Re: [PATCH] hostdev:Introduce vDPA device to hostdev subsystem as a new subtype

2023-05-24 Thread Jason Wang
On Mon, Apr 24, 2023 at 11:55 AM libai  wrote:
>
> The following is the xml of vdpa device:
> 
> 
> 
> 
> 
> And the command line passed to QEMU is as follows:
> -device {"driver":"vhost-vdpa-device-pci","vhostdev":"/dev/vhost-vdpa-0"}
>
> This solution is selected according to the previous discussion
> on the solution of supporting the vDPA device.
> For details, see the following:
> https://listman.redhat.com/archives/libvir-list/2023-March/239018.html

It would be also helpful if you can point to some links to explain the
generic vDPA device.

Adding Laine and Jonathon for comments.

Thanks

>
> Signed-off-by: libai 
> ---
>  src/conf/domain_audit.c |  4 +++
>  src/conf/domain_conf.c  | 47 +
>  src/conf/domain_conf.h  |  6 +
>  src/conf/domain_validate.c  |  1 +
>  src/conf/virconftypes.h |  2 ++
>  src/qemu/qemu_command.c | 19 +
>  src/qemu/qemu_command.h |  3 +++
>  src/qemu/qemu_domain.c  |  6 +
>  src/qemu/qemu_hotplug.c |  1 +
>  src/qemu/qemu_migration.c   |  2 ++
>  src/qemu/qemu_validate.c|  2 ++
>  src/security/security_dac.c |  2 ++
>  src/security/security_selinux.c |  2 ++
>  13 files changed, 97 insertions(+)
>
> diff --git a/src/conf/domain_audit.c b/src/conf/domain_audit.c
> index ae875188bd..6906ce7ade 100644
> --- a/src/conf/domain_audit.c
> +++ b/src/conf/domain_audit.c
> @@ -344,6 +344,7 @@ virDomainAuditHostdev(virDomainObj *vm, 
> virDomainHostdevDef *hostdev,
>  virDomainHostdevSubsysSCSI *scsisrc = >source.subsys.u.scsi;
>  virDomainHostdevSubsysSCSIVHost *hostsrc = 
> >source.subsys.u.scsi_host;
>  virDomainHostdevSubsysMediatedDev *mdevsrc = 
> >source.subsys.u.mdev;
> +virDomainHostdevSubsysVDPA *vdpasrc = >source.subsys.u.vdpa;
>
>  virUUIDFormat(vm->def->uuid, uuidstr);
>  if (!(vmname = virAuditEncode("vm", vm->def->name))) {
> @@ -383,6 +384,9 @@ virDomainAuditHostdev(virDomainObj *vm, 
> virDomainHostdevDef *hostdev,
>  case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV:
>  address = g_strdup(mdevsrc->uuidstr);
>  break;
> +case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_VDPA:
> +address = g_strdup(vdpasrc->devpath);
> +break;
>  case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST:
>  default:
>  VIR_WARN("Unexpected hostdev type while encoding audit message: 
> %d",
> diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
> index b03a3ff011..e8f6d1457b 100644
> --- a/src/conf/domain_conf.c
> +++ b/src/conf/domain_conf.c
> @@ -1047,6 +1047,7 @@ VIR_ENUM_IMPL(virDomainHostdevSubsys,
>"scsi",
>"scsi_host",
>"mdev",
> +  "vdpa",
>  );
>
>  VIR_ENUM_IMPL(virDomainHostdevSubsysPCIBackend,
> @@ -2641,6 +2642,9 @@ virDomainHostdevDefClear(virDomainHostdevDef *def)
>  case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI:
>  g_clear_pointer(>source.subsys.u.pci.origstates, 
> virBitmapFree);
>  break;
> +case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_VDPA:
> +VIR_FREE(def->source.subsys.u.vdpa.devpath);
> +break;
>  case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_USB:
>  case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV:
>  case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST:
> @@ -6160,6 +6164,22 @@ 
> virDomainHostdevSubsysMediatedDevDefParseXML(virDomainHostdevDef *def,
>  return 0;
>  }
>
> +static int
> +virDomainHostdevSubsysVDPADefParseXML(xmlNodePtr sourcenode,
> +  virDomainHostdevDef *def)
> +{
> +g_autofree char *devpath = NULL;
> +virDomainHostdevSubsysVDPA *vdpa = >source.subsys.u.vdpa;
> +
> +if(!(devpath = virXMLPropString(sourcenode, "dev"))) {
> +   virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s",
> +  _("Missing 'dev' attribute for element "));
> +   return -1;
> +}
> +vdpa->devpath = g_steal_pointer();
> +return 0;
> +}
> +
>  static int
>  virDomainHostdevDefParseXMLSubsys(xmlNodePtr node,
>xmlXPathContextPtr ctxt,
> @@ -6317,6 +6337,11 @@ virDomainHostdevDefParseXMLSubsys(xmlNodePtr node,
>  if (virDomainHostdevSubsysMediatedDevDefParseXML(def, ctxt) < 0)
>  return -1;
>  break;
> +case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_VDPA:
> +if (virDomainHostdevSubsysVDPADefParseXML(sourcenode, def) < 0) {
> +return -1;
> +}
> +break;
>
>  default:
>  virReportError(VIR_ERR_CONFIG_UNSUPPORTED,
> @@ -12979,6 +13004,7 @@ virDomainHostdevDefParseXML(virDomainXMLOption 
> *xmlopt,
>  case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI:
>  case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_SCSI_HOST:
>  case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV:
> +case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_VDPA:
>  case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST:
>

Re: [RFC] Seeking advice on support for generic vDPA device

2023-03-27 Thread Jason Wang
On Sat, Mar 25, 2023 at 9:14 PM jiangdongxu  wrote:
>
> vDPA devices allow high-performance devices in a virtual machine by
> providing a wire-speed data path. These devices require a vendor-specific host
> driver but the data path follows the virtio specification.
>
> The support for generic-vdpa device was recently added to qemu in v8.0.0-rc 
> version.
> https://gitlab.com/qemu-project/qemu/-/commit/b430a2bd2303b9940cc80ec746887f463a25fc5c
>
> This allows libvirt to support these deviecs regardless of device type.
> Currently, libvirt supports vDPA network devices, We need a new solution
> to support generic-vdpa devices.
>
> At present, we have considered several options.
>
> Solution A:
> Add type vdpa under the hostdev device,the device xml looks like
> 
> 
> 
> 
> 
> This solution is recommended. This configuration requires only that the
> device supports the vDPA framework, easy for future expansion and same
> as the qemu solution design idea.

I prefer this one, it's easier to implement and works for any type of
virtio devices and vDPA parents

>
> Solution B:
> The current libvirt mode is used to add vDPA devices under different 
> devices.
> device xml looks like:
> network:
> 
>   
> 
>   
> 
>
> storage:
> 
> 
> 
> 
>
> The network interface has the vdpa type. Therefore, a new type is 
> required to
> distinguish the "generic-vdpa" type. However, this method differentiates
> device types, which is not convenient for the expansion of other types
> of vDPA devices. Not recommended.

This should be the way when we have dedicated block layer support for
vhost-vDPA. Which seems not necessary for general vDPA devices.

>
> Solution C:
> Referring to PCI passthrough, Add the driver type (appropriate 
> vendor-specific driver) to the driver of the PCI device.
> The unbind/bind of the vDPA device driver and the creation of VDPA device 
> are implemented by libvirt.
> device xml looks like:
> 
> 
> 
> 
>  function='0x01'/>
> 
> 
> 
> This solution is related to the vendor-specific driver and supports only 
> the vdpa created by the PCI device.
> vdpa devices created in vduse mode are not supported. This solution is 
> not universal and is not recommended.

I agree, this requires PCI specific knowledge which defeats the goal
of vDPA to be device agnostic.

Thanks

>
>
> What's your proposal?



Re: [PATCH v2 03/15] net/vhost-vdpa: Fix device compatibility check

2021-10-13 Thread Jason Wang



在 2021/10/8 下午9:34, Kevin Wolf 写道:

vhost-vdpa works only with specific devices. At startup, it second
guesses what the command line option handling will do and error out if
it thinks a non-virtio device will attach to them.

This second guessing is not only ugly, it can lead to wrong error
messages ('-device floppy,netdev=foo' should complain about an unknown
property, not about the wrong kind of network device being attached) and
completely ignores hotplugging.

Drop the old checks and implement .check_peer_type() instead to fix
this. As a nice side effect, it also removes one more dependency on the
legacy QemuOpts infrastructure and even reduces the code size.

Signed-off-by: Kevin Wolf 



Acked-by: Jason Wang 



---
  net/vhost-vdpa.c | 37 ++---
  1 file changed, 14 insertions(+), 23 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 912686457c..6dc68d8677 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -147,12 +147,26 @@ static bool vhost_vdpa_has_ufo(NetClientState *nc)
  
  }
  
+static bool vhost_vdpa_check_peer_type(NetClientState *nc, ObjectClass *oc,

+   Error **errp)
+{
+const char *driver = object_class_get_name(oc);
+
+if (!g_str_has_prefix(driver, "virtio-net-")) {
+error_setg(errp, "vhost-vdpa requires frontend driver virtio-net-*");
+return false;
+}
+
+return true;
+}
+
  static NetClientInfo net_vhost_vdpa_info = {
  .type = NET_CLIENT_DRIVER_VHOST_VDPA,
  .size = sizeof(VhostVDPAState),
  .cleanup = vhost_vdpa_cleanup,
  .has_vnet_hdr = vhost_vdpa_has_vnet_hdr,
  .has_ufo = vhost_vdpa_has_ufo,
+.check_peer_type = vhost_vdpa_check_peer_type,
  };
  
  static int net_vhost_vdpa_init(NetClientState *peer, const char *device,

@@ -179,24 +193,6 @@ static int net_vhost_vdpa_init(NetClientState *peer, const 
char *device,
  return ret;
  }
  
-static int net_vhost_check_net(void *opaque, QemuOpts *opts, Error **errp)

-{
-const char *name = opaque;
-const char *driver, *netdev;
-
-driver = qemu_opt_get(opts, "driver");
-netdev = qemu_opt_get(opts, "netdev");
-if (!driver || !netdev) {
-return 0;
-}
-if (strcmp(netdev, name) == 0 &&
-!g_str_has_prefix(driver, "virtio-net-")) {
-error_setg(errp, "vhost-vdpa requires frontend driver virtio-net-*");
-return -1;
-}
-return 0;
-}
-
  int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
  NetClientState *peer, Error **errp)
  {
@@ -204,10 +200,5 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char 
*name,
  
  assert(netdev->type == NET_CLIENT_DRIVER_VHOST_VDPA);

  opts = >u.vhost_vdpa;
-/* verify net frontend */
-if (qemu_opts_foreach(qemu_find_opts("device"), net_vhost_check_net,
-  (char *)name, errp)) {
-return -1;
-}
  return net_vhost_vdpa_init(peer, TYPE_VHOST_VDPA, name, opts->vhostdev);
  }




Re: [PATCH v2 02/15] net/vhost-user: Fix device compatibility check

2021-10-13 Thread Jason Wang



在 2021/10/8 下午9:34, Kevin Wolf 写道:

vhost-user works only with specific devices. At startup, it second
guesses what the command line option handling will do and error out if
it thinks a non-virtio device will attach to them.

This second guessing is not only ugly, it can lead to wrong error
messages ('-device floppy,netdev=foo' should complain about an unknown
property, not about the wrong kind of network device being attached) and
completely ignores hotplugging.

Drop the old checks and implement .check_peer_type() instead to fix
this. As a nice side effect, it also removes one more dependency on the
legacy QemuOpts infrastructure and even reduces the code size.

Signed-off-by: Kevin Wolf 



Acked-by: Jason Wang 



---
  net/vhost-user.c | 41 ++---
  1 file changed, 14 insertions(+), 27 deletions(-)

diff --git a/net/vhost-user.c b/net/vhost-user.c
index 4a939124d2..b1a0247b59 100644
--- a/net/vhost-user.c
+++ b/net/vhost-user.c
@@ -198,6 +198,19 @@ static bool vhost_user_has_ufo(NetClientState *nc)
  return true;
  }
  
+static bool vhost_user_check_peer_type(NetClientState *nc, ObjectClass *oc,

+   Error **errp)
+{
+const char *driver = object_class_get_name(oc);
+
+if (!g_str_has_prefix(driver, "virtio-net-")) {
+error_setg(errp, "vhost-user requires frontend driver virtio-net-*");
+return false;
+}
+
+return true;
+}
+
  static NetClientInfo net_vhost_user_info = {
  .type = NET_CLIENT_DRIVER_VHOST_USER,
  .size = sizeof(NetVhostUserState),
@@ -207,6 +220,7 @@ static NetClientInfo net_vhost_user_info = {
  .has_ufo = vhost_user_has_ufo,
  .set_vnet_be = vhost_user_set_vnet_endianness,
  .set_vnet_le = vhost_user_set_vnet_endianness,
+.check_peer_type = vhost_user_check_peer_type,
  };
  
  static gboolean net_vhost_user_watch(void *do_not_use, GIOCondition cond,

@@ -397,27 +411,6 @@ static Chardev *net_vhost_claim_chardev(
  return chr;
  }
  
-static int net_vhost_check_net(void *opaque, QemuOpts *opts, Error **errp)

-{
-const char *name = opaque;
-const char *driver, *netdev;
-
-driver = qemu_opt_get(opts, "driver");
-netdev = qemu_opt_get(opts, "netdev");
-
-if (!driver || !netdev) {
-return 0;
-}
-
-if (strcmp(netdev, name) == 0 &&
-!g_str_has_prefix(driver, "virtio-net-")) {
-error_setg(errp, "vhost-user requires frontend driver virtio-net-*");
-return -1;
-}
-
-return 0;
-}
-
  int net_init_vhost_user(const Netdev *netdev, const char *name,
  NetClientState *peer, Error **errp)
  {
@@ -433,12 +426,6 @@ int net_init_vhost_user(const Netdev *netdev, const char 
*name,
  return -1;
  }
  
-/* verify net frontend */

-if (qemu_opts_foreach(qemu_find_opts("device"), net_vhost_check_net,
-  (char *)name, errp)) {
-return -1;
-}
-
  queues = vhost_user_opts->has_queues ? vhost_user_opts->queues : 1;
  if (queues < 1 || queues > MAX_QUEUE_NUM) {
  error_setg(errp,




Re: [PATCH v2 01/15] net: Introduce NetClientInfo.check_peer_type()

2021-10-13 Thread Jason Wang



在 2021/10/8 下午9:34, Kevin Wolf 写道:

Some network backends (vhost-user and vhost-vdpa) work only with
specific devices. At startup, they second guess what the command line
option handling will do and error out if they think a non-virtio device
will attach to them.

This second guessing is not only ugly, it can lead to wrong error
messages ('-device floppy,netdev=foo' should complain about an unknown
property, not about the wrong kind of network device being attached) and
completely ignores hotplugging.

Add a callback where backends can check compatibility with a device when
it actually tries to attach, even on hotplug.

Signed-off-by: Kevin Wolf 



Acked-by: Jason Wang 



---
  include/net/net.h| 2 ++
  hw/core/qdev-properties-system.c | 6 ++
  2 files changed, 8 insertions(+)

diff --git a/include/net/net.h b/include/net/net.h
index 5d1508081f..986288eb07 100644
--- a/include/net/net.h
+++ b/include/net/net.h
@@ -62,6 +62,7 @@ typedef struct SocketReadState SocketReadState;
  typedef void (SocketReadStateFinalize)(SocketReadState *rs);
  typedef void (NetAnnounce)(NetClientState *);
  typedef bool (SetSteeringEBPF)(NetClientState *, int);
+typedef bool (NetCheckPeerType)(NetClientState *, ObjectClass *, Error **);
  
  typedef struct NetClientInfo {

  NetClientDriver type;
@@ -84,6 +85,7 @@ typedef struct NetClientInfo {
  SetVnetBE *set_vnet_be;
  NetAnnounce *announce;
  SetSteeringEBPF *set_steering_ebpf;
+NetCheckPeerType *check_peer_type;
  } NetClientInfo;
  
  struct NetClientState {

diff --git a/hw/core/qdev-properties-system.c b/hw/core/qdev-properties-system.c
index e71f5d64d1..a91f60567a 100644
--- a/hw/core/qdev-properties-system.c
+++ b/hw/core/qdev-properties-system.c
@@ -431,6 +431,12 @@ static void set_netdev(Object *obj, Visitor *v, const char 
*name,
  goto out;
  }
  
+if (peers[i]->info->check_peer_type) {

+if (!peers[i]->info->check_peer_type(peers[i], obj->class, errp)) {
+goto out;
+}
+}
+
  ncs[i] = peers[i];
  ncs[i]->queue_index = i;
  }




Re: [ovirt-devel] Re: device compatibility interface for live migration with assigned devices

2020-08-30 Thread Jason Wang



On 2020/8/21 下午10:52, Cornelia Huck wrote:

On Fri, 21 Aug 2020 11:14:41 +0800
Jason Wang  wrote:


On 2020/8/20 下午8:27, Cornelia Huck wrote:

On Wed, 19 Aug 2020 17:28:38 +0800
Jason Wang  wrote:
  

On 2020/8/19 下午4:13, Yan Zhao wrote:

On Wed, Aug 19, 2020 at 03:39:50PM +0800, Jason Wang wrote:

On 2020/8/19 下午2:59, Yan Zhao wrote:

On Wed, Aug 19, 2020 at 02:57:34PM +0800, Jason Wang wrote:

On 2020/8/19 上午11:30, Yan Zhao wrote:

hi All,
could we decide that sysfs is the interface that every VFIO vendor driver
needs to provide in order to support vfio live migration, otherwise the
userspace management tool would not list the device into the compatible
list?

if that's true, let's move to the standardizing of the sysfs interface.
(1) content
common part: (must)
- software_version: (in major.minor.bugfix scheme)

This can not work for devices whose features can be negotiated/advertised
independently. (E.g virtio devices)

I thought the 'software_version' was supposed to describe kind of a
'protocol version' for the data we transmit? I.e., you add a new field,
you bump the version number.


Ok, but since we mandate backward compatibility of uABI, is this really
worth to have a version for sysfs? (Searching on sysfs shows no examples
like this)

I was not thinking about the sysfs interface, but rather about the data
that is sent over while migrating. E.g. we find out that sending some
auxiliary data is a good idea and bump to version 1.1.0; version 1.0.0
cannot deal with the extra data, but version 1.1.0 can deal with the
older data stream.

(...)



Well, I think what data to transmit during migration is the duty of qemu 
not kernel. And I suspect the idea of reading opaque data (with version) 
from kernel and transmit them to dest is the best approach.






- device_api: vfio-pci or vfio-ccw ...
- type: mdev type for mdev device or
a signature for physical device which is a counterpart for
   mdev type.

device api specific part: (must)
   - pci id: pci id of mdev parent device or pci id of physical pci
 device (device_api is vfio-pci)API here.

So this assumes a PCI device which is probably not true.
 

for device_api of vfio-pci, why it's not true?

for vfio-ccw, it's subchannel_type.

Ok but having two different attributes for the same file is not good idea.
How mgmt know there will be a 3rd type?

that's why some attributes need to be common. e.g.
device_api: it's common because mgmt need to know it's a pci device or a
   ccw device. and the api type is already defined vfio.h.
(The field is agreed by and actually suggested by Alex in previous 
mail)
type: mdev_type for mdev. if mgmt does not understand it, it would not
 be able to create one compatible mdev device.
software_version: mgmt can compare the major and minor if it understands
 this fields.

I think it would be helpful if you can describe how mgmt is expected to
work step by step with the proposed sysfs API. This can help people to
understand.

My proposal would be:
- check that device_api matches
- check possible device_api specific attributes
- check that type matches [I don't think the combination of mdev types
and another attribute to determine compatibility is a good idea;


Any reason for this? Actually if we only use mdev type to detect the
compatibility, it would be much more easier. Otherwise, we are actually
re-inventing mdev types.

E.g can we have the same mdev types with different device_api and other
attributes?

In the end, the mdev type is represented as a string; but I'm not sure
we can expect that two types with the same name, but a different
device_api are related in any way.

If we e.g. compare vfio-pci and vfio-ccw, they are fundamentally
different.

I was mostly concerned about the aggregation proposal, where type A +
aggregation value b might be compatible with type B + aggregation value
a.



Yes, that looks pretty complicated.







actually, the current proposal confuses me every time I look at it]
- check that software_version is compatible, assuming semantic
versioning
- check possible type-specific attributes


I'm not sure if this is too complicated. And I suspect there will be
vendor specific attributes:

- for compatibility check: I think we should either modeling everything
via mdev type or making it totally vendor specific. Having something in
the middle will bring a lot of burden

FWIW, I'm for a strict match on mdev type, and flexibility in per-type
attributes.



I'm not sure whether the above flexibility can work better than encoding 
them to mdev type. If we really want ultra flexibility, we need making 
the compatibility check totally vendor specific.






- for provisioning: it's still not clear. As shown in this proposal, for
NVME we may need to set remote_url, but unless there will be a subclass
(NVME) in the mdev (which I guess not), we can't prevent vendor from
using another

Re: [ovirt-devel] Re: device compatibility interface for live migration with assigned devices

2020-08-20 Thread Jason Wang



On 2020/8/20 下午8:27, Cornelia Huck wrote:

On Wed, 19 Aug 2020 17:28:38 +0800
Jason Wang  wrote:


On 2020/8/19 下午4:13, Yan Zhao wrote:

On Wed, Aug 19, 2020 at 03:39:50PM +0800, Jason Wang wrote:

On 2020/8/19 下午2:59, Yan Zhao wrote:

On Wed, Aug 19, 2020 at 02:57:34PM +0800, Jason Wang wrote:

On 2020/8/19 上午11:30, Yan Zhao wrote:

hi All,
could we decide that sysfs is the interface that every VFIO vendor driver
needs to provide in order to support vfio live migration, otherwise the
userspace management tool would not list the device into the compatible
list?

if that's true, let's move to the standardizing of the sysfs interface.
(1) content
common part: (must)
   - software_version: (in major.minor.bugfix scheme)

This can not work for devices whose features can be negotiated/advertised
independently. (E.g virtio devices)

I thought the 'software_version' was supposed to describe kind of a
'protocol version' for the data we transmit? I.e., you add a new field,
you bump the version number.



Ok, but since we mandate backward compatibility of uABI, is this really 
worth to have a version for sysfs? (Searching on sysfs shows no examples 
like this)





  

sorry, I don't understand here, why virtio devices need to use vfio interface?

I don't see any reason that virtio devices can't be used by VFIO. Do you?

Actually, virtio devices have been used by VFIO for many years:

- passthrough a hardware virtio devices to userspace(VM) drivers
- using virtio PMD inside guest
  

So, what's different for it vs passing through a physical hardware via VFIO?


The difference is in the guest, the device could be either real hardware
or emulated ones.



even though the features are negotiated dynamically, could you explain
why it would cause software_version not work?


Virtio device 1 supports feature A, B, C
Virtio device 2 supports feature B, C, D

So you can't migrate a guest from device 1 to device 2. And it's
impossible to model the features with versions.

We're talking about the features offered by the device, right? Would it
be sufficient to mandate that the target device supports the same
features or a superset of the features supported by the source device?



Yes.






  

I think this thread is discussing about vfio related devices.
  

   - device_api: vfio-pci or vfio-ccw ...
   - type: mdev type for mdev device or
   a signature for physical device which is a counterpart for
   mdev type.

device api specific part: (must)
  - pci id: pci id of mdev parent device or pci id of physical pci
device (device_api is vfio-pci)API here.

So this assumes a PCI device which is probably not true.
  

for device_api of vfio-pci, why it's not true?

for vfio-ccw, it's subchannel_type.

Ok but having two different attributes for the same file is not good idea.
How mgmt know there will be a 3rd type?

that's why some attributes need to be common. e.g.
device_api: it's common because mgmt need to know it's a pci device or a
  ccw device. and the api type is already defined vfio.h.
(The field is agreed by and actually suggested by Alex in previous 
mail)
type: mdev_type for mdev. if mgmt does not understand it, it would not
be able to create one compatible mdev device.
software_version: mgmt can compare the major and minor if it understands
this fields.


I think it would be helpful if you can describe how mgmt is expected to
work step by step with the proposed sysfs API. This can help people to
understand.

My proposal would be:
- check that device_api matches
- check possible device_api specific attributes
- check that type matches [I don't think the combination of mdev types
   and another attribute to determine compatibility is a good idea;



Any reason for this? Actually if we only use mdev type to detect the 
compatibility, it would be much more easier. Otherwise, we are actually 
re-inventing mdev types.


E.g can we have the same mdev types with different device_api and other 
attributes?




   actually, the current proposal confuses me every time I look at it]
- check that software_version is compatible, assuming semantic
   versioning
- check possible type-specific attributes



I'm not sure if this is too complicated. And I suspect there will be 
vendor specific attributes:


- for compatibility check: I think we should either modeling everything 
via mdev type or making it totally vendor specific. Having something in 
the middle will bring a lot of burden
- for provisioning: it's still not clear. As shown in this proposal, for 
NVME we may need to set remote_url, but unless there will be a subclass 
(NVME) in the mdev (which I guess not), we can't prevent vendor from 
using another attribute name, in this case, tricks like attributes 
iteration in some sub directory won't work. So even if we had some 
common API for compatibility check, the provisioning API is still vendor 
specific ...


Thanks






Thanks

Re: device compatibility interface for live migration with assigned devices

2020-08-19 Thread Jason Wang



On 2020/8/19 下午1:58, Parav Pandit wrote:



From: Yan Zhao 
Sent: Wednesday, August 19, 2020 9:01 AM
On Tue, Aug 18, 2020 at 09:39:24AM +, Parav Pandit wrote:

Please refer to my previous email which has more example and details.

hi Parav,
the example is based on a new vdpa tool running over netlink, not based on
devlink, right?

Right.


For vfio migration compatibility, we have to deal with both mdev and physical
pci devices, I don't think it's a good idea to write a new tool for it, given 
we are
able to retrieve the same info from sysfs and there's already an mdevctl from

mdev attribute should be visible in the mdev's sysfs tree.
I do not propose to write a new mdev tool over netlink. I am sorry if I implied 
that with my suggestion of vdpa tool.

If underlying device is vdpa, mdev might be able to understand vdpa device and 
query from it and populate in mdev sysfs tree.



Note that vdpa is bus independent so it can't work now and the support 
of mdev on top of vDPA have been rejected (and duplicated with vhost-vDPA).


Thanks




The vdpa tool I propose is usable even without mdevs.
vdpa tool's role is to create one or more vdpa devices and place on the "vdpa" 
bus which is the lowest layer here.
Additionally this tool let user query virtqueue stats, db stats.
When a user creates vdpa net device, user may need to configure features of the 
vdpa device such as VIRTIO_NET_F_MAC, default VIRTIO_NET_F_MTU.
These are vdpa level features, attributes. Mdev is layer above it.


Alex
(https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.
com%2Fmdevctl%2Fmdevctldata=02%7C01%7Cparav%40nvidia.com%7C
0c2691d430304f5ea11308d843f2d84e%7C43083d15727340c1b7db39efd9ccc17
a%7C0%7C0%7C637334057571911357sdata=KxH7PwxmKyy9JODut8BWr
LQyOBylW00%2Fyzc4rEvjUvA%3Dreserved=0).


Sorry for above link mangling. Our mail server is still transitioning due to 
company acquisition.

I am less familiar on below points to comment.


hi All,
could we decide that sysfs is the interface that every VFIO vendor driver needs
to provide in order to support vfio live migration, otherwise the userspace
management tool would not list the device into the compatible list?

if that's true, let's move to the standardizing of the sysfs interface.
(1) content
common part: (must)
- software_version: (in major.minor.bugfix scheme)
- device_api: vfio-pci or vfio-ccw ...
- type: mdev type for mdev device or
a signature for physical device which is a counterpart for
   mdev type.

device api specific part: (must)
   - pci id: pci id of mdev parent device or pci id of physical pci
 device (device_api is vfio-pci)
   - subchannel_type (device_api is vfio-ccw)

vendor driver specific part: (optional)
   - aggregator
   - chpid_type
   - remote_url

NOTE: vendors are free to add attributes in this part with a restriction that 
this
attribute is able to be configured with the same name in sysfs too. e.g.
for aggregator, there must be a sysfs attribute in device node
/sys/devices/pci:00/:00:02.0/882cc4da-dede-11e7-9180-
078a62063ab1/intel_vgpu/aggregator,
so that the userspace tool is able to configure the target device according to
source device's aggregator attribute.


(2) where and structure
proposal 1:
|- [path to device]
   |--- migration
   | |--- self
   | ||-software_version
   | ||-device_api
   | ||-type
   | ||-[pci_id or subchannel_type]
   | ||-
   | |--- compatible
   | ||-software_version
   | ||-device_api
   | ||-type
   | ||-[pci_id or subchannel_type]
   | ||-
multiple compatible is allowed.
attributes should be ASCII text files, preferably with only one value per file.


proposal 2: use bin_attribute.
|- [path to device]
   |--- migration
   | |--- self
   | |--- compatible

so we can continue use multiline format. e.g.
cat compatible
   software_version=0.1.0
   device_api=vfio_pci
   type=i915-GVTg_V5_{val1:int:1,2,4,8}
   pci_id=80865963
   aggregator={val1}/2

Thanks
Yan




Re: [ovirt-devel] Re: device compatibility interface for live migration with assigned devices

2020-08-19 Thread Jason Wang



On 2020/8/19 下午4:13, Yan Zhao wrote:

On Wed, Aug 19, 2020 at 03:39:50PM +0800, Jason Wang wrote:

On 2020/8/19 下午2:59, Yan Zhao wrote:

On Wed, Aug 19, 2020 at 02:57:34PM +0800, Jason Wang wrote:

On 2020/8/19 上午11:30, Yan Zhao wrote:

hi All,
could we decide that sysfs is the interface that every VFIO vendor driver
needs to provide in order to support vfio live migration, otherwise the
userspace management tool would not list the device into the compatible
list?

if that's true, let's move to the standardizing of the sysfs interface.
(1) content
common part: (must)
  - software_version: (in major.minor.bugfix scheme)

This can not work for devices whose features can be negotiated/advertised
independently. (E.g virtio devices)


sorry, I don't understand here, why virtio devices need to use vfio interface?


I don't see any reason that virtio devices can't be used by VFIO. Do you?

Actually, virtio devices have been used by VFIO for many years:

- passthrough a hardware virtio devices to userspace(VM) drivers
- using virtio PMD inside guest


So, what's different for it vs passing through a physical hardware via VFIO?



The difference is in the guest, the device could be either real hardware 
or emulated ones.




even though the features are negotiated dynamically, could you explain
why it would cause software_version not work?



Virtio device 1 supports feature A, B, C
Virtio device 2 supports feature B, C, D

So you can't migrate a guest from device 1 to device 2. And it's 
impossible to model the features with versions.







I think this thread is discussing about vfio related devices.


  - device_api: vfio-pci or vfio-ccw ...
  - type: mdev type for mdev device or
  a signature for physical device which is a counterpart for
   mdev type.

device api specific part: (must)
 - pci id: pci id of mdev parent device or pci id of physical pci
   device (device_api is vfio-pci)API here.

So this assumes a PCI device which is probably not true.


for device_api of vfio-pci, why it's not true?

for vfio-ccw, it's subchannel_type.


Ok but having two different attributes for the same file is not good idea.
How mgmt know there will be a 3rd type?

that's why some attributes need to be common. e.g.
device_api: it's common because mgmt need to know it's a pci device or a
 ccw device. and the api type is already defined vfio.h.
(The field is agreed by and actually suggested by Alex in previous 
mail)
type: mdev_type for mdev. if mgmt does not understand it, it would not
   be able to create one compatible mdev device.
software_version: mgmt can compare the major and minor if it understands
   this fields.



I think it would be helpful if you can describe how mgmt is expected to 
work step by step with the proposed sysfs API. This can help people to 
understand.


Thanks for the patience. Since sysfs is uABI, when accepted, we need 
support it forever. That's why we need to be careful.






 - subchannel_type (device_api is vfio-ccw)
vendor driver specific part: (optional)
 - aggregator
 - chpid_type
 - remote_url

For "remote_url", just wonder if it's better to integrate or reuse the
existing NVME management interface instead of duplicating it here. Otherwise
it could be a burden for mgmt to learn. E.g vendor A may use "remote_url"
but vendor B may use a different attribute.


it's vendor driver specific.
vendor specific attributes are inevitable, and that's why we are
discussing here of a way to standardizing of it.


Well, then you will end up with a very long list to discuss. E.g for
networking devices, you will have "mac", "v(x)lan" and a lot of other.

Note that "remote_url" is not vendor specific but NVME (class/subsystem)
specific.


yes, it's just NVMe specific. I added it as an example to show what is
vendor specific.
if one attribute is vendor specific across all vendors, then it's not vendor 
specific,
it's already common attribute, right?



It's common but the issue is about naming and mgmt overhead. Unless you 
have a unified API per class (NVME, ethernet, etc), you can't prevent 
vendor from using another name instead of "remote_url".






The point is that if vendor/class specific part is unavoidable, why not
making all of the attributes vendor specific?


some parts need to be common, as I listed above.



This is hard, unless VFIO knows the type of device (e.g it's a NVME or 
networking device).






our goal is that mgmt can use it without understanding the meaning of vendor
specific attributes.


I'm not sure this is the correct design of uAPI. Is there something similar
in the existing uAPIs?

And it might be hard to work for virtio devices.



NOTE: vendors are free to add attributes in this part with a
restriction that this attribute is able to be configured with the same
name in sysfs too. e.g.

Sysfs works well for common attr

Re: [ovirt-devel] Re: device compatibility interface for live migration with assigned devices

2020-08-19 Thread Jason Wang



On 2020/8/19 下午2:59, Yan Zhao wrote:

On Wed, Aug 19, 2020 at 02:57:34PM +0800, Jason Wang wrote:

On 2020/8/19 上午11:30, Yan Zhao wrote:

hi All,
could we decide that sysfs is the interface that every VFIO vendor driver
needs to provide in order to support vfio live migration, otherwise the
userspace management tool would not list the device into the compatible
list?

if that's true, let's move to the standardizing of the sysfs interface.
(1) content
common part: (must)
 - software_version: (in major.minor.bugfix scheme)


This can not work for devices whose features can be negotiated/advertised
independently. (E.g virtio devices)


sorry, I don't understand here, why virtio devices need to use vfio interface?



I don't see any reason that virtio devices can't be used by VFIO. Do you?

Actually, virtio devices have been used by VFIO for many years:

- passthrough a hardware virtio devices to userspace(VM) drivers
- using virtio PMD inside guest



I think this thread is discussing about vfio related devices.


 - device_api: vfio-pci or vfio-ccw ...
 - type: mdev type for mdev device or
 a signature for physical device which is a counterpart for
   mdev type.

device api specific part: (must)
- pci id: pci id of mdev parent device or pci id of physical pci
  device (device_api is vfio-pci)API here.


So this assumes a PCI device which is probably not true.


for device_api of vfio-pci, why it's not true?

for vfio-ccw, it's subchannel_type.



Ok but having two different attributes for the same file is not good 
idea. How mgmt know there will be a 3rd type?






- subchannel_type (device_api is vfio-ccw)
vendor driver specific part: (optional)
- aggregator
- chpid_type
- remote_url


For "remote_url", just wonder if it's better to integrate or reuse the
existing NVME management interface instead of duplicating it here. Otherwise
it could be a burden for mgmt to learn. E.g vendor A may use "remote_url"
but vendor B may use a different attribute.


it's vendor driver specific.
vendor specific attributes are inevitable, and that's why we are
discussing here of a way to standardizing of it.



Well, then you will end up with a very long list to discuss. E.g for 
networking devices, you will have "mac", "v(x)lan" and a lot of other.


Note that "remote_url" is not vendor specific but NVME (class/subsystem) 
specific.


The point is that if vendor/class specific part is unavoidable, why not 
making all of the attributes vendor specific?




our goal is that mgmt can use it without understanding the meaning of vendor
specific attributes.



I'm not sure this is the correct design of uAPI. Is there something 
similar in the existing uAPIs?


And it might be hard to work for virtio devices.





NOTE: vendors are free to add attributes in this part with a
restriction that this attribute is able to be configured with the same
name in sysfs too. e.g.


Sysfs works well for common attributes belongs to a class, but I'm not sure
it can work well for device/vendor specific attributes. Does this mean mgmt
need to iterate all the attributes in both src and dst?


no. just attributes under migration directory.


for aggregator, there must be a sysfs attribute in device node
/sys/devices/pci:00/:00:02.0/882cc4da-dede-11e7-9180-078a62063ab1/intel_vgpu/aggregator,
so that the userspace tool is able to configure the target device
according to source device's aggregator attribute.


(2) where and structure
proposal 1:
|- [path to device]
|--- migration
| |--- self
| ||-software_version
| ||-device_api
| ||-type
| ||-[pci_id or subchannel_type]
| ||-
| |--- compatible
| ||-software_version
| ||-device_api
| ||-type
| ||-[pci_id or subchannel_type]
| ||-
multiple compatible is allowed.
attributes should be ASCII text files, preferably with only one value
per file.


proposal 2: use bin_attribute.
|- [path to device]
|--- migration
| |--- self
| |--- compatible

so we can continue use multiline format. e.g.
cat compatible
software_version=0.1.0
device_api=vfio_pci
type=i915-GVTg_V5_{val1:int:1,2,4,8}
pci_id=80865963
aggregator={val1}/2


So basically two questions:

- how hard to standardize sysfs API for dealing with compatibility check (to
make it work for most types of devices)

sorry, I just know we are in the process of standardizing of it :)



It's not easy. As I said, the current design can't work for virtio 
devices and it's not hard to find other examples. I remember some Intel 
devices have bitmask based capability registers.






- how hard for the mgmt to learn with a vendor specific attributes (vs
existing management API)

what is existing management API?



It depends on the type of devices. E.g for NVME, we've al

Re: [ovirt-devel] Re: device compatibility interface for live migration with assigned devices

2020-08-19 Thread Jason Wang



On 2020/8/19 上午11:30, Yan Zhao wrote:

hi All,
could we decide that sysfs is the interface that every VFIO vendor driver
needs to provide in order to support vfio live migration, otherwise the
userspace management tool would not list the device into the compatible
list?

if that's true, let's move to the standardizing of the sysfs interface.
(1) content
common part: (must)
- software_version: (in major.minor.bugfix scheme)



This can not work for devices whose features can be 
negotiated/advertised independently. (E.g virtio devices)




- device_api: vfio-pci or vfio-ccw ...
- type: mdev type for mdev device or
a signature for physical device which is a counterpart for
   mdev type.

device api specific part: (must)
   - pci id: pci id of mdev parent device or pci id of physical pci
 device (device_api is vfio-pci)API here.



So this assumes a PCI device which is probably not true.



   - subchannel_type (device_api is vfio-ccw)
  
vendor driver specific part: (optional)

   - aggregator
   - chpid_type
   - remote_url



For "remote_url", just wonder if it's better to integrate or reuse the 
existing NVME management interface instead of duplicating it here. 
Otherwise it could be a burden for mgmt to learn. E.g vendor A may use 
"remote_url" but vendor B may use a different attribute.





NOTE: vendors are free to add attributes in this part with a
restriction that this attribute is able to be configured with the same
name in sysfs too. e.g.



Sysfs works well for common attributes belongs to a class, but I'm not 
sure it can work well for device/vendor specific attributes. Does this 
mean mgmt need to iterate all the attributes in both src and dst?




for aggregator, there must be a sysfs attribute in device node
/sys/devices/pci:00/:00:02.0/882cc4da-dede-11e7-9180-078a62063ab1/intel_vgpu/aggregator,
so that the userspace tool is able to configure the target device
according to source device's aggregator attribute.


(2) where and structure
proposal 1:
|- [path to device]
   |--- migration
   | |--- self
   | ||-software_version
   | ||-device_api
   | ||-type
   | ||-[pci_id or subchannel_type]
   | ||-
   | |--- compatible
   | ||-software_version
   | ||-device_api
   | ||-type
   | ||-[pci_id or subchannel_type]
   | ||-
multiple compatible is allowed.
attributes should be ASCII text files, preferably with only one value
per file.


proposal 2: use bin_attribute.
|- [path to device]
   |--- migration
   | |--- self
   | |--- compatible

so we can continue use multiline format. e.g.
cat compatible
   software_version=0.1.0
   device_api=vfio_pci
   type=i915-GVTg_V5_{val1:int:1,2,4,8}
   pci_id=80865963
   aggregator={val1}/2



So basically two questions:

- how hard to standardize sysfs API for dealing with compatibility check 
(to make it work for most types of devices)
- how hard for the mgmt to learn with a vendor specific attributes (vs 
existing management API)


Thanks




Thanks
Yan




Re: device compatibility interface for live migration with assigned devices

2020-08-19 Thread Jason Wang



On 2020/8/19 下午1:26, Parav Pandit wrote:



From: Jason Wang 
Sent: Wednesday, August 19, 2020 8:16 AM



On 2020/8/18 下午5:32, Parav Pandit wrote:

Hi Jason,

From: Jason Wang 
Sent: Tuesday, August 18, 2020 2:32 PM


On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
On 2020/8/14 下午1:16, Yan Zhao wrote:
On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
On 2020/8/10 下午3:46, Yan Zhao wrote:
driver is it handled by?
It looks that the devlink is for network device specific, and in
devlink.h, it says include/uapi/linux/devlink.h - Network physical
device Netlink interface, Actually not, I think there used to have
some discussion last year and the conclusion is to remove this
comment.

[...]


Yes, but it could be hard. E.g vDPA will chose to use devlink (there's a long

debate on sysfs vs devlink). So if we go with sysfs, at least two APIs needs to 
be
supported ...

We had internal discussion and proposal on this topic.
I wanted Eli Cohen to be back from vacation on Wed 8/19, but since this is

active discussion right now, I will share the thoughts anyway.

Here are the initial round of thoughts and proposal.

User requirements:
---
1. User might want to create one or more vdpa devices per PCI PF/VF/SF.
2. User might want to create one or more vdpa devices of type net/blk or

other type.

3. User needs to look and dump at the health of the queues for debug purpose.
4. During vdpa net device creation time, user may have to provide a MAC

address and/or VLAN.

5. User should be able to set/query some of the attributes for
debug/compatibility check 6. When user wants to create vdpa device, it needs

to know which device supports creation.

7. User should be able to see the queue statistics of doorbells, wqes
etc regardless of class type


Note that wqes is probably not something common in all of the vendors.

Yes. I virtq descriptors stats is better to monitor the virtqueues.




To address above requirements, there is a need of vendor agnostic tool, so

that user can create/config/delete vdpa device(s) regardless of the vendor.

Hence,
We should have a tool that lets user do it.

Examples:
-
(a) List parent devices which supports creating vdpa devices.
It also shows which class types supported by this parent device.
In below command two parent devices support vdpa device creation.
First is PCI VF whose bdf is 03.00:5.
Second is PCI SF whose name is mlx5_sf.1

$ vdpa list pd


What did "pd" mean?


Parent device which support creation of one or more vdpa devices.
In a system there can be multiple parent devices which may be support vdpa 
creation.
User should be able to know which devices support it, and when user creates a 
vdpa device, it tells which parent device to use for creation as done in below 
vdpa dev add example.

pci/:03.00:5
class_supports
  net vdpa
virtbus/mlx5_sf.1


So creating mlx5_sf.1 is the charge of devlink?


Yes.
But here vdpa tool is working at the parent device identifier {bus+name} 
instead of devlink identifier.



class_supports
  net

(b) Now add a vdpa device and show the device.
$ vdpa dev add pci/:03.00:5 type net


So if you want to create devices types other than vdpa on
pci/:03.00:5 it needs some synchronization with devlink?

Please refer to FAQ-1,  a new tool is not linked to devlink because vdpa will 
evolve with time and devlink will fall short.
So no, it doesn't need any synchronization with devlink.
As long as parent device exist, user can create it.
All synchronization will be within drivers/vdpa/vdpa.c
This user interface is exposed via new netlink family by doing genl_register_family() 
with new name "vdpa" in drivers/vdpa/vdpa.c.



Just to make sure I understand here.

Consider we had virtbus/mlx5_sf.1. Process A want to create a vDPA 
instance on top of it but Process B want to create a IB instance. Then I 
think some synchronization is needed at at least parent device level?








$ vdpa dev show
vdpa0@pci/:03.00:5 type net state inactive maxqueues 8 curqueues 4

(c) vdpa dev show features vdpa0
iommu platform
version 1

(d) dump vdpa statistics
$ vdpa dev stats show vdpa0
kickdoorbells 10
wqes 100

(e) Now delete a vdpa device previously created.
$ vdpa dev del vdpa0

Design overview:
---
1. Above example tool runs over netlink socket interface.
2. This enables users to return meaningful error strings in addition to code so

that user can be more informed.

Often this is missing in ioctl()/configfs/sysfs interfaces.
3. This tool over netlink enables syscaller tests to be more usable like other

subsystems to keep kernel robust

4. This provides vendor agnostic view of all vdpa capable parent and vdpa

devices.

5. Each driver which supports vdpa device creation, registers the parent device

along with supported classes.

FAQs:

1. Why not using devlink?
Ans: Because as vd

Re: device compatibility interface for live migration with assigned devices

2020-08-18 Thread Jason Wang



On 2020/8/18 下午5:36, Cornelia Huck wrote:

On Tue, 18 Aug 2020 10:16:28 +0100
Daniel P. Berrangé  wrote:


On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:

On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:

  On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:

  On 2020/8/14 下午1:16, Yan Zhao wrote:

  On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:

  On 2020/8/10 下午3:46, Yan Zhao wrote:
  we actually can also retrieve the same information through sysfs, .e.g

  |- [path to device]
 |--- migration
 | |--- self
 | |   |---device_api
 ||   |---mdev_type
 ||   |---software_version
 ||   |---device_id
 ||   |---aggregator
 | |--- compatible
 | |   |---device_api
 ||   |---mdev_type
 ||   |---software_version
 ||   |---device_id
 ||   |---aggregator


  Yes but:

  - You need one file per attribute (one syscall for one attribute)
  - Attribute is coupled with kobject

Is that really that bad? You have the device with an embedded kobject
anyway, and you can just put things into an attribute group?



Yes, but all of this could be done via devlink(netlink) as well with low 
overhead.





[Also, I think that self/compatible split in the example makes things
needlessly complex. Shouldn't semantic versioning and matching already
cover nearly everything?



That's my question as well. E.g for virtio, versioning may not even 
work, some of features are negotiated independently:


Source features: A, B, C
Dest features: A, B, C, E

We just need to make sure the dest features is a superset of source then 
all set.




  I would expect very few cases that are more
complex than that. Maybe the aggregation stuff, but I don't think we
need that self/compatible split for that, either.]


  All of above seems unnecessary.

  Another point, as we discussed in another thread, it's really hard to make
  sure the above API work for all types of devices and frameworks. So having a
  vendor specific API looks much better.

  From the POV of userspace mgmt apps doing device compat checking / migration,
  we certainly do NOT want to use different vendor specific APIs. We want to
  have an API that can be used / controlled in a standard manner across vendors.

Yes, but it could be hard. E.g vDPA will chose to use devlink (there's a
long debate on sysfs vs devlink). So if we go with sysfs, at least two
APIs needs to be supported ...

NB, I was not questioning devlink vs sysfs directly. If devlink is related
to netlink, I can't say I'm enthusiastic as IMKE sysfs is easier to deal
with. I don't know enough about devlink to have much of an opinion though.
The key point was that I don't want the userspace APIs we need to deal with
to be vendor specific.

 From what I've seen of devlink, it seems quite nice; but I understand
why sysfs might be easier to deal with (especially as there's likely
already a lot of code using it.)

I understand that some users would like devlink because it is already
widely used for network drivers (and some others), but I don't think
the majority of devices used with vfio are network (although certainly
a lot of them are.)



Note that though devlink could be popular only in network devices, 
netlink is widely used by a lot of subsystesm (e.g SCSI).


Thanks





What I care about is that we have a *standard* userspace API for performing
device compatibility checking / state migration, for use by QEMU/libvirt/
OpenStack, such that we can write code without countless vendor specific
code paths.

If there is vendor specific stuff on the side, that's fine as we can ignore
that, but the core functionality for device compat / migration needs to be
standardized.

To summarize:
- choose one of sysfs or devlink
- have a common interface, with a standardized way to add
   vendor-specific attributes
?




Re: device compatibility interface for live migration with assigned devices

2020-08-18 Thread Jason Wang



On 2020/8/18 下午5:32, Parav Pandit wrote:

Hi Jason,

From: Jason Wang 
Sent: Tuesday, August 18, 2020 2:32 PM


On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
On 2020/8/14 下午1:16, Yan Zhao wrote:
On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
On 2020/8/10 下午3:46, Yan Zhao wrote:
driver is it handled by?
It looks that the devlink is for network device specific, and in
devlink.h, it says
include/uapi/linux/devlink.h - Network physical device Netlink
interface,
Actually not, I think there used to have some discussion last year and the
conclusion is to remove this comment.

[...]


Yes, but it could be hard. E.g vDPA will chose to use devlink (there's a long 
debate on sysfs vs devlink). So if we go with sysfs, at least two APIs needs to 
be supported ...

We had internal discussion and proposal on this topic.
I wanted Eli Cohen to be back from vacation on Wed 8/19, but since this is 
active discussion right now, I will share the thoughts anyway.

Here are the initial round of thoughts and proposal.

User requirements:
---
1. User might want to create one or more vdpa devices per PCI PF/VF/SF.
2. User might want to create one or more vdpa devices of type net/blk or other 
type.
3. User needs to look and dump at the health of the queues for debug purpose.
4. During vdpa net device creation time, user may have to provide a MAC address 
and/or VLAN.
5. User should be able to set/query some of the attributes for 
debug/compatibility check
6. When user wants to create vdpa device, it needs to know which device 
supports creation.
7. User should be able to see the queue statistics of doorbells, wqes etc 
regardless of class type



Note that wqes is probably not something common in all of the vendors.




To address above requirements, there is a need of vendor agnostic tool, so that 
user can create/config/delete vdpa device(s) regardless of the vendor.

Hence,
We should have a tool that lets user do it.

Examples:
-
(a) List parent devices which supports creating vdpa devices.
It also shows which class types supported by this parent device.
In below command two parent devices support vdpa device creation.
First is PCI VF whose bdf is 03.00:5.
Second is PCI SF whose name is mlx5_sf.1

$ vdpa list pd



What did "pd" mean?



pci/:03.00:5
   class_supports
 net vdpa
virtbus/mlx5_sf.1



So creating mlx5_sf.1 is the charge of devlink?



   class_supports
 net

(b) Now add a vdpa device and show the device.
$ vdpa dev add pci/:03.00:5 type net



So if you want to create devices types other than vdpa on 
pci/:03.00:5 it needs some synchronization with devlink?




$ vdpa dev show
vdpa0@pci/:03.00:5 type net state inactive maxqueues 8 curqueues 4

(c) vdpa dev show features vdpa0
iommu platform
version 1

(d) dump vdpa statistics
$ vdpa dev stats show vdpa0
kickdoorbells 10
wqes 100

(e) Now delete a vdpa device previously created.
$ vdpa dev del vdpa0

Design overview:
---
1. Above example tool runs over netlink socket interface.
2. This enables users to return meaningful error strings in addition to code so 
that user can be more informed.
Often this is missing in ioctl()/configfs/sysfs interfaces.
3. This tool over netlink enables syscaller tests to be more usable like other 
subsystems to keep kernel robust
4. This provides vendor agnostic view of all vdpa capable parent and vdpa 
devices.

5. Each driver which supports vdpa device creation, registers the parent device 
along with supported classes.

FAQs:

1. Why not using devlink?
Ans: Because as vdpa echo system grows, devlink will fall short of extending 
vdpa specific params, attributes, stats.



This should be fine but it's still not clear to me the difference 
between a vdpa netlink and a vdpa object in devlink.


Thanks




2. Why not use sysfs?
Ans:
(a) Because running syscaller infrastructure can run well over netlink sockets 
like it runs for several subsystem.
(b) it lacks the ability to return error messages. Doing via kernel log is just 
doesn't work.
(c) Why not using some ioctl()? It will reinvent the wheel of netlink that has 
TLV formats for several attributes.

3. Why not configs?
It follows same limitation as that of sysfs.

Low level design and driver APIS:

Will post once we discuss this further.




Re: device compatibility interface for live migration with assigned devices

2020-08-18 Thread Jason Wang



On 2020/8/18 下午5:16, Daniel P. Berrangé wrote:

Your mail came through as HTML-only so all the quoting and attribution
is mangled / lost now :-(



My bad, sorry.




On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:

On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:

  On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:

  On 2020/8/14 下午1:16, Yan Zhao wrote:

  On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:

  On 2020/8/10 下午3:46, Yan Zhao wrote:
  we actually can also retrieve the same information through sysfs, .e.g

  |- [path to device]
 |--- migration
 | |--- self
 | |   |---device_api
 ||   |---mdev_type
 ||   |---software_version
 ||   |---device_id
 ||   |---aggregator
 | |--- compatible
 | |   |---device_api
 ||   |---mdev_type
 ||   |---software_version
 ||   |---device_id
 ||   |---aggregator


  Yes but:

  - You need one file per attribute (one syscall for one attribute)
  - Attribute is coupled with kobject

  All of above seems unnecessary.

  Another point, as we discussed in another thread, it's really hard to make
  sure the above API work for all types of devices and frameworks. So having a
  vendor specific API looks much better.

  From the POV of userspace mgmt apps doing device compat checking / migration,
  we certainly do NOT want to use different vendor specific APIs. We want to
  have an API that can be used / controlled in a standard manner across vendors.

Yes, but it could be hard. E.g vDPA will chose to use devlink (there's a
long debate on sysfs vs devlink). So if we go with sysfs, at least two
APIs needs to be supported ...

NB, I was not questioning devlink vs sysfs directly. If devlink is related
to netlink, I can't say I'm enthusiastic as IMKE sysfs is easier to deal
with. I don't know enough about devlink to have much of an opinion though.
The key point was that I don't want the userspace APIs we need to deal with
to be vendor specific.

What I care about is that we have a *standard* userspace API for performing
device compatibility checking / state migration, for use by QEMU/libvirt/
OpenStack, such that we can write code without countless vendor specific
code paths.

If there is vendor specific stuff on the side, that's fine as we can ignore
that, but the core functionality for device compat / migration needs to be
standardized.



Ok, I agree with you.

Thanks




Regards,
Daniel




Re: device compatibility interface for live migration with assigned devices

2020-08-18 Thread Jason Wang

  
  


On 2020/8/18 下午4:55, Daniel P. Berrangé
  wrote:


  On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:

  
On 2020/8/14 下午1:16, Yan Zhao wrote:


  On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:

  
On 2020/8/10 下午3:46, Yan Zhao wrote:


  
driver is it handled by?

  
  It looks that the devlink is for network device specific, and in
devlink.h, it says
include/uapi/linux/devlink.h - Network physical device Netlink
interface,


Actually not, I think there used to have some discussion last year and the
conclusion is to remove this comment.

It supports IB and probably vDPA in the future.


  
  hmm... sorry, I didn't find the referred discussion. only below discussion
regarding to why to add devlink.

https://www.mail-archive.com/netdev@vger.kernel.org/msg95801.html
	>This doesn't seem to be too much related to networking? Why can't something
	>like this be in sysfs?
	
	It is related to networking quite bit. There has been couple of
	iteration of this, including sysfs and configfs implementations. There
	has been a consensus reached that this should be done by netlink. I
	believe netlink is really the best for this purpose. Sysfs is not a good
	idea



See the discussion here:

https://patchwork.ozlabs.org/project/netdev/patch/20191115223355.1277139-1-jeffrey.t.kirs...@intel.com/




  https://www.mail-archive.com/netdev@vger.kernel.org/msg96102.html
	>there is already a way to change eth/ib via
	>echo 'eth' > /sys/bus/pci/drivers/mlx4_core/:02:00.0/mlx4_port1
	>
	>sounds like this is another way to achieve the same?
	
	It is. However the current way is driver-specific, not correct.
	For mlx5, we need the same, it cannot be done in this way. Do devlink is
	the correct way to go.

https://lwn.net/Articles/674867/
	There a is need for some userspace API that would allow to expose things
	that are not directly related to any device class like net_device of
	ib_device, but rather chip-wide/switch-ASIC-wide stuff.

	Use cases:
	1) get/set of port type (Ethernet/InfiniBand)
	2) monitoring of hardware messages to and from chip
	3) setting up port splitters - split port into multiple ones and squash again,
	   enables usage of splitter cable
	4) setting up shared buffers - shared among multiple ports within one chip



we actually can also retrieve the same information through sysfs, .e.g

|- [path to device]
   |--- migration
   | |--- self
   | |   |---device_api
   |	|   |---mdev_type
   |	|   |---software_version
   |	|   |---device_id
   |	|   |---aggregator
   | |--- compatible
   | |   |---device_api
   |	|   |---mdev_type
   |	|   |---software_version
   |	|   |---device_id
   |	|   |---aggregator



Yes but:

- You need one file per attribute (one syscall for one attribute)
- Attribute is coupled with kobject

All of above seems unnecessary.

Another point, as we discussed in another thread, it's really hard to make
sure the above API work for all types of devices and frameworks. So having a
vendor specific API looks much better.

  
  From the POV of userspace mgmt apps doing device compat checking / migration,
we certainly do NOT want to use different vendor specific APIs. We want to
have an API that can be used / controlled in a standard manner across vendors.



Yes, but it could be hard. E.g vDPA will chose to use devlink
  (there's a long debate on sysfs vs devlink). So if we go with
  sysfs, at least two APIs needs to be supported ...
Thanks




  



Regards,
Daniel

  




Re: device compatibility interface for live migration with assigned devices

2020-08-17 Thread Jason Wang



On 2020/8/14 下午1:16, Yan Zhao wrote:

On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:

On 2020/8/10 下午3:46, Yan Zhao wrote:

driver is it handled by?

It looks that the devlink is for network device specific, and in
devlink.h, it says
include/uapi/linux/devlink.h - Network physical device Netlink
interface,


Actually not, I think there used to have some discussion last year and the
conclusion is to remove this comment.

It supports IB and probably vDPA in the future.


hmm... sorry, I didn't find the referred discussion. only below discussion
regarding to why to add devlink.

https://www.mail-archive.com/netdev@vger.kernel.org/msg95801.html
>This doesn't seem to be too much related to networking? Why can't 
something
>like this be in sysfs?

It is related to networking quite bit. There has been couple of
iteration of this, including sysfs and configfs implementations. There
has been a consensus reached that this should be done by netlink. I
believe netlink is really the best for this purpose. Sysfs is not a good
idea



See the discussion here:

https://patchwork.ozlabs.org/project/netdev/patch/20191115223355.1277139-1-jeffrey.t.kirs...@intel.com/




https://www.mail-archive.com/netdev@vger.kernel.org/msg96102.html
>there is already a way to change eth/ib via
>echo 'eth' > /sys/bus/pci/drivers/mlx4_core/:02:00.0/mlx4_port1
>
>sounds like this is another way to achieve the same?

It is. However the current way is driver-specific, not correct.
For mlx5, we need the same, it cannot be done in this way. Do devlink is
the correct way to go.

https://lwn.net/Articles/674867/
There a is need for some userspace API that would allow to expose things
that are not directly related to any device class like net_device of
ib_device, but rather chip-wide/switch-ASIC-wide stuff.

Use cases:
1) get/set of port type (Ethernet/InfiniBand)
2) monitoring of hardware messages to and from chip
3) setting up port splitters - split port into multiple ones and squash 
again,
   enables usage of splitter cable
4) setting up shared buffers - shared among multiple ports within one 
chip



we actually can also retrieve the same information through sysfs, .e.g

|- [path to device]
   |--- migration
   | |--- self
   | |   |---device_api
   ||   |---mdev_type
   ||   |---software_version
   ||   |---device_id
   ||   |---aggregator
   | |--- compatible
   | |   |---device_api
   ||   |---mdev_type
   ||   |---software_version
   ||   |---device_id
   ||   |---aggregator



Yes but:

- You need one file per attribute (one syscall for one attribute)
- Attribute is coupled with kobject

All of above seems unnecessary.

Another point, as we discussed in another thread, it's really hard to 
make sure the above API work for all types of devices and frameworks. So 
having a vendor specific API looks much better.






   I feel like it's not very appropriate for a GPU driver to use
this interface. Is that right?


I think not though most of the users are switch or ethernet devices. It
doesn't prevent you from inventing new abstractions.

so need to patch devlink core and the userspace devlink tool?
e.g. devlink migration



It quite flexible, you can extend devlink, invent your own or let mgmt 
to establish devlink directly.






Note that devlink is based on netlink, netlink has been widely used by
various subsystems other than networking.

the advantage of netlink I see is that it can monitor device status and
notify upper layer that migration database needs to get updated.



I may miss something, but why this is needed?

From device point of view, the following capability should be 
sufficient to support live migration:


- set/get device state
- report dirty page tracking
- set/get capability



But not sure whether openstack would like to use this capability.
As Sean said, it's heavy for openstack. it's heavy for vendor driver
as well :)



Well, it depends several factors. Just counting LOCs, sysfs based 
attributes is not lightweight.


Thanks




And devlink monitor now listens the notification and dumps the state
changes. If we want to use it, need to let it forward the notification
and dumped info to openstack, right?

Thanks
Yan





Re: device compatibility interface for live migration with assigned devices

2020-08-12 Thread Jason Wang



On 2020/8/10 下午3:46, Yan Zhao wrote:

driver is it handled by?

It looks that the devlink is for network device specific, and in
devlink.h, it says
include/uapi/linux/devlink.h - Network physical device Netlink
interface,



Actually not, I think there used to have some discussion last year and 
the conclusion is to remove this comment.


It supports IB and probably vDPA in the future.



  I feel like it's not very appropriate for a GPU driver to use
this interface. Is that right?



I think not though most of the users are switch or ethernet devices. It 
doesn't prevent you from inventing new abstractions.


Note that devlink is based on netlink, netlink has been widely used by 
various subsystems other than networking.


Thanks




Thanks
Yan
  





Re: device compatibility interface for live migration with assigned devices

2020-08-05 Thread Jason Wang



On 2020/8/5 下午3:56, Jiri Pirko wrote:

Wed, Aug 05, 2020 at 04:41:54AM CEST, jasow...@redhat.com wrote:

On 2020/8/5 上午10:16, Yan Zhao wrote:

On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote:

On 2020/8/5 上午12:35, Cornelia Huck wrote:

[sorry about not chiming in earlier]

On Wed, 29 Jul 2020 16:05:03 +0800
Yan Zhao  wrote:


On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:

(...)


Based on the feedback we've received, the previously proposed interface
is not viable.  I think there's agreement that the user needs to be
able to parse and interpret the version information.  Using json seems
viable, but I don't know if it's the best option.  Is there any
precedent of markup strings returned via sysfs we could follow?

I don't think encoding complex information in a sysfs file is a viable
approach. Quoting Documentation/filesystems/sysfs.rst:

"Attributes should be ASCII text files, preferably with only one value
per file. It is noted that it may not be efficient to contain only one
value per file, so it is socially acceptable to express an array of
values of the same type.
Mixing types, expressing multiple lines of data, and doing fancy
formatting of data is heavily frowned upon."

Even though this is an older file, I think these restrictions still
apply.

+1, that's another reason why devlink(netlink) is better.


hi Jason,
do you have any materials or sample code about devlink, so we can have a good
study of it?
I found some kernel docs about it but my preliminary study didn't show me the
advantage of devlink.


CC Jiri and Parav for a better answer for this.

My understanding is that the following advantages are obvious (as I replied
in another thread):

- existing users (NIC, crypto, SCSI, ib), mature and stable
- much better error reporting (ext_ack other than string or errno)
- namespace aware
- do not couple with kobject

Jason, what is your use case?



I think the use case is to report device compatibility for live 
migration. Yan proposed a simple sysfs based migration version first, 
but it looks not sufficient and something based on JSON is discussed.


Yan, can you help to summarize the discussion so far for Jiri as a 
reference?


Thanks







Thanks



Thanks
Yan





Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-08-04 Thread Jason Wang



On 2020/8/5 上午4:30, Peter Xu wrote:

On Mon, Aug 03, 2020 at 06:00:34PM +0200, Eugenio Pérez wrote:

On Fri, 2020-07-03 at 15:24 +0800, Jason Wang wrote:

On 2020/7/2 下午11:45, Peter Xu wrote:

On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:

So I think we agree that a new notifier is needed?

Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?

That should work but I wonder something as following is better.

Instead of introducing new flags, how about carry the type of event in
the notifier then the device (vhost) can choose the message it want to
process like:

static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event)

{

switch (event->type) {

case IOMMU_MAP:
case IOMMU_UNMAP:
case IOMMU_DEV_IOTLB_UNMAP:
...

}

Thanks



Hi!

Sorry, I thought I had this clear but now it seems not so clear to me. Do you 
mean to add that switch to the current
vhost_iommu_unmap_notify, and then the "type" field to the IOMMUTLBEntry? Is 
that the scope of the changes, or there is
something I'm missing?

If that is correct, what is the advantage for vhost or other notifiers? I 
understand that move the IOMMUTLBEntry (addr,
len) -> (iova, mask) split/transformation to the different notifiers 
implementation could pollute them, but this is even a deeper change and vhost is 
not insterested in other events but IOMMU_UNMAP, isn't?

On the other hand, who decide what type of event is? If I follow the backtrace 
of the assert in
https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg01015.html, it seems 
to me that it should be
vtd_process_device_iotlb_desc. How do I know if it should be IOMMU_UNMAP or 
IOMMU_DEV_IOTLB_UNMAP? If I set it in some
function of memory.c, I should decide the type looking the actual notifier, 
isn't?

(Since Jason didn't reply yesterday, I'll try to; Jason, feel free to correct
  me...)

IMHO whether to put the type into the IOMMUTLBEntry is not important.  The
important change should be that we introduce IOMMU_DEV_IOTLB_UNMAP (or I'd
rather call it IOMMU_DEV_IOTLB directly which is shorter and cleaner).  With
that information we can make the failing assertion conditional for MAP/UNMAP
only.



Or having another dedicated device IOTLB notifier.



   We can also allow dev-iotlb messages to take arbitrary addr_mask (so it
becomes a length of address range; imho we can keep using addr_mask for
simplicity, but we can comment for addr_mask that for dev-iotlb it can be not a
real mask).



Yes.

Thanks




Thanks,





Re: device compatibility interface for live migration with assigned devices

2020-08-04 Thread Jason Wang



On 2020/8/5 上午10:16, Yan Zhao wrote:

On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote:

On 2020/8/5 上午12:35, Cornelia Huck wrote:

[sorry about not chiming in earlier]

On Wed, 29 Jul 2020 16:05:03 +0800
Yan Zhao  wrote:


On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:

(...)


Based on the feedback we've received, the previously proposed interface
is not viable.  I think there's agreement that the user needs to be
able to parse and interpret the version information.  Using json seems
viable, but I don't know if it's the best option.  Is there any
precedent of markup strings returned via sysfs we could follow?

I don't think encoding complex information in a sysfs file is a viable
approach. Quoting Documentation/filesystems/sysfs.rst:

"Attributes should be ASCII text files, preferably with only one value
per file. It is noted that it may not be efficient to contain only one
value per file, so it is socially acceptable to express an array of
values of the same type.
Mixing types, expressing multiple lines of data, and doing fancy
formatting of data is heavily frowned upon."

Even though this is an older file, I think these restrictions still
apply.


+1, that's another reason why devlink(netlink) is better.


hi Jason,
do you have any materials or sample code about devlink, so we can have a good
study of it?
I found some kernel docs about it but my preliminary study didn't show me the
advantage of devlink.



CC Jiri and Parav for a better answer for this.

My understanding is that the following advantages are obvious (as I 
replied in another thread):


- existing users (NIC, crypto, SCSI, ib), mature and stable
- much better error reporting (ext_ack other than string or errno)
- namespace aware
- do not couple with kobject

Thanks




Thanks
Yan





Re: device compatibility interface for live migration with assigned devices

2020-08-04 Thread Jason Wang



On 2020/8/5 上午12:35, Cornelia Huck wrote:

[sorry about not chiming in earlier]

On Wed, 29 Jul 2020 16:05:03 +0800
Yan Zhao  wrote:


On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:

(...)


Based on the feedback we've received, the previously proposed interface
is not viable.  I think there's agreement that the user needs to be
able to parse and interpret the version information.  Using json seems
viable, but I don't know if it's the best option.  Is there any
precedent of markup strings returned via sysfs we could follow?

I don't think encoding complex information in a sysfs file is a viable
approach. Quoting Documentation/filesystems/sysfs.rst:

"Attributes should be ASCII text files, preferably with only one value
per file. It is noted that it may not be efficient to contain only one
value per file, so it is socially acceptable to express an array of
values of the same type.
  
Mixing types, expressing multiple lines of data, and doing fancy

formatting of data is heavily frowned upon."

Even though this is an older file, I think these restrictions still
apply.



+1, that's another reason why devlink(netlink) is better.

Thanks



Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-21 Thread Jason Wang



On 2020/7/20 下午9:03, Peter Xu wrote:

On Mon, Jul 20, 2020 at 12:02:06PM +0800, Jason Wang wrote:

Right, so there's no need to deal with unmap in vtd's replay implementation
(as what generic one did).

We don't even for now; see vtd_page_walk_info.notify_unmap.  Thanks,



Right, but I meant the vtd_address_space_unmap() in vtd_iomm_replay(). 
It looks to me it will trigger UNMAP notifiers.


Thanks




Re: device compatibility interface for live migration with assigned devices

2020-07-20 Thread Jason Wang



On 2020/7/20 下午6:39, Sean Mooney wrote:

On Mon, 2020-07-20 at 11:41 +0800, Jason Wang wrote:

On 2020/7/18 上午12:12, Alex Williamson wrote:

On Thu, 16 Jul 2020 16:32:30 +0800
Yan Zhao  wrote:


On Thu, Jul 16, 2020 at 12:16:26PM +0800, Jason Wang wrote:

On 2020/7/14 上午7:29, Yan Zhao wrote:

hi folks,
we are defining a device migration compatibility interface that helps upper
layer stack like openstack/ovirt/libvirt to check if two devices are
live migration compatible.
The "devices" here could be MDEVs, physical devices, or hybrid of the two.
e.g. we could use it to check whether
- a src MDEV can migrate to a target MDEV,
- a src VF in SRIOV can migrate to a target VF in SRIOV,
- a src MDEV can migration to a target VF in SRIOV.
 (e.g. SIOV/SRIOV backward compatibility case)

The upper layer stack could use this interface as the last step to check
if one device is able to migrate to another device before triggering a real
live migration procedure.
we are not sure if this interface is of value or help to you. please don't
hesitate to drop your valuable comments.


(1) interface definition
The interface is defined in below way:

__userspace
 /\  \
/ \write
   / read  \
  /__   ___\|/_
 | migration_version | | migration_version |-->check migration
 - -   compatibility
device Adevice B


a device attribute named migration_version is defined under each device's
sysfs node. e.g. 
(/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version).

Are you aware of the devlink based device management interface that is
proposed upstream? I think it has many advantages over sysfs, do you
consider to switch to that?

Advantages, such as?


My understanding for devlink(netlink) over sysfs (some are mentioned at
the time of vDPA sysfs mgmt API discussion) are:

i tought netlink was used more a as a configuration protocoal to qurry and 
confire nic and i guess
other devices in its devlink form requireint a tool to be witten that can speak 
the protocal to interact with.
the primary advantate of sysfs is that everything is just a file. there are no 
addtional depleenceis
needed



Well, if you try to build logic like introspection on top for a 
sophisticated hardware, you probably need to have library on top. And 
it's attribute per file is pretty inefficient.




  and unlike netlink there are not interoperatblity issues in a coanitnerised 
env. if you are using diffrenet
version of libc and gcc in the contaienr vs the host my understanding is tools 
like ethtool from ubuntu deployed
in a container on a centos host can have issue communicating with the host 
kernel.



Kernel provides stable ABI for userspace, so it's not something that we 
can't fix.




if its jsut a file unless
the format the data is returnin in chagnes or the layout of sysfs changes its 
compatiable regardless of what you
use to read it.



I believe you can't change sysfs layout which is part of uABI. But as I 
mentioned below, sysfs has several drawbacks. It's not harm to compare 
between different approach when you start a new device management API.


Thanks



- existing users (NIC, crypto, SCSI, ib), mature and stable
- much better error reporting (ext_ack other than string or errno)
- namespace aware
- do not couple with kobject

Thanks





Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-19 Thread Jason Wang



On 2020/7/17 下午10:18, Peter Xu wrote:

On Thu, Jul 16, 2020 at 10:54:31AM +0800, Jason Wang wrote:

On 2020/7/16 上午9:00, Peter Xu wrote:

On Mon, Jul 13, 2020 at 12:04:16PM +0800, Jason Wang wrote:

On 2020/7/10 下午9:30, Peter Xu wrote:

On Fri, Jul 10, 2020 at 02:34:11PM +0800, Jason Wang wrote:

On 2020/7/9 下午10:10, Peter Xu wrote:

On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote:

- If we care the performance, it's better to implement the MAP event for
vhost, otherwise it could be a lot of IOTLB miss

I feel like these are two things.

So far what we are talking about is whether vt-d should have knowledge about
what kind of events one iommu notifier is interested in.  I still think we
should keep this as answered in question 1.

The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP
events even without vDMA, so that vhost can establish the mapping even before
IO starts.  IMHO it's doable, but only if the guest runs DPDK workloads.  When
the guest is using dynamic iommu page mappings, I feel like that can be even
slower, because then the worst case is for each IO we'll need to vmexit twice:

   - The first vmexit caused by an invalidation to MAP the page tables, so 
vhost
 will setup the page table before IO starts

   - IO/DMA triggers and completes

   - The second vmexit caused by another invalidation to UNMAP the page 
tables

So it seems to be worse than when vhost only uses UNMAP like right now.  At
least we only have one vmexit (when UNMAP).  We'll have a vhost translate()
request from kernel to userspace, but IMHO that's cheaper than the vmexit.

Right but then I would still prefer to have another notifier.

Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a
dedicated command for flushing device IOTLB. But the check for
vtd_as_has_map_notifier is used to skip the device which can do demand
paging via ATS or device specific way. If we have two different notifiers,
vhost will be on the device iotlb notifier so we don't need it at all?

But we can still have iommu notifier that only registers to UNMAP even after we
introduce dev-iotlb notifier?  We don't want to do page walk for them as well.
TCG should be the only one so far, but I don't know.. maybe there can still be
new ones?

I think you're right. But looking at the codes, it looks like the check of
vtd_as_has_map_notifier() was only used in:

1) vtd_iommu_replay()
2) vtd_iotlb_page_invalidate_notify() (PSI)

For the replay, it's expensive anyhow. For PSI, I think it's just about one
or few mappings, not sure it will have obvious performance impact.

And I had two questions:

1) The codes doesn't check map for DSI or GI, does this match what spec
said? (It looks to me the spec is unclear in this part)

Both DSI/GI should cover maps too?  E.g. vtd_sync_shadow_page_table() in
vtd_iotlb_domain_invalidate().

I meant the code doesn't check whether there's an MAP notifier :)

It's actually checked, because it loops over vtd_as_with_notifiers, and only
MAP notifiers register to that. :)


I may miss something but I don't find the code to block UNMAP notifiers?

vhost_iommu_region_add()
     memory_region_register_iommu_notifier()
         memory_region_update_iommu_notify_flags()
             imrc->notify_flag_changed()
                 vtd_iommu_notify_flag_changed()

?

Yeah I think you're right.  I might have confused with some previous
implementations.  Maybe we should also do similar thing for DSI/GI just like
what we do in PSI.



Ok.





2) for the replay() I don't see other implementations (either spapr or
generic one) that did unmap (actually they skip unmap explicitly), any
reason for doing this in intel IOMMU?

I could be wrong, but I'd guess it's because vt-d implemented the caching mode
by leveraging the same invalidation strucuture, so it's harder to make all
things right (IOW, we can't clearly identify MAP with UNMAP when we receive an
invalidation request, because MAP/UNMAP requests look the same).

I didn't check others, but I believe spapr is doing it differently by using
some hypercalls to deliver IOMMU map/unmap requests, which seems a bit close to
what virtio-iommu is doing.  Anyway, the point is if we have explicit MAP/UNMAP
from the guest, logically the replay indeed does not need to do any unmap
because we don't need to call replay() on an already existing device but only
for e.g. hot plug.

But this looks conflict with what memory_region_iommu_replay( ) did, for
IOMMU that doesn't have a replay method, it skips UNMAP request:

      for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
      iotlb = imrc->translate(iommu_mr, addr, IOMMU_NONE, n->iommu_idx);
      if (iotlb.perm != IOMMU_NONE) {
      n->notify(n, );
      }

I guess there's no knowledge of whether guest have an explicit MAP/UMAP for
this generic code. Or replay implies that guest doesn't have explicit
MAP/UNMAP?

I think it 

Re: device compatibility interface for live migration with assigned devices

2020-07-19 Thread Jason Wang



On 2020/7/18 上午12:12, Alex Williamson wrote:

On Thu, 16 Jul 2020 16:32:30 +0800
Yan Zhao  wrote:


On Thu, Jul 16, 2020 at 12:16:26PM +0800, Jason Wang wrote:

On 2020/7/14 上午7:29, Yan Zhao wrote:

hi folks,
we are defining a device migration compatibility interface that helps upper
layer stack like openstack/ovirt/libvirt to check if two devices are
live migration compatible.
The "devices" here could be MDEVs, physical devices, or hybrid of the two.
e.g. we could use it to check whether
- a src MDEV can migrate to a target MDEV,
- a src VF in SRIOV can migrate to a target VF in SRIOV,
- a src MDEV can migration to a target VF in SRIOV.
(e.g. SIOV/SRIOV backward compatibility case)

The upper layer stack could use this interface as the last step to check
if one device is able to migrate to another device before triggering a real
live migration procedure.
we are not sure if this interface is of value or help to you. please don't
hesitate to drop your valuable comments.


(1) interface definition
The interface is defined in below way:

   __userspace
/\  \
   / \write
  / read  \
 /__   ___\|/_
| migration_version | | migration_version |-->check migration
- -   compatibility
   device Adevice B


a device attribute named migration_version is defined under each device's
sysfs node. e.g. 
(/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version).


Are you aware of the devlink based device management interface that is
proposed upstream? I think it has many advantages over sysfs, do you
consider to switch to that?


Advantages, such as?



My understanding for devlink(netlink) over sysfs (some are mentioned at 
the time of vDPA sysfs mgmt API discussion) are:


- existing users (NIC, crypto, SCSI, ib), mature and stable
- much better error reporting (ext_ack other than string or errno)
- namespace aware
- do not couple with kobject

Thanks



Re: device compatibility interface for live migration with assigned devices

2020-07-16 Thread Jason Wang



On 2020/7/16 下午4:32, Yan Zhao wrote:

On Thu, Jul 16, 2020 at 12:16:26PM +0800, Jason Wang wrote:

On 2020/7/14 上午7:29, Yan Zhao wrote:

hi folks,
we are defining a device migration compatibility interface that helps upper
layer stack like openstack/ovirt/libvirt to check if two devices are
live migration compatible.
The "devices" here could be MDEVs, physical devices, or hybrid of the two.
e.g. we could use it to check whether
- a src MDEV can migrate to a target MDEV,
- a src VF in SRIOV can migrate to a target VF in SRIOV,
- a src MDEV can migration to a target VF in SRIOV.
(e.g. SIOV/SRIOV backward compatibility case)

The upper layer stack could use this interface as the last step to check
if one device is able to migrate to another device before triggering a real
live migration procedure.
we are not sure if this interface is of value or help to you. please don't
hesitate to drop your valuable comments.


(1) interface definition
The interface is defined in below way:

   __userspace
/\  \
   / \write
  / read  \
 /__   ___\|/_
| migration_version | | migration_version |-->check migration
- -   compatibility
   device Adevice B


a device attribute named migration_version is defined under each device's
sysfs node. e.g. 
(/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version).


Are you aware of the devlink based device management interface that is
proposed upstream? I think it has many advantages over sysfs, do you
consider to switch to that?

not familiar with the devlink. will do some research of it.



userspace tools read the migration_version as a string from the source device,
and write it to the migration_version sysfs attribute in the target device.

The userspace should treat ANY of below conditions as two devices not 
compatible:
- any one of the two devices does not have a migration_version attribute
- error when reading from migration_version attribute of one device
- error when writing migration_version string of one device to
migration_version attribute of the other device

The string read from migration_version attribute is defined by device vendor
driver and is completely opaque to the userspace.


My understanding is that something opaque to userspace is not the philosophy

but the VFIO live migration in itself is essentially a big opaque stream to 
userspace.



I think it's better not limit to the kernel interface for a specific use 
case. This is basically the device introspection.






of Linux. Instead of having a generic API but opaque value, why not do in a
vendor specific way like:

1) exposing the device capability in a vendor specific way via sysfs/devlink
or other API
2) management read capability in both src and dst and determine whether we
can do the migration

This is the way we plan to do with vDPA.


yes, in another reply, Alex proposed to use an interface in json format.
I guess we can define something like

{ "self" :
   [
 { "pciid" : "8086591d",
   "driver" : "i915",
   "gvt-version" : "v1",
   "mdev_type"   : "i915-GVTg_V5_2",
   "aggregator"  : "1",
   "pv-mode" : "none",
 }
   ],
   "compatible" :
   [
 { "pciid" : "8086591d",
   "driver" : "i915",
   "gvt-version" : "v1",
   "mdev_type"   : "i915-GVTg_V5_2",
   "aggregator"  : "1"
   "pv-mode" : "none",
 },
 { "pciid" : "8086591d",
   "driver" : "i915",
   "gvt-version" : "v1",
   "mdev_type"   : "i915-GVTg_V5_4",
   "aggregator"  : "2"
   "pv-mode" : "none",
 },
 { "pciid" : "8086591d",
   "driver" : "i915",
   "gvt-version" : "v2",
   "mdev_type"   : "i915-GVTg_V5_4",
   "aggregator"  : "2"
   "pv-mode" : "none, ppgtt, context",
 }
 ...
   ]
}



This is probably another call for devlink base interface.




But as those fields are mostly vendor specific, the userspace can
only do simple string comparing, I guess the list would be very long as
it needs to enumerate all possible targets.
also, in some fileds like "gvt-version", is there a simple way to express
things like v2+?



That's total vendor specific I think. If "v2+" means it only support a 
version 2+, we can introduce fields like min_version and max_version. 
B

Re: device compatibility interface for live migration with assigned devices

2020-07-15 Thread Jason Wang



On 2020/7/14 上午7:29, Yan Zhao wrote:

hi folks,
we are defining a device migration compatibility interface that helps upper
layer stack like openstack/ovirt/libvirt to check if two devices are
live migration compatible.
The "devices" here could be MDEVs, physical devices, or hybrid of the two.
e.g. we could use it to check whether
- a src MDEV can migrate to a target MDEV,
- a src VF in SRIOV can migrate to a target VF in SRIOV,
- a src MDEV can migration to a target VF in SRIOV.
   (e.g. SIOV/SRIOV backward compatibility case)

The upper layer stack could use this interface as the last step to check
if one device is able to migrate to another device before triggering a real
live migration procedure.
we are not sure if this interface is of value or help to you. please don't
hesitate to drop your valuable comments.


(1) interface definition
The interface is defined in below way:

  __userspace
   /\  \
  / \write
 / read  \
/__   ___\|/_
   | migration_version | | migration_version |-->check migration
   - -   compatibility
  device Adevice B


a device attribute named migration_version is defined under each device's
sysfs node. e.g. 
(/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version).



Are you aware of the devlink based device management interface that is 
proposed upstream? I think it has many advantages over sysfs, do you 
consider to switch to that?




userspace tools read the migration_version as a string from the source device,
and write it to the migration_version sysfs attribute in the target device.

The userspace should treat ANY of below conditions as two devices not 
compatible:
- any one of the two devices does not have a migration_version attribute
- error when reading from migration_version attribute of one device
- error when writing migration_version string of one device to
   migration_version attribute of the other device

The string read from migration_version attribute is defined by device vendor
driver and is completely opaque to the userspace.



My understanding is that something opaque to userspace is not the 
philosophy of Linux. Instead of having a generic API but opaque value, 
why not do in a vendor specific way like:


1) exposing the device capability in a vendor specific way via 
sysfs/devlink or other API
2) management read capability in both src and dst and determine whether 
we can do the migration


This is the way we plan to do with vDPA.

Thanks



for a Intel vGPU, string format can be defined like
"parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator 
count".

for an NVMe VF connecting to a remote storage. it could be
"PCI ID" + "driver version" + "configured remote storage URL"

for a QAT VF, it may be
"PCI ID" + "driver version" + "supported encryption set".

(to avoid namespace confliction from each vendor, we may prefix a driver name to
each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)


(2) backgrounds

The reason we hope the migration_version string is opaque to the userspace
is that it is hard to generalize standard comparing fields and comparing
methods for different devices from different vendors.
Though userspace now could still do a simple string compare to check if
two devices are compatible, and result should also be right, it's still
too limited as it excludes the possible candidate whose migration_version
string fails to be equal.
e.g. an MDEV with mdev_type_1, aggregator count 3 is probably compatible
with another MDEV with mdev_type_3, aggregator count 1, even their
migration_version strings are not equal.
(assumed mdev_type_3 is of 3 times equal resources of mdev_type_1).

besides that, driver version + configured resources are all elements demanding
to take into account.

So, we hope leaving the freedom to vendor driver and let it make the final 
decision
in a simple reading from source side and writing for test in the target side 
way.


we then think the device compatibility issues for live migration with assigned
devices can be divided into two steps:
a. management tools filter out possible migration target devices.
Tags could be created according to info from product specification.
we think openstack/ovirt may have vendor proprietary components to create
those customized tags for each product from each vendor.
e.g.
for Intel vGPU, with a vGPU(a MDEV device) in source side, the tags to
search target vGPU are like:
a tag for compatible parent PCI IDs,
a tag for a range of gvt driver versions,
a tag for a range of mdev type + aggregator count

for NVMe VF, the tags to search target VF may be like:
a tag for compatible PCI IDs,
a tag for a range of driver versions,
a tag for URL of configured remote storage.

b. with the output from 

Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-15 Thread Jason Wang



On 2020/7/16 上午9:00, Peter Xu wrote:

On Mon, Jul 13, 2020 at 12:04:16PM +0800, Jason Wang wrote:

On 2020/7/10 下午9:30, Peter Xu wrote:

On Fri, Jul 10, 2020 at 02:34:11PM +0800, Jason Wang wrote:

On 2020/7/9 下午10:10, Peter Xu wrote:

On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote:

- If we care the performance, it's better to implement the MAP event for
vhost, otherwise it could be a lot of IOTLB miss

I feel like these are two things.

So far what we are talking about is whether vt-d should have knowledge about
what kind of events one iommu notifier is interested in.  I still think we
should keep this as answered in question 1.

The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP
events even without vDMA, so that vhost can establish the mapping even before
IO starts.  IMHO it's doable, but only if the guest runs DPDK workloads.  When
the guest is using dynamic iommu page mappings, I feel like that can be even
slower, because then the worst case is for each IO we'll need to vmexit twice:

  - The first vmexit caused by an invalidation to MAP the page tables, so 
vhost
will setup the page table before IO starts

  - IO/DMA triggers and completes

  - The second vmexit caused by another invalidation to UNMAP the page 
tables

So it seems to be worse than when vhost only uses UNMAP like right now.  At
least we only have one vmexit (when UNMAP).  We'll have a vhost translate()
request from kernel to userspace, but IMHO that's cheaper than the vmexit.

Right but then I would still prefer to have another notifier.

Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a
dedicated command for flushing device IOTLB. But the check for
vtd_as_has_map_notifier is used to skip the device which can do demand
paging via ATS or device specific way. If we have two different notifiers,
vhost will be on the device iotlb notifier so we don't need it at all?

But we can still have iommu notifier that only registers to UNMAP even after we
introduce dev-iotlb notifier?  We don't want to do page walk for them as well.
TCG should be the only one so far, but I don't know.. maybe there can still be
new ones?

I think you're right. But looking at the codes, it looks like the check of
vtd_as_has_map_notifier() was only used in:

1) vtd_iommu_replay()
2) vtd_iotlb_page_invalidate_notify() (PSI)

For the replay, it's expensive anyhow. For PSI, I think it's just about one
or few mappings, not sure it will have obvious performance impact.

And I had two questions:

1) The codes doesn't check map for DSI or GI, does this match what spec
said? (It looks to me the spec is unclear in this part)

Both DSI/GI should cover maps too?  E.g. vtd_sync_shadow_page_table() in
vtd_iotlb_domain_invalidate().


I meant the code doesn't check whether there's an MAP notifier :)

It's actually checked, because it loops over vtd_as_with_notifiers, and only
MAP notifiers register to that. :)



I may miss something but I don't find the code to block UNMAP notifiers?

vhost_iommu_region_add()
    memory_region_register_iommu_notifier()
        memory_region_update_iommu_notify_flags()
            imrc->notify_flag_changed()
                vtd_iommu_notify_flag_changed()

?




But I agree with you that it should be cleaner to introduce the dev-iotlb
notifier type.



Yes, it can solve the issues of duplicated UNMAP issue of vhost.





2) for the replay() I don't see other implementations (either spapr or
generic one) that did unmap (actually they skip unmap explicitly), any
reason for doing this in intel IOMMU?

I could be wrong, but I'd guess it's because vt-d implemented the caching mode
by leveraging the same invalidation strucuture, so it's harder to make all
things right (IOW, we can't clearly identify MAP with UNMAP when we receive an
invalidation request, because MAP/UNMAP requests look the same).

I didn't check others, but I believe spapr is doing it differently by using
some hypercalls to deliver IOMMU map/unmap requests, which seems a bit close to
what virtio-iommu is doing.  Anyway, the point is if we have explicit MAP/UNMAP
from the guest, logically the replay indeed does not need to do any unmap
because we don't need to call replay() on an already existing device but only
for e.g. hot plug.


But this looks conflict with what memory_region_iommu_replay( ) did, for
IOMMU that doesn't have a replay method, it skips UNMAP request:

     for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
     iotlb = imrc->translate(iommu_mr, addr, IOMMU_NONE, n->iommu_idx);
     if (iotlb.perm != IOMMU_NONE) {
     n->notify(n, );
     }

I guess there's no knowledge of whether guest have an explicit MAP/UMAP for
this generic code. Or replay implies that guest doesn't have explicit
MAP/UNMAP?

I think it matches exactly with a hot plug case?  Note that when IOMMU_NONE
could also mean the translation does not exist.  So it's actuall

Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-12 Thread Jason Wang



On 2020/7/10 下午9:30, Peter Xu wrote:

On Fri, Jul 10, 2020 at 02:34:11PM +0800, Jason Wang wrote:

On 2020/7/9 下午10:10, Peter Xu wrote:

On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote:

- If we care the performance, it's better to implement the MAP event for
vhost, otherwise it could be a lot of IOTLB miss

I feel like these are two things.

So far what we are talking about is whether vt-d should have knowledge about
what kind of events one iommu notifier is interested in.  I still think we
should keep this as answered in question 1.

The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP
events even without vDMA, so that vhost can establish the mapping even before
IO starts.  IMHO it's doable, but only if the guest runs DPDK workloads.  When
the guest is using dynamic iommu page mappings, I feel like that can be even
slower, because then the worst case is for each IO we'll need to vmexit twice:

 - The first vmexit caused by an invalidation to MAP the page tables, so 
vhost
   will setup the page table before IO starts

 - IO/DMA triggers and completes

 - The second vmexit caused by another invalidation to UNMAP the page tables

So it seems to be worse than when vhost only uses UNMAP like right now.  At
least we only have one vmexit (when UNMAP).  We'll have a vhost translate()
request from kernel to userspace, but IMHO that's cheaper than the vmexit.

Right but then I would still prefer to have another notifier.

Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a
dedicated command for flushing device IOTLB. But the check for
vtd_as_has_map_notifier is used to skip the device which can do demand
paging via ATS or device specific way. If we have two different notifiers,
vhost will be on the device iotlb notifier so we don't need it at all?

But we can still have iommu notifier that only registers to UNMAP even after we
introduce dev-iotlb notifier?  We don't want to do page walk for them as well.
TCG should be the only one so far, but I don't know.. maybe there can still be
new ones?


I think you're right. But looking at the codes, it looks like the check of
vtd_as_has_map_notifier() was only used in:

1) vtd_iommu_replay()
2) vtd_iotlb_page_invalidate_notify() (PSI)

For the replay, it's expensive anyhow. For PSI, I think it's just about one
or few mappings, not sure it will have obvious performance impact.

And I had two questions:

1) The codes doesn't check map for DSI or GI, does this match what spec
said? (It looks to me the spec is unclear in this part)

Both DSI/GI should cover maps too?  E.g. vtd_sync_shadow_page_table() in
vtd_iotlb_domain_invalidate().



I meant the code doesn't check whether there's an MAP notifier :)





2) for the replay() I don't see other implementations (either spapr or
generic one) that did unmap (actually they skip unmap explicitly), any
reason for doing this in intel IOMMU?

I could be wrong, but I'd guess it's because vt-d implemented the caching mode
by leveraging the same invalidation strucuture, so it's harder to make all
things right (IOW, we can't clearly identify MAP with UNMAP when we receive an
invalidation request, because MAP/UNMAP requests look the same).

I didn't check others, but I believe spapr is doing it differently by using
some hypercalls to deliver IOMMU map/unmap requests, which seems a bit close to
what virtio-iommu is doing.  Anyway, the point is if we have explicit MAP/UNMAP
from the guest, logically the replay indeed does not need to do any unmap
because we don't need to call replay() on an already existing device but only
for e.g. hot plug.



But this looks conflict with what memory_region_iommu_replay( ) did, for 
IOMMU that doesn't have a replay method, it skips UNMAP request:


    for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
    iotlb = imrc->translate(iommu_mr, addr, IOMMU_NONE, n->iommu_idx);
    if (iotlb.perm != IOMMU_NONE) {
    n->notify(n, );
    }

I guess there's no knowledge of whether guest have an explicit MAP/UMAP 
for this generic code. Or replay implies that guest doesn't have 
explicit MAP/UNMAP?


(btw, the code shortcut the memory_region_notify_one(), not sure the reason)



  VT-d does not have that clear interface, so VT-d needs to
maintain its own mapping structures, and also vt-d is using the same replay &
page_walk operations to sync all these structures, which complicated the vt-d
replay a bit.  With that, we assume replay() can be called anytime on a device,
and we won't notify duplicated MAPs to lower layer like vfio if it is mapped
before.  At the meantime, since we'll compare the latest mapping with the one
we cached in the iova tree, UNMAP becomes possible too.



AFAIK vtd_iommu_replay() did a completely UNMAP:

    /*
 * The replay can be triggered by either a invalidation or a newly
 * created entry. No matter what, we release existing mappings
 * (it means

Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-10 Thread Jason Wang



On 2020/7/9 下午10:10, Peter Xu wrote:

On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote:

- If we care the performance, it's better to implement the MAP event for
vhost, otherwise it could be a lot of IOTLB miss

I feel like these are two things.

So far what we are talking about is whether vt-d should have knowledge about
what kind of events one iommu notifier is interested in.  I still think we
should keep this as answered in question 1.

The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP
events even without vDMA, so that vhost can establish the mapping even before
IO starts.  IMHO it's doable, but only if the guest runs DPDK workloads.  When
the guest is using dynamic iommu page mappings, I feel like that can be even
slower, because then the worst case is for each IO we'll need to vmexit twice:

- The first vmexit caused by an invalidation to MAP the page tables, so 
vhost
  will setup the page table before IO starts

- IO/DMA triggers and completes

- The second vmexit caused by another invalidation to UNMAP the page tables

So it seems to be worse than when vhost only uses UNMAP like right now.  At
least we only have one vmexit (when UNMAP).  We'll have a vhost translate()
request from kernel to userspace, but IMHO that's cheaper than the vmexit.


Right but then I would still prefer to have another notifier.

Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a
dedicated command for flushing device IOTLB. But the check for
vtd_as_has_map_notifier is used to skip the device which can do demand
paging via ATS or device specific way. If we have two different notifiers,
vhost will be on the device iotlb notifier so we don't need it at all?

But we can still have iommu notifier that only registers to UNMAP even after we
introduce dev-iotlb notifier?  We don't want to do page walk for them as well.
TCG should be the only one so far, but I don't know.. maybe there can still be
new ones?



I think you're right. But looking at the codes, it looks like the check 
of vtd_as_has_map_notifier() was only used in:


1) vtd_iommu_replay()
2) vtd_iotlb_page_invalidate_notify() (PSI)

For the replay, it's expensive anyhow. For PSI, I think it's just about 
one or few mappings, not sure it will have obvious performance impact.


And I had two questions:

1) The codes doesn't check map for DSI or GI, does this match what spec 
said? (It looks to me the spec is unclear in this part)
2) for the replay() I don't see other implementations (either spapr or 
generic one) that did unmap (actually they skip unmap explicitly), any 
reason for doing this in intel IOMMU?


Thanks




Thanks,





Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-09 Thread Jason Wang



On 2020/7/8 下午10:16, Peter Xu wrote:

On Wed, Jul 08, 2020 at 01:42:30PM +0800, Jason Wang wrote:

So it should be functional equivalent to vtd_as_has_notifier().

For example: in vtd_iommu_replay() we'll skip the replay if vhost has
registered the iommu notifier because vtd_as_has_map_notifier() will return
false.


Two questions:

- Do we care the performance here? If not, vhost may just ignore the MAP
event?

I think we care, because vtd_page_walk() can be expensive.



Ok.





- If we care the performance, it's better to implement the MAP event for
vhost, otherwise it could be a lot of IOTLB miss

I feel like these are two things.

So far what we are talking about is whether vt-d should have knowledge about
what kind of events one iommu notifier is interested in.  I still think we
should keep this as answered in question 1.

The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP
events even without vDMA, so that vhost can establish the mapping even before
IO starts.  IMHO it's doable, but only if the guest runs DPDK workloads.  When
the guest is using dynamic iommu page mappings, I feel like that can be even
slower, because then the worst case is for each IO we'll need to vmexit twice:

   - The first vmexit caused by an invalidation to MAP the page tables, so vhost
 will setup the page table before IO starts

   - IO/DMA triggers and completes

   - The second vmexit caused by another invalidation to UNMAP the page tables

So it seems to be worse than when vhost only uses UNMAP like right now.  At
least we only have one vmexit (when UNMAP).  We'll have a vhost translate()
request from kernel to userspace, but IMHO that's cheaper than the vmexit.



Right but then I would still prefer to have another notifier.

Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a 
dedicated command for flushing device IOTLB. But the check for 
vtd_as_has_map_notifier is used to skip the device which can do demand 
paging via ATS or device specific way. If we have two different 
notifiers, vhost will be on the device iotlb notifier so we don't need 
it at all?


Thanks








Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-07 Thread Jason Wang



On 2020/7/8 上午3:54, Peter Xu wrote:

On Tue, Jul 07, 2020 at 04:03:10PM +0800, Jason Wang wrote:

On 2020/7/3 下午9:03, Peter Xu wrote:

On Fri, Jul 03, 2020 at 03:24:19PM +0800, Jason Wang wrote:

On 2020/7/2 下午11:45, Peter Xu wrote:

On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:

So I think we agree that a new notifier is needed?

Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?

That should work but I wonder something as following is better.

Instead of introducing new flags, how about carry the type of event in the
notifier then the device (vhost) can choose the message it want to process
like:

static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event)

{

switch (event->type) {

case IOMMU_MAP:
case IOMMU_UNMAP:
case IOMMU_DEV_IOTLB_UNMAP:
...

}

Looks ok to me, though imo we should still keep the registration information,
so VT-d knows which notifiers is interested in which events.  E.g., we can
still do something like vtd_as_has_map_notifier().


Is this for a better performance?

I wonder whether it's worth since we can't not have both vhost and vtd to be
registered into the same as.

/me failed to follow this sentence.. :(



Sorry, I meant "vfio" instead "vtd".





So it should be functional equivalent to vtd_as_has_notifier().

For example: in vtd_iommu_replay() we'll skip the replay if vhost has
registered the iommu notifier because vtd_as_has_map_notifier() will return
false.



Two questions:

- Do we care the performance here? If not, vhost may just ignore the MAP 
event?
- If we care the performance, it's better to implement the MAP event for 
vhost, otherwise it could be a lot of IOTLB miss


Thanks



   It'll only return true if the device is a vfio-pci device.

Without vtd_as_has_map_notifier(), how could we do that?





Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-07 Thread Jason Wang



On 2020/7/3 下午9:03, Peter Xu wrote:

On Fri, Jul 03, 2020 at 03:24:19PM +0800, Jason Wang wrote:

On 2020/7/2 下午11:45, Peter Xu wrote:

On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:

So I think we agree that a new notifier is needed?

Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?


That should work but I wonder something as following is better.

Instead of introducing new flags, how about carry the type of event in the
notifier then the device (vhost) can choose the message it want to process
like:

static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event)

{

switch (event->type) {

case IOMMU_MAP:
case IOMMU_UNMAP:
case IOMMU_DEV_IOTLB_UNMAP:
...

}

Looks ok to me, though imo we should still keep the registration information,
so VT-d knows which notifiers is interested in which events.  E.g., we can
still do something like vtd_as_has_map_notifier().



Is this for a better performance?

I wonder whether it's worth since we can't not have both vhost and vtd 
to be registered into the same as.


So it should be functional equivalent to vtd_as_has_notifier().

Thanks




So these are probably two different things: the new IOMMU_NOTIFIER_DEV_IOTLB
flag is one as discussed; the other one is whether we would like to introduce
IOMMUTLBEvent to include the type, so that we can avoid introduce two notifiers
for one device majorly to identify dev-iotlb from unmaps.

Thanks,





Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-03 Thread Jason Wang



On 2020/7/2 下午11:45, Peter Xu wrote:

On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:

So I think we agree that a new notifier is needed?

Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?



That should work but I wonder something as following is better.

Instead of introducing new flags, how about carry the type of event in 
the notifier then the device (vhost) can choose the message it want to 
process like:


static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event)

{

switch (event->type) {

case IOMMU_MAP:
case IOMMU_UNMAP:
case IOMMU_DEV_IOTLB_UNMAP:
...

}

Thanks








Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-01 Thread Jason Wang



On 2020/7/1 下午4:09, Jason Wang wrote:


On 2020/6/30 下午11:39, Peter Xu wrote:

On Tue, Jun 30, 2020 at 10:41:10AM +0800, Jason Wang wrote:

  /* According to ATS spec table 2.4:
   * S = 0, bits 15:12 =  range size: 4K
   * S = 1, bits 15:12 = xxx0 range size: 8K
   * S = 1, bits 15:12 = xx01 range size: 16K
   * S = 1, bits 15:12 = x011 range size: 32K
   * S = 1, bits 15:12 = 0111 range size: 64K
   * ...
   */


Right, but the comment is probably misleading here, since it's for 
the PCI-E
transaction between IOMMU and device not for the device IOTLB 
invalidation

descriptor.

For device IOTLB invalidation descriptor, spec allows a [0, ~0ULL]
invalidation:

"

6.5.2.5 Device-TLB Invalidate Descriptor

...

Size (S): The size field indicates the number of consecutive pages 
targeted

by this invalidation
request. If S field is zero, a single page at page address specified by
Address [63:12] is requested
to be invalidated. If S field is Set, the least significant bit in the
Address field with value 0b
indicates the invalidation address range. For example, if S field is 
Set and

Address[12] is Clear, it
indicates an 8KB invalidation address range with base address in 
Address

[63:13]. If S field and
Address[12] is Set and bit 13 is Clear, it indicates a 16KB 
invalidation

address range with base
address in Address [63:14], etc.

"

So if we receive an address whose [63] is 0 and the rest is all 1, 
it's then

a [0, ~0ULL] invalidation.
Yes.  I think invalidating the whole range is always fine.  It's 
still not

arbitrary, right?  E.g., we can't even invalidate (0x1000, 0x3000) with
device-iotlb because of the address mask, not to say sub-pages.



Yes.






How about just convert to use a range [start, end] for any 
notifier and move
the checks (e.g the assert) into the actual notifier implemented 
(vhost or

vfio)?
IOMMUTLBEntry itself is the abstraction layer of TLB entry.  
Hardware TLB entry
is definitely not arbitrary range either (because AFAICT the 
hardware should

only cache PFN rather than address, so at least PAGE_SIZE aligned).
Introducing this flag will already make this trickier just to 
avoid introducing
another similar struct to IOMMUTLBEntry, but I really don't want 
to make it a
default option...  Not to mention I probably have no reason to 
urge the rest
iommu notifier users (tcg, vfio) to change their existing good 
code to suite
any of the backend who can cooperate with arbitrary address 
ranges...

Ok, so it looks like we need a dedicated notifiers to device IOTLB.
Or we can also make a new flag for device iotlb just like current 
UNMAP? Then
we replace the vhost type from UNMAP to DEVICE_IOTLB.  But IMHO 
using the
ARBITRARY_LENGTH flag would work in a similar way. DEVICE_IOTLB 
flag could
also allow virtio/vhost to only receive one invalidation (now IIUC 
it'll
receive both iotlb and device-iotlb for unmapping a page when 
ats=on), but then
ats=on will be a must and it could break some old (misconfiged) 
qemu because
afaict previously virtio/vhost could even work with vIOMMU 
(accidentally) even

without ats=on.


That's a bug and I don't think we need to workaround 
mis-configurated qemu

:)

IMHO it depends on the strictness we want on the qemu cmdline API. :)

We should at least check libvirt to make sure it's using ats=on 
always, then I

agree maybe we can avoid considering the rest...

Thanks,



Cc libvirt list, but I think we should fix libvirt if they don't 
provide "ats=on".


Thanks



Libvirt looks fine, according to the domain  XML documentation[1]:

 QEMU's virtio devices have some attributes related to the virtio 
transport under the driver element: The iommu attribute enables the use 
of emulated IOMMU by the device. The attribute ats controls the Address 
Translation Service support for PCIe devices. This is needed to make use 
of IOTLB support (see IOMMU device). Possible values are on or off. 
Since 3.5.0


So I think we agree that a new notifier is needed?

Thanks

[1] https://libvirt.org/formatdomain.html










Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-01 Thread Jason Wang



On 2020/6/30 下午11:39, Peter Xu wrote:

On Tue, Jun 30, 2020 at 10:41:10AM +0800, Jason Wang wrote:

  /* According to ATS spec table 2.4:
   * S = 0, bits 15:12 =  range size: 4K
   * S = 1, bits 15:12 = xxx0 range size: 8K
   * S = 1, bits 15:12 = xx01 range size: 16K
   * S = 1, bits 15:12 = x011 range size: 32K
   * S = 1, bits 15:12 = 0111 range size: 64K
   * ...
   */


Right, but the comment is probably misleading here, since it's for the PCI-E
transaction between IOMMU and device not for the device IOTLB invalidation
descriptor.

For device IOTLB invalidation descriptor, spec allows a [0, ~0ULL]
invalidation:

"

6.5.2.5 Device-TLB Invalidate Descriptor

...

Size (S): The size field indicates the number of consecutive pages targeted
by this invalidation
request. If S field is zero, a single page at page address specified by
Address [63:12] is requested
to be invalidated. If S field is Set, the least significant bit in the
Address field with value 0b
indicates the invalidation address range. For example, if S field is Set and
Address[12] is Clear, it
indicates an 8KB invalidation address range with base address in Address
[63:13]. If S field and
Address[12] is Set and bit 13 is Clear, it indicates a 16KB invalidation
address range with base
address in Address [63:14], etc.

"

So if we receive an address whose [63] is 0 and the rest is all 1, it's then
a [0, ~0ULL] invalidation.

Yes.  I think invalidating the whole range is always fine.  It's still not
arbitrary, right?  E.g., we can't even invalidate (0x1000, 0x3000) with
device-iotlb because of the address mask, not to say sub-pages.



Yes.







How about just convert to use a range [start, end] for any notifier and move
the checks (e.g the assert) into the actual notifier implemented (vhost or
vfio)?

IOMMUTLBEntry itself is the abstraction layer of TLB entry.  Hardware TLB entry
is definitely not arbitrary range either (because AFAICT the hardware should
only cache PFN rather than address, so at least PAGE_SIZE aligned).
Introducing this flag will already make this trickier just to avoid introducing
another similar struct to IOMMUTLBEntry, but I really don't want to make it a
default option...  Not to mention I probably have no reason to urge the rest
iommu notifier users (tcg, vfio) to change their existing good code to suite
any of the backend who can cooperate with arbitrary address ranges...

Ok, so it looks like we need a dedicated notifiers to device IOTLB.

Or we can also make a new flag for device iotlb just like current UNMAP? Then
we replace the vhost type from UNMAP to DEVICE_IOTLB.  But IMHO using the
ARBITRARY_LENGTH flag would work in a similar way.  DEVICE_IOTLB flag could
also allow virtio/vhost to only receive one invalidation (now IIUC it'll
receive both iotlb and device-iotlb for unmapping a page when ats=on), but then
ats=on will be a must and it could break some old (misconfiged) qemu because
afaict previously virtio/vhost could even work with vIOMMU (accidentally) even
without ats=on.


That's a bug and I don't think we need to workaround mis-configurated qemu
:)

IMHO it depends on the strictness we want on the qemu cmdline API. :)

We should at least check libvirt to make sure it's using ats=on always, then I
agree maybe we can avoid considering the rest...

Thanks,



Cc libvirt list, but I think we should fix libvirt if they don't provide 
"ats=on".


Thanks




Re: [PATCH 4/4] docs: documentation for virtio packed option

2020-03-26 Thread Jason Wang



On 2020/3/27 上午9:08, Halil Pasic wrote:

On Thu, 26 Mar 2020 11:32:14 +0100
Bjoern Walk  wrote:


+
+  The attribute packed controls the usage of packed
+  virtqueues.

This reads to me, like if packed='on' is specified, then packed
virtqueues are used (or the device fails); if  packed='off' is specified
is specified then split virtqueue format is used.

But that statement is not always true. One might still end up using
split despite of packed='on'. One example is a setup vhost-net setup.
The in kernel implementation namely does not support packed queues, so
F_RING_PACKED does not get offered by the device.

Another reason why one could end up with F_RING_PACKED not getting
negotiated is lack of guest support.

Please note that this behavior is a little different than what we do for
cpus (and cpu models). There one either gets what he asked for, or the
VM won't get constructed.

I believe, we should at least document this cleanly. Cc-ing Michael
and Jason as hypervisor experts on these topics.

Regards,
Halil



I think we need document that packed=on only enable the qemu support of 
packed ring, but whether it is actually enabled depends on the feature 
negotiation between qemu, vhost backends and guest drivers.


Thanks






Compared to regular split queues, packed queues consist of
+  only a single descriptor ring replacing available and used ring, index
+  and descriptor buffer. This can result in better cache utilization and
+  performance. Possible values are on or off.
+  Since 6.3.0 (QEMU and KVM only)
+




Re: [libvirt] [RFC PATCH 0/9] Introduce mediate ops in vfio-pci

2019-12-17 Thread Jason Wang


On 2019/12/12 下午1:47, Yan Zhao wrote:

On Thu, Dec 12, 2019 at 11:48:25AM +0800, Jason Wang wrote:

On 2019/12/6 下午8:49, Yan Zhao wrote:

On Fri, Dec 06, 2019 at 05:40:02PM +0800, Jason Wang wrote:

On 2019/12/6 下午4:22, Yan Zhao wrote:

On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote:

On 2019/12/5 下午4:51, Yan Zhao wrote:

On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:

Hi:

On 2019/12/5 上午11:24, Yan Zhao wrote:

For SRIOV devices, VFs are passthroughed into guest directly without host
driver mediation. However, when VMs migrating with passthroughed VFs,
dynamic host mediation is required to  (1) get device states, (2) get
dirty pages. Since device states as well as other critical information
required for dirty page tracking for VFs are usually retrieved from PFs,
it is handy to provide an extension in PF driver to centralizingly control
VFs' migration.

Therefore, in order to realize (1) passthrough VFs at normal time, (2)
dynamically trap VFs' bars for dirty page tracking and

A silly question, what's the reason for doing this, is this a must for dirty
page tracking?


For performance consideration. VFs' bars should be passthoughed at
normal time and only enter into trap state on need.

Right, but how does this matter for the case of dirty page tracking?


Take NIC as an example, to trap its VF dirty pages, software way is
required to trap every write of ring tail that resides in BAR0.

Interesting, but it looks like we need:
- decode the instruction
- mediate all access to BAR0
All of which seems a great burden for the VF driver. I wonder whether or
not doing interrupt relay and tracking head is better in this case.


hi Jason

not familiar with the way you mentioned. could you elaborate more?


It looks to me that you want to intercept the bar that contains the
head. Then you can figure out the buffers submitted from driver and you
still need to decide a proper time to mark them as dirty.


Not need to be accurate, right? just a superset of real dirty bitmap is
enough.



If the superset is too large compared with the dirty pages, it will lead 
a lot of side effects.






What I meant is, intercept the interrupt, then you can figure still
figure out the buffers which has been modified by the device and make
them as dirty.

Then there's no need to trap BAR and do decoding/emulation etc.

But it will still be tricky to be correct...


intercept the interrupt is a little hard if post interrupt is enabled..



We don't care about the performance too much in this case. Can we simply 
disable it?




I think what you worried about here is the timing to mark dirty pages,
right? upon interrupt receiving, you regard DMAs are finished and safe
to make them dirty.
But with BAR trap way, we at least can keep those dirtied pages as dirty
until device stop. Of course we have other methods to optimize it.



I'm not sure this will not confuse the migration converge time estimation.





There's
still no IOMMU Dirty bit available.

  (3) centralizing
VF critical states retrieving and VF controls into one driver, we propose
to introduce mediate ops on top of current vfio-pci device driver.


_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  __   register mediate ops|  ___ ___|
|  |<---| VF|   |   |
| vfio-pci |  | |  mediate  |   | PF driver |   |
|__|--->|   driver  |   |___|
  |open(pdev)  |  ---  | |
  ||
  ||_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
 \|/  \|/
--- 
|VF   | |PF|
--- 


VF mediate driver could be a standalone driver that does not bind to
any devices (as in demo code in patches 5-6) or it could be a built-in
extension of PF driver (as in patches 7-9) .

Rather than directly bind to VF, VF mediate driver register a mediate
ops into vfio-pci in driver init. vfio-pci maintains a list of such
mediate ops.
(Note that: VF mediate driver can register mediate ops into vfio-pci
before vfio-pci binding to any devices. And VF mediate driver can
support mediating multiple devices.)

When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
device as a parameter.
VF mediate driver should return success or failure depending on it
supports the pdev or not.
E.g. VF mediate driver would compare its supported VF devfn with the
devfn of the passed-in pdev.
Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
stop querying other mediate ops and bind

Re: [libvirt] [RFC PATCH 0/9] Introduce mediate ops in vfio-pci

2019-12-11 Thread Jason Wang


On 2019/12/7 上午1:42, Alex Williamson wrote:

On Fri, 6 Dec 2019 17:40:02 +0800
Jason Wang  wrote:


On 2019/12/6 下午4:22, Yan Zhao wrote:

On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote:

On 2019/12/5 下午4:51, Yan Zhao wrote:

On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:

Hi:

On 2019/12/5 上午11:24, Yan Zhao wrote:

For SRIOV devices, VFs are passthroughed into guest directly without host
driver mediation. However, when VMs migrating with passthroughed VFs,
dynamic host mediation is required to  (1) get device states, (2) get
dirty pages. Since device states as well as other critical information
required for dirty page tracking for VFs are usually retrieved from PFs,
it is handy to provide an extension in PF driver to centralizingly control
VFs' migration.

Therefore, in order to realize (1) passthrough VFs at normal time, (2)
dynamically trap VFs' bars for dirty page tracking and

A silly question, what's the reason for doing this, is this a must for dirty
page tracking?
  

For performance consideration. VFs' bars should be passthoughed at
normal time and only enter into trap state on need.

Right, but how does this matter for the case of dirty page tracking?
  

Take NIC as an example, to trap its VF dirty pages, software way is
required to trap every write of ring tail that resides in BAR0.


Interesting, but it looks like we need:
- decode the instruction
- mediate all access to BAR0
All of which seems a great burden for the VF driver. I wonder whether or
not doing interrupt relay and tracking head is better in this case.

This sounds like a NIC specific solution, I believe the goal here is to
allow any device type to implement a partial mediation solution, in
this case to sufficiently track the device while in the migration
saving state.



I suspect there's a solution that can work for any device type. E.g for 
virtio, avail index (head) doesn't belongs to any BAR and device may 
decide to disable doorbell from guest. So did interrupt relay since 
driver may choose to disable interrupt from device. In this case, the 
only way to track dirty pages correctly is to switch to software datapath.






   There's
still no IOMMU Dirty bit available.

 (3) centralizing
VF critical states retrieving and VF controls into one driver, we propose
to introduce mediate ops on top of current vfio-pci device driver.


   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
 __   register mediate ops|  ___ ___|
|  |<---| VF|   |   |
| vfio-pci |  | |  mediate  |   | PF driver |   |
|__|--->|   driver  |   |___|
 |open(pdev)  |  ---  | |
 ||
 ||_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
\|/  \|/
--- 
|VF   | |PF|
--- 


VF mediate driver could be a standalone driver that does not bind to
any devices (as in demo code in patches 5-6) or it could be a built-in
extension of PF driver (as in patches 7-9) .

Rather than directly bind to VF, VF mediate driver register a mediate
ops into vfio-pci in driver init. vfio-pci maintains a list of such
mediate ops.
(Note that: VF mediate driver can register mediate ops into vfio-pci
before vfio-pci binding to any devices. And VF mediate driver can
support mediating multiple devices.)

When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
device as a parameter.
VF mediate driver should return success or failure depending on it
supports the pdev or not.
E.g. VF mediate driver would compare its supported VF devfn with the
devfn of the passed-in pdev.
Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
stop querying other mediate ops and bind the opening device with this
mediate ops using the returned mediate handle.

Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
VF will be intercepted into VF mediate driver as
vfio_pci_mediate_ops->get_region_info(),
vfio_pci_mediate_ops->rw,
vfio_pci_mediate_ops->mmap, and get customized.
For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
further return 'pt' to indicate whether vfio-pci should further
passthrough data to hw.

when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
with a mediate handle as parameter.

The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
mediate driver be able to differentiate two opening VFs of the same device
id and vendor id.

When VF mediate driver exit

Re: [libvirt] [RFC PATCH 0/9] Introduce mediate ops in vfio-pci

2019-12-11 Thread Jason Wang


On 2019/12/6 下午8:49, Yan Zhao wrote:

On Fri, Dec 06, 2019 at 05:40:02PM +0800, Jason Wang wrote:

On 2019/12/6 下午4:22, Yan Zhao wrote:

On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote:

On 2019/12/5 下午4:51, Yan Zhao wrote:

On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:

Hi:

On 2019/12/5 上午11:24, Yan Zhao wrote:

For SRIOV devices, VFs are passthroughed into guest directly without host
driver mediation. However, when VMs migrating with passthroughed VFs,
dynamic host mediation is required to  (1) get device states, (2) get
dirty pages. Since device states as well as other critical information
required for dirty page tracking for VFs are usually retrieved from PFs,
it is handy to provide an extension in PF driver to centralizingly control
VFs' migration.

Therefore, in order to realize (1) passthrough VFs at normal time, (2)
dynamically trap VFs' bars for dirty page tracking and

A silly question, what's the reason for doing this, is this a must for dirty
page tracking?


For performance consideration. VFs' bars should be passthoughed at
normal time and only enter into trap state on need.

Right, but how does this matter for the case of dirty page tracking?


Take NIC as an example, to trap its VF dirty pages, software way is
required to trap every write of ring tail that resides in BAR0.


Interesting, but it looks like we need:
- decode the instruction
- mediate all access to BAR0
All of which seems a great burden for the VF driver. I wonder whether or
not doing interrupt relay and tracking head is better in this case.


hi Jason

not familiar with the way you mentioned. could you elaborate more?



It looks to me that you want to intercept the bar that contains the 
head. Then you can figure out the buffers submitted from driver and you 
still need to decide a proper time to mark them as dirty.


What I meant is, intercept the interrupt, then you can figure still 
figure out the buffers which has been modified by the device and make 
them as dirty.


Then there's no need to trap BAR and do decoding/emulation etc.

But it will still be tricky to be correct...



   There's
still no IOMMU Dirty bit available.

 (3) centralizing
VF critical states retrieving and VF controls into one driver, we propose
to introduce mediate ops on top of current vfio-pci device driver.


   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
 __   register mediate ops|  ___ ___|
|  |<---| VF|   |   |
| vfio-pci |  | |  mediate  |   | PF driver |   |
|__|--->|   driver  |   |___|
 |open(pdev)  |  ---  | |
 ||
 ||_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
\|/  \|/
--- 
|VF   | |PF|
--- 


VF mediate driver could be a standalone driver that does not bind to
any devices (as in demo code in patches 5-6) or it could be a built-in
extension of PF driver (as in patches 7-9) .

Rather than directly bind to VF, VF mediate driver register a mediate
ops into vfio-pci in driver init. vfio-pci maintains a list of such
mediate ops.
(Note that: VF mediate driver can register mediate ops into vfio-pci
before vfio-pci binding to any devices. And VF mediate driver can
support mediating multiple devices.)

When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
device as a parameter.
VF mediate driver should return success or failure depending on it
supports the pdev or not.
E.g. VF mediate driver would compare its supported VF devfn with the
devfn of the passed-in pdev.
Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
stop querying other mediate ops and bind the opening device with this
mediate ops using the returned mediate handle.

Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
VF will be intercepted into VF mediate driver as
vfio_pci_mediate_ops->get_region_info(),
vfio_pci_mediate_ops->rw,
vfio_pci_mediate_ops->mmap, and get customized.
For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
further return 'pt' to indicate whether vfio-pci should further
passthrough data to hw.

when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
with a mediate handle as parameter.

The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
mediate driver be able to differentiate two opening VFs of the same device
id and vendor id.

When VF mediate driver exits, it unregisters its mediate ops from

Re: [libvirt] [RFC PATCH 0/9] Introduce mediate ops in vfio-pci

2019-12-06 Thread Jason Wang


On 2019/12/6 下午4:22, Yan Zhao wrote:

On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote:

On 2019/12/5 下午4:51, Yan Zhao wrote:

On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:

Hi:

On 2019/12/5 上午11:24, Yan Zhao wrote:

For SRIOV devices, VFs are passthroughed into guest directly without host
driver mediation. However, when VMs migrating with passthroughed VFs,
dynamic host mediation is required to  (1) get device states, (2) get
dirty pages. Since device states as well as other critical information
required for dirty page tracking for VFs are usually retrieved from PFs,
it is handy to provide an extension in PF driver to centralizingly control
VFs' migration.

Therefore, in order to realize (1) passthrough VFs at normal time, (2)
dynamically trap VFs' bars for dirty page tracking and

A silly question, what's the reason for doing this, is this a must for dirty
page tracking?


For performance consideration. VFs' bars should be passthoughed at
normal time and only enter into trap state on need.


Right, but how does this matter for the case of dirty page tracking?


Take NIC as an example, to trap its VF dirty pages, software way is
required to trap every write of ring tail that resides in BAR0.



Interesting, but it looks like we need:
- decode the instruction
- mediate all access to BAR0
All of which seems a great burden for the VF driver. I wonder whether or 
not doing interrupt relay and tracking head is better in this case.




  There's
still no IOMMU Dirty bit available.

(3) centralizing
VF critical states retrieving and VF controls into one driver, we propose
to introduce mediate ops on top of current vfio-pci device driver.


  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
__   register mediate ops|  ___ ___|
|  |<---| VF|   |   |
| vfio-pci |  | |  mediate  |   | PF driver |   |
|__|--->|   driver  |   |___|
|open(pdev)  |  ---  | |
||
||_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
   \|/  \|/
--- 
|VF   | |PF|
--- 


VF mediate driver could be a standalone driver that does not bind to
any devices (as in demo code in patches 5-6) or it could be a built-in
extension of PF driver (as in patches 7-9) .

Rather than directly bind to VF, VF mediate driver register a mediate
ops into vfio-pci in driver init. vfio-pci maintains a list of such
mediate ops.
(Note that: VF mediate driver can register mediate ops into vfio-pci
before vfio-pci binding to any devices. And VF mediate driver can
support mediating multiple devices.)

When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
device as a parameter.
VF mediate driver should return success or failure depending on it
supports the pdev or not.
E.g. VF mediate driver would compare its supported VF devfn with the
devfn of the passed-in pdev.
Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
stop querying other mediate ops and bind the opening device with this
mediate ops using the returned mediate handle.

Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
VF will be intercepted into VF mediate driver as
vfio_pci_mediate_ops->get_region_info(),
vfio_pci_mediate_ops->rw,
vfio_pci_mediate_ops->mmap, and get customized.
For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
further return 'pt' to indicate whether vfio-pci should further
passthrough data to hw.

when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
with a mediate handle as parameter.

The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
mediate driver be able to differentiate two opening VFs of the same device
id and vendor id.

When VF mediate driver exits, it unregisters its mediate ops from
vfio-pci.


In this patchset, we enable vfio-pci to provide 3 things:
(1) calling mediate ops to allow vendor driver customizing default
region info/rw/mmap of a region.
(2) provide a migration region to support migration

What's the benefit of introducing a region? It looks to me we don't expect
the region to be accessed directly from guest. Could we simply extend device
fd ioctl for doing such things?


You may take a look on mdev live migration discussions in
https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html

or previous discussion at
https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html,
which has

Re: [libvirt] [RFC PATCH 0/9] Introduce mediate ops in vfio-pci

2019-12-05 Thread Jason Wang


On 2019/12/5 下午4:51, Yan Zhao wrote:

On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:

Hi:

On 2019/12/5 上午11:24, Yan Zhao wrote:

For SRIOV devices, VFs are passthroughed into guest directly without host
driver mediation. However, when VMs migrating with passthroughed VFs,
dynamic host mediation is required to  (1) get device states, (2) get
dirty pages. Since device states as well as other critical information
required for dirty page tracking for VFs are usually retrieved from PFs,
it is handy to provide an extension in PF driver to centralizingly control
VFs' migration.

Therefore, in order to realize (1) passthrough VFs at normal time, (2)
dynamically trap VFs' bars for dirty page tracking and


A silly question, what's the reason for doing this, is this a must for dirty
page tracking?


For performance consideration. VFs' bars should be passthoughed at
normal time and only enter into trap state on need.



Right, but how does this matter for the case of dirty page tracking?





   (3) centralizing
VF critical states retrieving and VF controls into one driver, we propose
to introduce mediate ops on top of current vfio-pci device driver.


 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   __   register mediate ops|  ___ ___|
|  |<---| VF|   |   |
| vfio-pci |  | |  mediate  |   | PF driver |   |
|__|--->|   driver  |   |___|
   |open(pdev)  |  ---  | |
   ||
   ||_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
  \|/  \|/
--- 
|VF   | |PF|
--- 


VF mediate driver could be a standalone driver that does not bind to
any devices (as in demo code in patches 5-6) or it could be a built-in
extension of PF driver (as in patches 7-9) .

Rather than directly bind to VF, VF mediate driver register a mediate
ops into vfio-pci in driver init. vfio-pci maintains a list of such
mediate ops.
(Note that: VF mediate driver can register mediate ops into vfio-pci
before vfio-pci binding to any devices. And VF mediate driver can
support mediating multiple devices.)

When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
device as a parameter.
VF mediate driver should return success or failure depending on it
supports the pdev or not.
E.g. VF mediate driver would compare its supported VF devfn with the
devfn of the passed-in pdev.
Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
stop querying other mediate ops and bind the opening device with this
mediate ops using the returned mediate handle.

Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
VF will be intercepted into VF mediate driver as
vfio_pci_mediate_ops->get_region_info(),
vfio_pci_mediate_ops->rw,
vfio_pci_mediate_ops->mmap, and get customized.
For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
further return 'pt' to indicate whether vfio-pci should further
passthrough data to hw.

when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
with a mediate handle as parameter.

The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
mediate driver be able to differentiate two opening VFs of the same device
id and vendor id.

When VF mediate driver exits, it unregisters its mediate ops from
vfio-pci.


In this patchset, we enable vfio-pci to provide 3 things:
(1) calling mediate ops to allow vendor driver customizing default
region info/rw/mmap of a region.
(2) provide a migration region to support migration


What's the benefit of introducing a region? It looks to me we don't expect
the region to be accessed directly from guest. Could we simply extend device
fd ioctl for doing such things?


You may take a look on mdev live migration discussions in
https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html

or previous discussion at
https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html,
which has kernel side implemetation 
https://patchwork.freedesktop.org/series/56876/

generaly speaking, qemu part of live migration is consistent for
vfio-pci + mediate ops way or mdev way.



So in mdev, do you still have a mediate driver? Or you expect the parent 
to implement the region?




The region is only a channel for
QEMU and kernel to communicate information without introducing IOCTLs.



Well, at least you introduce new type of region in uapi. So this does 
not answer why region is better than ioctl. If the region will only be 

Re: [libvirt] [RFC PATCH 0/9] Introduce mediate ops in vfio-pci

2019-12-05 Thread Jason Wang

Hi:

On 2019/12/5 上午11:24, Yan Zhao wrote:

For SRIOV devices, VFs are passthroughed into guest directly without host
driver mediation. However, when VMs migrating with passthroughed VFs,
dynamic host mediation is required to  (1) get device states, (2) get
dirty pages. Since device states as well as other critical information
required for dirty page tracking for VFs are usually retrieved from PFs,
it is handy to provide an extension in PF driver to centralizingly control
VFs' migration.

Therefore, in order to realize (1) passthrough VFs at normal time, (2)
dynamically trap VFs' bars for dirty page tracking and



A silly question, what's the reason for doing this, is this a must for 
dirty page tracking?




  (3) centralizing
VF critical states retrieving and VF controls into one driver, we propose
to introduce mediate ops on top of current vfio-pci device driver.


_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   
  __   register mediate ops|  ___ ___|

|  |<---| VF|   |   |
| vfio-pci |  | |  mediate  |   | PF driver |   |
|__|--->|   driver  |   |___|
  |open(pdev)  |  ---  | |
  ||
  ||_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
 \|/  \|/
--- 
|VF   | |PF|
--- 


VF mediate driver could be a standalone driver that does not bind to
any devices (as in demo code in patches 5-6) or it could be a built-in
extension of PF driver (as in patches 7-9) .

Rather than directly bind to VF, VF mediate driver register a mediate
ops into vfio-pci in driver init. vfio-pci maintains a list of such
mediate ops.
(Note that: VF mediate driver can register mediate ops into vfio-pci
before vfio-pci binding to any devices. And VF mediate driver can
support mediating multiple devices.)

When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
device as a parameter.
VF mediate driver should return success or failure depending on it
supports the pdev or not.
E.g. VF mediate driver would compare its supported VF devfn with the
devfn of the passed-in pdev.
Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
stop querying other mediate ops and bind the opening device with this
mediate ops using the returned mediate handle.

Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
VF will be intercepted into VF mediate driver as
vfio_pci_mediate_ops->get_region_info(),
vfio_pci_mediate_ops->rw,
vfio_pci_mediate_ops->mmap, and get customized.
For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
further return 'pt' to indicate whether vfio-pci should further
passthrough data to hw.

when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
with a mediate handle as parameter.

The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
mediate driver be able to differentiate two opening VFs of the same device
id and vendor id.

When VF mediate driver exits, it unregisters its mediate ops from
vfio-pci.


In this patchset, we enable vfio-pci to provide 3 things:
(1) calling mediate ops to allow vendor driver customizing default
region info/rw/mmap of a region.
(2) provide a migration region to support migration



What's the benefit of introducing a region? It looks to me we don't 
expect the region to be accessed directly from guest. Could we simply 
extend device fd ioctl for doing such things?




(3) provide a dynamic trap bar info region to allow vendor driver
control trap/untrap of device pci bars

This vfio-pci + mediate ops way differs from mdev way in that
(1) medv way needs to create a 1:1 mdev device on top of one VF, device
specific mdev parent driver is bound to VF directly.
(2) vfio-pci + mediate ops way does not create mdev devices and VF
mediate driver does not bind to VFs. Instead, vfio-pci binds to VFs.

The reason why we don't choose the way of writing mdev parent driver is
that
(1) VFs are almost all the time directly passthroughed. Directly binding
to vfio-pci can make most of the code shared/reused.



Can we split out the common parts from vfio-pci?



  If we write a
vendor specific mdev parent driver, most of the code (like passthrough
style of rw/mmap) still needs to be copied from vfio-pci driver, which is
actually a duplicated and tedious work.



The mediate ops looks quite similar to what vfio-mdev did. And it looks 
to me we need to consider live migration for mdev as well. In that case, 
do we still expect 

Re: [libvirt] [Qemu-devel] Virtio-net control queue handling

2019-02-27 Thread Jason Wang


On 2019/2/28 上午1:39, Nikhil Agarwal wrote:

Hi Stefan,

Thanks for directing question to correct people, yes your understanding is 
correct.

I want virtio-net control virtqueue to be handled by vhost-user-net device 
based backend for vdpa use case where control message would do some 
configuration on actual hw device. Or these changes should to be done in 
libvirt where QMP event are being handled as of now?

Regards
Nikhil



Cc Michael. Several methods for dealing with control virtqueue:

1) using backend specific method. For example, multiqueue control for 
TAP was done in this way, and we plan to support RSS in this way as well 
(through steering eBPF program). But this may not work for the 
privilleged operations.
2) inventing new vhost(user) protocol, convert the vq command to 
vhost(user) command and pass it to vhost(user) backend.

3) extend vhost protocol to deal with control vq like you propose here.
4) notify libvirt using QMP event, we've already supported this if rx 
filter is changed for virtio-net, and let management to deal with the 
changes.
5) implementing or linking vhost backend qemu, then there's no need to 
invent no IPCs


For the case of vDPA, I'm not sure the use case for vDPA through 
vhost-user. Can you describe more about that? To me the best way is 
probably done through mdev and VFIO not userspace path.


Thanks





-Original Message-
From: Stefan Hajnoczi 
Sent: Wednesday, February 27, 2019 8:54 PM
To: Nikhil Agarwal 
Cc: qemu-de...@nongnu.org; jasow...@redhat.com; libvir-list@redhat.com
Subject: Re: [Qemu-devel] Virtio-net control queue handling

On Tue, Feb 26, 2019 at 02:13:43PM +, Nikhil Agarwal wrote:

Is there any option in QEMU to offload virtio device control queue handling to 
backend instead of deciphering and passing to libvirt via QMP? if not, is there 
any interface in libvirt which passes on these message to application or i 
directly use QMP interface from QEMU?

I'm not sure I understand your question.  It could be because of
terminology:

"virtio device control queue" == virtio-net control virtqueue?

"backend" == vhost-user-net device backend?

"message" == QMP event?

I have CCed Jason Wang, QEMU network subsystem maintainer, and the libvirt 
mailing list.

Stefan



--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [PATCH 09/18] Add virtio-related options to interfaces

2017-06-01 Thread Jason Wang



On 2017年05月30日 21:09, Daniel P. Berrange wrote:

On Tue, May 30, 2017 at 02:50:32PM +0200, Ján Tomko wrote:


   
   
   


Feels like we could just use  'iommu=on' as the name.

Also, I'm unclear whether we actually need the 'ats' attribute.

 From the description it sounds like if we have device_iotlb=on
for the IOMMU device, then we should have ats=on too. Is there
a  reason to use ats=off when device_iotlb=on, or vica-verca.
If not, then we should just do the right thing and set ats=on
whenever we see device_iotlb=on.


Maybe for virtio-net only. Vhost can get best performance only when both 
were set, disabling either will be slower.  For the cases of non vhost 
and other virtio devices, enabling ATS maybe a little bit slower since 
we don't implement device IOTLB in the device now (though we may want to 
implement it in the future). So for now, it's better to leave a choice 
for user.


Another problem is, ATS depends on PCIE, there's some requirement e.g 
qemu does not support an integrated PCIE device which means we may need 
a PCIE switch (pcie-root-port) to make PCIE and ATS works.


Thanks

  


Regards,
Daniel


--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list