Re: [PATCH] hostdev:Introduce vDPA device to hostdev subsystem as a new subtype
On Mon, Apr 24, 2023 at 11:55 AM libai wrote: > > The following is the xml of vdpa device: > > > > > > And the command line passed to QEMU is as follows: > -device {"driver":"vhost-vdpa-device-pci","vhostdev":"/dev/vhost-vdpa-0"} > > This solution is selected according to the previous discussion > on the solution of supporting the vDPA device. > For details, see the following: > https://listman.redhat.com/archives/libvir-list/2023-March/239018.html It would be also helpful if you can point to some links to explain the generic vDPA device. Adding Laine and Jonathon for comments. Thanks > > Signed-off-by: libai > --- > src/conf/domain_audit.c | 4 +++ > src/conf/domain_conf.c | 47 + > src/conf/domain_conf.h | 6 + > src/conf/domain_validate.c | 1 + > src/conf/virconftypes.h | 2 ++ > src/qemu/qemu_command.c | 19 + > src/qemu/qemu_command.h | 3 +++ > src/qemu/qemu_domain.c | 6 + > src/qemu/qemu_hotplug.c | 1 + > src/qemu/qemu_migration.c | 2 ++ > src/qemu/qemu_validate.c| 2 ++ > src/security/security_dac.c | 2 ++ > src/security/security_selinux.c | 2 ++ > 13 files changed, 97 insertions(+) > > diff --git a/src/conf/domain_audit.c b/src/conf/domain_audit.c > index ae875188bd..6906ce7ade 100644 > --- a/src/conf/domain_audit.c > +++ b/src/conf/domain_audit.c > @@ -344,6 +344,7 @@ virDomainAuditHostdev(virDomainObj *vm, > virDomainHostdevDef *hostdev, > virDomainHostdevSubsysSCSI *scsisrc = >source.subsys.u.scsi; > virDomainHostdevSubsysSCSIVHost *hostsrc = > >source.subsys.u.scsi_host; > virDomainHostdevSubsysMediatedDev *mdevsrc = > >source.subsys.u.mdev; > +virDomainHostdevSubsysVDPA *vdpasrc = >source.subsys.u.vdpa; > > virUUIDFormat(vm->def->uuid, uuidstr); > if (!(vmname = virAuditEncode("vm", vm->def->name))) { > @@ -383,6 +384,9 @@ virDomainAuditHostdev(virDomainObj *vm, > virDomainHostdevDef *hostdev, > case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: > address = g_strdup(mdevsrc->uuidstr); > break; > +case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_VDPA: > +address = g_strdup(vdpasrc->devpath); > +break; > case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST: > default: > VIR_WARN("Unexpected hostdev type while encoding audit message: > %d", > diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c > index b03a3ff011..e8f6d1457b 100644 > --- a/src/conf/domain_conf.c > +++ b/src/conf/domain_conf.c > @@ -1047,6 +1047,7 @@ VIR_ENUM_IMPL(virDomainHostdevSubsys, >"scsi", >"scsi_host", >"mdev", > + "vdpa", > ); > > VIR_ENUM_IMPL(virDomainHostdevSubsysPCIBackend, > @@ -2641,6 +2642,9 @@ virDomainHostdevDefClear(virDomainHostdevDef *def) > case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI: > g_clear_pointer(>source.subsys.u.pci.origstates, > virBitmapFree); > break; > +case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_VDPA: > +VIR_FREE(def->source.subsys.u.vdpa.devpath); > +break; > case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_USB: > case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: > case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST: > @@ -6160,6 +6164,22 @@ > virDomainHostdevSubsysMediatedDevDefParseXML(virDomainHostdevDef *def, > return 0; > } > > +static int > +virDomainHostdevSubsysVDPADefParseXML(xmlNodePtr sourcenode, > + virDomainHostdevDef *def) > +{ > +g_autofree char *devpath = NULL; > +virDomainHostdevSubsysVDPA *vdpa = >source.subsys.u.vdpa; > + > +if(!(devpath = virXMLPropString(sourcenode, "dev"))) { > + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", > + _("Missing 'dev' attribute for element ")); > + return -1; > +} > +vdpa->devpath = g_steal_pointer(); > +return 0; > +} > + > static int > virDomainHostdevDefParseXMLSubsys(xmlNodePtr node, >xmlXPathContextPtr ctxt, > @@ -6317,6 +6337,11 @@ virDomainHostdevDefParseXMLSubsys(xmlNodePtr node, > if (virDomainHostdevSubsysMediatedDevDefParseXML(def, ctxt) < 0) > return -1; > break; > +case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_VDPA: > +if (virDomainHostdevSubsysVDPADefParseXML(sourcenode, def) < 0) { > +return -1; > +} > +break; > > default: > virReportError(VIR_ERR_CONFIG_UNSUPPORTED, > @@ -12979,6 +13004,7 @@ virDomainHostdevDefParseXML(virDomainXMLOption > *xmlopt, > case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI: > case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_SCSI_HOST: > case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: > +case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_VDPA: > case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST: >
Re: [RFC] Seeking advice on support for generic vDPA device
On Sat, Mar 25, 2023 at 9:14 PM jiangdongxu wrote: > > vDPA devices allow high-performance devices in a virtual machine by > providing a wire-speed data path. These devices require a vendor-specific host > driver but the data path follows the virtio specification. > > The support for generic-vdpa device was recently added to qemu in v8.0.0-rc > version. > https://gitlab.com/qemu-project/qemu/-/commit/b430a2bd2303b9940cc80ec746887f463a25fc5c > > This allows libvirt to support these deviecs regardless of device type. > Currently, libvirt supports vDPA network devices, We need a new solution > to support generic-vdpa devices. > > At present, we have considered several options. > > Solution A: > Add type vdpa under the hostdev device,the device xml looks like > > > > > > This solution is recommended. This configuration requires only that the > device supports the vDPA framework, easy for future expansion and same > as the qemu solution design idea. I prefer this one, it's easier to implement and works for any type of virtio devices and vDPA parents > > Solution B: > The current libvirt mode is used to add vDPA devices under different > devices. > device xml looks like: > network: > > > > > > > storage: > > > > > > The network interface has the vdpa type. Therefore, a new type is > required to > distinguish the "generic-vdpa" type. However, this method differentiates > device types, which is not convenient for the expansion of other types > of vDPA devices. Not recommended. This should be the way when we have dedicated block layer support for vhost-vDPA. Which seems not necessary for general vDPA devices. > > Solution C: > Referring to PCI passthrough, Add the driver type (appropriate > vendor-specific driver) to the driver of the PCI device. > The unbind/bind of the vDPA device driver and the creation of VDPA device > are implemented by libvirt. > device xml looks like: > > > > > function='0x01'/> > > > > This solution is related to the vendor-specific driver and supports only > the vdpa created by the PCI device. > vdpa devices created in vduse mode are not supported. This solution is > not universal and is not recommended. I agree, this requires PCI specific knowledge which defeats the goal of vDPA to be device agnostic. Thanks > > > What's your proposal?
Re: [PATCH v2 03/15] net/vhost-vdpa: Fix device compatibility check
在 2021/10/8 下午9:34, Kevin Wolf 写道: vhost-vdpa works only with specific devices. At startup, it second guesses what the command line option handling will do and error out if it thinks a non-virtio device will attach to them. This second guessing is not only ugly, it can lead to wrong error messages ('-device floppy,netdev=foo' should complain about an unknown property, not about the wrong kind of network device being attached) and completely ignores hotplugging. Drop the old checks and implement .check_peer_type() instead to fix this. As a nice side effect, it also removes one more dependency on the legacy QemuOpts infrastructure and even reduces the code size. Signed-off-by: Kevin Wolf Acked-by: Jason Wang --- net/vhost-vdpa.c | 37 ++--- 1 file changed, 14 insertions(+), 23 deletions(-) diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index 912686457c..6dc68d8677 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -147,12 +147,26 @@ static bool vhost_vdpa_has_ufo(NetClientState *nc) } +static bool vhost_vdpa_check_peer_type(NetClientState *nc, ObjectClass *oc, + Error **errp) +{ +const char *driver = object_class_get_name(oc); + +if (!g_str_has_prefix(driver, "virtio-net-")) { +error_setg(errp, "vhost-vdpa requires frontend driver virtio-net-*"); +return false; +} + +return true; +} + static NetClientInfo net_vhost_vdpa_info = { .type = NET_CLIENT_DRIVER_VHOST_VDPA, .size = sizeof(VhostVDPAState), .cleanup = vhost_vdpa_cleanup, .has_vnet_hdr = vhost_vdpa_has_vnet_hdr, .has_ufo = vhost_vdpa_has_ufo, +.check_peer_type = vhost_vdpa_check_peer_type, }; static int net_vhost_vdpa_init(NetClientState *peer, const char *device, @@ -179,24 +193,6 @@ static int net_vhost_vdpa_init(NetClientState *peer, const char *device, return ret; } -static int net_vhost_check_net(void *opaque, QemuOpts *opts, Error **errp) -{ -const char *name = opaque; -const char *driver, *netdev; - -driver = qemu_opt_get(opts, "driver"); -netdev = qemu_opt_get(opts, "netdev"); -if (!driver || !netdev) { -return 0; -} -if (strcmp(netdev, name) == 0 && -!g_str_has_prefix(driver, "virtio-net-")) { -error_setg(errp, "vhost-vdpa requires frontend driver virtio-net-*"); -return -1; -} -return 0; -} - int net_init_vhost_vdpa(const Netdev *netdev, const char *name, NetClientState *peer, Error **errp) { @@ -204,10 +200,5 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name, assert(netdev->type == NET_CLIENT_DRIVER_VHOST_VDPA); opts = >u.vhost_vdpa; -/* verify net frontend */ -if (qemu_opts_foreach(qemu_find_opts("device"), net_vhost_check_net, - (char *)name, errp)) { -return -1; -} return net_vhost_vdpa_init(peer, TYPE_VHOST_VDPA, name, opts->vhostdev); }
Re: [PATCH v2 02/15] net/vhost-user: Fix device compatibility check
在 2021/10/8 下午9:34, Kevin Wolf 写道: vhost-user works only with specific devices. At startup, it second guesses what the command line option handling will do and error out if it thinks a non-virtio device will attach to them. This second guessing is not only ugly, it can lead to wrong error messages ('-device floppy,netdev=foo' should complain about an unknown property, not about the wrong kind of network device being attached) and completely ignores hotplugging. Drop the old checks and implement .check_peer_type() instead to fix this. As a nice side effect, it also removes one more dependency on the legacy QemuOpts infrastructure and even reduces the code size. Signed-off-by: Kevin Wolf Acked-by: Jason Wang --- net/vhost-user.c | 41 ++--- 1 file changed, 14 insertions(+), 27 deletions(-) diff --git a/net/vhost-user.c b/net/vhost-user.c index 4a939124d2..b1a0247b59 100644 --- a/net/vhost-user.c +++ b/net/vhost-user.c @@ -198,6 +198,19 @@ static bool vhost_user_has_ufo(NetClientState *nc) return true; } +static bool vhost_user_check_peer_type(NetClientState *nc, ObjectClass *oc, + Error **errp) +{ +const char *driver = object_class_get_name(oc); + +if (!g_str_has_prefix(driver, "virtio-net-")) { +error_setg(errp, "vhost-user requires frontend driver virtio-net-*"); +return false; +} + +return true; +} + static NetClientInfo net_vhost_user_info = { .type = NET_CLIENT_DRIVER_VHOST_USER, .size = sizeof(NetVhostUserState), @@ -207,6 +220,7 @@ static NetClientInfo net_vhost_user_info = { .has_ufo = vhost_user_has_ufo, .set_vnet_be = vhost_user_set_vnet_endianness, .set_vnet_le = vhost_user_set_vnet_endianness, +.check_peer_type = vhost_user_check_peer_type, }; static gboolean net_vhost_user_watch(void *do_not_use, GIOCondition cond, @@ -397,27 +411,6 @@ static Chardev *net_vhost_claim_chardev( return chr; } -static int net_vhost_check_net(void *opaque, QemuOpts *opts, Error **errp) -{ -const char *name = opaque; -const char *driver, *netdev; - -driver = qemu_opt_get(opts, "driver"); -netdev = qemu_opt_get(opts, "netdev"); - -if (!driver || !netdev) { -return 0; -} - -if (strcmp(netdev, name) == 0 && -!g_str_has_prefix(driver, "virtio-net-")) { -error_setg(errp, "vhost-user requires frontend driver virtio-net-*"); -return -1; -} - -return 0; -} - int net_init_vhost_user(const Netdev *netdev, const char *name, NetClientState *peer, Error **errp) { @@ -433,12 +426,6 @@ int net_init_vhost_user(const Netdev *netdev, const char *name, return -1; } -/* verify net frontend */ -if (qemu_opts_foreach(qemu_find_opts("device"), net_vhost_check_net, - (char *)name, errp)) { -return -1; -} - queues = vhost_user_opts->has_queues ? vhost_user_opts->queues : 1; if (queues < 1 || queues > MAX_QUEUE_NUM) { error_setg(errp,
Re: [PATCH v2 01/15] net: Introduce NetClientInfo.check_peer_type()
在 2021/10/8 下午9:34, Kevin Wolf 写道: Some network backends (vhost-user and vhost-vdpa) work only with specific devices. At startup, they second guess what the command line option handling will do and error out if they think a non-virtio device will attach to them. This second guessing is not only ugly, it can lead to wrong error messages ('-device floppy,netdev=foo' should complain about an unknown property, not about the wrong kind of network device being attached) and completely ignores hotplugging. Add a callback where backends can check compatibility with a device when it actually tries to attach, even on hotplug. Signed-off-by: Kevin Wolf Acked-by: Jason Wang --- include/net/net.h| 2 ++ hw/core/qdev-properties-system.c | 6 ++ 2 files changed, 8 insertions(+) diff --git a/include/net/net.h b/include/net/net.h index 5d1508081f..986288eb07 100644 --- a/include/net/net.h +++ b/include/net/net.h @@ -62,6 +62,7 @@ typedef struct SocketReadState SocketReadState; typedef void (SocketReadStateFinalize)(SocketReadState *rs); typedef void (NetAnnounce)(NetClientState *); typedef bool (SetSteeringEBPF)(NetClientState *, int); +typedef bool (NetCheckPeerType)(NetClientState *, ObjectClass *, Error **); typedef struct NetClientInfo { NetClientDriver type; @@ -84,6 +85,7 @@ typedef struct NetClientInfo { SetVnetBE *set_vnet_be; NetAnnounce *announce; SetSteeringEBPF *set_steering_ebpf; +NetCheckPeerType *check_peer_type; } NetClientInfo; struct NetClientState { diff --git a/hw/core/qdev-properties-system.c b/hw/core/qdev-properties-system.c index e71f5d64d1..a91f60567a 100644 --- a/hw/core/qdev-properties-system.c +++ b/hw/core/qdev-properties-system.c @@ -431,6 +431,12 @@ static void set_netdev(Object *obj, Visitor *v, const char *name, goto out; } +if (peers[i]->info->check_peer_type) { +if (!peers[i]->info->check_peer_type(peers[i], obj->class, errp)) { +goto out; +} +} + ncs[i] = peers[i]; ncs[i]->queue_index = i; }
Re: [ovirt-devel] Re: device compatibility interface for live migration with assigned devices
On 2020/8/21 下午10:52, Cornelia Huck wrote: On Fri, 21 Aug 2020 11:14:41 +0800 Jason Wang wrote: On 2020/8/20 下午8:27, Cornelia Huck wrote: On Wed, 19 Aug 2020 17:28:38 +0800 Jason Wang wrote: On 2020/8/19 下午4:13, Yan Zhao wrote: On Wed, Aug 19, 2020 at 03:39:50PM +0800, Jason Wang wrote: On 2020/8/19 下午2:59, Yan Zhao wrote: On Wed, Aug 19, 2020 at 02:57:34PM +0800, Jason Wang wrote: On 2020/8/19 上午11:30, Yan Zhao wrote: hi All, could we decide that sysfs is the interface that every VFIO vendor driver needs to provide in order to support vfio live migration, otherwise the userspace management tool would not list the device into the compatible list? if that's true, let's move to the standardizing of the sysfs interface. (1) content common part: (must) - software_version: (in major.minor.bugfix scheme) This can not work for devices whose features can be negotiated/advertised independently. (E.g virtio devices) I thought the 'software_version' was supposed to describe kind of a 'protocol version' for the data we transmit? I.e., you add a new field, you bump the version number. Ok, but since we mandate backward compatibility of uABI, is this really worth to have a version for sysfs? (Searching on sysfs shows no examples like this) I was not thinking about the sysfs interface, but rather about the data that is sent over while migrating. E.g. we find out that sending some auxiliary data is a good idea and bump to version 1.1.0; version 1.0.0 cannot deal with the extra data, but version 1.1.0 can deal with the older data stream. (...) Well, I think what data to transmit during migration is the duty of qemu not kernel. And I suspect the idea of reading opaque data (with version) from kernel and transmit them to dest is the best approach. - device_api: vfio-pci or vfio-ccw ... - type: mdev type for mdev device or a signature for physical device which is a counterpart for mdev type. device api specific part: (must) - pci id: pci id of mdev parent device or pci id of physical pci device (device_api is vfio-pci)API here. So this assumes a PCI device which is probably not true. for device_api of vfio-pci, why it's not true? for vfio-ccw, it's subchannel_type. Ok but having two different attributes for the same file is not good idea. How mgmt know there will be a 3rd type? that's why some attributes need to be common. e.g. device_api: it's common because mgmt need to know it's a pci device or a ccw device. and the api type is already defined vfio.h. (The field is agreed by and actually suggested by Alex in previous mail) type: mdev_type for mdev. if mgmt does not understand it, it would not be able to create one compatible mdev device. software_version: mgmt can compare the major and minor if it understands this fields. I think it would be helpful if you can describe how mgmt is expected to work step by step with the proposed sysfs API. This can help people to understand. My proposal would be: - check that device_api matches - check possible device_api specific attributes - check that type matches [I don't think the combination of mdev types and another attribute to determine compatibility is a good idea; Any reason for this? Actually if we only use mdev type to detect the compatibility, it would be much more easier. Otherwise, we are actually re-inventing mdev types. E.g can we have the same mdev types with different device_api and other attributes? In the end, the mdev type is represented as a string; but I'm not sure we can expect that two types with the same name, but a different device_api are related in any way. If we e.g. compare vfio-pci and vfio-ccw, they are fundamentally different. I was mostly concerned about the aggregation proposal, where type A + aggregation value b might be compatible with type B + aggregation value a. Yes, that looks pretty complicated. actually, the current proposal confuses me every time I look at it] - check that software_version is compatible, assuming semantic versioning - check possible type-specific attributes I'm not sure if this is too complicated. And I suspect there will be vendor specific attributes: - for compatibility check: I think we should either modeling everything via mdev type or making it totally vendor specific. Having something in the middle will bring a lot of burden FWIW, I'm for a strict match on mdev type, and flexibility in per-type attributes. I'm not sure whether the above flexibility can work better than encoding them to mdev type. If we really want ultra flexibility, we need making the compatibility check totally vendor specific. - for provisioning: it's still not clear. As shown in this proposal, for NVME we may need to set remote_url, but unless there will be a subclass (NVME) in the mdev (which I guess not), we can't prevent vendor from using another
Re: [ovirt-devel] Re: device compatibility interface for live migration with assigned devices
On 2020/8/20 下午8:27, Cornelia Huck wrote: On Wed, 19 Aug 2020 17:28:38 +0800 Jason Wang wrote: On 2020/8/19 下午4:13, Yan Zhao wrote: On Wed, Aug 19, 2020 at 03:39:50PM +0800, Jason Wang wrote: On 2020/8/19 下午2:59, Yan Zhao wrote: On Wed, Aug 19, 2020 at 02:57:34PM +0800, Jason Wang wrote: On 2020/8/19 上午11:30, Yan Zhao wrote: hi All, could we decide that sysfs is the interface that every VFIO vendor driver needs to provide in order to support vfio live migration, otherwise the userspace management tool would not list the device into the compatible list? if that's true, let's move to the standardizing of the sysfs interface. (1) content common part: (must) - software_version: (in major.minor.bugfix scheme) This can not work for devices whose features can be negotiated/advertised independently. (E.g virtio devices) I thought the 'software_version' was supposed to describe kind of a 'protocol version' for the data we transmit? I.e., you add a new field, you bump the version number. Ok, but since we mandate backward compatibility of uABI, is this really worth to have a version for sysfs? (Searching on sysfs shows no examples like this) sorry, I don't understand here, why virtio devices need to use vfio interface? I don't see any reason that virtio devices can't be used by VFIO. Do you? Actually, virtio devices have been used by VFIO for many years: - passthrough a hardware virtio devices to userspace(VM) drivers - using virtio PMD inside guest So, what's different for it vs passing through a physical hardware via VFIO? The difference is in the guest, the device could be either real hardware or emulated ones. even though the features are negotiated dynamically, could you explain why it would cause software_version not work? Virtio device 1 supports feature A, B, C Virtio device 2 supports feature B, C, D So you can't migrate a guest from device 1 to device 2. And it's impossible to model the features with versions. We're talking about the features offered by the device, right? Would it be sufficient to mandate that the target device supports the same features or a superset of the features supported by the source device? Yes. I think this thread is discussing about vfio related devices. - device_api: vfio-pci or vfio-ccw ... - type: mdev type for mdev device or a signature for physical device which is a counterpart for mdev type. device api specific part: (must) - pci id: pci id of mdev parent device or pci id of physical pci device (device_api is vfio-pci)API here. So this assumes a PCI device which is probably not true. for device_api of vfio-pci, why it's not true? for vfio-ccw, it's subchannel_type. Ok but having two different attributes for the same file is not good idea. How mgmt know there will be a 3rd type? that's why some attributes need to be common. e.g. device_api: it's common because mgmt need to know it's a pci device or a ccw device. and the api type is already defined vfio.h. (The field is agreed by and actually suggested by Alex in previous mail) type: mdev_type for mdev. if mgmt does not understand it, it would not be able to create one compatible mdev device. software_version: mgmt can compare the major and minor if it understands this fields. I think it would be helpful if you can describe how mgmt is expected to work step by step with the proposed sysfs API. This can help people to understand. My proposal would be: - check that device_api matches - check possible device_api specific attributes - check that type matches [I don't think the combination of mdev types and another attribute to determine compatibility is a good idea; Any reason for this? Actually if we only use mdev type to detect the compatibility, it would be much more easier. Otherwise, we are actually re-inventing mdev types. E.g can we have the same mdev types with different device_api and other attributes? actually, the current proposal confuses me every time I look at it] - check that software_version is compatible, assuming semantic versioning - check possible type-specific attributes I'm not sure if this is too complicated. And I suspect there will be vendor specific attributes: - for compatibility check: I think we should either modeling everything via mdev type or making it totally vendor specific. Having something in the middle will bring a lot of burden - for provisioning: it's still not clear. As shown in this proposal, for NVME we may need to set remote_url, but unless there will be a subclass (NVME) in the mdev (which I guess not), we can't prevent vendor from using another attribute name, in this case, tricks like attributes iteration in some sub directory won't work. So even if we had some common API for compatibility check, the provisioning API is still vendor specific ... Thanks Thanks
Re: device compatibility interface for live migration with assigned devices
On 2020/8/19 下午1:58, Parav Pandit wrote: From: Yan Zhao Sent: Wednesday, August 19, 2020 9:01 AM On Tue, Aug 18, 2020 at 09:39:24AM +, Parav Pandit wrote: Please refer to my previous email which has more example and details. hi Parav, the example is based on a new vdpa tool running over netlink, not based on devlink, right? Right. For vfio migration compatibility, we have to deal with both mdev and physical pci devices, I don't think it's a good idea to write a new tool for it, given we are able to retrieve the same info from sysfs and there's already an mdevctl from mdev attribute should be visible in the mdev's sysfs tree. I do not propose to write a new mdev tool over netlink. I am sorry if I implied that with my suggestion of vdpa tool. If underlying device is vdpa, mdev might be able to understand vdpa device and query from it and populate in mdev sysfs tree. Note that vdpa is bus independent so it can't work now and the support of mdev on top of vDPA have been rejected (and duplicated with vhost-vDPA). Thanks The vdpa tool I propose is usable even without mdevs. vdpa tool's role is to create one or more vdpa devices and place on the "vdpa" bus which is the lowest layer here. Additionally this tool let user query virtqueue stats, db stats. When a user creates vdpa net device, user may need to configure features of the vdpa device such as VIRTIO_NET_F_MAC, default VIRTIO_NET_F_MTU. These are vdpa level features, attributes. Mdev is layer above it. Alex (https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub. com%2Fmdevctl%2Fmdevctldata=02%7C01%7Cparav%40nvidia.com%7C 0c2691d430304f5ea11308d843f2d84e%7C43083d15727340c1b7db39efd9ccc17 a%7C0%7C0%7C637334057571911357sdata=KxH7PwxmKyy9JODut8BWr LQyOBylW00%2Fyzc4rEvjUvA%3Dreserved=0). Sorry for above link mangling. Our mail server is still transitioning due to company acquisition. I am less familiar on below points to comment. hi All, could we decide that sysfs is the interface that every VFIO vendor driver needs to provide in order to support vfio live migration, otherwise the userspace management tool would not list the device into the compatible list? if that's true, let's move to the standardizing of the sysfs interface. (1) content common part: (must) - software_version: (in major.minor.bugfix scheme) - device_api: vfio-pci or vfio-ccw ... - type: mdev type for mdev device or a signature for physical device which is a counterpart for mdev type. device api specific part: (must) - pci id: pci id of mdev parent device or pci id of physical pci device (device_api is vfio-pci) - subchannel_type (device_api is vfio-ccw) vendor driver specific part: (optional) - aggregator - chpid_type - remote_url NOTE: vendors are free to add attributes in this part with a restriction that this attribute is able to be configured with the same name in sysfs too. e.g. for aggregator, there must be a sysfs attribute in device node /sys/devices/pci:00/:00:02.0/882cc4da-dede-11e7-9180- 078a62063ab1/intel_vgpu/aggregator, so that the userspace tool is able to configure the target device according to source device's aggregator attribute. (2) where and structure proposal 1: |- [path to device] |--- migration | |--- self | ||-software_version | ||-device_api | ||-type | ||-[pci_id or subchannel_type] | ||- | |--- compatible | ||-software_version | ||-device_api | ||-type | ||-[pci_id or subchannel_type] | ||- multiple compatible is allowed. attributes should be ASCII text files, preferably with only one value per file. proposal 2: use bin_attribute. |- [path to device] |--- migration | |--- self | |--- compatible so we can continue use multiline format. e.g. cat compatible software_version=0.1.0 device_api=vfio_pci type=i915-GVTg_V5_{val1:int:1,2,4,8} pci_id=80865963 aggregator={val1}/2 Thanks Yan
Re: [ovirt-devel] Re: device compatibility interface for live migration with assigned devices
On 2020/8/19 下午4:13, Yan Zhao wrote: On Wed, Aug 19, 2020 at 03:39:50PM +0800, Jason Wang wrote: On 2020/8/19 下午2:59, Yan Zhao wrote: On Wed, Aug 19, 2020 at 02:57:34PM +0800, Jason Wang wrote: On 2020/8/19 上午11:30, Yan Zhao wrote: hi All, could we decide that sysfs is the interface that every VFIO vendor driver needs to provide in order to support vfio live migration, otherwise the userspace management tool would not list the device into the compatible list? if that's true, let's move to the standardizing of the sysfs interface. (1) content common part: (must) - software_version: (in major.minor.bugfix scheme) This can not work for devices whose features can be negotiated/advertised independently. (E.g virtio devices) sorry, I don't understand here, why virtio devices need to use vfio interface? I don't see any reason that virtio devices can't be used by VFIO. Do you? Actually, virtio devices have been used by VFIO for many years: - passthrough a hardware virtio devices to userspace(VM) drivers - using virtio PMD inside guest So, what's different for it vs passing through a physical hardware via VFIO? The difference is in the guest, the device could be either real hardware or emulated ones. even though the features are negotiated dynamically, could you explain why it would cause software_version not work? Virtio device 1 supports feature A, B, C Virtio device 2 supports feature B, C, D So you can't migrate a guest from device 1 to device 2. And it's impossible to model the features with versions. I think this thread is discussing about vfio related devices. - device_api: vfio-pci or vfio-ccw ... - type: mdev type for mdev device or a signature for physical device which is a counterpart for mdev type. device api specific part: (must) - pci id: pci id of mdev parent device or pci id of physical pci device (device_api is vfio-pci)API here. So this assumes a PCI device which is probably not true. for device_api of vfio-pci, why it's not true? for vfio-ccw, it's subchannel_type. Ok but having two different attributes for the same file is not good idea. How mgmt know there will be a 3rd type? that's why some attributes need to be common. e.g. device_api: it's common because mgmt need to know it's a pci device or a ccw device. and the api type is already defined vfio.h. (The field is agreed by and actually suggested by Alex in previous mail) type: mdev_type for mdev. if mgmt does not understand it, it would not be able to create one compatible mdev device. software_version: mgmt can compare the major and minor if it understands this fields. I think it would be helpful if you can describe how mgmt is expected to work step by step with the proposed sysfs API. This can help people to understand. Thanks for the patience. Since sysfs is uABI, when accepted, we need support it forever. That's why we need to be careful. - subchannel_type (device_api is vfio-ccw) vendor driver specific part: (optional) - aggregator - chpid_type - remote_url For "remote_url", just wonder if it's better to integrate or reuse the existing NVME management interface instead of duplicating it here. Otherwise it could be a burden for mgmt to learn. E.g vendor A may use "remote_url" but vendor B may use a different attribute. it's vendor driver specific. vendor specific attributes are inevitable, and that's why we are discussing here of a way to standardizing of it. Well, then you will end up with a very long list to discuss. E.g for networking devices, you will have "mac", "v(x)lan" and a lot of other. Note that "remote_url" is not vendor specific but NVME (class/subsystem) specific. yes, it's just NVMe specific. I added it as an example to show what is vendor specific. if one attribute is vendor specific across all vendors, then it's not vendor specific, it's already common attribute, right? It's common but the issue is about naming and mgmt overhead. Unless you have a unified API per class (NVME, ethernet, etc), you can't prevent vendor from using another name instead of "remote_url". The point is that if vendor/class specific part is unavoidable, why not making all of the attributes vendor specific? some parts need to be common, as I listed above. This is hard, unless VFIO knows the type of device (e.g it's a NVME or networking device). our goal is that mgmt can use it without understanding the meaning of vendor specific attributes. I'm not sure this is the correct design of uAPI. Is there something similar in the existing uAPIs? And it might be hard to work for virtio devices. NOTE: vendors are free to add attributes in this part with a restriction that this attribute is able to be configured with the same name in sysfs too. e.g. Sysfs works well for common attr
Re: [ovirt-devel] Re: device compatibility interface for live migration with assigned devices
On 2020/8/19 下午2:59, Yan Zhao wrote: On Wed, Aug 19, 2020 at 02:57:34PM +0800, Jason Wang wrote: On 2020/8/19 上午11:30, Yan Zhao wrote: hi All, could we decide that sysfs is the interface that every VFIO vendor driver needs to provide in order to support vfio live migration, otherwise the userspace management tool would not list the device into the compatible list? if that's true, let's move to the standardizing of the sysfs interface. (1) content common part: (must) - software_version: (in major.minor.bugfix scheme) This can not work for devices whose features can be negotiated/advertised independently. (E.g virtio devices) sorry, I don't understand here, why virtio devices need to use vfio interface? I don't see any reason that virtio devices can't be used by VFIO. Do you? Actually, virtio devices have been used by VFIO for many years: - passthrough a hardware virtio devices to userspace(VM) drivers - using virtio PMD inside guest I think this thread is discussing about vfio related devices. - device_api: vfio-pci or vfio-ccw ... - type: mdev type for mdev device or a signature for physical device which is a counterpart for mdev type. device api specific part: (must) - pci id: pci id of mdev parent device or pci id of physical pci device (device_api is vfio-pci)API here. So this assumes a PCI device which is probably not true. for device_api of vfio-pci, why it's not true? for vfio-ccw, it's subchannel_type. Ok but having two different attributes for the same file is not good idea. How mgmt know there will be a 3rd type? - subchannel_type (device_api is vfio-ccw) vendor driver specific part: (optional) - aggregator - chpid_type - remote_url For "remote_url", just wonder if it's better to integrate or reuse the existing NVME management interface instead of duplicating it here. Otherwise it could be a burden for mgmt to learn. E.g vendor A may use "remote_url" but vendor B may use a different attribute. it's vendor driver specific. vendor specific attributes are inevitable, and that's why we are discussing here of a way to standardizing of it. Well, then you will end up with a very long list to discuss. E.g for networking devices, you will have "mac", "v(x)lan" and a lot of other. Note that "remote_url" is not vendor specific but NVME (class/subsystem) specific. The point is that if vendor/class specific part is unavoidable, why not making all of the attributes vendor specific? our goal is that mgmt can use it without understanding the meaning of vendor specific attributes. I'm not sure this is the correct design of uAPI. Is there something similar in the existing uAPIs? And it might be hard to work for virtio devices. NOTE: vendors are free to add attributes in this part with a restriction that this attribute is able to be configured with the same name in sysfs too. e.g. Sysfs works well for common attributes belongs to a class, but I'm not sure it can work well for device/vendor specific attributes. Does this mean mgmt need to iterate all the attributes in both src and dst? no. just attributes under migration directory. for aggregator, there must be a sysfs attribute in device node /sys/devices/pci:00/:00:02.0/882cc4da-dede-11e7-9180-078a62063ab1/intel_vgpu/aggregator, so that the userspace tool is able to configure the target device according to source device's aggregator attribute. (2) where and structure proposal 1: |- [path to device] |--- migration | |--- self | ||-software_version | ||-device_api | ||-type | ||-[pci_id or subchannel_type] | ||- | |--- compatible | ||-software_version | ||-device_api | ||-type | ||-[pci_id or subchannel_type] | ||- multiple compatible is allowed. attributes should be ASCII text files, preferably with only one value per file. proposal 2: use bin_attribute. |- [path to device] |--- migration | |--- self | |--- compatible so we can continue use multiline format. e.g. cat compatible software_version=0.1.0 device_api=vfio_pci type=i915-GVTg_V5_{val1:int:1,2,4,8} pci_id=80865963 aggregator={val1}/2 So basically two questions: - how hard to standardize sysfs API for dealing with compatibility check (to make it work for most types of devices) sorry, I just know we are in the process of standardizing of it :) It's not easy. As I said, the current design can't work for virtio devices and it's not hard to find other examples. I remember some Intel devices have bitmask based capability registers. - how hard for the mgmt to learn with a vendor specific attributes (vs existing management API) what is existing management API? It depends on the type of devices. E.g for NVME, we've al
Re: [ovirt-devel] Re: device compatibility interface for live migration with assigned devices
On 2020/8/19 上午11:30, Yan Zhao wrote: hi All, could we decide that sysfs is the interface that every VFIO vendor driver needs to provide in order to support vfio live migration, otherwise the userspace management tool would not list the device into the compatible list? if that's true, let's move to the standardizing of the sysfs interface. (1) content common part: (must) - software_version: (in major.minor.bugfix scheme) This can not work for devices whose features can be negotiated/advertised independently. (E.g virtio devices) - device_api: vfio-pci or vfio-ccw ... - type: mdev type for mdev device or a signature for physical device which is a counterpart for mdev type. device api specific part: (must) - pci id: pci id of mdev parent device or pci id of physical pci device (device_api is vfio-pci)API here. So this assumes a PCI device which is probably not true. - subchannel_type (device_api is vfio-ccw) vendor driver specific part: (optional) - aggregator - chpid_type - remote_url For "remote_url", just wonder if it's better to integrate or reuse the existing NVME management interface instead of duplicating it here. Otherwise it could be a burden for mgmt to learn. E.g vendor A may use "remote_url" but vendor B may use a different attribute. NOTE: vendors are free to add attributes in this part with a restriction that this attribute is able to be configured with the same name in sysfs too. e.g. Sysfs works well for common attributes belongs to a class, but I'm not sure it can work well for device/vendor specific attributes. Does this mean mgmt need to iterate all the attributes in both src and dst? for aggregator, there must be a sysfs attribute in device node /sys/devices/pci:00/:00:02.0/882cc4da-dede-11e7-9180-078a62063ab1/intel_vgpu/aggregator, so that the userspace tool is able to configure the target device according to source device's aggregator attribute. (2) where and structure proposal 1: |- [path to device] |--- migration | |--- self | ||-software_version | ||-device_api | ||-type | ||-[pci_id or subchannel_type] | ||- | |--- compatible | ||-software_version | ||-device_api | ||-type | ||-[pci_id or subchannel_type] | ||- multiple compatible is allowed. attributes should be ASCII text files, preferably with only one value per file. proposal 2: use bin_attribute. |- [path to device] |--- migration | |--- self | |--- compatible so we can continue use multiline format. e.g. cat compatible software_version=0.1.0 device_api=vfio_pci type=i915-GVTg_V5_{val1:int:1,2,4,8} pci_id=80865963 aggregator={val1}/2 So basically two questions: - how hard to standardize sysfs API for dealing with compatibility check (to make it work for most types of devices) - how hard for the mgmt to learn with a vendor specific attributes (vs existing management API) Thanks Thanks Yan
Re: device compatibility interface for live migration with assigned devices
On 2020/8/19 下午1:26, Parav Pandit wrote: From: Jason Wang Sent: Wednesday, August 19, 2020 8:16 AM On 2020/8/18 下午5:32, Parav Pandit wrote: Hi Jason, From: Jason Wang Sent: Tuesday, August 18, 2020 2:32 PM On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: On 2020/8/14 下午1:16, Yan Zhao wrote: On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: On 2020/8/10 下午3:46, Yan Zhao wrote: driver is it handled by? It looks that the devlink is for network device specific, and in devlink.h, it says include/uapi/linux/devlink.h - Network physical device Netlink interface, Actually not, I think there used to have some discussion last year and the conclusion is to remove this comment. [...] Yes, but it could be hard. E.g vDPA will chose to use devlink (there's a long debate on sysfs vs devlink). So if we go with sysfs, at least two APIs needs to be supported ... We had internal discussion and proposal on this topic. I wanted Eli Cohen to be back from vacation on Wed 8/19, but since this is active discussion right now, I will share the thoughts anyway. Here are the initial round of thoughts and proposal. User requirements: --- 1. User might want to create one or more vdpa devices per PCI PF/VF/SF. 2. User might want to create one or more vdpa devices of type net/blk or other type. 3. User needs to look and dump at the health of the queues for debug purpose. 4. During vdpa net device creation time, user may have to provide a MAC address and/or VLAN. 5. User should be able to set/query some of the attributes for debug/compatibility check 6. When user wants to create vdpa device, it needs to know which device supports creation. 7. User should be able to see the queue statistics of doorbells, wqes etc regardless of class type Note that wqes is probably not something common in all of the vendors. Yes. I virtq descriptors stats is better to monitor the virtqueues. To address above requirements, there is a need of vendor agnostic tool, so that user can create/config/delete vdpa device(s) regardless of the vendor. Hence, We should have a tool that lets user do it. Examples: - (a) List parent devices which supports creating vdpa devices. It also shows which class types supported by this parent device. In below command two parent devices support vdpa device creation. First is PCI VF whose bdf is 03.00:5. Second is PCI SF whose name is mlx5_sf.1 $ vdpa list pd What did "pd" mean? Parent device which support creation of one or more vdpa devices. In a system there can be multiple parent devices which may be support vdpa creation. User should be able to know which devices support it, and when user creates a vdpa device, it tells which parent device to use for creation as done in below vdpa dev add example. pci/:03.00:5 class_supports net vdpa virtbus/mlx5_sf.1 So creating mlx5_sf.1 is the charge of devlink? Yes. But here vdpa tool is working at the parent device identifier {bus+name} instead of devlink identifier. class_supports net (b) Now add a vdpa device and show the device. $ vdpa dev add pci/:03.00:5 type net So if you want to create devices types other than vdpa on pci/:03.00:5 it needs some synchronization with devlink? Please refer to FAQ-1, a new tool is not linked to devlink because vdpa will evolve with time and devlink will fall short. So no, it doesn't need any synchronization with devlink. As long as parent device exist, user can create it. All synchronization will be within drivers/vdpa/vdpa.c This user interface is exposed via new netlink family by doing genl_register_family() with new name "vdpa" in drivers/vdpa/vdpa.c. Just to make sure I understand here. Consider we had virtbus/mlx5_sf.1. Process A want to create a vDPA instance on top of it but Process B want to create a IB instance. Then I think some synchronization is needed at at least parent device level? $ vdpa dev show vdpa0@pci/:03.00:5 type net state inactive maxqueues 8 curqueues 4 (c) vdpa dev show features vdpa0 iommu platform version 1 (d) dump vdpa statistics $ vdpa dev stats show vdpa0 kickdoorbells 10 wqes 100 (e) Now delete a vdpa device previously created. $ vdpa dev del vdpa0 Design overview: --- 1. Above example tool runs over netlink socket interface. 2. This enables users to return meaningful error strings in addition to code so that user can be more informed. Often this is missing in ioctl()/configfs/sysfs interfaces. 3. This tool over netlink enables syscaller tests to be more usable like other subsystems to keep kernel robust 4. This provides vendor agnostic view of all vdpa capable parent and vdpa devices. 5. Each driver which supports vdpa device creation, registers the parent device along with supported classes. FAQs: 1. Why not using devlink? Ans: Because as vd
Re: device compatibility interface for live migration with assigned devices
On 2020/8/18 下午5:36, Cornelia Huck wrote: On Tue, 18 Aug 2020 10:16:28 +0100 Daniel P. Berrangé wrote: On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote: On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: On 2020/8/14 下午1:16, Yan Zhao wrote: On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: On 2020/8/10 下午3:46, Yan Zhao wrote: we actually can also retrieve the same information through sysfs, .e.g |- [path to device] |--- migration | |--- self | | |---device_api || |---mdev_type || |---software_version || |---device_id || |---aggregator | |--- compatible | | |---device_api || |---mdev_type || |---software_version || |---device_id || |---aggregator Yes but: - You need one file per attribute (one syscall for one attribute) - Attribute is coupled with kobject Is that really that bad? You have the device with an embedded kobject anyway, and you can just put things into an attribute group? Yes, but all of this could be done via devlink(netlink) as well with low overhead. [Also, I think that self/compatible split in the example makes things needlessly complex. Shouldn't semantic versioning and matching already cover nearly everything? That's my question as well. E.g for virtio, versioning may not even work, some of features are negotiated independently: Source features: A, B, C Dest features: A, B, C, E We just need to make sure the dest features is a superset of source then all set. I would expect very few cases that are more complex than that. Maybe the aggregation stuff, but I don't think we need that self/compatible split for that, either.] All of above seems unnecessary. Another point, as we discussed in another thread, it's really hard to make sure the above API work for all types of devices and frameworks. So having a vendor specific API looks much better. From the POV of userspace mgmt apps doing device compat checking / migration, we certainly do NOT want to use different vendor specific APIs. We want to have an API that can be used / controlled in a standard manner across vendors. Yes, but it could be hard. E.g vDPA will chose to use devlink (there's a long debate on sysfs vs devlink). So if we go with sysfs, at least two APIs needs to be supported ... NB, I was not questioning devlink vs sysfs directly. If devlink is related to netlink, I can't say I'm enthusiastic as IMKE sysfs is easier to deal with. I don't know enough about devlink to have much of an opinion though. The key point was that I don't want the userspace APIs we need to deal with to be vendor specific. From what I've seen of devlink, it seems quite nice; but I understand why sysfs might be easier to deal with (especially as there's likely already a lot of code using it.) I understand that some users would like devlink because it is already widely used for network drivers (and some others), but I don't think the majority of devices used with vfio are network (although certainly a lot of them are.) Note that though devlink could be popular only in network devices, netlink is widely used by a lot of subsystesm (e.g SCSI). Thanks What I care about is that we have a *standard* userspace API for performing device compatibility checking / state migration, for use by QEMU/libvirt/ OpenStack, such that we can write code without countless vendor specific code paths. If there is vendor specific stuff on the side, that's fine as we can ignore that, but the core functionality for device compat / migration needs to be standardized. To summarize: - choose one of sysfs or devlink - have a common interface, with a standardized way to add vendor-specific attributes ?
Re: device compatibility interface for live migration with assigned devices
On 2020/8/18 下午5:32, Parav Pandit wrote: Hi Jason, From: Jason Wang Sent: Tuesday, August 18, 2020 2:32 PM On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: On 2020/8/14 下午1:16, Yan Zhao wrote: On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: On 2020/8/10 下午3:46, Yan Zhao wrote: driver is it handled by? It looks that the devlink is for network device specific, and in devlink.h, it says include/uapi/linux/devlink.h - Network physical device Netlink interface, Actually not, I think there used to have some discussion last year and the conclusion is to remove this comment. [...] Yes, but it could be hard. E.g vDPA will chose to use devlink (there's a long debate on sysfs vs devlink). So if we go with sysfs, at least two APIs needs to be supported ... We had internal discussion and proposal on this topic. I wanted Eli Cohen to be back from vacation on Wed 8/19, but since this is active discussion right now, I will share the thoughts anyway. Here are the initial round of thoughts and proposal. User requirements: --- 1. User might want to create one or more vdpa devices per PCI PF/VF/SF. 2. User might want to create one or more vdpa devices of type net/blk or other type. 3. User needs to look and dump at the health of the queues for debug purpose. 4. During vdpa net device creation time, user may have to provide a MAC address and/or VLAN. 5. User should be able to set/query some of the attributes for debug/compatibility check 6. When user wants to create vdpa device, it needs to know which device supports creation. 7. User should be able to see the queue statistics of doorbells, wqes etc regardless of class type Note that wqes is probably not something common in all of the vendors. To address above requirements, there is a need of vendor agnostic tool, so that user can create/config/delete vdpa device(s) regardless of the vendor. Hence, We should have a tool that lets user do it. Examples: - (a) List parent devices which supports creating vdpa devices. It also shows which class types supported by this parent device. In below command two parent devices support vdpa device creation. First is PCI VF whose bdf is 03.00:5. Second is PCI SF whose name is mlx5_sf.1 $ vdpa list pd What did "pd" mean? pci/:03.00:5 class_supports net vdpa virtbus/mlx5_sf.1 So creating mlx5_sf.1 is the charge of devlink? class_supports net (b) Now add a vdpa device and show the device. $ vdpa dev add pci/:03.00:5 type net So if you want to create devices types other than vdpa on pci/:03.00:5 it needs some synchronization with devlink? $ vdpa dev show vdpa0@pci/:03.00:5 type net state inactive maxqueues 8 curqueues 4 (c) vdpa dev show features vdpa0 iommu platform version 1 (d) dump vdpa statistics $ vdpa dev stats show vdpa0 kickdoorbells 10 wqes 100 (e) Now delete a vdpa device previously created. $ vdpa dev del vdpa0 Design overview: --- 1. Above example tool runs over netlink socket interface. 2. This enables users to return meaningful error strings in addition to code so that user can be more informed. Often this is missing in ioctl()/configfs/sysfs interfaces. 3. This tool over netlink enables syscaller tests to be more usable like other subsystems to keep kernel robust 4. This provides vendor agnostic view of all vdpa capable parent and vdpa devices. 5. Each driver which supports vdpa device creation, registers the parent device along with supported classes. FAQs: 1. Why not using devlink? Ans: Because as vdpa echo system grows, devlink will fall short of extending vdpa specific params, attributes, stats. This should be fine but it's still not clear to me the difference between a vdpa netlink and a vdpa object in devlink. Thanks 2. Why not use sysfs? Ans: (a) Because running syscaller infrastructure can run well over netlink sockets like it runs for several subsystem. (b) it lacks the ability to return error messages. Doing via kernel log is just doesn't work. (c) Why not using some ioctl()? It will reinvent the wheel of netlink that has TLV formats for several attributes. 3. Why not configs? It follows same limitation as that of sysfs. Low level design and driver APIS: Will post once we discuss this further.
Re: device compatibility interface for live migration with assigned devices
On 2020/8/18 下午5:16, Daniel P. Berrangé wrote: Your mail came through as HTML-only so all the quoting and attribution is mangled / lost now :-( My bad, sorry. On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote: On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: On 2020/8/14 下午1:16, Yan Zhao wrote: On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: On 2020/8/10 下午3:46, Yan Zhao wrote: we actually can also retrieve the same information through sysfs, .e.g |- [path to device] |--- migration | |--- self | | |---device_api || |---mdev_type || |---software_version || |---device_id || |---aggregator | |--- compatible | | |---device_api || |---mdev_type || |---software_version || |---device_id || |---aggregator Yes but: - You need one file per attribute (one syscall for one attribute) - Attribute is coupled with kobject All of above seems unnecessary. Another point, as we discussed in another thread, it's really hard to make sure the above API work for all types of devices and frameworks. So having a vendor specific API looks much better. From the POV of userspace mgmt apps doing device compat checking / migration, we certainly do NOT want to use different vendor specific APIs. We want to have an API that can be used / controlled in a standard manner across vendors. Yes, but it could be hard. E.g vDPA will chose to use devlink (there's a long debate on sysfs vs devlink). So if we go with sysfs, at least two APIs needs to be supported ... NB, I was not questioning devlink vs sysfs directly. If devlink is related to netlink, I can't say I'm enthusiastic as IMKE sysfs is easier to deal with. I don't know enough about devlink to have much of an opinion though. The key point was that I don't want the userspace APIs we need to deal with to be vendor specific. What I care about is that we have a *standard* userspace API for performing device compatibility checking / state migration, for use by QEMU/libvirt/ OpenStack, such that we can write code without countless vendor specific code paths. If there is vendor specific stuff on the side, that's fine as we can ignore that, but the core functionality for device compat / migration needs to be standardized. Ok, I agree with you. Thanks Regards, Daniel
Re: device compatibility interface for live migration with assigned devices
On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: On 2020/8/14 下午1:16, Yan Zhao wrote: On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: On 2020/8/10 下午3:46, Yan Zhao wrote: driver is it handled by? It looks that the devlink is for network device specific, and in devlink.h, it says include/uapi/linux/devlink.h - Network physical device Netlink interface, Actually not, I think there used to have some discussion last year and the conclusion is to remove this comment. It supports IB and probably vDPA in the future. hmm... sorry, I didn't find the referred discussion. only below discussion regarding to why to add devlink. https://www.mail-archive.com/netdev@vger.kernel.org/msg95801.html >This doesn't seem to be too much related to networking? Why can't something >like this be in sysfs? It is related to networking quite bit. There has been couple of iteration of this, including sysfs and configfs implementations. There has been a consensus reached that this should be done by netlink. I believe netlink is really the best for this purpose. Sysfs is not a good idea See the discussion here: https://patchwork.ozlabs.org/project/netdev/patch/20191115223355.1277139-1-jeffrey.t.kirs...@intel.com/ https://www.mail-archive.com/netdev@vger.kernel.org/msg96102.html >there is already a way to change eth/ib via >echo 'eth' > /sys/bus/pci/drivers/mlx4_core/:02:00.0/mlx4_port1 > >sounds like this is another way to achieve the same? It is. However the current way is driver-specific, not correct. For mlx5, we need the same, it cannot be done in this way. Do devlink is the correct way to go. https://lwn.net/Articles/674867/ There a is need for some userspace API that would allow to expose things that are not directly related to any device class like net_device of ib_device, but rather chip-wide/switch-ASIC-wide stuff. Use cases: 1) get/set of port type (Ethernet/InfiniBand) 2) monitoring of hardware messages to and from chip 3) setting up port splitters - split port into multiple ones and squash again, enables usage of splitter cable 4) setting up shared buffers - shared among multiple ports within one chip we actually can also retrieve the same information through sysfs, .e.g |- [path to device] |--- migration | |--- self | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator | |--- compatible | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator Yes but: - You need one file per attribute (one syscall for one attribute) - Attribute is coupled with kobject All of above seems unnecessary. Another point, as we discussed in another thread, it's really hard to make sure the above API work for all types of devices and frameworks. So having a vendor specific API looks much better. From the POV of userspace mgmt apps doing device compat checking / migration, we certainly do NOT want to use different vendor specific APIs. We want to have an API that can be used / controlled in a standard manner across vendors. Yes, but it could be hard. E.g vDPA will chose to use devlink (there's a long debate on sysfs vs devlink). So if we go with sysfs, at least two APIs needs to be supported ... Thanks Regards, Daniel
Re: device compatibility interface for live migration with assigned devices
On 2020/8/14 下午1:16, Yan Zhao wrote: On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: On 2020/8/10 下午3:46, Yan Zhao wrote: driver is it handled by? It looks that the devlink is for network device specific, and in devlink.h, it says include/uapi/linux/devlink.h - Network physical device Netlink interface, Actually not, I think there used to have some discussion last year and the conclusion is to remove this comment. It supports IB and probably vDPA in the future. hmm... sorry, I didn't find the referred discussion. only below discussion regarding to why to add devlink. https://www.mail-archive.com/netdev@vger.kernel.org/msg95801.html >This doesn't seem to be too much related to networking? Why can't something >like this be in sysfs? It is related to networking quite bit. There has been couple of iteration of this, including sysfs and configfs implementations. There has been a consensus reached that this should be done by netlink. I believe netlink is really the best for this purpose. Sysfs is not a good idea See the discussion here: https://patchwork.ozlabs.org/project/netdev/patch/20191115223355.1277139-1-jeffrey.t.kirs...@intel.com/ https://www.mail-archive.com/netdev@vger.kernel.org/msg96102.html >there is already a way to change eth/ib via >echo 'eth' > /sys/bus/pci/drivers/mlx4_core/:02:00.0/mlx4_port1 > >sounds like this is another way to achieve the same? It is. However the current way is driver-specific, not correct. For mlx5, we need the same, it cannot be done in this way. Do devlink is the correct way to go. https://lwn.net/Articles/674867/ There a is need for some userspace API that would allow to expose things that are not directly related to any device class like net_device of ib_device, but rather chip-wide/switch-ASIC-wide stuff. Use cases: 1) get/set of port type (Ethernet/InfiniBand) 2) monitoring of hardware messages to and from chip 3) setting up port splitters - split port into multiple ones and squash again, enables usage of splitter cable 4) setting up shared buffers - shared among multiple ports within one chip we actually can also retrieve the same information through sysfs, .e.g |- [path to device] |--- migration | |--- self | | |---device_api || |---mdev_type || |---software_version || |---device_id || |---aggregator | |--- compatible | | |---device_api || |---mdev_type || |---software_version || |---device_id || |---aggregator Yes but: - You need one file per attribute (one syscall for one attribute) - Attribute is coupled with kobject All of above seems unnecessary. Another point, as we discussed in another thread, it's really hard to make sure the above API work for all types of devices and frameworks. So having a vendor specific API looks much better. I feel like it's not very appropriate for a GPU driver to use this interface. Is that right? I think not though most of the users are switch or ethernet devices. It doesn't prevent you from inventing new abstractions. so need to patch devlink core and the userspace devlink tool? e.g. devlink migration It quite flexible, you can extend devlink, invent your own or let mgmt to establish devlink directly. Note that devlink is based on netlink, netlink has been widely used by various subsystems other than networking. the advantage of netlink I see is that it can monitor device status and notify upper layer that migration database needs to get updated. I may miss something, but why this is needed? From device point of view, the following capability should be sufficient to support live migration: - set/get device state - report dirty page tracking - set/get capability But not sure whether openstack would like to use this capability. As Sean said, it's heavy for openstack. it's heavy for vendor driver as well :) Well, it depends several factors. Just counting LOCs, sysfs based attributes is not lightweight. Thanks And devlink monitor now listens the notification and dumps the state changes. If we want to use it, need to let it forward the notification and dumped info to openstack, right? Thanks Yan
Re: device compatibility interface for live migration with assigned devices
On 2020/8/10 下午3:46, Yan Zhao wrote: driver is it handled by? It looks that the devlink is for network device specific, and in devlink.h, it says include/uapi/linux/devlink.h - Network physical device Netlink interface, Actually not, I think there used to have some discussion last year and the conclusion is to remove this comment. It supports IB and probably vDPA in the future. I feel like it's not very appropriate for a GPU driver to use this interface. Is that right? I think not though most of the users are switch or ethernet devices. It doesn't prevent you from inventing new abstractions. Note that devlink is based on netlink, netlink has been widely used by various subsystems other than networking. Thanks Thanks Yan
Re: device compatibility interface for live migration with assigned devices
On 2020/8/5 下午3:56, Jiri Pirko wrote: Wed, Aug 05, 2020 at 04:41:54AM CEST, jasow...@redhat.com wrote: On 2020/8/5 上午10:16, Yan Zhao wrote: On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote: On 2020/8/5 上午12:35, Cornelia Huck wrote: [sorry about not chiming in earlier] On Wed, 29 Jul 2020 16:05:03 +0800 Yan Zhao wrote: On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote: (...) Based on the feedback we've received, the previously proposed interface is not viable. I think there's agreement that the user needs to be able to parse and interpret the version information. Using json seems viable, but I don't know if it's the best option. Is there any precedent of markup strings returned via sysfs we could follow? I don't think encoding complex information in a sysfs file is a viable approach. Quoting Documentation/filesystems/sysfs.rst: "Attributes should be ASCII text files, preferably with only one value per file. It is noted that it may not be efficient to contain only one value per file, so it is socially acceptable to express an array of values of the same type. Mixing types, expressing multiple lines of data, and doing fancy formatting of data is heavily frowned upon." Even though this is an older file, I think these restrictions still apply. +1, that's another reason why devlink(netlink) is better. hi Jason, do you have any materials or sample code about devlink, so we can have a good study of it? I found some kernel docs about it but my preliminary study didn't show me the advantage of devlink. CC Jiri and Parav for a better answer for this. My understanding is that the following advantages are obvious (as I replied in another thread): - existing users (NIC, crypto, SCSI, ib), mature and stable - much better error reporting (ext_ack other than string or errno) - namespace aware - do not couple with kobject Jason, what is your use case? I think the use case is to report device compatibility for live migration. Yan proposed a simple sysfs based migration version first, but it looks not sufficient and something based on JSON is discussed. Yan, can you help to summarize the discussion so far for Jiri as a reference? Thanks Thanks Thanks Yan
Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
On 2020/8/5 上午4:30, Peter Xu wrote: On Mon, Aug 03, 2020 at 06:00:34PM +0200, Eugenio Pérez wrote: On Fri, 2020-07-03 at 15:24 +0800, Jason Wang wrote: On 2020/7/2 下午11:45, Peter Xu wrote: On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote: So I think we agree that a new notifier is needed? Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)? That should work but I wonder something as following is better. Instead of introducing new flags, how about carry the type of event in the notifier then the device (vhost) can choose the message it want to process like: static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event) { switch (event->type) { case IOMMU_MAP: case IOMMU_UNMAP: case IOMMU_DEV_IOTLB_UNMAP: ... } Thanks Hi! Sorry, I thought I had this clear but now it seems not so clear to me. Do you mean to add that switch to the current vhost_iommu_unmap_notify, and then the "type" field to the IOMMUTLBEntry? Is that the scope of the changes, or there is something I'm missing? If that is correct, what is the advantage for vhost or other notifiers? I understand that move the IOMMUTLBEntry (addr, len) -> (iova, mask) split/transformation to the different notifiers implementation could pollute them, but this is even a deeper change and vhost is not insterested in other events but IOMMU_UNMAP, isn't? On the other hand, who decide what type of event is? If I follow the backtrace of the assert in https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg01015.html, it seems to me that it should be vtd_process_device_iotlb_desc. How do I know if it should be IOMMU_UNMAP or IOMMU_DEV_IOTLB_UNMAP? If I set it in some function of memory.c, I should decide the type looking the actual notifier, isn't? (Since Jason didn't reply yesterday, I'll try to; Jason, feel free to correct me...) IMHO whether to put the type into the IOMMUTLBEntry is not important. The important change should be that we introduce IOMMU_DEV_IOTLB_UNMAP (or I'd rather call it IOMMU_DEV_IOTLB directly which is shorter and cleaner). With that information we can make the failing assertion conditional for MAP/UNMAP only. Or having another dedicated device IOTLB notifier. We can also allow dev-iotlb messages to take arbitrary addr_mask (so it becomes a length of address range; imho we can keep using addr_mask for simplicity, but we can comment for addr_mask that for dev-iotlb it can be not a real mask). Yes. Thanks Thanks,
Re: device compatibility interface for live migration with assigned devices
On 2020/8/5 上午10:16, Yan Zhao wrote: On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote: On 2020/8/5 上午12:35, Cornelia Huck wrote: [sorry about not chiming in earlier] On Wed, 29 Jul 2020 16:05:03 +0800 Yan Zhao wrote: On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote: (...) Based on the feedback we've received, the previously proposed interface is not viable. I think there's agreement that the user needs to be able to parse and interpret the version information. Using json seems viable, but I don't know if it's the best option. Is there any precedent of markup strings returned via sysfs we could follow? I don't think encoding complex information in a sysfs file is a viable approach. Quoting Documentation/filesystems/sysfs.rst: "Attributes should be ASCII text files, preferably with only one value per file. It is noted that it may not be efficient to contain only one value per file, so it is socially acceptable to express an array of values of the same type. Mixing types, expressing multiple lines of data, and doing fancy formatting of data is heavily frowned upon." Even though this is an older file, I think these restrictions still apply. +1, that's another reason why devlink(netlink) is better. hi Jason, do you have any materials or sample code about devlink, so we can have a good study of it? I found some kernel docs about it but my preliminary study didn't show me the advantage of devlink. CC Jiri and Parav for a better answer for this. My understanding is that the following advantages are obvious (as I replied in another thread): - existing users (NIC, crypto, SCSI, ib), mature and stable - much better error reporting (ext_ack other than string or errno) - namespace aware - do not couple with kobject Thanks Thanks Yan
Re: device compatibility interface for live migration with assigned devices
On 2020/8/5 上午12:35, Cornelia Huck wrote: [sorry about not chiming in earlier] On Wed, 29 Jul 2020 16:05:03 +0800 Yan Zhao wrote: On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote: (...) Based on the feedback we've received, the previously proposed interface is not viable. I think there's agreement that the user needs to be able to parse and interpret the version information. Using json seems viable, but I don't know if it's the best option. Is there any precedent of markup strings returned via sysfs we could follow? I don't think encoding complex information in a sysfs file is a viable approach. Quoting Documentation/filesystems/sysfs.rst: "Attributes should be ASCII text files, preferably with only one value per file. It is noted that it may not be efficient to contain only one value per file, so it is socially acceptable to express an array of values of the same type. Mixing types, expressing multiple lines of data, and doing fancy formatting of data is heavily frowned upon." Even though this is an older file, I think these restrictions still apply. +1, that's another reason why devlink(netlink) is better. Thanks
Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
On 2020/7/20 下午9:03, Peter Xu wrote: On Mon, Jul 20, 2020 at 12:02:06PM +0800, Jason Wang wrote: Right, so there's no need to deal with unmap in vtd's replay implementation (as what generic one did). We don't even for now; see vtd_page_walk_info.notify_unmap. Thanks, Right, but I meant the vtd_address_space_unmap() in vtd_iomm_replay(). It looks to me it will trigger UNMAP notifiers. Thanks
Re: device compatibility interface for live migration with assigned devices
On 2020/7/20 下午6:39, Sean Mooney wrote: On Mon, 2020-07-20 at 11:41 +0800, Jason Wang wrote: On 2020/7/18 上午12:12, Alex Williamson wrote: On Thu, 16 Jul 2020 16:32:30 +0800 Yan Zhao wrote: On Thu, Jul 16, 2020 at 12:16:26PM +0800, Jason Wang wrote: On 2020/7/14 上午7:29, Yan Zhao wrote: hi folks, we are defining a device migration compatibility interface that helps upper layer stack like openstack/ovirt/libvirt to check if two devices are live migration compatible. The "devices" here could be MDEVs, physical devices, or hybrid of the two. e.g. we could use it to check whether - a src MDEV can migrate to a target MDEV, - a src VF in SRIOV can migrate to a target VF in SRIOV, - a src MDEV can migration to a target VF in SRIOV. (e.g. SIOV/SRIOV backward compatibility case) The upper layer stack could use this interface as the last step to check if one device is able to migrate to another device before triggering a real live migration procedure. we are not sure if this interface is of value or help to you. please don't hesitate to drop your valuable comments. (1) interface definition The interface is defined in below way: __userspace /\ \ / \write / read \ /__ ___\|/_ | migration_version | | migration_version |-->check migration - - compatibility device Adevice B a device attribute named migration_version is defined under each device's sysfs node. e.g. (/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version). Are you aware of the devlink based device management interface that is proposed upstream? I think it has many advantages over sysfs, do you consider to switch to that? Advantages, such as? My understanding for devlink(netlink) over sysfs (some are mentioned at the time of vDPA sysfs mgmt API discussion) are: i tought netlink was used more a as a configuration protocoal to qurry and confire nic and i guess other devices in its devlink form requireint a tool to be witten that can speak the protocal to interact with. the primary advantate of sysfs is that everything is just a file. there are no addtional depleenceis needed Well, if you try to build logic like introspection on top for a sophisticated hardware, you probably need to have library on top. And it's attribute per file is pretty inefficient. and unlike netlink there are not interoperatblity issues in a coanitnerised env. if you are using diffrenet version of libc and gcc in the contaienr vs the host my understanding is tools like ethtool from ubuntu deployed in a container on a centos host can have issue communicating with the host kernel. Kernel provides stable ABI for userspace, so it's not something that we can't fix. if its jsut a file unless the format the data is returnin in chagnes or the layout of sysfs changes its compatiable regardless of what you use to read it. I believe you can't change sysfs layout which is part of uABI. But as I mentioned below, sysfs has several drawbacks. It's not harm to compare between different approach when you start a new device management API. Thanks - existing users (NIC, crypto, SCSI, ib), mature and stable - much better error reporting (ext_ack other than string or errno) - namespace aware - do not couple with kobject Thanks
Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
On 2020/7/17 下午10:18, Peter Xu wrote: On Thu, Jul 16, 2020 at 10:54:31AM +0800, Jason Wang wrote: On 2020/7/16 上午9:00, Peter Xu wrote: On Mon, Jul 13, 2020 at 12:04:16PM +0800, Jason Wang wrote: On 2020/7/10 下午9:30, Peter Xu wrote: On Fri, Jul 10, 2020 at 02:34:11PM +0800, Jason Wang wrote: On 2020/7/9 下午10:10, Peter Xu wrote: On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote: - If we care the performance, it's better to implement the MAP event for vhost, otherwise it could be a lot of IOTLB miss I feel like these are two things. So far what we are talking about is whether vt-d should have knowledge about what kind of events one iommu notifier is interested in. I still think we should keep this as answered in question 1. The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP events even without vDMA, so that vhost can establish the mapping even before IO starts. IMHO it's doable, but only if the guest runs DPDK workloads. When the guest is using dynamic iommu page mappings, I feel like that can be even slower, because then the worst case is for each IO we'll need to vmexit twice: - The first vmexit caused by an invalidation to MAP the page tables, so vhost will setup the page table before IO starts - IO/DMA triggers and completes - The second vmexit caused by another invalidation to UNMAP the page tables So it seems to be worse than when vhost only uses UNMAP like right now. At least we only have one vmexit (when UNMAP). We'll have a vhost translate() request from kernel to userspace, but IMHO that's cheaper than the vmexit. Right but then I would still prefer to have another notifier. Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a dedicated command for flushing device IOTLB. But the check for vtd_as_has_map_notifier is used to skip the device which can do demand paging via ATS or device specific way. If we have two different notifiers, vhost will be on the device iotlb notifier so we don't need it at all? But we can still have iommu notifier that only registers to UNMAP even after we introduce dev-iotlb notifier? We don't want to do page walk for them as well. TCG should be the only one so far, but I don't know.. maybe there can still be new ones? I think you're right. But looking at the codes, it looks like the check of vtd_as_has_map_notifier() was only used in: 1) vtd_iommu_replay() 2) vtd_iotlb_page_invalidate_notify() (PSI) For the replay, it's expensive anyhow. For PSI, I think it's just about one or few mappings, not sure it will have obvious performance impact. And I had two questions: 1) The codes doesn't check map for DSI or GI, does this match what spec said? (It looks to me the spec is unclear in this part) Both DSI/GI should cover maps too? E.g. vtd_sync_shadow_page_table() in vtd_iotlb_domain_invalidate(). I meant the code doesn't check whether there's an MAP notifier :) It's actually checked, because it loops over vtd_as_with_notifiers, and only MAP notifiers register to that. :) I may miss something but I don't find the code to block UNMAP notifiers? vhost_iommu_region_add() memory_region_register_iommu_notifier() memory_region_update_iommu_notify_flags() imrc->notify_flag_changed() vtd_iommu_notify_flag_changed() ? Yeah I think you're right. I might have confused with some previous implementations. Maybe we should also do similar thing for DSI/GI just like what we do in PSI. Ok. 2) for the replay() I don't see other implementations (either spapr or generic one) that did unmap (actually they skip unmap explicitly), any reason for doing this in intel IOMMU? I could be wrong, but I'd guess it's because vt-d implemented the caching mode by leveraging the same invalidation strucuture, so it's harder to make all things right (IOW, we can't clearly identify MAP with UNMAP when we receive an invalidation request, because MAP/UNMAP requests look the same). I didn't check others, but I believe spapr is doing it differently by using some hypercalls to deliver IOMMU map/unmap requests, which seems a bit close to what virtio-iommu is doing. Anyway, the point is if we have explicit MAP/UNMAP from the guest, logically the replay indeed does not need to do any unmap because we don't need to call replay() on an already existing device but only for e.g. hot plug. But this looks conflict with what memory_region_iommu_replay( ) did, for IOMMU that doesn't have a replay method, it skips UNMAP request: for (addr = 0; addr < memory_region_size(mr); addr += granularity) { iotlb = imrc->translate(iommu_mr, addr, IOMMU_NONE, n->iommu_idx); if (iotlb.perm != IOMMU_NONE) { n->notify(n, ); } I guess there's no knowledge of whether guest have an explicit MAP/UMAP for this generic code. Or replay implies that guest doesn't have explicit MAP/UNMAP? I think it
Re: device compatibility interface for live migration with assigned devices
On 2020/7/18 上午12:12, Alex Williamson wrote: On Thu, 16 Jul 2020 16:32:30 +0800 Yan Zhao wrote: On Thu, Jul 16, 2020 at 12:16:26PM +0800, Jason Wang wrote: On 2020/7/14 上午7:29, Yan Zhao wrote: hi folks, we are defining a device migration compatibility interface that helps upper layer stack like openstack/ovirt/libvirt to check if two devices are live migration compatible. The "devices" here could be MDEVs, physical devices, or hybrid of the two. e.g. we could use it to check whether - a src MDEV can migrate to a target MDEV, - a src VF in SRIOV can migrate to a target VF in SRIOV, - a src MDEV can migration to a target VF in SRIOV. (e.g. SIOV/SRIOV backward compatibility case) The upper layer stack could use this interface as the last step to check if one device is able to migrate to another device before triggering a real live migration procedure. we are not sure if this interface is of value or help to you. please don't hesitate to drop your valuable comments. (1) interface definition The interface is defined in below way: __userspace /\ \ / \write / read \ /__ ___\|/_ | migration_version | | migration_version |-->check migration - - compatibility device Adevice B a device attribute named migration_version is defined under each device's sysfs node. e.g. (/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version). Are you aware of the devlink based device management interface that is proposed upstream? I think it has many advantages over sysfs, do you consider to switch to that? Advantages, such as? My understanding for devlink(netlink) over sysfs (some are mentioned at the time of vDPA sysfs mgmt API discussion) are: - existing users (NIC, crypto, SCSI, ib), mature and stable - much better error reporting (ext_ack other than string or errno) - namespace aware - do not couple with kobject Thanks
Re: device compatibility interface for live migration with assigned devices
On 2020/7/16 下午4:32, Yan Zhao wrote: On Thu, Jul 16, 2020 at 12:16:26PM +0800, Jason Wang wrote: On 2020/7/14 上午7:29, Yan Zhao wrote: hi folks, we are defining a device migration compatibility interface that helps upper layer stack like openstack/ovirt/libvirt to check if two devices are live migration compatible. The "devices" here could be MDEVs, physical devices, or hybrid of the two. e.g. we could use it to check whether - a src MDEV can migrate to a target MDEV, - a src VF in SRIOV can migrate to a target VF in SRIOV, - a src MDEV can migration to a target VF in SRIOV. (e.g. SIOV/SRIOV backward compatibility case) The upper layer stack could use this interface as the last step to check if one device is able to migrate to another device before triggering a real live migration procedure. we are not sure if this interface is of value or help to you. please don't hesitate to drop your valuable comments. (1) interface definition The interface is defined in below way: __userspace /\ \ / \write / read \ /__ ___\|/_ | migration_version | | migration_version |-->check migration - - compatibility device Adevice B a device attribute named migration_version is defined under each device's sysfs node. e.g. (/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version). Are you aware of the devlink based device management interface that is proposed upstream? I think it has many advantages over sysfs, do you consider to switch to that? not familiar with the devlink. will do some research of it. userspace tools read the migration_version as a string from the source device, and write it to the migration_version sysfs attribute in the target device. The userspace should treat ANY of below conditions as two devices not compatible: - any one of the two devices does not have a migration_version attribute - error when reading from migration_version attribute of one device - error when writing migration_version string of one device to migration_version attribute of the other device The string read from migration_version attribute is defined by device vendor driver and is completely opaque to the userspace. My understanding is that something opaque to userspace is not the philosophy but the VFIO live migration in itself is essentially a big opaque stream to userspace. I think it's better not limit to the kernel interface for a specific use case. This is basically the device introspection. of Linux. Instead of having a generic API but opaque value, why not do in a vendor specific way like: 1) exposing the device capability in a vendor specific way via sysfs/devlink or other API 2) management read capability in both src and dst and determine whether we can do the migration This is the way we plan to do with vDPA. yes, in another reply, Alex proposed to use an interface in json format. I guess we can define something like { "self" : [ { "pciid" : "8086591d", "driver" : "i915", "gvt-version" : "v1", "mdev_type" : "i915-GVTg_V5_2", "aggregator" : "1", "pv-mode" : "none", } ], "compatible" : [ { "pciid" : "8086591d", "driver" : "i915", "gvt-version" : "v1", "mdev_type" : "i915-GVTg_V5_2", "aggregator" : "1" "pv-mode" : "none", }, { "pciid" : "8086591d", "driver" : "i915", "gvt-version" : "v1", "mdev_type" : "i915-GVTg_V5_4", "aggregator" : "2" "pv-mode" : "none", }, { "pciid" : "8086591d", "driver" : "i915", "gvt-version" : "v2", "mdev_type" : "i915-GVTg_V5_4", "aggregator" : "2" "pv-mode" : "none, ppgtt, context", } ... ] } This is probably another call for devlink base interface. But as those fields are mostly vendor specific, the userspace can only do simple string comparing, I guess the list would be very long as it needs to enumerate all possible targets. also, in some fileds like "gvt-version", is there a simple way to express things like v2+? That's total vendor specific I think. If "v2+" means it only support a version 2+, we can introduce fields like min_version and max_version. B
Re: device compatibility interface for live migration with assigned devices
On 2020/7/14 上午7:29, Yan Zhao wrote: hi folks, we are defining a device migration compatibility interface that helps upper layer stack like openstack/ovirt/libvirt to check if two devices are live migration compatible. The "devices" here could be MDEVs, physical devices, or hybrid of the two. e.g. we could use it to check whether - a src MDEV can migrate to a target MDEV, - a src VF in SRIOV can migrate to a target VF in SRIOV, - a src MDEV can migration to a target VF in SRIOV. (e.g. SIOV/SRIOV backward compatibility case) The upper layer stack could use this interface as the last step to check if one device is able to migrate to another device before triggering a real live migration procedure. we are not sure if this interface is of value or help to you. please don't hesitate to drop your valuable comments. (1) interface definition The interface is defined in below way: __userspace /\ \ / \write / read \ /__ ___\|/_ | migration_version | | migration_version |-->check migration - - compatibility device Adevice B a device attribute named migration_version is defined under each device's sysfs node. e.g. (/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version). Are you aware of the devlink based device management interface that is proposed upstream? I think it has many advantages over sysfs, do you consider to switch to that? userspace tools read the migration_version as a string from the source device, and write it to the migration_version sysfs attribute in the target device. The userspace should treat ANY of below conditions as two devices not compatible: - any one of the two devices does not have a migration_version attribute - error when reading from migration_version attribute of one device - error when writing migration_version string of one device to migration_version attribute of the other device The string read from migration_version attribute is defined by device vendor driver and is completely opaque to the userspace. My understanding is that something opaque to userspace is not the philosophy of Linux. Instead of having a generic API but opaque value, why not do in a vendor specific way like: 1) exposing the device capability in a vendor specific way via sysfs/devlink or other API 2) management read capability in both src and dst and determine whether we can do the migration This is the way we plan to do with vDPA. Thanks for a Intel vGPU, string format can be defined like "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count". for an NVMe VF connecting to a remote storage. it could be "PCI ID" + "driver version" + "configured remote storage URL" for a QAT VF, it may be "PCI ID" + "driver version" + "supported encryption set". (to avoid namespace confliction from each vendor, we may prefix a driver name to each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1) (2) backgrounds The reason we hope the migration_version string is opaque to the userspace is that it is hard to generalize standard comparing fields and comparing methods for different devices from different vendors. Though userspace now could still do a simple string compare to check if two devices are compatible, and result should also be right, it's still too limited as it excludes the possible candidate whose migration_version string fails to be equal. e.g. an MDEV with mdev_type_1, aggregator count 3 is probably compatible with another MDEV with mdev_type_3, aggregator count 1, even their migration_version strings are not equal. (assumed mdev_type_3 is of 3 times equal resources of mdev_type_1). besides that, driver version + configured resources are all elements demanding to take into account. So, we hope leaving the freedom to vendor driver and let it make the final decision in a simple reading from source side and writing for test in the target side way. we then think the device compatibility issues for live migration with assigned devices can be divided into two steps: a. management tools filter out possible migration target devices. Tags could be created according to info from product specification. we think openstack/ovirt may have vendor proprietary components to create those customized tags for each product from each vendor. e.g. for Intel vGPU, with a vGPU(a MDEV device) in source side, the tags to search target vGPU are like: a tag for compatible parent PCI IDs, a tag for a range of gvt driver versions, a tag for a range of mdev type + aggregator count for NVMe VF, the tags to search target VF may be like: a tag for compatible PCI IDs, a tag for a range of driver versions, a tag for URL of configured remote storage. b. with the output from
Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
On 2020/7/16 上午9:00, Peter Xu wrote: On Mon, Jul 13, 2020 at 12:04:16PM +0800, Jason Wang wrote: On 2020/7/10 下午9:30, Peter Xu wrote: On Fri, Jul 10, 2020 at 02:34:11PM +0800, Jason Wang wrote: On 2020/7/9 下午10:10, Peter Xu wrote: On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote: - If we care the performance, it's better to implement the MAP event for vhost, otherwise it could be a lot of IOTLB miss I feel like these are two things. So far what we are talking about is whether vt-d should have knowledge about what kind of events one iommu notifier is interested in. I still think we should keep this as answered in question 1. The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP events even without vDMA, so that vhost can establish the mapping even before IO starts. IMHO it's doable, but only if the guest runs DPDK workloads. When the guest is using dynamic iommu page mappings, I feel like that can be even slower, because then the worst case is for each IO we'll need to vmexit twice: - The first vmexit caused by an invalidation to MAP the page tables, so vhost will setup the page table before IO starts - IO/DMA triggers and completes - The second vmexit caused by another invalidation to UNMAP the page tables So it seems to be worse than when vhost only uses UNMAP like right now. At least we only have one vmexit (when UNMAP). We'll have a vhost translate() request from kernel to userspace, but IMHO that's cheaper than the vmexit. Right but then I would still prefer to have another notifier. Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a dedicated command for flushing device IOTLB. But the check for vtd_as_has_map_notifier is used to skip the device which can do demand paging via ATS or device specific way. If we have two different notifiers, vhost will be on the device iotlb notifier so we don't need it at all? But we can still have iommu notifier that only registers to UNMAP even after we introduce dev-iotlb notifier? We don't want to do page walk for them as well. TCG should be the only one so far, but I don't know.. maybe there can still be new ones? I think you're right. But looking at the codes, it looks like the check of vtd_as_has_map_notifier() was only used in: 1) vtd_iommu_replay() 2) vtd_iotlb_page_invalidate_notify() (PSI) For the replay, it's expensive anyhow. For PSI, I think it's just about one or few mappings, not sure it will have obvious performance impact. And I had two questions: 1) The codes doesn't check map for DSI or GI, does this match what spec said? (It looks to me the spec is unclear in this part) Both DSI/GI should cover maps too? E.g. vtd_sync_shadow_page_table() in vtd_iotlb_domain_invalidate(). I meant the code doesn't check whether there's an MAP notifier :) It's actually checked, because it loops over vtd_as_with_notifiers, and only MAP notifiers register to that. :) I may miss something but I don't find the code to block UNMAP notifiers? vhost_iommu_region_add() memory_region_register_iommu_notifier() memory_region_update_iommu_notify_flags() imrc->notify_flag_changed() vtd_iommu_notify_flag_changed() ? But I agree with you that it should be cleaner to introduce the dev-iotlb notifier type. Yes, it can solve the issues of duplicated UNMAP issue of vhost. 2) for the replay() I don't see other implementations (either spapr or generic one) that did unmap (actually they skip unmap explicitly), any reason for doing this in intel IOMMU? I could be wrong, but I'd guess it's because vt-d implemented the caching mode by leveraging the same invalidation strucuture, so it's harder to make all things right (IOW, we can't clearly identify MAP with UNMAP when we receive an invalidation request, because MAP/UNMAP requests look the same). I didn't check others, but I believe spapr is doing it differently by using some hypercalls to deliver IOMMU map/unmap requests, which seems a bit close to what virtio-iommu is doing. Anyway, the point is if we have explicit MAP/UNMAP from the guest, logically the replay indeed does not need to do any unmap because we don't need to call replay() on an already existing device but only for e.g. hot plug. But this looks conflict with what memory_region_iommu_replay( ) did, for IOMMU that doesn't have a replay method, it skips UNMAP request: for (addr = 0; addr < memory_region_size(mr); addr += granularity) { iotlb = imrc->translate(iommu_mr, addr, IOMMU_NONE, n->iommu_idx); if (iotlb.perm != IOMMU_NONE) { n->notify(n, ); } I guess there's no knowledge of whether guest have an explicit MAP/UMAP for this generic code. Or replay implies that guest doesn't have explicit MAP/UNMAP? I think it matches exactly with a hot plug case? Note that when IOMMU_NONE could also mean the translation does not exist. So it's actuall
Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
On 2020/7/10 下午9:30, Peter Xu wrote: On Fri, Jul 10, 2020 at 02:34:11PM +0800, Jason Wang wrote: On 2020/7/9 下午10:10, Peter Xu wrote: On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote: - If we care the performance, it's better to implement the MAP event for vhost, otherwise it could be a lot of IOTLB miss I feel like these are two things. So far what we are talking about is whether vt-d should have knowledge about what kind of events one iommu notifier is interested in. I still think we should keep this as answered in question 1. The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP events even without vDMA, so that vhost can establish the mapping even before IO starts. IMHO it's doable, but only if the guest runs DPDK workloads. When the guest is using dynamic iommu page mappings, I feel like that can be even slower, because then the worst case is for each IO we'll need to vmexit twice: - The first vmexit caused by an invalidation to MAP the page tables, so vhost will setup the page table before IO starts - IO/DMA triggers and completes - The second vmexit caused by another invalidation to UNMAP the page tables So it seems to be worse than when vhost only uses UNMAP like right now. At least we only have one vmexit (when UNMAP). We'll have a vhost translate() request from kernel to userspace, but IMHO that's cheaper than the vmexit. Right but then I would still prefer to have another notifier. Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a dedicated command for flushing device IOTLB. But the check for vtd_as_has_map_notifier is used to skip the device which can do demand paging via ATS or device specific way. If we have two different notifiers, vhost will be on the device iotlb notifier so we don't need it at all? But we can still have iommu notifier that only registers to UNMAP even after we introduce dev-iotlb notifier? We don't want to do page walk for them as well. TCG should be the only one so far, but I don't know.. maybe there can still be new ones? I think you're right. But looking at the codes, it looks like the check of vtd_as_has_map_notifier() was only used in: 1) vtd_iommu_replay() 2) vtd_iotlb_page_invalidate_notify() (PSI) For the replay, it's expensive anyhow. For PSI, I think it's just about one or few mappings, not sure it will have obvious performance impact. And I had two questions: 1) The codes doesn't check map for DSI or GI, does this match what spec said? (It looks to me the spec is unclear in this part) Both DSI/GI should cover maps too? E.g. vtd_sync_shadow_page_table() in vtd_iotlb_domain_invalidate(). I meant the code doesn't check whether there's an MAP notifier :) 2) for the replay() I don't see other implementations (either spapr or generic one) that did unmap (actually they skip unmap explicitly), any reason for doing this in intel IOMMU? I could be wrong, but I'd guess it's because vt-d implemented the caching mode by leveraging the same invalidation strucuture, so it's harder to make all things right (IOW, we can't clearly identify MAP with UNMAP when we receive an invalidation request, because MAP/UNMAP requests look the same). I didn't check others, but I believe spapr is doing it differently by using some hypercalls to deliver IOMMU map/unmap requests, which seems a bit close to what virtio-iommu is doing. Anyway, the point is if we have explicit MAP/UNMAP from the guest, logically the replay indeed does not need to do any unmap because we don't need to call replay() on an already existing device but only for e.g. hot plug. But this looks conflict with what memory_region_iommu_replay( ) did, for IOMMU that doesn't have a replay method, it skips UNMAP request: for (addr = 0; addr < memory_region_size(mr); addr += granularity) { iotlb = imrc->translate(iommu_mr, addr, IOMMU_NONE, n->iommu_idx); if (iotlb.perm != IOMMU_NONE) { n->notify(n, ); } I guess there's no knowledge of whether guest have an explicit MAP/UMAP for this generic code. Or replay implies that guest doesn't have explicit MAP/UNMAP? (btw, the code shortcut the memory_region_notify_one(), not sure the reason) VT-d does not have that clear interface, so VT-d needs to maintain its own mapping structures, and also vt-d is using the same replay & page_walk operations to sync all these structures, which complicated the vt-d replay a bit. With that, we assume replay() can be called anytime on a device, and we won't notify duplicated MAPs to lower layer like vfio if it is mapped before. At the meantime, since we'll compare the latest mapping with the one we cached in the iova tree, UNMAP becomes possible too. AFAIK vtd_iommu_replay() did a completely UNMAP: /* * The replay can be triggered by either a invalidation or a newly * created entry. No matter what, we release existing mappings * (it means
Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
On 2020/7/9 下午10:10, Peter Xu wrote: On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote: - If we care the performance, it's better to implement the MAP event for vhost, otherwise it could be a lot of IOTLB miss I feel like these are two things. So far what we are talking about is whether vt-d should have knowledge about what kind of events one iommu notifier is interested in. I still think we should keep this as answered in question 1. The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP events even without vDMA, so that vhost can establish the mapping even before IO starts. IMHO it's doable, but only if the guest runs DPDK workloads. When the guest is using dynamic iommu page mappings, I feel like that can be even slower, because then the worst case is for each IO we'll need to vmexit twice: - The first vmexit caused by an invalidation to MAP the page tables, so vhost will setup the page table before IO starts - IO/DMA triggers and completes - The second vmexit caused by another invalidation to UNMAP the page tables So it seems to be worse than when vhost only uses UNMAP like right now. At least we only have one vmexit (when UNMAP). We'll have a vhost translate() request from kernel to userspace, but IMHO that's cheaper than the vmexit. Right but then I would still prefer to have another notifier. Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a dedicated command for flushing device IOTLB. But the check for vtd_as_has_map_notifier is used to skip the device which can do demand paging via ATS or device specific way. If we have two different notifiers, vhost will be on the device iotlb notifier so we don't need it at all? But we can still have iommu notifier that only registers to UNMAP even after we introduce dev-iotlb notifier? We don't want to do page walk for them as well. TCG should be the only one so far, but I don't know.. maybe there can still be new ones? I think you're right. But looking at the codes, it looks like the check of vtd_as_has_map_notifier() was only used in: 1) vtd_iommu_replay() 2) vtd_iotlb_page_invalidate_notify() (PSI) For the replay, it's expensive anyhow. For PSI, I think it's just about one or few mappings, not sure it will have obvious performance impact. And I had two questions: 1) The codes doesn't check map for DSI or GI, does this match what spec said? (It looks to me the spec is unclear in this part) 2) for the replay() I don't see other implementations (either spapr or generic one) that did unmap (actually they skip unmap explicitly), any reason for doing this in intel IOMMU? Thanks Thanks,
Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
On 2020/7/8 下午10:16, Peter Xu wrote: On Wed, Jul 08, 2020 at 01:42:30PM +0800, Jason Wang wrote: So it should be functional equivalent to vtd_as_has_notifier(). For example: in vtd_iommu_replay() we'll skip the replay if vhost has registered the iommu notifier because vtd_as_has_map_notifier() will return false. Two questions: - Do we care the performance here? If not, vhost may just ignore the MAP event? I think we care, because vtd_page_walk() can be expensive. Ok. - If we care the performance, it's better to implement the MAP event for vhost, otherwise it could be a lot of IOTLB miss I feel like these are two things. So far what we are talking about is whether vt-d should have knowledge about what kind of events one iommu notifier is interested in. I still think we should keep this as answered in question 1. The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP events even without vDMA, so that vhost can establish the mapping even before IO starts. IMHO it's doable, but only if the guest runs DPDK workloads. When the guest is using dynamic iommu page mappings, I feel like that can be even slower, because then the worst case is for each IO we'll need to vmexit twice: - The first vmexit caused by an invalidation to MAP the page tables, so vhost will setup the page table before IO starts - IO/DMA triggers and completes - The second vmexit caused by another invalidation to UNMAP the page tables So it seems to be worse than when vhost only uses UNMAP like right now. At least we only have one vmexit (when UNMAP). We'll have a vhost translate() request from kernel to userspace, but IMHO that's cheaper than the vmexit. Right but then I would still prefer to have another notifier. Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a dedicated command for flushing device IOTLB. But the check for vtd_as_has_map_notifier is used to skip the device which can do demand paging via ATS or device specific way. If we have two different notifiers, vhost will be on the device iotlb notifier so we don't need it at all? Thanks
Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
On 2020/7/8 上午3:54, Peter Xu wrote: On Tue, Jul 07, 2020 at 04:03:10PM +0800, Jason Wang wrote: On 2020/7/3 下午9:03, Peter Xu wrote: On Fri, Jul 03, 2020 at 03:24:19PM +0800, Jason Wang wrote: On 2020/7/2 下午11:45, Peter Xu wrote: On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote: So I think we agree that a new notifier is needed? Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)? That should work but I wonder something as following is better. Instead of introducing new flags, how about carry the type of event in the notifier then the device (vhost) can choose the message it want to process like: static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event) { switch (event->type) { case IOMMU_MAP: case IOMMU_UNMAP: case IOMMU_DEV_IOTLB_UNMAP: ... } Looks ok to me, though imo we should still keep the registration information, so VT-d knows which notifiers is interested in which events. E.g., we can still do something like vtd_as_has_map_notifier(). Is this for a better performance? I wonder whether it's worth since we can't not have both vhost and vtd to be registered into the same as. /me failed to follow this sentence.. :( Sorry, I meant "vfio" instead "vtd". So it should be functional equivalent to vtd_as_has_notifier(). For example: in vtd_iommu_replay() we'll skip the replay if vhost has registered the iommu notifier because vtd_as_has_map_notifier() will return false. Two questions: - Do we care the performance here? If not, vhost may just ignore the MAP event? - If we care the performance, it's better to implement the MAP event for vhost, otherwise it could be a lot of IOTLB miss Thanks It'll only return true if the device is a vfio-pci device. Without vtd_as_has_map_notifier(), how could we do that?
Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
On 2020/7/3 下午9:03, Peter Xu wrote: On Fri, Jul 03, 2020 at 03:24:19PM +0800, Jason Wang wrote: On 2020/7/2 下午11:45, Peter Xu wrote: On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote: So I think we agree that a new notifier is needed? Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)? That should work but I wonder something as following is better. Instead of introducing new flags, how about carry the type of event in the notifier then the device (vhost) can choose the message it want to process like: static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event) { switch (event->type) { case IOMMU_MAP: case IOMMU_UNMAP: case IOMMU_DEV_IOTLB_UNMAP: ... } Looks ok to me, though imo we should still keep the registration information, so VT-d knows which notifiers is interested in which events. E.g., we can still do something like vtd_as_has_map_notifier(). Is this for a better performance? I wonder whether it's worth since we can't not have both vhost and vtd to be registered into the same as. So it should be functional equivalent to vtd_as_has_notifier(). Thanks So these are probably two different things: the new IOMMU_NOTIFIER_DEV_IOTLB flag is one as discussed; the other one is whether we would like to introduce IOMMUTLBEvent to include the type, so that we can avoid introduce two notifiers for one device majorly to identify dev-iotlb from unmaps. Thanks,
Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
On 2020/7/2 下午11:45, Peter Xu wrote: On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote: So I think we agree that a new notifier is needed? Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)? That should work but I wonder something as following is better. Instead of introducing new flags, how about carry the type of event in the notifier then the device (vhost) can choose the message it want to process like: static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event) { switch (event->type) { case IOMMU_MAP: case IOMMU_UNMAP: case IOMMU_DEV_IOTLB_UNMAP: ... } Thanks
Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
On 2020/7/1 下午4:09, Jason Wang wrote: On 2020/6/30 下午11:39, Peter Xu wrote: On Tue, Jun 30, 2020 at 10:41:10AM +0800, Jason Wang wrote: /* According to ATS spec table 2.4: * S = 0, bits 15:12 = range size: 4K * S = 1, bits 15:12 = xxx0 range size: 8K * S = 1, bits 15:12 = xx01 range size: 16K * S = 1, bits 15:12 = x011 range size: 32K * S = 1, bits 15:12 = 0111 range size: 64K * ... */ Right, but the comment is probably misleading here, since it's for the PCI-E transaction between IOMMU and device not for the device IOTLB invalidation descriptor. For device IOTLB invalidation descriptor, spec allows a [0, ~0ULL] invalidation: " 6.5.2.5 Device-TLB Invalidate Descriptor ... Size (S): The size field indicates the number of consecutive pages targeted by this invalidation request. If S field is zero, a single page at page address specified by Address [63:12] is requested to be invalidated. If S field is Set, the least significant bit in the Address field with value 0b indicates the invalidation address range. For example, if S field is Set and Address[12] is Clear, it indicates an 8KB invalidation address range with base address in Address [63:13]. If S field and Address[12] is Set and bit 13 is Clear, it indicates a 16KB invalidation address range with base address in Address [63:14], etc. " So if we receive an address whose [63] is 0 and the rest is all 1, it's then a [0, ~0ULL] invalidation. Yes. I think invalidating the whole range is always fine. It's still not arbitrary, right? E.g., we can't even invalidate (0x1000, 0x3000) with device-iotlb because of the address mask, not to say sub-pages. Yes. How about just convert to use a range [start, end] for any notifier and move the checks (e.g the assert) into the actual notifier implemented (vhost or vfio)? IOMMUTLBEntry itself is the abstraction layer of TLB entry. Hardware TLB entry is definitely not arbitrary range either (because AFAICT the hardware should only cache PFN rather than address, so at least PAGE_SIZE aligned). Introducing this flag will already make this trickier just to avoid introducing another similar struct to IOMMUTLBEntry, but I really don't want to make it a default option... Not to mention I probably have no reason to urge the rest iommu notifier users (tcg, vfio) to change their existing good code to suite any of the backend who can cooperate with arbitrary address ranges... Ok, so it looks like we need a dedicated notifiers to device IOTLB. Or we can also make a new flag for device iotlb just like current UNMAP? Then we replace the vhost type from UNMAP to DEVICE_IOTLB. But IMHO using the ARBITRARY_LENGTH flag would work in a similar way. DEVICE_IOTLB flag could also allow virtio/vhost to only receive one invalidation (now IIUC it'll receive both iotlb and device-iotlb for unmapping a page when ats=on), but then ats=on will be a must and it could break some old (misconfiged) qemu because afaict previously virtio/vhost could even work with vIOMMU (accidentally) even without ats=on. That's a bug and I don't think we need to workaround mis-configurated qemu :) IMHO it depends on the strictness we want on the qemu cmdline API. :) We should at least check libvirt to make sure it's using ats=on always, then I agree maybe we can avoid considering the rest... Thanks, Cc libvirt list, but I think we should fix libvirt if they don't provide "ats=on". Thanks Libvirt looks fine, according to the domain XML documentation[1]: QEMU's virtio devices have some attributes related to the virtio transport under the driver element: The iommu attribute enables the use of emulated IOMMU by the device. The attribute ats controls the Address Translation Service support for PCIe devices. This is needed to make use of IOTLB support (see IOMMU device). Possible values are on or off. Since 3.5.0 So I think we agree that a new notifier is needed? Thanks [1] https://libvirt.org/formatdomain.html
Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
On 2020/6/30 下午11:39, Peter Xu wrote: On Tue, Jun 30, 2020 at 10:41:10AM +0800, Jason Wang wrote: /* According to ATS spec table 2.4: * S = 0, bits 15:12 = range size: 4K * S = 1, bits 15:12 = xxx0 range size: 8K * S = 1, bits 15:12 = xx01 range size: 16K * S = 1, bits 15:12 = x011 range size: 32K * S = 1, bits 15:12 = 0111 range size: 64K * ... */ Right, but the comment is probably misleading here, since it's for the PCI-E transaction between IOMMU and device not for the device IOTLB invalidation descriptor. For device IOTLB invalidation descriptor, spec allows a [0, ~0ULL] invalidation: " 6.5.2.5 Device-TLB Invalidate Descriptor ... Size (S): The size field indicates the number of consecutive pages targeted by this invalidation request. If S field is zero, a single page at page address specified by Address [63:12] is requested to be invalidated. If S field is Set, the least significant bit in the Address field with value 0b indicates the invalidation address range. For example, if S field is Set and Address[12] is Clear, it indicates an 8KB invalidation address range with base address in Address [63:13]. If S field and Address[12] is Set and bit 13 is Clear, it indicates a 16KB invalidation address range with base address in Address [63:14], etc. " So if we receive an address whose [63] is 0 and the rest is all 1, it's then a [0, ~0ULL] invalidation. Yes. I think invalidating the whole range is always fine. It's still not arbitrary, right? E.g., we can't even invalidate (0x1000, 0x3000) with device-iotlb because of the address mask, not to say sub-pages. Yes. How about just convert to use a range [start, end] for any notifier and move the checks (e.g the assert) into the actual notifier implemented (vhost or vfio)? IOMMUTLBEntry itself is the abstraction layer of TLB entry. Hardware TLB entry is definitely not arbitrary range either (because AFAICT the hardware should only cache PFN rather than address, so at least PAGE_SIZE aligned). Introducing this flag will already make this trickier just to avoid introducing another similar struct to IOMMUTLBEntry, but I really don't want to make it a default option... Not to mention I probably have no reason to urge the rest iommu notifier users (tcg, vfio) to change their existing good code to suite any of the backend who can cooperate with arbitrary address ranges... Ok, so it looks like we need a dedicated notifiers to device IOTLB. Or we can also make a new flag for device iotlb just like current UNMAP? Then we replace the vhost type from UNMAP to DEVICE_IOTLB. But IMHO using the ARBITRARY_LENGTH flag would work in a similar way. DEVICE_IOTLB flag could also allow virtio/vhost to only receive one invalidation (now IIUC it'll receive both iotlb and device-iotlb for unmapping a page when ats=on), but then ats=on will be a must and it could break some old (misconfiged) qemu because afaict previously virtio/vhost could even work with vIOMMU (accidentally) even without ats=on. That's a bug and I don't think we need to workaround mis-configurated qemu :) IMHO it depends on the strictness we want on the qemu cmdline API. :) We should at least check libvirt to make sure it's using ats=on always, then I agree maybe we can avoid considering the rest... Thanks, Cc libvirt list, but I think we should fix libvirt if they don't provide "ats=on". Thanks
Re: [PATCH 4/4] docs: documentation for virtio packed option
On 2020/3/27 上午9:08, Halil Pasic wrote: On Thu, 26 Mar 2020 11:32:14 +0100 Bjoern Walk wrote: + + The attribute packed controls the usage of packed + virtqueues. This reads to me, like if packed='on' is specified, then packed virtqueues are used (or the device fails); if packed='off' is specified is specified then split virtqueue format is used. But that statement is not always true. One might still end up using split despite of packed='on'. One example is a setup vhost-net setup. The in kernel implementation namely does not support packed queues, so F_RING_PACKED does not get offered by the device. Another reason why one could end up with F_RING_PACKED not getting negotiated is lack of guest support. Please note that this behavior is a little different than what we do for cpus (and cpu models). There one either gets what he asked for, or the VM won't get constructed. I believe, we should at least document this cleanly. Cc-ing Michael and Jason as hypervisor experts on these topics. Regards, Halil I think we need document that packed=on only enable the qemu support of packed ring, but whether it is actually enabled depends on the feature negotiation between qemu, vhost backends and guest drivers. Thanks Compared to regular split queues, packed queues consist of + only a single descriptor ring replacing available and used ring, index + and descriptor buffer. This can result in better cache utilization and + performance. Possible values are on or off. + Since 6.3.0 (QEMU and KVM only) +
Re: [libvirt] [RFC PATCH 0/9] Introduce mediate ops in vfio-pci
On 2019/12/12 下午1:47, Yan Zhao wrote: On Thu, Dec 12, 2019 at 11:48:25AM +0800, Jason Wang wrote: On 2019/12/6 下午8:49, Yan Zhao wrote: On Fri, Dec 06, 2019 at 05:40:02PM +0800, Jason Wang wrote: On 2019/12/6 下午4:22, Yan Zhao wrote: On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote: On 2019/12/5 下午4:51, Yan Zhao wrote: On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote: Hi: On 2019/12/5 上午11:24, Yan Zhao wrote: For SRIOV devices, VFs are passthroughed into guest directly without host driver mediation. However, when VMs migrating with passthroughed VFs, dynamic host mediation is required to (1) get device states, (2) get dirty pages. Since device states as well as other critical information required for dirty page tracking for VFs are usually retrieved from PFs, it is handy to provide an extension in PF driver to centralizingly control VFs' migration. Therefore, in order to realize (1) passthrough VFs at normal time, (2) dynamically trap VFs' bars for dirty page tracking and A silly question, what's the reason for doing this, is this a must for dirty page tracking? For performance consideration. VFs' bars should be passthoughed at normal time and only enter into trap state on need. Right, but how does this matter for the case of dirty page tracking? Take NIC as an example, to trap its VF dirty pages, software way is required to trap every write of ring tail that resides in BAR0. Interesting, but it looks like we need: - decode the instruction - mediate all access to BAR0 All of which seems a great burden for the VF driver. I wonder whether or not doing interrupt relay and tracking head is better in this case. hi Jason not familiar with the way you mentioned. could you elaborate more? It looks to me that you want to intercept the bar that contains the head. Then you can figure out the buffers submitted from driver and you still need to decide a proper time to mark them as dirty. Not need to be accurate, right? just a superset of real dirty bitmap is enough. If the superset is too large compared with the dirty pages, it will lead a lot of side effects. What I meant is, intercept the interrupt, then you can figure still figure out the buffers which has been modified by the device and make them as dirty. Then there's no need to trap BAR and do decoding/emulation etc. But it will still be tricky to be correct... intercept the interrupt is a little hard if post interrupt is enabled.. We don't care about the performance too much in this case. Can we simply disable it? I think what you worried about here is the timing to mark dirty pages, right? upon interrupt receiving, you regard DMAs are finished and safe to make them dirty. But with BAR trap way, we at least can keep those dirtied pages as dirty until device stop. Of course we have other methods to optimize it. I'm not sure this will not confuse the migration converge time estimation. There's still no IOMMU Dirty bit available. (3) centralizing VF critical states retrieving and VF controls into one driver, we propose to introduce mediate ops on top of current vfio-pci device driver. _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ __ register mediate ops| ___ ___| | |<---| VF| | | | vfio-pci | | | mediate | | PF driver | | |__|--->| driver | |___| |open(pdev) | --- | | || ||_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _| \|/ \|/ --- |VF | |PF| --- VF mediate driver could be a standalone driver that does not bind to any devices (as in demo code in patches 5-6) or it could be a built-in extension of PF driver (as in patches 7-9) . Rather than directly bind to VF, VF mediate driver register a mediate ops into vfio-pci in driver init. vfio-pci maintains a list of such mediate ops. (Note that: VF mediate driver can register mediate ops into vfio-pci before vfio-pci binding to any devices. And VF mediate driver can support mediating multiple devices.) When opening a device (e.g. a VF), vfio-pci goes through the mediate ops list and calls each vfio_pci_mediate_ops->open() with pdev of the opening device as a parameter. VF mediate driver should return success or failure depending on it supports the pdev or not. E.g. VF mediate driver would compare its supported VF devfn with the devfn of the passed-in pdev. Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will stop querying other mediate ops and bind
Re: [libvirt] [RFC PATCH 0/9] Introduce mediate ops in vfio-pci
On 2019/12/7 上午1:42, Alex Williamson wrote: On Fri, 6 Dec 2019 17:40:02 +0800 Jason Wang wrote: On 2019/12/6 下午4:22, Yan Zhao wrote: On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote: On 2019/12/5 下午4:51, Yan Zhao wrote: On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote: Hi: On 2019/12/5 上午11:24, Yan Zhao wrote: For SRIOV devices, VFs are passthroughed into guest directly without host driver mediation. However, when VMs migrating with passthroughed VFs, dynamic host mediation is required to (1) get device states, (2) get dirty pages. Since device states as well as other critical information required for dirty page tracking for VFs are usually retrieved from PFs, it is handy to provide an extension in PF driver to centralizingly control VFs' migration. Therefore, in order to realize (1) passthrough VFs at normal time, (2) dynamically trap VFs' bars for dirty page tracking and A silly question, what's the reason for doing this, is this a must for dirty page tracking? For performance consideration. VFs' bars should be passthoughed at normal time and only enter into trap state on need. Right, but how does this matter for the case of dirty page tracking? Take NIC as an example, to trap its VF dirty pages, software way is required to trap every write of ring tail that resides in BAR0. Interesting, but it looks like we need: - decode the instruction - mediate all access to BAR0 All of which seems a great burden for the VF driver. I wonder whether or not doing interrupt relay and tracking head is better in this case. This sounds like a NIC specific solution, I believe the goal here is to allow any device type to implement a partial mediation solution, in this case to sufficiently track the device while in the migration saving state. I suspect there's a solution that can work for any device type. E.g for virtio, avail index (head) doesn't belongs to any BAR and device may decide to disable doorbell from guest. So did interrupt relay since driver may choose to disable interrupt from device. In this case, the only way to track dirty pages correctly is to switch to software datapath. There's still no IOMMU Dirty bit available. (3) centralizing VF critical states retrieving and VF controls into one driver, we propose to introduce mediate ops on top of current vfio-pci device driver. _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ __ register mediate ops| ___ ___| | |<---| VF| | | | vfio-pci | | | mediate | | PF driver | | |__|--->| driver | |___| |open(pdev) | --- | | || ||_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _| \|/ \|/ --- |VF | |PF| --- VF mediate driver could be a standalone driver that does not bind to any devices (as in demo code in patches 5-6) or it could be a built-in extension of PF driver (as in patches 7-9) . Rather than directly bind to VF, VF mediate driver register a mediate ops into vfio-pci in driver init. vfio-pci maintains a list of such mediate ops. (Note that: VF mediate driver can register mediate ops into vfio-pci before vfio-pci binding to any devices. And VF mediate driver can support mediating multiple devices.) When opening a device (e.g. a VF), vfio-pci goes through the mediate ops list and calls each vfio_pci_mediate_ops->open() with pdev of the opening device as a parameter. VF mediate driver should return success or failure depending on it supports the pdev or not. E.g. VF mediate driver would compare its supported VF devfn with the devfn of the passed-in pdev. Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will stop querying other mediate ops and bind the opening device with this mediate ops using the returned mediate handle. Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the VF will be intercepted into VF mediate driver as vfio_pci_mediate_ops->get_region_info(), vfio_pci_mediate_ops->rw, vfio_pci_mediate_ops->mmap, and get customized. For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will further return 'pt' to indicate whether vfio-pci should further passthrough data to hw. when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release() with a mediate handle as parameter. The mediate handle returned from vfio_pci_mediate_ops->open() lets VF mediate driver be able to differentiate two opening VFs of the same device id and vendor id. When VF mediate driver exit
Re: [libvirt] [RFC PATCH 0/9] Introduce mediate ops in vfio-pci
On 2019/12/6 下午8:49, Yan Zhao wrote: On Fri, Dec 06, 2019 at 05:40:02PM +0800, Jason Wang wrote: On 2019/12/6 下午4:22, Yan Zhao wrote: On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote: On 2019/12/5 下午4:51, Yan Zhao wrote: On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote: Hi: On 2019/12/5 上午11:24, Yan Zhao wrote: For SRIOV devices, VFs are passthroughed into guest directly without host driver mediation. However, when VMs migrating with passthroughed VFs, dynamic host mediation is required to (1) get device states, (2) get dirty pages. Since device states as well as other critical information required for dirty page tracking for VFs are usually retrieved from PFs, it is handy to provide an extension in PF driver to centralizingly control VFs' migration. Therefore, in order to realize (1) passthrough VFs at normal time, (2) dynamically trap VFs' bars for dirty page tracking and A silly question, what's the reason for doing this, is this a must for dirty page tracking? For performance consideration. VFs' bars should be passthoughed at normal time and only enter into trap state on need. Right, but how does this matter for the case of dirty page tracking? Take NIC as an example, to trap its VF dirty pages, software way is required to trap every write of ring tail that resides in BAR0. Interesting, but it looks like we need: - decode the instruction - mediate all access to BAR0 All of which seems a great burden for the VF driver. I wonder whether or not doing interrupt relay and tracking head is better in this case. hi Jason not familiar with the way you mentioned. could you elaborate more? It looks to me that you want to intercept the bar that contains the head. Then you can figure out the buffers submitted from driver and you still need to decide a proper time to mark them as dirty. What I meant is, intercept the interrupt, then you can figure still figure out the buffers which has been modified by the device and make them as dirty. Then there's no need to trap BAR and do decoding/emulation etc. But it will still be tricky to be correct... There's still no IOMMU Dirty bit available. (3) centralizing VF critical states retrieving and VF controls into one driver, we propose to introduce mediate ops on top of current vfio-pci device driver. _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ __ register mediate ops| ___ ___| | |<---| VF| | | | vfio-pci | | | mediate | | PF driver | | |__|--->| driver | |___| |open(pdev) | --- | | || ||_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _| \|/ \|/ --- |VF | |PF| --- VF mediate driver could be a standalone driver that does not bind to any devices (as in demo code in patches 5-6) or it could be a built-in extension of PF driver (as in patches 7-9) . Rather than directly bind to VF, VF mediate driver register a mediate ops into vfio-pci in driver init. vfio-pci maintains a list of such mediate ops. (Note that: VF mediate driver can register mediate ops into vfio-pci before vfio-pci binding to any devices. And VF mediate driver can support mediating multiple devices.) When opening a device (e.g. a VF), vfio-pci goes through the mediate ops list and calls each vfio_pci_mediate_ops->open() with pdev of the opening device as a parameter. VF mediate driver should return success or failure depending on it supports the pdev or not. E.g. VF mediate driver would compare its supported VF devfn with the devfn of the passed-in pdev. Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will stop querying other mediate ops and bind the opening device with this mediate ops using the returned mediate handle. Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the VF will be intercepted into VF mediate driver as vfio_pci_mediate_ops->get_region_info(), vfio_pci_mediate_ops->rw, vfio_pci_mediate_ops->mmap, and get customized. For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will further return 'pt' to indicate whether vfio-pci should further passthrough data to hw. when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release() with a mediate handle as parameter. The mediate handle returned from vfio_pci_mediate_ops->open() lets VF mediate driver be able to differentiate two opening VFs of the same device id and vendor id. When VF mediate driver exits, it unregisters its mediate ops from
Re: [libvirt] [RFC PATCH 0/9] Introduce mediate ops in vfio-pci
On 2019/12/6 下午4:22, Yan Zhao wrote: On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote: On 2019/12/5 下午4:51, Yan Zhao wrote: On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote: Hi: On 2019/12/5 上午11:24, Yan Zhao wrote: For SRIOV devices, VFs are passthroughed into guest directly without host driver mediation. However, when VMs migrating with passthroughed VFs, dynamic host mediation is required to (1) get device states, (2) get dirty pages. Since device states as well as other critical information required for dirty page tracking for VFs are usually retrieved from PFs, it is handy to provide an extension in PF driver to centralizingly control VFs' migration. Therefore, in order to realize (1) passthrough VFs at normal time, (2) dynamically trap VFs' bars for dirty page tracking and A silly question, what's the reason for doing this, is this a must for dirty page tracking? For performance consideration. VFs' bars should be passthoughed at normal time and only enter into trap state on need. Right, but how does this matter for the case of dirty page tracking? Take NIC as an example, to trap its VF dirty pages, software way is required to trap every write of ring tail that resides in BAR0. Interesting, but it looks like we need: - decode the instruction - mediate all access to BAR0 All of which seems a great burden for the VF driver. I wonder whether or not doing interrupt relay and tracking head is better in this case. There's still no IOMMU Dirty bit available. (3) centralizing VF critical states retrieving and VF controls into one driver, we propose to introduce mediate ops on top of current vfio-pci device driver. _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ __ register mediate ops| ___ ___| | |<---| VF| | | | vfio-pci | | | mediate | | PF driver | | |__|--->| driver | |___| |open(pdev) | --- | | || ||_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _| \|/ \|/ --- |VF | |PF| --- VF mediate driver could be a standalone driver that does not bind to any devices (as in demo code in patches 5-6) or it could be a built-in extension of PF driver (as in patches 7-9) . Rather than directly bind to VF, VF mediate driver register a mediate ops into vfio-pci in driver init. vfio-pci maintains a list of such mediate ops. (Note that: VF mediate driver can register mediate ops into vfio-pci before vfio-pci binding to any devices. And VF mediate driver can support mediating multiple devices.) When opening a device (e.g. a VF), vfio-pci goes through the mediate ops list and calls each vfio_pci_mediate_ops->open() with pdev of the opening device as a parameter. VF mediate driver should return success or failure depending on it supports the pdev or not. E.g. VF mediate driver would compare its supported VF devfn with the devfn of the passed-in pdev. Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will stop querying other mediate ops and bind the opening device with this mediate ops using the returned mediate handle. Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the VF will be intercepted into VF mediate driver as vfio_pci_mediate_ops->get_region_info(), vfio_pci_mediate_ops->rw, vfio_pci_mediate_ops->mmap, and get customized. For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will further return 'pt' to indicate whether vfio-pci should further passthrough data to hw. when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release() with a mediate handle as parameter. The mediate handle returned from vfio_pci_mediate_ops->open() lets VF mediate driver be able to differentiate two opening VFs of the same device id and vendor id. When VF mediate driver exits, it unregisters its mediate ops from vfio-pci. In this patchset, we enable vfio-pci to provide 3 things: (1) calling mediate ops to allow vendor driver customizing default region info/rw/mmap of a region. (2) provide a migration region to support migration What's the benefit of introducing a region? It looks to me we don't expect the region to be accessed directly from guest. Could we simply extend device fd ioctl for doing such things? You may take a look on mdev live migration discussions in https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html or previous discussion at https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html, which has
Re: [libvirt] [RFC PATCH 0/9] Introduce mediate ops in vfio-pci
On 2019/12/5 下午4:51, Yan Zhao wrote: On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote: Hi: On 2019/12/5 上午11:24, Yan Zhao wrote: For SRIOV devices, VFs are passthroughed into guest directly without host driver mediation. However, when VMs migrating with passthroughed VFs, dynamic host mediation is required to (1) get device states, (2) get dirty pages. Since device states as well as other critical information required for dirty page tracking for VFs are usually retrieved from PFs, it is handy to provide an extension in PF driver to centralizingly control VFs' migration. Therefore, in order to realize (1) passthrough VFs at normal time, (2) dynamically trap VFs' bars for dirty page tracking and A silly question, what's the reason for doing this, is this a must for dirty page tracking? For performance consideration. VFs' bars should be passthoughed at normal time and only enter into trap state on need. Right, but how does this matter for the case of dirty page tracking? (3) centralizing VF critical states retrieving and VF controls into one driver, we propose to introduce mediate ops on top of current vfio-pci device driver. _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ __ register mediate ops| ___ ___| | |<---| VF| | | | vfio-pci | | | mediate | | PF driver | | |__|--->| driver | |___| |open(pdev) | --- | | || ||_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _| \|/ \|/ --- |VF | |PF| --- VF mediate driver could be a standalone driver that does not bind to any devices (as in demo code in patches 5-6) or it could be a built-in extension of PF driver (as in patches 7-9) . Rather than directly bind to VF, VF mediate driver register a mediate ops into vfio-pci in driver init. vfio-pci maintains a list of such mediate ops. (Note that: VF mediate driver can register mediate ops into vfio-pci before vfio-pci binding to any devices. And VF mediate driver can support mediating multiple devices.) When opening a device (e.g. a VF), vfio-pci goes through the mediate ops list and calls each vfio_pci_mediate_ops->open() with pdev of the opening device as a parameter. VF mediate driver should return success or failure depending on it supports the pdev or not. E.g. VF mediate driver would compare its supported VF devfn with the devfn of the passed-in pdev. Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will stop querying other mediate ops and bind the opening device with this mediate ops using the returned mediate handle. Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the VF will be intercepted into VF mediate driver as vfio_pci_mediate_ops->get_region_info(), vfio_pci_mediate_ops->rw, vfio_pci_mediate_ops->mmap, and get customized. For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will further return 'pt' to indicate whether vfio-pci should further passthrough data to hw. when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release() with a mediate handle as parameter. The mediate handle returned from vfio_pci_mediate_ops->open() lets VF mediate driver be able to differentiate two opening VFs of the same device id and vendor id. When VF mediate driver exits, it unregisters its mediate ops from vfio-pci. In this patchset, we enable vfio-pci to provide 3 things: (1) calling mediate ops to allow vendor driver customizing default region info/rw/mmap of a region. (2) provide a migration region to support migration What's the benefit of introducing a region? It looks to me we don't expect the region to be accessed directly from guest. Could we simply extend device fd ioctl for doing such things? You may take a look on mdev live migration discussions in https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html or previous discussion at https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html, which has kernel side implemetation https://patchwork.freedesktop.org/series/56876/ generaly speaking, qemu part of live migration is consistent for vfio-pci + mediate ops way or mdev way. So in mdev, do you still have a mediate driver? Or you expect the parent to implement the region? The region is only a channel for QEMU and kernel to communicate information without introducing IOCTLs. Well, at least you introduce new type of region in uapi. So this does not answer why region is better than ioctl. If the region will only be
Re: [libvirt] [RFC PATCH 0/9] Introduce mediate ops in vfio-pci
Hi: On 2019/12/5 上午11:24, Yan Zhao wrote: For SRIOV devices, VFs are passthroughed into guest directly without host driver mediation. However, when VMs migrating with passthroughed VFs, dynamic host mediation is required to (1) get device states, (2) get dirty pages. Since device states as well as other critical information required for dirty page tracking for VFs are usually retrieved from PFs, it is handy to provide an extension in PF driver to centralizingly control VFs' migration. Therefore, in order to realize (1) passthrough VFs at normal time, (2) dynamically trap VFs' bars for dirty page tracking and A silly question, what's the reason for doing this, is this a must for dirty page tracking? (3) centralizing VF critical states retrieving and VF controls into one driver, we propose to introduce mediate ops on top of current vfio-pci device driver. _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ __ register mediate ops| ___ ___| | |<---| VF| | | | vfio-pci | | | mediate | | PF driver | | |__|--->| driver | |___| |open(pdev) | --- | | || ||_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _| \|/ \|/ --- |VF | |PF| --- VF mediate driver could be a standalone driver that does not bind to any devices (as in demo code in patches 5-6) or it could be a built-in extension of PF driver (as in patches 7-9) . Rather than directly bind to VF, VF mediate driver register a mediate ops into vfio-pci in driver init. vfio-pci maintains a list of such mediate ops. (Note that: VF mediate driver can register mediate ops into vfio-pci before vfio-pci binding to any devices. And VF mediate driver can support mediating multiple devices.) When opening a device (e.g. a VF), vfio-pci goes through the mediate ops list and calls each vfio_pci_mediate_ops->open() with pdev of the opening device as a parameter. VF mediate driver should return success or failure depending on it supports the pdev or not. E.g. VF mediate driver would compare its supported VF devfn with the devfn of the passed-in pdev. Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will stop querying other mediate ops and bind the opening device with this mediate ops using the returned mediate handle. Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the VF will be intercepted into VF mediate driver as vfio_pci_mediate_ops->get_region_info(), vfio_pci_mediate_ops->rw, vfio_pci_mediate_ops->mmap, and get customized. For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will further return 'pt' to indicate whether vfio-pci should further passthrough data to hw. when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release() with a mediate handle as parameter. The mediate handle returned from vfio_pci_mediate_ops->open() lets VF mediate driver be able to differentiate two opening VFs of the same device id and vendor id. When VF mediate driver exits, it unregisters its mediate ops from vfio-pci. In this patchset, we enable vfio-pci to provide 3 things: (1) calling mediate ops to allow vendor driver customizing default region info/rw/mmap of a region. (2) provide a migration region to support migration What's the benefit of introducing a region? It looks to me we don't expect the region to be accessed directly from guest. Could we simply extend device fd ioctl for doing such things? (3) provide a dynamic trap bar info region to allow vendor driver control trap/untrap of device pci bars This vfio-pci + mediate ops way differs from mdev way in that (1) medv way needs to create a 1:1 mdev device on top of one VF, device specific mdev parent driver is bound to VF directly. (2) vfio-pci + mediate ops way does not create mdev devices and VF mediate driver does not bind to VFs. Instead, vfio-pci binds to VFs. The reason why we don't choose the way of writing mdev parent driver is that (1) VFs are almost all the time directly passthroughed. Directly binding to vfio-pci can make most of the code shared/reused. Can we split out the common parts from vfio-pci? If we write a vendor specific mdev parent driver, most of the code (like passthrough style of rw/mmap) still needs to be copied from vfio-pci driver, which is actually a duplicated and tedious work. The mediate ops looks quite similar to what vfio-mdev did. And it looks to me we need to consider live migration for mdev as well. In that case, do we still expect
Re: [libvirt] [Qemu-devel] Virtio-net control queue handling
On 2019/2/28 上午1:39, Nikhil Agarwal wrote: Hi Stefan, Thanks for directing question to correct people, yes your understanding is correct. I want virtio-net control virtqueue to be handled by vhost-user-net device based backend for vdpa use case where control message would do some configuration on actual hw device. Or these changes should to be done in libvirt where QMP event are being handled as of now? Regards Nikhil Cc Michael. Several methods for dealing with control virtqueue: 1) using backend specific method. For example, multiqueue control for TAP was done in this way, and we plan to support RSS in this way as well (through steering eBPF program). But this may not work for the privilleged operations. 2) inventing new vhost(user) protocol, convert the vq command to vhost(user) command and pass it to vhost(user) backend. 3) extend vhost protocol to deal with control vq like you propose here. 4) notify libvirt using QMP event, we've already supported this if rx filter is changed for virtio-net, and let management to deal with the changes. 5) implementing or linking vhost backend qemu, then there's no need to invent no IPCs For the case of vDPA, I'm not sure the use case for vDPA through vhost-user. Can you describe more about that? To me the best way is probably done through mdev and VFIO not userspace path. Thanks -Original Message- From: Stefan Hajnoczi Sent: Wednesday, February 27, 2019 8:54 PM To: Nikhil Agarwal Cc: qemu-de...@nongnu.org; jasow...@redhat.com; libvir-list@redhat.com Subject: Re: [Qemu-devel] Virtio-net control queue handling On Tue, Feb 26, 2019 at 02:13:43PM +, Nikhil Agarwal wrote: Is there any option in QEMU to offload virtio device control queue handling to backend instead of deciphering and passing to libvirt via QMP? if not, is there any interface in libvirt which passes on these message to application or i directly use QMP interface from QEMU? I'm not sure I understand your question. It could be because of terminology: "virtio device control queue" == virtio-net control virtqueue? "backend" == vhost-user-net device backend? "message" == QMP event? I have CCed Jason Wang, QEMU network subsystem maintainer, and the libvirt mailing list. Stefan -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH 09/18] Add virtio-related options to interfaces
On 2017年05月30日 21:09, Daniel P. Berrange wrote: On Tue, May 30, 2017 at 02:50:32PM +0200, Ján Tomko wrote: Feels like we could just use 'iommu=on' as the name. Also, I'm unclear whether we actually need the 'ats' attribute. From the description it sounds like if we have device_iotlb=on for the IOMMU device, then we should have ats=on too. Is there a reason to use ats=off when device_iotlb=on, or vica-verca. If not, then we should just do the right thing and set ats=on whenever we see device_iotlb=on. Maybe for virtio-net only. Vhost can get best performance only when both were set, disabling either will be slower. For the cases of non vhost and other virtio devices, enabling ATS maybe a little bit slower since we don't implement device IOTLB in the device now (though we may want to implement it in the future). So for now, it's better to leave a choice for user. Another problem is, ATS depends on PCIE, there's some requirement e.g qemu does not support an integrated PCIE device which means we may need a PCIE switch (pcie-root-port) to make PCIE and ATS works. Thanks Regards, Daniel -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list