Re: device compatibility interface for live migration with assigned devices
On Thu, 2020-09-10 at 14:38 +0200, Cornelia Huck wrote: > On Wed, 9 Sep 2020 10:13:09 +0800 > Yan Zhao wrote: > > > > > still, I'd like to put it more explicitly to make ensure it's not > > > > missed: > > > > the reason we want to specify compatible_type as a trait and check > > > > whether target compatible_type is the superset of source > > > > compatible_type is for the consideration of backward compatibility. > > > > e.g. > > > > an old generation device may have a mdev type xxx-v4-yyy, while a newer > > > > generation device may be of mdev type xxx-v5-yyy. > > > > with the compatible_type traits, the old generation device is still > > > > able to be regarded as compatible to newer generation device even their > > > > mdev types are not equal. > > > > > > If you want to support migration from v4 to v5, can't the (presumably > > > newer) driver that supports v5 simply register the v4 type as well, so > > > that the mdev can be created as v4? (Just like QEMU versioned machine > > > types work.) > > > > yes, it should work in some conditions. > > but it may not be that good in some cases when v5 and v4 in the name string > > of mdev type identify hardware generation (e.g. v4 for gen8, and v5 for > > gen9) > > > > e.g. > > (1). when src mdev type is v4 and target mdev type is v5 as > > software does not support it initially, and v4 and v5 identify hardware > > differences. > > My first hunch here is: Don't introduce types that may be compatible > later. Either make them compatible, or make them distinct by design, > and possibly add a different, compatible type later. > > > then after software upgrade, v5 is now compatible to v4, should the > > software now downgrade mdev type from v5 to v4? > > not sure if moving hardware generation info into a separate attribute > > from mdev type name is better. e.g. remove v4, v5 in mdev type, while use > > compatible_pci_ids to identify compatibility. > > If the generations are compatible, don't mention it in the mdev type. > If they aren't, use distinct types, so that management software doesn't > have to guess. At least that would be my naive approach here. yep that is what i would prefer to see too. > > > > > (2) name string of mdev type is composed by "driver_name + type_name". > > in some devices, e.g. qat, different generations of devices are binding to > > drivers of different names, e.g. "qat-v4", "qat-v5". > > then though type_name is equal, mdev type is not equal. e.g. > > "qat-v4-type1", "qat-v5-type1". > > I guess that shows a shortcoming of that "driver_name + type_name" > approach? Or maybe I'm just confused. yes i really dont like haveing the version in the mdev-type name i would stongly perfger just qat-type-1 wehere qat is just there as a way of namespacing. although symmetric-cryto, asymmetric-cryto and compression woudl be a better name then type-1, type-2, type-3 if that is what they would end up mapping too. e.g. qat-compression or qat-aes is a much better name then type-1 higher layers of software are unlikely to parse the mdev names but as a human looking at them its much eaiser to understand if the names are meaningful. the qat prefix i think is important however to make sure that your mdev-types dont colide with other vendeors mdev types. so i woudl encurage all vendors to prefix there mdev types with etiher the device name or the vendor. >
Re: device compatibility interface for live migration with assigned devices
On Fri, 2020-08-28 at 15:47 +0200, Cornelia Huck wrote: > On Wed, 26 Aug 2020 14:41:17 +0800 > Yan Zhao wrote: > > > previously, we want to regard the two mdevs created with dsa-1dwq x 30 and > > dsa-2dwq x 15 as compatible, because the two mdevs consist equal resources. > > > > But, as it's a burden to upper layer, we agree that if this condition > > happens, we still treat the two as incompatible. > > > > To fix it, either the driver should expose dsa-1dwq only, or the target > > dsa-2dwq needs to be destroyed and reallocated via dsa-1dwq x 30. > > AFAIU, these are mdev types, aren't they? So, basically, any management > software needs to take care to use the matching mdev type on the target > system for device creation? or just do the simple thing of use the same mdev type on the source and dest. matching mdevtypes is not nessiarly trivial. we could do that but we woudl have to do that in python rather then sql so it would be slower to do at least today. we dont currently have the ablity to say the resouce provider must have 1 of these set of traits. just that we must have a specific trait. this is a feature we have disucssed a couple of times and delayed untill we really really need it but its not out of the question that we could add it for this usecase. i suspect however we would do exact match first and explore this later after the inital mdev migration works. by the way i was looking at some vdpa reslated matiail today and noticed vdpa devices are nolonger usign mdevs and and now use a vhost chardev so i guess we will need a completely seperate mechanioum for vdpa vs mdev migration as a result. that is rather unfortunet but i guess that is life. >
Re: device compatibility interface for live migration with assigned devices
On Thu, 2020-08-20 at 14:27 +0800, Yan Zhao wrote: > On Thu, Aug 20, 2020 at 06:16:28AM +0100, Sean Mooney wrote: > > On Thu, 2020-08-20 at 12:01 +0800, Yan Zhao wrote: > > > On Thu, Aug 20, 2020 at 02:29:07AM +0100, Sean Mooney wrote: > > > > On Thu, 2020-08-20 at 08:39 +0800, Yan Zhao wrote: > > > > > On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote: > > > > > > On Tue, 18 Aug 2020 10:16:28 +0100 > > > > > > Daniel P. Berrangé wrote: > > > > > > > > > > > > > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote: > > > > > > > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: > > > > > > > > > > > > > > > > On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: > > > > > > > > > > > > > > > > On 2020/8/14 下午1:16, Yan Zhao wrote: > > > > > > > > > > > > > > > > On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: > > > > > > > > > > > > > > > > On 2020/8/10 下午3:46, Yan Zhao wrote: > > > > > > > > we actually can also retrieve the same information through > > > > > > > > sysfs, .e.g > > > > > > > > > > > > > > > > |- [path to device] > > > > > > > > |--- migration > > > > > > > > | |--- self > > > > > > > > | | |---device_api > > > > > > > > || |---mdev_type > > > > > > > > || |---software_version > > > > > > > > || |---device_id > > > > > > > > || |---aggregator > > > > > > > > | |--- compatible > > > > > > > > | | |---device_api > > > > > > > > || |---mdev_type > > > > > > > > || |---software_version > > > > > > > > || |---device_id > > > > > > > > || |---aggregator > > > > > > > > > > > > > > > > > > > > > > > > Yes but: > > > > > > > > > > > > > > > > - You need one file per attribute (one syscall for one > > > > > > > > attribute) > > > > > > > > - Attribute is coupled with kobject > > > > > > > > > > > > Is that really that bad? You have the device with an embedded > > > > > > kobject > > > > > > anyway, and you can just put things into an attribute group? > > > > > > > > > > > > [Also, I think that self/compatible split in the example makes > > > > > > things > > > > > > needlessly complex. Shouldn't semantic versioning and matching > > > > > > already > > > > > > cover nearly everything? I would expect very few cases that are more > > > > > > complex than that. Maybe the aggregation stuff, but I don't think we > > > > > > need that self/compatible split for that, either.] > > > > > > > > > > Hi Cornelia, > > > > > > > > > > The reason I want to declare compatible list of attributes is that > > > > > sometimes it's not a simple 1:1 matching of source attributes and > > > > > target attributes > > > > > as I demonstrated below, > > > > > source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is > > > > > compatible to > > > > > target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2), > > > > >(mdev_type i915-GVTg_V5_8 + aggregator 4) > > > > > > > > the way you are doing the nameing is till really confusing by the way > > > > if this has not already been merged in the kernel can you chagne the > > > > mdev > > > > so that mdev_type i915-GVTg_V5_2 is 2 of mdev_type i915-GVTg_V5_1 > > > > instead of half the device > > > > > > > > currently you need to deived the aggratod by the number at the end of > > > > the mdev type to figure out > > > > how much of the phsicial device is being used with is a very unfridly > > > > api convention > > > > > > > > the way aggrator ar
Re: device compatibility interface for live migration with assigned devices
On Thu, 2020-08-20 at 12:01 +0800, Yan Zhao wrote: > On Thu, Aug 20, 2020 at 02:29:07AM +0100, Sean Mooney wrote: > > On Thu, 2020-08-20 at 08:39 +0800, Yan Zhao wrote: > > > On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote: > > > > On Tue, 18 Aug 2020 10:16:28 +0100 > > > > Daniel P. Berrangé wrote: > > > > > > > > > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote: > > > > > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: > > > > > > > > > > > > On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: > > > > > > > > > > > > On 2020/8/14 下午1:16, Yan Zhao wrote: > > > > > > > > > > > > On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: > > > > > > > > > > > > On 2020/8/10 下午3:46, Yan Zhao wrote: > > > > > > we actually can also retrieve the same information through sysfs, > > > > > > .e.g > > > > > > > > > > > > |- [path to device] > > > > > > |--- migration > > > > > > | |--- self > > > > > > | | |---device_api > > > > > > || |---mdev_type > > > > > > || |---software_version > > > > > > || |---device_id > > > > > > || |---aggregator > > > > > > | |--- compatible > > > > > > | | |---device_api > > > > > > || |---mdev_type > > > > > > || |---software_version > > > > > > || |---device_id > > > > > > || |---aggregator > > > > > > > > > > > > > > > > > > Yes but: > > > > > > > > > > > > - You need one file per attribute (one syscall for one attribute) > > > > > > - Attribute is coupled with kobject > > > > > > > > Is that really that bad? You have the device with an embedded kobject > > > > anyway, and you can just put things into an attribute group? > > > > > > > > [Also, I think that self/compatible split in the example makes things > > > > needlessly complex. Shouldn't semantic versioning and matching already > > > > cover nearly everything? I would expect very few cases that are more > > > > complex than that. Maybe the aggregation stuff, but I don't think we > > > > need that self/compatible split for that, either.] > > > > > > Hi Cornelia, > > > > > > The reason I want to declare compatible list of attributes is that > > > sometimes it's not a simple 1:1 matching of source attributes and target > > > attributes > > > as I demonstrated below, > > > source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to > > > target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2), > > >(mdev_type i915-GVTg_V5_8 + aggregator 4) > > > > the way you are doing the nameing is till really confusing by the way > > if this has not already been merged in the kernel can you chagne the mdev > > so that mdev_type i915-GVTg_V5_2 is 2 of mdev_type i915-GVTg_V5_1 instead > > of half the device > > > > currently you need to deived the aggratod by the number at the end of the > > mdev type to figure out > > how much of the phsicial device is being used with is a very unfridly api > > convention > > > > the way aggrator are being proposed in general is not really someting i > > like but i thin this at least > > is something that should be able to correct. > > > > with the complexity in the mdev type name + aggrator i suspect that this > > will never be support > > in openstack nova directly requireing integration via cyborg unless we can > > pre partion the > > device in to mdevs staicaly and just ignore this. > > > > this is way to vendor sepecif to integrate into something like openstack in > > nova unless we can guarentee > > taht how aggreator work will be portable across vendors genericly. > > > > > > > > and aggragator may be just one of such examples that 1:1 matching does not > > > fit. > > > > for openstack nova i dont see us support anything beyond the 1:1 case where > > the mdev type does not change. > > > > hi Sean, > I understand it's hard for openstack.
Re: device compatibility interface for live migration with assigned devices
On Thu, 2020-08-20 at 08:39 +0800, Yan Zhao wrote: > On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote: > > On Tue, 18 Aug 2020 10:16:28 +0100 > > Daniel P. Berrangé wrote: > > > > > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote: > > > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: > > > > > > > > On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: > > > > > > > > On 2020/8/14 下午1:16, Yan Zhao wrote: > > > > > > > > On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: > > > > > > > > On 2020/8/10 下午3:46, Yan Zhao wrote: > > > > we actually can also retrieve the same information through sysfs, .e.g > > > > > > > > |- [path to device] > > > > |--- migration > > > > | |--- self > > > > | | |---device_api > > > > || |---mdev_type > > > > || |---software_version > > > > || |---device_id > > > > || |---aggregator > > > > | |--- compatible > > > > | | |---device_api > > > > || |---mdev_type > > > > || |---software_version > > > > || |---device_id > > > > || |---aggregator > > > > > > > > > > > > Yes but: > > > > > > > > - You need one file per attribute (one syscall for one attribute) > > > > - Attribute is coupled with kobject > > > > Is that really that bad? You have the device with an embedded kobject > > anyway, and you can just put things into an attribute group? > > > > [Also, I think that self/compatible split in the example makes things > > needlessly complex. Shouldn't semantic versioning and matching already > > cover nearly everything? I would expect very few cases that are more > > complex than that. Maybe the aggregation stuff, but I don't think we > > need that self/compatible split for that, either.] > > Hi Cornelia, > > The reason I want to declare compatible list of attributes is that > sometimes it's not a simple 1:1 matching of source attributes and target > attributes > as I demonstrated below, > source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to > target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2), >(mdev_type i915-GVTg_V5_8 + aggregator 4) the way you are doing the nameing is till really confusing by the way if this has not already been merged in the kernel can you chagne the mdev so that mdev_type i915-GVTg_V5_2 is 2 of mdev_type i915-GVTg_V5_1 instead of half the device currently you need to deived the aggratod by the number at the end of the mdev type to figure out how much of the phsicial device is being used with is a very unfridly api convention the way aggrator are being proposed in general is not really someting i like but i thin this at least is something that should be able to correct. with the complexity in the mdev type name + aggrator i suspect that this will never be support in openstack nova directly requireing integration via cyborg unless we can pre partion the device in to mdevs staicaly and just ignore this. this is way to vendor sepecif to integrate into something like openstack in nova unless we can guarentee taht how aggreator work will be portable across vendors genericly. > > and aggragator may be just one of such examples that 1:1 matching does not > fit. for openstack nova i dont see us support anything beyond the 1:1 case where the mdev type does not change. i woudl really prefer if there was just one mdev type that repsented the minimal allcatable unit and the aggragaotr where used to create compostions of that. i.e instad of i915-GVTg_V5_2 beign half the device, have 1 mdev type i915-GVTg and if the device support 8 of them then we can aggrate 4 of i915-GVTg if you want to have muplie mdev type to model the different amoutn of the resouce e.g. i915-GVTg_small i915-GVTg_large that is totlaly fine too or even i915-GVTg_4 indcating it sis 4 of i915-GVTg failing that i would just expose an mdev type per composable resouce and allow us to compose them a the user level with some other construct mudeling a attament to the device. e.g. create composed mdev or somethig that is an aggreateion of multiple sub resouces each of which is an mdev. so kind of like how bond port work. we would create an mdev for each of the sub resouces and then create a bond or aggrated mdev by reference the other mdevs by uuid then attach only the aggreated mdev to the instance. the current aggrator syntax and sematic however make me rather uncofrotable when i think about orchestating vms on top of it even to boot them let alone migrate them. > > So, we explicitly list out self/compatible attributes, and management > tools only need to check if self attributes is contained compatible > attributes. > > or do you mean only compatible list is enough, and the management tools > need to find out self list by themselves? > But I think provide a self list is easier for management tools. > > Thanks > Yan >
Re: device compatibility interface for live migration with assigned devices
On Fri, 2020-08-14 at 13:16 +0800, Yan Zhao wrote: > On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: > > > > On 2020/8/10 下午3:46, Yan Zhao wrote: > > > > driver is it handled by? > > > > > > It looks that the devlink is for network device specific, and in > > > devlink.h, it says > > > include/uapi/linux/devlink.h - Network physical device Netlink > > > interface, > > > > > > Actually not, I think there used to have some discussion last year and the > > conclusion is to remove this comment. > > > > It supports IB and probably vDPA in the future. > > > > hmm... sorry, I didn't find the referred discussion. only below discussion > regarding to why to add devlink. > > https://www.mail-archive.com/netdev@vger.kernel.org/msg95801.html > >This doesn't seem to be too much related to networking? Why can't > something > >like this be in sysfs? > > It is related to networking quite bit. There has been couple of > iteration of this, including sysfs and configfs implementations. There > has been a consensus reached that this should be done by netlink. I > believe netlink is really the best for this purpose. Sysfs is not a good > idea > > https://www.mail-archive.com/netdev@vger.kernel.org/msg96102.html > >there is already a way to change eth/ib via > >echo 'eth' > /sys/bus/pci/drivers/mlx4_core/:02:00.0/mlx4_port1 > > > >sounds like this is another way to achieve the same? > > It is. However the current way is driver-specific, not correct. > For mlx5, we need the same, it cannot be done in this way. Do devlink is > the correct way to go. im not sure i agree with that. standardising a filesystem based api that is used across all vendors is also a valid option. that said if devlink is the right choice form a kerenl perspective by all means use it but i have not heard a convincing argument for why it actually better. with tthat said we have been uing tools like ethtool to manage aspect of nics for decades so its not that strange an idea to use a tool and binary protocoal rather then a text based interface for this but there are advantages to both approches. > > https://lwn.net/Articles/674867/ > There a is need for some userspace API that would allow to expose things > that are not directly related to any device class like net_device of > ib_device, but rather chip-wide/switch-ASIC-wide stuff. > > Use cases: > 1) get/set of port type (Ethernet/InfiniBand) > 2) monitoring of hardware messages to and from chip > 3) setting up port splitters - split port into multiple ones and squash > again, > enables usage of splitter cable > 4) setting up shared buffers - shared among multiple ports within one > chip > > > > we actually can also retrieve the same information through sysfs, .e.g > > > - [path to device] > > |--- migration > | |--- self > | | |---device_api > | | |---mdev_type > | | |---software_version > | | |---device_id > | | |---aggregator > | |--- compatible > | | |---device_api > | | |---mdev_type > | | |---software_version > | | |---device_id > | | |---aggregator > > > > > > > > I feel like it's not very appropriate for a GPU driver to use > > > this interface. Is that right? > > > > > > I think not though most of the users are switch or ethernet devices. It > > doesn't prevent you from inventing new abstractions. > > so need to patch devlink core and the userspace devlink tool? > e.g. devlink migration and devlink python libs if openstack was to use it directly. we do have caes where we just frok a process and execaute a comannd in a shell with or without elevated privladge but we really dont like doing that due to the performacne impacat and security implciations so where we can use python bindign over c apis we do. pyroute2 is the only python lib i know off of the top of my head that support devlink so we would need to enhacne it to support this new devlink api. there may be otherss i have not really looked in the past since we dont need to use devlink at all today. > > > Note that devlink is based on netlink, netlink has been widely used by > > various subsystems other than networking. > > the advantage of netlink I see is that it can monitor device status and > notify upper layer that migration database needs to get updated. > But not sure whether openstack would like to use this capability. > As Sean said, it's heavy for openstack. it's heavy for vendor driver > as well :) > > And devlink monitor now listens the notification and dumps the state > changes. If we want to use it, need to let it forward the notification > and dumped info to openstack, right? i dont think we would use direct devlink monitoring in nova even if it was avaiable. we could but we already poll libvirt and the system for other resouce periodicly. we
Re: device compatibility interface for live migration with assigned devices
On Wed, 2020-08-05 at 12:53 +0200, Jiri Pirko wrote: > Wed, Aug 05, 2020 at 11:33:38AM CEST, yan.y.z...@intel.com wrote: > > On Wed, Aug 05, 2020 at 04:02:48PM +0800, Jason Wang wrote: > > > > > > On 2020/8/5 下午3:56, Jiri Pirko wrote: > > > > Wed, Aug 05, 2020 at 04:41:54AM CEST, jasow...@redhat.com wrote: > > > > > On 2020/8/5 上午10:16, Yan Zhao wrote: > > > > > > On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote: > > > > > > > On 2020/8/5 上午12:35, Cornelia Huck wrote: > > > > > > > > [sorry about not chiming in earlier] > > > > > > > > > > > > > > > > On Wed, 29 Jul 2020 16:05:03 +0800 > > > > > > > > Yan Zhao wrote: > > > > > > > > > > > > > > > > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson > > > > > > > > > wrote: > > > > > > > > > > > > > > > > (...) > > > > > > > > > > > > > > > > > > Based on the feedback we've received, the previously > > > > > > > > > > proposed interface > > > > > > > > > > is not viable. I think there's agreement that the user > > > > > > > > > > needs to be > > > > > > > > > > able to parse and interpret the version information. Using > > > > > > > > > > json seems > > > > > > > > > > viable, but I don't know if it's the best option. Is there > > > > > > > > > > any > > > > > > > > > > precedent of markup strings returned via sysfs we could > > > > > > > > > > follow? > > > > > > > > > > > > > > > > I don't think encoding complex information in a sysfs file is a > > > > > > > > viable > > > > > > > > approach. Quoting Documentation/filesystems/sysfs.rst: > > > > > > > > > > > > > > > > "Attributes should be ASCII text files, preferably with only > > > > > > > > one value > > > > > > > > per file. It is noted that it may not be efficient to contain > > > > > > > > only one > > > > > > > > value per file, so it is socially acceptable to express an > > > > > > > > array of > > > > > > > > values of the same type. > > > > > > > > Mixing types, expressing multiple lines of data, and doing fancy > > > > > > > > formatting of data is heavily frowned upon." > > > > > > > > > > > > > > > > Even though this is an older file, I think these restrictions > > > > > > > > still > > > > > > > > apply. > > > > > > > > > > > > > > +1, that's another reason why devlink(netlink) is better. > > > > > > > > > > > > > > > > > > > hi Jason, > > > > > > do you have any materials or sample code about devlink, so we can > > > > > > have a good > > > > > > study of it? > > > > > > I found some kernel docs about it but my preliminary study didn't > > > > > > show me the > > > > > > advantage of devlink. > > > > > > > > > > CC Jiri and Parav for a better answer for this. > > > > > > > > > > My understanding is that the following advantages are obvious (as I > > > > > replied > > > > > in another thread): > > > > > > > > > > - existing users (NIC, crypto, SCSI, ib), mature and stable > > > > > - much better error reporting (ext_ack other than string or errno) > > > > > - namespace aware > > > > > - do not couple with kobject > > > > > > > > Jason, what is your use case? > > > > > > > > > I think the use case is to report device compatibility for live migration. > > > Yan proposed a simple sysfs based migration version first, but it looks > > > not > > > sufficient and something based on JSON is discussed. > > > > > > Yan, can you help to summarize the discussion so far for Jiri as a > > > reference? > > > > > > > yes. > > we are currently defining an device live migration compatibility > > interface in order to let user space like openstack and libvirt knows > > which two devices are live migration compatible. > > currently the devices include mdev (a kernel emulated virtual device) > > and physical devices (e.g. a VF of a PCI SRIOV device). > > > > the attributes we want user space to compare including > > common attribues: > >device_api: vfio-pci, vfio-ccw... > >mdev_type: mdev type of mdev or similar signature for physical device > > It specifies a device's hardware capability. e.g. > >i915-GVTg_V5_4 means it's of 1/4 of a gen9 Intel graphics > >device. by the way this nameing sceam works the opisite of how it would have expected i woudl have expected to i915-GVTg_V5 to be the same as i915-GVTg_V5_1 and i915-GVTg_V5_4 to use 4 times the amount of resouce as i915-GVTg_V5_1 not 1 quarter. i would much rather see i915-GVTg_V5_4 express as aggreataor:i915-GVTg_V5=4 e.g. that it is 4 of the basic i915-GVTg_V5 type the invertion of the relationship makes this much harder to resonabout IMO. if i915-GVTg_V5_8 and i915-GVTg_V5_4 are both actully claiming the same resouce and both can be used at the same time with your suggested nameing scemem i have have to fine the mdevtype with the largest value and store that then do math by devidign it by the suffix of the requested type every time i want to claim the resouce in our placement inventoies. if we represent it the way i suggest we dont if it
Re: device compatibility interface for live migration with assigned devices
On Thu, 2020-07-30 at 11:41 +0800, Yan Zhao wrote: > > > >interface_version=3 > > > > Not much granularity here, I prefer Sean's previous > > .[.bugfix] scheme. > > > > yes, .[.bugfix] scheme may be better, but I'm not sure if > it works for a complicated scenario. > e.g for pv_mode, > (1) initially, pv_mode is not supported, so it's pv_mode=none, it's 0.0.0, > (2) then, pv_mode=ppgtt is supported, pv_mode="none+ppgtt", it's 0.1.0, > indicating pv_mode=none can migrate to pv_mode="none+ppgtt", but not vice > versa. > (3) later, pv_mode=context is also supported, > pv_mode="none+ppgtt+context", so it's 0.2.0. > > But if later, pv_mode=ppgtt is removed. pv_mode="none+context", how to > name its version? it would become 1.0.0 addtion of a feature is a minor version bump as its backwards compatiable. if you dont request the new feature you dont need to use it and it can continue to behave like a 0.0.0 device evne if its capably of acting as a 0.1.0 device. when you remove a feature that is backward incompatable as any isnstance that was prevously not using it would nolonger work so you have to bump the major version. > "none+ppgtt" (0.1.0) is not compatible to > "none+context", but "none+ppgtt+context" (0.2.0) is compatible to > "none+context". > > Maintain such scheme is painful to vendor driver. not really its how most software libs are version today. some use other schemes but semantic versioning is don right is a concies and easy to consume set of rules https://semver.org/ however you are right that it forcnes vendor to think about backwards and forwards compatiablty with each change which for the most part is a good thing. it goes hand in hand with have stable abi and api definitons to ensuring firmware updates and driver chagnes dont break userspace that depend on the kernel interfaces they expose.
Re: device compatibility interface for live migration with assigned devices
On Thu, 2020-07-30 at 09:56 +0800, Yan Zhao wrote: > On Wed, Jul 29, 2020 at 12:28:46PM +0100, Sean Mooney wrote: > > On Wed, 2020-07-29 at 16:05 +0800, Yan Zhao wrote: > > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote: > > > > On Mon, 27 Jul 2020 15:24:40 +0800 > > > > Yan Zhao wrote: > > > > > > > > > > > As you indicate, the vendor driver is responsible for checking > > > > > > > version > > > > > > > information embedded within the migration stream. Therefore a > > > > > > > migration should fail early if the devices are incompatible. Is > > > > > > > it > > > > > > > > > > > > but as I know, currently in VFIO migration protocol, we have no way > > > > > > to > > > > > > get vendor specific compatibility checking string in migration > > > > > > setup stage > > > > > > (i.e. .save_setup stage) before the device is set to _SAVING state. > > > > > > In this way, for devices who does not save device data in precopy > > > > > > stage, > > > > > > the migration compatibility checking is as late as in stop-and-copy > > > > > > stage, which is too late. > > > > > > do you think we need to add the getting/checking of vendor specific > > > > > > compatibility string early in save_setup stage? > > > > > > > > > > > > > > > > hi Alex, > > > > > after an offline discussion with Kevin, I realized that it may not be > > > > > a > > > > > problem if migration compatibility check in vendor driver occurs late > > > > > in > > > > > stop-and-copy phase for some devices, because if we report device > > > > > compatibility attributes clearly in an interface, the chances for > > > > > libvirt/openstack to make a wrong decision is little. > > > > > > > > I think it would be wise for a vendor driver to implement a pre-copy > > > > phase, even if only to send version information and verify it at the > > > > target. Deciding you have no device state to send during pre-copy does > > > > not mean your vendor driver needs to opt-out of the pre-copy phase > > > > entirely. Please also note that pre-copy is at the user's discretion, > > > > we've defined that we can enter stop-and-copy at any point, including > > > > without a pre-copy phase, so I would recommend that vendor drivers > > > > validate compatibility at the start of both the pre-copy and the > > > > stop-and-copy phases. > > > > > > > > > > ok. got it! > > > > > > > > so, do you think we are now arriving at an agreement that we'll give > > > > > up > > > > > the read-and-test scheme and start to defining one interface (perhaps > > > > > in > > > > > json format), from which libvirt/openstack is able to parse and find > > > > > out > > > > > compatibility list of a source mdev/physical device? > > > > > > > > Based on the feedback we've received, the previously proposed interface > > > > is not viable. I think there's agreement that the user needs to be > > > > able to parse and interpret the version information. Using json seems > > > > viable, but I don't know if it's the best option. Is there any > > > > precedent of markup strings returned via sysfs we could follow? > > > > > > I found some examples of using formatted string under /sys, mostly under > > > tracing. maybe we can do a similar implementation. > > > > > > #cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format > > > > > > name: kvm_mmio > > > ID: 32 > > > format: > > > field:unsigned short common_type; offset:0; size:2; > > > signed:0; > > > field:unsigned char common_flags; offset:2; size:1; > > > signed:0; > > > field:unsigned char common_preempt_count; offset:3; > > > size:1; signed:0; > > > field:int common_pid; offset:4; size:4; signed:1; > > > > > > field:u32 type; offset:8; size:4; signed:0; > > > field:u32 len; offset:12; size:4; signed:0; > > > field:u64 gpa; offset:16; size:8; signed:0; &
Re: device compatibility interface for live migration with assigned devices
On Wed, 2020-07-29 at 16:05 +0800, Yan Zhao wrote: > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote: > > On Mon, 27 Jul 2020 15:24:40 +0800 > > Yan Zhao wrote: > > > > > > > As you indicate, the vendor driver is responsible for checking version > > > > > information embedded within the migration stream. Therefore a > > > > > migration should fail early if the devices are incompatible. Is it > > > > > > > > but as I know, currently in VFIO migration protocol, we have no way to > > > > get vendor specific compatibility checking string in migration setup > > > > stage > > > > (i.e. .save_setup stage) before the device is set to _SAVING state. > > > > In this way, for devices who does not save device data in precopy stage, > > > > the migration compatibility checking is as late as in stop-and-copy > > > > stage, which is too late. > > > > do you think we need to add the getting/checking of vendor specific > > > > compatibility string early in save_setup stage? > > > > > > > > > > hi Alex, > > > after an offline discussion with Kevin, I realized that it may not be a > > > problem if migration compatibility check in vendor driver occurs late in > > > stop-and-copy phase for some devices, because if we report device > > > compatibility attributes clearly in an interface, the chances for > > > libvirt/openstack to make a wrong decision is little. > > > > I think it would be wise for a vendor driver to implement a pre-copy > > phase, even if only to send version information and verify it at the > > target. Deciding you have no device state to send during pre-copy does > > not mean your vendor driver needs to opt-out of the pre-copy phase > > entirely. Please also note that pre-copy is at the user's discretion, > > we've defined that we can enter stop-and-copy at any point, including > > without a pre-copy phase, so I would recommend that vendor drivers > > validate compatibility at the start of both the pre-copy and the > > stop-and-copy phases. > > > > ok. got it! > > > > so, do you think we are now arriving at an agreement that we'll give up > > > the read-and-test scheme and start to defining one interface (perhaps in > > > json format), from which libvirt/openstack is able to parse and find out > > > compatibility list of a source mdev/physical device? > > > > Based on the feedback we've received, the previously proposed interface > > is not viable. I think there's agreement that the user needs to be > > able to parse and interpret the version information. Using json seems > > viable, but I don't know if it's the best option. Is there any > > precedent of markup strings returned via sysfs we could follow? > > I found some examples of using formatted string under /sys, mostly under > tracing. maybe we can do a similar implementation. > > #cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format > > name: kvm_mmio > ID: 32 > format: > field:unsigned short common_type; offset:0; size:2; > signed:0; > field:unsigned char common_flags; offset:2; size:1; > signed:0; > field:unsigned char common_preempt_count; offset:3; > size:1; signed:0; > field:int common_pid; offset:4; size:4; signed:1; > > field:u32 type; offset:8; size:4; signed:0; > field:u32 len; offset:12; size:4; signed:0; > field:u64 gpa; offset:16; size:8; signed:0; > field:u64 val; offset:24; size:8; signed:0; > > print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", > __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1, "read" > }, { 2, "write" }), REC->len, REC->gpa, REC->val > this is not json fromat and its not supper frendly to parse. > > #cat /sys/devices/pci:00/:00:02.0/uevent > DRIVER=vfio-pci > PCI_CLASS=3 > PCI_ID=8086:591D > PCI_SUBSYS_ID=8086:2212 > PCI_SLOT_NAME=:00:02.0 > MODALIAS=pci:v8086d591Dsv8086sd2212bc03sc00i00 > this is ini format or conf formant this is pretty simple to parse whichi would be fine. that said you could also have a version or capablitiy directory with a file for each key and a singel value. i would prefer to only have to do one read personally the list the files in directory and then read tehm all ot build the datastucture myself but that is doable though the simple ini format use d for uevent seams the best of 3 options provided above. > > > > Your idea of having both a "self" object and an array of "compatible" > > objects is perhaps something we can build on, but we must not assume > > PCI devices at the root level of the object. Providing both the > > mdev-type and the driver is a bit redundant, since the former includes > > the latter. We can't have vendor specific versioning schemes though, > > ie. gvt-version. We need to agree on a common scheme and decide which > > fields the version is relative to, ex. just the mdev type? > > what about making all comparing fields vendor specific? >
Re: device compatibility interface for live migration with assigned devices
On Mon, 2020-07-20 at 11:41 +0800, Jason Wang wrote: > On 2020/7/18 上午12:12, Alex Williamson wrote: > > On Thu, 16 Jul 2020 16:32:30 +0800 > > Yan Zhao wrote: > > > > > On Thu, Jul 16, 2020 at 12:16:26PM +0800, Jason Wang wrote: > > > > On 2020/7/14 上午7:29, Yan Zhao wrote: > > > > > hi folks, > > > > > we are defining a device migration compatibility interface that helps > > > > > upper > > > > > layer stack like openstack/ovirt/libvirt to check if two devices are > > > > > live migration compatible. > > > > > The "devices" here could be MDEVs, physical devices, or hybrid of the > > > > > two. > > > > > e.g. we could use it to check whether > > > > > - a src MDEV can migrate to a target MDEV, > > > > > - a src VF in SRIOV can migrate to a target VF in SRIOV, > > > > > - a src MDEV can migration to a target VF in SRIOV. > > > > > (e.g. SIOV/SRIOV backward compatibility case) > > > > > > > > > > The upper layer stack could use this interface as the last step to > > > > > check > > > > > if one device is able to migrate to another device before triggering > > > > > a real > > > > > live migration procedure. > > > > > we are not sure if this interface is of value or help to you. please > > > > > don't > > > > > hesitate to drop your valuable comments. > > > > > > > > > > > > > > > (1) interface definition > > > > > The interface is defined in below way: > > > > > > > > > >__userspace > > > > > /\ \ > > > > >/ \write > > > > > / read \ > > > > > /__ ___\|/_ > > > > > | migration_version | | migration_version |-->check migration > > > > > - - compatibility > > > > >device Adevice B > > > > > > > > > > > > > > > a device attribute named migration_version is defined under each > > > > > device's > > > > > sysfs node. e.g. > > > > > (/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version). > > > > > > > > Are you aware of the devlink based device management interface that is > > > > proposed upstream? I think it has many advantages over sysfs, do you > > > > consider to switch to that? > > > > Advantages, such as? > > > My understanding for devlink(netlink) over sysfs (some are mentioned at > the time of vDPA sysfs mgmt API discussion) are: i tought netlink was used more a as a configuration protocoal to qurry and confire nic and i guess other devices in its devlink form requireint a tool to be witten that can speak the protocal to interact with. the primary advantate of sysfs is that everything is just a file. there are no addtional depleenceis needed and unlike netlink there are not interoperatblity issues in a coanitnerised env. if you are using diffrenet version of libc and gcc in the contaienr vs the host my understanding is tools like ethtool from ubuntu deployed in a container on a centos host can have issue communicating with the host kernel. if its jsut a file unless the format the data is returnin in chagnes or the layout of sysfs changes its compatiable regardless of what you use to read it. > > - existing users (NIC, crypto, SCSI, ib), mature and stable > - much better error reporting (ext_ack other than string or errno) > - namespace aware > - do not couple with kobject > > Thanks >
Re: device compatibility interface for live migration with assigned devices
resending with full cc list since i had this typed up i would blame my email provier but my email client does not seam to like long cc lists. we probably want to continue on alex's thread to not split the disscusion. but i have responed inline with some example of how openstack schdules and what i ment by different mdev_types On Tue, 2020-07-14 at 20:29 +0100, Sean Mooney wrote: > On Tue, 2020-07-14 at 11:01 -0600, Alex Williamson wrote: > > On Tue, 14 Jul 2020 13:33:24 +0100 > > Sean Mooney wrote: > > > > > On Tue, 2020-07-14 at 11:21 +0100, Daniel P. Berrangé wrote: > > > > On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote: > > > > > hi folks, > > > > > we are defining a device migration compatibility interface that helps > > > > > upper > > > > > layer stack like openstack/ovirt/libvirt to check if two devices are > > > > > live migration compatible. > > > > > The "devices" here could be MDEVs, physical devices, or hybrid of the > > > > > two. > > > > > e.g. we could use it to check whether > > > > > - a src MDEV can migrate to a target MDEV, > > > > > > mdev live migration is completely possible to do but i agree with Dan > > > barrange's comments > > > from the point of view of openstack integration i dont see calling out to > > > a vender sepecific > > > tool to be an accpetable > > > > As I replied to Dan, I'm hoping Yan was referring more to vendor > > specific knowledge rather than actual tools. > > > > > solutions for device compatiablity checking. the sys filesystem > > > that describs the mdevs that can be created shoudl also > > > contain the relevent infomation such > > > taht nova could integrate it via libvirt xml representation or directly > > > retrive the > > > info from > > > sysfs. > > > > > - a src VF in SRIOV can migrate to a target VF in SRIOV, > > > > > > so vf to vf migration is not possible in the general case as there is no > > > standarised > > > way to transfer teh device state as part of the siorv specs produced by > > > the pci-sig > > > as such there is not vender neutral way to support sriov live migration. > > > > We're not talking about a general case, we're talking about physical > > devices which have vfio wrappers or hooks with device specific > > knowledge in order to support the vfio migration interface. The point > > is that a discussion around vfio device migration cannot be limited to > > mdev devices. > > ok upstream in openstack at least we do not plan to support generic > livemigration > for passthough devivces. we cheat with network interfaces since in generaly > operating > systems handel hotplug of a nic somewhat safely so wehre no abstraction layer > like > an mdev is present or a macvtap device we hot unplug the nic before the > migration > and attach a new one after. for gpus or crypto cards this likely would not > be viable > since you can bond generic hardware devices to hide the removal and readdtion > of a generic > pci device. we were hoping that there would be a convergenca around MDEVs as > a way to provide > that abstraction going forward for generic device or some other new > mechanisum in the future. > > > > > > > - a src MDEV can migration to a target VF in SRIOV. > > > > > > that also makes this unviable > > > > > (e.g. SIOV/SRIOV backward compatibility case) > > > > > > > > > > The upper layer stack could use this interface as the last step to > > > > > check > > > > > if one device is able to migrate to another device before triggering > > > > > a real > > > > > live migration procedure. > > > > > > well actully that is already too late really. ideally we would want to do > > > this compaiablity > > > check much sooneer to avoid the migration failing. in an openstack > > > envionment at least > > > by the time we invoke libvirt (assuming your using the libvirt driver) to > > > do the migration we have alreaedy > > > finished schduling the instance to the new host. if if we do the > > > compatiablity check at this point > > > and it fails then the live migration is aborted and will not be retired. > > > These types of late check lead to a > > > poor user experince as unless you check the migration detial it basically > > > looks like the
Re: device compatibility interface for live migration with assigned devices
On Tue, 2020-07-14 at 11:21 +0100, Daniel P. Berrangé wrote: > On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote: > > hi folks, > > we are defining a device migration compatibility interface that helps upper > > layer stack like openstack/ovirt/libvirt to check if two devices are > > live migration compatible. > > The "devices" here could be MDEVs, physical devices, or hybrid of the two. > > e.g. we could use it to check whether > > - a src MDEV can migrate to a target MDEV, mdev live migration is completely possible to do but i agree with Dan barrange's comments from the point of view of openstack integration i dont see calling out to a vender sepecific tool to be an accpetable solutions for device compatiablity checking. the sys filesystem that describs the mdevs that can be created shoudl also contain the relevent infomation such taht nova could integrate it via libvirt xml representation or directly retrive the info from sysfs. > > - a src VF in SRIOV can migrate to a target VF in SRIOV, so vf to vf migration is not possible in the general case as there is no standarised way to transfer teh device state as part of the siorv specs produced by the pci-sig as such there is not vender neutral way to support sriov live migration. > > - a src MDEV can migration to a target VF in SRIOV. that also makes this unviable > > (e.g. SIOV/SRIOV backward compatibility case) > > > > The upper layer stack could use this interface as the last step to check > > if one device is able to migrate to another device before triggering a real > > live migration procedure. well actully that is already too late really. ideally we would want to do this compaiablity check much sooneer to avoid the migration failing. in an openstack envionment at least by the time we invoke libvirt (assuming your using the libvirt driver) to do the migration we have alreaedy finished schduling the instance to the new host. if if we do the compatiablity check at this point and it fails then the live migration is aborted and will not be retired. These types of late check lead to a poor user experince as unless you check the migration detial it basically looks like the migration was ignored as it start to migrate and then continuge running on the orgininal host. when using generic pci passhotuhg with openstack, the pci alias is intended to reference a single vendor id/product id so you will have 1+ alias for each type of device. that allows openstack to schedule based on the availability of a compatibale device because we track inventories of pci devices and can query that when selecting a host. if we were to support mdev live migration in the future we would want to take the same declarative approch. 1 interospec the capability of the deivce we manage 2 create inventories of the allocatable devices and there capabilities 3 schdule the instance to a host based on the device-type/capabilities and claim it atomicly to prevent raceces 4 have the lower level hyperviors do addtional validation if need prelive migration. this proposal seams to be targeting extending step 4 where as ideally we should focuse on providing the info that would be relevant in set 1 preferably in a vendor neutral way vai a kernel interface like /sys. > > we are not sure if this interface is of value or help to you. please don't > > hesitate to drop your valuable comments. > > > > > > (1) interface definition > > The interface is defined in below way: > > > > __userspace > > /\ \ > > / \write > > / read \ > >/__ ___\|/_ > > | migration_version | | migration_version |-->check migration > > - - compatibility > > device Adevice B > > > > > > a device attribute named migration_version is defined under each device's > > sysfs node. e.g. > > (/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version). this might be useful as we could tag the inventory with the migration version and only might to devices with the same version > > userspace tools read the migration_version as a string from the source > > device, > > and write it to the migration_version sysfs attribute in the target device. this would not be useful as the schduler cannot directlly connect to the compute host and even if it could it would be extreamly slow to do this for 1000s of hosts and potentally multiple devices per host. > > > > The userspace should treat ANY of below conditions as two devices not > > compatible: > > - any one of the two devices does not have a migration_version attribute > > - error when reading from migration_version attribute of one device > > - error when writing migration_version string of one device to > > migration_version attribute of the other device > > > > The string read from migration_version attribute is defined by