RE: device compatibility interface for live migration with assigned devices
On Monday, September 14, 2020 10:45 PM Alex Williamson wrote: > To: Zeng, Xin > Cc: Zhao, Yan Y ; Sean Mooney > ; Cornelia Huck ; Daniel > P.Berrangé ; k...@vger.kernel.org; libvir- > l...@redhat.com; Jason Wang ; qemu- > de...@nongnu.org; kwankh...@nvidia.com; eau...@redhat.com; Wang, > Xin-ran ; cor...@lwn.net; openstack- > disc...@lists.openstack.org; Feng, Shaohe ; Tian, > Kevin ; Parav Pandit ; Ding, > Jian-feng ; dgilb...@redhat.com; > zhen...@linux.intel.com; Xu, Hejie ; > bao.yum...@zte.com.cn; intel-gvt-...@lists.freedesktop.org; > eskul...@redhat.com; Jiri Pirko ; dinec...@redhat.com; > de...@ovirt.org > Subject: Re: device compatibility interface for live migration with assigned > devices > > On Mon, 14 Sep 2020 13:48:43 + > "Zeng, Xin" wrote: > > > On Saturday, September 12, 2020 12:52 AM > > Alex Williamson wrote: > > > To: Zhao, Yan Y > > > Cc: Sean Mooney ; Cornelia Huck > > > ; Daniel P.Berrangé ; > > > k...@vger.kernel.org; libvir-l...@redhat.com; Jason Wang > > > ; qemu-devel@nongnu.org; > > > kwankh...@nvidia.com; eau...@redhat.com; Wang, Xin-ran > > ran.w...@intel.com>; cor...@lwn.net; openstack- > > > disc...@lists.openstack.org; Feng, Shaohe ; > Tian, > > > Kevin ; Parav Pandit ; > Ding, > > > Jian-feng ; dgilb...@redhat.com; > > > zhen...@linux.intel.com; Xu, Hejie ; > > > bao.yum...@zte.com.cn; intel-gvt-...@lists.freedesktop.org; > > > eskul...@redhat.com; Jiri Pirko ; > dinec...@redhat.com; > > > de...@ovirt.org > > > Subject: Re: device compatibility interface for live migration with > > > assigned > > > devices > > > > > > On Fri, 11 Sep 2020 08:56:00 +0800 > > > Yan Zhao wrote: > > > > > > > On Thu, Sep 10, 2020 at 12:02:44PM -0600, Alex Williamson wrote: > > > > > On Thu, 10 Sep 2020 13:50:11 +0100 > > > > > Sean Mooney wrote: > > > > > > > > > > > On Thu, 2020-09-10 at 14:38 +0200, Cornelia Huck wrote: > > > > > > > On Wed, 9 Sep 2020 10:13:09 +0800 > > > > > > > Yan Zhao wrote: > > > > > > > > > > > > > > > > > still, I'd like to put it more explicitly to make ensure > > > > > > > > > > it's not > > > missed: > > > > > > > > > > the reason we want to specify compatible_type as a trait and > > > check > > > > > > > > > > whether target compatible_type is the superset of source > > > > > > > > > > compatible_type is for the consideration of backward > > > compatibility. > > > > > > > > > > e.g. > > > > > > > > > > an old generation device may have a mdev type xxx-v4-yyy, > > > while a newer > > > > > > > > > > generation device may be of mdev type xxx-v5-yyy. > > > > > > > > > > with the compatible_type traits, the old generation device > > > > > > > > > > is > still > > > > > > > > > > able to be regarded as compatible to newer generation > device > > > even their > > > > > > > > > > mdev types are not equal. > > > > > > > > > > > > > > > > > > If you want to support migration from v4 to v5, can't the > > > (presumably > > > > > > > > > newer) driver that supports v5 simply register the v4 type as > well, > > > so > > > > > > > > > that the mdev can be created as v4? (Just like QEMU > versioned > > > machine > > > > > > > > > types work.) > > > > > > > > > > > > > > > > yes, it should work in some conditions. > > > > > > > > but it may not be that good in some cases when v5 and v4 in the > > > name string > > > > > > > > of mdev type identify hardware generation (e.g. v4 for gen8, > and v5 > > > for > > > > > > > > gen9) > > > > > > > > > > > > > > > > e.g. > > > > > > > > (1). when src mdev type is v4 and target mdev type is v5 as > > > > > > > > software does not support it initially, and v4 and v5 identify > > > hardware > > > > > > > > differences. > > > > > > > > > > > &g
Re: device compatibility interface for live migration with assigned devices
On Mon, 14 Sep 2020 13:48:43 + "Zeng, Xin" wrote: > On Saturday, September 12, 2020 12:52 AM > Alex Williamson wrote: > > To: Zhao, Yan Y > > Cc: Sean Mooney ; Cornelia Huck > > ; Daniel P.Berrangé ; > > k...@vger.kernel.org; libvir-l...@redhat.com; Jason Wang > > ; qemu-devel@nongnu.org; > > kwankh...@nvidia.com; eau...@redhat.com; Wang, Xin-ran > ran.w...@intel.com>; cor...@lwn.net; openstack- > > disc...@lists.openstack.org; Feng, Shaohe ; Tian, > > Kevin ; Parav Pandit ; Ding, > > Jian-feng ; dgilb...@redhat.com; > > zhen...@linux.intel.com; Xu, Hejie ; > > bao.yum...@zte.com.cn; intel-gvt-...@lists.freedesktop.org; > > eskul...@redhat.com; Jiri Pirko ; dinec...@redhat.com; > > de...@ovirt.org > > Subject: Re: device compatibility interface for live migration with assigned > > devices > > > > On Fri, 11 Sep 2020 08:56:00 +0800 > > Yan Zhao wrote: > > > > > On Thu, Sep 10, 2020 at 12:02:44PM -0600, Alex Williamson wrote: > > > > On Thu, 10 Sep 2020 13:50:11 +0100 > > > > Sean Mooney wrote: > > > > > > > > > On Thu, 2020-09-10 at 14:38 +0200, Cornelia Huck wrote: > > > > > > On Wed, 9 Sep 2020 10:13:09 +0800 > > > > > > Yan Zhao wrote: > > > > > > > > > > > > > > > still, I'd like to put it more explicitly to make ensure it's > > > > > > > > > not > > missed: > > > > > > > > > the reason we want to specify compatible_type as a trait and > > check > > > > > > > > > whether target compatible_type is the superset of source > > > > > > > > > compatible_type is for the consideration of backward > > compatibility. > > > > > > > > > e.g. > > > > > > > > > an old generation device may have a mdev type xxx-v4-yyy, > > while a newer > > > > > > > > > generation device may be of mdev type xxx-v5-yyy. > > > > > > > > > with the compatible_type traits, the old generation device is > > > > > > > > > still > > > > > > > > > able to be regarded as compatible to newer generation device > > even their > > > > > > > > > mdev types are not equal. > > > > > > > > > > > > > > > > If you want to support migration from v4 to v5, can't the > > (presumably > > > > > > > > newer) driver that supports v5 simply register the v4 type as > > > > > > > > well, > > so > > > > > > > > that the mdev can be created as v4? (Just like QEMU versioned > > machine > > > > > > > > types work.) > > > > > > > > > > > > > > yes, it should work in some conditions. > > > > > > > but it may not be that good in some cases when v5 and v4 in the > > name string > > > > > > > of mdev type identify hardware generation (e.g. v4 for gen8, and > > > > > > > v5 > > for > > > > > > > gen9) > > > > > > > > > > > > > > e.g. > > > > > > > (1). when src mdev type is v4 and target mdev type is v5 as > > > > > > > software does not support it initially, and v4 and v5 identify > > hardware > > > > > > > differences. > > > > > > > > > > > > My first hunch here is: Don't introduce types that may be compatible > > > > > > later. Either make them compatible, or make them distinct by design, > > > > > > and possibly add a different, compatible type later. > > > > > > > > > > > > > then after software upgrade, v5 is now compatible to v4, should > > > > > > > the > > > > > > > software now downgrade mdev type from v5 to v4? > > > > > > > not sure if moving hardware generation info into a separate > > attribute > > > > > > > from mdev type name is better. e.g. remove v4, v5 in mdev type, > > while use > > > > > > > compatible_pci_ids to identify compatibility. > > > > > > > > > > > > If the generations are compatible, don't mention it in the mdev > > > > > > type. > > > > > > If they aren'
RE: device compatibility interface for live migration with assigned devices
On Saturday, September 12, 2020 12:52 AM Alex Williamson wrote: > To: Zhao, Yan Y > Cc: Sean Mooney ; Cornelia Huck > ; Daniel P.Berrangé ; > k...@vger.kernel.org; libvir-l...@redhat.com; Jason Wang > ; qemu-devel@nongnu.org; > kwankh...@nvidia.com; eau...@redhat.com; Wang, Xin-ran ran.w...@intel.com>; cor...@lwn.net; openstack- > disc...@lists.openstack.org; Feng, Shaohe ; Tian, > Kevin ; Parav Pandit ; Ding, > Jian-feng ; dgilb...@redhat.com; > zhen...@linux.intel.com; Xu, Hejie ; > bao.yum...@zte.com.cn; intel-gvt-...@lists.freedesktop.org; > eskul...@redhat.com; Jiri Pirko ; dinec...@redhat.com; > de...@ovirt.org > Subject: Re: device compatibility interface for live migration with assigned > devices > > On Fri, 11 Sep 2020 08:56:00 +0800 > Yan Zhao wrote: > > > On Thu, Sep 10, 2020 at 12:02:44PM -0600, Alex Williamson wrote: > > > On Thu, 10 Sep 2020 13:50:11 +0100 > > > Sean Mooney wrote: > > > > > > > On Thu, 2020-09-10 at 14:38 +0200, Cornelia Huck wrote: > > > > > On Wed, 9 Sep 2020 10:13:09 +0800 > > > > > Yan Zhao wrote: > > > > > > > > > > > > > still, I'd like to put it more explicitly to make ensure it's > > > > > > > > not > missed: > > > > > > > > the reason we want to specify compatible_type as a trait and > check > > > > > > > > whether target compatible_type is the superset of source > > > > > > > > compatible_type is for the consideration of backward > compatibility. > > > > > > > > e.g. > > > > > > > > an old generation device may have a mdev type xxx-v4-yyy, > while a newer > > > > > > > > generation device may be of mdev type xxx-v5-yyy. > > > > > > > > with the compatible_type traits, the old generation device is > > > > > > > > still > > > > > > > > able to be regarded as compatible to newer generation device > even their > > > > > > > > mdev types are not equal. > > > > > > > > > > > > > > If you want to support migration from v4 to v5, can't the > (presumably > > > > > > > newer) driver that supports v5 simply register the v4 type as > > > > > > > well, > so > > > > > > > that the mdev can be created as v4? (Just like QEMU versioned > machine > > > > > > > types work.) > > > > > > > > > > > > yes, it should work in some conditions. > > > > > > but it may not be that good in some cases when v5 and v4 in the > name string > > > > > > of mdev type identify hardware generation (e.g. v4 for gen8, and v5 > for > > > > > > gen9) > > > > > > > > > > > > e.g. > > > > > > (1). when src mdev type is v4 and target mdev type is v5 as > > > > > > software does not support it initially, and v4 and v5 identify > hardware > > > > > > differences. > > > > > > > > > > My first hunch here is: Don't introduce types that may be compatible > > > > > later. Either make them compatible, or make them distinct by design, > > > > > and possibly add a different, compatible type later. > > > > > > > > > > > then after software upgrade, v5 is now compatible to v4, should the > > > > > > software now downgrade mdev type from v5 to v4? > > > > > > not sure if moving hardware generation info into a separate > attribute > > > > > > from mdev type name is better. e.g. remove v4, v5 in mdev type, > while use > > > > > > compatible_pci_ids to identify compatibility. > > > > > > > > > > If the generations are compatible, don't mention it in the mdev type. > > > > > If they aren't, use distinct types, so that management software > doesn't > > > > > have to guess. At least that would be my naive approach here. > > > > yep that is what i would prefer to see too. > > > > > > > > > > > > > > > > > (2) name string of mdev type is composed by "driver_name + > type_name". > > > > > > in some devices, e.g. qat, different generations of devices are > binding to > > > > > > drivers of different names, e.g. "qat-v4", "qat-v5". > > > > > > then though type_name is equal
Re: device compatibility interface for live migration with assigned devices
On Fri, 11 Sep 2020 08:56:00 +0800 Yan Zhao wrote: > On Thu, Sep 10, 2020 at 12:02:44PM -0600, Alex Williamson wrote: > > On Thu, 10 Sep 2020 13:50:11 +0100 > > Sean Mooney wrote: > > > > > On Thu, 2020-09-10 at 14:38 +0200, Cornelia Huck wrote: > > > > On Wed, 9 Sep 2020 10:13:09 +0800 > > > > Yan Zhao wrote: > > > > > > > > > > > still, I'd like to put it more explicitly to make ensure it's not > > > > > > > missed: > > > > > > > the reason we want to specify compatible_type as a trait and check > > > > > > > whether target compatible_type is the superset of source > > > > > > > compatible_type is for the consideration of backward > > > > > > > compatibility. > > > > > > > e.g. > > > > > > > an old generation device may have a mdev type xxx-v4-yyy, while a > > > > > > > newer > > > > > > > generation device may be of mdev type xxx-v5-yyy. > > > > > > > with the compatible_type traits, the old generation device is > > > > > > > still > > > > > > > able to be regarded as compatible to newer generation device even > > > > > > > their > > > > > > > mdev types are not equal. > > > > > > > > > > > > If you want to support migration from v4 to v5, can't the > > > > > > (presumably > > > > > > newer) driver that supports v5 simply register the v4 type as well, > > > > > > so > > > > > > that the mdev can be created as v4? (Just like QEMU versioned > > > > > > machine > > > > > > types work.) > > > > > > > > > > yes, it should work in some conditions. > > > > > but it may not be that good in some cases when v5 and v4 in the name > > > > > string > > > > > of mdev type identify hardware generation (e.g. v4 for gen8, and v5 > > > > > for > > > > > gen9) > > > > > > > > > > e.g. > > > > > (1). when src mdev type is v4 and target mdev type is v5 as > > > > > software does not support it initially, and v4 and v5 identify > > > > > hardware > > > > > differences. > > > > > > > > My first hunch here is: Don't introduce types that may be compatible > > > > later. Either make them compatible, or make them distinct by design, > > > > and possibly add a different, compatible type later. > > > > > > > > > then after software upgrade, v5 is now compatible to v4, should the > > > > > software now downgrade mdev type from v5 to v4? > > > > > not sure if moving hardware generation info into a separate attribute > > > > > from mdev type name is better. e.g. remove v4, v5 in mdev type, while > > > > > use > > > > > compatible_pci_ids to identify compatibility. > > > > > > > > If the generations are compatible, don't mention it in the mdev type. > > > > If they aren't, use distinct types, so that management software doesn't > > > > have to guess. At least that would be my naive approach here. > > > yep that is what i would prefer to see too. > > > > > > > > > > > > > > (2) name string of mdev type is composed by "driver_name + type_name". > > > > > in some devices, e.g. qat, different generations of devices are > > > > > binding to > > > > > drivers of different names, e.g. "qat-v4", "qat-v5". > > > > > then though type_name is equal, mdev type is not equal. e.g. > > > > > "qat-v4-type1", "qat-v5-type1". > > > > > > > > I guess that shows a shortcoming of that "driver_name + type_name" > > > > approach? Or maybe I'm just confused. > > > yes i really dont like haveing the version in the mdev-type name > > > i would stongly perfger just qat-type-1 wehere qat is just there as a way > > > of namespacing. > > > although symmetric-cryto, asymmetric-cryto and compression woudl be a > > > better name then type-1, type-2, type-3 if > > > that is what they would end up mapping too. e.g. qat-compression or > > > qat-aes is a much better name then type-1 > > > higher layers of software are unlikely to parse the mdev names but as a > > > human looking at them its much eaiser to > > > understand if the names are meaningful. the qat prefix i think is > > > important however to make sure that your mdev-types > > > dont colide with other vendeors mdev types. so i woudl encurage all > > > vendors to prefix there mdev types with etiher the > > > device name or the vendor. > > > > +1 to all this, the mdev type is meant to indicate a software > > compatible interface, if different hardware versions can be software > > compatible, then don't make the job of finding a compatible device > > harder. The full type is a combination of the vendor driver name plus > > the vendor provided type name specifically in order to provide a type > > namespace per vendor driver. That's done at the mdev core level. > > Thanks, > > hi Alex, > got it. so do you suggest that vendors use consistent driver name over > generations of devices? > for qat, they create different modules for each generation. This > practice is not good if they want to support migration between devices > of different generations, right? > > and can I understand that we don't want support of migration
RE: device compatibility interface for live migration with assigned devices
> From: Cornelia Huck > Sent: Friday, September 11, 2020 6:08 PM > > On Fri, 11 Sep 2020 08:56:00 +0800 > Yan Zhao wrote: > > > On Thu, Sep 10, 2020 at 12:02:44PM -0600, Alex Williamson wrote: > > > On Thu, 10 Sep 2020 13:50:11 +0100 > > > Sean Mooney wrote: > > > > > > > On Thu, 2020-09-10 at 14:38 +0200, Cornelia Huck wrote: > > > > > On Wed, 9 Sep 2020 10:13:09 +0800 > > > > > Yan Zhao wrote: > > > > > > > > > > > > > still, I'd like to put it more explicitly to make ensure it's > > > > > > > > not > missed: > > > > > > > > the reason we want to specify compatible_type as a trait and > check > > > > > > > > whether target compatible_type is the superset of source > > > > > > > > compatible_type is for the consideration of backward > compatibility. > > > > > > > > e.g. > > > > > > > > an old generation device may have a mdev type xxx-v4-yyy, > while a newer > > > > > > > > generation device may be of mdev type xxx-v5-yyy. > > > > > > > > with the compatible_type traits, the old generation device is > > > > > > > > still > > > > > > > > able to be regarded as compatible to newer generation device > even their > > > > > > > > mdev types are not equal. > > > > > > > > > > > > > > If you want to support migration from v4 to v5, can't the > (presumably > > > > > > > newer) driver that supports v5 simply register the v4 type as > > > > > > > well, > so > > > > > > > that the mdev can be created as v4? (Just like QEMU versioned > machine > > > > > > > types work.) > > > > > > > > > > > > yes, it should work in some conditions. > > > > > > but it may not be that good in some cases when v5 and v4 in the > name string > > > > > > of mdev type identify hardware generation (e.g. v4 for gen8, and v5 > for > > > > > > gen9) > > > > > > > > > > > > e.g. > > > > > > (1). when src mdev type is v4 and target mdev type is v5 as > > > > > > software does not support it initially, and v4 and v5 identify > hardware > > > > > > differences. > > > > > > > > > > My first hunch here is: Don't introduce types that may be compatible > > > > > later. Either make them compatible, or make them distinct by design, > > > > > and possibly add a different, compatible type later. > > > > > > > > > > > then after software upgrade, v5 is now compatible to v4, should the > > > > > > software now downgrade mdev type from v5 to v4? > > > > > > not sure if moving hardware generation info into a separate > attribute > > > > > > from mdev type name is better. e.g. remove v4, v5 in mdev type, > while use > > > > > > compatible_pci_ids to identify compatibility. > > > > > > > > > > If the generations are compatible, don't mention it in the mdev type. > > > > > If they aren't, use distinct types, so that management software > doesn't > > > > > have to guess. At least that would be my naive approach here. > > [*] > > > > > yep that is what i would prefer to see too. > > > > > > > > > > > > > > > > > (2) name string of mdev type is composed by "driver_name + > type_name". > > > > > > in some devices, e.g. qat, different generations of devices are > binding to > > > > > > drivers of different names, e.g. "qat-v4", "qat-v5". > > > > > > then though type_name is equal, mdev type is not equal. e.g. > > > > > > "qat-v4-type1", "qat-v5-type1". > > > > > > > > > > I guess that shows a shortcoming of that "driver_name + type_name" > > > > > approach? Or maybe I'm just confused. > > > > yes i really dont like haveing the version in the mdev-type name > > > > i would stongly perfger just qat-type-1 wehere qat is just there as a > > > > way > of namespacing. > > > > although symmetric-cryto, asymmetric-cryto and compression woudl be > a better name then type-1, type-2, type-3 if > > > > that is what they would end up mapping too. e.g. qat-compression or > qat-aes is a much better name then type-1 > > > > higher layers of software are unlikely to parse the mdev names but as a > human looking at them its much eaiser to > > > > understand if the names are meaningful. the qat prefix i think is > important however to make sure that your mdev-types > > > > dont colide with other vendeors mdev types. so i woudl encurage all > vendors to prefix there mdev types with etiher the > > > > device name or the vendor. > > > > > > +1 to all this, the mdev type is meant to indicate a software > > > compatible interface, if different hardware versions can be software > > > compatible, then don't make the job of finding a compatible device > > > harder. The full type is a combination of the vendor driver name plus > > > the vendor provided type name specifically in order to provide a type > > > namespace per vendor driver. That's done at the mdev core level. > > > Thanks, > > > > hi Alex, > > got it. so do you suggest that vendors use consistent driver name over > > generations of devices? > > for qat, they create different modules for each generation. This > > practice is not good if they want to support migration between devices > > of different generations,
Re: device compatibility interface for live migration with assigned devices
On Fri, 11 Sep 2020 08:56:00 +0800 Yan Zhao wrote: > On Thu, Sep 10, 2020 at 12:02:44PM -0600, Alex Williamson wrote: > > On Thu, 10 Sep 2020 13:50:11 +0100 > > Sean Mooney wrote: > > > > > On Thu, 2020-09-10 at 14:38 +0200, Cornelia Huck wrote: > > > > On Wed, 9 Sep 2020 10:13:09 +0800 > > > > Yan Zhao wrote: > > > > > > > > > > > still, I'd like to put it more explicitly to make ensure it's not > > > > > > > missed: > > > > > > > the reason we want to specify compatible_type as a trait and check > > > > > > > whether target compatible_type is the superset of source > > > > > > > compatible_type is for the consideration of backward > > > > > > > compatibility. > > > > > > > e.g. > > > > > > > an old generation device may have a mdev type xxx-v4-yyy, while a > > > > > > > newer > > > > > > > generation device may be of mdev type xxx-v5-yyy. > > > > > > > with the compatible_type traits, the old generation device is > > > > > > > still > > > > > > > able to be regarded as compatible to newer generation device even > > > > > > > their > > > > > > > mdev types are not equal. > > > > > > > > > > > > If you want to support migration from v4 to v5, can't the > > > > > > (presumably > > > > > > newer) driver that supports v5 simply register the v4 type as well, > > > > > > so > > > > > > that the mdev can be created as v4? (Just like QEMU versioned > > > > > > machine > > > > > > types work.) > > > > > > > > > > yes, it should work in some conditions. > > > > > but it may not be that good in some cases when v5 and v4 in the name > > > > > string > > > > > of mdev type identify hardware generation (e.g. v4 for gen8, and v5 > > > > > for > > > > > gen9) > > > > > > > > > > e.g. > > > > > (1). when src mdev type is v4 and target mdev type is v5 as > > > > > software does not support it initially, and v4 and v5 identify > > > > > hardware > > > > > differences. > > > > > > > > My first hunch here is: Don't introduce types that may be compatible > > > > later. Either make them compatible, or make them distinct by design, > > > > and possibly add a different, compatible type later. > > > > > > > > > then after software upgrade, v5 is now compatible to v4, should the > > > > > software now downgrade mdev type from v5 to v4? > > > > > not sure if moving hardware generation info into a separate attribute > > > > > from mdev type name is better. e.g. remove v4, v5 in mdev type, while > > > > > use > > > > > compatible_pci_ids to identify compatibility. > > > > > > > > If the generations are compatible, don't mention it in the mdev type. > > > > If they aren't, use distinct types, so that management software doesn't > > > > have to guess. At least that would be my naive approach here. [*] > > > yep that is what i would prefer to see too. > > > > > > > > > > > > > > (2) name string of mdev type is composed by "driver_name + type_name". > > > > > in some devices, e.g. qat, different generations of devices are > > > > > binding to > > > > > drivers of different names, e.g. "qat-v4", "qat-v5". > > > > > then though type_name is equal, mdev type is not equal. e.g. > > > > > "qat-v4-type1", "qat-v5-type1". > > > > > > > > I guess that shows a shortcoming of that "driver_name + type_name" > > > > approach? Or maybe I'm just confused. > > > yes i really dont like haveing the version in the mdev-type name > > > i would stongly perfger just qat-type-1 wehere qat is just there as a way > > > of namespacing. > > > although symmetric-cryto, asymmetric-cryto and compression woudl be a > > > better name then type-1, type-2, type-3 if > > > that is what they would end up mapping too. e.g. qat-compression or > > > qat-aes is a much better name then type-1 > > > higher layers of software are unlikely to parse the mdev names but as a > > > human looking at them its much eaiser to > > > understand if the names are meaningful. the qat prefix i think is > > > important however to make sure that your mdev-types > > > dont colide with other vendeors mdev types. so i woudl encurage all > > > vendors to prefix there mdev types with etiher the > > > device name or the vendor. > > > > +1 to all this, the mdev type is meant to indicate a software > > compatible interface, if different hardware versions can be software > > compatible, then don't make the job of finding a compatible device > > harder. The full type is a combination of the vendor driver name plus > > the vendor provided type name specifically in order to provide a type > > namespace per vendor driver. That's done at the mdev core level. > > Thanks, > > hi Alex, > got it. so do you suggest that vendors use consistent driver name over > generations of devices? > for qat, they create different modules for each generation. This > practice is not good if they want to support migration between devices > of different generations, right? Even if they create different modules, I'd assume that they
Re: device compatibility interface for live migration with assigned devices
On Thu, Sep 10, 2020 at 12:02:44PM -0600, Alex Williamson wrote: > On Thu, 10 Sep 2020 13:50:11 +0100 > Sean Mooney wrote: > > > On Thu, 2020-09-10 at 14:38 +0200, Cornelia Huck wrote: > > > On Wed, 9 Sep 2020 10:13:09 +0800 > > > Yan Zhao wrote: > > > > > > > > > still, I'd like to put it more explicitly to make ensure it's not > > > > > > missed: > > > > > > the reason we want to specify compatible_type as a trait and check > > > > > > whether target compatible_type is the superset of source > > > > > > compatible_type is for the consideration of backward compatibility. > > > > > > e.g. > > > > > > an old generation device may have a mdev type xxx-v4-yyy, while a > > > > > > newer > > > > > > generation device may be of mdev type xxx-v5-yyy. > > > > > > with the compatible_type traits, the old generation device is still > > > > > > able to be regarded as compatible to newer generation device even > > > > > > their > > > > > > mdev types are not equal. > > > > > > > > > > If you want to support migration from v4 to v5, can't the (presumably > > > > > newer) driver that supports v5 simply register the v4 type as well, so > > > > > that the mdev can be created as v4? (Just like QEMU versioned machine > > > > > types work.) > > > > > > > > yes, it should work in some conditions. > > > > but it may not be that good in some cases when v5 and v4 in the name > > > > string > > > > of mdev type identify hardware generation (e.g. v4 for gen8, and v5 for > > > > gen9) > > > > > > > > e.g. > > > > (1). when src mdev type is v4 and target mdev type is v5 as > > > > software does not support it initially, and v4 and v5 identify hardware > > > > differences. > > > > > > My first hunch here is: Don't introduce types that may be compatible > > > later. Either make them compatible, or make them distinct by design, > > > and possibly add a different, compatible type later. > > > > > > > then after software upgrade, v5 is now compatible to v4, should the > > > > software now downgrade mdev type from v5 to v4? > > > > not sure if moving hardware generation info into a separate attribute > > > > from mdev type name is better. e.g. remove v4, v5 in mdev type, while > > > > use > > > > compatible_pci_ids to identify compatibility. > > > > > > If the generations are compatible, don't mention it in the mdev type. > > > If they aren't, use distinct types, so that management software doesn't > > > have to guess. At least that would be my naive approach here. > > yep that is what i would prefer to see too. > > > > > > > > > > > (2) name string of mdev type is composed by "driver_name + type_name". > > > > in some devices, e.g. qat, different generations of devices are binding > > > > to > > > > drivers of different names, e.g. "qat-v4", "qat-v5". > > > > then though type_name is equal, mdev type is not equal. e.g. > > > > "qat-v4-type1", "qat-v5-type1". > > > > > > I guess that shows a shortcoming of that "driver_name + type_name" > > > approach? Or maybe I'm just confused. > > yes i really dont like haveing the version in the mdev-type name > > i would stongly perfger just qat-type-1 wehere qat is just there as a way > > of namespacing. > > although symmetric-cryto, asymmetric-cryto and compression woudl be a > > better name then type-1, type-2, type-3 if > > that is what they would end up mapping too. e.g. qat-compression or qat-aes > > is a much better name then type-1 > > higher layers of software are unlikely to parse the mdev names but as a > > human looking at them its much eaiser to > > understand if the names are meaningful. the qat prefix i think is important > > however to make sure that your mdev-types > > dont colide with other vendeors mdev types. so i woudl encurage all vendors > > to prefix there mdev types with etiher the > > device name or the vendor. > > +1 to all this, the mdev type is meant to indicate a software > compatible interface, if different hardware versions can be software > compatible, then don't make the job of finding a compatible device > harder. The full type is a combination of the vendor driver name plus > the vendor provided type name specifically in order to provide a type > namespace per vendor driver. That's done at the mdev core level. > Thanks, hi Alex, got it. so do you suggest that vendors use consistent driver name over generations of devices? for qat, they create different modules for each generation. This practice is not good if they want to support migration between devices of different generations, right? and can I understand that we don't want support of migration between different mdev types even in future ? Thanks Yan
Re: device compatibility interface for live migration with assigned devices
On Thu, 10 Sep 2020 13:50:11 +0100 Sean Mooney wrote: > On Thu, 2020-09-10 at 14:38 +0200, Cornelia Huck wrote: > > On Wed, 9 Sep 2020 10:13:09 +0800 > > Yan Zhao wrote: > > > > > > > still, I'd like to put it more explicitly to make ensure it's not > > > > > missed: > > > > > the reason we want to specify compatible_type as a trait and check > > > > > whether target compatible_type is the superset of source > > > > > compatible_type is for the consideration of backward compatibility. > > > > > e.g. > > > > > an old generation device may have a mdev type xxx-v4-yyy, while a > > > > > newer > > > > > generation device may be of mdev type xxx-v5-yyy. > > > > > with the compatible_type traits, the old generation device is still > > > > > able to be regarded as compatible to newer generation device even > > > > > their > > > > > mdev types are not equal. > > > > > > > > If you want to support migration from v4 to v5, can't the (presumably > > > > newer) driver that supports v5 simply register the v4 type as well, so > > > > that the mdev can be created as v4? (Just like QEMU versioned machine > > > > types work.) > > > > > > yes, it should work in some conditions. > > > but it may not be that good in some cases when v5 and v4 in the name > > > string > > > of mdev type identify hardware generation (e.g. v4 for gen8, and v5 for > > > gen9) > > > > > > e.g. > > > (1). when src mdev type is v4 and target mdev type is v5 as > > > software does not support it initially, and v4 and v5 identify hardware > > > differences. > > > > My first hunch here is: Don't introduce types that may be compatible > > later. Either make them compatible, or make them distinct by design, > > and possibly add a different, compatible type later. > > > > > then after software upgrade, v5 is now compatible to v4, should the > > > software now downgrade mdev type from v5 to v4? > > > not sure if moving hardware generation info into a separate attribute > > > from mdev type name is better. e.g. remove v4, v5 in mdev type, while use > > > compatible_pci_ids to identify compatibility. > > > > If the generations are compatible, don't mention it in the mdev type. > > If they aren't, use distinct types, so that management software doesn't > > have to guess. At least that would be my naive approach here. > yep that is what i would prefer to see too. > > > > > > > > (2) name string of mdev type is composed by "driver_name + type_name". > > > in some devices, e.g. qat, different generations of devices are binding to > > > drivers of different names, e.g. "qat-v4", "qat-v5". > > > then though type_name is equal, mdev type is not equal. e.g. > > > "qat-v4-type1", "qat-v5-type1". > > > > I guess that shows a shortcoming of that "driver_name + type_name" > > approach? Or maybe I'm just confused. > yes i really dont like haveing the version in the mdev-type name > i would stongly perfger just qat-type-1 wehere qat is just there as a way of > namespacing. > although symmetric-cryto, asymmetric-cryto and compression woudl be a better > name then type-1, type-2, type-3 if > that is what they would end up mapping too. e.g. qat-compression or qat-aes > is a much better name then type-1 > higher layers of software are unlikely to parse the mdev names but as a human > looking at them its much eaiser to > understand if the names are meaningful. the qat prefix i think is important > however to make sure that your mdev-types > dont colide with other vendeors mdev types. so i woudl encurage all vendors > to prefix there mdev types with etiher the > device name or the vendor. +1 to all this, the mdev type is meant to indicate a software compatible interface, if different hardware versions can be software compatible, then don't make the job of finding a compatible device harder. The full type is a combination of the vendor driver name plus the vendor provided type name specifically in order to provide a type namespace per vendor driver. That's done at the mdev core level. Thanks, Alex
Re: device compatibility interface for live migration with assigned devices
On Thu, 2020-09-10 at 14:38 +0200, Cornelia Huck wrote: > On Wed, 9 Sep 2020 10:13:09 +0800 > Yan Zhao wrote: > > > > > still, I'd like to put it more explicitly to make ensure it's not > > > > missed: > > > > the reason we want to specify compatible_type as a trait and check > > > > whether target compatible_type is the superset of source > > > > compatible_type is for the consideration of backward compatibility. > > > > e.g. > > > > an old generation device may have a mdev type xxx-v4-yyy, while a newer > > > > generation device may be of mdev type xxx-v5-yyy. > > > > with the compatible_type traits, the old generation device is still > > > > able to be regarded as compatible to newer generation device even their > > > > mdev types are not equal. > > > > > > If you want to support migration from v4 to v5, can't the (presumably > > > newer) driver that supports v5 simply register the v4 type as well, so > > > that the mdev can be created as v4? (Just like QEMU versioned machine > > > types work.) > > > > yes, it should work in some conditions. > > but it may not be that good in some cases when v5 and v4 in the name string > > of mdev type identify hardware generation (e.g. v4 for gen8, and v5 for > > gen9) > > > > e.g. > > (1). when src mdev type is v4 and target mdev type is v5 as > > software does not support it initially, and v4 and v5 identify hardware > > differences. > > My first hunch here is: Don't introduce types that may be compatible > later. Either make them compatible, or make them distinct by design, > and possibly add a different, compatible type later. > > > then after software upgrade, v5 is now compatible to v4, should the > > software now downgrade mdev type from v5 to v4? > > not sure if moving hardware generation info into a separate attribute > > from mdev type name is better. e.g. remove v4, v5 in mdev type, while use > > compatible_pci_ids to identify compatibility. > > If the generations are compatible, don't mention it in the mdev type. > If they aren't, use distinct types, so that management software doesn't > have to guess. At least that would be my naive approach here. yep that is what i would prefer to see too. > > > > > (2) name string of mdev type is composed by "driver_name + type_name". > > in some devices, e.g. qat, different generations of devices are binding to > > drivers of different names, e.g. "qat-v4", "qat-v5". > > then though type_name is equal, mdev type is not equal. e.g. > > "qat-v4-type1", "qat-v5-type1". > > I guess that shows a shortcoming of that "driver_name + type_name" > approach? Or maybe I'm just confused. yes i really dont like haveing the version in the mdev-type name i would stongly perfger just qat-type-1 wehere qat is just there as a way of namespacing. although symmetric-cryto, asymmetric-cryto and compression woudl be a better name then type-1, type-2, type-3 if that is what they would end up mapping too. e.g. qat-compression or qat-aes is a much better name then type-1 higher layers of software are unlikely to parse the mdev names but as a human looking at them its much eaiser to understand if the names are meaningful. the qat prefix i think is important however to make sure that your mdev-types dont colide with other vendeors mdev types. so i woudl encurage all vendors to prefix there mdev types with etiher the device name or the vendor. >
Re: device compatibility interface for live migration with assigned devices
On Wed, 9 Sep 2020 10:13:09 +0800 Yan Zhao wrote: > > > still, I'd like to put it more explicitly to make ensure it's not missed: > > > the reason we want to specify compatible_type as a trait and check > > > whether target compatible_type is the superset of source > > > compatible_type is for the consideration of backward compatibility. > > > e.g. > > > an old generation device may have a mdev type xxx-v4-yyy, while a newer > > > generation device may be of mdev type xxx-v5-yyy. > > > with the compatible_type traits, the old generation device is still > > > able to be regarded as compatible to newer generation device even their > > > mdev types are not equal. > > > > If you want to support migration from v4 to v5, can't the (presumably > > newer) driver that supports v5 simply register the v4 type as well, so > > that the mdev can be created as v4? (Just like QEMU versioned machine > > types work.) > yes, it should work in some conditions. > but it may not be that good in some cases when v5 and v4 in the name string > of mdev type identify hardware generation (e.g. v4 for gen8, and v5 for > gen9) > > e.g. > (1). when src mdev type is v4 and target mdev type is v5 as > software does not support it initially, and v4 and v5 identify hardware > differences. My first hunch here is: Don't introduce types that may be compatible later. Either make them compatible, or make them distinct by design, and possibly add a different, compatible type later. > then after software upgrade, v5 is now compatible to v4, should the > software now downgrade mdev type from v5 to v4? > not sure if moving hardware generation info into a separate attribute > from mdev type name is better. e.g. remove v4, v5 in mdev type, while use > compatible_pci_ids to identify compatibility. If the generations are compatible, don't mention it in the mdev type. If they aren't, use distinct types, so that management software doesn't have to guess. At least that would be my naive approach here. > > (2) name string of mdev type is composed by "driver_name + type_name". > in some devices, e.g. qat, different generations of devices are binding to > drivers of different names, e.g. "qat-v4", "qat-v5". > then though type_name is equal, mdev type is not equal. e.g. > "qat-v4-type1", "qat-v5-type1". I guess that shows a shortcoming of that "driver_name + type_name" approach? Or maybe I'm just confused.
Re: device compatibility interface for live migration with assigned devices
hi All, Per our previous discussion, there are two main concerns to the previous proposal: (1) it's currently hard for openstack to match mdev types. (2) complicated. so, we further propose below changes: (1) requiring two compatible mdevs to have the same mdev type for now. (though kernel still exposes compatible_type attributes for future use) (2) requiring 1:1 match for other attributes under sysfs type node for now (those attributes are specified via compatible_ but with only 1 value in it.) (3) do not match attributes under device instance node. rather, they are regarded as part of resource claiming process. so src and dest values are ensured to be 1:1. A dynamic_resources attribute under sysfs node is added to list the attributes under device instance that mgt tools need to ensure 1:1 from src and dest. the "aggregator" attribute under device instance node is such one that needs to be listed. Those listed attributes can actually be treated as device state set by vendor driver during live migration. but we still want to ask for them to be set by mgt tools before live migration starts, in oder to reduce the chance of live migration failure. do you like those changes? after the changes, the sysfs interface would look like blow: |- [parent physical device] |--- Vendor-specific-attributes [optional] |--- [mdev_supported_types] | |--- [] | | |--- create | | |--- name | | |--- available_instances | | |--- device_api | | |--- software_version | | |--- compatible_type | | |--- compatible_ | | |--- compatible_ | | |--- dynamic_resources | | |--- description | | |--- [devices] - device_api : exact match between src and dest is required. its value can be one of "vfio-pci", "vfio-platform", "vfio-amba", "vfio-ccw", "vfio-ap" - software_version: version of vendor driver. in major.minor.bugfix scheme. dest major should be equal to src major, dest minor should be no less than src minor. once migration stream related code changed, vendor drivers need to bump the version. - compatible_type: not used by mgt tools currently. vendor drivers can provide this attribute, but need to know that mgt apps would ignore it. when in future mgt tools support this attribute, it would allow migration across different mdev types, so that devices of older generation may be able to migrate to newer generations. - compatible_: for device api specific attributes, e.g. compatible_subchannel_type, dest values should be superset of arc values. vendor drivers can specify only one value in this attribute, in order to do exact match between src and dest. It's ok for mgt tools to only read one value in the attribute so that src:dest values are 1:1. - compatible_: for mdev type specific attributes, e.g. compatible_pci_ids, compatible_chpid_type dest values should be superset of arc values. vendor drivers can specify only one value in the attribute in order to do exact match between src and dest. It's ok for mgt tools to only read one value in the attribute so that src:dest values are 1:1. - dynamic_resources: though defined statically under , this attribute lists attributes under device instance that need to be set as part of claiming dest resources. e.g. $cat dynamic_resources: aggregator, fps,... then after dest device is created, values of its device attributes need to be set to that of src device attributes. Failure in syncing src device values to dest device values is treated the same as failing to claiming dest resources. attributes under device instance that are not listed in this attribute would not be part of resource checking in mgt tools. Thanks Yan
Re: device compatibility interface for live migration with assigned devices
> > still, I'd like to put it more explicitly to make ensure it's not missed: > > the reason we want to specify compatible_type as a trait and check > > whether target compatible_type is the superset of source > > compatible_type is for the consideration of backward compatibility. > > e.g. > > an old generation device may have a mdev type xxx-v4-yyy, while a newer > > generation device may be of mdev type xxx-v5-yyy. > > with the compatible_type traits, the old generation device is still > > able to be regarded as compatible to newer generation device even their > > mdev types are not equal. > > If you want to support migration from v4 to v5, can't the (presumably > newer) driver that supports v5 simply register the v4 type as well, so > that the mdev can be created as v4? (Just like QEMU versioned machine > types work.) yes, it should work in some conditions. but it may not be that good in some cases when v5 and v4 in the name string of mdev type identify hardware generation (e.g. v4 for gen8, and v5 for gen9) e.g. (1). when src mdev type is v4 and target mdev type is v5 as software does not support it initially, and v4 and v5 identify hardware differences. then after software upgrade, v5 is now compatible to v4, should the software now downgrade mdev type from v5 to v4? not sure if moving hardware generation info into a separate attribute from mdev type name is better. e.g. remove v4, v5 in mdev type, while use compatible_pci_ids to identify compatibility. (2) name string of mdev type is composed by "driver_name + type_name". in some devices, e.g. qat, different generations of devices are binding to drivers of different names, e.g. "qat-v4", "qat-v5". then though type_name is equal, mdev type is not equal. e.g. "qat-v4-type1", "qat-v5-type1". Thanks Yan
Re: device compatibility interface for live migration with assigned devices
On Mon, 31 Aug 2020 12:43:44 +0800 Yan Zhao wrote: > On Fri, Aug 28, 2020 at 03:04:12PM +0100, Sean Mooney wrote: > > On Fri, 2020-08-28 at 15:47 +0200, Cornelia Huck wrote: > > > On Wed, 26 Aug 2020 14:41:17 +0800 > > > Yan Zhao wrote: > > > > > > > previously, we want to regard the two mdevs created with dsa-1dwq x 30 > > > > and > > > > dsa-2dwq x 15 as compatible, because the two mdevs consist equal > > > > resources. > > > > > > > > But, as it's a burden to upper layer, we agree that if this condition > > > > happens, we still treat the two as incompatible. > > > > > > > > To fix it, either the driver should expose dsa-1dwq only, or the target > > > > dsa-2dwq needs to be destroyed and reallocated via dsa-1dwq x 30. > > > > > > AFAIU, these are mdev types, aren't they? So, basically, any management > > > software needs to take care to use the matching mdev type on the target > > > system for device creation? > > > > or just do the simple thing of use the same mdev type on the source and > > dest. > > matching mdevtypes is not nessiarly trivial. we could do that but we woudl > > have > > to do that in python rather then sql so it would be slower to do at least > > today. > > > > we dont currently have the ablity to say the resouce provider must have 1 > > of these > > set of traits. just that we must have a specific trait. this is a feature > > we have > > disucssed a couple of times and delayed untill we really really need it but > > its not out > > of the question that we could add it for this usecase. i suspect however we > > would do exact > > match first and explore this later after the inital mdev migration works. > > Yes, I think it's good. > > still, I'd like to put it more explicitly to make ensure it's not missed: > the reason we want to specify compatible_type as a trait and check > whether target compatible_type is the superset of source > compatible_type is for the consideration of backward compatibility. > e.g. > an old generation device may have a mdev type xxx-v4-yyy, while a newer > generation device may be of mdev type xxx-v5-yyy. > with the compatible_type traits, the old generation device is still > able to be regarded as compatible to newer generation device even their > mdev types are not equal. If you want to support migration from v4 to v5, can't the (presumably newer) driver that supports v5 simply register the v4 type as well, so that the mdev can be created as v4? (Just like QEMU versioned machine types work.)
Re: device compatibility interface for live migration with assigned devices
On Fri, Aug 28, 2020 at 03:04:12PM +0100, Sean Mooney wrote: > On Fri, 2020-08-28 at 15:47 +0200, Cornelia Huck wrote: > > On Wed, 26 Aug 2020 14:41:17 +0800 > > Yan Zhao wrote: > > > > > previously, we want to regard the two mdevs created with dsa-1dwq x 30 and > > > dsa-2dwq x 15 as compatible, because the two mdevs consist equal > > > resources. > > > > > > But, as it's a burden to upper layer, we agree that if this condition > > > happens, we still treat the two as incompatible. > > > > > > To fix it, either the driver should expose dsa-1dwq only, or the target > > > dsa-2dwq needs to be destroyed and reallocated via dsa-1dwq x 30. > > > > AFAIU, these are mdev types, aren't they? So, basically, any management > > software needs to take care to use the matching mdev type on the target > > system for device creation? > > or just do the simple thing of use the same mdev type on the source and dest. > matching mdevtypes is not nessiarly trivial. we could do that but we woudl > have > to do that in python rather then sql so it would be slower to do at least > today. > > we dont currently have the ablity to say the resouce provider must have 1 of > these > set of traits. just that we must have a specific trait. this is a feature we > have > disucssed a couple of times and delayed untill we really really need it but > its not out > of the question that we could add it for this usecase. i suspect however we > would do exact > match first and explore this later after the inital mdev migration works. Yes, I think it's good. still, I'd like to put it more explicitly to make ensure it's not missed: the reason we want to specify compatible_type as a trait and check whether target compatible_type is the superset of source compatible_type is for the consideration of backward compatibility. e.g. an old generation device may have a mdev type xxx-v4-yyy, while a newer generation device may be of mdev type xxx-v5-yyy. with the compatible_type traits, the old generation device is still able to be regarded as compatible to newer generation device even their mdev types are not equal. Thanks Yan > by the way i was looking at some vdpa reslated matiail today and noticed vdpa > devices are nolonger > usign mdevs and and now use a vhost chardev so i guess we will need a > completely seperate mechanioum > for vdpa vs mdev migration as a result. that is rather unfortunet but i guess > that is life. > > >
Re: [ovirt-devel] Re: device compatibility interface for live migration with assigned devices
On 2020/8/21 下午10:52, Cornelia Huck wrote: On Fri, 21 Aug 2020 11:14:41 +0800 Jason Wang wrote: On 2020/8/20 下午8:27, Cornelia Huck wrote: On Wed, 19 Aug 2020 17:28:38 +0800 Jason Wang wrote: On 2020/8/19 下午4:13, Yan Zhao wrote: On Wed, Aug 19, 2020 at 03:39:50PM +0800, Jason Wang wrote: On 2020/8/19 下午2:59, Yan Zhao wrote: On Wed, Aug 19, 2020 at 02:57:34PM +0800, Jason Wang wrote: On 2020/8/19 上午11:30, Yan Zhao wrote: hi All, could we decide that sysfs is the interface that every VFIO vendor driver needs to provide in order to support vfio live migration, otherwise the userspace management tool would not list the device into the compatible list? if that's true, let's move to the standardizing of the sysfs interface. (1) content common part: (must) - software_version: (in major.minor.bugfix scheme) This can not work for devices whose features can be negotiated/advertised independently. (E.g virtio devices) I thought the 'software_version' was supposed to describe kind of a 'protocol version' for the data we transmit? I.e., you add a new field, you bump the version number. Ok, but since we mandate backward compatibility of uABI, is this really worth to have a version for sysfs? (Searching on sysfs shows no examples like this) I was not thinking about the sysfs interface, but rather about the data that is sent over while migrating. E.g. we find out that sending some auxiliary data is a good idea and bump to version 1.1.0; version 1.0.0 cannot deal with the extra data, but version 1.1.0 can deal with the older data stream. (...) Well, I think what data to transmit during migration is the duty of qemu not kernel. And I suspect the idea of reading opaque data (with version) from kernel and transmit them to dest is the best approach. - device_api: vfio-pci or vfio-ccw ... - type: mdev type for mdev device or a signature for physical device which is a counterpart for mdev type. device api specific part: (must) - pci id: pci id of mdev parent device or pci id of physical pci device (device_api is vfio-pci)API here. So this assumes a PCI device which is probably not true. for device_api of vfio-pci, why it's not true? for vfio-ccw, it's subchannel_type. Ok but having two different attributes for the same file is not good idea. How mgmt know there will be a 3rd type? that's why some attributes need to be common. e.g. device_api: it's common because mgmt need to know it's a pci device or a ccw device. and the api type is already defined vfio.h. (The field is agreed by and actually suggested by Alex in previous mail) type: mdev_type for mdev. if mgmt does not understand it, it would not be able to create one compatible mdev device. software_version: mgmt can compare the major and minor if it understands this fields. I think it would be helpful if you can describe how mgmt is expected to work step by step with the proposed sysfs API. This can help people to understand. My proposal would be: - check that device_api matches - check possible device_api specific attributes - check that type matches [I don't think the combination of mdev types and another attribute to determine compatibility is a good idea; Any reason for this? Actually if we only use mdev type to detect the compatibility, it would be much more easier. Otherwise, we are actually re-inventing mdev types. E.g can we have the same mdev types with different device_api and other attributes? In the end, the mdev type is represented as a string; but I'm not sure we can expect that two types with the same name, but a different device_api are related in any way. If we e.g. compare vfio-pci and vfio-ccw, they are fundamentally different. I was mostly concerned about the aggregation proposal, where type A + aggregation value b might be compatible with type B + aggregation value a. Yes, that looks pretty complicated. actually, the current proposal confuses me every time I look at it] - check that software_version is compatible, assuming semantic versioning - check possible type-specific attributes I'm not sure if this is too complicated. And I suspect there will be vendor specific attributes: - for compatibility check: I think we should either modeling everything via mdev type or making it totally vendor specific. Having something in the middle will bring a lot of burden FWIW, I'm for a strict match on mdev type, and flexibility in per-type attributes. I'm not sure whether the above flexibility can work better than encoding them to mdev type. If we really want ultra flexibility, we need making the compatibility check totally vendor specific. - for provisioning: it's still not clear. As shown in this proposal, for NVME we may need to set remote_url, but unless there will be a subclass (NVME) in the mdev (which I guess not), we can't prevent vendor from using another
Re: device compatibility interface for live migration with assigned devices
On Fri, Aug 28, 2020 at 03:47:41PM +0200, Cornelia Huck wrote: > On Wed, 26 Aug 2020 14:41:17 +0800 > Yan Zhao wrote: > > > previously, we want to regard the two mdevs created with dsa-1dwq x 30 and > > dsa-2dwq x 15 as compatible, because the two mdevs consist equal resources. > > > > But, as it's a burden to upper layer, we agree that if this condition > > happens, we still treat the two as incompatible. > > > > To fix it, either the driver should expose dsa-1dwq only, or the target > > dsa-2dwq needs to be destroyed and reallocated via dsa-1dwq x 30. > > AFAIU, these are mdev types, aren't they? So, basically, any management > software needs to take care to use the matching mdev type on the target > system for device creation? dsa-1dwq is the mdev type. there's no dsa-2dwq yet. and I think no dsa-2dwq should be provided in future according to our discussion. GVT currently does not support aggregator also. how to add the the aggregator attribute is currently uder discussion, and up to now it is recommended to be a vendor specific attributes. https://lists.freedesktop.org/archives/intel-gvt-dev/2020-July/006854.html. Thanks Yan
Re: device compatibility interface for live migration with assigned devices
On Fri, 2020-08-28 at 15:47 +0200, Cornelia Huck wrote: > On Wed, 26 Aug 2020 14:41:17 +0800 > Yan Zhao wrote: > > > previously, we want to regard the two mdevs created with dsa-1dwq x 30 and > > dsa-2dwq x 15 as compatible, because the two mdevs consist equal resources. > > > > But, as it's a burden to upper layer, we agree that if this condition > > happens, we still treat the two as incompatible. > > > > To fix it, either the driver should expose dsa-1dwq only, or the target > > dsa-2dwq needs to be destroyed and reallocated via dsa-1dwq x 30. > > AFAIU, these are mdev types, aren't they? So, basically, any management > software needs to take care to use the matching mdev type on the target > system for device creation? or just do the simple thing of use the same mdev type on the source and dest. matching mdevtypes is not nessiarly trivial. we could do that but we woudl have to do that in python rather then sql so it would be slower to do at least today. we dont currently have the ablity to say the resouce provider must have 1 of these set of traits. just that we must have a specific trait. this is a feature we have disucssed a couple of times and delayed untill we really really need it but its not out of the question that we could add it for this usecase. i suspect however we would do exact match first and explore this later after the inital mdev migration works. by the way i was looking at some vdpa reslated matiail today and noticed vdpa devices are nolonger usign mdevs and and now use a vhost chardev so i guess we will need a completely seperate mechanioum for vdpa vs mdev migration as a result. that is rather unfortunet but i guess that is life. >
Re: device compatibility interface for live migration with assigned devices
On Wed, 26 Aug 2020 14:41:17 +0800 Yan Zhao wrote: > previously, we want to regard the two mdevs created with dsa-1dwq x 30 and > dsa-2dwq x 15 as compatible, because the two mdevs consist equal resources. > > But, as it's a burden to upper layer, we agree that if this condition > happens, we still treat the two as incompatible. > > To fix it, either the driver should expose dsa-1dwq only, or the target > dsa-2dwq needs to be destroyed and reallocated via dsa-1dwq x 30. AFAIU, these are mdev types, aren't they? So, basically, any management software needs to take care to use the matching mdev type on the target system for device creation?
Re: device compatibility interface for live migration with assigned devices
On Thu, Aug 20, 2020 at 02:24:26PM +0100, Sean Mooney wrote: > On Thu, 2020-08-20 at 14:27 +0800, Yan Zhao wrote: > > On Thu, Aug 20, 2020 at 06:16:28AM +0100, Sean Mooney wrote: > > > On Thu, 2020-08-20 at 12:01 +0800, Yan Zhao wrote: > > > > On Thu, Aug 20, 2020 at 02:29:07AM +0100, Sean Mooney wrote: > > > > > On Thu, 2020-08-20 at 08:39 +0800, Yan Zhao wrote: > > > > > > On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote: > > > > > > > On Tue, 18 Aug 2020 10:16:28 +0100 > > > > > > > Daniel P. Berrangé wrote: > > > > > > > > > > > > > > > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote: > > > > > > > > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: > > > > > > > > > > > > > > > > > > On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: > > > > > > > > > > > > > > > > > > On 2020/8/14 下午1:16, Yan Zhao wrote: > > > > > > > > > > > > > > > > > > On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: > > > > > > > > > > > > > > > > > > On 2020/8/10 下午3:46, Yan Zhao wrote: > > > > > > > > > we actually can also retrieve the same information through > > > > > > > > > sysfs, .e.g > > > > > > > > > > > > > > > > > > |- [path to device] > > > > > > > > > |--- migration > > > > > > > > > | |--- self > > > > > > > > > | | |---device_api > > > > > > > > > || |---mdev_type > > > > > > > > > || |---software_version > > > > > > > > > || |---device_id > > > > > > > > > || |---aggregator > > > > > > > > > | |--- compatible > > > > > > > > > | | |---device_api > > > > > > > > > || |---mdev_type > > > > > > > > > || |---software_version > > > > > > > > > || |---device_id > > > > > > > > > || |---aggregator > > > > > > > > > > > > > > > > > > > > > > > > > > > Yes but: > > > > > > > > > > > > > > > > > > - You need one file per attribute (one syscall for one > > > > > > > > > attribute) > > > > > > > > > - Attribute is coupled with kobject > > > > > > > > > > > > > > Is that really that bad? You have the device with an embedded > > > > > > > kobject > > > > > > > anyway, and you can just put things into an attribute group? > > > > > > > > > > > > > > [Also, I think that self/compatible split in the example makes > > > > > > > things > > > > > > > needlessly complex. Shouldn't semantic versioning and matching > > > > > > > already > > > > > > > cover nearly everything? I would expect very few cases that are > > > > > > > more > > > > > > > complex than that. Maybe the aggregation stuff, but I don't think > > > > > > > we > > > > > > > need that self/compatible split for that, either.] > > > > > > > > > > > > Hi Cornelia, > > > > > > > > > > > > The reason I want to declare compatible list of attributes is that > > > > > > sometimes it's not a simple 1:1 matching of source attributes and > > > > > > target attributes > > > > > > as I demonstrated below, > > > > > > source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is > > > > > > compatible to > > > > > > target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2), > > > > > >(mdev_type i915-GVTg_V5_8 + aggregator 4) > > > > > > > > > > the way you are doing the nameing is till really confusing by the way > > > > > if this has not already been merged in the kernel can you chagne the > > > > > mdev > > > > > so that mdev_type i915-GVTg_V5_2 is 2 of mdev_type i915-GVTg_V5_1 > > > > > instead of half the device > > > > > > > > > > currently you need to deived the aggratod by the number at the end of > > > > > the mdev type to figure out > > > > > how much of the phsicial device is being used with is a very unfridly > > > > > api convention > > > > > > > > > > the way aggrator are being proposed in general is not really someting > > > > > i like but i thin this at least > > > > > is something that should be able to correct. > > > > > > > > > > with the complexity in the mdev type name + aggrator i suspect that > > > > > this will never be support > > > > > in openstack nova directly requireing integration via cyborg unless > > > > > we can pre partion the > > > > > device in to mdevs staicaly and just ignore this. > > > > > > > > > > this is way to vendor sepecif to integrate into something like > > > > > openstack in nova unless we can guarentee > > > > > taht how aggreator work will be portable across vendors genericly. > > > > > > > > > > > > > > > > > and aggragator may be just one of such examples that 1:1 matching > > > > > > does not > > > > > > fit. > > > > > > > > > > for openstack nova i dont see us support anything beyond the 1:1 case > > > > > where the mdev type does not change. > > > > > > > > > > > > > hi Sean, > > > > I understand it's hard for openstack. but 1:N is always meaningful. > > > > e.g. > > > > if source device 1 has cap A, it is compatible to > > > > device 2: cap A, > > > > device 3: cap A+B, > > > > device 4: cap A+B+C
Re: device compatibility interface for live migration with assigned devices
On Tue, Aug 25, 2020 at 04:39:25PM +0200, Cornelia Huck wrote: <...> > > do you think the bin_attribute I proposed yesterday good? > > Then we can have a single compatible with a variable in the mdev_type and > > aggregator. > > > >mdev_type=i915-GVTg_V5_{val1:int:2,4,8} > >aggregator={val1}/2 > > I'm not really a fan of binary attributes other than in cases where we > have some kind of binary format to begin with. > > IIUC, we basically have: > - different partitioning (expressed in the mdev_type) > - different number of partitions (expressed via the aggregator) > - devices being compatible if the partitioning:aggregator ratio is the > same > > (The multiple mdev_type variants seem to come from avoiding extra > creation parameters, IIRC?) > > Would it be enough to export > base_type=i915-GVTg_V5 > aggregation_ratio= > > to express the various combinations that are compatible without the > need for multiple sets of attributes? yes. I agree we need to decouple the mdev type name and aggregator for compatibility detection purpose. please allow me to put some words to describe the history and motivation of introducing aggregator. initially, we have fixed mdev_type i915-GVTg_V5_1, i915-GVTg_V5_2, i915-GVTg_V5_4, i915-GVTg_V5_8, the digital after i915-GVTg_V5 representing the max number of instances allowed to be created for this type. They also identify how many resources are to be allocated for each type. They are so far so good for current intel vgpus, i.e., cutting the physical GPU into several virtual pieces and sharing them among several VMs in pure mediation way. fixed types are provided in advance as we thought it can meet needs from most users and users can know the hardware capability they acquired from the type name. the bigger in number, the smaller piece of physical hardware. Then, when it comes to scalable IOV in near future, one physical hardware is able to be cut into a large number of units in hardware layer The single unit to be assigned into guest can be very small while one to several units are grouped into an mdev. The fixed type scheme is then cumbersome. Therefore, a new attribute aggregator is introduced to specify the number of resources to be assigned based on the base resource specified in type name. e.g. if type name is dsa-1dwq, and aggregator is 30, then the assignable resources to guest is 30 wqs in a single created mdev. if type name is dsa-2dwq, and aggregator is 15, then the assignable resources to guest is also 30wqs in a single created mdev. (in this example, the rule to define type name is different to the case in GVT. here 1 wq means wq number is 1. yes, they are current reality. :) ) previously, we want to regard the two mdevs created with dsa-1dwq x 30 and dsa-2dwq x 15 as compatible, because the two mdevs consist equal resources. But, as it's a burden to upper layer, we agree that if this condition happens, we still treat the two as incompatible. To fix it, either the driver should expose dsa-1dwq only, or the target dsa-2dwq needs to be destroyed and reallocated via dsa-1dwq x 30. Does it make sense? Thanks Yan
Re: device compatibility interface for live migration with assigned devices
On Thu, 20 Aug 2020 11:16:21 +0800 Yan Zhao wrote: > On Wed, Aug 19, 2020 at 09:22:34PM -0600, Alex Williamson wrote: > > On Thu, 20 Aug 2020 08:39:22 +0800 > > Yan Zhao wrote: > > > > > On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote: > > > > On Tue, 18 Aug 2020 10:16:28 +0100 > > > > Daniel P. Berrangé wrote: > > > > > > > > > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote: > > > > > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: > > > > > > > > > > > > On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: > > > > > > > > > > > > On 2020/8/14 下午1:16, Yan Zhao wrote: > > > > > > > > > > > > On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: > > > > > > > > > > > > On 2020/8/10 下午3:46, Yan Zhao wrote: > > > > > > > > > > > we actually can also retrieve the same information through sysfs, > > > > > > .e.g > > > > > > > > > > > > |- [path to device] > > > > > > |--- migration > > > > > > | |--- self > > > > > > | | |---device_api > > > > > > || |---mdev_type > > > > > > || |---software_version > > > > > > || |---device_id > > > > > > || |---aggregator > > > > > > | |--- compatible > > > > > > | | |---device_api > > > > > > || |---mdev_type > > > > > > || |---software_version > > > > > > || |---device_id > > > > > > || |---aggregator > > > > > > > > > > > > > > > > > > Yes but: > > > > > > > > > > > > - You need one file per attribute (one syscall for one attribute) > > > > > > - Attribute is coupled with kobject > > > > > > > > Is that really that bad? You have the device with an embedded kobject > > > > anyway, and you can just put things into an attribute group? > > > > > > > > [Also, I think that self/compatible split in the example makes things > > > > needlessly complex. Shouldn't semantic versioning and matching already > > > > cover nearly everything? I would expect very few cases that are more > > > > complex than that. Maybe the aggregation stuff, but I don't think we > > > > need that self/compatible split for that, either.] > > > Hi Cornelia, > > > > > > The reason I want to declare compatible list of attributes is that > > > sometimes it's not a simple 1:1 matching of source attributes and target > > > attributes > > > as I demonstrated below, > > > source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to > > > target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2), > > >(mdev_type i915-GVTg_V5_8 + aggregator 4) > > > > > > and aggragator may be just one of such examples that 1:1 matching does not > > > fit. > > > > If you're suggesting that we need a new 'compatible' set for every > > aggregation, haven't we lost the purpose of aggregation? For example, > > rather than having N mdev types to represent all the possible > > aggregation values, we have a single mdev type with N compatible > > migration entries, one for each possible aggregation value. BTW, how do > > we have multiple compatible directories? compatible0001, > > compatible0002? Thanks, > > > do you think the bin_attribute I proposed yesterday good? > Then we can have a single compatible with a variable in the mdev_type and > aggregator. > >mdev_type=i915-GVTg_V5_{val1:int:2,4,8} >aggregator={val1}/2 I'm not really a fan of binary attributes other than in cases where we have some kind of binary format to begin with. IIUC, we basically have: - different partitioning (expressed in the mdev_type) - different number of partitions (expressed via the aggregator) - devices being compatible if the partitioning:aggregator ratio is the same (The multiple mdev_type variants seem to come from avoiding extra creation parameters, IIRC?) Would it be enough to export base_type=i915-GVTg_V5 aggregation_ratio= to express the various combinations that are compatible without the need for multiple sets of attributes?
Re: [ovirt-devel] Re: device compatibility interface for live migration with assigned devices
On Fri, 21 Aug 2020 11:14:41 +0800 Jason Wang wrote: > On 2020/8/20 下午8:27, Cornelia Huck wrote: > > On Wed, 19 Aug 2020 17:28:38 +0800 > > Jason Wang wrote: > > > >> On 2020/8/19 下午4:13, Yan Zhao wrote: > >>> On Wed, Aug 19, 2020 at 03:39:50PM +0800, Jason Wang wrote: > On 2020/8/19 下午2:59, Yan Zhao wrote: > > On Wed, Aug 19, 2020 at 02:57:34PM +0800, Jason Wang wrote: > >> On 2020/8/19 上午11:30, Yan Zhao wrote: > >>> hi All, > >>> could we decide that sysfs is the interface that every VFIO vendor > >>> driver > >>> needs to provide in order to support vfio live migration, otherwise > >>> the > >>> userspace management tool would not list the device into the > >>> compatible > >>> list? > >>> > >>> if that's true, let's move to the standardizing of the sysfs > >>> interface. > >>> (1) content > >>> common part: (must) > >>>- software_version: (in major.minor.bugfix scheme) > >> This can not work for devices whose features can be > >> negotiated/advertised > >> independently. (E.g virtio devices) > > I thought the 'software_version' was supposed to describe kind of a > > 'protocol version' for the data we transmit? I.e., you add a new field, > > you bump the version number. > > > Ok, but since we mandate backward compatibility of uABI, is this really > worth to have a version for sysfs? (Searching on sysfs shows no examples > like this) I was not thinking about the sysfs interface, but rather about the data that is sent over while migrating. E.g. we find out that sending some auxiliary data is a good idea and bump to version 1.1.0; version 1.0.0 cannot deal with the extra data, but version 1.1.0 can deal with the older data stream. (...) > >>>- device_api: vfio-pci or vfio-ccw ... > >>>- type: mdev type for mdev device or > >>>a signature for physical device which is a counterpart > >>> for > >>> mdev type. > >>> > >>> device api specific part: (must) > >>> - pci id: pci id of mdev parent device or pci id of physical pci > >>> device (device_api is vfio-pci)API here. > >> So this assumes a PCI device which is probably not true. > >> > > for device_api of vfio-pci, why it's not true? > > > > for vfio-ccw, it's subchannel_type. > Ok but having two different attributes for the same file is not good > idea. > How mgmt know there will be a 3rd type? > >>> that's why some attributes need to be common. e.g. > >>> device_api: it's common because mgmt need to know it's a pci device or a > >>> ccw device. and the api type is already defined vfio.h. > >>> (The field is agreed by and actually suggested by Alex in previous > >>> mail) > >>> type: mdev_type for mdev. if mgmt does not understand it, it would not > >>> be able to create one compatible mdev device. > >>> software_version: mgmt can compare the major and minor if it understands > >>> this fields. > >> > >> I think it would be helpful if you can describe how mgmt is expected to > >> work step by step with the proposed sysfs API. This can help people to > >> understand. > > My proposal would be: > > - check that device_api matches > > - check possible device_api specific attributes > > - check that type matches [I don't think the combination of mdev types > >and another attribute to determine compatibility is a good idea; > > > Any reason for this? Actually if we only use mdev type to detect the > compatibility, it would be much more easier. Otherwise, we are actually > re-inventing mdev types. > > E.g can we have the same mdev types with different device_api and other > attributes? In the end, the mdev type is represented as a string; but I'm not sure we can expect that two types with the same name, but a different device_api are related in any way. If we e.g. compare vfio-pci and vfio-ccw, they are fundamentally different. I was mostly concerned about the aggregation proposal, where type A + aggregation value b might be compatible with type B + aggregation value a. > > > >actually, the current proposal confuses me every time I look at it] > > - check that software_version is compatible, assuming semantic > >versioning > > - check possible type-specific attributes > > > I'm not sure if this is too complicated. And I suspect there will be > vendor specific attributes: > > - for compatibility check: I think we should either modeling everything > via mdev type or making it totally vendor specific. Having something in > the middle will bring a lot of burden FWIW, I'm for a strict match on mdev type, and flexibility in per-type attributes. > - for provisioning: it's still not clear. As shown in this proposal, for > NVME we may need to set remote_url, but unless there will be a subclass > (NVME) in the mdev (which I guess not), we
Re: [ovirt-devel] Re: device compatibility interface for live migration with assigned devices
On 2020/8/20 下午8:27, Cornelia Huck wrote: On Wed, 19 Aug 2020 17:28:38 +0800 Jason Wang wrote: On 2020/8/19 下午4:13, Yan Zhao wrote: On Wed, Aug 19, 2020 at 03:39:50PM +0800, Jason Wang wrote: On 2020/8/19 下午2:59, Yan Zhao wrote: On Wed, Aug 19, 2020 at 02:57:34PM +0800, Jason Wang wrote: On 2020/8/19 上午11:30, Yan Zhao wrote: hi All, could we decide that sysfs is the interface that every VFIO vendor driver needs to provide in order to support vfio live migration, otherwise the userspace management tool would not list the device into the compatible list? if that's true, let's move to the standardizing of the sysfs interface. (1) content common part: (must) - software_version: (in major.minor.bugfix scheme) This can not work for devices whose features can be negotiated/advertised independently. (E.g virtio devices) I thought the 'software_version' was supposed to describe kind of a 'protocol version' for the data we transmit? I.e., you add a new field, you bump the version number. Ok, but since we mandate backward compatibility of uABI, is this really worth to have a version for sysfs? (Searching on sysfs shows no examples like this) sorry, I don't understand here, why virtio devices need to use vfio interface? I don't see any reason that virtio devices can't be used by VFIO. Do you? Actually, virtio devices have been used by VFIO for many years: - passthrough a hardware virtio devices to userspace(VM) drivers - using virtio PMD inside guest So, what's different for it vs passing through a physical hardware via VFIO? The difference is in the guest, the device could be either real hardware or emulated ones. even though the features are negotiated dynamically, could you explain why it would cause software_version not work? Virtio device 1 supports feature A, B, C Virtio device 2 supports feature B, C, D So you can't migrate a guest from device 1 to device 2. And it's impossible to model the features with versions. We're talking about the features offered by the device, right? Would it be sufficient to mandate that the target device supports the same features or a superset of the features supported by the source device? Yes. I think this thread is discussing about vfio related devices. - device_api: vfio-pci or vfio-ccw ... - type: mdev type for mdev device or a signature for physical device which is a counterpart for mdev type. device api specific part: (must) - pci id: pci id of mdev parent device or pci id of physical pci device (device_api is vfio-pci)API here. So this assumes a PCI device which is probably not true. for device_api of vfio-pci, why it's not true? for vfio-ccw, it's subchannel_type. Ok but having two different attributes for the same file is not good idea. How mgmt know there will be a 3rd type? that's why some attributes need to be common. e.g. device_api: it's common because mgmt need to know it's a pci device or a ccw device. and the api type is already defined vfio.h. (The field is agreed by and actually suggested by Alex in previous mail) type: mdev_type for mdev. if mgmt does not understand it, it would not be able to create one compatible mdev device. software_version: mgmt can compare the major and minor if it understands this fields. I think it would be helpful if you can describe how mgmt is expected to work step by step with the proposed sysfs API. This can help people to understand. My proposal would be: - check that device_api matches - check possible device_api specific attributes - check that type matches [I don't think the combination of mdev types and another attribute to determine compatibility is a good idea; Any reason for this? Actually if we only use mdev type to detect the compatibility, it would be much more easier. Otherwise, we are actually re-inventing mdev types. E.g can we have the same mdev types with different device_api and other attributes? actually, the current proposal confuses me every time I look at it] - check that software_version is compatible, assuming semantic versioning - check possible type-specific attributes I'm not sure if this is too complicated. And I suspect there will be vendor specific attributes: - for compatibility check: I think we should either modeling everything via mdev type or making it totally vendor specific. Having something in the middle will bring a lot of burden - for provisioning: it's still not clear. As shown in this proposal, for NVME we may need to set remote_url, but unless there will be a subclass (NVME) in the mdev (which I guess not), we can't prevent vendor from using another attribute name, in this case, tricks like attributes iteration in some sub directory won't work. So even if we had some common API for compatibility check, the provisioning API is still vendor specific ... Thanks Thanks for the
Re: device compatibility interface for live migration with assigned devices
On Thu, 2020-08-20 at 14:27 +0800, Yan Zhao wrote: > On Thu, Aug 20, 2020 at 06:16:28AM +0100, Sean Mooney wrote: > > On Thu, 2020-08-20 at 12:01 +0800, Yan Zhao wrote: > > > On Thu, Aug 20, 2020 at 02:29:07AM +0100, Sean Mooney wrote: > > > > On Thu, 2020-08-20 at 08:39 +0800, Yan Zhao wrote: > > > > > On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote: > > > > > > On Tue, 18 Aug 2020 10:16:28 +0100 > > > > > > Daniel P. Berrangé wrote: > > > > > > > > > > > > > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote: > > > > > > > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: > > > > > > > > > > > > > > > > On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: > > > > > > > > > > > > > > > > On 2020/8/14 下午1:16, Yan Zhao wrote: > > > > > > > > > > > > > > > > On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: > > > > > > > > > > > > > > > > On 2020/8/10 下午3:46, Yan Zhao wrote: > > > > > > > > we actually can also retrieve the same information through > > > > > > > > sysfs, .e.g > > > > > > > > > > > > > > > > |- [path to device] > > > > > > > > |--- migration > > > > > > > > | |--- self > > > > > > > > | | |---device_api > > > > > > > > || |---mdev_type > > > > > > > > || |---software_version > > > > > > > > || |---device_id > > > > > > > > || |---aggregator > > > > > > > > | |--- compatible > > > > > > > > | | |---device_api > > > > > > > > || |---mdev_type > > > > > > > > || |---software_version > > > > > > > > || |---device_id > > > > > > > > || |---aggregator > > > > > > > > > > > > > > > > > > > > > > > > Yes but: > > > > > > > > > > > > > > > > - You need one file per attribute (one syscall for one > > > > > > > > attribute) > > > > > > > > - Attribute is coupled with kobject > > > > > > > > > > > > Is that really that bad? You have the device with an embedded > > > > > > kobject > > > > > > anyway, and you can just put things into an attribute group? > > > > > > > > > > > > [Also, I think that self/compatible split in the example makes > > > > > > things > > > > > > needlessly complex. Shouldn't semantic versioning and matching > > > > > > already > > > > > > cover nearly everything? I would expect very few cases that are more > > > > > > complex than that. Maybe the aggregation stuff, but I don't think we > > > > > > need that self/compatible split for that, either.] > > > > > > > > > > Hi Cornelia, > > > > > > > > > > The reason I want to declare compatible list of attributes is that > > > > > sometimes it's not a simple 1:1 matching of source attributes and > > > > > target attributes > > > > > as I demonstrated below, > > > > > source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is > > > > > compatible to > > > > > target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2), > > > > >(mdev_type i915-GVTg_V5_8 + aggregator 4) > > > > > > > > the way you are doing the nameing is till really confusing by the way > > > > if this has not already been merged in the kernel can you chagne the > > > > mdev > > > > so that mdev_type i915-GVTg_V5_2 is 2 of mdev_type i915-GVTg_V5_1 > > > > instead of half the device > > > > > > > > currently you need to deived the aggratod by the number at the end of > > > > the mdev type to figure out > > > > how much of the phsicial device is being used with is a very unfridly > > > > api convention > > > > > > > > the way aggrator are being proposed in general is not really someting i > > > > like but i thin this at least > > > > is something that should be able to correct. > > > > > > > > with the complexity in the mdev type name + aggrator i suspect that > > > > this will never be support > > > > in openstack nova directly requireing integration via cyborg unless we > > > > can pre partion the > > > > device in to mdevs staicaly and just ignore this. > > > > > > > > this is way to vendor sepecif to integrate into something like > > > > openstack in nova unless we can guarentee > > > > taht how aggreator work will be portable across vendors genericly. > > > > > > > > > > > > > > and aggragator may be just one of such examples that 1:1 matching > > > > > does not > > > > > fit. > > > > > > > > for openstack nova i dont see us support anything beyond the 1:1 case > > > > where the mdev type does not change. > > > > > > > > > > hi Sean, > > > I understand it's hard for openstack. but 1:N is always meaningful. > > > e.g. > > > if source device 1 has cap A, it is compatible to > > > device 2: cap A, > > > device 3: cap A+B, > > > device 4: cap A+B+C > > > > > > to allow openstack to detect it correctly, in compatible list of > > > device 2, we would say compatible cap is A; > > > device 3, compatible cap is A or A+B; > > > device 4, compatible cap is A or A+B, or A+B+C; > > > > > > then if openstack finds device A's self cap A is contained in
Re: [ovirt-devel] Re: device compatibility interface for live migration with assigned devices
On Wed, 19 Aug 2020 17:28:38 +0800 Jason Wang wrote: > On 2020/8/19 下午4:13, Yan Zhao wrote: > > On Wed, Aug 19, 2020 at 03:39:50PM +0800, Jason Wang wrote: > >> On 2020/8/19 下午2:59, Yan Zhao wrote: > >>> On Wed, Aug 19, 2020 at 02:57:34PM +0800, Jason Wang wrote: > On 2020/8/19 上午11:30, Yan Zhao wrote: > > hi All, > > could we decide that sysfs is the interface that every VFIO vendor > > driver > > needs to provide in order to support vfio live migration, otherwise the > > userspace management tool would not list the device into the compatible > > list? > > > > if that's true, let's move to the standardizing of the sysfs interface. > > (1) content > > common part: (must) > > - software_version: (in major.minor.bugfix scheme) > This can not work for devices whose features can be negotiated/advertised > independently. (E.g virtio devices) I thought the 'software_version' was supposed to describe kind of a 'protocol version' for the data we transmit? I.e., you add a new field, you bump the version number. > > >>> sorry, I don't understand here, why virtio devices need to use vfio > >>> interface? > >> > >> I don't see any reason that virtio devices can't be used by VFIO. Do you? > >> > >> Actually, virtio devices have been used by VFIO for many years: > >> > >> - passthrough a hardware virtio devices to userspace(VM) drivers > >> - using virtio PMD inside guest > >> > > So, what's different for it vs passing through a physical hardware via > > VFIO? > > > The difference is in the guest, the device could be either real hardware > or emulated ones. > > > > even though the features are negotiated dynamically, could you explain > > why it would cause software_version not work? > > > Virtio device 1 supports feature A, B, C > Virtio device 2 supports feature B, C, D > > So you can't migrate a guest from device 1 to device 2. And it's > impossible to model the features with versions. We're talking about the features offered by the device, right? Would it be sufficient to mandate that the target device supports the same features or a superset of the features supported by the source device? > > > > > > > >>> I think this thread is discussing about vfio related devices. > >>> > > - device_api: vfio-pci or vfio-ccw ... > > - type: mdev type for mdev device or > > a signature for physical device which is a counterpart for > >mdev type. > > > > device api specific part: (must) > > - pci id: pci id of mdev parent device or pci id of physical pci > >device (device_api is vfio-pci)API here. > So this assumes a PCI device which is probably not true. > > >>> for device_api of vfio-pci, why it's not true? > >>> > >>> for vfio-ccw, it's subchannel_type. > >> > >> Ok but having two different attributes for the same file is not good idea. > >> How mgmt know there will be a 3rd type? > > that's why some attributes need to be common. e.g. > > device_api: it's common because mgmt need to know it's a pci device or a > > ccw device. and the api type is already defined vfio.h. > > (The field is agreed by and actually suggested by Alex in previous > > mail) > > type: mdev_type for mdev. if mgmt does not understand it, it would not > >be able to create one compatible mdev device. > > software_version: mgmt can compare the major and minor if it understands > >this fields. > > > I think it would be helpful if you can describe how mgmt is expected to > work step by step with the proposed sysfs API. This can help people to > understand. My proposal would be: - check that device_api matches - check possible device_api specific attributes - check that type matches [I don't think the combination of mdev types and another attribute to determine compatibility is a good idea; actually, the current proposal confuses me every time I look at it] - check that software_version is compatible, assuming semantic versioning - check possible type-specific attributes > > Thanks for the patience. Since sysfs is uABI, when accepted, we need > support it forever. That's why we need to be careful. Nod. (...)
Re: device compatibility interface for live migration with assigned devices
On Thu, Aug 20, 2020 at 06:16:28AM +0100, Sean Mooney wrote: > On Thu, 2020-08-20 at 12:01 +0800, Yan Zhao wrote: > > On Thu, Aug 20, 2020 at 02:29:07AM +0100, Sean Mooney wrote: > > > On Thu, 2020-08-20 at 08:39 +0800, Yan Zhao wrote: > > > > On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote: > > > > > On Tue, 18 Aug 2020 10:16:28 +0100 > > > > > Daniel P. Berrangé wrote: > > > > > > > > > > > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote: > > > > > > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: > > > > > > > > > > > > > > On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: > > > > > > > > > > > > > > On 2020/8/14 下午1:16, Yan Zhao wrote: > > > > > > > > > > > > > > On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: > > > > > > > > > > > > > > On 2020/8/10 下午3:46, Yan Zhao wrote: > > > > > > > we actually can also retrieve the same information through > > > > > > > sysfs, .e.g > > > > > > > > > > > > > > |- [path to device] > > > > > > > |--- migration > > > > > > > | |--- self > > > > > > > | | |---device_api > > > > > > > || |---mdev_type > > > > > > > || |---software_version > > > > > > > || |---device_id > > > > > > > || |---aggregator > > > > > > > | |--- compatible > > > > > > > | | |---device_api > > > > > > > || |---mdev_type > > > > > > > || |---software_version > > > > > > > || |---device_id > > > > > > > || |---aggregator > > > > > > > > > > > > > > > > > > > > > Yes but: > > > > > > > > > > > > > > - You need one file per attribute (one syscall for one attribute) > > > > > > > - Attribute is coupled with kobject > > > > > > > > > > Is that really that bad? You have the device with an embedded kobject > > > > > anyway, and you can just put things into an attribute group? > > > > > > > > > > [Also, I think that self/compatible split in the example makes things > > > > > needlessly complex. Shouldn't semantic versioning and matching already > > > > > cover nearly everything? I would expect very few cases that are more > > > > > complex than that. Maybe the aggregation stuff, but I don't think we > > > > > need that self/compatible split for that, either.] > > > > > > > > Hi Cornelia, > > > > > > > > The reason I want to declare compatible list of attributes is that > > > > sometimes it's not a simple 1:1 matching of source attributes and > > > > target attributes > > > > as I demonstrated below, > > > > source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible > > > > to > > > > target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2), > > > >(mdev_type i915-GVTg_V5_8 + aggregator 4) > > > > > > the way you are doing the nameing is till really confusing by the way > > > if this has not already been merged in the kernel can you chagne the mdev > > > so that mdev_type i915-GVTg_V5_2 is 2 of mdev_type i915-GVTg_V5_1 instead > > > of half the device > > > > > > currently you need to deived the aggratod by the number at the end of the > > > mdev type to figure out > > > how much of the phsicial device is being used with is a very unfridly api > > > convention > > > > > > the way aggrator are being proposed in general is not really someting i > > > like but i thin this at least > > > is something that should be able to correct. > > > > > > with the complexity in the mdev type name + aggrator i suspect that this > > > will never be support > > > in openstack nova directly requireing integration via cyborg unless we > > > can pre partion the > > > device in to mdevs staicaly and just ignore this. > > > > > > this is way to vendor sepecif to integrate into something like openstack > > > in nova unless we can guarentee > > > taht how aggreator work will be portable across vendors genericly. > > > > > > > > > > > and aggragator may be just one of such examples that 1:1 matching does > > > > not > > > > fit. > > > > > > for openstack nova i dont see us support anything beyond the 1:1 case > > > where the mdev type does not change. > > > > > > > hi Sean, > > I understand it's hard for openstack. but 1:N is always meaningful. > > e.g. > > if source device 1 has cap A, it is compatible to > > device 2: cap A, > > device 3: cap A+B, > > device 4: cap A+B+C > > > > to allow openstack to detect it correctly, in compatible list of > > device 2, we would say compatible cap is A; > > device 3, compatible cap is A or A+B; > > device 4, compatible cap is A or A+B, or A+B+C; > > > > then if openstack finds device A's self cap A is contained in compatible > > cap of device 2/3/4, it can migrate device 1 to device 2,3,4. > > > > conversely, device 1's compatible cap is only A, > > so it is able to migrate device 2 to device 1, and it is not able to > > migrate device 3/4 to device 1. > > yes we build the palcement servce aroudn the idea of capablites as traits on >
Re: device compatibility interface for live migration with assigned devices
On Thu, 2020-08-20 at 12:01 +0800, Yan Zhao wrote: > On Thu, Aug 20, 2020 at 02:29:07AM +0100, Sean Mooney wrote: > > On Thu, 2020-08-20 at 08:39 +0800, Yan Zhao wrote: > > > On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote: > > > > On Tue, 18 Aug 2020 10:16:28 +0100 > > > > Daniel P. Berrangé wrote: > > > > > > > > > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote: > > > > > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: > > > > > > > > > > > > On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: > > > > > > > > > > > > On 2020/8/14 下午1:16, Yan Zhao wrote: > > > > > > > > > > > > On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: > > > > > > > > > > > > On 2020/8/10 下午3:46, Yan Zhao wrote: > > > > > > we actually can also retrieve the same information through sysfs, > > > > > > .e.g > > > > > > > > > > > > |- [path to device] > > > > > > |--- migration > > > > > > | |--- self > > > > > > | | |---device_api > > > > > > || |---mdev_type > > > > > > || |---software_version > > > > > > || |---device_id > > > > > > || |---aggregator > > > > > > | |--- compatible > > > > > > | | |---device_api > > > > > > || |---mdev_type > > > > > > || |---software_version > > > > > > || |---device_id > > > > > > || |---aggregator > > > > > > > > > > > > > > > > > > Yes but: > > > > > > > > > > > > - You need one file per attribute (one syscall for one attribute) > > > > > > - Attribute is coupled with kobject > > > > > > > > Is that really that bad? You have the device with an embedded kobject > > > > anyway, and you can just put things into an attribute group? > > > > > > > > [Also, I think that self/compatible split in the example makes things > > > > needlessly complex. Shouldn't semantic versioning and matching already > > > > cover nearly everything? I would expect very few cases that are more > > > > complex than that. Maybe the aggregation stuff, but I don't think we > > > > need that self/compatible split for that, either.] > > > > > > Hi Cornelia, > > > > > > The reason I want to declare compatible list of attributes is that > > > sometimes it's not a simple 1:1 matching of source attributes and target > > > attributes > > > as I demonstrated below, > > > source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to > > > target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2), > > >(mdev_type i915-GVTg_V5_8 + aggregator 4) > > > > the way you are doing the nameing is till really confusing by the way > > if this has not already been merged in the kernel can you chagne the mdev > > so that mdev_type i915-GVTg_V5_2 is 2 of mdev_type i915-GVTg_V5_1 instead > > of half the device > > > > currently you need to deived the aggratod by the number at the end of the > > mdev type to figure out > > how much of the phsicial device is being used with is a very unfridly api > > convention > > > > the way aggrator are being proposed in general is not really someting i > > like but i thin this at least > > is something that should be able to correct. > > > > with the complexity in the mdev type name + aggrator i suspect that this > > will never be support > > in openstack nova directly requireing integration via cyborg unless we can > > pre partion the > > device in to mdevs staicaly and just ignore this. > > > > this is way to vendor sepecif to integrate into something like openstack in > > nova unless we can guarentee > > taht how aggreator work will be portable across vendors genericly. > > > > > > > > and aggragator may be just one of such examples that 1:1 matching does not > > > fit. > > > > for openstack nova i dont see us support anything beyond the 1:1 case where > > the mdev type does not change. > > > > hi Sean, > I understand it's hard for openstack. but 1:N is always meaningful. > e.g. > if source device 1 has cap A, it is compatible to > device 2: cap A, > device 3: cap A+B, > device 4: cap A+B+C > > to allow openstack to detect it correctly, in compatible list of > device 2, we would say compatible cap is A; > device 3, compatible cap is A or A+B; > device 4, compatible cap is A or A+B, or A+B+C; > > then if openstack finds device A's self cap A is contained in compatible > cap of device 2/3/4, it can migrate device 1 to device 2,3,4. > > conversely, device 1's compatible cap is only A, > so it is able to migrate device 2 to device 1, and it is not able to > migrate device 3/4 to device 1. yes we build the palcement servce aroudn the idea of capablites as traits on resocue providres. which is why i originally asked if we coudl model compatibality with feature flags we can seaislyt model deivce as aupport A, A+B or A+B+C and then select hosts and evice based on that but the list of compatable deivce you are propsoeing hide this feature infomation which whould be what
Re: device compatibility interface for live migration with assigned devices
On Thu, Aug 20, 2020 at 02:29:07AM +0100, Sean Mooney wrote: > On Thu, 2020-08-20 at 08:39 +0800, Yan Zhao wrote: > > On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote: > > > On Tue, 18 Aug 2020 10:16:28 +0100 > > > Daniel P. Berrangé wrote: > > > > > > > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote: > > > > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: > > > > > > > > > > On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: > > > > > > > > > > On 2020/8/14 下午1:16, Yan Zhao wrote: > > > > > > > > > > On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: > > > > > > > > > > On 2020/8/10 下午3:46, Yan Zhao wrote: > > > > > we actually can also retrieve the same information through sysfs, > > > > > .e.g > > > > > > > > > > |- [path to device] > > > > > |--- migration > > > > > | |--- self > > > > > | | |---device_api > > > > > || |---mdev_type > > > > > || |---software_version > > > > > || |---device_id > > > > > || |---aggregator > > > > > | |--- compatible > > > > > | | |---device_api > > > > > || |---mdev_type > > > > > || |---software_version > > > > > || |---device_id > > > > > || |---aggregator > > > > > > > > > > > > > > > Yes but: > > > > > > > > > > - You need one file per attribute (one syscall for one attribute) > > > > > - Attribute is coupled with kobject > > > > > > Is that really that bad? You have the device with an embedded kobject > > > anyway, and you can just put things into an attribute group? > > > > > > [Also, I think that self/compatible split in the example makes things > > > needlessly complex. Shouldn't semantic versioning and matching already > > > cover nearly everything? I would expect very few cases that are more > > > complex than that. Maybe the aggregation stuff, but I don't think we > > > need that self/compatible split for that, either.] > > > > Hi Cornelia, > > > > The reason I want to declare compatible list of attributes is that > > sometimes it's not a simple 1:1 matching of source attributes and target > > attributes > > as I demonstrated below, > > source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to > > target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2), > >(mdev_type i915-GVTg_V5_8 + aggregator 4) > the way you are doing the nameing is till really confusing by the way > if this has not already been merged in the kernel can you chagne the mdev > so that mdev_type i915-GVTg_V5_2 is 2 of mdev_type i915-GVTg_V5_1 instead of > half the device > > currently you need to deived the aggratod by the number at the end of the > mdev type to figure out > how much of the phsicial device is being used with is a very unfridly api > convention > > the way aggrator are being proposed in general is not really someting i like > but i thin this at least > is something that should be able to correct. > > with the complexity in the mdev type name + aggrator i suspect that this will > never be support > in openstack nova directly requireing integration via cyborg unless we can > pre partion the > device in to mdevs staicaly and just ignore this. > > this is way to vendor sepecif to integrate into something like openstack in > nova unless we can guarentee > taht how aggreator work will be portable across vendors genericly. > > > > > and aggragator may be just one of such examples that 1:1 matching does not > > fit. > for openstack nova i dont see us support anything beyond the 1:1 case where > the mdev type does not change. > hi Sean, I understand it's hard for openstack. but 1:N is always meaningful. e.g. if source device 1 has cap A, it is compatible to device 2: cap A, device 3: cap A+B, device 4: cap A+B+C to allow openstack to detect it correctly, in compatible list of device 2, we would say compatible cap is A; device 3, compatible cap is A or A+B; device 4, compatible cap is A or A+B, or A+B+C; then if openstack finds device A's self cap A is contained in compatible cap of device 2/3/4, it can migrate device 1 to device 2,3,4. conversely, device 1's compatible cap is only A, so it is able to migrate device 2 to device 1, and it is not able to migrate device 3/4 to device 1. Thanks Yan > i woudl really prefer if there was just one mdev type that repsented the > minimal allcatable unit and the > aggragaotr where used to create compostions of that. i.e instad of > i915-GVTg_V5_2 beign half the device, > have 1 mdev type i915-GVTg and if the device support 8 of them then we can > aggrate 4 of i915-GVTg > > if you want to have muplie mdev type to model the different amoutn of the > resouce e.g. i915-GVTg_small i915-GVTg_large > that is totlaly fine too or even i915-GVTg_4 indcating it sis 4 of i915-GVTg > > failing that i would just expose an mdev type per composable resouce and > allow us to compose them a the user level with >
Re: device compatibility interface for live migration with assigned devices
On Wed, Aug 19, 2020 at 09:22:34PM -0600, Alex Williamson wrote: > On Thu, 20 Aug 2020 08:39:22 +0800 > Yan Zhao wrote: > > > On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote: > > > On Tue, 18 Aug 2020 10:16:28 +0100 > > > Daniel P. Berrangé wrote: > > > > > > > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote: > > > > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: > > > > > > > > > > On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: > > > > > > > > > > On 2020/8/14 下午1:16, Yan Zhao wrote: > > > > > > > > > > On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: > > > > > > > > > > On 2020/8/10 下午3:46, Yan Zhao wrote: > > > > > > > > > we actually can also retrieve the same information through sysfs, > > > > > .e.g > > > > > > > > > > |- [path to device] > > > > > |--- migration > > > > > | |--- self > > > > > | | |---device_api > > > > > || |---mdev_type > > > > > || |---software_version > > > > > || |---device_id > > > > > || |---aggregator > > > > > | |--- compatible > > > > > | | |---device_api > > > > > || |---mdev_type > > > > > || |---software_version > > > > > || |---device_id > > > > > || |---aggregator > > > > > > > > > > > > > > > Yes but: > > > > > > > > > > - You need one file per attribute (one syscall for one attribute) > > > > > - Attribute is coupled with kobject > > > > > > Is that really that bad? You have the device with an embedded kobject > > > anyway, and you can just put things into an attribute group? > > > > > > [Also, I think that self/compatible split in the example makes things > > > needlessly complex. Shouldn't semantic versioning and matching already > > > cover nearly everything? I would expect very few cases that are more > > > complex than that. Maybe the aggregation stuff, but I don't think we > > > need that self/compatible split for that, either.] > > Hi Cornelia, > > > > The reason I want to declare compatible list of attributes is that > > sometimes it's not a simple 1:1 matching of source attributes and target > > attributes > > as I demonstrated below, > > source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to > > target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2), > >(mdev_type i915-GVTg_V5_8 + aggregator 4) > > > > and aggragator may be just one of such examples that 1:1 matching does not > > fit. > > If you're suggesting that we need a new 'compatible' set for every > aggregation, haven't we lost the purpose of aggregation? For example, > rather than having N mdev types to represent all the possible > aggregation values, we have a single mdev type with N compatible > migration entries, one for each possible aggregation value. BTW, how do > we have multiple compatible directories? compatible0001, > compatible0002? Thanks, > do you think the bin_attribute I proposed yesterday good? Then we can have a single compatible with a variable in the mdev_type and aggregator. mdev_type=i915-GVTg_V5_{val1:int:2,4,8} aggregator={val1}/2 Thanks Yan
Re: device compatibility interface for live migration with assigned devices
On Wed, Aug 19, 2020 at 09:13:45PM -0600, Alex Williamson wrote: > On Thu, 20 Aug 2020 08:18:10 +0800 > Yan Zhao wrote: > > > On Wed, Aug 19, 2020 at 11:50:21AM -0600, Alex Williamson wrote: > > <...> > > > > > > > What I care about is that we have a *standard* userspace API for > > > > > > > performing device compatibility checking / state migration, for > > > > > > > use by > > > > > > > QEMU/libvirt/ OpenStack, such that we can write code without > > > > > > > countless > > > > > > > vendor specific code paths. > > > > > > > > > > > > > > If there is vendor specific stuff on the side, that's fine as we > > > > > > > can > > > > > > > ignore that, but the core functionality for device compat / > > > > > > > migration > > > > > > > needs to be standardized. > > > > > > > > > > > > To summarize: > > > > > > - choose one of sysfs or devlink > > > > > > - have a common interface, with a standardized way to add > > > > > > vendor-specific attributes > > > > > > ? > > > > > > > > > > Please refer to my previous email which has more example and details. > > > > > > > > > hi Parav, > > > > the example is based on a new vdpa tool running over netlink, not based > > > > on devlink, right? > > > > For vfio migration compatibility, we have to deal with both mdev and > > > > physical > > > > pci devices, I don't think it's a good idea to write a new tool for it, > > > > given > > > > we are able to retrieve the same info from sysfs and there's already an > > > > mdevctl from Alex (https://github.com/mdevctl/mdevctl). > > > > > > > > hi All, > > > > could we decide that sysfs is the interface that every VFIO vendor > > > > driver > > > > needs to provide in order to support vfio live migration, otherwise the > > > > userspace management tool would not list the device into the compatible > > > > list? > > > > > > > > if that's true, let's move to the standardizing of the sysfs interface. > > > > (1) content > > > > common part: (must) > > > >- software_version: (in major.minor.bugfix scheme) > > > >- device_api: vfio-pci or vfio-ccw ... > > > >- type: mdev type for mdev device or > > > >a signature for physical device which is a counterpart for > > > >mdev type. > > > > > > > > device api specific part: (must) > > > > - pci id: pci id of mdev parent device or pci id of physical pci > > > > device (device_api is vfio-pci) > > > > > > As noted previously, the parent PCI ID should not matter for an mdev > > > device, if a vendor has a dependency on matching the parent device PCI > > > ID, that's a vendor specific restriction. An mdev device can also > > > expose a vfio-pci device API without the parent device being PCI. For > > > a physical PCI device, shouldn't the PCI ID be encompassed in the > > > signature? Thanks, > > > > > you are right. I need to put the PCI ID as a vendor specific field. > > I didn't do that because I wanted all fields in vendor specific to be > > configurable by management tools, so they can configure the target device > > according to the value of a vendor specific field even they don't know > > the meaning of the field. > > But maybe they can just ignore the field when they can't find a matching > > writable field to configure the target. > > > If fields can be ignored, what's the point of reporting them? Seems > it's no longer a requirement. Thanks, > sorry about the confusion. I mean this condition: about to migrate, openstack searches if there are existing matching MDEVs, if yes, i.e. all common/vendor specific fields match, then just create a VM with the matching target MDEV. (in this condition, the PCI ID field is not ignored); if not, openstack tries to create one MDEV according to mdev_type, and configures MDEV according to the vendor specific attributes. as PCI ID is not a configurable field, it just ignore the field. Thanks Yan
Re: device compatibility interface for live migration with assigned devices
On Thu, 20 Aug 2020 08:39:22 +0800 Yan Zhao wrote: > On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote: > > On Tue, 18 Aug 2020 10:16:28 +0100 > > Daniel P. Berrangé wrote: > > > > > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote: > > > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: > > > > > > > > On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: > > > > > > > > On 2020/8/14 下午1:16, Yan Zhao wrote: > > > > > > > > On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: > > > > > > > > On 2020/8/10 下午3:46, Yan Zhao wrote: > > > > > > > we actually can also retrieve the same information through sysfs, .e.g > > > > > > > > |- [path to device] > > > > |--- migration > > > > | |--- self > > > > | | |---device_api > > > > || |---mdev_type > > > > || |---software_version > > > > || |---device_id > > > > || |---aggregator > > > > | |--- compatible > > > > | | |---device_api > > > > || |---mdev_type > > > > || |---software_version > > > > || |---device_id > > > > || |---aggregator > > > > > > > > > > > > Yes but: > > > > > > > > - You need one file per attribute (one syscall for one attribute) > > > > - Attribute is coupled with kobject > > > > Is that really that bad? You have the device with an embedded kobject > > anyway, and you can just put things into an attribute group? > > > > [Also, I think that self/compatible split in the example makes things > > needlessly complex. Shouldn't semantic versioning and matching already > > cover nearly everything? I would expect very few cases that are more > > complex than that. Maybe the aggregation stuff, but I don't think we > > need that self/compatible split for that, either.] > Hi Cornelia, > > The reason I want to declare compatible list of attributes is that > sometimes it's not a simple 1:1 matching of source attributes and target > attributes > as I demonstrated below, > source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to > target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2), >(mdev_type i915-GVTg_V5_8 + aggregator 4) > > and aggragator may be just one of such examples that 1:1 matching does not > fit. If you're suggesting that we need a new 'compatible' set for every aggregation, haven't we lost the purpose of aggregation? For example, rather than having N mdev types to represent all the possible aggregation values, we have a single mdev type with N compatible migration entries, one for each possible aggregation value. BTW, how do we have multiple compatible directories? compatible0001, compatible0002? Thanks, Alex
Re: device compatibility interface for live migration with assigned devices
On Thu, 20 Aug 2020 08:18:10 +0800 Yan Zhao wrote: > On Wed, Aug 19, 2020 at 11:50:21AM -0600, Alex Williamson wrote: > <...> > > > > > > What I care about is that we have a *standard* userspace API for > > > > > > performing device compatibility checking / state migration, for use > > > > > > by > > > > > > QEMU/libvirt/ OpenStack, such that we can write code without > > > > > > countless > > > > > > vendor specific code paths. > > > > > > > > > > > > If there is vendor specific stuff on the side, that's fine as we can > > > > > > ignore that, but the core functionality for device compat / > > > > > > migration > > > > > > needs to be standardized. > > > > > > > > > > To summarize: > > > > > - choose one of sysfs or devlink > > > > > - have a common interface, with a standardized way to add > > > > > vendor-specific attributes > > > > > ? > > > > > > > > Please refer to my previous email which has more example and details. > > > > > > > hi Parav, > > > the example is based on a new vdpa tool running over netlink, not based > > > on devlink, right? > > > For vfio migration compatibility, we have to deal with both mdev and > > > physical > > > pci devices, I don't think it's a good idea to write a new tool for it, > > > given > > > we are able to retrieve the same info from sysfs and there's already an > > > mdevctl from Alex (https://github.com/mdevctl/mdevctl). > > > > > > hi All, > > > could we decide that sysfs is the interface that every VFIO vendor driver > > > needs to provide in order to support vfio live migration, otherwise the > > > userspace management tool would not list the device into the compatible > > > list? > > > > > > if that's true, let's move to the standardizing of the sysfs interface. > > > (1) content > > > common part: (must) > > >- software_version: (in major.minor.bugfix scheme) > > >- device_api: vfio-pci or vfio-ccw ... > > >- type: mdev type for mdev device or > > >a signature for physical device which is a counterpart for > > > mdev type. > > > > > > device api specific part: (must) > > > - pci id: pci id of mdev parent device or pci id of physical pci > > > device (device_api is vfio-pci) > > > > As noted previously, the parent PCI ID should not matter for an mdev > > device, if a vendor has a dependency on matching the parent device PCI > > ID, that's a vendor specific restriction. An mdev device can also > > expose a vfio-pci device API without the parent device being PCI. For > > a physical PCI device, shouldn't the PCI ID be encompassed in the > > signature? Thanks, > > > you are right. I need to put the PCI ID as a vendor specific field. > I didn't do that because I wanted all fields in vendor specific to be > configurable by management tools, so they can configure the target device > according to the value of a vendor specific field even they don't know > the meaning of the field. > But maybe they can just ignore the field when they can't find a matching > writable field to configure the target. If fields can be ignored, what's the point of reporting them? Seems it's no longer a requirement. Thanks, Alex > > > - subchannel_type (device_api is vfio-ccw) > > > > > > vendor driver specific part: (optional) > > > - aggregator > > > - chpid_type > > > - remote_url > > > > > > NOTE: vendors are free to add attributes in this part with a > > > restriction that this attribute is able to be configured with the same > > > name in sysfs too. e.g. > > > for aggregator, there must be a sysfs attribute in device node > > > /sys/devices/pci:00/:00:02.0/882cc4da-dede-11e7-9180-078a62063ab1/intel_vgpu/aggregator, > > > so that the userspace tool is able to configure the target device > > > according to source device's aggregator attribute. > > > > > > > > > (2) where and structure > > > proposal 1: > > > |- [path to device] > > > |--- migration > > > | |--- self > > > | ||-software_version > > > | ||-device_api > > > | ||-type > > > | ||-[pci_id or subchannel_type] > > > | ||- > > > | |--- compatible > > > | ||-software_version > > > | ||-device_api > > > | ||-type > > > | ||-[pci_id or subchannel_type] > > > | ||- > > > multiple compatible is allowed. > > > attributes should be ASCII text files, preferably with only one value > > > per file. > > > > > > > > > proposal 2: use bin_attribute. > > > |- [path to device] > > > |--- migration > > > | |--- self > > > | |--- compatible > > > > > > so we can continue use multiline format. e.g. > > > cat compatible > > > software_version=0.1.0 > > > device_api=vfio_pci > > > type=i915-GVTg_V5_{val1:int:1,2,4,8} > > > pci_id=80865963 > > > aggregator={val1}/2 > > > > > > Thanks > > > Yan > > > > > >
Re: device compatibility interface for live migration with assigned devices
On Thu, 2020-08-20 at 08:39 +0800, Yan Zhao wrote: > On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote: > > On Tue, 18 Aug 2020 10:16:28 +0100 > > Daniel P. Berrangé wrote: > > > > > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote: > > > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: > > > > > > > > On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: > > > > > > > > On 2020/8/14 下午1:16, Yan Zhao wrote: > > > > > > > > On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: > > > > > > > > On 2020/8/10 下午3:46, Yan Zhao wrote: > > > > we actually can also retrieve the same information through sysfs, .e.g > > > > > > > > |- [path to device] > > > > |--- migration > > > > | |--- self > > > > | | |---device_api > > > > || |---mdev_type > > > > || |---software_version > > > > || |---device_id > > > > || |---aggregator > > > > | |--- compatible > > > > | | |---device_api > > > > || |---mdev_type > > > > || |---software_version > > > > || |---device_id > > > > || |---aggregator > > > > > > > > > > > > Yes but: > > > > > > > > - You need one file per attribute (one syscall for one attribute) > > > > - Attribute is coupled with kobject > > > > Is that really that bad? You have the device with an embedded kobject > > anyway, and you can just put things into an attribute group? > > > > [Also, I think that self/compatible split in the example makes things > > needlessly complex. Shouldn't semantic versioning and matching already > > cover nearly everything? I would expect very few cases that are more > > complex than that. Maybe the aggregation stuff, but I don't think we > > need that self/compatible split for that, either.] > > Hi Cornelia, > > The reason I want to declare compatible list of attributes is that > sometimes it's not a simple 1:1 matching of source attributes and target > attributes > as I demonstrated below, > source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to > target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2), >(mdev_type i915-GVTg_V5_8 + aggregator 4) the way you are doing the nameing is till really confusing by the way if this has not already been merged in the kernel can you chagne the mdev so that mdev_type i915-GVTg_V5_2 is 2 of mdev_type i915-GVTg_V5_1 instead of half the device currently you need to deived the aggratod by the number at the end of the mdev type to figure out how much of the phsicial device is being used with is a very unfridly api convention the way aggrator are being proposed in general is not really someting i like but i thin this at least is something that should be able to correct. with the complexity in the mdev type name + aggrator i suspect that this will never be support in openstack nova directly requireing integration via cyborg unless we can pre partion the device in to mdevs staicaly and just ignore this. this is way to vendor sepecif to integrate into something like openstack in nova unless we can guarentee taht how aggreator work will be portable across vendors genericly. > > and aggragator may be just one of such examples that 1:1 matching does not > fit. for openstack nova i dont see us support anything beyond the 1:1 case where the mdev type does not change. i woudl really prefer if there was just one mdev type that repsented the minimal allcatable unit and the aggragaotr where used to create compostions of that. i.e instad of i915-GVTg_V5_2 beign half the device, have 1 mdev type i915-GVTg and if the device support 8 of them then we can aggrate 4 of i915-GVTg if you want to have muplie mdev type to model the different amoutn of the resouce e.g. i915-GVTg_small i915-GVTg_large that is totlaly fine too or even i915-GVTg_4 indcating it sis 4 of i915-GVTg failing that i would just expose an mdev type per composable resouce and allow us to compose them a the user level with some other construct mudeling a attament to the device. e.g. create composed mdev or somethig that is an aggreateion of multiple sub resouces each of which is an mdev. so kind of like how bond port work. we would create an mdev for each of the sub resouces and then create a bond or aggrated mdev by reference the other mdevs by uuid then attach only the aggreated mdev to the instance. the current aggrator syntax and sematic however make me rather uncofrotable when i think about orchestating vms on top of it even to boot them let alone migrate them. > > So, we explicitly list out self/compatible attributes, and management > tools only need to check if self attributes is contained compatible > attributes. > > or do you mean only compatible list is enough, and the management tools > need to find out self list by themselves? > But I think provide a self list is easier for management tools. > > Thanks > Yan >
Re: device compatibility interface for live migration with assigned devices
On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote: > On Tue, 18 Aug 2020 10:16:28 +0100 > Daniel P. Berrangé wrote: > > > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote: > > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: > > > > > > On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: > > > > > > On 2020/8/14 下午1:16, Yan Zhao wrote: > > > > > > On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: > > > > > > On 2020/8/10 下午3:46, Yan Zhao wrote: > > > > > we actually can also retrieve the same information through sysfs, .e.g > > > > > > |- [path to device] > > > |--- migration > > > | |--- self > > > | | |---device_api > > > || |---mdev_type > > > || |---software_version > > > || |---device_id > > > || |---aggregator > > > | |--- compatible > > > | | |---device_api > > > || |---mdev_type > > > || |---software_version > > > || |---device_id > > > || |---aggregator > > > > > > > > > Yes but: > > > > > > - You need one file per attribute (one syscall for one attribute) > > > - Attribute is coupled with kobject > > Is that really that bad? You have the device with an embedded kobject > anyway, and you can just put things into an attribute group? > > [Also, I think that self/compatible split in the example makes things > needlessly complex. Shouldn't semantic versioning and matching already > cover nearly everything? I would expect very few cases that are more > complex than that. Maybe the aggregation stuff, but I don't think we > need that self/compatible split for that, either.] Hi Cornelia, The reason I want to declare compatible list of attributes is that sometimes it's not a simple 1:1 matching of source attributes and target attributes as I demonstrated below, source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2), (mdev_type i915-GVTg_V5_8 + aggregator 4) and aggragator may be just one of such examples that 1:1 matching does not fit. So, we explicitly list out self/compatible attributes, and management tools only need to check if self attributes is contained compatible attributes. or do you mean only compatible list is enough, and the management tools need to find out self list by themselves? But I think provide a self list is easier for management tools. Thanks Yan
Re: device compatibility interface for live migration with assigned devices
On Wed, Aug 19, 2020 at 11:50:21AM -0600, Alex Williamson wrote: <...> > > > > > What I care about is that we have a *standard* userspace API for > > > > > performing device compatibility checking / state migration, for use by > > > > > QEMU/libvirt/ OpenStack, such that we can write code without countless > > > > > vendor specific code paths. > > > > > > > > > > If there is vendor specific stuff on the side, that's fine as we can > > > > > ignore that, but the core functionality for device compat / migration > > > > > needs to be standardized. > > > > > > > > To summarize: > > > > - choose one of sysfs or devlink > > > > - have a common interface, with a standardized way to add > > > > vendor-specific attributes > > > > ? > > > > > > Please refer to my previous email which has more example and details. > > hi Parav, > > the example is based on a new vdpa tool running over netlink, not based > > on devlink, right? > > For vfio migration compatibility, we have to deal with both mdev and > > physical > > pci devices, I don't think it's a good idea to write a new tool for it, > > given > > we are able to retrieve the same info from sysfs and there's already an > > mdevctl from Alex (https://github.com/mdevctl/mdevctl). > > > > hi All, > > could we decide that sysfs is the interface that every VFIO vendor driver > > needs to provide in order to support vfio live migration, otherwise the > > userspace management tool would not list the device into the compatible > > list? > > > > if that's true, let's move to the standardizing of the sysfs interface. > > (1) content > > common part: (must) > >- software_version: (in major.minor.bugfix scheme) > >- device_api: vfio-pci or vfio-ccw ... > >- type: mdev type for mdev device or > >a signature for physical device which is a counterpart for > >mdev type. > > > > device api specific part: (must) > > - pci id: pci id of mdev parent device or pci id of physical pci > > device (device_api is vfio-pci) > > As noted previously, the parent PCI ID should not matter for an mdev > device, if a vendor has a dependency on matching the parent device PCI > ID, that's a vendor specific restriction. An mdev device can also > expose a vfio-pci device API without the parent device being PCI. For > a physical PCI device, shouldn't the PCI ID be encompassed in the > signature? Thanks, > you are right. I need to put the PCI ID as a vendor specific field. I didn't do that because I wanted all fields in vendor specific to be configurable by management tools, so they can configure the target device according to the value of a vendor specific field even they don't know the meaning of the field. But maybe they can just ignore the field when they can't find a matching writable field to configure the target. Thanks Yan > > - subchannel_type (device_api is vfio-ccw) > > > > vendor driver specific part: (optional) > > - aggregator > > - chpid_type > > - remote_url > > > > NOTE: vendors are free to add attributes in this part with a > > restriction that this attribute is able to be configured with the same > > name in sysfs too. e.g. > > for aggregator, there must be a sysfs attribute in device node > > /sys/devices/pci:00/:00:02.0/882cc4da-dede-11e7-9180-078a62063ab1/intel_vgpu/aggregator, > > so that the userspace tool is able to configure the target device > > according to source device's aggregator attribute. > > > > > > (2) where and structure > > proposal 1: > > |- [path to device] > > |--- migration > > | |--- self > > | ||-software_version > > | ||-device_api > > | ||-type > > | ||-[pci_id or subchannel_type] > > | ||- > > | |--- compatible > > | ||-software_version > > | ||-device_api > > | ||-type > > | ||-[pci_id or subchannel_type] > > | ||- > > multiple compatible is allowed. > > attributes should be ASCII text files, preferably with only one value > > per file. > > > > > > proposal 2: use bin_attribute. > > |- [path to device] > > |--- migration > > | |--- self > > | |--- compatible > > > > so we can continue use multiline format. e.g. > > cat compatible > > software_version=0.1.0 > > device_api=vfio_pci > > type=i915-GVTg_V5_{val1:int:1,2,4,8} > > pci_id=80865963 > > aggregator={val1}/2 > > > > Thanks > > Yan > > >
Re: device compatibility interface for live migration with assigned devices
On Wed, 19 Aug 2020 11:30:35 +0800 Yan Zhao wrote: > On Tue, Aug 18, 2020 at 09:39:24AM +, Parav Pandit wrote: > > Hi Cornelia, > > > > > From: Cornelia Huck > > > Sent: Tuesday, August 18, 2020 3:07 PM > > > To: Daniel P. Berrangé > > > Cc: Jason Wang ; Yan Zhao > > > ; k...@vger.kernel.org; libvir-l...@redhat.com; > > > qemu-devel@nongnu.org; Kirti Wankhede ; > > > eau...@redhat.com; xin-ran.w...@intel.com; cor...@lwn.net; openstack- > > > disc...@lists.openstack.org; shaohe.f...@intel.com; kevin.t...@intel.com; > > > Parav Pandit ; jian-feng.d...@intel.com; > > > dgilb...@redhat.com; zhen...@linux.intel.com; hejie...@intel.com; > > > bao.yum...@zte.com.cn; Alex Williamson ; > > > eskul...@redhat.com; smoo...@redhat.com; intel-gvt- > > > d...@lists.freedesktop.org; Jiri Pirko ; > > > dinec...@redhat.com; de...@ovirt.org > > > Subject: Re: device compatibility interface for live migration with > > > assigned > > > devices > > > > > > On Tue, 18 Aug 2020 10:16:28 +0100 > > > Daniel P. Berrangé wrote: > > > > > > > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote: > > > > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: > > > > > > > > > > On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: > > > > > > > > > > On 2020/8/14 下午1:16, Yan Zhao wrote: > > > > > > > > > > On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: > > > > > > > > > > On 2020/8/10 下午3:46, Yan Zhao wrote: > > > > > > > > > we actually can also retrieve the same information through sysfs, > > > > > .e.g > > > > > > > > > > |- [path to device] > > > > > |--- migration > > > > > | |--- self > > > > > | | |---device_api > > > > > || |---mdev_type > > > > > || |---software_version > > > > > || |---device_id > > > > > || |---aggregator > > > > > | |--- compatible > > > > > | | |---device_api > > > > > || |---mdev_type > > > > > || |---software_version > > > > > || |---device_id > > > > > || |---aggregator > > > > > > > > > > > > > > > Yes but: > > > > > > > > > > - You need one file per attribute (one syscall for one attribute) > > > > > - Attribute is coupled with kobject > > > > > > Is that really that bad? You have the device with an embedded kobject > > > anyway, and you can just put things into an attribute group? > > > > > > [Also, I think that self/compatible split in the example makes things > > > needlessly complex. Shouldn't semantic versioning and matching already > > > cover nearly everything? I would expect very few cases that are more > > > complex than that. Maybe the aggregation stuff, but I don't think we need > > > that self/compatible split for that, either.] > > > > > > > > > > > > > All of above seems unnecessary. > > > > > > > > > > Another point, as we discussed in another thread, it's really hard > > > > > to make sure the above API work for all types of devices and > > > > > frameworks. So having a vendor specific API looks much better. > > > > > > > > > > From the POV of userspace mgmt apps doing device compat checking / > > > > > migration, we certainly do NOT want to use different vendor > > > > > specific APIs. We want to have an API that can be used / controlled > > > > > in a > > > standard manner across vendors. > > > > > > > > > >Yes, but it could be hard. E.g vDPA will chose to use devlink > > > > > (there's a > > > > >long debate on sysfs vs devlink). So if we go with sysfs, at least > > > > > two > > > > >APIs needs to be supported ... > > > > > > > > NB, I was not questioning devlink vs sysfs directly. If devlink is > > > > related to netlink, I can't say I'm enthusiastic as IMKE sysfs is > > > > easier to deal with. I don't know enough about devlink to have much of > > > > an > > > o
Re: device compatibility interface for live migration with assigned devices
On 2020/8/19 下午1:58, Parav Pandit wrote: From: Yan Zhao Sent: Wednesday, August 19, 2020 9:01 AM On Tue, Aug 18, 2020 at 09:39:24AM +, Parav Pandit wrote: Please refer to my previous email which has more example and details. hi Parav, the example is based on a new vdpa tool running over netlink, not based on devlink, right? Right. For vfio migration compatibility, we have to deal with both mdev and physical pci devices, I don't think it's a good idea to write a new tool for it, given we are able to retrieve the same info from sysfs and there's already an mdevctl from mdev attribute should be visible in the mdev's sysfs tree. I do not propose to write a new mdev tool over netlink. I am sorry if I implied that with my suggestion of vdpa tool. If underlying device is vdpa, mdev might be able to understand vdpa device and query from it and populate in mdev sysfs tree. Note that vdpa is bus independent so it can't work now and the support of mdev on top of vDPA have been rejected (and duplicated with vhost-vDPA). Thanks The vdpa tool I propose is usable even without mdevs. vdpa tool's role is to create one or more vdpa devices and place on the "vdpa" bus which is the lowest layer here. Additionally this tool let user query virtqueue stats, db stats. When a user creates vdpa net device, user may need to configure features of the vdpa device such as VIRTIO_NET_F_MAC, default VIRTIO_NET_F_MTU. These are vdpa level features, attributes. Mdev is layer above it. Alex (https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub. com%2Fmdevctl%2Fmdevctldata=02%7C01%7Cparav%40nvidia.com%7C 0c2691d430304f5ea11308d843f2d84e%7C43083d15727340c1b7db39efd9ccc17 a%7C0%7C0%7C637334057571911357sdata=KxH7PwxmKyy9JODut8BWr LQyOBylW00%2Fyzc4rEvjUvA%3Dreserved=0). Sorry for above link mangling. Our mail server is still transitioning due to company acquisition. I am less familiar on below points to comment. hi All, could we decide that sysfs is the interface that every VFIO vendor driver needs to provide in order to support vfio live migration, otherwise the userspace management tool would not list the device into the compatible list? if that's true, let's move to the standardizing of the sysfs interface. (1) content common part: (must) - software_version: (in major.minor.bugfix scheme) - device_api: vfio-pci or vfio-ccw ... - type: mdev type for mdev device or a signature for physical device which is a counterpart for mdev type. device api specific part: (must) - pci id: pci id of mdev parent device or pci id of physical pci device (device_api is vfio-pci) - subchannel_type (device_api is vfio-ccw) vendor driver specific part: (optional) - aggregator - chpid_type - remote_url NOTE: vendors are free to add attributes in this part with a restriction that this attribute is able to be configured with the same name in sysfs too. e.g. for aggregator, there must be a sysfs attribute in device node /sys/devices/pci:00/:00:02.0/882cc4da-dede-11e7-9180- 078a62063ab1/intel_vgpu/aggregator, so that the userspace tool is able to configure the target device according to source device's aggregator attribute. (2) where and structure proposal 1: |- [path to device] |--- migration | |--- self | ||-software_version | ||-device_api | ||-type | ||-[pci_id or subchannel_type] | ||- | |--- compatible | ||-software_version | ||-device_api | ||-type | ||-[pci_id or subchannel_type] | ||- multiple compatible is allowed. attributes should be ASCII text files, preferably with only one value per file. proposal 2: use bin_attribute. |- [path to device] |--- migration | |--- self | |--- compatible so we can continue use multiline format. e.g. cat compatible software_version=0.1.0 device_api=vfio_pci type=i915-GVTg_V5_{val1:int:1,2,4,8} pci_id=80865963 aggregator={val1}/2 Thanks Yan
Re: [ovirt-devel] Re: device compatibility interface for live migration with assigned devices
On 2020/8/19 下午4:13, Yan Zhao wrote: On Wed, Aug 19, 2020 at 03:39:50PM +0800, Jason Wang wrote: On 2020/8/19 下午2:59, Yan Zhao wrote: On Wed, Aug 19, 2020 at 02:57:34PM +0800, Jason Wang wrote: On 2020/8/19 上午11:30, Yan Zhao wrote: hi All, could we decide that sysfs is the interface that every VFIO vendor driver needs to provide in order to support vfio live migration, otherwise the userspace management tool would not list the device into the compatible list? if that's true, let's move to the standardizing of the sysfs interface. (1) content common part: (must) - software_version: (in major.minor.bugfix scheme) This can not work for devices whose features can be negotiated/advertised independently. (E.g virtio devices) sorry, I don't understand here, why virtio devices need to use vfio interface? I don't see any reason that virtio devices can't be used by VFIO. Do you? Actually, virtio devices have been used by VFIO for many years: - passthrough a hardware virtio devices to userspace(VM) drivers - using virtio PMD inside guest So, what's different for it vs passing through a physical hardware via VFIO? The difference is in the guest, the device could be either real hardware or emulated ones. even though the features are negotiated dynamically, could you explain why it would cause software_version not work? Virtio device 1 supports feature A, B, C Virtio device 2 supports feature B, C, D So you can't migrate a guest from device 1 to device 2. And it's impossible to model the features with versions. I think this thread is discussing about vfio related devices. - device_api: vfio-pci or vfio-ccw ... - type: mdev type for mdev device or a signature for physical device which is a counterpart for mdev type. device api specific part: (must) - pci id: pci id of mdev parent device or pci id of physical pci device (device_api is vfio-pci)API here. So this assumes a PCI device which is probably not true. for device_api of vfio-pci, why it's not true? for vfio-ccw, it's subchannel_type. Ok but having two different attributes for the same file is not good idea. How mgmt know there will be a 3rd type? that's why some attributes need to be common. e.g. device_api: it's common because mgmt need to know it's a pci device or a ccw device. and the api type is already defined vfio.h. (The field is agreed by and actually suggested by Alex in previous mail) type: mdev_type for mdev. if mgmt does not understand it, it would not be able to create one compatible mdev device. software_version: mgmt can compare the major and minor if it understands this fields. I think it would be helpful if you can describe how mgmt is expected to work step by step with the proposed sysfs API. This can help people to understand. Thanks for the patience. Since sysfs is uABI, when accepted, we need support it forever. That's why we need to be careful. - subchannel_type (device_api is vfio-ccw) vendor driver specific part: (optional) - aggregator - chpid_type - remote_url For "remote_url", just wonder if it's better to integrate or reuse the existing NVME management interface instead of duplicating it here. Otherwise it could be a burden for mgmt to learn. E.g vendor A may use "remote_url" but vendor B may use a different attribute. it's vendor driver specific. vendor specific attributes are inevitable, and that's why we are discussing here of a way to standardizing of it. Well, then you will end up with a very long list to discuss. E.g for networking devices, you will have "mac", "v(x)lan" and a lot of other. Note that "remote_url" is not vendor specific but NVME (class/subsystem) specific. yes, it's just NVMe specific. I added it as an example to show what is vendor specific. if one attribute is vendor specific across all vendors, then it's not vendor specific, it's already common attribute, right? It's common but the issue is about naming and mgmt overhead. Unless you have a unified API per class (NVME, ethernet, etc), you can't prevent vendor from using another name instead of "remote_url". The point is that if vendor/class specific part is unavoidable, why not making all of the attributes vendor specific? some parts need to be common, as I listed above. This is hard, unless VFIO knows the type of device (e.g it's a NVME or networking device). our goal is that mgmt can use it without understanding the meaning of vendor specific attributes. I'm not sure this is the correct design of uAPI. Is there something similar in the existing uAPIs? And it might be hard to work for virtio devices. NOTE: vendors are free to add attributes in this part with a restriction that this attribute is able to be configured with the same name in sysfs too. e.g. Sysfs works well for common attributes belongs to a class, but I'm not sure it can work well for
Re: [ovirt-devel] Re: device compatibility interface for live migration with assigned devices
On Wed, Aug 19, 2020 at 03:39:50PM +0800, Jason Wang wrote: > > On 2020/8/19 下午2:59, Yan Zhao wrote: > > On Wed, Aug 19, 2020 at 02:57:34PM +0800, Jason Wang wrote: > > > On 2020/8/19 上午11:30, Yan Zhao wrote: > > > > hi All, > > > > could we decide that sysfs is the interface that every VFIO vendor > > > > driver > > > > needs to provide in order to support vfio live migration, otherwise the > > > > userspace management tool would not list the device into the compatible > > > > list? > > > > > > > > if that's true, let's move to the standardizing of the sysfs interface. > > > > (1) content > > > > common part: (must) > > > > - software_version: (in major.minor.bugfix scheme) > > > > > > This can not work for devices whose features can be negotiated/advertised > > > independently. (E.g virtio devices) > > > > > sorry, I don't understand here, why virtio devices need to use vfio > > interface? > > > I don't see any reason that virtio devices can't be used by VFIO. Do you? > > Actually, virtio devices have been used by VFIO for many years: > > - passthrough a hardware virtio devices to userspace(VM) drivers > - using virtio PMD inside guest > So, what's different for it vs passing through a physical hardware via VFIO? even though the features are negotiated dynamically, could you explain why it would cause software_version not work? > > > I think this thread is discussing about vfio related devices. > > > > > > - device_api: vfio-pci or vfio-ccw ... > > > > - type: mdev type for mdev device or > > > > a signature for physical device which is a counterpart for > > > >mdev type. > > > > > > > > device api specific part: (must) > > > > - pci id: pci id of mdev parent device or pci id of physical pci > > > > device (device_api is vfio-pci)API here. > > > > > > So this assumes a PCI device which is probably not true. > > > > > for device_api of vfio-pci, why it's not true? > > > > for vfio-ccw, it's subchannel_type. > > > Ok but having two different attributes for the same file is not good idea. > How mgmt know there will be a 3rd type? that's why some attributes need to be common. e.g. device_api: it's common because mgmt need to know it's a pci device or a ccw device. and the api type is already defined vfio.h. (The field is agreed by and actually suggested by Alex in previous mail) type: mdev_type for mdev. if mgmt does not understand it, it would not be able to create one compatible mdev device. software_version: mgmt can compare the major and minor if it understands this fields. > > > > > > > > - subchannel_type (device_api is vfio-ccw) > > > > vendor driver specific part: (optional) > > > > - aggregator > > > > - chpid_type > > > > - remote_url > > > > > > For "remote_url", just wonder if it's better to integrate or reuse the > > > existing NVME management interface instead of duplicating it here. > > > Otherwise > > > it could be a burden for mgmt to learn. E.g vendor A may use "remote_url" > > > but vendor B may use a different attribute. > > > > > it's vendor driver specific. > > vendor specific attributes are inevitable, and that's why we are > > discussing here of a way to standardizing of it. > > > Well, then you will end up with a very long list to discuss. E.g for > networking devices, you will have "mac", "v(x)lan" and a lot of other. > > Note that "remote_url" is not vendor specific but NVME (class/subsystem) > specific. > yes, it's just NVMe specific. I added it as an example to show what is vendor specific. if one attribute is vendor specific across all vendors, then it's not vendor specific, it's already common attribute, right? > The point is that if vendor/class specific part is unavoidable, why not > making all of the attributes vendor specific? > some parts need to be common, as I listed above. > > > our goal is that mgmt can use it without understanding the meaning of vendor > > specific attributes. > > > I'm not sure this is the correct design of uAPI. Is there something similar > in the existing uAPIs? > > And it might be hard to work for virtio devices. > > > > > > > > NOTE: vendors are free to add attributes in this part with a > > > > restriction that this attribute is able to be configured with the same > > > > name in sysfs too. e.g. > > > > > > Sysfs works well for common attributes belongs to a class, but I'm not > > > sure > > > it can work well for device/vendor specific attributes. Does this mean > > > mgmt > > > need to iterate all the attributes in both src and dst? > > > > > no. just attributes under migration directory. > > > > > > for aggregator, there must be a sysfs attribute in device node > > > > /sys/devices/pci:00/:00:02.0/882cc4da-dede-11e7-9180-078a62063ab1/intel_vgpu/aggregator, > > > > so that the userspace tool is able to configure the target device > > > > according to source device's aggregator attribute. >
Re: [ovirt-devel] Re: device compatibility interface for live migration with assigned devices
On 2020/8/19 下午2:59, Yan Zhao wrote: On Wed, Aug 19, 2020 at 02:57:34PM +0800, Jason Wang wrote: On 2020/8/19 上午11:30, Yan Zhao wrote: hi All, could we decide that sysfs is the interface that every VFIO vendor driver needs to provide in order to support vfio live migration, otherwise the userspace management tool would not list the device into the compatible list? if that's true, let's move to the standardizing of the sysfs interface. (1) content common part: (must) - software_version: (in major.minor.bugfix scheme) This can not work for devices whose features can be negotiated/advertised independently. (E.g virtio devices) sorry, I don't understand here, why virtio devices need to use vfio interface? I don't see any reason that virtio devices can't be used by VFIO. Do you? Actually, virtio devices have been used by VFIO for many years: - passthrough a hardware virtio devices to userspace(VM) drivers - using virtio PMD inside guest I think this thread is discussing about vfio related devices. - device_api: vfio-pci or vfio-ccw ... - type: mdev type for mdev device or a signature for physical device which is a counterpart for mdev type. device api specific part: (must) - pci id: pci id of mdev parent device or pci id of physical pci device (device_api is vfio-pci)API here. So this assumes a PCI device which is probably not true. for device_api of vfio-pci, why it's not true? for vfio-ccw, it's subchannel_type. Ok but having two different attributes for the same file is not good idea. How mgmt know there will be a 3rd type? - subchannel_type (device_api is vfio-ccw) vendor driver specific part: (optional) - aggregator - chpid_type - remote_url For "remote_url", just wonder if it's better to integrate or reuse the existing NVME management interface instead of duplicating it here. Otherwise it could be a burden for mgmt to learn. E.g vendor A may use "remote_url" but vendor B may use a different attribute. it's vendor driver specific. vendor specific attributes are inevitable, and that's why we are discussing here of a way to standardizing of it. Well, then you will end up with a very long list to discuss. E.g for networking devices, you will have "mac", "v(x)lan" and a lot of other. Note that "remote_url" is not vendor specific but NVME (class/subsystem) specific. The point is that if vendor/class specific part is unavoidable, why not making all of the attributes vendor specific? our goal is that mgmt can use it without understanding the meaning of vendor specific attributes. I'm not sure this is the correct design of uAPI. Is there something similar in the existing uAPIs? And it might be hard to work for virtio devices. NOTE: vendors are free to add attributes in this part with a restriction that this attribute is able to be configured with the same name in sysfs too. e.g. Sysfs works well for common attributes belongs to a class, but I'm not sure it can work well for device/vendor specific attributes. Does this mean mgmt need to iterate all the attributes in both src and dst? no. just attributes under migration directory. for aggregator, there must be a sysfs attribute in device node /sys/devices/pci:00/:00:02.0/882cc4da-dede-11e7-9180-078a62063ab1/intel_vgpu/aggregator, so that the userspace tool is able to configure the target device according to source device's aggregator attribute. (2) where and structure proposal 1: |- [path to device] |--- migration | |--- self | ||-software_version | ||-device_api | ||-type | ||-[pci_id or subchannel_type] | ||- | |--- compatible | ||-software_version | ||-device_api | ||-type | ||-[pci_id or subchannel_type] | ||- multiple compatible is allowed. attributes should be ASCII text files, preferably with only one value per file. proposal 2: use bin_attribute. |- [path to device] |--- migration | |--- self | |--- compatible so we can continue use multiline format. e.g. cat compatible software_version=0.1.0 device_api=vfio_pci type=i915-GVTg_V5_{val1:int:1,2,4,8} pci_id=80865963 aggregator={val1}/2 So basically two questions: - how hard to standardize sysfs API for dealing with compatibility check (to make it work for most types of devices) sorry, I just know we are in the process of standardizing of it :) It's not easy. As I said, the current design can't work for virtio devices and it's not hard to find other examples. I remember some Intel devices have bitmask based capability registers. - how hard for the mgmt to learn with a vendor specific attributes (vs existing management API) what is existing management API? It depends on the type of devices. E.g for NVME, we've already had one (/sys/kernel/config/nvme)? Thanks
Re: [ovirt-devel] Re: device compatibility interface for live migration with assigned devices
On Wed, Aug 19, 2020 at 02:57:34PM +0800, Jason Wang wrote: > > On 2020/8/19 上午11:30, Yan Zhao wrote: > > hi All, > > could we decide that sysfs is the interface that every VFIO vendor driver > > needs to provide in order to support vfio live migration, otherwise the > > userspace management tool would not list the device into the compatible > > list? > > > > if that's true, let's move to the standardizing of the sysfs interface. > > (1) content > > common part: (must) > > - software_version: (in major.minor.bugfix scheme) > > > This can not work for devices whose features can be negotiated/advertised > independently. (E.g virtio devices) > sorry, I don't understand here, why virtio devices need to use vfio interface? I think this thread is discussing about vfio related devices. > > > - device_api: vfio-pci or vfio-ccw ... > > - type: mdev type for mdev device or > > a signature for physical device which is a counterpart for > >mdev type. > > > > device api specific part: (must) > >- pci id: pci id of mdev parent device or pci id of physical pci > > device (device_api is vfio-pci)API here. > > > So this assumes a PCI device which is probably not true. > for device_api of vfio-pci, why it's not true? for vfio-ccw, it's subchannel_type. > > >- subchannel_type (device_api is vfio-ccw) > > vendor driver specific part: (optional) > >- aggregator > >- chpid_type > >- remote_url > > > For "remote_url", just wonder if it's better to integrate or reuse the > existing NVME management interface instead of duplicating it here. Otherwise > it could be a burden for mgmt to learn. E.g vendor A may use "remote_url" > but vendor B may use a different attribute. > it's vendor driver specific. vendor specific attributes are inevitable, and that's why we are discussing here of a way to standardizing of it. our goal is that mgmt can use it without understanding the meaning of vendor specific attributes. > > > > > NOTE: vendors are free to add attributes in this part with a > > restriction that this attribute is able to be configured with the same > > name in sysfs too. e.g. > > > Sysfs works well for common attributes belongs to a class, but I'm not sure > it can work well for device/vendor specific attributes. Does this mean mgmt > need to iterate all the attributes in both src and dst? > no. just attributes under migration directory. > > > for aggregator, there must be a sysfs attribute in device node > > /sys/devices/pci:00/:00:02.0/882cc4da-dede-11e7-9180-078a62063ab1/intel_vgpu/aggregator, > > so that the userspace tool is able to configure the target device > > according to source device's aggregator attribute. > > > > > > (2) where and structure > > proposal 1: > > |- [path to device] > >|--- migration > >| |--- self > >| ||-software_version > >| ||-device_api > >| ||-type > >| ||-[pci_id or subchannel_type] > >| ||- > >| |--- compatible > >| ||-software_version > >| ||-device_api > >| ||-type > >| ||-[pci_id or subchannel_type] > >| ||- > > multiple compatible is allowed. > > attributes should be ASCII text files, preferably with only one value > > per file. > > > > > > proposal 2: use bin_attribute. > > |- [path to device] > >|--- migration > >| |--- self > >| |--- compatible > > > > so we can continue use multiline format. e.g. > > cat compatible > >software_version=0.1.0 > >device_api=vfio_pci > >type=i915-GVTg_V5_{val1:int:1,2,4,8} > >pci_id=80865963 > >aggregator={val1}/2 > > > So basically two questions: > > - how hard to standardize sysfs API for dealing with compatibility check (to > make it work for most types of devices) sorry, I just know we are in the process of standardizing of it :) > - how hard for the mgmt to learn with a vendor specific attributes (vs > existing management API) what is existing management API? Thanks
Re: [ovirt-devel] Re: device compatibility interface for live migration with assigned devices
On 2020/8/19 上午11:30, Yan Zhao wrote: hi All, could we decide that sysfs is the interface that every VFIO vendor driver needs to provide in order to support vfio live migration, otherwise the userspace management tool would not list the device into the compatible list? if that's true, let's move to the standardizing of the sysfs interface. (1) content common part: (must) - software_version: (in major.minor.bugfix scheme) This can not work for devices whose features can be negotiated/advertised independently. (E.g virtio devices) - device_api: vfio-pci or vfio-ccw ... - type: mdev type for mdev device or a signature for physical device which is a counterpart for mdev type. device api specific part: (must) - pci id: pci id of mdev parent device or pci id of physical pci device (device_api is vfio-pci)API here. So this assumes a PCI device which is probably not true. - subchannel_type (device_api is vfio-ccw) vendor driver specific part: (optional) - aggregator - chpid_type - remote_url For "remote_url", just wonder if it's better to integrate or reuse the existing NVME management interface instead of duplicating it here. Otherwise it could be a burden for mgmt to learn. E.g vendor A may use "remote_url" but vendor B may use a different attribute. NOTE: vendors are free to add attributes in this part with a restriction that this attribute is able to be configured with the same name in sysfs too. e.g. Sysfs works well for common attributes belongs to a class, but I'm not sure it can work well for device/vendor specific attributes. Does this mean mgmt need to iterate all the attributes in both src and dst? for aggregator, there must be a sysfs attribute in device node /sys/devices/pci:00/:00:02.0/882cc4da-dede-11e7-9180-078a62063ab1/intel_vgpu/aggregator, so that the userspace tool is able to configure the target device according to source device's aggregator attribute. (2) where and structure proposal 1: |- [path to device] |--- migration | |--- self | ||-software_version | ||-device_api | ||-type | ||-[pci_id or subchannel_type] | ||- | |--- compatible | ||-software_version | ||-device_api | ||-type | ||-[pci_id or subchannel_type] | ||- multiple compatible is allowed. attributes should be ASCII text files, preferably with only one value per file. proposal 2: use bin_attribute. |- [path to device] |--- migration | |--- self | |--- compatible so we can continue use multiline format. e.g. cat compatible software_version=0.1.0 device_api=vfio_pci type=i915-GVTg_V5_{val1:int:1,2,4,8} pci_id=80865963 aggregator={val1}/2 So basically two questions: - how hard to standardize sysfs API for dealing with compatibility check (to make it work for most types of devices) - how hard for the mgmt to learn with a vendor specific attributes (vs existing management API) Thanks Thanks Yan
RE: device compatibility interface for live migration with assigned devices
> From: Jason Wang > Sent: Wednesday, August 19, 2020 12:19 PM > > > On 2020/8/19 下午1:26, Parav Pandit wrote: > > > >> From: Jason Wang > >> Sent: Wednesday, August 19, 2020 8:16 AM > > > >> On 2020/8/18 下午5:32, Parav Pandit wrote: > >>> Hi Jason, > >>> > >>> From: Jason Wang > >>> Sent: Tuesday, August 18, 2020 2:32 PM > >>> > >>> > >>> On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: > >>> On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: > >>> On 2020/8/14 下午1:16, Yan Zhao wrote: > >>> On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: > >>> On 2020/8/10 下午3:46, Yan Zhao wrote: > >>> driver is it handled by? > >>> It looks that the devlink is for network device specific, and in > >>> devlink.h, it says include/uapi/linux/devlink.h - Network physical > >>> device Netlink interface, Actually not, I think there used to have > >>> some discussion last year and the conclusion is to remove this > >>> comment. > >>> > >>> [...] > >>> > Yes, but it could be hard. E.g vDPA will chose to use devlink > (there's a long > >> debate on sysfs vs devlink). So if we go with sysfs, at least two > >> APIs needs to be supported ... > >>> We had internal discussion and proposal on this topic. > >>> I wanted Eli Cohen to be back from vacation on Wed 8/19, but since > >>> this is > >> active discussion right now, I will share the thoughts anyway. > >>> Here are the initial round of thoughts and proposal. > >>> > >>> User requirements: > >>> --- > >>> 1. User might want to create one or more vdpa devices per PCI PF/VF/SF. > >>> 2. User might want to create one or more vdpa devices of type > >>> net/blk or > >> other type. > >>> 3. User needs to look and dump at the health of the queues for debug > purpose. > >>> 4. During vdpa net device creation time, user may have to provide a > >>> MAC > >> address and/or VLAN. > >>> 5. User should be able to set/query some of the attributes for > >>> debug/compatibility check 6. When user wants to create vdpa device, > >>> it needs > >> to know which device supports creation. > >>> 7. User should be able to see the queue statistics of doorbells, > >>> wqes etc regardless of class type > >> > >> Note that wqes is probably not something common in all of the vendors. > > Yes. I virtq descriptors stats is better to monitor the virtqueues. > > > >> > >>> To address above requirements, there is a need of vendor agnostic > >>> tool, so > >> that user can create/config/delete vdpa device(s) regardless of the vendor. > >>> Hence, > >>> We should have a tool that lets user do it. > >>> > >>> Examples: > >>> - > >>> (a) List parent devices which supports creating vdpa devices. > >>> It also shows which class types supported by this parent device. > >>> In below command two parent devices support vdpa device creation. > >>> First is PCI VF whose bdf is 03.00:5. > >>> Second is PCI SF whose name is mlx5_sf.1 > >>> > >>> $ vdpa list pd > >> > >> What did "pd" mean? > >> > > Parent device which support creation of one or more vdpa devices. > > In a system there can be multiple parent devices which may be support vdpa > creation. > > User should be able to know which devices support it, and when user creates > > a > vdpa device, it tells which parent device to use for creation as done in below > vdpa dev add example. > >>> pci/:03.00:5 > >>> class_supports > >>> net vdpa > >>> virtbus/mlx5_sf.1 > >> > >> So creating mlx5_sf.1 is the charge of devlink? > >> > > Yes. > > But here vdpa tool is working at the parent device identifier {bus+name} > instead of devlink identifier. > > > > > >>> class_supports > >>> net > >>> > >>> (b) Now add a vdpa device and show the device. > >>> $ vdpa dev add pci/:03.00:5 type net > >> > >> So if you want to create devices types other than vdpa on > >> pci/:03.00:5 it needs some synchronization with devlink? > > Please refer to FAQ-1, a new tool is not linked to devlink because vdpa > > will > evolve with time and devlink will fall short. > > So no, it doesn't need any synchronization with devlink. > > As long as parent device exist, user can create it. > > All synchronization will be within drivers/vdpa/vdpa.c This user > > interface is exposed via new netlink family by doing genl_register_family() > > with > new name "vdpa" in drivers/vdpa/vdpa.c. > > > Just to make sure I understand here. > > Consider we had virtbus/mlx5_sf.1. Process A want to create a vDPA instance on > top of it but Process B want to create a IB instance. Then I think some > synchronization is needed at at least parent device level? Likely but rdma device will be created either through $ rdma link add command. Or auto created by driver because there is only one without much configuration. While vdpa device(s) for virtbus/mlx5_sf.1 will be created through vdpa subsystem. And vdpa's synchronization will be contained within drivers/vdpa/vdpa.c > > > > > >> > >>> $ vdpa dev show > >>>
Re: device compatibility interface for live migration with assigned devices
On 2020/8/19 下午1:26, Parav Pandit wrote: From: Jason Wang Sent: Wednesday, August 19, 2020 8:16 AM On 2020/8/18 下午5:32, Parav Pandit wrote: Hi Jason, From: Jason Wang Sent: Tuesday, August 18, 2020 2:32 PM On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: On 2020/8/14 下午1:16, Yan Zhao wrote: On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: On 2020/8/10 下午3:46, Yan Zhao wrote: driver is it handled by? It looks that the devlink is for network device specific, and in devlink.h, it says include/uapi/linux/devlink.h - Network physical device Netlink interface, Actually not, I think there used to have some discussion last year and the conclusion is to remove this comment. [...] Yes, but it could be hard. E.g vDPA will chose to use devlink (there's a long debate on sysfs vs devlink). So if we go with sysfs, at least two APIs needs to be supported ... We had internal discussion and proposal on this topic. I wanted Eli Cohen to be back from vacation on Wed 8/19, but since this is active discussion right now, I will share the thoughts anyway. Here are the initial round of thoughts and proposal. User requirements: --- 1. User might want to create one or more vdpa devices per PCI PF/VF/SF. 2. User might want to create one or more vdpa devices of type net/blk or other type. 3. User needs to look and dump at the health of the queues for debug purpose. 4. During vdpa net device creation time, user may have to provide a MAC address and/or VLAN. 5. User should be able to set/query some of the attributes for debug/compatibility check 6. When user wants to create vdpa device, it needs to know which device supports creation. 7. User should be able to see the queue statistics of doorbells, wqes etc regardless of class type Note that wqes is probably not something common in all of the vendors. Yes. I virtq descriptors stats is better to monitor the virtqueues. To address above requirements, there is a need of vendor agnostic tool, so that user can create/config/delete vdpa device(s) regardless of the vendor. Hence, We should have a tool that lets user do it. Examples: - (a) List parent devices which supports creating vdpa devices. It also shows which class types supported by this parent device. In below command two parent devices support vdpa device creation. First is PCI VF whose bdf is 03.00:5. Second is PCI SF whose name is mlx5_sf.1 $ vdpa list pd What did "pd" mean? Parent device which support creation of one or more vdpa devices. In a system there can be multiple parent devices which may be support vdpa creation. User should be able to know which devices support it, and when user creates a vdpa device, it tells which parent device to use for creation as done in below vdpa dev add example. pci/:03.00:5 class_supports net vdpa virtbus/mlx5_sf.1 So creating mlx5_sf.1 is the charge of devlink? Yes. But here vdpa tool is working at the parent device identifier {bus+name} instead of devlink identifier. class_supports net (b) Now add a vdpa device and show the device. $ vdpa dev add pci/:03.00:5 type net So if you want to create devices types other than vdpa on pci/:03.00:5 it needs some synchronization with devlink? Please refer to FAQ-1, a new tool is not linked to devlink because vdpa will evolve with time and devlink will fall short. So no, it doesn't need any synchronization with devlink. As long as parent device exist, user can create it. All synchronization will be within drivers/vdpa/vdpa.c This user interface is exposed via new netlink family by doing genl_register_family() with new name "vdpa" in drivers/vdpa/vdpa.c. Just to make sure I understand here. Consider we had virtbus/mlx5_sf.1. Process A want to create a vDPA instance on top of it but Process B want to create a IB instance. Then I think some synchronization is needed at at least parent device level? $ vdpa dev show vdpa0@pci/:03.00:5 type net state inactive maxqueues 8 curqueues 4 (c) vdpa dev show features vdpa0 iommu platform version 1 (d) dump vdpa statistics $ vdpa dev stats show vdpa0 kickdoorbells 10 wqes 100 (e) Now delete a vdpa device previously created. $ vdpa dev del vdpa0 Design overview: --- 1. Above example tool runs over netlink socket interface. 2. This enables users to return meaningful error strings in addition to code so that user can be more informed. Often this is missing in ioctl()/configfs/sysfs interfaces. 3. This tool over netlink enables syscaller tests to be more usable like other subsystems to keep kernel robust 4. This provides vendor agnostic view of all vdpa capable parent and vdpa devices. 5. Each driver which supports vdpa device creation, registers the parent device along with supported classes. FAQs: 1. Why not using devlink? Ans: Because as vdpa echo system grows,
RE: device compatibility interface for live migration with assigned devices
> From: Yan Zhao > Sent: Wednesday, August 19, 2020 9:01 AM > On Tue, Aug 18, 2020 at 09:39:24AM +, Parav Pandit wrote: > > Please refer to my previous email which has more example and details. > hi Parav, > the example is based on a new vdpa tool running over netlink, not based on > devlink, right? Right. > For vfio migration compatibility, we have to deal with both mdev and physical > pci devices, I don't think it's a good idea to write a new tool for it, given > we are > able to retrieve the same info from sysfs and there's already an mdevctl from mdev attribute should be visible in the mdev's sysfs tree. I do not propose to write a new mdev tool over netlink. I am sorry if I implied that with my suggestion of vdpa tool. If underlying device is vdpa, mdev might be able to understand vdpa device and query from it and populate in mdev sysfs tree. The vdpa tool I propose is usable even without mdevs. vdpa tool's role is to create one or more vdpa devices and place on the "vdpa" bus which is the lowest layer here. Additionally this tool let user query virtqueue stats, db stats. When a user creates vdpa net device, user may need to configure features of the vdpa device such as VIRTIO_NET_F_MAC, default VIRTIO_NET_F_MTU. These are vdpa level features, attributes. Mdev is layer above it. > Alex > (https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub. > com%2Fmdevctl%2Fmdevctldata=02%7C01%7Cparav%40nvidia.com%7C > 0c2691d430304f5ea11308d843f2d84e%7C43083d15727340c1b7db39efd9ccc17 > a%7C0%7C0%7C637334057571911357sdata=KxH7PwxmKyy9JODut8BWr > LQyOBylW00%2Fyzc4rEvjUvA%3Dreserved=0). > Sorry for above link mangling. Our mail server is still transitioning due to company acquisition. I am less familiar on below points to comment. > hi All, > could we decide that sysfs is the interface that every VFIO vendor driver > needs > to provide in order to support vfio live migration, otherwise the userspace > management tool would not list the device into the compatible list? > > if that's true, let's move to the standardizing of the sysfs interface. > (1) content > common part: (must) >- software_version: (in major.minor.bugfix scheme) >- device_api: vfio-pci or vfio-ccw ... >- type: mdev type for mdev device or >a signature for physical device which is a counterpart for > mdev type. > > device api specific part: (must) > - pci id: pci id of mdev parent device or pci id of physical pci > device (device_api is vfio-pci) > - subchannel_type (device_api is vfio-ccw) > > vendor driver specific part: (optional) > - aggregator > - chpid_type > - remote_url > > NOTE: vendors are free to add attributes in this part with a restriction that > this > attribute is able to be configured with the same name in sysfs too. e.g. > for aggregator, there must be a sysfs attribute in device node > /sys/devices/pci:00/:00:02.0/882cc4da-dede-11e7-9180- > 078a62063ab1/intel_vgpu/aggregator, > so that the userspace tool is able to configure the target device according to > source device's aggregator attribute. > > > (2) where and structure > proposal 1: > |- [path to device] > |--- migration > | |--- self > | ||-software_version > | ||-device_api > | ||-type > | ||-[pci_id or subchannel_type] > | ||- > | |--- compatible > | ||-software_version > | ||-device_api > | ||-type > | ||-[pci_id or subchannel_type] > | ||- > multiple compatible is allowed. > attributes should be ASCII text files, preferably with only one value per > file. > > > proposal 2: use bin_attribute. > |- [path to device] > |--- migration > | |--- self > | |--- compatible > > so we can continue use multiline format. e.g. > cat compatible > software_version=0.1.0 > device_api=vfio_pci > type=i915-GVTg_V5_{val1:int:1,2,4,8} > pci_id=80865963 > aggregator={val1}/2 > > Thanks > Yan
RE: device compatibility interface for live migration with assigned devices
> From: Jason Wang > Sent: Wednesday, August 19, 2020 8:16 AM > On 2020/8/18 下午5:32, Parav Pandit wrote: > > Hi Jason, > > > > From: Jason Wang > > Sent: Tuesday, August 18, 2020 2:32 PM > > > > > > On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: > > On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: > > On 2020/8/14 下午1:16, Yan Zhao wrote: > > On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: > > On 2020/8/10 下午3:46, Yan Zhao wrote: > > driver is it handled by? > > It looks that the devlink is for network device specific, and in > > devlink.h, it says include/uapi/linux/devlink.h - Network physical > > device Netlink interface, Actually not, I think there used to have > > some discussion last year and the conclusion is to remove this > > comment. > > > > [...] > > > >> Yes, but it could be hard. E.g vDPA will chose to use devlink (there's a > >> long > debate on sysfs vs devlink). So if we go with sysfs, at least two APIs needs > to be > supported ... > > We had internal discussion and proposal on this topic. > > I wanted Eli Cohen to be back from vacation on Wed 8/19, but since this is > active discussion right now, I will share the thoughts anyway. > > > > Here are the initial round of thoughts and proposal. > > > > User requirements: > > --- > > 1. User might want to create one or more vdpa devices per PCI PF/VF/SF. > > 2. User might want to create one or more vdpa devices of type net/blk or > other type. > > 3. User needs to look and dump at the health of the queues for debug > > purpose. > > 4. During vdpa net device creation time, user may have to provide a MAC > address and/or VLAN. > > 5. User should be able to set/query some of the attributes for > > debug/compatibility check 6. When user wants to create vdpa device, it needs > to know which device supports creation. > > 7. User should be able to see the queue statistics of doorbells, wqes > > etc regardless of class type > > > Note that wqes is probably not something common in all of the vendors. Yes. I virtq descriptors stats is better to monitor the virtqueues. > > > > > > To address above requirements, there is a need of vendor agnostic tool, so > that user can create/config/delete vdpa device(s) regardless of the vendor. > > > > Hence, > > We should have a tool that lets user do it. > > > > Examples: > > - > > (a) List parent devices which supports creating vdpa devices. > > It also shows which class types supported by this parent device. > > In below command two parent devices support vdpa device creation. > > First is PCI VF whose bdf is 03.00:5. > > Second is PCI SF whose name is mlx5_sf.1 > > > > $ vdpa list pd > > > What did "pd" mean? > Parent device which support creation of one or more vdpa devices. In a system there can be multiple parent devices which may be support vdpa creation. User should be able to know which devices support it, and when user creates a vdpa device, it tells which parent device to use for creation as done in below vdpa dev add example. > > > pci/:03.00:5 > >class_supports > > net vdpa > > virtbus/mlx5_sf.1 > > > So creating mlx5_sf.1 is the charge of devlink? > Yes. But here vdpa tool is working at the parent device identifier {bus+name} instead of devlink identifier. > > >class_supports > > net > > > > (b) Now add a vdpa device and show the device. > > $ vdpa dev add pci/:03.00:5 type net > > > So if you want to create devices types other than vdpa on > pci/:03.00:5 it needs some synchronization with devlink? Please refer to FAQ-1, a new tool is not linked to devlink because vdpa will evolve with time and devlink will fall short. So no, it doesn't need any synchronization with devlink. As long as parent device exist, user can create it. All synchronization will be within drivers/vdpa/vdpa.c This user interface is exposed via new netlink family by doing genl_register_family() with new name "vdpa" in drivers/vdpa/vdpa.c. > > > > $ vdpa dev show > > vdpa0@pci/:03.00:5 type net state inactive maxqueues 8 curqueues 4 > > > > (c) vdpa dev show features vdpa0 > > iommu platform > > version 1 > > > > (d) dump vdpa statistics > > $ vdpa dev stats show vdpa0 > > kickdoorbells 10 > > wqes 100 > > > > (e) Now delete a vdpa device previously created. > > $ vdpa dev del vdpa0 > > > > Design overview: > > --- > > 1. Above example tool runs over netlink socket interface. > > 2. This enables users to return meaningful error strings in addition to > > code so > that user can be more informed. > > Often this is missing in ioctl()/configfs/sysfs interfaces. > > 3. This tool over netlink enables syscaller tests to be more usable like > > other > subsystems to keep kernel robust > > 4. This provides vendor agnostic view of all vdpa capable parent and vdpa > devices. > > > > 5. Each driver which supports vdpa device creation, registers the parent > > device > along with supported
Re: device compatibility interface for live migration with assigned devices
On Tue, Aug 18, 2020 at 09:39:24AM +, Parav Pandit wrote: > Hi Cornelia, > > > From: Cornelia Huck > > Sent: Tuesday, August 18, 2020 3:07 PM > > To: Daniel P. Berrangé > > Cc: Jason Wang ; Yan Zhao > > ; k...@vger.kernel.org; libvir-l...@redhat.com; > > qemu-devel@nongnu.org; Kirti Wankhede ; > > eau...@redhat.com; xin-ran.w...@intel.com; cor...@lwn.net; openstack- > > disc...@lists.openstack.org; shaohe.f...@intel.com; kevin.t...@intel.com; > > Parav Pandit ; jian-feng.d...@intel.com; > > dgilb...@redhat.com; zhen...@linux.intel.com; hejie...@intel.com; > > bao.yum...@zte.com.cn; Alex Williamson ; > > eskul...@redhat.com; smoo...@redhat.com; intel-gvt- > > d...@lists.freedesktop.org; Jiri Pirko ; > > dinec...@redhat.com; de...@ovirt.org > > Subject: Re: device compatibility interface for live migration with assigned > > devices > > > > On Tue, 18 Aug 2020 10:16:28 +0100 > > Daniel P. Berrangé wrote: > > > > > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote: > > > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: > > > > > > > > On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: > > > > > > > > On 2020/8/14 下午1:16, Yan Zhao wrote: > > > > > > > > On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: > > > > > > > > On 2020/8/10 下午3:46, Yan Zhao wrote: > > > > > > > we actually can also retrieve the same information through sysfs, > > > > .e.g > > > > > > > > |- [path to device] > > > > |--- migration > > > > | |--- self > > > > | | |---device_api > > > > || |---mdev_type > > > > || |---software_version > > > > || |---device_id > > > > || |---aggregator > > > > | |--- compatible > > > > | | |---device_api > > > > || |---mdev_type > > > > || |---software_version > > > > || |---device_id > > > > || |---aggregator > > > > > > > > > > > > Yes but: > > > > > > > > - You need one file per attribute (one syscall for one attribute) > > > > - Attribute is coupled with kobject > > > > Is that really that bad? You have the device with an embedded kobject > > anyway, and you can just put things into an attribute group? > > > > [Also, I think that self/compatible split in the example makes things > > needlessly complex. Shouldn't semantic versioning and matching already > > cover nearly everything? I would expect very few cases that are more > > complex than that. Maybe the aggregation stuff, but I don't think we need > > that self/compatible split for that, either.] > > > > > > > > > > All of above seems unnecessary. > > > > > > > > Another point, as we discussed in another thread, it's really hard > > > > to make sure the above API work for all types of devices and > > > > frameworks. So having a vendor specific API looks much better. > > > > > > > > From the POV of userspace mgmt apps doing device compat checking / > > > > migration, we certainly do NOT want to use different vendor > > > > specific APIs. We want to have an API that can be used / controlled in > > > > a > > standard manner across vendors. > > > > > > > >Yes, but it could be hard. E.g vDPA will chose to use devlink > > > > (there's a > > > >long debate on sysfs vs devlink). So if we go with sysfs, at least > > > > two > > > >APIs needs to be supported ... > > > > > > NB, I was not questioning devlink vs sysfs directly. If devlink is > > > related to netlink, I can't say I'm enthusiastic as IMKE sysfs is > > > easier to deal with. I don't know enough about devlink to have much of an > > opinion though. > > > The key point was that I don't want the userspace APIs we need to deal > > > with to be vendor specific. > > > > From what I've seen of devlink, it seems quite nice; but I understand why > > sysfs might be easier to deal with (especially as there's likely already a > > lot of > > code using it.) > > > > I understand that some users would like devlink because it is already widely > > used for network drivers (and some others), but I don't think the majority > >
Re: device compatibility interface for live migration with assigned devices
On 2020/8/18 下午5:36, Cornelia Huck wrote: On Tue, 18 Aug 2020 10:16:28 +0100 Daniel P. Berrangé wrote: On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote: On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: On 2020/8/14 下午1:16, Yan Zhao wrote: On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: On 2020/8/10 下午3:46, Yan Zhao wrote: we actually can also retrieve the same information through sysfs, .e.g |- [path to device] |--- migration | |--- self | | |---device_api || |---mdev_type || |---software_version || |---device_id || |---aggregator | |--- compatible | | |---device_api || |---mdev_type || |---software_version || |---device_id || |---aggregator Yes but: - You need one file per attribute (one syscall for one attribute) - Attribute is coupled with kobject Is that really that bad? You have the device with an embedded kobject anyway, and you can just put things into an attribute group? Yes, but all of this could be done via devlink(netlink) as well with low overhead. [Also, I think that self/compatible split in the example makes things needlessly complex. Shouldn't semantic versioning and matching already cover nearly everything? That's my question as well. E.g for virtio, versioning may not even work, some of features are negotiated independently: Source features: A, B, C Dest features: A, B, C, E We just need to make sure the dest features is a superset of source then all set. I would expect very few cases that are more complex than that. Maybe the aggregation stuff, but I don't think we need that self/compatible split for that, either.] All of above seems unnecessary. Another point, as we discussed in another thread, it's really hard to make sure the above API work for all types of devices and frameworks. So having a vendor specific API looks much better. From the POV of userspace mgmt apps doing device compat checking / migration, we certainly do NOT want to use different vendor specific APIs. We want to have an API that can be used / controlled in a standard manner across vendors. Yes, but it could be hard. E.g vDPA will chose to use devlink (there's a long debate on sysfs vs devlink). So if we go with sysfs, at least two APIs needs to be supported ... NB, I was not questioning devlink vs sysfs directly. If devlink is related to netlink, I can't say I'm enthusiastic as IMKE sysfs is easier to deal with. I don't know enough about devlink to have much of an opinion though. The key point was that I don't want the userspace APIs we need to deal with to be vendor specific. From what I've seen of devlink, it seems quite nice; but I understand why sysfs might be easier to deal with (especially as there's likely already a lot of code using it.) I understand that some users would like devlink because it is already widely used for network drivers (and some others), but I don't think the majority of devices used with vfio are network (although certainly a lot of them are.) Note that though devlink could be popular only in network devices, netlink is widely used by a lot of subsystesm (e.g SCSI). Thanks What I care about is that we have a *standard* userspace API for performing device compatibility checking / state migration, for use by QEMU/libvirt/ OpenStack, such that we can write code without countless vendor specific code paths. If there is vendor specific stuff on the side, that's fine as we can ignore that, but the core functionality for device compat / migration needs to be standardized. To summarize: - choose one of sysfs or devlink - have a common interface, with a standardized way to add vendor-specific attributes ?
Re: device compatibility interface for live migration with assigned devices
On 2020/8/18 下午5:32, Parav Pandit wrote: Hi Jason, From: Jason Wang Sent: Tuesday, August 18, 2020 2:32 PM On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: On 2020/8/14 下午1:16, Yan Zhao wrote: On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: On 2020/8/10 下午3:46, Yan Zhao wrote: driver is it handled by? It looks that the devlink is for network device specific, and in devlink.h, it says include/uapi/linux/devlink.h - Network physical device Netlink interface, Actually not, I think there used to have some discussion last year and the conclusion is to remove this comment. [...] Yes, but it could be hard. E.g vDPA will chose to use devlink (there's a long debate on sysfs vs devlink). So if we go with sysfs, at least two APIs needs to be supported ... We had internal discussion and proposal on this topic. I wanted Eli Cohen to be back from vacation on Wed 8/19, but since this is active discussion right now, I will share the thoughts anyway. Here are the initial round of thoughts and proposal. User requirements: --- 1. User might want to create one or more vdpa devices per PCI PF/VF/SF. 2. User might want to create one or more vdpa devices of type net/blk or other type. 3. User needs to look and dump at the health of the queues for debug purpose. 4. During vdpa net device creation time, user may have to provide a MAC address and/or VLAN. 5. User should be able to set/query some of the attributes for debug/compatibility check 6. When user wants to create vdpa device, it needs to know which device supports creation. 7. User should be able to see the queue statistics of doorbells, wqes etc regardless of class type Note that wqes is probably not something common in all of the vendors. To address above requirements, there is a need of vendor agnostic tool, so that user can create/config/delete vdpa device(s) regardless of the vendor. Hence, We should have a tool that lets user do it. Examples: - (a) List parent devices which supports creating vdpa devices. It also shows which class types supported by this parent device. In below command two parent devices support vdpa device creation. First is PCI VF whose bdf is 03.00:5. Second is PCI SF whose name is mlx5_sf.1 $ vdpa list pd What did "pd" mean? pci/:03.00:5 class_supports net vdpa virtbus/mlx5_sf.1 So creating mlx5_sf.1 is the charge of devlink? class_supports net (b) Now add a vdpa device and show the device. $ vdpa dev add pci/:03.00:5 type net So if you want to create devices types other than vdpa on pci/:03.00:5 it needs some synchronization with devlink? $ vdpa dev show vdpa0@pci/:03.00:5 type net state inactive maxqueues 8 curqueues 4 (c) vdpa dev show features vdpa0 iommu platform version 1 (d) dump vdpa statistics $ vdpa dev stats show vdpa0 kickdoorbells 10 wqes 100 (e) Now delete a vdpa device previously created. $ vdpa dev del vdpa0 Design overview: --- 1. Above example tool runs over netlink socket interface. 2. This enables users to return meaningful error strings in addition to code so that user can be more informed. Often this is missing in ioctl()/configfs/sysfs interfaces. 3. This tool over netlink enables syscaller tests to be more usable like other subsystems to keep kernel robust 4. This provides vendor agnostic view of all vdpa capable parent and vdpa devices. 5. Each driver which supports vdpa device creation, registers the parent device along with supported classes. FAQs: 1. Why not using devlink? Ans: Because as vdpa echo system grows, devlink will fall short of extending vdpa specific params, attributes, stats. This should be fine but it's still not clear to me the difference between a vdpa netlink and a vdpa object in devlink. Thanks 2. Why not use sysfs? Ans: (a) Because running syscaller infrastructure can run well over netlink sockets like it runs for several subsystem. (b) it lacks the ability to return error messages. Doing via kernel log is just doesn't work. (c) Why not using some ioctl()? It will reinvent the wheel of netlink that has TLV formats for several attributes. 3. Why not configs? It follows same limitation as that of sysfs. Low level design and driver APIS: Will post once we discuss this further.
Re: device compatibility interface for live migration with assigned devices
On 2020/8/18 下午5:16, Daniel P. Berrangé wrote: Your mail came through as HTML-only so all the quoting and attribution is mangled / lost now :-( My bad, sorry. On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote: On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: On 2020/8/14 下午1:16, Yan Zhao wrote: On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: On 2020/8/10 下午3:46, Yan Zhao wrote: we actually can also retrieve the same information through sysfs, .e.g |- [path to device] |--- migration | |--- self | | |---device_api || |---mdev_type || |---software_version || |---device_id || |---aggregator | |--- compatible | | |---device_api || |---mdev_type || |---software_version || |---device_id || |---aggregator Yes but: - You need one file per attribute (one syscall for one attribute) - Attribute is coupled with kobject All of above seems unnecessary. Another point, as we discussed in another thread, it's really hard to make sure the above API work for all types of devices and frameworks. So having a vendor specific API looks much better. From the POV of userspace mgmt apps doing device compat checking / migration, we certainly do NOT want to use different vendor specific APIs. We want to have an API that can be used / controlled in a standard manner across vendors. Yes, but it could be hard. E.g vDPA will chose to use devlink (there's a long debate on sysfs vs devlink). So if we go with sysfs, at least two APIs needs to be supported ... NB, I was not questioning devlink vs sysfs directly. If devlink is related to netlink, I can't say I'm enthusiastic as IMKE sysfs is easier to deal with. I don't know enough about devlink to have much of an opinion though. The key point was that I don't want the userspace APIs we need to deal with to be vendor specific. What I care about is that we have a *standard* userspace API for performing device compatibility checking / state migration, for use by QEMU/libvirt/ OpenStack, such that we can write code without countless vendor specific code paths. If there is vendor specific stuff on the side, that's fine as we can ignore that, but the core functionality for device compat / migration needs to be standardized. Ok, I agree with you. Thanks Regards, Daniel
RE: device compatibility interface for live migration with assigned devices
Hi Cornelia, > From: Cornelia Huck > Sent: Tuesday, August 18, 2020 3:07 PM > To: Daniel P. Berrangé > Cc: Jason Wang ; Yan Zhao > ; k...@vger.kernel.org; libvir-l...@redhat.com; > qemu-devel@nongnu.org; Kirti Wankhede ; > eau...@redhat.com; xin-ran.w...@intel.com; cor...@lwn.net; openstack- > disc...@lists.openstack.org; shaohe.f...@intel.com; kevin.t...@intel.com; > Parav Pandit ; jian-feng.d...@intel.com; > dgilb...@redhat.com; zhen...@linux.intel.com; hejie...@intel.com; > bao.yum...@zte.com.cn; Alex Williamson ; > eskul...@redhat.com; smoo...@redhat.com; intel-gvt- > d...@lists.freedesktop.org; Jiri Pirko ; > dinec...@redhat.com; de...@ovirt.org > Subject: Re: device compatibility interface for live migration with assigned > devices > > On Tue, 18 Aug 2020 10:16:28 +0100 > Daniel P. Berrangé wrote: > > > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote: > > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: > > > > > > On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: > > > > > > On 2020/8/14 下午1:16, Yan Zhao wrote: > > > > > > On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: > > > > > > On 2020/8/10 下午3:46, Yan Zhao wrote: > > > > > we actually can also retrieve the same information through sysfs, > > > .e.g > > > > > > |- [path to device] > > > |--- migration > > > | |--- self > > > | | |---device_api > > > || |---mdev_type > > > || |---software_version > > > || |---device_id > > > || |---aggregator > > > | |--- compatible > > > | | |---device_api > > > || |---mdev_type > > > || |---software_version > > > || |---device_id > > > || |---aggregator > > > > > > > > > Yes but: > > > > > > - You need one file per attribute (one syscall for one attribute) > > > - Attribute is coupled with kobject > > Is that really that bad? You have the device with an embedded kobject > anyway, and you can just put things into an attribute group? > > [Also, I think that self/compatible split in the example makes things > needlessly complex. Shouldn't semantic versioning and matching already > cover nearly everything? I would expect very few cases that are more > complex than that. Maybe the aggregation stuff, but I don't think we need > that self/compatible split for that, either.] > > > > > > > All of above seems unnecessary. > > > > > > Another point, as we discussed in another thread, it's really hard > > > to make sure the above API work for all types of devices and > > > frameworks. So having a vendor specific API looks much better. > > > > > > From the POV of userspace mgmt apps doing device compat checking / > > > migration, we certainly do NOT want to use different vendor > > > specific APIs. We want to have an API that can be used / controlled in a > standard manner across vendors. > > > > > >Yes, but it could be hard. E.g vDPA will chose to use devlink (there's > > > a > > >long debate on sysfs vs devlink). So if we go with sysfs, at least two > > >APIs needs to be supported ... > > > > NB, I was not questioning devlink vs sysfs directly. If devlink is > > related to netlink, I can't say I'm enthusiastic as IMKE sysfs is > > easier to deal with. I don't know enough about devlink to have much of an > opinion though. > > The key point was that I don't want the userspace APIs we need to deal > > with to be vendor specific. > > From what I've seen of devlink, it seems quite nice; but I understand why > sysfs might be easier to deal with (especially as there's likely already a > lot of > code using it.) > > I understand that some users would like devlink because it is already widely > used for network drivers (and some others), but I don't think the majority of > devices used with vfio are network (although certainly a lot of them are.) > > > > > What I care about is that we have a *standard* userspace API for > > performing device compatibility checking / state migration, for use by > > QEMU/libvirt/ OpenStack, such that we can write code without countless > > vendor specific code paths. > > > > If there is vendor specific stuff on the side, that's fine as we can > > ignore that, but the core functionality for device compat / migration > > needs to be standardized. > > To summarize: > - choose one of sysfs or devlink > - have a common interface, with a standardized way to add > vendor-specific attributes > ? Please refer to my previous email which has more example and details.
RE: device compatibility interface for live migration with assigned devices
Hi Jason, From: Jason Wang Sent: Tuesday, August 18, 2020 2:32 PM On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: On 2020/8/14 下午1:16, Yan Zhao wrote: On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: On 2020/8/10 下午3:46, Yan Zhao wrote: driver is it handled by? It looks that the devlink is for network device specific, and in devlink.h, it says include/uapi/linux/devlink.h - Network physical device Netlink interface, Actually not, I think there used to have some discussion last year and the conclusion is to remove this comment. [...] > Yes, but it could be hard. E.g vDPA will chose to use devlink (there's a long > debate on sysfs vs devlink). So if we go with sysfs, at least two APIs needs > to be supported ... We had internal discussion and proposal on this topic. I wanted Eli Cohen to be back from vacation on Wed 8/19, but since this is active discussion right now, I will share the thoughts anyway. Here are the initial round of thoughts and proposal. User requirements: --- 1. User might want to create one or more vdpa devices per PCI PF/VF/SF. 2. User might want to create one or more vdpa devices of type net/blk or other type. 3. User needs to look and dump at the health of the queues for debug purpose. 4. During vdpa net device creation time, user may have to provide a MAC address and/or VLAN. 5. User should be able to set/query some of the attributes for debug/compatibility check 6. When user wants to create vdpa device, it needs to know which device supports creation. 7. User should be able to see the queue statistics of doorbells, wqes etc regardless of class type To address above requirements, there is a need of vendor agnostic tool, so that user can create/config/delete vdpa device(s) regardless of the vendor. Hence, We should have a tool that lets user do it. Examples: - (a) List parent devices which supports creating vdpa devices. It also shows which class types supported by this parent device. In below command two parent devices support vdpa device creation. First is PCI VF whose bdf is 03.00:5. Second is PCI SF whose name is mlx5_sf.1 $ vdpa list pd pci/:03.00:5 class_supports net vdpa virtbus/mlx5_sf.1 class_supports net (b) Now add a vdpa device and show the device. $ vdpa dev add pci/:03.00:5 type net $ vdpa dev show vdpa0@pci/:03.00:5 type net state inactive maxqueues 8 curqueues 4 (c) vdpa dev show features vdpa0 iommu platform version 1 (d) dump vdpa statistics $ vdpa dev stats show vdpa0 kickdoorbells 10 wqes 100 (e) Now delete a vdpa device previously created. $ vdpa dev del vdpa0 Design overview: --- 1. Above example tool runs over netlink socket interface. 2. This enables users to return meaningful error strings in addition to code so that user can be more informed. Often this is missing in ioctl()/configfs/sysfs interfaces. 3. This tool over netlink enables syscaller tests to be more usable like other subsystems to keep kernel robust 4. This provides vendor agnostic view of all vdpa capable parent and vdpa devices. 5. Each driver which supports vdpa device creation, registers the parent device along with supported classes. FAQs: 1. Why not using devlink? Ans: Because as vdpa echo system grows, devlink will fall short of extending vdpa specific params, attributes, stats. 2. Why not use sysfs? Ans: (a) Because running syscaller infrastructure can run well over netlink sockets like it runs for several subsystem. (b) it lacks the ability to return error messages. Doing via kernel log is just doesn't work. (c) Why not using some ioctl()? It will reinvent the wheel of netlink that has TLV formats for several attributes. 3. Why not configs? It follows same limitation as that of sysfs. Low level design and driver APIS: Will post once we discuss this further.
Re: device compatibility interface for live migration with assigned devices
On Tue, 18 Aug 2020 10:24:33 +0100 Daniel P. Berrangé wrote: > On Tue, Aug 18, 2020 at 11:06:17AM +0200, Cornelia Huck wrote: > > On Tue, 18 Aug 2020 09:55:27 +0100 > > Daniel P. Berrangé wrote: > > > > > On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: > > > > Another point, as we discussed in another thread, it's really hard to > > > > make > > > > sure the above API work for all types of devices and frameworks. So > > > > having a > > > > vendor specific API looks much better. > > > > > > From the POV of userspace mgmt apps doing device compat checking / > > > migration, > > > we certainly do NOT want to use different vendor specific APIs. We want to > > > have an API that can be used / controlled in a standard manner across > > > vendors. > > > > As we certainly will need to have different things to check for > > different device types and vendor drivers, would it still be fine to > > have differing (say) attributes, as long as they are presented (and can > > be discovered) in a standardized way? > > Yes, the control API and algorithm to deal with the problem needs to > have standardization, but the data passed in/out of the APIs can vary. > > Essentially the key is that vendors should be able to create devices > at the kernel, and those devices should "just work" with the existing > generic userspace migration / compat checking code, without needing > extra vendor specific logic to be added. > > Note, I'm not saying that the userspace decisions would be perfectly > optimal based on generic code. They might be making a simplified > decision that while functionally safe, is not the ideal solution. > Adding vendor specific code might be able to optimize the userspace > decisions, but that should be considered just optimization, not a > core must have for any opertion. Yes, that sounds reasonable.
Re: device compatibility interface for live migration with assigned devices
On Tue, 18 Aug 2020 10:16:28 +0100 Daniel P. Berrangé wrote: > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote: > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: > > > > On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: > > > > On 2020/8/14 下午1:16, Yan Zhao wrote: > > > > On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: > > > > On 2020/8/10 下午3:46, Yan Zhao wrote: > > > we actually can also retrieve the same information through sysfs, .e.g > > > > |- [path to device] > > |--- migration > > | |--- self > > | | |---device_api > > || |---mdev_type > > || |---software_version > > || |---device_id > > || |---aggregator > > | |--- compatible > > | | |---device_api > > || |---mdev_type > > || |---software_version > > || |---device_id > > || |---aggregator > > > > > > Yes but: > > > > - You need one file per attribute (one syscall for one attribute) > > - Attribute is coupled with kobject Is that really that bad? You have the device with an embedded kobject anyway, and you can just put things into an attribute group? [Also, I think that self/compatible split in the example makes things needlessly complex. Shouldn't semantic versioning and matching already cover nearly everything? I would expect very few cases that are more complex than that. Maybe the aggregation stuff, but I don't think we need that self/compatible split for that, either.] > > > > All of above seems unnecessary. > > > > Another point, as we discussed in another thread, it's really hard to make > > sure the above API work for all types of devices and frameworks. So having > > a > > vendor specific API looks much better. > > > > From the POV of userspace mgmt apps doing device compat checking / > > migration, > > we certainly do NOT want to use different vendor specific APIs. We want to > > have an API that can be used / controlled in a standard manner across > > vendors. > > > >Yes, but it could be hard. E.g vDPA will chose to use devlink (there's a > >long debate on sysfs vs devlink). So if we go with sysfs, at least two > >APIs needs to be supported ... > > NB, I was not questioning devlink vs sysfs directly. If devlink is related > to netlink, I can't say I'm enthusiastic as IMKE sysfs is easier to deal > with. I don't know enough about devlink to have much of an opinion though. > The key point was that I don't want the userspace APIs we need to deal with > to be vendor specific. From what I've seen of devlink, it seems quite nice; but I understand why sysfs might be easier to deal with (especially as there's likely already a lot of code using it.) I understand that some users would like devlink because it is already widely used for network drivers (and some others), but I don't think the majority of devices used with vfio are network (although certainly a lot of them are.) > > What I care about is that we have a *standard* userspace API for performing > device compatibility checking / state migration, for use by QEMU/libvirt/ > OpenStack, such that we can write code without countless vendor specific > code paths. > > If there is vendor specific stuff on the side, that's fine as we can ignore > that, but the core functionality for device compat / migration needs to be > standardized. To summarize: - choose one of sysfs or devlink - have a common interface, with a standardized way to add vendor-specific attributes ?
Re: device compatibility interface for live migration with assigned devices
On Tue, Aug 18, 2020 at 11:06:17AM +0200, Cornelia Huck wrote: > On Tue, 18 Aug 2020 09:55:27 +0100 > Daniel P. Berrangé wrote: > > > On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: > > > Another point, as we discussed in another thread, it's really hard to make > > > sure the above API work for all types of devices and frameworks. So > > > having a > > > vendor specific API looks much better. > > > > From the POV of userspace mgmt apps doing device compat checking / > > migration, > > we certainly do NOT want to use different vendor specific APIs. We want to > > have an API that can be used / controlled in a standard manner across > > vendors. > > As we certainly will need to have different things to check for > different device types and vendor drivers, would it still be fine to > have differing (say) attributes, as long as they are presented (and can > be discovered) in a standardized way? Yes, the control API and algorithm to deal with the problem needs to have standardization, but the data passed in/out of the APIs can vary. Essentially the key is that vendors should be able to create devices at the kernel, and those devices should "just work" with the existing generic userspace migration / compat checking code, without needing extra vendor specific logic to be added. Note, I'm not saying that the userspace decisions would be perfectly optimal based on generic code. They might be making a simplified decision that while functionally safe, is not the ideal solution. Adding vendor specific code might be able to optimize the userspace decisions, but that should be considered just optimization, not a core must have for any opertion. Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: device compatibility interface for live migration with assigned devices
Your mail came through as HTML-only so all the quoting and attribution is mangled / lost now :-( On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote: >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: > > On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: > > On 2020/8/14 下午1:16, Yan Zhao wrote: > > On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: > > On 2020/8/10 下午3:46, Yan Zhao wrote: > we actually can also retrieve the same information through sysfs, .e.g > > |- [path to device] > |--- migration > | |--- self > | | |---device_api > || |---mdev_type > || |---software_version > || |---device_id > || |---aggregator > | |--- compatible > | | |---device_api > || |---mdev_type > || |---software_version > || |---device_id > || |---aggregator > > > Yes but: > > - You need one file per attribute (one syscall for one attribute) > - Attribute is coupled with kobject > > All of above seems unnecessary. > > Another point, as we discussed in another thread, it's really hard to make > sure the above API work for all types of devices and frameworks. So having a > vendor specific API looks much better. > > From the POV of userspace mgmt apps doing device compat checking / migration, > we certainly do NOT want to use different vendor specific APIs. We want to > have an API that can be used / controlled in a standard manner across > vendors. > >Yes, but it could be hard. E.g vDPA will chose to use devlink (there's a >long debate on sysfs vs devlink). So if we go with sysfs, at least two >APIs needs to be supported ... NB, I was not questioning devlink vs sysfs directly. If devlink is related to netlink, I can't say I'm enthusiastic as IMKE sysfs is easier to deal with. I don't know enough about devlink to have much of an opinion though. The key point was that I don't want the userspace APIs we need to deal with to be vendor specific. What I care about is that we have a *standard* userspace API for performing device compatibility checking / state migration, for use by QEMU/libvirt/ OpenStack, such that we can write code without countless vendor specific code paths. If there is vendor specific stuff on the side, that's fine as we can ignore that, but the core functionality for device compat / migration needs to be standardized. Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: device compatibility interface for live migration with assigned devices
On Tue, 18 Aug 2020 09:55:27 +0100 Daniel P. Berrangé wrote: > On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: > > Another point, as we discussed in another thread, it's really hard to make > > sure the above API work for all types of devices and frameworks. So having a > > vendor specific API looks much better. > > From the POV of userspace mgmt apps doing device compat checking / migration, > we certainly do NOT want to use different vendor specific APIs. We want to > have an API that can be used / controlled in a standard manner across vendors. As we certainly will need to have different things to check for different device types and vendor drivers, would it still be fine to have differing (say) attributes, as long as they are presented (and can be discovered) in a standardized way? (See e.g. what I came up with for vfio-ccw in a different branch of this thread.) E.g. version= .type_specific_value0= .type_specific_value1= .vendor_driver_specific_value0= with a type or vendor driver having some kind of get_supported_attributes method?
Re: device compatibility interface for live migration with assigned devices
On 2020/8/18 下午4:55, Daniel P. Berrangé wrote: On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: On 2020/8/14 下午1:16, Yan Zhao wrote: On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: On 2020/8/10 下午3:46, Yan Zhao wrote: driver is it handled by? It looks that the devlink is for network device specific, and in devlink.h, it says include/uapi/linux/devlink.h - Network physical device Netlink interface, Actually not, I think there used to have some discussion last year and the conclusion is to remove this comment. It supports IB and probably vDPA in the future. hmm... sorry, I didn't find the referred discussion. only below discussion regarding to why to add devlink. https://www.mail-archive.com/netdev@vger.kernel.org/msg95801.html >This doesn't seem to be too much related to networking? Why can't something >like this be in sysfs? It is related to networking quite bit. There has been couple of iteration of this, including sysfs and configfs implementations. There has been a consensus reached that this should be done by netlink. I believe netlink is really the best for this purpose. Sysfs is not a good idea See the discussion here: https://patchwork.ozlabs.org/project/netdev/patch/20191115223355.1277139-1-jeffrey.t.kirs...@intel.com/ https://www.mail-archive.com/netdev@vger.kernel.org/msg96102.html >there is already a way to change eth/ib via >echo 'eth' > /sys/bus/pci/drivers/mlx4_core/:02:00.0/mlx4_port1 > >sounds like this is another way to achieve the same? It is. However the current way is driver-specific, not correct. For mlx5, we need the same, it cannot be done in this way. Do devlink is the correct way to go. https://lwn.net/Articles/674867/ There a is need for some userspace API that would allow to expose things that are not directly related to any device class like net_device of ib_device, but rather chip-wide/switch-ASIC-wide stuff. Use cases: 1) get/set of port type (Ethernet/InfiniBand) 2) monitoring of hardware messages to and from chip 3) setting up port splitters - split port into multiple ones and squash again, enables usage of splitter cable 4) setting up shared buffers - shared among multiple ports within one chip we actually can also retrieve the same information through sysfs, .e.g |- [path to device] |--- migration | |--- self | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator | |--- compatible | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator Yes but: - You need one file per attribute (one syscall for one attribute) - Attribute is coupled with kobject All of above seems unnecessary. Another point, as we discussed in another thread, it's really hard to make sure the above API work for all types of devices and frameworks. So having a vendor specific API looks much better. From the POV of userspace mgmt apps doing device compat checking / migration, we certainly do NOT want to use different vendor specific APIs. We want to have an API that can be used / controlled in a standard manner across vendors. Yes, but it could be hard. E.g vDPA will chose to use devlink (there's a long debate on sysfs vs devlink). So if we go with sysfs, at least two APIs needs to be supported ... Thanks Regards, Daniel
Re: device compatibility interface for live migration with assigned devices
On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote: > > On 2020/8/14 下午1:16, Yan Zhao wrote: > > On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: > > > On 2020/8/10 下午3:46, Yan Zhao wrote: > > > > > driver is it handled by? > > > > It looks that the devlink is for network device specific, and in > > > > devlink.h, it says > > > > include/uapi/linux/devlink.h - Network physical device Netlink > > > > interface, > > > > > > Actually not, I think there used to have some discussion last year and the > > > conclusion is to remove this comment. > > > > > > It supports IB and probably vDPA in the future. > > > > > hmm... sorry, I didn't find the referred discussion. only below discussion > > regarding to why to add devlink. > > > > https://www.mail-archive.com/netdev@vger.kernel.org/msg95801.html > > >This doesn't seem to be too much related to networking? Why can't > > something > > >like this be in sysfs? > > > > It is related to networking quite bit. There has been couple of > > iteration of this, including sysfs and configfs implementations. There > > has been a consensus reached that this should be done by netlink. I > > believe netlink is really the best for this purpose. Sysfs is not a good > > idea > > > See the discussion here: > > https://patchwork.ozlabs.org/project/netdev/patch/20191115223355.1277139-1-jeffrey.t.kirs...@intel.com/ > > > > > > https://www.mail-archive.com/netdev@vger.kernel.org/msg96102.html > > >there is already a way to change eth/ib via > > >echo 'eth' > /sys/bus/pci/drivers/mlx4_core/:02:00.0/mlx4_port1 > > > > > >sounds like this is another way to achieve the same? > > > > It is. However the current way is driver-specific, not correct. > > For mlx5, we need the same, it cannot be done in this way. Do devlink is > > the correct way to go. > > > > https://lwn.net/Articles/674867/ > > There a is need for some userspace API that would allow to expose things > > that are not directly related to any device class like net_device of > > ib_device, but rather chip-wide/switch-ASIC-wide stuff. > > > > Use cases: > > 1) get/set of port type (Ethernet/InfiniBand) > > 2) monitoring of hardware messages to and from chip > > 3) setting up port splitters - split port into multiple ones and squash > > again, > >enables usage of splitter cable > > 4) setting up shared buffers - shared among multiple ports within one > > chip > > > > > > > > we actually can also retrieve the same information through sysfs, .e.g > > > > |- [path to device] > >|--- migration > >| |--- self > >| | |---device_api > >|| |---mdev_type > >|| |---software_version > >|| |---device_id > >|| |---aggregator > >| |--- compatible > >| | |---device_api > >|| |---mdev_type > >|| |---software_version > >|| |---device_id > >|| |---aggregator > > > > Yes but: > > - You need one file per attribute (one syscall for one attribute) > - Attribute is coupled with kobject > > All of above seems unnecessary. > > Another point, as we discussed in another thread, it's really hard to make > sure the above API work for all types of devices and frameworks. So having a > vendor specific API looks much better. >From the POV of userspace mgmt apps doing device compat checking / migration, we certainly do NOT want to use different vendor specific APIs. We want to have an API that can be used / controlled in a standard manner across vendors. Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: device compatibility interface for live migration with assigned devices
On 2020/8/14 下午1:16, Yan Zhao wrote: On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: On 2020/8/10 下午3:46, Yan Zhao wrote: driver is it handled by? It looks that the devlink is for network device specific, and in devlink.h, it says include/uapi/linux/devlink.h - Network physical device Netlink interface, Actually not, I think there used to have some discussion last year and the conclusion is to remove this comment. It supports IB and probably vDPA in the future. hmm... sorry, I didn't find the referred discussion. only below discussion regarding to why to add devlink. https://www.mail-archive.com/netdev@vger.kernel.org/msg95801.html >This doesn't seem to be too much related to networking? Why can't something >like this be in sysfs? It is related to networking quite bit. There has been couple of iteration of this, including sysfs and configfs implementations. There has been a consensus reached that this should be done by netlink. I believe netlink is really the best for this purpose. Sysfs is not a good idea See the discussion here: https://patchwork.ozlabs.org/project/netdev/patch/20191115223355.1277139-1-jeffrey.t.kirs...@intel.com/ https://www.mail-archive.com/netdev@vger.kernel.org/msg96102.html >there is already a way to change eth/ib via >echo 'eth' > /sys/bus/pci/drivers/mlx4_core/:02:00.0/mlx4_port1 > >sounds like this is another way to achieve the same? It is. However the current way is driver-specific, not correct. For mlx5, we need the same, it cannot be done in this way. Do devlink is the correct way to go. https://lwn.net/Articles/674867/ There a is need for some userspace API that would allow to expose things that are not directly related to any device class like net_device of ib_device, but rather chip-wide/switch-ASIC-wide stuff. Use cases: 1) get/set of port type (Ethernet/InfiniBand) 2) monitoring of hardware messages to and from chip 3) setting up port splitters - split port into multiple ones and squash again, enables usage of splitter cable 4) setting up shared buffers - shared among multiple ports within one chip we actually can also retrieve the same information through sysfs, .e.g |- [path to device] |--- migration | |--- self | | |---device_api || |---mdev_type || |---software_version || |---device_id || |---aggregator | |--- compatible | | |---device_api || |---mdev_type || |---software_version || |---device_id || |---aggregator Yes but: - You need one file per attribute (one syscall for one attribute) - Attribute is coupled with kobject All of above seems unnecessary. Another point, as we discussed in another thread, it's really hard to make sure the above API work for all types of devices and frameworks. So having a vendor specific API looks much better. I feel like it's not very appropriate for a GPU driver to use this interface. Is that right? I think not though most of the users are switch or ethernet devices. It doesn't prevent you from inventing new abstractions. so need to patch devlink core and the userspace devlink tool? e.g. devlink migration It quite flexible, you can extend devlink, invent your own or let mgmt to establish devlink directly. Note that devlink is based on netlink, netlink has been widely used by various subsystems other than networking. the advantage of netlink I see is that it can monitor device status and notify upper layer that migration database needs to get updated. I may miss something, but why this is needed? From device point of view, the following capability should be sufficient to support live migration: - set/get device state - report dirty page tracking - set/get capability But not sure whether openstack would like to use this capability. As Sean said, it's heavy for openstack. it's heavy for vendor driver as well :) Well, it depends several factors. Just counting LOCs, sysfs based attributes is not lightweight. Thanks And devlink monitor now listens the notification and dumps the state changes. If we want to use it, need to let it forward the notification and dumped info to openstack, right? Thanks Yan
Re: device compatibility interface for live migration with assigned devices
On Thu, 13 Aug 2020 15:02:53 -0400 Eric Farman wrote: > On 8/13/20 11:33 AM, Cornelia Huck wrote: > > On Fri, 7 Aug 2020 13:59:42 +0200 > > Cornelia Huck wrote: > > > >> On Wed, 05 Aug 2020 12:35:01 +0100 > >> Sean Mooney wrote: > >> > >>> On Wed, 2020-08-05 at 12:53 +0200, Jiri Pirko wrote: > Wed, Aug 05, 2020 at 11:33:38AM CEST, yan.y.z...@intel.com wrote: > >> > >> (...) > >> > >software_version: device driver's version. > > in .[.bugfix] scheme, where there is no > >compatibility across major versions, minor versions have > >forward compatibility (ex. 1-> 2 is ok, 2 -> 1 is not) > > and > >bugfix version number indicates some degree of internal > >improvement that is not visible to the user in terms of > >features or compatibility, > > > > vendor specific attributes: each vendor may define different attributes > > device id : device id of a physical devices or mdev's parent pci > > device. > > it could be equal to pci id for pci devices > > aggregator: used together with mdev_type. e.g. aggregator=2 together > > with i915-GVTg_V5_4 means 2*1/4=1/2 of a gen9 Intel > >graphics device. > > remote_url: for a local NVMe VF, it may be configured with a remote > > url of a remote storage and all data is stored in the > >remote side specified by the remote url. > > ... > >>> just a minor not that i find ^ much more simmple to understand then > >>> the current proposal with self and compatiable. > >>> if i have well defiend attibute that i can parse and understand that allow > >>> me to calulate the what is and is not compatible that is likely going to > >>> more useful as you wont have to keep maintianing a list of other > >>> compatible > >>> devices every time a new sku is released. > >>> > >>> in anycase thank for actully shareing ^ as it make it simpler to reson > >>> about what > >>> you have previously proposed. > >> > >> So, what would be the most helpful format? A 'software_version' field > >> that follows the conventions outlined above, and other (possibly > >> optional) fields that have to match? > > > > Just to get a different perspective, I've been trying to come up with > > what would be useful for a very different kind of device, namely > > vfio-ccw. (Adding Eric to cc: for that.) > > > > software_version makes sense for everybody, so it should be a standard > > attribute. > > > > For the vfio-ccw type, we have only one vendor driver (vfio-ccw_IO). > > > > Given a subchannel A, we want to make sure that subchannel B has a > > reasonable chance of being compatible. I guess that means: > > > > - same subchannel type (I/O) > > - same chpid type (e.g. all FICON; I assume there are no 'mixed' setups > > -- Eric?) > > Correct. > > > - same number of chpids? Maybe we can live without that and just inject > > some machine checks, I don't know. Same chpid numbers is something we > > cannot guarantee, especially if we want to migrate cross-CEC (to > > another machine.) > > I think we'd live without it, because I wouldn't expect it to be > consistent between systems. Yes, and the guest needs to be able to deal with changing path configurations anyway. > > > > > Other possibly interesting information is not available at the > > subchannel level (vfio-ccw is a subchannel driver.) > > I presume you're alluding to the DASD uid (dasdinfo -x) here? Yes, or the even more basic Sense ID information. > > > > > So, looking at a concrete subchannel on one of my machines, it would > > look something like the following: > > > > > > software_version=1.0.0 > > type=vfio-ccw <-- would be vfio-pci on the example above > > > > subchannel_type=0 > > > > chpid_type=0x1a > > chpid_mask=0xf0<-- not sure if needed/wanted Let's just drop the chpid_mask here. > > > > Does that make sense? Would be interesting if someone could come up with some possible information for a third type of device.
Re: device compatibility interface for live migration with assigned devices
On Fri, Aug 14, 2020 at 01:30:00PM +0100, Sean Mooney wrote: > On Fri, 2020-08-14 at 13:16 +0800, Yan Zhao wrote: > > On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: > > > > > > On 2020/8/10 下午3:46, Yan Zhao wrote: > > > > > driver is it handled by? > > > > > > > > It looks that the devlink is for network device specific, and in > > > > devlink.h, it says > > > > include/uapi/linux/devlink.h - Network physical device Netlink > > > > interface, > > > > > > > > > Actually not, I think there used to have some discussion last year and the > > > conclusion is to remove this comment. > > > > > > It supports IB and probably vDPA in the future. > > > > > > > hmm... sorry, I didn't find the referred discussion. only below discussion > > regarding to why to add devlink. > > > > https://www.mail-archive.com/netdev@vger.kernel.org/msg95801.html > > >This doesn't seem to be too much related to networking? Why can't > > something > > >like this be in sysfs? > > > > It is related to networking quite bit. There has been couple of > > iteration of this, including sysfs and configfs implementations. There > > has been a consensus reached that this should be done by netlink. I > > believe netlink is really the best for this purpose. Sysfs is not a good > > idea > > > > https://www.mail-archive.com/netdev@vger.kernel.org/msg96102.html > > >there is already a way to change eth/ib via > > >echo 'eth' > /sys/bus/pci/drivers/mlx4_core/:02:00.0/mlx4_port1 > > > > > >sounds like this is another way to achieve the same? > > > > It is. However the current way is driver-specific, not correct. > > For mlx5, we need the same, it cannot be done in this way. Do devlink is > > the correct way to go. > im not sure i agree with that. > standardising a filesystem based api that is used across all vendors is also > a valid > option. that said if devlink is the right choice form a kerenl perspective > by all > means use it but i have not heard a convincing argument for why it actually > better. > with tthat said we have been uing tools like ethtool to manage aspect of nics > for decades > so its not that strange an idea to use a tool and binary protocoal rather > then a text > based interface for this but there are advantages to both approches. > > Yes, I agree with you. > > https://lwn.net/Articles/674867/ > > There a is need for some userspace API that would allow to expose things > > that are not directly related to any device class like net_device of > > ib_device, but rather chip-wide/switch-ASIC-wide stuff. > > > > Use cases: > > 1) get/set of port type (Ethernet/InfiniBand) > > 2) monitoring of hardware messages to and from chip > > 3) setting up port splitters - split port into multiple ones and squash > > again, > >enables usage of splitter cable > > 4) setting up shared buffers - shared among multiple ports within one > > chip > > > > > > > > we actually can also retrieve the same information through sysfs, .e.g > > > > > - [path to device] > > > > |--- migration > > | |--- self > > | | |---device_api > > | | |---mdev_type > > | | |---software_version > > | | |---device_id > > | | |---aggregator > > | |--- compatible > > | | |---device_api > > | | |---mdev_type > > | | |---software_version > > | | |---device_id > > | | |---aggregator > > > > > > > > > > > > > I feel like it's not very appropriate for a GPU driver to use > > > > this interface. Is that right? > > > > > > > > > I think not though most of the users are switch or ethernet devices. It > > > doesn't prevent you from inventing new abstractions. > > > > so need to patch devlink core and the userspace devlink tool? > > e.g. devlink migration > and devlink python libs if openstack was to use it directly. > we do have caes where we just frok a process and execaute a comannd in a shell > with or without elevated privladge but we really dont like doing that due to > the performacne impacat and security implciations so where we can use python > bindign > over c apis we do. pyroute2 is the only python lib i know off of the top of > my head > that support devlink so we would need to enhacne it to support this new > devlink api. > there may be otherss i have not really looked in the past since we dont need > to use > devlink at all today. > > > > > Note that devlink is based on netlink, netlink has been widely used by > > > various subsystems other than networking. > > > > the advantage of netlink I see is that it can monitor device status and > > notify upper layer that migration database needs to get updated. > > But not sure whether openstack would like to use this capability. > > As Sean said, it's heavy for openstack. it's heavy for vendor driver > > as well :) > > > > And devlink monitor now listens the notification and dumps the state > > changes. If we
Re: device compatibility interface for live migration with assigned devices
On Fri, 2020-08-14 at 13:16 +0800, Yan Zhao wrote: > On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: > > > > On 2020/8/10 下午3:46, Yan Zhao wrote: > > > > driver is it handled by? > > > > > > It looks that the devlink is for network device specific, and in > > > devlink.h, it says > > > include/uapi/linux/devlink.h - Network physical device Netlink > > > interface, > > > > > > Actually not, I think there used to have some discussion last year and the > > conclusion is to remove this comment. > > > > It supports IB and probably vDPA in the future. > > > > hmm... sorry, I didn't find the referred discussion. only below discussion > regarding to why to add devlink. > > https://www.mail-archive.com/netdev@vger.kernel.org/msg95801.html > >This doesn't seem to be too much related to networking? Why can't > something > >like this be in sysfs? > > It is related to networking quite bit. There has been couple of > iteration of this, including sysfs and configfs implementations. There > has been a consensus reached that this should be done by netlink. I > believe netlink is really the best for this purpose. Sysfs is not a good > idea > > https://www.mail-archive.com/netdev@vger.kernel.org/msg96102.html > >there is already a way to change eth/ib via > >echo 'eth' > /sys/bus/pci/drivers/mlx4_core/:02:00.0/mlx4_port1 > > > >sounds like this is another way to achieve the same? > > It is. However the current way is driver-specific, not correct. > For mlx5, we need the same, it cannot be done in this way. Do devlink is > the correct way to go. im not sure i agree with that. standardising a filesystem based api that is used across all vendors is also a valid option. that said if devlink is the right choice form a kerenl perspective by all means use it but i have not heard a convincing argument for why it actually better. with tthat said we have been uing tools like ethtool to manage aspect of nics for decades so its not that strange an idea to use a tool and binary protocoal rather then a text based interface for this but there are advantages to both approches. > > https://lwn.net/Articles/674867/ > There a is need for some userspace API that would allow to expose things > that are not directly related to any device class like net_device of > ib_device, but rather chip-wide/switch-ASIC-wide stuff. > > Use cases: > 1) get/set of port type (Ethernet/InfiniBand) > 2) monitoring of hardware messages to and from chip > 3) setting up port splitters - split port into multiple ones and squash > again, > enables usage of splitter cable > 4) setting up shared buffers - shared among multiple ports within one > chip > > > > we actually can also retrieve the same information through sysfs, .e.g > > > - [path to device] > > |--- migration > | |--- self > | | |---device_api > | | |---mdev_type > | | |---software_version > | | |---device_id > | | |---aggregator > | |--- compatible > | | |---device_api > | | |---mdev_type > | | |---software_version > | | |---device_id > | | |---aggregator > > > > > > > > I feel like it's not very appropriate for a GPU driver to use > > > this interface. Is that right? > > > > > > I think not though most of the users are switch or ethernet devices. It > > doesn't prevent you from inventing new abstractions. > > so need to patch devlink core and the userspace devlink tool? > e.g. devlink migration and devlink python libs if openstack was to use it directly. we do have caes where we just frok a process and execaute a comannd in a shell with or without elevated privladge but we really dont like doing that due to the performacne impacat and security implciations so where we can use python bindign over c apis we do. pyroute2 is the only python lib i know off of the top of my head that support devlink so we would need to enhacne it to support this new devlink api. there may be otherss i have not really looked in the past since we dont need to use devlink at all today. > > > Note that devlink is based on netlink, netlink has been widely used by > > various subsystems other than networking. > > the advantage of netlink I see is that it can monitor device status and > notify upper layer that migration database needs to get updated. > But not sure whether openstack would like to use this capability. > As Sean said, it's heavy for openstack. it's heavy for vendor driver > as well :) > > And devlink monitor now listens the notification and dumps the state > changes. If we want to use it, need to let it forward the notification > and dumped info to openstack, right? i dont think we would use direct devlink monitoring in nova even if it was avaiable. we could but we already poll libvirt and the system for other resouce periodicly. we
Re: device compatibility interface for live migration with assigned devices
On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote: > > On 2020/8/10 下午3:46, Yan Zhao wrote: > > > driver is it handled by? > > It looks that the devlink is for network device specific, and in > > devlink.h, it says > > include/uapi/linux/devlink.h - Network physical device Netlink > > interface, > > > Actually not, I think there used to have some discussion last year and the > conclusion is to remove this comment. > > It supports IB and probably vDPA in the future. > hmm... sorry, I didn't find the referred discussion. only below discussion regarding to why to add devlink. https://www.mail-archive.com/netdev@vger.kernel.org/msg95801.html >This doesn't seem to be too much related to networking? Why can't something >like this be in sysfs? It is related to networking quite bit. There has been couple of iteration of this, including sysfs and configfs implementations. There has been a consensus reached that this should be done by netlink. I believe netlink is really the best for this purpose. Sysfs is not a good idea https://www.mail-archive.com/netdev@vger.kernel.org/msg96102.html >there is already a way to change eth/ib via >echo 'eth' > /sys/bus/pci/drivers/mlx4_core/:02:00.0/mlx4_port1 > >sounds like this is another way to achieve the same? It is. However the current way is driver-specific, not correct. For mlx5, we need the same, it cannot be done in this way. Do devlink is the correct way to go. https://lwn.net/Articles/674867/ There a is need for some userspace API that would allow to expose things that are not directly related to any device class like net_device of ib_device, but rather chip-wide/switch-ASIC-wide stuff. Use cases: 1) get/set of port type (Ethernet/InfiniBand) 2) monitoring of hardware messages to and from chip 3) setting up port splitters - split port into multiple ones and squash again, enables usage of splitter cable 4) setting up shared buffers - shared among multiple ports within one chip we actually can also retrieve the same information through sysfs, .e.g |- [path to device] |--- migration | |--- self | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator | |--- compatible | | |---device_api | | |---mdev_type | | |---software_version | | |---device_id | | |---aggregator > > > I feel like it's not very appropriate for a GPU driver to use > > this interface. Is that right? > > > I think not though most of the users are switch or ethernet devices. It > doesn't prevent you from inventing new abstractions. so need to patch devlink core and the userspace devlink tool? e.g. devlink migration > Note that devlink is based on netlink, netlink has been widely used by > various subsystems other than networking. the advantage of netlink I see is that it can monitor device status and notify upper layer that migration database needs to get updated. But not sure whether openstack would like to use this capability. As Sean said, it's heavy for openstack. it's heavy for vendor driver as well :) And devlink monitor now listens the notification and dumps the state changes. If we want to use it, need to let it forward the notification and dumped info to openstack, right? Thanks Yan
Re: device compatibility interface for live migration with assigned devices
On 8/13/20 11:33 AM, Cornelia Huck wrote: > On Fri, 7 Aug 2020 13:59:42 +0200 > Cornelia Huck wrote: > >> On Wed, 05 Aug 2020 12:35:01 +0100 >> Sean Mooney wrote: >> >>> On Wed, 2020-08-05 at 12:53 +0200, Jiri Pirko wrote: Wed, Aug 05, 2020 at 11:33:38AM CEST, yan.y.z...@intel.com wrote: >> >> (...) >> >software_version: device driver's version. > in .[.bugfix] scheme, where there is no > compatibility across major versions, minor versions have > forward compatibility (ex. 1-> 2 is ok, 2 -> 1 is not) and > bugfix version number indicates some degree of internal > improvement that is not visible to the user in terms of > features or compatibility, > > vendor specific attributes: each vendor may define different attributes > device id : device id of a physical devices or mdev's parent pci device. > it could be equal to pci id for pci devices > aggregator: used together with mdev_type. e.g. aggregator=2 together > with i915-GVTg_V5_4 means 2*1/4=1/2 of a gen9 Intel > graphics device. > remote_url: for a local NVMe VF, it may be configured with a remote > url of a remote storage and all data is stored in the > remote side specified by the remote url. > ... >>> just a minor not that i find ^ much more simmple to understand then >>> the current proposal with self and compatiable. >>> if i have well defiend attibute that i can parse and understand that allow >>> me to calulate the what is and is not compatible that is likely going to >>> more useful as you wont have to keep maintianing a list of other compatible >>> devices every time a new sku is released. >>> >>> in anycase thank for actully shareing ^ as it make it simpler to reson >>> about what >>> you have previously proposed. >> >> So, what would be the most helpful format? A 'software_version' field >> that follows the conventions outlined above, and other (possibly >> optional) fields that have to match? > > Just to get a different perspective, I've been trying to come up with > what would be useful for a very different kind of device, namely > vfio-ccw. (Adding Eric to cc: for that.) > > software_version makes sense for everybody, so it should be a standard > attribute. > > For the vfio-ccw type, we have only one vendor driver (vfio-ccw_IO). > > Given a subchannel A, we want to make sure that subchannel B has a > reasonable chance of being compatible. I guess that means: > > - same subchannel type (I/O) > - same chpid type (e.g. all FICON; I assume there are no 'mixed' setups > -- Eric?) Correct. > - same number of chpids? Maybe we can live without that and just inject > some machine checks, I don't know. Same chpid numbers is something we > cannot guarantee, especially if we want to migrate cross-CEC (to > another machine.) I think we'd live without it, because I wouldn't expect it to be consistent between systems. > > Other possibly interesting information is not available at the > subchannel level (vfio-ccw is a subchannel driver.) I presume you're alluding to the DASD uid (dasdinfo -x) here? > > So, looking at a concrete subchannel on one of my machines, it would > look something like the following: > > > software_version=1.0.0 > type=vfio-ccw <-- would be vfio-pci on the example above > > subchannel_type=0 > > chpid_type=0x1a > chpid_mask=0xf0<-- not sure if needed/wanted > > Does that make sense? >
Re: device compatibility interface for live migration with assigned devices
On Fri, 7 Aug 2020 13:59:42 +0200 Cornelia Huck wrote: > On Wed, 05 Aug 2020 12:35:01 +0100 > Sean Mooney wrote: > > > On Wed, 2020-08-05 at 12:53 +0200, Jiri Pirko wrote: > > > Wed, Aug 05, 2020 at 11:33:38AM CEST, yan.y.z...@intel.com wrote: > > (...) > > > > >software_version: device driver's version. > > > > in .[.bugfix] scheme, where there is no > > > >compatibility across major versions, minor versions have > > > >forward compatibility (ex. 1-> 2 is ok, 2 -> 1 is not) > > > > and > > > >bugfix version number indicates some degree of internal > > > >improvement that is not visible to the user in terms of > > > >features or compatibility, > > > > > > > > vendor specific attributes: each vendor may define different attributes > > > > device id : device id of a physical devices or mdev's parent pci > > > > device. > > > > it could be equal to pci id for pci devices > > > > aggregator: used together with mdev_type. e.g. aggregator=2 together > > > > with i915-GVTg_V5_4 means 2*1/4=1/2 of a gen9 Intel > > > >graphics device. > > > > remote_url: for a local NVMe VF, it may be configured with a remote > > > > url of a remote storage and all data is stored in the > > > >remote side specified by the remote url. > > > > ... > > just a minor not that i find ^ much more simmple to understand then > > the current proposal with self and compatiable. > > if i have well defiend attibute that i can parse and understand that allow > > me to calulate the what is and is not compatible that is likely going to > > more useful as you wont have to keep maintianing a list of other compatible > > devices every time a new sku is released. > > > > in anycase thank for actully shareing ^ as it make it simpler to reson > > about what > > you have previously proposed. > > So, what would be the most helpful format? A 'software_version' field > that follows the conventions outlined above, and other (possibly > optional) fields that have to match? Just to get a different perspective, I've been trying to come up with what would be useful for a very different kind of device, namely vfio-ccw. (Adding Eric to cc: for that.) software_version makes sense for everybody, so it should be a standard attribute. For the vfio-ccw type, we have only one vendor driver (vfio-ccw_IO). Given a subchannel A, we want to make sure that subchannel B has a reasonable chance of being compatible. I guess that means: - same subchannel type (I/O) - same chpid type (e.g. all FICON; I assume there are no 'mixed' setups -- Eric?) - same number of chpids? Maybe we can live without that and just inject some machine checks, I don't know. Same chpid numbers is something we cannot guarantee, especially if we want to migrate cross-CEC (to another machine.) Other possibly interesting information is not available at the subchannel level (vfio-ccw is a subchannel driver.) So, looking at a concrete subchannel on one of my machines, it would look something like the following: software_version=1.0.0 type=vfio-ccw <-- would be vfio-pci on the example above subchannel_type=0 chpid_type=0x1a chpid_mask=0xf0<-- not sure if needed/wanted Does that make sense?
Re: device compatibility interface for live migration with assigned devices
On 2020/8/10 下午3:46, Yan Zhao wrote: driver is it handled by? It looks that the devlink is for network device specific, and in devlink.h, it says include/uapi/linux/devlink.h - Network physical device Netlink interface, Actually not, I think there used to have some discussion last year and the conclusion is to remove this comment. It supports IB and probably vDPA in the future. I feel like it's not very appropriate for a GPU driver to use this interface. Is that right? I think not though most of the users are switch or ethernet devices. It doesn't prevent you from inventing new abstractions. Note that devlink is based on netlink, netlink has been widely used by various subsystems other than networking. Thanks Thanks Yan
Re: device compatibility interface for live migration with assigned devices
On Wed, Aug 05, 2020 at 12:53:19PM +0200, Jiri Pirko wrote: > Wed, Aug 05, 2020 at 11:33:38AM CEST, yan.y.z...@intel.com wrote: > >On Wed, Aug 05, 2020 at 04:02:48PM +0800, Jason Wang wrote: > >> > >> On 2020/8/5 下午3:56, Jiri Pirko wrote: > >> > Wed, Aug 05, 2020 at 04:41:54AM CEST, jasow...@redhat.com wrote: > >> > > On 2020/8/5 上午10:16, Yan Zhao wrote: > >> > > > On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote: > >> > > > > On 2020/8/5 上午12:35, Cornelia Huck wrote: > >> > > > > > [sorry about not chiming in earlier] > >> > > > > > > >> > > > > > On Wed, 29 Jul 2020 16:05:03 +0800 > >> > > > > > Yan Zhao wrote: > >> > > > > > > >> > > > > > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson > >> > > > > > > wrote: > >> > > > > > (...) > >> > > > > > > >> > > > > > > > Based on the feedback we've received, the previously > >> > > > > > > > proposed interface > >> > > > > > > > is not viable. I think there's agreement that the user > >> > > > > > > > needs to be > >> > > > > > > > able to parse and interpret the version information. Using > >> > > > > > > > json seems > >> > > > > > > > viable, but I don't know if it's the best option. Is there > >> > > > > > > > any > >> > > > > > > > precedent of markup strings returned via sysfs we could > >> > > > > > > > follow? > >> > > > > > I don't think encoding complex information in a sysfs file is a > >> > > > > > viable > >> > > > > > approach. Quoting Documentation/filesystems/sysfs.rst: > >> > > > > > > >> > > > > > "Attributes should be ASCII text files, preferably with only one > >> > > > > > value > >> > > > > > per file. It is noted that it may not be efficient to contain > >> > > > > > only one > >> > > > > > value per file, so it is socially acceptable to express an array > >> > > > > > of > >> > > > > > values of the same type. > >> > > > > > Mixing types, expressing multiple lines of data, and doing fancy > >> > > > > > formatting of data is heavily frowned upon." > >> > > > > > > >> > > > > > Even though this is an older file, I think these restrictions > >> > > > > > still > >> > > > > > apply. > >> > > > > +1, that's another reason why devlink(netlink) is better. > >> > > > > > >> > > > hi Jason, > >> > > > do you have any materials or sample code about devlink, so we can > >> > > > have a good > >> > > > study of it? > >> > > > I found some kernel docs about it but my preliminary study didn't > >> > > > show me the > >> > > > advantage of devlink. > >> > > > >> > > CC Jiri and Parav for a better answer for this. > >> > > > >> > > My understanding is that the following advantages are obvious (as I > >> > > replied > >> > > in another thread): > >> > > > >> > > - existing users (NIC, crypto, SCSI, ib), mature and stable > >> > > - much better error reporting (ext_ack other than string or errno) > >> > > - namespace aware > >> > > - do not couple with kobject > >> > Jason, what is your use case? > >> > >> > >> I think the use case is to report device compatibility for live migration. > >> Yan proposed a simple sysfs based migration version first, but it looks not > >> sufficient and something based on JSON is discussed. > >> > >> Yan, can you help to summarize the discussion so far for Jiri as a > >> reference? > >> > >yes. > >we are currently defining an device live migration compatibility > >interface in order to let user space like openstack and libvirt knows > >which two devices are live migration compatible. > >currently the devices include mdev (a kernel emulated virtual device) > >and physical devices (e.g. a VF of a PCI SRIOV device). > > > >the attributes we want user space to compare including > >common attribues: > >device_api: vfio-pci, vfio-ccw... > >mdev_type: mdev type of mdev or similar signature for physical device > > It specifies a device's hardware capability. e.g. > >i915-GVTg_V5_4 means it's of 1/4 of a gen9 Intel graphics > >device. > >software_version: device driver's version. > > in .[.bugfix] scheme, where there is no > >compatibility across major versions, minor versions have > >forward compatibility (ex. 1-> 2 is ok, 2 -> 1 is not) and > >bugfix version number indicates some degree of internal > >improvement that is not visible to the user in terms of > >features or compatibility, > > > >vendor specific attributes: each vendor may define different attributes > > device id : device id of a physical devices or mdev's parent pci device. > > it could be equal to pci id for pci devices > > aggregator: used together with mdev_type. e.g. aggregator=2 together > > with i915-GVTg_V5_4 means 2*1/4=1/2 of a gen9 Intel > >graphics device. > > remote_url: for a local NVMe VF, it may be configured with a remote > > url of a remote storage and all data is stored in the > >remote side
Re: device compatibility interface for live migration with assigned devices
On Wed, 05 Aug 2020 12:35:01 +0100 Sean Mooney wrote: > On Wed, 2020-08-05 at 12:53 +0200, Jiri Pirko wrote: > > Wed, Aug 05, 2020 at 11:33:38AM CEST, yan.y.z...@intel.com wrote: (...) > > >software_version: device driver's version. > > > in .[.bugfix] scheme, where there is no > > > compatibility across major versions, minor versions have > > > forward compatibility (ex. 1-> 2 is ok, 2 -> 1 is not) and > > > bugfix version number indicates some degree of internal > > > improvement that is not visible to the user in terms of > > > features or compatibility, > > > > > > vendor specific attributes: each vendor may define different attributes > > > device id : device id of a physical devices or mdev's parent pci device. > > > it could be equal to pci id for pci devices > > > aggregator: used together with mdev_type. e.g. aggregator=2 together > > > with i915-GVTg_V5_4 means 2*1/4=1/2 of a gen9 Intel > > > graphics device. > > > remote_url: for a local NVMe VF, it may be configured with a remote > > > url of a remote storage and all data is stored in the > > > remote side specified by the remote url. > > > ... > just a minor not that i find ^ much more simmple to understand then > the current proposal with self and compatiable. > if i have well defiend attibute that i can parse and understand that allow > me to calulate the what is and is not compatible that is likely going to > more useful as you wont have to keep maintianing a list of other compatible > devices every time a new sku is released. > > in anycase thank for actully shareing ^ as it make it simpler to reson about > what > you have previously proposed. So, what would be the most helpful format? A 'software_version' field that follows the conventions outlined above, and other (possibly optional) fields that have to match? (...) > > Thanks for the explanation, I'm still fuzzy about the details. > > Anyway, I suggest you to check "devlink dev info" command we have > > implemented for multiple drivers. > > is devlink exposed as a filesytem we can read with just open? > openstack will likely try to leverage libvirt to get this info but when we > cant its much simpler to read sysfs then it is to take a a depenency on a > commandline > too and have to fork shell to execute it and parse the cli output. > pyroute2 which we use in some openstack poject has basic python binding for > devlink but im not > sure how complete it is as i think its relitivly new addtion. if we need to > take a dependcy > we will but that would be a drawback fo devlink not that that is a large one > just something > to keep in mind. A devlinkfs, maybe? At least for reading information (IIUC, "devlink dev info" is only about information retrieval, right?)
Re: device compatibility interface for live migration with assigned devices
Wed, Aug 05, 2020 at 04:41:54AM CEST, jasow...@redhat.com wrote: > >On 2020/8/5 上午10:16, Yan Zhao wrote: >> On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote: >> > On 2020/8/5 上午12:35, Cornelia Huck wrote: >> > > [sorry about not chiming in earlier] >> > > >> > > On Wed, 29 Jul 2020 16:05:03 +0800 >> > > Yan Zhao wrote: >> > > >> > > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote: >> > > (...) >> > > >> > > > > Based on the feedback we've received, the previously proposed >> > > > > interface >> > > > > is not viable. I think there's agreement that the user needs to be >> > > > > able to parse and interpret the version information. Using json >> > > > > seems >> > > > > viable, but I don't know if it's the best option. Is there any >> > > > > precedent of markup strings returned via sysfs we could follow? >> > > I don't think encoding complex information in a sysfs file is a viable >> > > approach. Quoting Documentation/filesystems/sysfs.rst: >> > > >> > > "Attributes should be ASCII text files, preferably with only one value >> > > per file. It is noted that it may not be efficient to contain only one >> > > value per file, so it is socially acceptable to express an array of >> > > values of the same type. >> > > Mixing types, expressing multiple lines of data, and doing fancy >> > > formatting of data is heavily frowned upon." >> > > >> > > Even though this is an older file, I think these restrictions still >> > > apply. >> > >> > +1, that's another reason why devlink(netlink) is better. >> > >> hi Jason, >> do you have any materials or sample code about devlink, so we can have a good >> study of it? >> I found some kernel docs about it but my preliminary study didn't show me the >> advantage of devlink. > > >CC Jiri and Parav for a better answer for this. > >My understanding is that the following advantages are obvious (as I replied >in another thread): > >- existing users (NIC, crypto, SCSI, ib), mature and stable >- much better error reporting (ext_ack other than string or errno) >- namespace aware >- do not couple with kobject Jason, what is your use case? > >Thanks > > >> >> Thanks >> Yan >> >
Re: device compatibility interface for live migration with assigned devices
Wed, Aug 05, 2020 at 11:33:38AM CEST, yan.y.z...@intel.com wrote: >On Wed, Aug 05, 2020 at 04:02:48PM +0800, Jason Wang wrote: >> >> On 2020/8/5 下午3:56, Jiri Pirko wrote: >> > Wed, Aug 05, 2020 at 04:41:54AM CEST, jasow...@redhat.com wrote: >> > > On 2020/8/5 上午10:16, Yan Zhao wrote: >> > > > On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote: >> > > > > On 2020/8/5 上午12:35, Cornelia Huck wrote: >> > > > > > [sorry about not chiming in earlier] >> > > > > > >> > > > > > On Wed, 29 Jul 2020 16:05:03 +0800 >> > > > > > Yan Zhao wrote: >> > > > > > >> > > > > > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote: >> > > > > > (...) >> > > > > > >> > > > > > > > Based on the feedback we've received, the previously proposed >> > > > > > > > interface >> > > > > > > > is not viable. I think there's agreement that the user needs >> > > > > > > > to be >> > > > > > > > able to parse and interpret the version information. Using >> > > > > > > > json seems >> > > > > > > > viable, but I don't know if it's the best option. Is there any >> > > > > > > > precedent of markup strings returned via sysfs we could follow? >> > > > > > I don't think encoding complex information in a sysfs file is a >> > > > > > viable >> > > > > > approach. Quoting Documentation/filesystems/sysfs.rst: >> > > > > > >> > > > > > "Attributes should be ASCII text files, preferably with only one >> > > > > > value >> > > > > > per file. It is noted that it may not be efficient to contain only >> > > > > > one >> > > > > > value per file, so it is socially acceptable to express an array of >> > > > > > values of the same type. >> > > > > > Mixing types, expressing multiple lines of data, and doing fancy >> > > > > > formatting of data is heavily frowned upon." >> > > > > > >> > > > > > Even though this is an older file, I think these restrictions still >> > > > > > apply. >> > > > > +1, that's another reason why devlink(netlink) is better. >> > > > > >> > > > hi Jason, >> > > > do you have any materials or sample code about devlink, so we can have >> > > > a good >> > > > study of it? >> > > > I found some kernel docs about it but my preliminary study didn't show >> > > > me the >> > > > advantage of devlink. >> > > >> > > CC Jiri and Parav for a better answer for this. >> > > >> > > My understanding is that the following advantages are obvious (as I >> > > replied >> > > in another thread): >> > > >> > > - existing users (NIC, crypto, SCSI, ib), mature and stable >> > > - much better error reporting (ext_ack other than string or errno) >> > > - namespace aware >> > > - do not couple with kobject >> > Jason, what is your use case? >> >> >> I think the use case is to report device compatibility for live migration. >> Yan proposed a simple sysfs based migration version first, but it looks not >> sufficient and something based on JSON is discussed. >> >> Yan, can you help to summarize the discussion so far for Jiri as a >> reference? >> >yes. >we are currently defining an device live migration compatibility >interface in order to let user space like openstack and libvirt knows >which two devices are live migration compatible. >currently the devices include mdev (a kernel emulated virtual device) >and physical devices (e.g. a VF of a PCI SRIOV device). > >the attributes we want user space to compare including >common attribues: >device_api: vfio-pci, vfio-ccw... >mdev_type: mdev type of mdev or similar signature for physical device > It specifies a device's hardware capability. e.g. > i915-GVTg_V5_4 means it's of 1/4 of a gen9 Intel graphics > device. >software_version: device driver's version. > in .[.bugfix] scheme, where there is no > compatibility across major versions, minor versions have > forward compatibility (ex. 1-> 2 is ok, 2 -> 1 is not) and > bugfix version number indicates some degree of internal > improvement that is not visible to the user in terms of > features or compatibility, > >vendor specific attributes: each vendor may define different attributes > device id : device id of a physical devices or mdev's parent pci device. > it could be equal to pci id for pci devices > aggregator: used together with mdev_type. e.g. aggregator=2 together > with i915-GVTg_V5_4 means 2*1/4=1/2 of a gen9 Intel > graphics device. > remote_url: for a local NVMe VF, it may be configured with a remote > url of a remote storage and all data is stored in the > remote side specified by the remote url. > ... > >Comparing those attributes by user space alone is not an easy job, as it >can't simply assume an equal relationship between source attributes and >target attributes. e.g. >for a source device of mdev_type=i915-GVTg_V5_4,aggregator=2, (1/2 of >gen9), it actually could find a compatible device of
Re: device compatibility interface for live migration with assigned devices
On Wed, 2020-08-05 at 12:53 +0200, Jiri Pirko wrote: > Wed, Aug 05, 2020 at 11:33:38AM CEST, yan.y.z...@intel.com wrote: > > On Wed, Aug 05, 2020 at 04:02:48PM +0800, Jason Wang wrote: > > > > > > On 2020/8/5 下午3:56, Jiri Pirko wrote: > > > > Wed, Aug 05, 2020 at 04:41:54AM CEST, jasow...@redhat.com wrote: > > > > > On 2020/8/5 上午10:16, Yan Zhao wrote: > > > > > > On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote: > > > > > > > On 2020/8/5 上午12:35, Cornelia Huck wrote: > > > > > > > > [sorry about not chiming in earlier] > > > > > > > > > > > > > > > > On Wed, 29 Jul 2020 16:05:03 +0800 > > > > > > > > Yan Zhao wrote: > > > > > > > > > > > > > > > > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson > > > > > > > > > wrote: > > > > > > > > > > > > > > > > (...) > > > > > > > > > > > > > > > > > > Based on the feedback we've received, the previously > > > > > > > > > > proposed interface > > > > > > > > > > is not viable. I think there's agreement that the user > > > > > > > > > > needs to be > > > > > > > > > > able to parse and interpret the version information. Using > > > > > > > > > > json seems > > > > > > > > > > viable, but I don't know if it's the best option. Is there > > > > > > > > > > any > > > > > > > > > > precedent of markup strings returned via sysfs we could > > > > > > > > > > follow? > > > > > > > > > > > > > > > > I don't think encoding complex information in a sysfs file is a > > > > > > > > viable > > > > > > > > approach. Quoting Documentation/filesystems/sysfs.rst: > > > > > > > > > > > > > > > > "Attributes should be ASCII text files, preferably with only > > > > > > > > one value > > > > > > > > per file. It is noted that it may not be efficient to contain > > > > > > > > only one > > > > > > > > value per file, so it is socially acceptable to express an > > > > > > > > array of > > > > > > > > values of the same type. > > > > > > > > Mixing types, expressing multiple lines of data, and doing fancy > > > > > > > > formatting of data is heavily frowned upon." > > > > > > > > > > > > > > > > Even though this is an older file, I think these restrictions > > > > > > > > still > > > > > > > > apply. > > > > > > > > > > > > > > +1, that's another reason why devlink(netlink) is better. > > > > > > > > > > > > > > > > > > > hi Jason, > > > > > > do you have any materials or sample code about devlink, so we can > > > > > > have a good > > > > > > study of it? > > > > > > I found some kernel docs about it but my preliminary study didn't > > > > > > show me the > > > > > > advantage of devlink. > > > > > > > > > > CC Jiri and Parav for a better answer for this. > > > > > > > > > > My understanding is that the following advantages are obvious (as I > > > > > replied > > > > > in another thread): > > > > > > > > > > - existing users (NIC, crypto, SCSI, ib), mature and stable > > > > > - much better error reporting (ext_ack other than string or errno) > > > > > - namespace aware > > > > > - do not couple with kobject > > > > > > > > Jason, what is your use case? > > > > > > > > > I think the use case is to report device compatibility for live migration. > > > Yan proposed a simple sysfs based migration version first, but it looks > > > not > > > sufficient and something based on JSON is discussed. > > > > > > Yan, can you help to summarize the discussion so far for Jiri as a > > > reference? > > > > > > > yes. > > we are currently defining an device live migration compatibility > > interface in order to let user space like openstack and libvirt knows > > which two devices are live migration compatible. > > currently the devices include mdev (a kernel emulated virtual device) > > and physical devices (e.g. a VF of a PCI SRIOV device). > > > > the attributes we want user space to compare including > > common attribues: > >device_api: vfio-pci, vfio-ccw... > >mdev_type: mdev type of mdev or similar signature for physical device > > It specifies a device's hardware capability. e.g. > >i915-GVTg_V5_4 means it's of 1/4 of a gen9 Intel graphics > >device. by the way this nameing sceam works the opisite of how it would have expected i woudl have expected to i915-GVTg_V5 to be the same as i915-GVTg_V5_1 and i915-GVTg_V5_4 to use 4 times the amount of resouce as i915-GVTg_V5_1 not 1 quarter. i would much rather see i915-GVTg_V5_4 express as aggreataor:i915-GVTg_V5=4 e.g. that it is 4 of the basic i915-GVTg_V5 type the invertion of the relationship makes this much harder to resonabout IMO. if i915-GVTg_V5_8 and i915-GVTg_V5_4 are both actully claiming the same resouce and both can be used at the same time with your suggested nameing scemem i have have to fine the mdevtype with the largest value and store that then do math by devidign it by the suffix of the requested type every time i want to claim the resouce in our placement inventoies. if we represent it the way i suggest we dont if it
Re: device compatibility interface for live migration with assigned devices
On Wed, Aug 05, 2020 at 04:02:48PM +0800, Jason Wang wrote: > > On 2020/8/5 下午3:56, Jiri Pirko wrote: > > Wed, Aug 05, 2020 at 04:41:54AM CEST, jasow...@redhat.com wrote: > > > On 2020/8/5 上午10:16, Yan Zhao wrote: > > > > On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote: > > > > > On 2020/8/5 上午12:35, Cornelia Huck wrote: > > > > > > [sorry about not chiming in earlier] > > > > > > > > > > > > On Wed, 29 Jul 2020 16:05:03 +0800 > > > > > > Yan Zhao wrote: > > > > > > > > > > > > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote: > > > > > > (...) > > > > > > > > > > > > > > Based on the feedback we've received, the previously proposed > > > > > > > > interface > > > > > > > > is not viable. I think there's agreement that the user needs > > > > > > > > to be > > > > > > > > able to parse and interpret the version information. Using > > > > > > > > json seems > > > > > > > > viable, but I don't know if it's the best option. Is there any > > > > > > > > precedent of markup strings returned via sysfs we could follow? > > > > > > I don't think encoding complex information in a sysfs file is a > > > > > > viable > > > > > > approach. Quoting Documentation/filesystems/sysfs.rst: > > > > > > > > > > > > "Attributes should be ASCII text files, preferably with only one > > > > > > value > > > > > > per file. It is noted that it may not be efficient to contain only > > > > > > one > > > > > > value per file, so it is socially acceptable to express an array of > > > > > > values of the same type. > > > > > > Mixing types, expressing multiple lines of data, and doing fancy > > > > > > formatting of data is heavily frowned upon." > > > > > > > > > > > > Even though this is an older file, I think these restrictions still > > > > > > apply. > > > > > +1, that's another reason why devlink(netlink) is better. > > > > > > > > > hi Jason, > > > > do you have any materials or sample code about devlink, so we can have > > > > a good > > > > study of it? > > > > I found some kernel docs about it but my preliminary study didn't show > > > > me the > > > > advantage of devlink. > > > > > > CC Jiri and Parav for a better answer for this. > > > > > > My understanding is that the following advantages are obvious (as I > > > replied > > > in another thread): > > > > > > - existing users (NIC, crypto, SCSI, ib), mature and stable > > > - much better error reporting (ext_ack other than string or errno) > > > - namespace aware > > > - do not couple with kobject > > Jason, what is your use case? > > > I think the use case is to report device compatibility for live migration. > Yan proposed a simple sysfs based migration version first, but it looks not > sufficient and something based on JSON is discussed. > > Yan, can you help to summarize the discussion so far for Jiri as a > reference? > yes. we are currently defining an device live migration compatibility interface in order to let user space like openstack and libvirt knows which two devices are live migration compatible. currently the devices include mdev (a kernel emulated virtual device) and physical devices (e.g. a VF of a PCI SRIOV device). the attributes we want user space to compare including common attribues: device_api: vfio-pci, vfio-ccw... mdev_type: mdev type of mdev or similar signature for physical device It specifies a device's hardware capability. e.g. i915-GVTg_V5_4 means it's of 1/4 of a gen9 Intel graphics device. software_version: device driver's version. in .[.bugfix] scheme, where there is no compatibility across major versions, minor versions have forward compatibility (ex. 1-> 2 is ok, 2 -> 1 is not) and bugfix version number indicates some degree of internal improvement that is not visible to the user in terms of features or compatibility, vendor specific attributes: each vendor may define different attributes device id : device id of a physical devices or mdev's parent pci device. it could be equal to pci id for pci devices aggregator: used together with mdev_type. e.g. aggregator=2 together with i915-GVTg_V5_4 means 2*1/4=1/2 of a gen9 Intel graphics device. remote_url: for a local NVMe VF, it may be configured with a remote url of a remote storage and all data is stored in the remote side specified by the remote url. ... Comparing those attributes by user space alone is not an easy job, as it can't simply assume an equal relationship between source attributes and target attributes. e.g. for a source device of mdev_type=i915-GVTg_V5_4,aggregator=2, (1/2 of gen9), it actually could find a compatible device of mdev_type=i915-GVTg_V5_8,aggregator=4 (also 1/2 of gen9), if mdev_type of i915-GVTg_V5_4 is not available in the target machine. So, in our current proposal, we want
Re: device compatibility interface for live migration with assigned devices
* Yan Zhao (yan.y.z...@intel.com) wrote: > > > yes, include a device_api field is better. > > > for mdev, "device_type=vfio-mdev", is it right? > > > > No, vfio-mdev is not a device API, it's the driver that attaches to the > > mdev bus device to expose it through vfio. The device_api exposes the > > actual interface of the vfio device, it's also vfio-pci for typical > > mdev devices found on x86, but may be vfio-ccw, vfio-ap, etc... See > > VFIO_DEVICE_API_PCI_STRING and friends. > > > ok. got it. > > > > > > > device_id=8086591d > > > > > > > > Is device_id interpreted relative to device_type? How does this > > > > relate to mdev_type? If we have an mdev_type, doesn't that fully > > > > defined the software API? > > > > > > > it's parent pci id for mdev actually. > > > > If we need to specify the parent PCI ID then something is fundamentally > > wrong with the mdev_type. The mdev_type should define a unique, > > software compatible interface, regardless of the parent device IDs. If > > a i915-GVTg_V5_2 means different things based on the parent device IDs, > > then then different mdev_types should be reported for those parent > > devices. > > > hmm, then do we allow vendor specific fields? > or is it a must that a vendor specific field should have corresponding > vendor attribute? > > another thing is that the definition of mdev_type in GVT only corresponds > to vGPU computing ability currently, > e.g. i915-GVTg_V5_2, is 1/2 of a gen9 IGD, i915-GVTg_V4_2 is 1/2 of a > gen8 IGD. > It is too coarse-grained to live migration compatibility. Can you explain why that's too coarse? Is this because it's too specific (i.e. that a i915-GVTg_V4_2 could be migrated to a newer device?), or that it's too specific on the exact sizings (i.e. that there may be multiple different sizes of a gen9)? Dave > Do you think we need to update GVT's definition of mdev_type? > And is there any guide in mdev_type definition? > > > > > > > mdev_type=i915-GVTg_V5_2 > > > > > > > > And how are non-mdev devices represented? > > > > > > > non-mdev can opt to not include this field, or as you said below, a > > > vendor signature. > > > > > > > > > aggregator=1 > > > > > > pv_mode="none+ppgtt+context" > > > > > > > > These are meaningless vendor specific matches afaict. > > > > > > > yes, pv_mode and aggregator are vendor specific fields. > > > but they are important to decide whether two devices are compatible. > > > pv_mode means whether a vGPU supports guest paravirtualized api. > > > "none+ppgtt+context" means guest can not use pv, or use ppgtt mode pv or > > > use context mode pv. > > > > > > > > > interface_version=3 > > > > > > > > Not much granularity here, I prefer Sean's previous > > > > .[.bugfix] scheme. > > > > > > > yes, .[.bugfix] scheme may be better, but I'm not sure if > > > it works for a complicated scenario. > > > e.g for pv_mode, > > > (1) initially, pv_mode is not supported, so it's pv_mode=none, it's > > > 0.0.0, > > > (2) then, pv_mode=ppgtt is supported, pv_mode="none+ppgtt", it's 0.1.0, > > > indicating pv_mode=none can migrate to pv_mode="none+ppgtt", but not vice > > > versa. > > > (3) later, pv_mode=context is also supported, > > > pv_mode="none+ppgtt+context", so it's 0.2.0. > > > > > > But if later, pv_mode=ppgtt is removed. pv_mode="none+context", how to > > > name its version? "none+ppgtt" (0.1.0) is not compatible to > > > "none+context", but "none+ppgtt+context" (0.2.0) is compatible to > > > "none+context". > > > > If pv_mode=ppgtt is removed, then the compatible versions would be > > 0.0.0 or 1.0.0, ie. the major version would be incremented due to > > feature removal. > > > > > Maintain such scheme is painful to vendor driver. > > > > Migration compatibility is painful, there's no way around that. I > > think the version scheme is an attempt to push some of that low level > > burden on the vendor driver, otherwise the management tools need to > > work on an ever growing matrix of vendor specific features which is > > going to become unwieldy and is largely meaningless outside of the > > vendor driver. Instead, the vendor driver can make strategic decisions > > about where to continue to maintain a support burden and make explicit > > decisions to maintain or break compatibility. The version scheme is a > > simplification and abstraction of vendor driver features in order to > > create a small, logical compatibility matrix. Compromises necessarily > > need to be made for that to occur. > > > ok. got it. > > > > > > > COMPATIBLE: > > > > > > device_type=pci > > > > > > device_id=8086591d > > > > > > mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8} > > > > > this mixed notation will be hard to parse so i would avoid that. > > > > > > > > Some background, Intel has been proposing aggregation as a solution to > > > > how we scale mdev devices when hardware exposes large numbers of > > > > assignable objects that can be
Re: device compatibility interface for live migration with assigned devices
On 2020/8/5 下午3:56, Jiri Pirko wrote: Wed, Aug 05, 2020 at 04:41:54AM CEST, jasow...@redhat.com wrote: On 2020/8/5 上午10:16, Yan Zhao wrote: On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote: On 2020/8/5 上午12:35, Cornelia Huck wrote: [sorry about not chiming in earlier] On Wed, 29 Jul 2020 16:05:03 +0800 Yan Zhao wrote: On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote: (...) Based on the feedback we've received, the previously proposed interface is not viable. I think there's agreement that the user needs to be able to parse and interpret the version information. Using json seems viable, but I don't know if it's the best option. Is there any precedent of markup strings returned via sysfs we could follow? I don't think encoding complex information in a sysfs file is a viable approach. Quoting Documentation/filesystems/sysfs.rst: "Attributes should be ASCII text files, preferably with only one value per file. It is noted that it may not be efficient to contain only one value per file, so it is socially acceptable to express an array of values of the same type. Mixing types, expressing multiple lines of data, and doing fancy formatting of data is heavily frowned upon." Even though this is an older file, I think these restrictions still apply. +1, that's another reason why devlink(netlink) is better. hi Jason, do you have any materials or sample code about devlink, so we can have a good study of it? I found some kernel docs about it but my preliminary study didn't show me the advantage of devlink. CC Jiri and Parav for a better answer for this. My understanding is that the following advantages are obvious (as I replied in another thread): - existing users (NIC, crypto, SCSI, ib), mature and stable - much better error reporting (ext_ack other than string or errno) - namespace aware - do not couple with kobject Jason, what is your use case? I think the use case is to report device compatibility for live migration. Yan proposed a simple sysfs based migration version first, but it looks not sufficient and something based on JSON is discussed. Yan, can you help to summarize the discussion so far for Jiri as a reference? Thanks Thanks Thanks Yan
Re: device compatibility interface for live migration with assigned devices
On 2020/8/5 上午10:16, Yan Zhao wrote: On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote: On 2020/8/5 上午12:35, Cornelia Huck wrote: [sorry about not chiming in earlier] On Wed, 29 Jul 2020 16:05:03 +0800 Yan Zhao wrote: On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote: (...) Based on the feedback we've received, the previously proposed interface is not viable. I think there's agreement that the user needs to be able to parse and interpret the version information. Using json seems viable, but I don't know if it's the best option. Is there any precedent of markup strings returned via sysfs we could follow? I don't think encoding complex information in a sysfs file is a viable approach. Quoting Documentation/filesystems/sysfs.rst: "Attributes should be ASCII text files, preferably with only one value per file. It is noted that it may not be efficient to contain only one value per file, so it is socially acceptable to express an array of values of the same type. Mixing types, expressing multiple lines of data, and doing fancy formatting of data is heavily frowned upon." Even though this is an older file, I think these restrictions still apply. +1, that's another reason why devlink(netlink) is better. hi Jason, do you have any materials or sample code about devlink, so we can have a good study of it? I found some kernel docs about it but my preliminary study didn't show me the advantage of devlink. CC Jiri and Parav for a better answer for this. My understanding is that the following advantages are obvious (as I replied in another thread): - existing users (NIC, crypto, SCSI, ib), mature and stable - much better error reporting (ext_ack other than string or errno) - namespace aware - do not couple with kobject Thanks Thanks Yan
Re: device compatibility interface for live migration with assigned devices
On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote: > > On 2020/8/5 上午12:35, Cornelia Huck wrote: > > [sorry about not chiming in earlier] > > > > On Wed, 29 Jul 2020 16:05:03 +0800 > > Yan Zhao wrote: > > > > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote: > > (...) > > > > > > Based on the feedback we've received, the previously proposed interface > > > > is not viable. I think there's agreement that the user needs to be > > > > able to parse and interpret the version information. Using json seems > > > > viable, but I don't know if it's the best option. Is there any > > > > precedent of markup strings returned via sysfs we could follow? > > I don't think encoding complex information in a sysfs file is a viable > > approach. Quoting Documentation/filesystems/sysfs.rst: > > > > "Attributes should be ASCII text files, preferably with only one value > > per file. It is noted that it may not be efficient to contain only one > > value per file, so it is socially acceptable to express an array of > > values of the same type. > > Mixing types, expressing multiple lines of data, and doing fancy > > formatting of data is heavily frowned upon." > > > > Even though this is an older file, I think these restrictions still > > apply. > > > +1, that's another reason why devlink(netlink) is better. > hi Jason, do you have any materials or sample code about devlink, so we can have a good study of it? I found some kernel docs about it but my preliminary study didn't show me the advantage of devlink. Thanks Yan
Re: device compatibility interface for live migration with assigned devices
On 2020/8/5 上午12:35, Cornelia Huck wrote: [sorry about not chiming in earlier] On Wed, 29 Jul 2020 16:05:03 +0800 Yan Zhao wrote: On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote: (...) Based on the feedback we've received, the previously proposed interface is not viable. I think there's agreement that the user needs to be able to parse and interpret the version information. Using json seems viable, but I don't know if it's the best option. Is there any precedent of markup strings returned via sysfs we could follow? I don't think encoding complex information in a sysfs file is a viable approach. Quoting Documentation/filesystems/sysfs.rst: "Attributes should be ASCII text files, preferably with only one value per file. It is noted that it may not be efficient to contain only one value per file, so it is socially acceptable to express an array of values of the same type. Mixing types, expressing multiple lines of data, and doing fancy formatting of data is heavily frowned upon." Even though this is an older file, I think these restrictions still apply. +1, that's another reason why devlink(netlink) is better. Thanks
Re: device compatibility interface for live migration with assigned devices
[sorry about not chiming in earlier] On Wed, 29 Jul 2020 16:05:03 +0800 Yan Zhao wrote: > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote: (...) > > Based on the feedback we've received, the previously proposed interface > > is not viable. I think there's agreement that the user needs to be > > able to parse and interpret the version information. Using json seems > > viable, but I don't know if it's the best option. Is there any > > precedent of markup strings returned via sysfs we could follow? I don't think encoding complex information in a sysfs file is a viable approach. Quoting Documentation/filesystems/sysfs.rst: "Attributes should be ASCII text files, preferably with only one value per file. It is noted that it may not be efficient to contain only one value per file, so it is socially acceptable to express an array of values of the same type. Mixing types, expressing multiple lines of data, and doing fancy formatting of data is heavily frowned upon." Even though this is an older file, I think these restrictions still apply. > I found some examples of using formatted string under /sys, mostly under > tracing. maybe we can do a similar implementation. > > #cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format Note that this is *not* sysfs (anything under debug/ follows different rules anyway!) > > name: kvm_mmio > ID: 32 > format: > field:unsigned short common_type; offset:0; size:2; > signed:0; > field:unsigned char common_flags; offset:2; size:1; > signed:0; > field:unsigned char common_preempt_count; offset:3; > size:1; signed:0; > field:int common_pid; offset:4; size:4; signed:1; > > field:u32 type; offset:8; size:4; signed:0; > field:u32 len; offset:12; size:4; signed:0; > field:u64 gpa; offset:16; size:8; signed:0; > field:u64 val; offset:24; size:8; signed:0; > > print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", > __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1, "read" }, { 2, > "write" }), REC->len, REC->gpa, REC->val > > > #cat /sys/devices/pci:00/:00:02.0/uevent 'uevent' can probably be considered a special case, I would not really want to copy it. > DRIVER=vfio-pci > PCI_CLASS=3 > PCI_ID=8086:591D > PCI_SUBSYS_ID=8086:2212 > PCI_SLOT_NAME=:00:02.0 > MODALIAS=pci:v8086d591Dsv8086sd2212bc03sc00i00 > (...) > what about a migration_compatible attribute under device node like > below? > > #cat /sys/bus/pci/devices/\:00\:02.0/UUID1/migration_compatible > SELF: > device_type=pci > device_id=8086591d > mdev_type=i915-GVTg_V5_2 > aggregator=1 > pv_mode="none+ppgtt+context" > interface_version=3 > COMPATIBLE: > device_type=pci > device_id=8086591d > mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8} > aggregator={val1}/2 > pv_mode={val2:string:"none+ppgtt","none+context","none+ppgtt+context"} > interface_version={val3:int:2,3} > COMPATIBLE: > device_type=pci > device_id=8086591d > mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8} > aggregator={val1}/2 > pv_mode="" #"" meaning empty, could be absent in a compatible device > interface_version=1 I'd consider anything of a comparable complexity to be a big no-no. If anything, this needs to be split into individual files (with many of them being vendor driver specific anyway.) I think we can list compatible versions in a range/list format, though. Something like cat interface_version 2.1.3 cat interface_version_compatible 2.0.2-2.0.4,2.1.0- (indicating that versions 2.0.{2,3,4} and all versions after 2.1.0 are compatible, considering versions <2 and >2 incompatible by default) Possible compatibility between different mdev types feels a bit odd to me, and should not be included by default (only if it makes sense for a particular vendor driver.)
Re: device compatibility interface for live migration with assigned devices
> > yes, include a device_api field is better. > > for mdev, "device_type=vfio-mdev", is it right? > > No, vfio-mdev is not a device API, it's the driver that attaches to the > mdev bus device to expose it through vfio. The device_api exposes the > actual interface of the vfio device, it's also vfio-pci for typical > mdev devices found on x86, but may be vfio-ccw, vfio-ap, etc... See > VFIO_DEVICE_API_PCI_STRING and friends. > ok. got it. > > > > > device_id=8086591d > > > > > > Is device_id interpreted relative to device_type? How does this > > > relate to mdev_type? If we have an mdev_type, doesn't that fully > > > defined the software API? > > > > > it's parent pci id for mdev actually. > > If we need to specify the parent PCI ID then something is fundamentally > wrong with the mdev_type. The mdev_type should define a unique, > software compatible interface, regardless of the parent device IDs. If > a i915-GVTg_V5_2 means different things based on the parent device IDs, > then then different mdev_types should be reported for those parent > devices. > hmm, then do we allow vendor specific fields? or is it a must that a vendor specific field should have corresponding vendor attribute? another thing is that the definition of mdev_type in GVT only corresponds to vGPU computing ability currently, e.g. i915-GVTg_V5_2, is 1/2 of a gen9 IGD, i915-GVTg_V4_2 is 1/2 of a gen8 IGD. It is too coarse-grained to live migration compatibility. Do you think we need to update GVT's definition of mdev_type? And is there any guide in mdev_type definition? > > > > > mdev_type=i915-GVTg_V5_2 > > > > > > And how are non-mdev devices represented? > > > > > non-mdev can opt to not include this field, or as you said below, a > > vendor signature. > > > > > > > aggregator=1 > > > > > pv_mode="none+ppgtt+context" > > > > > > These are meaningless vendor specific matches afaict. > > > > > yes, pv_mode and aggregator are vendor specific fields. > > but they are important to decide whether two devices are compatible. > > pv_mode means whether a vGPU supports guest paravirtualized api. > > "none+ppgtt+context" means guest can not use pv, or use ppgtt mode pv or > > use context mode pv. > > > > > > > interface_version=3 > > > > > > Not much granularity here, I prefer Sean's previous > > > .[.bugfix] scheme. > > > > > yes, .[.bugfix] scheme may be better, but I'm not sure if > > it works for a complicated scenario. > > e.g for pv_mode, > > (1) initially, pv_mode is not supported, so it's pv_mode=none, it's 0.0.0, > > (2) then, pv_mode=ppgtt is supported, pv_mode="none+ppgtt", it's 0.1.0, > > indicating pv_mode=none can migrate to pv_mode="none+ppgtt", but not vice > > versa. > > (3) later, pv_mode=context is also supported, > > pv_mode="none+ppgtt+context", so it's 0.2.0. > > > > But if later, pv_mode=ppgtt is removed. pv_mode="none+context", how to > > name its version? "none+ppgtt" (0.1.0) is not compatible to > > "none+context", but "none+ppgtt+context" (0.2.0) is compatible to > > "none+context". > > If pv_mode=ppgtt is removed, then the compatible versions would be > 0.0.0 or 1.0.0, ie. the major version would be incremented due to > feature removal. > > > Maintain such scheme is painful to vendor driver. > > Migration compatibility is painful, there's no way around that. I > think the version scheme is an attempt to push some of that low level > burden on the vendor driver, otherwise the management tools need to > work on an ever growing matrix of vendor specific features which is > going to become unwieldy and is largely meaningless outside of the > vendor driver. Instead, the vendor driver can make strategic decisions > about where to continue to maintain a support burden and make explicit > decisions to maintain or break compatibility. The version scheme is a > simplification and abstraction of vendor driver features in order to > create a small, logical compatibility matrix. Compromises necessarily > need to be made for that to occur. > ok. got it. > > > > > COMPATIBLE: > > > > > device_type=pci > > > > > device_id=8086591d > > > > > mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8} > > > > this mixed notation will be hard to parse so i would avoid that. > > > > > > Some background, Intel has been proposing aggregation as a solution to > > > how we scale mdev devices when hardware exposes large numbers of > > > assignable objects that can be composed in essentially arbitrary ways. > > > So for instance, if we have a workqueue (wq), we might have an mdev > > > type for 1wq, 2wq, 3wq,... Nwq. It's not really practical to expose a > > > discrete mdev type for each of those, so they want to define a base > > > type which is composable to other types via this aggregation. This is > > > what this substitution and tagging is attempting to accomplish. So > > > imagine this set of values for cases where it's not practical to unroll > >
Re: device compatibility interface for live migration with assigned devices
On Thu, 30 Jul 2020 11:41:04 +0800 Yan Zhao wrote: > On Wed, Jul 29, 2020 at 01:12:55PM -0600, Alex Williamson wrote: > > On Wed, 29 Jul 2020 12:28:46 +0100 > > Sean Mooney wrote: > > > > > On Wed, 2020-07-29 at 16:05 +0800, Yan Zhao wrote: > > > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote: > > > > > On Mon, 27 Jul 2020 15:24:40 +0800 > > > > > Yan Zhao wrote: > > > > > > > > > > > > > As you indicate, the vendor driver is responsible for checking > > > > > > > > version > > > > > > > > information embedded within the migration stream. Therefore a > > > > > > > > migration should fail early if the devices are incompatible. > > > > > > > > Is it > > > > > > > > > > > > > > but as I know, currently in VFIO migration protocol, we have no > > > > > > > way to > > > > > > > get vendor specific compatibility checking string in migration > > > > > > > setup stage > > > > > > > (i.e. .save_setup stage) before the device is set to _SAVING > > > > > > > state. > > > > > > > In this way, for devices who does not save device data in precopy > > > > > > > stage, > > > > > > > the migration compatibility checking is as late as in > > > > > > > stop-and-copy > > > > > > > stage, which is too late. > > > > > > > do you think we need to add the getting/checking of vendor > > > > > > > specific > > > > > > > compatibility string early in save_setup stage? > > > > > > > > > > > > > > > > > > > hi Alex, > > > > > > after an offline discussion with Kevin, I realized that it may not > > > > > > be a > > > > > > problem if migration compatibility check in vendor driver occurs > > > > > > late in > > > > > > stop-and-copy phase for some devices, because if we report device > > > > > > compatibility attributes clearly in an interface, the chances for > > > > > > libvirt/openstack to make a wrong decision is little. > > > > > > > > > > I think it would be wise for a vendor driver to implement a pre-copy > > > > > phase, even if only to send version information and verify it at the > > > > > target. Deciding you have no device state to send during pre-copy > > > > > does > > > > > not mean your vendor driver needs to opt-out of the pre-copy phase > > > > > entirely. Please also note that pre-copy is at the user's discretion, > > > > > we've defined that we can enter stop-and-copy at any point, including > > > > > without a pre-copy phase, so I would recommend that vendor drivers > > > > > validate compatibility at the start of both the pre-copy and the > > > > > stop-and-copy phases. > > > > > > > > > > > > > ok. got it! > > > > > > > > > > so, do you think we are now arriving at an agreement that we'll > > > > > > give up > > > > > > the read-and-test scheme and start to defining one interface > > > > > > (perhaps in > > > > > > json format), from which libvirt/openstack is able to parse and > > > > > > find out > > > > > > compatibility list of a source mdev/physical device? > > > > > > > > > > Based on the feedback we've received, the previously proposed > > > > > interface > > > > > is not viable. I think there's agreement that the user needs to be > > > > > able to parse and interpret the version information. Using json seems > > > > > viable, but I don't know if it's the best option. Is there any > > > > > precedent of markup strings returned via sysfs we could follow? > > > > > > > > I found some examples of using formatted string under /sys, mostly under > > > > tracing. maybe we can do a similar implementation. > > > > > > > > #cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format > > > > > > > > name: kvm_mmio > > > > ID: 32 > > > > format: > > > > field:unsigned short common_type; offset:0; size:2; > > > > signed:0; > > > > field:unsigned char common_flags; offset:2; size:1; > > > > signed:0; > > > > field:unsigned char common_preempt_count; offset:3; > > > > size:1; signed:0; > > > > field:int common_pid; offset:4; size:4; signed:1; > > > > > > > > field:u32 type; offset:8; size:4; signed:0; > > > > field:u32 len; offset:12; size:4; signed:0; > > > > field:u64 gpa; offset:16; size:8; signed:0; > > > > field:u64 val; offset:24; size:8; signed:0; > > > > > > > > print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", > > > > __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1, "read" > > > > }, { 2, "write" }), REC->len, REC->gpa, REC->val > > > > > > > this is not json fromat and its not supper frendly to parse. > > > > > > > > #cat /sys/devices/pci:00/:00:02.0/uevent > > > > DRIVER=vfio-pci > > > > PCI_CLASS=3 > > > > PCI_ID=8086:591D > > > > PCI_SUBSYS_ID=8086:2212 > > > > PCI_SLOT_NAME=:00:02.0 > > > > MODALIAS=pci:v8086d591Dsv8086sd2212bc03sc00i00 > > > > > > > this is ini format or conf formant > > > this is pretty simple to
Re: device compatibility interface for live migration with assigned devices
On Thu, 2020-07-30 at 11:41 +0800, Yan Zhao wrote: > > > >interface_version=3 > > > > Not much granularity here, I prefer Sean's previous > > .[.bugfix] scheme. > > > > yes, .[.bugfix] scheme may be better, but I'm not sure if > it works for a complicated scenario. > e.g for pv_mode, > (1) initially, pv_mode is not supported, so it's pv_mode=none, it's 0.0.0, > (2) then, pv_mode=ppgtt is supported, pv_mode="none+ppgtt", it's 0.1.0, > indicating pv_mode=none can migrate to pv_mode="none+ppgtt", but not vice > versa. > (3) later, pv_mode=context is also supported, > pv_mode="none+ppgtt+context", so it's 0.2.0. > > But if later, pv_mode=ppgtt is removed. pv_mode="none+context", how to > name its version? it would become 1.0.0 addtion of a feature is a minor version bump as its backwards compatiable. if you dont request the new feature you dont need to use it and it can continue to behave like a 0.0.0 device evne if its capably of acting as a 0.1.0 device. when you remove a feature that is backward incompatable as any isnstance that was prevously not using it would nolonger work so you have to bump the major version. > "none+ppgtt" (0.1.0) is not compatible to > "none+context", but "none+ppgtt+context" (0.2.0) is compatible to > "none+context". > > Maintain such scheme is painful to vendor driver. not really its how most software libs are version today. some use other schemes but semantic versioning is don right is a concies and easy to consume set of rules https://semver.org/ however you are right that it forcnes vendor to think about backwards and forwards compatiablty with each change which for the most part is a good thing. it goes hand in hand with have stable abi and api definitons to ensuring firmware updates and driver chagnes dont break userspace that depend on the kernel interfaces they expose.
Re: device compatibility interface for live migration with assigned devices
On Thu, 2020-07-30 at 09:56 +0800, Yan Zhao wrote: > On Wed, Jul 29, 2020 at 12:28:46PM +0100, Sean Mooney wrote: > > On Wed, 2020-07-29 at 16:05 +0800, Yan Zhao wrote: > > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote: > > > > On Mon, 27 Jul 2020 15:24:40 +0800 > > > > Yan Zhao wrote: > > > > > > > > > > > As you indicate, the vendor driver is responsible for checking > > > > > > > version > > > > > > > information embedded within the migration stream. Therefore a > > > > > > > migration should fail early if the devices are incompatible. Is > > > > > > > it > > > > > > > > > > > > but as I know, currently in VFIO migration protocol, we have no way > > > > > > to > > > > > > get vendor specific compatibility checking string in migration > > > > > > setup stage > > > > > > (i.e. .save_setup stage) before the device is set to _SAVING state. > > > > > > In this way, for devices who does not save device data in precopy > > > > > > stage, > > > > > > the migration compatibility checking is as late as in stop-and-copy > > > > > > stage, which is too late. > > > > > > do you think we need to add the getting/checking of vendor specific > > > > > > compatibility string early in save_setup stage? > > > > > > > > > > > > > > > > hi Alex, > > > > > after an offline discussion with Kevin, I realized that it may not be > > > > > a > > > > > problem if migration compatibility check in vendor driver occurs late > > > > > in > > > > > stop-and-copy phase for some devices, because if we report device > > > > > compatibility attributes clearly in an interface, the chances for > > > > > libvirt/openstack to make a wrong decision is little. > > > > > > > > I think it would be wise for a vendor driver to implement a pre-copy > > > > phase, even if only to send version information and verify it at the > > > > target. Deciding you have no device state to send during pre-copy does > > > > not mean your vendor driver needs to opt-out of the pre-copy phase > > > > entirely. Please also note that pre-copy is at the user's discretion, > > > > we've defined that we can enter stop-and-copy at any point, including > > > > without a pre-copy phase, so I would recommend that vendor drivers > > > > validate compatibility at the start of both the pre-copy and the > > > > stop-and-copy phases. > > > > > > > > > > ok. got it! > > > > > > > > so, do you think we are now arriving at an agreement that we'll give > > > > > up > > > > > the read-and-test scheme and start to defining one interface (perhaps > > > > > in > > > > > json format), from which libvirt/openstack is able to parse and find > > > > > out > > > > > compatibility list of a source mdev/physical device? > > > > > > > > Based on the feedback we've received, the previously proposed interface > > > > is not viable. I think there's agreement that the user needs to be > > > > able to parse and interpret the version information. Using json seems > > > > viable, but I don't know if it's the best option. Is there any > > > > precedent of markup strings returned via sysfs we could follow? > > > > > > I found some examples of using formatted string under /sys, mostly under > > > tracing. maybe we can do a similar implementation. > > > > > > #cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format > > > > > > name: kvm_mmio > > > ID: 32 > > > format: > > > field:unsigned short common_type; offset:0; size:2; > > > signed:0; > > > field:unsigned char common_flags; offset:2; size:1; > > > signed:0; > > > field:unsigned char common_preempt_count; offset:3; > > > size:1; signed:0; > > > field:int common_pid; offset:4; size:4; signed:1; > > > > > > field:u32 type; offset:8; size:4; signed:0; > > > field:u32 len; offset:12; size:4; signed:0; > > > field:u64 gpa; offset:16; size:8; signed:0; > > > field:u64 val; offset:24; size:8; signed:0; > > > > > > print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", > > > __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1, > > > "read" > > > }, { 2, "write" }), REC->len, REC->gpa, REC->val > > > > > > > this is not json fromat and its not supper frendly to parse. > > yes, it's just an example. It's exported to be used by userspace perf & > trace_cmd. > > > > > > > #cat /sys/devices/pci:00/:00:02.0/uevent > > > DRIVER=vfio-pci > > > PCI_CLASS=3 > > > PCI_ID=8086:591D > > > PCI_SUBSYS_ID=8086:2212 > > > PCI_SLOT_NAME=:00:02.0 > > > MODALIAS=pci:v8086d591Dsv8086sd2212bc03sc00i00 > > > > > > > this is ini format or conf formant > > this is pretty simple to parse whichi would be fine. > > that said you could also have a version or capablitiy directory with a file > > for each key and a singel value. > > > > if this is easy for openstack, maybe we can organize the data like below way? > > |- [device] > |- migration >
Re: device compatibility interface for live migration with assigned devices
On Wed, Jul 29, 2020 at 01:12:55PM -0600, Alex Williamson wrote: > On Wed, 29 Jul 2020 12:28:46 +0100 > Sean Mooney wrote: > > > On Wed, 2020-07-29 at 16:05 +0800, Yan Zhao wrote: > > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote: > > > > On Mon, 27 Jul 2020 15:24:40 +0800 > > > > Yan Zhao wrote: > > > > > > > > > > > As you indicate, the vendor driver is responsible for checking > > > > > > > version > > > > > > > information embedded within the migration stream. Therefore a > > > > > > > migration should fail early if the devices are incompatible. Is > > > > > > > it > > > > > > > > > > > > but as I know, currently in VFIO migration protocol, we have no way > > > > > > to > > > > > > get vendor specific compatibility checking string in migration > > > > > > setup stage > > > > > > (i.e. .save_setup stage) before the device is set to _SAVING state. > > > > > > In this way, for devices who does not save device data in precopy > > > > > > stage, > > > > > > the migration compatibility checking is as late as in stop-and-copy > > > > > > stage, which is too late. > > > > > > do you think we need to add the getting/checking of vendor specific > > > > > > compatibility string early in save_setup stage? > > > > > > > > > > > > > > > > hi Alex, > > > > > after an offline discussion with Kevin, I realized that it may not be > > > > > a > > > > > problem if migration compatibility check in vendor driver occurs late > > > > > in > > > > > stop-and-copy phase for some devices, because if we report device > > > > > compatibility attributes clearly in an interface, the chances for > > > > > libvirt/openstack to make a wrong decision is little. > > > > > > > > I think it would be wise for a vendor driver to implement a pre-copy > > > > phase, even if only to send version information and verify it at the > > > > target. Deciding you have no device state to send during pre-copy does > > > > not mean your vendor driver needs to opt-out of the pre-copy phase > > > > entirely. Please also note that pre-copy is at the user's discretion, > > > > we've defined that we can enter stop-and-copy at any point, including > > > > without a pre-copy phase, so I would recommend that vendor drivers > > > > validate compatibility at the start of both the pre-copy and the > > > > stop-and-copy phases. > > > > > > > > > > ok. got it! > > > > > > > > so, do you think we are now arriving at an agreement that we'll give > > > > > up > > > > > the read-and-test scheme and start to defining one interface (perhaps > > > > > in > > > > > json format), from which libvirt/openstack is able to parse and find > > > > > out > > > > > compatibility list of a source mdev/physical device? > > > > > > > > Based on the feedback we've received, the previously proposed interface > > > > is not viable. I think there's agreement that the user needs to be > > > > able to parse and interpret the version information. Using json seems > > > > viable, but I don't know if it's the best option. Is there any > > > > precedent of markup strings returned via sysfs we could follow? > > > > > > I found some examples of using formatted string under /sys, mostly under > > > tracing. maybe we can do a similar implementation. > > > > > > #cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format > > > > > > name: kvm_mmio > > > ID: 32 > > > format: > > > field:unsigned short common_type; offset:0; size:2; > > > signed:0; > > > field:unsigned char common_flags; offset:2; size:1; > > > signed:0; > > > field:unsigned char common_preempt_count; offset:3; > > > size:1; signed:0; > > > field:int common_pid; offset:4; size:4; signed:1; > > > > > > field:u32 type; offset:8; size:4; signed:0; > > > field:u32 len; offset:12; size:4; signed:0; > > > field:u64 gpa; offset:16; size:8; signed:0; > > > field:u64 val; offset:24; size:8; signed:0; > > > > > > print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", > > > __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1, "read" > > > }, { 2, "write" }), REC->len, REC->gpa, REC->val > > > > > this is not json fromat and its not supper frendly to parse. > > > > > > #cat /sys/devices/pci:00/:00:02.0/uevent > > > DRIVER=vfio-pci > > > PCI_CLASS=3 > > > PCI_ID=8086:591D > > > PCI_SUBSYS_ID=8086:2212 > > > PCI_SLOT_NAME=:00:02.0 > > > MODALIAS=pci:v8086d591Dsv8086sd2212bc03sc00i00 > > > > > this is ini format or conf formant > > this is pretty simple to parse whichi would be fine. > > that said you could also have a version or capablitiy directory with a file > > for each key and a singel value. > > > > i would prefer to only have to do one read personally the list the files in > > directory and then read tehm all ot build the datastucture myself but that > > is > > doable though the simple ini
Re: device compatibility interface for live migration with assigned devices
On Wed, Jul 29, 2020 at 12:28:46PM +0100, Sean Mooney wrote: > On Wed, 2020-07-29 at 16:05 +0800, Yan Zhao wrote: > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote: > > > On Mon, 27 Jul 2020 15:24:40 +0800 > > > Yan Zhao wrote: > > > > > > > > > As you indicate, the vendor driver is responsible for checking > > > > > > version > > > > > > information embedded within the migration stream. Therefore a > > > > > > migration should fail early if the devices are incompatible. Is it > > > > > > > > > > > > > > > > but as I know, currently in VFIO migration protocol, we have no way to > > > > > get vendor specific compatibility checking string in migration setup > > > > > stage > > > > > (i.e. .save_setup stage) before the device is set to _SAVING state. > > > > > In this way, for devices who does not save device data in precopy > > > > > stage, > > > > > the migration compatibility checking is as late as in stop-and-copy > > > > > stage, which is too late. > > > > > do you think we need to add the getting/checking of vendor specific > > > > > compatibility string early in save_setup stage? > > > > > > > > > > > > > hi Alex, > > > > after an offline discussion with Kevin, I realized that it may not be a > > > > problem if migration compatibility check in vendor driver occurs late in > > > > stop-and-copy phase for some devices, because if we report device > > > > compatibility attributes clearly in an interface, the chances for > > > > libvirt/openstack to make a wrong decision is little. > > > > > > I think it would be wise for a vendor driver to implement a pre-copy > > > phase, even if only to send version information and verify it at the > > > target. Deciding you have no device state to send during pre-copy does > > > not mean your vendor driver needs to opt-out of the pre-copy phase > > > entirely. Please also note that pre-copy is at the user's discretion, > > > we've defined that we can enter stop-and-copy at any point, including > > > without a pre-copy phase, so I would recommend that vendor drivers > > > validate compatibility at the start of both the pre-copy and the > > > stop-and-copy phases. > > > > > > > ok. got it! > > > > > > so, do you think we are now arriving at an agreement that we'll give up > > > > the read-and-test scheme and start to defining one interface (perhaps in > > > > json format), from which libvirt/openstack is able to parse and find out > > > > compatibility list of a source mdev/physical device? > > > > > > Based on the feedback we've received, the previously proposed interface > > > is not viable. I think there's agreement that the user needs to be > > > able to parse and interpret the version information. Using json seems > > > viable, but I don't know if it's the best option. Is there any > > > precedent of markup strings returned via sysfs we could follow? > > > > I found some examples of using formatted string under /sys, mostly under > > tracing. maybe we can do a similar implementation. > > > > #cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format > > > > name: kvm_mmio > > ID: 32 > > format: > > field:unsigned short common_type; offset:0; size:2; > > signed:0; > > field:unsigned char common_flags; offset:2; size:1; > > signed:0; > > field:unsigned char common_preempt_count; offset:3; > > size:1; signed:0; > > field:int common_pid; offset:4; size:4; signed:1; > > > > field:u32 type; offset:8; size:4; signed:0; > > field:u32 len; offset:12; size:4; signed:0; > > field:u64 gpa; offset:16; size:8; signed:0; > > field:u64 val; offset:24; size:8; signed:0; > > > > print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", > > __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1, "read" > > }, { 2, "write" }), REC->len, REC->gpa, REC->val > > > this is not json fromat and its not supper frendly to parse. yes, it's just an example. It's exported to be used by userspace perf & trace_cmd. > > > > #cat /sys/devices/pci:00/:00:02.0/uevent > > DRIVER=vfio-pci > > PCI_CLASS=3 > > PCI_ID=8086:591D > > PCI_SUBSYS_ID=8086:2212 > > PCI_SLOT_NAME=:00:02.0 > > MODALIAS=pci:v8086d591Dsv8086sd2212bc03sc00i00 > > > this is ini format or conf formant > this is pretty simple to parse whichi would be fine. > that said you could also have a version or capablitiy directory with a file > for each key and a singel value. > if this is easy for openstack, maybe we can organize the data like below way? |- [device] |- migration |-self |-compatible1 |-compatible2 e.g. #cat /sys/bus/pci/devices/:00:02.0/UUID1/migration/self filed1=xxx filed2=xxx filed3=xxx filed3=xxx #cat /sys/bus/pci/devices/:00:02.0/UUID1/migration/compatible filed1=xxx filed2=xxx filed3=xxx filed3=xxx or in a
Re: device compatibility interface for live migration with assigned devices
On Wed, 29 Jul 2020 12:28:46 +0100 Sean Mooney wrote: > On Wed, 2020-07-29 at 16:05 +0800, Yan Zhao wrote: > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote: > > > On Mon, 27 Jul 2020 15:24:40 +0800 > > > Yan Zhao wrote: > > > > > > > > > As you indicate, the vendor driver is responsible for checking > > > > > > version > > > > > > information embedded within the migration stream. Therefore a > > > > > > migration should fail early if the devices are incompatible. Is it > > > > > > > > > > > > > > > > but as I know, currently in VFIO migration protocol, we have no way to > > > > > get vendor specific compatibility checking string in migration setup > > > > > stage > > > > > (i.e. .save_setup stage) before the device is set to _SAVING state. > > > > > In this way, for devices who does not save device data in precopy > > > > > stage, > > > > > the migration compatibility checking is as late as in stop-and-copy > > > > > stage, which is too late. > > > > > do you think we need to add the getting/checking of vendor specific > > > > > compatibility string early in save_setup stage? > > > > > > > > > > > > > hi Alex, > > > > after an offline discussion with Kevin, I realized that it may not be a > > > > problem if migration compatibility check in vendor driver occurs late in > > > > stop-and-copy phase for some devices, because if we report device > > > > compatibility attributes clearly in an interface, the chances for > > > > libvirt/openstack to make a wrong decision is little. > > > > > > I think it would be wise for a vendor driver to implement a pre-copy > > > phase, even if only to send version information and verify it at the > > > target. Deciding you have no device state to send during pre-copy does > > > not mean your vendor driver needs to opt-out of the pre-copy phase > > > entirely. Please also note that pre-copy is at the user's discretion, > > > we've defined that we can enter stop-and-copy at any point, including > > > without a pre-copy phase, so I would recommend that vendor drivers > > > validate compatibility at the start of both the pre-copy and the > > > stop-and-copy phases. > > > > > > > ok. got it! > > > > > > so, do you think we are now arriving at an agreement that we'll give up > > > > the read-and-test scheme and start to defining one interface (perhaps in > > > > json format), from which libvirt/openstack is able to parse and find out > > > > compatibility list of a source mdev/physical device? > > > > > > Based on the feedback we've received, the previously proposed interface > > > is not viable. I think there's agreement that the user needs to be > > > able to parse and interpret the version information. Using json seems > > > viable, but I don't know if it's the best option. Is there any > > > precedent of markup strings returned via sysfs we could follow? > > > > I found some examples of using formatted string under /sys, mostly under > > tracing. maybe we can do a similar implementation. > > > > #cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format > > > > name: kvm_mmio > > ID: 32 > > format: > > field:unsigned short common_type; offset:0; size:2; > > signed:0; > > field:unsigned char common_flags; offset:2; size:1; > > signed:0; > > field:unsigned char common_preempt_count; offset:3; > > size:1; signed:0; > > field:int common_pid; offset:4; size:4; signed:1; > > > > field:u32 type; offset:8; size:4; signed:0; > > field:u32 len; offset:12; size:4; signed:0; > > field:u64 gpa; offset:16; size:8; signed:0; > > field:u64 val; offset:24; size:8; signed:0; > > > > print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", > > __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1, "read" > > }, { 2, "write" }), REC->len, REC->gpa, REC->val > > > this is not json fromat and its not supper frendly to parse. > > > > #cat /sys/devices/pci:00/:00:02.0/uevent > > DRIVER=vfio-pci > > PCI_CLASS=3 > > PCI_ID=8086:591D > > PCI_SUBSYS_ID=8086:2212 > > PCI_SLOT_NAME=:00:02.0 > > MODALIAS=pci:v8086d591Dsv8086sd2212bc03sc00i00 > > > this is ini format or conf formant > this is pretty simple to parse whichi would be fine. > that said you could also have a version or capablitiy directory with a file > for each key and a singel value. > > i would prefer to only have to do one read personally the list the files in > directory and then read tehm all ot build the datastucture myself but that is > doable though the simple ini format use d for uevent seams the best of 3 > options > provided above. > > > > > > Your idea of having both a "self" object and an array of "compatible" > > > objects is perhaps something we can build on, but we must not assume > > > PCI devices at the root level of the object. Providing both the > > > mdev-type and the driver is a bit
Re: device compatibility interface for live migration with assigned devices
* Alex Williamson (alex.william...@redhat.com) wrote: > On Mon, 27 Jul 2020 15:24:40 +0800 > Yan Zhao wrote: > > > > > As you indicate, the vendor driver is responsible for checking version > > > > information embedded within the migration stream. Therefore a > > > > migration should fail early if the devices are incompatible. Is it > > > but as I know, currently in VFIO migration protocol, we have no way to > > > get vendor specific compatibility checking string in migration setup stage > > > (i.e. .save_setup stage) before the device is set to _SAVING state. > > > In this way, for devices who does not save device data in precopy stage, > > > the migration compatibility checking is as late as in stop-and-copy > > > stage, which is too late. > > > do you think we need to add the getting/checking of vendor specific > > > compatibility string early in save_setup stage? > > > > > hi Alex, > > after an offline discussion with Kevin, I realized that it may not be a > > problem if migration compatibility check in vendor driver occurs late in > > stop-and-copy phase for some devices, because if we report device > > compatibility attributes clearly in an interface, the chances for > > libvirt/openstack to make a wrong decision is little. > > I think it would be wise for a vendor driver to implement a pre-copy > phase, even if only to send version information and verify it at the > target. Deciding you have no device state to send during pre-copy does > not mean your vendor driver needs to opt-out of the pre-copy phase > entirely. Please also note that pre-copy is at the user's discretion, > we've defined that we can enter stop-and-copy at any point, including > without a pre-copy phase, so I would recommend that vendor drivers > validate compatibility at the start of both the pre-copy and the > stop-and-copy phases. That's quite curious; from a migration point of view I'd expect if you did want to skip pre-copy, that you'd go through the motions of entering it and then not saving any data and then going to stop-and-copy, rather than having two flows. Note that failing at a late stage of stop-and-copy is a pain; if you've just spent an hour migrating your huge busy VM over, you're going to be pretty annoyed when it goes pop near the end. Dave > > so, do you think we are now arriving at an agreement that we'll give up > > the read-and-test scheme and start to defining one interface (perhaps in > > json format), from which libvirt/openstack is able to parse and find out > > compatibility list of a source mdev/physical device? > > Based on the feedback we've received, the previously proposed interface > is not viable. I think there's agreement that the user needs to be > able to parse and interpret the version information. Using json seems > viable, but I don't know if it's the best option. Is there any > precedent of markup strings returned via sysfs we could follow? > > Your idea of having both a "self" object and an array of "compatible" > objects is perhaps something we can build on, but we must not assume > PCI devices at the root level of the object. Providing both the > mdev-type and the driver is a bit redundant, since the former includes > the latter. We can't have vendor specific versioning schemes though, > ie. gvt-version. We need to agree on a common scheme and decide which > fields the version is relative to, ex. just the mdev type? > > I had also proposed fields that provide information to create a > compatible type, for example to create a type_x2 device from a type_x1 > mdev type, they need to know to apply an aggregation attribute. If we > need to explicitly list every aggregation value and the resulting type, > I think we run aground of what aggregation was trying to avoid anyway, > so we might need to pick a language that defines variable substitution > or some kind of tagging. For example if we could define ${aggr} as an > integer within a specified range, then we might be able to define a type > relative to that value (type_x${aggr}) which requires an aggregation > attribute using the same value. I dunno, just spit balling. Thanks, > > Alex -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: device compatibility interface for live migration with assigned devices
On Wed, 2020-07-29 at 16:05 +0800, Yan Zhao wrote: > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote: > > On Mon, 27 Jul 2020 15:24:40 +0800 > > Yan Zhao wrote: > > > > > > > As you indicate, the vendor driver is responsible for checking version > > > > > information embedded within the migration stream. Therefore a > > > > > migration should fail early if the devices are incompatible. Is it > > > > > > > > but as I know, currently in VFIO migration protocol, we have no way to > > > > get vendor specific compatibility checking string in migration setup > > > > stage > > > > (i.e. .save_setup stage) before the device is set to _SAVING state. > > > > In this way, for devices who does not save device data in precopy stage, > > > > the migration compatibility checking is as late as in stop-and-copy > > > > stage, which is too late. > > > > do you think we need to add the getting/checking of vendor specific > > > > compatibility string early in save_setup stage? > > > > > > > > > > hi Alex, > > > after an offline discussion with Kevin, I realized that it may not be a > > > problem if migration compatibility check in vendor driver occurs late in > > > stop-and-copy phase for some devices, because if we report device > > > compatibility attributes clearly in an interface, the chances for > > > libvirt/openstack to make a wrong decision is little. > > > > I think it would be wise for a vendor driver to implement a pre-copy > > phase, even if only to send version information and verify it at the > > target. Deciding you have no device state to send during pre-copy does > > not mean your vendor driver needs to opt-out of the pre-copy phase > > entirely. Please also note that pre-copy is at the user's discretion, > > we've defined that we can enter stop-and-copy at any point, including > > without a pre-copy phase, so I would recommend that vendor drivers > > validate compatibility at the start of both the pre-copy and the > > stop-and-copy phases. > > > > ok. got it! > > > > so, do you think we are now arriving at an agreement that we'll give up > > > the read-and-test scheme and start to defining one interface (perhaps in > > > json format), from which libvirt/openstack is able to parse and find out > > > compatibility list of a source mdev/physical device? > > > > Based on the feedback we've received, the previously proposed interface > > is not viable. I think there's agreement that the user needs to be > > able to parse and interpret the version information. Using json seems > > viable, but I don't know if it's the best option. Is there any > > precedent of markup strings returned via sysfs we could follow? > > I found some examples of using formatted string under /sys, mostly under > tracing. maybe we can do a similar implementation. > > #cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format > > name: kvm_mmio > ID: 32 > format: > field:unsigned short common_type; offset:0; size:2; > signed:0; > field:unsigned char common_flags; offset:2; size:1; > signed:0; > field:unsigned char common_preempt_count; offset:3; > size:1; signed:0; > field:int common_pid; offset:4; size:4; signed:1; > > field:u32 type; offset:8; size:4; signed:0; > field:u32 len; offset:12; size:4; signed:0; > field:u64 gpa; offset:16; size:8; signed:0; > field:u64 val; offset:24; size:8; signed:0; > > print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", > __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1, "read" > }, { 2, "write" }), REC->len, REC->gpa, REC->val > this is not json fromat and its not supper frendly to parse. > > #cat /sys/devices/pci:00/:00:02.0/uevent > DRIVER=vfio-pci > PCI_CLASS=3 > PCI_ID=8086:591D > PCI_SUBSYS_ID=8086:2212 > PCI_SLOT_NAME=:00:02.0 > MODALIAS=pci:v8086d591Dsv8086sd2212bc03sc00i00 > this is ini format or conf formant this is pretty simple to parse whichi would be fine. that said you could also have a version or capablitiy directory with a file for each key and a singel value. i would prefer to only have to do one read personally the list the files in directory and then read tehm all ot build the datastucture myself but that is doable though the simple ini format use d for uevent seams the best of 3 options provided above. > > > > Your idea of having both a "self" object and an array of "compatible" > > objects is perhaps something we can build on, but we must not assume > > PCI devices at the root level of the object. Providing both the > > mdev-type and the driver is a bit redundant, since the former includes > > the latter. We can't have vendor specific versioning schemes though, > > ie. gvt-version. We need to agree on a common scheme and decide which > > fields the version is relative to, ex. just the mdev type? > > what about making all comparing fields vendor specific? >
Re: device compatibility interface for live migration with assigned devices
On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote: > On Mon, 27 Jul 2020 15:24:40 +0800 > Yan Zhao wrote: > > > > > As you indicate, the vendor driver is responsible for checking version > > > > information embedded within the migration stream. Therefore a > > > > migration should fail early if the devices are incompatible. Is it > > > but as I know, currently in VFIO migration protocol, we have no way to > > > get vendor specific compatibility checking string in migration setup stage > > > (i.e. .save_setup stage) before the device is set to _SAVING state. > > > In this way, for devices who does not save device data in precopy stage, > > > the migration compatibility checking is as late as in stop-and-copy > > > stage, which is too late. > > > do you think we need to add the getting/checking of vendor specific > > > compatibility string early in save_setup stage? > > > > > hi Alex, > > after an offline discussion with Kevin, I realized that it may not be a > > problem if migration compatibility check in vendor driver occurs late in > > stop-and-copy phase for some devices, because if we report device > > compatibility attributes clearly in an interface, the chances for > > libvirt/openstack to make a wrong decision is little. > > I think it would be wise for a vendor driver to implement a pre-copy > phase, even if only to send version information and verify it at the > target. Deciding you have no device state to send during pre-copy does > not mean your vendor driver needs to opt-out of the pre-copy phase > entirely. Please also note that pre-copy is at the user's discretion, > we've defined that we can enter stop-and-copy at any point, including > without a pre-copy phase, so I would recommend that vendor drivers > validate compatibility at the start of both the pre-copy and the > stop-and-copy phases. > ok. got it! > > so, do you think we are now arriving at an agreement that we'll give up > > the read-and-test scheme and start to defining one interface (perhaps in > > json format), from which libvirt/openstack is able to parse and find out > > compatibility list of a source mdev/physical device? > > Based on the feedback we've received, the previously proposed interface > is not viable. I think there's agreement that the user needs to be > able to parse and interpret the version information. Using json seems > viable, but I don't know if it's the best option. Is there any > precedent of markup strings returned via sysfs we could follow? I found some examples of using formatted string under /sys, mostly under tracing. maybe we can do a similar implementation. #cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format name: kvm_mmio ID: 32 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1; field:u32 type; offset:8; size:4; signed:0; field:u32 len; offset:12; size:4; signed:0; field:u64 gpa; offset:16; size:8; signed:0; field:u64 val; offset:24; size:8; signed:0; print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1, "read" }, { 2, "write" }), REC->len, REC->gpa, REC->val #cat /sys/devices/pci:00/:00:02.0/uevent DRIVER=vfio-pci PCI_CLASS=3 PCI_ID=8086:591D PCI_SUBSYS_ID=8086:2212 PCI_SLOT_NAME=:00:02.0 MODALIAS=pci:v8086d591Dsv8086sd2212bc03sc00i00 > > Your idea of having both a "self" object and an array of "compatible" > objects is perhaps something we can build on, but we must not assume > PCI devices at the root level of the object. Providing both the > mdev-type and the driver is a bit redundant, since the former includes > the latter. We can't have vendor specific versioning schemes though, > ie. gvt-version. We need to agree on a common scheme and decide which > fields the version is relative to, ex. just the mdev type? what about making all comparing fields vendor specific? userspace like openstack only needs to parse and compare if target device is within source compatible list without understanding the meaning of each field. > I had also proposed fields that provide information to create a > compatible type, for example to create a type_x2 device from a type_x1 > mdev type, they need to know to apply an aggregation attribute. If we > need to explicitly list every aggregation value and the resulting type, > I think we run aground of what aggregation was trying to avoid anyway, > so we might need to pick a language that defines variable substitution > or some kind of tagging. For example if we could define ${aggr} as an > integer within a specified range, then we might be able to define a type > relative to that value (type_x${aggr})
Re: device compatibility interface for live migration with assigned devices
On Mon, 27 Jul 2020 15:24:40 +0800 Yan Zhao wrote: > > > As you indicate, the vendor driver is responsible for checking version > > > information embedded within the migration stream. Therefore a > > > migration should fail early if the devices are incompatible. Is it > > but as I know, currently in VFIO migration protocol, we have no way to > > get vendor specific compatibility checking string in migration setup stage > > (i.e. .save_setup stage) before the device is set to _SAVING state. > > In this way, for devices who does not save device data in precopy stage, > > the migration compatibility checking is as late as in stop-and-copy > > stage, which is too late. > > do you think we need to add the getting/checking of vendor specific > > compatibility string early in save_setup stage? > > > hi Alex, > after an offline discussion with Kevin, I realized that it may not be a > problem if migration compatibility check in vendor driver occurs late in > stop-and-copy phase for some devices, because if we report device > compatibility attributes clearly in an interface, the chances for > libvirt/openstack to make a wrong decision is little. I think it would be wise for a vendor driver to implement a pre-copy phase, even if only to send version information and verify it at the target. Deciding you have no device state to send during pre-copy does not mean your vendor driver needs to opt-out of the pre-copy phase entirely. Please also note that pre-copy is at the user's discretion, we've defined that we can enter stop-and-copy at any point, including without a pre-copy phase, so I would recommend that vendor drivers validate compatibility at the start of both the pre-copy and the stop-and-copy phases. > so, do you think we are now arriving at an agreement that we'll give up > the read-and-test scheme and start to defining one interface (perhaps in > json format), from which libvirt/openstack is able to parse and find out > compatibility list of a source mdev/physical device? Based on the feedback we've received, the previously proposed interface is not viable. I think there's agreement that the user needs to be able to parse and interpret the version information. Using json seems viable, but I don't know if it's the best option. Is there any precedent of markup strings returned via sysfs we could follow? Your idea of having both a "self" object and an array of "compatible" objects is perhaps something we can build on, but we must not assume PCI devices at the root level of the object. Providing both the mdev-type and the driver is a bit redundant, since the former includes the latter. We can't have vendor specific versioning schemes though, ie. gvt-version. We need to agree on a common scheme and decide which fields the version is relative to, ex. just the mdev type? I had also proposed fields that provide information to create a compatible type, for example to create a type_x2 device from a type_x1 mdev type, they need to know to apply an aggregation attribute. If we need to explicitly list every aggregation value and the resulting type, I think we run aground of what aggregation was trying to avoid anyway, so we might need to pick a language that defines variable substitution or some kind of tagging. For example if we could define ${aggr} as an integer within a specified range, then we might be able to define a type relative to that value (type_x${aggr}) which requires an aggregation attribute using the same value. I dunno, just spit balling. Thanks, Alex
Re: device compatibility interface for live migration with assigned devices
> > As you indicate, the vendor driver is responsible for checking version > > information embedded within the migration stream. Therefore a > > migration should fail early if the devices are incompatible. Is it > but as I know, currently in VFIO migration protocol, we have no way to > get vendor specific compatibility checking string in migration setup stage > (i.e. .save_setup stage) before the device is set to _SAVING state. > In this way, for devices who does not save device data in precopy stage, > the migration compatibility checking is as late as in stop-and-copy > stage, which is too late. > do you think we need to add the getting/checking of vendor specific > compatibility string early in save_setup stage? > hi Alex, after an offline discussion with Kevin, I realized that it may not be a problem if migration compatibility check in vendor driver occurs late in stop-and-copy phase for some devices, because if we report device compatibility attributes clearly in an interface, the chances for libvirt/openstack to make a wrong decision is little. so, do you think we are now arriving at an agreement that we'll give up the read-and-test scheme and start to defining one interface (perhaps in json format), from which libvirt/openstack is able to parse and find out compatibility list of a source mdev/physical device? Thanks Yan
Re: device compatibility interface for live migration with assigned devices
On 2020/7/20 下午6:39, Sean Mooney wrote: On Mon, 2020-07-20 at 11:41 +0800, Jason Wang wrote: On 2020/7/18 上午12:12, Alex Williamson wrote: On Thu, 16 Jul 2020 16:32:30 +0800 Yan Zhao wrote: On Thu, Jul 16, 2020 at 12:16:26PM +0800, Jason Wang wrote: On 2020/7/14 上午7:29, Yan Zhao wrote: hi folks, we are defining a device migration compatibility interface that helps upper layer stack like openstack/ovirt/libvirt to check if two devices are live migration compatible. The "devices" here could be MDEVs, physical devices, or hybrid of the two. e.g. we could use it to check whether - a src MDEV can migrate to a target MDEV, - a src VF in SRIOV can migrate to a target VF in SRIOV, - a src MDEV can migration to a target VF in SRIOV. (e.g. SIOV/SRIOV backward compatibility case) The upper layer stack could use this interface as the last step to check if one device is able to migrate to another device before triggering a real live migration procedure. we are not sure if this interface is of value or help to you. please don't hesitate to drop your valuable comments. (1) interface definition The interface is defined in below way: __userspace /\ \ / \write / read \ /__ ___\|/_ | migration_version | | migration_version |-->check migration - - compatibility device Adevice B a device attribute named migration_version is defined under each device's sysfs node. e.g. (/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version). Are you aware of the devlink based device management interface that is proposed upstream? I think it has many advantages over sysfs, do you consider to switch to that? Advantages, such as? My understanding for devlink(netlink) over sysfs (some are mentioned at the time of vDPA sysfs mgmt API discussion) are: i tought netlink was used more a as a configuration protocoal to qurry and confire nic and i guess other devices in its devlink form requireint a tool to be witten that can speak the protocal to interact with. the primary advantate of sysfs is that everything is just a file. there are no addtional depleenceis needed Well, if you try to build logic like introspection on top for a sophisticated hardware, you probably need to have library on top. And it's attribute per file is pretty inefficient. and unlike netlink there are not interoperatblity issues in a coanitnerised env. if you are using diffrenet version of libc and gcc in the contaienr vs the host my understanding is tools like ethtool from ubuntu deployed in a container on a centos host can have issue communicating with the host kernel. Kernel provides stable ABI for userspace, so it's not something that we can't fix. if its jsut a file unless the format the data is returnin in chagnes or the layout of sysfs changes its compatiable regardless of what you use to read it. I believe you can't change sysfs layout which is part of uABI. But as I mentioned below, sysfs has several drawbacks. It's not harm to compare between different approach when you start a new device management API. Thanks - existing users (NIC, crypto, SCSI, ib), mature and stable - much better error reporting (ext_ack other than string or errno) - namespace aware - do not couple with kobject Thanks
Re: device compatibility interface for live migration with assigned devices
On Fri, Jul 17, 2020 at 10:12:58AM -0600, Alex Williamson wrote: <...> > > yes, in another reply, Alex proposed to use an interface in json format. > > I guess we can define something like > > > > { "self" : > > [ > > { "pciid" : "8086591d", > > "driver" : "i915", > > "gvt-version" : "v1", > > "mdev_type" : "i915-GVTg_V5_2", > > "aggregator" : "1", > > "pv-mode" : "none", > > } > > ], > > "compatible" : > > [ > > { "pciid" : "8086591d", > > "driver" : "i915", > > "gvt-version" : "v1", > > "mdev_type" : "i915-GVTg_V5_2", > > "aggregator" : "1" > > "pv-mode" : "none", > > }, > > { "pciid" : "8086591d", > > "driver" : "i915", > > "gvt-version" : "v1", > > "mdev_type" : "i915-GVTg_V5_4", > > "aggregator" : "2" > > "pv-mode" : "none", > > }, > > { "pciid" : "8086591d", > > "driver" : "i915", > > "gvt-version" : "v2", > > "mdev_type" : "i915-GVTg_V5_4", > > "aggregator" : "2" > > "pv-mode" : "none, ppgtt, context", > > } > > ... > > ] > > } > > > > But as those fields are mostly vendor specific, the userspace can > > only do simple string comparing, I guess the list would be very long as > > it needs to enumerate all possible targets. > > > This ignores so much of what I tried to achieve in my example :( > sorry, I just was eager to show and confirm the way to list all compatible combination of mdev_type and mdev attributes. > > > also, in some fileds like "gvt-version", is there a simple way to express > > things like v2+? > > > That's not a reasonable thing to express anyway, how can you be certain > that v3 won't break compatibility with v2? Sean proposed a versioning > scheme that accounts for this, using an x.y.z version expressing the > major, minor, and bugfix versions, where there is no compatibility > across major versions, minor versions have forward compatibility (ex. 1 > -> 2 is ok, 2 -> 1 is not) and bugfix version number indicates some > degree of internal improvement that is not visible to the user in terms > of features or compatibility, but provides a basis for preferring > equally compatible candidates. > right. if self version is v1, it can't know its compatible version is v2. it can only be done in reverse. i.e. when self version is v2, it can list its compatible version is v1 and v2. and maybe later when self version is v3, there's no v1 in its compatible list. In this way, do you think we still need the complex x.y.z versioning scheme? > > > If the userspace can read this interface both in src and target and > > check whether both src and target are in corresponding compatible list, I > > think it will work for us. > > > > But still, kernel should not rely on userspace's choice, the opaque > > compatibility string is still required in kernel. No matter whether > > it would be exposed to userspace as an compatibility checking interface, > > vendor driver would keep this part of code and embed the string into the > > migration stream. so exposing it as an interface to be used by libvirt to > > do a safety check before a real live migration is only about enabling > > the kernel part of check to happen ahead. > > As you indicate, the vendor driver is responsible for checking version > information embedded within the migration stream. Therefore a > migration should fail early if the devices are incompatible. Is it but as I know, currently in VFIO migration protocol, we have no way to get vendor specific compatibility checking string in migration setup stage (i.e. .save_setup stage) before the device is set to _SAVING state. In this way, for devices who does not save device data in precopy stage, the migration compatibility checking is as late as in stop-and-copy stage, which is too late. do you think we need to add the getting/checking of vendor specific compatibility string early in save_setup stage? > really libvirt's place to second guess what it has been directed to do? if libvirt uses the scheme of reading compatibility string at source and writing for checking at the target, it can not be called "a second guess". It's not a guess, but a confirmation. > Why would we even proceed to design a user parse-able version interface > if we still have a dependency on an opaque interface? Thanks, one reason is that libvirt can't trust the parsing result from openstack. Another reason is that libvirt can use this opaque interface easier than another parsing by itself, in the fact that it would not introduce more burden to kernel who would write this part of code anyway, no matter libvirt uses it or not. Thanks Yan
Re: device compatibility interface for live migration with assigned devices
On Mon, 2020-07-20 at 11:41 +0800, Jason Wang wrote: > On 2020/7/18 上午12:12, Alex Williamson wrote: > > On Thu, 16 Jul 2020 16:32:30 +0800 > > Yan Zhao wrote: > > > > > On Thu, Jul 16, 2020 at 12:16:26PM +0800, Jason Wang wrote: > > > > On 2020/7/14 上午7:29, Yan Zhao wrote: > > > > > hi folks, > > > > > we are defining a device migration compatibility interface that helps > > > > > upper > > > > > layer stack like openstack/ovirt/libvirt to check if two devices are > > > > > live migration compatible. > > > > > The "devices" here could be MDEVs, physical devices, or hybrid of the > > > > > two. > > > > > e.g. we could use it to check whether > > > > > - a src MDEV can migrate to a target MDEV, > > > > > - a src VF in SRIOV can migrate to a target VF in SRIOV, > > > > > - a src MDEV can migration to a target VF in SRIOV. > > > > > (e.g. SIOV/SRIOV backward compatibility case) > > > > > > > > > > The upper layer stack could use this interface as the last step to > > > > > check > > > > > if one device is able to migrate to another device before triggering > > > > > a real > > > > > live migration procedure. > > > > > we are not sure if this interface is of value or help to you. please > > > > > don't > > > > > hesitate to drop your valuable comments. > > > > > > > > > > > > > > > (1) interface definition > > > > > The interface is defined in below way: > > > > > > > > > >__userspace > > > > > /\ \ > > > > >/ \write > > > > > / read \ > > > > > /__ ___\|/_ > > > > > | migration_version | | migration_version |-->check migration > > > > > - - compatibility > > > > >device Adevice B > > > > > > > > > > > > > > > a device attribute named migration_version is defined under each > > > > > device's > > > > > sysfs node. e.g. > > > > > (/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version). > > > > > > > > Are you aware of the devlink based device management interface that is > > > > proposed upstream? I think it has many advantages over sysfs, do you > > > > consider to switch to that? > > > > Advantages, such as? > > > My understanding for devlink(netlink) over sysfs (some are mentioned at > the time of vDPA sysfs mgmt API discussion) are: i tought netlink was used more a as a configuration protocoal to qurry and confire nic and i guess other devices in its devlink form requireint a tool to be witten that can speak the protocal to interact with. the primary advantate of sysfs is that everything is just a file. there are no addtional depleenceis needed and unlike netlink there are not interoperatblity issues in a coanitnerised env. if you are using diffrenet version of libc and gcc in the contaienr vs the host my understanding is tools like ethtool from ubuntu deployed in a container on a centos host can have issue communicating with the host kernel. if its jsut a file unless the format the data is returnin in chagnes or the layout of sysfs changes its compatiable regardless of what you use to read it. > > - existing users (NIC, crypto, SCSI, ib), mature and stable > - much better error reporting (ext_ack other than string or errno) > - namespace aware > - do not couple with kobject > > Thanks >
Re: device compatibility interface for live migration with assigned devices
On 2020/7/18 上午12:12, Alex Williamson wrote: On Thu, 16 Jul 2020 16:32:30 +0800 Yan Zhao wrote: On Thu, Jul 16, 2020 at 12:16:26PM +0800, Jason Wang wrote: On 2020/7/14 上午7:29, Yan Zhao wrote: hi folks, we are defining a device migration compatibility interface that helps upper layer stack like openstack/ovirt/libvirt to check if two devices are live migration compatible. The "devices" here could be MDEVs, physical devices, or hybrid of the two. e.g. we could use it to check whether - a src MDEV can migrate to a target MDEV, - a src VF in SRIOV can migrate to a target VF in SRIOV, - a src MDEV can migration to a target VF in SRIOV. (e.g. SIOV/SRIOV backward compatibility case) The upper layer stack could use this interface as the last step to check if one device is able to migrate to another device before triggering a real live migration procedure. we are not sure if this interface is of value or help to you. please don't hesitate to drop your valuable comments. (1) interface definition The interface is defined in below way: __userspace /\ \ / \write / read \ /__ ___\|/_ | migration_version | | migration_version |-->check migration - - compatibility device Adevice B a device attribute named migration_version is defined under each device's sysfs node. e.g. (/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version). Are you aware of the devlink based device management interface that is proposed upstream? I think it has many advantages over sysfs, do you consider to switch to that? Advantages, such as? My understanding for devlink(netlink) over sysfs (some are mentioned at the time of vDPA sysfs mgmt API discussion) are: - existing users (NIC, crypto, SCSI, ib), mature and stable - much better error reporting (ext_ack other than string or errno) - namespace aware - do not couple with kobject Thanks
Re: device compatibility interface for live migration with assigned devices
On Fri, 17 Jul 2020 19:03:44 +0100 "Dr. David Alan Gilbert" wrote: > * Alex Williamson (alex.william...@redhat.com) wrote: > > On Wed, 15 Jul 2020 16:20:41 +0800 > > Yan Zhao wrote: > > > > > On Tue, Jul 14, 2020 at 02:59:48PM -0600, Alex Williamson wrote: > > > > On Tue, 14 Jul 2020 18:19:46 +0100 > > > > "Dr. David Alan Gilbert" wrote: > > > > > > > > > * Alex Williamson (alex.william...@redhat.com) wrote: > > > > > > On Tue, 14 Jul 2020 11:21:29 +0100 > > > > > > Daniel P. Berrangé wrote: > > > > > > > > > > > > > On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote: > > > > > > > > hi folks, > > > > > > > > we are defining a device migration compatibility interface that > > > > > > > > helps upper > > > > > > > > layer stack like openstack/ovirt/libvirt to check if two > > > > > > > > devices are > > > > > > > > live migration compatible. > > > > > > > > The "devices" here could be MDEVs, physical devices, or hybrid > > > > > > > > of the two. > > > > > > > > e.g. we could use it to check whether > > > > > > > > - a src MDEV can migrate to a target MDEV, > > > > > > > > - a src VF in SRIOV can migrate to a target VF in SRIOV, > > > > > > > > - a src MDEV can migration to a target VF in SRIOV. > > > > > > > > (e.g. SIOV/SRIOV backward compatibility case) > > > > > > > > > > > > > > > > The upper layer stack could use this interface as the last step > > > > > > > > to check > > > > > > > > if one device is able to migrate to another device before > > > > > > > > triggering a real > > > > > > > > live migration procedure. > > > > > > > > we are not sure if this interface is of value or help to you. > > > > > > > > please don't > > > > > > > > hesitate to drop your valuable comments. > > > > > > > > > > > > > > > > > > > > > > > > (1) interface definition > > > > > > > > The interface is defined in below way: > > > > > > > > > > > > > > > > __userspace > > > > > > > > /\ \ > > > > > > > > / \write > > > > > > > > / read \ > > > > > > > >/__ ___\|/_ > > > > > > > > | migration_version | | migration_version |-->check > > > > > > > > migration > > > > > > > > - - > > > > > > > > compatibility > > > > > > > > device Adevice B > > > > > > > > > > > > > > > > > > > > > > > > a device attribute named migration_version is defined under > > > > > > > > each device's > > > > > > > > sysfs node. e.g. > > > > > > > > (/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version). > > > > > > > > userspace tools read the migration_version as a string from the > > > > > > > > source device, > > > > > > > > and write it to the migration_version sysfs attribute in the > > > > > > > > target device. > > > > > > > > > > > > > > > > The userspace should treat ANY of below conditions as two > > > > > > > > devices not compatible: > > > > > > > > - any one of the two devices does not have a migration_version > > > > > > > > attribute > > > > > > > > - error when reading from migration_version attribute of one > > > > > > > > device > > > > > > > > - error when writing migration_version string of one device to > > > > > > > > migration_version attribute of the other device > > > > > > > > > > > > > > > > The string read from migration_version attribute is defined by > > > > > > > > device vendor > > > > > > > > driver and is completely opaque to the userspace. > > > > > > > > for a Intel vGPU, string format can be defined like > > > > > > > > "parent device PCI ID" + "version of gvt driver" + "mdev type" > > > > > > > > + "aggregator count". > > > > > > > > > > > > > > > > for an NVMe VF connecting to a remote storage. it could be > > > > > > > > "PCI ID" + "driver version" + "configured remote storage URL" > > > > > > > > > > > > > > > > for a QAT VF, it may be > > > > > > > > "PCI ID" + "driver version" + "supported encryption set". > > > > > > > > > > > > > > > > (to avoid namespace confliction from each vendor, we may prefix > > > > > > > > a driver name to > > > > > > > > each migration_version string. e.g. > > > > > > > > i915-v1-8086-591d-i915-GVTg_V5_8-1) > > > > > > > > > > > > It's very strange to define it as opaque and then proceed to > > > > > > describe > > > > > > the contents of that opaque string. The point is that its contents > > > > > > are defined by the vendor driver to describe the device, driver > > > > > > version, > > > > > > and possibly metadata about the configuration of the device. One > > > > > > instance of a device might generate a different string from another. > > > > > > The string that a device produces is not necessarily the only string > > > > > > the vendor driver will accept, for example the driver might support > > > > > > backwards compatible migrations. > > > > > > > > > > (As I've
Re: device compatibility interface for live migration with assigned devices
* Alex Williamson (alex.william...@redhat.com) wrote: > On Wed, 15 Jul 2020 16:20:41 +0800 > Yan Zhao wrote: > > > On Tue, Jul 14, 2020 at 02:59:48PM -0600, Alex Williamson wrote: > > > On Tue, 14 Jul 2020 18:19:46 +0100 > > > "Dr. David Alan Gilbert" wrote: > > > > > > > * Alex Williamson (alex.william...@redhat.com) wrote: > > > > > On Tue, 14 Jul 2020 11:21:29 +0100 > > > > > Daniel P. Berrangé wrote: > > > > > > > > > > > On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote: > > > > > > > hi folks, > > > > > > > we are defining a device migration compatibility interface that > > > > > > > helps upper > > > > > > > layer stack like openstack/ovirt/libvirt to check if two devices > > > > > > > are > > > > > > > live migration compatible. > > > > > > > The "devices" here could be MDEVs, physical devices, or hybrid of > > > > > > > the two. > > > > > > > e.g. we could use it to check whether > > > > > > > - a src MDEV can migrate to a target MDEV, > > > > > > > - a src VF in SRIOV can migrate to a target VF in SRIOV, > > > > > > > - a src MDEV can migration to a target VF in SRIOV. > > > > > > > (e.g. SIOV/SRIOV backward compatibility case) > > > > > > > > > > > > > > The upper layer stack could use this interface as the last step > > > > > > > to check > > > > > > > if one device is able to migrate to another device before > > > > > > > triggering a real > > > > > > > live migration procedure. > > > > > > > we are not sure if this interface is of value or help to you. > > > > > > > please don't > > > > > > > hesitate to drop your valuable comments. > > > > > > > > > > > > > > > > > > > > > (1) interface definition > > > > > > > The interface is defined in below way: > > > > > > > > > > > > > > __userspace > > > > > > > /\ \ > > > > > > > / \write > > > > > > > / read \ > > > > > > >/__ ___\|/_ > > > > > > > | migration_version | | migration_version |-->check > > > > > > > migration > > > > > > > - - compatibility > > > > > > > device Adevice B > > > > > > > > > > > > > > > > > > > > > a device attribute named migration_version is defined under each > > > > > > > device's > > > > > > > sysfs node. e.g. > > > > > > > (/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version). > > > > > > > userspace tools read the migration_version as a string from the > > > > > > > source device, > > > > > > > and write it to the migration_version sysfs attribute in the > > > > > > > target device. > > > > > > > > > > > > > > The userspace should treat ANY of below conditions as two devices > > > > > > > not compatible: > > > > > > > - any one of the two devices does not have a migration_version > > > > > > > attribute > > > > > > > - error when reading from migration_version attribute of one > > > > > > > device > > > > > > > - error when writing migration_version string of one device to > > > > > > > migration_version attribute of the other device > > > > > > > > > > > > > > The string read from migration_version attribute is defined by > > > > > > > device vendor > > > > > > > driver and is completely opaque to the userspace. > > > > > > > for a Intel vGPU, string format can be defined like > > > > > > > "parent device PCI ID" + "version of gvt driver" + "mdev type" + > > > > > > > "aggregator count". > > > > > > > > > > > > > > for an NVMe VF connecting to a remote storage. it could be > > > > > > > "PCI ID" + "driver version" + "configured remote storage URL" > > > > > > > > > > > > > > for a QAT VF, it may be > > > > > > > "PCI ID" + "driver version" + "supported encryption set". > > > > > > > > > > > > > > (to avoid namespace confliction from each vendor, we may prefix a > > > > > > > driver name to > > > > > > > each migration_version string. e.g. > > > > > > > i915-v1-8086-591d-i915-GVTg_V5_8-1) > > > > > > > > > > It's very strange to define it as opaque and then proceed to describe > > > > > the contents of that opaque string. The point is that its contents > > > > > are defined by the vendor driver to describe the device, driver > > > > > version, > > > > > and possibly metadata about the configuration of the device. One > > > > > instance of a device might generate a different string from another. > > > > > The string that a device produces is not necessarily the only string > > > > > the vendor driver will accept, for example the driver might support > > > > > backwards compatible migrations. > > > > > > > > (As I've said in the previous discussion, off one of the patch series) > > > > > > > > My view is it makes sense to have a half-way house on the opaqueness of > > > > this string; I'd expect to have an ID and version that are human > > > > readable, maybe a device ID/name that's human interpretable and then a > > > >
Re: device compatibility interface for live migration with assigned devices
On Thu, 16 Jul 2020 16:32:30 +0800 Yan Zhao wrote: > On Thu, Jul 16, 2020 at 12:16:26PM +0800, Jason Wang wrote: > > > > On 2020/7/14 上午7:29, Yan Zhao wrote: > > > hi folks, > > > we are defining a device migration compatibility interface that helps > > > upper > > > layer stack like openstack/ovirt/libvirt to check if two devices are > > > live migration compatible. > > > The "devices" here could be MDEVs, physical devices, or hybrid of the two. > > > e.g. we could use it to check whether > > > - a src MDEV can migrate to a target MDEV, > > > - a src VF in SRIOV can migrate to a target VF in SRIOV, > > > - a src MDEV can migration to a target VF in SRIOV. > > >(e.g. SIOV/SRIOV backward compatibility case) > > > > > > The upper layer stack could use this interface as the last step to check > > > if one device is able to migrate to another device before triggering a > > > real > > > live migration procedure. > > > we are not sure if this interface is of value or help to you. please don't > > > hesitate to drop your valuable comments. > > > > > > > > > (1) interface definition > > > The interface is defined in below way: > > > > > > __userspace > > >/\ \ > > > / \write > > > / read \ > > > /__ ___\|/_ > > >| migration_version | | migration_version |-->check migration > > >- - compatibility > > > device Adevice B > > > > > > > > > a device attribute named migration_version is defined under each device's > > > sysfs node. e.g. > > > (/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version). > > > > > > Are you aware of the devlink based device management interface that is > > proposed upstream? I think it has many advantages over sysfs, do you > > consider to switch to that? Advantages, such as? > not familiar with the devlink. will do some research of it. > > > > > > > userspace tools read the migration_version as a string from the source > > > device, > > > and write it to the migration_version sysfs attribute in the target > > > device. > > > > > > The userspace should treat ANY of below conditions as two devices not > > > compatible: > > > - any one of the two devices does not have a migration_version attribute > > > - error when reading from migration_version attribute of one device > > > - error when writing migration_version string of one device to > > >migration_version attribute of the other device > > > > > > The string read from migration_version attribute is defined by device > > > vendor > > > driver and is completely opaque to the userspace. > > > > > > My understanding is that something opaque to userspace is not the > > philosophy > > but the VFIO live migration in itself is essentially a big opaque stream to > userspace. > > > of Linux. Instead of having a generic API but opaque value, why not do in a > > vendor specific way like: > > > > 1) exposing the device capability in a vendor specific way via sysfs/devlink > > or other API > > 2) management read capability in both src and dst and determine whether we > > can do the migration > > > > This is the way we plan to do with vDPA. > > > yes, in another reply, Alex proposed to use an interface in json format. > I guess we can define something like > > { "self" : > [ > { "pciid" : "8086591d", > "driver" : "i915", > "gvt-version" : "v1", > "mdev_type" : "i915-GVTg_V5_2", > "aggregator" : "1", > "pv-mode" : "none", > } > ], > "compatible" : > [ > { "pciid" : "8086591d", > "driver" : "i915", > "gvt-version" : "v1", > "mdev_type" : "i915-GVTg_V5_2", > "aggregator" : "1" > "pv-mode" : "none", > }, > { "pciid" : "8086591d", > "driver" : "i915", > "gvt-version" : "v1", > "mdev_type" : "i915-GVTg_V5_4", > "aggregator" : "2" > "pv-mode" : "none", > }, > { "pciid" : "8086591d", > "driver" : "i915", > "gvt-version" : "v2", > "mdev_type" : "i915-GVTg_V5_4", > "aggregator" : "2" > "pv-mode" : "none, ppgtt, context", > } > ... > ] > } > > But as those fields are mostly vendor specific, the userspace can > only do simple string comparing, I guess the list would be very long as > it needs to enumerate all possible targets. This ignores so much of what I tried to achieve in my example :( > also, in some fileds like "gvt-version", is there a simple way to express > things like v2+? That's not a reasonable thing to express anyway, how can you be certain that v3 won't break compatibility with v2? Sean proposed a versioning scheme that accounts for this, using an x.y.z version expressing the major, minor, and bugfix versions, where there is no compatibility across major versions,
Re: device compatibility interface for live migration with assigned devices
On Wed, 15 Jul 2020 15:37:19 +0800 Alex Xu wrote: > Alex Williamson 于2020年7月15日周三 上午5:00写道: > > > On Tue, 14 Jul 2020 18:19:46 +0100 > > "Dr. David Alan Gilbert" wrote: > > > > > * Alex Williamson (alex.william...@redhat.com) wrote: > > > > On Tue, 14 Jul 2020 11:21:29 +0100 > > > > Daniel P. Berrangé wrote: > > > > > > > > > On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote: > > > > > > hi folks, > > > > > > we are defining a device migration compatibility interface that > > helps upper > > > > > > layer stack like openstack/ovirt/libvirt to check if two devices > > are > > > > > > live migration compatible. > > > > > > The "devices" here could be MDEVs, physical devices, or hybrid of > > the two. > > > > > > e.g. we could use it to check whether > > > > > > - a src MDEV can migrate to a target MDEV, > > > > > > - a src VF in SRIOV can migrate to a target VF in SRIOV, > > > > > > - a src MDEV can migration to a target VF in SRIOV. > > > > > > (e.g. SIOV/SRIOV backward compatibility case) > > > > > > > > > > > > The upper layer stack could use this interface as the last step to > > check > > > > > > if one device is able to migrate to another device before > > triggering a real > > > > > > live migration procedure. > > > > > > we are not sure if this interface is of value or help to you. > > please don't > > > > > > hesitate to drop your valuable comments. > > > > > > > > > > > > > > > > > > (1) interface definition > > > > > > The interface is defined in below way: > > > > > > > > > > > > __userspace > > > > > > /\ \ > > > > > > / \write > > > > > > / read \ > > > > > >/__ ___\|/_ > > > > > > | migration_version | | migration_version |-->check migration > > > > > > - - compatibility > > > > > > device Adevice B > > > > > > > > > > > > > > > > > > a device attribute named migration_version is defined under each > > device's > > > > > > sysfs node. e.g. > > (/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version). > > > > > > userspace tools read the migration_version as a string from the > > source device, > > > > > > and write it to the migration_version sysfs attribute in the > > target device. > > > > > > > > > > > > The userspace should treat ANY of below conditions as two devices > > not compatible: > > > > > > - any one of the two devices does not have a migration_version > > attribute > > > > > > - error when reading from migration_version attribute of one device > > > > > > - error when writing migration_version string of one device to > > > > > > migration_version attribute of the other device > > > > > > > > > > > > The string read from migration_version attribute is defined by > > device vendor > > > > > > driver and is completely opaque to the userspace. > > > > > > for a Intel vGPU, string format can be defined like > > > > > > "parent device PCI ID" + "version of gvt driver" + "mdev type" + > > "aggregator count". > > > > > > > > > > > > for an NVMe VF connecting to a remote storage. it could be > > > > > > "PCI ID" + "driver version" + "configured remote storage URL" > > > > > > > > > > > > for a QAT VF, it may be > > > > > > "PCI ID" + "driver version" + "supported encryption set". > > > > > > > > > > > > (to avoid namespace confliction from each vendor, we may prefix a > > driver name to > > > > > > each migration_version string. e.g. > > i915-v1-8086-591d-i915-GVTg_V5_8-1) > > > > > > > > It's very strange to define it as opaque and then proceed to describe > > > > the contents of that opaque string. The point is that its contents > > > > are defined by the vendor driver to describe the device, driver > > version, > > > > and possibly metadata about the configuration of the device. One > > > > instance of a device might generate a different string from another. > > > > The string that a device produces is not necessarily the only string > > > > the vendor driver will accept, for example the driver might support > > > > backwards compatible migrations. > > > > > > (As I've said in the previous discussion, off one of the patch series) > > > > > > My view is it makes sense to have a half-way house on the opaqueness of > > > this string; I'd expect to have an ID and version that are human > > > readable, maybe a device ID/name that's human interpretable and then a > > > bunch of other cruft that maybe device/vendor/version specific. > > > > > > I'm thinking that we want to be able to report problems and include the > > > string and the user to be able to easily identify the device that was > > > complaining and notice a difference in versions, and perhaps also use > > > it in compatibility patterns to find compatible hosts; but that does > > > get tricky when it's a 'ask the
Re: device compatibility interface for live migration with assigned devices
On Wed, 15 Jul 2020 16:20:41 +0800 Yan Zhao wrote: > On Tue, Jul 14, 2020 at 02:59:48PM -0600, Alex Williamson wrote: > > On Tue, 14 Jul 2020 18:19:46 +0100 > > "Dr. David Alan Gilbert" wrote: > > > > > * Alex Williamson (alex.william...@redhat.com) wrote: > > > > On Tue, 14 Jul 2020 11:21:29 +0100 > > > > Daniel P. Berrangé wrote: > > > > > > > > > On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote: > > > > > > hi folks, > > > > > > we are defining a device migration compatibility interface that > > > > > > helps upper > > > > > > layer stack like openstack/ovirt/libvirt to check if two devices are > > > > > > live migration compatible. > > > > > > The "devices" here could be MDEVs, physical devices, or hybrid of > > > > > > the two. > > > > > > e.g. we could use it to check whether > > > > > > - a src MDEV can migrate to a target MDEV, > > > > > > - a src VF in SRIOV can migrate to a target VF in SRIOV, > > > > > > - a src MDEV can migration to a target VF in SRIOV. > > > > > > (e.g. SIOV/SRIOV backward compatibility case) > > > > > > > > > > > > The upper layer stack could use this interface as the last step to > > > > > > check > > > > > > if one device is able to migrate to another device before > > > > > > triggering a real > > > > > > live migration procedure. > > > > > > we are not sure if this interface is of value or help to you. > > > > > > please don't > > > > > > hesitate to drop your valuable comments. > > > > > > > > > > > > > > > > > > (1) interface definition > > > > > > The interface is defined in below way: > > > > > > > > > > > > __userspace > > > > > > /\ \ > > > > > > / \write > > > > > > / read \ > > > > > >/__ ___\|/_ > > > > > > | migration_version | | migration_version |-->check migration > > > > > > - - compatibility > > > > > > device Adevice B > > > > > > > > > > > > > > > > > > a device attribute named migration_version is defined under each > > > > > > device's > > > > > > sysfs node. e.g. > > > > > > (/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version). > > > > > > userspace tools read the migration_version as a string from the > > > > > > source device, > > > > > > and write it to the migration_version sysfs attribute in the target > > > > > > device. > > > > > > > > > > > > The userspace should treat ANY of below conditions as two devices > > > > > > not compatible: > > > > > > - any one of the two devices does not have a migration_version > > > > > > attribute > > > > > > - error when reading from migration_version attribute of one device > > > > > > - error when writing migration_version string of one device to > > > > > > migration_version attribute of the other device > > > > > > > > > > > > The string read from migration_version attribute is defined by > > > > > > device vendor > > > > > > driver and is completely opaque to the userspace. > > > > > > for a Intel vGPU, string format can be defined like > > > > > > "parent device PCI ID" + "version of gvt driver" + "mdev type" + > > > > > > "aggregator count". > > > > > > > > > > > > for an NVMe VF connecting to a remote storage. it could be > > > > > > "PCI ID" + "driver version" + "configured remote storage URL" > > > > > > > > > > > > for a QAT VF, it may be > > > > > > "PCI ID" + "driver version" + "supported encryption set". > > > > > > > > > > > > (to avoid namespace confliction from each vendor, we may prefix a > > > > > > driver name to > > > > > > each migration_version string. e.g. > > > > > > i915-v1-8086-591d-i915-GVTg_V5_8-1) > > > > > > > > It's very strange to define it as opaque and then proceed to describe > > > > the contents of that opaque string. The point is that its contents > > > > are defined by the vendor driver to describe the device, driver version, > > > > and possibly metadata about the configuration of the device. One > > > > instance of a device might generate a different string from another. > > > > The string that a device produces is not necessarily the only string > > > > the vendor driver will accept, for example the driver might support > > > > backwards compatible migrations. > > > > > > (As I've said in the previous discussion, off one of the patch series) > > > > > > My view is it makes sense to have a half-way house on the opaqueness of > > > this string; I'd expect to have an ID and version that are human > > > readable, maybe a device ID/name that's human interpretable and then a > > > bunch of other cruft that maybe device/vendor/version specific. > > > > > > I'm thinking that we want to be able to report problems and include the > > > string and the user to be able to easily identify the device that was > > > complaining and notice a difference in versions, and perhaps also use > >