Re: [PATCH RFC 1/1] qemu: capabilities: disable csske for host cpu
On Fri, Mar 11, 2022 at 04:24:22PM +0100, Christian Borntraeger wrote: > > > Am 11.03.22 um 15:56 schrieb Daniel P. Berrangé: > > On Fri, Mar 11, 2022 at 03:52:57PM +0100, Christian Borntraeger wrote: > > > > > > > > > Am 11.03.22 um 14:08 schrieb Daniel P. Berrangé: > > > > On Fri, Mar 11, 2022 at 12:37:46PM +, Daniel P. Berrangé wrote: > > > > > On Fri, Mar 11, 2022 at 01:12:35PM +0100, Christian Borntraeger wrote: > > > > > > > > > > > > > > > > > > Am 11.03.22 um 10:23 schrieb David Hildenbrand: > > > > > > > On 11.03.22 10:17, Daniel P. Berrangé wrote: > > > > > > > > On Thu, Mar 10, 2022 at 11:17:38PM -0500, Collin Walling wrote: > > > > > > > > > CPU models past gen16a will no longer support the csske > > > > > > > > > feature. In > > > > > > > > > order to secure migration of guests running on machines that > > > > > > > > > still > > > > > > > > > support this feature to machines that do not, let's disable > > > > > > > > > csske > > > > > > > > > in the host-model. > > > > > > > > > > > > > > Sorry to say, removing CPU features is a no-go when wanting to > > > > > > > guarantee > > > > > > > forward migration without taking care about CPU model details > > > > > > > manually > > > > > > > and simply using the host model. Self-made HW vendor problem. > > > > > > > > > > > > And this simply does not reflect reality. Intel and Power have > > > > > > removed TX > > > > > > for example. We can now sit back and please ourselves how we live > > > > > > in our > > > > > > world of dreams. Or we can try to define an interface that deals > > > > > > with > > > > > > reality and actually solves problems. > > > > > > > > > > This proposal wouldn't have helped in the case of Intel removing > > > > > TSX, because it was removed without prior warning in the middle > > > > > of the product lifecycle. At that time there were already millions > > > > > of VMs in existance using the removed feature. > > > > > > > > > > > > > The problem scenario you describe is the intended semantics of > > > > > > > > host-model though. It enables all features available in the host > > > > > > > > that you launched on. It lets you live migrate to a target host > > > > > > > > with the same, or a greater number of features. If the target > > > > > > > > has > > > > > > > > a greater number of features, it should restrict the VM to the > > > > > > > > subset of features that were present on the original source CPU. > > > > > > > > If the target has fewer features, then you simply can't live > > > > > > > > migrate a VM using host-model. > > > > > > > > > > > > > > > > To get live migration in both directions across CPUs with > > > > > > > > differing > > > > > > > > featuresets, then the VM needs to be configured with a named CPU > > > > > > > > model that is a subset of both, rather than host-model. > > > > > > > > > > > > > > Right, and cpu-model-baseline does that job for you if you're > > > > > > > lazy to > > > > > > > lookup the proper model. > > > > > > > > > > > > Yes baseline will work, but this requires tooling like openstack. > > > > > > The normal > > > > > > user will just use the default and this is host-model. > > > > > > > > > > > > Let me explain the usecase for this feature. Migration between > > > > > > different versins > > > > > > baseline: always works > > > > > > host-passthrough: you get what you deserve > > > > > > default model: works > > > > > > We have disabled CSSKE from our default models (-cpu gen15a will > > > > > > not present csske). > > > > > > So that works as well. > > > > > > host-model: Also works for all machines that have csske. > > > > > > Now: Lets say gen17 will no longer support this. That means that we > > > > > > can not migrate > > > > > > host-model from gen16 or gen15 because those will have csske. > > > > > > What options do we have? If we disable csske in the host > > > > > > capabilities that would mean > > > > > > that a host compare against an xml from an older QEMU would fail > > > > > > (even if you move > > > > > > from gen14 to gen14). So this is not a good option. > > > > > > > > > > > > By disabling deprecated features ONLY for the _initial_ expansion > > > > > > of model-model, but > > > > > > keeping it in the host capabilities you can migrate existing guests > > > > > > (with the > > > > > > feature) as we only disable in the expansion, but manually asking > > > > > > for it still works. > > > > > > AND it will allow to move this instantiation of the guest to future > > > > > > machines without > > > > > > the feature. Basically everything works. > > > > > > > > > > The change you proposal works functionally, but none the less it is > > > > > changing the semantics of host-model. It is defined to expose all the > > > > > features in the host, and the proposal changes yet. If an app actually > > > > > /wants/ to use the deprecated feature and it exists in the host, then > > > > > host-model should be allowing that as it does today
Re: [PATCH RFC 1/1] qemu: capabilities: disable csske for host cpu
On Fri, Mar 11, 2022 at 04:31:49PM +0100, Christian Borntraeger wrote: > > > Am 11.03.22 um 16:24 schrieb Christian Borntraeger: > > > > > > Am 11.03.22 um 15:56 schrieb Daniel P. Berrangé: > > > On Fri, Mar 11, 2022 at 03:52:57PM +0100, Christian Borntraeger wrote: > > > > > > > > > > > > Am 11.03.22 um 14:08 schrieb Daniel P. Berrangé: > > > > > On Fri, Mar 11, 2022 at 12:37:46PM +, Daniel P. Berrangé wrote: > > > > > > On Fri, Mar 11, 2022 at 01:12:35PM +0100, Christian Borntraeger > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Am 11.03.22 um 10:23 schrieb David Hildenbrand: > > > > > > > > On 11.03.22 10:17, Daniel P. Berrangé wrote: > > > > > > > > > On Thu, Mar 10, 2022 at 11:17:38PM -0500, Collin Walling > > > > > > > > > wrote: > > > > > > > > > > CPU models past gen16a will no longer support the csske > > > > > > > > > > feature. In > > > > > > > > > > order to secure migration of guests running on machines > > > > > > > > > > that still > > > > > > > > > > support this feature to machines that do not, let's disable > > > > > > > > > > csske > > > > > > > > > > in the host-model. > > > > > > > > > > > > > > > > Sorry to say, removing CPU features is a no-go when wanting to > > > > > > > > guarantee > > > > > > > > forward migration without taking care about CPU model details > > > > > > > > manually > > > > > > > > and simply using the host model. Self-made HW vendor problem. > > > > > > > > > > > > > > And this simply does not reflect reality. Intel and Power have > > > > > > > removed TX > > > > > > > for example. We can now sit back and please ourselves how we live > > > > > > > in our > > > > > > > world of dreams. Or we can try to define an interface that deals > > > > > > > with > > > > > > > reality and actually solves problems. > > > > > > > > > > > > This proposal wouldn't have helped in the case of Intel removing > > > > > > TSX, because it was removed without prior warning in the middle > > > > > > of the product lifecycle. At that time there were already millions > > > > > > of VMs in existance using the removed feature. > > > > > > > > > > > > > > > The problem scenario you describe is the intended semantics of > > > > > > > > > host-model though. It enables all features available in the > > > > > > > > > host > > > > > > > > > that you launched on. It lets you live migrate to a target > > > > > > > > > host > > > > > > > > > with the same, or a greater number of features. If the target > > > > > > > > > has > > > > > > > > > a greater number of features, it should restrict the VM to the > > > > > > > > > subset of features that were present on the original source > > > > > > > > > CPU. > > > > > > > > > If the target has fewer features, then you simply can't live > > > > > > > > > migrate a VM using host-model. > > > > > > > > > > > > > > > > > > To get live migration in both directions across CPUs with > > > > > > > > > differing > > > > > > > > > featuresets, then the VM needs to be configured with a named > > > > > > > > > CPU > > > > > > > > > model that is a subset of both, rather than host-model. > > > > > > > > > > > > > > > > Right, and cpu-model-baseline does that job for you if you're > > > > > > > > lazy to > > > > > > > > lookup the proper model. > > > > > > > > > > > > > > Yes baseline will work, but this requires tooling like openstack. > > > > > > > The normal > > > > > > > user will just use the default and this is host-model. > > > > > > > > > > > > > > Let me explain the usecase for this feature. Migration between > > > > > > > different versins > > > > > > > baseline: always works > > > > > > > host-passthrough: you get what you deserve > > > > > > > default model: works > > > > > > > We have disabled CSSKE from our default models (-cpu gen15a will > > > > > > > not present csske). > > > > > > > So that works as well. > > > > > > > host-model: Also works for all machines that have csske. > > > > > > > Now: Lets say gen17 will no longer support this. That means that > > > > > > > we can not migrate > > > > > > > host-model from gen16 or gen15 because those will have csske. > > > > > > > What options do we have? If we disable csske in the host > > > > > > > capabilities that would mean > > > > > > > that a host compare against an xml from an older QEMU would fail > > > > > > > (even if you move > > > > > > > from gen14 to gen14). So this is not a good option. > > > > > > > > > > > > > > By disabling deprecated features ONLY for the _initial_ expansion > > > > > > > of model-model, but > > > > > > > keeping it in the host capabilities you can migrate existing > > > > > > > guests (with the > > > > > > > feature) as we only disable in the expansion, but manually asking > > > > > > > for it still works. > > > > > > > AND it will allow to move this instantiation of the guest to > > > > > > > future machines without > > > > > > > the feature. Basically everything works. > > > > > > > > > > > > The change you p
Re: [PATCH RFC 1/1] qemu: capabilities: disable csske for host cpu
Am 11.03.22 um 16:24 schrieb Christian Borntraeger: Am 11.03.22 um 15:56 schrieb Daniel P. Berrangé: On Fri, Mar 11, 2022 at 03:52:57PM +0100, Christian Borntraeger wrote: Am 11.03.22 um 14:08 schrieb Daniel P. Berrangé: On Fri, Mar 11, 2022 at 12:37:46PM +, Daniel P. Berrangé wrote: On Fri, Mar 11, 2022 at 01:12:35PM +0100, Christian Borntraeger wrote: Am 11.03.22 um 10:23 schrieb David Hildenbrand: On 11.03.22 10:17, Daniel P. Berrangé wrote: On Thu, Mar 10, 2022 at 11:17:38PM -0500, Collin Walling wrote: CPU models past gen16a will no longer support the csske feature. In order to secure migration of guests running on machines that still support this feature to machines that do not, let's disable csske in the host-model. Sorry to say, removing CPU features is a no-go when wanting to guarantee forward migration without taking care about CPU model details manually and simply using the host model. Self-made HW vendor problem. And this simply does not reflect reality. Intel and Power have removed TX for example. We can now sit back and please ourselves how we live in our world of dreams. Or we can try to define an interface that deals with reality and actually solves problems. This proposal wouldn't have helped in the case of Intel removing TSX, because it was removed without prior warning in the middle of the product lifecycle. At that time there were already millions of VMs in existance using the removed feature. The problem scenario you describe is the intended semantics of host-model though. It enables all features available in the host that you launched on. It lets you live migrate to a target host with the same, or a greater number of features. If the target has a greater number of features, it should restrict the VM to the subset of features that were present on the original source CPU. If the target has fewer features, then you simply can't live migrate a VM using host-model. To get live migration in both directions across CPUs with differing featuresets, then the VM needs to be configured with a named CPU model that is a subset of both, rather than host-model. Right, and cpu-model-baseline does that job for you if you're lazy to lookup the proper model. Yes baseline will work, but this requires tooling like openstack. The normal user will just use the default and this is host-model. Let me explain the usecase for this feature. Migration between different versins baseline: always works host-passthrough: you get what you deserve default model: works We have disabled CSSKE from our default models (-cpu gen15a will not present csske). So that works as well. host-model: Also works for all machines that have csske. Now: Lets say gen17 will no longer support this. That means that we can not migrate host-model from gen16 or gen15 because those will have csske. What options do we have? If we disable csske in the host capabilities that would mean that a host compare against an xml from an older QEMU would fail (even if you move from gen14 to gen14). So this is not a good option. By disabling deprecated features ONLY for the _initial_ expansion of model-model, but keeping it in the host capabilities you can migrate existing guests (with the feature) as we only disable in the expansion, but manually asking for it still works. AND it will allow to move this instantiation of the guest to future machines without the feature. Basically everything works. The change you proposal works functionally, but none the less it is changing the semantics of host-model. It is defined to expose all the features in the host, and the proposal changes yet. If an app actually /wants/ to use the deprecated feature and it exists in the host, then host-model should be allowing that as it does today. The problem scenario you describe is ultimately that OpenStack does not have a future proof default CPU choice. Libvirt and QEMU provide a mechanism for them to pick other CPU models that would address the problem, but they're not using that. The challenge is that OpenStack defaults currently are a zero-interaction thing. They could retain their zero-interaction defaults, if at install time they queried the libvirt capabilities to learn which named CPU models are available, whereupon they could decide to use gen15a. The main challenge here is that the list of named CPU models is an unordered set, so it is hard to programatically figure out which of the available named CPU models is the newest/best/recommended. IOW, what's missing is a way for apps to easily identify that 'gen15a' is the best CPU to use on the host, without needing human interaction. I think this could be solved with a change to query-cpu-definitions in QEMU, to add an extra 'recommended: bool' attribute to the CpuDefinitionInfo struct. This would be defined to be only set for 1 CPU model in the list, and would reflect the recommended CPU model given the current version of QEMU, kernel and hardware. Or we could allow 'rec
Re: [PATCH RFC 1/1] qemu: capabilities: disable csske for host cpu
Am 11.03.22 um 15:56 schrieb Daniel P. Berrangé: On Fri, Mar 11, 2022 at 03:52:57PM +0100, Christian Borntraeger wrote: Am 11.03.22 um 14:08 schrieb Daniel P. Berrangé: On Fri, Mar 11, 2022 at 12:37:46PM +, Daniel P. Berrangé wrote: On Fri, Mar 11, 2022 at 01:12:35PM +0100, Christian Borntraeger wrote: Am 11.03.22 um 10:23 schrieb David Hildenbrand: On 11.03.22 10:17, Daniel P. Berrangé wrote: On Thu, Mar 10, 2022 at 11:17:38PM -0500, Collin Walling wrote: CPU models past gen16a will no longer support the csske feature. In order to secure migration of guests running on machines that still support this feature to machines that do not, let's disable csske in the host-model. Sorry to say, removing CPU features is a no-go when wanting to guarantee forward migration without taking care about CPU model details manually and simply using the host model. Self-made HW vendor problem. And this simply does not reflect reality. Intel and Power have removed TX for example. We can now sit back and please ourselves how we live in our world of dreams. Or we can try to define an interface that deals with reality and actually solves problems. This proposal wouldn't have helped in the case of Intel removing TSX, because it was removed without prior warning in the middle of the product lifecycle. At that time there were already millions of VMs in existance using the removed feature. The problem scenario you describe is the intended semantics of host-model though. It enables all features available in the host that you launched on. It lets you live migrate to a target host with the same, or a greater number of features. If the target has a greater number of features, it should restrict the VM to the subset of features that were present on the original source CPU. If the target has fewer features, then you simply can't live migrate a VM using host-model. To get live migration in both directions across CPUs with differing featuresets, then the VM needs to be configured with a named CPU model that is a subset of both, rather than host-model. Right, and cpu-model-baseline does that job for you if you're lazy to lookup the proper model. Yes baseline will work, but this requires tooling like openstack. The normal user will just use the default and this is host-model. Let me explain the usecase for this feature. Migration between different versins baseline: always works host-passthrough: you get what you deserve default model: works We have disabled CSSKE from our default models (-cpu gen15a will not present csske). So that works as well. host-model: Also works for all machines that have csske. Now: Lets say gen17 will no longer support this. That means that we can not migrate host-model from gen16 or gen15 because those will have csske. What options do we have? If we disable csske in the host capabilities that would mean that a host compare against an xml from an older QEMU would fail (even if you move from gen14 to gen14). So this is not a good option. By disabling deprecated features ONLY for the _initial_ expansion of model-model, but keeping it in the host capabilities you can migrate existing guests (with the feature) as we only disable in the expansion, but manually asking for it still works. AND it will allow to move this instantiation of the guest to future machines without the feature. Basically everything works. The change you proposal works functionally, but none the less it is changing the semantics of host-model. It is defined to expose all the features in the host, and the proposal changes yet. If an app actually /wants/ to use the deprecated feature and it exists in the host, then host-model should be allowing that as it does today. The problem scenario you describe is ultimately that OpenStack does not have a future proof default CPU choice. Libvirt and QEMU provide a mechanism for them to pick other CPU models that would address the problem, but they're not using that. The challenge is that OpenStack defaults currently are a zero-interaction thing. They could retain their zero-interaction defaults, if at install time they queried the libvirt capabilities to learn which named CPU models are available, whereupon they could decide to use gen15a. The main challenge here is that the list of named CPU models is an unordered set, so it is hard to programatically figure out which of the available named CPU models is the newest/best/recommended. IOW, what's missing is a way for apps to easily identify that 'gen15a' is the best CPU to use on the host, without needing human interaction. I think this could be solved with a change to query-cpu-definitions in QEMU, to add an extra 'recommended: bool' attribute to the CpuDefinitionInfo struct. This would be defined to be only set for 1 CPU model in the list, and would reflect the recommended CPU model given the current version of QEMU, kernel and hardware. Or we could allow 'recommended' to be set for more than 1 CPU, provided we de
Re: [PATCH RFC 1/1] qemu: capabilities: disable csske for host cpu
On Fri, Mar 11, 2022 at 03:52:57PM +0100, Christian Borntraeger wrote: > > > Am 11.03.22 um 14:08 schrieb Daniel P. Berrangé: > > On Fri, Mar 11, 2022 at 12:37:46PM +, Daniel P. Berrangé wrote: > > > On Fri, Mar 11, 2022 at 01:12:35PM +0100, Christian Borntraeger wrote: > > > > > > > > > > > > Am 11.03.22 um 10:23 schrieb David Hildenbrand: > > > > > On 11.03.22 10:17, Daniel P. Berrangé wrote: > > > > > > On Thu, Mar 10, 2022 at 11:17:38PM -0500, Collin Walling wrote: > > > > > > > CPU models past gen16a will no longer support the csske feature. > > > > > > > In > > > > > > > order to secure migration of guests running on machines that still > > > > > > > support this feature to machines that do not, let's disable csske > > > > > > > in the host-model. > > > > > > > > > > Sorry to say, removing CPU features is a no-go when wanting to > > > > > guarantee > > > > > forward migration without taking care about CPU model details manually > > > > > and simply using the host model. Self-made HW vendor problem. > > > > > > > > And this simply does not reflect reality. Intel and Power have removed > > > > TX > > > > for example. We can now sit back and please ourselves how we live in our > > > > world of dreams. Or we can try to define an interface that deals with > > > > reality and actually solves problems. > > > > > > This proposal wouldn't have helped in the case of Intel removing > > > TSX, because it was removed without prior warning in the middle > > > of the product lifecycle. At that time there were already millions > > > of VMs in existance using the removed feature. > > > > > > > > > The problem scenario you describe is the intended semantics of > > > > > > host-model though. It enables all features available in the host > > > > > > that you launched on. It lets you live migrate to a target host > > > > > > with the same, or a greater number of features. If the target has > > > > > > a greater number of features, it should restrict the VM to the > > > > > > subset of features that were present on the original source CPU. > > > > > > If the target has fewer features, then you simply can't live > > > > > > migrate a VM using host-model. > > > > > > > > > > > > To get live migration in both directions across CPUs with differing > > > > > > featuresets, then the VM needs to be configured with a named CPU > > > > > > model that is a subset of both, rather than host-model. > > > > > > > > > > Right, and cpu-model-baseline does that job for you if you're lazy to > > > > > lookup the proper model. > > > > > > > > Yes baseline will work, but this requires tooling like openstack. The > > > > normal > > > > user will just use the default and this is host-model. > > > > > > > > Let me explain the usecase for this feature. Migration between > > > > different versins > > > > baseline: always works > > > > host-passthrough: you get what you deserve > > > > default model: works > > > > We have disabled CSSKE from our default models (-cpu gen15a will not > > > > present csske). > > > > So that works as well. > > > > host-model: Also works for all machines that have csske. > > > > Now: Lets say gen17 will no longer support this. That means that we can > > > > not migrate > > > > host-model from gen16 or gen15 because those will have csske. > > > > What options do we have? If we disable csske in the host capabilities > > > > that would mean > > > > that a host compare against an xml from an older QEMU would fail (even > > > > if you move > > > > from gen14 to gen14). So this is not a good option. > > > > > > > > By disabling deprecated features ONLY for the _initial_ expansion of > > > > model-model, but > > > > keeping it in the host capabilities you can migrate existing guests > > > > (with the > > > > feature) as we only disable in the expansion, but manually asking for > > > > it still works. > > > > AND it will allow to move this instantiation of the guest to future > > > > machines without > > > > the feature. Basically everything works. > > > > > > The change you proposal works functionally, but none the less it is > > > changing the semantics of host-model. It is defined to expose all the > > > features in the host, and the proposal changes yet. If an app actually > > > /wants/ to use the deprecated feature and it exists in the host, then > > > host-model should be allowing that as it does today. > > > > > > The problem scenario you describe is ultimately that OpenStack does > > > not have a future proof default CPU choice. Libvirt and QEMU provide > > > a mechanism for them to pick other CPU models that would address the > > > problem, but they're not using that. The challenge is that OpenStack > > > defaults currently are a zero-interaction thing. > > > > > > They could retain their zero-interaction defaults, if at install time > > > they queried the libvirt capabilities to learn which named CPU models > > > are available, whereupon they could decide to use gen15a. The main > >
Re: [PATCH RFC 1/1] qemu: capabilities: disable csske for host cpu
Am 11.03.22 um 14:08 schrieb Daniel P. Berrangé: On Fri, Mar 11, 2022 at 12:37:46PM +, Daniel P. Berrangé wrote: On Fri, Mar 11, 2022 at 01:12:35PM +0100, Christian Borntraeger wrote: Am 11.03.22 um 10:23 schrieb David Hildenbrand: On 11.03.22 10:17, Daniel P. Berrangé wrote: On Thu, Mar 10, 2022 at 11:17:38PM -0500, Collin Walling wrote: CPU models past gen16a will no longer support the csske feature. In order to secure migration of guests running on machines that still support this feature to machines that do not, let's disable csske in the host-model. Sorry to say, removing CPU features is a no-go when wanting to guarantee forward migration without taking care about CPU model details manually and simply using the host model. Self-made HW vendor problem. And this simply does not reflect reality. Intel and Power have removed TX for example. We can now sit back and please ourselves how we live in our world of dreams. Or we can try to define an interface that deals with reality and actually solves problems. This proposal wouldn't have helped in the case of Intel removing TSX, because it was removed without prior warning in the middle of the product lifecycle. At that time there were already millions of VMs in existance using the removed feature. The problem scenario you describe is the intended semantics of host-model though. It enables all features available in the host that you launched on. It lets you live migrate to a target host with the same, or a greater number of features. If the target has a greater number of features, it should restrict the VM to the subset of features that were present on the original source CPU. If the target has fewer features, then you simply can't live migrate a VM using host-model. To get live migration in both directions across CPUs with differing featuresets, then the VM needs to be configured with a named CPU model that is a subset of both, rather than host-model. Right, and cpu-model-baseline does that job for you if you're lazy to lookup the proper model. Yes baseline will work, but this requires tooling like openstack. The normal user will just use the default and this is host-model. Let me explain the usecase for this feature. Migration between different versins baseline: always works host-passthrough: you get what you deserve default model: works We have disabled CSSKE from our default models (-cpu gen15a will not present csske). So that works as well. host-model: Also works for all machines that have csske. Now: Lets say gen17 will no longer support this. That means that we can not migrate host-model from gen16 or gen15 because those will have csske. What options do we have? If we disable csske in the host capabilities that would mean that a host compare against an xml from an older QEMU would fail (even if you move from gen14 to gen14). So this is not a good option. By disabling deprecated features ONLY for the _initial_ expansion of model-model, but keeping it in the host capabilities you can migrate existing guests (with the feature) as we only disable in the expansion, but manually asking for it still works. AND it will allow to move this instantiation of the guest to future machines without the feature. Basically everything works. The change you proposal works functionally, but none the less it is changing the semantics of host-model. It is defined to expose all the features in the host, and the proposal changes yet. If an app actually /wants/ to use the deprecated feature and it exists in the host, then host-model should be allowing that as it does today. The problem scenario you describe is ultimately that OpenStack does not have a future proof default CPU choice. Libvirt and QEMU provide a mechanism for them to pick other CPU models that would address the problem, but they're not using that. The challenge is that OpenStack defaults currently are a zero-interaction thing. They could retain their zero-interaction defaults, if at install time they queried the libvirt capabilities to learn which named CPU models are available, whereupon they could decide to use gen15a. The main challenge here is that the list of named CPU models is an unordered set, so it is hard to programatically figure out which of the available named CPU models is the newest/best/recommended. IOW, what's missing is a way for apps to easily identify that 'gen15a' is the best CPU to use on the host, without needing human interaction. I think this could be solved with a change to query-cpu-definitions in QEMU, to add an extra 'recommended: bool' attribute to the CpuDefinitionInfo struct. This would be defined to be only set for 1 CPU model in the list, and would reflect the recommended CPU model given the current version of QEMU, kernel and hardware. Or we could allow 'recommended' to be set for more than 1 CPU, provided we define an explicit ordering of returned CPU models. I like the recommended: bool attribute. It should provide what we need.
Re: [PATCH RFC 1/1] qemu: capabilities: disable csske for host cpu
On Fri, Mar 11, 2022 at 12:37:46PM +, Daniel P. Berrangé wrote: > On Fri, Mar 11, 2022 at 01:12:35PM +0100, Christian Borntraeger wrote: > > > > > > Am 11.03.22 um 10:23 schrieb David Hildenbrand: > > > On 11.03.22 10:17, Daniel P. Berrangé wrote: > > > > On Thu, Mar 10, 2022 at 11:17:38PM -0500, Collin Walling wrote: > > > > > CPU models past gen16a will no longer support the csske feature. In > > > > > order to secure migration of guests running on machines that still > > > > > support this feature to machines that do not, let's disable csske > > > > > in the host-model. > > > > > > Sorry to say, removing CPU features is a no-go when wanting to guarantee > > > forward migration without taking care about CPU model details manually > > > and simply using the host model. Self-made HW vendor problem. > > > > And this simply does not reflect reality. Intel and Power have removed TX > > for example. We can now sit back and please ourselves how we live in our > > world of dreams. Or we can try to define an interface that deals with > > reality and actually solves problems. > > This proposal wouldn't have helped in the case of Intel removing > TSX, because it was removed without prior warning in the middle > of the product lifecycle. At that time there were already millions > of VMs in existance using the removed feature. > > > > > The problem scenario you describe is the intended semantics of > > > > host-model though. It enables all features available in the host > > > > that you launched on. It lets you live migrate to a target host > > > > with the same, or a greater number of features. If the target has > > > > a greater number of features, it should restrict the VM to the > > > > subset of features that were present on the original source CPU. > > > > If the target has fewer features, then you simply can't live > > > > migrate a VM using host-model. > > > > > > > > To get live migration in both directions across CPUs with differing > > > > featuresets, then the VM needs to be configured with a named CPU > > > > model that is a subset of both, rather than host-model. > > > > > > Right, and cpu-model-baseline does that job for you if you're lazy to > > > lookup the proper model. > > > > Yes baseline will work, but this requires tooling like openstack. The normal > > user will just use the default and this is host-model. > > > > Let me explain the usecase for this feature. Migration between different > > versins > > baseline: always works > > host-passthrough: you get what you deserve > > default model: works > > We have disabled CSSKE from our default models (-cpu gen15a will not > > present csske). > > So that works as well. > > host-model: Also works for all machines that have csske. > > Now: Lets say gen17 will no longer support this. That means that we can not > > migrate > > host-model from gen16 or gen15 because those will have csske. > > What options do we have? If we disable csske in the host capabilities that > > would mean > > that a host compare against an xml from an older QEMU would fail (even if > > you move > > from gen14 to gen14). So this is not a good option. > > > > By disabling deprecated features ONLY for the _initial_ expansion of > > model-model, but > > keeping it in the host capabilities you can migrate existing guests (with > > the > > feature) as we only disable in the expansion, but manually asking for it > > still works. > > AND it will allow to move this instantiation of the guest to future > > machines without > > the feature. Basically everything works. > > The change you proposal works functionally, but none the less it is > changing the semantics of host-model. It is defined to expose all the > features in the host, and the proposal changes yet. If an app actually > /wants/ to use the deprecated feature and it exists in the host, then > host-model should be allowing that as it does today. > > The problem scenario you describe is ultimately that OpenStack does > not have a future proof default CPU choice. Libvirt and QEMU provide > a mechanism for them to pick other CPU models that would address the > problem, but they're not using that. The challenge is that OpenStack > defaults currently are a zero-interaction thing. > > They could retain their zero-interaction defaults, if at install time > they queried the libvirt capabilities to learn which named CPU models > are available, whereupon they could decide to use gen15a. The main > challenge here is that the list of named CPU models is an unordered > set, so it is hard to programatically figure out which of the available > named CPU models is the newest/best/recommended. > > IOW, what's missing is a way for apps to easily identify that 'gen15a' > is the best CPU to use on the host, without needing human interaction. I think this could be solved with a change to query-cpu-definitions in QEMU, to add an extra 'recommended: bool' attribute to the CpuDefinitionInfo struct. This would be defined to b
Re: [PATCH RFC 1/1] qemu: capabilities: disable csske for host cpu
On 11.03.22 13:54, Christian Borntraeger wrote: > Am 11.03.22 um 13:27 schrieb David Hildenbrand: >> On 11.03.22 13:12, Christian Borntraeger wrote: >>> >>> >>> Am 11.03.22 um 10:23 schrieb David Hildenbrand: On 11.03.22 10:17, Daniel P. Berrangé wrote: > On Thu, Mar 10, 2022 at 11:17:38PM -0500, Collin Walling wrote: >> CPU models past gen16a will no longer support the csske feature. In >> order to secure migration of guests running on machines that still >> support this feature to machines that do not, let's disable csske >> in the host-model. Sorry to say, removing CPU features is a no-go when wanting to guarantee forward migration without taking care about CPU model details manually and simply using the host model. Self-made HW vendor problem. >>> >>> And this simply does not reflect reality. Intel and Power have removed TX >>> for example. We can now sit back and please ourselves how we live in our >>> world of dreams. Or we can try to define an interface that deals with >>> reality and actually solves problems. >>> >> >> Ehm, so, I spell out the obvious and get such a reaction? Okay, thank you. > > Sorry, reading my writing again shows that I clearly miscommunicated in a > very bad style. My point was rather trying to solve a problem instead > I wrote something up in a hurry which resulted in something offensive. > > Please accept my apologies. > No hard feelings, I understand that this is an important thing to sort out for IBM. -- Thanks, David / dhildenb
Re: [PATCH RFC 1/1] qemu: capabilities: disable csske for host cpu
Am 11.03.22 um 13:27 schrieb David Hildenbrand: On 11.03.22 13:12, Christian Borntraeger wrote: Am 11.03.22 um 10:23 schrieb David Hildenbrand: On 11.03.22 10:17, Daniel P. Berrangé wrote: On Thu, Mar 10, 2022 at 11:17:38PM -0500, Collin Walling wrote: CPU models past gen16a will no longer support the csske feature. In order to secure migration of guests running on machines that still support this feature to machines that do not, let's disable csske in the host-model. Sorry to say, removing CPU features is a no-go when wanting to guarantee forward migration without taking care about CPU model details manually and simply using the host model. Self-made HW vendor problem. And this simply does not reflect reality. Intel and Power have removed TX for example. We can now sit back and please ourselves how we live in our world of dreams. Or we can try to define an interface that deals with reality and actually solves problems. Ehm, so, I spell out the obvious and get such a reaction? Okay, thank you. Sorry, reading my writing again shows that I clearly miscommunicated in a very bad style. My point was rather trying to solve a problem instead I wrote something up in a hurry which resulted in something offensive. Please accept my apologies.
Re: [PATCH RFC 1/1] qemu: capabilities: disable csske for host cpu
On Fri, Mar 11, 2022 at 01:12:35PM +0100, Christian Borntraeger wrote: > > > Am 11.03.22 um 10:23 schrieb David Hildenbrand: > > On 11.03.22 10:17, Daniel P. Berrangé wrote: > > > On Thu, Mar 10, 2022 at 11:17:38PM -0500, Collin Walling wrote: > > > > CPU models past gen16a will no longer support the csske feature. In > > > > order to secure migration of guests running on machines that still > > > > support this feature to machines that do not, let's disable csske > > > > in the host-model. > > > > Sorry to say, removing CPU features is a no-go when wanting to guarantee > > forward migration without taking care about CPU model details manually > > and simply using the host model. Self-made HW vendor problem. > > And this simply does not reflect reality. Intel and Power have removed TX > for example. We can now sit back and please ourselves how we live in our > world of dreams. Or we can try to define an interface that deals with > reality and actually solves problems. This proposal wouldn't have helped in the case of Intel removing TSX, because it was removed without prior warning in the middle of the product lifecycle. At that time there were already millions of VMs in existance using the removed feature. > > > The problem scenario you describe is the intended semantics of > > > host-model though. It enables all features available in the host > > > that you launched on. It lets you live migrate to a target host > > > with the same, or a greater number of features. If the target has > > > a greater number of features, it should restrict the VM to the > > > subset of features that were present on the original source CPU. > > > If the target has fewer features, then you simply can't live > > > migrate a VM using host-model. > > > > > > To get live migration in both directions across CPUs with differing > > > featuresets, then the VM needs to be configured with a named CPU > > > model that is a subset of both, rather than host-model. > > > > Right, and cpu-model-baseline does that job for you if you're lazy to > > lookup the proper model. > > Yes baseline will work, but this requires tooling like openstack. The normal > user will just use the default and this is host-model. > > Let me explain the usecase for this feature. Migration between different > versins > baseline: always works > host-passthrough: you get what you deserve > default model: works > We have disabled CSSKE from our default models (-cpu gen15a will not present > csske). > So that works as well. > host-model: Also works for all machines that have csske. > Now: Lets say gen17 will no longer support this. That means that we can not > migrate > host-model from gen16 or gen15 because those will have csske. > What options do we have? If we disable csske in the host capabilities that > would mean > that a host compare against an xml from an older QEMU would fail (even if you > move > from gen14 to gen14). So this is not a good option. > > By disabling deprecated features ONLY for the _initial_ expansion of > model-model, but > keeping it in the host capabilities you can migrate existing guests (with the > feature) as we only disable in the expansion, but manually asking for it > still works. > AND it will allow to move this instantiation of the guest to future machines > without > the feature. Basically everything works. The change you proposal works functionally, but none the less it is changing the semantics of host-model. It is defined to expose all the features in the host, and the proposal changes yet. If an app actually /wants/ to use the deprecated feature and it exists in the host, then host-model should be allowing that as it does today. The problem scenario you describe is ultimately that OpenStack does not have a future proof default CPU choice. Libvirt and QEMU provide a mechanism for them to pick other CPU models that would address the problem, but they're not using that. The challenge is that OpenStack defaults currently are a zero-interaction thing. They could retain their zero-interaction defaults, if at install time they queried the libvirt capabilities to learn which named CPU models are available, whereupon they could decide to use gen15a. The main challenge here is that the list of named CPU models is an unordered set, so it is hard to programatically figure out which of the available named CPU models is the newest/best/recommended. IOW, what's missing is a way for apps to easily identify that 'gen15a' is the best CPU to use on the host, without needing human interaction. Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [PATCH RFC 1/1] qemu: capabilities: disable csske for host cpu
On 11.03.22 13:12, Christian Borntraeger wrote: > > > Am 11.03.22 um 10:23 schrieb David Hildenbrand: >> On 11.03.22 10:17, Daniel P. Berrangé wrote: >>> On Thu, Mar 10, 2022 at 11:17:38PM -0500, Collin Walling wrote: CPU models past gen16a will no longer support the csske feature. In order to secure migration of guests running on machines that still support this feature to machines that do not, let's disable csske in the host-model. >> >> Sorry to say, removing CPU features is a no-go when wanting to guarantee >> forward migration without taking care about CPU model details manually >> and simply using the host model. Self-made HW vendor problem. > > And this simply does not reflect reality. Intel and Power have removed TX > for example. We can now sit back and please ourselves how we live in our > world of dreams. Or we can try to define an interface that deals with > reality and actually solves problems. > Ehm, so, I spell out the obvious and get such a reaction? Okay, thank you. See my other reply, maybe we want a different kind of "host-model" from QEMU. It's Friday and I'm not particularly motivated to participate further in this discussion today. So I'm going to step away for today, please myself and live in a world of dreams. -- Thanks, David / dhildenb
Re: [PATCH RFC 1/1] qemu: capabilities: disable csske for host cpu
Am 11.03.22 um 10:23 schrieb David Hildenbrand: On 11.03.22 10:17, Daniel P. Berrangé wrote: On Thu, Mar 10, 2022 at 11:17:38PM -0500, Collin Walling wrote: CPU models past gen16a will no longer support the csske feature. In order to secure migration of guests running on machines that still support this feature to machines that do not, let's disable csske in the host-model. Sorry to say, removing CPU features is a no-go when wanting to guarantee forward migration without taking care about CPU model details manually and simply using the host model. Self-made HW vendor problem. And this simply does not reflect reality. Intel and Power have removed TX for example. We can now sit back and please ourselves how we live in our world of dreams. Or we can try to define an interface that deals with reality and actually solves problems. The problem scenario you describe is the intended semantics of host-model though. It enables all features available in the host that you launched on. It lets you live migrate to a target host with the same, or a greater number of features. If the target has a greater number of features, it should restrict the VM to the subset of features that were present on the original source CPU. If the target has fewer features, then you simply can't live migrate a VM using host-model. To get live migration in both directions across CPUs with differing featuresets, then the VM needs to be configured with a named CPU model that is a subset of both, rather than host-model. Right, and cpu-model-baseline does that job for you if you're lazy to lookup the proper model. Yes baseline will work, but this requires tooling like openstack. The normal user will just use the default and this is host-model. Let me explain the usecase for this feature. Migration between different versins baseline: always works host-passthrough: you get what you deserve default model: works We have disabled CSSKE from our default models (-cpu gen15a will not present csske). So that works as well. host-model: Also works for all machines that have csske. Now: Lets say gen17 will no longer support this. That means that we can not migrate host-model from gen16 or gen15 because those will have csske. What options do we have? If we disable csske in the host capabilities that would mean that a host compare against an xml from an older QEMU would fail (even if you move from gen14 to gen14). So this is not a good option. By disabling deprecated features ONLY for the _initial_ expansion of model-model, but keeping it in the host capabilities you can migrate existing guests (with the feature) as we only disable in the expansion, but manually asking for it still works. AND it will allow to move this instantiation of the guest to future machines without the feature. Basically everything works. The alternative of removing csske would result in too many failure scenarios.
Re: [PATCH RFC 1/1] qemu: capabilities: disable csske for host cpu
On 11.03.22 10:17, Daniel P. Berrangé wrote: > On Thu, Mar 10, 2022 at 11:17:38PM -0500, Collin Walling wrote: >> CPU models past gen16a will no longer support the csske feature. In >> order to secure migration of guests running on machines that still >> support this feature to machines that do not, let's disable csske >> in the host-model. Sorry to say, removing CPU features is a no-go when wanting to guarantee forward migration without taking care about CPU model details manually and simply using the host model. Self-made HW vendor problem. > > The problem scenario you describe is the intended semantics of > host-model though. It enables all features available in the host > that you launched on. It lets you live migrate to a target host > with the same, or a greater number of features. If the target has > a greater number of features, it should restrict the VM to the > subset of features that were present on the original source CPU. > If the target has fewer features, then you simply can't live > migrate a VM using host-model. > > To get live migration in both directions across CPUs with differing > featuresets, then the VM needs to be configured with a named CPU > model that is a subset of both, rather than host-model. Right, and cpu-model-baseline does that job for you if you're lazy to lookup the proper model. -- Thanks, David / dhildenb
Re: [PATCH RFC 1/1] qemu: capabilities: disable csske for host cpu
On Thu, Mar 10, 2022 at 11:17:38PM -0500, Collin Walling wrote: > CPU models past gen16a will no longer support the csske feature. In > order to secure migration of guests running on machines that still > support this feature to machines that do not, let's disable csske > in the host-model. The problem scenario you describe is the intended semantics of host-model though. It enables all features available in the host that you launched on. It lets you live migrate to a target host with the same, or a greater number of features. If the target has a greater number of features, it should restrict the VM to the subset of features that were present on the original source CPU. If the target has fewer features, then you simply can't live migrate a VM using host-model. To get live migration in both directions across CPUs with differing featuresets, then the VM needs to be configured with a named CPU model that is a subset of both, rather than host-model. With regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
[PATCH RFC 1/1] qemu: capabilities: disable csske for host cpu
CPU models past gen16a will no longer support the csske feature. In order to secure migration of guests running on machines that still support this feature to machines that do not, let's disable csske in the host-model. Signed-off-by: Collin Walling --- src/qemu/qemu_capabilities.c | 10 ++ tests/domaincapsdata/qemu_2.11.0.s390x.xml | 1 + tests/domaincapsdata/qemu_2.12.0.s390x.xml | 1 + tests/domaincapsdata/qemu_2.8.0.s390x.xml | 1 + tests/domaincapsdata/qemu_2.9.0.s390x.xml | 1 + tests/domaincapsdata/qemu_3.0.0.s390x.xml | 1 + tests/domaincapsdata/qemu_4.2.0.s390x.xml | 1 + tests/domaincapsdata/qemu_6.0.0.s390x.xml | 1 + 8 files changed, 17 insertions(+) diff --git a/src/qemu/qemu_capabilities.c b/src/qemu/qemu_capabilities.c index 1b28c3f161..6a65c81f81 100644 --- a/src/qemu/qemu_capabilities.c +++ b/src/qemu/qemu_capabilities.c @@ -3804,6 +3804,16 @@ virQEMUCapsInitHostCPUModel(virQEMUCaps *qemuCaps, goto error; } +if (ARCH_IS_S390(qemuCaps->arch)) { +/* + * The CSSKE feature will no longer be supported beyond gen16a. + * To protect migration, disable this feature ahead of time + * for all s390x CPU models. + */ +if (virCPUDefAddFeatureIfMissing(cpu, "csske", VIR_CPU_FEATURE_DISABLE) < 0) +goto error; +} + virQEMUCapsSetHostModel(qemuCaps, type, cpu, migCPU, fullCPU); cleanup: diff --git a/tests/domaincapsdata/qemu_2.11.0.s390x.xml b/tests/domaincapsdata/qemu_2.11.0.s390x.xml index 804bf8020e..f21efca122 100644 --- a/tests/domaincapsdata/qemu_2.11.0.s390x.xml +++ b/tests/domaincapsdata/qemu_2.11.0.s390x.xml @@ -61,6 +61,7 @@ + z890.2 diff --git a/tests/domaincapsdata/qemu_2.12.0.s390x.xml b/tests/domaincapsdata/qemu_2.12.0.s390x.xml index 5c3d9ce7db..9dc5d1396c 100644 --- a/tests/domaincapsdata/qemu_2.12.0.s390x.xml +++ b/tests/domaincapsdata/qemu_2.12.0.s390x.xml @@ -60,6 +60,7 @@ + z890.2 diff --git a/tests/domaincapsdata/qemu_2.8.0.s390x.xml b/tests/domaincapsdata/qemu_2.8.0.s390x.xml index 2c075d7cdb..857cb1ad5b 100644 --- a/tests/domaincapsdata/qemu_2.8.0.s390x.xml +++ b/tests/domaincapsdata/qemu_2.8.0.s390x.xml @@ -48,6 +48,7 @@ + z10EC-base diff --git a/tests/domaincapsdata/qemu_2.9.0.s390x.xml b/tests/domaincapsdata/qemu_2.9.0.s390x.xml index d5b58a786d..2e1ba62dc0 100644 --- a/tests/domaincapsdata/qemu_2.9.0.s390x.xml +++ b/tests/domaincapsdata/qemu_2.9.0.s390x.xml @@ -49,6 +49,7 @@ + z10EC-base diff --git a/tests/domaincapsdata/qemu_3.0.0.s390x.xml b/tests/domaincapsdata/qemu_3.0.0.s390x.xml index f49b6907ff..1b6f64e69f 100644 --- a/tests/domaincapsdata/qemu_3.0.0.s390x.xml +++ b/tests/domaincapsdata/qemu_3.0.0.s390x.xml @@ -64,6 +64,7 @@ + z890.2 diff --git a/tests/domaincapsdata/qemu_4.2.0.s390x.xml b/tests/domaincapsdata/qemu_4.2.0.s390x.xml index fb162ea578..b41929b585 100644 --- a/tests/domaincapsdata/qemu_4.2.0.s390x.xml +++ b/tests/domaincapsdata/qemu_4.2.0.s390x.xml @@ -81,6 +81,7 @@ + z800-base diff --git a/tests/domaincapsdata/qemu_6.0.0.s390x.xml b/tests/domaincapsdata/qemu_6.0.0.s390x.xml index 13fa3a637e..da4017541d 100644 --- a/tests/domaincapsdata/qemu_6.0.0.s390x.xml +++ b/tests/domaincapsdata/qemu_6.0.0.s390x.xml @@ -84,6 +84,7 @@ + z800-base -- 2.31.1