Re: [Qemu-devel] [Qemu-ppc] [PULL 1/2] spapr: Reset CAS & IRQ subsystem after devices
On Fri, Aug 23, 2019 at 03:47:52PM +0200, Laurent Vivier wrote: > On 23/08/2019 07:39, David Gibson wrote: > > On Thu, Aug 22, 2019 at 10:07:09PM +0200, Laurent Vivier wrote: > >> On 13/08/2019 08:59, David Gibson wrote: > >>> This fixes a nasty regression in qemu-4.1 for the 'pseries' machine, > >>> caused by the new "dual" interrupt controller model. Specifically, > >>> qemu can crash when used with KVM if a 'system_reset' is requested > >>> while there's active I/O in the guest. > >>> > >>> The problem is that in spapr_machine_reset() we: > >>> > >>> 1. Reset the CAS vector state > >>> spapr_ovec_cleanup(spapr->ov5_cas); > >>> > >>> 2. Reset all devices > >>> qemu_devices_reset() > >>> > >>> 3. Reset the irq subsystem > >>> spapr_irq_reset(); > >>> > >>> However (1) implicitly changes the interrupt delivery mode, because > >>> whether we're using XICS or XIVE depends on the CAS state. We don't > >>> properly initialize the new irq mode until (3) though - in particular > >>> setting up the KVM devices. > >>> > >>> During (2), we can temporarily drop the BQL allowing some irqs to be > >>> delivered which will go to an irq system that's not properly set up. > >>> > >>> Specifically, if the previous guest was in (KVM) XIVE mode, the CAS > >>> reset will put us back in XICS mode. kvm_kernel_irqchip() still > >>> returns true, because XIVE was using KVM, however XICs doesn't have > >>> its KVM components intialized and kernel_xics_fd == -1. When the irq > >>> is delivered it goes via ics_kvm_set_irq() which assert()s that > >>> kernel_xics_fd != -1. > >>> > >>> This change addresses the problem by delaying the CAS reset until > >>> after the devices reset. The device reset should quiesce all the > >>> devices so we won't get irqs delivered while we mess around with the > >>> IRQ. The CAS reset and irq re-initialize should also now be under the > >>> same BQL critical section so nothing else should be able to interrupt > >>> it either. > >>> > >>> We also move the spapr_irq_msi_reset() used in one of the legacy irq > >>> modes, since it logically makes sense at the same point as the > >>> spapr_irq_reset() (it's essentially an equivalent operation for older > >>> machine types). Since we don't need to switch between different > >>> interrupt controllers for those old machine types it shouldn't > >>> actually be broken in those cases though. > >>> > >>> Cc: Cédric Le Goater > >>> > >>> Fixes: b2e22477 "spapr: add a 'reset' method to the sPAPR IRQ backend" > >>> Fixes: 13db0cd9 "spapr: introduce a new sPAPR IRQ backend supporting > >>> XIVE and XICS" > >>> Signed-off-by: David Gibson > >>> --- > >>> hw/ppc/spapr.c | 24 > >>> 1 file changed, 12 insertions(+), 12 deletions(-) > >>> > >>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > >>> index 821f0d4a49..12ed4b065c 100644 > >>> --- a/hw/ppc/spapr.c > >>> +++ b/hw/ppc/spapr.c > >>> @@ -1726,6 +1726,18 @@ static void spapr_machine_reset(MachineState > >>> *machine) > >>> spapr_setup_hpt_and_vrma(spapr); > >>> } > >>> > >>> +/* > >>> + * NVLink2-connected GPU RAM needs to be placed on a separate NUMA > >>> node. > >>> + * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() > >>> which is > >>> + * called from vPHB reset handler so we initialize the counter here. > >>> + * If no NUMA is configured from the QEMU side, we start from 1 as > >>> GPU RAM > >>> + * must be equally distant from any other node. > >>> + * The final value of spapr->gpu_numa_id is going to be written to > >>> + * max-associativity-domains in spapr_build_fdt(). > >>> + */ > >>> +spapr->gpu_numa_id = MAX(1, nb_numa_nodes); > >>> +qemu_devices_reset(); > >>> + > >>> /* > >>> * If this reset wasn't generated by CAS, we should reset our > >>> * negotiated options and start from scratch > >>> @@ -1741,18 +1753,6 @@ static void spapr_machine_reset(MachineState > >>> *machine) > >>> spapr_irq_msi_reset(spapr); > >>> } > >>> > >>> -/* > >>> - * NVLink2-connected GPU RAM needs to be placed on a separate NUMA > >>> node. > >>> - * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() > >>> which is > >>> - * called from vPHB reset handler so we initialize the counter here. > >>> - * If no NUMA is configured from the QEMU side, we start from 1 as > >>> GPU RAM > >>> - * must be equally distant from any other node. > >>> - * The final value of spapr->gpu_numa_id is going to be written to > >>> - * max-associativity-domains in spapr_build_fdt(). > >>> - */ > >>> -spapr->gpu_numa_id = MAX(1, nb_numa_nodes); > >>> -qemu_devices_reset(); > >>> - > >>> /* > >>> * This is fixing some of the default configuration of the XIVE > >>> * devices. To be called after the reset of the machine devices. > >>> > >> > >> This commit breaks migration between POWER8 <-> POWER9 hosts: > >> > >>
Re: [Qemu-devel] [Qemu-ppc] [PULL 1/2] spapr: Reset CAS & IRQ subsystem after devices
On 23/08/2019 07:39, David Gibson wrote: > On Thu, Aug 22, 2019 at 10:07:09PM +0200, Laurent Vivier wrote: >> On 13/08/2019 08:59, David Gibson wrote: >>> This fixes a nasty regression in qemu-4.1 for the 'pseries' machine, >>> caused by the new "dual" interrupt controller model. Specifically, >>> qemu can crash when used with KVM if a 'system_reset' is requested >>> while there's active I/O in the guest. >>> >>> The problem is that in spapr_machine_reset() we: >>> >>> 1. Reset the CAS vector state >>> spapr_ovec_cleanup(spapr->ov5_cas); >>> >>> 2. Reset all devices >>> qemu_devices_reset() >>> >>> 3. Reset the irq subsystem >>> spapr_irq_reset(); >>> >>> However (1) implicitly changes the interrupt delivery mode, because >>> whether we're using XICS or XIVE depends on the CAS state. We don't >>> properly initialize the new irq mode until (3) though - in particular >>> setting up the KVM devices. >>> >>> During (2), we can temporarily drop the BQL allowing some irqs to be >>> delivered which will go to an irq system that's not properly set up. >>> >>> Specifically, if the previous guest was in (KVM) XIVE mode, the CAS >>> reset will put us back in XICS mode. kvm_kernel_irqchip() still >>> returns true, because XIVE was using KVM, however XICs doesn't have >>> its KVM components intialized and kernel_xics_fd == -1. When the irq >>> is delivered it goes via ics_kvm_set_irq() which assert()s that >>> kernel_xics_fd != -1. >>> >>> This change addresses the problem by delaying the CAS reset until >>> after the devices reset. The device reset should quiesce all the >>> devices so we won't get irqs delivered while we mess around with the >>> IRQ. The CAS reset and irq re-initialize should also now be under the >>> same BQL critical section so nothing else should be able to interrupt >>> it either. >>> >>> We also move the spapr_irq_msi_reset() used in one of the legacy irq >>> modes, since it logically makes sense at the same point as the >>> spapr_irq_reset() (it's essentially an equivalent operation for older >>> machine types). Since we don't need to switch between different >>> interrupt controllers for those old machine types it shouldn't >>> actually be broken in those cases though. >>> >>> Cc: Cédric Le Goater >>> >>> Fixes: b2e22477 "spapr: add a 'reset' method to the sPAPR IRQ backend" >>> Fixes: 13db0cd9 "spapr: introduce a new sPAPR IRQ backend supporting >>> XIVE and XICS" >>> Signed-off-by: David Gibson >>> --- >>> hw/ppc/spapr.c | 24 >>> 1 file changed, 12 insertions(+), 12 deletions(-) >>> >>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c >>> index 821f0d4a49..12ed4b065c 100644 >>> --- a/hw/ppc/spapr.c >>> +++ b/hw/ppc/spapr.c >>> @@ -1726,6 +1726,18 @@ static void spapr_machine_reset(MachineState >>> *machine) >>> spapr_setup_hpt_and_vrma(spapr); >>> } >>> >>> +/* >>> + * NVLink2-connected GPU RAM needs to be placed on a separate NUMA >>> node. >>> + * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which >>> is >>> + * called from vPHB reset handler so we initialize the counter here. >>> + * If no NUMA is configured from the QEMU side, we start from 1 as GPU >>> RAM >>> + * must be equally distant from any other node. >>> + * The final value of spapr->gpu_numa_id is going to be written to >>> + * max-associativity-domains in spapr_build_fdt(). >>> + */ >>> +spapr->gpu_numa_id = MAX(1, nb_numa_nodes); >>> +qemu_devices_reset(); >>> + >>> /* >>> * If this reset wasn't generated by CAS, we should reset our >>> * negotiated options and start from scratch >>> @@ -1741,18 +1753,6 @@ static void spapr_machine_reset(MachineState >>> *machine) >>> spapr_irq_msi_reset(spapr); >>> } >>> >>> -/* >>> - * NVLink2-connected GPU RAM needs to be placed on a separate NUMA >>> node. >>> - * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which >>> is >>> - * called from vPHB reset handler so we initialize the counter here. >>> - * If no NUMA is configured from the QEMU side, we start from 1 as GPU >>> RAM >>> - * must be equally distant from any other node. >>> - * The final value of spapr->gpu_numa_id is going to be written to >>> - * max-associativity-domains in spapr_build_fdt(). >>> - */ >>> -spapr->gpu_numa_id = MAX(1, nb_numa_nodes); >>> -qemu_devices_reset(); >>> - >>> /* >>> * This is fixing some of the default configuration of the XIVE >>> * devices. To be called after the reset of the machine devices. >>> >> >> This commit breaks migration between POWER8 <-> POWER9 hosts: >> >> qemu-system-ppc64: error while loading state for instance 0x1 of device 'cpu' >> qemu-system-ppc64: load of migration failed: Operation not permitted >> >> Using a guest with a running 4.18 kernel (RHEL 8) and "-M >> pseries,max-cpu-compat=power8" on both sides. >> >> There is no
Re: [Qemu-devel] [Qemu-ppc] [PULL 1/2] spapr: Reset CAS & IRQ subsystem after devices
On Thu, Aug 22, 2019 at 10:07:09PM +0200, Laurent Vivier wrote: > On 13/08/2019 08:59, David Gibson wrote: > > This fixes a nasty regression in qemu-4.1 for the 'pseries' machine, > > caused by the new "dual" interrupt controller model. Specifically, > > qemu can crash when used with KVM if a 'system_reset' is requested > > while there's active I/O in the guest. > > > > The problem is that in spapr_machine_reset() we: > > > > 1. Reset the CAS vector state > > spapr_ovec_cleanup(spapr->ov5_cas); > > > > 2. Reset all devices > > qemu_devices_reset() > > > > 3. Reset the irq subsystem > > spapr_irq_reset(); > > > > However (1) implicitly changes the interrupt delivery mode, because > > whether we're using XICS or XIVE depends on the CAS state. We don't > > properly initialize the new irq mode until (3) though - in particular > > setting up the KVM devices. > > > > During (2), we can temporarily drop the BQL allowing some irqs to be > > delivered which will go to an irq system that's not properly set up. > > > > Specifically, if the previous guest was in (KVM) XIVE mode, the CAS > > reset will put us back in XICS mode. kvm_kernel_irqchip() still > > returns true, because XIVE was using KVM, however XICs doesn't have > > its KVM components intialized and kernel_xics_fd == -1. When the irq > > is delivered it goes via ics_kvm_set_irq() which assert()s that > > kernel_xics_fd != -1. > > > > This change addresses the problem by delaying the CAS reset until > > after the devices reset. The device reset should quiesce all the > > devices so we won't get irqs delivered while we mess around with the > > IRQ. The CAS reset and irq re-initialize should also now be under the > > same BQL critical section so nothing else should be able to interrupt > > it either. > > > > We also move the spapr_irq_msi_reset() used in one of the legacy irq > > modes, since it logically makes sense at the same point as the > > spapr_irq_reset() (it's essentially an equivalent operation for older > > machine types). Since we don't need to switch between different > > interrupt controllers for those old machine types it shouldn't > > actually be broken in those cases though. > > > > Cc: Cédric Le Goater > > > > Fixes: b2e22477 "spapr: add a 'reset' method to the sPAPR IRQ backend" > > Fixes: 13db0cd9 "spapr: introduce a new sPAPR IRQ backend supporting > > XIVE and XICS" > > Signed-off-by: David Gibson > > --- > > hw/ppc/spapr.c | 24 > > 1 file changed, 12 insertions(+), 12 deletions(-) > > > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > > index 821f0d4a49..12ed4b065c 100644 > > --- a/hw/ppc/spapr.c > > +++ b/hw/ppc/spapr.c > > @@ -1726,6 +1726,18 @@ static void spapr_machine_reset(MachineState > > *machine) > > spapr_setup_hpt_and_vrma(spapr); > > } > > > > +/* > > + * NVLink2-connected GPU RAM needs to be placed on a separate NUMA > > node. > > + * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which > > is > > + * called from vPHB reset handler so we initialize the counter here. > > + * If no NUMA is configured from the QEMU side, we start from 1 as GPU > > RAM > > + * must be equally distant from any other node. > > + * The final value of spapr->gpu_numa_id is going to be written to > > + * max-associativity-domains in spapr_build_fdt(). > > + */ > > +spapr->gpu_numa_id = MAX(1, nb_numa_nodes); > > +qemu_devices_reset(); > > + > > /* > > * If this reset wasn't generated by CAS, we should reset our > > * negotiated options and start from scratch > > @@ -1741,18 +1753,6 @@ static void spapr_machine_reset(MachineState > > *machine) > > spapr_irq_msi_reset(spapr); > > } > > > > -/* > > - * NVLink2-connected GPU RAM needs to be placed on a separate NUMA > > node. > > - * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which > > is > > - * called from vPHB reset handler so we initialize the counter here. > > - * If no NUMA is configured from the QEMU side, we start from 1 as GPU > > RAM > > - * must be equally distant from any other node. > > - * The final value of spapr->gpu_numa_id is going to be written to > > - * max-associativity-domains in spapr_build_fdt(). > > - */ > > -spapr->gpu_numa_id = MAX(1, nb_numa_nodes); > > -qemu_devices_reset(); > > - > > /* > > * This is fixing some of the default configuration of the XIVE > > * devices. To be called after the reset of the machine devices. > > > > This commit breaks migration between POWER8 <-> POWER9 hosts: > > qemu-system-ppc64: error while loading state for instance 0x1 of device 'cpu' > qemu-system-ppc64: load of migration failed: Operation not permitted > > Using a guest with a running 4.18 kernel (RHEL 8) and "-M > pseries,max-cpu-compat=power8" on both sides. > > There is no problem if both hosts are of the same
Re: [Qemu-devel] [Qemu-ppc] [PULL 1/2] spapr: Reset CAS & IRQ subsystem after devices
On 13/08/2019 08:59, David Gibson wrote: > This fixes a nasty regression in qemu-4.1 for the 'pseries' machine, > caused by the new "dual" interrupt controller model. Specifically, > qemu can crash when used with KVM if a 'system_reset' is requested > while there's active I/O in the guest. > > The problem is that in spapr_machine_reset() we: > > 1. Reset the CAS vector state > spapr_ovec_cleanup(spapr->ov5_cas); > > 2. Reset all devices > qemu_devices_reset() > > 3. Reset the irq subsystem > spapr_irq_reset(); > > However (1) implicitly changes the interrupt delivery mode, because > whether we're using XICS or XIVE depends on the CAS state. We don't > properly initialize the new irq mode until (3) though - in particular > setting up the KVM devices. > > During (2), we can temporarily drop the BQL allowing some irqs to be > delivered which will go to an irq system that's not properly set up. > > Specifically, if the previous guest was in (KVM) XIVE mode, the CAS > reset will put us back in XICS mode. kvm_kernel_irqchip() still > returns true, because XIVE was using KVM, however XICs doesn't have > its KVM components intialized and kernel_xics_fd == -1. When the irq > is delivered it goes via ics_kvm_set_irq() which assert()s that > kernel_xics_fd != -1. > > This change addresses the problem by delaying the CAS reset until > after the devices reset. The device reset should quiesce all the > devices so we won't get irqs delivered while we mess around with the > IRQ. The CAS reset and irq re-initialize should also now be under the > same BQL critical section so nothing else should be able to interrupt > it either. > > We also move the spapr_irq_msi_reset() used in one of the legacy irq > modes, since it logically makes sense at the same point as the > spapr_irq_reset() (it's essentially an equivalent operation for older > machine types). Since we don't need to switch between different > interrupt controllers for those old machine types it shouldn't > actually be broken in those cases though. > > Cc: Cédric Le Goater > > Fixes: b2e22477 "spapr: add a 'reset' method to the sPAPR IRQ backend" > Fixes: 13db0cd9 "spapr: introduce a new sPAPR IRQ backend supporting > XIVE and XICS" > Signed-off-by: David Gibson > --- > hw/ppc/spapr.c | 24 > 1 file changed, 12 insertions(+), 12 deletions(-) > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > index 821f0d4a49..12ed4b065c 100644 > --- a/hw/ppc/spapr.c > +++ b/hw/ppc/spapr.c > @@ -1726,6 +1726,18 @@ static void spapr_machine_reset(MachineState *machine) > spapr_setup_hpt_and_vrma(spapr); > } > > +/* > + * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node. > + * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is > + * called from vPHB reset handler so we initialize the counter here. > + * If no NUMA is configured from the QEMU side, we start from 1 as GPU > RAM > + * must be equally distant from any other node. > + * The final value of spapr->gpu_numa_id is going to be written to > + * max-associativity-domains in spapr_build_fdt(). > + */ > +spapr->gpu_numa_id = MAX(1, nb_numa_nodes); > +qemu_devices_reset(); > + > /* > * If this reset wasn't generated by CAS, we should reset our > * negotiated options and start from scratch > @@ -1741,18 +1753,6 @@ static void spapr_machine_reset(MachineState *machine) > spapr_irq_msi_reset(spapr); > } > > -/* > - * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node. > - * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is > - * called from vPHB reset handler so we initialize the counter here. > - * If no NUMA is configured from the QEMU side, we start from 1 as GPU > RAM > - * must be equally distant from any other node. > - * The final value of spapr->gpu_numa_id is going to be written to > - * max-associativity-domains in spapr_build_fdt(). > - */ > -spapr->gpu_numa_id = MAX(1, nb_numa_nodes); > -qemu_devices_reset(); > - > /* > * This is fixing some of the default configuration of the XIVE > * devices. To be called after the reset of the machine devices. > This commit breaks migration between POWER8 <-> POWER9 hosts: qemu-system-ppc64: error while loading state for instance 0x1 of device 'cpu' qemu-system-ppc64: load of migration failed: Operation not permitted Using a guest with a running 4.18 kernel (RHEL 8) and "-M pseries,max-cpu-compat=power8" on both sides. There is no problem if both hosts are of the same kind ( P8 <-> P8 or P9 <-> P9). Thanks, Laurent