From: William Roche
Problem:
A Qemu VM can survive a memory error, as qemu can relay the error to the
VM kernel which could also deal with it -- poisoning/off-lining the impacted
page. This situation creates a hole in the VM memory address space (an
unreadable page or set of pages
From: William Roche
A memory page poisoned from the hypervisor level is no longer readable.
The migration of a VM will crash Qemu when it tries to read the
memory address space and stumbles on the poisoned page with a similar
stack trace:
Program terminated with signal SIGBUS, Bus error.
#0
On 11/8/23 22:45, Peter Xu wrote:
On Mon, Nov 06, 2023 at 10:38:14PM +0100, William Roche wrote:
But it implies a lot of other changes:
- The source has to flag the error pages to indicate a poison
(new flag in the exchange protocole)
- The destination has to be able to deal
From: William Roche
A memory page poisoned from the hypervisor level is no longer readable.
Thus, it is now treated as a zero-page for the ram saving migration phase.
The migration of a VM will crash Qemu when it tries to read the
memory address space and stumbles on the poisoned page
From: William Roche
Migrating a poisoned page as a zero-page can only be done when the
running guest kernel knows about this poison, so that it marks this
page as inaccessible and any access in the VM would fail.
But if a poison information is not relayed to the VM, the kernel
does not prevent
From: William Roche
Note about ARM specificities:
This code has a small part impacting more specificaly ARM machines,
that's the reason why I added qemu-...@nongnu.org -- see description.
A Qemu VM can survive a memory error, as qemu can relay the error to the
VM kernel which could also deal
On 10/17/23 17:13, Peter Xu wrote:
On Tue, Oct 17, 2023 at 02:38:48AM +0200, William Roche wrote:
On 10/16/23 18:48, Peter Xu wrote:
On Fri, Oct 13, 2023 at 03:08:39PM +, “William Roche wrote:
diff --git a/target/arm/kvm64.c b/target/arm/kvm64.c
index 5e95c496bb..e8db6380c1 100644
On 10/16/23 18:48, Peter Xu wrote:
On Fri, Oct 13, 2023 at 03:08:39PM +, “William Roche wrote:
diff --git a/target/arm/kvm64.c b/target/arm/kvm64.c
index 5e95c496bb..e8db6380c1 100644
--- a/target/arm/kvm64.c
+++ b/target/arm/kvm64.c
@@ -1158,7 +1158,6 @@ void kvm_arch_on_sigbus_vcpu
to the kvm_hwpoison_page_add function in kvm_arch_on_sigbus_vcpu with:
kvm_hwpoison_page_add(ram_addr, (code == BUS_MCEERR_AR));
Of course we'll have to wait for this above patch to be integrated first.
HTH,
William.
On 9/19/23 00:00, William Roche wrote:
> Hi John,
>
> I'd li
From: William Roche
Migrating a poisoned page as a zero-page can only be done when the
running guest kernel knows about this poison, so that it marks this
page as unaccessible and any access in the VM would fail.
But if a poison information is not relayed to the VM, the kernel
does not prevent
From: William Roche
A Qemu VM can survive a memory error, as qemu can relay the error to the
VM kernel which could also deal with it -- poisoning/off-lining the impacted
page.
This situation creates a hole in the VM memory address space that the VM kernel
knows about (an unreadable page or set
From: William Roche
A memory page poisoned from the hypervisor level is no longer readable.
Thus, it is now treated as a zero-page for the ram saving migration phase.
The migration of a VM will crash Qemu when it tries to read the
memory address space and stumbles on the poisoned page
On 9/22/23 16:30, Yazen Ghannam wrote:
On 9/22/23 4:36 AM, William Roche wrote:
On 9/21/23 19:41, Yazen Ghannam wrote:
[...]
Also, during page migration, does the data flow through the CPU core?
Sorry for the basic question. I haven't done a lot with virtualization.
Yes, in most cases
On 9/21/23 19:41, Yazen Ghannam wrote:
On 9/20/23 7:13 AM, Joao Martins wrote:
On 18/09/2023 23:00, William Roche wrote:
[...]
So it looks like the mechanism works fine... unless the VM has migrated
between the SRAO error and the first time it really touches the poisoned
page to get an SRAR
From: William Roche
A Qemu VM can survive a memory error, as qemu can relay the error to the
VM kernel which could also deal with it -- poisoning/off-lining the impacted
page.
This situation creates a hole in the VM memory address space that the VM kernel
knows about (an unreadable page or set
From: William Roche
A memory page poisoned from the hypervisor level is no longer readable.
Thus, it is now treated as a zero-page for the ram saving migration phase.
The migration of a VM will crash Qemu when it tries to read the
memory address space and stumbles on the poisoned page
Thank you Zhijian for your feedback.
So I'll try to push this change today.
Cheers,
William.
On 9/20/23 12:04, Zhijian Li (Fujitsu) wrote:
On 15/09/2023 19:31, William Roche wrote:
On 9/15/23 05:13, Zhijian Li (Fujitsu) wrote:
I'm okay with "RDMA isn't touched".
BTW, could
Hi John,
I'd like to put the emphasis on the fact that ignoring the SRAO error
for a VM is a real problem at least for a specific (rare) case I'm
currently working on: The VM migration.
Context:
- In the case of a poisoned page in the VM address space, the migration
can't read it and will skip
On 9/15/23 05:13, Zhijian Li (Fujitsu) wrote:
I'm okay with "RDMA isn't touched".
BTW, could you share your reproducing program/hacking to poison the page, so
that
i am able to take a look the RDMA part later when i'm free.
Not sure it's suitable to acknowledge a not touched part. Anyway
From: William Roche
A Qemu VM can survive a memory error, as qemu can relay the error to the
VM kernel which could also deal with it -- poisoning/off-lining the impacted
page.
This situation creates a hole in the VM memory address space that the VM kernel
knows about (an unreadable page or set
From: William Roche
A memory page poisoned from the hypervisor level is no longer readable.
Thus, it is now treated as a zero-page for the ram saving migration phase.
The migration of a VM will crash Qemu when it tries to read the
memory address space and stumbles on the poisoned page
On 9/7/23 13:12, Gupta, Pankaj wrote:
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 5fce74aac5..4d42d3ed4c 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -604,6 +604,10 @@ static void kvm_mce_inject(X86CPU *cpu, hwaddr
paddr, int code)
On 9/6/23 17:16, Peter Xu wrote:
Just a note..
Probably fine for now to reuse block page size, but IIUC the right thing to
do is to fetch it from the signal info (in QEMU's sigbus_handler()) of
kernel_siginfo.si_addr_lsb.
At least for x86 I think that stores the "shift" of covered poisoned
From: William Roche
A Qemu VM can survive a memory error, as qemu can relay the error to the
VM kernel which could also deal with it -- poisoning/off-lining the impacted
page.
This situation creates a hole in the VM memory address space that the VM kernel
knows about (an unreadable page or set
From: William Roche
A memory page poisoned from the hypervisor level is no longer readable.
Thus, it is now treated as a zero-page for the ram saving migration phase.
The migration of a VM will crash Qemu when it tries to read the
memory address space and stumbles on the poisoned page
platforms.
Reported-by: William Roche
Signed-off-by: John Allen
---
target/i386/helper.c | 4
target/i386/kvm/kvm.c | 17 +++--
2 files changed, 15 insertions(+), 6 deletions(-)
diff --git a/target/i386/helper.c b/target/i386/helper.c
index 533b29cb91..a6523858e0 100644
enting this AMD specific filter:
commit bf8cc74df3fcc7bf958a7c42b876e9c059fe4d06
Author: William Roche
Date: Thu Aug 31 18:54:57 2023 +
i386: Explicitly ignore unsupported BUS_MCEERR_AO MCE on AMD guest
AMD guests can't currently deal with BUS_MCEERR_AO MCE injection
as it panics
27 matches
Mail list logo