Re: [patch uq/master 7/8] MCE: Relay UCR MCE to guest

2010-10-08 Thread Dean Nelson

On 10/07/2010 10:15 PM, Huang Ying wrote:

Hi, Seto,

On Thu, 2010-10-07 at 11:41 +0800, Hidetoshi Seto wrote:

(2010/10/07 3:10), Dean Nelson wrote:

snip

When I applied a patch to the guest's kernel which forces mce_ser to be
set, as if MCG_SER_P was set (see __mcheck_cpu_cap_init()), I found
that when the memory page was 'owned' by a guest process, the process
would be killed (if the page was dirty), and the guest would stay
running. The HWPoisoned page would be sidelined and not cause any more
issues.


Excellent.
So while guest kernel knows which page is poisoned, guest processes
are controlled not to touch the page.

... Therefore rebooting the vm and renewing kernel will lost the
information where is poisoned.


Yes. That is an issue. Dean suggests that make qemu-kvm to refuse reboot
the guest if there is poisoned page and ask for user to intervention. I
have another idea to replace the poison pages with good pages when
reboot, that is, recover without user intervention.


Hi, Huang, I much prefer the replacing of the poisoned pages with good
pages on reboot, over the refusing to reboot. So definitely go with
your idea.

Thanks,
Dean
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch uq/master 7/8] MCE: Relay UCR MCE to guest

2010-10-07 Thread Dean Nelson

On 10/06/2010 10:41 PM, Hidetoshi Seto wrote:

(2010/10/07 3:10), Dean Nelson wrote:

On 10/06/2010 11:05 AM, Marcelo Tosatti wrote:

On Wed, Oct 06, 2010 at 10:58:36AM +0900, Hidetoshi Seto wrote:

I got some more question:

(2010/10/05 3:54), Marcelo Tosatti wrote:

Index: qemu/target-i386/cpu.h
===
--- qemu.orig/target-i386/cpu.h
+++ qemu/target-i386/cpu.h
@@ -250,16 +250,32 @@
   #define PG_ERROR_RSVD_MASK 0x08
   #define PG_ERROR_I_D_MASK  0x10

-#define MCG_CTL_P(1UL8)   /* MCG_CAP register available */
+#define MCG_CTL_P(1ULL8)   /* MCG_CAP register available */
+#define MCG_SER_P(1ULL24) /* MCA recovery/new status bits */

-#define MCE_CAP_DEFMCG_CTL_P
+#define MCE_CAP_DEF(MCG_CTL_P|MCG_SER_P)
   #define MCE_BANKS_DEF10



It seems that current kvm doesn't support SER_P, so injecting SRAO
to guest will mean that guest receives VAL|UC|!PCC and RIPV event
from virtual processor that doesn't have SER_P.


Dean also noted this. I don't think it was deliberate choice to not
expose SER_P. Huang?


In my testing, I found that MCG_SER_P was not being set (and I was
running on a Nehalem-EX system). Injecting a MCE resulted in the
guest entering into panic() from mce_panic(). If crash_kexec()
finds a kexec_crash_image the system ends up rebooting, otherwise,
what happens next requires operator intervention.


Good to know.
What I'm concerning is that if memory scrubbing SRAO event is
injected when !SER_P, linux guest with certain mce tolerant level
might grade it as UC severity and continue running with none of
panicking, killing and poisoning because of !PCC and RIPV.

Could you provide the panic message of the guest in your test?
I think it can tell me why the mce handler decided to go panic.


Sure, I'll add the info below at the end of this email.



When I applied a patch to the guest's kernel which forces mce_ser to be
set, as if MCG_SER_P was set (see __mcheck_cpu_cap_init()), I found
that when the memory page was 'owned' by a guest process, the process
would be killed (if the page was dirty), and the guest would stay
running. The HWPoisoned page would be sidelined and not cause any more
issues.


Excellent.
So while guest kernel knows which page is poisoned, guest processes
are controlled not to touch the page.

... Therefore rebooting the vm and renewing kernel will lost the
information where is poisoned.


Correct.



I think most OSes don't expect that it can receives MCE with !PCC
on traditional x86 processor without SER_P.

Q1: Is it safe to expect that guests can handle such !PCC event?


This might be best answered by Huang, but as I mentioned above, without
MCG_SER_P being set, the result was an orderly system panic on the
guest.


Though I'll wait Huang (I think he is on holiday), I believe that
system panic is just a possible option for AO (Action Optional)
event, no matter how the SER_P is.


I think you may be correct, but Huang will know for sure.



Q2: What is the expected behavior on the guest?


I think I answered this above.


Yeah, thanks.




Q3: What happen if guest reboots itself in response to the MCE?


That depends...

And the following issue also holds for a guest that is rebooted at
some point having successfully sidelined the bad page.

After the guest has panic'd, a system_reset of the guest or a restart
initiated by crash_kexec() (called by panic() on the guest), usually
results in the guest hanging because the bad page still belongs
to qemu-kvm and is now being referenced by the new guest in some way.


Yes. In other words my concern about reboot is that new guest kernel
including kdump kernel might try to read the bad page.  If there is
no AR-SIGBUS etc., we need some tricks to inhibit such accesses.


Agreed.



(It actually may not hang, but successfully reboot and be runnable,
with the bad page lurking in the background. It all seems to depend on
where the bad page ends up, and whether it's ever referenced.)


I know some tough guys using their PC with buggy DIMMs :-)



I believe there was an attempt to deal with this in kvm on the host.
See kvm_handle_bad_page(). This function was suppose to result in the
sending of a BUS_MCEERR_AR flavored SIGBUS by do_sigbus() to qemu-kvm
which in theory would result in the right thing happening. But commit
96054569190bdec375fe824e48ca1f4e3b53dd36 prevents the signal from being
sent. So this mechanism needs to be re-worked, and the issue remains.


Definitely.
I guess Huang has some plan or hint for rework this point.


Yeah, as far as I know Huang is looking into this.



I would think that if the the bad page can't be sidelined, such that
the newly booting guest can't use it, then the new guest shouldn't be
allowed to boot. But perhaps there is some merit in letting it try to
boot and see if one gets 'lucky'.


In case of booting a real machine in real world, hardware and firmware
usually (or often) do self-test before passing control to 

Re: [patch uq/master 7/8] MCE: Relay UCR MCE to guest

2010-10-07 Thread Huang Ying
On Thu, 2010-10-07 at 00:05 +0800, Marcelo Tosatti wrote:
 On Wed, Oct 06, 2010 at 10:58:36AM +0900, Hidetoshi Seto wrote:
  I got some more question:
  
  (2010/10/05 3:54), Marcelo Tosatti wrote:
   Index: qemu/target-i386/cpu.h
   ===
   --- qemu.orig/target-i386/cpu.h
   +++ qemu/target-i386/cpu.h
   @@ -250,16 +250,32 @@
#define PG_ERROR_RSVD_MASK 0x08
#define PG_ERROR_I_D_MASK  0x10

   -#define MCG_CTL_P(1UL8)   /* MCG_CAP register available */
   +#define MCG_CTL_P(1ULL8)   /* MCG_CAP register available */
   +#define MCG_SER_P(1ULL24) /* MCA recovery/new status bits */

   -#define MCE_CAP_DEF  MCG_CTL_P
   +#define MCE_CAP_DEF  (MCG_CTL_P|MCG_SER_P)
#define MCE_BANKS_DEF10

  
  It seems that current kvm doesn't support SER_P, so injecting SRAO
  to guest will mean that guest receives VAL|UC|!PCC and RIPV event
  from virtual processor that doesn't have SER_P.
 
 Dean also noted this. I don't think it was deliberate choice to not
 expose SER_P. Huang?

In fact, that should be a BUG. I will fix it as soon as possible.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch uq/master 7/8] MCE: Relay UCR MCE to guest

2010-10-07 Thread Huang Ying
Hi, Seto,

On Thu, 2010-10-07 at 11:41 +0800, Hidetoshi Seto wrote:
 (2010/10/07 3:10), Dean Nelson wrote:
  On 10/06/2010 11:05 AM, Marcelo Tosatti wrote:
  On Wed, Oct 06, 2010 at 10:58:36AM +0900, Hidetoshi Seto wrote:
  I got some more question:
 
  (2010/10/05 3:54), Marcelo Tosatti wrote:
  Index: qemu/target-i386/cpu.h
  ===
  --- qemu.orig/target-i386/cpu.h
  +++ qemu/target-i386/cpu.h
  @@ -250,16 +250,32 @@
#define PG_ERROR_RSVD_MASK 0x08
#define PG_ERROR_I_D_MASK  0x10
 
  -#define MCG_CTL_P(1UL8)   /* MCG_CAP register available */
  +#define MCG_CTL_P(1ULL8)   /* MCG_CAP register available */
  +#define MCG_SER_P(1ULL24) /* MCA recovery/new status bits */
 
  -#define MCE_CAP_DEFMCG_CTL_P
  +#define MCE_CAP_DEF(MCG_CTL_P|MCG_SER_P)
#define MCE_BANKS_DEF10
 
 
  It seems that current kvm doesn't support SER_P, so injecting SRAO
  to guest will mean that guest receives VAL|UC|!PCC and RIPV event
  from virtual processor that doesn't have SER_P.
 
  Dean also noted this. I don't think it was deliberate choice to not
  expose SER_P. Huang?
  
  In my testing, I found that MCG_SER_P was not being set (and I was
  running on a Nehalem-EX system). Injecting a MCE resulted in the
  guest entering into panic() from mce_panic(). If crash_kexec()
  finds a kexec_crash_image the system ends up rebooting, otherwise,
  what happens next requires operator intervention.
 
 Good to know.
 What I'm concerning is that if memory scrubbing SRAO event is
 injected when !SER_P, linux guest with certain mce tolerant level
 might grade it as UC severity and continue running with none of
 panicking, killing and poisoning because of !PCC and RIPV.
 
 Could you provide the panic message of the guest in your test?
 I think it can tell me why the mce handler decided to go panic.

That is a bug that the SER_P is not in KVM_MCE_CAP_SUPPORTED in kernel.
I will fix it as soon as possible. And SRAO MCE should not be sent
when !SER_P, we should add that condition in qemu-kvm.

  When I applied a patch to the guest's kernel which forces mce_ser to be
  set, as if MCG_SER_P was set (see __mcheck_cpu_cap_init()), I found
  that when the memory page was 'owned' by a guest process, the process
  would be killed (if the page was dirty), and the guest would stay
  running. The HWPoisoned page would be sidelined and not cause any more
  issues.
 
 Excellent.
 So while guest kernel knows which page is poisoned, guest processes
 are controlled not to touch the page.
 
 ... Therefore rebooting the vm and renewing kernel will lost the
 information where is poisoned.

Yes. That is an issue. Dean suggests that make qemu-kvm to refuse reboot
the guest if there is poisoned page and ask for user to intervention. I
have another idea to replace the poison pages with good pages when
reboot, that is, recover without user intervention.

  I think most OSes don't expect that it can receives MCE with !PCC
  on traditional x86 processor without SER_P.
 
  Q1: Is it safe to expect that guests can handle such !PCC event?
  
  This might be best answered by Huang, but as I mentioned above, without
  MCG_SER_P being set, the result was an orderly system panic on the
  guest.
 
 Though I'll wait Huang (I think he is on holiday), I believe that
 system panic is just a possible option for AO (Action Optional)
 event, no matter how the SER_P is.

We should fix this as I said above.

  Q2: What is the expected behavior on the guest?
  
  I think I answered this above.
 
 Yeah, thanks.
 
  
  Q3: What happen if guest reboots itself in response to the MCE?
  
  That depends...
  
  And the following issue also holds for a guest that is rebooted at
  some point having successfully sidelined the bad page.
  
  After the guest has panic'd, a system_reset of the guest or a restart
  initiated by crash_kexec() (called by panic() on the guest), usually
  results in the guest hanging because the bad page still belongs
  to qemu-kvm and is now being referenced by the new guest in some way.
 
 Yes. In other words my concern about reboot is that new guest kernel
 including kdump kernel might try to read the bad page.  If there is
 no AR-SIGBUS etc., we need some tricks to inhibit such accesses.
 
  (It actually may not hang, but successfully reboot and be runnable,
  with the bad page lurking in the background. It all seems to depend on
  where the bad page ends up, and whether it's ever referenced.)
 
 I know some tough guys using their PC with buggy DIMMs :-)
 
  
  I believe there was an attempt to deal with this in kvm on the host.
  See kvm_handle_bad_page(). This function was suppose to result in the
  sending of a BUS_MCEERR_AR flavored SIGBUS by do_sigbus() to qemu-kvm
  which in theory would result in the right thing happening. But commit
  96054569190bdec375fe824e48ca1f4e3b53dd36 prevents the signal from being
  sent. So this mechanism needs to be re-worked, and the 

Re: [patch uq/master 7/8] MCE: Relay UCR MCE to guest

2010-10-07 Thread Hidetoshi Seto
Hi, Huang-san,

(2010/10/08 12:15), Huang Ying wrote:
 Hi, Seto,
 
 On Thu, 2010-10-07 at 11:41 +0800, Hidetoshi Seto wrote:
 (2010/10/07 3:10), Dean Nelson wrote:
 On 10/06/2010 11:05 AM, Marcelo Tosatti wrote:
 On Wed, Oct 06, 2010 at 10:58:36AM +0900, Hidetoshi Seto wrote:
 I got some more question:

 (2010/10/05 3:54), Marcelo Tosatti wrote:
 Index: qemu/target-i386/cpu.h
 ===
 --- qemu.orig/target-i386/cpu.h
 +++ qemu/target-i386/cpu.h
 @@ -250,16 +250,32 @@
   #define PG_ERROR_RSVD_MASK 0x08
   #define PG_ERROR_I_D_MASK  0x10

 -#define MCG_CTL_P(1UL8)   /* MCG_CAP register available */
 +#define MCG_CTL_P(1ULL8)   /* MCG_CAP register available */
 +#define MCG_SER_P(1ULL24) /* MCA recovery/new status bits */

 -#define MCE_CAP_DEFMCG_CTL_P
 +#define MCE_CAP_DEF(MCG_CTL_P|MCG_SER_P)
   #define MCE_BANKS_DEF10


 It seems that current kvm doesn't support SER_P, so injecting SRAO
 to guest will mean that guest receives VAL|UC|!PCC and RIPV event
 from virtual processor that doesn't have SER_P.

 Dean also noted this. I don't think it was deliberate choice to not
 expose SER_P. Huang?

 In my testing, I found that MCG_SER_P was not being set (and I was
 running on a Nehalem-EX system). Injecting a MCE resulted in the
 guest entering into panic() from mce_panic(). If crash_kexec()
 finds a kexec_crash_image the system ends up rebooting, otherwise,
 what happens next requires operator intervention.

 Good to know.
 What I'm concerning is that if memory scrubbing SRAO event is
 injected when !SER_P, linux guest with certain mce tolerant level
 might grade it as UC severity and continue running with none of
 panicking, killing and poisoning because of !PCC and RIPV.

 Could you provide the panic message of the guest in your test?
 I think it can tell me why the mce handler decided to go panic.
 
 That is a bug that the SER_P is not in KVM_MCE_CAP_SUPPORTED in kernel.
 I will fix it as soon as possible. And SRAO MCE should not be sent
 when !SER_P, we should add that condition in qemu-kvm.

That makes sense.
I think it is qemu's responsibility for what follows the AO-SIGBUS,
what action should be taken depends on the KVM's capability.

 When I applied a patch to the guest's kernel which forces mce_ser to be
 set, as if MCG_SER_P was set (see __mcheck_cpu_cap_init()), I found
 that when the memory page was 'owned' by a guest process, the process
 would be killed (if the page was dirty), and the guest would stay
 running. The HWPoisoned page would be sidelined and not cause any more
 issues.

 Excellent.
 So while guest kernel knows which page is poisoned, guest processes
 are controlled not to touch the page.

 ... Therefore rebooting the vm and renewing kernel will lost the
 information where is poisoned.
 
 Yes. That is an issue. Dean suggests that make qemu-kvm to refuse reboot
 the guest if there is poisoned page and ask for user to intervention. I
 have another idea to replace the poison pages with good pages when
 reboot, that is, recover without user intervention.

Sounds good.

I think it may be worth something to reserve pages for the replacement
before reboot is requested; at least we really don't want to fail
rebooting with 'no memory'.

 I think most OSes don't expect that it can receives MCE with !PCC
 on traditional x86 processor without SER_P.

 Q1: Is it safe to expect that guests can handle such !PCC event?

 This might be best answered by Huang, but as I mentioned above, without
 MCG_SER_P being set, the result was an orderly system panic on the
 guest.

 Though I'll wait Huang (I think he is on holiday), I believe that
 system panic is just a possible option for AO (Action Optional)
 event, no matter how the SER_P is.
 
 We should fix this as I said above.
 
 Q2: What is the expected behavior on the guest?

 I think I answered this above.

 Yeah, thanks.


 Q3: What happen if guest reboots itself in response to the MCE?

 That depends...

 And the following issue also holds for a guest that is rebooted at
 some point having successfully sidelined the bad page.

 After the guest has panic'd, a system_reset of the guest or a restart
 initiated by crash_kexec() (called by panic() on the guest), usually
 results in the guest hanging because the bad page still belongs
 to qemu-kvm and is now being referenced by the new guest in some way.

 Yes. In other words my concern about reboot is that new guest kernel
 including kdump kernel might try to read the bad page.  If there is
 no AR-SIGBUS etc., we need some tricks to inhibit such accesses.

 (It actually may not hang, but successfully reboot and be runnable,
 with the bad page lurking in the background. It all seems to depend on
 where the bad page ends up, and whether it's ever referenced.)

 I know some tough guys using their PC with buggy DIMMs :-)


 I believe there was an attempt to deal with this in kvm on the host.
 See kvm_handle_bad_page(). This 

Re: [patch uq/master 7/8] MCE: Relay UCR MCE to guest

2010-10-06 Thread Marcelo Tosatti
On Wed, Oct 06, 2010 at 10:10:51AM +0900, Hidetoshi Seto wrote:
 
 (snip)
 
  Index: qemu/kvm.h
  ===
  --- qemu.orig/kvm.h
  +++ qemu/kvm.h
  @@ -110,6 +110,9 @@ int kvm_arch_init_vcpu(CPUState *env);
   
   void kvm_arch_reset_vcpu(CPUState *env);
   
  +int kvm_on_sigbus(CPUState *env, int code, void *addr);
  +int kvm_on_sigbus_vcpu(int code, void *addr);
  +
   struct kvm_guest_debug;
   struct kvm_debug_exit_arch;
   
 
 So kvm_on_sigbus() is called from qemu_kvm_eat_signal() that is
 called on vcpu thread, while kvm_on_sigbus_vcpu() is called via
 sigbus_handler that invoked on iothread using signalfd.
 
 ... Inverse naming?

Yes, fixed.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch uq/master 7/8] MCE: Relay UCR MCE to guest

2010-10-06 Thread Marcelo Tosatti
On Wed, Oct 06, 2010 at 10:58:36AM +0900, Hidetoshi Seto wrote:
 I got some more question:
 
 (2010/10/05 3:54), Marcelo Tosatti wrote:
  Index: qemu/target-i386/cpu.h
  ===
  --- qemu.orig/target-i386/cpu.h
  +++ qemu/target-i386/cpu.h
  @@ -250,16 +250,32 @@
   #define PG_ERROR_RSVD_MASK 0x08
   #define PG_ERROR_I_D_MASK  0x10
   
  -#define MCG_CTL_P  (1UL8)   /* MCG_CAP register available */
  +#define MCG_CTL_P  (1ULL8)   /* MCG_CAP register available */
  +#define MCG_SER_P  (1ULL24) /* MCA recovery/new status bits */
   
  -#define MCE_CAP_DEFMCG_CTL_P
  +#define MCE_CAP_DEF(MCG_CTL_P|MCG_SER_P)
   #define MCE_BANKS_DEF  10
   
 
 It seems that current kvm doesn't support SER_P, so injecting SRAO
 to guest will mean that guest receives VAL|UC|!PCC and RIPV event
 from virtual processor that doesn't have SER_P.

Dean also noted this. I don't think it was deliberate choice to not
expose SER_P. Huang?

 I think most OSes don't expect that it can receives MCE with !PCC
 on traditional x86 processor without SER_P.
 
 Q1: Is it safe to expect that guests can handle such !PCC event?
 Q2: What is the expected behavior on the guest?
 Q3: What happen if guest reboots itself in response to the MCE?
 
 
 Thanks,
 H.Seto
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch uq/master 7/8] MCE: Relay UCR MCE to guest

2010-10-06 Thread Marcelo Tosatti
Port qemu-kvm's

commit 4b62fff1101a7ad77553147717a8bd3bf79df7ef
Author: Huang Ying ying.hu...@intel.com
Date:   Mon Sep 21 10:43:25 2009 +0800

MCE: Relay UCR MCE to guest

UCR (uncorrected recovery) MCE is supported in recent Intel CPUs,
where some hardware error such as some memory error can be reported
without PCC (processor context corrupted). To recover from such MCE,
the corresponding memory will be unmapped, and all processes accessing
the memory will be killed via SIGBUS.

For KVM, if QEMU/KVM is killed, all guest processes will be killed
too. So we relay SIGBUS from host OS to guest system via a UCR MCE
injection. Then guest OS can isolate corresponding memory and kill
necessary guest processes only. SIGBUS sent to main thread (not VCPU
threads) will be broadcast to all VCPU threads as UCR MCE.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: qemu/cpus.c
===
--- qemu.orig/cpus.c
+++ qemu/cpus.c
@@ -34,6 +34,10 @@
 
 #include cpus.h
 #include compatfd.h
+#ifdef CONFIG_LINUX
+#include sys/prctl.h
+#include sys/signalfd.h
+#endif
 
 #ifdef SIGRTMIN
 #define SIG_IPI (SIGRTMIN+4)
@@ -41,6 +45,10 @@
 #define SIG_IPI SIGUSR1
 #endif
 
+#ifndef PR_MCE_KILL
+#define PR_MCE_KILL 33
+#endif
+
 static CPUState *next_cpu;
 
 /***/
@@ -498,28 +506,77 @@ static void qemu_tcg_wait_io_event(void)
 }
 }
 
+static void sigbus_reraise(void)
+{
+sigset_t set;
+struct sigaction action;
+
+memset(action, 0, sizeof(action));
+action.sa_handler = SIG_DFL;
+if (!sigaction(SIGBUS, action, NULL)) {
+raise(SIGBUS);
+sigemptyset(set);
+sigaddset(set, SIGBUS);
+sigprocmask(SIG_UNBLOCK, set, NULL);
+}
+perror(Failed to re-raise SIGBUS!\n);
+abort();
+}
+
+static void sigbus_handler(int n, struct qemu_signalfd_siginfo *siginfo,
+   void *ctx)
+{
+#if defined(TARGET_I386)
+if (kvm_on_sigbus(siginfo-ssi_code, (void *)(intptr_t)siginfo-ssi_addr))
+#endif
+sigbus_reraise();
+}
+
 static void qemu_kvm_eat_signal(CPUState *env, int timeout)
 {
 struct timespec ts;
 int r, e;
 siginfo_t siginfo;
 sigset_t waitset;
+sigset_t chkset;
 
 ts.tv_sec = timeout / 1000;
 ts.tv_nsec = (timeout % 1000) * 100;
 
 sigemptyset(waitset);
 sigaddset(waitset, SIG_IPI);
+sigaddset(waitset, SIGBUS);
 
-qemu_mutex_unlock(qemu_global_mutex);
-r = sigtimedwait(waitset, siginfo, ts);
-e = errno;
-qemu_mutex_lock(qemu_global_mutex);
+do {
+qemu_mutex_unlock(qemu_global_mutex);
 
-if (r == -1  !(e == EAGAIN || e == EINTR)) {
-fprintf(stderr, sigtimedwait: %s\n, strerror(e));
-exit(1);
-}
+r = sigtimedwait(waitset, siginfo, ts);
+e = errno;
+
+qemu_mutex_lock(qemu_global_mutex);
+
+if (r == -1  !(e == EAGAIN || e == EINTR)) {
+fprintf(stderr, sigtimedwait: %s\n, strerror(e));
+exit(1);
+}
+
+switch (r) {
+case SIGBUS:
+#ifdef TARGET_I386
+if (kvm_on_sigbus_vcpu(env, siginfo.si_code, siginfo.si_addr))
+#endif
+sigbus_reraise();
+break;
+default:
+break;
+}
+
+r = sigpending(chkset);
+if (r == -1) {
+fprintf(stderr, sigpending: %s\n, strerror(e));
+exit(1);
+}
+} while (sigismember(chkset, SIG_IPI) || sigismember(chkset, SIGBUS));
 }
 
 static void qemu_kvm_wait_io_event(CPUState *env)
@@ -645,6 +702,7 @@ static void kvm_init_ipi(CPUState *env)
 
 pthread_sigmask(SIG_BLOCK, NULL, set);
 sigdelset(set, SIG_IPI);
+sigdelset(set, SIGBUS);
 r = kvm_set_signal_mask(env, set);
 if (r) {
 fprintf(stderr, kvm_set_signal_mask: %s\n, strerror(r));
@@ -655,6 +713,7 @@ static void kvm_init_ipi(CPUState *env)
 static sigset_t block_io_signals(void)
 {
 sigset_t set;
+struct sigaction action;
 
 /* SIGUSR2 used by posix-aio-compat.c */
 sigemptyset(set);
@@ -665,8 +724,15 @@ static sigset_t block_io_signals(void)
 sigaddset(set, SIGIO);
 sigaddset(set, SIGALRM);
 sigaddset(set, SIG_IPI);
+sigaddset(set, SIGBUS);
 pthread_sigmask(SIG_BLOCK, set, NULL);
 
+memset(action, 0, sizeof(action));
+action.sa_flags = SA_SIGINFO;
+action.sa_sigaction = (void (*)(int, siginfo_t*, void*))sigbus_handler;
+sigaction(SIGBUS, action, NULL);
+prctl(PR_MCE_KILL, 1, 1, 0, 0);
+
 return set;
 }
 
Index: qemu/kvm.h
===
--- qemu.orig/kvm.h
+++ qemu/kvm.h
@@ -110,6 +110,9 @@ int kvm_arch_init_vcpu(CPUState *env);
 
 void kvm_arch_reset_vcpu(CPUState *env);
 
+int kvm_on_sigbus_vcpu(CPUState *env, int code, void *addr);
+int kvm_on_sigbus(int code, void *addr);
+
 struct 

Re: [patch uq/master 7/8] MCE: Relay UCR MCE to guest

2010-10-06 Thread Dean Nelson

On 10/06/2010 11:05 AM, Marcelo Tosatti wrote:

On Wed, Oct 06, 2010 at 10:58:36AM +0900, Hidetoshi Seto wrote:

I got some more question:

(2010/10/05 3:54), Marcelo Tosatti wrote:

Index: qemu/target-i386/cpu.h
===
--- qemu.orig/target-i386/cpu.h
+++ qemu/target-i386/cpu.h
@@ -250,16 +250,32 @@
  #define PG_ERROR_RSVD_MASK 0x08
  #define PG_ERROR_I_D_MASK  0x10

-#define MCG_CTL_P  (1UL8)   /* MCG_CAP register available */
+#define MCG_CTL_P  (1ULL8)   /* MCG_CAP register available */
+#define MCG_SER_P  (1ULL24) /* MCA recovery/new status bits */

-#define MCE_CAP_DEFMCG_CTL_P
+#define MCE_CAP_DEF(MCG_CTL_P|MCG_SER_P)
  #define MCE_BANKS_DEF 10



It seems that current kvm doesn't support SER_P, so injecting SRAO
to guest will mean that guest receives VAL|UC|!PCC and RIPV event
from virtual processor that doesn't have SER_P.


Dean also noted this. I don't think it was deliberate choice to not
expose SER_P. Huang?


In my testing, I found that MCG_SER_P was not being set (and I was
running on a Nehalem-EX system). Injecting a MCE resulted in the
guest entering into panic() from mce_panic(). If crash_kexec()
finds a kexec_crash_image the system ends up rebooting, otherwise,
what happens next requires operator intervention.

When I applied a patch to the guest's kernel which forces mce_ser to be
set, as if MCG_SER_P was set (see __mcheck_cpu_cap_init()), I found
that when the memory page was 'owned' by a guest process, the process
would be killed (if the page was dirty), and the guest would stay
running. The HWPoisoned page would be sidelined and not cause any more
issues.


I think most OSes don't expect that it can receives MCE with !PCC
on traditional x86 processor without SER_P.

Q1: Is it safe to expect that guests can handle such !PCC event?


This might be best answered by Huang, but as I mentioned above, without
MCG_SER_P being set, the result was an orderly system panic on the
guest.


Q2: What is the expected behavior on the guest?


I think I answered this above.


Q3: What happen if guest reboots itself in response to the MCE?


That depends...

And the following issue also holds for a guest that is rebooted at
some point having successfully sidelined the bad page.

After the guest has panic'd, a system_reset of the guest or a restart
initiated by crash_kexec() (called by panic() on the guest), usually
results in the guest hanging because the bad page still belongs
to qemu-kvm and is now being referenced by the new guest in some way.
(It actually may not hang, but successfully reboot and be runnable,
with the bad page lurking in the background. It all seems to depend on
where the bad page ends up, and whether it's ever referenced.)

I believe there was an attempt to deal with this in kvm on the host.
See kvm_handle_bad_page(). This function was suppose to result in the
sending of a BUS_MCEERR_AR flavored SIGBUS by do_sigbus() to qemu-kvm
which in theory would result in the right thing happening. But commit
96054569190bdec375fe824e48ca1f4e3b53dd36 prevents the signal from being
sent. So this mechanism needs to be re-worked, and the issue remains.

I would think that if the the bad page can't be sidelined, such that
the newly booting guest can't use it, then the new guest shouldn't be
allowed to boot. But perhaps there is some merit in letting it try to
boot and see if one gets 'lucky'.

I understand that Huang is looking into what should be done. He can
give you better information than I in answer to your questions.

Dean
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch uq/master 7/8] MCE: Relay UCR MCE to guest

2010-10-06 Thread Hidetoshi Seto
(2010/10/07 3:10), Dean Nelson wrote:
 On 10/06/2010 11:05 AM, Marcelo Tosatti wrote:
 On Wed, Oct 06, 2010 at 10:58:36AM +0900, Hidetoshi Seto wrote:
 I got some more question:

 (2010/10/05 3:54), Marcelo Tosatti wrote:
 Index: qemu/target-i386/cpu.h
 ===
 --- qemu.orig/target-i386/cpu.h
 +++ qemu/target-i386/cpu.h
 @@ -250,16 +250,32 @@
   #define PG_ERROR_RSVD_MASK 0x08
   #define PG_ERROR_I_D_MASK  0x10

 -#define MCG_CTL_P(1UL8)   /* MCG_CAP register available */
 +#define MCG_CTL_P(1ULL8)   /* MCG_CAP register available */
 +#define MCG_SER_P(1ULL24) /* MCA recovery/new status bits */

 -#define MCE_CAP_DEFMCG_CTL_P
 +#define MCE_CAP_DEF(MCG_CTL_P|MCG_SER_P)
   #define MCE_BANKS_DEF10


 It seems that current kvm doesn't support SER_P, so injecting SRAO
 to guest will mean that guest receives VAL|UC|!PCC and RIPV event
 from virtual processor that doesn't have SER_P.

 Dean also noted this. I don't think it was deliberate choice to not
 expose SER_P. Huang?
 
 In my testing, I found that MCG_SER_P was not being set (and I was
 running on a Nehalem-EX system). Injecting a MCE resulted in the
 guest entering into panic() from mce_panic(). If crash_kexec()
 finds a kexec_crash_image the system ends up rebooting, otherwise,
 what happens next requires operator intervention.

Good to know.
What I'm concerning is that if memory scrubbing SRAO event is
injected when !SER_P, linux guest with certain mce tolerant level
might grade it as UC severity and continue running with none of
panicking, killing and poisoning because of !PCC and RIPV.

Could you provide the panic message of the guest in your test?
I think it can tell me why the mce handler decided to go panic.

 When I applied a patch to the guest's kernel which forces mce_ser to be
 set, as if MCG_SER_P was set (see __mcheck_cpu_cap_init()), I found
 that when the memory page was 'owned' by a guest process, the process
 would be killed (if the page was dirty), and the guest would stay
 running. The HWPoisoned page would be sidelined and not cause any more
 issues.

Excellent.
So while guest kernel knows which page is poisoned, guest processes
are controlled not to touch the page.

... Therefore rebooting the vm and renewing kernel will lost the
information where is poisoned.

 I think most OSes don't expect that it can receives MCE with !PCC
 on traditional x86 processor without SER_P.

 Q1: Is it safe to expect that guests can handle such !PCC event?
 
 This might be best answered by Huang, but as I mentioned above, without
 MCG_SER_P being set, the result was an orderly system panic on the
 guest.

Though I'll wait Huang (I think he is on holiday), I believe that
system panic is just a possible option for AO (Action Optional)
event, no matter how the SER_P is.

 Q2: What is the expected behavior on the guest?
 
 I think I answered this above.

Yeah, thanks.

 
 Q3: What happen if guest reboots itself in response to the MCE?
 
 That depends...
 
 And the following issue also holds for a guest that is rebooted at
 some point having successfully sidelined the bad page.
 
 After the guest has panic'd, a system_reset of the guest or a restart
 initiated by crash_kexec() (called by panic() on the guest), usually
 results in the guest hanging because the bad page still belongs
 to qemu-kvm and is now being referenced by the new guest in some way.

Yes. In other words my concern about reboot is that new guest kernel
including kdump kernel might try to read the bad page.  If there is
no AR-SIGBUS etc., we need some tricks to inhibit such accesses.

 (It actually may not hang, but successfully reboot and be runnable,
 with the bad page lurking in the background. It all seems to depend on
 where the bad page ends up, and whether it's ever referenced.)

I know some tough guys using their PC with buggy DIMMs :-)

 
 I believe there was an attempt to deal with this in kvm on the host.
 See kvm_handle_bad_page(). This function was suppose to result in the
 sending of a BUS_MCEERR_AR flavored SIGBUS by do_sigbus() to qemu-kvm
 which in theory would result in the right thing happening. But commit
 96054569190bdec375fe824e48ca1f4e3b53dd36 prevents the signal from being
 sent. So this mechanism needs to be re-worked, and the issue remains.

Definitely.
I guess Huang has some plan or hint for rework this point.

 
 I would think that if the the bad page can't be sidelined, such that
 the newly booting guest can't use it, then the new guest shouldn't be
 allowed to boot. But perhaps there is some merit in letting it try to
 boot and see if one gets 'lucky'.

In case of booting a real machine in real world, hardware and firmware
usually (or often) do self-test before passing control to OS.
Some platform can boot OS with degraded configuration (for example,
fewer memory) if it has trouble on its component.  Some BIOS may
stop booting and show messages like please reseat [component] on 

Re: [patch uq/master 7/8] MCE: Relay UCR MCE to guest

2010-10-05 Thread Hidetoshi Seto
(2010/10/05 3:54), Marcelo Tosatti wrote:
 Port qemu-kvm's
 
 commit 4b62fff1101a7ad77553147717a8bd3bf79df7ef
 Author: Huang Ying ying.hu...@intel.com
 Date:   Mon Sep 21 10:43:25 2009 +0800
 
 MCE: Relay UCR MCE to guest
 
 UCR (uncorrected recovery) MCE is supported in recent Intel CPUs,
 where some hardware error such as some memory error can be reported
 without PCC (processor context corrupted). To recover from such MCE,
 the corresponding memory will be unmapped, and all processes accessing
 the memory will be killed via SIGBUS.
 
 For KVM, if QEMU/KVM is killed, all guest processes will be killed
 too. So we relay SIGBUS from host OS to guest system via a UCR MCE
 injection. Then guest OS can isolate corresponding memory and kill
 necessary guest processes only. SIGBUS sent to main thread (not VCPU
 threads) will be broadcast to all VCPU threads as UCR MCE.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 

(snip)

 +static void sigbus_handler(int n, struct qemu_signalfd_siginfo *siginfo,
 +   void *ctx)
 +{
 +#if defined(TARGET_I386)
 +if (kvm_on_sigbus_vcpu(siginfo-ssi_code, (void 
 *)(intptr_t)siginfo-ssi_addr))
 +#endif
 +sigbus_reraise();
 +}
 +
  static void qemu_kvm_eat_signal(CPUState *env, int timeout)
  {
  struct timespec ts;
  int r, e;
  siginfo_t siginfo;
  sigset_t waitset;
 +sigset_t chkset;
  
  ts.tv_sec = timeout / 1000;
  ts.tv_nsec = (timeout % 1000) * 100;
  
  sigemptyset(waitset);
  sigaddset(waitset, SIG_IPI);
 +sigaddset(waitset, SIGBUS);
  
 -qemu_mutex_unlock(qemu_global_mutex);
 -r = sigtimedwait(waitset, siginfo, ts);
 -e = errno;
 -qemu_mutex_lock(qemu_global_mutex);
 +do {
 +qemu_mutex_unlock(qemu_global_mutex);
  
 -if (r == -1  !(e == EAGAIN || e == EINTR)) {
 -fprintf(stderr, sigtimedwait: %s\n, strerror(e));
 -exit(1);
 -}
 +r = sigtimedwait(waitset, siginfo, ts);
 +e = errno;
 +
 +qemu_mutex_lock(qemu_global_mutex);
 +
 +if (r == -1  !(e == EAGAIN || e == EINTR)) {
 +fprintf(stderr, sigtimedwait: %s\n, strerror(e));
 +exit(1);
 +}
 +
 +switch (r) {
 +case SIGBUS:
 +#ifdef TARGET_I386
 +if (kvm_on_sigbus(env, siginfo.si_code, siginfo.si_addr))
 +#endif
 +sigbus_reraise();
 +break;
 +default:
 +break;
 +}
 +
 +r = sigpending(chkset);
 +if (r == -1) {
 +fprintf(stderr, sigpending: %s\n, strerror(e));
 +exit(1);
 +}
 +} while (sigismember(chkset, SIG_IPI) || sigismember(chkset, SIGBUS));
  }
  
  static void qemu_kvm_wait_io_event(CPUState *env)

(snip)

 Index: qemu/kvm.h
 ===
 --- qemu.orig/kvm.h
 +++ qemu/kvm.h
 @@ -110,6 +110,9 @@ int kvm_arch_init_vcpu(CPUState *env);
  
  void kvm_arch_reset_vcpu(CPUState *env);
  
 +int kvm_on_sigbus(CPUState *env, int code, void *addr);
 +int kvm_on_sigbus_vcpu(int code, void *addr);
 +
  struct kvm_guest_debug;
  struct kvm_debug_exit_arch;
  

So kvm_on_sigbus() is called from qemu_kvm_eat_signal() that is
called on vcpu thread, while kvm_on_sigbus_vcpu() is called via
sigbus_handler that invoked on iothread using signalfd.

... Inverse naming?


Thanks,
H.Seto

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch uq/master 7/8] MCE: Relay UCR MCE to guest

2010-10-05 Thread Hidetoshi Seto
I got some more question:

(2010/10/05 3:54), Marcelo Tosatti wrote:
 Index: qemu/target-i386/cpu.h
 ===
 --- qemu.orig/target-i386/cpu.h
 +++ qemu/target-i386/cpu.h
 @@ -250,16 +250,32 @@
  #define PG_ERROR_RSVD_MASK 0x08
  #define PG_ERROR_I_D_MASK  0x10
  
 -#define MCG_CTL_P(1UL8)   /* MCG_CAP register available */
 +#define MCG_CTL_P(1ULL8)   /* MCG_CAP register available */
 +#define MCG_SER_P(1ULL24) /* MCA recovery/new status bits */
  
 -#define MCE_CAP_DEF  MCG_CTL_P
 +#define MCE_CAP_DEF  (MCG_CTL_P|MCG_SER_P)
  #define MCE_BANKS_DEF10
  

It seems that current kvm doesn't support SER_P, so injecting SRAO
to guest will mean that guest receives VAL|UC|!PCC and RIPV event
from virtual processor that doesn't have SER_P.

I think most OSes don't expect that it can receives MCE with !PCC
on traditional x86 processor without SER_P.

Q1: Is it safe to expect that guests can handle such !PCC event?
Q2: What is the expected behavior on the guest?
Q3: What happen if guest reboots itself in response to the MCE?


Thanks,
H.Seto

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch uq/master 7/8] MCE: Relay UCR MCE to guest

2010-10-04 Thread Marcelo Tosatti
Port qemu-kvm's

commit 4b62fff1101a7ad77553147717a8bd3bf79df7ef
Author: Huang Ying ying.hu...@intel.com
Date:   Mon Sep 21 10:43:25 2009 +0800

MCE: Relay UCR MCE to guest

UCR (uncorrected recovery) MCE is supported in recent Intel CPUs,
where some hardware error such as some memory error can be reported
without PCC (processor context corrupted). To recover from such MCE,
the corresponding memory will be unmapped, and all processes accessing
the memory will be killed via SIGBUS.

For KVM, if QEMU/KVM is killed, all guest processes will be killed
too. So we relay SIGBUS from host OS to guest system via a UCR MCE
injection. Then guest OS can isolate corresponding memory and kill
necessary guest processes only. SIGBUS sent to main thread (not VCPU
threads) will be broadcast to all VCPU threads as UCR MCE.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: qemu/cpus.c
===
--- qemu.orig/cpus.c
+++ qemu/cpus.c
@@ -34,6 +34,10 @@
 
 #include cpus.h
 #include compatfd.h
+#ifdef CONFIG_LINUX
+#include sys/prctl.h
+#include sys/signalfd.h
+#endif
 
 #ifdef SIGRTMIN
 #define SIG_IPI (SIGRTMIN+4)
@@ -41,6 +45,10 @@
 #define SIG_IPI SIGUSR1
 #endif
 
+#ifndef PR_MCE_KILL
+#define PR_MCE_KILL 33
+#endif
+
 static CPUState *next_cpu;
 
 /***/
@@ -498,28 +506,77 @@ static void qemu_tcg_wait_io_event(void)
 }
 }
 
+static void sigbus_reraise(void)
+{
+sigset_t set;
+struct sigaction action;
+
+memset(action, 0, sizeof(action));
+action.sa_handler = SIG_DFL;
+if (!sigaction(SIGBUS, action, NULL)) {
+raise(SIGBUS);
+sigemptyset(set);
+sigaddset(set, SIGBUS);
+sigprocmask(SIG_UNBLOCK, set, NULL);
+}
+perror(Failed to re-raise SIGBUS!\n);
+abort();
+}
+
+static void sigbus_handler(int n, struct qemu_signalfd_siginfo *siginfo,
+   void *ctx)
+{
+#if defined(TARGET_I386)
+if (kvm_on_sigbus_vcpu(siginfo-ssi_code, (void 
*)(intptr_t)siginfo-ssi_addr))
+#endif
+sigbus_reraise();
+}
+
 static void qemu_kvm_eat_signal(CPUState *env, int timeout)
 {
 struct timespec ts;
 int r, e;
 siginfo_t siginfo;
 sigset_t waitset;
+sigset_t chkset;
 
 ts.tv_sec = timeout / 1000;
 ts.tv_nsec = (timeout % 1000) * 100;
 
 sigemptyset(waitset);
 sigaddset(waitset, SIG_IPI);
+sigaddset(waitset, SIGBUS);
 
-qemu_mutex_unlock(qemu_global_mutex);
-r = sigtimedwait(waitset, siginfo, ts);
-e = errno;
-qemu_mutex_lock(qemu_global_mutex);
+do {
+qemu_mutex_unlock(qemu_global_mutex);
 
-if (r == -1  !(e == EAGAIN || e == EINTR)) {
-fprintf(stderr, sigtimedwait: %s\n, strerror(e));
-exit(1);
-}
+r = sigtimedwait(waitset, siginfo, ts);
+e = errno;
+
+qemu_mutex_lock(qemu_global_mutex);
+
+if (r == -1  !(e == EAGAIN || e == EINTR)) {
+fprintf(stderr, sigtimedwait: %s\n, strerror(e));
+exit(1);
+}
+
+switch (r) {
+case SIGBUS:
+#ifdef TARGET_I386
+if (kvm_on_sigbus(env, siginfo.si_code, siginfo.si_addr))
+#endif
+sigbus_reraise();
+break;
+default:
+break;
+}
+
+r = sigpending(chkset);
+if (r == -1) {
+fprintf(stderr, sigpending: %s\n, strerror(e));
+exit(1);
+}
+} while (sigismember(chkset, SIG_IPI) || sigismember(chkset, SIGBUS));
 }
 
 static void qemu_kvm_wait_io_event(CPUState *env)
@@ -645,6 +702,7 @@ static void kvm_init_ipi(CPUState *env)
 
 pthread_sigmask(SIG_BLOCK, NULL, set);
 sigdelset(set, SIG_IPI);
+sigdelset(set, SIGBUS);
 r = kvm_set_signal_mask(env, set);
 if (r) {
 fprintf(stderr, kvm_set_signal_mask: %s\n, strerror(r));
@@ -655,6 +713,7 @@ static void kvm_init_ipi(CPUState *env)
 static sigset_t block_io_signals(void)
 {
 sigset_t set;
+struct sigaction action;
 
 /* SIGUSR2 used by posix-aio-compat.c */
 sigemptyset(set);
@@ -665,8 +724,15 @@ static sigset_t block_io_signals(void)
 sigaddset(set, SIGIO);
 sigaddset(set, SIGALRM);
 sigaddset(set, SIG_IPI);
+sigaddset(set, SIGBUS);
 pthread_sigmask(SIG_BLOCK, set, NULL);
 
+memset(action, 0, sizeof(action));
+action.sa_flags = SA_SIGINFO;
+action.sa_sigaction = (void (*)(int, siginfo_t*, void*))sigbus_handler;
+sigaction(SIGBUS, action, NULL);
+prctl(PR_MCE_KILL, 1, 1, 0, 0);
+
 return set;
 }
 
Index: qemu/kvm.h
===
--- qemu.orig/kvm.h
+++ qemu/kvm.h
@@ -110,6 +110,9 @@ int kvm_arch_init_vcpu(CPUState *env);
 
 void kvm_arch_reset_vcpu(CPUState *env);
 
+int kvm_on_sigbus(CPUState *env, int code, void *addr);
+int kvm_on_sigbus_vcpu(int code, void *addr);
+
 struct