Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-24 Thread Gleb Natapov
On Tue, Jan 18, 2011 at 11:09:01AM -0600, Anthony Liguori wrote:
 But we also need to provide a compatible interface to management tools.
 Exposing the device model topology as a compatible interface
 artificially limits us.  It's far better to provide higher level
 supported interfaces to give us the flexibility to change the device
 model as we need to.
 How do you want to change qdev to keep the guest and management tool
 view stable while branching off kvm sub-buses?
 
 The qdev device model is not a stable interface.  I think that's
 been clear from the very beginning.
 

And what was the reason it was declared not stable? May be because we
were not sure we will do it right from the start, so change will be
needed later. But changes should bring qdev closer to reflect what guest
expect device topology to look like. This will bring us to stable state
as close possible.  We need this knowledge and stability in qdev for
device path creation.  Both kind of device paths: OF and the one we use
for migration. To create OF device path we need to know topology as seen
by a guest (and guest does not care how isa bus is implemented internally
inside south bridge), to create device path used for migration we need
stability, otherwise change in qdev topology will break migration. All
this artificial buses you propose to add move us in opposite direction
and make qdev useless for anything but  well for anything.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/31] kvm: convert kvm_ioctl(KVM_CHECK_EXTENSION) to kvm_check_extension()

2011-01-24 Thread Marcelo Tosatti
From: Lai Jiangshan la...@cn.fujitsu.com

simple cleanup and use existing helper: kvm_check_extension().

Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
---
 kvm-all.c |2 +-
 target-i386/kvm.c |4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index 255b6fa..935c436 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -717,7 +717,7 @@ int kvm_init(int smp_cpus)
 
 s-broken_set_mem_region = 1;
 #ifdef KVM_CAP_JOIN_MEMORY_REGIONS_WORKS
-ret = kvm_ioctl(s, KVM_CHECK_EXTENSION, KVM_CAP_JOIN_MEMORY_REGIONS_WORKS);
+ret = kvm_check_extension(s, KVM_CAP_JOIN_MEMORY_REGIONS_WORKS);
 if (ret  0) {
 s-broken_set_mem_region = 0;
 }
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 755f8c9..4004de7 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -187,7 +187,7 @@ static int kvm_get_mce_cap_supported(KVMState *s, uint64_t 
*mce_cap,
 {
 int r;
 
-r = kvm_ioctl(s, KVM_CHECK_EXTENSION, KVM_CAP_MCE);
+r = kvm_check_extension(s, KVM_CAP_MCE);
 if (r  0) {
 *max_banks = r;
 return kvm_ioctl(s, KVM_X86_GET_MCE_CAP_SUPPORTED, mce_cap);
@@ -540,7 +540,7 @@ int kvm_arch_init(KVMState *s, int smp_cpus)
  * versions of KVM just assumed that it would be at the end of physical
  * memory but that doesn't work with more than 4GB of memory.  We simply
  * refuse to work with those older versions of KVM. */
-ret = kvm_ioctl(s, KVM_CHECK_EXTENSION, KVM_CAP_SET_TSS_ADDR);
+ret = kvm_check_extension(s, KVM_CAP_SET_TSS_ADDR);
 if (ret = 0) {
 fprintf(stderr, kvm does not support KVM_CAP_SET_TSS_ADDR\n);
 return ret;
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 22/31] kvm: x86: Refactor msr_star/hsave_pa setup and checks

2011-01-24 Thread Marcelo Tosatti
From: Jan Kiszka jan.kis...@siemens.com

Simplify kvm_has_msr_star/hsave_pa to booleans and push their one-time
initialization into kvm_arch_init. Also handle potential errors of that
setup procedure.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
---
 target-i386/kvm.c |   47 +++
 1 files changed, 19 insertions(+), 28 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index c4a22dd..454ddb1 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -54,6 +54,8 @@
 #define BUS_MCEERR_AO 5
 #endif
 
+static bool has_msr_star;
+static bool has_msr_hsave_pa;
 static int lm_capable_kernel;
 
 #ifdef KVM_CAP_EXT_CPUID
@@ -459,13 +461,10 @@ void kvm_arch_reset_vcpu(CPUState *env)
 }
 }
 
-int has_msr_star;
-int has_msr_hsave_pa;
-
-static void kvm_supported_msrs(CPUState *env)
+static int kvm_get_supported_msrs(KVMState *s)
 {
 static int kvm_supported_msrs;
-int ret;
+int ret = 0;
 
 /* first time */
 if (kvm_supported_msrs == 0) {
@@ -476,9 +475,9 @@ static void kvm_supported_msrs(CPUState *env)
 /* Obtain MSR list from KVM.  These are the MSRs that we must
  * save/restore */
 msr_list.nmsrs = 0;
-ret = kvm_ioctl(env-kvm_state, KVM_GET_MSR_INDEX_LIST, msr_list);
+ret = kvm_ioctl(s, KVM_GET_MSR_INDEX_LIST, msr_list);
 if (ret  0  ret != -E2BIG) {
-return;
+return ret;
 }
 /* Old kernel modules had a bug and could write beyond the provided
memory. Allocate at least a safe amount of 1K. */
@@ -487,17 +486,17 @@ static void kvm_supported_msrs(CPUState *env)
   sizeof(msr_list.indices[0])));
 
 kvm_msr_list-nmsrs = msr_list.nmsrs;
-ret = kvm_ioctl(env-kvm_state, KVM_GET_MSR_INDEX_LIST, kvm_msr_list);
+ret = kvm_ioctl(s, KVM_GET_MSR_INDEX_LIST, kvm_msr_list);
 if (ret = 0) {
 int i;
 
 for (i = 0; i  kvm_msr_list-nmsrs; i++) {
 if (kvm_msr_list-indices[i] == MSR_STAR) {
-has_msr_star = 1;
+has_msr_star = true;
 continue;
 }
 if (kvm_msr_list-indices[i] == MSR_VM_HSAVE_PA) {
-has_msr_hsave_pa = 1;
+has_msr_hsave_pa = true;
 continue;
 }
 }
@@ -506,19 +505,7 @@ static void kvm_supported_msrs(CPUState *env)
 free(kvm_msr_list);
 }
 
-return;
-}
-
-static int kvm_has_msr_hsave_pa(CPUState *env)
-{
-kvm_supported_msrs(env);
-return has_msr_hsave_pa;
-}
-
-static int kvm_has_msr_star(CPUState *env)
-{
-kvm_supported_msrs(env);
-return has_msr_star;
+return ret;
 }
 
 static int kvm_init_identity_map_page(KVMState *s)
@@ -543,9 +530,13 @@ static int kvm_init_identity_map_page(KVMState *s)
 int kvm_arch_init(KVMState *s, int smp_cpus)
 {
 int ret;
-
 struct utsname utsname;
 
+ret = kvm_get_supported_msrs(s);
+if (ret  0) {
+return ret;
+}
+
 uname(utsname);
 lm_capable_kernel = strcmp(utsname.machine, x86_64) == 0;
 
@@ -830,10 +821,10 @@ static int kvm_put_msrs(CPUState *env, int level)
 kvm_msr_entry_set(msrs[n++], MSR_IA32_SYSENTER_CS, env-sysenter_cs);
 kvm_msr_entry_set(msrs[n++], MSR_IA32_SYSENTER_ESP, env-sysenter_esp);
 kvm_msr_entry_set(msrs[n++], MSR_IA32_SYSENTER_EIP, env-sysenter_eip);
-if (kvm_has_msr_star(env)) {
+if (has_msr_star) {
 kvm_msr_entry_set(msrs[n++], MSR_STAR, env-star);
 }
-if (kvm_has_msr_hsave_pa(env)) {
+if (has_msr_hsave_pa) {
 kvm_msr_entry_set(msrs[n++], MSR_VM_HSAVE_PA, env-vm_hsave);
 }
 #ifdef TARGET_X86_64
@@ -1076,10 +1067,10 @@ static int kvm_get_msrs(CPUState *env)
 msrs[n++].index = MSR_IA32_SYSENTER_CS;
 msrs[n++].index = MSR_IA32_SYSENTER_ESP;
 msrs[n++].index = MSR_IA32_SYSENTER_EIP;
-if (kvm_has_msr_star(env)) {
+if (has_msr_star) {
 msrs[n++].index = MSR_STAR;
 }
-if (kvm_has_msr_hsave_pa(env)) {
+if (has_msr_hsave_pa) {
 msrs[n++].index = MSR_VM_HSAVE_PA;
 }
 msrs[n++].index = MSR_IA32_TSC;
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 20/31] kvm: x86: Remove redundant mp_state initialization

2011-01-24 Thread Marcelo Tosatti
From: Jan Kiszka jan.kis...@siemens.com

kvm_arch_reset_vcpu initializes mp_state, and that function is invoked
right after kvm_arch_init_vcpu.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
---
 target-i386/kvm.c |2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 531b69e..07c75c0 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -321,8 +321,6 @@ int kvm_arch_init_vcpu(CPUState *env)
 uint32_t signature[3];
 #endif
 
-env-mp_state = KVM_MP_STATE_RUNNABLE;
-
 env-cpuid_features = kvm_arch_get_supported_cpuid(env, 1, 0, R_EDX);
 
 i = env-cpuid_ext_features  CPUID_EXT_HYPERVISOR;
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/31] Clean up cpu_inject_x86_mce()

2011-01-24 Thread Marcelo Tosatti
From: Jin Dongming jin.dongm...@np.css.fujitsu.com

Clean up cpu_inject_x86_mce() for later patch.

Signed-off-by: Jin Dongming jin.dongm...@np.css.fujitsu.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
---
 target-i386/helper.c |   27 +--
 1 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/target-i386/helper.c b/target-i386/helper.c
index 25a3e36..2c94130 100644
--- a/target-i386/helper.c
+++ b/target-i386/helper.c
@@ -1021,21 +1021,12 @@ static void breakpoint_handler(CPUState *env)
 /* This should come from sysemu.h - if we could include it here... */
 void qemu_system_reset_request(void);
 
-void cpu_inject_x86_mce(CPUState *cenv, int bank, uint64_t status,
+static void qemu_inject_x86_mce(CPUState *cenv, int bank, uint64_t status,
 uint64_t mcg_status, uint64_t addr, uint64_t misc)
 {
 uint64_t mcg_cap = cenv-mcg_cap;
-unsigned bank_num = mcg_cap  0xff;
 uint64_t *banks = cenv-mce_banks;
 
-if (bank = bank_num || !(status  MCI_STATUS_VAL))
-return;
-
-if (kvm_enabled()) {
-kvm_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc, 0);
-return;
-}
-
 /*
  * if MSR_MCG_CTL is not all 1s, the uncorrected error
  * reporting is disabled
@@ -1076,6 +1067,22 @@ void cpu_inject_x86_mce(CPUState *cenv, int bank, 
uint64_t status,
 } else
 banks[1] |= MCI_STATUS_OVER;
 }
+
+void cpu_inject_x86_mce(CPUState *cenv, int bank, uint64_t status,
+uint64_t mcg_status, uint64_t addr, uint64_t misc)
+{
+unsigned bank_num = cenv-mcg_cap  0xff;
+
+if (bank = bank_num || !(status  MCI_STATUS_VAL)) {
+return;
+}
+
+if (kvm_enabled()) {
+kvm_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc, 0);
+} else {
+qemu_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc);
+}
+}
 #endif /* !CONFIG_USER_ONLY */
 
 static void mce_init(CPUX86State *cenv)
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/31] Add function for checking mca broadcast of CPU

2011-01-24 Thread Marcelo Tosatti
From: Jin Dongming jin.dongm...@np.css.fujitsu.com

Add function for checking whether current CPU support mca broadcast.

Signed-off-by: Jin Dongming jin.dongm...@np.css.fujitsu.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
---
 target-i386/cpu.h|1 +
 target-i386/helper.c |   33 +
 target-i386/kvm.c|6 +-
 3 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index f0c07cd..dddcd74 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -760,6 +760,7 @@ int cpu_x86_exec(CPUX86State *s);
 void cpu_x86_close(CPUX86State *s);
 void x86_cpu_list (FILE *f, fprintf_function cpu_fprintf, const char *optarg);
 void x86_cpudef_setup(void);
+int cpu_x86_support_mca_broadcast(CPUState *env);
 
 int cpu_get_pic_interrupt(CPUX86State *s);
 /* MSDOS compatibility mode FPU exception support */
diff --git a/target-i386/helper.c b/target-i386/helper.c
index 2cfb4a4..6dfa27d 100644
--- a/target-i386/helper.c
+++ b/target-i386/helper.c
@@ -110,6 +110,32 @@ void cpu_x86_close(CPUX86State *env)
 qemu_free(env);
 }
 
+static void cpu_x86_version(CPUState *env, int *family, int *model)
+{
+int cpuver = env-cpuid_version;
+
+if (family == NULL || model == NULL) {
+return;
+}
+
+*family = (cpuver  8)  0x0f;
+*model = ((cpuver  12)  0xf0) + ((cpuver  4)  0x0f);
+}
+
+/* Broadcast MCA signal for processor version 06H_EH and above */
+int cpu_x86_support_mca_broadcast(CPUState *env)
+{
+int family = 0;
+int model = 0;
+
+cpu_x86_version(env, family, model);
+if ((family == 6  model = 14) || family  6) {
+return 1;
+}
+
+return 0;
+}
+
 /***/
 /* x86 debug */
 
@@ -1080,6 +1106,13 @@ void cpu_inject_x86_mce(CPUState *cenv, int bank, 
uint64_t status,
 return;
 }
 
+if (broadcast) {
+if (!cpu_x86_support_mca_broadcast(cenv)) {
+fprintf(stderr, Current CPU does not support broadcast\n);
+return;
+}
+}
+
 if (kvm_enabled()) {
 if (broadcast) {
 flag |= MCE_BROADCAST;
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 8b868ad..2115a58 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1711,13 +1711,9 @@ static void hardware_memory_error(void)
 static void kvm_mce_broadcast_rest(CPUState *env)
 {
 CPUState *cenv;
-int family, model, cpuver = env-cpuid_version;
-
-family = (cpuver  8)  0xf;
-model = ((cpuver  12)  0xf0) + ((cpuver  4)  0xf);
 
 /* Broadcast MCA signal for processor version 06H_EH and above */
-if ((family == 6  model = 14) || family  6) {
+if (cpu_x86_support_mca_broadcast(env)) {
 for (cenv = first_cpu; cenv != NULL; cenv = cenv-next_cpu) {
 if (cenv == env) {
 continue;
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 15/31] kvm: Stop on all fatal exit reasons

2011-01-24 Thread Marcelo Tosatti
From: Jan Kiszka jan.kis...@siemens.com

Ensure that we stop the guest whenever we face a fatal or unknown exit
reason. If we stop, we also have to enforce a cpu loop exit.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
---
 kvm-all.c |   15 +++
 target-i386/kvm.c |4 
 target-ppc/kvm.c  |4 
 3 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index 86ddbd6..eaf9272 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -815,7 +815,7 @@ static int kvm_handle_io(uint16_t port, void *data, int 
direction, int size,
 }
 
 #ifdef KVM_CAP_INTERNAL_ERROR_DATA
-static void kvm_handle_internal_error(CPUState *env, struct kvm_run *run)
+static int kvm_handle_internal_error(CPUState *env, struct kvm_run *run)
 {
 
 if (kvm_check_extension(kvm_state, KVM_CAP_INTERNAL_ERROR_DATA)) {
@@ -833,13 +833,13 @@ static void kvm_handle_internal_error(CPUState *env, 
struct kvm_run *run)
 if (run-internal.suberror == KVM_INTERNAL_ERROR_EMULATION) {
 fprintf(stderr, emulation failure\n);
 if (!kvm_arch_stop_on_emulation_error(env)) {
-return;
+return 0;
 }
 }
 /* FIXME: Should trigger a qmp message to let management know
  * something went wrong.
  */
-vm_stop(0);
+return -1;
 }
 #endif
 
@@ -967,16 +967,19 @@ int kvm_cpu_exec(CPUState *env)
 break;
 case KVM_EXIT_UNKNOWN:
 DPRINTF(kvm_exit_unknown\n);
+ret = -1;
 break;
 case KVM_EXIT_FAIL_ENTRY:
 DPRINTF(kvm_exit_fail_entry\n);
+ret = -1;
 break;
 case KVM_EXIT_EXCEPTION:
 DPRINTF(kvm_exit_exception\n);
+ret = -1;
 break;
 #ifdef KVM_CAP_INTERNAL_ERROR_DATA
 case KVM_EXIT_INTERNAL_ERROR:
-kvm_handle_internal_error(env, run);
+ret = kvm_handle_internal_error(env, run);
 break;
 #endif
 case KVM_EXIT_DEBUG:
@@ -997,6 +1000,10 @@ int kvm_cpu_exec(CPUState *env)
 }
 } while (ret  0);
 
+if (ret  0) {
+vm_stop(0);
+env-exit_request = 1;
+}
 if (env-exit_request) {
 env-exit_request = 0;
 env-exception_index = EXCP_INTERRUPT;
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 0aeb079..6b4abaa 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1537,6 +1537,10 @@ int kvm_arch_handle_exit(CPUState *env, struct kvm_run 
*run)
 case KVM_EXIT_SET_TPR:
 ret = 1;
 break;
+default:
+fprintf(stderr, KVM: unknown exit reason %d\n, run-exit_reason);
+ret = -1;
+break;
 }
 
 return ret;
diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
index 5caa07c..849b404 100644
--- a/target-ppc/kvm.c
+++ b/target-ppc/kvm.c
@@ -307,6 +307,10 @@ int kvm_arch_handle_exit(CPUState *env, struct kvm_run 
*run)
 dprintf(handle halt\n);
 ret = kvmppc_handle_halt(env);
 break;
+default:
+fprintf(stderr, KVM: unknown exit reason %d\n, run-exit_reason);
+ret = -1;
+break;
 }
 
 return ret;
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 27/31] kvm: x86: Rework identity map and TSS setup for larger BIOS sizes

2011-01-24 Thread Marcelo Tosatti
From: Jan Kiszka jan.kis...@siemens.com

In order to support loading BIOSes  256K, reorder the code, adjusting
the base if the kernel supports moving the identity map.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
---
 target-i386/kvm.c |   63 +---
 1 files changed, 30 insertions(+), 33 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 1db8227..72f9fdf 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -493,27 +493,9 @@ static int kvm_get_supported_msrs(KVMState *s)
 return ret;
 }
 
-static int kvm_init_identity_map_page(KVMState *s)
-{
-#ifdef KVM_CAP_SET_IDENTITY_MAP_ADDR
-int ret;
-uint64_t addr = 0xfffbc000;
-
-if (!kvm_check_extension(s, KVM_CAP_SET_IDENTITY_MAP_ADDR)) {
-return 0;
-}
-
-ret = kvm_vm_ioctl(s, KVM_SET_IDENTITY_MAP_ADDR, addr);
-if (ret  0) {
-fprintf(stderr, kvm_set_identity_map_addr: %s\n, strerror(ret));
-return ret;
-}
-#endif
-return 0;
-}
-
 int kvm_arch_init(KVMState *s)
 {
+uint64_t identity_base = 0xfffbc000;
 int ret;
 struct utsname utsname;
 
@@ -525,27 +507,42 @@ int kvm_arch_init(KVMState *s)
 uname(utsname);
 lm_capable_kernel = strcmp(utsname.machine, x86_64) == 0;
 
-/* create vm86 tss.  KVM uses vm86 mode to emulate 16-bit code
- * directly.  In order to use vm86 mode, a TSS is needed.  Since this
- * must be part of guest physical memory, we need to allocate it. */
-
-/* this address is 3 pages before the bios, and the bios should present
- * as unavaible memory.  FIXME, need to ensure the e820 map deals with
- * this?
- */
 /*
- * Tell fw_cfg to notify the BIOS to reserve the range.
+ * On older Intel CPUs, KVM uses vm86 mode to emulate 16-bit code directly.
+ * In order to use vm86 mode, an EPT identity map and a TSS  are needed.
+ * Since these must be part of guest physical memory, we need to allocate
+ * them, both by setting their start addresses in the kernel and by
+ * creating a corresponding e820 entry. We need 4 pages before the BIOS.
+ *
+ * Older KVM versions may not support setting the identity map base. In
+ * that case we need to stick with the default, i.e. a 256K maximum BIOS
+ * size.
  */
-if (e820_add_entry(0xfffbc000, 0x4000, E820_RESERVED)  0) {
-perror(e820_add_entry() table is full);
-exit(1);
+#ifdef KVM_CAP_SET_IDENTITY_MAP_ADDR
+if (kvm_check_extension(s, KVM_CAP_SET_IDENTITY_MAP_ADDR)) {
+/* Allows up to 16M BIOSes. */
+identity_base = 0xfeffc000;
+
+ret = kvm_vm_ioctl(s, KVM_SET_IDENTITY_MAP_ADDR, identity_base);
+if (ret  0) {
+return ret;
+}
 }
-ret = kvm_vm_ioctl(s, KVM_SET_TSS_ADDR, 0xfffbd000);
+#endif
+/* Set TSS base one page after EPT identity map. */
+ret = kvm_vm_ioctl(s, KVM_SET_TSS_ADDR, identity_base + 0x1000);
+if (ret  0) {
+return ret;
+}
+
+/* Tell fw_cfg to notify the BIOS to reserve the range. */
+ret = e820_add_entry(identity_base, 0x4000, E820_RESERVED);
 if (ret  0) {
+fprintf(stderr, e820_add_entry() table is full\n);
 return ret;
 }
 
-return kvm_init_identity_map_page(s);
+return 0;
 }
 
 static void set_v8086_seg(struct kvm_segment *lhs, const SegmentCache *rhs)
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 18/31] kvm: x86: Align kvm_arch_put_registers code with comment

2011-01-24 Thread Marcelo Tosatti
From: Jan Kiszka jan.kis...@siemens.com

The ordering doesn't matter in this case, but better keep it consistent.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
---
 target-i386/kvm.c |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 0ba13fc..9bb34ab 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1388,12 +1388,12 @@ int kvm_arch_put_registers(CPUState *env, int level)
 if (ret  0) {
 return ret;
 }
-/* must be last */
-ret = kvm_guest_debug_workarounds(env);
+ret = kvm_put_debugregs(env);
 if (ret  0) {
 return ret;
 }
-ret = kvm_put_debugregs(env);
+/* must be last */
+ret = kvm_guest_debug_workarounds(env);
 if (ret  0) {
 return ret;
 }
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 28/31] kvm: Flush coalesced mmio buffer on IO window exits

2011-01-24 Thread Marcelo Tosatti
From: Jan Kiszka jan.kis...@siemens.com

We must flush pending mmio writes if we leave kvm_cpu_exec for an IO
window. Otherwise we risk to loose those requests when migrating to a
different host during that window.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
---
 kvm-all.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index 3a1f63b..9976762 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -918,6 +918,8 @@ int kvm_cpu_exec(CPUState *env)
 cpu_single_env = env;
 kvm_arch_post_run(env, run);
 
+kvm_flush_coalesced_mmio_buffer();
+
 if (ret == -EINTR || ret == -EAGAIN) {
 cpu_exit(env);
 DPRINTF(io window exit\n);
@@ -930,8 +932,6 @@ int kvm_cpu_exec(CPUState *env)
 abort();
 }
 
-kvm_flush_coalesced_mmio_buffer();
-
 ret = 0; /* exit loop */
 switch (run-exit_reason) {
 case KVM_EXIT_IO:
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 29/31] kvm: Do not use qemu_fair_mutex

2011-01-24 Thread Marcelo Tosatti
From: Jan Kiszka jan.kis...@siemens.com

The imbalance in the hold time of qemu_global_mutex only exists in TCG
mode. In contrast to TCG VCPUs, KVM drops the global lock during guest
execution. We already avoid touching the fairness lock from the
IO-thread in KVM mode, so also stop using it from the VCPU threads.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
---
 cpus.c |2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/cpus.c b/cpus.c
index 0309189..4c9928e 100644
--- a/cpus.c
+++ b/cpus.c
@@ -735,9 +735,7 @@ static sigset_t block_io_signals(void)
 void qemu_mutex_lock_iothread(void)
 {
 if (kvm_enabled()) {
-qemu_mutex_lock(qemu_fair_mutex);
 qemu_mutex_lock(qemu_global_mutex);
-qemu_mutex_unlock(qemu_fair_mutex);
 } else {
 qemu_mutex_lock(qemu_fair_mutex);
 if (qemu_mutex_trylock(qemu_global_mutex)) {
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/31] kvm: kvm_mce_inj_* subroutines for templated error injections

2011-01-24 Thread Marcelo Tosatti
From: Jin Dongming jin.dongm...@np.css.fujitsu.com

Refactor codes for maintainability.

Signed-off-by: Hidetoshi Seto seto.hideto...@jp.fujitsu.com
Signed-off-by: Jin Dongming jin.dongm...@np.css.fujitsu.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
---
 target-i386/kvm.c |  111 ++---
 1 files changed, 71 insertions(+), 40 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 5a699fc..ce01e18 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1722,44 +1722,75 @@ static void kvm_mce_broadcast_rest(CPUState *env)
 }
 }
 }
+
+static void kvm_mce_inj_srar_dataload(CPUState *env, target_phys_addr_t paddr)
+{
+struct kvm_x86_mce mce = {
+.bank = 9,
+.status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
+  | MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
+  | MCI_STATUS_AR | 0x134,
+.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_EIPV,
+.addr = paddr,
+.misc = (MCM_ADDR_PHYS  6) | 0xc,
+};
+int r;
+
+r = kvm_set_mce(env, mce);
+if (r  0) {
+fprintf(stderr, kvm_set_mce: %s\n, strerror(errno));
+abort();
+}
+kvm_mce_broadcast_rest(env);
+}
+
+static void kvm_mce_inj_srao_memscrub(CPUState *env, target_phys_addr_t paddr)
+{
+struct kvm_x86_mce mce = {
+.bank = 9,
+.status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
+  | MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
+  | 0xc0,
+.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_RIPV,
+.addr = paddr,
+.misc = (MCM_ADDR_PHYS  6) | 0xc,
+};
+int r;
+
+r = kvm_set_mce(env, mce);
+if (r  0) {
+fprintf(stderr, kvm_set_mce: %s\n, strerror(errno));
+abort();
+}
+kvm_mce_broadcast_rest(env);
+}
+
+static void kvm_mce_inj_srao_memscrub2(CPUState *env, target_phys_addr_t paddr)
+{
+uint64_t status;
+
+status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
+| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
+| 0xc0;
+kvm_inject_x86_mce(env, 9, status,
+   MCG_STATUS_MCIP | MCG_STATUS_RIPV, paddr,
+   (MCM_ADDR_PHYS  6) | 0xc, ABORT_ON_ERROR);
+
+kvm_mce_broadcast_rest(env);
+}
+
 #endif
 
 int kvm_on_sigbus_vcpu(CPUState *env, int code, void *addr)
 {
 #if defined(KVM_CAP_MCE)
-struct kvm_x86_mce mce = {
-.bank = 9,
-};
 void *vaddr;
 ram_addr_t ram_addr;
 target_phys_addr_t paddr;
-int r;
 
 if ((env-mcg_cap  MCG_SER_P)  addr
  (code == BUS_MCEERR_AR
 || code == BUS_MCEERR_AO)) {
-if (code == BUS_MCEERR_AR) {
-/* Fake an Intel architectural Data Load SRAR UCR */
-mce.status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
-| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
-| MCI_STATUS_AR | 0x134;
-mce.misc = (MCM_ADDR_PHYS  6) | 0xc;
-mce.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_EIPV;
-} else {
-/*
- * If there is an MCE excpetion being processed, ignore
- * this SRAO MCE
- */
-if (kvm_mce_in_progress(env)) {
-return 0;
-}
-/* Fake an Intel architectural Memory scrubbing UCR */
-mce.status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
-| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
-| 0xc0;
-mce.misc = (MCM_ADDR_PHYS  6) | 0xc;
-mce.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_RIPV;
-}
 vaddr = (void *)addr;
 if (qemu_ram_addr_from_host(vaddr, ram_addr) ||
 !kvm_physical_memory_addr_from_ram(env-kvm_state, ram_addr, 
paddr)) {
@@ -1772,13 +1803,20 @@ int kvm_on_sigbus_vcpu(CPUState *env, int code, void 
*addr)
 hardware_memory_error();
 }
 }
-mce.addr = paddr;
-r = kvm_set_mce(env, mce);
-if (r  0) {
-fprintf(stderr, kvm_set_mce: %s\n, strerror(errno));
-abort();
+
+if (code == BUS_MCEERR_AR) {
+/* Fake an Intel architectural Data Load SRAR UCR */
+kvm_mce_inj_srar_dataload(env, paddr);
+} else {
+/*
+ * If there is an MCE excpetion being processed, ignore
+ * this SRAO MCE
+ */
+if (!kvm_mce_in_progress(env)) {
+/* Fake an Intel architectural Memory scrubbing UCR */
+kvm_mce_inj_srao_memscrub(env, paddr);
+}
 }
-kvm_mce_broadcast_rest(env);
 } else
 #endif
 {
@@ -1797,7 +1835,6 @@ int kvm_on_sigbus(int code, void *addr)
 {
 #if defined(KVM_CAP_MCE)
 if ((first_cpu-mcg_cap  MCG_SER_P)  addr  code == BUS_MCEERR_AO) {
-uint64_t status;
 void 

[PATCH 30/31] kvm: x86: Implicitly clear nmi_injected/pending on reset

2011-01-24 Thread Marcelo Tosatti
From: Jan Kiszka jan.kis...@siemens.com

All CPUX86State variables before CPU_COMMON are automatically cleared on
reset. Reorder nmi_injected and nmi_pending to avoid having to touch
them explicitly.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
---
 target-i386/cpu.h |6 --
 target-i386/kvm.c |2 --
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index a457423..af701a4 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -699,6 +699,10 @@ typedef struct CPUX86State {
 uint32_t smbase;
 int old_exception;  /* exception in flight */
 
+/* KVM states, automatically cleared on reset */
+uint8_t nmi_injected;
+uint8_t nmi_pending;
+
 CPU_COMMON
 
 /* processor features (e.g. for CPUID insn) */
@@ -726,8 +730,6 @@ typedef struct CPUX86State {
 int32_t exception_injected;
 int32_t interrupt_injected;
 uint8_t soft_interrupt;
-uint8_t nmi_injected;
-uint8_t nmi_pending;
 uint8_t has_error_code;
 uint32_t sipi_vector;
 uint32_t cpuid_kvm_features;
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 72f9fdf..b2c5ee0 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -435,8 +435,6 @@ void kvm_arch_reset_vcpu(CPUState *env)
 {
 env-exception_injected = -1;
 env-interrupt_injected = -1;
-env-nmi_injected = 0;
-env-nmi_pending = 0;
 env-xcr0 = 1;
 if (kvm_irqchip_in_kernel()) {
 env-mp_state = cpu_is_bsp(env) ? KVM_MP_STATE_RUNNABLE :
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 13/31] kvm: Fix coding style violations

2011-01-24 Thread Marcelo Tosatti
From: Jan Kiszka jan.kis...@siemens.com

No functional changes.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
---
 kvm-all.c |  139 ++--
 1 files changed, 79 insertions(+), 60 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index 935c436..86ddbd6 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -88,10 +88,12 @@ static KVMSlot *kvm_alloc_slot(KVMState *s)
 
 for (i = 0; i  ARRAY_SIZE(s-slots); i++) {
 /* KVM private memory slots */
-if (i = 8  i  12)
+if (i = 8  i  12) {
 continue;
-if (s-slots[i].memory_size == 0)
+}
+if (s-slots[i].memory_size == 0) {
 return s-slots[i];
+}
 }
 
 fprintf(stderr, %s: no free slot available\n, __func__);
@@ -226,9 +228,10 @@ int kvm_init_vcpu(CPUState *env)
 }
 
 #ifdef KVM_CAP_COALESCED_MMIO
-if (s-coalesced_mmio  !s-coalesced_mmio_ring)
-s-coalesced_mmio_ring = (void *) env-kvm_run +
-   s-coalesced_mmio * PAGE_SIZE;
+if (s-coalesced_mmio  !s-coalesced_mmio_ring) {
+s-coalesced_mmio_ring =
+(void *)env-kvm_run + s-coalesced_mmio * PAGE_SIZE;
+}
 #endif
 
 ret = kvm_arch_init_vcpu(env);
@@ -275,16 +278,14 @@ static int kvm_dirty_pages_log_change(target_phys_addr_t 
phys_addr,
 
 int kvm_log_start(target_phys_addr_t phys_addr, ram_addr_t size)
 {
-return kvm_dirty_pages_log_change(phys_addr, size,
-  KVM_MEM_LOG_DIRTY_PAGES,
-  KVM_MEM_LOG_DIRTY_PAGES);
+return kvm_dirty_pages_log_change(phys_addr, size, KVM_MEM_LOG_DIRTY_PAGES,
+  KVM_MEM_LOG_DIRTY_PAGES);
 }
 
 int kvm_log_stop(target_phys_addr_t phys_addr, ram_addr_t size)
 {
-return kvm_dirty_pages_log_change(phys_addr, size,
-  0,
-  KVM_MEM_LOG_DIRTY_PAGES);
+return kvm_dirty_pages_log_change(phys_addr, size, 0,
+  KVM_MEM_LOG_DIRTY_PAGES);
 }
 
 static int kvm_set_migration_log(int enable)
@@ -356,7 +357,7 @@ static int kvm_get_dirty_pages_log_range(unsigned long 
start_addr,
  * @end_addr: end of logged region.
  */
 static int kvm_physical_sync_dirty_bitmap(target_phys_addr_t start_addr,
- target_phys_addr_t end_addr)
+  target_phys_addr_t end_addr)
 {
 KVMState *s = kvm_state;
 unsigned long size, allocated_size = 0;
@@ -480,9 +481,8 @@ static int kvm_check_many_ioeventfds(void)
 #endif
 }
 
-static void kvm_set_phys_mem(target_phys_addr_t start_addr,
-ram_addr_t size,
-ram_addr_t phys_offset)
+static void kvm_set_phys_mem(target_phys_addr_t start_addr, ram_addr_t size,
+ ram_addr_t phys_offset)
 {
 KVMState *s = kvm_state;
 ram_addr_t flags = phys_offset  ~TARGET_PAGE_MASK;
@@ -589,13 +589,13 @@ static void kvm_set_phys_mem(target_phys_addr_t 
start_addr,
 }
 
 /* in case the KVM bug workaround already consumed the new slot */
-if (!size)
+if (!size) {
 return;
-
+}
 /* KVM does not need to know about this memory */
-if (flags = IO_MEM_UNASSIGNED)
+if (flags = IO_MEM_UNASSIGNED) {
 return;
-
+}
 mem = kvm_alloc_slot(s);
 mem-memory_size = size;
 mem-start_addr = start_addr;
@@ -611,30 +611,29 @@ static void kvm_set_phys_mem(target_phys_addr_t 
start_addr,
 }
 
 static void kvm_client_set_memory(struct CPUPhysMemoryClient *client,
- target_phys_addr_t start_addr,
- ram_addr_t size,
- ram_addr_t phys_offset)
+  target_phys_addr_t start_addr,
+  ram_addr_t size, ram_addr_t phys_offset)
 {
-   kvm_set_phys_mem(start_addr, size, phys_offset);
+kvm_set_phys_mem(start_addr, size, phys_offset);
 }
 
 static int kvm_client_sync_dirty_bitmap(struct CPUPhysMemoryClient *client,
-   target_phys_addr_t start_addr,
-   target_phys_addr_t end_addr)
+target_phys_addr_t start_addr,
+target_phys_addr_t end_addr)
 {
-   return kvm_physical_sync_dirty_bitmap(start_addr, end_addr);
+return kvm_physical_sync_dirty_bitmap(start_addr, end_addr);
 }
 
 static int kvm_client_migration_log(struct CPUPhysMemoryClient *client,
-   int enable)
+int enable)
 {
-   return kvm_set_migration_log(enable);
+return kvm_set_migration_log(enable);
 }
 
 static CPUPhysMemoryClient kvm_cpu_phys_memory_client = {
- 

[PATCH 08/31] kvm: introduce kvm_inject_x86_mce_on

2011-01-24 Thread Marcelo Tosatti
From: Jin Dongming jin.dongm...@np.css.fujitsu.com

Pass a table instead of multiple args.

Note:

kvm_inject_x86_mce(env, bank, status, mcg_status, addr, misc,
   abort_on_error);

is equal to:

struct kvm_x86_mce mce = {
.bank = bank,
.status = status,
.mcg_status = mcg_status,
.addr = addr,
.misc = misc,
};
kvm_inject_x86_mce_on(env, mce, abort_on_error);

Signed-off-by: Hidetoshi Seto seto.hideto...@jp.fujitsu.com
Signed-off-by: Jin Dongming jin.dongm...@np.css.fujitsu.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
---
 target-i386/kvm.c |   57 +---
 1 files changed, 36 insertions(+), 21 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index ce01e18..9a4bf98 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -263,6 +263,23 @@ static void kvm_do_inject_x86_mce(void *_data)
 }
 }
 
+static void kvm_inject_x86_mce_on(CPUState *env, struct kvm_x86_mce *mce,
+  int flag)
+{
+struct kvm_x86_mce_data data = {
+.env = env,
+.mce = mce,
+.abort_on_error = (flag  ABORT_ON_ERROR),
+};
+
+if (!env-mcg_cap) {
+fprintf(stderr, MCE support is not enabled!\n);
+return;
+}
+
+run_on_cpu(env, kvm_do_inject_x86_mce, data);
+}
+
 static void kvm_mce_broadcast_rest(CPUState *env);
 #endif
 
@@ -278,21 +295,12 @@ void kvm_inject_x86_mce(CPUState *cenv, int bank, 
uint64_t status,
 .addr = addr,
 .misc = misc,
 };
-struct kvm_x86_mce_data data = {
-.env = cenv,
-.mce = mce,
-};
-
-if (!cenv-mcg_cap) {
-fprintf(stderr, MCE support is not enabled!\n);
-return;
-}
 
 if (flag  MCE_BROADCAST) {
 kvm_mce_broadcast_rest(cenv);
 }
 
-run_on_cpu(cenv, kvm_do_inject_x86_mce, data);
+kvm_inject_x86_mce_on(cenv, mce, flag);
 #else
 if (flag  ABORT_ON_ERROR) {
 abort();
@@ -1708,6 +1716,13 @@ static void hardware_memory_error(void)
 #ifdef KVM_CAP_MCE
 static void kvm_mce_broadcast_rest(CPUState *env)
 {
+struct kvm_x86_mce mce = {
+.bank = 1,
+.status = MCI_STATUS_VAL | MCI_STATUS_UC,
+.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_RIPV,
+.addr = 0,
+.misc = 0,
+};
 CPUState *cenv;
 
 /* Broadcast MCA signal for processor version 06H_EH and above */
@@ -1716,9 +1731,7 @@ static void kvm_mce_broadcast_rest(CPUState *env)
 if (cenv == env) {
 continue;
 }
-kvm_inject_x86_mce(cenv, 1, MCI_STATUS_VAL | MCI_STATUS_UC,
-   MCG_STATUS_MCIP | MCG_STATUS_RIPV, 0, 0,
-   ABORT_ON_ERROR);
+kvm_inject_x86_mce_on(cenv, mce, ABORT_ON_ERROR);
 }
 }
 }
@@ -1767,15 +1780,17 @@ static void kvm_mce_inj_srao_memscrub(CPUState *env, 
target_phys_addr_t paddr)
 
 static void kvm_mce_inj_srao_memscrub2(CPUState *env, target_phys_addr_t paddr)
 {
-uint64_t status;
-
-status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
-| MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
-| 0xc0;
-kvm_inject_x86_mce(env, 9, status,
-   MCG_STATUS_MCIP | MCG_STATUS_RIPV, paddr,
-   (MCM_ADDR_PHYS  6) | 0xc, ABORT_ON_ERROR);
+struct kvm_x86_mce mce = {
+.bank = 9,
+.status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN
+  | MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S
+  | 0xc0,
+.mcg_status = MCG_STATUS_MCIP | MCG_STATUS_RIPV,
+.addr = paddr,
+.misc = (MCM_ADDR_PHYS  6) | 0xc,
+};
 
+kvm_inject_x86_mce_on(env, mce, ABORT_ON_ERROR);
 kvm_mce_broadcast_rest(env);
 }
 
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/31] kvm: x86: Remove obsolete SS.RPL/DPL aligment

2011-01-24 Thread Marcelo Tosatti
From: Jan Kiszka jan.kis...@siemens.com

This seems to date back to the days KVM didn't support real mode. The
check is no longer needed and, even worse, is corrupting the guest state
in case SS.RPL != DPL.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Signed-off-by: Avi Kivity a...@redhat.com
---
 target-i386/kvm.c |7 ---
 1 files changed, 0 insertions(+), 7 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index ee7bdf8..7e5982b 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -780,13 +780,6 @@ static int kvm_put_sregs(CPUState *env)
set_seg(sregs.fs, env-segs[R_FS]);
set_seg(sregs.gs, env-segs[R_GS]);
set_seg(sregs.ss, env-segs[R_SS]);
-
-   if (env-cr[0]  CR0_PE_MASK) {
-   /* force ss cpl to cs cpl */
-   sregs.ss.selector = (sregs.ss.selector  ~3) |
-   (sregs.cs.selector  3);
-   sregs.ss.dpl = sregs.ss.selector  3;
-   }
 }
 
 set_seg(sregs.tr, env-tr);
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/31] [PULL] qemu-kvm.git uq/master queue

2011-01-24 Thread Marcelo Tosatti
The following changes since commit b646968336d4180bdd7d2e24209708dcee6ba400:

  checkpatch: adjust to QEMUisms (2011-01-20 20:58:56 +)

are available in the git repository at:
  git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git uq/master

Jan Kiszka (23):
  kvm: x86: Fix DPL write back of segment registers
  kvm: x86: Remove obsolete SS.RPL/DPL aligment
  kvm: x86: Prevent sign extension of DR7 in guest debugging mode
  kvm: x86: Fix a few coding style violations
  kvm: Fix coding style violations
  kvm: x86: Swallow KVM_EXIT_SET_TPR
  kvm: Stop on all fatal exit reasons
  kvm: Improve reporting of fatal errors
  x86: Optionally dump code bytes on cpu_dump_state
  kvm: x86: Align kvm_arch_put_registers code with comment
  kvm: x86: Prepare kvm_get_mp_state for in-kernel irqchip
  kvm: x86: Remove redundant mp_state initialization
  kvm: x86: Fix xcr0 reset mismerge
  kvm: x86: Refactor msr_star/hsave_pa setup and checks
  kvm: x86: Reset paravirtual MSRs
  kvm: x86: Fix !CONFIG_KVM_PARA build
  kvm: Drop smp_cpus argument from init functions
  kvm: Consolidate must-have capability checks
  kvm: x86: Rework identity map and TSS setup for larger BIOS sizes
  kvm: Flush coalesced mmio buffer on IO window exits
  kvm: Do not use qemu_fair_mutex
  kvm: x86: Implicitly clear nmi_injected/pending on reset
  kvm: x86: Only read/write MSR_KVM_ASYNC_PF_EN if supported

Jin Dongming (6):
  Clean up cpu_inject_x86_mce()
  Add broadcast option for mce command
  Add function for checking mca broadcast of CPU
  kvm: introduce kvm_mce_in_progress
  kvm: kvm_mce_inj_* subroutines for templated error injections
  kvm: introduce kvm_inject_x86_mce_on

Lai Jiangshan (2):
  kvm: Enable user space NMI injection for kvm guest
  kvm: convert kvm_ioctl(KVM_CHECK_EXTENSION) to kvm_check_extension()

 configure |   36 ++-
 cpu-all.h |5 +-
 cpus.c|2 -
 hmp-commands.hx   |6 +-
 kvm-all.c |  247 +
 kvm-stub.c|2 +-
 kvm.h |   14 +-
 monitor.c |7 +-
 target-i386/cpu.h |9 +-
 target-i386/cpuid.c   |5 +-
 target-i386/helper.c  |   97 ++-
 target-i386/kvm.c |  749 +++-
 target-i386/kvm_x86.h |5 +-
 target-ppc/kvm.c  |   10 +-
 target-s390x/kvm.c|6 +-
 vl.c  |2 +-
 16 files changed, 714 insertions(+), 488 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/31] kvm: x86: Fix a few coding style violations

2011-01-24 Thread Marcelo Tosatti
From: Jan Kiszka jan.kis...@siemens.com

No functional changes.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Signed-off-by: Avi Kivity a...@redhat.com
---
 target-i386/kvm.c |  335 +
 1 files changed, 182 insertions(+), 153 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 85edacc..fda07d2 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -150,34 +150,34 @@ uint32_t kvm_arch_get_supported_cpuid(CPUState *env, 
uint32_t function,
 
 #ifdef CONFIG_KVM_PARA
 struct kvm_para_features {
-int cap;
-int feature;
+int cap;
+int feature;
 } para_features[] = {
 #ifdef KVM_CAP_CLOCKSOURCE
-{ KVM_CAP_CLOCKSOURCE, KVM_FEATURE_CLOCKSOURCE },
+{ KVM_CAP_CLOCKSOURCE, KVM_FEATURE_CLOCKSOURCE },
 #endif
 #ifdef KVM_CAP_NOP_IO_DELAY
-{ KVM_CAP_NOP_IO_DELAY, KVM_FEATURE_NOP_IO_DELAY },
+{ KVM_CAP_NOP_IO_DELAY, KVM_FEATURE_NOP_IO_DELAY },
 #endif
 #ifdef KVM_CAP_PV_MMU
-{ KVM_CAP_PV_MMU, KVM_FEATURE_MMU_OP },
+{ KVM_CAP_PV_MMU, KVM_FEATURE_MMU_OP },
 #endif
 #ifdef KVM_CAP_ASYNC_PF
-{ KVM_CAP_ASYNC_PF, KVM_FEATURE_ASYNC_PF },
+{ KVM_CAP_ASYNC_PF, KVM_FEATURE_ASYNC_PF },
 #endif
-{ -1, -1 }
+{ -1, -1 }
 };
 
 static int get_para_features(CPUState *env)
 {
-int i, features = 0;
+int i, features = 0;
 
-for (i = 0; i  ARRAY_SIZE(para_features) - 1; i++) {
-if (kvm_check_extension(env-kvm_state, para_features[i].cap))
-features |= (1  para_features[i].feature);
+for (i = 0; i  ARRAY_SIZE(para_features) - 1; i++) {
+if (kvm_check_extension(env-kvm_state, para_features[i].cap)) {
+features |= (1  para_features[i].feature);
 }
-
-return features;
+}
+return features;
 }
 #endif
 
@@ -389,13 +389,15 @@ int kvm_arch_init_vcpu(CPUState *env)
 c-index = j;
 cpu_x86_cpuid(env, i, j, c-eax, c-ebx, c-ecx, c-edx);
 
-if (i == 4  c-eax == 0)
+if (i == 4  c-eax == 0) {
 break;
-if (i == 0xb  !(c-ecx  0xff00))
+}
+if (i == 0xb  !(c-ecx  0xff00)) {
 break;
-if (i == 0xd  c-eax == 0)
+}
+if (i == 0xd  c-eax == 0) {
 break;
-
+}
 c = cpuid_data.entries[cpuid_i++];
 }
 break;
@@ -425,17 +427,18 @@ int kvm_arch_init_vcpu(CPUState *env)
 uint64_t mcg_cap;
 int banks;
 
-if (kvm_get_mce_cap_supported(env-kvm_state, mcg_cap, banks))
+if (kvm_get_mce_cap_supported(env-kvm_state, mcg_cap, banks)) {
 perror(kvm_get_mce_cap_supported FAILED);
-else {
+} else {
 if (banks  MCE_BANKS_DEF)
 banks = MCE_BANKS_DEF;
 mcg_cap = MCE_CAP_DEF;
 mcg_cap |= banks;
-if (kvm_setup_mce(env, mcg_cap))
+if (kvm_setup_mce(env, mcg_cap)) {
 perror(kvm_setup_mce FAILED);
-else
+} else {
 env-mcg_cap = mcg_cap;
+}
 }
 }
 #endif
@@ -577,7 +580,7 @@ int kvm_arch_init(KVMState *s, int smp_cpus)
 
 return kvm_init_identity_map_page(s);
 }
-
+
 static void set_v8086_seg(struct kvm_segment *lhs, const SegmentCache *rhs)
 {
 lhs-selector = rhs-selector;
@@ -616,23 +619,23 @@ static void get_seg(SegmentCache *lhs, const struct 
kvm_segment *rhs)
 lhs-selector = rhs-selector;
 lhs-base = rhs-base;
 lhs-limit = rhs-limit;
-lhs-flags =
-   (rhs-type  DESC_TYPE_SHIFT)
-   | (rhs-present * DESC_P_MASK)
-   | (rhs-dpl  DESC_DPL_SHIFT)
-   | (rhs-db  DESC_B_SHIFT)
-   | (rhs-s * DESC_S_MASK)
-   | (rhs-l  DESC_L_SHIFT)
-   | (rhs-g * DESC_G_MASK)
-   | (rhs-avl * DESC_AVL_MASK);
+lhs-flags = (rhs-type  DESC_TYPE_SHIFT) |
+ (rhs-present * DESC_P_MASK) |
+ (rhs-dpl  DESC_DPL_SHIFT) |
+ (rhs-db  DESC_B_SHIFT) |
+ (rhs-s * DESC_S_MASK) |
+ (rhs-l  DESC_L_SHIFT) |
+ (rhs-g * DESC_G_MASK) |
+ (rhs-avl * DESC_AVL_MASK);
 }
 
 static void kvm_getput_reg(__u64 *kvm_reg, target_ulong *qemu_reg, int set)
 {
-if (set)
+if (set) {
 *kvm_reg = *qemu_reg;
-else
+} else {
 *qemu_reg = *kvm_reg;
+}
 }
 
 static int kvm_getput_regs(CPUState *env, int set)
@@ -642,8 +645,9 @@ static int kvm_getput_regs(CPUState *env, int set)
 
 if (!set) {
 ret = kvm_vcpu_ioctl(env, KVM_GET_REGS, regs);
-if (ret  0)
+if (ret  0) {
 return ret;
+}
 }
 
 kvm_getput_reg(regs.rax, env-regs[R_EAX], set);
@@ -668,8 +672,9 @@ static int kvm_getput_regs(CPUState *env, int set)
 

[PATCH 26/31] kvm: Consolidate must-have capability checks

2011-01-24 Thread Marcelo Tosatti
From: Jan Kiszka jan.kis...@siemens.com

Instead of splattering the code with #ifdefs and runtime checks for
capabilities we cannot work without anyway, provide central test
infrastructure for verifying their availability both at build and
runtime.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
---
 configure  |   39 --
 kvm-all.c  |   67 +---
 kvm.h  |   10 +++
 target-i386/kvm.c  |   39 ++
 target-ppc/kvm.c   |4 +++
 target-s390x/kvm.c |4 +++
 6 files changed, 79 insertions(+), 84 deletions(-)

diff --git a/configure b/configure
index 9a02d1f..4673bf0 100755
--- a/configure
+++ b/configure
@@ -1662,18 +1662,31 @@ if test $kvm != no ; then
 #if !defined(KVM_API_VERSION) || KVM_API_VERSION  12 || KVM_API_VERSION  12
 #error Invalid KVM version
 #endif
-#if !defined(KVM_CAP_USER_MEMORY)
-#error Missing KVM capability KVM_CAP_USER_MEMORY
-#endif
-#if !defined(KVM_CAP_SET_TSS_ADDR)
-#error Missing KVM capability KVM_CAP_SET_TSS_ADDR
-#endif
-#if !defined(KVM_CAP_DESTROY_MEMORY_REGION_WORKS)
-#error Missing KVM capability KVM_CAP_DESTROY_MEMORY_REGION_WORKS
-#endif
-#if !defined(KVM_CAP_USER_NMI)
-#error Missing KVM capability KVM_CAP_USER_NMI
+EOF
+must_have_caps=KVM_CAP_USER_MEMORY \
+KVM_CAP_DESTROY_MEMORY_REGION_WORKS \
+KVM_CAP_COALESCED_MMIO \
+KVM_CAP_SYNC_MMU \
+   
+if test \( $cpu = i386 -o $cpu = x86_64 \) ; then
+  must_have_caps=$caps \
+  KVM_CAP_SET_TSS_ADDR \
+  KVM_CAP_EXT_CPUID \
+  KVM_CAP_CLOCKSOURCE \
+  KVM_CAP_NOP_IO_DELAY \
+  KVM_CAP_PV_MMU \
+  KVM_CAP_MP_STATE \
+  KVM_CAP_USER_NMI \
+ 
+fi
+for c in $must_have_caps ; do
+  cat  $TMPC EOF
+#if !defined($c)
+#error Missing KVM capability $c
 #endif
+EOF
+done
+cat  $TMPC EOF
 int main(void) { return 0; }
 EOF
   if test $kerneldir !=  ; then
@@ -1708,8 +1721,8 @@ EOF
| awk -F error:  '{if (NR1) printf(, ); printf(%s,$2);}'`
 if test $kvmerr !=  ; then
   echo -e ${kvmerr}\n\
-  NOTE: To enable KVM support, update your kernel to 2.6.29+ or install \
-  recent kvm-kmod from http://sourceforge.net/projects/kvm.;
+NOTE: To enable KVM support, update your kernel to 2.6.29+ or install \
+recent kvm-kmod from http://sourceforge.net/projects/kvm.;
 fi
   fi
   feature_not_found kvm
diff --git a/kvm-all.c b/kvm-all.c
index 8053f92..3a1f63b 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -63,9 +63,7 @@ struct KVMState
 int fd;
 int vmfd;
 int coalesced_mmio;
-#ifdef KVM_CAP_COALESCED_MMIO
 struct kvm_coalesced_mmio_ring *coalesced_mmio_ring;
-#endif
 int broken_set_mem_region;
 int migration_log;
 int vcpu_events;
@@ -82,6 +80,12 @@ struct KVMState
 
 static KVMState *kvm_state;
 
+static const KVMCapabilityInfo kvm_required_capabilites[] = {
+KVM_CAP_INFO(USER_MEMORY),
+KVM_CAP_INFO(DESTROY_MEMORY_REGION_WORKS),
+KVM_CAP_LAST_INFO
+};
+
 static KVMSlot *kvm_alloc_slot(KVMState *s)
 {
 int i;
@@ -227,12 +231,10 @@ int kvm_init_vcpu(CPUState *env)
 goto err;
 }
 
-#ifdef KVM_CAP_COALESCED_MMIO
 if (s-coalesced_mmio  !s-coalesced_mmio_ring) {
 s-coalesced_mmio_ring =
 (void *)env-kvm_run + s-coalesced_mmio * PAGE_SIZE;
 }
-#endif
 
 ret = kvm_arch_init_vcpu(env);
 if (ret == 0) {
@@ -401,7 +403,6 @@ static int 
kvm_physical_sync_dirty_bitmap(target_phys_addr_t start_addr,
 int kvm_coalesce_mmio_region(target_phys_addr_t start, ram_addr_t size)
 {
 int ret = -ENOSYS;
-#ifdef KVM_CAP_COALESCED_MMIO
 KVMState *s = kvm_state;
 
 if (s-coalesced_mmio) {
@@ -412,7 +413,6 @@ int kvm_coalesce_mmio_region(target_phys_addr_t start, 
ram_addr_t size)
 
 ret = kvm_vm_ioctl(s, KVM_REGISTER_COALESCED_MMIO, zone);
 }
-#endif
 
 return ret;
 }
@@ -420,7 +420,6 @@ int kvm_coalesce_mmio_region(target_phys_addr_t start, 
ram_addr_t size)
 int kvm_uncoalesce_mmio_region(target_phys_addr_t start, ram_addr_t size)
 {
 int ret = -ENOSYS;
-#ifdef KVM_CAP_COALESCED_MMIO
 KVMState *s = kvm_state;
 
 if (s-coalesced_mmio) {
@@ -431,7 +430,6 @@ int kvm_uncoalesce_mmio_region(target_phys_addr_t start, 
ram_addr_t size)
 
 ret = kvm_vm_ioctl(s, KVM_UNREGISTER_COALESCED_MMIO, zone);
 }
-#endif
 
 return ret;
 }
@@ -481,6 +479,18 @@ static int kvm_check_many_ioeventfds(void)
 #endif
 }
 
+static const KVMCapabilityInfo *
+kvm_check_extension_list(KVMState *s, const KVMCapabilityInfo *list)
+{
+while (list-name) {
+if (!kvm_check_extension(s, list-value)) {
+return list;
+}
+list++;
+}
+

[PATCH 19/31] kvm: x86: Prepare kvm_get_mp_state for in-kernel irqchip

2011-01-24 Thread Marcelo Tosatti
From: Jan Kiszka jan.kis...@siemens.com

This code path will not yet be taken as we still lack in-kernel irqchip
support. But qemu-kvm can already make use of it and drop its own
mp_state access services.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
---
 target-i386/kvm.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 9bb34ab..531b69e 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1198,6 +1198,9 @@ static int kvm_get_mp_state(CPUState *env)
 return ret;
 }
 env-mp_state = mp_state.mp_state;
+if (kvm_irqchip_in_kernel()) {
+env-halted = (mp_state.mp_state == KVM_MP_STATE_HALTED);
+}
 return 0;
 }
 
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 14/31] kvm: x86: Swallow KVM_EXIT_SET_TPR

2011-01-24 Thread Marcelo Tosatti
From: Jan Kiszka jan.kis...@siemens.com

This exit only triggers activity in the common exit path, but we should
accept it in order to be able to detect unknown exit types.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
---
 target-i386/kvm.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index fda07d2..0aeb079 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1534,6 +1534,9 @@ int kvm_arch_handle_exit(CPUState *env, struct kvm_run 
*run)
 DPRINTF(handle_hlt\n);
 ret = kvm_handle_halt(env);
 break;
+case KVM_EXIT_SET_TPR:
+ret = 1;
+break;
 }
 
 return ret;
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/31] kvm: Enable user space NMI injection for kvm guest

2011-01-24 Thread Marcelo Tosatti
From: Lai Jiangshan la...@cn.fujitsu.com

Make use of the new KVM_NMI IOCTL to send NMIs into the KVM guest if the
user space raised them. (example: qemu monitor's nmi command)

Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com
Acked-by: Jan Kiszka jan.kis...@siemens.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
---
 configure |3 +++
 target-i386/kvm.c |7 +++
 2 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/configure b/configure
index 210670c..9a02d1f 100755
--- a/configure
+++ b/configure
@@ -1671,6 +1671,9 @@ if test $kvm != no ; then
 #if !defined(KVM_CAP_DESTROY_MEMORY_REGION_WORKS)
 #error Missing KVM capability KVM_CAP_DESTROY_MEMORY_REGION_WORKS
 #endif
+#if !defined(KVM_CAP_USER_NMI)
+#error Missing KVM capability KVM_CAP_USER_NMI
+#endif
 int main(void) { return 0; }
 EOF
   if test $kerneldir !=  ; then
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 7dfc357..755f8c9 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1417,6 +1417,13 @@ int kvm_arch_get_registers(CPUState *env)
 
 int kvm_arch_pre_run(CPUState *env, struct kvm_run *run)
 {
+/* Inject NMI */
+if (env-interrupt_request  CPU_INTERRUPT_NMI) {
+env-interrupt_request = ~CPU_INTERRUPT_NMI;
+DPRINTF(injected NMI\n);
+kvm_vcpu_ioctl(env, KVM_NMI);
+}
+
 /* Try to inject an interrupt if the guest can accept it */
 if (run-ready_for_interrupt_injection 
 (env-interrupt_request  CPU_INTERRUPT_HARD) 
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 24/31] kvm: x86: Fix !CONFIG_KVM_PARA build

2011-01-24 Thread Marcelo Tosatti
From: Jan Kiszka jan.kis...@siemens.com

If we lack kvm_para.h, MSR_KVM_ASYNC_PF_EN is not defined. The change in
kvm_arch_init_vcpu is just for consistency reasons.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
---
 target-i386/kvm.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 825af42..feaf33d 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -319,7 +319,7 @@ int kvm_arch_init_vcpu(CPUState *env)
 uint32_t limit, i, j, cpuid_i;
 uint32_t unused;
 struct kvm_cpuid_entry2 *c;
-#ifdef KVM_CPUID_SIGNATURE
+#ifdef CONFIG_KVM_PARA
 uint32_t signature[3];
 #endif
 
@@ -855,7 +855,7 @@ static int kvm_put_msrs(CPUState *env, int level)
 kvm_msr_entry_set(msrs[n++], MSR_KVM_SYSTEM_TIME,
   env-system_time_msr);
 kvm_msr_entry_set(msrs[n++], MSR_KVM_WALL_CLOCK, env-wall_clock_msr);
-#ifdef KVM_CAP_ASYNC_PF
+#if defined(CONFIG_KVM_PARA)  defined(KVM_CAP_ASYNC_PF)
 kvm_msr_entry_set(msrs[n++], MSR_KVM_ASYNC_PF_EN, 
env-async_pf_en_msr);
 #endif
 }
@@ -1091,7 +1091,7 @@ static int kvm_get_msrs(CPUState *env)
 #endif
 msrs[n++].index = MSR_KVM_SYSTEM_TIME;
 msrs[n++].index = MSR_KVM_WALL_CLOCK;
-#ifdef KVM_CAP_ASYNC_PF
+#if defined(CONFIG_KVM_PARA)  defined(KVM_CAP_ASYNC_PF)
 msrs[n++].index = MSR_KVM_ASYNC_PF_EN;
 #endif
 
@@ -1167,7 +1167,7 @@ static int kvm_get_msrs(CPUState *env)
 }
 #endif
 break;
-#ifdef KVM_CAP_ASYNC_PF
+#if defined(CONFIG_KVM_PARA)  defined(KVM_CAP_ASYNC_PF)
 case MSR_KVM_ASYNC_PF_EN:
 env-async_pf_en_msr = msrs[i].data;
 break;
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 31/31] kvm: x86: Only read/write MSR_KVM_ASYNC_PF_EN if supported

2011-01-24 Thread Marcelo Tosatti
From: Jan Kiszka jan.kis...@siemens.com

If the kernel does not support KVM_CAP_ASYNC_PF, it also does not know
about the related MSR. So skip it during state synchronization in that
case. Fixes annoying kernel warnings.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
---
 target-i386/kvm.c |   13 +++--
 1 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index b2c5ee0..8e8880a 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -63,6 +63,9 @@ const KVMCapabilityInfo kvm_arch_required_capabilities[] = {
 
 static bool has_msr_star;
 static bool has_msr_hsave_pa;
+#if defined(CONFIG_KVM_PARA)  defined(KVM_CAP_ASYNC_PF)
+static bool has_msr_async_pf_en;
+#endif
 static int lm_capable_kernel;
 
 static struct kvm_cpuid2 *try_get_cpuid(KVMState *s, int max)
@@ -164,6 +167,7 @@ static int get_para_features(CPUState *env)
 features |= (1  para_features[i].feature);
 }
 }
+has_msr_async_pf_en = features  (1  KVM_FEATURE_ASYNC_PF);
 return features;
 }
 #endif
@@ -828,7 +832,10 @@ static int kvm_put_msrs(CPUState *env, int level)
   env-system_time_msr);
 kvm_msr_entry_set(msrs[n++], MSR_KVM_WALL_CLOCK, env-wall_clock_msr);
 #if defined(CONFIG_KVM_PARA)  defined(KVM_CAP_ASYNC_PF)
-kvm_msr_entry_set(msrs[n++], MSR_KVM_ASYNC_PF_EN, 
env-async_pf_en_msr);
+if (has_msr_async_pf_en) {
+kvm_msr_entry_set(msrs[n++], MSR_KVM_ASYNC_PF_EN,
+  env-async_pf_en_msr);
+}
 #endif
 }
 #ifdef KVM_CAP_MCE
@@ -1064,7 +1071,9 @@ static int kvm_get_msrs(CPUState *env)
 msrs[n++].index = MSR_KVM_SYSTEM_TIME;
 msrs[n++].index = MSR_KVM_WALL_CLOCK;
 #if defined(CONFIG_KVM_PARA)  defined(KVM_CAP_ASYNC_PF)
-msrs[n++].index = MSR_KVM_ASYNC_PF_EN;
+if (has_msr_async_pf_en) {
+msrs[n++].index = MSR_KVM_ASYNC_PF_EN;
+}
 #endif
 
 #ifdef KVM_CAP_MCE
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 25/31] kvm: Drop smp_cpus argument from init functions

2011-01-24 Thread Marcelo Tosatti
From: Jan Kiszka jan.kis...@siemens.com

No longer used.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
---
 kvm-all.c  |4 ++--
 kvm-stub.c |2 +-
 kvm.h  |4 ++--
 target-i386/kvm.c  |2 +-
 target-ppc/kvm.c   |2 +-
 target-s390x/kvm.c |2 +-
 vl.c   |2 +-
 7 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index 41decde..8053f92 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -636,7 +636,7 @@ static CPUPhysMemoryClient kvm_cpu_phys_memory_client = {
 .migration_log = kvm_client_migration_log,
 };
 
-int kvm_init(int smp_cpus)
+int kvm_init(void)
 {
 static const char upgrade_note[] =
 Please upgrade to at least kernel 2.6.29 or recent kvm-kmod\n
@@ -749,7 +749,7 @@ int kvm_init(int smp_cpus)
 s-xcrs = kvm_check_extension(s, KVM_CAP_XCRS);
 #endif
 
-ret = kvm_arch_init(s, smp_cpus);
+ret = kvm_arch_init(s);
 if (ret  0) {
 goto err;
 }
diff --git a/kvm-stub.c b/kvm-stub.c
index 33d4476..88682f2 100644
--- a/kvm-stub.c
+++ b/kvm-stub.c
@@ -58,7 +58,7 @@ int kvm_check_extension(KVMState *s, unsigned int extension)
 return 0;
 }
 
-int kvm_init(int smp_cpus)
+int kvm_init(void)
 {
 return -ENOSYS;
 }
diff --git a/kvm.h b/kvm.h
index ce08d42..a971752 100644
--- a/kvm.h
+++ b/kvm.h
@@ -34,7 +34,7 @@ struct kvm_run;
 
 /* external API */
 
-int kvm_init(int smp_cpus);
+int kvm_init(void);
 
 int kvm_has_sync_mmu(void);
 int kvm_has_vcpu_events(void);
@@ -105,7 +105,7 @@ int kvm_arch_get_registers(CPUState *env);
 
 int kvm_arch_put_registers(CPUState *env, int level);
 
-int kvm_arch_init(KVMState *s, int smp_cpus);
+int kvm_arch_init(KVMState *s);
 
 int kvm_arch_init_vcpu(CPUState *env);
 
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index feaf33d..016b67d 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -527,7 +527,7 @@ static int kvm_init_identity_map_page(KVMState *s)
 return 0;
 }
 
-int kvm_arch_init(KVMState *s, int smp_cpus)
+int kvm_arch_init(KVMState *s)
 {
 int ret;
 struct utsname utsname;
diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
index 849b404..3c05630 100644
--- a/target-ppc/kvm.c
+++ b/target-ppc/kvm.c
@@ -56,7 +56,7 @@ static void kvm_kick_env(void *env)
 qemu_cpu_kick(env);
 }
 
-int kvm_arch_init(KVMState *s, int smp_cpus)
+int kvm_arch_init(KVMState *s)
 {
 #ifdef KVM_CAP_PPC_UNSET_IRQ
 cap_interrupt_unset = kvm_check_extension(s, KVM_CAP_PPC_UNSET_IRQ);
diff --git a/target-s390x/kvm.c b/target-s390x/kvm.c
index adf4a9e..b177e10 100644
--- a/target-s390x/kvm.c
+++ b/target-s390x/kvm.c
@@ -70,7 +70,7 @@
 #define SCLP_CMDW_READ_SCP_INFO 0x00020001
 #define SCLP_CMDW_READ_SCP_INFO_FORCED  0x00120001
 
-int kvm_arch_init(KVMState *s, int smp_cpus)
+int kvm_arch_init(KVMState *s)
 {
 return 0;
 }
diff --git a/vl.c b/vl.c
index 0292184..33f844f 100644
--- a/vl.c
+++ b/vl.c
@@ -2836,7 +2836,7 @@ int main(int argc, char **argv, char **envp)
 }
 
 if (kvm_allowed) {
-int ret = kvm_init(smp_cpus);
+int ret = kvm_init();
 if (ret  0) {
 if (!kvm_available()) {
 printf(KVM not supported for this target\n);
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 21/31] kvm: x86: Fix xcr0 reset mismerge

2011-01-24 Thread Marcelo Tosatti
From: Jan Kiszka jan.kis...@siemens.com

For unknown reasons, xcr0 reset ended up in kvm_arch_update_guest_debug
on upstream merge. Fix this and also remove the misleading comment (1 is
THE reset value).

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
---
 target-i386/kvm.c |3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 07c75c0..c4a22dd 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -450,6 +450,7 @@ void kvm_arch_reset_vcpu(CPUState *env)
 env-interrupt_injected = -1;
 env-nmi_injected = 0;
 env-nmi_pending = 0;
+env-xcr0 = 1;
 if (kvm_irqchip_in_kernel()) {
 env-mp_state = cpu_is_bsp(env) ? KVM_MP_STATE_RUNNABLE :
   KVM_MP_STATE_UNINITIALIZED;
@@ -1759,8 +1760,6 @@ void kvm_arch_update_guest_debug(CPUState *env, struct 
kvm_guest_debug *dbg)
 ((uint32_t)len_code[hw_breakpoint[n].len]  (18 + n*4));
 }
 }
-/* Legal xcr0 for loading */
-env-xcr0 = 1;
 }
 #endif /* KVM_CAP_SET_GUEST_DEBUG */
 
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 17/31] x86: Optionally dump code bytes on cpu_dump_state

2011-01-24 Thread Marcelo Tosatti
From: Jan Kiszka jan.kis...@siemens.com

Introduce the cpu_dump_state flag CPU_DUMP_CODE and implement it for
x86. This writes out the code bytes around the current instruction
pointer. Make use of this feature in KVM to help debugging fatal vm
exits.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
---
 cpu-all.h|2 ++
 kvm-all.c|4 ++--
 target-i386/helper.c |   21 +
 3 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/cpu-all.h b/cpu-all.h
index 4ce4e83..ffbd6a4 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -765,6 +765,8 @@ int page_check_range(target_ulong start, target_ulong len, 
int flags);
 CPUState *cpu_copy(CPUState *env);
 CPUState *qemu_get_cpu(int cpu);
 
+#define CPU_DUMP_CODE 0x0001
+
 void cpu_dump_state(CPUState *env, FILE *f, fprintf_function cpu_fprintf,
 int flags);
 void cpu_dump_statistics(CPUState *env, FILE *f, fprintf_function cpu_fprintf,
diff --git a/kvm-all.c b/kvm-all.c
index 10e1194..41decde 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -832,7 +832,7 @@ static int kvm_handle_internal_error(CPUState *env, struct 
kvm_run *run)
 if (run-internal.suberror == KVM_INTERNAL_ERROR_EMULATION) {
 fprintf(stderr, emulation failure\n);
 if (!kvm_arch_stop_on_emulation_error(env)) {
-cpu_dump_state(env, stderr, fprintf, 0);
+cpu_dump_state(env, stderr, fprintf, CPU_DUMP_CODE);
 return 0;
 }
 }
@@ -994,7 +994,7 @@ int kvm_cpu_exec(CPUState *env)
 } while (ret  0);
 
 if (ret  0) {
-cpu_dump_state(env, stderr, fprintf, 0);
+cpu_dump_state(env, stderr, fprintf, CPU_DUMP_CODE);
 vm_stop(0);
 env-exit_request = 1;
 }
diff --git a/target-i386/helper.c b/target-i386/helper.c
index 6dfa27d..1217452 100644
--- a/target-i386/helper.c
+++ b/target-i386/helper.c
@@ -249,6 +249,9 @@ done:
 cpu_fprintf(f, \n);
 }
 
+#define DUMP_CODE_BYTES_TOTAL50
+#define DUMP_CODE_BYTES_BACKWARD 20
+
 void cpu_dump_state(CPUState *env, FILE *f, fprintf_function cpu_fprintf,
 int flags)
 {
@@ -434,6 +437,24 @@ void cpu_dump_state(CPUState *env, FILE *f, 
fprintf_function cpu_fprintf,
 cpu_fprintf(f,  );
 }
 }
+if (flags  CPU_DUMP_CODE) {
+target_ulong base = env-segs[R_CS].base + env-eip;
+target_ulong offs = MIN(env-eip, DUMP_CODE_BYTES_BACKWARD);
+uint8_t code;
+char codestr[3];
+
+cpu_fprintf(f, Code=);
+for (i = 0; i  DUMP_CODE_BYTES_TOTAL; i++) {
+if (cpu_memory_rw_debug(env, base - offs + i, code, 1, 0) == 0) {
+snprintf(codestr, sizeof(codestr), %02x, code);
+} else {
+snprintf(codestr, sizeof(codestr), ??);
+}
+cpu_fprintf(f, %s%s%s%s, i  0 ?   : ,
+i == offs ?  : , codestr, i == offs ?  : );
+}
+cpu_fprintf(f, \n);
+}
 }
 
 /***/
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/31] kvm: introduce kvm_mce_in_progress

2011-01-24 Thread Marcelo Tosatti
From: Jin Dongming jin.dongm...@np.css.fujitsu.com

Share same error handing, and rename this function after
MCIP (Machine Check In Progress) flag.

Signed-off-by: Hidetoshi Seto seto.hideto...@jp.fujitsu.com
Signed-off-by: Jin Dongming jin.dongm...@np.css.fujitsu.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
---
 target-i386/kvm.c |   15 +--
 1 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 2115a58..5a699fc 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -219,7 +219,7 @@ static int kvm_get_msr(CPUState *env, struct kvm_msr_entry 
*msrs, int n)
 }
 
 /* FIXME: kill this and kvm_get_msr, use env-mcg_status instead */
-static int kvm_mce_in_exception(CPUState *env)
+static int kvm_mce_in_progress(CPUState *env)
 {
 struct kvm_msr_entry msr_mcg_status = {
 .index = MSR_MCG_STATUS,
@@ -228,7 +228,8 @@ static int kvm_mce_in_exception(CPUState *env)
 
 r = kvm_get_msr(env, msr_mcg_status, 1);
 if (r == -1 || r == 0) {
-return -1;
+fprintf(stderr, Failed to get MCE status\n);
+return 0;
 }
 return !!(msr_mcg_status.data  MCG_STATUS_MCIP);
 }
@@ -248,10 +249,7 @@ static void kvm_do_inject_x86_mce(void *_data)
 /* If there is an MCE exception being processed, ignore this SRAO MCE */
 if ((data-env-mcg_cap  MCG_SER_P) 
 !(data-mce-status  MCI_STATUS_AR)) {
-r = kvm_mce_in_exception(data-env);
-if (r == -1) {
-fprintf(stderr, Failed to get MCE status\n);
-} else if (r) {
+if (kvm_mce_in_progress(data-env)) {
 return;
 }
 }
@@ -1752,10 +1750,7 @@ int kvm_on_sigbus_vcpu(CPUState *env, int code, void 
*addr)
  * If there is an MCE excpetion being processed, ignore
  * this SRAO MCE
  */
-r = kvm_mce_in_exception(env);
-if (r == -1) {
-fprintf(stderr, Failed to get MCE status\n);
-} else if (r) {
+if (kvm_mce_in_progress(env)) {
 return 0;
 }
 /* Fake an Intel architectural Memory scrubbing UCR */
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/31] kvm: x86: Fix DPL write back of segment registers

2011-01-24 Thread Marcelo Tosatti
From: Jan Kiszka jan.kis...@siemens.com

The DPL is stored in the flags and not in the selector. In fact, the RPL
may differ from the DPL at some point in time, and so we were corrupting
the guest state so far.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Signed-off-by: Avi Kivity a...@redhat.com
---
 target-i386/kvm.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 9a4bf98..ee7bdf8 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -602,7 +602,7 @@ static void set_seg(struct kvm_segment *lhs, const 
SegmentCache *rhs)
 lhs-limit = rhs-limit;
 lhs-type = (flags  DESC_TYPE_SHIFT)  15;
 lhs-present = (flags  DESC_P_MASK) != 0;
-lhs-dpl = rhs-selector  3;
+lhs-dpl = (flags  DESC_DPL_SHIFT)  3;
 lhs-db = (flags  DESC_B_SHIFT)  1;
 lhs-s = (flags  DESC_S_MASK) != 0;
 lhs-l = (flags  DESC_L_SHIFT)  1;
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 23/31] kvm: x86: Reset paravirtual MSRs

2011-01-24 Thread Marcelo Tosatti
From: Jan Kiszka jan.kis...@siemens.com

Make sure to write the cleared MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
and MSR_KVM_ASYNC_PF_EN to the kernel state so that a freshly booted
guest cannot be disturbed by old values.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
CC: Glauber Costa glom...@redhat.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
---
 target-i386/kvm.c |7 +++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 454ddb1..825af42 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -845,6 +845,13 @@ static int kvm_put_msrs(CPUState *env, int level)
 if (smp_cpus == 1 || env-tsc != 0) {
 kvm_msr_entry_set(msrs[n++], MSR_IA32_TSC, env-tsc);
 }
+}
+/*
+ * The following paravirtual MSRs have side effects on the guest or are
+ * too heavy for normal writeback. Limit them to reset or full state
+ * updates.
+ */
+if (level = KVM_PUT_RESET_STATE) {
 kvm_msr_entry_set(msrs[n++], MSR_KVM_SYSTEM_TIME,
   env-system_time_msr);
 kvm_msr_entry_set(msrs[n++], MSR_KVM_WALL_CLOCK, env-wall_clock_msr);
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/31] kvm: x86: Prevent sign extension of DR7 in guest debugging mode

2011-01-24 Thread Marcelo Tosatti
From: Jan Kiszka jan.kis...@siemens.com

This unbreaks guest debugging when the 4th hardware breakpoint used for
guest debugging is a watchpoint of 4 or 8 byte lenght. The 31st bit of
DR7 is set in that case and used to cause a sign extension to the high
word which was breaking the guest state (vm entry failure).

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Signed-off-by: Avi Kivity a...@redhat.com
---
 target-i386/kvm.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 7e5982b..85edacc 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1686,7 +1686,7 @@ void kvm_arch_update_guest_debug(CPUState *env, struct 
kvm_guest_debug *dbg)
 dbg-arch.debugreg[n] = hw_breakpoint[n].addr;
 dbg-arch.debugreg[7] |= (2  (n * 2)) |
 (type_code[hw_breakpoint[n].type]  (16 + n*4)) |
-(len_code[hw_breakpoint[n].len]  (18 + n*4));
+((uint32_t)len_code[hw_breakpoint[n].len]  (18 + n*4));
 }
 }
 /* Legal xcr0 for loading */
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 16/31] kvm: Improve reporting of fatal errors

2011-01-24 Thread Marcelo Tosatti
From: Jan Kiszka jan.kis...@siemens.com

Report KVM_EXIT_UNKNOWN, KVM_EXIT_FAIL_ENTRY, and KVM_EXIT_EXCEPTION
with more details to stderr. The latter two are so far x86-only, so move
them into the arch-specific handler. Integrate the Intel real mode
warning on KVM_EXIT_FAIL_ENTRY that qemu-kvm carries, but actually
restrict it to Intel CPUs. Moreover, always dump the CPU state in case
we fail.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
---
 kvm-all.c   |   22 --
 target-i386/cpu.h   |2 ++
 target-i386/cpuid.c |5 ++---
 target-i386/kvm.c   |   33 +
 4 files changed, 45 insertions(+), 17 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index eaf9272..10e1194 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -817,22 +817,22 @@ static int kvm_handle_io(uint16_t port, void *data, int 
direction, int size,
 #ifdef KVM_CAP_INTERNAL_ERROR_DATA
 static int kvm_handle_internal_error(CPUState *env, struct kvm_run *run)
 {
-
+fprintf(stderr, KVM internal error.);
 if (kvm_check_extension(kvm_state, KVM_CAP_INTERNAL_ERROR_DATA)) {
 int i;
 
-fprintf(stderr, KVM internal error. Suberror: %d\n,
-run-internal.suberror);
-
+fprintf(stderr,  Suberror: %d\n, run-internal.suberror);
 for (i = 0; i  run-internal.ndata; ++i) {
 fprintf(stderr, extra data[%d]: %PRIx64\n,
 i, (uint64_t)run-internal.data[i]);
 }
+} else {
+fprintf(stderr, \n);
 }
-cpu_dump_state(env, stderr, fprintf, 0);
 if (run-internal.suberror == KVM_INTERNAL_ERROR_EMULATION) {
 fprintf(stderr, emulation failure\n);
 if (!kvm_arch_stop_on_emulation_error(env)) {
+cpu_dump_state(env, stderr, fprintf, 0);
 return 0;
 }
 }
@@ -966,15 +966,8 @@ int kvm_cpu_exec(CPUState *env)
 ret = 1;
 break;
 case KVM_EXIT_UNKNOWN:
-DPRINTF(kvm_exit_unknown\n);
-ret = -1;
-break;
-case KVM_EXIT_FAIL_ENTRY:
-DPRINTF(kvm_exit_fail_entry\n);
-ret = -1;
-break;
-case KVM_EXIT_EXCEPTION:
-DPRINTF(kvm_exit_exception\n);
+fprintf(stderr, KVM: unknown exit, hardware reason % PRIx64 \n,
+(uint64_t)run-hw.hardware_exit_reason);
 ret = -1;
 break;
 #ifdef KVM_CAP_INTERNAL_ERROR_DATA
@@ -1001,6 +994,7 @@ int kvm_cpu_exec(CPUState *env)
 } while (ret  0);
 
 if (ret  0) {
+cpu_dump_state(env, stderr, fprintf, 0);
 vm_stop(0);
 env-exit_request = 1;
 }
diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index dddcd74..a457423 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -874,6 +874,8 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
uint32_t *ecx, uint32_t *edx);
 int cpu_x86_register (CPUX86State *env, const char *cpu_model);
 void cpu_clear_apic_feature(CPUX86State *env);
+void host_cpuid(uint32_t function, uint32_t count,
+uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx);
 
 /* helper.c */
 int cpu_x86_handle_mmu_fault(CPUX86State *env, target_ulong addr,
diff --git a/target-i386/cpuid.c b/target-i386/cpuid.c
index 165045e..5382a28 100644
--- a/target-i386/cpuid.c
+++ b/target-i386/cpuid.c
@@ -103,9 +103,8 @@ typedef struct model_features_t {
 int check_cpuid = 0;
 int enforce_cpuid = 0;
 
-static void host_cpuid(uint32_t function, uint32_t count,
-   uint32_t *eax, uint32_t *ebx,
-   uint32_t *ecx, uint32_t *edx)
+void host_cpuid(uint32_t function, uint32_t count,
+uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx)
 {
 #if defined(CONFIG_KVM)
 uint32_t vec[4];
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 6b4abaa..0ba13fc 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1525,8 +1525,19 @@ static int kvm_handle_halt(CPUState *env)
 return 1;
 }
 
+static bool host_supports_vmx(void)
+{
+uint32_t ecx, unused;
+
+host_cpuid(1, 0, unused, unused, ecx, unused);
+return ecx  CPUID_EXT_VMX;
+}
+
+#define VMX_INVALID_GUEST_STATE 0x8021
+
 int kvm_arch_handle_exit(CPUState *env, struct kvm_run *run)
 {
+uint64_t code;
 int ret = 0;
 
 switch (run-exit_reason) {
@@ -1537,6 +1548,28 @@ int kvm_arch_handle_exit(CPUState *env, struct kvm_run 
*run)
 case KVM_EXIT_SET_TPR:
 ret = 1;
 break;
+case KVM_EXIT_FAIL_ENTRY:
+code = run-fail_entry.hardware_entry_failure_reason;
+fprintf(stderr, KVM: entry failed, hardware error 0x% PRIx64 \n,
+code);
+if (host_supports_vmx()  code == VMX_INVALID_GUEST_STATE) {
+fprintf(stderr,
+\nIf you're runnning a guest on an Intel machine without 
+

[PATCH 04/31] Add broadcast option for mce command

2011-01-24 Thread Marcelo Tosatti
From: Jin Dongming jin.dongm...@np.css.fujitsu.com

When the following test case is injected with mce command, maybe user could not
get the expected result.
DATA
   command cpu bank status mcg_status  addr   misc
(qemu) mce 1   10xbd00 0x050x1234 0x8c

Expected Result
   panic type: Fatal Machine check

That is because each mce command can only inject the given cpu and could not
inject mce interrupt to other cpus. So user will get the following result:
panic type: Fatal machine check on current CPU

broadcast option is used for injecting dummy data into other cpus. Injecting
mce with this option the expected result could be gotten.

Usage:
Broadcast[on]
   command broadcast cpu bank status mcg_status  addr   misc
(qemu) mce -b1   10xbd00 0x050x1234 0x8c

Broadcast[off]
   command cpu bank status mcg_status  addr   misc
(qemu) mce 1   10xbd00 0x050x1234 0x8c

Signed-off-by: Jin Dongming jin.dongm...@np.css.fujitsu.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
---
 cpu-all.h |3 ++-
 hmp-commands.hx   |6 +++---
 monitor.c |7 +--
 target-i386/helper.c  |   20 ++--
 target-i386/kvm.c |   16 
 target-i386/kvm_x86.h |5 -
 6 files changed, 44 insertions(+), 13 deletions(-)

diff --git a/cpu-all.h b/cpu-all.h
index 30ae17d..4ce4e83 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -964,6 +964,7 @@ int cpu_memory_rw_debug(CPUState *env, target_ulong addr,
 uint8_t *buf, int len, int is_write);
 
 void cpu_inject_x86_mce(CPUState *cenv, int bank, uint64_t status,
-uint64_t mcg_status, uint64_t addr, uint64_t misc);
+uint64_t mcg_status, uint64_t addr, uint64_t misc,
+int broadcast);
 
 #endif /* CPU_ALL_H */
diff --git a/hmp-commands.hx b/hmp-commands.hx
index 1cea572..d65a41f 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -1116,9 +1116,9 @@ ETEXI
 
 {
 .name   = mce,
-.args_type  = cpu_index:i,bank:i,status:l,mcg_status:l,addr:l,misc:l,
-.params = cpu bank status mcgstatus addr misc,
-.help   = inject a MCE on the given CPU,
+.args_type  = 
broadcast:-b,cpu_index:i,bank:i,status:l,mcg_status:l,addr:l,misc:l,
+.params = [-b] cpu bank status mcgstatus addr misc,
+.help   = inject a MCE on the given CPU [and broadcast to other 
CPUs with -b option],
 .mhandler.cmd = do_inject_mce,
 },
 
diff --git a/monitor.c b/monitor.c
index d291158..396d5cd 100644
--- a/monitor.c
+++ b/monitor.c
@@ -2671,12 +2671,15 @@ static void do_inject_mce(Monitor *mon, const QDict 
*qdict)
 uint64_t mcg_status = qdict_get_int(qdict, mcg_status);
 uint64_t addr = qdict_get_int(qdict, addr);
 uint64_t misc = qdict_get_int(qdict, misc);
+int broadcast = qdict_get_try_bool(qdict, broadcast, 0);
 
-for (cenv = first_cpu; cenv != NULL; cenv = cenv-next_cpu)
+for (cenv = first_cpu; cenv != NULL; cenv = cenv-next_cpu) {
 if (cenv-cpu_index == cpu_index  cenv-mcg_cap) {
-cpu_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc);
+cpu_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc,
+   broadcast);
 break;
 }
+}
 }
 #endif
 
diff --git a/target-i386/helper.c b/target-i386/helper.c
index 2c94130..2cfb4a4 100644
--- a/target-i386/helper.c
+++ b/target-i386/helper.c
@@ -1069,18 +1069,34 @@ static void qemu_inject_x86_mce(CPUState *cenv, int 
bank, uint64_t status,
 }
 
 void cpu_inject_x86_mce(CPUState *cenv, int bank, uint64_t status,
-uint64_t mcg_status, uint64_t addr, uint64_t misc)
+uint64_t mcg_status, uint64_t addr, uint64_t misc,
+int broadcast)
 {
 unsigned bank_num = cenv-mcg_cap  0xff;
+CPUState *env;
+int flag = 0;
 
 if (bank = bank_num || !(status  MCI_STATUS_VAL)) {
 return;
 }
 
 if (kvm_enabled()) {
-kvm_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc, 0);
+if (broadcast) {
+flag |= MCE_BROADCAST;
+}
+
+kvm_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc, flag);
 } else {
 qemu_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc);
+if (broadcast) {
+for (env = first_cpu; env != NULL; env = env-next_cpu) {
+if (cenv == env) {
+continue;
+}
+
+qemu_inject_x86_mce(env, 1, 0xa000, 0, 0, 0);
+}
+}
 }
 }
 #endif /* !CONFIG_USER_ONLY */
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 4004de7..8b868ad 100644
--- a/target-i386/kvm.c
+++ 

Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts

2011-01-24 Thread Marcelo Tosatti
On Fri, Jan 21, 2011 at 04:48:02PM -0700, Alex Williamson wrote:
 When doing device assignment, we use cpu_register_physical_memory() to
 directly map the qemu mmap of the device resource into the address
 space of the guest.  The unadvertised feature of the register physical
 memory code path on kvm, at least for this type of mapping, is that it
 needs to allocate an index from a small, fixed array of memory slots.
 Even better, if it can't get an index, the code aborts deep in the
 kvm specific bits, preventing the caller from having a chance to
 recover.
 
 It's really easy to hit this by hot adding too many assigned devices
 to a guest (pretty easy to hit with too many devices at instantiation
 time too, but the abort is slightly more bearable there).
 
 I'm assuming it's pretty difficult to make the memory slot array
 dynamically sized.  If that's not the case, please let me know as
 that would be a much better solution.

Its not difficult to either increase the maximum number (defined as
32 now in both qemu and kernel) of static slots, or support dynamic
increases, if it turns out to be a performance issue.

But you'd probably want to fix the abort for currently supported kernels
anyway.

 I'm not terribly happy with the solution in this series, it doesn't
 provide any guarantees whether a cpu_register_physical_memory() will
 succeed, only slightly better educated guesses.
 
 Are there better ideas how we could solve this?  Thanks,

Why can't cpu_register_physical_memory() return an error so you can
fallback to slow mode or cancel device insertion?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/18] Introduce VCPU self-signaling service

2011-01-24 Thread Marcelo Tosatti
On Mon, Jan 10, 2011 at 09:32:04AM +0100, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com
 
 Introduce qemu_cpu_kick_self to send SIG_IPI to the calling VCPU
 context. First user will be kvm.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com

For the updated patch, can't see where thread_kicked is cleared.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/18] kvm: Add MCE signal support for !CONFIG_IOTHREAD

2011-01-24 Thread Marcelo Tosatti
On Mon, Jan 10, 2011 at 09:32:00AM +0100, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com
 
 Currently, we only configure and process MCE-related SIGBUS events if
 CONFIG_IOTHREAD is enabled. Fix this by factoring out the required
 handler registration and system configuration. Make sure that events
 happening over a VCPU context in non-threaded mode get dispatched as
 VCPU MCEs.
 
 We also need to call qemu_kvm_eat_signals in non-threaded mode now, so
 move it (unmodified) and add the required Windows stub.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 CC: Huang Ying ying.hu...@intel.com
 ---
  cpus.c |  200 +++
  1 files changed, 124 insertions(+), 76 deletions(-)
 
 diff --git a/cpus.c b/cpus.c
 index 6da0f8f..b6f1cfb 100644
 --- a/cpus.c
 +++ b/cpus.c
 @@ -34,9 +34,6 @@
  
  #include cpus.h
  #include compatfd.h
 -#ifdef CONFIG_LINUX
 -#include sys/prctl.h
 -#endif
  
  #ifdef SIGRTMIN
  #define SIG_IPI (SIGRTMIN+4)
 @@ -44,10 +41,24 @@
  #define SIG_IPI SIGUSR1
  #endif
  

 @@ -912,6 +954,8 @@ static int qemu_cpu_exec(CPUState *env)
  
  bool cpu_exec_all(void)
  {
 +int r;
 +
  if (next_cpu == NULL)
  next_cpu = first_cpu;
  for (; next_cpu != NULL  !exit_request; next_cpu = next_cpu-next_cpu) 
 {
 @@ -923,7 +967,11 @@ bool cpu_exec_all(void)
  if (qemu_alarm_pending())
  break;
  if (cpu_can_run(env)) {
 -if (qemu_cpu_exec(env) == EXCP_DEBUG) {
 +r = qemu_cpu_exec(env);
 +if (kvm_enabled()) {
 +qemu_kvm_eat_signals(env);
 +}
 +if (r == EXCP_DEBUG) {
  break;
  }

SIGBUS should be processed outside of vcpu execution context, think of a
non MCE SIGBUS while vm is stopped. Could use signalfd for that.

But the SIGBUS handler for !IOTHREAD case should not ignore Action
Required, since it might have been generated in vcpu context.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/18] Introduce VCPU self-signaling service

2011-01-24 Thread Jan Kiszka
On 2011-01-24 12:47, Marcelo Tosatti wrote:
 On Mon, Jan 10, 2011 at 09:32:04AM +0100, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com

 Introduce qemu_cpu_kick_self to send SIG_IPI to the calling VCPU
 context. First user will be kvm.

 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 
 For the updated patch, can't see where thread_kicked is cleared.
 

Prevent abortion on multiple VCPU kicks, 6 patches earlier (assuming
you are hopefully looking at the patch queue in my git, not some older
postings).

Jan



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 07/18] kvm: Add MCE signal support for !CONFIG_IOTHREAD

2011-01-24 Thread Jan Kiszka
On 2011-01-24 12:17, Marcelo Tosatti wrote:
 On Mon, Jan 10, 2011 at 09:32:00AM +0100, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com

 Currently, we only configure and process MCE-related SIGBUS events if
 CONFIG_IOTHREAD is enabled. Fix this by factoring out the required
 handler registration and system configuration. Make sure that events
 happening over a VCPU context in non-threaded mode get dispatched as
 VCPU MCEs.

 We also need to call qemu_kvm_eat_signals in non-threaded mode now, so
 move it (unmodified) and add the required Windows stub.

 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 CC: Huang Ying ying.hu...@intel.com
 ---
  cpus.c |  200 
 +++
  1 files changed, 124 insertions(+), 76 deletions(-)

 diff --git a/cpus.c b/cpus.c
 index 6da0f8f..b6f1cfb 100644
 --- a/cpus.c
 +++ b/cpus.c
 @@ -34,9 +34,6 @@
  
  #include cpus.h
  #include compatfd.h
 -#ifdef CONFIG_LINUX
 -#include sys/prctl.h
 -#endif
  
  #ifdef SIGRTMIN
  #define SIG_IPI (SIGRTMIN+4)
 @@ -44,10 +41,24 @@
  #define SIG_IPI SIGUSR1
  #endif
  
 
 @@ -912,6 +954,8 @@ static int qemu_cpu_exec(CPUState *env)
  
  bool cpu_exec_all(void)
  {
 +int r;
 +
  if (next_cpu == NULL)
  next_cpu = first_cpu;
  for (; next_cpu != NULL  !exit_request; next_cpu = 
 next_cpu-next_cpu) {
 @@ -923,7 +967,11 @@ bool cpu_exec_all(void)
  if (qemu_alarm_pending())
  break;
  if (cpu_can_run(env)) {
 -if (qemu_cpu_exec(env) == EXCP_DEBUG) {
 +r = qemu_cpu_exec(env);
 +if (kvm_enabled()) {
 +qemu_kvm_eat_signals(env);
 +}
 +if (r == EXCP_DEBUG) {
  break;
  }
 
 SIGBUS should be processed outside of vcpu execution context, think of a
 non MCE SIGBUS while vm is stopped. Could use signalfd for that.

signalfd - that's the missing bit. I was thinking of how to handle
SIGBUS events raised outside the vcpu context. We need to handle them
synchronously, and signalfd should allow this.

 
 But the SIGBUS handler for !IOTHREAD case should not ignore Action
 Required, since it might have been generated in vcpu context.
 

Yes, the sigbus handler will require some rework when we actually start
using it for !IOTHREAD.

Will have a look, thanks,
Jan



signature.asc
Description: OpenPGP digital signature


KVM call agenda for Jan 25

2011-01-24 Thread Chris Wright
Please send in any agenda items you are interested in covering.

thanks,
-chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Agenda for Jan 25

2011-01-24 Thread Juan Quintela

Plase send in any agenda items you are interested in covering.

thanks, Juan.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-24 Thread Jan Kiszka
On 2011-01-21 19:49, Blue Swirl wrote:
 I'd add fourth possible class:
  - device, CPU and machine configuration, like nographic,
 win2k_install_hack, no_hpet, smp_cpus etc. Maybe also
 irqchip_in_kernel could fit here, though it obviously depends on a
 host capability too.

 I would count everything that cannot be assigned to a concrete device
 upfront to the dynamic state of a machine, thus class 2. The point is,
 (potentially) every device of that machine requires access to it, just
 like (indirectly, via the KVM core services) to some KVM VM state bits.
 
 The machine class should not be a catch-all, it would be like
 QEMUState or KVMState then. Perhaps each field or variable should be
 listed and given more thought.

Let's start with what is most urgent:

 - vmfd: file descriptor required for any KVM request that has VM scope
   (in-kernel device creation, device state synchronizations, IRQ
   routing etc.)
 - irqchip_in_kernel: VM uses in-kernel irqchip acceleration
   (some devices will have to adjust their behavior depending on this)

pit_in_kernel would be analogue to irqchip, but it's also conceptually
x86-only (irqchips is only used by x86, but not tied to it) and it's not
mandatory for a first round of KVM devices for upstream.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts

2011-01-24 Thread Alex Williamson
On Mon, 2011-01-24 at 15:16 +0100, Jan Kiszka wrote:
 On 2011-01-24 10:32, Marcelo Tosatti wrote:
  On Fri, Jan 21, 2011 at 04:48:02PM -0700, Alex Williamson wrote:
  When doing device assignment, we use cpu_register_physical_memory() to
  directly map the qemu mmap of the device resource into the address
  space of the guest.  The unadvertised feature of the register physical
  memory code path on kvm, at least for this type of mapping, is that it
  needs to allocate an index from a small, fixed array of memory slots.
  Even better, if it can't get an index, the code aborts deep in the
  kvm specific bits, preventing the caller from having a chance to
  recover.
 
  It's really easy to hit this by hot adding too many assigned devices
  to a guest (pretty easy to hit with too many devices at instantiation
  time too, but the abort is slightly more bearable there).
 
  I'm assuming it's pretty difficult to make the memory slot array
  dynamically sized.  If that's not the case, please let me know as
  that would be a much better solution.
  
  Its not difficult to either increase the maximum number (defined as
  32 now in both qemu and kernel) of static slots, or support dynamic
  increases, if it turns out to be a performance issue.
 
 Static limits are waiting to be hit again, just a bit later.

Yep, and I think static limits large enough that we can't hit them would
be performance prohibitive.

 I would start thinking about a tree search as well because iterating
 over all slots won't get faster over the time.
 
  
  But you'd probably want to fix the abort for currently supported kernels
  anyway.
 
 Jep. Depending on how soon we have smarter solution in the kernel, this
 fix may vary in pragmatism.
 
 
  I'm not terribly happy with the solution in this series, it doesn't
  provide any guarantees whether a cpu_register_physical_memory() will
  succeed, only slightly better educated guesses.
 
  Are there better ideas how we could solve this?  Thanks,
  
  Why can't cpu_register_physical_memory() return an error so you can
  fallback to slow mode or cancel device insertion?

It appears that it'd be pretty intrusive to fix this since
cpu_register_physical_memory() is a return void, and the kvm hook into
this is via a set_memory callback for the phys memory client.

 Doesn't that registration happens much later than the call to
 pci_register_bar? In any case, this will require significantly more
 invasive work (but it would be much nicer if possible, no question).

Right, we register BARs in the initfn, but we don't map them until the
guest writes the BARs, mapping them into MMIO space.  I don't think we
want to fall back to slow mapping at that point, so we either need to
fail in the initfn (like this series) or be able to dynamically allocate
more slots so the kvm callback can't fail.  I'll look at how we might be
able to allocate slots on demand.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] kvm hypervisor : Add hypercalls to support pv-ticketlock

2011-01-24 Thread Jeremy Fitzhardinge
On 01/22/2011 06:53 AM, Rik van Riel wrote:
 The main question that remains is whether the PV ticketlocks are
 a large enough improvement to also merge those.  I expect they
 will be, and we'll see so in the benchmark numbers.

The pathological worst-case of ticket locks in a virtual environment
isn't very hard to hit, so I expect they'll make a huge difference
there.   On lightly loaded systems (little or no CPU overcommit) then
they should be close to a no-op.

J
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v6 PATCH 2/8] sched: limit the scope of clear_buddies

2011-01-24 Thread Peter Zijlstra
On Thu, 2011-01-20 at 16:33 -0500, Rik van Riel wrote:
 The clear_buddies function does not seem to play well with the concept
 of hierarchical runqueues.  In the following tree, task groups are
 represented by 'G', tasks by 'T', next by 'n' and last by 'l'.
 
  (nl)
 /\
G(nl)  G
/ \ \
  T(l) T(n)  T
 
 This situation can arise when a task is woken up T(n), and the previously
 running task T(l) is marked last.
 
 When clear_buddies is called from either T(l) or T(n), the next and last
 buddies of the group G(nl) will be cleared.  This is not the desired
 result, since we would like to be able to find the other type of buddy
 in many cases.
 
 This especially a worry when implementing yield_task_fair through the
 buddy system.
 
 The fix is simple: only clear the buddy type that the task itself
 is indicated to be.  As an added bonus, we stop walking up the tree
 when the buddy has already been cleared or pointed elsewhere.
 
 Signed-off-by: Rik van Riel r...@redhat.com
 ---
  kernel/sched_fair.c |   30 +++---
  1 files changed, 23 insertions(+), 7 deletions(-)
 
 diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
 index f4ee445..0321473 100644
 --- a/kernel/sched_fair.c
 +++ b/kernel/sched_fair.c
 @@ -784,19 +784,35 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct 
 sched_entity *se, int flags)
   __enqueue_entity(cfs_rq, se);
  }
  
 -static void __clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
 +static void __clear_buddies_last(struct sched_entity *se)
  {
 - if (!se || cfs_rq-last == se)
 - cfs_rq-last = NULL;
 + for_each_sched_entity(se) {
 + struct cfs_rq *cfs_rq = cfs_rq_of(se);
 + if (cfs_rq-last == se)
 + cfs_rq-last = NULL;
 + else
 + break;
 + }
 +}
  
 - if (!se || cfs_rq-next == se)
 - cfs_rq-next = NULL;
 +static void __clear_buddies_next(struct sched_entity *se)
 +{
 + for_each_sched_entity(se) {
 + struct cfs_rq *cfs_rq = cfs_rq_of(se);
 + if (cfs_rq-next == se)
 + cfs_rq-next = NULL;
 + else
 + break;
 + }
  }
  
  static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
  {
 - for_each_sched_entity(se)
 - __clear_buddies(cfs_rq_of(se), se);
 + if (cfs_rq-last == se)
 + __clear_buddies_last(se);
 +
 + if (cfs_rq-next == se)
 + __clear_buddies_next(se);
  }
  

Right, I think this sorta matches with something the Google guys talked
about, they wanted to change pick_next_task() no always start from the
top but only go up one level when the current level ran out.

It looks ok, just sad that we can now have two hierarchy traversals (and
3 with the next patch).


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v6 PATCH 3/8] sched: use a buddy to implement yield_task_fair

2011-01-24 Thread Peter Zijlstra
On Thu, 2011-01-20 at 16:33 -0500, Rik van Riel wrote:
 Use the buddy mechanism to implement yield_task_fair.  This
 allows us to skip onto the next highest priority se at every
 level in the CFS tree, unless doing so would introduce gross
 unfairness in CPU time distribution.
 
 We order the buddy selection in pick_next_entity to check
 yield first, then last, then next.  We need next to be able
 to override yield, because it is possible for the next and
 yield task to be different processen in the same sub-tree
 of the CFS tree.  When they are, we need to go into that
 sub-tree regardless of the yield hint, and pick the correct
 entity once we get to the right level.
 
 Signed-off-by: Rik van Riel r...@redhat.com
 
 diff --git a/kernel/sched.c b/kernel/sched.c
 index dc91a4d..e4e57ff 100644
 --- a/kernel/sched.c
 +++ b/kernel/sched.c
 @@ -327,7 +327,7 @@ struct cfs_rq {
* 'curr' points to currently running entity on this cfs_rq.
* It is set to NULL otherwise (i.e when none are currently running).
*/
 - struct sched_entity *curr, *next, *last;
 + struct sched_entity *curr, *next, *last, *yield;

I'd prefer it be called: skip or somesuch..

   unsigned int nr_spread_over;
  
 diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
 index ad946fd..f701a51 100644
 --- a/kernel/sched_fair.c
 +++ b/kernel/sched_fair.c
 @@ -384,6 +384,22 @@ static struct sched_entity *__pick_next_entity(struct 
 cfs_rq *cfs_rq)
   return rb_entry(left, struct sched_entity, run_node);
  }
  
 +static struct sched_entity *__pick_second_entity(struct cfs_rq *cfs_rq)
 +{
 + struct rb_node *left = cfs_rq-rb_leftmost;
 + struct rb_node *second;
 +
 + if (!left)
 + return NULL;
 +
 + second = rb_next(left);
 +
 + if (!second)
 + second = left;
 +
 + return rb_entry(second, struct sched_entity, run_node);
 +}

So this works because you only ever skip the leftmost, should we perhaps
write this as something like the below?

static struct sched_entity *__pick_next_entity(sched_entity *se)
{
struct rb_node *next = rb_next(se-run_node);
if (!next)
return NULL;
return rb_entry(next, struct sched_entity, run_node);
}

  static struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
  {
   struct rb_node *last = rb_last(cfs_rq-tasks_timeline);
 @@ -806,6 +822,17 @@ static void __clear_buddies_next(struct sched_entity *se)
   }
  }
  
 +static void __clear_buddies_yield(struct sched_entity *se)
 +{
 + for_each_sched_entity(se) {
 + struct cfs_rq *cfs_rq = cfs_rq_of(se);
 + if (cfs_rq-yield == se)
 + cfs_rq-yield = NULL;
 + else
 + break;
 + }
 +}
 +
  static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
  {
   if (cfs_rq-last == se)
 @@ -813,6 +840,9 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct 
 sched_entity *se)
  
   if (cfs_rq-next == se)
   __clear_buddies_next(se);
 +
 + if (cfs_rq-yield == se)
 + __clear_buddies_yield(se);
  }

The 3rd hierarchy iteration.. :/

  static void
 @@ -926,13 +956,27 @@ set_next_entity(struct cfs_rq *cfs_rq, struct 
 sched_entity *se)
  static int
  wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
  
 +/*
 + * Pick the next process, keeping these things in mind, in this order:
 + * 1) keep things fair between processes/task groups
 + * 2) pick the next process, since someone really wants that to run
 + * 3) pick the last process, for cache locality
 + * 4) do not run the yield process, if something else is available
 + */
  static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
  {
   struct sched_entity *se = __pick_next_entity(cfs_rq);
   struct sched_entity *left = se;
  
 - if (cfs_rq-next  wakeup_preempt_entity(cfs_rq-next, left)  1)
 - se = cfs_rq-next;
 + /*
 +  * Avoid running the yield buddy, if running something else can
 +  * be done without getting too unfair.
 +  */
 + if (cfs_rq-yield == se) {
 + struct sched_entity *second = __pick_second_entity(cfs_rq);
 + if (wakeup_preempt_entity(second, left)  1)
 + se = second;
 + }
  
   /*
* Prefer last buddy, try to return the CPU to a preempted task.
 @@ -940,6 +984,12 @@ static struct sched_entity *pick_next_entity(struct 
 cfs_rq *cfs_rq)
   if (cfs_rq-last  wakeup_preempt_entity(cfs_rq-last, left)  1)
   se = cfs_rq-last;
  
 + /*
 +  * Someone really wants this to run. If it's not unfair, run it.
 +  */
 + if (cfs_rq-next  wakeup_preempt_entity(cfs_rq-next, left)  1)
 + se = cfs_rq-next;
 +
   clear_buddies(cfs_rq, se);
  
   return se;

This seems to assume -yield cannot be -next nor -last, but I'm not
quite sure that will actually be true.

 @@ -1096,52 +1146,6 @@ 

Re: [RFC -v6 PATCH 2/8] sched: limit the scope of clear_buddies

2011-01-24 Thread Rik van Riel

On 01/24/2011 12:57 PM, Peter Zijlstra wrote:

On Thu, 2011-01-20 at 16:33 -0500, Rik van Riel wrote:

The clear_buddies function does not seem to play well with the concept
of hierarchical runqueues.  In the following tree, task groups are
represented by 'G', tasks by 'T', next by 'n' and last by 'l'.

  (nl)
 /\
G(nl)  G
/ \ \
  T(l) T(n)  T

This situation can arise when a task is woken up T(n), and the previously
running task T(l) is marked last.

When clear_buddies is called from either T(l) or T(n), the next and last
buddies of the group G(nl) will be cleared.  This is not the desired
result, since we would like to be able to find the other type of buddy
in many cases.

This especially a worry when implementing yield_task_fair through the
buddy system.

The fix is simple: only clear the buddy type that the task itself
is indicated to be.  As an added bonus, we stop walking up the tree
when the buddy has already been cleared or pointed elsewhere.

Signed-off-by: Rik van Rielr...@redhat.com
---
  kernel/sched_fair.c |   30 +++---
  1 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index f4ee445..0321473 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -784,19 +784,35 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity 
*se, int flags)
__enqueue_entity(cfs_rq, se);
  }

-static void __clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static void __clear_buddies_last(struct sched_entity *se)
  {
-   if (!se || cfs_rq-last == se)
-   cfs_rq-last = NULL;
+   for_each_sched_entity(se) {
+   struct cfs_rq *cfs_rq = cfs_rq_of(se);
+   if (cfs_rq-last == se)
+   cfs_rq-last = NULL;
+   else
+   break;
+   }
+}

-   if (!se || cfs_rq-next == se)
-   cfs_rq-next = NULL;
+static void __clear_buddies_next(struct sched_entity *se)
+{
+   for_each_sched_entity(se) {
+   struct cfs_rq *cfs_rq = cfs_rq_of(se);
+   if (cfs_rq-next == se)
+   cfs_rq-next = NULL;
+   else
+   break;
+   }
  }

  static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
  {
-   for_each_sched_entity(se)
-   __clear_buddies(cfs_rq_of(se), se);
+   if (cfs_rq-last == se)
+   __clear_buddies_last(se);
+
+   if (cfs_rq-next == se)
+   __clear_buddies_next(se);
  }



Right, I think this sorta matches with something the Google guys talked
about, they wanted to change pick_next_task() no always start from the
top but only go up one level when the current level ran out.

It looks ok, just sad that we can now have two hierarchy traversals (and
3 with the next patch).


On the other hand, I don't think we'll actually _do_ the
hierarchy traversal most of the time, since pick_next_entity
calls clear_buddies, every step of the way down the tree.

A hierarchy traversal will only be done if a task already
has one type of buddy set, and then gets another type of
buddy set, before it is rescheduled.

Eg. a task can have -last set and then call yield, causing
the -yield buddy to get pointed at itself.  When doing that,
it will walk up the tree, clearing -last.

I suspect that with this patch, we'll end up doing less
tree traversal than before.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/16] KVM-HDR: Implement kvmclock systemtime over KVM - KVM Virtual Memory

2011-01-24 Thread Glauber Costa
As a proof of concept to KVM - Kernel Virtual Memory, this patch
implements kvmclock per-vcpu systime grabbing on top of it. At first, it
may seem as a waste of work to just redo it, since it is working well. But over
the time, other MSRs were added - think ASYNC_PF - and more will probably come.
After this patch, we won't need to ever add another virtual MSR to KVM.

If the hypervisor fails to register the memory area, we switch back to legacy
behavior on things that were already present - like kvm clock.

This patch contains the headers for it. I am keeping it separate to facilitate
backports to people who wants to backport the kernel part but not the
hypervisor, or the other way around.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Avi Kivity a...@redhat.com
---
 arch/x86/include/asm/kvm_para.h |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index de088c8..4df93cf 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -47,6 +47,7 @@ struct kvm_memory_area {
 };
 
 #define KVM_AREA_WALLCLOCK 1
+#define KVM_AREA_SYSTIME   2
 
 #define KVM_MAX_MMU_OP_BATCH   32
 
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/16] KVM-HDR: Implement wallclock over KVM - KVM Virtual Memory

2011-01-24 Thread Glauber Costa
As a proof of concept to KVM - Kernel Virtual Memory, this patch
implements wallclock grabbing on top of it. At first, it may seem
as a waste of work to just redo it, since it is working well. But over the
time, other MSRs were added - think ASYNC_PF - and more will probably come.
After this patch, we won't need to ever add another virtual MSR to KVM.

If the hypervisor fails to register the memory area, we switch back to legacy
behavior on things that were already present - like kvm clock.

This patch contains the hypervisor implementation for it. I am keeping it
separate from the headers to facilitate backports to people who wants to 
backport
the kernel part but not the hypervisor, or the other way around.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Avi Kivity a...@redhat.com
---
 arch/x86/kvm/x86.c |   10 ++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4ee9c87..a232a36 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1540,16 +1540,26 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
break;
case MSR_KVM_REGISTER_MEM_AREA: {
struct kvm_memory_area area_desc;
+   u64 wall_data;
 
kvm_read_guest(vcpu-kvm, data, area_desc, sizeof(area_desc));
area_desc.result = 0xF;
 
+   if (area_desc.type == KVM_AREA_WALLCLOCK) {
+   kvm_read_guest(vcpu-kvm, area_desc.base,
+  wall_data, area_desc.size);
+   vcpu-kvm-arch.wall_clock = wall_data;
+   kvm_write_wall_clock(vcpu-kvm, wall_data);
+   goto rma_out;
+   }
+
if (vcpu-kvm-register_mem_area_uspace) {
vcpu-run-exit_reason = KVM_EXIT_X86_MSR_OP;
vcpu-run-msr.msr_data = data;
return 1;
}
 rma_out:
+   area_desc.result = 0;
kvm_write_guest(vcpu-kvm, data, area_desc, sizeof(area_desc));
break;
}
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/16] KVM-GST: Implement kvmclock systemtime over KVM - KVM Virtual Memory

2011-01-24 Thread Glauber Costa
As a proof of concept to KVM - Kernel Virtual Memory, this patch
implements kvmclock per-vcpu systime grabbing on top of it. At first, it
may seem as a waste of work to just redo it, since it is working well. But over
the time, other MSRs were added - think ASYNC_PF - and more will probably come.
After this patch, we won't need to ever add another virtual MSR to KVM.

If the hypervisor fails to register the memory area, we switch back to legacy
behavior on things that were already present - like kvm clock.

This patch contains the guest part for it. I am keeping it separate to 
facilitate
backports to people who wants to backport the kernel part but not the
hypervisor, or the other way around.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Avi Kivity a...@redhat.com
---
 arch/x86/kernel/kvmclock.c |   31 ++-
 1 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index b8809f0..c304fdb 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -157,12 +157,28 @@ int kvm_register_clock(char *txt)
 {
int cpu = smp_processor_id();
int low, high, ret;
-
-   low = (int)__pa(per_cpu(hv_clock, cpu)) | 1;
-   high = ((u64)__pa(per_cpu(hv_clock, cpu))  32);
-   ret = native_write_msr_safe(msr_kvm_system_time, low, high);
-   printk(KERN_INFO kvm-clock: cpu %d, msr %x:%x, %s\n,
-  cpu, high, low, txt);
+   struct pvclock_vcpu_time_info *vcpu_time;
+   static int warned;
+
+   vcpu_time = per_cpu(hv_clock, cpu);
+
+   ret = kvm_register_mem_area(__pa(vcpu_time), KVM_AREA_SYSTIME,
+   sizeof(*vcpu_time));
+   if (ret == 0) {
+   printk(KERN_INFO kvm-clock: cpu %d, mem_area %lx %s\n,
+  cpu, __pa(vcpu_time), txt);
+   } else {
+   low = (int)__pa(vcpu_time) | 1;
+   high = ((u64)__pa(vcpu_time)  32);
+   ret = native_write_msr_safe(msr_kvm_system_time, low, high);
+
+   if (!warned++)
+   printk(KERN_INFO kvm-clock: Using msrs %x and %x,
+   msr_kvm_system_time, msr_kvm_wall_clock);
+
+   printk(KERN_INFO kvm-clock: cpu %d, msr %x:%x, %s\n,
+  cpu, high, low, txt);
+   }
 
return ret;
 }
@@ -216,9 +232,6 @@ void __init kvmclock_init(void)
} else if (!(kvmclock  kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
return;
 
-   printk(KERN_INFO kvm-clock: Using msrs %x and %x,
-   msr_kvm_system_time, msr_kvm_wall_clock);
-
if (kvm_register_clock(boot clock))
return;
pv_time_ops.sched_clock = kvm_clock_read;
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 14/16] KVM-GST: KVM Steal time registration

2011-01-24 Thread Glauber Costa
Register steal time within KVM. Everytime we sample the steal time
information, we update a local variable that tells what was the
last time read. We then account the difference.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
---
 arch/x86/kernel/kvmclock.c |   32 
 1 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index c304fdb..c0b0522 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -22,6 +22,7 @@
 #include asm/msr.h
 #include asm/apic.h
 #include linux/percpu.h
+#include linux/sched.h
 
 #include asm/x86_init.h
 #include asm/reboot.h
@@ -42,7 +43,9 @@ early_param(no-kvmclock, parse_no_kvmclock);
 
 /* The hypervisor will put information about time periodically here */
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct pvclock_vcpu_time_info, hv_clock);
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct kvm_steal_time, steal_time);
 static struct pvclock_wall_clock wall_clock;
+static DEFINE_PER_CPU(u64, steal_info);
 
 static int kvm_register_mem_area(u64 base, int type, int size)
 {
@@ -153,14 +156,43 @@ static struct clocksource kvm_clock = {
.flags = CLOCK_SOURCE_IS_CONTINUOUS,
 };
 
+static cputime_t kvm_account_steal_time(void)
+{
+   u64 delta = 0;
+   u64 *last_steal_info, this_steal_info;
+   struct kvm_steal_time *src;
+   cputime_t cpu;
+
+   src = get_cpu_var(steal_time);
+   this_steal_info = src-steal;
+   put_cpu_var(steal_time);
+
+   last_steal_info = get_cpu_var(steal_info);
+
+   delta = this_steal_info - *last_steal_info;
+
+   *last_steal_info = this_steal_info;
+   put_cpu_var(steal_info);
+
+   cpu = usecs_to_cputime(delta / 1000);
+
+   return cpu;
+}
+
 int kvm_register_clock(char *txt)
 {
int cpu = smp_processor_id();
int low, high, ret;
struct pvclock_vcpu_time_info *vcpu_time;
+   struct kvm_steal_time *stime;
static int warned;
 
vcpu_time = per_cpu(hv_clock, cpu);
+   stime = per_cpu(steal_time, cpu);
+
+   ret = kvm_register_mem_area(__pa(stime), KVM_AREA_STEAL_TIME, 
sizeof(*stime));
+   if (ret == 0)
+   hypervisor_steal_time = kvm_account_steal_time;
 
ret = kvm_register_mem_area(__pa(vcpu_time), KVM_AREA_SYSTIME,
sizeof(*vcpu_time));
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/16] KVM-HV: KVM Steal time implementation

2011-01-24 Thread Glauber Costa
To implement steal time, we need the hypervisor to pass the guest information
about how much time was spent running other processes outside the VM.
This is per-vcpu, and using the kvmclock structure for that is an abuse
we decided not to make.

This patch contains the hypervisor part for it. I am keeping it separate from
the headers to facilitate backports to people who wants to backport the kernel
part but not the hypervisor, or the other way around.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
---
 arch/x86/kvm/x86.c |5 +
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 81b2f34..f129ea1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1564,6 +1564,11 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
goto rma_out;
}
 
+   if (area_desc.type == KVM_AREA_STEAL_TIME) {
+   vcpu-arch.stime = area_desc.base;
+   goto rma_out;
+   }
+
if (vcpu-kvm-register_mem_area_uspace) {
vcpu-run-exit_reason = KVM_EXIT_X86_MSR_OP;
vcpu-run-msr.msr_data = data;
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 13/16] KVM-HV: KVM Steal time calculation

2011-01-24 Thread Glauber Costa
To implement steal time, we need the hypervisor to pass the guest information
about how much time was spent running other processes outside the VM.
We consider time to be potentially stolen everytime we schedule out the vcpu,
until we schedule it in again. If this is, or if this will not, be accounted
as steal time for the guest, is a guest's decision.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
---
 arch/x86/include/asm/kvm_host.h |3 +++
 arch/x86/kvm/x86.c  |   13 +
 2 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ffd7f8d..038679b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -377,6 +377,9 @@ struct kvm_vcpu_arch {
unsigned int hw_tsc_khz;
unsigned int time_offset;
struct page *time_page;
+   gpa_t stime;
+   u64 time_out;
+   u64 this_time_out;
u64 last_host_tsc;
u64 last_guest_tsc;
u64 last_kernel_ns;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f129ea1..33af936 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2123,6 +2123,9 @@ static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu)
 
 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
+   struct kvm_steal_time *st;
+   st = (struct kvm_steal_time *)vcpu-arch.stime;
+
/* Address WBINVD may be executed by guest */
if (need_emulate_wbinvd(vcpu)) {
if (kvm_x86_ops-has_wbinvd_exit())
@@ -2148,6 +2151,14 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
kvm_migrate_timers(vcpu);
vcpu-cpu = cpu;
}
+   if (vcpu-arch.this_time_out) {
+   vcpu-arch.time_out +=
+   (get_kernel_ns() - vcpu-arch.this_time_out);
+   kvm_write_guest(vcpu-kvm, (gpa_t)st-steal,
+   vcpu-arch.time_out, sizeof(st-steal));
+   /* is it possible to have 2 loads in sequence? */
+   vcpu-arch.this_time_out = 0;
+   }
 }
 
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
@@ -5333,6 +5344,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
kvm_lapic_sync_from_vapic(vcpu);
 
+   vcpu-arch.this_time_out = get_kernel_ns();
+
r = kvm_x86_ops-handle_exit(vcpu);
 out:
return r;
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/16] KVM-HV: KVM Userspace registering ioctl

2011-01-24 Thread Glauber Costa
KVM, which stands for KVM Virtual Memory (I wanted to call it KVM Virtual 
Mojito),
is a piece of shared memory that is visible to both the hypervisor and the guest
kernel - but not the guest userspace.

The basic idea is that the guest can tell the hypervisor about a specific
piece of memory, and what it expects to find in there. This is a generic
abstraction, that goes to userspace (qemu) if KVM (the hypervisor) can't
handle a specific request, thus giving us flexibility in some features
in the future.

KVM (The hypervisor) can change the contents of this piece of memory at
will. This works well with paravirtual information, and hopefully
normal guest memory - like last update time for the watchdog, for
instance.

This patch contains the basic implementation of the userspace communication.
Userspace can query the presence/absence of this feature in the normal way.
It also tells the hypervisor that it is capable of handling - in whatever
way it chooses, registrations that the hypervisor does not know how to.
In x86, only user so far, this mechanism is implemented as generic userspace
msr exit, that could theorectically be used to implement msr-handling in
userspace.

I am keeping it separate from the headers to facilitate backports to people
who wants to backport the kernel part but not the hypervisor, or the other way 
around.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
---
 arch/x86/kvm/svm.c |4 
 arch/x86/kvm/vmx.c |4 
 arch/x86/kvm/x86.c |   11 +++
 3 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 73a8f1d..214e740 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2990,6 +2990,10 @@ static int wrmsr_interception(struct vcpu_svm *svm)
svm-next_rip = kvm_rip_read(svm-vcpu) + 2;
if (svm_set_msr(svm-vcpu, ecx, data)) {
trace_kvm_msr_write_ex(ecx, data);
+   if (svm-vcpu.run-exit_reason == KVM_EXIT_X86_MSR_OP) {
+   skip_emulated_instruction(svm-vcpu);
+   return 0;
+   }
kvm_inject_gp(svm-vcpu, 0);
} else {
trace_kvm_msr_write(ecx, data);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index e2c4e32..f5c585f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3385,6 +3385,10 @@ static int handle_wrmsr(struct kvm_vcpu *vcpu)
 
if (vmx_set_msr(vcpu, ecx, data) != 0) {
trace_kvm_msr_write_ex(ecx, data);
+   if (vcpu-run-exit_reason == KVM_EXIT_X86_MSR_OP) {
+   skip_emulated_instruction(vcpu);
+   return 0;
+   }
kvm_inject_gp(vcpu, 0);
return 1;
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6206fd3..4ee9c87 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1543,6 +1543,13 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
 
kvm_read_guest(vcpu-kvm, data, area_desc, sizeof(area_desc));
area_desc.result = 0xF;
+
+   if (vcpu-kvm-register_mem_area_uspace) {
+   vcpu-run-exit_reason = KVM_EXIT_X86_MSR_OP;
+   vcpu-run-msr.msr_data = data;
+   return 1;
+   }
+rma_out:
kvm_write_guest(vcpu-kvm, data, area_desc, sizeof(area_desc));
break;
}
@@ -1974,6 +1981,7 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_X86_ROBUST_SINGLESTEP:
case KVM_CAP_XSAVE:
case KVM_CAP_ASYNC_PF:
+   case KVM_CAP_REGISTER_MEM_AREA:
r = 1;
break;
case KVM_CAP_COALESCED_MMIO:
@@ -3555,6 +3563,9 @@ long kvm_arch_vm_ioctl(struct file *filp,
r = 0;
break;
}
+   case KVM_USERSPACE_REGISTER_MEM_AREA:
+   kvm-register_mem_area_uspace = 1;
+   break;
 
default:
;
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 15/16] KVM-GST: KVM Steal time accounting

2011-01-24 Thread Glauber Costa
This patch accounts steal time time in kernel/sched.
I kept it from last proposal, because I still see advantages
in it: Doing it here will give us easier access from scheduler
variables such as the cpu rq. The next patch shows an example of
usage for it.

Since functions like account_idle_time() can be called from
multiple places, not only account_process_tick(), steal time
grabbing is repeated in each account function separatedely.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
---
 kernel/sched.c |   40 
 1 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index ea3e5ef..3b3e88d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3508,6 +3508,38 @@ unsigned long long thread_group_sched_runtime(struct 
task_struct *p)
return ns;
 }
 
+cputime_t (*hypervisor_steal_time)(void) = NULL;
+
+/*
+ * We have to at flush steal time information every time something else
+ * is accounted. Since the accounting functions are all visible to the rest
+ * of the kernel, it gets tricky to do them in one place. This helper function
+ * helps us.
+ *
+ * When the system is idle, the concept of steal time does not apply. We just
+ * tell the underlying hypervisor that we grabbed the data, but skip steal time
+ * accounting
+ */
+static int touch_steal_time(int is_idle)
+{
+   u64 steal;
+   struct rq *rq = this_rq();
+
+   if (!hypervisor_steal_time)
+   return 0;
+
+   steal = hypervisor_steal_time();
+
+   if (is_idle)
+   return 0;
+
+   if (steal) {
+   account_steal_time(steal);
+   return 1;
+   }
+   return 0;
+}
+
 /*
  * Account user cpu time to a process.
  * @p: the process that the cpu time gets accounted to
@@ -3520,6 +3552,9 @@ void account_user_time(struct task_struct *p, cputime_t 
cputime,
struct cpu_usage_stat *cpustat = kstat_this_cpu.cpustat;
cputime64_t tmp;
 
+   if (touch_steal_time(0))
+   return;
+
/* Add user time to process. */
p-utime = cputime_add(p-utime, cputime);
p-utimescaled = cputime_add(p-utimescaled, cputime_scaled);
@@ -3580,6 +3615,9 @@ void account_system_time(struct task_struct *p, int 
hardirq_offset,
struct cpu_usage_stat *cpustat = kstat_this_cpu.cpustat;
cputime64_t tmp;
 
+   if (touch_steal_time(0))
+   return;
+
if ((p-flags  PF_VCPU)  (irq_count() - hardirq_offset == 0)) {
account_guest_time(p, cputime, cputime_scaled);
return;
@@ -3627,6 +3665,8 @@ void account_idle_time(cputime_t cputime)
cputime64_t cputime64 = cputime_to_cputime64(cputime);
struct rq *rq = this_rq();
 
+   touch_steal_time(1);
+
if (atomic_read(rq-nr_iowait)  0)
cpustat-iowait = cputime64_add(cpustat-iowait, cputime64);
else
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/16] KVM-HDR: register KVM basic header infrastructure

2011-01-24 Thread Glauber Costa
KVM, which stands for KVM Virtual Memory (I wanted to call it KVM Virtual 
Mojito),
is a piece of shared memory that is visible to both the hypervisor and the guest
kernel - but not the guest userspace.

The basic idea is that the guest can tell the hypervisor about a specific
piece of memory, and what it expects to find in there. This is a generic
abstraction, that goes to userspace (qemu) if KVM (the hypervisor) can't
handle a specific request, thus giving us flexibility in some features
in the future.

KVM (The hypervisor) can change the contents of this piece of memory at
will. This works well with paravirtual information, and hopefully
normal guest memory - like last update time for the watchdog, for
instance.

This is basic KVM registration headers. I am keeping headers
separate to facilitate backports to people who wants to backport
the kernel part but not the hypervisor, or the other way around.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Avi Kivity a...@redhat.com
---
 arch/x86/include/asm/kvm_para.h |   11 +++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index a427bf7..b0b0ee0 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -21,6 +21,7 @@
  */
 #define KVM_FEATURE_CLOCKSOURCE23
 #define KVM_FEATURE_ASYNC_PF   4
+#define KVM_FEATURE_MEMORY_AREA5
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
@@ -35,6 +36,16 @@
 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
 #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
 
+#define MSR_KVM_REGISTER_MEM_AREA 0x4b564d03
+
+struct kvm_memory_area {
+   __u64 base;
+   __u32 size;
+   __u32 type;
+   __u8  result;
+   __u8  pad[3];
+};
+
 #define KVM_MAX_MMU_OP_BATCH   32
 
 #define KVM_ASYNC_PF_ENABLED   (1  0)
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/16] New Proposal for steal time in KVM

2011-01-24 Thread Glauber Costa
Hello people

This is the new version of the steal time series, this time on steroids.
The steal time per se is not much different from the last time I posted, so
I'll highlight what's around it.

Since one of the main fights was around how to register the shared memory area,
which would end up with a new MSR, I decided to provide an MSR to end all MSR
allocations. This patchset contains a generic GuestKernel-to-Hypervisor memory
registering, that can be further used and abused without too much hassle, even 
by
letting userspace handle what the hypervisor won't - think of feature emulation 
in
older hosts, or any other thing.

To demonstrate that, and how it works, I ported kvmclock to this infrastructure.
After that, I used it to implement steal time information exchange.

Even our last discussions, I am keeping the steal time accounting in sched.c.
I still see value on that, as opposed to a lower level, because it will give us
easier access to the scheduler variables, such as the cpu runqueue.

The last patch in the series uses this to decrease CPU power according to the
current steal time information, leading to a possibly more smart scheduling.
This, in particular, is very early work, and advise on that - including Stop 
now,
you idiot! is very welcome.

TODO:
 * Handle unregister over reboots
 * Grab a list of current registrations, for migration
 * Write documentation
 * Check size in hv registerings, to prevent out of boundaries
   exploits


Glauber Costa (16):
  KVM-HDR: register KVM basic header infrastructure
  KVM-HV: KVM - KVM Virtual Memory hypervisor implementation
  KVM-HDR: KVM Userspace registering ioctl
  KVM-HV: KVM Userspace registering ioctl
  KVM-HDR: Implement wallclock over KVM - KVM Virtual Memory
  KVM-HDR: Implement wallclock over KVM - KVM Virtual Memory
  KVM-GST: Implement wallclock over KVM - KVM Virtual Memory
  KVM-HDR: Implement kvmclock systemtime over KVM - KVM Virtual Memory
  KVM-HV: Implement kvmclock systemtime over KVM - KVM Virtual Memory
  KVM-GST: Implement kvmclock systemtime over KVM - KVM Virtual Memory
  KVM-HDR: KVM Steal time implementation
  KVM-HV: KVM Steal time implementation
  KVM-HV: KVM Steal time calculation
  KVM-GST: KVM Steal time registration
  KVM-GST: KVM Steal time accounting
  KVM-GST: adjust scheduler cpu power

 arch/x86/include/asm/kvm_host.h |3 +
 arch/x86/include/asm/kvm_para.h |   19 +++
 arch/x86/kernel/kvmclock.c  |  104 +-
 arch/x86/kvm/svm.c  |4 ++
 arch/x86/kvm/vmx.c  |4 ++
 arch/x86/kvm/x86.c  |   99 +
 include/linux/kvm.h |   14 +-
 include/linux/kvm_host.h|1 +
 include/linux/sched.h   |1 +
 kernel/sched.c  |   44 
 kernel/sched_fair.c |   10 
 11 files changed, 268 insertions(+), 35 deletions(-)

-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/16] KVM-GST: Implement wallclock over KVM - KVM Virtual Memory

2011-01-24 Thread Glauber Costa
As a proof of concept to KVM - Kernel Virtual Memory, this patch
implements wallclock grabbing on top of it. At first, it may seem
as a waste of work to just redo it, since it is working well. But over the
time, other MSRs were added - think ASYNC_PF - and more will probably come.
After this patch, we won't need to ever add another virtual MSR to KVM.

If the hypervisor fails to register the memory area, we switch back to legacy
behavior on things that were already present - like kvm clock.

This patch contains the hypervisor guest implementation for it. I am keeping it
separate to facilitate backports to people who wants to backport
the kernel part but not the hypervisor, or the other way around.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Avi Kivity a...@redhat.com
---
 arch/x86/kernel/kvmclock.c |   41 -
 1 files changed, 36 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index f98d3ea..b8809f0 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -31,6 +31,7 @@
 static int kvmclock = 1;
 static int msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
 static int msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
+static int kvm_memory_area_available = 0;
 
 static int parse_no_kvmclock(char *arg)
 {
@@ -43,6 +44,27 @@ early_param(no-kvmclock, parse_no_kvmclock);
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct pvclock_vcpu_time_info, hv_clock);
 static struct pvclock_wall_clock wall_clock;
 
+static int kvm_register_mem_area(u64 base, int type, int size)
+{
+   int low, high;
+
+   struct kvm_memory_area mem;
+
+   if (!kvm_memory_area_available)
+   return 1;
+
+   mem.base = base;
+   mem.size = size;
+   mem.type = type;
+
+   low = (int)__pa_symbol(mem);
+   high = ((u64)__pa_symbol(mem)  32);
+
+   native_write_msr(MSR_KVM_REGISTER_MEM_AREA, low, high);
+   return mem.result;
+}
+
+
 /*
  * The wallclock is the time of day when we booted. Since then, some time may
  * have elapsed since the hypervisor wrote the data. So we try to account for
@@ -53,11 +75,17 @@ static unsigned long kvm_get_wallclock(void)
struct pvclock_vcpu_time_info *vcpu_time;
struct timespec ts;
int low, high;
-
-   low = (int)__pa_symbol(wall_clock);
-   high = ((u64)__pa_symbol(wall_clock)  32);
-
-   native_write_msr(msr_kvm_wall_clock, low, high);
+   u64 addr = __pa_symbol(wall_clock);
+   int ret;
+   
+   ret = kvm_register_mem_area(addr, KVM_AREA_WALLCLOCK,
+   sizeof(wall_clock));
+   if (ret != 0) {
+   low = (int)addr;
+   high = ((u64)addr  32);
+
+   native_write_msr(msr_kvm_wall_clock, low, high);
+   }
 
vcpu_time = get_cpu_var(hv_clock);
pvclock_read_wallclock(wall_clock, vcpu_time, ts);
@@ -179,6 +207,9 @@ void __init kvmclock_init(void)
if (!kvm_para_available())
return;
 
+   if (kvm_para_has_feature(KVM_FEATURE_MEMORY_AREA))
+   kvm_memory_area_available = 1;
+
if (kvmclock  kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/16] KVM-HDR: KVM Steal time implementation

2011-01-24 Thread Glauber Costa
To implement steal time, we need the hypervisor to pass the guest information
about how much time was spent running other processes outside the VM.
This is per-vcpu, and using the kvmclock structure for that is an abuse
we decided not to make.

This patch contains the headers for it. I am keeping it separate to facilitate
backports to people who wants to backport the kernel part but not the
hypervisor, or the other way around.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
---
 arch/x86/include/asm/kvm_para.h |5 +
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 4df93cf..1ca49d8 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -46,8 +46,13 @@ struct kvm_memory_area {
__u8  pad[3];
 };
 
+struct kvm_steal_time {
+   __u64 steal;
+};
+
 #define KVM_AREA_WALLCLOCK 1
 #define KVM_AREA_SYSTIME   2
+#define KVM_AREA_STEAL_TIME3
 
 #define KVM_MAX_MMU_OP_BATCH   32
 
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/16] KVM-HV: Implement kvmclock systemtime over KVM - KVM Virtual Memory

2011-01-24 Thread Glauber Costa
As a proof of concept to KVM - Kernel Virtual Memory, this patch
implements kvmclock per-vcpu systime grabbing on top of it. At first, it
may seem as a waste of work to just redo it, since it is working well. But over
the time, other MSRs were added - think ASYNC_PF - and more will probably come.
After this patch, we won't need to ever add another virtual MSR to KVM.

If the hypervisor fails to register the memory area, we switch back to legacy
behavior on things that were already present - like kvm clock.

This patch contains the hypervisor part for it. I am keeping it separate from
the headers to facilitate backports to people who wants to backport the kernel 
part
but not the hypervisor, or the other way around.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Avi Kivity a...@redhat.com
---
 arch/x86/kvm/x86.c |   55 +++
 1 files changed, 33 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a232a36..81b2f34 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1453,6 +1453,33 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, 
u64 data)
return 0;
 }
 
+static void kvmclock_register_page(struct kvm_vcpu *vcpu, u64 data)
+{
+   if (vcpu-arch.time_page) {
+   kvm_release_page_dirty(vcpu-arch.time_page);
+   vcpu-arch.time_page = NULL;
+   }
+
+   vcpu-arch.time = data;
+   kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+
+   /* we verify if the enable bit is set... */
+   if (!(data  1))
+   return;
+
+   /* ...but clean it before doing the actual write */
+   vcpu-arch.time_offset = data  ~(PAGE_MASK | 1);
+
+   vcpu-arch.time_page =
+   gfn_to_page(vcpu-kvm, data  PAGE_SHIFT);
+
+   if (is_error_page(vcpu-arch.time_page)) {
+   kvm_release_page_clean(vcpu-arch.time_page);
+   vcpu-arch.time_page = NULL;
+   }
+}
+
+
 int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
switch (msr) {
@@ -1510,28 +1537,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
break;
case MSR_KVM_SYSTEM_TIME_NEW:
case MSR_KVM_SYSTEM_TIME: {
-   if (vcpu-arch.time_page) {
-   kvm_release_page_dirty(vcpu-arch.time_page);
-   vcpu-arch.time_page = NULL;
-   }
-
-   vcpu-arch.time = data;
-   kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
-
-   /* we verify if the enable bit is set... */
-   if (!(data  1))
-   break;
-
-   /* ...but clean it before doing the actual write */
-   vcpu-arch.time_offset = data  ~(PAGE_MASK | 1);
-
-   vcpu-arch.time_page =
-   gfn_to_page(vcpu-kvm, data  PAGE_SHIFT);
-
-   if (is_error_page(vcpu-arch.time_page)) {
-   kvm_release_page_clean(vcpu-arch.time_page);
-   vcpu-arch.time_page = NULL;
-   }
+   kvmclock_register_page(vcpu, data);
break;
}
case MSR_KVM_ASYNC_PF_EN:
@@ -1553,6 +1559,11 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
goto rma_out;
}
 
+   if (area_desc.type == KVM_AREA_SYSTIME) {
+   kvmclock_register_page(vcpu, area_desc.base | 1);
+   goto rma_out;
+   }
+
if (vcpu-kvm-register_mem_area_uspace) {
vcpu-run-exit_reason = KVM_EXIT_X86_MSR_OP;
vcpu-run-msr.msr_data = data;
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/16] KVM-HDR: Implement wallclock over KVM - KVM Virtual Memory

2011-01-24 Thread Glauber Costa
As a proof of concept to KVM - Kernel Virtual Memory, this patch
implements wallclock grabbing on top of it. At first, it may seem
as a waste of work to just redo it, since it is working well. But over the
time, other MSRs were added - think ASYNC_PF - and more will probably come.
After this patch, we won't need to ever add another virtual MSR to KVM.

If the hypervisor fails to register the memory area, we switch back to legacy
behavior on things that were already present - like kvm clock.

This patch contains the headers for it. I am keeping it separate to facilitate
backports to people who wants to backport the kernel part but not the
hypervisor, or the other way around.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Avi Kivity a...@redhat.com
---
 arch/x86/include/asm/kvm_para.h |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index b0b0ee0..de088c8 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -46,6 +46,8 @@ struct kvm_memory_area {
__u8  pad[3];
 };
 
+#define KVM_AREA_WALLCLOCK 1
+
 #define KVM_MAX_MMU_OP_BATCH   32
 
 #define KVM_ASYNC_PF_ENABLED   (1  0)
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/16] KVM-HDR: KVM Userspace registering ioctl

2011-01-24 Thread Glauber Costa
KVM, which stands for KVM Virtual Memory (I wanted to call it KVM Virtual 
Mojito),
is a piece of shared memory that is visible to both the hypervisor and the guest
kernel - but not the guest userspace.

The basic idea is that the guest can tell the hypervisor about a specific
piece of memory, and what it expects to find in there. This is a generic
abstraction, that goes to userspace (qemu) if KVM (the hypervisor) can't
handle a specific request, thus giving us flexibility in some features
in the future.

KVM (The hypervisor) can change the contents of this piece of memory at
will. This works well with paravirtual information, and hopefully
normal guest memory - like last update time for the watchdog, for
instance.

This patch contains the header part of the userspace communication 
implementation.
Userspace can query the presence/absence of this feature in the normal way.
It also tells the hypervisor that it is capable of handling - in whatever
way it chooses, registrations that the hypervisor does not know how to.
In x86, only user so far, this mechanism is implemented as generic userspace
msr exit, that could theorectically be used to implement msr-handling in
userspace.

I am keeping the headers separate to facilitate backports to people
who wants to backport the kernel part but not the hypervisor, or the other way 
around.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Avi Kivity a...@redhat.com
---
 include/linux/kvm.h  |   14 +-
 include/linux/kvm_host.h |1 +
 2 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index ea2dc1a..5cc4fe8 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -161,6 +161,7 @@ struct kvm_pit_config {
 #define KVM_EXIT_NMI  16
 #define KVM_EXIT_INTERNAL_ERROR   17
 #define KVM_EXIT_OSI  18
+#define KVM_EXIT_X86_MSR_OP  19
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 #define KVM_INTERNAL_ERROR_EMULATION 1
@@ -264,6 +265,10 @@ struct kvm_run {
struct {
__u64 gprs[32];
} osi;
+   /* KVM_EXIT_X86_MSR_OP */
+   struct {
+   __u64 msr_data;
+   } msr;
/* Fix the size of the union. */
char padding[256];
};
@@ -422,6 +427,11 @@ struct kvm_ppc_pvinfo {
__u8  pad[108];
 };
 
+struct kvm_area_info {
+   __u8  enabled;
+   __u8  pad[3];
+};
+
 #define KVMIO 0xAE
 
 /*
@@ -541,6 +551,7 @@ struct kvm_ppc_pvinfo {
 #define KVM_CAP_PPC_GET_PVINFO 57
 #define KVM_CAP_PPC_IRQ_LEVEL 58
 #define KVM_CAP_ASYNC_PF 59
+#define KVM_CAP_REGISTER_MEM_AREA 60
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -677,7 +688,8 @@ struct kvm_clock_data {
 #define KVM_SET_PIT2  _IOW(KVMIO,  0xa0, struct kvm_pit_state2)
 /* Available with KVM_CAP_PPC_GET_PVINFO */
 #define KVM_PPC_GET_PVINFO   _IOW(KVMIO,  0xa1, struct kvm_ppc_pvinfo)
-
+#define KVM_USERSPACE_REGISTER_MEM_AREA \
+ _IOW(KVMIO,  0xa8, struct kvm_area_info)
 /*
  * ioctls for vcpu fds
  */
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b5021db..b7b361f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -258,6 +258,7 @@ struct kvm {
long mmu_notifier_count;
 #endif
long tlbs_dirty;
+   int register_mem_area_uspace;
 };
 
 /* The guest did something we don't support. */
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v6 PATCH 4/8] sched: Add yield_to(task, preempt) functionality

2011-01-24 Thread Peter Zijlstra
On Thu, 2011-01-20 at 16:34 -0500, Rik van Riel wrote:
 From: Mike Galbraith efa...@gmx.de
 
 Currently only implemented for fair class tasks.
 
 Add a yield_to_task method() to the fair scheduling class. allowing the
 caller of yield_to() to accelerate another thread in it's thread group,
 task group.
 
 Implemented via a scheduler hint, using cfs_rq-next to encourage the
 target being selected.  We can rely on pick_next_entity to keep things
 fair, so noone can accelerate a thread that has already used its fair
 share of CPU time.
 
 This also means callers should only call yield_to when they really
 mean it.  Calling it too often can result in the scheduler just
 ignoring the hint.
 
 Signed-off-by: Rik van Riel r...@redhat.com
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 Signed-off-by: Mike Galbraith efa...@gmx.de

Patch 5 wants to be merged back in here I think..

 diff --git a/include/linux/sched.h b/include/linux/sched.h
 index 2c79e92..6c43fc4 100644
 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -1047,6 +1047,7 @@ struct sched_class {
   void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
   void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
   void (*yield_task) (struct rq *rq);
 + bool (*yield_to_task) (struct rq *rq, struct task_struct *p, bool 
 preempt);
  
   void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int 
 flags);
  
 @@ -1943,6 +1944,7 @@ static inline int rt_mutex_getprio(struct task_struct 
 *p)
  # define rt_mutex_adjust_pi(p)   do { } while (0)
  #endif
  
 +extern bool yield_to(struct task_struct *p, bool preempt);
  extern void set_user_nice(struct task_struct *p, long nice);
  extern int task_prio(const struct task_struct *p);
  extern int task_nice(const struct task_struct *p);
 diff --git a/kernel/sched.c b/kernel/sched.c
 index e4e57ff..1f38ed2 100644
 --- a/kernel/sched.c
 +++ b/kernel/sched.c
 @@ -5270,6 +5270,64 @@ void __sched yield(void)
  }
  EXPORT_SYMBOL(yield);
  
 +/**
 + * yield_to - yield the current processor to another thread in
 + * your thread group, or accelerate that thread toward the
 + * processor it's on.
 + *
 + * It's the caller's job to ensure that the target task struct
 + * can't go away on us before we can do any checks.
 + *
 + * Returns true if we indeed boosted the target task.
 + */
 +bool __sched yield_to(struct task_struct *p, bool preempt)
 +{
 + struct task_struct *curr = current;
 + struct rq *rq, *p_rq;
 + unsigned long flags;
 + bool yielded = 0;
 +
 + local_irq_save(flags);
 + rq = this_rq();
 +
 +again:
 + p_rq = task_rq(p);
 + double_rq_lock(rq, p_rq);
 + while (task_rq(p) != p_rq) {
 + double_rq_unlock(rq, p_rq);
 + goto again;
 + }
 +
 + if (!curr-sched_class-yield_to_task)
 + goto out;
 +
 + if (curr-sched_class != p-sched_class)
 + goto out;
 +
 + if (task_running(p_rq, p) || p-state)
 + goto out;
 +
 + if (!same_thread_group(p, curr))
 + goto out;
 +
 +#ifdef CONFIG_FAIR_GROUP_SCHED
 + if (task_group(p) != task_group(curr))
 + goto out;
 +#endif
 +
 + yielded = curr-sched_class-yield_to_task(rq, p, preempt);
 +
 +out:
 + double_rq_unlock(rq, p_rq);
 + local_irq_restore(flags);
 +
 + if (yielded)
 + yield();

Calling yield() here is funny, you just had all the locks to actually do
it..

 + return yielded;
 +}
 +EXPORT_SYMBOL_GPL(yield_to);



 diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
 index f701a51..097e936 100644
 --- a/kernel/sched_fair.c
 +++ b/kernel/sched_fair.c
 @@ -1800,6 +1800,23 @@ static void yield_task_fair(struct rq *rq)
   set_yield_buddy(se);
  }
  
 +static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool 
 preempt)
 +{
 + struct sched_entity *se = p-se;
 +
 + if (!se-on_rq)
 + return false;
 +
 + /* Tell the scheduler that we'd really like pse to run next. */
 + set_next_buddy(se);
 +
 + /* Make p's CPU reschedule; pick_next_entity takes care of fairness. */
 + if (preempt)
 + resched_task(rq-curr);
 +
 + return true;
 +}

So here we set -next, we could be -last, and after this we'll set
-yield to curr by calling yield().

So if you do this cyclically I can see -yield == {-next,-last}
happening.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v6 PATCH 3/8] sched: use a buddy to implement yield_task_fair

2011-01-24 Thread Rik van Riel

On 01/24/2011 01:04 PM, Peter Zijlstra wrote:


diff --git a/kernel/sched.c b/kernel/sched.c
index dc91a4d..e4e57ff 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -327,7 +327,7 @@ struct cfs_rq {
 * 'curr' points to currently running entity on this cfs_rq.
 * It is set to NULL otherwise (i.e when none are currently running).
 */
-   struct sched_entity *curr, *next, *last;
+   struct sched_entity *curr, *next, *last, *yield;


I'd prefer it be called: skip or somesuch..


I could do that.  Do any of the other scheduler people have
a preference?


+static struct sched_entity *__pick_second_entity(struct cfs_rq *cfs_rq)
+{
+   struct rb_node *left = cfs_rq-rb_leftmost;
+   struct rb_node *second;
+
+   if (!left)
+   return NULL;
+
+   second = rb_next(left);
+
+   if (!second)
+   second = left;
+
+   return rb_entry(second, struct sched_entity, run_node);
+}


So this works because you only ever skip the leftmost, should we perhaps
write this as something like the below?


Well, pick_next_entity only ever *picks* the leftmost entity,
so there's no reason to skip others.


@@ -813,6 +840,9 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct 
sched_entity *se)

if (cfs_rq-next == se)
__clear_buddies_next(se);
+
+   if (cfs_rq-yield == se)
+   __clear_buddies_yield(se);
  }


The 3rd hierarchy iteration.. :/


Except it won't actually walk up the tree above the level
where the buddy actually points at the se.  I suspect the
new code will do less tree walking than the old code.


+   /*
+* Someone really wants this to run. If it's not unfair, run it.
+*/
+   if (cfs_rq-next  wakeup_preempt_entity(cfs_rq-next, left)  1)
+   se = cfs_rq-next;
+
clear_buddies(cfs_rq, se);

return se;


This seems to assume -yield cannot be -next nor -last, but I'm not
quite sure that will actually be true.


On the contrary, I specifically want -next to be able to
override -yield, for the reason that the _tasks_ that
have -next and -yield set could be inside the same _group_.

What I am assuming is that -yield and -last are not the
same task.  This is achieved by yield_task_fair calling
clear_buddies.


+/*
+ * sched_yield() is very simple
+ *
+ * The magic of dealing with the -yield buddy is in pick_next_entity.
+ */
+static void yield_task_fair(struct rq *rq)
+{
+   struct task_struct *curr = rq-curr;
+   struct cfs_rq *cfs_rq = task_cfs_rq(curr);
+   struct sched_entity *se =curr-se;
+
+   /*
+* Are we the only task in the tree?
+*/
+   if (unlikely(rq-nr_running == 1))
+   return;
+
+   clear_buddies(cfs_rq, se);
+
+   if (curr-policy != SCHED_BATCH) {
+   update_rq_clock(rq);
+   /*
+* Update run-time statistics of the 'current'.
+*/
+   update_curr(cfs_rq);
+   }
+
+   set_yield_buddy(se);
+}


You just lost sysctl_sched_compat_yield, someone might be upset (I
really can't be bothered much with people using sys_yield :-), but if
you're going down that road you want a hunk in kernel/sysctl.c as well I
think.


I lost sysctl_sched_compat_yield, because with my code
yield is no longer a noop.

I'd be glad to remove the sysctl.c bits if you want :)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v6 PATCH 4/8] sched: Add yield_to(task, preempt) functionality

2011-01-24 Thread Rik van Riel

On 01/24/2011 01:12 PM, Peter Zijlstra wrote:

On Thu, 2011-01-20 at 16:34 -0500, Rik van Riel wrote:

From: Mike Galbraithefa...@gmx.de

Currently only implemented for fair class tasks.

Add a yield_to_task method() to the fair scheduling class. allowing the
caller of yield_to() to accelerate another thread in it's thread group,
task group.

Implemented via a scheduler hint, using cfs_rq-next to encourage the
target being selected.  We can rely on pick_next_entity to keep things
fair, so noone can accelerate a thread that has already used its fair
share of CPU time.

This also means callers should only call yield_to when they really
mean it.  Calling it too often can result in the scheduler just
ignoring the hint.

Signed-off-by: Rik van Rielr...@redhat.com
Signed-off-by: Marcelo Tosattimtosa...@redhat.com
Signed-off-by: Mike Galbraithefa...@gmx.de


Patch 5 wants to be merged back in here I think..


Agreed, but I wanted Mike's comments first  :)



+/**
+ * yield_to - yield the current processor to another thread in
+ * your thread group, or accelerate that thread toward the
+ * processor it's on.
+ *
+ * It's the caller's job to ensure that the target task struct
+ * can't go away on us before we can do any checks.
+ *
+ * Returns true if we indeed boosted the target task.
+ */
+bool __sched yield_to(struct task_struct *p, bool preempt)
+{
+   struct task_struct *curr = current;
+   struct rq *rq, *p_rq;
+   unsigned long flags;
+   bool yielded = 0;
+
+   local_irq_save(flags);
+   rq = this_rq();
+
+again:
+   p_rq = task_rq(p);
+   double_rq_lock(rq, p_rq);
+   while (task_rq(p) != p_rq) {
+   double_rq_unlock(rq, p_rq);
+   goto again;
+   }
+
+   if (!curr-sched_class-yield_to_task)
+   goto out;
+
+   if (curr-sched_class != p-sched_class)
+   goto out;
+
+   if (task_running(p_rq, p) || p-state)
+   goto out;
+
+   if (!same_thread_group(p, curr))
+   goto out;
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+   if (task_group(p) != task_group(curr))
+   goto out;
+#endif
+
+   yielded = curr-sched_class-yield_to_task(rq, p, preempt);
+
+out:
+   double_rq_unlock(rq, p_rq);
+   local_irq_restore(flags);
+
+   if (yielded)
+   yield();


Calling yield() here is funny, you just had all the locks to actually do
it..


This is us giving up the CPU, which requires not holding locks.

A different thing than us giving the CPU away to someone else.


+static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool 
preempt)
+{
+   struct sched_entity *se =p-se;
+
+   if (!se-on_rq)
+   return false;
+
+   /* Tell the scheduler that we'd really like pse to run next. */
+   set_next_buddy(se);
+
+   /* Make p's CPU reschedule; pick_next_entity takes care of fairness. */
+   if (preempt)
+   resched_task(rq-curr);
+
+   return true;
+}


So here we set -next, we could be -last, and after this we'll set
-yield to curr by calling yield().

So if you do this cyclically I can see -yield == {-next,-last}
happening.


That would only happen if we called yield_to with ourselves
as the argument!

There is no caller in the tree that does that - task p is
another task, not ourselves.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Errors on MMIO read access on VM suspend / resume operations

2011-01-24 Thread Stefan Berger

On 01/18/2011 03:53 AM, Jan Kiszka wrote:

On 2011-01-18 04:03, Stefan Berger wrote:

On 01/16/2011 09:43 AM, Avi Kivity wrote:

On 01/14/2011 09:27 PM, Stefan Berger wrote:

Can you sprinkle some printfs() arount kvm_run (in qemu-kvm.c) to
verify this?


Here's what I did:


interrupt exit requested

It appears from this you're using qemu.git.  Please try qemu-kvm.git,
where the code appears to be correct.


Cc'ing qemu-devel now. For reference, here the initial problem description:

http://www.spinics.net/lists/kvm/msg48274.html

I didn't know there was another tree...

I have seen now a couple of suspends-while-reading with patches applied
to the qemu-kvm.git tree and indeed, when run with the same host kernel
and VM I do not see the debugging dumps due to double-reads that I would
have anticipated seeing by now. Now what? Can this be easily fixed in
the other Qemu tree as well?

Please give this a try:

git://git.kiszka.org/qemu-kvm.git queues/kvm-upstream

I bet (  hope) kvm: Unconditionally reenter kernel after IO exits
fixes the issue for you. If other problems pop up with that tree, also
try resetting to that particular commit.

I'm currently trying to shake all those hidden or forgotten bug fixes
out of qemu-kvm and port them upstream. Most of those subtle differences
should hopefully soon be history.

I did the same test as I did with Avi's tree and haven't seen the 
consequences of possible double-reads. So, I would say that you should 
upstream those patches...


I searched for the text you mention above using 'gitk' but couldn't find 
a patch with that headline in your tree. There were others that seem to 
be related:


Gleb Natapov: do not enter vcpu again if it was stopped during IO

One thing I'd like to mention is that I have seen what I think are
interrupt stalls when running my tests inside the qemu-kvm.git tree
version and not suspending at all. A some point the interrupt counter in
the guest kernel does not increase anymore even though I see the device
model raising the IRQ and lowering it. The same tests run literally
forever in the qemu.git tree version of Qemu.

What about qemu-kmv and -no-kvm-irqchip?
That seems to be necessary for both trees, yours and the one Avi pointed 
me to. If applied, then I did not see the interrupt problem.


Stefan

Jan



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 16/16] KVM-GST: adjust scheduler cpu power

2011-01-24 Thread Peter Zijlstra
On Mon, 2011-01-24 at 13:06 -0500, Glauber Costa wrote:
 This is a first proposal for using steal time information
 to influence the scheduler. There are a lot of optimizations
 and fine grained adjustments to be done, but it is working reasonably
 so far for me (mostly)
 
 With this patch (and some host pinnings to demonstrate the situation),
 two vcpus with very different steal time (Say 80 % vs 1 %) will not get
 an even distribution of processes. This is a situation that can naturally
 arise, specially in overcommited scenarios. Previosly, the guest scheduler
 would wrongly think that all cpus have the same ability to run processes,
 lowering the overall throughput.
 
 Signed-off-by: Glauber Costa glom...@redhat.com
 CC: Rik van Riel r...@redhat.com
 CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
 CC: Peter Zijlstra pet...@infradead.org
 CC: Avi Kivity a...@redhat.com
 ---
  include/linux/sched.h |1 +
  kernel/sched.c|4 
  kernel/sched_fair.c   |   10 ++
  3 files changed, 15 insertions(+), 0 deletions(-)
 
 diff --git a/include/linux/sched.h b/include/linux/sched.h
 index d747f94..beab72d 100644
 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -302,6 +302,7 @@ long io_schedule_timeout(long timeout);
  extern void cpu_init (void);
  extern void trap_init(void);
  extern void update_process_times(int user);
 +extern cputime_t (*hypervisor_steal_time)(void);
  extern void scheduler_tick(void);
  
  extern void sched_show_task(struct task_struct *p);
 diff --git a/kernel/sched.c b/kernel/sched.c
 index 3b3e88d..c460f0d 100644
 --- a/kernel/sched.c
 +++ b/kernel/sched.c
 @@ -501,6 +501,8 @@ struct rq {
   struct sched_domain *sd;
  
   unsigned long cpu_power;
 + unsigned long real_ticks;
 + unsigned long total_ticks;
  
   unsigned char idle_at_tick;
   /* For active balancing */
 @@ -3533,10 +3535,12 @@ static int touch_steal_time(int is_idle)
   if (is_idle)
   return 0;
  
 + rq-total_ticks++;
   if (steal) {
   account_steal_time(steal);
   return 1;
   }
 + rq-real_ticks++;
   return 0;
  }

yuck!! is ticks really the best you can do?

I thought kvm had a ns resolution steal-time clock?

 diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
 index c62ebae..1364c28 100644
 --- a/kernel/sched_fair.c
 +++ b/kernel/sched_fair.c
 @@ -2509,6 +2509,16 @@ static void update_cpu_power(struct sched_domain *sd, 
 int cpu)
   power = SCHED_LOAD_SHIFT;
   }
  
 + WARN_ON(cpu_rq(cpu)-real_ticks  cpu_rq(cpu)-total_ticks);
 +
 + if (cpu_rq(cpu)-total_ticks) {
 + power *= cpu_rq(cpu)-real_ticks;
 + power /= cpu_rq(cpu)-total_ticks;
 + }
 +
 + cpu_rq(cpu)-total_ticks = 0;
 + cpu_rq(cpu)-real_ticks = 0;
 +
   sdg-cpu_power_orig = power;
  
   if (sched_feat(ARCH_POWER))

I would really much rather see you change update_rq_clock_task() and
subtract your ns resolution steal time from our wall-time,
update_rq_clock_task() already updates the cpu_power relative to the
remaining time available.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Flow Control and Port Mirroring Revisited

2011-01-24 Thread Rick Jones


Just to block netperf you can send it SIGSTOP :)



Clever :)  One could I suppose achieve the same result by making the remote 
receive socket buffer size smaller than the UDP message size and then not worry 
about having to learn the netserver's PID to send it the SIGSTOP.  I *think* the 
semantics will be substantially the same?  Both will be drops at the socket 
buffer, albeit for for different reasons.  The too small socket buffer version 
though doesn't require one remember to wake the netserver in time to have it 
send results back to netperf without netperf tossing-up an error and not 
reporting any statistics.


Also, netperf has a no control connection mode where you can, in effect cause 
it to send UDP datagrams out into the void - I put it there to allow folks to 
test against the likes of echo discard and chargen services but it may have a 
use here.  Requires that one specify the destination IP and port for the data 
connection explicitly via the test-specific options.  In that mode the only 
stats reported are those local to netperf rather than netserver.


happy benchmarking,

rick jones
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Flow Control and Port Mirroring Revisited

2011-01-24 Thread Michael S. Tsirkin
On Mon, Jan 24, 2011 at 10:27:55AM -0800, Rick Jones wrote:
 
 Just to block netperf you can send it SIGSTOP :)
 
 
 Clever :)  One could I suppose achieve the same result by making the
 remote receive socket buffer size smaller than the UDP message size
 and then not worry about having to learn the netserver's PID to send
 it the SIGSTOP.  I *think* the semantics will be substantially the
 same?

If you could set, it, yes. But at least linux ignores
any value substantially smaller than 1K, and then
multiplies that by 2:

case SO_RCVBUF:
/* Don't error on this BSD doesn't and if you think
   about it this is right. Otherwise apps have to
   play 'guess the biggest size' games. RCVBUF/SNDBUF
   are treated in BSD as hints */

if (val  sysctl_rmem_max)
val = sysctl_rmem_max;
set_rcvbuf: 
sk-sk_userlocks |= SOCK_RCVBUF_LOCK;

/*
 * We double it on the way in to account for
 * struct sk_buff etc. overhead.   Applications
 * assume that the SO_RCVBUF setting they make will
 * allow that much actual data to be received on that
 * socket.
 *
 * Applications are unaware that struct sk_buff and
 * other overheads allocate from the receive buffer
 * during socket buffer allocation. 
 *
 * And after considering the possible alternatives,
 * returning the value we actually used in getsockopt
 * is the most desirable behavior.
 */ 
if ((val * 2)  SOCK_MIN_RCVBUF)
sk-sk_rcvbuf = SOCK_MIN_RCVBUF;
else
sk-sk_rcvbuf = val * 2;

and

/*  
 * Since sk_rmem_alloc sums skb-truesize, even a small frame might need
 * sizeof(sk_buff) + MTU + padding, unless net driver perform copybreak
 */ 
#define SOCK_MIN_RCVBUF (2048 + sizeof(struct sk_buff))


  Both will be drops at the socket buffer, albeit for for
 different reasons.  The too small socket buffer version though
 doesn't require one remember to wake the netserver in time to have
 it send results back to netperf without netperf tossing-up an error
 and not reporting any statistics.
 
 Also, netperf has a no control connection mode where you can, in
 effect cause it to send UDP datagrams out into the void - I put it
 there to allow folks to test against the likes of echo discard and
 chargen services but it may have a use here.  Requires that one
 specify the destination IP and port for the data connection
 explicitly via the test-specific options.  In that mode the only
 stats reported are those local to netperf rather than netserver.

Ah, sounds perfect.

 happy benchmarking,
 
 rick jones

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 16/16] KVM-GST: adjust scheduler cpu power

2011-01-24 Thread Glauber Costa
On Mon, 2011-01-24 at 19:32 +0100, Peter Zijlstra wrote:
 On Mon, 2011-01-24 at 13:06 -0500, Glauber Costa wrote:
  This is a first proposal for using steal time information
  to influence the scheduler. There are a lot of optimizations
  and fine grained adjustments to be done, but it is working reasonably
  so far for me (mostly)
  
  With this patch (and some host pinnings to demonstrate the situation),
  two vcpus with very different steal time (Say 80 % vs 1 %) will not get
  an even distribution of processes. This is a situation that can naturally
  arise, specially in overcommited scenarios. Previosly, the guest scheduler
  would wrongly think that all cpus have the same ability to run processes,
  lowering the overall throughput.
  
  Signed-off-by: Glauber Costa glom...@redhat.com
  CC: Rik van Riel r...@redhat.com
  CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
  CC: Peter Zijlstra pet...@infradead.org
  CC: Avi Kivity a...@redhat.com
  ---
   include/linux/sched.h |1 +
   kernel/sched.c|4 
   kernel/sched_fair.c   |   10 ++
   3 files changed, 15 insertions(+), 0 deletions(-)
  
  diff --git a/include/linux/sched.h b/include/linux/sched.h
  index d747f94..beab72d 100644
  --- a/include/linux/sched.h
  +++ b/include/linux/sched.h
  @@ -302,6 +302,7 @@ long io_schedule_timeout(long timeout);
   extern void cpu_init (void);
   extern void trap_init(void);
   extern void update_process_times(int user);
  +extern cputime_t (*hypervisor_steal_time)(void);
   extern void scheduler_tick(void);
   
   extern void sched_show_task(struct task_struct *p);
  diff --git a/kernel/sched.c b/kernel/sched.c
  index 3b3e88d..c460f0d 100644
  --- a/kernel/sched.c
  +++ b/kernel/sched.c
  @@ -501,6 +501,8 @@ struct rq {
  struct sched_domain *sd;
   
  unsigned long cpu_power;
  +   unsigned long real_ticks;
  +   unsigned long total_ticks;
   
  unsigned char idle_at_tick;
  /* For active balancing */
  @@ -3533,10 +3535,12 @@ static int touch_steal_time(int is_idle)
  if (is_idle)
  return 0;
   
  +   rq-total_ticks++;
  if (steal) {
  account_steal_time(steal);
  return 1;
  }
  +   rq-real_ticks++;
  return 0;
   }
 
 yuck!! is ticks really the best you can do?
No, but it is simple enough for a first try. With those variables, we're
not accounting anything, but rather, getting a rough estimate of the %
of steal time compared to the useful (non-idle) time.

 I thought kvm had a ns resolution steal-time clock?
Yes, the one I introduced earlier in this series is nsec. However, user
and system will be accounted in usec at most, so there is no point in
using nsec here.

 
  diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
  index c62ebae..1364c28 100644
  --- a/kernel/sched_fair.c
  +++ b/kernel/sched_fair.c
  @@ -2509,6 +2509,16 @@ static void update_cpu_power(struct sched_domain 
  *sd, int cpu)
  power = SCHED_LOAD_SHIFT;
  }
   
  +   WARN_ON(cpu_rq(cpu)-real_ticks  cpu_rq(cpu)-total_ticks);
  +
  +   if (cpu_rq(cpu)-total_ticks) {
  +   power *= cpu_rq(cpu)-real_ticks;
  +   power /= cpu_rq(cpu)-total_ticks;
  +   }
  +
  +   cpu_rq(cpu)-total_ticks = 0;
  +   cpu_rq(cpu)-real_ticks = 0;
  +
  sdg-cpu_power_orig = power;
   
  if (sched_feat(ARCH_POWER))
 
 I would really much rather see you change update_rq_clock_task() and
 subtract your ns resolution steal time from our wall-time,
 update_rq_clock_task() already updates the cpu_power relative to the
 remaining time available.

But then we get into the problems we already discussed in previous
submissions, which is, we don't really want to alter any notion of
wallclock time. Could you list any more concrete advantages of the
specific way you proposed?

I found that updating cpu power directly is pretty simple, and seems
to work well enough, without disturbing any notion of real time.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Flow Control and Port Mirroring Revisited

2011-01-24 Thread Rick Jones

Michael S. Tsirkin wrote:

On Mon, Jan 24, 2011 at 10:27:55AM -0800, Rick Jones wrote:


Just to block netperf you can send it SIGSTOP :)



Clever :)  One could I suppose achieve the same result by making the
remote receive socket buffer size smaller than the UDP message size
and then not worry about having to learn the netserver's PID to send
it the SIGSTOP.  I *think* the semantics will be substantially the
same?



If you could set, it, yes. But at least linux ignores
any value substantially smaller than 1K, and then
multiplies that by 2:

case SO_RCVBUF:
/* Don't error on this BSD doesn't and if you think
   about it this is right. Otherwise apps have to
   play 'guess the biggest size' games. RCVBUF/SNDBUF
   are treated in BSD as hints */

if (val  sysctl_rmem_max)
val = sysctl_rmem_max;
set_rcvbuf: 
sk-sk_userlocks |= SOCK_RCVBUF_LOCK;


/*
 * We double it on the way in to account for
 * struct sk_buff etc. overhead.   Applications
 * assume that the SO_RCVBUF setting they make will
 * allow that much actual data to be received on that
 * socket.
 *
 * Applications are unaware that struct sk_buff and
 * other overheads allocate from the receive buffer
 * during socket buffer allocation. 
 *

 * And after considering the possible alternatives,
 * returning the value we actually used in getsockopt
 * is the most desirable behavior.
 */ 
if ((val * 2)  SOCK_MIN_RCVBUF)

sk-sk_rcvbuf = SOCK_MIN_RCVBUF;
else
sk-sk_rcvbuf = val * 2;

and

/*  
 * Since sk_rmem_alloc sums skb-truesize, even a small frame might need

 * sizeof(sk_buff) + MTU + padding, unless net driver perform copybreak
 */ 
#define SOCK_MIN_RCVBUF (2048 + sizeof(struct sk_buff))


Pity - seems to work back on 2.6.26:

raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1 -m 1024
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost 
(127.0.0.1) port 0 AF_INET : histogram

Socket  Message  Elapsed  Messages
SizeSize Time Okay Errors   Throughput
bytes   bytessecs#  #   10^6bits/sec

1249281024   10.00 2882334  02361.17
   256   10.00   0  0.00

raj@tardy:~/netperf2_trunk$ uname -a
Linux tardy 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 GNU/Linux

Still, even with that (or SIGSTOP) we don't really know where the packets were 
dropped right?  There is no guarantee they weren't dropped before they got to 
the socket buffer


happy benchmarking,
rick jones

PS - here is with a -S 1024 option:

raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1024 -m 1024
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost 
(127.0.0.1) port 0 AF_INET : histogram

Socket  Message  Elapsed  Messages
SizeSize Time Okay Errors   Throughput
bytes   bytessecs#  #   10^6bits/sec

1249281024   10.00 1679269  01375.64
  2048   10.00 1490662   1221.13

showing that there is a decent chance that many of the frames were dropped at 
the socket buffer, but not all - I suppose I could/should be checking netstat 
stats... :)


And just a little more, only because I was curious :)

raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1M -m 257
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost 
(127.0.0.1) port 0 AF_INET : histogram

Socket  Message  Elapsed  Messages
SizeSize Time Okay Errors   Throughput
bytes   bytessecs#  #   10^6bits/sec

124928 257   10.00 1869134  0 384.29
262142   10.00 1869134384.29

raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1 -m 257
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost 
(127.0.0.1) port 0 AF_INET : histogram

Socket  Message  Elapsed  Messages
SizeSize Time Okay Errors   Throughput
bytes   bytessecs#  #   10^6bits/sec

124928 257   10.00 3076363  0 632.49
   256   10.00   0  0.00

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Flow Control and Port Mirroring Revisited

2011-01-24 Thread Michael S. Tsirkin
On Mon, Jan 24, 2011 at 11:01:45AM -0800, Rick Jones wrote:
 Michael S. Tsirkin wrote:
 On Mon, Jan 24, 2011 at 10:27:55AM -0800, Rick Jones wrote:
 
 Just to block netperf you can send it SIGSTOP :)
 
 
 Clever :)  One could I suppose achieve the same result by making the
 remote receive socket buffer size smaller than the UDP message size
 and then not worry about having to learn the netserver's PID to send
 it the SIGSTOP.  I *think* the semantics will be substantially the
 same?
 
 
 If you could set, it, yes. But at least linux ignores
 any value substantially smaller than 1K, and then
 multiplies that by 2:
 
 case SO_RCVBUF:
 /* Don't error on this BSD doesn't and if you think
about it this is right. Otherwise apps have to
play 'guess the biggest size' games. RCVBUF/SNDBUF
are treated in BSD as hints */
 
 if (val  sysctl_rmem_max)
 val = sysctl_rmem_max;
 set_rcvbuf: sk-sk_userlocks |=
 SOCK_RCVBUF_LOCK;
 
 /*
  * We double it on the way in to account for
  * struct sk_buff etc. overhead.   Applications
  * assume that the SO_RCVBUF setting they make will
  * allow that much actual data to be received on that
  * socket.
  *
  * Applications are unaware that struct sk_buff and
  * other overheads allocate from the receive buffer
  * during socket buffer allocation.
 *
  * And after considering the possible alternatives,
  * returning the value we actually used in getsockopt
  * is the most desirable behavior.
  */ if ((val * 2) 
 SOCK_MIN_RCVBUF)
 sk-sk_rcvbuf = SOCK_MIN_RCVBUF;
 else
 sk-sk_rcvbuf = val * 2;
 
 and
 
 /*   * Since sk_rmem_alloc sums skb-truesize,
 even a small frame might need
  * sizeof(sk_buff) + MTU + padding, unless net driver perform copybreak
  */ #define SOCK_MIN_RCVBUF (2048 + sizeof(struct
 sk_buff))
 
 Pity - seems to work back on 2.6.26:

Hmm, that code is there at least as far back as 2.6.12.

 raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1 -m 1024
 MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
 localhost (127.0.0.1) port 0 AF_INET : histogram
 Socket  Message  Elapsed  Messages
 SizeSize Time Okay Errors   Throughput
 bytes   bytessecs#  #   10^6bits/sec
 
 1249281024   10.00 2882334  02361.17
256   10.00   0  0.00
 
 raj@tardy:~/netperf2_trunk$ uname -a
 Linux tardy 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 
 GNU/Linux
 
 Still, even with that (or SIGSTOP) we don't really know where the
 packets were dropped right?  There is no guarantee they weren't
 dropped before they got to the socket buffer
 
 happy benchmarking,
 rick jones

Right. Better send to a port with no socket listening there,
that would drop the packet at an early (if not at the earliest
possible)  opportunity.

 PS - here is with a -S 1024 option:
 
 raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1024 -m 1024
 MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
 localhost (127.0.0.1) port 0 AF_INET : histogram
 Socket  Message  Elapsed  Messages
 SizeSize Time Okay Errors   Throughput
 bytes   bytessecs#  #   10^6bits/sec
 
 1249281024   10.00 1679269  01375.64
   2048   10.00 1490662   1221.13
 
 showing that there is a decent chance that many of the frames were
 dropped at the socket buffer, but not all - I suppose I could/should
 be checking netstat stats... :)
 
 And just a little more, only because I was curious :)
 
 raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1M -m 257
 MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
 localhost (127.0.0.1) port 0 AF_INET : histogram
 Socket  Message  Elapsed  Messages
 SizeSize Time Okay Errors   Throughput
 bytes   bytessecs#  #   10^6bits/sec
 
 124928 257   10.00 1869134  0 384.29
 262142   10.00 1869134384.29
 
 raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1 -m 257
 MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
 localhost (127.0.0.1) port 0 AF_INET : histogram
 Socket  Message  Elapsed  Messages
 SizeSize Time Okay Errors   Throughput
 bytes   bytessecs#  #   10^6bits/sec
 
 124928 257   10.00 3076363  0 632.49
256   10.00   0  0.00
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a 

Re: [PATCH 16/16] KVM-GST: adjust scheduler cpu power

2011-01-24 Thread Peter Zijlstra
On Mon, 2011-01-24 at 16:51 -0200, Glauber Costa wrote:
  I would really much rather see you change update_rq_clock_task() and
  subtract your ns resolution steal time from our wall-time,
  update_rq_clock_task() already updates the cpu_power relative to the
  remaining time available.
 
 But then we get into the problems we already discussed in previous
 submissions, which is, we don't really want to alter any notion of
 wallclock time. Could you list any more concrete advantages of the
 specific way you proposed? 

clock_task is the time spend on the task, by not taking steal time into
account all steal time is accounted as service to whatever task was
current when the vcpu wasn't running.

It doesn't change wall-time in the sense of gtod, only the service time
to tasks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 16/16] KVM-GST: adjust scheduler cpu power

2011-01-24 Thread Peter Zijlstra
On Mon, 2011-01-24 at 16:51 -0200, Glauber Costa wrote:
 
  I thought kvm had a ns resolution steal-time clock?
 Yes, the one I introduced earlier in this series is nsec. However, user
 and system will be accounted in usec at most, so there is no point in
 using nsec here.

Well, the scheduler accounts most things in ns these days -- some archs
even do user/sys time in ns -- using ticks is really very crude, esp
when you've got better information.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 16/16] KVM-GST: adjust scheduler cpu power

2011-01-24 Thread Glauber Costa
On Mon, 2011-01-24 at 20:51 +0100, Peter Zijlstra wrote:
 On Mon, 2011-01-24 at 16:51 -0200, Glauber Costa wrote:
   I would really much rather see you change update_rq_clock_task() and
   subtract your ns resolution steal time from our wall-time,
   update_rq_clock_task() already updates the cpu_power relative to the
   remaining time available.
  
  But then we get into the problems we already discussed in previous
  submissions, which is, we don't really want to alter any notion of
  wallclock time. Could you list any more concrete advantages of the
  specific way you proposed? 
 
 clock_task is the time spend on the task, by not taking steal time into
 account all steal time is accounted as service to whatever task was
 current when the vcpu wasn't running.
 
 It doesn't change wall-time in the sense of gtod, only the service time
 to tasks.
Ok, I'll experiment with that and see how it goes.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-24 Thread Blue Swirl
On Mon, Jan 24, 2011 at 2:08 PM, Jan Kiszka jan.kis...@siemens.com wrote:
 On 2011-01-21 19:49, Blue Swirl wrote:
 I'd add fourth possible class:
  - device, CPU and machine configuration, like nographic,
 win2k_install_hack, no_hpet, smp_cpus etc. Maybe also
 irqchip_in_kernel could fit here, though it obviously depends on a
 host capability too.

 I would count everything that cannot be assigned to a concrete device
 upfront to the dynamic state of a machine, thus class 2. The point is,
 (potentially) every device of that machine requires access to it, just
 like (indirectly, via the KVM core services) to some KVM VM state bits.

 The machine class should not be a catch-all, it would be like
 QEMUState or KVMState then. Perhaps each field or variable should be
 listed and given more thought.

 Let's start with what is most urgent:

  - vmfd: file descriptor required for any KVM request that has VM scope
   (in-kernel device creation, device state synchronizations, IRQ
   routing etc.)

I'd say VM state.

  - irqchip_in_kernel: VM uses in-kernel irqchip acceleration
   (some devices will have to adjust their behavior depending on this)

Since QEMU version is useless, I peeked at qemu-kvm version.

There are a lot of lines like:
if (kvm_enabled()  !kvm_irqchip_in_kernel())
kvm_just_do_it();

Perhaps these would be cleaner with stub functions.

The device cases are obvious: the devices need a flag, passed to them
by pc.c, which combines kvm_enabled  kvm_irqchip_in_kernel(). This
gets stored in device state.

But exec.c case, where kvm_update_interrupt_request() is called, is
more interesting. CPU init could set up function pointer to either
stub/NULL or kvm_update_interrupt_request().

I didn't look at kvm*.c, qemu-kvm*.c or stuff in kvm/.

So I'd eliminate kvm_irqchip_in_kernel() from outside of KVM and pc.c.
The information could be stored in a MachineState, where pc.c could
grab it for device and CPU setup.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-24 Thread Jan Kiszka
On 2011-01-24 22:35, Blue Swirl wrote:
 On Mon, Jan 24, 2011 at 2:08 PM, Jan Kiszka jan.kis...@siemens.com wrote:
 On 2011-01-21 19:49, Blue Swirl wrote:
 I'd add fourth possible class:
  - device, CPU and machine configuration, like nographic,
 win2k_install_hack, no_hpet, smp_cpus etc. Maybe also
 irqchip_in_kernel could fit here, though it obviously depends on a
 host capability too.

 I would count everything that cannot be assigned to a concrete device
 upfront to the dynamic state of a machine, thus class 2. The point is,
 (potentially) every device of that machine requires access to it, just
 like (indirectly, via the KVM core services) to some KVM VM state bits.

 The machine class should not be a catch-all, it would be like
 QEMUState or KVMState then. Perhaps each field or variable should be
 listed and given more thought.

 Let's start with what is most urgent:

  - vmfd: file descriptor required for any KVM request that has VM scope
   (in-kernel device creation, device state synchronizations, IRQ
   routing etc.)
 
 I'd say VM state.

Good. That's +1 for introducing and distributing it.

 
  - irqchip_in_kernel: VM uses in-kernel irqchip acceleration
   (some devices will have to adjust their behavior depending on this)
 
 Since QEMU version is useless, I peeked at qemu-kvm version.
 
 There are a lot of lines like:
 if (kvm_enabled()  !kvm_irqchip_in_kernel())
 kvm_just_do_it();
 
 Perhaps these would be cleaner with stub functions.

Probably. I guess there is quite some room left for cleanups in this area.

 
 The device cases are obvious: the devices need a flag, passed to them
 by pc.c, which combines kvm_enabled  kvm_irqchip_in_kernel(). This
 gets stored in device state.

Not all devices are only instantiated by the machine init code. Even if
we are lucky that all those we need on x86 are created that way, we
shouldn't rely on this for future use case, including other KVM archs.

 
 But exec.c case, where kvm_update_interrupt_request() is called, is
 more interesting. CPU init could set up function pointer to either
 stub/NULL or kvm_update_interrupt_request().
 

Yes, callbacks are the way to go long term. Here we could also define
one for VCPU interrupt handling and set it according to the VCPU mode.

 I didn't look at kvm*.c, qemu-kvm*.c or stuff in kvm/.
 
 So I'd eliminate kvm_irqchip_in_kernel() from outside of KVM and pc.c.
 The information could be stored in a MachineState, where pc.c could
 grab it for device and CPU setup.

I still don't see how we can distribute the information to all
interested devices. It's basically the same issue as with current kvm_state.

Jan



signature.asc
Description: OpenPGP digital signature


Re: KVM call agenda for Jan 25

2011-01-24 Thread Anthony Liguori

On 01/24/2011 07:25 AM, Chris Wright wrote:

Please send in any agenda items you are interested in covering.
   


- coroutines for the block layer
- glib everywhere

Regards,

Anthony Liguori


thanks,
-chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Errors on MMIO read access on VM suspend / resume operations

2011-01-24 Thread Jan Kiszka
On 2011-01-24 19:27, Stefan Berger wrote:
 On 01/18/2011 03:53 AM, Jan Kiszka wrote:
 On 2011-01-18 04:03, Stefan Berger wrote:
 On 01/16/2011 09:43 AM, Avi Kivity wrote:
 On 01/14/2011 09:27 PM, Stefan Berger wrote:
 Can you sprinkle some printfs() arount kvm_run (in qemu-kvm.c) to
 verify this?

 Here's what I did:


 interrupt exit requested
 It appears from this you're using qemu.git.  Please try qemu-kvm.git,
 where the code appears to be correct.

 Cc'ing qemu-devel now. For reference, here the initial problem
 description:

 http://www.spinics.net/lists/kvm/msg48274.html

 I didn't know there was another tree...

 I have seen now a couple of suspends-while-reading with patches applied
 to the qemu-kvm.git tree and indeed, when run with the same host kernel
 and VM I do not see the debugging dumps due to double-reads that I would
 have anticipated seeing by now. Now what? Can this be easily fixed in
 the other Qemu tree as well?
 Please give this a try:

 git://git.kiszka.org/qemu-kvm.git queues/kvm-upstream

 I bet (  hope) kvm: Unconditionally reenter kernel after IO exits
 fixes the issue for you. If other problems pop up with that tree, also
 try resetting to that particular commit.

 I'm currently trying to shake all those hidden or forgotten bug fixes
 out of qemu-kvm and port them upstream. Most of those subtle differences
 should hopefully soon be history.

 I did the same test as I did with Avi's tree and haven't seen the
 consequences of possible double-reads. So, I would say that you should
 upstream those patches...
 
 I searched for the text you mention above using 'gitk' but couldn't find
 a patch with that headline in your tree. There were others that seem to
 be related:
 
 Gleb Natapov: do not enter vcpu again if it was stopped during IO

Err, I don't think you checked out queues/kvm-upstream. I bet you just
ran my master branch which is a version of qemu-kvm's master. Am I right? :)

 One thing I'd like to mention is that I have seen what I think are
 interrupt stalls when running my tests inside the qemu-kvm.git tree
 version and not suspending at all. A some point the interrupt counter in
 the guest kernel does not increase anymore even though I see the device
 model raising the IRQ and lowering it. The same tests run literally
 forever in the qemu.git tree version of Qemu.
 What about qemu-kmv and -no-kvm-irqchip?
 That seems to be necessary for both trees, yours and the one Avi pointed
 me to. If applied, then I did not see the interrupt problem.

And the fact that you were able to call qemu from my tree with
-no-kvm-irqchip just underlines my assumption: that switch is refused by
upstream. Please retry with the latest kvm-upstream queue.

Besides that, this other bug you may see in the in-kernel IRQ path - how
can we reproduce it?

Jan



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 11/16] KVM-HDR: KVM Steal time implementation

2011-01-24 Thread Rik van Riel

On 01/24/2011 01:06 PM, Glauber Costa wrote:

To implement steal time, we need the hypervisor to pass the guest information
about how much time was spent running other processes outside the VM.
This is per-vcpu, and using the kvmclock structure for that is an abuse
we decided not to make.

This patch contains the headers for it. I am keeping it separate to facilitate
backports to people who wants to backport the kernel part but not the
hypervisor, or the other way around.

Signed-off-by: Glauber Costaglom...@redhat.com
CC: Rik van Rielr...@redhat.com
CC: Jeremy Fitzhardingejeremy.fitzhardi...@citrix.com
CC: Peter Zijlstrapet...@infradead.org
CC: Avi Kivitya...@redhat.com


Acked-by: Rik van Riel r...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 12/16] KVM-HV: KVM Steal time implementation

2011-01-24 Thread Rik van Riel

On 01/24/2011 01:06 PM, Glauber Costa wrote:

To implement steal time, we need the hypervisor to pass the guest information
about how much time was spent running other processes outside the VM.
This is per-vcpu, and using the kvmclock structure for that is an abuse
we decided not to make.

This patch contains the hypervisor part for it. I am keeping it separate from
the headers to facilitate backports to people who wants to backport the kernel
part but not the hypervisor, or the other way around.

Signed-off-by: Glauber Costaglom...@redhat.com
CC: Rik van Rielr...@redhat.com
CC: Jeremy Fitzhardingejeremy.fitzhardi...@citrix.com
CC: Peter Zijlstrapet...@infradead.org
CC: Avi Kivitya...@redhat.com


Acked-by: Rik van Riel r...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/16] KVM-HV: KVM Steal time calculation

2011-01-24 Thread Rik van Riel

On 01/24/2011 01:06 PM, Glauber Costa wrote:

To implement steal time, we need the hypervisor to pass the guest information
about how much time was spent running other processes outside the VM.
We consider time to be potentially stolen everytime we schedule out the vcpu,
until we schedule it in again. If this is, or if this will not, be accounted
as steal time for the guest, is a guest's decision.

Signed-off-by: Glauber Costaglom...@redhat.com
CC: Rik van Rielr...@redhat.com
CC: Jeremy Fitzhardingejeremy.fitzhardi...@citrix.com
CC: Peter Zijlstrapet...@infradead.org
CC: Avi Kivitya...@redhat.com


Reviewed-by: Rik van Riel r...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/16] KVM-GST: KVM Steal time registration

2011-01-24 Thread Rik van Riel

On 01/24/2011 01:06 PM, Glauber Costa wrote:

Register steal time within KVM. Everytime we sample the steal time
information, we update a local variable that tells what was the
last time read. We then account the difference.

Signed-off-by: Glauber Costaglom...@redhat.com
CC: Rik van Rielr...@redhat.com
CC: Jeremy Fitzhardingejeremy.fitzhardi...@citrix.com
CC: Peter Zijlstrapet...@infradead.org
CC: Avi Kivitya...@redhat.com


Reviewed-by: Rik van Riel r...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/16] KVM-GST: KVM Steal time registration

2011-01-24 Thread Rik van Riel

On 01/24/2011 01:06 PM, Glauber Costa wrote:

Register steal time within KVM. Everytime we sample the steal time
information, we update a local variable that tells what was the
last time read. We then account the difference.

Signed-off-by: Glauber Costaglom...@redhat.com
CC: Rik van Rielr...@redhat.com
CC: Jeremy Fitzhardingejeremy.fitzhardi...@citrix.com
CC: Peter Zijlstrapet...@infradead.org
CC: Avi Kivitya...@redhat.com


On second thought - how does this deal with cpu hotplug and
hot unplug?

Do you allocate a new one of these structs every time a cpu
is hot unplugged and then hotplugged, leaking the old one?

Will leaving the old value around confuse the steal time
calculation?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 15/16] KVM-GST: KVM Steal time accounting

2011-01-24 Thread Rik van Riel

On 01/24/2011 01:06 PM, Glauber Costa wrote:

This patch accounts steal time time in kernel/sched.
I kept it from last proposal, because I still see advantages
in it: Doing it here will give us easier access from scheduler
variables such as the cpu rq. The next patch shows an example of
usage for it.

Since functions like account_idle_time() can be called from
multiple places, not only account_process_tick(), steal time
grabbing is repeated in each account function separatedely.

Signed-off-by: Glauber Costaglom...@redhat.com
CC: Rik van Rielr...@redhat.com
CC: Jeremy Fitzhardingejeremy.fitzhardi...@citrix.com
CC: Peter Zijlstrapet...@infradead.org
CC: Avi Kivitya...@redhat.com


Reviewed-by: Rik van Riel r...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/16] KVM-GST: KVM Steal time registration

2011-01-24 Thread Glauber Costa
On Mon, 2011-01-24 at 18:31 -0500, Rik van Riel wrote:
 On 01/24/2011 01:06 PM, Glauber Costa wrote:
  Register steal time within KVM. Everytime we sample the steal time
  information, we update a local variable that tells what was the
  last time read. We then account the difference.
 
  Signed-off-by: Glauber Costaglom...@redhat.com
  CC: Rik van Rielr...@redhat.com
  CC: Jeremy Fitzhardingejeremy.fitzhardi...@citrix.com
  CC: Peter Zijlstrapet...@infradead.org
  CC: Avi Kivitya...@redhat.com
 
 On second thought - how does this deal with cpu hotplug and
 hot unplug?
 
 Do you allocate a new one of these structs every time a cpu
 is hot unplugged and then hotplugged, leaking the old one?
 
 Will leaving the old value around confuse the steal time
 calculation?

If you look closely, there are no allocations happening at all,
it's all static.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/16] KVM-GST: KVM Steal time registration

2011-01-24 Thread Rik van Riel

On 01/24/2011 08:25 PM, Glauber Costa wrote:

On Mon, 2011-01-24 at 18:31 -0500, Rik van Riel wrote:

On 01/24/2011 01:06 PM, Glauber Costa wrote:

Register steal time within KVM. Everytime we sample the steal time
information, we update a local variable that tells what was the
last time read. We then account the difference.

Signed-off-by: Glauber Costaglom...@redhat.com
CC: Rik van Rielr...@redhat.com
CC: Jeremy Fitzhardingejeremy.fitzhardi...@citrix.com
CC: Peter Zijlstrapet...@infradead.org
CC: Avi Kivitya...@redhat.com


On second thought - how does this deal with cpu hotplug and
hot unplug?

Do you allocate a new one of these structs every time a cpu
is hot unplugged and then hotplugged, leaking the old one?

Will leaving the old value around confuse the steal time
calculation?


If you look closely, there are no allocations happening at all,
it's all static.


In that case, does the per-cpu steal area need to be
reinitialized at hotplug time?

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/16] KVM-GST: KVM Steal time registration

2011-01-24 Thread Glauber Costa
On Mon, 2011-01-24 at 20:26 -0500, Rik van Riel wrote:
 On 01/24/2011 08:25 PM, Glauber Costa wrote:
  On Mon, 2011-01-24 at 18:31 -0500, Rik van Riel wrote:
  On 01/24/2011 01:06 PM, Glauber Costa wrote:
  Register steal time within KVM. Everytime we sample the steal time
  information, we update a local variable that tells what was the
  last time read. We then account the difference.
 
  Signed-off-by: Glauber Costaglom...@redhat.com
  CC: Rik van Rielr...@redhat.com
  CC: Jeremy Fitzhardingejeremy.fitzhardi...@citrix.com
  CC: Peter Zijlstrapet...@infradead.org
  CC: Avi Kivitya...@redhat.com
 
  On second thought - how does this deal with cpu hotplug and
  hot unplug?
 
  Do you allocate a new one of these structs every time a cpu
  is hot unplugged and then hotplugged, leaking the old one?
 
  Will leaving the old value around confuse the steal time
  calculation?
 
  If you look closely, there are no allocations happening at all,
  it's all static.
 
 In that case, does the per-cpu steal area need to be
 reinitialized at hotplug time?
Probably.

I have to look closely at all unregistration scenarios, like
reboot.
It's part of my todo list

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: Errors on MMIO read access on VM suspend / resume operations

2011-01-24 Thread Stefan Berger

On 01/24/2011 05:34 PM, Jan Kiszka wrote:

On 2011-01-24 19:27, Stefan Berger wrote:

On 01/18/2011 03:53 AM, Jan Kiszka wrote:

On 2011-01-18 04:03, Stefan Berger wrote:

On 01/16/2011 09:43 AM, Avi Kivity wrote:

On 01/14/2011 09:27 PM, Stefan Berger wrote:

Can you sprinkle some printfs() arount kvm_run (in qemu-kvm.c) to
verify this?


Here's what I did:


interrupt exit requested

It appears from this you're using qemu.git.  Please try qemu-kvm.git,
where the code appears to be correct.


Cc'ing qemu-devel now. For reference, here the initial problem
description:

http://www.spinics.net/lists/kvm/msg48274.html

I didn't know there was another tree...

I have seen now a couple of suspends-while-reading with patches applied
to the qemu-kvm.git tree and indeed, when run with the same host kernel
and VM I do not see the debugging dumps due to double-reads that I would
have anticipated seeing by now. Now what? Can this be easily fixed in
the other Qemu tree as well?

Please give this a try:

git://git.kiszka.org/qemu-kvm.git queues/kvm-upstream

I bet (   hope) kvm: Unconditionally reenter kernel after IO exits
fixes the issue for you. If other problems pop up with that tree, also
try resetting to that particular commit.

I'm currently trying to shake all those hidden or forgotten bug fixes
out of qemu-kvm and port them upstream. Most of those subtle differences
should hopefully soon be history.


I did the same test as I did with Avi's tree and haven't seen the
consequences of possible double-reads. So, I would say that you should
upstream those patches...

I searched for the text you mention above using 'gitk' but couldn't find
a patch with that headline in your tree. There were others that seem to
be related:

Gleb Natapov: do not enter vcpu again if it was stopped during IO

Err, I don't think you checked out queues/kvm-upstream. I bet you just
ran my master branch which is a version of qemu-kvm's master. Am I right? :)



You're right. :-) my lack of git knowledge -  checked out the branch now.

I redid the testing and it passed. No double-reads and lost bytes from 
what I could see.



One thing I'd like to mention is that I have seen what I think are
interrupt stalls when running my tests inside the qemu-kvm.git tree
version and not suspending at all. A some point the interrupt counter in
the guest kernel does not increase anymore even though I see the device
model raising the IRQ and lowering it. The same tests run literally
forever in the qemu.git tree version of Qemu.

What about qemu-kmv and -no-kvm-irqchip?

That seems to be necessary for both trees, yours and the one Avi pointed
me to. If applied, then I did not see the interrupt problem.

And the fact that you were able to call qemu from my tree with
-no-kvm-irqchip just underlines my assumption: that switch is refused by
upstream. Please retry with the latest kvm-upstream queue.

Besides that, this other bug you may see in the in-kernel IRQ path - how
can we reproduce it?
Unfortunately I don't know. Some things have to come together for the 
code I am working on to become available and useful for everyone. It's 
going to be a while.


Thanks!
   Stefan

Jan



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] Unmapped Page Cache Control (v4)

2011-01-24 Thread Balbir Singh
The following series implements page cache control,
this is a split out version of patch 1 of version 3 of the
page cache optimization patches posted earlier at
Previous posting http://lwn.net/Articles/419564/

The previous few revision received lot of comments, I've tried to
address as many of those as possible in this revision.

Detailed Description

This patch implements unmapped page cache control via preferred
page cache reclaim. The current patch hooks into kswapd and reclaims
page cache if the user has requested for unmapped page control.
This is useful in the following scenario
- In a virtualized environment with cache=writethrough, we see
  double caching - (one in the host and one in the guest). As
  we try to scale guests, cache usage across the system grows.
  The goal of this patch is to reclaim page cache when Linux is running
  as a guest and get the host to hold the page cache and manage it.
  There might be temporary duplication, but in the long run, memory
  in the guests would be used for mapped pages.
- The option is controlled via a boot option and the administrator
  can selectively turn it on, on a need to use basis.

A lot of the code is borrowed from zone_reclaim_mode logic for
__zone_reclaim(). One might argue that the with ballooning and
KSM this feature is not very useful, but even with ballooning,
we need extra logic to balloon multiple VM machines and it is hard
to figure out the correct amount of memory to balloon. With these
patches applied, each guest has a sufficient amount of free memory
available, that can be easily seen and reclaimed by the balloon driver.
The additional memory in the guest can be reused for additional
applications or used to start additional guests/balance memory in
the host.

KSM currently does not de-duplicate host and guest page cache. The goal
of this patch is to help automatically balance unmapped page cache when
instructed to do so.

The sysctl for min_unmapped_ratio provides further control from
within the guest on the amount of unmapped pages to reclaim, a similar
max_unmapped_ratio sysctl is added and helps in the decision making
process of when reclaim should occur. This is tunable and set by
default to 16 (based on tradeoff's seen between aggressiveness in
balancing versus size of unmapped pages). Distro's and administrators
can further tweak this for desired control.

Data from the previous patchsets can be found at
https://lkml.org/lkml/2010/11/30/79


---

Balbir Singh (3):
  Move zone_reclaim() outside of CONFIG_NUMA
  Refactor zone_reclaim code
  Provide control over unmapped pages


 Documentation/kernel-parameters.txt |8 ++
 include/linux/mmzone.h  |9 ++-
 include/linux/swap.h|   23 +--
 init/Kconfig|   12 +++
 kernel/sysctl.c |   29 ++--
 mm/page_alloc.c |   31 -
 mm/vmscan.c |  122 +++
 7 files changed, 202 insertions(+), 32 deletions(-)

-- 
Balbir Singh
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA (v4)

2011-01-24 Thread Balbir Singh
This patch moves zone_reclaim and associated helpers
outside CONFIG_NUMA. This infrastructure is reused
in the patches for page cache control that follow.

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---
 include/linux/mmzone.h |4 ++--
 include/linux/swap.h   |4 ++--
 kernel/sysctl.c|   18 +-
 mm/page_alloc.c|6 +++---
 mm/vmscan.c|2 --
 5 files changed, 16 insertions(+), 18 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 02ecb01..2485acc 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -303,12 +303,12 @@ struct zone {
 */
unsigned long   lowmem_reserve[MAX_NR_ZONES];
 
-#ifdef CONFIG_NUMA
-   int node;
/*
 * zone reclaim becomes active if more unmapped pages exist.
 */
unsigned long   min_unmapped_pages;
+#ifdef CONFIG_NUMA
+   int node;
unsigned long   min_slab_pages;
 #endif
struct per_cpu_pageset __percpu *pageset;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 5e3355a..7b75626 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -255,11 +255,11 @@ extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern long vm_total_pages;
 
+extern int sysctl_min_unmapped_ratio;
+extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 #ifdef CONFIG_NUMA
 extern int zone_reclaim_mode;
-extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
-extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 #else
 #define zone_reclaim_mode 0
 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index bc86bb3..12e8f26 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1224,15 +1224,6 @@ static struct ctl_table vm_table[] = {
.extra1 = zero,
},
 #endif
-#ifdef CONFIG_NUMA
-   {
-   .procname   = zone_reclaim_mode,
-   .data   = zone_reclaim_mode,
-   .maxlen = sizeof(zone_reclaim_mode),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec,
-   .extra1 = zero,
-   },
{
.procname   = min_unmapped_ratio,
.data   = sysctl_min_unmapped_ratio,
@@ -1242,6 +1233,15 @@ static struct ctl_table vm_table[] = {
.extra1 = zero,
.extra2 = one_hundred,
},
+#ifdef CONFIG_NUMA
+   {
+   .procname   = zone_reclaim_mode,
+   .data   = zone_reclaim_mode,
+   .maxlen = sizeof(zone_reclaim_mode),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   .extra1 = zero,
+   },
{
.procname   = min_slab_ratio,
.data   = sysctl_min_slab_ratio,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index aede3a4..7b56473 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4167,10 +4167,10 @@ static void __paginginit free_area_init_core(struct 
pglist_data *pgdat,
 
zone-spanned_pages = size;
zone-present_pages = realsize;
-#ifdef CONFIG_NUMA
-   zone-node = nid;
zone-min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)
/ 100;
+#ifdef CONFIG_NUMA
+   zone-node = nid;
zone-min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100;
 #endif
zone-name = zone_names[j];
@@ -5084,7 +5084,6 @@ int min_free_kbytes_sysctl_handler(ctl_table *table, int 
write,
return 0;
 }
 
-#ifdef CONFIG_NUMA
 int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos)
 {
@@ -5101,6 +5100,7 @@ int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table 
*table, int write,
return 0;
 }
 
+#ifdef CONFIG_NUMA
 int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos)
 {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 47a5096..5899f2f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2868,7 +2868,6 @@ static int __init kswapd_init(void)
 
 module_init(kswapd_init)
 
-#ifdef CONFIG_NUMA
 /*
  * Zone reclaim mode
  *
@@ -3078,7 +3077,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, 
unsigned int order)
 
return ret;
 }
-#endif
 
 /*
  * page_evictable - test whether a page is evictable

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] Refactor zone_reclaim code (v4)

2011-01-24 Thread Balbir Singh
Changelog v3
1. Renamed zone_reclaim_unmapped_pages to zone_reclaim_pages

Refactor zone_reclaim, move reusable functionality outside
of zone_reclaim. Make zone_reclaim_unmapped_pages modular

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
Reviewed-by: Christoph Lameter c...@linux.com
---
 mm/vmscan.c |   35 +++
 1 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5899f2f..02cc82e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2943,6 +2943,27 @@ static long zone_pagecache_reclaimable(struct zone *zone)
 }
 
 /*
+ * Helper function to reclaim unmapped pages, we might add something
+ * similar to this for slab cache as well. Currently this function
+ * is shared with __zone_reclaim()
+ */
+static inline void
+zone_reclaim_pages(struct zone *zone, struct scan_control *sc,
+   unsigned long nr_pages)
+{
+   int priority;
+   /*
+* Free memory by calling shrink zone with increasing
+* priorities until we have enough memory freed.
+*/
+   priority = ZONE_RECLAIM_PRIORITY;
+   do {
+   shrink_zone(priority, zone, sc);
+   priority--;
+   } while (priority = 0  sc-nr_reclaimed  nr_pages);
+}
+
+/*
  * Try to free up some pages from this zone through reclaim.
  */
 static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int 
order)
@@ -2951,7 +2972,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t 
gfp_mask, unsigned int order)
const unsigned long nr_pages = 1  order;
struct task_struct *p = current;
struct reclaim_state reclaim_state;
-   int priority;
struct scan_control sc = {
.may_writepage = !!(zone_reclaim_mode  RECLAIM_WRITE),
.may_unmap = !!(zone_reclaim_mode  RECLAIM_SWAP),
@@ -2975,17 +2995,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t 
gfp_mask, unsigned int order)
reclaim_state.reclaimed_slab = 0;
p-reclaim_state = reclaim_state;
 
-   if (zone_pagecache_reclaimable(zone)  zone-min_unmapped_pages) {
-   /*
-* Free memory by calling shrink zone with increasing
-* priorities until we have enough memory freed.
-*/
-   priority = ZONE_RECLAIM_PRIORITY;
-   do {
-   shrink_zone(priority, zone, sc);
-   priority--;
-   } while (priority = 0  sc.nr_reclaimed  nr_pages);
-   }
+   if (zone_pagecache_reclaimable(zone)  zone-min_unmapped_pages)
+   zone_reclaim_pages(zone, sc, nr_pages);
 
nr_slab_pages0 = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
if (nr_slab_pages0  zone-min_slab_pages) {

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] Provide control over unmapped pages (v4)

2011-01-24 Thread Balbir Singh
Changelog v4
1. Add max_unmapped_ratio and use that as the upper limit
to check when to shrink the unmapped page cache (Christoph
Lameter)

Changelog v2
1. Use a config option to enable the code (Andrew Morton)
2. Explain the magic tunables in the code or at-least attempt
   to explain them (General comment)
3. Hint uses of the boot parameter with unlikely (Andrew Morton)
4. Use better names (balanced is not a good naming convention)

Provide control using zone_reclaim() and a boot parameter. The
code reuses functionality from zone_reclaim() to isolate unmapped
pages and reclaim them as a priority, ahead of other mapped pages.
A new sysctl for max_unmapped_ratio is provided and set to 16,
indicating 16% of the total zone pages are unmapped, we start
shrinking unmapped page cache.

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---
 Documentation/kernel-parameters.txt |8 +++
 include/linux/mmzone.h  |5 ++
 include/linux/swap.h|   23 -
 init/Kconfig|   12 +
 kernel/sysctl.c |   11 
 mm/page_alloc.c |   25 ++
 mm/vmscan.c |   87 +++
 7 files changed, 166 insertions(+), 5 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index fee5f57..65a4ee6 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2500,6 +2500,14 @@ and is between 256 and 4096 characters. It is defined in 
the file
[X86]
Set unknown_nmi_panic=1 early on boot.
 
+   unmapped_page_control
+   [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL
+   is enabled. It controls the amount of unmapped memory
+   that is present in the system. This boot option plus
+   vm.min_unmapped_ratio (sysctl) provide granular control
+   over how much unmapped page cache can exist in the 
system
+   before kswapd starts reclaiming unmapped page cache 
pages.
+
usbcore.autosuspend=
[USB] The autosuspend time delay (in seconds) used
for newly-detected USB devices (default 2).  This
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2485acc..18f0f09 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -306,7 +306,10 @@ struct zone {
/*
 * zone reclaim becomes active if more unmapped pages exist.
 */
+#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA)
unsigned long   min_unmapped_pages;
+   unsigned long   max_unmapped_pages;
+#endif
 #ifdef CONFIG_NUMA
int node;
unsigned long   min_slab_pages;
@@ -773,6 +776,8 @@ int percpu_pagelist_fraction_sysctl_handler(struct 
ctl_table *, int,
void __user *, size_t *, loff_t *);
 int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
+int sysctl_max_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
+   void __user *, size_t *, loff_t *);
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7b75626..ae62a03 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -255,19 +255,34 @@ extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern long vm_total_pages;
 
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
 extern int sysctl_min_unmapped_ratio;
+extern int sysctl_max_unmapped_ratio;
+
 extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
-#ifdef CONFIG_NUMA
-extern int zone_reclaim_mode;
-extern int sysctl_min_slab_ratio;
 #else
-#define zone_reclaim_mode 0
 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
 {
return 0;
 }
 #endif
 
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+extern bool should_reclaim_unmapped_pages(struct zone *zone);
+#else
+static inline bool should_reclaim_unmapped_pages(struct zone *zone)
+{
+   return false;
+}
+#endif
+
+#ifdef CONFIG_NUMA
+extern int zone_reclaim_mode;
+extern int sysctl_min_slab_ratio;
+#else
+#define zone_reclaim_mode 0
+#endif
+
 extern int page_evictable(struct page *page, struct vm_area_struct *vma);
 extern void scan_mapping_unevictable_pages(struct address_space *);
 
diff --git a/init/Kconfig b/init/Kconfig
index 4f6cdbf..2dfbc09 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -828,6 +828,18 @@ config SCHED_AUTOGROUP
 config MM_OWNER
bool
 
+config UNMAPPED_PAGECACHE_CONTROL
+   bool Provide control over unmapped page cache
+   

Re: [PATCH 1/2] Refactor zone_reclaim code (v4)

2011-01-24 Thread Balbir Singh
* Balbir Singh bal...@linux.vnet.ibm.com [2011-01-25 10:40:09]:

 Changelog v3
 1. Renamed zone_reclaim_unmapped_pages to zone_reclaim_pages
 
 Refactor zone_reclaim, move reusable functionality outside
 of zone_reclaim. Make zone_reclaim_unmapped_pages modular
 
 Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
 Reviewed-by: Christoph Lameter c...@linux.com

I got the patch numbering wrong due to a internet connection going down
in the middle of stg mail, restarting with specified patches goofed up
the numbering. I can resend the patches with the correct numbering if
desired. This patch should be numbered 2/3

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts

2011-01-24 Thread Alex Williamson
On Mon, 2011-01-24 at 08:44 -0700, Alex Williamson wrote:
 I'll look at how we might be
 able to allocate slots on demand.  Thanks,

Here's a first cut just to see if this looks agreeable.  This allows the
slot array to grow on demand.  This works with current userspace, as
well as userspace trivially modified to double KVMState.slots and
hotplugging enough pci-assign devices to exceed the previous limit (w/ 
w/o ept).  Hopefully I got the rcu bits correct.  Does this look like
the right path?  If so, I'll work on removing the fixed limit from
userspace next.  Thanks,

Alex


kvm: Allow memory slot array to grow on demand

Remove fixed KVM_MEMORY_SLOTS limit, allowing the slot array
to grow on demand.  Private slots are now allocated at the
front instead of the end.  Only x86 seems to use private slots,
so this is now zero for all other archs.  The memslots pointer
is already updated using rcu, so changing the size off the
array when it's replaces is straight forward.  x86 also keeps
a bitmap of slots used by a kvm_mmu_page, which requires a
shadow tlb flush whenever we increase the number of slots.
This forces the pages to be rebuilt with the new bitmap size.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

 arch/ia64/include/asm/kvm_host.h|4 --
 arch/ia64/kvm/kvm-ia64.c|2 +
 arch/powerpc/include/asm/kvm_host.h |3 --
 arch/s390/include/asm/kvm_host.h|3 --
 arch/x86/include/asm/kvm_host.h |3 +-
 arch/x86/include/asm/vmx.h  |6 ++-
 arch/x86/kvm/mmu.c  |7 +++-
 arch/x86/kvm/x86.c  |6 ++-
 include/linux/kvm_host.h|7 +++-
 virt/kvm/kvm_main.c |   65 ---
 10 files changed, 63 insertions(+), 43 deletions(-)


diff --git a/arch/ia64/include/asm/kvm_host.h b/arch/ia64/include/asm/kvm_host.h
index 2689ee5..11d0ab2 100644
--- a/arch/ia64/include/asm/kvm_host.h
+++ b/arch/ia64/include/asm/kvm_host.h
@@ -23,10 +23,6 @@
 #ifndef __ASM_KVM_HOST_H
 #define __ASM_KVM_HOST_H
 
-#define KVM_MEMORY_SLOTS 32
-/* memory slots that does not exposed to userspace */
-#define KVM_PRIVATE_MEM_SLOTS 4
-
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
 
 /* define exit reasons from vmm to kvm*/
diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
index 70d224d..f1adda2 100644
--- a/arch/ia64/kvm/kvm-ia64.c
+++ b/arch/ia64/kvm/kvm-ia64.c
@@ -1814,7 +1814,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
mutex_lock(kvm-slots_lock);
 
r = -EINVAL;
-   if (log-slot = KVM_MEMORY_SLOTS)
+   if (log-slot = kvm-memslots-nmemslots)
goto out;
 
memslot = kvm-memslots-memslots[log-slot];
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index bba3b9b..dc80057 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -29,9 +29,6 @@
 #include asm/kvm_asm.h
 
 #define KVM_MAX_VCPUS 1
-#define KVM_MEMORY_SLOTS 32
-/* memory slots that does not exposed to userspace */
-#define KVM_PRIVATE_MEM_SLOTS 4
 
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
 
diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index cef7dbf..92a964c 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -20,9 +20,6 @@
 #include asm/cpu.h
 
 #define KVM_MAX_VCPUS 64
-#define KVM_MEMORY_SLOTS 32
-/* memory slots that does not exposed to userspace */
-#define KVM_PRIVATE_MEM_SLOTS 4
 
 struct sca_entry {
atomic_t scn;
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ffd7f8d..df1382c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -27,7 +27,6 @@
 #include asm/msr-index.h
 
 #define KVM_MAX_VCPUS 64
-#define KVM_MEMORY_SLOTS 32
 /* memory slots that does not exposed to userspace */
 #define KVM_PRIVATE_MEM_SLOTS 4
 
@@ -207,7 +206,7 @@ struct kvm_mmu_page {
 * One bit set per slot which has memory
 * in this shadow page.
 */
-   DECLARE_BITMAP(slot_bitmap, KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS);
+   unsigned long *slot_bitmap;
bool multimapped; /* More than one parent_pte? */
bool unsync;
int root_count;  /* Currently serving as active root */
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 84471b8..7fd8c89 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -370,9 +370,9 @@ enum vmcs_field {
 
 #define AR_RESERVD_MASK 0xfffe0f00
 
-#define TSS_PRIVATE_MEMSLOT(KVM_MEMORY_SLOTS + 0)
-#define APIC_ACCESS_PAGE_PRIVATE_MEMSLOT   (KVM_MEMORY_SLOTS + 1)
-#define IDENTITY_PAGETABLE_PRIVATE_MEMSLOT (KVM_MEMORY_SLOTS + 2)
+#define TSS_PRIVATE_MEMSLOT0
+#define APIC_ACCESS_PAGE_PRIVATE_MEMSLOT   1
+#define IDENTITY_PAGETABLE_PRIVATE_MEMSLOT 2
 
 #define VMX_NR_VPIDS 

Re: [Qemu-devel] Re: Errors on MMIO read access on VM suspend / resume operations

2011-01-24 Thread Jan Kiszka
On 2011-01-25 04:13, Stefan Berger wrote:
 On 01/24/2011 05:34 PM, Jan Kiszka wrote:
 On 2011-01-24 19:27, Stefan Berger wrote:
 On 01/18/2011 03:53 AM, Jan Kiszka wrote:
 On 2011-01-18 04:03, Stefan Berger wrote:
 On 01/16/2011 09:43 AM, Avi Kivity wrote:
 On 01/14/2011 09:27 PM, Stefan Berger wrote:
 Can you sprinkle some printfs() arount kvm_run (in qemu-kvm.c) to
 verify this?

 Here's what I did:


 interrupt exit requested
 It appears from this you're using qemu.git.  Please try qemu-kvm.git,
 where the code appears to be correct.

 Cc'ing qemu-devel now. For reference, here the initial problem
 description:

 http://www.spinics.net/lists/kvm/msg48274.html

 I didn't know there was another tree...

 I have seen now a couple of suspends-while-reading with patches
 applied
 to the qemu-kvm.git tree and indeed, when run with the same host
 kernel
 and VM I do not see the debugging dumps due to double-reads that I
 would
 have anticipated seeing by now. Now what? Can this be easily fixed in
 the other Qemu tree as well?
 Please give this a try:

 git://git.kiszka.org/qemu-kvm.git queues/kvm-upstream

 I bet (   hope) kvm: Unconditionally reenter kernel after IO exits
 fixes the issue for you. If other problems pop up with that tree, also
 try resetting to that particular commit.

 I'm currently trying to shake all those hidden or forgotten bug fixes
 out of qemu-kvm and port them upstream. Most of those subtle
 differences
 should hopefully soon be history.

 I did the same test as I did with Avi's tree and haven't seen the
 consequences of possible double-reads. So, I would say that you should
 upstream those patches...

 I searched for the text you mention above using 'gitk' but couldn't find
 a patch with that headline in your tree. There were others that seem to
 be related:

 Gleb Natapov: do not enter vcpu again if it was stopped during IO
 Err, I don't think you checked out queues/kvm-upstream. I bet you just
 ran my master branch which is a version of qemu-kvm's master. Am I
 right? :)

 
 You're right. :-) my lack of git knowledge -  checked out the branch now.
 
 I redid the testing and it passed. No double-reads and lost bytes from
 what I could see.

Great, thanks.

 
 One thing I'd like to mention is that I have seen what I think are
 interrupt stalls when running my tests inside the qemu-kvm.git tree
 version and not suspending at all. A some point the interrupt
 counter in
 the guest kernel does not increase anymore even though I see the
 device
 model raising the IRQ and lowering it. The same tests run literally
 forever in the qemu.git tree version of Qemu.
 What about qemu-kmv and -no-kvm-irqchip?
 That seems to be necessary for both trees, yours and the one Avi pointed
 me to. If applied, then I did not see the interrupt problem.
 And the fact that you were able to call qemu from my tree with
 -no-kvm-irqchip just underlines my assumption: that switch is refused by
 upstream. Please retry with the latest kvm-upstream queue.

 Besides that, this other bug you may see in the in-kernel IRQ path - how
 can we reproduce it?
 Unfortunately I don't know. Some things have to come together for the
 code I am working on to become available and useful for everyone. It's
 going to be a while.

Do you see a chance to look closer at the issue yourself? E.g.
instrument the kernel's irqchip models and dump their states once your
guest is stuck?

Jan



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts

2011-01-24 Thread Jan Kiszka
On 2011-01-25 06:37, Alex Williamson wrote:
 On Mon, 2011-01-24 at 08:44 -0700, Alex Williamson wrote:
 I'll look at how we might be
 able to allocate slots on demand.  Thanks,
 
 Here's a first cut just to see if this looks agreeable.  This allows the
 slot array to grow on demand.  This works with current userspace, as
 well as userspace trivially modified to double KVMState.slots and
 hotplugging enough pci-assign devices to exceed the previous limit (w/ 
 w/o ept).  Hopefully I got the rcu bits correct.  Does this look like
 the right path?  If so, I'll work on removing the fixed limit from
 userspace next.  Thanks,
 
 Alex
 
 
 kvm: Allow memory slot array to grow on demand
 
 Remove fixed KVM_MEMORY_SLOTS limit, allowing the slot array
 to grow on demand.  Private slots are now allocated at the
 front instead of the end.  Only x86 seems to use private slots,

Hmm, doesn't current user space expect slots 8..11 to be the private
ones and wouldn't it cause troubles if slots 0..3 are suddenly reserved?

 so this is now zero for all other archs.  The memslots pointer
 is already updated using rcu, so changing the size off the
 array when it's replaces is straight forward.  x86 also keeps
 a bitmap of slots used by a kvm_mmu_page, which requires a
 shadow tlb flush whenever we increase the number of slots.
 This forces the pages to be rebuilt with the new bitmap size.

Is it possible for user space to increase the slot number to ridiculous
amounts (at least as far as kmalloc allows) and then trigger a kernel
walk through them in non-preemptible contexts? Just wondering, I haven't
checked all contexts of functions like kvm_is_visible_gfn yet.

If yes, we should already switch to rbtree or something like that.
Otherwise that may wait a bit, but probably not too long.

Jan



signature.asc
Description: OpenPGP digital signature