[COMMIT master] qemu-kvm: Drop redundant kvm_reset_mpstate

2011-01-13 Thread Avi Kivity
From: Jan Kiszka jan.kis...@siemens.com

kvm_arch_reset_vcpu includes the same logic (minus the obsolete feature
check), and every caller of kvm_reset_mpstate also calls that function.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index 672bcbf..2f1a090 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -564,20 +564,6 @@ static void kvm_arch_load_mpstate(CPUState *env)
 #endif
 }
 
-static void kvm_reset_mpstate(CPUState *env)
-{
-#ifdef KVM_CAP_MP_STATE
-if (kvm_check_extension(kvm_state, KVM_CAP_MP_STATE)) {
-if (kvm_irqchip_in_kernel()) {
-env-mp_state = cpu_is_bsp(env) ? KVM_MP_STATE_RUNNABLE :
-  KVM_MP_STATE_UNINITIALIZED;
-} else {
-env-mp_state = KVM_MP_STATE_RUNNABLE;
-}
-}
-#endif
-}
-
 #define XSAVE_CWD_RIP 2
 #define XSAVE_CWD_RDP 4
 #define XSAVE_MXCSR   6
@@ -652,7 +638,6 @@ static int _kvm_arch_init_vcpu(CPUState *env)
 #ifdef KVM_EXIT_TPR_ACCESS
 kvm_enable_tpr_access_reporting(env);
 #endif
-kvm_reset_mpstate(env);
 return 0;
 }
 
@@ -761,7 +746,6 @@ void kvm_arch_cpu_reset(CPUState *env)
 {
 kvm_reset_msrs(env);
 kvm_arch_reset_vcpu(env);
-kvm_reset_mpstate(env);
 }
 
 #ifdef CONFIG_KVM_DEVICE_ASSIGNMENT
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] Merge branch 'upstream-merge'

2011-01-13 Thread Avi Kivity
From: Marcelo Tosatti mtosa...@redhat.com

* upstream-merge: (279 commits)
  target-ppc: Implement correct NaN propagation rules
  target-mips: Implement correct NaN propagation rules
  softfloat: use float{32,64,x80,128}_maybe_silence_nan()
  softfloat: add float{x80,128}_maybe_silence_nan()
  softfloat: fix float{32,64}_maybe_silence_nan() for MIPS
  softfloat: rename *IsNaN variables to *IsQuietNaN
  softfloat: remove HPPA specific code
  target-ppc: use float32_is_any_nan()
  target-ppc: fix default qNaN
  target-ppc: remove PRECISE_EMULATION define
  microblaze: Use more TB chaining
  cirrus_vga: fix division by 0 for color expansion rop
  Fix curses on big endian hosts
  noaudio: correctly account acquired samples
  target-arm: Implement correct NaN propagation rules
  softfloat: abstract out target-specific NaN propagation rules
  softfloat: Rename float*_is_nan() functions to float*_is_quiet_nan()
  TCG: Improve tb_phys_hash_func()
  target-arm: fix UMAAL instruction
  Fix translation of unary PPC/SPE instructions (efdneg etc.).
  ...

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] remove qemu-kvm.h inclusion from monitor.c

2011-01-13 Thread Avi Kivity
From: Marcelo Tosatti mtosa...@redhat.com

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/monitor.c b/monitor.c
index cf1d3d0..40768bb 100644
--- a/monitor.c
+++ b/monitor.c
@@ -61,7 +61,6 @@
 #include trace.h
 #endif
 #include ui/qemu-spice.h
-#include qemu-kvm.h
 
 //#define DEBUG
 //#define DEBUG_COMPLETION
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] pci-assign: Fix transition MSI-INTx

2011-01-13 Thread Avi Kivity
From: Jan Kiszka jan.kis...@siemens.com

Make sure to re-register the IRQ of an assigned device as INTx when the
guest disables MSI[X] mode again.

Acked-by: Michael S. Tsirkin m...@redhat.com
Acked-by: Alex Williamson alex.william...@redhat.com
Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index a25f3e0..e97f565 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -1136,7 +1136,10 @@ static void assigned_dev_update_msi(PCIDevice *pci_dev, 
unsigned int ctrl_pos)
 if (kvm_assign_irq(kvm_context, assigned_irq_data)  0)
 perror(assigned_dev_enable_msi: assign irq);
 
+assigned_dev-girq = -1;
 assigned_dev-irq_requested_type = assigned_irq_data.flags;
+} else {
+assign_irq(assigned_dev);
 }
 }
 #endif
@@ -1276,7 +1279,10 @@ static void assigned_dev_update_msix(PCIDevice *pci_dev, 
unsigned int ctrl_pos)
 perror(assigned_dev_enable_msix: assign irq);
 return;
 }
+assigned_dev-girq = -1;
 assigned_dev-irq_requested_type = assigned_irq_data.flags;
+} else {
+assign_irq(assigned_dev);
 }
 }
 #endif
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] KVM: MMU: Don't flush shadow when enabling dirty tracking

2011-01-13 Thread Avi Kivity
From: Avi Kivity a...@redhat.com

Instead, drop large mappings, which were the reason we dropped shadow.

Signed-off-by: Avi Kivity a...@redhat.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9cafbb4..772d212 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3445,14 +3445,18 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, 
int slot)
if (!test_bit(slot, sp-slot_bitmap))
continue;
 
-   if (sp-role.level != PT_PAGE_TABLE_LEVEL)
-   continue;
-
pt = sp-spt;
-   for (i = 0; i  PT64_ENT_PER_PAGE; ++i)
+   for (i = 0; i  PT64_ENT_PER_PAGE; ++i) {
+   if (sp-role.level != PT_PAGE_TABLE_LEVEL
+is_large_pte(pt[i])) {
+   drop_spte(kvm, pt[i],
+ shadow_trap_nonpresent_pte);
+   --kvm-stat.lpages;
+   }
/* avoid RMW */
if (is_writable_pte(pt[i]))
update_spte(pt[i], pt[i]  ~PT_WRITABLE_MASK);
+   }
}
kvm_flush_remote_tlbs(kvm);
 }
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b1b6cbb..b3bfeb8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -586,7 +586,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
struct kvm_userspace_memory_region *mem,
int user_alloc)
 {
-   int r, flush_shadow = 0;
+   int r;
gfn_t base_gfn;
unsigned long npages;
unsigned long i;
@@ -706,8 +706,6 @@ skip_lpage:
if (kvm_create_dirty_bitmap(new)  0)
goto out_free;
/* destroy any largepage mappings for dirty tracking */
-   if (old.npages)
-   flush_shadow = 1;
}
 #else  /* not defined CONFIG_S390 */
new.user_alloc = user_alloc;
@@ -778,9 +776,6 @@ skip_lpage:
kvm_free_physmem_slot(old, new);
kfree(old_memslots);
 
-   if (flush_shadow)
-   kvm_arch_flush_shadow(kvm);
-
return 0;
 
 out_free:
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] KVM: PPC: Fix SPRG get/set for Book3S and BookE

2011-01-13 Thread Avi Kivity
From: Peter Tyser pty...@xes-inc.com

Previously SPRGs 4-7 were improperly read and written in
kvm_arch_vcpu_ioctl_get_regs() and kvm_arch_vcpu_ioctl_set_regs();

Signed-off-by: Alexander Graf ag...@suse.de
Signed-off-by: Peter Tyser pty...@xes-inc.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index badc983..c961de4 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -1141,9 +1141,10 @@ int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, 
struct kvm_regs *regs)
regs-sprg1 = vcpu-arch.shared-sprg1;
regs-sprg2 = vcpu-arch.shared-sprg2;
regs-sprg3 = vcpu-arch.shared-sprg3;
-   regs-sprg5 = vcpu-arch.sprg4;
-   regs-sprg6 = vcpu-arch.sprg5;
-   regs-sprg7 = vcpu-arch.sprg6;
+   regs-sprg4 = vcpu-arch.sprg4;
+   regs-sprg5 = vcpu-arch.sprg5;
+   regs-sprg6 = vcpu-arch.sprg6;
+   regs-sprg7 = vcpu-arch.sprg7;
 
for (i = 0; i  ARRAY_SIZE(regs-gpr); i++)
regs-gpr[i] = kvmppc_get_gpr(vcpu, i);
@@ -1167,9 +1168,10 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, 
struct kvm_regs *regs)
vcpu-arch.shared-sprg1 = regs-sprg1;
vcpu-arch.shared-sprg2 = regs-sprg2;
vcpu-arch.shared-sprg3 = regs-sprg3;
-   vcpu-arch.sprg5 = regs-sprg4;
-   vcpu-arch.sprg6 = regs-sprg5;
-   vcpu-arch.sprg7 = regs-sprg6;
+   vcpu-arch.sprg4 = regs-sprg4;
+   vcpu-arch.sprg5 = regs-sprg5;
+   vcpu-arch.sprg6 = regs-sprg6;
+   vcpu-arch.sprg7 = regs-sprg7;
 
for (i = 0; i  ARRAY_SIZE(regs-gpr); i++)
kvmppc_set_gpr(vcpu, i, regs-gpr[i]);
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 77575d0..ef76acb 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -546,9 +546,10 @@ int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, 
struct kvm_regs *regs)
regs-sprg1 = vcpu-arch.shared-sprg1;
regs-sprg2 = vcpu-arch.shared-sprg2;
regs-sprg3 = vcpu-arch.shared-sprg3;
-   regs-sprg5 = vcpu-arch.sprg4;
-   regs-sprg6 = vcpu-arch.sprg5;
-   regs-sprg7 = vcpu-arch.sprg6;
+   regs-sprg4 = vcpu-arch.sprg4;
+   regs-sprg5 = vcpu-arch.sprg5;
+   regs-sprg6 = vcpu-arch.sprg6;
+   regs-sprg7 = vcpu-arch.sprg7;
 
for (i = 0; i  ARRAY_SIZE(regs-gpr); i++)
regs-gpr[i] = kvmppc_get_gpr(vcpu, i);
@@ -572,9 +573,10 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, 
struct kvm_regs *regs)
vcpu-arch.shared-sprg1 = regs-sprg1;
vcpu-arch.shared-sprg2 = regs-sprg2;
vcpu-arch.shared-sprg3 = regs-sprg3;
-   vcpu-arch.sprg5 = regs-sprg4;
-   vcpu-arch.sprg6 = regs-sprg5;
-   vcpu-arch.sprg7 = regs-sprg6;
+   vcpu-arch.sprg4 = regs-sprg4;
+   vcpu-arch.sprg5 = regs-sprg5;
+   vcpu-arch.sprg6 = regs-sprg6;
+   vcpu-arch.sprg7 = regs-sprg7;
 
for (i = 0; i  ARRAY_SIZE(regs-gpr); i++)
kvmppc_set_gpr(vcpu, i, regs-gpr[i]);
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] KVM guest: Fix section mismatch derived from kvm_guest_cpu_online()

2011-01-13 Thread Avi Kivity
From: Sedat Dilek sedat.di...@googlemail.com

WARNING: arch/x86/built-in.o(.text+0x1bb74): Section mismatch in reference from 
the function kvm_guest_cpu_online() to the function 
.cpuinit.text:kvm_guest_cpu_init()
The function kvm_guest_cpu_online() references
the function __cpuinit kvm_guest_cpu_init().
This is often because kvm_guest_cpu_online lacks a __cpuinit
annotation or the annotation of kvm_guest_cpu_init is wrong.

This patch fixes the warning.

Tested with linux-next (next-20101231)

Signed-off-by: Sedat Dilek sedat.di...@gmail.com
Acked-by: Rik van Riel r...@redhat.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 8dc4466..33c07b0 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -493,7 +493,7 @@ static void __init kvm_smp_prepare_boot_cpu(void)
native_smp_prepare_boot_cpu();
 }
 
-static void kvm_guest_cpu_online(void *dummy)
+static void __cpuinit kvm_guest_cpu_online(void *dummy)
 {
kvm_guest_cpu_init();
 }
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6

2011-01-13 Thread Avi Kivity
From: Marcelo Tosatti mtosa...@redhat.com

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] KVM: VMX: Avoid leaking fake realmode state to userspace

2011-01-13 Thread Avi Kivity
From: Avi Kivity a...@redhat.com

When emulating real mode, we fake some state:

 - tr.base points to a fake vm86 tss
 - segment registers are made to conform to vm86 restrictions

change vmx_get_segment() not to expose this fake state to userspace;
instead, return the original state.

Signed-off-by: Avi Kivity a...@redhat.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a2e83a9..87ad551 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2032,23 +2032,40 @@ static void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned 
long cr4)
vmcs_writel(GUEST_CR4, hw_cr4);
 }
 
-static u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
-{
-   struct kvm_vmx_segment_field *sf = kvm_vmx_segment_fields[seg];
-
-   return vmcs_readl(sf-base);
-}
-
 static void vmx_get_segment(struct kvm_vcpu *vcpu,
struct kvm_segment *var, int seg)
 {
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
struct kvm_vmx_segment_field *sf = kvm_vmx_segment_fields[seg];
+   struct kvm_save_segment *save;
u32 ar;
 
+   if (vmx-rmode.vm86_active
+(seg == VCPU_SREG_TR || seg == VCPU_SREG_ES
+   || seg == VCPU_SREG_DS || seg == VCPU_SREG_FS
+   || seg == VCPU_SREG_GS)
+!emulate_invalid_guest_state) {
+   switch (seg) {
+   case VCPU_SREG_TR: save = vmx-rmode.tr; break;
+   case VCPU_SREG_ES: save = vmx-rmode.es; break;
+   case VCPU_SREG_DS: save = vmx-rmode.ds; break;
+   case VCPU_SREG_FS: save = vmx-rmode.fs; break;
+   case VCPU_SREG_GS: save = vmx-rmode.gs; break;
+   default: BUG();
+   }
+   var-selector = save-selector;
+   var-base = save-base;
+   var-limit = save-limit;
+   ar = save-ar;
+   if (seg == VCPU_SREG_TR
+   || var-selector == vmcs_read16(sf-selector))
+   goto use_saved_rmode_seg;
+   }
var-base = vmcs_readl(sf-base);
var-limit = vmcs_read32(sf-limit);
var-selector = vmcs_read16(sf-selector);
ar = vmcs_read32(sf-ar_bytes);
+use_saved_rmode_seg:
if ((ar  AR_UNUSABLE_MASK)  !emulate_invalid_guest_state)
ar = 0;
var-type = ar  15;
@@ -2062,6 +2079,18 @@ static void vmx_get_segment(struct kvm_vcpu *vcpu,
var-unusable = (ar  16)  1;
 }
 
+static u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
+{
+   struct kvm_vmx_segment_field *sf = kvm_vmx_segment_fields[seg];
+   struct kvm_segment s;
+
+   if (to_vmx(vcpu)-rmode.vm86_active) {
+   vmx_get_segment(vcpu, s, seg);
+   return s.base;
+   }
+   return vmcs_readl(sf-base);
+}
+
 static int vmx_get_cpl(struct kvm_vcpu *vcpu)
 {
if (!is_protmode(vcpu))
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] KVM: VMX: Save and restore tr selector across mode switches

2011-01-13 Thread Avi Kivity
From: Avi Kivity a...@redhat.com

When emulating real mode we play with tr hidden state, but leave
tr.selector alone.  That works well, except for save/restore, since
loading TR writes it to the hidden state in vmx-rmode.

Fix by also saving and restoring the tr selector; this makes things
more consistent and allows migration to work during the early
boot stages of Windows XP.

Signed-off-by: Avi Kivity a...@redhat.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index bf89ec2..a2e83a9 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1683,6 +1683,7 @@ static void enter_pmode(struct kvm_vcpu *vcpu)
vmx-emulation_required = 1;
vmx-rmode.vm86_active = 0;
 
+   vmcs_write16(GUEST_TR_SELECTOR, vmx-rmode.tr.selector);
vmcs_writel(GUEST_TR_BASE, vmx-rmode.tr.base);
vmcs_write32(GUEST_TR_LIMIT, vmx-rmode.tr.limit);
vmcs_write32(GUEST_TR_AR_BYTES, vmx-rmode.tr.ar);
@@ -1756,6 +1757,7 @@ static void enter_rmode(struct kvm_vcpu *vcpu)
vmx-emulation_required = 1;
vmx-rmode.vm86_active = 1;
 
+   vmx-rmode.tr.selector = vmcs_read16(GUEST_TR_SELECTOR);
vmx-rmode.tr.base = vmcs_readl(GUEST_TR_BASE);
vmcs_writel(GUEST_TR_BASE, rmode_tss_base(vcpu-kvm));
 
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] KVM: Initialize fpu state in preemptible context

2011-01-13 Thread Avi Kivity
From: Avi Kivity a...@redhat.com

init_fpu() (which is indirectly called by the fpu switching code) assumes
it is in process context.  Rather than makeing init_fpu() use an atomic
allocation, which can cause a task to be killed, make sure the fpu is
already initialized when we enter the run loop.

KVM-Stable-Tag.
Reported-and-tested-by: Kirill A. Shutemov k...@openvz.org
Acked-by: Pekka Enberg penb...@kernel.org
Reviewed-by: Christoph Lameter c...@linux.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index 58bb239..e60c38c 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -169,6 +169,7 @@ int init_fpu(struct task_struct *tsk)
set_stopped_child_used_math(tsk);
return 0;
 }
+EXPORT_SYMBOL_GPL(init_fpu);
 
 /*
  * The xstateregs_active() routine is the same as the fpregs_active() routine,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fa708c9..9dda70d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5376,6 +5376,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct 
kvm_run *kvm_run)
int r;
sigset_t sigsaved;
 
+   if (!tsk_used_math(current)  init_fpu(current))
+   return -ENOMEM;
+
if (vcpu-sigset_active)
sigprocmask(SIG_SETMASK, vcpu-sigset, sigsaved);
 
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] smp: speed up cpu_count()

2011-01-13 Thread Avi Kivity
From: Avi Kivity a...@redhat.com

cpu_count() is used in important places, like vmexit.flat's measuring
loop, yet it is ridiculously slow as it talks to the firmware config
interface.

Speed it up by reading the value from memory.

Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/lib/x86/smp.c b/lib/x86/smp.c
index 8da614a..d41c332 100644
--- a/lib/x86/smp.c
+++ b/lib/x86/smp.c
@@ -78,7 +78,7 @@ void spin_unlock(struct spinlock *lock)
 
 int cpu_count(void)
 {
-return fwcfg_get_nb_cpus();
+return _cpu_count;
 }
 
 int smp_id(void)
@@ -130,6 +130,8 @@ void smp_init(void)
 int i;
 void ipi_entry(void);
 
+_cpu_count = fwcfg_get_nb_cpus();
+
 set_ipi_descriptor(ipi_entry);
 
 setup_smp_id(0);
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] vmexit: fix race in joining smp tests

2011-01-13 Thread Avi Kivity
From: Avi Kivity a...@redhat.com

'nr_cpus_done' is not incremented atomically; this has been observed to
cause tests to stall.  Fix by using a proper atomic increment.

Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/x86/vmexit.c b/x86/vmexit.c
index 875caa3..ad8ab55 100644
--- a/x86/vmexit.c
+++ b/x86/vmexit.c
@@ -2,6 +2,7 @@
 #include libcflat.h
 #include smp.h
 #include processor.h
+#include atomic.h
 
 static unsigned int inl(unsigned short port)
 {
@@ -121,7 +122,7 @@ static struct test {
 };
 
 unsigned iterations;
-volatile int nr_cpus_done;
+static atomic_t nr_cpus_done;
 
 static void run_test(void *_func)
 {
@@ -131,7 +132,7 @@ static void run_test(void *_func)
 for (i = 0; i  iterations; ++i)
 func();
 
-nr_cpus_done++;
+atomic_inc(nr_cpus_done);
 }
 
 static void do_test(struct test *test)
@@ -155,10 +156,10 @@ static void do_test(struct test *test)
for (i = 0; i  iterations; ++i)
func();
} else {
-   nr_cpus_done = 0;
+   atomic_set(nr_cpus_done, 0);
for (i = cpu_count(); i  0; i--)
on_cpu_async(i-1, run_test, func);
-   while (nr_cpus_done  cpu_count())
+   while (atomic_read(nr_cpus_done)  cpu_count())
;
}
t2 = rdtsc();
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] smp: fix race in async on_cpu()

2011-01-13 Thread Avi Kivity
From: Avi Kivity a...@redhat.com

We fire off the IPI, but don't wait for the other cpu to pickk up
the function and data before returning.

Fix by making the other cpu ACK the receipt of the IPI (but still
execute the result asynchrously).

Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/lib/x86/smp.c b/lib/x86/smp.c
index 241f755..8da614a 100644
--- a/lib/x86/smp.c
+++ b/lib/x86/smp.c
@@ -7,15 +7,27 @@
 #define IPI_VECTOR 0x20
 
 static struct spinlock ipi_lock;
-static void (*ipi_function)(void *data);
-static void *ipi_data;
+static volatile void (*ipi_function)(void *data);
+static volatile void *ipi_data;
 static volatile int ipi_done;
+static volatile bool ipi_wait;
+static int _cpu_count;
 
 static __attribute__((used)) void ipi()
 {
-ipi_function(ipi_data);
-apic_write(APIC_EOI, 0);
-ipi_done = 1;
+void (*function)(void *data) = ipi_function;
+void *data = ipi_data;
+bool wait = ipi_wait;
+
+if (!wait) {
+   ipi_done = 1;
+   apic_write(APIC_EOI, 0);
+}
+function(data);
+if (wait) {
+   ipi_done = 1;
+   apic_write(APIC_EOI, 0);
+}
 }
 
 asm (
@@ -92,13 +104,12 @@ static void __on_cpu(int cpu, void (*function)(void 
*data), void *data,
ipi_done = 0;
ipi_function = function;
ipi_data = data;
+   ipi_wait = wait;
apic_icr_write(APIC_INT_ASSERT | APIC_DEST_PHYSICAL | APIC_DM_FIXED
| IPI_VECTOR,
cpu);
-   if (wait) {
-   while (!ipi_done)
-   ;
-   }
+   while (!ipi_done)
+   ;
 }
 spin_unlock(ipi_lock);
 }
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH uq/master 2/2] MCE, unpoison memory address across reboot

2011-01-13 Thread Huang Ying
In Linux kernel HWPoison processing implementation, the virtual
address in processes mapping the error physical memory page is marked
as HWPoison.  So that, the further accessing to the virtual
address will kill corresponding processes with SIGBUS.

If the error physical memory page is used by a KVM guest, the SIGBUS
will be sent to QEMU, and QEMU will simulate a MCE to report that
memory error to the guest OS.  If the guest OS can not recover from
the error (for example, the page is accessed by kernel code), guest OS
will reboot the system.  But because the underlying host virtual
address backing the guest physical memory is still poisoned, if the
guest system accesses the corresponding guest physical memory even
after rebooting, the SIGBUS will still be sent to QEMU and MCE will be
simulated.  That is, guest system can not recover via rebooting.

In fact, across rebooting, the contents of guest physical memory page
need not to be kept.  We can allocate a new host physical page to
back the corresponding guest physical address.

This patch fixes this issue in QEMU via calling qemu_ram_remap() to
clear the corresponding page table entry, so that make it possible to
allocate a new page to recover the issue.

Signed-off-by: Huang Ying ying.hu...@intel.com
---
 kvm.h |2 ++
 target-i386/kvm.c |   39 +++
 2 files changed, 41 insertions(+)

--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -580,6 +580,42 @@ static int kvm_get_supported_msrs(void)
 return ret;
 }
 
+struct HWPoisonPage;
+typedef struct HWPoisonPage HWPoisonPage;
+struct HWPoisonPage
+{
+ram_addr_t ram_addr;
+QLIST_ENTRY(HWPoisonPage) list;
+};
+
+static QLIST_HEAD(hwpoison_page_list, HWPoisonPage) hwpoison_page_list =
+QLIST_HEAD_INITIALIZER(hwpoison_page_list);
+
+void kvm_unpoison_all(void *param)
+{
+HWPoisonPage *page, *next_page;
+
+QLIST_FOREACH_SAFE(page, hwpoison_page_list, list, next_page) {
+QLIST_REMOVE(page, list);
+qemu_ram_remap(page-ram_addr, TARGET_PAGE_SIZE);
+qemu_free(page);
+}
+}
+
+static void kvm_hwpoison_page_add(ram_addr_t ram_addr)
+{
+HWPoisonPage *page;
+
+QLIST_FOREACH(page, hwpoison_page_list, list) {
+if (page-ram_addr == ram_addr)
+return;
+}
+
+page = qemu_malloc(sizeof(HWPoisonPage));
+page-ram_addr = ram_addr;
+QLIST_INSERT_HEAD(hwpoison_page_list, page, list);
+}
+
 int kvm_arch_init(void)
 {
 uint64_t identity_base = 0xfffbc000;
@@ -632,6 +668,7 @@ int kvm_arch_init(void)
 fprintf(stderr, e820_add_entry() table is full\n);
 return ret;
 }
+qemu_register_reset(kvm_unpoison_all, NULL);
 
 return 0;
 }
@@ -1940,6 +1977,7 @@ int kvm_on_sigbus_vcpu(CPUState *env, in
 hardware_memory_error();
 }
 }
+kvm_hwpoison_page_add(ram_addr);
 
 if (code == BUS_MCEERR_AR) {
 /* Fake an Intel architectural Data Load SRAR UCR */
@@ -1984,6 +2022,7 @@ int kvm_on_sigbus(int code, void *addr)
 QEMU itself instead of guest system!: %p\n, addr);
 return 0;
 }
+kvm_hwpoison_page_add(ram_addr);
 kvm_mce_inj_srao_memscrub2(first_cpu, paddr);
 } else
 #endif
--- a/kvm.h
+++ b/kvm.h
@@ -188,6 +188,8 @@ int kvm_physical_memory_addr_from_ram(ra
   target_phys_addr_t *phys_addr);
 #endif
 
+void kvm_unpoison_all(void *param);
+
 #endif
 int kvm_set_ioeventfd_mmio_long(int fd, uint32_t adr, uint32_t val, bool 
assign);
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH uq/master 1/2] Add qemu_ram_remap

2011-01-13 Thread Huang Ying
qemu_ram_remap() unmaps the specified RAM pages, then re-maps these
pages again.  This is used by KVM HWPoison support to clear HWPoisoned
page tables across guest rebooting, so that a new page may be
allocated later to recover the memory error.

Signed-off-by: Huang Ying ying.hu...@intel.com
---
 cpu-all.h|4 +++
 cpu-common.h |1 
 exec.c   |   61 ++-
 3 files changed, 65 insertions(+), 1 deletion(-)

--- a/cpu-all.h
+++ b/cpu-all.h
@@ -863,10 +863,14 @@ target_phys_addr_t cpu_get_phys_page_deb
 extern int phys_ram_fd;
 extern ram_addr_t ram_size;
 
+/* RAM is pre-allocated and passed into qemu_ram_alloc_from_ptr */
+#define RAM_PREALLOC_MASK  (1  0)
+
 typedef struct RAMBlock {
 uint8_t *host;
 ram_addr_t offset;
 ram_addr_t length;
+uint32_t flags;
 char idstr[256];
 QLIST_ENTRY(RAMBlock) next;
 #if defined(__linux__)  !defined(TARGET_S390X)
--- a/exec.c
+++ b/exec.c
@@ -2830,6 +2830,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(Devic
 
 if (host) {
 new_block-host = host;
+new_block-flags |= RAM_PREALLOC_MASK;
 } else {
 if (mem_path) {
 #if defined (__linux__)  !defined(TARGET_S390X)
@@ -2883,7 +2884,9 @@ void qemu_ram_free(ram_addr_t addr)
 QLIST_FOREACH(block, ram_list.blocks, next) {
 if (addr == block-offset) {
 QLIST_REMOVE(block, next);
-if (mem_path) {
+if (block-flags  RAM_PREALLOC_MASK)
+;
+else if (mem_path) {
 #if defined (__linux__)  !defined(TARGET_S390X)
 if (block-fd) {
 munmap(block-host, block-length);
@@ -2906,6 +2909,62 @@ void qemu_ram_free(ram_addr_t addr)
 
 }
 
+void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
+{
+RAMBlock *block;
+ram_addr_t offset;
+int flags;
+void *area, *vaddr;
+
+QLIST_FOREACH(block, ram_list.blocks, next) {
+offset = addr - block-offset;
+if (offset  block-length) {
+vaddr = block-host + offset;
+if (block-flags  RAM_PREALLOC_MASK) {
+;
+} else {
+flags = MAP_FIXED;
+munmap(vaddr, length);
+if (mem_path) {
+#if defined (__linux__)  !defined(TARGET_S390X)
+if (block-fd) {
+#ifdef MAP_POPULATE
+flags |= mem_prealloc ? MAP_POPULATE | MAP_SHARED :
+MAP_PRIVATE;
+#else
+flags |= MAP_PRIVATE;
+#endif
+area = mmap(vaddr, length, PROT_READ | PROT_WRITE,
+flags, block-fd, offset);
+} else {
+flags |= MAP_PRIVATE | MAP_ANONYMOUS;
+area = mmap(vaddr, length, PROT_READ | PROT_WRITE,
+flags, -1, 0);
+}
+#endif
+} else {
+#if defined(TARGET_S390X)  defined(CONFIG_KVM)
+flags |= MAP_SHARED | MAP_ANONYMOUS;
+area = mmap(vaddr, length, PROT_EXEC|PROT_READ|PROT_WRITE,
+flags, -1, 0);
+#else
+flags |= MAP_PRIVATE | MAP_ANONYMOUS;
+area = mmap(vaddr, length, PROT_READ | PROT_WRITE,
+flags, -1, 0);
+#endif
+}
+if (area != vaddr) {
+fprintf(stderr, Could not remap addr: %lx@%lx\n,
+length, addr);
+exit(1);
+}
+qemu_madvise(vaddr, length, QEMU_MADV_MERGEABLE);
+}
+return;
+}
+}
+}
+
 /* Return a host pointer to ram allocated with qemu_ram_alloc.
With the exception of the softmmu code in this file, this should
only be used for local memory (e.g. video ram) that the device owns,
--- a/cpu-common.h
+++ b/cpu-common.h
@@ -50,6 +50,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(Devic
 ram_addr_t size, void *host);
 ram_addr_t qemu_ram_alloc(DeviceState *dev, const char *name, ram_addr_t size);
 void qemu_ram_free(ram_addr_t addr);
+void qemu_ram_remap(ram_addr_t addr, ram_addr_t length);
 /* This should only be used for ram local to a device.  */
 void *qemu_get_ram_ptr(ram_addr_t addr);
 /* Same but slower, to use for migration, where the order of


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally

2011-01-13 Thread Huang Ying
Make __get_user_pages return -EHWPOISON for HWPOISON page only if
FOLL_HWPOISON is specified.  With this patch, the interested callers
can distinguish HWPOISON page from general FAULT page, while other
callers will still get -EFAULT for pages, so the user space interface
need not to be changed.

get_user_pages_hwpoison is added as a variant of get_user_pages that
can return -EHWPOISON for HWPOISON page.

This feature is needed by KVM, where UCR MCE should be relayed to
guest for HWPOISON page, while instruction emulation and MMIO will be
tried for general FAULT page.

The idea comes from Andrew Morton.

Signed-off-by: Huang Ying ying.hu...@intel.com
Cc: Andrew Morton a...@linux-foundation.org
---
 include/asm-generic/errno.h |2 +
 include/linux/mm.h  |   17 +
 mm/memory.c |   55 +---
 3 files changed, 71 insertions(+), 3 deletions(-)

--- a/include/asm-generic/errno.h
+++ b/include/asm-generic/errno.h
@@ -108,4 +108,6 @@
 
 #define ERFKILL132 /* Operation not possible due to 
RF-kill */
 
+#define EHWPOISON  133 /* Memory page has hardware error */
+
 #endif
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -860,6 +860,22 @@ int get_user_pages(struct task_struct *t
struct page **pages, struct vm_area_struct **vmas);
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages);
+#ifdef CONFIG_MEMORY_FAILURE
+int get_user_pages_hwpoison(struct task_struct *tsk, struct mm_struct *mm,
+   unsigned long start, int nr_pages, int write,
+   int force, struct page **pages,
+   struct vm_area_struct **vmas);
+#else
+static inline int get_user_pages_hwpoison(struct task_struct *tsk,
+ struct mm_struct *mm,
+ unsigned long start, int nr_pages,
+ int write, int force,
+ struct page **pages,
+ struct vm_area_struct **vmas) {
+   return get_user_pages(tsk, mm, start, nr_pages,
+ write, force, pages, vmas);
+}
+#endif
 struct page *get_dump_page(unsigned long addr);
 
 extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
@@ -1415,6 +1431,7 @@ struct page *follow_page(struct vm_area_
 #define FOLL_GET   0x04/* do get_page on page */
 #define FOLL_DUMP  0x08/* give error on hole if it would be zero */
 #define FOLL_FORCE 0x10/* get_user_pages read/write w/o permission */
+#define FOLL_HWPOISON  0x20/* check page is hwpoisoned */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
void *data);
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1449,9 +1449,16 @@ int __get_user_pages(struct task_struct
if (ret  VM_FAULT_ERROR) {
if (ret  VM_FAULT_OOM)
return i ? i : -ENOMEM;
-   if (ret 
-   
(VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE|
-VM_FAULT_SIGBUS))
+   if (ret  (VM_FAULT_HWPOISON |
+  VM_FAULT_HWPOISON_LARGE)) {
+   if (i)
+   return i;
+   else if (gup_flags  
FOLL_HWPOISON)
+   return -EHWPOISON;
+   else
+   return -EFAULT;
+   }
+   if (ret  VM_FAULT_SIGBUS)
return i ? i : -EFAULT;
BUG();
}
@@ -1563,6 +1570,48 @@ int get_user_pages(struct task_struct *t
 }
 EXPORT_SYMBOL(get_user_pages);
 
+#ifdef CONFIG_MEMORY_FAILURE
+/**
+ * get_user_pages_hwpoison() - pin user pages in memory, return hwpoison status
+ * @tsk:   task_struct of target task
+ * @mm:mm_struct of target mm
+ * @start: starting user address
+ * @nr_pages:  number of pages from start to pin
+ * @write: whether pages will be written to by the caller
+ * @force: whether to force write access even if user mapping is
+ * readonly. This will result in the page being COWed even
+ * in MAP_SHARED mappings. You do not want this.
+ * @pages: array that receives pointers to the pages pinned.
+ * Should be at least nr_pages long. Or NULL, 

[PATCH 2/2] KVM, Replace is_hwpoison_address with get_user_pages_hwpoison

2011-01-13 Thread Huang Ying
is_hwpoison_address only checks whether the page table entry is
hwpoisoned, regardless the memory page mapped.  While
get_user_pages_hwpoison will check both.

QEMU will clear the poisoned page table entry (via unmap/map) to make
it possible to allocate a new memory page for the virtual address
across guest rebooting.  But it is also possible that the underlying
memory page is kept poisoned even after the corresponding page table
entry is cleared, that is, a new memory page can not be allocated.
get_user_pages_hwpoison can catch these situations.

Signed-off-by: Huang Ying ying.hu...@intel.com
---
 include/linux/mm.h  |8 
 mm/memory-failure.c |   32 
 virt/kvm/kvm_main.c |4 +++-
 3 files changed, 3 insertions(+), 41 deletions(-)

--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1524,14 +1524,6 @@ extern int sysctl_memory_failure_recover
 extern void shake_page(struct page *p, int access);
 extern atomic_long_t mce_bad_pages;
 extern int soft_offline_page(struct page *page, int flags);
-#ifdef CONFIG_MEMORY_FAILURE
-int is_hwpoison_address(unsigned long addr);
-#else
-static inline int is_hwpoison_address(unsigned long addr)
-{
-   return 0;
-}
-#endif
 
 extern void dump_page(struct page *page);
 
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1433,35 +1433,3 @@ done:
/* keep elevated page count for bad page */
return ret;
 }
-
-/*
- * The caller must hold current-mm-mmap_sem in read mode.
- */
-int is_hwpoison_address(unsigned long addr)
-{
-   pgd_t *pgdp;
-   pud_t pud, *pudp;
-   pmd_t pmd, *pmdp;
-   pte_t pte, *ptep;
-   swp_entry_t entry;
-
-   pgdp = pgd_offset(current-mm, addr);
-   if (!pgd_present(*pgdp))
-   return 0;
-   pudp = pud_offset(pgdp, addr);
-   pud = *pudp;
-   if (!pud_present(pud) || pud_large(pud))
-   return 0;
-   pmdp = pmd_offset(pudp, addr);
-   pmd = *pmdp;
-   if (!pmd_present(pmd) || pmd_large(pmd))
-   return 0;
-   ptep = pte_offset_map(pmdp, addr);
-   pte = *ptep;
-   pte_unmap(ptep);
-   if (!is_swap_pte(pte))
-   return 0;
-   entry = pte_to_swp_entry(pte);
-   return is_hwpoison_entry(entry);
-}
-EXPORT_SYMBOL_GPL(is_hwpoison_address);
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -966,7 +966,9 @@ static pfn_t hva_to_pfn(struct kvm *kvm,
goto return_fault_page;
 
down_read(current-mm-mmap_sem);
-   if (is_hwpoison_address(addr)) {
+   npages = get_user_pages_hwpoison(current, current-mm,
+addr, 1, 1, 0, page, NULL);
+   if (npages == -EHWPOISON) {
up_read(current-mm-mmap_sem);
get_page(hwpoison_page);
return page_to_pfn(hwpoison_page);


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests] kvmclock_test: fix smp initialization

2011-01-13 Thread Avi Kivity
cpu_count() is not valid before smp_init().

Signed-off-by: Avi Kivity a...@redhat.com
---
 x86/kvmclock_test.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/x86/kvmclock_test.c b/x86/kvmclock_test.c
index 5b14ae2..52a43fb 100644
--- a/x86/kvmclock_test.c
+++ b/x86/kvmclock_test.c
@@ -115,7 +115,7 @@ static int cycle_test(int ncpus, long loops, int check, 
struct test_info *ti)
 
 int main(int ac, char **av)
 {
-int ncpus = cpu_count();
+int ncpus;
 int nerr = 0, i;
 long loops = DEFAULT_TEST_LOOPS;
 long sec = 0;
@@ -130,6 +130,7 @@ int main(int ac, char **av)
 
 smp_init();
 
+ncpus = cpu_count();
 if (ncpus  MAX_CPU)
 ncpus = MAX_CPU;
 for (i = 0; i  ncpus; ++i)
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH uq/master 2/2] MCE, unpoison memory address across reboot

2011-01-13 Thread Jan Kiszka
Am 13.01.2011 09:34, Huang Ying wrote:
 In Linux kernel HWPoison processing implementation, the virtual
 address in processes mapping the error physical memory page is marked
 as HWPoison.  So that, the further accessing to the virtual
 address will kill corresponding processes with SIGBUS.
 
 If the error physical memory page is used by a KVM guest, the SIGBUS
 will be sent to QEMU, and QEMU will simulate a MCE to report that
 memory error to the guest OS.  If the guest OS can not recover from
 the error (for example, the page is accessed by kernel code), guest OS
 will reboot the system.  But because the underlying host virtual
 address backing the guest physical memory is still poisoned, if the
 guest system accesses the corresponding guest physical memory even
 after rebooting, the SIGBUS will still be sent to QEMU and MCE will be
 simulated.  That is, guest system can not recover via rebooting.
 
 In fact, across rebooting, the contents of guest physical memory page
 need not to be kept.  We can allocate a new host physical page to
 back the corresponding guest physical address.
 
 This patch fixes this issue in QEMU via calling qemu_ram_remap() to
 clear the corresponding page table entry, so that make it possible to
 allocate a new page to recover the issue.
 
 Signed-off-by: Huang Ying ying.hu...@intel.com
 ---
  kvm.h |2 ++
  target-i386/kvm.c |   39 +++
  2 files changed, 41 insertions(+)
 
 --- a/target-i386/kvm.c
 +++ b/target-i386/kvm.c
 @@ -580,6 +580,42 @@ static int kvm_get_supported_msrs(void)
  return ret;
  }
  
 +struct HWPoisonPage;
 +typedef struct HWPoisonPage HWPoisonPage;
 +struct HWPoisonPage
 +{
 +ram_addr_t ram_addr;
 +QLIST_ENTRY(HWPoisonPage) list;
 +};
 +
 +static QLIST_HEAD(hwpoison_page_list, HWPoisonPage) hwpoison_page_list =
 +QLIST_HEAD_INITIALIZER(hwpoison_page_list);
 +
 +void kvm_unpoison_all(void *param)

Minor nit: This can be static now.

 +{
 +HWPoisonPage *page, *next_page;
 +
 +QLIST_FOREACH_SAFE(page, hwpoison_page_list, list, next_page) {
 +QLIST_REMOVE(page, list);
 +qemu_ram_remap(page-ram_addr, TARGET_PAGE_SIZE);
 +qemu_free(page);
 +}
 +}
 +
 +static void kvm_hwpoison_page_add(ram_addr_t ram_addr)
 +{
 +HWPoisonPage *page;
 +
 +QLIST_FOREACH(page, hwpoison_page_list, list) {
 +if (page-ram_addr == ram_addr)
 +return;
 +}
 +
 +page = qemu_malloc(sizeof(HWPoisonPage));
 +page-ram_addr = ram_addr;
 +QLIST_INSERT_HEAD(hwpoison_page_list, page, list);
 +}
 +
  int kvm_arch_init(void)
  {
  uint64_t identity_base = 0xfffbc000;
 @@ -632,6 +668,7 @@ int kvm_arch_init(void)
  fprintf(stderr, e820_add_entry() table is full\n);
  return ret;
  }
 +qemu_register_reset(kvm_unpoison_all, NULL);
  
  return 0;
  }
 @@ -1940,6 +1977,7 @@ int kvm_on_sigbus_vcpu(CPUState *env, in
  hardware_memory_error();
  }
  }
 +kvm_hwpoison_page_add(ram_addr);
  
  if (code == BUS_MCEERR_AR) {
  /* Fake an Intel architectural Data Load SRAR UCR */
 @@ -1984,6 +2022,7 @@ int kvm_on_sigbus(int code, void *addr)
  QEMU itself instead of guest system!: %p\n, addr);
  return 0;
  }
 +kvm_hwpoison_page_add(ram_addr);
  kvm_mce_inj_srao_memscrub2(first_cpu, paddr);
  } else
  #endif
 --- a/kvm.h
 +++ b/kvm.h
 @@ -188,6 +188,8 @@ int kvm_physical_memory_addr_from_ram(ra
target_phys_addr_t *phys_addr);
  #endif
  
 +void kvm_unpoison_all(void *param);
 +

To be removed if kvm_unpoison_all is static.

  #endif
  int kvm_set_ioeventfd_mmio_long(int fd, uint32_t adr, uint32_t val, bool 
 assign);
  
 

As indicated, I'm sitting on lots of fixes and refactorings of the MCE
user space code. How do you test your patches? Any suggestions how to do
this efficiently would be warmly welcome.

Thanks,
Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Errors on MMIO read access on VM suspend / resume operations

2011-01-13 Thread Avi Kivity

On 01/11/2011 06:19 PM, Stefan Berger wrote:

Hi!

  I am currently doing some long-term testing of a device model using 
memory mapped IO (TPM TIS) and am seeing some strange errors when the 
suspend occurs in the middle of a read operation in the Linux TPM TIS 
device driver where the driver reads the result packet from the mmio 
location.


 Short background: The TPM response packet is read in 2 chunks. First 
the first 10 bytes are read containing the response's header. 
Subsequently the rest of the packet is read using knowledge of the 
total size of the response packet from the header (bytes 2-5 in big 
endian format). The corresponding code reading the data from the 
hardware interface is here:



http://lxr.linux.no/#linux+v2.6.37/drivers/char/tpm/tpm_tis.c#L228


The test I am running is setup as follows:

- inside the VM keys are permanently generated by sending commands to 
the TPM; packets read from the interface are dumped to the screen


- on the host a script suspends the VM every 6 seconds and resumes it 
immediately afterwards (using libvirt)



As it happens, sometimes the VM is suspended in the middle of a read 
operation on the TPM TIS interface -- see above code reference. I see 
that because I do dump the state of the TPM TIS when suspending and 
see that the read offset is pointing to a location somewhere in the 
middle of the packet - so the TPM TIS Linux driver is in the above 
loop currently reading the data. I am observing two types of results 
if this happens:



- either the result read by the Linux TPM TIS driver is ok, so no 
problem here


- or the problematic case where the TPM TIS driver reads a packet with 
a byte missing and then at the end gets a zero byte from the TPM TIS 
interface indicating that it read beyond the available data. If the 
suspend happened while reading the first chunk of data (header), the 
TPM TIS driver will also complain that the available data for the 2nd  
chunk (burst size) is less than what's expected  -- it's an off-by-one 
error



So, I then modified the TPM TIS device model to decrement the read 
offset pointer by '1' in case it was detected that the suspend 
happened in the middle of the read operation -- in Qemu I do this in 
the post-load 'method'. This then leads to the following types of 
results:



- the problematic(!) case where the read packet was ok

- the expected case where the TPM TIS driver reads the packet and ends 
up having two same bytes in the result in consecutive array locations; 
besides that the TPM TIS driver will in this case complain that it has 
left-over data



So my conclusion from the above tests are:

- for some reason the memory read to the MMIO location happens as the 
last instruction executed on suspend and again as the very first on 
executed on resume. This explains to me that the TPM TIS model 
internal pointer into the packet was advanced by '1' (the packet is 
read by subsequently reading from the same memory location) and the 
above problematic cases make sense



Most likely this is qemu-kvm failing to obey this snippet from 
Documentation/kvm/api.txt:



NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the corresponding
operations are complete (and guest state is consistent) only after 
userspace

has re-entered the kernel with KVM_RUN.  The kernel side will first finish
incomplete operations and then check for pending signals.  Userspace
can re-enter the guest with an unmasked signal pending to complete
pending operations.



However, the code appears to be correct.  kvm_run() calls handle_mmio(), 
which returns 0.  The following bit


if (!r) {
goto again;
}

at the end of kvm_run() makes it enter the kernel again (delivering a 
signal to itself in case we want to stop).




- the other instruction in the Linux TPM TIS drivers that for example 
advance the buffer location do not execute twice, i.e., size++ in the 
buf[size++] = ... in the Linux driver.



What puzzles me is that the read operation may be run twice but others 
don't.




Reads have split execution: kvm emulates the mmio instruction, notices 
that it cannot satisfy the read request, exits to qemu, then restarts 
the instruction.  If the last step is omitted due to savevm, then kvm 
will exit back to qemu again and your device will see the read duplicated.


If you have insights as why the above may be occurring, please let me 
know. A simple solution to work around this may be to introduce a 
register holding the index into the result packet where to read the 
next byte from (rather than advancing an internal pointer to the next 
byte), though this would deviate the driver from the standard 
interface the model currently implements.


Most undesirable, I'd like to fix the bug.

Can you sprinkle some printfs() arount kvm_run (in qemu-kvm.c) to verify 
this?


Good pattern:

  ioctl(KVM_RUN)
  - KVM_EXIT_MMIO
  ioctl(KVM_RUN)
  - ENTR
  no further KVM_RUNs

or

   ioctl(KVM_RUN)
   - something other than 

Re: [PATCH 1/2] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally

2011-01-13 Thread Avi Kivity

On 01/13/2011 10:42 AM, Huang Ying wrote:

Make __get_user_pages return -EHWPOISON for HWPOISON page only if
FOLL_HWPOISON is specified.  With this patch, the interested callers
can distinguish HWPOISON page from general FAULT page, while other
callers will still get -EFAULT for pages, so the user space interface
need not to be changed.

get_user_pages_hwpoison is added as a variant of get_user_pages that
can return -EHWPOISON for HWPOISON page.

This feature is needed by KVM, where UCR MCE should be relayed to
guest for HWPOISON page, while instruction emulation and MMIO will be
tried for general FAULT page.

The idea comes from Andrew Morton.

Signed-off-by: Huang Yingying.hu...@intel.com
Cc: Andrew Mortona...@linux-foundation.org

+#ifdef CONFIG_MEMORY_FAILURE
+int get_user_pages_hwpoison(struct task_struct *tsk, struct mm_struct *mm,
+   unsigned long start, int nr_pages, int write,
+   int force, struct page **pages,
+   struct vm_area_struct **vmas);
+#else


Since we'd also like to add get_user_pages_noio(), perhaps adding a 
flags field to get_user_pages() is better.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH uq/master 2/2] MCE, unpoison memory address across reboot

2011-01-13 Thread Jan Kiszka
Am 13.01.2011 09:34, Huang Ying wrote:
 In Linux kernel HWPoison processing implementation, the virtual
 address in processes mapping the error physical memory page is marked
 as HWPoison.  So that, the further accessing to the virtual
 address will kill corresponding processes with SIGBUS.
 
 If the error physical memory page is used by a KVM guest, the SIGBUS
 will be sent to QEMU, and QEMU will simulate a MCE to report that
 memory error to the guest OS.  If the guest OS can not recover from
 the error (for example, the page is accessed by kernel code), guest OS
 will reboot the system.  But because the underlying host virtual
 address backing the guest physical memory is still poisoned, if the
 guest system accesses the corresponding guest physical memory even
 after rebooting, the SIGBUS will still be sent to QEMU and MCE will be
 simulated.  That is, guest system can not recover via rebooting.
 
 In fact, across rebooting, the contents of guest physical memory page
 need not to be kept.  We can allocate a new host physical page to
 back the corresponding guest physical address.
 
 This patch fixes this issue in QEMU via calling qemu_ram_remap() to
 clear the corresponding page table entry, so that make it possible to
 allocate a new page to recover the issue.
 
 Signed-off-by: Huang Ying ying.hu...@intel.com
 ---
  kvm.h |2 ++
  target-i386/kvm.c |   39 +++
  2 files changed, 41 insertions(+)
 
 --- a/target-i386/kvm.c
 +++ b/target-i386/kvm.c
 @@ -580,6 +580,42 @@ static int kvm_get_supported_msrs(void)
  return ret;
  }
  
 +struct HWPoisonPage;
 +typedef struct HWPoisonPage HWPoisonPage;
 +struct HWPoisonPage
 +{
 +ram_addr_t ram_addr;
 +QLIST_ENTRY(HWPoisonPage) list;
 +};
 +
 +static QLIST_HEAD(hwpoison_page_list, HWPoisonPage) hwpoison_page_list =
 +QLIST_HEAD_INITIALIZER(hwpoison_page_list);
 +
 +void kvm_unpoison_all(void *param)

Minor nit: This can be static now.

 +{
 +HWPoisonPage *page, *next_page;
 +
 +QLIST_FOREACH_SAFE(page, hwpoison_page_list, list, next_page) {
 +QLIST_REMOVE(page, list);
 +qemu_ram_remap(page-ram_addr, TARGET_PAGE_SIZE);
 +qemu_free(page);
 +}
 +}
 +
 +static void kvm_hwpoison_page_add(ram_addr_t ram_addr)
 +{
 +HWPoisonPage *page;
 +
 +QLIST_FOREACH(page, hwpoison_page_list, list) {
 +if (page-ram_addr == ram_addr)
 +return;
 +}
 +
 +page = qemu_malloc(sizeof(HWPoisonPage));
 +page-ram_addr = ram_addr;
 +QLIST_INSERT_HEAD(hwpoison_page_list, page, list);
 +}
 +
  int kvm_arch_init(void)
  {
  uint64_t identity_base = 0xfffbc000;
 @@ -632,6 +668,7 @@ int kvm_arch_init(void)
  fprintf(stderr, e820_add_entry() table is full\n);
  return ret;
  }
 +qemu_register_reset(kvm_unpoison_all, NULL);
  
  return 0;
  }
 @@ -1940,6 +1977,7 @@ int kvm_on_sigbus_vcpu(CPUState *env, in
  hardware_memory_error();
  }
  }
 +kvm_hwpoison_page_add(ram_addr);
  
  if (code == BUS_MCEERR_AR) {
  /* Fake an Intel architectural Data Load SRAR UCR */
 @@ -1984,6 +2022,7 @@ int kvm_on_sigbus(int code, void *addr)
  QEMU itself instead of guest system!: %p\n, addr);
  return 0;
  }
 +kvm_hwpoison_page_add(ram_addr);
  kvm_mce_inj_srao_memscrub2(first_cpu, paddr);
  } else
  #endif
 --- a/kvm.h
 +++ b/kvm.h
 @@ -188,6 +188,8 @@ int kvm_physical_memory_addr_from_ram(ra
target_phys_addr_t *phys_addr);
  #endif
  
 +void kvm_unpoison_all(void *param);
 +

To be removed if kvm_unpoison_all is static.

  #endif
  int kvm_set_ioeventfd_mmio_long(int fd, uint32_t adr, uint32_t val, bool 
 assign);
  
 

As indicated, I'm sitting on lots of fixes and refactorings of the MCE
user space code. How do you test your patches? Any suggestions how to do
this efficiently would be warmly welcome.

Thanks,
Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] KVM updates for the 2.6.38 merge window

2011-01-13 Thread Gleb Natapov
On Wed, Jan 12, 2011 at 12:53:14PM -0800, Linus Torvalds wrote:
 On Wed, Jan 12, 2011 at 12:33 PM, Rik van Riel r...@redhat.com wrote:
 
  Now that we have FAULT_FLAG_ALLOW_RETRY, the async
  pagefault patches can be a little smaller.
 
 I suspect you do still want a new page flag, to say that
 FAULT_FLAG_ALLOW_RETRY shouldn't actually wait for the page that it
 allows retry for.
 
 But even then, that flag should not be named MINOR, it should be
 about what the behaviour is actually all about (NOWAIT_RETRY or
 whatever - it presumably would also cause us to not drop the
 mmap_sem).
 
 IOW, these days I suspect the patch _should_ look something like the attached.
 
 Anyway, with this, you should be able to use
 
   FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT
 
 to basically get a non-waiting page fault (and it will return the
 VM_FAULT_RETRY error code if it failed).
 
I implemented get_user_pages_nowait() on top of your patch. In my testing
it works as expected when used inside KVM. Does this looks OK to you?

diff --git a/include/linux/mm.h b/include/linux/mm.h
index dc83565..d78e9e7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -859,6 +859,9 @@ extern int access_process_vm(struct task_struct *tsk, 
unsigned long addr, void *
 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int nr_pages, int write, int force,
struct page **pages, struct vm_area_struct **vmas);
+int get_user_pages_nowait(struct task_struct *tsk, struct mm_struct *mm,
+   unsigned long start, int nr_pages, int write, int force,
+   struct page **pages, struct vm_area_struct **vmas);
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages);
 struct page *get_dump_page(unsigned long addr);
@@ -1416,6 +1419,8 @@ struct page *follow_page(struct vm_area_struct *, 
unsigned long address,
 #define FOLL_GET   0x04/* do get_page on page */
 #define FOLL_DUMP  0x08/* give error on hole if it would be zero */
 #define FOLL_FORCE 0x10/* get_user_pages read/write w/o permission */
+#define FOLL_RETRY 0x20/* if disk transfer is needed release mmap_sem 
and return error */
+#define FOLL_NOWAIT0x40/* if disk transfer is needed return error 
without releasing mmap_sem */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
void *data);
diff --git a/mm/memory.c b/mm/memory.c
index 02e48aa..0a3d3b5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1443,8 +1443,12 @@ int __get_user_pages(struct task_struct *tsk, struct 
mm_struct *mm,
int ret;
 
ret = handle_mm_fault(mm, vma, start,
-   (foll_flags  FOLL_WRITE) ?
-   FAULT_FLAG_WRITE : 0);
+   ((foll_flags  FOLL_WRITE) ?
+   FAULT_FLAG_WRITE : 0) |
+   ((foll_flags  FOLL_RETRY) ?
+   FAULT_FLAG_ALLOW_RETRY : 0) |
+   ((foll_flags  FOLL_NOWAIT) ?
+   FAULT_FLAG_RETRY_NOWAIT : 0));
 
if (ret  VM_FAULT_ERROR) {
if (ret  VM_FAULT_OOM)
@@ -1460,6 +1464,9 @@ int __get_user_pages(struct task_struct *tsk, struct 
mm_struct *mm,
else
tsk-min_flt++;
 
+   if (ret  VM_FAULT_RETRY)
+   return i ? i : -EFAULT;
+
/*
 * The VM_FAULT_WRITE bit tells us that
 * do_wp_page has broken COW when necessary,
@@ -1563,6 +1570,23 @@ int get_user_pages(struct task_struct *tsk, struct 
mm_struct *mm,
 }
 EXPORT_SYMBOL(get_user_pages);
 
+int get_user_pages_nowait(struct task_struct *tsk, struct mm_struct *mm,
+   unsigned long start, int nr_pages, int write, int force,
+   struct page **pages, struct vm_area_struct **vmas)
+{
+   int flags = FOLL_TOUCH | FOLL_RETRY | FOLL_NOWAIT;
+
+   if (pages)
+   flags |= FOLL_GET;
+   if (write)
+   flags |= FOLL_WRITE;
+   if (force)
+   flags |= FOLL_FORCE;
+
+   return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas);
+}
+EXPORT_SYMBOL(get_user_pages_nowait);
+
 /**
  * get_dump_page() - pin user page in memory while writing it to core dump
  * @addr: user address
--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: BUG: sleeping function called from invalid context at mm/slub.c:793

2011-01-13 Thread Jan Kiszka
Am 11.01.2011 11:29, Avi Kivity wrote:
 Please try out the attached patch.
 
 From f3a6041b5bb3bf7c88f9694a66d7f34be2f78845 Mon Sep 17 00:00:00 2001
 From: Avi Kivity a...@redhat.com
 Date: Tue, 11 Jan 2011 12:15:54 +0200
 Subject: [PATCH] KVM: Initialize fpu state in preemptible context
 
 init_fpu() (which is indirectly called by the fpu switching code) assumes
 it is in process context.  Rather than makeing init_fpu() use an atomic
 allocation, which can cause a task to be killed, make sure the fpu is
 already initialized when we enter the run loop.
 
 Signed-off-by: Avi Kivity a...@redhat.com
 ---
  arch/x86/kernel/i387.c |1 +
  arch/x86/kvm/x86.c |3 +++
  2 files changed, 4 insertions(+), 0 deletions(-)
 
 diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
 index 58bb239..e60c38c 100644
 --- a/arch/x86/kernel/i387.c
 +++ b/arch/x86/kernel/i387.c
 @@ -169,6 +169,7 @@ int init_fpu(struct task_struct *tsk)
   set_stopped_child_used_math(tsk);
   return 0;
  }
 +EXPORT_SYMBOL_GPL(init_fpu);
  
  /*
   * The xstateregs_active() routine is the same as the fpregs_active() 
 routine,
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 8652643..fd93cda 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -5351,6 +5351,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, 
 struct kvm_run *kvm_run)
   int r;
   sigset_t sigsaved;
  
 + if (!tsk_used_math(current)  init_fpu(current))
 + return -ENOMEM;
 +

Could become a rainy day for the kvm-kmod maintainer:

For compat support on kernels without init_fpu exported yet, can I
trigger the same result by simply issuing an FPU instruction here so
that do_device_not_available will perform the allocation? Not really
nice, but it doesn't appear to me like there is any code path that would
complain about in-kernel FPU usage (provided we don't need math
emulation - which is quite likely).

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: sleeping function called from invalid context at mm/slub.c:793

2011-01-13 Thread Avi Kivity

On 01/13/2011 02:59 PM, Jan Kiszka wrote:

  @@ -5351,6 +5351,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, 
struct kvm_run *kvm_run)
int r;
sigset_t sigsaved;

  + if (!tsk_used_math(current)  init_fpu(current))
  + return -ENOMEM;
  +

Could become a rainy day for the kvm-kmod maintainer:

For compat support on kernels without init_fpu exported yet, can I
trigger the same result by simply issuing an FPU instruction here so
that do_device_not_available will perform the allocation? Not really
nice, but it doesn't appear to me like there is any code path that would
complain about in-kernel FPU usage (provided we don't need math
emulation - which is quite likely).


That's a pessimization, since it forces the fpu to be switched.  If both 
qemu and the guest don't use the fpu, we can run a guest with some other 
task's fpu loaded.


Oh, but if it's after the check for !tsk_used_math(), it only triggers 
once, so that's okay.  I guess something like mov %%xmm0, %%xmm0 should 
do nicely.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v4 PATCH 0/3] directed yield for Pause Loop Exiting

2011-01-13 Thread Avi Kivity

On 01/13/2011 07:21 AM, Rik van Riel wrote:

When running SMP virtual machines, it is possible for one VCPU to be
spinning on a spinlock, while the VCPU that holds the spinlock is not
currently running, because the host scheduler preempted it to run
something else.

Both Intel and AMD CPUs have a feature that detects when a virtual
CPU is spinning on a lock and will trap to the host.

The current KVM code sleeps for a bit whenever that happens, which
results in eg. a 64 VCPU Windows guest taking forever and a bit to
boot up.  This is because the VCPU holding the lock is actually
running and not sleeping, so the pause is counter-productive.

In other workloads a pause can also be counter-productive, with
spinlock detection resulting in one guest giving up its CPU time
to the others.  Instead of spinning, it ends up simply not running
much at all.

This patch series aims to fix that, by having a VCPU that spins
give the remainder of its timeslice to another VCPU in the same
guest before yielding the CPU - one that is runnable but got
preempted, hopefully the lock holder.


Can you share some benchmark results?

I'm mostly interested in moderately sized guests (4-8 vcpus) under 
conditions of no overcommit, and high overcommit (2x).


For no overcommit, I'd like to see comparisons against mainline with PLE 
disabled, to be sure there aren't significant regressions. For 
overcommit, comparisons against the no overcommit case.  Comparisons 
against mainline, with or without PLE disabled, are uninteresting since 
we know it sucks both ways.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v4 PATCH 3/3] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin

2011-01-13 Thread Avi Kivity

On 01/13/2011 07:27 AM, Rik van Riel wrote:

Instead of sleeping in kvm_vcpu_on_spin, which can cause gigantic
slowdowns of certain workloads, we instead use yield_to to hand
the rest of our timeslice to another vcpu in the same KVM guest.





+   for (pass = 0; pass  2  !yielded; pass++) {
+   kvm_for_each_vcpu(i, vcpu, kvm) {
+   struct task_struct *task = vcpu-task;
+   if (!pass  i  last_boosted_vcpu) {
+   i = last_boosted_vcpu;
+   continue;
+   } else if (pass  i  last_boosted_vcpu)
+   break;
+   if (vcpu == me)
+   continue;
+   if (!task)
+   continue;
+   if (waitqueue_active(vcpu-wq))
+   continue;


Suppose the vcpu exits at this point, and its task terminates.


+   if (task-flags  PF_VCPU)
+   continue;


Here you dereference freed memory.


+   kvm-last_boosted_vcpu = i;
+   yielded = 1;
+   yield_to(task, 1);


And here you do unimaginable things to that freed memory.

I think the first patch needs some reference counting... I'd move it to 
the outermost KVM_RUN loop to reduce the performance impact.



+   break;
+   }
+   }
  }
  EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] KVM: Fix x86_decode_insn() return code check

2011-01-13 Thread Avi Kivity

On 01/04/2011 03:14 PM, Avi Kivity wrote:

x86_decode_insn() doesn't return X86EMUL_* values, it returns
EMULATION_* codes.  Adjust the check.

Signed-off-by: Avi Kivitya...@redhat.com
---
  arch/x86/kvm/x86.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fa708c9..b20499d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4391,7 +4391,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
vcpu-arch.emulate_ctxt.perm_ok = false;

r = x86_decode_insn(vcpu-arch.emulate_ctxt, insn, insn_len);
-   if (r == X86EMUL_PROPAGATE_FAULT)
+   if (r != EMULATION_OK)
goto done;

trace_kvm_emulate_insn_start(vcpu);


Strangely, this patch causes a failure in Windows XP.  Dropping until 
submitter investigates and fixes.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v4 PATCH 3/3] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin

2011-01-13 Thread Rik van Riel

On 01/13/2011 08:16 AM, Avi Kivity wrote:


+ for (pass = 0; pass 2 !yielded; pass++) {
+ kvm_for_each_vcpu(i, vcpu, kvm) {
+ struct task_struct *task = vcpu-task;
+ if (!pass i last_boosted_vcpu) {
+ i = last_boosted_vcpu;
+ continue;
+ } else if (pass i last_boosted_vcpu)
+ break;
+ if (vcpu == me)
+ continue;
+ if (!task)
+ continue;
+ if (waitqueue_active(vcpu-wq))
+ continue;


Suppose the vcpu exits at this point, and its task terminates.


Arghh, good point.


I think the first patch needs some reference counting... I'd move it to
the outermost KVM_RUN loop to reduce the performance impact.


I don't see how refcounting from that other thread could
possibly help, and I now see that the task_struct_cachep
does not have SLAB_DESTROY_BY_LRU, either :(

What do you have in mind here that would both work and
be acceptable to you as KVM maintainer?

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] KVM: SVM: Fix NMI path when NMI happens in guest mode

2011-01-13 Thread Joerg Roedel
The vmexit path on SVM needs to restore the KERNEL_GS_BASE
MSR in order to savely execute the NMI handler. Otherwise a
pending NMI can occur after the STGI instruction and crash
the machine.
This makes it impossible to run perf and kvm in parallel on
an AMD machine in a stable way.

Cc: sta...@kernel.org
Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/kvm/svm.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 25bd1bc..8b9bc72 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3637,6 +3637,7 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
 
 #ifdef CONFIG_X86_64
wrmsrl(MSR_GS_BASE, svm-host.gs_base);
+   wrmsrl(MSR_KERNEL_GS_BASE, current-thread.gs);
 #else
loadsegment(fs, svm-host.fs);
 #endif
-- 
1.7.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] perf-kvm support for SVM

2011-01-13 Thread Joerg Roedel
Hi Avi, Marcelo,

these two patches finally implement perf-kvm support for AMD machines.
The meat is in the second patch. The first one is an important fix
which, when missing, causes system crashes when NMI happen while in
guest mode. So the first patch should also make it to the various
stable-queues.

Regards,

Joerg


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] KVM: SVM: Add support for perf-kvm

2011-01-13 Thread Joerg Roedel
This patch adds the necessary code to run perf-kvm on AMD
machines.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/kvm/svm.c |   12 ++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 8b9bc72..2415129 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3646,13 +3646,21 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
 
local_irq_disable();
 
-   stgi();
-
vcpu-arch.cr2 = svm-vmcb-save.cr2;
vcpu-arch.regs[VCPU_REGS_RAX] = svm-vmcb-save.rax;
vcpu-arch.regs[VCPU_REGS_RSP] = svm-vmcb-save.rsp;
vcpu-arch.regs[VCPU_REGS_RIP] = svm-vmcb-save.rip;
 
+   if (unlikely(svm-vmcb-control.exit_code == SVM_EXIT_NMI))
+   kvm_before_handle_nmi(svm-vcpu);
+
+   stgi();
+
+   /* Any pending NMI will happen here */
+
+   if (unlikely(svm-vmcb-control.exit_code == SVM_EXIT_NMI))
+   kvm_after_handle_nmi(svm-vcpu);
+
sync_cr8_to_lapic(vcpu);
 
svm-next_rip = 0;
-- 
1.7.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v4 PATCH 3/3] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin

2011-01-13 Thread Avi Kivity

On 01/13/2011 05:06 PM, Rik van Riel wrote:



I think the first patch needs some reference counting... I'd move it to
the outermost KVM_RUN loop to reduce the performance impact.


I don't see how refcounting from that other thread could
possibly help, and I now see that the task_struct_cachep
does not have SLAB_DESTROY_BY_LRU, either :(

What do you have in mind here that would both work and
be acceptable to you as KVM maintainer?



I think a 'struct pid' fits the bill here.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: SVM: Fix NMI path when NMI happens in guest mode

2011-01-13 Thread Avi Kivity

On 01/13/2011 05:22 PM, Joerg Roedel wrote:

The vmexit path on SVM needs to restore the KERNEL_GS_BASE
MSR in order to savely execute the NMI handler. Otherwise a
pending NMI can occur after the STGI instruction and crash
the machine.
This makes it impossible to run perf and kvm in parallel on
an AMD machine in a stable way.

Cc: sta...@kernel.org
Signed-off-by: Joerg Roedeljoerg.roe...@amd.com
---
  arch/x86/kvm/svm.c |1 +
  1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 25bd1bc..8b9bc72 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3637,6 +3637,7 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)

  #ifdef CONFIG_X86_64
wrmsrl(MSR_GS_BASE, svm-host.gs_base);
+   wrmsrl(MSR_KERNEL_GS_BASE, current-thread.gs);
  #else
loadsegment(fs, svm-host.fs);
  #endif


Why would an NMI crash if MSR_KERNEL_GS_BASE is bad?

I see save_paranoid depends on MSR_GS_BASE (specifically its sign, which 
is bad for the new instructions that allow userspace to write gsbase), 
but not on MSR_KERNEL_GS_BASE.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] KVM updates for the 2.6.38 merge window

2011-01-13 Thread Linus Torvalds
On Thu, Jan 13, 2011 at 4:53 AM, Gleb Natapov g...@redhat.com wrote:

 I implemented get_user_pages_nowait() on top of your patch. In my testing
 it works as expected when used inside KVM. Does this looks OK to you?

It looks reasonable, although I suspect the subtle behavior wrt the
mmap_sem means that you should not expose the magic bare
FAULT_FLAG_ALLOW_RETRY flag to the __get_user_pages() thing. It's just
too easy to introduce bugs, methinks.

So I'd suggest

 - drop FOLL_RETRY

 -  make FOLL_NOWAIT set both (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_NOWAIT)

and that way the get_user_pages() thing will never release the
mmap_sem, and you never have any subtle locking issues for that
particular interface.

But some other VM person should look at it too.

 Linus
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: SVM: Fix NMI path when NMI happens in guest mode

2011-01-13 Thread Jan Kiszka
Am 13.01.2011 16:22, Joerg Roedel wrote:
 The vmexit path on SVM needs to restore the KERNEL_GS_BASE
 MSR in order to savely execute the NMI handler. Otherwise a
 pending NMI can occur after the STGI instruction and crash
 the machine.
 This makes it impossible to run perf and kvm in parallel on
 an AMD machine in a stable way.
 
 Cc: sta...@kernel.org
 Signed-off-by: Joerg Roedel joerg.roe...@amd.com
 ---
  arch/x86/kvm/svm.c |1 +
  1 files changed, 1 insertions(+), 0 deletions(-)
 
 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
 index 25bd1bc..8b9bc72 100644
 --- a/arch/x86/kvm/svm.c
 +++ b/arch/x86/kvm/svm.c
 @@ -3637,6 +3637,7 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
  
  #ifdef CONFIG_X86_64
   wrmsrl(MSR_GS_BASE, svm-host.gs_base);
 + wrmsrl(MSR_KERNEL_GS_BASE, current-thread.gs);
  #else
   loadsegment(fs, svm-host.fs);
  #endif

Doesn't this also obsolete the wrmsrl(MSR_KERNEL_GS_BASE) in svm_vcpu_put?

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: SVM: Fix NMI path when NMI happens in guest mode

2011-01-13 Thread Roedel, Joerg
On Thu, Jan 13, 2011 at 10:42:01AM -0500, Avi Kivity wrote:
 On 01/13/2011 05:22 PM, Joerg Roedel wrote:
  The vmexit path on SVM needs to restore the KERNEL_GS_BASE
  MSR in order to savely execute the NMI handler. Otherwise a
  pending NMI can occur after the STGI instruction and crash
  the machine.
  This makes it impossible to run perf and kvm in parallel on
  an AMD machine in a stable way.
 
  Cc: sta...@kernel.org
  Signed-off-by: Joerg Roedeljoerg.roe...@amd.com
  ---
arch/x86/kvm/svm.c |1 +
1 files changed, 1 insertions(+), 0 deletions(-)
 
  diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
  index 25bd1bc..8b9bc72 100644
  --- a/arch/x86/kvm/svm.c
  +++ b/arch/x86/kvm/svm.c
  @@ -3637,6 +3637,7 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
 
#ifdef CONFIG_X86_64
  wrmsrl(MSR_GS_BASE, svm-host.gs_base);
  +   wrmsrl(MSR_KERNEL_GS_BASE, current-thread.gs);
#else
  loadsegment(fs, svm-host.fs);
#endif
 
 Why would an NMI crash if MSR_KERNEL_GS_BASE is bad?
 
 I see save_paranoid depends on MSR_GS_BASE (specifically its sign, which 
 is bad for the new instructions that allow userspace to write gsbase), 
 but not on MSR_KERNEL_GS_BASE.

Thats a good question. I have not idea. I spent some time trying to
figure this out (after I found out that wrong KERNEL_GS_BASE was the
cause of the crashes) but had no luck.

This also doesn't happen every time an NMI is delivered in svm_vcpu_run.
Sometimes it runs perfectly in parallel for a few minutues before the
machine triple-faults.

I also had a look at entry_64.S. The save_paranoid could not be the
cause because MSR_GS_BASE is already negative at this point. But the
re-schedule condition check at the end of the NMI handler code could
also not be the cause because the NMI happens while preemption (and
interrupts) are disabled (a re-schedule should also trigger
preempt-notifiers and restore KERNEL_GS_BASE).

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: SVM: Fix NMI path when NMI happens in guest mode

2011-01-13 Thread Roedel, Joerg
On Thu, Jan 13, 2011 at 10:48:33AM -0500, Jan Kiszka wrote:
 Am 13.01.2011 16:22, Joerg Roedel wrote:
   #ifdef CONFIG_X86_64
  wrmsrl(MSR_GS_BASE, svm-host.gs_base);
  +   wrmsrl(MSR_KERNEL_GS_BASE, current-thread.gs);
   #else
  loadsegment(fs, svm-host.fs);
   #endif
 
 Doesn't this also obsolete the wrmsrl(MSR_KERNEL_GS_BASE) in svm_vcpu_put?

Yes it does. Thanks.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: KVM call agenda for Jan 11

2011-01-13 Thread Avi Kivity

On 01/11/2011 04:43 PM, Anthony Liguori wrote:

- invalidate all buffers for that block device on machine A after
   migration.
* with NFS, just close + reopen the file (and pray that nobody else
has it also opened)
* with block devices: use BLKFLBLK ioctl, and pray that nobody 
else is

  using the device, that device is not a ramdisk, and some more
  things.  To add injury to insult, you need to be root to be able
  to issue that ioctl (technically have CAP_SYS_ADMIN).



Why isn't fsync() enough for a block device?


fsync() is fine on the outgoing side, but not on the incoming side.

(the imcoming side might have valid buffers if it was the outgoing side 
on the previous migration, for example, or because of automatic probing)


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/19] Introduce read() to FdMigrationState.

2011-01-13 Thread Yoshiaki Tamura
Currently FdMigrationState doesn't support read(), and this patch
introduces it to get response from the other side.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 migration-tcp.c |   15 +++
 migration.c |   12 
 migration.h |3 +++
 3 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/migration-tcp.c b/migration-tcp.c
index b55f419..96e2411 100644
--- a/migration-tcp.c
+++ b/migration-tcp.c
@@ -39,6 +39,20 @@ static int socket_write(FdMigrationState *s, const void * 
buf, size_t size)
 return send(s-fd, buf, size, 0);
 }
 
+static int socket_read(FdMigrationState *s, const void * buf, size_t size)
+{
+ssize_t len;
+
+do { 
+len = recv(s-fd, (void *)buf, size, 0);
+} while (len == -1  socket_error() == EINTR);
+if (len == -1) {
+len = -socket_error();
+}
+
+return len;
+}
+
 static int tcp_close(FdMigrationState *s)
 {
 DPRINTF(tcp_close\n);
@@ -94,6 +108,7 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon,
 
 s-get_error = socket_errno;
 s-write = socket_write;
+s-read = socket_read;
 s-close = tcp_close;
 s-mig_state.cancel = migrate_fd_cancel;
 s-mig_state.get_status = migrate_fd_get_status;
diff --git a/migration.c b/migration.c
index e5ba51c..6416ae5 100644
--- a/migration.c
+++ b/migration.c
@@ -330,6 +330,18 @@ ssize_t migrate_fd_put_buffer(void *opaque, const void 
*data, size_t size)
 return ret;
 }
 
+int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, int size)
+{
+FdMigrationState *s = opaque;
+ssize_t ret;
+ret = s-read(s, data, size);
+
+if (ret == -1)
+ret = -(s-get_error(s));
+
+return ret;
+}
+
 void migrate_fd_connect(FdMigrationState *s)
 {
 int ret;
diff --git a/migration.h b/migration.h
index d13ed4f..f033262 100644
--- a/migration.h
+++ b/migration.h
@@ -47,6 +47,7 @@ struct FdMigrationState
 int (*get_error)(struct FdMigrationState*);
 int (*close)(struct FdMigrationState*);
 int (*write)(struct FdMigrationState*, const void *, size_t);
+int (*read)(struct FdMigrationState *, const void *, size_t);
 void *opaque;
 };
 
@@ -115,6 +116,8 @@ void migrate_fd_put_notify(void *opaque);
 
 ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size);
 
+int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, int size);
+
 void migrate_fd_connect(FdMigrationState *s);
 
 void migrate_fd_put_ready(void *opaque);
-- 
1.7.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/19] Insert event_tap_mmio() to cpu_physical_memory_rw() in exec.c.

2011-01-13 Thread Yoshiaki Tamura
Record mmio write event to replay it upon failover.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 exec.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/exec.c b/exec.c
index 49c28b1..4a171cc 100644
--- a/exec.c
+++ b/exec.c
@@ -33,6 +33,7 @@
 #include osdep.h
 #include kvm.h
 #include qemu-timer.h
+#include event-tap.h
 #if defined(CONFIG_USER_ONLY)
 #include qemu.h
 #include signal.h
@@ -3625,6 +3626,9 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, 
uint8_t *buf,
 io_index = (pd  IO_MEM_SHIFT)  (IO_MEM_NB_ENTRIES - 1);
 if (p)
 addr1 = (addr  ~TARGET_PAGE_MASK) + p-region_offset;
+
+event_tap_mmio(addr, buf, len);
+
 /* XXX: could force cpu_single_env to NULL to avoid
potential bugs */
 if (l = 4  ((addr1  3) == 0)) {
-- 
1.7.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/19] qemu-char: export socket_set_nodelay().

2011-01-13 Thread Yoshiaki Tamura
Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 qemu-char.c   |2 +-
 qemu_socket.h |1 +
 2 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/qemu-char.c b/qemu-char.c
index edc9ad6..737d347 100644
--- a/qemu-char.c
+++ b/qemu-char.c
@@ -2116,7 +2116,7 @@ static void tcp_chr_telnet_init(int fd)
 send(fd, (char *)buf, 3, 0);
 }
 
-static void socket_set_nodelay(int fd)
+void socket_set_nodelay(int fd)
 {
 int val = 1;
 setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, (char *)val, sizeof(val));
diff --git a/qemu_socket.h b/qemu_socket.h
index 897a8ae..b7f8465 100644
--- a/qemu_socket.h
+++ b/qemu_socket.h
@@ -36,6 +36,7 @@ int inet_aton(const char *cp, struct in_addr *ia);
 int qemu_socket(int domain, int type, int protocol);
 int qemu_accept(int s, struct sockaddr *addr, socklen_t *addrlen);
 void socket_set_nonblock(int fd);
+void socket_set_nodelay(int fd);
 int send_all(int fd, const void *buf, int len1);
 
 /* New, ipv6-ready socket helper functions, see qemu-sockets.c */
-- 
1.7.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/19] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer().

2011-01-13 Thread Yoshiaki Tamura
Currently buf size is fixed at 32KB.  It would be useful if it could
be flexible.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 hw/hw.h  |2 ++
 savevm.c |   20 +++-
 2 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/hw/hw.h b/hw/hw.h
index 163a683..a506688 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -58,6 +58,8 @@ void qemu_fflush(QEMUFile *f);
 int qemu_fclose(QEMUFile *f);
 void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size);
 void qemu_put_byte(QEMUFile *f, int v);
+void *qemu_realloc_buffer(QEMUFile *f, int size);
+void qemu_clear_buffer(QEMUFile *f);
 
 static inline void qemu_put_ubyte(QEMUFile *f, unsigned int v)
 {
diff --git a/savevm.c b/savevm.c
index 90aa237..8c64c63 100644
--- a/savevm.c
+++ b/savevm.c
@@ -172,7 +172,8 @@ struct QEMUFile {
when reading */
 int buf_index;
 int buf_size; /* 0 when writing */
-uint8_t buf[IO_BUF_SIZE];
+int buf_max_size;
+uint8_t *buf;
 
 int has_error;
 };
@@ -423,6 +424,9 @@ QEMUFile *qemu_fopen_ops(void *opaque, 
QEMUFilePutBufferFunc *put_buffer,
 f-get_rate_limit = get_rate_limit;
 f-is_write = 0;
 
+f-buf_max_size = IO_BUF_SIZE;
+f-buf = qemu_malloc(sizeof(uint8_t) * f-buf_max_size);
+
 return f;
 }
 
@@ -453,6 +457,19 @@ void qemu_fflush(QEMUFile *f)
 }
 }
 
+void *qemu_realloc_buffer(QEMUFile *f, int size)
+{
+f-buf_max_size = size;
+f-buf = qemu_realloc(f-buf, f-buf_max_size);
+
+return f-buf;
+}
+
+void qemu_clear_buffer(QEMUFile *f)
+{
+f-buf_size = f-buf_index = f-buf_offset = 0;
+}
+
 static void qemu_fill_buffer(QEMUFile *f)
 {
 int len;
@@ -478,6 +495,7 @@ int qemu_fclose(QEMUFile *f)
 qemu_fflush(f);
 if (f-close)
 ret = f-close(f-opaque);
+qemu_free(f-buf);
 qemu_free(f);
 return ret;
 }
-- 
1.7.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 16/19] migration: introduce migrate_ft_trans_{put,get}_ready(), and modify migrate_fd_put_ready() when ft_mode is on.

2011-01-13 Thread Yoshiaki Tamura
Introduce migrate_ft_trans_put_ready() which kicks the FT transaction
cycle.  When ft_mode is on, migrate_fd_put_ready() would open
ft_trans_file and turn on event_tap.  To end or cancel FT transaction,
ft_mode and event_tap is turned off.  migrate_ft_trans_get_ready() is
called to receive ack from the receiver.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 migration.c |  266 +-
 migration.h |2 +-
 2 files changed, 262 insertions(+), 6 deletions(-)

diff --git a/migration.c b/migration.c
index 01b723d..557349b 100644
--- a/migration.c
+++ b/migration.c
@@ -21,6 +21,7 @@
 #include qemu_socket.h
 #include block-migration.h
 #include qemu-objects.h
+#include event-tap.h
 
 //#define DEBUG_MIGRATION
 
@@ -274,6 +275,14 @@ void migrate_fd_error(FdMigrationState *s)
 migrate_fd_cleanup(s);
 }
 
+static void migrate_ft_trans_error(FdMigrationState *s)
+{
+ft_mode = FT_ERROR;
+qemu_savevm_state_cancel(s-mon, s-file);
+migrate_fd_error(s);
+event_tap_unregister();
+}
+
 int migrate_fd_cleanup(FdMigrationState *s)
 {
 int ret = 0;
@@ -309,6 +318,17 @@ void migrate_fd_put_notify(void *opaque)
 qemu_file_put_notify(s-file);
 }
 
+static void migrate_fd_get_notify(void *opaque)
+{
+FdMigrationState *s = opaque;
+
+qemu_set_fd_handler2(s-fd, NULL, NULL, NULL, NULL);
+qemu_file_get_notify(s-file);
+if (qemu_file_has_error(s-file)) {
+migrate_ft_trans_error(s);
+}
+}
+
 ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size)
 {
 FdMigrationState *s = opaque;
@@ -333,15 +353,20 @@ ssize_t migrate_fd_put_buffer(void *opaque, const void 
*data, size_t size)
 return ret;
 }
 
-int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, int size)
+int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, size_t 
size)
 {
 FdMigrationState *s = opaque;
-ssize_t ret;
+int ret;
 ret = s-read(s, data, size);
 
-if (ret == -1)
+if (ret == -1) {
 ret = -(s-get_error(s));
-
+}
+
+if (ret == -EAGAIN) {
+qemu_set_fd_handler2(s-fd, NULL, migrate_fd_get_notify, NULL, s);
+}
+
 return ret;
 }
 
@@ -368,6 +393,226 @@ void migrate_fd_connect(FdMigrationState *s)
 migrate_fd_put_ready(s);
 }
 
+static int migrate_ft_trans_commit(void *opaque)
+{
+FdMigrationState *s = opaque;
+int ret = -1;
+
+if (ft_mode != FT_TRANSACTION_COMMIT  ft_mode != FT_TRANSACTION_ATOMIC) {
+fprintf(stderr,
+migrate_ft_trans_commit: invalid ft_mode %d\n, ft_mode);
+goto out;
+}
+
+do {
+if (ft_mode == FT_TRANSACTION_ATOMIC) {
+if (qemu_ft_trans_begin(s-file)  0) {
+fprintf(stderr, qemu_ft_trans_begin failed\n);
+goto out;
+}
+
+if ((ret = qemu_savevm_trans_begin(s-mon, s-file, 0))  0) {
+fprintf(stderr, qemu_savevm_trans_begin failed\n);
+goto out;
+}
+
+ft_mode = FT_TRANSACTION_COMMIT;
+if (ret) {
+/* don't proceed until if fd isn't ready */
+goto out;
+}
+}
+
+/* make the VM state consistent by flushing outstanding events */
+vm_stop(0);
+
+/* send at full speed */
+qemu_file_set_rate_limit(s-file, 0);
+
+if ((ret = qemu_savevm_trans_complete(s-mon, s-file))  0) {
+fprintf(stderr, qemu_savevm_trans_complete failed\n);
+goto out;
+}
+
+if (ret) {
+/* don't proceed until if fd isn't ready */
+ret = 1;
+goto out;
+}
+
+if ((ret = qemu_ft_trans_commit(s-file))  0) {
+fprintf(stderr, qemu_ft_trans_commit failed\n);
+goto out;
+}
+
+if (ret) {
+ft_mode = FT_TRANSACTION_RECV;
+ret = 1;
+goto out;
+}
+
+/* flush and check if events are remaining */
+vm_start();
+if ((ret = event_tap_flush_one())  0) {
+fprintf(stderr, event_tap_flush_one failed\n);
+goto out;
+}
+
+ft_mode =  ret ? FT_TRANSACTION_BEGIN : FT_TRANSACTION_ATOMIC;
+} while (ft_mode != FT_TRANSACTION_BEGIN);
+
+vm_start();
+ret = 0;
+
+out:
+return ret;
+}
+
+static int migrate_ft_trans_get_ready(void *opaque)
+{
+FdMigrationState *s = opaque;
+int ret = -1;
+
+if (ft_mode != FT_TRANSACTION_RECV) {
+fprintf(stderr,
+migrate_ft_trans_get_ready: invalid ft_mode %d\n, ft_mode);
+goto error_out;
+}
+
+/* flush and check if events are remaining */
+vm_start();
+if ((ret = event_tap_flush_one())  0) {
+fprintf(stderr, event_tap_flush_one failed\n);
+goto error_out;
+}
+
+if (ret) {
+ft_mode = FT_TRANSACTION_BEGIN;
+} else {
+

[PATCH 19/19] migration: add a parser to accept FT migration incoming mode.

2011-01-13 Thread Yoshiaki Tamura
The option looks like, -incoming protocol:address:port,ft_mode

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 migration.c |9 -
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/migration.c b/migration.c
index d4922ce..1822e97 100644
--- a/migration.c
+++ b/migration.c
@@ -42,9 +42,16 @@ static MigrationState *current_migration;
 
 int qemu_start_incoming_migration(const char *uri)
 {
-const char *p;
+const char *p = uri;
 int ret;
 
+/* check ft_mode option  */
+if ((p = strstr(p, ft_mode))) {
+if (!strcmp(p, ft_mode)) {
+ft_mode = FT_INIT;
+}
+}
+
 if (strstart(uri, tcp:, p))
 ret = tcp_start_incoming_migration(p);
 #if !defined(WIN32)
-- 
1.7.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/19] Introduce fault tolerant VM transaction QEMUFile and ft_mode.

2011-01-13 Thread Yoshiaki Tamura
This code implements VM transaction protocol.  Like buffered_file, it
sits between savevm and migration layer.  With this architecture, VM
transaction protocol is implemented mostly independent from other
existing code.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp
---
 Makefile.objs   |1 +
 ft_trans_file.c |  626 +++
 ft_trans_file.h |   72 +++
 migration.c |3 +
 trace-events|   16 ++
 5 files changed, 718 insertions(+), 0 deletions(-)
 create mode 100644 ft_trans_file.c
 create mode 100644 ft_trans_file.h

diff --git a/Makefile.objs b/Makefile.objs
index c3e52c5..de38579 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -100,6 +100,7 @@ common-obj-y += msmouse.o ps2.o
 common-obj-y += qdev.o qdev-properties.o
 common-obj-y += block-migration.o
 common-obj-y += pflib.o
+common-obj-y += ft_trans_file.o
 
 common-obj-$(CONFIG_BRLAPI) += baum.o
 common-obj-$(CONFIG_POSIX) += migration-exec.o migration-unix.o migration-fd.o
diff --git a/ft_trans_file.c b/ft_trans_file.c
new file mode 100644
index 000..9dee8e9
--- /dev/null
+++ b/ft_trans_file.c
@@ -0,0 +1,626 @@
+/*
+ * Fault tolerant VM transaction QEMUFile
+ *
+ * Copyright (c) 2010 Nippon Telegraph and Telephone Corporation. 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * This source code is based on buffered_file.c.
+ * Copyright IBM, Corp. 2008
+ * Authors:
+ *  Anthony Liguorialigu...@us.ibm.com
+ */
+
+#include qemu-common.h
+#include qemu-error.h
+#include hw/hw.h
+#include qemu-timer.h
+#include sysemu.h
+#include qemu-char.h
+#include trace.h
+#include ft_trans_file.h
+
+// #define DEBUG_FT_TRANSACTION
+
+typedef struct FtTransHdr
+{
+uint16_t cmd;
+uint16_t id;
+uint32_t seq;
+uint32_t payload_len;
+} FtTransHdr;
+
+typedef struct QEMUFileFtTrans
+{
+FtTransPutBufferFunc *put_buffer;
+FtTransGetBufferFunc *get_buffer;
+FtTransPutReadyFunc *put_ready;
+FtTransGetReadyFunc *get_ready;
+FtTransWaitForUnfreezeFunc *wait_for_unfreeze;
+FtTransCloseFunc *close;
+void *opaque;
+QEMUFile *file;
+
+enum QEMU_VM_TRANSACTION_STATE state;
+uint32_t seq;
+uint16_t id;
+
+int has_error;
+
+bool freeze_output;
+bool freeze_input;
+bool rate_limit;
+bool is_sender;
+bool is_payload;
+
+uint8_t *buf;
+size_t buf_max_size;
+size_t put_offset;
+size_t get_offset;
+
+FtTransHdr header;
+size_t header_offset;
+} QEMUFileFtTrans;
+
+#define IO_BUF_SIZE 32768
+
+static void ft_trans_append(QEMUFileFtTrans *s,
+const uint8_t *buf, size_t size)
+{
+if (size  (s-buf_max_size - s-put_offset)) {
+trace_ft_trans_realloc(s-buf_max_size, size + 1024);
+s-buf_max_size += size + 1024;
+s-buf = qemu_realloc(s-buf, s-buf_max_size);
+}
+
+trace_ft_trans_append(size);
+memcpy(s-buf + s-put_offset, buf, size);
+s-put_offset += size;
+}
+
+static void ft_trans_flush(QEMUFileFtTrans *s)
+{
+size_t offset = 0;
+
+if (s-has_error) {
+error_report(flush when error %d, bailing, s-has_error);
+return;
+}
+
+while (offset  s-put_offset) {
+ssize_t ret;
+
+ret = s-put_buffer(s-opaque, s-buf + offset, s-put_offset - 
offset);
+if (ret == -EAGAIN) {
+break;
+}
+
+if (ret = 0) {
+error_report(error flushing data, %s, strerror(errno));
+s-has_error = FT_TRANS_ERR_FLUSH;
+break;
+} else {
+offset += ret;
+}
+}
+
+trace_ft_trans_flush(offset, s-put_offset);
+memmove(s-buf, s-buf + offset, s-put_offset - offset);
+s-put_offset -= offset;
+s-freeze_output = !!s-put_offset;
+}
+
+static ssize_t ft_trans_put(void *opaque, void *buf, int size)
+{
+QEMUFileFtTrans *s = opaque;
+size_t offset = 0;
+ssize_t len;
+
+/* flush buffered data before putting next */
+if (s-put_offset) {
+ft_trans_flush(s);
+}
+
+while (!s-freeze_output  offset  size) {
+len = s-put_buffer(s-opaque, (uint8_t *)buf + offset, size - offset);
+
+if (len == -EAGAIN) {
+trace_ft_trans_freeze_output();
+s-freeze_output = 1;
+break;
+}
+
+if (len = 0) {
+error_report(putting data failed, %s, strerror(errno));
+s-has_error = 1;
+offset = -EINVAL;
+break;
+}
+
+offset += len;
+}
+
+if (s-freeze_output) {
+ft_trans_append(s, buf + offset, size - offset);
+offset = size;
+}
+
+return offset;
+}
+
+static int ft_trans_send_header(QEMUFileFtTrans *s,
+enum QEMU_VM_TRANSACTION_STATE state,
+uint32_t 

[PATCH 08/19] savevm: introduce util functions to control ft_trans_file from savevm layer.

2011-01-13 Thread Yoshiaki Tamura
To utilize ft_trans_file function, savevm needs interfaces to be
exported.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 hw/hw.h  |5 ++
 savevm.c |  148 ++
 2 files changed, 153 insertions(+), 0 deletions(-)

diff --git a/hw/hw.h b/hw/hw.h
index a506688..ace1744 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -51,6 +51,7 @@ QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc 
*put_buffer,
 QEMUFile *qemu_fopen(const char *filename, const char *mode);
 QEMUFile *qemu_fdopen(int fd, const char *mode);
 QEMUFile *qemu_fopen_socket(int fd);
+QEMUFile *qemu_fopen_ft_trans(int s_fd, int c_fd);
 QEMUFile *qemu_popen(FILE *popen_file, const char *mode);
 QEMUFile *qemu_popen_cmd(const char *command, const char *mode);
 int qemu_stdio_fd(QEMUFile *f);
@@ -60,6 +61,9 @@ void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int 
size);
 void qemu_put_byte(QEMUFile *f, int v);
 void *qemu_realloc_buffer(QEMUFile *f, int size);
 void qemu_clear_buffer(QEMUFile *f);
+int qemu_ft_trans_begin(QEMUFile *f);
+int qemu_ft_trans_commit(QEMUFile *f);
+int qemu_ft_trans_cancel(QEMUFile *f);
 
 static inline void qemu_put_ubyte(QEMUFile *f, unsigned int v)
 {
@@ -94,6 +98,7 @@ void qemu_file_set_error(QEMUFile *f);
  * halted due to rate limiting or EAGAIN errors occur as it can be used to
  * resume output. */
 void qemu_file_put_notify(QEMUFile *f);
+void qemu_file_get_notify(void *opaque);
 
 static inline void qemu_put_be64s(QEMUFile *f, const uint64_t *pv)
 {
diff --git a/savevm.c b/savevm.c
index 7bc3699..ebb3ef8 100644
--- a/savevm.c
+++ b/savevm.c
@@ -83,6 +83,7 @@
 #include migration.h
 #include qemu_socket.h
 #include qemu-queue.h
+#include ft_trans_file.h
 
 #define SELF_ANNOUNCE_ROUNDS 5
 
@@ -190,6 +191,13 @@ typedef struct QEMUFileSocket
 QEMUFile *file;
 } QEMUFileSocket;
 
+typedef struct QEMUFileSocketTrans
+{
+int fd;
+QEMUFileSocket *s;
+VMChangeStateEntry *e;
+} QEMUFileSocketTrans;
+
 static int socket_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size)
 {
 QEMUFileSocket *s = opaque;
@@ -205,6 +213,21 @@ static int socket_get_buffer(void *opaque, uint8_t *buf, 
int64_t pos, int size)
 return len;
 }
 
+static ssize_t socket_put_buffer(void *opaque, const void *buf, size_t size)
+{
+QEMUFileSocket *s = opaque;
+ssize_t len;
+
+do {
+len = send(s-fd, (void *)buf, size, 0);
+} while (len == -1  socket_error() == EINTR);
+
+if (len == -1)
+len = -socket_error();
+
+return len;
+}
+
 static int socket_close(void *opaque)
 {
 QEMUFileSocket *s = opaque;
@@ -212,6 +235,70 @@ static int socket_close(void *opaque)
 return 0;
 }
 
+static int socket_trans_get_buffer(void *opaque, uint8_t *buf, int64_t pos, 
size_t size)
+{
+QEMUFileSocketTrans *t = opaque;
+QEMUFileSocket *s = t-s;
+ssize_t len;
+
+len = socket_get_buffer(s, buf, pos, size);
+
+return len;
+}
+
+static ssize_t socket_trans_put_buffer(void *opaque, const void *buf, size_t 
size)
+{
+QEMUFileSocketTrans *t = opaque;
+
+return socket_put_buffer(t-s, buf, size);
+}
+
+
+static int socket_trans_get_ready(void *opaque)
+{
+QEMUFileSocketTrans *t = opaque;
+QEMUFileSocket *s = t-s;
+QEMUFile *f = s-file;
+int ret = 0;
+
+ret = qemu_loadvm_state(f, 1);
+if (ret  0) {
+fprintf(stderr,
+socket_trans_get_ready: error while loading vmstate\n);
+}
+
+return ret;
+}
+
+static int socket_trans_close(void *opaque)
+{
+QEMUFileSocketTrans *t = opaque;
+QEMUFileSocket *s = t-s;
+
+qemu_set_fd_handler2(s-fd, NULL, NULL, NULL, NULL);
+qemu_set_fd_handler2(t-fd, NULL, NULL, NULL, NULL);
+qemu_del_vm_change_state_handler(t-e);
+close(s-fd);
+close(t-fd);
+qemu_free(s);
+qemu_free(t);
+
+return 0;
+}
+
+static void socket_trans_resume(void *opaque, int running, int reason)
+{
+QEMUFileSocketTrans *t = opaque;
+QEMUFileSocket *s = t-s;
+
+if (!running) {
+return;
+}
+
+qemu_announce_self();
+qemu_fclose(s-file);
+}
+
 static int stdio_put_buffer(void *opaque, const uint8_t *buf, int64_t pos, int 
size)
 {
 QEMUFileStdio *s = opaque;
@@ -334,6 +421,26 @@ QEMUFile *qemu_fopen_socket(int fd)
 return s-file;
 }
 
+QEMUFile *qemu_fopen_ft_trans(int s_fd, int c_fd)
+{
+QEMUFileSocketTrans *t = qemu_mallocz(sizeof(QEMUFileSocketTrans));
+QEMUFileSocket *s = qemu_mallocz(sizeof(QEMUFileSocket));
+
+t-s = s;
+t-fd = s_fd;
+t-e = qemu_add_vm_change_state_handler(socket_trans_resume, t);
+
+s-fd = c_fd;
+s-file = qemu_fopen_ops_ft_trans(t, socket_trans_put_buffer,
+  socket_trans_get_buffer, NULL,
+  socket_trans_get_ready,
+  migrate_fd_wait_for_unfreeze,
+  socket_trans_close, 0);
+

[PATCH 00/19] Kemari for KVM v0.2.4

2011-01-13 Thread Yoshiaki Tamura
Hi,

This patch series is a revised version of Kemari for KVM, which
applied comments for the previous post.  The current code is based on
qemu.git d03d11260ee2d55579e8b76116e35ccdf5031833.

The changes from v0.2.3 - v0.2.4 are:

- call vm_start() before event_tap_flush_one() to avoid failure in
  virtio-net assertion
- add vm_change_state_handler to turn off ft_mode
- use qemu_iovec functions in event-tap
- remove duplicated code in migration
- remove unnecessary new line for error_report in ft_trans_file

The changes from v0.2.2 - v0.2.3 are:

- queue async net requests without copying (MST)
-- if not async, contents of the packets are sent to the secondary
- better description for option -k (MST)
- fix memory transfer failure
- fix ft transaction initiation failure

The changes from v0.2.1 - v0.2.2 are:

- decrement last_avaid_idx with inuse before saving (MST)
- remove qemu_aio_flush() and bdrv_flush_all() in migrate_ft_trans_commit()

The changes from v0.2 - v0.2.1 are:

- Move event-tap to net/block layer and use stubs (Blue, Paul, MST, Kevin)
- Tap bdrv_aio_flush (Marcelo)
- Remove multiwrite interface in event-tap (Stefan)
- Fix event-tap to use pio/mmio to replay both net/block (Stefan)
- Improve error handling in event-tap (Stefan)
- Fix leak in event-tap (Stefan)
- Revise virtio last_avail_idx manipulation (MST)
- Clean up migration.c hook (Marcelo)
- Make deleting change state handler robust (Isaku, Anthony)

The changes from v0.1.1 - v0.2 are:

- Introduce a queue in event-tap to make VM sync live.
- Change transaction receiver to a state machine for async receiving.
- Replace net/block layer functions with event-tap proxy functions.
- Remove dirty bitmap optimization for now.
- convert DPRINTF() in ft_trans_file to trace functions.
- convert fprintf() in ft_trans_file to error_report().
- improved error handling in ft_trans_file.
- add a tmp pointer to qemu_del_vm_change_state_handler.

The changes from v0.1 - v0.1.1 are:

- events are tapped in net/block layer instead of device emulation layer.
- Introduce a new option for -incoming to accept FT transaction.

- Removed writev() support to QEMUFile and FdMigrationState for now.
  I would post this work in a different series.

- Modified virtio-blk save/load handler to send inuse variable to
  correctly replay.

- Removed configure --enable-ft-mode.
- Removed unnecessary check for qemu_realloc().

The first 6 patches modify several functions of qemu to prepare
introducing Kemari specific components.

The next 6 patches are the components of Kemari.  They introduce
event-tap and the FT transaction protocol file based on buffered file.
The design document of FT transaction protocol can be found at,
http://wiki.qemu.org/images/b/b1/Kemari_sender_receiver_0.5a.pdf

Then the following 2 patches modifies net/block layer functions with
event-tap functions.  Please note that if Kemari is off, event-tap
will just passthrough, and there is most no intrusion to exisiting
functions including normal live migration.

Finally, the migration layer are modified to support Kemari in the
last 5 patches.  Again, there shouldn't be any affection if a user
doesn't specify Kemari specific options.  The transaction is now async
on both sender and receiver side.  The sender side respects the
max_downtime to decide when to switch from async to sync mode.

The repository contains all patches I'm sending with this message.
For those who want to try, please pull the following repository.  It
also includes dirty bitmap optimization which aren't ready for posting
yet.  To remove the dirty bitmap optimization, please look at HEAD~5
of the tree.

git://kemari.git.sourceforge.net/gitroot/kemari/kemari next

Thanks,

Yoshi

Yoshiaki Tamura (19):
  Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and
qemu_clear_buffer().
  Introduce read() to FdMigrationState.
  Introduce skip_header parameter to qemu_loadvm_state().
  qemu-char: export socket_set_nodelay().
  vl.c: add deleted flag for deleting the handler.
  virtio: decrement last_avail_idx with inuse before saving.
  Introduce fault tolerant VM transaction QEMUFile and ft_mode.
  savevm: introduce util functions to control ft_trans_file from savevm
layer.
  Introduce event-tap.
  Call init handler of event-tap at main() in vl.c.
  ioport: insert event_tap_ioport() to ioport_write().
  Insert event_tap_mmio() to cpu_physical_memory_rw() in exec.c.
  net: insert event-tap to qemu_send_packet() and
qemu_sendv_packet_async().
  block: insert event-tap to bdrv_aio_writev() and bdrv_aio_flush().
  savevm: introduce qemu_savevm_trans_{begin,commit}.
  migration: introduce migrate_ft_trans_{put,get}_ready(), and modify
migrate_fd_put_ready() when ft_mode is on.
  migration-tcp: modify tcp_accept_incoming_migration() to handle
ft_mode, and add a hack not to close fd when ft_mode is enabled.
  Introduce -k option to enable FT migration mode (Kemari).
  migration: add a parser to accept FT migration 

[PATCH 13/19] net: insert event-tap to qemu_send_packet() and qemu_sendv_packet_async().

2011-01-13 Thread Yoshiaki Tamura
event-tap function is called only when it is on.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 net.c |9 +
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/net.c b/net.c
index 9ba5be2..1176124 100644
--- a/net.c
+++ b/net.c
@@ -36,6 +36,7 @@
 #include qemu-common.h
 #include qemu_socket.h
 #include hw/qdev.h
+#include event-tap.h
 
 static QTAILQ_HEAD(, VLANState) vlans;
 static QTAILQ_HEAD(, VLANClientState) non_vlan_clients;
@@ -559,6 +560,10 @@ ssize_t qemu_send_packet_async(VLANClientState *sender,
 
 void qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size)
 {
+if (event_tap_is_on()) {
+return event_tap_send_packet(vc, buf, size);
+}
+
 qemu_send_packet_async(vc, buf, size, NULL);
 }
 
@@ -657,6 +662,10 @@ ssize_t qemu_sendv_packet_async(VLANClientState *sender,
 {
 NetQueue *queue;
 
+if (event_tap_is_on()) {
+return event_tap_sendv_packet_async(sender, iov, iovcnt, sent_cb);
+}
+
 if (sender-link_down || (!sender-peer  !sender-vlan)) {
 return calc_iov_length(iov, iovcnt);
 }
-- 
1.7.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/19] virtio: decrement last_avail_idx with inuse before saving.

2011-01-13 Thread Yoshiaki Tamura
For regular migration inuse == 0 always as requests are flushed before
save. However, event-tap log when enabled introduces an extra queue
for requests which is not being flushed, thus the last inuse requests
are left in the event-tap queue.  Move the last_avail_idx value sent
to the remote back to make it repeat the last inuse requests.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 hw/virtio.c |   10 +-
 1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/hw/virtio.c b/hw/virtio.c
index 07dbf86..408fad5 100644
--- a/hw/virtio.c
+++ b/hw/virtio.c
@@ -665,12 +665,20 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
 qemu_put_be32(f, i);
 
 for (i = 0; i  VIRTIO_PCI_QUEUE_MAX; i++) {
+/* For regular migration inuse == 0 always as
+ * requests are flushed before save. However, 
+ * event-tap log when enabled introduces an extra
+ * queue for requests which is not being flushed,
+ * thus the last inuse requests are left in the event-tap queue.
+ * Move the last_avail_idx value sent to the remote back
+ * to make it repeat the last inuse requests. */
+uint16_t last_avail = vdev-vq[i].last_avail_idx - vdev-vq[i].inuse;
 if (vdev-vq[i].vring.num == 0)
 break;
 
 qemu_put_be32(f, vdev-vq[i].vring.num);
 qemu_put_be64(f, vdev-vq[i].pa);
-qemu_put_be16s(f, vdev-vq[i].last_avail_idx);
+qemu_put_be16s(f, last_avail);
 if (vdev-binding-save_queue)
 vdev-binding-save_queue(vdev-binding_opaque, i, f);
 }
-- 
1.7.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 15/19] savevm: introduce qemu_savevm_trans_{begin,commit}.

2011-01-13 Thread Yoshiaki Tamura
Introduce qemu_savevm_state_{begin,commit} to send the memory and
device info together, while avoiding cancelling memory state tracking.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 savevm.c |   88 ++
 sysemu.h |2 +
 2 files changed, 90 insertions(+), 0 deletions(-)

diff --git a/savevm.c b/savevm.c
index ebb3ef8..9d20c37 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1722,6 +1722,94 @@ int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f)
 return 0;
 }
 
+int qemu_savevm_trans_begin(Monitor *mon, QEMUFile *f, int init)
+{
+SaveStateEntry *se;
+int skipped = 0;
+
+QTAILQ_FOREACH(se, savevm_handlers, entry) {
+int len, stage, ret;
+
+if (se-save_live_state == NULL)
+continue;
+
+/* Section type */
+qemu_put_byte(f, QEMU_VM_SECTION_START);
+qemu_put_be32(f, se-section_id);
+
+/* ID string */
+len = strlen(se-idstr);
+qemu_put_byte(f, len);
+qemu_put_buffer(f, (uint8_t *)se-idstr, len);
+
+qemu_put_be32(f, se-instance_id);
+qemu_put_be32(f, se-version_id);
+
+stage = init ? QEMU_VM_SECTION_START : QEMU_VM_SECTION_PART;
+ret = se-save_live_state(mon, f, stage, se-opaque);
+if (!ret) {
+skipped++;
+}
+}
+
+if (qemu_file_has_error(f))
+return -EIO;
+
+return skipped;
+}
+
+int qemu_savevm_trans_complete(Monitor *mon, QEMUFile *f)
+{
+SaveStateEntry *se;
+
+cpu_synchronize_all_states();
+
+QTAILQ_FOREACH(se, savevm_handlers, entry) {
+int ret;
+
+if (se-save_live_state == NULL)
+continue;
+
+/* Section type */
+qemu_put_byte(f, QEMU_VM_SECTION_PART);
+qemu_put_be32(f, se-section_id);
+
+ret = se-save_live_state(mon, f, QEMU_VM_SECTION_PART, se-opaque);
+if (!ret) {
+/* do not proceed to the next vmstate. */
+return 1;
+}
+}
+
+QTAILQ_FOREACH(se, savevm_handlers, entry) {
+int len;
+
+if (se-save_state == NULL  se-vmsd == NULL)
+continue;
+
+/* Section type */
+qemu_put_byte(f, QEMU_VM_SECTION_FULL);
+qemu_put_be32(f, se-section_id);
+
+/* ID string */
+len = strlen(se-idstr);
+qemu_put_byte(f, len);
+qemu_put_buffer(f, (uint8_t *)se-idstr, len);
+
+qemu_put_be32(f, se-instance_id);
+qemu_put_be32(f, se-version_id);
+
+vmstate_save(f, se);
+}
+
+qemu_put_byte(f, QEMU_VM_EOF);
+
+if (qemu_file_has_error(f))
+return -EIO;
+
+return 0;
+}
+
 void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f)
 {
 SaveStateEntry *se;
diff --git a/sysemu.h b/sysemu.h
index 81bcf00..9c2c45e 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -80,6 +80,8 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int 
blk_enable,
 int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f);
 int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f);
 void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f);
+int qemu_savevm_trans_begin(Monitor *mon, QEMUFile *f, int init);
+int qemu_savevm_trans_complete(Monitor *mon, QEMUFile *f);
 int qemu_loadvm_state(QEMUFile *f, int skip_header);
 
 /* SLIRP */
-- 
1.7.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/19] ioport: insert event_tap_ioport() to ioport_write().

2011-01-13 Thread Yoshiaki Tamura
Record ioport event to replay it upon failover.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 ioport.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/ioport.c b/ioport.c
index aa4188a..74aebf5 100644
--- a/ioport.c
+++ b/ioport.c
@@ -27,6 +27,7 @@
 
 #include ioport.h
 #include trace.h
+#include event-tap.h
 
 /***/
 /* IO Port */
@@ -76,6 +77,7 @@ static void ioport_write(int index, uint32_t address, 
uint32_t data)
 default_ioport_writel
 };
 IOPortWriteFunc *func = ioport_write_table[index][address];
+event_tap_ioport(index, address, data);
 if (!func)
 func = default_func[index];
 func(ioport_opaque[address], address, data);
-- 
1.7.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/19] Call init handler of event-tap at main() in vl.c.

2011-01-13 Thread Yoshiaki Tamura
Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 vl.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/vl.c b/vl.c
index 8bbb785..9faeb27 100644
--- a/vl.c
+++ b/vl.c
@@ -162,6 +162,7 @@ int main(int argc, char **argv)
 #include qemu-queue.h
 #include cpus.h
 #include arch_init.h
+#include event-tap.h
 
 #include ui/qemu-spice.h
 
@@ -2895,6 +2896,8 @@ int main(int argc, char **argv, char **envp)
 
 blk_mig_init();
 
+event_tap_init();
+
 if (default_cdrom) {
 /* we always create the cdrom drive, even if no disk is there */
 drive_add(NULL, CDROM_ALIAS);
-- 
1.7.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 18/19] Introduce -k option to enable FT migration mode (Kemari).

2011-01-13 Thread Yoshiaki Tamura
When -k option is set to migrate command, it will turn on ft_mode to
start FT migration mode (Kemari).

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 hmp-commands.hx |7 ---
 migration.c |3 +++
 qmp-commands.hx |7 ---
 3 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index 1cea572..b7f8f2f 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -735,13 +735,14 @@ ETEXI
 
 {
 .name   = migrate,
-.args_type  = detach:-d,blk:-b,inc:-i,uri:s,
-.params = [-d] [-b] [-i] uri,
+.args_type  = detach:-d,blk:-b,inc:-i,ft:-k,uri:s,
+.params = [-d] [-b] [-i] [-k] uri,
 .help   = migrate to URI (using -d to not wait for completion)
  \n\t\t\t -b for migration without shared storage with
   full copy of disk\n\t\t\t -i for migration without 
  shared storage with incremental copy of disk 
- (base image shared between src and destination),
+ (base image shared between src and destination)
+ \n\t\t\t -k for Fault Tolerance mode (Kemari protocol),
 .user_print = monitor_user_noop,   
.mhandler.cmd_new = do_migrate,
 },
diff --git a/migration.c b/migration.c
index 557349b..d4922ce 100644
--- a/migration.c
+++ b/migration.c
@@ -92,6 +92,9 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject 
**ret_data)
 return -1;
 }
 
+if (qdict_get_try_bool(qdict, ft, 0))
+ft_mode = FT_INIT;
+
 if (strstart(uri, tcp:, p)) {
 s = tcp_start_outgoing_migration(mon, p, max_throttle, detach,
  blk, inc);
diff --git a/qmp-commands.hx b/qmp-commands.hx
index 56c4d8b..1521931 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -431,13 +431,14 @@ EQMP
 
 {
 .name   = migrate,
-.args_type  = detach:-d,blk:-b,inc:-i,uri:s,
-.params = [-d] [-b] [-i] uri,
+.args_type  = detach:-d,blk:-b,inc:-i,ft:-k,uri:s,
+.params = [-d] [-b] [-i] [-k] uri,
 .help   = migrate to URI (using -d to not wait for completion)
  \n\t\t\t -b for migration without shared storage with
   full copy of disk\n\t\t\t -i for migration without 
  shared storage with incremental copy of disk 
- (base image shared between src and destination),
+ (base image shared between src and destination)
+ \n\t\t\t -k for Fault Tolerance mode (Kemari protocol),
 .user_print = monitor_user_noop,   
.mhandler.cmd_new = do_migrate,
 },
-- 
1.7.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/19] Introduce skip_header parameter to qemu_loadvm_state().

2011-01-13 Thread Yoshiaki Tamura
Introduce skip_header parameter to qemu_loadvm_state() so that it can
be called iteratively without reading the header.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 migration.c |2 +-
 savevm.c|   24 +---
 sysemu.h|2 +-
 3 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/migration.c b/migration.c
index 6416ae5..a400199 100644
--- a/migration.c
+++ b/migration.c
@@ -60,7 +60,7 @@ int qemu_start_incoming_migration(const char *uri)
 
 void process_incoming_migration(QEMUFile *f)
 {
-if (qemu_loadvm_state(f)  0) {
+if (qemu_loadvm_state(f, 0)  0) {
 fprintf(stderr, load of migration failed\n);
 exit(0);
 }
diff --git a/savevm.c b/savevm.c
index 8c64c63..7bc3699 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1701,7 +1701,7 @@ typedef struct LoadStateEntry {
 int version_id;
 } LoadStateEntry;
 
-int qemu_loadvm_state(QEMUFile *f)
+int qemu_loadvm_state(QEMUFile *f, int skip_header)
 {
 QLIST_HEAD(, LoadStateEntry) loadvm_handlers =
 QLIST_HEAD_INITIALIZER(loadvm_handlers);
@@ -1710,17 +1710,19 @@ int qemu_loadvm_state(QEMUFile *f)
 unsigned int v;
 int ret;
 
-v = qemu_get_be32(f);
-if (v != QEMU_VM_FILE_MAGIC)
-return -EINVAL;
+if (!skip_header) {
+v = qemu_get_be32(f);
+if (v != QEMU_VM_FILE_MAGIC)
+return -EINVAL;
 
-v = qemu_get_be32(f);
-if (v == QEMU_VM_FILE_VERSION_COMPAT) {
-fprintf(stderr, SaveVM v2 format is obsolete and don't work 
anymore\n);
-return -ENOTSUP;
+v = qemu_get_be32(f);
+if (v == QEMU_VM_FILE_VERSION_COMPAT) {
+fprintf(stderr, SaveVM v2 format is obsolete and don't work 
anymore\n);
+return -ENOTSUP;
+}
+if (v != QEMU_VM_FILE_VERSION)
+return -ENOTSUP;
 }
-if (v != QEMU_VM_FILE_VERSION)
-return -ENOTSUP;
 
 while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) {
 uint32_t instance_id, version_id, section_id;
@@ -2043,7 +2045,7 @@ int load_vmstate(const char *name)
 return -EINVAL;
 }
 
-ret = qemu_loadvm_state(f);
+ret = qemu_loadvm_state(f, 0);
 
 qemu_fclose(f);
 if (ret  0) {
diff --git a/sysemu.h b/sysemu.h
index d8fceec..81bcf00 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -80,7 +80,7 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int 
blk_enable,
 int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f);
 int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f);
 void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f);
-int qemu_loadvm_state(QEMUFile *f);
+int qemu_loadvm_state(QEMUFile *f, int skip_header);
 
 /* SLIRP */
 void do_info_slirp(Monitor *mon);
-- 
1.7.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/19] vl.c: add deleted flag for deleting the handler.

2011-01-13 Thread Yoshiaki Tamura
Make deleting handlers robust against deletion of any elements in a
handler by using a deleted flag like in file descriptors.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 vl.c |   13 +
 1 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/vl.c b/vl.c
index 0292184..8bbb785 100644
--- a/vl.c
+++ b/vl.c
@@ -1140,6 +1140,7 @@ static void nographic_update(void *opaque)
 struct vm_change_state_entry {
 VMChangeStateHandler *cb;
 void *opaque;
+int deleted;
 QLIST_ENTRY (vm_change_state_entry) entries;
 };
 
@@ -1160,8 +1161,7 @@ VMChangeStateEntry 
*qemu_add_vm_change_state_handler(VMChangeStateHandler *cb,
 
 void qemu_del_vm_change_state_handler(VMChangeStateEntry *e)
 {
-QLIST_REMOVE (e, entries);
-qemu_free (e);
+e-deleted = 1;
 }
 
 void vm_state_notify(int running, int reason)
@@ -1170,8 +1170,13 @@ void vm_state_notify(int running, int reason)
 
 trace_vm_state_notify(running, reason);
 
-for (e = vm_change_state_head.lh_first; e; e = e-entries.le_next) {
-e-cb(e-opaque, running, reason);
+QLIST_FOREACH(e, vm_change_state_head, entries) {
+if (e-deleted) {
+QLIST_REMOVE(e, entries);
+qemu_free(e);
+} else {
+e-cb(e-opaque, running, reason);
+}
 }
 }
 
-- 
1.7.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 14/19] block: insert event-tap to bdrv_aio_writev() and bdrv_aio_flush().

2011-01-13 Thread Yoshiaki Tamura
event-tap function is called only when it is on, and requests sent
from device emulators.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 block.c |   11 +++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/block.c b/block.c
index ff2795b..85bd8b8 100644
--- a/block.c
+++ b/block.c
@@ -28,6 +28,7 @@
 #include block_int.h
 #include module.h
 #include qemu-objects.h
+#include event-tap.h
 
 #ifdef CONFIG_BSD
 #include sys/types.h
@@ -2111,6 +2112,11 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, 
int64_t sector_num,
 if (bdrv_check_request(bs, sector_num, nb_sectors))
 return NULL;
 
+if (bs-device_name  event_tap_is_on()) {
+return event_tap_bdrv_aio_writev(bs, sector_num, qiov, nb_sectors,
+ cb, opaque);
+}
+
 if (bs-dirty_bitmap) {
 blk_cb_data = blk_dirty_cb_alloc(bs, sector_num, nb_sectors, cb,
  opaque);
@@ -2374,6 +2380,11 @@ BlockDriverAIOCB *bdrv_aio_flush(BlockDriverState *bs,
 
 if (!drv)
 return NULL;
+
+if (bs-device_name  event_tap_is_on()) {
+return event_tap_bdrv_aio_flush(bs, cb, opaque);
+}
+
 return drv-bdrv_aio_flush(bs, cb, opaque);
 }
 
-- 
1.7.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 17/19] migration-tcp: modify tcp_accept_incoming_migration() to handle ft_mode, and add a hack not to close fd when ft_mode is enabled.

2011-01-13 Thread Yoshiaki Tamura
When ft_mode is set in the header, tcp_accept_incoming_migration()
sets ft_trans_incoming() as a callback, and call
qemu_file_get_notify() to receive FT transaction iteratively.  We also
need a hack no to close fd before moving to ft_transaction mode, so
that we can reuse the fd for it.  vm_change_state_handler is added to
turn off ft_mode when cont is pressed.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
---
 migration-tcp.c |   64 ++-
 1 files changed, 63 insertions(+), 1 deletions(-)

diff --git a/migration-tcp.c b/migration-tcp.c
index 96e2411..e6f09e5 100644
--- a/migration-tcp.c
+++ b/migration-tcp.c
@@ -18,6 +18,8 @@
 #include sysemu.h
 #include buffered_file.h
 #include block.h
+#include ft_trans_file.h
+#include event-tap.h
 
 //#define DEBUG_MIGRATION_TCP
 
@@ -29,6 +31,8 @@
 do { } while (0)
 #endif
 
+static VMChangeStateEntry *vmstate;
+
 static int socket_errno(FdMigrationState *s)
 {
 return socket_error();
@@ -56,7 +60,8 @@ static int socket_read(FdMigrationState *s, const void * buf, 
size_t size)
 static int tcp_close(FdMigrationState *s)
 {
 DPRINTF(tcp_close\n);
-if (s-fd != -1) {
+/* FIX ME: accessing ft_mode here isn't clean */
+if (s-fd != -1  ft_mode != FT_INIT) {
 close(s-fd);
 s-fd = -1;
 }
@@ -150,6 +155,33 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon,
 return s-mig_state;
 }
 
+static void ft_trans_incoming(void *opaque) {
+QEMUFile *f = opaque;
+
+qemu_file_get_notify(f);
+if (qemu_file_has_error(f)) {
+ft_mode = FT_ERROR;
+qemu_fclose(f);
+}
+}
+
+static void ft_trans_reset(void *opaque, int running, int reason) {
+QEMUFile *f = opaque;
+
+if (running) {
+if (ft_mode != FT_ERROR) {
+qemu_fclose(f);
+}
+ft_mode = FT_OFF;
+qemu_del_vm_change_state_handler(vmstate);
+}
+}
+
+static void ft_trans_schedule_replay(QEMUFile *f) {
+event_tap_schedule_replay();
+vmstate = qemu_add_vm_change_state_handler(ft_trans_reset, f);
+}
+
 static void tcp_accept_incoming_migration(void *opaque)
 {
 struct sockaddr_in addr;
@@ -175,8 +207,38 @@ static void tcp_accept_incoming_migration(void *opaque)
 goto out;
 }
 
+if (ft_mode == FT_INIT) {
+autostart = 0;
+}
+
 process_incoming_migration(f);
+
+if (ft_mode == FT_INIT) {
+int ret;
+
+socket_set_nodelay(c);
+
+f = qemu_fopen_ft_trans(s, c);
+if (f == NULL) {
+fprintf(stderr, could not qemu_fopen_ft_trans\n);
+goto out;
+}
+
+/* need to wait sender to setup */
+ret = qemu_ft_trans_begin(f);
+if (ret  0) {
+goto out;
+}
+
+qemu_set_fd_handler2(c, NULL, ft_trans_incoming, NULL, f);
+ft_trans_schedule_replay(f);
+ft_mode = FT_TRANSACTION_RECV;
+
+return;
+}
+
 qemu_fclose(f);
+
 out:
 close(c);
 out2:
-- 
1.7.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/19] Introduce event-tap.

2011-01-13 Thread Yoshiaki Tamura
event-tap controls when to start FT transaction, and provides proxy
functions to called from net/block devices.  While FT transaction, it
queues up net/block requests, and flush them when the transaction gets
completed.

Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
Signed-off-by: OHMURA Kei ohmura@lab.ntt.co.jp
---
 Makefile.target |1 +
 event-tap.c |  836 +++
 event-tap.h |   42 +++
 qemu-tool.c |   24 ++
 trace-events|9 +
 5 files changed, 912 insertions(+), 0 deletions(-)
 create mode 100644 event-tap.c
 create mode 100644 event-tap.h

diff --git a/Makefile.target b/Makefile.target
index e15b1c4..f36cd75 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -199,6 +199,7 @@ obj-y += rwhandler.o
 obj-$(CONFIG_KVM) += kvm.o kvm-all.o
 obj-$(CONFIG_NO_KVM) += kvm-stub.o
 LIBS+=-lz
+obj-y += event-tap.o
 
 QEMU_CFLAGS += $(VNC_TLS_CFLAGS)
 QEMU_CFLAGS += $(VNC_SASL_CFLAGS)
diff --git a/event-tap.c b/event-tap.c
new file mode 100644
index 000..1010b9e
--- /dev/null
+++ b/event-tap.c
@@ -0,0 +1,836 @@
+/*
+ * Event Tap functions for QEMU
+ *
+ * Copyright (c) 2010 Nippon Telegraph and Telephone Corporation. 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ */
+
+#include qemu-common.h
+#include qemu-error.h
+#include block.h
+#include block_int.h
+#include ioport.h
+#include osdep.h
+#include sysemu.h
+#include hw/hw.h
+#include net.h
+#include event-tap.h
+#include trace.h
+
+enum EVENT_TAP_STATE {
+EVENT_TAP_OFF,
+EVENT_TAP_ON,
+EVENT_TAP_FLUSH,
+EVENT_TAP_LOAD,
+EVENT_TAP_REPLAY,
+};
+
+static enum EVENT_TAP_STATE event_tap_state = EVENT_TAP_OFF;
+static BlockDriverAIOCB dummy_acb; /* we may need a pool for dummies */
+
+typedef struct EventTapIOport {
+uint32_t address;
+uint32_t data;
+int  index;
+} EventTapIOport;
+
+#define MMIO_BUF_SIZE 8
+
+typedef struct EventTapMMIO {
+uint64_t address;
+uint8_t  buf[MMIO_BUF_SIZE];
+int  len;
+} EventTapMMIO;
+
+typedef struct EventTapNetReq {
+char *device_name;
+int iovcnt;
+struct iovec *iov;
+int vlan_id;
+bool vlan_needed;
+bool async;
+NetPacketSent *sent_cb;
+} EventTapNetReq;
+
+#define MAX_BLOCK_REQUEST 32
+
+typedef struct EventTapBlkReq {
+char *device_name;
+int num_reqs;
+int num_cbs;
+bool is_flush;
+BlockRequest reqs[MAX_BLOCK_REQUEST];
+BlockDriverCompletionFunc *cb[MAX_BLOCK_REQUEST];
+void *opaque[MAX_BLOCK_REQUEST];
+} EventTapBlkReq;
+
+#define EVENT_TAP_IOPORT (1  0)
+#define EVENT_TAP_MMIO   (1  1)
+#define EVENT_TAP_NET(1  2)
+#define EVENT_TAP_BLK(1  3)
+
+#define EVENT_TAP_TYPE_MASK (EVENT_TAP_NET - 1)
+
+typedef struct EventTapLog {
+int mode;
+union {
+EventTapIOport ioport ;
+EventTapMMIO mmio;
+};
+union {
+EventTapNetReq net_req;
+EventTapBlkReq blk_req;
+};
+QTAILQ_ENTRY(EventTapLog) node;
+} EventTapLog;
+
+static EventTapLog *last_event_tap;
+
+static QTAILQ_HEAD(, EventTapLog) event_list;
+static QTAILQ_HEAD(, EventTapLog) event_pool;
+
+static int (*event_tap_cb)(void);
+static QEMUBH *event_tap_bh;
+static VMChangeStateEntry *vmstate;
+
+static void event_tap_bh_cb(void *p)
+{
+if (event_tap_cb) {
+event_tap_cb();
+}
+
+qemu_bh_delete(event_tap_bh);
+event_tap_bh = NULL;
+}
+
+static void event_tap_schedule_bh(void)
+{
+trace_event_tap_ignore_bh(!!event_tap_bh);
+
+/* if bh is already set, we ignore it for now */
+if (event_tap_bh) {
+return;
+}
+
+event_tap_bh = qemu_bh_new(event_tap_bh_cb, NULL);
+qemu_bh_schedule(event_tap_bh);
+
+return ;
+}
+
+static void event_tap_alloc_net_req(EventTapNetReq *net_req, 
+   VLANClientState *vc,
+   const struct iovec *iov, int iovcnt,
+   NetPacketSent *sent_cb, bool async)
+{
+int i;
+
+net_req-iovcnt = iovcnt;
+net_req-async = async;
+net_req-device_name = qemu_strdup(vc-name);
+net_req-sent_cb = sent_cb;
+
+if (vc-vlan) {
+net_req-vlan_needed = 1;
+net_req-vlan_id = vc-vlan-id;
+} else {
+net_req-vlan_needed = 0;
+}
+
+if (async) {
+net_req-iov = (struct iovec *)iov;
+} else {
+net_req-iov = qemu_malloc(sizeof(struct iovec) * iovcnt);
+for (i = 0; i  iovcnt; i++) {
+net_req-iov[i].iov_base = qemu_malloc(iov[i].iov_len);
+memcpy(net_req-iov[i].iov_base, iov[i].iov_base, iov[i].iov_len);
+net_req-iov[i].iov_len = iov[i].iov_len;
+}
+}
+}
+
+static void event_tap_alloc_blk_req(EventTapBlkReq *blk_req,
+BlockDriverState *bs, BlockRequest *reqs,
+int num_reqs, 

Compiling qemu 0.13.0

2011-01-13 Thread Erik Rull

Hi all,

I want to compile qemu 0.13.0 against a 2.6.29.4 kernel. I found no 
kvm-kmod sources for this kernel. Do I need them?


I tried to compile qemu but it fails for the virtio pci because the struct 
kvm_irq_routing_entry was not found. But this struct is existing in the kvm 
sources within this qemu version. Is that a bug in the makefiles or am I 
doing something wrong?


I found an email from Avi dated 10.10.2010 where this struct should be 
avoided to be used if CONFIG_KVM is disabled.


So I disabled KVM in my kernel sources - doesn't help, same error.

Any ideas how to proceed?
Changing the kernel version is quite complex for my project and I want to 
avoid that.


Best regards,

Erik
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] KVM updates for the 2.6.38 merge window

2011-01-13 Thread Hugh Dickins
On Thu, Jan 13, 2011 at 7:43 AM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Thu, Jan 13, 2011 at 4:53 AM, Gleb Natapov g...@redhat.com wrote:

 I implemented get_user_pages_nowait() on top of your patch. In my testing
 it works as expected when used inside KVM. Does this looks OK to you?

 It looks reasonable, although I suspect the subtle behavior wrt the
 mmap_sem means that you should not expose the magic bare
 FAULT_FLAG_ALLOW_RETRY flag to the __get_user_pages() thing. It's just
 too easy to introduce bugs, methinks.

 So I'd suggest

  - drop FOLL_RETRY

  -  make FOLL_NOWAIT set both (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_NOWAIT)

 and that way the get_user_pages() thing will never release the
 mmap_sem, and you never have any subtle locking issues for that
 particular interface.

 But some other VM person should look at it too.

I haven't been following and am not really looking yet, but should
express a preference: that Gleb keep it as he had it (if that works),
rather than making FOLL_NOWAIT do a combination.

A couple of months ago I needed to add a FOLL flag myself, and made a
patch to use the same space for FOLL flags and FAULT_FLAGs, to end
such ugliness as you see in this patch:

+   ((foll_flags  FOLL_WRITE) ?
+   FAULT_FLAG_WRITE : 0) |
+   ((foll_flags  FOLL_RETRY) ?
+   FAULT_FLAG_ALLOW_RETRY : 0) |
+   ((foll_flags  FOLL_NOWAIT) ?
+   FAULT_FLAG_RETRY_NOWAIT : 0));

But I missed my window, I think several have been added since, with
more on the way: I intend to put it together again for mmotm once the
dust has settled a little.

__get_user_pages already has wrappers for friendly usage, I think it's
okay for the wrappers to have special knowledge of what's needed.

Hugh
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: SVM: Fix NMI path when NMI happens in guest mode

2011-01-13 Thread Avi Kivity

On 01/13/2011 05:51 PM, Roedel, Joerg wrote:

On Thu, Jan 13, 2011 at 10:42:01AM -0500, Avi Kivity wrote:
  On 01/13/2011 05:22 PM, Joerg Roedel wrote:
The vmexit path on SVM needs to restore the KERNEL_GS_BASE
MSR in order to savely execute the NMI handler. Otherwise a
pending NMI can occur after the STGI instruction and crash
the machine.
This makes it impossible to run perf and kvm in parallel on
an AMD machine in a stable way.
  
Cc: sta...@kernel.org
Signed-off-by: Joerg Roedeljoerg.roe...@amd.com
---
  arch/x86/kvm/svm.c |1 +
  1 files changed, 1 insertions(+), 0 deletions(-)
  
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 25bd1bc..8b9bc72 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3637,6 +3637,7 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
  
  #ifdef CONFIG_X86_64
wrmsrl(MSR_GS_BASE, svm-host.gs_base);
+   wrmsrl(MSR_KERNEL_GS_BASE, current-thread.gs);
  #else
loadsegment(fs, svm-host.fs);
  #endif

  Why would an NMI crash if MSR_KERNEL_GS_BASE is bad?

  I see save_paranoid depends on MSR_GS_BASE (specifically its sign, which
  is bad for the new instructions that allow userspace to write gsbase),
  but not on MSR_KERNEL_GS_BASE.

Thats a good question. I have not idea. I spent some time trying to
figure this out (after I found out that wrong KERNEL_GS_BASE was the
cause of the crashes) but had no luck.

This also doesn't happen every time an NMI is delivered in svm_vcpu_run.
Sometimes it runs perfectly in parallel for a few minutues before the
machine triple-faults.

I also had a look at entry_64.S. The save_paranoid could not be the
cause because MSR_GS_BASE is already negative at this point. But the
re-schedule condition check at the end of the NMI handler code could
also not be the cause because the NMI happens while preemption (and
interrupts) are disabled (a re-schedule should also trigger
preempt-notifiers and restore KERNEL_GS_BASE).



I have it:

ENTRY(native_load_gs_index)
CFI_STARTPROC
pushfq_cfi
DISABLE_INTERRUPTS(CLBR_ANY  ~CLBR_RDI)
SWAPGS
gs_change:
movl %edi,%gs
2:mfence/* workaround */
SWAPGS
popfq_cfi
ret

If an nmi hits between the two SWAPGSs, it sees the guest's 
MSR_KERNEL_GS_BASE as the host's MSR_GS_BASE.


An alternative to your fix would be to disable GIF around 
load_gs_index() in kvm.  I imagine it would be slower than your fix (not 
a trivial tradeoff - wrmsr every lightweight exit, vs. clgi/stgi every 
heavyweight exit).


Please update the changelog, and add a comment.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH uq/master 1/2] Add qemu_ram_remap

2011-01-13 Thread Blue Swirl
On Thu, Jan 13, 2011 at 8:34 AM, Huang Ying ying.hu...@intel.com wrote:
 qemu_ram_remap() unmaps the specified RAM pages, then re-maps these
 pages again.  This is used by KVM HWPoison support to clear HWPoisoned
 page tables across guest rebooting, so that a new page may be
 allocated later to recover the memory error.

 Signed-off-by: Huang Ying ying.hu...@intel.com
 ---
  cpu-all.h    |    4 +++
  cpu-common.h |    1
  exec.c       |   61 
 ++-
  3 files changed, 65 insertions(+), 1 deletion(-)

 --- a/cpu-all.h
 +++ b/cpu-all.h
 @@ -863,10 +863,14 @@ target_phys_addr_t cpu_get_phys_page_deb
  extern int phys_ram_fd;
  extern ram_addr_t ram_size;

 +/* RAM is pre-allocated and passed into qemu_ram_alloc_from_ptr */
 +#define RAM_PREALLOC_MASK      (1  0)
 +
  typedef struct RAMBlock {
     uint8_t *host;
     ram_addr_t offset;
     ram_addr_t length;
 +    uint32_t flags;
     char idstr[256];
     QLIST_ENTRY(RAMBlock) next;
  #if defined(__linux__)  !defined(TARGET_S390X)
 --- a/exec.c
 +++ b/exec.c
 @@ -2830,6 +2830,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(Devic

     if (host) {
         new_block-host = host;
 +        new_block-flags |= RAM_PREALLOC_MASK;
     } else {
         if (mem_path) {
  #if defined (__linux__)  !defined(TARGET_S390X)
 @@ -2883,7 +2884,9 @@ void qemu_ram_free(ram_addr_t addr)
     QLIST_FOREACH(block, ram_list.blocks, next) {
         if (addr == block-offset) {
             QLIST_REMOVE(block, next);
 -            if (mem_path) {
 +            if (block-flags  RAM_PREALLOC_MASK)

Missing braces.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 15/19] savevm: introduce qemu_savevm_trans_{begin, commit}.

2011-01-13 Thread Blue Swirl
On Thu, Jan 13, 2011 at 5:15 PM, Yoshiaki Tamura
tamura.yoshi...@lab.ntt.co.jp wrote:
 Introduce qemu_savevm_state_{begin,commit} to send the memory and
 device info together, while avoiding cancelling memory state tracking.

 Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
 ---
  savevm.c |   88 
 ++
  sysemu.h |    2 +
  2 files changed, 90 insertions(+), 0 deletions(-)

 diff --git a/savevm.c b/savevm.c
 index ebb3ef8..9d20c37 100644
 --- a/savevm.c
 +++ b/savevm.c
 @@ -1722,6 +1722,94 @@ int qemu_savevm_state_complete(Monitor *mon, QEMUFile 
 *f)
     return 0;
  }

 +int qemu_savevm_trans_begin(Monitor *mon, QEMUFile *f, int init)
 +{
 +    SaveStateEntry *se;
 +    int skipped = 0;
 +
 +    QTAILQ_FOREACH(se, savevm_handlers, entry) {
 +        int len, stage, ret;
 +
 +        if (se-save_live_state == NULL)
 +            continue;

Missing braces.

 +
 +        /* Section type */
 +        qemu_put_byte(f, QEMU_VM_SECTION_START);
 +        qemu_put_be32(f, se-section_id);
 +
 +        /* ID string */
 +        len = strlen(se-idstr);
 +        qemu_put_byte(f, len);
 +        qemu_put_buffer(f, (uint8_t *)se-idstr, len);
 +
 +        qemu_put_be32(f, se-instance_id);
 +        qemu_put_be32(f, se-version_id);
 +
 +        stage = init ? QEMU_VM_SECTION_START : QEMU_VM_SECTION_PART;
 +        ret = se-save_live_state(mon, f, stage, se-opaque);
 +        if (!ret) {
 +            skipped++;
 +        }
 +    }
 +
 +    if (qemu_file_has_error(f))
 +        return -EIO;

Also here

 +
 +    return skipped;
 +}
 +
 +int qemu_savevm_trans_complete(Monitor *mon, QEMUFile *f)
 +{
 +    SaveStateEntry *se;
 +
 +    cpu_synchronize_all_states();
 +
 +    QTAILQ_FOREACH(se, savevm_handlers, entry) {
 +        int ret;
 +
 +        if (se-save_live_state == NULL)
 +            continue;

here

 +
 +        /* Section type */
 +        qemu_put_byte(f, QEMU_VM_SECTION_PART);
 +        qemu_put_be32(f, se-section_id);
 +
 +        ret = se-save_live_state(mon, f, QEMU_VM_SECTION_PART, se-opaque);
 +        if (!ret) {
 +            /* do not proceed to the next vmstate. */
 +            return 1;
 +        }
 +    }
 +
 +    QTAILQ_FOREACH(se, savevm_handlers, entry) {
 +        int len;
 +
 +        if (se-save_state == NULL  se-vmsd == NULL)
 +            continue;

here

 +
 +        /* Section type */
 +        qemu_put_byte(f, QEMU_VM_SECTION_FULL);
 +        qemu_put_be32(f, se-section_id);
 +
 +        /* ID string */
 +        len = strlen(se-idstr);
 +        qemu_put_byte(f, len);
 +        qemu_put_buffer(f, (uint8_t *)se-idstr, len);
 +
 +        qemu_put_be32(f, se-instance_id);
 +        qemu_put_be32(f, se-version_id);
 +
 +        vmstate_save(f, se);
 +    }
 +
 +    qemu_put_byte(f, QEMU_VM_EOF);
 +
 +    if (qemu_file_has_error(f))
 +        return -EIO;

and here.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 18/19] Introduce -k option to enable FT migration mode (Kemari).

2011-01-13 Thread Blue Swirl
On Thu, Jan 13, 2011 at 5:15 PM, Yoshiaki Tamura
tamura.yoshi...@lab.ntt.co.jp wrote:
 When -k option is set to migrate command, it will turn on ft_mode to
 start FT migration mode (Kemari).

 Signed-off-by: Yoshiaki Tamura tamura.yoshi...@lab.ntt.co.jp
 ---
  hmp-commands.hx |    7 ---
  migration.c     |    3 +++
  qmp-commands.hx |    7 ---
  3 files changed, 11 insertions(+), 6 deletions(-)

 diff --git a/hmp-commands.hx b/hmp-commands.hx
 index 1cea572..b7f8f2f 100644
 --- a/hmp-commands.hx
 +++ b/hmp-commands.hx
 @@ -735,13 +735,14 @@ ETEXI

     {
         .name       = migrate,
 -        .args_type  = detach:-d,blk:-b,inc:-i,uri:s,
 -        .params     = [-d] [-b] [-i] uri,
 +        .args_type  = detach:-d,blk:-b,inc:-i,ft:-k,uri:s,
 +        .params     = [-d] [-b] [-i] [-k] uri,
         .help       = migrate to URI (using -d to not wait for completion)
                      \n\t\t\t -b for migration without shared storage with
                       full copy of disk\n\t\t\t -i for migration without 
                      shared storage with incremental copy of disk 
 -                     (base image shared between src and destination),
 +                     (base image shared between src and destination)
 +                     \n\t\t\t -k for Fault Tolerance mode (Kemari 
 protocol),
         .user_print = monitor_user_noop,
        .mhandler.cmd_new = do_migrate,
     },
 diff --git a/migration.c b/migration.c
 index 557349b..d4922ce 100644
 --- a/migration.c
 +++ b/migration.c
 @@ -92,6 +92,9 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject 
 **ret_data)
         return -1;
     }

 +    if (qdict_get_try_bool(qdict, ft, 0))
 +        ft_mode = FT_INIT;

Missing braces.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM-test: Add a ENOSPC subtest

2011-01-13 Thread Lucas Meneghel Rodrigues
From: Amos Kong ak...@redhat.com

KVM guest always pauses on NOSPACE error, this test
just repeatedly extend guest disk space and resume guest
from paused status.

Changes from v1:
- Use the most current KVM test API
- Use the autotest API for external commands execution
- Instead of chaining multiple shell commands as pre and
post commands, create proper pre and post scripts for the
test, as it is easier to figure out problems
- Instead of setting up /dev/loop0 hardcoded by default,
find the first available loop device before and use it.

Signed-off-by: Amos Kong ak...@redhat.com
---
 client/tests/kvm/scripts/enospc-post.py   |   74 +
 client/tests/kvm/scripts/enospc-pre.py|   71 +++
 client/tests/kvm/tests/enospc.py  |   68 ++
 client/tests/kvm/tests_base.cfg.sample|   14 +
 4 files changed, 227 insertions(+), 0 deletions(-)
 mode change 100644 = 100755 client/tests/kvm/scripts/bonding_setup.py
 mode change 100644 = 100755 client/tests/kvm/scripts/check_image.py
 create mode 100755 client/tests/kvm/scripts/enospc-post.py
 create mode 100755 client/tests/kvm/scripts/enospc-pre.py
 create mode 100644 client/tests/kvm/tests/enospc.py

diff --git a/client/tests/kvm/scripts/bonding_setup.py 
b/client/tests/kvm/scripts/bonding_setup.py
old mode 100644
new mode 100755
diff --git a/client/tests/kvm/scripts/check_image.py 
b/client/tests/kvm/scripts/check_image.py
old mode 100644
new mode 100755
diff --git a/client/tests/kvm/scripts/enospc-post.py 
b/client/tests/kvm/scripts/enospc-post.py
new file mode 100755
index 000..71cc383
--- /dev/null
+++ b/client/tests/kvm/scripts/enospc-post.py
@@ -0,0 +1,74 @@
+#!/usr/bin/python
+
+Simple script to setup enospc test environment
+
+import os, commands
+
+class SetupError(Exception):
+
+Simple wrapper for the builtin Exception class.
+
+pass
+
+
+def find_command(cmd):
+
+Searches for a command on common paths, error if it can't find it.
+
+@param cmd: Command to be found.
+
+if os.path.exists(cmd):
+return cmd
+for dir in [/usr/local/sbin, /usr/local/bin,
+/usr/sbin, /usr/bin, /sbin, /bin]:
+file = os.path.join(dir, cmd)
+if os.path.exists(file):
+return file
+raise ValueError('Missing command: %s' % cmd)
+
+
+def run(cmd, info=None):
+
+Run a command and throw an exception if it fails.
+Optionally, you can provide additional contextual info.
+
+@param cmd: Command string.
+@param reason: Optional string that explains the context of the failure.
+
+@raise: SetupError if command fails.
+
+print Running '%s' % cmd
+cmd_name = cmd.split(' ')[0]
+find_command(cmd_name)
+status, output = commands.getstatusoutput(cmd)
+if status:
+e_msg = ('Command %s failed.\nStatus:%s\nOutput:%s' %
+ (cmd, status, output))
+if info is not None:
+e_msg += '\nAdditional Info:%s' % info
+raise SetupError(e_msg)
+
+return (status, output)
+
+
+if __name__ == __main__:
+qemu_img_binary = os.environ['KVM_TEST_qemu_img_binary']
+if not os.path.isabs(qemu_img_binary):
+qemu_img_binary = os.path.join(KVM_TEST_DIR, qemu_img_binary)
+if not os.path.exists(qemu_img_binary):
+raise SetupError('The qemu-img binary that is supposed to be used '
+ '(%s) does not exist. Please verify your '
+ 'configuration' % qemu_img_binary)
+
+run(lvremove -f vgtest)
+status, output = run(losetup -a)
+loopback_device = None
+if output:
+for line in output.splitlines():
+device = line.split(:)[0]
+if /tmp/enospc.raw in line:
+loopback_device = device
+break
+if loopback_device is not None:
+run(losetup -d %s % loopback_device)
+run(rm -rf /tmp/enospc.raw /tmp/kvm_autotest_root/images/enospc.qcow2)
diff --git a/client/tests/kvm/scripts/enospc-pre.py 
b/client/tests/kvm/scripts/enospc-pre.py
new file mode 100755
index 000..e56bdc0
--- /dev/null
+++ b/client/tests/kvm/scripts/enospc-pre.py
@@ -0,0 +1,71 @@
+#!/usr/bin/python
+
+Simple script to setup enospc test environment
+
+import os, commands
+
+class SetupError(Exception):
+
+Simple wrapper for the builtin Exception class.
+
+pass
+
+
+def find_command(cmd):
+
+Searches for a command on common paths, error if it can't find it.
+
+@param cmd: Command to be found.
+
+if os.path.exists(cmd):
+return cmd
+for dir in [/usr/local/sbin, /usr/local/bin,
+/usr/sbin, /usr/bin, /sbin, /bin]:
+file = os.path.join(dir, cmd)
+if os.path.exists(file):
+return file
+raise ValueError('Missing command: %s' % cmd)
+
+
+def run(cmd, info=None):
+
+Run a command and throw an exception if it fails.
+Optionally, you 

[PATCH] KVM-test: Add a ENOSPC subtest

2011-01-13 Thread Lucas Meneghel Rodrigues
From: Amos Kong ak...@redhat.com

KVM guest always pauses on NOSPACE error, this test
just repeatedly extend guest disk space and resume guest
from paused status.

Changes from v2:
- Oops! Forgot to update tests_base.cfg.sample

Changes from v1:
- Use the most current KVM test API
- Use the autotest API for external commands execution
- Instead of chaining multiple shell commands as pre and
post commands, create proper pre and post scripts for the
test, as it is easier to figure out problems
- Instead of setting up /dev/loop0 hardcoded by default,
find the first available loop device before and use it.

Signed-off-by: Amos Kong ak...@redhat.com
---
 client/tests/kvm/scripts/enospc-post.py   |   74 +
 client/tests/kvm/scripts/enospc-pre.py|   71 +++
 client/tests/kvm/tests/enospc.py  |   68 ++
 client/tests/kvm/tests_base.cfg.sample|   14 +
 4 files changed, 227 insertions(+), 0 deletions(-)
 mode change 100644 = 100755 client/tests/kvm/scripts/bonding_setup.py
 mode change 100644 = 100755 client/tests/kvm/scripts/check_image.py
 create mode 100755 client/tests/kvm/scripts/enospc-post.py
 create mode 100755 client/tests/kvm/scripts/enospc-pre.py
 create mode 100644 client/tests/kvm/tests/enospc.py

diff --git a/client/tests/kvm/scripts/bonding_setup.py 
b/client/tests/kvm/scripts/bonding_setup.py
old mode 100644
new mode 100755
diff --git a/client/tests/kvm/scripts/check_image.py 
b/client/tests/kvm/scripts/check_image.py
old mode 100644
new mode 100755
diff --git a/client/tests/kvm/scripts/enospc-post.py 
b/client/tests/kvm/scripts/enospc-post.py
new file mode 100755
index 000..71cc383
--- /dev/null
+++ b/client/tests/kvm/scripts/enospc-post.py
@@ -0,0 +1,74 @@
+#!/usr/bin/python
+
+Simple script to setup enospc test environment
+
+import os, commands
+
+class SetupError(Exception):
+
+Simple wrapper for the builtin Exception class.
+
+pass
+
+
+def find_command(cmd):
+
+Searches for a command on common paths, error if it can't find it.
+
+@param cmd: Command to be found.
+
+if os.path.exists(cmd):
+return cmd
+for dir in [/usr/local/sbin, /usr/local/bin,
+/usr/sbin, /usr/bin, /sbin, /bin]:
+file = os.path.join(dir, cmd)
+if os.path.exists(file):
+return file
+raise ValueError('Missing command: %s' % cmd)
+
+
+def run(cmd, info=None):
+
+Run a command and throw an exception if it fails.
+Optionally, you can provide additional contextual info.
+
+@param cmd: Command string.
+@param reason: Optional string that explains the context of the failure.
+
+@raise: SetupError if command fails.
+
+print Running '%s' % cmd
+cmd_name = cmd.split(' ')[0]
+find_command(cmd_name)
+status, output = commands.getstatusoutput(cmd)
+if status:
+e_msg = ('Command %s failed.\nStatus:%s\nOutput:%s' %
+ (cmd, status, output))
+if info is not None:
+e_msg += '\nAdditional Info:%s' % info
+raise SetupError(e_msg)
+
+return (status, output)
+
+
+if __name__ == __main__:
+qemu_img_binary = os.environ['KVM_TEST_qemu_img_binary']
+if not os.path.isabs(qemu_img_binary):
+qemu_img_binary = os.path.join(KVM_TEST_DIR, qemu_img_binary)
+if not os.path.exists(qemu_img_binary):
+raise SetupError('The qemu-img binary that is supposed to be used '
+ '(%s) does not exist. Please verify your '
+ 'configuration' % qemu_img_binary)
+
+run(lvremove -f vgtest)
+status, output = run(losetup -a)
+loopback_device = None
+if output:
+for line in output.splitlines():
+device = line.split(:)[0]
+if /tmp/enospc.raw in line:
+loopback_device = device
+break
+if loopback_device is not None:
+run(losetup -d %s % loopback_device)
+run(rm -rf /tmp/enospc.raw /tmp/kvm_autotest_root/images/enospc.qcow2)
diff --git a/client/tests/kvm/scripts/enospc-pre.py 
b/client/tests/kvm/scripts/enospc-pre.py
new file mode 100755
index 000..e56bdc0
--- /dev/null
+++ b/client/tests/kvm/scripts/enospc-pre.py
@@ -0,0 +1,71 @@
+#!/usr/bin/python
+
+Simple script to setup enospc test environment
+
+import os, commands
+
+class SetupError(Exception):
+
+Simple wrapper for the builtin Exception class.
+
+pass
+
+
+def find_command(cmd):
+
+Searches for a command on common paths, error if it can't find it.
+
+@param cmd: Command to be found.
+
+if os.path.exists(cmd):
+return cmd
+for dir in [/usr/local/sbin, /usr/local/bin,
+/usr/sbin, /usr/bin, /sbin, /bin]:
+file = os.path.join(dir, cmd)
+if os.path.exists(file):
+return file
+raise ValueError('Missing command: %s' % cmd)
+
+
+def run(cmd, info=None):
+
+Run a 

Re: Flow Control and Port Mirroring Revisited

2011-01-13 Thread Simon Horman
On Thu, Jan 13, 2011 at 10:45:38AM -0500, Jesse Gross wrote:
 On Thu, Jan 13, 2011 at 1:47 AM, Simon Horman ho...@verge.net.au wrote:
  On Mon, Jan 10, 2011 at 06:31:55PM +0900, Simon Horman wrote:
  On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
   On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
  
   [ snip ]
   
I know that everyone likes a nice netperf result but I agree with
Michael that this probably isn't the right question to be asking.  I
don't think that socket buffers are a real solution to the flow
control problem: they happen to provide that functionality but it's
more of a side effect than anything.  It's just that the amount of
memory consumed by packets in the queue(s) doesn't really have any
implicit meaning for flow control (think multiple physical adapters,
all with the same speed instead of a virtual device and a physical
device with wildly different speeds).  The analog in the physical
world that you're looking for would be Ethernet flow control.
Obviously, if the question is limiting CPU or memory consumption then
that's a different story.
  
   Point taken. I will see if I can control CPU (and thus memory) 
   consumption
   using cgroups and/or tc.
 
  I have found that I can successfully control the throughput using
  the following techniques
 
  1) Place a tc egress filter on dummy0
 
  2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then eth1,
     this is effectively the same as one of my hacks to the datapath
     that I mentioned in an earlier mail. The result is that eth1
     paces the connection.
 
  Further to this, I wonder if there is any interest in providing
  a method to switch the action order - using ovs-ofctl is a hack imho -
  and/or switching the default action order for mirroring.
 
 I'm not sure that there is a way to do this that is correct in the
 generic case.  It's possible that the destination could be a VM while
 packets are being mirrored to a physical device or we could be
 multicasting or some other arbitrarily complex scenario.  Just think
 of what a physical switch would do if it has ports with two different
 speeds.

Yes, I have considered that case. And I agree that perhaps there
is no sensible default. But perhaps we could make it configurable somehow?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally

2011-01-13 Thread Huang Ying
On Thu, 2011-01-13 at 18:43 +0800, Avi Kivity wrote:
 On 01/13/2011 10:42 AM, Huang Ying wrote:
  Make __get_user_pages return -EHWPOISON for HWPOISON page only if
  FOLL_HWPOISON is specified.  With this patch, the interested callers
  can distinguish HWPOISON page from general FAULT page, while other
  callers will still get -EFAULT for pages, so the user space interface
  need not to be changed.
 
  get_user_pages_hwpoison is added as a variant of get_user_pages that
  can return -EHWPOISON for HWPOISON page.
 
  This feature is needed by KVM, where UCR MCE should be relayed to
  guest for HWPOISON page, while instruction emulation and MMIO will be
  tried for general FAULT page.
 
  The idea comes from Andrew Morton.
 
  Signed-off-by: Huang Yingying.hu...@intel.com
  Cc: Andrew Mortona...@linux-foundation.org
 
  +#ifdef CONFIG_MEMORY_FAILURE
  +int get_user_pages_hwpoison(struct task_struct *tsk, struct mm_struct *mm,
  +   unsigned long start, int nr_pages, int write,
  +   int force, struct page **pages,
  +   struct vm_area_struct **vmas);
  +#else
 
 Since we'd also like to add get_user_pages_noio(), perhaps adding a 
 flags field to get_user_pages() is better.

Or export __get_user_pages()?

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH uq/master 2/2] MCE, unpoison memory address across reboot

2011-01-13 Thread Huang Ying
On Thu, 2011-01-13 at 17:01 +0800, Jan Kiszka wrote:
 Am 13.01.2011 09:34, Huang Ying wrote:
  In Linux kernel HWPoison processing implementation, the virtual
  address in processes mapping the error physical memory page is marked
  as HWPoison.  So that, the further accessing to the virtual
  address will kill corresponding processes with SIGBUS.
  
  If the error physical memory page is used by a KVM guest, the SIGBUS
  will be sent to QEMU, and QEMU will simulate a MCE to report that
  memory error to the guest OS.  If the guest OS can not recover from
  the error (for example, the page is accessed by kernel code), guest OS
  will reboot the system.  But because the underlying host virtual
  address backing the guest physical memory is still poisoned, if the
  guest system accesses the corresponding guest physical memory even
  after rebooting, the SIGBUS will still be sent to QEMU and MCE will be
  simulated.  That is, guest system can not recover via rebooting.
  
  In fact, across rebooting, the contents of guest physical memory page
  need not to be kept.  We can allocate a new host physical page to
  back the corresponding guest physical address.
  
  This patch fixes this issue in QEMU via calling qemu_ram_remap() to
  clear the corresponding page table entry, so that make it possible to
  allocate a new page to recover the issue.
  
  Signed-off-by: Huang Ying ying.hu...@intel.com
  ---
   kvm.h |2 ++
   target-i386/kvm.c |   39 +++
   2 files changed, 41 insertions(+)
  
  --- a/target-i386/kvm.c
  +++ b/target-i386/kvm.c
  @@ -580,6 +580,42 @@ static int kvm_get_supported_msrs(void)
   return ret;
   }
   
  +struct HWPoisonPage;
  +typedef struct HWPoisonPage HWPoisonPage;
  +struct HWPoisonPage
  +{
  +ram_addr_t ram_addr;
  +QLIST_ENTRY(HWPoisonPage) list;
  +};
  +
  +static QLIST_HEAD(hwpoison_page_list, HWPoisonPage) hwpoison_page_list =
  +QLIST_HEAD_INITIALIZER(hwpoison_page_list);
  +
  +void kvm_unpoison_all(void *param)
 
 Minor nit: This can be static now.

In uq/master, it can be make static.  But in kvm/master, kvm_arch_init
is not compiled because of conditional compiling, so we will get warning
and error for unused symbol.  Should we consider kvm/master in this
patch?

  +{
  +HWPoisonPage *page, *next_page;
  +
  +QLIST_FOREACH_SAFE(page, hwpoison_page_list, list, next_page) {
  +QLIST_REMOVE(page, list);
  +qemu_ram_remap(page-ram_addr, TARGET_PAGE_SIZE);
  +qemu_free(page);
  +}
  +}
  +
  +static void kvm_hwpoison_page_add(ram_addr_t ram_addr)
  +{
  +HWPoisonPage *page;
  +
  +QLIST_FOREACH(page, hwpoison_page_list, list) {
  +if (page-ram_addr == ram_addr)
  +return;
  +}
  +
  +page = qemu_malloc(sizeof(HWPoisonPage));
  +page-ram_addr = ram_addr;
  +QLIST_INSERT_HEAD(hwpoison_page_list, page, list);
  +}
  +
   int kvm_arch_init(void)
   {
   uint64_t identity_base = 0xfffbc000;
  @@ -632,6 +668,7 @@ int kvm_arch_init(void)
   fprintf(stderr, e820_add_entry() table is full\n);
   return ret;
   }
  +qemu_register_reset(kvm_unpoison_all, NULL);
   
   return 0;
   }
  @@ -1940,6 +1977,7 @@ int kvm_on_sigbus_vcpu(CPUState *env, in
   hardware_memory_error();
   }
   }
  +kvm_hwpoison_page_add(ram_addr);
   
   if (code == BUS_MCEERR_AR) {
   /* Fake an Intel architectural Data Load SRAR UCR */
  @@ -1984,6 +2022,7 @@ int kvm_on_sigbus(int code, void *addr)
   QEMU itself instead of guest system!: %p\n, addr);
   return 0;
   }
  +kvm_hwpoison_page_add(ram_addr);
   kvm_mce_inj_srao_memscrub2(first_cpu, paddr);
   } else
   #endif
  --- a/kvm.h
  +++ b/kvm.h
  @@ -188,6 +188,8 @@ int kvm_physical_memory_addr_from_ram(ra
 target_phys_addr_t *phys_addr);
   #endif
   
  +void kvm_unpoison_all(void *param);
  +
 
 To be removed if kvm_unpoison_all is static.
 
   #endif
   int kvm_set_ioeventfd_mmio_long(int fd, uint32_t adr, uint32_t val, bool 
  assign);
   
  
 
 As indicated, I'm sitting on lots of fixes and refactorings of the MCE
 user space code. How do you test your patches? Any suggestions how to do
 this efficiently would be warmly welcome.

We use a self-made test script to test.  Repository is at:

git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git

The kvm test script is in kvm sub-directory.

The qemu patch attached is need by the test script.

Best Regards,
Huang Ying

Author: Max Asbock masb...@linux.vnet.ibm.com
Subject: [PATCH -v3] Monitor command: x-gpa2hva, translate guest physical address to host virtual address

Add command x-gpa2hva to translate guest physical address to host
virtual address. Because gpa to hva translation is not consistent, so
this command is only used for debugging.

The x-gpa2hva command 

Re: Flow Control and Port Mirroring Revisited

2011-01-13 Thread Michael S. Tsirkin
On Fri, Jan 14, 2011 at 08:41:36AM +0900, Simon Horman wrote:
 On Thu, Jan 13, 2011 at 10:45:38AM -0500, Jesse Gross wrote:
  On Thu, Jan 13, 2011 at 1:47 AM, Simon Horman ho...@verge.net.au wrote:
   On Mon, Jan 10, 2011 at 06:31:55PM +0900, Simon Horman wrote:
   On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
   
[ snip ]

 I know that everyone likes a nice netperf result but I agree with
 Michael that this probably isn't the right question to be asking.  I
 don't think that socket buffers are a real solution to the flow
 control problem: they happen to provide that functionality but it's
 more of a side effect than anything.  It's just that the amount of
 memory consumed by packets in the queue(s) doesn't really have any
 implicit meaning for flow control (think multiple physical adapters,
 all with the same speed instead of a virtual device and a physical
 device with wildly different speeds).  The analog in the physical
 world that you're looking for would be Ethernet flow control.
 Obviously, if the question is limiting CPU or memory consumption then
 that's a different story.
   
Point taken. I will see if I can control CPU (and thus memory) 
consumption
using cgroups and/or tc.
  
   I have found that I can successfully control the throughput using
   the following techniques
  
   1) Place a tc egress filter on dummy0
  
   2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then eth1,
      this is effectively the same as one of my hacks to the datapath
      that I mentioned in an earlier mail. The result is that eth1
      paces the connection.

This is actually a bug. This means that one slow connection will
affect fast ones. I intend to change the default for qemu to sndbuf=0 :
this will fix it but break your pacing. So pls do not count on this behaviour.

   Further to this, I wonder if there is any interest in providing
   a method to switch the action order - using ovs-ofctl is a hack imho -
   and/or switching the default action order for mirroring.
  
  I'm not sure that there is a way to do this that is correct in the
  generic case.  It's possible that the destination could be a VM while
  packets are being mirrored to a physical device or we could be
  multicasting or some other arbitrarily complex scenario.  Just think
  of what a physical switch would do if it has ports with two different
  speeds.
 
 Yes, I have considered that case. And I agree that perhaps there
 is no sensible default. But perhaps we could make it configurable somehow?

The fix is at the application level. Run netperf with -b and -w flags to
limit the speed to a sensible value.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM-test: Add a ENOSPC subtest

2011-01-13 Thread Amos Kong
- Original Message -
 From: Amos Kong ak...@redhat.com
 
 KVM guest always pauses on NOSPACE error, this test
 just repeatedly extend guest disk space and resume guest
 from paused status.
 
 Changes from v2:
 - Oops! Forgot to update tests_base.cfg.sample
 
 Changes from v1:
 - Use the most current KVM test API
 - Use the autotest API for external commands execution
 - Instead of chaining multiple shell commands as pre and
 post commands, create proper pre and post scripts for the
 test, as it is easier to figure out problems
 - Instead of setting up /dev/loop0 hardcoded by default,
 find the first available loop device before and use it.
 
 Signed-off-by: Amos Kong ak...@redhat.com

Thank you, Lucas
I've retested this patch, it's ok.

BTW, hi Kevin,  when I check the images during this test, got the following 
output, is it harmful?

Leaked cluster 60796 refcount=1 reference=0
Leaked cluster 60797 refcount=1 reference=0
Leaked cluster 60798 refcount=1 reference=0
Leaked cluster 60799 refcount=1 reference=0
Leaked cluster 60800 refcount=1 reference=0
Leaked cluster 60801 refcount=1 reference=0
Leaked cluster 60802 refcount=1 reference=0
Leaked cluster 60803 refcount=1 reference=0
Leaked cluster 60804 refcount=1 reference=0
Leaked cluster 60805 refcount=1 reference=0
Leaked cluster 60806 refcount=1 reference=0
Leaked cluster 60807 refcount=1 reference=0
Leaked cluster 63982 refcount=1 reference=0
Leaked cluster 63983 refcount=1 reference=0
Leaked cluster 63984 refcount=1 reference=0
Leaked cluster 63985 refcount=1 reference=0
Leaked cluster 63986 refcount=1 reference=0
Leaked cluster 63987 refcount=1 reference=0
Leaked cluster 63988 refcount=1 reference=0
Leaked cluster 63989 refcount=1 reference=0
Leaked cluster 63990 refcount=1 reference=0
Leaked cluster 63991 refcount=1 reference=0
Leaked cluster 63992 refcount=1 reference=0
Leaked cluster 63993 refcount=1 reference=0
Leaked cluster 63994 refcount=1 reference=0
Leaked cluster 63995 refcount=1 reference=0
Leaked cluster 63996 refcount=1 reference=0
Leaked cluster 63997 refcount=1 reference=0
Leaked cluster 63998 refcount=1 reference=0
Leaked cluster 63999 refcount=1 reference=0

867 leaked clusters were found on the image.
This means waste of disk space, but no harm to data.


 ---
 client/tests/kvm/scripts/enospc-post.py | 74
 +
 client/tests/kvm/scripts/enospc-pre.py | 71
 +++
 client/tests/kvm/tests/enospc.py | 68 ++
 client/tests/kvm/tests_base.cfg.sample | 14 +
 4 files changed, 227 insertions(+), 0 deletions(-)
 mode change 100644 = 100755 client/tests/kvm/scripts/bonding_setup.py
 mode change 100644 = 100755 client/tests/kvm/scripts/check_image.py
 create mode 100755 client/tests/kvm/scripts/enospc-post.py
 create mode 100755 client/tests/kvm/scripts/enospc-pre.py
 create mode 100644 client/tests/kvm/tests/enospc.py
 
 diff --git a/client/tests/kvm/scripts/bonding_setup.py
 b/client/tests/kvm/scripts/bonding_setup.py
 old mode 100644
 new mode 100755
 diff --git a/client/tests/kvm/scripts/check_image.py
 b/client/tests/kvm/scripts/check_image.py
 old mode 100644
 new mode 100755
 diff --git a/client/tests/kvm/scripts/enospc-post.py
 b/client/tests/kvm/scripts/enospc-post.py
 new file mode 100755
 index 000..71cc383
 --- /dev/null
 +++ b/client/tests/kvm/scripts/enospc-post.py
 @@ -0,0 +1,74 @@
 +#!/usr/bin/python
 +
 +Simple script to setup enospc test environment
 +
 +import os, commands
 +
 +class SetupError(Exception):
 + 
 + Simple wrapper for the builtin Exception class.
 + 
 + pass
 +
 +
 +def find_command(cmd):
 + 
 + Searches for a command on common paths, error if it can't find it.
 +
 + @param cmd: Command to be found.
 + 
 + if os.path.exists(cmd):
 + return cmd
 + for dir in [/usr/local/sbin, /usr/local/bin,
 + /usr/sbin, /usr/bin, /sbin, /bin]:
 + file = os.path.join(dir, cmd)
 + if os.path.exists(file):
 + return file
 + raise ValueError('Missing command: %s' % cmd)
 +
 +
 +def run(cmd, info=None):
 + 
 + Run a command and throw an exception if it fails.
 + Optionally, you can provide additional contextual info.
 +
 + @param cmd: Command string.
 + @param reason: Optional string that explains the context of the
 failure.
 +
 + @raise: SetupError if command fails.
 + 
 + print Running '%s' % cmd
 + cmd_name = cmd.split(' ')[0]
 + find_command(cmd_name)
 + status, output = commands.getstatusoutput(cmd)
 + if status:
 + e_msg = ('Command %s failed.\nStatus:%s\nOutput:%s' %
 + (cmd, status, output))
 + if info is not None:
 + e_msg += '\nAdditional Info:%s' % info
 + raise SetupError(e_msg)
 +
 + return (status, output)
 +
 +
 +if __name__ == __main__:
 + qemu_img_binary = os.environ['KVM_TEST_qemu_img_binary']
 + if not os.path.isabs(qemu_img_binary):
 + qemu_img_binary = os.path.join(KVM_TEST_DIR, qemu_img_binary)
 + if not os.path.exists(qemu_img_binary):
 + raise SetupError('The qemu-img binary that is supposed to be used '
 + '(%s) does not 

Re: Flow Control and Port Mirroring Revisited

2011-01-13 Thread Simon Horman
On Fri, Jan 14, 2011 at 06:58:18AM +0200, Michael S. Tsirkin wrote:
 On Fri, Jan 14, 2011 at 08:41:36AM +0900, Simon Horman wrote:
  On Thu, Jan 13, 2011 at 10:45:38AM -0500, Jesse Gross wrote:
   On Thu, Jan 13, 2011 at 1:47 AM, Simon Horman ho...@verge.net.au wrote:
On Mon, Jan 10, 2011 at 06:31:55PM +0900, Simon Horman wrote:
On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
 On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:

 [ snip ]
 
  I know that everyone likes a nice netperf result but I agree with
  Michael that this probably isn't the right question to be asking.  
  I
  don't think that socket buffers are a real solution to the flow
  control problem: they happen to provide that functionality but it's
  more of a side effect than anything.  It's just that the amount of
  memory consumed by packets in the queue(s) doesn't really have any
  implicit meaning for flow control (think multiple physical 
  adapters,
  all with the same speed instead of a virtual device and a physical
  device with wildly different speeds).  The analog in the physical
  world that you're looking for would be Ethernet flow control.
  Obviously, if the question is limiting CPU or memory consumption 
  then
  that's a different story.

 Point taken. I will see if I can control CPU (and thus memory) 
 consumption
 using cgroups and/or tc.
   
I have found that I can successfully control the throughput using
the following techniques
   
1) Place a tc egress filter on dummy0
   
2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then eth1,
   this is effectively the same as one of my hacks to the datapath
   that I mentioned in an earlier mail. The result is that eth1
   paces the connection.
 
 This is actually a bug. This means that one slow connection will affect
 fast ones. I intend to change the default for qemu to sndbuf=0 : this
 will fix it but break your pacing. So pls do not count on this
 behaviour.

Do you have a patch I could test?

Further to this, I wonder if there is any interest in providing
a method to switch the action order - using ovs-ofctl is a hack imho -
and/or switching the default action order for mirroring.
   
   I'm not sure that there is a way to do this that is correct in the
   generic case.  It's possible that the destination could be a VM while
   packets are being mirrored to a physical device or we could be
   multicasting or some other arbitrarily complex scenario.  Just think
   of what a physical switch would do if it has ports with two different
   speeds.
  
  Yes, I have considered that case. And I agree that perhaps there
  is no sensible default. But perhaps we could make it configurable somehow?
 
 The fix is at the application level. Run netperf with -b and -w flags to
 limit the speed to a sensible value.

Perhaps I should have stated my goals more clearly.
I'm interested in situations where I don't control the application.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Flow Control and Port Mirroring Revisited

2011-01-13 Thread Michael S. Tsirkin
On Fri, Jan 14, 2011 at 03:35:28PM +0900, Simon Horman wrote:
 On Fri, Jan 14, 2011 at 06:58:18AM +0200, Michael S. Tsirkin wrote:
  On Fri, Jan 14, 2011 at 08:41:36AM +0900, Simon Horman wrote:
   On Thu, Jan 13, 2011 at 10:45:38AM -0500, Jesse Gross wrote:
On Thu, Jan 13, 2011 at 1:47 AM, Simon Horman ho...@verge.net.au 
wrote:
 On Mon, Jan 10, 2011 at 06:31:55PM +0900, Simon Horman wrote:
 On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
  On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
 
  [ snip ]
  
   I know that everyone likes a nice netperf result but I agree with
   Michael that this probably isn't the right question to be 
   asking.  I
   don't think that socket buffers are a real solution to the flow
   control problem: they happen to provide that functionality but 
   it's
   more of a side effect than anything.  It's just that the amount 
   of
   memory consumed by packets in the queue(s) doesn't really have 
   any
   implicit meaning for flow control (think multiple physical 
   adapters,
   all with the same speed instead of a virtual device and a 
   physical
   device with wildly different speeds).  The analog in the physical
   world that you're looking for would be Ethernet flow control.
   Obviously, if the question is limiting CPU or memory consumption 
   then
   that's a different story.
 
  Point taken. I will see if I can control CPU (and thus memory) 
  consumption
  using cgroups and/or tc.

 I have found that I can successfully control the throughput using
 the following techniques

 1) Place a tc egress filter on dummy0

 2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then 
 eth1,
    this is effectively the same as one of my hacks to the datapath
    that I mentioned in an earlier mail. The result is that eth1
    paces the connection.
  
  This is actually a bug. This means that one slow connection will affect
  fast ones. I intend to change the default for qemu to sndbuf=0 : this
  will fix it but break your pacing. So pls do not count on this
  behaviour.
 
 Do you have a patch I could test?

You can (and users already can) just run qemu with sndbuf=0. But if you
like, below.

 Further to this, I wonder if there is any interest in providing
 a method to switch the action order - using ovs-ofctl is a hack imho -
 and/or switching the default action order for mirroring.

I'm not sure that there is a way to do this that is correct in the
generic case.  It's possible that the destination could be a VM while
packets are being mirrored to a physical device or we could be
multicasting or some other arbitrarily complex scenario.  Just think
of what a physical switch would do if it has ports with two different
speeds.
   
   Yes, I have considered that case. And I agree that perhaps there
   is no sensible default. But perhaps we could make it configurable somehow?
  
  The fix is at the application level. Run netperf with -b and -w flags to
  limit the speed to a sensible value.
 
 Perhaps I should have stated my goals more clearly.
 I'm interested in situations where I don't control the application.

Well an application that streams UDP without any throttling
at the application level will break on a physical network, right?
So I am not sure why should one try to make it work on the virtual one.

But let's assume that you do want to throttle the guest
for reasons such as QOS. The proper approach seems
to be to throttle the sender, not have a dummy throttled
receiver pacing it. Place the qemu process in the
correct net_cls cgroup, set the class id and apply a rate limit?


---

diff --git a/net/tap-linux.c b/net/tap-linux.c
index f7aa904..0dbcdd4 100644
--- a/net/tap-linux.c
+++ b/net/tap-linux.c
@@ -87,7 +87,7 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr, 
int vnet_hdr_required
  * Ethernet NICs generally have txqueuelen=1000, so 1Mb is
  * a good default, given a 1500 byte MTU.
  */
-#define TAP_DEFAULT_SNDBUF 1024*1024
+#define TAP_DEFAULT_SNDBUF 0
 
 int tap_set_sndbuf(int fd, QemuOpts *opts)
 {
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html