RE: [PATCH] KVM: x86: Add lowest-priority support for vt-d posted-interrupts

2015-12-14 Thread Wu, Feng


> -Original Message-
> From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
> Behalf Of Radim Krcmár
> Sent: Friday, December 11, 2015 10:38 PM
> To: Wu, Feng 
> Cc: pbonz...@redhat.com; kvm@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: Re: [PATCH] KVM: x86: Add lowest-priority support for vt-d posted-
> interrupts
> 
> 2015-12-10 01:52+, Wu, Feng:
> >> From: Radim Krčmář [mailto:rkrc...@redhat.com]
> >> (Physical xAPIC+x2APIC mode is still somewhat reasonable and xAPIC CPUs
> >>  start with LDR=0, which means that operating system doesn't need to
> >>  utilize mixed mode, as defined by KVM, when switching to x2APIC.)
> >
> > I think you mean Physical xAPIC+Physical x2APIC mode, right? For physical
> > mode, we don't use LDR in any case, do we? So in physical mode, we only
> > use the APIC ID, that is why they can be mixed, is my understanding correct?
> 
> Yes.  (Technically, physical and logical addressing is always active in
> APIC, but xAPIC must have nonzero LDR to accept logical interrupts[1].)
> If all xAPIC LDRs are zero, KVM doesn't enter a "mixed mode" even if
> some are xAPIC and some x2APIC [2].
> 
> 1: Real LAPICs probably do not accept broadcasts on APICs where LDR=0,
>KVM LAPICs do, but lowest priority broadcast is not allowed anyway,
>so PI doesn't care.
> 
> 2: KVM allows OS-writeable APIC ID, which complicates things and real
>hardware probably doesn't allow it because of that ... we'd be saner
>with RO APIC ID, but it's not that bad.  (And no major OS does it :])
> 
> >>  the system uses cluster xAPIC, OS should set DFR before LDR, which
> >>  doesn't trigger mixed mode either.)
> >
> > Just curious, if the APIC is software disabled and it is in xAPIC mode. OS 
> > sets
> > different value for DFR for different APICs, then when OS sets LDR, KVM can
> > trigger mixed flat and cluster mode, right?
> 
> Exactly.
> APICs with zeroed LDR are ignored, so KVM will use the slow-path for
> delivery (= trigger mixed mode) at the moment the first APIC with
> different DFR is configured.

Thanks a lot for your explanation!

Thanks,
Feng

> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vgaarb: fix signal handling in vga_get()

2015-12-14 Thread David Herrmann
Hi

On Mon, Dec 14, 2015 at 9:19 AM, Kirill A. Shutemov
 wrote:
> On Thu, Dec 10, 2015 at 11:28:58AM +0100, David Herrmann wrote:
>> Hi
>>
>> On Mon, Nov 30, 2015 at 3:17 AM, Kirill A. Shutemov
>>  wrote:
>> > There are few defects in vga_get() related to signal hadning:
>> >
>> >   - we shouldn't check for pending signals for TASK_UNINTERRUPTIBLE
>> > case;
>> >
>> >   - if we found pending signal we must remove ourself from wait queue
>> > and change task state back to running;
>> >
>> >   - -ERESTARTSYS is more appropriate, I guess.
>> >
>> > Signed-off-by: Kirill A. Shutemov 
>> > ---
>> >
>> > Alex, I try to get KVM with VGA passthrough working properly. I have i915
>> > (HD 4600) on the host and GTX 580 for the guest. The guest GPU is not
>> > capabale of EFI, so I have to use x-vga=on. It's kinda work with your
>> > patch for i915.enable_hd_vgaarb=1. But guest refuse to initialize the GPU
>> > after KVM was not shut down correctly, resulting in host crash like this:
>> >
>> > BUG: unable to handle kernel paging request at 880870187ed8
>> > IP: [] 0x880870187ed8
>> > PGD 2129067 PUD 800841e3
>> > Oops: 0011 [#1] PREEMPT SMP
>> > Modules linked in: iwlmvm iwlwifi
>> > CPU: 6 PID: 3983 Comm: qemu-system-x86 Not tainted 4.3.0-gentoo #6
>> > Hardware name: Gigabyte Technology Co., Ltd. Z87X-UD7 TH/Z87X-UD7 TH-CF, 
>> > BIOS F5a 06/12/2014
>> > task: 88087a91 ti: 8808632c task.ti: 8808632c
>> > RIP: 0010:[]  [] 0x880870187ed8
>> > RSP: 0018:8808632c3d08  EFLAGS: 00010006
>> > RAX: 880870187db0 RBX: 70187f58 RCX: 
>> > RDX:  RSI: 0003 RDI: 880870187db0
>> > RBP: 8808632c3d48 R08:  R09: 
>> > R10: 000103c0 R11: 0293 R12: 81ea03c8
>> > R13: 8104c7cb R14:  R15: 0003
>> > FS:  7f984f9b2700() GS:88089f38() 
>> > knlGS:
>> > CS:  0010 DS:  ES:  CR0: 80050033
>> > CR2: 880870187ed8 CR3: 0008645f8000 CR4: 001426e0
>> > Stack:
>> >  810cc83d 632c3d28  81ea03c0
>> >  0046 0003  
>> >  8808632c3d80 810cca44 88087af63800 0286
>> > Call Trace:
>> >  [] ? __wake_up_common+0x4d/0x80
>> >  [] __wake_up+0x34/0x50
>> >  [] __vga_put+0x73/0xd0
>> >  [] vga_put+0x54/0x80
>> >  [] vfio_pci_vga_rw+0x1d2/0x220
>> >  [] vfio_pci_rw+0x33/0x60
>> >  [] vfio_pci_write+0x17/0x20
>> >  [] vfio_device_fops_write+0x26/0x30
>> >  [] __vfs_write+0x23/0xe0
>> >  [] ? __vfs_read+0x23/0xd0
>> >  [] ? do_vfs_ioctl+0x2b5/0x490
>> >  [] vfs_write+0xa4/0x190
>> >  [] SyS_pwrite64+0x66/0xa0
>> >  [] entry_SYSCALL_64_fastpath+0x12/0x6a
>> > Code: 88 ff ff e0 7e 18 70 08 88 ff ff 00 8c 57 76 08 88 ff ff 20 7f 18 70 
>> > 08 88 ff ff 08 7f 18 70 08 88 ff ff 94 51 1a 81 ff ff ff ff <09> 00 00 00 
>> > 00 00 00 00 01 8c 57 76 08 88 ff ff 00 8c 57 76 08
>> > RIP  [] 0x880870187ed8
>> >  RSP 
>> > CR2: 880870187ed8
>> >
>> > The patch fixes the crash, but doesn't help with getting GPU in guest
>> > working again.
>> >
>> > Any ideas?
>> >
>> > ---
>> >  drivers/gpu/vga/vgaarb.c | 6 --
>> >  1 file changed, 4 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/drivers/gpu/vga/vgaarb.c b/drivers/gpu/vga/vgaarb.c
>> > index 3166e4bc4eb6..9abcaa53bd25 100644
>> > --- a/drivers/gpu/vga/vgaarb.c
>> > +++ b/drivers/gpu/vga/vgaarb.c
>> > @@ -395,8 +395,10 @@ int vga_get(struct pci_dev *pdev, unsigned int rsrc, 
>> > int interruptible)
>> > set_current_state(interruptible ?
>> >   TASK_INTERRUPTIBLE :
>> >   TASK_UNINTERRUPTIBLE);
>> > -   if (signal_pending(current)) {
>> > -   rc = -EINTR;
>> > +   if (interruptible && signal_pending(current)) {
>> > +   __set_current_state(TASK_RUNNING);
>> > +   remove_wait_queue(_wait_queue, );
>> > +   rc = -ERESTARTSYS;
>> > break;
>>
>> All 3 points are valid, and the patch looks good to me:
>>
>> Reviewed-by: David Herrmann 
>>
>> However, there seems to be a race between vga_lock and putting the
>> thread asleep.
>
> I'm not sure I understand the race you're talking about.
> Could you elaborate?

Normal wake-up race. You have to check the condition you wait for
*after* linking your wait-queue and setting your thread-state. In the
current code, if the wake-up happens between spin_unlock_irqrestore()
and add_wait_queue(), you will never get woken up.

Thanks
David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH] target-i386: check vcpu features before accessing MSR_TSC_AUX

2015-12-14 Thread Paolo Bonzini


On 14/12/2015 12:07, Haozhong Zhang wrote:
> This patch fix a bug that prevents VM rebooting on recent versions of
> KVM (from commit 9dbe6cf).
> 
> kvm_get_msrs() is called to save guest MSR_TSC_AUX and other MSRs across
> rebooting. It only checks whether KVM exposes MSR_TSC_AUX to userspace.
> However, if vcpu does not support rdtscp (e.g. kvm64), current KVM will
> fail the saving and thus all other MSRs following it will fail in
> kvm_get_msrs(). As a result, from KVM commit 9dbe6cf that exposes
> MSR_TSC_AUX, VM can not successfully reboot.
> 
> This patch fixes this bug by adding the missing rdtscp feature checks.

That commit is not in any released kernel.  It's better if we just check
msr_info->host_initiated in vmx_get_msr and vmx_set_msr. Can you prepare
a patch?

Thanks,

Paolo

> Signed-off-by: Haozhong Zhang 
> ---
>  target-i386/kvm.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> index 6dc9846..cc842c6 100644
> --- a/target-i386/kvm.c
> +++ b/target-i386/kvm.c
> @@ -1414,7 +1414,8 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
>  if (has_msr_hsave_pa) {
>  kvm_msr_entry_set([n++], MSR_VM_HSAVE_PA, env->vm_hsave);
>  }
> -if (has_msr_tsc_aux) {
> +if (has_msr_tsc_aux &&
> +(env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_RDTSCP)) {
>  kvm_msr_entry_set([n++], MSR_TSC_AUX, env->tsc_aux);
>  }
>  if (has_msr_tsc_adjust) {
> @@ -1793,7 +1794,8 @@ static int kvm_get_msrs(X86CPU *cpu)
>  if (has_msr_hsave_pa) {
>  msrs[n++].index = MSR_VM_HSAVE_PA;
>  }
> -if (has_msr_tsc_aux) {
> +if (has_msr_tsc_aux &&
> +(env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_RDTSCP)) {
>  msrs[n++].index = MSR_TSC_AUX;
>  }
>  if (has_msr_tsc_adjust) {
> 

This commit is not in any released kernel.  It's better if we just check
msr_info->host_initiated in vmx_get_msr and vmx_set_msr.  Can you
prepare a patch?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] target-i386: check vcpu features before accessing MSR_TSC_AUX

2015-12-14 Thread Haozhong Zhang
This patch fix a bug that prevents VM rebooting on recent versions of
KVM (from commit 9dbe6cf).

kvm_get_msrs() is called to save guest MSR_TSC_AUX and other MSRs across
rebooting. It only checks whether KVM exposes MSR_TSC_AUX to userspace.
However, if vcpu does not support rdtscp (e.g. kvm64), current KVM will
fail the saving and thus all other MSRs following it will fail in
kvm_get_msrs(). As a result, from KVM commit 9dbe6cf that exposes
MSR_TSC_AUX, VM can not successfully reboot.

This patch fixes this bug by adding the missing rdtscp feature checks.

Signed-off-by: Haozhong Zhang 
---
 target-i386/kvm.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 6dc9846..cc842c6 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1414,7 +1414,8 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
 if (has_msr_hsave_pa) {
 kvm_msr_entry_set([n++], MSR_VM_HSAVE_PA, env->vm_hsave);
 }
-if (has_msr_tsc_aux) {
+if (has_msr_tsc_aux &&
+(env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_RDTSCP)) {
 kvm_msr_entry_set([n++], MSR_TSC_AUX, env->tsc_aux);
 }
 if (has_msr_tsc_adjust) {
@@ -1793,7 +1794,8 @@ static int kvm_get_msrs(X86CPU *cpu)
 if (has_msr_hsave_pa) {
 msrs[n++].index = MSR_VM_HSAVE_PA;
 }
-if (has_msr_tsc_aux) {
+if (has_msr_tsc_aux &&
+(env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_RDTSCP)) {
 msrs[n++].index = MSR_TSC_AUX;
 }
 if (has_msr_tsc_adjust) {
-- 
2.4.8

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vgaarb: fix signal handling in vga_get()

2015-12-14 Thread Kirill A. Shutemov
On Mon, Dec 14, 2015 at 11:20:00AM +0100, David Herrmann wrote:
> Hi
> 
> On Mon, Dec 14, 2015 at 9:19 AM, Kirill A. Shutemov
>  wrote:
> > On Thu, Dec 10, 2015 at 11:28:58AM +0100, David Herrmann wrote:
> >> Hi
> >>
> >> On Mon, Nov 30, 2015 at 3:17 AM, Kirill A. Shutemov
> >>  wrote:
> >> > There are few defects in vga_get() related to signal hadning:
> >> >
> >> >   - we shouldn't check for pending signals for TASK_UNINTERRUPTIBLE
> >> > case;
> >> >
> >> >   - if we found pending signal we must remove ourself from wait queue
> >> > and change task state back to running;
> >> >
> >> >   - -ERESTARTSYS is more appropriate, I guess.
> >> >
> >> > Signed-off-by: Kirill A. Shutemov 
> >> > ---
> >> >
> >> > Alex, I try to get KVM with VGA passthrough working properly. I have i915
> >> > (HD 4600) on the host and GTX 580 for the guest. The guest GPU is not
> >> > capabale of EFI, so I have to use x-vga=on. It's kinda work with your
> >> > patch for i915.enable_hd_vgaarb=1. But guest refuse to initialize the GPU
> >> > after KVM was not shut down correctly, resulting in host crash like this:
> >> >
> >> > BUG: unable to handle kernel paging request at 880870187ed8
> >> > IP: [] 0x880870187ed8
> >> > PGD 2129067 PUD 800841e3
> >> > Oops: 0011 [#1] PREEMPT SMP
> >> > Modules linked in: iwlmvm iwlwifi
> >> > CPU: 6 PID: 3983 Comm: qemu-system-x86 Not tainted 4.3.0-gentoo #6
> >> > Hardware name: Gigabyte Technology Co., Ltd. Z87X-UD7 TH/Z87X-UD7 TH-CF, 
> >> > BIOS F5a 06/12/2014
> >> > task: 88087a91 ti: 8808632c task.ti: 8808632c
> >> > RIP: 0010:[]  [] 0x880870187ed8
> >> > RSP: 0018:8808632c3d08  EFLAGS: 00010006
> >> > RAX: 880870187db0 RBX: 70187f58 RCX: 
> >> > RDX:  RSI: 0003 RDI: 880870187db0
> >> > RBP: 8808632c3d48 R08:  R09: 
> >> > R10: 000103c0 R11: 0293 R12: 81ea03c8
> >> > R13: 8104c7cb R14:  R15: 0003
> >> > FS:  7f984f9b2700() GS:88089f38() 
> >> > knlGS:
> >> > CS:  0010 DS:  ES:  CR0: 80050033
> >> > CR2: 880870187ed8 CR3: 0008645f8000 CR4: 001426e0
> >> > Stack:
> >> >  810cc83d 632c3d28  81ea03c0
> >> >  0046 0003  
> >> >  8808632c3d80 810cca44 88087af63800 0286
> >> > Call Trace:
> >> >  [] ? __wake_up_common+0x4d/0x80
> >> >  [] __wake_up+0x34/0x50
> >> >  [] __vga_put+0x73/0xd0
> >> >  [] vga_put+0x54/0x80
> >> >  [] vfio_pci_vga_rw+0x1d2/0x220
> >> >  [] vfio_pci_rw+0x33/0x60
> >> >  [] vfio_pci_write+0x17/0x20
> >> >  [] vfio_device_fops_write+0x26/0x30
> >> >  [] __vfs_write+0x23/0xe0
> >> >  [] ? __vfs_read+0x23/0xd0
> >> >  [] ? do_vfs_ioctl+0x2b5/0x490
> >> >  [] vfs_write+0xa4/0x190
> >> >  [] SyS_pwrite64+0x66/0xa0
> >> >  [] entry_SYSCALL_64_fastpath+0x12/0x6a
> >> > Code: 88 ff ff e0 7e 18 70 08 88 ff ff 00 8c 57 76 08 88 ff ff 20 7f 18 
> >> > 70 08 88 ff ff 08 7f 18 70 08 88 ff ff 94 51 1a 81 ff ff ff ff <09> 00 
> >> > 00 00 00 00 00 00 01 8c 57 76 08 88 ff ff 00 8c 57 76 08
> >> > RIP  [] 0x880870187ed8
> >> >  RSP 
> >> > CR2: 880870187ed8
> >> >
> >> > The patch fixes the crash, but doesn't help with getting GPU in guest
> >> > working again.
> >> >
> >> > Any ideas?
> >> >
> >> > ---
> >> >  drivers/gpu/vga/vgaarb.c | 6 --
> >> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >> >
> >> > diff --git a/drivers/gpu/vga/vgaarb.c b/drivers/gpu/vga/vgaarb.c
> >> > index 3166e4bc4eb6..9abcaa53bd25 100644
> >> > --- a/drivers/gpu/vga/vgaarb.c
> >> > +++ b/drivers/gpu/vga/vgaarb.c
> >> > @@ -395,8 +395,10 @@ int vga_get(struct pci_dev *pdev, unsigned int 
> >> > rsrc, int interruptible)
> >> > set_current_state(interruptible ?
> >> >   TASK_INTERRUPTIBLE :
> >> >   TASK_UNINTERRUPTIBLE);
> >> > -   if (signal_pending(current)) {
> >> > -   rc = -EINTR;
> >> > +   if (interruptible && signal_pending(current)) {
> >> > +   __set_current_state(TASK_RUNNING);
> >> > +   remove_wait_queue(_wait_queue, );
> >> > +   rc = -ERESTARTSYS;
> >> > break;
> >>
> >> All 3 points are valid, and the patch looks good to me:
> >>
> >> Reviewed-by: David Herrmann 
> >>
> >> However, there seems to be a race between vga_lock and putting the
> >> thread asleep.
> >
> > I'm not sure I understand the race you're talking about.
> > Could you elaborate?
> 
> Normal wake-up race. You have to check the condition you wait for
> *after* linking your wait-queue and setting your thread-state. In the
> 

Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-14 Thread Michael S. Tsirkin
On Sun, Dec 13, 2015 at 11:47:44PM +0800, Lan, Tianyu wrote:
> 
> 
> On 12/11/2015 1:16 AM, Alexander Duyck wrote:
> >On Thu, Dec 10, 2015 at 6:38 AM, Lan, Tianyu  wrote:
> >>
> >>
> >>On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
> 
> Ideally, it is able to leave guest driver unmodified but it requires the
> >hypervisor or qemu to aware the device which means we may need a driver
> >in
> >hypervisor or qemu to handle the device on behalf of guest driver.
> >>>
> >>>Can you answer the question of when do you use your code -
> >>> at the start of migration or
> >>> just before the end?
> >>
> >>
> >>Just before stopping VCPU in this version and inject VF mailbox irq to
> >>notify the driver if the irq handler is installed.
> >>Qemu side also will check this via the faked PCI migration capability
> >>and driver will set the status during device open() or resume() callback.
> >
> >The VF mailbox interrupt is a very bad idea.  Really the device should
> >be in a reset state on the other side of a migration.  It doesn't make
> >sense to have the interrupt firing if the device is not configured.
> >This is one of the things that is preventing you from being able to
> >migrate the device while the interface is administratively down or the
> >VF driver is not loaded.
> 
> From my opinion, if VF driver is not loaded and hardware doesn't start
> to work, the device state doesn't need to be migrated.
> 
> We may add a flag for driver to check whether migration happened during it's
> down and reinitialize the hardware and clear the flag when system try to put
> it up.
> 
> We may add migration core in the Linux kernel and provide some helps
> functions to facilitate to add migration support for drivers.
> Migration core is in charge to sync status with Qemu.
> 
> Example.
> migration_register()
> Driver provides
> - Callbacks to be called before and after migration or for bad path
> - Its irq which it prefers to deal with migration event.
> 
> migration_event_check()
> Driver calls it in the irq handler. Migration core code will check
> migration status and call its callbacks when migration happens.
> 
> 
> >
> >My thought on all this is that it might make sense to move this
> >functionality into a PCI-to-PCI bridge device and make it a
> >requirement that all direct-assigned devices have to exist behind that
> >device in order to support migration.  That way you would be working
> >with a directly emulated device that would likely already be
> >supporting hot-plug anyway.  Then it would just be a matter of coming
> >up with a few Qemu specific extensions that you would need to add to
> >the device itself.  The same approach would likely be portable enough
> >that you could achieve it with PCIe as well via the same configuration
> >space being present on the upstream side of a PCIe port or maybe a
> >PCIe switch of some sort.
> >
> >It would then be possible to signal via your vendor-specific PCI
> >capability on that device that all devices behind this bridge require
> >DMA page dirtying, you could use the configuration in addition to the
> >interrupt already provided for hot-plug to signal things like when you
> >are starting migration, and possibly even just extend the shpc
> >functionality so that if this capability is present you have the
> >option to pause/resume instead of remove/probe the device in the case
> >of certain hot-plug events.  The fact is there may be some use for a
> >pause/resume type approach for PCIe hot-plug in the near future
> >anyway.  From the sounds of it Apple has required it for all
> >Thunderbolt device drivers so that they can halt the device in order
> >to shuffle resources around, perhaps we should look at something
> >similar for Linux.
> >
> >The other advantage behind grouping functions on one bridge is things
> >like reset domains.  The PCI error handling logic will want to be able
> >to reset any devices that experienced an error in the event of
> >something such as a surprise removal.  By grouping all of the devices
> >you could disable/reset/enable them as one logical group in the event
> >of something such as the "bad path" approach Michael has mentioned.
> >
> 
> These sounds we need to add a faked bridge for migration and adding a
> driver in the guest for it. It also needs to extend PCI bus/hotplug
> driver to do pause/resume other devices, right?
> 
> My concern is still that whether we can change PCI bus/hotplug like that
> without spec change.
> 
> IRQ should be general for any devices and we may extend it for
> migration. Device driver also can make decision to support migration
> or not.

A dedicated IRQ per device for something that is a system wide event
sounds like a waste.  I don't understand why a spec change is strictly
required, we only need to support this with the specific virtual bridge
used by QEMU, so I think that a vendor specific capability will do.
Once this works well in the field, a PCI spec 

Re: [PATCH 08/11] KVM: MMU: use page track for non-leaf shadow pages

2015-12-14 Thread Kai Huang



On 12/01/2015 02:26 AM, Xiao Guangrong wrote:

non-leaf shadow pages are always write protected, it can be the user
of page track

Signed-off-by: Xiao Guangrong 
---
  arch/x86/include/asm/kvm_page_track.h |  8 +
  arch/x86/kvm/mmu.c| 26 +---
  arch/x86/kvm/page_track.c | 58 +++
  3 files changed, 67 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/kvm_page_track.h 
b/arch/x86/include/asm/kvm_page_track.h
index 6744234..3447dac 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -41,8 +41,16 @@ int kvm_page_track_create_memslot(struct kvm_memory_slot 
*slot,
  void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
 struct kvm_memory_slot *dont);
  
+void

+kvm_slot_page_track_add_page_nolock(struct kvm *kvm,
+   struct kvm_memory_slot *slot, gfn_t gfn,
+   enum kvm_page_track_mode mode);
  void kvm_page_track_add_page(struct kvm *kvm, gfn_t gfn,
 enum kvm_page_track_mode mode);
+void kvm_slot_page_track_remove_page_nolock(struct kvm *kvm,
+   struct kvm_memory_slot *slot,
+   gfn_t gfn,
+   enum kvm_page_track_mode mode);
  void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
enum kvm_page_track_mode mode);
  bool kvm_page_track_check_mode(struct kvm_vcpu *vcpu, gfn_t gfn,
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index b23f9fc..5a2ca73 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -806,11 +806,17 @@ static void account_shadowed(struct kvm *kvm, struct 
kvm_mmu_page *sp)
struct kvm_memory_slot *slot;
gfn_t gfn;
  
+	kvm->arch.indirect_shadow_pages++;

gfn = sp->gfn;
slots = kvm_memslots_for_spte_role(kvm, sp->role);
slot = __gfn_to_memslot(slots, gfn);
+
+   /* the non-leaf shadow pages are keeping readonly. */
+   if (sp->role.level > PT_PAGE_TABLE_LEVEL)
+   return kvm_slot_page_track_add_page_nolock(kvm, slot, gfn,
+   KVM_PAGE_TRACK_WRITE);
+
kvm_mmu_gfn_disallow_lpage(slot, gfn);
-   kvm->arch.indirect_shadow_pages++;
  }
  
  static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)

@@ -819,11 +825,15 @@ static void unaccount_shadowed(struct kvm *kvm, struct 
kvm_mmu_page *sp)
struct kvm_memory_slot *slot;
gfn_t gfn;
  
+	kvm->arch.indirect_shadow_pages--;

gfn = sp->gfn;
slots = kvm_memslots_for_spte_role(kvm, sp->role);
slot = __gfn_to_memslot(slots, gfn);
+   if (sp->role.level > PT_PAGE_TABLE_LEVEL)
+   return kvm_slot_page_track_remove_page_nolock(kvm, slot, gfn,
+   KVM_PAGE_TRACK_WRITE);
+
kvm_mmu_gfn_allow_lpage(slot, gfn);
-   kvm->arch.indirect_shadow_pages--;
  }
  
  static bool __mmu_gfn_lpage_is_disallowed(gfn_t gfn, int level,

@@ -2140,12 +2150,18 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
hlist_add_head(>hash_link,
>kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)]);
if (!direct) {
-   if (rmap_write_protect(vcpu, gfn))
+   /*
+* we should do write protection before syncing pages
+* otherwise the content of the synced shadow page may
+* be inconsistent with guest page table.
+*/
+   account_shadowed(vcpu->kvm, sp);
+
+   if (level == PT_PAGE_TABLE_LEVEL &&
+ rmap_write_protect(vcpu, gfn))
kvm_flush_remote_tlbs(vcpu->kvm);
I think your modification is good but I am little bit confused here. In 
account_shadowed, if sp->role.level > PT_PAGE_TABLE_LEVEL, the sp->gfn 
is write protected, and this is reasonable. So why write protecting the 
gfn of PT_PAGE_TABLE_LEVEL here?



if (level > PT_PAGE_TABLE_LEVEL && need_sync)
kvm_sync_pages(vcpu, gfn);
-
-   account_shadowed(vcpu->kvm, sp);
}
sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
init_shadow_page_table(sp);
diff --git a/arch/x86/kvm/page_track.c b/arch/x86/kvm/page_track.c
index 84420df..87554d3 100644
--- a/arch/x86/kvm/page_track.c
+++ b/arch/x86/kvm/page_track.c
@@ -77,6 +77,26 @@ static void update_gfn_track(struct kvm_memory_slot *slot, 
gfn_t gfn,
WARN_ON(val < 0);
  }
  
+void

+kvm_slot_page_track_add_page_nolock(struct kvm *kvm,
+   struct kvm_memory_slot *slot, gfn_t gfn,
+   enum kvm_page_track_mode mode)
+{
+   

Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-14 Thread Michael S. Tsirkin
On Fri, Dec 11, 2015 at 03:32:04PM +0800, Lan, Tianyu wrote:
> 
> 
> On 12/11/2015 12:11 AM, Michael S. Tsirkin wrote:
> >On Thu, Dec 10, 2015 at 10:38:32PM +0800, Lan, Tianyu wrote:
> >>
> >>
> >>On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
> Ideally, it is able to leave guest driver unmodified but it requires the
> >hypervisor or qemu to aware the device which means we may need a driver 
> >in
> >hypervisor or qemu to handle the device on behalf of guest driver.
> >>>Can you answer the question of when do you use your code -
> >>>at the start of migration or
> >>>just before the end?
> >>
> >>Just before stopping VCPU in this version and inject VF mailbox irq to
> >>notify the driver if the irq handler is installed.
> >>Qemu side also will check this via the faked PCI migration capability
> >>and driver will set the status during device open() or resume() callback.
> >
> >Right, this is the "good path" optimization. Whether this buys anything
> >as compared to just sending reset to the device when VCPU is stopped
> >needs to be measured. In any case, we probably do need a way to
> >interrupt driver on destination to make it reconfigure the device -
> >otherwise it might take seconds for it to notice.  And a way to make
> >sure driver can handle this surprise reset so we can block migration if
> >it can't.
> >
> 
> Yes, we need such a way to notify driver about migration status and do
> reset or restore operation on the destination machine. My original
> design is to take advantage of device's irq to do that. Driver can tell
> Qemu that which irq it prefers to handle such task and whether the irq
> is enabled or bound with handler. We may discuss the detail in the other
> thread.
> 
> >>>
> >>>It would be great if we could avoid changing the guest; but at least 
> >>>your guest
> >>>driver changes don't actually seem to be that hardware specific; could 
> >>>your
> >>>changes actually be moved to generic PCI level so they could be made
> >>>to work for lots of drivers?
> >
> >It is impossible to use one common solution for all devices unless the 
> >PCIE
> >spec documents it clearly and i think one day it will be there. But 
> >before
> >that, we need some workarounds on guest driver to make it work even it 
> >looks
> >ugly.
> >>
> >>Yes, so far there is not hardware migration support
> >
> >VT-D supports setting dirty bit in the PTE in hardware.
> 
> Actually, this doesn't support in the current hardware.
> VTD spec documents the dirty bit for first level translation which
> requires devices to support DMA request with PASID(process
> address space identifier). Most device don't support the feature.

True, I missed this.  It's generally unfortunate that first level
translation only applies to requests with PASID.  All other features
limited to requests with PASID like nested translation would be very
useful for all requests, not just requests with PASID.


> >
> >>and it's hard to modify
> >>bus level code.
> >
> >Why is it hard?
> 
> As Yang said, the concern is that PCI Spec doesn't document about how to do
> migration.

We can submit a PCI spec ECN documenting a new capability.

I think for existing devices which lack it, adding this capability to
the bridge to which the device is attached is preferable to trying to
add it to the device itself.

> >
> >>It also will block implementation on the Windows.
> >
> >Implementation of what?  We are discussing motivation here, not
> >implementation.  E.g. windows drivers typically support surprise
> >removal, should you use that, you get some working code for free.  Just
> >stop worrying about it.  Make it work, worry about closed source
> >software later.
> >
> >>>Dave
> >>>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 22/22] arm64: KVM: Remove weak attributes

2015-12-14 Thread Christoffer Dall
On Mon, Dec 07, 2015 at 10:53:38AM +, Marc Zyngier wrote:
> As we've now switched to the new world switch implementation,
> remove the weak attributes, as nobody is supposed to override
> it anymore.
> 
> Signed-off-by: Marc Zyngier 

Acked-by: Christoffer Dall 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 10/22] arm64: KVM: Implement guest entry

2015-12-14 Thread Christoffer Dall
On Mon, Dec 07, 2015 at 10:53:26AM +, Marc Zyngier wrote:
> Contrary to the previous patch, the guest entry is fairly different
> from its assembly counterpart, mostly because it is only concerned
> with saving/restoring the GP registers, and nothing else.
> 
> Signed-off-by: Marc Zyngier 
> ---
>  arch/arm64/kvm/hyp/Makefile |   1 +
>  arch/arm64/kvm/hyp/entry.S  | 131 
> 
>  arch/arm64/kvm/hyp/hyp.h|   2 +
>  3 files changed, 134 insertions(+)
>  create mode 100644 arch/arm64/kvm/hyp/entry.S
> 
> diff --git a/arch/arm64/kvm/hyp/Makefile b/arch/arm64/kvm/hyp/Makefile
> index ec14cac..1e1ff06 100644
> --- a/arch/arm64/kvm/hyp/Makefile
> +++ b/arch/arm64/kvm/hyp/Makefile
> @@ -7,3 +7,4 @@ obj-$(CONFIG_KVM_ARM_HOST) += vgic-v3-sr.o
>  obj-$(CONFIG_KVM_ARM_HOST) += timer-sr.o
>  obj-$(CONFIG_KVM_ARM_HOST) += sysreg-sr.o
>  obj-$(CONFIG_KVM_ARM_HOST) += debug-sr.o
> +obj-$(CONFIG_KVM_ARM_HOST) += entry.o
> diff --git a/arch/arm64/kvm/hyp/entry.S b/arch/arm64/kvm/hyp/entry.S
> new file mode 100644
> index 000..47f3c69
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/entry.S
> @@ -0,0 +1,131 @@
> +/*
> + * Copyright (C) 2015 - ARM Ltd
> + * Author: Marc Zyngier 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see .
> + */
> +
> +#include 
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define CPU_GP_REG_OFFSET(x) (CPU_GP_REGS + x)
> +#define CPU_XREG_OFFSET(x)   CPU_GP_REG_OFFSET(CPU_USER_PT_REGS + 8*x)
> +
> + .text
> + .pushsection.hyp.text, "ax"
> +
> +.macro save_callee_saved_regs ctxt
> + stp x19, x20, [\ctxt, #CPU_XREG_OFFSET(19)]
> + stp x21, x22, [\ctxt, #CPU_XREG_OFFSET(21)]
> + stp x23, x24, [\ctxt, #CPU_XREG_OFFSET(23)]
> + stp x25, x26, [\ctxt, #CPU_XREG_OFFSET(25)]
> + stp x27, x28, [\ctxt, #CPU_XREG_OFFSET(27)]
> + stp x29, lr,  [\ctxt, #CPU_XREG_OFFSET(29)]
> +.endm
> +
> +.macro restore_callee_saved_regs ctxt
> + ldp x19, x20, [\ctxt, #CPU_XREG_OFFSET(19)]
> + ldp x21, x22, [\ctxt, #CPU_XREG_OFFSET(21)]
> + ldp x23, x24, [\ctxt, #CPU_XREG_OFFSET(23)]
> + ldp x25, x26, [\ctxt, #CPU_XREG_OFFSET(25)]
> + ldp x27, x28, [\ctxt, #CPU_XREG_OFFSET(27)]
> + ldp x29, lr,  [\ctxt, #CPU_XREG_OFFSET(29)]
> +.endm
> +
> +/*
> + * u64 __guest_enter(struct kvm_vcpu *vcpu,
> + *struct kvm_cpu_context *host_ctxt);
> + */
> +ENTRY(__guest_enter)
> + // x0: vcpu
> + // x1: host/guest context
> + // x2-x18: clobbered by macros
> +
> + // Store the host regs
> + save_callee_saved_regs x1
> +
> + // Preserve vcpu & host_ctxt for use at exit time
> + stp x0, x1, [sp, #-16]!
> +
> + add x1, x0, #VCPU_CONTEXT
> +
> + // Prepare x0-x1 for later restore by pushing them onto the stack
> + ldp x2, x3, [x1, #CPU_XREG_OFFSET(0)]
> + stp x2, x3, [sp, #-16]!
> +
> + // x2-x18
> + ldp x2, x3,   [x1, #CPU_XREG_OFFSET(2)]
> + ldp x4, x5,   [x1, #CPU_XREG_OFFSET(4)]
> + ldp x6, x7,   [x1, #CPU_XREG_OFFSET(6)]
> + ldp x8, x9,   [x1, #CPU_XREG_OFFSET(8)]
> + ldp x10, x11, [x1, #CPU_XREG_OFFSET(10)]
> + ldp x12, x13, [x1, #CPU_XREG_OFFSET(12)]
> + ldp x14, x15, [x1, #CPU_XREG_OFFSET(14)]
> + ldp x16, x17, [x1, #CPU_XREG_OFFSET(16)]
> + ldr x18,  [x1, #CPU_XREG_OFFSET(18)]
> +
> + // x19-x29, lr
> + restore_callee_saved_regs x1
> +
> + // Last bits of the 64bit state
> + ldp x0, x1, [sp], #16
> +
> + // Do not touch any register after this!
> + eret
> +ENDPROC(__guest_enter)
> +
> +ENTRY(__guest_exit)
> + // x0: vcpu
> + // x1: return code
> + // x2-x3: free
> + // x4-x29,lr: vcpu regs
> + // vcpu x0-x3 on the stack
> +
> + add x2, x0, #VCPU_CONTEXT
> +
> + // Compute base to save registers

misleading comment again?  Or misplaced at least?

> + stp x4, x5,   [x2, #CPU_XREG_OFFSET(4)]
> + stp x6, x7,   [x2, #CPU_XREG_OFFSET(6)]
> + stp x8, x9,   [x2, #CPU_XREG_OFFSET(8)]
> + stp x10, x11, [x2, #CPU_XREG_OFFSET(10)]
> + stp x12, x13, [x2, #CPU_XREG_OFFSET(12)]
> + stp x14, x15, [x2, #CPU_XREG_OFFSET(14)]
> + stp x16, x17, [x2, #CPU_XREG_OFFSET(16)]
> + str 

[PATCH] KVM: VMX: Fix host initiated access to guest MSR_TSC_AUX

2015-12-14 Thread Haozhong Zhang
The current handling of accesses to guest MSR_TSC_AUX returns error if
vcpu does not support rdtscp, though those accesses are initiated by
host. This can result in the reboot failure of some versions of
QEMU. This patch fixes this issue by passing those host initiated
accesses for further handling instead.

Signed-off-by: Haozhong Zhang 
---
 arch/x86/kvm/vmx.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 1a8bfaa..50f2b78 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2802,7 +2802,7 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
msr_info->data = vcpu->arch.ia32_xss;
break;
case MSR_TSC_AUX:
-   if (!guest_cpuid_has_rdtscp(vcpu))
+   if (!guest_cpuid_has_rdtscp(vcpu) && !msr_info->host_initiated)
return 1;
/* Otherwise falls through */
default:
@@ -2908,7 +2908,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
clear_atomic_switch_msr(vmx, MSR_IA32_XSS);
break;
case MSR_TSC_AUX:
-   if (!guest_cpuid_has_rdtscp(vcpu))
+   if (!guest_cpuid_has_rdtscp(vcpu) && !msr_info->host_initiated)
return 1;
/* Check reserved bit, higher 32 bits should be zero */
if ((data >> 32) != 0)
-- 
2.6.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1] kvm/x86: Remove Hyper-V SynIC timer stopping

2015-12-14 Thread Andrey Smetanin
It's possible that guest send us Hyper-V EOM at the middle
of Hyper-V SynIC timer running, so we start processing of Hyper-V
SynIC timers in vcpu context and stop the Hyper-V SynIC timer
uncoditionally and lose time expiration which Windows 2012R2 guest
expects.

The patch fixes such situation by not stopping Hyper-V SynIC timer
at all, because it's safe to restart it without stop in vcpu context
and timer callback always returns HRTIMER_NORESTART.

Signed-off-by: Andrey Smetanin 
CC: Gleb Natapov 
CC: Paolo Bonzini 
CC: Roman Kagan 
CC: Denis V. Lunev 
CC: qemu-de...@nongnu.org

---
 arch/x86/kvm/hyperv.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 8ff8829..f34f666 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -598,7 +598,6 @@ void kvm_hv_process_stimers(struct kvm_vcpu *vcpu)
for (i = 0; i < ARRAY_SIZE(hv_vcpu->stimer); i++)
if (test_and_clear_bit(i, hv_vcpu->stimer_pending_bitmap)) {
stimer = _vcpu->stimer[i];
-   stimer_stop(stimer);
if (stimer->config & HV_STIMER_ENABLE) {
time_now = get_time_ref_counter(vcpu->kvm);
if (time_now >= stimer->exp_time)
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1] kvm/x86: Remove Hyper-V SynIC timer stopping

2015-12-14 Thread Paolo Bonzini


On 14/12/2015 16:33, Andrey Smetanin wrote:
> It's possible that guest send us Hyper-V EOM at the middle
> of Hyper-V SynIC timer running, so we start processing of Hyper-V
> SynIC timers in vcpu context and stop the Hyper-V SynIC timer
> uncoditionally and lose time expiration which Windows 2012R2 guest
> expects.
> 
> The patch fixes such situation by not stopping Hyper-V SynIC timer
> at all, because it's safe to restart it without stop in vcpu context
> and timer callback always returns HRTIMER_NORESTART.

Can you summarize with a "picture" what is the bad race?

The patch seems safe, but I'd like to have a better understanding of
what goes wrong.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: MTRR: fix fixed MTRR segment look up

2015-12-14 Thread Paolo Bonzini


On 14/12/2015 15:39, Alexis D...t wrote:
> It fixes the slow-down of VM running with pci-passthrough, since some MTRR
> range changed from MTRR_TYPE_WRBACK to MTRR_TYPE_UNCACHABLE.
> 
> Fixes: fa61213746a ("KVM: MTRR: simplify kvm_mtrr_get_guest_memory_type")
> Bugzilla: (https://bugzilla.kernel.org/show_bug.cgi?id=107561)
> 
> Signed-off-by: Alexis Dambricourt 
> ---
>  arch/x86/kvm/mtrr.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> index 9e8bf13..adc54e1 100644
> --- a/arch/x86/kvm/mtrr.c
> +++ b/arch/x86/kvm/mtrr.c
> @@ -267,7 +267,7 @@ static int fixed_mtrr_addr_to_seg(u64 addr)
> 
> for (seg = 0; seg < seg_num; seg++) {
> mtrr_seg = _seg_table[seg];
> -   if (mtrr_seg->start >= addr && addr < mtrr_seg->end)
> +   if (mtrr_seg->start <= addr && addr < mtrr_seg->end)

So this if could never be true.

> return seg;
> }
> 
> --
> 

While that's embarrassing, :)  it would only apply to memory in the 640K-1M
range, while the logs in the BZ have

nov 22 17:06:49 Core-i7-5.lan kernel: vmx_get_mt_mask got the following: cpu=6, 
vcpu=0, gfn=67fe00, MMIO=0, cache=0

Note that this is a page above 4GB, which is why in the BZ I thought that
the culprit was MAXPHYADDR (and I still believe it is, at least for the
OVMF case---there are probably two bugs, and your patch fixes one).

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: MTRR: fix fixed MTRR segment look up

2015-12-14 Thread Alexis D...t
>From what I see in my case, it was especially gfn from 0x0 to 0xA0
which are deadly impacted by the wrong cache property.

2015-12-14 16:36 GMT+01:00 Paolo Bonzini :
>
>
> On 14/12/2015 15:39, Alexis D...t wrote:
>> It fixes the slow-down of VM running with pci-passthrough, since some MTRR
>> range changed from MTRR_TYPE_WRBACK to MTRR_TYPE_UNCACHABLE.
>>
>> Fixes: fa61213746a ("KVM: MTRR: simplify kvm_mtrr_get_guest_memory_type")
>> Bugzilla: (https://bugzilla.kernel.org/show_bug.cgi?id=107561)
>>
>> Signed-off-by: Alexis Dambricourt 
>> ---
>>  arch/x86/kvm/mtrr.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
>> index 9e8bf13..adc54e1 100644
>> --- a/arch/x86/kvm/mtrr.c
>> +++ b/arch/x86/kvm/mtrr.c
>> @@ -267,7 +267,7 @@ static int fixed_mtrr_addr_to_seg(u64 addr)
>>
>> for (seg = 0; seg < seg_num; seg++) {
>> mtrr_seg = _seg_table[seg];
>> -   if (mtrr_seg->start >= addr && addr < mtrr_seg->end)
>> +   if (mtrr_seg->start <= addr && addr < mtrr_seg->end)
>
> So this if could never be true.
>
>> return seg;
>> }
>>
>> --
>>
>
> While that's embarrassing, :)  it would only apply to memory in the 640K-1M
> range, while the logs in the BZ have
>
> nov 22 17:06:49 Core-i7-5.lan kernel: vmx_get_mt_mask got the following: 
> cpu=6, vcpu=0, gfn=67fe00, MMIO=0, cache=0
>
> Note that this is a page above 4GB, which is why in the BZ I thought that
> the culprit was MAXPHYADDR (and I still believe it is, at least for the
> OVMF case---there are probably two bugs, and your patch fixes one).
>
> Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/1] virtio/s390: one fix

2015-12-14 Thread Michael S. Tsirkin
On Mon, Dec 14, 2015 at 04:02:33PM +0100, Cornelia Huck wrote:
> On Thu,  3 Dec 2015 17:23:59 +0100
> Cornelia Huck  wrote:
> 
> > Michael,
> > 
> > here's one fix for the virtio-ccw driver.
> > 
> > Patch is against master, as your git branches on mst/vhost.git seem
> > to be quite old. Is there some branch I can base my own branch on,
> > or do you prefer to take patches directly anyway?
> > 
> > Cornelia Huck (1):
> >   virtio/s390: handle error values in irb
> > 
> >  drivers/s390/virtio/virtio_ccw.c | 62 
> > 
> >  1 file changed, 37 insertions(+), 25 deletions(-)
> > 
> 
> Ping?

Thanks, I'll merge this, hope it can still make 4.4.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 107561] 4.2 breaks PCI passthrough in QEMU/KVM

2015-12-14 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=107561

Paolo Bonzini  changed:

   What|Removed |Added

 Attachment #197081|0   |1
is obsolete||
 Status|NEW |NEEDINFO
   Assignee|virtualization_kvm@kernel-b |bonz...@gnu.org
   |ugs.osdl.org|

--- Comment #15 from Paolo Bonzini  ---
Created attachment 197381
  --> https://bugzilla.kernel.org/attachment.cgi?id=197381=edit
patch to be tested v2

There is indeed a two-character mistake in the patch.  Can you guys please test
this one instead?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: MTRR: fix fixed MTRR segment look up

2015-12-14 Thread Paolo Bonzini


On 14/12/2015 17:06, Alexis D...t wrote:
> From what I see in my case, it was especially gfn from 0x0 to 0xA0
> which are deadly impacted by the wrong cache property.

Yes, indeed.  I'll send the patch to Linus this week; I've now posted to
the BZ a second patch for GFNs above 4GB.

Paolo

> 2015-12-14 16:36 GMT+01:00 Paolo Bonzini :
>> >
>> >
>> > On 14/12/2015 15:39, Alexis D...t wrote:
>>> >> It fixes the slow-down of VM running with pci-passthrough, since some 
>>> >> MTRR
>>> >> range changed from MTRR_TYPE_WRBACK to MTRR_TYPE_UNCACHABLE.
>>> >>
>>> >> Fixes: fa61213746a ("KVM: MTRR: simplify kvm_mtrr_get_guest_memory_type")
>>> >> Bugzilla: (https://bugzilla.kernel.org/show_bug.cgi?id=107561)
>>> >>
>>> >> Signed-off-by: Alexis Dambricourt 
>>> >> ---
>>> >>  arch/x86/kvm/mtrr.c | 2 +-
>>> >>  1 file changed, 1 insertion(+), 1 deletion(-)
>>> >>
>>> >> diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
>>> >> index 9e8bf13..adc54e1 100644
>>> >> --- a/arch/x86/kvm/mtrr.c
>>> >> +++ b/arch/x86/kvm/mtrr.c
>>> >> @@ -267,7 +267,7 @@ static int fixed_mtrr_addr_to_seg(u64 addr)
>>> >>
>>> >> for (seg = 0; seg < seg_num; seg++) {
>>> >> mtrr_seg = _seg_table[seg];
>>> >> -   if (mtrr_seg->start >= addr && addr < mtrr_seg->end)
>>> >> +   if (mtrr_seg->start <= addr && addr < mtrr_seg->end)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/11] KVM: page track: add the framework of guest page tracking

2015-12-14 Thread Kai Huang

Hi Guangrong,

I am starting to review this series, and should have some comments or 
questions, you can determine whether they are valuable :)


See  below.

On 12/01/2015 02:26 AM, Xiao Guangrong wrote:

The array, gfn_track[mode][gfn], is introduced in memory slot for every
guest page, this is the tracking count for the gust page on different
modes. If the page is tracked then the count is increased, the page is
not tracked after the count reaches zero

Two callbacks, kvm_page_track_create_memslot() and
kvm_page_track_free_memslot() are implemented in this patch, they are
internally used to initialize and reclaim the memory of the array

Currently, only write track mode is supported

Signed-off-by: Xiao Guangrong 
---
  arch/x86/include/asm/kvm_host.h   |  2 ++
  arch/x86/include/asm/kvm_page_track.h | 13 
  arch/x86/kvm/Makefile |  3 +-
  arch/x86/kvm/page_track.c | 58 +++
  arch/x86/kvm/x86.c|  5 +++
  5 files changed, 80 insertions(+), 1 deletion(-)
  create mode 100644 arch/x86/include/asm/kvm_page_track.h
  create mode 100644 arch/x86/kvm/page_track.c

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5aa2dcc..afff1f1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -32,6 +32,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #define KVM_MAX_VCPUS 255

  #define KVM_SOFT_MAX_VCPUS 160
@@ -612,6 +613,7 @@ struct kvm_lpage_info {
  struct kvm_arch_memory_slot {
struct kvm_rmap_head *rmap[KVM_NR_PAGE_SIZES];
struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];
+   int *gfn_track[KVM_PAGE_TRACK_MAX];
  };
  
  /*

diff --git a/arch/x86/include/asm/kvm_page_track.h 
b/arch/x86/include/asm/kvm_page_track.h
new file mode 100644
index 000..347d5c9
--- /dev/null
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -0,0 +1,13 @@
+#ifndef _ASM_X86_KVM_PAGE_TRACK_H
+#define _ASM_X86_KVM_PAGE_TRACK_H
+
+enum kvm_page_track_mode {
+   KVM_PAGE_TRACK_WRITE,
+   KVM_PAGE_TRACK_MAX,
+};
+
+int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
+ unsigned long npages);
+void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
+struct kvm_memory_slot *dont);
+#endif
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index a1ff508..464fa47 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -13,9 +13,10 @@ kvm-$(CONFIG_KVM_ASYNC_PF)   += $(KVM)/async_pf.o
  
  kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \

   i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
-  hyperv.o
+  hyperv.o page_track.o
  
  kvm-$(CONFIG_KVM_DEVICE_ASSIGNMENT)	+= assigned-dev.o iommu.o

+
  kvm-intel-y   += vmx.o pmu_intel.o
  kvm-amd-y += svm.o pmu_amd.o
  
diff --git a/arch/x86/kvm/page_track.c b/arch/x86/kvm/page_track.c

new file mode 100644
index 000..0338d36
--- /dev/null
+++ b/arch/x86/kvm/page_track.c
@@ -0,0 +1,58 @@
+/*
+ * Support KVM gust page tracking
+ *
+ * This feature allows us to track page access in guest. Currently, only
+ * write access is tracked.
+ *
+ * Copyright(C) 2015 Intel Corporation.
+ *
+ * Author:
+ *   Xiao Guangrong 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ */
+
+#include 
+#include 
+#include 
+
+#include "mmu.h"
+
+static void page_track_slot_free(struct kvm_memory_slot *slot)
+{
+   int i;
+
+   for (i = 0; i < KVM_PAGE_TRACK_MAX; i++)
+   if (slot->arch.gfn_track[i]) {
+   kvfree(slot->arch.gfn_track[i]);
+   slot->arch.gfn_track[i] = NULL;
+   }
+}
+
+int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
+ unsigned long npages)
+{
+   int  i, pages = gfn_to_index(slot->base_gfn + npages - 1,
+ slot->base_gfn, PT_PAGE_TABLE_LEVEL) + 1;
+
+   for (i = 0; i < KVM_PAGE_TRACK_MAX; i++) {
+   slot->arch.gfn_track[i] = kvm_kvzalloc(pages *
+   sizeof(*slot->arch.gfn_track[i]));
+   if (!slot->arch.gfn_track[i])
+   goto track_free;
+   }
+
+   return 0;
+
+track_free:
+   page_track_slot_free(slot);
+   return -ENOMEM;
+}
Is it necessary to use the 'unsigned long npages' pareameter? In my 
understanding you are going to track all GFNs in the memory slot anyway, 
right? If you want to keep npages, I think it's better to add a base_gfn 
as well otherwise you are assuming you are going to track the npages GFN 
starting from slot->base_gfn.



+
+void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
+ 

ARM64/KVM: Bad page state in process iperf

2015-12-14 Thread Bhushan Bharat

Hi All,

I am running "iperf" in KVM guest on ARM64 machine and observing below crash.

=
$iperf -c 3.3.3.3 -P 4 -t 0 -i 5 -w 90k

Client connecting to 3.3.3.3, TCP port 5001
TCP window size:  180 KByte (WARNING: requested 90.0 KByte)

[  3] local 3.3.3.1 port 51131 connected with 3.3.3.3 port 5001
[  6] local 3.3.3.1 port 51134 connected with 3.3.3.3 port 5001
[  5] local 3.3.3.1 port 51133 connected with 3.3.3.3 port 5001
[  4] local 3.3.3.1 port 51132 connected with 3.3.3.3 port 5001
[   53.088567] random: nonblocking pool is initialized
[ ID] Interval   Transfer Bandwidth
[  3]  0.0- 5.0 sec   638 MBytes  1.07 Gbits/sec
[  4] 35.0-40.0 sec  1.66 GBytes  2.85 Gbits/sec
[  5] 40.0-45.0 sec  1.11 GBytes  1.90 Gbits/sec
[  4] 40.0-45.0 sec  1.16 GBytes  1.99 Gbits/sec
[   98.895207] BUG: Bad page state in process iperf  pfn:0a584
[   98.896164] page:78296100 count:-1 mapcount:0 mapping:  
(null) index:0x0
[   98.897436] flags: 0x0()
[   98.897885] page dumped because: nonzero _count
[   98.898640] Modules linked in:
[   98.899178] CPU: 0 PID: 1639 Comm: iperf Not tainted 4.1.8-00461-ge5431ad 
#141
[   98.900302] Hardware name: linux,dummy-virt (DT)
[   98.901014] Call trace:
[   98.901406] [] dump_backtrace+0x0/0x12c
[   98.902522] [] show_stack+0x10/0x1c
[   98.903441] [] dump_stack+0x8c/0xdc
[   98.904202] [] bad_page+0xc4/0x114
[   98.904945] [] get_page_from_freelist+0x590/0x63c
[   98.905871] [] __alloc_pages_nodemask+0xec/0x794
[   98.906791] [] skb_page_frag_refill+0x70/0xa8
[   98.907678] [] sk_page_frag_refill+0x20/0xd0
[   98.908550] [] tcp_sendmsg+0x1f8/0x9a8
[   98.909368] [] inet_sendmsg+0x5c/0xd0
[   98.910178] [] sock_sendmsg+0x14/0x58
[   98.911027] [] sock_write_iter+0x64/0xbc
[   98.912119] [] __vfs_write+0xac/0x10c
[   98.913126] [] vfs_write+0x90/0x1a0
[   98.913963] [] SyS_write+0x40/0xa0

---
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/11] KVM: page track: introduce kvm_page_track_{add,remove}_page

2015-12-14 Thread Kai Huang



On 12/01/2015 02:26 AM, Xiao Guangrong wrote:

These two functions are the user APIs:
- kvm_page_track_add_page(): add the page to the tracking pool after
   that later specified access on that page will be tracked

- kvm_page_track_remove_page(): remove the page from the tracking pool,
   the specified access on the page is not tracked after the last user is
   gone

Both of these are called under the protection of kvm->srcu or
kvm->slots_lock

Signed-off-by: Xiao Guangrong 
---
  arch/x86/include/asm/kvm_page_track.h |  5 ++
  arch/x86/kvm/page_track.c | 95 +++
  2 files changed, 100 insertions(+)

diff --git a/arch/x86/include/asm/kvm_page_track.h 
b/arch/x86/include/asm/kvm_page_track.h
index 347d5c9..9cc17c6 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -10,4 +10,9 @@ int kvm_page_track_create_memslot(struct kvm_memory_slot 
*slot,
  unsigned long npages);
  void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
 struct kvm_memory_slot *dont);
+
+void kvm_page_track_add_page(struct kvm *kvm, gfn_t gfn,
+enum kvm_page_track_mode mode);
+void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
+   enum kvm_page_track_mode mode);
  #endif
diff --git a/arch/x86/kvm/page_track.c b/arch/x86/kvm/page_track.c
index 0338d36..ad510db 100644
--- a/arch/x86/kvm/page_track.c
+++ b/arch/x86/kvm/page_track.c
@@ -56,3 +56,98 @@ void kvm_page_track_free_memslot(struct kvm_memory_slot 
*free,
if (!dont || free->arch.gfn_track != dont->arch.gfn_track)
page_track_slot_free(free);
  }
+
+static bool check_mode(enum kvm_page_track_mode mode)
+{
+   if (mode < 0 || mode >= KVM_PAGE_TRACK_MAX)
+   return false;
+
+   return true;
+}
+
+static void update_gfn_track(struct kvm_memory_slot *slot, gfn_t gfn,
+enum kvm_page_track_mode mode, int count)
+{
+   int index, val;
+
+   index = gfn_to_index(gfn, slot->base_gfn, PT_PAGE_TABLE_LEVEL);
+
+   slot->arch.gfn_track[mode][index] += count;
+   val = slot->arch.gfn_track[mode][index];
+   WARN_ON(val < 0);
+}
+
+/*
+ * add guest page to the tracking pool so that corresponding access on that
+ * page will be intercepted.
+ *
+ * It should be called under the protection of kvm->srcu or kvm->slots_lock
+ *
+ * @kvm: the guest instance we are interested in.
+ * @gfn: the guest page.
+ * @mode: tracking mode, currently only write track is supported.
+ */
+void kvm_page_track_add_page(struct kvm *kvm, gfn_t gfn,
+enum kvm_page_track_mode mode)
+{
+   struct kvm_memslots *slots;
+   struct kvm_memory_slot *slot;
+   int i;
+
+   WARN_ON(!check_mode(mode));
+
+   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+   slots = __kvm_memslots(kvm, i);
+   slot = __gfn_to_memslot(slots, gfn);
+
+   spin_lock(>mmu_lock);
+   update_gfn_track(slot, gfn, mode, 1);
+
+   /*
+* new track stops large page mapping for the
+* tracked page.
+*/
+   kvm_mmu_gfn_disallow_lpage(slot, gfn);
Where is  kvm_mmu_gfn_disallow_lpage? Neither did I see it in your patch 
nor in my own latest KVM repo without your patch :)



+
+   if (mode == KVM_PAGE_TRACK_WRITE)
+   if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn))
+   kvm_flush_remote_tlbs(kvm);

Neither can I find kvm_mmu_slot_gfn_write_protect. Did I miss something?

Thanks,
-Kai

+   spin_unlock(>mmu_lock);
+   }
+}
+
+/*
+ * remove the guest page from the tracking pool which stops the interception
+ * of corresponding access on that page. It is the opposed operation of
+ * kvm_page_track_add_page().
+ *
+ * It should be called under the protection of kvm->srcu or kvm->slots_lock
+ *
+ * @kvm: the guest instance we are interested in.
+ * @gfn: the guest page.
+ * @mode: tracking mode, currently only write track is supported.
+ */
+void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
+   enum kvm_page_track_mode mode)
+{
+   struct kvm_memslots *slots;
+   struct kvm_memory_slot *slot;
+   int i;
+
+   WARN_ON(!check_mode(mode));
+
+   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+   slots = __kvm_memslots(kvm, i);
+   slot = __gfn_to_memslot(slots, gfn);
+
+   spin_lock(>mmu_lock);
+   update_gfn_track(slot, gfn, mode, -1);
+
+   /*
+* allow large page mapping for the tracked page
+* after the tracker is gone.
+*/
+   kvm_mmu_gfn_allow_lpage(slot, gfn);
+   spin_unlock(>mmu_lock);
+ 

Re: [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest

2015-12-14 Thread Michael S. Tsirkin
On Sun, Dec 13, 2015 at 01:28:31PM -0800, Alexander Duyck wrote:
> This patch is meant to provide the guest with a way of flagging DMA pages
> as being dirty to the host when using a direct-assign device within a
> guest.  The advantage to this approach is that it is fairly simple, however
> It currently has a singificant impact on device performance in all the
> scenerios where it won't be needed.
> 
> As such this is really meant only as a proof of concept and to get the ball
> rolling in terms of figuring out how best to approach the issue of dirty
> page tracking for a guest that is using a direct assigned device.  In
> addition with just this patch it should be possible to modify current
> migration approaches so that instead of having to hot-remove the device
> before starting the migration this can instead be delayed until the period
> before the final stop and copy.
> 
> Signed-off-by: Alexander Duyck 
> ---
>  arch/arm/include/asm/dma-mapping.h   |3 ++-
>  arch/arm64/include/asm/dma-mapping.h |5 ++---
>  arch/ia64/include/asm/dma.h  |1 +
>  arch/mips/include/asm/dma-mapping.h  |1 +
>  arch/powerpc/include/asm/swiotlb.h   |1 +
>  arch/tile/include/asm/dma-mapping.h  |1 +
>  arch/unicore32/include/asm/dma-mapping.h |1 +
>  arch/x86/Kconfig |   11 +++
>  arch/x86/include/asm/swiotlb.h   |   26 ++
>  drivers/xen/swiotlb-xen.c|6 ++
>  lib/swiotlb.c|6 ++
>  11 files changed, 58 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm/include/asm/dma-mapping.h 
> b/arch/arm/include/asm/dma-mapping.h
> index ccb3aa64640d..1962d7b471c7 100644
> --- a/arch/arm/include/asm/dma-mapping.h
> +++ b/arch/arm/include/asm/dma-mapping.h
> @@ -167,7 +167,8 @@ static inline bool dma_capable(struct device *dev, 
> dma_addr_t addr, size_t size)
>   return 1;
>  }
>  
> -static inline void dma_mark_clean(void *addr, size_t size) { }
> +static inline void dma_mark_clean(void *addr, size_t size) {}
> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>  
>  extern int arm_dma_set_mask(struct device *dev, u64 dma_mask);
>  
> diff --git a/arch/arm64/include/asm/dma-mapping.h 
> b/arch/arm64/include/asm/dma-mapping.h
> index 61e08f360e31..8d24fe11c8a3 100644
> --- a/arch/arm64/include/asm/dma-mapping.h
> +++ b/arch/arm64/include/asm/dma-mapping.h
> @@ -84,9 +84,8 @@ static inline bool dma_capable(struct device *dev, 
> dma_addr_t addr, size_t size)
>   return addr + size - 1 <= *dev->dma_mask;
>  }
>  
> -static inline void dma_mark_clean(void *addr, size_t size)
> -{
> -}
> +static inline void dma_mark_clean(void *addr, size_t size) {}
> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>  
>  #endif   /* __KERNEL__ */
>  #endif   /* __ASM_DMA_MAPPING_H */
> diff --git a/arch/ia64/include/asm/dma.h b/arch/ia64/include/asm/dma.h
> index 4d97f60f1ef5..d92ebeb2758e 100644
> --- a/arch/ia64/include/asm/dma.h
> +++ b/arch/ia64/include/asm/dma.h
> @@ -20,5 +20,6 @@ extern unsigned long MAX_DMA_ADDRESS;
>  #define free_dma(x)
>  
>  void dma_mark_clean(void *addr, size_t size);
> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>  
>  #endif /* _ASM_IA64_DMA_H */
> diff --git a/arch/mips/include/asm/dma-mapping.h 
> b/arch/mips/include/asm/dma-mapping.h
> index e604f760c4a0..567f6e03e337 100644
> --- a/arch/mips/include/asm/dma-mapping.h
> +++ b/arch/mips/include/asm/dma-mapping.h
> @@ -28,6 +28,7 @@ static inline bool dma_capable(struct device *dev, 
> dma_addr_t addr, size_t size)
>  }
>  
>  static inline void dma_mark_clean(void *addr, size_t size) {}
> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>  
>  #include 
>  
> diff --git a/arch/powerpc/include/asm/swiotlb.h 
> b/arch/powerpc/include/asm/swiotlb.h
> index de99d6e29430..b694e8399e28 100644
> --- a/arch/powerpc/include/asm/swiotlb.h
> +++ b/arch/powerpc/include/asm/swiotlb.h
> @@ -16,6 +16,7 @@
>  extern struct dma_map_ops swiotlb_dma_ops;
>  
>  static inline void dma_mark_clean(void *addr, size_t size) {}
> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>  
>  extern unsigned int ppc_swiotlb_enable;
>  int __init swiotlb_setup_bus_notifier(void);
> diff --git a/arch/tile/include/asm/dma-mapping.h 
> b/arch/tile/include/asm/dma-mapping.h
> index 96ac6cce4a32..79953f09e938 100644
> --- a/arch/tile/include/asm/dma-mapping.h
> +++ b/arch/tile/include/asm/dma-mapping.h
> @@ -58,6 +58,7 @@ static inline phys_addr_t dma_to_phys(struct device *dev, 
> dma_addr_t daddr)
>  }
>  
>  static inline void dma_mark_clean(void *addr, size_t size) {}
> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>  
>  static inline void set_dma_ops(struct device *dev, struct dma_map_ops *ops)
>  {
> diff --git a/arch/unicore32/include/asm/dma-mapping.h 
> 

Re: kvmclock doesn't work, help?

2015-12-14 Thread Paolo Bonzini


On 11/12/2015 22:57, Andy Lutomirski wrote:
> I'm still not seeing the issue.
> 
> The formula is:
> 
> (((rdtsc - pvti->tsc_timestamp) * pvti->tsc_to_system_mul) >>
> pvti->tsc_shift) + pvti->system_time
> 
> Obviously, if you reset pvti->tsc_timestamp to the current tsc value
> after suspend/resume, you would also need to update system_time.
> 
> I don't see what this has to do with suspend/resume or with whether
> the effective scale factor is greater than or less than one.  The only
> suspend/resume interaction I can see is that, if the host allows the
> guest-observed TSC value to jump (which is arguably a bug, what that's
> not important here), it needs to update pvti before resuming the
> guest.

Which is not an issue, since freezing obviously gets all CPUs out of
guest mode.

Marcelo, can you provide an example with made-up values for tsc and pvti?

> Can you clarify concretely what goes wrong here?
> 
> (I'm also at a bit of a loss as to why this needs both system_time and
> tsc_timestamp.  They're redundant in the sense that you could set
> tsc_timestamp to zero and subtract (tsc_timestamp * tsc_to_system_mul) >>
> tsc_shift to system_time without changing the result of the
> calculation.)

You would have to ensure that all elements of pvti are rounded correctly
whenever the base TSC is updated.  Doable, but it does seem simpler to
keep subtract-TSC and add-nanoseconds separate.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] target-i386: check vcpu features before accessing MSR_TSC_AUX

2015-12-14 Thread Haozhong Zhang
On 12/14/15 12:51, Paolo Bonzini wrote:
> 
> 
> On 14/12/2015 12:07, Haozhong Zhang wrote:
> > This patch fix a bug that prevents VM rebooting on recent versions of
> > KVM (from commit 9dbe6cf).
> > 
> > kvm_get_msrs() is called to save guest MSR_TSC_AUX and other MSRs across
> > rebooting. It only checks whether KVM exposes MSR_TSC_AUX to userspace.
> > However, if vcpu does not support rdtscp (e.g. kvm64), current KVM will
> > fail the saving and thus all other MSRs following it will fail in
> > kvm_get_msrs(). As a result, from KVM commit 9dbe6cf that exposes
> > MSR_TSC_AUX, VM can not successfully reboot.
> > 
> > This patch fixes this bug by adding the missing rdtscp feature checks.
> 
> That commit is not in any released kernel.

Right, it's currently only in kvm next. But I assume it would finally come
into a released kernel.

> It's better if we just check
> msr_info->host_initiated in vmx_get_msr and vmx_set_msr. Can you prepare
> a patch?
>

Yes, I'll send a KVM patch later. And then this QEMU patch is not
needed any more.

Haozhong
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking

2015-12-14 Thread Michael S. Tsirkin
On Mon, Dec 14, 2015 at 03:20:26PM +0800, Yang Zhang wrote:
> On 2015/12/14 13:46, Alexander Duyck wrote:
> >On Sun, Dec 13, 2015 at 9:22 PM, Yang Zhang  wrote:
> >>On 2015/12/14 12:54, Alexander Duyck wrote:
> >>>
> >>>On Sun, Dec 13, 2015 at 6:27 PM, Yang Zhang 
> >>>wrote:
> 
> On 2015/12/14 5:28, Alexander Duyck wrote:
> >
> >
> >This patch set is meant to be the guest side code for a proof of concept
> >involving leaving pass-through devices in the guest during the warm-up
> >phase of guest live migration.  In order to accomplish this I have added
> >a
> >new function called dma_mark_dirty that will mark the pages associated
> >with
> >the DMA transaction as dirty in the case of either an unmap or a
> >sync_.*_for_cpu where the DMA direction is either DMA_FROM_DEVICE or
> >DMA_BIDIRECTIONAL.  The pass-through device must still be removed before
> >the stop-and-copy phase, however allowing the device to be present
> >should
> >significantly improve the performance of the guest during the warm-up
> >period.
> >
> >This current implementation is very preliminary and there are number of
> >items still missing.  Specifically in order to make this a more complete
> >solution we need to support:
> >1.  Notifying hypervisor that drivers are dirtying DMA pages received
> >2.  Bypassing page dirtying when it is not needed.
> >
> 
> Shouldn't current log dirty mechanism already cover them?
> >>>
> >>>
> >>>The guest has no way of currently knowing that the hypervisor is doing
> >>>dirty page logging, and the log dirty mechanism currently has no way
> >>>of tracking device DMA accesses.  This change is meant to bridge the
> >>>two so that the guest device driver will force the SWIOTLB DMA API to
> >>>mark pages written to by the device as dirty.
> >>
> >>
> >>OK. This is what we called "dummy write mechanism". Actually, this is just a
> >>workaround before iommu dirty bit ready. Eventually, we need to change to
> >>use the hardware dirty bit. Besides, we may still lost the data if dma
> >>happens during/just before stop and copy phase.
> >
> >Right, this is a "dummy write mechanism" in order to allow for entry
> >tracking.  This only works completely if we force the hardware to
> >quiesce via a hot-plug event before we reach the stop-and-copy phase
> >of the migration.
> >
> >The IOMMU dirty bit approach is likely going to have a significant
> >number of challenges involved.  Looking over the driver and the data
> >sheet it looks like the current implementation is using a form of huge
> >pages in the IOMMU, as such we will need to tear that down and replace
> >it with 4K pages if we don't want to dirty large regions with each DMA
> 
> Yes, we need to split the huge page into small pages to get the small dirty
> range.
> 
> >transaction, and I'm not sure that is something we can change while
> >DMA is active to the affected regions.  In addition the data sheet
> 
> what changes do you mean?
> 
> >references the fact that the page table entries are stored in a
> >translation cache and in order to sync things up you have to
> >invalidate the entries.  I'm not sure what the total overhead would be
> >for invalidating something like a half million 4K pages to migrate a
> >guest with just 2G of RAM, but I would think that might be a bit
> 
> Do you mean the cost of submit the flush request or the performance
> impaction due to IOTLB miss? For the former, we have domain-selective
> invalidation. For the latter, it would be acceptable since live migration
> shouldn't last too long.

That's pretty weak - if migration time is short and speed does not
matter during migration, then all this work is useless, temporarily
switching to a virtual card would be preferable.

> >expensive given the fact that IOMMU accesses aren't known for being
> >incredibly fast when invalidating DMA on the host.
> >
> >- Alex
> >
> 
> 
> -- 
> best regards
> yang
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvmclock doesn't work, help?

2015-12-14 Thread Marcelo Tosatti
On Fri, Dec 11, 2015 at 01:57:23PM -0800, Andy Lutomirski wrote:
> On Thu, Dec 10, 2015 at 1:32 PM, Marcelo Tosatti  wrote:
> > On Wed, Dec 09, 2015 at 01:10:59PM -0800, Andy Lutomirski wrote:
> >> I'm trying to clean up kvmclock and I can't get it to work at all.  My
> >> host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC.
> >>
> >> If I boot an SMP (2 vcpus) guest, tracing says:
> >>
> >>  qemu-system-x86-2517  [001] 102242.610654: kvm_update_master_clock:
> >> masterclock 0 hostclock tsc offsetmatched 0
> >>  qemu-system-x86-2521  [000] 102242.613742: kvm_track_tsc:
> >> vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc
> >>  qemu-system-x86-2522  [000] 102242.622959: kvm_track_tsc:
> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >>  qemu-system-x86-2521  [000] 102242.645123: kvm_track_tsc:
> >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >>  qemu-system-x86-2522  [000] 102242.647291: kvm_track_tsc:
> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >>  qemu-system-x86-2521  [000] 102242.653369: kvm_track_tsc:
> >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >>  qemu-system-x86-2522  [000] 102242.653429: kvm_track_tsc:
> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >>  qemu-system-x86-2517  [001] 102242.653447: kvm_update_master_clock:
> >> masterclock 0 hostclock tsc offsetmatched 1
> >>  qemu-system-x86-2521  [000] 102242.653657: kvm_update_master_clock:
> >> masterclock 0 hostclock tsc offsetmatched 1
> >>  qemu-system-x86-2522  [002] 102242.664448: kvm_update_master_clock:
> >> masterclock 0 hostclock tsc offsetmatched 1
> >>
> >>
> >> If I boot a UP guest, tracing says:
> >>
> >>  qemu-system-x86-2567  [001] 102370.447484: kvm_update_master_clock:
> >> masterclock 0 hostclock tsc offsetmatched 1
> >>  qemu-system-x86-2571  [002] 102370.447688: kvm_update_master_clock:
> >> masterclock 0 hostclock tsc offsetmatched 1
> >>
> >> I suspect, but I haven't verified, that this is fallout from:
> >>
> >> commit 16a9602158861687c78b6de6dc6a79e6e8a9136f
> >> Author: Marcelo Tosatti 
> >> Date:   Wed May 14 12:43:24 2014 -0300
> >>
> >> KVM: x86: disable master clock if TSC is reset during suspend
> >>
> >> Updating system_time from the kernel clock once master clock
> >> has been enabled can result in time backwards event, in case
> >> kernel clock frequency is lower than TSC frequency.
> >>
> >> Disable master clock in case it is necessary to update it
> >> from the resume path.
> >>
> >> Signed-off-by: Marcelo Tosatti 
> >> Signed-off-by: Paolo Bonzini 
> >>
> >>
> >> Can we please stop making kvmclock more complex?  It's a beast right
> >> now, and not in a good way.  It's far too tangled with the vclock
> >> machinery on both the host and guest sides, the pvclock stuff is not
> >> well thought out (even in principle in an ABI sense), and it's never
> >> been clear to my what problem exactly the kvmclock stuff is supposed
> >> to solve.
> >>
> >> I'm somewhat tempted to suggest that we delete kvmclock entirely and
> >> start over.  A correctly functioning KVM guest using TSC (i.e.
> >> ignoring kvmclock entirely)
> >> seems to work rather more reliably and
> >> considerably faster than a kvmclock guest.
> >>
> >> --Andy
> >>
> >> --
> >> Andy Lutomirski
> >> AMA Capital Management, LLC
> >
> > Andy,
> >
> > I am all for solving practical problems rather than pleasing aesthetic
> > pleasure.
> >
> >> Updating system_time from the kernel clock once master clock
> >> has been enabled can result in time backwards event, in case
> >> kernel clock frequency is lower than TSC frequency.
> >>
> >> Disable master clock in case it is necessary to update it
> >> from the resume path.
> >
> >> once master clock
> >> has been enabled can result in time backwards event, in case
> >> kernel clock frequency is lower than TSC frequency.
> >
> > guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc reads.
> >
> > If the effective frequency of the kernel clock is lower (for example
> > due to NTP correcting the TSC frequency of the system), and you resume
> > and update the system, the following happens:
> >
> > guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc 
> > reads=LARGE VALUE.

guest reads clock to memory at location A = scaled tsc read.

(note TSC is counting at frequency higher than advertised by
processor, thats why NTP has to "slow down" the kernel clock 
which is maintained by successive reads of the TSC).

> > suspend/resume event.
> > guest visible clock = tsc_timestamp (updated at time N) + scaled tsc 
> > reads=0.

Now the guest visible clock contains a tsc_timestamp that has been 
corrected by NTP, over say 5 days. So the tiny NTP correction has
been added up to something 

Re: [PATCH] target-i386: check vcpu features before accessing MSR_TSC_AUX

2015-12-14 Thread Paolo Bonzini


On 14/12/2015 14:54, Haozhong Zhang wrote:
>> > That commit is not in any released kernel.
> Right, it's currently only in kvm next. But I assume it would finally come
> into a released kernel.

Yes, but until it is, it's easier (and better) to fix KVM instead of QEMU.

> > It's better if we just check
> > msr_info->host_initiated in vmx_get_msr and vmx_set_msr. Can you prepare
> > a patch?
>
> Yes, I'll send a KVM patch later. And then this QEMU patch is not
> needed any more.

Great, thanks.

paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: MTRR: fix fixed MTRR segment look up

2015-12-14 Thread Alexis D...t
It fixes the slow-down of VM running with pci-passthrough, since some MTRR
range changed from MTRR_TYPE_WRBACK to MTRR_TYPE_UNCACHABLE.

Fixes: fa61213746a ("KVM: MTRR: simplify kvm_mtrr_get_guest_memory_type")
Bugzilla: (https://bugzilla.kernel.org/show_bug.cgi?id=107561)

Signed-off-by: Alexis Dambricourt 
---
 arch/x86/kvm/mtrr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
index 9e8bf13..adc54e1 100644
--- a/arch/x86/kvm/mtrr.c
+++ b/arch/x86/kvm/mtrr.c
@@ -267,7 +267,7 @@ static int fixed_mtrr_addr_to_seg(u64 addr)

for (seg = 0; seg < seg_num; seg++) {
mtrr_seg = _seg_table[seg];
-   if (mtrr_seg->start >= addr && addr < mtrr_seg->end)
+   if (mtrr_seg->start <= addr && addr < mtrr_seg->end)
return seg;
}

--
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/1] virtio/s390: one fix

2015-12-14 Thread Cornelia Huck
On Thu,  3 Dec 2015 17:23:59 +0100
Cornelia Huck  wrote:

> Michael,
> 
> here's one fix for the virtio-ccw driver.
> 
> Patch is against master, as your git branches on mst/vhost.git seem
> to be quite old. Is there some branch I can base my own branch on,
> or do you prefer to take patches directly anyway?
> 
> Cornelia Huck (1):
>   virtio/s390: handle error values in irb
> 
>  drivers/s390/virtio/virtio_ccw.c | 62 
> 
>  1 file changed, 37 insertions(+), 25 deletions(-)
> 

Ping?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [Patch V2 1/2] x86, mce: Basic support to add LMCE support to QEMU

2015-12-14 Thread Eduardo Habkost
Hi,

Comments below:

On Thu, Dec 10, 2015 at 02:41:21PM -0500, Ashok Raj wrote:
> This patch adds basic enumeration, control msr's required to support
> Local Machine Check Exception Support (LMCE).
> 
> - Added Local Machine Check definitions, changed MCG_CAP
> - Added support for IA32_FEATURE_CONTROL.
> - When delivering MCE to guest, we deliver to just a single CPU
>   when guest OS has opted in to Local delivery.
> 
> Signed-off-by: Ashok Raj 
> Tested-by: Gong Chen 
> ---
> Resending with proper commit message for second patch
> 
>  target-i386/cpu.c |  8 
>  target-i386/cpu.h |  8 ++--
>  target-i386/kvm.c | 38 +++---
>  3 files changed, 45 insertions(+), 9 deletions(-)
> 
> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> index 11e5e39..167669a 100644
> --- a/target-i386/cpu.c
> +++ b/target-i386/cpu.c
> @@ -2737,6 +2737,13 @@ static void mce_init(X86CPU *cpu)
>  }
>  }
>  
> +static void feature_control_init(X86CPU *cpu)
> +{
> + CPUX86State *cenv = >env;
> +
> + cenv->msr_ia32_feature_control = ((1<<20) | (1<<0));

FEATURE_CONTROL_LOCKED and FEATURE_CONTROL_LMCE macros would make
the code clearer.

> +}
> +
>  #ifndef CONFIG_USER_ONLY
>  static void x86_cpu_apic_create(X86CPU *cpu, Error **errp)
>  {
> @@ -2858,6 +2865,7 @@ static void x86_cpu_realizefn(DeviceState *dev, Error 
> **errp)
>  #endif
>  
>  mce_init(cpu);
> +feature_control_init(cpu);
>  
>  #ifndef CONFIG_USER_ONLY
>  if (tcg_enabled()) {
> diff --git a/target-i386/cpu.h b/target-i386/cpu.h
> index 84edfd0..a567d7a 100644
> --- a/target-i386/cpu.h
> +++ b/target-i386/cpu.h
> @@ -282,8 +282,9 @@
>  
>  #define MCG_CTL_P   (1ULL<<8)   /* MCG_CAP register available */
>  #define MCG_SER_P   (1ULL<<24) /* MCA recovery/new status bits */
> +#define MCG_LMCE_P   (1ULL<<27) /* Local Machine Check Supported */

Please use spaces instead of tabs. You can detect this and other
coding style issues in this patch with checkpatch.pl.


>  
> -#define MCE_CAP_DEF (MCG_CTL_P|MCG_SER_P)
> +#define MCE_CAP_DEF (MCG_CTL_P|MCG_SER_P|MCG_LMCE_P)

This makes mcg_cap change when upgrading QEMU.

VMs with MCG_LMCE_P enabled shouldn't be migratable to hosts
running older kernels, or the guest may try to read or write
MSR_IA32_MCG_EXT_CTL after miration and get a #GP. That means:

1) Older machine-types (pc-2.5 and older) should keep the
   old (MCG_CTL_P|MCG_SER_P) default.
2) We can't make pc-2.6 enable LMCE by default, either, because
   QEMU guarantees that just changing the machine-type shouldn't
   introduce new host requirements (see:
   http://article.gmane.org/gmane.comp.emulators.qemu/346651)

It looks like we need a new -cpu option to enable the feature,
then. At least until we raise the minimum kernel version
requirements of QEMU.

(I didn't review the kvm_mce_inject() changes as I am not
familiar with the details of how LMCE is implemented.)


>  #define MCE_BANKS_DEF   10
>  
>  #define MCG_CAP_BANKS_MASK 0xff
> @@ -291,6 +292,7 @@
>  #define MCG_STATUS_RIPV (1ULL<<0)   /* restart ip valid */
>  #define MCG_STATUS_EIPV (1ULL<<1)   /* ip points to correct instruction */
>  #define MCG_STATUS_MCIP (1ULL<<2)   /* machine check in progress */
> +#define MCG_STATUS_LMCE (1ULL<<3)   /* Local MCE signaled */
>  
>  #define MCI_STATUS_VAL   (1ULL<<63)  /* valid error */
>  #define MCI_STATUS_OVER  (1ULL<<62)  /* previous errors lost */
> @@ -333,6 +335,7 @@
>  #define MSR_MCG_CAP 0x179
>  #define MSR_MCG_STATUS  0x17a
>  #define MSR_MCG_CTL 0x17b
> +#define MSR_MCG_EXT_CTL  0x4d0
>  
>  #define MSR_P6_EVNTSEL0 0x186
>  
> @@ -892,7 +895,6 @@ typedef struct CPUX86State {
>  
>  uint64_t mcg_status;
>  uint64_t msr_ia32_misc_enable;
> -uint64_t msr_ia32_feature_control;
>  
>  uint64_t msr_fixed_ctr_ctrl;
>  uint64_t msr_global_ctrl;
> @@ -977,8 +979,10 @@ typedef struct CPUX86State {
>  int64_t tsc_khz;
>  void *kvm_xsave_buf;
>  
> +uint64_t msr_ia32_feature_control;
>  uint64_t mcg_cap;
>  uint64_t mcg_ctl;
> +uint64_t mcg_ext_ctl;
>  uint64_t mce_banks[MCE_BANKS_DEF*4];
>  
>  uint64_t tsc_aux;
> diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> index 6dc9846..c61fe1f 100644
> --- a/target-i386/kvm.c
> +++ b/target-i386/kvm.c
> @@ -72,6 +72,7 @@ static bool has_msr_tsc_aux;
>  static bool has_msr_tsc_adjust;
>  static bool has_msr_tsc_deadline;
>  static bool has_msr_feature_control;
> +static bool has_msr_ext_mcg_ctl;
>  static bool has_msr_async_pf_en;
>  static bool has_msr_pv_eoi_en;
>  static bool has_msr_misc_enable;
> @@ -370,18 +371,30 @@ static void kvm_mce_inject(X86CPU *cpu, hwaddr paddr, 
> int code)
>  uint64_t status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN |
>MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S;
>  uint64_t 

Re: [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest

2015-12-14 Thread Alexander Duyck
On Mon, Dec 14, 2015 at 6:00 AM, Michael S. Tsirkin  wrote:
> On Sun, Dec 13, 2015 at 01:28:31PM -0800, Alexander Duyck wrote:
>> This patch is meant to provide the guest with a way of flagging DMA pages
>> as being dirty to the host when using a direct-assign device within a
>> guest.  The advantage to this approach is that it is fairly simple, however
>> It currently has a singificant impact on device performance in all the
>> scenerios where it won't be needed.
>>
>> As such this is really meant only as a proof of concept and to get the ball
>> rolling in terms of figuring out how best to approach the issue of dirty
>> page tracking for a guest that is using a direct assigned device.  In
>> addition with just this patch it should be possible to modify current
>> migration approaches so that instead of having to hot-remove the device
>> before starting the migration this can instead be delayed until the period
>> before the final stop and copy.
>>
>> Signed-off-by: Alexander Duyck 
>> ---
>>  arch/arm/include/asm/dma-mapping.h   |3 ++-
>>  arch/arm64/include/asm/dma-mapping.h |5 ++---
>>  arch/ia64/include/asm/dma.h  |1 +
>>  arch/mips/include/asm/dma-mapping.h  |1 +
>>  arch/powerpc/include/asm/swiotlb.h   |1 +
>>  arch/tile/include/asm/dma-mapping.h  |1 +
>>  arch/unicore32/include/asm/dma-mapping.h |1 +
>>  arch/x86/Kconfig |   11 +++
>>  arch/x86/include/asm/swiotlb.h   |   26 ++
>>  drivers/xen/swiotlb-xen.c|6 ++
>>  lib/swiotlb.c|6 ++
>>  11 files changed, 58 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm/include/asm/dma-mapping.h 
>> b/arch/arm/include/asm/dma-mapping.h
>> index ccb3aa64640d..1962d7b471c7 100644
>> --- a/arch/arm/include/asm/dma-mapping.h
>> +++ b/arch/arm/include/asm/dma-mapping.h
>> @@ -167,7 +167,8 @@ static inline bool dma_capable(struct device *dev, 
>> dma_addr_t addr, size_t size)
>>   return 1;
>>  }
>>
>> -static inline void dma_mark_clean(void *addr, size_t size) { }
>> +static inline void dma_mark_clean(void *addr, size_t size) {}
>> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>>
>>  extern int arm_dma_set_mask(struct device *dev, u64 dma_mask);
>>
>> diff --git a/arch/arm64/include/asm/dma-mapping.h 
>> b/arch/arm64/include/asm/dma-mapping.h
>> index 61e08f360e31..8d24fe11c8a3 100644
>> --- a/arch/arm64/include/asm/dma-mapping.h
>> +++ b/arch/arm64/include/asm/dma-mapping.h
>> @@ -84,9 +84,8 @@ static inline bool dma_capable(struct device *dev, 
>> dma_addr_t addr, size_t size)
>>   return addr + size - 1 <= *dev->dma_mask;
>>  }
>>
>> -static inline void dma_mark_clean(void *addr, size_t size)
>> -{
>> -}
>> +static inline void dma_mark_clean(void *addr, size_t size) {}
>> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>>
>>  #endif   /* __KERNEL__ */
>>  #endif   /* __ASM_DMA_MAPPING_H */
>> diff --git a/arch/ia64/include/asm/dma.h b/arch/ia64/include/asm/dma.h
>> index 4d97f60f1ef5..d92ebeb2758e 100644
>> --- a/arch/ia64/include/asm/dma.h
>> +++ b/arch/ia64/include/asm/dma.h
>> @@ -20,5 +20,6 @@ extern unsigned long MAX_DMA_ADDRESS;
>>  #define free_dma(x)
>>
>>  void dma_mark_clean(void *addr, size_t size);
>> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>>
>>  #endif /* _ASM_IA64_DMA_H */
>> diff --git a/arch/mips/include/asm/dma-mapping.h 
>> b/arch/mips/include/asm/dma-mapping.h
>> index e604f760c4a0..567f6e03e337 100644
>> --- a/arch/mips/include/asm/dma-mapping.h
>> +++ b/arch/mips/include/asm/dma-mapping.h
>> @@ -28,6 +28,7 @@ static inline bool dma_capable(struct device *dev, 
>> dma_addr_t addr, size_t size)
>>  }
>>
>>  static inline void dma_mark_clean(void *addr, size_t size) {}
>> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>>
>>  #include 
>>
>> diff --git a/arch/powerpc/include/asm/swiotlb.h 
>> b/arch/powerpc/include/asm/swiotlb.h
>> index de99d6e29430..b694e8399e28 100644
>> --- a/arch/powerpc/include/asm/swiotlb.h
>> +++ b/arch/powerpc/include/asm/swiotlb.h
>> @@ -16,6 +16,7 @@
>>  extern struct dma_map_ops swiotlb_dma_ops;
>>
>>  static inline void dma_mark_clean(void *addr, size_t size) {}
>> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>>
>>  extern unsigned int ppc_swiotlb_enable;
>>  int __init swiotlb_setup_bus_notifier(void);
>> diff --git a/arch/tile/include/asm/dma-mapping.h 
>> b/arch/tile/include/asm/dma-mapping.h
>> index 96ac6cce4a32..79953f09e938 100644
>> --- a/arch/tile/include/asm/dma-mapping.h
>> +++ b/arch/tile/include/asm/dma-mapping.h
>> @@ -58,6 +58,7 @@ static inline phys_addr_t dma_to_phys(struct device *dev, 
>> dma_addr_t daddr)
>>  }
>>
>>  static inline void dma_mark_clean(void *addr, size_t size) {}
>> +static inline void dma_mark_dirty(void *addr, size_t size) {}
>>
>>  static inline 

Re: [PATCH v1] kvm/x86: Remove Hyper-V SynIC timer stopping

2015-12-14 Thread Andrey Smetanin



On 12/14/2015 07:09 PM, Paolo Bonzini wrote:



On 14/12/2015 16:33, Andrey Smetanin wrote:

It's possible that guest send us Hyper-V EOM at the middle
of Hyper-V SynIC timer running, so we start processing of Hyper-V
SynIC timers in vcpu context and stop the Hyper-V SynIC timer
uncoditionally and lose time expiration which Windows 2012R2 guest
expects.

The patch fixes such situation by not stopping Hyper-V SynIC timer
at all, because it's safe to restart it without stop in vcpu context
and timer callback always returns HRTIMER_NORESTART.


Can you summarize with a "picture" what is the bad race?

Currently I see that guest starts periodic timer and doesn't clear 
message slot after timer expires, so timer expires again and trying to 
deliver expiration message but message slot is still busy so we set 
->msg_pending flag for guest to receive EOM. timer restarts again and 
while it's not expired guest notifies us with EOM, in this case we 
schedule timer processing in vcpu context by KVM_REQ_HV_STIMER, 
kvm_hv_process_stimers() is called in vcpu context and stops the timer

before it expires, so timer is disabled forever but guest expects it's
periodic expiration(15ms).

I do not understand why Windows doesn't clear message slot for a long 
time, it's likely need to be analyzed with debugger(and need more 
research). But we can go out from such situation by such fix.



The patch seems safe, but I'd like to have a better understanding of
what goes wrong.

Paolo


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [Patch V2 1/2] x86, mce: Basic support to add LMCE support to QEMU

2015-12-14 Thread Borislav Petkov
On Mon, Dec 14, 2015 at 02:23:56PM -0200, Eduardo Habkost wrote:
> > -#define MCE_CAP_DEF (MCG_CTL_P|MCG_SER_P)
> > +#define MCE_CAP_DEF (MCG_CTL_P|MCG_SER_P|MCG_LMCE_P)
> 
> This makes mcg_cap change when upgrading QEMU.
> 
> VMs with MCG_LMCE_P enabled shouldn't be migratable to hosts
> running older kernels, or the guest may try to read or write
> MSR_IA32_MCG_EXT_CTL after miration and get a #GP. That means:
> 
> 1) Older machine-types (pc-2.5 and older) should keep the
>old (MCG_CTL_P|MCG_SER_P) default.
> 2) We can't make pc-2.6 enable LMCE by default, either, because
>QEMU guarantees that just changing the machine-type shouldn't
>introduce new host requirements (see:
>http://article.gmane.org/gmane.comp.emulators.qemu/346651)
> 
> It looks like we need a new -cpu option to enable the feature,
> then. At least until we raise the minimum kernel version
> requirements of QEMU.

... and obviously LMCE is vendor-specific so it cannot be enabled on
!Intel guests with a define like that. mce_init() in qemu should check
vendor too.

The same mistake was done with SER_P but that's much harder to change,
as we discussed previously.

-- 
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
-- 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1] kvm/x86: Remove Hyper-V SynIC timer stopping

2015-12-14 Thread Andrey Smetanin



On 12/14/2015 07:09 PM, Paolo Bonzini wrote:



On 14/12/2015 16:33, Andrey Smetanin wrote:

It's possible that guest send us Hyper-V EOM at the middle
of Hyper-V SynIC timer running, so we start processing of Hyper-V
SynIC timers in vcpu context and stop the Hyper-V SynIC timer
uncoditionally and lose time expiration which Windows 2012R2 guest
expects.

The patch fixes such situation by not stopping Hyper-V SynIC timer
at all, because it's safe to restart it without stop in vcpu context
and timer callback always returns HRTIMER_NORESTART.


Can you summarize with a "picture" what is the bad race?

hostguest
start periodic stimer
start periodic timer
timer expires after 15ms
send expiration message into guest
restart periodic timer
doing something
timer expires again after 15 ms
msg slot is still not cleared so
setup ->msg_pending
restart periodic timer
doing something
process timer msg and clear slot
so ->msg_pending was set:
send EOM into host
received EOM
queued call of kvm_hv_process_stimers()
by KVM_REQ_HV_STIMER

kvm_hv_process_stimers():
...
stimer_stop()
if (time_now >= stimer->exp_time)
stimer_expiration(stimer);
But time_now  < stimer->exp_time, so stimer_expiration is not called
in this case and timer is not restarted. so guest lose timer.


The patch seems safe, but I'd like to have a better understanding of
what goes wrong.

Paolo


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest

2015-12-14 Thread Michael S. Tsirkin
On Mon, Dec 14, 2015 at 08:34:00AM -0800, Alexander Duyck wrote:
> > This way distro can use a guest agent to disable
> > dirtying until before migration starts.
> 
> Right.  For a v2 version I would definitely want to have some way to
> limit the scope of this.  My main reason for putting this out here is
> to start altering the course of discussions since it seems like were
> weren't getting anywhere with the ixgbevf migration changes that were
> being proposed.

Absolutely, thanks for working on this.

> >> + unsigned long pg_addr, start;
> >> +
> >> + start = (unsigned long)addr;
> >> + pg_addr = PAGE_ALIGN(start + size);
> >> + start &= ~(sizeof(atomic_t) - 1);
> >> +
> >> + /* trigger a write fault on each page, excluding first page */
> >> + while ((pg_addr -= PAGE_SIZE) > start)
> >> + atomic_add(0, (atomic_t *)pg_addr);
> >> +
> >> + /* trigger a write fault on first word of DMA */
> >> + atomic_add(0, (atomic_t *)start);
> >
> > start might not be aligned correctly for a cast to atomic_t.
> > It's harmless to do this for any memory, so I think you should
> > just do this for 1st byte of all pages including the first one.
> 
> You may not have noticed it but I actually aligned start in the line
> after pg_addr.

Yes you did. alignof would make it a bit more noticeable.

>  However instead of aligning to the start of the next
> atomic_t I just masked off the lower bits so that we start at the
> DWORD that contains the first byte of the starting address.  The
> assumption here is that I cannot trigger any sort of fault since if I
> have access to a given byte within a DWORD I will have access to the
> entire DWORD.

I'm curious where does this come from.  Isn't it true that access is
controlled at page granularity normally, so you can touch beginning of
page just as well?

>  I coded this up so that the spots where we touch the
> memory should match up with addresses provided by the hardware to
> perform the DMA over the PCI bus.

Yes but there's no requirement to do it like this from
virt POV. You just need to touch each page.

> Also I intentionally ran from highest address to lowest since that way
> we don't risk pushing the first cache line of the DMA buffer out of
> the L1 cache due to the PAGE_SIZE stride.
> 
> - Alex

Interesting. How does order of access help with this?

By the way, if you are into these micro-optimizations you might want to
limit prefetch, to this end you want to access the last line of the
page.  And it's probably worth benchmarking a bit and not doing it all just
based on theory, keep code simple in v1 otherwise.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvmclock doesn't work, help?

2015-12-14 Thread Marcelo Tosatti
On Mon, Dec 14, 2015 at 10:07:21AM -0800, Andy Lutomirski wrote:
> On Fri, Dec 11, 2015 at 3:48 PM, Marcelo Tosatti  wrote:
> > On Fri, Dec 11, 2015 at 01:57:23PM -0800, Andy Lutomirski wrote:
> >> On Thu, Dec 10, 2015 at 1:32 PM, Marcelo Tosatti  
> >> wrote:
> >> > On Wed, Dec 09, 2015 at 01:10:59PM -0800, Andy Lutomirski wrote:
> >> >> I'm trying to clean up kvmclock and I can't get it to work at all.  My
> >> >> host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC.
> >> >>
> >> >> If I boot an SMP (2 vcpus) guest, tracing says:
> >> >>
> >> >>  qemu-system-x86-2517  [001] 102242.610654: kvm_update_master_clock:
> >> >> masterclock 0 hostclock tsc offsetmatched 0
> >> >>  qemu-system-x86-2521  [000] 102242.613742: kvm_track_tsc:
> >> >> vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc
> >> >>  qemu-system-x86-2522  [000] 102242.622959: kvm_track_tsc:
> >> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >> >>  qemu-system-x86-2521  [000] 102242.645123: kvm_track_tsc:
> >> >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >> >>  qemu-system-x86-2522  [000] 102242.647291: kvm_track_tsc:
> >> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >> >>  qemu-system-x86-2521  [000] 102242.653369: kvm_track_tsc:
> >> >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >> >>  qemu-system-x86-2522  [000] 102242.653429: kvm_track_tsc:
> >> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
> >> >>  qemu-system-x86-2517  [001] 102242.653447: kvm_update_master_clock:
> >> >> masterclock 0 hostclock tsc offsetmatched 1
> >> >>  qemu-system-x86-2521  [000] 102242.653657: kvm_update_master_clock:
> >> >> masterclock 0 hostclock tsc offsetmatched 1
> >> >>  qemu-system-x86-2522  [002] 102242.664448: kvm_update_master_clock:
> >> >> masterclock 0 hostclock tsc offsetmatched 1
> >> >>
> >> >>
> >> >> If I boot a UP guest, tracing says:
> >> >>
> >> >>  qemu-system-x86-2567  [001] 102370.447484: kvm_update_master_clock:
> >> >> masterclock 0 hostclock tsc offsetmatched 1
> >> >>  qemu-system-x86-2571  [002] 102370.447688: kvm_update_master_clock:
> >> >> masterclock 0 hostclock tsc offsetmatched 1
> >> >>
> >> >> I suspect, but I haven't verified, that this is fallout from:
> >> >>
> >> >> commit 16a9602158861687c78b6de6dc6a79e6e8a9136f
> >> >> Author: Marcelo Tosatti 
> >> >> Date:   Wed May 14 12:43:24 2014 -0300
> >> >>
> >> >> KVM: x86: disable master clock if TSC is reset during suspend
> >> >>
> >> >> Updating system_time from the kernel clock once master clock
> >> >> has been enabled can result in time backwards event, in case
> >> >> kernel clock frequency is lower than TSC frequency.
> >> >>
> >> >> Disable master clock in case it is necessary to update it
> >> >> from the resume path.
> >> >>
> >> >> Signed-off-by: Marcelo Tosatti 
> >> >> Signed-off-by: Paolo Bonzini 
> >> >>
> >> >>
> >> >> Can we please stop making kvmclock more complex?  It's a beast right
> >> >> now, and not in a good way.  It's far too tangled with the vclock
> >> >> machinery on both the host and guest sides, the pvclock stuff is not
> >> >> well thought out (even in principle in an ABI sense), and it's never
> >> >> been clear to my what problem exactly the kvmclock stuff is supposed
> >> >> to solve.
> >> >>
> >> >> I'm somewhat tempted to suggest that we delete kvmclock entirely and
> >> >> start over.  A correctly functioning KVM guest using TSC (i.e.
> >> >> ignoring kvmclock entirely)
> >> >> seems to work rather more reliably and
> >> >> considerably faster than a kvmclock guest.
> >> >>
> >> >> --Andy
> >> >>
> >> >> --
> >> >> Andy Lutomirski
> >> >> AMA Capital Management, LLC
> >> >
> >> > Andy,
> >> >
> >> > I am all for solving practical problems rather than pleasing aesthetic
> >> > pleasure.
> >> >
> >> >> Updating system_time from the kernel clock once master clock
> >> >> has been enabled can result in time backwards event, in case
> >> >> kernel clock frequency is lower than TSC frequency.
> >> >>
> >> >> Disable master clock in case it is necessary to update it
> >> >> from the resume path.
> >> >
> >> >> once master clock
> >> >> has been enabled can result in time backwards event, in case
> >> >> kernel clock frequency is lower than TSC frequency.
> >> >
> >> > guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc 
> >> > reads.
> >> >
> >> > If the effective frequency of the kernel clock is lower (for example
> >> > due to NTP correcting the TSC frequency of the system), and you resume
> >> > and update the system, the following happens:
> >> >
> >> > guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc 
> >> > reads=LARGE VALUE.
> >
> > guest reads clock to memory at location A = scaled tsc 

Re: [PATCH backport v3.12..v3.14 2/4] MIPS: KVM: Fix ASID restoration logic

2015-12-14 Thread Jiri Slaby
On 12/11/2015, 06:06 PM, James Hogan wrote:
> commit 002374f371bd02df864cce1fe85d90dc5b292837 upstream.

Applied 2/4 -- 4/4 to 3.12. Thanks.

-- 
js
suse labs
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [Patch V2 1/2] x86, mce: Basic support to add LMCE support to QEMU

2015-12-14 Thread Eduardo Habkost
On Mon, Dec 14, 2015 at 02:11:46PM -0500, Raj, Ashok wrote:
> On Mon, Dec 14, 2015 at 05:37:38PM +0100, Borislav Petkov wrote:
> > 
> > ... and obviously LMCE is vendor-specific so it cannot be enabled on
> > !Intel guests with a define like that. mce_init() in qemu should check
> > vendor too.
> > 
> > The same mistake was done with SER_P but that's much harder to change,
> > as we discussed previously.
> > 
> 
> This is mostly harmless.. since the MCG_CAP space is shared and has no
> conflict between vendors. Also just the CAP being set has no effect.
> 
> The Guest OS needs to opt-in and the SIGBUS indicating SRAR are the only 
> ones that are treated special treatment. Also Intel was the only one
> broadcasting MCE's.. so we are like the rest now :-)

If the feature won't be enabled by default, I believe it will be
OK to not make it conditional on CPUID vendor (as we would be
simply doing that the user asked for).

But if it's going to be enabled by default, I would like to get
some assurance that there won't be conflicts between vendors in
the MCG_CAP bits.

-- 
Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest

2015-12-14 Thread Alexander Duyck
On Mon, Dec 14, 2015 at 9:20 AM, Michael S. Tsirkin  wrote:
> On Mon, Dec 14, 2015 at 08:34:00AM -0800, Alexander Duyck wrote:
>> > This way distro can use a guest agent to disable
>> > dirtying until before migration starts.
>>
>> Right.  For a v2 version I would definitely want to have some way to
>> limit the scope of this.  My main reason for putting this out here is
>> to start altering the course of discussions since it seems like were
>> weren't getting anywhere with the ixgbevf migration changes that were
>> being proposed.
>
> Absolutely, thanks for working on this.
>
>> >> + unsigned long pg_addr, start;
>> >> +
>> >> + start = (unsigned long)addr;
>> >> + pg_addr = PAGE_ALIGN(start + size);
>> >> + start &= ~(sizeof(atomic_t) - 1);
>> >> +
>> >> + /* trigger a write fault on each page, excluding first page */
>> >> + while ((pg_addr -= PAGE_SIZE) > start)
>> >> + atomic_add(0, (atomic_t *)pg_addr);
>> >> +
>> >> + /* trigger a write fault on first word of DMA */
>> >> + atomic_add(0, (atomic_t *)start);
>> >
>> > start might not be aligned correctly for a cast to atomic_t.
>> > It's harmless to do this for any memory, so I think you should
>> > just do this for 1st byte of all pages including the first one.
>>
>> You may not have noticed it but I actually aligned start in the line
>> after pg_addr.
>
> Yes you did. alignof would make it a bit more noticeable.
>
>>  However instead of aligning to the start of the next
>> atomic_t I just masked off the lower bits so that we start at the
>> DWORD that contains the first byte of the starting address.  The
>> assumption here is that I cannot trigger any sort of fault since if I
>> have access to a given byte within a DWORD I will have access to the
>> entire DWORD.
>
> I'm curious where does this come from.  Isn't it true that access is
> controlled at page granularity normally, so you can touch beginning of
> page just as well?

Yeah, I am pretty sure it probably is page granularity.  However my
thought was to try and stick to the start of the DMA as the last
access.  That way we don't pull in any more cache lines than we need
to in order to dirty the pages.  Usually the start of the DMA region
will contain some sort of headers or something that needs to be
accessed with the highest priority so I wanted to make certain that we
were forcing usable data into the L1 cache rather than just the first
cache line of the page where the DMA started.  If however the start of
a DMA was the start of the page there is nothing there to prevent
that.

>>  I coded this up so that the spots where we touch the
>> memory should match up with addresses provided by the hardware to
>> perform the DMA over the PCI bus.
>
> Yes but there's no requirement to do it like this from
> virt POV. You just need to touch each page.

I know, but at the same time if we match up with the DMA then it is
more likely that we avoid grabbing unneeded cache lines.  In the case
of most drivers the data for headers and start is at the start of the
DMA.  So if we dirty the cache line associated with the start of the
DMA it will be pulled into the L1 cache and there is a greater chance
that it may already be prefetched as well.

>> Also I intentionally ran from highest address to lowest since that way
>> we don't risk pushing the first cache line of the DMA buffer out of
>> the L1 cache due to the PAGE_SIZE stride.
>
> Interesting. How does order of access help with this?

If you use a PAGE_SIZE stride you will start evicting things from L1
cache after something like 8 accesses on an x86 processor as most of
the recent ones have a 32K 8 way associative L1 cache.  So if I go
from back to front then I evict the stuff that would likely be in the
data portion of a buffer instead of headers which are usually located
at the front.

> By the way, if you are into these micro-optimizations you might want to
> limit prefetch, to this end you want to access the last line of the
> page.  And it's probably worth benchmarking a bit and not doing it all just
> based on theory, keep code simple in v1 otherwise.

My main goal for now is functional code over high performance code.
That is why I have kept this code fairly simple.  I might have done
some optimization but it was as much about the optimization as keeping
the code simple.  For example by using the start of the page instead
of the end I could easily do the comparison against start and avoid
doing more than one write per page.

The issue for me doing performance testing is that I don't have
anything that uses DMA blocks that are actually big enough to make use
of the PAGE_SIZE stride.  That is why the PAGE_SIZE stride portion is
mostly just theoretical.  I just have a few NICs and most of them only
allocate 1 page or so for DMA buffers.  What little benchmarking I
have done with netperf only showed a ~1% CPU penalty for the page
dirtying code.  For setups where we did more with the DMA such 

Re: [Qemu-devel] [Patch V2 1/2] x86, mce: Basic support to add LMCE support to QEMU

2015-12-14 Thread Raj, Ashok
On Mon, Dec 14, 2015 at 05:37:38PM +0100, Borislav Petkov wrote:
> 
> ... and obviously LMCE is vendor-specific so it cannot be enabled on
> !Intel guests with a define like that. mce_init() in qemu should check
> vendor too.
> 
> The same mistake was done with SER_P but that's much harder to change,
> as we discussed previously.
> 

This is mostly harmless.. since the MCG_CAP space is shared and has no
conflict between vendors. Also just the CAP being set has no effect.

The Guest OS needs to opt-in and the SIGBUS indicating SRAR are the only 
ones that are treated special treatment. Also Intel was the only one
broadcasting MCE's.. so we are like the rest now :-)

Cheers,
Ashok


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvmclock doesn't work, help?

2015-12-14 Thread Andy Lutomirski
On Fri, Dec 11, 2015 at 3:48 PM, Marcelo Tosatti  wrote:
> On Fri, Dec 11, 2015 at 01:57:23PM -0800, Andy Lutomirski wrote:
>> On Thu, Dec 10, 2015 at 1:32 PM, Marcelo Tosatti  wrote:
>> > On Wed, Dec 09, 2015 at 01:10:59PM -0800, Andy Lutomirski wrote:
>> >> I'm trying to clean up kvmclock and I can't get it to work at all.  My
>> >> host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC.
>> >>
>> >> If I boot an SMP (2 vcpus) guest, tracing says:
>> >>
>> >>  qemu-system-x86-2517  [001] 102242.610654: kvm_update_master_clock:
>> >> masterclock 0 hostclock tsc offsetmatched 0
>> >>  qemu-system-x86-2521  [000] 102242.613742: kvm_track_tsc:
>> >> vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc
>> >>  qemu-system-x86-2522  [000] 102242.622959: kvm_track_tsc:
>> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>> >>  qemu-system-x86-2521  [000] 102242.645123: kvm_track_tsc:
>> >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>> >>  qemu-system-x86-2522  [000] 102242.647291: kvm_track_tsc:
>> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>> >>  qemu-system-x86-2521  [000] 102242.653369: kvm_track_tsc:
>> >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>> >>  qemu-system-x86-2522  [000] 102242.653429: kvm_track_tsc:
>> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc
>> >>  qemu-system-x86-2517  [001] 102242.653447: kvm_update_master_clock:
>> >> masterclock 0 hostclock tsc offsetmatched 1
>> >>  qemu-system-x86-2521  [000] 102242.653657: kvm_update_master_clock:
>> >> masterclock 0 hostclock tsc offsetmatched 1
>> >>  qemu-system-x86-2522  [002] 102242.664448: kvm_update_master_clock:
>> >> masterclock 0 hostclock tsc offsetmatched 1
>> >>
>> >>
>> >> If I boot a UP guest, tracing says:
>> >>
>> >>  qemu-system-x86-2567  [001] 102370.447484: kvm_update_master_clock:
>> >> masterclock 0 hostclock tsc offsetmatched 1
>> >>  qemu-system-x86-2571  [002] 102370.447688: kvm_update_master_clock:
>> >> masterclock 0 hostclock tsc offsetmatched 1
>> >>
>> >> I suspect, but I haven't verified, that this is fallout from:
>> >>
>> >> commit 16a9602158861687c78b6de6dc6a79e6e8a9136f
>> >> Author: Marcelo Tosatti 
>> >> Date:   Wed May 14 12:43:24 2014 -0300
>> >>
>> >> KVM: x86: disable master clock if TSC is reset during suspend
>> >>
>> >> Updating system_time from the kernel clock once master clock
>> >> has been enabled can result in time backwards event, in case
>> >> kernel clock frequency is lower than TSC frequency.
>> >>
>> >> Disable master clock in case it is necessary to update it
>> >> from the resume path.
>> >>
>> >> Signed-off-by: Marcelo Tosatti 
>> >> Signed-off-by: Paolo Bonzini 
>> >>
>> >>
>> >> Can we please stop making kvmclock more complex?  It's a beast right
>> >> now, and not in a good way.  It's far too tangled with the vclock
>> >> machinery on both the host and guest sides, the pvclock stuff is not
>> >> well thought out (even in principle in an ABI sense), and it's never
>> >> been clear to my what problem exactly the kvmclock stuff is supposed
>> >> to solve.
>> >>
>> >> I'm somewhat tempted to suggest that we delete kvmclock entirely and
>> >> start over.  A correctly functioning KVM guest using TSC (i.e.
>> >> ignoring kvmclock entirely)
>> >> seems to work rather more reliably and
>> >> considerably faster than a kvmclock guest.
>> >>
>> >> --Andy
>> >>
>> >> --
>> >> Andy Lutomirski
>> >> AMA Capital Management, LLC
>> >
>> > Andy,
>> >
>> > I am all for solving practical problems rather than pleasing aesthetic
>> > pleasure.
>> >
>> >> Updating system_time from the kernel clock once master clock
>> >> has been enabled can result in time backwards event, in case
>> >> kernel clock frequency is lower than TSC frequency.
>> >>
>> >> Disable master clock in case it is necessary to update it
>> >> from the resume path.
>> >
>> >> once master clock
>> >> has been enabled can result in time backwards event, in case
>> >> kernel clock frequency is lower than TSC frequency.
>> >
>> > guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc reads.
>> >
>> > If the effective frequency of the kernel clock is lower (for example
>> > due to NTP correcting the TSC frequency of the system), and you resume
>> > and update the system, the following happens:
>> >
>> > guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc 
>> > reads=LARGE VALUE.
>
> guest reads clock to memory at location A = scaled tsc read.
>
> (note TSC is counting at frequency higher than advertised by
> processor, thats why NTP has to "slow down" the kernel clock
> which is maintained by successive reads of the TSC).
>
>> > suspend/resume event.
>> > guest visible clock = tsc_timestamp (updated at time N) + 

Re: [Qemu-devel] [Patch V2 1/2] x86, mce: Basic support to add LMCE support to QEMU

2015-12-14 Thread Raj, Ashok
Hi Eduardo,

Thanks for the feedback. All the suggestions make sense..

On Mon, Dec 14, 2015 at 02:23:56PM -0200, Eduardo Habkost wrote:
> > +static void feature_control_init(X86CPU *cpu)
> > +{
> > +   CPUX86State *cenv = >env;
> > +
> > +   cenv->msr_ia32_feature_control = ((1<<20) | (1<<0));
> 
> FEATURE_CONTROL_LOCKED and FEATURE_CONTROL_LMCE macros would make
> the code clearer.

Makes sense.. will fix it next round.

> > @@ -282,8 +282,9 @@
> >  
> >  #define MCG_CTL_P   (1ULL<<8)   /* MCG_CAP register available */
> >  #define MCG_SER_P   (1ULL<<24) /* MCA recovery/new status bits */
> > +#define MCG_LMCE_P (1ULL<<27) /* Local Machine Check Supported */
> 
> Please use spaces instead of tabs. You can detect this and other
> coding style issues in this patch with checkpatch.pl.
> 

Will change the ones i added. 

> 
> >  
> > -#define MCE_CAP_DEF (MCG_CTL_P|MCG_SER_P)
> > +#define MCE_CAP_DEF (MCG_CTL_P|MCG_SER_P|MCG_LMCE_P)
> 
> This makes mcg_cap change when upgrading QEMU.
> 
> VMs with MCG_LMCE_P enabled shouldn't be migratable to hosts
> running older kernels, or the guest may try to read or write
> MSR_IA32_MCG_EXT_CTL after miration and get a #GP. That means:
> 
> 1) Older machine-types (pc-2.5 and older) should keep the
>old (MCG_CTL_P|MCG_SER_P) default.
> 2) We can't make pc-2.6 enable LMCE by default, either, because
>QEMU guarantees that just changing the machine-type shouldn't
>introduce new host requirements (see:
>http://article.gmane.org/gmane.comp.emulators.qemu/346651)
> 
> It looks like we need a new -cpu option to enable the feature,
> then. At least until we raise the minimum kernel version
> requirements of QEMU.

Ok.. i will look this up. 
> 
> (I didn't review the kvm_mce_inject() changes as I am not
> familiar with the details of how LMCE is implemented.)
> 
This is quite simple.. we just don't broadcast MCE's and set LMCE in 
MCG_STATUS.

I will make the suggested changes and resend.

Cheers,
Ashok

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vgaarb: fix signal handling in vga_get()

2015-12-14 Thread Kirill A. Shutemov
On Thu, Dec 10, 2015 at 11:28:58AM +0100, David Herrmann wrote:
> Hi
> 
> On Mon, Nov 30, 2015 at 3:17 AM, Kirill A. Shutemov
>  wrote:
> > There are few defects in vga_get() related to signal hadning:
> >
> >   - we shouldn't check for pending signals for TASK_UNINTERRUPTIBLE
> > case;
> >
> >   - if we found pending signal we must remove ourself from wait queue
> > and change task state back to running;
> >
> >   - -ERESTARTSYS is more appropriate, I guess.
> >
> > Signed-off-by: Kirill A. Shutemov 
> > ---
> >
> > Alex, I try to get KVM with VGA passthrough working properly. I have i915
> > (HD 4600) on the host and GTX 580 for the guest. The guest GPU is not
> > capabale of EFI, so I have to use x-vga=on. It's kinda work with your
> > patch for i915.enable_hd_vgaarb=1. But guest refuse to initialize the GPU
> > after KVM was not shut down correctly, resulting in host crash like this:
> >
> > BUG: unable to handle kernel paging request at 880870187ed8
> > IP: [] 0x880870187ed8
> > PGD 2129067 PUD 800841e3
> > Oops: 0011 [#1] PREEMPT SMP
> > Modules linked in: iwlmvm iwlwifi
> > CPU: 6 PID: 3983 Comm: qemu-system-x86 Not tainted 4.3.0-gentoo #6
> > Hardware name: Gigabyte Technology Co., Ltd. Z87X-UD7 TH/Z87X-UD7 TH-CF, 
> > BIOS F5a 06/12/2014
> > task: 88087a91 ti: 8808632c task.ti: 8808632c
> > RIP: 0010:[]  [] 0x880870187ed8
> > RSP: 0018:8808632c3d08  EFLAGS: 00010006
> > RAX: 880870187db0 RBX: 70187f58 RCX: 
> > RDX:  RSI: 0003 RDI: 880870187db0
> > RBP: 8808632c3d48 R08:  R09: 
> > R10: 000103c0 R11: 0293 R12: 81ea03c8
> > R13: 8104c7cb R14:  R15: 0003
> > FS:  7f984f9b2700() GS:88089f38() knlGS:
> > CS:  0010 DS:  ES:  CR0: 80050033
> > CR2: 880870187ed8 CR3: 0008645f8000 CR4: 001426e0
> > Stack:
> >  810cc83d 632c3d28  81ea03c0
> >  0046 0003  
> >  8808632c3d80 810cca44 88087af63800 0286
> > Call Trace:
> >  [] ? __wake_up_common+0x4d/0x80
> >  [] __wake_up+0x34/0x50
> >  [] __vga_put+0x73/0xd0
> >  [] vga_put+0x54/0x80
> >  [] vfio_pci_vga_rw+0x1d2/0x220
> >  [] vfio_pci_rw+0x33/0x60
> >  [] vfio_pci_write+0x17/0x20
> >  [] vfio_device_fops_write+0x26/0x30
> >  [] __vfs_write+0x23/0xe0
> >  [] ? __vfs_read+0x23/0xd0
> >  [] ? do_vfs_ioctl+0x2b5/0x490
> >  [] vfs_write+0xa4/0x190
> >  [] SyS_pwrite64+0x66/0xa0
> >  [] entry_SYSCALL_64_fastpath+0x12/0x6a
> > Code: 88 ff ff e0 7e 18 70 08 88 ff ff 00 8c 57 76 08 88 ff ff 20 7f 18 70 
> > 08 88 ff ff 08 7f 18 70 08 88 ff ff 94 51 1a 81 ff ff ff ff <09> 00 00 00 
> > 00 00 00 00 01 8c 57 76 08 88 ff ff 00 8c 57 76 08
> > RIP  [] 0x880870187ed8
> >  RSP 
> > CR2: 880870187ed8
> >
> > The patch fixes the crash, but doesn't help with getting GPU in guest
> > working again.
> >
> > Any ideas?
> >
> > ---
> >  drivers/gpu/vga/vgaarb.c | 6 --
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/vga/vgaarb.c b/drivers/gpu/vga/vgaarb.c
> > index 3166e4bc4eb6..9abcaa53bd25 100644
> > --- a/drivers/gpu/vga/vgaarb.c
> > +++ b/drivers/gpu/vga/vgaarb.c
> > @@ -395,8 +395,10 @@ int vga_get(struct pci_dev *pdev, unsigned int rsrc, 
> > int interruptible)
> > set_current_state(interruptible ?
> >   TASK_INTERRUPTIBLE :
> >   TASK_UNINTERRUPTIBLE);
> > -   if (signal_pending(current)) {
> > -   rc = -EINTR;
> > +   if (interruptible && signal_pending(current)) {
> > +   __set_current_state(TASK_RUNNING);
> > +   remove_wait_queue(_wait_queue, );
> > +   rc = -ERESTARTSYS;
> > break;
> 
> All 3 points are valid, and the patch looks good to me:
> 
> Reviewed-by: David Herrmann 
> 
> However, there seems to be a race between vga_lock and putting the
> thread asleep.

I'm not sure I understand the race you're talking about.
Could you elaborate?

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH kvm-unit-tests 3/6] x86/pmu: expect failure with nmi_watchdog

2015-12-14 Thread Andrew Jones
On Mon, Dec 14, 2015 at 10:24:18PM +0100, Radim Krčmář wrote:
> Host's nmi_watchdog takes one slot, making the "all counters" unit-test
> fail.  We know exactly what happens, mark it as expected failure.
> 
> PMU test is now executed regardless of host_nmi_watchdog.
> 
> Signed-off-by: Radim Krčmář 
> ---
>  x86/pmu.c | 9 -
>  x86/unittests.cfg | 3 +--
>  2 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/x86/pmu.c b/x86/pmu.c
> index c68980044dee..4ca93235b977 100644
> --- a/x86/pmu.c
> +++ b/x86/pmu.c
> @@ -92,6 +92,7 @@ struct pmu_event {
>  };
>  
>  static int num_counters;
> +bool host_nmi_watchdog;
>  
>  char *buf;
>  
> @@ -291,7 +292,7 @@ static void check_counters_many(void)
>   if (!verify_counter([i]))
>   break;
>  
> - report("all counters", i == n);
> + report_xfail("all counters", host_nmi_watchdog, i == n);

How about outputting "host_nmi_watchdog=%d" as well?

>  }
>  
>  static void check_counter_overflow(void)
> @@ -374,6 +375,7 @@ static void check_rdpmc(void)
>  
>  int main(int ac, char **av)
>  {
> + int i;
>   struct cpuid id = cpuid(10);
>  
>   setup_vm();
> @@ -385,6 +387,11 @@ int main(int ac, char **av)
>   ebx.full = id.b;
>   edx.full = id.d;
>  
> + /* XXX: horrible command line parsing */
> + for (i = 1; i < ac; i++)
> + if (!strcmp(av[i], "host_nmi_watchdog=1"))
> + host_nmi_watchdog = true;
> +
>   if (!eax.split.version_id) {
>   printf("No pmu is detected!\n");
>   return report_summary();
> diff --git a/x86/unittests.cfg b/x86/unittests.cfg
> index c15c86df..6b94ad93dcf0 100644
> --- a/x86/unittests.cfg
> +++ b/x86/unittests.cfg
> @@ -106,8 +106,7 @@ file = msr.flat
>  
>  [pmu]
>  file = pmu.flat
> -extra_params = -cpu host
> -check = /proc/sys/kernel/nmi_watchdog=0
> +extra_params = -cpu host -append "host_nmi_watchdog=`cat 
> /proc/sys/kernel/nmi_watchdog`"
>  
>  [port80]
>  file = port80.flat
> -- 
> 2.6.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH kvm-unit-tests 2/6] x86/*: report skipped tests

2015-12-14 Thread Andrew Jones
On Mon, Dec 14, 2015 at 10:24:17PM +0100, Radim Krčmář wrote:
> No care to consistency or exhaustivity was given.
> 
> (svm-disabled test should be redone and it's weird that x86/hyperv_synic
>  is about the only one that does report_skip when unsupported.)
> 
> Signed-off-by: Radim Krčmář 
> ---
>  x86/apic.c | 7 +++
>  x86/emulator.c | 2 +-
>  x86/hyperv_synic.c | 2 +-
>  x86/pku.c  | 2 +-
>  x86/pmu.c  | 2 +-
>  x86/smap.c | 2 +-
>  x86/svm.c  | 2 +-
>  x86/tsc.c  | 2 +-
>  8 files changed, 10 insertions(+), 11 deletions(-)
> 
> diff --git a/x86/apic.c b/x86/apic.c
> index d4eec529e535..57af86de8f8c 100644
> --- a/x86/apic.c
> +++ b/x86/apic.c
> @@ -27,7 +27,7 @@ static void tsc_deadline_timer_isr(isr_regs_t *regs)
>  ++tdt_count;
>  }
>  
> -static void start_tsc_deadline_timer(void)
> +static void __test_tsc_deadline_timer(void)
>  {
>  handle_irq(TSC_DEADLINE_TIMER_VECTOR, tsc_deadline_timer_isr);
>  irq_enable();
> @@ -45,7 +45,6 @@ static int enable_tsc_deadline_timer(void)
>  if (cpuid(1).c & (1 << 24)) {
>  lvtt = TSC_DEADLINE_TIMER_MODE | TSC_DEADLINE_TIMER_VECTOR;
>  apic_write(APIC_LVTT, lvtt);
> -start_tsc_deadline_timer();
>  return 1;
>  } else {
>  return 0;
> @@ -55,9 +54,9 @@ static int enable_tsc_deadline_timer(void)
>  static void test_tsc_deadline_timer(void)
>  {
>  if(enable_tsc_deadline_timer()) {
> -printf("tsc deadline timer enabled\n");
> +__test_tsc_deadline_timer();
>  } else {
> -printf("tsc deadline timer not detected\n");
> +report_skip("tsc deadline timer not detected\n");

You probably don't want this '\n' anymore.

>  }
>  }
>  
> diff --git a/x86/emulator.c b/x86/emulator.c
> index e5c1c6b9a2f3..b64a5fe0f3dc 100644
> --- a/x86/emulator.c
> +++ b/x86/emulator.c
> @@ -1062,7 +1062,7 @@ static void illegal_movbe_handler(struct ex_regs *regs)
>  static void test_illegal_movbe(void)
>  {
>   if (!(cpuid(1).c & (1 << 22))) {
> - printf("SKIP: illegal movbe\n");
> + report_skip("illegal movbe");
>   return;
>   }
>  
> diff --git a/x86/hyperv_synic.c b/x86/hyperv_synic.c
> index 18d1295bfb37..602b79392bfd 100644
> --- a/x86/hyperv_synic.c
> +++ b/x86/hyperv_synic.c
> @@ -228,7 +228,7 @@ int main(int ac, char **av)
>  
>  report("Hyper-V SynIC test", ok);
>  } else {
> -report("Hyper-V SynIC is not supported", true);
> +report_skip("Hyper-V SynIC is not supported");
>  }
>  
>  return report_summary();
> diff --git a/x86/pku.c b/x86/pku.c
> index 0e00b9984d70..58971d21ed05 100644
> --- a/x86/pku.c
> +++ b/x86/pku.c
> @@ -91,7 +91,7 @@ int main(int ac, char **av)
>  
>  if (!(cpuid_indexed(7, 0).c & (1 << X86_FEATURE_PKU))) {
>  printf("PKU not enabled, exiting\n");
> -exit(1);
> +return report_summary();
>  }
>  
>  setup_vm();
> diff --git a/x86/pmu.c b/x86/pmu.c
> index 03f80190bb25..c68980044dee 100644
> --- a/x86/pmu.c
> +++ b/x86/pmu.c
> @@ -387,7 +387,7 @@ int main(int ac, char **av)
>  
>   if (!eax.split.version_id) {
>   printf("No pmu is detected!\n");
> - return 1;
> + return report_summary();
>   }
>   printf("PMU version: %d\n", eax.split.version_id);
>   printf("GP counters: %d\n", eax.split.num_counters);
> diff --git a/x86/smap.c b/x86/smap.c
> index d8a7ae82dc00..0aa44054bd48 100644
> --- a/x86/smap.c
> +++ b/x86/smap.c
> @@ -93,7 +93,7 @@ int main(int ac, char **av)
>  
>   if (!(cpuid_indexed(7, 0).b & (1 << X86_FEATURE_SMAP))) {
>   printf("SMAP not enabled, exiting\n");
> - exit(1);
> + return report_summary();
>   }
>  
>   setup_vm();
> diff --git a/x86/svm.c b/x86/svm.c
> index 1046ddf73732..ff1a0f34b4bf 100644
> --- a/x86/svm.c
> +++ b/x86/svm.c
> @@ -1064,7 +1064,7 @@ int main(int ac, char **av)
>  
>  if (!(cpuid(0x8001).c & 4)) {
>  printf("SVM not availble\n");
> -return 0;
> +return report_summary();
>  }
>  
>  setup_svm();
> diff --git a/x86/tsc.c b/x86/tsc.c
> index c71dc2a7abe0..ee247459fb42 100644
> --- a/x86/tsc.c
> +++ b/x86/tsc.c
> @@ -43,5 +43,5 @@ int main()
>   test_rdtscp(0x100);
>   } else
>   printf("rdtscp not supported\n");
> - return 0;
> + return report_summary();
>  }
> -- 
> 2.6.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH kvm-unit-tests 4/6] run_tests: generalize check

2015-12-14 Thread Andrew Jones
On Mon, Dec 14, 2015 at 10:24:19PM +0100, Radim Krčmář wrote:
> config attribute "check" is currently unused.
> Provide a simple implementation instead of removing it.
> 
> Signed-off-by: Radim Krčmář 
> ---
>  run_tests.sh | 15 ---
>  1 file changed, 4 insertions(+), 11 deletions(-)
> 
> diff --git a/run_tests.sh b/run_tests.sh
> index 4d813b9a7084..b0b064f2e341 100755
> --- a/run_tests.sh
> +++ b/run_tests.sh
> @@ -35,17 +35,10 @@ function run()
>  return
>  fi
>  
> -# check a file for a particular value before running a test
> -# the check line can contain multiple files to check separated by a space
> -# but each check parameter needs to be of the form =
> -for check_param in ${check[@]}; do
> -path=${check_param%%=*}
> -value=${check_param#*=}
> -if [ "$path" ] && [ "$(cat $path)" != "$value" ]; then
> -echo "skip $1 ($path not equal to $value)"
> -return
> -fi
> -done
> +eval $check || {
> +echo "skip $1 (failed \$($check))"
> +return
> +}

I think we should use "\e[33mSKIP\e[0m" for skip. Maybe we should create
pass(),fail(),skip() functions in order to make sure all callers use the
same prefix with the same color.

>  
>  cmdline="TESTNAME=$testname ACCEL=$accel ./$TEST_DIR-run $kernel -smp 
> $smp $opts"
>  if [ $verbose != 0 ]; then
> -- 
> 2.6.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH kvm-unit-tests 1/6] lib/report: allow test skipping

2015-12-14 Thread Andrew Jones
On Mon, Dec 14, 2015 at 04:00:19PM -0600, Andrew Jones wrote:
> On Mon, Dec 14, 2015 at 10:24:16PM +0100, Radim Krčmář wrote:
> > This patch allows us to explicitly mark a unit-test as skipped.
> > If all unit-tests were skipped, the whole test is reported as skipped as
> > well.  This also includes the case where no report()s were done, but
> > the test still ended with report_summary().
> > 
> > When the whole test is skipped, ./run_tests.sh prints "skip" instead of
> > green "PASS".
> > 
> > Return value of 77 is used to please Autotools.  I also renamed few
> > things in reporting code and chose to refactor a logic while at it.
> > 
> > Signed-off-by: Radim Krčmář 
> > ---
> >  lib/libcflat.h |  1 +
> >  lib/report.c   | 43 +--
> >  run_tests.sh   | 13 -
> >  3 files changed, 38 insertions(+), 19 deletions(-)
> > 
> > diff --git a/lib/libcflat.h b/lib/libcflat.h
> > index 9747ccdbc9f1..070818354ee1 100644
> > --- a/lib/libcflat.h
> > +++ b/lib/libcflat.h
> > @@ -61,6 +61,7 @@ void report_prefix_push(const char *prefix);
> >  void report_prefix_pop(void);
> >  void report(const char *msg_fmt, bool pass, ...);
> >  void report_xfail(const char *msg_fmt, bool xfail, bool pass, ...);
> > +void report_skip(const char *msg_fmt, ...);
> >  int report_summary(void);
> >  
> >  #define ARRAY_SIZE(_a) (sizeof(_a)/sizeof((_a)[0]))
> > diff --git a/lib/report.c b/lib/report.c
> > index 35e664108a92..e07baa347298 100644
> > --- a/lib/report.c
> > +++ b/lib/report.c
> > @@ -13,7 +13,7 @@
> >  #include "libcflat.h"
> >  #include "asm/spinlock.h"
> >  
> > -static unsigned int tests, failures, xfailures;
> > +static unsigned int tests, failures, xfailures, skipped;
> >  static char prefixes[256];
> >  static struct spinlock lock;
> >  
> > @@ -43,25 +43,27 @@ void report_prefix_pop(void)
> > spin_unlock();
> >  }
> >  
> > -void va_report_xfail(const char *msg_fmt, bool xfail, bool cond, va_list 
> > va)
> > +void va_report(const char *msg_fmt, bool pass, bool xfail, bool skip, 
> > va_list va)
> 
> Line greater than 80 char here. Yes, that was supposed to induce an eye
> roll. But... this file doesn't have any "long" lines yet, so we could
> continue avoiding them.
> 
> >  {
> > -   char *pass = xfail ? "XPASS" : "PASS";
> > -   char *fail = xfail ? "XFAIL" : "FAIL";
> > char buf[2000];
> > +   char *prefix = skip ? "SKIP"
> > +   : xfail ? (pass ? "XPASS" : "XFAIL")
> > +   : (pass ? "PASS"  : "FAIL");
> >  
> > spin_lock();
> >  
> > tests++;
> > -   printf("%s: ", cond ? pass : fail);
> > +   printf("%s: ", prefix);
> > puts(prefixes);
> > vsnprintf(buf, sizeof(buf), msg_fmt, va);
> > puts(buf);
> > puts("\n");
> > -   if (xfail && cond)
> > -   failures++;
> > -   else if (xfail)
> > +
> > +   if (skip)
> > +   skipped++;
> > +   else if (xfail && !pass)
> > xfailures++;
> > -   else if (!cond)
> > +   else if (xfail || !pass)
> > failures++;
> >  
> > spin_unlock();
> > @@ -71,7 +73,7 @@ void report(const char *msg_fmt, bool pass, ...)
> >  {
> > va_list va;
> > va_start(va, pass);
> > -   va_report_xfail(msg_fmt, false, pass, va);
> > +   va_report(msg_fmt, pass, false, false, va);
> > va_end(va);
> >  }
> >  
> > @@ -79,7 +81,15 @@ void report_xfail(const char *msg_fmt, bool xfail, bool 
> > pass, ...)
> >  {
> > va_list va;
> > va_start(va, pass);
> > -   va_report_xfail(msg_fmt, xfail, pass, va);
> > +   va_report(msg_fmt, pass, xfail, false, va);
> > +   va_end(va);
> > +}
> > +
> > +void report_skip(const char *msg_fmt, ...)
> > +{
> > +   va_list va;
> > +   va_start(va, msg_fmt);
> > +   va_report(msg_fmt, false, false, true, va);
> > va_end(va);
> >  }
> >  
> > @@ -89,9 +99,14 @@ int report_summary(void)
> >  
> > printf("\nSUMMARY: %d tests, %d unexpected failures", tests, failures);
> > if (xfailures)
> > -   printf(", %d expected failures\n", xfailures);
> > -   else
> > -   printf("\n");
> > +   printf(", %d expected failures", xfailures);
> > +   if (skipped)
> > +   printf(", %d skipped", skipped);
> > +   printf("\n");
> > +
> > +   if (tests == skipped)
> > +   return 77; /* blame AUTOTOOLS */
> > +
> > return failures > 0 ? 1 : 0;
> >  
> > spin_unlock();
> > diff --git a/run_tests.sh b/run_tests.sh
> > index fad22a935b00..4d813b9a7084 100755
> > --- a/run_tests.sh
> > +++ b/run_tests.sh
> > @@ -55,12 +55,15 @@ function run()
> >  # extra_params in the config file may contain backticks that need to be
> >  # expanded, so use eval to start qemu
> >  eval $cmdline >> test.log
> > +# The first bit of return value is too hard to use, just skip it.
> > +# Unit-tests' return value is shifted by one.
> > +case $(($? >> 1)) in
> > +0)  echo -ne "\e[32mPASS\e[0m" ;;
> > +77) echo -ne "skip" ;;
> 
> 

Re: [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest

2015-12-14 Thread Alexander Duyck
On Mon, Dec 14, 2015 at 12:52 PM, Michael S. Tsirkin  wrote:
> On Mon, Dec 14, 2015 at 09:59:13AM -0800, Alexander Duyck wrote:
>> On Mon, Dec 14, 2015 at 9:20 AM, Michael S. Tsirkin  wrote:
>> > On Mon, Dec 14, 2015 at 08:34:00AM -0800, Alexander Duyck wrote:
>> >> > This way distro can use a guest agent to disable
>> >> > dirtying until before migration starts.
>> >>
>> >> Right.  For a v2 version I would definitely want to have some way to
>> >> limit the scope of this.  My main reason for putting this out here is
>> >> to start altering the course of discussions since it seems like were
>> >> weren't getting anywhere with the ixgbevf migration changes that were
>> >> being proposed.
>> >
>> > Absolutely, thanks for working on this.
>> >
>> >> >> + unsigned long pg_addr, start;
>> >> >> +
>> >> >> + start = (unsigned long)addr;
>> >> >> + pg_addr = PAGE_ALIGN(start + size);
>> >> >> + start &= ~(sizeof(atomic_t) - 1);
>> >> >> +
>> >> >> + /* trigger a write fault on each page, excluding first page */
>> >> >> + while ((pg_addr -= PAGE_SIZE) > start)
>> >> >> + atomic_add(0, (atomic_t *)pg_addr);
>> >> >> +
>> >> >> + /* trigger a write fault on first word of DMA */
>> >> >> + atomic_add(0, (atomic_t *)start);
>
> Actually, I have second thoughts about using atomic_add here,
> especially for _sync.
>
> Many architectures do
>
> #define ATOMIC_OP_RETURN(op, c_op)  \
> static inline int atomic_##op##_return(int i, atomic_t *v)  \
> {   \
> unsigned long flags;\
> int ret;\
> \
> raw_local_irq_save(flags);  \
> ret = (v->counter = v->counter c_op i); \
> raw_local_irq_restore(flags);   \
> \
> return ret; \
> }
>
> and this is not safe if device is still doing DMA to/from
> this memory.
>
> Generally, atomic_t is there for SMP effects, not for sync
> with devices.
>
> This is why I said you should do
> cmpxchg(pg_addr, 0xdead, 0xdead);
>
> Yes, we probably never actually want to run m68k within a VM,
> but let's not misuse interfaces like this.

Right now this implementation is for x86 only.  Any other architecture
currently reports dma_mark_dirty as an empty inline function.  The
reason why I chose the atomic_add for x86 is simply because it is
guaranteed dirty the cache line with relatively few instructions and
operands as all I have to have is the pointer and 0.

For the m68k we could implement it as a cmpxchg instead.  The general
thought here is that each architecture is probably going to have to do
it a little bit differently.

>> >> >
>> >> > start might not be aligned correctly for a cast to atomic_t.
>> >> > It's harmless to do this for any memory, so I think you should
>> >> > just do this for 1st byte of all pages including the first one.
>> >>
>> >> You may not have noticed it but I actually aligned start in the line
>> >> after pg_addr.
>> >
>> > Yes you did. alignof would make it a bit more noticeable.
>> >
>> >>  However instead of aligning to the start of the next
>> >> atomic_t I just masked off the lower bits so that we start at the
>> >> DWORD that contains the first byte of the starting address.  The
>> >> assumption here is that I cannot trigger any sort of fault since if I
>> >> have access to a given byte within a DWORD I will have access to the
>> >> entire DWORD.
>> >
>> > I'm curious where does this come from.  Isn't it true that access is
>> > controlled at page granularity normally, so you can touch beginning of
>> > page just as well?
>>
>> Yeah, I am pretty sure it probably is page granularity.  However my
>> thought was to try and stick to the start of the DMA as the last
>> access.  That way we don't pull in any more cache lines than we need
>> to in order to dirty the pages.  Usually the start of the DMA region
>> will contain some sort of headers or something that needs to be
>> accessed with the highest priority so I wanted to make certain that we
>> were forcing usable data into the L1 cache rather than just the first
>> cache line of the page where the DMA started.  If however the start of
>> a DMA was the start of the page there is nothing there to prevent
>> that.
>
> OK, maybe this helps. You should document all these tricks
> in code comments.

I'll try to get that taken care of for v2.

>> >>  I coded this up so that the spots where we touch the
>> >> memory should match up with addresses provided by the hardware to
>> >> perform the DMA 

Re: [Qemu-devel] [Patch V2 1/2] x86, mce: Basic support to add LMCE support to QEMU

2015-12-14 Thread Borislav Petkov
On Mon, Dec 14, 2015 at 02:11:46PM -0500, Raj, Ashok wrote:
> This is mostly harmless.. since the MCG_CAP space is shared and has no
> conflict between vendors. Also just the CAP being set has no effect.

Of course it does - we check SER_P in machine_check_poll() and when
I emulate an AMD guest and inject errors into it, error handling is
obviously wrong, see:

https://lkml.kernel.org/r/20151123150355.ge5...@pd.tnic

-- 
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
-- 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH kvm-unit-tests 1/6] lib/report: allow test skipping

2015-12-14 Thread Andrew Jones
On Mon, Dec 14, 2015 at 10:24:16PM +0100, Radim Krčmář wrote:
> This patch allows us to explicitly mark a unit-test as skipped.
> If all unit-tests were skipped, the whole test is reported as skipped as
> well.  This also includes the case where no report()s were done, but
> the test still ended with report_summary().
> 
> When the whole test is skipped, ./run_tests.sh prints "skip" instead of
> green "PASS".
> 
> Return value of 77 is used to please Autotools.  I also renamed few
> things in reporting code and chose to refactor a logic while at it.
> 
> Signed-off-by: Radim Krčmář 
> ---
>  lib/libcflat.h |  1 +
>  lib/report.c   | 43 +--
>  run_tests.sh   | 13 -
>  3 files changed, 38 insertions(+), 19 deletions(-)
> 
> diff --git a/lib/libcflat.h b/lib/libcflat.h
> index 9747ccdbc9f1..070818354ee1 100644
> --- a/lib/libcflat.h
> +++ b/lib/libcflat.h
> @@ -61,6 +61,7 @@ void report_prefix_push(const char *prefix);
>  void report_prefix_pop(void);
>  void report(const char *msg_fmt, bool pass, ...);
>  void report_xfail(const char *msg_fmt, bool xfail, bool pass, ...);
> +void report_skip(const char *msg_fmt, ...);
>  int report_summary(void);
>  
>  #define ARRAY_SIZE(_a) (sizeof(_a)/sizeof((_a)[0]))
> diff --git a/lib/report.c b/lib/report.c
> index 35e664108a92..e07baa347298 100644
> --- a/lib/report.c
> +++ b/lib/report.c
> @@ -13,7 +13,7 @@
>  #include "libcflat.h"
>  #include "asm/spinlock.h"
>  
> -static unsigned int tests, failures, xfailures;
> +static unsigned int tests, failures, xfailures, skipped;
>  static char prefixes[256];
>  static struct spinlock lock;
>  
> @@ -43,25 +43,27 @@ void report_prefix_pop(void)
>   spin_unlock();
>  }
>  
> -void va_report_xfail(const char *msg_fmt, bool xfail, bool cond, va_list va)
> +void va_report(const char *msg_fmt, bool pass, bool xfail, bool skip, 
> va_list va)

Line greater than 80 char here. Yes, that was supposed to induce an eye
roll. But... this file doesn't have any "long" lines yet, so we could
continue avoiding them.

>  {
> - char *pass = xfail ? "XPASS" : "PASS";
> - char *fail = xfail ? "XFAIL" : "FAIL";
>   char buf[2000];
> + char *prefix = skip ? "SKIP"
> + : xfail ? (pass ? "XPASS" : "XFAIL")
> + : (pass ? "PASS"  : "FAIL");
>  
>   spin_lock();
>  
>   tests++;
> - printf("%s: ", cond ? pass : fail);
> + printf("%s: ", prefix);
>   puts(prefixes);
>   vsnprintf(buf, sizeof(buf), msg_fmt, va);
>   puts(buf);
>   puts("\n");
> - if (xfail && cond)
> - failures++;
> - else if (xfail)
> +
> + if (skip)
> + skipped++;
> + else if (xfail && !pass)
>   xfailures++;
> - else if (!cond)
> + else if (xfail || !pass)
>   failures++;
>  
>   spin_unlock();
> @@ -71,7 +73,7 @@ void report(const char *msg_fmt, bool pass, ...)
>  {
>   va_list va;
>   va_start(va, pass);
> - va_report_xfail(msg_fmt, false, pass, va);
> + va_report(msg_fmt, pass, false, false, va);
>   va_end(va);
>  }
>  
> @@ -79,7 +81,15 @@ void report_xfail(const char *msg_fmt, bool xfail, bool 
> pass, ...)
>  {
>   va_list va;
>   va_start(va, pass);
> - va_report_xfail(msg_fmt, xfail, pass, va);
> + va_report(msg_fmt, pass, xfail, false, va);
> + va_end(va);
> +}
> +
> +void report_skip(const char *msg_fmt, ...)
> +{
> + va_list va;
> + va_start(va, msg_fmt);
> + va_report(msg_fmt, false, false, true, va);
>   va_end(va);
>  }
>  
> @@ -89,9 +99,14 @@ int report_summary(void)
>  
>   printf("\nSUMMARY: %d tests, %d unexpected failures", tests, failures);
>   if (xfailures)
> - printf(", %d expected failures\n", xfailures);
> - else
> - printf("\n");
> + printf(", %d expected failures", xfailures);
> + if (skipped)
> + printf(", %d skipped", skipped);
> + printf("\n");
> +
> + if (tests == skipped)
> + return 77; /* blame AUTOTOOLS */
> +
>   return failures > 0 ? 1 : 0;
>  
>   spin_unlock();
> diff --git a/run_tests.sh b/run_tests.sh
> index fad22a935b00..4d813b9a7084 100755
> --- a/run_tests.sh
> +++ b/run_tests.sh
> @@ -55,12 +55,15 @@ function run()
>  # extra_params in the config file may contain backticks that need to be
>  # expanded, so use eval to start qemu
>  eval $cmdline >> test.log
> +# The first bit of return value is too hard to use, just skip it.
> +# Unit-tests' return value is shifted by one.
> +case $(($? >> 1)) in
> +0)  echo -ne "\e[32mPASS\e[0m" ;;
> +77) echo -ne "skip" ;;

Why not "\e[31mSKIP\e[0m"? (and without those escape sequences echo doesn't
need -e)

> +*)  echo -ne "\e[31mFAIL\e[0m"
> +esac
>  
> -if [ $? -le 1 ]; then
> -echo -e "\e[32mPASS\e[0m $1"
> -else
> -echo -e 

Re: kvmclock doesn't work, help?

2015-12-14 Thread Marcelo Tosatti
On Mon, Dec 14, 2015 at 02:44:15PM +0100, Paolo Bonzini wrote:
> 
> 
> On 11/12/2015 22:57, Andy Lutomirski wrote:
> > I'm still not seeing the issue.
> > 
> > The formula is:
> > 
> > (((rdtsc - pvti->tsc_timestamp) * pvti->tsc_to_system_mul) >>
> > pvti->tsc_shift) + pvti->system_time
> > 
> > Obviously, if you reset pvti->tsc_timestamp to the current tsc value
> > after suspend/resume, you would also need to update system_time.
> > 
> > I don't see what this has to do with suspend/resume or with whether
> > the effective scale factor is greater than or less than one.  The only
> > suspend/resume interaction I can see is that, if the host allows the
> > guest-observed TSC value to jump (which is arguably a bug, what that's
> > not important here), it needs to update pvti before resuming the
> > guest.
> 
> Which is not an issue, since freezing obviously gets all CPUs out of
> guest mode.
> 
> Marcelo, can you provide an example with made-up values for tsc and pvti?

I meant "systemtime" at ^.

guest visible clock = systemtime (updated at time 0, guest initialization) + 
scaled tsc reads=LARGE VALUE.
  ^^
guest reads clock to memory at location A = scaled tsc read.
-> suspend resume event 
guest visible clock = systemtime (updated at time AFTER SUSPEND) + scaled tsc 
reads=0.
guest reads clock to memory at location B.

So before the suspend/resume event, the clock is the RAW TSC values
(scaled by kvmclock, but the frequency of the RAW TSC). 

After suspend/resume event, the clock is updated from the host
via get_kernel_ns(), which reads the corrected NTP frequency TSC.

So you switch the timebase, from a clock running at a given frequency,
to a clock running at another frequency (effective frequency).

Example:

RAW TSC NTP corrected TSC
t0  10  10
t1  20  19.99
t2  30  29.98
t3  40  39.97
t4  50  49.96

...

if you suddenly switch from RAW TSC to NTP corrected TSC, 
you can see what will happen.

Does that make sense?

> > suspend/resume event.
> > guest visible clock = tsc_timestamp (updated at time N) + scaled tsc
> > reads=0.


> 
> > Can you clarify concretely what goes wrong here?
> > 
> > (I'm also at a bit of a loss as to why this needs both system_time and
> > tsc_timestamp.  They're redundant in the sense that you could set
> > tsc_timestamp to zero and subtract (tsc_timestamp * tsc_to_system_mul) >>
> > tsc_shift to system_time without changing the result of the
> > calculation.)
> 
> You would have to ensure that all elements of pvti are rounded correctly
> whenever the base TSC is updated.  Doable, but it does seem simpler to
> keep subtract-TSC and add-nanoseconds separate.
> 
> Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH kvm-unit-tests 0/6] Improve the output of test runners

2015-12-14 Thread Andrew Jones
On Mon, Dec 14, 2015 at 10:24:15PM +0100, Radim Krčmář wrote:
> This series is a mix of patches that change the output of run_tests.sh
> and x86-run.  The output of ./run_tests.sh now looks like this:
> 
> > PASS apic (14 tests, 0 unexpected failures)
> > PASS ioapic (19 tests, 0 unexpected failures)
> > PASS smptest (1 tests, 0 unexpected failures)
> > PASS smptest3 (1 tests, 0 unexpected failures)
> > PASS vmexit_cpuid 
> > PASS vmexit_vmcall 
> > PASS vmexit_mov_from_cr8 
> > PASS vmexit_mov_to_cr8 
> > PASS vmexit_inl_pmtimer 
> > PASS vmexit_ipi 
> > PASS vmexit_ipi_halt 
> > PASS vmexit_ple_round_robin 
> > PASS access 
> > skip smap (0 tests, 0 unexpected failures)
> > skip pku (0 tests, 0 unexpected failures)
> > PASS emulator (132 tests, 0 unexpected failures, 1 skipped)
> > PASS eventinj (13 tests, 0 unexpected failures)
> > PASS hypercall (2 tests, 0 unexpected failures)
> > PASS idt_test (4 tests, 0 unexpected failures)
> > PASS msr (13 tests, 0 unexpected failures)
> > PASS pmu (67 tests, 0 unexpected failures, 1 expected failures)
> > PASS port80 
> > PASS realmode 
> > PASS s3 
> > PASS sieve 
> > PASS tsc (3 tests, 0 unexpected failures)
> > PASS tsc_adjust (5 tests, 0 unexpected failures)
> > PASS xsave (17 tests, 0 unexpected failures)
> > PASS rmap_chain 
> > skip svm (0 tests, 0 unexpected failures)
> > skip svm-disabled (0 tests, 0 unexpected failures)
> > skip taskswitch (i386 only)
> > skip taskswitch2 (i386 only)
> > PASS kvmclock_test 
> > PASS pcid (3 tests, 0 unexpected failures)
> > skip vmx (0 tests, 0 unexpected failures)
> > PASS debug (7 tests, 0 unexpected failures)
> > qemu-kvm: Property '.hv-synic' not found
> > skip hyperv_synic (failed $(echo quit | $qemu -enable-kvm -cpu 
> > kvm64,hv_synic -device hyperv-testdev -monitor stdio > /dev/null))

I'm not sure I like the (summary) addition. A summary of why we skip
would be useful, like the (i386 only) stuff, but otherwise it doesn't
seem necessary, and the "(failed $(echo quit | $qemu -enable-kvm -cpu
kvm64,hv_synic -device hyperv-testdev -monitor stdio > /dev/null))"
summary is a bit ugly, wrapping on many terminals.

Another comment that is series wide is that all these changes need to be
tested with 'make standalone' (and many of your patches will require
changes to scripts/mkstandalone.sh, for which you will never forgive
me for having written :-)

Thanks,
drew
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvmclock doesn't work, help?

2015-12-14 Thread Andy Lutomirski
On Mon, Dec 14, 2015 at 2:00 PM, Marcelo Tosatti  wrote:
> On Mon, Dec 14, 2015 at 02:44:15PM +0100, Paolo Bonzini wrote:
>>
>>
>> On 11/12/2015 22:57, Andy Lutomirski wrote:
>> > I'm still not seeing the issue.
>> >
>> > The formula is:
>> >
>> > (((rdtsc - pvti->tsc_timestamp) * pvti->tsc_to_system_mul) >>
>> > pvti->tsc_shift) + pvti->system_time
>> >
>> > Obviously, if you reset pvti->tsc_timestamp to the current tsc value
>> > after suspend/resume, you would also need to update system_time.
>> >
>> > I don't see what this has to do with suspend/resume or with whether
>> > the effective scale factor is greater than or less than one.  The only
>> > suspend/resume interaction I can see is that, if the host allows the
>> > guest-observed TSC value to jump (which is arguably a bug, what that's
>> > not important here), it needs to update pvti before resuming the
>> > guest.
>>
>> Which is not an issue, since freezing obviously gets all CPUs out of
>> guest mode.
>>
>> Marcelo, can you provide an example with made-up values for tsc and pvti?
>
> I meant "systemtime" at ^.
>
> guest visible clock = systemtime (updated at time 0, guest initialization) + 
> scaled tsc reads=LARGE VALUE.
>   ^^
> guest reads clock to memory at location A = scaled tsc read.
> -> suspend resume event
> guest visible clock = systemtime (updated at time AFTER SUSPEND) + scaled tsc 
> reads=0.
> guest reads clock to memory at location B.
>
> So before the suspend/resume event, the clock is the RAW TSC values
> (scaled by kvmclock, but the frequency of the RAW TSC).
>
> After suspend/resume event, the clock is updated from the host
> via get_kernel_ns(), which reads the corrected NTP frequency TSC.
>
> So you switch the timebase, from a clock running at a given frequency,
> to a clock running at another frequency (effective frequency).
>
> Example:
>
> RAW TSC NTP corrected TSC
> t0  10  10
> t1  20  19.99
> t2  30  29.98
> t3  40  39.97
> t4  50  49.96
>
> ...
>
> if you suddenly switch from RAW TSC to NTP corrected TSC,
> you can see what will happen.

Sure, but why would you ever switch from one to the other?  I'm still
not seeing the scenario under which this discontinuity is visible to
anything other than the kvmclock code itself.

The only things that need to be monotonic are the output from
vread_pvclock and the in-kernel equivalent, I think.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest

2015-12-14 Thread Michael S. Tsirkin
On Mon, Dec 14, 2015 at 09:59:13AM -0800, Alexander Duyck wrote:
> On Mon, Dec 14, 2015 at 9:20 AM, Michael S. Tsirkin  wrote:
> > On Mon, Dec 14, 2015 at 08:34:00AM -0800, Alexander Duyck wrote:
> >> > This way distro can use a guest agent to disable
> >> > dirtying until before migration starts.
> >>
> >> Right.  For a v2 version I would definitely want to have some way to
> >> limit the scope of this.  My main reason for putting this out here is
> >> to start altering the course of discussions since it seems like were
> >> weren't getting anywhere with the ixgbevf migration changes that were
> >> being proposed.
> >
> > Absolutely, thanks for working on this.
> >
> >> >> + unsigned long pg_addr, start;
> >> >> +
> >> >> + start = (unsigned long)addr;
> >> >> + pg_addr = PAGE_ALIGN(start + size);
> >> >> + start &= ~(sizeof(atomic_t) - 1);
> >> >> +
> >> >> + /* trigger a write fault on each page, excluding first page */
> >> >> + while ((pg_addr -= PAGE_SIZE) > start)
> >> >> + atomic_add(0, (atomic_t *)pg_addr);
> >> >> +
> >> >> + /* trigger a write fault on first word of DMA */
> >> >> + atomic_add(0, (atomic_t *)start);

Actually, I have second thoughts about using atomic_add here,
especially for _sync.

Many architectures do

#define ATOMIC_OP_RETURN(op, c_op)  \
static inline int atomic_##op##_return(int i, atomic_t *v)  \
{   \
unsigned long flags;\
int ret;\
\
raw_local_irq_save(flags);  \
ret = (v->counter = v->counter c_op i); \
raw_local_irq_restore(flags);   \
\
return ret; \
}

and this is not safe if device is still doing DMA to/from
this memory.

Generally, atomic_t is there for SMP effects, not for sync
with devices.

This is why I said you should do
cmpxchg(pg_addr, 0xdead, 0xdead); 

Yes, we probably never actually want to run m68k within a VM,
but let's not misuse interfaces like this.


> >> >
> >> > start might not be aligned correctly for a cast to atomic_t.
> >> > It's harmless to do this for any memory, so I think you should
> >> > just do this for 1st byte of all pages including the first one.
> >>
> >> You may not have noticed it but I actually aligned start in the line
> >> after pg_addr.
> >
> > Yes you did. alignof would make it a bit more noticeable.
> >
> >>  However instead of aligning to the start of the next
> >> atomic_t I just masked off the lower bits so that we start at the
> >> DWORD that contains the first byte of the starting address.  The
> >> assumption here is that I cannot trigger any sort of fault since if I
> >> have access to a given byte within a DWORD I will have access to the
> >> entire DWORD.
> >
> > I'm curious where does this come from.  Isn't it true that access is
> > controlled at page granularity normally, so you can touch beginning of
> > page just as well?
> 
> Yeah, I am pretty sure it probably is page granularity.  However my
> thought was to try and stick to the start of the DMA as the last
> access.  That way we don't pull in any more cache lines than we need
> to in order to dirty the pages.  Usually the start of the DMA region
> will contain some sort of headers or something that needs to be
> accessed with the highest priority so I wanted to make certain that we
> were forcing usable data into the L1 cache rather than just the first
> cache line of the page where the DMA started.  If however the start of
> a DMA was the start of the page there is nothing there to prevent
> that.

OK, maybe this helps. You should document all these tricks
in code comments.

> >>  I coded this up so that the spots where we touch the
> >> memory should match up with addresses provided by the hardware to
> >> perform the DMA over the PCI bus.
> >
> > Yes but there's no requirement to do it like this from
> > virt POV. You just need to touch each page.
> 
> I know, but at the same time if we match up with the DMA then it is
> more likely that we avoid grabbing unneeded cache lines.  In the case
> of most drivers the data for headers and start is at the start of the
> DMA.  So if we dirty the cache line associated with the start of the
> DMA it will be pulled into the L1 cache and there is a greater chance
> that it may already be prefetched as well.
> 
> >> Also I intentionally ran from highest address to lowest since that way
> >> we don't risk pushing the first cache line of the DMA buffer out of
> >> the L1 cache due to the PAGE_SIZE stride.
> >

[PATCH kvm-unit-tests 0/6] Improve the output of test runners

2015-12-14 Thread Radim Krčmář
This series is a mix of patches that change the output of run_tests.sh
and x86-run.  The output of ./run_tests.sh now looks like this:

> PASS apic (14 tests, 0 unexpected failures)
> PASS ioapic (19 tests, 0 unexpected failures)
> PASS smptest (1 tests, 0 unexpected failures)
> PASS smptest3 (1 tests, 0 unexpected failures)
> PASS vmexit_cpuid 
> PASS vmexit_vmcall 
> PASS vmexit_mov_from_cr8 
> PASS vmexit_mov_to_cr8 
> PASS vmexit_inl_pmtimer 
> PASS vmexit_ipi 
> PASS vmexit_ipi_halt 
> PASS vmexit_ple_round_robin 
> PASS access 
> skip smap (0 tests, 0 unexpected failures)
> skip pku (0 tests, 0 unexpected failures)
> PASS emulator (132 tests, 0 unexpected failures, 1 skipped)
> PASS eventinj (13 tests, 0 unexpected failures)
> PASS hypercall (2 tests, 0 unexpected failures)
> PASS idt_test (4 tests, 0 unexpected failures)
> PASS msr (13 tests, 0 unexpected failures)
> PASS pmu (67 tests, 0 unexpected failures, 1 expected failures)
> PASS port80 
> PASS realmode 
> PASS s3 
> PASS sieve 
> PASS tsc (3 tests, 0 unexpected failures)
> PASS tsc_adjust (5 tests, 0 unexpected failures)
> PASS xsave (17 tests, 0 unexpected failures)
> PASS rmap_chain 
> skip svm (0 tests, 0 unexpected failures)
> skip svm-disabled (0 tests, 0 unexpected failures)
> skip taskswitch (i386 only)
> skip taskswitch2 (i386 only)
> PASS kvmclock_test 
> PASS pcid (3 tests, 0 unexpected failures)
> skip vmx (0 tests, 0 unexpected failures)
> PASS debug (7 tests, 0 unexpected failures)
> qemu-kvm: Property '.hv-synic' not found
> skip hyperv_synic (failed $(echo quit | $qemu -enable-kvm -cpu kvm64,hv_synic 
> -device hyperv-testdev -monitor stdio > /dev/null))

I don't like it too much, but it still seems like an improvement.

The main difference is in extra information for tests using
report_summary(), that smap and pku tests don't fail, pmu test isn't
completely skipped, and svm, vmx, and hyperv_synic are skipped.

The old one looked like this:

> PASS apic
> PASS ioapic
> PASS smptest
> PASS smptest3
> PASS vmexit_cpuid
> PASS vmexit_vmcall
> PASS vmexit_mov_from_cr8
> PASS vmexit_mov_to_cr8
> PASS vmexit_inl_pmtimer
> PASS vmexit_ipi
> PASS vmexit_ipi_halt
> PASS vmexit_ple_round_robin
> PASS access
> FAIL smap
> FAIL pku
> PASS emulator
> PASS eventinj
> PASS hypercall
> PASS idt_test
> PASS msr
> skip pmu (/proc/sys/kernel/nmi_watchdog not equal to 0)
> PASS port80
> PASS realmode
> PASS s3
> PASS sieve
> PASS tsc
> PASS tsc_adjust
> PASS xsave
> PASS rmap_chain
> PASS svm
> PASS svm-disabled
> skip taskswitch (i386 only)
> skip taskswitch2 (i386 only)
> PASS kvmclock_test
> PASS pcid
> PASS vmx
> PASS debug
> qemu-kvm: Property '.hv-synic' not found
> PASS hyperv_synic


Radim Krčmář (6):
  lib/report: allow test skipping
  x86/*: report that the test was skipped
  x86/pmu: expect failure with nmi_watchdog
  run_tests: generalize check
  x86/hyperv_synic: check for support before testing
  run_tests: print summary

 lib/libcflat.h |  1 +
 lib/report.c   | 43 +--
 run_tests.sh   | 35 ++-
 x86/apic.c |  7 +++
 x86/emulator.c |  2 +-
 x86/hyperv_synic.c |  2 +-
 x86/pku.c  |  2 +-
 x86/pmu.c  | 11 +--
 x86/smap.c |  2 +-
 x86/svm.c  |  2 +-
 x86/tsc.c  |  2 +-
 x86/unittests.cfg  |  4 ++--
 12 files changed, 68 insertions(+), 45 deletions(-)

-- 
2.6.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests 1/6] lib/report: allow test skipping

2015-12-14 Thread Radim Krčmář
This patch allows us to explicitly mark a unit-test as skipped.
If all unit-tests were skipped, the whole test is reported as skipped as
well.  This also includes the case where no report()s were done, but
the test still ended with report_summary().

When the whole test is skipped, ./run_tests.sh prints "skip" instead of
green "PASS".

Return value of 77 is used to please Autotools.  I also renamed few
things in reporting code and chose to refactor a logic while at it.

Signed-off-by: Radim Krčmář 
---
 lib/libcflat.h |  1 +
 lib/report.c   | 43 +--
 run_tests.sh   | 13 -
 3 files changed, 38 insertions(+), 19 deletions(-)

diff --git a/lib/libcflat.h b/lib/libcflat.h
index 9747ccdbc9f1..070818354ee1 100644
--- a/lib/libcflat.h
+++ b/lib/libcflat.h
@@ -61,6 +61,7 @@ void report_prefix_push(const char *prefix);
 void report_prefix_pop(void);
 void report(const char *msg_fmt, bool pass, ...);
 void report_xfail(const char *msg_fmt, bool xfail, bool pass, ...);
+void report_skip(const char *msg_fmt, ...);
 int report_summary(void);
 
 #define ARRAY_SIZE(_a) (sizeof(_a)/sizeof((_a)[0]))
diff --git a/lib/report.c b/lib/report.c
index 35e664108a92..e07baa347298 100644
--- a/lib/report.c
+++ b/lib/report.c
@@ -13,7 +13,7 @@
 #include "libcflat.h"
 #include "asm/spinlock.h"
 
-static unsigned int tests, failures, xfailures;
+static unsigned int tests, failures, xfailures, skipped;
 static char prefixes[256];
 static struct spinlock lock;
 
@@ -43,25 +43,27 @@ void report_prefix_pop(void)
spin_unlock();
 }
 
-void va_report_xfail(const char *msg_fmt, bool xfail, bool cond, va_list va)
+void va_report(const char *msg_fmt, bool pass, bool xfail, bool skip, va_list 
va)
 {
-   char *pass = xfail ? "XPASS" : "PASS";
-   char *fail = xfail ? "XFAIL" : "FAIL";
char buf[2000];
+   char *prefix = skip ? "SKIP"
+   : xfail ? (pass ? "XPASS" : "XFAIL")
+   : (pass ? "PASS"  : "FAIL");
 
spin_lock();
 
tests++;
-   printf("%s: ", cond ? pass : fail);
+   printf("%s: ", prefix);
puts(prefixes);
vsnprintf(buf, sizeof(buf), msg_fmt, va);
puts(buf);
puts("\n");
-   if (xfail && cond)
-   failures++;
-   else if (xfail)
+
+   if (skip)
+   skipped++;
+   else if (xfail && !pass)
xfailures++;
-   else if (!cond)
+   else if (xfail || !pass)
failures++;
 
spin_unlock();
@@ -71,7 +73,7 @@ void report(const char *msg_fmt, bool pass, ...)
 {
va_list va;
va_start(va, pass);
-   va_report_xfail(msg_fmt, false, pass, va);
+   va_report(msg_fmt, pass, false, false, va);
va_end(va);
 }
 
@@ -79,7 +81,15 @@ void report_xfail(const char *msg_fmt, bool xfail, bool 
pass, ...)
 {
va_list va;
va_start(va, pass);
-   va_report_xfail(msg_fmt, xfail, pass, va);
+   va_report(msg_fmt, pass, xfail, false, va);
+   va_end(va);
+}
+
+void report_skip(const char *msg_fmt, ...)
+{
+   va_list va;
+   va_start(va, msg_fmt);
+   va_report(msg_fmt, false, false, true, va);
va_end(va);
 }
 
@@ -89,9 +99,14 @@ int report_summary(void)
 
printf("\nSUMMARY: %d tests, %d unexpected failures", tests, failures);
if (xfailures)
-   printf(", %d expected failures\n", xfailures);
-   else
-   printf("\n");
+   printf(", %d expected failures", xfailures);
+   if (skipped)
+   printf(", %d skipped", skipped);
+   printf("\n");
+
+   if (tests == skipped)
+   return 77; /* blame AUTOTOOLS */
+
return failures > 0 ? 1 : 0;
 
spin_unlock();
diff --git a/run_tests.sh b/run_tests.sh
index fad22a935b00..4d813b9a7084 100755
--- a/run_tests.sh
+++ b/run_tests.sh
@@ -55,12 +55,15 @@ function run()
 # extra_params in the config file may contain backticks that need to be
 # expanded, so use eval to start qemu
 eval $cmdline >> test.log
+# The first bit of return value is too hard to use, just skip it.
+# Unit-tests' return value is shifted by one.
+case $(($? >> 1)) in
+0)  echo -ne "\e[32mPASS\e[0m" ;;
+77) echo -ne "skip" ;;
+*)  echo -ne "\e[31mFAIL\e[0m"
+esac
 
-if [ $? -le 1 ]; then
-echo -e "\e[32mPASS\e[0m $1"
-else
-echo -e "\e[31mFAIL\e[0m $1"
-fi
+echo " $1"
 }
 
 function usage()
-- 
2.6.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests 2/6] x86/*: report skipped tests

2015-12-14 Thread Radim Krčmář
No care to consistency or exhaustivity was given.

(svm-disabled test should be redone and it's weird that x86/hyperv_synic
 is about the only one that does report_skip when unsupported.)

Signed-off-by: Radim Krčmář 
---
 x86/apic.c | 7 +++
 x86/emulator.c | 2 +-
 x86/hyperv_synic.c | 2 +-
 x86/pku.c  | 2 +-
 x86/pmu.c  | 2 +-
 x86/smap.c | 2 +-
 x86/svm.c  | 2 +-
 x86/tsc.c  | 2 +-
 8 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/x86/apic.c b/x86/apic.c
index d4eec529e535..57af86de8f8c 100644
--- a/x86/apic.c
+++ b/x86/apic.c
@@ -27,7 +27,7 @@ static void tsc_deadline_timer_isr(isr_regs_t *regs)
 ++tdt_count;
 }
 
-static void start_tsc_deadline_timer(void)
+static void __test_tsc_deadline_timer(void)
 {
 handle_irq(TSC_DEADLINE_TIMER_VECTOR, tsc_deadline_timer_isr);
 irq_enable();
@@ -45,7 +45,6 @@ static int enable_tsc_deadline_timer(void)
 if (cpuid(1).c & (1 << 24)) {
 lvtt = TSC_DEADLINE_TIMER_MODE | TSC_DEADLINE_TIMER_VECTOR;
 apic_write(APIC_LVTT, lvtt);
-start_tsc_deadline_timer();
 return 1;
 } else {
 return 0;
@@ -55,9 +54,9 @@ static int enable_tsc_deadline_timer(void)
 static void test_tsc_deadline_timer(void)
 {
 if(enable_tsc_deadline_timer()) {
-printf("tsc deadline timer enabled\n");
+__test_tsc_deadline_timer();
 } else {
-printf("tsc deadline timer not detected\n");
+report_skip("tsc deadline timer not detected\n");
 }
 }
 
diff --git a/x86/emulator.c b/x86/emulator.c
index e5c1c6b9a2f3..b64a5fe0f3dc 100644
--- a/x86/emulator.c
+++ b/x86/emulator.c
@@ -1062,7 +1062,7 @@ static void illegal_movbe_handler(struct ex_regs *regs)
 static void test_illegal_movbe(void)
 {
if (!(cpuid(1).c & (1 << 22))) {
-   printf("SKIP: illegal movbe\n");
+   report_skip("illegal movbe");
return;
}
 
diff --git a/x86/hyperv_synic.c b/x86/hyperv_synic.c
index 18d1295bfb37..602b79392bfd 100644
--- a/x86/hyperv_synic.c
+++ b/x86/hyperv_synic.c
@@ -228,7 +228,7 @@ int main(int ac, char **av)
 
 report("Hyper-V SynIC test", ok);
 } else {
-report("Hyper-V SynIC is not supported", true);
+report_skip("Hyper-V SynIC is not supported");
 }
 
 return report_summary();
diff --git a/x86/pku.c b/x86/pku.c
index 0e00b9984d70..58971d21ed05 100644
--- a/x86/pku.c
+++ b/x86/pku.c
@@ -91,7 +91,7 @@ int main(int ac, char **av)
 
 if (!(cpuid_indexed(7, 0).c & (1 << X86_FEATURE_PKU))) {
 printf("PKU not enabled, exiting\n");
-exit(1);
+return report_summary();
 }
 
 setup_vm();
diff --git a/x86/pmu.c b/x86/pmu.c
index 03f80190bb25..c68980044dee 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -387,7 +387,7 @@ int main(int ac, char **av)
 
if (!eax.split.version_id) {
printf("No pmu is detected!\n");
-   return 1;
+   return report_summary();
}
printf("PMU version: %d\n", eax.split.version_id);
printf("GP counters: %d\n", eax.split.num_counters);
diff --git a/x86/smap.c b/x86/smap.c
index d8a7ae82dc00..0aa44054bd48 100644
--- a/x86/smap.c
+++ b/x86/smap.c
@@ -93,7 +93,7 @@ int main(int ac, char **av)
 
if (!(cpuid_indexed(7, 0).b & (1 << X86_FEATURE_SMAP))) {
printf("SMAP not enabled, exiting\n");
-   exit(1);
+   return report_summary();
}
 
setup_vm();
diff --git a/x86/svm.c b/x86/svm.c
index 1046ddf73732..ff1a0f34b4bf 100644
--- a/x86/svm.c
+++ b/x86/svm.c
@@ -1064,7 +1064,7 @@ int main(int ac, char **av)
 
 if (!(cpuid(0x8001).c & 4)) {
 printf("SVM not availble\n");
-return 0;
+return report_summary();
 }
 
 setup_svm();
diff --git a/x86/tsc.c b/x86/tsc.c
index c71dc2a7abe0..ee247459fb42 100644
--- a/x86/tsc.c
+++ b/x86/tsc.c
@@ -43,5 +43,5 @@ int main()
test_rdtscp(0x100);
} else
printf("rdtscp not supported\n");
-   return 0;
+   return report_summary();
 }
-- 
2.6.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests 5/6] x86/hyperv_synic: check for support before testing

2015-12-14 Thread Radim Krčmář
It's not easy to distinguish successful unit-test from failed QEMU, so
we check for presence of the needed feature before hand.

x86/hyperv_synic is "skip" instead of "PASS", without KVM/QEMU support.

Signed-off-by: Radim Krčmář 
---
 Probably not the nicest solution around ...
 "dry_run_qemu_and_skip_if_it_fails=true" parameter in unittest.cfg
 might be better.

 x86/unittests.cfg | 1 +
 1 file changed, 1 insertion(+)

diff --git a/x86/unittests.cfg b/x86/unittests.cfg
index 6b94ad93dcf0..32fc5f7037e1 100644
--- a/x86/unittests.cfg
+++ b/x86/unittests.cfg
@@ -182,3 +182,4 @@ arch = x86_64
 file = hyperv_synic.flat
 smp = 2
 extra_params = -cpu kvm64,hv_synic -device hyperv-testdev
+check = echo quit | $qemu -enable-kvm -cpu kvm64,hv_synic -device 
hyperv-testdev -monitor stdio > /dev/null
-- 
2.6.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests 4/6] run_tests: generalize check

2015-12-14 Thread Radim Krčmář
config attribute "check" is currently unused.
Provide a simple implementation instead of removing it.

Signed-off-by: Radim Krčmář 
---
 run_tests.sh | 15 ---
 1 file changed, 4 insertions(+), 11 deletions(-)

diff --git a/run_tests.sh b/run_tests.sh
index 4d813b9a7084..b0b064f2e341 100755
--- a/run_tests.sh
+++ b/run_tests.sh
@@ -35,17 +35,10 @@ function run()
 return
 fi
 
-# check a file for a particular value before running a test
-# the check line can contain multiple files to check separated by a space
-# but each check parameter needs to be of the form =
-for check_param in ${check[@]}; do
-path=${check_param%%=*}
-value=${check_param#*=}
-if [ "$path" ] && [ "$(cat $path)" != "$value" ]; then
-echo "skip $1 ($path not equal to $value)"
-return
-fi
-done
+eval $check || {
+echo "skip $1 (failed \$($check))"
+return
+}
 
 cmdline="TESTNAME=$testname ACCEL=$accel ./$TEST_DIR-run $kernel -smp $smp 
$opts"
 if [ $verbose != 0 ]; then
-- 
2.6.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests 3/6] x86/pmu: expect failure with nmi_watchdog

2015-12-14 Thread Radim Krčmář
Host's nmi_watchdog takes one slot, making the "all counters" unit-test
fail.  We know exactly what happens, mark it as expected failure.

PMU test is now executed regardless of host_nmi_watchdog.

Signed-off-by: Radim Krčmář 
---
 x86/pmu.c | 9 -
 x86/unittests.cfg | 3 +--
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/x86/pmu.c b/x86/pmu.c
index c68980044dee..4ca93235b977 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -92,6 +92,7 @@ struct pmu_event {
 };
 
 static int num_counters;
+bool host_nmi_watchdog;
 
 char *buf;
 
@@ -291,7 +292,7 @@ static void check_counters_many(void)
if (!verify_counter([i]))
break;
 
-   report("all counters", i == n);
+   report_xfail("all counters", host_nmi_watchdog, i == n);
 }
 
 static void check_counter_overflow(void)
@@ -374,6 +375,7 @@ static void check_rdpmc(void)
 
 int main(int ac, char **av)
 {
+   int i;
struct cpuid id = cpuid(10);
 
setup_vm();
@@ -385,6 +387,11 @@ int main(int ac, char **av)
ebx.full = id.b;
edx.full = id.d;
 
+   /* XXX: horrible command line parsing */
+   for (i = 1; i < ac; i++)
+   if (!strcmp(av[i], "host_nmi_watchdog=1"))
+   host_nmi_watchdog = true;
+
if (!eax.split.version_id) {
printf("No pmu is detected!\n");
return report_summary();
diff --git a/x86/unittests.cfg b/x86/unittests.cfg
index c15c86df..6b94ad93dcf0 100644
--- a/x86/unittests.cfg
+++ b/x86/unittests.cfg
@@ -106,8 +106,7 @@ file = msr.flat
 
 [pmu]
 file = pmu.flat
-extra_params = -cpu host
-check = /proc/sys/kernel/nmi_watchdog=0
+extra_params = -cpu host -append "host_nmi_watchdog=`cat 
/proc/sys/kernel/nmi_watchdog`"
 
 [port80]
 file = port80.flat
-- 
2.6.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests 6/6] run_tests: print summary

2015-12-14 Thread Radim Krčmář
Add shortened SUMMARY line into the output (in parentheses).

Might be interesting and hopefully won't break too many scripts.

Signed-off-by: Radim Krčmář 
---
 run_tests.sh | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/run_tests.sh b/run_tests.sh
index b0b064f2e341..d0b282f7b079 100755
--- a/run_tests.sh
+++ b/run_tests.sh
@@ -21,6 +21,7 @@ function run()
 local arch="$6"
 local check="$7"
 local accel="$8"
+local output
 
 if [ -z "$testname" ]; then
 return
@@ -47,7 +48,7 @@ function run()
 
 # extra_params in the config file may contain backticks that need to be
 # expanded, so use eval to start qemu
-eval $cmdline >> test.log
+output=`eval $cmdline`
 # The first bit of return value is too hard to use, just skip it.
 # Unit-tests' return value is shifted by one.
 case $(($? >> 1)) in
@@ -56,7 +57,11 @@ function run()
 *)  echo -ne "\e[31mFAIL\e[0m"
 esac
 
-echo " $1"
+echo -n " $1 "
+
+printf "%s\n" "$output" |
+  tee -a test.log |
+  sed 'x;$s/^SUMMARY: //;ta;$s/.*//p;d;:a s/^/(/;s/$/)/'
 }
 
 function usage()
-- 
2.6.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [Patch V2 1/2] x86, mce: Basic support to add LMCE support to QEMU

2015-12-14 Thread Raj, Ashok
On Mon, Dec 14, 2015 at 11:37:16PM +0100, Borislav Petkov wrote:
> On Mon, Dec 14, 2015 at 02:11:46PM -0500, Raj, Ashok wrote:
> > This is mostly harmless.. since the MCG_CAP space is shared and has no
> > conflict between vendors. Also just the CAP being set has no effect.
> 
> Of course it does - we check SER_P in machine_check_poll() and when
> I emulate an AMD guest and inject errors into it, error handling is
> obviously wrong, see:
> 
> https://lkml.kernel.org/r/20151123150355.ge5...@pd.tnic
> 

I can see how this hurts.. since the poller isn't doing cpu model specific 
stuff..?

in the LMCE case, even if you advertise MCG_LMCE_P in MCG_CAP, the guest kernel 
wont call intel_init_lmce() only from mce_intel.c.. so the same problem
won't happen.

but the issue Eduardo mentioned seems like the following.

New QEMU_LMCE + New KVM_LMCE + New_GUEST_LMCE - No problem

but if you were to migrage the Guest_LMCE to a non-LMCE supported KVM host
we could run into an issue.. 

is this the compatibility issue that you were looking to fix Eduardo?

Cheers,
Ashok
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html