Re: x86: Don't report guest userspace emulation error to userspace, why ?

2015-12-10 Thread Bandan Das
Paolo Bonzini  writes:

> On 10/12/2015 18:58, Bandan Das wrote:
>>> > Allowing userspace to stop the guest with an emulation failure is a
>> This one I don't :) Userspace started the guest after all, there are other
>> ways for it to kill the guest if it wanted to.
>
> I mean allowing guest userspace to stop the guest.

Sure! Userspace (Qemu) can just reenter the guest when it sees the failure.
Doing it in the host kvm seems overkill.

> Paolo
>
>>> > possible denial of service, similar to L2 stopping L1 with an emulation
>>> > failure.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: x86: move tracepoints outside extended quiescent state

2015-12-10 Thread Paolo Bonzini


On 10/12/2015 19:09, Borislav Petkov wrote:
> On Thu, Dec 10, 2015 at 06:38:57PM +0100, Paolo Bonzini wrote:
>> Invoking tracepoints within kvm_guest_enter/kvm_guest_exit causes a
>> lockdep splat.
>>
>> Cc: sta...@vger.kernel.org
>> Reported-by: Borislav Petkov 
>> Signed-off-by: Paolo Bonzini 
>> ---
>>  arch/x86/kvm/svm.c | 4 ++--
>>  arch/x86/kvm/vmx.c | 3 ++-
>>  arch/x86/kvm/x86.c | 2 +-
>>  3 files changed, 5 insertions(+), 4 deletions(-)
> 
> Looks like you missed some...

Yeah, wait_lapic_expire also have to be moved before __kvm_guest_enter.

Paolo

> [  144.296364] kvm: zapping shadow pages for mmio generation wraparound
> [  164.699053] kvm: zapping shadow pages for mmio generation wraparound
> [  312.115767] kvm: zapping shadow pages for mmio generation wraparound
> [  432.277585] kvm: zapping shadow pages for mmio generation wraparound
> 
> [  434.547820] ===
> [  434.552020] [ INFO: suspicious RCU usage. ]
> [  434.556223] 4.4.0-rc4+ #1 Not tainted
> [  434.559886] ---
> [  434.564072] arch/x86/kvm/trace.h:971 suspicious rcu_dereference_check() 
> usage!
> [  434.571303] 
>other info that might help us debug this:
> 
> [  434.579324] 
>RCU used illegally from idle CPU!
>rcu_scheduler_active = 1, debug_locks = 0
> [  434.590209] RCU used illegally from extended quiescent state!
> [  434.595971] 1 lock held by qemu-system-x86/2402:
> [  434.600596]  #0:  (>mutex){+.+.+.}, at: [] 
> vcpu_load+0x1c/0x80 [kvm]
> [  434.609146] 
>stack backtrace:
> [  434.613526] CPU: 4 PID: 2402 Comm: qemu-system-x86 Not tainted 4.4.0-rc4+ 
> #1
> [  434.620583] Hardware name: Dell Inc. Precision T3600/0PTTT9, BIOS A13 
> 05/11/2014
> [  434.627987]  0001 88042f79fcf0 813c2cfc 
> 880435aa
> [  434.635443]  88042f79fd20 810c5157 88042fd48000 
> 000295a85563
> [  434.642886]  000295de483f  88042f79fd58 
> a023ec6e
> [  434.650334] Call Trace:
> [  434.652804]  [] dump_stack+0x4e/0x82
> [  434.657950]  [] lockdep_rcu_suspicious+0xe7/0x120
> [  434.664239]  [] wait_lapic_expire+0xfe/0x1e0 [kvm]
> [  434.670606]  [] kvm_arch_vcpu_ioctl_run+0x76e/0x19c0 
> [kvm]
> [  434.677674]  [] ? kvm_arch_vcpu_ioctl_run+0x8b2/0x19c0 
> [kvm]
> [  434.684905]  [] ? mutex_lock_killable_nested+0x29c/0x4c0
> [  434.691792]  [] kvm_vcpu_ioctl+0x342/0x700 [kvm]
> [  434.697984]  [] ? __lock_is_held+0x4d/0x70
> [  434.703655]  [] ? __fget+0xfe/0x200
> [  434.708719]  [] do_vfs_ioctl+0x301/0x550
> [  434.714208]  [] ? __fget_light+0x2a/0x90
> [  434.719700]  [] SyS_ioctl+0x41/0x70
> [  434.724754]  [] entry_SYSCALL_64_fastpath+0x16/0x7a
> [  437.411818] kvm [2400]: vcpu0 unhandled rdmsr: 0x606
> [  437.416898] kvm [2400]: vcpu0 unhandled rdmsr: 0x34
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-10 Thread Alexander Duyck
On Thu, Dec 10, 2015 at 8:11 AM, Michael S. Tsirkin  wrote:
> On Thu, Dec 10, 2015 at 10:38:32PM +0800, Lan, Tianyu wrote:
>>
>>
>> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
>> >>Ideally, it is able to leave guest driver unmodified but it requires the
>> >>>hypervisor or qemu to aware the device which means we may need a driver in
>> >>>hypervisor or qemu to handle the device on behalf of guest driver.
>> >Can you answer the question of when do you use your code -
>> >at the start of migration or
>> >just before the end?
>>
>> Just before stopping VCPU in this version and inject VF mailbox irq to
>> notify the driver if the irq handler is installed.
>> Qemu side also will check this via the faked PCI migration capability
>> and driver will set the status during device open() or resume() callback.
>
> Right, this is the "good path" optimization. Whether this buys anything
> as compared to just sending reset to the device when VCPU is stopped
> needs to be measured. In any case, we probably do need a way to
> interrupt driver on destination to make it reconfigure the device -
> otherwise it might take seconds for it to notice.  And a way to make
> sure driver can handle this surprise reset so we can block migration if
> it can't.

The question is how do we handle the "bad path"?  From what I can tell
it seems like we would have to have the dirty page tracking for DMA
handled in the host in order to support that.  Otherwise we risk
corrupting the memory in the guest as there are going to be a few
stale pages that end up being in the guest.

The easiest way to probably flag a "bad path" migration would be to
emulate a Manually-operated Retention Latch being opened and closed on
the device.  It may even allow us to work with the desire to support a
means for doing a pause/resume as that would be a hot-plug event where
the latch was never actually opened.  Basically if the retention latch
is released and then re-closed it can be assumed that the device has
lost power and as a result been reset.  As such a normal hot-plug
controller would have to reconfigure the device in such an event.  The
key bit being that with the power being cycled on the port the
assumption is that the device has lost any existing state, and we
should emulate that as well by clearing any state Qemu might be
carrying such as the shadow of the MSI-X table.  In addition we could
also signal if the host supports the dirty page tracking via the IOMMU
so if needed the guest could trigger some sort of memory exception
handling due to the risk of memory corruption.

I would argue that we don't necessarily have to provide a means to
guarantee the driver can support a surprise removal/reset.  Worst case
scenario is that it would be equivalent to somebody pulling the plug
on an externally connected PCIe cage in a physical host.  I know the
Intel Ethernet drivers have already had to add support for surprise
removal due to the fact that such a scenario can occur on Thunderbolt
enabled platforms.  Since it is acceptable for physical hosts to have
such an event occur I think we could support the same type of failure
for direct assigned devices in guests.  That would be the one spot
where I would say it is up to the drivers to figure out how they are
going to deal with it since this is something that can occur for any
given driver on any given OS assuming it can be plugged into an
externally removable cage.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86: Don't report guest userspace emulation error to userspace, why ?

2015-12-10 Thread Paolo Bonzini
> Paolo Bonzini  writes:
> > On 10/12/2015 18:58, Bandan Das wrote:
> >>> > Allowing userspace to stop the guest with an emulation failure is a
> >> This one I don't :) Userspace started the guest after all, there are other
> >> ways for it to kill the guest if it wanted to.
> >
> > I mean allowing guest userspace to stop the guest.
> 
> Sure! Userspace (Qemu) can just reenter the guest when it sees the failure.
> Doing it in the host kvm seems overkill.

Most userspaces will get it wrong.  Doing it once makes sure that you
do it right.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86: Don't report guest userspace emulation error to userspace, why ?

2015-12-10 Thread Paolo Bonzini


On 09/12/2015 23:18, Bandan Das wrote:
> Commit a2b9e6c1a35afcc09:
> 
> KVM: x86: Don't report guest userspace emulation error to userspace
> 
> Commit fc3a9157d314 ("KVM: X86: Don't report L2 emulation failures to
> user-space") disabled the reporting of L2 (nested guest) emulation 
> failures to
> userspace due to race-condition between a vmexit and the instruction 
> emulator.
> The same rational applies also to userspace applications that are 
> permitted by
> the guest OS to access MMIO area or perform PIO.
> 
> This patch extends the current behavior - of injecting a #UD instead of
> reporting it to userspace - also for guest userspace code.
> 
> I searched the archives but failed in finding anything. Can someone please
> explain why this is needed ? Or, why not let userspace decide what to do based
> on the cpl, whether to continue execution or kill the guest ? Is the 
> assumption
> here that this is what userspace always wants ?

Not what userspace always wants, but what the guest kernel always wants.

Allowing userspace to stop the guest with an emulation failure is a
possible denial of service, similar to L2 stopping L1 with an emulation
failure.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap

2015-12-10 Thread Paolo Bonzini


On 10/12/2015 00:12, Andy Lutomirski wrote:
> Signed-off-by: Andy Lutomirski 
> ---
>  arch/x86/entry/vdso/vclock_gettime.c  | 20 
>  arch/x86/entry/vdso/vdso-layout.lds.S |  3 ++-
>  arch/x86/entry/vdso/vdso2c.c  |  3 +++
>  arch/x86/entry/vdso/vma.c | 13 +
>  arch/x86/include/asm/pvclock.h|  9 +
>  arch/x86/include/asm/vdso.h   |  1 +
>  arch/x86/kernel/kvmclock.c|  5 +
>  7 files changed, 41 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
> b/arch/x86/entry/vdso/vclock_gettime.c
> index c325ba1bdddf..5dd363d54348 100644
> --- a/arch/x86/entry/vdso/vclock_gettime.c
> +++ b/arch/x86/entry/vdso/vclock_gettime.c
> @@ -36,6 +36,11 @@ static notrace cycle_t vread_hpet(void)
>  }
>  #endif
>  
> +#ifdef CONFIG_PARAVIRT_CLOCK
> +extern u8 pvclock_page
> + __attribute__((visibility("hidden")));
> +#endif
> +
>  #ifndef BUILD_VDSO32
>  
>  #include 
> @@ -62,23 +67,14 @@ notrace static long vdso_fallback_gtod(struct timeval 
> *tv, struct timezone *tz)
>  
>  #ifdef CONFIG_PARAVIRT_CLOCK
>  
> -static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
> +static notrace const struct pvclock_vsyscall_time_info *get_pvti0(void)
>  {
> - const struct pvclock_vsyscall_time_info *pvti_base;
> - int idx = cpu / (PAGE_SIZE/PVTI_SIZE);
> - int offset = cpu % (PAGE_SIZE/PVTI_SIZE);
> -
> - BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx > PVCLOCK_FIXMAP_END);
> -
> - pvti_base = (struct pvclock_vsyscall_time_info *)
> - __fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx);
> -
> - return _base[offset];
> + return (const struct pvclock_vsyscall_time_info *)_page;
>  }
>  
>  static notrace cycle_t vread_pvclock(int *mode)
>  {
> - const struct pvclock_vcpu_time_info *pvti = _pvti(0)->pvti;
> + const struct pvclock_vcpu_time_info *pvti = _pvti0()->pvti;
>   cycle_t ret;
>   u64 tsc, pvti_tsc;
>   u64 last, delta, pvti_system_time;
> diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S 
> b/arch/x86/entry/vdso/vdso-layout.lds.S
> index de2c921025f5..4158acc17df0 100644
> --- a/arch/x86/entry/vdso/vdso-layout.lds.S
> +++ b/arch/x86/entry/vdso/vdso-layout.lds.S
> @@ -25,7 +25,7 @@ SECTIONS
>* segment.
>*/
>  
> - vvar_start = . - 2 * PAGE_SIZE;
> + vvar_start = . - 3 * PAGE_SIZE;
>   vvar_page = vvar_start;
>  
>   /* Place all vvars at the offsets in asm/vvar.h. */
> @@ -36,6 +36,7 @@ SECTIONS
>  #undef EMIT_VVAR
>  
>   hpet_page = vvar_start + PAGE_SIZE;
> + pvclock_page = vvar_start + 2 * PAGE_SIZE;
>  
>   . = SIZEOF_HEADERS;
>  
> diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c
> index 785d9922b106..491020b2826d 100644
> --- a/arch/x86/entry/vdso/vdso2c.c
> +++ b/arch/x86/entry/vdso/vdso2c.c
> @@ -73,6 +73,7 @@ enum {
>   sym_vvar_start,
>   sym_vvar_page,
>   sym_hpet_page,
> + sym_pvclock_page,
>   sym_VDSO_FAKE_SECTION_TABLE_START,
>   sym_VDSO_FAKE_SECTION_TABLE_END,
>  };
> @@ -80,6 +81,7 @@ enum {
>  const int special_pages[] = {
>   sym_vvar_page,
>   sym_hpet_page,
> + sym_pvclock_page,
>  };
>  
>  struct vdso_sym {
> @@ -91,6 +93,7 @@ struct vdso_sym required_syms[] = {
>   [sym_vvar_start] = {"vvar_start", true},
>   [sym_vvar_page] = {"vvar_page", true},
>   [sym_hpet_page] = {"hpet_page", true},
> + [sym_pvclock_page] = {"pvclock_page", true},
>   [sym_VDSO_FAKE_SECTION_TABLE_START] = {
>   "VDSO_FAKE_SECTION_TABLE_START", false
>   },
> diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
> index 64df47148160..aa828191c654 100644
> --- a/arch/x86/entry/vdso/vma.c
> +++ b/arch/x86/entry/vdso/vma.c
> @@ -100,6 +100,7 @@ static int map_vdso(const struct vdso_image *image, bool 
> calculate_addr)
>   .name = "[vvar]",
>   .pages = no_pages,
>   };
> + struct pvclock_vsyscall_time_info *pvti;
>  
>   if (calculate_addr) {
>   addr = vdso_addr(current->mm->start_stack,
> @@ -169,6 +170,18 @@ static int map_vdso(const struct vdso_image *image, bool 
> calculate_addr)
>   }
>  #endif
>  
> + pvti = pvclock_pvti_cpu0_va();
> + if (pvti && image->sym_pvclock_page) {
> + ret = remap_pfn_range(vma,
> +   text_start + image->sym_pvclock_page,
> +   __pa(pvti) >> PAGE_SHIFT,
> +   PAGE_SIZE,
> +   PAGE_READONLY);
> +
> + if (ret)
> + goto up_fail;
> + }
> +
>  up_fail:
>   if (ret)
>   current->mm->context.vdso = NULL;
> diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
> index 7a6bed5c08bc..3864398c7cb2 100644
> --- a/arch/x86/include/asm/pvclock.h
> +++ b/arch/x86/include/asm/pvclock.h
> @@ 

Re: [PATCH 5/5] x86/vdso: Enable vdso pvclock access on all vdso variants

2015-12-10 Thread Paolo Bonzini


On 10/12/2015 00:12, Andy Lutomirski wrote:
> Now that pvclock doesn't require access to the fixmap, all vdso
> variants can use it.
> 
> The kernel side isn't wired up for 32-bit kernels yet, but this
> covers 32-bit and x32 userspace on 64-bit kernels.
> 
> Signed-off-by: Andy Lutomirski 
> ---
>  arch/x86/entry/vdso/vclock_gettime.c | 91 
> 
>  1 file changed, 40 insertions(+), 51 deletions(-)
> 
> diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
> b/arch/x86/entry/vdso/vclock_gettime.c
> index 59a98c25bde7..8602f06c759f 100644
> --- a/arch/x86/entry/vdso/vclock_gettime.c
> +++ b/arch/x86/entry/vdso/vclock_gettime.c
> @@ -17,8 +17,10 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
> +#include 
>  
>  #define gtod ((vsyscall_gtod_data))
>  
> @@ -43,10 +45,6 @@ extern u8 pvclock_page
>  
>  #ifndef BUILD_VDSO32
>  
> -#include 
> -#include 
> -#include 
> -
>  notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
>  {
>   long ret;
> @@ -64,8 +62,42 @@ notrace static long vdso_fallback_gtod(struct timeval *tv, 
> struct timezone *tz)
>   return ret;
>  }
>  
> -#ifdef CONFIG_PARAVIRT_CLOCK
>  
> +#else
> +
> +notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
> +{
> + long ret;
> +
> + asm(
> + "mov %%ebx, %%edx \n"
> + "mov %2, %%ebx \n"
> + "call __kernel_vsyscall \n"
> + "mov %%edx, %%ebx \n"
> + : "=a" (ret)
> + : "0" (__NR_clock_gettime), "g" (clock), "c" (ts)
> + : "memory", "edx");
> + return ret;
> +}
> +
> +notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone 
> *tz)
> +{
> + long ret;
> +
> + asm(
> + "mov %%ebx, %%edx \n"
> + "mov %2, %%ebx \n"
> + "call __kernel_vsyscall \n"
> + "mov %%edx, %%ebx \n"
> + : "=a" (ret)
> + : "0" (__NR_gettimeofday), "g" (tv), "c" (tz)
> + : "memory", "edx");
> + return ret;
> +}
> +
> +#endif
> +
> +#ifdef CONFIG_PARAVIRT_CLOCK
>  static notrace const struct pvclock_vsyscall_time_info *get_pvti0(void)
>  {
>   return (const struct pvclock_vsyscall_time_info *)_page;
> @@ -109,9 +141,9 @@ static notrace cycle_t vread_pvclock(int *mode)
>   do {
>   version = pvti->version;
>  
> - /* This is also a read barrier, so we'll read version first. */
> - tsc = rdtsc_ordered();
> + smp_rmb();
>  
> + tsc = rdtsc_ordered();
>   pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
>   pvti_tsc_shift = pvti->tsc_shift;
>   pvti_system_time = pvti->system_time;
> @@ -126,7 +158,7 @@ static notrace cycle_t vread_pvclock(int *mode)
>   pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
>   pvti_tsc_shift);
>  
> - /* refer to tsc.c read_tsc() comment for rationale */
> + /* refer to vread_tsc() comment for rationale */
>   last = gtod->cycle_last;
>  
>   if (likely(ret >= last))
> @@ -136,49 +168,6 @@ static notrace cycle_t vread_pvclock(int *mode)
>  }
>  #endif
>  
> -#else
> -
> -notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
> -{
> - long ret;
> -
> - asm(
> - "mov %%ebx, %%edx \n"
> - "mov %2, %%ebx \n"
> - "call __kernel_vsyscall \n"
> - "mov %%edx, %%ebx \n"
> - : "=a" (ret)
> - : "0" (__NR_clock_gettime), "g" (clock), "c" (ts)
> - : "memory", "edx");
> - return ret;
> -}
> -
> -notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone 
> *tz)
> -{
> - long ret;
> -
> - asm(
> - "mov %%ebx, %%edx \n"
> - "mov %2, %%ebx \n"
> - "call __kernel_vsyscall \n"
> - "mov %%edx, %%ebx \n"
> - : "=a" (ret)
> - : "0" (__NR_gettimeofday), "g" (tv), "c" (tz)
> - : "memory", "edx");
> - return ret;
> -}
> -
> -#ifdef CONFIG_PARAVIRT_CLOCK
> -
> -static notrace cycle_t vread_pvclock(int *mode)
> -{
> - *mode = VCLOCK_NONE;
> - return 0;
> -}
> -#endif
> -
> -#endif
> -
>  notrace static cycle_t vread_tsc(void)
>  {
>   cycle_t ret = (cycle_t)rdtsc_ordered();
> 

Acked-by: Paolo Bonzini 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: live migration vs device assignment (motivation)

2015-12-10 Thread Michael S. Tsirkin
On Thu, Dec 10, 2015 at 11:04:54AM +0800, Lan, Tianyu wrote:
> 
> On 12/10/2015 4:07 AM, Michael S. Tsirkin wrote:
> >On Thu, Dec 10, 2015 at 12:26:25AM +0800, Lan, Tianyu wrote:
> >>On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:
> >>>I thought about what this is doing at the high level, and I do have some
> >>>value in what you are trying to do, but I also think we need to clarify
> >>>the motivation a bit more.  What you are saying is not really what the
> >>>patches are doing.
> >>>
> >>>And with that clearer understanding of the motivation in mind (assuming
> >>>it actually captures a real need), I would also like to suggest some
> >>>changes.
> >>
> >>Motivation:
> >>Most current solutions for migration with passthough device are based on
> >>the PCI hotplug but it has side affect and can't work for all device.
> >>
> >>For NIC device:
> >>PCI hotplug solution can work around Network device migration
> >>via switching VF and PF.
> >
> >This is just more confusion. hotplug is just a way to add and remove
> >devices. switching VF and PF is up to guest and hypervisor.
> 
> This is a combination. Because it's not able to migrate device state in
> the current world during migration(What we are doing), Exist solutions
> of migrating VM with passthough NIC relies on the PCI hotplug.

That's where you go wrong I think. This marketing speak about solution
of migrating VM with passthrough is just confusing people.

There's no way to do migration with device passthrough on KVM at the
moment, in particular because of lack of way for host to save and
restore device state, and you do not propose a way either.

So how do people migrate? Stop doing device passthrough.
So what I think your patches do is add ability to do the two things
in parallel: stop doing passthrough and start migration.
You still can not migrate with passthrough.

> Unplug VF
> before starting migration and then switch network from VF NIC to PV NIC
> in order to maintain the network connection.

Again, this is mixing unrelated things.  This switching is not really
related to migration. You can do this at any time for any number of
reasons.  If migration takes a lot of time and if you unplug before
migration, then switching to another interface might make sense.
But it's question of policy.

> Plug VF again after
> migration and then switch from PV back to VF. Bond driver provides a way to
> switch between PV and VF NIC automatically with save IP and MAC and so bond
> driver is more preferred.

Preferred over switching manually? As long as it works well, sure.  But
one can come up with other techniques.  For example, don't switch. Save
ip, mac etc, remove source device and add the destination one.  You were
also complaining that the switch took too long.

> >
> >>But switching network interface will introduce service down time.
> >>
> >>I tested the service down time via putting VF and PV interface
> >>into a bonded interface and ping the bonded interface during plug
> >>and unplug VF.
> >>1) About 100ms when add VF
> >>2) About 30ms when del VF
> >
> >OK and what's the source of the downtime?
> >I'm guessing that's just arp being repopulated.  So simply save and
> >re-populate it.
> >
> >There would be a much cleaner solution.
> >
> >Or maybe there's a timer there that just delays hotplug
> >for no reason. Fix it, everyone will benefit.
> >
> >>It also requires guest to do switch configuration.
> >
> >That's just wrong. if you want a switch, you need to
> >configure a switch.
> 
> I meant the config of switching operation between PV and VF.

I see. So sure, there are many ways to configure networking
on linux. You seem to see this as a downside and so want
to hardcode a single configuration into the driver.

> >
> >>These are hard to
> >>manage and deploy from our customers.
> >
> >So kernel want to remain flexible, and the stack is
> >configurable. Downside: customers need to deploy userspace
> >to configure it. Your solution: a hard-coded configuration
> >within kernel and hypervisor.  Sorry, this makes no sense.
> >If kernel is easier for you to deploy than userspace,
> >you need to rethink your deployment strategy.
> 
> This is one factor.
> 
> >
> >>To maintain PV performance during
> >>migration, host side also needs to assign a VF to PV device. This
> >>affects scalability.
> >
> >No idea what this means.
> >
> >>These factors block SRIOV NIC passthough usage in the cloud service and
> >>OPNFV which require network high performance and stability a lot.
> >
> >Everyone needs performance and scalability.
> >
> >>
> >>For other kind of devices, it's hard to work.
> >>We are also adding migration support for QAT(QuickAssist Technology) device.
> >>
> >>QAT device user case introduction.
> >>Server, networking, big data, and storage applications use QuickAssist
> >>Technology to offload servers from handling compute-intensive operations,
> >>such as:
> >>1) Symmetric cryptography functions including cipher operations and
> >>authentication 

Re: [PATCH 4/5] x86/vdso: Remove pvclock fixmap machinery

2015-12-10 Thread Paolo Bonzini


On 10/12/2015 00:12, Andy Lutomirski wrote:
> Signed-off-by: Andy Lutomirski 
> ---
>  arch/x86/entry/vdso/vclock_gettime.c |  1 -
>  arch/x86/entry/vdso/vma.c|  1 +
>  arch/x86/include/asm/fixmap.h|  5 -
>  arch/x86/include/asm/pvclock.h   |  5 -
>  arch/x86/kernel/kvmclock.c   |  6 --
>  arch/x86/kernel/pvclock.c| 24 
>  6 files changed, 1 insertion(+), 41 deletions(-)
> 
> diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
> b/arch/x86/entry/vdso/vclock_gettime.c
> index 5dd363d54348..59a98c25bde7 100644
> --- a/arch/x86/entry/vdso/vclock_gettime.c
> +++ b/arch/x86/entry/vdso/vclock_gettime.c
> @@ -45,7 +45,6 @@ extern u8 pvclock_page
>  
>  #include 
>  #include 
> -#include 
>  #include 
>  
>  notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
> diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
> index aa828191c654..b8f69e264ac4 100644
> --- a/arch/x86/entry/vdso/vma.c
> +++ b/arch/x86/entry/vdso/vma.c
> @@ -12,6 +12,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
> index f80d70009ff8..6d7d0e52ed5a 100644
> --- a/arch/x86/include/asm/fixmap.h
> +++ b/arch/x86/include/asm/fixmap.h
> @@ -19,7 +19,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #ifdef CONFIG_X86_32
>  #include 
>  #include 
> @@ -72,10 +71,6 @@ enum fixed_addresses {
>  #ifdef CONFIG_X86_VSYSCALL_EMULATION
>   VSYSCALL_PAGE = (FIXADDR_TOP - VSYSCALL_ADDR) >> PAGE_SHIFT,
>  #endif
> -#ifdef CONFIG_PARAVIRT_CLOCK
> - PVCLOCK_FIXMAP_BEGIN,
> - PVCLOCK_FIXMAP_END = PVCLOCK_FIXMAP_BEGIN+PVCLOCK_VSYSCALL_NR_PAGES-1,
> -#endif
>  #endif
>   FIX_DBGP_BASE,
>   FIX_EARLYCON_MEM_BASE,
> diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
> index 3864398c7cb2..66df22b2e0c9 100644
> --- a/arch/x86/include/asm/pvclock.h
> +++ b/arch/x86/include/asm/pvclock.h
> @@ -100,10 +100,5 @@ struct pvclock_vsyscall_time_info {
>  } __attribute__((__aligned__(SMP_CACHE_BYTES)));
>  
>  #define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info)
> -#define PVCLOCK_VSYSCALL_NR_PAGES (((NR_CPUS-1)/(PAGE_SIZE/PVTI_SIZE))+1)
> -
> -int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i,
> -  int size);
> -struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu);
>  
>  #endif /* _ASM_X86_PVCLOCK_H */
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index ec1b06dc82d2..72cef58693c7 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -310,7 +310,6 @@ int __init kvm_setup_vsyscall_timeinfo(void)
>  {
>  #ifdef CONFIG_X86_64
>   int cpu;
> - int ret;
>   u8 flags;
>   struct pvclock_vcpu_time_info *vcpu_time;
>   unsigned int size;
> @@ -330,11 +329,6 @@ int __init kvm_setup_vsyscall_timeinfo(void)
>   return 1;
>   }
>  
> - if ((ret = pvclock_init_vsyscall(hv_clock, size))) {
> - put_cpu();
> - return ret;
> - }
> -
>   put_cpu();
>  
>   kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
> diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
> index 2f355d229a58..99bfc025111d 100644
> --- a/arch/x86/kernel/pvclock.c
> +++ b/arch/x86/kernel/pvclock.c
> @@ -140,27 +140,3 @@ void pvclock_read_wallclock(struct pvclock_wall_clock 
> *wall_clock,
>  
>   set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
>  }
> -
> -#ifdef CONFIG_X86_64
> -/*
> - * Initialize the generic pvclock vsyscall state.  This will allocate
> - * a/some page(s) for the per-vcpu pvclock information, set up a
> - * fixmap mapping for the page(s)
> - */
> -
> -int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i,
> -  int size)
> -{
> - int idx;
> -
> - WARN_ON (size != PVCLOCK_VSYSCALL_NR_PAGES*PAGE_SIZE);
> -
> - for (idx = 0; idx <= (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) {
> - __set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx,
> -  __pa(i) + (idx*PAGE_SIZE),
> -  PAGE_KERNEL_VVAR);
> - }
> -
> - return 0;
> -}
> -#endif
> 

Acked-by: Paolo Bonzini 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

2015-12-10 Thread Paolo Bonzini


On 10/12/2015 00:12, Andy Lutomirski wrote:
> From: Andy Lutomirski 
> 
> The pvclock vdso code was too abstracted to understand easily and
> excessively paranoid.  Simplify it for a huge speedup.
> 
> This opens the door for additional simplifications, as the vdso no
> longer accesses the pvti for any vcpu other than vcpu 0.
> 
> Before, vclock_gettime using kvm-clock took about 45ns on my machine.
> With this change, it takes 29ns, which is almost as fast as the pure TSC
> implementation.
> 
> Signed-off-by: Andy Lutomirski 
> ---
>  arch/x86/entry/vdso/vclock_gettime.c | 81 
> 
>  1 file changed, 46 insertions(+), 35 deletions(-)
> 
> diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
> b/arch/x86/entry/vdso/vclock_gettime.c
> index ca94fa649251..c325ba1bdddf 100644
> --- a/arch/x86/entry/vdso/vclock_gettime.c
> +++ b/arch/x86/entry/vdso/vclock_gettime.c
> @@ -78,47 +78,58 @@ static notrace const struct pvclock_vsyscall_time_info 
> *get_pvti(int cpu)
>  
>  static notrace cycle_t vread_pvclock(int *mode)
>  {
> - const struct pvclock_vsyscall_time_info *pvti;
> + const struct pvclock_vcpu_time_info *pvti = _pvti(0)->pvti;
>   cycle_t ret;
> - u64 last;
> - u32 version;
> - u8 flags;
> - unsigned cpu, cpu1;
> -
> + u64 tsc, pvti_tsc;
> + u64 last, delta, pvti_system_time;
> + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
>  
>   /*
> -  * Note: hypervisor must guarantee that:
> -  * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
> -  * 2. that per-CPU pvclock time info is updated if the
> -  *underlying CPU changes.
> -  * 3. that version is increased whenever underlying CPU
> -  *changes.
> +  * Note: The kernel and hypervisor must guarantee that cpu ID
> +  * number maps 1:1 to per-CPU pvclock time info.
> +  *
> +  * Because the hypervisor is entirely unaware of guest userspace
> +  * preemption, it cannot guarantee that per-CPU pvclock time
> +  * info is updated if the underlying CPU changes or that that
> +  * version is increased whenever underlying CPU changes.
>*
> +  * On KVM, we are guaranteed that pvti updates for any vCPU are
> +  * atomic as seen by *all* vCPUs.  This is an even stronger
> +  * guarantee than we get with a normal seqlock.
> +  *
> +  * On Xen, we don't appear to have that guarantee, but Xen still
> +  * supplies a valid seqlock using the version field.
> +
> +  * We only do pvclock vdso timing at all if
> +  * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
> +  * mean that all vCPUs have matching pvti and that the TSC is
> +  * synced, so we can just look at vCPU 0's pvti.
>*/
> - do {
> - cpu = __getcpu() & VGETCPU_CPU_MASK;
> - /* TODO: We can put vcpu id into higher bits of pvti.version.
> -  * This will save a couple of cycles by getting rid of
> -  * __getcpu() calls (Gleb).
> -  */
> -
> - pvti = get_pvti(cpu);
> -
> - version = __pvclock_read_cycles(>pvti, , );
> -
> - /*
> -  * Test we're still on the cpu as well as the version.
> -  * We could have been migrated just after the first
> -  * vgetcpu but before fetching the version, so we
> -  * wouldn't notice a version change.
> -  */
> - cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> - } while (unlikely(cpu != cpu1 ||
> -   (pvti->pvti.version & 1) ||
> -   pvti->pvti.version != version));
> -
> - if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
> +
> + if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
>   *mode = VCLOCK_NONE;
> + return 0;
> + }
> +
> + do {
> + version = pvti->version;
> +
> + /* This is also a read barrier, so we'll read version first. */
> + tsc = rdtsc_ordered();
> +
> + pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
> + pvti_tsc_shift = pvti->tsc_shift;
> + pvti_system_time = pvti->system_time;
> + pvti_tsc = pvti->tsc_timestamp;
> +
> + /* Make sure that the version double-check is last. */
> + smp_rmb();
> + } while (unlikely((version & 1) || version != pvti->version));
> +
> + delta = tsc - pvti_tsc;
> + ret = pvti_system_time +
> + pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
> + pvti_tsc_shift);
>  
>   /* refer to tsc.c read_tsc() comment for rationale */
>   last = gtod->cycle_last;
> 

Reviewed-by: Paolo Bonzini 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

[Bug 107561] 4.2 breaks PCI passthrough in QEMU/KVM

2015-12-10 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=107561

--- Comment #10 from Paolo Bonzini  ---
Another test on the 3GB machine: make it fast by nuking
kvm_mtrr_get_guest_memory_type (return MTRR_TYPE_WRBACK), then cat /proc/mtrr
from the guest.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 107561] 4.2 breaks PCI passthrough in QEMU/KVM

2015-12-10 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=107561

--- Comment #11 from Paolo Bonzini  ---
Created attachment 197071
  --> https://bugzilla.kernel.org/attachment.cgi?id=197071=edit
OVMF log

> 1) providing host dmesg

Of course this makes no sense, you already did.  And from the traces I suspect
that /proc/mtrr is the same on the two guests.  The culprit seems to be this
MTRR:

 qemu-system-x86-2228  [002]   127.763684: kvm_msr: msr_read 201 = 0xff8800
 qemu-system-x86-2228  [002]   127.763685: kvm_msr: msr_read 200 = 0x8000

which is set up by OVMF.  I'm attaching the OVMF output for reference (it can
be extracted from the traces with "trace-cmd report trace-3GBmem.dat |grep
0x402 | awk --non-decimal-data '{printf "%c", 0+$NF}' > OVMF.log").

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/4] VSOCK: Introduce virtio-vsock.ko

2015-12-10 Thread Alex Bennée

Stefan Hajnoczi  writes:

> From: Asias He 
>
> VM sockets virtio transport implementation. This module runs in guest
> kernel.

checkpatch warns on a bunch of whitespace/tab issues.

>
> Signed-off-by: Asias He 
> Signed-off-by: Stefan Hajnoczi 
> ---
> v2:
>  * Fix total_tx_buf accounting
>  * Add virtio_transport global mutex to prevent races
> ---
>  net/vmw_vsock/virtio_transport.c | 466 
> +++
>  1 file changed, 466 insertions(+)
>  create mode 100644 net/vmw_vsock/virtio_transport.c
>
> diff --git a/net/vmw_vsock/virtio_transport.c 
> b/net/vmw_vsock/virtio_transport.c
> new file mode 100644
> index 000..df65dca
> --- /dev/null
> +++ b/net/vmw_vsock/virtio_transport.c
> @@ -0,0 +1,466 @@
> +/*
> + * virtio transport for vsock
> + *
> + * Copyright (C) 2013-2015 Red Hat, Inc.
> + * Author: Asias He 
> + * Stefan Hajnoczi 
> + *
> + * Some of the code is take from Gerd Hoffmann 's
> + * early virtio-vsock proof-of-concept bits.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + */
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +static struct workqueue_struct *virtio_vsock_workqueue;
> +static struct virtio_vsock *the_virtio_vsock;
> +static DEFINE_MUTEX(the_virtio_vsock_mutex); /* protects the_virtio_vsock */
> +static void virtio_vsock_rx_fill(struct virtio_vsock *vsock);
> +
> +struct virtio_vsock {
> + /* Virtio device */
> + struct virtio_device *vdev;
> + /* Virtio virtqueue */
> + struct virtqueue *vqs[VSOCK_VQ_MAX];
> + /* Wait queue for send pkt */
> + wait_queue_head_t queue_wait;
> + /* Work item to send pkt */
> + struct work_struct tx_work;
> + /* Work item to recv pkt */
> + struct work_struct rx_work;
> + /* Mutex to protect send pkt*/
> + struct mutex tx_lock;
> + /* Mutex to protect recv pkt*/
> + struct mutex rx_lock;

Further down I got confused by what lock was what and exactly what was
being protected. If the receive and transmit paths touch separate things
it might be worth re-arranging the structure to make it clearer, eg:

   /* The transmit path is protected by tx_lock */
   struct mutex tx_lock;
   struct work_struct tx_work;
   ..
   ..

   /* The receive path is protected by rx_lock */
   wait_queue_head_t queue_wait;
   ..
   ..

 Which might make things a little clearer. Then all the redundant
 information in the comments can be removed. I don't need to know what
 is a Virtio device, virtqueue or wait_queue etc as they are implicit in
 the structure name.

> + /* Number of recv buffers */
> + int rx_buf_nr;
> + /* Number of max recv buffers */
> + int rx_buf_max_nr;
> + /* Used for global tx buf limitation */
> + u32 total_tx_buf;
> + /* Guest context id, just like guest ip address */
> + u32 guest_cid;
> +};
> +
> +static struct virtio_vsock *virtio_vsock_get(void)
> +{
> + return the_virtio_vsock;
> +}
> +
> +static u32 virtio_transport_get_local_cid(void)
> +{
> + struct virtio_vsock *vsock = virtio_vsock_get();
> +
> + return vsock->guest_cid;
> +}
> +
> +static int
> +virtio_transport_send_pkt(struct vsock_sock *vsk,
> +   struct virtio_vsock_pkt_info *info)
> +{
> + u32 src_cid, src_port, dst_cid, dst_port;
> + int ret, in_sg = 0, out_sg = 0;
> + struct virtio_transport *trans;
> + struct virtio_vsock_pkt *pkt;
> + struct virtio_vsock *vsock;
> + struct scatterlist hdr, buf, *sgs[2];
> + struct virtqueue *vq;
> + u32 pkt_len = info->pkt_len;
> + DEFINE_WAIT(wait);
> +
> + vsock = virtio_vsock_get();
> + if (!vsock)
> + return -ENODEV;
> +
> + src_cid = virtio_transport_get_local_cid();
> + src_port = vsk->local_addr.svm_port;
> + if (!info->remote_cid) {
> + dst_cid = vsk->remote_addr.svm_cid;
> + dst_port = vsk->remote_addr.svm_port;
> + } else {
> + dst_cid = info->remote_cid;
> + dst_port = info->remote_port;
> + }
> +
> + trans = vsk->trans;
> + vq = vsock->vqs[VSOCK_VQ_TX];
> +
> + if (pkt_len > VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE)
> + pkt_len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE;
> + pkt_len = virtio_transport_get_credit(trans, pkt_len);
> + /* Do not send zero length OP_RW pkt*/
> + if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> + return pkt_len;
> +
> + /* Respect global tx buf limitation */
> + mutex_lock(>tx_lock);
> + while (pkt_len + vsock->total_tx_buf > VIRTIO_VSOCK_MAX_TX_BUF_SIZE) {
> + prepare_to_wait_exclusive(>queue_wait, ,
> +   TASK_UNINTERRUPTIBLE);
> + mutex_unlock(>tx_lock);
> +

Re: x86: Don't report guest userspace emulation error to userspace, why ?

2015-12-10 Thread Bandan Das
Paolo Bonzini  writes:

>> Paolo Bonzini  writes:
>> > On 10/12/2015 18:58, Bandan Das wrote:
>> >>> > Allowing userspace to stop the guest with an emulation failure is a
>> >> This one I don't :) Userspace started the guest after all, there are other
>> >> ways for it to kill the guest if it wanted to.
>> >
>> > I mean allowing guest userspace to stop the guest.
>> 
>> Sure! Userspace (Qemu) can just reenter the guest when it sees the failure.
>> Doing it in the host kvm seems overkill.
>
> Most userspaces will get it wrong.  Doing it once makes sure that you
> do it right.

I don't buy that. As userspace, I prefer getting to know what error the guest
I launched hit and decide what to do. Well, atleast whenever I can. This seems
to be one such case.

> Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 107561] 4.2 breaks PCI passthrough in QEMU/KVM

2015-12-10 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=107561

--- Comment #9 from Paolo Bonzini  ---
Can you try the following on HostA:

1) providing *host* dmesg

2) disabling each MTRR starting from the first, until things are fast again. 
Then reboot and try disabling only the MTRR that made things fast again.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 107561] 4.2 breaks PCI passthrough in QEMU/KVM

2015-12-10 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=107561

--- Comment #12 from Paolo Bonzini  ---
Created attachment 197081
  --> https://bugzilla.kernel.org/attachment.cgi?id=197081=edit
patch to be tested

The bug happens when the VM is created with a maxphyaddr that doesn't match the
host.  This patch should fix your bug.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: arm/arm64: vgic: Fix kvm_vgic_map_is_active's dist check

2015-12-10 Thread Christoffer Dall
External inputs to the vgic from time to time need to poke into the
state of a virtual interrupt, the prime example is the architected timer
code.

Since the IRQ's active state can be represented in two places; the LR or
the distributor, we first loop over the LRs but if not active in the LRs
we just return if *any* IRQ is active on the VCPU in question.

This is of course bogus, as we should check if the specific IRQ in
quesiton is active on the distributor instead.

Reported-by: Eric Auger 
Signed-off-by: Christoffer Dall 
---
 virt/kvm/arm/vgic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
index 65461f8..7a2f449 100644
--- a/virt/kvm/arm/vgic.c
+++ b/virt/kvm/arm/vgic.c
@@ -1114,7 +1114,7 @@ bool kvm_vgic_map_is_active(struct kvm_vcpu *vcpu, struct 
irq_phys_map *map)
return true;
}
 
-   return dist_active_irq(vcpu);
+   return vgic_irq_is_active(vcpu, map->virt_irq);
 }
 
 /*
-- 
2.1.2.330.g565301e.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/11] qapi: Rename qjson.h to qobject-json.h

2015-12-10 Thread Eric Blake
We have two different JSON visitors in the tree; and having both
named 'qjson.h' can cause include confusion.  Rename the qapi
version.

Kill trailing whitespace in the renamed tests/check-qobject-json.c
to keep checkpatch.pl happy.

Signed-off-by: Eric Blake 
---
 MAINTAINERS   |  2 +-
 balloon.c |  2 +-
 block.c   |  2 +-
 block/archipelago.c   |  2 +-
 block/nbd.c   |  2 +-
 block/quorum.c|  2 +-
 blockjob.c|  2 +-
 hw/core/qdev.c|  2 +-
 hw/misc/pvpanic.c |  2 +-
 hw/net/virtio-net.c   |  2 +-
 include/qapi/qmp/{qjson.h => qobject-json.h}  |  0
 include/qapi/qmp/types.h  |  2 +-
 monitor.c |  2 +-
 qapi/qmp-event.c  |  2 +-
 qemu-img.c|  2 +-
 qga/main.c|  2 +-
 qobject/Makefile.objs |  3 ++-
 qobject/{qjson.c => qobject-json.c}   |  2 +-
 target-s390x/kvm.c|  2 +-
 tests/.gitignore  |  2 +-
 tests/Makefile|  8 
 tests/{check-qjson.c => check-qobject-json.c} | 14 +++---
 tests/libqtest.c  |  2 +-
 ui/spice-core.c   |  2 +-
 vl.c  |  2 +-
 25 files changed, 34 insertions(+), 33 deletions(-)
 rename include/qapi/qmp/{qjson.h => qobject-json.h} (100%)
 rename qobject/{qjson.c => qobject-json.c} (99%)
 rename tests/{check-qjson.c => check-qobject-json.c} (99%)

diff --git a/MAINTAINERS b/MAINTAINERS
index e8cee1e..c943ff4 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1155,7 +1155,7 @@ X: include/qapi/qmp/dispatch.h
 F: tests/check-qdict.c
 F: tests/check-qfloat.c
 F: tests/check-qint.c
-F: tests/check-qjson.c
+F: tests/check-qobject-json.c
 F: tests/check-qlist.c
 F: tests/check-qstring.c
 T: git git://repo.or.cz/qemu/qmp-unstable.git queue/qmp
diff --git a/balloon.c b/balloon.c
index 0f45d1b..5983b4f 100644
--- a/balloon.c
+++ b/balloon.c
@@ -31,7 +31,7 @@
 #include "trace.h"
 #include "qmp-commands.h"
 #include "qapi/qmp/qerror.h"
-#include "qapi/qmp/qjson.h"
+#include "qapi/qmp/qobject-json.h"

 static QEMUBalloonEvent *balloon_event_fn;
 static QEMUBalloonStatus *balloon_stat_fn;
diff --git a/block.c b/block.c
index 9971976..e611002 100644
--- a/block.c
+++ b/block.c
@@ -29,7 +29,7 @@
 #include "qemu/error-report.h"
 #include "qemu/module.h"
 #include "qapi/qmp/qerror.h"
-#include "qapi/qmp/qjson.h"
+#include "qapi/qmp/qobject-json.h"
 #include "sysemu/block-backend.h"
 #include "sysemu/sysemu.h"
 #include "qemu/notify.h"
diff --git a/block/archipelago.c b/block/archipelago.c
index 855655c..80a1bb5 100644
--- a/block/archipelago.c
+++ b/block/archipelago.c
@@ -56,7 +56,7 @@
 #include "qemu/thread.h"
 #include "qapi/qmp/qint.h"
 #include "qapi/qmp/qstring.h"
-#include "qapi/qmp/qjson.h"
+#include "qapi/qmp/qobject-json.h"
 #include "qemu/atomic.h"

 #include 
diff --git a/block/nbd.c b/block/nbd.c
index cd6a587..9009f4f 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -32,7 +32,7 @@
 #include "qemu/module.h"
 #include "qemu/sockets.h"
 #include "qapi/qmp/qdict.h"
-#include "qapi/qmp/qjson.h"
+#include "qapi/qmp/qobject-json.h"
 #include "qapi/qmp/qint.h"
 #include "qapi/qmp/qstring.h"

diff --git a/block/quorum.c b/block/quorum.c
index d162459..1e3a278 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -18,7 +18,7 @@
 #include "qapi/qmp/qdict.h"
 #include "qapi/qmp/qerror.h"
 #include "qapi/qmp/qint.h"
-#include "qapi/qmp/qjson.h"
+#include "qapi/qmp/qobject-json.h"
 #include "qapi/qmp/qlist.h"
 #include "qapi/qmp/qstring.h"
 #include "qapi-event.h"
diff --git a/blockjob.c b/blockjob.c
index 80adb9d..84361f7 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -31,7 +31,7 @@
 #include "block/block_int.h"
 #include "sysemu/block-backend.h"
 #include "qapi/qmp/qerror.h"
-#include "qapi/qmp/qjson.h"
+#include "qapi/qmp/qobject-json.h"
 #include "qemu/coroutine.h"
 #include "qmp-commands.h"
 #include "qemu/timer.h"
diff --git a/hw/core/qdev.c b/hw/core/qdev.c
index b3ad467..a98304d 100644
--- a/hw/core/qdev.c
+++ b/hw/core/qdev.c
@@ -31,7 +31,7 @@
 #include "qapi/error.h"
 #include "qapi/qmp/qerror.h"
 #include "qapi/visitor.h"
-#include "qapi/qmp/qjson.h"
+#include "qapi/qmp/qobject-json.h"
 #include "qemu/error-report.h"
 #include "hw/hotplug.h"
 #include "hw/boards.h"
diff --git a/hw/misc/pvpanic.c b/hw/misc/pvpanic.c
index 3709488..2571483 100644
--- a/hw/misc/pvpanic.c
+++ b/hw/misc/pvpanic.c
@@ -13,7 +13,7 @@
  */

 #include "qapi/qmp/qobject.h"
-#include "qapi/qmp/qjson.h"
+#include 

Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-10 Thread Dr. David Alan Gilbert
* Yang Zhang (yang.zhang...@gmail.com) wrote:
> On 2015/12/10 18:18, Dr. David Alan Gilbert wrote:
> >* Lan, Tianyu (tianyu@intel.com) wrote:
> >>On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:
> >>>I thought about what this is doing at the high level, and I do have some
> >>>value in what you are trying to do, but I also think we need to clarify
> >>>the motivation a bit more.  What you are saying is not really what the
> >>>patches are doing.
> >>>
> >>>And with that clearer understanding of the motivation in mind (assuming
> >>>it actually captures a real need), I would also like to suggest some
> >>>changes.
> >>
> >>Motivation:
> >>Most current solutions for migration with passthough device are based on
> >>the PCI hotplug but it has side affect and can't work for all device.
> >>
> >>For NIC device:
> >>PCI hotplug solution can work around Network device migration
> >>via switching VF and PF.
> >>
> >>But switching network interface will introduce service down time.
> >>
> >>I tested the service down time via putting VF and PV interface
> >>into a bonded interface and ping the bonded interface during plug
> >>and unplug VF.
> >>1) About 100ms when add VF
> >>2) About 30ms when del VF
> >>
> >>It also requires guest to do switch configuration. These are hard to
> >>manage and deploy from our customers. To maintain PV performance during
> >>migration, host side also needs to assign a VF to PV device. This
> >>affects scalability.
> >>
> >>These factors block SRIOV NIC passthough usage in the cloud service and
> >>OPNFV which require network high performance and stability a lot.
> >
> >Right, that I'll agree it's hard to do migration of a VM which uses
> >an SRIOV device; and while I think it should be possible to bond a virtio 
> >device
> >to a VF for networking and then hotplug the SR-IOV device I agree it's hard 
> >to manage.
> >
> >>For other kind of devices, it's hard to work.
> >>We are also adding migration support for QAT(QuickAssist Technology) device.
> >>
> >>QAT device user case introduction.
> >>Server, networking, big data, and storage applications use QuickAssist
> >>Technology to offload servers from handling compute-intensive operations,
> >>such as:
> >>1) Symmetric cryptography functions including cipher operations and
> >>authentication operations
> >>2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
> >>cryptography
> >>3) Compression and decompression functions including DEFLATE and LZS
> >>
> >>PCI hotplug will not work for such devices during migration and these
> >>operations will fail when unplug device.
> >
> >I don't understand that QAT argument; if the device is purely an offload
> >engine for performance, then why can't you fall back to doing the
> >same operations in the VM or in QEMU if the card is unavailable?
> >The tricky bit is dealing with outstanding operations.
> >
> >>So we are trying implementing a new solution which really migrates
> >>device state to target machine and won't affect user during migration
> >>with low service down time.
> >
> >Right, that's a good aim - the only question is how to do it.
> >
> >It looks like this is always going to need some device-specific code;
> >the question I see is whether that's in:
> > 1) qemu
> > 2) the host kernel
> > 3) the guest kernel driver
> >
> >The objections to this series seem to be that it needs changes to (3);
> >I can see the worry that the guest kernel driver might not get a chance
> >to run during the right time in migration and it's painful having to
> >change every guest driver (although your change is small).
> >
> >My question is what stage of the migration process do you expect to tell
> >the guest kernel driver to do this?
> >
> > If you do it at the start of the migration, and quiesce the device,
> > the migration might take a long time (say 30 minutes) - are you
> > intending the device to be quiesced for this long? And where are
> > you going to send the traffic?
> > If you are, then do you need to do it via this PCI trick, or could
> > you just do it via something higher level to quiesce the device.
> >
> > Or are you intending to do it just near the end of the migration?
> > But then how do we know how long it will take the guest driver to
> > respond?
> 
> Ideally, it is able to leave guest driver unmodified but it requires the
> hypervisor or qemu to aware the device which means we may need a driver in
> hypervisor or qemu to handle the device on behalf of guest driver.

Can you answer the question of when do you use your code -
   at the start of migration or
   just before the end?

> >It would be great if we could avoid changing the guest; but at least your 
> >guest
> >driver changes don't actually seem to be that hardware specific; could your
> >changes actually be moved to generic PCI level so they could be made
> >to work for lots of drivers?
> 
> It is impossible to use one common solution for all devices 

Re: [PATCH v6 10/21] KVM: ARM64: Add access handler for PMEVCNTRn and PMCCNTR register

2015-12-10 Thread Shannon Zhao
Hi Marc,

On 2015/12/9 0:30, Marc Zyngier wrote:
> On 08/12/15 12:47, Shannon Zhao wrote:
>> > From: Shannon Zhao 
>> > 
>> > Since the reset value of PMEVCNTRn or PMCCNTR is UNKNOWN, use
>> > reset_unknown for its reset handler. Add access handler which emulates
>> > writing and reading PMEVCNTRn or PMCCNTR register. When reading
>> > PMEVCNTRn or PMCCNTR, call perf_event_read_value to get the count value
>> > of the perf event.
>> > 
>> > Signed-off-by: Shannon Zhao 
>> > ---
>> >  arch/arm64/kvm/sys_regs.c | 107 
>> > +-
>> >  1 file changed, 105 insertions(+), 2 deletions(-)
>> > 
>> > diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
>> > index c116a1b..f7a73b5 100644
>> > --- a/arch/arm64/kvm/sys_regs.c
>> > +++ b/arch/arm64/kvm/sys_regs.c
>> > @@ -525,6 +525,12 @@ static bool access_pmu_regs(struct kvm_vcpu *vcpu,
>> >  
>> >if (p->is_write) {
>> >switch (r->reg) {
>> > +  case PMEVCNTR0_EL0 ... PMCCNTR_EL0: {
> Same problem as previously mentioned.
> 
>> > +  val = kvm_pmu_get_counter_value(vcpu,
>> > +  r->reg - PMEVCNTR0_EL0);
>> > +  vcpu_sys_reg(vcpu, r->reg) += (s64)p->regval - val;
>> > +  break;
>> > +  }

If I use a handler to handle these accesses to PMEVCNTRn and PMCCNTR
like below. It converts the register offset c14_PMEVCNTRn and c9_PMCCNTR
to PMEVCNTRn_EL0 and PMCCNTR_EL0, uniformly uses vcpu_sys_reg and
doesn't need to take care the big endian. What do you think about this?

static bool access_pmu_evcntr(struct kvm_vcpu *vcpu,
  struct sys_reg_params *p,
  const struct sys_reg_desc *r)
{
u64 idx, reg, val;

if (p->is_aarch32)
reg = r->reg / 2;
else
reg = r->reg;

switch (reg) {
case PMEVCNTR0_EL0 ... PMEVCNTR30_EL0: {
idx = reg - PMEVCNTR0_EL0;
break;
}
case PMCCNTR_EL0: {
idx = ARMV8_CYCLE_IDX;
break;
}
default:
idx = vcpu_sys_reg(vcpu, PMSELR_EL0) & ARMV8_COUNTER_MASK;
if (!pmu_counter_idx_valid(vcpu, idx))
return true;
reg = (idx == ARMV8_CYCLE_IDX) ? PMCCNTR_EL0 :
PMEVCNTR0_EL0 + idx;
break;
}

val = kvm_pmu_get_counter_value(vcpu, idx);
if (p->is_write)
vcpu_sys_reg(vcpu, reg) = (s64)p->regval - val;
else
p->regval = val;

return true;
}

Thanks,
-- 
Shannon

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-10 Thread Dr. David Alan Gilbert
* Lan, Tianyu (tianyu@intel.com) wrote:
> On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:
> >I thought about what this is doing at the high level, and I do have some
> >value in what you are trying to do, but I also think we need to clarify
> >the motivation a bit more.  What you are saying is not really what the
> >patches are doing.
> >
> >And with that clearer understanding of the motivation in mind (assuming
> >it actually captures a real need), I would also like to suggest some
> >changes.
> 
> Motivation:
> Most current solutions for migration with passthough device are based on
> the PCI hotplug but it has side affect and can't work for all device.
> 
> For NIC device:
> PCI hotplug solution can work around Network device migration
> via switching VF and PF.
> 
> But switching network interface will introduce service down time.
> 
> I tested the service down time via putting VF and PV interface
> into a bonded interface and ping the bonded interface during plug
> and unplug VF.
> 1) About 100ms when add VF
> 2) About 30ms when del VF
> 
> It also requires guest to do switch configuration. These are hard to
> manage and deploy from our customers. To maintain PV performance during
> migration, host side also needs to assign a VF to PV device. This
> affects scalability.
> 
> These factors block SRIOV NIC passthough usage in the cloud service and
> OPNFV which require network high performance and stability a lot.

Right, that I'll agree it's hard to do migration of a VM which uses
an SRIOV device; and while I think it should be possible to bond a virtio device
to a VF for networking and then hotplug the SR-IOV device I agree it's hard to 
manage.

> For other kind of devices, it's hard to work.
> We are also adding migration support for QAT(QuickAssist Technology) device.
> 
> QAT device user case introduction.
> Server, networking, big data, and storage applications use QuickAssist
> Technology to offload servers from handling compute-intensive operations,
> such as:
> 1) Symmetric cryptography functions including cipher operations and
> authentication operations
> 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
> cryptography
> 3) Compression and decompression functions including DEFLATE and LZS
> 
> PCI hotplug will not work for such devices during migration and these
> operations will fail when unplug device.

I don't understand that QAT argument; if the device is purely an offload
engine for performance, then why can't you fall back to doing the
same operations in the VM or in QEMU if the card is unavailable?
The tricky bit is dealing with outstanding operations.

> So we are trying implementing a new solution which really migrates
> device state to target machine and won't affect user during migration
> with low service down time.

Right, that's a good aim - the only question is how to do it.

It looks like this is always going to need some device-specific code;
the question I see is whether that's in:
1) qemu
2) the host kernel
3) the guest kernel driver

The objections to this series seem to be that it needs changes to (3);
I can see the worry that the guest kernel driver might not get a chance
to run during the right time in migration and it's painful having to
change every guest driver (although your change is small).

My question is what stage of the migration process do you expect to tell
the guest kernel driver to do this?

If you do it at the start of the migration, and quiesce the device,
the migration might take a long time (say 30 minutes) - are you
intending the device to be quiesced for this long? And where are
you going to send the traffic?
If you are, then do you need to do it via this PCI trick, or could
you just do it via something higher level to quiesce the device.

Or are you intending to do it just near the end of the migration?
But then how do we know how long it will take the guest driver to
respond?

It would be great if we could avoid changing the guest; but at least your guest
driver changes don't actually seem to be that hardware specific; could your
changes actually be moved to generic PCI level so they could be made
to work for lots of drivers?

Dave
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 10/21] KVM: ARM64: Add access handler for PMEVCNTRn and PMCCNTR register

2015-12-10 Thread Marc Zyngier
Hi Shannon,

On 10/12/15 11:36, Shannon Zhao wrote:
> Hi Marc,
> 
> On 2015/12/9 0:30, Marc Zyngier wrote:
>> On 08/12/15 12:47, Shannon Zhao wrote:
 From: Shannon Zhao 

 Since the reset value of PMEVCNTRn or PMCCNTR is UNKNOWN, use
 reset_unknown for its reset handler. Add access handler which emulates
 writing and reading PMEVCNTRn or PMCCNTR register. When reading
 PMEVCNTRn or PMCCNTR, call perf_event_read_value to get the count value
 of the perf event.

 Signed-off-by: Shannon Zhao 
 ---
  arch/arm64/kvm/sys_regs.c | 107 
 +-
  1 file changed, 105 insertions(+), 2 deletions(-)

 diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
 index c116a1b..f7a73b5 100644
 --- a/arch/arm64/kvm/sys_regs.c
 +++ b/arch/arm64/kvm/sys_regs.c
 @@ -525,6 +525,12 @@ static bool access_pmu_regs(struct kvm_vcpu *vcpu,
  
if (p->is_write) {
switch (r->reg) {
 +  case PMEVCNTR0_EL0 ... PMCCNTR_EL0: {
>> Same problem as previously mentioned.
>>
 +  val = kvm_pmu_get_counter_value(vcpu,
 +  r->reg - PMEVCNTR0_EL0);
 +  vcpu_sys_reg(vcpu, r->reg) += (s64)p->regval - val;
 +  break;
 +  }
> 
> If I use a handler to handle these accesses to PMEVCNTRn and PMCCNTR
> like below. It converts the register offset c14_PMEVCNTRn and c9_PMCCNTR
> to PMEVCNTRn_EL0 and PMCCNTR_EL0, uniformly uses vcpu_sys_reg and
> doesn't need to take care the big endian. What do you think about this?
> 
> static bool access_pmu_evcntr(struct kvm_vcpu *vcpu,
>   struct sys_reg_params *p,
>   const struct sys_reg_desc *r)
> {
> u64 idx, reg, val;
> 
> if (p->is_aarch32)
> reg = r->reg / 2;

I'd prefer it if you actually decoded the reg itself. Something like:

if (p->is_aarch32) {
if (r->CRn == 9 && r->CRm == 13)
reg = (r->Op2 & 1) ? 0 : PMCCNTR_EL0;
if (r->CRn == 14 && (r->CRm & 0xc) == 8) {
reg = ((r->CRm & 3) << 2) & (r->Op2 & 7);
reg += PMEVCNTR0_EL0;
} else {
BUG();
}
} else {

}

And then you can get rid of the c14_PMVCNTR* and c9_PMCCNTR macros.
The only slightly ugly thing is this 0 value to represent PMXEVTYPER,
but that's what we already have with your "default" clause below.

> else
> reg = r->reg;
> 
> switch (reg) {
> case PMEVCNTR0_EL0 ... PMEVCNTR30_EL0: {
> idx = reg - PMEVCNTR0_EL0;
> break;
> }
> case PMCCNTR_EL0: {
> idx = ARMV8_CYCLE_IDX;
> break;
> }
> default:
> idx = vcpu_sys_reg(vcpu, PMSELR_EL0) & ARMV8_COUNTER_MASK;
> if (!pmu_counter_idx_valid(vcpu, idx))
> return true;
> reg = (idx == ARMV8_CYCLE_IDX) ? PMCCNTR_EL0 :
> PMEVCNTR0_EL0 + idx;
> break;
> }
> 
> val = kvm_pmu_get_counter_value(vcpu, idx);
> if (p->is_write)
> vcpu_sys_reg(vcpu, reg) = (s64)p->regval - val;

Maybe I don't have my head screwed in the right way, but as long as
we're only using u64 quantities, why do we need this s64 cast?

> else
> p->regval = val;
> 
> return true;
> }

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/4] VSOCK: Introduce virtio-vsock-common.ko

2015-12-10 Thread Alex Bennée

Stefan Hajnoczi  writes:

> From: Asias He 
>
> This module contains the common code and header files for the following
> virtio-vsock and virtio-vhost kernel modules.

General comment checkpatch has a bunch of warnings about 80 character
limits, extra braces and BUG_ON usage.

>
> Signed-off-by: Asias He 
> Signed-off-by: Stefan Hajnoczi 
> ---
> v3:
>  * Remove unnecessary 3-way handshake, just do REQUEST/RESPONSE instead
>of REQUEST/RESPONSE/ACK
>  * Remove SOCK_DGRAM support and focus on SOCK_STREAM first
>  * Only allow host->guest connections (same security model as latest
>VMware)
> v2:
>  * Fix peer_buf_alloc inheritance on child socket
>  * Notify other side of SOCK_STREAM disconnect (fixes shutdown
>semantics)
>  * Avoid recursive mutex_lock(tx_lock) for write_space (fixes deadlock)
>  * Define VIRTIO_VSOCK_TYPE_STREAM/DGRAM hardware interface constants
>  * Define VIRTIO_VSOCK_SHUTDOWN_RCV/SEND hardware interface constants
> ---
>  include/linux/virtio_vsock.h| 203 
>  include/uapi/linux/virtio_ids.h |   1 +
>  include/uapi/linux/virtio_vsock.h   |  87 
>  net/vmw_vsock/virtio_transport_common.c | 854 
> 
>  4 files changed, 1145 insertions(+)
>  create mode 100644 include/linux/virtio_vsock.h
>  create mode 100644 include/uapi/linux/virtio_vsock.h
>  create mode 100644 net/vmw_vsock/virtio_transport_common.c
>
> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> new file mode 100644
> index 000..e54eb45
> --- /dev/null
> +++ b/include/linux/virtio_vsock.h
> @@ -0,0 +1,203 @@
> +/*
> + * This header, excluding the #ifdef __KERNEL__ part, is BSD licensed so
> + * anyone can use the definitions to implement compatible
> drivers/servers:

Is anything in here actually exposed to userspace or the guest? The
#ifdef __KERNEL__ statement seems redundant for this file at least.

> + *
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + * 1. Redistributions of source code must retain the above copyright
> + *notice, this list of conditions and the following disclaimer.
> + * 2. Redistributions in binary form must reproduce the above copyright
> + *notice, this list of conditions and the following disclaimer in the
> + *documentation and/or other materials provided with the distribution.
> + * 3. Neither the name of IBM nor the names of its contributors
> + *may be used to endorse or promote products derived from this software
> + *without specific prior written permission.
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS 
> IS''
> + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> + * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> + * SUCH DAMAGE.
> + *
> + * Copyright (C) Red Hat, Inc., 2013-2015
> + * Copyright (C) Asias He , 2013
> + * Copyright (C) Stefan Hajnoczi , 2015
> + */
> +
> +#ifndef _LINUX_VIRTIO_VSOCK_H
> +#define _LINUX_VIRTIO_VSOCK_H
> +
> +#include 
> +#include 
> +#include 
> +
> +#define VIRTIO_VSOCK_DEFAULT_MIN_BUF_SIZE128
> +#define VIRTIO_VSOCK_DEFAULT_BUF_SIZE(1024 * 256)
> +#define VIRTIO_VSOCK_DEFAULT_MAX_BUF_SIZE(1024 * 256)
> +#define VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE (1024 * 4)
> +#define VIRTIO_VSOCK_MAX_BUF_SIZE0xUL
> +#define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE(1024 * 64)
> +#define VIRTIO_VSOCK_MAX_TX_BUF_SIZE (1024 * 1024 * 16)
> +#define VIRTIO_VSOCK_MAX_DGRAM_SIZE  (1024 * 64)
> +
> +struct vsock_transport_recv_notify_data;
> +struct vsock_transport_send_notify_data;
> +struct sockaddr_vm;
> +struct vsock_sock;
> +
> +enum {
> + VSOCK_VQ_CTRL   = 0,
> + VSOCK_VQ_RX = 1, /* for host to guest data */
> + VSOCK_VQ_TX = 2, /* for guest to host data */
> + VSOCK_VQ_MAX= 3,
> +};
> +
> +/* virtio transport socket state */
> +struct virtio_transport {
> + struct virtio_transport_pkt_ops *ops;
> + struct vsock_sock *vsk;
> +
> + u32 buf_size;
> + u32 buf_size_min;
> + u32 buf_size_max;
> +
> + struct mutex tx_lock;
> + struct mutex rx_lock;
> +
> + struct list_head 

Re: [PATCH] vgaarb: fix signal handling in vga_get()

2015-12-10 Thread David Herrmann
Hi

On Mon, Nov 30, 2015 at 3:17 AM, Kirill A. Shutemov
 wrote:
> There are few defects in vga_get() related to signal hadning:
>
>   - we shouldn't check for pending signals for TASK_UNINTERRUPTIBLE
> case;
>
>   - if we found pending signal we must remove ourself from wait queue
> and change task state back to running;
>
>   - -ERESTARTSYS is more appropriate, I guess.
>
> Signed-off-by: Kirill A. Shutemov 
> ---
>
> Alex, I try to get KVM with VGA passthrough working properly. I have i915
> (HD 4600) on the host and GTX 580 for the guest. The guest GPU is not
> capabale of EFI, so I have to use x-vga=on. It's kinda work with your
> patch for i915.enable_hd_vgaarb=1. But guest refuse to initialize the GPU
> after KVM was not shut down correctly, resulting in host crash like this:
>
> BUG: unable to handle kernel paging request at 880870187ed8
> IP: [] 0x880870187ed8
> PGD 2129067 PUD 800841e3
> Oops: 0011 [#1] PREEMPT SMP
> Modules linked in: iwlmvm iwlwifi
> CPU: 6 PID: 3983 Comm: qemu-system-x86 Not tainted 4.3.0-gentoo #6
> Hardware name: Gigabyte Technology Co., Ltd. Z87X-UD7 TH/Z87X-UD7 TH-CF, BIOS 
> F5a 06/12/2014
> task: 88087a91 ti: 8808632c task.ti: 8808632c
> RIP: 0010:[]  [] 0x880870187ed8
> RSP: 0018:8808632c3d08  EFLAGS: 00010006
> RAX: 880870187db0 RBX: 70187f58 RCX: 
> RDX:  RSI: 0003 RDI: 880870187db0
> RBP: 8808632c3d48 R08:  R09: 
> R10: 000103c0 R11: 0293 R12: 81ea03c8
> R13: 8104c7cb R14:  R15: 0003
> FS:  7f984f9b2700() GS:88089f38() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 880870187ed8 CR3: 0008645f8000 CR4: 001426e0
> Stack:
>  810cc83d 632c3d28  81ea03c0
>  0046 0003  
>  8808632c3d80 810cca44 88087af63800 0286
> Call Trace:
>  [] ? __wake_up_common+0x4d/0x80
>  [] __wake_up+0x34/0x50
>  [] __vga_put+0x73/0xd0
>  [] vga_put+0x54/0x80
>  [] vfio_pci_vga_rw+0x1d2/0x220
>  [] vfio_pci_rw+0x33/0x60
>  [] vfio_pci_write+0x17/0x20
>  [] vfio_device_fops_write+0x26/0x30
>  [] __vfs_write+0x23/0xe0
>  [] ? __vfs_read+0x23/0xd0
>  [] ? do_vfs_ioctl+0x2b5/0x490
>  [] vfs_write+0xa4/0x190
>  [] SyS_pwrite64+0x66/0xa0
>  [] entry_SYSCALL_64_fastpath+0x12/0x6a
> Code: 88 ff ff e0 7e 18 70 08 88 ff ff 00 8c 57 76 08 88 ff ff 20 7f 18 70 08 
> 88 ff ff 08 7f 18 70 08 88 ff ff 94 51 1a 81 ff ff ff ff <09> 00 00 00 00 00 
> 00 00 01 8c 57 76 08 88 ff ff 00 8c 57 76 08
> RIP  [] 0x880870187ed8
>  RSP 
> CR2: 880870187ed8
>
> The patch fixes the crash, but doesn't help with getting GPU in guest
> working again.
>
> Any ideas?
>
> ---
>  drivers/gpu/vga/vgaarb.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/vga/vgaarb.c b/drivers/gpu/vga/vgaarb.c
> index 3166e4bc4eb6..9abcaa53bd25 100644
> --- a/drivers/gpu/vga/vgaarb.c
> +++ b/drivers/gpu/vga/vgaarb.c
> @@ -395,8 +395,10 @@ int vga_get(struct pci_dev *pdev, unsigned int rsrc, int 
> interruptible)
> set_current_state(interruptible ?
>   TASK_INTERRUPTIBLE :
>   TASK_UNINTERRUPTIBLE);
> -   if (signal_pending(current)) {
> -   rc = -EINTR;
> +   if (interruptible && signal_pending(current)) {
> +   __set_current_state(TASK_RUNNING);
> +   remove_wait_queue(_wait_queue, );
> +   rc = -ERESTARTSYS;
> break;

All 3 points are valid, and the patch looks good to me:

Reviewed-by: David Herrmann 

However, there seems to be a race between vga_lock and putting the
thread asleep. We should fix that as well. See the hunk below
(completely untested.. why is VGA still in use? *sigh*).

Thanks
David

diff --git a/drivers/gpu/vga/vgaarb.c b/drivers/gpu/vga/vgaarb.c
index a0b4334..82cf1e3 100644
--- a/drivers/gpu/vga/vgaarb.c
+++ b/drivers/gpu/vga/vgaarb.c
@@ -359,8 +359,8 @@ static void __vga_put
 int vga_get(struct pci_dev *pdev, unsigned int rsrc, int interruptible)
 {
struct vga_device *vgadev, *conflict;
+   DECLARE_WAITQUEUE(wait, current);
unsigned long flags;
-   wait_queue_t wait;
int rc = 0;

vga_check_first_use();
@@ -371,6 +371,11 @@ int vga_get
return 0;

for (;;) {
+   add_wait_queue(_wait_queue, );
+   set_current_state(interruptible ?
+ TASK_INTERRUPTIBLE :
+ TASK_UNINTERRUPTIBLE);
+
spin_lock_irqsave(_lock, flags);
vgadev = 

Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-10 Thread Yang Zhang

On 2015/12/10 18:18, Dr. David Alan Gilbert wrote:

* Lan, Tianyu (tianyu@intel.com) wrote:

On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:

I thought about what this is doing at the high level, and I do have some
value in what you are trying to do, but I also think we need to clarify
the motivation a bit more.  What you are saying is not really what the
patches are doing.

And with that clearer understanding of the motivation in mind (assuming
it actually captures a real need), I would also like to suggest some
changes.


Motivation:
Most current solutions for migration with passthough device are based on
the PCI hotplug but it has side affect and can't work for all device.

For NIC device:
PCI hotplug solution can work around Network device migration
via switching VF and PF.

But switching network interface will introduce service down time.

I tested the service down time via putting VF and PV interface
into a bonded interface and ping the bonded interface during plug
and unplug VF.
1) About 100ms when add VF
2) About 30ms when del VF

It also requires guest to do switch configuration. These are hard to
manage and deploy from our customers. To maintain PV performance during
migration, host side also needs to assign a VF to PV device. This
affects scalability.

These factors block SRIOV NIC passthough usage in the cloud service and
OPNFV which require network high performance and stability a lot.


Right, that I'll agree it's hard to do migration of a VM which uses
an SRIOV device; and while I think it should be possible to bond a virtio device
to a VF for networking and then hotplug the SR-IOV device I agree it's hard to 
manage.


For other kind of devices, it's hard to work.
We are also adding migration support for QAT(QuickAssist Technology) device.

QAT device user case introduction.
Server, networking, big data, and storage applications use QuickAssist
Technology to offload servers from handling compute-intensive operations,
such as:
1) Symmetric cryptography functions including cipher operations and
authentication operations
2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
cryptography
3) Compression and decompression functions including DEFLATE and LZS

PCI hotplug will not work for such devices during migration and these
operations will fail when unplug device.


I don't understand that QAT argument; if the device is purely an offload
engine for performance, then why can't you fall back to doing the
same operations in the VM or in QEMU if the card is unavailable?
The tricky bit is dealing with outstanding operations.


So we are trying implementing a new solution which really migrates
device state to target machine and won't affect user during migration
with low service down time.


Right, that's a good aim - the only question is how to do it.

It looks like this is always going to need some device-specific code;
the question I see is whether that's in:
 1) qemu
 2) the host kernel
 3) the guest kernel driver

The objections to this series seem to be that it needs changes to (3);
I can see the worry that the guest kernel driver might not get a chance
to run during the right time in migration and it's painful having to
change every guest driver (although your change is small).

My question is what stage of the migration process do you expect to tell
the guest kernel driver to do this?

 If you do it at the start of the migration, and quiesce the device,
 the migration might take a long time (say 30 minutes) - are you
 intending the device to be quiesced for this long? And where are
 you going to send the traffic?
 If you are, then do you need to do it via this PCI trick, or could
 you just do it via something higher level to quiesce the device.

 Or are you intending to do it just near the end of the migration?
 But then how do we know how long it will take the guest driver to
 respond?


Ideally, it is able to leave guest driver unmodified but it requires the 
hypervisor or qemu to aware the device which means we may need a driver 
in hypervisor or qemu to handle the device on behalf of guest driver.




It would be great if we could avoid changing the guest; but at least your guest
driver changes don't actually seem to be that hardware specific; could your
changes actually be moved to generic PCI level so they could be made
to work for lots of drivers?


It is impossible to use one common solution for all devices unless the 
PCIE spec documents it clearly and i think one day it will be there. But 
before that, we need some workarounds on guest driver to make it work 
even it looks ugly.



--
best regards
yang
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: freeze host when injecting NMIs in the guest, at least in 4.4-rc4+

2015-12-10 Thread Paolo Bonzini


On 10/12/2015 17:44, Borislav Petkov wrote:
> Yap,
> 
> this is clearly a qemu/kvm issue. Lemme remove ext4 folks from CC. So
> here's what happens:
> 
> I boot a kvm guest, connect to its monitor (qemu is started with
> "-monitor pty") and on the monitor I issue a couple of times the "nmi"
> command. It doesn't explode immediately but it happens pretty often and
> when it happens, the *host*(!) freezes with some nasty corruption, see
> below.
> 
> Thoughts, suggestions, ideas?

Can you try it on Intel?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86: Don't report guest userspace emulation error to userspace, why ?

2015-12-10 Thread Bandan Das
Paolo Bonzini  writes:

> On 09/12/2015 23:18, Bandan Das wrote:
>> Commit a2b9e6c1a35afcc09:
>> 
>> KVM: x86: Don't report guest userspace emulation error to userspace
>> 
>> Commit fc3a9157d314 ("KVM: X86: Don't report L2 emulation failures to
>> user-space") disabled the reporting of L2 (nested guest) emulation 
>> failures to
>> userspace due to race-condition between a vmexit and the instruction 
>> emulator.
>> The same rational applies also to userspace applications that are 
>> permitted by
>> the guest OS to access MMIO area or perform PIO.
>> 
>> This patch extends the current behavior - of injecting a #UD instead of
>> reporting it to userspace - also for guest userspace code.
>> 
>> I searched the archives but failed in finding anything. Can someone please
>> explain why this is needed ? Or, why not let userspace decide what to do 
>> based
>> on the cpl, whether to continue execution or kill the guest ? Is the 
>> assumption
>> here that this is what userspace always wants ?
>
> Not what userspace always wants, but what the guest kernel always wants.

Thanks Paolo, this one I agree.

> Allowing userspace to stop the guest with an emulation failure is a

This one I don't :) Userspace started the guest after all, there are other
ways for it to kill the guest if it wanted to.

> possible denial of service, similar to L2 stopping L1 with an emulation
> failure.
>
> Paolo
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86: Don't report guest userspace emulation error to userspace, why ?

2015-12-10 Thread Paolo Bonzini


On 10/12/2015 18:58, Bandan Das wrote:
>> > Allowing userspace to stop the guest with an emulation failure is a
> This one I don't :) Userspace started the guest after all, there are other
> ways for it to kill the guest if it wanted to.

I mean allowing guest userspace to stop the guest.

Paolo

>> > possible denial of service, similar to L2 stopping L1 with an emulation
>> > failure.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: x86: move tracepoints outside extended quiescent state

2015-12-10 Thread Borislav Petkov
On Thu, Dec 10, 2015 at 06:38:57PM +0100, Paolo Bonzini wrote:
> Invoking tracepoints within kvm_guest_enter/kvm_guest_exit causes a
> lockdep splat.
> 
> Cc: sta...@vger.kernel.org
> Reported-by: Borislav Petkov 
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/x86/kvm/svm.c | 4 ++--
>  arch/x86/kvm/vmx.c | 3 ++-
>  arch/x86/kvm/x86.c | 2 +-
>  3 files changed, 5 insertions(+), 4 deletions(-)

Looks like you missed some...

[  144.296364] kvm: zapping shadow pages for mmio generation wraparound
[  164.699053] kvm: zapping shadow pages for mmio generation wraparound
[  312.115767] kvm: zapping shadow pages for mmio generation wraparound
[  432.277585] kvm: zapping shadow pages for mmio generation wraparound

[  434.547820] ===
[  434.552020] [ INFO: suspicious RCU usage. ]
[  434.556223] 4.4.0-rc4+ #1 Not tainted
[  434.559886] ---
[  434.564072] arch/x86/kvm/trace.h:971 suspicious rcu_dereference_check() 
usage!
[  434.571303] 
   other info that might help us debug this:

[  434.579324] 
   RCU used illegally from idle CPU!
   rcu_scheduler_active = 1, debug_locks = 0
[  434.590209] RCU used illegally from extended quiescent state!
[  434.595971] 1 lock held by qemu-system-x86/2402:
[  434.600596]  #0:  (>mutex){+.+.+.}, at: [] 
vcpu_load+0x1c/0x80 [kvm]
[  434.609146] 
   stack backtrace:
[  434.613526] CPU: 4 PID: 2402 Comm: qemu-system-x86 Not tainted 4.4.0-rc4+ #1
[  434.620583] Hardware name: Dell Inc. Precision T3600/0PTTT9, BIOS A13 
05/11/2014
[  434.627987]  0001 88042f79fcf0 813c2cfc 
880435aa
[  434.635443]  88042f79fd20 810c5157 88042fd48000 
000295a85563
[  434.642886]  000295de483f  88042f79fd58 
a023ec6e
[  434.650334] Call Trace:
[  434.652804]  [] dump_stack+0x4e/0x82
[  434.657950]  [] lockdep_rcu_suspicious+0xe7/0x120
[  434.664239]  [] wait_lapic_expire+0xfe/0x1e0 [kvm]
[  434.670606]  [] kvm_arch_vcpu_ioctl_run+0x76e/0x19c0 [kvm]
[  434.677674]  [] ? kvm_arch_vcpu_ioctl_run+0x8b2/0x19c0 
[kvm]
[  434.684905]  [] ? mutex_lock_killable_nested+0x29c/0x4c0
[  434.691792]  [] kvm_vcpu_ioctl+0x342/0x700 [kvm]
[  434.697984]  [] ? __lock_is_held+0x4d/0x70
[  434.703655]  [] ? __fget+0xfe/0x200
[  434.708719]  [] do_vfs_ioctl+0x301/0x550
[  434.714208]  [] ? __fget_light+0x2a/0x90
[  434.719700]  [] SyS_ioctl+0x41/0x70
[  434.724754]  [] entry_SYSCALL_64_fastpath+0x16/0x7a
[  437.411818] kvm [2400]: vcpu0 unhandled rdmsr: 0x606
[  437.416898] kvm [2400]: vcpu0 unhandled rdmsr: 0x34


-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-10 Thread Alexander Duyck
On Thu, Dec 10, 2015 at 6:38 AM, Lan, Tianyu  wrote:
>
>
> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
>>>
>>> Ideally, it is able to leave guest driver unmodified but it requires the
>>> >hypervisor or qemu to aware the device which means we may need a driver
>>> > in
>>> >hypervisor or qemu to handle the device on behalf of guest driver.
>>
>> Can you answer the question of when do you use your code -
>> at the start of migration or
>> just before the end?
>
>
> Just before stopping VCPU in this version and inject VF mailbox irq to
> notify the driver if the irq handler is installed.
> Qemu side also will check this via the faked PCI migration capability
> and driver will set the status during device open() or resume() callback.

The VF mailbox interrupt is a very bad idea.  Really the device should
be in a reset state on the other side of a migration.  It doesn't make
sense to have the interrupt firing if the device is not configured.
This is one of the things that is preventing you from being able to
migrate the device while the interface is administratively down or the
VF driver is not loaded.

My thought on all this is that it might make sense to move this
functionality into a PCI-to-PCI bridge device and make it a
requirement that all direct-assigned devices have to exist behind that
device in order to support migration.  That way you would be working
with a directly emulated device that would likely already be
supporting hot-plug anyway.  Then it would just be a matter of coming
up with a few Qemu specific extensions that you would need to add to
the device itself.  The same approach would likely be portable enough
that you could achieve it with PCIe as well via the same configuration
space being present on the upstream side of a PCIe port or maybe a
PCIe switch of some sort.

It would then be possible to signal via your vendor-specific PCI
capability on that device that all devices behind this bridge require
DMA page dirtying, you could use the configuration in addition to the
interrupt already provided for hot-plug to signal things like when you
are starting migration, and possibly even just extend the shpc
functionality so that if this capability is present you have the
option to pause/resume instead of remove/probe the device in the case
of certain hot-plug events.  The fact is there may be some use for a
pause/resume type approach for PCIe hot-plug in the near future
anyway.  From the sounds of it Apple has required it for all
Thunderbolt device drivers so that they can halt the device in order
to shuffle resources around, perhaps we should look at something
similar for Linux.

The other advantage behind grouping functions on one bridge is things
like reset domains.  The PCI error handling logic will want to be able
to reset any devices that experienced an error in the event of
something such as a surprise removal.  By grouping all of the devices
you could disable/reset/enable them as one logical group in the event
of something such as the "bad path" approach Michael has mentioned.

>>
 > >It would be great if we could avoid changing the guest; but at least
 > > your guest
 > >driver changes don't actually seem to be that hardware specific;
 > > could your
 > >changes actually be moved to generic PCI level so they could be made
 > >to work for lots of drivers?
>>>
>>> >
>>> >It is impossible to use one common solution for all devices unless the
>>> > PCIE
>>> >spec documents it clearly and i think one day it will be there. But
>>> > before
>>> >that, we need some workarounds on guest driver to make it work even it
>>> > looks
>>> >ugly.
>
>
> Yes, so far there is not hardware migration support and it's hard to modify
> bus level code. It also will block implementation on the Windows.

Please don't assume things.  Unless you have hard data from Microsoft
that says they want it this way lets just try to figure out what works
best for us for now and then we can start worrying about third party
implementations after we have figured out a solution that actually
works.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: freeze host when injecting NMIs in the guest, at least in 4.4-rc4+

2015-12-10 Thread Paolo Bonzini


On 10/12/2015 17:53, Borislav Petkov wrote:
> Just did, there it splats even when booting the guest, without even
> injecting NMIs:
> 
> [  113.233992] ===
> [  113.238192] [ INFO: suspicious RCU usage. ]
> [  113.242393] 4.4.0-rc4+ #1 Not tainted
> [  113.246056] ---
> [  113.250257] arch/x86/kvm/trace.h:29 suspicious rcu_dereference_check() 
> usage!
> [  113.257400] 
>other info that might help us debug this:
> 
> [  113.265432] 
>RCU used illegally from idle CPU!
>rcu_scheduler_active = 1, debug_locks = 0
> [  113.276321] RCU used illegally from extended quiescent state!
> [  113.282083] 1 lock held by qemu-system-x86/2275:
> [  113.286711]  #0:  (>mutex){+.+.+.}, at: [] 
> vcpu_load+0x1c/0x80 [kvm]
> [  113.295270] 
>stack backtrace:
> [  113.299644] CPU: 4 PID: 2275 Comm: qemu-system-x86 Not tainted 4.4.0-rc4+ 
> #1
> [  113.306706] Hardware name: Dell Inc. Precision T3600/0PTTT9, BIOS A13 
> 05/11/2014
> [  113.314122]  0001 88042f263d28 813c2cfc 
> 880432d53000
> [  113.321575]  88042f263d58 810c5157 88042f268000 
> 
> [  113.329027]   0001 88042f263e10 
> a01ee7c8
> [  113.336483] Call Trace:
> [  113.338939]  [] dump_stack+0x4e/0x82
> [  113.344098]  [] lockdep_rcu_suspicious+0xe7/0x120
> [  113.350385]  [] kvm_arch_vcpu_ioctl_run+0x16f8/0x19c0 
> [kvm]
> [  113.357544]  [] ? kvm_arch_vcpu_ioctl_run+0xd0/0x19c0 
> [kvm]
> [  113.364695]  [] kvm_vcpu_ioctl+0x342/0x700 [kvm]
> [  113.370896]  [] ? __lock_is_held+0x4d/0x70
> [  113.376583]  [] ? __fget+0xfe/0x200
> [  113.381662]  [] do_vfs_ioctl+0x301/0x550
> [  113.387156]  [] ? __fget_light+0x2a/0x90
> [  113.392654]  [] SyS_ioctl+0x41/0x70
> [  113.397739]  [] entry_SYSCALL_64_fastpath+0x16/0x7a
> [  113.755008] kvm: zapping shadow pages for mmio generation wraparound

Wow, this must have been there forever...

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Patch V2 1/2] x86,mce: Basic support to add LMCE support to QEMU

2015-12-10 Thread Ashok Raj
This patch adds basic enumeration, control msr's required to support
Local Machine Check Exception Support (LMCE).

- Added Local Machine Check definitions, changed MCG_CAP
- Added support for IA32_FEATURE_CONTROL.
- When delivering MCE to guest, we deliver to just a single CPU
  when guest OS has opted in to Local delivery.

Signed-off-by: Ashok Raj 
Tested-by: Gong Chen 
---
Resending with proper commit message for second patch

 target-i386/cpu.c |  8 
 target-i386/cpu.h |  8 ++--
 target-i386/kvm.c | 38 +++---
 3 files changed, 45 insertions(+), 9 deletions(-)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 11e5e39..167669a 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -2737,6 +2737,13 @@ static void mce_init(X86CPU *cpu)
 }
 }
 
+static void feature_control_init(X86CPU *cpu)
+{
+   CPUX86State *cenv = >env;
+
+   cenv->msr_ia32_feature_control = ((1<<20) | (1<<0));
+}
+
 #ifndef CONFIG_USER_ONLY
 static void x86_cpu_apic_create(X86CPU *cpu, Error **errp)
 {
@@ -2858,6 +2865,7 @@ static void x86_cpu_realizefn(DeviceState *dev, Error 
**errp)
 #endif
 
 mce_init(cpu);
+feature_control_init(cpu);
 
 #ifndef CONFIG_USER_ONLY
 if (tcg_enabled()) {
diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 84edfd0..a567d7a 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -282,8 +282,9 @@
 
 #define MCG_CTL_P   (1ULL<<8)   /* MCG_CAP register available */
 #define MCG_SER_P   (1ULL<<24) /* MCA recovery/new status bits */
+#define MCG_LMCE_P (1ULL<<27) /* Local Machine Check Supported */
 
-#define MCE_CAP_DEF (MCG_CTL_P|MCG_SER_P)
+#define MCE_CAP_DEF (MCG_CTL_P|MCG_SER_P|MCG_LMCE_P)
 #define MCE_BANKS_DEF   10
 
 #define MCG_CAP_BANKS_MASK 0xff
@@ -291,6 +292,7 @@
 #define MCG_STATUS_RIPV (1ULL<<0)   /* restart ip valid */
 #define MCG_STATUS_EIPV (1ULL<<1)   /* ip points to correct instruction */
 #define MCG_STATUS_MCIP (1ULL<<2)   /* machine check in progress */
+#define MCG_STATUS_LMCE (1ULL<<3)   /* Local MCE signaled */
 
 #define MCI_STATUS_VAL   (1ULL<<63)  /* valid error */
 #define MCI_STATUS_OVER  (1ULL<<62)  /* previous errors lost */
@@ -333,6 +335,7 @@
 #define MSR_MCG_CAP 0x179
 #define MSR_MCG_STATUS  0x17a
 #define MSR_MCG_CTL 0x17b
+#define MSR_MCG_EXT_CTL0x4d0
 
 #define MSR_P6_EVNTSEL0 0x186
 
@@ -892,7 +895,6 @@ typedef struct CPUX86State {
 
 uint64_t mcg_status;
 uint64_t msr_ia32_misc_enable;
-uint64_t msr_ia32_feature_control;
 
 uint64_t msr_fixed_ctr_ctrl;
 uint64_t msr_global_ctrl;
@@ -977,8 +979,10 @@ typedef struct CPUX86State {
 int64_t tsc_khz;
 void *kvm_xsave_buf;
 
+uint64_t msr_ia32_feature_control;
 uint64_t mcg_cap;
 uint64_t mcg_ctl;
+uint64_t mcg_ext_ctl;
 uint64_t mce_banks[MCE_BANKS_DEF*4];
 
 uint64_t tsc_aux;
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 6dc9846..c61fe1f 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -72,6 +72,7 @@ static bool has_msr_tsc_aux;
 static bool has_msr_tsc_adjust;
 static bool has_msr_tsc_deadline;
 static bool has_msr_feature_control;
+static bool has_msr_ext_mcg_ctl;
 static bool has_msr_async_pf_en;
 static bool has_msr_pv_eoi_en;
 static bool has_msr_misc_enable;
@@ -370,18 +371,30 @@ static void kvm_mce_inject(X86CPU *cpu, hwaddr paddr, int 
code)
 uint64_t status = MCI_STATUS_VAL | MCI_STATUS_UC | MCI_STATUS_EN |
   MCI_STATUS_MISCV | MCI_STATUS_ADDRV | MCI_STATUS_S;
 uint64_t mcg_status = MCG_STATUS_MCIP;
+int flags = 0;
+CPUState *cs = CPU(cpu);
+
+/*
+ * We need to read back the value of MSR_EXT_MCG_CTL that was set by the
+ * guest kernel back into Qemu
+ */
+cpu_synchronize_state(cs);
+
+flags = cpu_x86_support_mca_broadcast(env) ? MCE_INJECT_BROADCAST : 0;
 
 if (code == BUS_MCEERR_AR) {
-status |= MCI_STATUS_AR | 0x134;
-mcg_status |= MCG_STATUS_EIPV;
+   status |= MCI_STATUS_AR | 0x134;
+   mcg_status |= MCG_STATUS_EIPV;
+   if (env->mcg_ext_ctl & 0x1) {
+   mcg_status |= MCG_STATUS_LMCE;
+   flags = 0; /* No Broadcast when LMCE is opted by guest */
+   }
 } else {
 status |= 0xc0;
 mcg_status |= MCG_STATUS_RIPV;
 }
 cpu_x86_inject_mce(NULL, cpu, 9, status, mcg_status, paddr,
-   (MCM_ADDR_PHYS << 6) | 0xc,
-   cpu_x86_support_mca_broadcast(env) ?
-   MCE_INJECT_BROADCAST : 0);
+  (MCM_ADDR_PHYS << 6) | 0xc, flags);
 }
 
 static void hardware_memory_error(void)
@@ -808,10 +821,14 @@ int kvm_arch_init_vcpu(CPUState *cs)
 
 c = cpuid_find_entry(_data.cpuid, 1, 0);
 if (c) {
-has_msr_feature_control = !!(c->ecx & CPUID_EXT_VMX) ||
-   

[Patch V2 2/2] x86, mce: Need to translate GPA to HPA to inject error in guest.

2015-12-10 Thread Ashok Raj
From: Gong Chen 

When we need to test error injection to a specific address using EINJ,
there needs to be a way to translate GPA to HPA. This will allow host EINJ
to inject error to test how guest behavior is when a bad address is consumed.
This permits guest OS to perform its own recovery.

Signed-off-by: Gong Chen 
---
Sorry about the spam :-(.
Resending with proper Commit Message. Previous had a bogus From. Fixed that.
before sending.

 hmp-commands.hx   | 14 ++
 include/exec/memory.h |  2 ++
 kvm-all.c | 24 
 memory.c  | 13 +
 monitor.c | 16 
 5 files changed, 69 insertions(+)
 mode change 100644 => 100755 include/exec/memory.h
 mode change 100644 => 100755 kvm-all.c
 mode change 100644 => 100755 memory.c
 mode change 100644 => 100755 monitor.c

diff --git a/hmp-commands.hx b/hmp-commands.hx
index bb52e4d..673c00e 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -444,6 +444,20 @@ Start gdbserver session (default @var{port}=1234)
 ETEXI
 
 {
+.name = "x-gpa2hva",
+.args_type= "fmt:/,addr:l",
+.params   = "/fmt addr",
+.help = "translate guest physical 'addr' to host virtual 
address, only for debugging",
+.mhandler.cmd = do_gpa2hva,
+},
+
+STEXI
+@item x-gpa2hva @var{addr}
+@findex x-gpa2hva
+Translate guest physical @var{addr} to host virtual address, only for 
debugging.
+ETEXI
+
+{
 .name   = "x",
 .args_type  = "fmt:/,addr:l",
 .params = "/fmt addr",
diff --git a/include/exec/memory.h b/include/exec/memory.h
old mode 100644
new mode 100755
index 0f07159..57d7bf8
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -222,6 +222,7 @@ struct MemoryListener {
hwaddr addr, hwaddr len);
 void (*coalesced_mmio_del)(MemoryListener *listener, MemoryRegionSection 
*section,
hwaddr addr, hwaddr len);
+int  (*translate_gpa2hva)(MemoryListener *listener, uint64_t paddr, 
uint64_t *vaddr);
 /* Lower = earlier (during add), later (during del) */
 unsigned priority;
 AddressSpace *address_space_filter;
@@ -1123,6 +1124,7 @@ void memory_global_dirty_log_start(void);
 void memory_global_dirty_log_stop(void);
 
 void mtree_info(fprintf_function mon_printf, void *f);
+int  memory_translate_gpa2hva(hwaddr paddr, uint64_t *vaddr);
 
 /**
  * memory_region_dispatch_read: perform a read directly to the specified
diff --git a/kvm-all.c b/kvm-all.c
old mode 100644
new mode 100755
index c648b81..cb029be
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -197,6 +197,29 @@ static KVMSlot 
*kvm_lookup_overlapping_slot(KVMMemoryListener *kml,
 return found;
 }
 
+
+static int kvm_translate_gpa2hva(MemoryListener *listener, uint64_t paddr, 
uint64_t *vaddr)
+{
+KVMState *s = kvm_state;
+KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, 
listener);
+KVMSlot *mem = NULL;
+int i;
+
+for (i = 0; i < s->nr_slots; i++) {
+mem = >slots[i];
+if (paddr >= mem->start_addr && paddr < mem->start_addr + 
mem->memory_size) {
+*vaddr = (uint64_t)mem->ram + paddr - mem->start_addr;
+break;
+   }
+}
+
+if (i == s->nr_slots) {
+fprintf(stderr, "fail to find target physical addr(%ld) in KVM memory 
range\n", paddr);
+   return 1;
+}
+return 0;
+}
+
 int kvm_physical_memory_addr_from_host(KVMState *s, void *ram,
hwaddr *phys_addr)
 {
@@ -902,6 +925,7 @@ void kvm_memory_listener_register(KVMState *s, 
KVMMemoryListener *kml,
 kml->listener.log_start = kvm_log_start;
 kml->listener.log_stop = kvm_log_stop;
 kml->listener.log_sync = kvm_log_sync;
+kml->listener.translate_gpa2hva = kvm_translate_gpa2hva;
 kml->listener.priority = 10;
 
 memory_listener_register(>listener, as);
diff --git a/memory.c b/memory.c
old mode 100644
new mode 100755
index e193658..979dcf8
--- a/memory.c
+++ b/memory.c
@@ -2294,6 +2294,19 @@ static const TypeInfo memory_region_info = {
 .instance_finalize  = memory_region_finalize,
 };
 
+int memory_translate_gpa2hva(hwaddr paddr, uint64_t *vaddr){
+MemoryListener *ml = NULL;
+int ret = 1;
+
+QTAILQ_FOREACH(ml, _listeners, link) {
+if(ml->translate_gpa2hva)
+ret = ml->translate_gpa2hva(ml, paddr, vaddr);
+   if(0 == ret)
+   break;
+}
+return ret;
+}
+
 static void memory_register_types(void)
 {
 type_register_static(_region_info);
diff --git a/monitor.c b/monitor.c
old mode 100644
new mode 100755
index 9a35d72..408e1fa
--- a/monitor.c
+++ b/monitor.c
@@ -76,6 +76,7 @@
 #include "qapi-event.h"
 #include "qmp-introspect.h"
 #include "sysemu/block-backend.h"
+#include "exec/memory.h"
 
 /* for hmp_info_irq/pic */
 #if defined(TARGET_SPARC)
@@ -1681,6 +1682,21 @@ static 

Re: [PATCH v3 07/22] arm64: KVM: Implement system register save/restore

2015-12-10 Thread Mario Smarduch
Hi Marc,

On 12/7/2015 2:53 AM, Marc Zyngier wrote:
> Implement the system register save/restore as a direct translation of
> the assembly code version.
> 
> Signed-off-by: Marc Zyngier 
> Reviewed-by: Christoffer Dall 
> ---
>  arch/arm64/kvm/hyp/Makefile|  1 +
>  arch/arm64/kvm/hyp/hyp.h   |  3 ++
>  arch/arm64/kvm/hyp/sysreg-sr.c | 90 
> ++
>  3 files changed, 94 insertions(+)
>  create mode 100644 arch/arm64/kvm/hyp/sysreg-sr.c
> 
> diff --git a/arch/arm64/kvm/hyp/Makefile b/arch/arm64/kvm/hyp/Makefile
> index 455dc0a..ec94200 100644
> --- a/arch/arm64/kvm/hyp/Makefile
> +++ b/arch/arm64/kvm/hyp/Makefile
> @@ -5,3 +5,4 @@
>  obj-$(CONFIG_KVM_ARM_HOST) += vgic-v2-sr.o
>  obj-$(CONFIG_KVM_ARM_HOST) += vgic-v3-sr.o
>  obj-$(CONFIG_KVM_ARM_HOST) += timer-sr.o
> +obj-$(CONFIG_KVM_ARM_HOST) += sysreg-sr.o
> diff --git a/arch/arm64/kvm/hyp/hyp.h b/arch/arm64/kvm/hyp/hyp.h
> index f213e46..778d56d 100644
> --- a/arch/arm64/kvm/hyp/hyp.h
> +++ b/arch/arm64/kvm/hyp/hyp.h
> @@ -38,5 +38,8 @@ void __vgic_v3_restore_state(struct kvm_vcpu *vcpu);
>  void __timer_save_state(struct kvm_vcpu *vcpu);
>  void __timer_restore_state(struct kvm_vcpu *vcpu);
>  
> +void __sysreg_save_state(struct kvm_cpu_context *ctxt);
> +void __sysreg_restore_state(struct kvm_cpu_context *ctxt);
> +
>  #endif /* __ARM64_KVM_HYP_H__ */
>  
> diff --git a/arch/arm64/kvm/hyp/sysreg-sr.c b/arch/arm64/kvm/hyp/sysreg-sr.c
> new file mode 100644
> index 000..add8fcb
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/sysreg-sr.c
> @@ -0,0 +1,90 @@
> +/*
> + * Copyright (C) 2012-2015 - ARM Ltd
> + * Author: Marc Zyngier 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see .
> + */
> +
> +#include 
> +#include 
> +
> +#include 
> +
> +#include "hyp.h"
> +

I looked closer on some other ways to get better performance out of the
compiler. This code sequence performs about 35% faster for
__sysreg_save_state(..) for 5000 exits you save about 500mS or 100nS per exit.
This is on Juno.

register int volatile count asm("r2") = 0;

do {

} while(count);

I didn't test the restore function (ran out of time) but I suspect it should be
the same. The assembler pretty much uses all the GPRs, (a little too many, using
stp to push 4 pairs on the stack and restore) looking at the assembler it all
should execute out of order.

FWIW I gave this a try since compilers like to optimize loops. I used
'cntpct_el0' counter register to measure the intervals.


> +/* ctxt is already in the HYP VA space */
> +void __hyp_text __sysreg_save_state(struct kvm_cpu_context *ctxt)
> +{
> + ctxt->sys_regs[MPIDR_EL1]   = read_sysreg(vmpidr_el2);
> + ctxt->sys_regs[CSSELR_EL1]  = read_sysreg(csselr_el1);
> + ctxt->sys_regs[SCTLR_EL1]   = read_sysreg(sctlr_el1);
> + ctxt->sys_regs[ACTLR_EL1]   = read_sysreg(actlr_el1);
> + ctxt->sys_regs[CPACR_EL1]   = read_sysreg(cpacr_el1);
> + ctxt->sys_regs[TTBR0_EL1]   = read_sysreg(ttbr0_el1);
> + ctxt->sys_regs[TTBR1_EL1]   = read_sysreg(ttbr1_el1);
> + ctxt->sys_regs[TCR_EL1] = read_sysreg(tcr_el1);
> + ctxt->sys_regs[ESR_EL1] = read_sysreg(esr_el1);
> + ctxt->sys_regs[AFSR0_EL1]   = read_sysreg(afsr0_el1);
> + ctxt->sys_regs[AFSR1_EL1]   = read_sysreg(afsr1_el1);
> + ctxt->sys_regs[FAR_EL1] = read_sysreg(far_el1);
> + ctxt->sys_regs[MAIR_EL1]= read_sysreg(mair_el1);
> + ctxt->sys_regs[VBAR_EL1]= read_sysreg(vbar_el1);
> + ctxt->sys_regs[CONTEXTIDR_EL1]  = read_sysreg(contextidr_el1);
> + ctxt->sys_regs[TPIDR_EL0]   = read_sysreg(tpidr_el0);
> + ctxt->sys_regs[TPIDRRO_EL0] = read_sysreg(tpidrro_el0);
> + ctxt->sys_regs[TPIDR_EL1]   = read_sysreg(tpidr_el1);
> + ctxt->sys_regs[AMAIR_EL1]   = read_sysreg(amair_el1);
> + ctxt->sys_regs[CNTKCTL_EL1] = read_sysreg(cntkctl_el1);
> + ctxt->sys_regs[PAR_EL1] = read_sysreg(par_el1);
> + ctxt->sys_regs[MDSCR_EL1]   = read_sysreg(mdscr_el1);
> +
> + ctxt->gp_regs.regs.sp   = read_sysreg(sp_el0);
> + ctxt->gp_regs.regs.pc   = read_sysreg(elr_el2);
> + ctxt->gp_regs.regs.pstate   = read_sysreg(spsr_el2);
> + ctxt->gp_regs.sp_el1= read_sysreg(sp_el1);
> + ctxt->gp_regs.elr_el1   = 

Re: [PATCH v3 1/4] VSOCK: Introduce virtio-vsock-common.ko

2015-12-10 Thread Stefan Hajnoczi
On Thu, Dec 10, 2015 at 10:17:07AM +, Alex Bennée wrote:
> Stefan Hajnoczi  writes:
> 
> > From: Asias He 
> >
> > This module contains the common code and header files for the following
> > virtio-vsock and virtio-vhost kernel modules.
> 
> General comment checkpatch has a bunch of warnings about 80 character
> limits, extra braces and BUG_ON usage.

Will fix in the next verison.

> > diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> > new file mode 100644
> > index 000..e54eb45
> > --- /dev/null
> > +++ b/include/linux/virtio_vsock.h
> > @@ -0,0 +1,203 @@
> > +/*
> > + * This header, excluding the #ifdef __KERNEL__ part, is BSD licensed so
> > + * anyone can use the definitions to implement compatible
> > drivers/servers:
> 
> Is anything in here actually exposed to userspace or the guest? The
> #ifdef __KERNEL__ statement seems redundant for this file at least.

You are right.  I think the header was copied from a uapi file.

I'll compare against other virtio code and apply an appropriate header.

> > +void virtio_vsock_dumppkt(const char *func,  const struct virtio_vsock_pkt 
> > *pkt)
> > +{
> > +   pr_debug("%s: pkt=%p, op=%d, len=%d, %d:%d---%d:%d, len=%d\n",
> > +func, pkt,
> > +le16_to_cpu(pkt->hdr.op),
> > +le32_to_cpu(pkt->hdr.len),
> > +le32_to_cpu(pkt->hdr.src_cid),
> > +le32_to_cpu(pkt->hdr.src_port),
> > +le32_to_cpu(pkt->hdr.dst_cid),
> > +le32_to_cpu(pkt->hdr.dst_port),
> > +pkt->len);
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_vsock_dumppkt);
> 
> Why export this at all? The only users are in this file so you could
> make it static.

I'll make it static.

> > +u32 virtio_transport_get_credit(struct virtio_transport *trans, u32 credit)
> > +{
> > +   u32 ret;
> > +
> > +   mutex_lock(>tx_lock);
> > +   ret = trans->peer_buf_alloc - (trans->tx_cnt - trans->peer_fwd_cnt);
> > +   if (ret > credit)
> > +   ret = credit;
> > +   trans->tx_cnt += ret;
> > +   mutex_unlock(>tx_lock);
> > +
> > +   pr_debug("%s: ret=%d, buf_alloc=%d, peer_buf_alloc=%d,"
> > +"tx_cnt=%d, fwd_cnt=%d, peer_fwd_cnt=%d\n", __func__,
> 
> I think __func__ is superfluous here as the dynamic print code already
> has it and can print it when required. Having said that there seems to
> be plenty of code already in the kernel that uses __func__ :-/

I'll convert most printks to tracepoints in the next revision.

> > +u64 virtio_transport_get_max_buffer_size(struct vsock_sock *vsk)
> > +{
> > +   struct virtio_transport *trans = vsk->trans;
> > +
> > +   return trans->buf_size_max;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_get_max_buffer_size);
> 
> All these accesses functions seem pretty simple. Maybe they should be
> inline header functions or even #define macros?

They are used as struct vsock_transport function pointers.  What is the
advantage to inlining them?

> > +int virtio_transport_notify_send_post_enqueue(struct vsock_sock *vsk,
> > +   ssize_t written, struct vsock_transport_send_notify_data *data)
> > +{
> > +   return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_notify_send_post_enqueue);
> 
> This makes me wonder if the calling code should be having
> if(transport->fn) checks rather than filling stuff out will null
> implementations but I guess that's a question better aimed at the
> maintainers.

I've considered it too.  I'll try to streamline this in the next
revision.

> > +/* We are under the virtio-vsock's vsock->rx_lock or
> > + * vhost-vsock's vq->mutex lock */
> > +void virtio_transport_recv_pkt(struct virtio_vsock_pkt *pkt)
> > +{
> > +   struct virtio_transport *trans;
> > +   struct sockaddr_vm src, dst;
> > +   struct vsock_sock *vsk;
> > +   struct sock *sk;
> > +
> > +   vsock_addr_init(, le32_to_cpu(pkt->hdr.src_cid), 
> > le32_to_cpu(pkt->hdr.src_port));
> > +   vsock_addr_init(, le32_to_cpu(pkt->hdr.dst_cid), 
> > le32_to_cpu(pkt->hdr.dst_port));
> > +
> > +   virtio_vsock_dumppkt(__func__, pkt);
> > +
> > +   if (le16_to_cpu(pkt->hdr.type) != VIRTIO_VSOCK_TYPE_STREAM) {
> > +   /* TODO send RST */
> 
> TODO's shouldn't make it into final submissions.
> 
> > +   goto free_pkt;
> > +   }
> > +
> > +   /* The socket must be in connected or bound table
> > +* otherwise send reset back
> > +*/
> > +   sk = vsock_find_connected_socket(, );
> > +   if (!sk) {
> > +   sk = vsock_find_bound_socket();
> > +   if (!sk) {
> > +   pr_debug("%s: can not find bound_socket\n", __func__);
> > +   virtio_vsock_dumppkt(__func__, pkt);
> > +   /* Ignore this pkt instead of sending reset back */
> > +   /* TODO send a RST unless this packet is a RST
> > (to avoid infinite loops) */
> 
> Ditto.

Thanks, I'll complete the RST code in the next revision.


signature.asc
Description: PGP signature


Re: [PATCH v3 2/4] VSOCK: Introduce virtio-vsock.ko

2015-12-10 Thread Stefan Hajnoczi
On Thu, Dec 10, 2015 at 09:23:25PM +, Alex Bennée wrote:
> Stefan Hajnoczi  writes:
> 
> > From: Asias He 
> >
> > VM sockets virtio transport implementation. This module runs in guest
> > kernel.
> 
> checkpatch warns on a bunch of whitespace/tab issues.

Will fix in the next version.

> > +struct virtio_vsock {
> > +   /* Virtio device */
> > +   struct virtio_device *vdev;
> > +   /* Virtio virtqueue */
> > +   struct virtqueue *vqs[VSOCK_VQ_MAX];
> > +   /* Wait queue for send pkt */
> > +   wait_queue_head_t queue_wait;
> > +   /* Work item to send pkt */
> > +   struct work_struct tx_work;
> > +   /* Work item to recv pkt */
> > +   struct work_struct rx_work;
> > +   /* Mutex to protect send pkt*/
> > +   struct mutex tx_lock;
> > +   /* Mutex to protect recv pkt*/
> > +   struct mutex rx_lock;
> 
> Further down I got confused by what lock was what and exactly what was
> being protected. If the receive and transmit paths touch separate things
> it might be worth re-arranging the structure to make it clearer, eg:
> 
>/* The transmit path is protected by tx_lock */
>struct mutex tx_lock;
>struct work_struct tx_work;
>..
>..
> 
>/* The receive path is protected by rx_lock */
>wait_queue_head_t queue_wait;
>..
>..
> 
>  Which might make things a little clearer. Then all the redundant
>  information in the comments can be removed. I don't need to know what
>  is a Virtio device, virtqueue or wait_queue etc as they are implicit in
>  the structure name.

Thanks, that is a nice idea.

> > +   mutex_lock(>tx_lock);
> > +   while ((ret = virtqueue_add_sgs(vq, sgs, out_sg, in_sg, pkt,
> > +   GFP_KERNEL)) < 0) {
> > +   prepare_to_wait_exclusive(>queue_wait, ,
> > + TASK_UNINTERRUPTIBLE);
> > +   mutex_unlock(>tx_lock);
> > +   schedule();
> > +   mutex_lock(>tx_lock);
> > +   finish_wait(>queue_wait, );
> > +   }
> > +   virtqueue_kick(vq);
> > +   mutex_unlock(>tx_lock);
> 
> What are we protecting with tx_lock here? See comments above about
> making the lock usage semantics clearer.

vq (vsock->vqs[VSOCK_VQ_TX]) is being protected.  Concurrent calls to
virtqueue_add_sgs() are not allowed.

> > +
> > +   return pkt_len;
> > +}
> > +
> > +static struct virtio_transport_pkt_ops virtio_ops = {
> > +   .send_pkt = virtio_transport_send_pkt,
> > +};
> > +
> > +static void virtio_vsock_rx_fill(struct virtio_vsock *vsock)
> > +{
> > +   int buf_len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE;
> > +   struct virtio_vsock_pkt *pkt;
> > +   struct scatterlist hdr, buf, *sgs[2];
> > +   struct virtqueue *vq;
> > +   int ret;
> > +
> > +   vq = vsock->vqs[VSOCK_VQ_RX];
> > +
> > +   do {
> > +   pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
> > +   if (!pkt) {
> > +   pr_debug("%s: fail to allocate pkt\n", __func__);
> > +   goto out;
> > +   }
> > +
> > +   /* TODO: use mergeable rx buffer */
> 
> TODO's should end up in merged code.

Will fix in next revision.

> > +   pkt->buf = kmalloc(buf_len, GFP_KERNEL);
> > +   if (!pkt->buf) {
> > +   pr_debug("%s: fail to allocate pkt->buf\n", __func__);
> > +   goto err;
> > +   }
> > +
> > +   sg_init_one(, >hdr, sizeof(pkt->hdr));
> > +   sgs[0] = 
> > +
> > +   sg_init_one(, pkt->buf, buf_len);
> > +   sgs[1] = 
> > +   ret = virtqueue_add_sgs(vq, sgs, 0, 2, pkt, GFP_KERNEL);
> > +   if (ret)
> > +   goto err;
> > +   vsock->rx_buf_nr++;
> > +   } while (vq->num_free);
> > +   if (vsock->rx_buf_nr > vsock->rx_buf_max_nr)
> > +   vsock->rx_buf_max_nr = vsock->rx_buf_nr;
> > +out:
> > +   virtqueue_kick(vq);
> > +   return;
> > +err:
> > +   virtqueue_kick(vq);
> > +   virtio_transport_free_pkt(pkt);
> 
> You could free the pkt memory at the fail site and just have one exit path.

Okay, I agree the err label is of marginal use.  Let's get rid of it.

> 
> > +   return;
> > +}
> > +
> > +static void virtio_transport_send_pkt_work(struct work_struct *work)
> > +{
> > +   struct virtio_vsock *vsock =
> > +   container_of(work, struct virtio_vsock, tx_work);
> > +   struct virtio_vsock_pkt *pkt;
> > +   bool added = false;
> > +   struct virtqueue *vq;
> > +   unsigned int len;
> > +   struct sock *sk;
> > +
> > +   vq = vsock->vqs[VSOCK_VQ_TX];
> > +   mutex_lock(>tx_lock);
> > +   do {
> 
> You can move the declarations of pkt/len into the do block.

Okay.

> 
> > +   virtqueue_disable_cb(vq);
> > +   while ((pkt = virtqueue_get_buf(vq, )) != NULL) {
> 
> And the sk declaration here

Okay.

> > +static void virtio_transport_recv_pkt_work(struct work_struct *work)
> > +{
> > +   struct virtio_vsock *vsock =
> > +   container_of(work, struct virtio_vsock, rx_work);
> > +   struct 

Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-10 Thread Yang Zhang

On 2015/12/10 19:41, Dr. David Alan Gilbert wrote:

* Yang Zhang (yang.zhang...@gmail.com) wrote:

On 2015/12/10 18:18, Dr. David Alan Gilbert wrote:

* Lan, Tianyu (tianyu@intel.com) wrote:

On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote:

I thought about what this is doing at the high level, and I do have some
value in what you are trying to do, but I also think we need to clarify
the motivation a bit more.  What you are saying is not really what the
patches are doing.

And with that clearer understanding of the motivation in mind (assuming
it actually captures a real need), I would also like to suggest some
changes.


Motivation:
Most current solutions for migration with passthough device are based on
the PCI hotplug but it has side affect and can't work for all device.

For NIC device:
PCI hotplug solution can work around Network device migration
via switching VF and PF.

But switching network interface will introduce service down time.

I tested the service down time via putting VF and PV interface
into a bonded interface and ping the bonded interface during plug
and unplug VF.
1) About 100ms when add VF
2) About 30ms when del VF

It also requires guest to do switch configuration. These are hard to
manage and deploy from our customers. To maintain PV performance during
migration, host side also needs to assign a VF to PV device. This
affects scalability.

These factors block SRIOV NIC passthough usage in the cloud service and
OPNFV which require network high performance and stability a lot.


Right, that I'll agree it's hard to do migration of a VM which uses
an SRIOV device; and while I think it should be possible to bond a virtio device
to a VF for networking and then hotplug the SR-IOV device I agree it's hard to 
manage.


For other kind of devices, it's hard to work.
We are also adding migration support for QAT(QuickAssist Technology) device.

QAT device user case introduction.
Server, networking, big data, and storage applications use QuickAssist
Technology to offload servers from handling compute-intensive operations,
such as:
1) Symmetric cryptography functions including cipher operations and
authentication operations
2) Public key functions including RSA, Diffie-Hellman, and elliptic curve
cryptography
3) Compression and decompression functions including DEFLATE and LZS

PCI hotplug will not work for such devices during migration and these
operations will fail when unplug device.


I don't understand that QAT argument; if the device is purely an offload
engine for performance, then why can't you fall back to doing the
same operations in the VM or in QEMU if the card is unavailable?
The tricky bit is dealing with outstanding operations.


So we are trying implementing a new solution which really migrates
device state to target machine and won't affect user during migration
with low service down time.


Right, that's a good aim - the only question is how to do it.

It looks like this is always going to need some device-specific code;
the question I see is whether that's in:
 1) qemu
 2) the host kernel
 3) the guest kernel driver

The objections to this series seem to be that it needs changes to (3);
I can see the worry that the guest kernel driver might not get a chance
to run during the right time in migration and it's painful having to
change every guest driver (although your change is small).

My question is what stage of the migration process do you expect to tell
the guest kernel driver to do this?

 If you do it at the start of the migration, and quiesce the device,
 the migration might take a long time (say 30 minutes) - are you
 intending the device to be quiesced for this long? And where are
 you going to send the traffic?
 If you are, then do you need to do it via this PCI trick, or could
 you just do it via something higher level to quiesce the device.

 Or are you intending to do it just near the end of the migration?
 But then how do we know how long it will take the guest driver to
 respond?


Ideally, it is able to leave guest driver unmodified but it requires the
hypervisor or qemu to aware the device which means we may need a driver in
hypervisor or qemu to handle the device on behalf of guest driver.


Can you answer the question of when do you use your code -
at the start of migration or
just before the end?


Tianyu can answer this question. In my initial design, i prefer to put 
more modifications in hypervisor and Qemu, and the only involvement from 
guest driver is how to restore the state after migration. But I don't 
know the later implementation since i have left Intel.





It would be great if we could avoid changing the guest; but at least your guest
driver changes don't actually seem to be that hardware specific; could your
changes actually be moved to generic PCI level so they could be made
to work for lots of drivers?


It is impossible to use one common solution for all devices unless the PCIE
spec 

Re: [PATCH v6 10/21] KVM: ARM64: Add access handler for PMEVCNTRn and PMCCNTR register

2015-12-10 Thread Shannon Zhao


On 2015/12/10 20:07, Marc Zyngier wrote:
> Hi Shannon,
> 
> On 10/12/15 11:36, Shannon Zhao wrote:
>> > Hi Marc,
>> > 
>> > On 2015/12/9 0:30, Marc Zyngier wrote:
>>> >> On 08/12/15 12:47, Shannon Zhao wrote:
>  From: Shannon Zhao 
> 
>  Since the reset value of PMEVCNTRn or PMCCNTR is UNKNOWN, use
>  reset_unknown for its reset handler. Add access handler which 
>  emulates
>  writing and reading PMEVCNTRn or PMCCNTR register. When reading
>  PMEVCNTRn or PMCCNTR, call perf_event_read_value to get the count 
>  value
>  of the perf event.
> 
>  Signed-off-by: Shannon Zhao 
>  ---
>   arch/arm64/kvm/sys_regs.c | 107 
>  +-
>   1 file changed, 105 insertions(+), 2 deletions(-)
> 
>  diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
>  index c116a1b..f7a73b5 100644
>  --- a/arch/arm64/kvm/sys_regs.c
>  +++ b/arch/arm64/kvm/sys_regs.c
>  @@ -525,6 +525,12 @@ static bool access_pmu_regs(struct kvm_vcpu 
>  *vcpu,
>   
>   if (p->is_write) {
>   switch (r->reg) {
>  +case PMEVCNTR0_EL0 ... PMCCNTR_EL0: {
>>> >> Same problem as previously mentioned.
>>> >>
>  +val = kvm_pmu_get_counter_value(vcpu,
>  +r->reg - 
>  PMEVCNTR0_EL0);
>  +vcpu_sys_reg(vcpu, r->reg) += (s64)p->regval - 
>  val;
>  +break;
>  +}
>> > 
>> > If I use a handler to handle these accesses to PMEVCNTRn and PMCCNTR
>> > like below. It converts the register offset c14_PMEVCNTRn and c9_PMCCNTR
>> > to PMEVCNTRn_EL0 and PMCCNTR_EL0, uniformly uses vcpu_sys_reg and
>> > doesn't need to take care the big endian. What do you think about this?
>> > 
>> > static bool access_pmu_evcntr(struct kvm_vcpu *vcpu,
>> >   struct sys_reg_params *p,
>> >   const struct sys_reg_desc *r)
>> > {
>> > u64 idx, reg, val;
>> > 
>> > if (p->is_aarch32)
>> > reg = r->reg / 2;
> I'd prefer it if you actually decoded the reg itself. Something like:
> 
>   if (p->is_aarch32) {
>   if (r->CRn == 9 && r->CRm == 13)
>   reg = (r->Op2 & 1) ? 0 : PMCCNTR_EL0;
>   if (r->CRn == 14 && (r->CRm & 0xc) == 8) {
>   reg = ((r->CRm & 3) << 2) & (r->Op2 & 7);
>   reg += PMEVCNTR0_EL0;
>   } else {
>   BUG();
>   }
>   } else {
>   
>   }
> 
> And then you can get rid of the c14_PMVCNTR* and c9_PMCCNTR macros.
> The only slightly ugly thing is this 0 value to represent PMXEVTYPER,
> but that's what we already have with your "default" clause below.
> 
Ok, thanks.
>> > else
>> > reg = r->reg;
>> > 
>> > switch (reg) {
>> > case PMEVCNTR0_EL0 ... PMEVCNTR30_EL0: {
>> > idx = reg - PMEVCNTR0_EL0;
>> > break;
>> > }
>> > case PMCCNTR_EL0: {
>> > idx = ARMV8_CYCLE_IDX;
>> > break;
>> > }
>> > default:
>> > idx = vcpu_sys_reg(vcpu, PMSELR_EL0) & ARMV8_COUNTER_MASK;
>> > if (!pmu_counter_idx_valid(vcpu, idx))
>> > return true;
>> > reg = (idx == ARMV8_CYCLE_IDX) ? PMCCNTR_EL0 :
>> > PMEVCNTR0_EL0 + idx;
>> > break;
>> > }
>> > 
>> > val = kvm_pmu_get_counter_value(vcpu, idx);
>> > if (p->is_write)
>> > vcpu_sys_reg(vcpu, reg) = (s64)p->regval - val;
> Maybe I don't have my head screwed in the right way, but as long as
> we're only using u64 quantities, why do we need this s64 cast?
> 
In case p->regval less than val.
For example, at the first time it writes 0x8001 to
vcpu_sys_reg(vcpu, reg) and start a perf event which counts a value
1. Then we want to start the same perf event with writing 0x8001
to vcpu_sys_reg(vcpu, reg) too. At this time, p->regval(0x8001) is
less than val(0x8001 + 1).

Thanks,
-- 
Shannon

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: freeze host when injecting NMIs in the guest, at least in 4.4-rc4+

2015-12-10 Thread Borislav Petkov
On Thu, Dec 10, 2015 at 05:49:10PM +0100, Paolo Bonzini wrote:
> Can you try it on Intel?

Just did, there it splats even when booting the guest, without even
injecting NMIs:

[  113.233992] ===
[  113.238192] [ INFO: suspicious RCU usage. ]
[  113.242393] 4.4.0-rc4+ #1 Not tainted
[  113.246056] ---
[  113.250257] arch/x86/kvm/trace.h:29 suspicious rcu_dereference_check() usage!
[  113.257400] 
   other info that might help us debug this:

[  113.265432] 
   RCU used illegally from idle CPU!
   rcu_scheduler_active = 1, debug_locks = 0
[  113.276321] RCU used illegally from extended quiescent state!
[  113.282083] 1 lock held by qemu-system-x86/2275:
[  113.286711]  #0:  (>mutex){+.+.+.}, at: [] 
vcpu_load+0x1c/0x80 [kvm]
[  113.295270] 
   stack backtrace:
[  113.299644] CPU: 4 PID: 2275 Comm: qemu-system-x86 Not tainted 4.4.0-rc4+ #1
[  113.306706] Hardware name: Dell Inc. Precision T3600/0PTTT9, BIOS A13 
05/11/2014
[  113.314122]  0001 88042f263d28 813c2cfc 
880432d53000
[  113.321575]  88042f263d58 810c5157 88042f268000 

[  113.329027]   0001 88042f263e10 
a01ee7c8
[  113.336483] Call Trace:
[  113.338939]  [] dump_stack+0x4e/0x82
[  113.344098]  [] lockdep_rcu_suspicious+0xe7/0x120
[  113.350385]  [] kvm_arch_vcpu_ioctl_run+0x16f8/0x19c0 [kvm]
[  113.357544]  [] ? kvm_arch_vcpu_ioctl_run+0xd0/0x19c0 [kvm]
[  113.364695]  [] kvm_vcpu_ioctl+0x342/0x700 [kvm]
[  113.370896]  [] ? __lock_is_held+0x4d/0x70
[  113.376583]  [] ? __fget+0xfe/0x200
[  113.381662]  [] do_vfs_ioctl+0x301/0x550
[  113.387156]  [] ? __fget_light+0x2a/0x90
[  113.392654]  [] SyS_ioctl+0x41/0x70
[  113.397739]  [] entry_SYSCALL_64_fastpath+0x16/0x7a
[  113.755008] kvm: zapping shadow pages for mmio generation wraparound

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: x86: move tracepoints outside extended quiescent state

2015-12-10 Thread Paolo Bonzini
Invoking tracepoints within kvm_guest_enter/kvm_guest_exit causes a
lockdep splat.

Cc: sta...@vger.kernel.org
Reported-by: Borislav Petkov 
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/svm.c | 4 ++--
 arch/x86/kvm/vmx.c | 3 ++-
 arch/x86/kvm/x86.c | 2 +-
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 83a1c643f9a5..899c40f826dd 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3422,6 +3422,8 @@ static int handle_exit(struct kvm_vcpu *vcpu)
struct kvm_run *kvm_run = vcpu->run;
u32 exit_code = svm->vmcb->control.exit_code;
 
+   trace_kvm_exit(exit_code, vcpu, KVM_ISA_SVM);
+
if (!is_cr_intercept(svm, INTERCEPT_CR0_WRITE))
vcpu->arch.cr0 = svm->vmcb->save.cr0;
if (npt_enabled)
@@ -3892,8 +3894,6 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
vcpu->arch.regs[VCPU_REGS_RSP] = svm->vmcb->save.rsp;
vcpu->arch.regs[VCPU_REGS_RIP] = svm->vmcb->save.rip;
 
-   trace_kvm_exit(svm->vmcb->control.exit_code, vcpu, KVM_ISA_SVM);
-
if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI))
kvm_before_handle_nmi(>vcpu);
 
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index af823a388c19..6b5605607849 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -8042,6 +8042,8 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
u32 exit_reason = vmx->exit_reason;
u32 vectoring_info = vmx->idt_vectoring_info;
 
+   trace_kvm_exit(exit_reason, vcpu, KVM_ISA_VMX);
+
/*
 * Flush logged GPAs PML buffer, this will make dirty_bitmap more
 * updated. Another good is, in kvm_vm_ioctl_get_dirty_log, before
@@ -8668,7 +8670,6 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
vmx->loaded_vmcs->launched = 1;
 
vmx->exit_reason = vmcs_read32(VM_EXIT_REASON);
-   trace_kvm_exit(vmx->exit_reason, vcpu, KVM_ISA_VMX);
 
/*
 * the KVM_REQ_EVENT optimization bit is only on for one entry, and if
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eed32283d22c..2a8c035ec7fe 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6515,6 +6515,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (req_immediate_exit)
smp_send_reschedule(vcpu->cpu);
 
+   trace_kvm_entry(vcpu->vcpu_id);
__kvm_guest_enter();
 
if (unlikely(vcpu->arch.switch_db_regs)) {
@@ -6527,7 +6528,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_RELOAD;
}
 
-   trace_kvm_entry(vcpu->vcpu_id);
wait_lapic_expire(vcpu);
kvm_x86_ops->run(vcpu);
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] kvm: x86: move tracepoints outside extended quiescent state

2015-12-10 Thread Paolo Bonzini
Invoking tracepoints within kvm_guest_enter/kvm_guest_exit causes a
lockdep splat.

Cc: sta...@vger.kernel.org
Reported-by: Borislav Petkov 
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/svm.c | 4 ++--
 arch/x86/kvm/vmx.c | 3 ++-
 arch/x86/kvm/x86.c | 4 ++--
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 83a1c643f9a5..899c40f826dd 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3422,6 +3422,8 @@ static int handle_exit(struct kvm_vcpu *vcpu)
struct kvm_run *kvm_run = vcpu->run;
u32 exit_code = svm->vmcb->control.exit_code;
 
+   trace_kvm_exit(exit_code, vcpu, KVM_ISA_SVM);
+
if (!is_cr_intercept(svm, INTERCEPT_CR0_WRITE))
vcpu->arch.cr0 = svm->vmcb->save.cr0;
if (npt_enabled)
@@ -3892,8 +3894,6 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
vcpu->arch.regs[VCPU_REGS_RSP] = svm->vmcb->save.rsp;
vcpu->arch.regs[VCPU_REGS_RIP] = svm->vmcb->save.rip;
 
-   trace_kvm_exit(svm->vmcb->control.exit_code, vcpu, KVM_ISA_SVM);
-
if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI))
kvm_before_handle_nmi(>vcpu);
 
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index af823a388c19..6b5605607849 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -8042,6 +8042,8 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
u32 exit_reason = vmx->exit_reason;
u32 vectoring_info = vmx->idt_vectoring_info;
 
+   trace_kvm_exit(exit_reason, vcpu, KVM_ISA_VMX);
+
/*
 * Flush logged GPAs PML buffer, this will make dirty_bitmap more
 * updated. Another good is, in kvm_vm_ioctl_get_dirty_log, before
@@ -8668,7 +8670,6 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
vmx->loaded_vmcs->launched = 1;
 
vmx->exit_reason = vmcs_read32(VM_EXIT_REASON);
-   trace_kvm_exit(vmx->exit_reason, vcpu, KVM_ISA_VMX);
 
/*
 * the KVM_REQ_EVENT optimization bit is only on for one entry, and if
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eed32283d22c..b84ba4b17757 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6515,6 +6515,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (req_immediate_exit)
smp_send_reschedule(vcpu->cpu);
 
+   trace_kvm_entry(vcpu->vcpu_id);
+   wait_lapic_expire(vcpu);
__kvm_guest_enter();
 
if (unlikely(vcpu->arch.switch_db_regs)) {
@@ -6527,8 +6529,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_RELOAD;
}
 
-   trace_kvm_entry(vcpu->vcpu_id);
-   wait_lapic_expire(vcpu);
kvm_x86_ops->run(vcpu);
 
/*
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-10 Thread Lan, Tianyu



On 12/11/2015 12:11 AM, Michael S. Tsirkin wrote:

On Thu, Dec 10, 2015 at 10:38:32PM +0800, Lan, Tianyu wrote:



On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:

Ideally, it is able to leave guest driver unmodified but it requires the

hypervisor or qemu to aware the device which means we may need a driver in
hypervisor or qemu to handle the device on behalf of guest driver.

Can you answer the question of when do you use your code -
at the start of migration or
just before the end?


Just before stopping VCPU in this version and inject VF mailbox irq to
notify the driver if the irq handler is installed.
Qemu side also will check this via the faked PCI migration capability
and driver will set the status during device open() or resume() callback.


Right, this is the "good path" optimization. Whether this buys anything
as compared to just sending reset to the device when VCPU is stopped
needs to be measured. In any case, we probably do need a way to
interrupt driver on destination to make it reconfigure the device -
otherwise it might take seconds for it to notice.  And a way to make
sure driver can handle this surprise reset so we can block migration if
it can't.



Yes, we need such a way to notify driver about migration status and do
reset or restore operation on the destination machine. My original
design is to take advantage of device's irq to do that. Driver can tell
Qemu that which irq it prefers to handle such task and whether the irq
is enabled or bound with handler. We may discuss the detail in the other
thread.




It would be great if we could avoid changing the guest; but at least your guest
driver changes don't actually seem to be that hardware specific; could your
changes actually be moved to generic PCI level so they could be made
to work for lots of drivers?


It is impossible to use one common solution for all devices unless the PCIE
spec documents it clearly and i think one day it will be there. But before
that, we need some workarounds on guest driver to make it work even it looks
ugly.


Yes, so far there is not hardware migration support


VT-D supports setting dirty bit in the PTE in hardware.


Actually, this doesn't support in the current hardware.
VTD spec documents the dirty bit for first level translation which
requires devices to support DMA request with PASID(process
address space identifier). Most device don't support the feature.




and it's hard to modify
bus level code.


Why is it hard?


As Yang said, the concern is that PCI Spec doesn't document about how to 
do migration.





It also will block implementation on the Windows.


Implementation of what?  We are discussing motivation here, not
implementation.  E.g. windows drivers typically support surprise
removal, should you use that, you get some working code for free.  Just
stop worrying about it.  Make it work, worry about closed source
software later.


Dave


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

2015-12-10 Thread Ingo Molnar

* Paolo Bonzini  wrote:

> 
> 
> On 10/12/2015 00:12, Andy Lutomirski wrote:
> > From: Andy Lutomirski 
> > 
> > The pvclock vdso code was too abstracted to understand easily and
> > excessively paranoid.  Simplify it for a huge speedup.
> > 
> > This opens the door for additional simplifications, as the vdso no
> > longer accesses the pvti for any vcpu other than vcpu 0.
> > 
> > Before, vclock_gettime using kvm-clock took about 45ns on my machine.
> > With this change, it takes 29ns, which is almost as fast as the pure TSC
> > implementation.
> > 
> > Signed-off-by: Andy Lutomirski 
> > ---
> >  arch/x86/entry/vdso/vclock_gettime.c | 81 
> > 
> >  1 file changed, 46 insertions(+), 35 deletions(-)
> > 
> > diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
> > b/arch/x86/entry/vdso/vclock_gettime.c
> > index ca94fa649251..c325ba1bdddf 100644
> > --- a/arch/x86/entry/vdso/vclock_gettime.c
> > +++ b/arch/x86/entry/vdso/vclock_gettime.c
> > @@ -78,47 +78,58 @@ static notrace const struct pvclock_vsyscall_time_info 
> > *get_pvti(int cpu)
> >  
> >  static notrace cycle_t vread_pvclock(int *mode)
> >  {
> > -   const struct pvclock_vsyscall_time_info *pvti;
> > +   const struct pvclock_vcpu_time_info *pvti = _pvti(0)->pvti;
> > cycle_t ret;
> > -   u64 last;
> > -   u32 version;
> > -   u8 flags;
> > -   unsigned cpu, cpu1;
> > -
> > +   u64 tsc, pvti_tsc;
> > +   u64 last, delta, pvti_system_time;
> > +   u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
> >  
> > /*
> > -* Note: hypervisor must guarantee that:
> > -* 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
> > -* 2. that per-CPU pvclock time info is updated if the
> > -*underlying CPU changes.
> > -* 3. that version is increased whenever underlying CPU
> > -*changes.
> > +* Note: The kernel and hypervisor must guarantee that cpu ID
> > +* number maps 1:1 to per-CPU pvclock time info.
> > +*
> > +* Because the hypervisor is entirely unaware of guest userspace
> > +* preemption, it cannot guarantee that per-CPU pvclock time
> > +* info is updated if the underlying CPU changes or that that
> > +* version is increased whenever underlying CPU changes.
> >  *
> > +* On KVM, we are guaranteed that pvti updates for any vCPU are
> > +* atomic as seen by *all* vCPUs.  This is an even stronger
> > +* guarantee than we get with a normal seqlock.
> > +*
> > +* On Xen, we don't appear to have that guarantee, but Xen still
> > +* supplies a valid seqlock using the version field.
> > +
> > +* We only do pvclock vdso timing at all if
> > +* PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
> > +* mean that all vCPUs have matching pvti and that the TSC is
> > +* synced, so we can just look at vCPU 0's pvti.
> >  */
> > -   do {
> > -   cpu = __getcpu() & VGETCPU_CPU_MASK;
> > -   /* TODO: We can put vcpu id into higher bits of pvti.version.
> > -* This will save a couple of cycles by getting rid of
> > -* __getcpu() calls (Gleb).
> > -*/
> > -
> > -   pvti = get_pvti(cpu);
> > -
> > -   version = __pvclock_read_cycles(>pvti, , );
> > -
> > -   /*
> > -* Test we're still on the cpu as well as the version.
> > -* We could have been migrated just after the first
> > -* vgetcpu but before fetching the version, so we
> > -* wouldn't notice a version change.
> > -*/
> > -   cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> > -   } while (unlikely(cpu != cpu1 ||
> > - (pvti->pvti.version & 1) ||
> > - pvti->pvti.version != version));
> > -
> > -   if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
> > +
> > +   if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
> > *mode = VCLOCK_NONE;
> > +   return 0;
> > +   }
> > +
> > +   do {
> > +   version = pvti->version;
> > +
> > +   /* This is also a read barrier, so we'll read version first. */
> > +   tsc = rdtsc_ordered();
> > +
> > +   pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
> > +   pvti_tsc_shift = pvti->tsc_shift;
> > +   pvti_system_time = pvti->system_time;
> > +   pvti_tsc = pvti->tsc_timestamp;
> > +
> > +   /* Make sure that the version double-check is last. */
> > +   smp_rmb();
> > +   } while (unlikely((version & 1) || version != pvti->version));
> > +
> > +   delta = tsc - pvti_tsc;
> > +   ret = pvti_system_time +
> > +   pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
> > +   pvti_tsc_shift);
> >  
> > /* refer to tsc.c read_tsc() comment for rationale */
> > last = gtod->cycle_last;
> > 
> 
> Reviewed-by: Paolo Bonzini 

Thanks. I've 

Re: live migration vs device assignment (motivation)

2015-12-10 Thread Lan, Tianyu

On 12/10/2015 4:38 PM, Michael S. Tsirkin wrote:

Let's assume you do save state and do have a way to detect
whether state matches a given hardware. For example,
driver could store firmware and hardware versions
in the state, and then on destination, retrieve them
and compare. It will be pretty common that you have a mismatch,
and you must not just fail migration. You need a way to recover,
maybe with more downtime.


Second, you can change the driver but you can not be sure it will have
the chance to run at all. Host overload is a common reason to migrate
out of the host.  You also can not trust guest to do the right thing.
So how long do you want to wait until you decide guest is not
cooperating and kill it?  Most people will probably experiment a bit and
then add a bit of a buffer. This is not robust at all.

Again, maybe you ask driver to save state, and if it does
not respond for a while, then you still migrate,
and driver has to recover on destination.


With the above in mind, you need to support two paths:
1. "good path": driver stores state on source, checks it on destination
detects a match and restores state into the device
2. "bad path": driver does not store state, or detects a mismatch
on destination. driver has to assume device was lost,
and reset it

So what I am saying is, implement bad path first. Then good path
is an optimization - measure whether it's faster, and by how much.



These sound reasonable. Driver should have ability to do such check
to ensure hardware or firmware coherence after migration and reset 
device when migration happens at some unexpected position.




Also, it would be nice if on the bad path there was a way
to switch to another driver entirely, even if that means
a bit more downtime. For example, have a way for driver to
tell Linux it has to re-do probing for the device.


Just glace the code of device core. device_reprobe() does what you said.

/**
 * device_reprobe - remove driver for a device and probe for a new  driver
 * @dev: the device to reprobe
 *
 * This function detaches the attached driver (if any) for the given
 * device and restarts the driver probing process.  It is intended
 * to use if probing criteria changed during a devices lifetime and
 * driver attachment should change accordingly.
 */
int device_reprobe(struct device *dev)





--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-10 Thread Lan, Tianyu



On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:

Ideally, it is able to leave guest driver unmodified but it requires the
>hypervisor or qemu to aware the device which means we may need a driver in
>hypervisor or qemu to handle the device on behalf of guest driver.

Can you answer the question of when do you use your code -
at the start of migration or
just before the end?


Just before stopping VCPU in this version and inject VF mailbox irq to
notify the driver if the irq handler is installed.
Qemu side also will check this via the faked PCI migration capability
and driver will set the status during device open() or resume() callback.




> >It would be great if we could avoid changing the guest; but at least your 
guest
> >driver changes don't actually seem to be that hardware specific; could your
> >changes actually be moved to generic PCI level so they could be made
> >to work for lots of drivers?

>
>It is impossible to use one common solution for all devices unless the PCIE
>spec documents it clearly and i think one day it will be there. But before
>that, we need some workarounds on guest driver to make it work even it looks
>ugly.


Yes, so far there is not hardware migration support and it's hard to 
modify bus level code. It also will block implementation on the Windows.



Dave


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-10 Thread Michael S. Tsirkin
On Thu, Dec 10, 2015 at 10:38:32PM +0800, Lan, Tianyu wrote:
> 
> 
> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
> >>Ideally, it is able to leave guest driver unmodified but it requires the
> >>>hypervisor or qemu to aware the device which means we may need a driver in
> >>>hypervisor or qemu to handle the device on behalf of guest driver.
> >Can you answer the question of when do you use your code -
> >at the start of migration or
> >just before the end?
> 
> Just before stopping VCPU in this version and inject VF mailbox irq to
> notify the driver if the irq handler is installed.
> Qemu side also will check this via the faked PCI migration capability
> and driver will set the status during device open() or resume() callback.

Right, this is the "good path" optimization. Whether this buys anything
as compared to just sending reset to the device when VCPU is stopped
needs to be measured. In any case, we probably do need a way to
interrupt driver on destination to make it reconfigure the device -
otherwise it might take seconds for it to notice.  And a way to make
sure driver can handle this surprise reset so we can block migration if
it can't.

> >
>  >It would be great if we could avoid changing the guest; but at least 
>  >your guest
>  >driver changes don't actually seem to be that hardware specific; could 
>  >your
>  >changes actually be moved to generic PCI level so they could be made
>  >to work for lots of drivers?
> >>>
> >>>It is impossible to use one common solution for all devices unless the PCIE
> >>>spec documents it clearly and i think one day it will be there. But before
> >>>that, we need some workarounds on guest driver to make it work even it 
> >>>looks
> >>>ugly.
> 
> Yes, so far there is not hardware migration support

VT-D supports setting dirty bit in the PTE in hardware.

> and it's hard to modify
> bus level code.

Why is it hard?

> It also will block implementation on the Windows.

Implementation of what?  We are discussing motivation here, not
implementation.  E.g. windows drivers typically support surprise
removal, should you use that, you get some working code for free.  Just
stop worrying about it.  Make it work, worry about closed source
software later.

> >Dave
> >
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] live migration vs device assignment (motivation)

2015-12-10 Thread Dr. David Alan Gilbert
* Lan, Tianyu (tianyu@intel.com) wrote:
> 
> 
> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
> >>Ideally, it is able to leave guest driver unmodified but it requires the
> >>>hypervisor or qemu to aware the device which means we may need a driver in
> >>>hypervisor or qemu to handle the device on behalf of guest driver.
> >Can you answer the question of when do you use your code -
> >at the start of migration or
> >just before the end?
> 
> Just before stopping VCPU in this version and inject VF mailbox irq to
> notify the driver if the irq handler is installed.
> Qemu side also will check this via the faked PCI migration capability
> and driver will set the status during device open() or resume() callback.

OK, hmm - I can see that would work in some cases; but:
   a) It wouldn't work if the guest was paused, the management can pause it 
before
 starting migration or during migration - so you might need to hook the 
pause
 as well;  so that's a bit complicated.

   b) How long does qemu wait for the guest to respond, and what does it do if
  the guest doesn't respond ?  How do we recover?

   c) How much work does the guest need to do at this point?

   d) It would be great if we could find a more generic way of telling the guest
  it's about to migrate rather than via the PCI registers of one device; 
imagine
  what happens if you have a few different devices using SR-IOV, we'd have 
to tell
  them all with separate interrupts.   Perhaps we could use a virtio 
channel or
  an ACPI event or something?

>  >It would be great if we could avoid changing the guest; but at least 
>  >your guest
>  >driver changes don't actually seem to be that hardware specific; could 
>  >your
>  >changes actually be moved to generic PCI level so they could be made
>  >to work for lots of drivers?
> >>>
> >>>It is impossible to use one common solution for all devices unless the PCIE
> >>>spec documents it clearly and i think one day it will be there. But before
> >>>that, we need some workarounds on guest driver to make it work even it 
> >>>looks
> >>>ugly.
> 
> Yes, so far there is not hardware migration support and it's hard to modify
> bus level code. It also will block implementation on the Windows.

Well, there was agraf's trick, although that's a lot more complicated at the 
qemu
level, but it should work with no guest modifications.  Michael's point about
dirty page tracking is neat, I think that simplifies it a bit if it can track 
dirty
pages.

Dave

> >Dave
> >
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: freeze host when injecting NMIs in the guest, at least in 4.4-rc4+

2015-12-10 Thread Borislav Petkov
Yap,

this is clearly a qemu/kvm issue. Lemme remove ext4 folks from CC. So
here's what happens:

I boot a kvm guest, connect to its monitor (qemu is started with
"-monitor pty") and on the monitor I issue a couple of times the "nmi"
command. It doesn't explode immediately but it happens pretty often and
when it happens, the *host*(!) freezes with some nasty corruption, see
below.

Thoughts, suggestions, ideas?

...
[   13.564192] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[  111.524485] kvm: zapping shadow pages for mmio generation wraparound
[  166.414836] EXT4-fs (sdb1): mounted filesystem with ordered data mode. Opts: 
(null)
[  172.408126] kvm: zapping shadow pages for mmio generation wraparound
[  200.912827] BUG: unable to handle kernel NULL pointer dereference at 
  (null)
[  200.920716] IP: [<  (null)>]   (null)
[  200.925796] PGD 0 
[  200.927833] Oops: 0010 [#1] PREEMPT SMP 
[  200.931258] Modules linked in: binfmt_misc ipv6 vfat fat fuse dm_crypt 
dm_mod kvm_amd kvm irqq
[  200.931274] CPU: 0 PID: 3936 Comm: qemu-system-x86 Not tainted 4.4.0-rc4+ #1
[  200.931275] Hardware name: To be filled by O.E.M. To be filled by 
O.E.M./M5A97 EVO R2.0, BIOS3
[  200.931276] task: 880037ba ti: 8804227d8000 task.ti: 
8804227d8000
[  200.931277] RIP: 0010:[<>]  [<  (null)>]   
(null)
[  200.931278] RSP: 0018:8804227dbcb0  EFLAGS: 00010292
[  200.931278] RAX: 880037ba RBX: 0292 RCX: 0003
[  200.931279] RDX: 880037ba0750 RSI:  RDI: 880037ba0001
[  200.931280] RBP: 8804227dbd08 R08: 0001 R09: 0011f89b6000
[  200.931280] R10: 825c4fd0 R11:  R12: 
[  200.931281] R13:  R14: 0001 R15: 
[  200.931282] FS:  7fec0faba700() GS:88042c60() 
knlGS:
[  200.931283] CS:  0010 DS:  ES:  CR0: 8005003b
[  200.931283] CR2:  CR3: b6566000 CR4: 000406f0
[  200.931283] Stack:
[  200.931285]  8121004e 8804  
0292
[  200.931287]  a0305d3c 880414612280 880414612210 
8804227dbe90
[  200.931288]  880414612130 880037ba 8121004e 
8804227dbd90
[  200.931288] Call Trace:
[  200.931293]  [] ? ext4_file_write_iter+0x8e/0x470
[  200.931295]  [] ? ext4_file_write_iter+0x8e/0x470
[  200.931298]  [] mutex_lock_nested+0x66/0x3d0
[  200.931299]  [] ? ext4_file_write_iter+0x8e/0x470
[  200.931301]  [] ? ext4_file_write_iter+0x8e/0x470
[  200.931303]  [] ? __lock_acquire+0x55a/0x1f80
[  200.931305]  [] ext4_file_write_iter+0x8e/0x470
[  200.931307]  [] ? __fget+0x11e/0x210
[  200.931309]  [] ? percpu_down_read+0x44/0x90
[  200.931311]  [] __vfs_write+0xad/0xe0
[  200.931313]  [] vfs_write+0xb5/0x1a0
[  200.931315]  [] SyS_write+0x52/0xc0
[  200.931317]  [] entry_SYSCALL_64_fastpath+0x16/0x6f
[  200.931319] Code:  Bad RIP value.
[  200.931320] RIP  [<  (null)>]   (null)
[  200.931321]  RSP 
[  200.931321] CR2: 
[  200.939353] ---[ end trace 70675f54590d7755 ]---
[  200.939370] note: qemu-system-x86[3936] exited with preempt_count 1


-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html