date:20151130

Re: [PATCH 2/2] KVM: Create debugfs dir and stat files for each VM

2015-11-30 Thread Janosch Frank


On 11/27/2015 09:42 PM, Tyler Baker wrote:

On 27 November 2015 at 10:53, Tyler Baker  wrote:

On 27 November 2015 at 09:08, Tyler Baker  wrote:

On 27 November 2015 at 00:54, Christian Borntraeger
 wrote:

On 11/26/2015 09:47 PM, Christian Borntraeger wrote:

On 11/26/2015 05:17 PM, Tyler Baker wrote:

Hi Christian,

The kernelci.org bot recently has been reporting kvm guest boot
failures[1] on various arm64 platforms in next-20151126. The bot
bisected[2] the failures to the commit in -next titled "KVM: Create
debugfs dir and stat files for each VM". I confirmed by reverting this
commit on top of next-20151126 it resolves the boot issue.

In this test case the host and guest are booted with the same kernel.
The host is booted over nfs, installs qemu (qemu-system arm64 2.4.0),
and launches a guest. The host is booting fine, but when the guest is
launched it errors with "Failed to retrieve host CPU features!". I
checked the host logs, and found an "Unable to handle kernel paging
request" splat[3] which occurs when the guest is attempting to start.

I scanned the patch in question but nothing obvious jumped out at me,
any thoughts?

Not really.
Do you have processing running that do read the files in 
/sys/kernel/debug/kvm/* ?

If I read the arm oops message correctly it oopsed inside
__srcu_read_lock. there is actually nothing in there that can oops,
except the access to the preempt count. I am just guessing right now,
but maybe the preempt variable is no longer available (as the process
is gone). As long as a debugfs file is open, we hold a reference to
the kvm, which holds a reference to the mm, so the mm might be killed
after the process. But this is supposed to work, so maybe its something
different. An objdump of __srcu_read_lock might help.

Hmm, the preempt thing is done in srcu_read_lock, but the crash is in
__srcu_read_lock. This function gets the srcu struct from mmu_notifier.c,
which must be present and is initialized during boot.


int __srcu_read_lock(struct srcu_struct *sp)
{
 int idx;

 idx = READ_ONCE(sp->completed) & 0x1;
 __this_cpu_inc(sp->per_cpu_ref->c[idx]);
 smp_mb(); /* B */  /* Avoid leaking the critical section. */
 __this_cpu_inc(sp->per_cpu_ref->seq[idx]);
 return idx;
}

Looking at the code I have no clue why the patch does make a difference.
Can you try to get an objdump -S for__Srcu_read_lock?

Some other interesting finding below...

On the host, I do _not_ have any nodes under /sys/kernel/debug/kvm/

Running strace on the qemu command I use to launch the guest yields
the following.

[pid  5963] 1448649724.405537 mmap(NULL, 65536, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6652a000
[pid  5963] 1448649724.405586 read(13, "MemTotal:   16414616
kB\nMemF"..., 1024) = 1024
[pid  5963] 1448649724.405699 close(13) = 0
[pid  5963] 1448649724.405755 munmap(0x7f6652a000, 65536) = 0
[pid  5963] 1448649724.405947 brk(0x2552f000) = 0x2552f000
[pid  5963] 1448649724.406148 openat(AT_FDCWD, "/dev/kvm",
O_RDWR|O_CLOEXEC) = 13
[pid  5963] 1448649724.406209 ioctl(13, KVM_CREATE_VM, 0) = -1 ENOMEM
(Cannot allocate memory)

If I comment the call to kvm_create_vm_debugfs(kvm) the guest boots
fine. I put some printk's in the kvm_create_vm_debugfs() function and
it's returning -ENOMEM after it evaluates !kvm->debugfs_dentry. I was
chatting with some folks from the Linaro virtualization team and they
mentioned that ARM is a bit special as the same PID creates two vms in
quick succession, the first one is a scratch vm, and the other is the
'real' vm. With that bit of info, I suspect we may be trying to create
the debugfs directory twice, and the second time it's failing because
it already exists.

Cheers,

Tyler


After a quick look into qemu I guess I've found the problem:
kvm_init creates a vm, does checking and self initialization and
then calls kvm_arch_init. The arch initialization indirectly
calls kvm_arm_create_scratch_host_vcpu and that's where the
trouble begins, as it also creates a VM.

My assumption was, that nobody would create multiple VMs under
the same PID. Christian and I are working on a solution on kernel
side.

Cheers
Janosch

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] vhost: replace % with & on data path

2015-11-30 Thread kbuild test robot

Hi Michael,

[auto build test ERROR on: v4.4-rc3]
[also build test ERROR on: next-20151127]

url:
https://github.com/0day-ci/linux/commits/Michael-S-Tsirkin/vhost-replace-with-on-data-path/20151130-163704
config: s390-performance_defconfig (attached as .config)
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=s390 

All errors (new ones prefixed by >>):

   drivers/vhost/vhost.c: In function 'vhost_get_vq_desc':
   drivers/vhost/vhost.c:1345:6: warning: unused variable 'ret' 
[-Wunused-variable]
 int ret;
 ^
   drivers/vhost/vhost.c:1344:13: warning: unused variable 'ring_head' 
[-Wunused-variable]
 __virtio16 ring_head;
^
   drivers/vhost/vhost.c:1341:24: warning: unused variable 'found' 
[-Wunused-variable]
 unsigned int i, head, found = 0;
   ^
   drivers/vhost/vhost.c:1341:18: warning: unused variable 'head' 
[-Wunused-variable]
 unsigned int i, head, found = 0;
 ^
   drivers/vhost/vhost.c:1341:15: warning: unused variable 'i' 
[-Wunused-variable]
 unsigned int i, head, found = 0;
  ^
   drivers/vhost/vhost.c:1340:20: warning: unused variable 'desc' 
[-Wunused-variable]
 struct vring_desc desc;
   ^
   drivers/vhost/vhost.c: At top level:
   drivers/vhost/vhost.c:1373:2: error: expected identifier or '(' before 'if'
 if (unlikely(__get_user(ring_head,
 ^
   In file included from include/uapi/linux/stddef.h:1:0,
from include/linux/stddef.h:4,
from include/uapi/linux/posix_types.h:4,
from include/uapi/linux/types.h:13,
from include/linux/types.h:5,
from include/uapi/asm-generic/fcntl.h:4,
from arch/s390/include/uapi/asm/fcntl.h:1,
from include/uapi/linux/fcntl.h:4,
from include/linux/fcntl.h:4,
from include/linux/eventfd.h:11,
from drivers/vhost/vhost.c:14:
>> arch/s390/include/asm/uaccess.h:250:2: error: expected identifier or '(' 
>> before ')' token
})
 ^
   include/linux/compiler.h:166:42: note: in definition of macro 'unlikely'
# define unlikely(x) __builtin_expect(!!(x), 0)
 ^
   drivers/vhost/vhost.c:1373:15: note: in expansion of macro '__get_user'
 if (unlikely(__get_user(ring_head,
  ^
   drivers/vhost/vhost.c:1381:2: warning: data definition has no type or 
storage class
 head = vhost16_to_cpu(vq, ring_head);
 ^
   drivers/vhost/vhost.c:1381:2: error: type defaults to 'int' in declaration 
of 'head' [-Werror=implicit-int]
   drivers/vhost/vhost.c:1381:24: error: 'vq' undeclared here (not in a 
function)
 head = vhost16_to_cpu(vq, ring_head);
   ^
   drivers/vhost/vhost.c:1381:28: error: 'ring_head' undeclared here (not in a 
function)
 head = vhost16_to_cpu(vq, ring_head);
   ^
   drivers/vhost/vhost.c:1384:2: error: expected identifier or '(' before 'if'
 if (unlikely(head >= vq->num)) {
 ^
   drivers/vhost/vhost.c:1391:2: warning: data definition has no type or 
storage class
 *out_num = *in_num = 0;
 ^
   drivers/vhost/vhost.c:1391:3: error: type defaults to 'int' in declaration 
of 'out_num' [-Werror=implicit-int]
 *out_num = *in_num = 0;
  ^
   drivers/vhost/vhost.c:1391:14: error: 'in_num' undeclared here (not in a 
function)
 *out_num = *in_num = 0;
 ^
   drivers/vhost/vhost.c:1392:2: error: expected identifier or '(' before 'if'
 if (unlikely(log))
 ^
   drivers/vhost/vhost.c:1395:2: warning: data definition has no type or 
storage class
 i = head;
 ^
   drivers/vhost/vhost.c:1395:2: error: type defaults to 'int' in declaration 
of 'i' [-Werror=implicit-int]
   drivers/vhost/vhost.c:1395:2: error: initializer element is not constant
   drivers/vhost/vhost.c:1396:2: error: expected identifier or '(' before 'do'
 do {
 ^
   drivers/vhost/vhost.c:1454:4: error: expected identifier or '(' before 
'while'
 } while ((i = next_desc(vq, )) != -1);
   ^
   drivers/vhost/vhost.c:1457:4: error: expected '=', ',', ';', 'asm' or 
'__attribute__' before '->' token
 vq->last_avail_idx++;
   ^
   In file included from arch/s390/include/asm/bug.h:69:0,
from include/linux/bug.h:4,
from include/linux/thread_info.h:11,
from include/asm-generic/preempt.h:4,
from arch/s390/include/generated/asm/preempt.h:1,
from include/linux/preempt.h:59,
from include/linux/spinlock.h:50,

Re: [PATCH] vhost: replace % with & on data path

2015-11-30 Thread kbuild test robot

Hi Michael,

[auto build test ERROR on: v4.4-rc3]
[also build test ERROR on: next-20151127]

url:
https://github.com/0day-ci/linux/commits/Michael-S-Tsirkin/vhost-replace-with-on-data-path/20151130-163704
config: i386-randconfig-s1-201548 (attached as .config)
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All error/warnings (new ones prefixed by >>):

   drivers/vhost/vhost.c: In function 'vhost_get_vq_desc':
>> drivers/vhost/vhost.c:1345:6: warning: unused variable 'ret' 
>> [-Wunused-variable]
 int ret;
 ^
>> drivers/vhost/vhost.c:1344:13: warning: unused variable 'ring_head' 
>> [-Wunused-variable]
 __virtio16 ring_head;
^
>> drivers/vhost/vhost.c:1341:24: warning: unused variable 'found' 
>> [-Wunused-variable]
 unsigned int i, head, found = 0;
   ^
>> drivers/vhost/vhost.c:1341:18: warning: unused variable 'head' 
>> [-Wunused-variable]
 unsigned int i, head, found = 0;
 ^
>> drivers/vhost/vhost.c:1341:15: warning: unused variable 'i' 
>> [-Wunused-variable]
 unsigned int i, head, found = 0;
  ^
>> drivers/vhost/vhost.c:1340:20: warning: unused variable 'desc' 
>> [-Wunused-variable]
 struct vring_desc desc;
   ^
   drivers/vhost/vhost.c: At top level:
>> drivers/vhost/vhost.c:1373:2: error: expected identifier or '(' before 'if'
 if (unlikely(__get_user(ring_head,
 ^
   In file included from include/uapi/linux/stddef.h:1:0,
from include/linux/stddef.h:4,
from include/uapi/linux/posix_types.h:4,
from include/uapi/linux/types.h:13,
from include/linux/types.h:5,
from include/uapi/asm-generic/fcntl.h:4,
from arch/x86/include/uapi/asm/fcntl.h:1,
from include/uapi/linux/fcntl.h:4,
from include/linux/fcntl.h:4,
from include/linux/eventfd.h:11,
from drivers/vhost/vhost.c:14:
>> arch/x86/include/asm/uaccess.h:414:2: error: expected identifier or '(' 
>> before ')' token
})
 ^
   include/linux/compiler.h:137:45: note: in definition of macro 'unlikely'
#  define unlikely(x) (__builtin_constant_p(x) ? !!(x) : 
__branch_check__(x, 0))
^
   arch/x86/include/asm/uaccess.h:479:2: note: in expansion of macro 
'__get_user_nocheck'
 __get_user_nocheck((x), (ptr), sizeof(*(ptr)))
 ^
>> drivers/vhost/vhost.c:1373:15: note: in expansion of macro '__get_user'
 if (unlikely(__get_user(ring_head,
  ^
>> arch/x86/include/asm/uaccess.h:414:2: error: expected identifier or '(' 
>> before ')' token
})
 ^
   include/linux/compiler.h:137:53: note: in definition of macro 'unlikely'
#  define unlikely(x) (__builtin_constant_p(x) ? !!(x) : 
__branch_check__(x, 0))
^
   arch/x86/include/asm/uaccess.h:479:2: note: in expansion of macro 
'__get_user_nocheck'
 __get_user_nocheck((x), (ptr), sizeof(*(ptr)))
 ^
>> drivers/vhost/vhost.c:1373:15: note: in expansion of macro '__get_user'
 if (unlikely(__get_user(ring_head,
  ^
>> include/linux/compiler.h:126:4: error: expected identifier or '(' before ')' 
>> token
  })
   ^
   include/linux/compiler.h:137:58: note: in expansion of macro 
'__branch_check__'
#  define unlikely(x) (__builtin_constant_p(x) ? !!(x) : 
__branch_check__(x, 0))
 ^
>> drivers/vhost/vhost.c:1373:6: note: in expansion of macro 'unlikely'
 if (unlikely(__get_user(ring_head,
 ^
>> drivers/vhost/vhost.c:1381:2: warning: data definition has no type or 
>> storage class
 head = vhost16_to_cpu(vq, ring_head);
 ^
>> drivers/vhost/vhost.c:1381:2: error: type defaults to 'int' in declaration 
>> of 'head' [-Werror=implicit-int]
>> drivers/vhost/vhost.c:1381:24: error: 'vq' undeclared here (not in a 
>> function)
 head = vhost16_to_cpu(vq, ring_head);
   ^
>> drivers/vhost/vhost.c:1381:28: error: 'ring_head' undeclared here (not in a 
>> function)
 head = vhost16_to_cpu(vq, ring_head);
   ^
   drivers/vhost/vhost.c:1384:2: error: expected identifier or '(' before 'if'
 if (unlikely(head >= vq->num)) {
 ^
   In file included from include/uapi/linux/stddef.h:1:0,
from include/linux/stddef.h:4,
from include/uapi/linux/posix_types.h:4,
from include/uapi/linux/types.h:13,
from include/linux/types.h:5,
from include/u

Re: [PATCH] vhost: replace % with & on data path

2015-11-30 Thread Michael S. Tsirkin

On Mon, Nov 30, 2015 at 10:34:07AM +0200, Michael S. Tsirkin wrote:
> We know vring num is a power of 2, so use &
> to mask the high bits.
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  drivers/vhost/vhost.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 080422f..85f0f0a 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1366,10 +1366,12 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>   /* Only get avail ring entries after they have been exposed by guest. */
>   smp_rmb();
>  
> + }
> +

Oops. This sneaked in from an unrelated patch.
Pls ignore, will repost.

>   /* Grab the next descriptor number they're advertising, and increment
>* the index we've seen. */
>   if (unlikely(__get_user(ring_head,
> - >avail->ring[last_avail_idx % vq->num]))) {
> + >avail->ring[last_avail_idx & (vq->num - 
> 1)]))) {
>   vq_err(vq, "Failed to read head: idx %d address %p\n",
>  last_avail_idx,
>  >avail->ring[last_avail_idx % vq->num]);
> @@ -1489,7 +1491,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue 
> *vq,
>   u16 old, new;
>   int start;
>  
> - start = vq->last_used_idx % vq->num;
> + start = vq->last_used_idx & (vq->num - 1);
>   used = vq->used->ring + start;
>   if (count == 1) {
>   if (__put_user(heads[0].id, >id)) {
> @@ -1531,7 +1533,7 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct 
> vring_used_elem *heads,
>  {
>   int start, n, r;
>  
> - start = vq->last_used_idx % vq->num;
> + start = vq->last_used_idx & (vq->num - 1);
>   n = vq->num - start;
>   if (n < count) {
>   r = __vhost_add_used_n(vq, heads, n);
> -- 
> MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] vhost: replace % with & on data path

2015-11-30 Thread Michael S. Tsirkin

On Mon, Nov 30, 2015 at 12:42:49AM -0800, Joe Perches wrote:
> On Mon, 2015-11-30 at 10:34 +0200, Michael S. Tsirkin wrote:
> > We know vring num is a power of 2, so use &
> > to mask the high bits.
> []
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> []
> > @@ -1366,10 +1366,12 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
> > /* Only get avail ring entries after they have been exposed by guest. */
> > smp_rmb();
> >  
> > +   }
> 
> ?

Yes, I noticed this - I moved this chunk from the next patch
in my tree by mistake.

Will fix, thanks!

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] vhost: replace % with & on data path

2015-11-30 Thread Christian Borntraeger

On 11/30/2015 10:15 AM, Michael S. Tsirkin wrote:
> We know vring num is a power of 2, so use &
> to mask the high bits.

Makes a lot of sense and virtio_ring.c seems to use the same logic.

Acked-by: Christian Borntraeger 

> 
> Signed-off-by: Michael S. Tsirkin 
> ---
> 
> Changes from v1: drop an unrelated chunk
> 
>  drivers/vhost/vhost.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 080422f..ad2146a 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1369,7 +1369,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>   /* Grab the next descriptor number they're advertising, and increment
>* the index we've seen. */
>   if (unlikely(__get_user(ring_head,
> - >avail->ring[last_avail_idx % vq->num]))) {
> + >avail->ring[last_avail_idx & (vq->num - 
> 1)]))) {
>   vq_err(vq, "Failed to read head: idx %d address %p\n",
>  last_avail_idx,
>  >avail->ring[last_avail_idx % vq->num]);
> @@ -1489,7 +1489,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue 
> *vq,
>   u16 old, new;
>   int start;
> 
> - start = vq->last_used_idx % vq->num;
> + start = vq->last_used_idx & (vq->num - 1);
>   used = vq->used->ring + start;
>   if (count == 1) {
>   if (__put_user(heads[0].id, >id)) {
> @@ -1531,7 +1531,7 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct 
> vring_used_elem *heads,
>  {
>   int start, n, r;
> 
> - start = vq->last_used_idx % vq->num;
> + start = vq->last_used_idx & (vq->num - 1);
>   n = vq->num - start;
>   if (n < count) {
>   r = __vhost_add_used_n(vq, heads, n);
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 04/21] arm64: KVM: Implement vgic-v3 save/restore

2015-11-30 Thread Alex Bennée


Marc Zyngier  writes:

> Implement the vgic-v3 save restore as a direct translation of
> the assembly code version.
>
> Signed-off-by: Marc Zyngier 
> ---
>  arch/arm64/kvm/hyp/Makefile |   1 +
>  arch/arm64/kvm/hyp/hyp.h|   3 +
>  arch/arm64/kvm/hyp/vgic-v3-sr.c | 222 
> 
>  3 files changed, 226 insertions(+)
>  create mode 100644 arch/arm64/kvm/hyp/vgic-v3-sr.c
>
> diff --git a/arch/arm64/kvm/hyp/Makefile b/arch/arm64/kvm/hyp/Makefile
> index d8d5968..d1e38ce 100644
> --- a/arch/arm64/kvm/hyp/Makefile
> +++ b/arch/arm64/kvm/hyp/Makefile
> @@ -3,3 +3,4 @@
>  #
>
>  obj-$(CONFIG_KVM_ARM_HOST) += vgic-v2-sr.o
> +obj-$(CONFIG_KVM_ARM_HOST) += vgic-v3-sr.o
> diff --git a/arch/arm64/kvm/hyp/hyp.h b/arch/arm64/kvm/hyp/hyp.h
> index 78f25c4..a31cb6e 100644
> --- a/arch/arm64/kvm/hyp/hyp.h
> +++ b/arch/arm64/kvm/hyp/hyp.h
> @@ -30,5 +30,8 @@
>  void __vgic_v2_save_state(struct kvm_vcpu *vcpu);
>  void __vgic_v2_restore_state(struct kvm_vcpu *vcpu);
>
> +void __vgic_v3_save_state(struct kvm_vcpu *vcpu);
> +void __vgic_v3_restore_state(struct kvm_vcpu *vcpu);
> +
>  #endif /* __ARM64_KVM_HYP_H__ */
>
> diff --git a/arch/arm64/kvm/hyp/vgic-v3-sr.c b/arch/arm64/kvm/hyp/vgic-v3-sr.c
> new file mode 100644
> index 000..b490db5
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/vgic-v3-sr.c
> @@ -0,0 +1,222 @@
> +/*
> + * Copyright (C) 2012-2015 - ARM Ltd
> + * Author: Marc Zyngier 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see .
> + */
> +
> +#include 
> +#include 

This include starts spitting out compiler warnings due to use of
undefined barrier primitives. I'm not sure where the best place to:

 #include 

is. I added it to:

  arch/arm64/include/asm/arch_gicv3.h

> +#include 
> +
> +#include 
> +
> +#include "hyp.h"
> +
> +/*
> + * We store LRs in reverse order to let the CPU deal with streaming
> + * access. Use this macro to make it look saner...
> + */
> +#define LR_OFFSET(n) (15 - n)
> +
> +#define read_gicreg(r)   
> \
> + ({  \
> + u64 reg;\
> + asm volatile("mrs_s %0, " __stringify(r) : "=r" (reg)); \
> + reg;\
> + })
> +
> +#define write_gicreg(v,r)\
> + do {\
> + u64 __val = (v);\
> + asm volatile("msr_s " __stringify(r) ", %0" : : "r" (__val));\
> + } while (0)
> +
> +/* vcpu is already in the HYP VA space */
> +void __hyp_text __vgic_v3_save_state(struct kvm_vcpu *vcpu)
> +{
> + struct vgic_v3_cpu_if *cpu_if = >arch.vgic_cpu.vgic_v3;
> + u64 val;
> + u32 nr_lr, nr_pri;
> +
> + /*
> +  * Make sure stores to the GIC via the memory mapped interface
> +  * are now visible to the system register interface.
> +  */
> + dsb(st);
> +
> + cpu_if->vgic_vmcr  = read_gicreg(ICH_VMCR_EL2);
> + cpu_if->vgic_misr  = read_gicreg(ICH_MISR_EL2);
> + cpu_if->vgic_eisr  = read_gicreg(ICH_EISR_EL2);
> + cpu_if->vgic_elrsr = read_gicreg(ICH_ELSR_EL2);
> +
> + write_gicreg(0, ICH_HCR_EL2);
> + val = read_gicreg(ICH_VTR_EL2);
> + nr_lr = val & 0xf;
> + nr_pri = ((u32)val >> 29) + 1;
> +
> + switch (nr_lr) {
> + case 15:
> + cpu_if->vgic_lr[LR_OFFSET(15)] = read_gicreg(ICH_LR15_EL2);
> + case 14:
> + cpu_if->vgic_lr[LR_OFFSET(14)] = read_gicreg(ICH_LR14_EL2);
> + case 13:
> + cpu_if->vgic_lr[LR_OFFSET(13)] = read_gicreg(ICH_LR13_EL2);
> + case 12:
> + cpu_if->vgic_lr[LR_OFFSET(12)] = read_gicreg(ICH_LR12_EL2);
> + case 11:
> + cpu_if->vgic_lr[LR_OFFSET(11)] = read_gicreg(ICH_LR11_EL2);
> + case 10:
> + cpu_if->vgic_lr[LR_OFFSET(10)] = read_gicreg(ICH_LR10_EL2);
> + case 9:
> + cpu_if->vgic_lr[LR_OFFSET(9)] = read_gicreg(ICH_LR9_EL2);
> + case 8:
> + cpu_if->vgic_lr[LR_OFFSET(8)] = read_gicreg(ICH_LR8_EL2);
> + case 7:
> + cpu_if->vgic_lr[LR_OFFSET(7)] = read_gicreg(ICH_LR7_EL2);
> + case 6:
> +

Re: [PATCH v8 3/5] nvdimm acpi: build ACPI NFIT table

2015-11-30 Thread Michael S. Tsirkin

On Mon, Nov 16, 2015 at 06:51:01PM +0800, Xiao Guangrong wrote:
> NFIT is defined in ACPI 6.0: 5.2.25 NVDIMM Firmware Interface Table (NFIT)
> 
> Currently, we only support PMEM mode. Each device has 3 structures:
> - SPA structure, defines the PMEM region info
> 
> - MEM DEV structure, it has the @handle which is used to associate specified
>   ACPI NVDIMM  device we will introduce in later patch.
>   Also we can happily ignored the memory device's interleave, the real
>   nvdimm hardware access is hidden behind host
> 
> - DCR structure, it defines vendor ID used to associate specified vendor
>   nvdimm driver. Since we only implement PMEM mode this time, Command
>   window and Data window are not needed
> 
> The NVDIMM functionality is controlled by the parameter, 'nvdimm-support',
> is introduced for PIIX4_PM and ICH9-LPC, it is true on default and it is
> false on 2.4 and its earlier version to keep compatibility

Will need to make it false on 2.5 too.

Isn't there a device that needs to be created for this
to work?  It would be cleaned to just key off
the device presence, then we don't need compat gunk,
and further, people not using it don't get a
bunch of unused AML.


> Signed-off-by: Xiao Guangrong 
> ---
>  default-configs/i386-softmmu.mak   |   1 +
>  default-configs/x86_64-softmmu.mak |   1 +
>  hw/acpi/Makefile.objs  |   1 +
>  hw/acpi/ich9.c |  19 ++
>  hw/acpi/nvdimm.c   | 382 
> +
>  hw/acpi/piix4.c|   4 +
>  hw/i386/acpi-build.c   |   6 +
>  include/hw/acpi/ich9.h |   3 +
>  include/hw/i386/pc.h   |  12 +-
>  include/hw/mem/nvdimm.h|  12 ++
>  10 files changed, 440 insertions(+), 1 deletion(-)
>  create mode 100644 hw/acpi/nvdimm.c
> 
> diff --git a/default-configs/i386-softmmu.mak 
> b/default-configs/i386-softmmu.mak
> index 4c79d3b..53fb517 100644
> --- a/default-configs/i386-softmmu.mak
> +++ b/default-configs/i386-softmmu.mak
> @@ -47,6 +47,7 @@ CONFIG_IOAPIC=y
>  CONFIG_PVPANIC=y
>  CONFIG_MEM_HOTPLUG=y
>  CONFIG_NVDIMM=y
> +CONFIG_ACPI_NVDIMM=y
>  CONFIG_XIO3130=y
>  CONFIG_IOH3420=y
>  CONFIG_I82801B11=y
> diff --git a/default-configs/x86_64-softmmu.mak 
> b/default-configs/x86_64-softmmu.mak
> index e42d2fc..766c27c 100644
> --- a/default-configs/x86_64-softmmu.mak
> +++ b/default-configs/x86_64-softmmu.mak
> @@ -47,6 +47,7 @@ CONFIG_IOAPIC=y
>  CONFIG_PVPANIC=y
>  CONFIG_MEM_HOTPLUG=y
>  CONFIG_NVDIMM=y
> +CONFIG_ACPI_NVDIMM=y
>  CONFIG_XIO3130=y
>  CONFIG_IOH3420=y
>  CONFIG_I82801B11=y
> diff --git a/hw/acpi/Makefile.objs b/hw/acpi/Makefile.objs
> index 7d3230c..095597f 100644
> --- a/hw/acpi/Makefile.objs
> +++ b/hw/acpi/Makefile.objs
> @@ -2,6 +2,7 @@ common-obj-$(CONFIG_ACPI_X86) += core.o piix4.o pcihp.o
>  common-obj-$(CONFIG_ACPI_X86_ICH) += ich9.o tco.o
>  common-obj-$(CONFIG_ACPI_CPU_HOTPLUG) += cpu_hotplug.o
>  common-obj-$(CONFIG_ACPI_MEMORY_HOTPLUG) += memory_hotplug.o
> +common-obj-$(CONFIG_ACPI_NVDIMM) += nvdimm.o
>  common-obj-$(CONFIG_ACPI) += acpi_interface.o
>  common-obj-$(CONFIG_ACPI) += bios-linker-loader.o
>  common-obj-$(CONFIG_ACPI) += aml-build.o
> diff --git a/hw/acpi/ich9.c b/hw/acpi/ich9.c
> index 1c7fcfa..275796f 100644
> --- a/hw/acpi/ich9.c
> +++ b/hw/acpi/ich9.c
> @@ -307,6 +307,20 @@ static void ich9_pm_set_memory_hotplug_support(Object 
> *obj, bool value,
>  s->pm.acpi_memory_hotplug.is_enabled = value;
>  }
>  
> +static bool ich9_pm_get_nvdimm_support(Object *obj, Error **errp)
> +{
> +ICH9LPCState *s = ICH9_LPC_DEVICE(obj);
> +
> +return s->pm.nvdimm_acpi_state.is_enabled;
> +}
> +
> +static void ich9_pm_set_nvdimm_support(Object *obj, bool value, Error **errp)
> +{
> +ICH9LPCState *s = ICH9_LPC_DEVICE(obj);
> +
> +s->pm.nvdimm_acpi_state.is_enabled = value;
> +}
> +
>  static void ich9_pm_get_disable_s3(Object *obj, Visitor *v,
> void *opaque, const char *name,
> Error **errp)
> @@ -404,6 +418,7 @@ void ich9_pm_add_properties(Object *obj, ICH9LPCPMRegs 
> *pm, Error **errp)
>  {
>  static const uint32_t gpe0_len = ICH9_PMIO_GPE0_LEN;
>  pm->acpi_memory_hotplug.is_enabled = true;
> +pm->nvdimm_acpi_state.is_enabled = true;
>  pm->disable_s3 = 0;
>  pm->disable_s4 = 0;
>  pm->s4_val = 2;
> @@ -419,6 +434,10 @@ void ich9_pm_add_properties(Object *obj, ICH9LPCPMRegs 
> *pm, Error **errp)
>   ich9_pm_get_memory_hotplug_support,
>   ich9_pm_set_memory_hotplug_support,
>   NULL);
> +object_property_add_bool(obj, "nvdimm-support",
> + ich9_pm_get_nvdimm_support,
> + ich9_pm_set_nvdimm_support,
> + NULL);
>  object_property_add(obj, ACPI_PM_PROP_S3_DISABLED, "uint8",
>

Re: [PATCH v8 4/5] nvdimm acpi: build ACPI nvdimm devices

2015-11-30 Thread Michael S. Tsirkin

On Mon, Nov 16, 2015 at 06:51:02PM +0800, Xiao Guangrong wrote:
> NVDIMM devices is defined in ACPI 6.0 9.20 NVDIMM Devices
> 
> There is a root device under \_SB and specified NVDIMM devices are under the
> root device. Each NVDIMM device has _ADR which returns its handle used to
> associate MEMDEV structure in NFIT
> 
> Currently, we do not support any function on _DSM, that means, NVDIMM
> label data has not been supported yet
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  hw/acpi/nvdimm.c | 85 
> 
>  1 file changed, 85 insertions(+)
> 
> diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
> index 98c004d..abe0daa 100644
> --- a/hw/acpi/nvdimm.c
> +++ b/hw/acpi/nvdimm.c
> @@ -367,6 +367,90 @@ static void nvdimm_build_nfit(GSList *device_list, 
> GArray *table_offsets,
>  g_array_free(structures, true);
>  }
>  
> +static void nvdimm_build_common_dsm(Aml *root_dev)
> +{
> +Aml *method, *ifctx, *function;
> +uint8_t byte_list[1];
> +
> +method = aml_method("NCAL", 4);

This "NCAL" needs a define as it's used
in multiple places. It's really just a DSM
implementation, right? Reflect this in the macro
name.

> +{

What's this doing?

> +function = aml_arg(2);
> +
> +/*
> + * function 0 is called to inquire what functions are supported by
> + * OSPM
> + */
> +ifctx = aml_if(aml_equal(function, aml_int(0)));
> +byte_list[0] = 0 /* No function Supported */;
> +aml_append(ifctx, aml_return(aml_buffer(1, byte_list)));
> +aml_append(method, ifctx);
> +
> +/* No function is supported yet. */
> +byte_list[0] = 1 /* Not Supported */;
> +aml_append(method, aml_return(aml_buffer(1, byte_list)));
> +}
> +aml_append(root_dev, method);
> +}
> +
> +static void nvdimm_build_nvdimm_devices(GSList *device_list, Aml *root_dev)
> +{
> +for (; device_list; device_list = device_list->next) {
> +DeviceState *dev = device_list->data;
> +int slot = object_property_get_int(OBJECT(dev), PC_DIMM_SLOT_PROP,
> +   NULL);
> +uint32_t handle = nvdimm_slot_to_handle(slot);
> +Aml *nvdimm_dev, *method;
> +
> +nvdimm_dev = aml_device("NV%02X", slot);
> +aml_append(nvdimm_dev, aml_name_decl("_ADR", aml_int(handle)));
> +
> +method = aml_method("_DSM", 4);
> +{
> +aml_append(method, aml_return(aml_call4("NCAL", aml_arg(0),
> +   aml_arg(1), aml_arg(2), aml_arg(3;
> +}
> +aml_append(nvdimm_dev, method);
> +
> +aml_append(root_dev, nvdimm_dev);
> +}
> +}
> +
> +static void nvdimm_build_ssdt(GSList *device_list, GArray *table_offsets,
> +  GArray *table_data, GArray *linker)
> +{
> +Aml *ssdt, *sb_scope, *dev, *method;
> +
> +acpi_add_table(table_offsets, table_data);
> +
> +ssdt = init_aml_allocator();
> +acpi_data_push(ssdt->buf, sizeof(AcpiTableHeader));
> +
> +sb_scope = aml_scope("\\_SB");
> +
> +dev = aml_device("NVDR");
> +aml_append(dev, aml_name_decl("_HID", aml_string("ACPI0012")));

Pls add a comment explaining that ACPI0012 is NVDIMM root device.

Also - this will now appear for all users, e.g.
windows guests will prompt users for a driver.
Not nice if user didn't actually ask for nvdimm.

A simple solution is to default this functionality
to off by default.

> +
> +nvdimm_build_common_dsm(dev);
> +method = aml_method("_DSM", 4);
> +{
> +aml_append(method, aml_return(aml_call4("NCAL", aml_arg(0),
> +   aml_arg(1), aml_arg(2), aml_arg(3;
> +}

Some duplication here, move above to a sub-function please.

> +aml_append(dev, method);
> +
> +nvdimm_build_nvdimm_devices(device_list, dev);
> +
> +aml_append(sb_scope, dev);
> +
> +aml_append(ssdt, sb_scope);
> +/* copy AML table into ACPI tables blob and patch header there */
> +g_array_append_vals(table_data, ssdt->buf->data, ssdt->buf->len);
> +build_header(linker, table_data,
> +(void *)(table_data->data + table_data->len - ssdt->buf->len),
> +"SSDT", ssdt->buf->len, 1, "NVDIMM");
> +free_aml_allocator();
> +}
> +
>  void nvdimm_build_acpi(GArray *table_offsets, GArray *table_data,
> GArray *linker)
>  {
> @@ -378,5 +462,6 @@ void nvdimm_build_acpi(GArray *table_offsets, GArray 
> *table_data,
>  return;
>  }
>  nvdimm_build_nfit(device_list, table_offsets, table_data, linker);
> +nvdimm_build_ssdt(device_list, table_offsets, table_data, linker);
>  g_slist_free(device_list);
>  }
> -- 
> 1.8.3.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list:

[PATCH v2 2/9] drivers/hv: Move HV_SYNIC_STIMER_COUNT into Hyper-V UAPI x86 header

2015-11-30 Thread Andrey Smetanin

This constant is required for Hyper-V SynIC timers MSR's
support by userspace(QEMU).

Signed-off-by: Andrey Smetanin 
Acked-by: K. Y. Srinivasan 
Reviewed-by: Roman Kagan 
CC: Gleb Natapov 
CC: Paolo Bonzini 
CC: "K. Y. Srinivasan" 
CC: Haiyang Zhang 
CC: Vitaly Kuznetsov 
CC: Roman Kagan 
CC: Denis V. Lunev 
CC: qemu-de...@nongnu.org
---
 arch/x86/include/uapi/asm/hyperv.h | 2 ++
 drivers/hv/hyperv_vmbus.h  | 2 --
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/uapi/asm/hyperv.h 
b/arch/x86/include/uapi/asm/hyperv.h
index 040d408..07981f0 100644
--- a/arch/x86/include/uapi/asm/hyperv.h
+++ b/arch/x86/include/uapi/asm/hyperv.h
@@ -269,4 +269,6 @@ typedef struct _HV_REFERENCE_TSC_PAGE {
 #define HV_SYNIC_SINT_AUTO_EOI (1ULL << 17)
 #define HV_SYNIC_SINT_VECTOR_MASK  (0xFF)
 
+#define HV_SYNIC_STIMER_COUNT  (4)
+
 #endif
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index e46e18c..f214e37 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -100,8 +100,6 @@ enum hv_cpuid_function {
 #define HVMSG_X64_APIC_EOI 0x80010004
 #define HVMSG_X64_LEGACY_FP_ERROR  0x80010005
 
-#define HV_SYNIC_STIMER_COUNT  (4)
-
 /* Define invalid partition identifier. */
 #define HV_PARTITION_ID_INVALID((u64)0x0)
 
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 1/9] drivers/hv: replace enum hv_message_type by u32

2015-11-30 Thread Andrey Smetanin

enum hv_message_type inside struct hv_message, hv_post_message
is not size portable. Replace enum by u32.

Signed-off-by: Andrey Smetanin 
CC: Gleb Natapov 
CC: Paolo Bonzini 
CC: "K. Y. Srinivasan" 
CC: Haiyang Zhang 
CC: Vitaly Kuznetsov 
CC: Roman Kagan 
CC: Denis V. Lunev 
CC: qemu-de...@nongnu.org

---
 drivers/hv/hv.c   |  4 ++--
 drivers/hv/hyperv_vmbus.h | 48 +++
 2 files changed, 25 insertions(+), 27 deletions(-)

diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
index 6341be8..dde7e1c 100644
--- a/drivers/hv/hv.c
+++ b/drivers/hv/hv.c
@@ -310,8 +310,8 @@ void hv_cleanup(void)
  * This involves a hypercall.
  */
 int hv_post_message(union hv_connection_id connection_id,
- enum hv_message_type message_type,
- void *payload, size_t payload_size)
+   u32 message_type,
+   void *payload, size_t payload_size)
 {
 
struct hv_input_post_message *aligned_msg;
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index 3782636..e46e18c 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -75,32 +75,30 @@ enum hv_cpuid_function {
 #define HV_EVENT_FLAGS_DWORD_COUNT (256 / sizeof(u32))
 
 /* Define hypervisor message types. */
-enum hv_message_type {
-   HVMSG_NONE  = 0x,
+#define HVMSG_NONE 0x
 
-   /* Memory access messages. */
-   HVMSG_UNMAPPED_GPA  = 0x8000,
-   HVMSG_GPA_INTERCEPT = 0x8001,
+/* Memory access messages. */
+#define HVMSG_UNMAPPED_GPA 0x8000
+#define HVMSG_GPA_INTERCEPT0x8001
 
-   /* Timer notification messages. */
-   HVMSG_TIMER_EXPIRED = 0x8010,
+/* Timer notification messages. */
+#define HVMSG_TIMER_EXPIRED0x8010
 
-   /* Error messages. */
-   HVMSG_INVALID_VP_REGISTER_VALUE = 0x8020,
-   HVMSG_UNRECOVERABLE_EXCEPTION   = 0x8021,
-   HVMSG_UNSUPPORTED_FEATURE   = 0x8022,
+/* Error messages. */
+#define HVMSG_INVALID_VP_REGISTER_VALUE0x8020
+#define HVMSG_UNRECOVERABLE_EXCEPTION  0x8021
+#define HVMSG_UNSUPPORTED_FEATURE  0x8022
 
-   /* Trace buffer complete messages. */
-   HVMSG_EVENTLOG_BUFFERCOMPLETE   = 0x8040,
+/* Trace buffer complete messages. */
+#define HVMSG_EVENTLOG_BUFFERCOMPLETE  0x8040
 
-   /* Platform-specific processor intercept messages. */
-   HVMSG_X64_IOPORT_INTERCEPT  = 0x8001,
-   HVMSG_X64_MSR_INTERCEPT = 0x80010001,
-   HVMSG_X64_CPUID_INTERCEPT   = 0x80010002,
-   HVMSG_X64_EXCEPTION_INTERCEPT   = 0x80010003,
-   HVMSG_X64_APIC_EOI  = 0x80010004,
-   HVMSG_X64_LEGACY_FP_ERROR   = 0x80010005
-};
+/* Platform-specific processor intercept messages. */
+#define HVMSG_X64_IOPORT_INTERCEPT 0x8001
+#define HVMSG_X64_MSR_INTERCEPT0x80010001
+#define HVMSG_X64_CPUID_INTERCEPT  0x80010002
+#define HVMSG_X64_EXCEPTION_INTERCEPT  0x80010003
+#define HVMSG_X64_APIC_EOI 0x80010004
+#define HVMSG_X64_LEGACY_FP_ERROR  0x80010005
 
 #define HV_SYNIC_STIMER_COUNT  (4)
 
@@ -174,7 +172,7 @@ union hv_message_flags {
 
 /* Define synthetic interrupt controller message header. */
 struct hv_message_header {
-   enum hv_message_type message_type;
+   u32 message_type;
u8 payload_size;
union hv_message_flags message_flags;
u8 reserved[2];
@@ -347,7 +345,7 @@ enum hv_call_code {
 struct hv_input_post_message {
union hv_connection_id connectionid;
u32 reserved;
-   enum hv_message_type message_type;
+   u32 message_type;
u32 payload_size;
u64 payload[HV_MESSAGE_PAYLOAD_QWORD_COUNT];
 };
@@ -579,8 +577,8 @@ extern int hv_init(void);
 extern void hv_cleanup(void);
 
 extern int hv_post_message(union hv_connection_id connection_id,
-enum hv_message_type message_type,
-void *payload, size_t payload_size);
+  u32 message_type,
+  void *payload, size_t payload_size);
 
 extern u16 hv_signal_event(void *con_id);
 
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 0/9] KVM: Hyper-V SynIC timers

2015-11-30 Thread Andrey Smetanin

Per Hyper-V specification (and as required by Hyper-V-aware guests),
SynIC provides 4 per-vCPU timers.  Each timer is programmed via a pair
of MSRs, and signals expiration by delivering a special format message
to the configured SynIC message slot and triggering the corresponding
synthetic interrupt.

Note: as implemented by this patch, all periodic timers are "lazy"
(i.e. if the vCPU wasn't scheduled for more than the timer period the
timer events are lost), regardless of the corresponding configuration
MSR.  If deemed necessary, the "catch up" mode (the timer period is
shortened until the timer catches up) will be implemented later.

The Hyper-V SynIC timers support is required to load winhv.sys
inside Windows guest on which guest VMBus devices depends on.

This patches depends on Hyper-V SynIC patches previosly sent.

Changes v2:
* Hyper-V headers patches split and fixes
* Use remainder to calculate peridic timer expiration time

Signed-off-by: Andrey Smetanin 
CC: Gleb Natapov 
CC: Paolo Bonzini 
CC: "K. Y. Srinivasan" 
CC: Haiyang Zhang 
CC: Vitaly Kuznetsov 
CC: Roman Kagan 
CC: Denis V. Lunev 
CC: qemu-de...@nongnu.org

Andrey Smetanin (9):
  drivers/hv: Replace enum hv_message_type by u32
  drivers/hv: Move HV_SYNIC_STIMER_COUNT into Hyper-V UAPI x86 header
  drivers/hv: Move struct hv_message into UAPI Hyper-V x86 header
  drivers/hv: Move struct hv_timer_message_payload into UAPI Hyper-V x86
header
  kvm/x86: Rearrange func's declarations inside Hyper-V header
  kvm/x86: Added Hyper-V vcpu_to_hv_vcpu()/hv_vcpu_to_vcpu() helpers
  kvm/x86: Hyper-V internal helper to read MSR HV_X64_MSR_TIME_REF_COUNT
  kvm/x86: Hyper-V SynIC message slot pending clearing at SINT ack
  kvm/x86: Hyper-V SynIC timers

 arch/x86/include/asm/kvm_host.h|  13 ++
 arch/x86/include/uapi/asm/hyperv.h |  90 ++
 arch/x86/kvm/hyperv.c  | 360 -
 arch/x86/kvm/hyperv.h  |  54 --
 arch/x86/kvm/x86.c |   9 +
 drivers/hv/hv.c|   4 +-
 drivers/hv/hyperv_vmbus.h  |  92 +-
 include/linux/kvm_host.h   |   3 +
 8 files changed, 516 insertions(+), 109 deletions(-)

-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 7/9] kvm/x86: Hyper-V internal helper to read MSR HV_X64_MSR_TIME_REF_COUNT

2015-11-30 Thread Andrey Smetanin

This helper will be used also in Hyper-V SynIC timers implementation.

Signed-off-by: Andrey Smetanin 
Reviewed-by: Roman Kagan 
CC: Gleb Natapov 
CC: Paolo Bonzini 
CC: "K. Y. Srinivasan" 
CC: Haiyang Zhang 
CC: Vitaly Kuznetsov 
CC: Roman Kagan 
CC: Denis V. Lunev 
CC: qemu-de...@nongnu.org
---
 arch/x86/kvm/hyperv.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 41869a9..9958926 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -335,6 +335,11 @@ static void synic_init(struct kvm_vcpu_hv_synic *synic)
}
 }
 
+static u64 get_time_ref_counter(struct kvm *kvm)
+{
+   return div_u64(get_kernel_ns() + kvm->arch.kvmclock_offset, 100);
+}
+
 void kvm_hv_vcpu_init(struct kvm_vcpu *vcpu)
 {
synic_init(vcpu_to_synic(vcpu));
@@ -576,11 +581,9 @@ static int kvm_hv_get_msr_pw(struct kvm_vcpu *vcpu, u32 
msr, u64 *pdata)
case HV_X64_MSR_HYPERCALL:
data = hv->hv_hypercall;
break;
-   case HV_X64_MSR_TIME_REF_COUNT: {
-   data =
-div_u64(get_kernel_ns() + kvm->arch.kvmclock_offset, 100);
+   case HV_X64_MSR_TIME_REF_COUNT:
+   data = get_time_ref_counter(kvm);
break;
-   }
case HV_X64_MSR_REFERENCE_TSC:
data = hv->hv_tsc_page;
break;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 6/9] kvm/x86: Added Hyper-V vcpu_to_hv_vcpu()/hv_vcpu_to_vcpu() helpers

2015-11-30 Thread Andrey Smetanin

Signed-off-by: Andrey Smetanin 
Reviewed-by: Roman Kagan 
CC: Gleb Natapov 
CC: Paolo Bonzini 
CC: "K. Y. Srinivasan" 
CC: Haiyang Zhang 
CC: Vitaly Kuznetsov 
CC: Roman Kagan 
CC: Denis V. Lunev 
CC: qemu-de...@nongnu.org
---
 arch/x86/kvm/hyperv.h | 20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h
index 9483d49..d5d8217 100644
--- a/arch/x86/kvm/hyperv.h
+++ b/arch/x86/kvm/hyperv.h
@@ -24,21 +24,29 @@
 #ifndef __ARCH_X86_KVM_HYPERV_H__
 #define __ARCH_X86_KVM_HYPERV_H__
 
-static inline struct kvm_vcpu_hv_synic *vcpu_to_synic(struct kvm_vcpu *vcpu)
+static inline struct kvm_vcpu_hv *vcpu_to_hv_vcpu(struct kvm_vcpu *vcpu)
 {
-   return >arch.hyperv.synic;
+   return >arch.hyperv;
 }
 
-static inline struct kvm_vcpu *synic_to_vcpu(struct kvm_vcpu_hv_synic *synic)
+static inline struct kvm_vcpu *hv_vcpu_to_vcpu(struct kvm_vcpu_hv *hv_vcpu)
 {
-   struct kvm_vcpu_hv *hv;
struct kvm_vcpu_arch *arch;
 
-   hv = container_of(synic, struct kvm_vcpu_hv, synic);
-   arch = container_of(hv, struct kvm_vcpu_arch, hyperv);
+   arch = container_of(hv_vcpu, struct kvm_vcpu_arch, hyperv);
return container_of(arch, struct kvm_vcpu, arch);
 }
 
+static inline struct kvm_vcpu_hv_synic *vcpu_to_synic(struct kvm_vcpu *vcpu)
+{
+   return >arch.hyperv.synic;
+}
+
+static inline struct kvm_vcpu *synic_to_vcpu(struct kvm_vcpu_hv_synic *synic)
+{
+   return hv_vcpu_to_vcpu(container_of(synic, struct kvm_vcpu_hv, synic));
+}
+
 int kvm_hv_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data, bool host);
 int kvm_hv_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
 
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] KVM: Create debugfs dir and stat files for each VM

2015-11-30 Thread Tyler Baker

On 30 November 2015 at 00:38, Christian Borntraeger
 wrote:
> On 11/27/2015 09:42 PM, Tyler Baker wrote:
>> On 27 November 2015 at 10:53, Tyler Baker  wrote:
>>> On 27 November 2015 at 09:08, Tyler Baker  wrote:
 On 27 November 2015 at 00:54, Christian Borntraeger
  wrote:
> On 11/26/2015 09:47 PM, Christian Borntraeger wrote:
>> On 11/26/2015 05:17 PM, Tyler Baker wrote:
>>> Hi Christian,
>>>
>>> The kernelci.org bot recently has been reporting kvm guest boot
>>> failures[1] on various arm64 platforms in next-20151126. The bot
>>> bisected[2] the failures to the commit in -next titled "KVM: Create
>>> debugfs dir and stat files for each VM". I confirmed by reverting this
>>> commit on top of next-20151126 it resolves the boot issue.
>>>
>>> In this test case the host and guest are booted with the same kernel.
>>> The host is booted over nfs, installs qemu (qemu-system arm64 2.4.0),
>>> and launches a guest. The host is booting fine, but when the guest is
>>> launched it errors with "Failed to retrieve host CPU features!". I
>>> checked the host logs, and found an "Unable to handle kernel paging
>>> request" splat[3] which occurs when the guest is attempting to start.
>>>
>>> I scanned the patch in question but nothing obvious jumped out at me,
>>> any thoughts?
>>
>> Not really.
>> Do you have processing running that do read the files in 
>> /sys/kernel/debug/kvm/* ?
>>
>> If I read the arm oops message correctly it oopsed inside
>> __srcu_read_lock. there is actually nothing in there that can oops,
>> except the access to the preempt count. I am just guessing right now,
>> but maybe the preempt variable is no longer available (as the process
>> is gone). As long as a debugfs file is open, we hold a reference to
>> the kvm, which holds a reference to the mm, so the mm might be killed
>> after the process. But this is supposed to work, so maybe its something
>> different. An objdump of __srcu_read_lock might help.
>
> Hmm, the preempt thing is done in srcu_read_lock, but the crash is in
> __srcu_read_lock. This function gets the srcu struct from mmu_notifier.c,
> which must be present and is initialized during boot.
>
>
> int __srcu_read_lock(struct srcu_struct *sp)
> {
> int idx;
>
> idx = READ_ONCE(sp->completed) & 0x1;
> __this_cpu_inc(sp->per_cpu_ref->c[idx]);
> smp_mb(); /* B */  /* Avoid leaking the critical section. */
> __this_cpu_inc(sp->per_cpu_ref->seq[idx]);
> return idx;
> }
>
> Looking at the code I have no clue why the patch does make a difference.
> Can you try to get an objdump -S for__Srcu_read_lock?
>>>
>>> Some other interesting finding below...
>>>
>>> On the host, I do _not_ have any nodes under /sys/kernel/debug/kvm/
>>>
>>> Running strace on the qemu command I use to launch the guest yields
>>> the following.
>>>
>>> [pid  5963] 1448649724.405537 mmap(NULL, 65536, PROT_READ|PROT_WRITE,
>>> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6652a000
>>> [pid  5963] 1448649724.405586 read(13, "MemTotal:   16414616
>>> kB\nMemF"..., 1024) = 1024
>>> [pid  5963] 1448649724.405699 close(13) = 0
>>> [pid  5963] 1448649724.405755 munmap(0x7f6652a000, 65536) = 0
>>> [pid  5963] 1448649724.405947 brk(0x2552f000) = 0x2552f000
>>> [pid  5963] 1448649724.406148 openat(AT_FDCWD, "/dev/kvm",
>>> O_RDWR|O_CLOEXEC) = 13
>>> [pid  5963] 1448649724.406209 ioctl(13, KVM_CREATE_VM, 0) = -1 ENOMEM
>>> (Cannot allocate memory)
>>
>> If I comment the call to kvm_create_vm_debugfs(kvm) the guest boots
>> fine. I put some printk's in the kvm_create_vm_debugfs() function and
>> it's returning -ENOMEM after it evaluates !kvm->debugfs_dentry. I was
>> chatting with some folks from the Linaro virtualization team and they
>> mentioned that ARM is a bit special as the same PID creates two vms in
>> quick succession, the first one is a scratch vm, and the other is the
>> 'real' vm. With that bit of info, I suspect we may be trying to create
>> the debugfs directory twice, and the second time it's failing because
>> it already exists.
>
> Hmmm, with a patched QEMU that calls VM_CREATE twice it errors out on s390
> with -ENOMEM (which it should not), but it errors out gracefully.
>
> Does the attached patch avoid the crash? (guest will not start, but qemu
> should error out gracefully with ENOMEM)

Yeah. I patched my host kernel and now the qemu guest launch errors
gracefully[1].

Cheers,

Tyler

[1] http://hastebin.com/rotiropayo.mel
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 5/9] kvm/x86: Rearrange func's declarations inside Hyper-V header

2015-11-30 Thread Andrey Smetanin

This rearrangement places functions declarations together
according to their functionality, so future additions
will be simplier.

Signed-off-by: Andrey Smetanin 
Reviewed-by: Roman Kagan 
CC: Gleb Natapov 
CC: Paolo Bonzini 
CC: "K. Y. Srinivasan" 
CC: Haiyang Zhang 
CC: Vitaly Kuznetsov 
CC: Roman Kagan 
CC: Denis V. Lunev 
CC: qemu-de...@nongnu.org
---
 arch/x86/kvm/hyperv.h | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h
index 315af4b..9483d49 100644
--- a/arch/x86/kvm/hyperv.h
+++ b/arch/x86/kvm/hyperv.h
@@ -24,14 +24,6 @@
 #ifndef __ARCH_X86_KVM_HYPERV_H__
 #define __ARCH_X86_KVM_HYPERV_H__
 
-int kvm_hv_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data, bool host);
-int kvm_hv_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
-bool kvm_hv_hypercall_enabled(struct kvm *kvm);
-int kvm_hv_hypercall(struct kvm_vcpu *vcpu);
-
-int kvm_hv_synic_set_irq(struct kvm *kvm, u32 vcpu_id, u32 sint);
-void kvm_hv_synic_send_eoi(struct kvm_vcpu *vcpu, int vector);
-
 static inline struct kvm_vcpu_hv_synic *vcpu_to_synic(struct kvm_vcpu *vcpu)
 {
return >arch.hyperv.synic;
@@ -46,10 +38,18 @@ static inline struct kvm_vcpu *synic_to_vcpu(struct 
kvm_vcpu_hv_synic *synic)
arch = container_of(hv, struct kvm_vcpu_arch, hyperv);
return container_of(arch, struct kvm_vcpu, arch);
 }
-void kvm_hv_irq_routing_update(struct kvm *kvm);
 
-void kvm_hv_vcpu_init(struct kvm_vcpu *vcpu);
+int kvm_hv_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data, bool host);
+int kvm_hv_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
+
+bool kvm_hv_hypercall_enabled(struct kvm *kvm);
+int kvm_hv_hypercall(struct kvm_vcpu *vcpu);
 
+void kvm_hv_irq_routing_update(struct kvm *kvm);
+int kvm_hv_synic_set_irq(struct kvm *kvm, u32 vcpu_id, u32 sint);
+void kvm_hv_synic_send_eoi(struct kvm_vcpu *vcpu, int vector);
 int kvm_hv_activate_synic(struct kvm_vcpu *vcpu);
 
+void kvm_hv_vcpu_init(struct kvm_vcpu *vcpu);
+
 #endif
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 4/9] drivers/hv: Move struct hv_timer_message_payload into UAPI Hyper-V x86 header

2015-11-30 Thread Andrey Smetanin

This struct is required for Hyper-V SynIC timers implementation inside KVM
and for upcoming Hyper-V VMBus support by userspace(QEMU). So place it into
Hyper-V UAPI header.

Signed-off-by: Andrey Smetanin 
CC: Gleb Natapov 
CC: Paolo Bonzini 
CC: "K. Y. Srinivasan" 
CC: Haiyang Zhang 
CC: Vitaly Kuznetsov 
CC: Roman Kagan 
CC: Denis V. Lunev 
CC: qemu-de...@nongnu.org

---
 arch/x86/include/uapi/asm/hyperv.h | 8 
 drivers/hv/hyperv_vmbus.h  | 9 -
 2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/uapi/asm/hyperv.h 
b/arch/x86/include/uapi/asm/hyperv.h
index 76e503d..42278f8 100644
--- a/arch/x86/include/uapi/asm/hyperv.h
+++ b/arch/x86/include/uapi/asm/hyperv.h
@@ -345,4 +345,12 @@ struct hv_message_page {
struct hv_message sint_message[HV_SYNIC_SINT_COUNT];
 };
 
+/* Define timer message payload structure. */
+struct hv_timer_message_payload {
+   __u32 timer_index;
+   __u32 reserved;
+   __u64 expiration_time;  /* When the timer expired */
+   __u64 delivery_time;/* When the message was delivered */
+};
+
 #endif
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index 3f3756b..db60080 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -136,15 +136,6 @@ union hv_timer_config {
};
 };
 
-
-/* Define timer message payload structure. */
-struct hv_timer_message_payload {
-   u32 timer_index;
-   u32 reserved;
-   u64 expiration_time;/* When the timer expired */
-   u64 delivery_time;  /* When the message was delivered */
-};
-
 /* Define the number of message buffers associated with each port. */
 #define HV_PORT_MESSAGE_BUFFER_COUNT   (16)
 
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 8/9] kvm/x86: Hyper-V SynIC message slot pending clearing at SINT ack

2015-11-30 Thread Andrey Smetanin

The SynIC message protocol mandates that the message slot is claimed
by atomically setting message type to something other than HVMSG_NONE.
If another message is to be delivered while the slot is still busy,
message pending flag is asserted to indicate to the guest that the
hypervisor wants to be notified when the slot is released.

To make sure the protocol works regardless of where the message
sources are (kernel or userspace), clear the pending flag on SINT ACK
notification, and let the message sources compete for the slot again.

Signed-off-by: Andrey Smetanin 
Reviewed-by: Roman Kagan 
CC: Gleb Natapov 
CC: Paolo Bonzini 
CC: "K. Y. Srinivasan" 
CC: Haiyang Zhang 
CC: Vitaly Kuznetsov 
CC: Roman Kagan 
CC: Denis V. Lunev 
CC: qemu-de...@nongnu.org
---
 arch/x86/kvm/hyperv.c| 31 +++
 include/linux/kvm_host.h |  2 ++
 2 files changed, 33 insertions(+)

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 9958926..6412b6b 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -27,6 +27,7 @@
 #include "hyperv.h"
 
 #include 
+#include 
 #include 
 #include 
 
@@ -116,13 +117,43 @@ static struct kvm_vcpu_hv_synic *synic_get(struct kvm 
*kvm, u32 vcpu_id)
return (synic->active) ? synic : NULL;
 }
 
+static void synic_clear_sint_msg_pending(struct kvm_vcpu_hv_synic *synic,
+   u32 sint)
+{
+   struct kvm_vcpu *vcpu = synic_to_vcpu(synic);
+   struct page *page;
+   gpa_t gpa;
+   struct hv_message *msg;
+   struct hv_message_page *msg_page;
+
+   gpa = synic->msg_page & PAGE_MASK;
+   page = kvm_vcpu_gfn_to_page(vcpu, gpa >> PAGE_SHIFT);
+   if (is_error_page(page)) {
+   vcpu_err(vcpu, "Hyper-V SynIC can't get msg page, gpa 0x%llx\n",
+gpa);
+   return;
+   }
+   msg_page = kmap_atomic(page);
+
+   msg = _page->sint_message[sint];
+   msg->header.message_flags.msg_pending = 0;
+
+   kunmap_atomic(msg_page);
+   kvm_release_page_dirty(page);
+   kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
+}
+
 static void kvm_hv_notify_acked_sint(struct kvm_vcpu *vcpu, u32 sint)
 {
struct kvm *kvm = vcpu->kvm;
+   struct kvm_vcpu_hv_synic *synic = vcpu_to_synic(vcpu);
int gsi, idx;
 
vcpu_debug(vcpu, "Hyper-V SynIC acked sint %d\n", sint);
 
+   if (synic->msg_page & HV_SYNIC_SIMP_ENABLE)
+   synic_clear_sint_msg_pending(synic, sint);
+
idx = srcu_read_lock(>irq_srcu);
gsi = atomic_read(_to_synic(vcpu)->sint_to_gsi[sint]);
if (gsi != -1)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 2911919..9b64c8c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -450,6 +450,8 @@ struct kvm {
 
 #define vcpu_debug(vcpu, fmt, ...) \
kvm_debug("vcpu%i " fmt, (vcpu)->vcpu_id, ## __VA_ARGS__)
+#define vcpu_err(vcpu, fmt, ...)   \
+   kvm_err("vcpu%i " fmt, (vcpu)->vcpu_id, ## __VA_ARGS__)
 
 static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i)
 {
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 9/9] kvm/x86: Hyper-V SynIC timers

2015-11-30 Thread Andrey Smetanin

Per Hyper-V specification (and as required by Hyper-V-aware guests),
SynIC provides 4 per-vCPU timers.  Each timer is programmed via a pair
of MSRs, and signals expiration by delivering a special format message
to the configured SynIC message slot and triggering the corresponding
synthetic interrupt.

Note: as implemented by this patch, all periodic timers are "lazy"
(i.e. if the vCPU wasn't scheduled for more than the timer period the
timer events are lost), regardless of the corresponding configuration
MSR.  If deemed necessary, the "catch up" mode (the timer period is
shortened until the timer catches up) will be implemented later.

Changes v2:
* Use remainder to calculate periodic timer expiration time

Signed-off-by: Andrey Smetanin 
Reviewed-by: Roman Kagan 
CC: Gleb Natapov 
CC: Paolo Bonzini 
CC: "K. Y. Srinivasan" 
CC: Haiyang Zhang 
CC: Vitaly Kuznetsov 
CC: Roman Kagan 
CC: Denis V. Lunev 
CC: qemu-de...@nongnu.org
---
 arch/x86/include/asm/kvm_host.h|  13 ++
 arch/x86/include/uapi/asm/hyperv.h |   6 +
 arch/x86/kvm/hyperv.c  | 318 -
 arch/x86/kvm/hyperv.h  |  24 +++
 arch/x86/kvm/x86.c |   9 ++
 include/linux/kvm_host.h   |   1 +
 6 files changed, 368 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8140077..a7c8987 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -379,6 +379,17 @@ struct kvm_mtrr {
struct list_head head;
 };
 
+/* Hyper-V SynIC timer */
+struct kvm_vcpu_hv_stimer {
+   struct hrtimer timer;
+   int index;
+   u64 config;
+   u64 count;
+   u64 exp_time;
+   struct hv_message msg;
+   bool msg_pending;
+};
+
 /* Hyper-V synthetic interrupt controller (SynIC)*/
 struct kvm_vcpu_hv_synic {
u64 version;
@@ -398,6 +409,8 @@ struct kvm_vcpu_hv {
s64 runtime_offset;
struct kvm_vcpu_hv_synic synic;
struct kvm_hyperv_exit exit;
+   struct kvm_vcpu_hv_stimer stimer[HV_SYNIC_STIMER_COUNT];
+   DECLARE_BITMAP(stimer_pending_bitmap, HV_SYNIC_STIMER_COUNT);
 };
 
 struct kvm_vcpu_arch {
diff --git a/arch/x86/include/uapi/asm/hyperv.h 
b/arch/x86/include/uapi/asm/hyperv.h
index 42278f8..71fce3f 100644
--- a/arch/x86/include/uapi/asm/hyperv.h
+++ b/arch/x86/include/uapi/asm/hyperv.h
@@ -353,4 +353,10 @@ struct hv_timer_message_payload {
__u64 delivery_time;/* When the message was delivered */
 };
 
+#define HV_STIMER_ENABLE   (1ULL << 0)
+#define HV_STIMER_PERIODIC (1ULL << 1)
+#define HV_STIMER_LAZY (1ULL << 2)
+#define HV_STIMER_AUTOENABLE   (1ULL << 3)
+#define HV_STIMER_SINT(config) (__u8)(((config) >> 16) & 0x0F)
+
 #endif
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 6412b6b..8ff8829 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -147,15 +147,32 @@ static void kvm_hv_notify_acked_sint(struct kvm_vcpu 
*vcpu, u32 sint)
 {
struct kvm *kvm = vcpu->kvm;
struct kvm_vcpu_hv_synic *synic = vcpu_to_synic(vcpu);
-   int gsi, idx;
+   struct kvm_vcpu_hv *hv_vcpu = vcpu_to_hv_vcpu(vcpu);
+   struct kvm_vcpu_hv_stimer *stimer;
+   int gsi, idx, stimers_pending;
 
vcpu_debug(vcpu, "Hyper-V SynIC acked sint %d\n", sint);
 
if (synic->msg_page & HV_SYNIC_SIMP_ENABLE)
synic_clear_sint_msg_pending(synic, sint);
 
+   /* Try to deliver pending Hyper-V SynIC timers messages */
+   stimers_pending = 0;
+   for (idx = 0; idx < ARRAY_SIZE(hv_vcpu->stimer); idx++) {
+   stimer = _vcpu->stimer[idx];
+   if (stimer->msg_pending &&
+   (stimer->config & HV_STIMER_ENABLE) &&
+   HV_STIMER_SINT(stimer->config) == sint) {
+   set_bit(stimer->index,
+   hv_vcpu->stimer_pending_bitmap);
+   stimers_pending++;
+   }
+   }
+   if (stimers_pending)
+   kvm_make_request(KVM_REQ_HV_STIMER, vcpu);
+
idx = srcu_read_lock(>irq_srcu);
-   gsi = atomic_read(_to_synic(vcpu)->sint_to_gsi[sint]);
+   gsi = atomic_read(>sint_to_gsi[sint]);
if (gsi != -1)
kvm_notify_acked_gsi(kvm, gsi);
srcu_read_unlock(>irq_srcu, idx);
@@ -371,9 +388,268 @@ static u64 get_time_ref_counter(struct kvm *kvm)
return div_u64(get_kernel_ns() + kvm->arch.kvmclock_offset, 100);
 }
 
+static void stimer_mark_expired(struct kvm_vcpu_hv_stimer *stimer,
+   bool vcpu_kick)
+{
+   struct kvm_vcpu *vcpu = stimer_to_vcpu(stimer);
+
+   set_bit(stimer->index,
+

[PATCH v2 3/9] drivers/hv: Move struct hv_message into UAPI Hyper-V x86 header

2015-11-30 Thread Andrey Smetanin

This struct is required for Hyper-V SynIC timers implementation inside KVM
and for upcoming Hyper-V VMBus support by userspace(QEMU). So place it into
Hyper-V UAPI header.

Signed-off-by: Andrey Smetanin 
Acked-by: K. Y. Srinivasan 
Reviewed-by: Roman Kagan 
CC: Gleb Natapov 
CC: Paolo Bonzini 
CC: "K. Y. Srinivasan" 
CC: Haiyang Zhang 
CC: Vitaly Kuznetsov 
CC: Roman Kagan 
CC: Denis V. Lunev 
CC: qemu-de...@nongnu.org
---
 arch/x86/include/uapi/asm/hyperv.h | 74 ++
 drivers/hv/hyperv_vmbus.h  | 73 -
 2 files changed, 74 insertions(+), 73 deletions(-)

diff --git a/arch/x86/include/uapi/asm/hyperv.h 
b/arch/x86/include/uapi/asm/hyperv.h
index 07981f0..76e503d 100644
--- a/arch/x86/include/uapi/asm/hyperv.h
+++ b/arch/x86/include/uapi/asm/hyperv.h
@@ -271,4 +271,78 @@ typedef struct _HV_REFERENCE_TSC_PAGE {
 
 #define HV_SYNIC_STIMER_COUNT  (4)
 
+/* Define synthetic interrupt controller message constants. */
+#define HV_MESSAGE_SIZE(256)
+#define HV_MESSAGE_PAYLOAD_BYTE_COUNT  (240)
+#define HV_MESSAGE_PAYLOAD_QWORD_COUNT (30)
+
+/* Define hypervisor message types. */
+#define HVMSG_NONE 0x
+
+/* Memory access messages. */
+#define HVMSG_UNMAPPED_GPA 0x8000
+#define HVMSG_GPA_INTERCEPT0x8001
+
+/* Timer notification messages. */
+#define HVMSG_TIMER_EXPIRED0x8010
+
+/* Error messages. */
+#define HVMSG_INVALID_VP_REGISTER_VALUE0x8020
+#define HVMSG_UNRECOVERABLE_EXCEPTION  0x8021
+#define HVMSG_UNSUPPORTED_FEATURE  0x8022
+
+/* Trace buffer complete messages. */
+#define HVMSG_EVENTLOG_BUFFERCOMPLETE  0x8040
+
+/* Platform-specific processor intercept messages. */
+#define HVMSG_X64_IOPORT_INTERCEPT 0x8001
+#define HVMSG_X64_MSR_INTERCEPT0x80010001
+#define HVMSG_X64_CPUID_INTERCEPT  0x80010002
+#define HVMSG_X64_EXCEPTION_INTERCEPT  0x80010003
+#define HVMSG_X64_APIC_EOI 0x80010004
+#define HVMSG_X64_LEGACY_FP_ERROR  0x80010005
+
+/* Define synthetic interrupt controller message flags. */
+union hv_message_flags {
+   __u8 asu8;
+   struct {
+   __u8 msg_pending:1;
+   __u8 reserved:7;
+   };
+};
+
+/* Define port identifier type. */
+union hv_port_id {
+   __u32 asu32;
+   struct {
+   __u32 id:24;
+   __u32 reserved:8;
+   } u;
+};
+
+/* Define synthetic interrupt controller message header. */
+struct hv_message_header {
+   __u32 message_type;
+   __u8 payload_size;
+   union hv_message_flags message_flags;
+   __u8 reserved[2];
+   union {
+   __u64 sender;
+   union hv_port_id port;
+   };
+};
+
+/* Define synthetic interrupt controller message format. */
+struct hv_message {
+   struct hv_message_header header;
+   union {
+   __u64 payload[HV_MESSAGE_PAYLOAD_QWORD_COUNT];
+   } u;
+};
+
+/* Define the synthetic interrupt message page layout. */
+struct hv_message_page {
+   struct hv_message sint_message[HV_SYNIC_SINT_COUNT];
+};
+
 #endif
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index f214e37..3f3756b 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -63,10 +63,6 @@ enum hv_cpuid_function {
 /* Define version of the synthetic interrupt controller. */
 #define HV_SYNIC_VERSION   (1)
 
-/* Define synthetic interrupt controller message constants. */
-#define HV_MESSAGE_SIZE(256)
-#define HV_MESSAGE_PAYLOAD_BYTE_COUNT  (240)
-#define HV_MESSAGE_PAYLOAD_QWORD_COUNT (30)
 #define HV_ANY_VP  (0x)
 
 /* Define synthetic interrupt controller flag constants. */
@@ -74,44 +70,9 @@ enum hv_cpuid_function {
 #define HV_EVENT_FLAGS_BYTE_COUNT  (256)
 #define HV_EVENT_FLAGS_DWORD_COUNT (256 / sizeof(u32))
 
-/* Define hypervisor message types. */
-#define HVMSG_NONE 0x
-
-/* Memory access messages. */
-#define HVMSG_UNMAPPED_GPA 0x8000
-#define HVMSG_GPA_INTERCEPT0x8001
-
-/* Timer notification messages. */
-#define HVMSG_TIMER_EXPIRED0x8010
-
-/* Error messages. */
-#define HVMSG_INVALID_VP_REGISTER_VALUE0x8020
-#define HVMSG_UNRECOVERABLE_EXCEPTION  0x8021
-#define HVMSG_UNSUPPORTED_FEATURE  0x8022
-
-/* Trace buffer complete messages. */
-#define HVMSG_EVENTLOG_BUFFERCOMPLETE  0x8040
-
-/* Platform-specific processor intercept messages. */
-#define HVMSG_X64_IOPORT_INTERCEPT 0x8001
-#define HVMSG_X64_MSR_INTERCEPT0x80010001
-#define

[PATCH v2 0/3] target-i386: Use C struct for xsave area layout, offsets & sizes

2015-11-30 Thread Eduardo Habkost

target-i386/cpu.c:ext_save_area uses magic numbers for the xsave
area offets and sizes, and target-i386/kvm.c:kvm_{put,get}_xsave()
uses offset macros and bit manipulation to access the xsave area.
This series changes both to use C structs for those operations.

I still need to figure out a way to write unit tests for the new
code. Maybe I will just copy and paste the new and old functions,
and test them locally (checking if they give the same results
when translating blobs of random bytes).

Changes v1 -> v2:
* Use uint8_t[8*n] instead of uint64_t[n] for register data
* Keep the QEMU_BUILD_BUG_ON lines

v1 -> v2 diff below:

  diff --git a/target-i386/cpu.h b/target-i386/cpu.h
  index 3d1d01e..41f55ef 100644
  --- a/target-i386/cpu.h
  +++ b/target-i386/cpu.h
  @@ -818,7 +818,7 @@ typedef union X86LegacyXSaveArea {
   uint32_t mxcsr;
   uint32_t mxcsr_mask;
   FPReg fpregs[8];
  -uint64_t xmm_regs[16][2];
  +uint8_t xmm_regs[16][16];
   };
   uint8_t data[512];
   } X86LegacyXSaveArea;
  @@ -831,7 +831,7 @@ typedef struct X86XSaveHeader {

   /* Ext. save area 2: AVX State */
   typedef struct XSaveAVX {
  -uint64_t ymmh[16][2];
  +uint8_t ymmh[16][16];
   } XSaveAVX;

   /* Ext. save area 3: BNDREG */
  @@ -852,12 +852,12 @@ typedef struct XSaveOpmask {

   /* Ext. save area 6: ZMM_Hi256 */
   typedef struct XSaveZMM_Hi256 {
  -uint64_t zmm_hi256[16][4];
  +uint8_t zmm_hi256[16][32];
   } XSaveZMM_Hi256;

   /* Ext. save area 7: Hi16_ZMM */
   typedef struct XSaveHi16_ZMM {
  -XMMReg hi16_zmm[16];
  +uint8_t hi16_zmm[16][64];
   } XSaveHi16_ZMM;

   typedef struct X86XSaveArea {
  diff --git a/target-i386/kvm.c b/target-i386/kvm.c
  index 5e7ec70..98249e4 100644
  --- a/target-i386/kvm.c
  +++ b/target-i386/kvm.c
  @@ -1203,6 +1203,43 @@ static int kvm_put_fpu(X86CPU *cpu)
   return kvm_vcpu_ioctl(CPU(cpu), KVM_SET_FPU, );
   }

  +#define XSAVE_FCW_FSW 0
  +#define XSAVE_FTW_FOP 1
  +#define XSAVE_CWD_RIP 2
  +#define XSAVE_CWD_RDP 4
  +#define XSAVE_MXCSR   6
  +#define XSAVE_ST_SPACE8
  +#define XSAVE_XMM_SPACE   40
  +#define XSAVE_XSTATE_BV   128
  +#define XSAVE_YMMH_SPACE  144
  +#define XSAVE_BNDREGS 240
  +#define XSAVE_BNDCSR  256
  +#define XSAVE_OPMASK  272
  +#define XSAVE_ZMM_Hi256   288
  +#define XSAVE_Hi16_ZMM416
  +
  +#define XSAVE_BYTE_OFFSET(word_offset) \
  +((word_offset)*sizeof(((struct kvm_xsave*)0)->region[0]))
  +
  +#define ASSERT_OFFSET(word_offset, field) \
  +QEMU_BUILD_BUG_ON(XSAVE_BYTE_OFFSET(word_offset) != \
  +  offsetof(X86XSaveArea, field))
  +
  +ASSERT_OFFSET(XSAVE_FCW_FSW, legacy.fcw);
  +ASSERT_OFFSET(XSAVE_FTW_FOP, legacy.ftw);
  +ASSERT_OFFSET(XSAVE_CWD_RIP, legacy.fpip);
  +ASSERT_OFFSET(XSAVE_CWD_RDP, legacy.fpdp);
  +ASSERT_OFFSET(XSAVE_MXCSR, legacy.mxcsr);
  +ASSERT_OFFSET(XSAVE_ST_SPACE, legacy.fpregs);
  +ASSERT_OFFSET(XSAVE_XMM_SPACE, legacy.xmm_regs);
  +ASSERT_OFFSET(XSAVE_XSTATE_BV, header.xstate_bv);
  +ASSERT_OFFSET(XSAVE_YMMH_SPACE, avx_state);
  +ASSERT_OFFSET(XSAVE_BNDREGS, bndreg_state);
  +ASSERT_OFFSET(XSAVE_BNDCSR, bndcsr_state);
  +ASSERT_OFFSET(XSAVE_OPMASK, opmask_state);
  +ASSERT_OFFSET(XSAVE_ZMM_Hi256, zmm_hi256_state);
  +ASSERT_OFFSET(XSAVE_Hi16_ZMM, hi16_zmm_state);
  +
   static int kvm_put_xsave(X86CPU *cpu)
   {
   CPUX86State *env = >env;
  @@ -1239,17 +1276,17 @@ static int kvm_put_xsave(X86CPU *cpu)
   sizeof env->opmask_regs);

   for (i = 0; i < CPU_NB_REGS; i++) {
  -X86LegacyXSaveArea *legacy = >legacy;
  -XSaveAVX *avx = >avx_state;
  -XSaveZMM_Hi256 *zmm_hi256 = >zmm_hi256_state;
  -stq_p(>xmm_regs[i][0], env->xmm_regs[i].XMM_Q(0));
  -stq_p(>xmm_regs[i][1], env->xmm_regs[i].XMM_Q(1));
  -stq_p(>ymmh[i][0],env->xmm_regs[i].XMM_Q(2));
  -stq_p(>ymmh[i][1],env->xmm_regs[i].XMM_Q(3));
  -stq_p(_hi256->zmm_hi256[i][0], env->xmm_regs[i].XMM_Q(4));
  -stq_p(_hi256->zmm_hi256[i][1], env->xmm_regs[i].XMM_Q(5));
  -stq_p(_hi256->zmm_hi256[i][2], env->xmm_regs[i].XMM_Q(6));
  -stq_p(_hi256->zmm_hi256[i][3], env->xmm_regs[i].XMM_Q(7));
  +uint8_t *xmm = xsave->legacy.xmm_regs[i];
  +uint8_t *ymmh = xsave->avx_state.ymmh[i];
  +uint8_t *zmmh = xsave->zmm_hi256_state.zmm_hi256[i];
  +stq_p(xmm, env->xmm_regs[i].XMM_Q(0));
  +stq_p(xmm+8,   env->xmm_regs[i].XMM_Q(1));
  +stq_p(ymmh,env->xmm_regs[i].XMM_Q(2));
  +stq_p(ymmh+8,  env->xmm_regs[i].XMM_Q(3));
  +stq_p(zmmh,env->xmm_regs[i].XMM_Q(4));
  +stq_p(zmmh+8,  env->xmm_regs[i].XMM_Q(5));
  +stq_p(zmmh+16, env->xmm_regs[i].XMM_Q(6));
  +stq_p(zmmh+24, env->xmm_regs[i].XMM_Q(7));
   }

   #ifdef TARGET_X86_64
  @@ -1625,17 +1662,17 @@ static int kvm_get_xsave(X86CPU *cpu)
   sizeof

Re: [PATCH 2/2] KVM: Create debugfs dir and stat files for each VM

2015-11-30 Thread Alex Bennée


Janosch Frank  writes:

> On 11/27/2015 09:42 PM, Tyler Baker wrote:
>> On 27 November 2015 at 10:53, Tyler Baker  wrote:
>>> On 27 November 2015 at 09:08, Tyler Baker  wrote:
 On 27 November 2015 at 00:54, Christian Borntraeger
  wrote:
> On 11/26/2015 09:47 PM, Christian Borntraeger wrote:
>> On 11/26/2015 05:17 PM, Tyler Baker wrote:
>>> Hi Christian,
>>>
>>> The kernelci.org bot recently has been reporting kvm guest boot
>>> failures[1] on various arm64 platforms in next-20151126. The bot
>>> bisected[2] the failures to the commit in -next titled "KVM: Create
>>> debugfs dir and stat files for each VM". I confirmed by reverting this
>>> commit on top of next-20151126 it resolves the boot issue.
>>>

>>
> After a quick look into qemu I guess I've found the problem:
> kvm_init creates a vm, does checking and self initialization and
> then calls kvm_arch_init. The arch initialization indirectly
> calls kvm_arm_create_scratch_host_vcpu and that's where the
> trouble begins, as it also creates a VM.
>
> My assumption was, that nobody would create multiple VMs under
> the same PID. Christian and I are working on a solution on kernel
> side.

Yeah ARM is a little weird in that respect as the scratch VM is used to
probe capabilities. There is nothing in the API that says you can't have
multiple VMs per PID so I guess a better unique identifier is needed.

--
Alex Bennée
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] vhost: replace % with & on data path

2015-11-30 Thread David Miller

From: "Michael S. Tsirkin" 
Date: Mon, 30 Nov 2015 11:15:23 +0200

> We know vring num is a power of 2, so use &
> to mask the high bits.
> 
> Signed-off-by: Michael S. Tsirkin 

Acked-by: David S. Miller 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: best way to create a snapshot of a running vm ?

2015-11-30 Thread Lentes, Bernd

Stefan wrote:

> 
> Hi Bernd,
> qemu-img cannot be used on the disk image when the VM is running.
> Please use virsh, it communicates with the running QEMU process and
> ensures that the snapshot is crash-consistent.
> 

Hi Stefan,

thanks for your answer.

i read that virsh uses internally qemu-img
(http://serverfault.com/questions/692435/qemu-img-snapshot-on-live-vm).
Is that true ?

Bernd

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Dr. Nikolaus Blum, Dr. Alfons Enhsen
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-30 Thread Alexander Duyck

On Sun, Nov 29, 2015 at 10:53 PM, Lan, Tianyu  wrote:
> On 11/26/2015 11:56 AM, Alexander Duyck wrote:
>>
>> > I am not saying you cannot modify the drivers, however what you are
>> doing is far too invasive.  Do you seriously plan on modifying all of
>> the PCI device drivers out there in order to allow any device that
>> might be direct assigned to a port to support migration?  I certainly
>> hope not.  That is why I have said that this solution will not scale.
>
>
> Current drivers are not migration friendly. If the driver wants to
> support migration, it's necessary to be changed.

Modifying all of the drivers directly will not solve the issue though.
This is why I have suggested looking at possibly implementing
something like dma_mark_clean() which is used for ia64 architectures
to mark pages that were DMAed in as clean.  In your case though you
would want to mark such pages as dirty so that the page migration will
notice them and move them over.

> RFC PATCH V1 presented our ideas about how to deal with MMIO, ring and
> DMA tracking during migration. These are common for most drivers and
> they maybe problematic in the previous version but can be corrected later.

They can only be corrected if the underlying assumptions are correct
and they aren't.  Your solution would have never worked correctly.
The problem is you assume you can keep the device running when you are
migrating and you simply cannot.  At some point you will always have
to stop the device in order to complete the migration, and you cannot
stop it before you have stopped your page tracking mechanism.  So
unless the platform has an IOMMU that is somehow taking part in the
dirty page tracking you will not be able to stop the guest and then
the device, it will have to be the device and then the guest.

> Doing suspend and resume() may help to do migration easily but some
> devices requires low service down time. Especially network and I got
> that some cloud company promised less than 500ms network service downtime.

Honestly focusing on the downtime is getting the cart ahead of the
horse.  First you need to be able to do this without corrupting system
memory and regardless of the state of the device.  You haven't even
gotten to that state yet.  Last I knew the device had to be up in
order for your migration to even work.

Many devices are very state driven.  As such you cannot just freeze
them and restore them like you would regular device memory.  That is
where something like suspend/resume comes in because it already takes
care of getting the device ready for halt, and then resume.  Keep in
mind that those functions were meant to function on a device doing
something like a suspend to RAM or disk.  This is not too far of from
what a migration is doing since you need to halt the guest before you
move it.

As such the first step is to make it so that we can do the current
bonding approach with one change.  Specifically we want to leave the
device in the guest until the last portion of the migration instead of
having to remove it first.  To that end I would suggest focusing on
solving the DMA problem via something like a dma_mark_clean() type
solution as that would be one issue resolved and we all would see an
immediate gain instead of just those users of the ixgbevf driver.

> So I think performance effect also should be taken into account when we
> design the framework.

What you are proposing I would call premature optimization.  You need
to actually solve the problem before you can start optimizing things
and I don't see anything actually solved yet since your solution is
too unstable.

>>
>> What I am counter proposing seems like a very simple proposition.  It
>> can be implemented in two steps.
>>
>> 1.  Look at modifying dma_mark_clean().  It is a function called in
>> the sync and unmap paths of the lib/swiotlb.c.  If you could somehow
>> modify it to take care of marking the pages you unmap for Rx as being
>> dirty it will get you a good way towards your goal as it will allow
>> you to continue to do DMA while you are migrating the VM.
>>
>> 2.  Look at making use of the existing PCI suspend/resume calls that
>> are there to support PCI power management.  They have everything
>> needed to allow you to pause and resume DMA for the device before and
>> after the migration while retaining the driver state.  If you can
>> implement something that allows you to trigger these calls from the
>> PCI subsystem such as hot-plug then you would have a generic solution
>> that can be easily reproduced for multiple drivers beyond those
>> supported by ixgbevf.
>
>
> Glanced at PCI hotplug code. The hotplug events are triggered by PCI hotplug
> controller and these event are defined in the controller spec.
> It's hard to extend more events. Otherwise, we also need to add some
> specific codes in the PCI hotplug core since it's only add and remove
> PCI device when it gets events. It's also a challenge to modify Windows
>

Re: [PATCH] vhost: replace % with & on data path

2015-11-30 Thread David Miller

From: "Michael S. Tsirkin" 
Date: Mon, 30 Nov 2015 10:34:07 +0200

> We know vring num is a power of 2, so use &
> to mask the high bits.
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  drivers/vhost/vhost.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 080422f..85f0f0a 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1366,10 +1366,12 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>   /* Only get avail ring entries after they have been exposed by guest. */
>   smp_rmb();
>  
> + }
> +

!!!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 05/21] KVM: ARM64: Add reset and access handlers for PMSELR register

2015-11-30 Thread Marc Zyngier

On Fri, 30 Oct 2015 14:21:47 +0800
Shannon Zhao  wrote:

> From: Shannon Zhao 
> 
> Since the reset value of PMSELR_EL0 is UNKNOWN, use reset_unknown for
> its reset handler. As it doesn't need to deal with the acsessing action
> specially, it uses default case to emulate writing and reading PMSELR
> register.
> 
> Add a helper for CP15 registers reset to UNKNOWN.
> 
> Signed-off-by: Shannon Zhao 
> ---
>  arch/arm64/kvm/sys_regs.c | 5 +++--
>  arch/arm64/kvm/sys_regs.h | 8 
>  2 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index 5b591d6..35d232e 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -707,7 +707,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
> trap_raz_wi },
>   /* PMSELR_EL0 */
>   { Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b101),
> -   trap_raz_wi },
> +   access_pmu_regs, reset_unknown, PMSELR_EL0 },
>   /* PMCEID0_EL0 */
>   { Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b110),
> trap_raz_wi },
> @@ -998,7 +998,8 @@ static const struct sys_reg_desc cp15_regs[] = {
>   { Op1( 0), CRn( 9), CRm(12), Op2( 1), trap_raz_wi },
>   { Op1( 0), CRn( 9), CRm(12), Op2( 2), trap_raz_wi },
>   { Op1( 0), CRn( 9), CRm(12), Op2( 3), trap_raz_wi },
> - { Op1( 0), CRn( 9), CRm(12), Op2( 5), trap_raz_wi },
> + { Op1( 0), CRn( 9), CRm(12), Op2( 5), access_pmu_cp15_regs,
> +   reset_unknown_cp15, c9_PMSELR },
>   { Op1( 0), CRn( 9), CRm(12), Op2( 6), trap_raz_wi },
>   { Op1( 0), CRn( 9), CRm(12), Op2( 7), trap_raz_wi },
>   { Op1( 0), CRn( 9), CRm(13), Op2( 0), trap_raz_wi },
> diff --git a/arch/arm64/kvm/sys_regs.h b/arch/arm64/kvm/sys_regs.h
> index eaa324e..8afeff7 100644
> --- a/arch/arm64/kvm/sys_regs.h
> +++ b/arch/arm64/kvm/sys_regs.h
> @@ -110,6 +110,14 @@ static inline void reset_unknown(struct kvm_vcpu *vcpu,
>   vcpu_sys_reg(vcpu, r->reg) = 0x1de7ec7edbadc0deULL;
>  }
>  
> +static inline void reset_unknown_cp15(struct kvm_vcpu *vcpu,
> +   const struct sys_reg_desc *r)
> +{
> + BUG_ON(!r->reg);
> + BUG_ON(r->reg >= NR_COPRO_REGS);
> + vcpu_cp15(vcpu, r->reg) = 0xdecafbad;
> +}
> +
>  static inline void reset_val(struct kvm_vcpu *vcpu, const struct 
> sys_reg_desc *r)
>  {
>   BUG_ON(!r->reg);


Same remark here as the one I made earlier. I'm pretty sure we don't
call any CP15 reset because they are all shared with their 64bit
counterparts. The same thing goes for the whole series.

Thanks,

M.
-- 
Jazz is not dead. It just smells funny.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 11/11] KVM: MMU: apply page track notifier

2015-11-30 Thread Xiao Guangrong

Register the notifier to receive write track event so that we can update
our shadow page table

It makes kvm_mmu_pte_write() be the callback of the notifier, no function
is changed

Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_host.h |  5 +++--
 arch/x86/kvm/mmu.c  | 19 +--
 arch/x86/kvm/x86.c  |  4 ++--
 3 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ea7907d..698577a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -658,6 +658,7 @@ struct kvm_arch {
 */
struct list_head active_mmu_pages;
struct list_head zapped_obsolete_pages;
+   struct kvm_page_track_notifier_node mmu_sp_tracker;
struct kvm_page_track_notifier_head track_notifier_head;
 
struct list_head assigned_dev_head;
@@ -953,6 +954,8 @@ void kvm_mmu_module_exit(void);
 void kvm_mmu_destroy(struct kvm_vcpu *vcpu);
 int kvm_mmu_create(struct kvm_vcpu *vcpu);
 void kvm_mmu_setup(struct kvm_vcpu *vcpu);
+void kvm_mmu_init_vm(struct kvm *kvm);
+void kvm_mmu_uninit_vm(struct kvm *kvm);
 void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
u64 dirty_mask, u64 nx_mask, u64 x_mask);
 
@@ -1092,8 +1095,6 @@ void kvm_pic_clear_all(struct kvm_pic *pic, int 
irq_source_id);
 
 void kvm_inject_nmi(struct kvm_vcpu *vcpu);
 
-void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
-  const u8 *new, int bytes);
 int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn);
 int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva);
 void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9f6a4ef..a420c43 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4325,8 +4325,8 @@ static u64 *get_written_sptes(struct kvm_mmu_page *sp, 
gpa_t gpa, int *nspte)
return spte;
 }
 
-void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
-  const u8 *new, int bytes)
+static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
+ const u8 *new, int bytes)
 {
gfn_t gfn = gpa >> PAGE_SHIFT;
struct kvm_mmu_page *sp;
@@ -4540,6 +4540,21 @@ void kvm_mmu_setup(struct kvm_vcpu *vcpu)
init_kvm_mmu(vcpu);
 }
 
+void kvm_mmu_init_vm(struct kvm *kvm)
+{
+   struct kvm_page_track_notifier_node *node = >arch.mmu_sp_tracker;
+
+   node->track_write = kvm_mmu_pte_write;
+   kvm_page_track_register_notifier(kvm, node);
+}
+
+void kvm_mmu_uninit_vm(struct kvm *kvm)
+{
+   struct kvm_page_track_notifier_node *node = >arch.mmu_sp_tracker;
+
+   kvm_page_track_unregister_notifier(kvm, node);
+}
+
 /* The return value indicates if tlb flush on all vcpus is needed. */
 typedef bool (*slot_level_handler) (struct kvm *kvm, struct kvm_rmap_head 
*rmap_head);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 64dbc69..adc031a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4327,7 +4327,6 @@ int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
ret = kvm_vcpu_write_guest(vcpu, gpa, val, bytes);
if (ret < 0)
return 0;
-   kvm_mmu_pte_write(vcpu, gpa, val, bytes);
kvm_page_track_write(vcpu, gpa, val, bytes);
return 1;
 }
@@ -4586,7 +4585,6 @@ static int emulator_cmpxchg_emulated(struct 
x86_emulate_ctxt *ctxt,
return X86EMUL_CMPXCHG_FAILED;
 
kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
-   kvm_mmu_pte_write(vcpu, gpa, new, bytes);
kvm_page_track_write(vcpu, gpa, new, bytes);
 
return X86EMUL_CONTINUE;
@@ -7694,6 +7692,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
INIT_DELAYED_WORK(>arch.kvmclock_sync_work, kvmclock_sync_fn);
 
kvm_page_track_init(kvm);
+   kvm_mmu_init_vm(kvm);
 
return 0;
 }
@@ -7821,6 +7820,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
kfree(kvm->arch.vioapic);
kvm_free_vcpus(kvm);
kfree(rcu_dereference_check(kvm->arch.apic_map, 1));
+   kvm_mmu_uninit_vm(kvm);
 }
 
 void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *free,
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 3/3] target-i386: kvm: Use X86XSaveArea struct for xsave save/load

2015-11-30 Thread Eduardo Habkost

Instead of using offset macros and bit operations in a uint32_t
array, use the X86XSaveArea struct to perform the loading/saving
operations in kvm_put_xsave() and kvm_get_xsave().

Signed-off-by: Eduardo Habkost 
---
Changes v1 -> v2:
* Use uint8_t pointers when loading/saving xmm, ymmh, zmmh,
  keeping the same load/save logic from previous code
* Keep the QEMU_BUILD_BUG_ON lines
---
 target-i386/kvm.c | 74 +++
 1 file changed, 36 insertions(+), 38 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index b8b336b..98249e4 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1243,9 +1243,8 @@ ASSERT_OFFSET(XSAVE_Hi16_ZMM, hi16_zmm_state);
 static int kvm_put_xsave(X86CPU *cpu)
 {
 CPUX86State *env = >env;
-struct kvm_xsave* xsave = env->kvm_xsave_buf;
+X86XSaveArea *xsave = env->kvm_xsave_buf;
 uint16_t cwd, swd, twd;
-uint8_t *xmm, *ymmh, *zmmh;
 int i, r;
 
 if (!has_xsave) {
@@ -1260,25 +1259,26 @@ static int kvm_put_xsave(X86CPU *cpu)
 for (i = 0; i < 8; ++i) {
 twd |= (!env->fptags[i]) << i;
 }
-xsave->region[XSAVE_FCW_FSW] = (uint32_t)(swd << 16) + cwd;
-xsave->region[XSAVE_FTW_FOP] = (uint32_t)(env->fpop << 16) + twd;
-memcpy(>region[XSAVE_CWD_RIP], >fpip, sizeof(env->fpip));
-memcpy(>region[XSAVE_CWD_RDP], >fpdp, sizeof(env->fpdp));
-memcpy(>region[XSAVE_ST_SPACE], env->fpregs,
+xsave->legacy.fcw = cwd;
+xsave->legacy.fsw = swd;
+xsave->legacy.ftw = twd;
+xsave->legacy.fpop = env->fpop;
+xsave->legacy.fpip = env->fpip;
+xsave->legacy.fpdp = env->fpdp;
+memcpy(>legacy.fpregs, env->fpregs,
 sizeof env->fpregs);
-xsave->region[XSAVE_MXCSR] = env->mxcsr;
-*(uint64_t *)>region[XSAVE_XSTATE_BV] = env->xstate_bv;
-memcpy(>region[XSAVE_BNDREGS], env->bnd_regs,
+xsave->legacy.mxcsr = env->mxcsr;
+xsave->header.xstate_bv = env->xstate_bv;
+memcpy(>bndreg_state.bnd_regs, env->bnd_regs,
 sizeof env->bnd_regs);
-memcpy(>region[XSAVE_BNDCSR], >bndcs_regs,
-sizeof(env->bndcs_regs));
-memcpy(>region[XSAVE_OPMASK], env->opmask_regs,
+xsave->bndcsr_state.bndcsr = env->bndcs_regs;
+memcpy(>opmask_state.opmask_regs, env->opmask_regs,
 sizeof env->opmask_regs);
 
-xmm = (uint8_t *)>region[XSAVE_XMM_SPACE];
-ymmh = (uint8_t *)>region[XSAVE_YMMH_SPACE];
-zmmh = (uint8_t *)>region[XSAVE_ZMM_Hi256];
-for (i = 0; i < CPU_NB_REGS; i++, xmm += 16, ymmh += 16, zmmh += 32) {
+for (i = 0; i < CPU_NB_REGS; i++) {
+uint8_t *xmm = xsave->legacy.xmm_regs[i];
+uint8_t *ymmh = xsave->avx_state.ymmh[i];
+uint8_t *zmmh = xsave->zmm_hi256_state.zmm_hi256[i];
 stq_p(xmm, env->xmm_regs[i].XMM_Q(0));
 stq_p(xmm+8,   env->xmm_regs[i].XMM_Q(1));
 stq_p(ymmh,env->xmm_regs[i].XMM_Q(2));
@@ -1290,7 +1290,7 @@ static int kvm_put_xsave(X86CPU *cpu)
 }
 
 #ifdef TARGET_X86_64
-memcpy(>region[XSAVE_Hi16_ZMM], >xmm_regs[16],
+memcpy(>hi16_zmm_state.hi16_zmm, >xmm_regs[16],
 16 * sizeof env->xmm_regs[16]);
 #endif
 r = kvm_vcpu_ioctl(CPU(cpu), KVM_SET_XSAVE, xsave);
@@ -1626,9 +1626,8 @@ static int kvm_get_fpu(X86CPU *cpu)
 static int kvm_get_xsave(X86CPU *cpu)
 {
 CPUX86State *env = >env;
-struct kvm_xsave* xsave = env->kvm_xsave_buf;
+X86XSaveArea *xsave = env->kvm_xsave_buf;
 int ret, i;
-const uint8_t *xmm, *ymmh, *zmmh;
 uint16_t cwd, swd, twd;
 
 if (!has_xsave) {
@@ -1640,33 +1639,32 @@ static int kvm_get_xsave(X86CPU *cpu)
 return ret;
 }
 
-cwd = (uint16_t)xsave->region[XSAVE_FCW_FSW];
-swd = (uint16_t)(xsave->region[XSAVE_FCW_FSW] >> 16);
-twd = (uint16_t)xsave->region[XSAVE_FTW_FOP];
-env->fpop = (uint16_t)(xsave->region[XSAVE_FTW_FOP] >> 16);
+cwd = xsave->legacy.fcw;
+swd = xsave->legacy.fsw;
+twd = xsave->legacy.ftw;
+env->fpop = xsave->legacy.fpop;
 env->fpstt = (swd >> 11) & 7;
 env->fpus = swd;
 env->fpuc = cwd;
 for (i = 0; i < 8; ++i) {
 env->fptags[i] = !((twd >> i) & 1);
 }
-memcpy(>fpip, >region[XSAVE_CWD_RIP], sizeof(env->fpip));
-memcpy(>fpdp, >region[XSAVE_CWD_RDP], sizeof(env->fpdp));
-env->mxcsr = xsave->region[XSAVE_MXCSR];
-memcpy(env->fpregs, >region[XSAVE_ST_SPACE],
+env->fpip = xsave->legacy.fpip;
+env->fpdp = xsave->legacy.fpdp;
+env->mxcsr = xsave->legacy.mxcsr;
+memcpy(env->fpregs, >legacy.fpregs,
 sizeof env->fpregs);
-env->xstate_bv = *(uint64_t *)>region[XSAVE_XSTATE_BV];
-memcpy(env->bnd_regs, >region[XSAVE_BNDREGS],
+env->xstate_bv = xsave->header.xstate_bv;
+memcpy(env->bnd_regs, >bndreg_state.bnd_regs,
 sizeof env->bnd_regs);
-memcpy(>bndcs_regs, >region[XSAVE_BNDCSR],
-sizeof(env->bndcs_regs));
-memcpy(env->opmask_regs, >region[XSAVE_OPMASK],
+

[PATCH 07/11] KVM: page track: add notifier support

2015-11-30 Thread Xiao Guangrong

Notifier list is introduced so that any node wants to receive the track
event can register to the list

Two APIs are introduced here:
- kvm_page_track_register_notifier(): register the notifier to receive
  track event

- kvm_page_track_unregister_notifier(): stop receiving track event by
  unregister the notifier

The callback, node->track_write() is called when a write access on the
write tracked page happens

Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_host.h   |  1 +
 arch/x86/include/asm/kvm_page_track.h | 39 
 arch/x86/kvm/page_track.c | 67 +++
 arch/x86/kvm/x86.c|  4 +++
 4 files changed, 111 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index afff1f1..0f7b940 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -658,6 +658,7 @@ struct kvm_arch {
 */
struct list_head active_mmu_pages;
struct list_head zapped_obsolete_pages;
+   struct kvm_page_track_notifier_head track_notifier_head;
 
struct list_head assigned_dev_head;
struct iommu_domain *iommu_domain;
diff --git a/arch/x86/include/asm/kvm_page_track.h 
b/arch/x86/include/asm/kvm_page_track.h
index f223201..6744234 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -6,6 +6,36 @@ enum kvm_page_track_mode {
KVM_PAGE_TRACK_MAX,
 };
 
+/*
+ * The notifier represented by @kvm_page_track_notifier_node is linked into
+ * the head which will be notified when guest is triggering the track event.
+ *
+ * Write access on the head is protected by kvm->mmu_lock, read access
+ * is protected by track_srcu.
+ */
+struct kvm_page_track_notifier_head {
+   struct srcu_struct track_srcu;
+   struct hlist_head track_notifier_list;
+};
+
+struct kvm_page_track_notifier_node {
+   struct hlist_node node;
+
+   /*
+* It is called when guest is writing the write-tracked page
+* and write emulation is finished at that time.
+*
+* @vcpu: the vcpu where the write access happened.
+* @gpa: the physical address written by guest.
+* @new: the data was written to the address.
+* @bytes: the written length.
+*/
+   void (*track_write)(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
+   int bytes);
+};
+
+void kvm_page_track_init(struct kvm *kvm);
+
 int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
  unsigned long npages);
 void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
@@ -17,4 +47,13 @@ void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
enum kvm_page_track_mode mode);
 bool kvm_page_track_check_mode(struct kvm_vcpu *vcpu, gfn_t gfn,
   enum kvm_page_track_mode mode);
+
+void
+kvm_page_track_register_notifier(struct kvm *kvm,
+struct kvm_page_track_notifier_node *n);
+void
+kvm_page_track_unregister_notifier(struct kvm *kvm,
+  struct kvm_page_track_notifier_node *n);
+void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
+ int bytes);
 #endif
diff --git a/arch/x86/kvm/page_track.c b/arch/x86/kvm/page_track.c
index dc2da12..84420df 100644
--- a/arch/x86/kvm/page_track.c
+++ b/arch/x86/kvm/page_track.c
@@ -165,3 +165,70 @@ bool kvm_page_track_check_mode(struct kvm_vcpu *vcpu, 
gfn_t gfn,
 
return !!ACCESS_ONCE(slot->arch.gfn_track[mode][index]);
 }
+
+void kvm_page_track_init(struct kvm *kvm)
+{
+   struct kvm_page_track_notifier_head *head;
+
+   head = >arch.track_notifier_head;
+   init_srcu_struct(>track_srcu);
+   INIT_HLIST_HEAD(>track_notifier_list);
+}
+
+/*
+ * register the notifier so that event interception for the tracked guest
+ * pages can be received.
+ */
+void
+kvm_page_track_register_notifier(struct kvm *kvm,
+struct kvm_page_track_notifier_node *n)
+{
+   struct kvm_page_track_notifier_head *head;
+
+   head = >arch.track_notifier_head;
+
+   spin_lock(>mmu_lock);
+   hlist_add_head_rcu(>node, >track_notifier_list);
+   spin_unlock(>mmu_lock);
+}
+
+/*
+ * stop receiving the event interception. It is the opposed operation of
+ * kvm_page_track_register_notifier().
+ */
+void
+kvm_page_track_unregister_notifier(struct kvm *kvm,
+  struct kvm_page_track_notifier_node *n)
+{
+   struct kvm_page_track_notifier_head *head;
+
+   head = >arch.track_notifier_head;
+
+   spin_lock(>mmu_lock);
+   hlist_del_rcu(>node);
+   spin_unlock(>mmu_lock);
+   synchronize_srcu(>track_srcu);
+}
+
+/*
+ * Notify the node that write access is intercepted and write emulation is
+ * finished at

[PATCH 06/11] KVM: MMU: let page fault handler be aware tracked page

2015-11-30 Thread Xiao Guangrong

The page fault caused by write access on the write tracked page can not
be fixed, it always need to be emulated. page_fault_handle_page_track()
is the fast path we introduce here to skip holding mmu-lock and shadow
page table walking

However, if the page table is not present, it is worth making the page
table entry present and readonly to make the read access happy

mmu_need_write_protect() need to be cooked to avoid page becoming writable
when making page table present or sync/prefetch shadow page table entries

Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_page_track.h |  2 ++
 arch/x86/kvm/mmu.c| 44 +--
 arch/x86/kvm/page_track.c | 14 +++
 arch/x86/kvm/paging_tmpl.h|  3 +++
 4 files changed, 56 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_page_track.h 
b/arch/x86/include/asm/kvm_page_track.h
index 9cc17c6..f223201 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -15,4 +15,6 @@ void kvm_page_track_add_page(struct kvm *kvm, gfn_t gfn,
 enum kvm_page_track_mode mode);
 void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
enum kvm_page_track_mode mode);
+bool kvm_page_track_check_mode(struct kvm_vcpu *vcpu, gfn_t gfn,
+  enum kvm_page_track_mode mode);
 #endif
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 39809b8..b23f9fc 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * When setting this variable to true it enables Two-Dimensional-Paging
@@ -2456,25 +2457,29 @@ static void kvm_unsync_pages(struct kvm_vcpu *vcpu,  
gfn_t gfn)
}
 }
 
-static int mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
- bool can_unsync)
+static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
+  bool can_unsync)
 {
struct kvm_mmu_page *s;
bool need_unsync = false;
 
+   if (kvm_page_track_check_mode(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
+   return true;
+
for_each_gfn_indirect_valid_sp(vcpu->kvm, s, gfn) {
if (!can_unsync)
-   return 1;
+   return true;
 
if (s->role.level != PT_PAGE_TABLE_LEVEL)
-   return 1;
+   return true;
 
if (!s->unsync)
need_unsync = true;
}
if (need_unsync)
kvm_unsync_pages(vcpu, gfn);
-   return 0;
+
+   return false;
 }
 
 static bool kvm_is_mmio_pfn(pfn_t pfn)
@@ -3388,10 +3393,30 @@ int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 
addr, bool direct)
 }
 EXPORT_SYMBOL_GPL(handle_mmio_page_fault);
 
+static bool page_fault_handle_page_track(struct kvm_vcpu *vcpu,
+u32 error_code, gfn_t gfn)
+{
+   if (unlikely(error_code & PFERR_RSVD_MASK))
+   return false;
+
+   if (!(error_code & PFERR_PRESENT_MASK) ||
+ !(error_code & PFERR_WRITE_MASK))
+   return false;
+
+   /*
+* guest is writing the page which is write tracked which can
+* not be fixed by page fault handler.
+*/
+   if (kvm_page_track_check_mode(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
+   return true;
+
+   return false;
+}
+
 static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
u32 error_code, bool prefault)
 {
-   gfn_t gfn;
+   gfn_t gfn = gva >> PAGE_SHIFT;
int r;
 
pgprintk("%s: gva %lx error %x\n", __func__, gva, error_code);
@@ -3403,13 +3428,15 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, 
gva_t gva,
return r;
}
 
+   if (page_fault_handle_page_track(vcpu, error_code, gfn))
+   return 1;
+
r = mmu_topup_memory_caches(vcpu);
if (r)
return r;
 
MMU_WARN_ON(!VALID_PAGE(vcpu->arch.mmu.root_hpa));
 
-   gfn = gva >> PAGE_SHIFT;
 
return nonpaging_map(vcpu, gva & PAGE_MASK,
 error_code, gfn, prefault);
@@ -3493,6 +3520,9 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t 
gpa, u32 error_code,
return r;
}
 
+   if (page_fault_handle_page_track(vcpu, error_code, gfn))
+   return 1;
+
r = mmu_topup_memory_caches(vcpu);
if (r)
return r;
diff --git a/arch/x86/kvm/page_track.c b/arch/x86/kvm/page_track.c
index ad510db..dc2da12 100644
--- a/arch/x86/kvm/page_track.c
+++ b/arch/x86/kvm/page_track.c
@@ -151,3 +151,17 @@ void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
spin_unlock(>mmu_lock);
}
 }
+

[PATCH 08/11] KVM: MMU: use page track for non-leaf shadow pages

2015-11-30 Thread Xiao Guangrong

non-leaf shadow pages are always write protected, it can be the user
of page track

Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_page_track.h |  8 +
 arch/x86/kvm/mmu.c| 26 +---
 arch/x86/kvm/page_track.c | 58 +++
 3 files changed, 67 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/kvm_page_track.h 
b/arch/x86/include/asm/kvm_page_track.h
index 6744234..3447dac 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -41,8 +41,16 @@ int kvm_page_track_create_memslot(struct kvm_memory_slot 
*slot,
 void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
 struct kvm_memory_slot *dont);
 
+void
+kvm_slot_page_track_add_page_nolock(struct kvm *kvm,
+   struct kvm_memory_slot *slot, gfn_t gfn,
+   enum kvm_page_track_mode mode);
 void kvm_page_track_add_page(struct kvm *kvm, gfn_t gfn,
 enum kvm_page_track_mode mode);
+void kvm_slot_page_track_remove_page_nolock(struct kvm *kvm,
+   struct kvm_memory_slot *slot,
+   gfn_t gfn,
+   enum kvm_page_track_mode mode);
 void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
enum kvm_page_track_mode mode);
 bool kvm_page_track_check_mode(struct kvm_vcpu *vcpu, gfn_t gfn,
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index b23f9fc..5a2ca73 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -806,11 +806,17 @@ static void account_shadowed(struct kvm *kvm, struct 
kvm_mmu_page *sp)
struct kvm_memory_slot *slot;
gfn_t gfn;
 
+   kvm->arch.indirect_shadow_pages++;
gfn = sp->gfn;
slots = kvm_memslots_for_spte_role(kvm, sp->role);
slot = __gfn_to_memslot(slots, gfn);
+
+   /* the non-leaf shadow pages are keeping readonly. */
+   if (sp->role.level > PT_PAGE_TABLE_LEVEL)
+   return kvm_slot_page_track_add_page_nolock(kvm, slot, gfn,
+   KVM_PAGE_TRACK_WRITE);
+
kvm_mmu_gfn_disallow_lpage(slot, gfn);
-   kvm->arch.indirect_shadow_pages++;
 }
 
 static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
@@ -819,11 +825,15 @@ static void unaccount_shadowed(struct kvm *kvm, struct 
kvm_mmu_page *sp)
struct kvm_memory_slot *slot;
gfn_t gfn;
 
+   kvm->arch.indirect_shadow_pages--;
gfn = sp->gfn;
slots = kvm_memslots_for_spte_role(kvm, sp->role);
slot = __gfn_to_memslot(slots, gfn);
+   if (sp->role.level > PT_PAGE_TABLE_LEVEL)
+   return kvm_slot_page_track_remove_page_nolock(kvm, slot, gfn,
+   KVM_PAGE_TRACK_WRITE);
+
kvm_mmu_gfn_allow_lpage(slot, gfn);
-   kvm->arch.indirect_shadow_pages--;
 }
 
 static bool __mmu_gfn_lpage_is_disallowed(gfn_t gfn, int level,
@@ -2140,12 +2150,18 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
hlist_add_head(>hash_link,
>kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)]);
if (!direct) {
-   if (rmap_write_protect(vcpu, gfn))
+   /*
+* we should do write protection before syncing pages
+* otherwise the content of the synced shadow page may
+* be inconsistent with guest page table.
+*/
+   account_shadowed(vcpu->kvm, sp);
+
+   if (level == PT_PAGE_TABLE_LEVEL &&
+ rmap_write_protect(vcpu, gfn))
kvm_flush_remote_tlbs(vcpu->kvm);
if (level > PT_PAGE_TABLE_LEVEL && need_sync)
kvm_sync_pages(vcpu, gfn);
-
-   account_shadowed(vcpu->kvm, sp);
}
sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
init_shadow_page_table(sp);
diff --git a/arch/x86/kvm/page_track.c b/arch/x86/kvm/page_track.c
index 84420df..87554d3 100644
--- a/arch/x86/kvm/page_track.c
+++ b/arch/x86/kvm/page_track.c
@@ -77,6 +77,26 @@ static void update_gfn_track(struct kvm_memory_slot *slot, 
gfn_t gfn,
WARN_ON(val < 0);
 }
 
+void
+kvm_slot_page_track_add_page_nolock(struct kvm *kvm,
+   struct kvm_memory_slot *slot, gfn_t gfn,
+   enum kvm_page_track_mode mode)
+{
+   WARN_ON(!check_mode(mode));
+
+   update_gfn_track(slot, gfn, mode, 1);
+
+   /*
+* new track stops large page mapping for the
+* tracked page.
+*/
+   kvm_mmu_gfn_disallow_lpage(slot, gfn);
+
+   if (mode == KVM_PAGE_TRACK_WRITE)
+   if

[PATCH v2 1/3] target-i386: Define structs for layout of xsave area

2015-11-30 Thread Eduardo Habkost

Add structs that define the layout of the xsave areas used by
Intel processors. Add some QEMU_BUILD_BUG_ON lines to ensure the
structs match the XSAVE_* macros in target-i386/kvm.c and the
offsets and sizes at target-i386/cpu.c:ext_save_areas.

Signed-off-by: Eduardo Habkost 
---
Changes v1 -> v2:
* Use uint8_t[8*n] instead of uint64_t[n] for register data
---
 target-i386/cpu.h | 85 +++
 target-i386/kvm.c | 22 ++
 2 files changed, 107 insertions(+)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 84edfd0..41f55ef 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -806,6 +806,91 @@ typedef struct {
 
 #define NB_OPMASK_REGS 8
 
+typedef union X86LegacyXSaveArea {
+struct {
+uint16_t fcw;
+uint16_t fsw;
+uint8_t ftw;
+uint8_t reserved;
+uint16_t fpop;
+uint64_t fpip;
+uint64_t fpdp;
+uint32_t mxcsr;
+uint32_t mxcsr_mask;
+FPReg fpregs[8];
+uint8_t xmm_regs[16][16];
+};
+uint8_t data[512];
+} X86LegacyXSaveArea;
+
+typedef struct X86XSaveHeader {
+uint64_t xstate_bv;
+uint64_t xcomp_bv;
+uint8_t reserved[48];
+} X86XSaveHeader;
+
+/* Ext. save area 2: AVX State */
+typedef struct XSaveAVX {
+uint8_t ymmh[16][16];
+} XSaveAVX;
+
+/* Ext. save area 3: BNDREG */
+typedef struct XSaveBNDREG {
+BNDReg bnd_regs[4];
+} XSaveBNDREG;
+
+/* Ext. save area 4: BNDCSR */
+typedef union XSaveBNDCSR {
+BNDCSReg bndcsr;
+uint8_t data[64];
+} XSaveBNDCSR;
+
+/* Ext. save area 5: Opmask */
+typedef struct XSaveOpmask {
+uint64_t opmask_regs[NB_OPMASK_REGS];
+} XSaveOpmask;
+
+/* Ext. save area 6: ZMM_Hi256 */
+typedef struct XSaveZMM_Hi256 {
+uint8_t zmm_hi256[16][32];
+} XSaveZMM_Hi256;
+
+/* Ext. save area 7: Hi16_ZMM */
+typedef struct XSaveHi16_ZMM {
+uint8_t hi16_zmm[16][64];
+} XSaveHi16_ZMM;
+
+typedef struct X86XSaveArea {
+X86LegacyXSaveArea legacy;
+X86XSaveHeader header;
+
+/* Extended save areas: */
+
+/* AVX State: */
+XSaveAVX avx_state;
+uint8_t padding[960-576-sizeof(XSaveAVX)];
+/* MPX State: */
+XSaveBNDREG bndreg_state;
+XSaveBNDCSR bndcsr_state;
+/* AVX-512 State: */
+XSaveOpmask opmask_state;
+XSaveZMM_Hi256 zmm_hi256_state;
+XSaveHi16_ZMM hi16_zmm_state;
+} X86XSaveArea;
+
+QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, avx_state) != 0x240);
+QEMU_BUILD_BUG_ON(sizeof(XSaveAVX) != 0x100);
+QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, bndreg_state) != 0x3c0);
+QEMU_BUILD_BUG_ON(sizeof(XSaveBNDREG) != 0x40);
+QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, bndcsr_state) != 0x400);
+QEMU_BUILD_BUG_ON(sizeof(XSaveBNDCSR) != 0x40);
+QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, opmask_state) != 0x440);
+QEMU_BUILD_BUG_ON(sizeof(XSaveOpmask) != 0x40);
+QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, zmm_hi256_state) != 0x480);
+QEMU_BUILD_BUG_ON(sizeof(XSaveZMM_Hi256) != 0x200);
+QEMU_BUILD_BUG_ON(offsetof(X86XSaveArea, hi16_zmm_state) != 0x680);
+QEMU_BUILD_BUG_ON(sizeof(XSaveHi16_ZMM) != 0x400);
+
 typedef enum TPRAccess {
 TPR_ACCESS_READ,
 TPR_ACCESS_WRITE,
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 6dc9846..b8b336b 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1218,6 +1218,28 @@ static int kvm_put_fpu(X86CPU *cpu)
 #define XSAVE_ZMM_Hi256   288
 #define XSAVE_Hi16_ZMM416
 
+#define XSAVE_BYTE_OFFSET(word_offset) \
+((word_offset)*sizeof(((struct kvm_xsave*)0)->region[0]))
+
+#define ASSERT_OFFSET(word_offset, field) \
+QEMU_BUILD_BUG_ON(XSAVE_BYTE_OFFSET(word_offset) != \
+  offsetof(X86XSaveArea, field))
+
+ASSERT_OFFSET(XSAVE_FCW_FSW, legacy.fcw);
+ASSERT_OFFSET(XSAVE_FTW_FOP, legacy.ftw);
+ASSERT_OFFSET(XSAVE_CWD_RIP, legacy.fpip);
+ASSERT_OFFSET(XSAVE_CWD_RDP, legacy.fpdp);
+ASSERT_OFFSET(XSAVE_MXCSR, legacy.mxcsr);
+ASSERT_OFFSET(XSAVE_ST_SPACE, legacy.fpregs);
+ASSERT_OFFSET(XSAVE_XMM_SPACE, legacy.xmm_regs);
+ASSERT_OFFSET(XSAVE_XSTATE_BV, header.xstate_bv);
+ASSERT_OFFSET(XSAVE_YMMH_SPACE, avx_state);
+ASSERT_OFFSET(XSAVE_BNDREGS, bndreg_state);
+ASSERT_OFFSET(XSAVE_BNDCSR, bndcsr_state);
+ASSERT_OFFSET(XSAVE_OPMASK, opmask_state);
+ASSERT_OFFSET(XSAVE_ZMM_Hi256, zmm_hi256_state);
+ASSERT_OFFSET(XSAVE_Hi16_ZMM, hi16_zmm_state);
+
 static int kvm_put_xsave(X86CPU *cpu)
 {
 CPUX86State *env = >env;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 2/3] target-i386: Use xsave structs for ext_save_area

2015-11-30 Thread Eduardo Habkost

This doesn't introduce any change in the code, as the offsets and
struct sizes match what was present in the table. This can be
validated by the QEMU_BUILD_BUG_ON lines on target-i386/cpu.h,
which ensures the struct sizes and offsets match the existing
values in ext_save_area.

Signed-off-by: Eduardo Habkost 
---
 target-i386/cpu.c | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 11e5e39..bc95437 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -458,17 +458,23 @@ typedef struct ExtSaveArea {
 
 static const ExtSaveArea ext_save_areas[] = {
 [2] = { .feature = FEAT_1_ECX, .bits = CPUID_EXT_AVX,
-.offset = 0x240, .size = 0x100 },
+.offset = offsetof(X86XSaveArea, avx_state),
+.size = sizeof(XSaveAVX) },
 [3] = { .feature = FEAT_7_0_EBX, .bits = CPUID_7_0_EBX_MPX,
-.offset = 0x3c0, .size = 0x40  },
+.offset = offsetof(X86XSaveArea, bndreg_state),
+.size = sizeof(XSaveBNDREG)  },
 [4] = { .feature = FEAT_7_0_EBX, .bits = CPUID_7_0_EBX_MPX,
-.offset = 0x400, .size = 0x40  },
+.offset = offsetof(X86XSaveArea, bndcsr_state),
+.size = sizeof(XSaveBNDCSR)  },
 [5] = { .feature = FEAT_7_0_EBX, .bits = CPUID_7_0_EBX_AVX512F,
-.offset = 0x440, .size = 0x40 },
+.offset = offsetof(X86XSaveArea, opmask_state),
+.size = sizeof(XSaveOpmask) },
 [6] = { .feature = FEAT_7_0_EBX, .bits = CPUID_7_0_EBX_AVX512F,
-.offset = 0x480, .size = 0x200 },
+.offset = offsetof(X86XSaveArea, zmm_hi256_state),
+.size = sizeof(XSaveZMM_Hi256) },
 [7] = { .feature = FEAT_7_0_EBX, .bits = CPUID_7_0_EBX_AVX512F,
-.offset = 0x680, .size = 0x400 },
+.offset = offsetof(X86XSaveArea, hi16_zmm_state),
+.size = sizeof(XSaveHi16_ZMM) },
 };
 
 const char *get_register_name_32(unsigned int reg)
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 04/21] KVM: ARM64: Add reset and access handlers for PMCR_EL0 register

2015-11-30 Thread Marc Zyngier

On Fri, 30 Oct 2015 14:21:46 +0800
Shannon Zhao  wrote:

> From: Shannon Zhao 
> 
> Add reset handler which gets host value of PMCR_EL0 and make writable
> bits architecturally UNKNOWN except PMCR.E to zero. Add a common access
> handler for PMU registers which emulates writing and reading register
> and add emulation for PMCR.
> 
> Signed-off-by: Shannon Zhao 
> ---
>  arch/arm64/kvm/sys_regs.c | 106 
> +-
>  1 file changed, 104 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index d03d3af..5b591d6 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -33,6 +33,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  
> @@ -446,6 +447,67 @@ static void reset_mpidr(struct kvm_vcpu *vcpu, const 
> struct sys_reg_desc *r)
>   vcpu_sys_reg(vcpu, MPIDR_EL1) = (1ULL << 31) | mpidr;
>  }
>  
> +static void vcpu_sysreg_write(struct kvm_vcpu *vcpu,
> +   const struct sys_reg_desc *r, u64 val)
> +{
> + if (!vcpu_mode_is_32bit(vcpu))
> + vcpu_sys_reg(vcpu, r->reg) = val;
> + else
> + vcpu_cp15(vcpu, r->reg) = lower_32_bits(val);
> +}
> +
> +static void reset_pmcr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r)
> +{
> + u64 pmcr, val;
> +
> + asm volatile("mrs %0, pmcr_el0\n" : "=r" (pmcr));
> + /* Writable bits of PMCR_EL0 (ARMV8_PMCR_MASK) is reset to UNKNOWN
> +  * except PMCR.E resetting to zero.
> +  */
> + val = ((pmcr & ~ARMV8_PMCR_MASK) | (ARMV8_PMCR_MASK & 0xdecafbad))
> +   & (~ARMV8_PMCR_E);
> + vcpu_sysreg_write(vcpu, r, val);
> +}
> +
> +/* PMU registers accessor. */
> +static bool access_pmu_regs(struct kvm_vcpu *vcpu,
> + const struct sys_reg_params *p,
> + const struct sys_reg_desc *r)
> +{
> + unsigned long val;

I'd feel a lot more comfortable if this was a u64...

> +
> + if (p->is_write) {
> + switch (r->reg) {
> + case PMCR_EL0: {
> + /* Only update writeable bits of PMCR */
> + val = vcpu_sys_reg(vcpu, r->reg);
> + val &= ~ARMV8_PMCR_MASK;
> + val |= *vcpu_reg(vcpu, p->Rt) & ARMV8_PMCR_MASK;
> + vcpu_sys_reg(vcpu, r->reg) = val;
> + break;
> + }
> + default:
> + vcpu_sys_reg(vcpu, r->reg) = *vcpu_reg(vcpu, p->Rt);
> + break;
> + }
> + } else {
> + switch (r->reg) {
> + case PMCR_EL0: {
> + /* PMCR.P & PMCR.C are RAZ */
> + val = vcpu_sys_reg(vcpu, r->reg)
> +   & ~(ARMV8_PMCR_P | ARMV8_PMCR_C);
> + *vcpu_reg(vcpu, p->Rt) = val;
> + break;
> + }
> + default:
> + *vcpu_reg(vcpu, p->Rt) = vcpu_sys_reg(vcpu, r->reg);
> + break;
> + }
> + }
> +
> + return true;
> +}
> +
>  /* Silly macro to expand the DBG{BCR,BVR,WVR,WCR}n_EL1 registers in one go */
>  #define DBG_BCR_BVR_WCR_WVR_EL1(n)   \
>   /* DBGBVRn_EL1 */   \
> @@ -630,7 +692,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
>  
>   /* PMCR_EL0 */
>   { Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b000),
> -   trap_raz_wi },
> +   access_pmu_regs, reset_pmcr, PMCR_EL0, },
>   /* PMCNTENSET_EL0 */
>   { Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b001),
> trap_raz_wi },
> @@ -864,6 +926,45 @@ static const struct sys_reg_desc cp14_64_regs[] = {
>   { Op1( 0), CRm( 2), .access = trap_raz_wi },
>  };
>  
> +/* PMU CP15 registers accessor. */
> +static bool access_pmu_cp15_regs(struct kvm_vcpu *vcpu,
> +  const struct sys_reg_params *p,
> +  const struct sys_reg_desc *r)
> +{
> + unsigned long val;

... and this a u32.

> +
> + if (p->is_write) {
> + switch (r->reg) {
> + case c9_PMCR: {
> + /* Only update writeable bits of PMCR */
> + val = vcpu_cp15(vcpu, r->reg);
> + val &= ~ARMV8_PMCR_MASK;
> + val |= *vcpu_reg(vcpu, p->Rt) & ARMV8_PMCR_MASK;
> + vcpu_cp15(vcpu, r->reg) = val;
> + break;
> + }
> + default:
> + vcpu_cp15(vcpu, r->reg) = *vcpu_reg(vcpu, p->Rt);
> + break;
> + }
> + } else {
> + switch (r->reg) {
> + case c9_PMCR: {
> + /* PMCR.P & PMCR.C are RAZ */
> +

Re: [PATCH v4 08/21] KVM: ARM64: Add reset and access handlers for PMXEVTYPER register

2015-11-30 Thread Marc Zyngier

On Fri, 30 Oct 2015 14:21:50 +0800
Shannon Zhao  wrote:

> From: Shannon Zhao 
> 
> Since the reset value of PMXEVTYPER is UNKNOWN, use reset_unknown or
> reset_unknown_cp15 for its reset handler. Add access handler which
> emulates writing and reading PMXEVTYPER register. When writing to
> PMXEVTYPER, call kvm_pmu_set_counter_event_type to create a perf_event
> for the selected event type.
> 
> Signed-off-by: Shannon Zhao 
> ---
>  arch/arm64/kvm/sys_regs.c | 26 --
>  1 file changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index cb82b15..4e606ea 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -491,6 +491,17 @@ static bool access_pmu_regs(struct kvm_vcpu *vcpu,
>  
>   if (p->is_write) {
>   switch (r->reg) {
> + case PMXEVTYPER_EL0: {
> + val = vcpu_sys_reg(vcpu, PMSELR_EL0);
> + kvm_pmu_set_counter_event_type(vcpu,
> +*vcpu_reg(vcpu, p->Rt),
> +val);

You are blindingly truncating 64bit values to u32. Is that intentional?

> + vcpu_sys_reg(vcpu, PMXEVTYPER_EL0) =
> +  *vcpu_reg(vcpu, p->Rt);
> + vcpu_sys_reg(vcpu, PMEVTYPER0_EL0 + val) =
> +  *vcpu_reg(vcpu, p->Rt);

Please do not break assignments like this, it makes the code
unreadable. I don't care what the 80-character police says... ;-)

> + break;
> + }
>   case PMCR_EL0: {
>   /* Only update writeable bits of PMCR */
>   val = vcpu_sys_reg(vcpu, r->reg);
> @@ -735,7 +746,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
> trap_raz_wi },
>   /* PMXEVTYPER_EL0 */
>   { Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1101), Op2(0b001),
> -   trap_raz_wi },
> +   access_pmu_regs, reset_unknown, PMXEVTYPER_EL0 },
>   /* PMXEVCNTR_EL0 */
>   { Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1101), Op2(0b010),
> trap_raz_wi },
> @@ -951,6 +962,16 @@ static bool access_pmu_cp15_regs(struct kvm_vcpu *vcpu,
>  
>   if (p->is_write) {
>   switch (r->reg) {
> + case c9_PMXEVTYPER: {
> + val = vcpu_cp15(vcpu, c9_PMSELR);
> + kvm_pmu_set_counter_event_type(vcpu,
> +*vcpu_reg(vcpu, p->Rt),
> +val);
> + vcpu_cp15(vcpu, c9_PMXEVTYPER) = *vcpu_reg(vcpu, p->Rt);
> + vcpu_cp15(vcpu, c14_PMEVTYPER0 + val) =
> +  *vcpu_reg(vcpu, p->Rt);
> + break;
> + }
>   case c9_PMCR: {
>   /* Only update writeable bits of PMCR */
>   val = vcpu_cp15(vcpu, r->reg);
> @@ -1024,7 +1045,8 @@ static const struct sys_reg_desc cp15_regs[] = {
>   { Op1( 0), CRn( 9), CRm(12), Op2( 7), access_pmu_cp15_regs,
> reset_pmceid, c9_PMCEID1 },
>   { Op1( 0), CRn( 9), CRm(13), Op2( 0), trap_raz_wi },
> - { Op1( 0), CRn( 9), CRm(13), Op2( 1), trap_raz_wi },
> + { Op1( 0), CRn( 9), CRm(13), Op2( 1), access_pmu_cp15_regs,
> +   reset_unknown_cp15, c9_PMXEVTYPER },
>   { Op1( 0), CRn( 9), CRm(13), Op2( 2), trap_raz_wi },
>   { Op1( 0), CRn( 9), CRm(14), Op2( 0), trap_raz_wi },
>   { Op1( 0), CRn( 9), CRm(14), Op2( 1), trap_raz_wi },

Thanks,

M.
-- 
Jazz is not dead. It just smells funny.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 18/21] KVM: ARM64: Add PMU overflow interrupt routing

2015-11-30 Thread Marc Zyngier

On Fri, 30 Oct 2015 14:22:00 +0800
Shannon Zhao  wrote:

> From: Shannon Zhao 
> 
> When calling perf_event_create_kernel_counter to create perf_event,
> assign a overflow handler. Then when perf event overflows, set
> irq_pending and call kvm_vcpu_kick() to sync the interrupt.
> 
> Signed-off-by: Shannon Zhao 
> ---
>  arch/arm/kvm/arm.c|  4 +++
>  include/kvm/arm_pmu.h |  4 +++
>  virt/kvm/arm/pmu.c| 76 
> ++-
>  3 files changed, 83 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> index 78b2869..9c0fec4 100644
> --- a/arch/arm/kvm/arm.c
> +++ b/arch/arm/kvm/arm.c
> @@ -28,6 +28,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define CREATE_TRACE_POINTS
>  #include "trace.h"
> @@ -551,6 +552,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct 
> kvm_run *run)
>  
>   if (ret <= 0 || need_new_vmid_gen(vcpu->kvm)) {
>   local_irq_enable();
> + kvm_pmu_sync_hwstate(vcpu);

This is very weird. Are you only injecting interrupts when a signal is
pending? I don't understand how this works...

>   kvm_vgic_sync_hwstate(vcpu);
>   preempt_enable();
>   kvm_timer_sync_hwstate(vcpu);
> @@ -598,6 +600,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct 
> kvm_run *run)
>   kvm_guest_exit();
>   trace_kvm_exit(kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
>  
> + kvm_pmu_post_sync_hwstate(vcpu);
> +
>   kvm_vgic_sync_hwstate(vcpu);
>  
>   preempt_enable();
> diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
> index acd025a..5e7f943 100644
> --- a/include/kvm/arm_pmu.h
> +++ b/include/kvm/arm_pmu.h
> @@ -39,6 +39,8 @@ struct kvm_pmu {
>  };
>  
>  #ifdef CONFIG_KVM_ARM_PMU
> +void kvm_pmu_sync_hwstate(struct kvm_vcpu *vcpu);
> +void kvm_pmu_post_sync_hwstate(struct kvm_vcpu *vcpu);

Please follow the current terminology: _flush_ on VM entry, _sync_ on
VM exit.

>  unsigned long kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u32 
> select_idx);
>  void kvm_pmu_disable_counter(struct kvm_vcpu *vcpu, u32 val);
>  void kvm_pmu_enable_counter(struct kvm_vcpu *vcpu, u32 val, bool all_enable);
> @@ -49,6 +51,8 @@ void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, 
> u32 data,
>   u32 select_idx);
>  void kvm_pmu_handle_pmcr(struct kvm_vcpu *vcpu, u32 val);
>  #else
> +void kvm_pmu_sync_hwstate(struct kvm_vcpu *vcpu) {}
> +void kvm_pmu_post_sync_hwstate(struct kvm_vcpu *vcpu) {}
>  unsigned long kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u32 
> select_idx)
>  {
>   return 0;
> diff --git a/virt/kvm/arm/pmu.c b/virt/kvm/arm/pmu.c
> index 11d1bfb..6d48d9a 100644
> --- a/virt/kvm/arm/pmu.c
> +++ b/virt/kvm/arm/pmu.c
> @@ -21,6 +21,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /**
>   * kvm_pmu_get_counter_value - get PMU counter value
> @@ -69,6 +70,78 @@ static void kvm_pmu_stop_counter(struct kvm_pmc *pmc)
>  }
>  
>  /**
> + * kvm_pmu_sync_hwstate - sync pmu state for cpu
> + * @vcpu: The vcpu pointer
> + *
> + * Inject virtual PMU IRQ if IRQ is pending for this cpu.
> + */
> +void kvm_pmu_sync_hwstate(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_pmu *pmu = >arch.pmu;
> + u32 overflow;
> +
> + if (!vcpu_mode_is_32bit(vcpu))
> + overflow = vcpu_sys_reg(vcpu, PMOVSSET_EL0);
> + else
> + overflow = vcpu_cp15(vcpu, c9_PMOVSSET);
> +
> + if ((pmu->irq_pending || overflow != 0) && (pmu->irq_num != -1))
> + kvm_vgic_inject_irq(vcpu->kvm, vcpu->vcpu_id, pmu->irq_num, 1);
> +
> + pmu->irq_pending = false;
> +}
> +
> +/**
> + * kvm_pmu_post_sync_hwstate - post sync pmu state for cpu
> + * @vcpu: The vcpu pointer
> + *
> + * Inject virtual PMU IRQ if IRQ is pending for this cpu when back from 
> guest.
> + */
> +void kvm_pmu_post_sync_hwstate(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_pmu *pmu = >arch.pmu;
> +
> + if (pmu->irq_pending && (pmu->irq_num != -1))
> + kvm_vgic_inject_irq(vcpu->kvm, vcpu->vcpu_id, pmu->irq_num, 1);
> +
> + pmu->irq_pending = false;
> +}
> +
> +/**
> + * When perf event overflows, set irq_pending and call kvm_vcpu_kick() to 
> inject
> + * the interrupt.
> + */
> +static void kvm_pmu_perf_overflow(struct perf_event *perf_event,
> +   struct perf_sample_data *data,
> +   struct pt_regs *regs)
> +{
> + struct kvm_pmc *pmc = perf_event->overflow_handler_context;
> + struct kvm_vcpu *vcpu = pmc->vcpu;
> + struct kvm_pmu *pmu = >arch.pmu;
> + int idx = pmc->idx;
> +
> + if (!vcpu_mode_is_32bit(vcpu)) {
> + if ((vcpu_sys_reg(vcpu, PMINTENSET_EL1) >> idx) & 0x1) {
> +

Re: [PATCH v4 21/21] KVM: ARM64: Add a new kvm ARM PMU device

2015-11-30 Thread Marc Zyngier

On Fri, 30 Oct 2015 14:22:03 +0800
Shannon Zhao  wrote:

> From: Shannon Zhao 
> 
> Add a new kvm device type KVM_DEV_TYPE_ARM_PMU_V3 for ARM PMU. Implement
> the kvm_device_ops for it.
> 
> Signed-off-by: Shannon Zhao 
> ---
>  Documentation/virtual/kvm/devices/arm-pmu.txt | 15 +
>  arch/arm64/include/uapi/asm/kvm.h |  3 +
>  include/linux/kvm_host.h  |  1 +
>  include/uapi/linux/kvm.h  |  2 +
>  virt/kvm/arm/pmu.c| 92 
> +++
>  virt/kvm/arm/vgic.c   |  8 +++
>  virt/kvm/arm/vgic.h   |  1 +
>  virt/kvm/kvm_main.c   |  4 ++
>  8 files changed, 126 insertions(+)
>  create mode 100644 Documentation/virtual/kvm/devices/arm-pmu.txt
> 
> diff --git a/Documentation/virtual/kvm/devices/arm-pmu.txt 
> b/Documentation/virtual/kvm/devices/arm-pmu.txt
> new file mode 100644
> index 000..49481c4
> --- /dev/null
> +++ b/Documentation/virtual/kvm/devices/arm-pmu.txt
> @@ -0,0 +1,15 @@
> +ARM Virtual Performance Monitor Unit (vPMU)
> +===
> +
> +Device types supported:
> +  KVM_DEV_TYPE_ARM_PMU_V3 ARM Performance Monitor Unit v3
> +
> +Instantiate one PMU instance for per VCPU through this API.
> +
> +Groups:
> +  KVM_DEV_ARM_PMU_GRP_IRQ
> +  Attributes:
> +A value describing the interrupt number of PMU overflow interrupt.
> +
> +  Errors:
> +-EINVAL: Value set is out of the expected range

What is the expected range?

> diff --git a/arch/arm64/include/uapi/asm/kvm.h 
> b/arch/arm64/include/uapi/asm/kvm.h
> index 0cd7b59..1309a93 100644
> --- a/arch/arm64/include/uapi/asm/kvm.h
> +++ b/arch/arm64/include/uapi/asm/kvm.h
> @@ -204,6 +204,9 @@ struct kvm_arch_memory_slot {
>  #define KVM_DEV_ARM_VGIC_GRP_CTRL4
>  #define   KVM_DEV_ARM_VGIC_CTRL_INIT 0
>  
> +/* Device Control API: ARM PMU */
> +#define KVM_DEV_ARM_PMU_GRP_IRQ  0
> +
>  /* KVM_IRQ_LINE irq field index values */
>  #define KVM_ARM_IRQ_TYPE_SHIFT   24
>  #define KVM_ARM_IRQ_TYPE_MASK0xff
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 1bef9e2..f6be696 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1122,6 +1122,7 @@ extern struct kvm_device_ops kvm_mpic_ops;
>  extern struct kvm_device_ops kvm_xics_ops;
>  extern struct kvm_device_ops kvm_arm_vgic_v2_ops;
>  extern struct kvm_device_ops kvm_arm_vgic_v3_ops;
> +extern struct kvm_device_ops kvm_arm_pmu_ops;
>  
>  #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
>  
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index a9256f0..f41e6b6 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1025,6 +1025,8 @@ enum kvm_device_type {
>  #define KVM_DEV_TYPE_FLICKVM_DEV_TYPE_FLIC
>   KVM_DEV_TYPE_ARM_VGIC_V3,
>  #define KVM_DEV_TYPE_ARM_VGIC_V3 KVM_DEV_TYPE_ARM_VGIC_V3
> + KVM_DEV_TYPE_ARM_PMU_V3,
> +#define  KVM_DEV_TYPE_ARM_PMU_V3 KVM_DEV_TYPE_ARM_PMU_V3
>   KVM_DEV_TYPE_MAX,
>  };
>  
> diff --git a/virt/kvm/arm/pmu.c b/virt/kvm/arm/pmu.c
> index d78ce7b..0a00d04 100644
> --- a/virt/kvm/arm/pmu.c
> +++ b/virt/kvm/arm/pmu.c
> @@ -19,10 +19,13 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
>  
> +#include "vgic.h"
> +
>  /**
>   * kvm_pmu_get_counter_value - get PMU counter value
>   * @vcpu: The vcpu pointer
> @@ -416,3 +419,92 @@ void kvm_pmu_set_counter_event_type(struct kvm_vcpu 
> *vcpu, u32 data,
>  
>   pmc->perf_event = event;
>  }
> +
> +static int kvm_arm_pmu_set_irq(struct kvm *kvm, int irq)
> +{
> + int j;
> + struct kvm_vcpu *vcpu;
> +
> + kvm_for_each_vcpu(j, vcpu, kvm) {
> + struct kvm_pmu *pmu = >arch.pmu;
> +
> + kvm_debug("Set kvm ARM PMU irq: %d\n", irq);
> + pmu->irq_num = irq;
> + vgic_dist_irq_set_cfg(vcpu, irq, true);
> + }

So obviously, the irq must be a PPI, since all vcpus are getting the
same one. Worth documenting.

> +
> + return 0;
> +}
> +
> +static int kvm_arm_pmu_create(struct kvm_device *dev, u32 type)
> +{
> + int i, j;
> + struct kvm_vcpu *vcpu;
> + struct kvm *kvm = dev->kvm;
> +
> + kvm_for_each_vcpu(j, vcpu, kvm) {
> + struct kvm_pmu *pmu = >arch.pmu;
> +
> + memset(pmu, 0, sizeof(*pmu));
> + for (i = 0; i < ARMV8_MAX_COUNTERS; i++) {
> + pmu->pmc[i].idx = i;
> + pmu->pmc[i].vcpu = vcpu;
> + pmu->pmc[i].bitmask = 0xUL;
> + }
> + pmu->irq_num = -1;
> + }

Surely this can be shared with the reset code?

> +
> + return 0;
> +}
> +
> +static void kvm_arm_pmu_destroy(struct kvm_device *dev)
> +{
> + kfree(dev);
> +}
> +
>

[PATCH 10/11] KVM: MMU: clear write-flooding on the fast path of tracked page

2015-11-30 Thread Xiao Guangrong

If the page fault is caused by write access on write tracked page, the
real shadow page walking is skipped, we lost the chance to clear write
flooding for the page structure current vcpu is using

Fix it by locklessly waking shadow page table to clear write flooding
on the shadow page structure out of mmu-lock. So that we change the
count to atomic_t

Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu.c  | 25 +
 arch/x86/kvm/paging_tmpl.h  |  4 +++-
 3 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0f7b940..ea7907d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -252,7 +252,7 @@ struct kvm_mmu_page {
 #endif
 
/* Number of writes since the last time traversal visited this page.  */
-   int write_flooding_count;
+   atomic_t write_flooding_count;
 };
 
 struct kvm_pio_request {
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f89e77f..9f6a4ef 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2081,7 +2081,7 @@ static void init_shadow_page_table(struct kvm_mmu_page 
*sp)
 
 static void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp)
 {
-   sp->write_flooding_count = 0;
+   atomic_set(>write_flooding_count,  0);
 }
 
 static void clear_sp_write_flooding_count(u64 *spte)
@@ -2461,8 +2461,7 @@ static void __kvm_unsync_page(struct kvm_vcpu *vcpu, 
struct kvm_mmu_page *sp)
kvm_mmu_mark_parents_unsync(sp);
 }
 
-static bool kvm_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn,
-bool can_unsync)
+static bool kvm_unsync_pages(struct kvm_vcpu *vcpu,  gfn_t gfn, bool 
can_unsync)
 {
struct kvm_mmu_page *s;
 
@@ -3419,6 +3418,23 @@ static bool page_fault_handle_page_track(struct kvm_vcpu 
*vcpu,
return false;
 }
 
+static void shadow_page_table_clear_flood(struct kvm_vcpu *vcpu, gva_t addr)
+{
+   struct kvm_shadow_walk_iterator iterator;
+   u64 spte;
+
+   if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
+   return;
+
+   walk_shadow_page_lockless_begin(vcpu);
+   for_each_shadow_entry_lockless(vcpu, addr, iterator, spte) {
+   clear_sp_write_flooding_count(iterator.sptep);
+   if (!is_shadow_present_pte(spte))
+   break;
+   }
+   walk_shadow_page_lockless_end(vcpu);
+}
+
 static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
u32 error_code, bool prefault)
 {
@@ -4246,7 +4262,8 @@ static bool detect_write_flooding(struct kvm_mmu_page *sp)
if (sp->role.level == PT_PAGE_TABLE_LEVEL)
return false;
 
-   return ++sp->write_flooding_count >= 3;
+   atomic_inc(>write_flooding_count);
+   return atomic_read(>write_flooding_count) >= 3;
 }
 
 /*
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index ac85682..97fe5ac 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -735,8 +735,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t 
addr, u32 error_code,
return 0;
}
 
-   if (page_fault_handle_page_track(vcpu, error_code, walker.gfn))
+   if (page_fault_handle_page_track(vcpu, error_code, walker.gfn)) {
+   shadow_page_table_clear_flood(vcpu, addr);
return 1;
+   }
 
vcpu->arch.write_fault_to_shadow_pgtable = false;
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 09/11] KVM: MMU: simplify mmu_need_write_protect

2015-11-30 Thread Xiao Guangrong

Now, all non-leaf shadow page are page tracked, if gfn is not tracked
there is no non-leaf shadow page of gfn is existed, we can directly
make the shadow page of gfn to unsync

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 26 --
 1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5a2ca73..f89e77f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2461,41 +2461,31 @@ static void __kvm_unsync_page(struct kvm_vcpu *vcpu, 
struct kvm_mmu_page *sp)
kvm_mmu_mark_parents_unsync(sp);
 }
 
-static void kvm_unsync_pages(struct kvm_vcpu *vcpu,  gfn_t gfn)
+static bool kvm_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn,
+bool can_unsync)
 {
struct kvm_mmu_page *s;
 
for_each_gfn_indirect_valid_sp(vcpu->kvm, s, gfn) {
+   if (!can_unsync)
+   return true;
+
if (s->unsync)
continue;
WARN_ON(s->role.level != PT_PAGE_TABLE_LEVEL);
__kvm_unsync_page(vcpu, s);
}
+
+   return false;
 }
 
 static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
   bool can_unsync)
 {
-   struct kvm_mmu_page *s;
-   bool need_unsync = false;
-
if (kvm_page_track_check_mode(vcpu, gfn, KVM_PAGE_TRACK_WRITE))
return true;
 
-   for_each_gfn_indirect_valid_sp(vcpu->kvm, s, gfn) {
-   if (!can_unsync)
-   return true;
-
-   if (s->role.level != PT_PAGE_TABLE_LEVEL)
-   return true;
-
-   if (!s->unsync)
-   need_unsync = true;
-   }
-   if (need_unsync)
-   kvm_unsync_pages(vcpu, gfn);
-
-   return false;
+   return kvm_unsync_pages(vcpu, gfn, can_unsync);
 }
 
 static bool kvm_is_mmio_pfn(pfn_t pfn)
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 00/21] KVM: ARM64: Add guest PMU support

2015-11-30 Thread Marc Zyngier

On Fri, 30 Oct 2015 14:21:42 +0800
Shannon Zhao  wrote:

Hi Shannon,

> From: Shannon Zhao 
> 
> This patchset adds guest PMU support for KVM on ARM64. It takes
> trap-and-emulate approach. When guest wants to monitor one event, it
> will be trapped by KVM and KVM will call perf_event API to create a perf
> event and call relevant perf_event APIs to get the count value of event.

I've been through this whole series, and while this is shaping nicely,
there is still a number of things that are a bit odd (interrupt
injection is one, the whole CP15 reset is another).

Can you please respin this soon? I'd really like to have this in for
4.5...

Thanks,

M.
-- 
Jazz is not dead. It just smells funny.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 05/11] KVM: page track: introduce kvm_page_track_{add,remove}_page

2015-11-30 Thread Xiao Guangrong

These two functions are the user APIs:
- kvm_page_track_add_page(): add the page to the tracking pool after
  that later specified access on that page will be tracked

- kvm_page_track_remove_page(): remove the page from the tracking pool,
  the specified access on the page is not tracked after the last user is
  gone

Both of these are called under the protection of kvm->srcu or
kvm->slots_lock

Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_page_track.h |  5 ++
 arch/x86/kvm/page_track.c | 95 +++
 2 files changed, 100 insertions(+)

diff --git a/arch/x86/include/asm/kvm_page_track.h 
b/arch/x86/include/asm/kvm_page_track.h
index 347d5c9..9cc17c6 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -10,4 +10,9 @@ int kvm_page_track_create_memslot(struct kvm_memory_slot 
*slot,
  unsigned long npages);
 void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
 struct kvm_memory_slot *dont);
+
+void kvm_page_track_add_page(struct kvm *kvm, gfn_t gfn,
+enum kvm_page_track_mode mode);
+void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
+   enum kvm_page_track_mode mode);
 #endif
diff --git a/arch/x86/kvm/page_track.c b/arch/x86/kvm/page_track.c
index 0338d36..ad510db 100644
--- a/arch/x86/kvm/page_track.c
+++ b/arch/x86/kvm/page_track.c
@@ -56,3 +56,98 @@ void kvm_page_track_free_memslot(struct kvm_memory_slot 
*free,
if (!dont || free->arch.gfn_track != dont->arch.gfn_track)
page_track_slot_free(free);
 }
+
+static bool check_mode(enum kvm_page_track_mode mode)
+{
+   if (mode < 0 || mode >= KVM_PAGE_TRACK_MAX)
+   return false;
+
+   return true;
+}
+
+static void update_gfn_track(struct kvm_memory_slot *slot, gfn_t gfn,
+enum kvm_page_track_mode mode, int count)
+{
+   int index, val;
+
+   index = gfn_to_index(gfn, slot->base_gfn, PT_PAGE_TABLE_LEVEL);
+
+   slot->arch.gfn_track[mode][index] += count;
+   val = slot->arch.gfn_track[mode][index];
+   WARN_ON(val < 0);
+}
+
+/*
+ * add guest page to the tracking pool so that corresponding access on that
+ * page will be intercepted.
+ *
+ * It should be called under the protection of kvm->srcu or kvm->slots_lock
+ *
+ * @kvm: the guest instance we are interested in.
+ * @gfn: the guest page.
+ * @mode: tracking mode, currently only write track is supported.
+ */
+void kvm_page_track_add_page(struct kvm *kvm, gfn_t gfn,
+enum kvm_page_track_mode mode)
+{
+   struct kvm_memslots *slots;
+   struct kvm_memory_slot *slot;
+   int i;
+
+   WARN_ON(!check_mode(mode));
+
+   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+   slots = __kvm_memslots(kvm, i);
+   slot = __gfn_to_memslot(slots, gfn);
+
+   spin_lock(>mmu_lock);
+   update_gfn_track(slot, gfn, mode, 1);
+
+   /*
+* new track stops large page mapping for the
+* tracked page.
+*/
+   kvm_mmu_gfn_disallow_lpage(slot, gfn);
+
+   if (mode == KVM_PAGE_TRACK_WRITE)
+   if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn))
+   kvm_flush_remote_tlbs(kvm);
+   spin_unlock(>mmu_lock);
+   }
+}
+
+/*
+ * remove the guest page from the tracking pool which stops the interception
+ * of corresponding access on that page. It is the opposed operation of
+ * kvm_page_track_add_page().
+ *
+ * It should be called under the protection of kvm->srcu or kvm->slots_lock
+ *
+ * @kvm: the guest instance we are interested in.
+ * @gfn: the guest page.
+ * @mode: tracking mode, currently only write track is supported.
+ */
+void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
+   enum kvm_page_track_mode mode)
+{
+   struct kvm_memslots *slots;
+   struct kvm_memory_slot *slot;
+   int i;
+
+   WARN_ON(!check_mode(mode));
+
+   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+   slots = __kvm_memslots(kvm, i);
+   slot = __gfn_to_memslot(slots, gfn);
+
+   spin_lock(>mmu_lock);
+   update_gfn_track(slot, gfn, mode, -1);
+
+   /*
+* allow large page mapping for the tracked page
+* after the tracker is gone.
+*/
+   kvm_mmu_gfn_allow_lpage(slot, gfn);
+   spin_unlock(>mmu_lock);
+   }
+}
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 00/11] KVM: x86: track guest page access

2015-11-30 Thread Xiao Guangrong

This patchset introduces the feature which allows us to track page
access in guest. Currently, only write access tracking is implemented
in this version.

Four APIs are introduces:
- kvm_page_track_add_page(kvm, gfn, mode), single guest page @gfn is
  added into the track pool of the guest instance represented by @kvm,
  @mode specifies which kind of access on the @gfn is tracked
  
- kvm_page_track_remove_page(kvm, gfn, mode), is the opposed operation
  of kvm_page_track_add_page() which removes @gfn from the tracking pool.
  gfn is no tracked after its last user is gone

- kvm_page_track_register_notifier(kvm, n), register a notifier so that
  the event triggered by page tracking will be received, at that time,
  the callback of n->track_write() will be called

- kvm_page_track_unregister_notifier(kvm, n), does the opposed operation
  of kvm_page_track_register_notifier(), which unlinks the notifier and
  stops receiving the tracked event

The first user of page track is non-leaf shadow page tables as they are
always write protected. It also gains performance improvement because
page track speeds up page fault handler for the tracked pages. The
performance result of kernel building is as followings:

   before   after
real 461.63   real 455.48
user 4529.55  user 4557.88
sys 1995.39   sys 1922.57

Furthermore, it is the infrastructure of other kind of shadow page table,
such as GPU shadow page table introduced in KVMGT (1) and native nested
IOMMU.

This patch can be divided into two parts:
- patch 1 ~ patch 7, implement page tracking
- others patches apply page tracking to non-leaf shadow page table

(1): http://lkml.iu.edu/hypermail/linux/kernel/1510.3/01562.html

Xiao Guangrong (11):
  KVM: MMU: rename has_wrprotected_page to mmu_gfn_lpage_is_disallowed
  KVM: MMU: introduce kvm_mmu_gfn_{allow,disallow}_lpage
  KVM: MMU: introduce kvm_mmu_slot_gfn_write_protect
  KVM: page track: add the framework of guest page tracking
  KVM: page track: introduce kvm_page_track_{add,remove}_page
  KVM: MMU: let page fault handler be aware tracked page
  KVM: page track: add notifier support
  KVM: MMU: use page track for non-leaf shadow pages
  KVM: MMU: simplify mmu_need_write_protect
  KVM: MMU: clear write-flooding on the fast path of tracked page
  KVM: MMU: apply page track notifier

 Documentation/virtual/kvm/mmu.txt |   6 +-
 arch/x86/include/asm/kvm_host.h   |  12 +-
 arch/x86/include/asm/kvm_page_track.h |  67 +
 arch/x86/kvm/Makefile |   3 +-
 arch/x86/kvm/mmu.c| 199 +++
 arch/x86/kvm/mmu.h|   5 +
 arch/x86/kvm/page_track.c | 252 ++
 arch/x86/kvm/paging_tmpl.h|   5 +
 arch/x86/kvm/x86.c|  27 ++--
 9 files changed, 504 insertions(+), 72 deletions(-)
 create mode 100644 arch/x86/include/asm/kvm_page_track.h
 create mode 100644 arch/x86/kvm/page_track.c

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 01/11] KVM: MMU: rename has_wrprotected_page to mmu_gfn_lpage_is_disallowed

2015-11-30 Thread Xiao Guangrong

kvm_lpage_info->write_count is used to detect if the large page mapping for
the gfn on the specified level is allowed, rename it to disallow_lpage to
reflect its purpose, also we rename has_wrprotected_page() to
mmu_gfn_lpage_is_disallowed() to make the code more clearer

Later we will extend this mechanism for page tracking: if the gfn is tracked
then large mapping for that gfn on any level is not allowed. the new name is
more straightforward

Signed-off-by: Xiao Guangrong 
---
 Documentation/virtual/kvm/mmu.txt |  6 +++---
 arch/x86/include/asm/kvm_host.h   |  2 +-
 arch/x86/kvm/mmu.c| 25 +
 arch/x86/kvm/x86.c| 14 --
 4 files changed, 25 insertions(+), 22 deletions(-)

diff --git a/Documentation/virtual/kvm/mmu.txt 
b/Documentation/virtual/kvm/mmu.txt
index daf9c0f..dda2e93 100644
--- a/Documentation/virtual/kvm/mmu.txt
+++ b/Documentation/virtual/kvm/mmu.txt
@@ -391,11 +391,11 @@ To instantiate a large spte, four constraints must be 
satisfied:
   write-protected pages
 - the guest page must be wholly contained by a single memory slot
 
-To check the last two conditions, the mmu maintains a ->write_count set of
+To check the last two conditions, the mmu maintains a ->disallow_lpage set of
 arrays for each memory slot and large page size.  Every write protected page
-causes its write_count to be incremented, thus preventing instantiation of
+causes its disallow_lpage to be incremented, thus preventing instantiation of
 a large spte.  The frames at the end of an unaligned memory slot have
-artificially inflated ->write_counts so they can never be instantiated.
+artificially inflated ->disallow_lpages so they can never be instantiated.
 
 Zapping all pages (page generation count)
 =
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8140077..5aa2dcc 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -606,7 +606,7 @@ struct kvm_vcpu_arch {
 };
 
 struct kvm_lpage_info {
-   int write_count;
+   int disallow_lpage;
 };
 
 struct kvm_arch_memory_slot {
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a1a3d19..61259ff 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -789,7 +789,7 @@ static void account_shadowed(struct kvm *kvm, struct 
kvm_mmu_page *sp)
slot = __gfn_to_memslot(slots, gfn);
for (i = PT_DIRECTORY_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
linfo = lpage_info_slot(gfn, slot, i);
-   linfo->write_count += 1;
+   linfo->disallow_lpage += 1;
}
kvm->arch.indirect_shadow_pages++;
 }
@@ -807,31 +807,32 @@ static void unaccount_shadowed(struct kvm *kvm, struct 
kvm_mmu_page *sp)
slot = __gfn_to_memslot(slots, gfn);
for (i = PT_DIRECTORY_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
linfo = lpage_info_slot(gfn, slot, i);
-   linfo->write_count -= 1;
-   WARN_ON(linfo->write_count < 0);
+   linfo->disallow_lpage -= 1;
+   WARN_ON(linfo->disallow_lpage < 0);
}
kvm->arch.indirect_shadow_pages--;
 }
 
-static int __has_wrprotected_page(gfn_t gfn, int level,
- struct kvm_memory_slot *slot)
+static bool __mmu_gfn_lpage_is_disallowed(gfn_t gfn, int level,
+ struct kvm_memory_slot *slot)
 {
struct kvm_lpage_info *linfo;
 
if (slot) {
linfo = lpage_info_slot(gfn, slot, level);
-   return linfo->write_count;
+   return !!linfo->disallow_lpage;
}
 
-   return 1;
+   return true;
 }
 
-static int has_wrprotected_page(struct kvm_vcpu *vcpu, gfn_t gfn, int level)
+static bool mmu_gfn_lpage_is_disallowed(struct kvm_vcpu *vcpu, gfn_t gfn,
+   int level)
 {
struct kvm_memory_slot *slot;
 
slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
-   return __has_wrprotected_page(gfn, level, slot);
+   return __mmu_gfn_lpage_is_disallowed(gfn, level, slot);
 }
 
 static int host_mapping_level(struct kvm *kvm, gfn_t gfn)
@@ -897,7 +898,7 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t 
large_gfn,
max_level = min(kvm_x86_ops->get_lpage_level(), host_level);
 
for (level = PT_DIRECTORY_LEVEL; level <= max_level; ++level)
-   if (__has_wrprotected_page(large_gfn, level, slot))
+   if (__mmu_gfn_lpage_is_disallowed(large_gfn, level, slot))
break;
 
return level - 1;
@@ -2511,7 +2512,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 * be fixed if guest refault.
 */
if (level > PT_PAGE_TABLE_LEVEL &&
-   has_wrprotected_page(vcpu, gfn, level))
+   mmu_gfn_lpage_is_disallowed(vcpu,

[PATCH 04/11] KVM: page track: add the framework of guest page tracking

2015-11-30 Thread Xiao Guangrong

The array, gfn_track[mode][gfn], is introduced in memory slot for every
guest page, this is the tracking count for the gust page on different
modes. If the page is tracked then the count is increased, the page is
not tracked after the count reaches zero

Two callbacks, kvm_page_track_create_memslot() and
kvm_page_track_free_memslot() are implemented in this patch, they are
internally used to initialize and reclaim the memory of the array

Currently, only write track mode is supported

Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_host.h   |  2 ++
 arch/x86/include/asm/kvm_page_track.h | 13 
 arch/x86/kvm/Makefile |  3 +-
 arch/x86/kvm/page_track.c | 58 +++
 arch/x86/kvm/x86.c|  5 +++
 5 files changed, 80 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/include/asm/kvm_page_track.h
 create mode 100644 arch/x86/kvm/page_track.c

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5aa2dcc..afff1f1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define KVM_MAX_VCPUS 255
 #define KVM_SOFT_MAX_VCPUS 160
@@ -612,6 +613,7 @@ struct kvm_lpage_info {
 struct kvm_arch_memory_slot {
struct kvm_rmap_head *rmap[KVM_NR_PAGE_SIZES];
struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];
+   int *gfn_track[KVM_PAGE_TRACK_MAX];
 };
 
 /*
diff --git a/arch/x86/include/asm/kvm_page_track.h 
b/arch/x86/include/asm/kvm_page_track.h
new file mode 100644
index 000..347d5c9
--- /dev/null
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -0,0 +1,13 @@
+#ifndef _ASM_X86_KVM_PAGE_TRACK_H
+#define _ASM_X86_KVM_PAGE_TRACK_H
+
+enum kvm_page_track_mode {
+   KVM_PAGE_TRACK_WRITE,
+   KVM_PAGE_TRACK_MAX,
+};
+
+int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
+ unsigned long npages);
+void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
+struct kvm_memory_slot *dont);
+#endif
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index a1ff508..464fa47 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -13,9 +13,10 @@ kvm-$(CONFIG_KVM_ASYNC_PF)   += $(KVM)/async_pf.o
 
 kvm-y  += x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
   i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
-  hyperv.o
+  hyperv.o page_track.o
 
 kvm-$(CONFIG_KVM_DEVICE_ASSIGNMENT)+= assigned-dev.o iommu.o
+
 kvm-intel-y+= vmx.o pmu_intel.o
 kvm-amd-y  += svm.o pmu_amd.o
 
diff --git a/arch/x86/kvm/page_track.c b/arch/x86/kvm/page_track.c
new file mode 100644
index 000..0338d36
--- /dev/null
+++ b/arch/x86/kvm/page_track.c
@@ -0,0 +1,58 @@
+/*
+ * Support KVM gust page tracking
+ *
+ * This feature allows us to track page access in guest. Currently, only
+ * write access is tracked.
+ *
+ * Copyright(C) 2015 Intel Corporation.
+ *
+ * Author:
+ *   Xiao Guangrong 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ */
+
+#include 
+#include 
+#include 
+
+#include "mmu.h"
+
+static void page_track_slot_free(struct kvm_memory_slot *slot)
+{
+   int i;
+
+   for (i = 0; i < KVM_PAGE_TRACK_MAX; i++)
+   if (slot->arch.gfn_track[i]) {
+   kvfree(slot->arch.gfn_track[i]);
+   slot->arch.gfn_track[i] = NULL;
+   }
+}
+
+int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
+ unsigned long npages)
+{
+   int  i, pages = gfn_to_index(slot->base_gfn + npages - 1,
+ slot->base_gfn, PT_PAGE_TABLE_LEVEL) + 1;
+
+   for (i = 0; i < KVM_PAGE_TRACK_MAX; i++) {
+   slot->arch.gfn_track[i] = kvm_kvzalloc(pages *
+   sizeof(*slot->arch.gfn_track[i]));
+   if (!slot->arch.gfn_track[i])
+   goto track_free;
+   }
+
+   return 0;
+
+track_free:
+   page_track_slot_free(slot);
+   return -ENOMEM;
+}
+
+void kvm_page_track_free_memslot(struct kvm_memory_slot *free,
+struct kvm_memory_slot *dont)
+{
+   if (!dont || free->arch.gfn_track != dont->arch.gfn_track)
+   page_track_slot_free(free);
+}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c04987e..ad4888a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7838,6 +7838,8 @@ void kvm_arch_free_memslot(struct kvm *kvm, struct 
kvm_memory_slot *free,
free->arch.lpage_info[i - 1] = NULL;
}
}
+
+   kvm_page_track_free_memslot(free,

[PATCH 03/11] KVM: MMU: introduce kvm_mmu_slot_gfn_write_protect

2015-11-30 Thread Xiao Guangrong

Split rmap_write_protect() and introduce the function to abstract the write
protection based on the slot

This function will be used in the later patch

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 16 +++-
 arch/x86/kvm/mmu.h |  2 ++
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 4b04d13..39809b8 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1336,23 +1336,29 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm 
*kvm,
kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
 }
 
-static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
+bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
+   struct kvm_memory_slot *slot, u64 gfn)
 {
-   struct kvm_memory_slot *slot;
struct kvm_rmap_head *rmap_head;
int i;
bool write_protected = false;
 
-   slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
-
for (i = PT_PAGE_TABLE_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
rmap_head = __gfn_to_rmap(gfn, i, slot);
-   write_protected |= __rmap_write_protect(vcpu->kvm, rmap_head, 
true);
+   write_protected |= __rmap_write_protect(kvm, rmap_head, true);
}
 
return write_protected;
 }
 
+static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
+{
+   struct kvm_memory_slot *slot;
+
+   slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
+   return kvm_mmu_slot_gfn_write_protect(vcpu->kvm, slot, gfn);
+}
+
 static bool kvm_zap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head)
 {
u64 *sptep;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index de92bed..58fe98a 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -177,4 +177,6 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, 
gfn_t gfn_end);
 
 void kvm_mmu_gfn_disallow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
 void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
+bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
+   struct kvm_memory_slot *slot, u64 gfn);
 #endif
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 02/11] KVM: MMU: introduce kvm_mmu_gfn_{allow,disallow}_lpage

2015-11-30 Thread Xiao Guangrong

Abstract the common operations from account_shadowed() and
unaccount_shadowed(), then introduce kvm_mmu_gfn_disallow_lpage()
and kvm_mmu_gfn_allow_lpage()

These two functions will be used by page tracking in the later patch

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 38 +-
 arch/x86/kvm/mmu.h |  3 +++
 2 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 61259ff..4b04d13 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -776,21 +776,39 @@ static struct kvm_lpage_info *lpage_info_slot(gfn_t gfn,
return >arch.lpage_info[level - 2][idx];
 }
 
+static void update_gfn_disallow_lpage_count(struct kvm_memory_slot *slot,
+   gfn_t gfn, int count)
+{
+   struct kvm_lpage_info *linfo;
+   int i;
+
+   for (i = PT_DIRECTORY_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
+   linfo = lpage_info_slot(gfn, slot, i);
+   linfo->disallow_lpage += count;
+   WARN_ON(linfo->disallow_lpage < 0);
+   }
+}
+
+void kvm_mmu_gfn_disallow_lpage(struct kvm_memory_slot *slot, gfn_t gfn)
+{
+   update_gfn_disallow_lpage_count(slot, gfn, 1);
+}
+
+void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn)
+{
+   update_gfn_disallow_lpage_count(slot, gfn, -1);
+}
+
 static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
struct kvm_memslots *slots;
struct kvm_memory_slot *slot;
-   struct kvm_lpage_info *linfo;
gfn_t gfn;
-   int i;
 
gfn = sp->gfn;
slots = kvm_memslots_for_spte_role(kvm, sp->role);
slot = __gfn_to_memslot(slots, gfn);
-   for (i = PT_DIRECTORY_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
-   linfo = lpage_info_slot(gfn, slot, i);
-   linfo->disallow_lpage += 1;
-   }
+   kvm_mmu_gfn_disallow_lpage(slot, gfn);
kvm->arch.indirect_shadow_pages++;
 }
 
@@ -798,18 +816,12 @@ static void unaccount_shadowed(struct kvm *kvm, struct 
kvm_mmu_page *sp)
 {
struct kvm_memslots *slots;
struct kvm_memory_slot *slot;
-   struct kvm_lpage_info *linfo;
gfn_t gfn;
-   int i;
 
gfn = sp->gfn;
slots = kvm_memslots_for_spte_role(kvm, sp->role);
slot = __gfn_to_memslot(slots, gfn);
-   for (i = PT_DIRECTORY_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
-   linfo = lpage_info_slot(gfn, slot, i);
-   linfo->disallow_lpage -= 1;
-   WARN_ON(linfo->disallow_lpage < 0);
-   }
+   kvm_mmu_gfn_allow_lpage(slot, gfn);
kvm->arch.indirect_shadow_pages--;
 }
 
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 55ffb7b..de92bed 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -174,4 +174,7 @@ static inline bool permission_fault(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
 
 void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm);
 void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
+
+void kvm_mmu_gfn_disallow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
+void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
 #endif
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 04/21] arm64: KVM: Implement vgic-v3 save/restore

2015-11-30 Thread Christoffer Dall

On Fri, Nov 27, 2015 at 06:49:58PM +, Marc Zyngier wrote:
> Implement the vgic-v3 save restore as a direct translation of
> the assembly code version.
> 
> Signed-off-by: Marc Zyngier 
> ---
>  arch/arm64/kvm/hyp/Makefile |   1 +
>  arch/arm64/kvm/hyp/hyp.h|   3 +
>  arch/arm64/kvm/hyp/vgic-v3-sr.c | 222 
> 
>  3 files changed, 226 insertions(+)
>  create mode 100644 arch/arm64/kvm/hyp/vgic-v3-sr.c
> 
> diff --git a/arch/arm64/kvm/hyp/Makefile b/arch/arm64/kvm/hyp/Makefile
> index d8d5968..d1e38ce 100644
> --- a/arch/arm64/kvm/hyp/Makefile
> +++ b/arch/arm64/kvm/hyp/Makefile
> @@ -3,3 +3,4 @@
>  #
>  
>  obj-$(CONFIG_KVM_ARM_HOST) += vgic-v2-sr.o
> +obj-$(CONFIG_KVM_ARM_HOST) += vgic-v3-sr.o
> diff --git a/arch/arm64/kvm/hyp/hyp.h b/arch/arm64/kvm/hyp/hyp.h
> index 78f25c4..a31cb6e 100644
> --- a/arch/arm64/kvm/hyp/hyp.h
> +++ b/arch/arm64/kvm/hyp/hyp.h
> @@ -30,5 +30,8 @@
>  void __vgic_v2_save_state(struct kvm_vcpu *vcpu);
>  void __vgic_v2_restore_state(struct kvm_vcpu *vcpu);
>  
> +void __vgic_v3_save_state(struct kvm_vcpu *vcpu);
> +void __vgic_v3_restore_state(struct kvm_vcpu *vcpu);
> +
>  #endif /* __ARM64_KVM_HYP_H__ */
>  
> diff --git a/arch/arm64/kvm/hyp/vgic-v3-sr.c b/arch/arm64/kvm/hyp/vgic-v3-sr.c
> new file mode 100644
> index 000..b490db5
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/vgic-v3-sr.c
> @@ -0,0 +1,222 @@
> +/*
> + * Copyright (C) 2012-2015 - ARM Ltd
> + * Author: Marc Zyngier 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see .
> + */
> +
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +
> +#include "hyp.h"
> +
> +/*
> + * We store LRs in reverse order to let the CPU deal with streaming
> + * access. Use this macro to make it look saner...
> + */
> +#define LR_OFFSET(n) (15 - n)
> +
> +#define read_gicreg(r)   
> \
> + ({  \
> + u64 reg;\
> + asm volatile("mrs_s %0, " __stringify(r) : "=r" (reg)); \
> + reg;\
> + })
> +
> +#define write_gicreg(v,r)\
> + do {\
> + u64 __val = (v);\
> + asm volatile("msr_s " __stringify(r) ", %0" : : "r" (__val));\
> + } while (0)

remind me what the msr_s and mrs_s do compared to msr and mrs?

are these the reason why we need separate macros to access the gic
registers compared to 'normal' sysregs?

> +
> +/* vcpu is already in the HYP VA space */
> +void __hyp_text __vgic_v3_save_state(struct kvm_vcpu *vcpu)
> +{
> + struct vgic_v3_cpu_if *cpu_if = >arch.vgic_cpu.vgic_v3;
> + u64 val;
> + u32 nr_lr, nr_pri;
> +
> + /*
> +  * Make sure stores to the GIC via the memory mapped interface
> +  * are now visible to the system register interface.
> +  */
> + dsb(st);
> +
> + cpu_if->vgic_vmcr  = read_gicreg(ICH_VMCR_EL2);
> + cpu_if->vgic_misr  = read_gicreg(ICH_MISR_EL2);
> + cpu_if->vgic_eisr  = read_gicreg(ICH_EISR_EL2);
> + cpu_if->vgic_elrsr = read_gicreg(ICH_ELSR_EL2);
> +
> + write_gicreg(0, ICH_HCR_EL2);
> + val = read_gicreg(ICH_VTR_EL2);
> + nr_lr = val & 0xf;

this is not technically nr_lr, it's max_lr or max_lr_idx or something
like that.

> + nr_pri = ((u32)val >> 29) + 1;

nit: nr_pri_bits

> +
> + switch (nr_lr) {
> + case 15:
> + cpu_if->vgic_lr[LR_OFFSET(15)] = read_gicreg(ICH_LR15_EL2);
> + case 14:
> + cpu_if->vgic_lr[LR_OFFSET(14)] = read_gicreg(ICH_LR14_EL2);
> + case 13:
> + cpu_if->vgic_lr[LR_OFFSET(13)] = read_gicreg(ICH_LR13_EL2);
> + case 12:
> + cpu_if->vgic_lr[LR_OFFSET(12)] = read_gicreg(ICH_LR12_EL2);
> + case 11:
> + cpu_if->vgic_lr[LR_OFFSET(11)] = read_gicreg(ICH_LR11_EL2);
> + case 10:
> + cpu_if->vgic_lr[LR_OFFSET(10)] = read_gicreg(ICH_LR10_EL2);
> + case 9:
> + cpu_if->vgic_lr[LR_OFFSET(9)] = read_gicreg(ICH_LR9_EL2);
> + case 8:
> + cpu_if->vgic_lr[LR_OFFSET(8)] = read_gicreg(ICH_LR8_EL2);
> + case 7:
> +

Re: [PATCH v2 01/21] arm64: Add macros to read/write system registers

2015-11-30 Thread Christoffer Dall

On Fri, Nov 27, 2015 at 06:49:55PM +, Marc Zyngier wrote:
> From: Mark Rutland 
> 
> Rather than crafting custom macros for reading/writing each system
> register provide generics accessors, read_sysreg and write_sysreg, for
> this purpose.
> 
> Unlike read_cpuid, calls to read_exception_reg are never expected
> to be optimized away or replaced with synthetic values.

how does this comment about read_exception_reg relate to this patch?

> 
> Signed-off-by: Mark Rutland 
> Cc: Catalin Marinas 
> Cc: Marc Zyngier 
> Cc: Suzuki Poulose 
> Cc: Will Deacon 
> Signed-off-by: Marc Zyngier 
> ---
>  arch/arm64/include/asm/sysreg.h | 17 +
>  1 file changed, 17 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
> index d48ab5b..c9c283a 100644
> --- a/arch/arm64/include/asm/sysreg.h
> +++ b/arch/arm64/include/asm/sysreg.h
> @@ -20,6 +20,8 @@
>  #ifndef __ASM_SYSREG_H
>  #define __ASM_SYSREG_H
>  
> +#include 
> +
>  #include 
>  
>  /*
> @@ -208,6 +210,8 @@
>  
>  #else
>  
> +#include 
> +
>  asm(
>  ".irp
> num,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30\n"
>  ".equ__reg_num_x\\num, \\num\n"
> @@ -232,6 +236,19 @@ static inline void config_sctlr_el1(u32 clear, u32 set)
>   val |= set;
>   asm volatile("msr sctlr_el1, %0" : : "r" (val));
>  }
> +
> +#define read_sysreg(r) ({\
> + u64 __val;  \
> + asm volatile("mrs %0, " __stringify(r) : "=r" (__val)); \
> + __val;  \
> +})
> +
> +#define write_sysreg(v, r) do {  \
> + u64 __val = (u64)v; \
> + asm volatile("msr " __stringify(r) ", %0"   \
> +  : : "r" (__val));  \
> +} while (0)
> +
>  #endif
>  
>  #endif   /* __ASM_SYSREG_H */
> -- 
> 2.1.4
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 02/21] arm64: KVM: Add a HYP-specific header file

2015-11-30 Thread Christoffer Dall

On Fri, Nov 27, 2015 at 06:49:56PM +, Marc Zyngier wrote:
> In order to expose the various EL2 services that are private to
> the hypervisor, add a new hyp.h file.
> 
> So far, it only contains mundane things such as section annotation
> and VA manipulation.
> 
> Signed-off-by: Marc Zyngier 
> ---
>  arch/arm64/kvm/hyp/hyp.h | 31 +++
>  1 file changed, 31 insertions(+)
>  create mode 100644 arch/arm64/kvm/hyp/hyp.h
> 
> diff --git a/arch/arm64/kvm/hyp/hyp.h b/arch/arm64/kvm/hyp/hyp.h
> new file mode 100644
> index 000..dac843e
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/hyp.h
> @@ -0,0 +1,31 @@
> +/*
> + * Copyright (C) 2015 - ARM Ltd
> + * Author: Marc Zyngier 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see .
> + */
> +
> +#ifndef __ARM64_KVM_HYP_H__
> +#define __ARM64_KVM_HYP_H__
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define __hyp_text __section(.hyp.text) notrace

why notrace?

> +
> +#define kern_hyp_va(v) (typeof(v))((unsigned long)v & HYP_PAGE_OFFSET_MASK)

should you have parenthesis around 'v' ?

> +
> +#endif /* __ARM64_KVM_HYP_H__ */
> +
> -- 
> 2.1.4
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 03/21] arm64: KVM: Implement vgic-v2 save/restore

2015-11-30 Thread Christoffer Dall

On Fri, Nov 27, 2015 at 06:49:57PM +, Marc Zyngier wrote:
> Implement the vgic-v2 save restore (mostly) as a direct translation
> of the assembly code version.
> 
> Signed-off-by: Marc Zyngier 
> ---
>  arch/arm64/kvm/Makefile |  1 +
>  arch/arm64/kvm/hyp/Makefile |  5 +++
>  arch/arm64/kvm/hyp/hyp.h|  3 ++
>  arch/arm64/kvm/hyp/vgic-v2-sr.c | 89 
> +
>  4 files changed, 98 insertions(+)
>  create mode 100644 arch/arm64/kvm/hyp/Makefile
>  create mode 100644 arch/arm64/kvm/hyp/vgic-v2-sr.c
> 
> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> index 1949fe5..d31e4e5 100644
> --- a/arch/arm64/kvm/Makefile
> +++ b/arch/arm64/kvm/Makefile
> @@ -10,6 +10,7 @@ KVM=../../../virt/kvm
>  ARM=../../../arch/arm/kvm
>  
>  obj-$(CONFIG_KVM_ARM_HOST) += kvm.o
> +obj-$(CONFIG_KVM_ARM_HOST) += hyp/
>  
>  kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o 
> $(KVM)/eventfd.o $(KVM)/vfio.o
>  kvm-$(CONFIG_KVM_ARM_HOST) += $(ARM)/arm.o $(ARM)/mmu.o $(ARM)/mmio.o
> diff --git a/arch/arm64/kvm/hyp/Makefile b/arch/arm64/kvm/hyp/Makefile
> new file mode 100644
> index 000..d8d5968
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/Makefile
> @@ -0,0 +1,5 @@
> +#
> +# Makefile for Kernel-based Virtual Machine module, HYP part
> +#
> +
> +obj-$(CONFIG_KVM_ARM_HOST) += vgic-v2-sr.o
> diff --git a/arch/arm64/kvm/hyp/hyp.h b/arch/arm64/kvm/hyp/hyp.h
> index dac843e..78f25c4 100644
> --- a/arch/arm64/kvm/hyp/hyp.h
> +++ b/arch/arm64/kvm/hyp/hyp.h
> @@ -27,5 +27,8 @@
>  
>  #define kern_hyp_va(v) (typeof(v))((unsigned long)v & HYP_PAGE_OFFSET_MASK)
>  
> +void __vgic_v2_save_state(struct kvm_vcpu *vcpu);
> +void __vgic_v2_restore_state(struct kvm_vcpu *vcpu);

should we call these flush/sync here now ?

> +
>  #endif /* __ARM64_KVM_HYP_H__ */
>  
> diff --git a/arch/arm64/kvm/hyp/vgic-v2-sr.c b/arch/arm64/kvm/hyp/vgic-v2-sr.c
> new file mode 100644
> index 000..29a5c1d
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/vgic-v2-sr.c
> @@ -0,0 +1,89 @@
> +/*
> + * Copyright (C) 2012-2015 - ARM Ltd
> + * Author: Marc Zyngier 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see .
> + */
> +
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +
> +#include "hyp.h"
> +
> +/* vcpu is already in the HYP VA space */

should we annotate hyp pointers similarly to __user or will that be
confusing when VHE enters the scene ?

> +void __hyp_text __vgic_v2_save_state(struct kvm_vcpu *vcpu)
> +{
> + struct kvm *kvm = kern_hyp_va(vcpu->kvm);
> + struct vgic_v2_cpu_if *cpu_if = >arch.vgic_cpu.vgic_v2;
> + struct vgic_dist *vgic = >arch.vgic;
> + void __iomem *base = kern_hyp_va(vgic->vctrl_base);
> + u32 __iomem *lr_base;
> + u32 eisr0, eisr1, elrsr0, elrsr1;
> + int i = 0, nr_lr;
> +
> + if (!base)
> + return;
> +
> + nr_lr = vcpu->arch.vgic_cpu.nr_lr;
> + cpu_if->vgic_vmcr = readl_relaxed(base + GICH_VMCR);
> + cpu_if->vgic_misr = readl_relaxed(base + GICH_MISR);
> + eisr0  = readl_relaxed(base + GICH_EISR0);
> + elrsr0 = readl_relaxed(base + GICH_ELRSR0);
> + if (unlikely(nr_lr > 32)) {
> + eisr1  = readl_relaxed(base + GICH_EISR1);
> + elrsr1 = readl_relaxed(base + GICH_ELRSR1);
> + } else {
> + eisr1 = elrsr1 = 0;
> + }
> +#ifdef CONFIG_CPU_BIG_ENDIAN
> + cpu_if->vgic_eisr  = ((u64)eisr0 << 32) | eisr1;
> + cpu_if->vgic_elrsr = ((u64)elrsr0 << 32) | elrsr1;
> +#else
> + cpu_if->vgic_eisr  = ((u64)eisr1 << 32) | eisr0;
> + cpu_if->vgic_elrsr = ((u64)elrsr1 << 32) | elrsr0;
> +#endif
> + cpu_if->vgic_apr= readl_relaxed(base + GICH_APR);
> +
> + writel_relaxed(0, base + GICH_HCR);
> +
> + lr_base = base + GICH_LR0;
> + do {
> + cpu_if->vgic_lr[i++] = readl_relaxed(lr_base++);
> + } while (--nr_lr);

why not a simple for-loop?

> +}
> +

copy the vcpu HYP VA comment down here.

> +void __hyp_text __vgic_v2_restore_state(struct kvm_vcpu *vcpu)
> +{
> + struct kvm *kvm = kern_hyp_va(vcpu->kvm);
> + struct vgic_v2_cpu_if *cpu_if = >arch.vgic_cpu.vgic_v2;
> + struct vgic_dist *vgic = >arch.vgic;
> + void __iomem *base = kern_hyp_va(vgic->vctrl_base);
> + u32 __iomem *lr_base;
> + unsigned int i = 0, nr_lr;
> +
> + if (!base)
> + return;
> +
> +

Re: [PATCH v2 05/21] arm64: KVM: Implement timer save/restore

2015-11-30 Thread Christoffer Dall

On Fri, Nov 27, 2015 at 06:49:59PM +, Marc Zyngier wrote:
> Implement the timer save restore as a direct translation of
> the assembly code version.
> 
> Signed-off-by: Marc Zyngier 
> ---
>  arch/arm64/kvm/hyp/Makefile  |  1 +
>  arch/arm64/kvm/hyp/hyp.h |  3 ++
>  arch/arm64/kvm/hyp/timer-sr.c| 71 
> 
>  include/clocksource/arm_arch_timer.h |  6 +++
>  4 files changed, 81 insertions(+)
>  create mode 100644 arch/arm64/kvm/hyp/timer-sr.c
> 
> diff --git a/arch/arm64/kvm/hyp/Makefile b/arch/arm64/kvm/hyp/Makefile
> index d1e38ce..455dc0a 100644
> --- a/arch/arm64/kvm/hyp/Makefile
> +++ b/arch/arm64/kvm/hyp/Makefile
> @@ -4,3 +4,4 @@
>  
>  obj-$(CONFIG_KVM_ARM_HOST) += vgic-v2-sr.o
>  obj-$(CONFIG_KVM_ARM_HOST) += vgic-v3-sr.o
> +obj-$(CONFIG_KVM_ARM_HOST) += timer-sr.o
> diff --git a/arch/arm64/kvm/hyp/hyp.h b/arch/arm64/kvm/hyp/hyp.h
> index a31cb6e..86aa5a2 100644
> --- a/arch/arm64/kvm/hyp/hyp.h
> +++ b/arch/arm64/kvm/hyp/hyp.h
> @@ -33,5 +33,8 @@ void __vgic_v2_restore_state(struct kvm_vcpu *vcpu);
>  void __vgic_v3_save_state(struct kvm_vcpu *vcpu);
>  void __vgic_v3_restore_state(struct kvm_vcpu *vcpu);
>  
> +void __timer_save_state(struct kvm_vcpu *vcpu);
> +void __timer_restore_state(struct kvm_vcpu *vcpu);
> +
>  #endif /* __ARM64_KVM_HYP_H__ */
>  
> diff --git a/arch/arm64/kvm/hyp/timer-sr.c b/arch/arm64/kvm/hyp/timer-sr.c
> new file mode 100644
> index 000..8e2209c
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/timer-sr.c
> @@ -0,0 +1,71 @@
> +/*
> + * Copyright (C) 2012-2015 - ARM Ltd
> + * Author: Marc Zyngier 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see .
> + */
> +
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +
> +#include "hyp.h"
> +
> +/* vcpu is already in the HYP VA space */
> +void __hyp_text __timer_save_state(struct kvm_vcpu *vcpu)
> +{
> + struct kvm *kvm = kern_hyp_va(vcpu->kvm);
> + struct arch_timer_cpu *timer = >arch.timer_cpu;
> +
> + if (kvm->arch.timer.enabled) {
> + timer->cntv_ctl = read_sysreg(cntv_ctl_el0);
> + isb();
> + timer->cntv_cval = read_sysreg(cntv_cval_el0);
> + }
> +
> + /* Disable the virtual timer */
> + write_sysreg(0, cntv_ctl_el0);
> +
> + /* Allow physical timer/counter access for the host */
> + write_sysreg((read_sysreg(cnthctl_el2) | CNTHCTL_EL1PCTEN |
> +   CNTHCTL_EL1PCEN),
> +  cnthctl_el2);

nit: again I probably prefer reading cnthctl_el2 into a variable, modify
the bits and write it back, but it's no big deal.

> +
> + /* Clear cntvoff for the host */
> + write_sysreg(0, cntvoff_el2);

why do we do this when we've just disabled the timer?

> +}
> +
> +void __hyp_text __timer_restore_state(struct kvm_vcpu *vcpu)
> +{
> + struct kvm *kvm = kern_hyp_va(vcpu->kvm);
> + struct arch_timer_cpu *timer = >arch.timer_cpu;
> + u64 val;
> +
> + /*
> +  * Disallow physical timer access for the guest
> +  * Physical counter access is allowed
> +  */
> + val = read_sysreg(cnthctl_el2);
> + val &= ~CNTHCTL_EL1PCEN;
> + val |= CNTHCTL_EL1PCTEN;
> + write_sysreg(val, cnthctl_el2);
> +
> + if (kvm->arch.timer.enabled) {
> + write_sysreg(kvm->arch.timer.cntvoff, cntvoff_el2);
> + write_sysreg(timer->cntv_cval, cntv_cval_el0);
> + isb();
> + write_sysreg(timer->cntv_ctl, cntv_ctl_el0);
> + }
> +}
> diff --git a/include/clocksource/arm_arch_timer.h 
> b/include/clocksource/arm_arch_timer.h
> index 9916d0e..25d0914 100644
> --- a/include/clocksource/arm_arch_timer.h
> +++ b/include/clocksource/arm_arch_timer.h
> @@ -23,6 +23,12 @@
>  #define ARCH_TIMER_CTRL_IT_MASK  (1 << 1)
>  #define ARCH_TIMER_CTRL_IT_STAT  (1 << 2)
>  
> +#define CNTHCTL_EL1PCTEN (1 << 0)
> +#define CNTHCTL_EL1PCEN  (1 << 1)
> +#define CNTHCTL_EVNTEN   (1 << 2)
> +#define CNTHCTL_EVNTDIR  (1 << 3)
> +#define CNTHCTL_EVNTI(0xF << 4)
> +
>  enum arch_timer_reg {
>   ARCH_TIMER_REG_CTRL,
>   ARCH_TIMER_REG_TVAL,
> -- 
> 2.1.4
> 

Otherwise this looks good.

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to

Re: [PATCH v2 00/21] arm64: KVM: world switch in C

2015-11-30 Thread Christoffer Dall

On Fri, Nov 27, 2015 at 06:49:54PM +, Marc Zyngier wrote:
> Once upon a time, the KVM/arm64 world switch was a nice, clean, lean
> and mean piece of hand-crafted assembly code. Over time, features have
> crept in, the code has become harder to maintain, and the smallest
> change is a pain to introduce. The VHE patches are a prime example of
> why this doesn't work anymore.
> 
> This series rewrites most of the existing assembly code in C, but keeps
> the existing code structure in place (most function names will look
> familiar to the reader). The biggest change is that we don't have to
> deal with a static register allocation (the compiler does it for us),
> we can easily follow structure and pointers, and only the lowest level
> is still in assembly code. Oh, and a negative diffstat.
> 
> There is still a healthy dose of inline assembly (system register
> accessors, runtime code patching), but I've tried not to make it too
> invasive. The generated code, while not exactly brilliant, doesn't
> look too shaby. I do expect a small performance degradation, but I
> believe this is something we can improve over time (my initial
> measurements don't show any obvious regression though).

I ran this through my experimental setup on m400 and got this:

BM  v4.4-rc2v4.4-rc2-wsinc  overhead
--  --  
Apache  5297.11 5243.77 101.02%
fio rand read   4354.33 4294.50 101.39%
fio rand write  2465.33 2231.33 110.49%
hackbench   17.48   19.78   113.16%
memcached   96442.69101274.04   95.23%
TCP_MAERTS  5966.89 6029.72 98.96%
TCP_STREAM  6284.60 6351.74 98.94%
TCP_RR  15044.7114324.03105.03%
pbzip2 c18.13   17.89   98.68%
pbzip2 d11.42   11.45   100.26%
kernbench   50.13   50.28   100.30%
mysql 1 152.84  154.01  100.77%
mysql 2 98.12   98.94   100.84%
mysql 4 51.32   51.17   99.71%
mysql 8 27.31   27.70   101.42%
mysql 2016.80   17.21   102.47%
mysql 100   13.71   14.11   102.92%
mysql 200   15.20   15.20   100.00%
mysql 400   17.16   17.16   100.00%

(you want to see this with a viewer that renders clear-text and tabs
properly)

What this tells me is that we do take a noticable hit on the
world-switch path, which shows up in the TCP_RR and hackbench workloads,
which have a high precision in their output.

Note that the memcached number is well within its variability between
individual benchmark runs, where it varies to 12% of its average in over
80% of the executions.

I don't think this is a showstopper thought, but we could consider
looking more closely at a breakdown of the world-switch path and verify
if/where we are really taking a hit.

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 00/21] arm64: KVM: world switch in C

2015-11-30 Thread Mario Smarduch



On 11/30/2015 12:33 PM, Christoffer Dall wrote:
> On Fri, Nov 27, 2015 at 06:49:54PM +, Marc Zyngier wrote:
>> Once upon a time, the KVM/arm64 world switch was a nice, clean, lean
>> and mean piece of hand-crafted assembly code. Over time, features have
>> crept in, the code has become harder to maintain, and the smallest
>> change is a pain to introduce. The VHE patches are a prime example of
>> why this doesn't work anymore.
>>
>> This series rewrites most of the existing assembly code in C, but keeps
>> the existing code structure in place (most function names will look
>> familiar to the reader). The biggest change is that we don't have to
>> deal with a static register allocation (the compiler does it for us),
>> we can easily follow structure and pointers, and only the lowest level
>> is still in assembly code. Oh, and a negative diffstat.
>>
>> There is still a healthy dose of inline assembly (system register
>> accessors, runtime code patching), but I've tried not to make it too
>> invasive. The generated code, while not exactly brilliant, doesn't
>> look too shaby. I do expect a small performance degradation, but I
>> believe this is something we can improve over time (my initial
>> measurements don't show any obvious regression though).
> 
> I ran this through my experimental setup on m400 and got this:
> 
> BMv4.4-rc2v4.4-rc2-wsinc  overhead
> ----  
> Apache5297.11 5243.77 101.02%
> fio rand read 4354.33 4294.50 101.39%
> fio rand write2465.33 2231.33 110.49%
> hackbench 17.48   19.78   113.16%
> memcached 96442.69101274.04   95.23%
> TCP_MAERTS5966.89 6029.72 98.96%
> TCP_STREAM6284.60 6351.74 98.94%
> TCP_RR15044.7114324.03105.03%
> pbzip2 c  18.13   17.89   98.68%
> pbzip2 d  11.42   11.45   100.26%
> kernbench 50.13   50.28   100.30%
> mysql 1   152.84  154.01  100.77%
> mysql 2   98.12   98.94   100.84%
> mysql 4   51.32   51.17   99.71%
> mysql 8   27.31   27.70   101.42%
> mysql 20  16.80   17.21   102.47%
> mysql 100 13.71   14.11   102.92%
> mysql 200 15.20   15.20   100.00%
> mysql 400 17.16   17.16   100.00%
> 
> (you want to see this with a viewer that renders clear-text and tabs
> properly)
> 
> What this tells me is that we do take a noticable hit on the
> world-switch path, which shows up in the TCP_RR and hackbench workloads,
> which have a high precision in their output.
> 
> Note that the memcached number is well within its variability between
> individual benchmark runs, where it varies to 12% of its average in over
> 80% of the executions.
> 
> I don't think this is a showstopper thought, but we could consider
> looking more closely at a breakdown of the world-switch path and verify
> if/where we are really taking a hit.
> 
> -Christoffer
> ___
> kvmarm mailing list
> kvm...@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
> 

I ran some of the lmbench 'micro benchmarks' - currently
the usleep one consistently stands out by about .4% or extra 300ns
per sleep. Few other ones have some outliers, I will look at these
closer. Tests were ran on Juno.

- Mario
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 08/21] KVM: ARM64: Add reset and access handlers for PMXEVTYPER register

2015-11-30 Thread Shannon Zhao



On 2015/12/1 2:12, Marc Zyngier wrote:
> On Fri, 30 Oct 2015 14:21:50 +0800
> Shannon Zhao  wrote:
> 
>> > From: Shannon Zhao 
>> > 
>> > Since the reset value of PMXEVTYPER is UNKNOWN, use reset_unknown or
>> > reset_unknown_cp15 for its reset handler. Add access handler which
>> > emulates writing and reading PMXEVTYPER register. When writing to
>> > PMXEVTYPER, call kvm_pmu_set_counter_event_type to create a perf_event
>> > for the selected event type.
>> > 
>> > Signed-off-by: Shannon Zhao 
>> > ---
>> >  arch/arm64/kvm/sys_regs.c | 26 --
>> >  1 file changed, 24 insertions(+), 2 deletions(-)
>> > 
>> > diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
>> > index cb82b15..4e606ea 100644
>> > --- a/arch/arm64/kvm/sys_regs.c
>> > +++ b/arch/arm64/kvm/sys_regs.c
>> > @@ -491,6 +491,17 @@ static bool access_pmu_regs(struct kvm_vcpu *vcpu,
>> >  
>> >if (p->is_write) {
>> >switch (r->reg) {
>> > +  case PMXEVTYPER_EL0: {
>> > +  val = vcpu_sys_reg(vcpu, PMSELR_EL0);
>> > +  kvm_pmu_set_counter_event_type(vcpu,
>> > + *vcpu_reg(vcpu, p->Rt),
>> > + val);
> You are blindingly truncating 64bit values to u32. Is that intentional?
> 
Yeah, the register PMXEVTYPER_EL0 and PMSELR_EL0 are all 32bit.

-- 
Shannon

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 00/21] KVM: ARM64: Add guest PMU support

2015-11-30 Thread Shannon Zhao

Hi Marc,

On 2015/12/1 2:34, Marc Zyngier wrote:
> On Fri, 30 Oct 2015 14:21:42 +0800
> Shannon Zhao  wrote:
> 
> Hi Shannon,
> 
>> > From: Shannon Zhao 
>> > 
>> > This patchset adds guest PMU support for KVM on ARM64. It takes
>> > trap-and-emulate approach. When guest wants to monitor one event, it
>> > will be trapped by KVM and KVM will call perf_event API to create a perf
>> > event and call relevant perf_event APIs to get the count value of event.
> I've been through this whole series, and while this is shaping nicely,
> there is still a number of things that are a bit odd (interrupt
> injection is one, the whole CP15 reset is another).
> 
> Can you please respin this soon? I'd really like to have this in for
> 4.5...

Thanks! I will respin it soon.

-- 
Shannon

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 05/21] KVM: ARM64: Add reset and access handlers for PMSELR register

2015-11-30 Thread Shannon Zhao

Hi Marc,

On 2015/12/1 1:56, Marc Zyngier wrote:
> Same remark here as the one I made earlier. I'm pretty sure we don't
> call any CP15 reset because they are all shared with their 64bit
> counterparts. The same thing goes for the whole series.
Ok, I see. But within the 64bit reset function, it needs to update the
32bit register value, right? Since when accessing these 32bit registers,
it uses the offset c9_PM.

Thanks,
-- 
Shannon

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 3/3] vhost_net: basic polling support

2015-11-30 Thread Jason Wang



On 11/30/2015 06:44 PM, Michael S. Tsirkin wrote:
> On Wed, Nov 25, 2015 at 03:11:29PM +0800, Jason Wang wrote:
>> > This patch tries to poll for new added tx buffer or socket receive
>> > queue for a while at the end of tx/rx processing. The maximum time
>> > spent on polling were specified through a new kind of vring ioctl.
>> > 
>> > Signed-off-by: Jason Wang 
> One further enhancement would be to actually poll
> the underlying device. This should be reasonably
> straight-forward with macvtap (especially in the
> passthrough mode).
>
>

Yes, it is. I have some patches to do this by replacing
skb_queue_empty() with sk_busy_loop() but for tap. Tests does not show
any improvement but some regression.  Maybe it's better to test macvtap.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC

2015-11-30 Thread Lan, Tianyu




On 11/30/2015 4:01 PM, Michael S. Tsirkin wrote:

It is still not very clear what it is you are trying to achieve, and
whether your patchset achieves it.  You merely say "adding live
migration" but it seems pretty clear this isn't about being able to
migrate a guest transparently, since you are adding a host/guest
handshake.

This isn't about functionality either: I think that on KVM, it isn't
hard to live migrate if you can do a host/guest handshake, even today,
with no kernel changes:
1. before migration, expose a pv nic to guest (can be done directly on
   boot)
2. use e.g. a serial connection to move IP from an assigned device to pv nic
3. maybe move the mac as well
4. eject the assigned device
5. detect eject on host (QEMU generates a DEVICE_DELETED event when this
happens) and start migration



This looks like the bonding driver solution which put pv nic and VF
in one bonded interface under active-backup mode. The bonding driver
will switch from VF to PV nic automatically when VF is unplugged during
migration. This is the only available solution for VF NIC migration. But
it requires guest OS to do specific configurations inside and rely on
bonding driver which blocks it work on Windows. From performance side,
putting VF and virtio NIC under bonded interface will affect their
performance even when not do migration. These factors block to use VF
NIC passthough in some user cases(Especially in the cloud) which require
migration.

Current solution we proposed changes NIC driver and Qemu. Guest Os
doesn't need to do special thing for migration. It's easy to deploy and
all changes are in the NIC driver, NIC vendor can implement migration
support just in the their driver.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 2/3] vhost: introduce vhost_vq_more_avail()

2015-11-30 Thread Jason Wang



On 11/30/2015 04:22 PM, Michael S. Tsirkin wrote:
> On Wed, Nov 25, 2015 at 03:11:28PM +0800, Jason Wang wrote:
>> Signed-off-by: Jason Wang 
>> ---
>>  drivers/vhost/vhost.c | 26 +-
>>  drivers/vhost/vhost.h |  1 +
>>  2 files changed, 18 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index 163b365..b86c5aa 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -1633,10 +1633,25 @@ void vhost_add_used_and_signal_n(struct vhost_dev 
>> *dev,
>>  }
>>  EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n);
>>  
>> +bool vhost_vq_more_avail(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>> +{
>> +__virtio16 avail_idx;
>> +int r;
>> +
>> +r = __get_user(avail_idx, >avail->idx);
>> +if (r) {
>> +vq_err(vq, "Failed to check avail idx at %p: %d\n",
>> +   >avail->idx, r);
>> +return false;
> In patch 3 you are calling this under preempt disable,
> so this actually can fail and it isn't a VQ error.
>

Yes.

>> +}
>> +
>> +return vhost16_to_cpu(vq, avail_idx) != vq->avail_idx;
>> +}
>> +EXPORT_SYMBOL_GPL(vhost_vq_more_avail);
>> +
>>  /* OK, now we need to know about added descriptors. */
>>  bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>>  {
>> -__virtio16 avail_idx;
>>  int r;
>>  
>>  if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
>> @@ -1660,14 +1675,7 @@ bool vhost_enable_notify(struct vhost_dev *dev, 
>> struct vhost_virtqueue *vq)
>>  /* They could have slipped one in as we were doing that: make
>>   * sure it's written, then check again. */
>>  smp_mb();
>> -r = __get_user(avail_idx, >avail->idx);
>> -if (r) {
>> -vq_err(vq, "Failed to check avail idx at %p: %d\n",
>> -   >avail->idx, r);
>> -return false;
>> -}
>> -
>> -return vhost16_to_cpu(vq, avail_idx) != vq->avail_idx;
>> +return vhost_vq_more_avail(dev, vq);
>>  }
>>  EXPORT_SYMBOL_GPL(vhost_enable_notify);
>>  
> This path does need an error though.
> It's probably easier to just leave this call site alone.

Ok, will leave this function as is and remove the vq_err() in
vhost_vq_more_avail().

Thanks

>
>> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
>> index 43284ad..2f3c57c 100644
>> --- a/drivers/vhost/vhost.h
>> +++ b/drivers/vhost/vhost.h
>> @@ -159,6 +159,7 @@ void vhost_add_used_and_signal_n(struct vhost_dev *, 
>> struct vhost_virtqueue *,
>> struct vring_used_elem *heads, unsigned count);
>>  void vhost_signal(struct vhost_dev *, struct vhost_virtqueue *);
>>  void vhost_disable_notify(struct vhost_dev *, struct vhost_virtqueue *);
>> +bool vhost_vq_more_avail(struct vhost_dev *, struct vhost_virtqueue *);
>>  bool vhost_enable_notify(struct vhost_dev *, struct vhost_virtqueue *);
>>  
>>  int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
>> -- 
>> 2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH V2 0/3] basic busy polling support for vhost_net

2015-11-30 Thread Jason Wang

Hi all:

This series tries to add basic busy polling for vhost net. The idea is
simple: at the end of tx/rx processing, busy polling for new tx added
descriptor and rx receive socket for a while. The maximum number of
time (in us) could be spent on busy polling was specified ioctl.

Test A were done through:

- 50 us as busy loop timeout
- Netperf 2.6
- Two machines with back to back connected ixgbe
- Guest with 1 vcpu and 1 queue

Results:
- For stream workload, ioexits were reduced dramatically in medium
size (1024-2048) of tx (at most -43%) and almost all rx (at most
-84%) as a result of polling. This compensate for the possible
wasted cpu cycles more or less. That porbably why we can still see
some increasing in the normalized throughput in some cases.
- Throughput of tx were increased (at most 50%) expect for the huge
write (16384). And we can send more packets in the case (+tpkts were
increased).
- Very minor rx regression in some cases.
- Improvemnt on TCP_RR (at most 17%).

Guest TX:
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
64/ 1/ +18%/ -10%/ +7%/ +11%/ 0%
64/ 2/ +14%/ -13%/ +7%/ +10%/ 0%
64/ 4/ +8%/ -17%/ +7%/ +9%/ 0%
64/ 8/ +11%/ -15%/ +7%/ +10%/ 0%
256/ 1/ +35%/ +9%/ +21%/ +12%/ -11%
256/ 2/ +26%/ +2%/ +20%/ +9%/ -10%
256/ 4/ +23%/ 0%/ +21%/ +10%/ -9%
256/ 8/ +23%/ 0%/ +21%/ +9%/ -9%
512/ 1/ +31%/ +9%/ +23%/ +18%/ -12%
512/ 2/ +30%/ +8%/ +24%/ +15%/ -10%
512/ 4/ +26%/ +5%/ +24%/ +14%/ -11%
512/ 8/ +32%/ +9%/ +23%/ +15%/ -11%
1024/ 1/ +39%/ +16%/ +29%/ +22%/ -26%
1024/ 2/ +35%/ +14%/ +30%/ +21%/ -22%
1024/ 4/ +34%/ +13%/ +32%/ +21%/ -25%
1024/ 8/ +36%/ +14%/ +32%/ +19%/ -26%
2048/ 1/ +50%/ +27%/ +34%/ +26%/ -42%
2048/ 2/ +43%/ +21%/ +36%/ +25%/ -43%
2048/ 4/ +41%/ +20%/ +37%/ +27%/ -43%
2048/ 8/ +40%/ +18%/ +35%/ +25%/ -42%
16384/ 1/ 0%/ -12%/ -1%/ +8%/ +15%
16384/ 2/ 0%/ -10%/ +1%/ +4%/ +5%
16384/ 4/ 0%/ -11%/ -3%/ 0%/ +3%
16384/ 8/ 0%/ -10%/ -4%/ 0%/ +1%

Guest RX:
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
64/ 1/ -2%/ -21%/ +1%/ +2%/ -75%
64/ 2/ +1%/ -9%/ +12%/ 0%/ -55%
64/ 4/ 0%/ -6%/ +5%/ -1%/ -44%
64/ 8/ -5%/ -5%/ +7%/ -23%/ -50%
256/ 1/ -8%/ -18%/ +16%/ +15%/ -63%
256/ 2/ 0%/ -8%/ +9%/ -2%/ -26%
256/ 4/ 0%/ -7%/ -8%/ +20%/ -41%
256/ 8/ -8%/ -11%/ -9%/ -24%/ -78%
512/ 1/ -6%/ -19%/ +20%/ +18%/ -29%
512/ 2/ 0%/ -10%/ -14%/ -8%/ -31%
512/ 4/ -1%/ -5%/ -11%/ -9%/ -38%
512/ 8/ -7%/ -9%/ -17%/ -22%/ -81%
1024/ 1/ 0%/ -16%/ +12%/ +9%/ -11%
1024/ 2/ 0%/ -11%/ 0%/ +3%/ -30%
1024/ 4/ 0%/ -4%/ +2%/ +6%/ -15%
1024/ 8/ -3%/ -4%/ -8%/ -8%/ -70%
2048/ 1/ -8%/ -23%/ +36%/ +22%/ -11%
2048/ 2/ 0%/ -12%/ +1%/ +3%/ -29%
2048/ 4/ 0%/ -3%/ -17%/ -15%/ -84%
2048/ 8/ 0%/ -3%/ +1%/ -3%/ +10%
16384/ 1/ 0%/ -11%/ +4%/ +7%/ -22%
16384/ 2/ 0%/ -7%/ +4%/ +4%/ -33%
16384/ 4/ 0%/ -2%/ -2%/ -4%/ -23%
16384/ 8/ -1%/ -2%/ +1%/ -22%/ -40%

TCP_RR:
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
1/ 1/ +11%/ -26%/ +11%/ +11%/ +10%
1/ 25/ +11%/ -15%/ +11%/ +11%/ 0%
1/ 50/ +9%/ -16%/ +10%/ +10%/ 0%
1/ 100/ +9%/ -15%/ +9%/ +9%/ 0%
64/ 1/ +11%/ -31%/ +11%/ +11%/ +11%
64/ 25/ +12%/ -14%/ +12%/ +12%/ 0%
64/ 50/ +11%/ -14%/ +12%/ +12%/ 0%
64/ 100/ +11%/ -15%/ +11%/ +11%/ 0%
256/ 1/ +11%/ -27%/ +11%/ +11%/ +10%
256/ 25/ +17%/ -11%/ +16%/ +16%/ -1%
256/ 50/ +16%/ -11%/ +17%/ +17%/ +1%
256/ 100/ +17%/ -11%/ +18%/ +18%/ +1%

Test B were done through:

- 50us as busy loop timeout
- Netperf 2.6
- Two machines with back to back connected ixgbe
- Two guests each wich 1 vcpu and 1 queue
- pin two vhost threads to the same cpu on host to simulate the cpu
contending

Results:
- In this radical case, we can still get at most 14% improvement on
TCP_RR.
- For guest tx stream, minor improvemnt with at most 5% regression in
one byte case. For guest rx stream, at most 5% regression were seen.

Guest TX:
size /-+% /
1 /-5.55%/
64 /+1.11%/
256 /+2.33%/
512 /-0.03%/
1024 /+1.14%/
4096 /+0.00%/
16384/+0.00%/

Guest RX:
size /-+% /
1 /-5.11%/
64 /-0.55%/
256 /-2.35%/
512 /-3.39%/
1024 /+6.8% /
4096 /-0.01%/
16384/+0.00%/

TCP_RR:
size /-+% /
1 /+9.79% /
64 /+4.51% /
256 /+6.47% /
512 /-3.37% /
1024 /+6.15% /
4096 /+14.88%/
16384/-2.23% /

Changes from V1:
- Remove the buggy vq_error() in vhost_vq_more_avail().
- Leave vhost_enable_notify() untouched.

Changes from RFC V3:
- small tweak on the code to avoid multiple duplicate conditions in
critical path when busy loop is not enabled.
- Add the test result of multiple VMs

Changes from RFC V2:
- poll also at the end of rx handling
- factor out the polling logic and optimize the code a little bit
- add two ioctls to get and set the busy poll timeout
- test on ixgbe (which can give more stable and reproducable numbers)
instead of mlx4.

Changes from RFC V1:
- Add a comment for vhost_has_work() to explain why it could be
lockless
- Add param description for busyloop_timeout
- Split out the busy polling logic into a new helper
- Check and exit the loop when there's a pending signal
- Disable preemption during busy looping to make sure lock_clock() was
correctly used.

[PATCH V2 3/3] vhost_net: basic polling support

2015-11-30 Thread Jason Wang

This patch tries to poll for new added tx buffer or socket receive
queue for a while at the end of tx/rx processing. The maximum time
spent on polling were specified through a new kind of vring ioctl.

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c| 72 ++
 drivers/vhost/vhost.c  | 15 ++
 drivers/vhost/vhost.h  |  1 +
 include/uapi/linux/vhost.h | 11 +++
 4 files changed, 94 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9eda69e..ce6da77 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -287,6 +287,41 @@ static void vhost_zerocopy_callback(struct ubuf_info 
*ubuf, bool success)
rcu_read_unlock_bh();
 }
 
+static inline unsigned long busy_clock(void)
+{
+   return local_clock() >> 10;
+}
+
+static bool vhost_can_busy_poll(struct vhost_dev *dev,
+   unsigned long endtime)
+{
+   return likely(!need_resched()) &&
+  likely(!time_after(busy_clock(), endtime)) &&
+  likely(!signal_pending(current)) &&
+  !vhost_has_work(dev) &&
+  single_task_running();
+}
+
+static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
+   struct vhost_virtqueue *vq,
+   struct iovec iov[], unsigned int iov_size,
+   unsigned int *out_num, unsigned int *in_num)
+{
+   unsigned long uninitialized_var(endtime);
+
+   if (vq->busyloop_timeout) {
+   preempt_disable();
+   endtime = busy_clock() + vq->busyloop_timeout;
+   while (vhost_can_busy_poll(vq->dev, endtime) &&
+  !vhost_vq_more_avail(vq->dev, vq))
+   cpu_relax();
+   preempt_enable();
+   }
+
+   return vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
+out_num, in_num, NULL, NULL);
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
@@ -331,10 +366,9 @@ static void handle_tx(struct vhost_net *net)
  % UIO_MAXIOV == nvq->done_idx))
break;
 
-   head = vhost_get_vq_desc(vq, vq->iov,
-ARRAY_SIZE(vq->iov),
-, ,
-NULL, NULL);
+   head = vhost_net_tx_get_vq_desc(net, vq, vq->iov,
+   ARRAY_SIZE(vq->iov),
+   , );
/* On error, stop handling until the next kick. */
if (unlikely(head < 0))
break;
@@ -435,6 +469,34 @@ static int peek_head_len(struct sock *sk)
return len;
 }
 
+static int vhost_net_peek_head_len(struct vhost_net *net, struct sock *sk)
+{
+   struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
+   struct vhost_virtqueue *vq = >vq;
+   unsigned long uninitialized_var(endtime);
+
+   if (vq->busyloop_timeout) {
+   mutex_lock(>mutex);
+   vhost_disable_notify(>dev, vq);
+
+   preempt_disable();
+   endtime = busy_clock() + vq->busyloop_timeout;
+
+   while (vhost_can_busy_poll(>dev, endtime) &&
+  skb_queue_empty(>sk_receive_queue) &&
+  !vhost_vq_more_avail(>dev, vq))
+   cpu_relax();
+
+   preempt_enable();
+
+   if (vhost_enable_notify(>dev, vq))
+   vhost_poll_queue(>poll);
+   mutex_unlock(>mutex);
+   }
+
+   return peek_head_len(sk);
+}
+
 /* This is a multi-buffer version of vhost_get_desc, that works if
  * vq has read descriptors only.
  * @vq - the relevant virtqueue
@@ -553,7 +615,7 @@ static void handle_rx(struct vhost_net *net)
vq->log : NULL;
mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);
 
-   while ((sock_len = peek_head_len(sock->sk))) {
+   while ((sock_len = vhost_net_peek_head_len(net, sock->sk))) {
sock_len += sock_hlen;
vhost_len = sock_len + vhost_hlen;
headcount = get_rx_bufs(vq, vq->heads, vhost_len,
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 4f45a03..b8ca873 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -285,6 +285,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vq->memory = NULL;
vq->is_le = virtio_legacy_is_little_endian();
vhost_vq_reset_user_be(vq);
+   vq->busyloop_timeout = 0;
 }
 
 static int vhost_worker(void *data)
@@ -747,6 +748,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void 
__user *argp)
struct

BUG ALERT: ARM32 KVM does not work in 4.4-rc3

2015-11-30 Thread Pavel Fedin

 Hello!

 My project involves ARM64, but from time to time i also test ARM32 KVM. I have 
discovered that it stopped working in 4.4-rc3. The
same virtual machine works perfectly under current kvmarm/next, but gets stuck 
at random point under 4.4-rc3 from linux-stable.
 I'm not sure that i have time to investigate this quickly, but i'll post some 
new information as soon as i get it

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH V2 1/3] vhost: introduce vhost_has_work()

2015-11-30 Thread Jason Wang

This path introduces a helper which can give a hint for whether or not
there's a work queued in the work list.

Signed-off-by: Jason Wang 
---
 drivers/vhost/vhost.c | 7 +++
 drivers/vhost/vhost.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index eec2f11..163b365 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -245,6 +245,13 @@ void vhost_work_queue(struct vhost_dev *dev, struct 
vhost_work *work)
 }
 EXPORT_SYMBOL_GPL(vhost_work_queue);
 
+/* A lockless hint for busy polling code to exit the loop */
+bool vhost_has_work(struct vhost_dev *dev)
+{
+   return !list_empty(>work_list);
+}
+EXPORT_SYMBOL_GPL(vhost_has_work);
+
 void vhost_poll_queue(struct vhost_poll *poll)
 {
vhost_work_queue(poll->dev, >work);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d3f7674..43284ad 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -37,6 +37,7 @@ struct vhost_poll {
 
 void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
+bool vhost_has_work(struct vhost_dev *dev);
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
 unsigned long mask, struct vhost_dev *dev);
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH V2 2/3] vhost: introduce vhost_vq_more_avail()

2015-11-30 Thread Jason Wang

Signed-off-by: Jason Wang 
---
 drivers/vhost/vhost.c | 13 +
 drivers/vhost/vhost.h |  1 +
 2 files changed, 14 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 163b365..4f45a03 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1633,6 +1633,19 @@ void vhost_add_used_and_signal_n(struct vhost_dev *dev,
 }
 EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n);
 
+bool vhost_vq_more_avail(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+{
+   __virtio16 avail_idx;
+   int r;
+
+   r = __get_user(avail_idx, >avail->idx);
+   if (r)
+   return false;
+
+   return vhost16_to_cpu(vq, avail_idx) != vq->avail_idx;
+}
+EXPORT_SYMBOL_GPL(vhost_vq_more_avail);
+
 /* OK, now we need to know about added descriptors. */
 bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 43284ad..2f3c57c 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -159,6 +159,7 @@ void vhost_add_used_and_signal_n(struct vhost_dev *, struct 
vhost_virtqueue *,
   struct vring_used_elem *heads, unsigned count);
 void vhost_signal(struct vhost_dev *, struct vhost_virtqueue *);
 void vhost_disable_notify(struct vhost_dev *, struct vhost_virtqueue *);
+bool vhost_vq_more_avail(struct vhost_dev *, struct vhost_virtqueue *);
 bool vhost_enable_notify(struct vhost_dev *, struct vhost_virtqueue *);
 
 int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 2/3] vhost: introduce vhost_vq_more_avail()

2015-11-30 Thread Michael S. Tsirkin

On Wed, Nov 25, 2015 at 03:11:28PM +0800, Jason Wang wrote:
> Signed-off-by: Jason Wang 
> ---
>  drivers/vhost/vhost.c | 26 +-
>  drivers/vhost/vhost.h |  1 +
>  2 files changed, 18 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 163b365..b86c5aa 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1633,10 +1633,25 @@ void vhost_add_used_and_signal_n(struct vhost_dev 
> *dev,
>  }
>  EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n);
>  
> +bool vhost_vq_more_avail(struct vhost_dev *dev, struct vhost_virtqueue *vq)
> +{
> + __virtio16 avail_idx;
> + int r;
> +
> + r = __get_user(avail_idx, >avail->idx);
> + if (r) {
> + vq_err(vq, "Failed to check avail idx at %p: %d\n",
> +>avail->idx, r);
> + return false;

In patch 3 you are calling this under preempt disable,
so this actually can fail and it isn't a VQ error.

> + }
> +
> + return vhost16_to_cpu(vq, avail_idx) != vq->avail_idx;
> +}
> +EXPORT_SYMBOL_GPL(vhost_vq_more_avail);
> +
>  /* OK, now we need to know about added descriptors. */
>  bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>  {
> - __virtio16 avail_idx;
>   int r;
>  
>   if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
> @@ -1660,14 +1675,7 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct 
> vhost_virtqueue *vq)
>   /* They could have slipped one in as we were doing that: make
>* sure it's written, then check again. */
>   smp_mb();
> - r = __get_user(avail_idx, >avail->idx);
> - if (r) {
> - vq_err(vq, "Failed to check avail idx at %p: %d\n",
> ->avail->idx, r);
> - return false;
> - }
> -
> - return vhost16_to_cpu(vq, avail_idx) != vq->avail_idx;
> + return vhost_vq_more_avail(dev, vq);
>  }
>  EXPORT_SYMBOL_GPL(vhost_enable_notify);
>  

This path does need an error though.
It's probably easier to just leave this call site alone.

> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 43284ad..2f3c57c 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -159,6 +159,7 @@ void vhost_add_used_and_signal_n(struct vhost_dev *, 
> struct vhost_virtqueue *,
>  struct vring_used_elem *heads, unsigned count);
>  void vhost_signal(struct vhost_dev *, struct vhost_virtqueue *);
>  void vhost_disable_notify(struct vhost_dev *, struct vhost_virtqueue *);
> +bool vhost_vq_more_avail(struct vhost_dev *, struct vhost_virtqueue *);
>  bool vhost_enable_notify(struct vhost_dev *, struct vhost_virtqueue *);
>  
>  int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
> -- 
> 2.5.0
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 0/3] basic busy polling support for vhost_net

2015-11-30 Thread Michael S. Tsirkin

On Sun, Nov 29, 2015 at 10:31:10PM -0500, David Miller wrote:
> From: Jason Wang 
> Date: Wed, 25 Nov 2015 15:11:26 +0800
> 
> > This series tries to add basic busy polling for vhost net. The idea is
> > simple: at the end of tx/rx processing, busy polling for new tx added
> > descriptor and rx receive socket for a while. The maximum number of
> > time (in us) could be spent on busy polling was specified ioctl.
> > 
> > Test A were done through:
> > 
> > - 50 us as busy loop timeout
> > - Netperf 2.6
> > - Two machines with back to back connected ixgbe
> > - Guest with 1 vcpu and 1 queue
> > 
> > Results:
> > - For stream workload, ioexits were reduced dramatically in medium
> >   size (1024-2048) of tx (at most -43%) and almost all rx (at most
> >   -84%) as a result of polling. This compensate for the possible
> >   wasted cpu cycles more or less. That porbably why we can still see
> >   some increasing in the normalized throughput in some cases.
> > - Throughput of tx were increased (at most 50%) expect for the huge
> >   write (16384). And we can send more packets in the case (+tpkts were
> >   increased).
> > - Very minor rx regression in some cases.
> > - Improvemnt on TCP_RR (at most 17%).
> 
> Michael are you going to take this?  It's touching vhost core as
> much as it is the vhost_net driver.

There's a minor bug there, but once it's fixed - I agree,
it belongs in the vhost tree.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH V2 00/10] Qemu: Add live migration support for SRIOV NIC

2015-11-30 Thread Michael S. Tsirkin

On Tue, Nov 24, 2015 at 09:35:17PM +0800, Lan Tianyu wrote:
> This patchset is to propose a solution of adding live migration
> support for SRIOV NIC.
> 
> During migration, Qemu needs to let VF driver in the VM to know
> migration start and end. Qemu adds faked PCI migration capability
> to help to sync status between two sides during migration.
> 
> Qemu triggers VF's mailbox irq via sending MSIX msg when migration
> status is changed. VF driver tells Qemu its mailbox vector index
> via the new PCI capability. In some cases(NIC is suspended or closed),
> VF mailbox irq is freed and VF driver can disable irq injecting via
> new capability.   
> 
> VF driver will put down nic before migration and put up again on
> the target machine.

It is still not very clear what it is you are trying to achieve, and
whether your patchset achieves it.  You merely say "adding live
migration" but it seems pretty clear this isn't about being able to
migrate a guest transparently, since you are adding a host/guest
handshake.

This isn't about functionality either: I think that on KVM, it isn't
hard to live migrate if you can do a host/guest handshake, even today,
with no kernel changes:
1. before migration, expose a pv nic to guest (can be done directly on
  boot)
2. use e.g. a serial connection to move IP from an assigned device to pv nic
3. maybe move the mac as well
4. eject the assigned device
5. detect eject on host (QEMU generates a DEVICE_DELETED event when this
   happens) and start migration

Is this patchset a performance optimization then?
If yes it needs to be accompanied with some performance numbers.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 4/5] nvdimm acpi: build ACPI nvdimm devices

2015-11-30 Thread Michael S. Tsirkin

On Mon, Nov 16, 2015 at 06:51:02PM +0800, Xiao Guangrong wrote:
> NVDIMM devices is defined in ACPI 6.0 9.20 NVDIMM Devices

Forgot to mention:

Pls put spec info in code comments near
relevant functions, not just the log.

> 
> There is a root device under \_SB and specified NVDIMM devices are under the
> root device. Each NVDIMM device has _ADR which returns its handle used to
> associate MEMDEV structure in NFIT
> 
> Currently, we do not support any function on _DSM, that means, NVDIMM
> label data has not been supported yet
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  hw/acpi/nvdimm.c | 85 
> 
>  1 file changed, 85 insertions(+)
> 
> diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
> index 98c004d..abe0daa 100644
> --- a/hw/acpi/nvdimm.c
> +++ b/hw/acpi/nvdimm.c
> @@ -367,6 +367,90 @@ static void nvdimm_build_nfit(GSList *device_list, 
> GArray *table_offsets,
>  g_array_free(structures, true);
>  }
>  
> +static void nvdimm_build_common_dsm(Aml *root_dev)
> +{
> +Aml *method, *ifctx, *function;
> +uint8_t byte_list[1];
> +
> +method = aml_method("NCAL", 4);
> +{
> +function = aml_arg(2);
> +
> +/*
> + * function 0 is called to inquire what functions are supported by
> + * OSPM
> + */
> +ifctx = aml_if(aml_equal(function, aml_int(0)));
> +byte_list[0] = 0 /* No function Supported */;
> +aml_append(ifctx, aml_return(aml_buffer(1, byte_list)));
> +aml_append(method, ifctx);
> +
> +/* No function is supported yet. */
> +byte_list[0] = 1 /* Not Supported */;
> +aml_append(method, aml_return(aml_buffer(1, byte_list)));
> +}
> +aml_append(root_dev, method);
> +}
> +
> +static void nvdimm_build_nvdimm_devices(GSList *device_list, Aml *root_dev)
> +{
> +for (; device_list; device_list = device_list->next) {
> +DeviceState *dev = device_list->data;
> +int slot = object_property_get_int(OBJECT(dev), PC_DIMM_SLOT_PROP,
> +   NULL);
> +uint32_t handle = nvdimm_slot_to_handle(slot);
> +Aml *nvdimm_dev, *method;
> +
> +nvdimm_dev = aml_device("NV%02X", slot);
> +aml_append(nvdimm_dev, aml_name_decl("_ADR", aml_int(handle)));
> +
> +method = aml_method("_DSM", 4);
> +{
> +aml_append(method, aml_return(aml_call4("NCAL", aml_arg(0),
> +   aml_arg(1), aml_arg(2), aml_arg(3;
> +}
> +aml_append(nvdimm_dev, method);
> +
> +aml_append(root_dev, nvdimm_dev);
> +}
> +}
> +
> +static void nvdimm_build_ssdt(GSList *device_list, GArray *table_offsets,
> +  GArray *table_data, GArray *linker)
> +{
> +Aml *ssdt, *sb_scope, *dev, *method;

So why don't we skip this completely if device list is empty?

> +
> +acpi_add_table(table_offsets, table_data);
> +
> +ssdt = init_aml_allocator();
> +acpi_data_push(ssdt->buf, sizeof(AcpiTableHeader));
> +
> +sb_scope = aml_scope("\\_SB");
> +
> +dev = aml_device("NVDR");
> +aml_append(dev, aml_name_decl("_HID", aml_string("ACPI0012")));
> +
> +nvdimm_build_common_dsm(dev);
> +method = aml_method("_DSM", 4);
> +{
> +aml_append(method, aml_return(aml_call4("NCAL", aml_arg(0),
> +   aml_arg(1), aml_arg(2), aml_arg(3;
> +}
> +aml_append(dev, method);
> +
> +nvdimm_build_nvdimm_devices(device_list, dev);
> +
> +aml_append(sb_scope, dev);
> +
> +aml_append(ssdt, sb_scope);
> +/* copy AML table into ACPI tables blob and patch header there */
> +g_array_append_vals(table_data, ssdt->buf->data, ssdt->buf->len);
> +build_header(linker, table_data,
> +(void *)(table_data->data + table_data->len - ssdt->buf->len),
> +"SSDT", ssdt->buf->len, 1, "NVDIMM");
> +free_aml_allocator();
> +}
> +
>  void nvdimm_build_acpi(GArray *table_offsets, GArray *table_data,
> GArray *linker)
>  {
> @@ -378,5 +462,6 @@ void nvdimm_build_acpi(GArray *table_offsets, GArray 
> *table_data,
>  return;
>  }
>  nvdimm_build_nfit(device_list, table_offsets, table_data, linker);
> +nvdimm_build_ssdt(device_list, table_offsets, table_data, linker);
>  g_slist_free(device_list);
>  }
> -- 
> 1.8.3.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [for-2.6 PATCH 1/3] target-i386: Define structs for layout of xsave area

2015-11-30 Thread Paolo Bonzini



On 28/11/2015 20:56, Eduardo Habkost wrote:
> +/* Ext. save area 2: AVX State */
> +typedef struct XSaveAVX {
> +uint64_t ymmh[16][2];
> +} XSaveAVX;
> +

Because this is always little endian, I would write it as uint8_t[16][16].

> +/* Ext. save area 6: ZMM_Hi256 */
> +typedef struct XSaveZMM_Hi256 {
> +uint64_t zmm_hi256[16][4];
> +} XSaveZMM_Hi256;

Same here (uint8_t[16][32]).

> +/* Ext. save area 7: Hi16_ZMM */
> +typedef struct XSaveHi16_ZMM {
> +XMMReg hi16_zmm[16];
> +} XSaveHi16_ZMM;

This is uint8_t[16][64] or uint64_t[16][8].

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [for-2.6 PATCH 0/3] target-i386: Use C struct for xsave area layout, offsets & sizes

2015-11-30 Thread Paolo Bonzini



On 28/11/2015 20:56, Eduardo Habkost wrote:
> I still need to figure out a way to write unit tests for the new
> code. Maybe I will just copy and paste the new and old functions,
> and test them locally (checking if they give the same results
> when translating blobs of random bytes).

Aren't the QEMU_BUILD_BUG_ON enough?  No need to delete them in patch 3,
though perhaps you can remove the #defines.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 06/21] KVM: ARM64: Add reset and access handlers for PMCEID0 and PMCEID1 register

2015-11-30 Thread Marc Zyngier

On Fri, 30 Oct 2015 14:21:48 +0800
Shannon Zhao  wrote:

> From: Shannon Zhao 
> 
> Add reset handler which gets host value of PMCEID0 or PMCEID1. Since
> write action to PMCEID0 or PMCEID1 is ignored, add a new case for this.
> 
> Signed-off-by: Shannon Zhao 
> ---
>  arch/arm64/kvm/sys_regs.c | 29 +
>  1 file changed, 25 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index 35d232e..cb82b15 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -469,6 +469,19 @@ static void reset_pmcr(struct kvm_vcpu *vcpu, const 
> struct sys_reg_desc *r)
>   vcpu_sysreg_write(vcpu, r, val);
>  }
>  
> +static void reset_pmceid(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r)
> +{
> + u64 pmceid;
> +
> + if (r->reg == PMCEID0_EL0 || r->reg == c9_PMCEID0)

That feels wrong. We should only reset the 64bit view of the sysregs,
as the 32bit view is directly mapped to it.

> + asm volatile("mrs %0, pmceid0_el0\n" : "=r" (pmceid));
> + else
> + /* PMCEID1_EL0 or c9_PMCEID1 */
> + asm volatile("mrs %0, pmceid1_el0\n" : "=r" (pmceid));
> +
> + vcpu_sysreg_write(vcpu, r, pmceid);
> +}
> +
>  /* PMU registers accessor. */
>  static bool access_pmu_regs(struct kvm_vcpu *vcpu,
>   const struct sys_reg_params *p,
> @@ -486,6 +499,9 @@ static bool access_pmu_regs(struct kvm_vcpu *vcpu,
>   vcpu_sys_reg(vcpu, r->reg) = val;
>   break;
>   }
> + case PMCEID0_EL0:
> + case PMCEID1_EL0:
> + return ignore_write(vcpu, p);
>   default:
>   vcpu_sys_reg(vcpu, r->reg) = *vcpu_reg(vcpu, p->Rt);
>   break;
> @@ -710,10 +726,10 @@ static const struct sys_reg_desc sys_reg_descs[] = {
> access_pmu_regs, reset_unknown, PMSELR_EL0 },
>   /* PMCEID0_EL0 */
>   { Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b110),
> -   trap_raz_wi },
> +   access_pmu_regs, reset_pmceid, PMCEID0_EL0 },
>   /* PMCEID1_EL0 */
>   { Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b111),
> -   trap_raz_wi },
> +   access_pmu_regs, reset_pmceid, PMCEID1_EL0 },
>   /* PMCCNTR_EL0 */
>   { Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1101), Op2(0b000),
> trap_raz_wi },
> @@ -943,6 +959,9 @@ static bool access_pmu_cp15_regs(struct kvm_vcpu *vcpu,
>   vcpu_cp15(vcpu, r->reg) = val;
>   break;
>   }
> + case c9_PMCEID0:
> + case c9_PMCEID1:
> + return ignore_write(vcpu, p);
>   default:
>   vcpu_cp15(vcpu, r->reg) = *vcpu_reg(vcpu, p->Rt);
>   break;
> @@ -1000,8 +1019,10 @@ static const struct sys_reg_desc cp15_regs[] = {
>   { Op1( 0), CRn( 9), CRm(12), Op2( 3), trap_raz_wi },
>   { Op1( 0), CRn( 9), CRm(12), Op2( 5), access_pmu_cp15_regs,
> reset_unknown_cp15, c9_PMSELR },
> - { Op1( 0), CRn( 9), CRm(12), Op2( 6), trap_raz_wi },
> - { Op1( 0), CRn( 9), CRm(12), Op2( 7), trap_raz_wi },
> + { Op1( 0), CRn( 9), CRm(12), Op2( 6), access_pmu_cp15_regs,
> +   reset_pmceid, c9_PMCEID0 },
> + { Op1( 0), CRn( 9), CRm(12), Op2( 7), access_pmu_cp15_regs,
> +   reset_pmceid, c9_PMCEID1 },

and as a consequence, this hunk should be reworked.

>   { Op1( 0), CRn( 9), CRm(13), Op2( 0), trap_raz_wi },
>   { Op1( 0), CRn( 9), CRm(13), Op2( 1), trap_raz_wi },
>   { Op1( 0), CRn( 9), CRm(13), Op2( 2), trap_raz_wi },

Thanks,

M.
-- 
Jazz is not dead. It just smells funny.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 3/3] vhost_net: basic polling support

2015-11-30 Thread Michael S. Tsirkin

On Wed, Nov 25, 2015 at 03:11:29PM +0800, Jason Wang wrote:
> This patch tries to poll for new added tx buffer or socket receive
> queue for a while at the end of tx/rx processing. The maximum time
> spent on polling were specified through a new kind of vring ioctl.
> 
> Signed-off-by: Jason Wang 

One further enhancement would be to actually poll
the underlying device. This should be reasonably
straight-forward with macvtap (especially in the
passthrough mode).


> ---
>  drivers/vhost/net.c| 72 
> ++
>  drivers/vhost/vhost.c  | 15 ++
>  drivers/vhost/vhost.h  |  1 +
>  include/uapi/linux/vhost.h | 11 +++
>  4 files changed, 94 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 9eda69e..ce6da77 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -287,6 +287,41 @@ static void vhost_zerocopy_callback(struct ubuf_info 
> *ubuf, bool success)
>   rcu_read_unlock_bh();
>  }
>  
> +static inline unsigned long busy_clock(void)
> +{
> + return local_clock() >> 10;
> +}
> +
> +static bool vhost_can_busy_poll(struct vhost_dev *dev,
> + unsigned long endtime)
> +{
> + return likely(!need_resched()) &&
> +likely(!time_after(busy_clock(), endtime)) &&
> +likely(!signal_pending(current)) &&
> +!vhost_has_work(dev) &&
> +single_task_running();
> +}
> +
> +static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
> + struct vhost_virtqueue *vq,
> + struct iovec iov[], unsigned int iov_size,
> + unsigned int *out_num, unsigned int *in_num)
> +{
> + unsigned long uninitialized_var(endtime);
> +
> + if (vq->busyloop_timeout) {
> + preempt_disable();
> + endtime = busy_clock() + vq->busyloop_timeout;
> + while (vhost_can_busy_poll(vq->dev, endtime) &&
> +!vhost_vq_more_avail(vq->dev, vq))
> + cpu_relax();
> + preempt_enable();
> + }
> +
> + return vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
> +  out_num, in_num, NULL, NULL);
> +}
> +
>  /* Expects to be always run from workqueue - which acts as
>   * read-size critical section for our kind of RCU. */
>  static void handle_tx(struct vhost_net *net)
> @@ -331,10 +366,9 @@ static void handle_tx(struct vhost_net *net)
> % UIO_MAXIOV == nvq->done_idx))
>   break;
>  
> - head = vhost_get_vq_desc(vq, vq->iov,
> -  ARRAY_SIZE(vq->iov),
> -  , ,
> -  NULL, NULL);
> + head = vhost_net_tx_get_vq_desc(net, vq, vq->iov,
> + ARRAY_SIZE(vq->iov),
> + , );
>   /* On error, stop handling until the next kick. */
>   if (unlikely(head < 0))
>   break;
> @@ -435,6 +469,34 @@ static int peek_head_len(struct sock *sk)
>   return len;
>  }
>  
> +static int vhost_net_peek_head_len(struct vhost_net *net, struct sock *sk)
> +{
> + struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
> + struct vhost_virtqueue *vq = >vq;
> + unsigned long uninitialized_var(endtime);
> +
> + if (vq->busyloop_timeout) {
> + mutex_lock(>mutex);
> + vhost_disable_notify(>dev, vq);
> +
> + preempt_disable();
> + endtime = busy_clock() + vq->busyloop_timeout;
> +
> + while (vhost_can_busy_poll(>dev, endtime) &&
> +skb_queue_empty(>sk_receive_queue) &&
> +!vhost_vq_more_avail(>dev, vq))
> + cpu_relax();
> +
> + preempt_enable();
> +
> + if (vhost_enable_notify(>dev, vq))
> + vhost_poll_queue(>poll);
> + mutex_unlock(>mutex);
> + }
> +
> + return peek_head_len(sk);
> +}
> +
>  /* This is a multi-buffer version of vhost_get_desc, that works if
>   *   vq has read descriptors only.
>   * @vq   - the relevant virtqueue
> @@ -553,7 +615,7 @@ static void handle_rx(struct vhost_net *net)
>   vq->log : NULL;
>   mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);
>  
> - while ((sock_len = peek_head_len(sock->sk))) {
> + while ((sock_len = vhost_net_peek_head_len(net, sock->sk))) {
>   sock_len += sock_hlen;
>   vhost_len = sock_len + vhost_hlen;
>   headcount = get_rx_bufs(vq, vq->heads, vhost_len,
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index b86c5aa..857af6c 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -285,6 +285,7 @@

RE: [PATCH v5 2/2] KVM: Make KVM_CAP_IRQFD dependent on KVM_CAP_IRQCHIP

2015-11-30 Thread Pavel Fedin

 Hello!

> > case KVM_CAP_INTERNAL_ERROR_DATA:
> >  #ifdef CONFIG_HAVE_KVM_MSI
> > case KVM_CAP_SIGNAL_MSI:
> > +   /* Fallthrough */
> >  #endif
> > +   case KVM_CAP_CHECK_EXTENSION_VM:
> > +   return 1;
> >  #ifdef CONFIG_HAVE_KVM_IRQFD
> > case KVM_CAP_IRQFD:
> > case KVM_CAP_IRQFD_RESAMPLE:
> > +   return kvm_vm_ioctl_check_extension(kvm, KVM_CAP_IRQCHIP);
> 
> This won't work for s390, as it doesn't have KVM_CAP_IRQCHIP but
> KVM_CAP_S390_IRQCHIP (which needs to be enabled).

 Thank you for the note, i didn't know about irqchip-specific capability codes. 
There's the same issue with PowerPC, now i
understand why there's no KVM_CAP_IRQCHIP for them. Because they have 
KVM_CAP_IRQ_MPIC and KVM_CAP_IRQ_XICS, similar to S390.
 But isn't it just weird? I understand that perhaps we have some real need to 
distinguish between different irqchip types, but
shouldn't the kernel also publish KVM_CAP_IRQCHIP, which stands just for "we 
support some irqchip virtualization"?
 May be we should just add this for PowerPC and S390, to make things less 
ambiguous?

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 06/21] KVM: ARM64: Add reset and access handlers for PMCEID0 and PMCEID1 register

2015-11-30 Thread Shannon Zhao

Hi Marc,

On 2015/11/30 19:42, Marc Zyngier wrote:
>> +static void reset_pmceid(struct kvm_vcpu *vcpu, const struct sys_reg_desc 
>> *r)
>> > +{
>> > +  u64 pmceid;
>> > +
>> > +  if (r->reg == PMCEID0_EL0 || r->reg == c9_PMCEID0)
> That feels wrong. We should only reset the 64bit view of the sysregs,
> as the 32bit view is directly mapped to it.
> 
Just to confirm, if guest access c9_PMCEID0, KVM will trap this register
with the register index as PMCEID0_EL0? Or still as c9_PMCEID0?

-- 
Shannon

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 2/2] KVM: Make KVM_CAP_IRQFD dependent on KVM_CAP_IRQCHIP

2015-11-30 Thread Cornelia Huck

On Mon, 30 Nov 2015 14:56:38 +0300
Pavel Fedin  wrote:

>  Hello!
> 
> > >   case KVM_CAP_INTERNAL_ERROR_DATA:
> > >  #ifdef CONFIG_HAVE_KVM_MSI
> > >   case KVM_CAP_SIGNAL_MSI:
> > > + /* Fallthrough */
> > >  #endif
> > > + case KVM_CAP_CHECK_EXTENSION_VM:
> > > + return 1;
> > >  #ifdef CONFIG_HAVE_KVM_IRQFD
> > >   case KVM_CAP_IRQFD:
> > >   case KVM_CAP_IRQFD_RESAMPLE:
> > > + return kvm_vm_ioctl_check_extension(kvm, KVM_CAP_IRQCHIP);
> > 
> > This won't work for s390, as it doesn't have KVM_CAP_IRQCHIP but
> > KVM_CAP_S390_IRQCHIP (which needs to be enabled).
> 
>  Thank you for the note, i didn't know about irqchip-specific capability 
> codes. There's the same issue with PowerPC, now i
> understand why there's no KVM_CAP_IRQCHIP for them. Because they have 
> KVM_CAP_IRQ_MPIC and KVM_CAP_IRQ_XICS, similar to S390.
>  But isn't it just weird? I understand that perhaps we have some real need to 
> distinguish between different irqchip types, but
> shouldn't the kernel also publish KVM_CAP_IRQCHIP, which stands just for "we 
> support some irqchip virtualization"?
>  May be we should just add this for PowerPC and S390, to make things less 
> ambiguous?

Note that we explicitly need to _enable_ the s390 cap (for
compatibility). I'd need to recall the exact details but I came to the
conclusion back than that I could not simply enable KVM_CAP_IRQCHIP for
s390 (and current qemu would fail to enable the s390 cap if we started
advertising KVM_CAP_IRQCHIP now).

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 7/7] kvm/x86: Hyper-V SynIC timers

2015-11-30 Thread Roman Kagan

On Fri, Nov 27, 2015 at 11:49:40AM +0100, Paolo Bonzini wrote:

[ sorry missed your message on Friday, replying now ]

> On 27/11/2015 09:12, Roman Kagan wrote:
> >> > +n = div64_u64(time_now - stimer->exp_time, stimer->count) + 1;
> >> > +stimer->exp_time += n * stimer->count;
> > This is actually just a reminder calculation so I'd rather do it
> > directly with div64_u64_rem().
> 
> It took me a while to understand why it was a remained. :)

It gets easier if you think of it this way: we've slipped a few whole
periods and the remainder of the slack into the current period, so
the time left till the next tick is ("count" is the timer period here)

  delta = count - slack % count
where
  slack = time_now - exp_time

This gives you immediately your

>   exp_time = time_now + (count - (time_now - exp_time) % count)

Roman.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 0/5] implement vNVDIMM

2015-11-30 Thread Michael S. Tsirkin

On Mon, Nov 16, 2015 at 06:50:58PM +0800, Xiao Guangrong wrote:
> This patchset can be found at:
>   https://github.com/xiaogr/qemu.git nvdimm-v8
> 
> It is based on pci branch on Michael's tree and the top commit is:
> commit e3a4e177d9 (migration/ram: fix build on 32 bit hosts).
> 
> Changelog in v8:
> We split the long patch series into the small parts, as you see now, this
> is the first part which enables NVDIMM without label data support.

Finally found some time to review this.  Very nice, this is making good
progress, and I think to split it like this is a great idea.  I sent
some comments, most of them minor.

Thanks!

> The command line has been changed because some patches simplifying the
> things have not been included into this series, you should specify the
> file size exactly using the parameters as follows:
>memory-backend-file,id=mem1,share,mem-path=/tmp/nvdimm1,size=10G \
>-device nvdimm,memdev=mem1,id=nv1
> 
> Changelog in v7:
> - changes from Vladimir Sementsov-Ogievskiy's comments:
>   1) let gethugepagesize() realize if fstat is failed instead of get
>  normal page size
>   2) rename  open_file_path to open_ram_file_path
>   3) better log the error message by using error_setg_errno
>   4) update commit in the commit log to explain hugepage detection on
>  Windows
> 
> - changes from Eduardo Habkost's comments:
>   1) use 'Error**' to collect error message for qemu_file_get_page_size()
>   2) move gethugepagesize() replacement to the same patch to make it
>  better for review
>   3) introduce qemu_get_file_size to unity the code with raw_getlength()
> 
> - changes from Stefan's comments:
>   1) check the memory region is large enough to contain DSM output
>  buffer
> 
> - changes from Eric Blake's comments:
>   1) update the shell command in the commit log to generate the patch
>  which drops 'pc-dimm' prefix
>   
> - others:
>   pick up Reviewed-by from Stefan, Vladimir Sementsov-Ogievskiy, and
>   Eric Blake.
> 
> Changelog in v6:
> - changes from Stefan's comments:
>   1) fix code style of struct naming by CamelCase way
>   2) fix offset + length overflow when read/write label data
>   3) compile hw/acpi/nvdimm.c for per target so that TARGET_PAGE_SIZE can
>  be used to replace getpagesize()
> 
> Changelog in v5:
> - changes from Michael's comments:
>   1) prefix nvdimm_ to everything in NVDIMM source files
>   2) make parsing _DSM Arg3 more clear
>   3) comment style fix
>   5) drop single used definition
>   6) fix dirty dsm buffer lost due to memory write happened on host
>   7) check dsm buffer if it is big enough to contain input data
>   8) use build_append_int_noprefix to store single value to GArray
> 
> - changes from Michael's and Igor's comments:
>   1) introduce 'nvdimm-support' parameter to control nvdimm
>  enablement and it is disabled for 2.4 and its earlier versions
>  to make live migration compatible
>   2) only reserve 1 RAM page and 4 bytes IO Port for NVDIMM ACPI
>  virtualization
> 
> - changes from Stefan's comments:
>   1) do endian adjustment for the buffer length
> 
> - changes from Bharata B Rao's comments:
>   1) fix compile on ppc
> 
> - others:
>   1) the buffer length is directly got from IO read rather than got
>  from dsm memory
>   2) fix dirty label data lost due to memory write happened on host
> 
> Changelog in v4:
> - changes from Michael's comments:
>   1) show the message, "Memory is not allocated from HugeTlbfs", if file
>  based memory is not allocated from hugetlbfs.
>   2) introduce function, acpi_get_nvdimm_state(), to get NVDIMMState
>  from Machine.
>   3) statically define UUID and make its operation more clear
>   4) use GArray to build device structures to avoid potential buffer
>  overflow
>   4) improve comments in the code
>   5) improve code style
> 
> - changes from Igor's comments:
>   1) add NVDIMM ACPI spec document
>   2) use serialized method to avoid Mutex
>   3) move NVDIMM ACPI's code to hw/acpi/nvdimm.c
>   4) introduce a common ASL method used by _DSM for all devices to reduce
>  ACPI size
>   5) handle UUID in ACPI AML code. BTW, i'd keep handling revision in QEMU
>  it's better to upgrade QEMU to support Rev2 in the future
> 
> - changes from Stefan's comments:
>   1) copy input data from DSM memory to local buffer to avoid potential
>  issues as DSM memory is visible to guest. Output data is handled
>  in a similar way
> 
> - changes from Dan's comments:
>   1) drop static namespace as Linux has already supported label-less
>  nvdimm devices
> 
> - changes from Vladimir's comments:
>   1) print better message, "failed to get file size for %s, can't create
>  backend on it", if any file operation filed to obtain file size
> 
> - others:
>   create a git repo on github.com for better review/test
> 
> Also, thanks for Eric Blake's review on QAPI's side.
> 
> Thank all of you to review this patchset.
> 
> Changelog in v3:
> There

Re: [PATCH v2 04/21] arm64: KVM: Implement vgic-v3 save/restore

2015-11-30 Thread Marc Zyngier

On Mon, 30 Nov 2015 09:59:32 +
Alex Bennée  wrote:

> 
> Marc Zyngier  writes:
> 
> > Implement the vgic-v3 save restore as a direct translation of
> > the assembly code version.
> >
> > Signed-off-by: Marc Zyngier 
> > ---
> >  arch/arm64/kvm/hyp/Makefile |   1 +
> >  arch/arm64/kvm/hyp/hyp.h|   3 +
> >  arch/arm64/kvm/hyp/vgic-v3-sr.c | 222 
> > 
> >  3 files changed, 226 insertions(+)
> >  create mode 100644 arch/arm64/kvm/hyp/vgic-v3-sr.c
> >
> > diff --git a/arch/arm64/kvm/hyp/Makefile b/arch/arm64/kvm/hyp/Makefile
> > index d8d5968..d1e38ce 100644
> > --- a/arch/arm64/kvm/hyp/Makefile
> > +++ b/arch/arm64/kvm/hyp/Makefile
> > @@ -3,3 +3,4 @@
> >  #
> >
> >  obj-$(CONFIG_KVM_ARM_HOST) += vgic-v2-sr.o
> > +obj-$(CONFIG_KVM_ARM_HOST) += vgic-v3-sr.o
> > diff --git a/arch/arm64/kvm/hyp/hyp.h b/arch/arm64/kvm/hyp/hyp.h
> > index 78f25c4..a31cb6e 100644
> > --- a/arch/arm64/kvm/hyp/hyp.h
> > +++ b/arch/arm64/kvm/hyp/hyp.h
> > @@ -30,5 +30,8 @@
> >  void __vgic_v2_save_state(struct kvm_vcpu *vcpu);
> >  void __vgic_v2_restore_state(struct kvm_vcpu *vcpu);
> >
> > +void __vgic_v3_save_state(struct kvm_vcpu *vcpu);
> > +void __vgic_v3_restore_state(struct kvm_vcpu *vcpu);
> > +
> >  #endif /* __ARM64_KVM_HYP_H__ */
> >
> > diff --git a/arch/arm64/kvm/hyp/vgic-v3-sr.c 
> > b/arch/arm64/kvm/hyp/vgic-v3-sr.c
> > new file mode 100644
> > index 000..b490db5
> > --- /dev/null
> > +++ b/arch/arm64/kvm/hyp/vgic-v3-sr.c
> > @@ -0,0 +1,222 @@
> > +/*
> > + * Copyright (C) 2012-2015 - ARM Ltd
> > + * Author: Marc Zyngier 
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program.  If not, see .
> > + */
> > +
> > +#include 
> > +#include 
> 
> This include starts spitting out compiler warnings due to use of
> undefined barrier primitives. I'm not sure where the best place to:
> 
>  #include 
> 
> is. I added it to:
> 
>   arch/arm64/include/asm/arch_gicv3.h

I already have a couple of fixes queued to that effect in my tree,
hopefully heading for 4.4-rc4. If you pull the branch I have on korg,
you'll get the whole thing that should compile without warning.

Thanks,

M.
-- 
Without deviation from the norm, progress is not possible.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 2/2] KVM: Make KVM_CAP_IRQFD dependent on KVM_CAP_IRQCHIP

2015-11-30 Thread Cornelia Huck

On Mon, 30 Nov 2015 12:40:45 +0300
Pavel Fedin  wrote:

> Now at least ARM is able to determine whether the machine has
> virtualization support for irqchip or not at runtime. Obviously,
> irqfd requires irqchip.
> 
> Signed-off-by: Pavel Fedin 
> ---
>  virt/kvm/kvm_main.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 7873d6d..a057d5e 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2716,13 +2716,15 @@ static long 
> kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>   case KVM_CAP_INTERNAL_ERROR_DATA:
>  #ifdef CONFIG_HAVE_KVM_MSI
>   case KVM_CAP_SIGNAL_MSI:
> + /* Fallthrough */
>  #endif
> + case KVM_CAP_CHECK_EXTENSION_VM:
> + return 1;
>  #ifdef CONFIG_HAVE_KVM_IRQFD
>   case KVM_CAP_IRQFD:
>   case KVM_CAP_IRQFD_RESAMPLE:
> + return kvm_vm_ioctl_check_extension(kvm, KVM_CAP_IRQCHIP);

This won't work for s390, as it doesn't have KVM_CAP_IRQCHIP but
KVM_CAP_S390_IRQCHIP (which needs to be enabled).

>  #endif
> - case KVM_CAP_CHECK_EXTENSION_VM:
> - return 1;
>  #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
>   case KVM_CAP_IRQ_ROUTING:
>   return KVM_MAX_IRQ_ROUTES;

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 08/21] arm64: KVM: Implement debug save/restore

2015-11-30 Thread Alex Bennée


Marc Zyngier  writes:

> Implement the debug save restore as a direct translation of
> the assembly code version.
>
> Signed-off-by: Marc Zyngier 
> ---
>  arch/arm64/kvm/hyp/Makefile   |   1 +
>  arch/arm64/kvm/hyp/debug-sr.c | 130 
> ++
>  arch/arm64/kvm/hyp/hyp.h  |   9 +++
>  3 files changed, 140 insertions(+)
>  create mode 100644 arch/arm64/kvm/hyp/debug-sr.c
>
> diff --git a/arch/arm64/kvm/hyp/Makefile b/arch/arm64/kvm/hyp/Makefile
> index ec94200..ec14cac 100644
> --- a/arch/arm64/kvm/hyp/Makefile
> +++ b/arch/arm64/kvm/hyp/Makefile
> @@ -6,3 +6,4 @@ obj-$(CONFIG_KVM_ARM_HOST) += vgic-v2-sr.o
>  obj-$(CONFIG_KVM_ARM_HOST) += vgic-v3-sr.o
>  obj-$(CONFIG_KVM_ARM_HOST) += timer-sr.o
>  obj-$(CONFIG_KVM_ARM_HOST) += sysreg-sr.o
> +obj-$(CONFIG_KVM_ARM_HOST) += debug-sr.o
> diff --git a/arch/arm64/kvm/hyp/debug-sr.c b/arch/arm64/kvm/hyp/debug-sr.c
> new file mode 100644
> index 000..a0b2b99
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/debug-sr.c
> @@ -0,0 +1,130 @@
> +/*
> + * Copyright (C) 2015 - ARM Ltd
> + * Author: Marc Zyngier 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see .
> + */
> +
> +#include 
> +#include 
> +
> +#include 
> +
> +#include "hyp.h"
> +
> +#define read_debug(r,n)  read_sysreg(r##n##_el1)
> +#define write_debug(v,r,n)   write_sysreg(v, r##n##_el1)
> +
> +#define save_debug(ptr,reg,nr)   
> \
> + switch (nr) {   \
> + case 15:ptr[15] = read_debug(reg, 15);  \
> + case 14:ptr[14] = read_debug(reg, 14);  \
> + case 13:ptr[13] = read_debug(reg, 13);  \
> + case 12:ptr[12] = read_debug(reg, 12);  \
> + case 11:ptr[11] = read_debug(reg, 11);  \
> + case 10:ptr[10] = read_debug(reg, 10);  \
> + case 9: ptr[9] = read_debug(reg, 9);\
> + case 8: ptr[8] = read_debug(reg, 8);\
> + case 7: ptr[7] = read_debug(reg, 7);\
> + case 6: ptr[6] = read_debug(reg, 6);\
> + case 5: ptr[5] = read_debug(reg, 5);\
> + case 4: ptr[4] = read_debug(reg, 4);\
> + case 3: ptr[3] = read_debug(reg, 3);\
> + case 2: ptr[2] = read_debug(reg, 2);\
> + case 1: ptr[1] = read_debug(reg, 1);\
> + default:ptr[0] = read_debug(reg, 0);\
> + }
> +
> +#define restore_debug(ptr,reg,nr)\
> + switch (nr) {   \
> + case 15:write_debug(ptr[15], reg, 15);  \
> + case 14:write_debug(ptr[14], reg, 14);  \
> + case 13:write_debug(ptr[13], reg, 13);  \
> + case 12:write_debug(ptr[12], reg, 12);  \
> + case 11:write_debug(ptr[11], reg, 11);  \
> + case 10:write_debug(ptr[10], reg, 10);  \
> + case 9: write_debug(ptr[9], reg, 9);\
> + case 8: write_debug(ptr[8], reg, 8);\
> + case 7: write_debug(ptr[7], reg, 7);\
> + case 6: write_debug(ptr[6], reg, 6);\
> + case 5: write_debug(ptr[5], reg, 5);\
> + case 4: write_debug(ptr[4], reg, 4);\
> + case 3: write_debug(ptr[3], reg, 3);\
> + case 2: write_debug(ptr[2], reg, 2);\
> + case 1: write_debug(ptr[1], reg, 1);\
> + default:write_debug(ptr[0], reg, 0);\
> + }
> +
> +void __hyp_text __debug_save_state(struct kvm_vcpu *vcpu,
> +struct kvm_guest_debug_arch *dbg,
> +struct kvm_cpu_context *ctxt)
> +{
> + if (vcpu->arch.debug_flags & KVM_ARM64_DEBUG_DIRTY) {
> + u64 aa64dfr0 =

[PATCH] vhost: replace % with & on data path

2015-11-30 Thread Michael S. Tsirkin

We know vring num is a power of 2, so use &
to mask the high bits.

Signed-off-by: Michael S. Tsirkin 
---
 drivers/vhost/vhost.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 080422f..85f0f0a 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1366,10 +1366,12 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
/* Only get avail ring entries after they have been exposed by guest. */
smp_rmb();
 
+   }
+
/* Grab the next descriptor number they're advertising, and increment
 * the index we've seen. */
if (unlikely(__get_user(ring_head,
-   >avail->ring[last_avail_idx % vq->num]))) {
+   >avail->ring[last_avail_idx & (vq->num - 
1)]))) {
vq_err(vq, "Failed to read head: idx %d address %p\n",
   last_avail_idx,
   >avail->ring[last_avail_idx % vq->num]);
@@ -1489,7 +1491,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
u16 old, new;
int start;
 
-   start = vq->last_used_idx % vq->num;
+   start = vq->last_used_idx & (vq->num - 1);
used = vq->used->ring + start;
if (count == 1) {
if (__put_user(heads[0].id, >id)) {
@@ -1531,7 +1533,7 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct 
vring_used_elem *heads,
 {
int start, n, r;
 
-   start = vq->last_used_idx % vq->num;
+   start = vq->last_used_idx & (vq->num - 1);
n = vq->num - start;
if (n < count) {
r = __vhost_add_used_n(vq, heads, n);
-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] KVM: Create debugfs dir and stat files for each VM

2015-11-30 Thread Christian Borntraeger

On 11/27/2015 09:42 PM, Tyler Baker wrote:
> On 27 November 2015 at 10:53, Tyler Baker  wrote:
>> On 27 November 2015 at 09:08, Tyler Baker  wrote:
>>> On 27 November 2015 at 00:54, Christian Borntraeger
>>>  wrote:
 On 11/26/2015 09:47 PM, Christian Borntraeger wrote:
> On 11/26/2015 05:17 PM, Tyler Baker wrote:
>> Hi Christian,
>>
>> The kernelci.org bot recently has been reporting kvm guest boot
>> failures[1] on various arm64 platforms in next-20151126. The bot
>> bisected[2] the failures to the commit in -next titled "KVM: Create
>> debugfs dir and stat files for each VM". I confirmed by reverting this
>> commit on top of next-20151126 it resolves the boot issue.
>>
>> In this test case the host and guest are booted with the same kernel.
>> The host is booted over nfs, installs qemu (qemu-system arm64 2.4.0),
>> and launches a guest. The host is booting fine, but when the guest is
>> launched it errors with "Failed to retrieve host CPU features!". I
>> checked the host logs, and found an "Unable to handle kernel paging
>> request" splat[3] which occurs when the guest is attempting to start.
>>
>> I scanned the patch in question but nothing obvious jumped out at me,
>> any thoughts?
>
> Not really.
> Do you have processing running that do read the files in 
> /sys/kernel/debug/kvm/* ?
>
> If I read the arm oops message correctly it oopsed inside
> __srcu_read_lock. there is actually nothing in there that can oops,
> except the access to the preempt count. I am just guessing right now,
> but maybe the preempt variable is no longer available (as the process
> is gone). As long as a debugfs file is open, we hold a reference to
> the kvm, which holds a reference to the mm, so the mm might be killed
> after the process. But this is supposed to work, so maybe its something
> different. An objdump of __srcu_read_lock might help.

 Hmm, the preempt thing is done in srcu_read_lock, but the crash is in
 __srcu_read_lock. This function gets the srcu struct from mmu_notifier.c,
 which must be present and is initialized during boot.


 int __srcu_read_lock(struct srcu_struct *sp)
 {
 int idx;

 idx = READ_ONCE(sp->completed) & 0x1;
 __this_cpu_inc(sp->per_cpu_ref->c[idx]);
 smp_mb(); /* B */  /* Avoid leaking the critical section. */
 __this_cpu_inc(sp->per_cpu_ref->seq[idx]);
 return idx;
 }

 Looking at the code I have no clue why the patch does make a difference.
 Can you try to get an objdump -S for__Srcu_read_lock?
>>
>> Some other interesting finding below...
>>
>> On the host, I do _not_ have any nodes under /sys/kernel/debug/kvm/
>>
>> Running strace on the qemu command I use to launch the guest yields
>> the following.
>>
>> [pid  5963] 1448649724.405537 mmap(NULL, 65536, PROT_READ|PROT_WRITE,
>> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6652a000
>> [pid  5963] 1448649724.405586 read(13, "MemTotal:   16414616
>> kB\nMemF"..., 1024) = 1024
>> [pid  5963] 1448649724.405699 close(13) = 0
>> [pid  5963] 1448649724.405755 munmap(0x7f6652a000, 65536) = 0
>> [pid  5963] 1448649724.405947 brk(0x2552f000) = 0x2552f000
>> [pid  5963] 1448649724.406148 openat(AT_FDCWD, "/dev/kvm",
>> O_RDWR|O_CLOEXEC) = 13
>> [pid  5963] 1448649724.406209 ioctl(13, KVM_CREATE_VM, 0) = -1 ENOMEM
>> (Cannot allocate memory)
> 
> If I comment the call to kvm_create_vm_debugfs(kvm) the guest boots
> fine. I put some printk's in the kvm_create_vm_debugfs() function and
> it's returning -ENOMEM after it evaluates !kvm->debugfs_dentry. I was
> chatting with some folks from the Linaro virtualization team and they
> mentioned that ARM is a bit special as the same PID creates two vms in
> quick succession, the first one is a scratch vm, and the other is the
> 'real' vm. With that bit of info, I suspect we may be trying to create
> the debugfs directory twice, and the second time it's failing because
> it already exists.

Hmmm, with a patched QEMU that calls VM_CREATE twice it errors out on s390
with -ENOMEM (which it should not), but it errors out gracefully.

Does the attached patch avoid the crash? (guest will not start, but qemu
should error out gracefully with ENOMEM)



diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f7d6c8f..b26472a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -671,12 +671,16 @@ static struct kvm *kvm_create_vm(unsigned long type)
 
 	r = kvm_create_vm_debugfs(kvm);
 	if (r)
-		goto out_err;
+		goto out_mmu;
 
 	preempt_notifier_inc();
 
 	return kvm;
 
+out_mmu:
+#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+	mmu_notifier_unregister(>mmu_notifier, kvm->mm);
+#endif
 out_err:
 	cleanup_srcu_struct(>irq_srcu);
 out_err_no_irq_srcu:

Re: [PATCH] vhost: replace % with & on data path

2015-11-30 Thread Joe Perches

On Mon, 2015-11-30 at 10:34 +0200, Michael S. Tsirkin wrote:
> We know vring num is a power of 2, so use &
> to mask the high bits.
[]
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
[]
> @@ -1366,10 +1366,12 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>   /* Only get avail ring entries after they have been exposed by guest. */
>   smp_rmb();
>  
> + }

?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 0/5] implement vNVDIMM

2015-11-30 Thread Stefan Hajnoczi

On Mon, Nov 16, 2015 at 06:50:58PM +0800, Xiao Guangrong wrote:
> This patchset can be found at:
>   https://github.com/xiaogr/qemu.git nvdimm-v8
> 
> It is based on pci branch on Michael's tree and the top commit is:
> commit e3a4e177d9 (migration/ram: fix build on 32 bit hosts).
> 
> Changelog in v8:
> We split the long patch series into the small parts, as you see now, this
> is the first part which enables NVDIMM without label data support.
> 
> The command line has been changed because some patches simplifying the
> things have not been included into this series, you should specify the
> file size exactly using the parameters as follows:
>memory-backend-file,id=mem1,share,mem-path=/tmp/nvdimm1,size=10G \
>-device nvdimm,memdev=mem1,id=nv1
> 
> Changelog in v7:
> - changes from Vladimir Sementsov-Ogievskiy's comments:
>   1) let gethugepagesize() realize if fstat is failed instead of get
>  normal page size
>   2) rename  open_file_path to open_ram_file_path
>   3) better log the error message by using error_setg_errno
>   4) update commit in the commit log to explain hugepage detection on
>  Windows
> 
> - changes from Eduardo Habkost's comments:
>   1) use 'Error**' to collect error message for qemu_file_get_page_size()
>   2) move gethugepagesize() replacement to the same patch to make it
>  better for review
>   3) introduce qemu_get_file_size to unity the code with raw_getlength()
> 
> - changes from Stefan's comments:
>   1) check the memory region is large enough to contain DSM output
>  buffer
> 
> - changes from Eric Blake's comments:
>   1) update the shell command in the commit log to generate the patch
>  which drops 'pc-dimm' prefix
>   
> - others:
>   pick up Reviewed-by from Stefan, Vladimir Sementsov-Ogievskiy, and
>   Eric Blake.
> 
> Changelog in v6:
> - changes from Stefan's comments:
>   1) fix code style of struct naming by CamelCase way
>   2) fix offset + length overflow when read/write label data
>   3) compile hw/acpi/nvdimm.c for per target so that TARGET_PAGE_SIZE can
>  be used to replace getpagesize()
> 
> Changelog in v5:
> - changes from Michael's comments:
>   1) prefix nvdimm_ to everything in NVDIMM source files
>   2) make parsing _DSM Arg3 more clear
>   3) comment style fix
>   5) drop single used definition
>   6) fix dirty dsm buffer lost due to memory write happened on host
>   7) check dsm buffer if it is big enough to contain input data
>   8) use build_append_int_noprefix to store single value to GArray
> 
> - changes from Michael's and Igor's comments:
>   1) introduce 'nvdimm-support' parameter to control nvdimm
>  enablement and it is disabled for 2.4 and its earlier versions
>  to make live migration compatible
>   2) only reserve 1 RAM page and 4 bytes IO Port for NVDIMM ACPI
>  virtualization
> 
> - changes from Stefan's comments:
>   1) do endian adjustment for the buffer length
> 
> - changes from Bharata B Rao's comments:
>   1) fix compile on ppc
> 
> - others:
>   1) the buffer length is directly got from IO read rather than got
>  from dsm memory
>   2) fix dirty label data lost due to memory write happened on host
> 
> Changelog in v4:
> - changes from Michael's comments:
>   1) show the message, "Memory is not allocated from HugeTlbfs", if file
>  based memory is not allocated from hugetlbfs.
>   2) introduce function, acpi_get_nvdimm_state(), to get NVDIMMState
>  from Machine.
>   3) statically define UUID and make its operation more clear
>   4) use GArray to build device structures to avoid potential buffer
>  overflow
>   4) improve comments in the code
>   5) improve code style
> 
> - changes from Igor's comments:
>   1) add NVDIMM ACPI spec document
>   2) use serialized method to avoid Mutex
>   3) move NVDIMM ACPI's code to hw/acpi/nvdimm.c
>   4) introduce a common ASL method used by _DSM for all devices to reduce
>  ACPI size
>   5) handle UUID in ACPI AML code. BTW, i'd keep handling revision in QEMU
>  it's better to upgrade QEMU to support Rev2 in the future
> 
> - changes from Stefan's comments:
>   1) copy input data from DSM memory to local buffer to avoid potential
>  issues as DSM memory is visible to guest. Output data is handled
>  in a similar way
> 
> - changes from Dan's comments:
>   1) drop static namespace as Linux has already supported label-less
>  nvdimm devices
> 
> - changes from Vladimir's comments:
>   1) print better message, "failed to get file size for %s, can't create
>  backend on it", if any file operation filed to obtain file size
> 
> - others:
>   create a git repo on github.com for better review/test
> 
> Also, thanks for Eric Blake's review on QAPI's side.
> 
> Thank all of you to review this patchset.
> 
> Changelog in v3:
> There is huge change in this version, thank Igor, Stefan, Paolo, Eduardo,
> Michael for their valuable comments, the patchset finally gets better shape.
> - changes from Igor's comments:
>

[PATCH v2] vhost: replace % with & on data path

2015-11-30 Thread Michael S. Tsirkin

We know vring num is a power of 2, so use &
to mask the high bits.

Signed-off-by: Michael S. Tsirkin 
---

Changes from v1: drop an unrelated chunk

 drivers/vhost/vhost.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 080422f..ad2146a 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1369,7 +1369,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
/* Grab the next descriptor number they're advertising, and increment
 * the index we've seen. */
if (unlikely(__get_user(ring_head,
-   >avail->ring[last_avail_idx % vq->num]))) {
+   >avail->ring[last_avail_idx & (vq->num - 
1)]))) {
vq_err(vq, "Failed to read head: idx %d address %p\n",
   last_avail_idx,
   >avail->ring[last_avail_idx % vq->num]);
@@ -1489,7 +1489,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
u16 old, new;
int start;
 
-   start = vq->last_used_idx % vq->num;
+   start = vq->last_used_idx & (vq->num - 1);
used = vq->used->ring + start;
if (count == 1) {
if (__put_user(heads[0].id, >id)) {
@@ -1531,7 +1531,7 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct 
vring_used_elem *heads,
 {
int start, n, r;
 
-   start = vq->last_used_idx % vq->num;
+   start = vq->last_used_idx & (vq->num - 1);
n = vq->num - start;
if (n < count) {
r = __vhost_add_used_n(vq, heads, n);
-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC v2] virtio: skip avail/used index reads

2015-11-30 Thread Michael S. Tsirkin

This adds a new vring feature bit: when enabled, host and guest poll the
available/used ring directly instead of looking at the index field
first.

To guarantee it is possible to detect updates, the high bits (above
vring.num - 1) in the ring head ID value are modified to match the index
bits - these change on each wrap-around.  Writer also XORs this with
0x8000 such that rings can be zero-initialized.

Reader is modified to ignore these high bits when looking
up descriptors.

The point is to reduce the number of cacheline misses
for both reads and writes.

I see a performance improvement of about 20% on multithreaded benchmarks
(e.g. virtio-test), but regression of about 2% on vring_bench.
I think this has to do with the fact that complete_multi_user
is implemented suboptimally.

TODO:
investigate single-threaded regression
look at more aggressive ring layout changes
better name for a feature flag
split the patch to make it easier to review

This is on top of the following patches in my tree:
virtio_ring: Shadow available ring flags & index
vhost: replace % with & on data path
tools/virtio: fix byteswap logic
tools/virtio: move list macro stubs

Signed-off-by: Michael S. Tsirkin 
---

Changes from v1:
add a missing chunk in vhost_get_vq_desc

 drivers/vhost/vhost.h|   3 +-
 include/linux/vringh.h   |   3 +
 include/uapi/linux/virtio_ring.h |   3 +
 drivers/vhost/vhost.c| 104 ++
 drivers/vhost/vringh.c   | 153 +--
 drivers/virtio/virtio_ring.c |  40 --
 tools/virtio/virtio_test.c   |  14 +++-
 7 files changed, 256 insertions(+), 64 deletions(-)

diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d3f7674..aeeb15d 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -175,7 +175,8 @@ enum {
 (1ULL << VIRTIO_RING_F_EVENT_IDX) |
 (1ULL << VHOST_F_LOG_ALL) |
 (1ULL << VIRTIO_F_ANY_LAYOUT) |
-(1ULL << VIRTIO_F_VERSION_1)
+(1ULL << VIRTIO_F_VERSION_1) |
+(1ULL << VIRTIO_RING_F_POLL)
 };
 
 static inline bool vhost_has_feature(struct vhost_virtqueue *vq, int bit)
diff --git a/include/linux/vringh.h b/include/linux/vringh.h
index bc6c28d..13a9e3e 100644
--- a/include/linux/vringh.h
+++ b/include/linux/vringh.h
@@ -40,6 +40,9 @@ struct vringh {
/* Can we get away with weak barriers? */
bool weak_barriers;
 
+   /* Poll ring directly */
+   bool poll;
+
/* Last available index we saw (ie. where we're up to). */
u16 last_avail_idx;
 
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index c072959..bf3ca1d 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -62,6 +62,9 @@
  * at the end of the used ring. Guest should ignore the used->flags field. */
 #define VIRTIO_RING_F_EVENT_IDX29
 
+/* Support ring polling */
+#define VIRTIO_RING_F_POLL 33
+
 /* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
 struct vring_desc {
/* Address (guest-physical). */
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index ad2146a..cdbabf5 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1346,25 +1346,27 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 
/* Check it isn't doing very strange things with descriptor numbers. */
last_avail_idx = vq->last_avail_idx;
-   if (unlikely(__get_user(avail_idx, >avail->idx))) {
-   vq_err(vq, "Failed to access avail idx at %p\n",
-  >avail->idx);
-   return -EFAULT;
-   }
-   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
+   if (!vhost_has_feature(vq, VIRTIO_RING_F_POLL)) {
+   if (unlikely(__get_user(avail_idx, >avail->idx))) {
+   vq_err(vq, "Failed to access avail idx at %p\n",
+  >avail->idx);
+   return -EFAULT;
+   }
+   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
 
-   if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
-   vq_err(vq, "Guest moved used index from %u to %u",
-  last_avail_idx, vq->avail_idx);
-   return -EFAULT;
-   }
+   if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
+   vq_err(vq, "Guest moved used index from %u to %u",
+  last_avail_idx, vq->avail_idx);
+   return -EFAULT;
+   }
 
-   /* If there's nothing new since last we looked, return invalid. */
-   if (vq->avail_idx == last_avail_idx)
-   return vq->num;
+   /* If there's

Re: [PATCH] vhost: replace % with & on data path

2015-11-30 Thread kbuild test robot

Hi Michael,

[auto build test ERROR on: v4.4-rc3]
[also build test ERROR on: next-20151127]

url:
https://github.com/0day-ci/linux/commits/Michael-S-Tsirkin/vhost-replace-with-on-data-path/20151130-163704
config: x86_64-randconfig-s0-11301655 (attached as .config)
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All error/warnings (new ones prefixed by >>):

   drivers/vhost/vhost.c: In function 'vhost_get_vq_desc':
   drivers/vhost/vhost.c:1345:6: warning: unused variable 'ret' 
[-Wunused-variable]
 int ret;
 ^
   drivers/vhost/vhost.c:1344:13: warning: unused variable 'ring_head' 
[-Wunused-variable]
 __virtio16 ring_head;
^
   drivers/vhost/vhost.c:1341:24: warning: unused variable 'found' 
[-Wunused-variable]
 unsigned int i, head, found = 0;
   ^
   drivers/vhost/vhost.c:1341:18: warning: unused variable 'head' 
[-Wunused-variable]
 unsigned int i, head, found = 0;
 ^
   drivers/vhost/vhost.c:1341:15: warning: unused variable 'i' 
[-Wunused-variable]
 unsigned int i, head, found = 0;
  ^
   drivers/vhost/vhost.c:1340:20: warning: unused variable 'desc' 
[-Wunused-variable]
 struct vring_desc desc;
   ^
   In file included from include/uapi/linux/stddef.h:1:0,
from include/linux/stddef.h:4,
from include/uapi/linux/posix_types.h:4,
from include/uapi/linux/types.h:13,
from include/linux/types.h:5,
from include/uapi/asm-generic/fcntl.h:4,
from arch/x86/include/uapi/asm/fcntl.h:1,
from include/uapi/linux/fcntl.h:4,
from include/linux/fcntl.h:4,
from include/linux/eventfd.h:11,
from drivers/vhost/vhost.c:14:
   drivers/vhost/vhost.c: At top level:
>> include/linux/compiler.h:147:2: error: expected identifier or '(' before 'if'
 if (__builtin_constant_p((cond)) ? !!(cond) :   \
 ^
   include/linux/compiler.h:145:23: note: in expansion of macro '__trace_if'
#define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
  ^
>> drivers/vhost/vhost.c:1373:2: note: in expansion of macro 'if'
 if (unlikely(__get_user(ring_head,
 ^
   arch/x86/include/asm/uaccess.h:414:2: error: expected identifier or '(' 
before ')' token
})
 ^
   include/linux/compiler.h:147:28: note: in definition of macro '__trace_if'
 if (__builtin_constant_p((cond)) ? !!(cond) :   \
   ^
>> drivers/vhost/vhost.c:1373:2: note: in expansion of macro 'if'
 if (unlikely(__get_user(ring_head,
 ^
   drivers/vhost/vhost.c:1373:6: note: in expansion of macro 'unlikely'
 if (unlikely(__get_user(ring_head,
 ^
   arch/x86/include/asm/uaccess.h:479:2: note: in expansion of macro 
'__get_user_nocheck'
 __get_user_nocheck((x), (ptr), sizeof(*(ptr)))
 ^
   drivers/vhost/vhost.c:1373:15: note: in expansion of macro '__get_user'
 if (unlikely(__get_user(ring_head,
  ^
   arch/x86/include/asm/uaccess.h:414:2: error: expected identifier or '(' 
before ')' token
})
 ^
   include/linux/compiler.h:147:28: note: in definition of macro '__trace_if'
 if (__builtin_constant_p((cond)) ? !!(cond) :   \
   ^
>> drivers/vhost/vhost.c:1373:2: note: in expansion of macro 'if'
 if (unlikely(__get_user(ring_head,
 ^
   drivers/vhost/vhost.c:1373:6: note: in expansion of macro 'unlikely'
 if (unlikely(__get_user(ring_head,
 ^
   arch/x86/include/asm/uaccess.h:479:2: note: in expansion of macro 
'__get_user_nocheck'
 __get_user_nocheck((x), (ptr), sizeof(*(ptr)))
 ^
   drivers/vhost/vhost.c:1373:15: note: in expansion of macro '__get_user'
 if (unlikely(__get_user(ring_head,
  ^
   include/linux/compiler.h:126:4: error: expected identifier or '(' before ')' 
token
  })
   ^
   include/linux/compiler.h:147:28: note: in definition of macro '__trace_if'
 if (__builtin_constant_p((cond)) ? !!(cond) :   \
   ^
>> drivers/vhost/vhost.c:1373:2: note: in expansion of macro 'if'
 if (unlikely(__get_user(ring_head,
 ^
   include/linux/compiler.h:137:58: note: in expansion of macro 
'__branch_check__'
#  define unlikely(x) (__builtin_constant_p(x) ? !!(x) : 
__branch_check__(x, 0))
 ^
   drivers/vhost/vhost.c:1373:6: note: in expansion of macro 'unlikely'
 if (unlikely(__get_user(ring_head,
 ^
   arch/x86/include/asm/uaccess.h:414:2: error: expected identifier or '(' 
before ')' token
})
 ^
   include/linux/compiler.h:147:40: note: in definition of macro '__trace_if'
 if (__builtin_constant_p((cond)) ? !!(cond) :   \

Re: [for-2.6 PATCH 0/3] target-i386: Use C struct for xsave area layout, offsets & sizes

2015-11-30 Thread Eduardo Habkost

On Mon, Nov 30, 2015 at 12:21:23PM +0100, Paolo Bonzini wrote:
> 
> 
> On 28/11/2015 20:56, Eduardo Habkost wrote:
> > I still need to figure out a way to write unit tests for the new
> > code. Maybe I will just copy and paste the new and old functions,
> > and test them locally (checking if they give the same results
> > when translating blobs of random bytes).
> 
> Aren't the QEMU_BUILD_BUG_ON enough?  No need to delete them in patch 3,
> though perhaps you can remove the #defines.

Just wanted to be 100% sure. Even if the offets are all correct,
I might have made other mistakes when translating the get/save
code.

About the QEMU_BUILD_BUG_ON lines, we can keep them if you like.
We could translate the uint32_t offsets to byte offsets after
ptach 3/3, to make them easier to compare to the Intel docs.

-- 
Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC] virtio: skip avail/used index reads

2015-11-30 Thread Michael S. Tsirkin

This adds a new vring feature bit: when enabled, host and guest poll the
available/used ring directly instead of looking at the index field
first.

To guarantee it is possible to detect updates, the high bits (above
vring.num - 1) in the ring head ID value are modified to match the index
bits - these change on each wrap-around.  Writer also XORs this with
0x8000 such that rings can be zero-initialized.

Reader is modified to ignore these high bits when looking
up descriptors.

The point is to reduce the number of cacheline misses
for both reads and writes.

I see a speedup of about 20% on a multithreaded micro-benchmark
(virtio-test), but regression of about 2% on a single-threaded one
(vring_bench).  I think this has to do with the fact that
complete_multi_user is implemented suboptimally.

TODO:
investigate single-threaded regression
better name for a feature flag
split the patch to make it easier to review
look at more aggressive ring layout changes
write a spec patch

This is on top of the following patches in my tree:
virtio_ring: Shadow available ring flags & index
vhost: replace % with & on data path
tools/virtio: fix byteswap logic
tools/virtio: move list macro stubs

Signed-off-by: Michael S. Tsirkin 
---
 drivers/vhost/vhost.h|   3 +-
 include/linux/vringh.h   |   3 +
 include/uapi/linux/virtio_ring.h |   3 +
 drivers/vhost/vhost.c| 104 ++
 drivers/vhost/vringh.c   | 153 +--
 drivers/virtio/virtio_ring.c |  40 --
 tools/virtio/virtio_test.c   |  14 +++-
 7 files changed, 255 insertions(+), 65 deletions(-)

diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d3f7674..aeeb15d 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -175,7 +175,8 @@ enum {
 (1ULL << VIRTIO_RING_F_EVENT_IDX) |
 (1ULL << VHOST_F_LOG_ALL) |
 (1ULL << VIRTIO_F_ANY_LAYOUT) |
-(1ULL << VIRTIO_F_VERSION_1)
+(1ULL << VIRTIO_F_VERSION_1) |
+(1ULL << VIRTIO_RING_F_POLL)
 };
 
 static inline bool vhost_has_feature(struct vhost_virtqueue *vq, int bit)
diff --git a/include/linux/vringh.h b/include/linux/vringh.h
index bc6c28d..13a9e3e 100644
--- a/include/linux/vringh.h
+++ b/include/linux/vringh.h
@@ -40,6 +40,9 @@ struct vringh {
/* Can we get away with weak barriers? */
bool weak_barriers;
 
+   /* Poll ring directly */
+   bool poll;
+
/* Last available index we saw (ie. where we're up to). */
u16 last_avail_idx;
 
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index c072959..bf3ca1d 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -62,6 +62,9 @@
  * at the end of the used ring. Guest should ignore the used->flags field. */
 #define VIRTIO_RING_F_EVENT_IDX29
 
+/* Support ring polling */
+#define VIRTIO_RING_F_POLL 33
+
 /* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
 struct vring_desc {
/* Address (guest-physical). */
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 85f0f0a..cdbabf5 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1346,26 +1346,26 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 
/* Check it isn't doing very strange things with descriptor numbers. */
last_avail_idx = vq->last_avail_idx;
-   if (unlikely(__get_user(avail_idx, >avail->idx))) {
-   vq_err(vq, "Failed to access avail idx at %p\n",
-  >avail->idx);
-   return -EFAULT;
-   }
-   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
-
-   if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
-   vq_err(vq, "Guest moved used index from %u to %u",
-  last_avail_idx, vq->avail_idx);
-   return -EFAULT;
-   }
+   if (!vhost_has_feature(vq, VIRTIO_RING_F_POLL)) {
+   if (unlikely(__get_user(avail_idx, >avail->idx))) {
+   vq_err(vq, "Failed to access avail idx at %p\n",
+  >avail->idx);
+   return -EFAULT;
+   }
+   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
 
-   /* If there's nothing new since last we looked, return invalid. */
-   if (vq->avail_idx == last_avail_idx)
-   return vq->num;
+   if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
+   vq_err(vq, "Guest moved used index from %u to %u",
+  last_avail_idx, vq->avail_idx);
+   return -EFAULT;
+   }
 
-   /* Only get avail ring entries after they have been

Re: best way to create a snapshot of a running vm ?

2015-11-30 Thread Stefan Hajnoczi

On Mon, Nov 30, 2015 at 12:36:56AM +0100, Lentes, Bernd wrote:
> what is the best way to create a snapshot of a running vm ? qemu-img or virsh 
> ?
> I#d like to create a snapshot which is copied afterwards by other means, e.g. 
> by a network based backup software.

Hi Bernd,
qemu-img cannot be used on the disk image when the VM is running.
Please use virsh, it communicates with the running QEMU process and
ensures that the snapshot is crash-consistent.

Stefan

signature.asc
Description: PGP signature

[PATCH v5 0/2] KVM: arm/arm64: Allow to use KVM without in-kernel irqchip

2015-11-30 Thread Pavel Fedin

This patch set brings back functionality which was broken in v4.0.
Unfortunately, currently it is impossible to take advantage of virtual
architected timer in this case, therefore guest, running in such
restricted mode, has to use some memory-mapped timer. But it is still
better than nothing.

Patch 0002 needs to be verified on PowerPC architecture, because i've
got an impression that KVM_CAP_IRQCHIP is forgotten there.

v4 => v5:
- Tested on top of kvmarm/next
- Dropped already applied part
- Fixed minor checkpatch issues

v3 => v4:
- Revert back to using switch on kvm_vgic_hyp_init() return code. I decided
  to leave 'vgic_present = false' statement because it helps to understand
  the code.

v2 => v3:
- Improved commit messages, added references to commits where the respective
  functionality was broken
- Explicitly specify that the solution currently affects only vGIC and has
  nothing to do with timer.
- Fixed code style according to previous notes
- Removed ARM64 save/restore patch introduced in v2 because it was already
  obsolete for linux-next
- Modify KVM_CAP_IRQFD handling in correct place

v1 => v2:
- Do not use defensive approach in patch 0001. Use correct conditions in
  callers instead
- Added ARM64-specific code, without which attempt to run a VM ends in a
  HYP crash because of unset vGIC save/restore function pointers



Pavel Fedin (2):
  arm/arm64: KVM: Detect vGIC presence at runtime
  KVM: Make KVM_CAP_IRQFD dependent on KVM_CAP_IRQCHIP

 arch/arm/kvm/arm.c  | 22 --
 virt/kvm/kvm_main.c |  6 --
 2 files changed, 24 insertions(+), 4 deletions(-)

-- 
2.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 1/2] arm/arm64: KVM: Detect vGIC presence at runtime

2015-11-30 Thread Pavel Fedin

Before commit 662d9715840aef44dcb573b0f9fab9e8319c868a
("arm/arm64: KVM: Kill CONFIG_KVM_ARM_{VGIC,TIMER}") is was possible to
compile the kernel without vGIC and vTimer support. Commit message says
about possibility to detect vGIC support in runtime, but this has never
been implemented.

This patch introdices runtime check, restoring the lost functionality.
It again allows to use KVM on hardware without vGIC. Interrupt
controller has to be emulated in userspace in this case.

-ENODEV return code from probe function means there's no GIC at all.
-ENXIO happens when, for example, there is GIC node in the device tree,
but it does not specify vGIC resources. Normally this means that vGIC
hardware is defunct. Any other error code is still treated as full stop
because it might mean some really serious problems.

This patch does not touch any virtual timer code, suggesting that timer
hardware is actually in place. Normally on boards in question it is true,
however since vGIC is missing, it is impossible to correctly utilize
interrupts from the virtual timer. Since virtual timer handling is in
active redevelopment now, handling in it userspace is out of scope at
the moment. The guest is currently suggested to use some memory-mapped
timer which can be emulated in userspace.

Signed-off-by: Pavel Fedin 
---
 arch/arm/kvm/arm.c | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index eab83b2..d581756 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -61,6 +61,8 @@ static atomic64_t kvm_vmid_gen = ATOMIC64_INIT(1);
 static u8 kvm_next_vmid;
 static DEFINE_SPINLOCK(kvm_vmid_lock);
 
+static bool vgic_present;
+
 static void kvm_arm_set_running_vcpu(struct kvm_vcpu *vcpu)
 {
BUG_ON(preemptible());
@@ -132,7 +134,8 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
kvm->arch.vmid_gen = 0;
 
/* The maximum number of VCPUs is limited by the host's GIC model */
-   kvm->arch.max_vcpus = kvm_vgic_get_max_vcpus();
+   kvm->arch.max_vcpus = vgic_present ?
+   kvm_vgic_get_max_vcpus() : KVM_MAX_VCPUS;
 
return ret;
 out_free_stage2_pgd:
@@ -172,6 +175,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
int r;
switch (ext) {
case KVM_CAP_IRQCHIP:
+   r = vgic_present;
+   break;
case KVM_CAP_IOEVENTFD:
case KVM_CAP_DEVICE_CTRL:
case KVM_CAP_USER_MEMORY:
@@ -918,6 +923,8 @@ static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
 
switch (dev_id) {
case KVM_ARM_DEVICE_VGIC_V2:
+   if (!vgic_present)
+   return -ENXIO;
return kvm_vgic_addr(kvm, type, _addr->addr, true);
default:
return -ENODEV;
@@ -932,6 +939,8 @@ long kvm_arch_vm_ioctl(struct file *filp,
 
switch (ioctl) {
case KVM_CREATE_IRQCHIP: {
+   if (!vgic_present)
+   return -ENXIO;
return kvm_vgic_create(kvm, KVM_DEV_TYPE_ARM_VGIC_V2);
}
case KVM_ARM_SET_DEVICE_ADDR: {
@@ -1116,8 +1125,17 @@ static int init_hyp_mode(void)
 * Init HYP view of VGIC
 */
err = kvm_vgic_hyp_init();
-   if (err)
+   switch (err) {
+   case 0:
+   vgic_present = true;
+   break;
+   case -ENODEV:
+   case -ENXIO:
+   vgic_present = false;
+   break;
+   default:
goto out_free_context;
+   }
 
/*
 * Init HYP architected timer support
-- 
2.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 2/2] KVM: Make KVM_CAP_IRQFD dependent on KVM_CAP_IRQCHIP

2015-11-30 Thread Pavel Fedin

Now at least ARM is able to determine whether the machine has
virtualization support for irqchip or not at runtime. Obviously,
irqfd requires irqchip.

Signed-off-by: Pavel Fedin 
---
 virt/kvm/kvm_main.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7873d6d..a057d5e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2716,13 +2716,15 @@ static long kvm_vm_ioctl_check_extension_generic(struct 
kvm *kvm, long arg)
case KVM_CAP_INTERNAL_ERROR_DATA:
 #ifdef CONFIG_HAVE_KVM_MSI
case KVM_CAP_SIGNAL_MSI:
+   /* Fallthrough */
 #endif
+   case KVM_CAP_CHECK_EXTENSION_VM:
+   return 1;
 #ifdef CONFIG_HAVE_KVM_IRQFD
case KVM_CAP_IRQFD:
case KVM_CAP_IRQFD_RESAMPLE:
+   return kvm_vm_ioctl_check_extension(kvm, KVM_CAP_IRQCHIP);
 #endif
-   case KVM_CAP_CHECK_EXTENSION_VM:
-   return 1;
 #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
case KVM_CAP_IRQ_ROUTING:
return KVM_MAX_IRQ_ROUTES;
-- 
2.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 08/21] arm64: KVM: Implement debug save/restore

2015-11-30 Thread Marc Zyngier

On Mon, 30 Nov 2015 12:00:24 +
Alex Bennée  wrote:

> 
> Marc Zyngier  writes:
> 
> > Implement the debug save restore as a direct translation of
> > the assembly code version.
> >
> > Signed-off-by: Marc Zyngier 
> > ---
> >  arch/arm64/kvm/hyp/Makefile   |   1 +
> >  arch/arm64/kvm/hyp/debug-sr.c | 130 
> > ++
> >  arch/arm64/kvm/hyp/hyp.h  |   9 +++
> >  3 files changed, 140 insertions(+)
> >  create mode 100644 arch/arm64/kvm/hyp/debug-sr.c
> >
> > diff --git a/arch/arm64/kvm/hyp/Makefile b/arch/arm64/kvm/hyp/Makefile
> > index ec94200..ec14cac 100644
> > --- a/arch/arm64/kvm/hyp/Makefile
> > +++ b/arch/arm64/kvm/hyp/Makefile
> > @@ -6,3 +6,4 @@ obj-$(CONFIG_KVM_ARM_HOST) += vgic-v2-sr.o
> >  obj-$(CONFIG_KVM_ARM_HOST) += vgic-v3-sr.o
> >  obj-$(CONFIG_KVM_ARM_HOST) += timer-sr.o
> >  obj-$(CONFIG_KVM_ARM_HOST) += sysreg-sr.o
> > +obj-$(CONFIG_KVM_ARM_HOST) += debug-sr.o
> > diff --git a/arch/arm64/kvm/hyp/debug-sr.c b/arch/arm64/kvm/hyp/debug-sr.c
> > new file mode 100644
> > index 000..a0b2b99
> > --- /dev/null
> > +++ b/arch/arm64/kvm/hyp/debug-sr.c
> > @@ -0,0 +1,130 @@
> > +/*
> > + * Copyright (C) 2015 - ARM Ltd
> > + * Author: Marc Zyngier 
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program.  If not, see .
> > + */
> > +
> > +#include 
> > +#include 
> > +
> > +#include 
> > +
> > +#include "hyp.h"
> > +
> > +#define read_debug(r,n)read_sysreg(r##n##_el1)
> > +#define write_debug(v,r,n) write_sysreg(v, r##n##_el1)
> > +
> > +#define save_debug(ptr,reg,nr) 
> > \
> > +   switch (nr) {   \
> > +   case 15:ptr[15] = read_debug(reg, 15);  \
> > +   case 14:ptr[14] = read_debug(reg, 14);  \
> > +   case 13:ptr[13] = read_debug(reg, 13);  \
> > +   case 12:ptr[12] = read_debug(reg, 12);  \
> > +   case 11:ptr[11] = read_debug(reg, 11);  \
> > +   case 10:ptr[10] = read_debug(reg, 10);  \
> > +   case 9: ptr[9] = read_debug(reg, 9);\
> > +   case 8: ptr[8] = read_debug(reg, 8);\
> > +   case 7: ptr[7] = read_debug(reg, 7);\
> > +   case 6: ptr[6] = read_debug(reg, 6);\
> > +   case 5: ptr[5] = read_debug(reg, 5);\
> > +   case 4: ptr[4] = read_debug(reg, 4);\
> > +   case 3: ptr[3] = read_debug(reg, 3);\
> > +   case 2: ptr[2] = read_debug(reg, 2);\
> > +   case 1: ptr[1] = read_debug(reg, 1);\
> > +   default:ptr[0] = read_debug(reg, 0);\
> > +   }
> > +
> > +#define restore_debug(ptr,reg,nr)  \
> > +   switch (nr) {   \
> > +   case 15:write_debug(ptr[15], reg, 15);  \
> > +   case 14:write_debug(ptr[14], reg, 14);  \
> > +   case 13:write_debug(ptr[13], reg, 13);  \
> > +   case 12:write_debug(ptr[12], reg, 12);  \
> > +   case 11:write_debug(ptr[11], reg, 11);  \
> > +   case 10:write_debug(ptr[10], reg, 10);  \
> > +   case 9: write_debug(ptr[9], reg, 9);\
> > +   case 8: write_debug(ptr[8], reg, 8);\
> > +   case 7: write_debug(ptr[7], reg, 7);\
> > +   case 6: write_debug(ptr[6], reg, 6);\
> > +   case 5: write_debug(ptr[5], reg, 5);\
> > +   case 4: write_debug(ptr[4], reg, 4);\
> > +   case 3: write_debug(ptr[3], reg, 3);\
> > +   case 2: write_debug(ptr[2], reg, 2);\
> > +   case 1: write_debug(ptr[1], reg, 1);\
> > +   default:write_debug(ptr[0], reg, 0);\
> > +   }
> > +
> > +void __hyp_text __debug_save_state(struct kvm_vcpu *vcpu,
> > +

Re: [PATCH v8 4/5] nvdimm acpi: build ACPI nvdimm devices

2015-11-30 Thread Xiao Guangrong




On 11/30/2015 06:32 PM, Michael S. Tsirkin wrote:

On Mon, Nov 16, 2015 at 06:51:02PM +0800, Xiao Guangrong wrote:

NVDIMM devices is defined in ACPI 6.0 9.20 NVDIMM Devices


Forgot to mention:

Pls put spec info in code comments near
relevant functions, not just the log.



Sure, good to me.


+
+static void nvdimm_build_ssdt(GSList *device_list, GArray *table_offsets,
+  GArray *table_data, GArray *linker)
+{
+Aml *ssdt, *sb_scope, *dev, *method;


So why don't we skip this completely if device list is empty?


Yes, it is exactly what we did:

 void nvdimm_build_acpi(GArray *table_offsets, GArray *table_data,
GArray *linker)
 {
 GSList *device_list;

 /* no NVDIMM device is plugged. */
 device_list = nvdimm_get_plugged_device_list();
 if (!device_list) {
 return;
 }
 nvdimm_build_nfit(device_list, table_offsets, table_data, linker);
+nvdimm_build_ssdt(device_list, table_offsets, table_data, linker);
 g_slist_free(device_list);
 }
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 0/5] implement vNVDIMM

2015-11-30 Thread Xiao Guangrong




On 11/30/2015 06:38 PM, Michael S. Tsirkin wrote:

On Mon, Nov 16, 2015 at 06:50:58PM +0800, Xiao Guangrong wrote:

This patchset can be found at:
   https://github.com/xiaogr/qemu.git nvdimm-v8

It is based on pci branch on Michael's tree and the top commit is:
commit e3a4e177d9 (migration/ram: fix build on 32 bit hosts).

Changelog in v8:
We split the long patch series into the small parts, as you see now, this
is the first part which enables NVDIMM without label data support.


Finally found some time to review this.  Very nice, this is making good
progress, and I think to split it like this is a great idea.  I sent
some comments, most of them minor.



Thanks for your time and really happy to see you like it. :)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 0/5] implement vNVDIMM

2015-11-30 Thread Xiao Guangrong




On 11/30/2015 04:51 PM, Stefan Hajnoczi wrote:



Reviewed-by: Stefan Hajnoczi 



Thanks for your review Stefan. Will pick up your Reviewed-by in
the next version. :)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH v5 2/2] KVM: Make KVM_CAP_IRQFD dependent on KVM_CAP_IRQCHIP

2015-11-30 Thread Pavel Fedin

 Hello!

> >  Thank you for the note, i didn't know about irqchip-specific capability 
> > codes. There's the
> > same issue with PowerPC, now i
> > understand why there's no KVM_CAP_IRQCHIP for them. Because they have 
> > KVM_CAP_IRQ_MPIC and
> > KVM_CAP_IRQ_XICS, similar to S390.
> >  But isn't it just weird? I understand that perhaps we have some real need 
> > to distinguish
> > between different irqchip types, but
> > shouldn't the kernel also publish KVM_CAP_IRQCHIP, which stands just for 
> > "we support some
> > irqchip virtualization"?
> >  May be we should just add this for PowerPC and S390, to make things less 
> > ambiguous?
> 
> Note that we explicitly need to _enable_ the s390 cap (for
> compatibility). I'd need to recall the exact details but I came to the
> conclusion back than that I could not simply enable KVM_CAP_IRQCHIP for
> s390 (and current qemu would fail to enable the s390 cap if we started
> advertising KVM_CAP_IRQCHIP now).

 OMG... I've looked at the code, what a mess...
 If i was implementing this, i'd simply introduce kvm_vm_enable_cap(s, 
KVM_CAP_IRQCHIP, 0),
which would be allowed to fail with -ENOSYS, so that backwards compatibility is 
kept and an existing API is reused... But, well,
it's already impossible to unscramble an egg... :)
 Ok, i think in current situation we could choose one of these ways (both are 
based on the fact that it's obvious that irqfd require
IRQCHIP).
 a) I look for an alternate way to report KVM_CAP_IRQFD dynamically, and maybe 
PowerPC and S390 follow this way.
 b) I simply drop it as it is, because current qemu knows about the dependency 
and does not try to use irqfd without irqchip,
because there's simply no use for them. But, well, perhaps there would be an 
exception in vhost, i don't remember testing it.
 So what shall we do?

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 4/5] nvdimm acpi: build ACPI nvdimm devices

2015-11-30 Thread Xiao Guangrong




On 11/30/2015 06:30 PM, Michael S. Tsirkin wrote:

On Mon, Nov 16, 2015 at 06:51:02PM +0800, Xiao Guangrong wrote:

NVDIMM devices is defined in ACPI 6.0 9.20 NVDIMM Devices

There is a root device under \_SB and specified NVDIMM devices are under the
root device. Each NVDIMM device has _ADR which returns its handle used to
associate MEMDEV structure in NFIT

Currently, we do not support any function on _DSM, that means, NVDIMM
label data has not been supported yet

Signed-off-by: Xiao Guangrong 
---
  hw/acpi/nvdimm.c | 85 
  1 file changed, 85 insertions(+)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index 98c004d..abe0daa 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -367,6 +367,90 @@ static void nvdimm_build_nfit(GSList *device_list, GArray 
*table_offsets,
  g_array_free(structures, true);
  }

+static void nvdimm_build_common_dsm(Aml *root_dev)
+{
+Aml *method, *ifctx, *function;
+uint8_t byte_list[1];
+
+method = aml_method("NCAL", 4);


This "NCAL" needs a define as it's used
in multiple places. It's really just a DSM
implementation, right? Reflect this in the macro
name.



Yes, it is a common DSM method used by both root device and nvdimm devices.
I will do it like this:

#define NVDIMM_COMMON_DSM   "NCAL"


+{


What's this doing?



It just a reminder that the code containing in this braces is a DSM body like
a C function. However, i do not have strong opinion on it, will drop this style
if you dislike it.


+function = aml_arg(2);
+
+/*
+ * function 0 is called to inquire what functions are supported by
+ * OSPM
+ */
+ifctx = aml_if(aml_equal(function, aml_int(0)));
+byte_list[0] = 0 /* No function Supported */;
+aml_append(ifctx, aml_return(aml_buffer(1, byte_list)));
+aml_append(method, ifctx);
+
+/* No function is supported yet. */
+byte_list[0] = 1 /* Not Supported */;
+aml_append(method, aml_return(aml_buffer(1, byte_list)));
+}
+aml_append(root_dev, method);
+}
+
+static void nvdimm_build_nvdimm_devices(GSList *device_list, Aml *root_dev)
+{
+for (; device_list; device_list = device_list->next) {
+DeviceState *dev = device_list->data;
+int slot = object_property_get_int(OBJECT(dev), PC_DIMM_SLOT_PROP,
+   NULL);
+uint32_t handle = nvdimm_slot_to_handle(slot);
+Aml *nvdimm_dev, *method;
+
+nvdimm_dev = aml_device("NV%02X", slot);
+aml_append(nvdimm_dev, aml_name_decl("_ADR", aml_int(handle)));
+
+method = aml_method("_DSM", 4);
+{
+aml_append(method, aml_return(aml_call4("NCAL", aml_arg(0),
+   aml_arg(1), aml_arg(2), aml_arg(3;
+}
+aml_append(nvdimm_dev, method);
+
+aml_append(root_dev, nvdimm_dev);
+}
+}
+
+static void nvdimm_build_ssdt(GSList *device_list, GArray *table_offsets,
+  GArray *table_data, GArray *linker)
+{
+Aml *ssdt, *sb_scope, *dev, *method;
+
+acpi_add_table(table_offsets, table_data);
+
+ssdt = init_aml_allocator();
+acpi_data_push(ssdt->buf, sizeof(AcpiTableHeader));
+
+sb_scope = aml_scope("\\_SB");
+
+dev = aml_device("NVDR");
+aml_append(dev, aml_name_decl("_HID", aml_string("ACPI0012")));


Pls add a comment explaining that ACPI0012 is NVDIMM root device.


Okay, will add these comment:

/*
 * NVDIMM is introduced in ACPI 6.0 9.20 NVDIMM Devices which defines an NVDIMM
 * root device under _SB scope with a _HID of “ACPI0012”. For each NVDIMM 
present
 * or intended to be supported by platform, platform firmware also exposes an 
ACPI
 * Namespace Device under the root device.
 */



Also - this will now appear for all users, e.g.
windows guests will prompt users for a driver.
Not nice if user didn't actually ask for nvdimm.

A simple solution is to default this functionality
to off by default.



Okay, will disable nvdimm on default in the next version.


+
+nvdimm_build_common_dsm(dev);
+method = aml_method("_DSM", 4);
+{
+aml_append(method, aml_return(aml_call4("NCAL", aml_arg(0),
+   aml_arg(1), aml_arg(2), aml_arg(3;
+}


Some duplication here, move above to a sub-function please.


Okay, will add a function named nvdimm_build_device_dsm() to do these
things.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 105 matches

Mail list logo