Re: [PATCH 5/7] userfaultfd: switch to exclusive wakeup for blocking reads

2015-06-16 Thread Andrea Arcangeli
On Mon, Jun 15, 2015 at 08:41:24PM -1000, Linus Torvalds wrote:
 On Mon, Jun 15, 2015 at 12:19 PM, Andrea Arcangeli aarca...@redhat.com 
 wrote:
 
  Yes, it would leave the other blocked, how is it different from having
  just 1 reader and it gets killed?
 
 Either is completely wrong. But the read() code can at least see that
 I'm returning early due to a signal, so I'll wake up any other
 waiters.
 
 Poll simply *cannot* do that. Because by definition poll always
 returns without actually clearing the thing that caused the wakeup.
 
 So for poll(), using exclusive waits is wrong very much by
 definition. For read(), you *can* use exclusive waits correctly, it
 just requires you to wake up others if you don't read all the data
 (either due to being killed by a signal, or because the read was
 incomplete).

There's no interface to do wakeone with poll so I haven't thought much
about it frankly but intuitively it didn't look radically different as
long as poll checks every fd revent it gets. If I was to patch it to
introduce wakeone in poll I'd think more about it of course. Perhaps
I've been overoptimistic here.

 What does qemu have to do with anything?
 
 We don't implement kernel interfaces that are broken, and that can
 leave processes blocked when they shouldn't be blocked. We also don't
 implement kernel interfaces that only work with one program and then
 say if that program is broken, it's not our problem.

I'm testing with the stresstest application not with qemu, qemu cannot
take advantage of this anyway because it uses a single thread so far
and it uses poll not blocking reads. The stresstest suite listens to
events with one thread per CPU and it interleaves poll usage with
blocking reads at every bounce, and it's working correctly so far.

  I'm not saying doing wakeone is easy [...]
 
 Bullshit, Andrea.
 
 That's *exactly* what you said in the commit message for the broken
 patch that I complained about. And I quote:

Please don't quote me out of context, and quote me in full if you
quote me:

I'm not saying doing wakeone is easy and it's enough to flip a switch
everywhere to get it everywhere

In the above paragraph (that you quoted in a truncated version of it)
I was talking in general, not specific to userfaultfd. I meant in
general wakeone is not easy.

 patch that I complained about. And I quote:
 
   Blocking reads can easily use exclusive wakeups. Poll in theory
 could too but there's no poll_wait_exclusive in common code yet

With I'm not saying doing wakeone is easy and it's enough to flip a
switch everywhere to get it everywhere I intended exactly to clarify
that Blocking reads can easily use exclusive wakeups was in the
context of userfaultfd only.

With regard to the patch you still haven't told what exactly what
runtime breakage I shall expect from my simple change. The case you
mentioned about a thread that gets killed is irrelevant because an
userfault would get missed anyway, if a task listening to userfaultfd
get killed after it received any uffd_msg structure. Wakeone or
wakeall won't move the needle for that case. There's no broadcast of
userfaults to all readers, even with wakeall, only the first reader
that wakes up, gets the messages, the others returns to sleep
immediately.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] arm64: KVM: Add VCPU support for Qualcomm Technologies Kryo ARMv8 CPU

2015-06-16 Thread Marc Zyngier
In the future, please add the KVM/ARM maintainers on Cc.

On 12/06/15 22:57, Timur Tabi wrote:
 From: Shanker Donthineni shank...@codeaurora.org
 
 This patch enables assignment of 32/64bit guest VCPU to
 Qualcomm Technologies ARMv8 CPU. Added KVM_ARM_TARGET_QCOM_KRYO
 to the KVM target list and modified vm_target_cpu() to return
 KVM_ARM_TARGET_QCOM_KRYO when CPU running in AArch64 mode.
 
 Signed-off-by: Shanker Donthineni shank...@codeaurora.org
 Signed-off-by: Timur Tabi ti...@codeaurora.org
 ---
  arch/arm64/include/uapi/asm/kvm.h| 3 ++-
  arch/arm64/kvm/guest.c   | 6 ++
  arch/arm64/kvm/sys_regs_generic_v8.c | 2 ++
  3 files changed, 10 insertions(+), 1 deletion(-)
 
 diff --git a/arch/arm64/include/uapi/asm/kvm.h 
 b/arch/arm64/include/uapi/asm/kvm.h
 index d268320..426933e 100644
 --- a/arch/arm64/include/uapi/asm/kvm.h
 +++ b/arch/arm64/include/uapi/asm/kvm.h
 @@ -59,8 +59,9 @@ struct kvm_regs {
  #define KVM_ARM_TARGET_CORTEX_A572
  #define KVM_ARM_TARGET_XGENE_POTENZA 3
  #define KVM_ARM_TARGET_CORTEX_A534
 +#define KVM_ARM_TARGET_QCOM_KRYO 5
  
 -#define KVM_ARM_NUM_TARGETS  5
 +#define KVM_ARM_NUM_TARGETS  6
  
  /* KVM_ARM_SET_DEVICE_ADDR ioctl id encoding */
  #define KVM_ARM_DEVICE_TYPE_SHIFT0
 diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
 index 9535bd5..836cf16 100644
 --- a/arch/arm64/kvm/guest.c
 +++ b/arch/arm64/kvm/guest.c
 @@ -291,6 +291,12 @@ int __attribute_const__ kvm_target_cpu(void)
   return KVM_ARM_TARGET_XGENE_POTENZA;
   };
   break;
 + case ARM_CPU_IMP_QCOM:
 + switch (part_number  QCOM_CPU_PART_MASK) {
 + case QCOM_CPU_PART_KRYO:
 + return KVM_ARM_TARGET_QCOM_KRYO;
 + }
 + break;
   };
  
   return -EINVAL;
 diff --git a/arch/arm64/kvm/sys_regs_generic_v8.c 
 b/arch/arm64/kvm/sys_regs_generic_v8.c
 index 475fd29..3712ea8 100644
 --- a/arch/arm64/kvm/sys_regs_generic_v8.c
 +++ b/arch/arm64/kvm/sys_regs_generic_v8.c
 @@ -94,6 +94,8 @@ static int __init sys_reg_genericv8_init(void)
 genericv8_target_table);
   kvm_register_target_sys_reg_table(KVM_ARM_TARGET_XGENE_POTENZA,
 genericv8_target_table);
 + kvm_register_target_sys_reg_table(KVM_ARM_TARGET_QCOM_KRYO,
 +   genericv8_target_table);
  
   return 0;
  }
 

That's getting slightly out of hand and I think we need to stop being
paranoid.

All our target CPUs are using the generic tables, and this looks like an
unnecessary complexity as long as we choose to ignore
cross-microarchitecture migration (things like big-little or live
migration from one CPU type to another).

Let's just implement a default target that works for everything. Should
we support some CPUs in a more specific manner, it will then be worth
adding to that list. I believe Suzuki (cc-ed) has something brewing.

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] perf/kvm: Guest Symbol Resolution for powerpc

2015-06-16 Thread David Ahern

On 6/15/15 8:50 PM, Hemant Kumar wrote:

+/*
+ * Get the instruction pointer from the tracepoint data
+ */
+u64 arch__get_ip(struct perf_evsel *evsel, struct perf_sample *data)
+{
+   u64 tp_ip = data-ip;
+   int trap;
+
+   if (!strcmp(KVMPPC_EXIT, evsel-name)) {
+   trap = raw_field_value(evsel-tp_format, trap, 
data-raw_data);
+
+   if (trap == HV_DECREMENTER)
+   tp_ip = raw_field_value(evsel-tp_format, pc,
+   data-raw_data);
+   }
+   return tp_ip;
+}


You can tie a handler to an event; see builtin-trace.c for example 
(evsel-handler = handler). Then have the sample handler call it (e.g, 
see trace__process_sample). Then you don't have to check event names on 
each pass like this and just do event based processing.



+
+/*
+ * Get the HV and PR bits and accordingly, determine the cpumode
+ */
+u8 arch__get_cpumode(union perf_event *event, struct perf_evsel *evsel,
+struct perf_sample *data)
+{
+   unsigned long hv, pr, msr;
+   u8 cpumode = event-header.misc  PERF_RECORD_MISC_CPUMODE_MASK;
+
+   if (strcmp(KVMPPC_EXIT, evsel-name))
+   goto ret;
+
+   if (data-raw_data)
+   msr = raw_field_value(evsel-tp_format, msr, data-raw_data);
+   else
+   goto ret;
+
+   hv = msr  ((long unsigned)1  (PPC_MAX - HV_BIT));
+   pr = msr  ((long unsigned)1  (PPC_MAX - PR_BIT));
+
+   if (!hv  pr)
+   cpumode = PERF_RECORD_MISC_GUEST_USER;
+   else
+   cpumode = PERF_RECORD_MISC_GUEST_KERNEL;
+ret:
+   return cpumode;
+}


Why isn't that set properly kernel side when the sample is generated?

David
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] vhost: use binary search instead of linear in find_region()

2015-06-16 Thread Igor Mammedov
For default region layouts performance stays the same
as linear search i.e. it takes around 210ns average for
translate_desc() that inlines find_region().

But it scales better with larger amount of regions,
235ns BS vs 300ns LS with 55 memory regions
and it will be about the same values when allowed number
of slots is increased to 509 like it has been done in kvm.

Signed-off-by: Igor Mammedov imamm...@redhat.com
---
 drivers/vhost/vhost.c | 38 --
 1 file changed, 28 insertions(+), 10 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2ee2826..a22f8c3 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -25,6 +25,7 @@
 #include linux/kthread.h
 #include linux/cgroup.h
 #include linux/module.h
+#include linux/sort.h
 
 #include vhost.h
 
@@ -590,6 +591,16 @@ int vhost_vq_access_ok(struct vhost_virtqueue *vq)
 }
 EXPORT_SYMBOL_GPL(vhost_vq_access_ok);
 
+static int vhost_memory_reg_sort_cmp(const void *p1, const void *p2)
+{
+   const struct vhost_memory_region *r1 = p1, *r2 = p2;
+   if (r1-guest_phys_addr  r2-guest_phys_addr)
+   return 1;
+   if (r1-guest_phys_addr  r2-guest_phys_addr)
+   return -1;
+   return 0;
+}
+
 static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user 
*m)
 {
struct vhost_memory mem, *newmem, *oldmem;
@@ -609,9 +620,11 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
memcpy(newmem, mem, size);
if (copy_from_user(newmem-regions, m-regions,
   mem.nregions * sizeof *m-regions)) {
-   kfree(newmem);
+   kvfree(newmem);
return -EFAULT;
}
+   sort(newmem-regions, newmem-nregions, sizeof(*newmem-regions),
+   vhost_memory_reg_sort_cmp, NULL);
 
if (!memory_access_ok(d, newmem, 0)) {
kfree(newmem);
@@ -913,17 +926,22 @@ EXPORT_SYMBOL_GPL(vhost_dev_ioctl);
 static const struct vhost_memory_region *find_region(struct vhost_memory *mem,
 __u64 addr, __u32 len)
 {
-   struct vhost_memory_region *reg;
-   int i;
+   const struct vhost_memory_region *reg;
+   int start = 0, end = mem-nregions;
 
-   /* linear search is not brilliant, but we really have on the order of 6
-* regions in practice */
-   for (i = 0; i  mem-nregions; ++i) {
-   reg = mem-regions + i;
-   if (reg-guest_phys_addr = addr 
-   reg-guest_phys_addr + reg-memory_size - 1 = addr)
-   return reg;
+   while (start  end) {
+   int slot = start + (end - start) / 2;
+   reg = mem-regions + slot;
+   if (addr = reg-guest_phys_addr)
+   end = slot;
+   else
+   start = slot + 1;
}
+
+   reg = mem-regions + start;
+   if (addr = reg-guest_phys_addr 
+   reg-guest_phys_addr + reg-memory_size  addr)
+   return reg;
return NULL;
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] x86: reduce paravirtualized spinlock overhead

2015-06-16 Thread Juergen Gross

AFAIK there are no outstanding questions for more than one month now.
I'd appreciate some feedback or accepting these patches.


Juergen

On 04/30/2015 12:53 PM, Juergen Gross wrote:

Paravirtualized spinlocks produce some overhead even if the kernel is
running on bare metal. The main reason are the more complex locking
and unlocking functions. Especially unlocking is no longer just one
instruction but so complex that it is no longer inlined.

This patch series addresses this issue by adding two more pvops
functions to reduce the size of the inlined spinlock functions. When
running on bare metal unlocking is again basically one instruction.

Compile tested with CONFIG_PARAVIRT_SPINLOCKS on and off, 32 and 64
bits.

Functional testing on bare metal and as Xen dom0.

Correct patching verified by disassembly of active kernel.

Juergen Gross (6):
   x86: use macro instead of 0 for setting TICKET_SLOWPATH_FLAG
   x86: move decision about clearing slowpath flag into arch_spin_lock()
   x86: introduce new pvops function clear_slowpath
   x86: introduce new pvops function spin_unlock
   x86: switch config from UNINLINE_SPIN_UNLOCK to INLINE_SPIN_UNLOCK
   x86: remove no longer needed paravirt_ticketlocks_enabled

  arch/x86/Kconfig  |  1 -
  arch/x86/include/asm/paravirt.h   | 13 +
  arch/x86/include/asm/paravirt_types.h | 12 
  arch/x86/include/asm/spinlock.h   | 53 ---
  arch/x86/include/asm/spinlock_types.h |  3 +-
  arch/x86/kernel/kvm.c | 14 +
  arch/x86/kernel/paravirt-spinlocks.c  | 42 +--
  arch/x86/kernel/paravirt.c| 12 
  arch/x86/kernel/paravirt_patch_32.c   | 25 +
  arch/x86/kernel/paravirt_patch_64.c   | 24 
  arch/x86/xen/spinlock.c   | 23 +--
  include/linux/spinlock_api_smp.h  |  2 +-
  kernel/Kconfig.locks  |  7 +++--
  kernel/Kconfig.preempt|  3 +-
  kernel/locking/spinlock.c |  2 +-
  lib/Kconfig.debug |  1 -
  16 files changed, 154 insertions(+), 83 deletions(-)



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] vhost: translate_desc: optimization for desc.len region size

2015-06-16 Thread Igor Mammedov
when translating descriptors they are typically less than
memory region that holds them and translated into 1 iov
enty, so it's not nessesary to check remaining length
twice and calculate used length and next address
in such cases.

so relace a remaining length and 'size' increment branches
with a single remaining length check and execute
next iov steps only when it needed.

It saves tiny 2% of translate_desc() execution time.

Signed-off-by: Igor Mammedov imamm...@redhat.com
---
PS:
I'm not sure if iov_size  0 is always true, if it's not
then better to drop this patch.
---
 drivers/vhost/vhost.c | 21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 68c1c88..84c457d 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -,12 +,8 @@ static int translate_desc(struct vhost_virtqueue *vq, 
u64 addr, u32 len,
int ret = 0;
 
mem = vq-memory;
-   while ((u64)len  s) {
+   do {
u64 size;
-   if (unlikely(ret = iov_size)) {
-   ret = -ENOBUFS;
-   break;
-   }
reg = find_region(mem, addr, len, vq-cached_reg);
if (unlikely(!reg)) {
ret = -EFAULT;
@@ -1124,13 +1120,22 @@ static int translate_desc(struct vhost_virtqueue *vq, 
u64 addr, u32 len,
}
_iov = iov + ret;
size = reg-memory_size - addr + reg-guest_phys_addr;
-   _iov-iov_len = min((u64)len - s, size);
_iov-iov_base = (void __user *)(unsigned long)
(reg-userspace_addr + addr - reg-guest_phys_addr);
+   ++ret;
+   if (likely((u64)len - s  size)) {
+   _iov-iov_len = (u64)len - s;
+   break;
+   }
+
+   if (unlikely(ret = iov_size)) {
+   ret = -ENOBUFS;
+   break;
+   }
+   _iov-iov_len = size;
s += size;
addr += size;
-   ++ret;
-   }
+   } while (1);
 
return ret;
 }
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5] vhost: support upto 509 memory regions

2015-06-16 Thread Igor Mammedov
Series extends vhost to support upto 509 memory regions,
and adds some vhost:translate_desc() performance improvemnts
so it won't regress when memslots are increased to 509.

It fixes running VM crashing during memory hotplug due
to vhost refusing accepting more than 64 memory regions.

It's only host kernel side fix to make it work with QEMU
versions that support memory hotplug. But I'll continue
to work on QEMU side solution to reduce amount of memory
regions to make things even better.

Performance wise for guest with (in my case 3 memory regions)
and netperf's UDP_RR workload translate_desc() execution
time from total workload takes:

Memory  |1G RAM|cached|non cached
regions #   |  3   |  53  |  53

upstream| 0.3% |  -   | 3.5%

this series | 0.2% | 0.5% | 0.7%

where non cached column reflects trashing wokload
with constant cache miss. More details on timing in
respective patches.

Igor Mammedov (5):
  vhost: use binary search instead of linear in find_region()
  vhost: extend memory regions allocation to vmalloc
  vhost: support upto 509 memory regions
  vhost: add per VQ memory region caching
  vhost: translate_desc: optimization for desc.len  region size

 drivers/vhost/vhost.c | 95 +--
 drivers/vhost/vhost.h |  1 +
 2 files changed, 71 insertions(+), 25 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] vhost: support upto 509 memory regions

2015-06-16 Thread Igor Mammedov
since commit
 1d4e7e3 kvm: x86: increase user memory slots to 509

it became possible to use a bigger amount of memory
slots, which is used by memory hotplug for
registering hotplugged memory.
However QEMU crashes if it's used with more than ~60
pc-dimm devices and vhost-net since host kernel
in module vhost-net refuses to accept more than 65
memory regions.

Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.

Signed-off-by: Igor Mammedov imamm...@redhat.com
---
 drivers/vhost/vhost.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 99931a0..6a18c92 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -30,7 +30,7 @@
 #include vhost.h
 
 enum {
-   VHOST_MEMORY_MAX_NREGIONS = 64,
+   VHOST_MEMORY_MAX_NREGIONS = 509,
VHOST_MEMORY_F_LOG = 0x1,
 };
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] vhost: extend memory regions allocation to vmalloc

2015-06-16 Thread Igor Mammedov
with large number of memory regions we could end up with
high order allocations and kmalloc could fail if
host is under memory pressure.
Considering that memory regions array is used on hot path
try harder to allocate using kmalloc and if it fails resort
to vmalloc.
It's still better than just failing vhost_set_memory() and
causing guest crash due to it when a new memory hotplugged
to guest.

I'll still look at QEMU side solution to reduce amount of
memory regions it feeds to vhost to make things even better,
but it doesn't hurt for kernel to behave smarter and don't
crash older QEMU's which could use large amount of memory
regions.

Signed-off-by: Igor Mammedov imamm...@redhat.com
---
---
 drivers/vhost/vhost.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a22f8c3..99931a0 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -471,7 +471,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
fput(dev-log_file);
dev-log_file = NULL;
/* No one will access memory at this point */
-   kfree(dev-memory);
+   kvfree(dev-memory);
dev-memory = NULL;
WARN_ON(!list_empty(dev-work_list));
if (dev-worker) {
@@ -601,6 +601,18 @@ static int vhost_memory_reg_sort_cmp(const void *p1, const 
void *p2)
return 0;
 }
 
+static void *vhost_kvzalloc(unsigned long size)
+{
+   void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+
+   if (!n) {
+   n = vzalloc(size);
+   if (!n)
+   return ERR_PTR(-ENOMEM);
+   }
+   return n;
+}
+
 static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user 
*m)
 {
struct vhost_memory mem, *newmem, *oldmem;
@@ -613,7 +625,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
return -EOPNOTSUPP;
if (mem.nregions  VHOST_MEMORY_MAX_NREGIONS)
return -E2BIG;
-   newmem = kmalloc(size + mem.nregions * sizeof *m-regions, GFP_KERNEL);
+   newmem = vhost_kvzalloc(size + mem.nregions * sizeof(*m-regions));
if (!newmem)
return -ENOMEM;
 
@@ -627,7 +639,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
vhost_memory_reg_sort_cmp, NULL);
 
if (!memory_access_ok(d, newmem, 0)) {
-   kfree(newmem);
+   kvfree(newmem);
return -EFAULT;
}
oldmem = d-memory;
@@ -639,7 +651,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
d-vqs[i]-memory = newmem;
mutex_unlock(d-vqs[i]-mutex);
}
-   kfree(oldmem);
+   kvfree(oldmem);
return 0;
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 0/4] Consolidated KVM vPMU support for x86

2015-06-16 Thread Joerg Roedel
On Fri, Jun 12, 2015 at 01:34:52AM -0400, Wei Huang wrote:
 This patchset is directlyh applicable on kvm.git/queue.
 
 V5:
   * Remove the get_pmu_ops from sub_arch; instead define pmu dispatcher
 in kvm_x86_ops-pmu_ops. The dispatcher is initialized in sub-arch.
 The PMU interface functions are changed accordingly (suggested by 
 Joerg Rodel).

Tested it again, works like a charm on my AMD system :) So for all patches
again:

Tested-by: Joerg Roedel jroe...@suse.de


Thanks,

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 2/4] KVM: x86/vPMU: Create vPMU interface for VMX and SVM

2015-06-16 Thread Joerg Roedel
On Fri, Jun 12, 2015 at 01:34:54AM -0400, Wei Huang wrote:
 This patch splits existing vPMU code into a common vPMU interface (pmc.c)
 and Intel specific vPMU code (pmu_intel.c) using the following steps:
 
 - Part of arechitectural vPMU code is extracted and moved to pmu_intel.c
   file. They are hooked up with the newly-defined intel_pmu_ops, which will
   be called from pmu interface;
 - Create a dummy pmu_amd.c file for AMD SVM with empty functions;
 
 All architectural vPMU functions are now called via PMU function dispatcher.
 This function dispatcher is defined in kvm_x86_ops-pmu_ops, which is
 initialized by sub-arch. Also note that Intel and AMD modules are now
 generated by combinig their corresponding arch files (vmx.c/svm.c) and pmu
 files (pmu_intel.c/pmu_amd.c).
 
 Tested-by: Radim Krčmář rkrc...@redhat.com
 Signed-off-by: Wei Huang w...@redhat.com

Reviewed-by: Joerg Roedel jroe...@suse.de

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] vhost: add per VQ memory region caching

2015-06-16 Thread Igor Mammedov
that brings down translate_desc() cost to around 210ns
if accessed descriptors are from the same memory region.

Signed-off-by: Igor Mammedov imamm...@redhat.com
---
that's what netperf/iperf workloads were during testing.
---
 drivers/vhost/vhost.c | 16 +---
 drivers/vhost/vhost.h |  1 +
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 6a18c92..68c1c88 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -200,6 +200,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vq-call = NULL;
vq-log_ctx = NULL;
vq-memory = NULL;
+   vq-cached_reg = 0;
 }
 
 static int vhost_worker(void *data)
@@ -649,6 +650,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
vhost_memory __user *m)
for (i = 0; i  d-nvqs; ++i) {
mutex_lock(d-vqs[i]-mutex);
d-vqs[i]-memory = newmem;
+   d-vqs[i]-cached_reg = 0;
mutex_unlock(d-vqs[i]-mutex);
}
kvfree(oldmem);
@@ -936,11 +938,17 @@ done:
 EXPORT_SYMBOL_GPL(vhost_dev_ioctl);
 
 static const struct vhost_memory_region *find_region(struct vhost_memory *mem,
-__u64 addr, __u32 len)
+__u64 addr, __u32 len,
+int *cached_reg)
 {
const struct vhost_memory_region *reg;
int start = 0, end = mem-nregions;
 
+   reg = mem-regions + *cached_reg;
+   if (likely(addr = reg-guest_phys_addr 
+   reg-guest_phys_addr + reg-memory_size  addr))
+   return reg;
+
while (start  end) {
int slot = start + (end - start) / 2;
reg = mem-regions + slot;
@@ -952,8 +960,10 @@ static const struct vhost_memory_region 
*find_region(struct vhost_memory *mem,
 
reg = mem-regions + start;
if (addr = reg-guest_phys_addr 
-   reg-guest_phys_addr + reg-memory_size  addr)
+   reg-guest_phys_addr + reg-memory_size  addr) {
+   *cached_reg = start;
return reg;
+   }
return NULL;
 }
 
@@ -1107,7 +1117,7 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 
addr, u32 len,
ret = -ENOBUFS;
break;
}
-   reg = find_region(mem, addr, len);
+   reg = find_region(mem, addr, len, vq-cached_reg);
if (unlikely(!reg)) {
ret = -EFAULT;
break;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 8c1c792..68bd00f 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -106,6 +106,7 @@ struct vhost_virtqueue {
/* Log write descriptors */
void __user *log_base;
struct vhost_log *log;
+   int cached_reg;
 };
 
 struct vhost_dev {
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] perf/kvm: Guest Symbol Resolution for powerpc

2015-06-16 Thread Arnaldo Carvalho de Melo
Em Tue, Jun 16, 2015 at 08:20:53AM +0530, Hemant Kumar escreveu:
 perf kvm {record|report} is used to record and report the performance
 profile of any workload on a guest. From the host, we can collect
 guest kernel statistics which is useful in finding out any contentions
 in guest kernel symbols for a certain workload.
 
 This feature is not available on powerpc because perf relies on the
 cycles event (a PMU event) to profile the guest. However, for powerpc,
 this can't be used from the host because the PMUs are controlled by the
 guest rather than the host.
 
 Due to this problem, we need a different approach to profile the
 workload in the guest. There exists a tracepoint kvm_hv:kvm_guest_exit
 in powerpc which is hit whenever any of the threads exit the guest
 context. The guest instruction pointer dumped along with this
 tracepoint data in the field pc, can be used as guest instruction
 pointer while postprocessing the trace data to map this IP to symbol
 from guest.kallsyms.
 
 However, to have some kind of periodicity, we can't use all the kvm
 exits, rather exits which are bound to happen in certain intervals.
 HV_DECREMENTER Interrupt forces the threads to exit after an interval
 of 10 ms.
 
 This patch makes use of the kvm_guest_exit tracepoint and checks the
 exit reason for any kvm exit. If it is HV_DECREMENTER, then the
 instruction pointer dumped along with this tracepoint is retrieved and
 mapped with the guest kallsyms.
 
 This patch is a prototype asking for suggestions/comments as to whether
 the approach is right or is there any way better than this (like using
 a different event to profile for, etc) to profile the guest from the
 host.
 
 Thank You.
 
 Signed-off-by: Hemant Kumar hem...@linux.vnet.ibm.com
 ---
  tools/perf/arch/powerpc/Makefile|  1 +
  tools/perf/arch/powerpc/util/parse-tp.c | 55 
 +
  tools/perf/builtin-report.c |  9 ++
  tools/perf/util/event.c |  7 -
  tools/perf/util/evsel.c |  7 +
  tools/perf/util/evsel.h |  4 +++
  tools/perf/util/session.c   |  7 +++--
  7 files changed, 86 insertions(+), 4 deletions(-)
  create mode 100644 tools/perf/arch/powerpc/util/parse-tp.c
 
 diff --git a/tools/perf/arch/powerpc/Makefile 
 b/tools/perf/arch/powerpc/Makefile
 index 6f7782b..992a0d5 100644
 --- a/tools/perf/arch/powerpc/Makefile
 +++ b/tools/perf/arch/powerpc/Makefile
 @@ -4,3 +4,4 @@ LIB_OBJS += $(OUTPUT)arch/$(ARCH)/util/dwarf-regs.o
  LIB_OBJS += $(OUTPUT)arch/$(ARCH)/util/skip-callchain-idx.o
  endif
  LIB_OBJS += $(OUTPUT)arch/$(ARCH)/util/header.o
 +LIB_OBJS += $(OUTPUT)arch/$(ARCH)/util/parse-tp.o
 diff --git a/tools/perf/arch/powerpc/util/parse-tp.c 
 b/tools/perf/arch/powerpc/util/parse-tp.c
 new file mode 100644
 index 000..4c6e49c
 --- /dev/null
 +++ b/tools/perf/arch/powerpc/util/parse-tp.c
 @@ -0,0 +1,55 @@
 +#include ../../util/evsel.h
 +#include ../../util/trace-event.h
 +#include ../../util/session.h
 +
 +#define KVMPPC_EXIT kvm_hv:kvm_guest_exit
 +#define HV_DECREMENTER 2432
 +#define HV_BIT 3
 +#define PR_BIT 49
 +#define PPC_MAX 63
 +
 +/*
 + * Get the instruction pointer from the tracepoint data
 + */
 +u64 arch__get_ip(struct perf_evsel *evsel, struct perf_sample *data)
 +{
 + u64 tp_ip = data-ip;
 + int trap;
 +
 + if (!strcmp(KVMPPC_EXIT, evsel-name)) {

Can't you cache this somewhere? I.e. something like

static int kvmppc_exit = -1;

if (evsel-attr.type != PERF_TRACEPOINT)
goto out;

if (unlikely(kvmppc_exit == -1)) {
if (strcmp(KVMPPC_EXIT, evsel-name)))
goto out;

kvmppc_exit = evsel-attr.config;
} else (if kvmppc_exit != evsel-attr.config)
goto out;


 + trap = raw_field_value(evsel-tp_format, trap, data-raw_data);
 +
 + if (trap == HV_DECREMENTER)
 + tp_ip = raw_field_value(evsel-tp_format, pc,
 + data-raw_data);

out:

 + return tp_ip;
 +}


Also we have:

u64 perf_evsel__intval(struct perf_evsel *evsel,
   struct perf_sample *sample, const char *name);

So:

trap = perf_evsel__intval(evsel, sample, trap);

And:

tp_ip = perf_evsel__intval(evsel, sample, pc);

Makes it a bit shorter and allows for optimizations in how to find that
field by name made at the evsel code.

- Arnaldo

 +
 +/*
 + * Get the HV and PR bits and accordingly, determine the cpumode
 + */
 +u8 arch__get_cpumode(union perf_event *event, struct perf_evsel *evsel,
 +  struct perf_sample *data)
 +{
 + unsigned long hv, pr, msr;
 + u8 cpumode = event-header.misc  PERF_RECORD_MISC_CPUMODE_MASK;
 +
 + if (strcmp(KVMPPC_EXIT, evsel-name))
 + goto ret;
 +
 + if (data-raw_data)
 + msr = raw_field_value(evsel-tp_format, msr, data-raw_data);
 + else
 +   

Re: [PATCH 0/6] x86: reduce paravirtualized spinlock overhead

2015-06-16 Thread Thomas Gleixner
On Tue, 16 Jun 2015, Juergen Gross wrote:

 AFAIK there are no outstanding questions for more than one month now.
 I'd appreciate some feedback or accepting these patches.

They are against dead code, which will be gone soon. We switched over
to queued locks.

Thanks,

tglx

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] kvmtool: Save datamatch as little endian in {add,del}_event

2015-06-16 Thread Will Deacon
On Mon, Jun 15, 2015 at 12:49:45PM +0100, Andreas Herrmann wrote:
 W/o dedicated endianess it's impossible to find reliably a match
 e.g. in kernel/virt/kvm/eventfd.c ioeventfd_in_range.

Hmm, but shouldn't this be the endianness of the guest, rather than just
forcing things to little-endian?

Will
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 01/11] KVM: arm: plug guest debug exploit

2015-06-16 Thread Will Deacon
On Sun, Jun 14, 2015 at 05:13:05PM +0100, zichao wrote:
 I and marc are talking about how to plug the guest debug exploit in an
 easier way.
 
 I remembered that you mentioned disabling monitor mode had proven to be
 extremely fragile in practice on 32-bit ARM SoCs, what if I save/restore
 the debug monitor mode on each switch between the guest and the host,
 would it be acceptable?

If you're just referring to DBGDSCRext, then you could give it a go, but
you'll certainly want to predicate any writes to that register on whether
or not hw_breakpoint managed to reset the debug regs on the host.

Like I said, accessing these registers always worries me, so I'd really
avoid it in KVM if you can. If not, you'll need to do extensive testing
on a bunch of platforms with and without the presence of external debug.

Will
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] kvmtool: Misc fixes

2015-06-16 Thread Will Deacon
On Mon, Jun 15, 2015 at 12:49:41PM +0100, Andreas Herrmann wrote:
 Following some patches to fix misc issues found when testing the
 standalone kvmtool version.
 
 Please apply.

All applied, apart from the ioeventfd patch which I'm not sure about.

Will
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvmtool: don't use PCI config space IRQ line field

2015-06-16 Thread Will Deacon
On Mon, Jun 15, 2015 at 11:45:38AM +0100, Andre Przywara wrote:
 On 06/05/2015 05:41 PM, Will Deacon wrote:
  On Thu, Jun 04, 2015 at 04:20:45PM +0100, Andre Przywara wrote:
  In PCI config space there is an interrupt line field (offset 0x3f),
  which is used to initially communicate the IRQ line number from
  firmware to the OS. _Hardware_ should never use this information,
  as the OS is free to write any information in there.
  But kvmtool uses this number when it triggers IRQs in the guest,
  which fails starting with Linux 3.19-rc1, where the PCI layer starts
  writing the virtual IRQ number in there.
 
  Fix that by storing the IRQ number in a separate field in
  struct virtio_pci, which is independent from the PCI config space
  and cannot be influenced by the guest.
  This fixes ARM/ARM64 guests using PCI with newer kernels.
 
  Signed-off-by: Andre Przywara andre.przyw...@arm.com
  ---
   include/kvm/virtio-pci.h | 8 
   virtio/pci.c | 9 ++---
   2 files changed, 14 insertions(+), 3 deletions(-)
 
  diff --git a/include/kvm/virtio-pci.h b/include/kvm/virtio-pci.h
  index c795ce7..b70cadd 100644
  --- a/include/kvm/virtio-pci.h
  +++ b/include/kvm/virtio-pci.h
  @@ -30,6 +30,14 @@ struct virtio_pci {
 u8  isr;
 u32 features;
   
  +  /*
  +   * We cannot rely on the INTERRUPT_LINE byte in the config space once
  +   * we have run guest code, as the OS is allowed to use that field
  +   * as a scratch pad to communicate between driver and PCI layer.
  +   * So store our legacy interrupt line number in here for internal use.
  +   */
  +  u8  legacy_irq_line;
  +
 /* MSI-X */
 u16 config_vector;
 u32 config_gsi;
  diff --git a/virtio/pci.c b/virtio/pci.c
  index 7556239..e17e5a9 100644
  --- a/virtio/pci.c
  +++ b/virtio/pci.c
  @@ -141,7 +141,7 @@ static bool virtio_pci__io_in(struct ioport *ioport, 
  struct kvm_cpu *vcpu, u16 p
 break;
 case VIRTIO_PCI_ISR:
 ioport__write8(data, vpci-isr);
  -  kvm__irq_line(kvm, vpci-pci_hdr.irq_line, VIRTIO_IRQ_LOW);
  +  kvm__irq_line(kvm, vpci-legacy_irq_line, VIRTIO_IRQ_LOW);
 vpci-isr = VIRTIO_IRQ_LOW;
 break;
 default:
  @@ -299,7 +299,7 @@ int virtio_pci__signal_vq(struct kvm *kvm, struct 
  virtio_device *vdev, u32 vq)
 kvm__irq_trigger(kvm, vpci-gsis[vq]);
 } else {
 vpci-isr = VIRTIO_IRQ_HIGH;
  -  kvm__irq_trigger(kvm, vpci-pci_hdr.irq_line);
  +  kvm__irq_trigger(kvm, vpci-legacy_irq_line);
 }
 return 0;
   }
  @@ -323,7 +323,7 @@ int virtio_pci__signal_config(struct kvm *kvm, struct 
  virtio_device *vdev)
 kvm__irq_trigger(kvm, vpci-config_gsi);
 } else {
 vpci-isr = VIRTIO_PCI_ISR_CONFIG;
  -  kvm__irq_trigger(kvm, vpci-pci_hdr.irq_line);
  +  kvm__irq_trigger(kvm, vpci-legacy_irq_line);
 }
   
 return 0;
  @@ -422,6 +422,9 @@ int virtio_pci__init(struct kvm *kvm, void *dev, 
  struct virtio_device *vdev,
 if (r  0)
 goto free_msix_mmio;
   
  +  /* save the IRQ that device__register() has allocated */
  +  vpci-legacy_irq_line = vpci-pci_hdr.irq_line;
  
  I'd rather we used the container_of trick that we do for virtio-mmio
  devices when assigning the irq in device__register. Then we can avoid
  this line completely.
 
 Not completely sure I get what you mean, I take it you want to assign
 legacy_irq_line in pci__assign_irq() directly (where the IRQ number is
 allocated).
 But this function is PCI generic code and is used by the VESA
 framebuffer and the shmem device on x86 as well. For those devices
 dev_hdr is not part of a struct virtio_pci, so we can't do container_of
 to assign the legacy_irq_line here directly.
 Admittedly this fix should apply to the other two users as well, but
 VESA does not use interrupts and pci-shmem is completely broken anyway,
 so I didn't bother to fix it in this regard.
 Would it be justified to provide an IRQ number field in struct
 device_header to address all users?
 
 Or what am I missing here?

If VESA and shmem are broken, they should either be fixed or removed.

If you fix them, then we could have separate virtual buses for virtio-pci
and emulated-pci (or whatever you want to call it). We could also have
a separate bus for passthrough-devices too.

However, that's quite a lot of work for a bug-fix, so I guess the easiest
thing is to extend your current hack to cover VESA and shmem too.

Will
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] vhost: use binary search instead of linear in find_region()

2015-06-16 Thread Igor Mammedov
On Tue, 16 Jun 2015 23:07:24 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Tue, Jun 16, 2015 at 06:33:35PM +0200, Igor Mammedov wrote:
  For default region layouts performance stays the same
  as linear search i.e. it takes around 210ns average for
  translate_desc() that inlines find_region().
  
  But it scales better with larger amount of regions,
  235ns BS vs 300ns LS with 55 memory regions
  and it will be about the same values when allowed number
  of slots is increased to 509 like it has been done in kvm.
  
  Signed-off-by: Igor Mammedov imamm...@redhat.com
  ---
   drivers/vhost/vhost.c | 38 --
   1 file changed, 28 insertions(+), 10 deletions(-)
  
  diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
  index 2ee2826..a22f8c3 100644
  --- a/drivers/vhost/vhost.c
  +++ b/drivers/vhost/vhost.c
  @@ -25,6 +25,7 @@
   #include linux/kthread.h
   #include linux/cgroup.h
   #include linux/module.h
  +#include linux/sort.h
   
   #include vhost.h
   
  @@ -590,6 +591,16 @@ int vhost_vq_access_ok(struct vhost_virtqueue
  *vq) }
   EXPORT_SYMBOL_GPL(vhost_vq_access_ok);
   
  +static int vhost_memory_reg_sort_cmp(const void *p1, const void
  *p2) +{
  +   const struct vhost_memory_region *r1 = p1, *r2 = p2;
  +   if (r1-guest_phys_addr  r2-guest_phys_addr)
  +   return 1;
  +   if (r1-guest_phys_addr  r2-guest_phys_addr)
  +   return -1;
  +   return 0;
  +}
  +
   static long vhost_set_memory(struct vhost_dev *d, struct
  vhost_memory __user *m) {
  struct vhost_memory mem, *newmem, *oldmem;
  @@ -609,9 +620,11 @@ static long vhost_set_memory(struct vhost_dev
  *d, struct vhost_memory __user *m) memcpy(newmem, mem, size);
  if (copy_from_user(newmem-regions, m-regions,
 mem.nregions * sizeof *m-regions)) {
  -   kfree(newmem);
  +   kvfree(newmem);
  return -EFAULT;
  }
 
 What's this doing here?
ops, it sneaked in from 2/5 when I was splitting patches.
I'll fix it up.

 
  +   sort(newmem-regions, newmem-nregions,
  sizeof(*newmem-regions),
  +   vhost_memory_reg_sort_cmp, NULL);
   
  if (!memory_access_ok(d, newmem, 0)) {
  kfree(newmem);
  @@ -913,17 +926,22 @@ EXPORT_SYMBOL_GPL(vhost_dev_ioctl);
   static const struct vhost_memory_region *find_region(struct
  vhost_memory *mem, __u64 addr, __u32 len)
   {
  -   struct vhost_memory_region *reg;
  -   int i;
  +   const struct vhost_memory_region *reg;
  +   int start = 0, end = mem-nregions;
   
  -   /* linear search is not brilliant, but we really have on
  the order of 6
  -* regions in practice */
  -   for (i = 0; i  mem-nregions; ++i) {
  -   reg = mem-regions + i;
  -   if (reg-guest_phys_addr = addr 
  -   reg-guest_phys_addr + reg-memory_size - 1 =
  addr)
  -   return reg;
  +   while (start  end) {
  +   int slot = start + (end - start) / 2;
  +   reg = mem-regions + slot;
  +   if (addr = reg-guest_phys_addr)
  +   end = slot;
  +   else
  +   start = slot + 1;
  }
  +
  +   reg = mem-regions + start;
  +   if (addr = reg-guest_phys_addr 
  +   reg-guest_phys_addr + reg-memory_size  addr)
  +   return reg;
  return NULL;
   }
   
  -- 
  1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-16 Thread Michael S. Tsirkin
On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
 since commit
  1d4e7e3 kvm: x86: increase user memory slots to 509
 
 it became possible to use a bigger amount of memory
 slots, which is used by memory hotplug for
 registering hotplugged memory.
 However QEMU crashes if it's used with more than ~60
 pc-dimm devices and vhost-net since host kernel
 in module vhost-net refuses to accept more than 65
 memory regions.
 
 Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509

It was 64, not 65.

 to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
 
 Signed-off-by: Igor Mammedov imamm...@redhat.com

Still thinking about this: can you reorder this to
be the last patch in the series please?

Also - 509?
I think if we are changing this, it'd be nice to
create a way for userspace to discover the support
and the # of regions supported.


 ---
  drivers/vhost/vhost.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
 index 99931a0..6a18c92 100644
 --- a/drivers/vhost/vhost.c
 +++ b/drivers/vhost/vhost.c
 @@ -30,7 +30,7 @@
  #include vhost.h
  
  enum {
 - VHOST_MEMORY_MAX_NREGIONS = 64,
 + VHOST_MEMORY_MAX_NREGIONS = 509,
   VHOST_MEMORY_F_LOG = 0x1,
  };
  
 -- 
 1.8.3.1
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] vhost: translate_desc: optimization for desc.len region size

2015-06-16 Thread Michael S. Tsirkin
On Tue, Jun 16, 2015 at 06:33:39PM +0200, Igor Mammedov wrote:
 when translating descriptors they are typically less than
 memory region that holds them and translated into 1 iov
 enty,

entry

 so it's not nessesary to check remaining length
 twice and calculate used length and next address
 in such cases.
 
 so relace

replace

 a remaining length and 'size' increment branches
 with a single remaining length check and execute
 next iov steps only when it needed.

it is needed

 
 It saves tiny 2% of translate_desc() execution time.


a tiny

 
 Signed-off-by: Igor Mammedov imamm...@redhat.com
 ---
 PS:
 I'm not sure if iov_size  0 is always true, if it's not
 then better to drop this patch.
 ---
  drivers/vhost/vhost.c | 21 +
  1 file changed, 13 insertions(+), 8 deletions(-)
 
 diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
 index 68c1c88..84c457d 100644
 --- a/drivers/vhost/vhost.c
 +++ b/drivers/vhost/vhost.c
 @@ -,12 +,8 @@ static int translate_desc(struct vhost_virtqueue *vq, 
 u64 addr, u32 len,
   int ret = 0;
  
   mem = vq-memory;
 - while ((u64)len  s) {
 + do {
   u64 size;
 - if (unlikely(ret = iov_size)) {
 - ret = -ENOBUFS;
 - break;
 - }
   reg = find_region(mem, addr, len, vq-cached_reg);
   if (unlikely(!reg)) {
   ret = -EFAULT;
 @@ -1124,13 +1120,22 @@ static int translate_desc(struct vhost_virtqueue *vq, 
 u64 addr, u32 len,
   }
   _iov = iov + ret;
   size = reg-memory_size - addr + reg-guest_phys_addr;
 - _iov-iov_len = min((u64)len - s, size);
   _iov-iov_base = (void __user *)(unsigned long)
   (reg-userspace_addr + addr - reg-guest_phys_addr);
 + ++ret;
 + if (likely((u64)len - s  size)) {
 + _iov-iov_len = (u64)len - s;
 + break;
 + }
 +
 + if (unlikely(ret = iov_size)) {
 + ret = -ENOBUFS;
 + break;
 + }
 + _iov-iov_len = size;
   s += size;
   addr += size;
 - ++ret;
 - }
 + } while (1);
  
   return ret;
  }
 -- 
 1.8.3.1
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] vhost: support upto 509 memory regions

2015-06-16 Thread Michael S. Tsirkin
On Tue, Jun 16, 2015 at 06:33:34PM +0200, Igor Mammedov wrote:
 Series extends vhost to support upto 509 memory regions,
 and adds some vhost:translate_desc() performance improvemnts
 so it won't regress when memslots are increased to 509.
 
 It fixes running VM crashing during memory hotplug due
 to vhost refusing accepting more than 64 memory regions.
 
 It's only host kernel side fix to make it work with QEMU
 versions that support memory hotplug. But I'll continue
 to work on QEMU side solution to reduce amount of memory
 regions to make things even better.

I'm concerned userspace work will be harder, in particular,
performance gains will be harder to measure.
How about a flag to disable caching?

 Performance wise for guest with (in my case 3 memory regions)
 and netperf's UDP_RR workload translate_desc() execution
 time from total workload takes:
 
 Memory  |1G RAM|cached|non cached
 regions #   |  3   |  53  |  53
 
 upstream| 0.3% |  -   | 3.5%
 
 this series | 0.2% | 0.5% | 0.7%
 
 where non cached column reflects trashing wokload
 with constant cache miss. More details on timing in
 respective patches.
 
 Igor Mammedov (5):
   vhost: use binary search instead of linear in find_region()
   vhost: extend memory regions allocation to vmalloc
   vhost: support upto 509 memory regions
   vhost: add per VQ memory region caching
   vhost: translate_desc: optimization for desc.len  region size
 
  drivers/vhost/vhost.c | 95 
 +--
  drivers/vhost/vhost.h |  1 +
  2 files changed, 71 insertions(+), 25 deletions(-)
 
 -- 
 1.8.3.1
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] vhost: support upto 509 memory regions

2015-06-16 Thread Igor Mammedov
On Tue, 16 Jun 2015 23:14:20 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
  since commit
   1d4e7e3 kvm: x86: increase user memory slots to 509
  
  it became possible to use a bigger amount of memory
  slots, which is used by memory hotplug for
  registering hotplugged memory.
  However QEMU crashes if it's used with more than ~60
  pc-dimm devices and vhost-net since host kernel
  in module vhost-net refuses to accept more than 65
  memory regions.
  
  Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
 
 It was 64, not 65.
 
  to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
  
  Signed-off-by: Igor Mammedov imamm...@redhat.com
 
 Still thinking about this: can you reorder this to
 be the last patch in the series please?
sure

 
 Also - 509?
userspace memory slots in terms of KVM, I made it match
KVM's allotment of memory slots for userspace side.

 I think if we are changing this, it'd be nice to
 create a way for userspace to discover the support
 and the # of regions supported.
That was my first idea before extending KVM's memslots
to teach kernel to tell qemu this number so that QEMU
at least would be able to check if new memory slot could
be added but I was redirected to a more simple solution
of just extending vs everdoing things.
Currently QEMU supports upto ~250 memslots so 509
is about twice high we need it so it should work for near
future but eventually we might still teach kernel and QEMU
to make things more robust.

 
 
  ---
   drivers/vhost/vhost.c | 2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)
  
  diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
  index 99931a0..6a18c92 100644
  --- a/drivers/vhost/vhost.c
  +++ b/drivers/vhost/vhost.c
  @@ -30,7 +30,7 @@
   #include vhost.h
   
   enum {
  -   VHOST_MEMORY_MAX_NREGIONS = 64,
  +   VHOST_MEMORY_MAX_NREGIONS = 509,
  VHOST_MEMORY_F_LOG = 0x1,
   };
   
  -- 
  1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] vhost: use binary search instead of linear in find_region()

2015-06-16 Thread Michael S. Tsirkin
On Tue, Jun 16, 2015 at 06:33:35PM +0200, Igor Mammedov wrote:
 For default region layouts performance stays the same
 as linear search i.e. it takes around 210ns average for
 translate_desc() that inlines find_region().
 
 But it scales better with larger amount of regions,
 235ns BS vs 300ns LS with 55 memory regions
 and it will be about the same values when allowed number
 of slots is increased to 509 like it has been done in kvm.
 
 Signed-off-by: Igor Mammedov imamm...@redhat.com
 ---
  drivers/vhost/vhost.c | 38 --
  1 file changed, 28 insertions(+), 10 deletions(-)
 
 diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
 index 2ee2826..a22f8c3 100644
 --- a/drivers/vhost/vhost.c
 +++ b/drivers/vhost/vhost.c
 @@ -25,6 +25,7 @@
  #include linux/kthread.h
  #include linux/cgroup.h
  #include linux/module.h
 +#include linux/sort.h
  
  #include vhost.h
  
 @@ -590,6 +591,16 @@ int vhost_vq_access_ok(struct vhost_virtqueue *vq)
  }
  EXPORT_SYMBOL_GPL(vhost_vq_access_ok);
  
 +static int vhost_memory_reg_sort_cmp(const void *p1, const void *p2)
 +{
 + const struct vhost_memory_region *r1 = p1, *r2 = p2;
 + if (r1-guest_phys_addr  r2-guest_phys_addr)
 + return 1;
 + if (r1-guest_phys_addr  r2-guest_phys_addr)
 + return -1;
 + return 0;
 +}
 +
  static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user 
 *m)
  {
   struct vhost_memory mem, *newmem, *oldmem;
 @@ -609,9 +620,11 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
 vhost_memory __user *m)
   memcpy(newmem, mem, size);
   if (copy_from_user(newmem-regions, m-regions,
  mem.nregions * sizeof *m-regions)) {
 - kfree(newmem);
 + kvfree(newmem);
   return -EFAULT;
   }

What's this doing here?

 + sort(newmem-regions, newmem-nregions, sizeof(*newmem-regions),
 + vhost_memory_reg_sort_cmp, NULL);
  
   if (!memory_access_ok(d, newmem, 0)) {
   kfree(newmem);
 @@ -913,17 +926,22 @@ EXPORT_SYMBOL_GPL(vhost_dev_ioctl);
  static const struct vhost_memory_region *find_region(struct vhost_memory 
 *mem,
__u64 addr, __u32 len)
  {
 - struct vhost_memory_region *reg;
 - int i;
 + const struct vhost_memory_region *reg;
 + int start = 0, end = mem-nregions;
  
 - /* linear search is not brilliant, but we really have on the order of 6
 -  * regions in practice */
 - for (i = 0; i  mem-nregions; ++i) {
 - reg = mem-regions + i;
 - if (reg-guest_phys_addr = addr 
 - reg-guest_phys_addr + reg-memory_size - 1 = addr)
 - return reg;
 + while (start  end) {
 + int slot = start + (end - start) / 2;
 + reg = mem-regions + slot;
 + if (addr = reg-guest_phys_addr)
 + end = slot;
 + else
 + start = slot + 1;
   }
 +
 + reg = mem-regions + start;
 + if (addr = reg-guest_phys_addr 
 + reg-guest_phys_addr + reg-memory_size  addr)
 + return reg;
   return NULL;
  }
  
 -- 
 1.8.3.1
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/2] arm64: KVM: Optimize arm64 fp/simd save/restore

2015-06-16 Thread Mario Smarduch
This patch only saves and restores FP/SIMD registers on Guest access. To do
this cptr_el2 FP/SIMD trap is set on Guest entry and later checked on exit.
lmbench, hackbench show significant improvements, for 30-50% exits FP/SIMD
context is not saved/restored

Signed-off-by: Mario Smarduch m.smard...@samsung.com
---
 arch/arm64/include/asm/kvm_arm.h |5 -
 arch/arm64/kvm/hyp.S |   46 +++---
 2 files changed, 47 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index ac6fafb..7605e09 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -171,10 +171,13 @@
 #define HSTR_EL2_TTEE  (1  16)
 #define HSTR_EL2_T(x)  (1  x)
 
+/* Hyp Coproccessor Trap Register Shifts */
+#define CPTR_EL2_TFP_SHIFT 10
+
 /* Hyp Coprocessor Trap Register */
 #define CPTR_EL2_TCPAC (1  31)
 #define CPTR_EL2_TTA   (1  20)
-#define CPTR_EL2_TFP   (1  10)
+#define CPTR_EL2_TFP   (1  CPTR_EL2_TFP_SHIFT)
 
 /* Hyp Debug Configuration Register bits */
 #define MDCR_EL2_TDRA  (1  11)
diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
index 5befd01..de0788f 100644
--- a/arch/arm64/kvm/hyp.S
+++ b/arch/arm64/kvm/hyp.S
@@ -673,6 +673,15 @@
tbz \tmp, #KVM_ARM64_DEBUG_DIRTY_SHIFT, \target
 .endm
 
+/*
+ * Check cptr VFP/SIMD accessed bit, if set VFP/SIMD not accessed by guest.
+ */
+.macro skip_fpsimd_state tmp, target
+   mrs \tmp, cptr_el2
+   tbnz\tmp, #CPTR_EL2_TFP_SHIFT, \target
+.endm
+
+
 .macro compute_debug_state target
// Compute debug state: If any of KDE, MDE or KVM_ARM64_DEBUG_DIRTY
// is set, we do a full save/restore cycle and disable trapping.
@@ -763,6 +772,7 @@
ldr x2, [x0, #VCPU_HCR_EL2]
msr hcr_el2, x2
mov x2, #CPTR_EL2_TTA
+   orr x2, x2, #CPTR_EL2_TFP
msr cptr_el2, x2
 
mov x2, #(1  15)  // Trap CP15 Cr=15
@@ -785,7 +795,6 @@
 .macro deactivate_traps
mov x2, #HCR_RW
msr hcr_el2, x2
-   msr cptr_el2, xzr
msr hstr_el2, xzr
 
mrs x2, mdcr_el2
@@ -912,6 +921,28 @@ __restore_fpsimd:
restore_fpsimd
ret
 
+switch_to_guest_fpsimd:
+   pushx4, lr
+
+   mrs x2, cptr_el2
+   bic x2, x2, #CPTR_EL2_TFP
+   msr cptr_el2, x2
+
+   mrs x0, tpidr_el2
+
+   ldr x2, [x0, #VCPU_HOST_CONTEXT]
+   kern_hyp_va x2
+   bl __save_fpsimd
+
+   add x2, x0, #VCPU_CONTEXT
+   bl __restore_fpsimd
+
+   pop x4, lr
+   pop x2, x3
+   pop x0, x1
+
+   eret
+
 /*
  * u64 __kvm_vcpu_run(struct kvm_vcpu *vcpu);
  *
@@ -932,7 +963,6 @@ ENTRY(__kvm_vcpu_run)
kern_hyp_va x2
 
save_host_regs
-   bl __save_fpsimd
bl __save_sysregs
 
compute_debug_state 1f
@@ -948,7 +978,6 @@ ENTRY(__kvm_vcpu_run)
add x2, x0, #VCPU_CONTEXT
 
bl __restore_sysregs
-   bl __restore_fpsimd
 
skip_debug_state x3, 1f
bl  __restore_debug
@@ -967,7 +996,9 @@ __kvm_vcpu_return:
add x2, x0, #VCPU_CONTEXT
 
save_guest_regs
+   skip_fpsimd_state x3, 1f
bl __save_fpsimd
+1:
bl __save_sysregs
 
skip_debug_state x3, 1f
@@ -986,7 +1017,11 @@ __kvm_vcpu_return:
kern_hyp_va x2
 
bl __restore_sysregs
+   skip_fpsimd_state x3, 1f
bl __restore_fpsimd
+1:
+   /* Clear FPSIMD and Trace trapping */
+   msr cptr_el2, xzr
 
skip_debug_state x3, 1f
// Clear the dirty flag for the next run, as all the state has
@@ -1201,6 +1236,11 @@ el1_trap:
 * x1: ESR
 * x2: ESR_EC
 */
+
+   /* Guest accessed VFP/SIMD registers, save host, restore Guest */
+   cmp x2, #ESR_ELx_EC_FP_ASIMD
+   b.eqswitch_to_guest_fpsimd
+
cmp x2, #ESR_ELx_EC_DABT_LOW
mov x0, #ESR_ELx_EC_IABT_LOW
ccmpx2, x0, #4, ne
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/2] arm: KVM: keep arm vfp/simd exit handling consistent with arm64

2015-06-16 Thread Mario Smarduch
After enhancing arm64 FP/SIMD exit handling, FP/SIMD exit branch is moved
to guest trap handling. This keeps exiting handling flow between both
architectures consistent.

Signed-off-by: Mario Smarduch m.smard...@samsung.com
---
 arch/arm/kvm/interrupts.S |   12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/arm/kvm/interrupts.S b/arch/arm/kvm/interrupts.S
index 79caf79..fca2c56 100644
--- a/arch/arm/kvm/interrupts.S
+++ b/arch/arm/kvm/interrupts.S
@@ -363,10 +363,6 @@ hyp_hvc:
@ Check syndrome register
mrc p15, 4, r1, c5, c2, 0   @ HSR
lsr r0, r1, #HSR_EC_SHIFT
-#ifdef CONFIG_VFPv3
-   cmp r0, #HSR_EC_CP_0_13
-   beq switch_to_guest_vfp
-#endif
cmp r0, #HSR_EC_HVC
bne guest_trap  @ Not HVC instr.
 
@@ -406,6 +402,12 @@ THUMB( orr lr, #1)
 1: eret
 
 guest_trap:
+#ifdef CONFIG_VFPv3
+   /* Guest accessed VFP/SIMD registers, save host, restore Guest */
+   cmp r0, #HSR_EC_CP_0_13
+   beq switch_to_guest_fpsimd
+#endif
+
load_vcpu   @ Load VCPU pointer to r0
str r1, [vcpu, #VCPU_HSR]
 
@@ -478,7 +480,7 @@ guest_trap:
  * inject an undefined exception to the guest.
  */
 #ifdef CONFIG_VFPv3
-switch_to_guest_vfp:
+switch_to_guest_fpsimd:
load_vcpu   @ Load VCPU pointer to r0
push{r3-r7}
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] vhost: support upto 509 memory regions

2015-06-16 Thread Igor Mammedov
On Tue, 16 Jun 2015 23:16:07 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Tue, Jun 16, 2015 at 06:33:34PM +0200, Igor Mammedov wrote:
  Series extends vhost to support upto 509 memory regions,
  and adds some vhost:translate_desc() performance improvemnts
  so it won't regress when memslots are increased to 509.
  
  It fixes running VM crashing during memory hotplug due
  to vhost refusing accepting more than 64 memory regions.
  
  It's only host kernel side fix to make it work with QEMU
  versions that support memory hotplug. But I'll continue
  to work on QEMU side solution to reduce amount of memory
  regions to make things even better.
 
 I'm concerned userspace work will be harder, in particular,
 performance gains will be harder to measure.
it appears so, so far.

 How about a flag to disable caching?
I've tried to measure cost of cache miss but without much luck,
difference between version with cache and with caching removed
was within margin of error (±10ns) (i.e. not mensurable on my
5min/10*10^6 test workload).
Also I'm concerned about adding extra fetch+branch for flag
checking will make things worse for likely path of cache hit,
so I'd avoid it if possible.

Or do you mean a simple global per module flag to disable it and
wrap thing in static key so that it will be cheap jump to skip
cache?
 
  Performance wise for guest with (in my case 3 memory regions)
  and netperf's UDP_RR workload translate_desc() execution
  time from total workload takes:
  
  Memory  |1G RAM|cached|non cached
  regions #   |  3   |  53  |  53
  
  upstream| 0.3% |  -   | 3.5%
  
  this series | 0.2% | 0.5% | 0.7%
  
  where non cached column reflects trashing wokload
  with constant cache miss. More details on timing in
  respective patches.
  
  Igor Mammedov (5):
vhost: use binary search instead of linear in find_region()
vhost: extend memory regions allocation to vmalloc
vhost: support upto 509 memory regions
vhost: add per VQ memory region caching
vhost: translate_desc: optimization for desc.len  region size
  
   drivers/vhost/vhost.c | 95
  +--
  drivers/vhost/vhost.h |  1 + 2 files changed, 71 insertions(+), 25
  deletions(-)
  
  -- 
  1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/2] arm/arm64: KVM: Optimize arm64 fp/simd, saves 30-50% on exits

2015-06-16 Thread Mario Smarduch
Currently we save/restore fp/simd on each exit. Fist  patch optimizes arm64 
save/restore, we only do so on Guest access. hackbench and
several lmbench tests show anywhere from 30% to above 50% optimzation
achieved.

In second patch 32-bit handler is updated to keep exit handling consistent 
with 64-bit code.

Changes since v1:
- Addressed Marcs comments
- Verified optimization improvements with lmbench and hackbench, updated 
  commit message

Mario Smarduch (2):
  Optimize arm64 skip 30-50% vfp/simd save/restore on exits
  keep arm vfp/simd exit handling in sync with arm64

 arch/arm/kvm/interrupts.S|   12 +-
 arch/arm64/include/asm/kvm_arm.h |5 -
 arch/arm64/kvm/hyp.S |   46 +++---
 3 files changed, 54 insertions(+), 9 deletions(-)

-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 09/18] staging/lirc_serial: Remove TSC-based timing

2015-06-16 Thread Andy Lutomirski
It wasn't compiled in by default.  I suspect that the driver was and
still is broken, though -- it's calling udelay with a parameter
that's derived from loops_per_jiffy.

Cc: Jarod Wilson ja...@wilsonet.com
Cc: de...@driverdev.osuosl.org
Cc: Greg Kroah-Hartman gre...@linuxfoundation.org
Signed-off-by: Andy Lutomirski l...@kernel.org
---
 drivers/staging/media/lirc/lirc_serial.c | 63 ++--
 1 file changed, 4 insertions(+), 59 deletions(-)

diff --git a/drivers/staging/media/lirc/lirc_serial.c 
b/drivers/staging/media/lirc/lirc_serial.c
index dc7984455c3a..465796a686c4 100644
--- a/drivers/staging/media/lirc/lirc_serial.c
+++ b/drivers/staging/media/lirc/lirc_serial.c
@@ -327,9 +327,6 @@ static void safe_udelay(unsigned long usecs)
  * time
  */
 
-/* So send_pulse can quickly convert microseconds to clocks */
-static unsigned long conv_us_to_clocks;
-
 static int init_timing_params(unsigned int new_duty_cycle,
unsigned int new_freq)
 {
@@ -344,7 +341,6 @@ static int init_timing_params(unsigned int new_duty_cycle,
/* How many clocks in a microsecond?, avoiding long long divide */
work = loops_per_sec;
work *= 4295;  /* 4295 = 2^32 / 1e6 */
-   conv_us_to_clocks = work  32;
 
/*
 * Carrier period in clocks, approach good up to 32GHz clock,
@@ -357,10 +353,9 @@ static int init_timing_params(unsigned int new_duty_cycle,
pulse_width = period * duty_cycle / 100;
space_width = period - pulse_width;
dprintk(in init_timing_params, freq=%d, duty_cycle=%d, 
-   clk/jiffy=%ld, pulse=%ld, space=%ld, 
-   conv_us_to_clocks=%ld\n,
+   clk/jiffy=%ld, pulse=%ld, space=%ld\n,
freq, duty_cycle, __this_cpu_read(cpu_info.loops_per_jiffy),
-   pulse_width, space_width, conv_us_to_clocks);
+   pulse_width, space_width);
return 0;
 }
 #else /* ! USE_RDTSC */
@@ -431,63 +426,14 @@ static long send_pulse_irdeo(unsigned long length)
return ret;
 }
 
-#ifdef USE_RDTSC
-/* Version that uses Pentium rdtsc instruction to measure clocks */
-
-/*
- * This version does sub-microsecond timing using rdtsc instruction,
- * and does away with the fudged LIRC_SERIAL_TRANSMITTER_LATENCY
- * Implicitly i586 architecture...  - Steve
- */
-
-static long send_pulse_homebrew_softcarrier(unsigned long length)
-{
-   int flag;
-   unsigned long target, start, now;
-
-   /* Get going quick as we can */
-   rdtscl(start);
-   on();
-   /* Convert length from microseconds to clocks */
-   length *= conv_us_to_clocks;
-   /* And loop till time is up - flipping at right intervals */
-   now = start;
-   target = pulse_width;
-   flag = 1;
-   /*
-* FIXME: This looks like a hard busy wait, without even an occasional,
-* polite, cpu_relax() call.  There's got to be a better way?
-*
-* The i2c code has the result of a lot of bit-banging work, I wonder if
-* there's something there which could be helpful here.
-*/
-   while ((now - start)  length) {
-   /* Delay till flip time */
-   do {
-   rdtscl(now);
-   } while ((now - start)  target);
-
-   /* flip */
-   if (flag) {
-   rdtscl(now);
-   off();
-   target += space_width;
-   } else {
-   rdtscl(now); on();
-   target += pulse_width;
-   }
-   flag = !flag;
-   }
-   rdtscl(now);
-   return ((now - start) - length) / conv_us_to_clocks;
-}
-#else /* ! USE_RDTSC */
 /* Version using udelay() */
 
 /*
  * here we use fixed point arithmetic, with 8
  * fractional bits.  that gets us within 0.1% or so of the right average
  * frequency, albeit with some jitter in pulse length - Steve
+ *
+ * This should use ndelay instead.
  */
 
 /* To match 8 fractional bits used for pulse/space length */
@@ -520,7 +466,6 @@ static long send_pulse_homebrew_softcarrier(unsigned long 
length)
}
return (actual-length)  8;
 }
-#endif /* USE_RDTSC */
 
 static long send_pulse_homebrew(unsigned long length)
 {
-- 
2.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 08/18] baycom_epp: Replace rdtscl() with native_read_tsc()

2015-06-16 Thread Andy Lutomirski
This is only used if BAYCOM_DEBUG is defined.

Cc: walter harms wha...@bfs.de
Cc: Ralf Baechle r...@linux-mips.org
Cc: Thomas Sailer t.sai...@alumni.ethz.ch
Cc: linux-h...@vger.kernel.org
Signed-off-by: Andy Lutomirski l...@kernel.org
---

I'm hoping for an ack for this to go through -tip.

 drivers/net/hamradio/baycom_epp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/hamradio/baycom_epp.c 
b/drivers/net/hamradio/baycom_epp.c
index 83c7cce0d172..44e5c3b5e0af 100644
--- a/drivers/net/hamradio/baycom_epp.c
+++ b/drivers/net/hamradio/baycom_epp.c
@@ -638,7 +638,7 @@ static int receive(struct net_device *dev, int cnt)
 #define GETTICK(x)\
 ({\
if (cpu_has_tsc)  \
-   rdtscl(x);\
+   x = (unsigned int)native_read_tsc();  \
 })
 #else /* __i386__ */
 #define GETTICK(x)
-- 
2.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 04/18] x86/tsc: Replace rdtscll with native_read_tsc

2015-06-16 Thread Andy Lutomirski
Now that the read_tsc paravirt hook is gone, rdtscll() is just a
wrapper around native_read_tsc().  Unwrap it.

Signed-off-by: Andy Lutomirski l...@kernel.org
---
 arch/x86/boot/compressed/aslr.c  | 2 +-
 arch/x86/include/asm/msr.h   | 3 ---
 arch/x86/include/asm/tsc.h   | 5 +
 arch/x86/kernel/apb_timer.c  | 4 ++--
 arch/x86/kernel/apic/apic.c  | 8 
 arch/x86/kernel/cpu/mcheck/mce.c | 4 ++--
 arch/x86/kernel/espfix_64.c  | 2 +-
 arch/x86/kernel/hpet.c   | 4 ++--
 arch/x86/kernel/trace_clock.c| 2 +-
 arch/x86/kernel/tsc.c| 4 ++--
 arch/x86/kvm/vmx.c   | 2 +-
 arch/x86/lib/delay.c | 2 +-
 drivers/thermal/intel_powerclamp.c   | 4 ++--
 tools/power/cpupower/debug/kernel/cpufreq-test_tsc.c | 4 ++--
 14 files changed, 22 insertions(+), 28 deletions(-)

diff --git a/arch/x86/boot/compressed/aslr.c b/arch/x86/boot/compressed/aslr.c
index d7b1f655b3ef..ea33236190b1 100644
--- a/arch/x86/boot/compressed/aslr.c
+++ b/arch/x86/boot/compressed/aslr.c
@@ -82,7 +82,7 @@ static unsigned long get_random_long(void)
 
if (has_cpuflag(X86_FEATURE_TSC)) {
debug_putstr( RDTSC);
-   rdtscll(raw);
+   raw = native_read_tsc();
 
random ^= raw;
use_i8254 = false;
diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index d1afac7df484..7273b74e0f99 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -192,9 +192,6 @@ do {
\
 #define rdtscl(low)\
((low) = (u32)native_read_tsc())
 
-#define rdtscll(val)   \
-   ((val) = native_read_tsc())
-
 #define rdtscp(low, high, aux) \
 do {\
unsigned long long _val = native_read_tscp((aux)); \
diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 3da1cc1218ac..b4883902948b 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -21,15 +21,12 @@ extern void disable_TSC(void);
 
 static inline cycles_t get_cycles(void)
 {
-   unsigned long long ret = 0;
-
 #ifndef CONFIG_X86_TSC
if (!cpu_has_tsc)
return 0;
 #endif
-   rdtscll(ret);
 
-   return ret;
+   return native_read_tsc();
 }
 
 extern void tsc_init(void);
diff --git a/arch/x86/kernel/apb_timer.c b/arch/x86/kernel/apb_timer.c
index 9fe111cc50f8..25efa534c4e4 100644
--- a/arch/x86/kernel/apb_timer.c
+++ b/arch/x86/kernel/apb_timer.c
@@ -263,7 +263,7 @@ static int apbt_clocksource_register(void)
 
/* Verify whether apbt counter works */
t1 = dw_apb_clocksource_read(clocksource_apbt);
-   rdtscll(start);
+   start = native_read_tsc();
 
/*
 * We don't know the TSC frequency yet, but waiting for
@@ -273,7 +273,7 @@ static int apbt_clocksource_register(void)
 */
do {
rep_nop();
-   rdtscll(now);
+   now = native_read_tsc();
} while ((now - start)  20UL);
 
/* APBT is the only always on clocksource, it has to work! */
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index dcb52850a28f..51af1ed1ae2e 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -457,7 +457,7 @@ static int lapic_next_deadline(unsigned long delta,
 {
u64 tsc;
 
-   rdtscll(tsc);
+   tsc = native_read_tsc();
wrmsrl(MSR_IA32_TSC_DEADLINE, tsc + (((u64) delta) * TSC_DIVISOR));
return 0;
 }
@@ -592,7 +592,7 @@ static void __init lapic_cal_handler(struct 
clock_event_device *dev)
unsigned long pm = acpi_pm_read_early();
 
if (cpu_has_tsc)
-   rdtscll(tsc);
+   tsc = native_read_tsc();
 
switch (lapic_cal_loops++) {
case 0:
@@ -1209,7 +1209,7 @@ void setup_local_APIC(void)
long long max_loops = cpu_khz ? cpu_khz : 100;
 
if (cpu_has_tsc)
-   rdtscll(tsc);
+   tsc = native_read_tsc();
 
if (disable_apic) {
disable_ioapic_support();
@@ -1293,7 +1293,7 @@ void setup_local_APIC(void)
}
if (queued) {
if (cpu_has_tsc  cpu_khz) {
-   rdtscll(ntsc);
+   ntsc = native_read_tsc();
max_loops = (cpu_khz  10) - (ntsc - tsc);
} else
max_loops--;
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c 

[PATCH v3 07/18] x86/cpu/amd: Use the full 64-bit TSC to detect the 2.6.2 bug

2015-06-16 Thread Andy Lutomirski
This code is timing 100k indirect calls, so the added overhead of
counting the number of cycles elapsed as a 64-bit number should be
insignificant.  Drop the optimization of using a 32-bit count.

Signed-off-by: Andy Lutomirski l...@kernel.org
---
 arch/x86/kernel/cpu/amd.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 5bd3a99dc20b..c5ceec532799 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -107,7 +107,7 @@ static void init_amd_k6(struct cpuinfo_x86 *c)
const int K6_BUG_LOOP = 100;
int n;
void (*f_vide)(void);
-   unsigned long d, d2;
+   u64 d, d2;
 
printk(KERN_INFO AMD K6 stepping B detected - );
 
@@ -118,10 +118,10 @@ static void init_amd_k6(struct cpuinfo_x86 *c)
 
n = K6_BUG_LOOP;
f_vide = vide;
-   rdtscl(d);
+   d = native_read_tsc();
while (n--)
f_vide();
-   rdtscl(d2);
+   d2 = native_read_tsc();
d = d2-d;
 
if (d  20*K6_BUG_LOOP)
-- 
2.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 05/18] x86/tsc: Remove the rdtscp and rdtscpll macros

2015-06-16 Thread Andy Lutomirski
They have no users.  Leave native_read_tscp, which seems potentially
useful despite also having no callers.

Signed-off-by: Andy Lutomirski l...@kernel.org
---
 arch/x86/include/asm/msr.h | 9 -
 1 file changed, 9 deletions(-)

diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 7273b74e0f99..626f78199665 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -192,15 +192,6 @@ do {   
\
 #define rdtscl(low)\
((low) = (u32)native_read_tsc())
 
-#define rdtscp(low, high, aux) \
-do {\
-   unsigned long long _val = native_read_tscp((aux)); \
-   (low) = (u32)_val;  \
-   (high) = (u32)(_val  32); \
-} while (0)
-
-#define rdtscpll(val, aux) (val) = native_read_tscp((aux))
-
 /*
  * 64-bit version of wrmsr_safe():
  */
-- 
2.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 01/18] x86/tsc: Inline native_read_tsc and remove __native_read_tsc

2015-06-16 Thread Andy Lutomirski
In cdc7957d1954 (x86: move native_read_tsc() offline),
native_read_tsc was moved out of line, presumably for some
now-obsolete vDSO-related reason.  Undo it.

The entire rdtsc, shl, or sequence is only 11 bytes, and calls via
rdtscl and similar helpers were already inlined.

Signed-off-by: Andy Lutomirski l...@kernel.org
---
 arch/x86/entry/vdso/vclock_gettime.c  | 2 +-
 arch/x86/include/asm/msr.h| 8 +++-
 arch/x86/include/asm/pvclock.h| 2 +-
 arch/x86/include/asm/stackprotector.h | 2 +-
 arch/x86/include/asm/tsc.h| 2 +-
 arch/x86/kernel/apb_timer.c   | 4 ++--
 arch/x86/kernel/tsc.c | 6 --
 7 files changed, 9 insertions(+), 17 deletions(-)

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index 9793322751e0..972b488ac16a 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -186,7 +186,7 @@ notrace static cycle_t vread_tsc(void)
 * but no one has ever seen it happen.
 */
rdtsc_barrier();
-   ret = (cycle_t)__native_read_tsc();
+   ret = (cycle_t)native_read_tsc();
 
last = gtod-cycle_last;
 
diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index e6a707eb5081..88711470af7f 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -106,12 +106,10 @@ notrace static inline int native_write_msr_safe(unsigned 
int msr,
return err;
 }
 
-extern unsigned long long native_read_tsc(void);
-
 extern int rdmsr_safe_regs(u32 regs[8]);
 extern int wrmsr_safe_regs(u32 regs[8]);
 
-static __always_inline unsigned long long __native_read_tsc(void)
+static __always_inline unsigned long long native_read_tsc(void)
 {
DECLARE_ARGS(val, low, high);
 
@@ -181,10 +179,10 @@ static inline int rdmsrl_safe(unsigned msr, unsigned long 
long *p)
 }
 
 #define rdtscl(low)\
-   ((low) = (u32)__native_read_tsc())
+   ((low) = (u32)native_read_tsc())
 
 #define rdtscll(val)   \
-   ((val) = __native_read_tsc())
+   ((val) = native_read_tsc())
 
 #define rdpmc(counter, low, high)  \
 do {   \
diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index d6b078e9fa28..71bd485c2986 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -62,7 +62,7 @@ static inline u64 pvclock_scale_delta(u64 delta, u32 
mul_frac, int shift)
 static __always_inline
 u64 pvclock_get_nsec_offset(const struct pvclock_vcpu_time_info *src)
 {
-   u64 delta = __native_read_tsc() - src-tsc_timestamp;
+   u64 delta = native_read_tsc() - src-tsc_timestamp;
return pvclock_scale_delta(delta, src-tsc_to_system_mul,
   src-tsc_shift);
 }
diff --git a/arch/x86/include/asm/stackprotector.h 
b/arch/x86/include/asm/stackprotector.h
index c2e00bb2a136..bc5fa2af112e 100644
--- a/arch/x86/include/asm/stackprotector.h
+++ b/arch/x86/include/asm/stackprotector.h
@@ -72,7 +72,7 @@ static __always_inline void boot_init_stack_canary(void)
 * on during the bootup the random pool has true entropy too.
 */
get_random_bytes(canary, sizeof(canary));
-   tsc = __native_read_tsc();
+   tsc = native_read_tsc();
canary += tsc + (tsc  32UL);
 
current-stack_canary = canary;
diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 94605c0e9cee..fd11128faf25 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -42,7 +42,7 @@ static __always_inline cycles_t vget_cycles(void)
if (!cpu_has_tsc)
return 0;
 #endif
-   return (cycles_t)__native_read_tsc();
+   return (cycles_t)native_read_tsc();
 }
 
 extern void tsc_init(void);
diff --git a/arch/x86/kernel/apb_timer.c b/arch/x86/kernel/apb_timer.c
index ede92c3364d3..9fe111cc50f8 100644
--- a/arch/x86/kernel/apb_timer.c
+++ b/arch/x86/kernel/apb_timer.c
@@ -390,13 +390,13 @@ unsigned long apbt_quick_calibrate(void)
old = dw_apb_clocksource_read(clocksource_apbt);
old += loop;
 
-   t1 = __native_read_tsc();
+   t1 = native_read_tsc();
 
do {
new = dw_apb_clocksource_read(clocksource_apbt);
} while (new  old);
 
-   t2 = __native_read_tsc();
+   t2 = native_read_tsc();
 
shift = 5;
if (unlikely(loop  shift == 0)) {
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 505449700e0c..e7710cd7ba00 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -308,12 +308,6 @@ unsigned long long
 sched_clock(void) __attribute__((alias(native_sched_clock)));
 #endif
 
-unsigned long long native_read_tsc(void)
-{
-   return __native_read_tsc();
-}
-EXPORT_SYMBOL(native_read_tsc);
-
 int check_tsc_unstable(void)
 {
return tsc_unstable;

Re: [PATCH v3 08/18] baycom_epp: Replace rdtscl() with native_read_tsc()

2015-06-16 Thread Thomas Sailer

Acked-by: Thomas Sailer t.sai...@alumni.ethz.ch

On 06/17/2015 02:35 AM, Andy Lutomirski wrote:

This is only used if BAYCOM_DEBUG is defined.

Cc: walter harms wha...@bfs.de
Cc: Ralf Baechle r...@linux-mips.org
Cc: Thomas Sailer t.sai...@alumni.ethz.ch
Cc: linux-h...@vger.kernel.org
Signed-off-by: Andy Lutomirski l...@kernel.org
---

I'm hoping for an ack for this to go through -tip.

  drivers/net/hamradio/baycom_epp.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/hamradio/baycom_epp.c 
b/drivers/net/hamradio/baycom_epp.c
index 83c7cce0d172..44e5c3b5e0af 100644
--- a/drivers/net/hamradio/baycom_epp.c
+++ b/drivers/net/hamradio/baycom_epp.c
@@ -638,7 +638,7 @@ static int receive(struct net_device *dev, int cnt)
  #define GETTICK(x)\
  ({\
if (cpu_has_tsc)  \
-   rdtscl(x);\
+   x = (unsigned int)native_read_tsc();  \
  })
  #else /* __i386__ */
  #define GETTICK(x)


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] perf/kvm: Guest Symbol Resolution for powerpc

2015-06-16 Thread David Ahern

On 6/16/15 7:24 PM, Hemant Kumar wrote:

Because, this depends on the kernel tracepoint kvm_hv:kvm_guest_exit.
perf_prepare_sample() in the kernel side sets the event-header.misc
field to
PERF_RECORD_MISC_KERNEL through perf_misc_flags(pt_regs). In case of
tracepoints which always get hit in the host kernel context, the
perf_misc_flags() will always return PERF_RECORD_MISC_KERNEL.

IMHO we will rather have to set the cpumode in the user space for this
tracepoint
and we can't depend on the event-header.misc field for this case.

What would you suggest?



oh, right you are using a tracepoint for this. It does not have the 
hooks to specify cpumode. Never mind.

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 17/18] x86/kvm/tsc: Drop extra barrier and use rdtsc_ordered in kvmclock

2015-06-16 Thread Andy Lutomirski
__pvclock_read_cycles had an unnecessary barrier.  Get rid of that
barrier and clean up the code by just using rdtsc_ordered().

Cc: Paolo Bonzini pbonz...@redhat.com
Cc: Radim Krcmar rkrc...@redhat.com
Cc: Marcelo Tosatti mtosa...@redhat.com
Cc: kvm@vger.kernel.org
Signed-off-by: Andy Lutomirski l...@kernel.org
---

I'm hoping to get an ack for this to go in through -tip.  (Arguably
I'm the maintainer of this code given how it's used, but I should
still ask for an ack.)

arch/x86/include/asm/pvclock.h | 21 -
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 6084bce345fc..cf2329ca4812 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -62,7 +62,18 @@ static inline u64 pvclock_scale_delta(u64 delta, u32 
mul_frac, int shift)
 static __always_inline
 u64 pvclock_get_nsec_offset(const struct pvclock_vcpu_time_info *src)
 {
-   u64 delta = rdtsc() - src-tsc_timestamp;
+   /*
+* Note: emulated platforms which do not advertise SSE2 support
+* break rdtsc_ordered, resulting in kvmclock not using the
+* necessary RDTSC barriers.  Without barriers, it is possible
+* that RDTSC instruction is executed before prior loads,
+* resulting in violation of monotonicity.
+*
+* On an SMP guest without SSE2, it's unclear how anything is
+* supposed to work correctly, though -- memory fences
+* (e.g. smp_mb) are important for more than just timing.
+*/
+   u64 delta = rdtsc_ordered() - src-tsc_timestamp;
return pvclock_scale_delta(delta, src-tsc_to_system_mul,
   src-tsc_shift);
 }
@@ -76,17 +87,9 @@ unsigned __pvclock_read_cycles(const struct 
pvclock_vcpu_time_info *src,
u8 ret_flags;
 
version = src-version;
-   /* Note: emulated platforms which do not advertise SSE2 support
-* result in kvmclock not using the necessary RDTSC barriers.
-* Without barriers, it is possible that RDTSC instruction reads from
-* the time stamp counter outside rdtsc_barrier protected section
-* below, resulting in violation of monotonicity.
-*/
-   rdtsc_barrier();
offset = pvclock_get_nsec_offset(src);
ret = src-system_time + offset;
ret_flags = src-flags;
-   rdtsc_barrier();
 
*cycles = ret;
*flags = ret_flags;
-- 
2.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 14/18] x86: Add rdtsc_ordered() and use it in trivial call sites

2015-06-16 Thread Andy Lutomirski
rdtsc_barrier(); rdtsc() is an unnecessary mouthful and requires
more thought than should be necessary.  Add an rdtsc_ordered()
helper and replace the trivial call sites with it.

This should not change generated code.  The duplication of the fence
asm is temporary.

Signed-off-by: Andy Lutomirski l...@kernel.org
---
 arch/x86/entry/vdso/vclock_gettime.c | 16 ++--
 arch/x86/include/asm/msr.h   | 26 ++
 arch/x86/kernel/trace_clock.c|  7 +--
 arch/x86/kvm/x86.c   | 16 ++--
 arch/x86/lib/delay.c |  9 +++--
 5 files changed, 34 insertions(+), 40 deletions(-)

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index 0340d93c18ca..ca94fa649251 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -175,20 +175,8 @@ static notrace cycle_t vread_pvclock(int *mode)
 
 notrace static cycle_t vread_tsc(void)
 {
-   cycle_t ret;
-   u64 last;
-
-   /*
-* Empirically, a fence (of type that depends on the CPU)
-* before rdtsc is enough to ensure that rdtsc is ordered
-* with respect to loads.  The various CPU manuals are unclear
-* as to whether rdtsc can be reordered with later loads,
-* but no one has ever seen it happen.
-*/
-   rdtsc_barrier();
-   ret = (cycle_t)rdtsc();
-
-   last = gtod-cycle_last;
+   cycle_t ret = (cycle_t)rdtsc_ordered();
+   u64 last = gtod-cycle_last;
 
if (likely(ret = last))
return ret;
diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index ff0c120dafe5..02bdd6c65017 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -127,6 +127,32 @@ static __always_inline unsigned long long rdtsc(void)
return EAX_EDX_VAL(val, low, high);
 }
 
+/**
+ * rdtsc_ordered() - read the current TSC in program order
+ *
+ * rdtsc_ordered() returns the result of RDTSC as a 64-bit integer.
+ * It is ordered like a load to a global in-memory counter.  It should
+ * be impossible to observe non-monotonic rdtsc_unordered() behavior
+ * across multiple CPUs as long as the TSC is synced.
+ */
+static __always_inline unsigned long long rdtsc_ordered(void)
+{
+   /*
+* The RDTSC instruction is not ordered relative to memory
+* access.  The Intel SDM and the AMD APM are both vague on this
+* point, but empirically an RDTSC instruction can be
+* speculatively executed before prior loads.  An RDTSC
+* immediately after an appropriate barrier appears to be
+* ordered as a normal load, that is, it provides the same
+* ordering guarantees as reading from a global memory location
+* that some other imaginary CPU is updating continuously with a
+* time stamp.
+*/
+   alternative_2(, mfence, X86_FEATURE_MFENCE_RDTSC,
+ lfence, X86_FEATURE_LFENCE_RDTSC);
+   return rdtsc();
+}
+
 static inline unsigned long long native_read_pmc(int counter)
 {
DECLARE_ARGS(val, low, high);
diff --git a/arch/x86/kernel/trace_clock.c b/arch/x86/kernel/trace_clock.c
index 67efb8c96fc4..80bb24d9b880 100644
--- a/arch/x86/kernel/trace_clock.c
+++ b/arch/x86/kernel/trace_clock.c
@@ -12,10 +12,5 @@
  */
 u64 notrace trace_clock_x86_tsc(void)
 {
-   u64 ret;
-
-   rdtsc_barrier();
-   ret = rdtsc();
-
-   return ret;
+   return rdtsc_ordered();
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b0afdc74c28a..dfccaf2f2e00 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1419,20 +1419,8 @@ EXPORT_SYMBOL_GPL(kvm_write_tsc);
 
 static cycle_t read_tsc(void)
 {
-   cycle_t ret;
-   u64 last;
-
-   /*
-* Empirically, a fence (of type that depends on the CPU)
-* before rdtsc is enough to ensure that rdtsc is ordered
-* with respect to loads.  The various CPU manuals are unclear
-* as to whether rdtsc can be reordered with later loads,
-* but no one has ever seen it happen.
-*/
-   rdtsc_barrier();
-   ret = (cycle_t)rdtsc();
-
-   last = pvclock_gtod_data.clock.cycle_last;
+   cycle_t ret = (cycle_t)rdtsc_ordered();
+   u64 last = pvclock_gtod_data.clock.cycle_last;
 
if (likely(ret = last))
return ret;
diff --git a/arch/x86/lib/delay.c b/arch/x86/lib/delay.c
index f24bc59ab0a0..4453d52a143d 100644
--- a/arch/x86/lib/delay.c
+++ b/arch/x86/lib/delay.c
@@ -54,11 +54,9 @@ static void delay_tsc(unsigned long __loops)
 
preempt_disable();
cpu = smp_processor_id();
-   rdtsc_barrier();
-   bclock = rdtsc();
+   bclock = rdtsc_ordered();
for (;;) {
-   rdtsc_barrier();
-   now = rdtsc();
+   now = rdtsc_ordered();
if ((now - bclock) = loops)
break;
 
@@ -79,8 

[PATCH v3 02/18] x86/msr/kvm: Remove vget_cycles()

2015-06-16 Thread Andy Lutomirski
The only caller was kvm's read_tsc.  The only difference between
vget_cycles and native_read_tsc was that vget_cycles returned zero
instead of crashing on TSC-less systems.  KVM's already checks
vclock_mode before calling that function, so the extra check is
unnecessary.

(Off-topic, but the whole KVM clock host implementation is gross.
 IMO it should be rewritten.)

Signed-off-by: Andy Lutomirski l...@kernel.org
---
 arch/x86/include/asm/tsc.h | 13 -
 arch/x86/kvm/x86.c |  2 +-
 2 files changed, 1 insertion(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index fd11128faf25..3da1cc1218ac 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -32,19 +32,6 @@ static inline cycles_t get_cycles(void)
return ret;
 }
 
-static __always_inline cycles_t vget_cycles(void)
-{
-   /*
-* We only do VDSOs on TSC capable CPUs, so this shouldn't
-* access boot_cpu_data (which is not VDSO-safe):
-*/
-#ifndef CONFIG_X86_TSC
-   if (!cpu_has_tsc)
-   return 0;
-#endif
-   return (cycles_t)native_read_tsc();
-}
-
 extern void tsc_init(void);
 extern void mark_tsc_unstable(char *reason);
 extern int unsynchronized_tsc(void);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 26eaeb522cab..c26faf408bce 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1430,7 +1430,7 @@ static cycle_t read_tsc(void)
 * but no one has ever seen it happen.
 */
rdtsc_barrier();
-   ret = (cycle_t)vget_cycles();
+   ret = (cycle_t)native_read_tsc();
 
last = pvclock_gtod_data.clock.cycle_last;
 
-- 
2.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 06/18] x86/tsc: Use the full 64-bit tsc in tsc_delay

2015-06-16 Thread Andy Lutomirski
As a very minor optimization, tsc_delay was only using the low 32
bits of the TSC.  It's a delay function, so just use the whole
thing.

Signed-off-by: Andy Lutomirski l...@kernel.org
---
 arch/x86/lib/delay.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/lib/delay.c b/arch/x86/lib/delay.c
index 9a52ad0c0758..35115f3786a9 100644
--- a/arch/x86/lib/delay.c
+++ b/arch/x86/lib/delay.c
@@ -49,16 +49,16 @@ static void delay_loop(unsigned long loops)
 /* TSC based delay: */
 static void delay_tsc(unsigned long __loops)
 {
-   u32 bclock, now, loops = __loops;
+   u64 bclock, now, loops = __loops;
int cpu;
 
preempt_disable();
cpu = smp_processor_id();
rdtsc_barrier();
-   rdtscl(bclock);
+   bclock = native_read_tsc();
for (;;) {
rdtsc_barrier();
-   rdtscl(now);
+   now = native_read_tsc();
if ((now - bclock) = loops)
break;
 
@@ -80,7 +80,7 @@ static void delay_tsc(unsigned long __loops)
loops -= (now - bclock);
cpu = smp_processor_id();
rdtsc_barrier();
-   rdtscl(bclock);
+   bclock = native_read_tsc();
}
}
preempt_enable();
-- 
2.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 03/18] x86/tsc/paravirt: Remove the read_tsc and read_tscp paravirt hooks

2015-06-16 Thread Andy Lutomirski
We've had read_tsc and read_tscp paravirt hooks since the very
beginning of paravirt, i.e., d3561b7fa0fb ([PATCH] paravirt: header
and stubs for paravirtualisation).  AFAICT the only paravirt guest
implementation that ever replaced these calls was vmware, and it's
gone.  Arguably even vmware shouldn't have hooked rdtsc -- we fully
support systems that don't have a TSC at all, so there's no point
for a paravirt implementation to pretend that we have a TSC but to
replace it.

I also doubt that these hooks actually worked.  Calls to rdtscl and
rdtscll, which respected the hooks, were used seemingly
interchangeably with native_read_tsc, which did not.

Just remove them.  If anyone ever needs them again, they can try
to make a case for why they need them.

Before, on a paravirt config:
   textdata bss dec hex filename
134265051827056 14508032297615931c62039 vmlinux

After:
   textdata bss dec hex filename
134266171827056 14508032297617051c620a9 vmlinux

Signed-off-by: Andy Lutomirski l...@kernel.org
---
 arch/x86/include/asm/msr.h| 16 
 arch/x86/include/asm/paravirt.h   | 34 --
 arch/x86/include/asm/paravirt_types.h |  2 --
 arch/x86/kernel/paravirt.c|  2 --
 arch/x86/kernel/paravirt_patch_32.c   |  2 --
 arch/x86/xen/enlighten.c  |  3 ---
 6 files changed, 8 insertions(+), 51 deletions(-)

diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 88711470af7f..d1afac7df484 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -178,12 +178,6 @@ static inline int rdmsrl_safe(unsigned msr, unsigned long 
long *p)
return err;
 }
 
-#define rdtscl(low)\
-   ((low) = (u32)native_read_tsc())
-
-#define rdtscll(val)   \
-   ((val) = native_read_tsc())
-
 #define rdpmc(counter, low, high)  \
 do {   \
u64 _l = native_read_pmc((counter));\
@@ -193,6 +187,14 @@ do {   
\
 
 #define rdpmcl(counter, val) ((val) = native_read_pmc(counter))
 
+#endif /* !CONFIG_PARAVIRT */
+
+#define rdtscl(low)\
+   ((low) = (u32)native_read_tsc())
+
+#define rdtscll(val)   \
+   ((val) = native_read_tsc())
+
 #define rdtscp(low, high, aux) \
 do {\
unsigned long long _val = native_read_tscp((aux)); \
@@ -202,8 +204,6 @@ do {
\
 
 #define rdtscpll(val, aux) (val) = native_read_tscp((aux))
 
-#endif /* !CONFIG_PARAVIRT */
-
 /*
  * 64-bit version of wrmsr_safe():
  */
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index d143bfad45d7..c2be0375bcad 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -174,19 +174,6 @@ static inline int rdmsrl_safe(unsigned msr, unsigned long 
long *p)
return err;
 }
 
-static inline u64 paravirt_read_tsc(void)
-{
-   return PVOP_CALL0(u64, pv_cpu_ops.read_tsc);
-}
-
-#define rdtscl(low)\
-do {   \
-   u64 _l = paravirt_read_tsc();   \
-   low = (int)_l;  \
-} while (0)
-
-#define rdtscll(val) (val = paravirt_read_tsc())
-
 static inline unsigned long long paravirt_sched_clock(void)
 {
return PVOP_CALL0(unsigned long long, pv_time_ops.sched_clock);
@@ -215,27 +202,6 @@ do {   \
 
 #define rdpmcl(counter, val) ((val) = paravirt_read_pmc(counter))
 
-static inline unsigned long long paravirt_rdtscp(unsigned int *aux)
-{
-   return PVOP_CALL1(u64, pv_cpu_ops.read_tscp, aux);
-}
-
-#define rdtscp(low, high, aux) \
-do {   \
-   int __aux;  \
-   unsigned long __val = paravirt_rdtscp(__aux);  \
-   (low) = (u32)__val; \
-   (high) = (u32)(__val  32);\
-   (aux) = __aux;  \
-} while (0)
-
-#define rdtscpll(val, aux) \
-do {   \
-   unsigned long __aux;\
-   val = paravirt_rdtscp(__aux);  \
-   (aux) = __aux;  \
-} while (0)
-
 static inline void paravirt_alloc_ldt(struct desc_struct *ldt, unsigned 
entries)
 {
PVOP_VCALL2(pv_cpu_ops.alloc_ldt, ldt, entries);
diff --git a/arch/x86/include/asm/paravirt_types.h 

[PATCH v3 00/18] x86/tsc: Clean up rdtsc helpers

2015-06-16 Thread Andy Lutomirski
My sincere apologies for the spam.  I send an unholy mixture of the
real patch set and an old poorly split-up patch set, and the result
is incomprehensible.  Here's what I meant to send.

After the some recent threads about rdtsc barriers, I remembered
that our RDTSC wrappers are a big mess.  Let's clean it up.

Currently we have rdtscl, rdtscll, native_read_tsc,
paravirt_read_tsc, and rdtsc_barrier.  For people who haven't
noticed rdtsc_barrier and who haven't carefully read the docs,
there's no indication that all of the other accessors have a giant
ordering gotcha.  The macro forms are ugly, and the paravirt
implementation is completely pointless.

rdtscl is particularly awful.  It reads the low bits.  There are no
performance critical users of just the low bits anywhere in the
kernel.

Clean it up.  After this patch set, there are exactly three
functions.  rdtsc_unordered() is a function that does a raw RDTSC
and returns a 64-bit number.  rdtsc_ordered() is a function that
does a properly ordered RDTSC for general-purpose use.
barrier_before_rdtsc() is exactly what it sounds like.

Changes from v2:
 - Rename rdtsc_unordered to just rdtsc
 - Get rid of rdtsc_barrier entirely instead of renaming it
 - The KVM patch is new (see above)
 - Added some acks

Changes from v1:
 - None, except that I screwed up the v1 emails.

Andy Lutomirski (18):
  x86/tsc: Inline native_read_tsc and remove __native_read_tsc
  x86/msr/kvm: Remove vget_cycles()
  x86/tsc/paravirt: Remove the read_tsc and read_tscp paravirt hooks
  x86/tsc: Replace rdtscll with native_read_tsc
  x86/tsc: Remove the rdtscp and rdtscpll macros
  x86/tsc: Use the full 64-bit tsc in tsc_delay
  x86/cpu/amd: Use the full 64-bit TSC to detect the 2.6.2 bug
  baycom_epp: Replace rdtscl() with native_read_tsc()
  staging/lirc_serial: Remove TSC-based timing
  input/joystick/analog: Switch from rdtscl() to native_read_tsc()
  drivers/input/gameport: Replace rdtscl() with native_read_tsc()
  x86/tsc: Remove rdtscl()
  x86/tsc: Rename native_read_tsc() to rdtsc()
  x86: Add rdtsc_ordered() and use it in trivial call sites
  x86/tsc: Use rdtsc_ordered() in check_tsc_warp() and drop extra
barriers
  x86/tsc: In read_tsc, use rdtsc_ordered() instead of get_cycles()
  x86/kvm/tsc: Drop extra barrier and use rdtsc_ordered in kvmclock
  x86/tsc: Remove rdtsc_barrier()

 arch/x86/boot/compressed/aslr.c|  2 +-
 arch/x86/entry/vdso/vclock_gettime.c   | 16 +-
 arch/x86/include/asm/barrier.h | 11 
 arch/x86/include/asm/msr.h | 54 ---
 arch/x86/include/asm/paravirt.h| 34 
 arch/x86/include/asm/paravirt_types.h  |  2 -
 arch/x86/include/asm/pvclock.h | 21 
 arch/x86/include/asm/stackprotector.h  |  2 +-
 arch/x86/include/asm/tsc.h | 18 +--
 arch/x86/kernel/apb_timer.c|  8 +--
 arch/x86/kernel/apic/apic.c|  8 +--
 arch/x86/kernel/cpu/amd.c  |  6 +--
 arch/x86/kernel/cpu/mcheck/mce.c   |  4 +-
 arch/x86/kernel/espfix_64.c|  2 +-
 arch/x86/kernel/hpet.c |  4 +-
 arch/x86/kernel/paravirt.c |  2 -
 arch/x86/kernel/paravirt_patch_32.c|  2 -
 arch/x86/kernel/trace_clock.c  |  7 +--
 arch/x86/kernel/tsc.c  | 12 ++---
 arch/x86/kernel/tsc_sync.c | 14 +++--
 arch/x86/kvm/lapic.c   |  4 +-
 arch/x86/kvm/svm.c |  4 +-
 arch/x86/kvm/vmx.c |  4 +-
 arch/x86/kvm/x86.c | 26 +++--
 arch/x86/lib/delay.c   | 13 ++---
 arch/x86/um/asm/barrier.h  | 13 -
 arch/x86/xen/enlighten.c   |  3 --
 drivers/input/gameport/gameport.c  |  4 +-
 drivers/input/joystick/analog.c|  4 +-
 drivers/net/hamradio/baycom_epp.c  |  2 +-
 drivers/staging/media/lirc/lirc_serial.c   | 63 ++
 drivers/thermal/intel_powerclamp.c |  4 +-
 .../power/cpupower/debug/kernel/cpufreq-test_tsc.c |  4 +-
 33 files changed, 110 insertions(+), 267 deletions(-)

-- 
2.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 16/18] x86/tsc: In read_tsc, use rdtsc_ordered() instead of get_cycles()

2015-06-16 Thread Andy Lutomirski
There are two logical changes here.  First, this removes a check for
cpu_has_tsc.  That check is unnecessary, as we don't register the
TSC as a clocksource on systems that have no TSC.  Second, it adds a
barrier, thus preventing observable non-monotonicity.

I suspect that the missing barrier was never a problem in practice
because system calls themselves were heavy enough barriers to
prevent user code from observing time warps due to speculation.
(Without the corresponding barrier in the vDSO, however,
non-monotonicity is easy to detect.)

Signed-off-by: Andy Lutomirski l...@kernel.org
---
 arch/x86/kernel/tsc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 21d6e04e3e82..451bade0d320 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -961,7 +961,7 @@ static struct clocksource clocksource_tsc;
  */
 static cycle_t read_tsc(struct clocksource *cs)
 {
-   return (cycle_t)get_cycles();
+   return (cycle_t)rdtsc_ordered();
 }
 
 /*
-- 
2.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 18/18] x86/tsc: Remove rdtsc_barrier()

2015-06-16 Thread Andy Lutomirski
All callers have been converted to rdtsc_ordered().

Signed-off-by: Andy Lutomirski l...@kernel.org
---
 arch/x86/include/asm/barrier.h | 11 ---
 arch/x86/um/asm/barrier.h  | 13 -
 2 files changed, 24 deletions(-)

diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index e51a8f803f55..818cb8788225 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -91,15 +91,4 @@ do { 
\
 #define smp_mb__before_atomic()barrier()
 #define smp_mb__after_atomic() barrier()
 
-/*
- * Stop RDTSC speculation. This is needed when you need to use RDTSC
- * (or get_cycles or vread that possibly accesses the TSC) in a defined
- * code region.
- */
-static __always_inline void rdtsc_barrier(void)
-{
-   alternative_2(, mfence, X86_FEATURE_MFENCE_RDTSC,
- lfence, X86_FEATURE_LFENCE_RDTSC);
-}
-
 #endif /* _ASM_X86_BARRIER_H */
diff --git a/arch/x86/um/asm/barrier.h b/arch/x86/um/asm/barrier.h
index b9531d343134..755481f14d90 100644
--- a/arch/x86/um/asm/barrier.h
+++ b/arch/x86/um/asm/barrier.h
@@ -45,17 +45,4 @@
 #define read_barrier_depends() do { } while (0)
 #define smp_read_barrier_depends() do { } while (0)
 
-/*
- * Stop RDTSC speculation. This is needed when you need to use RDTSC
- * (or get_cycles or vread that possibly accesses the TSC) in a defined
- * code region.
- *
- * (Could use an alternative three way for this if there was one.)
- */
-static inline void rdtsc_barrier(void)
-{
-   alternative_2(, mfence, X86_FEATURE_MFENCE_RDTSC,
- lfence, X86_FEATURE_LFENCE_RDTSC);
-}
-
 #endif
-- 
2.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 10/18] input/joystick/analog: Switch from rdtscl() to native_read_tsc()

2015-06-16 Thread Andy Lutomirski
This timing code is hideous, and this doesn't help.  It gets rid of
one of the last users of rdtscl, though.

Acked-by: Dmitry Torokhov dmitry.torok...@gmail.com
Cc: linux-in...@vger.kernel.org
Signed-off-by: Andy Lutomirski l...@kernel.org
---
 drivers/input/joystick/analog.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/input/joystick/analog.c b/drivers/input/joystick/analog.c
index 4284080e481d..f871b4f00056 100644
--- a/drivers/input/joystick/analog.c
+++ b/drivers/input/joystick/analog.c
@@ -143,7 +143,7 @@ struct analog_port {
 
 #include linux/i8253.h
 
-#define GET_TIME(x)do { if (cpu_has_tsc) rdtscl(x); else x = 
get_time_pit(); } while (0)
+#define GET_TIME(x)do { if (cpu_has_tsc) x = (unsigned 
int)native_read_tsc(); else x = get_time_pit(); } while (0)
 #define DELTA(x,y) (cpu_has_tsc ? ((y) - (x)) : ((x) - (y) + ((x)  (y) ? 
PIT_TICK_RATE / HZ : 0)))
 #define TIME_NAME  (cpu_has_tsc?TSC:PIT)
 static unsigned int get_time_pit(void)
@@ -160,7 +160,7 @@ static unsigned int get_time_pit(void)
 return count;
 }
 #elif defined(__x86_64__)
-#define GET_TIME(x)rdtscl(x)
+#define GET_TIME(x)do { x = (unsigned int)native_read_tsc(); } while (0)
 #define DELTA(x,y) ((y)-(x))
 #define TIME_NAME  TSC
 #elif defined(__alpha__) || defined(CONFIG_MN10300) || defined(CONFIG_ARM) || 
defined(CONFIG_ARM64) || defined(CONFIG_TILE)
-- 
2.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 11/18] drivers/input/gameport: Replace rdtscl() with native_read_tsc()

2015-06-16 Thread Andy Lutomirski
It's unclear to me why this code exists in the first place.

Acked-by: Dmitry Torokhov dmitry.torok...@gmail.com
Cc: linux-in...@vger.kernel.org
Signed-off-by: Andy Lutomirski l...@kernel.org
---
 drivers/input/gameport/gameport.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/input/gameport/gameport.c 
b/drivers/input/gameport/gameport.c
index e853a2134680..abc0cb22e750 100644
--- a/drivers/input/gameport/gameport.c
+++ b/drivers/input/gameport/gameport.c
@@ -149,9 +149,9 @@ static int old_gameport_measure_speed(struct gameport 
*gameport)
 
for(i = 0; i  50; i++) {
local_irq_save(flags);
-   rdtscl(t1);
+   t1 = native_read_tsc();
for (t = 0; t  50; t++) gameport_read(gameport);
-   rdtscl(t2);
+   t2 = native_read_tsc();
local_irq_restore(flags);
udelay(i * 10);
if (t2 - t1  tx) tx = t2 - t1;
-- 
2.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 15/18] x86/tsc: Use rdtsc_ordered() in check_tsc_warp() and drop extra barriers

2015-06-16 Thread Andy Lutomirski
Using get_cycles was unnecessary: check_tsc_warp() is not called on
TSC-less systems.  Replace rdtsc_barrier(); get_cycles() with
rdtsc_ordered().

While we're at it, make the somewhat more dangerous change of
removing barrier_before_rdtsc after RDTSC in the TSC warp check
code.  This should be okay, though -- the vDSO TSC code doesn't have
that barrier, so, if removing the barrier from the warp check would
cause us to detect a warp that we otherwise wouldn't detect, then we
have a genuine bug.

Signed-off-by: Andy Lutomirski l...@kernel.org
---
 arch/x86/kernel/tsc_sync.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/tsc_sync.c b/arch/x86/kernel/tsc_sync.c
index dd8d0791dfb5..78083bf23ed1 100644
--- a/arch/x86/kernel/tsc_sync.c
+++ b/arch/x86/kernel/tsc_sync.c
@@ -39,16 +39,15 @@ static cycles_t max_warp;
 static int nr_warps;
 
 /*
- * TSC-warp measurement loop running on both CPUs:
+ * TSC-warp measurement loop running on both CPUs.  This is not called
+ * if there is no TSC.
  */
 static void check_tsc_warp(unsigned int timeout)
 {
cycles_t start, now, prev, end;
int i;
 
-   rdtsc_barrier();
-   start = get_cycles();
-   rdtsc_barrier();
+   start = rdtsc_ordered();
/*
 * The measurement runs for 'timeout' msecs:
 */
@@ -63,9 +62,7 @@ static void check_tsc_warp(unsigned int timeout)
 */
arch_spin_lock(sync_lock);
prev = last_tsc;
-   rdtsc_barrier();
-   now = get_cycles();
-   rdtsc_barrier();
+   now = rdtsc_ordered();
last_tsc = now;
arch_spin_unlock(sync_lock);
 
@@ -126,7 +123,7 @@ void check_tsc_sync_source(int cpu)
 
/*
 * No need to check if we already know that the TSC is not
-* synchronized:
+* synchronized or if we have no TSC.
 */
if (unsynchronized_tsc())
return;
@@ -190,6 +187,7 @@ void check_tsc_sync_target(void)
 {
int cpus = 2;
 
+   /* Also aborts if there is no TSC. */
if (unsynchronized_tsc() || tsc_clocksource_reliable)
return;
 
-- 
2.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 13/18] x86/tsc: Rename native_read_tsc() to rdtsc()

2015-06-16 Thread Andy Lutomirski
Now that there is no paravirt TSC, the native is inappropriate.
The function does RDTSC, so give it the obvious name: rdtsc()

Suggested-by: Borislav Petkov b...@suse.de
Signed-off-by: Andy Lutomirski l...@kernel.org
---
 arch/x86/boot/compressed/aslr.c  |  2 +-
 arch/x86/entry/vdso/vclock_gettime.c |  2 +-
 arch/x86/include/asm/msr.h   | 11 ++-
 arch/x86/include/asm/pvclock.h   |  2 +-
 arch/x86/include/asm/stackprotector.h|  2 +-
 arch/x86/include/asm/tsc.h   |  2 +-
 arch/x86/kernel/apb_timer.c  |  8 
 arch/x86/kernel/apic/apic.c  |  8 
 arch/x86/kernel/cpu/amd.c|  4 ++--
 arch/x86/kernel/cpu/mcheck/mce.c |  4 ++--
 arch/x86/kernel/espfix_64.c  |  2 +-
 arch/x86/kernel/hpet.c   |  4 ++--
 arch/x86/kernel/trace_clock.c|  2 +-
 arch/x86/kernel/tsc.c|  4 ++--
 arch/x86/kvm/lapic.c |  4 ++--
 arch/x86/kvm/svm.c   |  4 ++--
 arch/x86/kvm/vmx.c   |  4 ++--
 arch/x86/kvm/x86.c   | 12 ++--
 arch/x86/lib/delay.c |  8 
 drivers/input/gameport/gameport.c|  4 ++--
 drivers/input/joystick/analog.c  |  4 ++--
 drivers/net/hamradio/baycom_epp.c|  2 +-
 drivers/thermal/intel_powerclamp.c   |  4 ++--
 tools/power/cpupower/debug/kernel/cpufreq-test_tsc.c |  4 ++--
 24 files changed, 58 insertions(+), 49 deletions(-)

diff --git a/arch/x86/boot/compressed/aslr.c b/arch/x86/boot/compressed/aslr.c
index ea33236190b1..6a9b96b4624d 100644
--- a/arch/x86/boot/compressed/aslr.c
+++ b/arch/x86/boot/compressed/aslr.c
@@ -82,7 +82,7 @@ static unsigned long get_random_long(void)
 
if (has_cpuflag(X86_FEATURE_TSC)) {
debug_putstr( RDTSC);
-   raw = native_read_tsc();
+   raw = rdtsc();
 
random ^= raw;
use_i8254 = false;
diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index 972b488ac16a..0340d93c18ca 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -186,7 +186,7 @@ notrace static cycle_t vread_tsc(void)
 * but no one has ever seen it happen.
 */
rdtsc_barrier();
-   ret = (cycle_t)native_read_tsc();
+   ret = (cycle_t)rdtsc();
 
last = gtod-cycle_last;
 
diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index c89ed6ceed02..ff0c120dafe5 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -109,7 +109,16 @@ notrace static inline int native_write_msr_safe(unsigned 
int msr,
 extern int rdmsr_safe_regs(u32 regs[8]);
 extern int wrmsr_safe_regs(u32 regs[8]);
 
-static __always_inline unsigned long long native_read_tsc(void)
+/**
+ * rdtsc() - returns the current TSC without ordering constraints
+ *
+ * rdtsc() returns the result of RDTSC as a 64-bit integer.  The
+ * only ordering constraint it supplies is the ordering implied by
+ * asm volatile: it will put the RDTSC in the place you expect.  The
+ * CPU can and will speculatively execute that RDTSC, though, so the
+ * results can be non-monotonic if compared on different CPUs.
+ */
+static __always_inline unsigned long long rdtsc(void)
 {
DECLARE_ARGS(val, low, high);
 
diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 71bd485c2986..6084bce345fc 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -62,7 +62,7 @@ static inline u64 pvclock_scale_delta(u64 delta, u32 
mul_frac, int shift)
 static __always_inline
 u64 pvclock_get_nsec_offset(const struct pvclock_vcpu_time_info *src)
 {
-   u64 delta = native_read_tsc() - src-tsc_timestamp;
+   u64 delta = rdtsc() - src-tsc_timestamp;
return pvclock_scale_delta(delta, src-tsc_to_system_mul,
   src-tsc_shift);
 }
diff --git a/arch/x86/include/asm/stackprotector.h 
b/arch/x86/include/asm/stackprotector.h
index bc5fa2af112e..58505f01962f 100644
--- a/arch/x86/include/asm/stackprotector.h
+++ b/arch/x86/include/asm/stackprotector.h
@@ -72,7 +72,7 @@ static __always_inline void boot_init_stack_canary(void)
 * on during the bootup the random pool has true entropy too.
 */
get_random_bytes(canary, sizeof(canary));
-   tsc = native_read_tsc();
+   tsc = rdtsc();
canary += tsc + (tsc  32UL);
 
current-stack_canary = canary;
diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index b4883902948b..3df7675debcf 100644
--- 

[PATCH v3 12/18] x86/tsc: Remove rdtscl()

2015-06-16 Thread Andy Lutomirski
It has no more callers, and it was never a very sensible interface
to begin with.  Users of the TSC should either read all 64 bits or
explicitly throw out the high bits.

Signed-off-by: Andy Lutomirski l...@kernel.org
---
 arch/x86/include/asm/msr.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 626f78199665..c89ed6ceed02 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -189,9 +189,6 @@ do {
\
 
 #endif /* !CONFIG_PARAVIRT */
 
-#define rdtscl(low)\
-   ((low) = (u32)native_read_tsc())
-
 /*
  * 64-bit version of wrmsr_safe():
  */
-- 
2.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: Avoid warning user requested TSC rate below hardware speed when create VM.

2015-06-16 Thread Lan Tianyu
KVM populates max_tsc_khz with tsc_khz at arch init stage on the
constant tsc machine and creates VM with max_tsc_khz as tsc. However,
tsc_khz maybe changed during tsc clocksource driver refines calibration.
This will cause to create VM with slow tsc and produce the following
warning. To fix the issue, compare max_tsc_khz with tsc_khz and
update max_tsc_khz with new value of tsc_khz if it has been changed
when create a VM.

[   94.916906] [ cut here ]
[   94.922127] WARNING: CPU: 0 PID: 824 at arch/x86/kvm/vmx.c:2272 
vmx_set_tsc_khz+0x3e/0x50()
[   94.931503] user requested TSC rate below hardware speed
[   94.937470] Modules linked in:
[   94.940923] CPU: 0 PID: 824 Comm: qemu-system-x86 Tainted: G  D W   
4.1.0-rc3+ #4
[   94.960350]  81f453f8 88027e9f3bc8 81b5eb8a 

[   94.968721]  88027e9f3c18 88027e9f3c08 810e6f8a 
8802
[   94.977103]  001d3300 88027e98 0001 
88027e98
[   94.985476] Call Trace:
[   94.988242]  [81b5eb8a] dump_stack+0x45/0x57
[   94.994020]  [810e6f8a] warn_slowpath_common+0x8a/0xc0
[   95.000772]  [810e7006] warn_slowpath_fmt+0x46/0x50
[   95.007222]  [8104676e] vmx_set_tsc_khz+0x3e/0x50
[   95.013488]  [810112f7] kvm_set_tsc_khz.part.106+0xa7/0xe0
[   95.020618]  [8101e628] kvm_arch_vcpu_init+0x208/0x230
[   95.027373]  [81003bf9] kvm_vcpu_init+0xc9/0x110
[   95.033540]  [81049fd0] vmx_create_vcpu+0x70/0xc30
[   95.039911]  [81049f80] ? vmx_create_vcpu+0x20/0xc30
[   95.046476]  [8101dc9e] kvm_arch_vcpu_create+0x3e/0x60
[   95.053233]  [81009f00] kvm_vm_ioctl+0x1a0/0x770
[   95.059400]  [8129f395] ? __fget+0x5/0x200
[   95.064991]  [8115b85f] ? rcu_irq_exit+0x7f/0xd0
[   95.071157]  [81293448] do_vfs_ioctl+0x308/0x540
[   95.077323]  [8129f301] ? expand_files+0x1f1/0x280
[   95.083684]  [8147836b] ? selinux_file_ioctl+0x5b/0x100
[   95.090538]  [81293701] SyS_ioctl+0x81/0xa0
[   95.096218]  [81b6a72e] system_call_fastpath+0x12/0x76
[   95.102974] ---[ end trace 08ade884081d9dd7 ]---

Link: https://bugzilla.kernel.org/show_bug.cgi?id=99861
Signed-off-by: Lan Tianyu tianyu@intel.com
---
 arch/x86/kvm/x86.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 43f0df7..6c7fefe 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7814,6 +7814,7 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 {
struct page *page;
struct kvm *kvm;
+   int cpu;
int r;
 
BUG_ON(vcpu-kvm == NULL);
@@ -7833,6 +7834,21 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
}
vcpu-arch.pio_data = page_address(page);
 
+   /*
+* max_tsc_khz records tsc_khz at arch init stage on the constant tsc
+* machine. However, tsc_khz maybe changed during tsc clocksource driver
+* refines calibration. This will cause to create VM with slow tsc
+* and produce warning. To avoid such case, check whether tsc_khz
+* has been changed here and update max_tsc_khz with new value of
+* tsc_khz if changed.
+*/
+   if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC)  max_tsc_khz != tsc_khz) {
+   max_tsc_khz = tsc_khz;
+   pr_debug(kvm: max_tsc_khz is changed to %ld\n, max_tsc_khz);
+   for_each_online_cpu(cpu)
+   smp_call_function_single(cpu, tsc_khz_changed, NULL, 1);
+   }
+
kvm_set_tsc_khz(vcpu, max_tsc_khz);
 
r = kvm_mmu_create(vcpu);
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7] userfaultfd: switch to exclusive wakeup for blocking reads

2015-06-16 Thread Linus Torvalds
On Mon, Jun 15, 2015 at 12:19 PM, Andrea Arcangeli aarca...@redhat.com wrote:

 Yes, it would leave the other blocked, how is it different from having
 just 1 reader and it gets killed?

Either is completely wrong. But the read() code can at least see that
I'm returning early due to a signal, so I'll wake up any other
waiters.

Poll simply *cannot* do that. Because by definition poll always
returns without actually clearing the thing that caused the wakeup.

So for poll(), using exclusive waits is wrong very much by
definition. For read(), you *can* use exclusive waits correctly, it
just requires you to wake up others if you don't read all the data
(either due to being killed by a signal, or because the read was
incomplete).

 If any qemu thread gets killed the thing is going to be noticeable,
 there's no fault-tolerance-double-thread for anything.

What does qemu have to do with anything?

We don't implement kernel interfaces that are broken, and that can
leave processes blocked when they shouldn't be blocked. We also don't
implement kernel interfaces that only work with one program and then
say if that program is broken, it's not our problem.

 I'm not saying doing wakeone is easy [...]

Bullshit, Andrea.

That's *exactly* what you said in the commit message for the broken
patch that I complained about. And I quote:

  Blocking reads can easily use exclusive wakeups. Poll in theory
could too but there's no poll_wait_exclusive in common code yet

and I pointed out that your commit message was garbage, and that it's
not at all as easy as you claim, and that your patch was broken, and
your description was even more broken.

The whole poll cannot use exclsive waits has _nothing_ to do with us
not having poll_wait_exclusive(). Poll *fundamentally* cannot use
exclusive waits. Your commit message was garbage, and actively
misleading. Don't make excuses.

 Linus
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/10] KVM: arm/arm64: vgic: Allow dynamic mapping of physical/virtual interrupts

2015-06-16 Thread Marc Zyngier
Hi Eric,

On 15/06/15 16:44, Eric Auger wrote:
 Hi Marc,
 On 06/08/2015 07:04 PM, Marc Zyngier wrote:
 In order to be able to feed physical interrupts to a guest, we need
 to be able to establish the virtual-physical mapping between the two
 worlds.

 The mapping is kept in a rbtree, indexed by virtual interrupts.

 Signed-off-by: Marc Zyngier marc.zyng...@arm.com
 ---
  include/kvm/arm_vgic.h |  18 
  virt/kvm/arm/vgic.c| 110 
 +
  2 files changed, 128 insertions(+)

 diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
 index 4f9fa1d..33d121a 100644
 --- a/include/kvm/arm_vgic.h
 +++ b/include/kvm/arm_vgic.h
 @@ -159,6 +159,14 @@ struct vgic_io_device {
  struct kvm_io_device dev;
  };
  
 +struct irq_phys_map {
 +struct rb_node  node;
 +u32 virt_irq;
 +u32 phys_irq;
 +u32 irq;
 +boolactive;
 +};
 +
  struct vgic_dist {
  spinlock_t  lock;
  boolin_kernel;
 @@ -256,6 +264,10 @@ struct vgic_dist {
  struct vgic_vm_ops  vm_ops;
  struct vgic_io_device   dist_iodev;
  struct vgic_io_device   *redist_iodevs;
 +
 +/* Virtual irq to hwirq mapping */
 +spinlock_t  irq_phys_map_lock;
 +struct rb_root  irq_phys_map;
  };
  
  struct vgic_v2_cpu_if {
 @@ -307,6 +319,9 @@ struct vgic_cpu {
  struct vgic_v2_cpu_if   vgic_v2;
  struct vgic_v3_cpu_if   vgic_v3;
  };
 +
 +/* Protected by the distributor's irq_phys_map_lock */
 +struct rb_root  irq_phys_map;
  };
  
  #define LR_EMPTY0xff
 @@ -331,6 +346,9 @@ int kvm_vgic_inject_irq(struct kvm *kvm, int cpuid, 
 unsigned int irq_num,
  void vgic_v3_dispatch_sgi(struct kvm_vcpu *vcpu, u64 reg);
  int kvm_vgic_vcpu_pending_irq(struct kvm_vcpu *vcpu);
  int kvm_vgic_vcpu_active_irq(struct kvm_vcpu *vcpu);
 +struct irq_phys_map *vgic_map_phys_irq(struct kvm_vcpu *vcpu,
 +   int virt_irq, int irq);
 +int vgic_unmap_phys_irq(struct kvm_vcpu *vcpu, struct irq_phys_map *map);
  
  #define irqchip_in_kernel(k)(!!((k)-arch.vgic.in_kernel))
  #define vgic_initialized(k) (!!((k)-arch.vgic.nr_cpus))
 diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
 index 59ed7a3..c6604f2 100644
 --- a/virt/kvm/arm/vgic.c
 +++ b/virt/kvm/arm/vgic.c
 @@ -24,6 +24,7 @@
  #include linux/of.h
  #include linux/of_address.h
  #include linux/of_irq.h
 +#include linux/rbtree.h
  #include linux/uaccess.h
  
  #include linux/irqchip/arm-gic.h
 @@ -84,6 +85,8 @@ static void vgic_retire_disabled_irqs(struct kvm_vcpu 
 *vcpu);
  static void vgic_retire_lr(int lr_nr, int irq, struct kvm_vcpu *vcpu);
  static struct vgic_lr vgic_get_lr(const struct kvm_vcpu *vcpu, int lr);
  static void vgic_set_lr(struct kvm_vcpu *vcpu, int lr, struct vgic_lr 
 lr_desc);
 +static struct irq_phys_map *vgic_irq_map_search(struct kvm_vcpu *vcpu,
 +int virt_irq);
  
  static const struct vgic_ops *vgic_ops;
  static const struct vgic_params *vgic;
 @@ -1585,6 +1588,112 @@ static irqreturn_t vgic_maintenance_handler(int irq, 
 void *data)
  return IRQ_HANDLED;
  }
  
 +static struct rb_root *vgic_get_irq_phys_map(struct kvm_vcpu *vcpu,
 + int virt_irq)
 +{
 +if (virt_irq  VGIC_NR_PRIVATE_IRQS)
 +return vcpu-arch.vgic_cpu.irq_phys_map;
 +else
 +return vcpu-kvm-arch.vgic.irq_phys_map;
 +}
 +
 +struct irq_phys_map *vgic_map_phys_irq(struct kvm_vcpu *vcpu,
 +   int virt_irq, int irq)
 +{
 +struct vgic_dist *dist = vcpu-kvm-arch.vgic;
 +struct rb_root *root = vgic_get_irq_phys_map(vcpu, virt_irq);
 +struct rb_node **new = root-rb_node, *parent = NULL;
 +struct irq_phys_map *new_map;
 +struct irq_desc *desc;
 +struct irq_data *data;
 +int phys_irq;
 +
 +desc = irq_to_desc(irq);
 +if (!desc) {
 +kvm_err(kvm_arch_timer: can't obtain interrupt descriptor\n);
 +return NULL;
 +}
 +
 +data = irq_desc_get_irq_data(desc);
 +while (data-parent_data)
 +data = data-parent_data;
 +
 +phys_irq = data-hwirq;
 +
 +spin_lock(dist-irq_phys_map_lock);
 +
 +/* Boilerplate rb_tree code */
 +while (*new) {
 +struct irq_phys_map *this;
 +
 +this = container_of(*new, struct irq_phys_map, node);
 +parent = *new;
 +if (this-virt_irq  virt_irq)
 +new = (*new)-rb_left;
 +else if (this-virt_irq  virt_irq)
 +new = (*new)-rb_right;
 +else {
 +new_map = this;
 in case the mapping already exists you don't update the mappping or
 return an error. Is it what you want here?

Calling the map function several times is not necessarily a bad idea, as

Re: [PATCH 06/10] KVM: arm/arm64: vgic: Allow dynamic mapping of physical/virtual interrupts

2015-06-16 Thread Eric Auger
On 06/16/2015 10:28 AM, Marc Zyngier wrote:
 Hi Eric,
 
 On 15/06/15 16:44, Eric Auger wrote:
 Hi Marc,
 On 06/08/2015 07:04 PM, Marc Zyngier wrote:
 In order to be able to feed physical interrupts to a guest, we need
 to be able to establish the virtual-physical mapping between the two
 worlds.

 The mapping is kept in a rbtree, indexed by virtual interrupts.

 Signed-off-by: Marc Zyngier marc.zyng...@arm.com
 ---
  include/kvm/arm_vgic.h |  18 
  virt/kvm/arm/vgic.c| 110 
 +
  2 files changed, 128 insertions(+)

 diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
 index 4f9fa1d..33d121a 100644
 --- a/include/kvm/arm_vgic.h
 +++ b/include/kvm/arm_vgic.h
 @@ -159,6 +159,14 @@ struct vgic_io_device {
 struct kvm_io_device dev;
  };
  
 +struct irq_phys_map {
 +   struct rb_node  node;
 +   u32 virt_irq;
 +   u32 phys_irq;
 +   u32 irq;
 +   boolactive;
 +};
 +
  struct vgic_dist {
 spinlock_t  lock;
 boolin_kernel;
 @@ -256,6 +264,10 @@ struct vgic_dist {
 struct vgic_vm_ops  vm_ops;
 struct vgic_io_device   dist_iodev;
 struct vgic_io_device   *redist_iodevs;
 +
 +   /* Virtual irq to hwirq mapping */
 +   spinlock_t  irq_phys_map_lock;
 +   struct rb_root  irq_phys_map;
  };
  
  struct vgic_v2_cpu_if {
 @@ -307,6 +319,9 @@ struct vgic_cpu {
 struct vgic_v2_cpu_if   vgic_v2;
 struct vgic_v3_cpu_if   vgic_v3;
 };
 +
 +   /* Protected by the distributor's irq_phys_map_lock */
 +   struct rb_root  irq_phys_map;
  };
  
  #define LR_EMPTY   0xff
 @@ -331,6 +346,9 @@ int kvm_vgic_inject_irq(struct kvm *kvm, int cpuid, 
 unsigned int irq_num,
  void vgic_v3_dispatch_sgi(struct kvm_vcpu *vcpu, u64 reg);
  int kvm_vgic_vcpu_pending_irq(struct kvm_vcpu *vcpu);
  int kvm_vgic_vcpu_active_irq(struct kvm_vcpu *vcpu);
 +struct irq_phys_map *vgic_map_phys_irq(struct kvm_vcpu *vcpu,
 +  int virt_irq, int irq);
 +int vgic_unmap_phys_irq(struct kvm_vcpu *vcpu, struct irq_phys_map *map);
  
  #define irqchip_in_kernel(k)   (!!((k)-arch.vgic.in_kernel))
  #define vgic_initialized(k)(!!((k)-arch.vgic.nr_cpus))
 diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
 index 59ed7a3..c6604f2 100644
 --- a/virt/kvm/arm/vgic.c
 +++ b/virt/kvm/arm/vgic.c
 @@ -24,6 +24,7 @@
  #include linux/of.h
  #include linux/of_address.h
  #include linux/of_irq.h
 +#include linux/rbtree.h
  #include linux/uaccess.h
  
  #include linux/irqchip/arm-gic.h
 @@ -84,6 +85,8 @@ static void vgic_retire_disabled_irqs(struct kvm_vcpu 
 *vcpu);
  static void vgic_retire_lr(int lr_nr, int irq, struct kvm_vcpu *vcpu);
  static struct vgic_lr vgic_get_lr(const struct kvm_vcpu *vcpu, int lr);
  static void vgic_set_lr(struct kvm_vcpu *vcpu, int lr, struct vgic_lr 
 lr_desc);
 +static struct irq_phys_map *vgic_irq_map_search(struct kvm_vcpu *vcpu,
 +   int virt_irq);
  
  static const struct vgic_ops *vgic_ops;
  static const struct vgic_params *vgic;
 @@ -1585,6 +1588,112 @@ static irqreturn_t vgic_maintenance_handler(int 
 irq, void *data)
 return IRQ_HANDLED;
  }
  
 +static struct rb_root *vgic_get_irq_phys_map(struct kvm_vcpu *vcpu,
 +int virt_irq)
 +{
 +   if (virt_irq  VGIC_NR_PRIVATE_IRQS)
 +   return vcpu-arch.vgic_cpu.irq_phys_map;
 +   else
 +   return vcpu-kvm-arch.vgic.irq_phys_map;
 +}
 +
 +struct irq_phys_map *vgic_map_phys_irq(struct kvm_vcpu *vcpu,
 +  int virt_irq, int irq)
 +{
 +   struct vgic_dist *dist = vcpu-kvm-arch.vgic;
 +   struct rb_root *root = vgic_get_irq_phys_map(vcpu, virt_irq);
 +   struct rb_node **new = root-rb_node, *parent = NULL;
 +   struct irq_phys_map *new_map;
 +   struct irq_desc *desc;
 +   struct irq_data *data;
 +   int phys_irq;
 +
 +   desc = irq_to_desc(irq);
 +   if (!desc) {
 +   kvm_err(kvm_arch_timer: can't obtain interrupt descriptor\n);
 +   return NULL;
 +   }
 +
 +   data = irq_desc_get_irq_data(desc);
 +   while (data-parent_data)
 +   data = data-parent_data;
 +
 +   phys_irq = data-hwirq;
 +
 +   spin_lock(dist-irq_phys_map_lock);
 +
 +   /* Boilerplate rb_tree code */
 +   while (*new) {
 +   struct irq_phys_map *this;
 +
 +   this = container_of(*new, struct irq_phys_map, node);
 +   parent = *new;
 +   if (this-virt_irq  virt_irq)
 +   new = (*new)-rb_left;
 +   else if (this-virt_irq  virt_irq)
 +   new = (*new)-rb_right;
 +   else {
 +   new_map = this;
 in case the mapping already exists you don't update the mappping or
 return an error. Is it what you want here?
 
 Calling the map function several times is not necessarily a bad idea, as
 long 

Re: [PATCH] arm64: KVM: Optimize arm64 guest exit VFP/SIMD register save/restore

2015-06-16 Thread Marc Zyngier
On 16/06/15 04:04, Mario Smarduch wrote:
 On 06/15/2015 11:20 AM, Marc Zyngier wrote:
 On 15/06/15 19:04, Mario Smarduch wrote:
 On 06/15/2015 03:00 AM, Marc Zyngier wrote:
 Hi Mario,

 [ ... ]

 On 13/06/15 23:20, Mario Smarduch wrote:
 Currently VFP/SIMD registers are always saved and restored
 on Guest entry and exit.

 This patch only saves and restores VFP/SIMD registers on
 Guest access. To do this cptr_el2 VFP/SIMD trap is set
 on Guest entry and later checked on exit. This follows
 the ARMv7 VFPv3 implementation. Running an informal test
 there are high number of exits that don't access VFP/SIMD
 registers.

 It would be good to add some numbers here. How often do we exit without
 having touched the FPSIMD regs? For which workload?

 Lmbench is what I typically use, with ssh server, i.e., cause page
 faults and interrupts - usually registers are not touched.
 I'll run the tests again and define usually.

 Any other loads you had in mind?

 Not really (apart from running hackbench, of course...;-). I'd just like
 to see the numbers in the commit message, so that we can document the
 improvement (and maybe track regressions).
 
 Hi Marc,
some ballpark numbers.
 
hackbench about 30% of the time optimized path is taken
 (for 10*40 test).
 
 Lmbench3 upwards of 50% for context switching, memory bw,
 pipe, proc creation, sys call. There are lot more tests
 but I limited to these tests. In addition other processes
 are running in background NTP, SSH, ... doing their own
 thing.
 
 I added a tmp counter to kvm_vcpu_arch to count vfpsimd
 events.

That looks good. Please include these numbers in the commit message for
v2 of that patch.

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] KVM: fix checkpatch.pl errors in kvm/coalesced_mmio.h

2015-06-16 Thread Kevin Mulvey
Tabs rather than spaces

Signed-off-by: Kevin Mulvey kmul...@linux.com
---
 virt/kvm/coalesced_mmio.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/coalesced_mmio.h b/virt/kvm/coalesced_mmio.h
index b280c20..5cbf190 100644
--- a/virt/kvm/coalesced_mmio.h
+++ b/virt/kvm/coalesced_mmio.h
@@ -24,9 +24,9 @@ struct kvm_coalesced_mmio_dev {
 int kvm_coalesced_mmio_init(struct kvm *kvm);
 void kvm_coalesced_mmio_free(struct kvm *kvm);
 int kvm_vm_ioctl_register_coalesced_mmio(struct kvm *kvm,
-   struct kvm_coalesced_mmio_zone *zone);
+   

struct kvm_coalesced_mmio_zone *zone);
 int kvm_vm_ioctl_unregister_coalesced_mmio(struct kvm *kvm,
- struct kvm_coalesced_mmio_zone *zone);
+   

struct kvm_coalesced_mmio_zone *zone);
 
 #else
 
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] KVM: fix checkpatch.pl errors in kvm/async_pf.h

2015-06-16 Thread Kevin Mulvey
fix brace spacing

Signed-off-by: Kevin Mulvey kmul...@linux.com
---
 virt/kvm/async_pf.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/async_pf.h b/virt/kvm/async_pf.h
index e7ef6447..ec4cfa2 100644
--- a/virt/kvm/async_pf.h
+++ b/virt/kvm/async_pf.h
@@ -29,8 +29,8 @@ void kvm_async_pf_deinit(void);
 void kvm_async_pf_vcpu_init(struct kvm_vcpu *vcpu);
 #else
 #define kvm_async_pf_init() (0)
-#define kvm_async_pf_deinit() do{}while(0)
-#define kvm_async_pf_vcpu_init(C) do{}while(0)
+#define kvm_async_pf_deinit() do {} while (0)
+#define kvm_async_pf_vcpu_init(C) do {} while (0)
 #endif
 
 #endif
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] KVM fixes for 4.1-rc8

2015-06-16 Thread Marcelo Tosatti

Linus,

Please pull from

git://git.kernel.org/pub/scm/virt/kvm/kvm.git master

To receive the following KVM bug fix, which restores 
APIC migration functionality.


Radim Krčmář (1):
  KVM: x86: fix lapic.timer_mode on restore

 arch/x86/kvm/lapic.c |   26 --
 1 file changed, 16 insertions(+), 10 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html