[PATCH] kvm: x86: Fix kvm clock versioning.

2014-11-03 Thread Owen Hofmann
kvm updates the version number for the guest paravirt clock structure by
incrementing the version of its private copy. It does not read the guest
version, so will write version = 2 in the first update for every new VM,
including after restoring a saved state. If guest state is saved during
reading the clock, it could read and accept struct fields and guest TSC
from two different updates. This changes the code to increment the guest
version and write it back.

Signed-off-by: Owen Hofmann 
---
 arch/x86/kvm/x86.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0033df3..c25c9d8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1637,16 +1637,16 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
vcpu->last_guest_tsc = tsc_timestamp;
 
+   if (unlikely(kvm_read_guest_cached(v->kvm, &vcpu->pv_time,
+   &guest_hv_clock, sizeof(guest_hv_clock
+   return 0;
+
/*
 * The interface expects us to write an even number signaling that the
 * update is finished. Since the guest won't see the intermediate
 * state, we just increase by 2 at the end.
 */
-   vcpu->hv_clock.version += 2;
-
-   if (unlikely(kvm_read_guest_cached(v->kvm, &vcpu->pv_time,
-   &guest_hv_clock, sizeof(guest_hv_clock
-   return 0;
+   vcpu->hv_clock.version = guest_hv_clock.version + 2;
 
/* retain PVCLOCK_GUEST_STOPPED if set in guest copy */
pvclock_flags = (guest_hv_clock.flags & PVCLOCK_GUEST_STOPPED);
-- 
2.1.0.rc2.206.gedb03e5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v13 09/11] pvqspinlock, x86: Add para-virtualization support

2014-11-03 Thread Waiman Long

On 11/03/2014 05:35 AM, Peter Zijlstra wrote:

On Wed, Oct 29, 2014 at 04:19:09PM -0400, Waiman Long wrote:

  arch/x86/include/asm/pvqspinlock.h|  411 +

I do wonder why all this needs to live in x86..



I haven't looked into the para-virtualization code in other 
architectures to see if my PV code is equally applicable there. That is 
why I put it under the x86 directory. If other architectures decide to 
use qspinlock with paravirtualization, we may need to pull out some 
common code, if any, back to kernel/locking.




+#ifdef CONFIG_QUEUE_SPINLOCK
+
+static __always_inline void pv_kick_cpu(int cpu)
+{
+   PVOP_VCALLEE1(pv_lock_ops.kick_cpu, cpu);
+}
+
+static __always_inline void pv_lockwait(u8 *lockbyte)
+{
+   PVOP_VCALLEE1(pv_lock_ops.lockwait, lockbyte);
+}
+
+static __always_inline void pv_lockstat(enum pv_lock_stats type)
+{
+   PVOP_VCALLEE1(pv_lock_ops.lockstat, type);
+}

Why are any of these PV ops? they're only called from other pv_*()
functions. What's the point of pv ops you only call from pv code?


It is the same reason that you made them PV ops in your patch. Even when 
PV is on, the code won't need to call any of the PV ops most of the 
time. So it is just a matter of optimizing the most common case at the 
expense of performance in the rare case that the CPU need to be halt and 
woken up which will be bad performance-wise anyway However, if you think 
they should be regular function pointers, I am fine with that too.



+/*
+ * Queue Spinlock Para-Virtualization (PV) Support
+ *
+ * The PV support code for queue spinlock is roughly the same as that
+ * of the ticket spinlock.

Relative comments are bad, esp. since we'll make the ticket code go away
if this works, at which point this is a reference into a black hole.


Thank for the suggestion, I will remove that when I need to revise the 
patch.



 Each CPU waiting for the lock will spin until it
+ * reaches a threshold. When that happens, it will put itself to a halt state
+ * so that the hypervisor can reuse the CPU cycles in some other guests as
+ * well as returning other hold-up CPUs faster.





+/**
+ * queue_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ *
+ * N.B. INLINE_SPIN_LOCK should not be enabled when PARAVIRT_SPINLOCK is on.

One should write a compile time fail for that, not a comment.


Will do that.


+ */
+static __always_inline void queue_spin_lock(struct qspinlock *lock)
+{
+   u32 val;
+
+   val = atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL);
+   if (likely(val == 0))
+   return;
+   if (static_key_false(¶virt_spinlocks_enabled))
+   pv_queue_spin_lock_slowpath(lock, val);
+   else
+   queue_spin_lock_slowpath(lock, val);
+}

No, this is just vile.. _that_ is what we have PV ops for. And at that
point its the same function it was before the PV stuff, so that whole
inline thing is then gone.


I did that because in all the architectures except s390, the lock 
functions are not inlined. They live in the _raw_spin_lock* defined in 
kernel/locking/spinlock.c. The unlock functions, on the other hand, are 
all inlined except when PV spinlock is enabled. So adding a check for 
the jump label won't change any of the status quo.



+extern void queue_spin_unlock_slowpath(struct qspinlock *lock);
+
  /**
   * queue_spin_unlock - release a queue spinlock
   * @lock : Pointer to queue spinlock structure
   *
   * An effective smp_store_release() on the least-significant byte.
+ *
+ * Inlining of the unlock function is disabled when CONFIG_PARAVIRT_SPINLOCKS
+ * is defined. So _raw_spin_unlock() will be the only call site that will
+ * have to be patched.

again if you hard rely on the not inlining make a build fail not a
comment.


Will do that.


   */
  static inline void queue_spin_unlock(struct qspinlock *lock)
  {
barrier();
+   if (!static_key_false(¶virt_spinlocks_enabled)) {
+   native_spin_unlock(lock);
+   return;
+   }

+   /*
+* Need to atomically clear the lock byte to avoid racing with
+* queue head waiter trying to set _QLOCK_LOCKED_SLOWPATH.
+*/
+   if (unlikely(cmpxchg((u8 *)lock, _Q_LOCKED_VAL, 0) != _Q_LOCKED_VAL))
+   queue_spin_unlock_slowpath(lock);
+}

Idem, that static key stuff is wrong, use PV ops to switch between
unlock paths.


As said in my previous emails, the PV ops call site patching code 
doesn't work well with locking. First of all, the native_patch() 
function was called even in a KVM PV guest. Some unlock calls happened 
before the paravirt_spinlocks_enabled jump label was set up. It occurs 
to me that call site patching is done the first time the call site is 
used. At least for those early unlock calls, there is no way to figure 
out if it should use the native fast path or the PV slow path. The only 
possible workaround that I

KVM: x86: update masterclock on TSC writes

2014-11-03 Thread Marcelo Tosatti

When the guest writes to the TSC, the masterclock TSC copy must be updated
as well along with the TSC_OFFSET update, otherwise a 
negative tsc_timestamp is calculated at kvm_guest_time_update.

Signed-off-by: Marcelo Tosatti 

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0033df3..f52a887 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1248,7 +1248,7 @@ void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
if (!ka->use_master_clock)
do_request = 1;
 
-   if (!vcpus_matched && ka->use_master_clock)
+   if (ka->use_master_clock)
do_request = 1;
 
if (do_request)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 2/6] KVM: Add generic support for dirty page logging

2014-11-03 Thread Mario Smarduch
On 11/02/2014 07:01 PM, Takuya Yoshikawa wrote:
> On Thu, 30 Oct 2014 12:19:00 -0700
> Mario Smarduch  wrote:
> 
>> On 10/30/2014 05:14 AM, Cornelia Huck wrote:
>>> On Wed, 22 Oct 2014 15:34:07 -0700
>>> Mario Smarduch  wrote:
>>>
 This patch defines KVM_GENERIC_DIRTYLOG, and moves dirty log read function
 to it's own file virt/kvm/dirtylog.c. x86 is updated to use the generic
 dirty log interface, selecting KVM_GENERIC_DIRTYLOG in its Kconfig and 
 makefile. No other architectures are affected, each uses it's own version.
 This changed from previous patch revision where non-generic architectures 
 were modified.

 In subsequent patch armv7 does samething. All other architectures continue
 use architecture defined version.

>>>
>>> Hm.
>>>
>>> "The x86 specific version of dirty page logging is generic enough to be
>>> used by other architectures, noteably ARMv7. So let's move the x86 code
>>> under virt/kvm/ and make it depend on KVM_GENERIC_DIRTYLOG. Other
>>> architectures continue to use their own implementations."
>>>
>>> ?
>>
>> I'll update descriptions for both patches, with the more concise
>> descriptions.
> 
> I don't think it's so generic.
> 
> Especially, the relationship between mmu_lock and TLB flushing has
> been changed a few times for optimizing x86 mmu code, and it may change
> in the future again.
> 
> Since how mmu_lock is used is depends on each arch, it's not so
> simple to make the function generic IMO.
> 
> Thanks,
>   Takuya

Hi Takuya,

If the bitmap can be processed locklessly then it's not reusable for sure.

The interface is not "to generic", but "generic enough" for arm/armv8 to
reuse.
For TLB flushing I would think optimizations would wind up in arch subtree.
I'm not convinced that location of couple lines doesn't make the
interface re-usable, going back to 3.10 it hardly changed.

- Mario

> 
>>
>> Thanks.
>>
>>>

 Signed-off-by: Mario Smarduch 
 ---
  arch/x86/include/asm/kvm_host.h |3 --
  arch/x86/kvm/Kconfig|1 +
  arch/x86/kvm/Makefile   |1 +
  arch/x86/kvm/x86.c  |   86 --
  include/linux/kvm_host.h|4 ++
  virt/kvm/Kconfig|3 ++
  virt/kvm/dirtylog.c |  112 
 +++
  7 files changed, 121 insertions(+), 89 deletions(-)
  create mode 100644 virt/kvm/dirtylog.c

>>>
 diff --git a/virt/kvm/dirtylog.c b/virt/kvm/dirtylog.c
 new file mode 100644
 index 000..67a
 --- /dev/null
 +++ b/virt/kvm/dirtylog.c
 @@ -0,0 +1,112 @@
 +/*
 + * kvm generic dirty logging support, used by architectures that share
 + * comman dirty page logging implementation.
>>>
>>> s/comman/common/
>>>
>>> The approach looks sane to me, especially as it does not change other
>>> architectures needlessly.
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V1 0/4] KVM vPMU support for x86

2014-11-03 Thread Wei Huang


On 11/03/2014 12:39 PM, Radim Krčmář wrote:
> 2014-11-03 12:23-0600, Wei Huang:
>>
>>
>> On 11/03/2014 11:56 AM, Radim Krčmář wrote:
>>> 2014-10-31 12:05-0400, Wei Huang:
 Currently KVM only supports vPMU for Intel platforms. This patch set 
 enable vPMU support for AMD platform by creating a common PMU
 interface for x86. The PMU calls from guest VMs are dispatched
 to corresponding functions defined in arch specific files.
>>>
>>> The functionality looks good, so I just want verify the basic design:
>>> why don't we emulate AMD PMU on Intel, and vice versa?
>>> (Underlying PERF_COUNTs are identical in both.)
>>
>> Thanks. The underlining perf counters can be very different between AMD
>> and Intel. I think we can emulate AMD on Intel, or vice versa, for some
>> common perfmon_events (such as PERF_COUNT_HW_CPU_CYCLES). But as soon as
>> guest VMs access raw counters (see PERF_TYPE_RAW), we can't emulate them
>> anymore.
> 
> Thanks, I guess raw counters are used more than I thought, so code
> complexity would overshadow the gain of having at least something.
To be honest I did try. But it became a big rat-hole as soon as I tried
to abstract them. This problem also applies to Intel CPUs between
generations, if perf counters are not arch_counters.

-Wei

> 
> And then, there is the always perfect, "who cares" :)


> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 2/6] KVM: Add generic support for dirty page logging

2014-11-03 Thread Mario Smarduch
On 11/01/2014 03:12 AM, James Hogan wrote:
> Hi Mario,
> 
> On Wed, Oct 22, 2014 at 03:34:07PM -0700, Mario Smarduch wrote:
>> +/**
>> + * kvm_vm_ioctl_get_dirty_log - get and clear the log of dirty pages in a 
>> slot
>> + * @kvm: kvm instance
>> + * @log: slot id and address to which we copy the log
>> + *
>> + * We need to keep it in mind that VCPU threads can write to the bitmap
>> + * concurrently.  So, to avoid losing data, we keep the following order for
>> + * each bit:
>> + *
>> + *   1. Take a snapshot of the bit and clear it if needed.
>> + *   2. Write protect the corresponding page.
>> + *   3. Flush TLB's if needed.
>> + *   4. Copy the snapshot to the userspace.
>> + *
>> + * Between 2 and 3, the guest may write to the page using the remaining TLB
>> + * entry.  This is not a problem because the page will be reported dirty at
>> + * step 4 using the snapshot taken before and step 3 ensures that successive
>> + * writes will be logged for the next call.
>> + */
>> +int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
>> +{
>> +int r;
>> +struct kvm_memory_slot *memslot;
>> +unsigned long n, i;
>> +unsigned long *dirty_bitmap;
>> +unsigned long *dirty_bitmap_buffer;
>> +bool is_dirty = false;
>> +
>> +mutex_lock(&kvm->slots_lock);
>> +
>> +r = -EINVAL;
>> +if (log->slot >= KVM_USER_MEM_SLOTS)
>> +goto out;
>> +
>> +memslot = id_to_memslot(kvm->memslots, log->slot);
>> +
>> +dirty_bitmap = memslot->dirty_bitmap;
>> +r = -ENOENT;
>> +if (!dirty_bitmap)
>> +goto out;
>> +
>> +n = kvm_dirty_bitmap_bytes(memslot);
>> +
>> +dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long);
>> +memset(dirty_bitmap_buffer, 0, n);
>> +
>> +spin_lock(&kvm->mmu_lock);
>> +
>> +for (i = 0; i < n / sizeof(long); i++) {
>> +unsigned long mask;
>> +gfn_t offset;
>> +
>> +if (!dirty_bitmap[i])
>> +continue;
>> +
>> +is_dirty = true;
>> +
>> +mask = xchg(&dirty_bitmap[i], 0);
>> +dirty_bitmap_buffer[i] = mask;
>> +
>> +offset = i * BITS_PER_LONG;
>> +kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
>> +}
>> +
>> +spin_unlock(&kvm->mmu_lock);
>> +
>> +/* See the comments in kvm_mmu_slot_remove_write_access(). */
>> +lockdep_assert_held(&kvm->slots_lock);
>> +
>> +/*
>> + * All the TLBs can be flushed out of mmu lock, see the comments in
>> + * kvm_mmu_slot_remove_write_access().
>> + */
>> +if (is_dirty)
>> +kvm_flush_remote_tlbs(kvm);
>> +
>> +r = -EFAULT;
>> +if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
>> +goto out;
> 
> AFAICT all of the arch implementations of kvm_vm_ioctl_get_dirty_log()
> except x86 and ppc hv (i.e. ia60, mips, ppc pv, s390) already make use
> of the existing generic function kvm_get_dirty_log() to help implement
> their kvm_vm_ioctl_get_dirty_log functions, which all look pretty
> similar now except for TLB flushing.
> 
> Would they not be a better base for a generic
> kvm_vm_ioctl_get_dirty_log()?

But kvm_get_dirty_log() is just a helper for handling dirty bitmap, after
dirty page logging updates based on arch. implementation. There is not
much to reuse here other then add arm sync version and then call
kvm_get_dirty_log.  But the sync would pretty much be identical to our current
kvm_vm_ioctl_get_dirty_log. The purpose of these patches is to reuse some of
dirty page logging logic.
> 
> It feels a bit wrong to add a generic higher level function which
> doesn't make use of the existing generic lower level abstraction.
> 
> (Appologies if this has already been brought up in previous versions of
> the patchset, I haven't been tracking them).

Yeah there is no MAINTAINER entry for mips kvm, per my earlier email.
> 
> Cheers
> James
> 
>> +
>> +r = 0;
>> +out:
>> +mutex_unlock(&kvm->slots_lock);
>> +return r;
>> +}

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V1 0/4] KVM vPMU support for x86

2014-11-03 Thread Radim Krčmář
2014-11-03 12:23-0600, Wei Huang:
> 
> 
> On 11/03/2014 11:56 AM, Radim Krčmář wrote:
> > 2014-10-31 12:05-0400, Wei Huang:
> >> Currently KVM only supports vPMU for Intel platforms. This patch set 
> >> enable vPMU support for AMD platform by creating a common PMU
> >> interface for x86. The PMU calls from guest VMs are dispatched
> >> to corresponding functions defined in arch specific files.
> > 
> > The functionality looks good, so I just want verify the basic design:
> > why don't we emulate AMD PMU on Intel, and vice versa?
> > (Underlying PERF_COUNTs are identical in both.)
> 
> Thanks. The underlining perf counters can be very different between AMD
> and Intel. I think we can emulate AMD on Intel, or vice versa, for some
> common perfmon_events (such as PERF_COUNT_HW_CPU_CYCLES). But as soon as
> guest VMs access raw counters (see PERF_TYPE_RAW), we can't emulate them
> anymore.

Thanks, I guess raw counters are used more than I thought, so code
complexity would overshadow the gain of having at least something.

And then, there is the always perfect, "who cares" :)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V1 2/4] KVM: x86/vPMU: Convert pmu.c code into Intel specific code

2014-11-03 Thread Radim Krčmář
2014-10-31 12:05-0400, Wei Huang:
> This patch converts existing pmu.c into Intel specific code and hooks
> up with the PMU interface using the following steps:
> 
> - Convert pmu.c to pmu_intel.c; All public PMU functions are renamed
>   and hooked up with the newly defined intel_pmu_ops.
> - Create a corresponding pmu_amd.c file with empty functions for AMD
>   arch.
> - The PMU function pointer, kvm_pmu_ops, is initialized by calling
>   kvm_x86_ops->get_pmu_ops().
> - To reduce the code size, Intel and AMD modules are now generated
>   from their corrponding arch and PMU files; In the meanwhile, due
>   to this arrangement several, functions are exposed as public ones
>   to allow calling from PMU code.

(Patch would have been easier to review as two patches, where the first
 one just renames pmu to pmu_intel and the second one does the split.)

> Signed-off-by: Wei Huang 
> ---

---
The rest is an idea for consideration ...
Please consider everything from now to be in parentheses
= ignore you are not enjoying shuffling code around.

> +
> +static unsigned find_arch_event(struct kvm_pmu *pmu, u8 event_select,
> + u8 unit_mask)

The prototype could be exteded with
  struct kvm_event_hw_type_mapping, size_t nr_events
and reused for AMD as well.

> +{
> + int i;
> +
> + for (i = 0; i < ARRAY_SIZE(intel_arch_events); i++)
> + if (intel_arch_events[i].eventsel == event_select
> + && intel_arch_events[i].unit_mask == unit_mask
> + && (pmu->available_event_types & (1 << i)))

pmu->available_event_types would be -1 on AMD.

> + break;
> +
> + if (i == ARRAY_SIZE(intel_arch_events))
> + return PERF_COUNT_HW_MAX;
> +
> + return intel_arch_events[i].event_type;
> +}

> +static void reprogram_counter(struct kvm_pmc *pmc, u32 type,
> + unsigned config, bool exclude_user, bool exclude_kernel,
> + bool intr, bool in_tx, bool in_tx_cp)

This function could drop 'bool in_tx' and 'bool in_tx_cp', because it
already accepts config, so these flags can already be included there.
It has only one caller that uses them anyway.

> +{
> + struct perf_event *event;
> + struct perf_event_attr attr = {
> + .type = type,
> + .size = sizeof(attr),
> + .pinned = true,
> + .exclude_idle = true,
> + .exclude_host = 1,
> + .exclude_user = exclude_user,
> + .exclude_kernel = exclude_kernel,
> + .config = config,
> + };
> + if (in_tx)
> + attr.config |= HSW_IN_TX;
> + if (in_tx_cp)
> + attr.config |= HSW_IN_TX_CHECKPOINTED;

And after dropping this, it is identical to AMD.

> +
> + attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc);
> +
> + event = perf_event_create_kernel_counter(&attr, -1, current,
> +  intr ? kvm_perf_overflow_intr :
> +  kvm_perf_overflow, pmc);
> + if (IS_ERR(event)) {
> + printk_once("kvm: pmu event creation failed %ld\n",
> + PTR_ERR(event));
> + return;
> + }
> +
> + pmc->perf_event = event;
> + clear_bit(pmc->idx, (unsigned long*)&pmc->vcpu->arch.pmu.reprogram_pmi);
> +}
> +
> +static void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel)
> +{

Because the two functions that this one uses have been merged, we could
propagate the changes and reuse this one with AMD as well.

> + unsigned config, type = PERF_TYPE_RAW;
> + u8 event_select, unit_mask;
> +
> + if (eventsel & ARCH_PERFMON_EVENTSEL_PIN_CONTROL)
> + printk_once("kvm pmu: pin control bit is ignored\n");
> +
> + pmc->eventsel = eventsel;
> +
> + stop_counter(pmc);
> +
> + if (!(eventsel & ARCH_PERFMON_EVENTSEL_ENABLE) || !pmc_enabled(pmc))
> + return;
> +
> + event_select = eventsel & ARCH_PERFMON_EVENTSEL_EVENT;
> + unit_mask = (eventsel & ARCH_PERFMON_EVENTSEL_UMASK) >> 8;
> +
> + if (!(eventsel & (ARCH_PERFMON_EVENTSEL_EDGE |
> + ARCH_PERFMON_EVENTSEL_INV |
> + ARCH_PERFMON_EVENTSEL_CMASK |
> + HSW_IN_TX |
> + HSW_IN_TX_CHECKPOINTED))) {
> + config = find_arch_event(&pmc->vcpu->arch.pmu, event_select,
> + unit_mask);
> + if (config != PERF_COUNT_HW_MAX)
> + type = PERF_TYPE_HARDWARE;
> + }
> +
> + if (type == PERF_TYPE_RAW)
> + config = eventsel & X86_RAW_EVENT_MASK;
> +
> + reprogram_counter(pmc, type, config,
> + !(eventsel & ARCH_PERFMON_EVENTSEL_USR),
> + !(eventsel & ARCH_PERFMON_EVENTSEL_OS),
> + eventsel & ARCH_PERFMON_EVENTSEL_INT,
> + (eventsel 

Re: [PATCH V1 0/4] KVM vPMU support for x86

2014-11-03 Thread Wei Huang


On 11/03/2014 11:56 AM, Radim Krčmář wrote:
> 2014-10-31 12:05-0400, Wei Huang:
>> Currently KVM only supports vPMU for Intel platforms. This patch set 
>> enable vPMU support for AMD platform by creating a common PMU
>> interface for x86. The PMU calls from guest VMs are dispatched
>> to corresponding functions defined in arch specific files.
> 
> The functionality looks good, so I just want verify the basic design:
> why don't we emulate AMD PMU on Intel, and vice versa?
> (Underlying PERF_COUNTs are identical in both.)

Thanks. The underlining perf counters can be very different between AMD
and Intel. I think we can emulate AMD on Intel, or vice versa, for some
common perfmon_events (such as PERF_COUNT_HW_CPU_CYCLES). But as soon as
guest VMs access raw counters (see PERF_TYPE_RAW), we can't emulate them
anymore.

> 
>> V1:
>>   * Adopt the file layout suggested by Radim Krčmář
> 
> (I'll still advocate for more separation, sorry.)
> 
>>   * Link arch module with its specific PMU file
>>
>> RFC:
>>   * Initial version for RFC
>>
>> Wei Huang (4):
>>   KVM: x86/vPMU: Define kvm_pmu_ops to support vPMU function dispatch
>>   KVM: x86/vPMU: Convert pmu.c code into Intel specific code
>>   KVM: x86/vPMU: Implement AMD PMU support for KVM
>>   KVM: x86/vPMU: Enable PMU handling for AMD PERFCTRn and EVNTSELn MSRs
>>
>>  arch/x86/include/asm/kvm_host.h |  52 ++--
>>  arch/x86/kvm/Makefile   |   6 +-
>>  arch/x86/kvm/cpuid.c|   2 +-
>>  arch/x86/kvm/lapic.c|   1 +
>>  arch/x86/kvm/pmu.c  | 576 
>> ---
>>  arch/x86/kvm/pmu_amd.c  | 390 ++
>>  arch/x86/kvm/pmu_intel.c| 586 
>> 
>>  arch/x86/kvm/svm.c  |   7 +
>>  arch/x86/kvm/vmx.c  |   7 +
>>  arch/x86/kvm/x86.c  |  74 ++---
>>  10 files changed, 1070 insertions(+), 631 deletions(-)
>>  delete mode 100644 arch/x86/kvm/pmu.c
>>  create mode 100644 arch/x86/kvm/pmu_amd.c
>>  create mode 100644 arch/x86/kvm/pmu_intel.c
>>
>> -- 
>> 1.8.3.1
>>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V1 3/4] KVM: x86/vPMU: Implement AMD PMU support for KVM

2014-11-03 Thread Radim Krčmář
2014-10-31 12:05-0400, Wei Huang:
> This patch implemented vPMU for AMD platform. The design piggybacks
> on the existing Intel structs (kvm_pmu and kvm_pmc), but only uses
> the parts of generic counters. The kvm_pmu_ops interface is also
> initialized in this patch.
> 
> Signed-off-by: Wei Huang 
> ---
>  arch/x86/kvm/pmu_amd.c | 332 
> ++---
>  1 file changed, 318 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/kvm/pmu_amd.c b/arch/x86/kvm/pmu_amd.c
> index 6d6e1c3..19cfe26 100644
> --- a/arch/x86/kvm/pmu_amd.c
> +++ b/arch/x86/kvm/pmu_amd.c
> @@ -16,58 +16,362 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include "x86.h"
>  #include "cpuid.h"
>  #include "lapic.h"
>  
> -void amd_pmu_cpuid_update(struct kvm_vcpu *vcpu)
> +/* duplicated from amd_perfmon_event_map, K7 and above should work */
> +static struct kvm_event_hw_type_mapping amd_event_mapping[] = {
> + [0] = { 0x76, 0x00, PERF_COUNT_HW_CPU_CYCLES },
> + [1] = { 0xc0, 0x00, PERF_COUNT_HW_INSTRUCTIONS },
> + [2] = { 0x80, 0x00, PERF_COUNT_HW_CACHE_REFERENCES },
> + [3] = { 0x81, 0x00, PERF_COUNT_HW_CACHE_MISSES },

> + [4] = { 0xc4, 0x00, PERF_COUNT_HW_BRANCH_INSTRUCTIONS },
> + [5] = { 0xc5, 0x00, PERF_COUNT_HW_BRANCH_MISSES },

amd_perfmon_event_map has 0xc2 and 0xc3.

> + [6] = { 0xd0, 0x00, PERF_COUNT_HW_STALLED_CYCLES_FRONTEND },
> + [7] = { 0xd1, 0x00, PERF_COUNT_HW_STALLED_CYCLES_BACKEND },
> +};
> +
> +static inline bool pmc_is_gp(struct kvm_pmc *pmc)
>  {
> + return pmc->type == KVM_PMC_GP;
>  }

What is the benefit of having the same implementation in both files?

I'd prefer to have it in pmu.h and I'll try to note all functions that
look identical.  As always, you can ignore anything in parentheses,
there are only few real issues with this patch.

>  
> -int amd_pmu_check_pmc(struct kvm_vcpu *vcpu, unsigned pmc)
> +static inline struct kvm_pmc *get_gp_pmc(struct kvm_pmu *pmu, u32 msr,
> +  u32 base)

(pmu.h)

>  {
> - return 1;
> + if (msr >= base && msr < base + pmu->nr_arch_gp_counters)
> + return &pmu->gp_counters[msr - base];
> +
> + return NULL;
>  }
>  
> -int amd_pmu_read_pmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data)
> +static inline u64 pmc_bitmask(struct kvm_pmc *pmc)

(pmu.h)

>  {
> - return 1;
> + struct kvm_pmu *pmu = &pmc->vcpu->arch.pmu;
> +
> + return pmu->counter_bitmask[pmc->type];
>  }
>  
> -int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> +static struct kvm_pmc *global_idx_to_pmc(struct kvm_pmu *pmu, int idx)
>  {
> - return 1;
> + return get_gp_pmc(pmu, MSR_K7_EVNTSEL0 + idx, MSR_K7_EVNTSEL0);
>  }
>  
> -int amd_pmu_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data)
> +void amd_deliver_pmi(struct kvm_vcpu *vcpu)

static.

(pmu.c, rename to deliver_pmi and reuse for intel.)

>  {
> - return 1;
> + if (vcpu->arch.apic)
> + kvm_apic_local_deliver(vcpu->arch.apic, APIC_LVTPC);
>  }
>  
> -bool amd_is_pmu_msr(struct kvm_vcpu *vcpu, u32 msr)
> +static void trigger_pmi(struct irq_work *irq_work)

(pmu.c)

>  {
> - return 0;
> + struct kvm_pmu *pmu = container_of(irq_work, struct kvm_pmu,
> + irq_work);
> + struct kvm_vcpu *vcpu = container_of(pmu, struct kvm_vcpu,
> + arch.pmu);
> +
> + amd_deliver_pmi(vcpu);
>  }
>  
> -void amd_deliver_pmi(struct kvm_vcpu *vcpu)
> +static u64 read_pmc(struct kvm_pmc *pmc)

(pmu.c)

> +{
> + u64 counter, enabled, running, result, delta, prev;
> +
> + counter = pmc->counter;
> + prev = pmc->counter;
> +
> + if (pmc->perf_event) {
> + delta = perf_event_read_value(pmc->perf_event,
> +   &enabled, &running);
> + counter += delta;
> + }
> +
> + result = counter & pmc_bitmask(pmc);
> +
> + return result;
> +}

No.  The one intel uses is better.
(This one looks like a relic from experimenting.)

> +
> +static void stop_counter(struct kvm_pmc *pmc)

(pmu.c)

> +{
> + if (pmc->perf_event) {
> + pmc->counter = read_pmc(pmc);
> + perf_event_release_kernel(pmc->perf_event);
> + pmc->perf_event = NULL;
> + }
> +}
> +
> +static unsigned find_hw_type_event(struct kvm_pmu *pmu, u8 event_select,
> + u8 unit_mask)
> +{

(Intel calls this find_arch_event.)

> + int i;
> +
> + for (i = 0; i < ARRAY_SIZE(amd_event_mapping); i++)
> + if (amd_event_mapping[i].eventsel == event_select
> + && amd_event_mapping[i].unit_mask == unit_mask)

(And it could be less extensible, but shorter:
return amd_event_mapping[i].event_type;
}
return PERF_COUNT_HW_MAX;
  }
)

> + break;
> +
> + if (i == ARRAY_SIZE(amd_event_mapping))
> + return PERF_COUNT_HW_MAX;
> +
> +  

Re: [PATCH V1 0/4] KVM vPMU support for x86

2014-11-03 Thread Radim Krčmář
2014-10-31 12:05-0400, Wei Huang:
> Currently KVM only supports vPMU for Intel platforms. This patch set 
> enable vPMU support for AMD platform by creating a common PMU
> interface for x86. The PMU calls from guest VMs are dispatched
> to corresponding functions defined in arch specific files.

The functionality looks good, so I just want verify the basic design:
why don't we emulate AMD PMU on Intel, and vice versa?
(Underlying PERF_COUNTs are identical in both.)

> V1:
>   * Adopt the file layout suggested by Radim Krčmář

(I'll still advocate for more separation, sorry.)

>   * Link arch module with its specific PMU file
> 
> RFC:
>   * Initial version for RFC
> 
> Wei Huang (4):
>   KVM: x86/vPMU: Define kvm_pmu_ops to support vPMU function dispatch
>   KVM: x86/vPMU: Convert pmu.c code into Intel specific code
>   KVM: x86/vPMU: Implement AMD PMU support for KVM
>   KVM: x86/vPMU: Enable PMU handling for AMD PERFCTRn and EVNTSELn MSRs
> 
>  arch/x86/include/asm/kvm_host.h |  52 ++--
>  arch/x86/kvm/Makefile   |   6 +-
>  arch/x86/kvm/cpuid.c|   2 +-
>  arch/x86/kvm/lapic.c|   1 +
>  arch/x86/kvm/pmu.c  | 576 ---
>  arch/x86/kvm/pmu_amd.c  | 390 ++
>  arch/x86/kvm/pmu_intel.c| 586 
> 
>  arch/x86/kvm/svm.c  |   7 +
>  arch/x86/kvm/vmx.c  |   7 +
>  arch/x86/kvm/x86.c  |  74 ++---
>  10 files changed, 1070 insertions(+), 631 deletions(-)
>  delete mode 100644 arch/x86/kvm/pmu.c
>  create mode 100644 arch/x86/kvm/pmu_amd.c
>  create mode 100644 arch/x86/kvm/pmu_intel.c
> 
> -- 
> 1.8.3.1
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 87591] Host will call trace when loading igbvf.

2014-11-03 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=87591

Yinghai Lu  changed:

   What|Removed |Added

 CC||ying...@kernel.org

--- Comment #2 from Yinghai Lu  ---
https://lkml.org/lkml/2014/10/29/880

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/4] vhost: add VHOST_VRING_F_BYTESWAP flag

2014-11-03 Thread Michael S. Tsirkin
On Wed, Oct 29, 2014 at 09:38:42AM +0100, Cédric Le Goater wrote:
> The VHOST_VRING_F_BYTESWAP flag will be used by the host to byteswap
> the vring data when the guest and the host have a different endian
> order.
> 
> Signed-off-by: Cédric Le Goater 

I don't think it's a good API.
You should ask for specific header format, not for a swap.

> ---
>  drivers/vhost/vhost.c  |5 -
>  drivers/vhost/vhost.h  |1 +
>  include/uapi/linux/vhost.h |3 +++
>  3 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index c90f4374442a..72c21b790ba3 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -199,6 +199,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>   vq->call = NULL;
>   vq->log_ctx = NULL;
>   vq->memory = NULL;
> + vq->byteswap = 0;
>  }
>  
>  static int vhost_worker(void *data)
> @@ -701,7 +702,8 @@ long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, 
> void __user *argp)
>   r = -EFAULT;
>   break;
>   }
> - if (a.flags & ~(0x1 << VHOST_VRING_F_LOG)) {
> + if (a.flags & ~(0x1 << VHOST_VRING_F_LOG |
> + 0x1 << VHOST_VRING_F_BYTESWAP)) {
>   r = -EOPNOTSUPP;
>   break;
>   }
> @@ -747,6 +749,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, 
> void __user *argp)
>   vq->avail = (void __user *)(unsigned long)a.avail_user_addr;
>   vq->log_addr = a.log_guest_addr;
>   vq->used = (void __user *)(unsigned long)a.used_user_addr;
> + vq->byteswap = !!(a.flags & (0x1 << VHOST_VRING_F_BYTESWAP));
>   break;
>   case VHOST_SET_VRING_KICK:
>   if (copy_from_user(&f, argp, sizeof f)) {
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 3eda654b8f5a..ab25b7d0720d 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -110,6 +110,7 @@ struct vhost_virtqueue {
>   /* Log write descriptors */
>   void __user *log_base;
>   struct vhost_log *log;
> + bool byteswap;
>  };
>  
>  struct vhost_dev {
> diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
> index bb6a5b4cb3c5..6a8c2b325c44 100644
> --- a/include/uapi/linux/vhost.h
> +++ b/include/uapi/linux/vhost.h
> @@ -34,6 +34,9 @@ struct vhost_vring_addr {
>   /* Flag values: */
>   /* Whether log address is valid. If set enables logging. */
>  #define VHOST_VRING_F_LOG 0
> + /* Whether vring memory accesses should be byte-swapped.
> +  * required when the guest has a different endianness */
> +#define VHOST_VRING_F_BYTESWAP 1
>  
>   /* Start of array of descriptors (virtually contiguous) */
>   __u64 desc_user_addr;
> -- 
> 1.7.10.4
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 4/4] vhost_net: byteswap virtio_net header

2014-11-03 Thread Michael S. Tsirkin
On Wed, Oct 29, 2014 at 09:38:45AM +0100, Cédric Le Goater wrote:
> Signed-off-by: Cédric Le Goater 

This patch casts userspace pointers to void *,
this is generally unsafe. In particular sparse will warn.

> ---
>  drivers/vhost/net.c |   39 ++-
>  1 file changed, 30 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 8dae2f724a35..f2d5f585dae9 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -18,6 +18,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -329,6 +330,14 @@ static void vhost_zerocopy_callback(struct ubuf_info 
> *ubuf, bool success)
>   rcu_read_unlock_bh();
>  }
>  
> +static void virtio_net_hdr_swap(struct virtio_net_hdr *hdr)
> +{
> + hdr->hdr_len = swab16(hdr->hdr_len);
> + hdr->gso_size= swab16(hdr->gso_size);
> + hdr->csum_start  = swab16(hdr->csum_start);
> + hdr->csum_offset = swab16(hdr->csum_offset);
> +}
> +
>  /* Expects to be always run from workqueue - which acts as
>   * read-size critical section for our kind of RCU. */
>  static void handle_tx(struct vhost_net *net)
> @@ -346,7 +355,7 @@ static void handle_tx(struct vhost_net *net)
>   .msg_flags = MSG_DONTWAIT,
>   };
>   size_t len, total_len = 0;
> - int err;
> + int err, has_vnet_hdr;
>   size_t hdr_size;
>   struct socket *sock;
>   struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
> @@ -359,6 +368,7 @@ static void handle_tx(struct vhost_net *net)
>  
>   vhost_disable_notify(&net->dev, vq);
>  
> + has_vnet_hdr = vhost_has_feature(vq, VHOST_NET_F_VIRTIO_NET_HDR);
>   hdr_size = nvq->vhost_hlen;
>   zcopy = nvq->ubufs;
>  
> @@ -406,6 +416,8 @@ static void handle_tx(struct vhost_net *net)
>   break;
>   }
>  
> + if (!has_vnet_hdr && vq->byteswap)
> + virtio_net_hdr_swap((void *) vq->iov[0].iov_base);
>   zcopy_used = zcopy && len >= VHOST_GOODCOPY_LEN
>  && (nvq->upend_idx + 1) % UIO_MAXIOV !=
> nvq->done_idx
> @@ -570,7 +582,7 @@ static void handle_rx(struct vhost_net *net)
>   .hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE
>   };
>   size_t total_len = 0;
> - int err, mergeable;
> + int err, mergeable, has_vnet_hdr;
>   s16 headcount;
>   size_t vhost_hlen, sock_hlen;
>   size_t vhost_len, sock_len;
> @@ -588,6 +600,7 @@ static void handle_rx(struct vhost_net *net)
>   vq_log = unlikely(vhost_has_feature(vq, VHOST_F_LOG_ALL)) ?
>   vq->log : NULL;
>   mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);
> + has_vnet_hdr = vhost_has_feature(vq, VHOST_NET_F_VIRTIO_NET_HDR);
>  
>   while ((sock_len = peek_head_len(sock->sk))) {
>   sock_len += sock_hlen;
> @@ -638,6 +651,9 @@ static void handle_rx(struct vhost_net *net)
>   vhost_discard_vq_desc(vq, headcount);
>   continue;
>   }
> +
> + if (!has_vnet_hdr && vq->byteswap)
> + virtio_net_hdr_swap((void *) vq->iov[0].iov_base);
>   if (unlikely(vhost_hlen) &&
>   memcpy_toiovecend(nvq->hdr, (unsigned char *)&hdr, 0,
> vhost_hlen)) {
> @@ -646,13 +662,18 @@ static void handle_rx(struct vhost_net *net)
>   break;
>   }
>   /* TODO: Should check and handle checksum. */
> - if (likely(mergeable) &&
> - memcpy_toiovecend(nvq->hdr, (unsigned char *)&headcount,
> -   offsetof(typeof(hdr), num_buffers),
> -   sizeof hdr.num_buffers)) {
> - vq_err(vq, "Failed num_buffers write");
> - vhost_discard_vq_desc(vq, headcount);
> - break;
> + if (likely(mergeable)) {
> + __u16 tmp = headcount;
> +
> + if (vq->byteswap)
> + tmp = swab16(headcount);
> + if (memcpy_toiovecend(nvq->hdr, (unsigned char *)&tmp,
> + offsetof(typeof(hdr), num_buffers),
> +   sizeof(hdr.num_buffers))) {
> + vq_err(vq, "Failed num_buffers write");
> + vhost_discard_vq_desc(vq, headcount);
> + break;
> + }
>   }
>   vhost_add_used_and_signal_n(&net->dev, vq, vq->heads,
>   headcount);
> -- 
> 1.7.10.4
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] vhost_net: support for cross endian guests

2014-11-03 Thread Michael S. Tsirkin
On Wed, Oct 29, 2014 at 09:38:41AM +0100, Cédric Le Goater wrote:
> This patchset adds a VHOST_VRING_F_BYTESWAP flag to inform the host 
> to byteswap data of the vring when the guest and the host have a 
> different endian order. The flag is stored at initialization in an 
> attribute of the virtio queues. It is then used to byteswap, or not, 
> the vring indexes and descriptors shared with the guest OS.
> 
> The last patch adds the byteswapping of the virtio_net header as it 
> is done in qemu.

Hi Cedric,

Thanks for submitting this.
One general problem with this approach, is that
it adds overhead e.g. for x86 on x86 unconditionally.

I will in a couple of days post a patch adding virtio 1.0
support for vhost.

This will serve as a better basis for cross-endian support in
vhost, on top.

I'll try to remember to Cc you.

> The patches apply on linux-3.18-rc2 and the tests were done on PowerPC 
> using the following hosts :
>  
>fedora21/ppc64, utopic/ppc64le  
> 
> with various guests : 
> 
>trusty/ppc64le, utopic/ppc64le, debian/ppc64le, 
>rhel6.5/ppc64, fedora21/ppc64, debian/ppc64
> 
> Regressions tests for x86_64 were done a debian host using rhel6.6, 
> fedora20 and debian guests.
>   
> 
> Cédric Le Goater (4):
>   vhost: add VHOST_VRING_F_BYTESWAP flag
>   vhost: add byteswap routines
>   vhost: byteswap virtqueue attributes
>   vhost_net: byteswap virtio_net header
> 
>  drivers/vhost/net.c|   39 +---
>  drivers/vhost/vhost.c  |  150 
> 
>  drivers/vhost/vhost.h  |1 +
>  include/uapi/linux/vhost.h |3 +
>  4 files changed, 171 insertions(+), 22 deletions(-)
> 
> -- 
> 1.7.10.4
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 3/4] vhost: byteswap virtqueue attributes

2014-11-03 Thread Cornelia Huck
On Wed, 29 Oct 2014 09:38:44 +0100
Cédric Le Goater  wrote:

> The virtqueue structure shares a few attributes with the guest OS
> which need to be byteswapped when the endian order of the host is
> different.
> 
> This patch uses the vq->byteswap attribute to decide whether to
> byteswap or not data being accessed in the guest memory.
> 
> Signed-off-by: Cédric Le Goater 
> ---
>  drivers/vhost/vhost.c |   52 
> +
>  1 file changed, 40 insertions(+), 12 deletions(-)
> 

> +static int __copyhead_to_user(struct vhost_virtqueue *vq,
> + struct vring_used_elem *heads,
> + struct vring_used_elem __user *used,
> + unsigned count)

__copy_used_elems_to_user() ?

> +{
> + int i;
> +
> + for (i = 0; i < count; i++) {
> + if (__vq_put_user(vq, heads[i].id, &used[i].id)) {
> + vq_err(vq, "Failed to write used id");
> + return -EFAULT;
> + }
> + if (__vq_put_user(vq, heads[i].len, &used[i].len)) {
> + vq_err(vq, "Failed to write used len");
> + return -EFAULT;
> + }
> + }

Is there a number of elements where it would be more efficient to
byteswap the used elements first and then do __copy_to_user() in one
go? Depends on the backend, I guess.

> + return 0;
> +}
> +

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 4/6] hw_random: fix unregister race.

2014-11-03 Thread Amos Kong
From: Rusty Russell 

The previous patch added one potential problem: we can still be
reading from a hwrng when it's unregistered.  Add a wait for zero
in the hwrng_unregister path.

v4: add cleanup_done flag to insure that cleanup is done

Signed-off-by: Rusty Russell 
Signed-off-by: Amos Kong 
---
 drivers/char/hw_random/core.c | 8 
 include/linux/hw_random.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/drivers/char/hw_random/core.c b/drivers/char/hw_random/core.c
index 27ad6b4..c31bf91 100644
--- a/drivers/char/hw_random/core.c
+++ b/drivers/char/hw_random/core.c
@@ -60,6 +60,7 @@ static DEFINE_MUTEX(rng_mutex);
 static DEFINE_MUTEX(reading_mutex);
 static int data_avail;
 static u8 *rng_buffer, *rng_fillbuf;
+static DECLARE_WAIT_QUEUE_HEAD(rng_done);
 static unsigned short current_quality;
 static unsigned short default_quality; /* = 0; default to "off" */
 
@@ -98,6 +99,8 @@ static inline void cleanup_rng(struct kref *kref)
 
if (rng->cleanup)
rng->cleanup(rng);
+   rng->cleanup_done = true;
+   wake_up_all(&rng_done);
 }
 
 static void set_current_rng(struct hwrng *rng)
@@ -536,6 +539,11 @@ void hwrng_unregister(struct hwrng *rng)
kthread_stop(hwrng_fill);
} else
mutex_unlock(&rng_mutex);
+
+   /* Just in case rng is reading right now, wait. */
+   wait_event(rng_done, rng->cleanup_done &&
+  atomic_read(&rng->ref.refcount) == 0);
+
 }
 EXPORT_SYMBOL_GPL(hwrng_unregister);
 
diff --git a/include/linux/hw_random.h b/include/linux/hw_random.h
index c212e71..7832e50 100644
--- a/include/linux/hw_random.h
+++ b/include/linux/hw_random.h
@@ -46,6 +46,7 @@ struct hwrng {
/* internal. */
struct list_head list;
struct kref ref;
+   bool cleanup_done;
 };
 
 /** Register a new Hardware Random Number Generator driver. */
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 1/6] hw_random: place mutex around read functions and buffers.

2014-11-03 Thread Amos Kong
From: Rusty Russell 

There's currently a big lock around everything, and it means that we
can't query sysfs (eg /sys/devices/virtual/misc/hw_random/rng_current)
while the rng is reading.  This is a real problem when the rng is slow,
or blocked (eg. virtio_rng with qemu's default /dev/random backend)

This doesn't help (it leaves the current lock untouched), just adds a
lock to protect the read function and the static buffers, in preparation
for transition.

Signed-off-by: Rusty Russell 
---
 drivers/char/hw_random/core.c | 20 +---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/char/hw_random/core.c b/drivers/char/hw_random/core.c
index aa30a25..b1b6042 100644
--- a/drivers/char/hw_random/core.c
+++ b/drivers/char/hw_random/core.c
@@ -53,7 +53,10 @@
 static struct hwrng *current_rng;
 static struct task_struct *hwrng_fill;
 static LIST_HEAD(rng_list);
+/* Protects rng_list and current_rng */
 static DEFINE_MUTEX(rng_mutex);
+/* Protects rng read functions, data_avail, rng_buffer and rng_fillbuf */
+static DEFINE_MUTEX(reading_mutex);
 static int data_avail;
 static u8 *rng_buffer, *rng_fillbuf;
 static unsigned short current_quality;
@@ -81,7 +84,9 @@ static void add_early_randomness(struct hwrng *rng)
unsigned char bytes[16];
int bytes_read;
 
+   mutex_lock(&reading_mutex);
bytes_read = rng_get_data(rng, bytes, sizeof(bytes), 1);
+   mutex_unlock(&reading_mutex);
if (bytes_read > 0)
add_device_randomness(bytes, bytes_read);
 }
@@ -128,6 +133,7 @@ static inline int rng_get_data(struct hwrng *rng, u8 
*buffer, size_t size,
int wait) {
int present;
 
+   BUG_ON(!mutex_is_locked(&reading_mutex));
if (rng->read)
return rng->read(rng, (void *)buffer, size, wait);
 
@@ -160,13 +166,14 @@ static ssize_t rng_dev_read(struct file *filp, char 
__user *buf,
goto out_unlock;
}
 
+   mutex_lock(&reading_mutex);
if (!data_avail) {
bytes_read = rng_get_data(current_rng, rng_buffer,
rng_buffer_size(),
!(filp->f_flags & O_NONBLOCK));
if (bytes_read < 0) {
err = bytes_read;
-   goto out_unlock;
+   goto out_unlock_reading;
}
data_avail = bytes_read;
}
@@ -174,7 +181,7 @@ static ssize_t rng_dev_read(struct file *filp, char __user 
*buf,
if (!data_avail) {
if (filp->f_flags & O_NONBLOCK) {
err = -EAGAIN;
-   goto out_unlock;
+   goto out_unlock_reading;
}
} else {
len = data_avail;
@@ -186,7 +193,7 @@ static ssize_t rng_dev_read(struct file *filp, char __user 
*buf,
if (copy_to_user(buf + ret, rng_buffer + data_avail,
len)) {
err = -EFAULT;
-   goto out_unlock;
+   goto out_unlock_reading;
}
 
size -= len;
@@ -194,6 +201,7 @@ static ssize_t rng_dev_read(struct file *filp, char __user 
*buf,
}
 
mutex_unlock(&rng_mutex);
+   mutex_unlock(&reading_mutex);
 
if (need_resched())
schedule_timeout_interruptible(1);
@@ -208,6 +216,9 @@ out:
 out_unlock:
mutex_unlock(&rng_mutex);
goto out;
+out_unlock_reading:
+   mutex_unlock(&reading_mutex);
+   goto out_unlock;
 }
 
 
@@ -348,13 +359,16 @@ static int hwrng_fillfn(void *unused)
while (!kthread_should_stop()) {
if (!current_rng)
break;
+   mutex_lock(&reading_mutex);
rc = rng_get_data(current_rng, rng_fillbuf,
  rng_buffer_size(), 1);
+   mutex_unlock(&reading_mutex);
if (rc <= 0) {
pr_warn("hwrng: no data available\n");
msleep_interruptible(1);
continue;
}
+   /* Outside lock, sure, but y'know: randomness. */
add_hwgenerator_randomness((void *)rng_fillbuf, rc,
   rc * current_quality * 8 >> 10);
}
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 5/6] hw_random: don't double-check old_rng.

2014-11-03 Thread Amos Kong
From: Rusty Russell 

Interesting anti-pattern.

Signed-off-by: Rusty Russell 
---
 drivers/char/hw_random/core.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/char/hw_random/core.c b/drivers/char/hw_random/core.c
index c31bf91..fc5de7d 100644
--- a/drivers/char/hw_random/core.c
+++ b/drivers/char/hw_random/core.c
@@ -480,14 +480,13 @@ int hwrng_register(struct hwrng *rng)
}
 
old_rng = current_rng;
+   err = 0;
if (!old_rng) {
err = hwrng_init(rng);
if (err)
goto out_unlock;
set_current_rng(rng);
-   }
-   err = 0;
-   if (!old_rng) {
+
err = register_miscdev();
if (err) {
drop_current_rng();
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 3/6] hw_random: use reference counts on each struct hwrng.

2014-11-03 Thread Amos Kong
From: Rusty Russell 

current_rng holds one reference, and we bump it every time we want
to do a read from it.

This means we only hold the rng_mutex to grab or drop a reference,
so accessing /sys/devices/virtual/misc/hw_random/rng_current doesn't
block on read of /dev/hwrng.

Using a kref is overkill (we're always under the rng_mutex), but
a standard pattern.

This also solves the problem that the hwrng_fillfn thread was
accessing current_rng without a lock, which could change (eg. to NULL)
underneath it.

v4: decrease last reference for triggering the cleanup
v3: initialize kref (thanks Amos Kong)
v2: fix missing put_rng() on exit path (thanks Amos Kong)
Signed-off-by: Rusty Russell 
Signed-off-by: Amos Kong 
---
 drivers/char/hw_random/core.c | 142 +-
 include/linux/hw_random.h |   2 +
 2 files changed, 101 insertions(+), 43 deletions(-)

diff --git a/drivers/char/hw_random/core.c b/drivers/char/hw_random/core.c
index a0905c8..27ad6b4 100644
--- a/drivers/char/hw_random/core.c
+++ b/drivers/char/hw_random/core.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 
@@ -91,6 +92,65 @@ static void add_early_randomness(struct hwrng *rng)
add_device_randomness(bytes, bytes_read);
 }
 
+static inline void cleanup_rng(struct kref *kref)
+{
+   struct hwrng *rng = container_of(kref, struct hwrng, ref);
+
+   if (rng->cleanup)
+   rng->cleanup(rng);
+}
+
+static void set_current_rng(struct hwrng *rng)
+{
+   BUG_ON(!mutex_is_locked(&rng_mutex));
+   kref_get(&rng->ref);
+   current_rng = rng;
+}
+
+static void drop_current_rng(void)
+{
+   struct hwrng *rng = current_rng;
+
+   BUG_ON(!mutex_is_locked(&rng_mutex));
+   if (!current_rng)
+   return;
+
+   /* release current_rng reference */
+   kref_put(¤t_rng->ref, cleanup_rng);
+   current_rng = NULL;
+
+   /* decrease last reference for triggering the cleanup */
+   kref_put(&rng->ref, cleanup_rng);
+}
+
+/* Returns ERR_PTR(), NULL or refcounted hwrng */
+static struct hwrng *get_current_rng(void)
+{
+   struct hwrng *rng;
+
+   if (mutex_lock_interruptible(&rng_mutex))
+   return ERR_PTR(-ERESTARTSYS);
+
+   rng = current_rng;
+   if (rng)
+   kref_get(&rng->ref);
+
+   mutex_unlock(&rng_mutex);
+   return rng;
+}
+
+static void put_rng(struct hwrng *rng)
+{
+   /*
+* Hold rng_mutex here so we serialize in case they set_current_rng
+* on rng again immediately.
+*/
+   mutex_lock(&rng_mutex);
+   if (rng)
+   kref_put(&rng->ref, cleanup_rng);
+   mutex_unlock(&rng_mutex);
+}
+
 static inline int hwrng_init(struct hwrng *rng)
 {
if (rng->init) {
@@ -110,13 +170,9 @@ static inline int hwrng_init(struct hwrng *rng)
if (current_quality > 0 && !hwrng_fill)
start_khwrngd();
 
-   return 0;
-}
+   kref_init(&rng->ref);
 
-static inline void hwrng_cleanup(struct hwrng *rng)
-{
-   if (rng && rng->cleanup)
-   rng->cleanup(rng);
+   return 0;
 }
 
 static int rng_dev_open(struct inode *inode, struct file *filp)
@@ -154,21 +210,22 @@ static ssize_t rng_dev_read(struct file *filp, char 
__user *buf,
ssize_t ret = 0;
int err = 0;
int bytes_read, len;
+   struct hwrng *rng;
 
while (size) {
-   if (mutex_lock_interruptible(&rng_mutex)) {
-   err = -ERESTARTSYS;
+   rng = get_current_rng();
+   if (IS_ERR(rng)) {
+   err = PTR_ERR(rng);
goto out;
}
-
-   if (!current_rng) {
+   if (!rng) {
err = -ENODEV;
-   goto out_unlock;
+   goto out;
}
 
mutex_lock(&reading_mutex);
if (!data_avail) {
-   bytes_read = rng_get_data(current_rng, rng_buffer,
+   bytes_read = rng_get_data(rng, rng_buffer,
rng_buffer_size(),
!(filp->f_flags & O_NONBLOCK));
if (bytes_read < 0) {
@@ -200,8 +257,8 @@ static ssize_t rng_dev_read(struct file *filp, char __user 
*buf,
ret += len;
}
 
-   mutex_unlock(&rng_mutex);
mutex_unlock(&reading_mutex);
+   put_rng(rng);
 
if (need_resched())
schedule_timeout_interruptible(1);
@@ -213,12 +270,11 @@ static ssize_t rng_dev_read(struct file *filp, char 
__user *buf,
}
 out:
return ret ? : err;
-out_unlock:
-   mutex_unlock(&rng_mutex);
-   goto out;
+
 out_unlock_reading:
mutex_unlock(&reading_mutex);
-   goto out_unlock;
+   put_rng(rng);
+   goto out;
 }
 
 
@@ -257,8 +313,8 @

[PATCH v4 0/6] fix hw_random stuck

2014-11-03 Thread Amos Kong
When I hotunplug a busy virtio-rng device or try to access
hwrng attributes in non-smp guest, it gets stuck.

My hotplug tests:

| test 0:
|   hotunplug rng device from qemu monitor
|
| test 1:
|   guest) # dd if=/dev/hwrng of=/dev/null &
|   hotunplug rng device from qemu monitor
|
| test 2:
|   guest) # dd if=/dev/random of=/dev/null &
|   hotunplug rng device from qemu monitor
|
| test 4:
|   guest) # dd if=/dev/hwrng of=/dev/null &
|   cat /sys/devices/virtual/misc/hw_random/rng_*
|
| test 5:
|   guest) # dd if=/dev/hwrng of=/dev/null
|   cancel dd process after 10 seconds
|   guest) # dd if=/dev/hwrng of=/dev/null &
|   hotunplug rng device from qemu monitor
|
| test 6:
|   use a fifo as rng backend, execute test 0 ~ 5 with no input of fifo


V4: update patch 4 to fix corrupt, decrease last reference for triggering
the cleanup, fix unregister race pointed by Herbert
V3: initialize kref to 1
V2: added patch 2 to fix a deadlock, update current patch 3 to fix reference
counting issue

Amos Kong (1):
  hw_random: move some code out mutex_lock for avoiding underlying
deadlock

Rusty Russell (5):
  hw_random: place mutex around read functions and buffers.
  hw_random: use reference counts on each struct hwrng.
  hw_random: fix unregister race.
  hw_random: don't double-check old_rng.
  hw_random: don't init list element we're about to add to list.

 drivers/char/hw_random/core.c | 176 ++
 include/linux/hw_random.h |   3 +
 2 files changed, 129 insertions(+), 50 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 6/6] hw_random: don't init list element we're about to add to list.

2014-11-03 Thread Amos Kong
From: Rusty Russell 

Another interesting anti-pattern.

Signed-off-by: Rusty Russell 
---
 drivers/char/hw_random/core.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/char/hw_random/core.c b/drivers/char/hw_random/core.c
index fc5de7d..b2cc8a1 100644
--- a/drivers/char/hw_random/core.c
+++ b/drivers/char/hw_random/core.c
@@ -493,7 +493,6 @@ int hwrng_register(struct hwrng *rng)
goto out_unlock;
}
}
-   INIT_LIST_HEAD(&rng->list);
list_add_tail(&rng->list, &rng_list);
 
if (old_rng && !rng->init) {
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 2/6] hw_random: move some code out mutex_lock for avoiding underlying deadlock

2014-11-03 Thread Amos Kong
In next patch, we use reference counting for each struct hwrng,
changing reference count also needs to take mutex_lock. Before
releasing the lock, if we try to stop a kthread that waits to
take the lock to reduce the referencing count, deadlock will
occur.

Signed-off-by: Amos Kong 
---
 drivers/char/hw_random/core.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/char/hw_random/core.c b/drivers/char/hw_random/core.c
index b1b6042..a0905c8 100644
--- a/drivers/char/hw_random/core.c
+++ b/drivers/char/hw_random/core.c
@@ -474,12 +474,12 @@ void hwrng_unregister(struct hwrng *rng)
}
}
if (list_empty(&rng_list)) {
+   mutex_unlock(&rng_mutex);
unregister_miscdev();
if (hwrng_fill)
kthread_stop(hwrng_fill);
-   }
-
-   mutex_unlock(&rng_mutex);
+   } else
+   mutex_unlock(&rng_mutex);
 }
 EXPORT_SYMBOL_GPL(hwrng_unregister);
 
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/4] vhost: add byteswap routines

2014-11-03 Thread Cornelia Huck
On Wed, 29 Oct 2014 09:38:43 +0100
Cédric Le Goater  wrote:

> This patch adds a few helper routines around get_user and put_user
> to ease byteswapping.
> 
> Signed-off-by: Cédric Le Goater 
> ---
> 
> I am not sure these routines belong to this file. There is room for 
> improvement to remove the ugly  switch (sizeof(*(ptr))). Please 
> comment !
> 
>  drivers/vhost/vhost.c |   93 
> +
>  1 file changed, 93 insertions(+)
> 

I'm wondering whether it would be a good idea to introduce generic
{get,put}_user_reversed() routines.

That way, you would push the switch (sizeof(*(ptr)) into uaccess code,
where it needs to be done anyway. It also makes it possible to add
optimizations there.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: PPC: Book3S HV: ptes are big endian

2014-11-03 Thread Cédric Le Goater
When being restored from qemu, the kvm_get_htab_header are in native
endian, but the ptes are big endian. 

This patch fixes restore on a KVM LE host. Qemu also needs a fix for
this :

 http://lists.nongnu.org/archive/html/qemu-ppc/2014-11/msg8.html

Signed-off-by: Cédric Le Goater 
Cc: Paul Mackerras 
Cc: Alexey Kardashevskiy 
Cc: Gregory Kurz 

---

 Tested on 3.17-rc7 with LE and BE host.

 

 arch/powerpc/kvm/book3s_64_mmu_hv.c |2 ++
 1 file changed, 2 insertions(+)

Index: linux-3.18-hv.git/arch/powerpc/kvm/book3s_64_mmu_hv.c
===
--- linux-3.18-hv.git.orig/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ linux-3.18-hv.git/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -1542,6 +1542,8 @@ static ssize_t kvm_htab_write(struct fil
err = -EFAULT;
if (__get_user(v, lbuf) || __get_user(r, lbuf + 1))
goto out;
+   v = be64_to_cpu(v);
+   r = be64_to_cpu(r);
err = -EINVAL;
if (!(v & HPTE_V_VALID))
goto out;

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] hw_random: fix unregister race.

2014-11-03 Thread Amos Kong
On Tue, Oct 21, 2014 at 10:15:23PM +0800, Herbert Xu wrote:
> On Thu, Sep 18, 2014 at 12:18:24PM +0930, Rusty Russell wrote:
> > The previous patch added one potential problem: we can still be
> > reading from a hwrng when it's unregistered.  Add a wait for zero
> > in the hwrng_unregister path.
> > 
> > Signed-off-by: Rusty Russell 
> > ---
> >  drivers/char/hw_random/core.c | 5 +
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/drivers/char/hw_random/core.c b/drivers/char/hw_random/core.c
> > index dc9092a1075d..b4a21e9521cf 100644
> > --- a/drivers/char/hw_random/core.c
> > +++ b/drivers/char/hw_random/core.c
> > @@ -60,6 +60,7 @@ static DEFINE_MUTEX(rng_mutex);
> >  static DEFINE_MUTEX(reading_mutex);
> >  static int data_avail;
> >  static u8 *rng_buffer, *rng_fillbuf;
> > +static DECLARE_WAIT_QUEUE_HEAD(rng_done);
> >  static unsigned short current_quality;
> >  static unsigned short default_quality; /* = 0; default to "off" */
> >  
> > @@ -98,6 +99,7 @@ static inline void cleanup_rng(struct kref *kref)
> >  
> > if (rng->cleanup)
> > rng->cleanup(rng);

rng->cleanup_done = true;

> > +   wake_up_all(&rng_done);
> >  }
> >  
> >  static void set_current_rng(struct hwrng *rng)
> > @@ -529,6 +531,9 @@ void hwrng_unregister(struct hwrng *rng)
> > }
> >  
> > mutex_unlock(&rng_mutex);
> > +
> > +   /* Just in case rng is reading right now, wait. */
> > +   wait_event(rng_done, atomic_read(&rng->ref.refcount) == 0);

Hi Rusty,
 
After initializing (kref_init()), the refcount is 1, so we need one
more kref_put() after each drop_current_rng() to release last
reference count, then cleanup function will be called.


> While it's obviously better than what we have now, I don't believe
> this is 100% safe as the cleanup function might still be running
> even after the ref count hits zero.  Once we return from this function
> the module may be unloaded so we need to ensure that nothing is
> running at this point.

I found wait_event() can still pass and finish unregister even cleanup
function isn't called (wake_up_all() isn't called). So I added a flag
cleanup_done to indicate that the rng device is cleaned up.


+   /* Just in case rng is reading right now, wait. */
+   wait_event(rng_done, rng->cleanup_done &&
+  atomic_read(&rng->ref.refcount) == 0);

I will post the new v4 later.
 
> Cheers,
> -- 
> Email: Herbert Xu 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

-- 
Amos.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [question] updating the base image for all clones which havebeen running for months

2014-11-03 Thread Zhang Haoyu
>> Hi, all
>> 
>> I used base image A to clone so many vm, 
>> after running for months, each vm has its own private applications and data,
>> which maybe different from each other.
>> Now, I want to install some applications for all of the clones,
>> what should I do?
>
>Install the applications on each clone separately, or use some other
>method to make it available (like installing on a shared network
>resource).
>
Could you detail "installing on a shared network resource"?

Thanks,
Zhang Haoyu
>> Can I rebase image A to B which have the applications to be installed,
>> then change the base image to B for all clones?
>
>The problem is that rebase works on the block level, not on the file
>system level. 
Yes, this is the point.

>Changing the backing file won't produce a correctly
>working guest, it causes file system corruption.
>
>Kevin

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] target-i386: add Intel AVX-512 support

2014-11-03 Thread Michael S. Tsirkin
On Mon, Nov 03, 2014 at 12:31:42PM +0100, Paolo Bonzini wrote:
> 
> On 02/11/2014 11:19, Michael S. Tsirkin wrote:
> > > Add AVX512 feature bits, register definition and corresponding
> > > xsave/vmstate support.
> > > 
> > > Signed-off-by: Chao Peng 
> > 
> > Thanks!
> > 
> > As this was first posted after soft freeze, please
> > resubmit after 2.2 is out.
> > 
> > See schedule http://wiki.qemu.org/Planning/2.2
> 
> Actually this is already in.
> 
> While "Planning/2.2" says "all features should have patches posted on
> the mailing list", the actual page describing soft feature freeze is
> more lenient:
> 
>By the date of the soft feature freeze, any non-trivial feature
>should have some code posted to the qemu-devel mailing list if it's
>targeting a given release. Major features, and in particular
>features with a high likelihood of breaking things, should already
>be in the process of being merged.
> 
> This patch was fairly trivial, as it only adds a few memcpys and a
> subsection.  Likelihood of breaking things is basically zero because
> AVX512 does not yet exist in silicon.  Eduardo reviewed the patch
> promptly and agreed with putting it in 2.2, so that's what I did.
> 
> Paolo

Oh that's fine.
I was not trying to argue, I didn't notice it was in
and thought I'm asked to merge it.
It's always a judgement call, I trust you did the right thing.

Thanks and sorry about the noise.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [question] updating the base image for all clones which have been running for months

2014-11-03 Thread Kevin Wolf
Am 03.11.2014 um 13:04 hat Zhang Haoyu geschrieben:
> Hi, all
> 
> I used base image A to clone so many vm, 
> after running for months, each vm has its own private applications and data,
> which maybe different from each other.
> Now, I want to install some applications for all of the clones,
> what should I do?

Install the applications on each clone separately, or use some other
method to make it available (like installing on a shared network
resource).

> Can I rebase image A to B which have the applications to be installed,
> then change the base image to B for all clones?

The problem is that rebase works on the block level, not on the file
system level. Changing the backing file won't produce a correctly
working guest, it causes file system corruption.

Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[question] updating the base image for all clones which have been running for months

2014-11-03 Thread Zhang Haoyu
Hi, all

I used base image A to clone so many vm, 
after running for months, each vm has its own private applications and data,
which maybe different from each other.
Now, I want to install some applications for all of the clones,
what should I do?

Can I rebase image A to B which have the applications to be installed,
then change the base image to B for all clones?
I don't think it can work, the clone images maybe corrupted latter.
If the clone image's space is not enough, what will happen?
Any ideas?

My bad idea is to push the applications from center to each clone's guest-agent,
which is responsible to install the applications.

Thanks,
Zhang Haoyu

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC 05/11] virtio: introduce legacy virtio devices

2014-11-03 Thread Cornelia Huck
On Thu, 30 Oct 2014 23:29:50 +0100
Greg Kurz  wrote:

> On Thu, 30 Oct 2014 19:02:01 +0100
> Cornelia Huck  wrote:
> 
> > On Tue, 28 Oct 2014 16:40:18 +0100
> > Greg Kurz  wrote:
> > 
> > > On Tue,  7 Oct 2014 16:40:01 +0200
> > > Cornelia Huck  wrote:
> > > 
> > > > Introduce a helper function to indicate  whether a virtio device is
> > > > operating in legacy or virtio standard mode.
> > > > 
> > > > It may be used to make decisions about the endianess of virtio accesses
> > > > and other virtio-1 specific changes, enabling us to support transitional
> > > > devices.
> > > > 
> > > > Reviewed-by: Thomas Huth 
> > > > Signed-off-by: Cornelia Huck 
> > > > ---
> > > >  hw/virtio/virtio.c|6 +-
> > > >  include/hw/virtio/virtio-access.h |4 
> > > >  include/hw/virtio/virtio.h|   13 +++--
> > > >  3 files changed, 20 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> > > > index 7aaa953..e6ae3a0 100644
> > > > --- a/hw/virtio/virtio.c
> > > > +++ b/hw/virtio/virtio.c
> > > > @@ -883,7 +883,11 @@ static bool virtio_device_endian_needed(void 
> > > > *opaque)
> > > >  VirtIODevice *vdev = opaque;
> > > > 
> > > >  assert(vdev->device_endian != VIRTIO_DEVICE_ENDIAN_UNKNOWN);
> > > > -return vdev->device_endian != virtio_default_endian();
> > > > +if (virtio_device_is_legacy(vdev)) {
> > > > +return vdev->device_endian != virtio_default_endian();
> > > > +}
> > > > +/* Devices conforming to VIRTIO 1.0 or later are always LE. */
> > > > +return vdev->device_endian != VIRTIO_DEVICE_ENDIAN_LITTLE;
> > > >  }
> > > > 
> > > 
> > > Shouldn't we have some code doing the following somewhere ?
> > > 
> > > if (!virtio_device_is_legacy(vdev)) {
> > > vdev->device_endian = VIRTIO_DEVICE_ENDIAN_LITTLE;
> > > }
> > > 
> > > also, since virtio-1 is LE only, do we expect device_endian to
> > > be different from VIRTIO_DEVICE_ENDIAN_LITTLE ?
> > 
> > device_endian should not depend on whether the device is legacy or not.
> > virtio_is_big_endian always returns false for virtio-1 devices, though.
> 
> Sorry, I had missed the virtio_is_big_endian() change: it that makes
> device_endian a legacy virtio only matter. 
> So why would we care to migrate the endian subsection when we have a
> virtio-1 device ?  Shouldn't virtio_device_endian_needed() return false
> for virtio-1 ?

Indeeed, we can just leave device_endian at the default value for
virtio-1 devices.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] target-i386: add Intel AVX-512 support

2014-11-03 Thread Paolo Bonzini

On 02/11/2014 11:19, Michael S. Tsirkin wrote:
> > Add AVX512 feature bits, register definition and corresponding
> > xsave/vmstate support.
> > 
> > Signed-off-by: Chao Peng 
> 
> Thanks!
> 
> As this was first posted after soft freeze, please
> resubmit after 2.2 is out.
> 
> See schedule http://wiki.qemu.org/Planning/2.2

Actually this is already in.

While "Planning/2.2" says "all features should have patches posted on
the mailing list", the actual page describing soft feature freeze is
more lenient:

   By the date of the soft feature freeze, any non-trivial feature
   should have some code posted to the qemu-devel mailing list if it's
   targeting a given release. Major features, and in particular
   features with a high likelihood of breaking things, should already
   be in the process of being merged.

This patch was fairly trivial, as it only adds a few memcpys and a
subsection.  Likelihood of breaking things is basically zero because
AVX512 does not yet exist in silicon.  Eduardo reviewed the patch
promptly and agreed with putting it in 2.2, so that's what I did.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 0/1] KVM: s390: fix for 3.18-rc

2014-11-03 Thread Cornelia Huck
Hi,

one fix for a build warning regression introduced during the merge
window with some virtio changes. Rusty wanted me to send this through
the kvm/s390 tree. Branch is against 3.18-rc3.

The following changes since commit 0df1f2487d2f0d04703f142813d53615d62a1da4:

  Linux 3.18-rc3 (2014-11-02 15:01:51 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git 
tags/kvm-s390-20141103

for you to fetch changes up to f7ceb0dfec43d2d4e2373d02968f8fb58c6858f7:

  KVM: s390: virtio_ccw: remove unused variable (2014-11-03 10:43:00 +0100)


Fix a build warning in virtio-ccw introduced during the merge window.



Sebastian Ott (1):
  KVM: s390: virtio_ccw: remove unused variable

 drivers/s390/kvm/virtio_ccw.c |1 -
 1 file changed, 1 deletion(-)

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 1/1] KVM: s390: virtio_ccw: remove unused variable

2014-11-03 Thread Cornelia Huck
From: Sebastian Ott 

Fix this warning:
drivers/s390/kvm/virtio_ccw.c: In function ‘virtio_ccw_int_handler’:
drivers/s390/kvm/virtio_ccw.c:891:24: warning: unused variable ‘drv’ 
[-Wunused-variable]
struct virtio_driver *drv;

Signed-off-by: Sebastian Ott 
Acked-by: Cornelia Huck 
Acked-by: Michael S. Tsirkin 
Acked-by: Rusty Russell 
Signed-off-by: Cornelia Huck 
---
 drivers/s390/kvm/virtio_ccw.c |1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/s390/kvm/virtio_ccw.c b/drivers/s390/kvm/virtio_ccw.c
index 6cbe6ef..bda52f1 100644
--- a/drivers/s390/kvm/virtio_ccw.c
+++ b/drivers/s390/kvm/virtio_ccw.c
@@ -888,7 +888,6 @@ static void virtio_ccw_int_handler(struct ccw_device *cdev,
struct virtio_ccw_device *vcdev = dev_get_drvdata(&cdev->dev);
int i;
struct virtqueue *vq;
-   struct virtio_driver *drv;
 
if (!vcdev)
return;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v13 09/11] pvqspinlock, x86: Add para-virtualization support

2014-11-03 Thread Peter Zijlstra
On Wed, Oct 29, 2014 at 04:19:09PM -0400, Waiman Long wrote:
>  arch/x86/include/asm/pvqspinlock.h|  411 
> +

I do wonder why all this needs to live in x86..

>  
> +#ifdef CONFIG_QUEUE_SPINLOCK
> +
> +static __always_inline void pv_kick_cpu(int cpu)
> +{
> + PVOP_VCALLEE1(pv_lock_ops.kick_cpu, cpu);
> +}
> +
> +static __always_inline void pv_lockwait(u8 *lockbyte)
> +{
> + PVOP_VCALLEE1(pv_lock_ops.lockwait, lockbyte);
> +}
> +
> +static __always_inline void pv_lockstat(enum pv_lock_stats type)
> +{
> + PVOP_VCALLEE1(pv_lock_ops.lockstat, type);
> +}

Why are any of these PV ops? they're only called from other pv_*()
functions. What's the point of pv ops you only call from pv code?

> +/*
> + *   Queue Spinlock Para-Virtualization (PV) Support
> + *
> + * The PV support code for queue spinlock is roughly the same as that
> + * of the ticket spinlock.

Relative comments are bad, esp. since we'll make the ticket code go away
if this works, at which point this is a reference into a black hole.

> Each CPU waiting for the lock will spin until it
> + * reaches a threshold. When that happens, it will put itself to a halt state
> + * so that the hypervisor can reuse the CPU cycles in some other guests as
> + * well as returning other hold-up CPUs faster.




> +/**
> + * queue_spin_lock - acquire a queue spinlock
> + * @lock: Pointer to queue spinlock structure
> + *
> + * N.B. INLINE_SPIN_LOCK should not be enabled when PARAVIRT_SPINLOCK is on.

One should write a compile time fail for that, not a comment.

> + */
> +static __always_inline void queue_spin_lock(struct qspinlock *lock)
> +{
> + u32 val;
> +
> + val = atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL);
> + if (likely(val == 0))
> + return;
> + if (static_key_false(¶virt_spinlocks_enabled))
> + pv_queue_spin_lock_slowpath(lock, val);
> + else
> + queue_spin_lock_slowpath(lock, val);
> +}

No, this is just vile.. _that_ is what we have PV ops for. And at that
point its the same function it was before the PV stuff, so that whole
inline thing is then gone.

> +extern void queue_spin_unlock_slowpath(struct qspinlock *lock);
> +
>  /**
>   * queue_spin_unlock - release a queue spinlock
>   * @lock : Pointer to queue spinlock structure
>   *
>   * An effective smp_store_release() on the least-significant byte.
> + *
> + * Inlining of the unlock function is disabled when CONFIG_PARAVIRT_SPINLOCKS
> + * is defined. So _raw_spin_unlock() will be the only call site that will
> + * have to be patched.

again if you hard rely on the not inlining make a build fail not a
comment.

>   */
>  static inline void queue_spin_unlock(struct qspinlock *lock)
>  {
>   barrier();
> + if (!static_key_false(¶virt_spinlocks_enabled)) {
> + native_spin_unlock(lock);
> + return;
> + }
>  
> + /*
> +  * Need to atomically clear the lock byte to avoid racing with
> +  * queue head waiter trying to set _QLOCK_LOCKED_SLOWPATH.
> +  */
> + if (unlikely(cmpxchg((u8 *)lock, _Q_LOCKED_VAL, 0) != _Q_LOCKED_VAL))
> + queue_spin_unlock_slowpath(lock);
> +}

Idem, that static key stuff is wrong, use PV ops to switch between
unlock paths.

> @@ -354,7 +394,7 @@ queue:
>* if there was a previous node; link it and wait until reaching the
>* head of the waitqueue.
>*/
> - if (old & _Q_TAIL_MASK) {
> + if (!pv_link_and_wait_node(old, node) && (old & _Q_TAIL_MASK)) {
>   prev = decode_tail(old);
>   ACCESS_ONCE(prev->next) = node;
> @@ -369,9 +409,11 @@ queue:
>*
>* *,x,y -> *,0,0
>*/
> - while ((val = smp_load_acquire(&lock->val.counter)) &
> - _Q_LOCKED_PENDING_MASK)
> + val = pv_wait_head(lock, node);
> + while (val & _Q_LOCKED_PENDING_MASK) {
>   cpu_relax();
> + val = smp_load_acquire(&lock->val.counter);
> + }
>  
>   /*
>* claim the lock:

Please make the pv_*() calls return void and reduce to NOPs. This keeps
the logic invariant of the pv stuff.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 04/15] arm/arm64: KVM: wrap 64 bit MMIO accesses with two 32 bit ones

2014-11-03 Thread Christoffer Dall
On Fri, Oct 31, 2014 at 01:49:12PM +, Andre Przywara wrote:
> Hi Christoffer,
> 
> On 15/10/14 17:26, Christoffer Dall wrote:
> > On Thu, Aug 21, 2014 at 02:06:45PM +0100, Andre Przywara wrote:

[...]

> >> +   * 8 bytes long, caused by a 64-bit access
> >> +   */
> >> +
> >> +  mmio32.len = 4;
> >> +  mmio32.is_write = mmio->is_write;
> >> +
> >> +  mmio32.phys_addr = mmio->phys_addr + 4;
> >> +  if (mmio->is_write)
> >> +  *(u32 *)mmio32.data = data32[1];
> >> +  ret = range->handle_mmio(vcpu, &mmio32, offset + 4);
> >> +  if (!mmio->is_write)
> >> +  data32[1] = *(u32 *)mmio32.data;
> >> +
> >> +  mmio32.phys_addr = mmio->phys_addr;
> >> +  if (mmio->is_write)
> >> +  *(u32 *)mmio32.data = data32[0];
> >> +  ret |= range->handle_mmio(vcpu, &mmio32, offset);
> >> +  if (!mmio->is_write)
> >> +  data32[0] = *(u32 *)mmio32.data;
> > 
> > won't this break on a BE system?
> 
> Mmh, I remember having this discussed with Marc before. But I see that
> it looks suspicious. This whole endianess thing is even more confusing
> since the GIC is always LE and the guest as well as KVM already do swapping.
> So I rewrote the above function to avoid explicit endianess assumptions,
> but am still struggling to get it tested successfully on a bi-endian setup.
> As I don't want to hold back the newer patches any longer, I will try to
> debug this next week, meanwhile not stating bi-endianness is supported
> for the new series.
> 
Well you're writing code here that just won't work on a big-endian
system and regardless of our ability to test things, you need to at
least put a big fat comment saying "TODO FIXME BROKEN Breaks on BE
systems", but I'm not sure I would ack that given we know it's broken,
so I strongly recommend you do a best-effort implementation in lack of
an environment to test it for now.

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html