RE: [v3 00/26] Add VT-d Posted-Interrupts support

2015-01-05 Thread Wu, Feng
Ping...

Hi Joerg & David,

Could you please have a look at the IOMMU part of this series (patch 02 - 04, 
patch 06 - 09 , patch 26)?

Hi Thomas, Ingo, & Peter,

Could you please have a look at this series, especially for patch 01, 05, 21?

Thanks,
Feng

> -Original Message-
> From: Wu, Feng
> Sent: Friday, December 12, 2014 11:15 PM
> To: t...@linutronix.de; mi...@redhat.com; h...@zytor.com; x...@kernel.org;
> g...@kernel.org; pbonz...@redhat.com; dw...@infradead.org;
> j...@8bytes.org; alex.william...@redhat.com; jiang@linux.intel.com
> Cc: eric.au...@linaro.org; linux-ker...@vger.kernel.org;
> io...@lists.linux-foundation.org; kvm@vger.kernel.org; Wu, Feng
> Subject: [v3 00/26] Add VT-d Posted-Interrupts support
> 
> VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
> With VT-d Posted-Interrupts enabled, external interrupts from
> direct-assigned devices can be delivered to guests without VMM
> intervention when guest is running in non-root mode.
> 
> You can find the VT-d Posted-Interrtups Spec. in the following URL:
> http://www.intel.com/content/www/us/en/intelligent-systems/intel-technolog
> y/vt-directed-io-spec.html
> 
> v1->v2:
> * Use VFIO framework to enable this feature, the VFIO part of this series is
>   base on Eric's patch "[PATCH v3 0/8] KVM-VFIO IRQ forward control"
> * Rebase this patchset on
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git,
>   then revise some irq logic based on the new hierarchy irqdomain patches
> provided
>   by Jiang Liu 
> 
> v2->v3:
> * Adjust the Posted-interrupts Descriptor updating logic when vCPU is
>   preempted or blocked.
> * KVM_DEV_VFIO_DEVICE_POSTING_IRQ -->
> KVM_DEV_VFIO_DEVICE_POST_IRQ
> * __KVM_HAVE_ARCH_KVM_VFIO_POSTING -->
> __KVM_HAVE_ARCH_KVM_VFIO_POST
> * Add KVM_DEV_VFIO_DEVICE_UNPOST_IRQ attribute for VFIO irq, which
>   can be used to change back to remapping mode.
> * Fix typo
> 
> This patch series is made of the following groups:
> 1-6: Some preparation changes in iommu and irq component, this is based on
> the
>  new hierarchy irqdomain logic.
> 7-9, 26: IOMMU changes for VT-d Posted-Interrupts, such as, feature detection,
>   command line parameter.
> 10-17, 22-25: Changes related to KVM itself.
> 18-20: Changes in VFIO component, this part was previously sent out as
> "[RFC PATCH v2 0/2] kvm-vfio: implement the vfio skeleton for VT-d
> Posted-Interrupts"
> 21: x86 irq related changes
> 
> Feng Wu (26):
>   genirq: Introduce irq_set_vcpu_affinity() to target an interrupt to a
> VCPU
>   iommu: Add new member capability to struct irq_remap_ops
>   iommu, x86: Define new irte structure for VT-d Posted-Interrupts
>   iommu, x86: Implement irq_set_vcpu_affinity for intel_ir_chip
>   x86, irq: Implement irq_set_vcpu_affinity for pci_msi_ir_controller
>   iommu, x86: No need to migrating irq for VT-d Posted-Interrupts
>   iommu, x86: Add cap_pi_support() to detect VT-d PI capability
>   iommu, x86: Add intel_irq_remapping_capability() for Intel
>   iommu, x86: define irq_remapping_cap()
>   KVM: change struct pi_desc for VT-d Posted-Interrupts
>   KVM: Add some helper functions for Posted-Interrupts
>   KVM: Initialize VT-d Posted-Interrupts Descriptor
>   KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI
>   KVM: Get Posted-Interrupts descriptor address from struct kvm_vcpu
>   KVM: add interfaces to control PI outside vmx
>   KVM: Make struct kvm_irq_routing_table accessible
>   KVM: make kvm_set_msi_irq() public
>   KVM: kvm-vfio: User API for VT-d Posted-Interrupts
>   KVM: kvm-vfio: implement the VFIO skeleton for VT-d Posted-Interrupts
>   KVM: x86: kvm-vfio: VT-d posted-interrupts setup
>   x86, irq: Define a global vector for VT-d Posted-Interrupts
>   KVM: Define a wakeup worker thread for vCPU
>   KVM: Update Posted-Interrupts Descriptor when vCPU is preempted
>   KVM: Update Posted-Interrupts Descriptor when vCPU is blocked
>   KVM: Suppress posted-interrupt when 'SN' is set
>   iommu/vt-d: Add a command line parameter for VT-d posted-interrupts
> 
>  Documentation/kernel-parameters.txt|   1 +
>  Documentation/virtual/kvm/devices/vfio.txt |   9 ++
>  arch/x86/include/asm/entry_arch.h  |   2 +
>  arch/x86/include/asm/hardirq.h |   1 +
>  arch/x86/include/asm/hw_irq.h  |   2 +
>  arch/x86/include/asm/irq_remapping.h   |  11 ++
>  arch/x86/include/asm/irq_vectors.h |   1 +
>  arch/x86/include/asm/kvm_host.h|  12 ++
>  arch/x86/kernel/apic/msi.c |   1 +
>  arch/x86/kernel/entry_64.S |   2 +
>  arch/x86/kernel/irq.c  |  27 
>  arch/x86/kernel/irqinit.c  |   2 +
>  arch/x86/kvm/Makefile  |   2 +-
>  arch/x86/kvm/kvm_vfio_x86.c|  77 +
>  arch/x86/kvm/vmx.c | 244
> -
>  arch/x86/kvm/x86.c |  22 ++-
>  drivers/iommu/in

Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

2015-01-05 Thread Andy Lutomirski
On Mon, Jan 5, 2015 at 2:48 PM, Marcelo Tosatti  wrote:
> On Mon, Jan 05, 2015 at 02:38:46PM -0800, Andy Lutomirski wrote:
>> On Mon, Jan 5, 2015 at 11:17 AM, Marcelo Tosatti  wrote:
>> > On Mon, Jan 05, 2015 at 10:56:07AM -0800, Andy Lutomirski wrote:
>> >> On Mon, Jan 5, 2015 at 7:25 AM, Marcelo Tosatti  
>> >> wrote:
>> >> > On Mon, Dec 22, 2014 at 04:39:57PM -0800, Andy Lutomirski wrote:
>> >> >> The pvclock vdso code was too abstracted to understand easily and
>> >> >> excessively paranoid.  Simplify it for a huge speedup.
>> >> >>
>> >> >> This opens the door for additional simplifications, as the vdso no
>> >> >> longer accesses the pvti for any vcpu other than vcpu 0.
>> >> >>
>> >> >> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
>> >> >> With this change, it takes 19ns, which is almost as fast as the pure 
>> >> >> TSC
>> >> >> implementation.
>> >> >>
>> >> >> Signed-off-by: Andy Lutomirski 
>> >> >> ---
>> >> >>  arch/x86/vdso/vclock_gettime.c | 82 
>> >> >> --
>> >> >>  1 file changed, 47 insertions(+), 35 deletions(-)
>> >> >>
>> >> >> diff --git a/arch/x86/vdso/vclock_gettime.c 
>> >> >> b/arch/x86/vdso/vclock_gettime.c
>> >> >> index 9793322751e0..f2e0396d5629 100644
>> >> >> --- a/arch/x86/vdso/vclock_gettime.c
>> >> >> +++ b/arch/x86/vdso/vclock_gettime.c
>> >> >> @@ -78,47 +78,59 @@ static notrace const struct 
>> >> >> pvclock_vsyscall_time_info *get_pvti(int cpu)
>> >> >>
>> >> >>  static notrace cycle_t vread_pvclock(int *mode)
>> >> >>  {
>> >> >> - const struct pvclock_vsyscall_time_info *pvti;
>> >> >> + const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
>> >> >>   cycle_t ret;
>> >> >> - u64 last;
>> >> >> - u32 version;
>> >> >> - u8 flags;
>> >> >> - unsigned cpu, cpu1;
>> >> >> -
>> >> >> + u64 tsc, pvti_tsc;
>> >> >> + u64 last, delta, pvti_system_time;
>> >> >> + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
>> >> >>
>> >> >>   /*
>> >> >> -  * Note: hypervisor must guarantee that:
>> >> >> -  * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
>> >> >> -  * 2. that per-CPU pvclock time info is updated if the
>> >> >> -  *underlying CPU changes.
>> >> >> -  * 3. that version is increased whenever underlying CPU
>> >> >> -  *changes.
>> >> >> +  * Note: The kernel and hypervisor must guarantee that cpu ID
>> >> >> +  * number maps 1:1 to per-CPU pvclock time info.
>> >> >> +  *
>> >> >> +  * Because the hypervisor is entirely unaware of guest userspace
>> >> >> +  * preemption, it cannot guarantee that per-CPU pvclock time
>> >> >> +  * info is updated if the underlying CPU changes or that that
>> >> >> +  * version is increased whenever underlying CPU changes.
>> >> >> +  *
>> >> >> +  * On KVM, we are guaranteed that pvti updates for any vCPU are
>> >> >> +  * atomic as seen by *all* vCPUs.  This is an even stronger
>> >> >> +  * guarantee than we get with a normal seqlock.
>> >> >>*
>> >> >> +  * On Xen, we don't appear to have that guarantee, but Xen still
>> >> >> +  * supplies a valid seqlock using the version field.
>> >> >> +
>> >> >> +  * We only do pvclock vdso timing at all if
>> >> >> +  * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
>> >> >> +  * mean that all vCPUs have matching pvti and that the TSC is
>> >> >> +  * synced, so we can just look at vCPU 0's pvti.
>> >> >>*/
>> >> >
>> >> > Can Xen guarantee that ?
>> >>
>> >> I think so, vacuously.  Xen doesn't seem to set PVCLOCK_TSC_STABLE_BIT
>> >> at all.  I have no idea going forward, though.
>> >>
>> >> Xen people?
>> >>
>> >> >
>> >> >> - do {
>> >> >> - cpu = __getcpu() & VGETCPU_CPU_MASK;
>> >> >> - /* TODO: We can put vcpu id into higher bits of 
>> >> >> pvti.version.
>> >> >> -  * This will save a couple of cycles by getting rid of
>> >> >> -  * __getcpu() calls (Gleb).
>> >> >> -  */
>> >> >> -
>> >> >> - pvti = get_pvti(cpu);
>> >> >> -
>> >> >> - version = __pvclock_read_cycles(&pvti->pvti, &ret, 
>> >> >> &flags);
>> >> >> -
>> >> >> - /*
>> >> >> -  * Test we're still on the cpu as well as the version.
>> >> >> -  * We could have been migrated just after the first
>> >> >> -  * vgetcpu but before fetching the version, so we
>> >> >> -  * wouldn't notice a version change.
>> >> >> -  */
>> >> >> - cpu1 = __getcpu() & VGETCPU_CPU_MASK;
>> >> >> - } while (unlikely(cpu != cpu1 ||
>> >> >> -   (pvti->pvti.version & 1) ||
>> >> >> -   pvti->pvti.version != version));
>> >> >> -
>> >> >> - if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
>> >> >> +
>> >> >> + if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
>> >> >>

Re: RECEIVE YOUR ATM CARD BEFORE 23RD OF DECEMBER 2014

2015-01-05 Thread Tage Werner
 ACCESS BANK PLC ATM DEPARTMENT  accountant.com> writes:

> 
> THIS ACCESS BANK PLC WANT TO INFORM YOU THAT YOUR ATM CARD IS READY, 
THAT IF YOU NEED IT, YOU MUST PAY THE $98. IF
> YOU ARE READY, MAKE SURE YOU SEND ME YOUR FULL NAMES AND YOUR DIRECT 
TELEPHONE NUMBER FOR ME TO CALL YOU SO
> THAT YOU CAN PAY DIRECTLY TO OUR ACCOUNT OFFICER.
> 
> Thanks,
> 
> DR. CHRIS MICHAEL
> FROM ACCESS BANK PLC
> E-MAIL: accessb575  gmail.com
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo  vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
>

 
IS THIS A SCAM OR IS THAT A CORRECT MAIL??

PLEASE BE BACK WITH YOUR ANSWER.

TANKS
TAGE WERNER

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

2015-01-05 Thread Marcelo Tosatti
On Mon, Jan 05, 2015 at 02:38:46PM -0800, Andy Lutomirski wrote:
> On Mon, Jan 5, 2015 at 11:17 AM, Marcelo Tosatti  wrote:
> > On Mon, Jan 05, 2015 at 10:56:07AM -0800, Andy Lutomirski wrote:
> >> On Mon, Jan 5, 2015 at 7:25 AM, Marcelo Tosatti  
> >> wrote:
> >> > On Mon, Dec 22, 2014 at 04:39:57PM -0800, Andy Lutomirski wrote:
> >> >> The pvclock vdso code was too abstracted to understand easily and
> >> >> excessively paranoid.  Simplify it for a huge speedup.
> >> >>
> >> >> This opens the door for additional simplifications, as the vdso no
> >> >> longer accesses the pvti for any vcpu other than vcpu 0.
> >> >>
> >> >> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
> >> >> With this change, it takes 19ns, which is almost as fast as the pure TSC
> >> >> implementation.
> >> >>
> >> >> Signed-off-by: Andy Lutomirski 
> >> >> ---
> >> >>  arch/x86/vdso/vclock_gettime.c | 82 
> >> >> --
> >> >>  1 file changed, 47 insertions(+), 35 deletions(-)
> >> >>
> >> >> diff --git a/arch/x86/vdso/vclock_gettime.c 
> >> >> b/arch/x86/vdso/vclock_gettime.c
> >> >> index 9793322751e0..f2e0396d5629 100644
> >> >> --- a/arch/x86/vdso/vclock_gettime.c
> >> >> +++ b/arch/x86/vdso/vclock_gettime.c
> >> >> @@ -78,47 +78,59 @@ static notrace const struct 
> >> >> pvclock_vsyscall_time_info *get_pvti(int cpu)
> >> >>
> >> >>  static notrace cycle_t vread_pvclock(int *mode)
> >> >>  {
> >> >> - const struct pvclock_vsyscall_time_info *pvti;
> >> >> + const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
> >> >>   cycle_t ret;
> >> >> - u64 last;
> >> >> - u32 version;
> >> >> - u8 flags;
> >> >> - unsigned cpu, cpu1;
> >> >> -
> >> >> + u64 tsc, pvti_tsc;
> >> >> + u64 last, delta, pvti_system_time;
> >> >> + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
> >> >>
> >> >>   /*
> >> >> -  * Note: hypervisor must guarantee that:
> >> >> -  * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
> >> >> -  * 2. that per-CPU pvclock time info is updated if the
> >> >> -  *underlying CPU changes.
> >> >> -  * 3. that version is increased whenever underlying CPU
> >> >> -  *changes.
> >> >> +  * Note: The kernel and hypervisor must guarantee that cpu ID
> >> >> +  * number maps 1:1 to per-CPU pvclock time info.
> >> >> +  *
> >> >> +  * Because the hypervisor is entirely unaware of guest userspace
> >> >> +  * preemption, it cannot guarantee that per-CPU pvclock time
> >> >> +  * info is updated if the underlying CPU changes or that that
> >> >> +  * version is increased whenever underlying CPU changes.
> >> >> +  *
> >> >> +  * On KVM, we are guaranteed that pvti updates for any vCPU are
> >> >> +  * atomic as seen by *all* vCPUs.  This is an even stronger
> >> >> +  * guarantee than we get with a normal seqlock.
> >> >>*
> >> >> +  * On Xen, we don't appear to have that guarantee, but Xen still
> >> >> +  * supplies a valid seqlock using the version field.
> >> >> +
> >> >> +  * We only do pvclock vdso timing at all if
> >> >> +  * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
> >> >> +  * mean that all vCPUs have matching pvti and that the TSC is
> >> >> +  * synced, so we can just look at vCPU 0's pvti.
> >> >>*/
> >> >
> >> > Can Xen guarantee that ?
> >>
> >> I think so, vacuously.  Xen doesn't seem to set PVCLOCK_TSC_STABLE_BIT
> >> at all.  I have no idea going forward, though.
> >>
> >> Xen people?
> >>
> >> >
> >> >> - do {
> >> >> - cpu = __getcpu() & VGETCPU_CPU_MASK;
> >> >> - /* TODO: We can put vcpu id into higher bits of 
> >> >> pvti.version.
> >> >> -  * This will save a couple of cycles by getting rid of
> >> >> -  * __getcpu() calls (Gleb).
> >> >> -  */
> >> >> -
> >> >> - pvti = get_pvti(cpu);
> >> >> -
> >> >> - version = __pvclock_read_cycles(&pvti->pvti, &ret, 
> >> >> &flags);
> >> >> -
> >> >> - /*
> >> >> -  * Test we're still on the cpu as well as the version.
> >> >> -  * We could have been migrated just after the first
> >> >> -  * vgetcpu but before fetching the version, so we
> >> >> -  * wouldn't notice a version change.
> >> >> -  */
> >> >> - cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> >> >> - } while (unlikely(cpu != cpu1 ||
> >> >> -   (pvti->pvti.version & 1) ||
> >> >> -   pvti->pvti.version != version));
> >> >> -
> >> >> - if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
> >> >> +
> >> >> + if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
> >> >>   *mode = VCLOCK_NONE;
> >> >> + return 0;
> >> >> + }
> >> >
> >> > This check must be performed after reading a stable pvti.
> >> >
> >>
> >> We

Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

2015-01-05 Thread Andy Lutomirski
On Mon, Jan 5, 2015 at 11:17 AM, Marcelo Tosatti  wrote:
> On Mon, Jan 05, 2015 at 10:56:07AM -0800, Andy Lutomirski wrote:
>> On Mon, Jan 5, 2015 at 7:25 AM, Marcelo Tosatti  wrote:
>> > On Mon, Dec 22, 2014 at 04:39:57PM -0800, Andy Lutomirski wrote:
>> >> The pvclock vdso code was too abstracted to understand easily and
>> >> excessively paranoid.  Simplify it for a huge speedup.
>> >>
>> >> This opens the door for additional simplifications, as the vdso no
>> >> longer accesses the pvti for any vcpu other than vcpu 0.
>> >>
>> >> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
>> >> With this change, it takes 19ns, which is almost as fast as the pure TSC
>> >> implementation.
>> >>
>> >> Signed-off-by: Andy Lutomirski 
>> >> ---
>> >>  arch/x86/vdso/vclock_gettime.c | 82 
>> >> --
>> >>  1 file changed, 47 insertions(+), 35 deletions(-)
>> >>
>> >> diff --git a/arch/x86/vdso/vclock_gettime.c 
>> >> b/arch/x86/vdso/vclock_gettime.c
>> >> index 9793322751e0..f2e0396d5629 100644
>> >> --- a/arch/x86/vdso/vclock_gettime.c
>> >> +++ b/arch/x86/vdso/vclock_gettime.c
>> >> @@ -78,47 +78,59 @@ static notrace const struct 
>> >> pvclock_vsyscall_time_info *get_pvti(int cpu)
>> >>
>> >>  static notrace cycle_t vread_pvclock(int *mode)
>> >>  {
>> >> - const struct pvclock_vsyscall_time_info *pvti;
>> >> + const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
>> >>   cycle_t ret;
>> >> - u64 last;
>> >> - u32 version;
>> >> - u8 flags;
>> >> - unsigned cpu, cpu1;
>> >> -
>> >> + u64 tsc, pvti_tsc;
>> >> + u64 last, delta, pvti_system_time;
>> >> + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
>> >>
>> >>   /*
>> >> -  * Note: hypervisor must guarantee that:
>> >> -  * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
>> >> -  * 2. that per-CPU pvclock time info is updated if the
>> >> -  *underlying CPU changes.
>> >> -  * 3. that version is increased whenever underlying CPU
>> >> -  *changes.
>> >> +  * Note: The kernel and hypervisor must guarantee that cpu ID
>> >> +  * number maps 1:1 to per-CPU pvclock time info.
>> >> +  *
>> >> +  * Because the hypervisor is entirely unaware of guest userspace
>> >> +  * preemption, it cannot guarantee that per-CPU pvclock time
>> >> +  * info is updated if the underlying CPU changes or that that
>> >> +  * version is increased whenever underlying CPU changes.
>> >> +  *
>> >> +  * On KVM, we are guaranteed that pvti updates for any vCPU are
>> >> +  * atomic as seen by *all* vCPUs.  This is an even stronger
>> >> +  * guarantee than we get with a normal seqlock.
>> >>*
>> >> +  * On Xen, we don't appear to have that guarantee, but Xen still
>> >> +  * supplies a valid seqlock using the version field.
>> >> +
>> >> +  * We only do pvclock vdso timing at all if
>> >> +  * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
>> >> +  * mean that all vCPUs have matching pvti and that the TSC is
>> >> +  * synced, so we can just look at vCPU 0's pvti.
>> >>*/
>> >
>> > Can Xen guarantee that ?
>>
>> I think so, vacuously.  Xen doesn't seem to set PVCLOCK_TSC_STABLE_BIT
>> at all.  I have no idea going forward, though.
>>
>> Xen people?
>>
>> >
>> >> - do {
>> >> - cpu = __getcpu() & VGETCPU_CPU_MASK;
>> >> - /* TODO: We can put vcpu id into higher bits of 
>> >> pvti.version.
>> >> -  * This will save a couple of cycles by getting rid of
>> >> -  * __getcpu() calls (Gleb).
>> >> -  */
>> >> -
>> >> - pvti = get_pvti(cpu);
>> >> -
>> >> - version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
>> >> -
>> >> - /*
>> >> -  * Test we're still on the cpu as well as the version.
>> >> -  * We could have been migrated just after the first
>> >> -  * vgetcpu but before fetching the version, so we
>> >> -  * wouldn't notice a version change.
>> >> -  */
>> >> - cpu1 = __getcpu() & VGETCPU_CPU_MASK;
>> >> - } while (unlikely(cpu != cpu1 ||
>> >> -   (pvti->pvti.version & 1) ||
>> >> -   pvti->pvti.version != version));
>> >> -
>> >> - if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
>> >> +
>> >> + if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
>> >>   *mode = VCLOCK_NONE;
>> >> + return 0;
>> >> + }
>> >
>> > This check must be performed after reading a stable pvti.
>> >
>>
>> We can even read it in the middle, guarded by the version checks.
>> I'll do that for v2.
>>
>> >> +
>> >> + do {
>> >> + version = pvti->version;
>> >> +
>> >> + /* This is also a read barrier, so we'll read version 
>> >> first. */
>> >> + rdtsc_barrier();
>> >> +

Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

2015-01-05 Thread Marcelo Tosatti
On Mon, Jan 05, 2015 at 10:56:07AM -0800, Andy Lutomirski wrote:
> On Mon, Jan 5, 2015 at 7:25 AM, Marcelo Tosatti  wrote:
> > On Mon, Dec 22, 2014 at 04:39:57PM -0800, Andy Lutomirski wrote:
> >> The pvclock vdso code was too abstracted to understand easily and
> >> excessively paranoid.  Simplify it for a huge speedup.
> >>
> >> This opens the door for additional simplifications, as the vdso no
> >> longer accesses the pvti for any vcpu other than vcpu 0.
> >>
> >> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
> >> With this change, it takes 19ns, which is almost as fast as the pure TSC
> >> implementation.
> >>
> >> Signed-off-by: Andy Lutomirski 
> >> ---
> >>  arch/x86/vdso/vclock_gettime.c | 82 
> >> --
> >>  1 file changed, 47 insertions(+), 35 deletions(-)
> >>
> >> diff --git a/arch/x86/vdso/vclock_gettime.c 
> >> b/arch/x86/vdso/vclock_gettime.c
> >> index 9793322751e0..f2e0396d5629 100644
> >> --- a/arch/x86/vdso/vclock_gettime.c
> >> +++ b/arch/x86/vdso/vclock_gettime.c
> >> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info 
> >> *get_pvti(int cpu)
> >>
> >>  static notrace cycle_t vread_pvclock(int *mode)
> >>  {
> >> - const struct pvclock_vsyscall_time_info *pvti;
> >> + const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
> >>   cycle_t ret;
> >> - u64 last;
> >> - u32 version;
> >> - u8 flags;
> >> - unsigned cpu, cpu1;
> >> -
> >> + u64 tsc, pvti_tsc;
> >> + u64 last, delta, pvti_system_time;
> >> + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
> >>
> >>   /*
> >> -  * Note: hypervisor must guarantee that:
> >> -  * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
> >> -  * 2. that per-CPU pvclock time info is updated if the
> >> -  *underlying CPU changes.
> >> -  * 3. that version is increased whenever underlying CPU
> >> -  *changes.
> >> +  * Note: The kernel and hypervisor must guarantee that cpu ID
> >> +  * number maps 1:1 to per-CPU pvclock time info.
> >> +  *
> >> +  * Because the hypervisor is entirely unaware of guest userspace
> >> +  * preemption, it cannot guarantee that per-CPU pvclock time
> >> +  * info is updated if the underlying CPU changes or that that
> >> +  * version is increased whenever underlying CPU changes.
> >> +  *
> >> +  * On KVM, we are guaranteed that pvti updates for any vCPU are
> >> +  * atomic as seen by *all* vCPUs.  This is an even stronger
> >> +  * guarantee than we get with a normal seqlock.
> >>*
> >> +  * On Xen, we don't appear to have that guarantee, but Xen still
> >> +  * supplies a valid seqlock using the version field.
> >> +
> >> +  * We only do pvclock vdso timing at all if
> >> +  * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
> >> +  * mean that all vCPUs have matching pvti and that the TSC is
> >> +  * synced, so we can just look at vCPU 0's pvti.
> >>*/
> >
> > Can Xen guarantee that ?
> 
> I think so, vacuously.  Xen doesn't seem to set PVCLOCK_TSC_STABLE_BIT
> at all.  I have no idea going forward, though.
> 
> Xen people?
> 
> >
> >> - do {
> >> - cpu = __getcpu() & VGETCPU_CPU_MASK;
> >> - /* TODO: We can put vcpu id into higher bits of pvti.version.
> >> -  * This will save a couple of cycles by getting rid of
> >> -  * __getcpu() calls (Gleb).
> >> -  */
> >> -
> >> - pvti = get_pvti(cpu);
> >> -
> >> - version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
> >> -
> >> - /*
> >> -  * Test we're still on the cpu as well as the version.
> >> -  * We could have been migrated just after the first
> >> -  * vgetcpu but before fetching the version, so we
> >> -  * wouldn't notice a version change.
> >> -  */
> >> - cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> >> - } while (unlikely(cpu != cpu1 ||
> >> -   (pvti->pvti.version & 1) ||
> >> -   pvti->pvti.version != version));
> >> -
> >> - if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
> >> +
> >> + if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
> >>   *mode = VCLOCK_NONE;
> >> + return 0;
> >> + }
> >
> > This check must be performed after reading a stable pvti.
> >
> 
> We can even read it in the middle, guarded by the version checks.
> I'll do that for v2.
> 
> >> +
> >> + do {
> >> + version = pvti->version;
> >> +
> >> + /* This is also a read barrier, so we'll read version first. 
> >> */
> >> + rdtsc_barrier();
> >> + tsc = __native_read_tsc();
> >> +
> >> + pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
> >> + pvti_tsc_shift = pvti->tsc_shift;
> >> + 

Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

2015-01-05 Thread Paolo Bonzini


On 05/01/2015 19:56, Andy Lutomirski wrote:
>> > 1) State: all pvtis marked as PVCLOCK_TSC_STABLE_BIT.
>> > 1) Update request for all vcpus, for a TSC_STABLE_BIT -> ~TSC_STABLE_BIT
>> > transition.
>> > 2) vCPU-1 updates its pvti with new values.
>> > 3) vCPU-0 still has not updated its pvti with new values.
>> > 4) vCPU-1 VM-enters, uses vCPU-0 values, even though it has been
>> > notified of a TSC_STABLE_BIT -> ~TSC_STABLE_BIT transition.
>> >
>> > The update is not actually atomic across all vCPUs, its atomic in
>> > the sense of not allowing visibility of distinct
>> > system_timestamp/tsc_timestamp values.
>> >
> Hmm.  In step 4, is there a guarantee that vCPU-0 won't VM-enter until
> it gets marked unstable?  Otherwise the vdso could could just as
> easily be called from vCPU-1, migrated to vCPU-0, read the data
> complete with stale stable bit, and get migrated back to vCPU-1.
> 
> But I thought that KVM currently froze all vCPUs when updating pvti
> for any of them.  How can this happen?  I admit I don't really
> understand the update request code.

That was also my understanding.  I thought this was the point of
kvm_make_mclock_inprogress_request/KVM_REQ_MCLOCK_INPROGRESS.

Disabling TSC_STABLE_BIT is triggered by pvclock_gtod_update_fn but it
happens in kvm_gen_update_masterclock, and no guest entries will happen
in the meanwhile.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] x86_64,entry: Fix RCX for traced syscalls

2015-01-05 Thread Andy Lutomirski
On Mon, Jan 5, 2015 at 4:59 AM, Borislav Petkov  wrote:
> On Fri, Nov 07, 2014 at 03:58:17PM -0800, Andy Lutomirski wrote:
>> The int_ret_from_sys_call and syscall tracing code disagrees with
>> the sysret path as to the value of RCX.
>>
>> The Intel SDM, the AMD APM, and my laptop all agree that sysret
>> returns with RCX == RIP.  The syscall tracing code does not respect
>> this property.
>>
>> For example, this program:
>>
>> int main()
>> {
>>   extern const char syscall_rip[];
>>   unsigned long rcx = 1;
>>   unsigned long orig_rcx = rcx;
>>   asm ("mov $-1, %%eax\n\t"
>>"syscall\n\t"
>>"syscall_rip:"
>>: "+c" (rcx) : : "r11");
>>   printf("syscall: RCX = %lX  RIP = %lX  orig RCX = %lx\n",
>>  rcx, (unsigned long)syscall_rip, orig_rcx);
>>   return 0;
>> }
>>
>> prints:
>> syscall: RCX = 400556  RIP = 400556  orig RCX = 1
>>
>> Running it under strace gives this instead:
>> syscall: RCX =   RIP = 400556  orig RCX = 1
>
> I can trigger the same even without tracing it:
>
> syscall: RCX =   RIP = 40052C  orig RCX = 1

Do you have context tracking on?

>
>> This changes FIXUP_TOP_OF_STACK to match sysret, causing the test to
>> show RCX == RIP even under strace.
>>
>> Signed-off-by: Andy Lutomirski 
>> ---
>>  arch/x86/kernel/entry_64.S | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
>> index df088bb03fb3..3710b8241945 100644
>> --- a/arch/x86/kernel/entry_64.S
>> +++ b/arch/x86/kernel/entry_64.S
>> @@ -143,7 +143,8 @@ ENDPROC(native_usergs_sysret64)
>>   movq \tmp,RSP+\offset(%rsp)
>>   movq $__USER_DS,SS+\offset(%rsp)
>>   movq $__USER_CS,CS+\offset(%rsp)
>> - movq $-1,RCX+\offset(%rsp)
>> + movq RIP+\offset(%rsp),\tmp  /* get rip */
>> + movq \tmp,RCX+\offset(%rsp)  /* copy it to rcx as sysret would do */
>>   movq R11+\offset(%rsp),\tmp  /* get eflags */
>>   movq \tmp,EFLAGS+\offset(%rsp)
>>   .endm
>> --
>
> For some reason this patch is causing ata resets on by box, see the
> end of this mail. So something's not kosher yet. If I boot the kernel
> without it, it all seems ok.
>
> Btw, this change got introduced in 2002 where it used to return rIP in
> %rcx before, but it got changed to return -1 for rIP for some reason.


Thanks!  I assume that's in the historical tree?

[...]

>
> ---
>
> [  180.059170] ata1.00: exception Emask 0x0 SAct 0x7fff SErr 0x0 action 
> 0x6 frozen
> [  180.066873] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.072158] ata1.00: cmd 61/08:00:a8:ac:d9/00:00:23:00:00/40 tag 0 ncq 
> 4096 out
> [  180.072158]  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 
> (timeout)

That's really weird.  The only thing I can think of is that somehow we
returned to user mode without enabling interrupts.  This leads me to
wonder: why do we save eflags in the R11 pt_regs slot?  This seems
entirely backwards, not to mention that it accounts for two
instructions in each of FIXUP_TOP_OF_STACK and RESTORE_TOP_OF_STACK
for no apparently reason whatsoever.

Can you send the full output from syscall_exit_regs_64 from here:

https://gitorious.org/linux-test-utils/linux-clock-tests/source/34884122b6ebe81d9b96e3e5128b6d6d95082c6e:

with the patch applied (assuming it even gets that far for you)?  I
see results like:

[NOTE]syscall : orig RCX = 1  ss = 2b  orig_ss = 6b  flags =
217  orig_flags = 217

which seems fine.

Are you seeing this with the whole series applied or with only this patch?

--Andy

> [  180.086912] ata1.00: status: { DRDY }
> [  180.090591] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.095846] ata1.00: cmd 61/08:08:18:ae:d9/00:00:23:00:00/40 tag 1 ncq 
> 4096 out
> [  180.095846]  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 
> (timeout)
> [  180.110603] ata1.00: status: { DRDY }
> [  180.114283] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.119539] ata1.00: cmd 61/10:10:f0:b1:d9/00:00:23:00:00/40 tag 2 ncq 
> 8192 out
> [  180.119539]  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 
> (timeout)
> [  180.134292] ata1.00: status: { DRDY }
> [  180.137973] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.143226] ata1.00: cmd 61/08:18:00:98:18/00:00:1d:00:00/40 tag 3 ncq 
> 4096 out
> [  180.143226]  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 
> (timeout)
> [  180.158105] ata1.00: status: { DRDY }
> [  180.161809] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.167071] ata1.00: cmd 61/10:20:18:98:18/00:00:1d:00:00/40 tag 4 ncq 
> 8192 out
> [  180.167071]  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 
> (timeout)
> [  180.181822] ata1.00: status: { DRDY }
> [  180.185503] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.190756] ata1.00: cmd 61/a0:28:e0:7c:5d/25:00:1d:00:00/40 tag 5 ncq 
> 4931584 out
> [  180.190756]  res 40/00:00:00:00:00/00:00:00:00

Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

2015-01-05 Thread Andy Lutomirski
On Mon, Jan 5, 2015 at 7:25 AM, Marcelo Tosatti  wrote:
> On Mon, Dec 22, 2014 at 04:39:57PM -0800, Andy Lutomirski wrote:
>> The pvclock vdso code was too abstracted to understand easily and
>> excessively paranoid.  Simplify it for a huge speedup.
>>
>> This opens the door for additional simplifications, as the vdso no
>> longer accesses the pvti for any vcpu other than vcpu 0.
>>
>> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
>> With this change, it takes 19ns, which is almost as fast as the pure TSC
>> implementation.
>>
>> Signed-off-by: Andy Lutomirski 
>> ---
>>  arch/x86/vdso/vclock_gettime.c | 82 
>> --
>>  1 file changed, 47 insertions(+), 35 deletions(-)
>>
>> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
>> index 9793322751e0..f2e0396d5629 100644
>> --- a/arch/x86/vdso/vclock_gettime.c
>> +++ b/arch/x86/vdso/vclock_gettime.c
>> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info 
>> *get_pvti(int cpu)
>>
>>  static notrace cycle_t vread_pvclock(int *mode)
>>  {
>> - const struct pvclock_vsyscall_time_info *pvti;
>> + const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
>>   cycle_t ret;
>> - u64 last;
>> - u32 version;
>> - u8 flags;
>> - unsigned cpu, cpu1;
>> -
>> + u64 tsc, pvti_tsc;
>> + u64 last, delta, pvti_system_time;
>> + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
>>
>>   /*
>> -  * Note: hypervisor must guarantee that:
>> -  * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
>> -  * 2. that per-CPU pvclock time info is updated if the
>> -  *underlying CPU changes.
>> -  * 3. that version is increased whenever underlying CPU
>> -  *changes.
>> +  * Note: The kernel and hypervisor must guarantee that cpu ID
>> +  * number maps 1:1 to per-CPU pvclock time info.
>> +  *
>> +  * Because the hypervisor is entirely unaware of guest userspace
>> +  * preemption, it cannot guarantee that per-CPU pvclock time
>> +  * info is updated if the underlying CPU changes or that that
>> +  * version is increased whenever underlying CPU changes.
>> +  *
>> +  * On KVM, we are guaranteed that pvti updates for any vCPU are
>> +  * atomic as seen by *all* vCPUs.  This is an even stronger
>> +  * guarantee than we get with a normal seqlock.
>>*
>> +  * On Xen, we don't appear to have that guarantee, but Xen still
>> +  * supplies a valid seqlock using the version field.
>> +
>> +  * We only do pvclock vdso timing at all if
>> +  * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
>> +  * mean that all vCPUs have matching pvti and that the TSC is
>> +  * synced, so we can just look at vCPU 0's pvti.
>>*/
>
> Can Xen guarantee that ?

I think so, vacuously.  Xen doesn't seem to set PVCLOCK_TSC_STABLE_BIT
at all.  I have no idea going forward, though.

Xen people?

>
>> - do {
>> - cpu = __getcpu() & VGETCPU_CPU_MASK;
>> - /* TODO: We can put vcpu id into higher bits of pvti.version.
>> -  * This will save a couple of cycles by getting rid of
>> -  * __getcpu() calls (Gleb).
>> -  */
>> -
>> - pvti = get_pvti(cpu);
>> -
>> - version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
>> -
>> - /*
>> -  * Test we're still on the cpu as well as the version.
>> -  * We could have been migrated just after the first
>> -  * vgetcpu but before fetching the version, so we
>> -  * wouldn't notice a version change.
>> -  */
>> - cpu1 = __getcpu() & VGETCPU_CPU_MASK;
>> - } while (unlikely(cpu != cpu1 ||
>> -   (pvti->pvti.version & 1) ||
>> -   pvti->pvti.version != version));
>> -
>> - if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
>> +
>> + if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
>>   *mode = VCLOCK_NONE;
>> + return 0;
>> + }
>
> This check must be performed after reading a stable pvti.
>

We can even read it in the middle, guarded by the version checks.
I'll do that for v2.

>> +
>> + do {
>> + version = pvti->version;
>> +
>> + /* This is also a read barrier, so we'll read version first. */
>> + rdtsc_barrier();
>> + tsc = __native_read_tsc();
>> +
>> + pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
>> + pvti_tsc_shift = pvti->tsc_shift;
>> + pvti_system_time = pvti->system_time;
>> + pvti_tsc = pvti->tsc_timestamp;
>> +
>> + /* Make sure that the version double-check is last. */
>> + smp_rmb();
>> + } while (unlikely((version & 1) || version != pvti->version));
>> +
>> + delta = tsc - pvti_tsc;
>> + 

Re: [patch 2/3] KVM: x86: add option to advance tscdeadline hrtimer expiration

2015-01-05 Thread Radim Krcmar
2015-01-05 19:12+0100, Radim Krcmar:
>  (Right now, __delay() = delay_tsc()
> whenever the hardware has TSC, regardless of stability, thus always.)

(For quantifiers' sake, there also is 'tsc_disabled' variable.)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/3] KVM: x86: add option to advance tscdeadline hrtimer expiration

2015-01-05 Thread Radim Krcmar
2014-12-23 15:58-0500, Marcelo Tosatti:
> For the hrtimer which emulates the tscdeadline timer in the guest,
> add an option to advance expiration, and busy spin on VM-entry waiting
> for the actual expiration time to elapse.
> 
> This allows achieving low latencies in cyclictest (or any scenario 
> which requires strict timing regarding timer expiration).
> 
> Reduces average cyclictest latency from 12us to 8us
> on Core i5 desktop.
> 
> Note: this option requires tuning to find the appropriate value 
> for a particular hardware/guest combination. One method is to measure the 
> average delay between apic_timer_fn and VM-entry. 
> Another method is to start with 1000ns, and increase the value
> in say 500ns increments until avg cyclictest numbers stop decreasing.
> 
> Signed-off-by: Marcelo Tosatti 

Reviewed-by: Radim Krčmář 

(Other patches weren't touched, so my previous Reviewed-by holds.)

> +++ kvm/arch/x86/kvm/x86.c
> @@ -108,6 +108,10 @@ EXPORT_SYMBOL_GPL(kvm_max_guest_tsc_khz)
>  static u32 tsc_tolerance_ppm = 250;
>  module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
>  
> +/* lapic timer advance (tscdeadline mode only) in nanoseconds */
> +unsigned int lapic_timer_advance_ns = 0;
> +module_param(lapic_timer_advance_ns, uint, S_IRUGO | S_IWUSR);
> +
>  static bool backwards_tsc_observed = false;
>  
>  #define KVM_NR_SHARED_MSRS 16
> @@ -5625,6 +5629,10 @@ static void kvm_timer_init(void)
>   __register_hotcpu_notifier(&kvmclock_cpu_notifier_block);
>   cpu_notifier_register_done();
>  
> + if (check_tsc_unstable() && lapic_timer_advance_ns) {
> + pr_info("kvm: unstable TSC, disabling 
> lapic_timer_advance_ns\n");
> + lapic_timer_advance_ns = 0;

Does unstable TSC invalidate this feature?
(lapic_timer_advance_ns can be overridden, so we don't differentiate
 workflows that calibrate after starting with 0.)

And cover letter is a bit misleading:  The condition does nothing to
guarantee TSC based __delay() loop.  (Right now, __delay() = delay_tsc()
whenever the hardware has TSC, regardless of stability, thus always.)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Kernel options for virtio-net

2015-01-05 Thread Brady Dean
I have a base Linux From Scratch installation in Virtualbox and I need
virtio-net in the kernel so I can use the virtio-net adapter through
Virtualbox.

I enabled the options listed here: www.linux-kvm.org/page/Virtio but
the network interface does not show up.

I was wondering if there are more kernel options I need to enable and
if there are any KVM packages I need to install in the guest.

I am using kernel 3.16.2.

Thanks a lot,

Brady
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

2015-01-05 Thread Marcelo Tosatti
On Mon, Dec 22, 2014 at 04:39:57PM -0800, Andy Lutomirski wrote:
> The pvclock vdso code was too abstracted to understand easily and
> excessively paranoid.  Simplify it for a huge speedup.
> 
> This opens the door for additional simplifications, as the vdso no
> longer accesses the pvti for any vcpu other than vcpu 0.
> 
> Before, vclock_gettime using kvm-clock took about 64ns on my machine.
> With this change, it takes 19ns, which is almost as fast as the pure TSC
> implementation.
> 
> Signed-off-by: Andy Lutomirski 
> ---
>  arch/x86/vdso/vclock_gettime.c | 82 
> --
>  1 file changed, 47 insertions(+), 35 deletions(-)
> 
> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
> index 9793322751e0..f2e0396d5629 100644
> --- a/arch/x86/vdso/vclock_gettime.c
> +++ b/arch/x86/vdso/vclock_gettime.c
> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info 
> *get_pvti(int cpu)
>  
>  static notrace cycle_t vread_pvclock(int *mode)
>  {
> - const struct pvclock_vsyscall_time_info *pvti;
> + const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti;
>   cycle_t ret;
> - u64 last;
> - u32 version;
> - u8 flags;
> - unsigned cpu, cpu1;
> -
> + u64 tsc, pvti_tsc;
> + u64 last, delta, pvti_system_time;
> + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
>  
>   /*
> -  * Note: hypervisor must guarantee that:
> -  * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
> -  * 2. that per-CPU pvclock time info is updated if the
> -  *underlying CPU changes.
> -  * 3. that version is increased whenever underlying CPU
> -  *changes.
> +  * Note: The kernel and hypervisor must guarantee that cpu ID
> +  * number maps 1:1 to per-CPU pvclock time info.
> +  *
> +  * Because the hypervisor is entirely unaware of guest userspace
> +  * preemption, it cannot guarantee that per-CPU pvclock time
> +  * info is updated if the underlying CPU changes or that that
> +  * version is increased whenever underlying CPU changes.
> +  *
> +  * On KVM, we are guaranteed that pvti updates for any vCPU are
> +  * atomic as seen by *all* vCPUs.  This is an even stronger
> +  * guarantee than we get with a normal seqlock.
>*
> +  * On Xen, we don't appear to have that guarantee, but Xen still
> +  * supplies a valid seqlock using the version field.
> +
> +  * We only do pvclock vdso timing at all if
> +  * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
> +  * mean that all vCPUs have matching pvti and that the TSC is
> +  * synced, so we can just look at vCPU 0's pvti.
>*/

Can Xen guarantee that ?

> - do {
> - cpu = __getcpu() & VGETCPU_CPU_MASK;
> - /* TODO: We can put vcpu id into higher bits of pvti.version.
> -  * This will save a couple of cycles by getting rid of
> -  * __getcpu() calls (Gleb).
> -  */
> -
> - pvti = get_pvti(cpu);
> -
> - version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
> -
> - /*
> -  * Test we're still on the cpu as well as the version.
> -  * We could have been migrated just after the first
> -  * vgetcpu but before fetching the version, so we
> -  * wouldn't notice a version change.
> -  */
> - cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> - } while (unlikely(cpu != cpu1 ||
> -   (pvti->pvti.version & 1) ||
> -   pvti->pvti.version != version));
> -
> - if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
> +
> + if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
>   *mode = VCLOCK_NONE;
> + return 0;
> + }

This check must be performed after reading a stable pvti.

> +
> + do {
> + version = pvti->version;
> +
> + /* This is also a read barrier, so we'll read version first. */
> + rdtsc_barrier();
> + tsc = __native_read_tsc();
> +
> + pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
> + pvti_tsc_shift = pvti->tsc_shift;
> + pvti_system_time = pvti->system_time;
> + pvti_tsc = pvti->tsc_timestamp;
> +
> + /* Make sure that the version double-check is last. */
> + smp_rmb();
> + } while (unlikely((version & 1) || version != pvti->version));
> +
> + delta = tsc - pvti_tsc;
> + ret = pvti_system_time +
> + pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
> + pvti_tsc_shift);

The following is possible:

1) State: all pvtis marked as PVCLOCK_TSC_STABLE_BIT.
1) Update request for all vcpus, for a TSC_STABLE_BIT -> ~TSC_STABLE_BIT
transition.
2) vCPU-1 updates its pvti with new values.
3) vCPU-0 still has not updated it

Re: [PATCH 1/3] x86_64,entry: Fix RCX for traced syscalls

2015-01-05 Thread Borislav Petkov
On Fri, Nov 07, 2014 at 03:58:17PM -0800, Andy Lutomirski wrote:
> The int_ret_from_sys_call and syscall tracing code disagrees with
> the sysret path as to the value of RCX.
> 
> The Intel SDM, the AMD APM, and my laptop all agree that sysret
> returns with RCX == RIP.  The syscall tracing code does not respect
> this property.
> 
> For example, this program:
> 
> int main()
> {
>   extern const char syscall_rip[];
>   unsigned long rcx = 1;
>   unsigned long orig_rcx = rcx;
>   asm ("mov $-1, %%eax\n\t"
>"syscall\n\t"
>"syscall_rip:"
>: "+c" (rcx) : : "r11");
>   printf("syscall: RCX = %lX  RIP = %lX  orig RCX = %lx\n",
>  rcx, (unsigned long)syscall_rip, orig_rcx);
>   return 0;
> }
> 
> prints:
> syscall: RCX = 400556  RIP = 400556  orig RCX = 1
> 
> Running it under strace gives this instead:
> syscall: RCX =   RIP = 400556  orig RCX = 1

I can trigger the same even without tracing it:

syscall: RCX =   RIP = 40052C  orig RCX = 1

> This changes FIXUP_TOP_OF_STACK to match sysret, causing the test to
> show RCX == RIP even under strace.
> 
> Signed-off-by: Andy Lutomirski 
> ---
>  arch/x86/kernel/entry_64.S | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index df088bb03fb3..3710b8241945 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -143,7 +143,8 @@ ENDPROC(native_usergs_sysret64)
>   movq \tmp,RSP+\offset(%rsp)
>   movq $__USER_DS,SS+\offset(%rsp)
>   movq $__USER_CS,CS+\offset(%rsp)
> - movq $-1,RCX+\offset(%rsp)
> + movq RIP+\offset(%rsp),\tmp  /* get rip */
> + movq \tmp,RCX+\offset(%rsp)  /* copy it to rcx as sysret would do */
>   movq R11+\offset(%rsp),\tmp  /* get eflags */
>   movq \tmp,EFLAGS+\offset(%rsp)
>   .endm
> --

For some reason this patch is causing ata resets on by box, see the
end of this mail. So something's not kosher yet. If I boot the kernel
without it, it all seems ok.

Btw, this change got introduced in 2002 where it used to return rIP in
%rcx before, but it got changed to return -1 for rIP for some reason.

commit af53c7a2c81399b805b6d4eff887401a5e50feef
Author: Andi Kleen 
Date:   Fri Apr 19 20:23:17 2002 -0700

[PATCH] x86-64 architecture specific sync for 2.5.8

This patch brings 2.5.8 in sync with the x86-64 2.4 development tree again
(excluding device drivers)

It has lots of bug fixes and enhancements. It only touches architecture
specific files.

...

diff --git a/arch/x86_64/kernel/entry.S b/arch/x86_64/kernel/entry.S
index 6b98b90891f4..16c6e3faf5a7 100644
--- a/arch/x86_64/kernel/entry.S
+++ b/arch/x86_64/kernel/entry.S
@@ -5,7 +5,7 @@
  *  Copyright (C) 2000, 2001, 2002  Andi Kleen SuSE Labs
  *  Copyright (C) 2000  Pavel Machek 
  * 
- *  $Id: entry.S,v 1.66 2001/11/11 17:47:47 ak Exp $   
+ *  $Id$
  */
 
 /*
@@ -39,8 +39,7 @@
 #include 
 #include 
 #include 
-   
-#define RIP_SYMBOL_NAME(x) x(%rip)
+#include 
 
.code64
 
@@ -67,8 +66,7 @@
movq\tmp,RSP(%rsp)
movq$__USER_DS,SS(%rsp)
movq$__USER_CS,CS(%rsp)
-   movqRCX(%rsp),\tmp  /* get return address */
-   movq\tmp,RIP(%rsp)
+   movq$-1,RCX(%rsp)
movqR11(%rsp),\tmp  /* get eflags */
movq\tmp,EFLAGS(%rsp)
.endm
@@ -76,8 +74,6 @@
.macro RESTORE_TOP_OF_STACK tmp,offset=0
movq   RSP-\offset(%rsp),\tmp
movq   \tmp,PDAREF(pda_oldrsp)
-   movq   RIP-\offset(%rsp),\tmp
-   movq   \tmp,RCX-\offset(%rsp)
movq   EFLAGS-\offset(%rsp),\tmp
movq   \tmp,R11-\offset(%rsp)
.endm

---

[  180.059170] ata1.00: exception Emask 0x0 SAct 0x7fff SErr 0x0 action 0x6 
frozen
[  180.066873] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.072158] ata1.00: cmd 61/08:00:a8:ac:d9/00:00:23:00:00/40 tag 0 ncq 4096 
out
[  180.072158]  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 
(timeout)
[  180.086912] ata1.00: status: { DRDY }
[  180.090591] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.095846] ata1.00: cmd 61/08:08:18:ae:d9/00:00:23:00:00/40 tag 1 ncq 4096 
out
[  180.095846]  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 
(timeout)
[  180.110603] ata1.00: status: { DRDY }
[  180.114283] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.119539] ata1.00: cmd 61/10:10:f0:b1:d9/00:00:23:00:00/40 tag 2 ncq 8192 
out
[  180.119539]  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 
(timeout)
[  180.134292] ata1.00: status: { DRDY }
[  180.137973] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.143226] ata1.00: cmd 61/08:18:00:98:18/00:00:1d:00:00/40 tag 3 ncq 4096 
out
[  180.143226]  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 
(timeout)
[  180.158105] ata1.00: status: { DRDY }
[  180.161809] ata1.00: failed command: WRITE FPDMA

Re: cpu frequency

2015-01-05 Thread Nerijus Baliunas
Nerijus Baliunas  users.sourceforge.net> writes:

> Paolo Bonzini  redhat.com> writes:
> 
> > In the case of Windows it's probably some timing loop that is executed
> > at startup, and the result depends on frequency scaling in the host.
> > Try adding this to the XML in the meanwhile, and see if the control
> > panel shows the same value:
> > 
> > Inside :
> > 
> >   
> > 
> >   
> > 
> > Inside :
> > 
> >   
> 
> So far Control Panel -> System shows CPU as 2.2 GHz, I rebooted once. So it 
> seems OK.

Unfortunately after the host reboot the problem reappeared once. It helped to 
reboot the VM. Any ideas what else to try?

Regards,
Nerijus


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: Benchmarking for vhost polling patch

2015-01-05 Thread Michael S. Tsirkin
Hi Razya,
Thanks for the update.
So that's reasonable I think, and I think it makes sense
to keep working on this in isolation - it's more
manageable at this size.

The big questions in my mind:
- What happens if system is lightly loaded?
  E.g. a ping/pong benchmark. How much extra CPU are
  we wasting?
- We see the best performance on your system is with 10usec worth of polling.
  It's OK to be able to tune it for best performance, but
  most people don't have the time or the inclination.
  So what would be the best value for other CPUs?
- Should this be tunable from usespace per vhost instance?
  Why is it only tunable globally?
- How bad is it if you don't pin vhost and vcpu threads?
  Is the scheduler smart enough to pull them apart?
- What happens in overcommit scenarios? Does polling make things
  much worse?
  Clearly polling will work worse if e.g. vhost and vcpu
  share the host cpu. How can we avoid conflicts?

  For two last questions, better cooperation with host scheduler will
  likely help here.
  See e.g.  http://thread.gmane.org/gmane.linux.kernel/1771791/focus=1772505
  I'm currently looking at pushing something similar upstream,
  if it goes in vhost polling can do something similar.

Any data points to shed light on these questions?

On Thu, Jan 01, 2015 at 02:59:21PM +0200, Razya Ladelsky wrote:
> Hi Michael,
> Just a follow up on the polling patch numbers,..
> Please let me know if you find these numbers satisfying enough to continue 
> with submitting this patch.
> Otherwise - we'll have this patch submitted as part of the larger Elvis 
> patch set rather than independently.
> Thank you,
> Razya 
> 
> - Forwarded by Razya Ladelsky/Haifa/IBM on 01/01/2015 09:37 AM -
> 
> From:   Razya Ladelsky/Haifa/IBM@IBMIL
> To: m...@redhat.com
> Cc: 
> Date:   25/11/2014 02:43 PM
> Subject:Re: Benchmarking for vhost polling patch
> Sent by:kvm-ow...@vger.kernel.org
> 
> 
> 
> Hi Michael,
> 
> > Hi Razya,
> > On the netperf benchmark, it looks like polling=10 gives a modest but
> > measureable gain.  So from that perspective it might be worth it if it's
> > not too much code, though we'll need to spend more time checking the
> > macro effect - we barely moved the needle on the macro benchmark and
> > that is suspicious.
> 
> I ran memcached with various values for the key & value arguments, and 
> managed to see a bigger impact of polling than when I used the default 
> values,
> Here are the numbers:
> 
> key=250 TPS  netvhost vm   TPS/cpu  TPS/CPU
> value=2048   rate   util  util  change
> 
> polling=0   101540   103.0  46   100   695.47
> polling=5   136747   123.0  83   100   747.25   0.074440609
> polling=7   140722   125.7  84   100   764.79   0.099663658
> polling=10  141719   126.3  87   100   757.85   0.089688003
> polling=15  142430   127.1  90   100   749.63   0.077863015
> polling=25  146347   128.7  95   100   750.49   0.079107993
> polling=50  150882   131.1  100  100   754.41   0.084733701
> 
> Macro benchmarks are less I/O intensive than the micro benchmark, which is 
> why 
> we can expect less impact for polling as compared to netperf. 
> However, as shown above, we managed to get 10% TPS/CPU improvement with 
> the 
> polling patch.
> 
> > Is there a chance you are actually trading latency for throughput?
> > do you observe any effect on latency?
> 
> No.
> 
> > How about trying some other benchmark, e.g. NFS?
> > 
> 
> Tried, but didn't have enough I/O produced (vhost was at most at 15% util)

OK but was there a regression in this case?


> > 
> > Also, I am wondering:
> > 
> > since vhost thread is polling in kernel anyway, shouldn't
> > we try and poll the host NIC?
> > that would likely reduce at least the latency significantly,
> > won't it?
> > 
> 
> Yes, it could be a great addition at some point, but needs a thorough 
> investigation. In any case, not a part of this patch...
> 
> Thanks,
> Razya
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html