date:20120507

Re: [PATCH v4 06/10] KVM: MMU: fast path of handling guest page fault

2012-05-07 Thread Xiao Guangrong

On 05/05/2012 10:08 PM, Marcelo Tosatti wrote:



 I am confused with ' _everywhere_ ', it means all of the path read/update
 spte? why not only verify the path which depends on is_writable_pte()?
 
 I meant any path that updates from present-present.
 


OK, got it. So let us focus on mmu_spte_update() only. :)

 For the reason of its easy to verify that it is correct? But these
 paths are safe since it is not care PT_WRITABLE_MASK at all. What these
 paths care is the Dirty-bit and Accessed-bit are not lost, that is why
 we always treat the spte as volatile if it is can be updated out of
 mmu-lock.

 For the further development? We can add the delta comment for
 is_writable_pte() to warn the developer use it more carefully.

 It is also very hard to verify spte everywhere. :(

 Actually, the current code to care PT_WRITABLE_MASK is just for
 tlb flush, may be we can fold it into mmu_spte_update.
 [
   There are tree ways to modify spte, present - nonpresent, nonpresent - 
 present,
   present - present.

   But we only need care present - present for lockless.
 ]
 
 Also need to take memory ordering into account, which was not an issue
 before. So it is not only TLB flush.


It seems do not need explicit barrier, we always use atomic-xchg to update
spte, it has already guaranteed the memory ordering.

In mmu_spte_update():

/* the return value indicates wheater tlb need be flushed.*/
static bool mmu_spte_update(u64 *sptep, u64 new_spte)
{
u64 old_spte = *sptep;
bool flush = false;

old_spte = xchg(sptep, new_spte);

if (is_writable_pte(old_spte)  !is_writable_pte(spte) )
flush = true;

.
}

 
 /*
  * return true means we need flush tlbs caused by changing spte from 
 writeable
  * to read-only.
  */
 bool mmu_update_spte(u64 *sptep, u64 spte)
 {
  u64 last_spte, old_spte = *sptep;
  bool flush = false;

  last_spte = xchg(sptep, spte);

  if ((is_writable_pte(last_spte) ||
spte_has_updated_lockless(old_spte, last_spte)) 
   !is_writable_pte(spte) )
  flush = true;

   track Drity/Accessed bit ...


  return flush
 }

 Furthermore, the style of if (spte-has-changed) goto beginning is feasible
 in set_spte since this path is a fast path. (i can speed up 
 mmu_need_write_protect)
 
 What you mean exactly?
 
 It would be better if all these complications introduced by lockless
 updates can be avoided, say using A/D bits as Avi suggested.


Anyway, i do not object it if we have a better way to do these, but ..

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/4] KVM: Add APIs for unlocked TLB flush

2012-05-07 Thread Xiao Guangrong

On 05/03/2012 10:11 PM, Avi Kivity wrote:

 On 05/03/2012 04:23 PM, Xiao Guangrong wrote:
 On 05/03/2012 07:22 PM, Avi Kivity wrote:

 Currently we flush the TLB while holding mmu_lock.  This
 increases the lock hold time by the IPI round-trip time, increasing
 contention, and makes dropping the lock (for latency reasons) harder.

 This patch changes TLB management to be usable locklessly, introducing
 the following APIs:

   kvm_mark_tlb_dirty() - mark the TLB as containing stale entries
   kvm_cond_flush_remote_tlbs() - flush the TLB if it was marked as
   dirty

 These APIs can be used without holding mmu_lock (though if the TLB
 became stale due to shadow page table modifications, typically it
 will need to be called with the lock held to prevent other threads
 from seeing the modified page tables with the TLB unmarked and unflushed)/

 Signed-off-by: Avi Kivity a...@redhat.com
 ---
  Documentation/virtual/kvm/locking.txt |   14 ++
  arch/x86/kvm/paging_tmpl.h|4 ++--
  include/linux/kvm_host.h  |   22 +-
  virt/kvm/kvm_main.c   |   29 -
  4 files changed, 57 insertions(+), 12 deletions(-)

 diff --git a/Documentation/virtual/kvm/locking.txt 
 b/Documentation/virtual/kvm/locking.txt
 index 3b4cd3b..f6c90479 100644
 --- a/Documentation/virtual/kvm/locking.txt
 +++ b/Documentation/virtual/kvm/locking.txt
 @@ -23,3 +23,17 @@ Arch:x86
  Protects:  - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}
 - tsc offset in vmcb
  Comment:   'raw' because updating the tsc offsets must not be preempted.
 +
 +3. TLB control
 +--
 +
 +The following APIs should be used for TLB control:
 +
 + - kvm_mark_tlb_dirty() - indicates that the TLB is out of sync wrt
 +  either guest or host page tables.
 + - kvm_flush_remote_tlbs() - unconditionally flush the tlbs
 + - kvm_cond_flush_remote_tlbs() - flush the TLBs if previously marked
 +
 +These may be used without mmu_lock, though kvm_mark_tlb_dirty() needs to be
 +used while holding mmu_lock if it is called due to host page table changes
 +(contrast to guest page table changes).


 In these patches, it seems that kvm_mark_tlb_dirty is always used
 under the protection of mmu-lock, yes?
 
 Correct.  It's possible we'll find a use outside mmu_lock, but this
 isn't likely.


If we need call kvm_mark_tlb_dirty outside mmu-lock, just use
kvm_flush_remote_tlbs instead:

if (need-flush-tlb)
flush = true;

do something...

if (flush)
kvm_flush_remote_tlbs

 
 If both kvm_mark_tlb_dirty and kvm_cond_flush_remote_tlbs are use
 out of mmu-lock, i think we can use kvm_flush_remote_tlbs instead.

 If it is so, dirtied_count/flushed_count need not be atomic.
 
 But we always mark with mmu_lock held.
 


Yes, so, we can change kvm_mark_tlb_dirty to:

+static inline void kvm_mark_tlb_dirty(struct kvm *kvm)
+{
+   /*
+* Make any changes to the page tables visible to remote flushers.
+*/
+   smb_mb();
+   kvm-tlb_state.dirtied_count++;
+}

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: possible circular locking dependency

2012-05-07 Thread Avi Kivity

On 05/07/2012 06:47 AM, Paul E. McKenney wrote:
 On Sun, May 06, 2012 at 11:34:39PM +0300, Sergey Senozhatsky wrote:
  On (05/06/12 09:42), Paul E. McKenney wrote:
   On Sun, May 06, 2012 at 11:55:30AM +0300, Avi Kivity wrote:
On 05/03/2012 11:02 PM, Sergey Senozhatsky wrote:
 Hello,
 3.4-rc5

Whoa.

Looks like inconsistent locking between cpufreq and
synchronize_srcu_expedited().  kvm triggered this because it is one of
the few users of synchronize_srcu_expedited(), but I don't think it is
doing anything wrong directly.

Dave, Paul?
   
   SRCU hasn't changed much in mainline for quite some time.  Holding
   the hotplug mutex across a synchronize_srcu() is a bad idea, though.
   
   However, there is a reworked implementation (courtesy of Lai Jiangshan)
   in -rcu that does not acquire the hotplug mutex.  Could you try that out?
  
  
  Paul, should I try solely -rcu or there are several commits to pick up and 
  apply
  on top of -linus tree?

 If you want the smallest possible change, take the rcu/srcu branch of -rcu.
 If you want the works, take the rcu/next branch of -rcu.

 You can find -rcu at:

 git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git

To make the difference even smaller, merge the above branch with v3.4-rc5.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH] kvm: Enable device LTR/OBFF capibility before doing guest device assignment

2012-05-07 Thread Hao, Xudong

 -Original Message-
 From: Avi Kivity [mailto:a...@redhat.com]
 Sent: Sunday, May 06, 2012 11:34 PM
 To: Xudong Hao
 Cc: mtosa...@redhat.com; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
 Zhang, Xiantao; Hao, Xudong; Alex Williamson
 Subject: Re: [PATCH] kvm: Enable device LTR/OBFF capibility before doing guest
 device assignment
 
 On 05/06/2012 06:24 PM, Xudong Hao wrote:
  Enable device LTR/OBFF capibility before do device assignment, so that guest
 can benefit from them.
 
 cc += Alex
 
  @@ -166,6 +166,10 @@ int kvm_assign_device(struct kvm *kvm,
  if (pdev == NULL)
  return -ENODEV;
 
  +   /* Enable some device capibility before do device assignment,
  +* so that guest can benefit from them.
  +*/
  +   kvm_iommu_enable_dev_caps(pdev);
  r = iommu_attach_device(domain, pdev-dev);
 
 Suppose we fail here.  Do we need to disable_dev_caps()?
 
I don't think so. When a device will be assigned to guest, it's be owned by a 
pci-stub driver, attach_device fail here do not affect everything. If host want 
to use it, host device driver has its own enable/disable dev_caps.
 
  if (r) {
  printk(KERN_ERR assign device %x:%x:%x.%x failed,
  @@ -228,6 +232,7 @@ int kvm_deassign_device(struct kvm *kvm,
  PCI_SLOT(assigned_dev-host_devfn),
  PCI_FUNC(assigned_dev-host_devfn));
 
  +   kvm_iommu_disable_dev_caps(pdev);
  return 0;
   }
 
  @@ -351,3 +356,30 @@ int kvm_iommu_unmap_guest(struct kvm *kvm)
  iommu_domain_free(domain);
  return 0;
   }
  +
  +static void kvm_iommu_enable_dev_caps(struct pci_dev *pdev)
  +{
  +   /* set default value */
  +   unsigned long type = PCI_EXP_OBFF_SIGNAL_ALWAYS;
  +   int snoop_lat_ns = 1024, nosnoop_lat_ns = 1024;
 
 Where does this magic number come from?
 
The number is the max value that register support, set it as default here, we 
did not have any device here, and we do not know what's the proper value, so it 
set a default value firstly.


  +
  +   /* LTR(Latency tolerance reporting) allows devices to send
  +* messages to the root complex indicating their latency
  +* tolerance for snooped  unsnooped memory transactions.
  +*/
  +   pci_enable_ltr(pdev);
  +   pci_set_ltr(pdev, snoop_lat_ns, nosnoop_lat_ns);
  +
  +   /* OBFF (optimized buffer flush/fill), where supported,
  +* can help improve energy efficiency by giving devices
  +* information about when interrupts and other activity
  +* will have a reduced power impact.
  +*/
  +   pci_enable_obff(pdev, type);
  +}
  +
  +static void kvm_iommu_disable_dev_caps(struct pci_dev *pdev)
  +{
  +   pci_disble_obff(pdev);
  +   pci_disble_ltr(pdev);
  +}
 
 Do we need to communicate something about these capabilities to the guest?
 

I guess you means that here host don't know if guest want to enable them, 
right? 
The ltr/obff new feature are supposed to enabled by guest if platform and 
device supported.

 --
 error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/4] KVM: Add APIs for unlocked TLB flush

2012-05-07 Thread Avi Kivity

On 05/07/2012 10:06 AM, Xiao Guangrong wrote:
 On 05/03/2012 10:11 PM, Avi Kivity wrote:

  On 05/03/2012 04:23 PM, Xiao Guangrong wrote:
  On 05/03/2012 07:22 PM, Avi Kivity wrote:
 
  Currently we flush the TLB while holding mmu_lock.  This
  increases the lock hold time by the IPI round-trip time, increasing
  contention, and makes dropping the lock (for latency reasons) harder.
 
  This patch changes TLB management to be usable locklessly, introducing
  the following APIs:
 
kvm_mark_tlb_dirty() - mark the TLB as containing stale entries
kvm_cond_flush_remote_tlbs() - flush the TLB if it was marked as
dirty
 
  These APIs can be used without holding mmu_lock (though if the TLB
  became stale due to shadow page table modifications, typically it
  will need to be called with the lock held to prevent other threads
  from seeing the modified page tables with the TLB unmarked and unflushed)/
 
  Signed-off-by: Avi Kivity a...@redhat.com
  ---
   Documentation/virtual/kvm/locking.txt |   14 ++
   arch/x86/kvm/paging_tmpl.h|4 ++--
   include/linux/kvm_host.h  |   22 +-
   virt/kvm/kvm_main.c   |   29 
  -
   4 files changed, 57 insertions(+), 12 deletions(-)
 
  diff --git a/Documentation/virtual/kvm/locking.txt 
  b/Documentation/virtual/kvm/locking.txt
  index 3b4cd3b..f6c90479 100644
  --- a/Documentation/virtual/kvm/locking.txt
  +++ b/Documentation/virtual/kvm/locking.txt
  @@ -23,3 +23,17 @@ Arch:  x86
   Protects:- 
  kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}
- tsc offset in vmcb
   Comment: 'raw' because updating the tsc offsets must not be preempted.
  +
  +3. TLB control
  +--
  +
  +The following APIs should be used for TLB control:
  +
  + - kvm_mark_tlb_dirty() - indicates that the TLB is out of sync wrt
  +  either guest or host page tables.
  + - kvm_flush_remote_tlbs() - unconditionally flush the tlbs
  + - kvm_cond_flush_remote_tlbs() - flush the TLBs if previously marked
  +
  +These may be used without mmu_lock, though kvm_mark_tlb_dirty() needs to 
  be
  +used while holding mmu_lock if it is called due to host page table 
  changes
  +(contrast to guest page table changes).
 
 
  In these patches, it seems that kvm_mark_tlb_dirty is always used
  under the protection of mmu-lock, yes?
  
  Correct.  It's possible we'll find a use outside mmu_lock, but this
  isn't likely.


 If we need call kvm_mark_tlb_dirty outside mmu-lock, just use
 kvm_flush_remote_tlbs instead:

 if (need-flush-tlb)
   flush = true;

 do something...

 if (flush)
   kvm_flush_remote_tlbs

It depends on how need-flush-tlb is computed.  If it depends on sptes,
then we mush use kvm_cond_flush_remote_tlbs().

  
  If both kvm_mark_tlb_dirty and kvm_cond_flush_remote_tlbs are use
  out of mmu-lock, i think we can use kvm_flush_remote_tlbs instead.
 
  If it is so, dirtied_count/flushed_count need not be atomic.
  
  But we always mark with mmu_lock held.
  


 Yes, so, we can change kvm_mark_tlb_dirty to:

 +static inline void kvm_mark_tlb_dirty(struct kvm *kvm)
 +{
 + /*
 +  * Make any changes to the page tables visible to remote flushers.
 +  */
 + smb_mb();
 + kvm-tlb_state.dirtied_count++;
 +}


Yes.  We'll have to change it again if we ever dirty sptes outside the
lock, but that's okay.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm: KVM internal error. Suberror: 1

2012-05-07 Thread Avi Kivity

On 05/06/2012 08:19 PM, Sasha Levin wrote:
 Hi all,

 During some fuzzing with trinity in a KVM guest running on qemu, I got the 
 following error:

 KVM internal error. Suberror: 1
 emulation failure
 RAX= RBX=8800284108e0 RCX=0001 
 RDX=84482008
 RSI=1030 RDI=8180 RBP=880028723d38 
 RSP=880028723ce8
 R8 =0206 R9 =f7e80206 R10= 
 R11=
 R12=88002841 R13=846ba1c0 R14=84a74970 
 R15=9530
 RIP=8111c862 RFL=00010046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =   
 CS =0010   00a09b00 DPL=0 CS64 [-RA]
 SS =0018   00c09300 DPL=0 DS   [-WA]
 DS =   
 FS = 7f955873b700  
 GS = 880035a0  
 LDT=   
 TR =0040 880035bd2480 2087 8b00 DPL=0 TSS64-busy
 GDT= 880035a04000 007f
 IDT= 8436a000 0fff
 CR0=8005003b CR2=7f5cfdad0518 CR3=1a154000 CR4=000407e0
 DR0= DR1= DR2= 
 DR3= 
 DR6=0ff0 DR7=0400
 EFER=0d01
 Code=66 90 e8 7b 97 ff ff b8 01 00 00 00 eb 1c 0f 1f 40 00 31 c0 83 3d 97 
 9f c7 02 00 0f 95 c0 eb 0a 66 90 31 c0 66 0f 1f 44 00 00 48 8b 5d d8 4c 8b 65 
 e0
 KVM internal error. Suberror: 1
 emulation failure

This is cmpl   $0x0,0x2c79f97(%rip) # 0x83d96800.  I don't
understand why it failed, we do emulate cmp.  I'll try to write a unit
test for it.


 RAX=88000d5f8000 RBX=88000d600010 RCX=0001 
 RDX=
 RSI=0001 RDI=88000d5f8000 RBP=88000d601ec8 
 RSP=88000d601ec8
 R8 =0001 R9 = R10= 
 R11=
 R12=83fed960 R13= R14= 
 R15=
 RIP=8107d696 RFL=0286 [--S--P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =   
 CS =0010   00a09b00 DPL=0 CS64 [-RA]
 SS =0018   00c09300 DPL=0 DS   [-WA]
 DS =   
 FS =   
 GS = 88002980  
 LDT=   
 TR =0040 8800299d2480 2087 8b00 DPL=0 TSS64-busy
 GDT= 880029804000 007f
 IDT= 8436a000 0fff
 CR0=8005003b CR2=7fcfa03f9e9c CR3=03a1c000 CR4=000407e0
 DR0= DR1= DR2= 
 DR3= 
 DR6=0ff0 DR7=0400
 EFER=0d01
 Code=89 e5 fb c9 c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb f4 c9 c3 0f 
 1f 84 00 00 00 00 00 55 48 89 e5 f4 c9 c3 66 0f 1f 84 00 00 00 00 00 55 8b 07 
 48
 KVM internal error. Suberror: 1
 emulation failure
 RAX=88000d5db000 RBX=88000d5ce010 RCX=0001 
 RDX=
 RSI=0001 RDI=88000d5db000 RBP=88000d5cfec8 
 RSP=88000d5cfec8
 R8 =0001 R9 = R10= 
 R11=
 R12=83fed960 R13= R14= 
 R15=
 RIP=8107d696 RFL=0286 [--S--P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =   
 CS =0010   00a09b00 DPL=0 CS64 [-RA]
 SS =0018   00c09300 DPL=0 DS   [-WA]
 DS =   
 FS =   
 GS = 88001b80  
 LDT=   
 TR =0040 88001b9d2480 2087 8b00 DPL=0 TSS64-busy
 GDT= 88001b804000 007f
 IDT= 8436a000 0fff
 CR0=8005003b CR2=7fcfa076b518 CR3=1a148000 CR4=000407e0
 DR0= DR1= DR2= 
 DR3= 
 DR6=0ff0 DR7=0400
 EFER=0d01
 Code=89 e5 fb c9 c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb f4 c9 c3 0f 
 1f 84 00 00 00 00 00 55 48 89 e5 f4 c9 c3 66 0f 1f 84 00 00 00 00 00 55 8b 07 
 48

 The assembly doesn't quite make sense, and the fact that I got 3 of these in 
 a row, makes me believe that it isn't an actual emulation error, but 
 something else.


The assembly makes sense, it's sti; hlt; leaveq. What doesn't make sense
is that we have to emulate leaveq - rsp and rbp point at normal memory
as far as I can tell.

The fact that it often happens after hlt makes me suspect interrupts are
involved.  Please run this again with a trace so we so what happens
prior to the error.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to

Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Ingo Molnar


* Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote:

 This series replaces the existing paravirtualized spinlock mechanism
 with a paravirtualized ticketlock mechanism. The series provides
 implementation for both Xen and KVM.(targeted for 3.5 window)
 
 Note: This needs debugfs changes patch that should be in Xen / linux-next
https://lkml.org/lkml/2012/3/30/687
 
 Changes in V8:
  - Reabsed patches to 3.4-rc4
  - Combined the KVM changes with ticketlock + Xen changes (Ingo)
  - Removed CAP_PV_UNHALT since it is redundant (Avi). But note that we
 need newer qemu which uses KVM_GET_SUPPORTED_CPUID ioctl.
  - Rewrite GET_MP_STATE condition (Avi)
  - Make pv_unhalt = bool (Avi)
  - Move out reset pv_unhalt code to vcpu_run from vcpu_block (Gleb)
  - Documentation changes (Rob Landley)
  - Have a printk to recognize that paravirt spinlock is enabled (Nikunj)
  - Move out kick hypercall out of CONFIG_PARAVIRT_SPINLOCK now
so that it can be used for other optimizations such as 
flush_tlb_ipi_others etc. (Nikunj)
 
 Ticket locks have an inherent problem in a virtualized case, because
 the vCPUs are scheduled rather than running concurrently (ignoring
 gang scheduled vCPUs).  This can result in catastrophic performance
 collapses when the vCPU scheduler doesn't schedule the correct next
 vCPU, and ends up scheduling a vCPU which burns its entire timeslice
 spinning.  (Note that this is not the same problem as lock-holder
 preemption, which this series also addresses; that's also a problem,
 but not catastrophic).
 
 (See Thomas Friebel's talk Prevent Guests from Spinning Around
 http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.)
 
 Currently we deal with this by having PV spinlocks, which adds a layer
 of indirection in front of all the spinlock functions, and defining a
 completely new implementation for Xen (and for other pvops users, but
 there are none at present).
 
 PV ticketlocks keeps the existing ticketlock implemenentation
 (fastpath) as-is, but adds a couple of pvops for the slow paths:
 
 - If a CPU has been waiting for a spinlock for SPIN_THRESHOLD
   iterations, then call out to the __ticket_lock_spinning() pvop,
   which allows a backend to block the vCPU rather than spinning.  This
   pvop can set the lock into slowpath state.
 
 - When releasing a lock, if it is in slowpath state, the call
   __ticket_unlock_kick() to kick the next vCPU in line awake.  If the
   lock is no longer in contention, it also clears the slowpath flag.
 
 The slowpath state is stored in the LSB of the within the lock tail
 ticket.  This has the effect of reducing the max number of CPUs by
 half (so, a small ticket can deal with 128 CPUs, and large ticket
 32768).
 
 For KVM, one hypercall is introduced in hypervisor,that allows a vcpu to kick
 another vcpu out of halt state.
 The blocking of vcpu is done using halt() in (lock_spinning) slowpath.
 
 Overall, it results in a large reduction in code, it makes the native
 and virtualized cases closer, and it removes a layer of indirection
 around all the spinlock functions.
 
 The fast path (taking an uncontended lock which isn't in slowpath
 state) is optimal, identical to the non-paravirtualized case.
 
 The inner part of ticket lock code becomes:
   inc = xadd(lock-tickets, inc);
   inc.tail = ~TICKET_SLOWPATH_FLAG;
 
   if (likely(inc.head == inc.tail))
   goto out;
   for (;;) {
   unsigned count = SPIN_THRESHOLD;
   do {
   if (ACCESS_ONCE(lock-tickets.head) == inc.tail)
   goto out;
   cpu_relax();
   } while (--count);
   __ticket_lock_spinning(lock, inc.tail);
   }
 out:  barrier();
 which results in:
   push   %rbp
   mov%rsp,%rbp
 
   mov$0x200,%eax
   lock xadd %ax,(%rdi)
   movzbl %ah,%edx
   cmp%al,%dl
   jne1f   # Slowpath if lock in contention
 
   pop%rbp
   retq   
 
   ### SLOWPATH START
 1:and$-2,%edx
   movzbl %dl,%esi
 
 2:mov$0x800,%eax
   jmp4f
 
 3:pause  
   sub$0x1,%eax
   je 5f
 
 4:movzbl (%rdi),%ecx
   cmp%cl,%dl
   jne3b
 
   pop%rbp
   retq   
 
 5:callq  *__ticket_lock_spinning
   jmp2b
   ### SLOWPATH END
 
 with CONFIG_PARAVIRT_SPINLOCKS=n, the code has changed slightly, where
 the fastpath case is straight through (taking the lock without
 contention), and the spin loop is out of line:
 
   push   %rbp
   mov%rsp,%rbp
 
   mov$0x100,%eax
   lock xadd %ax,(%rdi)
   movzbl %ah,%edx
   cmp%al,%dl
   jne1f
 
   pop%rbp
   retq   
 
   ### SLOWPATH START
 1:pause  
   movzbl (%rdi),%eax
   cmp%dl,%al
   jne1b
 
   pop%rbp
   retq   
   ### SLOWPATH END
 
 The unlock code is complicated by the need to

Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Avi Kivity

On 05/07/2012 11:29 AM, Ingo Molnar wrote:
 This is looking pretty good and complete now - any objections 
 from anyone to trying this out in a separate x86 topic tree?

No objections, instead an

Acked-by: Avi Kivity a...@redhat.com

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: X86: Add mmx movq emulation

2012-05-07 Thread Joerg Roedel

On Sun, May 06, 2012 at 01:08:05PM +0300, Avi Kivity wrote:
 On 05/04/2012 02:47 PM, Joerg Roedel wrote:
  Add support to the MMX versions of the movq instructions to
  the instruction emulator. Also handle possible exceptions
  they may cause.
 
 This is already in (cbe2c9d30, e59717550e).  Are you looking at master
 instead of next?

Right, I was looking at master. Probably I should have re-read your mail
about the new git workflow.

 Since you've just thought of the issues involved, I'd appreciate a
 review of the commits above, both wrt correctness and
 maintainability.

The patches above look correct to me. In fact cbe2c9d30 is more general
than my implementation because it fetches all possible mmx operands.

My implementation on the other side should be a bit faster because it
looks for FP exceptions directly when the registers are accessed which
saves one get_fpu/put_fpu cycle (and an fwait instruction).

 In fact I already see one difference - my patches do reg = 7, while
 your patches generate #UD for %mm8-%mm15.

Your version is correct. Documentation says that REX-prefixes are
ignored where not supported or misplaced. I also tried that directly on
hardware and it works as documented and implemented in KVM.


Regards,

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: X86: Add mmx movq emulation

2012-05-07 Thread Avi Kivity

On 05/07/2012 11:56 AM, Joerg Roedel wrote:
  Since you've just thought of the issues involved, I'd appreciate a
  review of the commits above, both wrt correctness and
  maintainability.

 The patches above look correct to me. In fact cbe2c9d30 is more general
 than my implementation because it fetches all possible mmx operands.

 My implementation on the other side should be a bit faster because it
 looks for FP exceptions directly when the registers are accessed which
 saves one get_fpu/put_fpu cycle (and an fwait instruction).

The get_fpu/put_fpu are nops (unless we schedule in between), since we
only put_fpu() doesn't really unload the fpu.  You're correct about the
fwait; my motivation was to get the #MF exception early instead of doing
the accesses first.

  In fact I already see one difference - my patches do reg = 7, while
  your patches generate #UD for %mm8-%mm15.

 Your version is correct. Documentation says that REX-prefixes are
 ignored where not supported or misplaced. I also tried that directly on
 hardware and it works as documented and implemented in KVM.

Thanks for verifying.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: X86: Remove stale values from ctxt-memop before emulation

2012-05-07 Thread Joerg Roedel

On Sun, May 06, 2012 at 11:21:52AM +0300, Avi Kivity wrote:
  diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
  index d4bf50c..1b516ec 100644
  --- a/arch/x86/kvm/emulate.c
  +++ b/arch/x86/kvm/emulate.c
  @@ -3937,6 +3937,7 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, 
  void *insn, int insn_len)
  struct opcode opcode;
   
  ctxt-memop.type = OP_NONE;
  +   ctxt-memop.val  = 0;
  ctxt-memopp = NULL;
  ctxt-_eip = ctxt-eip;
  ctxt-fetch.start = ctxt-_eip;
 
 This only works for long sized values - it doesn't initialize val64 on
 i386, for example.  So I think it's better to change bsr (and family) to
 use emualte_2op_SrcV_nobyte() instead (which has the added benefit of
 using the same values as the processor for the undefined bits).

Right, thats a better solution. How about the attached patch? The zf
check shouldn't be necessary anymore because the generated assembly uses
dst.val as input and output so writeback shouldn't do anything wrong.
The bsr and bsf unittests all pass again with this patch.

Joerg

From e9262f18e90111d32b584084c0b5564cbd728d65 Mon Sep 17 00:00:00 2001
From: Joerg Roedel joerg.roe...@amd.com
Date: Mon, 7 May 2012 12:05:28 +0200
Subject: [PATCH] KVM: X86: convert bsf/bsr instructions to
 emulate_2op_SrcV_nobyte()

The instruction emulation for bsrw is broken in KVM because
the code always uses bsr with 32 or 64 bit operand size for
emulation. Fix that by using emulate_2op_SrcV_nobyte() macro
to use guest operand size for emulation.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/kvm/emulate.c |   26 ++
 1 file changed, 2 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 0d151e2..a6f8488 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -3134,35 +3134,13 @@ static int em_btc(struct x86_emulate_ctxt *ctxt)
 
 static int em_bsf(struct x86_emulate_ctxt *ctxt)
 {
-   u8 zf;
-
-   __asm__ (bsf %2, %0; setz %1
-: =r(ctxt-dst.val), =q(zf)
-: r(ctxt-src.val));
-
-   ctxt-eflags = ~X86_EFLAGS_ZF;
-   if (zf) {
-   ctxt-eflags |= X86_EFLAGS_ZF;
-   /* Disable writeback. */
-   ctxt-dst.type = OP_NONE;
-   }
+   emulate_2op_SrcV_nobyte(ctxt, bsf);
return X86EMUL_CONTINUE;
 }
 
 static int em_bsr(struct x86_emulate_ctxt *ctxt)
 {
-   u8 zf;
-
-   __asm__ (bsr %2, %0; setz %1
-: =r(ctxt-dst.val), =q(zf)
-: r(ctxt-src.val));
-
-   ctxt-eflags = ~X86_EFLAGS_ZF;
-   if (zf) {
-   ctxt-eflags |= X86_EFLAGS_ZF;
-   /* Disable writeback. */
-   ctxt-dst.type = OP_NONE;
-   }
+   emulate_2op_SrcV_nobyte(ctxt, bsr);
return X86EMUL_CONTINUE;
 }
 
-- 
1.7.9.5


-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: X86: Remove stale values from ctxt-memop before emulation

2012-05-07 Thread Avi Kivity

On 05/07/2012 01:12 PM, Joerg Roedel wrote:
 On Sun, May 06, 2012 at 11:21:52AM +0300, Avi Kivity wrote:
   diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
   index d4bf50c..1b516ec 100644
   --- a/arch/x86/kvm/emulate.c
   +++ b/arch/x86/kvm/emulate.c
   @@ -3937,6 +3937,7 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, 
   void *insn, int insn_len)
 struct opcode opcode;

 ctxt-memop.type = OP_NONE;
   + ctxt-memop.val  = 0;
 ctxt-memopp = NULL;
 ctxt-_eip = ctxt-eip;
 ctxt-fetch.start = ctxt-_eip;
  
  This only works for long sized values - it doesn't initialize val64 on
  i386, for example.  So I think it's better to change bsr (and family) to
  use emualte_2op_SrcV_nobyte() instead (which has the added benefit of
  using the same values as the processor for the undefined bits).

 Right, thats a better solution. How about the attached patch? The zf
 check shouldn't be necessary anymore because the generated assembly uses
 dst.val as input and output so writeback shouldn't do anything wrong.
 The bsr and bsf unittests all pass again with this patch.

   Joerg

 From e9262f18e90111d32b584084c0b5564cbd728d65 Mon Sep 17 00:00:00 2001
 From: Joerg Roedel joerg.roe...@amd.com
 Date: Mon, 7 May 2012 12:05:28 +0200
 Subject: [PATCH] KVM: X86: convert bsf/bsr instructions to
  emulate_2op_SrcV_nobyte()

 The instruction emulation for bsrw is broken in KVM because
 the code always uses bsr with 32 or 64 bit operand size for
 emulation. Fix that by using emulate_2op_SrcV_nobyte() macro
 to use guest operand size for emulation.


It looks fine.  Do you know what triggered this regression?  (for
figuring out if it's 3.4 material)

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: X86: Add mmx movq emulation

2012-05-07 Thread Joerg Roedel

On Sun, May 06, 2012 at 01:08:05PM +0300, Avi Kivity wrote:
 This is already in (cbe2c9d30, e59717550e).  Are you looking at master
 instead of next?

Btw. your mail about the new git-workflow states something about an
auto-next branch. But I don't see that branch in the KVM tree (looking
at git://git.kernel.org/pub/scm/virt/kvm/kvm.git). Is there another
branch that contains all fixes and everything for the next merge window?
Basically I am looking for a branch which has the new master and next
merged.

Thanks,

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: X86: Add mmx movq emulation

2012-05-07 Thread Avi Kivity

On 05/07/2012 01:28 PM, Joerg Roedel wrote:
 On Sun, May 06, 2012 at 01:08:05PM +0300, Avi Kivity wrote:
  This is already in (cbe2c9d30, e59717550e).  Are you looking at master
  instead of next?

 Btw. your mail about the new git-workflow states something about an
 auto-next branch. But I don't see that branch in the KVM tree (looking
 at git://git.kernel.org/pub/scm/virt/kvm/kvm.git). 

We forgot to generate it.

 Is there another
 branch that contains all fixes and everything for the next merge window?
 Basically I am looking for a branch which has the new master and next
 merged.

Right.  I'll get my scripts to generate it.  (btw:  auto-next =
merge(upstream, master, next)).

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 0/5] apic: eoi optimization support

2012-05-07 Thread Ingo Molnar


* Michael S. Tsirkin m...@redhat.com wrote:

 I'm looking at reducing the interrupt overhead for virtualized guests:
 some workloads spend a large part of their time processing interrupts.
 This patchset supplies infrastructure to reduce the IRQ ack overhead on
 x86: the idea is to add an eoi_write callback that we can then optimize
 without touching other apic functionality.
 
 The main user will be kvm: on kvm, an EOI write from the guest causes an
 expensive exit to host; we can avoid this using shared memory as the
 last patch in the series demonstrates.
 
 But I also wrote a micro-optimized version for the regular x2apic: this
 shaves off a branch and about 9 instructions from EOI when x2apic is
 used, and a comment in ack_APIC_irq implies that someone counted
 instructions there, at some point.
 
 Also included in the patchset are a couple of trivial macro fixes.
 
 The patches work fine on my boxes and I did look at the
 objdump output to verify that the generated code
 for the micro-optimization patch looks right
 and actually is shorter.
 
 Some benchmark results below (not sure what kind of
 testing is the most appropriate) show a tiny
 but measureable improvement. The tests were run on
 an AMD box with 24 cpus.
 
 - A clean kernel build after reboot shows
 a tiny but measureable improvement in system time
 which means lower CPU overhead (though not measureable
 in total time - that is dominated by user time and fluctuates
 too much):
 
 linux# reboot -f
 ...
 linux# make clean
 linux# time make -j 64 LOCALVERSION= 21  /dev/null
 
 Before:
 
 real2m52.244s
 user35m53.833s
 sys 6m7.194s
 
 After:
 
 real2m52.827s
 user35m48.916s
 sys 6m2.305s
 
 - perf micro-benchmarks seem to consistently show
   a tiny improvement in total time as well but it's below
   the confidence level of 3 std deviations:
 
 # ./tools/perf/perf   stat --sync --repeat 100 --null perf bench sched 
 messaging
 ...
0.414666797 seconds time elapsed ( +-  1.29% )
 
 Performance counter stats for 'perf bench sched messaging' (100 runs):
 
0.395370891 seconds time elapsed
 ( +-  1.04% )
 
 
 # ./tools/perf/perf   stat --sync --repeat 100 --null perf bench sched pipe 
 -l 1
0.307019664 seconds time elapsed
 ( +-  0.10% )
 
0.304738024 seconds time elapsed
 ( +-  0.08% )
 
 The patches are against 3.4-rc3 - let me know if
 I need to rebase.
 
 I think patches 1-2 are definitely a good idea,
 and patches 3-4 might be a good idea.
 Please review, and consider patches 1-4 for linux 3.5.
 
 Thanks,
 MST
 
 Michael S. Tsirkin (5):
   apic: fix typo EIO_ACK - EOI_ACK and document
   apic: use symbolic APIC_EOI_ACK
   x86: add apic-eoi_write callback
   x86: eoi micro-optimization
   kvm_para: guest side for eoi avoidance
 
  arch/x86/include/asm/apic.h|   22 --
  arch/x86/include/asm/apicdef.h |2 +-
  arch/x86/include/asm/bitops.h  |6 ++-
  arch/x86/include/asm/kvm_para.h|2 +
  arch/x86/kernel/apic/apic_flat_64.c|2 +
  arch/x86/kernel/apic/apic_noop.c   |1 +
  arch/x86/kernel/apic/apic_numachip.c   |1 +
  arch/x86/kernel/apic/bigsmp_32.c   |1 +
  arch/x86/kernel/apic/es7000_32.c   |2 +
  arch/x86/kernel/apic/numaq_32.c|1 +
  arch/x86/kernel/apic/probe_32.c|1 +
  arch/x86/kernel/apic/summit_32.c   |1 +
  arch/x86/kernel/apic/x2apic_cluster.c  |1 +
  arch/x86/kernel/apic/x2apic_phys.c |1 +
  arch/x86/kernel/apic/x2apic_uv_x.c |1 +
  arch/x86/kernel/kvm.c  |   51 
 ++--
  arch/x86/platform/visws/visws_quirks.c |2 +-
  17 files changed, 88 insertions(+), 10 deletions(-)

No objections from the x86 side.

In terms of advantages, could you please create perf stat runs 
that counts the number of MMIOs or so? That should show a pretty 
obvious improvement - and that is enough as proof, no need to 
try to reproduce the performance win in a noisy benchmark.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: X86: Remove stale values from ctxt-memop before emulation

2012-05-07 Thread Joerg Roedel

On Mon, May 07, 2012 at 01:18:01PM +0300, Avi Kivity wrote:
 On 05/07/2012 01:12 PM, Joerg Roedel wrote:
 
  The instruction emulation for bsrw is broken in KVM because
  the code always uses bsr with 32 or 64 bit operand size for
  emulation. Fix that by using emulate_2op_SrcV_nobyte() macro
  to use guest operand size for emulation.
 
 
 It looks fine.  Do you know what triggered this regression?  (for
 figuring out if it's 3.4 material)

Looks like it is 3.4 (and -stable) material. I tested a few older
kernels and the test passes on 3.0 but fails on 3.2 an later kernels
(I have not tested 3.1).


Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: X86: Add mmx movq emulation

2012-05-07 Thread Joerg Roedel

On Mon, May 07, 2012 at 01:30:48PM +0300, Avi Kivity wrote:
 On 05/07/2012 01:28 PM, Joerg Roedel wrote:
  On Sun, May 06, 2012 at 01:08:05PM +0300, Avi Kivity wrote:
   This is already in (cbe2c9d30, e59717550e).  Are you looking at master
   instead of next?
 
  Btw. your mail about the new git-workflow states something about an
  auto-next branch. But I don't see that branch in the KVM tree (looking
  at git://git.kernel.org/pub/scm/virt/kvm/kvm.git). 
 
 We forgot to generate it.
 
  Is there another
  branch that contains all fixes and everything for the next merge window?
  Basically I am looking for a branch which has the new master and next
  merged.
 
 Right.  I'll get my scripts to generate it.  (btw:  auto-next =
 merge(upstream, master, next)).

Cool thanks. That is perfect for our internal testing :)


Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 0/5] apic: eoi optimization support

2012-05-07 Thread Michael S. Tsirkin

On Mon, May 07, 2012 at 12:35:12PM +0200, Ingo Molnar wrote:
  Michael S. Tsirkin (5):
apic: fix typo EIO_ACK - EOI_ACK and document
apic: use symbolic APIC_EOI_ACK
x86: add apic-eoi_write callback
x86: eoi micro-optimization
kvm_para: guest side for eoi avoidance
  
   arch/x86/include/asm/apic.h|   22 --
   arch/x86/include/asm/apicdef.h |2 +-
   arch/x86/include/asm/bitops.h  |6 ++-
   arch/x86/include/asm/kvm_para.h|2 +
   arch/x86/kernel/apic/apic_flat_64.c|2 +
   arch/x86/kernel/apic/apic_noop.c   |1 +
   arch/x86/kernel/apic/apic_numachip.c   |1 +
   arch/x86/kernel/apic/bigsmp_32.c   |1 +
   arch/x86/kernel/apic/es7000_32.c   |2 +
   arch/x86/kernel/apic/numaq_32.c|1 +
   arch/x86/kernel/apic/probe_32.c|1 +
   arch/x86/kernel/apic/summit_32.c   |1 +
   arch/x86/kernel/apic/x2apic_cluster.c  |1 +
   arch/x86/kernel/apic/x2apic_phys.c |1 +
   arch/x86/kernel/apic/x2apic_uv_x.c |1 +
   arch/x86/kernel/kvm.c  |   51 
  ++--
   arch/x86/platform/visws/visws_quirks.c |2 +-
   17 files changed, 88 insertions(+), 10 deletions(-)
 
 No objections from the x86 side.

Is kvm.git a good tree to merge this through?

 In terms of advantages, could you please create perf stat runs 
 that counts the number of MMIOs or so? That should show a pretty 
 obvious improvement - and that is enough as proof, no need to 
 try to reproduce the performance win in a noisy benchmark.

You mean with kvm PV, right? On real hardware the micro-optimization
removes branches and maybe cache-misses but I don't see why would it
reduce MMIOs.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Raghavendra K T


On 05/07/2012 02:02 PM, Avi Kivity wrote:

On 05/07/2012 11:29 AM, Ingo Molnar wrote:

This is looking pretty good and complete now - any objections
from anyone to trying this out in a separate x86 topic tree?


No objections, instead an

Acked-by: Avi Kivitya...@redhat.com



Thank you.

Here is a benchmark result with the patches.

3 guests with 8VCPU, 8GB RAM, 1 used for kernbench
(kernbench -f -H -M -o 20) other for cpuhog (shell script while
true with an instruction)

unpinned scenario
1x: no hogs
2x: 8hogs in one guest
3x: 8hogs each in two guest

BASE: 3.4-rc4 vanilla with CONFIG_PARAVIRT_SPINLOCK=n
BASE+patch: 3.4-rc4 + debugfs + pv patches with CONFIG_PARAVIRT_SPINLOCK=y

Machine : IBM xSeries with Intel(R) Xeon(R) x5570 2.93GHz CPU (Non PLE) 
with 8 core , 64GB RAM


(Less is better. Below is time elapsed in sec for x86_64_defconfig (3+3 
runs)).


 BASEBASE+patch%improvement
 mean (sd)   mean (sd)
case 1x: 66.0566 (74.0304)   61.3233 (68.8299)  7.16552
case 2x: 1253.2 (1795.74)131.606 (137.358)  89.4984
case 3x: 3431.04 (5297.26)   134.964 (149.861)  96.0664


Will be working on further analysis with other benchmarks 
(pgbench/sysbench/ebizzy...) and further optimization.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 0/5] apic: eoi optimization support

2012-05-07 Thread Ingo Molnar


* Michael S. Tsirkin m...@redhat.com wrote:

 On Mon, May 07, 2012 at 12:35:12PM +0200, Ingo Molnar wrote:
   Michael S. Tsirkin (5):
 apic: fix typo EIO_ACK - EOI_ACK and document
 apic: use symbolic APIC_EOI_ACK
 x86: add apic-eoi_write callback
 x86: eoi micro-optimization
 kvm_para: guest side for eoi avoidance
   
arch/x86/include/asm/apic.h|   22 --
arch/x86/include/asm/apicdef.h |2 +-
arch/x86/include/asm/bitops.h  |6 ++-
arch/x86/include/asm/kvm_para.h|2 +
arch/x86/kernel/apic/apic_flat_64.c|2 +
arch/x86/kernel/apic/apic_noop.c   |1 +
arch/x86/kernel/apic/apic_numachip.c   |1 +
arch/x86/kernel/apic/bigsmp_32.c   |1 +
arch/x86/kernel/apic/es7000_32.c   |2 +
arch/x86/kernel/apic/numaq_32.c|1 +
arch/x86/kernel/apic/probe_32.c|1 +
arch/x86/kernel/apic/summit_32.c   |1 +
arch/x86/kernel/apic/x2apic_cluster.c  |1 +
arch/x86/kernel/apic/x2apic_phys.c |1 +
arch/x86/kernel/apic/x2apic_uv_x.c |1 +
arch/x86/kernel/kvm.c  |   51 
   ++--
arch/x86/platform/visws/visws_quirks.c |2 +-
17 files changed, 88 insertions(+), 10 deletions(-)
  
  No objections from the x86 side.
 
 Is kvm.git a good tree to merge this through?

Fine to me, but I haven't checked how widely it conflicts with 
existing bits: by the looks of it most of the linecount is on 
the core x86 side, while the kvm change is well concentrated.

  In terms of advantages, could you please create perf stat 
  runs that counts the number of MMIOs or so? That should show 
  a pretty obvious improvement - and that is enough as proof, 
  no need to try to reproduce the performance win in a noisy 
  benchmark.
 
 You mean with kvm PV, right? On real hardware the 
 micro-optimization removes branches and maybe cache-misses but 
 I don't see why would it reduce MMIOs.

Yeah, on KVM. On real hw I doubt it's measurable.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

KVM call agenda for May, Tuesday 8th

2012-05-07 Thread Juan Quintela


Hi

Please send in any agenda items you are interested in covering.

Thanks, Juan.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 0/5] apic: eoi optimization support

2012-05-07 Thread Avi Kivity

On 05/07/2012 02:40 PM, Ingo Molnar wrote:
   
   No objections from the x86 side.
  
  Is kvm.git a good tree to merge this through?

 Fine to me, but I haven't checked how widely it conflicts with 
 existing bits: by the looks of it most of the linecount is on 
 the core x86 side, while the kvm change is well concentrated.

I don't see a problem with merging though tip.git - we're close to the
next merge window, and the guest side rarely causes conflicts.  But
please don't apply the last patch yet, I want to review it more closely
(esp. with the host side).

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 0/5] apic: eoi optimization support

2012-05-07 Thread Ingo Molnar


* Avi Kivity a...@redhat.com wrote:

 On 05/07/2012 02:40 PM, Ingo Molnar wrote:

No objections from the x86 side.
   
   Is kvm.git a good tree to merge this through?
 
  Fine to me, but I haven't checked how widely it conflicts with 
  existing bits: by the looks of it most of the linecount is on 
  the core x86 side, while the kvm change is well concentrated.
 
 I don't see a problem with merging though tip.git - we're close to the
 next merge window, and the guest side rarely causes conflicts.  But
 please don't apply the last patch yet, I want to review it more closely
 (esp. with the host side).

That last patch was marked don't apply yet, so I definitely 
planned on another iteration that incorporates all the feedback 
that has been given.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Avi Kivity

On 05/07/2012 01:58 PM, Raghavendra K T wrote:
 On 05/07/2012 02:02 PM, Avi Kivity wrote:
 On 05/07/2012 11:29 AM, Ingo Molnar wrote:
 This is looking pretty good and complete now - any objections
 from anyone to trying this out in a separate x86 topic tree?

 No objections, instead an

 Acked-by: Avi Kivitya...@redhat.com


 Thank you.

 Here is a benchmark result with the patches.

 3 guests with 8VCPU, 8GB RAM, 1 used for kernbench
 (kernbench -f -H -M -o 20) other for cpuhog (shell script while
 true with an instruction)

 unpinned scenario
 1x: no hogs
 2x: 8hogs in one guest
 3x: 8hogs each in two guest

 BASE: 3.4-rc4 vanilla with CONFIG_PARAVIRT_SPINLOCK=n
 BASE+patch: 3.4-rc4 + debugfs + pv patches with
 CONFIG_PARAVIRT_SPINLOCK=y

 Machine : IBM xSeries with Intel(R) Xeon(R) x5570 2.93GHz CPU (Non
 PLE) with 8 core , 64GB RAM

 (Less is better. Below is time elapsed in sec for x86_64_defconfig
 (3+3 runs)).

  BASEBASE+patch%improvement
  mean (sd)   mean (sd)
 case 1x: 66.0566 (74.0304)  61.3233 (68.8299) 7.16552
 case 2x: 1253.2 (1795.74)  131.606 (137.358) 89.4984
 case 3x: 3431.04 (5297.26)  134.964 (149.861) 96.0664


You're calculating the improvement incorrectly.  In the last case, it's
not 96%, rather it's 2400% (25x).  Similarly the second case is about
900% faster.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] KVM call agenda for May, Tuesday 8th

2012-05-07 Thread Anthony Liguori


On 05/07/2012 06:47 AM, Juan Quintela wrote:


Hi

Please send in any agenda items you are interested in covering.


- Status of the 1.1 release

Regards,

Anthony Liguori



Thanks, Juan.




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] KVM call agenda for May, Tuesday 8th

2012-05-07 Thread Anthony Liguori


On 05/07/2012 06:47 AM, Juan Quintela wrote:


Hi

Please send in any agenda items you are interested in covering.


- QEMU documentation

qemu-doc.texi is in a pretty awful state.  I'm wondering if anyone has any ideas 
about how we can improve it.  One thing we could do is move the entire contents 
of it to the wiki to allow for broader editing.


I'd also be really happy to have a documentation submaintainer if anyone is 
interested in the role.  Other ideas?


Regards,

Anthony Liguori



Thanks, Juan.




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Raghavendra K T


On 05/07/2012 05:36 PM, Avi Kivity wrote:

On 05/07/2012 01:58 PM, Raghavendra K T wrote:

On 05/07/2012 02:02 PM, Avi Kivity wrote:

On 05/07/2012 11:29 AM, Ingo Molnar wrote:

This is looking pretty good and complete now - any objections
from anyone to trying this out in a separate x86 topic tree?


No objections, instead an

Acked-by: Avi Kivitya...@redhat.com


[...]


(Less is better. Below is time elapsed in sec for x86_64_defconfig
(3+3 runs)).

  BASEBASE+patch%improvement
  mean (sd)   mean (sd)
case 1x: 66.0566 (74.0304)  61.3233 (68.8299) 7.16552
case 2x: 1253.2 (1795.74)  131.606 (137.358) 89.4984
case 3x: 3431.04 (5297.26)  134.964 (149.861) 96.0664



You're calculating the improvement incorrectly.  In the last case, it's
not 96%, rather it's 2400% (25x).  Similarly the second case is about
900% faster.



You are right,
my %improvement was intended to be like
if
1) base takes 100 sec == patch takes 93 sec
2) base takes 100 sec == patch takes 11 sec
3) base takes 100 sec == patch takes 4 sec

The above is more confusing (and incorrect!).

Better is what you told which boils to 10x and 25x improvement in case
2 and case 3. And IMO, this *really* gives the feeling of magnitude of
improvement with patches.

I ll change script to report that way :).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Avi Kivity

On 05/07/2012 04:20 PM, Raghavendra K T wrote:
 On 05/07/2012 05:36 PM, Avi Kivity wrote:
 On 05/07/2012 01:58 PM, Raghavendra K T wrote:
 On 05/07/2012 02:02 PM, Avi Kivity wrote:
 On 05/07/2012 11:29 AM, Ingo Molnar wrote:
 This is looking pretty good and complete now - any objections
 from anyone to trying this out in a separate x86 topic tree?

 No objections, instead an

 Acked-by: Avi Kivitya...@redhat.com

 [...]

 (Less is better. Below is time elapsed in sec for x86_64_defconfig
 (3+3 runs)).

   BASEBASE+patch%improvement
   mean (sd)   mean (sd)
 case 1x: 66.0566 (74.0304)  61.3233 (68.8299) 7.16552
 case 2x: 1253.2 (1795.74)  131.606 (137.358) 89.4984
 case 3x: 3431.04 (5297.26)  134.964 (149.861) 96.0664


 You're calculating the improvement incorrectly.  In the last case, it's
 not 96%, rather it's 2400% (25x).  Similarly the second case is about
 900% faster.


 You are right,
 my %improvement was intended to be like
 if
 1) base takes 100 sec == patch takes 93 sec
 2) base takes 100 sec == patch takes 11 sec
 3) base takes 100 sec == patch takes 4 sec

 The above is more confusing (and incorrect!).

 Better is what you told which boils to 10x and 25x improvement in case
 2 and case 3. And IMO, this *really* gives the feeling of magnitude of
 improvement with patches.

 I ll change script to report that way :).


btw, this is on non-PLE hardware, right?  What are the numbers for PLE?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debug or diagnose tools for individual guest VM

2012-05-07 Thread Avi Kivity

On 05/07/2012 06:01 AM, Hailong Yang wrote:
 Dear all,

 I would like to know are there any appropriate tools for debugging or
 diagnosing individual guest VM performance. Similar to kvm_stat, but could
 distinguish information for each guest VM.

You could do something like 'perf top -e kvm:\* -p pid'

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] KVM call agenda for May, Tuesday 8th

2012-05-07 Thread 陳韋任

 - QEMU documentation
 
 qemu-doc.texi is in a pretty awful state.  I'm wondering if anyone has any 
 ideas 
 about how we can improve it.  One thing we could do is move the entire 
 contents 
 of it to the wiki to allow for broader editing.

  What's qemu-tech.texi status? ;)
 
 I'd also be really happy to have a documentation submaintainer if anyone is 
 interested in the role.  Other ideas?

  IMHO, one of the problem is there are documents scattering out there, not just
in one place. There are too many links on http://wiki.qemu.org/Manual. :/ If
people can focus on one document, then it's easier to make it a good shape.

Regards,
chenwj 

-- 
Wei-Ren Chen (陳韋任)
Computer Systems Lab, Institute of Information Science,
Academia Sinica, Taiwan (R.O.C.)
Tel:886-2-2788-3799 #1667
Homepage: http://people.cs.nctu.edu.tw/~chenwj
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Raghavendra K T


On 05/07/2012 06:52 PM, Avi Kivity wrote:

On 05/07/2012 04:20 PM, Raghavendra K T wrote:

On 05/07/2012 05:36 PM, Avi Kivity wrote:

On 05/07/2012 01:58 PM, Raghavendra K T wrote:

On 05/07/2012 02:02 PM, Avi Kivity wrote:

On 05/07/2012 11:29 AM, Ingo Molnar wrote:

This is looking pretty good and complete now - any objections
from anyone to trying this out in a separate x86 topic tree?


No objections, instead an

Acked-by: Avi Kivitya...@redhat.com


[...]


(Less is better. Below is time elapsed in sec for x86_64_defconfig
(3+3 runs)).

   BASEBASE+patch%improvement
   mean (sd)   mean (sd)
case 1x: 66.0566 (74.0304)  61.3233 (68.8299) 7.16552
case 2x: 1253.2 (1795.74)  131.606 (137.358) 89.4984
case 3x: 3431.04 (5297.26)  134.964 (149.861) 96.0664



You're calculating the improvement incorrectly.  In the last case, it's
not 96%, rather it's 2400% (25x).  Similarly the second case is about
900% faster.



You are right,
my %improvement was intended to be like
if
1) base takes 100 sec ==  patch takes 93 sec
2) base takes 100 sec ==  patch takes 11 sec
3) base takes 100 sec ==  patch takes 4 sec

The above is more confusing (and incorrect!).

Better is what you told which boils to 10x and 25x improvement in case
2 and case 3. And IMO, this *really* gives the feeling of magnitude of
improvement with patches.

I ll change script to report that way :).



btw, this is on non-PLE hardware, right?  What are the numbers for PLE?


Sure.
I 'll get hold of a PLE mc  and come up with the numbers soon. but I
'll expect the improvement around 1-3% as it was in last version.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Srivatsa Vaddagiri

* Raghavendra K T raghavendra...@linux.vnet.ibm.com [2012-05-07 19:08:51]:

 I 'll get hold of a PLE mc  and come up with the numbers soon. but I
 'll expect the improvement around 1-3% as it was in last version.

Deferring preemption (when vcpu is holding lock) may give us better than 1-3% 
results on PLE hardware. Something worth trying IMHO.

- vatsa

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Avi Kivity

On 05/07/2012 04:46 PM, Srivatsa Vaddagiri wrote:
 * Raghavendra K T raghavendra...@linux.vnet.ibm.com [2012-05-07 19:08:51]:

  I 'll get hold of a PLE mc  and come up with the numbers soon. but I
  'll expect the improvement around 1-3% as it was in last version.

 Deferring preemption (when vcpu is holding lock) may give us better than 1-3% 
 results on PLE hardware. Something worth trying IMHO.

Is the improvement so low, because PLE is interfering with the patch, or
because PLE already does a good job?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Raghavendra K T


On 05/07/2012 07:19 PM, Avi Kivity wrote:

On 05/07/2012 04:46 PM, Srivatsa Vaddagiri wrote:

* Raghavendra K Traghavendra...@linux.vnet.ibm.com  [2012-05-07 19:08:51]:


I 'll get hold of a PLE mc  and come up with the numbers soon. but I
'll expect the improvement around 1-3% as it was in last version.


Deferring preemption (when vcpu is holding lock) may give us better than 1-3%
results on PLE hardware. Something worth trying IMHO.


Is the improvement so low, because PLE is interfering with the patch, or
because PLE already does a good job?



It is because PLE already does a good job (of not burning cpu). The
1-3% improvement is because, patchset knows atleast who is next to hold
lock, which is lacking in PLE.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Srivatsa Vaddagiri

* Avi Kivity a...@redhat.com [2012-05-07 16:49:25]:

  Deferring preemption (when vcpu is holding lock) may give us better than 
  1-3% 
  results on PLE hardware. Something worth trying IMHO.
 
 Is the improvement so low, because PLE is interfering with the patch, or
 because PLE already does a good job?

I think its latter (PLE already doing a good job). 

- vatsa

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Raghavendra K T


On 05/07/2012 07:16 PM, Srivatsa Vaddagiri wrote:

* Raghavendra K Traghavendra...@linux.vnet.ibm.com  [2012-05-07 19:08:51]:


I 'll get hold of a PLE mc  and come up with the numbers soon. but I
'll expect the improvement around 1-3% as it was in last version.


Deferring preemption (when vcpu is holding lock) may give us better than 1-3%
results on PLE hardware. Something worth trying IMHO.



Yes, Sure. 'll take-up this and any scalability improvement possible 
further.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Avi Kivity

On 05/07/2012 04:53 PM, Raghavendra K T wrote:
 Is the improvement so low, because PLE is interfering with the patch, or
 because PLE already does a good job?



 It is because PLE already does a good job (of not burning cpu). The
 1-3% improvement is because, patchset knows atleast who is next to hold
 lock, which is lacking in PLE.


Not good.  Solving a problem in software that is already solved by
hardware?  It's okay if there are no costs involved, but here we're
introducing a new ABI that we'll have to maintain for a long time.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Adding an IPMI BMC device to KVM

2012-05-07 Thread Anthony Liguori


On 05/06/2012 09:39 AM, Avi Kivity wrote:

On 05/06/2012 05:35 PM, Anthony Liguori wrote:

On 05/06/2012 08:11 AM, Avi Kivity wrote:
libvirt is essentially the BMC for a virtual guest.  I would suggest
looking at implementing an IPMI interface to libvirt and exposing it
to the guest through a USB RNDIS device.



That's the first option.  One unanswered question is what to do when the
guest is down?  Someone should listen for IPMI events, but we can't make
it libvirt unconditionally, since many instances of libvirt are active
at any one time.

Note the IPMI external interface needs to be migrated, like any other.


For all intents and purposes, the BMC/RSA is a separate physical machine.  If 
you really wanted to model it, you would launch two instances of QEMU.  The BMC 
instance would have a virtual NIC and would share a USB bus with the slave QEMU 
instance (probably via USBoIP).  The USB bus is how the BMC exposes IPMI to the 
guest (via a USB rndis adapter), remote media, etc.  I believe some BMC's also 
expose IPMI over i2c but that's pretty low bandwidth.


At any rate, you would have some sort of virtual hardware device that 
essentially spoke QMP to the slave instance.  You could just do virtio-serial 
and call it a day actually.


It really boils down to what you are trying to do.  If you want to just get some 
piece of software working that expects to do IPMI, the easiest thing to do is 
run IPMI in the host and use a USB rndis interface to interact with it.


I don't think there's a tremendous amount of value in QEMU making itself look 
like an IBM IMM or whatever HP/Dell's equivalent is.  As I said, these stacks 
are hugely complicated and there are better ways of doing out of band management 
(like talk to libvirt directly).


So what's really the use case here?  Would an IPMI - libvirt bridge get you 
what you need?  I really think that's the best path forward.


Regards,

Anthony Liguori





--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Adding an IPMI BMC device to KVM

2012-05-07 Thread Avi Kivity

On 05/07/2012 05:30 PM, Anthony Liguori wrote:
 On 05/06/2012 09:39 AM, Avi Kivity wrote:
 On 05/06/2012 05:35 PM, Anthony Liguori wrote:
 On 05/06/2012 08:11 AM, Avi Kivity wrote:
 libvirt is essentially the BMC for a virtual guest.  I would suggest
 looking at implementing an IPMI interface to libvirt and exposing it
 to the guest through a USB RNDIS device.


 That's the first option.  One unanswered question is what to do when the
 guest is down?  Someone should listen for IPMI events, but we can't make
 it libvirt unconditionally, since many instances of libvirt are active
 at any one time.

 Note the IPMI external interface needs to be migrated, like any other.

 For all intents and purposes, the BMC/RSA is a separate physical
 machine.  

That's true for any other card on a machine.

 If you really wanted to model it, you would launch two instances of
 QEMU.  The BMC instance would have a virtual NIC and would share a USB
 bus with the slave QEMU instance (probably via USBoIP).  The USB bus
 is how the BMC exposes IPMI to the guest (via a USB rndis adapter),
 remote media, etc.  I believe some BMC's also expose IPMI over i2c but
 that's pretty low bandwidth.

That is one way to do it.  Figure out the interactions between two
different parts in a machine, define an interface for them to
communicate, and split them into two processes.  We don't usually do
that; I believe your motivation is that the two have different power
domains (but then so do NICs with wake-on-LAN support).

 At any rate, you would have some sort of virtual hardware device that
 essentially spoke QMP to the slave instance.  You could just do
 virtio-serial and call it a day actually.

Sorry I lost you.  Which is the master and which is the slave?

 It really boils down to what you are trying to do.  If you want to
 just get some piece of software working that expects to do IPMI, the
 easiest thing to do is run IPMI in the host and use a USB rndis
 interface to interact with it.

That would be most strange.  A remote client connecting to the IPMI
interface would control the power level of the host, not the guest.

 I don't think there's a tremendous amount of value in QEMU making
 itself look like an IBM IMM or whatever HP/Dell's equivalent is.  As I
 said, these stacks are hugely complicated and there are better ways of
 doing out of band management (like talk to libvirt directly).

I have to agree here.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Raghavendra K T


On 05/07/2012 07:28 PM, Avi Kivity wrote:

On 05/07/2012 04:53 PM, Raghavendra K T wrote:

Is the improvement so low, because PLE is interfering with the patch, or
because PLE already does a good job?




It is because PLE already does a good job (of not burning cpu). The
1-3% improvement is because, patchset knows atleast who is next to hold
lock, which is lacking in PLE.



Not good.  Solving a problem in software that is already solved by
hardware?  It's okay if there are no costs involved, but here we're
introducing a new ABI that we'll have to maintain for a long time.



Hmm agree that being a step ahead of mighty hardware (and just an
improvement of 1-3%) is no good for long term (where PLE is future).

Having said that, it is hard for me to resist saying :
 bottleneck is somewhere else on PLE m/c and IMHO answer would be
combination of paravirt-spinlock + pv-flush-tb.

But I need to come up with good number to argue in favour of the claim.

PS: Nikunj had experimented that pv-flush tlb + paravirt-spinlock is a 
win on PLE where only one of them alone could not prove the benefit.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Avi Kivity

On 05/07/2012 05:52 PM, Avi Kivity wrote:
  Having said that, it is hard for me to resist saying :
   bottleneck is somewhere else on PLE m/c and IMHO answer would be
  combination of paravirt-spinlock + pv-flush-tb.
 
  But I need to come up with good number to argue in favour of the claim.
 
  PS: Nikunj had experimented that pv-flush tlb + paravirt-spinlock is a
  win on PLE where only one of them alone could not prove the benefit.
 

 I'd like to see those numbers, then.


Note: it's probably best to try very wide guests, where the overhead of
iterating on all vcpus begins to show.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Adding an IPMI BMC device to KVM

2012-05-07 Thread Anthony Liguori


On 05/07/2012 09:44 AM, Avi Kivity wrote:

On 05/07/2012 05:30 PM, Anthony Liguori wrote:

On 05/06/2012 09:39 AM, Avi Kivity wrote:

On 05/06/2012 05:35 PM, Anthony Liguori wrote:

On 05/06/2012 08:11 AM, Avi Kivity wrote:
libvirt is essentially the BMC for a virtual guest.  I would suggest
looking at implementing an IPMI interface to libvirt and exposing it
to the guest through a USB RNDIS device.



That's the first option.  One unanswered question is what to do when the
guest is down?  Someone should listen for IPMI events, but we can't make
it libvirt unconditionally, since many instances of libvirt are active
at any one time.

Note the IPMI external interface needs to be migrated, like any other.


For all intents and purposes, the BMC/RSA is a separate physical
machine.


That's true for any other card on a machine.


It has a separate power source for all intents and purposes.  If you think of it 
in QOM terms, what connects the nodes together ultimately is the Vcc pin that 
travels across all devices.  The RTC is a little special because it has a 
battery backed CMOS/clock but it's also handled specially.


The BMC does not share Vcc.  It's no different than a separate physical box.  It 
just shares a couple buses.



If you really wanted to model it, you would launch two instances of
QEMU.  The BMC instance would have a virtual NIC and would share a USB
bus with the slave QEMU instance (probably via USBoIP).  The USB bus
is how the BMC exposes IPMI to the guest (via a USB rndis adapter),
remote media, etc.  I believe some BMC's also expose IPMI over i2c but
that's pretty low bandwidth.


That is one way to do it.  Figure out the interactions between two
different parts in a machine, define an interface for them to
communicate, and split them into two processes.  We don't usually do
that; I believe your motivation is that the two have different power
domains (but then so do NICs with wake-on-LAN support).


The power still comes from the PCI bus.

Think of something like a blade center.  Each individual blade does not have 
it's own BMC.  There's a single common BMC that provides an IPMI interface for 
all blades.  Yet each blade still sees an IPMI interface via a USB rndis device.


You can rip out the memory, PCI devices, etc. from a box while the Power is in 
and the BMC will be unaffected.





At any rate, you would have some sort of virtual hardware device that
essentially spoke QMP to the slave instance.  You could just do
virtio-serial and call it a day actually.


Sorry I lost you.  Which is the master and which is the slave?


The BMC is the master, system being controlled is the slave.




It really boils down to what you are trying to do.  If you want to
just get some piece of software working that expects to do IPMI, the
easiest thing to do is run IPMI in the host and use a USB rndis
interface to interact with it.


That would be most strange.  A remote client connecting to the IPMI
interface would control the power level of the host, not the guest.


IPMI with a custom backend is what I mean.  That's what I mean by an IPMI - 
libvirt bridge.  You could build a libvirt client that exposes an IPMI interface 
and when you issue IPMI commands, it translate it to libvirt operations.


This can run as a normal process on the host and then network it to the guest 
via an emulated USB rndis device.  Existing software on the guest shouldn't be 
able to tell the difference as long as it doesn't try to use I2C to talk to the BMC.





I don't think there's a tremendous amount of value in QEMU making
itself look like an IBM IMM or whatever HP/Dell's equivalent is.  As I
said, these stacks are hugely complicated and there are better ways of
doing out of band management (like talk to libvirt directly).


I have to agree here.



Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Avi Kivity

On 05/07/2012 05:47 PM, Raghavendra K T wrote:
 Not good.  Solving a problem in software that is already solved by
 hardware?  It's okay if there are no costs involved, but here we're
 introducing a new ABI that we'll have to maintain for a long time.



 Hmm agree that being a step ahead of mighty hardware (and just an
 improvement of 1-3%) is no good for long term (where PLE is future).


PLE is the present, not the future.  It was introduced on later Nehalems
and is present on all Westmeres.  Two more processor generations have
passed meanwhile.  The AMD equivalent was also introduced around that
timeframe.

 Having said that, it is hard for me to resist saying :
  bottleneck is somewhere else on PLE m/c and IMHO answer would be
 combination of paravirt-spinlock + pv-flush-tb.

 But I need to come up with good number to argue in favour of the claim.

 PS: Nikunj had experimented that pv-flush tlb + paravirt-spinlock is a
 win on PLE where only one of them alone could not prove the benefit.


I'd like to see those numbers, then.

Ingo, please hold on the kvm-specific patches, meanwhile.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Adding an IPMI BMC device to KVM

2012-05-07 Thread Avi Kivity

On 05/07/2012 05:55 PM, Anthony Liguori wrote:
 For all intents and purposes, the BMC/RSA is a separate physical
 machine.

 That's true for any other card on a machine.


 It has a separate power source for all intents and purposes.  If you
 think of it in QOM terms, what connects the nodes together ultimately
 is the Vcc pin that travels across all devices.  The RTC is a little
 special because it has a battery backed CMOS/clock but it's also
 handled specially.

And we fail to emulate it correctly as well, wrt. alarms.


 The BMC does not share Vcc.  It's no different than a separate
 physical box.  It just shares a couple buses.

It controls the main power place, reset line, can read VGA and emulate
keyboard, seems pretty well integrated.

 That is one way to do it.  Figure out the interactions between two
 different parts in a machine, define an interface for them to
 communicate, and split them into two processes.  We don't usually do
 that; I believe your motivation is that the two have different power
 domains (but then so do NICs with wake-on-LAN support).

 The power still comes from the PCI bus.

Maybe.  But it's on when the rest of the machine is off.  So Vcc is not
shared.


 Think of something like a blade center.  Each individual blade does
 not have it's own BMC.  There's a single common BMC that provides an
 IPMI interface for all blades.  Yet each blade still sees an IPMI
 interface via a USB rndis device.

 You can rip out the memory, PCI devices, etc. from a box while the
 Power is in and the BMC will be unaffected.


 At any rate, you would have some sort of virtual hardware device that
 essentially spoke QMP to the slave instance.  You could just do
 virtio-serial and call it a day actually.

 Sorry I lost you.  Which is the master and which is the slave?

 The BMC is the master, system being controlled is the slave.

Ah okay.  It also has to read the VGA output (say via vnc) and supply
keyboard input (via sendkey).

 It really boils down to what you are trying to do.  If you want to
 just get some piece of software working that expects to do IPMI, the
 easiest thing to do is run IPMI in the host and use a USB rndis
 interface to interact with it.

 That would be most strange.  A remote client connecting to the IPMI
 interface would control the power level of the host, not the guest.

 IPMI with a custom backend is what I mean.  That's what I mean by an
 IPMI - libvirt bridge.  You could build a libvirt client that exposes
 an IPMI interface and when you issue IPMI commands, it translate it to
 libvirt operations.

 This can run as a normal process on the host and then network it to
 the guest via an emulated USB rndis device.  Existing software on the
 guest shouldn't be able to tell the difference as long as it doesn't
 try to use I2C to talk to the BMC.

I still like the single process solution, it is more in line with the
rest of qemu and handles live migration better.  But even better would
be not to do this at all, and satisfy the remote management requirements
using the existing tools.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Adding an IPMI BMC device to KVM

2012-05-07 Thread Zhi Yong Wu

Then should we also emulate one AMM virtual device? one fsp virtual
device? one IVE virtual device?

On Sat, May 5, 2012 at 3:10 AM, Corey Minyard tcminy...@gmail.com wrote:
 I've been working on adding an IPMI BMC as a virtual device under KVM.  I'm
 doing this for two primary reasons, one to have a better test environment
 than
 what I have now for testing IPMI issues, and second to be able to better
 simulate a legacy environment for customers porting legacy software.

 For those that don't know, IPMI is a system management interface.  Generally
 systems with IPMI have a small microcontroller, called a BMC, that is always
 on
 when the board is powered.  The BMC is capable of controlling power and
 reset
 on the board, and it is hooked to sensors on the board (voltage, current,
 temperature, the presence of things like DIMMS and power supplies, intrusion
 detection, and a host of other things).  The main processor on a system can
 communicate with the BMC over a device.  Often these systems also have a LAN
 interface that lets you control the system remotely even when it's off.

 In addition, IPMI provides access to FRU (Field Replaceable Unit) data that
 describes the components of the system that can be replaced.  It also has
 data
 records that describe the sensor, so it is possible to directly interpret
 the sensor
 data and know what the sensor is measuring without outside data.

 I've been struggling a bit with how to implement this.  There is a lot of
 configuration information, and you need ways to simulate the sensors.  This
 type
 of interface is a little sensitive, since it has direct access to the reset
 and power
 control of a system.

 I was at first considering having the BMC be an external program that KVM
 connected to over a chardev, with possibly a simulated LAN interface.  This
 has
 the advantage that the BMC can run even when KVM is down.  It could even
 start up KVM for a power up, though I'm not sure how valuable that would
 be.
 Plus it could be used for other virtual machines.  However, that means there
 is
 an interface to KVM over a chardev that could do nasty things, and even be a
 possible intrusion point.  It also means there is a separate program to
 maintain.

 You could also include the BMC inside of KVM and run it as a separate
 thread.
 That way there doesn't have to be an insecure interface.  But the BMC will
 need
 a lot of configuration data and this will add a bunch of code to KVM that's
 only
 tangentially related to it.  And you would still need a way to simulate
 setting
 sensors and such for testing things.

 Either way, is this interesting for including into KVM?  Does anyone have
 any
 opinions on the possible ways to implement this?

 Thanks,

 -corey
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Regards,

Zhi Yong Wu
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Adding an IPMI BMC device to KVM

2012-05-07 Thread Anthony Liguori


On 05/07/2012 10:11 AM, Avi Kivity wrote:

On 05/07/2012 05:55 PM, Anthony Liguori wrote:

For all intents and purposes, the BMC/RSA is a separate physical
machine.


That's true for any other card on a machine.



It has a separate power source for all intents and purposes.  If you
think of it in QOM terms, what connects the nodes together ultimately
is the Vcc pin that travels across all devices.  The RTC is a little
special because it has a battery backed CMOS/clock but it's also
handled specially.


And we fail to emulate it correctly as well, wrt. alarms.



The BMC does not share Vcc.  It's no different than a separate
physical box.  It just shares a couple buses.


It controls the main power place, reset line, can read VGA and emulate
keyboard, seems pretty well integrated.


Emulating the keyboard is done through USB.  How the VGA thing works is very 
vendor dependent.  The VGA snooping can happen as part of the display path 
(essentially connected via a VGA cable) or it can be a side-band using a special 
graphics adapter.  I think QEMU VNC emulation is a pretty good analogy actually.





That is one way to do it.  Figure out the interactions between two
different parts in a machine, define an interface for them to
communicate, and split them into two processes.  We don't usually do
that; I believe your motivation is that the two have different power
domains (but then so do NICs with wake-on-LAN support).


The power still comes from the PCI bus.


Maybe.  But it's on when the rest of the machine is off.  So Vcc is not
shared.


That's all plumbed through the PCI bus FWIW.





Think of something like a blade center.  Each individual blade does
not have it's own BMC.  There's a single common BMC that provides an
IPMI interface for all blades.  Yet each blade still sees an IPMI
interface via a USB rndis device.

You can rip out the memory, PCI devices, etc. from a box while the
Power is in and the BMC will be unaffected.




At any rate, you would have some sort of virtual hardware device that
essentially spoke QMP to the slave instance.  You could just do
virtio-serial and call it a day actually.


Sorry I lost you.  Which is the master and which is the slave?


The BMC is the master, system being controlled is the slave.


Ah okay.  It also has to read the VGA output (say via vnc) and supply
keyboard input (via sendkey).


Right, QMP + VNC is a pretty accurate analogy.


It really boils down to what you are trying to do.  If you want to
just get some piece of software working that expects to do IPMI, the
easiest thing to do is run IPMI in the host and use a USB rndis
interface to interact with it.


That would be most strange.  A remote client connecting to the IPMI
interface would control the power level of the host, not the guest.


IPMI with a custom backend is what I mean.  That's what I mean by an
IPMI -  libvirt bridge.  You could build a libvirt client that exposes
an IPMI interface and when you issue IPMI commands, it translate it to
libvirt operations.

This can run as a normal process on the host and then network it to
the guest via an emulated USB rndis device.  Existing software on the
guest shouldn't be able to tell the difference as long as it doesn't
try to use I2C to talk to the BMC.


I still like the single process solution, it is more in line with the
rest of qemu and handles live migration better.


Two QEMU processes could be migrated in unison if you really wanted to support 
that...


With qemu-system-mips/sh4 you could probably even run the real BMC software 
stack if you were so inclined :-)



But even better would
be not to do this at all, and satisfy the remote management requirements
using the existing tools.


Right.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH] kvm: Enable device LTR/OBFF capibility before doing guest device assignment

2012-05-07 Thread Alex Williamson

On Mon, 2012-05-07 at 07:58 +, Hao, Xudong wrote:
  -Original Message-
  From: Avi Kivity [mailto:a...@redhat.com]
  Sent: Sunday, May 06, 2012 11:34 PM
  To: Xudong Hao
  Cc: mtosa...@redhat.com; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
  Zhang, Xiantao; Hao, Xudong; Alex Williamson
  Subject: Re: [PATCH] kvm: Enable device LTR/OBFF capibility before doing 
  guest
  device assignment

  On 05/06/2012 06:24 PM, Xudong Hao wrote:
   Enable device LTR/OBFF capibility before do device assignment, so that 
   guest
  can benefit from them.

  cc += Alex

   @@ -166,6 +166,10 @@ int kvm_assign_device(struct kvm *kvm,
   if (pdev == NULL)
   return -ENODEV;

   +   /* Enable some device capibility before do device assignment,
   +* so that guest can benefit from them.
   +*/
   +   kvm_iommu_enable_dev_caps(pdev);
   r = iommu_attach_device(domain, pdev-dev);

  Suppose we fail here.  Do we need to disable_dev_caps()?

If kvm_assign_device() fails we'll try to restore the state we saved in
kvm_vm_ioctl_assign_device(), so ltr/obff should be brought back to
initial state.

 I don't think so. When a device will be assigned to guest, it's be
 owned by a pci-stub driver, attach_device fail here do not affect
 everything. If host want to use it, host device driver has its own
 enable/disable dev_caps.

Why is device assignment unique here?  If there's a default value that's
known to be safe, why doesn't pci_enable_device set it for everyone?
Host drivers can fine tune the value later if they want.

   if (r) {
   printk(KERN_ERR assign device %x:%x:%x.%x failed,
   @@ -228,6 +232,7 @@ int kvm_deassign_device(struct kvm *kvm,
   PCI_SLOT(assigned_dev-host_devfn),
   PCI_FUNC(assigned_dev-host_devfn));

   +   kvm_iommu_disable_dev_caps(pdev);
   return 0;
}

   @@ -351,3 +356,30 @@ int kvm_iommu_unmap_guest(struct kvm *kvm)
   iommu_domain_free(domain);
   return 0;
}
   +
   +static void kvm_iommu_enable_dev_caps(struct pci_dev *pdev)
   +{
   +   /* set default value */
   +   unsigned long type = PCI_EXP_OBFF_SIGNAL_ALWAYS;
   +   int snoop_lat_ns = 1024, nosnoop_lat_ns = 1024;

  Where does this magic number come from?

 The number is the max value that register support, set it as default
 here, we did not have any device here, and we do not know what's the
 proper value, so it set a default value firstly.

The register is composed of latency scale and latency value fields.
1024 is simply the largest value the latency value can hold (+1).  The
scale field allows latencies up to 34,326,183,936ns to be specified, so
please explain how 1024 is a universal default.

   +
   +   /* LTR(Latency tolerance reporting) allows devices to send
   +* messages to the root complex indicating their latency
   +* tolerance for snooped  unsnooped memory transactions.
   +*/
   +   pci_enable_ltr(pdev);
   +   pci_set_ltr(pdev, snoop_lat_ns, nosnoop_lat_ns);
   +
   +   /* OBFF (optimized buffer flush/fill), where supported,
   +* can help improve energy efficiency by giving devices
   +* information about when interrupts and other activity
   +* will have a reduced power impact.
   +*/
   +   pci_enable_obff(pdev, type);
   +}
   +
   +static void kvm_iommu_disable_dev_caps(struct pci_dev *pdev)
   +{
   +   pci_disble_obff(pdev);
   +   pci_disble_ltr(pdev);
   +}

  Do we need to communicate something about these capabilities to the guest?

 I guess you means that here host don't know if guest want to enable them, 
 right? 
 The ltr/obff new feature are supposed to enabled by guest if platform and 
 device supported.

It looks like ltr is a two part mechanism, the capability and enable
lives in the pci express capability, but the tuning registers live in
extended capability space.  The guest doesn't yet have access to the
latter since we don't have an express chipset.  The capability and
enable are read-only to the guest currently, same for obff.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] PCI: save/restore max Latency Value for device LTR

2012-05-07 Thread Bjorn Helgaas

On Sun, May 6, 2012 at 8:11 AM, Xudong Hao xudong@linux.intel.com wrote:
 LTR: Save Max snoop/no-snoop Latency Value in pci_save_pcie_state, and 
 restore them in pci_restore_pcie_state.

 Signed-off-by: Xudong Hao xudong@intel.com

 ---
  drivers/pci/pci.c |   12 
  1 files changed, 12 insertions(+), 0 deletions(-)

 diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
 index 111569c..c8573c3 100644
 --- a/drivers/pci/pci.c
 +++ b/drivers/pci/pci.c
 @@ -875,6 +875,12 @@ static int pci_save_pcie_state(struct pci_dev *dev)
        pci_read_config_word(dev, pos + PCI_EXP_LNKCTL2, cap[i++]);
    if (pcie_cap_has_sltctl2(dev-pcie_type, flags))
        pci_read_config_word(dev, pos + PCI_EXP_SLTCTL2, cap[i++]);
 +   if (pci_ltr_supported(dev)) {
 +       pci_read_config_word(dev, pos + PCI_LTR_MAX_SNOOP_LAT,
 +                           cap[i++]);
 +       pci_read_config_word(dev, pos + PCI_LTR_MAX_NOSNOOP_LAT,
 +                           cap[i++]);
 +   }

    return 0;
  }
 @@ -908,6 +914,12 @@ static void pci_restore_pcie_state(struct pci_dev *dev)
        pci_write_config_word(dev, pos + PCI_EXP_LNKCTL2, cap[i++]);
    if (pcie_cap_has_sltctl2(dev-pcie_type, flags))
        pci_write_config_word(dev, pos + PCI_EXP_SLTCTL2, cap[i++]);
 +   if (pci_ltr_supported(dev)) {
 +       pci_write_config_word(dev, pos + PCI_LTR_MAX_SNOOP_LAT,
 +                           cap[i++]);
 +       pci_write_config_word(dev, pos + PCI_LTR_MAX_NOSNOOP_LAT,
 +                           cap[i++]);
 +   }
  }


This doesn't make any sense to me.  pos is the offset of the PCI
Express Capability (identifier 10h).  LTR is a separate extended
capability (identifier 18h), so you at least have to look up its
offset.

Bjorn
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Ingo Molnar


* Avi Kivity a...@redhat.com wrote:

  PS: Nikunj had experimented that pv-flush tlb + 
  paravirt-spinlock is a win on PLE where only one of them 
  alone could not prove the benefit.
 
 I'd like to see those numbers, then.
 
 Ingo, please hold on the kvm-specific patches, meanwhile.

I'll hold off on the whole thing - frankly, we don't want this 
kind of Xen-only complexity. If KVM can make use of PLE then Xen 
ought to be able to do it as well.

If both Xen and KVM makes good use of it then that's a different 
matter.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Adding an IPMI BMC device to KVM

2012-05-07 Thread Corey Minyard

I think we are getting a little out of hand here, and we are mixing up 
concepts :).


There are lots of things IPMI *can* do (including serial access, VGA 
snooping, LAN access, etc.) but I don't see any value it that.  The main 
thing here is to emulate the interface to the guest.  OOB management is 
really more appropriately handled with libvirt.  How the BMC integrates 
into the hardware varies a *lot* between systems, but it's really kind 
of irrelevant.  (Well, almost irrelevant, BMCs can provide a direct I2C 
messaging capability, and that may matter.)


A guest can have one (or more) of a number of interfaces (that are all 
fairly bad, unfortunately).  The standard ones are called KCS, BT 
and SMIC and they generally are directly on the ISA bus, but are in 
memory on non-x86 boxes (and on some x86 boxes) and sometimes on the PCI 
bus.  Some systems also have interfaces over I2C, but that hasn't really 
caught on.  Others have interfaces over serial ports, and that 
unfortunately has caught on in the ATCA world.  And there are at least 3 
different basic types of serial port interfaces with sub-variants :(.  
I'm not sure what the USB rndis device is, but I'll look at it.  But 
there is no IPMI over USB.


The big things a guest can do are sensor management, watchdog timer, 
reset, and power control.  In complicated IPMI-based systems like ATCA, 
a guest may want to send messages through its local IPMI controller to 
other guest's IPMI controllers or to a central BMC that runs an entire 
chassis of systems.  So that may need to be supported, depending on what 
people want to do and how hard they want to work on it.


My proposal is to start small, with just a local interface, watchdog 
timer, sensors and power control.  But have an architecture that would 
allow external LAN access, tying BMCs in different qemu instances 
together, perhaps serial over IPMI, and other things of that nature.


-corey


On 05/07/2012 10:21 AM, Anthony Liguori wrote:

On 05/07/2012 10:11 AM, Avi Kivity wrote:

On 05/07/2012 05:55 PM, Anthony Liguori wrote:

For all intents and purposes, the BMC/RSA is a separate physical
machine.


That's true for any other card on a machine.



It has a separate power source for all intents and purposes.  If you
think of it in QOM terms, what connects the nodes together ultimately
is the Vcc pin that travels across all devices.  The RTC is a little
special because it has a battery backed CMOS/clock but it's also
handled specially.


And we fail to emulate it correctly as well, wrt. alarms.



The BMC does not share Vcc.  It's no different than a separate
physical box.  It just shares a couple buses.


It controls the main power place, reset line, can read VGA and emulate
keyboard, seems pretty well integrated.


Emulating the keyboard is done through USB.  How the VGA thing works 
is very vendor dependent.  The VGA snooping can happen as part of the 
display path (essentially connected via a VGA cable) or it can be a 
side-band using a special graphics adapter.  I think QEMU VNC 
emulation is a pretty good analogy actually.





That is one way to do it.  Figure out the interactions between two
different parts in a machine, define an interface for them to
communicate, and split them into two processes.  We don't usually do
that; I believe your motivation is that the two have different power
domains (but then so do NICs with wake-on-LAN support).


The power still comes from the PCI bus.


Maybe.  But it's on when the rest of the machine is off.  So Vcc is not
shared.


That's all plumbed through the PCI bus FWIW.





Think of something like a blade center.  Each individual blade does
not have it's own BMC.  There's a single common BMC that provides an
IPMI interface for all blades.  Yet each blade still sees an IPMI
interface via a USB rndis device.

You can rip out the memory, PCI devices, etc. from a box while the
Power is in and the BMC will be unaffected.




At any rate, you would have some sort of virtual hardware device that
essentially spoke QMP to the slave instance.  You could just do
virtio-serial and call it a day actually.


Sorry I lost you.  Which is the master and which is the slave?


The BMC is the master, system being controlled is the slave.


Ah okay.  It also has to read the VGA output (say via vnc) and supply
keyboard input (via sendkey).


Right, QMP + VNC is a pretty accurate analogy.


It really boils down to what you are trying to do.  If you want to
just get some piece of software working that expects to do IPMI, the
easiest thing to do is run IPMI in the host and use a USB rndis
interface to interact with it.


That would be most strange.  A remote client connecting to the IPMI
interface would control the power level of the host, not the guest.


IPMI with a custom backend is what I mean.  That's what I mean by an
IPMI -  libvirt bridge.  You could build a libvirt client that exposes
an IPMI interface and when you issue IPMI commands, it translate it to

Semantics of -cpu host (was Re: [Qemu-devel] [PATCH 2/2] Expose tsc deadline timer cpuid to guest)

2012-05-07 Thread Eduardo Habkost


Andre? Are you able to help to answer the question below?

I would like to clarify what's the expected behavior of -cpu host to
be able to continue working on it. I believe the code will need to be
fixed on either case, but first we need to figure out what are the
expectations/requirements, to know _which_ changes will be needed.


On Tue, Apr 24, 2012 at 02:19:25PM -0300, Eduardo Habkost wrote:
 (CCing Andre Przywara, in case he can help to clarify what's the
 expected meaning of -cpu host)
 
[...]
 I am not sure I understand what you are proposing. Let me explain the
 use case I am thinking about:
 
 - Feature FOO is of type (A) (e.g. just a new instruction set that
   doesn't require additional userspace support)
 - User has a Qemu vesion that doesn't know anything about feature FOO
 - User gets a new CPU that supports feature FOO
 - User gets a new kernel that supports feature FOO (i.e. has FOO in
   GET_SUPPORTED_CPUID)
 - User does _not_ upgrade Qemu.
 - User expects to get feature FOO enabled if using -cpu host, without
   upgrading Qemu.
 
 The problem here is: to support the above use-case, userspace need a
 probing mechanism that can differentiate _new_ (previously unknown)
 features that are in group (A) (safe to blindly enable) from features
 that are in group (B) (that can't be enabled without an userspace
 upgrade).
 
 In short, it becomes a problem if we consider the following case:
 
 - Feature BAR is of type (B) (it can't be enabled without extra
   userspace support)
 - User has a Qemu version that doesn't know anything about feature BAR
 - User gets a new CPU that supports feature BAR
 - User gets a new kernel that supports feature BAR (i.e. has BAR in
   GET_SUPPORTED_CPUID)
 - User does _not_ upgrade Qemu.
 - User simply shouldn't get feature BAR enabled, even if using -cpu
   host, otherwise Qemu would break.
 
 If userspace always limited itself to features it knows about, it would
 be really easy to implement the feature without any new probing
 mechanism from the kernel. But that's not how I think users expect -cpu
 host to work. Maybe I am wrong, I don't know. I am CCing Andre, who
 introduced the -cpu host feature, in case he can explain what's the
 expected semantics on the cases above.
 

-- 
Eduardo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Adding an IPMI BMC device to KVM

2012-05-07 Thread Dave Allan

FWIW, the idea of an IPMI interface to VMs was proposed for libvirt
not too long ago.  See:

https://bugzilla.redhat.com/show_bug.cgi?id=815136

Dave

On Mon, May 07, 2012 at 01:07:45PM -0500, Corey Minyard wrote:
 I think we are getting a little out of hand here, and we are mixing
 up concepts :).
 
 There are lots of things IPMI *can* do (including serial access, VGA
 snooping, LAN access, etc.) but I don't see any value it that.  The
 main thing here is to emulate the interface to the guest.  OOB
 management is really more appropriately handled with libvirt.  How
 the BMC integrates into the hardware varies a *lot* between systems,
 but it's really kind of irrelevant.  (Well, almost irrelevant, BMCs
 can provide a direct I2C messaging capability, and that may matter.)
 
 A guest can have one (or more) of a number of interfaces (that are
 all fairly bad, unfortunately).  The standard ones are called KCS,
 BT and SMIC and they generally are directly on the ISA bus, but
 are in memory on non-x86 boxes (and on some x86 boxes) and sometimes
 on the PCI bus.  Some systems also have interfaces over I2C, but
 that hasn't really caught on.  Others have interfaces over serial
 ports, and that unfortunately has caught on in the ATCA world.  And
 there are at least 3 different basic types of serial port interfaces
 with sub-variants :(.  I'm not sure what the USB rndis device is,
 but I'll look at it.  But there is no IPMI over USB.
 
 The big things a guest can do are sensor management, watchdog timer,
 reset, and power control.  In complicated IPMI-based systems like
 ATCA, a guest may want to send messages through its local IPMI
 controller to other guest's IPMI controllers or to a central BMC
 that runs an entire chassis of systems.  So that may need to be
 supported, depending on what people want to do and how hard they
 want to work on it.
 
 My proposal is to start small, with just a local interface, watchdog
 timer, sensors and power control.  But have an architecture that
 would allow external LAN access, tying BMCs in different qemu
 instances together, perhaps serial over IPMI, and other things of
 that nature.
 
 -corey
 
 
 On 05/07/2012 10:21 AM, Anthony Liguori wrote:
 On 05/07/2012 10:11 AM, Avi Kivity wrote:
 On 05/07/2012 05:55 PM, Anthony Liguori wrote:
 For all intents and purposes, the BMC/RSA is a separate physical
 machine.
 
 That's true for any other card on a machine.
 
 
 It has a separate power source for all intents and purposes.  If you
 think of it in QOM terms, what connects the nodes together ultimately
 is the Vcc pin that travels across all devices.  The RTC is a little
 special because it has a battery backed CMOS/clock but it's also
 handled specially.
 
 And we fail to emulate it correctly as well, wrt. alarms.
 
 
 The BMC does not share Vcc.  It's no different than a separate
 physical box.  It just shares a couple buses.
 
 It controls the main power place, reset line, can read VGA and emulate
 keyboard, seems pretty well integrated.
 
 Emulating the keyboard is done through USB.  How the VGA thing
 works is very vendor dependent.  The VGA snooping can happen as
 part of the display path (essentially connected via a VGA cable)
 or it can be a side-band using a special graphics adapter.  I
 think QEMU VNC emulation is a pretty good analogy actually.
 
 
 That is one way to do it.  Figure out the interactions between two
 different parts in a machine, define an interface for them to
 communicate, and split them into two processes.  We don't usually do
 that; I believe your motivation is that the two have different power
 domains (but then so do NICs with wake-on-LAN support).
 
 The power still comes from the PCI bus.
 
 Maybe.  But it's on when the rest of the machine is off.  So Vcc is not
 shared.
 
 That's all plumbed through the PCI bus FWIW.
 
 
 
 Think of something like a blade center.  Each individual blade does
 not have it's own BMC.  There's a single common BMC that provides an
 IPMI interface for all blades.  Yet each blade still sees an IPMI
 interface via a USB rndis device.
 
 You can rip out the memory, PCI devices, etc. from a box while the
 Power is in and the BMC will be unaffected.
 
 
 At any rate, you would have some sort of virtual hardware device that
 essentially spoke QMP to the slave instance.  You could just do
 virtio-serial and call it a day actually.
 
 Sorry I lost you.  Which is the master and which is the slave?
 
 The BMC is the master, system being controlled is the slave.
 
 Ah okay.  It also has to read the VGA output (say via vnc) and supply
 keyboard input (via sendkey).
 
 Right, QMP + VNC is a pretty accurate analogy.
 
 It really boils down to what you are trying to do.  If you want to
 just get some piece of software working that expects to do IPMI, the
 easiest thing to do is run IPMI in the host and use a USB rndis
 interface to interact with it.
 
 That would be most strange.  A remote client connecting to the IPMI
 interface

Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Thomas Gleixner

On Mon, 7 May 2012, Ingo Molnar wrote:
 * Avi Kivity a...@redhat.com wrote:
 
   PS: Nikunj had experimented that pv-flush tlb + 
   paravirt-spinlock is a win on PLE where only one of them 
   alone could not prove the benefit.
  
  I'd like to see those numbers, then.
  
  Ingo, please hold on the kvm-specific patches, meanwhile.
 
 I'll hold off on the whole thing - frankly, we don't want this 
 kind of Xen-only complexity. If KVM can make use of PLE then Xen 
 ought to be able to do it as well.
 
 If both Xen and KVM makes good use of it then that's a different 
 matter.

Aside of that, it's kinda strange that a dude named Nikunj is
referenced in the argument chain, but I can't find him on the CC list.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Adding an IPMI BMC device to KVM

2012-05-07 Thread Corey Minyard


On 05/07/2012 02:45 PM, Dave Allan wrote:

FWIW, the idea of an IPMI interface to VMs was proposed for libvirt
not too long ago.  See:

https://bugzilla.redhat.com/show_bug.cgi?id=815136


Well, it wouldn't be to hard to do.  I already have working emulation 
code that does the IPMI LAN interface (including the IPMI 2.0 stuff for 
more reasonable security).  I have a KCS interface and a minimal IPMI 
controller working in KVM, though I'm not quite sure the best final way 
to hook it in.


Configuration is going to be the hardest part, but a minimal 
configuration for providing basic management would be easy.


-corey


Dave

On Mon, May 07, 2012 at 01:07:45PM -0500, Corey Minyard wrote:

I think we are getting a little out of hand here, and we are mixing
up concepts :).

There are lots of things IPMI *can* do (including serial access, VGA
snooping, LAN access, etc.) but I don't see any value it that.  The
main thing here is to emulate the interface to the guest.  OOB
management is really more appropriately handled with libvirt.  How
the BMC integrates into the hardware varies a *lot* between systems,
but it's really kind of irrelevant.  (Well, almost irrelevant, BMCs
can provide a direct I2C messaging capability, and that may matter.)

A guest can have one (or more) of a number of interfaces (that are
all fairly bad, unfortunately).  The standard ones are called KCS,
BT and SMIC and they generally are directly on the ISA bus, but
are in memory on non-x86 boxes (and on some x86 boxes) and sometimes
on the PCI bus.  Some systems also have interfaces over I2C, but
that hasn't really caught on.  Others have interfaces over serial
ports, and that unfortunately has caught on in the ATCA world.  And
there are at least 3 different basic types of serial port interfaces
with sub-variants :(.  I'm not sure what the USB rndis device is,
but I'll look at it.  But there is no IPMI over USB.

The big things a guest can do are sensor management, watchdog timer,
reset, and power control.  In complicated IPMI-based systems like
ATCA, a guest may want to send messages through its local IPMI
controller to other guest's IPMI controllers or to a central BMC
that runs an entire chassis of systems.  So that may need to be
supported, depending on what people want to do and how hard they
want to work on it.

My proposal is to start small, with just a local interface, watchdog
timer, sensors and power control.  But have an architecture that
would allow external LAN access, tying BMCs in different qemu
instances together, perhaps serial over IPMI, and other things of
that nature.

-corey


On 05/07/2012 10:21 AM, Anthony Liguori wrote:

On 05/07/2012 10:11 AM, Avi Kivity wrote:

On 05/07/2012 05:55 PM, Anthony Liguori wrote:

For all intents and purposes, the BMC/RSA is a separate physical
machine.

That's true for any other card on a machine.


It has a separate power source for all intents and purposes.  If you
think of it in QOM terms, what connects the nodes together ultimately
is the Vcc pin that travels across all devices.  The RTC is a little
special because it has a battery backed CMOS/clock but it's also
handled specially.

And we fail to emulate it correctly as well, wrt. alarms.


The BMC does not share Vcc.  It's no different than a separate
physical box.  It just shares a couple buses.

It controls the main power place, reset line, can read VGA and emulate
keyboard, seems pretty well integrated.

Emulating the keyboard is done through USB.  How the VGA thing
works is very vendor dependent.  The VGA snooping can happen as
part of the display path (essentially connected via a VGA cable)
or it can be a side-band using a special graphics adapter.  I
think QEMU VNC emulation is a pretty good analogy actually.


That is one way to do it.  Figure out the interactions between two
different parts in a machine, define an interface for them to
communicate, and split them into two processes.  We don't usually do
that; I believe your motivation is that the two have different power
domains (but then so do NICs with wake-on-LAN support).

The power still comes from the PCI bus.

Maybe.  But it's on when the rest of the machine is off.  So Vcc is not
shared.

That's all plumbed through the PCI bus FWIW.


Think of something like a blade center.  Each individual blade does
not have it's own BMC.  There's a single common BMC that provides an
IPMI interface for all blades.  Yet each blade still sees an IPMI
interface via a USB rndis device.

You can rip out the memory, PCI devices, etc. from a box while the
Power is in and the BMC will be unaffected.


At any rate, you would have some sort of virtual hardware device that
essentially spoke QMP to the slave instance.  You could just do
virtio-serial and call it a day actually.

Sorry I lost you.  Which is the master and which is the slave?

The BMC is the master, system being controlled is the slave.

Ah okay.  It also has to read the VGA output (say via vnc) and supply
keyboard input (via

[PULL net-next] macvtap, vhost and virtio tools updates

2012-05-07 Thread Michael S. Tsirkin

There are mostly bugfixes here.
I hope to merge some more patches by 3.5, in particular
vlan support fixes are waiting for Eric's ack,
and a version of tracepoint patch might be
ready in time, but let's merge what's ready so it's testable.

This includes a ton of zerocopy fixes by Jason -
good stuff but too intrusive for 3.4 and zerocopy is experimental
anyway.

virtio supported delayed interrupt for a while now
so adding support to the virtio tool made sense

Please pull into net-next and merge for 3.5.
Thanks!

MST

The following changes since commit e4ae004b84b315dd4b762e474f97403eac70f76a:

  netem: add ECN capability (2012-05-01 09:39:48 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost-net-next

for you to fetch changes up to c70aa540c7a9f67add11ad3161096fb95233aa2e:

  vhost: zerocopy: poll vq in zerocopy callback (2012-05-02 18:22:25 +0300)


Jason Wang (9):
  macvtap: zerocopy: fix offset calculation when building skb
  macvtap: zerocopy: fix truesize underestimation
  macvtap: zerocopy: put page when fail to get all requested user pages
  macvtap: zerocopy: set SKBTX_DEV_ZEROCOPY only when skb is built 
successfully
  macvtap: zerocopy: validate vectors before building skb
  vhost_net: zerocopy: fix possible NULL pointer dereference of vq-bufs
  vhost_net: re-poll only on EAGAIN or ENOBUFS
  vhost_net: zerocopy: adding and signalling immediately when fully copied
  vhost: zerocopy: poll vq in zerocopy callback

Michael S. Tsirkin (1):
  virtio/tools: add delayed interupt mode

 drivers/net/macvtap.c   |   57 ++-
 drivers/vhost/net.c |7 -
 drivers/vhost/vhost.c   |1 +
 tools/virtio/linux/virtio.h |1 +
 tools/virtio/virtio_test.c  |   26 ---
 5 files changed, 69 insertions(+), 23 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] vhost: basic tracepoints

2012-05-07 Thread Michael S. Tsirkin

On Tue, Apr 10, 2012 at 10:58:19AM +0800, Jason Wang wrote:
 To help for the performance optimizations and debugging, this patch 
 tracepoints
 for vhost. Pay attention that the tracepoints are only for vhost, net code are
 not touched.
 
 Two kinds of activities were traced: virtio and vhost work.
 
 Signed-off-by: Jason Wang jasow...@redhat.com

Thanks for looking into this.

Some questions:

Do we need to prefix traces with vhost_virtio_?

How about a trace for enabling/disabling interrupts?
Trace for a suppressed interrupt?
I think we need a vq # not pointer.
Also need some id for when there are many guests.
Use the vhost thread name (includes owner pid)? It's pid? Both?

Also, traces do add very small overhead when compiled but not
enabled mainly due to increasing register pressure.
Need to test to make sure perf is not hurt.
Some traces are just for debugging so build them on
debug kernel only?

Further, there are many events some are rare
some are common. I think we need some naming scheme
so that really useful and low overhead stuff is easier
to enable ignoring the rarely usefu;/higher overhead traces.

 ---
  drivers/vhost/trace.h |  153 
 +
  drivers/vhost/vhost.c |   17 +
  2 files changed, 168 insertions(+), 2 deletions(-)
  create mode 100644 drivers/vhost/trace.h
 
 diff --git a/drivers/vhost/trace.h b/drivers/vhost/trace.h
 new file mode 100644
 index 000..0423899
 --- /dev/null
 +++ b/drivers/vhost/trace.h
 @@ -0,0 +1,153 @@
 +#if !defined(_TRACE_VHOST_H) || defined(TRACE_HEADER_MULTI_READ)
 +#define _TRACE_VHOST_H
 +
 +#include linux/tracepoint.h
 +#include vhost.h
 +
 +#undef TRACE_SYSTEM
 +#define TRACE_SYSTEM vhost
 +
 +/*
 + * Tracepoint for updating used flag.
 + */
 +TRACE_EVENT(vhost_virtio_update_used_flags,
 + TP_PROTO(struct vhost_virtqueue *vq),
 + TP_ARGS(vq),
 +
 + TP_STRUCT__entry(
 + __field(struct vhost_virtqueue *, vq)
 + __field(u16, used_flags)
 + ),
 +
 + TP_fast_assign(
 + __entry-vq = vq;
 + __entry-used_flags = vq-used_flags;
 + ),
 +
 + TP_printk(vhost update used flag %x to vq %p notify %s,
 + __entry-used_flags, __entry-vq,
 + (__entry-used_flags  VRING_USED_F_NO_NOTIFY) ?
 + disabled : enabled)
 +);
 +
 +/*
 + * Tracepoint for updating avail event.
 + */
 +TRACE_EVENT(vhost_virtio_update_avail_event,
 + TP_PROTO(struct vhost_virtqueue *vq),
 + TP_ARGS(vq),
 +
 + TP_STRUCT__entry(
 + __field(struct vhost_virtqueue *, vq)
 + __field(u16, avail_idx)
 + ),
 +
 + TP_fast_assign(
 + __entry-vq = vq;
 + __entry-avail_idx = vq-avail_idx;
 + ),
 +
 + TP_printk(vhost update avail idx %u(%u) for vq %p,
 +   __entry-avail_idx, __entry-avail_idx %
 +   __entry-vq-num, __entry-vq)
 +);
 +
 +/*
 + * Tracepoint for processing descriptor.
 + */
 +TRACE_EVENT(vhost_virtio_get_vq_desc,
 + TP_PROTO(struct vhost_virtqueue *vq, unsigned int index,
 +  unsigned out, unsigned int in),
 + TP_ARGS(vq, index, out, in),
 +
 + TP_STRUCT__entry(
 + __field(struct vhost_virtqueue *, vq)
 + __field(unsigned int, head)
 + __field(unsigned int, out)
 + __field(unsigned int, in)
 + ),
 +
 + TP_fast_assign(
 + __entry-vq = vq;
 + __entry-head = index;
 + __entry-out = out;
 + __entry-in = in;
 + ),
 +
 + TP_printk(vhost get vq %p head %u out %u in %u,
 +   __entry-vq, __entry-head, __entry-out, __entry-in)
 +
 +);
 +
 +/*
 + * Tracepoint for signal guest.
 + */
 +TRACE_EVENT(vhost_virtio_signal,
 + TP_PROTO(struct vhost_virtqueue *vq),
 + TP_ARGS(vq),
 +
 + TP_STRUCT__entry(
 + __field(struct vhost_virtqueue *, vq)
 + ),
 +
 + TP_fast_assign(
 + __entry-vq = vq;
 + ),
 +
 + TP_printk(vhost signal vq %p, __entry-vq)
 +);
 +
 +DECLARE_EVENT_CLASS(vhost_work_template,
 + TP_PROTO(struct vhost_dev *dev, struct vhost_work *work),
 + TP_ARGS(dev, work),
 +
 + TP_STRUCT__entry(
 + __field(struct vhost_dev *, dev)
 + __field(struct vhost_work *, work)
 + __field(void *, function)
 + ),
 +
 + TP_fast_assign(
 + __entry-dev = dev;
 + __entry-work = work;
 + __entry-function = work-fn;
 + ),
 +
 + TP_printk(%pf for work %p dev %p,
 + __entry-function, __entry-work, __entry-dev)
 +);
 +
 +DEFINE_EVENT(vhost_work_template, vhost_work_queue_wakeup,
 + TP_PROTO(struct vhost_dev *dev, struct vhost_work *work),
 + TP_ARGS(dev, work));
 +
 +DEFINE_EVENT(vhost_work_template, vhost_work_queue_coalesce,
 + TP_PROTO(struct vhost_dev *dev, struct vhost_work *work),
 + TP_ARGS(dev, work));
 +
 +DEFINE_EVENT(vhost_work_template,

Fwd: Re: Adding an IPMI BMC device to KVM

2012-05-07 Thread Corey Minyard

Resending to the list, plan text only.  Sorry about that...

 Original Message 
Subject:Re: Adding an IPMI BMC device to KVM
Date:   Mon, 07 May 2012 16:57:06 -0500
From:   Corey Minyard tcminy...@gmail.com
Reply-To:   miny...@acm.org
To: Anthony Liguori anth...@codemonkey.ws
CC: kvm@vger.kernel.org, Avi Kivity a...@redhat.com

On 05/05/2012 09:29 AM, Anthony Liguori wrote:

You could just use a USB rndis device and then run an IPMI server on 
the host.  That is probably the simpliest way to test.

I'm trying to figure out how RNDIS would be helpful here.  You can't 
have the guest talk USB for this, it's going to talk the standard 
interfaces.

However, I am loathe to create my own protocol for talking between the 
BMC and QEMU.  Perhaps RNDIS could be useful, but it's massive 
overkill.  It also doesn't provide any security.  If there were client 
and server libraries that would run over a socket, that would be tempting.

I'm wondering if security is that big a deal, though.  If you use a 
chardev, and you have QEMU make a connection to the BMC, maybe that 
would be ok.

  BMCs typically run a full OS like Linux making emulation as a device 
in QEMU prohibitively hard.

BMCs are generally very simple 8-bit microcontrollers, unless they are 
something like an ATCA shelf manager.  Emulation is pretty easy.  Well, 
emulation of an ATCA shelf manager wouldn't be easy, but that's 
hopefully not required in the near future.

-corey

Regards,

Anthony Liguori

On May 4, 2012 2:10 PM, Corey Minyard tcminy...@gmail.com 
mailto:tcminy...@gmail.com wrote:

 I've been working on adding an IPMI BMC as a virtual device under 
KVM.  I'm
 doing this for two primary reasons, one to have a better test 
environment than

 what I have now for testing IPMI issues, and second to be able to better
 simulate a legacy environment for customers porting legacy software.

 For those that don't know, IPMI is a system management interface. 
 Generally
 systems with IPMI have a small microcontroller, called a BMC, that 
is always on
 when the board is powered.  The BMC is capable of controlling power 
and reset
 on the board, and it is hooked to sensors on the board (voltage, 
current,
 temperature, the presence of things like DIMMS and power supplies, 
intrusion
 detection, and a host of other things).  The main processor on a 
system can
 communicate with the BMC over a device.  Often these systems also 
have a LAN

 interface that lets you control the system remotely even when it's off.

 In addition, IPMI provides access to FRU (Field Replaceable Unit) 
data that
 describes the components of the system that can be replaced.  It 
also has data
 records that describe the sensor, so it is possible to directly 
interpret the sensor

 data and know what the sensor is measuring without outside data.

 I've been struggling a bit with how to implement this.  There is a 
lot of
 configuration information, and you need ways to simulate the 
sensors.  This type
 of interface is a little sensitive, since it has direct access to 
the reset and power

 control of a system.

 I was at first considering having the BMC be an external program 
that KVM
 connected to over a chardev, with possibly a simulated LAN 
interface.  This has

 the advantage that the BMC can run even when KVM is down.  It could even
 start up KVM for a power up, though I'm not sure how valuable that 
would be.
 Plus it could be used for other virtual machines.  However, that 
means there is
 an interface to KVM over a chardev that could do nasty things, and 
even be a
 possible intrusion point.  It also means there is a separate program 
to maintain.

 You could also include the BMC inside of KVM and run it as a 
separate thread.
 That way there doesn't have to be an insecure interface.  But the 
BMC will need
 a lot of configuration data and this will add a bunch of code to KVM 
that's only
 tangentially related to it.  And you would still need a way to 
simulate setting

 sensors and such for testing things.

 Either way, is this interesting for including into KVM?  Does anyone 
have any

 opinions on the possible ways to implement this?

 Thanks,

 -corey
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org 
mailto:majord...@vger.kernel.org

 More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: possible circular locking dependency

2012-05-07 Thread Sergey Senozhatsky

On (05/07/12 10:52), Avi Kivity wrote:
 On 05/07/2012 06:47 AM, Paul E. McKenney wrote:
  On Sun, May 06, 2012 at 11:34:39PM +0300, Sergey Senozhatsky wrote:
   On (05/06/12 09:42), Paul E. McKenney wrote:
On Sun, May 06, 2012 at 11:55:30AM +0300, Avi Kivity wrote:
 On 05/03/2012 11:02 PM, Sergey Senozhatsky wrote:
  Hello,
  3.4-rc5
 
 Whoa.
 
 Looks like inconsistent locking between cpufreq and
 synchronize_srcu_expedited().  kvm triggered this because it is one of
 the few users of synchronize_srcu_expedited(), but I don't think it is
 doing anything wrong directly.
 
 Dave, Paul?

SRCU hasn't changed much in mainline for quite some time.  Holding
the hotplug mutex across a synchronize_srcu() is a bad idea, though.

However, there is a reworked implementation (courtesy of Lai Jiangshan)
in -rcu that does not acquire the hotplug mutex.  Could you try that 
out?
   
   
   Paul, should I try solely -rcu or there are several commits to pick up 
   and apply
   on top of -linus tree?
 
  If you want the smallest possible change, take the rcu/srcu branch of -rcu.
  If you want the works, take the rcu/next branch of -rcu.
 
  You can find -rcu at:
 
  git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git
 
 To make the difference even smaller, merge the above branch with v3.4-rc5.
 

I'm unable to reproduce the issue on 3.4-rc6 so far. So I guess this will 
take some time.


Sergey
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Jeremy Fitzhardinge

On 05/07/2012 06:49 AM, Avi Kivity wrote:
 On 05/07/2012 04:46 PM, Srivatsa Vaddagiri wrote:
 * Raghavendra K T raghavendra...@linux.vnet.ibm.com [2012-05-07 19:08:51]:

 I 'll get hold of a PLE mc  and come up with the numbers soon. but I
 'll expect the improvement around 1-3% as it was in last version.
 Deferring preemption (when vcpu is holding lock) may give us better than 
 1-3% 
 results on PLE hardware. Something worth trying IMHO.
 Is the improvement so low, because PLE is interfering with the patch, or
 because PLE already does a good job?

How does PLE help with ticket scheduling on unlock?  I thought it would
just help with the actual spin loops.

J
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Adding an IPMI BMC device to KVM

2012-05-07 Thread Anthony Liguori


On 05/07/2012 01:07 PM, Corey Minyard wrote:

I think we are getting a little out of hand here, and we are mixing up concepts 
:).

There are lots of things IPMI *can* do (including serial access, VGA snooping,
LAN access, etc.) but I don't see any value it that. The main thing here is to
emulate the interface to the guest. OOB management is really more appropriately
handled with libvirt. How the BMC integrates into the hardware varies a *lot*
between systems, but it's really kind of irrelevant. (Well, almost irrelevant,
BMCs can provide a direct I2C messaging capability, and that may matter.)

A guest can have one (or more) of a number of interfaces (that are all fairly
bad, unfortunately). The standard ones are called KCS, BT and SMIC and
they generally are directly on the ISA bus, but are in memory on non-x86 boxes
(and on some x86 boxes) and sometimes on the PCI bus. Some systems also have
interfaces over I2C, but that hasn't really caught on. Others have interfaces
over serial ports, and that unfortunately has caught on in the ATCA world. And
there are at least 3 different basic types of serial port interfaces with
sub-variants :(. I'm not sure what the USB rndis device is, but I'll look at it.
But there is no IPMI over USB.


USB rndis == USB network adapater.  It's just seen by the machine as IPMI over 
LAN.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Semantics of -cpu host (was Re: [Qemu-devel] [PATCH 2/2] Expose tsc deadline timer cpuid to guest)

2012-05-07 Thread Alexander Graf


On 07.05.2012, at 20:21, Eduardo Habkost wrote:

 
 Andre? Are you able to help to answer the question below?
 
 I would like to clarify what's the expected behavior of -cpu host to
 be able to continue working on it. I believe the code will need to be
 fixed on either case, but first we need to figure out what are the
 expectations/requirements, to know _which_ changes will be needed.
 
 
 On Tue, Apr 24, 2012 at 02:19:25PM -0300, Eduardo Habkost wrote:
 (CCing Andre Przywara, in case he can help to clarify what's the
 expected meaning of -cpu host)
 
 [...]
 I am not sure I understand what you are proposing. Let me explain the
 use case I am thinking about:
 
 - Feature FOO is of type (A) (e.g. just a new instruction set that
  doesn't require additional userspace support)
 - User has a Qemu vesion that doesn't know anything about feature FOO
 - User gets a new CPU that supports feature FOO
 - User gets a new kernel that supports feature FOO (i.e. has FOO in
  GET_SUPPORTED_CPUID)
 - User does _not_ upgrade Qemu.
 - User expects to get feature FOO enabled if using -cpu host, without
  upgrading Qemu.
 
 The problem here is: to support the above use-case, userspace need a
 probing mechanism that can differentiate _new_ (previously unknown)
 features that are in group (A) (safe to blindly enable) from features
 that are in group (B) (that can't be enabled without an userspace
 upgrade).
 
 In short, it becomes a problem if we consider the following case:
 
 - Feature BAR is of type (B) (it can't be enabled without extra
  userspace support)
 - User has a Qemu version that doesn't know anything about feature BAR
 - User gets a new CPU that supports feature BAR
 - User gets a new kernel that supports feature BAR (i.e. has BAR in
  GET_SUPPORTED_CPUID)
 - User does _not_ upgrade Qemu.
 - User simply shouldn't get feature BAR enabled, even if using -cpu
  host, otherwise Qemu would break.
 
 If userspace always limited itself to features it knows about, it would
 be really easy to implement the feature without any new probing
 mechanism from the kernel. But that's not how I think users expect -cpu
 host to work. Maybe I am wrong, I don't know. I am CCing Andre, who
 introduced the -cpu host feature, in case he can explain what's the
 expected semantics on the cases above.

Can you think of any feature that'd go into category B?

All features I'm aware of work fine (without migration, but that one is moot 
for -cpu host anyway) as long as the host kvm implementation is fine with it 
(GET_SUPPORTED_CPUID). So they'd be category A.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Raghavendra K T


On 05/08/2012 04:45 AM, Jeremy Fitzhardinge wrote:

On 05/07/2012 06:49 AM, Avi Kivity wrote:

On 05/07/2012 04:46 PM, Srivatsa Vaddagiri wrote:

* Raghavendra K Traghavendra...@linux.vnet.ibm.com  [2012-05-07 19:08:51]:


I 'll get hold of a PLE mc  and come up with the numbers soon. but I
'll expect the improvement around 1-3% as it was in last version.

Deferring preemption (when vcpu is holding lock) may give us better than 1-3%
results on PLE hardware. Something worth trying IMHO.

Is the improvement so low, because PLE is interfering with the patch, or
because PLE already does a good job?


How does PLE help with ticket scheduling on unlock?  I thought it would
just help with the actual spin loops.


Hmm. This strikes something to me. I think I should replace while 1 hog
in with some *real job*  to measure over-commit case. I hope to see
greater improvements because of fairness and scheduling of the
patch-set.

May be all the way I was measuring something equal to 1x case.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Unlocked TLB flush

2012-05-07 Thread Marcelo Tosatti

On Thu, May 03, 2012 at 02:22:58PM +0300, Avi Kivity wrote:
 This patchset implements unlocked TLB flushing for KVM.  An operation that
 generates stale TLB entries can mark the TLB as dirty instead of flushing
 immediately, and then flush after releasing mmu_lock but before returning
 to the guest or the caller.  A few call sites are converted too.
 
 Note not all call sites are easily convertible; as an example, sync_page()
 must flush before reading the guest page table.

Huh? Are you referring to:

 * Note:
 *   We should flush all tlbs if spte is dropped even though guest is
 *   responsible for it. Since if we don't,
 *   kvm_mmu_notifier_invalidate_page
 *   and kvm_mmu_notifier_invalidate_range_start detect the mapping page
 *   isn't
 *   used by guest then tlbs are not flushed, so guest is allowed to
 *   access the
 *   freed pages.
 *   And we increase kvm-tlbs_dirty to delay tlbs flush in this case.

With an increased dirtied_count the flush can be performed
by kvm_mmu_notifier_invalidate_page.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/4] KVM: Add APIs for unlocked TLB flush

2012-05-07 Thread Marcelo Tosatti

On Mon, May 07, 2012 at 10:59:04AM +0300, Avi Kivity wrote:
 On 05/07/2012 10:06 AM, Xiao Guangrong wrote:
  On 05/03/2012 10:11 PM, Avi Kivity wrote:
 
   On 05/03/2012 04:23 PM, Xiao Guangrong wrote:
   On 05/03/2012 07:22 PM, Avi Kivity wrote:
  
   Currently we flush the TLB while holding mmu_lock.  This
   increases the lock hold time by the IPI round-trip time, increasing
   contention, and makes dropping the lock (for latency reasons) harder.
  
   This patch changes TLB management to be usable locklessly, introducing
   the following APIs:
  
 kvm_mark_tlb_dirty() - mark the TLB as containing stale entries
 kvm_cond_flush_remote_tlbs() - flush the TLB if it was marked as
 dirty
  
   These APIs can be used without holding mmu_lock (though if the TLB
   became stale due to shadow page table modifications, typically it
   will need to be called with the lock held to prevent other threads
   from seeing the modified page tables with the TLB unmarked and 
   unflushed)/
  
   Signed-off-by: Avi Kivity a...@redhat.com
   ---
Documentation/virtual/kvm/locking.txt |   14 ++
arch/x86/kvm/paging_tmpl.h|4 ++--
include/linux/kvm_host.h  |   22 +-
virt/kvm/kvm_main.c   |   29 
   -
4 files changed, 57 insertions(+), 12 deletions(-)
  
   diff --git a/Documentation/virtual/kvm/locking.txt 
   b/Documentation/virtual/kvm/locking.txt
   index 3b4cd3b..f6c90479 100644
   --- a/Documentation/virtual/kvm/locking.txt
   +++ b/Documentation/virtual/kvm/locking.txt
   @@ -23,3 +23,17 @@ Arch:x86
Protects:  - 
   kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}
   - tsc offset in vmcb
Comment:   'raw' because updating the tsc offsets must not be 
   preempted.
   +
   +3. TLB control
   +--
   +
   +The following APIs should be used for TLB control:
   +
   + - kvm_mark_tlb_dirty() - indicates that the TLB is out of sync wrt
   +  either guest or host page tables.
   + - kvm_flush_remote_tlbs() - unconditionally flush the tlbs
   + - kvm_cond_flush_remote_tlbs() - flush the TLBs if previously marked
   +
   +These may be used without mmu_lock, though kvm_mark_tlb_dirty() needs 
   to be
   +used while holding mmu_lock if it is called due to host page table 
   changes
   +(contrast to guest page table changes).
  
  
   In these patches, it seems that kvm_mark_tlb_dirty is always used
   under the protection of mmu-lock, yes?
   
   Correct.  It's possible we'll find a use outside mmu_lock, but this
   isn't likely.
 
 
  If we need call kvm_mark_tlb_dirty outside mmu-lock, just use
  kvm_flush_remote_tlbs instead:
 
  if (need-flush-tlb)
  flush = true;
 
  do something...
 
  if (flush)
  kvm_flush_remote_tlbs
 
 It depends on how need-flush-tlb is computed.  If it depends on sptes,
 then we mush use kvm_cond_flush_remote_tlbs().
 
   
   If both kvm_mark_tlb_dirty and kvm_cond_flush_remote_tlbs are use
   out of mmu-lock, i think we can use kvm_flush_remote_tlbs instead.
  
   If it is so, dirtied_count/flushed_count need not be atomic.
   
   But we always mark with mmu_lock held.
   
 
 
  Yes, so, we can change kvm_mark_tlb_dirty to:
 
  +static inline void kvm_mark_tlb_dirty(struct kvm *kvm)
  +{
  +   /*
  +* Make any changes to the page tables visible to remote flushers.
  +*/
  +   smb_mb();
  +   kvm-tlb_state.dirtied_count++;
  +}
 
 
 Yes.  We'll have to change it again if we ever dirty sptes outside the
 lock, but that's okay.

Please don't. There are readers outside mmu_lock, so it should be
atomic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/4] KVM: Add APIs for unlocked TLB flush

2012-05-07 Thread Marcelo Tosatti

On Thu, May 03, 2012 at 05:11:01PM +0300, Avi Kivity wrote:
 On 05/03/2012 04:23 PM, Xiao Guangrong wrote:
  On 05/03/2012 07:22 PM, Avi Kivity wrote:
 
   Currently we flush the TLB while holding mmu_lock.  This
   increases the lock hold time by the IPI round-trip time, increasing
   contention, and makes dropping the lock (for latency reasons) harder.
   
   This patch changes TLB management to be usable locklessly, introducing
   the following APIs:
   
 kvm_mark_tlb_dirty() - mark the TLB as containing stale entries
 kvm_cond_flush_remote_tlbs() - flush the TLB if it was marked as
 dirty
   
   These APIs can be used without holding mmu_lock (though if the TLB
   became stale due to shadow page table modifications, typically it
   will need to be called with the lock held to prevent other threads
   from seeing the modified page tables with the TLB unmarked and unflushed)/
   
   Signed-off-by: Avi Kivity a...@redhat.com
   ---
Documentation/virtual/kvm/locking.txt |   14 ++
arch/x86/kvm/paging_tmpl.h|4 ++--
include/linux/kvm_host.h  |   22 +-
virt/kvm/kvm_main.c   |   29 
   -
4 files changed, 57 insertions(+), 12 deletions(-)
   
   diff --git a/Documentation/virtual/kvm/locking.txt 
   b/Documentation/virtual/kvm/locking.txt
   index 3b4cd3b..f6c90479 100644
   --- a/Documentation/virtual/kvm/locking.txt
   +++ b/Documentation/virtual/kvm/locking.txt
   @@ -23,3 +23,17 @@ Arch:  x86
Protects:- 
   kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}
 - tsc offset in vmcb
Comment: 'raw' because updating the tsc offsets must not be preempted.
   +
   +3. TLB control
   +--
   +
   +The following APIs should be used for TLB control:
   +
   + - kvm_mark_tlb_dirty() - indicates that the TLB is out of sync wrt
   +  either guest or host page tables.
   + - kvm_flush_remote_tlbs() - unconditionally flush the tlbs
   + - kvm_cond_flush_remote_tlbs() - flush the TLBs if previously marked
   +
   +These may be used without mmu_lock, though kvm_mark_tlb_dirty() needs to 
   be
   +used while holding mmu_lock if it is called due to host page table 
   changes
   +(contrast to guest page table changes).
 
 
  In these patches, it seems that kvm_mark_tlb_dirty is always used
  under the protection of mmu-lock, yes?
 
 Correct.  It's possible we'll find a use outside mmu_lock, but this
 isn't likely.
 
  If both kvm_mark_tlb_dirty and kvm_cond_flush_remote_tlbs are use
  out of mmu-lock, i think we can use kvm_flush_remote_tlbs instead.
 
  If it is so, dirtied_count/flushed_count need not be atomic.
 
 But we always mark with mmu_lock held.
 
 
  And, it seems there is a bug:
 
   VCPU 0  VCPU 1
 
   hold mmu-lock
   zap spte which points to 'gfn'
   mark_tlb_dirty
   release mmu-lock
  hold mmu-lock
  rmap_write-protect:
 see no spte pointing to gfn
 tlb did not be flushed
  release mmu-lock
 
  write gfn by guest
OOPS!!!
 
   kvm_cond_flush_remote_tlbs
 
  We need call kvm_cond_flush_remote_tlbs in rmap_write-protect
  unconditionally?
 
 Yes, that's the point.  We change
 
spin_lock(mmu_lock)
conditionally flush the tlb
spin_unlock(mmu_lock)
 
 to
 
   spin_lock(mmu_lock)
   conditionally mark the tlb as dirty
   spin_unlock(mmu_lock)
   kvm_cond_flush_remote_tlbs()
 
 but that means the entire codebase has to be converted.

Is there any other site which expects sptes and TLBs to be in sync, 
other than rmap_write_protect?  

Please convert the flush_remote_tlbs at the end of set_spte (RW-RO) to
mark_dirty.

Looks good in general (patchset is incomplete though). One thing that
is annoying is that there is no guarantee of progress for flushed_count
increment (it can, in theory, always race with a mark_dirty). But since
that would be only a performance, and not correctness, aspect, it is
fine.

It has the advantage that it requires code to explicitly document where
the TLB must be flushed and the sites which expect sptes to be in sync
with TLBs.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Unlocked TLB flush

2012-05-07 Thread Marcelo Tosatti

On Mon, May 07, 2012 at 10:25:34PM -0300, Marcelo Tosatti wrote:
 On Thu, May 03, 2012 at 02:22:58PM +0300, Avi Kivity wrote:
  This patchset implements unlocked TLB flushing for KVM.  An operation that
  generates stale TLB entries can mark the TLB as dirty instead of flushing
  immediately, and then flush after releasing mmu_lock but before returning
  to the guest or the caller.  A few call sites are converted too.
  
  Note not all call sites are easily convertible; as an example, sync_page()
  must flush before reading the guest page table.
 
 Huh? Are you referring to:
 
  * Note:
  *   We should flush all tlbs if spte is dropped even though guest is
  *   responsible for it. Since if we don't,
  *   kvm_mmu_notifier_invalidate_page
  *   and kvm_mmu_notifier_invalidate_range_start detect the mapping page
  *   isn't
  *   used by guest then tlbs are not flushed, so guest is allowed to
  *   access the
  *   freed pages.
  *   And we increase kvm-tlbs_dirty to delay tlbs flush in this case.
 
 With an increased dirtied_count the flush can be performed
 by kvm_mmu_notifier_invalidate_page.

Which is what patch 1 does. Your comment regarding sync_page()
above is what is outdated, unless i am missing something.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: virtio 3.4 patches

2012-05-07 Thread Rusty Russell

On Mon, 7 May 2012 08:30:21 +0300, Michael S. Tsirkin m...@redhat.com wrote:
 Hi Rusty, please also pick two fixes from for_linus tag on my tree
 I think they should be sent to Linus for 3.4:
 
 http://git.kernel.org/?p=linux/kernel/git/mst/vhost.git;a=tag;h=refs/tags/for_linus
 
 git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git 
 refs/tags/for_linus

Done!

Thanks so much for handling virtio while I was away; knowing it was
safely in your hands allowed me to relax without reservation.  My wife
thanks you, too :)

Cheers,
Rusty.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio_blk: Drop unused request tracking list

2012-05-07 Thread Rusty Russell

On Fri, 30 Mar 2012 11:24:10 +0800, Asias He as...@redhat.com wrote:
 Benchmark shows small performance improvement on fusion io device.
 
 Before:
   seq-read : io=1,024MB, bw=19,982KB/s, iops=39,964, runt= 52475msec
   seq-write: io=1,024MB, bw=20,321KB/s, iops=40,641, runt= 51601msec
   rnd-read : io=1,024MB, bw=15,404KB/s, iops=30,808, runt= 68070msec
   rnd-write: io=1,024MB, bw=14,776KB/s, iops=29,552, runt= 70963msec
 
 After:
   seq-read : io=1,024MB, bw=20,343KB/s, iops=40,685, runt= 51546msec
   seq-write: io=1,024MB, bw=20,803KB/s, iops=41,606, runt= 50404msec
   rnd-read : io=1,024MB, bw=16,221KB/s, iops=32,442, runt= 64642msec
   rnd-write: io=1,024MB, bw=15,199KB/s, iops=30,397, runt= 68991msec
 
 Signed-off-by: Asias He as...@redhat.com

Thanks.  It didn't really need a benchmark to justify this cleanup, but
you certainly get points for thoroughness!

Applied,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Raghavendra K T


On 05/07/2012 08:22 PM, Avi Kivity wrote:

On 05/07/2012 05:47 PM, Raghavendra K T wrote:

Not good.  Solving a problem in software that is already solved by
hardware?  It's okay if there are no costs involved, but here we're
introducing a new ABI that we'll have to maintain for a long time.




Hmm agree that being a step ahead of mighty hardware (and just an
improvement of 1-3%) is no good for long term (where PLE is future).



PLE is the present, not the future.  It was introduced on later Nehalems
and is present on all Westmeres.  Two more processor generations have
passed meanwhile.  The AMD equivalent was also introduced around that
timeframe.


Having said that, it is hard for me to resist saying :
  bottleneck is somewhere else on PLE m/c and IMHO answer would be
combination of paravirt-spinlock + pv-flush-tb.

But I need to come up with good number to argue in favour of the claim.

PS: Nikunj had experimented that pv-flush tlb + paravirt-spinlock is a
win on PLE where only one of them alone could not prove the benefit.



I'd like to see those numbers, then.

Ingo, please hold on the kvm-specific patches, meanwhile.




Hmm. I think I messed up the fact while saying 1-3% improvement on PLE.

Going by what I had posted in  https://lkml.org/lkml/2012/4/5/73 (with
correct calculation)

  1x 70.475 (85.6979)   63.5033 (72.7041)   15.7%
  2x 110.971 (132.829)  105.099 (128.738)5.56%  
  3x 150.265 (184.766)  138.341 (172.69) 8.62%


It was around 12% with optimization patch posted separately with that 
(That one Needs more experiment though)


But anyways, I will come up with result for current patch series..

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

69 matches

Mail list logo