Re: Guest floppy regression hits qemu-kvm, qemu is not affected

2011-11-20 Thread Avi Kivity
On 11/18/2011 04:06 PM, Lucas Meneghel Rodrigues wrote:
 Hi guys,

 Today during the last 'sanity' qemu-kvm testing, we've noticed a
 recurring problem: guest OS does not see the floppy, making the
 windows installs time out. This problem has been extensively discussed
 on qemu and qemu is fine here, problem is specific with qemu-kvm.

 http://comments.gmane.org/gmane.comp.emulators.qemu/123928

 So, this is something to go on your radars.

What was the last qemu-kvm commit hash that showed the problem?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM updates for 3.2-rc2

2011-11-20 Thread Avi Kivity
Linus, please pull from:

  git://git.kernel.org/pub/scm/virt/kvm/kvm.git kvm-updates/3.2

This includes a ppc ABI breakage fix, s390 fixes, a tracing/kvmclock
conflict fix, and the implementation of guest-only/host-only profiling
for Intel.  The latter is not strictly a fix, and missed the merge
window due to a combination of a dependency on tip.git, kernel.org being
offline, and me forgetting all about it.  However it would be good to
have the feature implemented fully instead of just on AMD as it is at
present.

Alexander Graf (1):
  Revert KVM: PPC: Add support for explicit HIOR setting

Avi Kivity (1):
  KVM guest: prevent tracing recursion with kvmclock

Christian Borntraeger (2):
  KVM: s390: Fix tprot locking
  KVM: s390: announce SYNC_MMU

Cornelia Huck (2):
  KVM: s390: Fix RUNNING flag misinterpretation
  KVM: s390: handle SIGP sense running intercepts

Gleb Natapov (3):
  KVM: VMX: add support for switching of PERF_GLOBAL_CTRL
  KVM: VMX: Add support for guest/host-only profiling
  KVM: VMX: Check for automatic switch msr table overflow

 arch/powerpc/include/asm/kvm.h|8 --
 arch/powerpc/include/asm/kvm_book3s.h |2 -
 arch/powerpc/kvm/book3s_pr.c  |   14 +---
 arch/powerpc/kvm/powerpc.c|1 -
 arch/s390/include/asm/kvm_host.h  |3 +-
 arch/s390/kvm/diag.c  |2 +-
 arch/s390/kvm/intercept.c |3 +-
 arch/s390/kvm/interrupt.c |1 +
 arch/s390/kvm/kvm-s390.c  |   12 ++-
 arch/s390/kvm/priv.c  |   10 ++-
 arch/s390/kvm/sigp.c  |   45 +++-
 arch/x86/kernel/kvmclock.c|5 +-
 arch/x86/kvm/vmx.c|  131
++---
 include/linux/kvm.h   |1 -
 14 files changed, 189 insertions(+), 49 deletions(-)

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] KVM updates for 3.2-rc2

2011-11-20 Thread Avi Kivity
(now with [GIT PULL] for filters)

Linus, please pull from:

  git://git.kernel.org/pub/scm/virt/kvm/kvm.git kvm-updates/3.2

This includes a ppc ABI breakage fix, s390 fixes, a tracing/kvmclock
conflict fix, and the implementation of guest-only/host-only profiling
for Intel.  The latter is not strictly a fix, and missed the merge
window due to a combination of a dependency on tip.git, kernel.org being
offline, and me forgetting all about it.  However it would be good to
have the feature implemented fully instead of just on AMD as it is at
present.

Alexander Graf (1):
  Revert KVM: PPC: Add support for explicit HIOR setting

Avi Kivity (1):
  KVM guest: prevent tracing recursion with kvmclock

Christian Borntraeger (2):
  KVM: s390: Fix tprot locking
  KVM: s390: announce SYNC_MMU

Cornelia Huck (2):
  KVM: s390: Fix RUNNING flag misinterpretation
  KVM: s390: handle SIGP sense running intercepts

Gleb Natapov (3):
  KVM: VMX: add support for switching of PERF_GLOBAL_CTRL
  KVM: VMX: Add support for guest/host-only profiling
  KVM: VMX: Check for automatic switch msr table overflow

 arch/powerpc/include/asm/kvm.h|8 --
 arch/powerpc/include/asm/kvm_book3s.h |2 -
 arch/powerpc/kvm/book3s_pr.c  |   14 +---
 arch/powerpc/kvm/powerpc.c|1 -
 arch/s390/include/asm/kvm_host.h  |3 +-
 arch/s390/kvm/diag.c  |2 +-
 arch/s390/kvm/intercept.c |3 +-
 arch/s390/kvm/interrupt.c |1 +
 arch/s390/kvm/kvm-s390.c  |   12 ++-
 arch/s390/kvm/priv.c  |   10 ++-
 arch/s390/kvm/sigp.c  |   45 +++-
 arch/x86/kernel/kvmclock.c|5 +-
 arch/x86/kvm/vmx.c|  131
++---
 include/linux/kvm.h   |1 -
 14 files changed, 189 insertions(+), 49 deletions(-)

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v2] ivshmem: add a new PIO BAR3(Doorbell) besides MMIO BAR0 to reduce notification time

2011-11-20 Thread Avi Kivity
On 11/18/2011 07:50 AM, zanghongy...@huawei.com wrote:
 From: Hongyong Zang zanghongy...@huawei.com

 This patch, adds a PIO BAR3 for guest notifying qemu. And we find the new 
 notification way of PIO BAR3 reduces 30% time in comparison with the original 
 MMIO BAR0 way.

Please update the spec, and split the patch into an infrastructure patch
(for long eventfds in the memory API) and an ivshmem patch.

  
  //#define DEBUG_IVSHMEM
  #ifdef DEBUG_IVSHMEM
 @@ -57,8 +58,10 @@ typedef struct IVShmemState {
  CharDriverState **eventfd_chr;
  CharDriverState *server_chr;
  MemoryRegion ivshmem_mmio;
 +MemoryRegion ivshmem_pio;
  
  pcibus_t mmio_addr;
 +pcibus_t pio_addr;
  /* We might need to register the BAR before we actually have the memory.
   * So prepare a container MemoryRegion for the BAR immediately and
   * add a subregion when we have the memory.
 @@ -234,7 +237,7 @@ static uint64_t ivshmem_io_read(void *opaque, 
 target_phys_addr_t addr,
  return ret;
  }
  
 -static const MemoryRegionOps ivshmem_mmio_ops = {
 +static const MemoryRegionOps ivshmem_io_ops = {
  .read = ivshmem_io_read,
  .write = ivshmem_io_write,
  .endianness = DEVICE_NATIVE_ENDIAN,
 @@ -348,6 +351,8 @@ static void close_guest_eventfds(IVShmemState *s, int 
 posn)
  for (i = 0; i  guest_curr_max; i++) {
  kvm_set_ioeventfd_mmio_long(s-peers[posn].eventfds[i],
  s-mmio_addr + DOORBELL, (posn  16) | i, 0);
 +kvm_set_ioeventfd_pio_long(s-peers[posn].eventfds[i],
 +s-pio_addr + DOORBELL, (posn  16) | i, 0);

This really shouldn't be needed - the memory API should take care of it.

  close(s-peers[posn].eventfds[i]);
  }
  
 @@ -367,6 +372,12 @@ static void setup_ioeventfds(IVShmemState *s) {
true,
(i  16) | j,
s-peers[i].eventfds[j]);
 +memory_region_add_eventfd(s-ivshmem_pio,
 +  DOORBELL,
 +  4,
 +  true,
 +  (i  16) | j,
 +  s-peers[i].eventfds[j]);
  }
  }

Where is the memory API support for this?

  }
 @@ -495,6 +506,10 @@ static void ivshmem_read(void *opaque, const uint8_t * 
 buf, int flags)
  (incoming_posn  16) | guest_max_eventfd, 1)  0) {
  fprintf(stderr, ivshmem: ioeventfd not available\n);
  }
 +if (kvm_set_ioeventfd_pio_long(incoming_fd, s-pio_addr + DOORBELL,
 +(incoming_posn  16) | guest_max_eventfd, 1)  0) {
 +fprintf(stderr, ivshmem: ioeventfd not available\n);
 +}
  }

Nor should this be needed.

Please make BAR 3 disappear if started with -M pc-1.0.


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC v2 PATCH] kvm tools, qcow: Add the support for copy-on-write clusters

2011-11-20 Thread Pekka Enberg
On Sun, 2011-11-20 at 14:14 +0800, Lan, Tianyu wrote:
  OK. Thx.
  But fsync is too slow. I try to find a way to sync a range of file. 
  Are there any solutions to meet my purpose?

On Sun, 2011-11-20 at 08:23 +0200, Sasha Levin wrote:
 fdatasync() is as good as it'll get.
 
 tbh, maybe we should just consider opening QCOW images with O_SYNC and
 just get it over with?

No, lets not do that. It's easier to improve the performance of correct
code that doesn't use O_SYNC.

Pekka

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/6] KVM: introduce kvm_for_each_memslot macro

2011-11-20 Thread Avi Kivity
On 11/18/2011 11:18 AM, Xiao Guangrong wrote:
 index bb8728e..10524c0 100644
 --- a/include/linux/kvm_host.h
 +++ b/include/linux/kvm_host.h
 @@ -307,6 +307,10 @@ static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm 
 *kvm, int i)
(vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
idx++)

 +#define kvm_for_each_memslot(slots, memslot, i)  \
 + for (i = 0; i  (slots)-nmemslots\
 +   ({ memslot = (slots)-memslots[i]; 1; }); i++)
 +


Statement expression not needed, you can use the comma operator:

  i  (slots)-nmemslots  (memslot = @(slots)-memslots[i], true)

or even

  memslot = (slots)-memslots[i], i  (slots)-nmemslots

or just kill i and make memslot the loop variable.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/6] KVM: sort memslots by its size and use line search

2011-11-20 Thread Avi Kivity
On 11/18/2011 11:19 AM, Xiao Guangrong wrote:
 Sort memslots base on its size and use line search to find it, so the larger
 memslots have better fit

 The idea is from Avi

 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  include/linux/kvm_host.h |   22 +---
  virt/kvm/kvm_main.c  |   82 -
  2 files changed, 75 insertions(+), 29 deletions(-)

 diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
 index a0e4d63..83b396a 100644
 --- a/include/linux/kvm_host.h
 +++ b/include/linux/kvm_host.h
 @@ -230,8 +230,12 @@ struct kvm_irq_routing_table {};
  #define KVM_MEM_SLOTS_NUM (KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS)
  #endif

 +/*
 + * Note:
 + * memslots are not sorted by id anymore, please use id_to_memslot()
 + * to get the memslot by its id.
 + */
  struct kvm_memslots {
 - int nmemslots;
   u64 generation;
   struct kvm_memory_slot memslots[KVM_MEM_SLOTS_NUM];
  };
 @@ -307,9 +311,10 @@ static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm 
 *kvm, int i)
(vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
idx++)

 -#define kvm_for_each_memslot(slots, memslot, i)  \
 - for (i = 0; i  (slots)-nmemslots\
 -   ({ memslot = (slots)-memslots[i]; 1; }); i++)
 +#define kvm_for_each_memslot(slots, memslot, i)  \
 + for (i = 0; i  KVM_MEM_SLOTS_NUM \
 +   ({ memslot = (slots)-memslots[i]; 1; })   \
 +   memslot-npages != 0; i++)

You might allocate an always-empty memslot at the end and simplify the
termination condition.


  int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id);
  void kvm_vcpu_uninit(struct kvm_vcpu *vcpu);
 @@ -335,7 +340,14 @@ static inline struct kvm_memslots *kvm_memslots(struct 
 kvm *kvm)
  static inline struct kvm_memory_slot *
  id_to_memslot(struct kvm_memslots *slots, int id)
  {
 - return slots-memslots[id];
 + int i;
 +
 + for (i = 0; i  KVM_MEM_SLOTS_NUM; i++)
 + if (slots-memslots[i].id == id)
 + return slots-memslots[i];
 +

Is that in any hot path?  we could make an array for doing this translation.

Alex wants to increase the memslot count, so this could be a long loop.


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/6] KVM: sort memslots by its size and use line search

2011-11-20 Thread Avi Kivity
On 11/20/2011 01:26 PM, Avi Kivity wrote:
 
   int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id);
   void kvm_vcpu_uninit(struct kvm_vcpu *vcpu);
  @@ -335,7 +340,14 @@ static inline struct kvm_memslots *kvm_memslots(struct 
  kvm *kvm)
   static inline struct kvm_memory_slot *
   id_to_memslot(struct kvm_memslots *slots, int id)
   {
  -   return slots-memslots[id];
  +   int i;
  +
  +   for (i = 0; i  KVM_MEM_SLOTS_NUM; i++)
  +   if (slots-memslots[i].id == id)
  +   return slots-memslots[i];
  +

 Is that in any hot path?  we could make an array for doing this translation.

Never mind, I see patch 6.  I should have known you wouldn't leave it
like that.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/6] KVM: optimize memslots searching

2011-11-20 Thread Avi Kivity
On 11/18/2011 11:16 AM, Xiao Guangrong wrote:
 This is the more work base on my v1 patchset which is posted some months ago,
 it can be found at:
 https://lkml.org/lkml/2011/2/22/68

 Change log:
 - sort memslots base on its size and do the line search instead of binary
   search base on gfn, it is from Avi's idea.
 - in order to reduce cache footprint, memslots are sorted in the array of
   kvm-memslots-memslots[] and introduce a table to map slot id to index in
   the array

 There is the performance result:

 autotest for RHEL.6.1 setup/boot/reboot/shutdown(average):
 ept=1:before: 449.5   after: 447.8
 ept=0:before: 532.7   after: 529.8

 kernbench(average):
 ept=1:before: 127.94  after: 126.98
 ept=0:before: 196.85  after: 189.66

Looks good, had a couple of comments but they're really minor.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] [PATCH] kvm: Fix tprot locking

2011-11-20 Thread Avi Kivity
On 11/17/2011 01:15 PM, Martin Schwidefsky wrote:
 On Thu, 17 Nov 2011 12:27:41 +0200
 Avi Kivity a...@redhat.com wrote:

  On 11/17/2011 12:00 PM, Carsten Otte wrote:
   From: Christian Borntraeger borntrae...@de.ibm.com 
  
   There is a potential host deadlock in the tprot intercept handling.
   We must not hold the mmap semaphore while resolving the guest
   address. If userspace is remapping, then the memory detection in
   the guest is broken anyway so we can safely separate the 
   address translation from walking the vmas.
  
   Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com 
   Signed-off-by: Carsten Otte co...@de.ibm.com
   ---
  
arch/s390/kvm/priv.c |   10 --
1 file changed, 8 insertions(+), 2 deletions(-)
  
   diff -urpN linux-2.6/arch/s390/kvm/priv.c 
   linux-2.6-patched/arch/s390/kvm/priv.c
   --- linux-2.6/arch/s390/kvm/priv.c2011-10-24 09:10:05.0 
   +0200
   +++ linux-2.6-patched/arch/s390/kvm/priv.c2011-11-17 
   10:03:53.0 +0100
   @@ -336,6 +336,7 @@ static int handle_tprot(struct kvm_vcpu
 u64 address1 = disp1 + base1 ? vcpu-arch.guest_gprs[base1] : 0;
 u64 address2 = disp2 + base2 ? vcpu-arch.guest_gprs[base2] : 0;
 struct vm_area_struct *vma;
   + unsigned long user_address;

 vcpu-stat.instruction_tprot++;

   @@ -349,9 +350,14 @@ static int handle_tprot(struct kvm_vcpu
 return -EOPNOTSUPP;


   + /* we must resolve the address without holding the mmap semaphore.
   +  * This is ok since the userspace hypervisor is not supposed to change
   +  * the mapping while the guest queries the memory. Otherwise the guest
   +  * might crash or get wrong info anyway. */
   + user_address = (unsigned long) __guestaddr_to_user(vcpu, address1);
   +
 down_read(current-mm-mmap_sem);
   - vma = find_vma(current-mm,
   - (unsigned long) __guestaddr_to_user(vcpu, address1));
   + vma = find_vma(current-mm, user_address);
 if (!vma) {
 up_read(current-mm-mmap_sem);
 return kvm_s390_inject_program_int(vcpu, PGM_ADDRESSING);
  
  
  Unrelated to the patch, but I'm curious: it looks like __gmap_fault()
  dereferences the guest page table?  How can it assume that it is mapped?

 The gmap code does not assume that the code is mapped. If the individual
 MB has not been mapped in the guest address space or the target memory
 is not mapped in the process address space __gmap_fault() returns -EFAULT. 

  I'm probably misreading the code.

I did misread it - I assumed it was guest page table, whereas those are
host page tables mapping guest memory (here, guest page table mean
guest-managed virt to phys translation).


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] [PATCH] kvm: Fix tprot locking

2011-11-20 Thread Avi Kivity
On 11/17/2011 01:32 PM, Martin Schwidefsky wrote:
 On Thu, 17 Nov 2011 12:15:52 +0100
 Martin Schwidefsky schwidef...@de.ibm.com wrote:

  On Thu, 17 Nov 2011 12:27:41 +0200
  Avi Kivity a...@redhat.com wrote:
  
   On 11/17/2011 12:00 PM, Carsten Otte wrote:
From: Christian Borntraeger borntrae...@de.ibm.com 
   
There is a potential host deadlock in the tprot intercept handling.
We must not hold the mmap semaphore while resolving the guest
address. If userspace is remapping, then the memory detection in
the guest is broken anyway so we can safely separate the 
address translation from walking the vmas.
   
Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com 
Signed-off-by: Carsten Otte co...@de.ibm.com
---
   
 arch/s390/kvm/priv.c |   10 --
 1 file changed, 8 insertions(+), 2 deletions(-)
   
diff -urpN linux-2.6/arch/s390/kvm/priv.c 
linux-2.6-patched/arch/s390/kvm/priv.c
--- linux-2.6/arch/s390/kvm/priv.c  2011-10-24 09:10:05.0 
+0200
+++ linux-2.6-patched/arch/s390/kvm/priv.c  2011-11-17 
10:03:53.0 +0100
@@ -336,6 +336,7 @@ static int handle_tprot(struct kvm_vcpu
u64 address1 = disp1 + base1 ? vcpu-arch.guest_gprs[base1] : 0;
u64 address2 = disp2 + base2 ? vcpu-arch.guest_gprs[base2] : 0;
struct vm_area_struct *vma;
+   unsigned long user_address;
 
vcpu-stat.instruction_tprot++;
 
@@ -349,9 +350,14 @@ static int handle_tprot(struct kvm_vcpu
return -EOPNOTSUPP;
 
 
+   /* we must resolve the address without holding the mmap 
semaphore.
+* This is ok since the userspace hypervisor is not supposed to 
change
+* the mapping while the guest queries the memory. Otherwise 
the guest
+* might crash or get wrong info anyway. */
+   user_address = (unsigned long) __guestaddr_to_user(vcpu, 
address1);
+
down_read(current-mm-mmap_sem);
-   vma = find_vma(current-mm,
-   (unsigned long) __guestaddr_to_user(vcpu, 
address1));
+   vma = find_vma(current-mm, user_address);
if (!vma) {
up_read(current-mm-mmap_sem);
return kvm_s390_inject_program_int(vcpu, 
PGM_ADDRESSING);
   
   
   Unrelated to the patch, but I'm curious: it looks like __gmap_fault()
   dereferences the guest page table?  How can it assume that it is mapped?
  
  The gmap code does not assume that the code is mapped. If the individual
  MB has not been mapped in the guest address space or the target memory
  is not mapped in the process address space __gmap_fault() returns -EFAULT. 
  
   I'm probably misreading the code.
   
   A little closer to the patch, x86 handles the same issue by calling
   get_user_pages_fast().  This should be more scalable than bouncing
   mmap_sem, something to consider.
  
  I don't think that the frequency of asynchronous page faults will make
  it necessary to use get_user_pages_fast(). We are talking about the
  case where I/O is necessary to provide the page that the guest accessed.
  
  The advantage of the way s390 does things is that after __gmap_fault
  translated the guest address to a user space address we can just do a
  standard page fault for the user space process. Only if that requires
  I/O we go the long way. Makes sense?

 Hmm, Carsten just made me aware that your question is not about pfault,
 it is about the standard case of a guest fault.

 For normal guest faults we use a cool trick that the s390 hardware
 allows us. We have the paging table for the kvm process and we have the
 guest page table for execution in the virtualized context. The trick is
 that the guest page table reuses the lowest level of the process page
 table. A fault that sets a pte in the process page table will
 automatically make that pte visible in the guest page table as well
 if the memory region has been mapped in the higher order page tables.
 Even the invalidation of a pte will automatically (!!) remove the
 referenced page from the guest page table as well, including the TLB
 entries on all cpus. The IPTE instruction is your friend :-)
 That is why we don't need mm notifiers.

Yes, that explains it perfectly.  I congratulate you on having such
friendly hardware...

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/6] KVM: optimize memslots searching

2011-11-20 Thread Avi Kivity
On 11/20/2011 01:29 PM, Avi Kivity wrote:
 On 11/18/2011 11:16 AM, Xiao Guangrong wrote:
  This is the more work base on my v1 patchset which is posted some months 
  ago,
  it can be found at:
  https://lkml.org/lkml/2011/2/22/68
 
  Change log:
  - sort memslots base on its size and do the line search instead of binary
search base on gfn, it is from Avi's idea.
  - in order to reduce cache footprint, memslots are sorted in the array of
kvm-memslots-memslots[] and introduce a table to map slot id to index in
the array
 
  There is the performance result:
 
  autotest for RHEL.6.1 setup/boot/reboot/shutdown(average):
  ept=1:  before: 449.5   after: 447.8
  ept=0:  before: 532.7   after: 529.8
 
  kernbench(average):
  ept=1:  before: 127.94  after: 126.98
  ept=0:  before: 196.85  after: 189.66

 Looks good, had a couple of comments but they're really minor.


btw, this patchset touches a lot of common code.  Did you crossbuild for
other kvm archs?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM device assignment and user privileges

2011-11-20 Thread Sasha Levin
Hi all,

I've been working on adding device assignment to KVM tools, and started
with the basics of just getting a device assigned using the
KVM_ASSIGN_PCI_DEVICE ioctl.

What I've figured is that unprivileged users can request any PCI device
to be assigned to him, including devices which he shouldn't be touching.

In my case, it happened with the VGA card, where an unprivileged user
simply called KVM_ASSIGN_PCI_DEVICE with the bus, seg and fn of the VGA
card and caused the display on the host to go apeshit.

Was it supposed to work this way? I couldn't find any security checks in
the code paths of KVM_ASSIGN_PCI_DEVICE and it looks like any user can
invoke it with any parameters he'd want - enabling him to kill the host.

-- 

Sasha.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM device assignment and user privileges

2011-11-20 Thread Avi Kivity
On 11/20/2011 04:58 PM, Sasha Levin wrote:
 Hi all,

 I've been working on adding device assignment to KVM tools, and started
 with the basics of just getting a device assigned using the
 KVM_ASSIGN_PCI_DEVICE ioctl.

 What I've figured is that unprivileged users can request any PCI device
 to be assigned to him, including devices which he shouldn't be touching.

 In my case, it happened with the VGA card, where an unprivileged user
 simply called KVM_ASSIGN_PCI_DEVICE with the bus, seg and fn of the VGA
 card and caused the display on the host to go apeshit.

 Was it supposed to work this way? 

No, of course not.

 I couldn't find any security checks in
 the code paths of KVM_ASSIGN_PCI_DEVICE and it looks like any user can
 invoke it with any parameters he'd want - enabling him to kill the host.

Alex, Chris?



-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2011-11-20 Thread Roopa Prabhu



On 11/17/11 4:15 PM, Ben Hutchings bhutchi...@solarflare.com wrote:

 Sorry to come to this rather late.
 
 On Tue, 2011-11-08 at 23:55 -0800, Roopa Prabhu wrote:
 [...]
 v2 - v3
 - Moved set and get filter ops from rtnl_link_ops to netdev_ops
 - Support for SRIOV VFs.
 [Note: The get filters msg (in the way current get rtnetlink handles
 it) might get too big for SRIOV vfs. This patch follows existing
 sriov 
 vf get code and tries to accomodate filters for all VF's in a PF.
 And for the SRIOV case I have only tested the fact that the VF
 arguments are getting delivered to rtnetlink correctly. The code
 follows existing sriov vf handling code so rest of it should work
 fine]
 [...]
 
 This is already broken for large numbers of VFs, and increasing the
 amount of information per VF is going to make the situation worse.  I am
 no netlink expert but I think that the current approach of bundling all
 information about an interface in a single message may not be
 sustainable.

Yes agreed. I have the same concern.

 
 Also, I'm unclear on why this interface is to be used to set filtering
 for the (PF) net device as well as for related VFs.  Doesn't that
 duplicate the functionality of ndo_set_rx_mode and
 ndo_vlan_rx_{add,kill}_vid?
 
Yes..I have thought about this. But the reason the final version is the way
it is because its trying to accommodate sriov and non sriov cases because I
was just trying to make the netlink interface available to as many use cases
that might need it.

I just wanted to bring up the original intent of this patch.
Which was to add support for TUNSETTXFILTER to macvtap so that it can do
filtering instead of putting the lower dev (physical dev) in promiscuous
mode (This part really does not care if the lowerdev is an SRIOV VF or not).
And the focus was on macvlan passthru mode because it is the simplest case
to solve (you have to just pass the filters to lowerdev device/driver).
Now this may seem like It can be done with existing set_rx_mode/add_vlan_id
etc (which are essentially the mechanisms I am using in the macvlan driver
to send the filters to lowerdev for passthru mode), but it might not be the
case for other macvlan modes. Macvlan device might need to maintain and do a
filter lookup like the tap driver does today. And the only reason SRIOV came
up in the original patch was because PASSTHRU mode of macvlan was added for
SRIOV use case, though it really does not care if the lowerdev is an SRIOV
VF or not.


Instead of implementing TUNSETTXFILTER, michael had suggested netlink
interface instead. 
When implementing the netlink interface, I did go back and forth in deciding
whether this should be on every netdev or only virtual devices that support
rtnl link ops. And the existence of set_rx_mode and add_vlan_id netdev ops
Was the reason for confusion. So the next version implemented it as rtnl
link ops because all I really want is a mechanism like TUNSETTXFILTER which
can set/get filters for virtual devices that need to do filtering by
themselves. But restricting this interface for only virtual devices did not
make great sense so when greg pointed it out that he will need it for VF
netdevs, I was happy to move it to netdev ops.

And the only reason this patch works on both PF and VF if the final version
is because, its trying to accommodate both SRIOV and non-SRIOV devices.
So by saying PF and VF, for me it really means SRIOV VF and any other netdev
devices. So I intentionally did not put PF or VF in the name of the op.

Thanks,
Roopa

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv4 0/3] acpi: DSDT/SSDT runtime patching

2011-11-20 Thread Michael S. Tsirkin
Here's an updated revision of acpi runtime patching patchset.
Lightly tested.

changes in v4:
- split PCI hotplug handling out to a separate SSDT

Changes in v3:
- change ssdt generation code to get rid of hardcoded offsets
- enhancements to acpi_extract: add more extract methods
ACPI_EXTRACT_NAME_WORD_CONST - extract a Word Const object from 
Name()
ACPI_EXTRACT_NAME_BYTE_CONST - extract a Byte Const object from 
Name()
ACPI_EXTRACT_PROCESSOR_START - start of Processor() block
ACPI_EXTRACT_PROCESSOR_STRING - extract a NameString from 
Processor()
ACPI_EXTRACT_PROCESSOR_END - offset at last byte of Processor() 
+ 1

Changes in v2:
- tools rewritten in python
- Original ASL retains _EJ0 methods, BIOS patches that to EJ0_
- generic ACP_EXTRACT infrastructure that can match Method
  and Name Operators
- instead of matching specific method name, insert tags
  in original DSL source and match that to AML

-

Here's a bug: guest thinks it can eject VGA device and ISA bridge.

[root@dhcp74-172 ~]#lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 PCI bridge: Red Hat, Inc. Device 0001
00:04.0 Ethernet controller: Qumranet, Inc. Virtio network device
00:05.0 SCSI storage controller: Qumranet, Inc. Virtio block device
01:00.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet 
Controller (rev 03)

[root@dhcp74-172 ~]# ls /sys/bus/pci/slots/1/
adapter  address  attention  latch  module  power
[root@dhcp74-172 ~]# ls /sys/bus/pci/slots/2/
adapter  address  attention  latch  module  power

[root@dhcp74-172 ~]# echo 0  /sys/bus/pci/slots/2/power 
[root@dhcp74-172 ~]# lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:03.0 PCI bridge: Red Hat, Inc. Device 0001
00:04.0 Ethernet controller: Qumranet, Inc. Virtio network device
00:05.0 SCSI storage controller: Qumranet, Inc. Virtio block device
01:00.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet 
Controller (rev 03)

This is wrong because slots 1 and 2 are marked as not hotpluggable
in qemu.

The reason is that our acpi tables declare both _RMV with value 0,
and _EJ0 method for these slots. What happens in this case
is undocumented by ACPI spec, so linux ignores _RMV,
and windows seems to ignore _EJ0.

The correct way to suppress hotplug is not to have _EJ0,
so this is what this patch does: it probes PIIX and
modifies DSDT to match.

With these patches applied, we get:

[root@dhcp74-172 ~]# ls /sys/bus/pci/slots/1/
address
[root@dhcp74-172 ~]# ls /sys/bus/pci/slots/2/
address





Michael S. Tsirkin (3):
  acpi: add ssdt for pci hotplug
  acpi: EJ0 method name patching
  acpi: remove _RMV

 Makefile   |2 +-
 src/acpi-dsdt.dsl  |  132 +--
 src/acpi.c |   40 
 src/ssdt-pcihp.dsl |   98 ++
 4 files changed, 142 insertions(+), 130 deletions(-)
 create mode 100644 src/ssdt-pcihp.dsl

-- 
1.7.5.53.gc233e
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv4 1/3] acpi: add ssdt for pci hotplug

2011-11-20 Thread Michael S. Tsirkin
The point of this split is to make runtime patching easier.

DSDT is required to supply: PCI0 - PCI root device object;
PCEJ - Method object to eject a PCI slot.
Additionally, SSDT is required to supply PCNT - Method object to notify
OSPM of a PCI slot event.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 Makefile   |2 +-
 src/acpi-dsdt.dsl  |  117 +
 src/acpi.c |   12 +
 src/ssdt-pcihp.dsl |  132 
 4 files changed, 148 insertions(+), 115 deletions(-)
 create mode 100644 src/ssdt-pcihp.dsl

diff --git a/Makefile b/Makefile
index 540f1ea..ef8e15d 100644
--- a/Makefile
+++ b/Makefile
@@ -200,7 +200,7 @@ src/%.hex: src/%.dsl ./tools/acpi_extract_preprocess.py 
./tools/acpi_extract.py
$(Q)./tools/acpi_extract.py $(OUT)$*.lst  $(OUT)$*.off
$(Q)cat $(OUT)$*.off  $@
 
-$(OUT)ccode32flat.o: src/acpi-dsdt.hex src/ssdt-proc.hex
+$(OUT)ccode32flat.o: src/acpi-dsdt.hex src/ssdt-proc.hex src/ssdt-pcihp.hex
 
 ### Kconfig rules
 export HOSTCC := $(CC)
diff --git a/src/acpi-dsdt.dsl b/src/acpi-dsdt.dsl
index b9b06f2..f97d463 100644
--- a/src/acpi-dsdt.dsl
+++ b/src/acpi-dsdt.dsl
@@ -474,90 +474,15 @@ DefinitionBlock (
 Return (0x0)
 }
 
-#define gen_pci_device(slot)\
-Device(SL##slot) {  \
-Name (_ADR, 0x##slot##) \
-Method (_RMV) { Return (PRMV(0x##slot)) }   \
-Name (_SUN, 0x##slot)   \
-}
-
-/* VGA (slot 1) and ISA bus (slot 2) defined above */
-gen_pci_device(03)
-gen_pci_device(04)
-gen_pci_device(05)
-gen_pci_device(06)
-gen_pci_device(07)
-gen_pci_device(08)
-gen_pci_device(09)
-gen_pci_device(0a)
-gen_pci_device(0b)
-gen_pci_device(0c)
-gen_pci_device(0d)
-gen_pci_device(0e)
-gen_pci_device(0f)
-gen_pci_device(10)
-gen_pci_device(11)
-gen_pci_device(12)
-gen_pci_device(13)
-gen_pci_device(14)
-gen_pci_device(15)
-gen_pci_device(16)
-gen_pci_device(17)
-gen_pci_device(18)
-gen_pci_device(19)
-gen_pci_device(1a)
-gen_pci_device(1b)
-gen_pci_device(1c)
-gen_pci_device(1d)
-gen_pci_device(1e)
-gen_pci_device(1f)
-
-/* Methods called by bulk generated hotplug devices below */
+/* Methods called by hotplug devices */
 Method (PCEJ, 1, NotSerialized) {
 // _EJ0 method - eject callback
 Store(ShiftLeft(1, Arg0), B0EJ)
 Return (0x0)
 }
 
-/* Bulk generated PCI hotplug devices */
-#define hotplug_slot(slot)  \
-Device (S##slot) {  \
-   Name (_ADR, 0x##slot##)  \
-   Method (_EJ0, 1) { Return(PCEJ(0x##slot)) }  \
-   Name (_SUN, 0x##slot)\
-}
-
-hotplug_slot(01)
-hotplug_slot(02)
-hotplug_slot(03)
-hotplug_slot(04)
-hotplug_slot(05)
-hotplug_slot(06)
-hotplug_slot(07)
-hotplug_slot(08)
-hotplug_slot(09)
-hotplug_slot(0a)
-hotplug_slot(0b)
-hotplug_slot(0c)
-hotplug_slot(0d)
-hotplug_slot(0e)
-hotplug_slot(0f)
-hotplug_slot(10)
-hotplug_slot(11)
-hotplug_slot(12)
-hotplug_slot(13)
-hotplug_slot(14)
-hotplug_slot(15)
-hotplug_slot(16)
-hotplug_slot(17)
-hotplug_slot(18)
-hotplug_slot(19)
-hotplug_slot(1a)
-hotplug_slot(1b)
-hotplug_slot(1c)
-hotplug_slot(1d)
-hotplug_slot(1e)
-hotplug_slot(1f)
+   /* Hotplug notification method supplied by SSDT */
+   External (\_SB.PCI0.PCNT, MethodObj)
 
 /* PCI hotplug notify method */
 Method(PCNF, 0) {
@@ -575,42 +500,6 @@ DefinitionBlock (
 Return(One)
 }
 
-#define gen_pci_hotplug(slot)   \
-If (LEqual(Arg0, 0x##slot)) { Notify(S##slot, Arg1) }
-
-Method(PCNT, 2) {
-gen_pci_hotplug(01)
-gen_pci_hotplug(02)
-gen_pci_hotplug(03)
-gen_pci_hotplug(04)
-gen_pci_hotplug(05)
-gen_pci_hotplug(06)
-gen_pci_hotplug(07)
-gen_pci_hotplug(08)
-gen_pci_hotplug(09)
-gen_pci_hotplug(0a)
-gen_pci_hotplug(0b)
-gen_pci_hotplug(0c)
-gen_pci_hotplug(0d)
-gen_pci_hotplug(0e)
-gen_pci_hotplug(0f)
-gen_pci_hotplug(10)
-gen_pci_hotplug(11)
-gen_pci_hotplug(12)
-

[PATCHv4 2/3] acpi: EJ0 method name patching

2011-11-20 Thread Michael S. Tsirkin
Modify ACPI to only supply _EJ0 methods for PCI slots that support hotplug.

This is done by runtime patching:
- Instrument SSDT ASL code with ACPI_EXTRACT directives
  tagging _EJ0 and _ADR fields.
- At compile time, tools/acpi_extract.py looks for these methods
  in ASL source finds the matching AML, and stores the offsets
  of these methods in tables named aml_ej0_name and aml_adr_dword.
- At run time, go over aml_ej0_name, use aml_adr_dword
  to get slot information and check which slots support hotplug.

  If hotplug is disabled, we patch the _EJ0 NameString in ACPI table,
  replacing _EJ0 with EJ0_.

  Note that this has the same checksum, but is ignored by OSPM.

Note: the method used is robust in that we don't need
to change any offsets manually in case of ASL code changes.
As all parsing is done at compile time, any unexpected input causes
build failure, not a runtime failure.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 src/acpi.c |   44 
 src/ssdt-pcihp.dsl |6 ++
 2 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/src/acpi.c b/src/acpi.c
index c5147d9..107469f 100644
--- a/src/acpi.c
+++ b/src/acpi.c
@@ -484,12 +484,42 @@ build_ssdt(void)
 return ssdt;
 }
 
-static void*
-copy_table(void *table, int size)
+#include ssdt-pcihp.hex
+
+#define PCI_RMV_BASE 0xae0c
+
+extern void link_time_assertion(void);
+
+static void* build_pcihp(void)
 {
-u8 *copy = malloc_high(size);
-memcpy(copy, table, size);
-return copy;
+u32 rmvc_pcrm;
+int i;
+
+u8 *ssdt = malloc_high(sizeof ssdp_pcihp_aml);
+memcpy(ssdt, ssdp_pcihp_aml, sizeof ssdp_pcihp_aml);
+
+/* Runtime patching of EJ0: to disable hotplug for a slot,
+ * replace the method name: _EJ0 by EJ0_. */
+if (ARRAY_SIZE(aml_ej0_name) != ARRAY_SIZE(aml_adr_dword)) {
+link_time_assertion();
+}
+
+rmvc_pcrm = inl(PCI_RMV_BASE);
+for (i = 0; i  ARRAY_SIZE(aml_ej0_name); ++i) {
+/* Slot is in byte 2 in _ADR */
+u8 slot = ssdp_pcihp_aml[aml_adr_dword[i] + 2]  0x1F;
+   /* Sanity check */
+if (memcmp(ssdp_pcihp_aml + aml_ej0_name[i], _EJ0, 4)) {
+warn_internalerror();
+free(ssdt);
+return NULL;
+}
+if (!(rmvc_pcrm  (0x1  slot))) {
+memcpy(ssdt + aml_ej0_name[i], EJ0_, 4);
+}
+}
+
+return ssdt;
 }
 
 #define HPET_SIGNATURE 0x54455048 // HPET
@@ -642,8 +672,6 @@ static const struct pci_device_id acpi_find_tbl[] = {
 
 struct rsdp_descriptor *RsdpAddr;
 
-#include ssdt-pcihp.hex
-
 #define MAX_ACPI_TABLES 20
 void
 acpi_bios_init(void)
@@ -675,7 +703,7 @@ acpi_bios_init(void)
 ACPI_INIT_TABLE(build_madt());
 ACPI_INIT_TABLE(build_hpet());
 ACPI_INIT_TABLE(build_srat());
-ACPI_INIT_TABLE(copy_table(ssdp_pcihp_aml, sizeof ssdp_pcihp_aml));
+ACPI_INIT_TABLE(build_pcihp());
 
 u16 i, external_tables = qemu_cfg_acpi_additional_tables();
 
diff --git a/src/ssdt-pcihp.dsl b/src/ssdt-pcihp.dsl
index 72d1bb7..cc96ffc 100644
--- a/src/ssdt-pcihp.dsl
+++ b/src/ssdt-pcihp.dsl
@@ -53,9 +53,15 @@ DefinitionBlock (ssdt-pcihp.aml, SSDT, 0x01, BXPC, 
BXSSDTPCIHP, 0x1)
 gen_pci_device(1f)
 
 /* Bulk generated PCI hotplug devices */
+// Method _EJ0 can be patched by BIOS to EJ0_
+// at runtime, if the slot is detected to not support hotplug.
+// Extract the offset of the address dword and the
+// _EJ0 name to allow this patching.
 #define hotplug_slot(slot)  \
 Device (S##slot) {  \
+   ACPI_EXTRACT_NAME_DWORD_CONST aml_adr_dword  \
Name (_ADR, 0x##slot##)  \
+   ACPI_EXTRACT_METHOD_STRING aml_ej0_name  \
Method (_EJ0, 1) { Return(PCEJ(0x##slot)) }  \
Name (_SUN, 0x##slot)\
 }
-- 
1.7.5.53.gc233e

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv4 3/3] acpi: remove _RMV

2011-11-20 Thread Michael S. Tsirkin
The macro gen_pci_device is used to add _RMV
method to a slot device so it is no longer needed:
presence of _EJ0 now indicates that the slot is ejectable.
It is also placing two devices with the same _ADR
on the same bus, which isn't defined by the ACPI spec.
So let's remove it.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 src/acpi-dsdt.dsl  |   15 ---
 src/ssdt-pcihp.dsl |   40 
 2 files changed, 0 insertions(+), 55 deletions(-)

diff --git a/src/acpi-dsdt.dsl b/src/acpi-dsdt.dsl
index f97d463..aff3f48 100644
--- a/src/acpi-dsdt.dsl
+++ b/src/acpi-dsdt.dsl
@@ -132,12 +132,6 @@ DefinitionBlock (
 B0EJ, 32,
 }
 
-OperationRegion(RMVC, SystemIO, 0xae0c, 0x04)
-Field(RMVC, DWordAcc, NoLock, WriteAsZeros)
-{
-PCRM, 32,
-}
-
 Name (_CRS, ResourceTemplate ()
 {
 WordBusNumber (ResourceProducer, MinFixed, MaxFixed, PosDecode,
@@ -239,7 +233,6 @@ DefinitionBlock (
  Return (0x00)
  }
  }
- Method(_RMV) { Return (0x00) }
 }
 }
 
@@ -251,7 +244,6 @@ DefinitionBlock (
 Scope(\_SB.PCI0) {
 Device (ISA) {
 Name (_ADR, 0x0001)
-Method(_RMV) { Return (0x00) }
 
 /* PIIX PCI to ISA irq remapping */
 OperationRegion (P40C, PCI_Config, 0x60, 0x04)
@@ -466,13 +458,6 @@ DefinitionBlock (
 
 Scope(\_SB.PCI0) {
 /* Methods called by bulk generated PCI devices below */
-Method (PRMV, 1, NotSerialized) {
-// _RMV method - check if device can be removed
-If (And(\_SB.PCI0.PCRM, ShiftLeft(1, Arg0))) {
-Return (0x1)
-}
-Return (0x0)
-}
 
 /* Methods called by hotplug devices */
 Method (PCEJ, 1, NotSerialized) {
diff --git a/src/ssdt-pcihp.dsl b/src/ssdt-pcihp.dsl
index cc96ffc..4b435b8 100644
--- a/src/ssdt-pcihp.dsl
+++ b/src/ssdt-pcihp.dsl
@@ -9,49 +9,9 @@ DefinitionBlock (ssdt-pcihp.aml, SSDT, 0x01, BXPC, 
BXSSDTPCIHP, 0x1)
 
 /* Objects supplied by DSDT */
 External (\_SB.PCI0, DeviceObj)
-External (\_SB.PCI0.PRMV, MethodObj)
 External (\_SB.PCI0.PCEJ, MethodObj)
 
 Scope(\_SB.PCI0) {
-
-#define gen_pci_device(slot)\
-Device(SL##slot) {  \
-Name (_ADR, 0x##slot##) \
-Method (_RMV) { Return (PRMV(0x##slot)) }   \
-Name (_SUN, 0x##slot)   \
-}
-
-/* VGA (slot 1) and ISA bus (slot 2) defined in DSDT */
-gen_pci_device(03)
-gen_pci_device(04)
-gen_pci_device(05)
-gen_pci_device(06)
-gen_pci_device(07)
-gen_pci_device(08)
-gen_pci_device(09)
-gen_pci_device(0a)
-gen_pci_device(0b)
-gen_pci_device(0c)
-gen_pci_device(0d)
-gen_pci_device(0e)
-gen_pci_device(0f)
-gen_pci_device(10)
-gen_pci_device(11)
-gen_pci_device(12)
-gen_pci_device(13)
-gen_pci_device(14)
-gen_pci_device(15)
-gen_pci_device(16)
-gen_pci_device(17)
-gen_pci_device(18)
-gen_pci_device(19)
-gen_pci_device(1a)
-gen_pci_device(1b)
-gen_pci_device(1c)
-gen_pci_device(1d)
-gen_pci_device(1e)
-gen_pci_device(1f)
-
 /* Bulk generated PCI hotplug devices */
 // Method _EJ0 can be patched by BIOS to EJ0_
 // at runtime, if the slot is detected to not support hotplug.
-- 
1.7.5.53.gc233e
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Failure to compile qemu-kvm - In file included from /home/ykaul/qemu-kvm/hw/pci.c:36:0: ./qmp-commands.h:3:1: error: expected identifier or ‘(’ before ‘{’ token

2011-11-20 Thread Yaniv Kaul
Latest git hash  1d1c8a498b7ce5c5636f1014f7ad18aa4e1acc0a (though I 
think it happened before that).

./configure command line:
--target-list=x86_64-softmmu --enable-vnc-thread --audio-card-list=hda 
--disable-nptl --disable-guest-base --enable-spice --disable-werror 
--disable-usb-redir --disable-smartcard-nss


On F16/x64. qemu latest compiles fine with those parameters.


...

  CCx86_64-softmmu/pci.o
In file included from /home/ykaul/qemu-kvm/hw/pci.c:36:0:
./qmp-commands.h:3:1: error: expected identifier or ‘(’ before ‘{’ token
./qmp-commands.h:10:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:20:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:30:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:40:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:50:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:60:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:70:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:80:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:90:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:100:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:110:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:120:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:130:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:140:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:154:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:164:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:174:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:184:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:194:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:204:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:214:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:224:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:233:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:244:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:254:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:264:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:274:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:284:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:294:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:304:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:314:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:323:2: error: expected identifier or ‘(’ before ‘,’ token
./qmp-commands.h:333:2: error: expected identifier or ‘(’ before ‘,’ token
/home/ykaul/qemu-kvm/hw/pci.c:1431:14: warning: no previous prototype 
for ‘qmp_query_pci’ [-Wmissing-prototypes]


Thanks,
Y.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv4 0/3] acpi: DSDT/SSDT runtime patching

2011-11-20 Thread Kevin O'Connor
On Sun, Nov 20, 2011 at 07:56:43PM +0200, Michael S. Tsirkin wrote:
 Here's an updated revision of acpi runtime patching patchset.
 Lightly tested.

It looks good to me.

-Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/6] KVM: introduce kvm_for_each_memslot macro

2011-11-20 Thread Takuya Yoshikawa

(2011/11/20 20:21), Avi Kivity wrote:

On 11/18/2011 11:18 AM, Xiao Guangrong wrote:

index bb8728e..10524c0 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -307,6 +307,10 @@ static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm 
*kvm, int i)
 (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
 idx++)

+#define kvm_for_each_memslot(slots, memslot, i)\
+   for (i = 0; i  (slots)-nmemslots  \
+ ({ memslot =(slots)-memslots[i]; 1; }); i++)
+



Statement expression not needed, you can use the comma operator:

   i  (slots)-nmemslots  (memslot = @(slots)-memslots[i], true)

or even

   memslot =(slots)-memslots[i], i  (slots)-nmemslots

or just kill i and make memslot the loop variable.



Do you have any preference for the arguments ordering?

I think placing the target one, memslot in this case, first is conventional in
the kernel code, except when we want to place kvm or something like that.

But in kvm code, there seems to be some difference.

Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-20 Thread David Gibson
On Fri, Nov 18, 2011 at 01:32:56PM -0700, Alex Williamson wrote:
 On Thu, 2011-11-17 at 11:02 +1100, David Gibson wrote:
  On Tue, Nov 15, 2011 at 11:01:28AM -0700, Alex Williamson wrote:
   On Tue, 2011-11-15 at 17:34 +1100, David Gibson wrote:
On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
 snip
 +Groups, Devices, IOMMUs, oh my
 +---
 +
 +A fundamental component of VFIO is the notion of IOMMU groups.  
 IOMMUs
 +can't always distinguish transactions from each individual device in
 +the system.  Sometimes this is because of the IOMMU design, such as 
 with
 +PEs, other times it's caused by the I/O topology, for instance a
 +PCIe-to-PCI bridge masking all devices behind it.  We call the sets 
 of
 +devices created by these restictions IOMMU groups (or just groups 
 for
 +this document).
 +
 +The IOMMU cannot distiguish transactions between the individual 
 devices
 +within the group, therefore the group is the basic unit of ownership 
 for
 +a userspace process.  Because of this, groups are also the primary
 +interface to both devices and IOMMU domains in VFIO.
 +
 +The VFIO representation of groups is created as devices are added 
 into
 +the framework by a VFIO bus driver.  The vfio-pci module is an 
 example
 +of a bus driver.  This module registers devices along with a set of 
 bus
 +specific callbacks with the VFIO core.  These callbacks provide the
 +interfaces later used for device access.  As each new group is 
 created,
 +as determined by iommu_device_group(), VFIO creates a 
 /dev/vfio/$GROUP
 +character device.

Ok.. so, the fact that it's called vfio-pci suggests that the VFIO
bus driver is per bus type, not per bus instance.   But grouping
constraints could be per bus instance, if you have a couple of
different models of PCI host bridge with IOMMUs of different
capabilities built in, for example.
   
   Yes, vfio-pci manages devices on the pci_bus_type; per type, not per bus
   instance.
  
  Ok, how can that work.  vfio-pci is responsible for generating the
  groupings, yes?  For which it needs to know the iommu/host bridge's
  isolation capabilities, which vary depending on the type of host
  bridge.
 
 No, grouping is done at the iommu driver level.  vfio gets groupings via
 iomm_device_group(), which uses the iommu_ops for the bus_type of the
 requested device.  I'll attempt to clarify where groups come from in the
 documentation.

Hrm, but still per bus type, not bus instance.  Hrm.  Yeah, I need to
look at the earlier iommu patches in more detail.

[snip]
  Yeah, I'm starting to get my head around the model, but the current
  description of it doesn't help very much.  In particular the terms
  merge and unmerge lead one to the wrong mental model, I think.
  
The semantics of merge and unmerge under those names are really
non-obvious.  Merge kind of has to merge two whole metagroups, but
it's unclear if unmerge reverses one merge, or just takes out one
(atom) group.  These operations need better names, at least.
   
   Christian suggested a change to UNMERGE that we do not need to
   specify a group to unmerge from.  This makes it more like a list
   implementation except there's no defined list_head.  Any member of the
   list can pull in a new entry.  Calling UNMERGE on any member extracts
   that member.
  
  I think that's a good idea, but unmerge is not a good word for it.
 
 I can't think of better, if you can, please suggest.

Well, I think addgroup and removegroup would be better than merge and
unmerge, although they have their own problems.

Then it's unclear what order you can do various operations, and which
order you can open and close various things.  You can kind of figure
it out but it takes far more thinking than it should.


So at the _very_ least, we need to invent new terminology and find a
much better way of describing this API's semantics.  I still think an
entirely different interface, where metagroups are created from
outside with a lifetime that's not tied to an fd would be a better
idea.
   
   As we've discussed previously, configfs provides part of this, but has
   no ioctl support.  It doesn't make sense to me to go play with groups in
   configfs, but then still interact with them via a char dev.
  
  Why not?  You configure, say, loopback devices with losetup, then use
  them as a block device.  Similar with nbd.  You can configure serial
  devices with setserial, then use them as a char dev.
  
It also
   splits the ownership model 
  
  I'm not even sure what that means.
  
   and makes it harder to enforce who gets to
   interact with the devices vs who gets to manipulate groups.
  
  How so.
 
 Let's map out what a configfs interface would 

Re: [PATCH v2 3/6] KVM: introduce kvm_for_each_memslot macro

2011-11-20 Thread Xiao Guangrong
On 11/20/2011 07:21 PM, Avi Kivity wrote:

 On 11/18/2011 11:18 AM, Xiao Guangrong wrote:
 index bb8728e..10524c0 100644
 --- a/include/linux/kvm_host.h
 +++ b/include/linux/kvm_host.h
 @@ -307,6 +307,10 @@ static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm 
 *kvm, int i)
   (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
   idx++)

 +#define kvm_for_each_memslot(slots, memslot, i) \
 +for (i = 0; i  (slots)-nmemslots\
 +  ({ memslot = (slots)-memslots[i]; 1; }); i++)
 +

 
 Statement expression not needed, you can use the comma operator:
 
   i  (slots)-nmemslots  (memslot = @(slots)-memslots[i], true)
 
 or even
 
   memslot = (slots)-memslots[i], i  (slots)-nmemslots
 
 or just kill i and make memslot the loop variable.
 


Good, will fix it in the next version, thanks!

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] kvm tools: Implement multiple VQ for virtio-net

2011-11-20 Thread Rusty Russell
On Wed, 16 Nov 2011 09:23:17 +0200, Michael S. Tsirkin m...@redhat.com 
wrote:
 On Wed, Nov 16, 2011 at 10:34:42AM +1030, Rusty Russell wrote:
  On Mon, 14 Nov 2011 15:05:07 +0200, Michael S. Tsirkin m...@redhat.com 
  wrote:
   On Mon, Nov 14, 2011 at 02:25:17PM +0200, Pekka Enberg wrote:
On Mon, Nov 14, 2011 at 4:04 AM, Asias He asias.he...@gmail.com wrote:
 Why both the bandwidth and latency performance are dropping so 
 dramatically
 with multiple VQ?

What's the expected benefit from multiple VQs
   
   Heh, the original patchset didn't mention this :) It really should.
   They are supposed to speed up networking for high smp guests.
  
  If we have one queue per guest CPU, does this allow us to run lockless?
  
  Thanks,
  Rusty.
 
 LLTX? It's supposed to be deprecated, isn't it?

I was referring back to Subject: virtio net lockless ring which
Stephen sent back in June, nothing more specific.

I assumed from his query that this was still an active area of
exploration...

Stephen?

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 1/5] virtio-pci: flexible configuration layout

2011-11-20 Thread Rusty Russell
On Tue, 15 Nov 2011 01:43:13 +0200, Sasha Levin levinsasha...@gmail.com wrote:
 From: Michael S. Tsirkin m...@redhat.com
 
 Add a flexible mechanism to specify virtio configuration layout, using
 pci vendor-specific capability.  A separate capability is used for each
 of common, device specific and data-path accesses.

OK, a couple of minor suggestions:

 +static int virtio_pci_iomap(struct virtio_pci_device *vp_dev)
 +{
 + vp_dev-isr_map = virtio_pci_map_cfg(vp_dev,
 +  VIRTIO_PCI_CAP_ISR_CFG,
 +  sizeof(u8));
 + vp_dev-notify_map = virtio_pci_map_cfg(vp_dev,
 + VIRTIO_PCI_CAP_NOTIFY_CFG,
 + sizeof(u16));
 + vp_dev-common_map = virtio_pci_map_cfg(vp_dev,
 + VIRTIO_PCI_CAP_COMMON_CFG,
 + sizeof(u32));
 + vp_dev-device_map = virtio_pci_map_cfg(vp_dev,
 + VIRTIO_PCI_CAP_DEVICE_CFG,
 + sizeof(u8));
 +
 + if (!vp_dev-notify_map || !vp_dev-common_map ||
 + !vp_dev-device_map) {
 + /*
 +  * If not all capabilities present, map legacy PIO.
 +  * Legacy access is at BAR 0. We never need to map
 +  * more than 256 bytes there, since legacy config space
 +  * used PIO which has this size limit.
 +  * */
 + vp_dev-legacy_map = pci_iomap(vp_dev-pci_dev, 0, 256);
 + if (!vp_dev-legacy_map) {
 + dev_err(vp_dev-vdev.dev, Unable to map legacy PIO);
 + goto err;
 + }
 + }

Please don't mix the two.  Ever, under any circumstances.  If it helps,
put CONFIG_VIRTIO_PCI_LEGACY #ifdefs in there:

vp_dev-common_map = virtio_pci_map_cfg(vp_dev,
VIRTIO_PCI_CAP_COMMON_CFG,
sizeof(u32));
/* Something horribly wrong? */
if (IS_ERR(vp_dev-common_map)) {
err = PTR_ERR(vp_dev-common_map);
goto fail;
}

/* Not found? */
if (!vp_dev-common_map)
return legacy_setup(vp_dev, ...);

...

Practice has shown that if we allow something, it'll be done.  We should
break horribly on malformed devices, and really really only fall back
when we have a well-formed, but old, device.

And various other legacy paths can happily use:

#ifdef CONFIG_VIRTIO_PCI_LEGACY
#define is_legacy(vp_dev) ((vp_dev)-legacy_map != NULL)
#else
#define is_legacy(vp_dev) false
#endif

Here's suggested Kconfig entry:

config VIRTIO_PCI_LEGACY
bool
default y
depends on VIRTIO_PCI
---help---
  Look out into your driveway.  Do you have a flying car?  If
  so, you can happily disable this option and virtio will not
  break.  Otherwise, leave it set.  Unless you're testing what
  life will be like in The Future.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5 of 5] virtio: expose added descriptors immediately

2011-11-20 Thread Rusty Russell
On Wed, 16 Nov 2011 09:18:38 +0200, Michael S. Tsirkin m...@redhat.com 
wrote:
 My unlocked kick patches will trip this warning: they make
 virtio-net do add + get without kick.

Heh, it's a good sign if they do, since that means you're running really
well :)

 I think block with unlocked kick can trip it too:
 add, lock is dropped and then an interrupt can get.
 
 We also don't need a kick each num - each 2^15 is enough.
 Why don't we do this at start of add_buf:
 if (vq-num_added = 0x7fff)
   return -ENOSPC;

The warning was there in case a driver is never doing a kick, and
getting away with it (mostly) because the device is polling.  Let's not
penalize good drivers to catch bad ones.

How about we do this properly, like so:

From: Rusty Russell ru...@rustcorp.com.au
Subject: virtio: add debugging if driver doesn't kick.

Under the existing #ifdef DEBUG, check that they don't have more than
1/10 of a second between an add_buf() and a
virtqueue_notify()/virtqueue_kick_prepare() call.

We could get false positives on a really busy system, but good for
development.

Signed-off-by: Rusty Russell ru...@rustcorp.com.au
---
 drivers/virtio/virtio_ring.c |   31 +++
 1 file changed, 31 insertions(+)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -22,6 +22,7 @@
 #include linux/device.h
 #include linux/slab.h
 #include linux/module.h
+#include linux/hrtimer.h
 
 /* virtio guest is communicating with a virtual device that actually runs on
  * a host processor.  Memory barriers are used to control SMP effects. */
@@ -102,6 +103,10 @@ struct vring_virtqueue
 #ifdef DEBUG
/* They're supposed to lock for us. */
unsigned int in_use;
+
+   /* Figure out if their kicks are too delayed. */
+   bool last_add_time_valid;
+   ktime_t last_add_time;
 #endif
 
/* Tokens for callbacks. */
@@ -192,6 +197,19 @@ int virtqueue_add_buf(struct virtqueue *
 
BUG_ON(data == NULL);
 
+#ifdef DEBUG
+   {
+   ktime_t now = ktime_get();
+
+   /* No kick or get, with .1 second between?  Warn. */
+   if (vq-last_add_time_valid)
+   WARN_ON(ktime_to_ms(ktime_sub(now, vq-last_add_time))
+100);
+   vq-last_add_time = now;
+   vq-last_add_time_valid = true;
+   }
+#endif
+
/* If the host supports indirect descriptor tables, and we have multiple
 * buffers, then go indirect. FIXME: tune this threshold */
if (vq-indirect  (out + in)  1  vq-num_free) {
@@ -291,6 +309,14 @@ bool virtqueue_kick_prepare(struct virtq
new = vq-vring.avail-idx;
vq-num_added = 0;
 
+#ifdef DEBUG
+   if (vq-last_add_time_valid) {
+   WARN_ON(ktime_to_ms(ktime_sub(ktime_get(),
+ vq-last_add_time))  100);
+   }
+   vq-last_add_time_valid = false;
+#endif
+
if (vq-event) {
needs_kick = vring_need_event(vring_avail_event(vq-vring),
  new, old);
@@ -428,6 +454,10 @@ void *virtqueue_get_buf(struct virtqueue
virtio_mb();
}
 
+#ifdef DEBUG
+   vq-last_add_time_valid = false;
+#endif
+
END_USE(vq);
return ret;
 }
@@ -611,6 +641,7 @@ struct virtqueue *vring_new_virtqueue(un
list_add_tail(vq-vq.list, vdev-vqs);
 #ifdef DEBUG
vq-in_use = false;
+   vq-last_add_time_valid = false;
 #endif
 
vq-indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC);
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/6] KVM: sort memslots by its size and use line search

2011-11-20 Thread Xiao Guangrong
On 11/20/2011 07:26 PM, Avi Kivity wrote:

 On 11/18/2011 11:19 AM, Xiao Guangrong wrote:
 Sort memslots base on its size and use line search to find it, so the larger
 memslots have better fit

 The idea is from Avi

 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  include/linux/kvm_host.h |   22 +---
  virt/kvm/kvm_main.c  |   82 
 -
  2 files changed, 75 insertions(+), 29 deletions(-)

 diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
 index a0e4d63..83b396a 100644
 --- a/include/linux/kvm_host.h
 +++ b/include/linux/kvm_host.h
 @@ -230,8 +230,12 @@ struct kvm_irq_routing_table {};
  #define KVM_MEM_SLOTS_NUM (KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS)
  #endif

 +/*
 + * Note:
 + * memslots are not sorted by id anymore, please use id_to_memslot()
 + * to get the memslot by its id.
 + */
  struct kvm_memslots {
 -int nmemslots;
  u64 generation;
  struct kvm_memory_slot memslots[KVM_MEM_SLOTS_NUM];
  };
 @@ -307,9 +311,10 @@ static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm 
 *kvm, int i)
   (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
   idx++)

 -#define kvm_for_each_memslot(slots, memslot, i) \
 -for (i = 0; i  (slots)-nmemslots\
 -  ({ memslot = (slots)-memslots[i]; 1; }); i++)
 +#define kvm_for_each_memslot(slots, memslot, i) \
 +for (i = 0; i  KVM_MEM_SLOTS_NUM \
 +  ({ memslot = (slots)-memslots[i]; 1; })   \
 +  memslot-npages != 0; i++)
 
 You might allocate an always-empty memslot at the end and simplify the
 termination condition.
 


The slots are sorted by its size, all empty slots are at the end of the array,
so we can simplify break the loop if meet a empty slot.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/6] KVM: optimize memslots searching

2011-11-20 Thread Xiao Guangrong
On 11/20/2011 08:12 PM, Avi Kivity wrote:

 On 11/20/2011 01:29 PM, Avi Kivity wrote:
 On 11/18/2011 11:16 AM, Xiao Guangrong wrote:
 This is the more work base on my v1 patchset which is posted some months 
 ago,
 it can be found at:
 https://lkml.org/lkml/2011/2/22/68

 Change log:
 - sort memslots base on its size and do the line search instead of binary
   search base on gfn, it is from Avi's idea.
 - in order to reduce cache footprint, memslots are sorted in the array of
   kvm-memslots-memslots[] and introduce a table to map slot id to index in
   the array

 There is the performance result:

 autotest for RHEL.6.1 setup/boot/reboot/shutdown(average):
 ept=1:  before: 449.5   after: 447.8
 ept=0:  before: 532.7   after: 529.8

 kernbench(average):
 ept=1:  before: 127.94  after: 126.98
 ept=0:  before: 196.85  after: 189.66

 Looks good, had a couple of comments but they're really minor.

 
 btw, this patchset touches a lot of common code.  Did you crossbuild for
 other kvm archs?
 


No :-), but i will build cross compilation environment and test it next time,
thanks!

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM device assignment and user privileges

2011-11-20 Thread Chris Wright
* Avi Kivity (a...@redhat.com) wrote:
 On 11/20/2011 04:58 PM, Sasha Levin wrote:
  Hi all,
 
  I've been working on adding device assignment to KVM tools, and started
  with the basics of just getting a device assigned using the
  KVM_ASSIGN_PCI_DEVICE ioctl.
 
  What I've figured is that unprivileged users can request any PCI device
  to be assigned to him, including devices which he shouldn't be touching.
 
  In my case, it happened with the VGA card, where an unprivileged user
  simply called KVM_ASSIGN_PCI_DEVICE with the bus, seg and fn of the VGA
  card and caused the display on the host to go apeshit.
 
  Was it supposed to work this way? 
 
 No, of course not.

Indeed.  A device is typically owned by a host OS driver which precludes
device assignment from working.  If it's not, the unprivilged guest
will not have access to the device's config space or resource bars as
they are only rw for a privileged user.  And similarly, /dev/kvm was
typically left as 0644.  As you can see, it's fragile.

  I couldn't find any security checks in
  the code paths of KVM_ASSIGN_PCI_DEVICE and it looks like any user can
  invoke it with any parameters he'd want - enabling him to kill the host.
 
 Alex, Chris?

The security checks were removed some time back.  The expectation was
that there was nothing an unprivleged user could usefully do w/ the
assign device ioctl, and the assign irq ioctl only works after assign
device.  It's built on an overly fragile set of assumptions, however.
Avi, the simplest short term thing to do now might be simply revert:

48bb09e KVM: remove CAP_SYS_RAWIO requirement from kvm_vm_ioctl_assign_irq

While it's a regression for existing unprivileged users it's better than
a hole.  And in the meantime, we can come up w/ something better to
replace with.

thanks,
-chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm tools, qcow: Add the support for copy-on-write clusters

2011-11-20 Thread Lan Tianyu
When meeting request to write the cluster without copied flag,
allocate a new cluster and write original data with modification
to the new cluster. This also adds support for the writing operation
of the qcow2 compressed image. After testing, image file can pass
through qemu-img check. The performance is needed to be improved.

Signed-off-by: Lan Tianyu tianyu@intel.com
---
 tools/kvm/disk/qcow.c|  411 +-
 tools/kvm/include/kvm/qcow.h |2 +
 2 files changed, 285 insertions(+), 128 deletions(-)

diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c
index 680b37d..723faaa 100644
--- a/tools/kvm/disk/qcow.c
+++ b/tools/kvm/disk/qcow.c
@@ -20,6 +20,16 @@
 #include linux/kernel.h
 #include linux/types.h
 
+
+static inline int qcow_pwrite_sync(int fd,
+   void *buf, size_t count, off_t offset)
+{
+   if (pwrite_in_full(fd, buf, count, offset)  0)
+   return -1;
+
+   return fdatasync(fd);
+}
+
 static int l2_table_insert(struct rb_root *root, struct qcow_l2_table *new)
 {
struct rb_node **link = (root-rb_node), *parent = NULL;
@@ -101,7 +111,8 @@ static int qcow_l2_cache_write(struct qcow *q, struct 
qcow_l2_table *c)
 
size = 1  header-l2_bits;
 
-   if (pwrite_in_full(q-fd, c-table, size * sizeof(u64), c-offset)  0)
+   if (qcow_pwrite_sync(q-fd, c-table,
+   size * sizeof(u64), c-offset)  0)
return -1;
 
c-dirty = 0;
@@ -122,9 +133,6 @@ static int cache_table(struct qcow *q, struct qcow_l2_table 
*c)
 */
lru = list_first_entry(l1t-lru_list, struct qcow_l2_table, 
list);
 
-   if (qcow_l2_cache_write(q, lru)  0)
-   goto error;
-
/* Remove the node from the cache */
rb_erase(lru-node, r);
list_del_init(lru-list);
@@ -518,14 +526,6 @@ static inline u64 file_size(int fd)
return st.st_size;
 }
 
-static inline int qcow_pwrite_sync(int fd, void *buf, size_t count, off_t 
offset)
-{
-   if (pwrite_in_full(fd, buf, count, offset)  0)
-   return -1;
-
-   return fdatasync(fd);
-}
-
 /* Writes a level 2 table at the end of the file. */
 static u64 qcow_write_l2_table(struct qcow *q, u64 *table)
 {
@@ -601,7 +601,8 @@ static int write_refcount_block(struct qcow *q, struct 
qcow_refcount_block *rfb)
if (!rfb-dirty)
return 0;
 
-   if (pwrite_in_full(q-fd, rfb-entries, rfb-size * sizeof(u16), 
rfb-offset)  0)
+   if (qcow_pwrite_sync(q-fd, rfb-entries,
+   rfb-size * sizeof(u16), rfb-offset)  0)
return -1;
 
rfb-dirty = 0;
@@ -618,9 +619,6 @@ static int cache_refcount_block(struct qcow *q, struct 
qcow_refcount_block *c)
if (rft-nr_cached == MAX_CACHE_NODES) {
lru = list_first_entry(rft-lru_list, struct 
qcow_refcount_block, list);
 
-   if (write_refcount_block(q, lru)  0)
-   goto error;
-
rb_erase(lru-node, r);
list_del_init(lru-list);
rft-nr_cached--;
@@ -706,6 +704,11 @@ static struct qcow_refcount_block 
*qcow_read_refcount_block(struct qcow *q, u64
 
rfb_offset = be64_to_cpu(rft-rf_table[rft_idx]);
 
+   if (!rfb_offset) {
+   pr_warning(Don't support to grow refcount table);
+   return NULL;
+   }
+
rfb = refcount_block_search(q, rfb_offset);
if (rfb)
return rfb;
@@ -728,35 +731,138 @@ error_free_rfb:
return NULL;
 }
 
-/*
- * QCOW file might grow during a write operation. Not only data but metadata is
- * also written at the end of the file. Therefore it is necessary to ensure
- * every write is committed to disk. Hence we use uses qcow_pwrite_sync() to
- * synchronize the in-core state of QCOW image to disk.
- *
- * We also try to restore the image to a consistent state if the metdata
- * operation fails. The two metadat operations are: level 1 and level 2 table
- * update. If either of them fails the image is truncated to a consistent 
state.
+static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx)
+{
+   struct qcow_refcount_block *rfb = NULL;
+   struct qcow_header *header = q-header;
+   u64 rfb_idx;
+
+   rfb = qcow_read_refcount_block(q, clust_idx);
+   if (!rfb) {
+   pr_warning(Error while reading refcount table);
+   return -1;
+   }
+
+   rfb_idx = clust_idx  (((1ULL 
+   (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1));
+
+   if (rfb_idx = rfb-size) {
+   pr_warning(L1: refcount block index out of bounds);
+   return -1;
+   }
+
+   return be16_to_cpu(rfb-entries[rfb_idx]);
+}
+
+static int update_cluster_refcount(struct qcow *q, u64 clust_idx, u16 append)
+{
+   struct qcow_refcount_block *rfb = NULL;
+   struct qcow_header *header = q-header;
+   u16 

Re: [PATCH 01/11] KVM: PPC: Add memory-mapping support for PCI passthrough and emulation

2011-11-20 Thread Avi Kivity
On 11/17/2011 12:52 AM, Paul Mackerras wrote:
 From: Benjamin Herrenschmidt b...@kernel.crashing.org

 This adds support for adding PCI device I/O regions to the guest memory
 map, and for trapping guest accesses to emulated MMIO regions and
 delivering them to qemu for MMIO emulation.  To trap guest accesses to
 emulated MMIO regions, we reserve key 31 for the hypervisor's use and
 set the VPM1 bit in LPCR, which sends all page faults to the host.
 Any page fault that is not a key fault gets reflected immediately to the
 guest.  We set HPTEs for emulated MMIO regions to have key = 31, and
 don't allow the guest to create HPTEs with key = 31.  Any page fault
 that is a key fault with key = 31 is then a candidate for MMIO
 emulation and thus gets sent up to qemu.  We also load the instruction
 that caused the fault for use later when qemu has done the emulation.

 [pau...@samba.org: Cleaned up, moved kvmppc_book3s_hv_emulate_mmio()
  to book3s_64_mmu_hv.c]


 + /*
 +  * XXX WARNING: We do not know for sure whether the instruction we just
 +  * read from memory is the same that caused the fault in the first
 +  * place. We don't have a problem with the guest shooting itself in
 +  * the foot that way, however we must be careful that we enforce
 +  * the write permission based on the instruction we are actually
 +  * emulating, not based on dsisr. Unfortunately, the KVM code for
 +  * instruction emulation isn't smart enough for that to work
 +  * so right now we just do it badly and racily, but that will need
 +  * fixing
 +  */
 +

Ouch, I assume this will be fixed before merging?

  }
  
  int kvmppc_core_prepare_memory_region(struct kvm *kvm,
 - struct kvm_userspace_memory_region *mem)
 +   struct kvm_memory_slot *memslot,
 +   struct kvm_userspace_memory_region *mem)
  {
   unsigned long psize, porder;
   unsigned long i, npages, totalpages;
   unsigned long pg_ix;
   struct kvmppc_pginfo *pginfo;
 - unsigned long hva;
   struct kvmppc_rma_info *ri = NULL;
 + struct vm_area_struct *vma;
   struct page *page;
 + unsigned long hva;
 +
 + /*
 +  * This could be an attempt at adding memory or it could be MMIO
 +  * pass-through. We need to treat them differently but the only
 +  * way for us to know what it is is to look at the VMA and play
 +  * guess work so let's just do that
 +  */

There is no the VMA.  There could be multiple VMAs, or none (with the
mmap() coming afterwards).  You could do all the checks you want here,
only to have host userspace remap it under your feet.  This needs to be
done on a per-page basis at fault time.

Please see the corresponding x86 code (warning: convoluted), which has a
similar problem (though I have no idea if you can use a similar solution).

 +
 + /*
 +  * We require read  write permission as we cannot yet
 +  * enforce guest read-only protection or no access.
 +  */
 + if ((vma-vm_flags  (VM_READ | VM_WRITE)) !=
 + (VM_READ | VM_WRITE))
 + goto err_unlock;

This, too, must be done at get_user_pages() time.

What happens if mmu notifiers tell you to write protect a page?

  void kvm_arch_commit_memory_region(struct kvm *kvm,
 diff --git a/include/linux/kvm.h b/include/linux/kvm.h
 index c107fae..774b04d 100644
 --- a/include/linux/kvm.h
 +++ b/include/linux/kvm.h
 @@ -105,6 +105,9 @@ struct kvm_userspace_memory_region {
  #define KVM_MEM_LOG_DIRTY_PAGES  1UL
  #define KVM_MEMSLOT_INVALID  (1UL  1)
  
 +/* Kernel internal use */
 +#define KVM_MEMSLOT_IO(1UL  31)
 +

Please define it internally then (and leave a comment so we don't
overlap it).

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 10/11] KVM: PPC: Implement MMU notifiers

2011-11-20 Thread Avi Kivity
On 11/17/2011 01:52 AM, Paul Mackerras wrote:
 This implements the low-level functions called by the MMU notifiers in
 the generic KVM code, and defines KVM_ARCH_WANT_MMU_NOTIFIER if
 CONFIG_KVM_BOOK3S_64_HV so that the generic KVM MMU notifiers get
 included.

 That means we also have to take notice of when PTE invalidations are
 in progress, as indicated by mmu_notifier_retry().  In kvmppc_h_enter,
 if any invalidation is in progress we just install a non-present HPTE.
 In kvmppc_book3s_hv_page_fault, if an invalidation is in progress we
 just return without resolving the guest, causing it to encounter another
 page fault immediately.  This is better than spinning inside
 kvmppc_book3s_hv_page_fault because this way the guest can get preempted
 by a hypervisor decrementer interrupt without us having to do any
 special checks.

 We currently maintain a referenced bit in the rmap array, and when we
 clear it, we make all the HPTEs that map the corresponding page be
 non-present, as if the page were invalidated.  In future we could use
 the hardware reference bit in the guest HPT instead.

 The kvm_set_spte_hva function is implemented as kvm_unmap_hva.  The
 former appears to be unused anyway.

This is mostly used for COW (after ksm, not fork).  So if you want to
use ksm, this avoids an exit.

 This all means that on processors that support virtual partition
 memory (POWER7), we can claim support for the KVM_CAP_SYNC_MMU
 capability, and we no longer have to pin all the guest memory.



-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html