Re: XP smp using a lot of CPU

2009-05-13 Thread Avi Kivity

Ross Boylan wrote:

I just installed XP into a new VM, specifying -smp 2 for the machine.
According to top, it's using nearly 200% of a cpu even when I'm not
doing anything.

Is this real CPU useage, or just a reporting problem (just as my disk
image is big according to ls, but isn't really)?

If it's real, is there anything I can do about it?

kvm 0.7.2 on Debian Lenny (but 2.6.29 kernel), amd64.  Xeon chips; 32
bit version of XP pro installed, now fully patched (including the
Windows Genuine Advantage stuff, though I cancelled it when it wanted to
run).  


Task manager in XP shows virtually no CPU useage.

Please cc me on responses.

  


I'm guessing Windows uses a pio port to sleep, which kvm doesn't 
support.  Can you provide kvm_stat output?


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enable dirty logging for all regions during migration

2009-05-13 Thread Avi Kivity

Glauber Costa wrote:

From: Glauber de Oliveira Costa glom...@redhat.com

In current calculations, we are not activating dirty logging
for all regions, leading migration to fail. This problem was
already raised by Yaniv Kamay a while ago. The proposed
solution at the time (not merged), was a calculation to convert
from target_phys_addr_t to ram_addr_t, which the dirty logging code
expects.

Avi noticed that enabling dirty logging for the region 0 - -1ULL
would do the trick. As I hit the problem, I can confirm it does.

This patch, therefore, goes with this simpler approach. Before
this patch, migration fails. With this patch, simple migration
tests succeds.

  


Applied, thanks.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kvm: Use a bitmap for tracking used GSIs

2009-05-13 Thread Michael S. Tsirkin
On Tue, May 12, 2009 at 04:07:15PM -0600, Alex Williamson wrote:
 We're currently using a counter to track the most recent GSI we've
 handed out.  This quickly hits KVM_MAX_IRQ_ROUTES when using device
 assignment with a driver that regularly toggles the MSI enable bit.

BTW, dwhich driver does that? Any idea why?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kvm: Use a bitmap for tracking used GSIs

2009-05-13 Thread Michael S. Tsirkin
On Tue, May 12, 2009 at 04:07:15PM -0600, Alex Williamson wrote:
 We're currently using a counter to track the most recent GSI we've
 handed out.  This quickly hits KVM_MAX_IRQ_ROUTES when using device
 assignment with a driver that regularly toggles the MSI enable bit.
 This can mean only a few minutes of usable run time.  Instead, track
 used GSIs in a bitmap.
 
 Signed-off-by: Alex Williamson alex.william...@hp.com
 ---
 
  v2: Added mutex to protect gsi bitmap
  v3: Updated for comments from Michael Tsirkin
  No longer depends on [PATCH] kvm: device-assignment: Catch GSI overflow
 
  hw/device-assignment.c  |4 ++
  kvm/libkvm/kvm-common.h |4 ++
  kvm/libkvm/libkvm.c |   83 
 +--
  kvm/libkvm/libkvm.h |   10 ++
  4 files changed, 88 insertions(+), 13 deletions(-)
 
 diff --git a/hw/device-assignment.c b/hw/device-assignment.c
 index a7365c8..a6cc9b9 100644
 --- a/hw/device-assignment.c
 +++ b/hw/device-assignment.c
 @@ -561,8 +561,10 @@ static void free_dev_irq_entries(AssignedDevice *dev)
  {
  int i;
  
 -for (i = 0; i  dev-irq_entries_nr; i++)
 +for (i = 0; i  dev-irq_entries_nr; i++) {
  kvm_del_routing_entry(kvm_context, dev-entry[i]);
 +kvm_free_irq_route_gsi(kvm_context, dev-entry[i].gsi);
 +}
  free(dev-entry);
  dev-entry = NULL;
  dev-irq_entries_nr = 0;
 diff --git a/kvm/libkvm/kvm-common.h b/kvm/libkvm/kvm-common.h
 index 591fb53..4b3cb51 100644
 --- a/kvm/libkvm/kvm-common.h
 +++ b/kvm/libkvm/kvm-common.h
 @@ -66,8 +66,10 @@ struct kvm_context {
  #ifdef KVM_CAP_IRQ_ROUTING
   struct kvm_irq_routing *irq_routes;
   int nr_allocated_irq_routes;
 + void *used_gsi_bitmap;
 + int max_gsi;
 + pthread_mutex_t gsi_mutex;
  #endif
 - int max_used_gsi;
  };
  
  int kvm_alloc_kernel_memory(kvm_context_t kvm, unsigned long memory,
 diff --git a/kvm/libkvm/libkvm.c b/kvm/libkvm/libkvm.c
 index ba0a5d1..3d7ab75 100644
 --- a/kvm/libkvm/libkvm.c
 +++ b/kvm/libkvm/libkvm.c
 @@ -35,6 +35,7 @@
  #include errno.h
  #include sys/ioctl.h
  #include inttypes.h
 +#include pthread.h
  #include libkvm.h
  
  #if defined(__x86_64__) || defined(__i386__)
 @@ -65,6 +66,8 @@
  int kvm_abi = EXPECTED_KVM_API_VERSION;
  int kvm_page_size;
  
 +static inline void set_bit(uint32_t *buf, unsigned int bit);
 +
  struct slot_info {
   unsigned long phys_addr;
   unsigned long len;
 @@ -286,6 +289,9 @@ kvm_context_t kvm_init(struct kvm_callbacks *callbacks,
   int fd;
   kvm_context_t kvm;
   int r;
 +#ifdef KVM_CAP_IRQ_ROUTING

Let's kill all these ifdefs. Or at least, let's not add them.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Best choice for copy/clone/snapshot

2009-05-13 Thread Avi Kivity

Ross Boylan wrote:

First, I have a feeling this might be a question I could ask on a qemu
list.  


It is.


Is there a way for me to tell which questions should go where?
  


If the question is equally valid for qemu and qemu-kvm, then qemu-devel 
is the correct forum.



Is it OK to ask here?
  


Sure, we aren't sticklers for this sort of thing.


As I install software onto a system I want to preserve its state--just
the disk state---at various points so I can go back.  What is the best
way to do this?
  


LVM snapshots.  Read up on the 'lvcreate -s' command and option.


First, I think I could just make a copy of the virtual disk, although I
haven't seen this suggested anywhere.  I assume this will work if the VM
is off; 


Yes.

are there other circumstances in which it is safe?  


You could suspend the guest, either by having it sleep, or externally 
using ctrl-Z.



Since my
original virtual disk file isn't really occupying its nominal space, I
assume this will be true of the copy too.

Second, kvm-img could create a copy on write image.  There are several
things I don't understand about this.  Suppose I go
kvm-img -b A.img  B.img

If I then go on and use A.img as I did before, changing what is on disk,
have I screwed up B.img?
  


Yes.  If you use an image as a backing store, you promise not to change 
it.  Use B.img instead.



Do A.img or B.img have to be qcow2 format?  I created a raw image for
portability.
  


Only B.img, though it works better if both are qcow2s.


Suppose I work for awhile installing new stuff on B.img, and then want
to preserve the state.  Is
kvm-img -b B.img C.img
sensible, or is this kind of recursive operation (B.img is already the
copy on write version of A.img) not OK?
  


Should work.


Does ‘commit [-f fmt] filename’, documented as
Commit the changes recorded in filename in its base image.
mean commit the recorded changes TO its base image?
  


Yes.  It was broken until recently, so use with caution.


Here are some other things I think I don't want to do.  Please let me
know if I'm mistaken.

-snapshot on the kvm command line: nothing persistent comes of this
(maybe if you commit you update the original image, but you don't get
2).
  


Right.


snapshot in the monitor: this snapshots the non-disk state of the VM;
further, that state is not guaranteed to work if you later change what
is on the disk.  I think kvm-img snapshot also accesses these
facilities.
  


It snapshots both the disk and non-disk state.  You have to use qcow2 
for this.



--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][KVM-AUTOTEST] Add custom install option for kvm_install

2009-05-13 Thread Avi Kivity

Michael Goldish wrote:

That is, assuming you want to
- install KVM from F10 branch
- run all tests
- install KVM from F11 branch
- run all tests
- install KVM from devel branch
- run all tests

If you meant something different please correct me.
  


Note kvm is moving to split userspace/kernel packaging, so this can me 
useful to test version compatibility.


i.e. test kvm-kmod A, B, C vs qemu-kvm X, Y, Z.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/2] qemu-kvm: MSI-X support

2009-05-13 Thread Avi Kivity

Michael S. Tsirkin wrote:

It seems that if I just call apic_deliver_irq each time
I want to send MSI, things will work.

However, large part of the msix code is managing IRQs versus kernel,
and I'm not sure it's a wise investment of effort to rip it all out. So
IMHO, what's missing is API that abstracts managing irq routes in kvm,
specifically abstract this stuff in some way:
kvm_get_irq_route_gsi
kvm_add_routing_entry
kvm_del_routing_entry
kvm_commit_irq_routes
  


All these are just games with qemu_irq objects.  Should be a lot simpler 
in userspace.




kvm_set_irq
  


qemu_set_irq().


How hard is that?
  


Should be pretty easy, once you get the hang of qemu_irq.


For now, this API could be a stub that just stores the routes somewhere,
and set_irq would call the local apic emulation, along the lines of:

uint8_t dest = (addr_lo  MSI_ADDR_DEST_ID_MASK)
 MSI_ADDR_DEST_ID_SHIFT;
uint8_t vector = (addr_hi  MSI_DATA_VECTOR_MASK)
 MSI_DATA_VECTOR_SHIFT;
uint8_t dest_mode = (addr_lo  MSI_ADDR_DEST_MODE_SHIFT)  0x1;
uint8_t trigger_mode = (data  MSI_DATA_TRIGGER_SHIFT)  0x1;
uint8_t delivery_mode = (data  MSI_DATA_DELIVERY_MODE_SHIFT) 
0x7;
apic_deliver_irq(dest, dest_mode, delivery_mode, vector, 0,
 trigger_mode);
  


qemu_set_irq() eventually calls a callback that you specify; just set it 
do look up the entry and call apic_deliver_irq.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] virtio: find_vqs/del_vqs virtio operations

2009-05-13 Thread Michael S. Tsirkin
On Wed, May 13, 2009 at 10:47:08AM +0930, Rusty Russell wrote:
 On Wed, 13 May 2009 01:03:30 am Michael S. Tsirkin wrote:
  On Wed, May 13, 2009 at 12:00:02AM +0930, Rusty Russell wrote
   and perhaps consider
   varargs for the callbacks (or would that be too horrible at the
   implementation end?)
  
   Thanks,
   Rusty.
 
  Ugh ... I think it will be. And AFAIK gcc generates a lot of code
  for varargs - not something we want to do in each interrupt handler.
 
 Err, no I mean for find_vqs:  eg.
   (block device)
   err = vdev-config-find_vqs(vdev, 1, vblk-vq, blk_done);
 
   (net device)
   err = vdev-config-find_vqs(vdev, 3, vqs, skb_recv_done, 
 skb_xmit_done, NULL);
 
 A bit neater for for the single-queue case.
 
 Cheers,
 Rusty.

Oh. I see. But it becomes messy now that we also need to pass in the
names, and we lose type safety.
Let's just add a helper function for the single vq case?

static inline struct virtqueue *virtio_find_vq(struct virtio_devide *vdev,
   vq_callback_t *c, const char *n)
{
vq_callback_t *callbacks[] = { c };
const char *names[] = { n };
struct virtqueue *vq;
int err = vdev-config-find_vqs(vdev, 1, vq, callbacks, names);
if (err  0)
return ERR_PTR(err);
return vq;
}


-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network I/O performance

2009-05-13 Thread Avi Kivity

Fischer, Anna wrote:

I am running KVM with Fedora Core 8 on a 2.6.23 32-bit kernel. I use the 
tun/tap device model and the Linux bridge kernel module to connect my VM to the 
network. I have 2 10G Intel 82598 network devices (with the ixgbe driver) 
attached to my machine and I want to do packet routing in my VM (the VM has two 
virtual network interfaces configured). Analysing the network performance of 
the standard QEMU emulated NICs, I get less that 1G of throughput on those 10G 
links. Surprisingly though, I don't really see CPU utilization being maxed out. 
This is a dual core machine, and mpstat shows me that both CPUs are about 40% 
idle. My VM is more or less unresponsive due to the high network processing 
load while the host OS still seems to be in good shape. How can I best tune 
this setup to achieve best possible performance with KVM? I know there is 
virtIO and I know there is PCI pass-through, but those models are not an option 
for me right now.
  


How many cpus are assigned to the guest?  If only one, then 40% idle 
equates to 100% of a core for the guest and 20% for housekeeping.


If this is the case, you could try pinning the vcpu thread (info cpus 
from the monitor) to one core.  You should then see 100%/20% cpu load 
distribution.


wrt emulated NIC performance, I'm guessing you're not doing tcp?  If you 
were we might do something with TSO.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] virtio: find_vqs/del_vqs virtio operations

2009-05-13 Thread Avi Kivity

Michael S. Tsirkin wrote:

On Wed, May 13, 2009 at 10:47:08AM +0930, Rusty Russell wrote:
  

On Wed, 13 May 2009 01:03:30 am Michael S. Tsirkin wrote:


On Wed, May 13, 2009 at 12:00:02AM +0930, Rusty Russell wrote
  

and perhaps consider
varargs for the callbacks (or would that be too horrible at the
implementation end?)

Thanks,
Rusty.


Ugh ... I think it will be. And AFAIK gcc generates a lot of code
for varargs - not something we want to do in each interrupt handler.
  

Err, no I mean for find_vqs:  eg.
(block device)
err = vdev-config-find_vqs(vdev, 1, vblk-vq, blk_done);

(net device)
err = vdev-config-find_vqs(vdev, 3, vqs, skb_recv_done, 
skb_xmit_done, NULL);

A bit neater for for the single-queue case.

Cheers,
Rusty.



Oh. I see. But it becomes messy now that we also need to pass in the
names, and we lose type safety.
Let's just add a helper function for the single vq case?

static inline struct virtqueue *virtio_find_vq(struct virtio_devide *vdev,
   vq_callback_t *c, const char *n)
{
vq_callback_t *callbacks[] = { c };
const char *names[] = { n };
struct virtqueue *vq;
int err = vdev-config-find_vqs(vdev, 1, vq, callbacks, names);
if (err  0)
return ERR_PTR(err);
return vq;
}
  


Much saner.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] Deal with shadow interrupts after emulated instructions

2009-05-13 Thread Avi Kivity

Glauber Costa wrote:

Same story, more avi's comments merged.


  


Applied, thanks.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/3] locking fixes / cr3 validation v3

2009-05-13 Thread Avi Kivity

mtosa...@redhat.com wrote:

Addressing comments.


  


Applied all.  But please fix you From: header.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/1] kvm: expand on help info to specify kvm intel and amd module names

2009-05-13 Thread Avi Kivity

a...@linux-foundation.org wrote:

From: Robert P. J. Day rpj...@crashcourse.ca

Signed-off-by: Robert P. J. Day rpj...@crashcourse.ca
Cc: Avi Kivity a...@redhat.com
Signed-off-by: Andrew Morton a...@linux-foundation.org
  



Applied, thanks.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] kvm-s390: collection of kvm-s390 fixes - v3

2009-05-13 Thread Avi Kivity

ehrha...@linux.vnet.ibm.com wrote:

From: Christian Ehrhardt ehrha...@de.ibm.com

*updates in v3*
- fix memory slot vs. run uses trylock to avoid a potential livelock
- fix memory slot vs. run checks if it is the first and only memslot registered

*updates in v2*
- hrtimer wakeup use a more accurate calculation
- unlink vcpu uses smb_mb so the pointer is really zero when the page is freed

This is a collection of fixes for kvm-s390 that originate from several tests
made in the last few months. They are now tested a while and should be ready
to be merged.

  


Applied all, thanks.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] Userspace changes for KVM HPET (v3)

2009-05-13 Thread Avi Kivity

Beth Kon wrote:

Beth Kon wrote:

Avi Kivity wrote:

Beth Kon wrote:

Signed-off-by: Beth Kon e...@us.ibm.com


diff --git a/hw/hpet.c b/hw/hpet.c
index c7945ec..100abf5 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -30,6 +30,7 @@
 #include console.h
 #include qemu-timer.h
 #include hpet_emul.h
+#include qemu-kvm.h
 
 //#define HPET_DEBUG

 #ifdef HPET_DEBUG
@@ -48,6 +49,28 @@ uint32_t hpet_in_legacy_mode(void)
 return 0;
 }
 
+static void hpet_legacy_enable(void)

+{
+if (qemu_kvm_pit_in_kernel()) {
+kvm_kpit_disable();
+dprintf(qemu: hpet disabled kernel pit\n);
+} else {
+hpet_pit_disable();
+dprintf(qemu: hpet disabled userspace pit\n);
+}
+}
+
+static void hpet_legacy_disable(void)
+{
+if (qemu_kvm_pit_in_kernel()) {
+kvm_kpit_enable();
+dprintf(qemu: hpet enabled kernel pit\n);
+} else {
+hpet_pit_enable();
+dprintf(qemu: hpet enabled userspace pit\n);
+}
+}
  
I think it's better to move these into hpet_pit_enable() and 
hpet_pit_enable().  This avoids changing the calls below, and puts 
pit stuff in i8254.c instead of hpet.c.


Might also need to be called from hpet_load(); probably a problem in 
upstream as well.


My assumption about hpet_load was that the correct pit state would be 
established via pit_load (since all saves/loads are done together).  
But when I wrote this, I was thinking only about the userspace pit 
(for qemu). I'm not sure how the load concept applies to kernel 
state.  Do I need to explicitly re-enable or disable the kernel pit 
during load?
Looking further at the code, it looks like kvm_pit_load should take 
care of this. Agree?




I doesn't save/load the enabled bit, does it?

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] Userspace changes for KVM HPET (v3)

2009-05-13 Thread Avi Kivity

Avi Kivity wrote:
My assumption about hpet_load was that the correct pit state would 
be established via pit_load (since all saves/loads are done 
together).  But when I wrote this, I was thinking only about the 
userspace pit (for qemu). I'm not sure how the load concept 
applies to kernel state.  Do I need to explicitly re-enable or 
disable the kernel pit during load?
Looking further at the code, it looks like kvm_pit_load should take 
care of this. Agree?




I doesn't save/load the enabled bit, does it?



Also, we might migrate between a host with pit-in-kernel and a host with 
pit-in-userspace, so this is should be handled at the pit level, not kvm.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] KVM fixes for 2.6.30rc3

2009-05-13 Thread Avi Kivity

Avi Kivity wrote:

Linus, please pull repo and branch at
  



Typo in $subject, branch is against recent git.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Enable IRQ windows after exception injection if there are pending virq

2009-05-13 Thread Dong, Eddie
Gleb Natapov wrote:
 On Tue, May 12, 2009 at 11:06:39PM +0800, Dong, Eddie wrote:
 
 I didn't take many test since our PTS system stop working now due to
 KVM userspace 
 build changes. But since the logic is pretty simple, so I want to
 post here to see comments. Thx, eddie 
 
 
 
 
 If there is pending irq after an virtual exception is injected,
 KVM needs to enable IRQ window to trap back earlier once
 the exception is handled.
 
 I already posted patch to do that
 http://patchwork.kernel.org/patch/21830/ Is you patch different?
 
Is it base on the idea I mentioned to you in private mail (April 27), or a 
novel one?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -tip v5 1/7] x86: instruction decorder API

2009-05-13 Thread Gleb Natapov
On Fri, May 08, 2009 at 08:48:42PM -0400, Masami Hiramatsu wrote:
 +++ b/arch/x86/scripts/gen-insn-attr-x86.awk
 @@ -0,0 +1,314 @@
 +#!/bin/awk -f
On some distributions (debian) it is /usr/bin/awk.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: user: include arch specific headers from $(KERNELDIR)

2009-05-13 Thread Mark McLoughlin
Currently we only include $(KERNELDIR)/include in CFLAGS,
but we also have $(KERNELDIR)/arch/$(arch)/include or else
we'll get mis-matched headers.

Signed-off-by: Mark McLoughlin mar...@redhat.com
---
 kvm/user/config-i386.mak   |1 -
 kvm/user/config-ia64.mak   |1 +
 kvm/user/config-powerpc.mak|1 +
 kvm/user/config-x86-common.mak |2 ++
 kvm/user/config-x86_64.mak |1 -
 5 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/kvm/user/config-i386.mak b/kvm/user/config-i386.mak
index 09175d5..eebb9de 100644
--- a/kvm/user/config-i386.mak
+++ b/kvm/user/config-i386.mak
@@ -3,7 +3,6 @@ cstart.o = $(TEST_DIR)/cstart.o
 bits = 32
 ldarch = elf32-i386
 CFLAGS += -D__i386__
-CFLAGS += -I $(KERNELDIR)/include
 
 tests=
 
diff --git a/kvm/user/config-ia64.mak b/kvm/user/config-ia64.mak
index c4c639e..e8803a0 100644
--- a/kvm/user/config-ia64.mak
+++ b/kvm/user/config-ia64.mak
@@ -2,6 +2,7 @@ bits = 64
 CFLAGS += -m64
 CFLAGS += -D__ia64__
 CFLAGS += -I $(KERNELDIR)/include
+CFLAGS += -I $(KERNELDIR)/arch/ia64/include
 
 all:
 
diff --git a/kvm/user/config-powerpc.mak b/kvm/user/config-powerpc.mak
index dd7ef54..589aa61 100644
--- a/kvm/user/config-powerpc.mak
+++ b/kvm/user/config-powerpc.mak
@@ -1,4 +1,5 @@
 CFLAGS += -I $(KERNELDIR)/include
+CFLAGS += -I $(KERNELDIR)/arch/powerpc/include
 CFLAGS += -Wa,-mregnames -I test/lib
 CFLAGS += -ffreestanding
 
diff --git a/kvm/user/config-x86-common.mak b/kvm/user/config-x86-common.mak
index e789fd4..8d8fadf 100644
--- a/kvm/user/config-x86-common.mak
+++ b/kvm/user/config-x86-common.mak
@@ -12,6 +12,8 @@ cflatobjs += \
 $(libcflat): LDFLAGS += -nostdlib
 $(libcflat): CFLAGS += -ffreestanding -I test/lib
 
+CFLAGS += -I $(KERNELDIR)/include
+CFLAGS += -I $(KERNELDIR)/arch/x86/include
 CFLAGS += -m$(bits)
 
 FLATLIBS = test/lib/libcflat.a $(libgcc)
diff --git a/kvm/user/config-x86_64.mak b/kvm/user/config-x86_64.mak
index b50b540..d88f54c 100644
--- a/kvm/user/config-x86_64.mak
+++ b/kvm/user/config-x86_64.mak
@@ -3,7 +3,6 @@ cstart.o = $(TEST_DIR)/cstart64.o
 bits = 64
 ldarch = elf64-x86-64
 CFLAGS += -D__x86_64__
-CFLAGS += -I $(KERNELDIR)/include
 
 tests = $(TEST_DIR)/access.flat $(TEST_DIR)/irq.flat $(TEST_DIR)/sieve.flat \
   $(TEST_DIR)/simple.flat $(TEST_DIR)/stringio.flat \
-- 
1.6.0.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][Resend] Fix Warnining in arch/x86/kvm/vmx.c

2009-05-13 Thread Subrata Modak
Hi Avi/Yaniv,

With gcc --version 4.4.1 20090429 (prerelease)

I get the following warning:
arch/x86/kvm/vmx.c: In function ‘vmx_intr_assist’:
arch/x86/kvm/vmx.c:3233: warning: ‘max_irr’ may be used uninitialized in 
this function
arch/x86/kvm/vmx.c:3233: note: ‘max_irr’ was declared here

Investigation found that:

3231 static void update_tpr_threshold(struct kvm_vcpu *vcpu)
3232 {
3233 int max_irr, tpr;
3234 
3235 if (!vm_need_tpr_shadow(vcpu-kvm))
3236 return;
3237 
3238 if (!kvm_lapic_enabled(vcpu) ||
3239 ((max_irr = kvm_lapic_find_highest_irr(vcpu)) == -1)) {

(max_irr = kvm_lapic_find_highest_irr(vcpu)) == -1

may not get a chance to evaluate if:

!kvm_lapic_enabled(vcpu)

evaluates to true (as the expressions are Or-ed).

3240 vmcs_write32(TPR_THRESHOLD, 0);
3241 return;
3242 }
3243 
3244 tpr = (kvm_lapic_get_cr8(vcpu)  0x0f)  4;
3245 vmcs_write32(TPR_THRESHOLD, (max_irr  tpr) ? tpr  4 : max_irr 
 4);

Using (max_irr  tpr) and max_irr  4, without max_irr getting initialized can
cause trouble.

3246 }

I would like to propose a small fix for this by interchanging the
operands in ||, so that max_irr is initialized in all instances,
and, the warning fades away, without compromising the criteria of
conditional evaluation inside if().

Signed-Off-By: Subrata Modak subr...@linux.vnet.ibm.com,
To: Avi Kivity a...@qumranet.com
To: Yaniv Kamay  ya...@qumranet.com
To: kvm@vger.kernel.org
Cc: Balbir Singh bal...@linux.vnet.ibm.com
Cc: Sachin P Sant sach...@linux.vnet.ibm.com
Subject: [PATCH][Resend] Fix Warnining in arch/x86/kvm/vmx.c
---

--- a/arch/x86/kvm/vmx.c2009-05-12 15:28:42.0 +0530
+++ b/arch/x86/kvm/vmx.c2009-05-12 15:51:02.0 +0530
@@ -3235,8 +3235,8 @@ static void update_tpr_threshold(struct 
if (!vm_need_tpr_shadow(vcpu-kvm))
return;
 
-   if (!kvm_lapic_enabled(vcpu) ||
-   ((max_irr = kvm_lapic_find_highest_irr(vcpu)) == -1)) {
+   if (((max_irr = kvm_lapic_find_highest_irr(vcpu)) == -1) ||
+!kvm_lapic_enabled(vcpu)) {
vmcs_write32(TPR_THRESHOLD, 0);
return;
}

---
Regards--
Subrata

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -tip v5 1/7] x86: instruction decorder API

2009-05-13 Thread Przemysław Pawełczyk
On Wed, May 13, 2009 at 10:23, Gleb Natapov g...@redhat.com wrote:
 On Fri, May 08, 2009 at 08:48:42PM -0400, Masami Hiramatsu wrote:
 +++ b/arch/x86/scripts/gen-insn-attr-x86.awk
 @@ -0,0 +1,314 @@
 +#!/bin/awk -f
 On some distributions (debian) it is /usr/bin/awk.

True, but on most of them (all?) there is also an appropriate link in /bin.
If shebang could have more that one argument, then '/usr/bin/env awk
-f' would be the best solution I think.

-- 
Przemysław Pawełczyk
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][Resend] Fix Warnining in arch/x86/kvm/vmx.c

2009-05-13 Thread Avi Kivity

Subrata Modak wrote:

Hi Avi/Yaniv,

With gcc --version 4.4.1 20090429 (prerelease)

I get the following warning:
arch/x86/kvm/vmx.c: In function ‘vmx_intr_assist’:
arch/x86/kvm/vmx.c:3233: warning: ‘max_irr’ may be used uninitialized in this 
function
arch/x86/kvm/vmx.c:3233: note: ‘max_irr’ was declared here

Investigation found that:

3231 static void update_tpr_threshold(struct kvm_vcpu *vcpu)
3232 {
3233 int max_irr, tpr;
3234 
3235 if (!vm_need_tpr_shadow(vcpu-kvm))

3236 return;
3237 
3238 if (!kvm_lapic_enabled(vcpu) ||

3239 ((max_irr = kvm_lapic_find_highest_irr(vcpu)) == -1)) {

  


This function no longer exists; can you check if the current code is 
susceptible?



(max_irr = kvm_lapic_find_highest_irr(vcpu)) == -1

may not get a chance to evaluate if:

!kvm_lapic_enabled(vcpu)

evaluates to true (as the expressions are Or-ed).

3240 vmcs_write32(TPR_THRESHOLD, 0);
3241 return;
3242 }
3243 
3244 tpr = (kvm_lapic_get_cr8(vcpu)  0x0f)  4;

3245 vmcs_write32(TPR_THRESHOLD, (max_irr  tpr) ? tpr  4 : max_irr 
 4);

Using (max_irr  tpr) and max_irr  4, without max_irr getting initialized can
cause trouble.
  


With !kvm_lapic_enabled(), TPR_THRESHOLD is meaningless.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -tip v5 1/7] x86: instruction decorder API

2009-05-13 Thread Gleb Natapov
On Wed, May 13, 2009 at 11:35:16AM +0200, Przemysssaw Paweeeczyk wrote:
 On Wed, May 13, 2009 at 10:23, Gleb Natapov g...@redhat.com wrote:
  On Fri, May 08, 2009 at 08:48:42PM -0400, Masami Hiramatsu wrote:
  +++ b/arch/x86/scripts/gen-insn-attr-x86.awk
  @@ -0,0 +1,314 @@
  +#!/bin/awk -f
  On some distributions (debian) it is /usr/bin/awk.
 
 True, but on most of them (all?) there is also an appropriate link in /bin.
Nope, not on debian testing. Although I assume if kernel compilation
will start to fail it will appear :)

 If shebang could have more that one argument, then '/usr/bin/env awk
 -f' would be the best solution I think.
 

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] kvm: Use a bitmap for tracking used GSIs

2009-05-13 Thread Avi Kivity

Alex Williamson wrote:

We're currently using a counter to track the most recent GSI we've
handed out.  This quickly hits KVM_MAX_IRQ_ROUTES when using device
assignment with a driver that regularly toggles the MSI enable bit.
This can mean only a few minutes of usable run time.  Instead, track
used GSIs in a bitmap.

Signed-off-by: Alex Williamson alex.william...@hp.com
---

 v2: Added mutex to protect gsi bitmap
  


Why is the mutex needed?  We already have mutex protection in qemu.

How often does the driver enable/disable the MSI (and, do you now why)?  
If it's often enough it may justify kernel support.  (We'll need this 
patch in any case for kernels without this new support).


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: event injection MACROs

2009-05-13 Thread Avi Kivity

Dong, Eddie wrote:

I noticed the MACRO for SVM vmcb-control.event_inj and VMX VM_EXIT_INTR_INFO 
are almost same, I have a need to query the event injection situation in common 
code so plan to expose this register read/write to x86.c.  Should we define a new 
format for evtinj/VM_EXIT_INTR_INFO as common KVM format, or just move those 
original MACRO to kvm_host.h?

  


This is dangerous if additional bits or field values are defined by 
either architecture.  Better to use accessors.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] qemu-kvm-0.10.4

2009-05-13 Thread Farkas Levente
Mark McLoughlin wrote:
 On Tue, 2009-05-12 at 22:30 +0200, Farkas Levente wrote:
 Avi Kivity wrote:
 qemu-kvm-0.10.4 is now available.  This is the first release of the 0.10
 stable branch of qemu-kvm.  The qemu-kvm 0.10.4 includes all of the
 features and fixes of qemu-0.10.4, plus adaptations for improved kvm
 support.

 Note that qemu-kvm releases do not include the kvm external modules
 (kvm*.ko); you can use the modules provided by your distribution,
 modules from the development releases (kvm-xx), or from the kvm-kmod
 stable branch releases once they become available.

 As this is the first release of this branch there is no changelog;
 qemu-kvm-0.10.4 is roughly equivalent (but is not identical) to qemu
 from kvm-84.
 this's the plan? ie. the stable userspace will be about kvm-84?
 what's the plan for kvm-kmod release date and it's also be somewhere ~ 84?
 
 AIUI it, the plan is:
 
   - There will be stable releases of qemu-kvm in sync with qemu 
 upstream releases - e.g. you can expect a qemu-kvm-0.11.0 release
 shortly after qemu-0.11.0 is released
 
   - There will be no stable releases, as such, of the kernel module. 
 You should use upstream linux releases instead - e.g. the latest
 stable release is 2.6.29.2
 
   - The kvm-XX releases are development snapshots of the kvm.git and 
 qemu-kvm.git code
 
 For example, in Fedora, our plan is that we will ship the kvm.ko
 included in upstream linux releases and the qemu-kvm stable releases[1].
 We may include qemu-kvm from kvm-XX releases during the development of
 the next Fedora release, but only as a preview of the next qemu-kvm
 stable release.

and what's the plan for rhel-5.4? in this case latest stable kernel
can't be used since 5.x series are always 2.6.18 based and if there is
not a stable kvm-kmod branch then...?

-- 
  Levente   Si vis pacem para bellum!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enable IRQ windows after exception injection if there are pending virq

2009-05-13 Thread Gleb Natapov
On Wed, May 13, 2009 at 03:45:37PM +0800, Dong, Eddie wrote:
 Gleb Natapov wrote:
  On Tue, May 12, 2009 at 11:06:39PM +0800, Dong, Eddie wrote:
  
  I didn't take many test since our PTS system stop working now due to
  KVM userspace 
  build changes. But since the logic is pretty simple, so I want to
  post here to see comments. Thx, eddie 
  
  
  
  
  If there is pending irq after an virtual exception is injected,
  KVM needs to enable IRQ window to trap back earlier once
  the exception is handled.
  
  I already posted patch to do that
  http://patchwork.kernel.org/patch/21830/ Is you patch different?
  
 Is it base on the idea I mentioned to you in private mail (April 27), or a 
 novel one?
 
Yes. It fixes the bug you pointed out.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-13 Thread Gregory Haskins
Anthony Liguori wrote:
 Gregory Haskins wrote:

 So, yes, the delta from PIO to HC is 350ns.  Yes, this is a ~1.4%
 improvement.  So what?  Its still an improvement.  If that improvement
 were for free, would you object?  And we all know that this change isn't
 free because we have to change some code (+128/-0, to be exact).  But
 what is it specifically you are objecting to in the first place?  Adding
 hypercall support as an pv_ops primitive isn't exactly hard or complex,
 or even very much code.
   

 Where does 25us come from?  The number you post below are 33us and 66us.

snip

The 25us is approximately the max from an in-kernel harness strapped
directly to the driver gathered informally during testing.  The 33us is
from formally averaging multiple runs of a userspace socket app in
preparation for publishing.  I consider the 25us the target goal since
there is obviously overhead that a socket application deals with that
theoretically a guest bypasses with the tap-device.  Note that the
socket application itself often sees  30us itself...this was just a
particularly slow set of runs that day.

Note that this is why I express the impact as approximately (e.g.
~4%).  Sorry for the confusion.

-Greg



signature.asc
Description: OpenPGP digital signature


[PATCH][KVM-AUTOTEST] TAP network support in kvm-autotest

2009-05-13 Thread Jason Wang
Hi All:
This patch tries to add tap network support in kvm-autotest. Multiple nics 
connected to different bridges could be achieved through this script. Public 
bridge is important for testing real network traffic and migration. The patch 
gives each nic with randomly generated mac address. The ip address required in 
the test could be dynamically probed through nmap/arp. Only the ip address of 
first NIC is used through the test.

Example:
nics = nic1 nic2
network = bridge
bridge = switch
ifup =/etc/qemu-ifup-switch
ifdown =/etc/qemu-ifdown-switch

This would make the virtual machine have two nics both of which are connected 
to a bridge with the name of 'switch'. Ifup/ifdown scripts are also specified.

Another Example:
nics = nic1 nic2
network = bridge
bridge = switch
bridge_nic2 = virbr0
ifup =/etc/qemu-ifup-switch
ifup_nic2 = /etc/qemu-ifup-virbr0

This would makes the virtual machine have two nics: nic1 are connected to 
bridge 'switch' and nci2 are connected to bridge 'virbr0'.

Public mode and user mode nic could also be mixed:
nics = nic1 nic2
network = bridge
network_nic2 = user

Looking forward for comments and suggestions.

From: jason jasow...@redhat.com
Date: Wed, 13 May 2009 16:15:28 +0800
Subject: [PATCH] Add tap networking support.

---
 client/tests/kvm_runtest_2/kvm_utils.py |7 +++
 client/tests/kvm_runtest_2/kvm_vm.py|   74 ++-
 2 files changed, 69 insertions(+), 12 deletions(-)

diff --git a/client/tests/kvm_runtest_2/kvm_utils.py 
b/client/tests/kvm_runtest_2/kvm_utils.py
index be8ad95..0d1f7f8 100644
--- a/client/tests/kvm_runtest_2/kvm_utils.py
+++ b/client/tests/kvm_runtest_2/kvm_utils.py
@@ -773,3 +773,10 @@ def md5sum_file(filename, size=None):
 size -= len(data)
 f.close()
 return o.hexdigest()
+
+def random_mac():
+mac=[0x00,0x16,0x30,
+ random.randint(0x00,0x09),
+ random.randint(0x00,0x09),
+ random.randint(0x00,0x09)]
+return ':'.join(map(lambda x: %02x %x,mac))
diff --git a/client/tests/kvm_runtest_2/kvm_vm.py 
b/client/tests/kvm_runtest_2/kvm_vm.py
index fab839f..ea7dab6 100644
--- a/client/tests/kvm_runtest_2/kvm_vm.py
+++ b/client/tests/kvm_runtest_2/kvm_vm.py
@@ -105,6 +105,10 @@ class VM:
 self.qemu_path = qemu_path
 self.image_dir = image_dir
 self.iso_dir = iso_dir
+self.macaddr = []
+for nic_name in kvm_utils.get_sub_dict_names(params,nics):
+macaddr = kvm_utils.random_mac()
+self.macaddr.append(macaddr)
 
 def verify_process_identity(self):
 Make sure .pid really points to the original qemu process.
@@ -189,9 +193,25 @@ class VM:
 for nic_name in kvm_utils.get_sub_dict_names(params, nics):
 nic_params = kvm_utils.get_sub_dict(params, nic_name)
 qemu_cmd +=  -net nic,vlan=%d % vlan
+net = nic_params.get(network)
+if net == bridge:
+qemu_cmd += ,macaddr=%s % self.macaddr[vlan]
 if nic_params.get(nic_model):
 qemu_cmd += ,model=%s % nic_params.get(nic_model)
-qemu_cmd +=  -net user,vlan=%d % vlan
+if net == bridge:
+qemu_cmd +=  -net tap,vlan=%d % vlan
+ifup = nic_params.get(ifup)
+if ifup:
+qemu_cmd += ,script=%s % ifup
+else:
+qemu_cmd += ,script=/etc/qemu-ifup
+ifdown = nic_params.get(ifdown)
+if ifdown:
+qemu_cmd += ,downscript=%s % ifdown
+else:
+qemu_cmd += ,downscript=no
+else:
+qemu_cmd +=  -net user,vlan=%d % vlan
 vlan += 1
 
 mem = params.get(mem)
@@ -206,11 +226,11 @@ class VM:
 extra_params = params.get(extra_params)
 if extra_params:
 qemu_cmd +=  %s % extra_params
-
+
 for redir_name in kvm_utils.get_sub_dict_names(params, redirs):
 redir_params = kvm_utils.get_sub_dict(params, redir_name)
 guest_port = int(redir_params.get(guest_port))
-host_port = self.get_port(guest_port)
+host_port = self.get_port(guest_port,True)
 qemu_cmd +=  -redir tcp:%s::%s % (host_port, guest_port)
 
 if params.get(display) == vnc:
@@ -467,27 +487,57 @@ class VM:
 If port redirection is used, return 'localhost' (the guest has no IP
 address of its own).  Otherwise return the guest's IP address.
 
-# Currently redirection is always used, so return 'localhost'
-return localhost
+if self.params.get(network) == bridge:
+# probing ip address through arp
+bridge_name = self.params['bridge']
+macaddr = self.macaddr[0]
+lines = os.popen(arp -a).readlines()
+for line in lines:
+if macaddr in line:
+return 

Re: [PATCH 1/3] virtio: find_vqs/del_vqs virtio operations

2009-05-13 Thread Rusty Russell
On Wed, 13 May 2009 04:48:34 pm Michael S. Tsirkin wrote:
 Let's just add a helper function for the single vq case?

 static inline struct virtqueue *virtio_find_vq(struct virtio_devide *vdev,
  vq_callback_t *c, const char *n)

virtio_find_single_vq() to emphasize the singular nature, and it looks good.

Thanks!
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] KVM: Fix potentially recursively get kvm lock

2009-05-13 Thread Gleb Natapov
On Tue, May 12, 2009 at 11:30:21AM -0300, Marcelo Tosatti wrote:
 On Tue, May 12, 2009 at 10:13:36PM +0800, Yang, Sheng wrote:
+   mutex_unlock(kvm-lock);
  
   assigned_dev list is protected by kvm-lock. So you could have another
   ioctl adding to it at the same time you're searching.
  
  Oh, yes... My fault... 
  
   Could either have a separate kvm-assigned_devs_lock, to protect
   kvm-arch.assigned_dev_head (users are ioctls that manipulate it), or
   change the IRQ injection to use a separate spinlock, kill the workqueue
   and call kvm_set_irq from the assigned device interrupt handler.
  
  Peferred the latter, though needs more work. But the only reason for put a 
  workqueue here is because kvm-lock is a mutex? I can't believe... If so, I 
  think we had made a big mistake - we have to fix all kinds of racy problem 
  caused by this, but finally find it's unnecessary... 
 
 One issue is that kvm_set_irq can take too long while interrupts are
 blocked, and you'd have to disable interrupts in other contexes that
 inject interrupts (say qemu-ioctl(SET_INTERRUPT)-...-), so all i can
 see is a tradeoff.
 
 guess mode on
 
 But the interrupt injection path seems to be pretty short and efficient
 to happen in host interrupt context.
 
 guess mode off
 
 Avi, Gleb?
 
Interrupt injection path also use IRQ routing data structures so access
to them should be protected by the same lock. And of cause in kernel
device (apic/ioapic/pic) mmio is done holding this lock so interrupt
injection cannot happen in parallel with device reconfiguration. May
be we want more parallelism here.

  Maybe another reason is kvm_kick_vcpu(), but have already fixed by you.
 
 Note you tested the spinlock_irq patch with GigE and there was no
 significant performance regression right?
 
  
  Continue to check the code...
  
  -- 
  regards
  Yang, Sheng
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/10] Unprotect a page if #PF happens during NMI injection.

2009-05-13 Thread Avi Kivity

Gleb Natapov wrote:

It is done for exception and interrupt already.
  


Applied, thanks.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kvm: Use a bitmap for tracking used GSIs

2009-05-13 Thread Alex Williamson
On Wed, 2009-05-13 at 10:03 +0300, Michael S. Tsirkin wrote:
 On Tue, May 12, 2009 at 04:07:15PM -0600, Alex Williamson wrote:
  We're currently using a counter to track the most recent GSI we've
  handed out.  This quickly hits KVM_MAX_IRQ_ROUTES when using device
  assignment with a driver that regularly toggles the MSI enable bit.
 
 BTW, dwhich driver does that? Any idea why?

I've seen it from both e1000e and qla2xxx.  I assumed it was some kind
of interrupt mitigation since the devices seem to work fine otherwise.

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kvm: Use a bitmap for tracking used GSIs

2009-05-13 Thread Alex Williamson
On Wed, 2009-05-13 at 10:04 +0300, Michael S. Tsirkin wrote:
 On Tue, May 12, 2009 at 04:07:15PM -0600, Alex Williamson wrote:
  @@ -286,6 +289,9 @@ kvm_context_t kvm_init(struct kvm_callbacks *callbacks,
  int fd;
  kvm_context_t kvm;
  int r;
  +#ifdef KVM_CAP_IRQ_ROUTING
 
 Let's kill all these ifdefs. Or at least, let's not add them.

AFAICT, they're still used both for builds against older kernels and
architectures that don't support it.  Hollis just added the one around
kvm_get_irq_route_gsi() 10 days ago to fix ppc build.  Has it since been
deprecated?  Thanks,

Alex


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] kvm: Use a bitmap for tracking used GSIs

2009-05-13 Thread Alex Williamson
On Wed, 2009-05-13 at 12:47 +0300, Avi Kivity wrote:
 Alex Williamson wrote:
  We're currently using a counter to track the most recent GSI we've
  handed out.  This quickly hits KVM_MAX_IRQ_ROUTES when using device
  assignment with a driver that regularly toggles the MSI enable bit.
  This can mean only a few minutes of usable run time.  Instead, track
  used GSIs in a bitmap.
 
  Signed-off-by: Alex Williamson alex.william...@hp.com
  ---
 
   v2: Added mutex to protect gsi bitmap

 
 Why is the mutex needed?  We already have mutex protection in qemu.

If it's unneeded, I'll happily remove it.  I was assuming in a guest
with multiple devices these could come in parallel, but maybe the guest
is already serialized for config space accesses via cfc/cf8.

 How often does the driver enable/disable the MSI (and, do you now why)?  
 If it's often enough it may justify kernel support.  (We'll need this 
 patch in any case for kernels without this new support).

Seems like multiple times per second.  I don't know why.  Now I'm
starting to get curious why nobody else seems to be hitting this.  I'm
seeing it on an e1000e NIC and Qlogic fibre channel.  Is everyone else
using MSI-X or regular interrupts vs MSI?

Alex


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] kvm: Use a bitmap for tracking used GSIs

2009-05-13 Thread Avi Kivity

Alex Williamson wrote:

On Wed, 2009-05-13 at 12:47 +0300, Avi Kivity wrote:
  

Alex Williamson wrote:


We're currently using a counter to track the most recent GSI we've
handed out.  This quickly hits KVM_MAX_IRQ_ROUTES when using device
assignment with a driver that regularly toggles the MSI enable bit.
This can mean only a few minutes of usable run time.  Instead, track
used GSIs in a bitmap.

Signed-off-by: Alex Williamson alex.william...@hp.com
---

 v2: Added mutex to protect gsi bitmap
  
  

Why is the mutex needed?  We already have mutex protection in qemu.



If it's unneeded, I'll happily remove it.  I was assuming in a guest
with multiple devices these could come in parallel, but maybe the guest
is already serialized for config space accesses via cfc/cf8.
  


The guest may or may not be serialized; we can't rely on that.  But qemu 
is, and we can.


  
How often does the driver enable/disable the MSI (and, do you now why)?  
If it's often enough it may justify kernel support.  (We'll need this 
patch in any case for kernels without this new support).



Seems like multiple times per second.  I don't know why.  Now I'm
starting to get curious why nobody else seems to be hitting this.  I'm
seeing it on an e1000e NIC and Qlogic fibre channel.  Is everyone else
using MSI-X or regular interrupts vs MSI?
  


When you say multiple times, it is several, or a lot more?

Maybe it is NAPI?

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] kvm: Use a bitmap for tracking used GSIs

2009-05-13 Thread Alex Williamson
On Wed, 2009-05-13 at 15:35 +0300, Avi Kivity wrote:
 Alex Williamson wrote:
  On Wed, 2009-05-13 at 12:47 +0300, Avi Kivity wrote:

  Alex Williamson wrote:
  
  We're currently using a counter to track the most recent GSI we've
  handed out.  This quickly hits KVM_MAX_IRQ_ROUTES when using device
  assignment with a driver that regularly toggles the MSI enable bit.
  This can mean only a few minutes of usable run time.  Instead, track
  used GSIs in a bitmap.
 
  Signed-off-by: Alex Williamson alex.william...@hp.com
  ---
 
   v2: Added mutex to protect gsi bitmap


  Why is the mutex needed?  We already have mutex protection in qemu.
  
 
  If it's unneeded, I'll happily remove it.  I was assuming in a guest
  with multiple devices these could come in parallel, but maybe the guest
  is already serialized for config space accesses via cfc/cf8.

 
 The guest may or may not be serialized; we can't rely on that.  But qemu 
 is, and we can.

Ok, I'll drop the mutex here.
   
  How often does the driver enable/disable the MSI (and, do you now why)?  
  If it's often enough it may justify kernel support.  (We'll need this 
  patch in any case for kernels without this new support).
  
 
  Seems like multiple times per second.  I don't know why.  Now I'm
  starting to get curious why nobody else seems to be hitting this.  I'm
  seeing it on an e1000e NIC and Qlogic fibre channel.  Is everyone else
  using MSI-X or regular interrupts vs MSI?

 
 When you say multiple times, it is several, or a lot more?
 
 Maybe it is NAPI?

The system would run out of the ~1000 available GSIs in a minute or two
with just an e1000e available to the guest.  So that's something on the
order of 10/s.  This also causes a printk in the host ever time the
interrupt in enabled, which can't help performance and gets pretty
annoying for syslog.  I was guessing some kind of interrupt mitigation,
such as NAPI, but a qlogic FC card seems to do it too (seemingly at a
slower rate).

Alex


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio net regression

2009-05-13 Thread Antoine Martin
Re-sending as this does not seem to have made it to the list.

Antoine Martin wrote:
 Hi,
 
 Here is another one, any ideas?
 These oopses do look quite deep. Is it normal to end up in tcp_send_ack
 from pdflush??
 
 Cheers
 Antoine
 
 [929492.154634] pdflush: page allocation failure. order:0, mode:0x20
 [929492.154637] Pid: 291, comm: pdflush Not tainted 2.6.29.2 #5
 [929492.154639] Call Trace:
 [929492.154641]  IRQ  [8027e8bc]
 __alloc_pages_internal+0x3e1/0x401
 [929492.154649]  [8055b5ea] try_fill_recv+0xa1/0x182
 [929492.154652]  [8055c1fc] virtnet_poll+0x533/0x5ab
 [929492.154655]  [80632bba] net_rx_action+0x70/0x143
 [929492.154658]  [8023f18c] __do_softirq+0x83/0x123
 [929492.154661]  [8020d35c] call_softirq+0x1c/0x28
 [929492.154664]  [8020e2c0] do_softirq+0x3c/0x85
 [929492.154666]  [8023eea3] irq_exit+0x3f/0x7a
 [929492.154668]  [8020e59c] do_IRQ+0x12b/0x14f
 [929492.154670]  [8020cad3] ret_from_intr+0x0/0x29
 [929492.154672]  EOI  [802c22b1]
 __set_page_dirty_buffers+0x0/0x8f
 [929492.154677]  [8031702b] bget_one+0x0/0xb
 [929492.154680]  [80316fa2] walk_page_buffers+0x2/0x8b
 [929492.154682]  [803185bc] ext3_ordered_writepage+0xae/0x134
 [929492.154685]  [8027ea46] __writepage+0xa/0x25
 [929492.154687]  [8027f19f] write_cache_pages+0x206/0x322
 [929492.154689]  [8027ea3c] __writepage+0x0/0x25
 [929492.154691]  [8027f2fe] do_writepages+0x27/0x2d
 [929492.154694]  [802bd3f6] __writeback_single_inode+0x1a7/0x3b5
 [929492.154696]  [8020a68c] __switch_to+0xb4/0x38c
 [929492.154698]  [802bda76] generic_sync_sb_inodes+0x2a7/0x458
 [929492.154701]  [802bde00] writeback_inodes+0x8d/0xe6
 [929492.154704]  [807296e2] _spin_lock+0x5/0x7
 [929492.155056]  [8027f432] wb_kupdate+0x9f/0x116
 [929492.155058]  [80280095] pdflush+0x14b/0x202
 [929492.155061]  [8027f393] wb_kupdate+0x0/0x116
 [929492.155063]  [8027ff4a] pdflush+0x0/0x202
 [929492.155065]  [8027ff4a] pdflush+0x0/0x202
 [929492.155068]  [8024c127] kthread+0x47/0x73
 [929492.155070]  [8020d25a] child_rip+0xa/0x20
 [929492.155072]  [8024c0e0] kthread+0x0/0x73
 [929492.183142]  [8020d250] child_rip+0x0/0x20
 [929492.183145] Mem-Info:
 [929492.183147] DMA per-cpu:
 [929492.183149] CPU0: hi:0, btch:   1 usd:   0
 [929492.183151] DMA32 per-cpu:
 [929492.183154] CPU0: hi:  186, btch:  31 usd: 184
 [929492.183158] Active_anon:2755 active_file:39849 inactive_anon:2972
 [929492.183159]  inactive_file:70353 unevictable:0 dirty:4172
 writeback:1580 unstable:0
 [929492.183161]  free:734 slab:5619 mapped:15047 pagetables:927 bounce:0
 [929492.183166] DMA free:1968kB min:28kB low:32kB high:40kB
 active_anon:0kB inactive_anon:40kB active_file:2116kB
 inactive_file:1880kB unevictable:0kB present:5448kB pages_scanned:0
 all_unreclaimable? no
 [929492.183169] lowmem_reserve[]: 0 489 489 489
 [929492.183176] DMA32 free:968kB min:2812kB low:3512kB high:4216kB
 active_anon:11020kB inactive_anon:11848kB active_file:157280kB
 inactive_file:279532kB unevictable:0kB present:500896kB pages_scanned:0
 all_unreclaimable? no
 [929492.183180] lowmem_reserve[]: 0 0 0 0
 [929492.183183] DMA: 6*4kB 2*8kB 3*16kB 1*32kB 1*64kB 2*128kB 0*256kB
 1*512kB 1*1024kB 0*2048kB 0*4096kB = 1976kB
 [929492.183235] DMA32: 0*4kB 1*8kB 0*16kB 0*32kB 1*64kB 3*128kB 2*256kB
 0*512kB 0*1024kB 0*2048kB 0*4096kB = 968kB
 [929492.183244] 110992 total pagecache pages
 [929492.183246] 739 pages in swap cache
 [929492.183248] Swap cache stats: add 8996, delete 8257, find 92604/93191
 [929492.183250] Free swap  = 1040016kB
 [929492.183252] Total swap = 1048568kB
 [929492.186003] 131056 pages RAM
 [929492.186006] 4799 pages reserved
 [929492.186007] 44697 pages shared
 [929492.186008] 90516 pages non-shared
 [930274.380075] eth0: no IPv6 routers present
 
 
 
 
 
 
 
 Antoine Martin wrote:
 Hi

 Still getting (some but less) network issues with a 2.6.28.9 host.

 Found quite a few of these call traces in the 2.6.29.1 guests:
 Guest has 512MB of memory and was not all that busy (just network
 traffic), so I don't understand why it would fail to allocate a page...


 [701453.834571] kjournald: page allocation failure. order:0, mode:0x4020
 [701453.834574] Pid: 4806, comm: kjournald Not tainted 2.6.29.1 #4
 [701453.834576] Call Trace:
 [701453.834578]  IRQ  [8027fa48]
 __alloc_pages_internal+0x3e1/0x401
 [701453.834586]  [802a1ad4] __slab_alloc+0x17f/0x4ca
 [701453.834590]  [8067e322] tcp_send_ack+0x23/0x105
 [701453.834592]  [8067e322] tcp_send_ack+0x23/0x105
 [701453.834595]  [802a2e66] __kmalloc_track_caller+0xac/0xe1
 [701453.834598]  [8062f97e] __alloc_skb+0x61/0x11e
 [701453.834600]  [8067e322] tcp_send_ack+0x23/0x105
 [701453.834603]  [8067c374] tcp_rcv_established+0x6c7/0x9e6
 

Re: [PATCH v4] kvm: Use a bitmap for tracking used GSIs

2009-05-13 Thread Avi Kivity

Alex Williamson wrote:


When you say multiple times, it is several, or a lot more?

Maybe it is NAPI?



The system would run out of the ~1000 available GSIs in a minute or two
with just an e1000e available to the guest.  So that's something on the
order of 10/s.  This also causes a printk in the host ever time the
interrupt in enabled, which can't help performance and gets pretty
annoying for syslog.  I was guessing some kind of interrupt mitigation,
such as NAPI, but a qlogic FC card seems to do it too (seemingly at a
slower rate).
  


I see.  And what is the path by which it is disabled?  The mask bit in 
the MSI entry?


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] kvm: Use a bitmap for tracking used GSIs

2009-05-13 Thread Alex Williamson
On Wed, 2009-05-13 at 16:00 +0300, Avi Kivity wrote:
 Alex Williamson wrote:
 
  When you say multiple times, it is several, or a lot more?
 
  Maybe it is NAPI?
  
 
  The system would run out of the ~1000 available GSIs in a minute or two
  with just an e1000e available to the guest.  So that's something on the
  order of 10/s.  This also causes a printk in the host ever time the
  interrupt in enabled, which can't help performance and gets pretty
  annoying for syslog.  I was guessing some kind of interrupt mitigation,
  such as NAPI, but a qlogic FC card seems to do it too (seemingly at a
  slower rate).

 
 I see.  And what is the path by which it is disabled?  The mask bit in 
 the MSI entry?

Yes, I believe the only path is via a write to the MSI capability in the
PCI config space.

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] KVM: Fix potentially recursively get kvm lock

2009-05-13 Thread Marcelo Tosatti
On Wed, May 13, 2009 at 10:07:54AM +0800, Yang, Sheng wrote:
  KVM: workaround workqueue / deassign_host_irq deadlock
 
  I think I'm running into the following deadlock in the kvm kernel module
  when trying to use device assignment:
 
  CPU A   CPU B
  kvm_vm_ioctl_deassign_dev_irq()
mutex_lock(kvm-lock);   worker_thread()
- kvm_deassign_irq()   -
  kvm_assigned_dev_interrupt_work_handler()
  - deassign_host_irq()  mutex_lock(kvm-lock);
- cancel_work_sync() [blocked]
 
  Workaround the issue by dropping kvm-lock for cancel_work_sync().
 
  Reported-by: Alex Williamson alex.william...@hp.com
  From: Sheng Yang sheng.y...@intel.com
  Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
 Another calling path(kvm_free_all_assigned_devices()) don't hold kvm-lock... 
 Seems it need the lock for travel assigned dev list?

Sheng,

The task executing the deassign irq ioctl has a reference to the vm
instance. This solution is just temporary though until the locks can be
split and then dropping kvm-lock around cancel_work_sync will not be
necessary anymore.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -tip v5 0/7] tracing: kprobe-based event tracer and x86 instruction decoder

2009-05-13 Thread Ingo Molnar

* Masami Hiramatsu mhira...@redhat.com wrote:

 Ingo Molnar wrote:
  * Masami Hiramatsu mhira...@redhat.com wrote:
  
  Hi,
 
  Here are the patches of kprobe-based event tracer for x86, version 
  5, which allows you to probe various kernel events through ftrace 
  interface.
 
  This version supports only x86(-32/-64) (but porting it on other 
  arch just needs kprobes/kretprobes and register and stack access 
  APIs).
 
  This patchset also includes x86(-64) instruction decoder which 
  supports non-SSE/FP opcodes and includes x86 opcode map. I think 
  it will be possible to share this opcode map with KVM's decoder.
 
  This series can be applied on the latest linux-2.6-tip tree.
 
  This patchset includes following changes:
  - Add x86 instruction decoder [1/7]
  - Check insertion point safety in kprobe [2/7]
  - Cleanup fix_riprel() with insn decoder [3/7]
  - Add kprobe-tracer plugin [4/7]
  - Fix kernel_trap_sp() on x86 according to systemtap runtime. [5/7]
  - Add arch-dep register and stack fetching functions [6/7]
  - Support fetching various status (register/stack/memory/etc.) [7/7]
 
  Future items:
  - .init function tracing support.
  - Support primitive types(long, ulong, int, uint, etc) for args.
  
  Ok, this looks pretty complete already.
  
  Two high-level comments:
  
   - There's no self-test - would it be possible to add one? See 
 trace_selftest* in kernel/trace/
  
   - No generic integration.
 
 Hmm, Ingo, could you tell me what I can do for the integration? 
 Would you means that I should use filters?

yeah, that - and for the tracepoints to show up under 
/debug/tracing/events/. They'd in essence be 'flexible', dynamic 
event tracepoints that extend upon existing, built-in tracepoints. 
To user-space tools the two would show up in a very similar way and 
with a similar usage (once they are injected).

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] kvm: Use a bitmap for tracking used GSIs

2009-05-13 Thread Michael S. Tsirkin
On Wed, May 13, 2009 at 07:11:16AM -0600, Alex Williamson wrote:
 On Wed, 2009-05-13 at 16:00 +0300, Avi Kivity wrote:
  Alex Williamson wrote:
  
   When you say multiple times, it is several, or a lot more?
  
   Maybe it is NAPI?
   
  
   The system would run out of the ~1000 available GSIs in a minute or two
   with just an e1000e available to the guest.  So that's something on the
   order of 10/s.  This also causes a printk in the host ever time the
   interrupt in enabled, which can't help performance and gets pretty
   annoying for syslog.  I was guessing some kind of interrupt mitigation,
   such as NAPI, but a qlogic FC card seems to do it too (seemingly at a
   slower rate).
 
  
  I see.  And what is the path by which it is disabled?  The mask bit in 
  the MSI entry?
 
 Yes, I believe the only path is via a write to the MSI capability in the
 PCI config space.
 
 Alex

Very surprising: I haven't seen any driver disable MSI expect on device
destructor path. Is this a linux guest?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Implement generic double fault generation mechanism

2009-05-13 Thread Dong, Eddie
 That is OK, You can send two patches. The first one will WARN_ON and
 overwrite exception like the current code does. And the second one
 will remove WARN_ON explaining that this case is actually possible to
 trigger from a guest.
 
Sounds you don't like to provide this additional one, here it is for the 
purpose of
removing the block issue. My basic position is still same with what mentioned 
in previous mail, but I am neutral to either way.

Thx, eddie

Signed-off-by: Eddie Dong eddie.d...@intel.com

Overwriting former event may help forward progress
in case of multiple exception/interrupt happens serially.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d0e75a2..b3de5d2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -183,11 +183,7 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
int class1, class2;
 
if (!vcpu-arch.exception.pending) {
-   vcpu-arch.exception.pending = true;
-   vcpu-arch.exception.has_error_code = has_error;
-   vcpu-arch.exception.nr = nr;
-   vcpu-arch.exception.error_code = error_code;
-   return;
+   goto out;
}
 
/* to check exception */
@@ -208,9 +204,15 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
vcpu-arch.exception.has_error_code = true;
vcpu-arch.exception.nr = DF_VECTOR;
vcpu-arch.exception.error_code = 0;
+   return;
} else
printk(KERN_ERR Exception 0x%x on 0x%x happens serially\n,
prev_nr, nr);
+out:
+   vcpu-arch.exception.pending = true;
+   vcpu-arch.exception.has_error_code = has_error;
+   vcpu-arch.exception.nr = nr;
+   vcpu-arch.exception.error_code = error_code;
 }
 
 void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr)

serial_irq.patch
Description: serial_irq.patch


Re: [PATCH v4] kvm: Use a bitmap for tracking used GSIs

2009-05-13 Thread Alex Williamson
On Wed, 2009-05-13 at 16:55 +0300, Michael S. Tsirkin wrote:
 On Wed, May 13, 2009 at 07:11:16AM -0600, Alex Williamson wrote:
  On Wed, 2009-05-13 at 16:00 +0300, Avi Kivity wrote:
   Alex Williamson wrote:
   
When you say multiple times, it is several, or a lot more?
   
Maybe it is NAPI?

   
The system would run out of the ~1000 available GSIs in a minute or two
with just an e1000e available to the guest.  So that's something on the
order of 10/s.  This also causes a printk in the host ever time the
interrupt in enabled, which can't help performance and gets pretty
annoying for syslog.  I was guessing some kind of interrupt mitigation,
such as NAPI, but a qlogic FC card seems to do it too (seemingly at a
slower rate).
  
   
   I see.  And what is the path by which it is disabled?  The mask bit in 
   the MSI entry?
  
  Yes, I believe the only path is via a write to the MSI capability in the
  PCI config space.
  
  Alex
 
 Very surprising: I haven't seen any driver disable MSI expect on device
 destructor path. Is this a linux guest?

Yes, Debian 2.6.26 kernel.  I'll check it it behaves the same on newer
upstream kernels and try to figure out why it's doing it.

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: event injection MACROs

2009-05-13 Thread Dong, Eddie
Avi Kivity wrote:
 Dong, Eddie wrote:
 I noticed the MACRO for SVM vmcb-control.event_inj and VMX
 VM_EXIT_INTR_INFO are almost same, I have a need to query the event
 injection situation in common code so plan to expose this register
 read/write to x86.c.  Should we define a new format for
 evtinj/VM_EXIT_INTR_INFO as common KVM format, or just move those
 original MACRO to kvm_host.h? 
 
 
 
 This is dangerous if additional bits or field values are defined by
 either architecture.  Better to use accessors.

OK.
Also back to Gleb's question, the reason I want to do that is to simplify event
generation mechanism in current KVM.

Today KVM use additional layer of exception/nmi/interrupt such as
vcpu.arch.exception.pending, vcpu-arch.interrupt.pending  
vcpu-arch.nmi_injected.
All those additional layer is due to compete of VM_ENTRY_INTR_INFO_FIELD
write to inject the event. Both SVM  VMX has only one resource to inject the 
virtual event
but KVM generates 3 catagory of events in parallel which further requires 
additional
logic to dictate among them. One example is that exception has higher priority
than NMI/IRQ injection in current code which is not true in reality. 

Another issue is that an failed event from previous injection say IRQ or NMI 
may be 
discarded if an virtual exception happens in the EXIT handling now. With the 
patch of 
generic double fault handling, this case should be handled as normally.

Will post RFC soon.

Thx, eddie--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] kvm: Use a bitmap for tracking used GSIs

2009-05-13 Thread Michael S. Tsirkin
On Tue, May 12, 2009 at 10:41:29PM -0600, Alex Williamson wrote:
 + gsi_count = kvm_get_gsi_count(kvm);
 + /* Round up so we can search ints using ffs */
 + gsi_bytes = ((gsi_count + 31) / 32) * 4;
 + kvm-used_gsi_bitmap = malloc(gsi_bytes);

What happens on error in kvm_get_gsi_count?
gsi_count will be negative ..

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] kvm: Use a bitmap for tracking used GSIs

2009-05-13 Thread Alex Williamson
On Wed, 2009-05-13 at 08:15 -0600, Alex Williamson wrote:
 On Wed, 2009-05-13 at 16:55 +0300, Michael S. Tsirkin wrote:
  On Wed, May 13, 2009 at 07:11:16AM -0600, Alex Williamson wrote:
   On Wed, 2009-05-13 at 16:00 +0300, Avi Kivity wrote:
Alex Williamson wrote:

 When you say multiple times, it is several, or a lot more?

 Maybe it is NAPI?
 

 The system would run out of the ~1000 available GSIs in a minute or 
 two
 with just an e1000e available to the guest.  So that's something on 
 the
 order of 10/s.  This also causes a printk in the host ever time the
 interrupt in enabled, which can't help performance and gets pretty
 annoying for syslog.  I was guessing some kind of interrupt 
 mitigation,
 such as NAPI, but a qlogic FC card seems to do it too (seemingly at a
 slower rate).
   

I see.  And what is the path by which it is disabled?  The mask bit in 
the MSI entry?
   
   Yes, I believe the only path is via a write to the MSI capability in the
   PCI config space.
   
   Alex
  
  Very surprising: I haven't seen any driver disable MSI expect on device
  destructor path. Is this a linux guest?
 
 Yes, Debian 2.6.26 kernel.  I'll check it it behaves the same on newer
 upstream kernels and try to figure out why it's doing it.

Updating the guest to 2.6.29 seems to fix the interrupt toggling.  So
it's either something in older kernels or something debian introduced,
but that seems unlikely.

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -tip v5 1/7] x86: instruction decorder API

2009-05-13 Thread Masami Hiramatsu
Gleb Natapov wrote:
 On Wed, May 13, 2009 at 11:35:16AM +0200, Przemysssaw Paweeeczyk wrote:
 On Wed, May 13, 2009 at 10:23, Gleb Natapov g...@redhat.com wrote:
 On Fri, May 08, 2009 at 08:48:42PM -0400, Masami Hiramatsu wrote:
 +++ b/arch/x86/scripts/gen-insn-attr-x86.awk
 @@ -0,0 +1,314 @@
 +#!/bin/awk -f
 On some distributions (debian) it is /usr/bin/awk.
 True, but on most of them (all?) there is also an appropriate link in /bin.
 Nope, not on debian testing. Although I assume if kernel compilation
 will start to fail it will appear :)
 
 If shebang could have more that one argument, then '/usr/bin/env awk
 -f' would be the best solution I think.

Ah, I see.
Actually, it will be executed from Makefile with 'awk -f'.

 --- a/arch/x86/lib/Makefile
 +++ b/arch/x86/lib/Makefile
 @@ -2,12 +2,21 @@
  # Makefile for x86 specific library files.
  #
  
 +quiet_cmd_inat_tables = GEN $@
 +  cmd_inat_tables = awk -f 
 $(srctree)/arch/x86/scripts/gen-insn-attr-x86.awk 
 $(srctree)/arch/x86/lib/x86-opcode-map.txt  $@
 +

So, if awk is on the PATH, it will pass.
Maybe, I need to add 'HOSTAWK = awk' line in Makefile.

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhira...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-13 Thread Gregory Haskins
Anthony Liguori wrote:
 Gregory Haskins wrote:
 I specifically generalized my statement above because #1 I assume
 everyone here is smart enough to convert that nice round unit into the
 relevant figure.  And #2, there are multiple potential latency sources
 at play which we need to factor in when looking at the big picture.  For
 instance, the difference between PF exit, and an IO exit (2.58us on x86,
 to be precise).  Or whether you need to take a heavy-weight exit.  Or a
 context switch to qemu, the the kernel, back to qemu, and back to the
 vcpu).  Or acquire a mutex.  Or get head-of-lined on the VGA models
 IO. I know you wish that this whole discussion would just go away,
 but these
 little 300ns here, 1600ns there really add up in aggregate despite
 your dismissive attitude towards them.  And it doesn't take much to
 affect the results in a measurable way.  As stated, each 1us costs
 ~4%. My motivation is to reduce as many of these sources as possible.

 So, yes, the delta from PIO to HC is 350ns.  Yes, this is a ~1.4%
 improvement.  So what?  Its still an improvement.  If that improvement
 were for free, would you object?  And we all know that this change isn't
 free because we have to change some code (+128/-0, to be exact).  But
 what is it specifically you are objecting to in the first place?  Adding
 hypercall support as an pv_ops primitive isn't exactly hard or complex,
 or even very much code.
   

 Where does 25us come from?  The number you post below are 33us and
 66us.  This is part of what's frustrating me in this thread.  Things
 are way too theoretical.  Saying that if packet latency was 25us,
 then it would be a 1.4% improvement is close to misleading.
[ answered in the last reply ]

 The numbers you've posted are also measuring on-box speeds.  What
 really matters are off-box latencies and that's just going to exaggerate.

I'm not 100% clear on what you mean with on-box vs off-box.  These
figures were gathered between two real machines connected via 10GE
cross-over cable.  The 5.8Gb/s and 33us (25us) values were gathered
sending real data between these hosts.  This sounds off-box to me, but
I am not sure I truly understand your assertion.



 IIUC, if you switched vbus to using PIO today, you would go from 66us
 to to 65.65, which you'd round to 66us for on-box latencies.  Even if
 you didn't round, it's a 0.5% improvement in latency.

I think part of what you are missing is that in order to create vbus, I
needed to _create_ an in-kernel hook from scratch since there were no
existing methods. Since I measured HC to be superior in performance (if
by only a little), I wasn't going to chose the slower way if there
wasn't a reason, and at the time I didn't see one.  Now after community
review, perhaps we do have a reason, but that is the point of the review
process.  So now we can push something like iofd as a PIO hook instead. 
But either way, something needed to be created.



 Adding hypercall support as a pv_ops primitive is adding a fair bit of
 complexity.  You need a hypercall fd mechanism to plumb this down to
 userspace otherwise, you can't support migration from in-kernel
 backend to non in-kernel backend.

I respectfully disagree.   This is orthogonal to the simple issue of the
IO type for the exit.  Where you *do* have a point is that the bigger
benefit comes from in-kernel termination (like the iofd stuff I posted
yesterday).  However, in-kernel termination is not strictly necessary to
exploit some reduction in overhead in the IO latency.  In either case we
know we can shave off about 2.56us from an MMIO.

Since I formally measured MMIO rtt to userspace yesterday, we now know
that we can do qemu-mmio in about 110k IOPS, 9.09us rtt.  Switching to
pv_io_ops-mmio() alone would be a boost to approximately 153k IOPS,
6.53us rtt.  This would have a tangible benefit to all models without
any hypercall plumbing screwing up migration.  Therefore I still stand
by the assertion that the hypercall discussion alone doesn't add very
much complexity.

   You need some way to allocate hypercalls to particular devices which
 so far, has been completely ignored.

I'm sorry, but thats not true.  Vbus already handles this mapping.


   I've already mentioned why hypercalls are also unfortunate from a
 guest perspective.  They require kernel patching and this is almost
 certainly going to break at least Vista as a guest.  Certainly Windows 7.

Yes, you have a point here.


 So it's not at all fair to trivialize the complexity introduce here. 
 I'm simply asking for justification to introduce this complexity.  I
 don't see why this is unfair for me to ask.

In summary, I don't think there is really much complexity being added
because this stuff really doesn't depend on the hypercallfd (iofd)
interface in order to have some benefit, as you assert above.  The
hypercall page is a good point for attestation, but that issue exists
already today and is not a newly created issue by this proposal.


 As 

[PATCH v5] kvm: Use a bitmap for tracking used GSIs

2009-05-13 Thread Alex Williamson
We're currently using a counter to track the most recent GSI we've
handed out.  This quickly hits KVM_MAX_IRQ_ROUTES when using device
assignment with a driver that regularly toggles the MSI enable bit.
This can mean only a few minutes of usable run time.  Instead, track
used GSIs in a bitmap.

Signed-off-by: Alex Williamson alex.william...@hp.com
---

 v2: Added mutex to protect gsi bitmap
 v3: Updated for comments from Michael Tsirkin
 No longer depends on [PATCH] kvm: device-assignment: Catch GSI overflow
 v4: Fix gsi_bytes calculation noted by Sheng Yang
 v5: Remove mutex per Avi
 Fix negative gsi_count path per Michael
 Remove KVM_CAP_IRQ_ROUTING per Michael, ppc should still be protected
 by the KVM_IOAPIC_NUM_PINS check

 hw/device-assignment.c  |4 ++-
 kvm/libkvm/kvm-common.h |3 +-
 kvm/libkvm/libkvm.c |   74 ++-
 kvm/libkvm/libkvm.h |   10 ++
 4 files changed, 75 insertions(+), 16 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index a7365c8..a6cc9b9 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -561,8 +561,10 @@ static void free_dev_irq_entries(AssignedDevice *dev)
 {
 int i;
 
-for (i = 0; i  dev-irq_entries_nr; i++)
+for (i = 0; i  dev-irq_entries_nr; i++) {
 kvm_del_routing_entry(kvm_context, dev-entry[i]);
+kvm_free_irq_route_gsi(kvm_context, dev-entry[i].gsi);
+}
 free(dev-entry);
 dev-entry = NULL;
 dev-irq_entries_nr = 0;
diff --git a/kvm/libkvm/kvm-common.h b/kvm/libkvm/kvm-common.h
index 591fb53..c95c591 100644
--- a/kvm/libkvm/kvm-common.h
+++ b/kvm/libkvm/kvm-common.h
@@ -67,7 +67,8 @@ struct kvm_context {
struct kvm_irq_routing *irq_routes;
int nr_allocated_irq_routes;
 #endif
-   int max_used_gsi;
+   void *used_gsi_bitmap;
+   int max_gsi;
 };
 
 int kvm_alloc_kernel_memory(kvm_context_t kvm, unsigned long memory,
diff --git a/kvm/libkvm/libkvm.c b/kvm/libkvm/libkvm.c
index ba0a5d1..74fb59b 100644
--- a/kvm/libkvm/libkvm.c
+++ b/kvm/libkvm/libkvm.c
@@ -61,10 +61,13 @@
 #define DPRINTF(fmt, args...) do {} while (0)
 #endif
 
+#define min(x,y) ((x)  (y) ? (x) : (y))
 
 int kvm_abi = EXPECTED_KVM_API_VERSION;
 int kvm_page_size;
 
+static inline void set_bit(uint32_t *buf, unsigned int bit);
+
 struct slot_info {
unsigned long phys_addr;
unsigned long len;
@@ -285,7 +288,7 @@ kvm_context_t kvm_init(struct kvm_callbacks *callbacks,
 {
int fd;
kvm_context_t kvm;
-   int r;
+   int r, gsi_count;
 
fd = open(/dev/kvm, O_RDWR);
if (fd == -1) {
@@ -323,6 +326,28 @@ kvm_context_t kvm_init(struct kvm_callbacks *callbacks,
kvm-no_irqchip_creation = 0;
kvm-no_pit_creation = 0;
 
+   gsi_count = kvm_get_gsi_count(kvm);
+   if (gsi_count  0) {
+   int gsi_bytes, i;
+
+   /* Round up so we can search ints using ffs */
+   gsi_bytes = ((gsi_count + 31) / 32) * 4;
+   kvm-used_gsi_bitmap = malloc(gsi_bytes);
+   if (!kvm-used_gsi_bitmap)
+   goto out_close;
+   memset(kvm-used_gsi_bitmap, 0, gsi_bytes);
+   kvm-max_gsi = gsi_bytes * 8;
+
+   /* Mark all the IOAPIC pin GSIs and any over-allocated
+   * GSIs as already in use. */
+#ifdef KVM_IOAPIC_NUM_PINS
+   for (i = 0; i  min(KVM_IOAPIC_NUM_PINS, gsi_count); i++)
+   set_bit(kvm-used_gsi_bitmap, i);
+#endif
+   for (i = gsi_count; i  kvm-max_gsi; i++)
+   set_bit(kvm-used_gsi_bitmap, i);
+   }
+
return kvm;
  out_close:
close(fd);
@@ -1298,8 +1323,6 @@ int kvm_add_routing_entry(kvm_context_t kvm,
new-flags = entry-flags;
new-u = entry-u;
 
-   if (entry-gsi  kvm-max_used_gsi)
-   kvm-max_used_gsi = entry-gsi;
return 0;
 #else
return -ENOSYS;
@@ -1404,19 +1427,42 @@ int kvm_commit_irq_routes(kvm_context_t kvm)
 #endif
 }
 
+static inline void set_bit(uint32_t *buf, unsigned int bit)
+{
+   buf[bit / 32] |= 1U  (bit % 32);
+}
+
+static inline void clear_bit(uint32_t *buf, unsigned int bit)
+{
+   buf[bit / 32] = ~(1U  (bit % 32));
+}
+
+static int kvm_find_free_gsi(kvm_context_t kvm)
+{
+   int i, bit, gsi;
+   uint32_t *buf = kvm-used_gsi_bitmap;
+
+   for (i = 0; i  kvm-max_gsi / 32; i++) {
+   bit = ffs(~buf[i]);
+   if (!bit)
+   continue;
+
+   gsi = bit - 1 + i * 32;
+   set_bit(buf, gsi);
+   return gsi;
+   }
+
+   return -ENOSPC;
+}
+
 int kvm_get_irq_route_gsi(kvm_context_t kvm)
 {
-#ifdef KVM_CAP_IRQ_ROUTING
-   if (kvm-max_used_gsi = KVM_IOAPIC_NUM_PINS)  {
-   if (kvm-max_used_gsi = kvm_get_gsi_count(kvm))
-return kvm-max_used_gsi + 1;
-else
-   

Re: [PATCH -tip v5 1/7] x86: instruction decorder API

2009-05-13 Thread Gleb Natapov
On Wed, May 13, 2009 at 10:35:55AM -0400, Masami Hiramatsu wrote:
 Gleb Natapov wrote:
  On Wed, May 13, 2009 at 11:35:16AM +0200, Przemysssaw Paweeeczyk wrote:
  On Wed, May 13, 2009 at 10:23, Gleb Natapov g...@redhat.com wrote:
  On Fri, May 08, 2009 at 08:48:42PM -0400, Masami Hiramatsu wrote:
  +++ b/arch/x86/scripts/gen-insn-attr-x86.awk
  @@ -0,0 +1,314 @@
  +#!/bin/awk -f
  On some distributions (debian) it is /usr/bin/awk.
  True, but on most of them (all?) there is also an appropriate link in /bin.
  Nope, not on debian testing. Although I assume if kernel compilation
  will start to fail it will appear :)
  
  If shebang could have more that one argument, then '/usr/bin/env awk
  -f' would be the best solution I think.
 
 Ah, I see.
 Actually, it will be executed from Makefile with 'awk -f'.
 
  --- a/arch/x86/lib/Makefile
  +++ b/arch/x86/lib/Makefile
  @@ -2,12 +2,21 @@
   # Makefile for x86 specific library files.
   #
   
  +quiet_cmd_inat_tables = GEN $@
  +  cmd_inat_tables = awk -f 
  $(srctree)/arch/x86/scripts/gen-insn-attr-x86.awk 
  $(srctree)/arch/x86/lib/x86-opcode-map.txt  $@
  +
 
 So, if awk is on the PATH, it will pass.
Ah, that is good enough I thing. I tried to run scrip manually.

 Maybe, I need to add 'HOSTAWK = awk' line in Makefile.
 

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Best choice for copy/clone/snapshot

2009-05-13 Thread Ross Boylan
Thanks for all the info.  I have one follow up.
On Wed, 2009-05-13 at 10:07 +0300, Avi Kivity wrote:
 
  As I install software onto a system I want to preserve its
 state--just
  the disk state---at various points so I can go back.  What is the
 best
  way to do this?

 
 LVM snapshots.  Read up on the 'lvcreate -s' command and option.
I may have been unclear.  I meant as I install software on the VM.
Since some of them are running Windows, they can't do LVM.  I am running
LVM on my host Linux system.

Or are you suggesting that I put the image files on a snapshottable
partition?  Over time the snapshot seems likely to accumulate a lot of
original sectors that don't involve the disk image I care about.

Or do you mean I should back each virtual disk with an LVM volume?  That
does seem cleaner; I've just been following the docs and they use
regular files.  They say I can't just use a raw partition, but maybe
kvm-img -f qcow2 /dev/MyVolumeGroup/Volume10 ?
Does that give better performance?  The one drawback I see is that I'd
have to really take the space I wanted, rather than having it only
notionally reserved for a file.  I'm not sure how growing the logical
volume would interact with qcow...

Ross

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Network I/O performance

2009-05-13 Thread Fischer, Anna
 Subject: Re: Network I/O performance
 
 Fischer, Anna wrote:
  I am running KVM with Fedora Core 8 on a 2.6.23 32-bit kernel. I use
 the tun/tap device model and the Linux bridge kernel module to connect
 my VM to the network. I have 2 10G Intel 82598 network devices (with
 the ixgbe driver) attached to my machine and I want to do packet
 routing in my VM (the VM has two virtual network interfaces
 configured). Analysing the network performance of the standard QEMU
 emulated NICs, I get less that 1G of throughput on those 10G links.
 Surprisingly though, I don't really see CPU utilization being maxed
 out. This is a dual core machine, and mpstat shows me that both CPUs
 are about 40% idle. My VM is more or less unresponsive due to the high
 network processing load while the host OS still seems to be in good
 shape. How can I best tune this setup to achieve best possible
 performance with KVM? I know there is virtIO and I know there is PCI
 pass-through, but those models are not an option for me right now.
 
 
 How many cpus are assigned to the guest?  If only one, then 40% idle
 equates to 100% of a core for the guest and 20% for housekeeping.

No, the machine has a dual core CPU and I have configured the guest with 2 
CPUs. So I would want to see KVM using up to 200% of CPU, ideally. There is 
nothing else running on that machine.
 
 If this is the case, you could try pinning the vcpu thread (info cpus
 from the monitor) to one core.  You should then see 100%/20% cpu load
 distribution.
 
 wrt emulated NIC performance, I'm guessing you're not doing tcp?  If
 you
 were we might do something with TSO.

No, I am measuring UDP throughput performance. I have now tried using a 
different NIC model, and the e1000 model seems to achieve slightly better 
performance (CPU goes up to 110% only though). I have also been running virtio 
now, and while its performance with 2.6.20 was very poor too, when changing the 
guest kernel to 2.6.30, I get a reasonable performance and higher CPU 
utilization (e.g. it goes up to 180-190%). I have to throttle the incoming 
bandwidth though, because as soon as I go over a certain threshold, CPU goes 
back down to 90% and throughput goes down too. 

I have not seen this with Xen/VMware where I mostly managed to max out CPU 
completely before throughput performance did not go up anymore.

I have also realized that when using the tun/tap configuration with a bridge, 
packets are replicated on all tap devices when QEMU writes packets to the tun 
interface. I guess this is a limitation of tun/tap as it does not know to which 
tap device the packet has to go to. The tap device then eventually drops 
packets when the destination MAC is not its own, but it still receives the 
packet which causes more overhead in the system overall.

I have not yet experimented much with pinning VCPU threads to cores. I will do 
that as well.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5] kvm: Use a bitmap for tracking used GSIs

2009-05-13 Thread Michael S. Tsirkin
On Wed, May 13, 2009 at 09:13:38AM -0600, Alex Williamson wrote:
 @@ -323,6 +326,28 @@ kvm_context_t kvm_init(struct kvm_callbacks *callbacks,
   kvm-no_irqchip_creation = 0;
   kvm-no_pit_creation = 0;
  
 + gsi_count = kvm_get_gsi_count(kvm);
 + if (gsi_count  0) {
 + int gsi_bytes, i;
 +
 + /* Round up so we can search ints using ffs */
 + gsi_bytes = ((gsi_count + 31) / 32) * 4;

Let's take ALIGN macro from linux/kernel.h?

 + kvm-used_gsi_bitmap = malloc(gsi_bytes);
 + if (!kvm-used_gsi_bitmap)
 + goto out_close;
 + memset(kvm-used_gsi_bitmap, 0, gsi_bytes);
 + kvm-max_gsi = gsi_bytes * 8;
 +
 + /* Mark all the IOAPIC pin GSIs and any over-allocated
 + * GSIs as already in use. */

Align '*'s please.

 +#ifdef KVM_IOAPIC_NUM_PINS

I think we should just export
#define KVM_IOAPIC_NUM_PINS 0
for ppc in kernel headers (or in libkvm),
and get rid of this ifdef completely.

Avi, agree?

 + for (i = 0; i  min(KVM_IOAPIC_NUM_PINS, gsi_count); i++)
 + set_bit(kvm-used_gsi_bitmap, i);
 +#endif
 + for (i = gsi_count; i  kvm-max_gsi; i++)
 + set_bit(kvm-used_gsi_bitmap, i);
 + }
 +
   return kvm;
   out_close:
   close(fd);

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC + PATCHES] Work to get KVM autotest upstream

2009-05-13 Thread Michael Goldish
The patches look good, but I haven't tested them yet to make sure
they leave everything at a functional state (will test them and let
you know).

I have a somewhat related question: how is KVM-Autotest development
going to proceed after the upstream merge? Currently I have
comfortable access to our repository at TLV, and on good days I push
as many as 20 patches per day. Should I submit all patches to the
Autotest mailing list after the merge, or are we going to work with
pull requests, or some other way? Will we work with git or svn?

Thanks,
Michael

- Original Message -
From: Lucas Meneghel Rodrigues mrodr...@redhat.com
To: kvm@vger.kernel.org
Sent: Wednesday, May 13, 2009 4:37:40 PM (GMT+0200) Auto-Detected
Subject: [RFC + PATCHES] Work to get KVM autotest upstream

These are the patches I have so far related to the work to get kvm
autotest in shape for upstream merge. Please note that once the patches
are applied, the kvm_runtest_2 directory should be placed on a fresh svn
trunk checkout to work, so there's a little bit of tweaking to get them
working.

That said, this haven't had enough testing. I am posting them here only
if someone wants to take a look at them.

Cheers,
-- 
Lucas Meneghel Rodrigues
Software Engineer (QE)
Red Hat - Emerging Technologies
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5] kvm: Use a bitmap for tracking used GSIs

2009-05-13 Thread Alex Williamson
On Wed, 2009-05-13 at 19:05 +0300, Michael S. Tsirkin wrote:
 On Wed, May 13, 2009 at 09:13:38AM -0600, Alex Williamson wrote:
  @@ -323,6 +326,28 @@ kvm_context_t kvm_init(struct kvm_callbacks *callbacks,
  kvm-no_irqchip_creation = 0;
  kvm-no_pit_creation = 0;
   
  +   gsi_count = kvm_get_gsi_count(kvm);
  +   if (gsi_count  0) {
  +   int gsi_bytes, i;
  +
  +   /* Round up so we can search ints using ffs */
  +   gsi_bytes = ((gsi_count + 31) / 32) * 4;
 
 Let's take ALIGN macro from linux/kernel.h?

It's already defined in libkvm.c, I'll just move it up in the file.
There's also a BITMAP_SIZE macro by it that looks like it can be nuked.

  +   kvm-used_gsi_bitmap = malloc(gsi_bytes);
  +   if (!kvm-used_gsi_bitmap)
  +   goto out_close;
  +   memset(kvm-used_gsi_bitmap, 0, gsi_bytes);
  +   kvm-max_gsi = gsi_bytes * 8;
  +
  +   /* Mark all the IOAPIC pin GSIs and any over-allocated
  +   * GSIs as already in use. */
 
 Align '*'s please.

Argh, fixed.

  +#ifdef KVM_IOAPIC_NUM_PINS
 
 I think we should just export
 #define KVM_IOAPIC_NUM_PINS 0
 for ppc in kernel headers (or in libkvm),
 and get rid of this ifdef completely.

Ok, I'll add an #ifndef and make it zero in libkvm.c.  It can be cleaned
out further from there.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] Add serial number support for virtio_blk, V2

2009-05-13 Thread john cooper

[Resend of earlier patch: 1/2 rebased to qemu-kvm,
2/2 minor tweak]

This patch allows passing of a virtio_blk drive
serial number from qemu into a guest's virtio_blk
driver, and provides a means to access the serial
number from a guest's userspace.

Equivalent functionality currently exists for IDE
and SCSI, however it is not yet implemented for
virtio.  Scenarios exist where guest code relies
on a unique drive serial number to correctly
identify the machine environment in which it
exists.

The following two patches implement the above:

   qemu-vblk-serial-2.patch

which provides the qemu missing bits to interpret
a '-drive .. serial=XYZ ..' flag, and:

   virtio_blk-serial-2.patch

which extracts this information and makes it
available to guest userspace via ioctl.

Attached to this patch header is a trivial example
program which retrieves the serial number from
guest userspace.

The above patches are relative to qemu-kvm.git and
2.6.29.3 respectively.

-john

--
john.coo...@redhat.com

/* example: retrieve serial number from virtio block device
 */
#include stdio.h
#include fcntl.h
#include stdlib.h
#include linux/virtio_blk.h

#define iswhite(c)	(!('!' = (c)  (c) = '~'))

#ifndef VBLK_GET_SN
#define VBLK_GET_SN ((unsigned int)('V'  24 | 'B'  16 | 'L'  8 | 'K'))
#endif

/* get virtblk drive serial#
 */
int main(int ac, char ***av)
{
	int fd, nb, i;
	unsigned char sn[30];
	unsigned char *p;

	sn[0] = sizeof (sn);
	if ((fd = open(/dev/vda, O_RDONLY))  0)
		perror(can't open device), exit(1);
	else if ((nb = ioctl(fd, VBLK_GET_SN, sn))  0)
		perror(can't ioctl device), exit(1);
	printf(returned %d bytes:\n, nb);
	for (p = sn, i = nb; 0 = --i; ++p)
		printf(%02x%c, *p, i ? ' ' : '\t');
	for (p = sn, i = nb; 0 = --i; ++p)
		printf(%c%s, iswhite(*p) ? '.' : *p, i ?  : \n);
	return (0);
}


Re: [RFC + PATCHES] Work to get KVM autotest upstream

2009-05-13 Thread Lucas Meneghel Rodrigues
On Wed, 2009-05-13 at 12:23 -0400, Michael Goldish wrote:
 The patches look good, but I haven't tested them yet to make sure
 they leave everything at a functional state (will test them and let
 you know).

Thanks Michael! I will start to give more thorough test on this today,
since we finally got 0.10 in shape.

 I have a somewhat related question: how is KVM-Autotest development
 going to proceed after the upstream merge? Currently I have
 comfortable access to our repository at TLV, and on good days I push
 as many as 20 patches per day. Should I submit all patches to the
 Autotest mailing list after the merge, or are we going to work with
 pull requests, or some other way? Will we work with git or svn?

Here is my plan: For people inside our team, with access to the git tree
we can just pull stuff to the git tree and on a given time basis I can
pick up the patches and send them altogether to the KVM and autotest
mailing list, wait for reviews and then check them.

If you are already used to send all your changes to the KVM mailing list
though, this would pose little or no change to you, just send an
additional cc to the autotest mailing list.

What do you think?

 Thanks,
 Michael
 
 - Original Message -
 From: Lucas Meneghel Rodrigues mrodr...@redhat.com
 To: kvm@vger.kernel.org
 Sent: Wednesday, May 13, 2009 4:37:40 PM (GMT+0200) Auto-Detected
 Subject: [RFC + PATCHES] Work to get KVM autotest upstream
 
 These are the patches I have so far related to the work to get kvm
 autotest in shape for upstream merge. Please note that once the patches
 are applied, the kvm_runtest_2 directory should be placed on a fresh svn
 trunk checkout to work, so there's a little bit of tweaking to get them
 working.
 
 That said, this haven't had enough testing. I am posting them here only
 if someone wants to take a look at them.
 
 Cheers,
-- 
Lucas Meneghel Rodrigues
Software Engineer (QE)
Red Hat - Emerging Technologies

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] Add serial number support for virtio_blk, V2

2009-05-13 Thread john cooper


--
john.coo...@redhat.com

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index dad4ef0..90825a8 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -25,6 +25,7 @@ typedef struct VirtIOBlock
 BlockDriverState *bs;
 VirtQueue *vq;
 void *rq;
+char serial_str[BLOCK_SERIAL_STRLEN + 1];
 } VirtIOBlock;
 
 static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
@@ -285,6 +286,8 @@ static void virtio_blk_reset(VirtIODevice *vdev)
 qemu_aio_flush();
 }
 
+/* coalesce internal state, copy to pci i/o region 0
+ */
 static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config)
 {
 VirtIOBlock *s = to_virtio_blk(vdev);
@@ -299,11 +302,13 @@ static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config)
 stw_raw(blkcfg.cylinders, cylinders);
 blkcfg.heads = heads;
 blkcfg.sectors = secs;
+memcpy(blkcfg.serial, s-serial_str, sizeof (blkcfg.serial));
 memcpy(config, blkcfg, sizeof(blkcfg));
 }
 
 static uint32_t virtio_blk_get_features(VirtIODevice *vdev)
 {
+VirtIOBlock *s = to_virtio_blk(vdev);
 uint32_t features = 0;
 
 features |= (1  VIRTIO_BLK_F_SEG_MAX);
@@ -311,6 +316,8 @@ static uint32_t virtio_blk_get_features(VirtIODevice *vdev)
 #ifdef __linux__
 features |= (1  VIRTIO_BLK_F_SCSI);
 #endif
+if (strcmp(s-serial_str, 0))
+features |= 1  VIRTIO_BLK_F_SN;
 
 return features;
 }
@@ -353,6 +360,7 @@ void *virtio_blk_init(PCIBus *bus, BlockDriverState *bs)
 VirtIOBlock *s;
 int cylinders, heads, secs;
 static int virtio_blk_id;
+char *ps = drive_get_serial(bs);
 
 s = (VirtIOBlock *)virtio_init_pci(bus, virtio-blk,
PCI_VENDOR_ID_REDHAT_QUMRANET,
@@ -369,6 +377,10 @@ void *virtio_blk_init(PCIBus *bus, BlockDriverState *bs)
 s-vdev.reset = virtio_blk_reset;
 s-bs = bs;
 s-rq = NULL;
+if (strlen(ps))
+strncpy(s-serial_str, ps, sizeof (s-serial_str));
+else
+snprintf(s-serial_str, sizeof (s-serial_str), 0);
 bs-private = s-vdev.pci_dev;
 bdrv_guess_geometry(s-bs, cylinders, heads, secs);
 bdrv_set_geometry_hint(s-bs, cylinders, heads, secs);
diff --git a/hw/virtio-blk.h b/hw/virtio-blk.h
index 5ef6c36..3229394 100644
--- a/hw/virtio-blk.h
+++ b/hw/virtio-blk.h
@@ -31,6 +31,7 @@
 #define VIRTIO_BLK_F_RO 5   /* Disk is read-only */
 #define VIRTIO_BLK_F_BLK_SIZE   6   /* Block size of disk is available*/
 #define VIRTIO_BLK_F_SCSI   7   /* Supports scsi command passthru */
+#define VIRTIO_BLK_F_SN 8   /* serial number supported */
 
 struct virtio_blk_config
 {
@@ -40,6 +41,8 @@ struct virtio_blk_config
 uint16_t cylinders;
 uint8_t heads;
 uint8_t sectors;
+uint32_t _blk_size;/* structure pad, currently unused */
+uint8_t serial[BLOCK_SERIAL_STRLEN];
 } __attribute__((packed));
 
 /* These two define direction. */
diff --git a/sysemu.h b/sysemu.h
index 1f45fd6..185b4e3 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -141,6 +141,8 @@ typedef enum {
 BLOCK_ERR_STOP_ANY
 } BlockInterfaceErrorAction;
 
+#define BLOCK_SERIAL_STRLEN 20
+
 typedef struct DriveInfo {
 BlockDriverState *bdrv;
 BlockInterfaceType type;
@@ -149,7 +151,7 @@ typedef struct DriveInfo {
 int used;
 int drive_opt_idx;
 BlockInterfaceErrorAction onerror;
-char serial[21];
+char serial[BLOCK_SERIAL_STRLEN + 1];
 } DriveInfo;
 
 #define MAX_IDE_DEVS	2


[PATCH 2/2] Add serial number support for virtio_blk, V2

2009-05-13 Thread john cooper


--
john.coo...@redhat.com

 drivers/block/virtio_blk.c |   35 ---
 include/linux/virtio_blk.h |   10 ++
 2 files changed, 42 insertions(+), 3 deletions(-)
=
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -146,12 +146,40 @@ static void do_virtblk_request(struct re
 		vblk-vq-vq_ops-kick(vblk-vq);
 }
 
+/* user passes the address of a char[] for serial# return, and has set char[0]
+ * to the array size.  copy serial# to this char[] and return number of
+ * characters copied excluding any trailing '\0' pad chars in buffer.
+ */
+static int get_virtblk_sn(struct block_device *bdev, void *buf)
+{
+	struct virtio_blk *vblk = bdev-bd_disk-private_data;
+	unsigned char serial[BLOCK_SERIAL_STRLEN];
+	unsigned char snlen;
+	int rv;
+
+	if (copy_from_user(snlen, buf, sizeof (snlen)))
+		rv = -EFAULT;
+	else if ((rv = virtio_config_val(vblk-vdev, VIRTIO_BLK_F_SN,
+		offsetof(struct virtio_blk_config, serial), serial)))
+			;
+	else if (copy_to_user(buf, serial,
+		snlen = min(snlen, (unsigned char)sizeof (serial
+			rv = -EFAULT;
+	else
+		for (rv = 0; rv  snlen; ++rv)
+			if (!serial[rv])
+break;
+	return (rv);
+}
+
 static int virtblk_ioctl(struct block_device *bdev, fmode_t mode,
 			 unsigned cmd, unsigned long data)
 {
-	return scsi_cmd_ioctl(bdev-bd_disk-queue,
-			  bdev-bd_disk, mode, cmd,
-			  (void __user *)data);
+	if (cmd == VBLK_GET_SN)
+		return (get_virtblk_sn(bdev, (void __user *)data));
+	else
+		return scsi_cmd_ioctl(bdev-bd_disk-queue, bdev-bd_disk,
+			mode, cmd, (void __user *)data);
 }
 
 /* We provide getgeo only to please some old bootloader/partitioning tools */
@@ -356,6 +384,7 @@ static struct virtio_device_id id_table[
 static unsigned int features[] = {
 	VIRTIO_BLK_F_BARRIER, VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX,
 	VIRTIO_BLK_F_GEOMETRY, VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
+	VIRTIO_BLK_F_SN
 };
 
 static struct virtio_driver virtio_blk = {
=
--- a/include/linux/virtio_blk.h
+++ b/include/linux/virtio_blk.h
@@ -15,7 +15,16 @@
 #define VIRTIO_BLK_F_GEOMETRY	4	/* Legacy geometry available  */
 #define VIRTIO_BLK_F_RO		5	/* Disk is read-only */
 #define VIRTIO_BLK_F_BLK_SIZE	6	/* Block size of disk is available*/
+#define VIRTIO_BLK_F_SN		8	/* serial number supported */
 
+/* ioctl cmd to retrieve serial#
+*/
+#define VBLK_GET_SN ((unsigned int)('V'  24 | 'B'  16 | 'L'  8 | 'K'))
+
+#define BLOCK_SERIAL_STRLEN 20
+
+/* mapped into pci i/o region 0
+ */
 struct virtio_blk_config
 {
 	/* The capacity (in 512-byte sectors). */
@@ -32,6 +41,7 @@ struct virtio_blk_config
 	} geometry;
 	/* block size of device (if VIRTIO_BLK_F_BLK_SIZE) */
 	__u32 blk_size;
+	__u8 serial[BLOCK_SERIAL_STRLEN];
 } __attribute__((packed));
 
 /* These two define direction. */


Re: Best choice for copy/clone/snapshot

2009-05-13 Thread Charles Duffy

Ross Boylan wrote:

Or do you mean I should back each virtual disk with an LVM volume?


Yes, this option is what was meant.


That does seem cleaner; I've just been following the docs and they
use regular files. They say I can't just use a raw partition, but
maybe kvm-img -f qcow2 /dev/MyVolumeGroup/Volume10 ?


While new versions of qcow2 have some extensions that let the 
last-written sector be tracked for use on device-backed partitions, the 
expectation is that you'll (really) just use the raw partition; qcow2 
more than takes back the performance gain from getting your host 
filesystem out of the loop.



I'm not sure how growing the logical
volume would interact with qcow...


Right -- folks doing this route go raw rather than qcow, so it's just a 
matter of resizing the partitions / filesystems within the guest.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Problem doing pci passthrough of the network card without VT-d

2009-05-13 Thread Fischer, Anna
Are you expecting this to work using the 1:1 mapping for direct device 
assignment? I use a similar setup (e.g. dma=none and no VT-d) but a different 
NIC (Intel 82598 10G) and a different driver (ixgbe). I see the same messages, 
but also don't get the device to work in the guest (while it does work in the 
host OS). In fact I don't get any errors on the guest side, so it is hard to 
track what is wrong. No I/O is happening. The guest cannot not transmit/receive 
any packets to/from those NICs. The interface packet counters stay at 0.

I see an error in QEMU saying invalid memtype, and it also seems to have 
trouble assigning IRQs. assigned_dev_enabled_msix() fails with Invalid 
Argument, but on the guest side I can see that MSI-X is configured properly 
under /proc/interrupts.

I use the latest KVM 2.6.30 tree in both host OS and guest OS.

 -Original Message-
 From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
 Behalf Of Passera, Pablo R
 Sent: 12 May 2009 11:22
 To: kvm@vger.kernel.org
 Subject: RE: Problem doing pci passthrough of the network card without
 VT-d

 One update on this. I disabled VT-d from the BIOS and now I am not
 getting the DMAR error messages in dmesg, but the board still does not
 work on the guest. Any help is welcomed.

 e1000e :00:19.0: PCI INT A disabled
 pci-stub :00:19.0: PCI INT A - GSI 20 (level, low) - IRQ 20
 pci-stub :00:19.0: irq 29 for MSI/MSI-X
 pci-stub :00:19.0: irq 29 for MSI/MSI-X
 pci-stub :00:19.0: irq 29 for MSI/MSI-X
 pci-stub :00:19.0: irq 29 for MSI/MSI-X
 pci-stub :00:19.0: irq 29 for MSI/MSI-X
 pci-stub :00:19.0: irq 29 for MSI/MSI-X
 pci-stub :00:19.0: irq 29 for MSI/MSI-X
 pci-stub :00:19.0: irq 29 for MSI/MSI-X
 pci-stub :00:19.0: irq 29 for MSI/MSI-X
 pci-stub :00:19.0: irq 29 for MSI/MSI-X

 Regards,
 Pablo

 -Original Message-
 From: Passera, Pablo R
 Sent: Tuesday, May 12, 2009 12:14 PM
 To: kvm@vger.kernel.org
 Subject: Problem doing pci passthrough of the network card without VT-
 d
 
 Hi List,
I am having problems to do pci passthrough to a network card
 without using VT-d. The card is present in the guest but with a
 different model (Intel Corporation 82801I Gigabit Ethernet Controller
 (rev 2)) and it does not work. The qemu line that I used is:
 
 ./devel/bin/qemu-system-x86_64 -hda ./dm.img -m 256 -pcidevice
 host=00:19.0,dma=none -net none
 
 Before running qemu I did
 
 echo 8086 294c  /sys/bus/pci/drivers/pci-stub/new_id
 echo :00:19.0  /sys/bus/pci/drivers/e1000e/unbind
 echo :00:19.0  /sys/bus/pci/drivers/pci-stub/bind
 
 This is the lspci -tv output
 
 -[:00]-+-00.0  Intel Corporation 82X38/X48 Express DRAM Controller
+-01.0-[:01]00.0  nVidia Corporation G80 [GeForce
 8800 GTX]
+-19.0  Intel Corporation 82566DC-2 Gigabit Network
 Connection
+-1a.0  Intel Corporation 82801I (ICH9 Family) USB UHCI
 Controller #4
+-1a.1  Intel Corporation 82801I (ICH9 Family) USB UHCI
 Controller #5
+-1a.2  Intel Corporation 82801I (ICH9 Family) USB UHCI
 Controller #6
+-1a.7  Intel Corporation 82801I (ICH9 Family) USB2 EHCI
 Controller #2
+-1b.0  Intel Corporation 82801I (ICH9 Family) HD Audio
 Controller
+-1c.0-[:02]--
+-1c.4-[:03]00.0  Marvell Technology Group Ltd.
 88SE6121 SATA II Controller
+-1d.0  Intel Corporation 82801I (ICH9 Family) USB UHCI
 Controller #1
+-1d.1  Intel Corporation 82801I (ICH9 Family) USB UHCI
 Controller #2
+-1d.2  Intel Corporation 82801I (ICH9 Family) USB UHCI
 Controller #3
+-1d.7  Intel Corporation 82801I (ICH9 Family) USB2 EHCI
 Controller #1
+-1e.0-[:04]03.0  Texas Instruments TSB43AB22/A
 IEEE-
 1394a-2000 Controller (PHY/Link)
+-1f.0  Intel Corporation 82801IR (ICH9R) LPC Interface
 Controller
+-1f.2  Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 4
 port
 SATA IDE Controller
+-1f.3  Intel Corporation 82801I (ICH9 Family) SMBus
 Controller
\-1f.5  Intel Corporation 82801I (ICH9 Family) 2 port SATA
 IDE Controller
 
 
 I am getting the following error in host dmesg
 
 e1000e :00:19.0: PCI INT A disabled
 pci-stub :00:19.0: PCI INT A - GSI 20 (level, low) - IRQ 20
 pci-stub :00:19.0: irq 29 for MSI/MSI-X
 pci-stub :00:19.0: irq 29 for MSI/MSI-X
 pci-stub :00:19.0: irq 29 for MSI/MSI-X
 pci-stub :00:19.0: irq 29 for MSI/MSI-X
 pci-stub :00:19.0: irq 29 for MSI/MSI-X
 DMAR:[DMA Read] Request device [00:19.0] fault addr baee000
 DMAR:[fault reason 02] Present bit in context entry is clear
 pci-stub :00:19.0: irq 29 for MSI/MSI-X
 pci-stub :00:19.0: irq 29 for MSI/MSI-X
 pci-stub :00:19.0: irq 29 for MSI/MSI-X
 pci-stub :00:19.0: irq 29 for MSI/MSI-X
 pci-stub :00:19.0: irq 29 for MSI/MSI-X
 DMAR:[DMA Read] Request device [00:19.0] fault 

[PATCH v6] kvm: Use a bitmap for tracking used GSIs

2009-05-13 Thread Alex Williamson
We're currently using a counter to track the most recent GSI we've
handed out.  This quickly hits KVM_MAX_IRQ_ROUTES when using device
assignment with a driver that regularly toggles the MSI enable bit.
This can mean only a few minutes of usable run time.  Instead, track
used GSIs in a bitmap.

Signed-off-by: Alex Williamson alex.william...@hp.com
---

 v2: Added mutex to protect gsi bitmap
 v3: Updated for comments from Michael Tsirkin
 No longer depends on [PATCH] kvm: device-assignment: Catch GSI overflow
 v4: Fix gsi_bytes calculation noted by Sheng Yang
 v5: Remove mutex per Avi
 Fix negative gsi_count path per Michael
 Remove KVM_CAP_IRQ_ROUTING per Michael, ppc should still be protected
 by the KVM_IOAPIC_NUM_PINS check
 v6: Make use of ALIGN macro, per Michael
 Define KVM_IOAPIC_NUM_PINS if not already, per Michael
 Fix comment indent, per Michael
 Remove unused BITMAP_SIZE macro

 hw/device-assignment.c  |4 ++
 kvm/libkvm/kvm-common.h |3 +-
 kvm/libkvm/libkvm.c |   80 +--
 kvm/libkvm/libkvm.h |   10 ++
 4 files changed, 78 insertions(+), 19 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index a7365c8..a6cc9b9 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -561,8 +561,10 @@ static void free_dev_irq_entries(AssignedDevice *dev)
 {
 int i;
 
-for (i = 0; i  dev-irq_entries_nr; i++)
+for (i = 0; i  dev-irq_entries_nr; i++) {
 kvm_del_routing_entry(kvm_context, dev-entry[i]);
+kvm_free_irq_route_gsi(kvm_context, dev-entry[i].gsi);
+}
 free(dev-entry);
 dev-entry = NULL;
 dev-irq_entries_nr = 0;
diff --git a/kvm/libkvm/kvm-common.h b/kvm/libkvm/kvm-common.h
index 591fb53..c95c591 100644
--- a/kvm/libkvm/kvm-common.h
+++ b/kvm/libkvm/kvm-common.h
@@ -67,7 +67,8 @@ struct kvm_context {
struct kvm_irq_routing *irq_routes;
int nr_allocated_irq_routes;
 #endif
-   int max_used_gsi;
+   void *used_gsi_bitmap;
+   int max_gsi;
 };
 
 int kvm_alloc_kernel_memory(kvm_context_t kvm, unsigned long memory,
diff --git a/kvm/libkvm/libkvm.c b/kvm/libkvm/libkvm.c
index ba0a5d1..70857c7 100644
--- a/kvm/libkvm/libkvm.c
+++ b/kvm/libkvm/libkvm.c
@@ -61,10 +61,18 @@
 #define DPRINTF(fmt, args...) do {} while (0)
 #endif
 
+#define MIN(x,y) ((x)  (y) ? (x) : (y))
+#define ALIGN(x, y) (((x)+(y)-1)  ~((y)-1))
+
+#ifndef KVM_IOAPIC_NUM_PINS
+#define KVM_IOAPIC_NUM_PINS 0
+#endif
 
 int kvm_abi = EXPECTED_KVM_API_VERSION;
 int kvm_page_size;
 
+static inline void set_bit(uint32_t *buf, unsigned int bit);
+
 struct slot_info {
unsigned long phys_addr;
unsigned long len;
@@ -285,7 +293,7 @@ kvm_context_t kvm_init(struct kvm_callbacks *callbacks,
 {
int fd;
kvm_context_t kvm;
-   int r;
+   int r, gsi_count;
 
fd = open(/dev/kvm, O_RDWR);
if (fd == -1) {
@@ -323,6 +331,26 @@ kvm_context_t kvm_init(struct kvm_callbacks *callbacks,
kvm-no_irqchip_creation = 0;
kvm-no_pit_creation = 0;
 
+   gsi_count = kvm_get_gsi_count(kvm);
+   if (gsi_count  0) {
+   int gsi_bits, i;
+
+   /* Round up so we can search ints using ffs */
+   gsi_bits = ALIGN(gsi_count, 32);
+   kvm-used_gsi_bitmap = malloc(gsi_bits / 8);
+   if (!kvm-used_gsi_bitmap)
+   goto out_close;
+   memset(kvm-used_gsi_bitmap, 0, gsi_bits / 8);
+   kvm-max_gsi = gsi_bits;
+
+   /* Mark all the IOAPIC pin GSIs and any over-allocated
+* GSIs as already in use. */
+   for (i = 0; i  MIN(KVM_IOAPIC_NUM_PINS, gsi_count); i++)
+   set_bit(kvm-used_gsi_bitmap, i);
+   for (i = gsi_count; i  gsi_bits; i++)
+   set_bit(kvm-used_gsi_bitmap, i);
+   }
+
return kvm;
  out_close:
close(fd);
@@ -626,9 +654,6 @@ int kvm_get_dirty_pages(kvm_context_t kvm, unsigned long 
phys_addr, void *buf)
return kvm_get_map(kvm, KVM_GET_DIRTY_LOG, slot, buf);
 }
 
-#define ALIGN(x, y)  (((x)+(y)-1)  ~((y)-1))
-#define BITMAP_SIZE(m) (ALIGN(((m)/PAGE_SIZE), sizeof(long) * 8) / 8)
-
 int kvm_get_dirty_pages_range(kvm_context_t kvm, unsigned long phys_addr,
  unsigned long len, void *buf, void *opaque,
  int (*cb)(unsigned long start, unsigned long len,
@@ -1298,8 +1323,6 @@ int kvm_add_routing_entry(kvm_context_t kvm,
new-flags = entry-flags;
new-u = entry-u;
 
-   if (entry-gsi  kvm-max_used_gsi)
-   kvm-max_used_gsi = entry-gsi;
return 0;
 #else
return -ENOSYS;
@@ -1404,19 +1427,42 @@ int kvm_commit_irq_routes(kvm_context_t kvm)
 #endif
 }
 
+static inline void set_bit(uint32_t *buf, unsigned int bit)
+{
+   buf[bit / 32] |= 1U  (bit % 32);
+}
+
+static inline void clear_bit(uint32_t 

RE: Problem doing pci passthrough of the network card without VT-d

2009-05-13 Thread Passera, Pablo R
Hi Anna,

Are you expecting this to work using the 1:1 mapping for direct device 
assignment?
Actually, I want to use the current qemu implementation for this. AFAIK from 
the code seems that qemu mmaps the device memory into the qemu pci subsystem 
memory space. Is this correct?

In fact I don't get any errors on the guest side, so it is hard to track what 
is wrong.
In the guest I am getting an error in dmesg saying Detected Tx Unit Hang

I see an error in QEMU saying invalid memtype, and it also seems to have 
trouble assigning IRQs.
The only error I am seeing in qemu is

assigned_dev_iomem_map: e_phys=f202 r_virt=0x7f95bca9a000 type=0 
len=0002 region_num=0
BUG: kvm_destroy_phys_mem: invalid parameters (slot=-1)

Regards,
Pablo

-Original Message-
From: Fischer, Anna [mailto:anna.fisc...@hp.com]
Sent: Wednesday, May 13, 2009 2:22 PM
To: Passera, Pablo R
Cc: kvm@vger.kernel.org
Subject: RE: Problem doing pci passthrough of the network card without
VT-d

Are you expecting this to work using the 1:1 mapping for direct device
assignment? I use a similar setup (e.g. dma=none and no VT-d) but a
different NIC (Intel 82598 10G) and a different driver (ixgbe). I see
the same messages, but also don't get the device to work in the guest
(while it does work in the host OS). In fact I don't get any errors on
the guest side, so it is hard to track what is wrong. No I/O is
happening. The guest cannot not transmit/receive any packets to/from
those NICs. The interface packet counters stay at 0.

I see an error in QEMU saying invalid memtype, and it also seems to have
trouble assigning IRQs. assigned_dev_enabled_msix() fails with Invalid
Argument, but on the guest side I can see that MSI-X is configured
properly under /proc/interrupts.

I use the latest KVM 2.6.30 tree in both host OS and guest OS.

 -Original Message-
 From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
 Behalf Of Passera, Pablo R
 Sent: 12 May 2009 11:22
 To: kvm@vger.kernel.org
 Subject: RE: Problem doing pci passthrough of the network card without
 VT-d

 One update on this. I disabled VT-d from the BIOS and now I am not
 getting the DMAR error messages in dmesg, but the board still does not
 work on the guest. Any help is welcomed.

 e1000e :00:19.0: PCI INT A disabled
 pci-stub :00:19.0: PCI INT A - GSI 20 (level, low) - IRQ 20
 pci-stub :00:19.0: irq 29 for MSI/MSI-X
 pci-stub :00:19.0: irq 29 for MSI/MSI-X
 pci-stub :00:19.0: irq 29 for MSI/MSI-X
 pci-stub :00:19.0: irq 29 for MSI/MSI-X
 pci-stub :00:19.0: irq 29 for MSI/MSI-X
 pci-stub :00:19.0: irq 29 for MSI/MSI-X
 pci-stub :00:19.0: irq 29 for MSI/MSI-X
 pci-stub :00:19.0: irq 29 for MSI/MSI-X
 pci-stub :00:19.0: irq 29 for MSI/MSI-X
 pci-stub :00:19.0: irq 29 for MSI/MSI-X

 Regards,
 Pablo

 -Original Message-
 From: Passera, Pablo R
 Sent: Tuesday, May 12, 2009 12:14 PM
 To: kvm@vger.kernel.org
 Subject: Problem doing pci passthrough of the network card without
VT-
 d
 
 Hi List,
I am having problems to do pci passthrough to a network card
 without using VT-d. The card is present in the guest but with a
 different model (Intel Corporation 82801I Gigabit Ethernet Controller
 (rev 2)) and it does not work. The qemu line that I used is:
 
 ./devel/bin/qemu-system-x86_64 -hda ./dm.img -m 256 -pcidevice
 host=00:19.0,dma=none -net none
 
 Before running qemu I did
 
 echo 8086 294c  /sys/bus/pci/drivers/pci-stub/new_id
 echo :00:19.0  /sys/bus/pci/drivers/e1000e/unbind
 echo :00:19.0  /sys/bus/pci/drivers/pci-stub/bind
 
 This is the lspci -tv output
 
 -[:00]-+-00.0  Intel Corporation 82X38/X48 Express DRAM
Controller
+-01.0-[:01]00.0  nVidia Corporation G80 [GeForce
 8800 GTX]
+-19.0  Intel Corporation 82566DC-2 Gigabit Network
 Connection
+-1a.0  Intel Corporation 82801I (ICH9 Family) USB UHCI
 Controller #4
+-1a.1  Intel Corporation 82801I (ICH9 Family) USB UHCI
 Controller #5
+-1a.2  Intel Corporation 82801I (ICH9 Family) USB UHCI
 Controller #6
+-1a.7  Intel Corporation 82801I (ICH9 Family) USB2 EHCI
 Controller #2
+-1b.0  Intel Corporation 82801I (ICH9 Family) HD Audio
 Controller
+-1c.0-[:02]--
+-1c.4-[:03]00.0  Marvell Technology Group Ltd.
 88SE6121 SATA II Controller
+-1d.0  Intel Corporation 82801I (ICH9 Family) USB UHCI
 Controller #1
+-1d.1  Intel Corporation 82801I (ICH9 Family) USB UHCI
 Controller #2
+-1d.2  Intel Corporation 82801I (ICH9 Family) USB UHCI
 Controller #3
+-1d.7  Intel Corporation 82801I (ICH9 Family) USB2 EHCI
 Controller #1
+-1e.0-[:04]03.0  Texas Instruments TSB43AB22/A
 IEEE-
 1394a-2000 Controller (PHY/Link)
+-1f.0  Intel Corporation 82801IR (ICH9R) LPC Interface
 Controller
+-1f.2  Intel Corporation 82801IR/IO/IH 

kvm-autotest: The automation plans?

2009-05-13 Thread sudhir kumar
Hi Uri/Lucas,

Do you have any plans for enhancing kvm-autotest?
I was looking mainly on the following 2 aspects:

(1).
we have standalone migration only. Is there any plans of enhancing
kvm-autotest so that we can trigger migration while a workload is
running?
Something like this:
Start a workload(may be n instances of it).
let the test execute for some time.
Trigger migration.
Log into the target.
Check if the migration is succesful
Check if the test results are consistent.

(2).
How can we run N parallel instances of a test? Will the current
configuration  be easily able to support it?

Please provide your thoughts on the above features.

-- 
Sudhir Kumar
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm-autotest: The automation plans?

2009-05-13 Thread Michael Goldish

- sudhir kumar smalik...@gmail.com wrote:

 Hi Uri/Lucas,
 
 Do you have any plans for enhancing kvm-autotest?
 I was looking mainly on the following 2 aspects:
 
 (1).
 we have standalone migration only. Is there any plans of enhancing
 kvm-autotest so that we can trigger migration while a workload is
 running?
 Something like this:
 Start a workload(may be n instances of it).
 let the test execute for some time.
 Trigger migration.
 Log into the target.
 Check if the migration is succesful
 Check if the test results are consistent.

Yes, we have plans to implement such functionality. It shouldn't be
hard, but we need to give it some thought in order to implement it as
elegantly as possible.

 (2).
 How can we run N parallel instances of a test? Will the current
 configuration  be easily able to support it?

I currently have some experimental patches that allow running of
several parallel queues of tests. But what exactly do you mean by
N parallel instances of a test? Do you mean N queues? Please provide
an example so I can get a better idea.

Thanks,
Michael
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: changing guest CD

2009-05-13 Thread Stuart Jansen
On Mon, 2009-05-11 at 17:13 -0500, Anthony Liguori wrote:
 Stuart Jansen wrote:
  Does KVM support changing the CD in a running guest's disc drive? I've
  tried to do it using the qemu monitor, but so far haven't been able to.
  I've seen rumor and innuendo that KVM can't change the disc in a running
  system, but no official confirmation yet.
 
  If KVM doesn't support changing the disc in a running system, what would
  be required to support it?

 
 It does via the change command.  What did you try and how did it fail?

I've been using both libvirt and raw qemu monitor with Fedora 11 KVM
RPMs. After further testing, while F11 doesn't work, F10 does. Guess
it's time to see if Fedora bugzilla already has a report.

-- 
XML is like violence: if it doesn't solve your problem, you aren't
using enough of it. - Chris Maden

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [KVM-AUTOTEST][PATCH] timedrift support

2009-05-13 Thread Lucas Meneghel Rodrigues
On Tue, 2009-05-12 at 21:07 +0800, Bear Yang wrote:
 Sorry forgot to attach my new patch.
 Bear Yang wrote:
  Hi Lucas:
  First, I want to say really thanks for your kindly,carefully words and 
  suggestions. now,  I modified my scripts follow your opinions.
  1. Add the genload to timedrift, but I am not sure whether it is right 
  or not to add the information  CVS relevant. If it is not necessary. I 
  will remove them next time.

Yes, we can remove the CVS related info, they just got mailed to you
because I got the code from a fresh LTP CVS checkout!

  2. Replace the API os.system to utils.system
  3. Replace the API os.environ.get('HOSTNAME') to socket.gethostname()
  4. for the snippet of the code below:
  +if utils.system(ntp_cmd, ignore_status=True) != 0:
  +raise error.TestFail, NTP server has not starting correctly...
 
  Your suggestion is Instead of the if clause we'd put a try/except 
  block, but I am not clear how to do it. Would you please give me some 
  guides for this. Sorry.

You could re-write the above if statement using the form:

try:
utils.system(ntp_cmd)
except:
raise error.TestFail(NTP server has not started correctly)

Some comments:

 1) The try/except block works because utils.system already throws an
exception when the exit code is different from 0.
 2) The form 

raise error.TestFail(NTP server has not started correctly)

Is preferred on the upstream project over the equivalent

raise error.TestFail, NTP server has not started correctly

But on kvm autotest we are adopting the later, so don't worry and keep
the all the raises the way they are on your original patch. This was
just a side comment.

  Other thing about functional the clauses which to get vm handle below:
 
  +# get vm handle
  +vm = kvm_utils.env_get_vm(env,params.get(main_vm))
  +if not vm:
  +raise error.TestError, VM object not found in environment
  +if not vm.is_alive():
  +raise error.TestError, VM seems to be dead; Test requires a 
  living VM
 
  I agree with you on this point, I remember that somebody to do this 
  before. but seems upstream not accept his modification.

Ok, will take a look at this. By the way, when you have an updated patch
please let us know!

Thank you very much,

-- 
Lucas Meneghel Rodrigues
Software Engineer (QE)
Red Hat - Emerging Technologies

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm-autotest: The automation plans?

2009-05-13 Thread Lucas Meneghel Rodrigues
On Wed, 2009-05-13 at 23:21 +0530, sudhir kumar wrote:
 Hi Uri/Lucas,
 
 Do you have any plans for enhancing kvm-autotest?
 I was looking mainly on the following 2 aspects:

Hi Sudhir, about the two questions you've made, Michael has answered
them a lot better than I possibly could. So please keep in touch and
send your ideas so we can consider implementing them on our tests!

Thank you very much,

-- 
Lucas Meneghel Rodrigues
Software Engineer (QE)
Red Hat - Emerging Technologies

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6] kvm: Use a bitmap for tracking used GSIs

2009-05-13 Thread Michael S. Tsirkin
On Wed, May 13, 2009 at 11:28:16AM -0600, Alex Williamson wrote:
 We're currently using a counter to track the most recent GSI we've
 handed out.  This quickly hits KVM_MAX_IRQ_ROUTES when using device
 assignment with a driver that regularly toggles the MSI enable bit.
 This can mean only a few minutes of usable run time.  Instead, track
 used GSIs in a bitmap.
 
 Signed-off-by: Alex Williamson alex.william...@hp.com

Acked-by: Michael S. Tsirkin m...@redhat.com

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv5 0/3] virtio: MSI-X support

2009-05-13 Thread Michael S. Tsirkin
Here's the latest draft of virtio patches.
This is on top of Rusty's recent virtqueue list + name patch.

Michael S. Tsirkin (3):
  virtio: find_vqs/del_vqs virtio operations
  virtio_pci: split up vp_interrupt
  virtio_pci: optional MSI-X support

 drivers/block/virtio_blk.c  |6 +-
 drivers/char/hw_random/virtio-rng.c |6 +-
 drivers/char/virtio_console.c   |   26 ++--
 drivers/lguest/lguest_device.c  |   36 -
 drivers/net/virtio_net.c|   45 ++---
 drivers/s390/kvm/kvm_virtio.c   |   36 -
 drivers/virtio/virtio_balloon.c |   27 ++--
 drivers/virtio/virtio_pci.c |  301 ++-
 include/linux/virtio_config.h   |   46 --
 include/linux/virtio_pci.h  |   10 +-
 net/9p/trans_virtio.c   |2 +-
 11 files changed, 423 insertions(+), 118 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv5 1/3] virtio: find_vqs/del_vqs virtio operations

2009-05-13 Thread Michael S. Tsirkin
This replaces find_vq/del_vq with find_vqs/del_vqs virtio operations,
and updates all drivers. This is needed for MSI support, because MSI
needs to know the total number of vectors upfront.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 drivers/block/virtio_blk.c  |6 ++--
 drivers/char/hw_random/virtio-rng.c |6 ++--
 drivers/char/virtio_console.c   |   26 ---
 drivers/lguest/lguest_device.c  |   36 +-
 drivers/net/virtio_net.c|   45 +
 drivers/s390/kvm/kvm_virtio.c   |   36 +-
 drivers/virtio/virtio_balloon.c |   27 
 drivers/virtio/virtio_pci.c |   37 ++-
 include/linux/virtio_config.h   |   46 ++
 net/9p/trans_virtio.c   |2 +-
 10 files changed, 180 insertions(+), 87 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 8f7c956..c9f5627 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -224,7 +224,7 @@ static int virtblk_probe(struct virtio_device *vdev)
sg_init_table(vblk-sg, vblk-sg_elems);
 
/* We expect one virtqueue, for output. */
-   vblk-vq = vdev-config-find_vq(vdev, 0, blk_done, requests);
+   vblk-vq = virtio_find_single_vq(vdev, blk_done, requests);
if (IS_ERR(vblk-vq)) {
err = PTR_ERR(vblk-vq);
goto out_free_vblk;
@@ -323,7 +323,7 @@ out_put_disk:
 out_mempool:
mempool_destroy(vblk-pool);
 out_free_vq:
-   vdev-config-del_vq(vblk-vq);
+   vdev-config-del_vqs(vdev);
 out_free_vblk:
kfree(vblk);
 out:
@@ -344,7 +344,7 @@ static void virtblk_remove(struct virtio_device *vdev)
blk_cleanup_queue(vblk-disk-queue);
put_disk(vblk-disk);
mempool_destroy(vblk-pool);
-   vdev-config-del_vq(vblk-vq);
+   vdev-config-del_vqs(vdev);
kfree(vblk);
 }
 
diff --git a/drivers/char/hw_random/virtio-rng.c 
b/drivers/char/hw_random/virtio-rng.c
index 2aeafce..f2041fe 100644
--- a/drivers/char/hw_random/virtio-rng.c
+++ b/drivers/char/hw_random/virtio-rng.c
@@ -94,13 +94,13 @@ static int virtrng_probe(struct virtio_device *vdev)
int err;
 
/* We expect a single virtqueue. */
-   vq = vdev-config-find_vq(vdev, 0, random_recv_done, input);
+   vq = virtio_find_single_vq(vdev, random_recv_done, input);
if (IS_ERR(vq))
return PTR_ERR(vq);
 
err = hwrng_register(virtio_hwrng);
if (err) {
-   vdev-config-del_vq(vq);
+   vdev-config-del_vqs(vdev);
return err;
}
 
@@ -112,7 +112,7 @@ static void virtrng_remove(struct virtio_device *vdev)
 {
vdev-config-reset(vdev);
hwrng_unregister(virtio_hwrng);
-   vdev-config-del_vq(vq);
+   vdev-config-del_vqs(vdev);
 }
 
 static struct virtio_device_id id_table[] = {
diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index 58684e4..c74dacf 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -188,6 +188,9 @@ static void hvc_handle_input(struct virtqueue *vq)
  * Finally we put our input buffer in the input queue, ready to receive. */
 static int __devinit virtcons_probe(struct virtio_device *dev)
 {
+   vq_callback_t *callbacks[] = { hvc_handle_input, NULL};
+   const char *names[] = { input, output };
+   struct virtqueue *vqs[2];
int err;
 
vdev = dev;
@@ -199,20 +202,15 @@ static int __devinit virtcons_probe(struct virtio_device 
*dev)
goto fail;
}
 
-   /* Find the input queue. */
+   /* Find the queues. */
/* FIXME: This is why we want to wean off hvc: we do nothing
 * when input comes in. */
-   in_vq = vdev-config-find_vq(vdev, 0, hvc_handle_input, input);
-   if (IS_ERR(in_vq)) {
-   err = PTR_ERR(in_vq);
+   err = vdev-config-find_vqs(vdev, 2, vqs, callbacks, names);
+   if (err)
goto free;
-   }
 
-   out_vq = vdev-config-find_vq(vdev, 1, NULL, output);
-   if (IS_ERR(out_vq)) {
-   err = PTR_ERR(out_vq);
-   goto free_in_vq;
-   }
+   in_vq = vqs[0];
+   out_vq = vqs[1];
 
/* Start using the new console output. */
virtio_cons.get_chars = get_chars;
@@ -233,17 +231,15 @@ static int __devinit virtcons_probe(struct virtio_device 
*dev)
hvc = hvc_alloc(0, 0, virtio_cons, PAGE_SIZE);
if (IS_ERR(hvc)) {
err = PTR_ERR(hvc);
-   goto free_out_vq;
+   goto free_vqs;
}
 
/* Register the input buffer the first time. */
add_inbuf();
return 0;
 
-free_out_vq:
-   vdev-config-del_vq(out_vq);
-free_in_vq:
-   vdev-config-del_vq(in_vq);
+free_vqs:
+   vdev-config-del_vqs(vdev);
 free:
  

[PATCHv5 2/3] virtio_pci: split up vp_interrupt

2009-05-13 Thread Michael S. Tsirkin
This reorganizes virtio-pci code in vp_interrupt slightly, so that
it's easier to add per-vq MSI support on top.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 drivers/virtio/virtio_pci.c |   53 +++---
 1 files changed, 34 insertions(+), 19 deletions(-)

diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index 027f13f..951e673 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -164,6 +164,37 @@ static void vp_notify(struct virtqueue *vq)
iowrite16(info-queue_index, vp_dev-ioaddr + VIRTIO_PCI_QUEUE_NOTIFY);
 }
 
+/* Handle a configuration change: Tell driver if it wants to know. */
+static irqreturn_t vp_config_changed(int irq, void *opaque)
+{
+   struct virtio_pci_device *vp_dev = opaque;
+   struct virtio_driver *drv;
+   drv = container_of(vp_dev-vdev.dev.driver,
+  struct virtio_driver, driver);
+
+   if (drv  drv-config_changed)
+   drv-config_changed(vp_dev-vdev);
+   return IRQ_HANDLED;
+}
+
+/* Notify all virtqueues on an interrupt. */
+static irqreturn_t vp_vring_interrupt(int irq, void *opaque)
+{
+   struct virtio_pci_device *vp_dev = opaque;
+   struct virtio_pci_vq_info *info;
+   irqreturn_t ret = IRQ_NONE;
+   unsigned long flags;
+
+   spin_lock_irqsave(vp_dev-lock, flags);
+   list_for_each_entry(info, vp_dev-virtqueues, node) {
+   if (vring_interrupt(irq, info-vq) == IRQ_HANDLED)
+   ret = IRQ_HANDLED;
+   }
+   spin_unlock_irqrestore(vp_dev-lock, flags);
+
+   return ret;
+}
+
 /* A small wrapper to also acknowledge the interrupt when it's handled.
  * I really need an EIO hook for the vring so I can ack the interrupt once we
  * know that we'll be handling the IRQ but before we invoke the callback since
@@ -173,9 +204,6 @@ static void vp_notify(struct virtqueue *vq)
 static irqreturn_t vp_interrupt(int irq, void *opaque)
 {
struct virtio_pci_device *vp_dev = opaque;
-   struct virtio_pci_vq_info *info;
-   irqreturn_t ret = IRQ_NONE;
-   unsigned long flags;
u8 isr;
 
/* reading the ISR has the effect of also clearing it so it's very
@@ -187,23 +215,10 @@ static irqreturn_t vp_interrupt(int irq, void *opaque)
return IRQ_NONE;
 
/* Configuration change?  Tell driver if it wants to know. */
-   if (isr  VIRTIO_PCI_ISR_CONFIG) {
-   struct virtio_driver *drv;
-   drv = container_of(vp_dev-vdev.dev.driver,
-  struct virtio_driver, driver);
-
-   if (drv  drv-config_changed)
-   drv-config_changed(vp_dev-vdev);
-   }
+   if (isr  VIRTIO_PCI_ISR_CONFIG)
+   vp_config_changed(irq, opaque);
 
-   spin_lock_irqsave(vp_dev-lock, flags);
-   list_for_each_entry(info, vp_dev-virtqueues, node) {
-   if (vring_interrupt(irq, info-vq) == IRQ_HANDLED)
-   ret = IRQ_HANDLED;
-   }
-   spin_unlock_irqrestore(vp_dev-lock, flags);
-
-   return ret;
+   return vp_vring_interrupt(irq, opaque);
 }
 
 /* the config-find_vq() implementation */
-- 
1.6.3.rc3.1.g830204

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv5 3/3] virtio_pci: optional MSI-X support

2009-05-13 Thread Michael S. Tsirkin
This implements optional MSI-X support in virtio_pci.
MSI-X is used whenever the host supports at least 2 MSI-X
vectors: 1 for configuration changes and 1 for virtqueues.
Per-virtqueue vectors are allocated if enough vectors
available.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 drivers/virtio/virtio_pci.c |  227 +++
 include/linux/virtio_pci.h  |   10 ++-
 2 files changed, 217 insertions(+), 20 deletions(-)

diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index 951e673..65627a4 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -42,6 +42,26 @@ struct virtio_pci_device
/* a list of queues so we can dispatch IRQs */
spinlock_t lock;
struct list_head virtqueues;
+
+   /* MSI-X support */
+   int msix_enabled;
+   int intx_enabled;
+   struct msix_entry *msix_entries;
+   /* Name strings for interrupts. This size should be enough,
+* and I'm too lazy to allocate each name separately. */
+   char (*msix_names)[256];
+   /* Number of available vectors */
+   unsigned msix_vectors;
+   /* Vectors allocated */
+   unsigned msix_used_vectors;
+};
+
+/* Constants for MSI-X */
+/* Use first vector for configuration changes, second and the rest for
+ * virtqueues Thus, we need at least 2 vectors for MSI. */
+enum {
+   VP_MSIX_CONFIG_VECTOR = 0,
+   VP_MSIX_VQ_VECTOR = 1,
 };
 
 struct virtio_pci_vq_info
@@ -60,6 +80,9 @@ struct virtio_pci_vq_info
 
/* the list node for the virtqueues list */
struct list_head node;
+
+   /* MSI-X vector (or none) */
+   unsigned vector;
 };
 
 /* Qumranet donated their vendor ID for devices 0x1000 thru 0x10FF. */
@@ -109,7 +132,8 @@ static void vp_get(struct virtio_device *vdev, unsigned 
offset,
   void *buf, unsigned len)
 {
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
-   void __iomem *ioaddr = vp_dev-ioaddr + VIRTIO_PCI_CONFIG + offset;
+   void __iomem *ioaddr = vp_dev-ioaddr +
+   VIRTIO_PCI_CONFIG(vp_dev) + offset;
u8 *ptr = buf;
int i;
 
@@ -123,7 +147,8 @@ static void vp_set(struct virtio_device *vdev, unsigned 
offset,
   const void *buf, unsigned len)
 {
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
-   void __iomem *ioaddr = vp_dev-ioaddr + VIRTIO_PCI_CONFIG + offset;
+   void __iomem *ioaddr = vp_dev-ioaddr +
+  VIRTIO_PCI_CONFIG(vp_dev) + offset;
const u8 *ptr = buf;
int i;
 
@@ -221,7 +246,121 @@ static irqreturn_t vp_interrupt(int irq, void *opaque)
return vp_vring_interrupt(irq, opaque);
 }
 
-/* the config-find_vq() implementation */
+static void vp_free_vectors(struct virtio_device *vdev) {
+   struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+   int i;
+
+   if (vp_dev-intx_enabled) {
+   free_irq(vp_dev-pci_dev-irq, vp_dev);
+   vp_dev-intx_enabled = 0;
+   }
+
+   for (i = 0; i  vp_dev-msix_used_vectors; ++i)
+   free_irq(vp_dev-msix_entries[i].vector, vp_dev);
+   vp_dev-msix_used_vectors = 0;
+
+   if (vp_dev-msix_enabled) {
+   /* Disable the vector used for configuration */
+   iowrite16(VIRTIO_MSI_NO_VECTOR,
+ vp_dev-ioaddr + VIRTIO_MSI_CONFIG_VECTOR);
+   /* Flush the write out to device */
+   ioread16(vp_dev-ioaddr + VIRTIO_MSI_CONFIG_VECTOR);
+
+   vp_dev-msix_enabled = 0;
+   pci_disable_msix(vp_dev-pci_dev);
+   }
+}
+
+static int vp_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
+ int *options, int noptions)
+{
+   int i;
+   for (i = 0; i  noptions; ++i)
+   if (!pci_enable_msix(dev, entries, options[i]))
+   return options[i];
+   return -EBUSY;
+}
+
+static int vp_request_vectors(struct virtio_device *vdev, unsigned max_vqs)
+{
+   struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+   const char *name = dev_name(vp_dev-vdev.dev);
+   unsigned i, v;
+   int err = -ENOMEM;
+   /* We want at most one vector per queue and one for config changes.
+* Fallback to separate vectors for config and a shared for queues.
+* Finally fall back to regular interrupts. */
+   int options[] = { max_vqs + 1, 2 };
+   int nvectors = max(options[0], options[1]);
+
+   vp_dev-msix_entries = kmalloc(nvectors * sizeof *vp_dev-msix_entries,
+  GFP_KERNEL);
+   if (!vp_dev-msix_entries)
+   goto error_entries;
+   vp_dev-msix_names = kmalloc(nvectors * sizeof *vp_dev-msix_names,
+GFP_KERNEL);
+   if (!vp_dev-msix_names)
+   goto error_names;
+
+   for (i = 0; i  nvectors; ++i)
+ 

kvm-85 sometimes not starting on 2.6.30-rc5

2009-05-13 Thread Nikola Ciprich
Hi,
sometimes trying to start kvm on 2.6.30-rc5 (with kvm module v85, userspace 
v85) fails with:
kvm_create_vm: Interrupted system call
Could not create KVM context

and following backtrace appears in dmesg:
[  309.546138] BUG: MAX_LOCK_DEPTH too low!
[  309.549964] turning off the locking correctness validator.
[  309.549964] Pid: 2833, comm: qemu-kvm Not tainted 2.6.30lb.00_01_PRE08 #1
[  309.549964] Call Trace:
[  309.549964]  [80269aa9] __lock_acquire+0x4a9/0xb70
[  309.549964]  [802c54ef] ? mm_take_all_locks+0x2f/0x130
[  309.549964]  [8026b825] lock_acquire+0xa5/0x150
[  309.549964]  [802c55ac] ? mm_take_all_locks+0xec/0x130
[  309.549964]  [80505c96] _spin_lock_nest_lock+0x36/0x50
[  309.549964]  [802c55ac] ? mm_take_all_locks+0xec/0x130
[  309.549964]  [802c55ac] mm_take_all_locks+0xec/0x130
[  309.549964]  [802d43ab] do_mmu_notifier_register+0x7b/0x1d0
[  309.549964]  [802d451e] mmu_notifier_register+0xe/0x10
[  309.549964]  [a02a8dd9] kvm_dev_ioctl+0x189/0x2f0 [kvm]
[  309.549964]  [802f0171] vfs_ioctl+0x31/0x90
[  309.549964]  [802f03fb] do_vfs_ioctl+0x22b/0x550
[  309.549964]  [802f07a2] sys_ioctl+0x82/0xa0
[  309.549964]  [8020b442] system_call_fastpath+0x16/0x1b

It happened to me when I didn't have storage with kernel mounted.
Further attempts are usually successfull.
BR
nik


-- 
-
Nikola CIPRICH
LinuxBox.cz, s.r.o.
28. rijna 168, 709 01 Ostrava

tel.:   +420 596 603 142
fax:+420 596 621 273
mobil:  +420 777 093 799

www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: ser...@linuxbox.cz
-
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] don't use a 32-bit bit type as offset argument.

2009-05-13 Thread Glauber Costa
In the call path of kvm_get_dirty_pages_log_range(),
its caller kvm_get_dirty_bitmap_cb() passes the
target_phys_addr_t both as start_addr and the offset.
So, using int will make dirty tracking over 4G fail
completely.

Of course we should be using qemu types in
here, so please don't get me started on this. The whole
file is wrong already ;)

Signed-off-by: Glauber Costa glom...@redhat.com
---
 qemu-kvm.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/qemu-kvm.c b/qemu-kvm.c
index f55cee8..27c37b5 100644
--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -1201,7 +1201,7 @@ int kvm_physical_memory_set_dirty_tracking(int enable)
 /* get kvm's dirty pages bitmap and update qemu's */
 static int kvm_get_dirty_pages_log_range(unsigned long start_addr,
  unsigned char *bitmap,
- unsigned int offset,
+ unsigned long offset,
  unsigned long mem_size)
 {
 unsigned int i, j, n=0;
-- 
1.5.6.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC + PATCHES] Work to get KVM autotest upstream

2009-05-13 Thread Michael Goldish

- Lucas Meneghel Rodrigues mrodr...@redhat.com wrote:

 On Wed, 2009-05-13 at 12:23 -0400, Michael Goldish wrote:
  The patches look good, but I haven't tested them yet to make sure
  they leave everything at a functional state (will test them and let
  you know).
 
 Thanks Michael! I will start to give more thorough test on this
 today,
 since we finally got 0.10 in shape.
 
  I have a somewhat related question: how is KVM-Autotest development
  going to proceed after the upstream merge? Currently I have
  comfortable access to our repository at TLV, and on good days I
 push
  as many as 20 patches per day. Should I submit all patches to the
  Autotest mailing list after the merge, or are we going to work with
  pull requests, or some other way? Will we work with git or svn?
 
 Here is my plan: For people inside our team, with access to the git
 tree
 we can just pull stuff to the git tree and on a given time basis I
 can
 pick up the patches and send them altogether to the KVM and autotest
 mailing list, wait for reviews and then check them.

I think it would be nice to have a 'fast' development channel like
directly pulling from a git tree.

 If you are already used to send all your changes to the KVM mailing
 list
 though, this would pose little or no change to you, just send an
 additional cc to the autotest mailing list.
 
 What do you think?

So far we've kept development mostly internal in TLV, so I'm not quite
used to passing my commits through the mailing list. Will this be
necessary? I'm worried it might slow down development to a grinding halt.

Thanks,
Michael
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: XP smp using a lot of CPU

2009-05-13 Thread Erik Rull

Hi all,

very very interesting.

I have a similar problem but the other way round.
If my XP runs up tp 100% CPU usage top on the linux host reports only 
33% cpu usage. I would expect around 50% because I only provide one core 
for the guest. I already increased the process priority of qemu and the io 
priority, nothing helped. The rest of the CPU is nearly idle, no excessive 
disk access this time :-)


Any Idea what this could be?

Best regards,

Erik


Ross Boylan wrote:

I just installed XP into a new VM, specifying -smp 2 for the machine.
According to top, it's using nearly 200% of a cpu even when I'm not
doing anything.

Is this real CPU useage, or just a reporting problem (just as my disk
image is big according to ls, but isn't really)?

If it's real, is there anything I can do about it?

kvm 0.7.2 on Debian Lenny (but 2.6.29 kernel), amd64.  Xeon chips; 32
bit version of XP pro installed, now fully patched (including the
Windows Genuine Advantage stuff, though I cancelled it when it wanted to
run).  


Task manager in XP shows virtually no CPU useage.

Please cc me on responses.

Thanks for any assistance.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] don't use a 32-bit bit type as offset argument.

2009-05-13 Thread Ryan Harper
* Glauber Costa glom...@redhat.com [2009-05-13 14:22]:
 In the call path of kvm_get_dirty_pages_log_range(),
 its caller kvm_get_dirty_bitmap_cb() passes the
 target_phys_addr_t both as start_addr and the offset.
 So, using int will make dirty tracking over 4G fail
 completely.

Does this patch fix someting like 32-bit migration with 4G ?  Seems
like it might.

 
 Of course we should be using qemu types in
 here, so please don't get me started on this. The whole
 file is wrong already ;)
 
 Signed-off-by: Glauber Costa glom...@redhat.com
 ---
  qemu-kvm.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)
 
 diff --git a/qemu-kvm.c b/qemu-kvm.c
 index f55cee8..27c37b5 100644
 --- a/qemu-kvm.c
 +++ b/qemu-kvm.c
 @@ -1201,7 +1201,7 @@ int kvm_physical_memory_set_dirty_tracking(int enable)
  /* get kvm's dirty pages bitmap and update qemu's */
  static int kvm_get_dirty_pages_log_range(unsigned long start_addr,
   unsigned char *bitmap,
 - unsigned int offset,
 + unsigned long offset,
   unsigned long mem_size)
  {
  unsigned int i, j, n=0;
 -- 
 1.5.6.6
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
ry...@us.ibm.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] don't use a 32-bit bit type as offset argument.

2009-05-13 Thread Glauber Costa
On Wed, May 13, 2009 at 5:23 PM, Ryan Harper ry...@us.ibm.com wrote:
 * Glauber Costa glom...@redhat.com [2009-05-13 14:22]:
 In the call path of kvm_get_dirty_pages_log_range(),
 its caller kvm_get_dirty_bitmap_cb() passes the
 target_phys_addr_t both as start_addr and the offset.
 So, using int will make dirty tracking over 4G fail
 completely.

 Does this patch fix someting like 32-bit migration with 4G ?  Seems
 like it might.

it fixes general  4G migration. I tested a 64-bit guest on a 64-bit host,
and it does not work previous to this patch



-- 
Glauber  Costa.
Free as in Freedom
http://glommer.net

The less confident you are, the more serious you have to act.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: user: include arch specific headers from $(KERNELDIR)

2009-05-13 Thread Arnd Bergmann
On Wednesday 13 May 2009 08:32:21 Mark McLoughlin wrote:
 Currently we only include $(KERNELDIR)/include in CFLAGS,
 but we also have $(KERNELDIR)/arch/$(arch)/include or else
 we'll get mis-matched headers.
 

I think this is fundamentally wrong. User files should never directly
access kernel headers, because they are postprocessed in various
ways in order to get files that are valid in user space, e.g. __user
annotations are removed.

The three possible sources for kernel headers are:

/usr/include 
- system provided headers, may be older than the running kernel
/lib/modules/$(uname -r)/build/usr/include
- user space headers for the currently running kernel
$(KERNELDIR)/usr/include
-  user space headers from a configured kernel tree after 'make 
headers_install'

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] don't use a 32-bit bit type as offset argument.

2009-05-13 Thread Anthony Liguori

Glauber Costa wrote:

In the call path of kvm_get_dirty_pages_log_range(),
its caller kvm_get_dirty_bitmap_cb() passes the
target_phys_addr_t both as start_addr and the offset.
So, using int will make dirty tracking over 4G fail
completely.

Of course we should be using qemu types in
here, so please don't get me started on this. The whole
file is wrong already ;)
  


:-)


Signed-off-by: Glauber Costa glom...@redhat.com
  


Good candidate for stable too.

Regards,

Anthony Liguori

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] kvm: Use a bitmap for tracking used GSIs

2009-05-13 Thread Alex Williamson
On Wed, 2009-05-13 at 08:33 -0600, Alex Williamson wrote:
 On Wed, 2009-05-13 at 08:15 -0600, Alex Williamson wrote:
  On Wed, 2009-05-13 at 16:55 +0300, Michael S. Tsirkin wrote:
   Very surprising: I haven't seen any driver disable MSI expect on device
   destructor path. Is this a linux guest?
  
  Yes, Debian 2.6.26 kernel.  I'll check it it behaves the same on newer
  upstream kernels and try to figure out why it's doing it.
 
 Updating the guest to 2.6.29 seems to fix the interrupt toggling.  So
 it's either something in older kernels or something debian introduced,
 but that seems unlikely.

For the curious, this was fixed prior to 2.6.27-rc1 by this:

commit ce6fce4295ba727b36fdc73040e444bd1aae64cd
Author: Matthew Wilcox
Date:   Fri Jul 25 15:42:58 2008 -0600

PCI MSI: Don't disable MSIs if the mask bit isn't supported

David Vrabel has a device which generates an interrupt storm on the INTx
pin if we disable MSI interrupts altogether.  Masking interrupts is only
a performance optimisation, so we can ignore the request to mask the
interrupt.

It looks like without the maskbit attribute on MSI, the default way to
mask an MSI interrupt was to toggle the MSI enable bit.  This was
introduced in 58e0543e8f355b32f0778a18858b255adb7402ae, so it's lifespan
was probably 2.6.21 - 2.6.26.

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] add ksm kernel shared memory driver.

2009-05-13 Thread Andrew Morton
On Mon, 20 Apr 2009 04:36:06 +0300
Izik Eidus iei...@redhat.com wrote:

 Ksm is driver that allow merging identical pages between one or more
 applications in way unvisible to the application that use it.
 Pages that are merged are marked as readonly and are COWed when any
 application try to change them.
 
 Ksm is used for cases where using fork() is not suitable,
 one of this cases is where the pages of the application keep changing
 dynamicly and the application cannot know in advance what pages are
 going to be identical.
 
 Ksm works by walking over the memory pages of the applications it
 scan in order to find identical pages.
 It uses a two sorted data strctures called stable and unstable trees
 to find in effective way the identical pages.
 
 When ksm finds two identical pages, it marks them as readonly and merges
 them into single one page,
 after the pages are marked as readonly and merged into one page, linux
 will treat this pages as normal copy_on_write pages and will fork them
 when write access will happen to them.
 
 Ksm scan just memory areas that were registred to be scanned by it.
 
 ...
 + copy_user_highpage(kpage, page1, addr1, vma);
 ...

Breaks ppc64 allmodcofnig because that architecture doesn't export its
copy_user_page() to modules.

Architectures are inconsistent about this.  x86 _does_ export it,
because it bounces it to the exported copy_page().

So can I ask that you sit down and work out upon which architectures it
really makes sense to offer KSM?  Disallow the others in Kconfig and
arrange for copy_user_highpage() to be available on the allowed architectures?

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] add ksm kernel shared memory driver.

2009-05-13 Thread Chris Wright
* Andrew Morton (a...@linux-foundation.org) wrote:
 Breaks ppc64 allmodcofnig because that architecture doesn't export its
 copy_user_page() to modules.

Things like this and updating to use madvise() I think all point towards
s/tristate/bool/.  I don't think CONFIG_KSM=M has huge benefit.

thanks,
-chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC + PATCHES] Work to get KVM autotest upstream

2009-05-13 Thread Glauber Costa
On Wed, May 13, 2009 at 5:21 PM, Ryan Harper ry...@us.ibm.com wrote:
 * Michael Goldish mgold...@redhat.com [2009-05-13 14:54]:

 - Lucas Meneghel Rodrigues mrodr...@redhat.com wrote:

  On Wed, 2009-05-13 at 12:23 -0400, Michael Goldish wrote:
   The patches look good, but I haven't tested them yet to make sure
   they leave everything at a functional state (will test them and let
   you know).
 
  Thanks Michael! I will start to give more thorough test on this
  today,
  since we finally got 0.10 in shape.
 
   I have a somewhat related question: how is KVM-Autotest development
   going to proceed after the upstream merge? Currently I have
   comfortable access to our repository at TLV, and on good days I
  push
   as many as 20 patches per day. Should I submit all patches to the
   Autotest mailing list after the merge, or are we going to work with
   pull requests, or some other way? Will we work with git or svn?
 
  Here is my plan: For people inside our team, with access to the git
  tree
  we can just pull stuff to the git tree and on a given time basis I
  can
  pick up the patches and send them altogether to the KVM and autotest
  mailing list, wait for reviews and then check them.

 I think it would be nice to have a 'fast' development channel like
 directly pulling from a git tree.

  If you are already used to send all your changes to the KVM mailing
  list
  though, this would pose little or no change to you, just send an
  additional cc to the autotest mailing list.
 
  What do you think?

 So far we've kept development mostly internal in TLV, so I'm not quite
 used to passing my commits through the mailing list. Will this be
 necessary? I'm worried it might slow down development to a grinding halt.

 I'd definitely like to see patches to the list before committing; we do
 the same for kvm, qemu etc, not sure why kvm-autotest should be any
 different.  On the other hand, it's not currently being done that way
 and I'm not losing any sleep over it; it's easy enough to git log and
 and email the list if you break something or think something should be
 done differently.



If you have, or can have, a publicly visible git tree with your changes, you
can generate pull requests from time to time. Then the job of the maintainer
will be only to sanitize your tree, make sure it is in overall good shape, and
merge it to the main stream.


-- 
Glauber  Costa.
Free as in Freedom
http://glommer.net

The less confident you are, the more serious you have to act.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] add ksm kernel shared memory driver.

2009-05-13 Thread Izik Eidus

Anthony Liguori wrote:

Chris Wright wrote:

* Andrew Morton (a...@linux-foundation.org) wrote:
 

Breaks ppc64 allmodcofnig because that architecture doesn't export its
copy_user_page() to modules.



Things like this and updating to use madvise() I think all point towards
s/tristate/bool/.  I don't think CONFIG_KSM=M has huge benefit.
  


I agree.


I am sending in one sec, the madvise patch that will kick it away from 
being module anyway...




Regards,

Anthony Liguori



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] add ksm kernel shared memory driver.

2009-05-13 Thread Izik Eidus

Andrew Morton wrote:

On Mon, 20 Apr 2009 04:36:06 +0300
Izik Eidus iei...@redhat.com wrote:

  

Ksm is driver that allow merging identical pages between one or more
applications in way unvisible to the application that use it.
Pages that are merged are marked as readonly and are COWed when any
application try to change them.

Ksm is used for cases where using fork() is not suitable,
one of this cases is where the pages of the application keep changing
dynamicly and the application cannot know in advance what pages are
going to be identical.

Ksm works by walking over the memory pages of the applications it
scan in order to find identical pages.
It uses a two sorted data strctures called stable and unstable trees
to find in effective way the identical pages.

When ksm finds two identical pages, it marks them as readonly and merges
them into single one page,
after the pages are marked as readonly and merged into one page, linux
will treat this pages as normal copy_on_write pages and will fork them
when write access will happen to them.

Ksm scan just memory areas that were registred to be scanned by it.

...
+   copy_user_highpage(kpage, page1, addr1, vma);
...



Breaks ppc64 allmodcofnig because that architecture doesn't export its
copy_user_page() to modules.

Architectures are inconsistent about this.  x86 _does_ export it,
because it bounces it to the exported copy_page().

So can I ask that you sit down and work out upon which architectures it
really makes sense to offer KSM?  Disallow the others in Kconfig and
arrange for copy_user_highpage() to be available on the allowed architectures?
  


Hi

There is some way (script) that i can run that will allow compile this 
code for every possible arch?


(I dont mind to allow it just for archs that support virtualization - 
x86, ia64, powerpc, s390, but is it the right thing to do ?)

Thanks.
  


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] bios: Fix MADT corruption and RSDT size when using -acpitable

2009-05-13 Thread Vincent Minet
External ACPI tables are counted twice for the RSDT size and the load
address for the first external table is in the MADT (interrupt override
entries are overwritten).

Signed-off-by: Vincent Minet vinc...@vincent-minet.net
---
 kvm/bios/rombios32.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kvm/bios/rombios32.c b/kvm/bios/rombios32.c
index cbd5f15..289361b 100755
--- a/kvm/bios/rombios32.c
+++ b/kvm/bios/rombios32.c
@@ -1626,7 +1626,7 @@ void acpi_bios_init(void)
 addr = base_addr = ram_size - ACPI_DATA_SIZE;
 rsdt_addr = addr;
 rsdt = (void *)(addr);
-rsdt_size = sizeof(*rsdt) + external_tables * 4;
+rsdt_size = sizeof(*rsdt);
 addr += rsdt_size;
 
 fadt_addr = addr;
@@ -1787,6 +1787,7 @@ void acpi_bios_init(void)
 }
 int_override++;
 madt_size += sizeof(struct madt_int_override);
+addr += sizeof(struct madt_int_override);
 }
 acpi_build_table_header((struct acpi_table_header *)madt,
 APIC, madt_size, 1);
-- 
1.6.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] bios: Fix MADT corruption and RSDT size when using -acpitable

2009-05-13 Thread Anthony Liguori

Vincent Minet wrote:

External ACPI tables are counted twice for the RSDT size and the load
address for the first external table is in the MADT (interrupt override
entries are overwritten).

Signed-off-by: Vincent Minet vinc...@vincent-minet.net
  


Beth,

I think you had a patch attempting to address the same issue.  It was a 
bit more involved though.


Which is the proper fix and are they both to the same problem?

Regards,

Anthony Liguori


---
 kvm/bios/rombios32.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kvm/bios/rombios32.c b/kvm/bios/rombios32.c
index cbd5f15..289361b 100755
--- a/kvm/bios/rombios32.c
+++ b/kvm/bios/rombios32.c
@@ -1626,7 +1626,7 @@ void acpi_bios_init(void)
 addr = base_addr = ram_size - ACPI_DATA_SIZE;
 rsdt_addr = addr;
 rsdt = (void *)(addr);
-rsdt_size = sizeof(*rsdt) + external_tables * 4;
+rsdt_size = sizeof(*rsdt);
 addr += rsdt_size;
 
 fadt_addr = addr;

@@ -1787,6 +1787,7 @@ void acpi_bios_init(void)
 }
 int_override++;
 madt_size += sizeof(struct madt_int_override);
+addr += sizeof(struct madt_int_override);
 }
 acpi_build_table_header((struct acpi_table_header *)madt,
 APIC, madt_size, 1);
  


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC + PATCHES] Work to get KVM autotest upstream

2009-05-13 Thread Glauber Costa
On Wed, May 13, 2009 at 9:19 PM, Anthony Liguori anth...@codemonkey.ws wrote:
 Glauber Costa wrote:

 On Wed, May 13, 2009 at 5:21 PM, Ryan Harper ry...@us.ibm.com wrote:


 I'd definitely like to see patches to the list before committing; we do
 the same for kvm, qemu etc, not sure why kvm-autotest should be any
 different.  On the other hand, it's not currently being done that way
 and I'm not losing any sleep over it; it's easy enough to git log and
 and email the list if you break something or think something should be
 done differently.




 If you have, or can have, a publicly visible git tree with your changes,
 you
 can generate pull requests from time to time. Then the job of the
 maintainer
 will be only to sanitize your tree, make sure it is in overall good shape,
 and
 merge it to the main stream.


 The advantage to posting non-trivial patches (beyond review) is that it
 helps people learn about how things are being developed and makes it easier
 to for others to get involved.  It forces a lot of the design discussions to
 happen on the mailing list.


+5
Note that I'm not against it in any means. I'm all for post for mailing lists.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] add ksm kernel shared memory driver.

2009-05-13 Thread Tony Breeds
On Thu, May 14, 2009 at 03:15:05AM +0300, Izik Eidus wrote:

 Hi

 There is some way (script) that i can run that will allow compile this  
 code for every possible arch?

Segher Boessenkool has a tool for builing cross toolchains and the kernel
at git://git.infradead.org/users/segher/buildall.git  You can save
yourself some time (and pain) and use the built toolchains at:
http://bakeyournoodle.com/cross

If there is any interest I can get these toolchains hosted on a faster machine
(say kernel.org)

Yours Tony
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv5 3/3] virtio_pci: optional MSI-X support

2009-05-13 Thread Anthony Liguori

Michael S. Tsirkin wrote:

This implements optional MSI-X support in virtio_pci.
MSI-X is used whenever the host supports at least 2 MSI-X
vectors: 1 for configuration changes and 1 for virtqueues.
Per-virtqueue vectors are allocated if enough vectors
available.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
  

Acked-by: Anthony Liguori aligu...@us.ibm.com

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 resend 5/6] VT-d: cleanup iommu_flush_iotlb_psi and flush_unmaps

2009-05-13 Thread Yu Zhao
Make iommu_flush_iotlb_psi() and flush_unmaps() more readable.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/intel-iommu.c |   46 +---
 1 files changed, 22 insertions(+), 24 deletions(-)

diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index 001b328..a2cbc01 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -968,30 +968,27 @@ static int __iommu_flush_iotlb(struct intel_iommu *iommu, 
u16 did,
 static int iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did,
u64 addr, unsigned int pages, int non_present_entry_flush)
 {
-   unsigned int mask;
+   int rc;
+   unsigned int mask = ilog2(__roundup_pow_of_two(pages));
 
BUG_ON(addr  (~VTD_PAGE_MASK));
BUG_ON(pages == 0);
 
-   /* Fallback to domain selective flush if no PSI support */
-   if (!cap_pgsel_inv(iommu-cap))
-   return iommu-flush.flush_iotlb(iommu, did, 0, 0,
-   DMA_TLB_DSI_FLUSH,
-   non_present_entry_flush);
-
/*
+* Fallback to domain selective flush if no PSI support or the size is
+* too big.
 * PSI requires page size to be 2 ^ x, and the base address is naturally
 * aligned to the size
 */
-   mask = ilog2(__roundup_pow_of_two(pages));
-   /* Fallback to domain selective flush if size is too big */
-   if (mask  cap_max_amask_val(iommu-cap))
-   return iommu-flush.flush_iotlb(iommu, did, 0, 0,
-   DMA_TLB_DSI_FLUSH, non_present_entry_flush);
-
-   return iommu-flush.flush_iotlb(iommu, did, addr, mask,
-   DMA_TLB_PSI_FLUSH,
-   non_present_entry_flush);
+   if (!cap_pgsel_inv(iommu-cap) || mask  cap_max_amask_val(iommu-cap))
+   rc = iommu-flush.flush_iotlb(iommu, did, 0, 0,
+   DMA_TLB_DSI_FLUSH,
+   non_present_entry_flush);
+   else
+   rc = iommu-flush.flush_iotlb(iommu, did, addr, mask,
+   DMA_TLB_PSI_FLUSH,
+   non_present_entry_flush);
+   return rc;
 }
 
 static void iommu_disable_protect_mem_regions(struct intel_iommu *iommu)
@@ -2214,15 +2211,16 @@ static void flush_unmaps(void)
if (!iommu)
continue;
 
-   if (deferred_flush[i].next) {
-   iommu-flush.flush_iotlb(iommu, 0, 0, 0,
-DMA_TLB_GLOBAL_FLUSH, 0);
-   for (j = 0; j  deferred_flush[i].next; j++) {
-   __free_iova(deferred_flush[i].domain[j]-iovad,
-   deferred_flush[i].iova[j]);
-   }
-   deferred_flush[i].next = 0;
+   if (!deferred_flush[i].next)
+   continue;
+
+   iommu-flush.flush_iotlb(iommu, 0, 0, 0,
+DMA_TLB_GLOBAL_FLUSH, 0);
+   for (j = 0; j  deferred_flush[i].next; j++) {
+   __free_iova(deferred_flush[i].domain[j]-iovad,
+   deferred_flush[i].iova[j]);
}
+   deferred_flush[i].next = 0;
}
 
list_size = 0;
-- 
1.5.6.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 resend 0/6] ATS capability support for Intel IOMMU

2009-05-13 Thread Yu Zhao
This patch series implements Address Translation Service support for
the Intel IOMMU. The PCIe Endpoint that supports ATS capability can
request the DMA address translation from the IOMMU and cache the
translation itself. This can alleviate IOMMU TLB pressure and improve
the hardware performance in the I/O virtualization environment.

The ATS is one of PCI-SIG I/O Virtualization (IOV) Specifications. The
spec can be found at: http://www.pcisig.com/specifications/iov/ats/
(it requires membership).


Changelog:
v3 - v4
  1, coding style fixes (Grant Grundler)
  2, support the Virtual Function ATS capability

v2 - v3
  1, throw error message if VT-d hardware detects invalid descriptor
 on Queued Invalidation interface (David Woodhouse)
  2, avoid using pci_find_ext_capability every time when reading ATS
 Invalidate Queue Depth (Matthew Wilcox)

v1 - v2
  added 'static' prefix to a local LIST_HEAD (Andrew Morton)

Yu Zhao (6):
  PCI: support the ATS capability
  PCI: handle Virtual Function ATS enabling
  VT-d: parse ATSR in DMA Remapping Reporting Structure
  VT-d: add device IOTLB invalidation support
  VT-d: cleanup iommu_flush_iotlb_psi and flush_unmaps
  VT-d: support the device IOTLB

 drivers/pci/dmar.c  |  189 +++---
 drivers/pci/intel-iommu.c   |  140 ++--
 drivers/pci/iov.c   |  155 ++--
 drivers/pci/pci.h   |   39 +
 include/linux/dmar.h|9 ++
 include/linux/intel-iommu.h |   16 -
 include/linux/pci.h |2 +
 include/linux/pci_regs.h|   10 +++
 8 files changed, 515 insertions(+), 45 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 resend 1/6] PCI: support the ATS capability

2009-05-13 Thread Yu Zhao
The PCIe ATS capability makes the Endpoint be able to request the
DMA address translation from the IOMMU and cache the translation
in the device side, thus alleviate IOMMU pressure and improve the
hardware performance in the I/O virtualization environment.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/iov.c|  105 ++
 drivers/pci/pci.h|   37 
 include/linux/pci.h  |2 +
 include/linux/pci_regs.h |   10 
 4 files changed, 154 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index b497daa..0a7a1b4 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -5,6 +5,7 @@
  *
  * PCI Express I/O Virtualization (IOV) support.
  *   Single Root IOV 1.0
+ *   Address Translation Service 1.0
  */
 
 #include linux/pci.h
@@ -679,3 +680,107 @@ irqreturn_t pci_sriov_migration(struct pci_dev *dev)
return sriov_migration(dev) ? IRQ_HANDLED : IRQ_NONE;
 }
 EXPORT_SYMBOL_GPL(pci_sriov_migration);
+
+static int ats_alloc_one(struct pci_dev *dev, int ps)
+{
+   int pos;
+   u16 cap;
+   struct pci_ats *ats;
+
+   pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS);
+   if (!pos)
+   return -ENODEV;
+
+   ats = kzalloc(sizeof(*ats), GFP_KERNEL);
+   if (!ats)
+   return -ENOMEM;
+
+   ats-pos = pos;
+   ats-stu = ps;
+   pci_read_config_word(dev, pos + PCI_ATS_CAP, cap);
+   ats-qdep = PCI_ATS_CAP_QDEP(cap) ? PCI_ATS_CAP_QDEP(cap) :
+   PCI_ATS_MAX_QDEP;
+   dev-ats = ats;
+
+   return 0;
+}
+
+static void ats_free_one(struct pci_dev *dev)
+{
+   kfree(dev-ats);
+   dev-ats = NULL;
+}
+
+/**
+ * pci_enable_ats - enable the ATS capability
+ * @dev: the PCI device
+ * @ps: the IOMMU page shift
+ *
+ * Returns 0 on success, or negative on failure.
+ */
+int pci_enable_ats(struct pci_dev *dev, int ps)
+{
+   int rc;
+   u16 ctrl;
+
+   BUG_ON(dev-ats);
+
+   if (ps  PCI_ATS_MIN_STU)
+   return -EINVAL;
+
+   rc = ats_alloc_one(dev, ps);
+   if (rc)
+   return rc;
+
+   ctrl = PCI_ATS_CTRL_ENABLE;
+   ctrl |= PCI_ATS_CTRL_STU(ps - PCI_ATS_MIN_STU);
+   pci_write_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl);
+
+   return 0;
+}
+
+/**
+ * pci_disable_ats - disable the ATS capability
+ * @dev: the PCI device
+ */
+void pci_disable_ats(struct pci_dev *dev)
+{
+   u16 ctrl;
+
+   BUG_ON(!dev-ats);
+
+   pci_read_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl);
+   ctrl = ~PCI_ATS_CTRL_ENABLE;
+   pci_write_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl);
+
+   ats_free_one(dev);
+}
+
+/**
+ * pci_ats_queue_depth - query the ATS Invalidate Queue Depth
+ * @dev: the PCI device
+ *
+ * Returns the queue depth on success, or negative on failure.
+ *
+ * The ATS spec uses 0 in the Invalidate Queue Depth field to
+ * indicate that the function can accept 32 Invalidate Request.
+ * But here we use the `real' values (i.e. 1~32) for the Queue
+ * Depth.
+ */
+int pci_ats_queue_depth(struct pci_dev *dev)
+{
+   int pos;
+   u16 cap;
+
+   if (dev-ats)
+   return dev-ats-qdep;
+
+   pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS);
+   if (!pos)
+   return -ENODEV;
+
+   pci_read_config_word(dev, pos + PCI_ATS_CAP, cap);
+
+   return PCI_ATS_CAP_QDEP(cap) ? PCI_ATS_CAP_QDEP(cap) :
+  PCI_ATS_MAX_QDEP;
+}
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index d03f6b9..3c2ec64 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -229,6 +229,13 @@ struct pci_sriov {
u8 __iomem *mstate; /* VF Migration State Array */
 };
 
+/* Address Translation Service */
+struct pci_ats {
+   int pos;/* capability position */
+   int stu;/* Smallest Translation Unit */
+   int qdep;   /* Invalidate Queue Depth */
+};
+
 #ifdef CONFIG_PCI_IOV
 extern int pci_iov_init(struct pci_dev *dev);
 extern void pci_iov_release(struct pci_dev *dev);
@@ -236,6 +243,20 @@ extern int pci_iov_resource_bar(struct pci_dev *dev, int 
resno,
enum pci_bar_type *type);
 extern void pci_restore_iov_state(struct pci_dev *dev);
 extern int pci_iov_bus_range(struct pci_bus *bus);
+
+extern int pci_enable_ats(struct pci_dev *dev, int ps);
+extern void pci_disable_ats(struct pci_dev *dev);
+extern int pci_ats_queue_depth(struct pci_dev *dev);
+/**
+ * pci_ats_enabled - query the ATS status
+ * @dev: the PCI device
+ *
+ * Returns 1 if ATS capability is enabled, or 0 if not.
+ */
+static inline int pci_ats_enabled(struct pci_dev *dev)
+{
+   return !!dev-ats;
+}
 #else
 static inline int pci_iov_init(struct pci_dev *dev)
 {
@@ -257,6 +278,22 @@ static inline int pci_iov_bus_range(struct pci_bus *bus)
 {
return 0;
 }
+
+static inline int 

[PATCH v4 resend 3/6] VT-d: parse ATSR in DMA Remapping Reporting Structure

2009-05-13 Thread Yu Zhao
Parse the Root Port ATS Capability Reporting Structure in the DMA
Remapping Reporting Structure ACPI table.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/dmar.c  |  112 --
 include/linux/dmar.h|9 
 include/linux/intel-iommu.h |1 +
 3 files changed, 116 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c
index fa3a113..eaa405f 100644
--- a/drivers/pci/dmar.c
+++ b/drivers/pci/dmar.c
@@ -267,6 +267,84 @@ rmrr_parse_dev(struct dmar_rmrr_unit *rmrru)
}
return ret;
 }
+
+static LIST_HEAD(dmar_atsr_units);
+
+static int __init dmar_parse_one_atsr(struct acpi_dmar_header *hdr)
+{
+   struct acpi_dmar_atsr *atsr;
+   struct dmar_atsr_unit *atsru;
+
+   atsr = container_of(hdr, struct acpi_dmar_atsr, header);
+   atsru = kzalloc(sizeof(*atsru), GFP_KERNEL);
+   if (!atsru)
+   return -ENOMEM;
+
+   atsru-hdr = hdr;
+   atsru-include_all = atsr-flags  0x1;
+
+   list_add(atsru-list, dmar_atsr_units);
+
+   return 0;
+}
+
+static int __init atsr_parse_dev(struct dmar_atsr_unit *atsru)
+{
+   int rc;
+   struct acpi_dmar_atsr *atsr;
+
+   if (atsru-include_all)
+   return 0;
+
+   atsr = container_of(atsru-hdr, struct acpi_dmar_atsr, header);
+   rc = dmar_parse_dev_scope((void *)(atsr + 1),
+   (void *)atsr + atsr-header.length,
+   atsru-devices_cnt, atsru-devices,
+   atsr-segment);
+   if (rc || !atsru-devices_cnt) {
+   list_del(atsru-list);
+   kfree(atsru);
+   }
+
+   return rc;
+}
+
+int dmar_find_matched_atsr_unit(struct pci_dev *dev)
+{
+   int i;
+   struct pci_bus *bus;
+   struct acpi_dmar_atsr *atsr;
+   struct dmar_atsr_unit *atsru;
+
+   list_for_each_entry(atsru, dmar_atsr_units, list) {
+   atsr = container_of(atsru-hdr, struct acpi_dmar_atsr, header);
+   if (atsr-segment == pci_domain_nr(dev-bus))
+   goto found;
+   }
+
+   return 0;
+
+found:
+   for (bus = dev-bus; bus; bus = bus-parent) {
+   struct pci_dev *bridge = bus-self;
+
+   if (!bridge || !bridge-is_pcie ||
+   bridge-pcie_type == PCI_EXP_TYPE_PCI_BRIDGE)
+   return 0;
+
+   if (bridge-pcie_type == PCI_EXP_TYPE_ROOT_PORT) {
+   for (i = 0; i  atsru-devices_cnt; i++)
+   if (atsru-devices[i] == bridge)
+   return 1;
+   break;
+   }
+   }
+
+   if (atsru-include_all)
+   return 1;
+
+   return 0;
+}
 #endif
 
 static void __init
@@ -274,22 +352,28 @@ dmar_table_print_dmar_entry(struct acpi_dmar_header 
*header)
 {
struct acpi_dmar_hardware_unit *drhd;
struct acpi_dmar_reserved_memory *rmrr;
+   struct acpi_dmar_atsr *atsr;
 
switch (header-type) {
case ACPI_DMAR_TYPE_HARDWARE_UNIT:
-   drhd = (struct acpi_dmar_hardware_unit *)header;
+   drhd = container_of(header, struct acpi_dmar_hardware_unit,
+   header);
printk (KERN_INFO PREFIX
-   DRHD (flags: 0x%08x)base: 0x%016Lx\n,
-   drhd-flags, (unsigned long long)drhd-address);
+   DRHD base: %#016Lx flags: %#x\n,
+   (unsigned long long)drhd-address, drhd-flags);
break;
case ACPI_DMAR_TYPE_RESERVED_MEMORY:
-   rmrr = (struct acpi_dmar_reserved_memory *)header;
-
+   rmrr = container_of(header, struct acpi_dmar_reserved_memory,
+   header);
printk (KERN_INFO PREFIX
-   RMRR base: 0x%016Lx end: 0x%016Lx\n,
+   RMRR base: %#016Lx end: %#016Lx\n,
(unsigned long long)rmrr-base_address,
(unsigned long long)rmrr-end_address);
break;
+   case ACPI_DMAR_TYPE_ATSR:
+   atsr = container_of(header, struct acpi_dmar_atsr, header);
+   printk(KERN_INFO PREFIX ATSR flags: %#x\n, atsr-flags);
+   break;
}
 }
 
@@ -363,6 +447,11 @@ parse_dmar_table(void)
ret = dmar_parse_one_rmrr(entry_header);
 #endif
break;
+   case ACPI_DMAR_TYPE_ATSR:
+#ifdef CONFIG_DMAR
+   ret = dmar_parse_one_atsr(entry_header);
+#endif
+   break;
default:
printk(KERN_WARNING PREFIX
Unknown DMAR structure type\n);
@@ -431,11 +520,19 @@ int __init dmar_dev_scope_init(void)
 #ifdef CONFIG_DMAR
{
struct 

[PATCH v4 resend 2/6] PCI: handle Virtual Function ATS enabling

2009-05-13 Thread Yu Zhao
The SR-IOV spec requires that the Smallest Translation Unit and
the Invalidate Queue Depth fields in the Virtual Function ATS
capability are hardwired to 0. If a function is a Virtual Function,
then and set its Physical Function's STU before enabling the ATS.

Signed-off-by: Yu Zhao yu.z...@intel.com
---
 drivers/pci/iov.c |   66 +---
 drivers/pci/pci.h |4 ++-
 2 files changed, 55 insertions(+), 15 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 0a7a1b4..4151404 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -491,10 +491,10 @@ found:
 
if (pdev)
iov-dev = pci_dev_get(pdev);
-   else {
+   else
iov-dev = dev;
-   mutex_init(iov-lock);
-   }
+
+   mutex_init(iov-lock);
 
dev-sriov = iov;
dev-is_physfn = 1;
@@ -514,11 +514,11 @@ static void sriov_release(struct pci_dev *dev)
 {
BUG_ON(dev-sriov-nr_virtfn);
 
-   if (dev == dev-sriov-dev)
-   mutex_destroy(dev-sriov-lock);
-   else
+   if (dev != dev-sriov-dev)
pci_dev_put(dev-sriov-dev);
 
+   mutex_destroy(dev-sriov-lock);
+
kfree(dev-sriov);
dev-sriov = NULL;
 }
@@ -723,19 +723,40 @@ int pci_enable_ats(struct pci_dev *dev, int ps)
int rc;
u16 ctrl;
 
-   BUG_ON(dev-ats);
+   BUG_ON(dev-ats  dev-ats-is_enabled);
 
if (ps  PCI_ATS_MIN_STU)
return -EINVAL;
 
-   rc = ats_alloc_one(dev, ps);
-   if (rc)
-   return rc;
+   if (dev-is_physfn || dev-is_virtfn) {
+   struct pci_dev *pdev = dev-is_physfn ? dev : dev-physfn;
+
+   mutex_lock(pdev-sriov-lock);
+   if (pdev-ats)
+   rc = pdev-ats-stu == ps ? 0 : -EINVAL;
+   else
+   rc = ats_alloc_one(pdev, ps);
+
+   if (!rc)
+   pdev-ats-ref_cnt++;
+   mutex_unlock(pdev-sriov-lock);
+   if (rc)
+   return rc;
+   }
+
+   if (!dev-is_physfn) {
+   rc = ats_alloc_one(dev, ps);
+   if (rc)
+   return rc;
+   }
 
ctrl = PCI_ATS_CTRL_ENABLE;
-   ctrl |= PCI_ATS_CTRL_STU(ps - PCI_ATS_MIN_STU);
+   if (!dev-is_virtfn)
+   ctrl |= PCI_ATS_CTRL_STU(ps - PCI_ATS_MIN_STU);
pci_write_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl);
 
+   dev-ats-is_enabled = 1;
+
return 0;
 }
 
@@ -747,13 +768,26 @@ void pci_disable_ats(struct pci_dev *dev)
 {
u16 ctrl;
 
-   BUG_ON(!dev-ats);
+   BUG_ON(!dev-ats || !dev-ats-is_enabled);
 
pci_read_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl);
ctrl = ~PCI_ATS_CTRL_ENABLE;
pci_write_config_word(dev, dev-ats-pos + PCI_ATS_CTRL, ctrl);
 
-   ats_free_one(dev);
+   dev-ats-is_enabled = 0;
+
+   if (dev-is_physfn || dev-is_virtfn) {
+   struct pci_dev *pdev = dev-is_physfn ? dev : dev-physfn;
+
+   mutex_lock(pdev-sriov-lock);
+   pdev-ats-ref_cnt--;
+   if (!pdev-ats-ref_cnt)
+   ats_free_one(pdev);
+   mutex_unlock(pdev-sriov-lock);
+   }
+
+   if (!dev-is_physfn)
+   ats_free_one(dev);
 }
 
 /**
@@ -765,13 +799,17 @@ void pci_disable_ats(struct pci_dev *dev)
  * The ATS spec uses 0 in the Invalidate Queue Depth field to
  * indicate that the function can accept 32 Invalidate Request.
  * But here we use the `real' values (i.e. 1~32) for the Queue
- * Depth.
+ * Depth; and 0 indicates the function shares the Queue with
+ * other functions (doesn't exclusively own a Queue).
  */
 int pci_ats_queue_depth(struct pci_dev *dev)
 {
int pos;
u16 cap;
 
+   if (dev-is_virtfn)
+   return 0;
+
if (dev-ats)
return dev-ats-qdep;
 
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 3c2ec64..f73bcbe 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -234,6 +234,8 @@ struct pci_ats {
int pos;/* capability position */
int stu;/* Smallest Translation Unit */
int qdep;   /* Invalidate Queue Depth */
+   int ref_cnt;/* Physical Function reference count */
+   int is_enabled:1;   /* Enable bit is set */
 };
 
 #ifdef CONFIG_PCI_IOV
@@ -255,7 +257,7 @@ extern int pci_ats_queue_depth(struct pci_dev *dev);
  */
 static inline int pci_ats_enabled(struct pci_dev *dev)
 {
-   return !!dev-ats;
+   return dev-ats  dev-ats-is_enabled;
 }
 #else
 static inline int pci_iov_init(struct pci_dev *dev)
-- 
1.5.6.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >