Re: [Qemu-devel] [RFC PATCH v5 0/4] Separate thread for VM migration

2011-08-24 Thread Avi Kivity

On 08/25/2011 09:29 AM, Umesh Deshpande wrote:

Jitterd Test
I ran jitterd in a migrating VM of size 8GB with and w/o the patch 
series.

./jitterd -f -m 1 -p 100 -r 40
That is to report the jitter of greater than 400ms during the interval 
of 40 seconds.


Jitter in ms. with the migration thread.
RunTotal (Peak)
1No chatter
2No chatter
3No chatter
4409 (360)

Jitter in ms. without migration thread.
RunTotal (Peak)
14663 (2413)
2643 (423)
31973 (1817)
43908 (3772)

--
Flood ping test : ping to the migrating VM from a third machine (data 
over 3 runs)

Latency (ms) ping to a non-migrating VM: Avg 0.156, Max: 0.96
Latency (ms) with migration thread : Avg 0.215, Max: 280
Latency (ms) without migration thread: Avg 6.47,   Max: 4562



Very impressive numbers.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KSM Unstable tree question

2011-08-24 Thread Prateek Sharma
Hello everyone .
I've been trying to understand how KSM works (i want to make some
modifications / implement some optimizations) .
One thing that struck me odd was the high number of calls to
remove_rmap_item_from_tree .
Particularly, this instance in cmp_and_merge_page :

/*
 * As soon as we merge this page, we want to remove the
 * rmap_item of the page we have merged with from the unstable
 * tree, and insert it instead as new node in the stable tree.
 */
if (kpage) {
remove_rmap_item_from_tree(tree_rmap_item);

lock_page(kpage);
stable_node = stable_tree_insert(kpage);
if (stable_node) {
stable_tree_append(tree_rmap_item, stable_node);
stable_tree_append(rmap_item, stable_node);
}

Here, from i understand, we've found a match in the unstable tree, and
are adding a stable node in the stable tree.
My question is: why do we need to remove the rmap_item from unstable
tree here? At the end of a scan we are erasing the unstable tree
anyway. Also, all searches first consider the stable tree , and then
the unstable tree.
What will happen if we find a match in the unstable tree, and simply
update tree_rmap_item to point to a stable_node ?

Thanks for reading. I'd love to share the ideas i have for (attempting
to) improve KSM, if anyone is interested.

Prateek
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-24 Thread Asias He
On Thu, Aug 25, 2011 at 1:54 PM, Pekka Enberg  wrote:
>
> On 8/25/11 8:34 AM, Asias He wrote:
>
> Hi, David
>
> On Thu, Aug 25, 2011 at 6:25 AM, David Evensky  wrote:
>>
>>
>> This patch adds a PCI device that provides PCI device memory to the
>> guest. This memory in the guest exists as a shared memory segment in
>> the host. This is similar memory sharing capability of Nahanni
>> (ivshmem) available in QEMU. In this case, the shared memory segment
>> is exposed as a PCI BAR only.
>>
>> A new command line argument is added as:
>>    --shmem pci:0xc800:16MB:handle=/newmem:create
>>
>> which will set the PCI BAR at 0xc800, the shared memory segment
>> and the region pointed to by the BAR will be 16MB. On the host side
>> the shm_open handle will be '/newmem', and the kvm tool will create
>> the shared segment, set its size, and initialize it. If the size,
>> handle, or create flag are absent, they will default to 16MB,
>> handle=/kvm_shmem, and create will be false.
>
> I think it's better to use a default BAR address if user does not specify one 
> as well.
> This way,
>
> ./kvm --shmem
>
> will work with default values with zero configuration.
>
> Does that sort of thing make sense here? It's a special purpose device
> and the guest is expected to ioremap() the memory so it needs to
> know the BAR.

I mean a default bar address for --shmem device.  Yes, guest needs to know
this address, but even if we specify the address at command line the guest still
does not know this address, no? So having a default bar address does no harm.

>> The address family,
>> 'pci:' is also optional as it is the only address family currently
>> supported. Only a single --shmem is supported at this time.
>
> So, let's drop the 'pci:' prefix.
>
> That means the user interface will change if someone adds new address
> families. So we should keep the prefix, no?

We can have a more flexible option format which does not depend on the order of
args, e.g.:

--shmem bar=0xc800,size=16MB,handle=/newmem,ops=create, type=pci

if user does not specify sub-args, just use the default one.

--
Asias He
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH v5 0/4] Separate thread for VM migration

2011-08-24 Thread Umesh Deshpande

Jitterd Test
I ran jitterd in a migrating VM of size 8GB with and w/o the patch series.
./jitterd -f -m 1 -p 100 -r 40
That is to report the jitter of greater than 400ms during the interval 
of 40 seconds.


Jitter in ms. with the migration thread.
RunTotal (Peak)
1No chatter
2No chatter
3No chatter
4409 (360)

Jitter in ms. without migration thread.
RunTotal (Peak)
14663 (2413)
2643 (423)
31973 (1817)
43908 (3772)

--
Flood ping test : ping to the migrating VM from a third machine (data 
over 3 runs)

Latency (ms) ping to a non-migrating VM: Avg 0.156, Max: 0.96
Latency (ms) with migration thread : Avg 0.215, Max: 280
Latency (ms) without migration thread: Avg 6.47,   Max: 4562

- Umesh


On 08/24/2011 01:19 PM, Anthony Liguori wrote:

On 08/23/2011 10:12 PM, Umesh Deshpande wrote:
Following patch series deals with VCPU and iothread starvation during 
the
migration of a guest. Currently the iothread is responsible for 
performing the
guest migration. It holds qemu_mutex during the migration and doesn't 
allow VCPU
to enter the qemu mode and delays its return to the guest. The guest 
migration,

executed as an iohandler also delays the execution of other iohandlers.
In the following patch series,


Can you please include detailed performance data with and without this 
series?


Perhaps runs of migration with jitterd running in the guest.

Regards,

Anthony Liguori



The migration has been moved to a separate thread to
reduce the qemu_mutex contention and iohandler starvation.

Umesh Deshpande (4):
   MRU ram block list
   migration thread mutex
   separate migration bitmap
   separate migration thread

  arch_init.c |   38 
  buffered_file.c |   75 +--
  cpu-all.h   |   42 +
  exec.c  |   97 ++--
  migration.c |  122 
+-

  migration.h |9 
  qemu-common.h   |2 +
  qemu-thread-posix.c |   10 
  qemu-thread.h   |1 +
  savevm.c|5 --
  10 files changed, 297 insertions(+), 104 deletions(-)





--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-24 Thread David Evensky
On Thu, Aug 25, 2011 at 09:02:56AM +0300, Pekka Enberg wrote:
> On Thu, Aug 25, 2011 at 1:25 AM, David Evensky  wrote:
> > + ? ? ? if (*next == '\0')
> > + ? ? ? ? ? ? ? p = next;
> > + ? ? ? else
> > + ? ? ? ? ? ? ? p = next + 1;
> > + ? ? ? /* parse out size */
> > + ? ? ? base = 10;
> > + ? ? ? if (strcasestr(p, "0x"))
> > + ? ? ? ? ? ? ? base = 16;
> > + ? ? ? size = strtoll(p, &next, base);
> > + ? ? ? if (next == p && size == 0) {
> > + ? ? ? ? ? ? ? pr_info("shmem: no size specified, using default.");
> > + ? ? ? ? ? ? ? size = default_size;
> > + ? ? ? }
> > + ? ? ? /* look for [KMGkmg][Bb]* ?uses base 2. */
> > + ? ? ? int skip_B = 0;
> > + ? ? ? if (strspn(next, "KMGkmg")) { ? /* might have a prefix */
> > + ? ? ? ? ? ? ? if (*(next + 1) == 'B' || *(next + 1) == 'b')
> > + ? ? ? ? ? ? ? ? ? ? ? skip_B = 1;
> > + ? ? ? ? ? ? ? switch (*next) {
> > + ? ? ? ? ? ? ? case 'K':
> > + ? ? ? ? ? ? ? case 'k':
> > + ? ? ? ? ? ? ? ? ? ? ? size = size << KB_SHIFT;
> > + ? ? ? ? ? ? ? ? ? ? ? break;
> > + ? ? ? ? ? ? ? case 'M':
> > + ? ? ? ? ? ? ? case 'm':
> > + ? ? ? ? ? ? ? ? ? ? ? size = size << MB_SHIFT;
> > + ? ? ? ? ? ? ? ? ? ? ? break;
> > + ? ? ? ? ? ? ? case 'G':
> > + ? ? ? ? ? ? ? case 'g':
> > + ? ? ? ? ? ? ? ? ? ? ? size = size << GB_SHIFT;
> > + ? ? ? ? ? ? ? ? ? ? ? break;
> > + ? ? ? ? ? ? ? default:
> > + ? ? ? ? ? ? ? ? ? ? ? die("shmem: bug in detecting size prefix.");
> > + ? ? ? ? ? ? ? ? ? ? ? break;
> > + ? ? ? ? ? ? ? }
> 
> There's some nice code in perf to parse sizes like this. We could just
> steal that.

That sounds good to me.

> > +inline void fill_mem(void *buf, size_t buf_size, char *fill, size_t 
> > fill_len)
> > +{
> > + ? ? ? size_t i;
> > +
> > + ? ? ? if (fill_len == 1) {
> > + ? ? ? ? ? ? ? memset(buf, fill[0], buf_size);
> > + ? ? ? } else {
> > + ? ? ? ? ? ? ? if (buf_size > fill_len) {
> > + ? ? ? ? ? ? ? ? ? ? ? for (i = 0; i < buf_size - fill_len; i += fill_len)
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? memcpy(((char *)buf) + i, fill, fill_len);
> > + ? ? ? ? ? ? ? ? ? ? ? memcpy(buf + i, fill, buf_size - i);
> > + ? ? ? ? ? ? ? } else {
> > + ? ? ? ? ? ? ? ? ? ? ? memcpy(buf, fill, buf_size);
> > + ? ? ? ? ? ? ? }
> > + ? ? ? }
> > +}
> 
> Can we do a memset_pattern4() type of interface instead? I think it's
> mostly pointless to try to support arbitrary-length 'fill'.

Yeah, I can see how the arbitrary fill thing might be too cute. It
certainly isn't necessary.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-24 Thread Pekka Enberg
On Thu, Aug 25, 2011 at 1:25 AM, David Evensky  wrote:
> +       if (*next == '\0')
> +               p = next;
> +       else
> +               p = next + 1;
> +       /* parse out size */
> +       base = 10;
> +       if (strcasestr(p, "0x"))
> +               base = 16;
> +       size = strtoll(p, &next, base);
> +       if (next == p && size == 0) {
> +               pr_info("shmem: no size specified, using default.");
> +               size = default_size;
> +       }
> +       /* look for [KMGkmg][Bb]*  uses base 2. */
> +       int skip_B = 0;
> +       if (strspn(next, "KMGkmg")) {   /* might have a prefix */
> +               if (*(next + 1) == 'B' || *(next + 1) == 'b')
> +                       skip_B = 1;
> +               switch (*next) {
> +               case 'K':
> +               case 'k':
> +                       size = size << KB_SHIFT;
> +                       break;
> +               case 'M':
> +               case 'm':
> +                       size = size << MB_SHIFT;
> +                       break;
> +               case 'G':
> +               case 'g':
> +                       size = size << GB_SHIFT;
> +                       break;
> +               default:
> +                       die("shmem: bug in detecting size prefix.");
> +                       break;
> +               }

There's some nice code in perf to parse sizes like this. We could just
steal that.

> +inline void fill_mem(void *buf, size_t buf_size, char *fill, size_t fill_len)
> +{
> +       size_t i;
> +
> +       if (fill_len == 1) {
> +               memset(buf, fill[0], buf_size);
> +       } else {
> +               if (buf_size > fill_len) {
> +                       for (i = 0; i < buf_size - fill_len; i += fill_len)
> +                               memcpy(((char *)buf) + i, fill, fill_len);
> +                       memcpy(buf + i, fill, buf_size - i);
> +               } else {
> +                       memcpy(buf, fill, buf_size);
> +               }
> +       }
> +}

Can we do a memset_pattern4() type of interface instead? I think it's
mostly pointless to try to support arbitrary-length 'fill'.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-24 Thread David Evensky

I don't know if there is a PCI card that only provides a region
of memory. I'm not really trying to provide emulation for a known
piece of hardware, so I picked values that weren't being used since
there didn't appear to be an 'unknown'. I'll ask around.

\dae

On Thu, Aug 25, 2011 at 08:41:43AM +0300, Avi Kivity wrote:
> On 08/25/2011 01:25 AM, David Evensky wrote:
> >  #define PCI_DEVICE_ID_VIRTIO_BLN   0x1005
> >  #define PCI_DEVICE_ID_VIRTIO_P90x1009
> >  #define PCI_DEVICE_ID_VESA 0x2000
> >+#define PCI_DEVICE_ID_PCI_SHMEM 0x0001
> >
> >  #define PCI_VENDOR_ID_REDHAT_QUMRANET  0x1af4
> >+#define PCI_VENDOR_ID_PCI_SHMEM 0x0001
> >  #define PCI_SUBSYSTEM_VENDOR_ID_REDHAT_QUMRANET0x1af4
> >
> >
> 
> Please use a real life vendor ID from http://www.pcidatabase.com.
> If you're following an existing spec, you should pick the vendor ID
> matching the device you're emulating.  If not, as seems to be the
> case here, you need your own, or permission from an existing owner
> of a vendor ID.
> 
> -- 
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-24 Thread David Evensky
On Thu, Aug 25, 2011 at 08:06:34AM +0300, Pekka Enberg wrote:
> On Wed, 2011-08-24 at 21:49 -0700, David Evensky wrote:
> > On Wed, Aug 24, 2011 at 10:27:18PM -0500, Alexander Graf wrote:
> > > 
> > > On 24.08.2011, at 17:25, David Evensky wrote:
> > > 
> > > > 
> > > > 
> > > > This patch adds a PCI device that provides PCI device memory to the
> > > > guest. This memory in the guest exists as a shared memory segment in
> > > > the host. This is similar memory sharing capability of Nahanni
> > > > (ivshmem) available in QEMU. In this case, the shared memory segment
> > > > is exposed as a PCI BAR only.
> > > > 
> > > > A new command line argument is added as:
> > > >--shmem pci:0xc800:16MB:handle=/newmem:create
> > > > 
> > > > which will set the PCI BAR at 0xc800, the shared memory segment
> > > > and the region pointed to by the BAR will be 16MB. On the host side
> > > > the shm_open handle will be '/newmem', and the kvm tool will create
> > > > the shared segment, set its size, and initialize it. If the size,
> > > > handle, or create flag are absent, they will default to 16MB,
> > > > handle=/kvm_shmem, and create will be false. The address family,
> > > > 'pci:' is also optional as it is the only address family currently
> > > > supported. Only a single --shmem is supported at this time.
> > > 
> > > Did you have a look at ivshmem? It does that today, but also gives
> > you an IRQ line so the guests can poke each other. For something as
> > simple as this, I don't see why we'd need two competing
> > implementations.
> > 
> > Isn't ivshmem in QEMU? If so, then I don't think there isn't any
> > competition. How do you feel that these are competing?
> 
> It's obviously not competing. One thing you might want to consider is
> making the guest interface compatible with ivshmem. Is there any reason
> we shouldn't do that? I don't consider that a requirement, just nice to
> have.

I think it depends on what the goal is. For us, just having a hunk of
memory shared between the host and guests that the guests can ioremap
provides a lot. Having the rest of ivshmem's guest interface I don't
think would impact our use above, but I haven't tested things with
QEMU to verify that.

\dae


> 
>   Pekka
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-24 Thread Avi Kivity

On 08/25/2011 01:25 AM, David Evensky wrote:

  #define PCI_DEVICE_ID_VIRTIO_BLN  0x1005
  #define PCI_DEVICE_ID_VIRTIO_P9   0x1009
  #define PCI_DEVICE_ID_VESA0x2000
+#define PCI_DEVICE_ID_PCI_SHMEM0x0001

  #define PCI_VENDOR_ID_REDHAT_QUMRANET 0x1af4
+#define PCI_VENDOR_ID_PCI_SHMEM0x0001
  #define PCI_SUBSYSTEM_VENDOR_ID_REDHAT_QUMRANET   0x1af4




Please use a real life vendor ID from http://www.pcidatabase.com.  If 
you're following an existing spec, you should pick the vendor ID 
matching the device you're emulating.  If not, as seems to be the case 
here, you need your own, or permission from an existing owner of a 
vendor ID.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-24 Thread Alexander Graf

On 25.08.2011, at 00:37, Pekka Enberg wrote:

> On 8/25/11 8:22 AM, Alexander Graf wrote:
>> 
>> On 25.08.2011, at 00:11, Pekka Enberg wrote:
>> 
>>> On Wed, 2011-08-24 at 23:52 -0500, Alexander Graf wrote:
> Isn't ivshmem in QEMU? If so, then I don't think there isn't any
> competition. How do you feel that these are competing?
 
 Well, it means that you will inside the guest have two different
 devices depending whether you're using QEMU or kvm-tool. I don't see
 the point in exposing different devices to the guest just because of
 NIH. Why should a guest care which device emulation framework you're
 using?
>>> 
>>> It's a pretty special-purpose device that requires user configuration so
>>> I don't consider QEMU compatibility to be mandatory. It'd be nice to
>>> have but not something to bend over backwards for.
>> 
>> Well, the nice thing is that you would get the guest side for free:
>> 
>> http://gitorious.org/nahanni/guest-code/blobs/master/kernel_module/uio/uio_ivshmem.c
>> 
>> You also didn't invent your own virtio protocol, no? :)
> 
> No, because virtio drivers are in Linux kernel proper. Is ivshmem in the 
> kernel tree
> or planned to be merged at some point?

*shrug* Let's ask Cam.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-24 Thread Pekka Enberg

On 8/25/11 8:22 AM, Alexander Graf wrote:


On 25.08.2011, at 00:11, Pekka Enberg wrote:


On Wed, 2011-08-24 at 23:52 -0500, Alexander Graf wrote:

Isn't ivshmem in QEMU? If so, then I don't think there isn't any
competition. How do you feel that these are competing?


Well, it means that you will inside the guest have two different
devices depending whether you're using QEMU or kvm-tool. I don't see
the point in exposing different devices to the guest just because of
NIH. Why should a guest care which device emulation framework you're
using?


It's a pretty special-purpose device that requires user configuration so
I don't consider QEMU compatibility to be mandatory. It'd be nice to
have but not something to bend over backwards for.


Well, the nice thing is that you would get the guest side for free:

http://gitorious.org/nahanni/guest-code/blobs/master/kernel_module/uio/uio_ivshmem.c

You also didn't invent your own virtio protocol, no? :)


No, because virtio drivers are in Linux kernel proper. Is ivshmem in the 
kernel tree

or planned to be merged at some point?

Pekka
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Guest kernel device compatability auto-detection

2011-08-24 Thread Avi Kivity

On 08/25/2011 08:21 AM, Sasha Levin wrote:

Hi,

Currently when we run the guest we treat it as a black box, we're not
quite sure what it's going to start and whether it supports the same
features we expect it to support when running it from the host.

This forces us to start the guest with the safest defaults possible, for
example: '-drive file=my_image.qcow2' will be started with slow IDE
emulation even though the guest is capable of virtio.

I'm currently working on a method to try and detect whether the guest
kernel has specific configurations enabled and either warn the user if
we know the kernel is not going to properly work or use better defaults
if we know some advanced features are going to work.

How am I planning to do it? First, we'll try finding which kernel the
guest is going to boot (easy when user does '-kernel', less easy when
the user boots an image). For simplicity sake I'll stick with the
'-kernel' option for now.

Once we have the kernel we can do two things:
  1. See if the kernel was built with CONFIG_IKCONFIG.

  2. Try finding the System.map which belongs to the kernel, it's
provided with all distro kernels so we can expect it to be around. If we
did find it we repeat the same process as in #1.

If we found one of the above, we start matching config sets ("we need
a,b,c,d for virtio, let's see if it's all there"). Once we find a good
config set, we use it for defaults. If we didn't find a good config set
we warn the user and don't even bother starting the guest.

If we couldn't find either, we can just default to whatever we have as
defaults now.


To sum it up, I was wondering if this approach has been considered
before and whether it sounds interesting enough to try.



This is a similar problem to p2v or v2v - taking a guest that used to 
run on physical or virtual hardware, and modifying it to run on 
(different) virtual hardware.  The first step is what you're looking for 
- detecting what the guest currently supports.


You can look at http://libguestfs.org/virt-v2v/ for an example.  I'm 
also copying Richard Jones, who maintains libguestfs, which does the 
actual poking around in the guest.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Guest kernel device compatability auto-detection

2011-08-24 Thread Sasha Levin
Hi,

Currently when we run the guest we treat it as a black box, we're not
quite sure what it's going to start and whether it supports the same
features we expect it to support when running it from the host.

This forces us to start the guest with the safest defaults possible, for
example: '-drive file=my_image.qcow2' will be started with slow IDE
emulation even though the guest is capable of virtio.

I'm currently working on a method to try and detect whether the guest
kernel has specific configurations enabled and either warn the user if
we know the kernel is not going to properly work or use better defaults
if we know some advanced features are going to work.

How am I planning to do it? First, we'll try finding which kernel the
guest is going to boot (easy when user does '-kernel', less easy when
the user boots an image). For simplicity sake I'll stick with the
'-kernel' option for now.

Once we have the kernel we can do two things:
 1. See if the kernel was built with CONFIG_IKCONFIG.

 2. Try finding the System.map which belongs to the kernel, it's
provided with all distro kernels so we can expect it to be around. If we
did find it we repeat the same process as in #1.

If we found one of the above, we start matching config sets ("we need
a,b,c,d for virtio, let's see if it's all there"). Once we find a good
config set, we use it for defaults. If we didn't find a good config set
we warn the user and don't even bother starting the guest.

If we couldn't find either, we can just default to whatever we have as
defaults now.


To sum it up, I was wondering if this approach has been considered
before and whether it sounds interesting enough to try.

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-24 Thread Pekka Enberg
On Wed, 2011-08-24 at 23:52 -0500, Alexander Graf wrote:
> > Isn't ivshmem in QEMU? If so, then I don't think there isn't any
> > competition. How do you feel that these are competing?
> 
> Well, it means that you will inside the guest have two different
> devices depending whether you're using QEMU or kvm-tool. I don't see
> the point in exposing different devices to the guest just because of
> NIH. Why should a guest care which device emulation framework you're
> using?

It's a pretty special-purpose device that requires user configuration so
I don't consider QEMU compatibility to be mandatory. It'd be nice to
have but not something to bend over backwards for.

Pekka

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-24 Thread Pekka Enberg
On Wed, 2011-08-24 at 21:49 -0700, David Evensky wrote:
> On Wed, Aug 24, 2011 at 10:27:18PM -0500, Alexander Graf wrote:
> > 
> > On 24.08.2011, at 17:25, David Evensky wrote:
> > 
> > > 
> > > 
> > > This patch adds a PCI device that provides PCI device memory to the
> > > guest. This memory in the guest exists as a shared memory segment in
> > > the host. This is similar memory sharing capability of Nahanni
> > > (ivshmem) available in QEMU. In this case, the shared memory segment
> > > is exposed as a PCI BAR only.
> > > 
> > > A new command line argument is added as:
> > >--shmem pci:0xc800:16MB:handle=/newmem:create
> > > 
> > > which will set the PCI BAR at 0xc800, the shared memory segment
> > > and the region pointed to by the BAR will be 16MB. On the host side
> > > the shm_open handle will be '/newmem', and the kvm tool will create
> > > the shared segment, set its size, and initialize it. If the size,
> > > handle, or create flag are absent, they will default to 16MB,
> > > handle=/kvm_shmem, and create will be false. The address family,
> > > 'pci:' is also optional as it is the only address family currently
> > > supported. Only a single --shmem is supported at this time.
> > 
> > Did you have a look at ivshmem? It does that today, but also gives
> you an IRQ line so the guests can poke each other. For something as
> simple as this, I don't see why we'd need two competing
> implementations.
> 
> Isn't ivshmem in QEMU? If so, then I don't think there isn't any
> competition. How do you feel that these are competing?

It's obviously not competing. One thing you might want to consider is
making the guest interface compatible with ivshmem. Is there any reason
we shouldn't do that? I don't consider that a requirement, just nice to
have.

Pekka

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-24 Thread Alexander Graf

On 24.08.2011, at 23:49, David Evensky wrote:

> On Wed, Aug 24, 2011 at 10:27:18PM -0500, Alexander Graf wrote:
>> 
>> On 24.08.2011, at 17:25, David Evensky wrote:
>> 
>>> 
>>> 
>>> This patch adds a PCI device that provides PCI device memory to the
>>> guest. This memory in the guest exists as a shared memory segment in
>>> the host. This is similar memory sharing capability of Nahanni
>>> (ivshmem) available in QEMU. In this case, the shared memory segment
>>> is exposed as a PCI BAR only.
>>> 
>>> A new command line argument is added as:
>>>   --shmem pci:0xc800:16MB:handle=/newmem:create
>>> 
>>> which will set the PCI BAR at 0xc800, the shared memory segment
>>> and the region pointed to by the BAR will be 16MB. On the host side
>>> the shm_open handle will be '/newmem', and the kvm tool will create
>>> the shared segment, set its size, and initialize it. If the size,
>>> handle, or create flag are absent, they will default to 16MB,
>>> handle=/kvm_shmem, and create will be false. The address family,
>>> 'pci:' is also optional as it is the only address family currently
>>> supported. Only a single --shmem is supported at this time.
>> 
>> Did you have a look at ivshmem? It does that today, but also gives you an 
>> IRQ line so the guests can poke each other. For something as simple as this, 
>> I don't see why we'd need two competing implementations.
> 
> Isn't ivshmem in QEMU? If so, then I don't think there isn't any
> competition. How do you feel that these are competing?

Well, it means that you will inside the guest have two different devices 
depending whether you're using QEMU or kvm-tool. I don't see the point in 
exposing different devices to the guest just because of NIH. Why should a guest 
care which device emulation framework you're using?


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-24 Thread David Evensky
On Wed, Aug 24, 2011 at 10:27:18PM -0500, Alexander Graf wrote:
> 
> On 24.08.2011, at 17:25, David Evensky wrote:
> 
> > 
> > 
> > This patch adds a PCI device that provides PCI device memory to the
> > guest. This memory in the guest exists as a shared memory segment in
> > the host. This is similar memory sharing capability of Nahanni
> > (ivshmem) available in QEMU. In this case, the shared memory segment
> > is exposed as a PCI BAR only.
> > 
> > A new command line argument is added as:
> >--shmem pci:0xc800:16MB:handle=/newmem:create
> > 
> > which will set the PCI BAR at 0xc800, the shared memory segment
> > and the region pointed to by the BAR will be 16MB. On the host side
> > the shm_open handle will be '/newmem', and the kvm tool will create
> > the shared segment, set its size, and initialize it. If the size,
> > handle, or create flag are absent, they will default to 16MB,
> > handle=/kvm_shmem, and create will be false. The address family,
> > 'pci:' is also optional as it is the only address family currently
> > supported. Only a single --shmem is supported at this time.
> 
> Did you have a look at ivshmem? It does that today, but also gives you an IRQ 
> line so the guests can poke each other. For something as simple as this, I 
> don't see why we'd need two competing implementations.

Isn't ivshmem in QEMU? If so, then I don't think there isn't any
competition. How do you feel that these are competing?

\dae

> 
> 
> Alex
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/11] KVM: MMU: improve write flooding detected

2011-08-24 Thread Avi Kivity

On 08/25/2011 05:04 AM, Marcelo Tosatti wrote:

>
>  It could increase the flood count independently of the accessed bit of
>  the spte being updated, zapping after 3 attempts as it is now.
>
>  But additionally reset the flood count if the gpte appears to be valid
>  (points to an existant gfn if the present bit is set, or if its zeroed).

Well not zero, as thats a common pattern for non ptes.



On 32-bit with 4GB RAM, practically anything is a valid gpte.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: about vEOI optimization

2011-08-24 Thread Avi Kivity

On 08/25/2011 05:24 AM, Tian, Kevin wrote:

>
>  Another option is the hyper-V EOI support, which can also eliminate the
>  EOI exit when no additional interrupt is pending.  This can improve EOI
>  performance even more.
>

yes, and this is an orthogonal option.

So if you agree, I'll send out an updated patch atop their work.




Thanks.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Add wrapper script around Qemu to test kernels

2011-08-24 Thread Américo Wang
On Thu, Aug 25, 2011 at 4:35 AM, Alexander Graf  wrote:
>
> On 24.08.2011, at 00:31, Américo Wang wrote:
>
>> On Wed, Aug 24, 2011 at 1:19 PM, Pekka Enberg  wrote:
>>>
>>> It's nice to see such an honest attempt at improving QEMU usability, 
>>> Alexander!
>>>
>>> One comment: in my experience, having shell scripts under
>>> Documentation reduces the likelihood that people actually discover
>>> them so you might want to consider putting it under scripts or tools.
>>>
>>
>> I was going to give the same suggestion, +1 for tools/ directory.
>
> Well, scripts/ is a flat directory where I can just throw in the script. 
> Tools however is split by tool and creating a full new directory for only a 
> single script sounds a bit like overkill to me. I'll move it to scripts/ for 
> now :)

How about the directory tools/testing/ ?

scripts/ is mainly for the tools/utilities we use to build kernel or
do kernel dev,
it is not so suitable for your script IMHO.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-24 Thread Alexander Graf

On 24.08.2011, at 17:25, David Evensky wrote:

> 
> 
> This patch adds a PCI device that provides PCI device memory to the
> guest. This memory in the guest exists as a shared memory segment in
> the host. This is similar memory sharing capability of Nahanni
> (ivshmem) available in QEMU. In this case, the shared memory segment
> is exposed as a PCI BAR only.
> 
> A new command line argument is added as:
>--shmem pci:0xc800:16MB:handle=/newmem:create
> 
> which will set the PCI BAR at 0xc800, the shared memory segment
> and the region pointed to by the BAR will be 16MB. On the host side
> the shm_open handle will be '/newmem', and the kvm tool will create
> the shared segment, set its size, and initialize it. If the size,
> handle, or create flag are absent, they will default to 16MB,
> handle=/kvm_shmem, and create will be false. The address family,
> 'pci:' is also optional as it is the only address family currently
> supported. Only a single --shmem is supported at this time.

Did you have a look at ivshmem? It does that today, but also gives you an IRQ 
line so the guests can poke each other. For something as simple as this, I 
don't see why we'd need two competing implementations.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: about vEOI optimization

2011-08-24 Thread Tian, Kevin
> From: Avi Kivity [mailto:a...@redhat.com]
> Sent: Wednesday, August 24, 2011 6:00 PM
> 
> On 08/23/2011 11:09 AM, Tian, Kevin wrote:
> > Hi, Avi,
> >
> > Both Eddie and Marcello once suggested vEOI optimization by skipping
> > heavy-weight instruction decode, which reduces vEOI overhead greatly:
> >
> > http://www.mail-archive.com/kvm@vger.kernel.org/msg18619.html
> > http://www.spinics.net/lists/kvm/msg36691.html
> >
> > Though virtual x2apic serves similar purpose, it depends on guest OS
> > to support x2apic. Many Windows versions don't support x2apic though,
> > including Win7, Windows server before 2008 R2, etc. Given that 
> > virtualization
> > need support various OS versions, any chance to incorporate above vEOI
> > optimization in KVM as an alternative to boost performance when guest
> > doesn't support x2apic?
> >
> 
> Yes.  There was a problem with the guest using MOVSD or STOSD to write
> the EOI; if we don't emulate, then registers don't get updated.  I guess
> we can ignore it since no sane guest will use those instructions for EOI.

yes, sane guests all use MOV for EOI. btw, Xen has integrated a similar
acceleration for several releases. When we're measuring 10G SR-IOV
network performance, vEOI access overhead may be up to 6%-8% based
on interrupt rate which is one factor for KVM to lag behind.

> 
> Another option is the hyper-V EOI support, which can also eliminate the
> EOI exit when no additional interrupt is pending.  This can improve EOI
> performance even more.
> 

yes, and this is an orthogonal option.

So if you agree, I'll send out an updated patch atop their work.

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/11] KVM: MMU: improve write flooding detected

2011-08-24 Thread Marcelo Tosatti
On Wed, Aug 24, 2011 at 05:05:40PM -0300, Marcelo Tosatti wrote:
> On Wed, Aug 24, 2011 at 04:16:52AM +0800, Xiao Guangrong wrote:
> > On 08/24/2011 03:09 AM, Marcelo Tosatti wrote:
> > > On Wed, Aug 24, 2011 at 12:32:32AM +0800, Xiao Guangrong wrote:
> > >> On 08/23/2011 08:38 PM, Marcelo Tosatti wrote:
> > >>
> >  And, i think there are not problems since: if the spte without accssed 
> >  bit is
> >  written frequently, it means the guest page table is accessed 
> >  infrequently or
> >  during the writing, the guest page table is not accessed, in this 
> >  time, zapping
> >  this shadow page is not bad.
> > >>>
> > >>> Think of the following scenario:
> > >>>
> > >>> 1) page fault, spte with accessed bit is created from gpte at 
> > >>> gfnA+indexA.
> > >>> 2) write to gfnA+indexA, spte has accessed bit set, write_flooding_count
> > >>> is not increased.
> > >>> 3) repeat
> > >>>
> > >>
> > >> I think the result is just we hoped, we do not want to zap the shadow 
> > >> page
> > >> because the spte is currently used by the guest, it also will be used in 
> > >> the
> > >> next repetition. So do not increase 'write_flooding_count' is a good 
> > >> choice.
> > > 
> > > Its not used. Step 2) is write to write protected shadow page at
> > > gfnA.
> > > 
> > >> Let's consider what will happen if we increase 'write_flooding_count':
> > >> 1: after three repetitions, zap the shadow page
> > >> 2: in step 1, we will alloc a new shadow page for gpte at gfnA+indexA
> > >> 3: in step 2, the flooding count is creased, so after 3 repetitions, the
> > >>shadow page can be zapped again, repeat 1 to 3.
> > > 
> > > The shadow page will not be zapped because the spte created from
> > > gfnA+indexA has the accessed bit set:
> > > 
> > >if (spte && !(*spte & shadow_accessed_mask))
> > >sp->write_flooding_count++;
> > >else
> > >sp->write_flooding_count = 0;
> > > 
> > 
> > Ah, i see, i thought it was "repeat"ed on the same spte, it was my wrong.
> > 
> > Yes, in this case, the sp is not zapped, but it is hardly to know the gfn
> > is not used as gpte just depends on writing, for example, the guest can
> > change the mapping address or the status bit, and so on...The sp can be
> > zapped if the guest write it again(on the same address), i think it is
> > acceptable, anymore, it is just the speculative way to zap the unused
> > shadow page...your opinion?
> 
> It could increase the flood count independently of the accessed bit of
> the spte being updated, zapping after 3 attempts as it is now.
> 
> But additionally reset the flood count if the gpte appears to be valid
> (points to an existant gfn if the present bit is set, or if its zeroed).

Well not zero, as thats a common pattern for non ptes.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Emulating LWZU Instruction for e500 powerpc

2011-08-24 Thread Alexander Graf

On 19.08.2011, at 06:45, Aashish Mittal wrote:

> Hi
> I'm trying to emulate the lwzu instruction in e500 powerpc kvm for my project 
> . 
> I've removed the read and write privileges from the tlb entries of guest's 
> certain pages . So when i'm trying to emulate lwzu instruction i'm getting a 
> kernel panic while mounting the guest filesystem while booting .
> 
> attempt to access beyond end of device
> ram0: rw=0, want=75703268, limit=262144
> 
> To make sure that the emulation is  faulty what i'm trying to do now is at 
> the 
> time of DATA STORAGE exit on a marked page by an lwzu instruction i'm 
> patching 
> the next instruction with an instruction which will raise an INTERRUPT 
> PROGRAM 
> EXCEPTION and will get trapped in kvm and then i'm reverting the old read and 
> write privileges of this page and resuming the guest so that this LWZU 
> instruction can run natively . I'm expecting the immediate next instruction 
> to 
> raise the INTERRUPT PROGRAM EXCEPTION but all i'm getting are DATA STORAGE 
> Exits 
> at other pages marked by me and DTLB and ITLB misses on other addresses .
> 
> I've made sure to flush the icache after i patch using the instruction 
> flush_icache_range .
> 
> Error Log :
> Emulating a lwzu instruction on pc 0xc00161ac && eaddr 0xc05742f0
> Original Instruction is 0x90e60004 at pc: 0xc00161b0
> Modified Instruction is 0x7ce000a6 at pc: 0xc00161b0 
> Exit : Interrupt DATA STORAGE at pc 0xc000f210 on eaddr:0xc000f228 
> instruction: 
> 0x8085001c
> 
> Why i'm not getting any INTERRUPT PROGRAM EXCEPTION immediately in the next 
> instruction ?

Hrm. Are you sure you're actually modifying the instruction? This looks like 
you're running Linux, so you could try and just put a "b ." instruction right 
after the instruction you're trying to patch up and examine memory from Qemu :)


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-24 Thread David Evensky


This patch adds a PCI device that provides PCI device memory to the
guest. This memory in the guest exists as a shared memory segment in
the host. This is similar memory sharing capability of Nahanni
(ivshmem) available in QEMU. In this case, the shared memory segment
is exposed as a PCI BAR only.

A new command line argument is added as:
--shmem pci:0xc800:16MB:handle=/newmem:create

which will set the PCI BAR at 0xc800, the shared memory segment
and the region pointed to by the BAR will be 16MB. On the host side
the shm_open handle will be '/newmem', and the kvm tool will create
the shared segment, set its size, and initialize it. If the size,
handle, or create flag are absent, they will default to 16MB,
handle=/kvm_shmem, and create will be false. The address family,
'pci:' is also optional as it is the only address family currently
supported. Only a single --shmem is supported at this time.

Signed-off-by: David Evensky 

diff -uprN -X linux-kvm/Documentation/dontdiff 
linux-kvm/tools/kvm/builtin-run.c linux-kvm_pci_shmem/tools/kvm/builtin-run.c
--- linux-kvm/tools/kvm/builtin-run.c   2011-08-24 10:21:22.342077674 -0700
+++ linux-kvm_pci_shmem/tools/kvm/builtin-run.c 2011-08-24 14:17:33.190451297 
-0700
@@ -28,6 +28,8 @@
 #include "kvm/sdl.h"
 #include "kvm/vnc.h"
 #include "kvm/guest_compat.h"
+#include "shmem-util.h"
+#include "kvm/pci-shmem.h"
 
 #include 
 
@@ -52,6 +54,8 @@
 #define DEFAULT_SCRIPT "none"
 
 #define MB_SHIFT   (20)
+#define KB_SHIFT   (10)
+#define GB_SHIFT   (30)
 #define MIN_RAM_SIZE_MB(64ULL)
 #define MIN_RAM_SIZE_BYTE  (MIN_RAM_SIZE_MB << MB_SHIFT)
 
@@ -151,6 +155,130 @@ static int virtio_9p_rootdir_parser(cons
return 0;
 }
 
+static int shmem_parser(const struct option *opt, const char *arg, int unset)
+{
+   const uint64_t default_size = SHMEM_DEFAULT_SIZE;
+   const uint64_t default_phys_addr = SHMEM_DEFAULT_ADDR;
+   const char *default_handle = SHMEM_DEFAULT_HANDLE;
+   enum { PCI, UNK } addr_type = PCI;
+   uint64_t phys_addr;
+   uint64_t size;
+   char *handle = NULL;
+   int create = 0;
+   const char *p = arg;
+   char *next;
+   int base = 10;
+   int verbose = 0;
+
+   const int skip_pci = strlen("pci:");
+   if (verbose)
+   pr_info("shmem_parser(%p,%s,%d)", opt, arg, unset);
+   /* parse out optional addr family */
+   if (strcasestr(p, "pci:")) {
+   p += skip_pci;
+   addr_type = PCI;
+   } else if (strcasestr(p, "mem:")) {
+   die("I can't add to E820 map yet.\n");
+   }
+   /* parse out physical addr */
+   base = 10;
+   if (strcasestr(p, "0x"))
+   base = 16;
+   phys_addr = strtoll(p, &next, base);
+   if (next == p && phys_addr == 0) {
+   pr_info("shmem: no physical addr specified, using default.");
+   phys_addr = default_phys_addr;
+   }
+   if (*next != ':' && *next != '\0')
+   die("shmem: unexpected chars after phys addr.\n");
+   if (*next == '\0')
+   p = next;
+   else
+   p = next + 1;
+   /* parse out size */
+   base = 10;
+   if (strcasestr(p, "0x"))
+   base = 16;
+   size = strtoll(p, &next, base);
+   if (next == p && size == 0) {
+   pr_info("shmem: no size specified, using default.");
+   size = default_size;
+   }
+   /* look for [KMGkmg][Bb]*  uses base 2. */
+   int skip_B = 0;
+   if (strspn(next, "KMGkmg")) {   /* might have a prefix */
+   if (*(next + 1) == 'B' || *(next + 1) == 'b')
+   skip_B = 1;
+   switch (*next) {
+   case 'K':
+   case 'k':
+   size = size << KB_SHIFT;
+   break;
+   case 'M':
+   case 'm':
+   size = size << MB_SHIFT;
+   break;
+   case 'G':
+   case 'g':
+   size = size << GB_SHIFT;
+   break;
+   default:
+   die("shmem: bug in detecting size prefix.");
+   break;
+   }
+   next += 1 + skip_B;
+   }
+   if (*next != ':' && *next != '\0') {
+   die("shmem: unexpected chars after phys size. <%c><%c>\n",
+   *next, *p);
+   }
+   if (*next == '\0')
+   p = next;
+   else
+   p = next + 1;
+   /* parse out optional shmem handle */
+   const int skip_handle = strlen("handle=");
+   next = strcasestr(p, "handle=");
+   if (*p && next) {
+   if (p != next)
+   die("unexpected chars before handle\n");
+   p += skip_handle;
+   next = strchrnul(p, ':');
+   if (next - p) {
+

[PATCH] KVM: Add wrapper script around QEMU to test kernels

2011-08-24 Thread Alexander Graf
On LinuxCon I had a nice chat with Linus on what he thinks kvm-tool
would be doing and what he expects from it. Basically he wants a
small and simple tool he and other developers can run to try out and
see if the kernel they just built actually works.

Fortunately, QEMU can do that today already! The only piece that was
missing was the "simple" piece of the equation, so here is a script
that wraps around QEMU and executes a kernel you just built.

If you do have KVM around and are not cross-compiling, it will use
KVM. But if you don't, you can still fall back to emulation mode and
at least check if your kernel still does what you expect. I only
implemented support for s390x and ppc there, but it's easily extensible
to more platforms, as QEMU can emulate (and virtualize) pretty much
any platform out there.

If you don't have qemu installed, please do so before using this script. Your
distro should provide a package for it (might even call it "kvm"). If not,
just compile it from source - it's not hard!

To quickly get going, just execute the following as user:

$ ./Documentation/run-qemu.sh -r / -a init=/bin/bash

This will drop you into a shell on your rootfs.

Happy hacking!

Signed-off-by: Alexander Graf 

---

v1 -> v2:

  - fix naming of QEMU
  - use grep -q for has_config
  - support multiple -a args
  - spawn gdb on execution
  - pass through qemu options
  - dont use qemu-system-x86_64 on i386
  - add funny sentence to startup text
  - more helpful error messages
---
 scripts/run-qemu.sh |  334 +++
 1 files changed, 334 insertions(+), 0 deletions(-)
 create mode 100755 scripts/run-qemu.sh

diff --git a/scripts/run-qemu.sh b/scripts/run-qemu.sh
new file mode 100755
index 000..5d4e185
--- /dev/null
+++ b/scripts/run-qemu.sh
@@ -0,0 +1,334 @@
+#!/bin/bash
+#
+# QEMU Launcher
+#
+# This script enables simple use of the KVM and QEMU tool stack for
+# easy kernel testing. It allows to pass either a host directory to
+# the guest or a disk image. Example usage:
+#
+# Run the host root fs inside a VM:
+#
+# $ ./scripts/run-qemu.sh -r /
+#
+# Run the same with SDL:
+#
+# $ ./scripts/run-qemu.sh -r / --sdl
+# 
+# Or with a PPC build:
+#
+# $ ARCH=ppc ./scripts/run-qemu.sh -r /
+# 
+# PPC with a mac99 model by passing options to QEMU:
+#
+# $ ARCH=ppc ./scripts/run-qemu.sh -r / -- -M mac99
+#
+
+USE_SDL=
+USE_VNC=
+USE_GDB=1
+KERNEL_BIN=arch/x86/boot/bzImage
+MON_STDIO=
+KERNEL_APPEND2=
+SERIAL=ttyS0
+SERIAL_KCONFIG=SERIAL_8250
+BASENAME=$(basename "$0")
+
+function usage() {
+   echo "
+$BASENAME allows you to execute a virtual machine with the Linux kernel
+that you just built. To only execute a simple VM, you can just run it
+on your root fs with \"-r / -a init=/bin/bash\"
+
+   -a, --append parameters
+   Append the given parameters to the kernel command line.
+
+   -d, --disk image
+   Add the image file as disk into the VM.
+
+   -D, --no-gdb
+   Don't run an xterm with gdb attached to the guest.
+
+   -r, --root directory
+   Use the specified directory as root directory inside the guest.
+
+   -s, --sdl
+   Enable SDL graphical output.
+
+   -S, --smp cpus
+   Set number of virtual CPUs.
+
+   -v, --vnc
+   Enable VNC graphical output.
+
+Examples:
+
+   Run the host root fs inside a VM:
+   $ ./scripts/run-qemu.sh -r /
+
+   Run the same with SDL:
+   $ ./scripts/run-qemu.sh -r / --sdl
+   
+   Or with a PPC build:
+   $ ARCH=ppc ./scripts/run-qemu.sh -r /
+   
+   PPC with a mac99 model by passing options to QEMU:
+   $ ARCH=ppc ./scripts/run-qemu.sh -r / -- -M mac99
+"
+}
+
+function require_config() {
+   if [ "$(grep CONFIG_$1=y .config)" ]; then
+   return
+   fi
+
+   echo "You need to enable CONFIG_$1 for run-qemu to work properly"
+   exit 1
+}
+
+function has_config() {
+   grep -q "CONFIG_$1=y" .config
+}
+
+function drive_if() {
+   if has_config VIRTIO_BLK; then
+   echo virtio
+   elif has_config ATA_PIIX; then
+   echo ide
+   else
+   echo "\
+Your kernel must have either VIRTIO_BLK or ATA_PIIX
+enabled for block device assignment" >&2
+   exit 1
+   fi
+}
+
+GETOPT=`getopt -o a:d:Dhr:sS:v --long 
append,disk:,no-gdb,help,root:,sdl,smp:,vnc \
+   -n "$(basename \"$0\")" -- "$@"`
+
+if [ $? != 0 ]; then
+   echo "Terminating..." >&2
+   exit 1
+fi
+
+eval set -- "$GETOPT"
+
+while true; do
+   case "$1" in
+   -a|--append)
+   KERNEL_APPEND2="$KERNEL_APPEND2 $KERNEL_APPEND2"
+   shift
+   ;;
+   -d|--disk)
+   QEMU_OPTIONS="$QEMU_OPTIONS -drive \
+   file=$2,if=$(drive_if),cache=unsafe"
+   USE_DISK=1
+   shift
+   ;;
+   -D|--no-gdb)
+   USE_GD

Re: [Qemu-devel] [PATCH] KVM: Add wrapper script around Qemu to test kernels

2011-08-24 Thread Alexander Graf

On 24.08.2011, at 12:40, Blue Swirl wrote:

> On Tue, Aug 23, 2011 at 10:16 PM, Alexander Graf  wrote:
>> On LinuxCon I had a nice chat with Linus on what he thinks kvm-tool
>> would be doing and what he expects from it. Basically he wants a
>> small and simple tool he and other developers can run to try out and
>> see if the kernel they just built actually works.
>> 
>> Fortunately, Qemu can do that today already! The only piece that was
>> missing was the "simple" piece of the equation, so here is a script
>> that wraps around Qemu and executes a kernel you just built.
>> 
>> If you do have KVM around and are not cross-compiling, it will use
>> KVM. But if you don't, you can still fall back to emulation mode and
>> at least check if your kernel still does what you expect. I only
>> implemented support for s390x and ppc there, but it's easily extensible
>> to more platforms, as Qemu can emulate (and virtualize) pretty much
>> any platform out there.
>> 
>> If you don't have qemu installed, please do so before using this script. Your
>> distro should provide a package for it (might even call it "kvm"). If not,
>> just compile it from source - it's not hard!
>> 
>> To quickly get going, just execute the following as user:
>> 
>>$ ./Documentation/run-qemu.sh -r / -a init=/bin/bash
>> 
>> This will drop you into a shell on your rootfs.
>> 
>> Happy hacking!
>> 
>> Signed-off-by: Alexander Graf 
>> ---
>>  Documentation/run-qemu.sh |  284 
>> +
>>  1 files changed, 284 insertions(+), 0 deletions(-)
>>  create mode 100755 Documentation/run-qemu.sh
>> 
>> diff --git a/Documentation/run-qemu.sh b/Documentation/run-qemu.sh
>> new file mode 100755
>> index 000..0bac924
>> --- /dev/null
>> +++ b/Documentation/run-qemu.sh
>> @@ -0,0 +1,284 @@
>> +#!/bin/bash
>> +#
>> +# QEMU Launcher
>> +#
>> +# This script enables simple use of the KVM and Qemu tool stack for
> 
> QEMU
> 
>> +# easy kernel testing. It allows to pass either a host directory to
>> +# the guest or a disk image. Example usage:
>> +#
>> +# Run the host root fs inside a VM:
>> +#
>> +# $ ./Documentation/run-qemu.sh -r /
>> +#
>> +# Run the same with SDL:
>> +#
>> +# $ ./Documentation/run-qemu.sh -r / --sdl
>> +#
>> +# Or with a PPC build:
>> +#
>> +# $ ARCH=ppc ./Documentation/run-qemu.sh -r /
>> +#
>> +#
>> +
>> +USE_SDL=
>> +USE_VNC=
>> +KERNEL_BIN=arch/x86/boot/bzImage
>> +MON_STDIO=
>> +KERNEL_APPEND2=
>> +SERIAL=ttyS0
>> +SERIAL_KCONFIG=SERIAL_8250
>> +
>> +function usage() {
>> +   echo "
>> +Run-Qemu allows you to execute a virtual machine with the Linux kernel
> 
> run-qemu.sh or $0
> 
>> +that you just built. To only execute a simple VM, you can just run it
>> +on your root fs with \"-r / -a init=/bin/bash\"
>> +
>> +   -a, --append parameters
>> +   Append the given parameters to the kernel command line
>> +
>> +   -d, --disk image
>> +   Add the image file as disk into the VM
>> +
>> +   -r, --root directory
>> +   Use the specified directory as root directory inside the 
>> guest.
>> +
>> +   -s, --sdl
>> +   Enable SDL graphical output.
>> +
>> +   -S, --smp cpus
>> +   Set number of virtual CPUs
>> +
>> +   -v, --vnc
>> +   Enable VNC graphical output.
>> +
>> +Examples:
>> +
>> +   Run the host root fs inside a VM:
>> +   $ ./Documentation/run-qemu.sh -r /
>> +
>> +   Run the same with SDL:
>> +   $ ./Documentation/run-qemu.sh -r / --sdl
>> +
>> +   Or with a PPC build:
>> +   $ ARCH=ppc ./Documentation/run-qemu.sh -r /
>> +"
>> +}
>> +
>> +function require_config() {
>> +   if [ "$(grep CONFIG_$1=y .config)" ]; then
>> +   return
>> +   fi
>> +
>> +   echo "You need to enable CONFIG_$1 for run-qemu to work properly"
>> +   exit 1
>> +}
>> +
>> +function has_config() {
>> +   grep "CONFIG_$1=y" .config
>> +}
>> +
>> +function drive_if() {
>> +   if [ "$(has_config VIRTIO_BLK)" ]; then
>> +   echo virtio
>> +   elif [ "$(has_config ATA_PIIX)" ]; then
>> +   echo ide
>> +   else
>> +   echo "\
>> +Your kernel must have either VIRTIO_BLK or ATA_PIIX
>> +enabled for block device assignment" >&2
>> +   exit 1
>> +   fi
>> +}
>> +
>> +GETOPT=`getopt -o a:d:hr:sS:v --long append,disk:,help,root:,sdl,smp:,vnc \
>> +   -n "$(basename \"$0\")" -- "$@"`
>> +
>> +if [ $? != 0 ]; then
>> +   echo "Terminating..." >&2
>> +   exit 1
>> +fi
>> +
>> +eval set -- "$GETOPT"
>> +
>> +while true; do
>> +   case "$1" in
>> +   -a|--append)
>> +   KERNEL_APPEND2="$2"
>> +   shift 2
>> +   ;;
>> +   -d|--disk)
>> +   QEMU_OPTIONS="$QEMU_OPTIONS -drive \
>> +   file=$2,if=$(drive_if),cache=unsafe"
>> +   USE_DISK=1
>> +   shift 2
>> +   ;;
>> +   -h|--help)
>> +  

Re: kvm PCI assignment & VFIO ramblings

2011-08-24 Thread Alex Williamson
Joerg,

Is this roughly what you're thinking of for the iommu_group component?
Adding a dev_to_group iommu ops callback let's us consolidate the sysfs
support in the iommu base.  Would AMD-Vi do something similar (or
exactly the same) for group #s?  Thanks,

Alex

Signed-off-by: Alex Williamson 

diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c
index 6e6b6a1..6b54c1a 100644
--- a/drivers/base/iommu.c
+++ b/drivers/base/iommu.c
@@ -17,20 +17,56 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 
 static struct iommu_ops *iommu_ops;
 
+static ssize_t show_iommu_group(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   return sprintf(buf, "%lx", iommu_dev_to_group(dev));
+}
+static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL);
+
+static int add_iommu_group(struct device *dev, void *unused)
+{
+   if (iommu_dev_to_group(dev) >= 0)
+   return device_create_file(dev, &dev_attr_iommu_group);
+
+   return 0;
+}
+
+static int device_notifier(struct notifier_block *nb,
+  unsigned long action, void *data)
+{
+   struct device *dev = data;
+
+   if (action == BUS_NOTIFY_ADD_DEVICE)
+   return add_iommu_group(dev, NULL);
+
+   return 0;
+}
+
+static struct notifier_block device_nb = {
+   .notifier_call = device_notifier,
+};
+
 void register_iommu(struct iommu_ops *ops)
 {
if (iommu_ops)
BUG();
 
iommu_ops = ops;
+
+   /* FIXME - non-PCI, really want for_each_bus() */
+   bus_register_notifier(&pci_bus_type, &device_nb);
+   bus_for_each_dev(&pci_bus_type, NULL, NULL, add_iommu_group);
 }
 
 bool iommu_found(void)
@@ -94,6 +130,14 @@ int iommu_domain_has_cap(struct iommu_domain *domain,
 }
 EXPORT_SYMBOL_GPL(iommu_domain_has_cap);
 
+long iommu_dev_to_group(struct device *dev)
+{
+   if (iommu_ops->dev_to_group)
+   return iommu_ops->dev_to_group(dev);
+   return -ENODEV;
+}
+EXPORT_SYMBOL_GPL(iommu_dev_to_group);
+
 int iommu_map(struct iommu_domain *domain, unsigned long iova,
  phys_addr_t paddr, int gfp_order, int prot)
 {
diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index f02c34d..477259c 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -404,6 +404,7 @@ static int dmar_map_gfx = 1;
 static int dmar_forcedac;
 static int intel_iommu_strict;
 static int intel_iommu_superpage = 1;
+static int intel_iommu_no_mf_groups;
 
 #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1))
 static DEFINE_SPINLOCK(device_domain_lock);
@@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str)
printk(KERN_INFO
"Intel-IOMMU: disable supported super page\n");
intel_iommu_superpage = 0;
+   } else if (!strncmp(str, "no_mf_groups", 12)) {
+   printk(KERN_INFO
+   "Intel-IOMMU: disable separate groups for 
multifunction devices\n");
+   intel_iommu_no_mf_groups = 1;
}
 
str += strcspn(str, ",");
@@ -3902,6 +3907,52 @@ static int intel_iommu_domain_has_cap(struct 
iommu_domain *domain,
return 0;
 }
 
+/* Group numbers are arbitrary.  Device with the same group number
+ * indicate the iommu cannot differentiate between them.  To avoid
+ * tracking used groups we just use the seg|bus|devfn of the lowest
+ * level we're able to differentiate devices */
+static long intel_iommu_dev_to_group(struct device *dev)
+{
+   struct pci_dev *pdev = to_pci_dev(dev);
+   struct pci_dev *bridge;
+   union {
+   struct {
+   u8 devfn;
+   u8 bus;
+   u16 segment;
+   } pci;
+   u32 group;
+   } id;
+
+   if (iommu_no_mapping(dev))
+   return -ENODEV;
+
+   id.pci.segment = pci_domain_nr(pdev->bus);
+   id.pci.bus = pdev->bus->number;
+   id.pci.devfn = pdev->devfn;
+
+   if (!device_to_iommu(id.pci.segment, id.pci.bus, id.pci.devfn))
+   return -ENODEV;
+
+   bridge = pci_find_upstream_pcie_bridge(pdev);
+   if (bridge) {
+   if (pci_is_pcie(bridge)) {
+   id.pci.bus = bridge->subordinate->number;
+   id.pci.devfn = 0;
+   } else {
+   id.pci.bus = bridge->bus->number;
+   id.pci.devfn = bridge->devfn;
+   }
+   }
+
+   /* Virtual functions always get their own group */
+   if (!pdev->is_virtfn && intel_iommu_no_mf_groups)
+   id.pci.devfn = PCI_DEVFN(PCI_SLOT(id.pci.devfn), 0);
+
+   /* FIXME - seg # >= 0x8000 on 32b */
+   return id.group;
+}
+
 static struct iommu_ops intel_iommu_ops = {
.domain_init= in

Re: [PATCH] KVM: Add wrapper script around Qemu to test kernels

2011-08-24 Thread Alexander Graf

On 24.08.2011, at 04:16, Jan Kiszka wrote:

> On 2011-08-24 10:25, Avi Kivity wrote:
>> On 08/24/2011 01:16 AM, Alexander Graf wrote:
>>> +"
>>> +echo "\
>>> +Your guest is bound to the current foreground shell. To quit the guest,
>>> +please use Ctrl-A x"
>>> +echo "  Executing: $QEMU_BIN $QEMU_OPTIONS -append \"$KERNEL_APPEND\"
>>> -smp $SMP"
>>> +echo
>>> +
>>> +exec $QEMU_BIN $QEMU_OPTIONS -append "$KERNEL_APPEND -smp $SMP"
>> 
>> Would be nice to support launching gdb in a separate terminal with
>> vmlinux already loaded, and already attached to qemu.
> 
> + loading a python script into gdb to pull in module symbols. There are
> a few implementations floating around (including my own one).

I'll leave that part to you then :). I haven't figured out a nice way how to 
get modules into the VM for now anyways.

> It would also be nice if one could append QEMU (note the capitalization
> BTW) options to the script, maybe everything after a '--' separator.

Good point :)


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Add wrapper script around Qemu to test kernels

2011-08-24 Thread Alexander Graf

On 24.08.2011, at 00:31, Américo Wang wrote:

> On Wed, Aug 24, 2011 at 1:19 PM, Pekka Enberg  wrote:
>> 
>> It's nice to see such an honest attempt at improving QEMU usability, 
>> Alexander!
>> 
>> One comment: in my experience, having shell scripts under
>> Documentation reduces the likelihood that people actually discover
>> them so you might want to consider putting it under scripts or tools.
>> 
> 
> I was going to give the same suggestion, +1 for tools/ directory.

Well, scripts/ is a flat directory where I can just throw in the script. Tools 
however is split by tool and creating a full new directory for only a single 
script sounds a bit like overkill to me. I'll move it to scripts/ for now :)


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/11] KVM: MMU: improve write flooding detected

2011-08-24 Thread Marcelo Tosatti
On Wed, Aug 24, 2011 at 04:16:52AM +0800, Xiao Guangrong wrote:
> On 08/24/2011 03:09 AM, Marcelo Tosatti wrote:
> > On Wed, Aug 24, 2011 at 12:32:32AM +0800, Xiao Guangrong wrote:
> >> On 08/23/2011 08:38 PM, Marcelo Tosatti wrote:
> >>
>  And, i think there are not problems since: if the spte without accssed 
>  bit is
>  written frequently, it means the guest page table is accessed 
>  infrequently or
>  during the writing, the guest page table is not accessed, in this time, 
>  zapping
>  this shadow page is not bad.
> >>>
> >>> Think of the following scenario:
> >>>
> >>> 1) page fault, spte with accessed bit is created from gpte at gfnA+indexA.
> >>> 2) write to gfnA+indexA, spte has accessed bit set, write_flooding_count
> >>> is not increased.
> >>> 3) repeat
> >>>
> >>
> >> I think the result is just we hoped, we do not want to zap the shadow page
> >> because the spte is currently used by the guest, it also will be used in 
> >> the
> >> next repetition. So do not increase 'write_flooding_count' is a good 
> >> choice.
> > 
> > Its not used. Step 2) is write to write protected shadow page at
> > gfnA.
> > 
> >> Let's consider what will happen if we increase 'write_flooding_count':
> >> 1: after three repetitions, zap the shadow page
> >> 2: in step 1, we will alloc a new shadow page for gpte at gfnA+indexA
> >> 3: in step 2, the flooding count is creased, so after 3 repetitions, the
> >>shadow page can be zapped again, repeat 1 to 3.
> > 
> > The shadow page will not be zapped because the spte created from
> > gfnA+indexA has the accessed bit set:
> > 
> >if (spte && !(*spte & shadow_accessed_mask))
> >sp->write_flooding_count++;
> >else
> >sp->write_flooding_count = 0;
> > 
> 
> Ah, i see, i thought it was "repeat"ed on the same spte, it was my wrong.
> 
> Yes, in this case, the sp is not zapped, but it is hardly to know the gfn
> is not used as gpte just depends on writing, for example, the guest can
> change the mapping address or the status bit, and so on...The sp can be
> zapped if the guest write it again(on the same address), i think it is
> acceptable, anymore, it is just the speculative way to zap the unused
> shadow page...your opinion?

It could increase the flood count independently of the accessed bit of
the spte being updated, zapping after 3 attempts as it is now.

But additionally reset the flood count if the gpte appears to be valid
(points to an existant gfn if the present bit is set, or if its zeroed).

> >> The result is the shadow page for gfnA is alloced and zapped again and 
> >> again,
> >> yes?
> > 
> > The point is you cannot rely on the accessed bit of sptes that have been
> > instantiated with the accessed bit set to decide whether or not to zap.
> > Because the accessed bit will only be cleared on host memory pressure.
> > 
> 
> Yes, accessed bit is the cursory way to track gpte accessed, however,
> at least, the accessed bit can indicate whether the gfn is accessed
> for a period of time in the most case, for example, from it is
> speculated to it is written, or from it is zapped to it is written,
> i thinks it is not too bad.
> 
> Do you have ideas to improve this?


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/7] qemu-kvm: device assignment cleanups and upstream diff reductions

2011-08-24 Thread Marcelo Tosatti
On Tue, Aug 23, 2011 at 07:32:56PM +0200, Jan Kiszka wrote:
> Rebased version of the previous round.
> 
> Jan Kiszka (7):
>   pci-assign: Fix kvm_deassign_irq handling in assign_irq
>   pci-assign: Update legacy interrupts only if used
>   pci-assign: Drop libpci header dependency
>   pci-assign: Refactor calc_assigned_dev_id
>   pci-assign: Track MSI/MSI-X capability position, clean up related
> code
>   pci-assign: Generic config space access management
>   qemu-kvm: Resolve PCI upstream diffs

Applied 1-6, thanks.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] KVM: Add wrapper script around Qemu to test kernels

2011-08-24 Thread Avi Kivity

On 08/24/2011 08:40 PM, Blue Swirl wrote:

>  +
>  +   # Qemu's own build system calls it qemu-system-x86_64
>  +   [ "$QEMU_BIN" ] || QEMU_BIN=$(which qemu-system-x86_64 2>/dev/null)

If you run qemu-system-x86_64 on an i386 host, will it use kvm at all?


I think it will, and that's actually a bug, since kvm doesn't support 
virtualizing long mode on a 32-bit host.



--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: Improve init within a custom filesystem

2011-08-24 Thread Pekka Enberg
On Wed, Aug 24, 2011 at 9:51 PM, Avi Kivity  wrote:
> On 08/24/2011 08:41 PM, Sasha Levin wrote:
>>
>> On Wed, 2011-08-24 at 20:30 +0300, Avi Kivity wrote:
>> >  On 08/24/2011 07:19 PM, Sasha Levin wrote:
>> >  >  This patch adds the following improvements:
>> >  >
>> >  >    * Automatically start dhcpcd. Since we provide usermode netowrking
>> >  >  we should make it fully transparent to the user.
>> >
>> >  On my hosts, I have dhclient instead of dhcpd.
>> >
>>
>> I was wondering if we should bring our own tiny dhcp client instead of
>> assuming the host has one.
>>
>> Would it be better than assuming the host has it and then trying to
>> figure out which one?
>>
>
> You don't really need a dhcp client, since you already have a communication
> channel - the kernel command line.  Read the IP address and other info from
> there, and poke it into the interface.
>
> There is also the ip= kernel parameter, but I don't know if it works with a
> modular network driver.
>
> I suggest something like "kvmtool.nic.$macaddr=$ip/$netmask
> kvmtool.defaultroute=$gateway" - this is interface name agnostic.

We had "ip=dhcp" enabled for a while:

https://github.com/penberg/linux-kvm/commit/f0aec23a91368e916a53e6072f2173bb481b1544

Unfortunately the option makes nfsroot override the 9p rootfs. I guess
we could just fix that.

 Pekka
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: Improve init within a custom filesystem

2011-08-24 Thread Avi Kivity

On 08/24/2011 08:41 PM, Sasha Levin wrote:

On Wed, 2011-08-24 at 20:30 +0300, Avi Kivity wrote:
>  On 08/24/2011 07:19 PM, Sasha Levin wrote:
>  >  This patch adds the following improvements:
>  >
>  >* Automatically start dhcpcd. Since we provide usermode netowrking
>  >  we should make it fully transparent to the user.
>
>  On my hosts, I have dhclient instead of dhcpd.
>

I was wondering if we should bring our own tiny dhcp client instead of
assuming the host has one.

Would it be better than assuming the host has it and then trying to
figure out which one?



You don't really need a dhcp client, since you already have a 
communication channel - the kernel command line.  Read the IP address 
and other info from there, and poke it into the interface.


There is also the ip= kernel parameter, but I don't know if it works 
with a modular network driver.


I suggest something like "kvmtool.nic.$macaddr=$ip/$netmask 
kvmtool.defaultroute=$gateway" - this is interface name agnostic.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: Improve init within a custom filesystem

2011-08-24 Thread Pekka Enberg
On Wed, Aug 24, 2011 at 8:41 PM, Sasha Levin  wrote:
>> On my hosts, I have dhclient instead of dhcpd.
>
> I was wondering if we should bring our own tiny dhcp client instead of
> assuming the host has one.
>
> Would it be better than assuming the host has it and then trying to
> figure out which one?

That'd be awesome if we can keep it small and clean.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: Improve init within a custom filesystem

2011-08-24 Thread Sasha Levin
On Wed, 2011-08-24 at 20:30 +0300, Avi Kivity wrote:
> On 08/24/2011 07:19 PM, Sasha Levin wrote:
> > This patch adds the following improvements:
> >
> >   * Automatically start dhcpcd. Since we provide usermode netowrking
> > we should make it fully transparent to the user.
> 
> On my hosts, I have dhclient instead of dhcpd.
> 

I was wondering if we should bring our own tiny dhcp client instead of
assuming the host has one.

Would it be better than assuming the host has it and then trying to
figure out which one?

> >
> > +   puts("Running dhcpcd...");
> > +
> > +   system("dhcpcd -z eth* -A");
> > +
> > puts("Starting '/bin/sh'...");
> >
> 
> Better not  depend on interface names, instead get the interface names 
> from the kernel.
> 

Will do.

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] KVM: Add wrapper script around Qemu to test kernels

2011-08-24 Thread Blue Swirl
On Tue, Aug 23, 2011 at 10:16 PM, Alexander Graf  wrote:
> On LinuxCon I had a nice chat with Linus on what he thinks kvm-tool
> would be doing and what he expects from it. Basically he wants a
> small and simple tool he and other developers can run to try out and
> see if the kernel they just built actually works.
>
> Fortunately, Qemu can do that today already! The only piece that was
> missing was the "simple" piece of the equation, so here is a script
> that wraps around Qemu and executes a kernel you just built.
>
> If you do have KVM around and are not cross-compiling, it will use
> KVM. But if you don't, you can still fall back to emulation mode and
> at least check if your kernel still does what you expect. I only
> implemented support for s390x and ppc there, but it's easily extensible
> to more platforms, as Qemu can emulate (and virtualize) pretty much
> any platform out there.
>
> If you don't have qemu installed, please do so before using this script. Your
> distro should provide a package for it (might even call it "kvm"). If not,
> just compile it from source - it's not hard!
>
> To quickly get going, just execute the following as user:
>
>    $ ./Documentation/run-qemu.sh -r / -a init=/bin/bash
>
> This will drop you into a shell on your rootfs.
>
> Happy hacking!
>
> Signed-off-by: Alexander Graf 
> ---
>  Documentation/run-qemu.sh |  284 
> +
>  1 files changed, 284 insertions(+), 0 deletions(-)
>  create mode 100755 Documentation/run-qemu.sh
>
> diff --git a/Documentation/run-qemu.sh b/Documentation/run-qemu.sh
> new file mode 100755
> index 000..0bac924
> --- /dev/null
> +++ b/Documentation/run-qemu.sh
> @@ -0,0 +1,284 @@
> +#!/bin/bash
> +#
> +# QEMU Launcher
> +#
> +# This script enables simple use of the KVM and Qemu tool stack for

QEMU

> +# easy kernel testing. It allows to pass either a host directory to
> +# the guest or a disk image. Example usage:
> +#
> +# Run the host root fs inside a VM:
> +#
> +# $ ./Documentation/run-qemu.sh -r /
> +#
> +# Run the same with SDL:
> +#
> +# $ ./Documentation/run-qemu.sh -r / --sdl
> +#
> +# Or with a PPC build:
> +#
> +# $ ARCH=ppc ./Documentation/run-qemu.sh -r /
> +#
> +#
> +
> +USE_SDL=
> +USE_VNC=
> +KERNEL_BIN=arch/x86/boot/bzImage
> +MON_STDIO=
> +KERNEL_APPEND2=
> +SERIAL=ttyS0
> +SERIAL_KCONFIG=SERIAL_8250
> +
> +function usage() {
> +       echo "
> +Run-Qemu allows you to execute a virtual machine with the Linux kernel

run-qemu.sh or $0

> +that you just built. To only execute a simple VM, you can just run it
> +on your root fs with \"-r / -a init=/bin/bash\"
> +
> +       -a, --append parameters
> +               Append the given parameters to the kernel command line
> +
> +       -d, --disk image
> +               Add the image file as disk into the VM
> +
> +       -r, --root directory
> +               Use the specified directory as root directory inside the 
> guest.
> +
> +       -s, --sdl
> +               Enable SDL graphical output.
> +
> +       -S, --smp cpus
> +               Set number of virtual CPUs
> +
> +       -v, --vnc
> +               Enable VNC graphical output.
> +
> +Examples:
> +
> +       Run the host root fs inside a VM:
> +       $ ./Documentation/run-qemu.sh -r /
> +
> +       Run the same with SDL:
> +       $ ./Documentation/run-qemu.sh -r / --sdl
> +
> +       Or with a PPC build:
> +       $ ARCH=ppc ./Documentation/run-qemu.sh -r /
> +"
> +}
> +
> +function require_config() {
> +       if [ "$(grep CONFIG_$1=y .config)" ]; then
> +               return
> +       fi
> +
> +       echo "You need to enable CONFIG_$1 for run-qemu to work properly"
> +       exit 1
> +}
> +
> +function has_config() {
> +       grep "CONFIG_$1=y" .config
> +}
> +
> +function drive_if() {
> +       if [ "$(has_config VIRTIO_BLK)" ]; then
> +               echo virtio
> +       elif [ "$(has_config ATA_PIIX)" ]; then
> +               echo ide
> +       else
> +               echo "\
> +Your kernel must have either VIRTIO_BLK or ATA_PIIX
> +enabled for block device assignment" >&2
> +               exit 1
> +       fi
> +}
> +
> +GETOPT=`getopt -o a:d:hr:sS:v --long append,disk:,help,root:,sdl,smp:,vnc \
> +       -n "$(basename \"$0\")" -- "$@"`
> +
> +if [ $? != 0 ]; then
> +       echo "Terminating..." >&2
> +       exit 1
> +fi
> +
> +eval set -- "$GETOPT"
> +
> +while true; do
> +       case "$1" in
> +       -a|--append)
> +               KERNEL_APPEND2="$2"
> +               shift 2
> +               ;;
> +       -d|--disk)
> +               QEMU_OPTIONS="$QEMU_OPTIONS -drive \
> +                       file=$2,if=$(drive_if),cache=unsafe"
> +               USE_DISK=1
> +               shift 2
> +               ;;
> +       -h|--help)
> +               usage
> +               exit 0
> +               ;;
> +       -r|--root)
> +               ROOTFS="$2"
> +               shift 2
> +               ;;
> +       -s|--sdl)
> +               USE_SDL=1
> +               sh

Re: [PATCH] kvm tools: Improve init within a custom filesystem

2011-08-24 Thread Avi Kivity

On 08/24/2011 07:19 PM, Sasha Levin wrote:

This patch adds the following improvements:

  * Automatically start dhcpcd. Since we provide usermode netowrking
we should make it fully transparent to the user.


On my hosts, I have dhclient instead of dhcpd.



+   puts("Running dhcpcd...");
+
+   system("dhcpcd -z eth* -A");
+
puts("Starting '/bin/sh'...");



Better not  depend on interface names, instead get the interface names 
from the kernel.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH v5 0/4] Separate thread for VM migration

2011-08-24 Thread Anthony Liguori

On 08/23/2011 10:12 PM, Umesh Deshpande wrote:

Following patch series deals with VCPU and iothread starvation during the
migration of a guest. Currently the iothread is responsible for performing the
guest migration. It holds qemu_mutex during the migration and doesn't allow VCPU
to enter the qemu mode and delays its return to the guest. The guest migration,
executed as an iohandler also delays the execution of other iohandlers.
In the following patch series,


Can you please include detailed performance data with and without this 
series?


Perhaps runs of migration with jitterd running in the guest.

Regards,

Anthony Liguori



The migration has been moved to a separate thread to
reduce the qemu_mutex contention and iohandler starvation.

Umesh Deshpande (4):
   MRU ram block list
   migration thread mutex
   separate migration bitmap
   separate migration thread

  arch_init.c |   38 
  buffered_file.c |   75 +--
  cpu-all.h   |   42 +
  exec.c  |   97 ++--
  migration.c |  122 +-
  migration.h |9 
  qemu-common.h   |2 +
  qemu-thread-posix.c |   10 
  qemu-thread.h   |1 +
  savevm.c|5 --
  10 files changed, 297 insertions(+), 104 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm tools: Improve init within a custom filesystem

2011-08-24 Thread Sasha Levin
This patch adds the following improvements:

 * Automatically start dhcpcd. Since we provide usermode netowrking
we should make it fully transparent to the user.

 * Mount more kernel filesystems such as debugfs and shm.

Signed-off-by: Sasha Levin 
---
 tools/kvm/builtin-setup.c |4 
 tools/kvm/guest/init.c|7 +++
 2 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/tools/kvm/builtin-setup.c b/tools/kvm/builtin-setup.c
index 3c6ad48..347398d 100644
--- a/tools/kvm/builtin-setup.c
+++ b/tools/kvm/builtin-setup.c
@@ -100,12 +100,16 @@ error_close_in:
 
 static const char *guestfs_dirs[] = {
"/dev",
+   "/dev/pts",
+   "/dev/shm",
"/etc",
"/home",
"/host",
"/proc",
"/root",
"/sys",
+   "/sys/kernel",
+   "/sys/kernel/debug",
"/var",
"/var/lib",
"/virt",
diff --git a/tools/kvm/guest/init.c b/tools/kvm/guest/init.c
index 837acfb..caa671d 100644
--- a/tools/kvm/guest/init.c
+++ b/tools/kvm/guest/init.c
@@ -22,6 +22,9 @@ static void do_mounts(void)
mount("", "/sys", "sysfs", 0, NULL);
mount("proc", "/proc", "proc", 0, NULL);
mount("devtmpfs", "/dev", "devtmpfs", 0, NULL);
+   mount("debugfs", "/sys/kernel/debug", "debugfs", 0, NULL);
+   mount("shm", "/dev/shm", "tmpfs", 0, NULL);
+   mount("devpts", "/dev/pts", "devpts", 0, NULL);
 }
 
 int main(int argc, char *argv[])
@@ -30,6 +33,10 @@ int main(int argc, char *argv[])
 
do_mounts();
 
+   puts("Running dhcpcd...");
+
+   system("dhcpcd -z eth* -A");
+
puts("Starting '/bin/sh'...");
 
run_process("/bin/sh");
-- 
1.7.6

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vfio/dev-assignment: potential pci_block_user_cfg_access nesting

2011-08-24 Thread Alex Williamson
On Wed, 2011-08-24 at 11:09 +0200, Jan Kiszka wrote:
> On 2011-08-24 00:05, Alex Williamson wrote:
> > On Tue, 2011-08-23 at 15:31 +0200, Jan Kiszka wrote:
> >> Hi Alex,
> >>
> >> just ran into some corner case with my reanimated IRQ sharing patches
> >> that may affect vfio as well:
> >>
> >> How are vfio_enable/disable_intx synchronized against all other possible
> >> spots that call pci_block_user_cfg_access?
> >>
> >> I hit the recursion bug check in pci_block_user_cfg_access with my code
> >> which takes the user_cfg lock like vfio does. It likely races with
> >> pci_reset_function here - and should do so in vfio as well.
> > 
> > So the race is that we're doing a pci_reset_function and while we've got
> > pci_block_user_cfg_access set, an interrupt comes in (maybe from a
> > device sharing the interrupt line), and we hit the BUG_ON when trying to
> > nest pci_block_user_cfg_access?
> 
> Most probably the scenario I was seeing, but I didn't debugged it in all
> details as it already locked up my notebook twice.
> 
> > 
> >> Just taking some lock would mean having to run pci_reset_function with
> >> IRQs disabled to synchronize with the IRQ handler (not sure if that is
> >> possible at all). Alternatively, we would have to disable the interrupt
> >> line or deregister the IRQ while resetting. Or we perform INTx mask
> >> manipulation in an unsynchronized fashion, resolving races with user
> >> space differently (still need to think about this option).
> >>
> >> Any other thoughts?
> > 
> > I think this is a bit easier for vfio since the reset is already routed
> > through a vfio ioctl.  We can just use a mutex between the two, reset
> > would wait on the mutex while the interrupt handler would skip masking
> > of a shared interrupt if it can't get the mutex (hopefully the interrupt
> > is really for a shared device or we squelch it via the reset before we
> > trigger the spurious interrupt counter).
> > 
> > I think the only path for kvm assignment that doesn't involve also
> > rerouting the reset through a kvm ioctl would have to be avoiding the
> > problem in userspace.  We'd have to unregister the interrupt handler,
> > reset, then re-register.  That sounds pretty heavy, but the reset is
> > already a slow process.  Thanks,
> 
> I don't think we can allow userspace to potentially crash the host.
> 
> Anyway, this problem turns out to be way more generic. Just run two
> "echo 1 > /sys/bus/pci/.../reset" loops on the same device in parallel.
> But be warned, you will have to reboot that box afterward.
> 
> Maybe this very creative interface of pci_block_user_cfg_access was once
> OK when only the IPR SCSI driver used it. But by the time it grew beyond
> that use case, it became hopelessly broken (well, open-coded
> locking...). We need to redesign it, synchronizing users that can sleep
> via a simple mutex and addressing access to the status/command word
> separately via an IRQ-save spinlock (as we need that service in hard IRQ
> handlers).

Yep, that sounds like the best path.  pci_block_user_cfg_access is at
best "fragile" in it's current implementation.  Thanks,

Alex


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-24 Thread Alex Williamson
On Wed, 2011-08-24 at 10:52 +0200, Roedel, Joerg wrote:
> On Tue, Aug 23, 2011 at 01:08:29PM -0400, Alex Williamson wrote:
> > On Tue, 2011-08-23 at 15:14 +0200, Roedel, Joerg wrote:
> 
> > > Handling it through fds is a good idea. This makes sure that everything
> > > belongs to one process. I am not really sure yet if we go the way to
> > > just bind plain groups together or if we create meta-groups. The
> > > meta-groups thing seems somewhat cleaner, though.
> > 
> > I'm leaning towards binding because we need to make it dynamic, but I
> > don't really have a good picture of the lifecycle of a meta-group.
> 
> In my view the life-cycle of the meta-group is a subrange of the
> qemu-instance's life-cycle.

I guess I mean the lifecycle of a super-group that's actually exposed as
a new group in sysfs.  Who creates it?  How?  How are groups dynamically
added and removed from the super-group?  The group merging makes sense
to me because it's largely just an optimization that qemu will try to
merge groups.  If it works, great.  If not, it manages them separately.
When all the devices from a group are unplugged, unmerge the group if
necessary.

> > > Putting the process to sleep (which would be uninterruptible) seems bad.
> > > The process would sleep until the guest releases the device-group, which
> > > can take days or months.
> > > The best thing (and the most intrusive :-) ) is to change PCI core to
> > > allow unbindings to fail, I think. But this probably further complicates
> > > the way to upstream VFIO...
> > 
> > Yes, it's not ideal but I think it's sufficient for now and if we later
> > get support for returning an error from release, we can set a timeout
> > after notifying the user to make use of that.  Thanks,
> 
> Ben had the idea of just forcing to hard-unplug this device from the
> guest. Thats probably the best way to deal with that, I think. VFIO
> sends a notification to qemu that the device is gone and qemu informs
> the guest in some way about it.

We need to try the polite method of attempting to hot unplug the device
from qemu first, which the current vfio code already implements.  We can
then escalate if it doesn't respond.  The current code calls abort in
qemu if the guest doesn't respond, but I agree we should also be
enforcing this at the kernel interface.  I think the problem with the
hard-unplug is that we don't have a good revoke mechanism for the mmio
mmaps.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AUTOTEST][KVM][PATCH] Add test for testing of killing guest when network is under usage.

2011-08-24 Thread Lukáš Doktor

Hi Jiří,

Do you have any further plans with this test? I'm not convinced that 
netperf only as a stress is necessarily. You can use netcat or simple 
python udp send/recv (flood attack ;-) ).


Dne 17.8.2011 16:17, Jiří Župka napsal(a):

This patch contain two tests.
1) Try kill guest when guest netwok is under loading.
2) Try kill guest after multiple adding and removing network drivers.

Signed-off-by: Jiří Župka
---
  client/tests/kvm/tests_base.cfg.sample|   23 +
  client/virt/tests/netstress_kill_guest.py |  146 +
  2 files changed, 169 insertions(+), 0 deletions(-)
  create mode 100644 client/virt/tests/netstress_kill_guest.py

diff --git a/client/tests/kvm/tests_base.cfg.sample 
b/client/tests/kvm/tests_base.cfg.sample
index ec1b48d..2c88088 100644
--- a/client/tests/kvm/tests_base.cfg.sample
+++ b/client/tests/kvm/tests_base.cfg.sample
@@ -845,6 +845,29 @@ variants:
  restart_vm = yes
  kill_vm_on_error = yes

+- netstress_kill_guest: install setup unattended_install.cdrom
+only Linux
+type = netstress_kill_guest
+image_snapshot = yes
+nic_mode = tap
+# There should be enough vms for build topology.
+variants:
+-driver:
+mode = driver
+-load:
+mode = load
+netperf_files = netperf-2.4.5.tar.bz2 wait_before_data.patch
+packet_size = 1500
+setup_cmd = "cd %s&&  tar xvfj netperf-2.4.5.tar.bz2&&  cd netperf-2.4.5&&  patch 
-p0<  ../wait_before_data.patch&&  ./configure&&  make"
+clean_cmd = " while killall -9 netserver; do True test; done;"
+netserver_cmd =  %s/netperf-2.4.5/src/netserver
+netperf_cmd = %s/netperf-2.4.5/src/netperf -t %s -H %s -l 60 
-- -m %s
+variants:
+- vhost:
+netdev_extra_params = "vhost=on"



You might add "modprobe vhost-net" command as vhost-net might not be 
loaded by default.



+- vhost-no:
+netdev_extra_params = ""
+
  - set_link: install setup image_copy unattended_install.cdrom
  type = set_link
  test_timeout = 1000
diff --git a/client/virt/tests/netstress_kill_guest.py 
b/client/virt/tests/netstress_kill_guest.py
new file mode 100644
index 000..7daec95
--- /dev/null
+++ b/client/virt/tests/netstress_kill_guest.py
@@ -0,0 +1,146 @@
+import logging, os, signal, re, time
+from autotest_lib.client.common_lib import error
+from autotest_lib.client.bin import utils
+from autotest_lib.client.virt import aexpect, virt_utils
+
+
+def run_netstress_kill_guest(test, params, env):
+"""
+Try stop network interface in VM when other VM try to communicate.
+
+@param test: kvm test object
+@param params: Dictionary with the test parameters
+@param env: Dictionary with test environment.
+"""
+def get_corespond_ip(ip):
+"""
+Get local ip address which is used for contact ip.
+
+@param ip: Remote ip
+@return: Local corespond IP.
+"""
+result = utils.run("ip route get %s" % (ip)).stdout
+ip = re.search("src (.+)", result)
+if ip is not None:
+ip = ip.groups()[0]
+return ip
+
+
+def get_ethernet_driver(session):
+"""
+Get driver of network cards.
+
+@param session: session to machine
+"""
+modules = []
+out = session.cmd("ls -l /sys/class/net/*/device/driver/module")
+for module in out.split("\n"):
+modules.append(module.split("/")[-1])
+modules.remove("")
+return set(modules)
+
+
+def kill_and_check(vm):
+vm_pid = vm.get_pid()
+vm.destroy(gracefully=False)
+time.sleep(2)
+try:
+os.kill(vm_pid, 0)
+logging.error("VM is not dead.")
+raise error.TestFail("Problem with killing guest.")
+except OSError:
+logging.info("VM is dead.")
+
+
+def netload_kill_problem(session_serial):


I think you should clean this function. I belive it would be better and 
more readable, if you first get all the params/variables, than prepare 
the host/guests and after all of this start the guest. See the comments 
further...



+netperf_dir = os.path.join(os.environ['AUTODIR'], "tests/netperf2")
+setup_cmd = params.get("setup_cmd")
+clean_cmd = params.get("clean_cmd")
+
+firewall_flush = "iptables -F"
+session_serial.cmd_output(firewall_flush)
+try:
+utils.run("iptables -F")
you have firewall_flush command-string, why not to use it here to. Also 
you should either warn everywhere or not at all... (you log the failure 
when flushing the guest but not here)



+except:
+pass
+
+for i in params.get("netperf_files").split():
+vm.copy_files_to(os.path.join(netperf_dir, i), "/tmp")
+
+t

Re: kvm PCI assignment & VFIO ramblings

2011-08-24 Thread Alex Williamson
On Wed, 2011-08-24 at 10:43 +0200, Joerg Roedel wrote:
> On Tue, Aug 23, 2011 at 03:30:06PM -0400, Alex Williamson wrote:
> > On Tue, 2011-08-23 at 07:01 +1000, Benjamin Herrenschmidt wrote:
> 
> > > Could be tho in what form ? returning sysfs pathes ?
> > 
> > I'm at a loss there, please suggest.  I think we need an ioctl that
> > returns some kind of array of devices within the group and another that
> > maybe takes an index from that array and returns an fd for that device.
> > A sysfs path string might be a reasonable array element, but it sounds
> > like a pain to work with.
> 
> Limiting to PCI we can just pass the BDF as the argument to optain the
> device-fd. For a more generic solution we need a unique identifier in
> some way which is unique across all 'struct device' instances in the
> system. As far as I know we don't have that yet (besides the sysfs-path)
> so we either add that or stick with bus-specific solutions.
> 
> > > 1:1 process has the advantage of linking to an -mm which makes the whole
> > > mmu notifier business doable. How do you want to track down mappings and
> > > do the second level translation in the case of explicit map/unmap (like
> > > on power) if you are not tied to an mm_struct ?
> > 
> > Right, I threw away the mmu notifier code that was originally part of
> > vfio because we can't do anything useful with it yet on x86.  I
> > definitely don't want to prevent it where it makes sense though.  Maybe
> > we just record current->mm on open and restrict subsequent opens to the
> > same.
> 
> Hmm, I think we need io-page-fault support in the iommu-api then.

Yeah, when we can handle iommu page faults, this gets more interesting.

> > > Another aspect I don't see discussed is how we represent these things to
> > > the guest.
> > > 
> > > On Power for example, I have a requirement that a given iommu domain is
> > > represented by a single dma window property in the device-tree. What
> > > that means is that that property needs to be either in the node of the
> > > device itself if there's only one device in the group or in a parent
> > > node (ie a bridge or host bridge) if there are multiple devices.
> > > 
> > > Now I do -not- want to go down the path of simulating P2P bridges,
> > > besides we'll quickly run out of bus numbers if we go there.
> > > 
> > > For us the most simple and logical approach (which is also what pHyp
> > > uses and what Linux handles well) is really to expose a given PCI host
> > > bridge per group to the guest. Believe it or not, it makes things
> > > easier :-)
> > 
> > I'm all for easier.  Why does exposing the bridge use less bus numbers
> > than emulating a bridge?
> > 
> > On x86, I want to maintain that our default assignment is at the device
> > level.  A user should be able to pick single or multiple devices from
> > across several groups and have them all show up as individual,
> > hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
> > also seen cases where users try to attach a bridge to the guest,
> > assuming they'll get all the devices below the bridge, so I'd be in
> > favor of making this "just work" if possible too, though we may have to
> > prevent hotplug of those.
> 
> A side-note: Might it be better to expose assigned devices in a guest on
> a seperate bus? This will make it easier to emulate an IOMMU for the
> guest inside qemu.

I think we want that option, sure.  A lot of guests aren't going to
support hotplugging buses though, so I think our default, map the entire
guest model should still be using bus 0.  The ACPI gets a lot more
complicated for that model too; dynamic SSDTs?  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-24 Thread Alex Williamson
On Wed, 2011-08-24 at 09:51 +1000, Benjamin Herrenschmidt wrote:
> > > For us the most simple and logical approach (which is also what pHyp
> > > uses and what Linux handles well) is really to expose a given PCI host
> > > bridge per group to the guest. Believe it or not, it makes things
> > > easier :-)
> > 
> > I'm all for easier.  Why does exposing the bridge use less bus numbers
> > than emulating a bridge?
> 
> Because a host bridge doesn't look like a PCI to PCI bridge at all for
> us. It's an entire separate domain with it's own bus number space
> (unlike most x86 setups).

Ok, I missed the "host" bridge.

> In fact we have some problems afaik in qemu today with the concept of
> PCI domains, for example, I think qemu has assumptions about a single
> shared IO space domain which isn't true for us (each PCI host bridge
> provides a distinct IO space domain starting at 0). We'll have to fix
> that, but it's not a huge deal.

Yep, I've seen similar on ia64 systems.

> So for each "group" we'd expose in the guest an entire separate PCI
> domain space with its own IO, MMIO etc... spaces, handed off from a
> single device-tree "host bridge" which doesn't itself appear in the
> config space, doesn't need any emulation of any config space etc...
> 
> > On x86, I want to maintain that our default assignment is at the device
> > level.  A user should be able to pick single or multiple devices from
> > across several groups and have them all show up as individual,
> > hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
> > also seen cases where users try to attach a bridge to the guest,
> > assuming they'll get all the devices below the bridge, so I'd be in
> > favor of making this "just work" if possible too, though we may have to
> > prevent hotplug of those.
> >
> > Given the device requirement on x86 and since everything is a PCI device
> > on x86, I'd like to keep a qemu command line something like -device
> > vfio,host=00:19.0.  I assume that some of the iommu properties, such as
> > dma window size/address, will be query-able through an architecture
> > specific (or general if possible) ioctl on the vfio group fd.  I hope
> > that will help the specification, but I don't fully understand what all
> > remains.  Thanks,
> 
> Well, for iommu there's a couple of different issues here but yes,
> basically on one side we'll have some kind of ioctl to know what segment
> of the device(s) DMA address space is assigned to the group and we'll
> need to represent that to the guest via a device-tree property in some
> kind of "parent" node of all the devices in that group.
> 
> We -might- be able to implement some kind of hotplug of individual
> devices of a group under such a PHB (PCI Host Bridge), I don't know for
> sure yet, some of that PAPR stuff is pretty arcane, but basically, for
> all intend and purpose, we really want a group to be represented as a
> PHB in the guest.
> 
> We cannot arbitrary have individual devices of separate groups be
> represented in the guest as siblings on a single simulated PCI bus.

I think the vfio kernel layer we're describing easily supports both.
This is just a matter of adding qemu-vfio code to expose different
topologies based on group iommu capabilities and mapping mode.  Thanks,

Alex


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC/PATCH] kvm tools: Introduce 'kvm setup' command

2011-08-24 Thread Pekka Enberg
This patch implements 'kvm setup' command that can be used to setup a guest
filesystem that shares system libraries and binaries from host filesystem in
read-only mode.

You can setup a new shared rootfs guest with:

  ./kvm setup -n default

and launch it with:

  ./kvm run --9p /,hostfs -p "init=virt/init" -d ~/.kvm-tools/default/

We want to teach 'kvm run' to be able to launch guest filesystems by name in
the future. Furthermore, 'kvm run' should setup a 'default' filesystem and use
it by default unless the user specifies otherwise.

Cc: Asias He 
Cc: Cyrill Gorcunov 
Cc: Ingo Molnar 
Cc: Sasha Levin 
Signed-off-by: Pekka Enberg 
---
 tools/kvm/.gitignore  |1 +
 tools/kvm/Documentation/kvm-setup.txt |   15 +++
 tools/kvm/Makefile|   11 ++-
 tools/kvm/builtin-setup.c |  184 +
 tools/kvm/command-list.txt|1 +
 tools/kvm/guest/init.c|   40 +++
 tools/kvm/include/kvm/builtin-setup.h |7 ++
 tools/kvm/kvm-cmd.c   |2 +
 8 files changed, 259 insertions(+), 2 deletions(-)
 create mode 100644 tools/kvm/Documentation/kvm-setup.txt
 create mode 100644 tools/kvm/builtin-setup.c
 create mode 100644 tools/kvm/guest/init.c
 create mode 100644 tools/kvm/include/kvm/builtin-setup.h

diff --git a/tools/kvm/.gitignore b/tools/kvm/.gitignore
index 852d052..6ace4ec 100644
--- a/tools/kvm/.gitignore
+++ b/tools/kvm/.gitignore
@@ -6,4 +6,5 @@ tags
 include/common-cmds.h
 tests/boot/boot_test.iso
 tests/boot/rootfs/
+guest/init
 KVMTOOLS-VERSION-FILE
diff --git a/tools/kvm/Documentation/kvm-setup.txt 
b/tools/kvm/Documentation/kvm-setup.txt
new file mode 100644
index 000..c845d17
--- /dev/null
+++ b/tools/kvm/Documentation/kvm-setup.txt
@@ -0,0 +1,15 @@
+kvm-setup(1)
+
+
+NAME
+
+kvm-setup - Setup a new virtual machine
+
+SYNOPSIS
+
+[verse]
+'kvm setup '
+
+DESCRIPTION
+---
+The command setups a virtual machine.
diff --git a/tools/kvm/Makefile b/tools/kvm/Makefile
index 669386f..25cbd7e 100644
--- a/tools/kvm/Makefile
+++ b/tools/kvm/Makefile
@@ -20,6 +20,8 @@ TAGS  := ctags
 
 PROGRAM:= kvm
 
+GUEST_INIT := guest/init
+
 OBJS   += builtin-balloon.o
 OBJS   += builtin-debug.o
 OBJS   += builtin-help.o
@@ -28,6 +30,7 @@ OBJS  += builtin-stat.o
 OBJS   += builtin-pause.o
 OBJS   += builtin-resume.o
 OBJS   += builtin-run.o
+OBJS   += builtin-setup.o
 OBJS   += builtin-stop.o
 OBJS   += builtin-version.o
 OBJS   += cpuid.o
@@ -159,7 +162,7 @@ WARNINGS += -Wunused-result
 
 CFLAGS += $(WARNINGS)
 
-all: $(PROGRAM)
+all: $(PROGRAM) $(GUEST_INIT)
 
 KVMTOOLS-VERSION-FILE:
@$(SHELL_PATH) util/KVMTOOLS-VERSION-GEN $(OUTPUT)
@@ -169,6 +172,10 @@ $(PROGRAM): $(DEPS) $(OBJS)
$(E) "  LINK" $@
$(Q) $(CC) $(OBJS) $(LIBS) -o $@
 
+$(GUEST_INIT): guest/init.c
+   $(E) "  LINK" $@
+   $(Q) $(CC) -static guest/init.c -o $@
+
 $(DEPS):
 
 %.d: %.c
@@ -240,7 +247,7 @@ clean:
$(Q) rm -f bios/bios-rom.h
$(Q) rm -f tests/boot/boot_test.iso
$(Q) rm -rf tests/boot/rootfs/
-   $(Q) rm -f $(DEPS) $(OBJS) $(PROGRAM)
+   $(Q) rm -f $(DEPS) $(OBJS) $(PROGRAM) $(GUEST_INIT)
$(Q) rm -f cscope.*
$(Q) rm -f $(KVM_INCLUDE)/common-cmds.h
$(Q) rm -f KVMTOOLS-VERSION-FILE
diff --git a/tools/kvm/builtin-setup.c b/tools/kvm/builtin-setup.c
new file mode 100644
index 000..4280fe0
--- /dev/null
+++ b/tools/kvm/builtin-setup.c
@@ -0,0 +1,184 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define KVM_PID_FILE_PATH  "/.kvm-tools/"
+#define HOME_DIR   getenv("HOME")
+
+static const char *instance_name;
+
+static const char * const setup_usage[] = {
+   "kvm setup [-n name]",
+   NULL
+};
+
+static const struct option setup_options[] = {
+   OPT_GROUP("General options:"),
+   OPT_STRING('n', "name", &instance_name, "name", "Instance name"),
+   OPT_END()
+};
+
+static void parse_setup_options(int argc, const char **argv)
+{
+   while (argc != 0) {
+   argc = parse_options(argc, argv, setup_options, setup_usage,
+   PARSE_OPT_STOP_AT_NON_OPTION);
+   if (argc != 0)
+   kvm_setup_help();
+   }
+}
+
+void kvm_setup_help(void)
+{
+   usage_with_options(setup_usage, setup_options);
+}
+
+static int copy_file(const char *from, const char *to)
+{
+   int in_fd, out_fd;
+   void *src, *dst;
+   struct stat st;
+   int err = -1;
+
+   in_fd = open(from, O_RDONLY);
+   if (in_fd < 0)
+   return err;
+
+   if (fstat(in_fd, &st) < 0)
+   goto error_close_in;
+
+   out_fd = open(to, O_RDWR | O_CREAT | O_TRUNC, st.st_mode & 
(S_IRWXU|S_IRWXG|S_IRWXO));
+   

Re: A non-responsive guest problem

2011-08-24 Thread Paul
Hi,

Sometimes this problem happened in one day, but sometimes it was very
difficult to reproduce it.
Previously the clock source of the guest was kvm-clock. Now I changed
it to tsc. The problem didn't occur until now. Is it related to the
clock source? I  find that there are some bug fixes for kvm-clock
recently. (e.g.,
http://www.spinics.net/lists/stable-commits/msg11942.html) Anyway, I
will update KVM later.

Thanks,
Paul

On Wed, Aug 24, 2011 at 6:24 PM, Stefan Hajnoczi  wrote:
>
> On Wed, Aug 24, 2011 at 10:02 AM, Xiao Guangrong
>  wrote:
> > On 08/24/2011 04:40 PM, Paul wrote:
> >> Hi,
> >>
> >> I captured the output of pidstat when the problem was reproduced:
> >>
> >> bash-4.1# pidstat -p $PID 8966
> >> Linux 2.6.32-71.el6.x86_64 (test)     07/24/11        _x86_64_        (4 
> >> CPU)
> >>
> >> 16:25:15          PID    %usr %system  %guest    %CPU   CPU  Command
> >> 16:25:15         8966    0.14   55.04  115.41  170.59     1  qemu-kvm
> >>
> >
> > I have tried to reproduce it, but it was failed. I am using the
> > current KVM code. I suggest you to test it by the new code if possible.
>
> Yes, that's a good idea.  The issue might already be fixed.  But if
> this is hard to reproduce then perhaps keep the spinning guest around
> a bit longer so we can poke at it and figure out what is happening.
>
> The pidstat output shows us that it's the guest that is spinning, not
> qemu-kvm userspace.
>
> The system time (time spent in host kernel) is also quite high so
> running kvm_stat might show some interesting KVM events happening.
>
> Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 7/7] qemu-kvm: Resolve PCI upstream diffs

2011-08-24 Thread Jan Kiszka
On 2011-08-23 19:33, Jan Kiszka wrote:
> Device assignment no longer peeks into config_map, so we can drop all
> the related changes and sync the PCI core with upstream.
> 
> Signed-off-by: Jan Kiszka 
> ---
>  hw/pci.c |   29 +++--
>  hw/pci.h |7 +--
>  2 files changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/hw/pci.c b/hw/pci.c
> index 5c87a62..0225888 100644
> --- a/hw/pci.c
> +++ b/hw/pci.c
> @@ -806,7 +806,7 @@ static void pci_config_alloc(PCIDevice *pci_dev)
>  pci_dev->cmask = g_malloc0(config_size);
>  pci_dev->wmask = g_malloc0(config_size);
>  pci_dev->w1cmask = g_malloc0(config_size);
> -pci_dev->config_map = g_malloc0(config_size);
> +pci_dev->used = g_malloc0(config_size);
>  }
>  
>  static void pci_config_free(PCIDevice *pci_dev)
> @@ -815,7 +815,7 @@ static void pci_config_free(PCIDevice *pci_dev)
>  g_free(pci_dev->cmask);
>  g_free(pci_dev->wmask);
>  g_free(pci_dev->w1cmask);
> -g_free(pci_dev->config_map);
> +g_free(pci_dev->used);
>  }
>  
>  /* -1 for devfn means auto assign */
> @@ -846,8 +846,6 @@ static PCIDevice *do_pci_register_device(PCIDevice 
> *pci_dev, PCIBus *bus,
>  pci_dev->irq_state = 0;
>  pci_config_alloc(pci_dev);
>  
> -memset(pci_dev->config_map, 0xff, PCI_CONFIG_HEADER_SIZE);
> -
>  pci_config_set_vendor_id(pci_dev->config, info->vendor_id);
>  pci_config_set_device_id(pci_dev->config, info->device_id);
>  pci_config_set_revision(pci_dev->config, info->revision);
> @@ -1887,7 +1885,7 @@ static int pci_find_space(PCIDevice *pdev, uint8_t size)
>  int offset = PCI_CONFIG_HEADER_SIZE;
>  int i;
>  for (i = PCI_CONFIG_HEADER_SIZE; i < config_size; ++i)
> -if (pdev->config_map[i])
> +if (pdev->used[i])
>  offset = i + 1;
>  else if (i - offset + 1 == size)
>  return offset;
> @@ -2062,13 +2060,13 @@ int pci_add_capability(PCIDevice *pdev, uint8_t 
> cap_id,
>  int i;
>  
>  for (i = offset; i < offset + size; i++) {
> -if (pdev->config_map[i]) {
> +if (pdev->used[i]) {
>  fprintf(stderr, "ERROR: %04x:%02x:%02x.%x "
>  "Attempt to add PCI capability %x at offset "
>  "%x overlaps existing capability %x at offset %x\n",
>  pci_find_domain(pdev->bus), pci_bus_num(pdev->bus),
>  PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn),
> -cap_id, offset, pdev->config_map[i], i);
> +cap_id, offset, pdev->used[i], i);
>  return -EINVAL;

This hunk is actually not equivalent. So the whole patch should be
skipped for now. I'll post a new version once upstreaming and merging
back of that error detection is done.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 39412] Win Vista and Win2K8 guests' network breaks down

2011-08-24 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=39412


Florian Mickler  changed:

   What|Removed |Added

 Status|CLOSED  |NEEDINFO
 Resolution|CODE_FIX|




--- Comment #7 from Florian Mickler   2011-08-24 12:29:52 
---
As far as I can see, the master branch of kvm.git is based upon v3.0.
In fact none of the 3 commits mentioned above are contained in Linus tree.

Can you please test Linus tree (i.e the mainline kernel, currently at v3.1-rc3)
and verify that it works (or doesn't work)?

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-24 Thread Roedel, Joerg
On Wed, Aug 24, 2011 at 05:33:00AM -0400, David Gibson wrote:
> On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote:

> > I don't see a reason to make this meta-grouping static. It would harm
> > flexibility on x86. I think it makes things easier on power but there
> > are options on that platform to get the dynamic solution too.
> 
> I think several people are misreading what Ben means by "static".  I
> would prefer to say 'persistent', in that the meta-groups lifetime is
> not tied to an fd, but they can be freely created, altered and removed
> during runtime.

Even if it can be altered at runtime, from a usability perspective it is
certainly the best to handle these groups directly in qemu. Or are there
strong reasons to do it somewhere else?

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A non-responsive guest problem

2011-08-24 Thread Stefan Hajnoczi
On Wed, Aug 24, 2011 at 10:02 AM, Xiao Guangrong
 wrote:
> On 08/24/2011 04:40 PM, Paul wrote:
>> Hi,
>>
>> I captured the output of pidstat when the problem was reproduced:
>>
>> bash-4.1# pidstat -p $PID 8966
>> Linux 2.6.32-71.el6.x86_64 (test)     07/24/11        _x86_64_        (4 CPU)
>>
>> 16:25:15          PID    %usr %system  %guest    %CPU   CPU  Command
>> 16:25:15         8966    0.14   55.04  115.41  170.59     1  qemu-kvm
>>
>
> I have tried to reproduce it, but it was failed. I am using the
> current KVM code. I suggest you to test it by the new code if possible.

Yes, that's a good idea.  The issue might already be fixed.  But if
this is hard to reproduce then perhaps keep the spinning guest around
a bit longer so we can poke at it and figure out what is happening.

The pidstat output shows us that it's the guest that is spinning, not
qemu-kvm userspace.

The system time (time spent in host kernel) is also quite high so
running kvm_stat might show some interesting KVM events happening.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: about vEOI optimization

2011-08-24 Thread Avi Kivity

On 08/23/2011 11:09 AM, Tian, Kevin wrote:

Hi, Avi,

Both Eddie and Marcello once suggested vEOI optimization by skipping
heavy-weight instruction decode, which reduces vEOI overhead greatly:

http://www.mail-archive.com/kvm@vger.kernel.org/msg18619.html
http://www.spinics.net/lists/kvm/msg36691.html

Though virtual x2apic serves similar purpose, it depends on guest OS
to support x2apic. Many Windows versions don't support x2apic though,
including Win7, Windows server before 2008 R2, etc. Given that virtualization
need support various OS versions, any chance to incorporate above vEOI
optimization in KVM as an alternative to boost performance when guest
doesn't support x2apic?



Yes.  There was a problem with the guest using MOVSD or STOSD to write 
the EOI; if we don't emulate, then registers don't get updated.  I guess 
we can ignore it since no sane guest will use those instructions for EOI.


Another option is the hyper-V EOI support, which can also eliminate the 
EOI exit when no additional interrupt is pending.  This can improve EOI 
performance even more.



--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-24 Thread David Gibson
On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote:
> On Tue, Aug 23, 2011 at 12:54:27PM -0400, aafabbri wrote:
> > On 8/23/11 4:04 AM, "Joerg Roedel"  wrote:
> > > That is makes uiommu basically the same as the meta-groups, right?
> > 
> > Yes, functionality seems the same, thus my suggestion to keep uiommu
> > explicit.  Is there some need for group-groups besides defining sets of
> > groups which share IOMMU resources?
> > 
> > I do all this stuff (bringing up sets of devices which may share IOMMU
> > domain) dynamically from C applications.  I don't really want some static
> > (boot-time or sysfs fiddling) supergroup config unless there is a good
> > reason KVM/power needs it.
> > 
> > As you say in your next email, doing it all from ioctls is very easy,
> > programmatically.
> 
> I don't see a reason to make this meta-grouping static. It would harm
> flexibility on x86. I think it makes things easier on power but there
> are options on that platform to get the dynamic solution too.

I think several people are misreading what Ben means by "static".  I
would prefer to say 'persistent', in that the meta-groups lifetime is
not tied to an fd, but they can be freely created, altered and removed
during runtime.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pcnet-pci: fix wrong opaque given to I/O accessors

2011-08-24 Thread Avi Kivity

On 08/22/2011 05:16 PM, Gerhard Wiesinger wrote:

Hello Avi,

Thnx, fixed: OK, maybe some credits :-)

Acked-by: Gerhard Wiesinger 

This pattern is still present at (maybe some further problems!!!) and 
I guess it has to be fixed, too:


grep -ir 'ops, s, "' .
./hw/rtl8139.c:memory_region_init_io(&s->bar_io, &rtl8139_io_ops, 
s, "rtl8139", 0x100);
./hw/rtl8139.c:memory_region_init_io(&s->bar_mem, 
&rtl8139_mmio_ops, s, "rtl8139", 0x100);


Usually, when you have

  memory_region_init_io(&s->something, ..., s, ...)

it means everything is fine.  Lance/pcnet is special in this regard.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Add wrapper script around Qemu to test kernels

2011-08-24 Thread Jan Kiszka
On 2011-08-24 10:25, Avi Kivity wrote:
> On 08/24/2011 01:16 AM, Alexander Graf wrote:
>> +"
>> +echo "\
>> +Your guest is bound to the current foreground shell. To quit the guest,
>> +please use Ctrl-A x"
>> +echo "  Executing: $QEMU_BIN $QEMU_OPTIONS -append \"$KERNEL_APPEND\"
>> -smp $SMP"
>> +echo
>> +
>> +exec $QEMU_BIN $QEMU_OPTIONS -append "$KERNEL_APPEND -smp $SMP"
> 
> Would be nice to support launching gdb in a separate terminal with
> vmlinux already loaded, and already attached to qemu.

+ loading a python script into gdb to pull in module symbols. There are
a few implementations floating around (including my own one).

It would also be nice if one could append QEMU (note the capitalization
BTW) options to the script, maybe everything after a '--' separator.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-24 Thread Roedel, Joerg
On Tue, Aug 23, 2011 at 12:54:27PM -0400, aafabbri wrote:
> On 8/23/11 4:04 AM, "Joerg Roedel"  wrote:
> > That is makes uiommu basically the same as the meta-groups, right?
> 
> Yes, functionality seems the same, thus my suggestion to keep uiommu
> explicit.  Is there some need for group-groups besides defining sets of
> groups which share IOMMU resources?
> 
> I do all this stuff (bringing up sets of devices which may share IOMMU
> domain) dynamically from C applications.  I don't really want some static
> (boot-time or sysfs fiddling) supergroup config unless there is a good
> reason KVM/power needs it.
> 
> As you say in your next email, doing it all from ioctls is very easy,
> programmatically.

I don't see a reason to make this meta-grouping static. It would harm
flexibility on x86. I think it makes things easier on power but there
are options on that platform to get the dynamic solution too.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-24 Thread Joerg Roedel
On Tue, Aug 23, 2011 at 01:33:14PM -0400, Aaron Fabbri wrote:
> On 8/23/11 10:01 AM, "Alex Williamson"  wrote:
> > The iommu domain would probably be allocated when the first device is
> > bound to vfio.  As each device is bound, it gets attached to the group.
> > DMAs are done via an ioctl on the group.
> > 
> > I think group + uiommu leads to effectively reliving most of the
> > problems with the current code.  The only benefit is the group
> > assignment to enforce hardware restrictions.  We still have the problem
> > that uiommu open() = iommu_domain_alloc(), whose properties are
> > meaningless without attached devices (groups).  Which I think leads to
> > the same awkward model of attaching groups to define the domain, then we
> > end up doing mappings via the group to enforce ordering.
> 
> Is there a better way to allow groups to share an IOMMU domain?
> 
> Maybe, instead of having an ioctl to allow a group A to inherit the same
> iommu domain as group B, we could have an ioctl to fully merge two groups
> (could be what Ben was thinking):
> 
> A.ioctl(MERGE_TO_GROUP, B)
> 
> The group A now goes away and its devices join group B.  If A ever had an
> iommu domain assigned (and buffers mapped?) we fail.
> 
> Groups cannot get smaller (they are defined as minimum granularity of an
> IOMMU, initially).  They can get bigger if you want to share IOMMU
> resources, though.
> 
> Any downsides to this approach?

As long as this is a 2-way road its fine. There must be a way to split
the groups again after the guest exits. But then we are again at the
super-groups (aka meta-groups, aka uiommu) point.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] kvm tools: Seperate pci layer out of virtio-console

2011-08-24 Thread Sasha Levin
Signed-off-by: Sasha Levin 
---
 tools/kvm/virtio/console.c |  213 +++-
 1 files changed, 71 insertions(+), 142 deletions(-)

diff --git a/tools/kvm/virtio/console.c b/tools/kvm/virtio/console.c
index cb5a15e..035554c 100644
--- a/tools/kvm/virtio/console.c
+++ b/tools/kvm/virtio/console.c
@@ -11,6 +11,7 @@
 #include "kvm/threadpool.h"
 #include "kvm/irq.h"
 #include "kvm/guest_compat.h"
+#include "kvm/virtio-pci.h"
 
 #include 
 #include 
@@ -29,28 +30,13 @@
 #define VIRTIO_CONSOLE_RX_QUEUE0
 #define VIRTIO_CONSOLE_TX_QUEUE1
 
-static struct pci_device_header virtio_console_pci_device = {
-   .vendor_id  = PCI_VENDOR_ID_REDHAT_QUMRANET,
-   .device_id  = PCI_DEVICE_ID_VIRTIO_CONSOLE,
-   .header_type= PCI_HEADER_TYPE_NORMAL,
-   .revision_id= 0,
-   .class  = 0x078000,
-   .subsys_vendor_id   = PCI_SUBSYSTEM_VENDOR_ID_REDHAT_QUMRANET,
-   .subsys_id  = VIRTIO_ID_CONSOLE,
-};
-
 struct con_dev {
pthread_mutex_t mutex;
 
+   struct virtio_pci   vpci;
struct virt_queue   vqs[VIRTIO_CONSOLE_NUM_QUEUES];
-   struct virtio_console_configconsole_config;
-   u32 host_features;
-   u32 guest_features;
-   u16 config_vector;
-   u8  status;
-   u8  isr;
-   u16 queue_selector;
-   u16 base_addr;
+   struct virtio_console_configconfig;
+   u32 features;
int compat_id;
 
struct thread_pool__job jobs[VIRTIO_CONSOLE_NUM_QUEUES];
@@ -59,13 +45,11 @@ struct con_dev {
 static struct con_dev cdev = {
.mutex  = PTHREAD_MUTEX_INITIALIZER,
 
-   .console_config = {
+   .config = {
.cols   = 80,
.rows   = 24,
.max_nr_ports   = 1,
},
-
-   .host_features  = 0,
 };
 
 /*
@@ -87,7 +71,7 @@ static void virtio_console__inject_interrupt_callback(struct 
kvm *kvm, void *par
head = virt_queue__get_iov(vq, iov, &out, &in, kvm);
len = term_getc_iov(CONSOLE_VIRTIO, iov, in);
virt_queue__set_used_elem(vq, head, len);
-   virt_queue__trigger_irq(vq, virtio_console_pci_device.irq_line, 
&cdev.isr, kvm);
+   virtio_pci__signal_vq(kvm, &cdev.vpci, vq - cdev.vqs);
}
 
mutex_unlock(&cdev.mutex);
@@ -98,65 +82,6 @@ void virtio_console__inject_interrupt(struct kvm *kvm)
thread_pool__do_job(&cdev.jobs[VIRTIO_CONSOLE_RX_QUEUE]);
 }
 
-static bool virtio_console_pci_io_device_specific_in(void *data, unsigned long 
offset, int size)
-{
-   u8 *config_space = (u8 *) &cdev.console_config;
-
-   if (size != 1)
-   return false;
-
-   if ((offset - VIRTIO_MSI_CONFIG_VECTOR) > sizeof(struct 
virtio_console_config))
-   pr_error("config offset is too big: %li", offset - 
VIRTIO_MSI_CONFIG_VECTOR);
-
-   ioport__write8(data, config_space[offset - VIRTIO_MSI_CONFIG_VECTOR]);
-
-   return true;
-}
-
-static bool virtio_console_pci_io_in(struct ioport *ioport, struct kvm *kvm, 
u16 port, void *data, int size)
-{
-   unsigned long offset = port - cdev.base_addr;
-   bool ret = true;
-
-   mutex_lock(&cdev.mutex);
-
-   switch (offset) {
-   case VIRTIO_PCI_HOST_FEATURES:
-   ioport__write32(data, cdev.host_features);
-   break;
-   case VIRTIO_PCI_GUEST_FEATURES:
-   ret = false;
-   break;
-   case VIRTIO_PCI_QUEUE_PFN:
-   ioport__write32(data, cdev.vqs[cdev.queue_selector].pfn);
-   break;
-   case VIRTIO_PCI_QUEUE_NUM:
-   ioport__write16(data, VIRTIO_CONSOLE_QUEUE_SIZE);
-   break;
-   case VIRTIO_PCI_QUEUE_SEL:
-   case VIRTIO_PCI_QUEUE_NOTIFY:
-   ret = false;
-   break;
-   case VIRTIO_PCI_STATUS:
-   ioport__write8(data, cdev.status);
-   break;
-   case VIRTIO_PCI_ISR:
-   ioport__write8(data, cdev.isr);
-   kvm__irq_line(kvm, virtio_console_pci_device.irq_line, 
VIRTIO_IRQ_LOW);
-   cdev.isr = VIRTIO_IRQ_LOW;
-   break;
-   case VIRTIO_MSI_CONFIG_VECTOR:
-   ioport__write16(data, cdev.config_vector);
-   break;
-   default:
-   ret = virtio_console_pci_io_device_specific_in(data, offset, 
size);
-   };
-
-   mutex_unlock(&cdev.mutex);
-
-   return ret;
-}
-
 static void virtio_console_handle_callback(struct kvm *kvm,

[PATCH 1/2] kvm tools: Move ioeventfd registration to virtio-pci

2011-08-24 Thread Sasha Levin
This patch removed ioeventfd registration from devices and moves it
to a single place in virtio-pci layer.

Signed-off-by: Sasha Levin 
---
 tools/kvm/include/kvm/virtio-pci.h |6 ++
 tools/kvm/virtio/9p.c  |   21 -
 tools/kvm/virtio/balloon.c |   20 +---
 tools/kvm/virtio/blk.c |   21 -
 tools/kvm/virtio/net.c |   19 ---
 tools/kvm/virtio/pci.c |   33 +
 tools/kvm/virtio/rng.c |   15 ---
 7 files changed, 40 insertions(+), 95 deletions(-)

diff --git a/tools/kvm/include/kvm/virtio-pci.h 
b/tools/kvm/include/kvm/virtio-pci.h
index 0c2a035..ce44e84 100644
--- a/tools/kvm/include/kvm/virtio-pci.h
+++ b/tools/kvm/include/kvm/virtio-pci.h
@@ -22,6 +22,11 @@ struct virtio_pci_ops {
int (*get_size_vq)(struct kvm *kvm, void *dev, u32 vq);
 };
 
+struct virtio_pci_ioevent_param {
+   struct virtio_pci   *vpci;
+   u32 vq;
+};
+
 struct virtio_pci {
struct pci_device_header pci_hdr;
struct virtio_pci_ops   ops;
@@ -43,6 +48,7 @@ struct virtio_pci {
 
/* virtio queue */
u16 queue_selector;
+   struct virtio_pci_ioevent_param ioeventfds[VIRTIO_PCI_MAX_VQ];
 };
 
 int virtio_pci__init(struct kvm *kvm, struct virtio_pci *vpci, void *dev,
diff --git a/tools/kvm/virtio/9p.c b/tools/kvm/virtio/9p.c
index 1682e64..0dffc7a 100644
--- a/tools/kvm/virtio/9p.c
+++ b/tools/kvm/virtio/9p.c
@@ -2,7 +2,6 @@
 #include "kvm/ioport.h"
 #include "kvm/util.h"
 #include "kvm/threadpool.h"
-#include "kvm/ioeventfd.h"
 #include "kvm/irq.h"
 #include "kvm/virtio-9p.h"
 #include "kvm/guest_compat.h"
@@ -1116,13 +1115,6 @@ static void virtio_p9_do_io(struct kvm *kvm, void *param)
}
 }
 
-static void ioevent_callback(struct kvm *kvm, void *param)
-{
-   struct p9_dev_job *job = param;
-
-   thread_pool__do_job(&job->job_id);
-}
-
 static void set_config(struct kvm *kvm, void *dev, u8 data, u32 offset)
 {
struct p9_dev *p9dev = dev;
@@ -1155,7 +1147,6 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, 
u32 pfn)
struct p9_dev_job *job;
struct virt_queue *queue;
void *p;
-   struct ioevent ioevent;
 
compat__remove_message(p9dev->compat_id);
 
@@ -1172,18 +1163,6 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, 
u32 pfn)
};
thread_pool__init_job(&job->job_id, kvm, virtio_p9_do_io, job);
 
-   ioevent = (struct ioevent) {
-   .io_addr= p9dev->vpci.base_addr + 
VIRTIO_PCI_QUEUE_NOTIFY,
-   .io_len = sizeof(u16),
-   .fn = ioevent_callback,
-   .fn_ptr = &p9dev->jobs[vq],
-   .datamatch  = vq,
-   .fn_kvm = kvm,
-   .fd = eventfd(0, 0),
-   };
-
-   ioeventfd__add_event(&ioevent);
-
return 0;
 }
 
diff --git a/tools/kvm/virtio/balloon.c b/tools/kvm/virtio/balloon.c
index 6b93121..0f24539 100644
--- a/tools/kvm/virtio/balloon.c
+++ b/tools/kvm/virtio/balloon.c
@@ -7,7 +7,6 @@
 #include "kvm/kvm.h"
 #include "kvm/pci.h"
 #include "kvm/threadpool.h"
-#include "kvm/ioeventfd.h"
 #include "kvm/guest_compat.h"
 #include "kvm/virtio-pci.h"
 
@@ -21,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define NUM_VIRT_QUEUES3
 #define VIRTIO_BLN_QUEUE_SIZE  128
@@ -125,11 +125,6 @@ static void virtio_bln_do_io(struct kvm *kvm, void *param)
}
 }
 
-static void ioevent_callback(struct kvm *kvm, void *param)
-{
-   thread_pool__do_job(param);
-}
-
 static int virtio_bln__collect_stats(void)
 {
u64 tmp;
@@ -230,7 +225,6 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 
pfn)
struct bln_dev *bdev = dev;
struct virt_queue *queue;
void *p;
-   struct ioevent ioevent;
 
compat__remove_message(bdev->compat_id);
 
@@ -241,18 +235,6 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 
pfn)
thread_pool__init_job(&bdev->jobs[vq], kvm, virtio_bln_do_io, queue);
vring_init(&queue->vring, VIRTIO_BLN_QUEUE_SIZE, p, 
VIRTIO_PCI_VRING_ALIGN);
 
-   ioevent = (struct ioevent) {
-   .io_addr= bdev->vpci.base_addr + 
VIRTIO_PCI_QUEUE_NOTIFY,
-   .io_len = sizeof(u16),
-   .fn = ioevent_callback,
-   .fn_ptr = &bdev->jobs[vq],
-   .datamatch  = vq,
-   .fn_kvm = kvm,
-   .fd = eventfd(0, 0),
-   };
-
-   ioeventfd__add_event(&ioevent);
-
return 0;
 }
 
diff --git a/tools/kvm/virtio/blk.c b/tools/kvm/virtio/blk.c
index 2e047d7..5f312b5 100644
--- a/tools/kvm/virtio/blk.c
+++ b/tools/kvm/virtio/blk.c
@@ -122,13 +122,6 @@ static void virtio_blk_do_io(struct kvm *kvm, struct 

Re: vfio/dev-assignment: potential pci_block_user_cfg_access nesting

2011-08-24 Thread Jan Kiszka
On 2011-08-24 00:05, Alex Williamson wrote:
> On Tue, 2011-08-23 at 15:31 +0200, Jan Kiszka wrote:
>> Hi Alex,
>>
>> just ran into some corner case with my reanimated IRQ sharing patches
>> that may affect vfio as well:
>>
>> How are vfio_enable/disable_intx synchronized against all other possible
>> spots that call pci_block_user_cfg_access?
>>
>> I hit the recursion bug check in pci_block_user_cfg_access with my code
>> which takes the user_cfg lock like vfio does. It likely races with
>> pci_reset_function here - and should do so in vfio as well.
> 
> So the race is that we're doing a pci_reset_function and while we've got
> pci_block_user_cfg_access set, an interrupt comes in (maybe from a
> device sharing the interrupt line), and we hit the BUG_ON when trying to
> nest pci_block_user_cfg_access?

Most probably the scenario I was seeing, but I didn't debugged it in all
details as it already locked up my notebook twice.

> 
>> Just taking some lock would mean having to run pci_reset_function with
>> IRQs disabled to synchronize with the IRQ handler (not sure if that is
>> possible at all). Alternatively, we would have to disable the interrupt
>> line or deregister the IRQ while resetting. Or we perform INTx mask
>> manipulation in an unsynchronized fashion, resolving races with user
>> space differently (still need to think about this option).
>>
>> Any other thoughts?
> 
> I think this is a bit easier for vfio since the reset is already routed
> through a vfio ioctl.  We can just use a mutex between the two, reset
> would wait on the mutex while the interrupt handler would skip masking
> of a shared interrupt if it can't get the mutex (hopefully the interrupt
> is really for a shared device or we squelch it via the reset before we
> trigger the spurious interrupt counter).
> 
> I think the only path for kvm assignment that doesn't involve also
> rerouting the reset through a kvm ioctl would have to be avoiding the
> problem in userspace.  We'd have to unregister the interrupt handler,
> reset, then re-register.  That sounds pretty heavy, but the reset is
> already a slow process.  Thanks,

I don't think we can allow userspace to potentially crash the host.

Anyway, this problem turns out to be way more generic. Just run two
"echo 1 > /sys/bus/pci/.../reset" loops on the same device in parallel.
But be warned, you will have to reboot that box afterward.

Maybe this very creative interface of pci_block_user_cfg_access was once
OK when only the IPR SCSI driver used it. But by the time it grew beyond
that use case, it became hopelessly broken (well, open-coded
locking...). We need to redesign it, synchronizing users that can sleep
via a simple mutex and addressing access to the status/command word
separately via an IRQ-save spinlock (as we need that service in hard IRQ
handlers).

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A non-responsive guest problem

2011-08-24 Thread Xiao Guangrong
On 08/24/2011 04:40 PM, Paul wrote:
> Hi,
> 
> I captured the output of pidstat when the problem was reproduced:
> 
> bash-4.1# pidstat -p $PID 8966
> Linux 2.6.32-71.el6.x86_64 (test) 07/24/11_x86_64_(4 CPU)
> 
> 16:25:15  PID%usr %system  %guest%CPU   CPU  Command
> 16:25:15 89660.14   55.04  115.41  170.59 1  qemu-kvm
> 

I have tried to reproduce it, but it was failed. I am using the
current KVM code. I suggest you to test it by the new code if possible.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-24 Thread Roedel, Joerg
On Tue, Aug 23, 2011 at 07:35:37PM -0400, Benjamin Herrenschmidt wrote:
> On Tue, 2011-08-23 at 15:18 +0200, Roedel, Joerg wrote:

> > Hmm, good idea. But as far as I know the hotplug-event needs to be in
> > the guest _before_ the device is actually unplugged (so that the guest
> > can unbind its driver first). That somehow brings back the sleep-idea
> > and the timeout in the .release function.
> 
> That's for normal assisted hotplug, but don't we support hard hotplug ?
> I mean, things like cardbus, thunderbolt (if we ever support that)
> etc... will need it and some platforms do support hard hotplug of PCIe
> devices.
> 
> (That's why drivers should never spin on MMIO waiting for a 1 bit to
> clear without a timeout :-)

Right, thats probably the best semantic for this issue then. The worst
thing that happens is that the admin crashed the guest.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-24 Thread Roedel, Joerg
On Tue, Aug 23, 2011 at 01:08:29PM -0400, Alex Williamson wrote:
> On Tue, 2011-08-23 at 15:14 +0200, Roedel, Joerg wrote:

> > Handling it through fds is a good idea. This makes sure that everything
> > belongs to one process. I am not really sure yet if we go the way to
> > just bind plain groups together or if we create meta-groups. The
> > meta-groups thing seems somewhat cleaner, though.
> 
> I'm leaning towards binding because we need to make it dynamic, but I
> don't really have a good picture of the lifecycle of a meta-group.

In my view the life-cycle of the meta-group is a subrange of the
qemu-instance's life-cycle.

> > Putting the process to sleep (which would be uninterruptible) seems bad.
> > The process would sleep until the guest releases the device-group, which
> > can take days or months.
> > The best thing (and the most intrusive :-) ) is to change PCI core to
> > allow unbindings to fail, I think. But this probably further complicates
> > the way to upstream VFIO...
> 
> Yes, it's not ideal but I think it's sufficient for now and if we later
> get support for returning an error from release, we can set a timeout
> after notifying the user to make use of that.  Thanks,

Ben had the idea of just forcing to hard-unplug this device from the
guest. Thats probably the best way to deal with that, I think. VFIO
sends a notification to qemu that the device is gone and qemu informs
the guest in some way about it.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-24 Thread Joerg Roedel
On Tue, Aug 23, 2011 at 03:30:06PM -0400, Alex Williamson wrote:
> On Tue, 2011-08-23 at 07:01 +1000, Benjamin Herrenschmidt wrote:

> > Could be tho in what form ? returning sysfs pathes ?
> 
> I'm at a loss there, please suggest.  I think we need an ioctl that
> returns some kind of array of devices within the group and another that
> maybe takes an index from that array and returns an fd for that device.
> A sysfs path string might be a reasonable array element, but it sounds
> like a pain to work with.

Limiting to PCI we can just pass the BDF as the argument to optain the
device-fd. For a more generic solution we need a unique identifier in
some way which is unique across all 'struct device' instances in the
system. As far as I know we don't have that yet (besides the sysfs-path)
so we either add that or stick with bus-specific solutions.

> > 1:1 process has the advantage of linking to an -mm which makes the whole
> > mmu notifier business doable. How do you want to track down mappings and
> > do the second level translation in the case of explicit map/unmap (like
> > on power) if you are not tied to an mm_struct ?
> 
> Right, I threw away the mmu notifier code that was originally part of
> vfio because we can't do anything useful with it yet on x86.  I
> definitely don't want to prevent it where it makes sense though.  Maybe
> we just record current->mm on open and restrict subsequent opens to the
> same.

Hmm, I think we need io-page-fault support in the iommu-api then.

> > Another aspect I don't see discussed is how we represent these things to
> > the guest.
> > 
> > On Power for example, I have a requirement that a given iommu domain is
> > represented by a single dma window property in the device-tree. What
> > that means is that that property needs to be either in the node of the
> > device itself if there's only one device in the group or in a parent
> > node (ie a bridge or host bridge) if there are multiple devices.
> > 
> > Now I do -not- want to go down the path of simulating P2P bridges,
> > besides we'll quickly run out of bus numbers if we go there.
> > 
> > For us the most simple and logical approach (which is also what pHyp
> > uses and what Linux handles well) is really to expose a given PCI host
> > bridge per group to the guest. Believe it or not, it makes things
> > easier :-)
> 
> I'm all for easier.  Why does exposing the bridge use less bus numbers
> than emulating a bridge?
> 
> On x86, I want to maintain that our default assignment is at the device
> level.  A user should be able to pick single or multiple devices from
> across several groups and have them all show up as individual,
> hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
> also seen cases where users try to attach a bridge to the guest,
> assuming they'll get all the devices below the bridge, so I'd be in
> favor of making this "just work" if possible too, though we may have to
> prevent hotplug of those.

A side-note: Might it be better to expose assigned devices in a guest on
a seperate bus? This will make it easier to emulate an IOMMU for the
guest inside qemu.


Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A non-responsive guest problem

2011-08-24 Thread Paul
Hi,

I captured the output of pidstat when the problem was reproduced:

bash-4.1# pidstat -p $PID 8966
Linux 2.6.32-71.el6.x86_64 (test) 07/24/11_x86_64_(4 CPU)

16:25:15  PID%usr %system  %guest%CPU   CPU  Command
16:25:15 89660.14   55.04  115.41  170.59 1  qemu-kvm

Thanks,
Paul

On Tue, Aug 23, 2011 at 6:09 PM, Stefan Hajnoczi  wrote:
>
> On Tue, Aug 23, 2011 at 9:10 AM, Paul  wrote:
> > From trace messages, it seemed no interrupts for guest.
> > I also tried sysrq, but it didn't work. I doubt that kvm-qemu entered
> > some infinite loop.
>
> The fact that a fresh VNC connection to the guest works (but the mouse
> doesn't move) means that qemu-kvm itself is not completely locked up.
> The VNC server runs in a qemu-kvm thread.
>
> So this seems to be a problem inside the guest that causes it to
> consume 100% CPU.
>
> One way to confirm this is to run pidstat(1):
> $ pidstat -p $PID 1
> 11:05:51          PID    %usr %system  %guest    %CPU   CPU  Command
> 11:06:05        26994   65.00    0.00   98.00  163.00     1  kvm
>
> The %guest value is the percentage spent executing guest code.  The
> %usr time is the percentage spent executing qemu-kvm userspace code.
> I'm guessing you will see >80% %guest.
>
> In my example I was running while true; do true; done inside the guest :).
>
> Perhaps Avi can suggest kvm_stat or other techniques to discover what
> exactly this guest is doing.
>
> Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Add wrapper script around Qemu to test kernels

2011-08-24 Thread Avi Kivity

On 08/24/2011 01:16 AM, Alexander Graf wrote:

On LinuxCon I had a nice chat with Linus on what he thinks kvm-tool
would be doing and what he expects from it. Basically he wants a
small and simple tool he and other developers can run to try out and
see if the kernel they just built actually works.

Fortunately, Qemu can do that today already! The only piece that was
missing was the "simple" piece of the equation, so here is a script
that wraps around Qemu and executes a kernel you just built.

If you do have KVM around and are not cross-compiling, it will use
KVM. But if you don't, you can still fall back to emulation mode and
at least check if your kernel still does what you expect. I only
implemented support for s390x and ppc there, but it's easily extensible
to more platforms, as Qemu can emulate (and virtualize) pretty much
any platform out there.

If you don't have qemu installed, please do so before using this script. Your
distro should provide a package for it (might even call it "kvm"). If not,
just compile it from source - it's not hard!

To quickly get going, just execute the following as user:

 $ ./Documentation/run-qemu.sh -r / -a init=/bin/bash

This will drop you into a shell on your rootfs.

Happy hacking!

+
+function has_config() {
+   grep "CONFIG_$1=y" .config
+}


grep -q ?


+   case "$1" in
+   -a|--append)
+   KERNEL_APPEND2="$2"


Might want to append to KERNEL_APPEND2, so you could have multiple -a args.


+echo "
+   # Linux Qemu launcher #
+
+This script executes your currently built Linux kernel using Qemu. If KVM is
+available, it will also use KVM for fast virtualization of your guest.
+
+The intent is to make it very easy to run your kernel. If you need to do more
+advanced things, such as passing through real devices, please take the command
+line shown below and modify it to your needs. This tool is for simplicity, not
+world dominating functionality coverage.


Device assignment could be useful for driver developers, yes.


+"
+echo "\
+Your guest is bound to the current foreground shell. To quit the guest,
+please use Ctrl-A x"
+echo "  Executing: $QEMU_BIN $QEMU_OPTIONS -append \"$KERNEL_APPEND\" -smp 
$SMP"
+echo
+
+exec $QEMU_BIN $QEMU_OPTIONS -append "$KERNEL_APPEND -smp $SMP"


Would be nice to support launching gdb in a separate terminal with 
vmlinux already loaded, and already attached to qemu.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html