Re: [PATCH 5 of 5] virtio: expose added descriptors immediately

2011-11-23 Thread Michael S. Tsirkin
On Wed, Nov 23, 2011 at 11:49:01AM +1030, Rusty Russell wrote:
 On Tue, 22 Nov 2011 08:29:08 +0200, Michael S. Tsirkin m...@redhat.com 
 wrote:
  On Tue, Nov 22, 2011 at 11:03:04AM +1030, Rusty Russell wrote:
   - /* If you haven't kicked in this long, you're probably doing something
   -  * wrong. */
   - WARN_ON(vq-num_added  vq-vring.num);
   + /* This is very unlikely, but theoretically possible.  Kick
   +  * just in case. */
   + if (unlikely(vq-num_added == 65535))
  
  This is 0x but why use the decimal notation?
 
 Interesting.  Why use hex?  Feels more like binary?

Just easier to see it's the largest 16 bit number.

 But I've changed it to (1  16) - 1 to be clear.

That's even better.

   + virtqueue_kick(_vq);

 pr_debug(Added buffer head %i to %p\n, head, vq);
 END_USE(vq);
  
  We also still need to reset vq-num_added, right?
 
 virtqueue_kick does that for us.
 
 Cheers,
 Rusty.

Right.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout

2011-11-23 Thread Michael S. Tsirkin
On Wed, Nov 23, 2011 at 01:02:22PM +1030, Rusty Russell wrote:
 +/* Fields in VIRTIO_PCI_CAP_COMMON_CFG: */
 +struct virtio_pci_common_cfg {
 + /* About the whole device. */
 + __u64 device_features;  /* read-only */
 + __u64 guest_features;   /* read-write */

We currently require atomic accesses to common fields.
Some architectures might not have such for 64 bit,
so these need to be split I think ...

 + __u64 queue_address;/* read-write */
 + __u16 msix_config;  /* read-write */
 + __u8 device_status; /* read-write */
 + __u8 unused;
 +
 + /* About a specific virtqueue. */
 + __u16 queue_select; /* read-write */
 + __u16 queue_align;  /* read-write, power of 2. */
 + __u16 queue_size;   /* read-write, power of 2. */
 + __u16 queue_msix_vector;/* read-write */
 +};
  #endif
 
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About hotplug multifunction

2011-11-23 Thread Amos Kong

On 09/09/11 15:08, Amos Kong wrote:

Hello all,

I'm working on hotplug pci multifunction.

1. qemu cmdline:
./x86_64-softmmu/qemu-system-x86_64 -snapshot -m 2000 
/home/kvm_autotest_root/images/rhel61-64-virtio.qcow2 -vnc :0 -monitor 
unix:/tmp/a,server,nowait --enable-kvm -net none

2. script to add virtio-blk devices:
for i in `seq 1 7` 0;do
qemu-img create /tmp/resize$i.qcow2 1G -f qcow2
echo drive_add 0x6.$i id=drv$i,if=none,file=/tmp/resize$i.qcow2 | nc -U /tmp/a
echo device_add 
virtio-blk-pci,id=dev$i,drive=drv$i,addr=0x6.$i,multifunction=on | nc -U /tmp/a
done

3. script to add virio-nic devices:
for i in `seq 1 7` 0;do
echo netdev_add tap,id=drv$i | nc -U /tmp/a
echo device_add 
virtio-net-pci,id=dev$i,netdev=drv$i,addr=0x6.$i,multifunction=on | nc -U /tmp/a
done

4. current qemu behaviors
4.1. add func 1~7 one by one, then add func 0
virtio-nic : success, all funcs are added
virtio-blk : success

4.2. add func 0~7 one by one
virtio-nic : failed, only func 0 is added
virtio-blk : success

4.3. removing any single func in monitor
virtio-nic: func 0 are not found in 'lspci', func 1~7 also exist. eth1~eth7 
also exist.
virtio-blk: func 0 are not found in 'lspci', func 1~7 also exist. the device. 
/dev/vda disappears,
   vdb,vdc,vde,vdf,vdg,vdh,vdi,vdj also exist. If I re-add 8 funcs 
to guest, they all works.
   # lspci (00:06.1 ~ 00:06.7 exist, 00:06.0 doesn't exit)
   00:06.1 SCSI storage controller: Red Hat, Inc Virtio block 
device (rev ff)

qemu sends an acpi event to guest, then guest will remove all funcs in the slot.
linux-2.6/drivers/pci/hotplug/acpiphp_glue.c:
static int disable_device(struct acpiphp_slot *slot) {
 list_for_each_entry(func,slot-funcs, sibling) {
 ...


Hello all,

I re-tested the multifunction hotplug of winxp,rhel6,win7 guest. 
(upstream qemu-kvm)


For all the guests, func0 needs to be added, otherwise no func can be 
found in guest.
RHEL6 needs to add fun0 in the end, otherwise, the funcs added after 
hot-adding func0 could not be found in guest.

For winxp,win7, add func0 in any time, all the funcs can be found in guest.

This is the test result( hogplug order of funcs  funcs can be found in 
guest):

hotplug order   xp  win7rhel6
--  -
func 0~70~7 0~7 0~7
func 0~50~6 0~6 0~6
func 1~4,0,50~5 0~5 0~4
func 1~4no  no  no

It looks like the 'problem' of linux guest, or could we accept to add 
func0 after adding other funcs for linux guest?


Thanks,
Amos


Questions:
1. why func1~7 still can be found after hot-remove? is it same as real hardware?
2. why the func 1~7 could not be added to guest (addingfunc 0~7 one by one)?
3. how about this interface to hotplug/hot-unplug multifunction:
1) Add func 1-7 by monitor, add func 0, then send an acpi event to notice 
guest
2) Remove func0, send an acpi event to guest. (all funcs can be removed)
4. what does reversion 0xff stand for?

Thanks in advance,
Amos


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout

2011-11-23 Thread Michael S. Tsirkin
On Wed, Nov 23, 2011 at 01:02:22PM +1030, Rusty Russell wrote:
 On Tue, 22 Nov 2011 20:36:22 +0200, Michael S. Tsirkin m...@redhat.com 
 wrote:
  Here's an updated vesion.
  I'm alternating between updating the spec and the driver,
  spec update to follow.
 
 Don't touch the spec yet, we have a long way to go :(
 
 I want the ability for driver to set the ring size, and the device to
 set the alignment.

Did you mean driver to be able to set the alignment? This
is what BIOS guys want - after BIOS completes, guest driver gets handed
control and sets its own alignment to save memory.

 That's a bigger change than you have here.

Why can't we just add the new registers at the end?
With the new capability, we have as much space as we like for that.

 I imagine it almost rips the driver into two completely different drivers.

If you insist on moving all the rest of registers around, certainly. But
why do this?

 This is the kind of thing I had in mind, for the header.  Want me to
 code up the rest?
 
 (I've avoided adding the constants for the new layout: a struct is more
 compact and more descriptive).
 
 Cheers,
 Rusty.

Renaming constants in exported headers will break userspace builds.
Do we care? Why not?

 diff --git a/include/linux/virtio_pci.h b/include/linux/virtio_pci.h
 --- a/include/linux/virtio_pci.h
 +++ b/include/linux/virtio_pci.h
 @@ -42,56 +42,74 @@
  #include linux/virtio_config.h
  
  /* A 32-bit r/o bitmask of the features supported by the host */
 -#define VIRTIO_PCI_HOST_FEATURES 0
 +#define VIRTIO_PCI_LEGACY_HOST_FEATURES  0
  
  /* A 32-bit r/w bitmask of features activated by the guest */
 -#define VIRTIO_PCI_GUEST_FEATURES4
 +#define VIRTIO_PCI_LEGACY_GUEST_FEATURES 4
  
  /* A 32-bit r/w PFN for the currently selected queue */
 -#define VIRTIO_PCI_QUEUE_PFN 8
 +#define VIRTIO_PCI_LEGACY_QUEUE_PFN  8
  
  /* A 16-bit r/o queue size for the currently selected queue */
 -#define VIRTIO_PCI_QUEUE_NUM 12
 +#define VIRTIO_PCI_LEGACY_QUEUE_NUM  12
  
  /* A 16-bit r/w queue selector */
 -#define VIRTIO_PCI_QUEUE_SEL 14
 +#define VIRTIO_PCI_LEGACY_QUEUE_SEL  14
  
  /* A 16-bit r/w queue notifier */
 -#define VIRTIO_PCI_QUEUE_NOTIFY  16
 +#define VIRTIO_PCI_LEGACY_QUEUE_NOTIFY   16
  
  /* An 8-bit device status register.  */
 -#define VIRTIO_PCI_STATUS18
 +#define VIRTIO_PCI_LEGACY_STATUS 18
  
  /* An 8-bit r/o interrupt status register.  Reading the value will return the
   * current contents of the ISR and will also clear it.  This is effectively
   * a read-and-acknowledge. */
 -#define VIRTIO_PCI_ISR   19
 -
 -/* The bit of the ISR which indicates a device configuration change. */
 -#define VIRTIO_PCI_ISR_CONFIG0x2
 +#define VIRTIO_PCI_LEGACY_ISR19
  
  /* MSI-X registers: only enabled if MSI-X is enabled. */
  /* A 16-bit vector for configuration changes. */
 -#define VIRTIO_MSI_CONFIG_VECTOR20
 +#define VIRTIO_MSI_LEGACY_CONFIG_VECTOR20
  /* A 16-bit vector for selected queue notifications. */
 -#define VIRTIO_MSI_QUEUE_VECTOR 22
 -/* Vector value used to disable MSI for queue */
 -#define VIRTIO_MSI_NO_VECTOR0x
 +#define VIRTIO_MSI_LEGACY_QUEUE_VECTOR 22
  
  /* The remaining space is defined by each driver as the per-driver
   * configuration space */
 -#define VIRTIO_PCI_CONFIG(dev)   ((dev)-msix_enabled ? 24 : 20)
 +#define VIRTIO_PCI_LEGACY_CONFIG(dev)((dev)-msix_enabled ? 
 24 : 20)
 +
 +/* How many bits to shift physical queue address written to QUEUE_PFN.
 + * 12 is historical, and due to x86 page size. */
 +#define VIRTIO_PCI_LEGACY_QUEUE_ADDR_SHIFT   12
 +
 +/* The alignment to use between consumer and producer parts of vring.
 + * x86 pagesize again. */
 +#define VIRTIO_PCI_LEGACY_VRING_ALIGN4096
 +
 +#ifndef __KERNEL__
 +/* Don't break compile of old userspace code.  These will go away. */
 +#define VIRTIO_PCI_HOST_FEATURES VIRTIO_PCI_LEGACY_HOST_FEATURES
 +#define VIRTIO_PCI_GUEST_FEATURES VIRTIO_PCI_LEGACY_GUEST_FEATURES
 +#define VIRTIO_PCI_LEGACY_QUEUE_PFN VIRTIO_PCI_QUEUE_PFN
 +#define VIRTIO_PCI_LEGACY_QUEUE_NUM VIRTIO_PCI_QUEUE_NUM
 +#define VIRTIO_PCI_LEGACY_QUEUE_SEL VIRTIO_PCI_QUEUE_SEL
 +#define VIRTIO_PCI_LEGACY_QUEUE_NOTIFY VIRTIO_PCI_QUEUE_NOTIFY
 +#define VIRTIO_PCI_LEGACY_STATUS VIRTIO_PCI_STATUS
 +#define VIRTIO_PCI_LEGACY_ISR VIRTIO_PCI_ISR
 +#define VIRTIO_MSI_LEGACY_CONFIG_VECTOR VIRTIO_MSI_CONFIG_VECTOR
 +#define VIRTIO_MSI_LEGACY_QUEUE_VECTOR VIRTIO_MSI_QUEUE_VECTOR
 +#define VIRTIO_PCI_LEGACY_CONFIG(dev) VIRTIO_PCI_CONFIG(dev)
 +#define VIRTIO_PCI_LEGACY_QUEUE_ADDR_SHIFT VIRTIO_PCI_QUEUE_ADDR_SHIFT
 +#define VIRTIO_PCI_LEGACY_VRING_ALIGN VIRTIO_PCI_VRING_ALIGN
 +#endif /* ...!KERNEL */
  
  /* Virtio ABI version, this must match exactly */
  #define VIRTIO_PCI_ABI_VERSION   

Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout

2011-11-23 Thread Sasha Levin
On Wed, 2011-11-23 at 10:49 +0200, Michael S. Tsirkin wrote:
 On Wed, Nov 23, 2011 at 01:02:22PM +1030, Rusty Russell wrote:
  +/* Fields in VIRTIO_PCI_CAP_COMMON_CFG: */
  +struct virtio_pci_common_cfg {
  +   /* About the whole device. */
  +   __u64 device_features;  /* read-only */
  +   __u64 guest_features;   /* read-write */
 
 We currently require atomic accesses to common fields.
 Some architectures might not have such for 64 bit,
 so these need to be split I think ...

We can consider stealing the feature implementation from virtio-mmio:
Use the first 32 bits as a selector and the last as the features
themselves.

It's more complex to work with, but it provides 2**32 32 feature bits
(which should be enough for a while) and it solves the atomic access
issue.

  +   __u64 queue_address;/* read-write */
  +   __u16 msix_config;  /* read-write */
  +   __u8 device_status; /* read-write */
  +   __u8 unused;
  +
  +   /* About a specific virtqueue. */
  +   __u16 queue_select; /* read-write */
  +   __u16 queue_align;  /* read-write, power of 2. */
  +   __u16 queue_size;   /* read-write, power of 2. */
  +   __u16 queue_msix_vector;/* read-write */
  +};
   #endif
  
  

-- 

Sasha.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout

2011-11-23 Thread Sasha Levin
On Wed, 2011-11-23 at 13:02 +1030, Rusty Russell wrote:
 On Tue, 22 Nov 2011 20:36:22 +0200, Michael S. Tsirkin m...@redhat.com 
 wrote:
  Here's an updated vesion.
  I'm alternating between updating the spec and the driver,
  spec update to follow.
 
 Don't touch the spec yet, we have a long way to go :(
 
 I want the ability for driver to set the ring size, and the device to
 set the alignment.  That's a bigger change than you have here.  I
 imagine it almost rips the driver into two completely different drivers.
 
 This is the kind of thing I had in mind, for the header.  Want me to
 code up the rest?
 
 (I've avoided adding the constants for the new layout: a struct is more
 compact and more descriptive).
 
 Cheers,
 Rusty.
 
 diff --git a/include/linux/virtio_pci.h b/include/linux/virtio_pci.h
 --- a/include/linux/virtio_pci.h
 +++ b/include/linux/virtio_pci.h
 @@ -42,56 +42,74 @@
  #include linux/virtio_config.h
  
  /* A 32-bit r/o bitmask of the features supported by the host */
 -#define VIRTIO_PCI_HOST_FEATURES 0
 +#define VIRTIO_PCI_LEGACY_HOST_FEATURES  0
  
  /* A 32-bit r/w bitmask of features activated by the guest */
 -#define VIRTIO_PCI_GUEST_FEATURES4
 +#define VIRTIO_PCI_LEGACY_GUEST_FEATURES 4
  
  /* A 32-bit r/w PFN for the currently selected queue */
 -#define VIRTIO_PCI_QUEUE_PFN 8
 +#define VIRTIO_PCI_LEGACY_QUEUE_PFN  8
  
  /* A 16-bit r/o queue size for the currently selected queue */
 -#define VIRTIO_PCI_QUEUE_NUM 12
 +#define VIRTIO_PCI_LEGACY_QUEUE_NUM  12
  
  /* A 16-bit r/w queue selector */
 -#define VIRTIO_PCI_QUEUE_SEL 14
 +#define VIRTIO_PCI_LEGACY_QUEUE_SEL  14
  
  /* A 16-bit r/w queue notifier */
 -#define VIRTIO_PCI_QUEUE_NOTIFY  16
 +#define VIRTIO_PCI_LEGACY_QUEUE_NOTIFY   16
  
  /* An 8-bit device status register.  */
 -#define VIRTIO_PCI_STATUS18
 +#define VIRTIO_PCI_LEGACY_STATUS 18
  
  /* An 8-bit r/o interrupt status register.  Reading the value will return the
   * current contents of the ISR and will also clear it.  This is effectively
   * a read-and-acknowledge. */
 -#define VIRTIO_PCI_ISR   19
 -
 -/* The bit of the ISR which indicates a device configuration change. */
 -#define VIRTIO_PCI_ISR_CONFIG0x2
 +#define VIRTIO_PCI_LEGACY_ISR19
  
  /* MSI-X registers: only enabled if MSI-X is enabled. */
  /* A 16-bit vector for configuration changes. */
 -#define VIRTIO_MSI_CONFIG_VECTOR20
 +#define VIRTIO_MSI_LEGACY_CONFIG_VECTOR20
  /* A 16-bit vector for selected queue notifications. */
 -#define VIRTIO_MSI_QUEUE_VECTOR 22
 -/* Vector value used to disable MSI for queue */
 -#define VIRTIO_MSI_NO_VECTOR0x
 +#define VIRTIO_MSI_LEGACY_QUEUE_VECTOR 22
  
  /* The remaining space is defined by each driver as the per-driver
   * configuration space */
 -#define VIRTIO_PCI_CONFIG(dev)   ((dev)-msix_enabled ? 24 : 20)
 +#define VIRTIO_PCI_LEGACY_CONFIG(dev)((dev)-msix_enabled ? 
 24 : 20)
 +
 +/* How many bits to shift physical queue address written to QUEUE_PFN.
 + * 12 is historical, and due to x86 page size. */
 +#define VIRTIO_PCI_LEGACY_QUEUE_ADDR_SHIFT   12
 +
 +/* The alignment to use between consumer and producer parts of vring.
 + * x86 pagesize again. */
 +#define VIRTIO_PCI_LEGACY_VRING_ALIGN4096
 +
 +#ifndef __KERNEL__
 +/* Don't break compile of old userspace code.  These will go away. */

Maybe wrap it in:
#ifndef VIRTIO_PCI_NO_LEGACY

#warning Please stop using old defines

[...]

 +#define VIRTIO_PCI_HOST_FEATURES VIRTIO_PCI_LEGACY_HOST_FEATURES
 +#define VIRTIO_PCI_GUEST_FEATURES VIRTIO_PCI_LEGACY_GUEST_FEATURES
 +#define VIRTIO_PCI_LEGACY_QUEUE_PFN VIRTIO_PCI_QUEUE_PFN
 +#define VIRTIO_PCI_LEGACY_QUEUE_NUM VIRTIO_PCI_QUEUE_NUM
 +#define VIRTIO_PCI_LEGACY_QUEUE_SEL VIRTIO_PCI_QUEUE_SEL
 +#define VIRTIO_PCI_LEGACY_QUEUE_NOTIFY VIRTIO_PCI_QUEUE_NOTIFY
 +#define VIRTIO_PCI_LEGACY_STATUS VIRTIO_PCI_STATUS
 +#define VIRTIO_PCI_LEGACY_ISR VIRTIO_PCI_ISR
 +#define VIRTIO_MSI_LEGACY_CONFIG_VECTOR VIRTIO_MSI_CONFIG_VECTOR
 +#define VIRTIO_MSI_LEGACY_QUEUE_VECTOR VIRTIO_MSI_QUEUE_VECTOR
 +#define VIRTIO_PCI_LEGACY_CONFIG(dev) VIRTIO_PCI_CONFIG(dev)
 +#define VIRTIO_PCI_LEGACY_QUEUE_ADDR_SHIFT VIRTIO_PCI_QUEUE_ADDR_SHIFT
 +#define VIRTIO_PCI_LEGACY_VRING_ALIGN VIRTIO_PCI_VRING_ALIGN

#endif /* VIRTIO_PCI_NO_LEGACY */

 +#endif /* ...!KERNEL */
  
  /* Virtio ABI version, this must match exactly */
  #define VIRTIO_PCI_ABI_VERSION   0
  
 -/* How many bits to shift physical queue address written to QUEUE_PFN.
 - * 12 is historical, and due to x86 page size. */
 -#define VIRTIO_PCI_QUEUE_ADDR_SHIFT  12
 +/* Vector value used to disable MSI for queue */
 +#define VIRTIO_MSI_NO_VECTOR0x
  
 -/* The alignment to use between consumer and producer parts of vring.
 - 

Re: Memory sync algorithm during migration

2011-11-23 Thread Oliver Hookins
On Tue, Nov 22, 2011 at 02:04:17PM +0100, Oliver Hookins wrote:
 On Tue, Nov 22, 2011 at 10:31:58AM +0100, ext Juan Quintela wrote:
  Oliver Hookins oliver.hook...@nokia.com wrote:
   On Tue, Nov 15, 2011 at 11:47:58AM +0100, ext Juan Quintela wrote:
   Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp wrote:
Adding qemu-devel ML to CC.
   
Your question should have been sent to qemu-devel ML because the logic
is implemented in QEMU, not KVM.
   
(2011/11/11 1:35), Oliver Hookins wrote:
Hi,
   
I am performing some benchmarks on KVM migration on two different 
types of VM.
One has 4GB RAM and the other 32GB. More or less idle, the 4GB VM 
takes about 20
seconds to migrate on our hardware while the 32GB VM takes about a 
minute.
   
With a reasonable amount of memory activity going on (in the hundreds 
of MB per
second) the 32GB VM takes 3.5 minutes to migrate, but the 4GB VM never
completes. Intuitively this tells me there is some watermarking of 
dirty pages
going on that is not particularly efficient when the dirty pages 
ratio is high
compared to total memory, but I may be completely incorrect.
   
You can change the ratio IIRC.
Hopefully, someone who knows well about QEMU will tell you better ways.
   
   Takuya
   
   
Could anybody fill me in on what might be going on here? We're using 
libvirt
0.8.2 and kvm-83-224.el5.centos.1
   
   This is pretty old qemu/kvm code base.
   In principle, it makes no sense that with 32GB RAM migration finishes,
   and with 4GB RAM it is unable (intuitively it should be, if ever, the
   other way around).
   
   Do you have an easy test that makes the problem easily reproducible?
   Have you tried ustream qemu.git? (some improvements on that department).
  
   I've just tried the qemu-kvm 0.14.1 tag which seems to be the latest that 
   builds
   on my platform. For some strange reason migrations always seem to fail in 
   one
   direction with Unknown savevm section or instance 'hpet' 0 messages.
  
  What is your platform?  This seems like you are running with hpet in one
  side, but without it in the other.  What command line are you using?
 
 Yes, my mistake. We were also testing later kernels and my test machines 
 managed
 to get out of sync. One had support for hpet clocksource but the other one
 didn't.
 
  
   This seems to point to different migration protocols on either end but 
   they are
   both running the same version of qemu-kvm I built. Does this ring any 
   bells for
   anyone?
  
  Command line mismatch.  But, what is your platform?
 
 CentOS5.6. Now running the VMs through qemu-kvm 0.14.1, unloaded migrations 
 take
 about half the time but with memory I/O load now both VMs never complete the
 migration. In practical terms I'm writing about 50MB/s into memory and we 
 have a
 10Gbps network (and I've seen real speeds up to 8-9Gbps on the wire) so there
 should be enough capacity to sync up the dirty pages.
 
 So now the 32GB and 4GB VMs have matching behaviour (which makes more sense) 
 but
 I'm not any closer to figuring out what is going on.

Resending as it got rejected the first time:

I have not set the maximum permissible migration time but that default
certainly points to a possibility of a solution. As for the writing semantics
it was a straight dd from disk to /dev/shm so I can't speak for the kernel but
naively I would think it may be contiguous space.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IOAPIC doesn't handle byte writes

2011-11-23 Thread Avi Kivity
On 11/22/2011 05:06 PM, Julian Stecklina wrote:
 Hello,

[please don't drop cc list]

 Avi Kivity wrote:
  Care to post a patch instead?

 Sure. Never hacked KVM, though. Is there a particular reason why the
 void *val argument to ioapic_mmio_read/_write is only dereferenced when
 ioapic-lock is not held?

No particular reason.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Allow aligned byte and word writes to IOAPIC registers.

2011-11-23 Thread Avi Kivity
On 11/22/2011 06:09 PM, Julian Stecklina wrote:
 This fixes byte accesses to IOAPIC_REG_SELECT as mandated by at least the
 ICH10 and Intel Series 5 chipset specs. It also makes ioapic_mmio_write
 consistent with ioapic_mmio_read, which also allows byte and word accesses.


Your patch indents with spaces, while Linux uses tabs for indents.

 Signed-off-by: Julian Stecklina j...@alien8.de
 ---
  virt/kvm/ioapic.c |   17 +
  1 files changed, 13 insertions(+), 4 deletions(-)

 diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
 index 3eed61e..e94ef6ba 100644
 --- a/virt/kvm/ioapic.c
 +++ b/virt/kvm/ioapic.c
 @@ -332,9 +332,18 @@ static int ioapic_mmio_write(struct kvm_io_device *this, 
 gpa_t addr, int len,
(void*)addr, len, val);
   ASSERT(!(addr  0xf));  /* check alignment */
  
 - if (len == 4 || len == 8)
 - data = *(u32 *) val;
 - else {
 +switch (len) {
 +case 8:
 +case 4:
 +data = *(u32 *) val;
 +break;
 +case 2:
 +data = *(u16 *) val;
 +break;
 +case 1:
 +data = *(u8  *) val;
 +break;
 +default:
   printk(KERN_WARNING ioapic: Unsupported size %d\n, len);
   return 0;
   }
 @@ -343,7 +352,7 @@ static int ioapic_mmio_write(struct kvm_io_device *this, 
 gpa_t addr, int len,
   spin_lock(ioapic-lock);
   switch (addr) {
   case IOAPIC_REG_SELECT:
 - ioapic-ioregsel = data;
 + ioapic-ioregsel = data  0xFF; /* 8-bit register */
   break;
  
   case IOAPIC_REG_WINDOW:

This is a bit over-permissive in that it allows 8-byte writes to the
IOWIN register.  I guess it's okay though.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


SMP Example using KVM only ?

2011-11-23 Thread Mian M. Hamayun

Hello Everyone,

I am currently using kvm-85 (I know its pretty old ... :)) and it has an 
smptest example in test/x86/.
This example works for 1 cpu but doesn't work for more than one cpus. 
This issue has already been discussed on KVM Mailing list.
And I 'think' this example has been removed from the later versions of 
KVM/QEMU-KVM code base.


I have also tested the latest qemu-kvm with multiple processors, it 
works, but the code base is quite large.
What I am interested in is only using KVM for my platform, with smp 
support, and without QEMU.


Is there a simple example for testing KVM with SMP support ? or if 
someone could outline the changes required to make the smptest example 
work in kvm-85.


Many Thanks,
Mian Hamayun




smime.p7s
Description: S/MIME Cryptographic Signature


Re: nested virtualization on Intel and needed cpu flags to pass

2011-11-23 Thread Nadav Har'El
On Tue, Nov 22, 2011, Gianluca Cecchi wrote about nested virtualization on 
Intel and needed cpu flags to pass:
 I'm going to test nested virtualization on Intel with Fedora 16 host.
...
 [root at f16 ~]# uname -r
 3.1.1-2.fc16.x86_64

This Linux version indeed has nested VMX, while

 # uname -r
 2.6.40.6-0.fc15.x86_64

doesn't.

 Based on docs, the nested option is disabled by default on Intel

Indeed. You need to load the kvm_intel module with the nested=1
option. You also need to tell qemu that the emulated CPU will advertise
VMX - the simplest way to do this is with -cpu host option to qemu.
It seems you did all of this correctly.

 Preliminary results are not so good.
 I created an F16 guest (f16vm), with default options
..
 Inside the guest I create a windows xp with default values proposed by
..
 Until now not able to complete installation

Unfortunately, this is a known bug - which I promised to work on, but
haven't yet got around to :(
nested-vmx.txt explictly lists under known limitations that: The
current code supports running Linux guests under KVM guests.

 more tests to come with newer kernel 3.1.1-2.fc16.x86_64
 But before proceeding, probably I need to adjust particular features
 to enable/disable about
 cpu to pass to f16vm guest...

I don't think this is the problem. The underlying problem is that the
VMX spec is very complex, and ideally nested VMX should correctly emulate
each and every bit and each and every corner of it. Because all our
testing was done with KVM L1s and Linux L2s, all the paths that get
exercised in that case were tested and their bugs ironed out - but it is
possible that Windows L2s end up excercising slightly different code
paths that still have bugs, and those need to be fixed.

Unfortunately, I doubt a newer kernel will solve your problem. I think
there are genuine bugs that will have to be fixed.

 What are current advises about cpu flags to pass to optimally use
 nested virtualization with intel at this time?

I don't think there is any such guidelines. The only thing you really
need is -cpu qemu64,+vmx (replace qemu64 by whatever you want) to
advertise the exisance of VMX.

-- 
Nadav Har'El| Wednesday, Nov 23 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |When you lose, don't lose the lesson.
http://nadav.harel.org.il   |
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SMP Example using KVM only ?

2011-11-23 Thread Avi Kivity
On 11/23/2011 12:32 PM, Mian M. Hamayun wrote:
 Hello Everyone,

 I am currently using kvm-85 (I know its pretty old ... :)) and it has
 an smptest example in test/x86/.
 This example works for 1 cpu but doesn't work for more than one cpus.
 This issue has already been discussed on KVM Mailing list.
 And I 'think' this example has been removed from the later versions of
 KVM/QEMU-KVM code base.

 I have also tested the latest qemu-kvm with multiple processors, it
 works, but the code base is quite large.
 What I am interested in is only using KVM for my platform, with smp
 support, and without QEMU.

 Is there a simple example for testing KVM with SMP support ? or if
 someone could outline the changes required to make the smptest
 example work in kvm-85.


You can take a look at kvm-tool,
git://github.com/penberg/linux-kvm.git.  While it's not exactly trivial,
it's a lot simpler than qemu.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Allow aligned byte and word writes to IOAPIC registers.

2011-11-23 Thread Nadav Har'El
On Wed, Nov 23, 2011, Avi Kivity wrote about Re: [PATCH] Allow aligned byte 
and word writes to IOAPIC registers.:
 On 11/22/2011 06:09 PM, Julian Stecklina wrote:
  This fixes byte accesses to IOAPIC_REG_SELECT as mandated by at least the
  ICH10 and Intel Series 5 chipset specs. It also makes ioapic_mmio_write
  consistent with ioapic_mmio_read, which also allows byte and word accesses.
 
 
 Your patch indents with spaces, while Linux uses tabs for indents.

You might want to run scripts/checkpatch.pl on your patches before
sending them - this will catch this and many other stylistic faux pas.

-- 
Nadav Har'El| Wednesday, Nov 23 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Why do doctors call what they do
http://nadav.harel.org.il   |practice? Think about it.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Allow aligned byte and word writes to IOAPIC registers.

2011-11-23 Thread Julian Stecklina
On Mi, 2011-11-23 at 12:47 +0200, Avi Kivity wrote:
 On 11/22/2011 06:09 PM, Julian Stecklina wrote:
  This fixes byte accesses to IOAPIC_REG_SELECT as mandated by at least the
  ICH10 and Intel Series 5 chipset specs. It also makes ioapic_mmio_write
  consistent with ioapic_mmio_read, which also allows byte and word accesses.
 
 
 Your patch indents with spaces, while Linux uses tabs for indents.

True. I'll post a revised version in a minute. I think this is my second
patch to Linux in total, so I am slowly getting there... Thanks for the
patience.

Regards, Julian

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/14] KVM: PPC: Add generic single register ioctls

2011-11-23 Thread Marcelo Tosatti
On Wed, Nov 16, 2011 at 12:45:45AM +0100, Alexander Graf wrote:
 
 On 10.11.2011, at 18:35, Marcelo Tosatti wrote:
 
  On Thu, Nov 10, 2011 at 05:49:42PM +0100, Alexander Graf wrote:
  Documentation/virtual/kvm/api.txt |   47 
  ++
  arch/powerpc/kvm/powerpc.c|   51 
  +
  include/linux/kvm.h   |   32 +++
  3 files changed, 130 insertions(+), 0 deletions(-)
  I don't see the benefit of this generalization, the current structure 
  where
  context information is hardcoded in the data transmitted works well.
  
  Well, unfortunately it doesn't work quite as well for us because we
  are a much more evolving platform. Also, there are a lot of edges
  and corners of the architecture that simply aren't implemented in
  KVM as of now. I want to have something extensible enough so we
  don't break the ABI along the way.
  
  You still have to agree on format between userspace and kernel, right?
  If either party fails to conform to that, you're doomed.
 
 Yes, but we can shove registers back and forth without allocating 8kb of ram 
 each time. If all we need to do is poke one register, we poke one register. 
 If we poke 10, we poke the 10 we need to touch.
 
  The problem with two interfaces is potential ambiguity: is
  register X implemented through KVM_GET_ONE_REG and also through
  KVM_GET_XYZ_REGISTER_SET ? If its accessible by two interfaces, what is
  the register writeback order? Is there a plan to convert, etc.
 
 Why writeback order? Register modification operations should always happen 
 from the same thread the vCPU would run on at the end of the day, no?

Yes, but there is a specified order which the registers must be written
back, in case there are dependencies between them (the QEMU x86's code
does its best to document these dependencies).

All i'm saying is that two distinct interfaces make it potentially
confusing for the programmer. That said, its up to Avi to decide.

 Avi wanted to go as far as making this a syscall interface even.

Ok, then full convertion work must be done.

  If you agree these concerns are valid, perhaps this interface can be PPC
  specific.
 
 I can always make it PPC specific, but I believe it would make sense as a 
 generic interface for everyone, similar to how ENABLE_CAP can make sense for 
 any arch.
 
 
 Alex
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2] KVM: Allow aligned byte and word writes to IOAPIC registers.

2011-11-23 Thread Julian Stecklina
This fixes byte accesses to IOAPIC_REG_SELECT as mandated by at least the
ICH10 and Intel Series 5 chipset specs. It also makes ioapic_mmio_write
consistent with ioapic_mmio_read, which also allows byte and word accesses.

Signed-off-by: Julian Stecklina j...@alien8.de
---
 virt/kvm/ioapic.c |   15 ---
 1 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
index 3eed61e..71e2253 100644
--- a/virt/kvm/ioapic.c
+++ b/virt/kvm/ioapic.c
@@ -332,9 +332,18 @@ static int ioapic_mmio_write(struct kvm_io_device *this, 
gpa_t addr, int len,
 (void*)addr, len, val);
ASSERT(!(addr  0xf));  /* check alignment */
 
-   if (len == 4 || len == 8)
+   switch (len) {
+   case 8:
+   case 4:
data = *(u32 *) val;
-   else {
+   break;
+   case 2:
+   data = *(u16 *) val;
+   break;
+   case 1:
+   data = *(u8  *) val;
+   break;
+   default:
printk(KERN_WARNING ioapic: Unsupported size %d\n, len);
return 0;
}
@@ -343,7 +352,7 @@ static int ioapic_mmio_write(struct kvm_io_device *this, 
gpa_t addr, int len,
spin_lock(ioapic-lock);
switch (addr) {
case IOAPIC_REG_SELECT:
-   ioapic-ioregsel = data;
+   ioapic-ioregsel = data  0xFF; /* 8-bit register */
break;
 
case IOAPIC_REG_WINDOW:
-- 
1.7.7.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding

2011-11-23 Thread Andrea Arcangeli
Hi!

On Mon, Nov 21, 2011 at 07:51:21PM -0600, Anthony Liguori wrote:
 Fundamentally, the entity that should be deciding what memory should be 
 present 
 and where it should located is the kernel.  I'm fundamentally opposed to 
 trying 
 to make QEMU override the scheduler/mm by using cpu or memory pinning in QEMU.
 
  From what I can tell about ms_mbind(), it just uses process knowledge to 
 bind 
 specific areas of memory to a memsched group and let's the kernel decide what 
 to 
 do with that knowledge.  This is exactly the type of interface that QEMU 
 should 
 be using.
 
 QEMU should tell the kernel enough information such that the kernel can make 
 good decisions.  QEMU should not be the one making the decisions.

True, QEMU won't have to decide where the memory and vcpus should be
located (but hey it wouldn't need to decide that even if you use
cpusets, you can use relative mbind with cpusets, the admin or a
cpuset job scheduler could decide) but it's still QEMU making the
decision of what memory and which vcpus threads to
ms_mbind/ms_tbind. Think how you're going to create the input of those
syscalls...

If it wasn't qemu to decide that, qemu wouldn't be required to scan
the whole host physical numa (cpu/memory) topology in order to create
the input arguments of ms_mbind/ms_tbind. And when you migrate the
VM to another host, the whole vtopology may be counter-productive
because the kernel isn't automatically detecting the numa affinity
between threads and the guest vtopology will stick to whatever numa
_physical_ topology that was seen on the first node where the VM was
created.

I doubt that the assumption that all cloud nodes will have the same
physical numa topology is reasonable.

Furthermore to get the same benefits that qemu gets on host by using
ms_mbind/ms_tbind, every single guest application should be modified
to scan the guest vtopology and call ms_mbind/ms_tbind too (or use the
hard bindings which is what we try to avoid).

I think it's unreasonable to expect all applications to use
ms_mbind/ms_tbind in the guest, at best guest apps will use cpusets or
wrappers, few apps will be modified for sys_ms_tbind/mbind.

You can always have the supercomputer case with just one app that is
optimized and a single VM spanning over the whole host, but in that
scenarios hard bindings would work perfectly too.

In my view the trouble of the numa hard bindings is not the fact
they're hard and qemu has to also decide the location (in fact it
doesn't need to decide the location if you use cpusets and relative
mbinds). The bigger problem is the fact either the admin or the app
developer has to explicitly scan the numa physical topology (both cpus
and memory) and tell the kernel how much memory to bind to each
thread. ms_mbind/ms_tbind only partially solve that problem. They're
similar to the mbind MPOL_F_RELATIVE_NODES with cpusets, except you
don't need an admin or a cpuset-job-scheduler (or a perl script) to
redistribute the hardware resources.

Now dealing with bindings isn't big deal for qemu, in fact this API is
pretty much ideal for qemu, but it won't make life substantially
easier than if compared to hard bindings. Simply the management code
that is now done with a perl script will have to be moved in the
kernel. It looks an incremental improvement compared to the relative
mbind+cpuset, but I'm unsure if it's the best we could aim for and
what we really need in virt considering we deal with VM migration too.

The real long term design to me is not to add more syscalls, and
initially handling the case of a process/VM spanning not more than one
node in thread number and amount of memory. That's not too hard an in
fact I've benchmarks for the scheduler already showing it to work
pretty well (it's creating a too strict affinity but it can be relaxed
to be more useful). Then later add some mechanism (simplest is the
page fault at low frequency) to create a
guest_vcpu_thread-host_memory affinity and have a parvirtualized
interface that tells the guest scheduler to group CPUs.

If the guest scheduler runs free and is allowed to move threads
randomly without any paravirtualized interface that controls the CPU
thread migration in the guest scheduler, the thread-memory affinity
on host will be hopeless. But with a parvirtualized interface to make
a guest thread stick to vcpu0/1/2/3 and not going into vcpu4/5/6/7,
will allow to create a more meaningful guest_thread-physical_ram
affinity on host through KVM page faults. And then this will work also
with VM migration and without having to create a vtopology in guest.

And for apps running in guest no paravirt will be needed of course.

The reason paravirt would be needed for qemu-kvm with a full automatic
thread-memory affinity is that the vcpu threads are magic. What runs
in the vcpu thread are guest threads. And those can move through the
guest CPU scheduler from vcpu0 to vcpu7. If that happens and we've 4
physical cpu for each physical node, any affinity 

Re: [PATCH 02/10] nEPT: MMU context for nested EPT

2011-11-23 Thread Nadav Har'El
On Sun, Nov 13, 2011, Orit Wasserman wrote about Re: [PATCH 02/10] nEPT: MMU 
context for nested EPT:
 Maybe this patch can help, this is roughly what Avi wants (I hope) done very 
 quickly.
 I'm sorry I don't have setup to run nested VMX at the moment so i can't test 
 it.

Hi Orit, thanks for the code - I'm now working on incorporating
something based on this into my patch. However, I do have a question:

 +static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
 +{
 + int r = kvm_init_shadow_mmu(vcpu, vcpu-arch.mmu);
 +
 + vcpu-arch.nested_mmu.gva_to_gpa = EPT_gva_to_gpa_nested;
 +
 + return r;
 +}

I didn't see you actually call this function anywhere - how is it
supposed to work?

The way I understand the current code, kvm_mmu_reset_context() calls
init_kvm_mmu() which (in our case) calls init_kvm_nested_mmu().
I think the above gva_to_gpa setting should be there - right?
It seems we need a fifth case in that function.
But at that point in mmu.c, how will I be able to check if this is the
nested EPT case? Do you have any suggestion?

Thanks,
Nadav.

-- 
Nadav Har'El| Wednesday, Nov 23 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |This message contains 100% recycled
http://nadav.harel.org.il   |characters.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] KVM: Use kmemdup() instead of kzalloc/memcpy

2011-11-23 Thread Sasha Levin
Switch to kmemdup() in two places to shorten the code and avoid possible bugs.

Cc: Avi Kivity a...@redhat.com
Cc: Marcelo Tosatti mtosa...@redhat.com
Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 arch/x86/kvm/x86.c  |4 ++--
 virt/kvm/kvm_main.c |7 +++
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9eff4af..79254f3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3499,10 +3499,10 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
memset(dirty_bitmap, 0, n);
 
r = -ENOMEM;
-   slots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);
+   slots = kmemdup(kvm-memslots, sizeof(*kvm-memslots), 
GFP_KERNEL);
if (!slots)
goto out;
-   memcpy(slots, kvm-memslots, sizeof(struct kvm_memslots));
+
slots-memslots[log-slot].dirty_bitmap = dirty_bitmap;
slots-generation++;
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4c5b9a2..6f0a5e1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2538,13 +2538,12 @@ int kvm_io_bus_unregister_dev(struct kvm *kvm, enum 
kvm_bus bus_idx,
int i, r;
struct kvm_io_bus *new_bus, *bus;
 
-   new_bus = kzalloc(sizeof(struct kvm_io_bus), GFP_KERNEL);
+   bus = kvm-buses[bus_idx];
+
+   new_bus = kmemdup(bus, sizeof(*bus), GFP_KERNEL);
if (!new_bus)
return -ENOMEM;
 
-   bus = kvm-buses[bus_idx];
-   memcpy(new_bus, bus, sizeof(struct kvm_io_bus));
-
r = -ENOENT;
for (i = 0; i  new_bus-dev_count; i++)
if (new_bus-range[i].dev == dev) {
-- 
1.7.8.rc3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] KVM: Use memdup_user instead of kmalloc/copy_from_user

2011-11-23 Thread Sasha Levin
Switch to using memdup_user when possible. This makes code more
smaller and compact, and prevents errors.

Cc: Avi Kivity a...@redhat.com
Cc: Marcelo Tosatti mtosa...@redhat.com
Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 arch/x86/kvm/x86.c  |   82 +-
 virt/kvm/kvm_main.c |   29 +++--
 2 files changed, 47 insertions(+), 64 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 79254f3..ee09ae4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1362,12 +1362,11 @@ static int xen_hvm_config(struct kvm_vcpu *vcpu, u64 
data)
if (page_num = blob_size)
goto out;
r = -ENOMEM;
-   page = kzalloc(PAGE_SIZE, GFP_KERNEL);
-   if (!page)
+   page = memdup_user(blob_addr + (page_num * PAGE_SIZE), PAGE_SIZE);
+   if (IS_ERR(page)) {
+   r = PTR_ERR(page);
goto out;
-   r = -EFAULT;
-   if (copy_from_user(page, blob_addr + (page_num * PAGE_SIZE), PAGE_SIZE))
-   goto out_free;
+   }
if (kvm_write_guest(kvm, page_addr, page, PAGE_SIZE))
goto out_free;
r = 0;
@@ -2041,15 +2040,12 @@ static int msr_io(struct kvm_vcpu *vcpu, struct 
kvm_msrs __user *user_msrs,
if (msrs.nmsrs = MAX_IO_MSRS)
goto out;
 
-   r = -ENOMEM;
size = sizeof(struct kvm_msr_entry) * msrs.nmsrs;
-   entries = kmalloc(size, GFP_KERNEL);
-   if (!entries)
+   entries = memdup_user(user_msrs-entries, size);
+   if (IS_ERR(entries)) {
+   r = PTR_ERR(entries);
goto out;
-
-   r = -EFAULT;
-   if (copy_from_user(entries, user_msrs-entries, size))
-   goto out_free;
+   }
 
r = n = __msr_io(vcpu, msrs, entries, do_msr);
if (r  0)
@@ -3043,13 +3039,12 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
r = -EINVAL;
if (!vcpu-arch.apic)
goto out;
-   u.lapic = kmalloc(sizeof(struct kvm_lapic_state), GFP_KERNEL);
-   r = -ENOMEM;
-   if (!u.lapic)
-   goto out;
-   r = -EFAULT;
-   if (copy_from_user(u.lapic, argp, sizeof(struct 
kvm_lapic_state)))
+   u.lapic = memdup_user(argp, sizeof(*u.lapic));
+   if (IS_ERR(u.lapic)) {
+   r = PTR_ERR(u.lapic);
goto out;
+   }
+
r = kvm_vcpu_ioctl_set_lapic(vcpu, u.lapic);
if (r)
goto out;
@@ -3228,14 +3223,11 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
break;
}
case KVM_SET_XSAVE: {
-   u.xsave = kzalloc(sizeof(struct kvm_xsave), GFP_KERNEL);
-   r = -ENOMEM;
-   if (!u.xsave)
-   break;
-
-   r = -EFAULT;
-   if (copy_from_user(u.xsave, argp, sizeof(struct kvm_xsave)))
-   break;
+   u.xsave = memdup_user(argp, sizeof(*u.xsave));
+   if (IS_ERR(u.xsave)) {
+   r = PTR_ERR(u.xsave);
+   goto out;
+   }
 
r = kvm_vcpu_ioctl_x86_set_xsave(vcpu, u.xsave);
break;
@@ -3256,15 +3248,11 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
break;
}
case KVM_SET_XCRS: {
-   u.xcrs = kzalloc(sizeof(struct kvm_xcrs), GFP_KERNEL);
-   r = -ENOMEM;
-   if (!u.xcrs)
-   break;
-
-   r = -EFAULT;
-   if (copy_from_user(u.xcrs, argp,
-  sizeof(struct kvm_xcrs)))
-   break;
+   u.xcrs = memdup_user(argp, sizeof(*u.xcrs));
+   if (IS_ERR(u.xcrs)) {
+   r = PTR_ERR(u.xcrs);
+   goto out;
+   }
 
r = kvm_vcpu_ioctl_x86_set_xcrs(vcpu, u.xcrs);
break;
@@ -3659,14 +3647,14 @@ long kvm_arch_vm_ioctl(struct file *filp,
}
case KVM_GET_IRQCHIP: {
/* 0: PIC master, 1: PIC slave, 2: IOAPIC */
-   struct kvm_irqchip *chip = kmalloc(sizeof(*chip), GFP_KERNEL);
+   struct kvm_irqchip *chip;
 
-   r = -ENOMEM;
-   if (!chip)
+   chip = memdup_user(argp, sizeof(*chip));
+   if (IS_ERR(chip)) {
+   r = PTR_ERR(chip);
goto out;
-   r = -EFAULT;
-   if (copy_from_user(chip, argp, sizeof *chip))
-   goto get_irqchip_out;
+   }
+
r = -ENXIO;
if (!irqchip_in_kernel(kvm))
goto get_irqchip_out;
@@ -3685,14 +3673,14 @@ long kvm_arch_vm_ioctl(struct file *filp,
}
case KVM_SET_IRQCHIP: {

Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout

2011-11-23 Thread Michael S. Tsirkin
On Wed, Nov 23, 2011 at 10:46:40AM +0200, Michael S. Tsirkin wrote:
 On Wed, Nov 23, 2011 at 01:02:22PM +1030, Rusty Russell wrote:
  On Tue, 22 Nov 2011 20:36:22 +0200, Michael S. Tsirkin m...@redhat.com 
  wrote:
   Here's an updated vesion.
   I'm alternating between updating the spec and the driver,
   spec update to follow.
  
  Don't touch the spec yet, we have a long way to go :(
  
  I want the ability for driver to set the ring size, and the device to
  set the alignment.
 
 Did you mean driver to be able to set the alignment? This
 is what BIOS guys want - after BIOS completes, guest driver gets handed
 control and sets its own alignment to save memory.
 
  That's a bigger change than you have here.
 
 Why can't we just add the new registers at the end?
 With the new capability, we have as much space as we like for that.

Didn't have the time to finish the patch today, but just to clarify,
we can apply something like the below on top of my patch,
and then config_len can be checked to see whether the new fields
like alignment are available.

We probably can make the alignment smaller and save some memory -
the default value could set the default alignment.

Ring size, naturally, can just be made writeable - BIOS can put a
small value there, but linux guest would just use the defaults
so no need for any code changes at all.

As a bonus, the capability length is verified to be large enough.
The change's pretty small, isn't it?

diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index 681347b..39433d3 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -37,8 +37,9 @@ struct virtio_pci_device
struct virtio_device vdev;
struct pci_dev *pci_dev;
 
-   /* the IO address for the common PCI config space */
+   /* the IO address and length for the common PCI config space */
void __iomem *ioaddr;
+   unsigned config_len;
/* the IO address for device specific config */
void __iomem *ioaddr_device;
/* the IO address to use for notifications operations */
@@ -101,22 +102,24 @@ static void __iomem *virtio_pci_legacy_map(struct 
virtio_pci_device *vp_dev)
 #endif
 
 /*
- * With PIO, device-specific config moves as MSI-X is enabled/disabled.
- * Use this accessor to keep pointer to that config in sync.
+ * With PIO, device-specific config moves as MSI-X is enabled/disabled,
+ * configuration region length changes as well.  Use this accessor to keep
+ * pointer to that config and common config length, in sync.
  */
 static void virtio_pci_set_msix_enabled(struct virtio_pci_device *vp_dev, int 
enabled)
 {
void __iomem* m;
vp_dev-msix_enabled = enabled;
m = virtio_pci_legacy_map(vp_dev);
-   if (m)
+   if (m) {
vp_dev-ioaddr_device = m + VIRTIO_PCI_CONFIG(vp_dev);
-   else
+   vp_dev-config_len = VIRTIO_PCI_CONFIG(vp_dev);
+   } else
vp_dev-ioaddr_device = vp_dev-device_map;
 }
 
 static void __iomem *virtio_pci_map_cfg(struct virtio_pci_device *vp_dev, u8 
cap_id,
-   u32 align)
+   u32 align, unsigned *lenp)
 {
 u32 size;
 u32 offset;
@@ -160,12 +163,16 @@ static void __iomem *virtio_pci_map_cfg(struct 
virtio_pci_device *vp_dev, u8 cap
 offset = VIRTIO_PCI_CAP_CFG_OFF_MASK;
/* Align offset appropriately */
offset = ~(align - 1);
+   if (lenp  size  *lenp)
+   goto err;
 
/* It's possible for a device to supply a huge config space,
 * but we'll never need to map more than a page ATM. */
p = pci_iomap_range(dev, bir, offset, size, PAGE_SIZE);
if (!p)
dev_err(vp_dev-vdev.dev, Unable to map virtio pci memory);
+   if (lenp)
+   *lenp = min(size, PAGE_SIZE);
return p;
 err:
dev_err(vp_dev-vdev.dev, Unable to parse virtio pci capability);
@@ -188,22 +195,24 @@ static void virtio_pci_iounmap(struct virtio_pci_device 
*vp_dev)
 
 static int virtio_pci_iomap(struct virtio_pci_device *vp_dev)
 {
+   unsigned common_len = VIRTIO_PCI_COMMON_CFG_MINSIZE;
vp_dev-isr_map = virtio_pci_map_cfg(vp_dev,
 VIRTIO_PCI_CAP_ISR_CFG,
-sizeof(u8));
+sizeof(u8), NULL);
vp_dev-notify_map = virtio_pci_map_cfg(vp_dev,
VIRTIO_PCI_CAP_NOTIFY_CFG,
-   sizeof(u16));
+   sizeof(u16), NULL);
vp_dev-common_map = virtio_pci_map_cfg(vp_dev,
VIRTIO_PCI_CAP_COMMON_CFG,
-   sizeof(u32));
+   sizeof(u32), common_len);

Re: [PATCH 02/10] nEPT: MMU context for nested EPT

2011-11-23 Thread Nadav Har'El
On Wed, Nov 23, 2011, Nadav Har'El wrote about Re: [PATCH 02/10] nEPT: MMU 
context for nested EPT:
  +static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
  +{
  +   int r = kvm_init_shadow_mmu(vcpu, vcpu-arch.mmu);
  +
  +   vcpu-arch.nested_mmu.gva_to_gpa = EPT_gva_to_gpa_nested;
  +
  +   return r;
  +}
..
 I didn't see you actually call this function anywhere - how is it
 supposed to work?
..
 It seems we need a fifth case in that function.
..

On second thought, why is this modifying nested_mmu.gva_to_gpa, and not
mmu.gva_to_gpa? Isn't the nested_mmu the L2 CR3, which is *not* in EPT
format, and what we really want to change is the outer mmu, which is
EPT12 and is indeed in EPT format?
Or am I missing something?

Thanks,
Nadav.

-- 
Nadav Har'El| Wednesday, Nov 23 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |My password is my dog's name. His name is
http://nadav.harel.org.il   |a#j!4@h, but I change it every month.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: Move cpuid code to new file

2011-11-23 Thread Avi Kivity
The cpuid code has grown; put it into a separate file.

Signed-off-by: Avi Kivity a...@redhat.com
---
 arch/x86/kvm/Makefile |2 +-
 arch/x86/kvm/cpuid.c  |  629 
 arch/x86/kvm/cpuid.h  |   46 
 arch/x86/kvm/lapic.c  |1 +
 arch/x86/kvm/vmx.c|1 +
 arch/x86/kvm/x86.c|  638 +
 arch/x86/kvm/x86.h|5 +-
 7 files changed, 683 insertions(+), 639 deletions(-)
 create mode 100644 arch/x86/kvm/cpuid.c
 create mode 100644 arch/x86/kvm/cpuid.h

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index f15501f..161b76a 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -12,7 +12,7 @@ kvm-$(CONFIG_IOMMU_API)   += $(addprefix 
../../../virt/kvm/, iommu.o)
 kvm-$(CONFIG_KVM_ASYNC_PF) += $(addprefix ../../../virt/kvm/, async_pf.o)
 
 kvm-y  += x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
-  i8254.o timer.o
+  i8254.o timer.o cpuid.o
 kvm-intel-y+= vmx.o
 kvm-amd-y  += svm.o
 
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
new file mode 100644
index 000..bbaa6d8
--- /dev/null
+++ b/arch/x86/kvm/cpuid.c
@@ -0,0 +1,629 @@
+/*
+ * Kernel-based Virtual Machine driver for Linux
+ * cpuid support routines
+ *
+ * derived from arch/x86/kvm/x86.c
+ *
+ * Copyright 2011 Red Hat, Inc. and/or its affiliates.
+ * Copyright IBM Corporation, 2008
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include linux/kvm_host.h
+#include linux/module.h
+#include asm/user.h
+#include asm/xsave.h
+#include cpuid.h
+#include lapic.h
+#include mmu.h
+#include trace.h
+
+void kvm_update_cpuid(struct kvm_vcpu *vcpu)
+{
+   struct kvm_cpuid_entry2 *best;
+   struct kvm_lapic *apic = vcpu-arch.apic;
+   u32 timer_mode_mask;
+
+   best = kvm_find_cpuid_entry(vcpu, 1, 0);
+   if (!best)
+   return;
+
+   /* Update OSXSAVE bit */
+   if (cpu_has_xsave  best-function == 0x1) {
+   best-ecx = ~(bit(X86_FEATURE_OSXSAVE));
+   if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE))
+   best-ecx |= bit(X86_FEATURE_OSXSAVE);
+   }
+
+   if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL 
+   best-function == 0x1) {
+   best-ecx |= bit(X86_FEATURE_TSC_DEADLINE_TIMER);
+   timer_mode_mask = 3  17;
+   } else
+   timer_mode_mask = 1  17;
+
+   if (apic)
+   apic-lapic_timer.timer_mode_mask = timer_mode_mask;
+}
+
+static int is_efer_nx(void)
+{
+   unsigned long long efer = 0;
+
+   rdmsrl_safe(MSR_EFER, efer);
+   return efer  EFER_NX;
+}
+
+static void cpuid_fix_nx_cap(struct kvm_vcpu *vcpu)
+{
+   int i;
+   struct kvm_cpuid_entry2 *e, *entry;
+
+   entry = NULL;
+   for (i = 0; i  vcpu-arch.cpuid_nent; ++i) {
+   e = vcpu-arch.cpuid_entries[i];
+   if (e-function == 0x8001) {
+   entry = e;
+   break;
+   }
+   }
+   if (entry  (entry-edx  (1  20))  !is_efer_nx()) {
+   entry-edx = ~(1  20);
+   printk(KERN_INFO kvm: guest NX capability removed\n);
+   }
+}
+
+/* when an old userspace process fills a new kernel module */
+int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu,
+struct kvm_cpuid *cpuid,
+struct kvm_cpuid_entry __user *entries)
+{
+   int r, i;
+   struct kvm_cpuid_entry *cpuid_entries;
+
+   r = -E2BIG;
+   if (cpuid-nent  KVM_MAX_CPUID_ENTRIES)
+   goto out;
+   r = -ENOMEM;
+   cpuid_entries = vmalloc(sizeof(struct kvm_cpuid_entry) * cpuid-nent);
+   if (!cpuid_entries)
+   goto out;
+   r = -EFAULT;
+   if (copy_from_user(cpuid_entries, entries,
+  cpuid-nent * sizeof(struct kvm_cpuid_entry)))
+   goto out_free;
+   for (i = 0; i  cpuid-nent; i++) {
+   vcpu-arch.cpuid_entries[i].function = 
cpuid_entries[i].function;
+   vcpu-arch.cpuid_entries[i].eax = cpuid_entries[i].eax;
+   vcpu-arch.cpuid_entries[i].ebx = cpuid_entries[i].ebx;
+   vcpu-arch.cpuid_entries[i].ecx = cpuid_entries[i].ecx;
+   vcpu-arch.cpuid_entries[i].edx = cpuid_entries[i].edx;
+   vcpu-arch.cpuid_entries[i].index = 0;
+   vcpu-arch.cpuid_entries[i].flags = 0;
+   vcpu-arch.cpuid_entries[i].padding[0] = 0;
+   vcpu-arch.cpuid_entries[i].padding[1] = 0;
+   vcpu-arch.cpuid_entries[i].padding[2] = 0;
+   }
+   vcpu-arch.cpuid_nent = cpuid-nent;
+   cpuid_fix_nx_cap(vcpu);
+   r = 0;
+   kvm_apic_set_version(vcpu);
+   kvm_x86_ops-cpuid_update(vcpu);
+   

Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding

2011-11-23 Thread Alexander Graf

On 11/23/2011 04:03 PM, Andrea Arcangeli wrote:

Hi!

On Mon, Nov 21, 2011 at 07:51:21PM -0600, Anthony Liguori wrote:

Fundamentally, the entity that should be deciding what memory should be present
and where it should located is the kernel.  I'm fundamentally opposed to trying
to make QEMU override the scheduler/mm by using cpu or memory pinning in QEMU.

  From what I can tell about ms_mbind(), it just uses process knowledge to bind
specific areas of memory to a memsched group and let's the kernel decide what to
do with that knowledge.  This is exactly the type of interface that QEMU should
be using.

QEMU should tell the kernel enough information such that the kernel can make
good decisions.  QEMU should not be the one making the decisions.

True, QEMU won't have to decide where the memory and vcpus should be
located (but hey it wouldn't need to decide that even if you use
cpusets, you can use relative mbind with cpusets, the admin or a
cpuset job scheduler could decide) but it's still QEMU making the
decision of what memory and which vcpus threads to
ms_mbind/ms_tbind. Think how you're going to create the input of those
syscalls...

If it wasn't qemu to decide that, qemu wouldn't be required to scan
the whole host physical numa (cpu/memory) topology in order to create
the input arguments of ms_mbind/ms_tbind. And when you migrate the
VM to another host, the whole vtopology may be counter-productive
because the kernel isn't automatically detecting the numa affinity
between threads and the guest vtopology will stick to whatever numa
_physical_ topology that was seen on the first node where the VM was
created.

I doubt that the assumption that all cloud nodes will have the same
physical numa topology is reasonable.

Furthermore to get the same benefits that qemu gets on host by using
ms_mbind/ms_tbind, every single guest application should be modified
to scan the guest vtopology and call ms_mbind/ms_tbind too (or use the
hard bindings which is what we try to avoid).

I think it's unreasonable to expect all applications to use
ms_mbind/ms_tbind in the guest, at best guest apps will use cpusets or
wrappers, few apps will be modified for sys_ms_tbind/mbind.

You can always have the supercomputer case with just one app that is
optimized and a single VM spanning over the whole host, but in that
scenarios hard bindings would work perfectly too.

In my view the trouble of the numa hard bindings is not the fact
they're hard and qemu has to also decide the location (in fact it
doesn't need to decide the location if you use cpusets and relative
mbinds). The bigger problem is the fact either the admin or the app
developer has to explicitly scan the numa physical topology (both cpus
and memory) and tell the kernel how much memory to bind to each
thread. ms_mbind/ms_tbind only partially solve that problem. They're
similar to the mbind MPOL_F_RELATIVE_NODES with cpusets, except you
don't need an admin or a cpuset-job-scheduler (or a perl script) to
redistribute the hardware resources.


Well yeah, of course the guest needs to see some topology. I don't see 
why we'd have to actually scan the host for this though. All we need to 
tell the kernel is this memory region is close to that thread.


So if you define -numa node,mem=1G,cpus=0 then QEMU should be able to 
tell the kernel that this GB of RAM actually is close to that vCPU thread.


Of course the admin still needs to decide how to split up memory. That's 
the deal with emulating real hardware. You get the interfaces hardware 
gets :). However, if you follow a reasonable default strategy such as 
numa splitting your RAM into equal chunks between guest vCPUs you're 
probably close enough to optimal usage models. Or at least you could 
have a close enough approximation of how this mapping could work for the 
_guest_ regardless of the host and when you migrate it somewhere else it 
should also work reasonably well.




Now dealing with bindings isn't big deal for qemu, in fact this API is
pretty much ideal for qemu, but it won't make life substantially
easier than if compared to hard bindings. Simply the management code
that is now done with a perl script will have to be moved in the
kernel. It looks an incremental improvement compared to the relative
mbind+cpuset, but I'm unsure if it's the best we could aim for and
what we really need in virt considering we deal with VM migration too.

The real long term design to me is not to add more syscalls, and
initially handling the case of a process/VM spanning not more than one
node in thread number and amount of memory. That's not too hard an in
fact I've benchmarks for the scheduler already showing it to work
pretty well (it's creating a too strict affinity but it can be relaxed
to be more useful). Then later add some mechanism (simplest is the
page fault at low frequency) to create a
guest_vcpu_thread-host_memory affinity and have a parvirtualized
interface that tells the guest scheduler to group CPUs.

If the guest 

Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding

2011-11-23 Thread Andrea Arcangeli
On Wed, Nov 23, 2011 at 07:34:37PM +0100, Alexander Graf wrote:
 So if you define -numa node,mem=1G,cpus=0 then QEMU should be able to 
 tell the kernel that this GB of RAM actually is close to that vCPU thread.
 Of course the admin still needs to decide how to split up memory. That's 
 the deal with emulating real hardware. You get the interfaces hardware 
 gets :). However, if you follow a reasonable default strategy such as 

The problem is how do you decide the parameter -numa
node,mem=1G,cpus=0.

Real hardware exists when the VM starts. But then the VM can be
migrated. Or the VM may have to be split in the middle of two nodes
regardless of the -node node,mem=1G,cpus=0-1 to avoid swapping so
there may be two 512M nodes with 1 cpu each instead of 1 NUMA node
with 1G and 2 cpus.

Especially by relaxing the hard bindings and using ms_mbind/tbind, the
vtopology you create won't match real hardware because you don't know
the real hardware that you will get.

 numa splitting your RAM into equal chunks between guest vCPUs you're 
 probably close enough to optimal usage models. Or at least you could 
 have a close enough approximation of how this mapping could work for the 
 _guest_ regardless of the host and when you migrate it somewhere else it 
 should also work reasonably well.

If you enforce these assumptions and the admin has still again to
choose the -numa node,mem=1G parameters after checking the physical
numa topology and make sure the vtopology can match the real physical
topology and that the guest runs on real hardware, it's not very
different from using hard bindings, hard bindings enforces the real
hardware so there's no way it can go wrong. I mean you still need
some NUMA topology knowledge outside of QEMU to be sure you get real
hardware out of the vtopology.

Ok cpusets would restrict the availability of idle cpus, so there's a
slight improvement in maximizing idle CPU usage (it's better to run
50% slower than not to run at all), but that could be achieved also by
a relax of the cpuset semantics (if that's not already available).

 So you want to basically dynamically create NUMA topologies from the 
 runtime behavior of the guest? What if it changes over time?

Yes, just I wouldn't call it NUMA topologies or it looks like a
vtopology, and the vtopology is something fixed at boot time, the sort
of thing created by using a command line like -numa
node,mem=1G,cpu=0.

I wouldn't try to give the guest any memory topology, just the vcpus
are magic threads that don't behave like normal threads in memory
affinity terms. So they need a paravirtualization layer to be dealt
with.

The fact vcpu0 accessed 10 pages right now, doesn't mean there's a
real affinity between vcpu0 and those N pages if the guest scheduler
is free to migrate anything anywhere. The guest thread running in the
vcpu0 may be migrated to the vcpu7 which may belong to a different
physical node. So if we want to automatically detect thread-memory
affinity between vcpus and guest memory, we also need to group the
guest threads in certain vcpus and prevent those cpu migrations. The
thread in the guest would better stick to vcpu0/1/2/3 (instead of
migration to vcpu4/5/6/7) if vcpu0/1/2/3 have affinity with the same
memory which fits in one node. That can only be told dynamically from
KVM to the guest OS scheduler as we may migrate virtual machines or we
may move the memory.

Take the example of 3 VM of 2.5G ram each on a 8G system with 2 nodes
(4G per node). Suppose one of the two VM that have all the 2.5G
allocated in a single node quits. Then the VM that was split across
the two nodes will  memory-migrated to fit in one node. So far so
good, but then KVM should tell the guest OS scheduler that it should
stop grouping vcpus and all vcpus are equal and all guest threads can
be migrated to any vcpu.

I don't see a way to do those things with a vtopology fixed at boot.

 I actually like the idea of just telling the kernel how close memory 
 will be to a thread. Sure, you can handle this basically by shoving your 
 scheduler into user space, but isn't managing processes what a kernel is 
 supposed to do in the first place?

Assume you're not in virt and you just want to tell thread A uses
memory range A and thread B uses memory range B. If the memory range A
fits in one node you're ok. But if memory A now spans over two nodes
(maybe to avoid swapping), you're still screwed and you won't give
enough information to the kernel on the real runtime affinity that
thread A has on the memory. Now if statistically the access to
memory a are all equal, it won't make a difference but if you end up
using half of memory A 99% of the time, it will not work as well.

This is especially a problem for KVM because statistically the
accesses to memory a given to vcpu0 won't be equal. 50% of it may
not be used at all and just have pagecache sitting there, or even free
memory, so we can do better if memory a is split across two nodes to
avoid swapping, if we 

Re: [RFC PATCH 11/11] KVM: PPC: Eliminate global spinlock in kvmppc_h_enter

2011-11-23 Thread Marcelo Tosatti
On Thu, Nov 17, 2011 at 10:55:49AM +1100, Paul Mackerras wrote:
 From dfd5bcfac841f8a36593edf60d9fb15e0d633287 Mon Sep 17 00:00:00 2001
 From: Paul Mackerras pau...@samba.org
 Date: Mon, 14 Nov 2011 13:30:38 +1100
 Subject: 
 
 Currently, kvmppc_h_enter takes a spinlock that is global to the guest,
 kvm-mmu_lock, in order to check for pending PTE invalidations safely.
 On some workloads, kvmppc_h_enter is called heavily and the use of a
 global spinlock could compromise scalability.  We already use a per-
 guest page spinlock in the form of the bit spinlock on the rmap chain,
 and this gives us synchronization with the PTE invalidation side, which
 also takes the bit spinlock on the rmap chain for each page being
 invalidated.  Thus it is sufficient to check for pending invalidations
 while the rmap chain bit spinlock is held.  However, now we require
 barriers in mmu_notifier_retry() and in the places where
 mmu_notifier_count and mmu_notifier_seq are updated, since we can now
 call mmu_notifier_retry() concurrently with updates to those fields.
 
 Signed-off-by: Paul Mackerras pau...@samba.org
 ---
 Cc'd to kvm@vger.kernel.org for review of the generic kvm changes.

Looks good to me.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/11] KVM: PPC: Update Book3S HV memory handling

2011-11-23 Thread Marcelo Tosatti
On Sat, Nov 19, 2011 at 08:54:24AM +1100, Paul Mackerras wrote:
 On Fri, Nov 18, 2011 at 02:57:11PM +0100, Alexander Graf wrote:
  
  This touches areas that I'm sure non-PPC people would want to see as
  well. Could you please CC kvm@vger too next time?
  
  Avi, Marcelo, mind to review some of the bits in this patch set? :)
 
 I did cc the last patch (the one that adds barriers in the MMU
 notifier sequence/count logic) to kvm@vger.  Do you mean I should cc
 the whole series?  The only other thing touching generic code is the
 addition of the KVM_MEMSLOT_IO flag in the first patch.

I don't see such flag anywhere in the patches in this thread?

 I'm hoping the extra barriers will be OK since they are no-ops on
 x86.  In fact I now think that the smp_wmbs I added to
 kvm_mmu_notifier_invalidate_page and kvm_mmu_notifier_change_pte
 aren't in fact necessary, since it's only necessary to ensure that the
 sequence number increase is visible before the point where
 kvm_unmap_hva or kvm_set_spte_hva unlock the bitlock on the first rmap
 chain they look at, which will be ensured anyway by the barrier before
 the unlock.
 
 Paul.
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout

2011-11-23 Thread Rusty Russell
On Wed, 23 Nov 2011 11:38:40 +0200, Sasha Levin levinsasha...@gmail.com wrote:
 On Wed, 2011-11-23 at 10:49 +0200, Michael S. Tsirkin wrote:
  On Wed, Nov 23, 2011 at 01:02:22PM +1030, Rusty Russell wrote:
   +/* Fields in VIRTIO_PCI_CAP_COMMON_CFG: */
   +struct virtio_pci_common_cfg {
   + /* About the whole device. */
   + __u64 device_features;  /* read-only */
   + __u64 guest_features;   /* read-write */
  
  We currently require atomic accesses to common fields.
  Some architectures might not have such for 64 bit,
  so these need to be split I think ...
 
 We can consider stealing the feature implementation from virtio-mmio:
 Use the first 32 bits as a selector and the last as the features
 themselves.
 
 It's more complex to work with, but it provides 2**32 32 feature bits
 (which should be enough for a while) and it solves the atomic access
 issue.

That works.  I don't expect we'll need more than 64 features given that
virtio_net hasn't seen a new one in over a year, but it's gone from 5 to
18 in 4 years, so another 32 bits at that rate only gets us another decade.

Unfortunately, it doesn't solve the queue_address problem.

We currently multiply the (32-bit) address by 4096 (the alignment).  If
drivers are going to reduce alignment, that makes it trickier, hence the
change here to a 64.  Simplicity.

We could cheat and assert that a 32-bit write there implies 0 in the
upper bits, and hope that all 64 bit platforms can do a 64-bit atomic
write.  Or insist the value not be intepreted until the guest writes its
(first) feature bit.

Perhaps we really need a I'm done configuring! trigger, now we can't
use the feature bit field for that.

Thoughts welcome...
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout

2011-11-23 Thread Rusty Russell
On Wed, 23 Nov 2011 10:46:41 +0200, Michael S. Tsirkin m...@redhat.com 
wrote:
 On Wed, Nov 23, 2011 at 01:02:22PM +1030, Rusty Russell wrote:
  On Tue, 22 Nov 2011 20:36:22 +0200, Michael S. Tsirkin m...@redhat.com 
  wrote:
   Here's an updated vesion.
   I'm alternating between updating the spec and the driver,
   spec update to follow.
  
  Don't touch the spec yet, we have a long way to go :(
  
  I want the ability for driver to set the ring size, and the device to
  set the alignment.
 
 Did you mean driver to be able to set the alignment? This
 is what BIOS guys want - after BIOS completes, guest driver gets handed
 control and sets its own alignment to save memory.

Yep, sorry.

But we really do want the guest to set the ring size.  Because it has to
be guest-physical-contiguous, the host currently sets a very small ring,
because the guest is useless if it can't allocate.

Either way, it's now the driver's responsibility to write those fields.

  That's a bigger change than you have here.
 
 Why can't we just add the new registers at the end?
 With the new capability, we have as much space as we like for that.

We could, for sure.

  I imagine it almost rips the driver into two completely different drivers.
 
 If you insist on moving all the rest of registers around, certainly. But
 why do this?

Because I suspect we'll be different enough anyway, once we change the
way we allocate the ring, and write the alignment.  It'll be *clearer*
to have two completely separate paths than to fill with if() statements.
And a rewrite won't hurt the driver.

But to be honest I don't really care about the Linux driver: we're
steeped in this stuff and we'll get it right.  But I'm *terrified* of
making the spec more complex; implementations will get it wrong.  I
*really* want to banish the legacy stuff to an appendix where noone will
ever know it's there :)

 Renaming constants in exported headers will break userspace builds.
 Do we care? Why not?

As the patch shows, I decided not to do that.  It's a nice heads-up, but
breaking older versions of the code is just mean.  Hence this:

  +#ifndef __KERNEL__
  +/* Don't break compile of old userspace code.  These will go away. */
  +#define VIRTIO_PCI_HOST_FEATURES VIRTIO_PCI_LEGACY_HOST_FEATURES
  +#define VIRTIO_PCI_GUEST_FEATURES VIRTIO_PCI_LEGACY_GUEST_FEATURES
  +#define VIRTIO_PCI_LEGACY_QUEUE_PFN VIRTIO_PCI_QUEUE_PFN
  +#define VIRTIO_PCI_LEGACY_QUEUE_NUM VIRTIO_PCI_QUEUE_NUM
  +#define VIRTIO_PCI_LEGACY_QUEUE_SEL VIRTIO_PCI_QUEUE_SEL
  +#define VIRTIO_PCI_LEGACY_QUEUE_NOTIFY VIRTIO_PCI_QUEUE_NOTIFY
  +#define VIRTIO_PCI_LEGACY_STATUS VIRTIO_PCI_STATUS
  +#define VIRTIO_PCI_LEGACY_ISR VIRTIO_PCI_ISR
  +#define VIRTIO_MSI_LEGACY_CONFIG_VECTOR VIRTIO_MSI_CONFIG_VECTOR
  +#define VIRTIO_MSI_LEGACY_QUEUE_VECTOR VIRTIO_MSI_QUEUE_VECTOR
  +#define VIRTIO_PCI_LEGACY_CONFIG(dev) VIRTIO_PCI_CONFIG(dev)
  +#define VIRTIO_PCI_LEGACY_QUEUE_ADDR_SHIFT VIRTIO_PCI_QUEUE_ADDR_SHIFT
  +#define VIRTIO_PCI_LEGACY_VRING_ALIGN VIRTIO_PCI_VRING_ALIGN
  +#endif /* ...!KERNEL */

...
  +/* Fields in VIRTIO_PCI_CAP_COMMON_CFG: */
  +struct virtio_pci_common_cfg {
  +   /* About the whole device. */
  +   __u64 device_features;  /* read-only */
  +   __u64 guest_features;   /* read-write */
  +   __u64 queue_address;/* read-write */
  +   __u16 msix_config;  /* read-write */
  +   __u8 device_status; /* read-write */
  +   __u8 unused;
  +
  +   /* About a specific virtqueue. */
  +   __u16 queue_select; /* read-write */
  +   __u16 queue_align;  /* read-write, power of 2. */
  +   __u16 queue_size;   /* read-write, power of 2. */
  +   __u16 queue_msix_vector;/* read-write */
  +};
 
 Slightly confusing as the registers are in fact little endian ...

Good point, should mark them appropriately with __le16.  That makes it
even clearer.

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout

2011-11-23 Thread Michael S. Tsirkin
On Thu, Nov 24, 2011 at 11:06:44AM +1030, Rusty Russell wrote:
   +/* Fields in VIRTIO_PCI_CAP_COMMON_CFG: */
   +struct virtio_pci_common_cfg {
   + /* About the whole device. */
   + __u64 device_features;  /* read-only */
   + __u64 guest_features;   /* read-write */
   + __u64 queue_address;/* read-write */
   + __u16 msix_config;  /* read-write */
   + __u8 device_status; /* read-write */
   + __u8 unused;
   +
   + /* About a specific virtqueue. */
   + __u16 queue_select; /* read-write */
   + __u16 queue_align;  /* read-write, power of 2. */
   + __u16 queue_size;   /* read-write, power of 2. */
   + __u16 queue_msix_vector;/* read-write */
   +};
  
  Slightly confusing as the registers are in fact little endian ...
 
 Good point, should mark them appropriately with __le16.  That makes it
 even clearer.
 
 Thanks,
 Rusty.

Do we still require atomic access to fields in common cfg?
If yes it's a problem as some systems don't have 64 bit
addresses. If no, implementations might get harder.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] kvm tools, qcow: Add the support for copy-on-write cluster

2011-11-23 Thread Lan Tianyu
When meeting request to write the cluster without copied flag,
allocate a new cluster and write original data with modification
to the new cluster. This also adds support for the writing operation
of the qcow2 compressed image. After testing, image file can pass
through qemu-img check. The performance is needed to be improved.

Signed-off-by: Lan Tianyu tianyu@intel.com
---
 tools/kvm/disk/qcow.c|  456 +++---
 tools/kvm/include/kvm/qcow.h |2 +
 2 files changed, 292 insertions(+), 166 deletions(-)

diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c
index 680b37d..7cf6fda 100644
--- a/tools/kvm/disk/qcow.c
+++ b/tools/kvm/disk/qcow.c
@@ -20,6 +20,16 @@
 #include linux/kernel.h
 #include linux/types.h
 
+
+static inline int qcow_pwrite_sync(int fd,
+   void *buf, size_t count, off_t offset)
+{
+   if (pwrite_in_full(fd, buf, count, offset)  0)
+   return -1;
+
+   return fdatasync(fd);
+}
+
 static int l2_table_insert(struct rb_root *root, struct qcow_l2_table *new)
 {
struct rb_node **link = (root-rb_node), *parent = NULL;
@@ -101,7 +111,8 @@ static int qcow_l2_cache_write(struct qcow *q, struct 
qcow_l2_table *c)
 
size = 1  header-l2_bits;
 
-   if (pwrite_in_full(q-fd, c-table, size * sizeof(u64), c-offset)  0)
+   if (qcow_pwrite_sync(q-fd, c-table,
+   size * sizeof(u64), c-offset)  0)
return -1;
 
c-dirty = 0;
@@ -122,9 +133,6 @@ static int cache_table(struct qcow *q, struct qcow_l2_table 
*c)
 */
lru = list_first_entry(l1t-lru_list, struct qcow_l2_table, 
list);
 
-   if (qcow_l2_cache_write(q, lru)  0)
-   goto error;
-
/* Remove the node from the cache */
rb_erase(lru-node, r);
list_del_init(lru-list);
@@ -508,47 +516,6 @@ static ssize_t qcow_read_sector(struct disk_image *disk, 
u64 sector,
return total;
 }
 
-static inline u64 file_size(int fd)
-{
-   struct stat st;
-
-   if (fstat(fd, st)  0)
-   return 0;
-
-   return st.st_size;
-}
-
-static inline int qcow_pwrite_sync(int fd, void *buf, size_t count, off_t 
offset)
-{
-   if (pwrite_in_full(fd, buf, count, offset)  0)
-   return -1;
-
-   return fdatasync(fd);
-}
-
-/* Writes a level 2 table at the end of the file. */
-static u64 qcow_write_l2_table(struct qcow *q, u64 *table)
-{
-   struct qcow_header *header = q-header;
-   u64 clust_sz;
-   u64 f_sz;
-   u64 off;
-   u64 sz;
-
-   f_sz= file_size(q-fd);
-   if (!f_sz)
-   return 0;
-
-   sz  = 1  header-l2_bits;
-   clust_sz= 1  header-cluster_bits;
-   off = ALIGN(f_sz, clust_sz);
-
-   if (pwrite_in_full(q-fd, table, sz * sizeof(u64), off)  0)
-   return 0;
-
-   return off;
-}
-
 static void refcount_table_free_cache(struct qcow_refcount_table *rft)
 {
struct rb_root *r = rft-root;
@@ -601,7 +568,8 @@ static int write_refcount_block(struct qcow *q, struct 
qcow_refcount_block *rfb)
if (!rfb-dirty)
return 0;
 
-   if (pwrite_in_full(q-fd, rfb-entries, rfb-size * sizeof(u16), 
rfb-offset)  0)
+   if (qcow_pwrite_sync(q-fd, rfb-entries,
+   rfb-size * sizeof(u16), rfb-offset)  0)
return -1;
 
rfb-dirty = 0;
@@ -618,9 +586,6 @@ static int cache_refcount_block(struct qcow *q, struct 
qcow_refcount_block *c)
if (rft-nr_cached == MAX_CACHE_NODES) {
lru = list_first_entry(rft-lru_list, struct 
qcow_refcount_block, list);
 
-   if (write_refcount_block(q, lru)  0)
-   goto error;
-
rb_erase(lru-node, r);
list_del_init(lru-list);
rft-nr_cached--;
@@ -706,6 +671,11 @@ static struct qcow_refcount_block 
*qcow_read_refcount_block(struct qcow *q, u64
 
rfb_offset = be64_to_cpu(rft-rf_table[rft_idx]);
 
+   if (!rfb_offset) {
+   pr_warning(Don't support to grow refcount table);
+   return NULL;
+   }
+
rfb = refcount_block_search(q, rfb_offset);
if (rfb)
return rfb;
@@ -728,35 +698,140 @@ error_free_rfb:
return NULL;
 }
 
+static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx)
+{
+   struct qcow_refcount_block *rfb = NULL;
+   struct qcow_header *header = q-header;
+   u64 rfb_idx;
+
+   rfb = qcow_read_refcount_block(q, clust_idx);
+   if (!rfb) {
+   pr_warning(Error while reading refcount table);
+   return -1;
+   }
+
+   rfb_idx = clust_idx  (((1ULL 
+   (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1));
+
+   if (rfb_idx = rfb-size) {
+   pr_warning(L1: refcount block index out of bounds);
+   return -1;
+   }
+
+

[PATCH v4] kvm tools, qcow: Add the support for copy-on-write cluster

2011-11-23 Thread Lan Tianyu
When meeting request to write the cluster without copied flag,
allocate a new cluster and write original data with modification
to the new cluster. This also adds support for the writing operation
of the qcow2 compressed image. After testing, image file can pass
through qemu-img check. The performance is needed to be improved.

Signed-off-by: Lan Tianyu tianyu@intel.com
---
 tools/kvm/disk/qcow.c|  456 +++---
 tools/kvm/include/kvm/qcow.h |2 +
 2 files changed, 292 insertions(+), 166 deletions(-)

diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c
index 680b37d..400aae8 100644
--- a/tools/kvm/disk/qcow.c
+++ b/tools/kvm/disk/qcow.c
@@ -20,6 +20,16 @@
 #include linux/kernel.h
 #include linux/types.h
 
+
+static inline int qcow_pwrite_sync(int fd,
+   void *buf, size_t count, off_t offset)
+{
+   if (pwrite_in_full(fd, buf, count, offset)  0)
+   return -1;
+
+   return fdatasync(fd);
+}
+
 static int l2_table_insert(struct rb_root *root, struct qcow_l2_table *new)
 {
struct rb_node **link = (root-rb_node), *parent = NULL;
@@ -101,7 +111,8 @@ static int qcow_l2_cache_write(struct qcow *q, struct 
qcow_l2_table *c)
 
size = 1  header-l2_bits;
 
-   if (pwrite_in_full(q-fd, c-table, size * sizeof(u64), c-offset)  0)
+   if (qcow_pwrite_sync(q-fd, c-table,
+   size * sizeof(u64), c-offset)  0)
return -1;
 
c-dirty = 0;
@@ -122,9 +133,6 @@ static int cache_table(struct qcow *q, struct qcow_l2_table 
*c)
 */
lru = list_first_entry(l1t-lru_list, struct qcow_l2_table, 
list);
 
-   if (qcow_l2_cache_write(q, lru)  0)
-   goto error;
-
/* Remove the node from the cache */
rb_erase(lru-node, r);
list_del_init(lru-list);
@@ -508,47 +516,6 @@ static ssize_t qcow_read_sector(struct disk_image *disk, 
u64 sector,
return total;
 }
 
-static inline u64 file_size(int fd)
-{
-   struct stat st;
-
-   if (fstat(fd, st)  0)
-   return 0;
-
-   return st.st_size;
-}
-
-static inline int qcow_pwrite_sync(int fd, void *buf, size_t count, off_t 
offset)
-{
-   if (pwrite_in_full(fd, buf, count, offset)  0)
-   return -1;
-
-   return fdatasync(fd);
-}
-
-/* Writes a level 2 table at the end of the file. */
-static u64 qcow_write_l2_table(struct qcow *q, u64 *table)
-{
-   struct qcow_header *header = q-header;
-   u64 clust_sz;
-   u64 f_sz;
-   u64 off;
-   u64 sz;
-
-   f_sz= file_size(q-fd);
-   if (!f_sz)
-   return 0;
-
-   sz  = 1  header-l2_bits;
-   clust_sz= 1  header-cluster_bits;
-   off = ALIGN(f_sz, clust_sz);
-
-   if (pwrite_in_full(q-fd, table, sz * sizeof(u64), off)  0)
-   return 0;
-
-   return off;
-}
-
 static void refcount_table_free_cache(struct qcow_refcount_table *rft)
 {
struct rb_root *r = rft-root;
@@ -601,7 +568,8 @@ static int write_refcount_block(struct qcow *q, struct 
qcow_refcount_block *rfb)
if (!rfb-dirty)
return 0;
 
-   if (pwrite_in_full(q-fd, rfb-entries, rfb-size * sizeof(u16), 
rfb-offset)  0)
+   if (qcow_pwrite_sync(q-fd, rfb-entries,
+   rfb-size * sizeof(u16), rfb-offset)  0)
return -1;
 
rfb-dirty = 0;
@@ -618,9 +586,6 @@ static int cache_refcount_block(struct qcow *q, struct 
qcow_refcount_block *c)
if (rft-nr_cached == MAX_CACHE_NODES) {
lru = list_first_entry(rft-lru_list, struct 
qcow_refcount_block, list);
 
-   if (write_refcount_block(q, lru)  0)
-   goto error;
-
rb_erase(lru-node, r);
list_del_init(lru-list);
rft-nr_cached--;
@@ -706,6 +671,11 @@ static struct qcow_refcount_block 
*qcow_read_refcount_block(struct qcow *q, u64
 
rfb_offset = be64_to_cpu(rft-rf_table[rft_idx]);
 
+   if (!rfb_offset) {
+   pr_warning(Don't support to grow refcount table);
+   return NULL;
+   }
+
rfb = refcount_block_search(q, rfb_offset);
if (rfb)
return rfb;
@@ -728,35 +698,140 @@ error_free_rfb:
return NULL;
 }
 
+static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx)
+{
+   struct qcow_refcount_block *rfb = NULL;
+   struct qcow_header *header = q-header;
+   u64 rfb_idx;
+
+   rfb = qcow_read_refcount_block(q, clust_idx);
+   if (!rfb) {
+   pr_warning(Error while reading refcount table);
+   return -1;
+   }
+
+   rfb_idx = clust_idx  (((1ULL 
+   (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1));
+
+   if (rfb_idx = rfb-size) {
+   pr_warning(L1: refcount block index out of bounds);
+   return -1;
+   }
+
+

Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout

2011-11-23 Thread Michael S. Tsirkin
On Thu, Nov 24, 2011 at 11:06:44AM +1030, Rusty Russell wrote:
 Because I suspect we'll be different enough anyway, once we change the
 way we allocate the ring, and write the alignment.

Well, it'd be easy to just add pa_high.

 It'll be *clearer* to have two completely separate paths than to fill
 with if() statements.

Well, look at my patches. See how each new feature basically adds *one*
if statement.

Yes, we could declare all existing features to be required,
this is what you are advocating. But in a couple of years
more if statements have accumulated and we still don't have
a flying car. Meanwhile the driver has a huge legacy section
which is a huge pain to maintain.

 And a rewrite won't hurt the driver.

Wow. Is everyone going for the rewrite these days?
After a lot of work we will be back at the point where we left,
with a minor tweak to enable a huge number of slow devices.
Meanwhile we have no time to work on real problems,
such as ring fragmentation that Avi pointed out,
support for very large requests.


And yes it will hurt, it will hurt bisectability and testability.
What you propose is a huge patch that just changes all of it.
If there's a breakage, bisect just points at a rewrite.
What I would like to see is incremental patches. So I would like to
1. Split driver config from common config
2. Split isr/notifications out too
3. Copy config in MMIO
4. Add a new config
5. Add length checks
6. Add 64 bit features
7. Add alignment programming

At each point we can bisect.  It worked well for event index, qemu had a
flag to turn it off, each time someone came in and went 'maybe it's
event index' we just tested without.

 But to be honest I don't really care about the Linux driver: we're
 steeped in this stuff and we'll get it right.

Maybe. What about windows drivers?

 But I'm *terrified* of making the spec more complex;

All you do is move stuff around. Why do you think it simplifies the spec
so much?

 implementations will get it wrong.

Just to clarify: you assume that if virtio pci spec is changed, then
there will be many more new implementations than existing ones, so it is
worth it to make life (temporarily, for several short years) harder for
all the existing ones?

Look at the qemu side for example. The way I proposed - with optional
capabilities, each one with a separate fallback - the change can be
implemented very gradually.  Each step will be bisectable, and testable.

 I *really* want to banish the legacy stuff to an appendix where noone will
 ever know it's there :)

This does not make life easy for implementations as they
need to implement it anyway. What makes life easy is
making new features optional, so you pick just what you like.

For example, we might want to backport some new feature into
rhel6.X at some point. I would like the diff to be small,
and easily understandable in its impact.


-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 11/11] KVM: PPC: Eliminate global spinlock in kvmppc_h_enter

2011-11-23 Thread Marcelo Tosatti
On Thu, Nov 17, 2011 at 10:55:49AM +1100, Paul Mackerras wrote:
 From dfd5bcfac841f8a36593edf60d9fb15e0d633287 Mon Sep 17 00:00:00 2001
 From: Paul Mackerras pau...@samba.org
 Date: Mon, 14 Nov 2011 13:30:38 +1100
 Subject: 
 
 Currently, kvmppc_h_enter takes a spinlock that is global to the guest,
 kvm-mmu_lock, in order to check for pending PTE invalidations safely.
 On some workloads, kvmppc_h_enter is called heavily and the use of a
 global spinlock could compromise scalability.  We already use a per-
 guest page spinlock in the form of the bit spinlock on the rmap chain,
 and this gives us synchronization with the PTE invalidation side, which
 also takes the bit spinlock on the rmap chain for each page being
 invalidated.  Thus it is sufficient to check for pending invalidations
 while the rmap chain bit spinlock is held.  However, now we require
 barriers in mmu_notifier_retry() and in the places where
 mmu_notifier_count and mmu_notifier_seq are updated, since we can now
 call mmu_notifier_retry() concurrently with updates to those fields.
 
 Signed-off-by: Paul Mackerras pau...@samba.org
 ---
 Cc'd to k...@vger.kernel.org for review of the generic kvm changes.

Looks good to me.

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/11] KVM: PPC: Update Book3S HV memory handling

2011-11-23 Thread Marcelo Tosatti
On Sat, Nov 19, 2011 at 08:54:24AM +1100, Paul Mackerras wrote:
 On Fri, Nov 18, 2011 at 02:57:11PM +0100, Alexander Graf wrote:
  
  This touches areas that I'm sure non-PPC people would want to see as
  well. Could you please CC kvm@vger too next time?
  
  Avi, Marcelo, mind to review some of the bits in this patch set? :)
 
 I did cc the last patch (the one that adds barriers in the MMU
 notifier sequence/count logic) to kvm@vger.  Do you mean I should cc
 the whole series?  The only other thing touching generic code is the
 addition of the KVM_MEMSLOT_IO flag in the first patch.

I don't see such flag anywhere in the patches in this thread?

 I'm hoping the extra barriers will be OK since they are no-ops on
 x86.  In fact I now think that the smp_wmbs I added to
 kvm_mmu_notifier_invalidate_page and kvm_mmu_notifier_change_pte
 aren't in fact necessary, since it's only necessary to ensure that the
 sequence number increase is visible before the point where
 kvm_unmap_hva or kvm_set_spte_hva unlock the bitlock on the first rmap
 chain they look at, which will be ensured anyway by the barrier before
 the unlock.
 
 Paul.
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html