Re: qemu-kvm requires apic initialized before vcpu main loop

2009-12-10 Thread Jan Kiszka
Gleb Natapov wrote:
 On Wed, Dec 09, 2009 at 10:01:36PM +0100, Jan Kiszka wrote:
 Gleb Natapov wrote:
 On Wed, Dec 09, 2009 at 09:09:54PM +0100, Jan Kiszka wrote:
 Gleb Natapov wrote:
 On Wed, Dec 09, 2009 at 07:23:38PM +0100, Jan Kiszka wrote:
 Marcelo Tosatti wrote:
 Otherwise a zero apic base is loaded into KVM, which results
 in interrupts being lost until a proper apic base with enabled 
 bit set is loaded.

 Fixes WinXP migration in qemu-kvm origin/next.

 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

 diff --git a/hw/apic.c b/hw/apic.c
 index 627ff98..45a4d2b 100644
 --- a/hw/apic.c
 +++ b/hw/apic.c
 @@ -1131,6 +1131,11 @@ int apic_init(CPUState *env)
  vmstate_register(s-idx, vmstate_apic, s);
  qemu_register_reset(apic_reset, s);
  
 +/* apic_reset must be called before the vcpu threads are 
 initialized and load 
 + * registers, in qemu-kvm.
 + */
 +apic_reset(s);
 +
  local_apics[s-idx] = s;
  return 0;
  }
 Heals the issue I saw with Win2003 Server as well.

 Looks all a bit messy though. Hope we can establish a more regular and
 less fragile model on the midterm. I wonder if it wouldn't be better to
 do write-back of the local APIC state along with the register state on
 vmrun (and only there!). The same would apply to things like mpstate,
 Write back of mp state there is a bug and introduce races. Do write back
 of the whole APIC state there looks like a recipe for disaster.
 Please read the full suggestion: We will only write-back if we were
 going through a reset or vmload before. That removes the ugly kvm hooks
 from generic code and ensures proper ordering /wrt other write-backs.
 IMHO, anything else will continue to cause headache like the above to us.

 We can't postpone APIC loading till vmrun. This will race with
 devices/other vcpus sending interrupts to the vcpu. APIC state of all
 vcpus should be up-to-date _before_ any vcpu or main loop starts
 running.
 Simple to solve, just add another write-back point: vm_start.

 So what's the point to have write-back for APIC in vmrun? It is always
 wrong to do it there.

Write-back is arch-specific and will never show up directly in vmrun or
other generic code. Will update my original patch to make this
discussion more concrete.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] KVM: VMX: Add instruction rdtscp support for guest

2009-12-10 Thread Avi Kivity

On 12/10/2009 03:17 AM, Sheng Yang wrote:
Yeah, I realize this later last night... Though it passed my testing 
program.


Please post the test program as well... it's now easy to extend 
kvm/user/test since it runs with qemu -kernel.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm requires apic initialized before vcpu main loop

2009-12-10 Thread Avi Kivity

On 12/10/2009 11:33 AM, Avi Kivity wrote:

On 12/09/2009 07:46 PM, Marcelo Tosatti wrote:

Otherwise a zero apic base is loaded into KVM, which results
in interrupts being lost until a proper apic base with enabled
bit set is loaded.

Fixes WinXP migration in qemu-kvm origin/next.


Thanks, applied.



btw, generalization is still welcome, I agree that we need something 
better here.  I applied it since the bug is blocking development.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enable non page boundary BAR device assignment

2009-12-10 Thread Michael S. Tsirkin
On Thu, Dec 10, 2009 at 07:16:04AM +0200, Muli Ben-Yehuda wrote:
 On Wed, Dec 09, 2009 at 06:38:54PM +0100, Alexander Graf wrote:
 
  While trying to get device passthrough working with an emulex hba,
  kvm refused to pass it through because it has a BAR of 256 bytes:
 
  Region 0: Memory at d210 (64-bit, non-prefetchable) [size=4K]
  Region 2: Memory at d2101000 (64-bit, non-prefetchable) [size=256]
  Region 4: I/O ports at b100 [size=256]
 
  Since the page boundary is an arbitrary optimization to allow 1:1
  mapping of physical to virtual addresses, we can still take the old
  MMIO callback route.
 
  So let's add a second code path that allows for size  0xFFF != 0
  sized regions by looping it through userspace.
 
 That makes sense in general *but* the 4K-aligned check isn't just an
 optimization, it also has a security implication. Consider the
 theoretical case where has a multi-function device has BARs for two
 functions on the same page (within a 4K boundary), and each function
 is assigned to a different guest. With your current patch both guests
 will be able to write to each other's BARs. Another case is where a
 device has a bug and you must not write beyond the BAR or Bad Things
 Happen. With this patch an *unprivileged* guest could exploit that bug
 and make bad things happen.
 
 This can be fixed if the slow userspace mmio path checks that all MMIO
 accesses by a guest fall within the portion of the page that is
 assigned to it.

This patch seems to implement range checks correctly,
let me know if I am missing something.

One also notes that we currently link qemu with libpci
which I think requires admin cap to work.
However, in the future we might extend this to
also support getting device fds over a unix socket
from a higher priviledged process.

If or when this is done, we will have to be
extra careful when passing
device file descriptor to an unpriveledged qemu process if
the BARs are less than full page in size: mapping
such BAR will allow qemu access outside this BAR.

A possible solution to this problem
if/when it arises would be adding yet another sysfs file
for each resource, which would allow read/write but not
mmap access, and perform range checks in the kernel.


 Cheers,
 Muli
 -- 
 Muli Ben-Yehuda | m...@il.ibm.com | +972-4-8281080
 Manager, Virtualization and Systems Architecture
 Master Inventor, IBM Research -- Haifa
 Second Workshop on I/O Virtualization (WIOV '10):
 http://sysrun.haifa.il.ibm.com/hrl/wiov2010/
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enable non page boundary BAR device assignment

2009-12-10 Thread Alexander Graf

On 10.12.2009, at 10:43, Michael S. Tsirkin wrote:

 On Thu, Dec 10, 2009 at 07:16:04AM +0200, Muli Ben-Yehuda wrote:
 On Wed, Dec 09, 2009 at 06:38:54PM +0100, Alexander Graf wrote:
 
 While trying to get device passthrough working with an emulex hba,
 kvm refused to pass it through because it has a BAR of 256 bytes:
 
Region 0: Memory at d210 (64-bit, non-prefetchable) [size=4K]
Region 2: Memory at d2101000 (64-bit, non-prefetchable) [size=256]
Region 4: I/O ports at b100 [size=256]
 
 Since the page boundary is an arbitrary optimization to allow 1:1
 mapping of physical to virtual addresses, we can still take the old
 MMIO callback route.
 
 So let's add a second code path that allows for size  0xFFF != 0
 sized regions by looping it through userspace.
 
 That makes sense in general *but* the 4K-aligned check isn't just an
 optimization, it also has a security implication. Consider the
 theoretical case where has a multi-function device has BARs for two
 functions on the same page (within a 4K boundary), and each function
 is assigned to a different guest. With your current patch both guests
 will be able to write to each other's BARs. Another case is where a
 device has a bug and you must not write beyond the BAR or Bad Things
 Happen. With this patch an *unprivileged* guest could exploit that bug
 and make bad things happen.
 
 This can be fixed if the slow userspace mmio path checks that all MMIO
 accesses by a guest fall within the portion of the page that is
 assigned to it.
 
 This patch seems to implement range checks correctly,
 let me know if I am missing something.
 
 One also notes that we currently link qemu with libpci
 which I think requires admin cap to work.
 However, in the future we might extend this to
 also support getting device fds over a unix socket
 from a higher priviledged process.
 
 If or when this is done, we will have to be
 extra careful when passing
 device file descriptor to an unpriveledged qemu process if
 the BARs are less than full page in size: mapping
 such BAR will allow qemu access outside this BAR.
 
 A possible solution to this problem
 if/when it arises would be adding yet another sysfs file
 for each resource, which would allow read/write but not
 mmap access, and perform range checks in the kernel.

Sounds like the best solution to this problem, yeah. Though we'd only need 
those for non-page-boundary BARs. So I guess the best would be to always export 
them from the kernel, but only use them when BAR  (PAGE_SIZE-1).

Either way, FWIW the device assignment stuff needs to be completely rewritten 
for qemu upstream anyways. So while it's good to collect ideas for now, let's 
not too put too much effort code-wise into the current code (unless it doesn't 
work).

Alex--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enable non page boundary BAR device assignment

2009-12-10 Thread Alexander Graf

On 10.12.2009, at 10:52, Alexander Graf wrote:

 
 On 10.12.2009, at 10:43, Michael S. Tsirkin wrote:
 
 On Thu, Dec 10, 2009 at 07:16:04AM +0200, Muli Ben-Yehuda wrote:
 On Wed, Dec 09, 2009 at 06:38:54PM +0100, Alexander Graf wrote:
 
 While trying to get device passthrough working with an emulex hba,
 kvm refused to pass it through because it has a BAR of 256 bytes:
 
   Region 0: Memory at d210 (64-bit, non-prefetchable) [size=4K]
   Region 2: Memory at d2101000 (64-bit, non-prefetchable) [size=256]
   Region 4: I/O ports at b100 [size=256]
 
 Since the page boundary is an arbitrary optimization to allow 1:1
 mapping of physical to virtual addresses, we can still take the old
 MMIO callback route.
 
 So let's add a second code path that allows for size  0xFFF != 0
 sized regions by looping it through userspace.
 
 That makes sense in general *but* the 4K-aligned check isn't just an
 optimization, it also has a security implication. Consider the
 theoretical case where has a multi-function device has BARs for two
 functions on the same page (within a 4K boundary), and each function
 is assigned to a different guest. With your current patch both guests
 will be able to write to each other's BARs. Another case is where a
 device has a bug and you must not write beyond the BAR or Bad Things
 Happen. With this patch an *unprivileged* guest could exploit that bug
 and make bad things happen.
 
 This can be fixed if the slow userspace mmio path checks that all MMIO
 accesses by a guest fall within the portion of the page that is
 assigned to it.
 
 This patch seems to implement range checks correctly,
 let me know if I am missing something.
 
 One also notes that we currently link qemu with libpci
 which I think requires admin cap to work.
 However, in the future we might extend this to
 also support getting device fds over a unix socket
 from a higher priviledged process.
 
 If or when this is done, we will have to be
 extra careful when passing
 device file descriptor to an unpriveledged qemu process if
 the BARs are less than full page in size: mapping
 such BAR will allow qemu access outside this BAR.
 
 A possible solution to this problem
 if/when it arises would be adding yet another sysfs file
 for each resource, which would allow read/write but not
 mmap access, and perform range checks in the kernel.
 
 Sounds like the best solution to this problem, yeah. Though we'd only need 
 those for non-page-boundary BARs. So I guess the best would be to always 
 export them from the kernel, but only use them when BAR  (PAGE_SIZE-1).

Hm, or add read/write fd functions that always do boundary checks to the 
existing interface and only allow mmap on size  PAGE_SIZE. Or only allow 
non-aligned mmap when the admin cap is present.

Alex--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Autotest] [PATCH] KVM test: subtest stress_boot: Fix a bug that cloned VMs are not screendumped

2009-12-10 Thread Yolkfull Chow
We just used vm.create() to create those cloned VMs whereas ignored catching
screendumps of them. This patch fix this problem.

Signed-off-by: Yolkfull Chow yz...@redhat.com
---
 client/tests/kvm/tests/stress_boot.py |8 +++-
 1 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/client/tests/kvm/tests/stress_boot.py 
b/client/tests/kvm/tests/stress_boot.py
index 2a2e933..0b5ec02 100644
--- a/client/tests/kvm/tests/stress_boot.py
+++ b/client/tests/kvm/tests/stress_boot.py
@@ -1,6 +1,6 @@
 import logging, time
 from autotest_lib.client.common_lib import error
-import kvm_subprocess, kvm_test_utils, kvm_utils
+import kvm_subprocess, kvm_test_utils, kvm_utils, kvm_preprocessing
 
 
 def run_stress_boot(tests, params, env):
@@ -39,11 +39,9 @@ def run_stress_boot(tests, params, env):
 vm_params[address_index] = str(address_index)
 curr_vm = vm.clone(vm_name, vm_params)
 kvm_utils.env_register_vm(env, vm_name, curr_vm)
-params['vms'] +=   + vm_name
-
 logging.info(Booting guest #%d % num)
-if not curr_vm.create():
-raise error.TestFail(Cannot create VM #%d % num)
+kvm_preprocessing.preprocess_vm(tests, vm_params, env, vm_name)
+params['vms'] +=   + vm_name
 
 curr_vm_session = kvm_utils.wait_for(curr_vm.remote_login, 240, 0, 
2)
 if not curr_vm_session:
-- 
1.6.5.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enable non page boundary BAR device assignment

2009-12-10 Thread Muli Ben-Yehuda
On Thu, Dec 10, 2009 at 10:35:41AM +0100, Alexander Graf wrote:

 That's exactly what this patch does. It only allows access to the
 region defined in the BAR.

Sorry, missed it!

Cheers,
Muli
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enable non page boundary BAR device assignment

2009-12-10 Thread Muli Ben-Yehuda
On Thu, Dec 10, 2009 at 10:52:46AM +0100, Alexander Graf wrote:

 Either way, FWIW the device assignment stuff needs to be completely
 rewritten for qemu upstream anyways. So while it's good to collect
 ideas for now, let's not too put too much effort code-wise into the
 current code (unless it doesn't work).

What do you have in mind for such a rewrite?

Cheers,
Muli
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enable non page boundary BAR device assignment

2009-12-10 Thread Michael S. Tsirkin
On Thu, Dec 10, 2009 at 11:08:58AM +0100, Alexander Graf wrote:
 
 On 10.12.2009, at 10:52, Alexander Graf wrote:
 
  
  On 10.12.2009, at 10:43, Michael S. Tsirkin wrote:
  
  On Thu, Dec 10, 2009 at 07:16:04AM +0200, Muli Ben-Yehuda wrote:
  On Wed, Dec 09, 2009 at 06:38:54PM +0100, Alexander Graf wrote:
  
  While trying to get device passthrough working with an emulex hba,
  kvm refused to pass it through because it has a BAR of 256 bytes:
  
Region 0: Memory at d210 (64-bit, non-prefetchable) [size=4K]
Region 2: Memory at d2101000 (64-bit, non-prefetchable) [size=256]
Region 4: I/O ports at b100 [size=256]
  
  Since the page boundary is an arbitrary optimization to allow 1:1
  mapping of physical to virtual addresses, we can still take the old
  MMIO callback route.
  
  So let's add a second code path that allows for size  0xFFF != 0
  sized regions by looping it through userspace.
  
  That makes sense in general *but* the 4K-aligned check isn't just an
  optimization, it also has a security implication. Consider the
  theoretical case where has a multi-function device has BARs for two
  functions on the same page (within a 4K boundary), and each function
  is assigned to a different guest. With your current patch both guests
  will be able to write to each other's BARs. Another case is where a
  device has a bug and you must not write beyond the BAR or Bad Things
  Happen. With this patch an *unprivileged* guest could exploit that bug
  and make bad things happen.
  
  This can be fixed if the slow userspace mmio path checks that all MMIO
  accesses by a guest fall within the portion of the page that is
  assigned to it.
  
  This patch seems to implement range checks correctly,
  let me know if I am missing something.
  
  One also notes that we currently link qemu with libpci
  which I think requires admin cap to work.
  However, in the future we might extend this to
  also support getting device fds over a unix socket
  from a higher priviledged process.
  
  If or when this is done, we will have to be
  extra careful when passing
  device file descriptor to an unpriveledged qemu process if
  the BARs are less than full page in size: mapping
  such BAR will allow qemu access outside this BAR.
  
  A possible solution to this problem
  if/when it arises would be adding yet another sysfs file
  for each resource, which would allow read/write but not
  mmap access, and perform range checks in the kernel.
  
  Sounds like the best solution to this problem, yeah. Though we'd only need 
  those for non-page-boundary BARs. So I guess the best would be to always 
  export them from the kernel, but only use them when BAR  (PAGE_SIZE-1).
 
 Hm, or add read/write fd functions that always do boundary checks to the 
 existing interface and only allow mmap on size  PAGE_SIZE. Or only allow 
 non-aligned mmap when the admin cap is present.
 
 Alex

This might break existing applications.
We don't want that.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enable non page boundary BAR device assignment

2009-12-10 Thread Alexander Graf

On 10.12.2009, at 11:23, Muli Ben-Yehuda wrote:

 On Thu, Dec 10, 2009 at 10:52:46AM +0100, Alexander Graf wrote:
 
 Either way, FWIW the device assignment stuff needs to be completely
 rewritten for qemu upstream anyways. So while it's good to collect
 ideas for now, let's not too put too much effort code-wise into the
 current code (unless it doesn't work).
 
 What do you have in mind for such a rewrite?

I'd like to see it more well-abstracted and versatile. I don't see an obvious 
reason why we shouldn't be able to use a physical device in a TCG target :-).

Alex--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enable non page boundary BAR device assignment

2009-12-10 Thread Alexander Graf

On 10.12.2009, at 11:27, Michael S. Tsirkin wrote:

 On Thu, Dec 10, 2009 at 11:08:58AM +0100, Alexander Graf wrote:
 
 On 10.12.2009, at 10:52, Alexander Graf wrote:
 
 
 On 10.12.2009, at 10:43, Michael S. Tsirkin wrote:
 
 On Thu, Dec 10, 2009 at 07:16:04AM +0200, Muli Ben-Yehuda wrote:
 On Wed, Dec 09, 2009 at 06:38:54PM +0100, Alexander Graf wrote:
 
 While trying to get device passthrough working with an emulex hba,
 kvm refused to pass it through because it has a BAR of 256 bytes:
 
  Region 0: Memory at d210 (64-bit, non-prefetchable) [size=4K]
  Region 2: Memory at d2101000 (64-bit, non-prefetchable) [size=256]
  Region 4: I/O ports at b100 [size=256]
 
 Since the page boundary is an arbitrary optimization to allow 1:1
 mapping of physical to virtual addresses, we can still take the old
 MMIO callback route.
 
 So let's add a second code path that allows for size  0xFFF != 0
 sized regions by looping it through userspace.
 
 That makes sense in general *but* the 4K-aligned check isn't just an
 optimization, it also has a security implication. Consider the
 theoretical case where has a multi-function device has BARs for two
 functions on the same page (within a 4K boundary), and each function
 is assigned to a different guest. With your current patch both guests
 will be able to write to each other's BARs. Another case is where a
 device has a bug and you must not write beyond the BAR or Bad Things
 Happen. With this patch an *unprivileged* guest could exploit that bug
 and make bad things happen.
 
 This can be fixed if the slow userspace mmio path checks that all MMIO
 accesses by a guest fall within the portion of the page that is
 assigned to it.
 
 This patch seems to implement range checks correctly,
 let me know if I am missing something.
 
 One also notes that we currently link qemu with libpci
 which I think requires admin cap to work.
 However, in the future we might extend this to
 also support getting device fds over a unix socket
 from a higher priviledged process.
 
 If or when this is done, we will have to be
 extra careful when passing
 device file descriptor to an unpriveledged qemu process if
 the BARs are less than full page in size: mapping
 such BAR will allow qemu access outside this BAR.
 
 A possible solution to this problem
 if/when it arises would be adding yet another sysfs file
 for each resource, which would allow read/write but not
 mmap access, and perform range checks in the kernel.
 
 Sounds like the best solution to this problem, yeah. Though we'd only need 
 those for non-page-boundary BARs. So I guess the best would be to always 
 export them from the kernel, but only use them when BAR  (PAGE_SIZE-1).
 
 Hm, or add read/write fd functions that always do boundary checks to the 
 existing interface and only allow mmap on size  PAGE_SIZE. Or only allow 
 non-aligned mmap when the admin cap is present.
 
 Alex
 
 This might break existing applications.
 We don't want that.

Well currently you can't mmap the resource at all without at least r/w rights 
on the file, right?
But yeah, we'd probably change behavior that could break someone - sigh.

Alex--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enable non page boundary BAR device assignment

2009-12-10 Thread Avi Kivity

On 12/09/2009 11:06 PM, Alexander Graf wrote:


Am 09.12.2009 um 21:49 schrieb Michael S. Tsirkin m...@redhat.com:


On Wed, Dec 09, 2009 at 06:38:54PM +0100, Alexander Graf wrote:

While trying to get device passthrough working with an emulex hba, kvm
refused to pass it through because it has a BAR of 256 bytes:

   Region 0: Memory at d210 (64-bit, non-prefetchable) 
[size=4K]
   Region 2: Memory at d2101000 (64-bit, non-prefetchable) 
[size=256]

   Region 4: I/O ports at b100 [size=256]

Since the page boundary is an arbitrary optimization to allow 1:1 
mapping of
physical to virtual addresses, we can still take the old MMIO 
callback route.


So let's add a second code path that allows for size  0xFFF != 0 
sized regions

by looping it through userspace.

I verified that it works by passing through an e1000 with this 
additional patch

applied and the card acted the same way it did without this patch:

map_func = assigned_dev_iomem_map;
-if (cur_region-size  0xFFF) {
+if (i != PCI_ROM_SLOT){
fprintf(stderr, PCI region %d at address 0x%llx 

Signed-off-by: Alexander Graf ag...@suse.de


Good idea.

One thing I am concerned with, is that some users might
see performance degradation and not see the error message.
Maybe we can have a flag to optionally fail unless
passthrough can be fast? I'm not really sure it's worth it
to add such a flag though - what do others think?


It only kicks in on small BARs which are usually not _that_ 
performance critical. We're mostly talking about ack and status bits.




Well, ack and status are pretty important if accessed every interrupt.


Also not failing  failing IMHO :). Even if it's not as fast as native.


Yes.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enable non page boundary BAR device assignment

2009-12-10 Thread Muli Ben-Yehuda
On Thu, Dec 10, 2009 at 11:31:01AM +0100, Alexander Graf wrote:

  What do you have in mind for such a rewrite?
 
 I'd like to see it more well-abstracted and versatile. I don't see
 an obvious reason why we shouldn't be able to use a physical device
 in a TCG target :-).

mmio and pio are easy, DMA you'd need an IOMMU for security, or
whatever uio does just for translation, and interrupts you probably
get for free from uio. Seems eminently doable to me. Why you'd want to
is another matter :-)

Cheers,
Muli
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enable non page boundary BAR device assignment

2009-12-10 Thread Michael S. Tsirkin
On Thu, Dec 10, 2009 at 11:31:54AM +0100, Alexander Graf wrote:
 
 On 10.12.2009, at 11:27, Michael S. Tsirkin wrote:
 
  On Thu, Dec 10, 2009 at 11:08:58AM +0100, Alexander Graf wrote:
  
  On 10.12.2009, at 10:52, Alexander Graf wrote:
  
  
  On 10.12.2009, at 10:43, Michael S. Tsirkin wrote:
  
  On Thu, Dec 10, 2009 at 07:16:04AM +0200, Muli Ben-Yehuda wrote:
  On Wed, Dec 09, 2009 at 06:38:54PM +0100, Alexander Graf wrote:
  
  While trying to get device passthrough working with an emulex hba,
  kvm refused to pass it through because it has a BAR of 256 bytes:
  
   Region 0: Memory at d210 (64-bit, non-prefetchable) [size=4K]
   Region 2: Memory at d2101000 (64-bit, non-prefetchable) [size=256]
   Region 4: I/O ports at b100 [size=256]
  
  Since the page boundary is an arbitrary optimization to allow 1:1
  mapping of physical to virtual addresses, we can still take the old
  MMIO callback route.
  
  So let's add a second code path that allows for size  0xFFF != 0
  sized regions by looping it through userspace.
  
  That makes sense in general *but* the 4K-aligned check isn't just an
  optimization, it also has a security implication. Consider the
  theoretical case where has a multi-function device has BARs for two
  functions on the same page (within a 4K boundary), and each function
  is assigned to a different guest. With your current patch both guests
  will be able to write to each other's BARs. Another case is where a
  device has a bug and you must not write beyond the BAR or Bad Things
  Happen. With this patch an *unprivileged* guest could exploit that bug
  and make bad things happen.
  
  This can be fixed if the slow userspace mmio path checks that all MMIO
  accesses by a guest fall within the portion of the page that is
  assigned to it.
  
  This patch seems to implement range checks correctly,
  let me know if I am missing something.
  
  One also notes that we currently link qemu with libpci
  which I think requires admin cap to work.
  However, in the future we might extend this to
  also support getting device fds over a unix socket
  from a higher priviledged process.
  
  If or when this is done, we will have to be
  extra careful when passing
  device file descriptor to an unpriveledged qemu process if
  the BARs are less than full page in size: mapping
  such BAR will allow qemu access outside this BAR.
  
  A possible solution to this problem
  if/when it arises would be adding yet another sysfs file
  for each resource, which would allow read/write but not
  mmap access, and perform range checks in the kernel.
  
  Sounds like the best solution to this problem, yeah. Though we'd only 
  need those for non-page-boundary BARs. So I guess the best would be to 
  always export them from the kernel, but only use them when BAR  
  (PAGE_SIZE-1).
  
  Hm, or add read/write fd functions that always do boundary checks to the 
  existing interface and only allow mmap on size  PAGE_SIZE. Or only allow 
  non-aligned mmap when the admin cap is present.
  
  Alex
  
  This might break existing applications.
  We don't want that.
 
 Well currently you can't mmap the resource at all without at least r/w rights 
 on the file, right?

You could have dropped the cap or got the fd from another
process.

 But yeah, we'd probably change behavior that could break someone - sigh.
 
 Alex
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enable non page boundary BAR device assignment

2009-12-10 Thread Michael S. Tsirkin
On Thu, Dec 10, 2009 at 12:37:37PM +0200, Muli Ben-Yehuda wrote:
 On Thu, Dec 10, 2009 at 11:31:01AM +0100, Alexander Graf wrote:
 
   What do you have in mind for such a rewrite?
  
  I'd like to see it more well-abstracted and versatile. I don't see
  an obvious reason why we shouldn't be able to use a physical device
  in a TCG target :-).
 
 mmio and pio are easy, DMA you'd need an IOMMU for security, or
 whatever uio does just for translation,

uio currently does not support DMA, but I plan to fix this

 and interrupts you probably
 get for free from uio. Seems eminently doable to me. Why you'd want to
 is another matter :-)
 
 Cheers,
 Muli

The list above ignores the biggest issue: you would have to change TCG
code generation to make this work.

For example, I think a read memory barrier is currently ignored in
translation, and host CPU will reorder reads.  Some drivers might also
rely on ordering guarantees that depend on CPU cacheline sizes.  Atomics
is another bag of tricks but I expect atomics on a DMA memory are not
widely used.

I am not sure this problem is solvable unless host and guest
architectures are very similar.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm requires apic initialized before vcpu main loop

2009-12-10 Thread Glauber Costa
On Wed, Dec 09, 2009 at 06:21:38PM -0200, Marcelo Tosatti wrote:
 On Wed, Dec 09, 2009 at 08:00:41PM +0100, Jan Kiszka wrote:
  Glauber Costa wrote:
   On Wed, Dec 09, 2009 at 03:46:54PM -0200, Marcelo Tosatti wrote:
   Otherwise a zero apic base is loaded into KVM, which results
   in interrupts being lost until a proper apic base with enabled 
   bit set is loaded.
  
   Fixes WinXP migration in qemu-kvm origin/next.
  
   Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
  
   diff --git a/hw/apic.c b/hw/apic.c
   index 627ff98..45a4d2b 100644
   --- a/hw/apic.c
   +++ b/hw/apic.c
   @@ -1131,6 +1131,11 @@ int apic_init(CPUState *env)
vmstate_register(s-idx, vmstate_apic, s);
qemu_register_reset(apic_reset, s);

   +/* apic_reset must be called before the vcpu threads are 
   initialized and load 
   + * registers, in qemu-kvm.
   + */
   +apic_reset(s);
   +
   But by doing this, the system-wide reset will re-reset the apic, possibly 
   losing
   some other information.
   
   Also, system_reset happens before we signal system_ready (or at least 
   should).
   This means the vcpus should not be running and producing anything useful 
   yet.
   So how does it happen, in the first place?
  
  There is
  
  kvm_arch_load_regs(env);
  
  before qemu_cond_wait in ap_main_loop. Probably part of the reason. Why
  is it there?
 
 Hum ... see how qemu_kvm_load_lapic depends on kvm_vcpu_inited.
 
 kvm_vcpu_ioctl_set_lapic - kvm_apic_post_state_restore relies on
 proper apicbase set (maybe other reasons too).

Have you tried getting rid of kvm_vcpu_inited()? Now that we are doing reset 
after vcpu creation,
it is quite possible that it won't be needed anymore.

Btw, the whole point of this exercise is to try diminishing oportunities
for nasty things like that.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enable non page boundary BAR device assignment

2009-12-10 Thread Alexander Graf

On 10.12.2009, at 11:56, Michael S. Tsirkin wrote:

 On Thu, Dec 10, 2009 at 12:37:37PM +0200, Muli Ben-Yehuda wrote:
 On Thu, Dec 10, 2009 at 11:31:01AM +0100, Alexander Graf wrote:
 
 What do you have in mind for such a rewrite?
 
 I'd like to see it more well-abstracted and versatile. I don't see
 an obvious reason why we shouldn't be able to use a physical device
 in a TCG target :-).
 
 mmio and pio are easy, DMA you'd need an IOMMU for security, or
 whatever uio does just for translation,
 
 uio currently does not support DMA, but I plan to fix this
 
 and interrupts you probably
 get for free from uio. Seems eminently doable to me. Why you'd want to
 is another matter :-)
 
 Cheers,
 Muli
 
 The list above ignores the biggest issue: you would have to change TCG
 code generation to make this work.
 
 For example, I think a read memory barrier is currently ignored in
 translation, and host CPU will reorder reads.  Some drivers might also
 rely on ordering guarantees that depend on CPU cacheline sizes.  Atomics
 is another bag of tricks but I expect atomics on a DMA memory are not
 widely used.

Since we'd use the mmio callbacks for MMIO we'd be strictly ordered, no?

Alex
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Networking-related crash?

2009-12-10 Thread Patrick McHardy
Eric Dumazet wrote:
 Le 09/12/2009 16:11, Avi Kivity a écrit :
 On 12/09/2009 03:46 PM, Adam Huffman wrote:
 I've been seeing lots of crashes on a new Dell Precision T7500,
 running the KVM in Fedora 12.  Finally managed to capture an Oops,
 which is shown below (hand-transcribed):

 BUG: unable to handle kernel paging request at 00200200
 IP: [8139aab7] destroy_conntrack+0x82/0x11f
 PGD 332d0e067 PUD 33453c067 PMD 0
 RIP: 0010:[8139aab7]  [8139aab7]
 destroy_conntrack+0x82/0x11f
 RSP: 0018:c9803bf0  EFLAGS: 00010202
 RAX: 8001 RBX: 816fb1a0 RCX: 752f
 RDX: 00200200 RSI: 0011 RDI: 816fb1a0
 RBP: c9803c00 R08: 880336699438 R09: 00aaa5e0
 R10: 0002f54189d5 R11: 0001 R12: 819a92e0
 R13: a029adcc R14:  R15: 880632866c38
 FS:  7fdd34b17710() GS:c980()
 knlGS:
 CS:  0010 DS: 002B ES: 002B CR0: 80050033
 CR2: 00200200 CR3: 0003349c CR4: 26e0
 DR0:  DR1:  DR2: 
 DR3:  DR6: 0ff0 DR7: 0400
 Process qemu-kvm (pid: 1759, threadinfo 88062e9e8000, task
 880634945e00)
 Stack:
   880632866c00 880634640c30 c9803c10 813989c2
 0  c9803c30 81374092 c9803c30 880632866c00
 0  c9803c50 81373dd3 0002 880632866c00
 Call Trace:
   IRQ
   [813989c2] nf_conntrack_destroy+0x1b/0x1d
   [81374092] skb_release_head_state+0x95/0xd7
   [81373dd3] __kfree_skb+0x16/0x81
   [81373ed7] kfree_skb+0x6a/0x72
   [a029adcc] ip6_mc_input+0x220/0x230 [ipv6]
   [a029a3d1] ip6_rcv_finish+0x27/0x2b [ipv6]
   [a029a763] ipv6_rcv+0x38e/0x3e5 [ipv6]
   [8137bd91] netif_receive_skb+0x402/0x427
   ...

 crash in :
   48 8b 43 08 mov0x8(%rbx),%rax
   a8 01   test   $0x1,%al
   48 89 02mov%rax,(%rdx)   HERE  RDX=0x200200  
 (LIST_POISON2)
   75 04   jne1f
   48 89 50 08 mov%rdx,0x8(%rax)
 1:48 c7 43 10 00 02 20movq   $0x200200,0x10(%rbx)
 
   if (!nf_ct_is_confirmed(ct)) {
   
 BUG_ON(hlist_nulls_unhashed(ct-tuplehash[IP_CT_DIR_ORIGINAL].hnnode));
   hlist_nulls_del_rcu(ct-tuplehash[IP_CT_DIR_ORIGINAL].hnnode); 
   HERE 
   }
   NF_CT_STAT_INC(net, delete); 


I can't spot the problem. Adam, please send me your .config file.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enable non page boundary BAR device assignment

2009-12-10 Thread Michael S. Tsirkin
On Thu, Dec 10, 2009 at 12:09:13PM +0100, Alexander Graf wrote:
 
 On 10.12.2009, at 11:56, Michael S. Tsirkin wrote:
 
  On Thu, Dec 10, 2009 at 12:37:37PM +0200, Muli Ben-Yehuda wrote:
  On Thu, Dec 10, 2009 at 11:31:01AM +0100, Alexander Graf wrote:
  
  What do you have in mind for such a rewrite?
  
  I'd like to see it more well-abstracted and versatile. I don't see
  an obvious reason why we shouldn't be able to use a physical device
  in a TCG target :-).
  
  mmio and pio are easy, DMA you'd need an IOMMU for security, or
  whatever uio does just for translation,
  
  uio currently does not support DMA, but I plan to fix this
  
  and interrupts you probably
  get for free from uio. Seems eminently doable to me. Why you'd want to
  is another matter :-)
  
  Cheers,
  Muli
  
  The list above ignores the biggest issue: you would have to change TCG
  code generation to make this work.
  
  For example, I think a read memory barrier is currently ignored in
  translation, and host CPU will reorder reads.  Some drivers might also
  rely on ordering guarantees that depend on CPU cacheline sizes.  Atomics
  is another bag of tricks but I expect atomics on a DMA memory are not
  widely used.
 
 Since we'd use the mmio callbacks for MMIO we'd be strictly ordered, no?
 
 Alex

Not unless you issue appropriate host memory barriers on mmio callbacks (kvm 
currently
uses a lock for this, which has an implicit barrier, but I do not
think TCG does this).

But even then, it depends on the device, for some devices DMA memory
reads/writes might depend on each other. Look at virtio as an example, a
real device might have the same semantics.  As a simpler example, some
devices DMA the following into ring in host memory to signal data
available:
- valid tag
- data length
host will read tag, and when it's valid use data length,
but e.g. on intel this only works well if these share a cache line,
otherwise data read might bypass tag read and you get invalid data.









-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enable non page boundary BAR device assignment

2009-12-10 Thread Muli Ben-Yehuda
On Thu, Dec 10, 2009 at 12:56:56PM +0200, Michael S. Tsirkin wrote:

  mmio and pio are easy, DMA you'd need an IOMMU for security, or
  whatever uio does just for translation,
 
 uio currently does not support DMA, but I plan to fix this

With or without an IOMMU?

  and interrupts you probably get for free from uio. Seems eminently
  doable to me. Why you'd want to is another matter :-)
 
 The list above ignores the biggest issue: you would have to change
 TCG code generation to make this work.

Yep, I know nothing about TCG, only looking at this from the device
interaction side.

 I am not sure this problem is solvable unless host and guest
 architectures are very similar.

Now you are ignoring the most interesting issue, namely, why would you
want to solve it? What is the value of device assignment for TCG
targets?

Cheers,
Muli
-- 
Muli Ben-Yehuda | m...@il.ibm.com | +972-4-8281080
Manager, Virtualization and Systems Architecture
Master Inventor, IBM Research -- Haifa
Second Workshop on I/O Virtualization (WIOV '10):
http://sysrun.haifa.il.ibm.com/hrl/wiov2010/
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enable non page boundary BAR device assignment

2009-12-10 Thread Michael S. Tsirkin
On Thu, Dec 10, 2009 at 12:34:38PM +0100, Alexander Graf wrote:
 
 On 10.12.2009, at 12:28, Muli Ben-Yehuda wrote:
 
  On Thu, Dec 10, 2009 at 12:56:56PM +0200, Michael S. Tsirkin wrote:
  
  mmio and pio are easy, DMA you'd need an IOMMU for security, or
  whatever uio does just for translation,
  
  uio currently does not support DMA, but I plan to fix this
  
  With or without an IOMMU?
  
  and interrupts you probably get for free from uio. Seems eminently
  doable to me. Why you'd want to is another matter :-)
  
  The list above ignores the biggest issue: you would have to change
  TCG code generation to make this work.
  
  Yep, I know nothing about TCG, only looking at this from the device
  interaction side.
  
  I am not sure this problem is solvable unless host and guest
  architectures are very similar.
  
  Now you are ignoring the most interesting issue, namely, why would you
  want to solve it? What is the value of device assignment for TCG
  targets?
 
 Why would you want to emulate a machine at all? ;-)
 
 Imagine you were a hardware manufacturer who develops some self-made 
 hardware. Now that manufacturer of course has x86 boxes. Chances are pretty 
 low he has PPC ones and imagine we'd ever get MMIO/PCI on S390, he definitely 
 does not have those.
 
 Still, while developing his driver it'd be really valuable to know that it 
 works on other architectures as well. Especially since he can claim support 
 for architectures he doesn't have himself. At least until the first customer 
 arrives :-).
 
 While this is a scenario I just made up and don't really have myself, its 
 goal is to illustrate why we shouldn't close doors that don't have to be 
 closed. If TCG doesn't deal with it properly, that doesn't mean we shouldn't 
 work on making the other end compatible.
 

Well, it does IMO mean that such a project should not be a blocker for
passthrough in upstream qemu, after fixing endian-ness issues and
generally cleaning up the code.


 Alex
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enable non page boundary BAR device assignment

2009-12-10 Thread Gleb Natapov
On Thu, Dec 10, 2009 at 01:21:40PM +0200, Michael S. Tsirkin wrote:
 On Thu, Dec 10, 2009 at 12:09:13PM +0100, Alexander Graf wrote:
  
  On 10.12.2009, at 11:56, Michael S. Tsirkin wrote:
  
   On Thu, Dec 10, 2009 at 12:37:37PM +0200, Muli Ben-Yehuda wrote:
   On Thu, Dec 10, 2009 at 11:31:01AM +0100, Alexander Graf wrote:
   
   What do you have in mind for such a rewrite?
   
   I'd like to see it more well-abstracted and versatile. I don't see
   an obvious reason why we shouldn't be able to use a physical device
   in a TCG target :-).
   
   mmio and pio are easy, DMA you'd need an IOMMU for security, or
   whatever uio does just for translation,
   
   uio currently does not support DMA, but I plan to fix this
   
   and interrupts you probably
   get for free from uio. Seems eminently doable to me. Why you'd want to
   is another matter :-)
   
   Cheers,
   Muli
   
   The list above ignores the biggest issue: you would have to change TCG
   code generation to make this work.
   
   For example, I think a read memory barrier is currently ignored in
   translation, and host CPU will reorder reads.  Some drivers might also
   rely on ordering guarantees that depend on CPU cacheline sizes.  Atomics
   is another bag of tricks but I expect atomics on a DMA memory are not
   widely used.
  
  Since we'd use the mmio callbacks for MMIO we'd be strictly ordered, no?
  
  Alex
 
 Not unless you issue appropriate host memory barriers on mmio callbacks (kvm 
 currently
 uses a lock for this, which has an implicit barrier, but I do not
 think TCG does this).
 
 But even then, it depends on the device, for some devices DMA memory
 reads/writes might depend on each other. Look at virtio as an example, a
 real device might have the same semantics.  As a simpler example, some
 devices DMA the following into ring in host memory to signal data
 available:
 - valid tag
 - data length
 host will read tag, and when it's valid use data length,
May be:
- data length
- valid tag
Otherwise how can you guaranty that at the time tag is valid data
length is already up-to-date and not in process to be written? An on
arch such as Altix DMA from device to memory can arrive out of order
if memory is not mapped in a special way, but then DMA is slow.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libvirk KVM save (migrate) terribly slow

2009-12-10 Thread Pierre Riteau
On 10 déc. 2009, at 12:44, Daniel P. Berrange wrote:

 On Thu, Dec 10, 2009 at 08:05:12AM +0100, Nikola Ciprich wrote:
 Hi,
 I noticed that using libvirt to save running domain is terribly slow,
 I mean it can take even 10 minutes to save VM with 2GB RAM, although
 the host is almost idle (8CPU, 16GB RAM). If I understand correctly, 
 libvirt first pauses the vm and then migrates it to file (optionally through 
 gzip).
 Restoring it back to running state is almost instant.
 I guess this is not what should be expected, is there something particular 
 I should check?
 
 This matches behaviour that I see when saving VMs. QEMU takes a really
 very long time to save itself using migrate exec:.  It is not CPU limited
 even in gzip mode - there is near zero CPU usage while this is going on.
 QEMU just seems to be very slow at sending the data. I've not had time to
 investigate whether this is a flaw in QEMU's exec: migration handling,
 or whether it is being hurt by too small pipe buffers, or something else
 entirely.  I must say I'm not entirely convinced that it is a good idea
 for QEMU to be mixing use of FILE * / popen, with non-blocking I/O, but
 that could be a red-herring.
 
 Daniel

I've reported this issue back when the regression was introduced in qemu.
Anthony had the same idea (not mixing popen with non-blocking I/O), but no 
solution was found at the time.
I haven't tried recently (I'm using TCP migration only) but the problem must 
still be here.

The thread on qemu-devel:
http://lists.gnu.org/archive/html/qemu-devel/2009-08/msg01557.html
http://lists.gnu.org/archive/html/qemu-devel/2009-09/msg00020.html

-- 
Pierre Riteau -- http://perso.univ-rennes1.fr/pierre.riteau/

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Installing kernel headers in kvm-kmod

2009-12-10 Thread Anthony Liguori

QEMU 0.12.0-rc1 does not support KVM
https://bugs.launchpad.net/bugs/494500

Boils down to the fact that 1) we don't include kernel headers in qemu (whereas 
qemu-kvm does) and 2) kvm-kmod does not install those headers on make install.

I think we've discussed (2) as being the preferred solution.  Does everyone 
agree with that?  Anyone care to volunteer to make the change? :-)

Regards,

Anthony Liguori

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Installing kernel headers in kvm-kmod

2009-12-10 Thread Avi Kivity

On 12/10/2009 03:07 PM, Anthony Liguori wrote:

QEMU 0.12.0-rc1 does not support KVM
https://bugs.launchpad.net/bugs/494500

Boils down to the fact that 1) we don't include kernel headers in qemu 
(whereas qemu-kvm does) and 2) kvm-kmod does not install those headers 
on make install.


I think we've discussed (2) as being the preferred solution.  Does 
everyone agree with that?  Anyone care to volunteer to make the 
change? :-)




While I definitely agree with (2), for the bug you cite Ubuntu should 
backport KVM_CAP_DESTROY_MEMORY_REGION_WORKS to their kernel (and 
headers).  Distributions shouldn't require kvm-kmod.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Installing kernel headers in kvm-kmod

2009-12-10 Thread Jan Kiszka
Anthony Liguori wrote:
 QEMU 0.12.0-rc1 does not support KVM
 https://bugs.launchpad.net/bugs/494500
 
 Boils down to the fact that 1) we don't include kernel headers in qemu
 (whereas qemu-kvm does) and 2) kvm-kmod does not install those headers
 on make install.
 
 I think we've discussed (2) as being the preferred solution.  Does
 everyone agree with that?  Anyone care to volunteer to make the change? :-)
 

I've pushed a half-tested approach into kvm-kmod's next branch. Feel
free to test/fix/enhance it.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


problems with domain installation

2009-12-10 Thread Tomas Macek
Hi, I'm using redhat enterprise 5.4 with newest redhat kvm/qemu packages 
installed for testing purposes.
I'm trying to install debian lenny from ISO on the machine with only 
ssh access using virt-install python script and the output is like this


-
[r...@localhost ~]# virt-install --connect qemu:///system -n debnet -f 
~/debnet.qcow2 -s 3 -r 128 --accelerate --network=bridge:br0 --hvm 
--location http://ftp.us.debian.org/debian/dists/etch/main/installer-i386/ 
--nographics -x console=/dev/pts/3



Starting install...
Retrieving file MANIFEST... 
| 1.7 kB 00:00
Retrieving file linux... 
| 1.2 MB 00:09
Retrieving file initrd.gz... 
| 4.1 MB 00:32
Creating storage file... 
| 3.0 GB 00:00
Creating domain... 
|0 B 00:00

Connected to domain debnet
Escape character is ^]

Domain installation still in progress. You can reconnect to
the console to complete the installation process.
--

The problem is, that the debnet is created (can be seen with virsh 
list), 
but I'm still unable to connect to the console using both virsh console 
debnet or with minicom -op /dev/pts/3 (I got the /dev/pts/3 from 
virsh ttyconsole debnet command).

The same was the result when I tried to install Fedora 12.

I need to run the installation from the command line and not from the X 
window interface like virt-manager, but the only thing I see is always the 
Escape character is ^] as the output.
I've spent a lot of time googling what I'm doing wrong, but the only 
replies I've found was ... the vnc access is better, use it... or that I 
should redirect the installation console to serial device - I think I'm 
doing it with the -x console /dev/pts/3, so what's wrong?


Best regards, Tomas
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Installing kernel headers in kvm-kmod

2009-12-10 Thread Avi Kivity

On 12/10/2009 04:50 PM, Arnd Bergmann wrote:

On Thursday 10 December 2009, Jan Kiszka wrote:
   

Anthony Liguori wrote:
 

QEMU 0.12.0-rc1 does not support KVM
https://bugs.launchpad.net/bugs/494500

Boils down to the fact that 1) we don't include kernel headers in qemu
(whereas qemu-kvm does) and 2) kvm-kmod does not install those headers
on make install.

I think we've discussed (2) as being the preferred solution.  Does
everyone agree with that?  Anyone care to volunteer to make the change? :-)

   

I've pushed a half-tested approach into kvm-kmod's next branch. Feel
free to test/fix/enhance it.
 

This would work, but installing to /usr/include/linux/kvm.h will confuse
distro package managers a lot, because that location belongs to the glibc
or libc-linux-headers or some other package already.

If you want to install the headers from kvm-kmod, I would recommend
doing it in a different path, e.g. /usr/include/kvm-kmod/{linux,asm}.
   


Maybe even /usr/local/include/kvm-kmod-$version/, and a symlink 
/usr/local/include/kvm-kmod.



qemu can then add -I/usr/include/kvm-kmod to it's default include
path and get the kvm-kmod version if that's installed or the distro
version otherwise.

It may also be useful to do the equivalent of 'make headers_install'
from the kernel, to remove all #ifdef __KERNEL__ sections and
sparse annotations from the header files, but it should also work
without that.
   


Well, qemu.git needs __user removed.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Installing kernel headers in kvm-kmod

2009-12-10 Thread Jan Kiszka
Arnd Bergmann wrote:
 On Thursday 10 December 2009, Jan Kiszka wrote:
 Anthony Liguori wrote:
 QEMU 0.12.0-rc1 does not support KVM
 https://bugs.launchpad.net/bugs/494500

 Boils down to the fact that 1) we don't include kernel headers in qemu
 (whereas qemu-kvm does) and 2) kvm-kmod does not install those headers
 on make install.

 I think we've discussed (2) as being the preferred solution.  Does
 everyone agree with that?  Anyone care to volunteer to make the change? :-)

 I've pushed a half-tested approach into kvm-kmod's next branch. Feel
 free to test/fix/enhance it.
 
 This would work, but installing to /usr/include/linux/kvm.h will confuse
 distro package managers a lot, because that location belongs to the glibc
 or libc-linux-headers or some other package already.
 
 If you want to install the headers from kvm-kmod, I would recommend
 doing it in a different path, e.g. /usr/include/kvm-kmod/{linux,asm}.
 qemu can then add -I/usr/include/kvm-kmod to it's default include
 path and get the kvm-kmod version if that's installed or the distro
 version otherwise.

Good point. /usr/include/kvm-kmod would be ok for me unless someone
wants them elsewhere.

 
 It may also be useful to do the equivalent of 'make headers_install'
 from the kernel, to remove all #ifdef __KERNEL__ sections and
 sparse annotations from the header files, but it should also work
 without that.

Yes, I think it's better to let the sync source install those headers
for us, then pick up those cleaned versions, carry them in kvm-kmod in
addition to the existing ones and finally install them.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Installing kernel headers in kvm-kmod

2009-12-10 Thread Arnd Bergmann
On Thursday 10 December 2009, Avi Kivity wrote:
 Maybe even /usr/local/include/kvm-kmod-$version/, and a symlink 
 /usr/local/include/kvm-kmod.

Depends on how fine-grained you want to do the packaging.
Most distributions split packages between code and development
packages. The kvm-kmod code is the kernel module, so you want
to be able to install it for multiple kernels simultaneously.

Building the package only requires one version of the header
and does not depend on the underlying kernel version, only
on the version of the module, so it's reasonable to install only
one version as the -dev package, and have a dependency
in there to match the module version with the header version.

The most complex setup would split the development package
into one per kernel version and/or module version, plus an
extra package for the module version containing only the
symlink. I wouldn't go there.

  It may also be useful to do the equivalent of 'make headers_install'
  from the kernel, to remove all #ifdef __KERNEL__ sections and
  sparse annotations from the header files, but it should also work
  without that.
 
 
 Well, qemu.git needs __user removed.

This one is taken care of by kvm_kmod in the sync script, though it
would be cleaner to only do it for the installed version of the header,
not for the one used to build kvm.ko.

Arnd 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ kvm-Bugs-2912026 ] kvm87 crash on fedora 10

2009-12-10 Thread SourceForge.net
Bugs item #2912026, was opened at 2009-12-10 16:31
Message generated for change (Comment added) made by avik
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2912026group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Closed
Resolution: Out of Date
Priority: 5
Private: No
Submitted By: https://www.google.com/accounts ()
Assigned to: Nobody/Anonymous (nobody)
Summary: kvm87 crash on fedora 10

Initial Comment:
kvm keeps crashing. Sometimes right away, sometimes several seconds after 
start. dmesg gives:

vmwrite error: reg 6c00 value 80050033 (err 27648)
Pid: 4545, comm: qemu Not tainted 2.6.27.25-170.2.72.fc10.i686.PAE #1
 [c06b5884] ? printk+0xf/0x13
 [f910e4cd] vmwrite_error+0x25/0x2f [kvm_intel]
 [f910e4f7] vmcs_writel+0x20/0x24 [kvm_intel]
 [f910e7eb] vmx_vcpu_run+0x1e/0x196 [kvm_intel]
 [f90821c1] kvm_arch_vcpu_ioctl_run+0x3dc/0x5ee [kvm]
 [f907c3af] kvm_vcpu_ioctl+0xfe/0x3da [kvm]
 [c04292da] ? enqueue_entity+0x203/0x20b
 [c0424848] ? resched_task+0x3a/0x6e
 [c042b546] ? check_preempt_wakeup+0x1bb/0x1c3
 [c06b76eb] ? _spin_unlock_irqrestore+0x22/0x38
 [c042d1c9] ? try_to_wake_up+0x230/0x23b
 [c042d1df] ? default_wake_function+0xb/0xd
 [c042594e] ? __wake_up_common+0x35/0x5b
 [c0426f63] ? __wake_up+0x31/0x3b
 [c044dad4] ? wake_futex+0x1f/0x2c
 [c044db9d] ? futex_wake+0xbc/0xc6
 [c042752b] ? __dequeue_entity+0x73/0x7b
 [c04275be] ? set_next_entity+0x8b/0xf7
 [f907d901] ? kvm_vm_ioctl+0x1db/0x1ee [kvm]
 [c042fff7] ? finish_task_switch+0x2f/0xb0
 [c06b60c7] ? schedule+0x6ee/0x70d
 [f907c2b1] ? kvm_vcpu_ioctl+0x0/0x3da [kvm]
 [c04a2a16] vfs_ioctl+0x22/0x69
 [c04a2c98] do_vfs_ioctl+0x23b/0x247
 [c0466d4e] ? audit_syscall_entry+0xf9/0x123
 [c04a2ce4] sys_ioctl+0x40/0x5c
 [c0408c8a] syscall_call+0x7/0xb
 ===

--

Comment By: Avi Kivity (avik)
Date: 2009-12-10 18:29

Message:
kvm-87 is outdated (and so is F10).  I recommend using the modules provided
by F10 (and upgrading to F12).

--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2912026group_id=180599
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Installing kernel headers in kvm-kmod

2009-12-10 Thread Jan Kiszka
Arnd Bergmann wrote:
 On Thursday 10 December 2009, Avi Kivity wrote:
 Maybe even /usr/local/include/kvm-kmod-$version/, and a symlink 
 /usr/local/include/kvm-kmod.
 
 Depends on how fine-grained you want to do the packaging.
 Most distributions split packages between code and development
 packages. The kvm-kmod code is the kernel module, so you want
 to be able to install it for multiple kernels simultaneously.
 
 Building the package only requires one version of the header
 and does not depend on the underlying kernel version, only
 on the version of the module, so it's reasonable to install only
 one version as the -dev package, and have a dependency
 in there to match the module version with the header version.
 
 The most complex setup would split the development package
 into one per kernel version and/or module version, plus an
 extra package for the module version containing only the
 symlink. I wouldn't go there.

I've just (forced-)pushed the simple version with
/usr/include/kvm-kmod as destination. The user headers are now stored
under usr/include in the kvm-kmod sources and installed from there.

 
 It may also be useful to do the equivalent of 'make headers_install'
 from the kernel, to remove all #ifdef __KERNEL__ sections and
 sparse annotations from the header files, but it should also work
 without that.

 Well, qemu.git needs __user removed.
 
 This one is taken care of by kvm_kmod in the sync script, though it
 would be cleaner to only do it for the installed version of the header,
 not for the one used to build kvm.ko.

It's easy to drop, but I wonder why it was introduced. To allow reusing
the headers for user space?

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Installing kernel headers in kvm-kmod

2009-12-10 Thread Avi Kivity

On 12/10/2009 06:42 PM, Jan Kiszka wrote:


I've just (forced-)pushed the simple version with
/usr/include/kvm-kmod as destination. The user headers are now stored
under usr/include in the kvm-kmod sources and installed from there.
   


It's customary to install to /usr/local, not to /usr (qemu does the same).

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Installing kernel headers in kvm-kmod

2009-12-10 Thread Jan Kiszka
Avi Kivity wrote:
 On 12/10/2009 06:42 PM, Jan Kiszka wrote:
 I've just (forced-)pushed the simple version with
 /usr/include/kvm-kmod as destination. The user headers are now stored
 under usr/include in the kvm-kmod sources and installed from there.

 
 It's customary to install to /usr/local, not to /usr (qemu does the same).

Adjusted accordingly. Moreover, I only install the target arch's header now.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Nested VMX support v4

2009-12-10 Thread oritw
Avi,   
We have addressed all of the comments, please apply.   

The following patches implement nested VMX support. The patches enable a guest
to use the VMX APIs in order to run its own nested guest (i.e., enable running
other hypervisors which use VMX under KVM). The current patches support running
Linux under a nested KVM using shadow page table (with bypass_guest_pf
disabled). Reworking EPT support to mesh cleanly with the current shadow paging
design per Avi's comments is a work-in-progress.

The current patches support multiple nested hypervisors, which can run
multiple guests. Only 64-bit nested hypervisors are supported. SMP is
supported. Additional patches for running Windows under nested KVM, and
Linux under nested VMware server, are currently running in the lab, and
will be sent as a follow-on patchset.

This patches were written by:
 Orit Wasserman, oritw at il.ibm.com
 Ben-Ami Yassor, benami at il.ibm.com
 Abel Gordon, abelg at il.ibm.com
 Muli Ben-Yehuda, muli at il.ibm.com

With contributions by:
 Anthony Liguori, aliguori at us.ibm.com
 Mike Day, mdday at us.ibm.com

This work was inspired by the nested SVM support by Alexander Graf and Joerg
Roedel.

Changes since v3:
   Added support for 32-bit nested guests
   Added support for multiple nested guests
   Added support for multiple nested hypervisors
   Implemented VMX instruction decoding
   Implemented CR0.TS handling for nested
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/7] Nested VMX patch 1 implements vmon and vmoff

2009-12-10 Thread oritw
From: Orit Wasserman or...@il.ibm.com

---
 arch/x86/kvm/svm.c |3 -
 arch/x86/kvm/vmx.c |  265 +++-
 arch/x86/kvm/x86.c |   11 ++-
 arch/x86/kvm/x86.h |2 +
 4 files changed, 274 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 3de0b37..3f63cdd 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -121,9 +121,6 @@ static int npt = 1;
 
 module_param(npt, int, S_IRUGO);
 
-static int nested = 1;
-module_param(nested, int, S_IRUGO);
-
 static void svm_flush_tlb(struct kvm_vcpu *vcpu);
 static void svm_complete_interrupts(struct vcpu_svm *svm);
 
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 9a0a2cf..2726a6c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -92,6 +92,16 @@ struct shared_msr_entry {
u64 mask;
 };
 
+struct __attribute__ ((__packed__)) level_state {
+};
+
+struct nested_vmx {
+   /* Has the level1 guest done vmxon? */
+   bool vmxon;
+   /* Level 1 state for switching to level 2 and back */
+   struct level_state *l1_state;
+};
+
 struct vcpu_vmx {
struct kvm_vcpu   vcpu;
struct list_head  local_vcpus_link;
@@ -136,6 +146,9 @@ struct vcpu_vmx {
ktime_t entry_time;
s64 vnmi_blocked_time;
u32 exit_reason;
+
+   /* Nested vmx */
+   struct nested_vmx nested;
 };
 
 static inline struct vcpu_vmx *to_vmx(struct kvm_vcpu *vcpu)
@@ -201,6 +214,7 @@ static struct kvm_vmx_segment_field {
 static u64 host_efer;
 
 static void ept_save_pdptrs(struct kvm_vcpu *vcpu);
+static int create_l1_state(struct kvm_vcpu *vcpu);
 
 /*
  * Keep MSR_K6_STAR at the end, as setup_msrs() will try to optimize it
@@ -961,6 +975,95 @@ static void guest_write_tsc(u64 guest_tsc, u64 host_tsc)
 }
 
 /*
+ * Handles msr read for nested virtualization
+ */
+static int nested_vmx_get_msr(struct kvm_vcpu *vcpu, u32 msr_index,
+ u64 *pdata)
+{
+   u64 vmx_msr = 0;
+
+   switch (msr_index) {
+   case MSR_IA32_FEATURE_CONTROL:
+   *pdata = 0;
+   break;
+   case MSR_IA32_VMX_BASIC:
+   *pdata = 0;
+   rdmsrl(MSR_IA32_VMX_BASIC, vmx_msr);
+   *pdata = (vmx_msr  0x00cf);
+   break;
+   case MSR_IA32_VMX_PINBASED_CTLS:
+   rdmsrl(MSR_IA32_VMX_PINBASED_CTLS, vmx_msr);
+   *pdata = (PIN_BASED_EXT_INTR_MASK  
vmcs_config.pin_based_exec_ctrl) |
+   (PIN_BASED_NMI_EXITING  
vmcs_config.pin_based_exec_ctrl) |
+   (PIN_BASED_VIRTUAL_NMIS  
vmcs_config.pin_based_exec_ctrl);
+   break;
+   case MSR_IA32_VMX_PROCBASED_CTLS:
+   {
+   u32 vmx_msr_high, vmx_msr_low;
+   u32 control = CPU_BASED_HLT_EXITING |
+#ifdef CONFIG_X86_64
+   CPU_BASED_CR8_LOAD_EXITING |
+   CPU_BASED_CR8_STORE_EXITING |
+#endif
+   CPU_BASED_CR3_LOAD_EXITING |
+   CPU_BASED_CR3_STORE_EXITING |
+   CPU_BASED_USE_IO_BITMAPS |
+   CPU_BASED_MOV_DR_EXITING |
+   CPU_BASED_USE_TSC_OFFSETING |
+   CPU_BASED_INVLPG_EXITING |
+   CPU_BASED_TPR_SHADOW |
+   CPU_BASED_USE_MSR_BITMAPS |
+   CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
+
+   rdmsr(MSR_IA32_VMX_PROCBASED_CTLS, vmx_msr_low, vmx_msr_high);
+
+   control = vmx_msr_high; /* bit == 0 in high word == must be 
zero */
+   control |= vmx_msr_low;  /* bit == 1 in low word  == must be 
one  */
+
+   *pdata = (CPU_BASED_HLT_EXITING  control) |
+#ifdef CONFIG_X86_64
+   (CPU_BASED_CR8_LOAD_EXITING  control) |
+   (CPU_BASED_CR8_STORE_EXITING  control) |
+#endif
+   (CPU_BASED_CR3_LOAD_EXITING  control) |
+   (CPU_BASED_CR3_STORE_EXITING  control) |
+   (CPU_BASED_USE_IO_BITMAPS  control) |
+   (CPU_BASED_MOV_DR_EXITING  control) |
+   (CPU_BASED_USE_TSC_OFFSETING  control) |
+   (CPU_BASED_INVLPG_EXITING  control) ;
+
+   if (cpu_has_secondary_exec_ctrls())
+   *pdata |= CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
+
+   if (vm_need_tpr_shadow(vcpu-kvm))
+   *pdata |= CPU_BASED_TPR_SHADOW;
+   break;
+   }
+   case MSR_IA32_VMX_EXIT_CTLS:
+   *pdata = 0;
+#ifdef CONFIG_X86_64
+   *pdata |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
+#endif
+   break;
+   case MSR_IA32_VMX_ENTRY_CTLS:
+   *pdata = 0;
+   break;
+   case MSR_IA32_VMX_PROCBASED_CTLS2:
+   *pdata = 0;
+   if (vm_need_virtualize_apic_accesses(vcpu-kvm))
+ 

[PATCH 2/7] Nested VMX patch 2 implements vmclear

2009-12-10 Thread oritw
From: Orit Wasserman or...@il.ibm.com

---
 arch/x86/kvm/vmx.c |  235 +++-
 arch/x86/kvm/x86.c |5 +-
 arch/x86/kvm/x86.h |3 +
 3 files changed, 240 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2726a6c..a7ffd5e 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -93,13 +93,39 @@ struct shared_msr_entry {
 };
 
 struct __attribute__ ((__packed__)) level_state {
+   /* Has the level1 guest done vmclear? */
+   bool vmclear;
+};
+
+/*
+ * This structure is mapped to guest memory.
+ * It is packed in order to preseve the binary content
+ * after live migration.
+ * If there are changed in the content or layout the revision_id must be 
updated.
+ */
+struct __attribute__ ((__packed__)) nested_vmcs_page {
+   u32 revision_id;
+   u32 abort;
+   struct level_state l2_state;
+};
+
+struct nested_vmcs_list {
+   struct list_head list;
+   gpa_t vmcs_addr;
+   struct vmcs *l2_vmcs;
 };
 
 struct nested_vmx {
/* Has the level1 guest done vmxon? */
bool vmxon;
+   /* What is the location of the current vmcs l1 keeps for l2 */
+   gpa_t current_vmptr;
/* Level 1 state for switching to level 2 and back */
struct level_state *l1_state;
+   /* list of vmcs for each l2 guest created by l1 */
+   struct list_head l2_vmcs_list;
+   /* l2 page corresponding to the current vmcs set by l1 */
+   struct nested_vmcs_page *current_l2_page;
 };
 
 struct vcpu_vmx {
@@ -156,6 +182,76 @@ static inline struct vcpu_vmx *to_vmx(struct kvm_vcpu 
*vcpu)
return container_of(vcpu, struct vcpu_vmx, vcpu);
 }
 
+static struct page *nested_get_page(struct kvm_vcpu *vcpu,
+   u64 vmcs_addr)
+{
+   struct page *vmcs_page = NULL;
+
+   down_read(current-mm-mmap_sem);
+   vmcs_page = gfn_to_page(vcpu-kvm, vmcs_addr  PAGE_SHIFT);
+   up_read(current-mm-mmap_sem);
+
+   if (is_error_page(vmcs_page)) {
+   printk(KERN_ERR %s error allocating page 0x%llx\n,
+  __func__, vmcs_addr);
+   kvm_release_page_clean(vmcs_page);
+   return NULL;
+   }
+
+   return vmcs_page;
+
+}
+
+static int nested_map_current(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   struct page *vmcs_page =
+   nested_get_page(vcpu, vmx-nested.current_vmptr);
+   struct nested_vmcs_page *mapped_page;
+
+   if (vmcs_page == NULL) {
+   printk(KERN_INFO %s: failure in nested_get_page\n, __func__);
+   return 0;
+   }
+
+   if (vmx-nested.current_l2_page) {
+   printk(KERN_INFO %s: shadow vmcs already mapped\n, __func__);
+   WARN_ON(1);
+   return 0;
+   }
+
+   mapped_page = kmap_atomic(vmcs_page, KM_USER0);
+
+   if (!mapped_page) {
+   printk(KERN_INFO %s: error in kmap_atomic\n, __func__);
+   return 0;
+   }
+
+   vmx-nested.current_l2_page = mapped_page;
+
+   return 1;
+}
+
+static void nested_unmap_current(struct kvm_vcpu *vcpu)
+{
+   struct page *page;
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+   if (!vmx-nested.current_l2_page) {
+   printk(KERN_INFO Shadow vmcs already unmapped\n);
+   WARN_ON(1);
+   return;
+   }
+
+   page = kmap_atomic_to_page(vmx-nested.current_l2_page);
+
+   kunmap_atomic(vmx-nested.current_l2_page, KM_USER0);
+
+   kvm_release_page_dirty(page);
+
+   vmx-nested.current_l2_page = NULL;
+}
+
 static int init_rmode(struct kvm *kvm);
 static u64 construct_eptp(unsigned long root_hpa);
 
@@ -1144,6 +1240,35 @@ static int nested_vmx_set_msr(struct kvm_vcpu *vcpu, u32 
msr_index, u64 data)
return 0;
 }
 
+static int read_guest_vmcs_gpa(struct kvm_vcpu *vcpu, gva_t gva, u64 *gentry)
+{
+   int r = 0;
+   uint size;
+
+   *gentry = 0;
+
+   if (is_long_mode(vcpu))
+   size = sizeof(u64);
+   else
+   size = sizeof(u32);
+
+   r = kvm_read_guest_virt(gva, gentry,
+   size, vcpu);
+   if (r) {
+   printk(KERN_ERR %s cannot read guest vmcs addr %lx : %d\n,
+  __func__, vcpu-arch.regs[VCPU_REGS_RAX], r);
+   return r;
+   }
+
+   if (!IS_ALIGNED(*gentry, PAGE_SIZE)) {
+   printk(KERN_DEBUG %s addr %llx not aligned\n,
+  __func__, *gentry);
+   return 1;
+   }
+
+   return 0;
+}
+
 /*
  * Writes msr value into into the appropriate register.
  * Returns 0 on success, non-0 otherwise.
@@ -1316,6 +1441,7 @@ static int create_l1_state(struct kvm_vcpu *vcpu)
} else
return 0;
 
+   INIT_LIST_HEAD((vmx-nested.l2_vmcs_list));
return 0;
 }
 
@@ -1488,15 +1614,35 @@ static void free_vmcs(struct vmcs 

[PATCH 3/7] Nested VMX patch 3 implements vmptrld and vmptrst

2009-12-10 Thread oritw
From: Orit Wasserman or...@il.ibm.com

---
 arch/x86/kvm/vmx.c |  292 ++--
 arch/x86/kvm/x86.c |6 +-
 arch/x86/kvm/x86.h |3 +
 3 files changed, 289 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a7ffd5e..46a4f3a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -92,9 +92,142 @@ struct shared_msr_entry {
u64 mask;
 };
 
+struct __attribute__ ((__packed__)) shadow_vmcs {
+   u16 virtual_processor_id;
+   u16 guest_es_selector;
+   u16 guest_cs_selector;
+   u16 guest_ss_selector;
+   u16 guest_ds_selector;
+   u16 guest_fs_selector;
+   u16 guest_gs_selector;
+   u16 guest_ldtr_selector;
+   u16 guest_tr_selector;
+   u16 host_es_selector;
+   u16 host_cs_selector;
+   u16 host_ss_selector;
+   u16 host_ds_selector;
+   u16 host_fs_selector;
+   u16 host_gs_selector;
+   u16 host_tr_selector;
+   u64 io_bitmap_a;
+   u64 io_bitmap_b;
+   u64 msr_bitmap;
+   u64 vm_exit_msr_store_addr;
+   u64 vm_exit_msr_load_addr;
+   u64 vm_entry_msr_load_addr;
+   u64 tsc_offset;
+   u64 virtual_apic_page_addr;
+   u64 apic_access_addr;
+   u64 ept_pointer;
+   u64 guest_physical_address;
+   u64 vmcs_link_pointer;
+   u64 guest_ia32_debugctl;
+   u64 guest_ia32_pat;
+   u64 guest_pdptr0;
+   u64 guest_pdptr1;
+   u64 guest_pdptr2;
+   u64 guest_pdptr3;
+   u64 host_ia32_pat;
+   u32 pin_based_vm_exec_control;
+   u32 cpu_based_vm_exec_control;
+   u32 exception_bitmap;
+   u32 page_fault_error_code_mask;
+   u32 page_fault_error_code_match;
+   u32 cr3_target_count;
+   u32 vm_exit_controls;
+   u32 vm_exit_msr_store_count;
+   u32 vm_exit_msr_load_count;
+   u32 vm_entry_controls;
+   u32 vm_entry_msr_load_count;
+   u32 vm_entry_intr_info_field;
+   u32 vm_entry_exception_error_code;
+   u32 vm_entry_instruction_len;
+   u32 tpr_threshold;
+   u32 secondary_vm_exec_control;
+   u32 vm_instruction_error;
+   u32 vm_exit_reason;
+   u32 vm_exit_intr_info;
+   u32 vm_exit_intr_error_code;
+   u32 idt_vectoring_info_field;
+   u32 idt_vectoring_error_code;
+   u32 vm_exit_instruction_len;
+   u32 vmx_instruction_info;
+   u32 guest_es_limit;
+   u32 guest_cs_limit;
+   u32 guest_ss_limit;
+   u32 guest_ds_limit;
+   u32 guest_fs_limit;
+   u32 guest_gs_limit;
+   u32 guest_ldtr_limit;
+   u32 guest_tr_limit;
+   u32 guest_gdtr_limit;
+   u32 guest_idtr_limit;
+   u32 guest_es_ar_bytes;
+   u32 guest_cs_ar_bytes;
+   u32 guest_ss_ar_bytes;
+   u32 guest_ds_ar_bytes;
+   u32 guest_fs_ar_bytes;
+   u32 guest_gs_ar_bytes;
+   u32 guest_ldtr_ar_bytes;
+   u32 guest_tr_ar_bytes;
+   u32 guest_interruptibility_info;
+   u32 guest_activity_state;
+   u32 guest_sysenter_cs;
+   u32 host_ia32_sysenter_cs;
+   unsigned long cr0_guest_host_mask;
+   unsigned long cr4_guest_host_mask;
+   unsigned long cr0_read_shadow;
+   unsigned long cr4_read_shadow;
+   unsigned long cr3_target_value0;
+   unsigned long cr3_target_value1;
+   unsigned long cr3_target_value2;
+   unsigned long cr3_target_value3;
+   unsigned long exit_qualification;
+   unsigned long guest_linear_address;
+   unsigned long guest_cr0;
+   unsigned long guest_cr3;
+   unsigned long guest_cr4;
+   unsigned long guest_es_base;
+   unsigned long guest_cs_base;
+   unsigned long guest_ss_base;
+   unsigned long guest_ds_base;
+   unsigned long guest_fs_base;
+   unsigned long guest_gs_base;
+   unsigned long guest_ldtr_base;
+   unsigned long guest_tr_base;
+   unsigned long guest_gdtr_base;
+   unsigned long guest_idtr_base;
+   unsigned long guest_dr7;
+   unsigned long guest_rsp;
+   unsigned long guest_rip;
+   unsigned long guest_rflags;
+   unsigned long guest_pending_dbg_exceptions;
+   unsigned long guest_sysenter_esp;
+   unsigned long guest_sysenter_eip;
+   unsigned long host_cr0;
+   unsigned long host_cr3;
+   unsigned long host_cr4;
+   unsigned long host_fs_base;
+   unsigned long host_gs_base;
+   unsigned long host_tr_base;
+   unsigned long host_gdtr_base;
+   unsigned long host_idtr_base;
+   unsigned long host_ia32_sysenter_esp;
+   unsigned long host_ia32_sysenter_eip;
+   unsigned long host_rsp;
+   unsigned long host_rip;
+};
+
+
 struct __attribute__ ((__packed__)) level_state {
/* Has the level1 guest done vmclear? */
bool vmclear;
+
+   u64 io_bitmap_a;
+   u64 io_bitmap_b;
+   u64 msr_bitmap;
+
+   bool first_launch;
 };
 
 /*
@@ -122,6 +255,8 @@ struct nested_vmx {
gpa_t current_vmptr;
/* Level 1 

[PATCH 4/7] Nested VMX patch 4 implements vmread and vmwrite

2009-12-10 Thread oritw
From: Orit Wasserman or...@il.ibm.com

---
 arch/x86/kvm/vmx.c |  670 +++-
 1 files changed, 660 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 46a4f3a..8745d44 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -239,6 +239,7 @@ struct __attribute__ ((__packed__)) level_state {
 struct __attribute__ ((__packed__)) nested_vmcs_page {
u32 revision_id;
u32 abort;
+   struct shadow_vmcs shadow_vmcs;
struct level_state l2_state;
 };
 
@@ -263,6 +264,55 @@ struct nested_vmx {
struct nested_vmcs_page *current_l2_page;
 };
 
+enum vmcs_field_type {
+   VMCS_FIELD_TYPE_U16 = 0,
+   VMCS_FIELD_TYPE_U64 = 1,
+   VMCS_FIELD_TYPE_U32 = 2,
+   VMCS_FIELD_TYPE_ULONG = 3
+};
+
+#define VMCS_FIELD_LENGTH_OFFSET 13
+#define VMCS_FIELD_LENGTH_MASK 0x6000
+
+/*
+  Returns VMCS Field type
+*/
+static inline int vmcs_field_type(unsigned long field)
+{
+   /* For 32 bit L1 when it using the HIGH field */
+   if (0x1  field)
+   return VMCS_FIELD_TYPE_U32;
+
+   return (VMCS_FIELD_LENGTH_MASK  field)  13;
+}
+
+/*
+  Returncs VMCS field size in bits
+*/
+static inline int vmcs_field_size(int field_type, struct kvm_vcpu *vcpu)
+{
+   switch (field_type) {
+   case VMCS_FIELD_TYPE_U16:
+   return 2;
+   case VMCS_FIELD_TYPE_U32:
+   return 4;
+   case VMCS_FIELD_TYPE_U64:
+   return 8;
+   case VMCS_FIELD_TYPE_ULONG:
+#ifdef CONFIG_X86_64
+   if (is_long_mode(vcpu))
+   return 8;
+   else
+   return 4;
+#else
+   return 4;
+#endif
+   }
+
+   printk(KERN_INFO WARNING: invalid field type %d \n, field_type);
+   return 0;
+}
+
 struct vcpu_vmx {
struct kvm_vcpu   vcpu;
struct list_head  local_vcpus_link;
@@ -317,6 +367,411 @@ static inline struct vcpu_vmx *to_vmx(struct kvm_vcpu 
*vcpu)
return container_of(vcpu, struct vcpu_vmx, vcpu);
 }
 
+static inline struct shadow_vmcs *get_shadow_vmcs(struct kvm_vcpu *vcpu)
+{
+   WARN_ON(!to_vmx(vcpu)-nested.current_l2_page);
+   return (to_vmx(vcpu)-nested.current_l2_page-shadow_vmcs);
+}
+
+#define SHADOW_VMCS_OFFSET(x) offsetof(struct shadow_vmcs, x)
+
+static unsigned short vmcs_field_to_offset_table[HOST_RIP+1] = {
+
+   [VIRTUAL_PROCESSOR_ID] =
+   SHADOW_VMCS_OFFSET(virtual_processor_id),
+   [GUEST_ES_SELECTOR] =
+   SHADOW_VMCS_OFFSET(guest_es_selector),
+   [GUEST_CS_SELECTOR] =
+   SHADOW_VMCS_OFFSET(guest_cs_selector),
+   [GUEST_SS_SELECTOR] =
+   SHADOW_VMCS_OFFSET(guest_ss_selector),
+   [GUEST_DS_SELECTOR] =
+   SHADOW_VMCS_OFFSET(guest_ds_selector),
+   [GUEST_FS_SELECTOR] =
+   SHADOW_VMCS_OFFSET(guest_fs_selector),
+   [GUEST_GS_SELECTOR] =
+   SHADOW_VMCS_OFFSET(guest_gs_selector),
+   [GUEST_LDTR_SELECTOR] =
+   SHADOW_VMCS_OFFSET(guest_ldtr_selector),
+   [GUEST_TR_SELECTOR] =
+   SHADOW_VMCS_OFFSET(guest_tr_selector),
+   [HOST_ES_SELECTOR] =
+   SHADOW_VMCS_OFFSET(host_es_selector),
+   [HOST_CS_SELECTOR] =
+   SHADOW_VMCS_OFFSET(host_cs_selector),
+   [HOST_SS_SELECTOR] =
+   SHADOW_VMCS_OFFSET(host_ss_selector),
+   [HOST_DS_SELECTOR] =
+   SHADOW_VMCS_OFFSET(host_ds_selector),
+   [HOST_FS_SELECTOR] =
+   SHADOW_VMCS_OFFSET(host_fs_selector),
+   [HOST_GS_SELECTOR] =
+   SHADOW_VMCS_OFFSET(host_gs_selector),
+   [HOST_TR_SELECTOR] =
+   SHADOW_VMCS_OFFSET(host_tr_selector),
+   [IO_BITMAP_A] =
+   SHADOW_VMCS_OFFSET(io_bitmap_a),
+   [IO_BITMAP_A_HIGH] =
+   SHADOW_VMCS_OFFSET(io_bitmap_a)+4,
+   [IO_BITMAP_B] =
+   SHADOW_VMCS_OFFSET(io_bitmap_b),
+   [IO_BITMAP_B_HIGH] =
+   SHADOW_VMCS_OFFSET(io_bitmap_b)+4,
+   [MSR_BITMAP] =
+   SHADOW_VMCS_OFFSET(msr_bitmap),
+   [MSR_BITMAP_HIGH] =
+   SHADOW_VMCS_OFFSET(msr_bitmap)+4,
+   [VM_EXIT_MSR_STORE_ADDR] =
+   SHADOW_VMCS_OFFSET(vm_exit_msr_store_addr),
+   [VM_EXIT_MSR_STORE_ADDR_HIGH] =
+   SHADOW_VMCS_OFFSET(vm_exit_msr_store_addr)+4,
+   [VM_EXIT_MSR_LOAD_ADDR] =
+   SHADOW_VMCS_OFFSET(vm_exit_msr_load_addr),
+   [VM_EXIT_MSR_LOAD_ADDR_HIGH] =
+   SHADOW_VMCS_OFFSET(vm_exit_msr_load_addr)+4,
+   [VM_ENTRY_MSR_LOAD_ADDR] =
+   SHADOW_VMCS_OFFSET(vm_entry_msr_load_addr),
+   [VM_ENTRY_MSR_LOAD_ADDR_HIGH] =
+   SHADOW_VMCS_OFFSET(vm_entry_msr_load_addr)+4,
+   [TSC_OFFSET] =
+   SHADOW_VMCS_OFFSET(tsc_offset),
+   [TSC_OFFSET_HIGH] =
+   SHADOW_VMCS_OFFSET(tsc_offset)+4,
+   

[PATCH 5/7] Nested VMX patch 5 Simplify fpu handling

2009-12-10 Thread oritw
From: Orit Wasserman or...@il.ibm.com

---
 arch/x86/kvm/vmx.c |   27 +--
 1 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 8745d44..de1f596 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1244,8 +1244,6 @@ static void update_exception_bitmap(struct kvm_vcpu *vcpu)
u32 eb;
 
eb = (1u  PF_VECTOR) | (1u  UD_VECTOR) | (1u  MC_VECTOR);
-   if (!vcpu-fpu_active)
-   eb |= 1u  NM_VECTOR;
/*
 * Unconditionally intercept #DB so we can maintain dr6 without
 * reading it every exit.
@@ -1463,10 +1461,6 @@ static void vmx_fpu_activate(struct kvm_vcpu *vcpu)
if (vcpu-fpu_active)
return;
vcpu-fpu_active = 1;
-   vmcs_clear_bits(GUEST_CR0, X86_CR0_TS);
-   if (vcpu-arch.cr0  X86_CR0_TS)
-   vmcs_set_bits(GUEST_CR0, X86_CR0_TS);
-   update_exception_bitmap(vcpu);
 }
 
 static void vmx_fpu_deactivate(struct kvm_vcpu *vcpu)
@@ -1474,8 +1468,6 @@ static void vmx_fpu_deactivate(struct kvm_vcpu *vcpu)
if (!vcpu-fpu_active)
return;
vcpu-fpu_active = 0;
-   vmcs_set_bits(GUEST_CR0, X86_CR0_TS);
-   update_exception_bitmap(vcpu);
 }
 
 static unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu)
@@ -2715,8 +2707,10 @@ static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned 
long cr3)
 
vmx_flush_tlb(vcpu);
vmcs_writel(GUEST_CR3, guest_cr3);
-   if (vcpu-arch.cr0  X86_CR0_PE)
-   vmx_fpu_deactivate(vcpu);
+   if (vcpu-arch.cr0  X86_CR0_PE) {
+   if (guest_cr3 != vmcs_readl(GUEST_CR3))
+   vmx_fpu_deactivate(vcpu);
+   }
 }
 
 static void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
@@ -5208,6 +5202,19 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
if (vcpu-arch.switch_db_regs)
get_debugreg(vcpu-arch.dr6, 6);
 
+   if (vcpu-fpu_active) {
+   if (vmcs_readl(CR0_READ_SHADOW)  X86_CR0_TS)
+   vmcs_set_bits(GUEST_CR0, X86_CR0_TS);
+   else
+   vmcs_clear_bits(GUEST_CR0, X86_CR0_TS);
+   vmcs_write32(EXCEPTION_BITMAP,
+vmcs_read32(EXCEPTION_BITMAP)   ~(1u  
NM_VECTOR));
+   } else {
+   vmcs_set_bits(GUEST_CR0, X86_CR0_TS);
+   vmcs_write32(EXCEPTION_BITMAP,
+vmcs_read32(EXCEPTION_BITMAP) |  (1u  
NM_VECTOR));
+   }
+
vmx-idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
if (vmx-rmode.irq.pending)
fixup_rmode_irq(vmx);
-- 
1.6.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/7] Nested VMX patch 6 implements vmlaunch and vmresume

2009-12-10 Thread oritw
From: Orit Wasserman or...@il.ibm.com

---
 arch/x86/kvm/vmx.c |  890 +++-
 1 files changed, 873 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index de1f596..0d36b49 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -223,10 +223,16 @@ struct __attribute__ ((__packed__)) level_state {
/* Has the level1 guest done vmclear? */
bool vmclear;
 
+   u64 shadow_efer;
+   unsigned long cr3;
+   unsigned long cr4;
+
u64 io_bitmap_a;
u64 io_bitmap_b;
u64 msr_bitmap;
 
+   int cpu;
+   int launched;
bool first_launch;
 };
 
@@ -254,10 +260,14 @@ struct nested_vmx {
bool vmxon;
/* What is the location of the current vmcs l1 keeps for l2 */
gpa_t current_vmptr;
+   /* Are we running nested guest */
+   bool nested_mode;
/* Level 1 state for switching to level 2 and back */
struct level_state *l1_state;
/* Level 1 shadow vmcs for switching to level 2 and back */
struct shadow_vmcs *l1_shadow_vmcs;
+   /* Level 1 vmcs loaded into the processor */
+   struct vmcs *l1_vmcs;
/* list of vmcs for each l2 guest created by l1 */
struct list_head l2_vmcs_list;
/* l2 page corresponding to the current vmcs set by l1 */
@@ -287,7 +297,7 @@ static inline int vmcs_field_type(unsigned long field)
 }
 
 /*
-  Returncs VMCS field size in bits
+  Returns VMCS field size in bits
 */
 static inline int vmcs_field_size(int field_type, struct kvm_vcpu *vcpu)
 {
@@ -313,6 +323,10 @@ static inline int vmcs_field_size(int field_type, struct 
kvm_vcpu *vcpu)
return 0;
 }
 
+#define NESTED_VM_EXIT_CONTROLS_MASK (~(VM_EXIT_LOAD_IA32_PAT | \
+   VM_EXIT_SAVE_IA32_PAT))
+#define NESTED_VM_ENTRY_CONTROLS_MASK (~(VM_ENTRY_LOAD_IA32_PAT | \
+VM_ENTRY_IA32E_MODE))
 struct vcpu_vmx {
struct kvm_vcpu   vcpu;
struct list_head  local_vcpus_link;
@@ -892,7 +906,11 @@ static struct kvm_vmx_segment_field {
 static u64 host_efer;
 
 static void ept_save_pdptrs(struct kvm_vcpu *vcpu);
+
+static int nested_vmx_check_permission(struct kvm_vcpu *vcpu);
 static int create_l1_state(struct kvm_vcpu *vcpu);
+static int create_l2_state(struct kvm_vcpu *vcpu);
+static int launch_guest(struct kvm_vcpu *vcpu, bool launch);
 
 /*
  * Keep MSR_K6_STAR at the end, as setup_msrs() will try to optimize it
@@ -993,6 +1011,18 @@ static inline bool cpu_has_vmx_ept_2m_page(void)
return !!(vmx_capability.ept  VMX_EPT_2MB_PAGE_BIT);
 }
 
+static inline int is_exception(u32 intr_info)
+{
+   return (intr_info  (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
+   == (INTR_TYPE_HARD_EXCEPTION | INTR_INFO_VALID_MASK);
+}
+
+static inline int is_nmi(u32 intr_info)
+{
+   return (intr_info  (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
+   == (INTR_TYPE_NMI_INTR | INTR_INFO_VALID_MASK);
+}
+
 static inline int cpu_has_vmx_invept_individual_addr(void)
 {
return !!(vmx_capability.ept  VMX_EPT_EXTENT_INDIVIDUAL_BIT);
@@ -1049,6 +1079,51 @@ static inline bool report_flexpriority(void)
return flexpriority_enabled;
 }
 
+static inline int nested_cpu_has_vmx_tpr_shadow(struct  kvm_vcpu *vcpu)
+{
+   return cpu_has_vmx_tpr_shadow() 
+   get_shadow_vmcs(vcpu)-cpu_based_vm_exec_control 
+   CPU_BASED_TPR_SHADOW;
+}
+
+static inline int nested_cpu_has_secondary_exec_ctrls(struct kvm_vcpu *vcpu)
+{
+   return cpu_has_secondary_exec_ctrls() 
+   get_shadow_vmcs(vcpu)-cpu_based_vm_exec_control 
+   CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
+}
+
+static inline bool nested_vm_need_virtualize_apic_accesses(struct kvm_vcpu
+  *vcpu)
+{
+   return get_shadow_vmcs(vcpu)-secondary_vm_exec_control 
+   SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+}
+
+static inline int nested_cpu_has_vmx_ept(struct kvm_vcpu *vcpu)
+{
+   return get_shadow_vmcs(vcpu)-
+   secondary_vm_exec_control  SECONDARY_EXEC_ENABLE_EPT;
+}
+
+static inline int nested_cpu_has_vmx_vpid(struct kvm_vcpu *vcpu)
+{
+   return get_shadow_vmcs(vcpu)-secondary_vm_exec_control 
+   SECONDARY_EXEC_ENABLE_VPID;
+}
+
+static inline int nested_cpu_has_vmx_pat(struct kvm_vcpu *vcpu)
+{
+   return get_shadow_vmcs(vcpu)-vm_entry_controls 
+   VM_ENTRY_LOAD_IA32_PAT;
+}
+
+static inline int nested_cpu_has_vmx_msr_bitmap(struct kvm_vcpu *vcpu)
+{
+   return get_shadow_vmcs(vcpu)-cpu_based_vm_exec_control 
+   CPU_BASED_USE_MSR_BITMAPS;
+}
+
 static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr)
 {
int i;
@@ -1390,6 +1465,8 @@ static void vmx_load_host_state(struct vcpu_vmx *vmx)
preempt_enable();
 }
 
+static void 

[PATCH 7/7] Nested VMX patch 7 handling of nested guest exits

2009-12-10 Thread oritw
From: Orit Wasserman or...@il.ibm.com

---
 arch/x86/kvm/vmx.c |  521 +++-
 1 files changed, 515 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 0d36b49..203f016 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -262,6 +262,10 @@ struct nested_vmx {
gpa_t current_vmptr;
/* Are we running nested guest */
bool nested_mode;
+   /* L1 requested VMLAUNCH or VMRESUME but we didn't run L2 yet */
+   bool nested_run_pending;
+   /* flag indicating if there was a valid IDT after exiting from l2 */
+   bool valid_idt_vectoring_info;
/* Level 1 state for switching to level 2 and back */
struct level_state *l1_state;
/* Level 1 shadow vmcs for switching to level 2 and back */
@@ -908,9 +912,16 @@ static u64 host_efer;
 static void ept_save_pdptrs(struct kvm_vcpu *vcpu);
 
 static int nested_vmx_check_permission(struct kvm_vcpu *vcpu);
+static int nested_vmx_check_exception(struct vcpu_vmx *vmx, unsigned nr,
+ bool has_error_code, u32 error_code);
+static int nested_vmx_intr(struct kvm_vcpu *vcpu);
 static int create_l1_state(struct kvm_vcpu *vcpu);
 static int create_l2_state(struct kvm_vcpu *vcpu);
 static int launch_guest(struct kvm_vcpu *vcpu, bool launch);
+static int nested_vmx_exit_handled_msr(struct kvm_vcpu *vcpu);
+static int nested_vmx_exit_handled(struct kvm_vcpu *vcpu, bool kvm_override);
+static int nested_vmx_vmexit(struct kvm_vcpu *vcpu,
+bool is_interrupt);
 
 /*
  * Keep MSR_K6_STAR at the end, as setup_msrs() will try to optimize it
@@ -1467,6 +1478,8 @@ static void vmx_load_host_state(struct vcpu_vmx *vmx)
 
 static void vmx_fpu_deactivate(struct kvm_vcpu *vcpu);
 
+int load_vmcs_host_state(struct shadow_vmcs *src);
+
 /*
  * Switches to specified vcpu, until a matching vcpu_put(), but assumes
  * vcpu mutex is already taken.
@@ -1503,6 +1516,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
if (vcpu-cpu != cpu) {
struct descriptor_table dt;
unsigned long sysenter_esp;
+   struct shadow_vmcs *l1_shadow_vmcs = vmx-nested.l1_shadow_vmcs;
 
vcpu-cpu = cpu;
/*
@@ -1525,6 +1539,22 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
new_offset = vmcs_read64(TSC_OFFSET) + delta;
vmcs_write64(TSC_OFFSET, new_offset);
}
+
+   if (l1_shadow_vmcs != NULL) {
+   l1_shadow_vmcs-host_tr_base =
+   vmcs_readl(HOST_TR_BASE);
+   l1_shadow_vmcs-host_gdtr_base =
+   vmcs_readl(HOST_GDTR_BASE);
+   l1_shadow_vmcs-host_ia32_sysenter_esp =
+   vmcs_readl(HOST_IA32_SYSENTER_ESP);
+
+   if (tsc_this  vcpu-arch.host_tsc)
+   l1_shadow_vmcs-tsc_offset =
+   vmcs_read64(TSC_OFFSET);
+
+   if (vmx-nested.nested_mode)
+   load_vmcs_host_state(l1_shadow_vmcs);
+   }
}
 }
 
@@ -1611,6 +1641,9 @@ static void vmx_queue_exception(struct kvm_vcpu *vcpu, 
unsigned nr,
struct vcpu_vmx *vmx = to_vmx(vcpu);
u32 intr_info = nr | INTR_INFO_VALID_MASK;
 
+   if (nested_vmx_check_exception(vmx, nr, has_error_code, error_code))
+   return;
+
if (has_error_code) {
vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, error_code);
intr_info |= INTR_INFO_DELIVER_CODE_MASK;
@@ -2185,9 +2218,6 @@ int load_vmcs_common(struct shadow_vmcs *src)
if (vmcs_config.vmentry_ctrl  VM_ENTRY_LOAD_IA32_PAT)
vmcs_write64(GUEST_IA32_PAT, src-guest_ia32_pat);
 
-   if (src-vm_entry_msr_load_count  512)
-   vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, 
src-vm_entry_msr_load_count);
-
vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, src-vm_entry_intr_info_field);
vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
 src-vm_entry_exception_error_code);
@@ -3794,6 +3824,11 @@ static void enable_irq_window(struct kvm_vcpu *vcpu)
 {
u32 cpu_based_vm_exec_control;
 
+   if (to_vmx(vcpu)-nested.nested_mode) {
+   nested_vmx_intr(vcpu);
+   return;
+   }
+
cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control);
@@ -3922,6 +3957,11 @@ static void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool 
masked)
 
 static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
 {
+   if (to_vmx(vcpu)-nested.nested_mode) {
+   if (!nested_vmx_intr(vcpu))
+  

Re: [PATCH 2/2] qemu-kvm: x86: Add support for VCPU event states

2009-12-10 Thread Marcelo Tosatti
On Tue, Nov 24, 2009 at 08:08:37PM -0200, Marcelo Tosatti wrote:
 On Sun, Nov 15, 2009 at 04:41:26PM +0100, Jan Kiszka wrote:
  Avi Kivity wrote:
   On 11/15/2009 05:02 PM, Jan Kiszka wrote:
  
   Where should I add /* next version to use: 13 */, and who will take
   care that this comment will also be kept up to date? The CPU vmstate is
   already ordered according to logical groups, just look at earlier field.
   Only recent KVM additions happened to create some version ordering as
   well.
  
   
   Er, now I'm confused.  11 and 12 indeed do already exist, so how can you
   update 11 retroactively?
  
  Oh, right, good that we discuss this. My patch dated back before the
  kvmclock addition, which IMHO incorrectly bumped the version numbers. I
  think the current policy in upstream is that we only increment once per
  qemu release, not per bit added.
  
   
   Shouldn't you create 13 now?
  
  No, I rather think Glauber's 12 should be downgraded to 11 - unless it
  misses the qemu merge windows for 0.12. My extensions definitely target
  that release, thus will likely carry 11 in upstream. And we should
  really try to avoid diverging again.
 
 Agree.
 
 Anthony,
 
 Are Glauber patches going to make it for 0.12, so we can revert current
 qemu-kvm's CPU_SAVE_VERSION from 12 to 11?
 
 Thanks.

Anthony,

I still don't see this patch in for qemu 0.12. Please merge it.

Date: Thu, 22 Oct 2009 10:26:56 -0200
From: Glauber Costa glom...@redhat.com
To: qemu-de...@nongnu.org
Subject: [Qemu-devel] [PATCH] v2: properly save kvm system time msr
registers
Message-Id: 1256214416-20554-1-git-send-email-glom...@redhat.com

TIA

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Installing kernel headers in kvm-kmod

2009-12-10 Thread Arnd Bergmann
On Thursday 10 December 2009 17:14:40 Jan Kiszka wrote:
 Avi Kivity wrote:
  On 12/10/2009 06:42 PM, Jan Kiszka wrote:
  I've just (forced-)pushed the simple version with
  /usr/include/kvm-kmod as destination. The user headers are now stored
  under usr/include in the kvm-kmod sources and installed from there.
 
  
  It's customary to install to /usr/local, not to /usr (qemu does the same).

Right. Specifically, an install from source should go to /usr/local/include
by default, while a distro package should override the path to go to
/usr/include, which the current version easily allows.

This also means that qemu will have to look in three places now,
/usr/local/include/kvm-kmod, /usr/include/kvm-kmod and /usr/include.
Adding /usr/local/include probably doesn't hurt but should not be
necessary.

 Adjusted accordingly. Moreover, I only install the target arch's header now.

Looks good now.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Installing kernel headers in kvm-kmod

2009-12-10 Thread Avi Kivity

On 12/10/2009 10:26 PM, Arnd Bergmann wrote:

On Thursday 10 December 2009 17:14:40 Jan Kiszka wrote:
   

Avi Kivity wrote:
 

On 12/10/2009 06:42 PM, Jan Kiszka wrote:
   

I've just (forced-)pushed the simple version with
/usr/include/kvm-kmod as destination. The user headers are now stored
under usr/include in the kvm-kmod sources and installed from there.

 

It's customary to install to /usr/local, not to /usr (qemu does the same).
   

Right. Specifically, an install from source should go to /usr/local/include
by default, while a distro package should override the path to go to
/usr/include, which the current version easily allows.

This also means that qemu will have to look in three places now,
/usr/local/include/kvm-kmod, /usr/include/kvm-kmod and /usr/include.
Adding /usr/local/include probably doesn't hurt but should not be
necessary.
   


The only icky bit is that /usr/local/include/kvm-kmod will stick around 
after the user forgets about it and switches to the kernel headers.  I 
don't see a way around it (it's the generic uninstall problem), so I 
think we should just live with it.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Parallel Port question

2009-12-10 Thread Erik Rull

Hi all,

I'm currently running kvm-88 with a parallel port forwarding and Win XP as 
guest. The Windows XP guest talks directly to the hardware registers of the 
parallel port and tries to programm a piece of hardware over a proprietary 
softwar tool. The hardware is connected via Parallel port and a JTAG 
interface. (The Chip ID is recognized correctly over the port)


Running the Windows natively on the system programming is successful. If I 
run it within the windows guest the Busy flag of the parallel port seems to 
be too slow. That means the software expects a low signal but the bit is 
still set.


Are there any posibilities to enable a more direct access to the parallel 
port? Am I able to map somehow the heap of I/O registers directly into 
the guest for accessing it? What does the linux host driver layer do with 
the parallel port bits? Are they somehow modified within the driver?


Thanks!

Best regards,

Erik

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Installing kernel headers in kvm-kmod

2009-12-10 Thread Anthony Liguori

Avi Kivity wrote:

On 12/10/2009 04:50 PM, Arnd Bergmann wrote:

On Thursday 10 December 2009, Jan Kiszka wrote:
  

Anthony Liguori wrote:


QEMU 0.12.0-rc1 does not support KVM
https://bugs.launchpad.net/bugs/494500

Boils down to the fact that 1) we don't include kernel headers in qemu
(whereas qemu-kvm does) and 2) kvm-kmod does not install those headers
on make install.

I think we've discussed (2) as being the preferred solution.  Does
everyone agree with that?  Anyone care to volunteer to make the 
change? :-)


   

I've pushed a half-tested approach into kvm-kmod's next branch. Feel
free to test/fix/enhance it.
 

This would work, but installing to /usr/include/linux/kvm.h will confuse
distro package managers a lot, because that location belongs to the 
glibc

or libc-linux-headers or some other package already.

If you want to install the headers from kvm-kmod, I would recommend
doing it in a different path, e.g. /usr/include/kvm-kmod/{linux,asm}.
   


Maybe even /usr/local/include/kvm-kmod-$version/, and a symlink 
/usr/local/include/kvm-kmod.


A pkg-config file would be nice.  Then we need no symlink.  Makes qemu 
interaction saner.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Memory under KVM?

2009-12-10 Thread Christian Fernandez
Hi everyone, I'm new to the list and I have a couple questions that we 
are wondering about here at work...
we have notice that the KVM processes on the host take much more memory 
than the memory we have told the VM to use..  a ruff example..
if we tell KVM to use 2 gigs for one VM it will end up showing on the 
host process list for that VM like 3 gigs or more...
Why do I ask this? well we need to figure out how much memory to add to 
our host server so we can calculate the number of VM's we can run there 
etc etc..


Thanks for the help



The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Inform users about busy device assignment attempt

2009-12-10 Thread Alexander Graf
When using -pcidevice on a device that is already in use by a kernel driver
all the user gets is the following (very useful) information:

  Failed to assign device 04:00.0 : Device or resource busy
  Failed to deassign device 04:00.0 : Invalid argument
  Error initializing device pci-assign

Since I usually prefer to have my computer do the thinking for me, I figured
it might be a good idea to check and see if a device is actually used by a
driver. If so, tell the user.

So with this patch applied you get the following output:

  Failed to assign device 04:00.0 : Device or resource busy
  *** The driver 'igb' is occupying your device 04:00.0.
  ***
  *** You can try the following commands to free it:
  ***
  *** $ echo 8086 150a  /sys/bus/pci/drivers/pci-stub/new_id
  *** $ echo :04:00.0  /sys/bus/pci/drivers/igb/unbind
  *** $ echo :04:00.0  /sys/bus/pci/drivers/pci-stub/bind
  *** $ echo 8086 150a  /sys/bus/pci/drivers/pci-stub/remove_id
  ***
  Failed to deassign device 04:00.0 : Invalid argument
  Error initializing device pci-assign

That should keep people like me from doing the most obvious misuses :-).

CC: Daniel P. Berrange berra...@redhat.com
Signed-off-by: Alexander Graf ag...@suse.de

---

v1 - v2:

  - add more helpful guidance thanks to Daniel Berrange

v2 - v3:

  - clear name variable before using it, thus 0-terminating the string
  - fix region numbers
  - use correct unbind/bind names
---
 hw/device-assignment.c |  109 +---
 1 files changed, 85 insertions(+), 24 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 5cee929..98faa83 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -564,14 +564,44 @@ static int assigned_dev_register_regions(PCIRegion 
*io_regions,
 return 0;
 }
 
+static int get_real_id(const char *devpath, const char *idname, uint16_t *val)
+{
+FILE *f;
+char name[128];
+long id;
+
+snprintf(name, sizeof(name), %s%s, devpath, idname);
+f = fopen(name, r);
+if (f == NULL) {
+fprintf(stderr, %s: %s: %m\n, __func__, name);
+return -1;
+}
+if (fscanf(f, %li\n, id) == 1) {
+*val = id;
+}
+fclose(f);
+
+return 0;
+}
+
+static int get_real_vendor_id(const char *devpath, uint16_t *val)
+{
+return get_real_id(devpath, vendor, val);
+}
+
+static int get_real_device_id(const char *devpath, uint16_t *val)
+{
+return get_real_id(devpath, device, val);
+}
+
 static int get_real_device(AssignedDevice *pci_dev, uint8_t r_bus,
uint8_t r_dev, uint8_t r_func)
 {
 char dir[128], name[128];
-int fd, r = 0;
+int fd, r = 0, v;
 FILE *f;
 unsigned long long start, end, size, flags;
-unsigned long id;
+uint16_t id;
 struct stat statbuf;
 PCIRegion *rp;
 PCIDevRegions *dev = pci_dev-real_device;
@@ -637,31 +667,21 @@ again:
 
 fclose(f);
 
-/* read and fill device ID */
-snprintf(name, sizeof(name), %svendor, dir);
-f = fopen(name, r);
-if (f == NULL) {
-fprintf(stderr, %s: %s: %m\n, __func__, name);
+/* read and fill vendor ID */
+v = get_real_vendor_id(dir, id);
+if (v) {
 return 1;
 }
-if (fscanf(f, %li\n, id) == 1) {
-   pci_dev-dev.config[0] = id  0xff;
-   pci_dev-dev.config[1] = (id  0xff00)  8;
-}
-fclose(f);
+pci_dev-dev.config[0] = id  0xff;
+pci_dev-dev.config[1] = (id  0xff00)  8;
 
-/* read and fill vendor ID */
-snprintf(name, sizeof(name), %sdevice, dir);
-f = fopen(name, r);
-if (f == NULL) {
-fprintf(stderr, %s: %s: %m\n, __func__, name);
+/* read and fill device ID */
+v = get_real_device_id(dir, id);
+if (v) {
 return 1;
 }
-if (fscanf(f, %li\n, id) == 1) {
-   pci_dev-dev.config[2] = id  0xff;
-   pci_dev-dev.config[3] = (id  0xff00)  8;
-}
-fclose(f);
+pci_dev-dev.config[2] = id  0xff;
+pci_dev-dev.config[3] = (id  0xff00)  8;
 
 /* dealing with virtual function device */
 snprintf(name, sizeof(name), %sphysfn/, dir);
@@ -739,7 +759,9 @@ static uint32_t calc_assigned_dev_id(uint8_t bus, uint8_t 
devfn)
 static int assign_device(AssignedDevice *dev)
 {
 struct kvm_assigned_pci_dev assigned_dev_data;
-int r;
+char name[128], dir[128], driver[128], *ns;
+uint16_t vendor_id, device_id;
+int r, v;
 
 memset(assigned_dev_data, 0, sizeof(assigned_dev_data));
 assigned_dev_data.assigned_dev_id  =
@@ -761,9 +783,48 @@ static int assign_device(AssignedDevice *dev)
 #endif
 
 r = kvm_assign_pci_device(kvm_context, assigned_dev_data);
-if (r  0)
+if (r  0) {
fprintf(stderr, Failed to assign device \%s\ : %s\n,
 dev-dev.qdev.id, strerror(-r));
+
+snprintf(dir, sizeof(dir),
+ /sys/bus/pci/devices/:%02x:%02x.%x/,
+dev-host.bus, dev-host.dev, dev-host.func);
+
+

[PATCH] Enable non page boundary BAR device assignment

2009-12-10 Thread Alexander Graf
While trying to get device passthrough working with an emulex hba, kvm
refused to pass it through because it has a BAR of 256 bytes:

Region 0: Memory at d210 (64-bit, non-prefetchable) [size=4K]
Region 2: Memory at d2101000 (64-bit, non-prefetchable) [size=256]
Region 4: I/O ports at b100 [size=256]

Since the page boundary is an arbitrary optimization to allow 1:1 mapping of
physical to virtual addresses, we can still take the old MMIO callback route.

So let's add a second code path that allows for size  0xFFF != 0 sized regions
by looping it through userspace.

I verified that it works by passing through an e1000 with this additional patch
applied and the card acted the same way it did without this patch:

 map_func = assigned_dev_iomem_map;
-if (cur_region-size  0xFFF) {
+if (i != PCI_ROM_SLOT){
 fprintf(stderr, PCI region %d at address 0x%llx 

Signed-off-by: Alexander Graf ag...@suse.de

---

v1 - v2:

  - don't use map_func function pointer
  - use the same code for mmap on fast and slow path
---
 hw/device-assignment.c |  123 +---
 1 files changed, 116 insertions(+), 7 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 13a86bb..5cee929 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -148,6 +148,105 @@ static uint32_t assigned_dev_ioport_readl(void *opaque, 
uint32_t addr)
 return value;
 }
 
+static uint32_t slow_bar_readb(void *opaque, target_phys_addr_t addr)
+{
+AssignedDevRegion *d = opaque;
+uint8_t *in = (uint8_t*)(d-u.r_virtbase + addr);
+uint32_t r = -1;
+
+r = *in;
+DEBUG(slow_bar_readl addr=0x TARGET_FMT_plx  val=0x%08x\n, addr, r);
+
+return r;
+}
+
+static uint32_t slow_bar_readw(void *opaque, target_phys_addr_t addr)
+{
+AssignedDevRegion *d = opaque;
+uint16_t *in = (uint16_t*)(d-u.r_virtbase + addr);
+uint32_t r = -1;
+
+r = *in;
+DEBUG(slow_bar_readl addr=0x TARGET_FMT_plx  val=0x%08x\n, addr, r);
+
+return r;
+}
+
+static uint32_t slow_bar_readl(void *opaque, target_phys_addr_t addr)
+{
+AssignedDevRegion *d = opaque;
+uint32_t *in = (uint32_t*)(d-u.r_virtbase + addr);
+uint32_t r = -1;
+
+r = *in;
+DEBUG(slow_bar_readl addr=0x TARGET_FMT_plx  val=0x%08x\n, addr, r);
+
+return r;
+}
+
+static void slow_bar_writeb(void *opaque, target_phys_addr_t addr, uint32_t 
val)
+{
+AssignedDevRegion *d = opaque;
+uint8_t *out = (uint8_t*)(d-u.r_virtbase + addr);
+
+DEBUG(slow_bar_writeb addr=0x TARGET_FMT_plx  val=0x%02x\n, addr, val);
+*out = val;
+}
+
+static void slow_bar_writew(void *opaque, target_phys_addr_t addr, uint32_t 
val)
+{
+AssignedDevRegion *d = opaque;
+uint16_t *out = (uint16_t*)(d-u.r_virtbase + addr);
+
+DEBUG(slow_bar_writew addr=0x TARGET_FMT_plx  val=0x%04x\n, addr, val);
+*out = val;
+}
+
+static void slow_bar_writel(void *opaque, target_phys_addr_t addr, uint32_t 
val)
+{
+AssignedDevRegion *d = opaque;
+uint32_t *out = (uint32_t*)(d-u.r_virtbase + addr);
+
+DEBUG(slow_bar_writel addr=0x TARGET_FMT_plx  val=0x%08x\n, addr, val);
+*out = val;
+}
+
+static CPUWriteMemoryFunc * const slow_bar_write[] = {
+slow_bar_writeb,
+slow_bar_writew,
+slow_bar_writel
+};
+
+static CPUReadMemoryFunc * const slow_bar_read[] = {
+slow_bar_readb,
+slow_bar_readw,
+slow_bar_readl
+};
+
+static void assigned_dev_iomem_map_slow(PCIDevice *pci_dev, int region_num,
+pcibus_t e_phys, pcibus_t e_size,
+int type)
+{
+AssignedDevice *r_dev = container_of(pci_dev, AssignedDevice, dev);
+AssignedDevRegion *region = r_dev-v_addrs[region_num];
+PCIRegion *real_region = r_dev-real_device.regions[region_num];
+int m;
+
+DEBUG(slow map\n);
+m = cpu_register_io_memory(slow_bar_read, slow_bar_write, region);
+cpu_register_physical_memory(e_phys, e_size, m);
+
+/* MSI-X MMIO page */
+if ((e_size  0) 
+real_region-base_addr = r_dev-msix_table_addr 
+real_region-base_addr + real_region-size = r_dev-msix_table_addr) {
+int offset = r_dev-msix_table_addr - real_region-base_addr;
+
+cpu_register_physical_memory(e_phys + offset,
+TARGET_PAGE_SIZE, r_dev-mmio_index);
+}
+}
+
 static void assigned_dev_iomem_map(PCIDevice *pci_dev, int region_num,
pcibus_t e_phys, pcibus_t e_size, int type)
 {
@@ -381,15 +480,22 @@ static int assigned_dev_register_regions(PCIRegion 
*io_regions,
 
 /* handle memory io regions */
 if (cur_region-type  IORESOURCE_MEM) {
+int slow_map = 0;
 int t = cur_region-type  IORESOURCE_PREFETCH
 ? PCI_BASE_ADDRESS_MEM_PREFETCH
 : PCI_BASE_ADDRESS_SPACE_MEMORY;
+
 if (cur_region-size  0xFFF) {

KVM networking performance

2009-12-10 Thread Jan Houstek

Hello all.

As there's very little info about network performance in KVM on the web, 
let's share your experience or findings about running KVM guests with 
non-trivial network traffic (like iSCSI access, fileservers etc.).


- Do you enable KVM guests to directly use host's PCI network interface 
via VT-d? Is performance comparable to native performance in host?


- Do you use virtio for emulated interfaces? Does it really improve
performance?

- Do you use VDE? For switching between guests inside the same host, 
which has better performance -- VDE or TAP connected by bridge in 
host's kernel?


- For connection between guests and real network, which has better 
performance -- routing or bridging in the host?


Any other findings to this topic are welcome.

Regards,

-- HH


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html