Re: [PATCH 0/3] KVM_VCPU_GET_REG_LIST API

2012-10-19 Thread Rusty Russell
Rusty Russell ru...@rustcorp.com.au writes:
 Avi Kivity a...@redhat.com writes:
 On 09/05/2012 10:58 AM, Rusty Russell wrote:
 This is the generic part of the KVM_SET_ONE_REG/KVM_GET_ONE_REG
 enhancements which ARM wants, rebased onto kvm/next.

 This was stalled for so long it needs rebasing again, sorry.

 But otherwise I'm happy to apply.

 Ok, will rebase and re-test against kvm-next.

Wait, what?  kvm/arm isn't in kvm-next?

This will produce a needless clash with that, which is more important
than this cleanup.  I'll rebase this as soon as that is merged.

Christoffer, is there anything I can help with?

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost-blk: Add vhost-blk support v2

2012-10-19 Thread Asias He
On 10/18/2012 12:20 PM, Rusty Russell wrote:
 Asias He as...@redhat.com writes:
 +#define BLK_HDR   0

 What's this for, exactly? Please add a comment.

 The block headr is in the first and separate buffer.
 
 Please don't assume this!  We're trying to fix all the assumptions in
 qemu at the moment.
 
 vhost_net handles this correctly, taking bytes off the descriptor chain
 as required.


Well, I will switch to no assumption about the buffer layout in next
version.


-- 
Asias
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Hardwre for KVM host

2012-10-19 Thread Lukas Laukamp

Hey all,

at the moment I am looking for hardware for a KVM host, but on a few 
points I realy have problems to get a good hardware setup.


My situation is the following: I need a server with minimal energy 
consumption, the ability of PCI Passtrough (so IOMMU is needed) and good 
support for Nested Virtualization (KVM, Xen and VMWare)


So when I understood everything right I need a CPU and chipset on the 
board which supports IOMMU, which consumer boards do and which of them 
works fine with KVM?


For the nested virtualization theme I read that EPTs in case of Intel 
and RVI in case of AMD is needed. Also I read that the AMD RVI would be 
better than the EPTs of Intel and bring a good speedup, wikipedia 
mentions a VMWare research whitepaper and Red Hat tests 
(http://en.wikipedia.org/wiki/Rapid_Virtualization_Indexing). Also it 
seems that the implementation of Nested Virualization is easier with the 
AMD RVIs than with the Intel EPTs. So which CPU with which chipset 
(baord) should i prefer to get this features.


For the RAM I thought about 32GB because my experiences show that 
virtualization needs much RAM, would that be a good value for RAM when 
the VM host runs a few VMs with simple server services like (HTTP, SMTP 
etc.)?


I want to run VMs with much I/O so I'm thinking about the hardware 
setup. I want to have a few TB of space so for example 4 3TB hard disks 
in RAID5, all virtualization solutions (Xen, VMWare) recommand hardware 
RAID. I also would prefer hardware RAID but would software RAID decrease 
the I/O performance very much? And would it be a difference to take 
software RAID with mdadm (for example on ext4 filesystem) or native RAID 
support like in BTRFS?


Which modern hardware RAID controllers would be fully supported by linux 
and are there recommandations for a special interface (S-ATA, SAS etc.)?


I don't have that much experiences with KVM-over-IP managment but I want 
to start using it. Are there special conditions which must be fulfilled 
by the board or operation system? And does someone here have experiences 
with KVM-over-IP managment and can recommand adapters or cards?


To say something about the price I targeting: It would be greate to have 
costs between 800 and 1400€ for the host system without the costs of the 
KVM-over-IP console. When the price would be higher it would be ok when 
it don't is greater than 2500€


Best Regards
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Shared IRQ with PCI Passthrough?

2012-10-19 Thread Veruca Salt




 To: kvm@vger.kernel.org
 From: msch...@gmx.eu
 Subject: Re: Shared IRQ with PCI Passthrough?
 Date: Thu, 18 Oct 2012 20:09:56 +

 Jan Kiszka jan.kiszka at siemens.com writes:

 
  On 2012-10-15 11:07, Marco wrote:
   Jan Kiszka jan.kiszka at siemens.com writes:
  
   
  
  
   Nope, there is no IRQ sharing support for assigned devices in any public
   version so far. I'm on it, but some issues remain to be solved.
  
   Jan
  
  
  
   Hi, any news on this? I own an Intel DQ67OW that has the same issue. No 
   PCI
   passthrough possible with KVM when USB is active.
 
We encountered severe problems with the DQ67OW; it proved all but impossible to 
pass thru USB, as you have to pass both PCI bridges through also, in which case 
you get booting problems with actual PCI cards being bounced from host to guest 
and back several times.

We worked around this(our ultimate kernel was pretty unstable in any case and 
IOMMU crashed it) using the KVM ehci script file to load USB 2 emulation. We 
don't use passthru at all on the current iteration.

However, while our kernel is the main reason, the DQ67OW board has awkward 
IRQ's, particularly in 64 bit, which would have nixed IOMMU in any case for us.

  Supported by qemu-kvm-1.2 and Linux = 3.4. But not all devices play
  well with it, so your mileage may vary.
 
  Jan
 


 Unfortunately I had no luck trying on Ubuntu Quantal (qemu-kvm 1.2 and Linux
 3.5). Exactly the same error message than before:

 Failed to assign irq for hostdev0: Input/output error
 Perhaps you are assigning a device that shares an IRQ with another device?

 Marco

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at http://vger.kernel.org/majordomo-info.html
  --
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: x86: fix vcpu-mmio_fragments overflow

2012-10-19 Thread Xiao Guangrong
After commit b3356bf0dbb349 (KVM: emulator: optimize rep ins handling),
the pieces of io data can be collected and write them to the guest memory
or MMIO together.

Unfortunately, kvm splits the mmio access into 8 bytes and store them to
vcpu-mmio_fragments. If the guest uses rep ins to move large data, it
will cause vcpu-mmio_fragments overflow

The bug can be exposed by isapc (-M isapc):

[23154.818733] general protection fault:  [#1] SMP DEBUG_PAGEALLOC
[ ..]
[23154.858083] Call Trace:
[23154.859874]  [a04f0e17] kvm_get_cr8+0x1d/0x28 [kvm]
[23154.861677]  [a04fa6d4] kvm_arch_vcpu_ioctl_run+0xcda/0xe45 [kvm]
[23154.863604]  [a04f5a1a] ? kvm_arch_vcpu_load+0x17b/0x180 [kvm]


Actually, we can use one mmio_fragment to store a large mmio access for the
mmio access is always continuous then split it when we pass the mmio-exit-info
to userspace. After that, we only need two entries to store mmio info for
the cross-mmio pages access

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/x86.c   |  127 +-
 include/linux/kvm_host.h |   16 +-
 2 files changed, 84 insertions(+), 59 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8b90dd5..41ceb51 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3779,9 +3779,6 @@ static int read_exit_mmio(struct kvm_vcpu *vcpu, gpa_t 
gpa,
 static int write_exit_mmio(struct kvm_vcpu *vcpu, gpa_t gpa,
   void *val, int bytes)
 {
-   struct kvm_mmio_fragment *frag = vcpu-mmio_fragments[0];
-
-   memcpy(vcpu-run-mmio.data, frag-data, frag-len);
return X86EMUL_CONTINUE;
 }

@@ -3799,6 +3796,64 @@ static const struct read_write_emulator_ops 
write_emultor = {
.write = true,
 };

+static bool get_current_mmio_info(struct kvm_vcpu *vcpu, gpa_t *gpa,
+ unsigned *len, void **data)
+{
+   struct kvm_mmio_fragment *frag;
+   int cur = vcpu-mmio_cur_fragment;
+
+   if (cur = vcpu-mmio_nr_fragments)
+   return false;
+
+   frag = vcpu-mmio_fragments[cur];
+   if (frag-pos = frag-len) {
+   if (++vcpu-mmio_cur_fragment = vcpu-mmio_nr_fragments)
+   return false;
+   frag++;
+   }
+
+   *gpa = frag-gpa + frag-pos;
+   *data = frag-data + frag-pos;
+   *len = min(8u, frag-len - frag-pos);
+   return true;
+}
+
+static void complete_current_mmio(struct kvm_vcpu *vcpu)
+{
+   struct kvm_mmio_fragment *frag;
+   gpa_t gpa;
+   unsigned len;
+   void *data;
+
+   get_current_mmio_info(vcpu, gpa, len, data);
+
+   if (!vcpu-mmio_is_write)
+   memcpy(data, vcpu-run-mmio.data, len);
+
+   /* Increase frag-pos to switch to the next mmio. */
+   frag = vcpu-mmio_fragments[vcpu-mmio_cur_fragment];
+   frag-pos += len;
+}
+
+static bool vcpu_fill_mmio_exit_info(struct kvm_vcpu *vcpu)
+{
+   gpa_t gpa;
+   unsigned len;
+   void *data;
+
+   if (!get_current_mmio_info(vcpu, gpa, len, data))
+   return false;
+
+   vcpu-run-mmio.len = len;
+   vcpu-run-mmio.is_write = vcpu-mmio_is_write;
+   vcpu-run-exit_reason = KVM_EXIT_MMIO;
+   vcpu-run-mmio.phys_addr = gpa;
+
+   if (vcpu-mmio_is_write)
+   memcpy(vcpu-run-mmio.data, data, len);
+   return true;
+}
+
 static int emulator_read_write_onepage(unsigned long addr, void *val,
   unsigned int bytes,
   struct x86_exception *exception,
@@ -3834,18 +3889,12 @@ mmio:
bytes -= handled;
val += handled;

-   while (bytes) {
-   unsigned now = min(bytes, 8U);
-
-   frag = vcpu-mmio_fragments[vcpu-mmio_nr_fragments++];
-   frag-gpa = gpa;
-   frag-data = val;
-   frag-len = now;
-
-   gpa += now;
-   val += now;
-   bytes -= now;
-   }
+   WARN_ON(vcpu-mmio_nr_fragments = KVM_MAX_MMIO_FRAGMENTS);
+   frag = vcpu-mmio_fragments[vcpu-mmio_nr_fragments++];
+   frag-pos = 0;
+   frag-gpa = gpa;
+   frag-data = val;
+   frag-len = bytes;
return X86EMUL_CONTINUE;
 }

@@ -3855,7 +3904,6 @@ int emulator_read_write(struct x86_emulate_ctxt *ctxt, 
unsigned long addr,
const struct read_write_emulator_ops *ops)
 {
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
-   gpa_t gpa;
int rc;

if (ops-read_write_prepare 
@@ -3887,17 +3935,13 @@ int emulator_read_write(struct x86_emulate_ctxt *ctxt, 
unsigned long addr,
if (!vcpu-mmio_nr_fragments)
return rc;

-   gpa = vcpu-mmio_fragments[0].gpa;
-
vcpu-mmio_needed = 1;
vcpu-mmio_cur_fragment = 0;
+   vcpu-mmio_is_write = ops-write;

-   vcpu-run-mmio.len = vcpu-mmio_fragments[0].len;
-   

[PATCH] emulator test: add rep ins mmio access test

2012-10-19 Thread Xiao Guangrong
Add the test to trigger the bug that rep ins causes vcpu-mmio_fragments
overflow overflow while move large data from ioport to MMIO

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 x86/emulator.c |   14 ++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/x86/emulator.c b/x86/emulator.c
index 24b33d1..0735405 100644
--- a/x86/emulator.c
+++ b/x86/emulator.c
@@ -731,6 +731,18 @@ static void test_crosspage_mmio(volatile uint8_t *mem)
 report(cross-page mmio write, mem[4095] == 0xaa  mem[4096] == 0x88);
 }

+static void test_string_io_mmio(volatile uint8_t *mem)
+{
+   /* Cross MMIO pages.*/
+   volatile uint8_t *mmio = mem + 4032;
+
+   asm volatile(outw %%ax, %%dx  \n\t : : a(0x), 
d(TESTDEV_IO_PORT));
+
+   asm volatile (cld; rep insb : : d (TESTDEV_IO_PORT), D (mmio), 
c (1024));
+
+   report(string_io_mmio, mmio[1023] == 0x99);
+}
+
 static void test_lgdt_lidt(volatile uint8_t *mem)
 {
 struct descriptor_table_ptr orig, fresh = {};
@@ -878,6 +890,8 @@ int main()

test_crosspage_mmio(mem);

+   test_string_io_mmio(mem);
+
printf(\nSUMMARY: %d tests, %d failures\n, tests, fails);
return fails ? 1 : 0;
 }
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-19 Thread Raghavendra K T

On 10/18/2012 06:09 PM, Avi Kivity wrote:

On 10/09/2012 08:51 PM, Raghavendra K T wrote:

Here is the summary:
We do get good benefit by increasing ple window. Though we don't
see good benefit for kernbench and sysbench, for ebizzy, we get huge
improvement for 1x scenario. (almost 2/3rd of ple disabled case).

Let me know if you think we can increase the default ple_window
itself to 16k.



I think so, there is no point running with untuned defaults.



Oaky.



I can respin the whole series including this default ple_window change.


It can come as a separate patch.


Yes. Will spin it separately.





I also have the perf kvm top result for both ebizzy and kernbench.
I think they are in expected lines now.

Improvements


16 core PLE machine with 16 vcpu guest

base = 3.6.0-rc5 + ple handler optimization patches
base_pleopt_16k = base + ple_window = 16k
base_pleopt_32k = base + ple_window = 32k
base_pleopt_nople = base + ple_gap = 0
kernbench, hackbench, sysbench (time in sec lower is better)
ebizzy (rec/sec higher is better)

% improvements w.r.t base (ple_window = 4k)
---+---+-+---+
|base_pleopt_16k| base_pleopt_32k | base_pleopt_nople |
---+---+-+---+
kernbench_1x   |  0.42371  |  1.15164|   0.09320 |
kernbench_2x   | -1.40981  | -17.48282   |  -570.77053   |
---+---+-+---+
sysbench_1x| -0.92367  | 0.24241 | -0.27027  |
sysbench_2x| -2.22706  |-0.30896 | -1.27573  |
sysbench_3x| -0.75509  | 0.09444 | -2.97756  |
---+---+-+---+
ebizzy_1x  | 54.99976  | 67.29460|  74.14076 |
ebizzy_2x  | -8.83386  |-27.38403| -96.22066 |
---+---+-+---+


So it seems we want dynamic PLE windows.  As soon as we enter overcommit
we need to decrease the window.



Okay.
I have some rough idea on the implementation. I 'll try that after this
V2 experiments are over.
So in brief, I have this in my queue priority wise

1) V2 version of this patch series( in progress)
2) default PLE window
3) preemption notifiers
4) Pv spinlock

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-19 Thread Raghavendra K T

On 10/15/2012 08:04 PM, Andrew Theurer wrote:

On Mon, 2012-10-15 at 17:40 +0530, Raghavendra K T wrote:

On 10/11/2012 01:06 AM, Andrew Theurer wrote:

On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote:

On 10/10/2012 08:29 AM, Andrew Theurer wrote:

On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:

* Avi Kivity a...@redhat.com [2012-10-04 17:00:28]:


On 10/04/2012 03:07 PM, Peter Zijlstra wrote:

On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:



[...]

A big concern I have (if this is 1x overcommit) for ebizzy is that it
has just terrible scalability to begin with.  I do not think we should
try to optimize such a bad workload.



I think my way of running dbench has some flaw, so I went to ebizzy.
Could you let me know how you generally run dbench?


I mount a tmpfs and then specify that mount for dbench to run on.  This
eliminates all IO.  I use a 300 second run time and number of threads is
equal to number of vcpus.  All of the VMs of course need to have a
synchronized start.

I would also make sure you are using a recent kernel for dbench, where
the dcache scalability is much improved.  Without any lock-holder
preemption, the time in spin_lock should be very low:



  21.54%  78016 dbench  [kernel.kallsyms]   [k] 
copy_user_generic_unrolled
   3.51%  12723 dbench  libc-2.12.so[.] __strchr_sse42
   2.81%  10176 dbench  dbench  [.] child_run
   2.54%   9203 dbench  [kernel.kallsyms]   [k] _raw_spin_lock
   2.33%   8423 dbench  dbench  [.] next_token
   2.02%   7335 dbench  [kernel.kallsyms]   [k] __d_lookup_rcu
   1.89%   6850 dbench  libc-2.12.so[.] __strstr_sse42
   1.53%   5537 dbench  libc-2.12.so[.] __memset_sse2
   1.47%   5337 dbench  [kernel.kallsyms]   [k] link_path_walk
   1.40%   5084 dbench  [kernel.kallsyms]   [k] kmem_cache_alloc
   1.38%   5009 dbench  libc-2.12.so[.] memmove
   1.24%   4496 dbench  libc-2.12.so[.] vfprintf
   1.15%   4169 dbench  [kernel.kallsyms]   [k] 
__audit_syscall_exit




Hi Andrew,
I ran the test with dbench with tmpfs. I do not see any improvements in
dbench for 16k ple window.

So it seems apart from ebizzy no workload benefited by that. and I
agree that, it may not be good to optimize for ebizzy.
I shall drop changing to 16k default window and continue with other
original patch series. Need to experiment with latest kernel.


Thanks for running this again.  I do believe there are some workloads,
when run at 1x overcommit, would benefit from a larger ple_window [with
he current ple handling code], but I do not also want to potentially
degrade 1x with a larger window.  I do, however, think there may be a
another option.  I have not fully worked this out, but I think I am on
to something.

I decided to revert back to just a yield() instead of a yield_to().  My
motivation was that yield_to() [for large VMs] is like a dog chasing its
tail, round and round we go   Just yield(), in particular a yield()
which results in yielding to something -other- than the current VM's
vcpus, helps synchronize the execution of sibling vcpus by deferring
them until the lock holder vcpu is running again.  The more we can do to
get all vcpus running at the same time, the far less we deal with the
preemption problem.  The other benefit is that yield() is far, far lower
overhead than yield_to()

This does assume that vcpus from same VM do not share same runqueues.
Yielding to a sibling vcpu with yield() is not productive for larger VMs
in the same way that yield_to() is not.  My recent results include
restricting vcpu placement so that sibling vcpus do not get to run on
the same runqueue.  I do believe we could implement a initial placement
and load balance policy to strive for this restriction (making it purely
optional, but I bet could also help user apps which use spin locks).

For 1x VMs which still vm_exit due to PLE, I believe we could probably
just leave the ple_window alone, as long as we mostly use yield()
instead of yield_to().  The problem with the unneeded exits in this case
has been the overhead in routines leading up to yield_to() and the
yield_to() itself.  If we use yield() most of the time, this overhead
will go away.

Here is a comparison of yield_to() and yield():

dbench with 20-way VMs, 8 of them on 80-way host:

no PLE426 +/- 11.03%
no PLE w/ gangsched 32001 +/- .37%
PLE with yield()29207 +/- .28%
PLE with yield_to()  8175 +/- 1.37%

Yield() is far and way better than yield_to() here and almost approaches
gang sched result.  Here is a link for the perf sched map bitmap:

https://docs.google.com/open?id=0B6tfUNlZ-14weXBfVnFFZGw1akU

The thrashing is way down and sibling vcpus tend to run together,
approximating the behavior of the gang 

Re: I/O errors in guest OS after repeated migration

2012-10-19 Thread Guido Winkelmann
Am Donnerstag, 18. Oktober 2012, 18:05:39 schrieb Avi Kivity:
 On 10/18/2012 05:50 PM, Guido Winkelmann wrote:
  Am Mittwoch, 17. Oktober 2012, 13:25:45 schrieb Brian Jackson:
  On Wednesday, October 17, 2012 10:45:14 AM Guido Winkelmann wrote:
   vda1, logical block 1858771
   Oct 17 17:12:04 localhost kernel: [  212.070600] Buffer I/O error on
   device
   vda1, logical block 1858772
   Oct 17 17:12:04 localhost kernel: [  212.070602] Buffer I/O error on
   device
   vda1, logical block 1858773
   Oct 17 17:12:04 localhost kernel: [  212.070605] Buffer I/O error on
   device
   vda1, logical block 1858774
   Oct 17 17:12:04 localhost kernel: [  212.070607] Buffer I/O error on
   device
   vda1, logical block 1858775
   Oct 17 17:12:04 localhost kernel: [  212.070610] Buffer I/O error on
   device
   vda1, logical block 1858776
   Oct 17 17:12:04 localhost kernel: [  212.070612] Buffer I/O error on
   device
   vda1, logical block 1858777
   Oct 17 17:12:04 localhost kernel: [  212.070615] Buffer I/O error on
   device
   vda1, logical block 1858778
   Oct 17 17:12:04 localhost kernel: [  212.070617] Buffer I/O error on
   device
   vda1, logical block 1858779
   
   (I was writing a large file at the time, to make sure I actually catch
   I/O
   errors as they happen)
  
  What about newer versions of qemu/kvm? But of course if those work, your
  next task is going to be git bisect it or file a bug with your distro
  that
  is using an ancient version of qemu/kvm.
  
  I've just upgraded both hosts to qemu-kvm 1.2.0
  (qemu-1.2.0-14.fc17.x86_64,
  built from spec files under http://pkgs.fedoraproject.org/cgit/qemu.git/).
  
  The bug is still there.
 
 If you let the guest go idle (no I/O), then migrate it, then restart the
 I/O, do the errors show?

Just tested - yes, they do.

Guido
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-19 Thread Andrew Theurer
On Fri, 2012-10-19 at 14:00 +0530, Raghavendra K T wrote:
 On 10/15/2012 08:04 PM, Andrew Theurer wrote:
  On Mon, 2012-10-15 at 17:40 +0530, Raghavendra K T wrote:
  On 10/11/2012 01:06 AM, Andrew Theurer wrote:
  On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote:
  On 10/10/2012 08:29 AM, Andrew Theurer wrote:
  On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:
  * Avi Kivity a...@redhat.com [2012-10-04 17:00:28]:
 
  On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
  On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
 
  [...]
  A big concern I have (if this is 1x overcommit) for ebizzy is that it
  has just terrible scalability to begin with.  I do not think we should
  try to optimize such a bad workload.
 
 
  I think my way of running dbench has some flaw, so I went to ebizzy.
  Could you let me know how you generally run dbench?
 
  I mount a tmpfs and then specify that mount for dbench to run on.  This
  eliminates all IO.  I use a 300 second run time and number of threads is
  equal to number of vcpus.  All of the VMs of course need to have a
  synchronized start.
 
  I would also make sure you are using a recent kernel for dbench, where
  the dcache scalability is much improved.  Without any lock-holder
  preemption, the time in spin_lock should be very low:
 
 
21.54%  78016 dbench  [kernel.kallsyms]   [k] 
  copy_user_generic_unrolled
 3.51%  12723 dbench  libc-2.12.so[.] 
  __strchr_sse42
 2.81%  10176 dbench  dbench  [.] child_run
 2.54%   9203 dbench  [kernel.kallsyms]   [k] 
  _raw_spin_lock
 2.33%   8423 dbench  dbench  [.] 
  next_token
 2.02%   7335 dbench  [kernel.kallsyms]   [k] 
  __d_lookup_rcu
 1.89%   6850 dbench  libc-2.12.so[.] 
  __strstr_sse42
 1.53%   5537 dbench  libc-2.12.so[.] 
  __memset_sse2
 1.47%   5337 dbench  [kernel.kallsyms]   [k] 
  link_path_walk
 1.40%   5084 dbench  [kernel.kallsyms]   [k] 
  kmem_cache_alloc
 1.38%   5009 dbench  libc-2.12.so[.] memmove
 1.24%   4496 dbench  libc-2.12.so[.] vfprintf
 1.15%   4169 dbench  [kernel.kallsyms]   [k] 
  __audit_syscall_exit
 
 
  Hi Andrew,
  I ran the test with dbench with tmpfs. I do not see any improvements in
  dbench for 16k ple window.
 
  So it seems apart from ebizzy no workload benefited by that. and I
  agree that, it may not be good to optimize for ebizzy.
  I shall drop changing to 16k default window and continue with other
  original patch series. Need to experiment with latest kernel.
 
  Thanks for running this again.  I do believe there are some workloads,
  when run at 1x overcommit, would benefit from a larger ple_window [with
  he current ple handling code], but I do not also want to potentially
  degrade 1x with a larger window.  I do, however, think there may be a
  another option.  I have not fully worked this out, but I think I am on
  to something.
 
  I decided to revert back to just a yield() instead of a yield_to().  My
  motivation was that yield_to() [for large VMs] is like a dog chasing its
  tail, round and round we go   Just yield(), in particular a yield()
  which results in yielding to something -other- than the current VM's
  vcpus, helps synchronize the execution of sibling vcpus by deferring
  them until the lock holder vcpu is running again.  The more we can do to
  get all vcpus running at the same time, the far less we deal with the
  preemption problem.  The other benefit is that yield() is far, far lower
  overhead than yield_to()
 
  This does assume that vcpus from same VM do not share same runqueues.
  Yielding to a sibling vcpu with yield() is not productive for larger VMs
  in the same way that yield_to() is not.  My recent results include
  restricting vcpu placement so that sibling vcpus do not get to run on
  the same runqueue.  I do believe we could implement a initial placement
  and load balance policy to strive for this restriction (making it purely
  optional, but I bet could also help user apps which use spin locks).
 
  For 1x VMs which still vm_exit due to PLE, I believe we could probably
  just leave the ple_window alone, as long as we mostly use yield()
  instead of yield_to().  The problem with the unneeded exits in this case
  has been the overhead in routines leading up to yield_to() and the
  yield_to() itself.  If we use yield() most of the time, this overhead
  will go away.
 
  Here is a comparison of yield_to() and yield():
 
  dbench with 20-way VMs, 8 of them on 80-way host:
 
  no PLE426 +/- 11.03%
  no PLE w/ gangsched 32001 +/- .37%
  PLE with yield()29207 +/- .28%
  PLE with yield_to()  8175 +/- 1.37%
 
  Yield() is far and way better than yield_to() here and almost approaches
  

Re: KVM on NFS

2012-10-19 Thread Banyan He
MySQl might not run good locally on the disk in the VM. It is actually 
not a good idea to run it on nfs filesystem. The startup should not be a 
problem. Once the data grows up, it will become the problem to you. Try 
to cache as much as possible that would be helping you out from lots of 
problems.



Banyan He
Blog: http://www.rootong.com
Email: ban...@rootong.com

On 2012-10-17 6:46 PM, Avi Kivity wrote:

On 10/17/2012 11:20 AM, Andrew Holway wrote:

Hello,

I am testing KVM on an Oracle NFS box that I have.

Does the list have any advice on best practice? I remember reading that there 
is stuff you can do with I/O schedulers and stuff to make it more efficient.

My VMs will primarily be running mysql databases. I am currently using o_direct.


O_DIRECT is good.  I/O schedulers don't affect NFS so no need to tune
anything on the host.  You might experiment with switching to the
deadline scheduler in the guest.




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM_VCPU_GET_REG_LIST API

2012-10-19 Thread Christoffer Dall
On Fri, Oct 19, 2012 at 2:19 AM, Rusty Russell ru...@rustcorp.com.au wrote:
 Rusty Russell ru...@rustcorp.com.au writes:
 Avi Kivity a...@redhat.com writes:
 On 09/05/2012 10:58 AM, Rusty Russell wrote:
 This is the generic part of the KVM_SET_ONE_REG/KVM_GET_ONE_REG
 enhancements which ARM wants, rebased onto kvm/next.

 This was stalled for so long it needs rebasing again, sorry.

 But otherwise I'm happy to apply.

 Ok, will rebase and re-test against kvm-next.

 Wait, what?  kvm/arm isn't in kvm-next?

 This will produce a needless clash with that, which is more important
 than this cleanup.  I'll rebase this as soon as that is merged.

 Christoffer, is there anything I can help with?


There are some worries about duplicating functionality on the ARM side
of things.

Specifically there are worries about the instruction decoding for the
mmio instructions. My cycles are unfortunately too limited to change
this right now and I'm also not sure I agree things will turn out
nicer by unifying all decoding into a large complicated space ship,
but it would be great if you could take a look. This discussion seems
to be a good place to start:

https://lists.cs.columbia.edu/pipermail/kvmarm/2012-September/003447.html

Thanks!
-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm, async_pf: exit idleness when handling KVM_PV_REASON_PAGE_NOT_PRESENT

2012-10-19 Thread Sasha Levin
KVM_PV_REASON_PAGE_NOT_PRESENT kicks cpu out of idleness, but we haven't
marked that spot as an exit from idleness.

Not doing so can cause RCU warnings such as:

[  732.788386] ===
[  732.789803] [ INFO: suspicious RCU usage. ]
[  732.790032] 3.7.0-rc1-next-20121019-sasha-2-g6d8d02d-dirty #63 Tainted: 
GW
[  732.790032] ---
[  732.790032] include/linux/rcupdate.h:738 rcu_read_lock() used illegally 
while idle!
[  732.790032]
[  732.790032] other info that might help us debug this:
[  732.790032]
[  732.790032]
[  732.790032] RCU used illegally from idle CPU!
[  732.790032] rcu_scheduler_active = 1, debug_locks = 1
[  732.790032] RCU used illegally from extended quiescent state!
[  732.790032] 2 locks held by trinity-child31/8252:
[  732.790032]  #0:  (rq-lock){-.-.-.}, at: [83a67528] 
__schedule+0x178/0x8f0
[  732.790032]  #1:  (rcu_read_lock){.+.+..}, at: [81152bde] 
cpuacct_charge+0xe/0x200
[  732.790032]
[  732.790032] stack backtrace:
[  732.790032] Pid: 8252, comm: trinity-child31 Tainted: GW
3.7.0-rc1-next-20121019-sasha-2-g6d8d02d-dirty #63
[  732.790032] Call Trace:
[  732.790032]  [8118266b] lockdep_rcu_suspicious+0x10b/0x120
[  732.790032]  [81152c60] cpuacct_charge+0x90/0x200
[  732.790032]  [81152bde] ? cpuacct_charge+0xe/0x200
[  732.790032]  [81158093] update_curr+0x1a3/0x270
[  732.790032]  [81158a6a] dequeue_entity+0x2a/0x210
[  732.790032]  [81158ea5] dequeue_task_fair+0x45/0x130
[  732.790032]  [8114ae29] dequeue_task+0x89/0xa0
[  732.790032]  [8114bb9e] deactivate_task+0x1e/0x20
[  732.790032]  [83a67c29] __schedule+0x879/0x8f0
[  732.790032]  [8117e20d] ? trace_hardirqs_off+0xd/0x10
[  732.790032]  [810a37a5] ? kvm_async_pf_task_wait+0x1d5/0x2b0
[  732.790032]  [83a67cf5] schedule+0x55/0x60
[  732.790032]  [810a37c4] kvm_async_pf_task_wait+0x1f4/0x2b0
[  732.790032]  [81139e50] ? abort_exclusive_wait+0xb0/0xb0
[  732.790032]  [81139c25] ? prepare_to_wait+0x25/0x90
[  732.790032]  [810a3a66] do_async_page_fault+0x56/0xa0
[  732.790032]  [83a6a6e8] async_page_fault+0x28/0x30

Signed-off-by: Sasha Levin sasha.le...@oracle.com
---
 arch/x86/kernel/kvm.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index b3e5e51..4180a87 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -247,7 +247,10 @@ do_async_page_fault(struct pt_regs *regs, unsigned long 
error_code)
break;
case KVM_PV_REASON_PAGE_NOT_PRESENT:
/* page is swapped out by the host. */
+   rcu_irq_enter();
+   exit_idle();
kvm_async_pf_task_wait((u32)read_cr2());
+   rcu_irq_exit();
break;
case KVM_PV_REASON_PAGE_READY:
rcu_irq_enter();
-- 
1.7.12.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v3 06/19] Implement -dimm command line option

2012-10-19 Thread Blue Swirl
On Thu, Oct 18, 2012 at 12:33 PM, Avi Kivity a...@redhat.com wrote:
 On 10/18/2012 11:27 AM, Vasilis Liaskovitis wrote:
 On Wed, Oct 17, 2012 at 12:03:51PM +0200, Avi Kivity wrote:
 On 10/17/2012 11:19 AM, Vasilis Liaskovitis wrote:
 
  I don't think so, but probably there's a limit of DIMMs that real
  controllers have, something like 8 max.
 
  In the case of i440fx specifically, do you mean that we should model the 
  DRB
  (Dram row boundary registers in section 3.2.19 of the i440fx spec) ?
 
  The i440fx DRB registers only supports up to 8 DRAM rows (let's say 1 row
  maps 1-1 to a DimmDevice for this discussion) and only supports up to 2GB 
  of
  memory afaict (bit 31 and above is ignored).
 
  I 'd rather not model this part of the i440fx - having only 8 DIMMs seems 
  too
  restrictive. The rest of the patchset supports up to 255 DIMMs so it 
  would be a
  waste imho to model an old pc memory controller that only supports 8 
  DIMMs.
 
  There was also an old discussion about i440fx modeling here:
  https://lists.nongnu.org/archive/html/qemu-devel/2011-07/msg02705.html
  the general direction was that i440fx is too old and we don't want to 
  precisely
  emulate the DRB registers, since they lack flexibility.
 
  Possible solutions:
 
  1) is there a newer and more flexible chipset that we could model?

 Look for q35 on this list.

 thanks, I 'll take a look. It sounds like the other options below are more
 straightforward now, but let me know if you prefer q35 integration as a 
 priority.

 At least validate that what you're doing fits with how q35 works.


 
  We could for example model:
  - an 8-bit non-cumulative register for each DIMM, denoting how many
  128MB chunks it contains. This allowes 32GB for each DIMM, and with 255 
  DIMMs we
  describe a bit less than 8TB. These registers require 255 bytes.
  - a 16-bit cumulative register for each DIMM again for 128MB chunks. This 
  allows
  us to describe 8TB of memory (but the registers take up double the space, 
  because
  they describe cumulative memory amounts)

 There is no reason to save space.  Why not have two 64-bit registers per
 DIMM, one describing the size and the other the base address, both in
 bytes?  Use a few low order bits for control.

 Do we want this generic scheme above to be tied into the i440fx/pc machine?

 Yes.  q35 should work according to its own specifications.

 Or have it as a separate generic memory bus / pmc usable by others (e.g. in
 hw/dimm.c)?
 The 64-bit values you describe are already part of DimmDevice properties, but
 they are not hardware registers described as part of a chipset.

 In terms of control bits, did you want to mimic some other chipset 
 registers? -
 any examples would be useful.

 I don't have any real requirements.  Just make it simple and easily
 accessible to ACPI code.



 
  3) let everything be handled/abstracted by dimmbus - the chipset DRB 
  modelling
  is not done (at least for i440fx, other machines could). This is the 
  least precise
  in terms of emulation. On the other hand, if we are not really trying to 
  emulate
  the real (too restrictive) hardware, does it matter?

 We could emulate base memory using the chipset, and extra memory using
 the scheme above.  This allows guests that are tied to the chipset to
 work, and guests that have more awareness (seabios) to use the extra
 features.

 But if we use the real i440fx pmc DRBs for base memory, this means base 
 memory
 would be = 2GB, right?

 Sounds like we 'd need to change the DRBs anyway to describe useful amounts 
 of
 base memory (e.g. 512MB chunks and check against address lines [36:29] can
 describe base memory up to 64GB, though that's still limiting for very large
 VMs). But we'd be diverting from the real hardware again.

 Then there's no point.  Modelling real hardware allows guests written to
 work against that hardware to function correctly.  If you diverge, they
 won't.

The guest is also unlikely to want to reprogram the memory controller.



 Then we can model base memory with tweaked i440fx pmc's DRB registers - we
 could only use DRB[0] (one DIMM describing all of base memory) or more.

 DIMMs would be allowed to be hotplugged in the generic mem-controller scheme 
 only
 (unless it makes sense to allow hotplug in the remaining pmc DRBs and
 start using the generic scheme once we run out of emulated DRBs)


 440fx seems a lost cause, so we can go wild and just implement pv dimms.

Maybe. But what would be a PV DIMM? Do we need any DIMM-like
granularity at all, instead the guest could be told to use a list of
RAM regions with arbitrary start and end addresses? Isn't ballooning
also related?

  For q35 I'd like to stay within the spec.

That may not last forever when machines have terabytes of memory.


 --
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo 

Re: [RFC PATCH 1/3] KVM: ARM: Introduce KVM_INIT_IRQCHIP ioctl

2012-10-19 Thread Christoffer Dall
On Thu, Oct 18, 2012 at 8:20 AM, Avi Kivity a...@redhat.com wrote:
 On 10/14/2012 02:04 AM, Christoffer Dall wrote:
 Used to initialize the in-kernel interrupt controller. On ARM we need to
 map the virtual generic interrupt controller (vGIC) into Hyp the guest's
 physicall address space so the guest can access the virtual cpu
 interface. This must be done after the IRQ chips is create and after a
 base address has been provided for the emulated platform (patch is
 following), but before the CPU is initally run.


 +4.79 KVM_INIT_IRQCHIP
 +
 +Capability: KVM_CAP_INIT_IRQCHIP
 +Architectures: arm
 +Type: vm ioctl
 +Parameters: none
 +Returns: 0 on success, -1 on error
 +
 +Initialize the in-kernel interrupt controller. On ARM we need to map the
 +virtual generic interrupt controller (vGIC) into Hyp the guest's physicall
 +address space so the guest can access the virtual cpu interface. This must 
 be
 +done after the IRQ chips is create and after a base address has been 
 provided
 +for the emulated platofrm (see KVM_SET_DEVICE_ADDRESS), but before the CPU 
 is
 +initally run.
 +

 What enforces this?

 Can it be done automatically?  issue a
 kvm_make_request(KVM_REQ_INIT_IRQCHIP) on vcpu creation, and you'll
 automatically be notified before the first guest entry.

 Having an ioctl that must be called after point A but before point B
 seems pointless, when A and B are both known.


I reworked this according to your comments, patches on the way.

thanks for the input.

 +
  5. The kvm_run structure
  

 diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
 index f8c377b..85c76e4 100644
 --- a/arch/arm/kvm/arm.c
 +++ b/arch/arm/kvm/arm.c
 @@ -195,6 +195,7 @@ int kvm_dev_ioctl_check_extension(long ext)
   switch (ext) {
  #ifdef CONFIG_KVM_ARM_VGIC
   case KVM_CAP_IRQCHIP:
 + case KVM_CAP_INIT_IRQCHIP:

 This could be part of a baseline, if you don't envision ever taking it out.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] KVM: ARM: Introduce KVM_SET_DEVICE_ADDRESS ioctl

2012-10-19 Thread Christoffer Dall
On Wed, Oct 17, 2012 at 4:29 PM, Peter Maydell peter.mayd...@linaro.org wrote:
 On 14 October 2012 01:04, Christoffer Dall
 c.d...@virtualopensystems.com wrote:
 On ARM (and possibly other architectures) some bits are specific to the
 model being emulated for the guest and user space needs a way to tell
 the kernel about those bits.  An example is mmio device base addresses,
 where KVM must know the base address for a given device to properly
 emulate mmio accesses within a certain address range or directly map a
 device with virtualiation extensions into the guest address space.

 We try to make this API slightly more generic than for our specific use,
 but so far only the VGIC uses this feature.

 Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com
 ---
  Documentation/virtual/kvm/api.txt |   30 ++
  arch/arm/include/asm/kvm.h|   13 +
  arch/arm/include/asm/kvm_mmu.h|1 +
  arch/arm/include/asm/kvm_vgic.h   |6 ++
  arch/arm/kvm/arm.c|   31 ++-
  arch/arm/kvm/vgic.c   |   34 +++---
  include/linux/kvm.h   |8 
  7 files changed, 119 insertions(+), 4 deletions(-)

 diff --git a/Documentation/virtual/kvm/api.txt 
 b/Documentation/virtual/kvm/api.txt
 index 26e953d..30ddcac 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -2118,6 +2118,36 @@ for the emulated platofrm (see 
 KVM_SET_DEVICE_ADDRESS), but before the CPU is
  initally run.


 +4.80 KVM_SET_DEVICE_ADDRESS
 +
 +Capability: KVM_CAP_SET_DEVICE_ADDRESS
 +Architectures: arm
 +Type: vm ioctl
 +Parameters: struct kvm_device_address (in)
 +Returns: 0 on success, -1 on error
 +Errors:
 +  ENODEV: The device id is unknwown

 unknown

 +  ENXIO:  Device not supported in configuration

 in this configuration ? (I'm guessing this is for you tried to
 map a GIC when this CPU doesn't have a GIC and similar errors?)

 +  E2BIG:  Address outside of guest physical address space

 I would say outside rather than outside of here.

 +
 +struct kvm_device_address {
 +   __u32 id;
 +   __u64 addr;
 +};
 +
 +Specify a device address in the guest's physical address space where guests
 +can access emulated or directly exposed devices, which the host kernel needs
 +to know about. The id field is an architecture specific identifier for a
 +specific device.
 +
 +ARM divides the id field into two parts, a device ID and an address type id

 We should be consistent about whether ID is capitalised or not.


indeed

 +specific to the individual device.
 +
 +  bits:  | 31...16 | 15...0 |
 +  field: | device id   |  addr type id  |

 This doesn't say whether userspace is allowed to make this ioctl
 multiple times for the same device. This could be any of:
  * undefined behaviour
  * second call fails with some errno
  * second call overrides first one


I added an error condition EEXIST, but since this is trying to not be
arm-vgic specific this is really up to the individual device - maybe
we can have some polymorphic device that moves around later.

 It also doesn't say that you're supposed to call this after CREATE
 and before INIT of the irqchip. (Nor does it say what happens if
 you call it at some other time.)


same non-device specific argument as above.

Thanks,
-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] KVM: ARM: Introduce KVM_SET_DEVICE_ADDRESS ioctl

2012-10-19 Thread Peter Maydell
On 19 October 2012 19:46, Christoffer Dall
c.d...@virtualopensystems.com wrote:
 On Wed, Oct 17, 2012 at 4:29 PM, Peter Maydell peter.mayd...@linaro.org 
 wrote:
 This doesn't say whether userspace is allowed to make this ioctl
 multiple times for the same device. This could be any of:
  * undefined behaviour
  * second call fails with some errno
  * second call overrides first one


 I added an error condition EEXIST, but since this is trying to not be
 arm-vgic specific this is really up to the individual device - maybe
 we can have some polymorphic device that moves around later.

 It also doesn't say that you're supposed to call this after CREATE
 and before INIT of the irqchip. (Nor does it say what happens if
 you call it at some other time.)


 same non-device specific argument as above.

We could have a section in the docs that says On ARM platforms
there are devices X and Y and they have such-and-such properties
and requirements [and other devices later can have further docs
as appropriate].

-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] KVM: ARM: Introduce KVM_SET_DEVICE_ADDRESS ioctl

2012-10-19 Thread Christoffer Dall
On Fri, Oct 19, 2012 at 4:24 PM, Peter Maydell peter.mayd...@linaro.org wrote:
 On 19 October 2012 19:46, Christoffer Dall
 c.d...@virtualopensystems.com wrote:
 On Wed, Oct 17, 2012 at 4:29 PM, Peter Maydell peter.mayd...@linaro.org 
 wrote:
 This doesn't say whether userspace is allowed to make this ioctl
 multiple times for the same device. This could be any of:
  * undefined behaviour
  * second call fails with some errno
  * second call overrides first one


 I added an error condition EEXIST, but since this is trying to not be
 arm-vgic specific this is really up to the individual device - maybe
 we can have some polymorphic device that moves around later.

 It also doesn't say that you're supposed to call this after CREATE
 and before INIT of the irqchip. (Nor does it say what happens if
 you call it at some other time.)


 same non-device specific argument as above.

 We could have a section in the docs that says On ARM platforms
 there are devices X and Y and they have such-and-such properties
 and requirements [and other devices later can have further docs
 as appropriate].

sure, I can add that.

-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] KVM: ARM: Introduce KVM_SET_DEVICE_ADDRESS ioctl

2012-10-19 Thread Christoffer Dall
On Fri, Oct 19, 2012 at 4:27 PM, Christoffer Dall
c.d...@virtualopensystems.com wrote:
 On Fri, Oct 19, 2012 at 4:24 PM, Peter Maydell peter.mayd...@linaro.org 
 wrote:
 On 19 October 2012 19:46, Christoffer Dall
 c.d...@virtualopensystems.com wrote:
 On Wed, Oct 17, 2012 at 4:29 PM, Peter Maydell peter.mayd...@linaro.org 
 wrote:
 This doesn't say whether userspace is allowed to make this ioctl
 multiple times for the same device. This could be any of:
  * undefined behaviour
  * second call fails with some errno
  * second call overrides first one


 I added an error condition EEXIST, but since this is trying to not be
 arm-vgic specific this is really up to the individual device - maybe
 we can have some polymorphic device that moves around later.

 It also doesn't say that you're supposed to call this after CREATE
 and before INIT of the irqchip. (Nor does it say what happens if
 you call it at some other time.)


 same non-device specific argument as above.

 We could have a section in the docs that says On ARM platforms
 there are devices X and Y and they have such-and-such properties
 and requirements [and other devices later can have further docs
 as appropriate].



diff --git a/Documentation/virtual/kvm/api.txt
b/Documentation/virtual/kvm/api.txt
index 65aacc5..1380885 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2131,6 +2131,12 @@ specific to the individual device.
   bits:  | 31...16 | 15...0 |
   field: | device id   |  addr type id  |

+ARM currently only require this when using the in-kernel GIC support for the
+hardware vGIC features, using KVM_ARM_DEVICE_VGIC_V2 as the device id.  When
+setting the base address for the guest's mapping of the vGIC virtual CPU
+and distributor interface, the ioctl must be called after calling
+KVM_CREATE_IRQCHIP, but before calling KVM_RUN on any of the VCPUs.  Calling
+this ioctl twice for any of the base addresses will return -EEXIST.


 5. The kvm_run structure
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm, async_pf: exit idleness when handling KVM_PV_REASON_PAGE_NOT_PRESENT

2012-10-19 Thread Paul E. McKenney
On Fri, Oct 19, 2012 at 12:11:55PM -0400, Sasha Levin wrote:
 KVM_PV_REASON_PAGE_NOT_PRESENT kicks cpu out of idleness, but we haven't
 marked that spot as an exit from idleness.
 
 Not doing so can cause RCU warnings such as:
 
 [  732.788386] ===
 [  732.789803] [ INFO: suspicious RCU usage. ]
 [  732.790032] 3.7.0-rc1-next-20121019-sasha-2-g6d8d02d-dirty #63 
 Tainted: GW
 [  732.790032] ---
 [  732.790032] include/linux/rcupdate.h:738 rcu_read_lock() used illegally 
 while idle!
 [  732.790032]
 [  732.790032] other info that might help us debug this:
 [  732.790032]
 [  732.790032]
 [  732.790032] RCU used illegally from idle CPU!
 [  732.790032] rcu_scheduler_active = 1, debug_locks = 1
 [  732.790032] RCU used illegally from extended quiescent state!
 [  732.790032] 2 locks held by trinity-child31/8252:
 [  732.790032]  #0:  (rq-lock){-.-.-.}, at: [83a67528] 
 __schedule+0x178/0x8f0
 [  732.790032]  #1:  (rcu_read_lock){.+.+..}, at: [81152bde] 
 cpuacct_charge+0xe/0x200
 [  732.790032]
 [  732.790032] stack backtrace:
 [  732.790032] Pid: 8252, comm: trinity-child31 Tainted: GW
 3.7.0-rc1-next-20121019-sasha-2-g6d8d02d-dirty #63
 [  732.790032] Call Trace:
 [  732.790032]  [8118266b] lockdep_rcu_suspicious+0x10b/0x120
 [  732.790032]  [81152c60] cpuacct_charge+0x90/0x200
 [  732.790032]  [81152bde] ? cpuacct_charge+0xe/0x200
 [  732.790032]  [81158093] update_curr+0x1a3/0x270
 [  732.790032]  [81158a6a] dequeue_entity+0x2a/0x210
 [  732.790032]  [81158ea5] dequeue_task_fair+0x45/0x130
 [  732.790032]  [8114ae29] dequeue_task+0x89/0xa0
 [  732.790032]  [8114bb9e] deactivate_task+0x1e/0x20
 [  732.790032]  [83a67c29] __schedule+0x879/0x8f0
 [  732.790032]  [8117e20d] ? trace_hardirqs_off+0xd/0x10
 [  732.790032]  [810a37a5] ? kvm_async_pf_task_wait+0x1d5/0x2b0
 [  732.790032]  [83a67cf5] schedule+0x55/0x60
 [  732.790032]  [810a37c4] kvm_async_pf_task_wait+0x1f4/0x2b0
 [  732.790032]  [81139e50] ? abort_exclusive_wait+0xb0/0xb0
 [  732.790032]  [81139c25] ? prepare_to_wait+0x25/0x90
 [  732.790032]  [810a3a66] do_async_page_fault+0x56/0xa0
 [  732.790032]  [83a6a6e8] async_page_fault+0x28/0x30
 
 Signed-off-by: Sasha Levin sasha.le...@oracle.com

Acked-by: Paul E. McKenney paul...@linux.vnet.ibm.com

 ---
  arch/x86/kernel/kvm.c | 3 +++
  1 file changed, 3 insertions(+)
 
 diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
 index b3e5e51..4180a87 100644
 --- a/arch/x86/kernel/kvm.c
 +++ b/arch/x86/kernel/kvm.c
 @@ -247,7 +247,10 @@ do_async_page_fault(struct pt_regs *regs, unsigned long 
 error_code)
   break;
   case KVM_PV_REASON_PAGE_NOT_PRESENT:
   /* page is swapped out by the host. */
 + rcu_irq_enter();
 + exit_idle();
   kvm_async_pf_task_wait((u32)read_cr2());
 + rcu_irq_exit();
   break;
   case KVM_PV_REASON_PAGE_READY:
   rcu_irq_enter();
 -- 
 1.7.12.3
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to do fast accesses to LAPIC TPR under kvm?

2012-10-19 Thread Stefan Fritsch
On Thursday 18 October 2012, Avi Kivity wrote:
 On 10/18/2012 11:35 AM, Gleb Natapov wrote:
  You misunderstood the description. V_INTR_MASKING=1 means that
  CR8 writes are not propagated to real HW APIC.
  
  But KVM does not trap access to CR8 unconditionally. It enables
  CR8 intercept only when there is pending interrupt in IRR that
  cannot be immediately delivered due to current TPR value. This
  should eliminate 99% of CR8 intercepts.
 
 Right.  You will need to expose the alternate encoding of cr8 (IIRC
 lock mov reg, cr0) on AMD via cpuid, but otherwise it should just
 work.  Be aware that this will break cross-vendor migration.

I get an exception and I am not sure why:

kvm_entry: vcpu 0
kvm_exit: reason write_cr8 rip 0xd0203788 info 0 0
kvm_emulate_insn: 0:d0203788: f0 0f 22 c0 (prot32)
kvm_inj_exception: #UD (0x0)

This is qemu-kvm 1.1.2 on Linux 3.2.

When I look at arch/x86/kvm/emulate.c (both the current and the v3.2 
version), I don't see any special case handling for lock mov reg, 
cr0 to mean mov reg, cr8.

Before I spend lots of time on debugging my code, can you verify if 
the alternate encoding of cr8 is actually supported in kvm or if it is 
maybe missing? Thanks in advance.

Cheers,
Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


1.1.1 - 1.1.2 migrate /managedsave issue

2012-10-19 Thread Doug Goldstein
I'm using libvirt 0.10.2 and I had qemu-kvm 1.1.1 running all my VMs.
I used libvirt's managedsave command to pause all the VMs and write
them to disk and when I brought up the machine again I had upgraded to
qemu-kvm 1.1.2 and attempted to resume the VMs from their state. It
unfortunately fails. During the life of the VM I did not attempt to
adjust the amount of memory it had via the balloon device unless of
course libvirt did behind the scenes on me. Below is the command line
invocation and the error:

LC_ALL=C 
PATH=/bin:/sbin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/opt/bin
HOME=/root USER=root QEMU_AUDIO_DRV=spice /usr/bin/qemu-kvm -name expo
-S -M pc-1.0 -cpu
Penryn,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
-enable-kvm -m 1024 -smp 1,sockets=1,cores=1,threads=1 -uuid
19034754-aa3f-9671-d247-1bc53134e3f0 -no-user-config -nodefaults
-chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/expo.monitor,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc
-no-shutdown -device
virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -device
piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
file=/var/lib/libvirt/images/expo.img,if=none,id=drive-ide0-0-0,format=raw,cache=none
-device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1
-netdev tap,fd=23,id=hostnet0 -device
rtl8139,netdev=hostnet0,id=net0,mac=52:54:00:0b:29:d9,bus=pci.0,addr=0x3
-chardev pty,id=charserial0 -device
isa-serial,chardev=charserial0,id=serial0 -chardev
spicevmc,id=charchannel0,name=vdagent -device
virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0
-spice port=5901,addr=127.0.0.1,disable-ticketing -vga qxl -global
qxl-vga.vram_size=67108864 -incoming fd:20 -device
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5
char device redirected to /dev/pts/7
qemu: warning: error while loading state for instance 0x0 of device 'ram'
load of migration failed

Let me know what specifics I can provide to make this easier to debug.

Thanks.
-- 
Doug Goldstein
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] KVM: ARM: Get rid of hardcoded VGIC addresses

2012-10-19 Thread Christoffer Dall
We need a way to specify the address at which we expect VMs to access
the interrupt controller (both the emulated distributor and the hardware
interface supporting virtualization).  User space should decide on this
address as user space decides on an emulated board and loads a device
tree describing these details directly to the guest.

We introduce a new ioctl, KVM_SET_DEVICE_ADDRESS, that lets user space
provide a base address for a device based on exported device ids.  For
now, this is only supported for the ARM vgic.  User space provides this
address after creating the IRQ chip and KVM performs the required
mappings for a VM on the first execution of a VCPU.

Christoffer Dall (2):
  KVM: ARM: Introduce KVM_SET_DEVICE_ADDRESS ioctl
  KVM: ARM: Defer parts of the vgic init until first KVM_RUN

 Documentation/virtual/kvm/api.txt |   37 ++
 arch/arm/include/asm/kvm.h|   13 +
 arch/arm/include/asm/kvm_mmu.h|2 +
 arch/arm/include/asm/kvm_vgic.h   |   27 --
 arch/arm/kvm/arm.c|   41 ++-
 arch/arm/kvm/vgic.c   |   99 +
 include/linux/kvm.h   |8 +++
 7 files changed, 201 insertions(+), 26 deletions(-)

-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] KVM: ARM: Introduce KVM_SET_DEVICE_ADDRESS ioctl

2012-10-19 Thread Christoffer Dall
On ARM (and possibly other architectures) some bits are specific to the
model being emulated for the guest and user space needs a way to tell
the kernel about those bits.  An example is mmio device base addresses,
where KVM must know the base address for a given device to properly
emulate mmio accesses within a certain address range or directly map a
device with virtualiation extensions into the guest address space.

We try to make this API slightly more generic than for our specific use,
but so far only the VGIC uses this feature.

Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com
---
 Documentation/virtual/kvm/api.txt |   37 +
 arch/arm/include/asm/kvm.h|   13 +
 arch/arm/include/asm/kvm_mmu.h|2 ++
 arch/arm/include/asm/kvm_vgic.h   |6 ++
 arch/arm/kvm/arm.c|   31 ++-
 arch/arm/kvm/vgic.c   |   25 +
 include/linux/kvm.h   |8 
 7 files changed, 121 insertions(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 0aa4d83..dae4f05 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2102,6 +2102,43 @@ This ioctl returns the guest registers that are 
supported for the
 KVM_GET_ONE_REG/KVM_SET_ONE_REG calls.
 
 
+4.80 KVM_SET_DEVICE_ADDRESS
+
+Capability: KVM_CAP_SET_DEVICE_ADDRESS
+Architectures: arm
+Type: vm ioctl
+Parameters: struct kvm_device_address (in)
+Returns: 0 on success, -1 on error
+Errors:
+  ENODEV: The device id is unknown
+  ENXIO:  Device not supported on current system
+  EEXIST: Address already set
+  E2BIG:  Address outside guest physical address space
+
+struct kvm_device_address {
+   __u32 id;
+   __u64 addr;
+};
+
+Specify a device address in the guest's physical address space where guests
+can access emulated or directly exposed devices, which the host kernel needs
+to know about. The id field is an architecture specific identifier for a
+specific device.
+
+ARM divides the id field into two parts, a device id and an address type id
+specific to the individual device.
+
+  bits:  | 31...16 | 15...0 |
+  field: | device id   |  addr type id  |
+
+ARM currently only require this when using the in-kernel GIC support for the
+hardware vGIC features, using KVM_ARM_DEVICE_VGIC_V2 as the device id.  When
+setting the base address for the guest's mapping of the vGIC virtual CPU
+and distributor interface, the ioctl must be called after calling
+KVM_CREATE_IRQCHIP, but before calling KVM_RUN on any of the VCPUs.  Calling
+this ioctl twice for any of the base addresses will return -EEXIST.
+
+
 5. The kvm_run structure
 
 
diff --git a/arch/arm/include/asm/kvm.h b/arch/arm/include/asm/kvm.h
index fb41608..a7ae073 100644
--- a/arch/arm/include/asm/kvm.h
+++ b/arch/arm/include/asm/kvm.h
@@ -42,6 +42,19 @@ struct kvm_regs {
 #define KVM_ARM_TARGET_CORTEX_A15  0
 #define KVM_ARM_NUM_TARGETS1
 
+/* KVM_SET_DEVICE_ADDRESS ioctl id encoding */
+#define KVM_DEVICE_TYPE_SHIFT  0
+#define KVM_DEVICE_TYPE_MASK   (0x  KVM_DEVICE_TYPE_SHIFT)
+#define KVM_DEVICE_ID_SHIFT16
+#define KVM_DEVICE_ID_MASK (0x  KVM_DEVICE_ID_SHIFT)
+
+/* Supported device IDs */
+#define KVM_ARM_DEVICE_VGIC_V2 0
+
+/* Supported VGIC address types  */
+#define KVM_VGIC_V2_ADDR_TYPE_DIST 0
+#define KVM_VGIC_V2_ADDR_TYPE_CPU  1
+
 struct kvm_vcpu_init {
__u32 target;
__u32 features[7];
diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 9bd0508..0800531 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -26,6 +26,8 @@
  * To save a bit of memory and to avoid alignment issues we assume 39-bit IPA
  * for now, but remember that the level-1 table must be aligned to its size.
  */
+#define KVM_PHYS_SHIFT (38)
+#define KVM_PHYS_MASK  ((1ULL  KVM_PHYS_SHIFT) - 1)
 #define PTRS_PER_PGD2  512
 #define PGD2_ORDER get_order(PTRS_PER_PGD2 * sizeof(pgd_t))
 
diff --git a/arch/arm/include/asm/kvm_vgic.h b/arch/arm/include/asm/kvm_vgic.h
index 588c637..a688132 100644
--- a/arch/arm/include/asm/kvm_vgic.h
+++ b/arch/arm/include/asm/kvm_vgic.h
@@ -242,6 +242,7 @@ struct kvm_exit_mmio;
 
 #ifdef CONFIG_KVM_ARM_VGIC
 int kvm_vgic_hyp_init(void);
+int kvm_vgic_set_addr(struct kvm *kvm, unsigned long type, u64 addr);
 int kvm_vgic_init(struct kvm *kvm);
 void kvm_vgic_vcpu_init(struct kvm_vcpu *vcpu);
 void kvm_vgic_sync_to_cpu(struct kvm_vcpu *vcpu);
@@ -261,6 +262,11 @@ static inline int kvm_vgic_hyp_init(void)
return 0;
 }
 
+static inline int kvm_vgic_set_addr(struct kvm *kvm, unsigned long type, u64 
addr)
+{
+   return 0;
+}
+
 static inline int kvm_vgic_init(struct kvm *kvm)
 {
return 0;
diff --git a/arch/arm/kvm/arm.c 

[PATCH 2/2] KVM: ARM: Defer parts of the vgic init until first KVM_RUN

2012-10-19 Thread Christoffer Dall
The vgic virtual cpu and emulated distributor interfaces must
be mapped at a given physical address in the guest.  This address is
provided through the KVM_SET_DEVICE_ADDRESS ioctl, which happens after
the KVM_CREATE_IRQCHIP ioctl is called, but before the first VCPU is
excuted thorugh KVM_RUN.  We create the vgic on KVM_CREATE_IRQCHIP, but
query kvm_vgic_ready(kvm), which checks if the vgic.vctrl_base field has
been set, before we execute a VCPU, and if it has not been set, we call
kvm_vgic_init, which takes care of the remaining setup.

We use the IS_VGIC_ADDR_UNDEF() macro, which compares to the
VGIC_ADDR_UNDEF constant, to check if an address has been set;  it's
unlikely that a device will sit on address 0, but since this is a part
of main kernel boot procedure if this stuff is enabled in the config,
I'm being paranoid.

The distributor and vcpu base addresses used to be a per-host setting
global for all VMs, but this is not a requirement and when we want to
emulate several boards on a single host, we need the flexibility of
storing these guest addresses on a per-VM basis.

Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com
---
 arch/arm/include/asm/kvm_vgic.h |   21 --
 arch/arm/kvm/arm.c  |   10 -
 arch/arm/kvm/vgic.c |   82 +++
 3 files changed, 84 insertions(+), 29 deletions(-)

diff --git a/arch/arm/include/asm/kvm_vgic.h b/arch/arm/include/asm/kvm_vgic.h
index a688132..2de167f 100644
--- a/arch/arm/include/asm/kvm_vgic.h
+++ b/arch/arm/include/asm/kvm_vgic.h
@@ -154,13 +154,14 @@ static inline void vgic_bytemap_set_irq_val(struct 
vgic_bytemap *x,
 struct vgic_dist {
 #ifdef CONFIG_KVM_ARM_VGIC
spinlock_t  lock;
+   boolready;
 
/* Virtual control interface mapping */
void __iomem*vctrl_base;
 
-   /* Distributor mapping in the guest */
-   unsigned long   vgic_dist_base;
-   unsigned long   vgic_dist_size;
+   /* Distributor and vcpu interface mapping in the guest */
+   phys_addr_t vgic_dist_base;
+   phys_addr_t vgic_cpu_base;
 
/* Distributor enabled */
u32 enabled;
@@ -243,6 +244,7 @@ struct kvm_exit_mmio;
 #ifdef CONFIG_KVM_ARM_VGIC
 int kvm_vgic_hyp_init(void);
 int kvm_vgic_set_addr(struct kvm *kvm, unsigned long type, u64 addr);
+int kvm_vgic_create(struct kvm *kvm);
 int kvm_vgic_init(struct kvm *kvm);
 void kvm_vgic_vcpu_init(struct kvm_vcpu *vcpu);
 void kvm_vgic_sync_to_cpu(struct kvm_vcpu *vcpu);
@@ -252,8 +254,9 @@ int kvm_vgic_inject_irq(struct kvm *kvm, int cpuid, 
unsigned int irq_num,
 int kvm_vgic_vcpu_pending_irq(struct kvm_vcpu *vcpu);
 bool vgic_handle_mmio(struct kvm_vcpu *vcpu, struct kvm_run *run,
  struct kvm_exit_mmio *mmio);
+bool irqchip_in_kernel(struct kvm *kvm);
 
-#define irqchip_in_kernel(k)   (!!((k)-arch.vgic.vctrl_base))
+#define vgic_initialized(k)((k)-arch.vgic.ready)
 #define vgic_active_irq(v) 
(atomic_read((v)-arch.vgic_cpu.irq_active_count) == 0)
 
 #else
@@ -267,6 +270,11 @@ static inline int kvm_vgic_set_addr(struct kvm *kvm, 
unsigned long type, u64 add
return 0;
 }
 
+static inline int kvm_vgic_create(struct kvm *kvm)
+{
+   return 0;
+}
+
 static inline int kvm_vgic_init(struct kvm *kvm)
 {
return 0;
@@ -298,6 +306,11 @@ static inline int irqchip_in_kernel(struct kvm *kvm)
return 0;
 }
 
+static inline bool kvm_vgic_initialized(struct kvm *kvm)
+{
+   return true;
+}
+
 static inline int vgic_active_irq(struct kvm_vcpu *vcpu)
 {
return 0;
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 282794e..d64783e 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -636,6 +636,14 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
if (unlikely(vcpu-arch.target  0))
return -ENOEXEC;
 
+   /* Initalize the VGIC before running the vcpu */
+   if (unlikely(irqchip_in_kernel(vcpu-kvm) 
+!vgic_initialized(vcpu-kvm))) {
+   ret = kvm_vgic_init(vcpu-kvm);
+   if (ret)
+   return ret;
+   }
+
if (run-exit_reason == KVM_EXIT_MMIO) {
ret = kvm_handle_mmio_return(vcpu, vcpu-run);
if (ret)
@@ -889,7 +897,7 @@ long kvm_arch_vm_ioctl(struct file *filp,
 #ifdef CONFIG_KVM_ARM_VGIC
case KVM_CREATE_IRQCHIP: {
if (vgic_present)
-   return kvm_vgic_init(kvm);
+   return kvm_vgic_create(kvm);
else
return -EINVAL;
}
diff --git a/arch/arm/kvm/vgic.c b/arch/arm/kvm/vgic.c
index d63b7f8..fa591db 100644
--- a/arch/arm/kvm/vgic.c
+++ b/arch/arm/kvm/vgic.c
@@ -65,12 +65,17 @@
  *   interrupt line to be sampled again.
  */
 
-/* Temporary hacks, need to be provided by userspace