date:20110131

Re: [PATCH 05/29] nVMX: Implement reading and writing of VMX MSRs

2011-01-31 Thread Nadav Har'El

On Sun, Jan 30, 2011, Avi Kivity wrote about Re: [PATCH 05/29] nVMX: Implement 
reading and writing of VMX MSRs:
 +case MSR_IA32_VMX_TRUE_PINBASED_CTLS:
 +case MSR_IA32_VMX_PINBASED_CTLS:
 +vmx_msr_low  = CORE2_PINBASED_CTLS_MUST_BE_ONE;
 +vmx_msr_high = CORE2_PINBASED_CTLS_MUST_BE_ONE |
 +PIN_BASED_EXT_INTR_MASK |
 +PIN_BASED_NMI_EXITING |
 +PIN_BASED_VIRTUAL_NMIS;
 
 Can we actually support PIN_BASED_VIRTUAL_NMIs on hosts which don't 
 support them?
 
 Maybe better to drop for the initial version.

Thanks, I'll look into this. You already found this problem in June, and
it's already in my bugzilla. Just wanted to let you know that I'm taking
all your previous comments seriously, and not forgetting any of them.
Since you mention this one again, I'm increasing its priority, so I'll fix
it before the next version of the patches.

 +static int vmx_set_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data)
 +{
 +if (!nested_vmx_allowed(vcpu))
 +return 0;
 +
 +/*
 + * according to the spec, VMX capability MSRs are read-only; an
 + * attempt to write them (with WRMSR) produces a #GP(0).
 + */
 +if (msr_index= MSR_IA32_VMX_BASIC
 +msr_index= MSR_IA32_VMX_TRUE_ENTRY_CTLS) {
 +kvm_queue_exception_e(vcpu, GP_VECTOR, 0);
 +return 1;
 
 Can just drop this part, #GP is the default response.

Right, thanks, I see that now. I'll remove the extra code, but leave a
comment.

-- 
Nadav Har'El|  Monday, Jan 31 2011, 26 Shevat 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |If the universe is expanding, why can't I
http://nadav.harel.org.il   |find a parking space?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 05/29] nVMX: Implement reading and writing of VMX MSRs

2011-01-31 Thread Avi Kivity


On 01/31/2011 10:57 AM, Nadav Har'El wrote:

On Sun, Jan 30, 2011, Avi Kivity wrote about Re: [PATCH 05/29] nVMX: Implement 
reading and writing of VMX MSRs:
  + case MSR_IA32_VMX_TRUE_PINBASED_CTLS:
  + case MSR_IA32_VMX_PINBASED_CTLS:
  + vmx_msr_low  = CORE2_PINBASED_CTLS_MUST_BE_ONE;
  + vmx_msr_high = CORE2_PINBASED_CTLS_MUST_BE_ONE |
  + PIN_BASED_EXT_INTR_MASK |
  + PIN_BASED_NMI_EXITING |
  + PIN_BASED_VIRTUAL_NMIS;

  Can we actually support PIN_BASED_VIRTUAL_NMIs on hosts which don't
  support them?

  Maybe better to drop for the initial version.

Thanks, I'll look into this. You already found this problem in June, and
it's already in my bugzilla. Just wanted to let you know that I'm taking
all your previous comments seriously, and not forgetting any of them.
Since you mention this one again, I'm increasing its priority, so I'll fix
it before the next version of the patches.


Thanks.  Since I see many other comments are still unaddressed, I'll 
stop reviewing until you let me know that you have a version that is 
ready for review.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] Provide control over unmapped pages (v4)

2011-01-31 Thread Balbir Singh

* KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2011-01-31 08:58:53]:

 On Fri, 28 Jan 2011 09:20:02 -0600 (CST)
 Christoph Lameter c...@linux.com wrote:
 
  On Fri, 28 Jan 2011, KAMEZAWA Hiroyuki wrote:
  
 I see it as a tradeoff of when to check? add_to_page_cache or when we
 are want more free memory (due to allocation). It is OK to wakeup
 kswapd while allocating memory, somehow for this purpose (global page
 cache), add_to_page_cache or add_to_page_cache_locked does not seem
 the right place to hook into. I'd be open to comments/suggestions
 though from others as well.
  
   I don't like add hook here.
   AND I don't want to run kswapd because 'kswapd' has been a sign as
   there are memory shortage. (reusing code is ok.)
  
   How about adding new daemon ? Recently, khugepaged, ksmd works for
   managing memory. Adding one more daemon for special purpose is not
   very bad, I think. Then, you can do
- wake up without hook
- throttle its work.
- balance the whole system rather than zone.
  I think per-node balance is enough...
  
  
  I think we already have enough kernel daemons floating around. They are
  multiplying in an amazing way. What would be useful is to map all
  the memory management background stuff into a process. May call this memd
  instead? Perhaps we can fold khugepaged into kswapd as well etc.
  
 
 Making kswapd slow for whis additional, requested by user, not by system
 work is good thing ? I think workqueue works enough well, it's scale based on
 workloads, if using thread is bad.


Making it slow is a generic statement, kswapd
is supposed to do background reclaim, in this case a special request
for unmapped pages, specifically and deliberately requested by the
admin via a boot option.
 
-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 07/29] nVMX: Hold a vmcs02 for each vmcs12

2011-01-31 Thread Nadav Har'El

Hi,

On Sun, Jan 30, 2011, Avi Kivity wrote about Re: [PATCH 07/29] nVMX: Hold a 
vmcs02 for each vmcs12:
 +/*
 + * Allocate an L0 VMCS (vmcs02) for the current L1 VMCS (vmcs12), if one
 + * does not already exist. The allocation is done in L0 memory, so to 
 avoid
 + * denial-of-service attack by guests, we limit the number of 
 concurrently-
 + * allocated vmcss. A well-behaving L1 will VMCLEAR unused vmcs12s and not
 + * trigger this limit.
 
 No, it won't.  If you run N guests on a single-cpu kvm host, you'll have 
 N active VMCSs.

Of course. What I said was that *unused* vmcs12s (in the sense that they
don't describe any active guest) will normally be unloaded (VMCLEARed) by L1
and so will not take up space. Only VMCSs actually being used to run guests
will take up space. If you have N running guests, then right, you'll have
N VMCSs. I put the limit at 256, which on one hand allows L1 to run 256 L2s
(which I think is well above what people will normally run on one CPU), and
on the other hand limits the amount of damage that a malicious L1 can do:
At worst it can cause the host to allocate (and pin) 256 extra VMCSs, which
sum up to 1 MB.

I thought that this compromise was good enough, and you didn't, and although
I still don't understand why, I promise I will change it. I'll make this
change my top priority now.

 +static void __nested_free_saved_vmcs(void *arg)
 +{
 +struct saved_vmcs *saved_vmcs = arg;
 +int cpu = raw_smp_processor_id();
 +
 +if (saved_vmcs-cpu == cpu) /* TODO: how can this not be the case? */
 +vmcs_clear(saved_vmcs-vmcs);
 
 This check will always be true.

This is what I thought too... I call this function on the saved_vmcs-cpu
cpu, so there's no reason why it would find itself being called on a different
cpu.

The only reason I added this sanity check was that __vcpu_clear has the same
one, and there too it seemed redundant, and I thought maybe I was missing
something. Do you know why __vcpu_clear needs this test?

 
 +if (per_cpu(current_vmcs, cpu) == saved_vmcs-vmcs)
 +per_cpu(current_vmcs, cpu) = NULL;
 
 And this will always be false, no?  Unless you free a vmcs02 while you 
 use it?  Don't you always switch back to vmcs01 prior to freeing?
..
 Maybe this is the counterexample - we kill a vcpu while it is in nested 
 mode.

Right, this is what I had in mind.

-- 
Nadav Har'El|  Monday, Jan 31 2011, 26 Shevat 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |I planted some bird seed. A bird came up.
http://nadav.harel.org.il   |Now I don't know what to feed it...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 07/29] nVMX: Hold a vmcs02 for each vmcs12

2011-01-31 Thread Avi Kivity


On 01/31/2011 11:26 AM, Nadav Har'El wrote:

Hi,

On Sun, Jan 30, 2011, Avi Kivity wrote about Re: [PATCH 07/29] nVMX: Hold a vmcs02 
for each vmcs12:
  +/*
  + * Allocate an L0 VMCS (vmcs02) for the current L1 VMCS (vmcs12), if one
  + * does not already exist. The allocation is done in L0 memory, so to
  avoid
  + * denial-of-service attack by guests, we limit the number of
  concurrently-
  + * allocated vmcss. A well-behaving L1 will VMCLEAR unused vmcs12s and not
  + * trigger this limit.

  No, it won't.  If you run N guests on a single-cpu kvm host, you'll have
  N active VMCSs.

Of course. What I said was that *unused* vmcs12s (in the sense that they
don't describe any active guest) will normally be unloaded (VMCLEARed) by L1
and so will not take up space. Only VMCSs actually being used to run guests
will take up space. If you have N running guests, then right, you'll have
N VMCSs. I put the limit at 256, which on one hand allows L1 to run 256 L2s
(which I think is well above what people will normally run on one CPU), and
on the other hand limits the amount of damage that a malicious L1 can do:
At worst it can cause the host to allocate (and pin) 256 extra VMCSs, which
sum up to 1 MB.

I thought that this compromise was good enough, and you didn't, and although
I still don't understand why, I promise I will change it. I'll make this
change my top priority now.


VM crashes on legal guest behaviour are bad; and since it's easy to 
avoid, why do it?



  +static void __nested_free_saved_vmcs(void *arg)
  +{
  + struct saved_vmcs *saved_vmcs = arg;
  + int cpu = raw_smp_processor_id();
  +
  + if (saved_vmcs-cpu == cpu) /* TODO: how can this not be the case? */
  + vmcs_clear(saved_vmcs-vmcs);

  This check will always be true.

This is what I thought too... I call this function on the saved_vmcs-cpu
cpu, so there's no reason why it would find itself being called on a different
cpu.

The only reason I added this sanity check was that __vcpu_clear has the same
one, and there too it seemed redundant, and I thought maybe I was missing
something. Do you know why __vcpu_clear needs this test?


__vcpu_clear() can race with itself (called from the cpu migration path 
vs. cpu offline path)



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 01/22] Prevent abortion on multiple VCPU kicks

2011-01-31 Thread Avi Kivity


On 01/27/2011 03:09 PM, Jan Kiszka wrote:

If we call qemu_cpu_kick more than once before the target was able to
process the signal, pthread_kill will fail, and qemu will abort. Prevent
this by avoiding the redundant signal.



Doesn't fit with the manual page (or with the idea that signals are 
asynchronous):


NAME
   pthread_kill - send a signal to a thread


...

ERRORS
   ESRCH  No thread with the ID thread could be found.

   EINVAL An invalid signal was specified.



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 04/22] Leave inner main_loop faster on pending requests

2011-01-31 Thread Avi Kivity


On 01/27/2011 03:09 PM, Jan Kiszka wrote:

If there is any pending request that requires us to leave the inner loop
if main_loop, makes sure we do this as soon as possible by enforcing
non-blocking IO processing.

At this change, move variable definitions out of the inner loop to
improve readability.

Signed-off-by: Jan Kiszkajan.kis...@siemens.com
---
  vl.c |   11 +++
  1 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/vl.c b/vl.c
index 5fad700..2ebc55b 100644
--- a/vl.c
+++ b/vl.c
@@ -1384,18 +1384,21 @@ qemu_irq qemu_system_powerdown;

  static void main_loop(void)
  {
+bool nonblocking = false;
+#ifdef CONFIG_PROFILER
+int64_t ti;
+#endif
  int r;

  qemu_main_loop_start();

  for (;;) {
  do {
-bool nonblocking = false;
-#ifdef CONFIG_PROFILER
-int64_t ti;
-#endif
  #ifndef CONFIG_IOTHREAD
  nonblocking = cpu_exec_all();
+if (!vm_can_run()) {
+nonblocking = true;
+}


Doesn't this cause vmstop to spin?  We'll never execute 
main_loop_wait(false) if I read the code correctly?


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 14/22] kvm: Fix race between timer signals and vcpu entry under !IOTHREAD

2011-01-31 Thread Avi Kivity


On 01/27/2011 04:33 PM, Jan Kiszka wrote:

Found by Stefan Hajnoczi: There is a race in kvm_cpu_exec between
checking for exit_request on vcpu entry and timer signals arriving
before KVM starts to catch them. Plug it by blocking both timer related
signals also on !CONFIG_IOTHREAD and process those via signalfd.

As this fix depends on real signalfd support (otherwise the timer
signals only kick the compat helper thread, and the main thread hangs),
we need to detect the invalid constellation and abort configure.

Signed-off-by: Jan Kiszkajan.kis...@siemens.com
CC: Stefan Hajnoczistefa...@linux.vnet.ibm.com
---

I don't want to invest that much into !IOTHREAD anymore, so let's see if
the proposed catchabort is acceptable.



I don't understand the dependency on signalfd.  The normal way of doing 
things, either waiting for the signal in sigtimedwait() or in 
ioctl(KVM_RUN), works with SIGALRM just fine.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

KVM call agenda for Feb 1

2011-01-31 Thread Juan Quintela


Please send in any agenda items you are interested incovering.

Thanks, Juan.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/29] nVMX: Nested VMX, v8

2011-01-31 Thread Nadav Har'El

On Fri, Jan 28, 2011, Juerg Haefliger wrote about Re: [PATCH 0/29] nVMX: 
Nested VMX, v8:
 This branch doesn't even compile:
...
   CC [M]  drivers/staging/smbfs/dir.o
 drivers/staging/smbfs/dir.c:286: error: static declaration of

I tried to compile this branch with the default .config (answering the
defaults to every question in make config), and it compiled without
any error.

However, the default configuration is *not* to compile the code you had
problems in, drivers/staging/smbfs, because by default CONFIG_STAGING is off.

When I went to enable CONFIG_STAGING and then SMB_FS, I was given a big
warning that smbfs is OBSOLETE, please use CIFS. Is there a specific reason
why you want to use smbfs, not cifs?

Anyway, when I did enable SMB_FS, and compiled, sure enought I got the
same error that you did. This definitely has nothing to do with the
nested VMX patches. If smbfs isn't essential to you, please try again
without it (turn off SMB_FS in your .config). If it is, maybe you should
report this problem to the smbfs maintainers?

Thanks,
Nadav.

-- 
Nadav Har'El|  Monday, Jan 31 2011, 26 Shevat 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |This box was intentionally left blank.
http://nadav.harel.org.il   |
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 17/22] kvm: Move irqchip event processing out of inner loop

2011-01-31 Thread Avi Kivity


On 01/27/2011 03:10 PM, Jan Kiszka wrote:

Align with qemu-kvm and prepare for IO exit fix: There is no need to run
kvm_arch_process_irqchip_events in the inner VCPU loop. Any state change
this service processes will first cause an exit from kvm_cpu_exec
anyway. And we will have to reenter the kernel on IO exits
unconditionally, something that the current logic prevents.

Signed-off-by: Jan Kiszkajan.kis...@siemens.com
---
  kvm-all.c |   11 ++-
  1 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index 5bfa8c0..46ecc1c 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -892,6 +892,12 @@ int kvm_cpu_exec(CPUState *env)

  DPRINTF(kvm_cpu_exec()\n);

+if (kvm_arch_process_irqchip_events(env)) {
+env-exit_request = 0;
+env-exception_index = EXCP_HLT;
+return 0;
+}
+
  do {
  #ifndef CONFIG_IOTHREAD
  if (env-exit_request) {
@@ -901,11 +907,6 @@ int kvm_cpu_exec(CPUState *env)
  }


We check for -exit_request here


  #endif

-if (kvm_arch_process_irqchip_events(env)) {
-ret = 0;
-break;
-}
-


But this checks for -interrupt_request.  What ensures that we exit when 
-interrupt_request is set?



  if (env-kvm_vcpu_dirty) {
  kvm_arch_put_registers(env, KVM_PUT_RUNTIME_STATE);
  env-kvm_vcpu_dirty = 0;



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/22] [uq/master] Patch queue, part II

2011-01-31 Thread Avi Kivity


On 01/27/2011 03:09 PM, Jan Kiszka wrote:

This second round of patches focus on issues in cpus.c, primarily signal
related. The highlights are

  - Add missing KVM_RUN continuation after I/O exits
  - Fix for timer signal race in KVM entry code under !CONFIG_IOTHREAD
(based on Stefan's findings)
  - MCE signal processing under !CONFIG_IOTHREAD
  - Prevent abortion on multiple VCPU kicks
  - Fixed synchronous (ie. VCPU-issued) reset request processing

Topics remaining for a third round are cleanups and refactorings of the
KVM VCPU entry/exit path, MCE rework, and a few assorted cleanups. But I
will wait at least until these bits here are ready for an upstream
merge.



Pretty nice.  Is this standalone or does it depend on unmerged changes?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 2/6] KVM-HV: KVM Steal time implementation

2011-01-31 Thread Peter Zijlstra

On Fri, 2011-01-28 at 14:52 -0500, Glauber Costa wrote:
 +   u64 to = (get_kernel_ns() - vcpu-arch.this_time_out);
 +   /*
 +* using nanoseconds introduces noise, which accumulates 
 easily
 +* leading to big steal time values. We want, however, to 
 keep the
 +* interface nanosecond-based for future-proofness.
 +*/
 +   to /= NSEC_PER_USEC;
 +   to *= NSEC_PER_USEC; 

This just doesn't make any sense at all, you use the most expensive and
accurate kernel interface to get ns resolution timestamps (ktime_get),
and then you truncate the stuff to usec for some totally weird and
unexplained reason.

If ktime_get is wrong all our time-keeping is out the window.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 4/6] KVM-GST: KVM Steal time registration

2011-01-31 Thread Peter Zijlstra

On Fri, 2011-01-28 at 14:52 -0500, Glauber Costa wrote:
 +   /*
 +* using nanoseconds introduces noise, which accumulates easily
 +* leading to big steal time values. We want, however, to keep the
 +* interface nanosecond-based for future-proofness. The hypervisor may
 +* adopt a similar strategy, but we can't rely on that.
 +*/
 +   delta /= NSEC_PER_MSEC;
 +   delta *= NSEC_PER_MSEC; 

I really doubt that claim..

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 01/22] Prevent abortion on multiple VCPU kicks

2011-01-31 Thread Jan Kiszka

On 2011-01-31 10:44, Avi Kivity wrote:
 On 01/27/2011 03:09 PM, Jan Kiszka wrote:
 If we call qemu_cpu_kick more than once before the target was able to
 process the signal, pthread_kill will fail, and qemu will abort. Prevent
 this by avoiding the redundant signal.

 
 Doesn't fit with the manual page (or with the idea that signals are 
 asynchronous):
 
 NAME
 pthread_kill - send a signal to a thread
 
 
 ...
 
 ERRORS
 ESRCH  No thread with the ID thread could be found.
 
 EINVAL An invalid signal was specified.
 

Valid remark, but I was receiving EAGAIN for blocked RT signals. Don't
know if this is Linux-specific. A quick glance at the man pages did not
reveal if this is allowed or at least gray area.

However, even when selectively ignoring this, it's more efficient to
catch the redundant signaling in user space.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 04/22] Leave inner main_loop faster on pending requests

2011-01-31 Thread Jan Kiszka

On 2011-01-31 10:52, Avi Kivity wrote:
 On 01/27/2011 03:09 PM, Jan Kiszka wrote:
 If there is any pending request that requires us to leave the inner loop
 if main_loop, makes sure we do this as soon as possible by enforcing
 non-blocking IO processing.

 At this change, move variable definitions out of the inner loop to
 improve readability.

 Signed-off-by: Jan Kiszkajan.kis...@siemens.com
 ---
   vl.c |   11 +++
   1 files changed, 7 insertions(+), 4 deletions(-)

 diff --git a/vl.c b/vl.c
 index 5fad700..2ebc55b 100644
 --- a/vl.c
 +++ b/vl.c
 @@ -1384,18 +1384,21 @@ qemu_irq qemu_system_powerdown;

   static void main_loop(void)
   {
 +bool nonblocking = false;
 +#ifdef CONFIG_PROFILER
 +int64_t ti;
 +#endif
   int r;

   qemu_main_loop_start();

   for (;;) {
   do {
 -bool nonblocking = false;
 -#ifdef CONFIG_PROFILER
 -int64_t ti;
 -#endif
   #ifndef CONFIG_IOTHREAD
   nonblocking = cpu_exec_all();
 +if (!vm_can_run()) {
 +nonblocking = true;
 +}
 
 Doesn't this cause vmstop to spin?  We'll never execute 
 main_loop_wait(false) if I read the code correctly?
 

The code path is not changed, we just poll instead of wait in
main_loop_wait.

Also, I didn't get your error scenario yet. Even if we left the loop
here, what magic would main_loop_wait do to vmstop processing? The stop
request is handled outside the loop, that's why we should leave ASAP.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 5/6] KVM-GST: adjust scheduler cpu power

2011-01-31 Thread Peter Zijlstra

On Fri, 2011-01-28 at 14:52 -0500, Glauber Costa wrote:

 +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
 +static DEFINE_PER_CPU(u64, cpu_steal_time);
 +
 +#ifndef CONFIG_64BIT
 +static DEFINE_PER_CPU(seqcount_t, steal_time_seq);
 +
 +static inline void steal_time_write_begin(void)
 +{
 + __this_cpu_inc(steal_time_seq.sequence);
 + smp_wmb();
 +}
 +
 +static inline void steal_time_write_end(void)
 +{
 + smp_wmb();
 + __this_cpu_inc(steal_time_seq.sequence);
 +}
 +
 +static inline u64 steal_time_read(int cpu)
 +{
 + u64 steal_time;
 + unsigned seq;
 +
 + do {
 + seq = read_seqcount_begin(per_cpu(steal_time_seq, cpu));
 + steal_time = per_cpu(cpu_steal_time, cpu);
 + } while (read_seqcount_retry(per_cpu(steal_time_seq, cpu), seq));
 +
 + return steal_time;
 +}
 +#else /* CONFIG_64BIT */
 +static inline void steal_time_write_begin(void)
 +{
 +}
 +
 +static inline void steal_time_write_end(void)
 +{
 +}
 +
 +static inline u64 steal_time_read(int cpu)
 +{
 + return per_cpu(cpu_steal_time, cpu);
 +}


 @@ -3536,6 +3592,11 @@ static int touch_steal_time(int is_idle)
  
   if (st) {
   account_steal_time(st);
 +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
 + steal_time_write_begin();
 + __this_cpu_add(cpu_steal_time, steal);
 + steal_time_write_end();
 +#endif
   return 1;
   }
   return 0;


Why replicate all logic you've already got in patch 4? That too is
reading steal time in a loop in kvm_account_steal_time(), why not extend
that interface to take a cpu argument and be done with it?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 5/6] KVM-GST: adjust scheduler cpu power

2011-01-31 Thread Peter Zijlstra

On Mon, 2011-01-31 at 12:25 +0100, Peter Zijlstra wrote:
 On Fri, 2011-01-28 at 14:52 -0500, Glauber Costa wrote:
 
  +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
  +static DEFINE_PER_CPU(u64, cpu_steal_time);
  +
  +#ifndef CONFIG_64BIT
  +static DEFINE_PER_CPU(seqcount_t, steal_time_seq);
  +
  +static inline void steal_time_write_begin(void)
  +{
  +   __this_cpu_inc(steal_time_seq.sequence);
  +   smp_wmb();
  +}
  +
  +static inline void steal_time_write_end(void)
  +{
  +   smp_wmb();
  +   __this_cpu_inc(steal_time_seq.sequence);
  +}
  +
  +static inline u64 steal_time_read(int cpu)
  +{
  +   u64 steal_time;
  +   unsigned seq;
  +
  +   do {
  +   seq = read_seqcount_begin(per_cpu(steal_time_seq, cpu));
  +   steal_time = per_cpu(cpu_steal_time, cpu);
  +   } while (read_seqcount_retry(per_cpu(steal_time_seq, cpu), seq));
  +
  +   return steal_time;
  +}
  +#else /* CONFIG_64BIT */
  +static inline void steal_time_write_begin(void)
  +{
  +}
  +
  +static inline void steal_time_write_end(void)
  +{
  +}
  +
  +static inline u64 steal_time_read(int cpu)
  +{
  +   return per_cpu(cpu_steal_time, cpu);
  +}
 
 
  @@ -3536,6 +3592,11 @@ static int touch_steal_time(int is_idle)
   
  if (st) {
  account_steal_time(st);
  +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
  +   steal_time_write_begin();
  +   __this_cpu_add(cpu_steal_time, steal);
  +   steal_time_write_end();
  +#endif
  return 1;
  }
  return 0;
 
 
 Why replicate all logic you've already got in patch 4? That too is
 reading steal time in a loop in kvm_account_steal_time(), why not extend
 that interface to take a cpu argument and be done with it?

Also, this relies on touch_steal_time() to have ran, not at all
something that is tied to calling update_rq_clock().

We can call update_rq_clock() many many times between ticks, as can the
stealtime increase because the vcpu can schedule many many times between
ticks.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 14/22] kvm: Fix race between timer signals and vcpu entry under !IOTHREAD

2011-01-31 Thread Jan Kiszka

On 2011-01-31 11:03, Avi Kivity wrote:
 On 01/27/2011 04:33 PM, Jan Kiszka wrote:
 Found by Stefan Hajnoczi: There is a race in kvm_cpu_exec between
 checking for exit_request on vcpu entry and timer signals arriving
 before KVM starts to catch them. Plug it by blocking both timer related
 signals also on !CONFIG_IOTHREAD and process those via signalfd.

 As this fix depends on real signalfd support (otherwise the timer
 signals only kick the compat helper thread, and the main thread hangs),
 we need to detect the invalid constellation and abort configure.

 Signed-off-by: Jan Kiszkajan.kis...@siemens.com
 CC: Stefan Hajnoczistefa...@linux.vnet.ibm.com
 ---

 I don't want to invest that much into !IOTHREAD anymore, so let's see if
 the proposed catchabort is acceptable.

 
 I don't understand the dependency on signalfd.  The normal way of doing 
 things, either waiting for the signal in sigtimedwait() or in 
 ioctl(KVM_RUN), works with SIGALRM just fine.

And how would you be kicked out of the select() call if it is waiting
with a timeout? We only have a single thread here.

The only alternative is Stefan's original proposal. But that required
fiddling with the signal mask twice per KVM_RUN.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 17/22] kvm: Move irqchip event processing out of inner loop

2011-01-31 Thread Jan Kiszka

On 2011-01-31 11:08, Avi Kivity wrote:
 On 01/27/2011 03:10 PM, Jan Kiszka wrote:
 Align with qemu-kvm and prepare for IO exit fix: There is no need to run
 kvm_arch_process_irqchip_events in the inner VCPU loop. Any state change
 this service processes will first cause an exit from kvm_cpu_exec
 anyway. And we will have to reenter the kernel on IO exits
 unconditionally, something that the current logic prevents.

 Signed-off-by: Jan Kiszkajan.kis...@siemens.com
 ---
   kvm-all.c |   11 ++-
   1 files changed, 6 insertions(+), 5 deletions(-)

 diff --git a/kvm-all.c b/kvm-all.c
 index 5bfa8c0..46ecc1c 100644
 --- a/kvm-all.c
 +++ b/kvm-all.c
 @@ -892,6 +892,12 @@ int kvm_cpu_exec(CPUState *env)

   DPRINTF(kvm_cpu_exec()\n);

 +if (kvm_arch_process_irqchip_events(env)) {
 +env-exit_request = 0;
 +env-exception_index = EXCP_HLT;
 +return 0;
 +}
 +
   do {
   #ifndef CONFIG_IOTHREAD
   if (env-exit_request) {
 @@ -901,11 +907,6 @@ int kvm_cpu_exec(CPUState *env)
   }
 
 We check for -exit_request here
 
   #endif

 -if (kvm_arch_process_irqchip_events(env)) {
 -ret = 0;
 -break;
 -}
 -
 
 But this checks for -interrupt_request.  What ensures that we exit when 
 -interrupt_request is set?

Good question, need to check again. But if that turns out to be an
issue, qemu-kvm would be broken as well. I'm just aligning the code here.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Google Summer of Code 2011

2011-01-31 Thread Luiz Capitulino

On Sun, 30 Jan 2011 16:06:20 +0100
Alexander Graf ag...@suse.de wrote:

 
 On 28.01.2011, at 21:10, Luiz Capitulino wrote:
 
  Hi there,
  
  GSoC 2011 has been announced[1]. As we were pretty successful last year,
  I think we should participate again. I've already created a wiki page:
  
  http://wiki.qemu.org/Google_Summer_of_Code_2011
  
  We should now populate it with projects and people willing to be mentors
  should say so (or just add a project)[2].
  
  Also, I'd like to do something different this year, I'd like to invite
  libvirt people to join. There are two ways of doing this:
  
  1. They join in the program as a regular mentoring organization, or
  
  2. They join with QEMU
  
  The second option means that libvirt can suggest and run its own projects
  (preferably with QEMU relevance), but from a GSoC perspective, the project
  will be part of the QEMU org.
 
 Keep in mind that every full org gets a free trip to the west coast for 2 
 people ;). So splitting up means we could almost do a mini-summit at the 
 google campus on google's expenses ;).

Actually, they have a limited budget and if you live too far (say, in Brazil),
the trip might not be 100% free :)

 Please coordinate that with Carol. Apparently traction for GSOC is declining 
 (according to last year's summit). So there might be plenty of available 
 slots this year. So I'd say sign up separately for now and if you don't get 
 accepted, just join forces with us!

Yes, that's a good plan and I fully agree that we get more benefits if we
apply separately. It's a call to libvirt's people.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC -v7 PATCH 3/7] sched: use a buddy to implement yield_task_fair

2011-01-31 Thread Peter Zijlstra

On Wed, 2011-01-26 at 17:21 -0500, Rik van Riel wrote:

 +static struct sched_entity *__pick_second_entity(struct cfs_rq *cfs_rq)
 +{
 + struct rb_node *left = cfs_rq-rb_leftmost;
 + struct rb_node *second;
 +
 + if (!left)
 + return NULL;
 +
 + second = rb_next(left);
 +
 + if (!second)
 + second = left;
 +
 + return rb_entry(second, struct sched_entity, run_node);
 +}

I would still prefer:

sed -i 's/__pick_next_entity/__pick_first_entity/g' kernel/sched*.[ch]

and

static struct sched_entity *__pick_next_entity(struct sched_entity *se)
{
struct rb_node *next = rb_next(se-run_node);

if (!next)
return NULL;

return rb_entry(next, struct sched_entity, run_node);
}

Its simpler and more versatile.

  static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
  {
   struct sched_entity *se = __pick_next_entity(cfs_rq);
   struct sched_entity *left = se;
  
 - if (cfs_rq-next  wakeup_preempt_entity(cfs_rq-next, left)  1)
 - se = cfs_rq-next;
 + /*
 +  * Avoid running the skip buddy, if running something else can
 +  * be done without getting too unfair.
 +  */
 + if (cfs_rq-skip == se) {
 + struct sched_entity *second = __pick_second_entity(cfs_rq);

which then becomes *next = __pick_next_entity(se);

 + if (wakeup_preempt_entity(second, left)  1)

oops when !second. (!next)

 + se = second;
 + }

 -/*
 - * sched_yield() support is very simple - we dequeue and enqueue.
 - *
 - * If compat_yield is turned on then we requeue to the end of the tree.
 - */
 -static void yield_task_fair(struct rq *rq)
 -{
 - struct task_struct *curr = rq-curr;
 - struct cfs_rq *cfs_rq = task_cfs_rq(curr);
 - struct sched_entity *rightmost, *se = curr-se;
 -
 - /*
 -  * Are we the only task in the tree?
 -  */
 - if (unlikely(rq-nr_running == 1))
 - return;
 -
 - clear_buddies(cfs_rq, se);
 -
 - if (likely(!sysctl_sched_compat_yield)  curr-policy != SCHED_BATCH) {
 - update_rq_clock(rq);
 - /*
 -  * Update run-time statistics of the 'current'.
 -  */
 - update_curr(cfs_rq);
 -
 - return;
 - }
 - /*
 -  * Find the rightmost entry in the rbtree:
 -  */
 - rightmost = __pick_last_entity(cfs_rq);
 - /*
 -  * Already in the rightmost position?
 -  */
 - if (unlikely(!rightmost || entity_before(rightmost, se)))
 - return;
 -
 - /*
 -  * Minimally necessary key value to be last in the tree:
 -  * Upon rescheduling, sched_class::put_prev_task() will place
 -  * 'current' within the tree based on its new key value.
 -  */
 - se-vruntime = rightmost-vruntime + 1;
 -}

You seem to have lost the hunk removing sysctl_sched_compat_yield from
kernel/sysctl.c :-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC -v7 PATCH 4/7] Add yield_to(task, preempt) functionality.

2011-01-31 Thread Peter Zijlstra

On Wed, 2011-01-26 at 17:21 -0500, Rik van Riel wrote:
 +bool __sched yield_to(struct task_struct *p, bool preempt)
 +{
 +   struct task_struct *curr = current;
 +   struct rq *rq, *p_rq;
 +   unsigned long flags;
 +   bool yielded = 0;
 +
 +   local_irq_save(flags);
 +   rq = this_rq();
 +
 +again:
 +   p_rq = task_rq(p);
 +   double_rq_lock(rq, p_rq);
 +   while (task_rq(p) != p_rq) {
 +   double_rq_unlock(rq, p_rq);
 +   goto again;
 +   }
 +
 +   if (!curr-sched_class-yield_to_task)
 +   goto out;
 +
 +   if (curr-sched_class != p-sched_class)
 +   goto out;
 +
 +   if (task_running(p_rq, p) || p-state)
 +   goto out;
 +
 +   yielded = curr-sched_class-yield_to_task(rq, p, preempt);
 +
 +out:
 +   double_rq_unlock(rq, p_rq);
 +   local_irq_restore(flags);
 +
 +   if (yielded)
 +   yield();
 +
 +   return yielded;
 +}
 +EXPORT_SYMBOL_GPL(yield_to); 

yield() will again acquire rq-lock.. not not simply have
-yield_to_task() do everything required and make that an unconditional
schedule()?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC -v7 PATCH 5/7] export pid symbols needed for kvm_vcpu_on_spin

2011-01-31 Thread Peter Zijlstra

On Wed, 2011-01-26 at 17:23 -0500, Rik van Riel wrote:
 Export the symbols required for a race-free kvm_vcpu_on_spin.

Avi, you asked for an example of why I hated KVM as a module :-)

 Signed-off-by: Rik van Riel r...@redhat.com
 
 diff --git a/kernel/fork.c b/kernel/fork.c
 index 3b159c5..adc8f47 100644
 --- a/kernel/fork.c
 +++ b/kernel/fork.c
 @@ -191,6 +191,7 @@ void __put_task_struct(struct task_struct *tsk)
   if (!profile_handoff_task(tsk))
   free_task(tsk);
  }
 +EXPORT_SYMBOL_GPL(__put_task_struct);
  
  /*
   * macro override instead of weak attribute alias, to workaround
 diff --git a/kernel/pid.c b/kernel/pid.c
 index 39b65b6..02f2212 100644
 --- a/kernel/pid.c
 +++ b/kernel/pid.c
 @@ -435,6 +435,7 @@ struct pid *get_task_pid(struct task_struct *task, enum 
 pid_type type)
   rcu_read_unlock();
   return pid;
  }
 +EXPORT_SYMBOL_GPL(get_task_pid);
  
  struct task_struct *get_pid_task(struct pid *pid, enum pid_type type)
  {
 @@ -446,6 +447,7 @@ struct task_struct *get_pid_task(struct pid *pid, enum 
 pid_type type)
   rcu_read_unlock();
   return result;
  }
 +EXPORT_SYMBOL_GPL(get_pid_task);
  
  struct pid *find_get_pid(pid_t nr)
  {


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/22] [uq/master] Patch queue, part II

2011-01-31 Thread Jan Kiszka

On 2011-01-31 11:12, Avi Kivity wrote:
 On 01/27/2011 03:09 PM, Jan Kiszka wrote:
 This second round of patches focus on issues in cpus.c, primarily signal
 related. The highlights are

   - Add missing KVM_RUN continuation after I/O exits
   - Fix for timer signal race in KVM entry code under !CONFIG_IOTHREAD
 (based on Stefan's findings)
   - MCE signal processing under !CONFIG_IOTHREAD
   - Prevent abortion on multiple VCPU kicks
   - Fixed synchronous (ie. VCPU-issued) reset request processing

 Topics remaining for a third round are cleanups and refactorings of the
 KVM VCPU entry/exit path, MCE rework, and a few assorted cleanups. But I
 will wait at least until these bits here are ready for an upstream
 merge.

 
 Pretty nice.  Is this standalone or does it depend on unmerged changes?
 

It only depends on my previous patches in current uq/master.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [PATCH v3 14/22] kvm: Fix race between timer signals and vcpu entry under !IOTHREAD

2011-01-31 Thread Stefan Hajnoczi

On Mon, Jan 31, 2011 at 11:27 AM, Jan Kiszka jan.kis...@siemens.com wrote:
 On 2011-01-31 11:03, Avi Kivity wrote:
 On 01/27/2011 04:33 PM, Jan Kiszka wrote:
 Found by Stefan Hajnoczi: There is a race in kvm_cpu_exec between
 checking for exit_request on vcpu entry and timer signals arriving
 before KVM starts to catch them. Plug it by blocking both timer related
 signals also on !CONFIG_IOTHREAD and process those via signalfd.

 As this fix depends on real signalfd support (otherwise the timer
 signals only kick the compat helper thread, and the main thread hangs),
 we need to detect the invalid constellation and abort configure.

 Signed-off-by: Jan Kiszkajan.kis...@siemens.com
 CC: Stefan Hajnoczistefa...@linux.vnet.ibm.com
 ---

 I don't want to invest that much into !IOTHREAD anymore, so let's see if
 the proposed catchabort is acceptable.


 I don't understand the dependency on signalfd.  The normal way of doing
 things, either waiting for the signal in sigtimedwait() or in
 ioctl(KVM_RUN), works with SIGALRM just fine.

 And how would you be kicked out of the select() call if it is waiting
 with a timeout? We only have a single thread here.

 The only alternative is Stefan's original proposal. But that required
 fiddling with the signal mask twice per KVM_RUN.

I think my original patch messed with the sigmask in the wrong place,
as you mentioned doing it twice per KVM_RUN isn't a good idea.  I
wonder if we can enable SIGALRM only in blocking calls and guest code
execution but without signalfd.  It might be possible, I don't see an
immediate problem with doing that, we might have to use pselect(2) or
similar in a few places.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [PATCH v3 14/22] kvm: Fix race between timer signals and vcpu entry under !IOTHREAD

2011-01-31 Thread Jan Kiszka

On 2011-01-31 13:13, Stefan Hajnoczi wrote:
 On Mon, Jan 31, 2011 at 11:27 AM, Jan Kiszka jan.kis...@siemens.com wrote:
 On 2011-01-31 11:03, Avi Kivity wrote:
 On 01/27/2011 04:33 PM, Jan Kiszka wrote:
 Found by Stefan Hajnoczi: There is a race in kvm_cpu_exec between
 checking for exit_request on vcpu entry and timer signals arriving
 before KVM starts to catch them. Plug it by blocking both timer related
 signals also on !CONFIG_IOTHREAD and process those via signalfd.

 As this fix depends on real signalfd support (otherwise the timer
 signals only kick the compat helper thread, and the main thread hangs),
 we need to detect the invalid constellation and abort configure.

 Signed-off-by: Jan Kiszkajan.kis...@siemens.com
 CC: Stefan Hajnoczistefa...@linux.vnet.ibm.com
 ---

 I don't want to invest that much into !IOTHREAD anymore, so let's see if
 the proposed catchabort is acceptable.


 I don't understand the dependency on signalfd.  The normal way of doing
 things, either waiting for the signal in sigtimedwait() or in
 ioctl(KVM_RUN), works with SIGALRM just fine.

 And how would you be kicked out of the select() call if it is waiting
 with a timeout? We only have a single thread here.

 The only alternative is Stefan's original proposal. But that required
 fiddling with the signal mask twice per KVM_RUN.
 
 I think my original patch messed with the sigmask in the wrong place,
 as you mentioned doing it twice per KVM_RUN isn't a good idea.  I
 wonder if we can enable SIGALRM only in blocking calls and guest code
 execution but without signalfd.  It might be possible, I don't see an
 immediate problem with doing that, we might have to use pselect(2) or
 similar in a few places.

My main concern about alternative approaches is that IOTHREAD is about
to become the default, and hardly anyone (of the few upstream KVM users)
will run without it in the foreseeable future. The next step will be the
removal of any !CONFIG_IOTHREAD section. So, how much do we want to
invest here (provided my proposal has not remaining issues)?

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 17/22] kvm: Move irqchip event processing out of inner loop

2011-01-31 Thread Jan Kiszka

On 2011-01-31 12:36, Jan Kiszka wrote:
 On 2011-01-31 11:08, Avi Kivity wrote:
 On 01/27/2011 03:10 PM, Jan Kiszka wrote:
 Align with qemu-kvm and prepare for IO exit fix: There is no need to run
 kvm_arch_process_irqchip_events in the inner VCPU loop. Any state change
 this service processes will first cause an exit from kvm_cpu_exec
 anyway. And we will have to reenter the kernel on IO exits
 unconditionally, something that the current logic prevents.

 Signed-off-by: Jan Kiszkajan.kis...@siemens.com
 ---
   kvm-all.c |   11 ++-
   1 files changed, 6 insertions(+), 5 deletions(-)

 diff --git a/kvm-all.c b/kvm-all.c
 index 5bfa8c0..46ecc1c 100644
 --- a/kvm-all.c
 +++ b/kvm-all.c
 @@ -892,6 +892,12 @@ int kvm_cpu_exec(CPUState *env)

   DPRINTF(kvm_cpu_exec()\n);

 +if (kvm_arch_process_irqchip_events(env)) {
 +env-exit_request = 0;
 +env-exception_index = EXCP_HLT;
 +return 0;
 +}
 +
   do {
   #ifndef CONFIG_IOTHREAD
   if (env-exit_request) {
 @@ -901,11 +907,6 @@ int kvm_cpu_exec(CPUState *env)
   }

 We check for -exit_request here

   #endif

 -if (kvm_arch_process_irqchip_events(env)) {
 -ret = 0;
 -break;
 -}
 -

 But this checks for -interrupt_request.  What ensures that we exit when 
 -interrupt_request is set?
 
 Good question, need to check again. But if that turns out to be an
 issue, qemu-kvm would be broken as well. I'm just aligning the code here.
 

The only thing we miss by moving process_irqchip_events is a self-INIT
of an AP - if such thing exists in real life. In that case, the AP would
cause a reset of itself, followed by a transition to HALT state.

A self-SIPI has no effect as A) a CPU can't send a SIPI from
wait-on-SIPI state and B) SIPIs are ignored in any other state.

Will post a version that additionally checks for pending INIT as well
and injects a self-ipi then.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] KVM: Emulate MSI-X table in kernel

2011-01-31 Thread Avi Kivity


On 01/30/2011 06:38 AM, Sheng Yang wrote:

(Sorry, missed this mail...)

On Mon, Jan 17, 2011 at 02:29:44PM +0200, Avi Kivity wrote:
  On 01/06/2011 12:19 PM, Sheng Yang wrote:
  Then we can support mask bit operation of assigned devices now.
  
  
  
  +int kvm_assigned_device_update_msix_mask_bit(struct kvm *kvm,
  + int assigned_dev_id, int entry, bool mask)
  +{
  + int r = -EFAULT;
  + struct kvm_assigned_dev_kernel *adev;
  + int i;
  +
  + if (!irqchip_in_kernel(kvm))
  + return r;
  +
  + mutex_lock(kvm-lock);
  + adev = kvm_find_assigned_dev(kvm-arch.assigned_dev_head,
  +   assigned_dev_id);
  + if (!adev)
  + goto out;
  +
  + for (i = 0; i   adev-entries_nr; i++)
  + if (adev-host_msix_entries[i].entry == entry) {
  + if (mask)
  + disable_irq_nosync(
  + adev-host_msix_entries[i].vector);

  Is it okay to call disable_irq_nosync() here?  IIRC we don't check
  the mask bit on irq delivery, so we may forward an interrupt to the
  guest after the mask bit was set.

  What does pci say about the mask bit?  when does it take effect?

  Another question is whether disable_irq_nosync() actually programs
  the device mask bit, or not.  If it does, then it's slow, and it may
  be better to leave interrupts enabled but have an internal pending
  bit.  If it doesn't program the mask bit, it's fine.

I think Michael and Jan had explained this.

  + else
  + enable_irq(adev-host_msix_entries[i].vector);
  + r = 0;
  + break;
  + }
  +out:
  + mutex_unlock(kvm-lock);
  + return r;
  +}
  
  +
  +static int msix_table_mmio_read(struct kvm_io_device *this, gpa_t addr, 
int len,
  + void *val)
  +{
  + struct kvm_msix_mmio_dev *mmio_dev =
  + container_of(this, struct kvm_msix_mmio_dev, table_dev);
  + struct kvm_msix_mmio *mmio;
  + int idx, ret = 0, entry, offset, r;
  +
  + mutex_lock(mmio_dev-lock);
  + idx = get_mmio_table_index(mmio_dev, addr, len);
  + if (idx   0) {
  + ret = -EOPNOTSUPP;
  + goto out;
  + }
  + if ((addr   0x3) || (len != 4   len != 8))
  + goto out;
  +
  + offset = addr   0xf;
  + if (offset == PCI_MSIX_ENTRY_VECTOR_CTRL   len == 8)
  + goto out;
  +
  + mmio =mmio_dev-mmio[idx];
  + entry = (addr - mmio-table_base_addr) / PCI_MSIX_ENTRY_SIZE;
  + r = copy_from_user(val, (void __user *)(mmio-table_base_va +
  + entry * PCI_MSIX_ENTRY_SIZE + offset), len);
  + if (r)
  + goto out;

  and return ret == 0?

Yes. This operation should be handled by in-kernel MSI-X MMIO. So we return 0
in order to omit this action. We can add warning to it later.


But it failed.  We need to return -EFAULT.


The same as above.

  +
  + if ((offset   PCI_MSIX_ENTRY_VECTOR_CTRL   len == 4) ||
  + (offset   PCI_MSIX_ENTRY_DATA   len == 8))
  + ret = -ENOTSYNC;

  goto out?

No. This judgement only check if MSI data/address was touched. And the line
below would check if we need to operate mask bit. Because in theory guest can
use len=8 to modify MSI-X data and ctrl at the same time.



Ok, makes sense.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 01/22] Prevent abortion on multiple VCPU kicks

2011-01-31 Thread Avi Kivity


On 01/31/2011 01:19 PM, Jan Kiszka wrote:

On 2011-01-31 10:44, Avi Kivity wrote:
  On 01/27/2011 03:09 PM, Jan Kiszka wrote:
  If we call qemu_cpu_kick more than once before the target was able to
  process the signal, pthread_kill will fail, and qemu will abort. Prevent
  this by avoiding the redundant signal.


  Doesn't fit with the manual page (or with the idea that signals are
  asynchronous):

  NAME
  pthread_kill - send a signal to a thread


  ...

  ERRORS
  ESRCH  No thread with the ID thread could be found.

  EINVAL An invalid signal was specified.


Valid remark, but I was receiving EAGAIN for blocked RT signals. Don't
know if this is Linux-specific. A quick glance at the man pages did not
reveal if this is allowed or at least gray area.



} else if (!is_si_special(info)) {
if (sig = SIGRTMIN  info-si_code != SI_USER) {
/*
 * Queue overflow, abort.  We may abort if the
 * signal was rt and sent by user using something
 * other than kill().
 */
trace_signal_overflow_fail(sig, group, info);
return -EAGAIN;
}


However, even when selectively ignoring this, it's more efficient to
catch the redundant signaling in user space.


Yes.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 04/22] Leave inner main_loop faster on pending requests

2011-01-31 Thread Avi Kivity


On 01/31/2011 01:22 PM, Jan Kiszka wrote:

On 2011-01-31 10:52, Avi Kivity wrote:
  On 01/27/2011 03:09 PM, Jan Kiszka wrote:
  If there is any pending request that requires us to leave the inner loop
  if main_loop, makes sure we do this as soon as possible by enforcing
  non-blocking IO processing.

  At this change, move variable definitions out of the inner loop to
  improve readability.

  Signed-off-by: Jan Kiszkajan.kis...@siemens.com
  ---
vl.c |   11 +++
1 files changed, 7 insertions(+), 4 deletions(-)

  diff --git a/vl.c b/vl.c
  index 5fad700..2ebc55b 100644
  --- a/vl.c
  +++ b/vl.c
  @@ -1384,18 +1384,21 @@ qemu_irq qemu_system_powerdown;

static void main_loop(void)
{
  +bool nonblocking = false;
  +#ifdef CONFIG_PROFILER
  +int64_t ti;
  +#endif
int r;

qemu_main_loop_start();

for (;;) {
do {
  -bool nonblocking = false;
  -#ifdef CONFIG_PROFILER
  -int64_t ti;
  -#endif
#ifndef CONFIG_IOTHREAD
nonblocking = cpu_exec_all();
  +if (!vm_can_run()) {
  +nonblocking = true;
  +}

  Doesn't this cause vmstop to spin?  We'll never execute
  main_loop_wait(false) if I read the code correctly?


The code path is not changed, we just poll instead of wait in
main_loop_wait.


Where do we wait then?


Also, I didn't get your error scenario yet. Even if we left the loop
here, what magic would main_loop_wait do to vmstop processing? The stop
request is handled outside the loop, that's why we should leave ASAP.


I'm just missing the alternate place where we sleep.  If there's no such 
place, we spin.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 14/22] kvm: Fix race between timer signals and vcpu entry under !IOTHREAD

2011-01-31 Thread Avi Kivity


On 01/31/2011 01:27 PM, Jan Kiszka wrote:

On 2011-01-31 11:03, Avi Kivity wrote:
  On 01/27/2011 04:33 PM, Jan Kiszka wrote:
  Found by Stefan Hajnoczi: There is a race in kvm_cpu_exec between
  checking for exit_request on vcpu entry and timer signals arriving
  before KVM starts to catch them. Plug it by blocking both timer related
  signals also on !CONFIG_IOTHREAD and process those via signalfd.

  As this fix depends on real signalfd support (otherwise the timer
  signals only kick the compat helper thread, and the main thread hangs),
  we need to detect the invalid constellation and abort configure.

  Signed-off-by: Jan Kiszkajan.kis...@siemens.com
  CC: Stefan Hajnoczistefa...@linux.vnet.ibm.com
  ---

  I don't want to invest that much into !IOTHREAD anymore, so let's see if
  the proposed catchabort is acceptable.


  I don't understand the dependency on signalfd.  The normal way of doing
  things, either waiting for the signal in sigtimedwait() or in
  ioctl(KVM_RUN), works with SIGALRM just fine.

And how would you be kicked out of the select() call if it is waiting
with a timeout? We only have a single thread here.


If we use signalfd() (either kernel provided or thread+pipe), we kick 
out of select by select()ing it (though I don't see how it works without 
an iothread, since an fd can't stop a vcpu unless you enable SIGIO on 
it, which is silly for signalfd)


If you leave it as a naked signal, then it can break out of either 
pselect() or vcpu.


Since the goal is to drop !CONFIG_IOTHREAD, the first path seems better, 
I just don't understand the problem with emulated signalfd().



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] KVM: Add documents for MSI-X MMIO API

2011-01-31 Thread Avi Kivity


On 01/26/2011 11:05 AM, Sheng Yang wrote:

On Tuesday 25 January 2011 20:47:38 Avi Kivity wrote:
  On 01/19/2011 10:21 AM, Sheng Yang wrote:
  We already got an guest MMIO address for that in the exit
  information. I've created a chain of handler in qemu to handle it.

   But we already decoded the table and entry...
  
But the handler is still wrapped by vcpu_mmio_write(), as a part of MMIO.
So it's not quite handy to get the table and entry out.

  The kernel handler can create a new kvm_run exit description.

  Also the updater in the userspace
  
can share the most logic with ordinary userspace MMIO handler, which take
address as parameter. So I think we don't need to pass the decoded
table_id and entry to userspace.

  It's mixing layers, which always leads to trouble.  For one, the user
  handler shouldn't do anything with the write since the kernel already
  wrote it into the table.  For another, if two vcpus write to the same
  entry simultaneously, you could see different ordering in the kernel and
  userspace, and get inconsistent results.

The shared logic is not about writing, but about interpret what's written. Old
MMIO handler would write the data, then interpret it; and our new MMIO would 
only
share the logic of interpretation. I think that's fair enough?


It dosn't make sense for an API point of view.  You registered a table 
of entries, you expect an exit on that table to point to the table and 
entry that got changed.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC -v7 PATCH 5/7] export pid symbols needed for kvm_vcpu_on_spin

2011-01-31 Thread Avi Kivity


On 01/31/2011 01:51 PM, Peter Zijlstra wrote:

On Wed, 2011-01-26 at 17:23 -0500, Rik van Riel wrote:
  Export the symbols required for a race-free kvm_vcpu_on_spin.

Avi, you asked for an example of why I hated KVM as a module :-)


Why do you dislike exports so much?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [PATCH v3 14/22] kvm: Fix race between timer signals and vcpu entry under !IOTHREAD

2011-01-31 Thread Stefan Hajnoczi

On Mon, Jan 31, 2011 at 12:18 PM, Jan Kiszka jan.kis...@siemens.com wrote:
 On 2011-01-31 13:13, Stefan Hajnoczi wrote:
 On Mon, Jan 31, 2011 at 11:27 AM, Jan Kiszka jan.kis...@siemens.com wrote:
 On 2011-01-31 11:03, Avi Kivity wrote:
 On 01/27/2011 04:33 PM, Jan Kiszka wrote:
 Found by Stefan Hajnoczi: There is a race in kvm_cpu_exec between
 checking for exit_request on vcpu entry and timer signals arriving
 before KVM starts to catch them. Plug it by blocking both timer related
 signals also on !CONFIG_IOTHREAD and process those via signalfd.

 As this fix depends on real signalfd support (otherwise the timer
 signals only kick the compat helper thread, and the main thread hangs),
 we need to detect the invalid constellation and abort configure.

 Signed-off-by: Jan Kiszkajan.kis...@siemens.com
 CC: Stefan Hajnoczistefa...@linux.vnet.ibm.com
 ---

 I don't want to invest that much into !IOTHREAD anymore, so let's see if
 the proposed catchabort is acceptable.


 I don't understand the dependency on signalfd.  The normal way of doing
 things, either waiting for the signal in sigtimedwait() or in
 ioctl(KVM_RUN), works with SIGALRM just fine.

 And how would you be kicked out of the select() call if it is waiting
 with a timeout? We only have a single thread here.

 The only alternative is Stefan's original proposal. But that required
 fiddling with the signal mask twice per KVM_RUN.

 I think my original patch messed with the sigmask in the wrong place,
 as you mentioned doing it twice per KVM_RUN isn't a good idea.  I
 wonder if we can enable SIGALRM only in blocking calls and guest code
 execution but without signalfd.  It might be possible, I don't see an
 immediate problem with doing that, we might have to use pselect(2) or
 similar in a few places.

 My main concern about alternative approaches is that IOTHREAD is about
 to become the default, and hardly anyone (of the few upstream KVM users)
 will run without it in the foreseeable future. The next step will be the
 removal of any !CONFIG_IOTHREAD section. So, how much do we want to
 invest here (provided my proposal has not remaining issues)?

Yes, you're right.  I'm not volunteering to dig more into this, the
best case would be to switch to a non-I/O thread world that works for
everybody.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC -v7 PATCH 5/7] export pid symbols needed for kvm_vcpu_on_spin

2011-01-31 Thread Peter Zijlstra

On Mon, 2011-01-31 at 15:26 +0200, Avi Kivity wrote:
 On 01/31/2011 01:51 PM, Peter Zijlstra wrote:
  On Wed, 2011-01-26 at 17:23 -0500, Rik van Riel wrote:
Export the symbols required for a race-free kvm_vcpu_on_spin.
 
  Avi, you asked for an example of why I hated KVM as a module :-)
 
 Why do you dislike exports so much?

Because I hate modules [*], since we sadly cannot undo those, limiting
the module interface is all we've got, and KVM is one that pushes the
module interface further and further.

Many people see an EXPORT as a acknowledgement that its OK to use the
interface, even when its not.

[*] they enable the binary junk thing.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC -v7 PATCH 5/7] export pid symbols needed for kvm_vcpu_on_spin

2011-01-31 Thread Avi Kivity


On 01/31/2011 03:43 PM, Peter Zijlstra wrote:

On Mon, 2011-01-31 at 15:26 +0200, Avi Kivity wrote:
  On 01/31/2011 01:51 PM, Peter Zijlstra wrote:
On Wed, 2011-01-26 at 17:23 -0500, Rik van Riel wrote:
   Export the symbols required for a race-free kvm_vcpu_on_spin.
  
Avi, you asked for an example of why I hated KVM as a module :-)

  Why do you dislike exports so much?

Because I hate modules [*],


Well, you could have just said you hate KVM as a module because it is a 
module, and saved one indirection.



since we sadly cannot undo those, limiting
the module interface is all we've got, and KVM is one that pushes the
module interface further and further.

Many people see an EXPORT as a acknowledgement that its OK to use the
interface, even when its not.

[*] they enable the binary junk thing.


Then we need to stop binary modules, not castrate modules as a whole.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM: MMU: update sp-gfns on pte update path

2011-01-31 Thread Marcelo Tosatti

On Tue, Jan 25, 2011 at 07:36:02PM +0200, Avi Kivity wrote:
 On 01/25/2011 07:12 PM, Marcelo Tosatti wrote:
 
   Should be done by a call to kvm_mmu_page_set_gfn().  But I don't
   understand how it could become inconsistent in the first place.
 
   if (is_rmap_spte(*sptep)) {
   /*
* If we overwrite a PTE page pointer with a 2MB PMD, unlink
* the parent of the now unreachable PTE.
*/
   if (level  PT_PAGE_TABLE_LEVEL
   !is_large_pte(*sptep)) {
   struct kvm_mmu_page *child;
   u64 pte = *sptep;
 
   child = page_header(pte  PT64_BASE_ADDR_MASK);
   mmu_page_remove_parent_pte(child, sptep);
   __set_spte(sptep, shadow_trap_nonpresent_pte);
   kvm_flush_remote_tlbs(vcpu-kvm);
   } else if (pfn != spte_to_pfn(*sptep)) {
   pgprintk(hfn old %llx new %llx\n,
spte_to_pfn(*sptep), pfn);
   drop_spte(vcpu-kvm, sptep, shadow_trap_nonpresent_pte);
   kvm_flush_remote_tlbs(vcpu-kvm);
   } else
   was_rmapped = 1;
   }
 
   If we set was_rmapped, that means rmap_add() was previously called
   for this spte/gfn/pfn pair, and all that changes is permissions, no?
 
 What if pfn is the same but gfn differs?
 
 Could be.  Any way to verify if this was the case?
 
 Isn't it nicer to have it detected by the test above and do the
 drop_spte()/kvm_flush_remote_tlbs() thing instead?

It could not be the case. If spte is updated to point to a new gfn, and
rmap is not updated:

1. rmap[A] = spte
   sp-gfns[i] = A
   spte points to gfn A

2. rmap[A] = spte
   sp-gfns[i] = A
   spte points to gfn B

rmap_remove(spte) will succeed (as in not crash). In case gfn A's slot
is removed, all shadow pages will be destroyed. So what can fail from
this point on are operations on gfn B such as rmap_write_protect(B).

Yes, its nicer (and correct) to do it at drop_spte. Will resubmit.

However, still have no explanation for Nicolas BUG's... ideas?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 14/22] kvm: Fix race between timer signals and vcpu entry under !IOTHREAD

2011-01-31 Thread Jan Kiszka

On 2011-01-31 14:22, Avi Kivity wrote:
 On 01/31/2011 01:27 PM, Jan Kiszka wrote:
 On 2011-01-31 11:03, Avi Kivity wrote:
  On 01/27/2011 04:33 PM, Jan Kiszka wrote:
  Found by Stefan Hajnoczi: There is a race in kvm_cpu_exec between
  checking for exit_request on vcpu entry and timer signals arriving
  before KVM starts to catch them. Plug it by blocking both timer related
  signals also on !CONFIG_IOTHREAD and process those via signalfd.

  As this fix depends on real signalfd support (otherwise the timer
  signals only kick the compat helper thread, and the main thread hangs),
  we need to detect the invalid constellation and abort configure.

  Signed-off-by: Jan Kiszkajan.kis...@siemens.com
  CC: Stefan Hajnoczistefa...@linux.vnet.ibm.com
  ---

  I don't want to invest that much into !IOTHREAD anymore, so let's see if
  the proposed catchabort is acceptable.


  I don't understand the dependency on signalfd.  The normal way of doing
  things, either waiting for the signal in sigtimedwait() or in
  ioctl(KVM_RUN), works with SIGALRM just fine.

 And how would you be kicked out of the select() call if it is waiting
 with a timeout? We only have a single thread here.
 
 If we use signalfd() (either kernel provided or thread+pipe), we kick 
 out of select by select()ing it (though I don't see how it works without 
 an iothread, since an fd can't stop a vcpu unless you enable SIGIO on 
 it, which is silly for signalfd)
 
 If you leave it as a naked signal, then it can break out of either 
 pselect() or vcpu.
 
 Since the goal is to drop !CONFIG_IOTHREAD, the first path seems better, 
 I just don't understand the problem with emulated signalfd().
 

With the emulated signalfd, there won't be any signal for the VCPU while
in KVM_RUN.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 04/22] Leave inner main_loop faster on pending requests

2011-01-31 Thread Jan Kiszka

On 2011-01-31 14:17, Avi Kivity wrote:
 On 01/31/2011 01:22 PM, Jan Kiszka wrote:
 On 2011-01-31 10:52, Avi Kivity wrote:
  On 01/27/2011 03:09 PM, Jan Kiszka wrote:
  If there is any pending request that requires us to leave the inner loop
  if main_loop, makes sure we do this as soon as possible by enforcing
  non-blocking IO processing.

  At this change, move variable definitions out of the inner loop to
  improve readability.

  Signed-off-by: Jan Kiszkajan.kis...@siemens.com
  ---
vl.c |   11 +++
1 files changed, 7 insertions(+), 4 deletions(-)

  diff --git a/vl.c b/vl.c
  index 5fad700..2ebc55b 100644
  --- a/vl.c
  +++ b/vl.c
  @@ -1384,18 +1384,21 @@ qemu_irq qemu_system_powerdown;

static void main_loop(void)
{
  +bool nonblocking = false;
  +#ifdef CONFIG_PROFILER
  +int64_t ti;
  +#endif
int r;

qemu_main_loop_start();

for (;;) {
do {
  -bool nonblocking = false;
  -#ifdef CONFIG_PROFILER
  -int64_t ti;
  -#endif
#ifndef CONFIG_IOTHREAD
nonblocking = cpu_exec_all();
  +if (!vm_can_run()) {
  +nonblocking = true;
  +}

  Doesn't this cause vmstop to spin?  We'll never execute
  main_loop_wait(false) if I read the code correctly?


 The code path is not changed, we just poll instead of wait in
 main_loop_wait.
 
 Where do we wait then?
 
 Also, I didn't get your error scenario yet. Even if we left the loop
 here, what magic would main_loop_wait do to vmstop processing? The stop
 request is handled outside the loop, that's why we should leave ASAP.
 
 I'm just missing the alternate place where we sleep.  If there's no such 
 place, we spin.
 

Probably you are misled by the name of vm_can_run. It should better be
renamed to requests_pending or something like that - iow, it is only
true if some request is pending and not if just the vm is in stop mode.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC -v7 PATCH 3/7] sched: use a buddy to implement yield_task_fair

2011-01-31 Thread Rik van Riel


On 01/31/2011 06:47 AM, Peter Zijlstra wrote:

On Wed, 2011-01-26 at 17:21 -0500, Rik van Riel wrote:


+static struct sched_entity *__pick_second_entity(struct cfs_rq *cfs_rq)
+{
+   struct rb_node *left = cfs_rq-rb_leftmost;
+   struct rb_node *second;
+
+   if (!left)
+   return NULL;
+
+   second = rb_next(left);
+
+   if (!second)
+   second = left;
+
+   return rb_entry(second, struct sched_entity, run_node);
+}


I would still prefer:

sed -i 's/__pick_next_entity/__pick_first_entity/g' kernel/sched*.[ch]


Ahhh, now I understand what you want.  I'll get right on it.


You seem to have lost the hunk removing sysctl_sched_compat_yield from
kernel/sysctl.c :-)


Let me take care of that, too.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 17/22] kvm: Move irqchip event processing out of inner loop

2011-01-31 Thread Jan Kiszka

On 2011-01-31 14:04, Jan Kiszka wrote:
 On 2011-01-31 12:36, Jan Kiszka wrote:
 On 2011-01-31 11:08, Avi Kivity wrote:
 On 01/27/2011 03:10 PM, Jan Kiszka wrote:
 Align with qemu-kvm and prepare for IO exit fix: There is no need to run
 kvm_arch_process_irqchip_events in the inner VCPU loop. Any state change
 this service processes will first cause an exit from kvm_cpu_exec
 anyway. And we will have to reenter the kernel on IO exits
 unconditionally, something that the current logic prevents.

 Signed-off-by: Jan Kiszkajan.kis...@siemens.com
 ---
   kvm-all.c |   11 ++-
   1 files changed, 6 insertions(+), 5 deletions(-)

 diff --git a/kvm-all.c b/kvm-all.c
 index 5bfa8c0..46ecc1c 100644
 --- a/kvm-all.c
 +++ b/kvm-all.c
 @@ -892,6 +892,12 @@ int kvm_cpu_exec(CPUState *env)

   DPRINTF(kvm_cpu_exec()\n);

 +if (kvm_arch_process_irqchip_events(env)) {
 +env-exit_request = 0;
 +env-exception_index = EXCP_HLT;
 +return 0;
 +}
 +
   do {
   #ifndef CONFIG_IOTHREAD
   if (env-exit_request) {
 @@ -901,11 +907,6 @@ int kvm_cpu_exec(CPUState *env)
   }

 We check for -exit_request here

   #endif

 -if (kvm_arch_process_irqchip_events(env)) {
 -ret = 0;
 -break;
 -}
 -

 But this checks for -interrupt_request.  What ensures that we exit when 
 -interrupt_request is set?

 Good question, need to check again. But if that turns out to be an
 issue, qemu-kvm would be broken as well. I'm just aligning the code here.

 
 The only thing we miss by moving process_irqchip_events is a self-INIT
 of an AP - if such thing exists in real life. In that case, the AP would
 cause a reset of itself, followed by a transition to HALT state.

I checked again with the Intel spec, and a self-INIT is invalid (at
least when specified via shorthand). So I'm under the impression now
that we can safely ignore this case and leave the patch as is.

Any different views?

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 14/22] kvm: Fix race between timer signals and vcpu entry under !IOTHREAD

2011-01-31 Thread Avi Kivity


On 01/31/2011 04:31 PM, Jan Kiszka wrote:


  And how would you be kicked out of the select() call if it is waiting
  with a timeout? We only have a single thread here.

  If we use signalfd() (either kernel provided or thread+pipe), we kick
  out of select by select()ing it (though I don't see how it works without
  an iothread, since an fd can't stop a vcpu unless you enable SIGIO on
  it, which is silly for signalfd)

  If you leave it as a naked signal, then it can break out of either
  pselect() or vcpu.

  Since the goal is to drop !CONFIG_IOTHREAD, the first path seems better,
  I just don't understand the problem with emulated signalfd().


With the emulated signalfd, there won't be any signal for the VCPU while
in KVM_RUN.



I see it now - with a real signalfd, kvm unmasks the signal, and that 
takes precedence over signalfd and exits the vcpu.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 17/22] kvm: Move irqchip event processing out of inner loop

2011-01-31 Thread Gleb Natapov

On Mon, Jan 31, 2011 at 04:40:34PM +0100, Jan Kiszka wrote:
 On 2011-01-31 14:04, Jan Kiszka wrote:
  On 2011-01-31 12:36, Jan Kiszka wrote:
  On 2011-01-31 11:08, Avi Kivity wrote:
  On 01/27/2011 03:10 PM, Jan Kiszka wrote:
  Align with qemu-kvm and prepare for IO exit fix: There is no need to run
  kvm_arch_process_irqchip_events in the inner VCPU loop. Any state change
  this service processes will first cause an exit from kvm_cpu_exec
  anyway. And we will have to reenter the kernel on IO exits
  unconditionally, something that the current logic prevents.
 
  Signed-off-by: Jan Kiszkajan.kis...@siemens.com
  ---
kvm-all.c |   11 ++-
1 files changed, 6 insertions(+), 5 deletions(-)
 
  diff --git a/kvm-all.c b/kvm-all.c
  index 5bfa8c0..46ecc1c 100644
  --- a/kvm-all.c
  +++ b/kvm-all.c
  @@ -892,6 +892,12 @@ int kvm_cpu_exec(CPUState *env)
 
DPRINTF(kvm_cpu_exec()\n);
 
  +if (kvm_arch_process_irqchip_events(env)) {
  +env-exit_request = 0;
  +env-exception_index = EXCP_HLT;
  +return 0;
  +}
  +
do {
#ifndef CONFIG_IOTHREAD
if (env-exit_request) {
  @@ -901,11 +907,6 @@ int kvm_cpu_exec(CPUState *env)
}
 
  We check for -exit_request here
 
#endif
 
  -if (kvm_arch_process_irqchip_events(env)) {
  -ret = 0;
  -break;
  -}
  -
 
  But this checks for -interrupt_request.  What ensures that we exit when 
  -interrupt_request is set?
 
  Good question, need to check again. But if that turns out to be an
  issue, qemu-kvm would be broken as well. I'm just aligning the code here.
 
  
  The only thing we miss by moving process_irqchip_events is a self-INIT
  of an AP - if such thing exists in real life. In that case, the AP would
  cause a reset of itself, followed by a transition to HALT state.
 
 I checked again with the Intel spec, and a self-INIT is invalid (at
 least when specified via shorthand). So I'm under the impression now
 that we can safely ignore this case and leave the patch as is.
 
 Any different views?
 
IIRC if you don't use shorthand you can send INIT to self.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 17/22] kvm: Move irqchip event processing out of inner loop

2011-01-31 Thread Jan Kiszka

On 2011-01-31 17:38, Gleb Natapov wrote:
 On Mon, Jan 31, 2011 at 04:40:34PM +0100, Jan Kiszka wrote:
 On 2011-01-31 14:04, Jan Kiszka wrote:
 On 2011-01-31 12:36, Jan Kiszka wrote:
 On 2011-01-31 11:08, Avi Kivity wrote:
 On 01/27/2011 03:10 PM, Jan Kiszka wrote:
 Align with qemu-kvm and prepare for IO exit fix: There is no need to run
 kvm_arch_process_irqchip_events in the inner VCPU loop. Any state change
 this service processes will first cause an exit from kvm_cpu_exec
 anyway. And we will have to reenter the kernel on IO exits
 unconditionally, something that the current logic prevents.

 Signed-off-by: Jan Kiszkajan.kis...@siemens.com
 ---
   kvm-all.c |   11 ++-
   1 files changed, 6 insertions(+), 5 deletions(-)

 diff --git a/kvm-all.c b/kvm-all.c
 index 5bfa8c0..46ecc1c 100644
 --- a/kvm-all.c
 +++ b/kvm-all.c
 @@ -892,6 +892,12 @@ int kvm_cpu_exec(CPUState *env)

   DPRINTF(kvm_cpu_exec()\n);

 +if (kvm_arch_process_irqchip_events(env)) {
 +env-exit_request = 0;
 +env-exception_index = EXCP_HLT;
 +return 0;
 +}
 +
   do {
   #ifndef CONFIG_IOTHREAD
   if (env-exit_request) {
 @@ -901,11 +907,6 @@ int kvm_cpu_exec(CPUState *env)
   }

 We check for -exit_request here

   #endif

 -if (kvm_arch_process_irqchip_events(env)) {
 -ret = 0;
 -break;
 -}
 -

 But this checks for -interrupt_request.  What ensures that we exit when 
 -interrupt_request is set?

 Good question, need to check again. But if that turns out to be an
 issue, qemu-kvm would be broken as well. I'm just aligning the code here.


 The only thing we miss by moving process_irqchip_events is a self-INIT
 of an AP - if such thing exists in real life. In that case, the AP would
 cause a reset of itself, followed by a transition to HALT state.

 I checked again with the Intel spec, and a self-INIT is invalid (at
 least when specified via shorthand). So I'm under the impression now
 that we can safely ignore this case and leave the patch as is.

 Any different views?

 IIRC if you don't use shorthand you can send INIT to self.

We didn't care so far (in qemu-kvm), do you think we should?

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 17/22] kvm: Move irqchip event processing out of inner loop

2011-01-31 Thread Jan Kiszka

On 2011-01-31 17:41, Jan Kiszka wrote:
 On 2011-01-31 17:38, Gleb Natapov wrote:
 On Mon, Jan 31, 2011 at 04:40:34PM +0100, Jan Kiszka wrote:
 On 2011-01-31 14:04, Jan Kiszka wrote:
 On 2011-01-31 12:36, Jan Kiszka wrote:
 On 2011-01-31 11:08, Avi Kivity wrote:
 On 01/27/2011 03:10 PM, Jan Kiszka wrote:
 Align with qemu-kvm and prepare for IO exit fix: There is no need to run
 kvm_arch_process_irqchip_events in the inner VCPU loop. Any state change
 this service processes will first cause an exit from kvm_cpu_exec
 anyway. And we will have to reenter the kernel on IO exits
 unconditionally, something that the current logic prevents.

 Signed-off-by: Jan Kiszkajan.kis...@siemens.com
 ---
   kvm-all.c |   11 ++-
   1 files changed, 6 insertions(+), 5 deletions(-)

 diff --git a/kvm-all.c b/kvm-all.c
 index 5bfa8c0..46ecc1c 100644
 --- a/kvm-all.c
 +++ b/kvm-all.c
 @@ -892,6 +892,12 @@ int kvm_cpu_exec(CPUState *env)

   DPRINTF(kvm_cpu_exec()\n);

 +if (kvm_arch_process_irqchip_events(env)) {
 +env-exit_request = 0;
 +env-exception_index = EXCP_HLT;
 +return 0;
 +}
 +
   do {
   #ifndef CONFIG_IOTHREAD
   if (env-exit_request) {
 @@ -901,11 +907,6 @@ int kvm_cpu_exec(CPUState *env)
   }

 We check for -exit_request here

   #endif

 -if (kvm_arch_process_irqchip_events(env)) {
 -ret = 0;
 -break;
 -}
 -

 But this checks for -interrupt_request.  What ensures that we exit when 
 -interrupt_request is set?

 Good question, need to check again. But if that turns out to be an
 issue, qemu-kvm would be broken as well. I'm just aligning the code here.


 The only thing we miss by moving process_irqchip_events is a self-INIT
 of an AP - if such thing exists in real life. In that case, the AP would
 cause a reset of itself, followed by a transition to HALT state.

 I checked again with the Intel spec, and a self-INIT is invalid (at
 least when specified via shorthand). So I'm under the impression now
 that we can safely ignore this case and leave the patch as is.

 Any different views?

 IIRC if you don't use shorthand you can send INIT to self.
 
 We didn't care so far (in qemu-kvm), do you think we should?

...and the kernel model should have barked INIT on a runnable vcpu in
such cases (BTW, that's user triggerable and should likely be rate limited).

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 17/22] kvm: Move irqchip event processing out of inner loop

2011-01-31 Thread Gleb Natapov

On Mon, Jan 31, 2011 at 05:41:24PM +0100, Jan Kiszka wrote:
 On 2011-01-31 17:38, Gleb Natapov wrote:
  On Mon, Jan 31, 2011 at 04:40:34PM +0100, Jan Kiszka wrote:
  On 2011-01-31 14:04, Jan Kiszka wrote:
  On 2011-01-31 12:36, Jan Kiszka wrote:
  On 2011-01-31 11:08, Avi Kivity wrote:
  On 01/27/2011 03:10 PM, Jan Kiszka wrote:
  Align with qemu-kvm and prepare for IO exit fix: There is no need to 
  run
  kvm_arch_process_irqchip_events in the inner VCPU loop. Any state 
  change
  this service processes will first cause an exit from kvm_cpu_exec
  anyway. And we will have to reenter the kernel on IO exits
  unconditionally, something that the current logic prevents.
 
  Signed-off-by: Jan Kiszkajan.kis...@siemens.com
  ---
kvm-all.c |   11 ++-
1 files changed, 6 insertions(+), 5 deletions(-)
 
  diff --git a/kvm-all.c b/kvm-all.c
  index 5bfa8c0..46ecc1c 100644
  --- a/kvm-all.c
  +++ b/kvm-all.c
  @@ -892,6 +892,12 @@ int kvm_cpu_exec(CPUState *env)
 
DPRINTF(kvm_cpu_exec()\n);
 
  +if (kvm_arch_process_irqchip_events(env)) {
  +env-exit_request = 0;
  +env-exception_index = EXCP_HLT;
  +return 0;
  +}
  +
do {
#ifndef CONFIG_IOTHREAD
if (env-exit_request) {
  @@ -901,11 +907,6 @@ int kvm_cpu_exec(CPUState *env)
}
 
  We check for -exit_request here
 
#endif
 
  -if (kvm_arch_process_irqchip_events(env)) {
  -ret = 0;
  -break;
  -}
  -
 
  But this checks for -interrupt_request.  What ensures that we exit 
  when 
  -interrupt_request is set?
 
  Good question, need to check again. But if that turns out to be an
  issue, qemu-kvm would be broken as well. I'm just aligning the code here.
 
 
  The only thing we miss by moving process_irqchip_events is a self-INIT
  of an AP - if such thing exists in real life. In that case, the AP would
  cause a reset of itself, followed by a transition to HALT state.
 
  I checked again with the Intel spec, and a self-INIT is invalid (at
  least when specified via shorthand). So I'm under the impression now
  that we can safely ignore this case and leave the patch as is.
 
  Any different views?
 
  IIRC if you don't use shorthand you can send INIT to self.
 
 We didn't care so far (in qemu-kvm), do you think we should?
 
Doesn't kernel lapic emulation support this?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 17/22] kvm: Move irqchip event processing out of inner loop

2011-01-31 Thread Jan Kiszka

On 2011-01-31 17:50, Gleb Natapov wrote:
 On Mon, Jan 31, 2011 at 05:41:24PM +0100, Jan Kiszka wrote:
 On 2011-01-31 17:38, Gleb Natapov wrote:
 On Mon, Jan 31, 2011 at 04:40:34PM +0100, Jan Kiszka wrote:
 On 2011-01-31 14:04, Jan Kiszka wrote:
 On 2011-01-31 12:36, Jan Kiszka wrote:
 On 2011-01-31 11:08, Avi Kivity wrote:
 On 01/27/2011 03:10 PM, Jan Kiszka wrote:
 Align with qemu-kvm and prepare for IO exit fix: There is no need to 
 run
 kvm_arch_process_irqchip_events in the inner VCPU loop. Any state 
 change
 this service processes will first cause an exit from kvm_cpu_exec
 anyway. And we will have to reenter the kernel on IO exits
 unconditionally, something that the current logic prevents.

 Signed-off-by: Jan Kiszkajan.kis...@siemens.com
 ---
   kvm-all.c |   11 ++-
   1 files changed, 6 insertions(+), 5 deletions(-)

 diff --git a/kvm-all.c b/kvm-all.c
 index 5bfa8c0..46ecc1c 100644
 --- a/kvm-all.c
 +++ b/kvm-all.c
 @@ -892,6 +892,12 @@ int kvm_cpu_exec(CPUState *env)

   DPRINTF(kvm_cpu_exec()\n);

 +if (kvm_arch_process_irqchip_events(env)) {
 +env-exit_request = 0;
 +env-exception_index = EXCP_HLT;
 +return 0;
 +}
 +
   do {
   #ifndef CONFIG_IOTHREAD
   if (env-exit_request) {
 @@ -901,11 +907,6 @@ int kvm_cpu_exec(CPUState *env)
   }

 We check for -exit_request here

   #endif

 -if (kvm_arch_process_irqchip_events(env)) {
 -ret = 0;
 -break;
 -}
 -

 But this checks for -interrupt_request.  What ensures that we exit 
 when 
 -interrupt_request is set?

 Good question, need to check again. But if that turns out to be an
 issue, qemu-kvm would be broken as well. I'm just aligning the code here.


 The only thing we miss by moving process_irqchip_events is a self-INIT
 of an AP - if such thing exists in real life. In that case, the AP would
 cause a reset of itself, followed by a transition to HALT state.

 I checked again with the Intel spec, and a self-INIT is invalid (at
 least when specified via shorthand). So I'm under the impression now
 that we can safely ignore this case and leave the patch as is.

 Any different views?

 IIRC if you don't use shorthand you can send INIT to self.

 We didn't care so far (in qemu-kvm), do you think we should?

 Doesn't kernel lapic emulation support this?

See the my other mail: It supports it, but it apparently doesn't expects
this to happen.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 17/22] kvm: Move irqchip event processing out of inner loop

2011-01-31 Thread Gleb Natapov

On Mon, Jan 31, 2011 at 05:52:13PM +0100, Jan Kiszka wrote:
 On 2011-01-31 17:50, Gleb Natapov wrote:
  On Mon, Jan 31, 2011 at 05:41:24PM +0100, Jan Kiszka wrote:
  On 2011-01-31 17:38, Gleb Natapov wrote:
  On Mon, Jan 31, 2011 at 04:40:34PM +0100, Jan Kiszka wrote:
  On 2011-01-31 14:04, Jan Kiszka wrote:
  On 2011-01-31 12:36, Jan Kiszka wrote:
  On 2011-01-31 11:08, Avi Kivity wrote:
  On 01/27/2011 03:10 PM, Jan Kiszka wrote:
  Align with qemu-kvm and prepare for IO exit fix: There is no need to 
  run
  kvm_arch_process_irqchip_events in the inner VCPU loop. Any state 
  change
  this service processes will first cause an exit from kvm_cpu_exec
  anyway. And we will have to reenter the kernel on IO exits
  unconditionally, something that the current logic prevents.
 
  Signed-off-by: Jan Kiszkajan.kis...@siemens.com
  ---
kvm-all.c |   11 ++-
1 files changed, 6 insertions(+), 5 deletions(-)
 
  diff --git a/kvm-all.c b/kvm-all.c
  index 5bfa8c0..46ecc1c 100644
  --- a/kvm-all.c
  +++ b/kvm-all.c
  @@ -892,6 +892,12 @@ int kvm_cpu_exec(CPUState *env)
 
DPRINTF(kvm_cpu_exec()\n);
 
  +if (kvm_arch_process_irqchip_events(env)) {
  +env-exit_request = 0;
  +env-exception_index = EXCP_HLT;
  +return 0;
  +}
  +
do {
#ifndef CONFIG_IOTHREAD
if (env-exit_request) {
  @@ -901,11 +907,6 @@ int kvm_cpu_exec(CPUState *env)
}
 
  We check for -exit_request here
 
#endif
 
  -if (kvm_arch_process_irqchip_events(env)) {
  -ret = 0;
  -break;
  -}
  -
 
  But this checks for -interrupt_request.  What ensures that we exit 
  when 
  -interrupt_request is set?
 
  Good question, need to check again. But if that turns out to be an
  issue, qemu-kvm would be broken as well. I'm just aligning the code 
  here.
 
 
  The only thing we miss by moving process_irqchip_events is a self-INIT
  of an AP - if such thing exists in real life. In that case, the AP would
  cause a reset of itself, followed by a transition to HALT state.
 
  I checked again with the Intel spec, and a self-INIT is invalid (at
  least when specified via shorthand). So I'm under the impression now
  that we can safely ignore this case and leave the patch as is.
 
  Any different views?
 
  IIRC if you don't use shorthand you can send INIT to self.
 
  We didn't care so far (in qemu-kvm), do you think we should?
 
  Doesn't kernel lapic emulation support this?
 
 See the my other mail: It supports it, but it apparently doesn't expects
 this to happen.
 
I saw it, but I do not understand why do we print this message. May be
it was used for debugging in early stages of KVM development.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM call agenda for Feb 1

2011-01-31 Thread Jan Kiszka

On 2011-01-31 11:02, Juan Quintela wrote:
 
 Please send in any agenda items you are interested incovering.
 

 o KVM upstream merge: status, plans, coordination

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC -v7 PATCH 4/7] Add yield_to(task, preempt) functionality.

2011-01-31 Thread Rik van Riel


On 01/31/2011 06:49 AM, Peter Zijlstra wrote:

On Wed, 2011-01-26 at 17:21 -0500, Rik van Riel wrote:



+   if (yielded)
+   yield();
+
+   return yielded;
+}
+EXPORT_SYMBOL_GPL(yield_to);


yield() will again acquire rq-lock.. not not simply have
-yield_to_task() do everything required and make that an unconditional
schedule()?


I wanted to reduce code duplication, since this code path
should not be taken all that frequently.  But hey, you're
the maintainer, so I'll implement it your way :)

The patches should be on the way later today :)

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts

2011-01-31 Thread Marcelo Tosatti

On Mon, Jan 24, 2011 at 10:37:56PM -0700, Alex Williamson wrote:
 On Mon, 2011-01-24 at 08:44 -0700, Alex Williamson wrote:
  I'll look at how we might be
  able to allocate slots on demand.  Thanks,
 
 Here's a first cut just to see if this looks agreeable.  This allows the
 slot array to grow on demand.  This works with current userspace, as
 well as userspace trivially modified to double KVMState.slots and
 hotplugging enough pci-assign devices to exceed the previous limit (w/ 
 w/o ept).  Hopefully I got the rcu bits correct.  Does this look like
 the right path?  If so, I'll work on removing the fixed limit from
 userspace next.  Thanks,
 
 Alex
 
 
 kvm: Allow memory slot array to grow on demand
 
 Remove fixed KVM_MEMORY_SLOTS limit, allowing the slot array
 to grow on demand.  Private slots are now allocated at the
 front instead of the end.  Only x86 seems to use private slots,
 so this is now zero for all other archs.  The memslots pointer
 is already updated using rcu, so changing the size off the
 array when it's replaces is straight forward.  x86 also keeps
 a bitmap of slots used by a kvm_mmu_page, which requires a
 shadow tlb flush whenever we increase the number of slots.
 This forces the pages to be rebuilt with the new bitmap size.
 
 Signed-off-by: Alex Williamson alex.william...@redhat.com
 ---
 
  arch/ia64/include/asm/kvm_host.h|4 --
  arch/ia64/kvm/kvm-ia64.c|2 +
  arch/powerpc/include/asm/kvm_host.h |3 --
  arch/s390/include/asm/kvm_host.h|3 --
  arch/x86/include/asm/kvm_host.h |3 +-
  arch/x86/include/asm/vmx.h  |6 ++-
  arch/x86/kvm/mmu.c  |7 +++-
  arch/x86/kvm/x86.c  |6 ++-
  include/linux/kvm_host.h|7 +++-
  virt/kvm/kvm_main.c |   65 
 ---
  10 files changed, 63 insertions(+), 43 deletions(-)
 
 
 diff --git a/arch/ia64/include/asm/kvm_host.h 
 b/arch/ia64/include/asm/kvm_host.h
 index 2689ee5..11d0ab2 100644
 --- a/arch/ia64/include/asm/kvm_host.h
 +++ b/arch/ia64/include/asm/kvm_host.h
 @@ -23,10 +23,6 @@
  #ifndef __ASM_KVM_HOST_H
  #define __ASM_KVM_HOST_H
  
 -#define KVM_MEMORY_SLOTS 32
 -/* memory slots that does not exposed to userspace */
 -#define KVM_PRIVATE_MEM_SLOTS 4
 -
  #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
  
  /* define exit reasons from vmm to kvm*/
 diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
 index 70d224d..f1adda2 100644
 --- a/arch/ia64/kvm/kvm-ia64.c
 +++ b/arch/ia64/kvm/kvm-ia64.c
 @@ -1814,7 +1814,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
   mutex_lock(kvm-slots_lock);
  
   r = -EINVAL;
 - if (log-slot = KVM_MEMORY_SLOTS)
 + if (log-slot = kvm-memslots-nmemslots)
   goto out;
  
   memslot = kvm-memslots-memslots[log-slot];
 diff --git a/arch/powerpc/include/asm/kvm_host.h 
 b/arch/powerpc/include/asm/kvm_host.h
 index bba3b9b..dc80057 100644
 --- a/arch/powerpc/include/asm/kvm_host.h
 +++ b/arch/powerpc/include/asm/kvm_host.h
 @@ -29,9 +29,6 @@
  #include asm/kvm_asm.h
  
  #define KVM_MAX_VCPUS 1
 -#define KVM_MEMORY_SLOTS 32
 -/* memory slots that does not exposed to userspace */
 -#define KVM_PRIVATE_MEM_SLOTS 4
  
  #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
  
 diff --git a/arch/s390/include/asm/kvm_host.h 
 b/arch/s390/include/asm/kvm_host.h
 index cef7dbf..92a964c 100644
 --- a/arch/s390/include/asm/kvm_host.h
 +++ b/arch/s390/include/asm/kvm_host.h
 @@ -20,9 +20,6 @@
  #include asm/cpu.h
  
  #define KVM_MAX_VCPUS 64
 -#define KVM_MEMORY_SLOTS 32
 -/* memory slots that does not exposed to userspace */
 -#define KVM_PRIVATE_MEM_SLOTS 4
  
  struct sca_entry {
   atomic_t scn;
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index ffd7f8d..df1382c 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -27,7 +27,6 @@
  #include asm/msr-index.h
  
  #define KVM_MAX_VCPUS 64
 -#define KVM_MEMORY_SLOTS 32
  /* memory slots that does not exposed to userspace */
  #define KVM_PRIVATE_MEM_SLOTS 4
  
 @@ -207,7 +206,7 @@ struct kvm_mmu_page {
* One bit set per slot which has memory
* in this shadow page.
*/
 - DECLARE_BITMAP(slot_bitmap, KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS);
 + unsigned long *slot_bitmap;
   bool multimapped; /* More than one parent_pte? */
   bool unsync;
   int root_count;  /* Currently serving as active root */
 diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
 index 84471b8..7fd8c89 100644
 --- a/arch/x86/include/asm/vmx.h
 +++ b/arch/x86/include/asm/vmx.h
 @@ -370,9 +370,9 @@ enum vmcs_field {
  
  #define AR_RESERVD_MASK 0xfffe0f00
  
 -#define TSS_PRIVATE_MEMSLOT  (KVM_MEMORY_SLOTS + 0)
 -#define APIC_ACCESS_PAGE_PRIVATE_MEMSLOT (KVM_MEMORY_SLOTS + 1)
 -#define IDENTITY_PAGETABLE_PRIVATE_MEMSLOT   (KVM_MEMORY_SLOTS + 2)
 +#define

Re: [PATCH] V2 Handle guest access to BBL_CR_CTL3 MSR

2011-01-31 Thread Marcelo Tosatti

On Fri, Jan 21, 2011 at 12:21:00AM -0500, john cooper wrote:
 [Resubmit of prior version which contained a wayward
 patch hunk.  Thanks Marcelo]
 
 A correction to Intel cpu model CPUID data (patch queued)
 caused winxp to BSOD when booted with a Penryn model.
 This was traced to the CPUID model field correction from
 6 - 23 (as is proper for a Penryn class of cpu).  Only in
 this case does the problem surface.
 
 The cause for this failure is winxp accessing the BBL_CR_CTL3
 MSR which is unsupported by current kvm, appears to be a
 legacy MSR not fully characterized yet existing in current
 silicon, and is apparently carried forward in MSR space to
 accommodate vintage code as here.  It is not yet conclusive
 whether this MSR implements any of its legacy functionality
 or is just an ornamental dud for compatibility.  While I
 found no silicon version specific documentation link to
 this MSR, a general description exists in Intel's developer's
 reference which agrees with the functional behavior of
 other bootloader/kernel code I've examined accessing
 BBL_CR_CTL3.  Regrettably winxp appears to be setting bit #19
 called out as reserved in the above document.
 
 So to minimally accommodate this MSR, kvm msr get will provide
 the equivalent mock data and kvm msr write will simply toss the
 guest passed data without interpretation.  While this treatment
 of BBL_CR_CTL3 addresses the immediate problem, the approach may
 be modified pending clarification from Intel.
 
 Signed-off-by: john cooper john.coo...@redhat.com
 ---
 
 diff --git a/arch/x86/include/asm/msr-index.h 
 b/arch/x86/include/asm/msr-index.h
 index 4d0dfa0..5bfafb6 100644
 --- a/arch/x86/include/asm/msr-index.h
 +++ b/arch/x86/include/asm/msr-index.h
 @@ -38,6 +38,7 @@
  
  #define MSR_MTRRcap  0x00fe
  #define MSR_IA32_BBL_CR_CTL  0x0119
 +#define MSR_IA32_BBL_CR_CTL3 0x011e
  
  #define MSR_IA32_SYSENTER_CS 0x0174
  #define MSR_IA32_SYSENTER_ESP0x0175
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index bcc0efc..04d6c55 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -1592,6 +1592,12 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
 u64 data)
   } else
   return set_msr_hyperv(vcpu, msr, data);
   break;
 + case MSR_IA32_BBL_CR_CTL3:
 + /* Drop writes to this legacy MSR -- see rdmsr
 +  * counterpart for further detail.
 +  */
 + pr_unimpl(vcpu, ignored wrmsr: 0x%x data %llx\n, msr, data);
 + break;
   default:
   if (msr  (msr == vcpu-kvm-arch.xen_hvm_config.msr))
   return xen_hvm_config(vcpu, data);
 @@ -1846,6 +1852,19 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
 u64 *pdata)
   } else
   return get_msr_hyperv(vcpu, msr, pdata);
   break;
 + case MSR_IA32_BBL_CR_CTL3:
 + /* This legacy MSR exists but isn't fully documented in current
 +  * silicon.  It is however accessed by winxp in very narrow
 +  * scenarios where it sets bit #19, itself documented as
 +  * a reserved bit.  Best effort attempt to source coherent
 +  * read data here should the balance of the register be
 +  * interpreted by the guest:
 +  *
 +  * L2 cache control register 3: 64GB range, 256KB size,
 +  * enabled, latency 0x1, configured
 +  */ 
 + data = 0xbe702111;
 + break;

Why bits 26-29 and 31 enabled?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv2] vhost: force vhost off for non-MSI guests

2011-01-31 Thread Michael S. Tsirkin

When MSI is off, each interrupt needs to be bounced through the io
thread when it's set/cleared, so vhost-net causes more context switches and
higher CPU utilization than userspace virtio which handles networking in
the same thread.

We'll need to fix this by adding level irq support in kvm irqfd,
for now disable vhost-net in these configurations.

Added a vhostforce flag to force vhost-net back on.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---

Untested, I'll send a pull request later after
some testing assuming we've resolved the
command line comments to everyone's satisfaction.

Changes since v1:
added vhostforce to enable vhost for non-MSI guests

 hw/vhost.c  |   16 +++-
 hw/vhost.h  |3 ++-
 hw/vhost_net.c  |8 +---
 hw/vhost_net.h  |2 +-
 hw/virtio-net.c |6 --
 hw/virtio-pci.c |6 +-
 hw/virtio.h |4 +++-
 net/tap.c   |6 --
 qemu-options.hx |4 +++-
 9 files changed, 38 insertions(+), 17 deletions(-)

diff --git a/hw/vhost.c b/hw/vhost.c
index 6082da2..72df75f 100644
--- a/hw/vhost.c
+++ b/hw/vhost.c
@@ -581,7 +581,7 @@ static void vhost_virtqueue_cleanup(struct vhost_dev *dev,
   0, virtio_queue_get_desc_size(vdev, idx));
 }
 
-int vhost_dev_init(struct vhost_dev *hdev, int devfd)
+int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force)
 {
 uint64_t features;
 int r;
@@ -613,6 +613,7 @@ int vhost_dev_init(struct vhost_dev *hdev, int devfd)
 hdev-log_enabled = false;
 hdev-started = false;
 cpu_register_phys_memory_client(hdev-client);
+hdev-force = force;
 return 0;
 fail:
 r = -errno;
@@ -636,9 +637,12 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice 
*vdev)
 goto fail;
 }
 
-r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true);
+r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true,
+   hdev-force);
 if (r  0) {
-fprintf(stderr, Error binding guest notifier: %d\n, -r);
+   if (r != -EVIRTIO_DISABLED) {
+   fprintf(stderr, Error binding guest notifier: %d\n, -r);
+   }
 goto fail_notifiers;
 }
 
@@ -686,7 +690,8 @@ fail_vq:
 }
 fail_mem:
 fail_features:
-vdev-binding-set_guest_notifiers(vdev-binding_opaque, false);
+vdev-binding-set_guest_notifiers(vdev-binding_opaque, false,
+   hdev-force);
 fail_notifiers:
 fail:
 return r;
@@ -704,7 +709,8 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice 
*vdev)
 }
 vhost_client_sync_dirty_bitmap(hdev-client, 0,
(target_phys_addr_t)~0x0ull);
-r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false);
+r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false,
+   hdev-force);
 if (r  0) {
 fprintf(stderr, vhost guest notifier cleanup failed: %d\n, r);
 fflush(stderr);
diff --git a/hw/vhost.h b/hw/vhost.h
index 86dd834..d5d4d86 100644
--- a/hw/vhost.h
+++ b/hw/vhost.h
@@ -38,9 +38,10 @@ struct vhost_dev {
 bool log_enabled;
 vhost_log_chunk_t *log;
 unsigned long long log_size;
+bool force;
 };
 
-int vhost_dev_init(struct vhost_dev *hdev, int devfd);
+int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force);
 void vhost_dev_cleanup(struct vhost_dev *hdev);
 int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev);
 void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev);
diff --git a/hw/vhost_net.c b/hw/vhost_net.c
index c068be1..4f57271 100644
--- a/hw/vhost_net.c
+++ b/hw/vhost_net.c
@@ -81,7 +81,8 @@ static int vhost_net_get_fd(VLANClientState *backend)
 }
 }
 
-struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd)
+struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd,
+ bool force)
 {
 int r;
 struct vhost_net *net = qemu_malloc(sizeof *net);
@@ -98,7 +99,7 @@ struct vhost_net *vhost_net_init(VLANClientState *backend, 
int devfd)
 (1  VHOST_NET_F_VIRTIO_NET_HDR);
 net-backend = r;
 
-r = vhost_dev_init(net-dev, devfd);
+r = vhost_dev_init(net-dev, devfd, force);
 if (r  0) {
 goto fail;
 }
@@ -188,7 +189,8 @@ void vhost_net_cleanup(struct vhost_net *net)
 qemu_free(net);
 }
 #else
-struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd)
+struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd,
+ bool force)
 {
return NULL;
 }
diff --git a/hw/vhost_net.h b/hw/vhost_net.h
index 6c18ff7..8435094 100644
--- a/hw/vhost_net.h
+++ b/hw/vhost_net.h
@@ -6,7 +6,7 @@
 struct vhost_net;
 typedef struct vhost_net VHostNetState;
 
-VHostNetState *vhost_net_init(VLANClientState *backend, int devfd);
+VHostNetState *vhost_net_init(VLANClientState *backend, int devfd, bool force);

Re: [PATCHv2] vhost: force vhost off for non-MSI guests

2011-01-31 Thread Anthony Liguori


On 01/31/2011 03:19 PM, Michael S. Tsirkin wrote:

When MSI is off, each interrupt needs to be bounced through the io
thread when it's set/cleared, so vhost-net causes more context switches and
higher CPU utilization than userspace virtio which handles networking in
the same thread.

We'll need to fix this by adding level irq support in kvm irqfd,
for now disable vhost-net in these configurations.

Added a vhostforce flag to force vhost-net back on.

Signed-off-by: Michael S. Tsirkinm...@redhat.com
   


I can't say I like the set_guest_notifiers interface getting overloaded 
like this but it looks pretty reasonable.


Regards,

Anthony Liguori


---

Untested, I'll send a pull request later after
some testing assuming we've resolved the
command line comments to everyone's satisfaction.

Changes since v1:
added vhostforce to enable vhost for non-MSI guests

  hw/vhost.c  |   16 +++-
  hw/vhost.h  |3 ++-
  hw/vhost_net.c  |8 +---
  hw/vhost_net.h  |2 +-
  hw/virtio-net.c |6 --
  hw/virtio-pci.c |6 +-
  hw/virtio.h |4 +++-
  net/tap.c   |6 --
  qemu-options.hx |4 +++-
  9 files changed, 38 insertions(+), 17 deletions(-)

diff --git a/hw/vhost.c b/hw/vhost.c
index 6082da2..72df75f 100644
--- a/hw/vhost.c
+++ b/hw/vhost.c
@@ -581,7 +581,7 @@ static void vhost_virtqueue_cleanup(struct vhost_dev *dev,
0, virtio_queue_get_desc_size(vdev, idx));
  }

-int vhost_dev_init(struct vhost_dev *hdev, int devfd)
+int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force)
  {
  uint64_t features;
  int r;
@@ -613,6 +613,7 @@ int vhost_dev_init(struct vhost_dev *hdev, int devfd)
  hdev-log_enabled = false;
  hdev-started = false;
  cpu_register_phys_memory_client(hdev-client);
+hdev-force = force;
  return 0;
  fail:
  r = -errno;
@@ -636,9 +637,12 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice 
*vdev)
  goto fail;
  }

-r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true);
+r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true,
+   hdev-force);
  if (r  0) {
-fprintf(stderr, Error binding guest notifier: %d\n, -r);
+   if (r != -EVIRTIO_DISABLED) {
+   fprintf(stderr, Error binding guest notifier: %d\n, -r);
+   }
  goto fail_notifiers;
  }

@@ -686,7 +690,8 @@ fail_vq:
  }
  fail_mem:
  fail_features:
-vdev-binding-set_guest_notifiers(vdev-binding_opaque, false);
+vdev-binding-set_guest_notifiers(vdev-binding_opaque, false,
+   hdev-force);
  fail_notifiers:
  fail:
  return r;
@@ -704,7 +709,8 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice 
*vdev)
  }
  vhost_client_sync_dirty_bitmap(hdev-client, 0,
 (target_phys_addr_t)~0x0ull);
-r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false);
+r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false,
+   hdev-force);
  if (r  0) {
  fprintf(stderr, vhost guest notifier cleanup failed: %d\n, r);
  fflush(stderr);
diff --git a/hw/vhost.h b/hw/vhost.h
index 86dd834..d5d4d86 100644
--- a/hw/vhost.h
+++ b/hw/vhost.h
@@ -38,9 +38,10 @@ struct vhost_dev {
  bool log_enabled;
  vhost_log_chunk_t *log;
  unsigned long long log_size;
+bool force;
  };

-int vhost_dev_init(struct vhost_dev *hdev, int devfd);
+int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force);
  void vhost_dev_cleanup(struct vhost_dev *hdev);
  int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev);
  void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev);
diff --git a/hw/vhost_net.c b/hw/vhost_net.c
index c068be1..4f57271 100644
--- a/hw/vhost_net.c
+++ b/hw/vhost_net.c
@@ -81,7 +81,8 @@ static int vhost_net_get_fd(VLANClientState *backend)
  }
  }

-struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd)
+struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd,
+ bool force)
  {
  int r;
  struct vhost_net *net = qemu_malloc(sizeof *net);
@@ -98,7 +99,7 @@ struct vhost_net *vhost_net_init(VLANClientState *backend, 
int devfd)
  (1  VHOST_NET_F_VIRTIO_NET_HDR);
  net-backend = r;

-r = vhost_dev_init(net-dev, devfd);
+r = vhost_dev_init(net-dev, devfd, force);
  if (r  0) {
  goto fail;
  }
@@ -188,7 +189,8 @@ void vhost_net_cleanup(struct vhost_net *net)
  qemu_free(net);
  }
  #else
-struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd)
+struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd,
+ bool force)
  {
return NULL;
  }
diff --git a/hw/vhost_net.h b/hw/vhost_net.h
index 6c18ff7..8435094 100644
---

Re: KVM call agenda for Feb 1

2011-01-31 Thread Anthony Liguori


On 01/31/2011 12:10 PM, Jan Kiszka wrote:

On 2011-01-31 11:02, Juan Quintela wrote:
   

Please send in any agenda items you are interested incovering.

 

  o KVM upstream merge: status, plans, coordination
   


 o QMP support status for 0.14.  Luiz and I already chatted about it 
today but would be good to discuss in the call just to see if anyone has 
opinions.  Basically, declare it fully supported with a few minor 
caveats (like human-monitor-passthrough is no more supported than the 
actual monitor and recommendations about how to deal with devices in the 
device tree).


Reminder: tomorrow is 0.14 stable fork.

Regards,

Anthony Liguori


Jan

   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2] vhost: force vhost off for non-MSI guests

2011-01-31 Thread Alex Williamson

On Mon, 2011-01-31 at 23:19 +0200, Michael S. Tsirkin wrote:
 When MSI is off, each interrupt needs to be bounced through the io
 thread when it's set/cleared, so vhost-net causes more context switches and
 higher CPU utilization than userspace virtio which handles networking in
 the same thread.
 
 We'll need to fix this by adding level irq support in kvm irqfd,
 for now disable vhost-net in these configurations.
 
 Added a vhostforce flag to force vhost-net back on.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
 
 Untested, I'll send a pull request later after
 some testing assuming we've resolved the
 command line comments to everyone's satisfaction.
 
 Changes since v1:
   added vhostforce to enable vhost for non-MSI guests

I'm still not thrilled with the errno hack.  If this is really a generic
problem, we should better architect the set_guest_notifiers() callback,
the force option feels like a band-aide.  I'm still curious if we can't
be more dynamic about how we setup the set_guest_notifiers() function
pointer so it only gets set if (force || msix_enabled()) and we can
limit some of the impact here to vhost.  Thanks,

Alex

  hw/vhost.c  |   16 +++-
  hw/vhost.h  |3 ++-
  hw/vhost_net.c  |8 +---
  hw/vhost_net.h  |2 +-
  hw/virtio-net.c |6 --
  hw/virtio-pci.c |6 +-
  hw/virtio.h |4 +++-
  net/tap.c   |6 --
  qemu-options.hx |4 +++-
  9 files changed, 38 insertions(+), 17 deletions(-)
 
 diff --git a/hw/vhost.c b/hw/vhost.c
 index 6082da2..72df75f 100644
 --- a/hw/vhost.c
 +++ b/hw/vhost.c
 @@ -581,7 +581,7 @@ static void vhost_virtqueue_cleanup(struct vhost_dev *dev,
0, virtio_queue_get_desc_size(vdev, idx));
  }
  
 -int vhost_dev_init(struct vhost_dev *hdev, int devfd)
 +int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force)
  {
  uint64_t features;
  int r;
 @@ -613,6 +613,7 @@ int vhost_dev_init(struct vhost_dev *hdev, int devfd)
  hdev-log_enabled = false;
  hdev-started = false;
  cpu_register_phys_memory_client(hdev-client);
 +hdev-force = force;
  return 0;
  fail:
  r = -errno;
 @@ -636,9 +637,12 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice 
 *vdev)
  goto fail;
  }
  
 -r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true);
 +r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true,
 +   hdev-force);
  if (r  0) {
 -fprintf(stderr, Error binding guest notifier: %d\n, -r);
 + if (r != -EVIRTIO_DISABLED) {
 + fprintf(stderr, Error binding guest notifier: %d\n, -r);
 + }
  goto fail_notifiers;
  }
  
 @@ -686,7 +690,8 @@ fail_vq:
  }
  fail_mem:
  fail_features:
 -vdev-binding-set_guest_notifiers(vdev-binding_opaque, false);
 +vdev-binding-set_guest_notifiers(vdev-binding_opaque, false,
 +   hdev-force);
  fail_notifiers:
  fail:
  return r;
 @@ -704,7 +709,8 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice 
 *vdev)
  }
  vhost_client_sync_dirty_bitmap(hdev-client, 0,
 (target_phys_addr_t)~0x0ull);
 -r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false);
 +r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false,
 +   hdev-force);
  if (r  0) {
  fprintf(stderr, vhost guest notifier cleanup failed: %d\n, r);
  fflush(stderr);
 diff --git a/hw/vhost.h b/hw/vhost.h
 index 86dd834..d5d4d86 100644
 --- a/hw/vhost.h
 +++ b/hw/vhost.h
 @@ -38,9 +38,10 @@ struct vhost_dev {
  bool log_enabled;
  vhost_log_chunk_t *log;
  unsigned long long log_size;
 +bool force;
  };
  
 -int vhost_dev_init(struct vhost_dev *hdev, int devfd);
 +int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force);
  void vhost_dev_cleanup(struct vhost_dev *hdev);
  int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev);
  void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev);
 diff --git a/hw/vhost_net.c b/hw/vhost_net.c
 index c068be1..4f57271 100644
 --- a/hw/vhost_net.c
 +++ b/hw/vhost_net.c
 @@ -81,7 +81,8 @@ static int vhost_net_get_fd(VLANClientState *backend)
  }
  }
  
 -struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd)
 +struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd,
 + bool force)
  {
  int r;
  struct vhost_net *net = qemu_malloc(sizeof *net);
 @@ -98,7 +99,7 @@ struct vhost_net *vhost_net_init(VLANClientState *backend, 
 int devfd)
  (1  VHOST_NET_F_VIRTIO_NET_HDR);
  net-backend = r;
  
 -r = vhost_dev_init(net-dev, devfd);
 +r = vhost_dev_init(net-dev, devfd, force);
  if (r  0) {
  goto fail;
  }
 @@ -188,7 +189,8 @@ void vhost_net_cleanup(struct vhost_net

[PATCH -v8 3/7] sched: use a buddy to implement yield_task_fair

2011-01-31 Thread Rik van Riel

Use the buddy mechanism to implement yield_task_fair.  This
allows us to skip onto the next highest priority se at every
level in the CFS tree, unless doing so would introduce gross
unfairness in CPU time distribution.

We order the buddy selection in pick_next_entity to check
yield first, then last, then next.  We need next to be able
to override yield, because it is possible for the next and
yield task to be different processen in the same sub-tree
of the CFS tree.  When they are, we need to go into that
sub-tree regardless of the yield hint, and pick the correct
entity once we get to the right level.

Signed-off-by: Rik van Riel r...@redhat.com

diff --git a/kernel/sched.c b/kernel/sched.c
index dc91a4d..7ff53e2 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -327,7 +327,7 @@ struct cfs_rq {
 * 'curr' points to currently running entity on this cfs_rq.
 * It is set to NULL otherwise (i.e when none are currently running).
 */
-   struct sched_entity *curr, *next, *last;
+   struct sched_entity *curr, *next, *last, *skip;
 
unsigned int nr_spread_over;
 
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index ad946fd..a56d410 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -384,6 +384,22 @@ static struct sched_entity *__pick_next_entity(struct 
cfs_rq *cfs_rq)
return rb_entry(left, struct sched_entity, run_node);
 }
 
+static struct sched_entity *__pick_second_entity(struct cfs_rq *cfs_rq)
+{
+   struct rb_node *left = cfs_rq-rb_leftmost;
+   struct rb_node *second;
+
+   if (!left)
+   return NULL;
+
+   second = rb_next(left);
+
+   if (!second)
+   second = left;
+
+   return rb_entry(second, struct sched_entity, run_node);
+}
+
 static struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
 {
struct rb_node *last = rb_last(cfs_rq-tasks_timeline);
@@ -806,6 +822,17 @@ static void __clear_buddies_next(struct sched_entity *se)
}
 }
 
+static void __clear_buddies_skip(struct sched_entity *se)
+{
+   for_each_sched_entity(se) {
+   struct cfs_rq *cfs_rq = cfs_rq_of(se);
+   if (cfs_rq-skip == se)
+   cfs_rq-skip = NULL;
+   else
+   break;
+   }
+}
+
 static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
if (cfs_rq-last == se)
@@ -813,6 +840,9 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
 
if (cfs_rq-next == se)
__clear_buddies_next(se);
+
+   if (cfs_rq-skip == se)
+   __clear_buddies_skip(se);
 }
 
 static void
@@ -926,13 +956,27 @@ set_next_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
 static int
 wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
 
+/*
+ * Pick the next process, keeping these things in mind, in this order:
+ * 1) keep things fair between processes/task groups
+ * 2) pick the next process, since someone really wants that to run
+ * 3) pick the last process, for cache locality
+ * 4) do not run the skip process, if something else is available
+ */
 static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
 {
struct sched_entity *se = __pick_next_entity(cfs_rq);
struct sched_entity *left = se;
 
-   if (cfs_rq-next  wakeup_preempt_entity(cfs_rq-next, left)  1)
-   se = cfs_rq-next;
+   /*
+* Avoid running the skip buddy, if running something else can
+* be done without getting too unfair.
+*/
+   if (cfs_rq-skip == se) {
+   struct sched_entity *second = __pick_second_entity(cfs_rq);
+   if (wakeup_preempt_entity(second, left)  1)
+   se = second;
+   }
 
/*
 * Prefer last buddy, try to return the CPU to a preempted task.
@@ -940,6 +984,12 @@ static struct sched_entity *pick_next_entity(struct cfs_rq 
*cfs_rq)
if (cfs_rq-last  wakeup_preempt_entity(cfs_rq-last, left)  1)
se = cfs_rq-last;
 
+   /*
+* Someone really wants this to run. If it's not unfair, run it.
+*/
+   if (cfs_rq-next  wakeup_preempt_entity(cfs_rq-next, left)  1)
+   se = cfs_rq-next;
+
clear_buddies(cfs_rq, se);
 
return se;
@@ -1096,52 +1146,6 @@ static void dequeue_task_fair(struct rq *rq, struct 
task_struct *p, int flags)
hrtick_update(rq);
 }
 
-/*
- * sched_yield() support is very simple - we dequeue and enqueue.
- *
- * If compat_yield is turned on then we requeue to the end of the tree.
- */
-static void yield_task_fair(struct rq *rq)
-{
-   struct task_struct *curr = rq-curr;
-   struct cfs_rq *cfs_rq = task_cfs_rq(curr);
-   struct sched_entity *rightmost, *se = curr-se;
-
-   /*
-* Are we the only task in the tree?
-*/
-   if (unlikely(rq-nr_running == 1))
-   return;
-
-

[PATCH -v8 2/7] sched: limit the scope of clear_buddies

2011-01-31 Thread Rik van Riel

The clear_buddies function does not seem to play well with the concept
of hierarchical runqueues.  In the following tree, task groups are
represented by 'G', tasks by 'T', next by 'n' and last by 'l'.

 (nl)
/\
   G(nl)  G
   / \ \
 T(l) T(n)  T

This situation can arise when a task is woken up T(n), and the previously
running task T(l) is marked last.

When clear_buddies is called from either T(l) or T(n), the next and last
buddies of the group G(nl) will be cleared.  This is not the desired
result, since we would like to be able to find the other type of buddy
in many cases.

This especially a worry when implementing yield_task_fair through the
buddy system.

The fix is simple: only clear the buddy type that the task itself
is indicated to be.  As an added bonus, we stop walking up the tree
when the buddy has already been cleared or pointed elsewhere.

Signed-off-by: Rik van Riel r...@redhat.com
---
 kernel/sched_fair.c |   30 +++---
 1 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index f4ee445..0321473 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -784,19 +784,35 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity 
*se, int flags)
__enqueue_entity(cfs_rq, se);
 }
 
-static void __clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static void __clear_buddies_last(struct sched_entity *se)
 {
-   if (!se || cfs_rq-last == se)
-   cfs_rq-last = NULL;
+   for_each_sched_entity(se) {
+   struct cfs_rq *cfs_rq = cfs_rq_of(se);
+   if (cfs_rq-last == se)
+   cfs_rq-last = NULL;
+   else
+   break;
+   }
+}
 
-   if (!se || cfs_rq-next == se)
-   cfs_rq-next = NULL;
+static void __clear_buddies_next(struct sched_entity *se)
+{
+   for_each_sched_entity(se) {
+   struct cfs_rq *cfs_rq = cfs_rq_of(se);
+   if (cfs_rq-next == se)
+   cfs_rq-next = NULL;
+   else
+   break;
+   }
 }
 
 static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-   for_each_sched_entity(se)
-   __clear_buddies(cfs_rq_of(se), se);
+   if (cfs_rq-last == se)
+   __clear_buddies_last(se);
+
+   if (cfs_rq-next == se)
+   __clear_buddies_next(se);
 }
 
 static void

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH -v8 4/7] sched: Add yield_to(task, preempt) functionality.

2011-01-31 Thread Rik van Riel

From: Mike Galbraith efa...@gmx.de

Currently only implemented for fair class tasks.

Add a yield_to_task method() to the fair scheduling class. allowing the
caller of yield_to() to accelerate another thread in it's thread group,
task group.

Implemented via a scheduler hint, using cfs_rq-next to encourage the
target being selected.  We can rely on pick_next_entity to keep things
fair, so noone can accelerate a thread that has already used its fair
share of CPU time.

This also means callers should only call yield_to when they really
mean it.  Calling it too often can result in the scheduler just
ignoring the hint.

Signed-off-by: Rik van Riel r...@redhat.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
Signed-off-by: Mike Galbraith efa...@gmx.de

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2c79e92..6c43fc4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1047,6 +1047,7 @@ struct sched_class {
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*yield_task) (struct rq *rq);
+   bool (*yield_to_task) (struct rq *rq, struct task_struct *p, bool 
preempt);
 
void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int 
flags);
 
@@ -1943,6 +1944,7 @@ static inline int rt_mutex_getprio(struct task_struct *p)
 # define rt_mutex_adjust_pi(p) do { } while (0)
 #endif
 
+extern bool yield_to(struct task_struct *p, bool preempt);
 extern void set_user_nice(struct task_struct *p, long nice);
 extern int task_prio(const struct task_struct *p);
 extern int task_nice(const struct task_struct *p);
diff --git a/kernel/sched.c b/kernel/sched.c
index 7ff53e2..5e70156 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5270,6 +5270,56 @@ void __sched yield(void)
 }
 EXPORT_SYMBOL(yield);
 
+/**
+ * yield_to - yield the current processor to another thread in
+ * your thread group, or accelerate that thread toward the
+ * processor it's on.
+ *
+ * It's the caller's job to ensure that the target task struct
+ * can't go away on us before we can do any checks.
+ *
+ * Returns true if we indeed boosted the target task.
+ */
+bool __sched yield_to(struct task_struct *p, bool preempt)
+{
+   struct task_struct *curr = current;
+   struct rq *rq, *p_rq;
+   unsigned long flags;
+   bool yielded = 0;
+
+   local_irq_save(flags);
+   rq = this_rq();
+
+again:
+   p_rq = task_rq(p);
+   double_rq_lock(rq, p_rq);
+   while (task_rq(p) != p_rq) {
+   double_rq_unlock(rq, p_rq);
+   goto again;
+   }
+
+   if (!curr-sched_class-yield_to_task)
+   goto out;
+
+   if (curr-sched_class != p-sched_class)
+   goto out;
+
+   if (task_running(p_rq, p) || p-state)
+   goto out;
+
+   yielded = curr-sched_class-yield_to_task(rq, p, preempt);
+
+out:
+   double_rq_unlock(rq, p_rq);
+   local_irq_restore(flags);
+
+   if (yielded)
+   yield();
+
+   return yielded;
+}
+EXPORT_SYMBOL_GPL(yield_to);
+
 /*
  * This task is about to go to sleep on IO. Increment rq-nr_iowait so
  * that process accounting knows that this is a task in IO wait state.
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index a56d410..be729d7 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1800,6 +1800,23 @@ static void yield_task_fair(struct rq *rq)
set_skip_buddy(se);
 }
 
+static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool 
preempt)
+{
+   struct sched_entity *se = p-se;
+
+   if (!se-on_rq)
+   return false;
+
+   /* Tell the scheduler that we'd really like pse to run next. */
+   set_next_buddy(se);
+
+   /* Make p's CPU reschedule; pick_next_entity takes care of fairness. */
+   if (preempt)
+   resched_task(rq-curr);
+
+   return true;
+}
+
 #ifdef CONFIG_SMP
 /**
  * Fair scheduling class load-balancing methods:
@@ -3993,6 +4010,7 @@ static const struct sched_class fair_sched_class = {
.enqueue_task   = enqueue_task_fair,
.dequeue_task   = dequeue_task_fair,
.yield_task = yield_task_fair,
+   .yield_to_task  = yield_to_task_fair,
 
.check_preempt_curr = check_preempt_wakeup,
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH -v8 0/7] directed yield for Pause Loop Exiting

2011-01-31 Thread Rik van Riel

When running SMP virtual machines, it is possible for one VCPU to be
spinning on a spinlock, while the VCPU that holds the spinlock is not
currently running, because the host scheduler preempted it to run
something else.

Both Intel and AMD CPUs have a feature that detects when a virtual
CPU is spinning on a lock and will trap to the host.

The current KVM code sleeps for a bit whenever that happens, which
results in eg. a 64 VCPU Windows guest taking forever and a bit to
boot up.  This is because the VCPU holding the lock is actually
running and not sleeping, so the pause is counter-productive.

In other workloads a pause can also be counter-productive, with
spinlock detection resulting in one guest giving up its CPU time
to the others.  Instead of spinning, it ends up simply not running
much at all.

This patch series aims to fix that, by having a VCPU that spins
give the remainder of its timeslice to another VCPU in the same
guest before yielding the CPU - one that is runnable but got 
preempted, hopefully the lock holder.

v8:
- some more changes and cleanups suggested by Peter
v7:
- move the vcpu to pid mapping to inside the vcpu-mutex
- rename -yield to -skip
- merge patch 5 into patch 4
v6:
- implement yield_task_fair in a way that works with task groups,
  this allows me to actually get a performance improvement!
- fix another race Avi pointed out, the code should be good now
v5:
- fix the race condition Avi pointed out, by tracking vcpu-pid
- also allows us to yield to vcpu tasks that got preempted while in qemu
  userspace
v4:
- change to newer version of Mike Galbraith's yield_to implementation
- chainsaw out some code from Mike that looked like a great idea, but
  turned out to give weird interactions in practice
v3:
- more cleanups
- change to Mike Galbraith's yield_to implementation
- yield to spinning VCPUs, this seems to work better in some
  situations and has little downside potential
v2:
- make lots of cleanups and improvements suggested
- do not implement timeslice scheduling or fairness stuff
  yet, since it is not entirely clear how to do that right
  (suggestions welcome)


Benchmark results:

Two 4-CPU KVM guests are pinned to the same 4 physical CPUs.

One guest runs the AMQP performance test, the other guest runs
0, 2 or 4 infinite loops, for CPU overcommit factors of 0, 1.5
and 4.

The AMQP perftest is run 30 times, with message payloads of 8 and 16 bytes.

size8   no overcommit   1.5x overcommit 2x overcommit

no PLE  223801  135137  104951
PLE 224135  141105  118744

size16  no overcommit   1.5x overcommit 2x overcommit

no PLE  222424  126175  105299
PLE 222534  138082  132945

Note: this is with the KVM guests NOT running inside cgroups.  There
seems to be a CPU load balancing issue with cgroup fair group scheduling,
which often results in one guest getting only 80% CPU time and the other
guest 320%.  That will have to be fixed to get meaningful results with
cgroups.

CPU time division between the AMQP guest and the infinite loop guest
were not exactly fair, but the guests got close to the same amount
of CPU time in each test run.

There is a substantial amount of randomness in CPU time division between
guests, but the performance improvement is consistent between multiple
runs.

-- 
All rights reversed.


-- 
-- 
All rights reversed.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH -v8 7/7] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin

2011-01-31 Thread Rik van Riel

Instead of sleeping in kvm_vcpu_on_spin, which can cause gigantic
slowdowns of certain workloads, we instead use yield_to to get
another VCPU in the same KVM guest to run sooner.

This seems to give a 10-15% speedup in certain workloads.

Signed-off-by: Rik van Riel r...@redhat.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9d56ed5..fab2250 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -187,6 +187,7 @@ struct kvm {
 #endif
struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
atomic_t online_vcpus;
+   int last_boosted_vcpu;
struct list_head vm_list;
struct mutex lock;
struct kvm_io_bus *buses[KVM_NR_BUSES];
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 86c4905..8b761ba 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1292,18 +1292,55 @@ void kvm_resched(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_resched);
 
-void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu)
+void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 {
-   ktime_t expires;
-   DEFINE_WAIT(wait);
-
-   prepare_to_wait(vcpu-wq, wait, TASK_INTERRUPTIBLE);
-
-   /* Sleep for 100 us, and hope lock-holder got scheduled */
-   expires = ktime_add_ns(ktime_get(), 10UL);
-   schedule_hrtimeout(expires, HRTIMER_MODE_ABS);
+   struct kvm *kvm = me-kvm;
+   struct kvm_vcpu *vcpu;
+   int last_boosted_vcpu = me-kvm-last_boosted_vcpu;
+   int yielded = 0;
+   int pass;
+   int i;
 
-   finish_wait(vcpu-wq, wait);
+   /*
+* We boost the priority of a VCPU that is runnable but not
+* currently running, because it got preempted by something
+* else and called schedule in __vcpu_run.  Hopefully that
+* VCPU is holding the lock that we need and will release it.
+* We approximate round-robin by starting at the last boosted VCPU.
+*/
+   for (pass = 0; pass  2  !yielded; pass++) {
+   kvm_for_each_vcpu(i, vcpu, kvm) {
+   struct task_struct *task = NULL;
+   struct pid *pid;
+   if (!pass  i  last_boosted_vcpu) {
+   i = last_boosted_vcpu;
+   continue;
+   } else if (pass  i  last_boosted_vcpu)
+   break;
+   if (vcpu == me)
+   continue;
+   if (waitqueue_active(vcpu-wq))
+   continue;
+   rcu_read_lock();
+   pid = rcu_dereference(vcpu-pid);
+   if (pid)
+   task = get_pid_task(vcpu-pid, PIDTYPE_PID);
+   rcu_read_unlock();
+   if (!task)
+   continue;
+   if (task-flags  PF_VCPU) {
+   put_task_struct(task);
+   continue;
+   }
+   if (yield_to(task, 1)) {
+   put_task_struct(task);
+   kvm-last_boosted_vcpu = i;
+   yielded = 1;
+   break;
+   }
+   put_task_struct(task);
+   }
+   }
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH -v8 6/7] kvm: keep track of which task is running a KVM vcpu

2011-01-31 Thread Rik van Riel

Keep track of which task is running a KVM vcpu.  This helps us
figure out later what task to wake up if we want to boost a
vcpu that got preempted.

Unfortunately there are no guarantees that the same task
always keeps the same vcpu, so we can only track the task
across a single run of the vcpu.

Signed-off-by: Rik van Riel r...@redhat.com

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a055742..9d56ed5 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -81,6 +81,7 @@ struct kvm_vcpu {
 #endif
int vcpu_id;
struct mutex mutex;
+   struct pid *pid;
int   cpu;
atomic_t guest_mode;
struct kvm_run *run;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5225052..0fa9a48 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -117,6 +117,14 @@ void vcpu_load(struct kvm_vcpu *vcpu)
int cpu;
 
mutex_lock(vcpu-mutex);
+   if (unlikely(vcpu-pid != current-pids[PIDTYPE_PID].pid)) {
+   /* The thread running this VCPU changed. */
+   struct pid *oldpid = vcpu-pid;
+   struct pid *newpid = get_task_pid(current, PIDTYPE_PID);
+   rcu_assign_pointer(vcpu-pid, newpid);
+   synchronize_rcu();
+   put_pid(oldpid);
+   }
cpu = get_cpu();
preempt_notifier_register(vcpu-preempt_notifier);
kvm_arch_vcpu_load(vcpu, cpu);
@@ -185,6 +193,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, 
unsigned id)
vcpu-cpu = -1;
vcpu-kvm = kvm;
vcpu-vcpu_id = id;
+   vcpu-pid = NULL;
init_waitqueue_head(vcpu-wq);
 
page = alloc_page(GFP_KERNEL | __GFP_ZERO);
@@ -208,6 +217,7 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_init);
 
 void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
 {
+   put_pid(vcpu-pid);
kvm_arch_vcpu_uninit(vcpu);
free_page((unsigned long)vcpu-run);
 }

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH -v8 1/7] sched: check the right -nr_running in yield_task_fair

2011-01-31 Thread Rik van Riel

With CONFIG_FAIR_GROUP_SCHED, each task_group has its own cfs_rq.
Yielding to a task from another cfs_rq may be worthwhile, since
a process calling yield typically cannot use the CPU right now.

Therefor, we want to check the per-cpu nr_running, not the
cgroup local one.

Signed-off-by: Rik van Riel r...@redhat.com
---
 kernel/sched_fair.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index c62ebae..7b338ac 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1304,7 +1304,7 @@ static void yield_task_fair(struct rq *rq)
/*
 * Are we the only task in the tree?
 */
-   if (unlikely(cfs_rq-nr_running == 1))
+   if (unlikely(rq-nr_running == 1))
return;
 
clear_buddies(cfs_rq, se);

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH -v8 5/7] export pid symbols needed for kvm_vcpu_on_spin

2011-01-31 Thread Rik van Riel

Export the symbols required for a race-free kvm_vcpu_on_spin.

Signed-off-by: Rik van Riel r...@redhat.com

diff --git a/kernel/fork.c b/kernel/fork.c
index 3b159c5..adc8f47 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -191,6 +191,7 @@ void __put_task_struct(struct task_struct *tsk)
if (!profile_handoff_task(tsk))
free_task(tsk);
 }
+EXPORT_SYMBOL_GPL(__put_task_struct);
 
 /*
  * macro override instead of weak attribute alias, to workaround
diff --git a/kernel/pid.c b/kernel/pid.c
index 39b65b6..02f2212 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -435,6 +435,7 @@ struct pid *get_task_pid(struct task_struct *task, enum 
pid_type type)
rcu_read_unlock();
return pid;
 }
+EXPORT_SYMBOL_GPL(get_task_pid);
 
 struct task_struct *get_pid_task(struct pid *pid, enum pid_type type)
 {
@@ -446,6 +447,7 @@ struct task_struct *get_pid_task(struct pid *pid, enum 
pid_type type)
rcu_read_unlock();
return result;
 }
+EXPORT_SYMBOL_GPL(get_pid_task);
 
 struct pid *find_get_pid(pid_t nr)
 {
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2] vhost: force vhost off for non-MSI guests

2011-01-31 Thread Michael S. Tsirkin

On Mon, Jan 31, 2011 at 02:47:34PM -0700, Alex Williamson wrote:
 On Mon, 2011-01-31 at 23:19 +0200, Michael S. Tsirkin wrote:
  When MSI is off, each interrupt needs to be bounced through the io
  thread when it's set/cleared, so vhost-net causes more context switches and
  higher CPU utilization than userspace virtio which handles networking in
  the same thread.
  
  We'll need to fix this by adding level irq support in kvm irqfd,
  for now disable vhost-net in these configurations.
  
  Added a vhostforce flag to force vhost-net back on.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  ---
  
  Untested, I'll send a pull request later after
  some testing assuming we've resolved the
  command line comments to everyone's satisfaction.
  
  Changes since v1:
  added vhostforce to enable vhost for non-MSI guests
 
 I'm still not thrilled with the errno hack.  If this is really a generic
 problem, we should better architect the set_guest_notifiers() callback,
 the force option feels like a band-aide.
  I'm still curious if we can't
 be more dynamic about how we setup the set_guest_notifiers() function
 pointer so it only gets set if (force || msix_enabled()) and we can
 limit some of the impact here to vhost.  Thanks,
 
 Alex

I'm not entirely happy with the API either, but I'm sure
playing with the function pointer is not the solution
we are looking for:

msix is enabled/disabled dynamically.  force is static but it is
something net backend knows about, virtio pci doesn't.
So I suspect modifying the callback on the fly will become
messy quite quickly.

Well, this API will be easy to tweak after the fact, there's a single
user after all.

   hw/vhost.c  |   16 +++-
   hw/vhost.h  |3 ++-
   hw/vhost_net.c  |8 +---
   hw/vhost_net.h  |2 +-
   hw/virtio-net.c |6 --
   hw/virtio-pci.c |6 +-
   hw/virtio.h |4 +++-
   net/tap.c   |6 --
   qemu-options.hx |4 +++-
   9 files changed, 38 insertions(+), 17 deletions(-)
  
  diff --git a/hw/vhost.c b/hw/vhost.c
  index 6082da2..72df75f 100644
  --- a/hw/vhost.c
  +++ b/hw/vhost.c
  @@ -581,7 +581,7 @@ static void vhost_virtqueue_cleanup(struct vhost_dev 
  *dev,
 0, virtio_queue_get_desc_size(vdev, idx));
   }
   
  -int vhost_dev_init(struct vhost_dev *hdev, int devfd)
  +int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force)
   {
   uint64_t features;
   int r;
  @@ -613,6 +613,7 @@ int vhost_dev_init(struct vhost_dev *hdev, int devfd)
   hdev-log_enabled = false;
   hdev-started = false;
   cpu_register_phys_memory_client(hdev-client);
  +hdev-force = force;
   return 0;
   fail:
   r = -errno;
  @@ -636,9 +637,12 @@ int vhost_dev_start(struct vhost_dev *hdev, 
  VirtIODevice *vdev)
   goto fail;
   }
   
  -r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true);
  +r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true,
  +   hdev-force);
   if (r  0) {
  -fprintf(stderr, Error binding guest notifier: %d\n, -r);
  +   if (r != -EVIRTIO_DISABLED) {
  +   fprintf(stderr, Error binding guest notifier: %d\n, -r);
  +   }
   goto fail_notifiers;
   }
   
  @@ -686,7 +690,8 @@ fail_vq:
   }
   fail_mem:
   fail_features:
  -vdev-binding-set_guest_notifiers(vdev-binding_opaque, false);
  +vdev-binding-set_guest_notifiers(vdev-binding_opaque, false,
  +   hdev-force);
   fail_notifiers:
   fail:
   return r;
  @@ -704,7 +709,8 @@ void vhost_dev_stop(struct vhost_dev *hdev, 
  VirtIODevice *vdev)
   }
   vhost_client_sync_dirty_bitmap(hdev-client, 0,
  (target_phys_addr_t)~0x0ull);
  -r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false);
  +r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false,
  +   hdev-force);
   if (r  0) {
   fprintf(stderr, vhost guest notifier cleanup failed: %d\n, r);
   fflush(stderr);
  diff --git a/hw/vhost.h b/hw/vhost.h
  index 86dd834..d5d4d86 100644
  --- a/hw/vhost.h
  +++ b/hw/vhost.h
  @@ -38,9 +38,10 @@ struct vhost_dev {
   bool log_enabled;
   vhost_log_chunk_t *log;
   unsigned long long log_size;
  +bool force;
   };
   
  -int vhost_dev_init(struct vhost_dev *hdev, int devfd);
  +int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force);
   void vhost_dev_cleanup(struct vhost_dev *hdev);
   int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev);
   void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev);
  diff --git a/hw/vhost_net.c b/hw/vhost_net.c
  index c068be1..4f57271 100644
  --- a/hw/vhost_net.c
  +++ b/hw/vhost_net.c
  @@ -81,7 +81,8 @@ static int vhost_net_get_fd(VLANClientState *backend)
   }
   }
   
  -struct vhost_net

Re: [PATCHv2] vhost: force vhost off for non-MSI guests

2011-01-31 Thread Alex Williamson

On Tue, 2011-02-01 at 00:02 +0200, Michael S. Tsirkin wrote:
 On Mon, Jan 31, 2011 at 02:47:34PM -0700, Alex Williamson wrote:
  On Mon, 2011-01-31 at 23:19 +0200, Michael S. Tsirkin wrote:
   When MSI is off, each interrupt needs to be bounced through the io
   thread when it's set/cleared, so vhost-net causes more context switches 
   and
   higher CPU utilization than userspace virtio which handles networking in
   the same thread.
   
   We'll need to fix this by adding level irq support in kvm irqfd,
   for now disable vhost-net in these configurations.
   
   Added a vhostforce flag to force vhost-net back on.
   
   Signed-off-by: Michael S. Tsirkin m...@redhat.com
   ---
   
   Untested, I'll send a pull request later after
   some testing assuming we've resolved the
   command line comments to everyone's satisfaction.
   
   Changes since v1:
 added vhostforce to enable vhost for non-MSI guests
  
  I'm still not thrilled with the errno hack.  If this is really a generic
  problem, we should better architect the set_guest_notifiers() callback,
  the force option feels like a band-aide.
   I'm still curious if we can't
  be more dynamic about how we setup the set_guest_notifiers() function
  pointer so it only gets set if (force || msix_enabled()) and we can
  limit some of the impact here to vhost.  Thanks,
  
  Alex
 
 I'm not entirely happy with the API either, but I'm sure
 playing with the function pointer is not the solution
 we are looking for:
 
 msix is enabled/disabled dynamically.  force is static but it is
 something net backend knows about, virtio pci doesn't.
 So I suspect modifying the callback on the fly will become
 messy quite quickly.

Isn't there a callback to the device when msix is enabled/disabled?

 Well, this API will be easy to tweak after the fact, there's a single
 user after all.

Gosh, I wish I got to take that attitude when I submit code... ;)

Alex

hw/vhost.c  |   16 +++-
hw/vhost.h  |3 ++-
hw/vhost_net.c  |8 +---
hw/vhost_net.h  |2 +-
hw/virtio-net.c |6 --
hw/virtio-pci.c |6 +-
hw/virtio.h |4 +++-
net/tap.c   |6 --
qemu-options.hx |4 +++-
9 files changed, 38 insertions(+), 17 deletions(-)
   
   diff --git a/hw/vhost.c b/hw/vhost.c
   index 6082da2..72df75f 100644
   --- a/hw/vhost.c
   +++ b/hw/vhost.c
   @@ -581,7 +581,7 @@ static void vhost_virtqueue_cleanup(struct vhost_dev 
   *dev,
  0, virtio_queue_get_desc_size(vdev, idx));
}

   -int vhost_dev_init(struct vhost_dev *hdev, int devfd)
   +int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force)
{
uint64_t features;
int r;
   @@ -613,6 +613,7 @@ int vhost_dev_init(struct vhost_dev *hdev, int devfd)
hdev-log_enabled = false;
hdev-started = false;
cpu_register_phys_memory_client(hdev-client);
   +hdev-force = force;
return 0;
fail:
r = -errno;
   @@ -636,9 +637,12 @@ int vhost_dev_start(struct vhost_dev *hdev, 
   VirtIODevice *vdev)
goto fail;
}

   -r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true);
   +r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true,
   +   hdev-force);
if (r  0) {
   -fprintf(stderr, Error binding guest notifier: %d\n, -r);
   + if (r != -EVIRTIO_DISABLED) {
   + fprintf(stderr, Error binding guest notifier: %d\n, -r);
   + }
goto fail_notifiers;
}

   @@ -686,7 +690,8 @@ fail_vq:
}
fail_mem:
fail_features:
   -vdev-binding-set_guest_notifiers(vdev-binding_opaque, false);
   +vdev-binding-set_guest_notifiers(vdev-binding_opaque, false,
   +   hdev-force);
fail_notifiers:
fail:
return r;
   @@ -704,7 +709,8 @@ void vhost_dev_stop(struct vhost_dev *hdev, 
   VirtIODevice *vdev)
}
vhost_client_sync_dirty_bitmap(hdev-client, 0,
   (target_phys_addr_t)~0x0ull);
   -r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false);
   +r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false,
   +   hdev-force);
if (r  0) {
fprintf(stderr, vhost guest notifier cleanup failed: %d\n, r);
fflush(stderr);
   diff --git a/hw/vhost.h b/hw/vhost.h
   index 86dd834..d5d4d86 100644
   --- a/hw/vhost.h
   +++ b/hw/vhost.h
   @@ -38,9 +38,10 @@ struct vhost_dev {
bool log_enabled;
vhost_log_chunk_t *log;
unsigned long long log_size;
   +bool force;
};

   -int vhost_dev_init(struct vhost_dev *hdev, int devfd);
   +int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force);
void vhost_dev_cleanup(struct vhost_dev *hdev);
int vhost_dev_start(struct vhost_dev *hdev,

Re: [PATCHv2] vhost: force vhost off for non-MSI guests

2011-01-31 Thread Michael S. Tsirkin

On Mon, Jan 31, 2011 at 03:07:49PM -0700, Alex Williamson wrote:
 On Tue, 2011-02-01 at 00:02 +0200, Michael S. Tsirkin wrote:
  On Mon, Jan 31, 2011 at 02:47:34PM -0700, Alex Williamson wrote:
   On Mon, 2011-01-31 at 23:19 +0200, Michael S. Tsirkin wrote:
When MSI is off, each interrupt needs to be bounced through the io
thread when it's set/cleared, so vhost-net causes more context switches 
and
higher CPU utilization than userspace virtio which handles networking in
the same thread.

We'll need to fix this by adding level irq support in kvm irqfd,
for now disable vhost-net in these configurations.

Added a vhostforce flag to force vhost-net back on.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---

Untested, I'll send a pull request later after
some testing assuming we've resolved the
command line comments to everyone's satisfaction.

Changes since v1:
added vhostforce to enable vhost for non-MSI guests
   
   I'm still not thrilled with the errno hack.  If this is really a generic
   problem, we should better architect the set_guest_notifiers() callback,
   the force option feels like a band-aide.
I'm still curious if we can't
   be more dynamic about how we setup the set_guest_notifiers() function
   pointer so it only gets set if (force || msix_enabled()) and we can
   limit some of the impact here to vhost.  Thanks,
   
   Alex
  
  I'm not entirely happy with the API either, but I'm sure
  playing with the function pointer is not the solution
  we are looking for:
  
  msix is enabled/disabled dynamically.  force is static but it is
  something net backend knows about, virtio pci doesn't.
  So I suspect modifying the callback on the fly will become
  messy quite quickly.
 
 Isn't there a callback to the device when msix is enabled/disabled?

Not at the moment. In qemu we have mask/unmask.

  Well, this API will be easy to tweak after the fact, there's a single
  user after all.
 
 Gosh, I wish I got to take that attitude when I submit code... ;)
 
 Alex

You are probably dealing with more important issues :)
If you have an internal API used in a single place, you get
to change it at will. If you tweak an API used all over the codebase,
you don't :)

 hw/vhost.c  |   16 +++-
 hw/vhost.h  |3 ++-
 hw/vhost_net.c  |8 +---
 hw/vhost_net.h  |2 +-
 hw/virtio-net.c |6 --
 hw/virtio-pci.c |6 +-
 hw/virtio.h |4 +++-
 net/tap.c   |6 --
 qemu-options.hx |4 +++-
 9 files changed, 38 insertions(+), 17 deletions(-)

diff --git a/hw/vhost.c b/hw/vhost.c
index 6082da2..72df75f 100644
--- a/hw/vhost.c
+++ b/hw/vhost.c
@@ -581,7 +581,7 @@ static void vhost_virtqueue_cleanup(struct 
vhost_dev *dev,
   0, virtio_queue_get_desc_size(vdev, 
idx));
 }
 
-int vhost_dev_init(struct vhost_dev *hdev, int devfd)
+int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force)
 {
 uint64_t features;
 int r;
@@ -613,6 +613,7 @@ int vhost_dev_init(struct vhost_dev *hdev, int 
devfd)
 hdev-log_enabled = false;
 hdev-started = false;
 cpu_register_phys_memory_client(hdev-client);
+hdev-force = force;
 return 0;
 fail:
 r = -errno;
@@ -636,9 +637,12 @@ int vhost_dev_start(struct vhost_dev *hdev, 
VirtIODevice *vdev)
 goto fail;
 }
 
-r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true);
+r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true,
+   hdev-force);
 if (r  0) {
-fprintf(stderr, Error binding guest notifier: %d\n, -r);
+   if (r != -EVIRTIO_DISABLED) {
+   fprintf(stderr, Error binding guest notifier: %d\n, 
-r);
+   }
 goto fail_notifiers;
 }
 
@@ -686,7 +690,8 @@ fail_vq:
 }
 fail_mem:
 fail_features:
-vdev-binding-set_guest_notifiers(vdev-binding_opaque, false);
+vdev-binding-set_guest_notifiers(vdev-binding_opaque, false,
+   hdev-force);
 fail_notifiers:
 fail:
 return r;
@@ -704,7 +709,8 @@ void vhost_dev_stop(struct vhost_dev *hdev, 
VirtIODevice *vdev)
 }
 vhost_client_sync_dirty_bitmap(hdev-client, 0,
(target_phys_addr_t)~0x0ull);
-r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, 
false);
+r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false,
+   hdev-force);
 if (r  0) {
 fprintf(stderr, vhost guest notifier cleanup failed: %d\n, 
r);
 fflush(stderr);
diff --git a/hw/vhost.h

Re: Network performance with small packets

2011-01-31 Thread Steve Dobbelstein

Michael S. Tsirkin m...@redhat.com wrote on 01/28/2011 06:16:16 AM:

 OK, so thinking about it more, maybe the issue is this:
 tx becomes full. We process one request and interrupt the guest,
 then it adds one request and the queue is full again.

 Maybe the following will help it stabilize?
 By itself it does nothing, but if you set
 all the parameters to a huge value we will
 only interrupt when we see an empty ring.
 Which might be too much: pls try other values
 in the middle: e.g. make bufs half the ring,
 or bytes some small value, or packets some
 small value etc.

 Warning: completely untested.

 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index aac05bc..6769cdc 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -32,6 +32,13 @@
   * Using this limit prevents one virtqueue from starving others. */
  #define VHOST_NET_WEIGHT 0x8

 +int tx_bytes_coalesce = 0;
 +module_param(tx_bytes_coalesce, int, 0644);
 +int tx_bufs_coalesce = 0;
 +module_param(tx_bufs_coalesce, int, 0644);
 +int tx_packets_coalesce = 0;
 +module_param(tx_packets_coalesce, int, 0644);
 +
  enum {
 VHOST_NET_VQ_RX = 0,
 VHOST_NET_VQ_TX = 1,
 @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net)
 int err, wmem;
 size_t hdr_size;
 struct socket *sock;
 +   int bytes_coalesced = 0;
 +   int bufs_coalesced = 0;
 +   int packets_coalesced = 0;

 /* TODO: check that we are running from vhost_worker? */
 sock = rcu_dereference_check(vq-private_data, 1);
 @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net)
if (err != len)
   pr_debug(Truncated TX packet: 
len %d != %zd\n, err, len);
 -  vhost_add_used_and_signal(net-dev, vq, head, 0);
total_len += len;
 +  packets_coalesced += 1;
 +  bytes_coalesced += len;
 +  bufs_coalesced += in;

Should this instead be:
  bufs_coalesced += out;

Perusing the code I see that earlier there is a check to see if in is not
zero, and, if so, error out of the loop.  After the check, in is not
touched until it is added to bufs_coalesced, effectively not changing
bufs_coalesced, meaning bufs_coalesced will never trigger the conditions
below.

Or am I missing something?

 +  if (unlikely(packets_coalesced  tx_packets_coalesce ||
 +  bytes_coalesced  tx_bytes_coalesce ||
 +  bufs_coalesced  tx_bufs_coalesce))
 + vhost_add_used_and_signal(net-dev, vq, head, 0);
 +  else
 + vhost_add_used(vq, head, 0);
if (unlikely(total_len = VHOST_NET_WEIGHT)) {
   vhost_poll_queue(vq-poll);
   break;
}
 }

 +   if (likely(packets_coalesced  tx_packets_coalesce ||
 + bytes_coalesced  tx_bytes_coalesce ||
 + bufs_coalesced  tx_bufs_coalesce))
 +  vhost_signal(net-dev, vq);
 mutex_unlock(vq-mutex);
  }


Steve D.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-01-31 Thread Sridhar Samudrala

On Mon, 2011-01-31 at 18:24 -0600, Steve Dobbelstein wrote:
 Michael S. Tsirkin m...@redhat.com wrote on 01/28/2011 06:16:16 AM:
 
  OK, so thinking about it more, maybe the issue is this:
  tx becomes full. We process one request and interrupt the guest,
  then it adds one request and the queue is full again.
 
  Maybe the following will help it stabilize?
  By itself it does nothing, but if you set
  all the parameters to a huge value we will
  only interrupt when we see an empty ring.
  Which might be too much: pls try other values
  in the middle: e.g. make bufs half the ring,
  or bytes some small value, or packets some
  small value etc.
 
  Warning: completely untested.
 
  diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
  index aac05bc..6769cdc 100644
  --- a/drivers/vhost/net.c
  +++ b/drivers/vhost/net.c
  @@ -32,6 +32,13 @@
* Using this limit prevents one virtqueue from starving others. */
   #define VHOST_NET_WEIGHT 0x8
 
  +int tx_bytes_coalesce = 0;
  +module_param(tx_bytes_coalesce, int, 0644);
  +int tx_bufs_coalesce = 0;
  +module_param(tx_bufs_coalesce, int, 0644);
  +int tx_packets_coalesce = 0;
  +module_param(tx_packets_coalesce, int, 0644);
  +
   enum {
  VHOST_NET_VQ_RX = 0,
  VHOST_NET_VQ_TX = 1,
  @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net)
  int err, wmem;
  size_t hdr_size;
  struct socket *sock;
  +   int bytes_coalesced = 0;
  +   int bufs_coalesced = 0;
  +   int packets_coalesced = 0;
 
  /* TODO: check that we are running from vhost_worker? */
  sock = rcu_dereference_check(vq-private_data, 1);
  @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net)
 if (err != len)
pr_debug(Truncated TX packet: 
 len %d != %zd\n, err, len);
  -  vhost_add_used_and_signal(net-dev, vq, head, 0);
 total_len += len;
  +  packets_coalesced += 1;
  +  bytes_coalesced += len;
  +  bufs_coalesced += in;
 
 Should this instead be:
   bufs_coalesced += out;
 
 Perusing the code I see that earlier there is a check to see if in is not
 zero, and, if so, error out of the loop.  After the check, in is not
 touched until it is added to bufs_coalesced, effectively not changing
 bufs_coalesced, meaning bufs_coalesced will never trigger the conditions
 below.

Yes. It definitely should be 'out'. 'in' should be 0 in the tx path.

I tried a simpler version of this patch without any tunables by
delaying the signaling until we come out of the for loop.
It definitely reduced the number of vmexits significantly for small message
guest to host stream test and the throughput went up a little.

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9b3ca10..5f9fae9 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -197,7 +197,7 @@ static void handle_tx(struct vhost_net *net)
if (err != len)
pr_debug(Truncated TX packet: 
  len %d != %zd\n, err, len);
-   vhost_add_used_and_signal(net-dev, vq, head, 0);
+   vhost_add_used(vq, head, 0);
total_len += len;
if (unlikely(total_len = VHOST_NET_WEIGHT)) {
vhost_poll_queue(vq-poll);
@@ -205,6 +205,8 @@ static void handle_tx(struct vhost_net *net)
}
}
 
+   if (total_len  0)
+   vhost_signal(net-dev, vq);
mutex_unlock(vq-mutex);
 }
 

 
 Or am I missing something?
 
  +  if (unlikely(packets_coalesced  tx_packets_coalesce ||
  +  bytes_coalesced  tx_bytes_coalesce ||
  +  bufs_coalesced  tx_bufs_coalesce))
  + vhost_add_used_and_signal(net-dev, vq, head, 0);
  +  else
  + vhost_add_used(vq, head, 0);
 if (unlikely(total_len = VHOST_NET_WEIGHT)) {
vhost_poll_queue(vq-poll);
break;
 }
  }
 
  +   if (likely(packets_coalesced  tx_packets_coalesce ||
  + bytes_coalesced  tx_bytes_coalesce ||
  + bufs_coalesced  tx_bufs_coalesce))
  +  vhost_signal(net-dev, vq);
  mutex_unlock(vq-mutex);
   }

It is possible that we can miss signaling the guest even after
processing a few pkts, if we don't hit any of these conditions.

 
 
 Steve D.
 
 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] KVM: Emulate MSI-X table in kernel

2011-01-31 Thread Sheng Yang

On Mon, Jan 31, 2011 at 03:09:09PM +0200, Avi Kivity wrote:
 On 01/30/2011 06:38 AM, Sheng Yang wrote:
 (Sorry, missed this mail...)
 
 On Mon, Jan 17, 2011 at 02:29:44PM +0200, Avi Kivity wrote:
   On 01/06/2011 12:19 PM, Sheng Yang wrote:
   Then we can support mask bit operation of assigned devices now.
   
   
   
   +int kvm_assigned_device_update_msix_mask_bit(struct kvm *kvm,
   +int assigned_dev_id, int entry, bool 
  mask)
   +{
   +int r = -EFAULT;
   +struct kvm_assigned_dev_kernel *adev;
   +int i;
   +
   +if (!irqchip_in_kernel(kvm))
   +return r;
   +
   +mutex_lock(kvm-lock);
   +adev = kvm_find_assigned_dev(kvm-arch.assigned_dev_head,
   +  assigned_dev_id);
   +if (!adev)
   +goto out;
   +
   +for (i = 0; i   adev-entries_nr; i++)
   +if (adev-host_msix_entries[i].entry == entry) {
   +if (mask)
   +disable_irq_nosync(
   +
  adev-host_msix_entries[i].vector);
 
   Is it okay to call disable_irq_nosync() here?  IIRC we don't check
   the mask bit on irq delivery, so we may forward an interrupt to the
   guest after the mask bit was set.
 
   What does pci say about the mask bit?  when does it take effect?
 
   Another question is whether disable_irq_nosync() actually programs
   the device mask bit, or not.  If it does, then it's slow, and it may
   be better to leave interrupts enabled but have an internal pending
   bit.  If it doesn't program the mask bit, it's fine.
 
 I think Michael and Jan had explained this.
 
   +else
   +
  enable_irq(adev-host_msix_entries[i].vector);
   +r = 0;
   +break;
   +}
   +out:
   +mutex_unlock(kvm-lock);
   +return r;
   +}
   
   +
   +static int msix_table_mmio_read(struct kvm_io_device *this, gpa_t addr, 
  int len,
   +void *val)
   +{
   +struct kvm_msix_mmio_dev *mmio_dev =
   +container_of(this, struct kvm_msix_mmio_dev, table_dev);
   +struct kvm_msix_mmio *mmio;
   +int idx, ret = 0, entry, offset, r;
   +
   +mutex_lock(mmio_dev-lock);
   +idx = get_mmio_table_index(mmio_dev, addr, len);
   +if (idx   0) {
   +ret = -EOPNOTSUPP;
   +goto out;
   +}
   +if ((addr   0x3) || (len != 4   len != 8))
   +goto out;
   +
   +offset = addr   0xf;
   +if (offset == PCI_MSIX_ENTRY_VECTOR_CTRL   len == 8)
   +goto out;
   +
   +mmio =mmio_dev-mmio[idx];
   +entry = (addr - mmio-table_base_addr) / PCI_MSIX_ENTRY_SIZE;
   +r = copy_from_user(val, (void __user *)(mmio-table_base_va +
   +entry * PCI_MSIX_ENTRY_SIZE + offset), len);
   +if (r)
   +goto out;
 
   and return ret == 0?
 
 Yes. This operation should be handled by in-kernel MSI-X MMIO. So we return 0
 in order to omit this action. We can add warning to it later.
 
 But it failed.  We need to return -EFAULT.

So it would return to QEmu. OK, let QEmu prints warning about it.

-- 
regards
Yang, Sheng
 
 The same as above.
 
   +
   +if ((offset   PCI_MSIX_ENTRY_VECTOR_CTRL   len == 4) ||
   +(offset   PCI_MSIX_ENTRY_DATA   len == 8))
   +ret = -ENOTSYNC;
 
   goto out?
 
 No. This judgement only check if MSI data/address was touched. And the line
 below would check if we need to operate mask bit. Because in theory guest can
 use len=8 to modify MSI-X data and ctrl at the same time.
 
 
 Ok, makes sense.
 
 -- 
 error compiling committee.c: too many arguments to function
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] KVM: Add documents for MSI-X MMIO API

2011-01-31 Thread Sheng Yang

On Mon, Jan 31, 2011 at 03:24:27PM +0200, Avi Kivity wrote:
 On 01/26/2011 11:05 AM, Sheng Yang wrote:
 On Tuesday 25 January 2011 20:47:38 Avi Kivity wrote:
   On 01/19/2011 10:21 AM, Sheng Yang wrote:
   We already got an guest MMIO address for that in the exit
   information. I've created a chain of handler in qemu to handle 
  it.
 
But we already decoded the table and entry...
   
 But the handler is still wrapped by vcpu_mmio_write(), as a part of 
  MMIO.
 So it's not quite handy to get the table and entry out.
 
   The kernel handler can create a new kvm_run exit description.
 
   Also the updater in the userspace
   
 can share the most logic with ordinary userspace MMIO handler, which 
  take
 address as parameter. So I think we don't need to pass the decoded
 table_id and entry to userspace.
 
   It's mixing layers, which always leads to trouble.  For one, the user
   handler shouldn't do anything with the write since the kernel already
   wrote it into the table.  For another, if two vcpus write to the same
   entry simultaneously, you could see different ordering in the kernel and
   userspace, and get inconsistent results.
 
 The shared logic is not about writing, but about interpret what's written. 
 Old
 MMIO handler would write the data, then interpret it; and our new MMIO would 
 only
 share the logic of interpretation. I think that's fair enough?
 
 It dosn't make sense for an API point of view.  You registered a
 table of entries, you expect an exit on that table to point to the
 table and entry that got changed.

OK, I would update this when I come back to work(about two weeks later...).

-- 
regards
Yang, Sheng

 
 -- 
 error compiling committee.c: too many arguments to function
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-01-31 Thread Michael S. Tsirkin

On Mon, Jan 31, 2011 at 06:24:34PM -0600, Steve Dobbelstein wrote:
 Michael S. Tsirkin m...@redhat.com wrote on 01/28/2011 06:16:16 AM:
 
  OK, so thinking about it more, maybe the issue is this:
  tx becomes full. We process one request and interrupt the guest,
  then it adds one request and the queue is full again.
 
  Maybe the following will help it stabilize?
  By itself it does nothing, but if you set
  all the parameters to a huge value we will
  only interrupt when we see an empty ring.
  Which might be too much: pls try other values
  in the middle: e.g. make bufs half the ring,
  or bytes some small value, or packets some
  small value etc.
 
  Warning: completely untested.
 
  diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
  index aac05bc..6769cdc 100644
  --- a/drivers/vhost/net.c
  +++ b/drivers/vhost/net.c
  @@ -32,6 +32,13 @@
* Using this limit prevents one virtqueue from starving others. */
   #define VHOST_NET_WEIGHT 0x8
 
  +int tx_bytes_coalesce = 0;
  +module_param(tx_bytes_coalesce, int, 0644);
  +int tx_bufs_coalesce = 0;
  +module_param(tx_bufs_coalesce, int, 0644);
  +int tx_packets_coalesce = 0;
  +module_param(tx_packets_coalesce, int, 0644);
  +
   enum {
  VHOST_NET_VQ_RX = 0,
  VHOST_NET_VQ_TX = 1,
  @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net)
  int err, wmem;
  size_t hdr_size;
  struct socket *sock;
  +   int bytes_coalesced = 0;
  +   int bufs_coalesced = 0;
  +   int packets_coalesced = 0;
 
  /* TODO: check that we are running from vhost_worker? */
  sock = rcu_dereference_check(vq-private_data, 1);
  @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net)
 if (err != len)
pr_debug(Truncated TX packet: 
 len %d != %zd\n, err, len);
  -  vhost_add_used_and_signal(net-dev, vq, head, 0);
 total_len += len;
  +  packets_coalesced += 1;
  +  bytes_coalesced += len;
  +  bufs_coalesced += in;
 
 Should this instead be:
   bufs_coalesced += out;

Correct.

 Perusing the code I see that earlier there is a check to see if in is not
 zero, and, if so, error out of the loop.  After the check, in is not
 touched until it is added to bufs_coalesced, effectively not changing
 bufs_coalesced, meaning bufs_coalesced will never trigger the conditions
 below.
 
 Or am I missing something?
 
  +  if (unlikely(packets_coalesced  tx_packets_coalesce ||
  +  bytes_coalesced  tx_bytes_coalesce ||
  +  bufs_coalesced  tx_bufs_coalesce))
  + vhost_add_used_and_signal(net-dev, vq, head, 0);
  +  else
  + vhost_add_used(vq, head, 0);
 if (unlikely(total_len = VHOST_NET_WEIGHT)) {
vhost_poll_queue(vq-poll);
break;
 }
  }
 
  +   if (likely(packets_coalesced  tx_packets_coalesce ||
  + bytes_coalesced  tx_bytes_coalesce ||
  + bufs_coalesced  tx_bufs_coalesce))
  +  vhost_signal(net-dev, vq);
  mutex_unlock(vq-mutex);
   }
 
 
 Steve D.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance with small packets

2011-01-31 Thread Michael S. Tsirkin

On Mon, Jan 31, 2011 at 05:30:38PM -0800, Sridhar Samudrala wrote:
 On Mon, 2011-01-31 at 18:24 -0600, Steve Dobbelstein wrote:
  Michael S. Tsirkin m...@redhat.com wrote on 01/28/2011 06:16:16 AM:
  
   OK, so thinking about it more, maybe the issue is this:
   tx becomes full. We process one request and interrupt the guest,
   then it adds one request and the queue is full again.
  
   Maybe the following will help it stabilize?
   By itself it does nothing, but if you set
   all the parameters to a huge value we will
   only interrupt when we see an empty ring.
   Which might be too much: pls try other values
   in the middle: e.g. make bufs half the ring,
   or bytes some small value, or packets some
   small value etc.
  
   Warning: completely untested.
  
   diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
   index aac05bc..6769cdc 100644
   --- a/drivers/vhost/net.c
   +++ b/drivers/vhost/net.c
   @@ -32,6 +32,13 @@
 * Using this limit prevents one virtqueue from starving others. */
#define VHOST_NET_WEIGHT 0x8
  
   +int tx_bytes_coalesce = 0;
   +module_param(tx_bytes_coalesce, int, 0644);
   +int tx_bufs_coalesce = 0;
   +module_param(tx_bufs_coalesce, int, 0644);
   +int tx_packets_coalesce = 0;
   +module_param(tx_packets_coalesce, int, 0644);
   +
enum {
   VHOST_NET_VQ_RX = 0,
   VHOST_NET_VQ_TX = 1,
   @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net)
   int err, wmem;
   size_t hdr_size;
   struct socket *sock;
   +   int bytes_coalesced = 0;
   +   int bufs_coalesced = 0;
   +   int packets_coalesced = 0;
  
   /* TODO: check that we are running from vhost_worker? */
   sock = rcu_dereference_check(vq-private_data, 1);
   @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net)
  if (err != len)
 pr_debug(Truncated TX packet: 
  len %d != %zd\n, err, len);
   -  vhost_add_used_and_signal(net-dev, vq, head, 0);
  total_len += len;
   +  packets_coalesced += 1;
   +  bytes_coalesced += len;
   +  bufs_coalesced += in;
  
  Should this instead be:
bufs_coalesced += out;
  
  Perusing the code I see that earlier there is a check to see if in is not
  zero, and, if so, error out of the loop.  After the check, in is not
  touched until it is added to bufs_coalesced, effectively not changing
  bufs_coalesced, meaning bufs_coalesced will never trigger the conditions
  below.
 
 Yes. It definitely should be 'out'. 'in' should be 0 in the tx path.
 
 I tried a simpler version of this patch without any tunables by
 delaying the signaling until we come out of the for loop.
 It definitely reduced the number of vmexits significantly for small message
 guest to host stream test and the throughput went up a little.
 
 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index 9b3ca10..5f9fae9 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -197,7 +197,7 @@ static void handle_tx(struct vhost_net *net)
   if (err != len)
   pr_debug(Truncated TX packet: 
 len %d != %zd\n, err, len);
 - vhost_add_used_and_signal(net-dev, vq, head, 0);
 + vhost_add_used(vq, head, 0);
   total_len += len;
   if (unlikely(total_len = VHOST_NET_WEIGHT)) {
   vhost_poll_queue(vq-poll);
 @@ -205,6 +205,8 @@ static void handle_tx(struct vhost_net *net)
   }
   }
  
 + if (total_len  0)
 + vhost_signal(net-dev, vq);
   mutex_unlock(vq-mutex);
  }
  
 
  
  Or am I missing something?
  
   +  if (unlikely(packets_coalesced  tx_packets_coalesce ||
   +  bytes_coalesced  tx_bytes_coalesce ||
   +  bufs_coalesced  tx_bufs_coalesce))
   + vhost_add_used_and_signal(net-dev, vq, head, 0);
   +  else
   + vhost_add_used(vq, head, 0);
  if (unlikely(total_len = VHOST_NET_WEIGHT)) {
 vhost_poll_queue(vq-poll);
 break;
  }
   }
  
   +   if (likely(packets_coalesced  tx_packets_coalesce ||
   + bytes_coalesced  tx_bytes_coalesce ||
   + bufs_coalesced  tx_bufs_coalesce))
   +  vhost_signal(net-dev, vq);
   mutex_unlock(vq-mutex);
}
 
 It is possible that we can miss signaling the guest even after
 processing a few pkts, if we don't hit any of these conditions.

Yes. It really should be
   if (likely(packets_coalesced  bytes_coalesced  bufs_coalesced))
  vhost_signal(net-dev, vq);

  
  
  Steve D.
  
  --
  To unsubscribe from this list: send the line unsubscribe netdev in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

74 matches

Mail list logo