Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu->wq

2015-01-16 Thread Peter Zijlstra
On Fri, Jan 16, 2015 at 11:48:46AM -0500, Steven Rostedt wrote:
> I notice everywhere you have a swait_wake_interruptible() but here. Is
> there a reason why?
> 
> IIRC, Peter wants to make swait wakeup usage homogenous. That is, you
> either sleep in an interruptible state, or you don't. You can't mix and
> match it.
> 
> Peter is that what you plan?

Yes, this is required for the single wakeup case to be bounded.

If you have both INTERRUPTIBLE and UNINTERRUPTIBLE waiters on a list,
and you were allowed to do INTERRUPTIBLE/UNINTERRUPTIBLE wakeups, the
wakeup would have to do O(n) iteration to find a matching waiter.

Seeing that the entire purpose of simple wait queues was to retain RT
properties, that's entirely retarded ;-)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/3] x86_64,entry: Remove the syscall exit audit and schedule optimizations

2015-01-16 Thread Andy Lutomirski
We used to optimize rescheduling and audit on syscall exit.  Now
that the full slow path is reasonably fast, remove these
optimizations.  Syscall exit auditing is now handled exclusively by
syscall_trace_leave.

This adds something like 10ns to the previously optimized paths on
my computer, presumably due mostly to SAVE_REST / RESTORE_REST.

I think that we should eventually replace both the syscall and
non-paranoid interrupt exit slow paths with a pair of C functions
along the lines of the syscall entry hooks.

Acked-by: Borislav Petkov 
Signed-off-by: Andy Lutomirski 
---
 arch/x86/kernel/entry_64.S | 52 +-
 1 file changed, 5 insertions(+), 47 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index eeab4cf8b2c9..db13655c3a2a 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -361,15 +361,12 @@ system_call_fastpath:
  * Has incomplete stack frame and undefined top of stack.
  */
 ret_from_sys_call:
-   movl $_TIF_ALLWORK_MASK,%edi
-   /* edi: flagmask */
-sysret_check:
+   testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
+   jnz int_ret_from_sys_call_fixup /* Go the the slow path */
+
LOCKDEP_SYS_EXIT
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
-   movl TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET),%edx
-   andl %edi,%edx
-   jnz  sysret_careful
CFI_REMEMBER_STATE
/*
 * sysretq will re-enable interrupts:
@@ -383,49 +380,10 @@ sysret_check:
USERGS_SYSRET64
 
CFI_RESTORE_STATE
-   /* Handle reschedules */
-   /* edx: work, edi: workmask */
-sysret_careful:
-   bt $TIF_NEED_RESCHED,%edx
-   jnc sysret_signal
-   TRACE_IRQS_ON
-   ENABLE_INTERRUPTS(CLBR_NONE)
-   pushq_cfi %rdi
-   SCHEDULE_USER
-   popq_cfi %rdi
-   jmp sysret_check
 
-   /* Handle a signal */
-sysret_signal:
-   TRACE_IRQS_ON
-   ENABLE_INTERRUPTS(CLBR_NONE)
-#ifdef CONFIG_AUDITSYSCALL
-   bt $TIF_SYSCALL_AUDIT,%edx
-   jc sysret_audit
-#endif
-   /*
-* We have a signal, or exit tracing or single-step.
-* These all wind up with the iret return path anyway,
-* so just join that path right now.
-*/
+int_ret_from_sys_call_fixup:
FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
-   jmp int_check_syscall_exit_work
-
-#ifdef CONFIG_AUDITSYSCALL
-   /*
-* Return fast path for syscall audit.  Call __audit_syscall_exit()
-* directly and then jump back to the fast path with TIF_SYSCALL_AUDIT
-* masked off.
-*/
-sysret_audit:
-   movq RAX-ARGOFFSET(%rsp),%rsi   /* second arg, syscall return value */
-   cmpq $-MAX_ERRNO,%rsi   /* is it < -MAX_ERRNO? */
-   setbe %al   /* 1 if so, 0 if not */
-   movzbl %al,%edi /* zero-extend that into %edi */
-   call __audit_syscall_exit
-   movl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),%edi
-   jmp sysret_check
-#endif /* CONFIG_AUDITSYSCALL */
+   jmp int_ret_from_sys_call
 
/* Do syscall tracing */
 tracesys:
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/3] x86_64,entry: Use sysret to return to userspace when possible

2015-01-16 Thread Andy Lutomirski
The x86_64 entry code currently jumps through complex and
inconsistent hoops to try to minimize the impact of syscall exit
work.  For a true fast-path syscall, almost nothing needs to be
done, so returning is just a check for exit work and sysret.  For a
full slow-path return from a syscall, the C exit hook is invoked if
needed and we join the iret path.

Using iret to return to userspace is very slow, so the entry code
has accumulated various special cases to try to do certain forms of
exit work without invoking iret.  This is error-prone, since it
duplicates assembly code paths, and it's dangerous, since sysret
can malfunction in interesting ways if used carelessly.  It's
also inefficient, since a lot of useful cases aren't optimized
and therefore force an iret out of a combination of paranoia and
the fact that no one has bothered to write even more asm code
to avoid it.

I would argue that this approach is backwards.  Rather than trying
to avoid the iret path, we should instead try to make the iret path
fast.  Under a specific set of conditions, iret is unnecessary.  In
particular, if RIP==RCX, RFLAGS==R11, RIP is canonical, RF is not
set, and both SS and CS are as expected, then
movq 32(%rsp),%rsp;sysret does the same thing as iret.  This set of
conditions is nearly always satisfied on return from syscalls, and
it can even occasionally be satisfied on return from an irq.

Even with the careful checks for sysret applicability, this cuts
nearly 80ns off of the overhead from syscalls with unoptimized exit
work.  This includes tracing and context tracking, and any return
that invokes KVM's user return notifier.  For example, the cost of
getpid with CONFIG_CONTEXT_TRACKING_FORCE=y drops from ~360ns to
~280ns on my computer.

This may allow the removal and even eventual conversion to C
of a respectable amount of exit asm.

This may require further tweaking to give the full benefit on Xen.

It may be worthwhile to adjust signal delivery and exec to try hit
the sysret path.

This does not optimize returns to 32-bit userspace.  Making the same
optimization for CS == __USER32_CS is conceptually straightforward,
but it will require some tedious code to handle the differences
between sysretl and sysexitl.

Signed-off-by: Andy Lutomirski 
---
 arch/x86/kernel/entry_64.S | 54 ++
 1 file changed, 54 insertions(+)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 501212f14c87..eeab4cf8b2c9 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -794,6 +794,60 @@ retint_swapgs: /* return to user-space */
 */
DISABLE_INTERRUPTS(CLBR_ANY)
TRACE_IRQS_IRETQ
+
+   /*
+* Try to use SYSRET instead of IRET if we're returning to
+* a completely clean 64-bit userspace context.
+*/
+   movq (RCX-R11)(%rsp), %rcx
+   cmpq %rcx,(RIP-R11)(%rsp)   /* RCX == RIP */
+   jne opportunistic_sysret_failed
+
+   /*
+* On Intel CPUs, sysret with non-canonical RCX/RIP will #GP
+* in kernel space.  This essentially lets the user take over
+* the kernel, since userspace controls RSP.  It's not worth
+* testing for canonicalness exactly -- this check detects any
+* of the 17 high bits set, which is true for non-canonical
+* or kernel addresses.  (This will pessimize vsyscall=native.
+* Big deal.)
+*
+* If virtual addresses ever become wider, this will need
+* to be updated to remain correct on both old and new CPUs.
+*/
+   .ifne __VIRTUAL_MASK_SHIFT - 47
+   .error "virtual address width changed -- sysret checks need update"
+   .endif
+   shr $__VIRTUAL_MASK_SHIFT, %rcx
+   jnz opportunistic_sysret_failed
+
+   cmpq $__USER_CS,(CS-R11)(%rsp)  /* CS must match SYSRET */
+   jne opportunistic_sysret_failed
+
+   movq (R11-ARGOFFSET)(%rsp), %r11
+   cmpq %r11,(EFLAGS-ARGOFFSET)(%rsp)  /* R11 == RFLAGS */
+   jne opportunistic_sysret_failed
+
+   testq $X86_EFLAGS_RF,%r11   /* sysret can't restore RF */
+   jnz opportunistic_sysret_failed
+
+   /* nothing to check for RSP */
+
+   cmpq $__USER_DS,(SS-ARGOFFSET)(%rsp)/* SS must match SYSRET */
+   jne opportunistic_sysret_failed
+
+   /*
+* We win!  This label is here just for ease of understanding
+* perf profiles.  Nothing jumps here.
+*/
+irq_return_via_sysret:
+   CFI_REMEMBER_STATE
+   RESTORE_ARGS 1,8,1
+   movq (RSP-RIP)(%rsp),%rsp
+   USERGS_SYSRET64
+   CFI_RESTORE_STATE
+
+opportunistic_sysret_failed:
SWAPGS
jmp restore_args
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/3] x86_64,entry: Rearrange the syscall exit optimizations

2015-01-16 Thread Andy Lutomirski
Linus, I suspect you'll either like or hate this series.  Or maybe
you'll think it's crazy but you'll like it anyway.  I'm curious
which of those is the case. :)

The syscall exit asm is a big mess.  There's a really fast path, some
kind of fast path code (with a hard-coded optimization for audit), and
the really slow path.  The result is that it's very hard to work with
this code.  There are some asm paths that are much slower than they
should be (context tracking is a major offender), but no one really
wants to add even more asm to speed them up.

This series takes a different, unorthodox approach.  Rather than trying
to avoid entering the very slow iret path, it adds a way back out of the
iret path.  The result is a dramatic speedup for context tracking, user
return notification, and similar code, as the cost of a few lines of
tricky asm.  Nonetheless, it's barely a net addition of asm code,
because we get to remove the fast path optimizations for audit and
rescheduling.

Thoughts?  If this works, it opens the door for a lot of further
consolidation of the exit code.

Note: patch 1 in this series has been floating around on the list
for quite a while.  It's mandatory for this series to work, because
the buglet that it fixes almost completely defeats the optimization
that I'm introducing.

Note: Some optimization along these lines is probably certainly needed
to get anything resembling acceptable performance out of the FPU changes
that Rik is working on.

Andy Lutomirski (3):
  x86_64,entry: Fix RCX for traced syscalls
  x86_64,entry: Use sysret to return to userspace when possible
  x86_64,entry: Remove the syscall exit audit and schedule optimizations

 arch/x86/kernel/entry_64.S | 109 +
 1 file changed, 61 insertions(+), 48 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/3] x86_64,entry: Fix RCX for traced syscalls

2015-01-16 Thread Andy Lutomirski
The int_ret_from_sys_call and syscall tracing code disagrees with
the sysret path as to the value of RCX.

The Intel SDM, the AMD APM, and my laptop all agree that sysret
returns with RCX == RIP.  The syscall tracing code does not respect
this property.

For example, this program:

int main()
{
extern const char syscall_rip[];
unsigned long rcx = 1;
unsigned long orig_rcx = rcx;
asm ("mov $-1, %%eax\n\t"
 "syscall\n\t"
 "syscall_rip:"
 : "+c" (rcx) : : "r11");
printf("syscall: RCX = %lX  RIP = %lX  orig RCX = %lx\n",
   rcx, (unsigned long)syscall_rip, orig_rcx);
return 0;
}

prints:
syscall: RCX = 400556  RIP = 400556  orig RCX = 1

Running it under strace gives this instead:
syscall: RCX =   RIP = 400556  orig RCX = 1

This changes FIXUP_TOP_OF_STACK to match sysret, causing the test to
show RCX == RIP even under strace.

It looks like this is a partial revert of:
88e4bc32686e [PATCH] x86-64 architecture specific sync for 2.5.8
from the historic git tree.

Signed-off-by: Andy Lutomirski 
---
 arch/x86/kernel/entry_64.S | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 7d59df23e5bb..501212f14c87 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -143,7 +143,8 @@ ENDPROC(native_usergs_sysret64)
movq \tmp,RSP+\offset(%rsp)
movq $__USER_DS,SS+\offset(%rsp)
movq $__USER_CS,CS+\offset(%rsp)
-   movq $-1,RCX+\offset(%rsp)
+   movq RIP+\offset(%rsp),\tmp  /* get rip */
+   movq \tmp,RCX+\offset(%rsp)  /* copy it to rcx as sysret would do */
movq R11+\offset(%rsp),\tmp  /* get eflags */
movq \tmp,EFLAGS+\offset(%rsp)
.endm
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] arch/x86/kvm/vmx.c: Fix external interrupts inject directly bug with guestos RFLAGS.IF=0

2015-01-16 Thread Radim Krčmář
2015-01-16 16:07+0800, Li Kaihang:
> > GuestOS is running and handling some interrupt with RFLAGS.IF = 0 while a 
> > external interrupt coming,
> > then can lead to a vm exit,in this case,we must avoid inject this external 
> > interrupt or it will generate
> > a processor hardware exception causing virtual machine crash.
> 
> What is the source of this exception?  (Is there a reproducer?)
> 
> Li Kaihang: exception is produced by intel processor hardware because 
> injecting a external interrupt vector is forbidden by intel processor when 
> GuestOS RFLAGS.IF = 0,
> this need to be ensured by hypervisor software according to Intel 
> 64 and IA-32 Architectures Software Developer's Manual Volume 3.

(Found it, happens on VMENTRY ... 26.3.1.4 Checks on Guest RIP and RFLAGS
   The IF flag (RFLAGS[bit 9]) must be 1 if the valid bit (bit 31) in the
   VM-entry interruption-information field is 1 and the interruption type
   (bits 10:8) is external interrupt.)

> This bug has a certain probability, if code is designed to be 
> very short between cli and sti in a guestos's interrupt processing, 
> probability of occurrence
> is very low, this event is like moving trap, bug is produced that 
> guestos is running between cli and sti instruction while a external interrupt 
> coming, it
> may be verified by constructing a special guestos interrupt code. 
> General OS running on kvm vm has also probability to hit this bug.

Is APICv enabled?  (Does togging it avoid the bug?)

> > Step 1:
> >  a ext intr gen a vm_exit --> vmx_complete_interrupts --> 
> > __vmx_complete_interrupts --> case INTR_TYPE_EXT_INR: 
> > kvm_queue_interrupt(vcpu, vector, type == INTR_TYPE_SOFT_INTR);
> >
> > Step 2:
> >  kvm_x86_ops->handle_external_intr(vcpu);
> > for the above steps, step 4 and 5 will be a processor hardware exception if 
> > step1 happen while guestos RFLAGS.IF = 0, that is to say, guestos interrupt 
> > is disabled.
> > So we should add a logic to judge in step 1 whether a external interrupt 
> > need to be pended then inject directly, in the process, we don't need to 
> > worry about
> > this external interrupt lost because the next Step 2 will handle and choose 
> > a best chance to inject it by virtual interrupt controller.
> 
> Can you explain the relation between vectored events (Step 1) and
> external interrupts (Step 2)?
> (The bug happens when external interrupt arrives during event delivery?)
> 
> Li Kaihang: a external interrupt to running vm can trigger a vm_exit event 
> handled in step 1, then this interrupt vector can be processed in step2
> kvm_x86_ops->handle_external_intr(vcpu) and this function can 
> jump to HOSTOS IDT to complete external interrupt handling,external interrupt 
> handler in HOSTOS
> IDT may inject the external interrupt into virtual interrupt 
> controller if it has been registered to be needed by virtual machine.

Proposed non-accidental relation between first two steps confuses me ...

1) External interrupt in Step 1 should be from incomplete event delivery
   to the guest,
2) External interrupt in Step 2 from an interrupt for the host.
3) Interrupts should be unrelated, because if not, then we either
   deliver host's interrupts directly to the guest, or the other way
   around.  (Both cases are buggy, regardless of CLI.)
   - What is stored in "VM-exit interruption information" (Step 2) and
 "IDT-vectoring information" (Step 1)?

=> Interrupts in first two steps are independent and we have interrupted
   external interrupt delivery to the guest.

Can you provide a KVM trace of what is happening before the crash?
(Or just point out where I made a mistake.)

Thank you.

> The Bug has never happened in step 1 and 2, but 
> vcpu->arch.interrupt.pending is set in step 1, if this pending should not be 
> injected, it also will be passed
> to step4 to complete the dangerous external interrupt injecting. 
> Please see the above answer about what is "pending should not be injected"? 
> Our solution
> is that clearing invalid external interrupt pending to prevent 
> error inject pass by adding a logical judge in step 1.

(I agree that we shouldn't inject events that fail VMENTRY,
 I'm just not sure that this solution is correct.)

> Why isn't the delivered event lost?
> (It should be different from the external interrupt.)
> 
> Li Kaihang: please refer to the above answer, a external interrupt in step1 
> only can get to case INTR_TYPE_EXT_INR branch in patch code, so it should not 
> affect other

(It is the same type of interrupt, but with a different meaning ...
 You don't save the incomplete event delivery in Step 1, so it has to be
 recorded somewhere, otherwise it is lost -- where is it?)

> type events delivering, but there is another possibility that the 
> external interrupt needed by running vm is not registered in hostos idt 
> handler chain,of
>  

Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu->wq

2015-01-16 Thread Steven Rostedt
On Fri, 16 Jan 2015 11:40:45 -0500
Marcelo Tosatti  wrote:

> The problem:
> 
> On -RT, an emulated LAPIC timer instances has the following path:
> 
> 1) hard interrupt
> 2) ksoftirqd is scheduled
> 3) ksoftirqd wakes up vcpu thread
> 4) vcpu thread is scheduled
> 
> This extra context switch introduces unnecessary latency in the 
> LAPIC path for a KVM guest.
> 
> The solution:
> 
> Allow waking up vcpu thread from hardirq context,
> thus avoiding the need for ksoftirqd to be scheduled.
> 
> Normal waitqueues make use of spinlocks, which on -RT 
> are sleepable locks. Therefore, waking up a waitqueue 
> waiter involves locking a sleeping lock, which 
> is not allowed from hard interrupt context.
> 

I have no issue in pulling this into stable-rt, but read below.

> cyclictest command line:
> # cyclictest -m -n -q -p99 -l 100 -h60  -D 1m
> 
> This patch reduces the average latency in my tests from 14us to 11us.
> 
> Signed-off-by: Marcelo Tosatti 
> 
> ---
>  arch/arm/kvm/arm.c  |4 ++--
>  arch/arm/kvm/psci.c |4 ++--
>  arch/mips/kvm/kvm_mips.c|8 
>  arch/powerpc/include/asm/kvm_host.h |4 ++--
>  arch/powerpc/kvm/book3s_hv.c|   20 ++--
>  arch/s390/include/asm/kvm_host.h|2 +-
>  arch/s390/kvm/interrupt.c   |   22 ++
>  arch/s390/kvm/sigp.c|   16 
>  arch/x86/kvm/lapic.c|6 +++---
>  include/linux/kvm_host.h|4 ++--
>  virt/kvm/async_pf.c |4 ++--
>  virt/kvm/kvm_main.c |   16 
>  12 files changed, 54 insertions(+), 56 deletions(-)
> 
> Index: linux-stable-rt/arch/arm/kvm/arm.c
> ===
> --- linux-stable-rt.orig/arch/arm/kvm/arm.c   2014-11-25 14:13:39.188899952 
> -0200
> +++ linux-stable-rt/arch/arm/kvm/arm.c2014-11-25 14:14:38.620810092 
> -0200
> @@ -495,9 +495,9 @@
>  
>  static void vcpu_pause(struct kvm_vcpu *vcpu)
>  {
> - wait_queue_head_t *wq = kvm_arch_vcpu_wq(vcpu);
> + struct swait_head *wq = kvm_arch_vcpu_wq(vcpu);
>  
> - wait_event_interruptible(*wq, !vcpu->arch.pause);
> + swait_event_interruptible(*wq, !vcpu->arch.pause);
>  }
>  
>  static int kvm_vcpu_initialized(struct kvm_vcpu *vcpu)
> Index: linux-stable-rt/arch/arm/kvm/psci.c
> ===
> --- linux-stable-rt.orig/arch/arm/kvm/psci.c  2014-11-25 14:13:39.189899951 
> -0200
> +++ linux-stable-rt/arch/arm/kvm/psci.c   2014-11-25 14:14:38.620810092 
> -0200
> @@ -36,7 +36,7 @@
>  {
>   struct kvm *kvm = source_vcpu->kvm;
>   struct kvm_vcpu *vcpu = NULL, *tmp;
> - wait_queue_head_t *wq;
> + struct swait_head *wq;
>   unsigned long cpu_id;
>   unsigned long mpidr;
>   phys_addr_t target_pc;
> @@ -80,7 +80,7 @@
>   smp_mb();   /* Make sure the above is visible */
>  
>   wq = kvm_arch_vcpu_wq(vcpu);
> - wake_up_interruptible(wq);
> + swait_wake_interruptible(wq);
>  
>   return KVM_PSCI_RET_SUCCESS;
>  }
> Index: linux-stable-rt/arch/mips/kvm/kvm_mips.c
> ===
> --- linux-stable-rt.orig/arch/mips/kvm/kvm_mips.c 2014-11-25 
> 14:13:39.191899948 -0200
> +++ linux-stable-rt/arch/mips/kvm/kvm_mips.c  2014-11-25 14:14:38.621810091 
> -0200
> @@ -464,8 +464,8 @@
>  
>   dvcpu->arch.wait = 0;
>  
> - if (waitqueue_active(&dvcpu->wq)) {
> - wake_up_interruptible(&dvcpu->wq);
> + if (swaitqueue_active(&dvcpu->wq)) {
> + swait_wake_interruptible(&dvcpu->wq);
>   }
>  
>   return 0;
> @@ -971,8 +971,8 @@
>   kvm_mips_callbacks->queue_timer_int(vcpu);
>  
>   vcpu->arch.wait = 0;
> - if (waitqueue_active(&vcpu->wq)) {
> - wake_up_interruptible(&vcpu->wq);
> + if (swaitqueue_active(&vcpu->wq)) {
> + swait_wake_nterruptible(&vcpu->wq);

Sure you compiled this patch?

>   }
>  }
>  
> Index: linux-stable-rt/arch/powerpc/include/asm/kvm_host.h
> ===
> --- linux-stable-rt.orig/arch/powerpc/include/asm/kvm_host.h  2014-11-25 
> 14:13:39.193899944 -0200
> +++ linux-stable-rt/arch/powerpc/include/asm/kvm_host.h   2014-11-25 
> 14:14:38.621810091 -0200
> @@ -295,7 +295,7 @@
>   u8 in_guest;
>   struct list_head runnable_threads;
>   spinlock_t lock;
> - wait_queue_head_t wq;
> + struct swait_head wq;
>   u64 stolen_tb;
>   u64 preempt_tb;
>   struct kvm_vcpu *runner;
> @@ -612,7 +612,7 @@
>   u8 prodded;
>   u32 last_inst;
>  
> - wait_queue_head_t *wqp;
> + struct swait_head *wqp;
>   struct kvmppc_vcore *vcore;
>   int ret;
>   int trap;
> Index: linux-stable-rt/arch/powerpc/kvm/book3s_hv.c
> ===

Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu->wq

2015-01-16 Thread Steven Rostedt
On Fri, 16 Jan 2015 11:48:46 -0500
Steven Rostedt  wrote:

> >  static void kvmppc_vcore_blocked(struct kvmppc_vcore *vc)
> >  {
> > -   DEFINE_WAIT(wait);
> > +   DEFINE_SWAITER(wait);
> >  
> > -   prepare_to_wait(&vc->wq, &wait, TASK_INTERRUPTIBLE);
> > +   swait_prepare(&vc->wq, &wait, TASK_INTERRUPTIBLE);
> > vc->vcore_state = VCORE_SLEEPING;
> > spin_unlock(&vc->lock);
> > schedule();
> > -   finish_wait(&vc->wq, &wait);
> > +   swait_finish(&vc->wq, &wait);
> > spin_lock(&vc->lock);
> > vc->vcore_state = VCORE_INACTIVE;
> >  }
> > @@ -1613,7 +1613,7 @@
> > kvmppc_create_dtl_entry(vcpu, vc);
> > kvmppc_start_thread(vcpu);
> > } else if (vc->vcore_state == VCORE_SLEEPING) {
> > -   wake_up(&vc->wq);
> > +   swait_wake(&vc->wq);
> 
> I notice everywhere you have a swait_wake_interruptible() but here. Is
> there a reason why?
> 

Of course I could have misread this patch, and this specific work queue
will only be woken up by swait_wake() and not
swait_wake_interruptible(). If that's the case, then its fine as is
(except for that typo I pointed out before).

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvmtool: virtio-9p: Convert EMFILE error at the server to ENFILE for the guest

2015-01-16 Thread Will Deacon
On Fri, Jan 16, 2015 at 11:30:10AM +, Suzuki K. Poulose wrote:
> From: "Suzuki K. Poulose" 
> 
> If an open at the 9p server(host) fails with EMFILE (Too many open files for
> the process), we should return ENFILE(too many open files in the system) to
> the guest to indicate the actual status within the guest.
> 
> This was uncovered during LTP, where getdtablesize01 fails to open the maximum
> number-open-files.
> 
> getdtablesize010  TINFO  :  Maximum number of files a process can have 
> opened is 1024
> getdtablesize010  TINFO  :  Checking with the value returned by 
> getrlimit...RLIMIT_NOFILE
> getdtablesize011  TPASS  :  got correct dtablesize, value is 1024
> getdtablesize010  TINFO  :  Checking Max num of files that can be opened 
> by a process.Should be: RLIMIT_NOFILE - 1
> getdtablesize012  TFAIL  :  getdtablesize01.c:102: 974 != 1023
> 
> For a more practial impact:
> 
>  # ./getdtablesize01 &
> [1] 1834
>  getdtablesize010  TINFO  :  Maximum number of files a process can have 
> opened is 1024
>  getdtablesize010  TINFO  :  Checking with the value returned by 
> getrlimit...RLIMIT_NOFILE
>  getdtablesize011  TPASS  :  got correct dtablesize, value is 1024
>  getdtablesize010  TINFO  :  Checking Max num of files that can be opened 
> by a process.Should be: RLIMIT_NOFILE - 1
>  getdtablesize012  TFAIL  :  getdtablesize01.c:102: 974 != 1023
>  [--- Modified to sleep indefinitely, without closing the files --- ]
> 
>  # ls
>  bash: /bin/ls: Too many open files
> 
> That gives a wrong error message for the bash, when getdtablesize01 has 
> exhausted the system
> wide limits, giving false indicators.
> 
> With the fix, we get :
> 
>  # ls
>  bash: /bin/ls: Too many open files in system

I doubt this is affecting anybody in practice, but:

  Acked-by: Will Deacon 

Will

> Signed-off-by: Suzuki K. Poulose 
> ---
>  tools/kvm/virtio/9p.c |4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/tools/kvm/virtio/9p.c b/tools/kvm/virtio/9p.c
> index 9073a1e..b24c0f2 100644
> --- a/tools/kvm/virtio/9p.c
> +++ b/tools/kvm/virtio/9p.c
> @@ -152,6 +152,10 @@ static void virtio_p9_error_reply(struct p9_dev *p9dev,
>  {
>   u16 tag;
>  
> + /* EMFILE at server implies ENFILE for the VM */
> + if (err == EMFILE)
> + err = ENFILE;
> +
>   pdu->write_offset = VIRTIO_9P_HDR_LEN;
>   virtio_p9_pdu_writef(pdu, "d", err);
>   *outlen = pdu->write_offset;
> -- 
> 1.7.9.5
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: use dedicated I/O service domain in KVM

2015-01-16 Thread Yidao Liu
2015-01-16 19:00 GMT+08:00 Stefan Hajnoczi :
> On Fri, Jan 16, 2015 at 12:14:17PM +0800, Yidao Liu wrote:
>> Hi, I want to use a dedicated guest VM to handle I/O request just as
>> I/O service domain used in xen.
>>
>> Specifically, using network I/O as an example, I should directly
>> assign the NIC to one guest VM (using pci-assign option),  after that
>> all other guest VMs should perform network I/O through that VM rather
>> than the host OS.
>>
>> Is there currently any viable approach to do this?
>>
>> If not, I want to implement one. Currently I'm thinking of combining
>> nahanni shared memory and vhost architecture to implement it, is there
>> any other suggestions?
>
> There are security features like SELinux, seccomp, and file descriptor
> passing that allow the QEMU userspace process to run unprivileged.
> Why does it matter if the I/O is happening in an isolated userspace
> process or another VM?
>
Although we use unprivileged Qemu to emulate the I/O operation,
current I/O mechanism still need to use Host to do the last part of I/O, so
if there's any bug in this phase, the whole system may crash due to
the Host OS down, is it?
So if using I/O service domain, there's no such problem, only service
domain will crash rather than the whole system.

> You can of course have appliance VMs that do network I/O on behalf of
> other guests.  Assign the physical NIC to the appliance VM and then use
> a private bridge on the host so guests can only communicate through the
> appliance VM.
>
I'm curious why use a private bridge on the host, but not in the appliance VM?
since the physical NIC is already assigned to the VM.

> My Xen knowledge is very limited but I think part of why this
> architecture is attractive there is because there is already a mechanism
> for granting pages - i.e. secure inter-domain DMA.
>
> In QEMU the architecture is different, so I wonder what you're trying to
> achieve.  It will be slower and more convoluted than how it works today.
>
> What are you trying to achieve?
>
Thanks for your reply, actually it is a research trial, we want to use
a non-root guest VM to
provide the I/O service, rather than the Qemu+host-os, to provide
better fault isolation.
> Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] current ACCESS_ONCE patch queue

2015-01-16 Thread Alexander Graf


On 15.01.15 09:58, Christian Borntraeger wrote:
> Folks,
> 
> fyi, this is my current patch queue for the next merge window. It
> does contain a patch that will disallow ACCESS_ONCE on non-scalar
> types.
> 
> The tree is part of linux-next and can be found at
> git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux.git linux-next

KVM PPC bits are:

 Acked-by: Alexander Graf 



Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvmtool: virtio-9p: Convert EMFILE error at the server to ENFILE for the guest

2015-01-16 Thread Suzuki K. Poulose
From: "Suzuki K. Poulose" 

If an open at the 9p server(host) fails with EMFILE (Too many open files for
the process), we should return ENFILE(too many open files in the system) to
the guest to indicate the actual status within the guest.

This was uncovered during LTP, where getdtablesize01 fails to open the maximum
number-open-files.

getdtablesize010  TINFO  :  Maximum number of files a process can have 
opened is 1024
getdtablesize010  TINFO  :  Checking with the value returned by 
getrlimit...RLIMIT_NOFILE
getdtablesize011  TPASS  :  got correct dtablesize, value is 1024
getdtablesize010  TINFO  :  Checking Max num of files that can be opened by 
a process.Should be: RLIMIT_NOFILE - 1
getdtablesize012  TFAIL  :  getdtablesize01.c:102: 974 != 1023

For a more practial impact:

 # ./getdtablesize01 &
[1] 1834
 getdtablesize010  TINFO  :  Maximum number of files a process can have 
opened is 1024
 getdtablesize010  TINFO  :  Checking with the value returned by 
getrlimit...RLIMIT_NOFILE
 getdtablesize011  TPASS  :  got correct dtablesize, value is 1024
 getdtablesize010  TINFO  :  Checking Max num of files that can be opened 
by a process.Should be: RLIMIT_NOFILE - 1
 getdtablesize012  TFAIL  :  getdtablesize01.c:102: 974 != 1023
 [--- Modified to sleep indefinitely, without closing the files --- ]

 # ls
 bash: /bin/ls: Too many open files

That gives a wrong error message for the bash, when getdtablesize01 has 
exhausted the system
wide limits, giving false indicators.

With the fix, we get :

 # ls
 bash: /bin/ls: Too many open files in system

Signed-off-by: Suzuki K. Poulose 
---
 tools/kvm/virtio/9p.c |4 
 1 file changed, 4 insertions(+)

diff --git a/tools/kvm/virtio/9p.c b/tools/kvm/virtio/9p.c
index 9073a1e..b24c0f2 100644
--- a/tools/kvm/virtio/9p.c
+++ b/tools/kvm/virtio/9p.c
@@ -152,6 +152,10 @@ static void virtio_p9_error_reply(struct p9_dev *p9dev,
 {
u16 tag;
 
+   /* EMFILE at server implies ENFILE for the VM */
+   if (err == EMFILE)
+   err = ENFILE;
+
pdu->write_offset = VIRTIO_9P_HDR_LEN;
virtio_p9_pdu_writef(pdu, "d", err);
*outlen = pdu->write_offset;
-- 
1.7.9.5


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: use dedicated I/O service domain in KVM

2015-01-16 Thread Stefan Hajnoczi
On Fri, Jan 16, 2015 at 12:14:17PM +0800, Yidao Liu wrote:
> Hi, I want to use a dedicated guest VM to handle I/O request just as
> I/O service domain used in xen.
> 
> Specifically, using network I/O as an example, I should directly
> assign the NIC to one guest VM (using pci-assign option),  after that
> all other guest VMs should perform network I/O through that VM rather
> than the host OS.
> 
> Is there currently any viable approach to do this?
> 
> If not, I want to implement one. Currently I'm thinking of combining
> nahanni shared memory and vhost architecture to implement it, is there
> any other suggestions?

There are security features like SELinux, seccomp, and file descriptor
passing that allow the QEMU userspace process to run unprivileged.
Why does it matter if the I/O is happening in an isolated userspace
process or another VM?

You can of course have appliance VMs that do network I/O on behalf of
other guests.  Assign the physical NIC to the appliance VM and then use
a private bridge on the host so guests can only communicate through the
appliance VM.

My Xen knowledge is very limited but I think part of why this
architecture is attractive there is because there is already a mechanism
for granting pages - i.e. secure inter-domain DMA.

In QEMU the architecture is different, so I wonder what you're trying to
achieve.  It will be slower and more convoluted than how it works today.

What are you trying to achieve?

Stefan


pgpKK6Rjb7V_G.pgp
Description: PGP signature


Re: [PATCH v7 4/4] KVM: arm/arm64: add irqfd support

2015-01-16 Thread André Przywara
Hi Eric,

On 01/15/2015 02:47 PM, Eric Auger wrote:
> This patch enables irqfd on arm/arm64.
> 
> Both irqfd and resamplefd are supported. Injection is implemented
> in vgic.c without routing.
> 
> This patch enables CONFIG_HAVE_KVM_EVENTFD and CONFIG_HAVE_KVM_IRQFD.
> 
> KVM_CAP_IRQFD is now advertised. KVM_CAP_IRQFD_RESAMPLE capability
> automatically is advertised as soon as CONFIG_HAVE_KVM_IRQFD is set.
> 
> Irqfd injection is restricted to SPI. The rationale behind not
> supporting PPI irqfd injection is that any device using a PPI would
> be a private-to-the-CPU device (timer for instance), so its state
> would have to be context-switched along with the VCPU and would
> require in-kernel wiring anyhow. It is not a relevant use case for
> irqfds.
> 
> Signed-off-by: Eric Auger 
> Reviewed-by: Christoffer Dall 
> 
> ---
> 
> v5 -> v6:
> - KVM_CAP_IRQFD support depends on vgic_present
> - add Christoffer's Reviewed-by
> 
> v4 -> v5:
> - squash [PATCH v4 3/3] KVM: arm64: add irqfd support into this patch
> - some rewording in Documentation/virtual/kvm/api.txt and in vgic
>   vgic_process_maintenance unlock comment.
> - move explanation of why not supporting PPI into commit message
> - in case of injection before gic readiness, -ENODEV is returned. It is
>   up to the user space to avoid this situation.
> 
> v3 -> v4:
> - reword commit message
> - explain why we unlock the distributor before calling kvm_notify_acked_irq
> - rename is_assigned_irq into has_notifier
> - change EOI and injection kvm_debug format string
> - remove error local variable in kvm_set_irq
> - Move HAVE_KVM_IRQCHIP unset in a separate patch
> - handle case were the irqfd injection is attempted before the vgic is ready.
>   in such a case the notifier, if any, is called immediatly
> - use nr_irqs to test spi is within correct range
> 
> v2 -> v3:
> - removal of irq.h from eventfd.c put in a separate patch to increase
>   visibility
> - properly expose KVM_CAP_IRQFD capability in arm.c
> - remove CONFIG_HAVE_KVM_IRQCHIP meaningfull only if irq_comm.c is used
> 
> v1 -> v2:
> - rebase on 3.17rc1
> - move of the dist unlock in process_maintenance
> - remove of dist lock in __kvm_vgic_sync_hwstate
> - rewording of the commit message (add resamplefd reference)
> - remove irq.h
> ---
>  Documentation/virtual/kvm/api.txt |  6 +++-
>  arch/arm/include/uapi/asm/kvm.h   |  3 ++
>  arch/arm/kvm/Kconfig  |  2 ++
>  arch/arm/kvm/Makefile |  2 +-
>  arch/arm/kvm/arm.c|  5 
>  arch/arm64/include/uapi/asm/kvm.h |  3 ++
>  arch/arm64/kvm/Kconfig|  2 ++
>  arch/arm64/kvm/Makefile   |  2 +-
>  virt/kvm/arm/vgic.c   | 63 
> ---
>  9 files changed, 81 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index 0007fef..5ed8088 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2231,7 +2231,7 @@ into the hash PTE second double word).
>  4.75 KVM_IRQFD
>  
>  Capability: KVM_CAP_IRQFD
> -Architectures: x86 s390
> +Architectures: x86 s390 arm arm64
>  Type: vm ioctl
>  Parameters: struct kvm_irqfd (in)
>  Returns: 0 on success, -1 on error
> @@ -2257,6 +2257,10 @@ Note that closing the resamplefd is not sufficient to 
> disable the
>  irqfd.  The KVM_IRQFD_FLAG_RESAMPLE is only necessary on assignment
>  and need not be specified with KVM_IRQFD_FLAG_DEASSIGN.
>  
> +On ARM/ARM64, the gsi field in the kvm_irqfd struct specifies the Shared
> +Peripheral Interrupt (SPI) index, such that the GIC interrupt ID is
> +given by gsi + 32.
> +
>  4.76 KVM_PPC_ALLOCATE_HTAB
>  
>  Capability: KVM_CAP_PPC_ALLOC_HTAB
> diff --git a/arch/arm/include/uapi/asm/kvm.h b/arch/arm/include/uapi/asm/kvm.h
> index 09ee408..77547bb 100644
> --- a/arch/arm/include/uapi/asm/kvm.h
> +++ b/arch/arm/include/uapi/asm/kvm.h
> @@ -196,6 +196,9 @@ struct kvm_arch_memory_slot {
>  /* Highest supported SPI, from VGIC_NR_IRQS */
>  #define KVM_ARM_IRQ_GIC_MAX  127
>  
> +/* One single KVM irqchip, ie. the VGIC */
> +#define KVM_NR_IRQCHIPS  1
> +
>  /* PSCI interface */
>  #define KVM_PSCI_FN_BASE 0x95c1ba5e
>  #define KVM_PSCI_FN(n)   (KVM_PSCI_FN_BASE + (n))
> diff --git a/arch/arm/kvm/Kconfig b/arch/arm/kvm/Kconfig
> index 9f581b1..e519a40 100644
> --- a/arch/arm/kvm/Kconfig
> +++ b/arch/arm/kvm/Kconfig
> @@ -24,6 +24,7 @@ config KVM
>   select KVM_MMIO
>   select KVM_ARM_HOST
>   depends on ARM_VIRT_EXT && ARM_LPAE
> + select HAVE_KVM_EVENTFD
>   ---help---
> Support hosting virtualized guest machines. You will also
> need to select one or more of the processor modules below.
> @@ -55,6 +56,7 @@ config KVM_ARM_MAX_VCPUS
>  config KVM_ARM_VGIC
>   bool "KVM support for Virtual GIC"
>   depends on KVM_ARM_HOST && OF
> + select HAVE_KVM_IRQFD
>   default y
>   ---help

Re: [PATCH 1/8] ppc/kvm: Replace ACCESS_ONCE with READ_ONCE

2015-01-16 Thread Christian Borntraeger
Am 16.01.2015 um 00:09 schrieb Michael Ellerman:
> On Thu, 2015-01-15 at 09:58 +0100, Christian Borntraeger wrote:
>> ACCESS_ONCE does not work reliably on non-scalar types. For
>> example gcc 4.6 and 4.7 might remove the volatile tag for such
>> accesses during the SRA (scalar replacement of aggregates) step
>> (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)
>>
>> Change the ppc/kvm code to replace ACCESS_ONCE with READ_ONCE.
>>
>> Signed-off-by: Christian Borntraeger 
>> ---
>>  arch/powerpc/kvm/book3s_hv_rm_xics.c |  8 
>>  arch/powerpc/kvm/book3s_xics.c   | 16 
>>  2 files changed, 12 insertions(+), 12 deletions(-)
>>
>> diff --git a/arch/powerpc/kvm/book3s_hv_rm_xics.c 
>> b/arch/powerpc/kvm/book3s_hv_rm_xics.c
>> index 7b066f6..7c22997 100644
>> --- a/arch/powerpc/kvm/book3s_hv_rm_xics.c
>> +++ b/arch/powerpc/kvm/book3s_hv_rm_xics.c
>> @@ -152,7 +152,7 @@ static void icp_rm_down_cppr(struct kvmppc_xics *xics, 
>> struct kvmppc_icp *icp,
>>   * in virtual mode.
>>   */
>>  do {
>> -old_state = new_state = ACCESS_ONCE(icp->state);
>> +old_state = new_state = READ_ONCE(icp->state);
> 
> These are all icp->state.
> 
> Which is a union, but it's only the size of unsigned long. So in practice 
> there
> shouldn't be a bug here right?

This bug was that gcc lost the volatile tag when propagating aggregates to 
scalar types.
So in theory a union could be affected. See the original problem
 ( http://marc.info/?i=54611D86.4040306%40de.ibm.com ) 
which happened on 

union ipte_control {
unsigned long val;
struct {
unsigned long k  : 1;
unsigned long kh : 31;
unsigned long kg : 32;
};
};

Christian


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 1/4] KVM: arm/arm64: unset CONFIG_HAVE_KVM_IRQCHIP

2015-01-16 Thread André Przywara
On 01/15/2015 02:47 PM, Eric Auger wrote:
> CONFIG_HAVE_KVM_IRQCHIP is needed to support IRQ routing (along
> with irq_comm.c and irqchip.c usage). This is not the case for
> arm/arm64 currently.
> 
> This patch unsets the flag for both arm and arm64.
> 
> Signed-off-by: Eric Auger 
> Acked-by: Christoffer Dall 
> Acked-by: Will Deacon 

Given it's current usage in the kernel HAVE_KVM_IRQCHIP seems to be
grossly misnamed these days, but that's certainly not the duty of this
patch to solve. So:

Reviewed-by: Andre Przywara 

Cheers,
Andre.

> ---
>  arch/arm/kvm/Kconfig   | 2 --
>  arch/arm64/kvm/Kconfig | 1 -
>  2 files changed, 3 deletions(-)
> 
> diff --git a/arch/arm/kvm/Kconfig b/arch/arm/kvm/Kconfig
> index 466bd29..9f581b1 100644
> --- a/arch/arm/kvm/Kconfig
> +++ b/arch/arm/kvm/Kconfig
> @@ -55,7 +55,6 @@ config KVM_ARM_MAX_VCPUS
>  config KVM_ARM_VGIC
>   bool "KVM support for Virtual GIC"
>   depends on KVM_ARM_HOST && OF
> - select HAVE_KVM_IRQCHIP
>   default y
>   ---help---
> Adds support for a hardware assisted, in-kernel GIC emulation.
> @@ -63,7 +62,6 @@ config KVM_ARM_VGIC
>  config KVM_ARM_TIMER
>   bool "KVM support for Architected Timers"
>   depends on KVM_ARM_VGIC && ARM_ARCH_TIMER
> - select HAVE_KVM_IRQCHIP
>   default y
>   ---help---
> Adds support for the Architected Timers in virtual machines
> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> index 8ba85e9..279e1a0 100644
> --- a/arch/arm64/kvm/Kconfig
> +++ b/arch/arm64/kvm/Kconfig
> @@ -50,7 +50,6 @@ config KVM_ARM_MAX_VCPUS
>  config KVM_ARM_VGIC
>   bool
>   depends on KVM_ARM_HOST && OF
> - select HAVE_KVM_IRQCHIP
>   ---help---
> Adds support for a hardware assisted, in-kernel GIC emulation.
>  
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 2/4] KVM: introduce kvm_arch_intc_initialized and use it in irqfd

2015-01-16 Thread André Przywara
Hi Eric,

On 01/15/2015 02:47 PM, Eric Auger wrote:
> Introduce __KVM_HAVE_ARCH_INTC_INITIALIZED define and
> associated kvm_arch_intc_initialized function. This latter
> allows to test whether the virtual interrupt controller is initialized
> and ready to accept virtual IRQ injection. On some architectures,
> the virtual interrupt controller is dynamically instantiated, justifying
> that kind of check.
> 
> The new function can now be used by irqfd to check whether the
> virtual interrupt controller is ready on KVM_IRQFD request. If not,
> KVM_IRQFD returns -EAGAIN.
> 
> Signed-off-by: Eric Auger 
> Acked-by: Christoffer Dall 
> 
> ---
> v6 -> v7:
> - From now on, kvm_irqfd_assign calls kvm_arch_intc_initialized
>   (previously introduced in subsequent patch file).
> - add Christoffer's ack
> 
> v5 -> v6:
> - rename function name and macro
> - add kvm_arch_intc_initialized declaration in case the archi defines
>   the macro
> ---
>  include/linux/kvm_host.h | 14 ++
>  virt/kvm/eventfd.c   |  3 +++
>  2 files changed, 17 insertions(+)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 26f1060..702036a 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -706,6 +706,20 @@ static inline wait_queue_head_t *kvm_arch_vcpu_wq(struct 
> kvm_vcpu *vcpu)
>  #endif
>  }
>  
> +#ifdef __KVM_HAVE_ARCH_INTC_INITIALIZED
> +/*
> + * returns trues if the virtual interrupt controller is initialized and

Nit:  true

Other than that:
Reviewed-by: Andre Przywara 

Cheers,
Andre.


> + * ready to accept virtual IRQ. On some architectures the virtual interrupt
> + * controller is dynamically instantiated and this is not always true.
> + */
> +bool kvm_arch_intc_initialized(struct kvm *kvm);
> +#else
> +static inline bool kvm_arch_intc_initialized(struct kvm *kvm)
> +{
> + return true;
> +}
> +#endif
> +
>  int kvm_arch_init_vm(struct kvm *kvm, unsigned long type);
>  void kvm_arch_destroy_vm(struct kvm *kvm);
>  void kvm_arch_sync_events(struct kvm *kvm);
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index 148b239..fc5f43e 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -311,6 +311,9 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
>   unsigned int events;
>   int idx;
>  
> + if (!kvm_arch_intc_initialized(kvm))
> + return -EAGAIN;
> +
>   irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
>   if (!irqfd)
>   return -ENOMEM;
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 3/4] KVM: arm/arm64: implement kvm_arch_intc_initialized

2015-01-16 Thread André Przywara
On 01/15/2015 02:47 PM, Eric Auger wrote:
> On arm/arm64 the VGIC is dynamically instantiated and it is useful
> to expose its state, especially for irqfd setup.
> 
> This patch defines __KVM_HAVE_ARCH_INTC_INITIALIZED and
> implements kvm_arch_intc_initialized.
> 
> Signed-off-by: Eric Auger 
> Acked-by: Christoffer Dall 

Reviewed-by: Andre Przywara 

Cheers,
Andre.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] arch/x86/kvm/vmx.c: Fix external interrupts inject directly bug with guestos RFLAGS.IF=0

2015-01-16 Thread Li Kaihang
Hello, please see the answer below blue:



From:   Radim Krčmář 
To: Li Kaihang ,
Cc: g...@kernel.org, pbonz...@redhat.com, t...@linutronix.de, 
mi...@redhat.com, h...@zytor.com, x...@kernel.org, kvm@vger.kernel.org, 
linux-ker...@vger.kernel.org
Date:   2015-01-16 上午 02:09
Subject:Re: [PATCH 1/1] arch/x86/kvm/vmx.c: Fix external interrupts 
inject directly bug with guestos RFLAGS.IF=0



2015-01-15 20:36+0800, Li Kaihang:
> This patch fix a external interrupt injecting bug in linux 3.19-rc4.

Was the bug introduced in earlier 3.19 release candidate?

Li Kaihang: Yes, we also find this problem in 2.6

> GuestOS is running and handling some interrupt with RFLAGS.IF = 0 while a 
> external interrupt coming,
> then can lead to a vm exit,in this case,we must avoid inject this external 
> interrupt or it will generate
> a processor hardware exception causing virtual machine crash.

What is the source of this exception?  (Is there a reproducer?)

Li Kaihang: exception is produced by intel processor hardware because injecting 
a external interrupt vector is forbidden by intel processor when GuestOS 
RFLAGS.IF = 0,
this need to be ensured by hypervisor software according to Intel 
64 and IA-32 Architectures Software Developer's Manual Volume 3.
This bug has a certain probability, if code is designed to be very 
short between cli and sti in a guestos's interrupt processing, probability of 
occurrence
is very low, this event is like moving trap, bug is produced that 
guestos is running between cli and sti instruction while a external interrupt 
coming, it
may be verified by constructing a special guestos interrupt code. 
General OS running on kvm vm has also probability to hit this bug.

> Now, I show more details about this problem:
>
> A general external interrupt processing for a running virtual machine is 
> shown in the following:
>
> Step 1:
>  a ext intr gen a vm_exit --> vmx_complete_interrupts --> 
> __vmx_complete_interrupts --> case INTR_TYPE_EXT_INR: 
> kvm_queue_interrupt(vcpu, vector, type == INTR_TYPE_SOFT_INTR);
>
> Step 2:
>  kvm_x86_ops->handle_external_intr(vcpu);
>
> Step 3:
>  get back to vcpu_enter_guest after a while cycle,then run 
> inject_pending_event
>
> Step 4:
>  if (vcpu->arch.interrupt.pending) {
>kvm_x86_ops->set_irq(vcpu);
>return 0;
>}
>
> Step 5:
>  kvm_x86_ops->run(vcpu) --> vm_entry inject vector to guestos IDT
>
> for the above steps, step 4 and 5 will be a processor hardware exception if 
> step1 happen while guestos RFLAGS.IF = 0, that is to say, guestos interrupt 
> is disabled.
> So we should add a logic to judge in step 1 whether a external interrupt need 
> to be pended then inject directly, in the process, we don't need to worry 
> about
> this external interrupt lost because the next Step 2 will handle and choose a 
> best chance to inject it by virtual interrupt controller.

Can you explain the relation between vectored events (Step 1) and
external interrupts (Step 2)?
(The bug happens when external interrupt arrives during event delivery?)

Li Kaihang: a external interrupt to running vm can trigger a vm_exit event 
handled in step 1, then this interrupt vector can be processed in step2
kvm_x86_ops->handle_external_intr(vcpu) and this function can jump 
to HOSTOS IDT to complete external interrupt handling,external interrupt 
handler in HOSTOS
IDT may inject the external interrupt into virtual interrupt 
controller if it has been registered to be needed by virtual machine.
The Bug has never happened in step 1 and 2, but 
vcpu->arch.interrupt.pending is set in step 1, if this pending should not be 
injected, it also will be passed
to step4 to complete the dangerous external interrupt injecting. 
Please see the above answer about what is "pending should not be injected"? Our 
solution
is that clearing invalid external interrupt pending to prevent 
error inject pass by adding a logical judge in step 1.

Why isn't the delivered event lost?
(It should be different from the external interrupt.)

Li Kaihang: please refer to the above answer, a external interrupt in step1 
only can get to case INTR_TYPE_EXT_INR branch in patch code, so it should not 
affect other
type events delivering, but there is another possibility that the 
external interrupt needed by running vm is not registered in hostos idt handler 
chain,of
course, this situation is another problem, even so it is dangerous 
action to inject the external interrupt directly if not judge current guestos 
RFLAGS.IF
state.

Thanks.

>
>
> Signed-off-by: Li kaihang 
> ---
>  arch/x86/kvm/vmx.c |   20 ++--
>  1 files changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index d4c58d8..e8311ee 100644
> ---