date:20150915

Re: [PATCH v2 1/2] KVM: nVMX: enhance allocate/free_vpid to handle shadow vpid

2015-09-15 Thread Wanpeng Li


On 9/15/15 8:54 PM, Paolo Bonzini wrote:


On 15/09/2015 12:30, Wanpeng Li wrote:

+   if (!nested) {
+   vpid = find_first_zero_bit(vmx_vpid_bitmap, VMX_NR_VPIDS);
+   if (vpid < VMX_NR_VPIDS) {
vmx->vpid = vpid;
__set_bit(vpid, vmx_vpid_bitmap);
+   }
+   } else {
+   vpid = find_first_zero_bit(vmx_vpid_bitmap, VMX_NR_VPIDS);
+   if (vpid < VMX_NR_VPIDS) {
+   vmx->nested.vpid02 = vpid;
+   __set_bit(vpid, vmx_vpid_bitmap);
+   }

Messy indentation, and a lot of duplicate code.  Can you instead have
(which I think was Jan's suggestion too):

static int allocate_vpid(void);
static void free_vpid(int vpid);


I see, done in v3.



That said, I like the simple solution to the "too many VPIDs for each L1
VCPU" processor.


Thanks. :-)

Regards,
Wanpeng Li

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: nVMX: nested VPID emulation

2015-09-15 Thread Jan Kiszka

On 2015-09-16 04:36, Wanpeng Li wrote:
> On 9/16/15 1:32 AM, Jan Kiszka wrote:
>> On 2015-09-15 12:14, Wanpeng Li wrote:
>>> On 9/14/15 10:54 PM, Jan Kiszka wrote:
 Last but not least: the guest can now easily exhaust the host's pool of
 vpid by simply spawning plenty of VCPUs for L2, no? Is this acceptable
 or should there be some limit?
>>> I reuse the value of vpid02 while vpid12 changed w/ one invvpid in v2,
>>> and the scenario which you pointed out can be avoid.
>> I cannot yet follow why there is no chance for L1 to consume all vpids
>> that the host manages in that single, global bitmap by simply spawning a
>> lot of nested VCPUs for some L2. What is enforcing L1 to call nested
>> vmclear - apparently the only way, besides destructing nested VCPUs, to
>> release such vpids again?
> 
> In v2, there is no direct mapping between vpid02 and vpid12, the vpid02
> is per-vCPU for L0 and reused while the value of vpid12 is changed w/
> one invvpid during nested vmentry. The vpid12 is allocated by L1 for L2,
> so it will not influence global bitmap(for vpid01 and vpid02 allocation)
> even if spawn a lot of nested vCPUs.

Ah, I see, you limit allocation to one additional host-side vpid per
VCPU, for nesting. That looks better. That also means all vpids for L2
will be folded on that single vpid in hardware, right? So the major
benefit comes from having separate vpids when switching between L1 and
L2, in fact.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH 1/2] target-i386: disable LINT0 after reset

2015-09-15 Thread Jan Kiszka

On 2015-09-15 23:19, Alex Williamson wrote:
> On Mon, 2015-04-13 at 02:32 +0300, Nadav Amit wrote:
>> Due to old Seabios bug, QEMU reenable LINT0 after reset. This bug is long 
>> gone
>> and therefore this hack is no longer needed.  Since it violates the
>> specifications, it is removed.
>>
>> Signed-off-by: Nadav Amit 
>> ---
>>  hw/intc/apic_common.c | 9 -
>>  1 file changed, 9 deletions(-)
> 
> Please see bug: https://bugs.launchpad.net/qemu/+bug/1488363
> 
> Is this bug perhaps not as long gone as we thought, or is there
> something else going on here?  Thanks,

I would say, someone needs to check if the SeaBIOS line that is supposed
to enable LINT0 is actually executed on one of the broken systems and,
if not, why not.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 1/2] KVM: nVMX: enhance allocate/free_vpid to handle shadow vpid

2015-09-15 Thread Wanpeng Li

Enhance allocate/free_vid to handle shadow vpid.

Signed-off-by: Wanpeng Li 
---
 arch/x86/kvm/vmx.c | 24 +++-
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 9ff6a3f..4956081 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4155,29 +4155,27 @@ static int alloc_identity_pagetable(struct kvm *kvm)
return r;
 }
 
-static void allocate_vpid(struct vcpu_vmx *vmx)
+static int allocate_vpid(void)
 {
-   int vpid;
+   int vpid = 0;
 
-   vmx->vpid = 0;
if (!enable_vpid)
-   return;
+   return 0;
spin_lock(_vpid_lock);
vpid = find_first_zero_bit(vmx_vpid_bitmap, VMX_NR_VPIDS);
-   if (vpid < VMX_NR_VPIDS) {
-   vmx->vpid = vpid;
+   if (vpid < VMX_NR_VPIDS)
__set_bit(vpid, vmx_vpid_bitmap);
-   }
spin_unlock(_vpid_lock);
+   return vpid;
 }
 
-static void free_vpid(struct vcpu_vmx *vmx)
+static void free_vpid(int vpid)
 {
if (!enable_vpid)
return;
spin_lock(_vpid_lock);
-   if (vmx->vpid != 0)
-   __clear_bit(vmx->vpid, vmx_vpid_bitmap);
+   if (vpid != 0)
+   __clear_bit(vpid, vmx_vpid_bitmap);
spin_unlock(_vpid_lock);
 }
 
@@ -8482,7 +8480,7 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
 
if (enable_pml)
vmx_disable_pml(vmx);
-   free_vpid(vmx);
+   free_vpid(vmx->vpid);
leave_guest_mode(vcpu);
vmx_load_vmcs01(vcpu);
free_nested(vmx);
@@ -8501,7 +8499,7 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, 
unsigned int id)
if (!vmx)
return ERR_PTR(-ENOMEM);
 
-   allocate_vpid(vmx);
+   vmx->vpid = allocate_vpid();
 
err = kvm_vcpu_init(>vcpu, kvm, id);
if (err)
@@ -8577,7 +8575,7 @@ free_msrs:
 uninit_vcpu:
kvm_vcpu_uninit(>vcpu);
 free_vcpu:
-   free_vpid(vmx);
+   free_vpid(vmx->vpid);
kmem_cache_free(kvm_vcpu_cache, vmx);
return ERR_PTR(err);
 }
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 0/2] KVM: nested VPID emulation

2015-09-15 Thread Wanpeng Li

v2 -> v3:
 * enhance allocate/free_vpid as Jan's suggestion
 * add more comments to 2/2

v1 -> v2:
 * enhance allocate/free_vpid to handle shadow vpid
 * drop empty space
 * allocate shadow vpid during initialization
 * For each nested vmentry, if vpid12 is changed, reuse shadow vpid w/ an 
   invvpid.

VPID is used to tag address space and avoid a TLB flush. Currently L0 use 
the same VPID to run L1 and all its guests. KVM flushes VPID when switching 
between L1 and L2. 

This patch advertises VPID to the L1 hypervisor, then address space of L1 and 
L2 can be separately treated and avoid TLB flush when swithing between L1 and 
L2.

Performance: 

run lmbench on L2 w/ 3.5 kernel.

Context switching - times in microseconds - smaller is better
-
Host OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
 ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
- - -- -- -- -- -- --- ---
kernelLinux 3.5.0-1 1.2200 1.3700 1.4500 4.7800 2.3300 5.6 2.88000  
nested VPID 
kernelLinux 3.5.0-1 1.2600 1.4300 1.5600   12.7   12.9 3.49000 7.46000  
vanilla

Wanpeng Li (2):
  KVM: nVMX: enhance allocate/free_vpid to handle shadow vpid
  KVM: nVMX: nested VPID emulation

 arch/x86/kvm/vmx.c | 61 +-
 1 file changed, 42 insertions(+), 19 deletions(-)

-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 2/2] KVM: nVMX: nested VPID emulation

2015-09-15 Thread Wanpeng Li

VPID is used to tag address space and avoid a TLB flush. Currently L0 use 
the same VPID to run L1 and all its guests. KVM flushes VPID when switching 
between L1 and L2. 

This patch advertises VPID to the L1 hypervisor, then address space of L1 and 
L2 can be separately treated and avoid TLB flush when swithing between L1 and 
L2. For each nested vmentry, if vpid12 is changed, reuse shadow vpid w/ an 
invvpid.

Performance: 

run lmbench on L2 w/ 3.5 kernel.

Context switching - times in microseconds - smaller is better
-
Host OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
 ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
- - -- -- -- -- -- --- ---
kernelLinux 3.5.0-1 1.2200 1.3700 1.4500 4.7800 2.3300 5.6 2.88000  
nested VPID 
kernelLinux 3.5.0-1 1.2600 1.4300 1.5600   12.7   12.9 3.49000 7.46000  
vanilla

Suggested-by: Wincy Van 
Signed-off-by: Wanpeng Li 
---
 arch/x86/kvm/vmx.c | 37 +++--
 1 file changed, 31 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 4956081..2fd5b5e 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -424,6 +424,9 @@ struct nested_vmx {
/* to migrate it to L2 if VM_ENTRY_LOAD_DEBUG_CONTROLS is off */
u64 vmcs01_debugctl;
 
+   u16 vpid02;
+   u16 last_vpid;
+
u32 nested_vmx_procbased_ctls_low;
u32 nested_vmx_procbased_ctls_high;
u32 nested_vmx_true_procbased_ctls_low;
@@ -1155,6 +1158,11 @@ static inline bool 
nested_cpu_has_virt_x2apic_mode(struct vmcs12 *vmcs12)
return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE);
 }
 
+static inline bool nested_cpu_has_vpid(struct vmcs12 *vmcs12)
+{
+   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_VPID);
+}
+
 static inline bool nested_cpu_has_apic_reg_virt(struct vmcs12 *vmcs12)
 {
return nested_cpu_has2(vmcs12, SECONDARY_EXEC_APIC_REGISTER_VIRT);
@@ -2469,6 +2477,7 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
SECONDARY_EXEC_RDTSCP |
SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
+   SECONDARY_EXEC_ENABLE_VPID |
SECONDARY_EXEC_APIC_REGISTER_VIRT |
SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
SECONDARY_EXEC_WBINVD_EXITING |
@@ -6662,6 +6671,7 @@ static void free_nested(struct vcpu_vmx *vmx)
return;
 
vmx->nested.vmxon = false;
+   free_vpid(vmx->nested.vpid02);
nested_release_vmcs12(vmx);
if (enable_shadow_vmcs)
free_vmcs(vmx->nested.current_shadow_vmcs);
@@ -8547,8 +8557,10 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, 
unsigned int id)
goto free_vmcs;
}
 
-   if (nested)
+   if (nested) {
nested_vmx_setup_ctls_msrs(vmx);
+   vmx->nested.vpid02 = allocate_vpid();
+   }
 
vmx->nested.posted_intr_nv = -1;
vmx->nested.current_vmptr = -1ull;
@@ -8569,6 +8581,7 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, 
unsigned int id)
return >vcpu;
 
 free_vmcs:
+   free_vpid(vmx->nested.vpid02);
free_loaded_vmcs(vmx->loaded_vmcs);
 free_msrs:
kfree(vmx->guest_msrs);
@@ -9444,12 +9457,24 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
struct vmcs12 *vmcs12)
 
if (enable_vpid) {
/*
-* Trivially support vpid by letting L2s share their parent
-* L1's vpid. TODO: move to a more elaborate solution, giving
-* each L2 its own vpid and exposing the vpid feature to L1.
+* There is no direct mapping between vpid02 and vpid12, the
+* vpid02 is per-vCPU for L0 and reused while the value of
+* vpid12 is changed w/ one invvpid during nested vmentry.
+* The vpid12 is allocated by L1 for L2, so it will not
+* influence global bitmap(for vpid01 and vpid02 allocation)
+* even if spawn a lot of nested vCPUs.
 */
-   vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);
-   vmx_flush_tlb(vcpu);
+   if (nested_cpu_has_vpid(vmcs12)) {
+   vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->nested.vpid02);
+   if (vmcs12->virtual_processor_id != 
vmx->nested.last_vpid) {
+   vmx->nested.last_vpid = 
vmcs12->virtual_processor_id;
+   vmx_flush_tlb(vcpu);
+   }
+   } else {
+   vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);
+   vmx_flush_tlb(vcpu);
+

Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Paul E. McKenney

On Tue, Sep 15, 2015 at 09:24:15PM -0400, Tejun Heo wrote:
> Hello, Paul.
> 
> On Tue, Sep 15, 2015 at 04:38:18PM -0700, Paul E. McKenney wrote:
> > Well, the decision as to what is too big for -stable is owned by the
> > -stable maintainers, not by me.
> 
> Is it tho?  Usually the subsystem maintainer knows the best and has
> most say in it.  I was mostly curious whether you'd think that the
> changes would be too risky.  If not, great.

I do hope that they would listen to what I thought about it, but at
the end of the day, it is the -stable maintainers who pull a given
patch, or don't.

> > I am suggesting trying the options and seeing what works best, then
> > working to convince people as needed.
> 
> Yeah, sure thing.  Let's wait for Christian.

Indeed.  Is there enough benefit to risk jamming this thing into 4.3?
I believe that 4.4 should be a no-brainer.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: nVMX: nested VPID emulation

2015-09-15 Thread Wanpeng Li


On 9/16/15 1:32 AM, Jan Kiszka wrote:

On 2015-09-15 12:14, Wanpeng Li wrote:

On 9/14/15 10:54 PM, Jan Kiszka wrote:

Last but not least: the guest can now easily exhaust the host's pool of
vpid by simply spawning plenty of VCPUs for L2, no? Is this acceptable
or should there be some limit?

I reuse the value of vpid02 while vpid12 changed w/ one invvpid in v2,
and the scenario which you pointed out can be avoid.

I cannot yet follow why there is no chance for L1 to consume all vpids
that the host manages in that single, global bitmap by simply spawning a
lot of nested VCPUs for some L2. What is enforcing L1 to call nested
vmclear - apparently the only way, besides destructing nested VCPUs, to
release such vpids again?


In v2, there is no direct mapping between vpid02 and vpid12, the vpid02 
is per-vCPU for L0 and reused while the value of vpid12 is changed w/ 
one invvpid during nested vmentry. The vpid12 is allocated by L1 for L2, 
so it will not influence global bitmap(for vpid01 and vpid02 allocation) 
even if spawn a lot of nested vCPUs.


Regards,
Wanpeng Li

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Bug 104631] New: Error on walk_shadow_page_get_mmio_spte when starting Qemu

2015-09-15 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=104631

Bug ID: 104631
   Summary: Error on walk_shadow_page_get_mmio_spte when starting
Qemu
   Product: Virtualization
   Version: unspecified
Kernel Version: 4.3.0-rc1
  Hardware: All
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: kvm
  Assignee: virtualization_...@kernel-bugs.osdl.org
  Reporter: ta...@tasossah.com
Regression: Yes

Created attachment 187691
  --> https://bugzilla.kernel.org/attachment.cgi?id=187691=edit
Output from Qemu

Ever since I upgraded to 4.3-rc1 I have been unable to run Qemu with KVM.

The launch command for qemu used is:
QEMU_SDL_SAMPLES=3072 SDL_AUDIODRIVER=pulse QEMU_AUDIO_DRV=sdl
qemu-system-x86_64 -enable-kvm -M q35 -m 4096 -cpu host -smp
6,sockets=1,cores=6,threads=1 -bios /usr/share/seabios/bios.bin -device
ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=root.1 -drive
file=vm_image.img,id=disk,format=raw -soundhw ac97

The version of Qemu is:
QEMU emulator version 2.2.0 (Debian 1:2.2+dfsg-5expubuntu9.4)

This kernel has been custom built from commit
6ff33f3902c3b1c5d0db6b1e2c70b6d76fba357f on Linus' kernel repo, with only
change being a patch that removes the 40-wire check on PATA controllers.

Hardware has AMD-v as well IOMMU enabled. Motherboard is an ASRock 990FX
Extreme 4 with an FX8350.

Looking at the commit logs, it seems that commit
47ab8751695f71d85562ff497e21b521c789247c
(
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/arch/x86/kvm/mmu.c?id=47ab8751695f71d85562ff497e21b521c789247c
) might be related to this problem.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Bug 104631] Error on walk_shadow_page_get_mmio_spte when starting Qemu

2015-09-15 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=104631

--- Comment #1 from Tasos Sahanidis  ---
Created attachment 187701
  --> https://bugzilla.kernel.org/attachment.cgi?id=187701=edit
Output from dmesg

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Questions about KVM TSC trapping

2015-09-15 Thread Wanpeng Li


On 9/16/15 6:00 AM, David Matlack wrote:

On Tue, Sep 15, 2015 at 12:04 AM, Oliver Yang  wrote:

Hi Guys,

I found below patch for KVM TSC trapping / migration support,

https://lkml.org/lkml/2011/1/6/90

It seemed the patch were not merged in Linux mainline.

So I have 3 questions here,

1.  Can KVM support TSC trapping today? If not, what is the plan?

Not without a patch. Did you want to trap RDTSC? RDTSC works without
trapping thanks to hardware support.


2. What is the solution if my SMP Linux guest OS doesn't have reliable
TSC?

If you are seeing an unreliable TSC in your guest, maybe your host
hardware does not support a synchronized TSC across CPUs. I can't
recall how to check for this. There might be a flag in your host's
/proc/cpuinfo.


constant_tsc




Because the no TSC trapping support, will kvmclock driver handle all
TSC sync issues?

3. What if my Linux guest doesn't have kvmclock driver?

The guest will use a different clock source (e.g. acpi-pm). Note the
RDTSC[P] instruction will still work just fine.


Does that mean I shouldn't run TSC sensitive application in my guest
OS?

BTW, my application is written with lots of rdtsc instructions, and
which performs well in VMware guest.

Does it not work well in KVM?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gelten Sie für dringende Darlehen bieten1.1%

2015-09-15 Thread BancoPosta Online Loans




--
BancoPosta Loans
Viale Europa,
175-00144 Roma,
Italy.

Email: bancopost...@gmail.com

Guten Tag meine Damen und Herren,

Brauchen Sie ein Darlehen für einen bestimmten Zweck?

BancoPosta Bank in Italien haben einen günstigen Kredit für Sie. Wir 
bieten gesicherten und ungesicherten persönlichen Darlehen Kunden für 
jeden Zweck, ob ein neues Unternehmen, Privatpersonen und Unternehmen zu 
starten, oder Sie haben ein großes Projekt im Auge. Ein persönliches 
Darlehen, das Sie überall für alles verwenden können, sollten Sie die 
Freiheit, die BancoPosta Bank Darlehen Programm anbieten kann. Mit einer 
vereinfachten Kreditantrag erhalten Sie Ihr BancoPosta Darlehen 
genehmigt und am selben Tag, die Sie anwenden, mit der niedrigen Rate 
von 1.1% der flexiblen Kreditkonditionen aus (1 Jahr auf 20 Jahre), 
Borrow (10,000.00 Euro/Chf bis 100,000,000.00 Euro/Chf) für einen 
bestimmten Zweck ausgestellt. Wenn Sie benötigen ein Darlehen, füllen 
Sie das Antragsformular aus, um loszulegen.


Ihr vollständiger Name: ___
Ihre Adresse: ___
Ihr Darlehensbetrag: ___
Miet-Zweck: ___
Darlehen-Dauer: ___

Bewerben Sie sich jetzt und überlassen Sie uns den Rest!

Vielen Dank für Ihre Schirmherrschaft.

Unser Ziel ist die Erfüllung Ihrer finanziellen Bedürfnisse!

Alles Gute
Massimo Sarmi.
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Tejun Heo

Hello, Paul.

On Tue, Sep 15, 2015 at 04:38:18PM -0700, Paul E. McKenney wrote:
> Well, the decision as to what is too big for -stable is owned by the
> -stable maintainers, not by me.

Is it tho?  Usually the subsystem maintainer knows the best and has
most say in it.  I was mostly curious whether you'd think that the
changes would be too risky.  If not, great.

> I am suggesting trying the options and seeing what works best, then
> working to convince people as needed.

Yeah, sure thing.  Let's wait for Christian.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Tejun Heo

Hello,

On Tue, Sep 15, 2015 at 02:38:30PM -0700, Paul E. McKenney wrote:
> I did take a shot at adding the rcu_sync stuff during this past merge
> window, but it did not converge quickly enough to make it.  It looks
> quite good for the next merge window.  There have been changes in most
> of the relevant areas, so probably best to just try them and see which
> works best.

Heh, I'm having a bit of trouble following.  Are you saying that the
changes would be too big for -stable?  If so, I'll send out reverts of
the culprit patches and then reapply them for this cycle so that it
can land together with the rcu changes in the next merge window, but
it'd be great to find out whether the rcu changes are enough for the
issue that Christian is seeing to go away.  If not, I'll switch to a
different locking scheme and mark those patches w/ stable tag.

Thanks!

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Paul E. McKenney

On Tue, Sep 15, 2015 at 06:28:11PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Tue, Sep 15, 2015 at 02:38:30PM -0700, Paul E. McKenney wrote:
> > I did take a shot at adding the rcu_sync stuff during this past merge
> > window, but it did not converge quickly enough to make it.  It looks
> > quite good for the next merge window.  There have been changes in most
> > of the relevant areas, so probably best to just try them and see which
> > works best.
> 
> Heh, I'm having a bit of trouble following.  Are you saying that the
> changes would be too big for -stable?  If so, I'll send out reverts of
> the culprit patches and then reapply them for this cycle so that it
> can land together with the rcu changes in the next merge window, but
> it'd be great to find out whether the rcu changes are enough for the
> issue that Christian is seeing to go away.  If not, I'll switch to a
> different locking scheme and mark those patches w/ stable tag.

Well, the decision as to what is too big for -stable is owned by the
-stable maintainers, not by me.

I am suggesting trying the options and seeing what works best, then
working to convince people as needed.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Questions about KVM TSC trapping

2015-09-15 Thread David Matlack

On Tue, Sep 15, 2015 at 12:04 AM, Oliver Yang  wrote:
> Hi Guys,
>
> I found below patch for KVM TSC trapping / migration support,
>
> https://lkml.org/lkml/2011/1/6/90
>
> It seemed the patch were not merged in Linux mainline.
>
> So I have 3 questions here,
>
> 1.  Can KVM support TSC trapping today? If not, what is the plan?

Not without a patch. Did you want to trap RDTSC? RDTSC works without
trapping thanks to hardware support.

>
> 2. What is the solution if my SMP Linux guest OS doesn't have reliable
> TSC?

If you are seeing an unreliable TSC in your guest, maybe your host
hardware does not support a synchronized TSC across CPUs. I can't
recall how to check for this. There might be a flag in your host's
/proc/cpuinfo.

>
> Because the no TSC trapping support, will kvmclock driver handle all
> TSC sync issues?
>
> 3. What if my Linux guest doesn't have kvmclock driver?

The guest will use a different clock source (e.g. acpi-pm). Note the
RDTSC[P] instruction will still work just fine.

>
> Does that mean I shouldn't run TSC sensitive application in my guest
> OS?
>
> BTW, my application is written with lots of rdtsc instructions, and
> which performs well in VMware guest.

Does it not work well in KVM?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V6 1/6] kvm: don't try to register to KVM_FAST_MMIO_BUS for non mmio eventfd

2015-09-15 Thread Cornelia Huck

On Tue, 15 Sep 2015 14:41:54 +0800
Jason Wang  wrote:

> We only want zero length mmio eventfd to be registered on
> KVM_FAST_MMIO_BUS. So check this explicitly when arg->len is zero to
> make sure this.
> 
> Cc: sta...@vger.kernel.org
> Cc: Gleb Natapov 
> Cc: Paolo Bonzini 
> Signed-off-by: Jason Wang 
> ---
>  virt/kvm/eventfd.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Reviewed-by: Cornelia Huck 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: fix polling for guest halt continued even if disable it

2015-09-15 Thread Christian Borntraeger

Am 14.09.2015 um 11:38 schrieb Wanpeng Li:
> If there is already some polling ongoing, it's impossible to disable the 
> polling, since as soon as somebody sets halt_poll_ns to 0, polling will 
> never stop, as grow and shrink are only handled if halt_poll_ns is != 0.
> 
> This patch fix it by reset vcpu->halt_poll_ns in order to stop polling 
> when polling is disabled.
> 
> Reported-by: Christian Borntraeger 
> Signed-off-by: Wanpeng Li 
> ---
>  virt/kvm/kvm_main.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 4662a88..f756cac0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2012,7 +2012,8 @@ out:
>   else if (vcpu->halt_poll_ns < halt_poll_ns &&
>   block_ns < halt_poll_ns)
>   grow_halt_poll_ns(vcpu);
> - }
> + } else
> + vcpu->halt_poll_ns = 0;
> 
>   trace_kvm_vcpu_wakeup(block_ns, waited);
>  }
> 

Tested-by: Christian Borntraeger 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH V6 4/6] kvm: fix zero length mmio searching

2015-09-15 Thread Jason Wang

Currently, if we had a zero length mmio eventfd assigned on
KVM_MMIO_BUS. It will never be found by kvm_io_bus_cmp() since it
always compares the kvm_io_range() with the length that guest
wrote. This will cause e.g for vhost, kick will be trapped by qemu
userspace instead of vhost. Fixing this by using zero length if an
iodevice is zero length.

Cc: sta...@vger.kernel.org
Cc: Gleb Natapov 
Cc: Paolo Bonzini 
Signed-off-by: Jason Wang 
---
 virt/kvm/kvm_main.c | 19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index eb4c9d2..9af68db 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3157,10 +3157,25 @@ static void kvm_io_bus_destroy(struct kvm_io_bus *bus)
 static inline int kvm_io_bus_cmp(const struct kvm_io_range *r1,
 const struct kvm_io_range *r2)
 {
-   if (r1->addr < r2->addr)
+   gpa_t addr1 = r1->addr;
+   gpa_t addr2 = r2->addr;
+
+   if (addr1 < addr2)
return -1;
-   if (r1->addr + r1->len > r2->addr + r2->len)
+
+   /* If r2->len == 0, match the exact address.  If r2->len != 0,
+* accept any overlapping write.  Any order is acceptable for
+* overlapping ranges, because kvm_io_bus_get_first_dev ensures
+* we process all of them.
+*/
+   if (r2->len) {
+   addr1 += r1->len;
+   addr2 += r2->len;
+   }
+
+   if (addr1 > addr2)
return 1;
+
return 0;
 }
 
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH V6 2/6] kvm: factor out core eventfd assign/deassign logic

2015-09-15 Thread Jason Wang

This patch factors out core eventfd assign/deassign logic and leaves
the argument checking and bus index selection to callers.

Cc: sta...@vger.kernel.org
Cc: Gleb Natapov 
Cc: Paolo Bonzini 
Signed-off-by: Jason Wang 
---
 virt/kvm/eventfd.c | 85 --
 1 file changed, 50 insertions(+), 35 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index e404806..0829c7f 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -771,40 +771,14 @@ static enum kvm_bus ioeventfd_bus_from_flags(__u32 flags)
return KVM_MMIO_BUS;
 }
 
-static int
-kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
+static int kvm_assign_ioeventfd_idx(struct kvm *kvm,
+   enum kvm_bus bus_idx,
+   struct kvm_ioeventfd *args)
 {
-   enum kvm_bus  bus_idx;
-   struct _ioeventfd*p;
-   struct eventfd_ctx   *eventfd;
-   int   ret;
-
-   bus_idx = ioeventfd_bus_from_flags(args->flags);
-   /* must be natural-word sized, or 0 to ignore length */
-   switch (args->len) {
-   case 0:
-   case 1:
-   case 2:
-   case 4:
-   case 8:
-   break;
-   default:
-   return -EINVAL;
-   }
-
-   /* check for range overflow */
-   if (args->addr + args->len < args->addr)
-   return -EINVAL;
 
-   /* check for extra flags that we don't understand */
-   if (args->flags & ~KVM_IOEVENTFD_VALID_FLAG_MASK)
-   return -EINVAL;
-
-   /* ioeventfd with no length can't be combined with DATAMATCH */
-   if (!args->len &&
-   args->flags & (KVM_IOEVENTFD_FLAG_PIO |
-  KVM_IOEVENTFD_FLAG_DATAMATCH))
-   return -EINVAL;
+   struct eventfd_ctx *eventfd;
+   struct _ioeventfd *p;
+   int ret;
 
eventfd = eventfd_ctx_fdget(args->fd);
if (IS_ERR(eventfd))
@@ -873,14 +847,13 @@ fail:
 }
 
 static int
-kvm_deassign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
+kvm_deassign_ioeventfd_idx(struct kvm *kvm, enum kvm_bus bus_idx,
+  struct kvm_ioeventfd *args)
 {
-   enum kvm_bus  bus_idx;
struct _ioeventfd*p, *tmp;
struct eventfd_ctx   *eventfd;
int   ret = -ENOENT;
 
-   bus_idx = ioeventfd_bus_from_flags(args->flags);
eventfd = eventfd_ctx_fdget(args->fd);
if (IS_ERR(eventfd))
return PTR_ERR(eventfd);
@@ -918,6 +891,48 @@ kvm_deassign_ioeventfd(struct kvm *kvm, struct 
kvm_ioeventfd *args)
return ret;
 }
 
+static int kvm_deassign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
+{
+   enum kvm_bus bus_idx = ioeventfd_bus_from_flags(args->flags);
+
+   return kvm_deassign_ioeventfd_idx(kvm, bus_idx, args);
+}
+
+static int
+kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
+{
+   enum kvm_bus  bus_idx;
+
+   bus_idx = ioeventfd_bus_from_flags(args->flags);
+   /* must be natural-word sized, or 0 to ignore length */
+   switch (args->len) {
+   case 0:
+   case 1:
+   case 2:
+   case 4:
+   case 8:
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   /* check for range overflow */
+   if (args->addr + args->len < args->addr)
+   return -EINVAL;
+
+   /* check for extra flags that we don't understand */
+   if (args->flags & ~KVM_IOEVENTFD_VALID_FLAG_MASK)
+   return -EINVAL;
+
+   /* ioeventfd with no length can't be combined with DATAMATCH */
+   if (!args->len &&
+   args->flags & (KVM_IOEVENTFD_FLAG_PIO |
+  KVM_IOEVENTFD_FLAG_DATAMATCH))
+   return -EINVAL;
+
+   return kvm_assign_ioeventfd_idx(kvm, bus_idx, args);
+}
+
 int
 kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
 {
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH V6 3/6] kvm: fix double free for fast mmio eventfd

2015-09-15 Thread Jason Wang

We register wildcard mmio eventfd on two buses, once for KVM_MMIO_BUS
and once on KVM_FAST_MMIO_BUS but with a single iodev
instance. This will lead to an issue: kvm_io_bus_destroy() knows
nothing about the devices on two buses pointing to a single dev. Which
will lead to double free[1] during exit. Fix this by allocating two
instances of iodevs then registering one on KVM_MMIO_BUS and another
on KVM_FAST_MMIO_BUS.

CPU: 1 PID: 2894 Comm: qemu-system-x86 Not tainted 3.19.0-26-generic #28-Ubuntu
Hardware name: LENOVO 2356BG6/2356BG6, BIOS G7ET96WW (2.56 ) 09/12/2013
task: 88009ae0c4b0 ti: 88020e7f task.ti: 88020e7f
RIP: 0010:[]  [] 
ioeventfd_release+0x28/0x60 [kvm]
RSP: 0018:88020e7f3bc8  EFLAGS: 00010292
RAX: dead00200200 RBX: 8801ec19c900 RCX: 00018200016d
RDX: 8801ec19cf80 RSI: ea0008bf1d40 RDI: 8801ec19c900
RBP: 88020e7f3bd8 R08: 2fc75a01 R09: 00018200016d
R10: c07df6ae R11: 88022fc75a98 R12: 88021e7cc000
R13: 88021e7cca48 R14: 88021e7cca50 R15: 8801ec19c880
FS:  7fc1ee3e6700() GS:88023e24() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f8f389d8000 CR3: 00023dc13000 CR4: 001427e0
Stack:
88021e7cc000  88020e7f3be8 c07e2622
88020e7f3c38 c07df69a 880232524160 88020e792d80
  880219b78c00 0008 8802321686a8
Call Trace:
[] ioeventfd_destructor+0x12/0x20 [kvm]
[] kvm_put_kvm+0xca/0x210 [kvm]
[] kvm_vcpu_release+0x18/0x20 [kvm]
[] __fput+0xe7/0x250
[] fput+0xe/0x10
[] task_work_run+0xd4/0xf0
[] do_exit+0x368/0xa50
[] ? recalc_sigpending+0x1f/0x60
[] do_group_exit+0x45/0xb0
[] get_signal+0x291/0x750
[] do_signal+0x28/0xab0
[] ? do_futex+0xdb/0x5d0
[] ? __wake_up_locked_key+0x18/0x20
[] ? SyS_futex+0x76/0x170
[] do_notify_resume+0x69/0xb0
[] int_signal+0x12/0x17
Code: 5d c3 90 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 7f 20 
e8 06 d6 a5 c0 48 8b 43 08 48 8b 13 48 89 df 48 89 42 08 <48> 89 10 48 b8 00 01 
10 00 00
 RIP  [] ioeventfd_release+0x28/0x60 [kvm]
 RSP 

Cc: sta...@vger.kernel.org
Cc: Gleb Natapov 
Cc: Paolo Bonzini 
Signed-off-by: Jason Wang 
---
 virt/kvm/eventfd.c | 43 +--
 1 file changed, 25 insertions(+), 18 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 0829c7f..79db453 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -817,16 +817,6 @@ static int kvm_assign_ioeventfd_idx(struct kvm *kvm,
if (ret < 0)
goto unlock_fail;
 
-   /* When length is ignored, MMIO is also put on a separate bus, for
-* faster lookups.
-*/
-   if (!args->len && bus_idx == KVM_MMIO_BUS) {
-   ret = kvm_io_bus_register_dev(kvm, KVM_FAST_MMIO_BUS,
- p->addr, 0, >dev);
-   if (ret < 0)
-   goto register_fail;
-   }
-
kvm->buses[bus_idx]->ioeventfd_count++;
list_add_tail(>list, >ioeventfds);
 
@@ -834,8 +824,6 @@ static int kvm_assign_ioeventfd_idx(struct kvm *kvm,
 
return 0;
 
-register_fail:
-   kvm_io_bus_unregister_dev(kvm, bus_idx, >dev);
 unlock_fail:
mutex_unlock(>slots_lock);
 
@@ -874,10 +862,6 @@ kvm_deassign_ioeventfd_idx(struct kvm *kvm, enum kvm_bus 
bus_idx,
continue;
 
kvm_io_bus_unregister_dev(kvm, bus_idx, >dev);
-   if (!p->length && p->bus_idx == KVM_MMIO_BUS) {
-   kvm_io_bus_unregister_dev(kvm, KVM_FAST_MMIO_BUS,
- >dev);
-   }
kvm->buses[bus_idx]->ioeventfd_count--;
ioeventfd_release(p);
ret = 0;
@@ -894,14 +878,19 @@ kvm_deassign_ioeventfd_idx(struct kvm *kvm, enum kvm_bus 
bus_idx,
 static int kvm_deassign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
 {
enum kvm_bus bus_idx = ioeventfd_bus_from_flags(args->flags);
+   int ret = kvm_deassign_ioeventfd_idx(kvm, bus_idx, args);
+
+   if (!args->len && bus_idx == KVM_MMIO_BUS)
+   kvm_deassign_ioeventfd_idx(kvm, KVM_FAST_MMIO_BUS, args);
 
-   return kvm_deassign_ioeventfd_idx(kvm, bus_idx, args);
+   return ret;
 }
 
 static int
 kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
 {
enum kvm_bus  bus_idx;
+   int ret;
 
bus_idx = ioeventfd_bus_from_flags(args->flags);
/* must be natural-word sized, or 0 to ignore length */
@@ -930,7 +919,25 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd 
*args)
   KVM_IOEVENTFD_FLAG_DATAMATCH))
return -EINVAL;
 
-   return kvm_assign_ioeventfd_idx(kvm, bus_idx, args);
+   ret = kvm_assign_ioeventfd_idx(kvm, bus_idx,

[PATCH V6 6/6] kvm: add fast mmio capabilitiy

2015-09-15 Thread Jason Wang

Cc: Gleb Natapov 
Cc: Paolo Bonzini 
Signed-off-by: Jason Wang 
---
 Documentation/virtual/kvm/api.txt | 7 ++-
 include/uapi/linux/kvm.h  | 1 +
 virt/kvm/kvm_main.c   | 1 +
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index d9eccee..26661ef 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1598,7 +1598,7 @@ provided event instead of triggering an exit.
 struct kvm_ioeventfd {
__u64 datamatch;
__u64 addr;/* legal pio/mmio address */
-   __u32 len; /* 1, 2, 4, or 8 bytes*/
+   __u32 len; /* 0, 1, 2, 4, or 8 bytes*/
__s32 fd;
__u32 flags;
__u8  pad[36];
@@ -1621,6 +1621,11 @@ to the registered address is equal to datamatch in 
struct kvm_ioeventfd.
 For virtio-ccw devices, addr contains the subchannel id and datamatch the
 virtqueue index.
 
+With KVM_CAP_FAST_MMIO, a zero length mmio eventfd is allowed for
+kernel to ignore the length of guest write and get a possible faster
+response. Note the speedup may only work on some specific
+architectures and setups. Otherwise, it's as fast as wildcard mmio
+eventfd.
 
 4.60 KVM_DIRTY_TLB
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index a9256f0..ad72a61 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -824,6 +824,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_MULTI_ADDRESS_SPACE 118
 #define KVM_CAP_GUEST_DEBUG_HW_BPS 119
 #define KVM_CAP_GUEST_DEBUG_HW_WPS 120
+#define KVM_CAP_FAST_MMIO 121
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9af68db..645f55d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2717,6 +2717,7 @@ static long kvm_vm_ioctl_check_extension_generic(struct 
kvm *kvm, long arg)
case KVM_CAP_IRQFD:
case KVM_CAP_IRQFD_RESAMPLE:
 #endif
+   case KVM_CAP_FAST_MMIO:
case KVM_CAP_CHECK_EXTENSION_VM:
return 1;
 #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH V6 5/6] kvm: add tracepoint for fast mmio

2015-09-15 Thread Jason Wang

Cc: Gleb Natapov 
Cc: Paolo Bonzini 
Signed-off-by: Jason Wang 
---
 arch/x86/kvm/trace.h | 18 ++
 arch/x86/kvm/vmx.c   |  1 +
 arch/x86/kvm/x86.c   |  1 +
 3 files changed, 20 insertions(+)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 4eae7c3..ce4abe3 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -129,6 +129,24 @@ TRACE_EVENT(kvm_pio,
 );
 
 /*
+ * Tracepoint for fast mmio.
+ */
+TRACE_EVENT(kvm_fast_mmio,
+   TP_PROTO(u64 gpa),
+   TP_ARGS(gpa),
+
+   TP_STRUCT__entry(
+   __field(u64,gpa)
+   ),
+
+   TP_fast_assign(
+   __entry->gpa= gpa;
+   ),
+
+   TP_printk("fast mmio at gpa 0x%llx", __entry->gpa)
+);
+
+/*
  * Tracepoint for cpuid.
  */
 TRACE_EVENT(kvm_cpuid,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index d019868..ff1234a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5767,6 +5767,7 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
skip_emulated_instruction(vcpu);
+   trace_kvm_fast_mmio(gpa);
return 1;
}
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a60bdbc..1ec3965 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8015,6 +8015,7 @@ bool kvm_arch_has_noncoherent_dma(struct kvm *kvm)
 EXPORT_SYMBOL_GPL(kvm_arch_has_noncoherent_dma);
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_msr);
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Questions about KVM TSC trapping

2015-09-15 Thread Oliver Yang

Hi Guys,

I found below patch for KVM TSC trapping / migration support, 

https://lkml.org/lkml/2011/1/6/90

It seemed the patch were not merged in Linux mainline.

So I have 3 questions here, 

1.  Can KVM support TSC trapping today? If not, what is the plan?

2. What is the solution if my SMP Linux guest OS doesn't have reliable 
TSC?

Because the no TSC trapping support, will kvmclock driver handle all 
TSC sync issues?

3. What if my Linux guest doesn't have kvmclock driver?

Does that mean I shouldn't run TSC sensitive application in my guest 
OS? 

BTW, my application is written with lots of rdtsc instructions, and
which performs well in VMware guest.

Re: [PATCH V6 3/6] kvm: fix double free for fast mmio eventfd

2015-09-15 Thread Cornelia Huck

On Tue, 15 Sep 2015 14:41:56 +0800
Jason Wang  wrote:

> We register wildcard mmio eventfd on two buses, once for KVM_MMIO_BUS
> and once on KVM_FAST_MMIO_BUS but with a single iodev
> instance. This will lead to an issue: kvm_io_bus_destroy() knows
> nothing about the devices on two buses pointing to a single dev. Which
> will lead to double free[1] during exit. Fix this by allocating two
> instances of iodevs then registering one on KVM_MMIO_BUS and another
> on KVM_FAST_MMIO_BUS.
> 
> CPU: 1 PID: 2894 Comm: qemu-system-x86 Not tainted 3.19.0-26-generic 
> #28-Ubuntu
> Hardware name: LENOVO 2356BG6/2356BG6, BIOS G7ET96WW (2.56 ) 09/12/2013
> task: 88009ae0c4b0 ti: 88020e7f task.ti: 88020e7f
> RIP: 0010:[]  [] 
> ioeventfd_release+0x28/0x60 [kvm]
> RSP: 0018:88020e7f3bc8  EFLAGS: 00010292
> RAX: dead00200200 RBX: 8801ec19c900 RCX: 00018200016d
> RDX: 8801ec19cf80 RSI: ea0008bf1d40 RDI: 8801ec19c900
> RBP: 88020e7f3bd8 R08: 2fc75a01 R09: 00018200016d
> R10: c07df6ae R11: 88022fc75a98 R12: 88021e7cc000
> R13: 88021e7cca48 R14: 88021e7cca50 R15: 8801ec19c880
> FS:  7fc1ee3e6700() GS:88023e24() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 7f8f389d8000 CR3: 00023dc13000 CR4: 001427e0
> Stack:
> 88021e7cc000  88020e7f3be8 c07e2622
> 88020e7f3c38 c07df69a 880232524160 88020e792d80
>   880219b78c00 0008 8802321686a8
> Call Trace:
> [] ioeventfd_destructor+0x12/0x20 [kvm]
> [] kvm_put_kvm+0xca/0x210 [kvm]
> [] kvm_vcpu_release+0x18/0x20 [kvm]
> [] __fput+0xe7/0x250
> [] fput+0xe/0x10
> [] task_work_run+0xd4/0xf0
> [] do_exit+0x368/0xa50
> [] ? recalc_sigpending+0x1f/0x60
> [] do_group_exit+0x45/0xb0
> [] get_signal+0x291/0x750
> [] do_signal+0x28/0xab0
> [] ? do_futex+0xdb/0x5d0
> [] ? __wake_up_locked_key+0x18/0x20
> [] ? SyS_futex+0x76/0x170
> [] do_notify_resume+0x69/0xb0
> [] int_signal+0x12/0x17
> Code: 5d c3 90 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 7f 20 
> e8 06 d6 a5 c0 48 8b 43 08 48 8b 13 48 89 df 48 89 42 08 <48> 89 10 48 b8 00 
> 01 10 00 00
>  RIP  [] ioeventfd_release+0x28/0x60 [kvm]
>  RSP 
> 
> Cc: sta...@vger.kernel.org
> Cc: Gleb Natapov 
> Cc: Paolo Bonzini 
> Signed-off-by: Jason Wang 
> ---
>  virt/kvm/eventfd.c | 43 +--
>  1 file changed, 25 insertions(+), 18 deletions(-)

Reviewed-by: Cornelia Huck 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH V6 1/6] kvm: don't try to register to KVM_FAST_MMIO_BUS for non mmio eventfd

2015-09-15 Thread Jason Wang

We only want zero length mmio eventfd to be registered on
KVM_FAST_MMIO_BUS. So check this explicitly when arg->len is zero to
make sure this.

Cc: sta...@vger.kernel.org
Cc: Gleb Natapov 
Cc: Paolo Bonzini 
Signed-off-by: Jason Wang 
---
 virt/kvm/eventfd.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 9ff4193..e404806 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -846,7 +846,7 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd 
*args)
/* When length is ignored, MMIO is also put on a separate bus, for
 * faster lookups.
 */
-   if (!args->len && !(args->flags & KVM_IOEVENTFD_FLAG_PIO)) {
+   if (!args->len && bus_idx == KVM_MMIO_BUS) {
ret = kvm_io_bus_register_dev(kvm, KVM_FAST_MMIO_BUS,
  p->addr, 0, >dev);
if (ret < 0)
@@ -901,7 +901,7 @@ kvm_deassign_ioeventfd(struct kvm *kvm, struct 
kvm_ioeventfd *args)
continue;
 
kvm_io_bus_unregister_dev(kvm, bus_idx, >dev);
-   if (!p->length) {
+   if (!p->length && p->bus_idx == KVM_MMIO_BUS) {
kvm_io_bus_unregister_dev(kvm, KVM_FAST_MMIO_BUS,
  >dev);
}
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH V6 0/6] Fast mmio eventfd fixes

2015-09-15 Thread Jason Wang

Hi:

This series fixes two issues of fast mmio eventfd:

1) A single iodev instance were registerd on two buses: KVM_MMIO_BUS
   and KVM_FAST_MMIO_BUS. This will cause double in
   ioeventfd_destructor()
2) A zero length iodev on KVM_MMIO_BUS will never be found but
   kvm_io_bus_cmp(). This will lead e.g the eventfd will be trapped by
   qemu instead of host.

1 is fixed by allocating two instances of iodev and introduce a new
capability for userspace. 2 is fixed by ignore the actual length if
the length of iodev is zero in kvm_io_bus_cmp().

Please review.

Changes from V5:
- move patch of explicitly checking for KVM_MMIO_BUS to patch 1 and
  remove the unnecessary checks
- even more grammar and typo fixes
- rabase to kvm.git
- document KVM_CAP_FAST_MMIO

Changes from V4:
- move the location of kvm_assign_ioeventfd() in patch 1 which reduce
  the change set.
- commit log typo fixes
- switch to use kvm_deassign_ioeventfd_id) when fail to register to
  fast mmio bus
- change kvm_io_bus_cmp() as Paolo's suggestions
- introduce a new capability to avoid new userspace crash old kernel
- add a new patch that only try to register mmio eventfd on fast mmio
  bus

Changes from V3:

- Don't do search on two buses when trying to do write on
  KVM_MMIO_BUS. This fixes a small regression found by vmexit.flat.
- Since we don't do search on two buses, change kvm_io_bus_cmp() to
  let it can find zero length iodevs.
- Fix the unnecessary lines in tracepoint patch.

Changes from V2:
- Tweak styles and comment suggested by Cornelia.

Changes from v1:
- change ioeventfd_bus_from_flags() to return KVM_FAST_MMIO_BUS when
  needed to save lots of unnecessary changes.

Jason Wang (6):
  kvm: don't try to register to KVM_FAST_MMIO_BUS for non mmio eventfd
  kvm: factor out core eventfd assign/deassign logic
  kvm: fix double free for fast mmio eventfd
  kvm: fix zero length mmio searching
  kvm: add tracepoint for fast mmio
  kvm: add fast mmio capabilitiy

 Documentation/virtual/kvm/api.txt |   7 ++-
 arch/x86/kvm/trace.h  |  18 ++
 arch/x86/kvm/vmx.c|   1 +
 arch/x86/kvm/x86.c|   1 +
 include/uapi/linux/kvm.h  |   1 +
 virt/kvm/eventfd.c| 124 ++
 virt/kvm/kvm_main.c   |  20 +-
 7 files changed, 118 insertions(+), 54 deletions(-)

-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V6 2/6] kvm: factor out core eventfd assign/deassign logic

2015-09-15 Thread Cornelia Huck

On Tue, 15 Sep 2015 14:41:55 +0800
Jason Wang  wrote:

> This patch factors out core eventfd assign/deassign logic and leaves
> the argument checking and bus index selection to callers.
> 
> Cc: sta...@vger.kernel.org
> Cc: Gleb Natapov 
> Cc: Paolo Bonzini 
> Signed-off-by: Jason Wang 
> ---
>  virt/kvm/eventfd.c | 85 
> --
>  1 file changed, 50 insertions(+), 35 deletions(-)

Reviewed-by: Cornelia Huck 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V6 4/6] kvm: fix zero length mmio searching

2015-09-15 Thread Cornelia Huck

On Tue, 15 Sep 2015 14:41:57 +0800
Jason Wang  wrote:

> Currently, if we had a zero length mmio eventfd assigned on
> KVM_MMIO_BUS. It will never be found by kvm_io_bus_cmp() since it
> always compares the kvm_io_range() with the length that guest
> wrote. This will cause e.g for vhost, kick will be trapped by qemu
> userspace instead of vhost. Fixing this by using zero length if an
> iodevice is zero length.
> 
> Cc: sta...@vger.kernel.org
> Cc: Gleb Natapov 
> Cc: Paolo Bonzini 
> Signed-off-by: Jason Wang 
> ---
>  virt/kvm/kvm_main.c | 19 +--
>  1 file changed, 17 insertions(+), 2 deletions(-)

Reviewed-by: Cornelia Huck 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: arm64: add workaround for Cortex-A57 erratum #852523

2015-09-15 Thread Christoffer Dall

On Mon, Sep 14, 2015 at 04:46:28PM +0100, Marc Zyngier wrote:
> On 14/09/15 16:06, Will Deacon wrote:
> > When restoring the system register state for an AArch32 guest at EL2,
> > writes to DACR32_EL2 may not be correctly synchronised by Cortex-A57,
> > which can lead to the guest effectively running with junk in the DACR
> > and running into unexpected domain faults.
> > 
> > This patch works around the issue by re-ordering our restoration of the
> > AArch32 register aliases so that they happen before the AArch64 system
> > registers. Ensuring that the registers are restored in this order
> > guarantees that they will be correctly synchronised by the core.
> > 
> > Cc: 
> > Cc: Marc Zyngier 
> > Signed-off-by: Will Deacon 
> 
> Reviewed-by: Marc Zyngier 
> 

Looks good to me too, and I just gave this a spin with my loop test on
Juno as well.

Thanks,
-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 kvmtool] Make static libc and guest-init functionality optional.

2015-09-15 Thread Will Deacon

Hi Dmitri,

On Fri, Sep 11, 2015 at 03:40:00PM +0100, Dimitri John Ledkov wrote:
> If one typically only boots full disk-images, one wouldn't necessaraly
> want to statically link glibc, for the guest-init feature of the
> kvmtool. As statically linked glibc triggers haevy security
> maintainance.
> 
> Signed-off-by: Dimitri John Ledkov 
> ---
>  Changes since v1:
>  - rename CONFIG_HAS_LIBC to CONFIG_GUEST_INIT for clarity
>  - use more ifdefs, instead of runtime check of _binary_guest_init_size==0

The idea looks good, but I think we can tidy some of this up at the same
time by moving all the guest_init code in builtin_setup.c.

How about the patch below?

Will

--->8

>From cdce942c1a3a04635065a7972ca4e21386664756 Mon Sep 17 00:00:00 2001
From: Dimitri John Ledkov 
Date: Fri, 11 Sep 2015 15:40:00 +0100
Subject: [PATCH] Make static libc and guest-init functionality optional.

If one typically only boots full disk-images, one wouldn't necessaraly
want to statically link glibc, for the guest-init feature of the
kvmtool. As statically linked glibc triggers haevy security
maintainance.

Signed-off-by: Dimitri John Ledkov 
[will: moved all the guest_init handling into builtin_setup.c]
Signed-off-by: Will Deacon 
---
 Makefile| 12 +++-
 builtin-run.c   | 29 +
 builtin-setup.c | 19 ++-
 include/kvm/builtin-setup.h |  1 +
 4 files changed, 23 insertions(+), 38 deletions(-)

diff --git a/Makefile b/Makefile
index 7b17d529d13b..f1701aa7b8ec 100644
--- a/Makefile
+++ b/Makefile
@@ -34,8 +34,6 @@ bindir_SQ = $(subst ','\'',$(bindir))
 PROGRAM:= lkvm
 PROGRAM_ALIAS := vm
 
-GUEST_INIT := guest/init
-
 OBJS   += builtin-balloon.o
 OBJS   += builtin-debug.o
 OBJS   += builtin-help.o
@@ -279,8 +277,13 @@ ifeq ($(LTO),1)
endif
 endif
 
-ifneq ($(call try-build,$(SOURCE_STATIC),,-static),y)
-$(error No static libc found. Please install glibc-static package.)
+ifeq ($(call try-build,$(SOURCE_STATIC),,-static),y)
+   CFLAGS  += -DCONFIG_GUEST_INIT
+   GUEST_INIT  := guest/init
+   GUEST_OBJS  = guest/guest_init.o
+else
+   $(warning No static libc found. Skipping guest init)
+   NOTFOUND+= static-libc
 endif
 
 ifeq (y,$(ARCH_WANT_LIBFDT))
@@ -356,7 +359,6 @@ c_flags = -Wp,-MD,$(depfile) $(CFLAGS)
 # $(OTHEROBJS) are things that do not get substituted like this.
 #
 STATIC_OBJS = $(patsubst %.o,%.static.o,$(OBJS) $(OBJS_STATOPT))
-GUEST_OBJS = guest/guest_init.o
 
 $(PROGRAM)-static:  $(STATIC_OBJS) $(OTHEROBJS) $(GUEST_INIT)
$(E) "  LINK" $@
diff --git a/builtin-run.c b/builtin-run.c
index 1ee75ad3f010..e0c87329e52b 100644
--- a/builtin-run.c
+++ b/builtin-run.c
@@ -59,9 +59,6 @@ static int  kvm_run_wrapper;
 
 bool do_debug_print = false;
 
-extern char _binary_guest_init_start;
-extern char _binary_guest_init_size;
-
 static const char * const run_usage[] = {
"lkvm run [] []",
NULL
@@ -345,30 +342,6 @@ void kvm_run_help(void)
usage_with_options(run_usage, options);
 }
 
-static int kvm_setup_guest_init(struct kvm *kvm)
-{
-   const char *rootfs = kvm->cfg.custom_rootfs_name;
-   char tmp[PATH_MAX];
-   size_t size;
-   int fd, ret;
-   char *data;
-
-   /* Setup /virt/init */
-   size = (size_t)&_binary_guest_init_size;
-   data = (char *)&_binary_guest_init_start;
-   snprintf(tmp, PATH_MAX, "%s%s/virt/init", kvm__get_dir(), rootfs);
-   remove(tmp);
-   fd = open(tmp, O_CREAT | O_WRONLY, 0755);
-   if (fd < 0)
-   die("Fail to setup %s", tmp);
-   ret = xwrite(fd, data, size);
-   if (ret < 0)
-   die("Fail to setup %s", tmp);
-   close(fd);
-
-   return 0;
-}
-
 static int kvm_run_set_sandbox(struct kvm *kvm)
 {
const char *guestfs_name = kvm->cfg.custom_rootfs_name;
@@ -631,7 +604,7 @@ static struct kvm *kvm_cmd_run_init(int argc, const char 
**argv)
 
if (!kvm->cfg.no_dhcp)
strcat(real_cmdline, "  ip=dhcp");
-   if (kvm_setup_guest_init(kvm))
+   if (kvm_setup_guest_init(kvm->cfg.custom_rootfs_name))
die("Failed to setup init for guest.");
}
} else if (!strstr(real_cmdline, "root=")) {
diff --git a/builtin-setup.c b/builtin-setup.c
index 8b45c5645ad4..40fef15dbbe4 100644
--- a/builtin-setup.c
+++ b/builtin-setup.c
@@ -16,9 +16,6 @@
 #include 
 #include 
 
-extern char _binary_guest_init_start;
-extern char _binary_guest_init_size;
-
 static const char *instance_name;
 
 static const char * const setup_usage[] = {
@@ -124,7 +121,11 @@ static const char *guestfs_symlinks[] = {
"/etc/ld.so.conf",
 };
 
-static int copy_init(const char *guestfs_name)
+#ifdef

Re: [PATCH] KVM: add halt_attempted_poll to VCPU stats

2015-09-15 Thread David Matlack

On Tue, Sep 15, 2015 at 9:27 AM, Paolo Bonzini  wrote:
> This new statistic can help diagnosing VCPUs that, for any reason,
> trigger bad behavior of halt_poll_ns autotuning.
>
> For example, say halt_poll_ns = 48, and wakeups are spaced exactly
> like 479us, 481us, 479us, 481us. Then KVM always fails polling and wastes
> 10+20+40+80+160+320+480 = 1110 microseconds out of every
> 479+481+479+481+479+481+479 = 3359 microseconds. The VCPU then
> is consuming about 30% more CPU than it would use without
> polling.  This would show as an abnormally high number of
> attempted polling compared to the successful polls.

Reviewed-by: David Matlack 

>
> Cc: Christian Borntraeger  Cc: David Matlack 
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/arm/include/asm/kvm_host.h | 1 +
>  arch/arm64/include/asm/kvm_host.h   | 1 +
>  arch/mips/include/asm/kvm_host.h| 1 +
>  arch/mips/kvm/mips.c| 1 +
>  arch/powerpc/include/asm/kvm_host.h | 1 +
>  arch/powerpc/kvm/book3s.c   | 1 +
>  arch/powerpc/kvm/booke.c| 1 +
>  arch/s390/include/asm/kvm_host.h| 1 +
>  arch/s390/kvm/kvm-s390.c| 1 +
>  arch/x86/include/asm/kvm_host.h | 1 +
>  arch/x86/kvm/x86.c  | 1 +
>  virt/kvm/kvm_main.c | 1 +
>  12 files changed, 12 insertions(+)
>
> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index dcba0fa5176e..687ddeba3611 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -148,6 +148,7 @@ struct kvm_vm_stat {
>
>  struct kvm_vcpu_stat {
> u32 halt_successful_poll;
> +   u32 halt_attempted_poll;
> u32 halt_wakeup;
>  };
>
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index 415938dc45cf..486594583cc6 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -195,6 +195,7 @@ struct kvm_vm_stat {
>
>  struct kvm_vcpu_stat {
> u32 halt_successful_poll;
> +   u32 halt_attempted_poll;
> u32 halt_wakeup;
>  };
>
> diff --git a/arch/mips/include/asm/kvm_host.h 
> b/arch/mips/include/asm/kvm_host.h
> index e8c8d9d0c45f..3a54dbca9f7e 100644
> --- a/arch/mips/include/asm/kvm_host.h
> +++ b/arch/mips/include/asm/kvm_host.h
> @@ -128,6 +128,7 @@ struct kvm_vcpu_stat {
> u32 msa_disabled_exits;
> u32 flush_dcache_exits;
> u32 halt_successful_poll;
> +   u32 halt_attempted_poll;
> u32 halt_wakeup;
>  };
>
> diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> index cd4c129ce743..49ff3bfc007e 100644
> --- a/arch/mips/kvm/mips.c
> +++ b/arch/mips/kvm/mips.c
> @@ -55,6 +55,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
> { "msa_disabled", VCPU_STAT(msa_disabled_exits), KVM_STAT_VCPU },
> { "flush_dcache", VCPU_STAT(flush_dcache_exits), KVM_STAT_VCPU },
> { "halt_successful_poll", VCPU_STAT(halt_successful_poll), 
> KVM_STAT_VCPU },
> +   { "halt_attempted_poll", VCPU_STAT(halt_attempted_poll), 
> KVM_STAT_VCPU },
> { "halt_wakeup",  VCPU_STAT(halt_wakeup),KVM_STAT_VCPU },
> {NULL}
>  };
> diff --git a/arch/powerpc/include/asm/kvm_host.h 
> b/arch/powerpc/include/asm/kvm_host.h
> index 98eebbf66340..195886a583ba 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -108,6 +108,7 @@ struct kvm_vcpu_stat {
> u32 dec_exits;
> u32 ext_intr_exits;
> u32 halt_successful_poll;
> +   u32 halt_attempted_poll;
> u32 halt_wakeup;
> u32 dbell_exits;
> u32 gdbell_exits;
> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> index d75bf325f54a..cf009167d208 100644
> --- a/arch/powerpc/kvm/book3s.c
> +++ b/arch/powerpc/kvm/book3s.c
> @@ -53,6 +53,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
> { "ext_intr",VCPU_STAT(ext_intr_exits) },
> { "queue_intr",  VCPU_STAT(queue_intr) },
> { "halt_successful_poll", VCPU_STAT(halt_successful_poll), },
> +   { "halt_attempted_poll", VCPU_STAT(halt_attempted_poll), },
> { "halt_wakeup", VCPU_STAT(halt_wakeup) },
> { "pf_storage",  VCPU_STAT(pf_storage) },
> { "sp_storage",  VCPU_STAT(sp_storage) },
> diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
> index ae458f0fd061..fd5875179e5c 100644
> --- a/arch/powerpc/kvm/booke.c
> +++ b/arch/powerpc/kvm/booke.c
> @@ -63,6 +63,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
> { "dec",VCPU_STAT(dec_exits) },
> { "ext_intr",   VCPU_STAT(ext_intr_exits) },
> { "halt_successful_poll", VCPU_STAT(halt_successful_poll) },
> +   { "halt_attempted_poll", VCPU_STAT(halt_attempted_poll) },
> { "halt_wakeup", VCPU_STAT(halt_wakeup) },
> { "doorbell",

Re: [RFC PATCH] os-android: Add support to android platform, built by ndk-r10

2015-09-15 Thread Houcheng Lin

Hi Paolo,

(Please ignore the previous mail that did not include "qemu-devel")

Thanks for your review and suggestions. I'll fix this patch
accordingly and please see my replies below.

best regards,
Houcheng Lin

2015-09-15 17:41 GMT+08:00 Paolo Bonzini :

> This is okay and can be done unconditionally (introduce a new
> qemu_getdtablesize function that is defined in util/oslib-posix.c).

Will fix it.
>
>
>> - sigtimewait(): call __rt_sigtimewait() instead.
>> - lockf(): not see this feature in android, directly return -1.
>> - shm_open(): not see this feature in android, directly return -1.
>
> This is not okay.  Please fix your libc instead.

I'll modify the bionic C library to support these functions and feedback
to google's AOSP project. But the android kernel does not support shmem,
so I will prevent compile the ivshmem.c when in android config. The config
for ivshmem in pci.mak will be:

CONFIG_IVSHMEM=$(call land, $(call lnot,$(CONFIG_ANDROID)),$(CONFIG_KVM))

> For sys/io.h, we can just remove the inclusion.  It is not necessary.
>
> scsi/sg.h should be exported by the Linux kernel, so that we can use
> scripts/update-linux-headers.sh to copy it from QEMU.  I've sent a Linux
> kernel patch and CCed you.

It's better to put headers on kernel user headers. Thanks.

>
> You should instead disable the guest agent on your configure command line.
>

Okay.

>
> If you have CONFIG_ANDROID, you do not need -DANDROID.
>

Okay.

>> +  LIBS="-lglib-2.0 -lgthread-2.0 -lz -lpixman-1 -lintl -liconv -lc $LIBS"
>> +  libs_qga="-lglib-2.0 -lgthread-2.0 -lz -lpixman-1 -lintl -liconv -lc"
>
> This should not be necessary, QEMU uses pkg-config.
>
>> +fi
>>  if [ "$bsd" = "yes" ] ; then
>>if [ "$darwin" != "yes" ] ; then
>>  bsd_user="yes"
>> @@ -1736,7 +1749,14 @@ fi
>>  # pkg-config probe
>>
>>  if ! has "$pkg_config_exe"; then
>> -  error_exit "pkg-config binary '$pkg_config_exe' not found"
>> +  case $cross_prefix in
>> +*android*)
>> + pkg_config_exe=scripts/android-pkg-config
>
> Neither should this.  Your cross-compilation environment is not
> correctly set up if you do not have a pkg-config executable.  If you
> want to use a wrapper, you can specify it with the PKG_CONFIG
> environment variable.  But it need not be maintained in the QEMU
> repository, because QEMU assumes a complete cross-compilation environment.

I'll use wrapper in next release and specify with environment variable.
Later, I may generate pkg-config data file while building library and install
it into cross-compilation environment.

>
>> + ;;
>> +*)
>> + error_exit "pkg-config binary '$pkg_config_exe' not found"
>> + ;;
>> +  esac
>>  fi
>>
>>  ##
>> @@ -3764,7 +3784,7 @@ elif compile_prog "" "$pthread_lib -lrt" ; then
>>  fi
>>
>>  if test "$darwin" != "yes" -a "$mingw32" != "yes" -a "$solaris" != yes -a \
>> -"$aix" != "yes" -a "$haiku" != "yes" ; then
>> +"$aix" != "yes" -a "$haiku" != "yes" -a "$android" != "yes" ; then
>>  libs_softmmu="-lutil $libs_softmmu"
>>  fi
>>
>> @@ -4709,6 +4729,10 @@ if test "$linux" = "yes" ; then
>>echo "CONFIG_LINUX=y" >> $config_host_mak
>>  fi
>>
>> +if test "$android" = "yes" ; then
>> +  echo "CONFIG_ANDROID=y" >> $config_host_mak
>> +fi
>> +
>>  if test "$darwin" = "yes" ; then
>>echo "CONFIG_DARWIN=y" >> $config_host_mak
>>  fi
>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
>> index ab3c876..1ba22be 100644
>> --- a/include/qemu/osdep.h
>> +++ b/include/qemu/osdep.h
>> @@ -59,6 +59,10 @@
>>  #define WEXITSTATUS(x) (x)
>>  #endif
>>
>> +#ifdef ANDROID
>> + #include "sysemu/os-android.h"
>> +#endif
>> +
>>  #ifdef _WIN32
>>  #include "sysemu/os-win32.h"
>>  #endif
>> diff --git a/include/sysemu/os-android.h b/include/sysemu/os-android.h
>> new file mode 100644
>> index 000..7f73777
>> --- /dev/null
>> +++ b/include/sysemu/os-android.h
>> @@ -0,0 +1,35 @@
>> +#ifndef QEMU_OS_ANDROID_H
>> +#define QEMU_OS_ANDROID_H
>> +
>> +#include 
>> +
>> +/*
>> + * For include the basename prototyping in android.
>> + */
>> +#include 
>
> These two includes can be added to sysemu/os-posix.h.
>
>> +extern int __rt_sigtimedwait(const sigset_t *uthese, siginfo_t *uinfo, \
>> + const struct timespec *uts, size_t sigsetsize);
>> +
>> +int getdtablesize(void);
>> +int lockf(int fd, int cmd, off_t len);
>> +int sigtimedwait(const sigset_t *set, siginfo_t *info, \
>> + const struct timespec *ts);
>> +int shm_open(const char *name, int oflag, mode_t mode);
>> +char *a_ptsname(int fd);
>> +
>> +#define F_TLOCK 0
>> +#define IOV_MAX 256
>
> libc should really be fixed for all of these (except getdtablesize and
> a_ptsname).  The way you're working around it is introducing subtle
> bugs, for example the pidfile is _not_ locked.

Okay, I will fix it.

>> +#define I_LOOK  (__SID | 4) /* Retrieve the name of the module just

Re: [PATCH v2 09/18] nvdimm: build ACPI NFIT table

2015-09-15 Thread Igor Mammedov

On Tue, 15 Sep 2015 18:12:43 +0200
Paolo Bonzini  wrote:

> 
> 
> On 14/08/2015 16:52, Xiao Guangrong wrote:
> > NFIT is defined in ACPI 6.0: 5.2.25 NVDIMM Firmware Interface Table (NFIT)
> > 
> > Currently, we only support PMEM mode. Each device has 3 tables:
> > - SPA table, define the PMEM region info
> > 
> > - MEM DEV table, it has the @handle which is used to associate specified
> >   ACPI NVDIMM  device we will introduce in later patch.
> >   Also we can happily ignored the memory device's interleave, the real
> >   nvdimm hardware access is hidden behind host
> > 
> > - DCR table, it defines Vendor ID used to associate specified vendor
> >   nvdimm driver. Since we only implement PMEM mode this time, Command
> >   window and Data window are not needed
> > 
> > Signed-off-by: Xiao Guangrong 
> > ---
> >  hw/i386/acpi-build.c   |   3 +
> >  hw/mem/Makefile.objs   |   2 +-
> >  hw/mem/nvdimm/acpi.c   | 285 
> > +
> >  hw/mem/nvdimm/internal.h   |  29 +
> >  hw/mem/nvdimm/pc-nvdimm.c  |  27 -
> >  include/hw/mem/pc-nvdimm.h |   2 +
> >  6 files changed, 346 insertions(+), 2 deletions(-)
> >  create mode 100644 hw/mem/nvdimm/acpi.c
> >  create mode 100644 hw/mem/nvdimm/internal.h
> > 
> > diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> > index 8ead1c1..092ed2f 100644
> > --- a/hw/i386/acpi-build.c
> > +++ b/hw/i386/acpi-build.c
> > @@ -39,6 +39,7 @@
> >  #include "hw/loader.h"
> >  #include "hw/isa/isa.h"
> >  #include "hw/acpi/memory_hotplug.h"
> > +#include "hw/mem/pc-nvdimm.h"
> >  #include "sysemu/tpm.h"
> >  #include "hw/acpi/tpm.h"
> >  #include "sysemu/tpm_backend.h"
> > @@ -1741,6 +1742,8 @@ void acpi_build(PcGuestInfo *guest_info, 
> > AcpiBuildTables *tables)
> >  build_dmar_q35(tables_blob, tables->linker);
> >  }
> >  
> > +pc_nvdimm_build_nfit_table(table_offsets, tables_blob, tables->linker);
> > +
> >  /* Add tables supplied by user (if any) */
> >  for (u = acpi_table_first(); u; u = acpi_table_next(u)) {
> >  unsigned len = acpi_table_len(u);
> > diff --git a/hw/mem/Makefile.objs b/hw/mem/Makefile.objs
> > index 4df7482..7a6948d 100644
> > --- a/hw/mem/Makefile.objs
> > +++ b/hw/mem/Makefile.objs
> > @@ -1,2 +1,2 @@
> >  common-obj-$(CONFIG_MEM_HOTPLUG) += pc-dimm.o
> > -common-obj-$(CONFIG_NVDIMM) += nvdimm/pc-nvdimm.o
> > +common-obj-$(CONFIG_NVDIMM) += nvdimm/pc-nvdimm.o nvdimm/acpi.o
> > diff --git a/hw/mem/nvdimm/acpi.c b/hw/mem/nvdimm/acpi.c
> > new file mode 100644
> > index 000..f28752f
> > --- /dev/null
> > +++ b/hw/mem/nvdimm/acpi.c
> > @@ -0,0 +1,285 @@
> > +/*
> > + * NVDIMM (A Non-Volatile Dual In-line Memory Module) NFIT Implement
> > + *
> > + * Copyright(C) 2015 Intel Corporation.
> > + *
> > + * Author:
> > + *  Xiao Guangrong 
> > + *
> > + * NFIT is defined in ACPI 6.0: 5.2.25 NVDIMM Firmware Interface Table 
> > (NFIT)
> > + * and the DSM specfication can be found at:
> > + *   http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
> > + *
> > + * Currently, it only supports PMEM Virtualization.
> > + *
> > + * This library is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU Lesser General Public
> > + * License as published by the Free Software Foundation; either
> > + * version 2 of the License, or (at your option) any later version.
> > + *
> > + * This library is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + * Lesser General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU Lesser General Public
> > + * License along with this library; if not, see 
> > 
> > + */
> > +
> > +#include "qemu-common.h"
> > +
> > +#include "hw/acpi/aml-build.h"
> > +#include "hw/mem/pc-nvdimm.h"
> > +
> > +#include "internal.h"
> > +
> > +static void nfit_spa_uuid_pm(void *uuid)
> > +{
> > +uuid_le uuid_pm = UUID_LE(0x66f0d379, 0xb4f3, 0x4074, 0xac, 0x43, 0x0d,
> > +  0x33, 0x18, 0xb7, 0x8c, 0xdb);
> > +memcpy(uuid, _pm, sizeof(uuid_pm));
> > +}
> > +
> > +enum {
> > +NFIT_TABLE_SPA = 0,
> > +NFIT_TABLE_MEM = 1,
> > +NFIT_TABLE_IDT = 2,
> > +NFIT_TABLE_SMBIOS = 3,
> > +NFIT_TABLE_DCR = 4,
> > +NFIT_TABLE_BDW = 5,
> > +NFIT_TABLE_FLUSH = 6,
> > +};
> > +
> > +enum {
> > +EFI_MEMORY_UC = 0x1ULL,
> > +EFI_MEMORY_WC = 0x2ULL,
> > +EFI_MEMORY_WT = 0x4ULL,
> > +EFI_MEMORY_WB = 0x8ULL,
> > +EFI_MEMORY_UCE = 0x10ULL,
> > +EFI_MEMORY_WP = 0x1000ULL,
> > +EFI_MEMORY_RP = 0x2000ULL,
> > +EFI_MEMORY_XP = 0x4000ULL,
> > +EFI_MEMORY_NV = 0x8000ULL,
> > +EFI_MEMORY_MORE_RELIABLE = 0x1ULL,
> > +};
> > +
> > +/*
> > + * struct nfit -

Re: [PATCH v2 6/8] arm/arm64: KVM: Add forwarded physical interrupts documentation

2015-09-15 Thread Christoffer Dall

On Tue, Sep 15, 2015 at 04:16:07PM +0100, Andre Przywara wrote:
> Hi Christoffer,
> 
> On 14/09/15 12:42, Christoffer Dall wrote:
> 
>  Where is this done? I see that the physical dist state is altered on the
>  actual IRQ forwarding, but not on later exits/entries? Do you mean
>  kvm_vgic_flush_hwstate() with "flush"?
> >>>
> >>> this is a bug and should be fixed in the 'fixes' patches I sent last
> >>> week.  We should set active state on every entry to the guest for IRQs
> >>> with the HW bit set in either pending or active state.
> >>
> >> OK, sorry, I missed that one patch, I was looking at what should become
> >> -rc1 soon (because that's what I want to rebase my ITS emulation patches
> >> on). That patch wasn't in queue at the time I started looking at it.
> >>
> >> So I updated to the latest queue containing those two fixes and also
> >> applied your v2 series. Indeed this series addresses some of the things
> >> I was wondering about the last time, but the main thing still persists:
> >> - Every time the physical dist state is active we have the virtual state
> >> still at pending or active.
> > 
> > For the arch timer, yes.
> > 
> > For a passthrough device, there should be a situation where the physical
> > dist state is active but we didn't see the virtual state updated at the
> > vgic yet (after physical IRQ fires and before the VFIO ISR calls
> > kvm_set_irq).
> 
> But then we wouldn't get into vgic_sync_hwirq(), because we wouldn't
> inject a mapped IRQ before kvm_set_irq() is called, would we?

Ah, you meant, if we are in vgic_sync_hwirq() and the dist state is
active, then we have the virtual state still at pending or active?

That's a slightly different question from what you posed above.

I haven't thought extremely carefully about it, but could you not have
(1) guest deactivates (2) physical interrupt is handled on different CPU
on host for passthrough device (3) VFIO ISR leaves the IRQ active (3)
guest exits and you now hit vgic_sync_hwirq() and the virtual interrupt
is now inactive but the physical interrupt is active?

> 
> >> - If the physical dist state is non-active, the virtual state is
> >> inactive (LR.state==8: HW bit) as well. The associated ELRSR bit is 1
> >> (LR empty).
> >> (I was tracing every HW mapped LR in vgic_sync_hwirq() for this)
> >>
> >> So that contradicts:
> >>
> >> +  - On guest EOI, the *physical distributor* active bit gets cleared,
> >> +but the LR.Active is left untouched (set).
> >>
> >> This is the main point I was actually wondering about: I cannot confirm
> >> this statement. In my tests the LR state and the physical dist state
> >> always correspond, as excepted by reading the spec.
> >>
> >> I reckon that these observations are mostly independent from the actual
> >> KVM code, as I try to observe hardware state (physical distributor and
> >> LRs) before KVM tinkers with them.
> > 
> > ok, I got this paragraph from Marc, so we really need to ask him?  Which
> > hardware are you seeing this behavior on?  Perhaps implementations vary
> > on this point?
> 
> I checked this on Midway and Juno. Both have a GIC-400, but I don't have
> access to any other GIC implementations.
> I added the two BUG_ONs shown below to prove that assumption.
> 
> Eric, I've been told you observed the behaviour with the GIC not syncing
> LR and phys state for a mapped HWIRQ which was not the timer.
> Can you reproduce this? Does it complain with the patch below?
> 
> diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
> index 5942ce9..7fac16e 100644
> --- a/virt/kvm/arm/vgic.c
> +++ b/virt/kvm/arm/vgic.c
> @@ -1459,9 +1459,12 @@ static bool vgic_sync_hwirq(struct kvm_vcpu
>IRQCHIP_STATE_ACTIVE,
>false);
>   WARN_ON(ret);
> + BUG_ON(!(vlr.state & 3));
>   return false;
>   }
> 
> + BUG_ON(vlr.state & 3);
> +
>   return process_queued_irq(vcpu, lr, vlr);
>  }
> 
> > 
> > I have no objections removing this point from the doc though, I'm just
> > relaying information on this one.
> 
> I see, I talked with Marc and I am about to gather more data with the
> above patch to prove that this never happens.
> 
> >>
> >> ...
> >>
> >>>
>  Is this an observation, an implementation bug or is this mentioned in
>  the spec? Needing to spoon-feed the VGIC by doing it's job sounds a bit
>  awkward to me.
> >>>
> >>> What do you mean?  How are we spoon-feeding the VGIC?
> >>
> >> By looking at the physical dist state and all LRs and clearing the LR we
> >> do what the GIC is actually supposed to do for us - and what it actually
> >> does according to my observations.
> >>
> >> The point is that patch 1 in my ITS emulation series is reworking the LR
> >> handling and this patch was based on assumptions that seem to be no
> >> longer true (i.e. we don't care about inactive LRs except for our LR
> >> mapping code). So I want to be

Re: [PATCH] KVM: nVMX: nested VPID emulation

2015-09-15 Thread Wanpeng Li


On 9/14/15 10:54 PM, Jan Kiszka wrote:

On 2015-09-14 14:52, Wanpeng Li wrote:

VPID is used to tag address space and avoid a TLB flush. Currently L0 use
the same VPID to run L1 and all its guests. KVM flushes VPID when switching
between L1 and L2.

This patch advertises VPID to the L1 hypervisor, then address space of L1 and
L2 can be separately treated and avoid TLB flush when swithing between L1 and
L2. This patch gets ~3x performance improvement for lmbench 8p/64k ctxsw.

Signed-off-by: Wanpeng Li 
---
  arch/x86/kvm/vmx.c | 39 ---
  1 file changed, 32 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index da1590e..06bc31e 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1157,6 +1157,11 @@ static inline bool 
nested_cpu_has_virt_x2apic_mode(struct vmcs12 *vmcs12)
return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE);
  }
  
+static inline bool nested_cpu_has_vpid(struct vmcs12 *vmcs12)

+{
+   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_VPID);
+}
+
  static inline bool nested_cpu_has_apic_reg_virt(struct vmcs12 *vmcs12)
  {
return nested_cpu_has2(vmcs12, SECONDARY_EXEC_APIC_REGISTER_VIRT);
@@ -2471,6 +2476,7 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
SECONDARY_EXEC_RDTSCP |
SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
+   SECONDARY_EXEC_ENABLE_VPID |
SECONDARY_EXEC_APIC_REGISTER_VIRT |
SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
SECONDARY_EXEC_WBINVD_EXITING |
@@ -4160,7 +4166,7 @@ static void allocate_vpid(struct vcpu_vmx *vmx)
int vpid;
  
  	vmx->vpid = 0;

-   if (!enable_vpid)
+   if (!enable_vpid || is_guest_mode(>vcpu))
return;
spin_lock(_vpid_lock);
vpid = find_first_zero_bit(vmx_vpid_bitmap, VMX_NR_VPIDS);
@@ -6738,6 +6744,14 @@ static int handle_vmclear(struct kvm_vcpu *vcpu)
}
vmcs12 = kmap(page);
vmcs12->launch_state = 0;
+   if (enable_vpid) {
+   if (nested_cpu_has_vpid(vmcs12)) {
+   spin_lock(_vpid_lock);
+   if (vmcs12->virtual_processor_id != 0)
+   __clear_bit(vmcs12->virtual_processor_id, 
vmx_vpid_bitmap);
+   spin_unlock(_vpid_lock);

Maybe enhance free_vpid (and also allocate_vpid) to work generically and
let the caller decide where to take the vpid from or where to store it?


Good idea.




+   }
+   }
kunmap(page);
nested_release_page(page);
  
@@ -9189,6 +9203,7 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)

  {
struct vcpu_vmx *vmx = to_vmx(vcpu);
u32 exec_control;
+   int vpid;
  
  	vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector);

vmcs_write16(GUEST_CS_SELECTOR, vmcs12->guest_cs_selector);
@@ -9438,13 +9453,21 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
struct vmcs12 *vmcs12)
else
vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);
  
+

if (enable_vpid) {
-   /*
-* Trivially support vpid by letting L2s share their parent
-* L1's vpid. TODO: move to a more elaborate solution, giving
-* each L2 its own vpid and exposing the vpid feature to L1.
-*/
-   vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);
+   if (nested_cpu_has_vpid(vmcs12)) {
+   if (vmcs12->virtual_processor_id == 0) {
+   spin_lock(_vpid_lock);
+   vpid = find_first_zero_bit(vmx_vpid_bitmap, 
VMX_NR_VPIDS);
+   if (vpid < VMX_NR_VPIDS)
+   __set_bit(vpid, vmx_vpid_bitmap);
+   spin_unlock(_vpid_lock);
+   vmcs_write16(VIRTUAL_PROCESSOR_ID, vpid);

It's a bit non-obvious that vpid == VMX_NR_VPIDS (no free vpids) will
lead to vpid == 0 when writing VIRTUAL_PROCESSOR_ID. You should leave at
least a comment. Or generalize allocate_vpid as that one is already
clearer in this regard.


Ditto.




+   } else
+   vmcs_write16(VIRTUAL_PROCESSOR_ID, 
vmcs12->virtual_processor_id);
+   } else
+   vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);
+
vmx_flush_tlb(vcpu);
}
  
@@ -9973,6 +9996,8 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,

vmcs12_save_pending_event(vcpu, vmcs12);
}
  
+	if (nested_cpu_has_vpid(vmcs12))

+   vmcs12->virtual_processor_id = 
vmcs_read16(VIRTUAL_PROCESSOR_ID);
/*
 * Drop what we picked up for L2

Re: [RFC PATCH] os-android: Add support to android platform, built by ndk-r10

2015-09-15 Thread Paolo Bonzini



On 15/09/2015 04:11, Houcheng Lin wrote:
> The OS dependent code for android that implement functions missing in bionic 
> C, including:
> - getdtablesize(): call getrlimit() instead.

This is okay and can be done unconditionally (introduce a new
qemu_getdtablesize function that is defined in util/oslib-posix.c).

> - a_ptsname(): call pstname_r() instead.

This is not necessary, see below.

> - sigtimewait(): call __rt_sigtimewait() instead.
> - lockf(): not see this feature in android, directly return -1.
> - shm_open(): not see this feature in android, directly return -1.

This is not okay.  Please fix your libc instead.

> The kernel headers for android include two kernel header missing in android: 
> scsi/sg.h and
> sys/io.h.

For sys/io.h, we can just remove the inclusion.  It is not necessary.

scsi/sg.h should be exported by the Linux kernel, so that we can use
scripts/update-linux-headers.sh to copy it from QEMU.  I've sent a Linux
kernel patch and CCed you.

More comments follow:

> diff --git a/configure b/configure
> index 5c06663..3ff6ffa 100755
> --- a/configure
> +++ b/configure
> @@ -566,7 +566,6 @@ fi
>  
>  # host *BSD for user mode
>  HOST_VARIANT_DIR=""
> -
>  case $targetos in
>  CYGWIN*)
>mingw32="yes"
> @@ -692,9 +691,23 @@ Haiku)
>vhost_net="yes"
>vhost_scsi="yes"
>QEMU_INCLUDES="-I\$(SRC_PATH)/linux-headers -I$(pwd)/linux-headers 
> $QEMU_INCLUDES"
> +  case $cross_prefix in
> +*android*)
> +  android="yes"
> +  guest_agent="no"

You should instead disable the guest agent on your configure command line.

> +  QEMU_INCLUDES="-I\$(SRC_PATH)/include/android $QEMU_INCLUDES"
> +;;
> +*)
> +;;
> +  esac
>  ;;
>  esac
>  
> +if [ "$android" = "yes" ] ; then
> +  QEMU_CFLAGS="-DANDROID $QEMU_CFLAGS"

If you have CONFIG_ANDROID, you do not need -DANDROID.

> +  LIBS="-lglib-2.0 -lgthread-2.0 -lz -lpixman-1 -lintl -liconv -lc $LIBS"
> +  libs_qga="-lglib-2.0 -lgthread-2.0 -lz -lpixman-1 -lintl -liconv -lc"

This should not be necessary, QEMU uses pkg-config.

> +fi
>  if [ "$bsd" = "yes" ] ; then
>if [ "$darwin" != "yes" ] ; then
>  bsd_user="yes"
> @@ -1736,7 +1749,14 @@ fi
>  # pkg-config probe
>  
>  if ! has "$pkg_config_exe"; then
> -  error_exit "pkg-config binary '$pkg_config_exe' not found"
> +  case $cross_prefix in
> +*android*)
> + pkg_config_exe=scripts/android-pkg-config

Neither should this.  Your cross-compilation environment is not
correctly set up if you do not have a pkg-config executable.  If you
want to use a wrapper, you can specify it with the PKG_CONFIG
environment variable.  But it need not be maintained in the QEMU
repository, because QEMU assumes a complete cross-compilation environment.

> + ;;
> +*)
> + error_exit "pkg-config binary '$pkg_config_exe' not found"
> + ;;
> +  esac
>  fi
>  
>  ##
> @@ -3764,7 +3784,7 @@ elif compile_prog "" "$pthread_lib -lrt" ; then
>  fi
>  
>  if test "$darwin" != "yes" -a "$mingw32" != "yes" -a "$solaris" != yes -a \
> -"$aix" != "yes" -a "$haiku" != "yes" ; then
> +"$aix" != "yes" -a "$haiku" != "yes" -a "$android" != "yes" ; then
>  libs_softmmu="-lutil $libs_softmmu"
>  fi
>  
> @@ -4709,6 +4729,10 @@ if test "$linux" = "yes" ; then
>echo "CONFIG_LINUX=y" >> $config_host_mak
>  fi
>  
> +if test "$android" = "yes" ; then
> +  echo "CONFIG_ANDROID=y" >> $config_host_mak
> +fi
> +
>  if test "$darwin" = "yes" ; then
>echo "CONFIG_DARWIN=y" >> $config_host_mak
>  fi
> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> index ab3c876..1ba22be 100644
> --- a/include/qemu/osdep.h
> +++ b/include/qemu/osdep.h
> @@ -59,6 +59,10 @@
>  #define WEXITSTATUS(x) (x)
>  #endif
>  
> +#ifdef ANDROID
> + #include "sysemu/os-android.h"
> +#endif
> +
>  #ifdef _WIN32
>  #include "sysemu/os-win32.h"
>  #endif
> diff --git a/include/sysemu/os-android.h b/include/sysemu/os-android.h
> new file mode 100644
> index 000..7f73777
> --- /dev/null
> +++ b/include/sysemu/os-android.h
> @@ -0,0 +1,35 @@
> +#ifndef QEMU_OS_ANDROID_H
> +#define QEMU_OS_ANDROID_H
> +
> +#include 
> +
> +/*
> + * For include the basename prototyping in android.
> + */
> +#include 

These two includes can be added to sysemu/os-posix.h.

> +extern int __rt_sigtimedwait(const sigset_t *uthese, siginfo_t *uinfo, \
> + const struct timespec *uts, size_t sigsetsize);
> +
> +int getdtablesize(void);
> +int lockf(int fd, int cmd, off_t len);
> +int sigtimedwait(const sigset_t *set, siginfo_t *info, \
> + const struct timespec *ts);
> +int shm_open(const char *name, int oflag, mode_t mode);
> +char *a_ptsname(int fd);
> +
> +#define F_TLOCK 0
> +#define IOV_MAX 256

libc should really be fixed for all of these (except getdtablesize and
a_ptsname).  The way you're working around it is introducing subtle
bugs, for example the pidfile is _not_ locked.

>

Re: [PATCH] KVM: nVMX: nested VPID emulation

2015-09-15 Thread Wanpeng Li


On 9/15/15 12:08 AM, Bandan Das wrote:

Wanpeng Li  writes:


VPID is used to tag address space and avoid a TLB flush. Currently L0 use
the same VPID to run L1 and all its guests. KVM flushes VPID when switching
between L1 and L2.

This patch advertises VPID to the L1 hypervisor, then address space of L1 and
L2 can be separately treated and avoid TLB flush when swithing between L1 and
L2. This patch gets ~3x performance improvement for lmbench 8p/64k ctxsw.

TLB flush does context invalidation and while that should result in
some improvement, I never expected a 3x improvement for any workload!
Interesting :)


The result still looks good when test v2.




Signed-off-by: Wanpeng Li 
---
  arch/x86/kvm/vmx.c | 39 ---
  1 file changed, 32 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index da1590e..06bc31e 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1157,6 +1157,11 @@ static inline bool 
nested_cpu_has_virt_x2apic_mode(struct vmcs12 *vmcs12)
return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE);
  }
  
+static inline bool nested_cpu_has_vpid(struct vmcs12 *vmcs12)

+{
+   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_VPID);
+}
+
  static inline bool nested_cpu_has_apic_reg_virt(struct vmcs12 *vmcs12)
  {
return nested_cpu_has2(vmcs12, SECONDARY_EXEC_APIC_REGISTER_VIRT);
@@ -2471,6 +2476,7 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
SECONDARY_EXEC_RDTSCP |
SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
+   SECONDARY_EXEC_ENABLE_VPID |
SECONDARY_EXEC_APIC_REGISTER_VIRT |
SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
SECONDARY_EXEC_WBINVD_EXITING |
@@ -4160,7 +4166,7 @@ static void allocate_vpid(struct vcpu_vmx *vmx)
int vpid;
  
  	vmx->vpid = 0;

-   if (!enable_vpid)
+   if (!enable_vpid || is_guest_mode(>vcpu))
return;
spin_lock(_vpid_lock);
vpid = find_first_zero_bit(vmx_vpid_bitmap, VMX_NR_VPIDS);
@@ -6738,6 +6744,14 @@ static int handle_vmclear(struct kvm_vcpu *vcpu)
}
vmcs12 = kmap(page);
vmcs12->launch_state = 0;
+   if (enable_vpid) {
+   if (nested_cpu_has_vpid(vmcs12)) {
+   spin_lock(_vpid_lock);
+   if (vmcs12->virtual_processor_id != 0)
+   __clear_bit(vmcs12->virtual_processor_id, 
vmx_vpid_bitmap);
+   spin_unlock(_vpid_lock);
+   }
+   }
kunmap(page);
nested_release_page(page);

I don't think this is enough, we should also check for set "nested" bits
in free_vpid() and clear them. There should be some sort of a mapping between 
the
nested guest vpid and the actual vpid so that we can just clear those bits.


Agreed.




@@ -9189,6 +9203,7 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct 
vmcs12 *vmcs12)
  {
struct vcpu_vmx *vmx = to_vmx(vcpu);
u32 exec_control;
+   int vpid;
  
  	vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector);

vmcs_write16(GUEST_CS_SELECTOR, vmcs12->guest_cs_selector);
@@ -9438,13 +9453,21 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
struct vmcs12 *vmcs12)
else
vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);
  
+

Empty space here.


if (enable_vpid) {
-   /*
-* Trivially support vpid by letting L2s share their parent
-* L1's vpid. TODO: move to a more elaborate solution, giving
-* each L2 its own vpid and exposing the vpid feature to L1.
-*/
-   vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);
+   if (nested_cpu_has_vpid(vmcs12)) {
+   if (vmcs12->virtual_processor_id == 0) {

Ok, so if we advertise vpid to the nested hypervisor, isn't it going to
attempt writing this field when setting up ? Atleast
that's what Linux does, no ?


Agreed, I do the allocation of vpid02 during initialization in v2.




+   spin_lock(_vpid_lock);
+   vpid = find_first_zero_bit(vmx_vpid_bitmap, 
VMX_NR_VPIDS);
+   if (vpid < VMX_NR_VPIDS)
+   __set_bit(vpid, vmx_vpid_bitmap);
+   spin_unlock(_vpid_lock);
+   vmcs_write16(VIRTUAL_PROCESSOR_ID, vpid);
+   } else
+   vmcs_write16(VIRTUAL_PROCESSOR_ID, 
vmcs12->virtual_processor_id);
+   } else
+   vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);
+

I guess L1 shouldn't know what vpid L0 chose to run L2. If L1 vmreads,
it should get what it expects

[PATCH 05/15] arm64: Handle 4 level page table for swapper

2015-09-15 Thread Suzuki K. Poulose

From: "Suzuki K. Poulose" 

At the moment, we only support maximum of 3-level page table for
swapper. With 48bit VA, 64K has only 3 levels and 4K uses section
mapping. Add support for 4-level page table for swapper, needed
by 16K pages.

Cc: Ard Biesheuvel 
Cc: Mark Rutland 
Cc: Catalin Marinas 
Cc: Will Deacon 
Signed-off-by: Suzuki K. Poulose 
Reviewed-by: Ard Biesheuvel 
Tested-by: Ard Biesheuvel 
---
 arch/arm64/kernel/head.S |5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 46670bf..01b8e58 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -271,7 +271,10 @@ ENDPROC(preserve_boot_args)
  */
.macro  create_pgd_entry, tbl, virt, tmp1, tmp2
create_table_entry \tbl, \virt, PGDIR_SHIFT, PTRS_PER_PGD, \tmp1, \tmp2
-#if SWAPPER_PGTABLE_LEVELS == 3
+#if SWAPPER_PGTABLE_LEVELS > 3
+   create_table_entry \tbl, \virt, PUD_SHIFT, PTRS_PER_PUD, \tmp1, \tmp2
+#endif
+#if SWAPPER_PGTABLE_LEVELS > 2
create_table_entry \tbl, \virt, SWAPPER_TABLE_SHIFT, PTRS_PER_PTE, 
\tmp1, \tmp2
 #endif
.endm
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V6 6/6] kvm: add fast mmio capabilitiy

2015-09-15 Thread Paolo Bonzini



On 15/09/2015 08:41, Jason Wang wrote:
> +With KVM_CAP_FAST_MMIO, a zero length mmio eventfd is allowed for
> +kernel to ignore the length of guest write and get a possible faster
> +response. Note the speedup may only work on some specific
> +architectures and setups. Otherwise, it's as fast as wildcard mmio
> +eventfd.

I don't really like tying the capability to MMIO, especially since
zero length ioeventfd is already accepted for virtio-ccw.

What about the following?

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 7a3cb48a644d..247944071cc8 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1627,11 +1627,10 @@ to the registered address is equal to datamatch in 
struct kvm_ioeventfd.
 For virtio-ccw devices, addr contains the subchannel id and datamatch the
 virtqueue index.
 
-With KVM_CAP_FAST_MMIO, a zero length mmio eventfd is allowed for
-kernel to ignore the length of guest write and get a possible faster
-response. Note the speedup may only work on some specific
-architectures and setups. Otherwise, it's as fast as wildcard mmio
-eventfd.
+With KVM_CAP_IOEVENTFD_ANY_LENGTH, a zero length ioeventfd is allowed, and
+the kernel will ignore the length of guest write and get a faster vmexit.
+The speedup may only apply to specific architectures, but the ioeventfd will
+work anyway.
 
 4.60 KVM_DIRTY_TLB
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index b4f6aeaf94a6..03f3618612aa 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -830,7 +830,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_GUEST_DEBUG_HW_BPS 119
 #define KVM_CAP_GUEST_DEBUG_HW_WPS 120
 #define KVM_CAP_SPLIT_IRQCHIP 121
-#define KVM_CAP_FAST_MMIO 122
+#define KVM_CAP_IOEVENTFD_ANY_LENGTH 122
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 79db45336e3a..1dc8c45d2270 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -914,9 +914,7 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd 
*args)
return -EINVAL;
 
/* ioeventfd with no length can't be combined with DATAMATCH */
-   if (!args->len &&
-   args->flags & (KVM_IOEVENTFD_FLAG_PIO |
-  KVM_IOEVENTFD_FLAG_DATAMATCH))
+   if (!args->len && (args->flags & KVM_IOEVENTFD_FLAG_DATAMATCH))
return -EINVAL;
 
ret = kvm_assign_ioeventfd_idx(kvm, bus_idx, args);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0780d970d087..0b48aadedcee 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2717,7 +2717,7 @@ static long kvm_vm_ioctl_check_extension_generic(struct 
kvm *kvm, long arg)
case KVM_CAP_IRQFD:
case KVM_CAP_IRQFD_RESAMPLE:
 #endif
-   case KVM_CAP_FAST_MMIO:
+   case KVM_CAP_IOEVENTFD_ANY_LENGTH:
case KVM_CAP_CHECK_EXTENSION_VM:
return 1;
 #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING


Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V6 0/6] Fast mmio eventfd fixes

2015-09-15 Thread Paolo Bonzini



On 15/09/2015 08:41, Jason Wang wrote:
> Hi:
> 
> This series fixes two issues of fast mmio eventfd:
> 
> 1) A single iodev instance were registerd on two buses: KVM_MMIO_BUS
>and KVM_FAST_MMIO_BUS. This will cause double in
>ioeventfd_destructor()
> 2) A zero length iodev on KVM_MMIO_BUS will never be found but
>kvm_io_bus_cmp(). This will lead e.g the eventfd will be trapped by
>qemu instead of host.
> 
> 1 is fixed by allocating two instances of iodev and introduce a new
> capability for userspace. 2 is fixed by ignore the actual length if
> the length of iodev is zero in kvm_io_bus_cmp().
> 
> Please review.

Applied to kvm/queue and will send patches 1-4 for 4.3-rc.  Thanks!

Paolo

> Changes from V5:
> - move patch of explicitly checking for KVM_MMIO_BUS to patch 1 and
>   remove the unnecessary checks
> - even more grammar and typo fixes
> - rabase to kvm.git
> - document KVM_CAP_FAST_MMIO
> 
> Changes from V4:
> - move the location of kvm_assign_ioeventfd() in patch 1 which reduce
>   the change set.
> - commit log typo fixes
> - switch to use kvm_deassign_ioeventfd_id) when fail to register to
>   fast mmio bus
> - change kvm_io_bus_cmp() as Paolo's suggestions
> - introduce a new capability to avoid new userspace crash old kernel
> - add a new patch that only try to register mmio eventfd on fast mmio
>   bus
> 
> Changes from V3:
> 
> - Don't do search on two buses when trying to do write on
>   KVM_MMIO_BUS. This fixes a small regression found by vmexit.flat.
> - Since we don't do search on two buses, change kvm_io_bus_cmp() to
>   let it can find zero length iodevs.
> - Fix the unnecessary lines in tracepoint patch.
> 
> Changes from V2:
> - Tweak styles and comment suggested by Cornelia.
> 
> Changes from v1:
> - change ioeventfd_bus_from_flags() to return KVM_FAST_MMIO_BUS when
>   needed to save lots of unnecessary changes.
> 
> Jason Wang (6):
>   kvm: don't try to register to KVM_FAST_MMIO_BUS for non mmio eventfd
>   kvm: factor out core eventfd assign/deassign logic
>   kvm: fix double free for fast mmio eventfd
>   kvm: fix zero length mmio searching
>   kvm: add tracepoint for fast mmio
>   kvm: add fast mmio capabilitiy
> 
>  Documentation/virtual/kvm/api.txt |   7 ++-
>  arch/x86/kvm/trace.h  |  18 ++
>  arch/x86/kvm/vmx.c|   1 +
>  arch/x86/kvm/x86.c|   1 +
>  include/uapi/linux/kvm.h  |   1 +
>  virt/kvm/eventfd.c| 124 
> ++
>  virt/kvm/kvm_main.c   |  20 +-
>  7 files changed, 118 insertions(+), 54 deletions(-)
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 08/18] nvdimm: init backend memory mapping and config data area

2015-09-15 Thread Paolo Bonzini

On 01/09/2015 11:14, Stefan Hajnoczi wrote:
>> > 
>> > When I was digging into live migration code, i noticed that the same MR 
>> > name may
>> > cause the name "idstr", please refer to qemu_ram_set_idstr().
>> > 
>> > Since nvdimm devices do not have parent-bus, it will trigger the abort() 
>> > in that
>> > function.
> I see.  The other devices that use a constant name are on a bus so the
> abort doesn't trigger.

However, the MR name must be the same across the two machines.  Indices
are not friendly to hotplug.  Even though hotplug isn't supported now,
we should prepare and try not to change migration format when we support
hotplug in the future.

Is there any other fixed value that we can use, for example the base
address of the NVDIMM?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: arm64: remove all traces of the ThumbEE registers

2015-09-15 Thread Will Deacon

Although the ThumbEE registers and traps were present in earlier
versions of the v8 architecture, it was retrospectively removed and so
we can do the same.

Cc: Marc Zyngier 
Signed-off-by: Will Deacon 
---
 arch/arm64/include/asm/kvm_arm.h |  1 -
 arch/arm64/include/asm/kvm_asm.h |  4 +---
 arch/arm64/kvm/hyp.S | 22 --
 arch/arm64/kvm/sys_regs.c|  7 ---
 4 files changed, 5 insertions(+), 29 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index 7605e095217f..cdc9888134e5 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -168,7 +168,6 @@
 #define VTTBR_VMID_MASK  (UL(0xFF) << VTTBR_VMID_SHIFT)
 
 /* Hyp System Trap Register */
-#define HSTR_EL2_TTEE  (1 << 16)
 #define HSTR_EL2_T(x)  (1 << x)
 
 /* Hyp Coproccessor Trap Register Shifts */
diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 67fa0de3d483..5e377101f919 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -53,9 +53,7 @@
 #defineIFSR32_EL2  25  /* Instruction Fault Status Register */
 #defineFPEXC32_EL2 26  /* Floating-Point Exception Control 
Register */
 #defineDBGVCR32_EL227  /* Debug Vector Catch Register */
-#defineTEECR32_EL1 28  /* ThumbEE Configuration Register */
-#defineTEEHBR32_EL129  /* ThumbEE Handler Base Register */
-#defineNR_SYS_REGS 30
+#defineNR_SYS_REGS 28
 
 /* 32bit mapping */
 #define c0_MPIDR   (MPIDR_EL1 * 2) /* MultiProcessor ID Register */
diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
index 6addf97aadb3..c4016d411f4a 100644
--- a/arch/arm64/kvm/hyp.S
+++ b/arch/arm64/kvm/hyp.S
@@ -433,20 +433,13 @@
mrs x5, ifsr32_el2
stp x4, x5, [x3]
 
-   skip_fpsimd_state x8, 3f
+   skip_fpsimd_state x8, 2f
mrs x6, fpexc32_el2
str x6, [x3, #16]
-3:
-   skip_debug_state x8, 2f
+2:
+   skip_debug_state x8, 1f
mrs x7, dbgvcr32_el2
str x7, [x3, #24]
-2:
-   skip_tee_state x8, 1f
-
-   add x3, x2, #CPU_SYSREG_OFFSET(TEECR32_EL1)
-   mrs x4, teecr32_el1
-   mrs x5, teehbr32_el1
-   stp x4, x5, [x3]
 1:
 .endm
 
@@ -466,16 +459,9 @@
msr dacr32_el2, x4
msr ifsr32_el2, x5
 
-   skip_debug_state x8, 2f
+   skip_debug_state x8, 1f
ldr x7, [x3, #24]
msr dbgvcr32_el2, x7
-2:
-   skip_tee_state x8, 1f
-
-   add x3, x2, #CPU_SYSREG_OFFSET(TEECR32_EL1)
-   ldp x4, x5, [x3]
-   msr teecr32_el1, x4
-   msr teehbr32_el1, x5
 1:
 .endm
 
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index b41607d270ac..6c35e49757d8 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -539,13 +539,6 @@ static const struct sys_reg_desc sys_reg_descs[] = {
{ Op0(0b10), Op1(0b000), CRn(0b0111), CRm(0b1110), Op2(0b110),
  trap_dbgauthstatus_el1 },
 
-   /* TEECR32_EL1 */
-   { Op0(0b10), Op1(0b010), CRn(0b), CRm(0b), Op2(0b000),
- NULL, reset_val, TEECR32_EL1, 0 },
-   /* TEEHBR32_EL1 */
-   { Op0(0b10), Op1(0b010), CRn(0b0001), CRm(0b), Op2(0b000),
- NULL, reset_val, TEEHBR32_EL1, 0 },
-
/* MDCCSR_EL1 */
{ Op0(0b10), Op1(0b011), CRn(0b), CRm(0b0001), Op2(0b000),
  trap_raz_wi },
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: arm64: remove all traces of the ThumbEE registers

2015-09-15 Thread Peter Maydell

On 15 September 2015 at 17:15, Will Deacon  wrote:
> Although the ThumbEE registers and traps were present in earlier
> versions of the v8 architecture, it was retrospectively removed and so
> we can do the same.
>
> Cc: Marc Zyngier 
> Signed-off-by: Will Deacon 

> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index b41607d270ac..6c35e49757d8 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -539,13 +539,6 @@ static const struct sys_reg_desc sys_reg_descs[] = {
> { Op0(0b10), Op1(0b000), CRn(0b0111), CRm(0b1110), Op2(0b110),
>   trap_dbgauthstatus_el1 },
>
> -   /* TEECR32_EL1 */
> -   { Op0(0b10), Op1(0b010), CRn(0b), CRm(0b), Op2(0b000),
> - NULL, reset_val, TEECR32_EL1, 0 },
> -   /* TEEHBR32_EL1 */
> -   { Op0(0b10), Op1(0b010), CRn(0b0001), CRm(0b), Op2(0b000),
> - NULL, reset_val, TEEHBR32_EL1, 0 },
> -

I guess this is a VM migration compatibility break between kernels
without this patch and kernels with it? I think that's OK at this
point, but it would be nice to mention it in the commit message.

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Paolo Bonzini

On 15/09/2015 15:36, Christian Borntraeger wrote:
> I am wondering why the old code behaved in such fatal ways. Is there
> some interaction between waiting for a reschedule in the
> synchronize_sched writer and some fork code actually waiting for the
> read side to get the lock together with some rescheduling going on
> waiting for a lock that fork holds? lockdep does not give me an hints
> so I have no clue :-(

It may just be consuming too much CPU usage.  kernel/rcu/tree.c warns
about it:

 * if you are using synchronize_sched_expedited() in a loop, please
 * restructure your code to batch your updates, and then use a single
 * synchronize_sched() instead.

and you may remember that in KVM we switched from RCU to SRCU exactly to
avoid userspace-controlled synchronize_rcu_expedited().

In fact, I would say that any userspace-controlled call to *_expedited()
is a bug waiting to happen and a bad idea---because userspace can, with
little effort, end up calling it in a loop.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH v2 08/18] nvdimm: init backend memory mapping and config data area

2015-09-15 Thread Paolo Bonzini



On 07/09/2015 16:11, Igor Mammedov wrote:
> 
> here is common concepts that could be reused.
>   - on physical system both DIMM and NVDIMM devices use
> the same slots. We could share QEMU's '-m slots' option between
> both devices. An alternative to not sharing would be to introduce
> '-machine nvdimm_slots' option.
> And yes, we need to know number of NVDIMMs to describe
> them all in ACPI table (taking in amount future hotplug
> include in this possible NVDIMM devices)
> I'd go the same way as on real hardware on make them share the same slots.
>   - they share the same physical address space and limits
> on how much memory system can handle. So I'd suggest sharing existing
> '-m maxmem' option and reuse hotplug_memory address space.

I agree, and the slot number also provide a nice way to build a
consistent memory region name across multiple systems.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

KVM call for agenda for 2015-09-29

2015-09-15 Thread Juan Quintela


Hi

Please, send any topic that you are interested in covering.

At the end of Monday I will send an email with the agenda or the
cancellation of the call, so hurry up.

After discussions on the QEMU Summit, we are going to have always open a
KVM call where you can add topics.

 Call details:

By popular demand, a google calendar public entry with it

  
https://www.google.com/calendar/embed?src=dG9iMXRqcXAzN3Y4ZXZwNzRoMHE4a3BqcXNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ

(Let me know if you have any problems with the calendar entry.  I just
gave up about getting right at the same time CEST, CET, EDT and DST).

If you need phone number details,  contact me privately

Thanks, Juan.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 2/8] KVM: x86: add pcommit support

2015-09-15 Thread Paolo Bonzini



On 09/09/2015 08:05, Xiao Guangrong wrote:
> + if (!guest_cpuid_has_pcommit(vcpu) && nested)
> + vmx->nested.nested_vmx_secondary_ctls_high &=
> + ~SECONDARY_EXEC_PCOMMIT;

It is legal to set CPUID multiple times, so I think we need

if (static_cpu_has(X86_FEATURE_PCOMMIT) && nested) {
if (guest_cpuid_has_pcommit(vcpu))
vmx->nested.nested_vmx_secondary_ctls_high |=
SECONDARY_EXEC_PCOMMIT;
else
vmx->nested.nested_vmx_secondary_ctls_high &=
~SECONDARY_EXEC_PCOMMIT;
}

The INVPCID exit probably should be handled the same way as PCOMMIT;
moving both to a new nested_vmx_update_ctls_msrs probably makes sense.

This can be done on top of this series.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 15/15] arm64: 36 bit VA

2015-09-15 Thread Suzuki K. Poulose

From: "Suzuki K. Poulose" 

36bit VA lets us use 2 level page tables while limiting the
available address space to 64GB.

Cc: Mark Rutland 
Cc: Catalin Marinas 
Cc: Will Deacon 
Signed-off-by: Suzuki K. Poulose 
Reviewed-by: Ard Biesheuvel 
Tested-by: Ard Biesheuvel 
---
 arch/arm64/Kconfig |8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 2253819..3560241 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -168,6 +168,7 @@ config FIX_EARLYCON_MEM
 
 config PGTABLE_LEVELS
int
+   default 2 if ARM64_16K_PAGES && ARM64_VA_BITS_36
default 2 if ARM64_64K_PAGES && ARM64_VA_BITS_42
default 3 if ARM64_64K_PAGES && ARM64_VA_BITS_48
default 3 if ARM64_4K_PAGES && ARM64_VA_BITS_39
@@ -373,6 +374,10 @@ choice
  space sizes. The level of translation table is determined by
  a combination of page size and virtual address space size.
 
+config ARM64_VA_BITS_36
+   bool "36-bit"
+   depends on ARM64_16K_PAGES
+
 config ARM64_VA_BITS_39
bool "39-bit"
depends on ARM64_4K_PAGES
@@ -392,6 +397,7 @@ endchoice
 
 config ARM64_VA_BITS
int
+   default 36 if ARM64_VA_BITS_36
default 39 if ARM64_VA_BITS_39
default 42 if ARM64_VA_BITS_42
default 47 if ARM64_VA_BITS_47
@@ -465,7 +471,7 @@ config ARCH_WANT_GENERAL_HUGETLB
def_bool y
 
 config ARCH_WANT_HUGE_PMD_SHARE
-   def_bool y if ARM64_4K_PAGES || ARM64_16K_PAGES
+   def_bool y if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
 
 config HAVE_ARCH_TRANSPARENT_HUGEPAGE
def_bool y
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 12/15] arm: kvm: Move fake PGD handling to arch specific files

2015-09-15 Thread Suzuki K. Poulose

From: "Suzuki K. Poulose" 

Rearrange the code for fake pgd handling, which is applicable
to only ARM64. The intention is to keep the common code cleaner,
unaware of the underlying hacks.

Cc: kvm...@lists.cs.columbia.edu
Cc: christoffer.d...@linaro.org
Cc: marc.zyng...@arm.com
Signed-off-by: Suzuki K. Poulose 
---
 arch/arm/include/asm/kvm_mmu.h   |7 ++
 arch/arm/kvm/mmu.c   |   44 +-
 arch/arm64/include/asm/kvm_mmu.h |   43 +
 3 files changed, 55 insertions(+), 39 deletions(-)

diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 405aa18..1c9aa8a 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -173,6 +173,13 @@ static inline unsigned int kvm_get_hwpgd_size(void)
return PTRS_PER_S2_PGD * sizeof(pgd_t);
 }
 
+static inline pgd_t *kvm_setup_fake_pgd(pgd_t *pgd)
+{
+   return pgd;
+}
+
+static inline void kvm_free_fake_pgd(pgd_t *pgd) {}
+
 struct kvm;
 
 #define kvm_flush_dcache_to_poc(a,l)   __cpuc_flush_dcache_area((a), (l))
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 7b42012..b210622 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -677,43 +677,11 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm)
 * guest, we allocate a fake PGD and pre-populate it to point
 * to the next-level page table, which will be the real
 * initial page table pointed to by the VTTBR.
-*
-* When KVM_PREALLOC_LEVEL==2, we allocate a single page for
-* the PMD and the kernel will use folded pud.
-* When KVM_PREALLOC_LEVEL==1, we allocate 2 consecutive PUD
-* pages.
 */
-   if (KVM_PREALLOC_LEVEL > 0) {
-   int i;
-
-   /*
-* Allocate fake pgd for the page table manipulation macros to
-* work.  This is not used by the hardware and we have no
-* alignment requirement for this allocation.
-*/
-   pgd = kmalloc(PTRS_PER_S2_PGD * sizeof(pgd_t),
-   GFP_KERNEL | __GFP_ZERO);
-
-   if (!pgd) {
-   kvm_free_hwpgd(hwpgd);
-   return -ENOMEM;
-   }
-
-   /* Plug the HW PGD into the fake one. */
-   for (i = 0; i < PTRS_PER_S2_PGD; i++) {
-   if (KVM_PREALLOC_LEVEL == 1)
-   pgd_populate(NULL, pgd + i,
-(pud_t *)hwpgd + i * PTRS_PER_PUD);
-   else if (KVM_PREALLOC_LEVEL == 2)
-   pud_populate(NULL, pud_offset(pgd, 0) + i,
-(pmd_t *)hwpgd + i * PTRS_PER_PMD);
-   }
-   } else {
-   /*
-* Allocate actual first-level Stage-2 page table used by the
-* hardware for Stage-2 page table walks.
-*/
-   pgd = (pgd_t *)hwpgd;
+   pgd = kvm_setup_fake_pgd(hwpgd);
+   if (IS_ERR(pgd)) {
+   kvm_free_hwpgd(hwpgd);
+   return PTR_ERR(pgd);
}
 
kvm_clean_pgd(pgd);
@@ -820,9 +788,7 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
 
unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
kvm_free_hwpgd(kvm_get_hwpgd(kvm));
-   if (KVM_PREALLOC_LEVEL > 0)
-   kfree(kvm->arch.pgd);
-
+   kvm_free_fake_pgd(kvm->arch.pgd);
kvm->arch.pgd = NULL;
 }
 
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 6150567..2567fe8 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -198,6 +198,49 @@ static inline unsigned int kvm_get_hwpgd_size(void)
return PTRS_PER_S2_PGD * sizeof(pgd_t);
 }
 
+/*
+ * Allocate fake pgd for the page table manipulation macros to
+ * work.  This is not used by the hardware and we have no
+ * alignment requirement for this allocation.
+ */
+static inline pgd_t* kvm_setup_fake_pgd(pgd_t *hwpgd)
+{
+   int i;
+   pgd_t *pgd;
+
+   if (!KVM_PREALLOC_LEVEL)
+   return hwpgd;
+   /*
+* When KVM_PREALLOC_LEVEL==2, we allocate a single page for
+* the PMD and the kernel will use folded pud.
+* When KVM_PREALLOC_LEVEL==1, we allocate 2 consecutive PUD
+* pages.
+*/
+   pgd = kmalloc(PTRS_PER_S2_PGD * sizeof(pgd_t),
+   GFP_KERNEL | __GFP_ZERO);
+
+   if (!pgd)
+   return ERR_PTR(-ENOMEM);
+
+   /* Plug the HW PGD into the fake one. */
+   for (i = 0; i < PTRS_PER_S2_PGD; i++) {
+   if (KVM_PREALLOC_LEVEL == 1)
+   pgd_populate(NULL, pgd + i,
+(pud_t *)hwpgd + i * PTRS_PER_PUD);
+   else if (KVM_PREALLOC_LEVEL == 2)
+

[PATCH 14/15] arm64: Add 16K page size support

2015-09-15 Thread Suzuki K. Poulose

From: "Suzuki K. Poulose" 

This patch turns on the 16K page support in the kernel. We
support 48bit VA (4 level page tables) and 47bit VA (3 level
page tables).

Cc: Mark Rutland 
Cc: Catalin Marinas 
Cc: Will Deacon 
Signed-off-by: Suzuki K. Poulose 
Reviewed-by: Ard Biesheuvel 
Tested-by: Ard Biesheuvel 
---
 arch/arm64/Kconfig   |   25 -
 arch/arm64/include/asm/fixmap.h  |4 +++-
 arch/arm64/include/asm/kvm_arm.h |   12 
 arch/arm64/include/asm/page.h|2 ++
 arch/arm64/include/asm/sysreg.h  |2 ++
 arch/arm64/include/asm/thread_info.h |2 ++
 arch/arm64/kernel/head.S |7 ++-
 arch/arm64/mm/proc.S |4 +++-
 8 files changed, 50 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 83bca48..2253819 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -171,7 +171,8 @@ config PGTABLE_LEVELS
default 2 if ARM64_64K_PAGES && ARM64_VA_BITS_42
default 3 if ARM64_64K_PAGES && ARM64_VA_BITS_48
default 3 if ARM64_4K_PAGES && ARM64_VA_BITS_39
-   default 4 if ARM64_4K_PAGES && ARM64_VA_BITS_48
+   default 3 if ARM64_16K_PAGES && ARM64_VA_BITS_47
+   default 4 if !ARM64_64K_PAGES && ARM64_VA_BITS_48
 
 source "init/Kconfig"
 
@@ -345,6 +346,13 @@ config ARM64_4K_PAGES
help
  This feature enables 4KB pages support.
 
+config ARM64_16K_PAGES
+   bool "16KB"
+   help
+ The system will use 16KB pages support. AArch32 emulation
+ requires applications compiled with 16K(or multiple of 16K)
+ aligned segments.
+
 config ARM64_64K_PAGES
bool "64KB"
help
@@ -358,6 +366,7 @@ endchoice
 choice
prompt "Virtual address space size"
default ARM64_VA_BITS_39 if ARM64_4K_PAGES
+   default ARM64_VA_BITS_47 if ARM64_16K_PAGES
default ARM64_VA_BITS_42 if ARM64_64K_PAGES
help
  Allows choosing one of multiple possible virtual address
@@ -372,6 +381,10 @@ config ARM64_VA_BITS_42
bool "42-bit"
depends on ARM64_64K_PAGES
 
+config ARM64_VA_BITS_47
+   bool "47-bit"
+   depends on ARM64_16K_PAGES
+
 config ARM64_VA_BITS_48
bool "48-bit"
 
@@ -381,6 +394,7 @@ config ARM64_VA_BITS
int
default 39 if ARM64_VA_BITS_39
default 42 if ARM64_VA_BITS_42
+   default 47 if ARM64_VA_BITS_47
default 48 if ARM64_VA_BITS_48
 
 config CPU_BIG_ENDIAN
@@ -451,7 +465,7 @@ config ARCH_WANT_GENERAL_HUGETLB
def_bool y
 
 config ARCH_WANT_HUGE_PMD_SHARE
-   def_bool y if ARM64_4K_PAGES
+   def_bool y if ARM64_4K_PAGES || ARM64_16K_PAGES
 
 config HAVE_ARCH_TRANSPARENT_HUGEPAGE
def_bool y
@@ -488,6 +502,7 @@ config XEN
 config FORCE_MAX_ZONEORDER
int
default "14" if (ARM64_64K_PAGES && TRANSPARENT_HUGEPAGE)
+   default "12" if (ARM64_16K_PAGES && TRANSPARENT_HUGEPAGE)
default "11"
 
 menuconfig ARMV8_DEPRECATED
@@ -674,9 +689,9 @@ config COMPAT
  the user helper functions, VFP support and the ptrace interface are
  handled appropriately by the kernel.
 
- If you also enabled CONFIG_ARM64_64K_PAGES, please be aware that you
- will only be able to execute AArch32 binaries that were compiled with
- 64k aligned segments.
+ If you use a page size other than 4KB(i.e, 16KB or 64KB), please be 
aware
+ that you will only be able to execute AArch32 binaries that were 
compiled
+ with page size aligned segments.
 
  If you want to execute 32-bit userspace applications, say Y.
 
diff --git a/arch/arm64/include/asm/fixmap.h b/arch/arm64/include/asm/fixmap.h
index 8b9884c..a294c70 100644
--- a/arch/arm64/include/asm/fixmap.h
+++ b/arch/arm64/include/asm/fixmap.h
@@ -55,8 +55,10 @@ enum fixed_addresses {
 * Temporary boot-time mappings, used by early_ioremap(),
 * before ioremap() is functional.
 */
-#ifdef CONFIG_ARM64_64K_PAGES
+#ifdefined(CONFIG_ARM64_64K_PAGES)
 #define NR_FIX_BTMAPS  4
+#elif  defined (CONFIG_ARM64_16K_PAGES)
+#define NR_FIX_BTMAPS  16
 #else
 #define NR_FIX_BTMAPS  64
 #endif
diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index 699554d..b28a06e 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -113,6 +113,7 @@
 #define VTCR_EL2_TG0_MASK  (3 << 14)
 #define VTCR_EL2_TG0_4K(0 << 14)
 #define VTCR_EL2_TG0_64K   (1 << 14)
+#define VTCR_EL2_TG0_16K   (2 << 14)
 #define VTCR_EL2_SH0_MASK  (3 << 12)
 #define VTCR_EL2_SH0_INNER (3 << 12)
 #define VTCR_EL2_ORGN0_MASK(3 << 10)
@@ -134,6 +135,8 @@
  *
  * Note that when using 4K pages, we

[PATCH 06/15] arm64: Clean config usages for page size

2015-09-15 Thread Suzuki K. Poulose

From: "Suzuki K. Poulose" 

We use !CONFIG_ARM64_64K_PAGES for CONFIG_ARM64_4K_PAGES
(and vice versa) in code. It all worked well, so far since
we only had two options. Now, with the introduction of 16K,
these cases will break. This patch cleans up the code to
use the required CONFIG symbol expression without the assumption
that !64K => 4K (and vice versa)

Cc: Ard Biesheuvel 
Cc: Catalin Marinas 
Cc: Will Deacon 
Acked-by: Mark Rutland 
Signed-off-by: Suzuki K. Poulose 
Reviewed-by: Ard Biesheuvel 
Tested-by: Ard Biesheuvel 
---
 arch/arm64/Kconfig   |4 ++--
 arch/arm64/Kconfig.debug |2 +-
 arch/arm64/include/asm/thread_info.h |2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 7d95663..ab0a36f 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -451,7 +451,7 @@ config ARCH_WANT_GENERAL_HUGETLB
def_bool y
 
 config ARCH_WANT_HUGE_PMD_SHARE
-   def_bool y if !ARM64_64K_PAGES
+   def_bool y if ARM64_4K_PAGES
 
 config HAVE_ARCH_TRANSPARENT_HUGEPAGE
def_bool y
@@ -663,7 +663,7 @@ source "fs/Kconfig.binfmt"
 
 config COMPAT
bool "Kernel support for 32-bit EL0"
-   depends on !ARM64_64K_PAGES || EXPERT
+   depends on ARM64_4K_PAGES || EXPERT
select COMPAT_BINFMT_ELF
select HAVE_UID16
select OLD_SIGSUSPEND3
diff --git a/arch/arm64/Kconfig.debug b/arch/arm64/Kconfig.debug
index d6285ef..c24d6ad 100644
--- a/arch/arm64/Kconfig.debug
+++ b/arch/arm64/Kconfig.debug
@@ -77,7 +77,7 @@ config DEBUG_RODATA
   If in doubt, say Y
 
 config DEBUG_ALIGN_RODATA
-   depends on DEBUG_RODATA && !ARM64_64K_PAGES
+   depends on DEBUG_RODATA && ARM64_4K_PAGES
bool "Align linker sections up to SECTION_SIZE"
help
  If this option is enabled, sections that may potentially be marked as
diff --git a/arch/arm64/include/asm/thread_info.h 
b/arch/arm64/include/asm/thread_info.h
index dcd06d1..d9c8c9f 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -23,7 +23,7 @@
 
 #include 
 
-#ifndef CONFIG_ARM64_64K_PAGES
+#ifdef CONFIG_ARM64_4K_PAGES
 #define THREAD_SIZE_ORDER  2
 #endif
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 13/15] arm64: kvm: Rewrite fake pgd handling

2015-09-15 Thread Suzuki K. Poulose

From: "Suzuki K. Poulose" 

The existing fake pgd handling code assumes that the stage-2 entry
level can only be one level down that of the host, which may not be
true always(e.g, with the introduction of 16k pagesize).

e.g.
With 16k page size and 48bit VA and 40bit IPA we have the following
split for page table levels:

level:  0   1 2 3
bits : [47] [46 - 36] [35 - 25] [24 - 14] [13 - 0]
 ^   ^ ^
 |   | |
   host entry| x stage-2 entry
 |
IPA -x

The stage-2 entry level is 2, due to the concatenation of 16tables
at level 2(mandated by the hardware). So, we need to fake two levels
to actually reach the hyp page table. This case cannot be handled
with the existing code, as, all we know about is KVM_PREALLOC_LEVEL
which kind of stands for two different pieces of information.

1) Whether we have fake page table entry levels.
2) The entry level of stage-2 translation.

We loose the information about the number of fake levels that
we may have to use. Also, KVM_PREALLOC_LEVEL computation itself
is wrong, as we assume the hw entry level is always 1 level down
from the host.

This patch introduces two seperate indicators :
1) Accurate entry level for stage-2 translation - HYP_PGTABLE_ENTRY_LEVEL -
   using the new helpers.
2) Number of levels of fake pagetable entries. (KVM_FAKE_PGTABLE_LEVELS)

The following conditions hold true for all cases(with 40bit IPA)
1) The stage-2 entry level <= 2
2) Number of fake page-table entries is in the inclusive range [0, 2].

Cc: kvm...@lists.cs.columbia.edu
Cc: christoffer.d...@linaro.org
Cc: marc.zyng...@arm.com
Signed-off-by: Suzuki K. Poulose 
---
 arch/arm64/include/asm/kvm_mmu.h |  114 --
 1 file changed, 61 insertions(+), 53 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 2567fe8..72cfd9e 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -41,18 +41,6 @@
  */
 #define TRAMPOLINE_VA  (HYP_PAGE_OFFSET_MASK & PAGE_MASK)
 
-/*
- * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation
- * levels in addition to the PGD and potentially the PUD which are
- * pre-allocated (we pre-allocate the fake PGD and the PUD when the Stage-2
- * tables use one level of tables less than the kernel.
- */
-#ifdef CONFIG_ARM64_64K_PAGES
-#define KVM_MMU_CACHE_MIN_PAGES1
-#else
-#define KVM_MMU_CACHE_MIN_PAGES2
-#endif
-
 #ifdef __ASSEMBLY__
 
 /*
@@ -80,6 +68,26 @@
 #define KVM_PHYS_SIZE  (1UL << KVM_PHYS_SHIFT)
 #define KVM_PHYS_MASK  (KVM_PHYS_SIZE - 1UL)
 
+/*
+ * At stage-2 entry level, upto 16 tables can be concatenated and
+ * the hardware expects us to use concatenation, whenever possible.
+ * So, number of page table levels for KVM_PHYS_SHIFT is always
+ * the number of normal page table levels for (KVM_PHYS_SHIFT - 4).
+ */
+#define HYP_PGTABLE_LEVELS ARM64_HW_PGTABLE_LEVELS(KVM_PHYS_SHIFT - 4)
+/* Number of bits normally addressed by HYP_PGTABLE_LEVELS */
+#define HYP_PGTABLE_SHIFT  ARM64_HW_PGTABLE_LEVEL_SHIFT(HYP_PGTABLE_LEVELS 
+ 1)
+#define HYP_PGDIR_SHIFT
ARM64_HW_PGTABLE_LEVEL_SHIFT(HYP_PGTABLE_LEVELS)
+#define HYP_PGTABLE_ENTRY_LEVEL(4 - HYP_PGTABLE_LEVELS)
+
+/*
+ * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation
+ * levels in addition to the PGD and potentially the PUD which are
+ * pre-allocated (we pre-allocate the fake PGD and the PUD when the Stage-2
+ * tables use one level of tables less than the kernel.
+ */
+#define KVM_MMU_CACHE_MIN_PAGES(HYP_PGTABLE_LEVELS - 1)
+
 int create_hyp_mappings(void *from, void *to);
 int create_hyp_io_mappings(void *from, void *to, phys_addr_t);
 void free_boot_hyp_pgd(void);
@@ -145,56 +153,41 @@ static inline bool kvm_s2pmd_readonly(pmd_t *pmd)
 #define kvm_pud_addr_end(addr, end)pud_addr_end(addr, end)
 #define kvm_pmd_addr_end(addr, end)pmd_addr_end(addr, end)
 
-/*
- * In the case where PGDIR_SHIFT is larger than KVM_PHYS_SHIFT, we can address
- * the entire IPA input range with a single pgd entry, and we would only need
- * one pgd entry.  Note that in this case, the pgd is actually not used by
- * the MMU for Stage-2 translations, but is merely a fake pgd used as a data
- * structure for the kernel pgtable macros to work.
- */
-#if PGDIR_SHIFT > KVM_PHYS_SHIFT
-#define PTRS_PER_S2_PGD_SHIFT  0
+/* Number of concatenated tables in stage-2 entry level */
+#if KVM_PHYS_SHIFT > HYP_PGTABLE_SHIFT
+#define S2_ENTRY_TABLES_SHIFT  (KVM_PHYS_SHIFT - HYP_PGTABLE_SHIFT)
 #else
-#define PTRS_PER_S2_PGD_SHIFT  (KVM_PHYS_SHIFT - PGDIR_SHIFT)
+#define S2_ENTRY_TABLES_SHIFT  0
 #endif
+#define S2_ENTRY_TABLES(1 << (S2_ENTRY_TABLES_SHIFT))
+
+/* Number of page table levels we fake to reach the hw pgtable for hyp */
+#define KVM_FAKE_PGTABLE_LEVELS

[PATCH 04/15] arm64: Calculate size for idmap_pg_dir at compile time

2015-09-15 Thread Suzuki K. Poulose

From: "Suzuki K. Poulose" 

Now that we can calculate the number of levels required for
mapping a va width, reserve exact number of pages that would
be required to cover the idmap. The idmap should be able to handle
the maximum physical address size supported.

Cc: Ard Biesheuvel 
Cc: Mark Rutland 
Cc: Catalin Marinas 
Cc: Will Deacon 
Signed-off-by: Suzuki K. Poulose 
Reviewed-by: Ard Biesheuvel 
Tested-by: Ard Biesheuvel 
---
 arch/arm64/include/asm/boot.h   |1 +
 arch/arm64/include/asm/kernel-pgtable.h |7 +--
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/boot.h b/arch/arm64/include/asm/boot.h
index 81151b6..678b63e 100644
--- a/arch/arm64/include/asm/boot.h
+++ b/arch/arm64/include/asm/boot.h
@@ -2,6 +2,7 @@
 #ifndef __ASM_BOOT_H
 #define __ASM_BOOT_H
 
+#include 
 #include 
 
 /*
diff --git a/arch/arm64/include/asm/kernel-pgtable.h 
b/arch/arm64/include/asm/kernel-pgtable.h
index 5876a36..def7168 100644
--- a/arch/arm64/include/asm/kernel-pgtable.h
+++ b/arch/arm64/include/asm/kernel-pgtable.h
@@ -33,16 +33,19 @@
  * map to pte level. The swapper also maps the FDT (see __create_page_tables
  * for more information). Note that the number of ID map translation levels
  * could be increased on the fly if system RAM is out of reach for the default
- * VA range, so 3 pages are reserved in all cases.
+ * VA range, so pages required to map highest possible PA are reserved in all
+ * cases.
  */
 #if ARM64_SWAPPER_USES_SECTION_MAPS
 #define SWAPPER_PGTABLE_LEVELS (CONFIG_PGTABLE_LEVELS - 1)
+#define IDMAP_PGTABLE_LEVELS   (ARM64_HW_PGTABLE_LEVELS(PHYS_MASK_SHIFT) - 1)
 #else
 #define SWAPPER_PGTABLE_LEVELS (CONFIG_PGTABLE_LEVELS)
+#define IDMAP_PGTABLE_LEVELS   (ARM64_HW_PGTABLE_LEVELS(PHYS_MASK_SHIFT))
 #endif
 
 #define SWAPPER_DIR_SIZE   (SWAPPER_PGTABLE_LEVELS * PAGE_SIZE)
-#define IDMAP_DIR_SIZE (3 * PAGE_SIZE)
+#define IDMAP_DIR_SIZE (IDMAP_PGTABLE_LEVELS * PAGE_SIZE)
 
 /* Initial memory map size */
 #if ARM64_SWAPPER_USES_SECTION_MAPS
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 08/18] nvdimm: init backend memory mapping and config data area

2015-09-15 Thread Paolo Bonzini



On 25/08/2015 18:03, Stefan Hajnoczi wrote:
>> >  
>> > +static uint64_t get_file_size(int fd)
>> > +{
>> > +struct stat stat_buf;
>> > +uint64_t size;
>> > +
>> > +if (fstat(fd, _buf) < 0) {
>> > +return 0;
>> > +}
>> > +
>> > +if (S_ISREG(stat_buf.st_mode)) {
>> > +return stat_buf.st_size;
>> > +}
>> > +
>> > +if (S_ISBLK(stat_buf.st_mode) && !ioctl(fd, BLKGETSIZE64, )) {
>> > +return size;
>> > +}
> #ifdef __linux__ for ioctl(fd, BLKGETSIZE64, )?
> 
> There is nothing Linux-specific about emulating NVDIMMs so this code
> should compile on all platforms.

The code from block/raw-posix.c and block/raw-win32.c's raw_getlength
should probably be extracted to a new function in utils/, and reused here.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 09/18] nvdimm: build ACPI NFIT table

2015-09-15 Thread Paolo Bonzini



On 14/08/2015 16:52, Xiao Guangrong wrote:
> NFIT is defined in ACPI 6.0: 5.2.25 NVDIMM Firmware Interface Table (NFIT)
> 
> Currently, we only support PMEM mode. Each device has 3 tables:
> - SPA table, define the PMEM region info
> 
> - MEM DEV table, it has the @handle which is used to associate specified
>   ACPI NVDIMM  device we will introduce in later patch.
>   Also we can happily ignored the memory device's interleave, the real
>   nvdimm hardware access is hidden behind host
> 
> - DCR table, it defines Vendor ID used to associate specified vendor
>   nvdimm driver. Since we only implement PMEM mode this time, Command
>   window and Data window are not needed
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  hw/i386/acpi-build.c   |   3 +
>  hw/mem/Makefile.objs   |   2 +-
>  hw/mem/nvdimm/acpi.c   | 285 
> +
>  hw/mem/nvdimm/internal.h   |  29 +
>  hw/mem/nvdimm/pc-nvdimm.c  |  27 -
>  include/hw/mem/pc-nvdimm.h |   2 +
>  6 files changed, 346 insertions(+), 2 deletions(-)
>  create mode 100644 hw/mem/nvdimm/acpi.c
>  create mode 100644 hw/mem/nvdimm/internal.h
> 
> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> index 8ead1c1..092ed2f 100644
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -39,6 +39,7 @@
>  #include "hw/loader.h"
>  #include "hw/isa/isa.h"
>  #include "hw/acpi/memory_hotplug.h"
> +#include "hw/mem/pc-nvdimm.h"
>  #include "sysemu/tpm.h"
>  #include "hw/acpi/tpm.h"
>  #include "sysemu/tpm_backend.h"
> @@ -1741,6 +1742,8 @@ void acpi_build(PcGuestInfo *guest_info, 
> AcpiBuildTables *tables)
>  build_dmar_q35(tables_blob, tables->linker);
>  }
>  
> +pc_nvdimm_build_nfit_table(table_offsets, tables_blob, tables->linker);
> +
>  /* Add tables supplied by user (if any) */
>  for (u = acpi_table_first(); u; u = acpi_table_next(u)) {
>  unsigned len = acpi_table_len(u);
> diff --git a/hw/mem/Makefile.objs b/hw/mem/Makefile.objs
> index 4df7482..7a6948d 100644
> --- a/hw/mem/Makefile.objs
> +++ b/hw/mem/Makefile.objs
> @@ -1,2 +1,2 @@
>  common-obj-$(CONFIG_MEM_HOTPLUG) += pc-dimm.o
> -common-obj-$(CONFIG_NVDIMM) += nvdimm/pc-nvdimm.o
> +common-obj-$(CONFIG_NVDIMM) += nvdimm/pc-nvdimm.o nvdimm/acpi.o
> diff --git a/hw/mem/nvdimm/acpi.c b/hw/mem/nvdimm/acpi.c
> new file mode 100644
> index 000..f28752f
> --- /dev/null
> +++ b/hw/mem/nvdimm/acpi.c
> @@ -0,0 +1,285 @@
> +/*
> + * NVDIMM (A Non-Volatile Dual In-line Memory Module) NFIT Implement
> + *
> + * Copyright(C) 2015 Intel Corporation.
> + *
> + * Author:
> + *  Xiao Guangrong 
> + *
> + * NFIT is defined in ACPI 6.0: 5.2.25 NVDIMM Firmware Interface Table (NFIT)
> + * and the DSM specfication can be found at:
> + *   http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
> + *
> + * Currently, it only supports PMEM Virtualization.
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, see 
> 
> + */
> +
> +#include "qemu-common.h"
> +
> +#include "hw/acpi/aml-build.h"
> +#include "hw/mem/pc-nvdimm.h"
> +
> +#include "internal.h"
> +
> +static void nfit_spa_uuid_pm(void *uuid)
> +{
> +uuid_le uuid_pm = UUID_LE(0x66f0d379, 0xb4f3, 0x4074, 0xac, 0x43, 0x0d,
> +  0x33, 0x18, 0xb7, 0x8c, 0xdb);
> +memcpy(uuid, _pm, sizeof(uuid_pm));
> +}
> +
> +enum {
> +NFIT_TABLE_SPA = 0,
> +NFIT_TABLE_MEM = 1,
> +NFIT_TABLE_IDT = 2,
> +NFIT_TABLE_SMBIOS = 3,
> +NFIT_TABLE_DCR = 4,
> +NFIT_TABLE_BDW = 5,
> +NFIT_TABLE_FLUSH = 6,
> +};
> +
> +enum {
> +EFI_MEMORY_UC = 0x1ULL,
> +EFI_MEMORY_WC = 0x2ULL,
> +EFI_MEMORY_WT = 0x4ULL,
> +EFI_MEMORY_WB = 0x8ULL,
> +EFI_MEMORY_UCE = 0x10ULL,
> +EFI_MEMORY_WP = 0x1000ULL,
> +EFI_MEMORY_RP = 0x2000ULL,
> +EFI_MEMORY_XP = 0x4000ULL,
> +EFI_MEMORY_NV = 0x8000ULL,
> +EFI_MEMORY_MORE_RELIABLE = 0x1ULL,
> +};
> +
> +/*
> + * struct nfit - Nvdimm Firmware Interface Table
> + * @signature: "NFIT"
> + */
> +struct nfit {
> +ACPI_TABLE_HEADER_DEF
> +uint32_t reserved;
> +} QEMU_PACKED;
> +
> +/*
> + * struct nfit_spa - System Physical Address Range Structure
> + */
> +struct nfit_spa {
> +uint16_t type;
> +uint16_t length;
> +uint16_t spa_index;
> +uint16_t flags;

Re: [PATCH V6 6/6] kvm: add fast mmio capabilitiy

2015-09-15 Thread Cornelia Huck

On Tue, 15 Sep 2015 17:07:55 +0200
Paolo Bonzini  wrote:

> On 15/09/2015 08:41, Jason Wang wrote:
> > +With KVM_CAP_FAST_MMIO, a zero length mmio eventfd is allowed for
> > +kernel to ignore the length of guest write and get a possible faster
> > +response. Note the speedup may only work on some specific
> > +architectures and setups. Otherwise, it's as fast as wildcard mmio
> > +eventfd.
> 
> I don't really like tying the capability to MMIO, especially since
> zero length ioeventfd is already accepted for virtio-ccw.

Actually, zero length ioeventfd does not make sense for virtio-ccw; we
just don't check it (although we probably should).

> 
> What about the following?
> 
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index 7a3cb48a644d..247944071cc8 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -1627,11 +1627,10 @@ to the registered address is equal to datamatch in 
> struct kvm_ioeventfd.
>  For virtio-ccw devices, addr contains the subchannel id and datamatch the
>  virtqueue index.
> 
> -With KVM_CAP_FAST_MMIO, a zero length mmio eventfd is allowed for
> -kernel to ignore the length of guest write and get a possible faster
> -response. Note the speedup may only work on some specific
> -architectures and setups. Otherwise, it's as fast as wildcard mmio
> -eventfd.
> +With KVM_CAP_IOEVENTFD_ANY_LENGTH, a zero length ioeventfd is allowed, and
> +the kernel will ignore the length of guest write and get a faster vmexit.

s/get/may get/ ?

> +The speedup may only apply to specific architectures, but the ioeventfd will
> +work anyway.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bug 103321] New: NPT page attribute support causes extreme slowdown

2015-09-15 Thread Paolo Bonzini



On 02/09/2015 21:01, Sebastian Schütte wrote:
> I inserted some printk() lines into init_vmcb() around the call of
> svm_set_guest_pat() to print out the g_pat value as well as
> svm->vcpu.vcpu_id and noticed that something was off:
> 
> Initially, the PATs of all VCPUs are set to 0x0606060606060606.
> However, after attaching some devices (vfio-pci enabling device and
> vfio_ecap_init lines are being printed) init_vmcb() is only called
> again for vcpu_id > 0. Unless g_pat is changed somewhere else, VCPU
> #0 remains set to 0x0606060606060606 (according to comments in
> svm_set_guest_pat() this is bad for assigned devices) while all other
> VCPUs use 0x0007040600070406.
> 
> I'd guess that could explain the slowdown.
> 

Hi, sorry for the delay.  If you change KVM like this

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 27f57fc05bc7..3ce878c5fde8 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1155,10 +1155,7 @@ static void svm_set_guest_pat(struct vcpu_svm *svm, u64 
*g_pat)
 * have assigned devices, however, we cannot force WB for RAM
 * pages only, so use the guest PAT directly.
 */
-   if (!kvm_arch_has_assigned_device(vcpu->kvm))
-   *g_pat = 0x0606060606060606;
-   else
-   *g_pat = vcpu->arch.pat;
+   *g_pat = vcpu->arch.pat;
 }
 
 static u64 svm_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)

(obviously not a valid patch for upstream patch, but good enough for
testing) do you get normal speed?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V6 6/6] kvm: add fast mmio capabilitiy

2015-09-15 Thread Paolo Bonzini



On 15/09/2015 18:44, Cornelia Huck wrote:
>> > Can you explain why?  If there is any non-zero valid length, "wildcard
>> > length" (represented by zero) would also make sense.
> What is a wildcard match supposed to mean in this case? The datamatch
> field contains the queue index for the device specified in the address
> field. The hypercall interface associated with the eventfd always has
> device + queue index in its parameters; there is no interface for
> "notify device with all its queues".

Ah, I see.  Because all valid virtio-ccw ioeventfds are datamatch, no
valid virtio-ccw ioeventfd is wildcard-length.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] os-android: Add support to android platform, built by ndk-r10

2015-09-15 Thread Fam Zheng

On Tue, 09/15 10:11, Houcheng Lin wrote:
> From: Houcheng 

Thanks for sending patches!  Please include qemu-de...@nongnu.org list for QEMU
changes.

Fam

> 
> This patch is to build qemu in android ndk tool-chain, and has been tested in 
> both
> x86_64 and x86 android platform with hardware virtualization enabled. This 
> patch is
> composed of three part:
> 
> - configure scripts for android
> - OS dependent code for android
> - kernel headers for android
> 
> The configure scripts add cross-compile options for android, define compile 
> flags and link flags
> and call android specific pkg-config. A pseudo pkg-config script is added to 
> report correct
> compile flags for android system.
> 
> The OS dependent code for android that implement functions missing in bionic 
> C, including:
> - getdtablesize(): call getrlimit() instead.
> - sigtimewait(): call __rt_sigtimewait() instead.
> - a_ptsname(): call pstname_r() instead.
> - lockf(): not see this feature in android, directly return -1.
> - shm_open(): not see this feature in android, directly return -1.
> 
> The kernel headers for android include two kernel header missing in android: 
> scsi/sg.h and
> sys/io.h.
> 
> How to build android version
> 
> 1. download the ndk toolchain r10, build the following libraries and install 
> in your toolchain sysroot:
> libiconv-1.14
> gettext-0.19
> libffi-3.0.12
> glib-2.34.3
> libpng-1.2.52
> pixman-0.30
> 
> 2. configure the qemu and build:
> 
> % export SYSROOT="/opt/android-toolchain64/sysroot"
> % CFLAGS=" --sysroot=$SYSROOT -I$SYSROOT/usr/include 
> -I$SYSROOT/usr/include/pixman-1/" \
> ./configure --prefix="${SYSROOT}/usr"  \
> --cross-prefix=x86_64-linux-android- \
> --enable-kvm --enable-trace-backend=nop --disable-fdt 
> --target-list=x86_64-softmmu \
> --disable-spice --disable-vhost-net --disable-libiscsi 
> --audio-drv-list="" --disable-gtk \
> --disable-gnutls --disable-libnfs --disable-glusterfs 
> --disable-libssh2 --disable-seccomp \
> --disable-usb-redir --disable-libusb
> % make -j4
> 
> Or, configure qemu to static link version:
> 
> % export SYSROOT="/opt/android-toolchain64/sysroot"
> % CFLAGS=" --sysroot=$SYSROOT -I$SYSROOT/usr/include 
> -I$SYSROOT/usr/include/pixman-1/" \
> ./configure --prefix="${SYSROOT}/usr"  \
> --cross-prefix=x86_64-linux-android- \
> --enable-kvm --enable-trace-backend=nop --disable-fdt 
> --target-list=x86_64-softmmu \
> --disable-spice --disable-vhost-net --disable-libiscsi 
> --audio-drv-list="" --disable-gtk \
> --disable-gnutls --disable-libnfs --disable-glusterfs 
> --disable-libssh2 --disable-seccomp \
> --disable-usb-redir --disable-libusb --static
> % make -j4
> 
> Signed-off-by: Houcheng 
> ---
>  configure   |   30 -
>  include/android/scsi/sg.h   |  307 
> +++
>  include/qemu/osdep.h|4 +
>  include/sysemu/os-android.h |   35 +
>  kvm-all.c   |3 +
>  scripts/android-pkg-config  |   28 
>  tests/Makefile  |2 +
>  util/osdep.c|   53 
>  util/qemu-openpty.c |   10 +-
>  9 files changed, 467 insertions(+), 5 deletions(-)
>  create mode 100644 include/android/scsi/sg.h
>  create mode 100644 include/android/sys/io.h
>  create mode 100644 include/sysemu/os-android.h
>  create mode 100755 scripts/android-pkg-config
> 
> diff --git a/configure b/configure
> index 5c06663..3ff6ffa 100755
> --- a/configure
> +++ b/configure
> @@ -566,7 +566,6 @@ fi
>  
>  # host *BSD for user mode
>  HOST_VARIANT_DIR=""
> -
>  case $targetos in
>  CYGWIN*)
>mingw32="yes"
> @@ -692,9 +691,23 @@ Haiku)
>vhost_net="yes"
>vhost_scsi="yes"
>QEMU_INCLUDES="-I\$(SRC_PATH)/linux-headers -I$(pwd)/linux-headers 
> $QEMU_INCLUDES"
> +  case $cross_prefix in
> +*android*)
> +  android="yes"
> +  guest_agent="no"
> +  QEMU_INCLUDES="-I\$(SRC_PATH)/include/android $QEMU_INCLUDES"
> +;;
> +*)
> +;;
> +  esac
>  ;;
>  esac
>  
> +if [ "$android" = "yes" ] ; then
> +  QEMU_CFLAGS="-DANDROID $QEMU_CFLAGS"
> +  LIBS="-lglib-2.0 -lgthread-2.0 -lz -lpixman-1 -lintl -liconv -lc $LIBS"
> +  libs_qga="-lglib-2.0 -lgthread-2.0 -lz -lpixman-1 -lintl -liconv -lc"
> +fi
>  if [ "$bsd" = "yes" ] ; then
>if [ "$darwin" != "yes" ] ; then
>  bsd_user="yes"
> @@ -1736,7 +1749,14 @@ fi
>  # pkg-config probe
>  
>  if ! has "$pkg_config_exe"; then
> -  error_exit "pkg-config binary '$pkg_config_exe' not found"
> +  case $cross_prefix in
> +*android*)
> + pkg_config_exe=scripts/android-pkg-config
> + ;;
> +*)
> + error_exit "pkg-config binary '$pkg_config_exe' not found"
> + ;;
> +  esac
>  fi
>  
>

[PATCH kernel 9/9] KVM: PPC: Add support for multiple-TCE hcalls

2015-09-15 Thread Alexey Kardashevskiy

This adds real and virtual mode handlers for the H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for user space emulated devices such as IBMVIO
devices or emulated PCI.  These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition between kernel and user space.

This implements the KVM_CAP_PPC_MULTITCE capability. When present,
the kernel will try handling H_PUT_TCE_INDIRECT and H_STUFF_TCE.
If they can not be handled by the kernel, they are passed on to
the user space. The user space still has to have an implementation
for these.

Both HV and PR-syle KVM are supported.

Signed-off-by: Alexey Kardashevskiy 
---
 Documentation/virtual/kvm/api.txt   |  25 ++
 arch/powerpc/include/asm/kvm_ppc.h  |  12 +++
 arch/powerpc/kvm/book3s_64_vio.c| 111 +++-
 arch/powerpc/kvm/book3s_64_vio_hv.c | 145 ++--
 arch/powerpc/kvm/book3s_hv.c|  26 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |   6 +-
 arch/powerpc/kvm/book3s_pr_papr.c   |  35 
 arch/powerpc/kvm/powerpc.c  |   3 +
 8 files changed, 350 insertions(+), 13 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index d86d831..593c62a 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -3019,6 +3019,31 @@ Returns: 0 on success, -1 on error
 
 Queues an SMI on the thread's vcpu.
 
+4.97 KVM_CAP_PPC_MULTITCE
+
+Capability: KVM_CAP_PPC_MULTITCE
+Architectures: ppc
+Type: vm
+
+This capability means the kernel is capable of handling hypercalls
+H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
+space. This significantly accelerates DMA operations for PPC KVM guests.
+User space should expect that its handlers for these hypercalls
+are not going to be called if user space previously registered LIOBN
+in KVM (via KVM_CREATE_SPAPR_TCE or similar calls).
+
+In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
+user space might have to advertise it for the guest. For example,
+IBM pSeries (sPAPR) guest starts using them if "hcall-multi-tce" is
+present in the "ibm,hypertas-functions" device-tree property.
+
+The hypercalls mentioned above may or may not be processed successfully
+in the kernel based fast path. If they can not be handled by the kernel,
+they will get passed on to user space. So user space still has to have
+an implementation for these despite the in kernel acceleration.
+
+This capability is always enabled.
+
 5. The kvm_run structure
 
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index fcde896..e5b968e 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -166,12 +166,24 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce *args);
+extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
+   struct kvm_vcpu *vcpu, unsigned long liobn);
 extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
unsigned long ioba, unsigned long npages);
 extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt,
unsigned long tce);
+extern long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
+   unsigned long *ua, unsigned long **prmap);
+extern void kvmppc_tce_put(struct kvmppc_spapr_tce_table *tt,
+   unsigned long idx, unsigned long tce);
 extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 unsigned long ioba, unsigned long tce);
+extern long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+   unsigned long liobn, unsigned long ioba,
+   unsigned long tce_list, unsigned long npages);
+extern long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+   unsigned long liobn, unsigned long ioba,
+   unsigned long tce_value, unsigned long npages);
 extern long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 unsigned long ioba);
 extern struct page *kvm_alloc_hpt(unsigned long nr_pages);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index e347856..d3fc732 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -14,6 +14,7 @@
  *
  * Copyright 2010 Paul Mackerras, IBM Corp. 
  * Copyright 2011 David Gibson, IBM Corporation 
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation 
  */
 
 #include 
@@ -37,8 +38,7 @@
 #include 
 #include 
 #include 
-
-#define TCES_PER_PAGE  (PAGE_SIZE / sizeof(u64))
+#include 
 
 static long kvmppc_stt_npages(unsigned long window_size)
 {
@@ -200,3 +200,110 @@ fail:
}

[PATCH kernel 4/9] KVM: PPC: Use RCU for arch.spapr_tce_tables

2015-09-15 Thread Alexey Kardashevskiy

At the moment spapr_tce_tables is not protected against races. This makes
use of RCU-variants of list helpers. As some bits are executed in real
mode, this makes use of just introduced list_for_each_entry_rcu_notrace().

This converts release_spapr_tce_table() to a RCU scheduled handler.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/kvm_host.h |  1 +
 arch/powerpc/kvm/book3s.c   |  2 +-
 arch/powerpc/kvm/book3s_64_vio.c| 20 +++-
 3 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 98eebbf6..e19d412 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -178,6 +178,7 @@ struct kvmppc_spapr_tce_table {
struct kvm *kvm;
u64 liobn;
u32 window_size;
+   struct rcu_head rcu;
struct page *pages[0];
 };
 
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 53285d5..3418f7c 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -806,7 +806,7 @@ int kvmppc_core_init_vm(struct kvm *kvm)
 {
 
 #ifdef CONFIG_PPC64
-   INIT_LIST_HEAD(>arch.spapr_tce_tables);
+   INIT_LIST_HEAD_RCU(>arch.spapr_tce_tables);
INIT_LIST_HEAD(>arch.rtas_tokens);
 #endif
 
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 54cf9bc..9526c34 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -45,19 +45,16 @@ static long kvmppc_stt_npages(unsigned long window_size)
 * sizeof(u64), PAGE_SIZE) / PAGE_SIZE;
 }
 
-static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
+static void release_spapr_tce_table(struct rcu_head *head)
 {
-   struct kvm *kvm = stt->kvm;
+   struct kvmppc_spapr_tce_table *stt = container_of(head,
+   struct kvmppc_spapr_tce_table, rcu);
int i;
 
-   mutex_lock(>lock);
-   list_del(>list);
for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
__free_page(stt->pages[i]);
+
kfree(stt);
-   mutex_unlock(>lock);
-
-   kvm_put_kvm(kvm);
 }
 
 static int kvm_spapr_tce_fault(struct vm_area_struct *vma, struct vm_fault 
*vmf)
@@ -88,7 +85,12 @@ static int kvm_spapr_tce_release(struct inode *inode, struct 
file *filp)
 {
struct kvmppc_spapr_tce_table *stt = filp->private_data;
 
-   release_spapr_tce_table(stt);
+   list_del_rcu(>list);
+
+   kvm_put_kvm(stt->kvm);
+
+   call_rcu(>rcu, release_spapr_tce_table);
+
return 0;
 }
 
@@ -131,7 +133,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
kvm_get_kvm(kvm);
 
mutex_lock(>lock);
-   list_add(>list, >arch.spapr_tce_tables);
+   list_add_rcu(>list, >arch.spapr_tce_tables);
 
mutex_unlock(>lock);
 
-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH kernel 5/9] KVM: PPC: Account TCE-containing pages in locked_vm

2015-09-15 Thread Alexey Kardashevskiy

At the moment pages used for TCE tables (in addition to pages addressed
by TCEs) are not counted in locked_vm counter so a malicious userspace
tool can call ioctl(KVM_CREATE_SPAPR_TCE) as many times as RLIMIT_NOFILE and
lock a lot of memory.

This adds counting for pages used for TCE tables.

This counts the number of pages required for a table plus pages for
the kvmppc_spapr_tce_table struct (TCE table descriptor) itself.

This does not change the amount of (de)allocated memory.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kvm/book3s_64_vio.c | 51 +++-
 1 file changed, 50 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 9526c34..b70787d 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -45,13 +45,56 @@ static long kvmppc_stt_npages(unsigned long window_size)
 * sizeof(u64), PAGE_SIZE) / PAGE_SIZE;
 }
 
+static long kvmppc_account_memlimit(long npages, bool inc)
+{
+   long ret = 0;
+   const long bytes = sizeof(struct kvmppc_spapr_tce_table) +
+   (abs(npages) * sizeof(struct page *));
+   const long stt_pages = ALIGN(bytes, PAGE_SIZE) / PAGE_SIZE;
+
+   if (!current || !current->mm)
+   return ret; /* process exited */
+
+   npages += stt_pages;
+
+   down_write(>mm->mmap_sem);
+
+   if (inc) {
+   long locked, lock_limit;
+
+   locked = current->mm->locked_vm + npages;
+   lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+   if (locked > lock_limit && !capable(CAP_IPC_LOCK))
+   ret = -ENOMEM;
+   else
+   current->mm->locked_vm += npages;
+   } else {
+   if (npages > current->mm->locked_vm)
+   npages = current->mm->locked_vm;
+
+   current->mm->locked_vm -= npages;
+   }
+
+   pr_debug("[%d] RLIMIT_MEMLOCK KVM %c%ld %ld/%ld%s\n", current->pid,
+   inc ? '+' : '-',
+   npages << PAGE_SHIFT,
+   current->mm->locked_vm << PAGE_SHIFT,
+   rlimit(RLIMIT_MEMLOCK),
+   ret ? " - exceeded" : "");
+
+   up_write(>mm->mmap_sem);
+
+   return ret;
+}
+
 static void release_spapr_tce_table(struct rcu_head *head)
 {
struct kvmppc_spapr_tce_table *stt = container_of(head,
struct kvmppc_spapr_tce_table, rcu);
int i;
+   long npages = kvmppc_stt_npages(stt->window_size);
 
-   for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
+   for (i = 0; i < npages; i++)
__free_page(stt->pages[i]);
 
kfree(stt);
@@ -89,6 +132,7 @@ static int kvm_spapr_tce_release(struct inode *inode, struct 
file *filp)
 
kvm_put_kvm(stt->kvm);
 
+   kvmppc_account_memlimit(kvmppc_stt_npages(stt->window_size), false);
call_rcu(>rcu, release_spapr_tce_table);
 
return 0;
@@ -114,6 +158,11 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
}
 
npages = kvmppc_stt_npages(args->window_size);
+   ret = kvmppc_account_memlimit(npages, true);
+   if (ret) {
+   stt = NULL;
+   goto fail;
+   }
 
stt = kzalloc(sizeof(*stt) + npages * sizeof(struct page *),
  GFP_KERNEL);
-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH kernel 6/9] KVM: PPC: Replace SPAPR_TCE_SHIFT with IOMMU_PAGE_SHIFT_4K

2015-09-15 Thread Alexey Kardashevskiy

SPAPR_TCE_SHIFT is used in few places only and since IOMMU_PAGE_SHIFT_4K
can be easily used instead, remove SPAPR_TCE_SHIFT.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/kvm_book3s_64.h | 2 --
 arch/powerpc/kvm/book3s_64_vio.c | 3 ++-
 arch/powerpc/kvm/book3s_64_vio_hv.c  | 4 ++--
 3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 2aa79c8..7529aab 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -33,8 +33,6 @@ static inline void svcpu_put(struct kvmppc_book3s_shadow_vcpu 
*svcpu)
 }
 #endif
 
-#define SPAPR_TCE_SHIFT12
-
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
 #define KVM_DEFAULT_HPT_ORDER  24  /* 16MB HPT by default */
 #endif
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index b70787d..e347856 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -36,12 +36,13 @@
 #include 
 #include 
 #include 
+#include 
 
 #define TCES_PER_PAGE  (PAGE_SIZE / sizeof(u64))
 
 static long kvmppc_stt_npages(unsigned long window_size)
 {
-   return ALIGN((window_size >> SPAPR_TCE_SHIFT)
+   return ALIGN((window_size >> IOMMU_PAGE_SHIFT_4K)
 * sizeof(u64), PAGE_SIZE) / PAGE_SIZE;
 }
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c 
b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 8ae12ac..6cf1ab3 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -99,7 +99,7 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long 
liobn,
if (ret)
return ret;
 
-   idx = ioba >> SPAPR_TCE_SHIFT;
+   idx = ioba >> IOMMU_PAGE_SHIFT_4K;
page = stt->pages[idx / TCES_PER_PAGE];
tbl = (u64 *)page_address(page);
 
@@ -121,7 +121,7 @@ long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long 
liobn,
if (stt) {
ret = kvmppc_ioba_validate(stt, ioba, 1);
if (!ret) {
-   unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
+   unsigned long idx = ioba >> IOMMU_PAGE_SHIFT_4K;
struct page *page = stt->pages[idx / TCES_PER_PAGE];
u64 *tbl = (u64 *)page_address(page);
 
-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH kernel 8/9] KVM: Fix KVM_SMI chapter number

2015-09-15 Thread Alexey Kardashevskiy

The KVM_SMI capability is following the KVM_S390_SET_IRQ_STATE capability
which is "4.95", this changes the number of the KVM_SMI chapter to 4.96.

Signed-off-by: Alexey Kardashevskiy 
---
 Documentation/virtual/kvm/api.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index d9eccee..d86d831 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -3009,7 +3009,7 @@ len must be a multiple of sizeof(struct kvm_s390_irq). It 
must be > 0
 and it must not exceed (max_vcpus + 32) * sizeof(struct kvm_s390_irq),
 which is the maximum number of possibly pending cpu-local interrupts.
 
-4.90 KVM_SMI
+4.96 KVM_SMI
 
 Capability: KVM_CAP_X86_SMM
 Architectures: x86
-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH kernel 9/9] KVM: PPC: Add support for multiple-TCE hcalls

2015-09-15 Thread Alexey Kardashevskiy

This adds real and virtual mode handlers for the H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for user space emulated devices such as IBMVIO
devices or emulated PCI.  These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition between kernel and user space.

This implements the KVM_CAP_PPC_MULTITCE capability. When present,
the kernel will try handling H_PUT_TCE_INDIRECT and H_STUFF_TCE.
If they can not be handled by the kernel, they are passed on to
the user space. The user space still has to have an implementation
for these.

Both HV and PR-syle KVM are supported.

Signed-off-by: Alexey Kardashevskiy 
---
 Documentation/virtual/kvm/api.txt   |  25 ++
 arch/powerpc/include/asm/kvm_ppc.h  |  12 +++
 arch/powerpc/kvm/book3s_64_vio.c| 111 +++-
 arch/powerpc/kvm/book3s_64_vio_hv.c | 145 ++--
 arch/powerpc/kvm/book3s_hv.c|  26 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |   6 +-
 arch/powerpc/kvm/book3s_pr_papr.c   |  35 
 arch/powerpc/kvm/powerpc.c  |   3 +
 8 files changed, 350 insertions(+), 13 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index d86d831..593c62a 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -3019,6 +3019,31 @@ Returns: 0 on success, -1 on error
 
 Queues an SMI on the thread's vcpu.
 
+4.97 KVM_CAP_PPC_MULTITCE
+
+Capability: KVM_CAP_PPC_MULTITCE
+Architectures: ppc
+Type: vm
+
+This capability means the kernel is capable of handling hypercalls
+H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
+space. This significantly accelerates DMA operations for PPC KVM guests.
+User space should expect that its handlers for these hypercalls
+are not going to be called if user space previously registered LIOBN
+in KVM (via KVM_CREATE_SPAPR_TCE or similar calls).
+
+In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
+user space might have to advertise it for the guest. For example,
+IBM pSeries (sPAPR) guest starts using them if "hcall-multi-tce" is
+present in the "ibm,hypertas-functions" device-tree property.
+
+The hypercalls mentioned above may or may not be processed successfully
+in the kernel based fast path. If they can not be handled by the kernel,
+they will get passed on to user space. So user space still has to have
+an implementation for these despite the in kernel acceleration.
+
+This capability is always enabled.
+
 5. The kvm_run structure
 
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index fcde896..e5b968e 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -166,12 +166,24 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce *args);
+extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
+   struct kvm_vcpu *vcpu, unsigned long liobn);
 extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
unsigned long ioba, unsigned long npages);
 extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt,
unsigned long tce);
+extern long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
+   unsigned long *ua, unsigned long **prmap);
+extern void kvmppc_tce_put(struct kvmppc_spapr_tce_table *tt,
+   unsigned long idx, unsigned long tce);
 extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 unsigned long ioba, unsigned long tce);
+extern long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+   unsigned long liobn, unsigned long ioba,
+   unsigned long tce_list, unsigned long npages);
+extern long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+   unsigned long liobn, unsigned long ioba,
+   unsigned long tce_value, unsigned long npages);
 extern long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 unsigned long ioba);
 extern struct page *kvm_alloc_hpt(unsigned long nr_pages);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index e347856..d3fc732 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -14,6 +14,7 @@
  *
  * Copyright 2010 Paul Mackerras, IBM Corp. 
  * Copyright 2011 David Gibson, IBM Corporation 
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation 
  */
 
 #include 
@@ -37,8 +38,7 @@
 #include 
 #include 
 #include 
-
-#define TCES_PER_PAGE  (PAGE_SIZE / sizeof(u64))
+#include 
 
 static long kvmppc_stt_npages(unsigned long window_size)
 {
@@ -200,3 +200,110 @@ fail:
}

[PATCH kernel 8/9] KVM: Fix KVM_SMI chapter number

2015-09-15 Thread Alexey Kardashevskiy

The KVM_SMI capability is following the KVM_S390_SET_IRQ_STATE capability
which is "4.95", this changes the number of the KVM_SMI chapter to 4.96.

Signed-off-by: Alexey Kardashevskiy 
---
 Documentation/virtual/kvm/api.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index d9eccee..d86d831 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -3009,7 +3009,7 @@ len must be a multiple of sizeof(struct kvm_s390_irq). It 
must be > 0
 and it must not exceed (max_vcpus + 32) * sizeof(struct kvm_s390_irq),
 which is the maximum number of possibly pending cpu-local interrupts.
 
-4.90 KVM_SMI
+4.96 KVM_SMI
 
 Capability: KVM_CAP_X86_SMM
 Architectures: x86
-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH kernel 4/9] KVM: PPC: Use RCU for arch.spapr_tce_tables

2015-09-15 Thread Alexey Kardashevskiy

At the moment spapr_tce_tables is not protected against races. This makes
use of RCU-variants of list helpers. As some bits are executed in real
mode, this makes use of just introduced list_for_each_entry_rcu_notrace().

This converts release_spapr_tce_table() to a RCU scheduled handler.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/kvm_host.h |  1 +
 arch/powerpc/kvm/book3s.c   |  2 +-
 arch/powerpc/kvm/book3s_64_vio.c| 20 +++-
 3 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 98eebbf6..e19d412 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -178,6 +178,7 @@ struct kvmppc_spapr_tce_table {
struct kvm *kvm;
u64 liobn;
u32 window_size;
+   struct rcu_head rcu;
struct page *pages[0];
 };
 
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 53285d5..3418f7c 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -806,7 +806,7 @@ int kvmppc_core_init_vm(struct kvm *kvm)
 {
 
 #ifdef CONFIG_PPC64
-   INIT_LIST_HEAD(>arch.spapr_tce_tables);
+   INIT_LIST_HEAD_RCU(>arch.spapr_tce_tables);
INIT_LIST_HEAD(>arch.rtas_tokens);
 #endif
 
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 54cf9bc..9526c34 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -45,19 +45,16 @@ static long kvmppc_stt_npages(unsigned long window_size)
 * sizeof(u64), PAGE_SIZE) / PAGE_SIZE;
 }
 
-static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
+static void release_spapr_tce_table(struct rcu_head *head)
 {
-   struct kvm *kvm = stt->kvm;
+   struct kvmppc_spapr_tce_table *stt = container_of(head,
+   struct kvmppc_spapr_tce_table, rcu);
int i;
 
-   mutex_lock(>lock);
-   list_del(>list);
for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
__free_page(stt->pages[i]);
+
kfree(stt);
-   mutex_unlock(>lock);
-
-   kvm_put_kvm(kvm);
 }
 
 static int kvm_spapr_tce_fault(struct vm_area_struct *vma, struct vm_fault 
*vmf)
@@ -88,7 +85,12 @@ static int kvm_spapr_tce_release(struct inode *inode, struct 
file *filp)
 {
struct kvmppc_spapr_tce_table *stt = filp->private_data;
 
-   release_spapr_tce_table(stt);
+   list_del_rcu(>list);
+
+   kvm_put_kvm(stt->kvm);
+
+   call_rcu(>rcu, release_spapr_tce_table);
+
return 0;
 }
 
@@ -131,7 +133,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
kvm_get_kvm(kvm);
 
mutex_lock(>lock);
-   list_add(>list, >arch.spapr_tce_tables);
+   list_add_rcu(>list, >arch.spapr_tce_tables);
 
mutex_unlock(>lock);
 
-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH kernel 5/9] KVM: PPC: Account TCE-containing pages in locked_vm

2015-09-15 Thread Alexey Kardashevskiy

At the moment pages used for TCE tables (in addition to pages addressed
by TCEs) are not counted in locked_vm counter so a malicious userspace
tool can call ioctl(KVM_CREATE_SPAPR_TCE) as many times as RLIMIT_NOFILE and
lock a lot of memory.

This adds counting for pages used for TCE tables.

This counts the number of pages required for a table plus pages for
the kvmppc_spapr_tce_table struct (TCE table descriptor) itself.

This does not change the amount of (de)allocated memory.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kvm/book3s_64_vio.c | 51 +++-
 1 file changed, 50 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 9526c34..b70787d 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -45,13 +45,56 @@ static long kvmppc_stt_npages(unsigned long window_size)
 * sizeof(u64), PAGE_SIZE) / PAGE_SIZE;
 }
 
+static long kvmppc_account_memlimit(long npages, bool inc)
+{
+   long ret = 0;
+   const long bytes = sizeof(struct kvmppc_spapr_tce_table) +
+   (abs(npages) * sizeof(struct page *));
+   const long stt_pages = ALIGN(bytes, PAGE_SIZE) / PAGE_SIZE;
+
+   if (!current || !current->mm)
+   return ret; /* process exited */
+
+   npages += stt_pages;
+
+   down_write(>mm->mmap_sem);
+
+   if (inc) {
+   long locked, lock_limit;
+
+   locked = current->mm->locked_vm + npages;
+   lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+   if (locked > lock_limit && !capable(CAP_IPC_LOCK))
+   ret = -ENOMEM;
+   else
+   current->mm->locked_vm += npages;
+   } else {
+   if (npages > current->mm->locked_vm)
+   npages = current->mm->locked_vm;
+
+   current->mm->locked_vm -= npages;
+   }
+
+   pr_debug("[%d] RLIMIT_MEMLOCK KVM %c%ld %ld/%ld%s\n", current->pid,
+   inc ? '+' : '-',
+   npages << PAGE_SHIFT,
+   current->mm->locked_vm << PAGE_SHIFT,
+   rlimit(RLIMIT_MEMLOCK),
+   ret ? " - exceeded" : "");
+
+   up_write(>mm->mmap_sem);
+
+   return ret;
+}
+
 static void release_spapr_tce_table(struct rcu_head *head)
 {
struct kvmppc_spapr_tce_table *stt = container_of(head,
struct kvmppc_spapr_tce_table, rcu);
int i;
+   long npages = kvmppc_stt_npages(stt->window_size);
 
-   for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
+   for (i = 0; i < npages; i++)
__free_page(stt->pages[i]);
 
kfree(stt);
@@ -89,6 +132,7 @@ static int kvm_spapr_tce_release(struct inode *inode, struct 
file *filp)
 
kvm_put_kvm(stt->kvm);
 
+   kvmppc_account_memlimit(kvmppc_stt_npages(stt->window_size), false);
call_rcu(>rcu, release_spapr_tce_table);
 
return 0;
@@ -114,6 +158,11 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
}
 
npages = kvmppc_stt_npages(args->window_size);
+   ret = kvmppc_account_memlimit(npages, true);
+   if (ret) {
+   stt = NULL;
+   goto fail;
+   }
 
stt = kzalloc(sizeof(*stt) + npages * sizeof(struct page *),
  GFP_KERNEL);
-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH kernel 6/9] KVM: PPC: Replace SPAPR_TCE_SHIFT with IOMMU_PAGE_SHIFT_4K

2015-09-15 Thread Alexey Kardashevskiy

SPAPR_TCE_SHIFT is used in few places only and since IOMMU_PAGE_SHIFT_4K
can be easily used instead, remove SPAPR_TCE_SHIFT.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/kvm_book3s_64.h | 2 --
 arch/powerpc/kvm/book3s_64_vio.c | 3 ++-
 arch/powerpc/kvm/book3s_64_vio_hv.c  | 4 ++--
 3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 2aa79c8..7529aab 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -33,8 +33,6 @@ static inline void svcpu_put(struct kvmppc_book3s_shadow_vcpu 
*svcpu)
 }
 #endif
 
-#define SPAPR_TCE_SHIFT12
-
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
 #define KVM_DEFAULT_HPT_ORDER  24  /* 16MB HPT by default */
 #endif
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index b70787d..e347856 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -36,12 +36,13 @@
 #include 
 #include 
 #include 
+#include 
 
 #define TCES_PER_PAGE  (PAGE_SIZE / sizeof(u64))
 
 static long kvmppc_stt_npages(unsigned long window_size)
 {
-   return ALIGN((window_size >> SPAPR_TCE_SHIFT)
+   return ALIGN((window_size >> IOMMU_PAGE_SHIFT_4K)
 * sizeof(u64), PAGE_SIZE) / PAGE_SIZE;
 }
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c 
b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 8ae12ac..6cf1ab3 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -99,7 +99,7 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long 
liobn,
if (ret)
return ret;
 
-   idx = ioba >> SPAPR_TCE_SHIFT;
+   idx = ioba >> IOMMU_PAGE_SHIFT_4K;
page = stt->pages[idx / TCES_PER_PAGE];
tbl = (u64 *)page_address(page);
 
@@ -121,7 +121,7 @@ long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long 
liobn,
if (stt) {
ret = kvmppc_ioba_validate(stt, ioba, 1);
if (!ret) {
-   unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
+   unsigned long idx = ioba >> IOMMU_PAGE_SHIFT_4K;
struct page *page = stt->pages[idx / TCES_PER_PAGE];
u64 *tbl = (u64 *)page_address(page);
 
-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH kernel 7/9] KVM: PPC: Move reusable bits of H_PUT_TCE handler to helpers

2015-09-15 Thread Alexey Kardashevskiy

Upcoming multi-tce support (H_PUT_TCE_INDIRECT/H_STUFF_TCE hypercalls)
will validate TCE (not to have unexpected bits) and IO address
(to be within the DMA window boundaries).

This introduces helpers to validate TCE and IO address.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/kvm_ppc.h  |  4 ++
 arch/powerpc/kvm/book3s_64_vio_hv.c | 89 -
 2 files changed, 83 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index c6ef05b..fcde896 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -166,6 +166,10 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce *args);
+extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
+   unsigned long ioba, unsigned long npages);
+extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt,
+   unsigned long tce);
 extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 unsigned long ioba, unsigned long tce);
 extern long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c 
b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 6cf1ab3..f0fd84c 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define TCES_PER_PAGE  (PAGE_SIZE / sizeof(u64))
 
@@ -64,7 +65,7 @@ static struct kvmppc_spapr_tce_table 
*kvmppc_find_table(struct kvm_vcpu *vcpu,
  * WARNING: This will be called in real-mode on HV KVM and virtual
  *  mode on PR KVM
  */
-static long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
+long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
unsigned long ioba, unsigned long npages)
 {
unsigned long mask = (1ULL << IOMMU_PAGE_SHIFT_4K) - 1;
@@ -76,6 +77,79 @@ static long kvmppc_ioba_validate(struct 
kvmppc_spapr_tce_table *stt,
 
return H_SUCCESS;
 }
+EXPORT_SYMBOL_GPL(kvmppc_ioba_validate);
+
+/*
+ * Validates TCE address.
+ * At the moment flags and page mask are validated.
+ * As the host kernel does not access those addresses (just puts them
+ * to the table and user space is supposed to process them), we can skip
+ * checking other things (such as TCE is a guest RAM address or the page
+ * was actually allocated).
+ *
+ * WARNING: This will be called in real-mode on HV KVM and virtual
+ *  mode on PR KVM
+ */
+long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *stt, unsigned long tce)
+{
+   unsigned long mask = ((1ULL << IOMMU_PAGE_SHIFT_4K) - 1) &
+   ~(TCE_PCI_WRITE | TCE_PCI_READ);
+
+   if (tce & mask)
+   return H_PARAMETER;
+
+   return H_SUCCESS;
+}
+EXPORT_SYMBOL_GPL(kvmppc_tce_validate);
+
+/* Note on the use of page_address() in real mode,
+ *
+ * It is safe to use page_address() in real mode on ppc64 because
+ * page_address() is always defined as lowmem_page_address()
+ * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
+ * operation and does not access page struct.
+ *
+ * Theoretically page_address() could be defined different
+ * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
+ * should be enabled.
+ * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
+ * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
+ * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
+ * is not expected to be enabled on ppc32, page_address()
+ * is safe for ppc32 as well.
+ *
+ * WARNING: This will be called in real-mode on HV KVM and virtual
+ *  mode on PR KVM
+ */
+static u64 *kvmppc_page_address(struct page *page)
+{
+#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
+#error TODO: fix to avoid page_address() here
+#endif
+   return (u64 *) page_address(page);
+}
+
+/*
+ * Handles TCE requests for emulated devices.
+ * Puts guest TCE values to the table and expects user space to convert them.
+ * Called in both real and virtual modes.
+ * Cannot fail so kvmppc_tce_validate must be called before it.
+ *
+ * WARNING: This will be called in real-mode on HV KVM and virtual
+ *  mode on PR KVM
+ */
+void kvmppc_tce_put(struct kvmppc_spapr_tce_table *stt,
+   unsigned long idx, unsigned long tce)
+{
+   struct page *page;
+   u64 *tbl;
+
+   page = stt->pages[idx / TCES_PER_PAGE];
+   tbl = kvmppc_page_address(page);
+
+   tbl[idx % TCES_PER_PAGE] = tce;
+}
+EXPORT_SYMBOL_GPL(kvmppc_tce_put);
 
 /* WARNING: This will be called in real-mode on HV KVM and virtual
  *  mode on PR KVM
@@ -85,9 +159,6 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long 
liobn,
 {
struct kvmppc_spapr_tce_table *stt

[PATCH kernel 0/9] KVM: PPC: Add in-kernel multitce handling

2015-09-15 Thread Alexey Kardashevskiy

These patches enable in-kernel acceleration for H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls which allow doing multiple (up to 512) TCE entries
update in a single call saving time on switching context. QEMU already
supports these hypercalls so this is just an optimization.

Both HV and PR KVM modes are supported.

This does not affect VFIO, this support is coming next.

Please comment. Thanks.


Alexey Kardashevskiy (9):
  rcu: Define notrace version of list_for_each_entry_rcu
  KVM: PPC: Make real_vmalloc_addr() public
  KVM: PPC: Rework H_PUT_TCE/H_GET_TCE handlers
  KVM: PPC: Use RCU for arch.spapr_tce_tables
  KVM: PPC: Account TCE-containing pages in locked_vm
  KVM: PPC: Replace SPAPR_TCE_SHIFT with IOMMU_PAGE_SHIFT_4K
  KVM: PPC: Move reusable bits of H_PUT_TCE handler to helpers
  KVM: Fix KVM_SMI chapter number
  KVM: PPC: Add support for multiple-TCE hcalls

 Documentation/virtual/kvm/api.txt|  27 ++-
 arch/powerpc/include/asm/kvm_book3s_64.h |   2 -
 arch/powerpc/include/asm/kvm_host.h  |   1 +
 arch/powerpc/include/asm/kvm_ppc.h   |  16 ++
 arch/powerpc/include/asm/mmu-hash64.h|   3 +
 arch/powerpc/kvm/book3s.c|   2 +-
 arch/powerpc/kvm/book3s_64_vio.c | 185 --
 arch/powerpc/kvm/book3s_64_vio_hv.c  | 310 +++
 arch/powerpc/kvm/book3s_hv.c |  26 ++-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c  |  17 --
 arch/powerpc/kvm/book3s_hv_rmhandlers.S  |   6 +-
 arch/powerpc/kvm/book3s_pr_papr.c|  35 
 arch/powerpc/kvm/powerpc.c   |   3 +
 arch/powerpc/mm/hash_utils_64.c  |  17 ++
 include/linux/rculist.h  |  38 
 15 files changed, 610 insertions(+), 78 deletions(-)

-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH kernel 2/9] KVM: PPC: Make real_vmalloc_addr() public

2015-09-15 Thread Alexey Kardashevskiy

This helper translates vmalloc'd addresses to linear addresses.
It is only used by the KVM MMU code now and resides in the HV KVM code.
We will need it further in the TCE code and the DMA memory preregistration
code called in real mode.

This makes real_vmalloc_addr() public and moves it to the powerpc code as
it does not do anything special for KVM.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/mmu-hash64.h |  3 +++
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   | 17 -
 arch/powerpc/mm/hash_utils_64.c   | 17 +
 3 files changed, 20 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu-hash64.h 
b/arch/powerpc/include/asm/mmu-hash64.h
index a82f534..fd06b73 100644
--- a/arch/powerpc/include/asm/mmu-hash64.h
+++ b/arch/powerpc/include/asm/mmu-hash64.h
@@ -606,6 +606,9 @@ static inline unsigned long get_kernel_vsid(unsigned long 
ea, int ssize)
context = (MAX_USER_CONTEXT) + ((ea >> 60) - 0xc) + 1;
return get_vsid(context, ea, ssize);
 }
+
+void *real_vmalloc_addr(void *x);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_MMU_HASH64_H_ */
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index c1df9bb..987b7d1 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -22,23 +22,6 @@
 #include 
 #include 
 
-/* Translate address of a vmalloc'd thing to a linear map address */
-static void *real_vmalloc_addr(void *x)
-{
-   unsigned long addr = (unsigned long) x;
-   pte_t *p;
-   /*
-* assume we don't have huge pages in vmalloc space...
-* So don't worry about THP collapse/split. Called
-* Only in realmode, hence won't need irq_save/restore.
-*/
-   p = __find_linux_pte_or_hugepte(swapper_pg_dir, addr, NULL);
-   if (!p || !pte_present(*p))
-   return NULL;
-   addr = (pte_pfn(*p) << PAGE_SHIFT) | (addr & ~PAGE_MASK);
-   return __va(addr);
-}
-
 /* Return 1 if we need to do a global tlbie, 0 if we can use tlbiel */
 static int global_invalidates(struct kvm *kvm, unsigned long flags)
 {
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 5ec987f..9737d6a 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1556,3 +1556,20 @@ void setup_initial_memory_limit(phys_addr_t 
first_memblock_base,
/* Finally limit subsequent allocations */
memblock_set_current_limit(ppc64_rma_size);
 }
+
+/* Translate address of a vmalloc'd thing to a linear map address */
+void *real_vmalloc_addr(void *x)
+{
+   unsigned long addr = (unsigned long) x;
+   pte_t *p;
+   /*
+* assume we don't have huge pages in vmalloc space...
+* So don't worry about THP collapse/split. Called
+* Only in realmode, hence won't need irq_save/restore.
+*/
+   p = __find_linux_pte_or_hugepte(swapper_pg_dir, addr, NULL);
+   if (!p || !pte_present(*p))
+   return NULL;
+   addr = (pte_pfn(*p) << PAGE_SHIFT) | (addr & ~PAGE_MASK);
+   return __va(addr);
+}
-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH kernel 1/9] rcu: Define notrace version of list_for_each_entry_rcu

2015-09-15 Thread Alexey Kardashevskiy

This defines list_for_each_entry_rcu_notrace and list_entry_rcu_notrace
which use rcu_dereference_raw_notrace instead of rcu_dereference_raw.
This allows using list_for_each_entry_rcu_notrace in real mode (MMU is off).

Signed-off-by: Alexey Kardashevskiy 
---
 include/linux/rculist.h | 38 ++
 1 file changed, 38 insertions(+)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 17c6b1f..439c4d7 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -253,6 +253,25 @@ static inline void list_splice_init_rcu(struct list_head 
*list,
 })
 
 /**
+ * list_entry_rcu_notrace - get the struct for this entry
+ * @ptr:the  list_head pointer.
+ * @type:   the type of the struct this is embedded in.
+ * @member: the name of the list_struct within the struct.
+ *
+ * This primitive may safely run concurrently with the _rcu list-mutation
+ * primitives such as list_add_rcu() as long as it's guarded by 
rcu_read_lock().
+ *
+ * This is the same as list_entry_rcu() except that it does
+ * not do any RCU debugging or tracing.
+ */
+#define list_entry_rcu_notrace(ptr, type, member) \
+({ \
+   typeof(*ptr) __rcu *__ptr = (typeof(*ptr) __rcu __force *)ptr; \
+   container_of((typeof(ptr))rcu_dereference_raw_notrace(__ptr), \
+   type, member); \
+})
+
+/**
  * Where are list_empty_rcu() and list_first_entry_rcu()?
  *
  * Implementing those functions following their counterparts list_empty() and
@@ -308,6 +327,25 @@ static inline void list_splice_init_rcu(struct list_head 
*list,
pos = list_entry_rcu(pos->member.next, typeof(*pos), member))
 
 /**
+ * list_for_each_entry_rcu_notrace - iterate over rcu list of given type
+ * @pos:   the type * to use as a loop cursor.
+ * @head:  the head for your list.
+ * @member:the name of the list_struct within the struct.
+ *
+ * This list-traversal primitive may safely run concurrently with
+ * the _rcu list-mutation primitives such as list_add_rcu()
+ * as long as the traversal is guarded by rcu_read_lock().
+ *
+ * This is the same as list_for_each_entry_rcu() except that it does
+ * not do any RCU debugging or tracing.
+ */
+#define list_for_each_entry_rcu_notrace(pos, head, member) \
+   for (pos = list_entry_rcu_notrace((head)->next, typeof(*pos), member); \
+   >member != (head); \
+   pos = list_entry_rcu_notrace(pos->member.next, typeof(*pos), \
+   member))
+
+/**
  * list_for_each_entry_continue_rcu - continue iteration over list of given 
type
  * @pos:   the type * to use as a loop cursor.
  * @head:  the head for your list.
-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH kernel 3/9] KVM: PPC: Rework H_PUT_TCE/H_GET_TCE handlers

2015-09-15 Thread Alexey Kardashevskiy

This reworks the existing H_PUT_TCE/H_GET_TCE handlers to have one
exit path. This allows next patch to add locks nicely.

This moves the ioba boundaries check to a helper and adds a check for
least bits which have to be zeros.

The patch is pretty mechanical (only check for least ioba bits is added)
so no change in behaviour is expected.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 102 +++-
 1 file changed, 66 insertions(+), 36 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c 
b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 89e96b3..8ae12ac 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -35,71 +35,101 @@
 #include 
 #include 
 #include 
+#include 
 
 #define TCES_PER_PAGE  (PAGE_SIZE / sizeof(u64))
 
+/*
+ * Finds a TCE table descriptor by LIOBN.
+ *
+ * WARNING: This will be called in real or virtual mode on HV KVM and virtual
+ *  mode on PR KVM
+ */
+static struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm_vcpu *vcpu,
+   unsigned long liobn)
+{
+   struct kvm *kvm = vcpu->kvm;
+   struct kvmppc_spapr_tce_table *stt;
+
+   list_for_each_entry_rcu_notrace(stt, >arch.spapr_tce_tables, list)
+   if (stt->liobn == liobn)
+   return stt;
+
+   return NULL;
+}
+
+/*
+ * Validates IO address.
+ *
+ * WARNING: This will be called in real-mode on HV KVM and virtual
+ *  mode on PR KVM
+ */
+static long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
+   unsigned long ioba, unsigned long npages)
+{
+   unsigned long mask = (1ULL << IOMMU_PAGE_SHIFT_4K) - 1;
+   unsigned long idx = ioba >> IOMMU_PAGE_SHIFT_4K;
+   unsigned long size = stt->window_size >> IOMMU_PAGE_SHIFT_4K;
+
+   if ((ioba & mask) || (size + npages <= idx))
+   return H_PARAMETER;
+
+   return H_SUCCESS;
+}
+
 /* WARNING: This will be called in real-mode on HV KVM and virtual
  *  mode on PR KVM
  */
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
  unsigned long ioba, unsigned long tce)
 {
-   struct kvm *kvm = vcpu->kvm;
-   struct kvmppc_spapr_tce_table *stt;
+   struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+   long ret = H_TOO_HARD;
+   unsigned long idx;
+   struct page *page;
+   u64 *tbl;
 
/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
/*  liobn, ioba, tce); */
 
-   list_for_each_entry(stt, >arch.spapr_tce_tables, list) {
-   if (stt->liobn == liobn) {
-   unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
-   struct page *page;
-   u64 *tbl;
+   if (!stt)
+   return ret;
 
-   /* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  
window_size=0x%x\n", */
-   /*  liobn, stt, stt->window_size); */
-   if (ioba >= stt->window_size)
-   return H_PARAMETER;
+   ret = kvmppc_ioba_validate(stt, ioba, 1);
+   if (ret)
+   return ret;
 
-   page = stt->pages[idx / TCES_PER_PAGE];
-   tbl = (u64 *)page_address(page);
+   idx = ioba >> SPAPR_TCE_SHIFT;
+   page = stt->pages[idx / TCES_PER_PAGE];
+   tbl = (u64 *)page_address(page);
 
-   /* FIXME: Need to validate the TCE itself */
-   /* udbg_printf("tce @ %p\n", [idx % 
TCES_PER_PAGE]); */
-   tbl[idx % TCES_PER_PAGE] = tce;
-   return H_SUCCESS;
-   }
-   }
+   /* FIXME: Need to validate the TCE itself */
+   /* udbg_printf("tce @ %p\n", [idx % TCES_PER_PAGE]); */
+   tbl[idx % TCES_PER_PAGE] = tce;
 
-   /* Didn't find the liobn, punt it to userspace */
-   return H_TOO_HARD;
+   return ret;
 }
 EXPORT_SYMBOL_GPL(kvmppc_h_put_tce);
 
 long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
  unsigned long ioba)
 {
-   struct kvm *kvm = vcpu->kvm;
-   struct kvmppc_spapr_tce_table *stt;
+   struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+   long ret = H_TOO_HARD;
 
-   list_for_each_entry(stt, >arch.spapr_tce_tables, list) {
-   if (stt->liobn == liobn) {
+
+   if (stt) {
+   ret = kvmppc_ioba_validate(stt, ioba, 1);
+   if (!ret) {
unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
-   struct page *page;
-   u64 *tbl;
-
-   if (ioba >= stt->window_size)
-   return H_PARAMETER;
-
-   page = stt->pages[idx / TCES_PER_PAGE];
-   tbl = (u64 *)page_address(page);

[PATCH kernel 1/9] rcu: Define notrace version of list_for_each_entry_rcu

2015-09-15 Thread Alexey Kardashevskiy

This defines list_for_each_entry_rcu_notrace and list_entry_rcu_notrace
which use rcu_dereference_raw_notrace instead of rcu_dereference_raw.
This allows using list_for_each_entry_rcu_notrace in real mode (MMU is off).

Signed-off-by: Alexey Kardashevskiy 
---
 include/linux/rculist.h | 38 ++
 1 file changed, 38 insertions(+)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 17c6b1f..439c4d7 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -253,6 +253,25 @@ static inline void list_splice_init_rcu(struct list_head 
*list,
 })
 
 /**
+ * list_entry_rcu_notrace - get the struct for this entry
+ * @ptr:the  list_head pointer.
+ * @type:   the type of the struct this is embedded in.
+ * @member: the name of the list_struct within the struct.
+ *
+ * This primitive may safely run concurrently with the _rcu list-mutation
+ * primitives such as list_add_rcu() as long as it's guarded by 
rcu_read_lock().
+ *
+ * This is the same as list_entry_rcu() except that it does
+ * not do any RCU debugging or tracing.
+ */
+#define list_entry_rcu_notrace(ptr, type, member) \
+({ \
+   typeof(*ptr) __rcu *__ptr = (typeof(*ptr) __rcu __force *)ptr; \
+   container_of((typeof(ptr))rcu_dereference_raw_notrace(__ptr), \
+   type, member); \
+})
+
+/**
  * Where are list_empty_rcu() and list_first_entry_rcu()?
  *
  * Implementing those functions following their counterparts list_empty() and
@@ -308,6 +327,25 @@ static inline void list_splice_init_rcu(struct list_head 
*list,
pos = list_entry_rcu(pos->member.next, typeof(*pos), member))
 
 /**
+ * list_for_each_entry_rcu_notrace - iterate over rcu list of given type
+ * @pos:   the type * to use as a loop cursor.
+ * @head:  the head for your list.
+ * @member:the name of the list_struct within the struct.
+ *
+ * This list-traversal primitive may safely run concurrently with
+ * the _rcu list-mutation primitives such as list_add_rcu()
+ * as long as the traversal is guarded by rcu_read_lock().
+ *
+ * This is the same as list_for_each_entry_rcu() except that it does
+ * not do any RCU debugging or tracing.
+ */
+#define list_for_each_entry_rcu_notrace(pos, head, member) \
+   for (pos = list_entry_rcu_notrace((head)->next, typeof(*pos), member); \
+   >member != (head); \
+   pos = list_entry_rcu_notrace(pos->member.next, typeof(*pos), \
+   member))
+
+/**
  * list_for_each_entry_continue_rcu - continue iteration over list of given 
type
  * @pos:   the type * to use as a loop cursor.
  * @head:  the head for your list.
-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH kernel 3/9] KVM: PPC: Rework H_PUT_TCE/H_GET_TCE handlers

2015-09-15 Thread Alexey Kardashevskiy

This reworks the existing H_PUT_TCE/H_GET_TCE handlers to have one
exit path. This allows next patch to add locks nicely.

This moves the ioba boundaries check to a helper and adds a check for
least bits which have to be zeros.

The patch is pretty mechanical (only check for least ioba bits is added)
so no change in behaviour is expected.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 102 +++-
 1 file changed, 66 insertions(+), 36 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c 
b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 89e96b3..8ae12ac 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -35,71 +35,101 @@
 #include 
 #include 
 #include 
+#include 
 
 #define TCES_PER_PAGE  (PAGE_SIZE / sizeof(u64))
 
+/*
+ * Finds a TCE table descriptor by LIOBN.
+ *
+ * WARNING: This will be called in real or virtual mode on HV KVM and virtual
+ *  mode on PR KVM
+ */
+static struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm_vcpu *vcpu,
+   unsigned long liobn)
+{
+   struct kvm *kvm = vcpu->kvm;
+   struct kvmppc_spapr_tce_table *stt;
+
+   list_for_each_entry_rcu_notrace(stt, >arch.spapr_tce_tables, list)
+   if (stt->liobn == liobn)
+   return stt;
+
+   return NULL;
+}
+
+/*
+ * Validates IO address.
+ *
+ * WARNING: This will be called in real-mode on HV KVM and virtual
+ *  mode on PR KVM
+ */
+static long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
+   unsigned long ioba, unsigned long npages)
+{
+   unsigned long mask = (1ULL << IOMMU_PAGE_SHIFT_4K) - 1;
+   unsigned long idx = ioba >> IOMMU_PAGE_SHIFT_4K;
+   unsigned long size = stt->window_size >> IOMMU_PAGE_SHIFT_4K;
+
+   if ((ioba & mask) || (size + npages <= idx))
+   return H_PARAMETER;
+
+   return H_SUCCESS;
+}
+
 /* WARNING: This will be called in real-mode on HV KVM and virtual
  *  mode on PR KVM
  */
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
  unsigned long ioba, unsigned long tce)
 {
-   struct kvm *kvm = vcpu->kvm;
-   struct kvmppc_spapr_tce_table *stt;
+   struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+   long ret = H_TOO_HARD;
+   unsigned long idx;
+   struct page *page;
+   u64 *tbl;
 
/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
/*  liobn, ioba, tce); */
 
-   list_for_each_entry(stt, >arch.spapr_tce_tables, list) {
-   if (stt->liobn == liobn) {
-   unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
-   struct page *page;
-   u64 *tbl;
+   if (!stt)
+   return ret;
 
-   /* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  
window_size=0x%x\n", */
-   /*  liobn, stt, stt->window_size); */
-   if (ioba >= stt->window_size)
-   return H_PARAMETER;
+   ret = kvmppc_ioba_validate(stt, ioba, 1);
+   if (ret)
+   return ret;
 
-   page = stt->pages[idx / TCES_PER_PAGE];
-   tbl = (u64 *)page_address(page);
+   idx = ioba >> SPAPR_TCE_SHIFT;
+   page = stt->pages[idx / TCES_PER_PAGE];
+   tbl = (u64 *)page_address(page);
 
-   /* FIXME: Need to validate the TCE itself */
-   /* udbg_printf("tce @ %p\n", [idx % 
TCES_PER_PAGE]); */
-   tbl[idx % TCES_PER_PAGE] = tce;
-   return H_SUCCESS;
-   }
-   }
+   /* FIXME: Need to validate the TCE itself */
+   /* udbg_printf("tce @ %p\n", [idx % TCES_PER_PAGE]); */
+   tbl[idx % TCES_PER_PAGE] = tce;
 
-   /* Didn't find the liobn, punt it to userspace */
-   return H_TOO_HARD;
+   return ret;
 }
 EXPORT_SYMBOL_GPL(kvmppc_h_put_tce);
 
 long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
  unsigned long ioba)
 {
-   struct kvm *kvm = vcpu->kvm;
-   struct kvmppc_spapr_tce_table *stt;
+   struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+   long ret = H_TOO_HARD;
 
-   list_for_each_entry(stt, >arch.spapr_tce_tables, list) {
-   if (stt->liobn == liobn) {
+
+   if (stt) {
+   ret = kvmppc_ioba_validate(stt, ioba, 1);
+   if (!ret) {
unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
-   struct page *page;
-   u64 *tbl;
-
-   if (ioba >= stt->window_size)
-   return H_PARAMETER;
-
-   page = stt->pages[idx / TCES_PER_PAGE];
-   tbl = (u64 *)page_address(page);

[PATCH kernel 0/9] KVM: PPC: Add in-kernel multitce handling

2015-09-15 Thread Alexey Kardashevskiy

These patches enable in-kernel acceleration for H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls which allow doing multiple (up to 512) TCE entries
update in a single call saving time on switching context. QEMU already
supports these hypercalls so this is just an optimization.

Both HV and PR KVM modes are supported.

This does not affect VFIO, this support is coming next.

Please comment. Thanks.


Alexey Kardashevskiy (9):
  rcu: Define notrace version of list_for_each_entry_rcu
  KVM: PPC: Make real_vmalloc_addr() public
  KVM: PPC: Rework H_PUT_TCE/H_GET_TCE handlers
  KVM: PPC: Use RCU for arch.spapr_tce_tables
  KVM: PPC: Account TCE-containing pages in locked_vm
  KVM: PPC: Replace SPAPR_TCE_SHIFT with IOMMU_PAGE_SHIFT_4K
  KVM: PPC: Move reusable bits of H_PUT_TCE handler to helpers
  KVM: Fix KVM_SMI chapter number
  KVM: PPC: Add support for multiple-TCE hcalls

 Documentation/virtual/kvm/api.txt|  27 ++-
 arch/powerpc/include/asm/kvm_book3s_64.h |   2 -
 arch/powerpc/include/asm/kvm_host.h  |   1 +
 arch/powerpc/include/asm/kvm_ppc.h   |  16 ++
 arch/powerpc/include/asm/mmu-hash64.h|   3 +
 arch/powerpc/kvm/book3s.c|   2 +-
 arch/powerpc/kvm/book3s_64_vio.c | 185 --
 arch/powerpc/kvm/book3s_64_vio_hv.c  | 310 +++
 arch/powerpc/kvm/book3s_hv.c |  26 ++-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c  |  17 --
 arch/powerpc/kvm/book3s_hv_rmhandlers.S  |   6 +-
 arch/powerpc/kvm/book3s_pr_papr.c|  35 
 arch/powerpc/kvm/powerpc.c   |   3 +
 arch/powerpc/mm/hash_utils_64.c  |  17 ++
 include/linux/rculist.h  |  38 
 15 files changed, 610 insertions(+), 78 deletions(-)

-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH kernel 2/9] KVM: PPC: Make real_vmalloc_addr() public

2015-09-15 Thread Alexey Kardashevskiy

This helper translates vmalloc'd addresses to linear addresses.
It is only used by the KVM MMU code now and resides in the HV KVM code.
We will need it further in the TCE code and the DMA memory preregistration
code called in real mode.

This makes real_vmalloc_addr() public and moves it to the powerpc code as
it does not do anything special for KVM.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/mmu-hash64.h |  3 +++
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   | 17 -
 arch/powerpc/mm/hash_utils_64.c   | 17 +
 3 files changed, 20 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu-hash64.h 
b/arch/powerpc/include/asm/mmu-hash64.h
index a82f534..fd06b73 100644
--- a/arch/powerpc/include/asm/mmu-hash64.h
+++ b/arch/powerpc/include/asm/mmu-hash64.h
@@ -606,6 +606,9 @@ static inline unsigned long get_kernel_vsid(unsigned long 
ea, int ssize)
context = (MAX_USER_CONTEXT) + ((ea >> 60) - 0xc) + 1;
return get_vsid(context, ea, ssize);
 }
+
+void *real_vmalloc_addr(void *x);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_MMU_HASH64_H_ */
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index c1df9bb..987b7d1 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -22,23 +22,6 @@
 #include 
 #include 
 
-/* Translate address of a vmalloc'd thing to a linear map address */
-static void *real_vmalloc_addr(void *x)
-{
-   unsigned long addr = (unsigned long) x;
-   pte_t *p;
-   /*
-* assume we don't have huge pages in vmalloc space...
-* So don't worry about THP collapse/split. Called
-* Only in realmode, hence won't need irq_save/restore.
-*/
-   p = __find_linux_pte_or_hugepte(swapper_pg_dir, addr, NULL);
-   if (!p || !pte_present(*p))
-   return NULL;
-   addr = (pte_pfn(*p) << PAGE_SHIFT) | (addr & ~PAGE_MASK);
-   return __va(addr);
-}
-
 /* Return 1 if we need to do a global tlbie, 0 if we can use tlbiel */
 static int global_invalidates(struct kvm *kvm, unsigned long flags)
 {
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 5ec987f..9737d6a 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1556,3 +1556,20 @@ void setup_initial_memory_limit(phys_addr_t 
first_memblock_base,
/* Finally limit subsequent allocations */
memblock_set_current_limit(ppc64_rma_size);
 }
+
+/* Translate address of a vmalloc'd thing to a linear map address */
+void *real_vmalloc_addr(void *x)
+{
+   unsigned long addr = (unsigned long) x;
+   pte_t *p;
+   /*
+* assume we don't have huge pages in vmalloc space...
+* So don't worry about THP collapse/split. Called
+* Only in realmode, hence won't need irq_save/restore.
+*/
+   p = __find_linux_pte_or_hugepte(swapper_pg_dir, addr, NULL);
+   if (!p || !pte_present(*p))
+   return NULL;
+   addr = (pte_pfn(*p) << PAGE_SHIFT) | (addr & ~PAGE_MASK);
+   return __va(addr);
+}
-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 0/2] KVM: nested VPID emulation

2015-09-15 Thread Wanpeng Li

v1 -> v2:
 * enhance allocate/free_vpid to handle shadow vpid
 * drop empty space
 * allocate shadow vpid during initialization
 * For each nested vmentry, if vpid12 is changed, reuse shadow vpid w/ an 
   invvpid.

VPID is used to tag address space and avoid a TLB flush. Currently L0 use 
the same VPID to run L1 and all its guests. KVM flushes VPID when switching 
between L1 and L2. 

This patch advertises VPID to the L1 hypervisor, then address space of L1 and 
L2 can be separately treated and avoid TLB flush when swithing between L1 and 
L2.

Performance: 

run lmbench on L2 w/ 3.5 kernel.

Context switching - times in microseconds - smaller is better
-
Host OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
 ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
- - -- -- -- -- -- --- ---
kernelLinux 3.5.0-1 1.2200 1.3700 1.4500 4.7800 2.3300 5.6 2.88000  
nested VPID 
kernelLinux 3.5.0-1 1.2600 1.4300 1.5600   12.7   12.9 3.49000 7.46000  
vanilla


Wanpeng Li (2):
  KVM: nVMX: enhance allocate/free_vpid to handle shadow vpid
  KVM: nVMX: nested VPID emulation

 arch/x86/kvm/vmx.c | 58 ++
 1 file changed, 45 insertions(+), 13 deletions(-)

-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 2/2] KVM: nVMX: nested VPID emulation

2015-09-15 Thread Wanpeng Li

VPID is used to tag address space and avoid a TLB flush. Currently L0 use 
the same VPID to run L1 and all its guests. KVM flushes VPID when switching 
between L1 and L2. 

This patch advertises VPID to the L1 hypervisor, then address space of L1 and 
L2 can be separately treated and avoid TLB flush when swithing between L1 and 
L2. For each nested vmentry, if vpid12 is changed, reuse shadow vpid w/ an 
invvpid.

Performance: 

run lmbench on L2 w/ 3.5 kernel.

Context switching - times in microseconds - smaller is better
-
Host OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
 ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
- - -- -- -- -- -- --- ---
kernelLinux 3.5.0-1 1.2200 1.3700 1.4500 4.7800 2.3300 5.6 2.88000  
nested VPID 
kernelLinux 3.5.0-1 1.2600 1.4300 1.5600   12.7   12.9 3.49000 7.46000  
vanilla

Suggested-by: Wincy Van 
Signed-off-by: Wanpeng Li 
---
 arch/x86/kvm/vmx.c | 25 ++---
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index bd07d88..6d8ec00 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -425,6 +425,7 @@ struct nested_vmx {
u64 vmcs01_debugctl;
 
u16 vpid02;
+   u16 last_vpid;
 
u32 nested_vmx_procbased_ctls_low;
u32 nested_vmx_procbased_ctls_high;
@@ -1159,6 +1160,11 @@ static inline bool 
nested_cpu_has_virt_x2apic_mode(struct vmcs12 *vmcs12)
return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE);
 }
 
+static inline bool nested_cpu_has_vpid(struct vmcs12 *vmcs12)
+{
+   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_VPID);
+}
+
 static inline bool nested_cpu_has_apic_reg_virt(struct vmcs12 *vmcs12)
 {
return nested_cpu_has2(vmcs12, SECONDARY_EXEC_APIC_REGISTER_VIRT);
@@ -2473,6 +2479,7 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
SECONDARY_EXEC_RDTSCP |
SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
+   SECONDARY_EXEC_ENABLE_VPID |
SECONDARY_EXEC_APIC_REGISTER_VIRT |
SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
SECONDARY_EXEC_WBINVD_EXITING |
@@ -9460,13 +9467,17 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
struct vmcs12 *vmcs12)
vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);
 
if (enable_vpid) {
-   /*
-* Trivially support vpid by letting L2s share their parent
-* L1's vpid. TODO: move to a more elaborate solution, giving
-* each L2 its own vpid and exposing the vpid feature to L1.
-*/
-   vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);
-   vmx_flush_tlb(vcpu);
+   if (nested_cpu_has_vpid(vmcs12)) {
+   vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->nested.vpid02);
+   if (vmcs12->virtual_processor_id != 
vmx->nested.last_vpid) {
+   vmx->nested.last_vpid = 
vmcs12->virtual_processor_id;
+   vmx_flush_tlb(vcpu);
+   }
+   } else {
+   vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);
+   vmx_flush_tlb(vcpu);
+   }
+
}
 
if (nested_cpu_has_ept(vmcs12)) {
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 1/2] KVM: nVMX: enhance allocate/free_vpid to handle shadow vpid

2015-09-15 Thread Wanpeng Li

Enhance allocate/free_vid to handle shadow vpid.

Suggested-by: Wincy Van 
Signed-off-by: Wanpeng Li 
---
 arch/x86/kvm/vmx.c | 33 +++--
 1 file changed, 27 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index da1590e..bd07d88 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -424,6 +424,8 @@ struct nested_vmx {
/* to migrate it to L2 if VM_ENTRY_LOAD_DEBUG_CONTROLS is off */
u64 vmcs01_debugctl;
 
+   u16 vpid02;
+
u32 nested_vmx_procbased_ctls_low;
u32 nested_vmx_procbased_ctls_high;
u32 nested_vmx_true_procbased_ctls_low;
@@ -4155,18 +4157,29 @@ static int alloc_identity_pagetable(struct kvm *kvm)
return r;
 }
 
-static void allocate_vpid(struct vcpu_vmx *vmx)
+static void allocate_vpid(struct vcpu_vmx *vmx, bool nested)
 {
int vpid;
 
-   vmx->vpid = 0;
if (!enable_vpid)
return;
+   if (!nested)
+   vmx->vpid = 0;
+   else
+   vmx->nested.vpid02 = 0;
spin_lock(_vpid_lock);
-   vpid = find_first_zero_bit(vmx_vpid_bitmap, VMX_NR_VPIDS);
-   if (vpid < VMX_NR_VPIDS) {
+   if (!nested) {
+   vpid = find_first_zero_bit(vmx_vpid_bitmap, VMX_NR_VPIDS);
+   if (vpid < VMX_NR_VPIDS) {
vmx->vpid = vpid;
__set_bit(vpid, vmx_vpid_bitmap);
+   }
+   } else {
+   vpid = find_first_zero_bit(vmx_vpid_bitmap, VMX_NR_VPIDS);
+   if (vpid < VMX_NR_VPIDS) {
+   vmx->nested.vpid02 = vpid;
+   __set_bit(vpid, vmx_vpid_bitmap);
+   }
}
spin_unlock(_vpid_lock);
 }
@@ -4178,6 +4191,12 @@ static void free_vpid(struct vcpu_vmx *vmx)
spin_lock(_vpid_lock);
if (vmx->vpid != 0)
__clear_bit(vmx->vpid, vmx_vpid_bitmap);
+   if (!nested) {
+   spin_unlock(_vpid_lock);
+   return;
+   }
+   if (vmx->nested.vpid02)
+   __clear_bit(vmx->nested.vpid02, vmx_vpid_bitmap);
spin_unlock(_vpid_lock);
 }
 
@@ -8509,7 +8528,7 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, 
unsigned int id)
if (!vmx)
return ERR_PTR(-ENOMEM);
 
-   allocate_vpid(vmx);
+   allocate_vpid(vmx, false);
 
err = kvm_vcpu_init(>vcpu, kvm, id);
if (err)
@@ -8557,8 +8576,10 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, 
unsigned int id)
goto free_vmcs;
}
 
-   if (nested)
+   if (nested) {
nested_vmx_setup_ctls_msrs(vmx);
+   allocate_vpid(vmx, true);
+   }
 
vmx->nested.posted_intr_nv = -1;
vmx->nested.current_vmptr = -1ull;
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Migration fail randomly

2015-09-15 Thread Sander Klein


Hi,

I'm running Debian Jessie with KVM 2.1 on 6 nodes with an NFS3 back-end 
and sometimes when I migrate a VM it suddenly 'hangs'. The KVM process 
is running on the new node, eating up all it's CPU cycles but it does 
not respond to anything anymore. I don't see any errors logged and it 
seems like the migration went fine except for the VM hanging.


It doesn't matter if I do a live migrate or a save and restore, it 
happens with both ways.


Destroying and booting the VM fixes it of course and I can migrate it 
afterwards, but after a couple of days the same happens again when I try 
to migrate it.


The VM's range from Debian Squeeze (With a 3.2 kernel), Wheezy, and 
Jessie.


Does anyone have an idea why this happens and if I can fix this?

Regards,

Sander
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Christian Borntraeger

Tejun,


commit d59cfc09c32a2ae31f1c3bc2983a0cd79afb3f14 (sched, cgroup: replace 
signal_struct->group_rwsem with a global percpu_rwsem) causes some noticably
hickups when starting several kvm guests (which libvirt will move into cgroups
- each vcpu thread and each i/o thread)
When you now start lots of guests in parallel on a bigger system (32CPUs with
2way smt in my case) the system is so busy that systemd runs into several 
timeouts
like "Did not receive a reply. Possible causes include: the remote application 
did
not send a reply, the message bus security policy blocked the reply, the reply
timeout expired, or the network connection was broken."

The problem seems to be that the newly used percpu_rwsem does a
rcu_synchronize_sched_expedited for all write downs/ups.

Hacking the percpu_rw_semaphore to always go the slow path and avoid the
synchronize_sched seems to fix the issue. For some (yet unknown to me) reason
the synchronize_sched and the fast path seems to block writers for incredibly
long times.

To trigger the problem, the guest must be CPU bound, iow idle guests seem to
not trigger the pathological case. Any idea how to improve the situation, 
this looks like a real regression for larger kvm installations.

Christian

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] os-android: Add support to android platform, built by ndk-r10

2015-09-15 Thread Houcheng Lin

Hi Paolo,

Thanks for your review and suggestions. I'll fix this patch accordingly.
Please also see my replies below.

best regards,
Houcheng Lin

2015-09-15 17:41 GMT+08:00 Paolo Bonzini :
>
> This is okay and can be done unconditionally (introduce a new
> qemu_getdtablesize function that is defined in util/oslib-posix.c).

Will fix it.
>
>
>> - sigtimewait(): call __rt_sigtimewait() instead.
>> - lockf(): not see this feature in android, directly return -1.
>> - shm_open(): not see this feature in android, directly return -1.
>
> This is not okay.  Please fix your libc instead.

I'll modify the bionic C library to support these functions and feedback
to google's AOSP project. But the android kernel does not support shmem,
so I will prevent compile the ivshmem.c when in android config. The config
for ivshmem in pci.mak will be:

CONFIG_IVSHMEM=$(call land, $(call lnot,$(CONFIG_ANDROID)),$(CONFIG_KVM))

> For sys/io.h, we can just remove the inclusion.  It is not necessary.
>
> scsi/sg.h should be exported by the Linux kernel, so that we can use
> scripts/update-linux-headers.sh to copy it from QEMU.  I've sent a Linux
> kernel patch and CCed you.

It's better to put headers on kernel user headers. Thanks.

>
> You should instead disable the guest agent on your configure command line.
>

Okay.

>
> If you have CONFIG_ANDROID, you do not need -DANDROID.
>

Okay.

>> +  LIBS="-lglib-2.0 -lgthread-2.0 -lz -lpixman-1 -lintl -liconv -lc $LIBS"
>> +  libs_qga="-lglib-2.0 -lgthread-2.0 -lz -lpixman-1 -lintl -liconv -lc"
>
> This should not be necessary, QEMU uses pkg-config.
>
>> +fi
>>  if [ "$bsd" = "yes" ] ; then
>>if [ "$darwin" != "yes" ] ; then
>>  bsd_user="yes"
>> @@ -1736,7 +1749,14 @@ fi
>>  # pkg-config probe
>>
>>  if ! has "$pkg_config_exe"; then
>> -  error_exit "pkg-config binary '$pkg_config_exe' not found"
>> +  case $cross_prefix in
>> +*android*)
>> + pkg_config_exe=scripts/android-pkg-config
>
> Neither should this.  Your cross-compilation environment is not
> correctly set up if you do not have a pkg-config executable.  If you
> want to use a wrapper, you can specify it with the PKG_CONFIG
> environment variable.  But it need not be maintained in the QEMU
> repository, because QEMU assumes a complete cross-compilation environment.

I'll use wrapper in next release and specify with environment variable.
Later, I may generate pkg-config data file while building library and install
it into cross-compilation environment.

>
>> + ;;
>> +*)
>> + error_exit "pkg-config binary '$pkg_config_exe' not found"
>> + ;;
>> +  esac
>>  fi
>>
>>  ##
>> @@ -3764,7 +3784,7 @@ elif compile_prog "" "$pthread_lib -lrt" ; then
>>  fi
>>
>>  if test "$darwin" != "yes" -a "$mingw32" != "yes" -a "$solaris" != yes -a \
>> -"$aix" != "yes" -a "$haiku" != "yes" ; then
>> +"$aix" != "yes" -a "$haiku" != "yes" -a "$android" != "yes" ; then
>>  libs_softmmu="-lutil $libs_softmmu"
>>  fi
>>
>> @@ -4709,6 +4729,10 @@ if test "$linux" = "yes" ; then
>>echo "CONFIG_LINUX=y" >> $config_host_mak
>>  fi
>>
>> +if test "$android" = "yes" ; then
>> +  echo "CONFIG_ANDROID=y" >> $config_host_mak
>> +fi
>> +
>>  if test "$darwin" = "yes" ; then
>>echo "CONFIG_DARWIN=y" >> $config_host_mak
>>  fi
>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
>> index ab3c876..1ba22be 100644
>> --- a/include/qemu/osdep.h
>> +++ b/include/qemu/osdep.h
>> @@ -59,6 +59,10 @@
>>  #define WEXITSTATUS(x) (x)
>>  #endif
>>
>> +#ifdef ANDROID
>> + #include "sysemu/os-android.h"
>> +#endif
>> +
>>  #ifdef _WIN32
>>  #include "sysemu/os-win32.h"
>>  #endif
>> diff --git a/include/sysemu/os-android.h b/include/sysemu/os-android.h
>> new file mode 100644
>> index 000..7f73777
>> --- /dev/null
>> +++ b/include/sysemu/os-android.h
>> @@ -0,0 +1,35 @@
>> +#ifndef QEMU_OS_ANDROID_H
>> +#define QEMU_OS_ANDROID_H
>> +
>> +#include 
>> +
>> +/*
>> + * For include the basename prototyping in android.
>> + */
>> +#include 
>
> These two includes can be added to sysemu/os-posix.h.
>
>> +extern int __rt_sigtimedwait(const sigset_t *uthese, siginfo_t *uinfo, \
>> + const struct timespec *uts, size_t sigsetsize);
>> +
>> +int getdtablesize(void);
>> +int lockf(int fd, int cmd, off_t len);
>> +int sigtimedwait(const sigset_t *set, siginfo_t *info, \
>> + const struct timespec *ts);
>> +int shm_open(const char *name, int oflag, mode_t mode);
>> +char *a_ptsname(int fd);
>> +
>> +#define F_TLOCK 0
>> +#define IOV_MAX 256
>
> libc should really be fixed for all of these (except getdtablesize and
> a_ptsname).  The way you're working around it is introducing subtle
> bugs, for example the pidfile is _not_ locked.

Okay, I will fix it.

>> +#define I_LOOK  (__SID | 4) /* Retrieve the name of the module just 
>> below
>> +   the STREAM head

Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Paul E. McKenney

On Tue, Sep 15, 2015 at 06:42:19PM +0200, Paolo Bonzini wrote:
> 
> 
> On 15/09/2015 15:36, Christian Borntraeger wrote:
> > I am wondering why the old code behaved in such fatal ways. Is there
> > some interaction between waiting for a reschedule in the
> > synchronize_sched writer and some fork code actually waiting for the
> > read side to get the lock together with some rescheduling going on
> > waiting for a lock that fork holds? lockdep does not give me an hints
> > so I have no clue :-(
> 
> It may just be consuming too much CPU usage.  kernel/rcu/tree.c warns
> about it:
> 
>  * if you are using synchronize_sched_expedited() in a loop, please
>  * restructure your code to batch your updates, and then use a single
>  * synchronize_sched() instead.
> 
> and you may remember that in KVM we switched from RCU to SRCU exactly to
> avoid userspace-controlled synchronize_rcu_expedited().
> 
> In fact, I would say that any userspace-controlled call to *_expedited()
> is a bug waiting to happen and a bad idea---because userspace can, with
> little effort, end up calling it in a loop.

Excellent points!

Other options in such situations include the following:

o   Rework so that the code uses call_rcu*() instead of *_expedited().

o   Maintain a per-task or per-CPU counter so that every so many
*_expedited() invocations instead uses the non-expedited
counterpart.  (For example, synchronize_rcu instead of
synchronize_rcu_expedited().)

Note that synchronize_srcu_expedited() is less troublesome than are the
other *_expedited() functions, because synchronize_srcu_expedited() does
not inflict OS jitter on other CPUs.  This situation is being improved,
so that the other *_expedited() functions inflict less OS jitter and
(mostly) avoid inflicting OS jitter on nohz_full CPUs and idle CPUs (the
latter being important for battery-powered systems).  In addition, the
*_expedited() functions avoid hammering CPUs with N-squared OS jitter
in response to concurrent invocation from all CPUs because multiple
concurrent *_expedited() calls will be satisfied by a single expedited
grace-period operation.  Nevertheless, as Paolo points out, it is still
necessary to exercise caution when exposing synchronous grace periods
to userspace control.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 6/8] arm/arm64: KVM: Add forwarded physical interrupts documentation

2015-09-15 Thread Andre Przywara

Hi Christoffer,

On 14/09/15 12:42, Christoffer Dall wrote:

 Where is this done? I see that the physical dist state is altered on the
 actual IRQ forwarding, but not on later exits/entries? Do you mean
 kvm_vgic_flush_hwstate() with "flush"?
>>>
>>> this is a bug and should be fixed in the 'fixes' patches I sent last
>>> week.  We should set active state on every entry to the guest for IRQs
>>> with the HW bit set in either pending or active state.
>>
>> OK, sorry, I missed that one patch, I was looking at what should become
>> -rc1 soon (because that's what I want to rebase my ITS emulation patches
>> on). That patch wasn't in queue at the time I started looking at it.
>>
>> So I updated to the latest queue containing those two fixes and also
>> applied your v2 series. Indeed this series addresses some of the things
>> I was wondering about the last time, but the main thing still persists:
>> - Every time the physical dist state is active we have the virtual state
>> still at pending or active.
> 
> For the arch timer, yes.
> 
> For a passthrough device, there should be a situation where the physical
> dist state is active but we didn't see the virtual state updated at the
> vgic yet (after physical IRQ fires and before the VFIO ISR calls
> kvm_set_irq).

But then we wouldn't get into vgic_sync_hwirq(), because we wouldn't
inject a mapped IRQ before kvm_set_irq() is called, would we?

>> - If the physical dist state is non-active, the virtual state is
>> inactive (LR.state==8: HW bit) as well. The associated ELRSR bit is 1
>> (LR empty).
>> (I was tracing every HW mapped LR in vgic_sync_hwirq() for this)
>>
>> So that contradicts:
>>
>> +  - On guest EOI, the *physical distributor* active bit gets cleared,
>> +but the LR.Active is left untouched (set).
>>
>> This is the main point I was actually wondering about: I cannot confirm
>> this statement. In my tests the LR state and the physical dist state
>> always correspond, as excepted by reading the spec.
>>
>> I reckon that these observations are mostly independent from the actual
>> KVM code, as I try to observe hardware state (physical distributor and
>> LRs) before KVM tinkers with them.
> 
> ok, I got this paragraph from Marc, so we really need to ask him?  Which
> hardware are you seeing this behavior on?  Perhaps implementations vary
> on this point?

I checked this on Midway and Juno. Both have a GIC-400, but I don't have
access to any other GIC implementations.
I added the two BUG_ONs shown below to prove that assumption.

Eric, I've been told you observed the behaviour with the GIC not syncing
LR and phys state for a mapped HWIRQ which was not the timer.
Can you reproduce this? Does it complain with the patch below?

diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
index 5942ce9..7fac16e 100644
--- a/virt/kvm/arm/vgic.c
+++ b/virt/kvm/arm/vgic.c
@@ -1459,9 +1459,12 @@ static bool vgic_sync_hwirq(struct kvm_vcpu
 IRQCHIP_STATE_ACTIVE,
 false);
WARN_ON(ret);
+   BUG_ON(!(vlr.state & 3));
return false;
}

+   BUG_ON(vlr.state & 3);
+
return process_queued_irq(vcpu, lr, vlr);
 }

> 
> I have no objections removing this point from the doc though, I'm just
> relaying information on this one.

I see, I talked with Marc and I am about to gather more data with the
above patch to prove that this never happens.

>>
>> ...
>>
>>>
 Is this an observation, an implementation bug or is this mentioned in
 the spec? Needing to spoon-feed the VGIC by doing it's job sounds a bit
 awkward to me.
>>>
>>> What do you mean?  How are we spoon-feeding the VGIC?
>>
>> By looking at the physical dist state and all LRs and clearing the LR we
>> do what the GIC is actually supposed to do for us - and what it actually
>> does according to my observations.
>>
>> The point is that patch 1 in my ITS emulation series is reworking the LR
>> handling and this patch was based on assumptions that seem to be no
>> longer true (i.e. we don't care about inactive LRs except for our LR
>> mapping code). So I want to be sure that I fully get what is going on
>> here and I struggle at this at the moment due to the above statement.
>>
>> What are the plans regarding your "v2: Rework architected timer..."
>> series? Will this be queued for 4.4? I want to do the
>> rebasing^Wrewriting of my series only once if possible ;-)
>>
> I think we should settle on this series ASAP and base your ITS stuff on
> top of it.  What do you think?

Yeah, that's what I was thinking too. So I will be working against
4.3-rc1 with your timer-rework-v2 branch plus the other fixes from the
kvm-arm queue merged.

Cheers,
Andre.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: VMX: align vmx->nested.nested_vmx_secondary_ctls_high to vmx->rdtscp_enabled

2015-09-15 Thread Paolo Bonzini

The SECONDARY_EXEC_RDTSCP must be available iff RDTSCP is enabled in the
guest.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/vmx.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 4df94e2b7c23..fb7a35ad9b02 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -8678,9 +8678,14 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
exec_control);
}
 
-   if (nested && !vmx->rdtscp_enabled)
-   vmx->nested.nested_vmx_secondary_ctls_high &=
-   ~SECONDARY_EXEC_RDTSCP;
+   if (nested) {
+   if (vmx->rdtscp_enabled)
+   vmx->nested.nested_vmx_secondary_ctls_high |=
+   SECONDARY_EXEC_RDTSCP;
+   else
+   vmx->nested.nested_vmx_secondary_ctls_high &=
+   ~SECONDARY_EXEC_RDTSCP;
+   }
}
 
/* Exposing INVPCID only when PCID is exposed */
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv2 00/15] arm64: 16K translation granule support

2015-09-15 Thread Suzuki K. Poulose

From: "Suzuki K. Poulose" 

This series enables the 16K page size support on Linux for arm64.
Adds support for 48bit VA(4 level), 47bit VA(3 level) and
36bit VA(2 level) with 16K. 16K was a late addition to the architecture
and is not implemented by all CPUs. Added a check to ensure the
selected granule size is supported by the CPU, failing which the CPU
won't proceed with booting. Also the kernel page size is added to the
kernel image header (patch from Ard).

KVM bits have been tested on a fast model with GICv3 using kvmtool [1].

Patches 1-7 cleans up the kernel page size handling code.
Patch 8 Adds a check to ensure the CPU supports the selected granule 
size.
Patch 9 Adds the page size information to image header.
Patches 10-13   Fixes some issues with the KVM bits, mainly the fake PGD
handling code.
Patches 14-15   Adds the 16k page size support bits.

This series applies on top of 4.3-rc1.

The tree is also available here:

git://linux-arm.org/linux-skp.git  16k/v2-4.3-rc1

Changes since V1:
  - Rebase to 4.3-rc1
  - Fix vmemmap_populate for 16K (use !ARM64_SWAPPER_USES_SECTION_MAPS)
  - Better description for patch2 (suggested-by: Ard)
  - Add page size information to the image header flags.
  - Added reviewed-by/tested-by Ard.

[1] git://git.kernel.org/pub/scm/linux/kernel/git/will/kvmtool.git

Ard Biesheuvel (1):
  arm64: Add page size to the kernel image header

Suzuki K. Poulose (14):
  arm64: Move swapper pagetable definitions
  arm64: Handle section maps for swapper/idmap
  arm64: Introduce helpers for page table levels
  arm64: Calculate size for idmap_pg_dir at compile time
  arm64: Handle 4 level page table for swapper
  arm64: Clean config usages for page size
  arm64: Kconfig: Fix help text about AArch32 support with 64K pages
  arm64: Check for selected granule support
  arm64: kvm: Fix {V}TCR_EL2_TG0 mask
  arm64: Cleanup VTCR_EL2 computation
  arm: kvm: Move fake PGD handling to arch specific files
  arm64: kvm: Rewrite fake pgd handling
  arm64: Add 16K page size support
  arm64: 36 bit VA

 Documentation/arm64/booting.txt |7 +-
 arch/arm/include/asm/kvm_mmu.h  |7 ++
 arch/arm/kvm/mmu.c  |   44 ++
 arch/arm64/Kconfig  |   37 +++--
 arch/arm64/Kconfig.debug|2 +-
 arch/arm64/include/asm/boot.h   |1 +
 arch/arm64/include/asm/fixmap.h |4 +-
 arch/arm64/include/asm/kernel-pgtable.h |   77 ++
 arch/arm64/include/asm/kvm_arm.h|   29 +--
 arch/arm64/include/asm/kvm_mmu.h|  135 +--
 arch/arm64/include/asm/page.h   |   20 +
 arch/arm64/include/asm/pgtable-hwdef.h  |   15 +++-
 arch/arm64/include/asm/sysreg.h |8 ++
 arch/arm64/include/asm/thread_info.h|4 +-
 arch/arm64/kernel/head.S|   71 +---
 arch/arm64/kernel/image.h   |5 +-
 arch/arm64/kernel/vmlinux.lds.S |1 +
 arch/arm64/mm/mmu.c |   72 -
 arch/arm64/mm/proc.S|4 +-
 19 files changed, 348 insertions(+), 195 deletions(-)
 create mode 100644 arch/arm64/include/asm/kernel-pgtable.h

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 10/15] arm64: kvm: Fix {V}TCR_EL2_TG0 mask

2015-09-15 Thread Suzuki K. Poulose

From: "Suzuki K. Poulose" 

{V}TCR_EL2_TG0 is a 2bit wide field, where:

 00 - 4K
 01 - 64K
 10 - 16K

But we use only 1 bit, which has worked well so far since
we never cared about 16K. Fix it for 16K support.

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Marc Zyngier 
Cc: Christoffer Dall 
Cc: kvm...@lists.cs.columbia.edu
Acked-by: Mark Rutland 
Signed-off-by: Suzuki K. Poulose 
---
 arch/arm64/include/asm/kvm_arm.h |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index 7605e09..bdf139e 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -98,7 +98,7 @@
 #define TCR_EL2_TBI(1 << 20)
 #define TCR_EL2_PS (7 << 16)
 #define TCR_EL2_PS_40B (2 << 16)
-#define TCR_EL2_TG0(1 << 14)
+#define TCR_EL2_TG0(3 << 14)
 #define TCR_EL2_SH0(3 << 12)
 #define TCR_EL2_ORGN0  (3 << 10)
 #define TCR_EL2_IRGN0  (3 << 8)
@@ -110,7 +110,7 @@
 
 /* VTCR_EL2 Registers bits */
 #define VTCR_EL2_PS_MASK   (7 << 16)
-#define VTCR_EL2_TG0_MASK  (1 << 14)
+#define VTCR_EL2_TG0_MASK  (3 << 14)
 #define VTCR_EL2_TG0_4K(0 << 14)
 #define VTCR_EL2_TG0_64K   (1 << 14)
 #define VTCR_EL2_SH0_MASK  (3 << 12)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 02/15] arm64: Handle section maps for swapper/idmap

2015-09-15 Thread Suzuki K. Poulose

From: "Suzuki K. Poulose" 

We use section maps with 4K page size to create the swapper/idmaps.
So far we have used !64K or 4K checks to handle the case where we
use the section maps.
This patch adds a new symbol, ARM64_SWAPPER_USES_SECTION_MAPS, to
handle cases where we use section maps, instead of using the page size
symbols.

Cc: Ard Biesheuvel 
Cc: Mark Rutland 
Cc: Catalin Marinas 
Cc: Will Deacon 
Signed-off-by: Suzuki K. Poulose 
---
Changes since v1:
  - Use ARM64_SWAPPER_USES_SECTION_MAPS for vmemmap_populate()
  - Fix description
---
 arch/arm64/include/asm/kernel-pgtable.h |   31 -
 arch/arm64/mm/mmu.c |   72 ++-
 2 files changed, 52 insertions(+), 51 deletions(-)

diff --git a/arch/arm64/include/asm/kernel-pgtable.h 
b/arch/arm64/include/asm/kernel-pgtable.h
index 622929d..5876a36 100644
--- a/arch/arm64/include/asm/kernel-pgtable.h
+++ b/arch/arm64/include/asm/kernel-pgtable.h
@@ -19,6 +19,13 @@
 #ifndef __ASM_KERNEL_PGTABLE_H
 #define __ASM_KERNEL_PGTABLE_H
 
+/* With 4K pages, we use section maps. */
+#ifdef CONFIG_ARM64_4K_PAGES
+#define ARM64_SWAPPER_USES_SECTION_MAPS 1
+#else
+#define ARM64_SWAPPER_USES_SECTION_MAPS 0
+#endif
+
 /*
  * The idmap and swapper page tables need some space reserved in the kernel
  * image. Both require pgd, pud (4 levels only) and pmd tables to (section)
@@ -28,26 +35,28 @@
  * could be increased on the fly if system RAM is out of reach for the default
  * VA range, so 3 pages are reserved in all cases.
  */
-#ifdef CONFIG_ARM64_64K_PAGES
-#define SWAPPER_PGTABLE_LEVELS (CONFIG_PGTABLE_LEVELS)
-#else
+#if ARM64_SWAPPER_USES_SECTION_MAPS
 #define SWAPPER_PGTABLE_LEVELS (CONFIG_PGTABLE_LEVELS - 1)
+#else
+#define SWAPPER_PGTABLE_LEVELS (CONFIG_PGTABLE_LEVELS)
 #endif
 
 #define SWAPPER_DIR_SIZE   (SWAPPER_PGTABLE_LEVELS * PAGE_SIZE)
 #define IDMAP_DIR_SIZE (3 * PAGE_SIZE)
 
 /* Initial memory map size */
-#ifdef CONFIG_ARM64_64K_PAGES
-#define SWAPPER_BLOCK_SHIFTPAGE_SHIFT
-#define SWAPPER_BLOCK_SIZE PAGE_SIZE
-#define SWAPPER_TABLE_SHIFTPMD_SHIFT
-#else
+#if ARM64_SWAPPER_USES_SECTION_MAPS
 #define SWAPPER_BLOCK_SHIFTSECTION_SHIFT
 #define SWAPPER_BLOCK_SIZE SECTION_SIZE
 #define SWAPPER_TABLE_SHIFTPUD_SHIFT
+#else
+#define SWAPPER_BLOCK_SHIFTPAGE_SHIFT
+#define SWAPPER_BLOCK_SIZE PAGE_SIZE
+#define SWAPPER_TABLE_SHIFTPMD_SHIFT
 #endif
 
+/* The size of the initial kernel direct mapping */
+#define SWAPPER_INIT_MAP_SIZE  (_AC(1, UL) << SWAPPER_TABLE_SHIFT)
 
 /*
  * Initial memory map attributes.
@@ -55,10 +64,10 @@
 #define SWAPPER_PTE_FLAGS  PTE_TYPE_PAGE | PTE_AF | PTE_SHARED
 #define SWAPPER_PMD_FLAGS  PMD_TYPE_SECT | PMD_SECT_AF | PMD_SECT_S
 
-#ifdef CONFIG_ARM64_64K_PAGES
-#define SWAPPER_MM_MMUFLAGSPTE_ATTRINDX(MT_NORMAL) | SWAPPER_PTE_FLAGS
-#else
+#if ARM64_SWAPPER_USES_SECTION_MAPS
 #define SWAPPER_MM_MMUFLAGSPMD_ATTRINDX(MT_NORMAL) | SWAPPER_PMD_FLAGS
+#else
+#define SWAPPER_MM_MMUFLAGSPTE_ATTRINDX(MT_NORMAL) | SWAPPER_PTE_FLAGS
 #endif
 
 
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 9211b85..f533312 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -32,6 +32,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -353,14 +354,11 @@ static void __init map_mem(void)
 * memory addressable from the initial direct kernel mapping.
 *
 * The initial direct kernel mapping, located at swapper_pg_dir, gives
-* us PUD_SIZE (4K pages) or PMD_SIZE (64K pages) memory starting from
-* PHYS_OFFSET (which must be aligned to 2MB as per
-* Documentation/arm64/booting.txt).
+* us PUD_SIZE (with SECTION maps, i.e, 4K) or PMD_SIZE (without
+* SECTION maps, i.e, 64K pages) memory starting from PHYS_OFFSET
+* (which must be aligned to 2MB as per 
Documentation/arm64/booting.txt).
 */
-   if (IS_ENABLED(CONFIG_ARM64_64K_PAGES))
-   limit = PHYS_OFFSET + PMD_SIZE;
-   else
-   limit = PHYS_OFFSET + PUD_SIZE;
+   limit = PHYS_OFFSET + SWAPPER_INIT_MAP_SIZE;
memblock_set_current_limit(limit);
 
/* map all the memory banks */
@@ -371,21 +369,24 @@ static void __init map_mem(void)
if (start >= end)
break;
 
-#ifndef CONFIG_ARM64_64K_PAGES
-   /*
-* For the first memory bank align the start address and
-* current memblock limit to prevent create_mapping() from
-* allocating pte page tables from unmapped memory.
-* When 64K pages are enabled, the pte page table for the
-* first PGDIR_SIZE is already present in swapper_pg_dir.
-*/
-   if (start < limit)
-

[PATCH 01/15] arm64: Move swapper pagetable definitions

2015-09-15 Thread Suzuki K. Poulose

From: "Suzuki K. Poulose" 

Move the kernel pagetable (both swapper and idmap) definitions
from the generic asm/page.h to a new file, asm/kernel-pgtable.h.

This is mostly a cosmetic change, to clean up the asm/page.h to
get rid of the arch specific details which are not needed by the
generic code.

Also renames the symbols to prevent conflicts. e.g,
BLOCK_SHIFT => SWAPPER_BLOCK_SHIFT

Cc: Ard Biesheuvel 
Cc: Mark Rutland 
Cc: Catalin Marinas 
Cc: Will Deacon 
Signed-off-by: Suzuki K. Poulose 
Reviewed-by: Ard Biesheuvel 
Tested-by: Ard Biesheuvel 
---
 arch/arm64/include/asm/kernel-pgtable.h |   65 +++
 arch/arm64/include/asm/page.h   |   18 -
 arch/arm64/kernel/head.S|   37 --
 arch/arm64/kernel/vmlinux.lds.S |1 +
 4 files changed, 74 insertions(+), 47 deletions(-)
 create mode 100644 arch/arm64/include/asm/kernel-pgtable.h

diff --git a/arch/arm64/include/asm/kernel-pgtable.h 
b/arch/arm64/include/asm/kernel-pgtable.h
new file mode 100644
index 000..622929d
--- /dev/null
+++ b/arch/arm64/include/asm/kernel-pgtable.h
@@ -0,0 +1,65 @@
+/*
+ * asm/kernel-pgtable.h : Kernel page table mapping
+ *
+ * Copyright (C) 2015 ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see .
+ */
+
+#ifndef __ASM_KERNEL_PGTABLE_H
+#define __ASM_KERNEL_PGTABLE_H
+
+/*
+ * The idmap and swapper page tables need some space reserved in the kernel
+ * image. Both require pgd, pud (4 levels only) and pmd tables to (section)
+ * map the kernel. With the 64K page configuration, swapper and idmap need to
+ * map to pte level. The swapper also maps the FDT (see __create_page_tables
+ * for more information). Note that the number of ID map translation levels
+ * could be increased on the fly if system RAM is out of reach for the default
+ * VA range, so 3 pages are reserved in all cases.
+ */
+#ifdef CONFIG_ARM64_64K_PAGES
+#define SWAPPER_PGTABLE_LEVELS (CONFIG_PGTABLE_LEVELS)
+#else
+#define SWAPPER_PGTABLE_LEVELS (CONFIG_PGTABLE_LEVELS - 1)
+#endif
+
+#define SWAPPER_DIR_SIZE   (SWAPPER_PGTABLE_LEVELS * PAGE_SIZE)
+#define IDMAP_DIR_SIZE (3 * PAGE_SIZE)
+
+/* Initial memory map size */
+#ifdef CONFIG_ARM64_64K_PAGES
+#define SWAPPER_BLOCK_SHIFTPAGE_SHIFT
+#define SWAPPER_BLOCK_SIZE PAGE_SIZE
+#define SWAPPER_TABLE_SHIFTPMD_SHIFT
+#else
+#define SWAPPER_BLOCK_SHIFTSECTION_SHIFT
+#define SWAPPER_BLOCK_SIZE SECTION_SIZE
+#define SWAPPER_TABLE_SHIFTPUD_SHIFT
+#endif
+
+
+/*
+ * Initial memory map attributes.
+ */
+#define SWAPPER_PTE_FLAGS  PTE_TYPE_PAGE | PTE_AF | PTE_SHARED
+#define SWAPPER_PMD_FLAGS  PMD_TYPE_SECT | PMD_SECT_AF | PMD_SECT_S
+
+#ifdef CONFIG_ARM64_64K_PAGES
+#define SWAPPER_MM_MMUFLAGSPTE_ATTRINDX(MT_NORMAL) | SWAPPER_PTE_FLAGS
+#else
+#define SWAPPER_MM_MMUFLAGSPMD_ATTRINDX(MT_NORMAL) | SWAPPER_PMD_FLAGS
+#endif
+
+
+#endif
diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 7d9c7e4..3c9ce8c 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -28,24 +28,6 @@
 #define PAGE_SIZE  (_AC(1,UL) << PAGE_SHIFT)
 #define PAGE_MASK  (~(PAGE_SIZE-1))
 
-/*
- * The idmap and swapper page tables need some space reserved in the kernel
- * image. Both require pgd, pud (4 levels only) and pmd tables to (section)
- * map the kernel. With the 64K page configuration, swapper and idmap need to
- * map to pte level. The swapper also maps the FDT (see __create_page_tables
- * for more information). Note that the number of ID map translation levels
- * could be increased on the fly if system RAM is out of reach for the default
- * VA range, so 3 pages are reserved in all cases.
- */
-#ifdef CONFIG_ARM64_64K_PAGES
-#define SWAPPER_PGTABLE_LEVELS (CONFIG_PGTABLE_LEVELS)
-#else
-#define SWAPPER_PGTABLE_LEVELS (CONFIG_PGTABLE_LEVELS - 1)
-#endif
-
-#define SWAPPER_DIR_SIZE   (SWAPPER_PGTABLE_LEVELS * PAGE_SIZE)
-#define IDMAP_DIR_SIZE (3 * PAGE_SIZE)
-
 #ifndef __ASSEMBLY__
 
 #include 
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index a055be6..46670bf 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -29,6 +29,7 @@
 #include

[PATCH 03/15] arm64: Introduce helpers for page table levels

2015-09-15 Thread Suzuki K. Poulose

From: "Suzuki K. Poulose" 

Introduce helpers for finding the number of page table
levels required for a given VA width, shift for a particular
page table level.

Convert the existing users to the new helpers. More users
to follow.

Cc: Ard Biesheuvel 
Cc: Mark Rutland 
Cc: Catalin Marinas 
Cc: Will Deacon 
Signed-off-by: Suzuki K. Poulose 
Reviewed-by: Ard Biesheuvel 
Tested-by: Ard Biesheuvel 
---
 arch/arm64/include/asm/pgtable-hwdef.h |   15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable-hwdef.h 
b/arch/arm64/include/asm/pgtable-hwdef.h
index 24154b0..ce18389 100644
--- a/arch/arm64/include/asm/pgtable-hwdef.h
+++ b/arch/arm64/include/asm/pgtable-hwdef.h
@@ -16,13 +16,21 @@
 #ifndef __ASM_PGTABLE_HWDEF_H
 #define __ASM_PGTABLE_HWDEF_H
 
+/*
+ * Number of page-table levels required to address 'va_bits' wide
+ * address, without section mapping
+ */
+#define ARM64_HW_PGTABLE_LEVELS(va_bits) (((va_bits) - 4) / (PAGE_SHIFT - 3))
+#define ARM64_HW_PGTABLE_LEVEL_SHIFT(level) \
+   ((PAGE_SHIFT - 3) * (level) + 3)
+
 #define PTRS_PER_PTE   (1 << (PAGE_SHIFT - 3))
 
 /*
  * PMD_SHIFT determines the size a level 2 page table entry can map.
  */
 #if CONFIG_PGTABLE_LEVELS > 2
-#define PMD_SHIFT  ((PAGE_SHIFT - 3) * 2 + 3)
+#define PMD_SHIFT  ARM64_HW_PGTABLE_LEVEL_SHIFT(2)
 #define PMD_SIZE   (_AC(1, UL) << PMD_SHIFT)
 #define PMD_MASK   (~(PMD_SIZE-1))
 #define PTRS_PER_PMD   PTRS_PER_PTE
@@ -32,7 +40,7 @@
  * PUD_SHIFT determines the size a level 1 page table entry can map.
  */
 #if CONFIG_PGTABLE_LEVELS > 3
-#define PUD_SHIFT  ((PAGE_SHIFT - 3) * 3 + 3)
+#define PUD_SHIFT  ARM64_HW_PGTABLE_LEVEL_SHIFT(3)
 #define PUD_SIZE   (_AC(1, UL) << PUD_SHIFT)
 #define PUD_MASK   (~(PUD_SIZE-1))
 #define PTRS_PER_PUD   PTRS_PER_PTE
@@ -42,7 +50,8 @@
  * PGDIR_SHIFT determines the size a top-level page table entry can map
  * (depending on the configuration, this level can be 0, 1 or 2).
  */
-#define PGDIR_SHIFT((PAGE_SHIFT - 3) * CONFIG_PGTABLE_LEVELS + 3)
+#define PGDIR_SHIFT\
+   ARM64_HW_PGTABLE_LEVEL_SHIFT(CONFIG_PGTABLE_LEVELS)
 #define PGDIR_SIZE (_AC(1, UL) << PGDIR_SHIFT)
 #define PGDIR_MASK (~(PGDIR_SIZE-1))
 #define PTRS_PER_PGD   (1 << (VA_BITS - PGDIR_SHIFT))
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 09/15] arm64: Add page size to the kernel image header

2015-09-15 Thread Suzuki K. Poulose

From: Ard Biesheuvel 

This patch adds the page size to the arm64 kernel image header
so that one can infer the PAGESIZE used by the kernel. This will
be helpful to diagnose failures to boot the kernel with page size
not supported by the CPU.

Signed-off-by: Ard Biesheuvel 
---
 Documentation/arm64/booting.txt |7 ++-
 arch/arm64/kernel/image.h   |5 -
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/Documentation/arm64/booting.txt b/Documentation/arm64/booting.txt
index 7d9d3c2..aaf6d77 100644
--- a/Documentation/arm64/booting.txt
+++ b/Documentation/arm64/booting.txt
@@ -104,7 +104,12 @@ Header notes:
 - The flags field (introduced in v3.17) is a little-endian 64-bit field
   composed as follows:
   Bit 0:   Kernel endianness.  1 if BE, 0 if LE.
-  Bits 1-63:   Reserved.
+  Bit 1-2: Kernel Page size.
+   0 - Unspecified.
+   1 - 4K
+   2 - 16K
+   3 - 64K
+  Bits 3-63:   Reserved.
 
 - When image_size is zero, a bootloader should attempt to keep as much
   memory as possible free for use by the kernel immediately after the
diff --git a/arch/arm64/kernel/image.h b/arch/arm64/kernel/image.h
index 8fae075..73b736c 100644
--- a/arch/arm64/kernel/image.h
+++ b/arch/arm64/kernel/image.h
@@ -47,7 +47,10 @@
 #define __HEAD_FLAG_BE 0
 #endif
 
-#define __HEAD_FLAGS   (__HEAD_FLAG_BE << 0)
+#define __HEAD_FLAG_PAGE_SIZE ((PAGE_SHIFT - 10) / 2)
+
+#define __HEAD_FLAGS   (__HEAD_FLAG_BE << 0) | \
+   (__HEAD_FLAG_PAGE_SIZE << 1)
 
 /*
  * These will output as part of the Image header, which should be little-endian
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 07/15] arm64: Kconfig: Fix help text about AArch32 support with 64K pages

2015-09-15 Thread Suzuki K. Poulose

From: "Suzuki K. Poulose" 

Update the help text for ARM64_64K_PAGES to reflect the reality
about AArch32 support.

Cc: Mark Rutland 
Cc: Catalin Marinas 
Cc: Will Deacon 
Signed-off-by: Suzuki K. Poulose 
Reviewed-by: Ard Biesheuvel 
Tested-by: Ard Biesheuvel 
---
 arch/arm64/Kconfig |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index ab0a36f..83bca48 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -350,8 +350,8 @@ config ARM64_64K_PAGES
help
  This feature enables 64KB pages support (4KB by default)
  allowing only two levels of page tables and faster TLB
- look-up. AArch32 emulation is not available when this feature
- is enabled.
+ look-up. AArch32 emulation requires applications compiled
+ with 64K aligned segments.
 
 endchoice
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 11/15] arm64: Cleanup VTCR_EL2 computation

2015-09-15 Thread Suzuki K. Poulose

From: "Suzuki K. Poulose" 

No functional changes. Group the common bits for VCTR_EL2
initialisation for better readability. The granule size
and the entry level are controlled by the page size.

Cc: Christoffer Dall 
Cc: Marc Zyngier 
Cc: kvm...@lists.cs.columbia.edu
Signed-off-by: Suzuki K. Poulose 
---
 arch/arm64/include/asm/kvm_arm.h |   13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index bdf139e..699554d 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -138,6 +138,9 @@
  * The magic numbers used for VTTBR_X in this patch can be found in Tables
  * D4-23 and D4-25 in ARM DDI 0487A.b.
  */
+#define VTCR_EL2_COMMON_BITS   (VTCR_EL2_SH0_INNER | VTCR_EL2_ORGN0_WBWA | \
+VTCR_EL2_IRGN0_WBWA | VTCR_EL2_T0SZ_40B)
+
 #ifdef CONFIG_ARM64_64K_PAGES
 /*
  * Stage2 translation configuration:
@@ -145,9 +148,8 @@
  * 64kB pages (TG0 = 1)
  * 2 level page tables (SL = 1)
  */
-#define VTCR_EL2_FLAGS (VTCR_EL2_TG0_64K | VTCR_EL2_SH0_INNER | \
-VTCR_EL2_ORGN0_WBWA | VTCR_EL2_IRGN0_WBWA | \
-VTCR_EL2_SL0_LVL1 | VTCR_EL2_T0SZ_40B)
+#define VTCR_EL2_FLAGS (VTCR_EL2_TG0_64K | VTCR_EL2_SL0_LVL1 | \
+VTCR_EL2_COMMON_BITS)
 #define VTTBR_X(38 - VTCR_EL2_T0SZ_40B)
 #else
 /*
@@ -156,9 +158,8 @@
  * 4kB pages (TG0 = 0)
  * 3 level page tables (SL = 1)
  */
-#define VTCR_EL2_FLAGS (VTCR_EL2_TG0_4K | VTCR_EL2_SH0_INNER | \
-VTCR_EL2_ORGN0_WBWA | VTCR_EL2_IRGN0_WBWA | \
-VTCR_EL2_SL0_LVL1 | VTCR_EL2_T0SZ_40B)
+#define VTCR_EL2_FLAGS (VTCR_EL2_TG0_4K | VTCR_EL2_SL0_LVL1 | \
+VTCR_EL2_COMMON_BITS)
 #define VTTBR_X(37 - VTCR_EL2_T0SZ_40B)
 #endif
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 6/8] KVM: VMX: unify SECONDARY_VM_EXEC_CONTROL update

2015-09-15 Thread Paolo Bonzini



On 09/09/2015 08:05, Xiao Guangrong wrote:
> Unify the update in vmx_cpuid_update()
> 
> Signed-off-by: Xiao Guangrong 

What if we instead start fresh from vmx_secondary_exec_control, like this:

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 4504cbc1b287..98c4d3f1653a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -8660,11 +8660,26 @@ static int vmx_get_lpage_level(void)
return PT_PDPE_LEVEL;
 }
 
+static void vmcs_set_secondary_exec_control(u32 new_ctl)
+{
+   /*
+* These bits in the secondary execution controls field
+* are dynamic, the others are mostly based on the hypervisor
+* architecture and the guest's CPUID.  Do not touch the
+* dynamic bits.
+*/
+   u32 mask = SECONDARY_EXEC_ENABLE_PML | SECONDARY_EXEC_SHADOW_VMCS;
+   u32 cur_ctl = vmcs_read32(SECONDARY_EXEC_CONTROL);
+
+   vmcs_write32(SECONDARY_EXEC_CONTROL,
+(new_ctl & ~mask) | (cur_ctl & mask));
+}
+
 static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
 {
struct kvm_cpuid_entry2 *best;
struct vcpu_vmx *vmx = to_vmx(vcpu);
-   u32 clear_exe_ctrl = 0;
+   u32 secondary_exec_ctl = vmx_secondary_exec_control(vmx);
 
vmx->rdtscp_enabled = false;
if (vmx_rdtscp_supported()) {
@@ -8672,7 +8681,7 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
if (best && (best->edx & bit(X86_FEATURE_RDTSCP)))
vmx->rdtscp_enabled = true;
else
-   clear_exe_ctrl |= SECONDARY_EXEC_RDTSCP;
+   secondary_exec_ctl &= ~SECONDARY_EXEC_RDTSCP;
 
if (nested) {
if (vmx->rdtscp_enabled)
@@ -8689,18 +8698,13 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
if (vmx_invpcid_supported() &&
(!best || !(best->ebx & bit(X86_FEATURE_INVPCID)) ||
!guest_cpuid_has_pcid(vcpu))) {
-   clear_exe_ctrl |= SECONDARY_EXEC_ENABLE_INVPCID;
+   secondary_exec_ctl &= ~SECONDARY_EXEC_ENABLE_INVPCID;
 
if (best)
best->ebx &= ~bit(X86_FEATURE_INVPCID);
}
 
-   if (clear_exe_ctrl) {
-   u32 exec_ctl = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
-
-   exec_ctl &= ~clear_exe_ctrl;
-   vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_ctl);
-   }
+   vmcs_set_secondary_exec_control(secondary_exec_ctl);
 
if (static_cpu_has(X86_FEATURE_PCOMMIT) && nested) {
if (guest_cpuid_has_pcommit(vcpu))

> ---
>  arch/x86/kvm/vmx.c | 21 +++--
>  1 file changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 97e3340..5a074d0 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -8673,19 +8673,15 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
>  {
>   struct kvm_cpuid_entry2 *best;
>   struct vcpu_vmx *vmx = to_vmx(vcpu);
> - u32 exec_control;
> + u32 clear_exe_ctrl = 0;
>  
>   vmx->rdtscp_enabled = false;
>   if (vmx_rdtscp_supported()) {
> - exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
>   best = kvm_find_cpuid_entry(vcpu, 0x8001, 0);
>   if (best && (best->edx & bit(X86_FEATURE_RDTSCP)))
>   vmx->rdtscp_enabled = true;
> - else {
> - exec_control &= ~SECONDARY_EXEC_RDTSCP;
> - vmcs_write32(SECONDARY_VM_EXEC_CONTROL,
> - exec_control);
> - }
> + else
> + clear_exe_ctrl |= SECONDARY_EXEC_RDTSCP;
>  
>   if (nested && !vmx->rdtscp_enabled)
>   vmx->nested.nested_vmx_secondary_ctls_high &=
> @@ -8697,14 +8693,19 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
>   if (vmx_invpcid_supported() &&
>   (!best || !(best->ebx & bit(X86_FEATURE_INVPCID)) ||
>   !guest_cpuid_has_pcid(vcpu))) {
> - exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
> - exec_control &= ~SECONDARY_EXEC_ENABLE_INVPCID;
> - vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
> + clear_exe_ctrl |= SECONDARY_EXEC_ENABLE_INVPCID;
>  
>   if (best)
>   best->ebx &= ~bit(X86_FEATURE_INVPCID);
>   }
>  
> + if (clear_exe_ctrl) {
> + u32 exec_ctl = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
> +
> + exec_ctl &= ~clear_exe_ctrl;
> + vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_ctl);
> + }
> +
>   if (!guest_cpuid_has_pcommit(vcpu) && nested)
>   vmx->nested.nested_vmx_secondary_ctls_high &=
>   ~SECONDARY_EXEC_PCOMMIT;
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to

[PATCH 08/15] arm64: Check for selected granule support

2015-09-15 Thread Suzuki K. Poulose

From: "Suzuki K. Poulose" 

Ensure that the selected page size is supported by the
CPU(s).

Cc: Mark Rutland 
Cc: Catalin Marinas 
Cc: Will Deacon 
Signed-off-by: Suzuki K. Poulose 
Reviewed-by: Ard Biesheuvel 
Tested-by: Ard Biesheuvel 
---
 arch/arm64/include/asm/sysreg.h |6 ++
 arch/arm64/kernel/head.S|   24 +++-
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index a7f3d4b..e01d323 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -87,4 +87,10 @@ static inline void config_sctlr_el1(u32 clear, u32 set)
 }
 #endif
 
+#define ID_AA64MMFR0_TGran4_SHIFT  28
+#define ID_AA64MMFR0_TGran64_SHIFT 24
+
+#define ID_AA64MMFR0_TGran4_ENABLED0x0
+#define ID_AA64MMFR0_TGran64_ENABLED   0x0
+
 #endif /* __ASM_SYSREG_H */
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 01b8e58..0cb04db 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -31,10 +31,11 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 
 #define __PHYS_OFFSET  (KERNEL_START - TEXT_OFFSET)
@@ -606,9 +607,25 @@ ENDPROC(__secondary_switched)
  *  x27 = *virtual* address to jump to upon completion
  *
  * other registers depend on the function called upon completion
+ * Checks if the selected granule size is supported by the CPU.
  */
+#ifdefined(CONFIG_ARM64_64K_PAGES)
+
+#define ID_AA64MMFR0_TGran_SHIFT   ID_AA64MMFR0_TGran64_SHIFT
+#define ID_AA64MMFR0_TGran_ENABLED ID_AA64MMFR0_TGran64_ENABLED
+
+#else
+
+#define ID_AA64MMFR0_TGran_SHIFT   ID_AA64MMFR0_TGran4_SHIFT
+#define ID_AA64MMFR0_TGran_ENABLED ID_AA64MMFR0_TGran4_ENABLED
+
+#endif
.section".idmap.text", "ax"
 __enable_mmu:
+   mrs x1, ID_AA64MMFR0_EL1
+   ubfxx2, x1, #ID_AA64MMFR0_TGran_SHIFT, 4
+   cmp x2, #ID_AA64MMFR0_TGran_ENABLED
+   b.ne__no_granule_support
ldr x5, =vectors
msr vbar_el1, x5
msr ttbr0_el1, x25  // load TTBR0
@@ -626,3 +643,8 @@ __enable_mmu:
isb
br  x27
 ENDPROC(__enable_mmu)
+
+__no_granule_support:
+   wfe
+   b __no_granule_support
+ENDPROC(__no_granule_support)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 08/18] nvdimm: init backend memory mapping and config data area

2015-09-15 Thread Paolo Bonzini



On 26/08/2015 12:40, Xiao Guangrong wrote:
>>>
>>> +
>>> +size = get_file_size(fd);
>>> +buf = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>>
>> I guess the user will want to choose between MAP_SHARED and MAP_PRIVATE.
>> This can be added in the future.
> 
> Good idea, it will allow guest to write data but discards its content
> after it exits. Will implement O_RDONLY + MAP_PRIVATE in the near future.

FWIW, if Igor's backend/frontend idea is implemented, the choice between
MAP_SHARED and MAP_PRIVATE should belong in the backend.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: nVMX: nested VPID emulation

2015-09-15 Thread Jan Kiszka

On 2015-09-15 12:14, Wanpeng Li wrote:
> On 9/14/15 10:54 PM, Jan Kiszka wrote:
>> Last but not least: the guest can now easily exhaust the host's pool of
>> vpid by simply spawning plenty of VCPUs for L2, no? Is this acceptable
>> or should there be some limit?
> 
> I reuse the value of vpid02 while vpid12 changed w/ one invvpid in v2,
> and the scenario which you pointed out can be avoid.

I cannot yet follow why there is no chance for L1 to consume all vpids
that the host manages in that single, global bitmap by simply spawning a
lot of nested VCPUs for some L2. What is enforcing L1 to call nested
vmclear - apparently the only way, besides destructing nested VCPUs, to
release such vpids again?

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 119 matches

Mail list logo