Re: [PATCH] kvm: fix to update memslots properly

2015-03-10 Thread Chen, Tiejun

On 2015/3/10 4:54, Marcelo Tosatti wrote:

On Sat, Dec 27, 2014 at 09:41:45PM +0100, Paolo Bonzini wrote:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f528343..6e52f3f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -672,6 +672,7 @@ static void update_memslots(struct kvm_memslots *slots,
WARN_ON(mslots[i].id != id);
if (!new-npages) {
new-base_gfn = 0;
+   new-flags = 0;
if (mslots[i].npages)
slots-used_slots--;
} else {


This should not be necessary.  The part of the mslots array that has
base_gfn == npages == 0 is entirely unused, and such a slot can never
be returned by search_memslots because this:

 if (gfn = memslots[slot].base_gfn 
 gfn  memslots[slot].base_gfn + memslots[slot].npages)

can never be true.


@@ -688,7 +689,9 @@ static void update_memslots(struct kvm_memslots *slots,
i++;
}
while (i  0 
-  new-base_gfn  mslots[i - 1].base_gfn) {
+  ((new-base_gfn  mslots[i - 1].base_gfn) ||
+   (!new-base_gfn 
+!mslots[i - 1].base_gfn  !mslots[i - 1].npages))) {
mslots[i] = mslots[i - 1];
slots-id_to_index[mslots[i].id] = i;
i--;



You should have explained _why_ this fixes the bug, and what invariant
is not being respected, something like this:

 kvm: fix sorting of memslots with base_gfn == 0

 Before commit 0e60b0799fed (kvm: change memslot sorting rule from size
 to GFN, 2014-12-01), the memslots' sorting key was npages, meaning
 that a valid memslot couldn't have its sorting key equal to zero.
 On the other hand, a valid memslot can have base_gfn == 0, and invalid
 memslots are identified by base_gfn == npages == 0.

 Because of this, commit 0e60b0799fed broke the invariant that invalid
 memslots are at the end of the mslots array.  When a memslot with
 base_gfn == 0 was created, any invalid memslot before it were left
 in place.

This suggests another fix.  We can change the insertion to use a =
comparison, as in your first patch.  Alone it is not correct, but we
only need to take some care and avoid breaking the case of deleting a
memslot.

It's enough to wrap the second loop (that you patched) with
if (new-npages).  In the new-npages == 0 case the first loop has
already set i to the right value, and moving i back would be wrong:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f5283438ee05..050974c051b5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -687,11 +687,23 @@ static void update_memslots(struct kvm_memslots *slots,
slots-id_to_index[mslots[i].id] = i;
i++;
}
-   while (i  0 
-  new-base_gfn  mslots[i - 1].base_gfn) {
-   mslots[i] = mslots[i - 1];
-   slots-id_to_index[mslots[i].id] = i;
-   i--;
+
+   /*
+* The = is needed when creating a slot with base_gfn == 0,
+* so that it moves before all those with base_gfn == npages == 0.
+*
+* On the other hand, if new-npages is zero, the above loop has
+* already left i pointing to the beginning of the empty part of
+* mslots, and the = would move the hole backwards in this
+* case---which is wrong.  So skip the loop when deleting a slot.
+*/
+   if (new-npages) {
+   while (i  0 
+  new-base_gfn = mslots[i - 1].base_gfn) {
+   mslots[i] = mslots[i - 1];
+   slots-id_to_index[mslots[i].id] = i;
+   i--;
+   }
}

mslots[i] = *new;

Paolo


Paolo,

Can you include a proper changelog for this patch?



But this is already applied long time ago...

Tiejun
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: fix to update memslots properly

2015-03-10 Thread Paolo Bonzini

 This suggests another fix.  We can change the insertion to use a =
 comparison, as in your first patch.  Alone it is not correct, but we
 only need to take some care and avoid breaking the case of deleting a
 memslot.

 It's enough to wrap the second loop (that you patched) with
 if (new-npages).  In the new-npages == 0 case the first loop has
 already set i to the right value, and moving i back would be wrong:

 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index f5283438ee05..050974c051b5 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -687,11 +687,23 @@ static void update_memslots(struct kvm_memslots
 *slots,
   slots-id_to_index[mslots[i].id] = i;
   i++;
   }
 -while (i  0 
 -   new-base_gfn  mslots[i - 1].base_gfn) {
 -mslots[i] = mslots[i - 1];
 -slots-id_to_index[mslots[i].id] = i;
 -i--;
 +
 +/*
 + * The = is needed when creating a slot with base_gfn == 0,
 + * so that it moves before all those with base_gfn == npages == 0.
 + *
 + * On the other hand, if new-npages is zero, the above loop has
 + * already left i pointing to the beginning of the empty part of
 + * mslots, and the = would move the hole backwards in this
 + * case---which is wrong.  So skip the loop when deleting a slot.
 + */
 +if (new-npages) {
 +while (i  0 
 +   new-base_gfn = mslots[i - 1].base_gfn) {
 +mslots[i] = mslots[i - 1];
 +slots-id_to_index[mslots[i].id] = i;
 +i--;
 +}
   }

   mslots[i] = *new;

 Paolo

 Paolo,

 Can you include a proper changelog for this patch?

 
 But this is already applied long time ago...

Yes, this is commit efbeec7098eee2b3d2359d0cc24bbba0436e7f21.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2 CPU Conformance Issue in KVM/x86

2015-03-10 Thread Paolo Bonzini


On 09/03/2015 20:49, Avi Kivity wrote:

 Yes, and it checked that MAXPHYADDR != 52 before.  If you want to set
 only one bit, making that bit 51 makes sense anyway for simplicity, so
 it is still 99.9% academic.  Once processors appear with MAXPHYADDR =
 52, the remaining 0.1% will become more relevant.

 The current limit is IIRC 46 or 48 (on Haswell Xeons).
 
 It will be interesting to have processors with 52 bits of physical
 address and 48 bits of virtual address. HIGHMEM for x86_64?  Or 5-level
 page tables?

I wonder why Intel chose exactly 52...  HIGHMEM seems more likely than
5-level page tables.  Certainly it wouldn't need hacks like Ingo's 4G-4G.

 50 bits == 1 PiB.  That's quite an amount of RAM.

Not that 64 TiB is not quite an amount of RAM. :)

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2 CPU Conformance Issue in KVM/x86

2015-03-10 Thread Paolo Bonzini


On 09/03/2015 20:50, Avi Kivity wrote:
 Is the issue emulating a higher MAXPHYADDR on the guest than is
 available on the host? I don't think there's any need to support that.
 No, indeed.  The only problem is that the failure mode is quite horrible
 (you get a triple fault, possibly while the guest is running).
 
 Can't qemu simply check for it?

Yes.  But right now it doesn't even try to do something sensible with
MAXPHYADDR. :/

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: svm: make wbinvd faster

2015-03-10 Thread Radim Krčmář
2015-03-09 20:28-0300, Marcelo Tosatti:
 On Fri, Feb 27, 2015 at 06:19:18PM -0600, Joel Schopp wrote:
  From: David Kaplan david.kap...@amd.com
  No need to re-decode WBINVD since we know what it is from the intercept.
  
  Signed-off-by: David Kaplan david.kap...@amd.com
  [extracted from larger unlrelated patch, forward ported, tested]
  Signed-off-by: Joel Schopp joel.sch...@amd.com
 
 Can't you disable the intercept if need_emulate_wbinvd(vcpu) == false? 

I don't think we want to:  it should be faster to intercept and ignore
than to invalidate all caches.  The exit doesn't affect other physical
cores and costs just about 10(?) L3 cache misses.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-10 Thread Andrey Korolyov
On Sat, Mar 7, 2015 at 3:00 AM, Andrey Korolyov and...@xdel.ru wrote:
 On Fri, Mar 6, 2015 at 7:57 PM, Bandan Das b...@redhat.com wrote:
 Andrey Korolyov and...@xdel.ru writes:

 On Fri, Mar 6, 2015 at 1:14 AM, Andrey Korolyov and...@xdel.ru wrote:
 Hello,

 recently I`ve got a couple of shiny new Intel 2620v2s for future
 replacement of the E5-2620v1, but I experienced relatively many events
 with emulation errors, all traces looks simular to the one below. I am
 running qemu-2.1 on x86 on top of 3.10 branch for testing purposes but
 can switch to some other versions if necessary. Most of crashes
 happened during reboot cycle or at the end of ACPI-based shutdown
 action, if this can help. I have zero clues of what can introduce such
 a mess inside same processor family using identical software, as
 2620v1 has no simular problem ever. Please let me know if there can be
 some side measures for making entire story more clear.

 Thanks!

 KVM internal error. Suberror: 2
 extra data[0]: 80d1
 extra data[1]: 8b0d
 EAX=0003 EBX= ECX= EDX=
 ESI= EDI= EBP= ESP=6cd4
 EIP=d3f9 EFL=00010202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =   9300
 CS =f000 000f  9b00
 SS =   9300
 DS =   9300
 FS =   9300
 GS =   9300
 LDT=   8200
 TR =   8b00
 GDT= 000f6e98 0037
 IDT=  03ff
 CR0=0010 CR2= CR3= CR4=
 DR0= DR1= DR2=
 DR3=
 DR6=0ff0 DR7=0400
 EFER=
 Code=48 18 67 8c 00 8c d1 8e d9 66 5a 66 58 66 5d 66 c3 cd 02 cb cd
 10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd 19 cb cd 1c cb fa fc 66
 b8 00 e0 00 00 8e


 It turns out that those errors are introduced by APICv, which gets
 enabled due to different feature set. If anyone is interested in
 reproducing/fixing this exactly on 3.10, it takes about one hundred of
 migrations/power state changes for an issue to appear, guest OS can be
 Linux or Win.

 Are you able to reproduce this on a more recent upstream kernel as well ?

 Bandan

 I`ll go through test cycle with 3.18 and 2603v2 around tomorrow and
 follow up with any reproduceable results.

Heh.. issue is not triggered on 2603v2 at all, at least I am not able
to hit this. The only difference with 2620v2 except lower frequency is
an Intel Dynamic Acceleration feature. I`d appreciate any testing with
higher CPU models with same or richer feature set. The testing itself
can be done on both generic 3.10 or RH7 kernels, as both of them are
experiencing this issue. I conducted all tests with disabled cstates
so I advise to do the same for a first reproduction step.

Thanks!

model name  : Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
stepping: 4
microcode   : 0x416
cpu MHz : 2100.039
cache size  : 15360 KB
siblings: 12
apicid  : 43
initial apicid  : 43
fpu : yes
fpu_exception   : yes
cpuid level : 13
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts
rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq
dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca
sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c
rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi
flexpriority ept vpid fsgsbase smep erms
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-10 Thread Dr. David Alan Gilbert
* Andrey Korolyov (and...@xdel.ru) wrote:
 On Sat, Mar 7, 2015 at 3:00 AM, Andrey Korolyov and...@xdel.ru wrote:
  On Fri, Mar 6, 2015 at 7:57 PM, Bandan Das b...@redhat.com wrote:
  Andrey Korolyov and...@xdel.ru writes:
 
  On Fri, Mar 6, 2015 at 1:14 AM, Andrey Korolyov and...@xdel.ru wrote:
  Hello,
 
  recently I`ve got a couple of shiny new Intel 2620v2s for future
  replacement of the E5-2620v1, but I experienced relatively many events
  with emulation errors, all traces looks simular to the one below. I am
  running qemu-2.1 on x86 on top of 3.10 branch for testing purposes but
  can switch to some other versions if necessary. Most of crashes
  happened during reboot cycle or at the end of ACPI-based shutdown
  action, if this can help. I have zero clues of what can introduce such
  a mess inside same processor family using identical software, as
  2620v1 has no simular problem ever. Please let me know if there can be
  some side measures for making entire story more clear.
 
  Thanks!
 
  KVM internal error. Suberror: 2
  extra data[0]: 80d1
  extra data[1]: 8b0d
  EAX=0003 EBX= ECX= EDX=
  ESI= EDI= EBP= ESP=6cd4
  EIP=d3f9 EFL=00010202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
  ES =   9300
  CS =f000 000f  9b00
  SS =   9300
  DS =   9300
  FS =   9300
  GS =   9300
  LDT=   8200
  TR =   8b00
  GDT= 000f6e98 0037
  IDT=  03ff
  CR0=0010 CR2= CR3= CR4=
  DR0= DR1= DR2=
  DR3=
  DR6=0ff0 DR7=0400
  EFER=
  Code=48 18 67 8c 00 8c d1 8e d9 66 5a 66 58 66 5d 66 c3 cd 02 cb cd
  10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd 19 cb cd 1c cb fa fc 66
  b8 00 e0 00 00 8e
 
 
  It turns out that those errors are introduced by APICv, which gets
  enabled due to different feature set. If anyone is interested in
  reproducing/fixing this exactly on 3.10, it takes about one hundred of
  migrations/power state changes for an issue to appear, guest OS can be
  Linux or Win.
 
  Are you able to reproduce this on a more recent upstream kernel as well ?
 
  Bandan
 
  I`ll go through test cycle with 3.18 and 2603v2 around tomorrow and
  follow up with any reproduceable results.
 
 Heh.. issue is not triggered on 2603v2 at all, at least I am not able
 to hit this. The only difference with 2620v2 except lower frequency is
 an Intel Dynamic Acceleration feature. I`d appreciate any testing with
 higher CPU models with same or richer feature set. The testing itself
 can be done on both generic 3.10 or RH7 kernels, as both of them are
 experiencing this issue. I conducted all tests with disabled cstates
 so I advise to do the same for a first reproduction step.
 
 Thanks!
 
 model name  : Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
 stepping: 4
 microcode   : 0x416
 cpu MHz : 2100.039
 cache size  : 15360 KB
 siblings: 12
 apicid  : 43
 initial apicid  : 43
 fpu : yes
 fpu_exception   : yes
 cpuid level : 13
 wp  : yes
 flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
 mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
 syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts
 rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq
 dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca
 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c
 rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi
 flexpriority ept vpid fsgsbase smep erms

I'm seeing something similar; it's very intermittent and generally
happening right at boot of the guest;   I'm running this on qemu
head+my postcopy world (but it's happening right at boot before postcopy
gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case
but hey maybe I'm seeing a different bug.

Dave
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-10 Thread Andrey Korolyov
On Tue, Mar 10, 2015 at 7:57 PM, Dr. David Alan Gilbert
dgilb...@redhat.com wrote:
 * Andrey Korolyov (and...@xdel.ru) wrote:
 On Sat, Mar 7, 2015 at 3:00 AM, Andrey Korolyov and...@xdel.ru wrote:
  On Fri, Mar 6, 2015 at 7:57 PM, Bandan Das b...@redhat.com wrote:
  Andrey Korolyov and...@xdel.ru writes:
 
  On Fri, Mar 6, 2015 at 1:14 AM, Andrey Korolyov and...@xdel.ru wrote:
  Hello,
 
  recently I`ve got a couple of shiny new Intel 2620v2s for future
  replacement of the E5-2620v1, but I experienced relatively many events
  with emulation errors, all traces looks simular to the one below. I am
  running qemu-2.1 on x86 on top of 3.10 branch for testing purposes but
  can switch to some other versions if necessary. Most of crashes
  happened during reboot cycle or at the end of ACPI-based shutdown
  action, if this can help. I have zero clues of what can introduce such
  a mess inside same processor family using identical software, as
  2620v1 has no simular problem ever. Please let me know if there can be
  some side measures for making entire story more clear.
 
  Thanks!
 
  KVM internal error. Suberror: 2
  extra data[0]: 80d1
  extra data[1]: 8b0d
  EAX=0003 EBX= ECX= EDX=
  ESI= EDI= EBP= ESP=6cd4
  EIP=d3f9 EFL=00010202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
  ES =   9300
  CS =f000 000f  9b00
  SS =   9300
  DS =   9300
  FS =   9300
  GS =   9300
  LDT=   8200
  TR =   8b00
  GDT= 000f6e98 0037
  IDT=  03ff
  CR0=0010 CR2= CR3= CR4=
  DR0= DR1= DR2=
  DR3=
  DR6=0ff0 DR7=0400
  EFER=
  Code=48 18 67 8c 00 8c d1 8e d9 66 5a 66 58 66 5d 66 c3 cd 02 cb cd
  10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd 19 cb cd 1c cb fa fc 66
  b8 00 e0 00 00 8e
 
 
  It turns out that those errors are introduced by APICv, which gets
  enabled due to different feature set. If anyone is interested in
  reproducing/fixing this exactly on 3.10, it takes about one hundred of
  migrations/power state changes for an issue to appear, guest OS can be
  Linux or Win.
 
  Are you able to reproduce this on a more recent upstream kernel as well ?
 
  Bandan
 
  I`ll go through test cycle with 3.18 and 2603v2 around tomorrow and
  follow up with any reproduceable results.

 Heh.. issue is not triggered on 2603v2 at all, at least I am not able
 to hit this. The only difference with 2620v2 except lower frequency is
 an Intel Dynamic Acceleration feature. I`d appreciate any testing with
 higher CPU models with same or richer feature set. The testing itself
 can be done on both generic 3.10 or RH7 kernels, as both of them are
 experiencing this issue. I conducted all tests with disabled cstates
 so I advise to do the same for a first reproduction step.

 Thanks!

 model name  : Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
 stepping: 4
 microcode   : 0x416
 cpu MHz : 2100.039
 cache size  : 15360 KB
 siblings: 12
 apicid  : 43
 initial apicid  : 43
 fpu : yes
 fpu_exception   : yes
 cpuid level : 13
 wp  : yes
 flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
 mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
 syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts
 rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq
 dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca
 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c
 rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi
 flexpriority ept vpid fsgsbase smep erms

 I'm seeing something similar; it's very intermittent and generally
 happening right at boot of the guest;   I'm running this on qemu
 head+my postcopy world (but it's happening right at boot before postcopy
 gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case
 but hey maybe I'm seeing a different bug.

 Dave

Yep, looks like we are hitting same bug - two thirds of mine failure
events shot during boot/reboot cycle and approx. one third of events
happened in the middle of runtime. What CPU, v0 or v2 are you using
(in other words, is APICv enabled)?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-10 Thread Bandan Das
Dr. David Alan Gilbert dgilb...@redhat.com writes:

 * Paolo Bonzini (pbonz...@redhat.com) wrote:
 
 
 On 10/03/2015 19:21, Bandan Das wrote:
  Paolo Bonzini pbonz...@redhat.com writes:
  
  On 10/03/2015 17:57, Dr. David Alan Gilbert wrote:
  I'm seeing something similar; it's very intermittent and generally
  happening right at boot of the guest;   I'm running this on qemu
  head+my postcopy world (but it's happening right at boot before postcopy
  gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case
  but hey maybe I'm seeing a different bug.
  
  Probably a tangent but is the qemu trace identical to what Andrey is 
  seeing ?
  From a cursory look and my limited understanding, it seems his failure is 
  #GP
  when executing video bios.
  
  Same here on 3.16 + Xeon E5 v3 kernel.
  
  I will try to reproduce this on a v2.
 
 I see several failures, usually mine have suberror 1.  With a 32-VCPU
 guest I can reproduce it roughly half of the time.
 
 Paolo

 while true; do (sleep 5; echo -e 
 '\001cq\n')|/opt/qemu-try-world3/bin/qemu-system-x86_64 -machine 
 pc-i440fx-2.0,accel=kvm -m 1024 -smp 128 -nographic -device sga 21 | tee 
 /tmp/qemu.op; grep internal error /tmp/qemu.op -q  break; done

 (and leave about 2mins of runs before declaring good)

 bad: cd2946607b42636d6c8cf6dbf94bce0273507b17
 bad: 041ccc922ee474693a2869d4e3b59e920c739bc0
 bad: 2559db069628981bfdc90637fac5bf1b4f4e8ef5
 bad: 21f5826a04d38e19488f917e1eef22751490c769
 good:e95d24ff40c77fbfd71396834a2eb99375f8bcc4
 good: 7781a492fa5a2eff53d06b25b93f0186ad3226c9
 good: c3edd62851098e6417786193ed9e9341781fcf57
 good: c5c6d7f81a6950d8e32a3b5a0bafd37bfa5a8e88
 good: 73104fd399c6778112f64fe0d439319f24508d9a
 good: 92013cf8ca10adafec9a92deb5df993e7df22cb9
 good: 4478aa768ccefcc5b234c23d035435fd71b932f6
 good: 2.2.0

 [root@virtlab413 qemu-world3]# git bisect bad
 21f5826a04d38e19488f917e1eef22751490c769 is the first bad commit

I can reproduce this on E5-2620 v2 with  David's while true test.
(The emulation failure I mean, not the suberror 2 that Andrey is seeing)
The commit that seems to have introduced this is -

commit 0673b7870063a3affbad9046fb6d385a4e734c19
Author: Kevin O'Connor ke...@koconnor.net
Date:   Sat May 24 10:49:50 2014 -0400

smp: Replace QEMU SMP init assembler code with C; run only in 32bit mode.

Change the multi-processor init code to trampoline into 32bit mode on
each of the additional processors.  Implement an atomic lock so that
each processor performs its initialization serially.

I am not sure what in that change could cause this though..
Also, in my testing, unrestricted_guest=0 avoids the failure.

 commit 21f5826a04d38e19488f917e1eef22751490c769
 Author: Gerd Hoffmann kra...@redhat.com
 Date:   Thu Feb 19 09:33:03 2015 +0100

 seabios: update to 1.8.0 release
 
 'git shortlog 8936dbb2..4c59f5d8' for seabios repo:
 
 David Woodhouse (4):
   Update EFI_COMPATIBILITY16_TABLE to match 0.98 spec update
   build: use -m16 where available instead of asm(.code16gcc)
   romlayout: Use .code16 not .code16gcc
   vgabios: Use .code16 not .code16gcc
 
 Gerd Hoffmann (2):
   add scripts/tarball.sh
   build: set LC_ALL=C
 
 Hannes Reinecke (1):
   megasas: read addional PCI I/O bar
 
 Ian Campbell (1):
   romlayout: Use rep ; nop not rep nop.
 
 Kevin O'Connor (139):
   vgabios: Return from handle_1011() if handler found.
   edd: Move EDD get drive parameters (int 1348) logic from disk.c to 
 block.c.
   edd: Use sectors==-1 to detect removable media.
   edd: Separate out ATA and virtio specific parts of fill_edd().
   cdemu: store internal cdemu fields in standard el-torito spec 
 format.
   Move cdemu call interface and disk_ret helper code to disk.c.
   smm: Replace SMI assembler code with C code.
   smm: Use a C struct to define the layout of the SMM area.
   smp: Replace QEMU SMP init assembler code with C; run only in 32bit 
 mode.
   Don't enable thread preemption during S3 resume vga option rom 
 execution.
   Remove old Bochs bios fixed address string at 0xfff00.
   Move most of the VAR16FIXED() defs to misc.c.
   build: Avoid absolute paths during whole-program compiling.
   Make sure handle_smi() and handle_smp() are compiled out if not 
 enabled.
   Remove the TODO file.
   Abstract reset call (and possible 16bit mode switch) into reset() 
 function.
   build: Remove unused function getSectionsStart() from layoutrom.py.
   build: Extract section visiting logic in layoutrom.py.
   build: Refactor layoutrom.py gc() function.
   build: Use customized entry point for each type of build.
   build: Refactor findInit() function.
   build: Rework getRelocs() to use a hash instead of categories in 
 

Re: [PATCH v5 29/29] vfio: powerpc/spapr: Support Dynamic DMA windows

2015-03-10 Thread Alex Williamson
On Tue, 2015-03-10 at 01:07 +1100, Alexey Kardashevskiy wrote:
 This adds create/remove window ioctls to create and remove DMA windows.
 sPAPR defines a Dynamic DMA windows capability which allows
 para-virtualized guests to create additional DMA windows on a PCI bus.
 The existing linux kernels use this new window to map the entire guest
 memory and switch to the direct DMA operations saving time on map/unmap
 requests which would normally happen in a big amounts.
 
 This adds 2 ioctl handlers - VFIO_IOMMU_SPAPR_TCE_CREATE and
 VFIO_IOMMU_SPAPR_TCE_REMOVE - to create and remove windows.
 Up to 2 windows are supported now by the hardware and by this driver.
 
 This changes VFIO_IOMMU_SPAPR_TCE_GET_INFO handler to return additional
 information such as a number of supported windows and maximum number
 levels of TCE tables.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
 Changes:
 v4:
 * moved code to tce_iommu_create_window()/tce_iommu_remove_window()
 helpers
 * added docs
 ---
  Documentation/vfio.txt  |  19 +
  arch/powerpc/include/asm/iommu.h|   2 +-
  drivers/vfio/vfio_iommu_spapr_tce.c | 165 
 +++-
  include/uapi/linux/vfio.h   |  24 +-
  4 files changed, 207 insertions(+), 3 deletions(-)
 
 diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
 index 791e85c..61ce393 100644
 --- a/Documentation/vfio.txt
 +++ b/Documentation/vfio.txt
 @@ -446,6 +446,25 @@ the memory block.
  The user space is not expected to call these often and the block descriptors
  are stored in a linked list in the kernel.
  
 +6) sPAPR specification allows guests to have an ddditional DMA window(s) on
 +a PCI bus with a variable page size. Two ioctls have been added to support
 +this: VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE.
 +The platform has to support the functionality or error will be returned to
 +the userspace. The existing hardware supports up to 2 DMA windows, one is
 +2GB long, uses 4K pages and called default 32bit window; the other can
 +be as big as entire RAM, use different page size, it is optional - guests
 +create those in run-time if the guest driver supports 64bit DMA.
 +
 +VFIO_IOMMU_SPAPR_TCE_CREATE receives a page shift, a DMA window size and
 +a number of TCE table levels (if a TCE table is going to be big enough and
 +the kernel may not be able to allocate enough of physicall contiguous 
 memory).
 +It creates a new window in the available slot and returns the bus address 
 where
 +the new window starts. Due to hardware limitation, the user space cannot 
 choose
 +the location of DMA windows.
 +
 +VFIO_IOMMU_SPAPR_TCE_REMOVE receives the bus start address of the window
 +and removes it.
 +
  
 ---
  
  [1] VFIO was originally an acronym for Virtual Function I/O in its
 diff --git a/arch/powerpc/include/asm/iommu.h 
 b/arch/powerpc/include/asm/iommu.h
 index 04f72ac..de82b61 100644
 --- a/arch/powerpc/include/asm/iommu.h
 +++ b/arch/powerpc/include/asm/iommu.h
 @@ -138,7 +138,7 @@ extern void iommu_free_table(struct iommu_table *tbl, 
 const char *node_name);
  extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
   int nid);
  
 -#define IOMMU_TABLE_GROUP_MAX_TABLES 1
 +#define IOMMU_TABLE_GROUP_MAX_TABLES 2
  
  struct iommu_table_group;
  
 diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
 b/drivers/vfio/vfio_iommu_spapr_tce.c
 index 3a0b5fe..7aa4141b 100644
 --- a/drivers/vfio/vfio_iommu_spapr_tce.c
 +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
 @@ -96,6 +96,7 @@ struct tce_container {
   struct list_head mem_list;
   struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
   struct list_head group_list;
 + bool v2;
  };
  
  struct tce_iommu_group {
 @@ -333,6 +334,20 @@ static struct iommu_table *spapr_tce_find_table(
   return ret;
  }
  
 +static int spapr_tce_find_free_table(struct tce_container *container)
 +{
 + int i;
 +
 + for (i = 0; i  IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
 + struct iommu_table *tbl = container-tables[i];
 +
 + if (!tbl-it_size)
 + return i;
 + }
 +
 + return -1;


Why not use a real errno here?

 +}
 +
  static int tce_iommu_enable(struct tce_container *container)
  {
   int ret = 0;
 @@ -432,6 +447,8 @@ static void *tce_iommu_open(unsigned long arg)
   INIT_LIST_HEAD_RCU(container-mem_list);
   INIT_LIST_HEAD_RCU(container-group_list);
  
 + container-v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
 +

Ah, here v2 actually provides some enforced differentiation, right?  ...
oh wait, nobody ever uses this.
   return container;
  }
  
 @@ -605,11 +622,90 @@ static long tce_iommu_build(struct tce_container 
 *container,
   return ret;
  }
  
 +static long tce_iommu_create_window(struct tce_container *container,
 + __u32 page_shift, __u64 

CfP 10th Workshop on Virtualization in High-Performance Cloud Computing (VHPC '15)

2015-03-10 Thread VHPC 15
=
CALL FOR PAPERS

10th Workshop on Virtualization in High-Performance Cloud Computing (VHPC '15)

held in conjunction with Euro-Par 2015, August 24-28, Vienna, Austria

(Springer LNCS)
=

Date: August 25, 2015
Workshop URL: http://vhpc.org

Paper Submission Deadline: May 22, 2015


CALL FOR PAPERS

Virtualization technologies constitute a key enabling factor for
flexible resource
management in modern data centers, cloud environments, and increasingly in
HPC as well. Providers need to dynamically manage complex infrastructures in a
seamless fashion for varying workloads and hosted applications, independently of
the customers deploying software or users submitting highly dynamic and
heterogeneous workloads. Thanks to virtualization, we have the ability to manage
vast computing and networking resources dynamically and close to the marginal
cost of providing the services, which is unprecedented in the history
of scientific
and commercial computing.

Various virtualization technologies contribute to the overall picture
in different
ways: machine virtualization, with its capability to enable
consolidation of multiple
under-utilized servers with heterogeneous software and operating systems (OSes),
and its capability to live-migrate a fully operating virtual machine
(VM) with a very
short downtime, enables novel and dynamic ways to manage physical servers;
OS-level virtualization, with its capability to isolate multiple user-space
environments and to allow for their co-existence within the same OS kernel,
promises to provide many of the advantages of machine virtualization with high
levels of responsiveness and performance; I/O Virtualization allows physical
network adapters to take traffic from multiple VMs; network
virtualization, with its
capability to create logical network overlays that are independent of the
underlying physical topology and IP addressing, provides the fundamental
ground on top of which evolved network services can be realized with an
unprecedented level of dynamicity and flexibility; These technologies
have to be inter-mixed and integrated in an intelligent way, to support
workloads that are increasingly demanding in terms of absolute performance,
responsiveness and interactivity, and have to respect well-specified Service-
Level Agreements (SLAs), as needed for industrial-grade provided services.
Indeed, among emerging and increasingly interesting application domains
for virtualization, we can find big-data application workloads in cloud
infrastructures, interactive and real-time multimedia services in the cloud,
including real-time big-data streaming platforms such as used in real-time
analytics supporting nowadays a plethora of application domains. Distributed
cloud infrastructures promise to offer unprecedented responsiveness levels for
hosted applications, but that is only possible if the underlying virtualization
technologies can overcome most of the latency impairments typical of current
virtualized infrastructures (e.g., far worse tail-latency).

The Workshop on Virtualization in High-Performance Cloud Computing (VHPC)
aims to bring together researchers and industrial practitioners facing
the challenges
posed by virtualization in order to foster discussion, collaboration,
mutual exchange
of knowledge and experience, enabling research to ultimately provide novel
solutions for virtualized computing systems of tomorrow.

The workshop will be one day in length, composed of 20 min paper presentations,
each followed by 10 min discussion sections, and lightning talks, limited to 5
minutes. Presentations may be accompanied by interactive demonstrations.

TOPICS

Topics of interest include, but are not limited to:

- Virtualization in supercomputing environments, HPC clusters, cloud
HPC and grids
- Optimizations of virtual machine monitor platforms, hypervisors and
OS-level virtualization
- Hypervisor and network virtualization QoS and SLAs
- Cloud based network and system management for SDN and NFV
- Management, deployment and monitoring of virtualized environments
- Performance measurement, modelling and monitoring of
virtualized/cloud workloads
- Programming models for virtualized environments
- Cloud reliability, fault-tolerance, high-availability and security
- Heterogeneous virtualized environments, virtualized accelerators,
GPUs and co-processors
- Optimized communication libraries/protocols in the cloud and for HPC
in the cloud
- Topology management and optimization for distributed virtualized applications
- Cluster provisioning in the cloud and cloud bursting
- Adaptation of emerging HPC technologies (high performance networks,
RDMA, etc..)
- I/O and storage virtualization, virtualization aware file systems
- Job scheduling/control/policy in virtualized environments
- Checkpointing and migration of VM-based large compute jobs
- Cloud frameworks and APIs
- 

Re: [PATCH] x86: irq_comm: Add check for RH bit in kvm_set_msi_irq

2015-03-10 Thread Radim Krčmář
2015-03-09 23:05-0600, James Sullivan:
 Hi folks,
 
 This is a small patch that implements logic to handle the RH bit
 being set in the msi message address register. Currently the DM bit is
 the only thing used to decide irq-dest_mode (logical when DM set,
 physical when unset). Documentation indicates that the DM bit will be
 ignored when the RH bit is unset, and physical destination mode is used
 in this case.

(I found that https://software.intel.com/en-us/forums/topic/23 agrees
 with this definition of ignored; and it explains the weird design at
 the bottom.)

 Fixed this to set irq-dest_mode to logical just when both bits are set,
 and physical otherwise.
 
 Signed-off-by: James Sullivan sullivan.jame...@gmail.com
 ---
 diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
 @@ -97,18 +97,26 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct 
 kvm_lapic *src,
  static inline void kvm_set_msi_irq(struct kvm_kernel_irq_routing_entry *e,
  struct kvm_lapic_irq *irq)
  {
 + bool phys;
   trace_kvm_msi_set_irq(e-msi.address_lo, e-msi.data);
 
   irq-dest_id = (e-msi.address_lo 
   MSI_ADDR_DEST_ID_MASK)  MSI_ADDR_DEST_ID_SHIFT;
   irq-vector = (e-msi.data 
   MSI_DATA_VECTOR_MASK)  MSI_DATA_VECTOR_SHIFT;
 - irq-dest_mode = (1  MSI_ADDR_DEST_MODE_SHIFT)  e-msi.address_lo;
 + /*
 +  * Set dest_mode to logical just in case both the RH and DM
 +  * bits are set, otherwise default to physical.
 +  */
 + phys = ((e-msi.address_lo  (MSI_ADDR_REDIRECTION_LOWPRI |
 + MSI_ADDR_DEST_MODE_LOGICAL)) !=
 + (MSI_ADDR_REDIRECTION_LOWPRI |
 + MSI_ADDR_DEST_MODE_LOGICAL));
 + irq-dest_mode = phys ? 0 : (MSI_ADDR_DEST_MODE_LOGICAL);

(Should be APIC_DEST_LOGICAL.  All works because it is a boolean and we
 only checked for APIC_DEST_PHYSICAL, which is 0.)

   irq-trig_mode = (1  MSI_DATA_TRIGGER_SHIFT)  e-msi.data;
   irq-delivery_mode = e-msi.data  0x700;
   irq-level = 1;
   irq-shorthand = 0;
 - /* TODO Deal with RH bit of MSI message address */

RH bit still isn't deal with -- we do not use lowest-priority-like
delivery in logical destination mode ...

How does DM=1/RH=1 work on real hardware?
(There seem to be interesting corner cases with irq-delivery_mode like
 APIC_DM_NMI.)

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-10 Thread Dr. David Alan Gilbert
* Paolo Bonzini (pbonz...@redhat.com) wrote:
 
 
 On 10/03/2015 19:21, Bandan Das wrote:
  Paolo Bonzini pbonz...@redhat.com writes:
  
  On 10/03/2015 17:57, Dr. David Alan Gilbert wrote:
  I'm seeing something similar; it's very intermittent and generally
  happening right at boot of the guest;   I'm running this on qemu
  head+my postcopy world (but it's happening right at boot before postcopy
  gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case
  but hey maybe I'm seeing a different bug.
  
  Probably a tangent but is the qemu trace identical to what Andrey is seeing 
  ?
  From a cursory look and my limited understanding, it seems his failure is 
  #GP
  when executing video bios.
  
  Same here on 3.16 + Xeon E5 v3 kernel.
  
  I will try to reproduce this on a v2.
 
 I see several failures, usually mine have suberror 1.  With a 32-VCPU
 guest I can reproduce it roughly half of the time.
 
 Paolo

while true; do (sleep 5; echo -e 
'\001cq\n')|/opt/qemu-try-world3/bin/qemu-system-x86_64 -machine 
pc-i440fx-2.0,accel=kvm -m 1024 -smp 128 -nographic -device sga 21 | tee 
/tmp/qemu.op; grep internal error /tmp/qemu.op -q  break; done

(and leave about 2mins of runs before declaring good)

bad: cd2946607b42636d6c8cf6dbf94bce0273507b17
bad: 041ccc922ee474693a2869d4e3b59e920c739bc0
bad: 2559db069628981bfdc90637fac5bf1b4f4e8ef5
bad: 21f5826a04d38e19488f917e1eef22751490c769
good:e95d24ff40c77fbfd71396834a2eb99375f8bcc4
good: 7781a492fa5a2eff53d06b25b93f0186ad3226c9
good: c3edd62851098e6417786193ed9e9341781fcf57
good: c5c6d7f81a6950d8e32a3b5a0bafd37bfa5a8e88
good: 73104fd399c6778112f64fe0d439319f24508d9a
good: 92013cf8ca10adafec9a92deb5df993e7df22cb9
good: 4478aa768ccefcc5b234c23d035435fd71b932f6
good: 2.2.0

[root@virtlab413 qemu-world3]# git bisect bad
21f5826a04d38e19488f917e1eef22751490c769 is the first bad commit
commit 21f5826a04d38e19488f917e1eef22751490c769
Author: Gerd Hoffmann kra...@redhat.com
Date:   Thu Feb 19 09:33:03 2015 +0100

seabios: update to 1.8.0 release

'git shortlog 8936dbb2..4c59f5d8' for seabios repo:

David Woodhouse (4):
  Update EFI_COMPATIBILITY16_TABLE to match 0.98 spec update
  build: use -m16 where available instead of asm(.code16gcc)
  romlayout: Use .code16 not .code16gcc
  vgabios: Use .code16 not .code16gcc

Gerd Hoffmann (2):
  add scripts/tarball.sh
  build: set LC_ALL=C

Hannes Reinecke (1):
  megasas: read addional PCI I/O bar

Ian Campbell (1):
  romlayout: Use rep ; nop not rep nop.

Kevin O'Connor (139):
  vgabios: Return from handle_1011() if handler found.
  edd: Move EDD get drive parameters (int 1348) logic from disk.c to 
block.c.
  edd: Use sectors==-1 to detect removable media.
  edd: Separate out ATA and virtio specific parts of fill_edd().
  cdemu: store internal cdemu fields in standard el-torito spec 
format.
  Move cdemu call interface and disk_ret helper code to disk.c.
  smm: Replace SMI assembler code with C code.
  smm: Use a C struct to define the layout of the SMM area.
  smp: Replace QEMU SMP init assembler code with C; run only in 32bit 
mode.
  Don't enable thread preemption during S3 resume vga option rom 
execution.
  Remove old Bochs bios fixed address string at 0xfff00.
  Move most of the VAR16FIXED() defs to misc.c.
  build: Avoid absolute paths during whole-program compiling.
  Make sure handle_smi() and handle_smp() are compiled out if not 
enabled.
  Remove the TODO file.
  Abstract reset call (and possible 16bit mode switch) into reset() 
function.
  build: Remove unused function getSectionsStart() from layoutrom.py.
  build: Extract section visiting logic in layoutrom.py.
  build: Refactor layoutrom.py gc() function.
  build: Use customized entry point for each type of build.
  build: Refactor findInit() function.
  build: Rework getRelocs() to use a hash instead of categories in 
layoutrom.py
  build: Keep segmented sections separate until final link step.
  build: Use fileid instead of category to write sections in 
layoutrom.py.
  build: Only export needed fields in LayoutInfo in layoutrom.py.
  build: Get fixed address variables from 32bit compile pass (not 16bit)
  build: Minor - fix comments referring to old tools/ directory.
  xhci: Update the times for usb command timeouts.
  ehci: Update usb command timeouts to use usb_xfer_time()
  uhci: Update usb command timeouts to use usb_xfer_time()
  ohci: Update usb command timeouts to use usb_xfer_time()
  vgabios: Fix broken build resulting from e5749978.
  boot: Change :rom%d boot order rom instance to :rom%x
  Minor - remove stray tab from src/fw/smm.c.
  build: Update 

Re: 2 CPU Conformance Issue in KVM/x86

2015-03-10 Thread Avi Kivity

On 03/10/2015 12:47 PM, Paolo Bonzini wrote:


On 09/03/2015 20:49, Avi Kivity wrote:

Yes, and it checked that MAXPHYADDR != 52 before.  If you want to set
only one bit, making that bit 51 makes sense anyway for simplicity, so
it is still 99.9% academic.  Once processors appear with MAXPHYADDR =
52, the remaining 0.1% will become more relevant.

The current limit is IIRC 46 or 48 (on Haswell Xeons).

It will be interesting to have processors with 52 bits of physical
address and 48 bits of virtual address. HIGHMEM for x86_64?  Or 5-level
page tables?

I wonder why Intel chose exactly 52...  HIGHMEM seems more likely than
5-level page tables.  Certainly it wouldn't need hacks like Ingo's 4G-4G.


My bitcoins are on 5-level page tables.  HIGHMEM is too much pain.


50 bits == 1 PiB.  That's quite an amount of RAM.

Not that 64 TiB is not quite an amount of RAM. :)




Depends on how many browser tabs you have open, I guess.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-10 Thread Paolo Bonzini


On 10/03/2015 19:21, Bandan Das wrote:
 Paolo Bonzini pbonz...@redhat.com writes:
 
 On 10/03/2015 17:57, Dr. David Alan Gilbert wrote:
 I'm seeing something similar; it's very intermittent and generally
 happening right at boot of the guest;   I'm running this on qemu
 head+my postcopy world (but it's happening right at boot before postcopy
 gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case
 but hey maybe I'm seeing a different bug.
 
 Probably a tangent but is the qemu trace identical to what Andrey is seeing ?
 From a cursory look and my limited understanding, it seems his failure is #GP
 when executing video bios.
 
 Same here on 3.16 + Xeon E5 v3 kernel.
 
 I will try to reproduce this on a v2.

I see several failures, usually mine have suberror 1.  With a 32-VCPU
guest I can reproduce it roughly half of the time.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 03/29] vfio: powerpc/spapr: Check that TCE page size is equal to it_page_size

2015-03-10 Thread Alex Williamson
On Tue, 2015-03-10 at 01:06 +1100, Alexey Kardashevskiy wrote:
 This checks that the TCE table page size is not bigger that the size of
 a page we just pinned and going to put its physical address to the table.
 
 Otherwise the hardware gets unwanted access to physical memory between
 the end of the actual page and the end of the aligned up TCE page.
 
 Since compound_order() and compound_head() work correctly on non-huge
 pages, there is no need for additional check whether the page is huge.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
 Changes:
 v4:
 * s/tce_check_page_size/tce_page_is_contained/
 ---
  drivers/vfio/vfio_iommu_spapr_tce.c | 22 ++
  1 file changed, 22 insertions(+)
 
 diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
 b/drivers/vfio/vfio_iommu_spapr_tce.c
 index 756831f..91e7599 100644
 --- a/drivers/vfio/vfio_iommu_spapr_tce.c
 +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
 @@ -49,6 +49,22 @@ struct tce_container {
   bool enabled;
  };
  
 +static bool tce_page_is_contained(struct page *page, unsigned page_shift)
 +{
 + unsigned shift;
 +
 + /*
 +  * Check that the TCE table granularity is not bigger than the size of
 +  * a page we just found. Otherwise the hardware can get access to
 +  * a bigger memory chunk that it should.
 +  */
 + shift = PAGE_SHIFT + compound_order(compound_head(page));
 + if (shift = page_shift)
 + return true;
 +
 + return false;

nit, simplified:

return (PAGE_SHIFT + compound_order(compound_head(page) = page_shift);

 +}
 +
  static int tce_iommu_enable(struct tce_container *container)
  {
   int ret = 0;
 @@ -197,6 +213,12 @@ static long tce_iommu_build(struct tce_container 
 *container,
   ret = -EFAULT;
   break;
   }
 +
 + if (!tce_page_is_contained(page, tbl-it_page_shift)) {
 + ret = -EPERM;
 + break;
 + }
 +
   hva = (unsigned long) page_address(page) +
   (tce  IOMMU_PAGE_MASK(tbl)  ~PAGE_MASK);
  



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-10 Thread Andrey Korolyov
On Tue, Mar 10, 2015 at 9:16 PM, Dr. David Alan Gilbert
dgilb...@redhat.com wrote:
 * Andrey Korolyov (and...@xdel.ru) wrote:
 On Tue, Mar 10, 2015 at 7:57 PM, Dr. David Alan Gilbert
 dgilb...@redhat.com wrote:
  * Andrey Korolyov (and...@xdel.ru) wrote:
  On Sat, Mar 7, 2015 at 3:00 AM, Andrey Korolyov and...@xdel.ru wrote:
   On Fri, Mar 6, 2015 at 7:57 PM, Bandan Das b...@redhat.com wrote:
   Andrey Korolyov and...@xdel.ru writes:
  
   On Fri, Mar 6, 2015 at 1:14 AM, Andrey Korolyov and...@xdel.ru 
   wrote:
   Hello,
  
   recently I`ve got a couple of shiny new Intel 2620v2s for future
   replacement of the E5-2620v1, but I experienced relatively many 
   events
   with emulation errors, all traces looks simular to the one below. I 
   am
   running qemu-2.1 on x86 on top of 3.10 branch for testing purposes 
   but
   can switch to some other versions if necessary. Most of crashes
   happened during reboot cycle or at the end of ACPI-based shutdown
   action, if this can help. I have zero clues of what can introduce 
   such
   a mess inside same processor family using identical software, as
   2620v1 has no simular problem ever. Please let me know if there can 
   be
   some side measures for making entire story more clear.
  
   Thanks!
  
   KVM internal error. Suberror: 2
   extra data[0]: 80d1
   extra data[1]: 8b0d
   EAX=0003 EBX= ECX= EDX=
   ESI= EDI= EBP= ESP=6cd4
   EIP=d3f9 EFL=00010202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
   ES =   9300
   CS =f000 000f  9b00
   SS =   9300
   DS =   9300
   FS =   9300
   GS =   9300
   LDT=   8200
   TR =   8b00
   GDT= 000f6e98 0037
   IDT=  03ff
   CR0=0010 CR2= CR3= CR4=
   DR0= DR1= DR2=
   DR3=
   DR6=0ff0 DR7=0400
   EFER=
   Code=48 18 67 8c 00 8c d1 8e d9 66 5a 66 58 66 5d 66 c3 cd 02 cb cd
   10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd 19 cb cd 1c cb fa fc 66
   b8 00 e0 00 00 8e
  
  
   It turns out that those errors are introduced by APICv, which gets
   enabled due to different feature set. If anyone is interested in
   reproducing/fixing this exactly on 3.10, it takes about one hundred of
   migrations/power state changes for an issue to appear, guest OS can be
   Linux or Win.
  
   Are you able to reproduce this on a more recent upstream kernel as 
   well ?
  
   Bandan
  
   I`ll go through test cycle with 3.18 and 2603v2 around tomorrow and
   follow up with any reproduceable results.
 
  Heh.. issue is not triggered on 2603v2 at all, at least I am not able
  to hit this. The only difference with 2620v2 except lower frequency is
  an Intel Dynamic Acceleration feature. I`d appreciate any testing with
  higher CPU models with same or richer feature set. The testing itself
  can be done on both generic 3.10 or RH7 kernels, as both of them are
  experiencing this issue. I conducted all tests with disabled cstates
  so I advise to do the same for a first reproduction step.
 
  Thanks!
 
  model name  : Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
  stepping: 4
  microcode   : 0x416
  cpu MHz : 2100.039
  cache size  : 15360 KB
  siblings: 12
  apicid  : 43
  initial apicid  : 43
  fpu : yes
  fpu_exception   : yes
  cpuid level : 13
  wp  : yes
  flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
  mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
  syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts
  rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq
  dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca
  sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c
  rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi
  flexpriority ept vpid fsgsbase smep erms
 
  I'm seeing something similar; it's very intermittent and generally
  happening right at boot of the guest;   I'm running this on qemu
  head+my postcopy world (but it's happening right at boot before postcopy
  gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case
  but hey maybe I'm seeing a different bug.
 
  Dave

 Yep, looks like we are hitting same bug - two thirds of mine failure
 events shot during boot/reboot cycle and approx. one third of events
 happened in the middle of runtime. What CPU, v0 or v2 are you using
 (in other words, is APICv enabled)?

 processor   : 7
 vendor_id   : GenuineIntel
 cpu family  : 6
 model   : 45
 model name  : Intel(R) Xeon(R) CPU E5-2407 0 @ 2.20GHz
 stepping: 7
 microcode   : 0x70d
 cpu MHz : 

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-10 Thread Bandan Das
Paolo Bonzini pbonz...@redhat.com writes:

 On 10/03/2015 17:57, Dr. David Alan Gilbert wrote:
 I'm seeing something similar; it's very intermittent and generally
 happening right at boot of the guest;   I'm running this on qemu
 head+my postcopy world (but it's happening right at boot before postcopy
 gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case
 but hey maybe I'm seeing a different bug.

Probably a tangent but is the qemu trace identical to what Andrey is seeing ?
From a cursory look and my limited understanding, it seems his failure is #GP
when executing video bios.

 Same here on 3.16 + Xeon E5 v3 kernel.

I will try to reproduce this on a v2.

Bandan
 Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-10 Thread Dr. David Alan Gilbert
* Paolo Bonzini (pbonz...@redhat.com) wrote:
 
 
 On 10/03/2015 19:21, Bandan Das wrote:
  Paolo Bonzini pbonz...@redhat.com writes:
  
  On 10/03/2015 17:57, Dr. David Alan Gilbert wrote:
  I'm seeing something similar; it's very intermittent and generally
  happening right at boot of the guest;   I'm running this on qemu
  head+my postcopy world (but it's happening right at boot before postcopy
  gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case
  but hey maybe I'm seeing a different bug.
  
  Probably a tangent but is the qemu trace identical to what Andrey is seeing 
  ?
  From a cursory look and my limited understanding, it seems his failure is 
  #GP
  when executing video bios.
  
  Same here on 3.16 + Xeon E5 v3 kernel.
  
  I will try to reproduce this on a v2.
 
 I see several failures, usually mine have suberror 1.  With a 32-VCPU
 guest I can reproduce it roughly half of the time.

Oh yes, that helps:

while true; do (sleep 5; echo -e 
'\001cq\n')|/opt/qemu-try-world3/bin/qemu-system-x86_64 -machine 
pc-i440fx-2.3,accel=kvm -m 8192 -smp 20 -nographic -device sga; done

gets it under a minute on my other box; Intel(R) Xeon(R) CPU E5-2620 v2 @ 
2.10GHz
(Running a 3.18)

Dave

 Paolo
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-10 Thread Paolo Bonzini


On 10/03/2015 19:16, Dr. David Alan Gilbert wrote:
 KVM internal error. Suberror: 1
 emulation failure
 EAX= EBX= ECX= EDX=000fd2bc
 ESI= EDI= EBP= ESP=
 EIP=000fd2c5 EFL=00010007 [-PC] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =0010   00c09300 DPL=0 DS   [-WA]
 CS =0008   00c09b00 DPL=0 CS32 [-RA]
 SS =0010   00c09300 DPL=0 DS   [-WA]
 DS =0010   00c09300 DPL=0 DS   [-WA]
 FS =0010   00c09300 DPL=0 DS   [-WA]
 GS =0010   00c09300 DPL=0 DS   [-WA]
 LDT=   8200 DPL=0 LDT
 TR =   8b00 DPL=0 TSS32-busy
 GDT= 000f6a80 0037
 IDT= 000f6abe 
 CR0=6011 CR2= CR3= CR4=
 DR0= DR1= DR2= 
 DR3=
 DR6=0ff0 DR7=0400
 EFER=
 Code=66 ba bc d2 0f 00 e9 a2 fe f3 90 f0 0f ba 2d 04 ff fb bf 00 72 f3 8b 
 25 00 ff fb bf e8 44 66 ff ff c7 05 04 ff
  fb bf 00 00 00 00 f4 eb fd fa fc 66 b8
 KVM internal error. Suberror: 1
 emulation failure

This is exactly the same as mine.  I'm using upstream QEMU right now,
but I can try to reproduce using RHEL7's too.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: svm: make wbinvd faster

2015-03-10 Thread Marcelo Tosatti
On Tue, Mar 10, 2015 at 12:01:31PM +0100, Radim Krčmář wrote:
 2015-03-09 20:28-0300, Marcelo Tosatti:
  On Fri, Feb 27, 2015 at 06:19:18PM -0600, Joel Schopp wrote:
   From: David Kaplan david.kap...@amd.com
   No need to re-decode WBINVD since we know what it is from the intercept.
   
   Signed-off-by: David Kaplan david.kap...@amd.com
   [extracted from larger unlrelated patch, forward ported, tested]
   Signed-off-by: Joel Schopp joel.sch...@amd.com
  
  Can't you disable the intercept if need_emulate_wbinvd(vcpu) == false? 
 
 I don't think we want to:  it should be faster to intercept and ignore
 than to invalidate all caches.  The exit doesn't affect other physical
 cores and costs just about 10(?) L3 cache misses.

Yes, right.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: irq_comm: Add check for RH bit in kvm_set_msi_irq

2015-03-10 Thread James Sullivan
On 03/10/2015 08:47 AM, Radim Krčmář wrote:
...
 +/*
 + * Set dest_mode to logical just in case both the RH and DM
 + * bits are set, otherwise default to physical.
 + */
 +phys = ((e-msi.address_lo  (MSI_ADDR_REDIRECTION_LOWPRI |
 +MSI_ADDR_DEST_MODE_LOGICAL)) !=
 +(MSI_ADDR_REDIRECTION_LOWPRI |
 +MSI_ADDR_DEST_MODE_LOGICAL));
 +irq-dest_mode = phys ? 0 : (MSI_ADDR_DEST_MODE_LOGICAL);
 
 (Should be APIC_DEST_LOGICAL.  All works because it is a boolean and we
  only checked for APIC_DEST_PHYSICAL, which is 0.)
 

Thank you, that should be as follows:

irq-dest_mode = phys ? (APIC_DEST_PHYSICAL) : (APIC_DEST_LOGICAL);

?

  irq-trig_mode = (1  MSI_DATA_TRIGGER_SHIFT)  e-msi.data;
  irq-delivery_mode = e-msi.data  0x700;
  irq-level = 1;
  irq-shorthand = 0;
 -/* TODO Deal with RH bit of MSI message address */
 
 RH bit still isn't deal with -- we do not use lowest-priority-like
 delivery in logical destination mode ...

Just want to make sure I understand this comment-
Isn't low-pri delivery mode used in kvm_irq_delivery_to_apic_fast
when irq-dest_mode  APIC_DEST_PHYSICAL (ie, logical)? (See below.)

Do you mean that this patch will interfere with this? As long as
irq-dest_mode is still APIC_DEST_LOGICAL this shouldn't change.


bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
struct kvm_lapic_irq *irq, int *r, unsigned long *dest_map)
{
.
if (irq-dest_mode == APIC_DEST_PHYSICAL) {
if (irq-dest_id = ARRAY_SIZE(map-phys_map))
goto out;

dst = map-phys_map[irq-dest_id];
} else {
u32 mda = irq-dest_id  (32 - map-ldr_bits);
u16 cid = apic_cluster_id(map, mda);

if (cid = ARRAY_SIZE(map-logical_map))
goto out;

dst = map-logical_map[cid];

bitmap = apic_logical_id(map, mda);

if (irq-delivery_mode == APIC_DM_LOWEST) {
int l = -1;
for_each_set_bit(i, bitmap, 16) {
if (!dst[i])
continue;
if (l  0)
l = i;
else if (kvm_apic_compare_prio(dst[i]-vcpu, 
dst[l]-vcpu)  0)
l = i;
}

bitmap = (l = 0) ? 1  l : 0;
}
}
.
}

 How does DM=1/RH=1 work on real hardware?
 (There seem to be interesting corner cases with irq-delivery_mode like
  APIC_DM_NMI.)
 

The IA32 manual says that if DM=1/RH=1, we operate in logical destination mode
similarly to other IPIs. I don't believe this patch introduces any invalid
settings listed in section 10-21, Vol. 3, so this shouldn't create any 
weirdness.

Thanks
-James
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 03/29] vfio: powerpc/spapr: Check that TCE page size is equal to it_page_size

2015-03-10 Thread Alex Williamson
On Wed, 2015-03-11 at 09:57 +1100, Alexey Kardashevskiy wrote:
 On 03/11/2015 06:56 AM, Alex Williamson wrote:
  On Tue, 2015-03-10 at 01:06 +1100, Alexey Kardashevskiy wrote:
  This checks that the TCE table page size is not bigger that the size of
  a page we just pinned and going to put its physical address to the table.
 
  Otherwise the hardware gets unwanted access to physical memory between
  the end of the actual page and the end of the aligned up TCE page.
 
  Since compound_order() and compound_head() work correctly on non-huge
  pages, there is no need for additional check whether the page is huge.
 
  Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
  ---
  Changes:
  v4:
  * s/tce_check_page_size/tce_page_is_contained/
  ---
drivers/vfio/vfio_iommu_spapr_tce.c | 22 ++
1 file changed, 22 insertions(+)
 
  diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
  b/drivers/vfio/vfio_iommu_spapr_tce.c
  index 756831f..91e7599 100644
  --- a/drivers/vfio/vfio_iommu_spapr_tce.c
  +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
  @@ -49,6 +49,22 @@ struct tce_container {
 bool enabled;
};
 
  +static bool tce_page_is_contained(struct page *page, unsigned page_shift)
  +{
  +  unsigned shift;
  +
  +  /*
  +   * Check that the TCE table granularity is not bigger than the size of
  +   * a page we just found. Otherwise the hardware can get access to
  +   * a bigger memory chunk that it should.
  +   */
  +  shift = PAGE_SHIFT + compound_order(compound_head(page));
  +  if (shift = page_shift)
  +  return true;
  +
  +  return false;
 
  nit, simplified:
 
  return (PAGE_SHIFT + compound_order(compound_head(page) = page_shift);
 
 This won't be bool though.

Yes, it will.

  This will (I'll do this)
 
 shift = PAGE_SHIFT + compound_order(compound_head(page));
 return (shift = page_shift);
 
 
 
 
 
  +}
  +
static int tce_iommu_enable(struct tce_container *container)
{
 int ret = 0;
  @@ -197,6 +213,12 @@ static long tce_iommu_build(struct tce_container 
  *container,
 ret = -EFAULT;
 break;
 }
  +
  +  if (!tce_page_is_contained(page, tbl-it_page_shift)) {
  +  ret = -EPERM;
  +  break;
  +  }
  +
 hva = (unsigned long) page_address(page) +
 (tce  IOMMU_PAGE_MASK(tbl)  ~PAGE_MASK);
 
 
 
 
 
 



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 03/29] vfio: powerpc/spapr: Check that TCE page size is equal to it_page_size

2015-03-10 Thread Alex Williamson
On Wed, 2015-03-11 at 10:14 +1100, Benjamin Herrenschmidt wrote:
 On Tue, 2015-03-10 at 17:03 -0600, Alex Williamson wrote:
return (PAGE_SHIFT + compound_order(compound_head(page) = page_shift);
   
   This won't be bool though.
  
  Yes, it will.
 
 Don't you have your parenthesis in the wrong place, Alex ? :-)

Probably, but the compiler will warn about that.

This will (I'll do this)
   
   shift = PAGE_SHIFT + compound_order(compound_head(page));
   return (shift = page_shift);
 
 



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 03/29] vfio: powerpc/spapr: Check that TCE page size is equal to it_page_size

2015-03-10 Thread Alexey Kardashevskiy

On 03/11/2015 06:56 AM, Alex Williamson wrote:

On Tue, 2015-03-10 at 01:06 +1100, Alexey Kardashevskiy wrote:

This checks that the TCE table page size is not bigger that the size of
a page we just pinned and going to put its physical address to the table.

Otherwise the hardware gets unwanted access to physical memory between
the end of the actual page and the end of the aligned up TCE page.

Since compound_order() and compound_head() work correctly on non-huge
pages, there is no need for additional check whether the page is huge.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
Changes:
v4:
* s/tce_check_page_size/tce_page_is_contained/
---
  drivers/vfio/vfio_iommu_spapr_tce.c | 22 ++
  1 file changed, 22 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 756831f..91e7599 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -49,6 +49,22 @@ struct tce_container {
bool enabled;
  };

+static bool tce_page_is_contained(struct page *page, unsigned page_shift)
+{
+   unsigned shift;
+
+   /*
+* Check that the TCE table granularity is not bigger than the size of
+* a page we just found. Otherwise the hardware can get access to
+* a bigger memory chunk that it should.
+*/
+   shift = PAGE_SHIFT + compound_order(compound_head(page));
+   if (shift = page_shift)
+   return true;
+
+   return false;


nit, simplified:

return (PAGE_SHIFT + compound_order(compound_head(page) = page_shift);


This won't be bool though. This will (I'll do this)

shift = PAGE_SHIFT + compound_order(compound_head(page));
return (shift = page_shift);







+}
+
  static int tce_iommu_enable(struct tce_container *container)
  {
int ret = 0;
@@ -197,6 +213,12 @@ static long tce_iommu_build(struct tce_container 
*container,
ret = -EFAULT;
break;
}
+
+   if (!tce_page_is_contained(page, tbl-it_page_shift)) {
+   ret = -EPERM;
+   break;
+   }
+
hva = (unsigned long) page_address(page) +
(tce  IOMMU_PAGE_MASK(tbl)  ~PAGE_MASK);








--
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 03/29] vfio: powerpc/spapr: Check that TCE page size is equal to it_page_size

2015-03-10 Thread Benjamin Herrenschmidt
On Tue, 2015-03-10 at 17:03 -0600, Alex Williamson wrote:
   return (PAGE_SHIFT + compound_order(compound_head(page) = page_shift);
  
  This won't be bool though.
 
 Yes, it will.

Don't you have your parenthesis in the wrong place, Alex ? :-)

   This will (I'll do this)
  
  shift = PAGE_SHIFT + compound_order(compound_head(page));
  return (shift = page_shift);


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 07/29] vfio: powerpc/spapr: Moving pinning/unpinning to helpers

2015-03-10 Thread Alex Williamson
On Tue, 2015-03-10 at 01:07 +1100, Alexey Kardashevskiy wrote:
 This is a pretty mechanical patch to make next patches simpler.
 
 New tce_iommu_unuse_page() helper does put_page() now but it might skip
 that after the memory registering patch applied.
 
 As we are here, this removes unnecessary checks for a value returned
 by pfn_to_page() as it cannot possibly return NULL.
 
 This moves tce_iommu_disable() later to let tce_iommu_clear() know if
 the container has been enabled because if it has not been, then
 put_page() must not be called on TCEs from the TCE table. This situation
 is not yet possible but it will after KVM acceleration patchset is
 applied.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
  drivers/vfio/vfio_iommu_spapr_tce.c | 70 
 -
  1 file changed, 54 insertions(+), 16 deletions(-)
 
 diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
 b/drivers/vfio/vfio_iommu_spapr_tce.c
 index d3ab34f..ca396e5 100644
 --- a/drivers/vfio/vfio_iommu_spapr_tce.c
 +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
 @@ -204,7 +204,6 @@ static void tce_iommu_release(void *iommu_data)
   struct iommu_table *tbl = container-tbl;
  
   WARN_ON(tbl  !tbl-it_group);
 - tce_iommu_disable(container);
  
   if (tbl) {
   tce_iommu_clear(container, tbl, tbl-it_offset, tbl-it_size);
 @@ -212,63 +211,102 @@ static void tce_iommu_release(void *iommu_data)
   if (tbl-it_group)
   tce_iommu_detach_group(iommu_data, tbl-it_group);
   }
 +
 + tce_iommu_disable(container);
 +
   mutex_destroy(container-lock);
  
   kfree(container);
  }
  
 +static void tce_iommu_unuse_page(struct tce_container *container,
 + unsigned long oldtce)
 +{
 + struct page *page;
 +
 + if (!(oldtce  (TCE_PCI_READ | TCE_PCI_WRITE)))
 + return;
 +
 + /*
 +  * VFIO cannot map/unmap when a container is not enabled so
 +  * we would not need this check but KVM could map/unmap and if
 +  * this happened, we must not put pages as KVM does not get them as
 +  * it expects memory pre-registation to do this part.
 +  */
 + if (!container-enabled)
 + return;
 +
 + page = pfn_to_page(__pa(oldtce)  PAGE_SHIFT);
 +
 + if (oldtce  TCE_PCI_WRITE)
 + SetPageDirty(page);
 +
 + put_page(page);
 +}
 +
  static int tce_iommu_clear(struct tce_container *container,
   struct iommu_table *tbl,
   unsigned long entry, unsigned long pages)
  {
   unsigned long oldtce;
 - struct page *page;
  
   for ( ; pages; --pages, ++entry) {
   oldtce = iommu_clear_tce(tbl, entry);
   if (!oldtce)
   continue;
  
 - page = pfn_to_page(oldtce  PAGE_SHIFT);
 - WARN_ON(!page);
 - if (page) {
 - if (oldtce  TCE_PCI_WRITE)
 - SetPageDirty(page);
 - put_page(page);
 - }
 + tce_iommu_unuse_page(container, (unsigned long) __va(oldtce));
   }
  
   return 0;
  }
  
 +static unsigned long tce_get_hva(struct tce_container *container,
 + unsigned page_shift, unsigned long tce)
 +{
 + long ret;
 + struct page *page = NULL;
 + unsigned long hva;
 + enum dma_data_direction direction = iommu_tce_direction(tce);
 +
 + ret = get_user_pages_fast(tce  PAGE_MASK, 1,
 + direction != DMA_TO_DEVICE, page);
 + if (unlikely(ret != 1))
 + return -1;
 +
 + hva = (unsigned long) page_address(page);
 +
 + return hva;
 +}


It's a bit crude to return -1 for an unsigned long function.  You might
want to later think about cleaning this up to return int with a proper
error code and return hva via a pointer.  We don't really need to store
'ret' here either.

 +
  static long tce_iommu_build(struct tce_container *container,
   struct iommu_table *tbl,
   unsigned long entry, unsigned long tce, unsigned long pages)
  {
   long i, ret = 0;
 - struct page *page = NULL;
 + struct page *page;
   unsigned long hva;
   enum dma_data_direction direction = iommu_tce_direction(tce);
  
   for (i = 0; i  pages; ++i) {
 - ret = get_user_pages_fast(tce  PAGE_MASK, 1,
 - direction != DMA_TO_DEVICE, page);
 - if (unlikely(ret != 1)) {
 + hva = tce_get_hva(container, tbl-it_page_shift, tce);
 + if (hva == -1) {
   ret = -EFAULT;
   break;
   }
  
 + page = pfn_to_page(__pa(hva)  PAGE_SHIFT);
   if (!tce_page_is_contained(page, tbl-it_page_shift)) {
   ret = -EPERM;
   break;
   }
  
 - hva = (unsigned long) page_address(page) +
 - (tce  IOMMU_PAGE_MASK(tbl)  

Re: [PATCH trivial] KVM: s390: Spelling s/intance/instance/

2015-03-10 Thread Christian Borntraeger
Am 09.03.2015 um 21:27 schrieb Geert Uytterhoeven:
 Signed-off-by: Geert Uytterhoeven geert+rene...@glider.be
 ---
  arch/s390/kvm/kvm-s390.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
 index fda3f3146eb636d3..83f32a147d72c0f3 100644
 --- a/arch/s390/kvm/kvm-s390.h
 +++ b/arch/s390/kvm/kvm-s390.h
 @@ -125,7 +125,7 @@ static inline void kvm_s390_set_psw_cc(struct kvm_vcpu 
 *vcpu, unsigned long cc)
   vcpu-arch.sie_block-gpsw.mask |= cc  44;
  }
 
 -/* test availability of facility in a kvm intance */
 +/* test availability of facility in a kvm instance */
  static inline int test_kvm_facility(struct kvm *kvm, unsigned long nr)
  {
   return __test_facility(nr, kvm-arch.model.fac-mask) 
 

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 03/29] vfio: powerpc/spapr: Check that TCE page size is equal to it_page_size

2015-03-10 Thread Alexey Kardashevskiy

On 03/11/2015 10:03 AM, Alex Williamson wrote:

On Wed, 2015-03-11 at 09:57 +1100, Alexey Kardashevskiy wrote:

On 03/11/2015 06:56 AM, Alex Williamson wrote:

On Tue, 2015-03-10 at 01:06 +1100, Alexey Kardashevskiy wrote:

This checks that the TCE table page size is not bigger that the size of
a page we just pinned and going to put its physical address to the table.

Otherwise the hardware gets unwanted access to physical memory between
the end of the actual page and the end of the aligned up TCE page.

Since compound_order() and compound_head() work correctly on non-huge
pages, there is no need for additional check whether the page is huge.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
Changes:
v4:
* s/tce_check_page_size/tce_page_is_contained/
---
   drivers/vfio/vfio_iommu_spapr_tce.c | 22 ++
   1 file changed, 22 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 756831f..91e7599 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -49,6 +49,22 @@ struct tce_container {
bool enabled;
   };

+static bool tce_page_is_contained(struct page *page, unsigned page_shift)
+{
+   unsigned shift;
+
+   /*
+* Check that the TCE table granularity is not bigger than the size of
+* a page we just found. Otherwise the hardware can get access to
+* a bigger memory chunk that it should.
+*/
+   shift = PAGE_SHIFT + compound_order(compound_head(page));
+   if (shift = page_shift)
+   return true;
+
+   return false;


nit, simplified:

return (PAGE_SHIFT + compound_order(compound_head(page) = page_shift);


This won't be bool though.


Yes, it will.


Ah, misread as ... - page_shift. And you missed one bracket :)





  This will (I'll do this)

shift = PAGE_SHIFT + compound_order(compound_head(page));
return (shift = page_shift);







+}
+
   static int tce_iommu_enable(struct tce_container *container)
   {
int ret = 0;
@@ -197,6 +213,12 @@ static long tce_iommu_build(struct tce_container 
*container,
ret = -EFAULT;
break;
}
+
+   if (!tce_page_is_contained(page, tbl-it_page_shift)) {
+   ret = -EPERM;
+   break;
+   }
+
hva = (unsigned long) page_address(page) +
(tce  IOMMU_PAGE_MASK(tbl)  ~PAGE_MASK);















--
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/2] kvm: x86: kvm_emulate_*

2015-03-10 Thread Marcelo Tosatti
On Mon, Mar 02, 2015 at 01:43:24PM -0600, Joel Schopp wrote:
 Review comments from v1 that used kvm_emulate_wbinvd() pointed out that 
 kvm_emulate_* was inconsistant in using skipping, while kvm_emulate() always
 skips.  The first patch cleans up the existing use while the second patch
 adds use of the updated version of kvm_emulate_wbinvd() in svm
 
 Changes since v2:
   * fixed email subject line on series short description
   * renamed kvm_emulate_halt_noskip() to kvm_vcpu_halt()
   * added header declaration for kvm_vcpu_halt()
   * squashed blank line 
 ---
 
 David Kaplan (1):
   x86: svm: make wbinvd faster
 
 Joel Schopp (1):
   kvm: x86: make kvm_emulate_* consistant
 
 
  arch/x86/include/asm/kvm_host.h |1 +
  arch/x86/kvm/svm.c  |   10 +++---
  arch/x86/kvm/vmx.c  |9 +++--
  arch/x86/kvm/x86.c  |   23 ---
  4 files changed, 31 insertions(+), 12 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 26/29] vfio: powerpc/spapr: Define v2 IOMMU

2015-03-10 Thread Alex Williamson
On Tue, 2015-03-10 at 01:07 +1100, Alexey Kardashevskiy wrote:
 The existing IOMMU code takes/releases ownership over the existing IOMMU
 tables created by the platform code, i.e. the tables remain in memory
 all the time. Also, the existing IOMMU requires VFIO_IOMMU_ENABLE call to
 start working as that's where we check the rlimit for locked pages.
 
 New IOMMU will take over the IOMMU group completely which means the IOMMU
 tables created by the platform code are going to be disposed and VFIO
 will create its own tables. Also, with the DMA memory pre-registration
 feature, the userspace will not need to call VFIO_IOMMU_ENABLE as
 the locked pages accounting will be done by
 VFIO_IOMMU_SPAPR_REGISTER_MEMORY.
 
 In order to inform the userspace that VFIO supports new capabilities,
 this adds a new SPAPR TCE IOMMU v2 type.

Shouldn't we block some of these new features on v1?  Otherwise v1 now
supports them too and by the property of we don't break userspace v1
and v2 are now one-in-the-same.  You'd need to support everything
through both versions.  If that's what you intend to do, maybe this
should be a capability rather than an IOMMU type.

 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
  drivers/vfio/vfio_iommu_spapr_tce.c | 3 ++-
  include/uapi/linux/vfio.h   | 2 ++
  2 files changed, 4 insertions(+), 1 deletion(-)
 
 diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
 b/drivers/vfio/vfio_iommu_spapr_tce.c
 index e572c28..d665ddc 100644
 --- a/drivers/vfio/vfio_iommu_spapr_tce.c
 +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
 @@ -404,7 +404,7 @@ static void *tce_iommu_open(unsigned long arg)
  {
   struct tce_container *container;
  
 - if (arg != VFIO_SPAPR_TCE_IOMMU) {
 + if ((arg != VFIO_SPAPR_TCE_IOMMU)  (arg != VFIO_SPAPR_TCE_v2_IOMMU)) {
   pr_err(tce_vfio: Wrong IOMMU type\n);
   return ERR_PTR(-EINVAL);
   }
 @@ -588,6 +588,7 @@ static long tce_iommu_ioctl(void *iommu_data,
   case VFIO_CHECK_EXTENSION:
   switch (arg) {
   case VFIO_SPAPR_TCE_IOMMU:
 + case VFIO_SPAPR_TCE_v2_IOMMU:
   ret = 1;
   break;
   default:
 diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
 index b17e120..fbc5286 100644
 --- a/include/uapi/linux/vfio.h
 +++ b/include/uapi/linux/vfio.h
 @@ -36,6 +36,8 @@
  /* Two-stage IOMMU */
  #define VFIO_TYPE1_NESTING_IOMMU 6   /* Implies v2 */
  
 +#define VFIO_SPAPR_TCE_v2_IOMMU  7
 +
  /*
   * The IOCTL interface is designed for extensibility by embedding the
   * structure length (argsz) and flags into structures passed between



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 27/29] vfio: powerpc/spapr: powerpc/powernv/ioda2: Rework ownership

2015-03-10 Thread Alex Williamson
On Tue, 2015-03-10 at 01:07 +1100, Alexey Kardashevskiy wrote:
 Before the IOMMU user would take control over the IOMMU table belonging to
 a specific IOMMU group. This approach did not allow sharing tables between
 IOMMU groups attached to the same container.
 
 This introduces a new IOMMU ownership flavour when the user can not
 just control the existing IOMMU table but remove/create tables on demand.
 If an IOMMU supports a set_ownership() callback, that lets the user have
 full control over the IOMMU group. When the ownership is taken,
 the platform code removes all the windows so the caller must create them.
 Before returning the ownership back to the platform code, the user
 has to unprogram and remove all the tables it created.

We have no ability to enforce that requirement on the user.  VFIO needs
to do the cleanup if the user fails to.

 Old-style ownership is still supported allowing VFIO to run on older
 P5IOC2 and IODA IO controllers.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
  arch/powerpc/platforms/powernv/pci-ioda.c | 30 +++---
  drivers/vfio/vfio_iommu_spapr_tce.c   | 51 
 ---
  2 files changed, 66 insertions(+), 15 deletions(-)
 
 diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
 b/arch/powerpc/platforms/powernv/pci-ioda.c
 index 07857c4..afb6906 100644
 --- a/arch/powerpc/platforms/powernv/pci-ioda.c
 +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
 @@ -1620,11 +1620,33 @@ static void pnv_ioda2_set_ownership(struct 
 iommu_table_group *table_group,
  {
   struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
   table_group);
 - if (enable)
 - iommu_take_ownership(table_group);
 - else
 - iommu_release_ownership(table_group);
 + if (enable) {
 + pnv_pci_ioda2_unset_window(pe-table_group, 0);
 + pnv_pci_free_table(pe-table_group.tables[0]);
 + } else {
 + struct iommu_table *tbl = pe-table_group.tables[0];
 + int64_t rc;
  
 + rc = pnv_pci_ioda2_create_table(pe-table_group, 0,
 + IOMMU_PAGE_SHIFT_4K,
 + pe-phb-ioda.m32_pci_base,
 + POWERNV_IOMMU_DEFAULT_LEVELS, tbl);
 + if (rc) {
 + pe_err(pe, Failed to create 32-bit TCE table, err %ld,
 + rc);
 + return;
 + }
 +
 + iommu_init_table(tbl, pe-phb-hose-node);
 +
 + rc = pnv_pci_ioda2_set_window(pe-table_group, 0, tbl);
 + if (rc) {
 + pe_err(pe, Failed to configure 32-bit TCE table, err 
 %ld\n,
 + rc);
 + pnv_pci_free_table(tbl);
 + return;
 + }
 + }
   pnv_pci_ioda2_set_bypass(pe, !enable);
  }
  
 diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
 b/drivers/vfio/vfio_iommu_spapr_tce.c
 index d665ddc..3bc0645 100644
 --- a/drivers/vfio/vfio_iommu_spapr_tce.c
 +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
 @@ -426,18 +426,11 @@ static int tce_iommu_clear(struct tce_container 
 *container,
  static void tce_iommu_release(void *iommu_data)
  {
   struct tce_container *container = iommu_data;
 - struct iommu_table *tbl;
 - struct iommu_table_group *table_group;
  
   WARN_ON(container-grp);
  
 - if (container-grp) {
 - table_group = iommu_group_get_iommudata(container-grp);
 - tbl = table_group-tables[0];
 - tce_iommu_clear(container, tbl, tbl-it_offset, tbl-it_size);
 -
 + if (container-grp)
   tce_iommu_detach_group(iommu_data, container-grp);
 - }
  
   tce_mem_unregister_all(container);
   tce_iommu_disable(container);
 @@ -826,14 +819,24 @@ static int tce_iommu_attach_group(void *iommu_data,
  
   if (!table_group-ops || !table_group-ops-set_ownership) {
   ret = iommu_take_ownership(table_group);
 + } else if (!table_group-ops-create_table ||
 + !table_group-ops-set_window) {
 + WARN_ON_ONCE(1);
 + ret = -EFAULT;
   } else {
   /*
* Disable iommu bypass, otherwise the user can DMA to all of
* our physical memory via the bypass window instead of just
* the pages that has been explicitly mapped into the iommu
*/
 + struct iommu_table tbltmp = { 0 }, *tbl = tbltmp;
 +
   table_group-ops-set_ownership(table_group, true);
 - ret = 0;
 + ret = table_group-ops-create_table(table_group, 0,
 + IOMMU_PAGE_SHIFT_4K,
 + table_group-tce32_size, 1, tbl);
 + if (!ret)
 + ret = table_group-ops-set_window(table_group, 0, tbl);
   }
  

Re: [PATCH] kvm: move advertising of KVM_CAP_IRQFD to common code

2015-03-10 Thread Marcelo Tosatti
On Thu, Mar 05, 2015 at 11:54:46AM +0100, Paolo Bonzini wrote:
 POWER supports irqfds but forgot to advertise them.  Some userspace does
 not check for the capability, but others check it---thus they work on
 x86 and s390 but not POWER.
 
 To avoid that other architectures in the future make the same mistake, let
 common code handle KVM_CAP_IRQFD the same way as KVM_CAP_IRQFD_RESAMPLE.
 
 Reported-by: Greg Kurz gk...@linux.vnet.ibm.com
 Cc: sta...@vger.kernel.org
 Fixes: 297e21053a52f060944e9f0de4c64fad9bcd72fc
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
   Marcelo, please apply this for 4.0.

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 27/29] vfio: powerpc/spapr: powerpc/powernv/ioda2: Rework ownership

2015-03-10 Thread Alexey Kardashevskiy

On 03/11/2015 11:09 AM, Alex Williamson wrote:

On Tue, 2015-03-10 at 01:07 +1100, Alexey Kardashevskiy wrote:

Before the IOMMU user would take control over the IOMMU table belonging to
a specific IOMMU group. This approach did not allow sharing tables between
IOMMU groups attached to the same container.

This introduces a new IOMMU ownership flavour when the user can not
just control the existing IOMMU table but remove/create tables on demand.
If an IOMMU supports a set_ownership() callback, that lets the user have
full control over the IOMMU group. When the ownership is taken,
the platform code removes all the windows so the caller must create them.
Before returning the ownership back to the platform code, the user
has to unprogram and remove all the tables it created.


We have no ability to enforce that requirement on the user.  VFIO needs
to do the cleanup if the user fails to.


Ah. Wrong commit log, VFIO always does cleanup (as I end my guests by c-a-x 
most of the time :) ), will fix it.






Old-style ownership is still supported allowing VFIO to run on older
P5IOC2 and IODA IO controllers.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
  arch/powerpc/platforms/powernv/pci-ioda.c | 30 +++---
  drivers/vfio/vfio_iommu_spapr_tce.c   | 51 ---
  2 files changed, 66 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 07857c4..afb6906 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1620,11 +1620,33 @@ static void pnv_ioda2_set_ownership(struct 
iommu_table_group *table_group,
  {
struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
table_group);
-   if (enable)
-   iommu_take_ownership(table_group);
-   else
-   iommu_release_ownership(table_group);
+   if (enable) {
+   pnv_pci_ioda2_unset_window(pe-table_group, 0);
+   pnv_pci_free_table(pe-table_group.tables[0]);
+   } else {
+   struct iommu_table *tbl = pe-table_group.tables[0];
+   int64_t rc;

+   rc = pnv_pci_ioda2_create_table(pe-table_group, 0,
+   IOMMU_PAGE_SHIFT_4K,
+   pe-phb-ioda.m32_pci_base,
+   POWERNV_IOMMU_DEFAULT_LEVELS, tbl);
+   if (rc) {
+   pe_err(pe, Failed to create 32-bit TCE table, err %ld,
+   rc);
+   return;
+   }
+
+   iommu_init_table(tbl, pe-phb-hose-node);
+
+   rc = pnv_pci_ioda2_set_window(pe-table_group, 0, tbl);
+   if (rc) {
+   pe_err(pe, Failed to configure 32-bit TCE table, err 
%ld\n,
+   rc);
+   pnv_pci_free_table(tbl);
+   return;
+   }
+   }
pnv_pci_ioda2_set_bypass(pe, !enable);
  }

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index d665ddc..3bc0645 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -426,18 +426,11 @@ static int tce_iommu_clear(struct tce_container 
*container,
  static void tce_iommu_release(void *iommu_data)
  {
struct tce_container *container = iommu_data;
-   struct iommu_table *tbl;
-   struct iommu_table_group *table_group;

WARN_ON(container-grp);

-   if (container-grp) {
-   table_group = iommu_group_get_iommudata(container-grp);
-   tbl = table_group-tables[0];
-   tce_iommu_clear(container, tbl, tbl-it_offset, tbl-it_size);
-
+   if (container-grp)
tce_iommu_detach_group(iommu_data, container-grp);
-   }

tce_mem_unregister_all(container);
tce_iommu_disable(container);
@@ -826,14 +819,24 @@ static int tce_iommu_attach_group(void *iommu_data,

if (!table_group-ops || !table_group-ops-set_ownership) {
ret = iommu_take_ownership(table_group);
+   } else if (!table_group-ops-create_table ||
+   !table_group-ops-set_window) {
+   WARN_ON_ONCE(1);
+   ret = -EFAULT;
} else {
/*
 * Disable iommu bypass, otherwise the user can DMA to all of
 * our physical memory via the bypass window instead of just
 * the pages that has been explicitly mapped into the iommu
 */
+   struct iommu_table tbltmp = { 0 }, *tbl = tbltmp;
+
table_group-ops-set_ownership(table_group, true);
-   ret = 0;
+   ret = table_group-ops-create_table(table_group, 0,
+   IOMMU_PAGE_SHIFT_4K,
+