[PATCH] book3s_hv_rmhandlers:Pass the correct trap argument to kvmhv_commence_exit

2015-05-21 Thread Gautham R. Shenoy
In guest_exit_cont we call kvmhv_commence_exit which expects the trap
number as the argument. However r3 doesn't contain the trap number at
this point and as a result we would be calling the function with a
spurious trap number.

Fix this by copying r12 into r3 before calling kvmhv_commence_exit as
r12 contains the trap number

Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com
---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 4d70df2..f0d7c54 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1170,6 +1170,7 @@ mc_cont:
bl  kvmhv_accumulate_time
 #endif
 
+   mr  r3, r12
/* Increment exit count, poke other threads to exit */
bl  kvmhv_commence_exit
nop
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm:Return -ENOMEM directly for the function, kvm_create_lapic

2015-05-21 Thread Ingo Molnar

* Nicholas Krause xerofo...@gmail.com wrote:

 In order to make code paths easier to read in the function,
 kvm_create_lapic we return -ENOMEM when unable to allocate
 memory for a kvm_lapic structure pointer directly. This
 makes the code easier to read and cleaner then jumping
 to a goto label at the end of the function's body for
 returning just the error code, -ENOMEM.
 
 Signed-off-by: Nicholas Krause xerofo...@gmail.com
 ---
  arch/x86/kvm/lapic.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)
 
 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index 629af0f..88d0cce 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -1687,7 +1687,7 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu)
  
   apic = kzalloc(sizeof(*apic), GFP_KERNEL);
   if (!apic)
 - goto nomem;
 + return -ENOMEM;
  
   vcpu-arch.apic = apic;
  
 @@ -1718,7 +1718,6 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu)
   return 0;
  nomem_free_apic:
   kfree(apic);
 -nomem:
   return -ENOMEM;
  }

NAK!

You just half destroyed the nice error handling cascade of labels.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 98741] New: Cannot boot into kvm guest with kernel = 3.18.x on el6.6 qemu-kvm host with virtio-blk-pci.x-data-plane=on and virtio-blk-pci.ioeventfd=on

2015-05-21 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=98741

Bug ID: 98741
   Summary: Cannot boot into kvm guest with kernel = 3.18.x on
el6.6 qemu-kvm host with
virtio-blk-pci.x-data-plane=on and
virtio-blk-pci.ioeventfd=on
   Product: Virtualization
   Version: unspecified
Kernel Version: 3.18.x, 3.19.x, 4.0.x
  Hardware: All
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: kvm
  Assignee: virtualization_...@kernel-bugs.osdl.org
  Reporter: jaroslav.pulch...@gooddata.com
Regression: No

Hello,

I'm experiencing problem with latest kernel during testing of new features in
virtual guests running at EL6.6 host in KVM with Virtio PV drivers and enabled
datapalane (virtio-blk-pci.x-data-plane=on) feature to provide best IO
performance.

The issue cause that qemu-kvm process is stopped during quest boot. 

Host log:
--
qemu-kvm: /builddir/build/BUILD/qemu-kvm-0.12.1.2/hw/msix.c:645:
msix_set_mask_notifier: Assertion `!dev-msix_mask_notifier' failed.
2015-05-18 08:45:48.102+: shutting down
---
Guest log:
---
...
initcall virtio_pci_driver_init+0x0/0x1000 [virtio_pci] returned 0 after 82556
usecs
calling  ata_init+0x0/0x5d [libata] @ 204
libata version 3.00 loaded.
initcall ata_init+0x0/0x5d [libata] returned 0 after 11641 usecs
[drm:drm_framebuffer_reference] 8800da0c5ea0: FB ID: 21 (2)
[drm:drm_framebuffer_unreference] 8800da0c5ea0: FB ID: 21 (3)
calling  ata_generic_pci_driver_init+0x0/0x1000 [ata_generic] @ 204
initcall ata_generic_pci_driver_init+0x0/0x1000 [ata_generic] returned 0 after
146 usecs
calling  init+0x0/0x1000 [virtio_blk] @ 250
virtio-pci :00:04.0: irq 24 for MSI/MSI-X
virtio-pci :00:04.0: irq 25 for MSI/MSI-X
... END ...
---

* Reproducible with this setup:
  Quest kernel versions: kernel 3.18, 3.19, 4.0
  qemu-kvm: virtio-blk-pci.x-data-plane=on, virtio-blk-pci.ioeventfd=on

* Not reproducible:
  Quest kernel versions: kernel 3.17 or older

* Not reproducible:
  Quest kernel versions: kernel 3.18, 3.19, 4.0
  qemu-kvm: virtio-blk-pci.x-data-plane=on + virtio-blk-pci.ioeventfd=off
virtio-blk-pci.x-data-plane=off + virtio-blk-pci.ioeventfd=on

I found this:
1/ shutting down is triggered by reentrancy into of qemu's
virtio_blk_data_plane_start() function (see RedHat's bug #1222574),
2/ commits which introduced this issue are 
  virtio_blk: enable VQs early on restore
(6d62c37f1991aafc872f8d8be8ac60e57ede8605) 
  virtio_net: enable VQs early on restore
(e53fbd11e983e896adaabef2d2f1695d6e0af829)
  virtio_blk: enable VQs early (7a11370e5e6c26566904bb7f08281093a3002ff2)
  virtio_net: enable VQs early (4baf1e33d0842c9673fef4af207d4b74da8d0126) 
   found by deep drive into git history of virtio drivers and several rebuilds
with commits reverts (from 3.17-3.18) 

kernel 3.18.x build with reverted mentioned commits can successfully boot and
run without issues.

Best regards,
Jaroslav

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Announcing qboot, a minimal x86 firmware for QEMU

2015-05-21 Thread Avi Kivity

On 05/21/2015 07:21 PM, Paolo Bonzini wrote:


On 21/05/2015 17:48, Avi Kivity wrote:

Lovely!

Note you have memcpy.o instead of memcpy.c.

Doh, and it's not used anyway.  Check the repository, and let me know if
OSv boots with it (it probably needs ACPI; Linux doesn't boot virtio
without ACPI).



Yes, it requires ACPI.  We don't implement the pre-ACPI bootstrap methods.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm:Return -ENOMEM directly for the function, kvm_create_lapic

2015-05-21 Thread Paolo Bonzini


On 21/05/2015 08:09, Ingo Molnar wrote:
 
 * Nicholas Krause xerofo...@gmail.com wrote:
 
 In order to make code paths easier to read in the function,
 kvm_create_lapic we return -ENOMEM when unable to allocate
 memory for a kvm_lapic structure pointer directly. This
 makes the code easier to read and cleaner then jumping
 to a goto label at the end of the function's body for
 returning just the error code, -ENOMEM.

 Signed-off-by: Nicholas Krause xerofo...@gmail.com
 ---
  arch/x86/kvm/lapic.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index 629af0f..88d0cce 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -1687,7 +1687,7 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu)
  
  apic = kzalloc(sizeof(*apic), GFP_KERNEL);
  if (!apic)
 -goto nomem;
 +return -ENOMEM;
  
  vcpu-arch.apic = apic;
  
 @@ -1718,7 +1718,6 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu)
  return 0;
  nomem_free_apic:
  kfree(apic);
 -nomem:
  return -ENOMEM;
  }
 
 NAK!
 
 You just half destroyed the nice error handling cascade of labels.

Right.

What could be done, is always going through kfree(apic), because it is
okay to free NULL.  So the nomem label moves up, and the nomem_free_apic
label is not necessary anymore.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM call agenda for 2015-05-26

2015-05-21 Thread Juan Quintela

Hi

Please, send any topic that you are interested in covering.


 Call details:

By popular demand, a google calendar public entry with it

  
https://www.google.com/calendar/embed?src=dG9iMXRqcXAzN3Y4ZXZwNzRoMHE4a3BqcXNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ

(Let me know if you have any problems with the calendar entry.  I just
gave up about getting right at the same time CEST, CET, EDT and DST).

If you need phone number details,  contact me privately

Thanks, Juan.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/12] KVM: x86: API changes for SMM support

2015-05-21 Thread Paolo Bonzini


On 21/05/2015 18:26, Radim Krčmář wrote:
 2015-05-21 16:59+0200, Paolo Bonzini:
 On 21/05/2015 16:49, Radim Krčmář wrote:
 2015-05-08 13:20+0200, Paolo Bonzini:
 diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
 @@ -202,7 +202,7 @@ struct kvm_run {
__u32 exit_reason;
__u8 ready_for_interrupt_injection;
__u8 if_flag;
 -  __u8 padding2[2];
 +  __u16 flags;

 (It got lost last review and I'd really like to know ...
  what is the advantage of giving both bytes to flags?)

 No advantage.  You just should leave padding2[1] in the middle so that
 the offset of run-padding2[0] doesn't change.
 
 I don't get that.  The position of padding should be decided by
 comparing probabilities of extending 'if_flag' and 'flags'.
 
  Since it's not obvious
 I gave two bytes to flags, but I can do it either way.
 
 if_flag seems to be set in stone as one bit, so I'd vote for
 
   __u8 flags;
   __u8 padding2;
 
 (Or 'padding3', to prevent the same class of errors that removing it
  altogether does;  which we didn't do for other tailed padding).

You're right that we didn't do it.  I'll change it to flags + padding2.

Paolo

 For there isn't much space left in struct kvm ...
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/12] KVM: x86: save/load state on SMM switch

2015-05-21 Thread Paolo Bonzini


On 21/05/2015 19:00, Radim Krčmář wrote:
   Potentially, an NMI could be latched (while in SMM or upon exit) and
   serviced upon exit [...]
 
 This Potentially could be in the sense that the whole 3rd paragraph is
 only applicable to some ancient SMM design :)

It could also be in the sense that you cannot exclude an NMI coming at
exactly the wrong time.

If you want to go full language lawyer, it does mention it whenever
behavior is specific to a processor family.

 The 1st paragraph has quite clear sentence:
 
   If NMIs were blocked before the SMI occurred, they are blocked after
   execution of RSM.
 
 so I'd just ignore the 3rd paragraph ...
 
 And the APM 2:10.3.3 Exceptions and Interrupts
   NMI—If an NMI occurs while the processor is in SMM, it is latched by
   the processor, but the NMI handler is not invoked until the processor
   leaves SMM with the execution of an RSM instruction.  A pending NMI
   causes the handler to be invoked immediately after the RSM completes
   and before the first instruction in the interrupted program is
   executed.
 
   An SMM handler can unmask NMI interrupts by simply executing an IRET.
   Upon completion of the IRET instruction, the processor recognizes the
   pending NMI, and transfers control to the NMI handler. Once an NMI is
   recognized within SMM using this technique, subsequent NMIs are
   recognized until SMM is exited. Later SMIs cause NMIs to be masked,
   until the SMM handler unmasks them.
 
 makes me think that we should unmask them unconditionally or that SMM
 doesn't do anything with NMI masking.

Actually I hadn't noticed this paragraph.  But I read it the same as the
Intel manual (i.e. what I implemented): it doesn't say anywhere that RSM
may cause the processor to *set* the NMIs masked flag.

It makes no sense; as you said it's 1 bit of state!  But it seems that
it's the architectural behavior. :(

 If we can choose, less NMI nesting seems like a good idea.

It would---I'm just preempting future patches from Nadav. :)  That said,
even if OVMF does do IRETs in SMM (in 64-bit mode it fills in page
tables lazily for memory above 4GB), we do not care about asynchronous
SMIs such as those for power management.  So we should never enter SMM
with NMIs masked, to begin with.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/12] KVM: x86: save/load state on SMM switch

2015-05-21 Thread Paolo Bonzini


On 21/05/2015 18:33, Radim Krčmář wrote:
  Check the AMD architecture manual.
 I must be blind, is there more than Table 10-2?

There's Table 10-1! :DDD

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/12] KVM: x86: API changes for SMM support

2015-05-21 Thread Radim Krčmář
2015-05-08 13:20+0200, Paolo Bonzini:
 diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
 @@ -202,7 +202,7 @@ struct kvm_run {
   __u32 exit_reason;
   __u8 ready_for_interrupt_injection;
   __u8 if_flag;
 - __u8 padding2[2];
 + __u16 flags;

(It got lost last review and I'd really like to know ...
 what is the advantage of giving both bytes to flags?)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/12] KVM: x86: stubs for SMM support

2015-05-21 Thread Radim Krčmář
2015-05-08 13:20+0200, Paolo Bonzini:
 This patch adds the interface between x86.c and the emulator: the
 SMBASE register, a new emulator flag, the RSM instruction.  It also
 adds a new request bit that will be used by the KVM_SMI ioctl.
 
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 --
   RFC-v1: make SMBASE host-readable only
add support for latching an SMI
do not reset SMBASE on INIT
 ---
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 @@ -367,6 +367,7 @@ struct kvm_vcpu_arch {
   int32_t apic_arb_prio;
   int mp_state;
   u64 ia32_misc_enable_msr;
 + u64 smbase;

smbase is u32 in hardware.

 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 @@ -2504,7 +2504,7 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
 *vmx)
   vmx-nested.nested_vmx_misc_low = VMX_MISC_SAVE_EFER_LMA;
   vmx-nested.nested_vmx_misc_low |=
   VMX_MISC_EMULATED_PREEMPTION_TIMER_RATE |
 - VMX_MISC_ACTIVITY_HLT;
 + VMX_MISC_ACTIVITY_HLT | VMX_MISC_IA32_SMBASE_MSR;

No need to expose this feature when the MSR isn't readable.

 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 @@ -2220,6 +2221,11 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
 msr_data *msr_info)
 + case MSR_IA32_SMBASE:
 + if (!msr_info-host_initiated)
 + return 1;
 + vcpu-arch.smbase = data;
 + break;
 @@ -2615,6 +2621,11 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct 
 msr_data *msr_info)
 + case MSR_IA32_SMBASE:
 + if (!msr_info-host_initiated)
 + return 1;
 + msr_info-data = vcpu-arch.smbase;
 + break;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/12] KVM: x86: API changes for SMM support

2015-05-21 Thread Paolo Bonzini


On 21/05/2015 16:49, Radim Krčmář wrote:
 2015-05-08 13:20+0200, Paolo Bonzini:
 diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
 @@ -202,7 +202,7 @@ struct kvm_run {
  __u32 exit_reason;
  __u8 ready_for_interrupt_injection;
  __u8 if_flag;
 -__u8 padding2[2];
 +__u16 flags;
 
 (It got lost last review and I'd really like to know ...
  what is the advantage of giving both bytes to flags?)

No advantage.  You just should leave padding2[1] in the middle so that
the offset of run-padding2[0] doesn't change.  Since it's not obvious
I gave two bytes to flags, but I can do it either way.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/23] userfaultfd v4

2015-05-21 Thread Kirill Smelkov
Hello up there,

On Thu, May 14, 2015 at 07:30:57PM +0200, Andrea Arcangeli wrote:
 Hello everyone,
 
 This is the latest userfaultfd patchset against mm-v4.1-rc3
 2015-05-14-10:04.
 
 The postcopy live migration feature on the qemu side is mostly ready
 to be merged and it entirely depends on the userfaultfd syscall to be
 merged as well. So it'd be great if this patchset could be reviewed
 for merging in -mm.
 
 Userfaults allow to implement on demand paging from userland and more
 generally they allow userland to more efficiently take control of the
 behavior of page faults than what was available before
 (PROT_NONE + SIGSEGV trap).
 
 The use cases are:

[...]

 Even though there wasn't a real use case requesting it yet, it also
 allows to implement distributed shared memory in a way that readonly
 shared mappings can exist simultaneously in different hosts and they
 can be become exclusive at the first wrprotect fault.

Sorry for maybe speaking up too late, but here is additional real
potential use-case which in my view is overlapping with the above:

Recently we needed to implement persistency for NumPy arrays - that is
to track made changes to array memory and transactionally either abandon
the changes on transaction abort, or store them back to storage on
transaction commit.

Since arrays can be large, it would be slow and thus not practical to
have original data copy and compare memory to original to find what
array parts have been changed.

So I've implemented a scheme where array data is initially PROT_READ
protected, then we catch SIGSEGV, if it is write and area belongs to array
data - we mark that page as PROT_WRITE and continue. On commit time we
know which parts were modified.

Also, since arrays could be large - bigger than RAM, and only sparse
parts of it could be needed to get needed information, for reading it
also makes sense to lazily load data in SIGSEGV handler with initial
PROT_NONE protection.

This is very similar to how memory mapped files work, but adds
transactionality which, as far as I know, is not provided by any
currently in-kernel filesystem on Linux.

The system is done as files, and arrays are then build on top of
this-way memory-mapped files. So from now on we can forget about NumPy
arrays and only talk about files, their mapping, lazy loading and
transactionally storing in-memory changes back to file storage.

To get this working, a custom user-space virtual memory manager is
unrolled, which manages RAM memory pages, file mappings into virtual
address-space, tracks pages protection and does SIGSEGV handling
appropriately.


The gist of virtual memory-manager is this:


https://lab.nexedi.cn/kirr/wendelin.core/blob/master/include/wendelin/bigfile/virtmem.h
https://lab.nexedi.cn/kirr/wendelin.core/blob/master/bigfile/virtmem.c  
(vma_on_pagefault)


For operations it currently needs

- establishing virtual memory areas and connecting to tracking it

- changing pages protection

PROT_NONE or absent - initially
PROT_NONE   - PROT_READ- after read
PROT_READ   - PROT_READWRITE   - after write
PROT_READWRITE  - PROT_READ- after commit
PROT_READWRITE  - PROT_NONE or absent (again)  - after abort
PROT_READ   - PROT_NONE or absent (again)  - on reclaim

- working with aliasable memory (thus taken from tmpfs)

there could be two overlapping-in-file mapping for file (array)
requested at different time, and changes from one mapping should
propagate to another one - for common parts only 1 page should
be memory-mapped into 2 places in address-space.

so what is currently lacking on userfaultfd side is:

- ability to remove / make PROT_NONE already mapped pages
  (UFFDIO_REMAP was recently dropped)

- ability to arbitrarily change pages protection (e.g. RW - R)

- inject aliasable memory from tmpfs (or better hugetlbfs) and into
  several places (UFFDIO_REMAP + some mapping copy semantic).


The code is ugly because it is only a prototype. You can clone/read it
all from here:

https://lab.nexedi.cn/kirr/wendelin.core

Virtual memory-manager even has tests, and from them it could be seen
how the system is supposed to work (after each access - what pages and
where are mapped and how):


https://lab.nexedi.cn/kirr/wendelin.core/blob/master/bigfile/tests/test_virtmem.c

The performance currently is not great, partly because of page clearing
when getting ram from tmpfs, and partly because of mprotect/SIGSEGV/vmas
overhead and other dumb things on my side.

I still wanted to show the case, as userfaultd here has potential to
remove overhead related to kernel.

Thanks beforehand for feedback,

Kirill


P.S. some context

http://www.wendelin.io/NXD-Wendelin.Core.Non.Secret/asEntireHTML
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message 

Announcing qboot, a minimal x86 firmware for QEMU

2015-05-21 Thread Paolo Bonzini
Some of you may have heard about the Clear Containers initiative from
Intel, which couple KVM with various kernel tricks to create extremely
lightweight virtual machines.  The experimental Clear Containers setup
requires only 18-20 MB to launch a virtual machine, and needs about 60
ms to boot.

Now, as all of you probably know, QEMU is great for running Windows or
legacy Linux guests, but that flexibility comes at a hefty price. Not
only does all of the emulation consume memory, it also requires some
form of low-level firmware in the guest as well. All of this adds quite
a bit to virtual-machine startup times (500 to 700 milliseconds is not
unusual).

Right?  In fact, it's for this reason that Clear Containers uses kvmtool
instead of QEMU.

No, wrong!  In fact, reporting bad performance is pretty much the same
as throwing down the gauntlet.

Enter qboot, a minimal x86 firmware that runs on QEMU and, together with
a slimmed-down QEMU configuration, boots a virtual machine in 40
milliseconds[2] on an Ivy Bridge Core i7 processor.

qboot is available at git://github.com/bonzini/qboot.git.  In all the
glory of its 8KB of code, it brings together various existing open
source components:

* a minimal (really minimal) 16-bit BIOS runtime based on kvmtool's own BIOS

* a couple hardware initialization routines written mostly from scratch
but with good help from SeaBIOS source code

* a minimal 32-bit libc based on kvm-unit-tests

* the Linux loader from QEMU itself

The repository has more information on how to achieve fast boot times,
and examples of using qboot.  Right now there is a limit of 8 MB for
vmlinuz+initrd+cmdline, which however should be enough for initrd-less
containers.

The first commit to qboot is more or less 24 hours old, so there is
definitely more work to do, in particular to extract ACPI tables from
QEMU and present them to the guest.  This is probably another day of
work or so, and it will enable multiprocessor guests with little or no
impact on the boot times.  SMBIOS information is also available from QEMU.

On the QEMU side, there is no support yet for persistent memory and the
NFIT tables from ACPI 6.0.  Once that (and ACPI support) is added, qboot
will automatically start using it.

Happy hacking!

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: PPC: check for lookup_linux_ptep() returning NULL

2015-05-21 Thread Laurentiu Tudor
If passed a larger page size lookup_linux_ptep()
may fail, so add a check for that and bail out
if that's the case.
This was found with the help of a static
code analysis tool.

Signed-off-by: Mihai Caraman mihai.cara...@freescale.com
Signed-off-by: Laurentiu Tudor laurentiu.tu...@freescale.com
Cc: Scott Wood scottw...@freescale.com
---
based on https://github.com/agraf/linux-2.6.git kvm-ppc-next

 arch/powerpc/kvm/e500_mmu_host.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index cc536d4..249c816 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -469,7 +469,7 @@ static inline int kvmppc_e500_shadow_map(struct 
kvmppc_vcpu_e500 *vcpu_e500,
 
pgdir = vcpu_e500-vcpu.arch.pgdir;
ptep = lookup_linux_ptep(pgdir, hva, tsize_pages);
-   if (pte_present(*ptep))
+   if (ptep  pte_present(*ptep))
wimg = (*ptep  PTE_WIMGE_SHIFT)  MAS2_WIMGE_MASK;
else {
if (printk_ratelimit())
-- 
1.8.3.1
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: PPC: check for lookup_linux_ptep() returning NULL

2015-05-21 Thread Laurentiu Tudor
If passed a larger page size lookup_linux_ptep()
may fail, so add a check for that and bail out
if that's the case.
This was found with the help of a static
code analysis tool.

Signed-off-by: Mihai Caraman mihai.cara...@freescale.com
Signed-off-by: Laurentiu Tudor laurentiu.tu...@freescale.com
Cc: Scott Wood scottw...@freescale.com
---
based on https://github.com/agraf/linux-2.6.git kvm-ppc-next

 arch/powerpc/kvm/e500_mmu_host.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index cc536d4..249c816 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -469,7 +469,7 @@ static inline int kvmppc_e500_shadow_map(struct 
kvmppc_vcpu_e500 *vcpu_e500,
 
pgdir = vcpu_e500-vcpu.arch.pgdir;
ptep = lookup_linux_ptep(pgdir, hva, tsize_pages);
-   if (pte_present(*ptep))
+   if (ptep  pte_present(*ptep))
wimg = (*ptep  PTE_WIMGE_SHIFT)  MAS2_WIMGE_MASK;
else {
if (printk_ratelimit())
-- 
1.8.3.1
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR

2015-05-21 Thread Marcelo Tosatti

Initialize kvmclock base, on kvmclock system MSR write time, 
so that the guest sees kvmclock counting from zero.

This matches baremetal behaviour when kvmclock in guest
sets sched clock stable.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cc2c759..ea40d24 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2188,6 +2188,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
vcpu-requests);
 
ka-boot_vcpu_runs_old_kvmclock = tmp;
+
+   ka-kvmclock_offset = get_kernel_ns();
}
 
vcpu-arch.time = data;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm: odd time values since kvmclock: set scheduler clock stable

2015-05-21 Thread Marcelo Tosatti
On Mon, May 18, 2015 at 10:13:03PM -0400, Sasha Levin wrote:
 On 05/18/2015 10:02 PM, Sasha Levin wrote:
  On 05/18/2015 08:13 PM, Marcelo Tosatti wrote:
  GOn Mon, May 18, 2015 at 07:45:41PM -0400, Sasha Levin wrote:
  On 05/18/2015 06:39 PM, Marcelo Tosatti wrote:
  On Tue, May 12, 2015 at 07:17:24PM -0400, Sasha Levin wrote:
  Hi all,
 
  I'm seeing odd jump in time values during boot of a KVM guest:
 
  [...]
  [0.00] tsc: Detected 2260.998 MHz processor
  [3376355.247558] Calibrating delay loop (skipped) preset value..
  [...]
 
  I've bisected it to:
 
  Paolo, Sasha,
 
  Although this might seem undesirable, there is no requirement 
  for sched_clock to initialize at 0:
 
  
   *
   * There is no strict promise about the base, although it tends to 
  start
   * at 0 on boot (but people really shouldn't rely on that).
   *
  
 
  Sasha, are you seeing any problem other than the apparent time jump?
 
  Nope, but I've looked at it again and it seems that it jumps to the 
  host's
  clock (that is, in the example above the 3376355 value was the host's 
  clock
  value).
 
 
  Thanks,
  Sasha
  Sasha, thats right. Its the host monotonic clock.
  
  It's worth figuring out if (what) userspace breaks on that. I know it says 
  that
  you shouldn't rely on that, but I'd happily place a bet on at least one 
  userspace
  treating it as seconds since boot or something similar.
 
 Didn't need to go far... In the guest:
 
 # date
 Tue May 19 02:11:46 UTC 2015
 # echo hi  /dev/kmsg
 [3907533.080112] hi
 # dmesg -T
 [Fri Jul  3 07:33:41 2015] hi

Sasha,

Can you give the suggested patch (hypervisor patch...) a try please?
(with a patched guest, obviously).

KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system
MSR


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Announcing qboot, a minimal x86 firmware for QEMU

2015-05-21 Thread Yong Wang
On Thu, May 21, 2015 at 03:51:43PM +0200, Paolo Bonzini wrote:
 On the QEMU side, there is no support yet for persistent memory and the
 NFIT tables from ACPI 6.0.  Once that (and ACPI support) is added, qboot
 will automatically start using it.
 

We are working on adding NFIT support into virtual bios.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: check for lookup_linux_ptep() returning NULL

2015-05-21 Thread Scott Wood
On Thu, 2015-05-21 at 16:26 +0300, Laurentiu Tudor wrote:
 If passed a larger page size lookup_linux_ptep()
 may fail, so add a check for that and bail out
 if that's the case.
 This was found with the help of a static
 code analysis tool.
 
 Signed-off-by: Mihai Caraman mihai.cara...@freescale.com
 Signed-off-by: Laurentiu Tudor laurentiu.tu...@freescale.com
 Cc: Scott Wood scottw...@freescale.com
 ---
 based on https://github.com/agraf/linux-2.6.git kvm-ppc-next
 
  arch/powerpc/kvm/e500_mmu_host.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Scott Wood scottw...@freescale.com

-Scott


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: check for lookup_linux_ptep() returning NULL

2015-05-21 Thread Scott Wood
On Thu, 2015-05-21 at 16:26 +0300, Laurentiu Tudor wrote:
 If passed a larger page size lookup_linux_ptep()
 may fail, so add a check for that and bail out
 if that's the case.
 This was found with the help of a static
 code analysis tool.
 
 Signed-off-by: Mihai Caraman mihai.cara...@freescale.com
 Signed-off-by: Laurentiu Tudor laurentiu.tu...@freescale.com
 Cc: Scott Wood scottw...@freescale.com
 ---
 based on https://github.com/agraf/linux-2.6.git kvm-ppc-next
 
  arch/powerpc/kvm/e500_mmu_host.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Scott Wood scottw...@freescale.com

-Scott


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Announcing qboot, a minimal x86 firmware for QEMU

2015-05-21 Thread Avi Kivity

On 05/21/2015 04:51 PM, Paolo Bonzini wrote:

Some of you may have heard about the Clear Containers initiative from
Intel, which couple KVM with various kernel tricks to create extremely
lightweight virtual machines.  The experimental Clear Containers setup
requires only 18-20 MB to launch a virtual machine, and needs about 60
ms to boot.

Now, as all of you probably know, QEMU is great for running Windows or
legacy Linux guests, but that flexibility comes at a hefty price. Not
only does all of the emulation consume memory, it also requires some
form of low-level firmware in the guest as well. All of this adds quite
a bit to virtual-machine startup times (500 to 700 milliseconds is not
unusual).

Right?  In fact, it's for this reason that Clear Containers uses kvmtool
instead of QEMU.

No, wrong!  In fact, reporting bad performance is pretty much the same
as throwing down the gauntlet.

Enter qboot, a minimal x86 firmware that runs on QEMU and, together with
a slimmed-down QEMU configuration, boots a virtual machine in 40
milliseconds[2] on an Ivy Bridge Core i7 processor.

qboot is available at git://github.com/bonzini/qboot.git.  In all the
glory of its 8KB of code, it brings together various existing open
source components:

* a minimal (really minimal) 16-bit BIOS runtime based on kvmtool's own BIOS

* a couple hardware initialization routines written mostly from scratch
but with good help from SeaBIOS source code

* a minimal 32-bit libc based on kvm-unit-tests

* the Linux loader from QEMU itself

The repository has more information on how to achieve fast boot times,
and examples of using qboot.  Right now there is a limit of 8 MB for
vmlinuz+initrd+cmdline, which however should be enough for initrd-less
containers.

The first commit to qboot is more or less 24 hours old, so there is
definitely more work to do, in particular to extract ACPI tables from
QEMU and present them to the guest.  This is probably another day of
work or so, and it will enable multiprocessor guests with little or no
impact on the boot times.  SMBIOS information is also available from QEMU.

On the QEMU side, there is no support yet for persistent memory and the
NFIT tables from ACPI 6.0.  Once that (and ACPI support) is added, qboot
will automatically start using it.

Happy hacking!



Lovely!

Note you have memcpy.o instead of memcpy.c.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/23] userfaultfd v4

2015-05-21 Thread Andrea Arcangeli
Hi Kirill,

On Thu, May 21, 2015 at 04:11:11PM +0300, Kirill Smelkov wrote:
 Sorry for maybe speaking up too late, but here is additional real

Not too late, in fact I don't think there's any change required for
this at this stage, but it'd be great if you could help me to review.

 Since arrays can be large, it would be slow and thus not practical to
[..]
 So I've implemented a scheme where array data is initially PROT_READ
 protected, then we catch SIGSEGV, if it is write and area belongs to array

In the case of postcopy live migration (for qemu and/or containers) and
postcopy live snapshotting, splitting the vmas is not an option
because we may run out of them.

If your PROT_READ areas are limited perhaps this isn't an issue but
with hundreds GB guests (currently plenty in production) that needs to
live migrate fully reliably and fast, the vmas could exceed the limit
if we were to use mprotect. If your arrays are very large and the
PROT_READ aren't limited, using userfaultfd this isn't only an
optimization for you too, it's actually a must to avoid a potential
-ENOMEM.

 Also, since arrays could be large - bigger than RAM, and only sparse
 parts of it could be needed to get needed information, for reading it
 also makes sense to lazily load data in SIGSEGV handler with initial
 PROT_NONE protection.

Similarly I heard somebody wrote a fastresume to load the suspended
(on disk) guest ram using userfaultfd. That is a slightly less
fundamental case than postcopy because you could do it also with
MAP_SHARED, but it's still interesting in allowing to compress or
decompress the suspended ram on the fly with lz4 for example,
something MAP_PRIVATE/MAP_SHARED wouldn't do (plus there's the
additional benefit of not having an orphaned inode left open even if
the file is deleted, that prevents to unmount the filesystem for the
whole lifetime of the guest).

 This is very similar to how memory mapped files work, but adds
 transactionality which, as far as I know, is not provided by any
 currently in-kernel filesystem on Linux.

That's another benefit yes.

 The gist of virtual memory-manager is this:
 
 
 https://lab.nexedi.cn/kirr/wendelin.core/blob/master/include/wendelin/bigfile/virtmem.h
 https://lab.nexedi.cn/kirr/wendelin.core/blob/master/bigfile/virtmem.c  
 (vma_on_pagefault)

I'll check it more in detail ASAP, thanks for the pointers!

 For operations it currently needs
 
 - establishing virtual memory areas and connecting to tracking it

That's the UFFDIO_REGISTER/UNREGISTER.

 - changing pages protection
 
 PROT_NONE or absent - initially

absent is what works with -mm already. The lazy loading already works.

 PROT_NONE   - PROT_READ- after read

Current UFFDIO_COPY will map it using vma-vm_page_prot.

We'll need a new flag for UFFDIO_COPY to map it readonly. This is
already contemplated:

/*
 * There will be a wrprotection flag later that allows to map
 * pages wrprotected on the fly. And such a flag will be
 * available if the wrprotection ioctl are implemented for the
 * range according to the uffdio_register.ioctls.
 */
#define UFFDIO_COPY_MODE_DONTWAKE   ((__u64)10)
__u64 mode;

If the memory protection framework exists (either through the
uffdio_register.ioctl out value, or through uffdio_api.features
out-only value) you can pass a new flag (MODE_WP) above to transition
from absent to PROT_READ.

 PROT_READ   - PROT_READWRITE   - after write

This will need to add UFFDIO_MPROTECT.

 PROT_READWRITE  - PROT_READ- after commit

UFFDIO_MPROTECT again (but harder if going from rw to ro, because of a
slight mess to solve with regard to FAULT_FLAG_TRIED, in case you want
to run this UFFDIO_MPROTECT without stopping the threads that are
accessing the memory concurrently).

And this should only work if the uffdio_register.mode had MODE_WP set,
so we don't run into the races created by COWs (gup vs fork race).

 PROT_READWRITE  - PROT_NONE or absent (again)  - after abort

UFFDIO_MPROTECT again, but you won't be able to read the page contents
inside the memory manager thread (the one working with
userfaultfd).

The manager at all times if forbidden to touch the memory it is
tracking with userfaultfd (if it does it'll deadlock, but kill -9 will
get rid of it). gdb ironically because it is using an underoptimized
access_process_vm wouldn't hang, because FAULT_FLAG_RETRY won't be set
in handle_userfault in the gdb context, and it'll just receive a
sigbus if by mistake the user tries to touch the memory. Even if it
will hung later as get_user_pages_locked|unlocked gets used there too,
kill -9 would solve gdb too.

Back to the problem of accessing the UFFDIO_MPROTECT(PROT_NONE)
memory: to do that a new ioctl should be required. I'd rather not go
back to the route of UFFDIO_REMAP, but it could copy the 

Re: [PATCH 08/12] KVM: x86: save/load state on SMM switch

2015-05-21 Thread Radim Krčmář
2015-05-21 18:23+0200, Paolo Bonzini:
 On 21/05/2015 18:20, Radim Krčmář wrote:
 2. NMI - SMI - IRET - RSM - NMI
 NMI is injected;  I think it shouldn't be ... have you based this
 behavior on the 3rd paragraph of SDM 34.8 NMI HANDLING WHILE IN SMM
 (A special case [...])?
 
 Yes.

Well, if I were to go lawyer

 [...] saves the SMRAM state save map but does not save the attribute to
 keep NMI interrupts disabled.

NMI masking is a bit, so it'd be really wasteful not to have an
attribute to keep NMI enabled in the same place ...

  Potentially, an NMI could be latched (while in SMM or upon exit) and
  serviced upon exit [...]

This Potentially could be in the sense that the whole 3rd paragraph is
only applicable to some ancient SMM design :)

The 1st paragraph has quite clear sentence:

  If NMIs were blocked before the SMI occurred, they are blocked after
  execution of RSM.

so I'd just ignore the 3rd paragraph ...

And the APM 2:10.3.3 Exceptions and Interrupts
  NMI—If an NMI occurs while the processor is in SMM, it is latched by
  the processor, but the NMI handler is not invoked until the processor
  leaves SMM with the execution of an RSM instruction.  A pending NMI
  causes the handler to be invoked immediately after the RSM completes
  and before the first instruction in the interrupted program is
  executed.

  An SMM handler can unmask NMI interrupts by simply executing an IRET.
  Upon completion of the IRET instruction, the processor recognizes the
  pending NMI, and transfers control to the NMI handler. Once an NMI is
  recognized within SMM using this technique, subsequent NMIs are
  recognized until SMM is exited. Later SMIs cause NMIs to be masked,
  until the SMM handler unmasks them.

makes me think that we should unmask them unconditionally or that SMM
doesn't do anything with NMI masking.

If we can choose, less NMI nesting seems like a good idea.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Announcing qboot, a minimal x86 firmware for QEMU

2015-05-21 Thread Jan Kiszka
On 2015-05-21 15:51, Paolo Bonzini wrote:
 Some of you may have heard about the Clear Containers initiative from
 Intel, which couple KVM with various kernel tricks to create extremely
 lightweight virtual machines.  The experimental Clear Containers setup
 requires only 18-20 MB to launch a virtual machine, and needs about 60
 ms to boot.
 
 Now, as all of you probably know, QEMU is great for running Windows or
 legacy Linux guests, but that flexibility comes at a hefty price. Not
 only does all of the emulation consume memory, it also requires some
 form of low-level firmware in the guest as well. All of this adds quite
 a bit to virtual-machine startup times (500 to 700 milliseconds is not
 unusual).
 
 Right?  In fact, it's for this reason that Clear Containers uses kvmtool
 instead of QEMU.
 
 No, wrong!  In fact, reporting bad performance is pretty much the same
 as throwing down the gauntlet.
 
 Enter qboot, a minimal x86 firmware that runs on QEMU and, together with
 a slimmed-down QEMU configuration, boots a virtual machine in 40
 milliseconds[2] on an Ivy Bridge Core i7 processor.
 
 qboot is available at git://github.com/bonzini/qboot.git.  In all the
 glory of its 8KB of code, it brings together various existing open
 source components:
 
 * a minimal (really minimal) 16-bit BIOS runtime based on kvmtool's own BIOS
 
 * a couple hardware initialization routines written mostly from scratch
 but with good help from SeaBIOS source code
 
 * a minimal 32-bit libc based on kvm-unit-tests
 
 * the Linux loader from QEMU itself
 
 The repository has more information on how to achieve fast boot times,
 and examples of using qboot.  Right now there is a limit of 8 MB for
 vmlinuz+initrd+cmdline, which however should be enough for initrd-less
 containers.
 
 The first commit to qboot is more or less 24 hours old, so there is
 definitely more work to do, in particular to extract ACPI tables from
 QEMU and present them to the guest.  This is probably another day of
 work or so, and it will enable multiprocessor guests with little or no
 impact on the boot times.  SMBIOS information is also available from QEMU.
 
 On the QEMU side, there is no support yet for persistent memory and the
 NFIT tables from ACPI 6.0.  Once that (and ACPI support) is added, qboot
 will automatically start using it.
 
 Happy hacking!

Incidentally, I did something similar these days to get Linux booting in
Jailhouse non-root cells, i.e without BIOS and almost no hardware except
memory, cpus and pci devices. Yes, requires a bit pv for Linux, but
really little. Not aiming for speed (yet), just for less hypervisor
work. Maybe there are some milliseconds to save when cutting off more
hardware in an analogous way...

PV pat^Whacks are here:
http://git.kiszka.org/?p=linux.git;a=shortlog;h=refs/heads/queues/jailhouse.
The boot loader is a combination of a python script [1] (result can be
saved and reused - replaces ACPI) and really few lines of code [2][3].

Jan

[1]
https://github.com/siemens/jailhouse/blob/wip/linux-x86-inmate/tools/jailhouse-cell-linux
[2]
https://github.com/siemens/jailhouse/blob/wip/linux-x86-inmate/inmates/lib/x86/header.S
[3]
https://github.com/siemens/jailhouse/blob/wip/linux-x86-inmate/inmates/tools/x86/linux-loader.c

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/12] KVM: x86: save/load state on SMM switch

2015-05-21 Thread Radim Krčmář
2015-05-08 13:20+0200, Paolo Bonzini:
 The big ugly one.  This patch adds support for switching in and out of
 system management mode, respectively upon receiving KVM_REQ_SMI and upon
 executing a RSM instruction.  Both 32- and 64-bit formats are supported
 for the SMM state save area.
 
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
   RFC-v1: shift access rights left by 8 for 32-bit format
move tracepoint to kvm_set_hflags
fix NMI handling
 ---
 diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
 @@ -2262,12 +2262,258 @@ static int em_lseg(struct x86_emulate_ctxt *ctxt)
 +static int rsm_load_seg_32(struct x86_emulate_ctxt *ctxt, u64 smbase, int n)
 +{
 + struct desc_struct desc;
 + int offset;
 + u16 selector;
 +
 + selector = get_smstate(u32, smbase, 0x7fa8 + n * 4);

(u16, SDM says that most significant 2 bytes are reserved anyway.)

 + if (n  3)
 + offset = 0x7f84 + n * 12;
 + else
 + offset = 0x7f2c + (n - 3) * 12;

These numbers made me look where the hell is that defined and the
easiest reference seemed to be http://www.sandpile.org/x86/smm.htm,
which has several layouts ... I hopefully checked the intersection of
various Intels and AMDs.

 + set_desc_base(desc,  get_smstate(u32, smbase, offset + 8));
 + set_desc_limit(desc, get_smstate(u32, smbase, offset + 4));
 + rsm_set_desc_flags(desc, get_smstate(u32, smbase, offset));

(There wan't a layout where this would be right, so we could save the
 shifting of those flags in 64 bit mode.  Intel P6 was close, and they
 had only 2 bytes for access right, which means they weren't shifted.)

 +static int rsm_load_state_32(struct x86_emulate_ctxt *ctxt, u64 smbase)
 +{
 + cr0 =  get_smstate(u32, smbase, 0x7ffc);

(I wonder why they made 'smbase + 0x8000' the default offset in SDM,
 when 'smbase + 0xfe00' or 'smbase' would work as well.)

 +static int rsm_load_state_64(struct x86_emulate_ctxt *ctxt, u64 smbase)
 +{
 + struct desc_struct desc;
 + u16 selector;
 + selector =  get_smstate(u32, smbase, 0x7e90);
 + rsm_set_desc_flags(desc,   get_smstate(u32, smbase, 0x7e92)  8);

(Both reads should be u16.  Luckily, extra data gets ignored.)

  static int em_rsm(struct x86_emulate_ctxt *ctxt)
  {
 + if ((ctxt-emul_flags  X86EMUL_SMM_INSIDE_NMI_MASK) == 0)
 + ctxt-ops-set_nmi_mask(ctxt, false);

NMI is always fun ... let's see two cases:
1. NMI - SMI - RSM - NMI
NMI is not injected;  ok.

2. NMI - SMI - IRET - RSM - NMI
NMI is injected;  I think it shouldn't be ... have you based this
behavior on the 3rd paragraph of SDM 34.8 NMI HANDLING WHILE IN SMM
(A special case [...])?

Why I think we should restore NMI mask on RSM:
- It's consistent with SMI - IRET - NMI - RSM - NMI (where we,
  I think correctly, unmask NMIs) and the idea that SMM tries to be to
  transparent (but maybe they didn't care about retarded SMI handlers).
- APM 2:15.30.3 SMM_CTL MSR (C001_0116h)
  • ENTER—Bit 1. Enter SMM: map the SMRAM memory areas, record whether
NMI was currently blocked and block further NMI and SMI interrupts.
  • EXIT—Bit 3. Exit SMM: unmap the SMRAM memory areas, restore the
previous masking status of NMI and unconditionally reenable SMI.
  
  The MSR should mimic real SMM signals and does restore the NMI mask.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Announcing qboot, a minimal x86 firmware for QEMU

2015-05-21 Thread Paolo Bonzini


On 21/05/2015 17:48, Avi Kivity wrote:
 Lovely!
 
 Note you have memcpy.o instead of memcpy.c.

Doh, and it's not used anyway.  Check the repository, and let me know if
OSv boots with it (it probably needs ACPI; Linux doesn't boot virtio
without ACPI).

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/12] KVM: x86: save/load state on SMM switch

2015-05-21 Thread Paolo Bonzini


On 21/05/2015 18:20, Radim Krčmář wrote:
 2. NMI - SMI - IRET - RSM - NMI
 NMI is injected;  I think it shouldn't be ... have you based this
 behavior on the 3rd paragraph of SDM 34.8 NMI HANDLING WHILE IN SMM
 (A special case [...])?

Yes.

 Why I think we should restore NMI mask on RSM:
 - It's consistent with SMI - IRET - NMI - RSM - NMI (where we,
   I think correctly, unmask NMIs)

Yes, we do.

 and the idea that SMM tries to be to
   transparent (but maybe they didn't care about retarded SMI handlers).

That's my reading of that paragraph of the manual. :)

 - APM 2:15.30.3 SMM_CTL MSR (C001_0116h)
   • ENTER—Bit 1. Enter SMM: map the SMRAM memory areas, record whether
 NMI was currently blocked and block further NMI and SMI interrupts.
   • EXIT—Bit 3. Exit SMM: unmap the SMRAM memory areas, restore the
 previous masking status of NMI and unconditionally reenable SMI.
   
   The MSR should mimic real SMM signals and does restore the NMI mask.

No idea...  My implementation does restore the previous masking status,
but only if it was unmasked.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/12] KVM: x86: API changes for SMM support

2015-05-21 Thread Radim Krčmář
2015-05-21 16:59+0200, Paolo Bonzini:
 On 21/05/2015 16:49, Radim Krčmář wrote:
 2015-05-08 13:20+0200, Paolo Bonzini:
 diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
 @@ -202,7 +202,7 @@ struct kvm_run {
 __u32 exit_reason;
 __u8 ready_for_interrupt_injection;
 __u8 if_flag;
 -   __u8 padding2[2];
 +   __u16 flags;
 
 (It got lost last review and I'd really like to know ...
  what is the advantage of giving both bytes to flags?)
 
 No advantage.  You just should leave padding2[1] in the middle so that
 the offset of run-padding2[0] doesn't change.

I don't get that.  The position of padding should be decided by
comparing probabilities of extending 'if_flag' and 'flags'.

  Since it's not obvious
 I gave two bytes to flags, but I can do it either way.

if_flag seems to be set in stone as one bit, so I'd vote for

  __u8 flags;
  __u8 padding2;

(Or 'padding3', to prevent the same class of errors that removing it
 altogether does;  which we didn't do for other tailed padding).

For there isn't much space left in struct kvm ...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/12] KVM: x86: save/load state on SMM switch

2015-05-21 Thread Paolo Bonzini


On 21/05/2015 18:20, Radim Krčmář wrote:
 
  +  set_desc_base(desc,  get_smstate(u32, smbase, offset + 8));
  +  set_desc_limit(desc, get_smstate(u32, smbase, offset + 4));
  +  rsm_set_desc_flags(desc, get_smstate(u32, smbase, offset));
 (There wan't a layout where this would be right, so we could save the
  shifting of those flags in 64 bit mode.  Intel P6 was close, and they
  had only 2 bytes for access right, which means they weren't shifted.)

Check the AMD architecture manual.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/12] KVM: x86: save/load state on SMM switch

2015-05-21 Thread Radim Krčmář
2015-05-21 18:21+0200, Paolo Bonzini:
 On 21/05/2015 18:20, Radim Krčmář wrote:
  
   +set_desc_base(desc,  get_smstate(u32, smbase, offset + 8));
   +set_desc_limit(desc, get_smstate(u32, smbase, offset + 4));
   +rsm_set_desc_flags(desc, get_smstate(u32, smbase, offset));
  (There wan't a layout where this would be right, so we could save the
   shifting of those flags in 64 bit mode.  Intel P6 was close, and they
   had only 2 bytes for access right, which means they weren't shifted.)
 
 Check the AMD architecture manual.

I must be blind, is there more than Table 10-2?

(And according to ADM manual, we are overwriting GDT and IDT base at
 offset 0xff88 and 0xff94 with ES and CS data, so it's not the best
 reference for this case ...)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html