from:"Gleb Natapov"

Re: What's the difference between EPT_MISCONFIG and EPT_VIOLATION?

2014-12-03 Thread Gleb Natapov

On Wed, Dec 03, 2014 at 06:12:10PM +0800, Zhang Haoyu wrote:
> > > Hi,
> > > 
> > > EXIT_REASON_EPT_VIOLATION's corresponding handle is 
> > > handle_ept_violation(),
> > > and EXIT_REASON_EPT_MISCONFIG's corresponding handle is 
> > > handle_ept_misconfig(),
> > > what's the difference between them?
> > > 
> > > I read the SDM-3C 28.2.3 EPT-Induced VM Exits, and found below 
> > > description,
> > > "An EPT misconfiguration occurs when, in the course of translating 
> > > a guest-physical address, the logical processor encounters an EPT 
> > > paging-structure entry that contains an unsupported value. An EPT 
> > > violation occurs when there is no EPT misconfiguration but the EPT 
> > > paging-structure entries disallow an access using the guest physical
> > > address."
> > > 
> > > According to above description, EPT-MISCONFIG is from error settings ,
> > > but from the its exit-handle handle_ept_misconfig(),
> > > it seems that handle_ept_misconfig() handles mmio pagefault,
> > > I'm really confused, I think I'm missing something,
> > > any advices?
> > > 
> > EXIT_REASON_EPT_VIOLATION is similar to a "page not present" pagefault
> > EXIT_REASON_EPT_MISCONFIG is similar to a "reserved bit set" pagefault.
> > handle_ept_misconfig() handles mmio pagefault because KVM has an
> > optimization that uses reserved bits to mark mmio regions.
> >
> Thanks, Gleb, 
> where does kvm use the reserved bits to mark mmio regions?
> 
arch/x86/kvm/mmu.c:mark_mmio_spte

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What's the difference between EPT_MISCONFIG and EPT_VIOLATION?

2014-12-03 Thread Gleb Natapov

On Wed, Dec 03, 2014 at 05:50:33PM +0800, Zhang Haoyu wrote:
> Hi,
> 
> EXIT_REASON_EPT_VIOLATION's corresponding handle is handle_ept_violation(),
> and EXIT_REASON_EPT_MISCONFIG's corresponding handle is 
> handle_ept_misconfig(),
> what's the difference between them?
> 
> I read the SDM-3C 28.2.3 EPT-Induced VM Exits, and found below description,
> "An EPT misconfiguration occurs when, in the course of translating 
> a guest-physical address, the logical processor encounters an EPT 
> paging-structure entry that contains an unsupported value. An EPT 
> violation occurs when there is no EPT misconfiguration but the EPT 
> paging-structure entries disallow an access using the guest physical
> address."
> 
> According to above description, EPT-MISCONFIG is from error settings ,
> but from the its exit-handle handle_ept_misconfig(),
> it seems that handle_ept_misconfig() handles mmio pagefault,
> I'm really confused, I think I'm missing something,
> any advices?
> 
EXIT_REASON_EPT_VIOLATION is similar to a "page not present" pagefault
EXIT_REASON_EPT_MISCONFIG is similar to a "reserved bit set" pagefault.
handle_ept_misconfig() handles mmio pagefault because KVM has an
optimization that uses reserved bits to mark mmio regions.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: ia64: remove

2014-11-19 Thread Gleb Natapov

On Wed, Nov 19, 2014 at 10:05:43PM +0100, Paolo Bonzini wrote:
> KVM for ia64 has been marked as broken not just once, but twice even,
> and the last patch from the maintainer is now roughly 5 years old.
> Time for it to rest in piece.
> 
Acked-by: Gleb Natapov 

Next step is to move ioapic bits into arch :)

> Signed-off-by: Paolo Bonzini 
> ---
>   The patch was edited to keep its size decent, by dropping
>   all the removed lines from the deleted files.
> 
>  MAINTAINERS |9 -
>  arch/ia64/Kconfig   |3 -
>  arch/ia64/Makefile  |1 -
>  arch/ia64/include/asm/kvm_host.h|  609 --
>  arch/ia64/include/asm/pvclock-abi.h |   48 -
>  arch/ia64/include/uapi/asm/kvm.h|  268 -
>  arch/ia64/kvm/Kconfig   |   66 --
>  arch/ia64/kvm/Makefile  |   67 --
>  arch/ia64/kvm/asm-offsets.c |  241 
>  arch/ia64/kvm/irq.h |   33 -
>  arch/ia64/kvm/kvm-ia64.c| 1942 --
>  arch/ia64/kvm/kvm_fw.c  |  674 ---
>  arch/ia64/kvm/kvm_lib.c |   21 -
>  arch/ia64/kvm/kvm_minstate.h|  266 -
>  arch/ia64/kvm/lapic.h   |   30 -
>  arch/ia64/kvm/memcpy.S  |1 -
>  arch/ia64/kvm/memset.S  |1 -
>  arch/ia64/kvm/misc.h|   94 --
>  arch/ia64/kvm/mmio.c|  336 --
>  arch/ia64/kvm/optvfault.S   | 1090 -
>  arch/ia64/kvm/process.c | 1024 
>  arch/ia64/kvm/trampoline.S  | 1038 
>  arch/ia64/kvm/vcpu.c| 2209 
> ---
>  arch/ia64/kvm/vcpu.h|  752 
>  arch/ia64/kvm/vmm.c |   99 --
>  arch/ia64/kvm/vmm_ivt.S | 1392 --
>  arch/ia64/kvm/vti.h |  290 -
>  arch/ia64/kvm/vtlb.c|  640 --
>  virt/kvm/ioapic.c   |  5 -
>  virt/kvm/ioapic.h   |  1 -
>  virt/kvm/irq_comm.c | 22 -
>  31 files changed, 13272 deletions(-)
>  delete mode 100644 arch/ia64/include/asm/kvm_host.h
>  delete mode 100644 arch/ia64/include/asm/pvclock-abi.h
>  delete mode 100644 arch/ia64/include/uapi/asm/kvm.h
>  delete mode 100644 arch/ia64/kvm/Kconfig
>  delete mode 100644 arch/ia64/kvm/Makefile
>  delete mode 100644 arch/ia64/kvm/asm-offsets.c
>  delete mode 100644 arch/ia64/kvm/irq.h
>  delete mode 100644 arch/ia64/kvm/kvm-ia64.c
>  delete mode 100644 arch/ia64/kvm/kvm_fw.c
>  delete mode 100644 arch/ia64/kvm/kvm_lib.c
>  delete mode 100644 arch/ia64/kvm/kvm_minstate.h
>  delete mode 100644 arch/ia64/kvm/lapic.h
>  delete mode 100644 arch/ia64/kvm/memcpy.S
>  delete mode 100644 arch/ia64/kvm/memset.S
>  delete mode 100644 arch/ia64/kvm/misc.h
>  delete mode 100644 arch/ia64/kvm/mmio.c
>  delete mode 100644 arch/ia64/kvm/optvfault.S
>  delete mode 100644 arch/ia64/kvm/process.c
>  delete mode 100644 arch/ia64/kvm/trampoline.S
>  delete mode 100644 arch/ia64/kvm/vcpu.c
>  delete mode 100644 arch/ia64/kvm/vcpu.h
>  delete mode 100644 arch/ia64/kvm/vmm.c
>  delete mode 100644 arch/ia64/kvm/vmm_ivt.S
>  delete mode 100644 arch/ia64/kvm/vti.h
>  delete mode 100644 arch/ia64/kvm/vtlb.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index a12edf2624e5..56705138ca74 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -5244,15 +5244,6 @@ S:   Supported
>  F: arch/powerpc/include/asm/kvm*
>  F: arch/powerpc/kvm/
> 
> -KERNEL VIRTUAL MACHINE For Itanium (KVM/IA64)
> -M: Xiantao Zhang 
> -L: kvm-i...@vger.kernel.org
> -W: http://kvm.qumranet.com
> -S: Supported
> -F: Documentation/ia64/kvm.txt
> -F: arch/ia64/include/asm/kvm*
> -F: arch/ia64/kvm/
> -
>  KERNEL VIRTUAL MACHINE for s390 (KVM/s390)
>  M: Christian Borntraeger 
>  M: Cornelia Huck 
> diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
> index c84c88d7..11afe7ab1981 100644
> --- a/arch/ia64/Kconfig
> +++ b/arch/ia64/Kconfig
> @@ -21,7 +21,6 @@ config IA64
>   select HAVE_DYNAMIC_FTRACE if (!ITANIUM)
>   select HAVE_FUNCTION_TRACER
>   select HAVE_DMA_ATTRS
> - select HAVE_KVM
>   select TTY
>   select HAVE_ARCH_TRACEHOOK
>   select HAVE_DMA_API_DEBUG
> @@ -640,8 +639,6 @@ source "security/Kconfig"
>  
>  source "crypto/Kconfig"
>  
> -source "arch/ia64/kvm/Kconfig"
> -
>  source "lib/Kconfig"
>  
>  config IOMMU_HELPER
> diff --git a/arch/ia64/Makefile b/arch/ia64/Makefile
> index 5441b14994fc

Re: [question] lots of interrupts injected to vm when pressing somekey w/o releasing

2014-11-19 Thread Gleb Natapov

On Thu, Nov 20, 2014 at 02:59:36PM +0800, Zhang Haoyu wrote:
> >On 20/11/2014 03:20, Zhang Haoyu wrote:
> >> Hi all,
> >> 
> >> If I press the one of "Insert/Delete/Home/End/PageUp/PageDown/UpArrow/
> >> DownArrow/LeftArrow/RightArrow" key w/o releasing, then lots of interrupts
> >> will be injected to vm(win7/win2008), about 8000/s, the system become very 
> >> slow,
> >> bringing very bad experience. But the other keys are okay.
> >> And, linux guest has no this problem.
> >
> >Do you have a trace for this?  What version of QEMU and what UI backend?
> >
> Sorry for forgetting to mention test environment from the start.
> Host: rhel7 with kernel-3.10.0-121
> QEMU: qemu-2.0.2
> Guest: win7(bad),win2008(bad),linux-kernel-3.10.0-121(good)
> 
> No UI backend, directly start the VM via qemu command.
> 
> perf top data when above problem happening: 
Trace it like this: http://www.linux-kvm.org/page/Tracing

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vhost + multiqueue + RSS question.

2014-11-17 Thread Gleb Natapov

On Tue, Nov 18, 2014 at 11:41:11AM +0800, Jason Wang wrote:
> On 11/18/2014 09:37 AM, Zhang Haoyu wrote:
> >> On Mon, Nov 17, 2014 at 01:58:20PM +0200, Michael S. Tsirkin wrote:
> >>> On Mon, Nov 17, 2014 at 01:22:07PM +0200, Gleb Natapov wrote:
> >>>> On Mon, Nov 17, 2014 at 12:38:16PM +0200, Michael S. Tsirkin wrote:
> >>>>> On Mon, Nov 17, 2014 at 09:44:23AM +0200, Gleb Natapov wrote:
> >>>>>> On Sun, Nov 16, 2014 at 08:56:04PM +0200, Michael S. Tsirkin wrote:
> >>>>>>> On Sun, Nov 16, 2014 at 06:18:18PM +0200, Gleb Natapov wrote:
> >>>>>>>> Hi Michael,
> >>>>>>>>
> >>>>>>>>  I am playing with vhost multiqueue capability and have a question 
> >>>>>>>> about
> >>>>>>>> vhost multiqueue and RSS (receive side steering). My setup has 
> >>>>>>>> Mellanox
> >>>>>>>> ConnectX-3 NIC which supports multiqueue and RSS. Network related
> >>>>>>>> parameters for qemu are:
> >>>>>>>>
> >>>>>>>>-netdev tap,id=hn0,script=qemu-ifup.sh,vhost=on,queues=4
> >>>>>>>>-device virtio-net-pci,netdev=hn0,id=nic1,mq=on,vectors=10
> >>>>>>>>
> >>>>>>>> In a guest I ran "ethtool -L eth0 combined 4" to enable multiqueue.
> >>>>>>>>
> >>>>>>>> I am running one tcp stream into the guest using iperf. Since there 
> >>>>>>>> is
> >>>>>>>> only one tcp stream I expect it to be handled by one queue only but
> >>>>>>>> this seams to be not the case. ethtool -S on a host shows that the
> >>>>>>>> stream is handled by one queue in the NIC, just like I would expect,
> >>>>>>>> but in a guest all 4 virtio-input interrupt are incremented. Am I
> >>>>>>>> missing any configuration?
> >>>>>>> I don't see anything obviously wrong with what you describe.
> >>>>>>> Maybe, somehow, same irqfd got bound to multiple MSI vectors?
> >>>>>> It does not look like this is what is happening judging by the way
> >>>>>> interrupts are distributed between queues. They are not distributed
> >>>>>> uniformly and often I see one queue gets most interrupt and others get
> >>>>>> much less and then it changes.
> >>>>> Weird. It would happen if you transmitted from multiple CPUs.
> >>>>> You did pin iperf to a single CPU within guest, did you not?
> >>>>>
> >>>> No, I didn't because I didn't expect it to matter for input interrupts.
> >>>> When I run iperf on a host rx queue that receives all packets depends
> >>>> only on a connection itself, not on a cpu iperf is running on (I tested
> >>>> that).
> >>> This really depends on the type of networking card you have
> >>> on the host, and how it's configured.
> >>>
> >>> I think you will get something more closely resembling this
> >>> behaviour if you enable RFS in host.
> >>>
> >>>> When I pin iperf in a guest I do indeed see that all interrupts
> >>>> are arriving to the same irq vector. Is a number after virtio-input
> >>>> in /proc/interrupt any indication of a queue a packet arrived to (on
> >>>> a host I can use ethtool -S to check what queue receives packets, but
> >>>> unfortunately this does not work for virtio nic in a guest)?
> >>> I think it is.
> >>>
> >>>> Because if
> >>>> it is the way RSS works in virtio is not how it works on a host and not
> >>>> what I would expect after reading about RSS. The queue a packets arrives
> >>>> to should be calculated by hashing fields from a packet header only.
> >>> Yes, what virtio has is not RSS - it's an accelerated RFS really.
> >>>
> >> OK, if what virtio has is RFS and not RSS my test results make sense.
> >> Thanks!
> > I think the RSS emulation for virtio-mq NIC is implemented in 
> > tun_select_queue(),
> > am I missing something?
> >
> > Thanks,
> > Zhang Haoyu
> >
> 
> Yes, if RSS is the short for Receive Side Steering which is a generic
> technology. But RSS is usually short for Receive Side Scaling which was
> commonly technology used by Windows, it was implemented through a
> indirection table in the card which is obviously not supported in tun
> currently.
Hmm, I had an impression that "Receive Side Steering" and "Receive Side
Scaling" are interchangeable. Software implementation for RSS is called
"Receive Packet Steering" according to Documentation/networking/scaling.txt
not "Receive Packet Scaling". Those damn TLAs are confusing.
 
--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vhost + multiqueue + RSS question.

2014-11-17 Thread Gleb Natapov

On Mon, Nov 17, 2014 at 01:58:20PM +0200, Michael S. Tsirkin wrote:
> On Mon, Nov 17, 2014 at 01:22:07PM +0200, Gleb Natapov wrote:
> > On Mon, Nov 17, 2014 at 12:38:16PM +0200, Michael S. Tsirkin wrote:
> > > On Mon, Nov 17, 2014 at 09:44:23AM +0200, Gleb Natapov wrote:
> > > > On Sun, Nov 16, 2014 at 08:56:04PM +0200, Michael S. Tsirkin wrote:
> > > > > On Sun, Nov 16, 2014 at 06:18:18PM +0200, Gleb Natapov wrote:
> > > > > > Hi Michael,
> > > > > > 
> > > > > >  I am playing with vhost multiqueue capability and have a question 
> > > > > > about
> > > > > > vhost multiqueue and RSS (receive side steering). My setup has 
> > > > > > Mellanox
> > > > > > ConnectX-3 NIC which supports multiqueue and RSS. Network related
> > > > > > parameters for qemu are:
> > > > > > 
> > > > > >-netdev tap,id=hn0,script=qemu-ifup.sh,vhost=on,queues=4
> > > > > >-device virtio-net-pci,netdev=hn0,id=nic1,mq=on,vectors=10
> > > > > > 
> > > > > > In a guest I ran "ethtool -L eth0 combined 4" to enable multiqueue.
> > > > > > 
> > > > > > I am running one tcp stream into the guest using iperf. Since there 
> > > > > > is
> > > > > > only one tcp stream I expect it to be handled by one queue only but
> > > > > > this seams to be not the case. ethtool -S on a host shows that the
> > > > > > stream is handled by one queue in the NIC, just like I would expect,
> > > > > > but in a guest all 4 virtio-input interrupt are incremented. Am I
> > > > > > missing any configuration?
> > > > > 
> > > > > I don't see anything obviously wrong with what you describe.
> > > > > Maybe, somehow, same irqfd got bound to multiple MSI vectors?
> > > > It does not look like this is what is happening judging by the way
> > > > interrupts are distributed between queues. They are not distributed
> > > > uniformly and often I see one queue gets most interrupt and others get
> > > > much less and then it changes.
> > > 
> > > Weird. It would happen if you transmitted from multiple CPUs.
> > > You did pin iperf to a single CPU within guest, did you not?
> > > 
> > No, I didn't because I didn't expect it to matter for input interrupts.
> > When I run iperf on a host rx queue that receives all packets depends
> > only on a connection itself, not on a cpu iperf is running on (I tested
> > that).
> 
> This really depends on the type of networking card you have
> on the host, and how it's configured.
> 
> I think you will get something more closely resembling this
> behaviour if you enable RFS in host.
> 
> > When I pin iperf in a guest I do indeed see that all interrupts
> > are arriving to the same irq vector. Is a number after virtio-input
> > in /proc/interrupt any indication of a queue a packet arrived to (on
> > a host I can use ethtool -S to check what queue receives packets, but
> > unfortunately this does not work for virtio nic in a guest)?
> 
> I think it is.
> 
> > Because if
> > it is the way RSS works in virtio is not how it works on a host and not
> > what I would expect after reading about RSS. The queue a packets arrives
> > to should be calculated by hashing fields from a packet header only.
> 
> Yes, what virtio has is not RSS - it's an accelerated RFS really.
> 
OK, if what virtio has is RFS and not RSS my test results make sense.
Thanks!

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vhost + multiqueue + RSS question.

2014-11-17 Thread Gleb Natapov

On Mon, Nov 17, 2014 at 12:38:16PM +0200, Michael S. Tsirkin wrote:
> On Mon, Nov 17, 2014 at 09:44:23AM +0200, Gleb Natapov wrote:
> > On Sun, Nov 16, 2014 at 08:56:04PM +0200, Michael S. Tsirkin wrote:
> > > On Sun, Nov 16, 2014 at 06:18:18PM +0200, Gleb Natapov wrote:
> > > > Hi Michael,
> > > > 
> > > >  I am playing with vhost multiqueue capability and have a question about
> > > > vhost multiqueue and RSS (receive side steering). My setup has Mellanox
> > > > ConnectX-3 NIC which supports multiqueue and RSS. Network related
> > > > parameters for qemu are:
> > > > 
> > > >-netdev tap,id=hn0,script=qemu-ifup.sh,vhost=on,queues=4
> > > >-device virtio-net-pci,netdev=hn0,id=nic1,mq=on,vectors=10
> > > > 
> > > > In a guest I ran "ethtool -L eth0 combined 4" to enable multiqueue.
> > > > 
> > > > I am running one tcp stream into the guest using iperf. Since there is
> > > > only one tcp stream I expect it to be handled by one queue only but
> > > > this seams to be not the case. ethtool -S on a host shows that the
> > > > stream is handled by one queue in the NIC, just like I would expect,
> > > > but in a guest all 4 virtio-input interrupt are incremented. Am I
> > > > missing any configuration?
> > > 
> > > I don't see anything obviously wrong with what you describe.
> > > Maybe, somehow, same irqfd got bound to multiple MSI vectors?
> > It does not look like this is what is happening judging by the way
> > interrupts are distributed between queues. They are not distributed
> > uniformly and often I see one queue gets most interrupt and others get
> > much less and then it changes.
> 
> Weird. It would happen if you transmitted from multiple CPUs.
> You did pin iperf to a single CPU within guest, did you not?
> 
No, I didn't because I didn't expect it to matter for input interrupts.
When I run iperf on a host rx queue that receives all packets depends
only on a connection itself, not on a cpu iperf is running on (I tested
that). When I pin iperf in a guest I do indeed see that all interrupts
are arriving to the same irq vector. Is a number after virtio-input
in /proc/interrupt any indication of a queue a packet arrived to (on
a host I can use ethtool -S to check what queue receives packets, but
unfortunately this does not work for virtio nic in a guest)? Because if
it is the way RSS works in virtio is not how it works on a host and not
what I would expect after reading about RSS. The queue a packets arrives
to should be calculated by hashing fields from a packet header only.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: nested KVM slower than QEMU with gnumach guest kernel

2014-11-17 Thread Gleb Natapov

On Mon, Nov 17, 2014 at 10:10:25AM +0100, Samuel Thibault wrote:
> Jan Kiszka, le Mon 17 Nov 2014 10:04:37 +0100, a écrit :
> > On 2014-11-17 10:03, Samuel Thibault wrote:
> > > Gleb Natapov, le Mon 17 Nov 2014 10:58:45 +0200, a écrit :
> > >> Do you know how gnumach timekeeping works? Does it have a timer that 
> > >> fires each 1ms?
> > >> Which clock device is it using?
> > > 
> > > It uses the PIT every 10ms, in square mode
> > > (PIT_C0|PIT_SQUAREMODE|PIT_READMODE = 0x36).
> > 
> > Wow... how retro. That feature might be unsupported - does user space
> > irqchip work better?
> 
> I had indeed tried giving -machine kernel_irqchip=off to the L2 kvm,
> with the same bad performance and external_interrupt in the trace.
> 
They are always be in the trace, but do you see them each ms or each 10ms
with user space irqchip?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: nested KVM slower than QEMU with gnumach guest kernel

2014-11-17 Thread Gleb Natapov

On Sun, Nov 16, 2014 at 11:18:28PM +0100, Samuel Thibault wrote:
> Hello,
> 
> Jan Kiszka, le Wed 12 Nov 2014 00:42:52 +0100, a écrit :
> > On 2014-11-11 19:55, Samuel Thibault wrote:
> > > jenkins.debian.net is running inside a KVM VM, and it runs nested
> > > KVM guests for its installation attempts.  This goes fine with Linux
> > > kernels, but it is extremely slow with gnumach kernels.
> 
> > You can try to catch a trace (ftrace) on the physical host.
> > 
> > I suspect the setup forces a lot of instruction emulation, either on L0
> > or L1. And that is slower than QEMU is KVM does not optimize like QEMU does.
> 
> Here is a sample of trace-cmd output dump: the same kind of pattern
> repeats over and over, with EXTERNAL_INTERRUPT happening mostly
> every other microsecond:
> 
>  qemu-system-x86-9752  [003]  4106.187755: kvm_exit: reason 
> EXTERNAL_INTERRUPT rip 0xa02848b1 info 0 80f6
>  qemu-system-x86-9752  [003]  4106.187756: kvm_entry:vcpu 0
>  qemu-system-x86-9752  [003]  4106.187757: kvm_exit: reason 
> EXTERNAL_INTERRUPT rip 0xa02848b1 info 0 80f6
>  qemu-system-x86-9752  [003]  4106.187758: kvm_entry:vcpu 0
>  qemu-system-x86-9752  [003]  4106.187759: kvm_exit: reason 
> EXTERNAL_INTERRUPT rip 0xa02848b1 info 0 80f6
>  qemu-system-x86-9752  [003]  4106.187760: kvm_entry:vcpu 0
> 
> The various functions being interrupted are vmx_vcpu_run
> (0xa02848b1 and 0xa0284972), handle_io
> (0xa027ee62), vmx_get_cpl (0xa027a7de),
> load_vmc12_host_state (0xa027ea31), native_read_tscp
> (0x81050a84), native_write_msr_safe (0x81050aa6),
> vmx_decache_cr0_guest_bits (0xa027a384),
> vmx_handle_external_intr (0xa027a54d).
> 
> AIUI, the external interrupt is 0xf6, i.e. Linux' IRQ_WORK_VECTOR.  I
> however don't see any of them, neither in L0's /proc/interrupts, nor in
> L1's /proc/interrupts...
> 
Do you know how gnumach timekeeping works? Does it have a timer that fires each 
1ms?
Which clock device is it using?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vhost + multiqueue + RSS question.

2014-11-16 Thread Gleb Natapov

On Sun, Nov 16, 2014 at 08:56:04PM +0200, Michael S. Tsirkin wrote:
> On Sun, Nov 16, 2014 at 06:18:18PM +0200, Gleb Natapov wrote:
> > Hi Michael,
> > 
> >  I am playing with vhost multiqueue capability and have a question about
> > vhost multiqueue and RSS (receive side steering). My setup has Mellanox
> > ConnectX-3 NIC which supports multiqueue and RSS. Network related
> > parameters for qemu are:
> > 
> >-netdev tap,id=hn0,script=qemu-ifup.sh,vhost=on,queues=4
> >-device virtio-net-pci,netdev=hn0,id=nic1,mq=on,vectors=10
> > 
> > In a guest I ran "ethtool -L eth0 combined 4" to enable multiqueue.
> > 
> > I am running one tcp stream into the guest using iperf. Since there is
> > only one tcp stream I expect it to be handled by one queue only but
> > this seams to be not the case. ethtool -S on a host shows that the
> > stream is handled by one queue in the NIC, just like I would expect,
> > but in a guest all 4 virtio-input interrupt are incremented. Am I
> > missing any configuration?
> 
> I don't see anything obviously wrong with what you describe.
> Maybe, somehow, same irqfd got bound to multiple MSI vectors?
It does not look like this is what is happening judging by the way
interrupts are distributed between queues. They are not distributed
uniformly and often I see one queue gets most interrupt and others get
much less and then it changes.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vhost + multiqueue + RSS question.

2014-11-16 Thread Gleb Natapov

On Mon, Nov 17, 2014 at 01:30:06PM +0800, Jason Wang wrote:
> On 11/17/2014 02:56 AM, Michael S. Tsirkin wrote:
> > On Sun, Nov 16, 2014 at 06:18:18PM +0200, Gleb Natapov wrote:
> >> Hi Michael,
> >>
> >>  I am playing with vhost multiqueue capability and have a question about
> >> vhost multiqueue and RSS (receive side steering). My setup has Mellanox
> >> ConnectX-3 NIC which supports multiqueue and RSS. Network related
> >> parameters for qemu are:
> >>
> >>-netdev tap,id=hn0,script=qemu-ifup.sh,vhost=on,queues=4
> >>-device virtio-net-pci,netdev=hn0,id=nic1,mq=on,vectors=10
> >>
> >> In a guest I ran "ethtool -L eth0 combined 4" to enable multiqueue.
> >>
> >> I am running one tcp stream into the guest using iperf. Since there is
> >> only one tcp stream I expect it to be handled by one queue only but
> >> this seams to be not the case. ethtool -S on a host shows that the
> >> stream is handled by one queue in the NIC, just like I would expect,
> >> but in a guest all 4 virtio-input interrupt are incremented. Am I
> >> missing any configuration?
> > I don't see anything obviously wrong with what you describe.
> > Maybe, somehow, same irqfd got bound to multiple MSI vectors?
> > To see, can you try dumping struct kvm_irqfd that's passed to kvm?
> >
> >
> >> --
> >>Gleb.
> 
> This sounds like a regression, which kernel/qemu version did you use?
Sorry, should have mentioned it from the start. Host is a fedora 20 with
kernel 3.16.6-200.fc20.x86_64 and qemu-system-x86-1.6.2-9.fc20.x86_64.
Guest is also fedora 20 but with an older kernel 3.11.10-301.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

vhost + multiqueue + RSS question.

2014-11-16 Thread Gleb Natapov

Hi Michael,

 I am playing with vhost multiqueue capability and have a question about
vhost multiqueue and RSS (receive side steering). My setup has Mellanox
ConnectX-3 NIC which supports multiqueue and RSS. Network related
parameters for qemu are:

   -netdev tap,id=hn0,script=qemu-ifup.sh,vhost=on,queues=4
   -device virtio-net-pci,netdev=hn0,id=nic1,mq=on,vectors=10

In a guest I ran "ethtool -L eth0 combined 4" to enable multiqueue.

I am running one tcp stream into the guest using iperf. Since there is
only one tcp stream I expect it to be handled by one queue only but
this seams to be not the case. ethtool -S on a host shows that the
stream is handled by one queue in the NIC, just like I would expect,
but in a guest all 4 virtio-input interrupt are incremented. Am I
missing any configuration?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Seeking a KVM benchmark

2014-11-12 Thread Gleb Natapov

On Wed, Nov 12, 2014 at 04:26:29PM +0100, Paolo Bonzini wrote:
> 
> 
> On 12/11/2014 16:22, Gleb Natapov wrote:
> >> > Nehalem results:
> >> > 
> >> > userspace exit, urn  17560 17726 17628 17572 17417
> >> > lightweight exit, urn 3316  3342  3342  3319  3328
> >> > userspace exit, LOAD_EFER, guest!=host   12200 11772 12130 12164 12327
> >> > lightweight exit, LOAD_EFER, guest!=host  3214  3220  3238  3218  3337
> >> > userspace exit, LOAD_EFER, guest=host11983 11780 11920 11919 12040
> >> > lightweight exit, LOAD_EFER, guest=host   3178  3193  3193  3187  3220
> >> > 
> > Is this with Andy's patch that skips LOAD_EFER when guest=host, or the one
> > that always switch LOAD_EFER?
> 
> Skip LOAD_EFER when guest=host.
> 
So guest=host is a little bit better than guest!=host so looks like skipping 
LOAD_EFER helps, but
why "lightweight exit, urn" worse than guest=host though, it should be exactly 
the same as long as NX
bit is the same in urn test, no?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Seeking a KVM benchmark

2014-11-12 Thread Gleb Natapov

On Wed, Nov 12, 2014 at 12:33:32PM +0100, Paolo Bonzini wrote:
> 
> 
> On 10/11/2014 18:38, Gleb Natapov wrote:
> > On Mon, Nov 10, 2014 at 06:28:25PM +0100, Paolo Bonzini wrote:
> >> On 10/11/2014 15:23, Avi Kivity wrote:
> >>> It's not surprising [1].  Since the meaning of some PTE bits change [2],
> >>> the TLB has to be flushed.  In VMX we have VPIDs, so we only need to flush
> >>> if EFER changed between two invocations of the same VPID, which isn't the
> >>> case.
> >>>
> >>> [1] after the fact
> >>> [2] although those bits were reserved with NXE=0, so they shouldn't have
> >>> any TLB footprint
> >>
> >> You're right that this is not that surprising after the fact, and that
> >> both Sandy Bridge and Ivy Bridge have VPIDs (even the non-Xeon ones).
> >> This is also why I'm curious about the Nehalem.
> >>
> >> However note that even toggling the SCE bit is flushing the TLB.  The
> >> NXE bit is not being toggled here!  That's the more surprising part.
> >>
> > Just a guess, but may be because writing EFER is not something that happens
> > often in regular OSes it is not optimized to handle different bits 
> > differently.
> 
> Yes, that's what Intel said too.
> 
> Nehalem results:
> 
> userspace exit, urn  17560 17726 17628 17572 17417
> lightweight exit, urn 3316  3342  3342  3319  3328
> userspace exit, LOAD_EFER, guest!=host   12200 11772 12130 12164 12327
> lightweight exit, LOAD_EFER, guest!=host  3214  3220  3238  3218  3337
> userspace exit, LOAD_EFER, guest=host11983 11780 11920 11919 12040
> lightweight exit, LOAD_EFER, guest=host   3178  3193  3193  3187  3220
> 
Is this with Andy's patch that skips LOAD_EFER when guest=host, or the one
that always switch LOAD_EFER?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Seeking a KVM benchmark

2014-11-10 Thread Gleb Natapov

On Mon, Nov 10, 2014 at 06:28:25PM +0100, Paolo Bonzini wrote:
> On 10/11/2014 15:23, Avi Kivity wrote:
> > It's not surprising [1].  Since the meaning of some PTE bits change [2],
> > the TLB has to be flushed.  In VMX we have VPIDs, so we only need to flush
> > if EFER changed between two invocations of the same VPID, which isn't the
> > case.
> > 
> > [1] after the fact
> > [2] although those bits were reserved with NXE=0, so they shouldn't have
> > any TLB footprint
> 
> You're right that this is not that surprising after the fact, and that
> both Sandy Bridge and Ivy Bridge have VPIDs (even the non-Xeon ones).
> This is also why I'm curious about the Nehalem.
> 
> However note that even toggling the SCE bit is flushing the TLB.  The
> NXE bit is not being toggled here!  That's the more surprising part.
> 
Just a guess, but may be because writing EFER is not something that happens
often in regular OSes it is not optimized to handle different bits differently.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Seeking a KVM benchmark

2014-11-10 Thread Gleb Natapov

On Mon, Nov 10, 2014 at 11:03:35AM +0100, Paolo Bonzini wrote:
> 
> 
> On 09/11/2014 17:36, Andy Lutomirski wrote:
> >> The purpose of vmexit test is to show us various overheads, so why not
> >> measure EFER switch overhead by having two tests one with equal EFER
> >> another with different EFER, instead of hiding it.
> > 
> > I'll try this.  We might need three tests, though: NX different, NX
> > same but SCE different, and all flags the same.
> 
> The test actually explicitly enables NX in order to put itself in the 
> "common case":
> 
> commit 82d4ccb9daf67885a0316b1d763ce5ace57cff36
> Author: Marcelo Tosatti 
> Date:   Tue Jun 8 15:33:29 2010 -0300
> 
> test: vmexit: enable NX
> 
> Enable NX to disable MSR autoload/save. This is the common case anyway.
> 
> Signed-off-by: Marcelo Tosatti 
> Signed-off-by: Avi Kivity 
> 
> (this commit is in qemu-kvm.git), so I guess forgetting to set SCE is
> just a bug.  The results on my Xeon Sandy Bridge are very interesting:
> 
> NX different~11.5k (load/save EFER path)
> NX same, SCE different  ~19.5k (urn path)
> all flags the same  ~10.2k
> 
> The inl_from_kernel results have absolutely no change, usually at most 5
> cycles difference.  This could be because I've added the SCE=1 variant
> directly to vmexit.c, so I'm running the tests one next to the other.
> 
> I tried making also the other shared MSRs the same between guest and
> host (STAR, LSTAR, CSTAR, SYSCALL_MASK), so that the user return notifier
> has nothing to do.  That saves about 4-500 cycles on inl_from_qemu.  I
> do want to dig out my old Core 2 and see how the new test fares, but it
> really looks like your patch will be in 3.19.
>
Please test on wide variety of HW before final decision. Also it would
be nice to ask Intel what is expected overhead. It is awesome if they
mange to add EFER switching with non measurable overhead, but also hard
to believe :) Also Andy had an idea do disable switching in case host
and guest EFERs are the same but IIRC his patch does not include it yet.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Seeking a KVM benchmark

2014-11-09 Thread Gleb Natapov

On Sat, Nov 08, 2014 at 08:44:42AM -0800, Andy Lutomirski wrote:
> On Sat, Nov 8, 2014 at 8:00 AM, Andy Lutomirski  wrote:
> > On Nov 8, 2014 4:01 AM, "Gleb Natapov"  wrote:
> >>
> >> On Fri, Nov 07, 2014 at 09:59:55AM -0800, Andy Lutomirski wrote:
> >> > On Thu, Nov 6, 2014 at 11:17 PM, Paolo Bonzini  
> >> > wrote:
> >> > >
> >> > >
> >> > > On 07/11/2014 07:27, Andy Lutomirski wrote:
> >> > >> Is there an easy benchmark that's sensitive to the time it takes to
> >> > >> round-trip from userspace to guest and back to userspace?  I think I
> >> > >> may have a big speedup.
> >> > >
> >> > > The simplest is vmexit.flat from
> >> > > git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git
> >> > >
> >> > > Run it with "x86/run x86/vmexit.flat" and look at the inl_from_qemu
> >> > > benchmark.
> >> >
> >> > Thanks!
> >> >
> >> > That test case is slower than I expected.  I think my change is likely
> >> > to save somewhat under 100ns, which is only a couple percent.  I'll
> >> > look for more impressive improvements.
> >> >
> >> > On a barely related note, in the process of poking around with this
> >> > test, I noticed:
> >> >
> >> > /* On ept, can't emulate nx, and must switch nx atomically */
> >> > if (enable_ept && ((vmx->vcpu.arch.efer ^ host_efer) & EFER_NX)) {
> >> > guest_efer = vmx->vcpu.arch.efer;
> >> > if (!(guest_efer & EFER_LMA))
> >> > guest_efer &= ~EFER_LME;
> >> > add_atomic_switch_msr(vmx, MSR_EFER, guest_efer, host_efer);
> >> > return false;
> >> > }
> >> >
> >> > return true;
> >> >
> >> > This heuristic seems wrong to me.  wrmsr is serializing and therefore
> >> > extremely slow, whereas I imagine that, on CPUs that support it,
> >> > atomically switching EFER ought to be reasonably fast.
> >> >
> >> > Indeed, changing vmexit.c to disable NX (thereby forcing atomic EFER
> >> > switching, and having no other relevant effect that I've thought of)
> >> > speeds up inl_from_qemu by ~30% on Sandy Bridge.  Would it make sense
> >> > to always use atomic EFER switching, at least when
> >> > cpu_has_load_ia32_efer?
> >> >
> >> The idea behind current logic is that we want to avoid writing an MSR
> >> at all for lightweight exists (those that do not exit to userspace). So
> >> if NX bit is the same for host and guest we can avoid writing EFER on
> >> exit and run with guest's EFER in the kernel. But if userspace exit is
> >> required only then we write host's MSR back, only if guest and host MSRs
> >> are different of course. What bit should be restored on userspace exit
> >> in vmexit tests? Is it SCE? What if you set it instead of unsetting NXE?
> >
> > I don't understand.  AFAICT there are really only two cases: EFER
> > switched atomically using the best available mechanism on the host
> > CPU, or EFER switched on userspace exit.  I think there's a
> > theoretical third possibility: if the guest and host EFER match, then
> > EFER doesn't need to be switched at all, but this doesn't seem to be
> > implemented.
> 
> I got this part wrong.  It looks like the user return notifier is
> smart enough not to set EFER at all if the guest and host values
> match.  Indeed, with stock KVM, if I modify vmexit.c to have exactly
> the same EFER as the host (NX and SCE both set), then it runs quickly.
> But I get almost exactly the same performance if NX is clear, which is
> the case where the built-in entry/exit switching is used.
> 
What's the performance difference?

> Admittedly, most guests probably do match the host, so this effect may
> be rare in practice.  But possibly the code should be changed either
> the way I patched it (always use the built-in switching if available)
> or to only do it if the guest and host EFER values differ.  ISTM that,
> on modern CPUs, switching EFER on return to userspace is always a big
> loss.
We should be careful to not optimise for a wrong case. In common case
userspace exits are extremely rare. Try to trace common workloads with
Linux guest. Windows as a guest has its share of userspace exists, but
this is due to the lack of PV timer support (was it fixed already?).
So if switching EFER has measurable overhead doing it on each exit is a
net loss.

> 
> If neither change is made, then maybe the test should change to set
> SCE so that it isn't so misleadingly slow.
>
The purpose of vmexit test is to show us various overheads, so why not
measure EFER switch overhead by having two tests one with equal EFER
another with different EFER, instead of hiding it.


--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Seeking a KVM benchmark

2014-11-08 Thread Gleb Natapov

On Fri, Nov 07, 2014 at 09:59:55AM -0800, Andy Lutomirski wrote:
> On Thu, Nov 6, 2014 at 11:17 PM, Paolo Bonzini  wrote:
> >
> >
> > On 07/11/2014 07:27, Andy Lutomirski wrote:
> >> Is there an easy benchmark that's sensitive to the time it takes to
> >> round-trip from userspace to guest and back to userspace?  I think I
> >> may have a big speedup.
> >
> > The simplest is vmexit.flat from
> > git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git
> >
> > Run it with "x86/run x86/vmexit.flat" and look at the inl_from_qemu
> > benchmark.
> 
> Thanks!
> 
> That test case is slower than I expected.  I think my change is likely
> to save somewhat under 100ns, which is only a couple percent.  I'll
> look for more impressive improvements.
> 
> On a barely related note, in the process of poking around with this
> test, I noticed:
> 
> /* On ept, can't emulate nx, and must switch nx atomically */
> if (enable_ept && ((vmx->vcpu.arch.efer ^ host_efer) & EFER_NX)) {
> guest_efer = vmx->vcpu.arch.efer;
> if (!(guest_efer & EFER_LMA))
> guest_efer &= ~EFER_LME;
> add_atomic_switch_msr(vmx, MSR_EFER, guest_efer, host_efer);
> return false;
> }
> 
> return true;
> 
> This heuristic seems wrong to me.  wrmsr is serializing and therefore
> extremely slow, whereas I imagine that, on CPUs that support it,
> atomically switching EFER ought to be reasonably fast.
> 
> Indeed, changing vmexit.c to disable NX (thereby forcing atomic EFER
> switching, and having no other relevant effect that I've thought of)
> speeds up inl_from_qemu by ~30% on Sandy Bridge.  Would it make sense
> to always use atomic EFER switching, at least when
> cpu_has_load_ia32_efer?
> 
The idea behind current logic is that we want to avoid writing an MSR
at all for lightweight exists (those that do not exit to userspace). So
if NX bit is the same for host and guest we can avoid writing EFER on
exit and run with guest's EFER in the kernel. But if userspace exit is
required only then we write host's MSR back, only if guest and host MSRs
are different of course. What bit should be restored on userspace exit
in vmexit tests? Is it SCE? What if you set it instead of unsetting NXE?

Your change reduced userspace exit cost by ~30%, but what about exit to kernel? 
We have much more of those.
 
--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 3/4] KVM: MMU: reload request from GET_DIRTY_LOG path

2014-10-15 Thread Gleb Natapov

On Mon, Oct 13, 2014 at 05:52:38AM -0300, Marcelo Tosatti wrote:
> On Fri, Oct 10, 2014 at 04:09:29PM +0300, Gleb Natapov wrote:
> > On Wed, Oct 08, 2014 at 04:22:31PM -0300, Marcelo Tosatti wrote:
> > > > > 
> > > > > Argh, lets try again:
> > > > > 
> > > > > skip_pinned = true
> > > > > --
> > > > > 
> > > > > mark page dirty, keep spte intact
> > > > > 
> > > > > called from get dirty log path.
> > > > > 
> > > > > skip_pinned = false
> > > > > ---
> > > > > reload remote mmu
> > > > > destroy pinned spte.
> > > > > 
> > > > > called from: dirty log enablement, rmap write protect (unused for 
> > > > > pinned
> > > > > sptes)
> > > > > 
> > > > > 
> > > > > Note this behaviour is your suggestion:
> > > > 
> > > > Yes, I remember that and I thought we will not need this skip_pinned
> > > > at all.  For rmap write protect case there shouldn't be any pinned 
> > > > pages,
> > > > but why dirty log enablement sets skip_pinned to false? Why not mark
> > > > pinned pages as dirty just like you do in get dirty log path?
> > > 
> > > Because if its a large spte, it must be nuked (or marked read-only,
> > > which for pinned sptes, is not possible).
> > > 
> > If a large page has one small page pinned inside it its spte will
> > be marked as pinned, correct?  
> 
> Correct.
> 
> > We did nuke large ptes here until very
> > recently: c126d94f2c90ed9d, but we cannot drop a pte here anyway without
> > kicking all vcpu from a guest mode, but do you need additional skip_pinned
> > parameter?  Why not check if spte is large instead?
> 
> Nuke only if large spte is found? Can do that, instead.
> 
> > So why not have per slot pinned page list (Xiao suggested the same) and do:
> 
> The interface is per-vcpu (that is registration of pinned pages is
> performed on a per-vcpu basis).
> 
PEBS is per cpu, but it does not mean that pinning should be per cpu, it
can be done globally with ref counting.

> > spte_write_protect() {
> >   if (is_pinned(spte) {
> > if (large(spte))
> >// cannot drop while vcpu are running
> >mmu_reload_pinned_vcpus();
> > else
> >return false;
> > }
> > 
> > 
> > get_dirty_log() {
> >  for_each(pinned pages i)
> > makr_dirty(i);
> > }
> 
> That is effectively the same this patchset does, except that the spte
> pinned bit is checked at spte_write_protect, instead of looping over 
> page pinned list. Fail to see huge advantage there.
>
I think spte_write_protect is a strange place to mark pages dirty, but
otherwise yes the effect is the same, so definitely not a huge difference.
If global pinned list is a PITA in your opinion leave it as is.

> I'll drop the skip_pinned parameter and use is_large_pte check instead.
> 
Thanks,

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 2/5] KVM: x86: Emulator performs code segment checks on read access

2014-10-13 Thread Gleb Natapov

On Mon, Oct 13, 2014 at 02:15:43AM +0300, Nadav Amit wrote:
> 
> 
> On 10/12/14 3:12 PM, Paolo Bonzini wrote:
> > Il 12/10/2014 08:57, Nadav Amit ha scritto:
> >> Looks good. I’ll give it a try but it is hard to give a definitive
> >> answer, since the emulator is still bug-ridden.
> > 
> > Yes, we need to write unit tests for this, especially the conforming
> > case.  A bit of a pain to get kvm-unit-tests in ring 3 (access.flat
> > does it), but I'll give it a shot.
> > 
> > Paolo
> > 
> 
> I think the problem might be even more fundamental.
> According to the SDM, the privilege level checks (CPL/DPL/RPL) are only 
> performed when the segment is loaded; I see no reference to privilege checks 
> when data is accessed.
> You should be able to load a segment with DPL=0 while you are in CPL=0, then 
> change CPL to 3 and still access the segment (obviously, it is not the best 
> practice).
> 
> In that case, all the privilege checks in __linearize are redundant and for 
> some extent incorrect.
> Obviously, I am afraid to submit a patch that removes them, since if the 
> privilege checks of __linearize are needed in certain case, this may 
> introduce security problem.
> 
> Do you agree?
> 
3a78a4f46302bfc83602a53dfa4dcbe76a7a1f5f removed RPL check from __linearize 
already, so
you are probably right, but better verify it on real HW.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 3/4] KVM: MMU: reload request from GET_DIRTY_LOG path

2014-10-10 Thread Gleb Natapov

On Wed, Oct 08, 2014 at 04:22:31PM -0300, Marcelo Tosatti wrote:
> > > 
> > > Argh, lets try again:
> > > 
> > > skip_pinned = true
> > > --
> > > 
> > > mark page dirty, keep spte intact
> > > 
> > > called from get dirty log path.
> > > 
> > > skip_pinned = false
> > > ---
> > > reload remote mmu
> > > destroy pinned spte.
> > > 
> > > called from: dirty log enablement, rmap write protect (unused for pinned
> > > sptes)
> > > 
> > > 
> > > Note this behaviour is your suggestion:
> > 
> > Yes, I remember that and I thought we will not need this skip_pinned
> > at all.  For rmap write protect case there shouldn't be any pinned pages,
> > but why dirty log enablement sets skip_pinned to false? Why not mark
> > pinned pages as dirty just like you do in get dirty log path?
> 
> Because if its a large spte, it must be nuked (or marked read-only,
> which for pinned sptes, is not possible).
> 
If a large page has one small page pinned inside it its spte will
be marked as pinned, correct?  We did nuke large ptes here until very
recently: c126d94f2c90ed9d, but we cannot drop a pte here anyway without
kicking all vcpu from a guest mode, but do you need additional skip_pinned
parameter?  Why not check if spte is large instead?

So why not have per slot pinned page list (Xiao suggested the same) and do:

spte_write_protect() {
  if (is_pinned(spte) {
if (large(spte))
   // cannot drop while vcpu are running
   mmu_reload_pinned_vcpus();
else
   return false;
}


get_dirty_log() {
 for_each(pinned pages i)
makr_dirty(i);
}

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 3/4] KVM: MMU: reload request from GET_DIRTY_LOG path

2014-10-08 Thread Gleb Natapov

On Wed, Oct 08, 2014 at 02:15:34PM -0300, Marcelo Tosatti wrote:
> On Wed, Oct 08, 2014 at 09:56:36AM +0300, Gleb Natapov wrote:
> > On Mon, Oct 06, 2014 at 02:19:32PM -0300, Marcelo Tosatti wrote:
> > > On Sat, Oct 04, 2014 at 10:23:32AM +0300, Gleb Natapov wrote:
> > > > On Tue, Sep 09, 2014 at 12:28:11PM -0300, Marcelo Tosatti wrote:
> > > > > On Mon, Jul 21, 2014 at 04:14:24PM +0300, Gleb Natapov wrote:
> > > > > > On Wed, Jul 09, 2014 at 04:12:53PM -0300, mtosa...@redhat.com wrote:
> > > > > > > Reload remote vcpus MMU from GET_DIRTY_LOG codepath, before
> > > > > > > deleting a pinned spte.
> > > > > > > 
> > > > > > > Signed-off-by: Marcelo Tosatti 
> > > > > > > 
> > > > > > > ---
> > > > > > >  arch/x86/kvm/mmu.c |   29 +++--
> > > > > > >  1 file changed, 23 insertions(+), 6 deletions(-)
> > > > > > > 
> > > > > > > Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c
> > > > > > > ===
> > > > > > > --- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c  2014-07-09 
> > > > > > > 11:23:59.290744490 -0300
> > > > > > > +++ kvm.pinned-sptes/arch/x86/kvm/mmu.c   2014-07-09 
> > > > > > > 11:24:58.449632435 -0300
> > > > > > > @@ -1208,7 +1208,8 @@
> > > > > > >   *
> > > > > > >   * Return true if tlb need be flushed.
> > > > > > >   */
> > > > > > > -static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool 
> > > > > > > pt_protect)
> > > > > > > +static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool 
> > > > > > > pt_protect,
> > > > > > > +bool skip_pinned)
> > > > > > >  {
> > > > > > >   u64 spte = *sptep;
> > > > > > >  
> > > > > > > @@ -1218,6 +1219,22 @@
> > > > > > >  
> > > > > > >   rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, 
> > > > > > > *sptep);
> > > > > > >  
> > > > > > > + if (is_pinned_spte(spte)) {
> > > > > > > + /* keep pinned spte intact, mark page dirty again */
> > > > > > > + if (skip_pinned) {
> > > > > > > + struct kvm_mmu_page *sp;
> > > > > > > + gfn_t gfn;
> > > > > > > +
> > > > > > > + sp = page_header(__pa(sptep));
> > > > > > > + gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
> > > > > > > +
> > > > > > > + mark_page_dirty(kvm, gfn);
> > > > > > > + return false;
> > > > > > Why not mark all pinned gfns as dirty in 
> > > > > > kvm_vm_ioctl_get_dirty_log() while
> > > > > > populating dirty_bitmap_buffer?
> > > > > 
> > > > > The pinned gfns are per-vcpu. Requires looping all vcpus (not
> > > > > scalable).
> > > > > 
> > > > OK, but do they really have to be per-cpu? What's the advantage?
> > > 
> > > The guest configures pinning on a per-cpu basis (that is, enabling PEBS 
> > > is done on a per-cpu basis).
> > Is it a problem to maintain global pinned pages list for each memslot too?
> > 
> > > 
> > > > > 
> > > > > > > + } else
> > > > > > > + mmu_reload_pinned_vcpus(kvm);
> > > > > > Can you explain why do you need this?
> > > > > 
> > > > > Because if skip_pinned = false, we want vcpus to exit (called
> > > > > from enable dirty logging codepath).
> > > > > 
> > > > I guess what I wanted to ask is why do we need skip_pinned? As far as I 
> > > > see it
> > > > is set to false in two cases:
> > > > 1: page is write protected for shadow MMU needs, should not happen 
> > > > since the feature
> > > 
> > > Correct.
> > > 
> > > >is not supported with shadow mmu (can happen with nested EPT, but 
> > > > page will be m

Re: [patch 3/4] KVM: MMU: reload request from GET_DIRTY_LOG path

2014-10-07 Thread Gleb Natapov

On Mon, Oct 06, 2014 at 02:19:32PM -0300, Marcelo Tosatti wrote:
> On Sat, Oct 04, 2014 at 10:23:32AM +0300, Gleb Natapov wrote:
> > On Tue, Sep 09, 2014 at 12:28:11PM -0300, Marcelo Tosatti wrote:
> > > On Mon, Jul 21, 2014 at 04:14:24PM +0300, Gleb Natapov wrote:
> > > > On Wed, Jul 09, 2014 at 04:12:53PM -0300, mtosa...@redhat.com wrote:
> > > > > Reload remote vcpus MMU from GET_DIRTY_LOG codepath, before
> > > > > deleting a pinned spte.
> > > > > 
> > > > > Signed-off-by: Marcelo Tosatti 
> > > > > 
> > > > > ---
> > > > >  arch/x86/kvm/mmu.c |   29 +++--
> > > > >  1 file changed, 23 insertions(+), 6 deletions(-)
> > > > > 
> > > > > Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c
> > > > > ===
> > > > > --- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c  2014-07-09 
> > > > > 11:23:59.290744490 -0300
> > > > > +++ kvm.pinned-sptes/arch/x86/kvm/mmu.c   2014-07-09 
> > > > > 11:24:58.449632435 -0300
> > > > > @@ -1208,7 +1208,8 @@
> > > > >   *
> > > > >   * Return true if tlb need be flushed.
> > > > >   */
> > > > > -static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool 
> > > > > pt_protect)
> > > > > +static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool 
> > > > > pt_protect,
> > > > > +bool skip_pinned)
> > > > >  {
> > > > >   u64 spte = *sptep;
> > > > >  
> > > > > @@ -1218,6 +1219,22 @@
> > > > >  
> > > > >   rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, 
> > > > > *sptep);
> > > > >  
> > > > > + if (is_pinned_spte(spte)) {
> > > > > + /* keep pinned spte intact, mark page dirty again */
> > > > > + if (skip_pinned) {
> > > > > + struct kvm_mmu_page *sp;
> > > > > + gfn_t gfn;
> > > > > +
> > > > > + sp = page_header(__pa(sptep));
> > > > > + gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
> > > > > +
> > > > > + mark_page_dirty(kvm, gfn);
> > > > > + return false;
> > > > Why not mark all pinned gfns as dirty in kvm_vm_ioctl_get_dirty_log() 
> > > > while
> > > > populating dirty_bitmap_buffer?
> > > 
> > > The pinned gfns are per-vcpu. Requires looping all vcpus (not
> > > scalable).
> > > 
> > OK, but do they really have to be per-cpu? What's the advantage?
> 
> The guest configures pinning on a per-cpu basis (that is, enabling PEBS 
> is done on a per-cpu basis).
Is it a problem to maintain global pinned pages list for each memslot too?

> 
> > > 
> > > > > + } else
> > > > > + mmu_reload_pinned_vcpus(kvm);
> > > > Can you explain why do you need this?
> > > 
> > > Because if skip_pinned = false, we want vcpus to exit (called
> > > from enable dirty logging codepath).
> > > 
> > I guess what I wanted to ask is why do we need skip_pinned? As far as I see 
> > it
> > is set to false in two cases:
> > 1: page is write protected for shadow MMU needs, should not happen since 
> > the feature
> 
> Correct.
> 
> >is not supported with shadow mmu (can happen with nested EPT, but page 
> > will be marked
> >is accessed during next vcpu entry anyway, so how will it work)?
> 
> PEBS is not supported on nested EPT.
> 
OK, so for this case we do not need skip_pinned. Assert if it happens.

> > 2: when slot is marked as read only: such slot cannot have PEBS pages and 
> > if it will guest will die
> >anyway during next guest entry, so why not maintain list of pinned pages 
> > per slot and kill aguest
> >if slot with pinned pages is marked read only.
> 
> 2: when slots pages have dirty logging enabled. In that case, the page
> is marked dirty immediately. This is the call from
> kvm_mmu_slot_remove_write_access.
> 
Right, my 2 is incorrect.

> So when enabling dirty logging, for pinned sptes:
> 
> - maintain pinned spte intact.
> - mark gfn for which pinned spte represents as dirty in the dirty
>   log.
> 
But you set skip_pinned to false in kvm_mmu_slot_remove_write_access(), so this 
is not
what is happening. Did you mean to set it to true there?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 3/4] KVM: MMU: reload request from GET_DIRTY_LOG path

2014-10-04 Thread Gleb Natapov

On Tue, Sep 09, 2014 at 12:28:11PM -0300, Marcelo Tosatti wrote:
> On Mon, Jul 21, 2014 at 04:14:24PM +0300, Gleb Natapov wrote:
> > On Wed, Jul 09, 2014 at 04:12:53PM -0300, mtosa...@redhat.com wrote:
> > > Reload remote vcpus MMU from GET_DIRTY_LOG codepath, before
> > > deleting a pinned spte.
> > > 
> > > Signed-off-by: Marcelo Tosatti 
> > > 
> > > ---
> > >  arch/x86/kvm/mmu.c |   29 +++--
> > >  1 file changed, 23 insertions(+), 6 deletions(-)
> > > 
> > > Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c
> > > ===
> > > --- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c  2014-07-09 
> > > 11:23:59.290744490 -0300
> > > +++ kvm.pinned-sptes/arch/x86/kvm/mmu.c   2014-07-09 11:24:58.449632435 
> > > -0300
> > > @@ -1208,7 +1208,8 @@
> > >   *
> > >   * Return true if tlb need be flushed.
> > >   */
> > > -static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool 
> > > pt_protect)
> > > +static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool 
> > > pt_protect,
> > > +bool skip_pinned)
> > >  {
> > >   u64 spte = *sptep;
> > >  
> > > @@ -1218,6 +1219,22 @@
> > >  
> > >   rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
> > >  
> > > + if (is_pinned_spte(spte)) {
> > > + /* keep pinned spte intact, mark page dirty again */
> > > + if (skip_pinned) {
> > > + struct kvm_mmu_page *sp;
> > > + gfn_t gfn;
> > > +
> > > + sp = page_header(__pa(sptep));
> > > + gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
> > > +
> > > + mark_page_dirty(kvm, gfn);
> > > + return false;
> > Why not mark all pinned gfns as dirty in kvm_vm_ioctl_get_dirty_log() while
> > populating dirty_bitmap_buffer?
> 
> The pinned gfns are per-vcpu. Requires looping all vcpus (not
> scalable).
> 
OK, but do they really have to be per-cpu? What's the advantage?

> 
> > > + } else
> > > + mmu_reload_pinned_vcpus(kvm);
> > Can you explain why do you need this?
> 
> Because if skip_pinned = false, we want vcpus to exit (called
> from enable dirty logging codepath).
> 
I guess what I wanted to ask is why do we need skip_pinned? As far as I see it
is set to false in two cases:
1: page is write protected for shadow MMU needs, should not happen since the 
feature
   is not supported with shadow mmu (can happen with nested EPT, but page will 
be marked
   is accessed during next vcpu entry anyway, so how will it work)?
2: when slot is marked as read only: such slot cannot have PEBS pages and if it 
will guest will die
   anyway during next guest entry, so why not maintain list of pinned pages per 
slot and kill aguest
   if slot with pinned pages is marked read only.
 
--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] vPMU support for AMD system

2014-10-04 Thread Gleb Natapov

Hi,

On Tue, Sep 30, 2014 at 09:07:22PM -0500, Wei Huang wrote:
> Hi Paolo and Gleb,
> 
> The attached file is a preliminary version of AMD vPMU support for KVM.
> Currently I am working on a formal patch set; but realized that there are
> some design choice to make (see below). I thought it is better to send it
> out now, asking for your comments before sending out patchset v1.
> 
> If you are OK with current approach, please let me know. Otherwise please
> share your suggestions/comments on the design choice (see suggestions
> below). I will send out split patch set soon.
> 
I am not very familiar with AMD PMU, but IIRC is lacks architectural PMU, so the
first question that comes to my mind is:

> +static struct kvm_event_hw_type_mapping amd_event_mapping[] = {
> + [0] = { 0x76, 0x00, PERF_COUNT_HW_CPU_CYCLES },
> + [1] = { 0xc0, 0x00, PERF_COUNT_HW_INSTRUCTIONS },
> + [3] = { 0x80, 0x00, PERF_COUNT_HW_CACHE_REFERENCES },
> + [4] = { 0x81, 0x00, PERF_COUNT_HW_CACHE_MISSES },
> + [5] = { 0xc4, 0x00, PERF_COUNT_HW_BRANCH_INSTRUCTIONS },
> + [6] = { 0xc5, 0x00, PERF_COUNT_HW_BRANCH_MISSES },
> +};
> +
Are those evens count exactly same things on all AMD cpus?


--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] KVM: x86: Fix determining flat mode in recalculate_apic_map

2014-10-03 Thread Gleb Natapov

On Tue, Sep 30, 2014 at 08:49:17PM +0300, Nadav Amit wrote:
> Determining flat mode according to cid_mask is wrong, since currently KVM
> supports zero clusters in x2apic mode. Use ldr_bits instead.
As a comment above the 'if' you are fixing says the code assumes all APICs are 
in
the same mode (if they are not that's an OS bug). In that case the code never 
gets
here with x2apic mode.

> 
> Since we recalculate the apic map whenever the DFR is changed, the bug appears
> to have no effect, and perhaps the entire check can be removed.
> 
> Signed-off-by: Nadav Amit 
> ---
>  arch/x86/kvm/lapic.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index b8345dd..cfce429 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -171,7 +171,7 @@ static void recalculate_apic_map(struct kvm *kvm)
>   new->cid_mask = (1 << KVM_X2APIC_CID_BITS) - 1;
>   new->lid_mask = 0x;
>   } else if (kvm_apic_sw_enabled(apic) &&
> - !new->cid_mask /* flat mode */ &&
> + new->ldr_bits == 8 /* flat mode */ &&
>   kvm_apic_get_reg(apic, APIC_DFR) == 
> APIC_DFR_CLUSTER) {
>   new->cid_shift = 4;
>   new->cid_mask = 0xf;
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls

2014-09-22 Thread Gleb Natapov

On Mon, Sep 22, 2014 at 09:29:19PM +0200, Paolo Bonzini wrote:
> Il 22/09/2014 21:20, Christian Borntraeger ha scritto:
> > "while using trinity to fuzz KVM, we noticed long stalls on invalid ioctls. 
> > Lets bail out early on invalid ioctls". or similar?
> 
> Okay.  David, can you explain how you found it so that I can make up my
> mind?
> 
> Gleb and Marcelo, a fourth and fifth opinion? :)
> 
I agree with Christian that simpler fix is better here.
The overhead is minimal. If we ever notice this overhead
we can revert the patch all together since the problem it
fixes can only be inflicted on userspace by itself and there
are myriads other ways userspace can hurt itself.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Standardizing an MSR or other hypercall to get an RNG seed?

2014-09-19 Thread Gleb Natapov

On Fri, Sep 19, 2014 at 11:20:49AM -0700, Andy Lutomirski wrote:
> [cc: Alok Kataria at VMware]
> 
> On Fri, Sep 19, 2014 at 11:12 AM, Gleb Natapov  wrote:
> > On Fri, Sep 19, 2014 at 11:02:38AM -0700, Andy Lutomirski wrote:
> >> On Fri, Sep 19, 2014 at 10:49 AM, Gleb Natapov  wrote:
> >> > On Fri, Sep 19, 2014 at 10:18:37AM -0700, H. Peter Anvin wrote:
> >> >> On 09/19/2014 10:15 AM, Gleb Natapov wrote:
> >> >> > On Fri, Sep 19, 2014 at 10:08:20AM -0700, H. Peter Anvin wrote:
> >> >> >> On 09/19/2014 09:53 AM, Gleb Natapov wrote:
> >> >> >>> On Fri, Sep 19, 2014 at 09:40:07AM -0700, H. Peter Anvin wrote:
> >> >> >>>> On 09/19/2014 09:37 AM, Gleb Natapov wrote:
> >> >> >>>>>
> >> >> >>>>> Linux detects what hypervior it runs on very early
> >> >> >>>>
> >> >> >>>> Not anywhere close to early enough.  We're talking for uses like 
> >> >> >>>> kASLR.
> >> >> >>>>
> >> >> >>> Still to early to do:
> >> >> >>>
> >> >> >>>h = cpuid(HYPERVIOR_SIGNATURE)
> >> >> >>>if (h == KVMKVMKVM) {
> >> >> >>>   if (cpuid(kvm_features) & kvm_rnd)
> >> >> >>>  rdmsr(kvm_rnd)
> >> >> >>>else (h == HyperV) {
> >> >> >>>   if (cpuid(hv_features) & hv_rnd)
> >> >> >>> rdmsr(hv_rnd)
> >> >> >>>else (h == XenXenXen) {
> >> >> >>>   if (cpuid(xen_features) & xen_rnd)
> >> >> >>> rdmsr(xen_rnd)
> >> >> >>>   }
> >> >> >>>
> >> >> >>
> >> >> >> If we need to do chase loops, especially not so...
> >> >> >>
> >> >> > What loops exactly? As a non native English speaker I fail to 
> >> >> > understand
> >> >> > if your answer is "yes" or "no" ;)
> >> >> >
> >> >>
> >> >> The above isn't actually the full algorithm used.
> >> >>
> >> > What part of actually algorithm cannot be implemented? Loop that searches
> >> > for KVM leaf in case KVM pretend to be HyperV (is this what you called
> >> > "chase loops"?)? First of all there is no need to implement it, if KVM
> >> > pretends to be HyperV use HyperV's way to obtain RNG, but what is the
> >> > problem with the loop?
> >> >
> >>
> >> It can be implemented, and I've done it.  But it's a mess.  Almost the
> >> very first thing we do in boot (even before decompressing the kernel)
> >> will be to scan a bunch of cpuid leaves looking for a hypervisor with
> >> an rng source that we can use for kASLR.  And we'll have to update
> >> that code and make it bigger every time another hypervisor adds
> >> exactly the same feature.
> > IMO implementing this feature is in hypervisor's best interest, so the task
> > of updating the code will scale by virtue of hypervisor's developers each
> > adding it for hypervisor he cares about.
> 
> I assume that you mean guest, not hypervisor.
> 
Yes, I mean guest support for hypervisor he cares about.

> >
> >>
> >> And then we have another copy of almost exactly the same code in the
> >> normal post-boot part of the kernel.
> >>
> >> We can certainly do this, but I'd much rather solve the problem once
> >> and let all of the hypervisors and guests opt in and immediately be
> >> compatible with each other.
> >>
> >> > I "forgot" VMware because I do not see VMware people to be CCed. They may
> >> > be even less excited about them being told _how_ this feature need to be
> >> > implemented (e.g implement HyperV leafs for the feature detection). I
> >> > do not want to and cannot speak for VMware, but my guess is that for
> >> > them it would be much easier to add an else clause for VMware in above
> >> > "if" then to coordinate with all hypervisor developers about MSR/cpuid
> >> > details. And since this is security feature implementing it for Linux
> >> > is in their best interest.
> >>
> >> Do you know any of them who should be cc'd?
> >>
> > No, not anyone in particular. git log arch/x86/kernel/cpu/vmware.c may help.
> >
> > But VMware is an elephant in the room here. There are other hypervisors out 
> > there.
> > VirtualBox, bhyve...
> 
> Exactly.  The amount of effort to get everything to be compatible with
> everything scales quadratically in the number of hypervisors, and the
> probability that some combination is broken also increases.
> 
The effort is distributed equally among hypervisor developers. If they
want Linux to be more secure on their hypervisor they contribute guest
code. They do need to write hypervisor part anyway. On cpus with RDRAND
instruction this MSR is not even needed and some hypervisors may decide
that support for old cpus does not worth the effort. Unified interface
does not help if hypervisor does not implement it.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Standardizing an MSR or other hypercall to get an RNG seed?

2014-09-19 Thread Gleb Natapov

On Fri, Sep 19, 2014 at 11:02:38AM -0700, Andy Lutomirski wrote:
> On Fri, Sep 19, 2014 at 10:49 AM, Gleb Natapov  wrote:
> > On Fri, Sep 19, 2014 at 10:18:37AM -0700, H. Peter Anvin wrote:
> >> On 09/19/2014 10:15 AM, Gleb Natapov wrote:
> >> > On Fri, Sep 19, 2014 at 10:08:20AM -0700, H. Peter Anvin wrote:
> >> >> On 09/19/2014 09:53 AM, Gleb Natapov wrote:
> >> >>> On Fri, Sep 19, 2014 at 09:40:07AM -0700, H. Peter Anvin wrote:
> >> >>>> On 09/19/2014 09:37 AM, Gleb Natapov wrote:
> >> >>>>>
> >> >>>>> Linux detects what hypervior it runs on very early
> >> >>>>
> >> >>>> Not anywhere close to early enough.  We're talking for uses like 
> >> >>>> kASLR.
> >> >>>>
> >> >>> Still to early to do:
> >> >>>
> >> >>>h = cpuid(HYPERVIOR_SIGNATURE)
> >> >>>if (h == KVMKVMKVM) {
> >> >>>   if (cpuid(kvm_features) & kvm_rnd)
> >> >>>  rdmsr(kvm_rnd)
> >> >>>else (h == HyperV) {
> >> >>>   if (cpuid(hv_features) & hv_rnd)
> >> >>> rdmsr(hv_rnd)
> >> >>>else (h == XenXenXen) {
> >> >>>   if (cpuid(xen_features) & xen_rnd)
> >> >>> rdmsr(xen_rnd)
> >> >>>   }
> >> >>>
> >> >>
> >> >> If we need to do chase loops, especially not so...
> >> >>
> >> > What loops exactly? As a non native English speaker I fail to understand
> >> > if your answer is "yes" or "no" ;)
> >> >
> >>
> >> The above isn't actually the full algorithm used.
> >>
> > What part of actually algorithm cannot be implemented? Loop that searches
> > for KVM leaf in case KVM pretend to be HyperV (is this what you called
> > "chase loops"?)? First of all there is no need to implement it, if KVM
> > pretends to be HyperV use HyperV's way to obtain RNG, but what is the
> > problem with the loop?
> >
> 
> It can be implemented, and I've done it.  But it's a mess.  Almost the
> very first thing we do in boot (even before decompressing the kernel)
> will be to scan a bunch of cpuid leaves looking for a hypervisor with
> an rng source that we can use for kASLR.  And we'll have to update
> that code and make it bigger every time another hypervisor adds
> exactly the same feature.
IMO implementing this feature is in hypervisor's best interest, so the task
of updating the code will scale by virtue of hypervisor's developers each
adding it for hypervisor he cares about.

> 
> And then we have another copy of almost exactly the same code in the
> normal post-boot part of the kernel.
> 
> We can certainly do this, but I'd much rather solve the problem once
> and let all of the hypervisors and guests opt in and immediately be
> compatible with each other.
> 
> > I "forgot" VMware because I do not see VMware people to be CCed. They may
> > be even less excited about them being told _how_ this feature need to be
> > implemented (e.g implement HyperV leafs for the feature detection). I
> > do not want to and cannot speak for VMware, but my guess is that for
> > them it would be much easier to add an else clause for VMware in above
> > "if" then to coordinate with all hypervisor developers about MSR/cpuid
> > details. And since this is security feature implementing it for Linux
> > is in their best interest.
> 
> Do you know any of them who should be cc'd?
> 
No, not anyone in particular. git log arch/x86/kernel/cpu/vmware.c may help.

But VMware is an elephant in the room here. There are other hypervisors out 
there.
VirtualBox, bhyve...

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Standardizing an MSR or other hypercall to get an RNG seed?

2014-09-19 Thread Gleb Natapov

On Fri, Sep 19, 2014 at 10:21:27AM -0700, Andy Lutomirski wrote:
> On Sep 19, 2014 9:53 AM, "Gleb Natapov"  wrote:
> >
> > On Fri, Sep 19, 2014 at 09:40:07AM -0700, H. Peter Anvin wrote:
> > > On 09/19/2014 09:37 AM, Gleb Natapov wrote:
> > > >
> > > > Linux detects what hypervior it runs on very early
> > >
> > > Not anywhere close to early enough.  We're talking for uses like kASLR.
> > >
> > Still to early to do:
> >
> >h = cpuid(HYPERVIOR_SIGNATURE)
> >if (h == KVMKVMKVM) {
> >   if (cpuid(kvm_features) & kvm_rnd)
> >  rdmsr(kvm_rnd)
> >else (h == HyperV) {
> >   if (cpuid(hv_features) & hv_rnd)
> > rdmsr(hv_rnd)
> >else (h == XenXenXen) {
> >   if (cpuid(xen_features) & xen_rnd)
> > rdmsr(xen_rnd)
> >   }
> >
> 
> I think that there's a lot of value in having each guest
> implementation be automatically compatible with all hypervisors.  For
> example, you forgot VMware, and VMware might be less excited about
> implementing this feature if all the guests won't immediately start
> using it.
> 
I "forgot" VMware because I do not see VMware people to be CCed. They may
be even less excited about them being told _how_ this feature need to be
implemented (e.g implement HyperV leafs for the feature detection). I
do not want to and cannot speak for VMware, but my guess is that for
them it would be much easier to add an else clause for VMware in above
"if" then to coordinate with all hypervisor developers about MSR/cpuid
details. And since this is security feature implementing it for Linux
is in their best interest.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Standardizing an MSR or other hypercall to get an RNG seed?

2014-09-19 Thread Gleb Natapov

On Fri, Sep 19, 2014 at 10:18:37AM -0700, H. Peter Anvin wrote:
> On 09/19/2014 10:15 AM, Gleb Natapov wrote:
> > On Fri, Sep 19, 2014 at 10:08:20AM -0700, H. Peter Anvin wrote:
> >> On 09/19/2014 09:53 AM, Gleb Natapov wrote:
> >>> On Fri, Sep 19, 2014 at 09:40:07AM -0700, H. Peter Anvin wrote:
> >>>> On 09/19/2014 09:37 AM, Gleb Natapov wrote:
> >>>>>
> >>>>> Linux detects what hypervior it runs on very early
> >>>>
> >>>> Not anywhere close to early enough.  We're talking for uses like kASLR.
> >>>>
> >>> Still to early to do:
> >>>
> >>>h = cpuid(HYPERVIOR_SIGNATURE)
> >>>if (h == KVMKVMKVM) {
> >>>   if (cpuid(kvm_features) & kvm_rnd)
> >>>  rdmsr(kvm_rnd)
> >>>else (h == HyperV) {
> >>>   if (cpuid(hv_features) & hv_rnd)
> >>> rdmsr(hv_rnd) 
> >>>else (h == XenXenXen) {
> >>>   if (cpuid(xen_features) & xen_rnd)
> >>> rdmsr(xen_rnd)
> >>>   }
> >>>
> >>
> >> If we need to do chase loops, especially not so...
> >>
> > What loops exactly? As a non native English speaker I fail to understand
> > if your answer is "yes" or "no" ;)
> > 
> 
> The above isn't actually the full algorithm used.
> 
What part of actually algorithm cannot be implemented? Loop that searches
for KVM leaf in case KVM pretend to be HyperV (is this what you called
"chase loops"?)? First of all there is no need to implement it, if KVM
pretends to be HyperV use HyperV's way to obtain RNG, but what is the
problem with the loop?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Standardizing an MSR or other hypercall to get an RNG seed?

2014-09-19 Thread Gleb Natapov

On Fri, Sep 19, 2014 at 10:08:20AM -0700, H. Peter Anvin wrote:
> On 09/19/2014 09:53 AM, Gleb Natapov wrote:
> > On Fri, Sep 19, 2014 at 09:40:07AM -0700, H. Peter Anvin wrote:
> >> On 09/19/2014 09:37 AM, Gleb Natapov wrote:
> >>>
> >>> Linux detects what hypervior it runs on very early
> >>
> >> Not anywhere close to early enough.  We're talking for uses like kASLR.
> >>
> > Still to early to do:
> > 
> >h = cpuid(HYPERVIOR_SIGNATURE)
> >if (h == KVMKVMKVM) {
> >   if (cpuid(kvm_features) & kvm_rnd)
> >  rdmsr(kvm_rnd)
> >else (h == HyperV) {
> >   if (cpuid(hv_features) & hv_rnd)
> > rdmsr(hv_rnd) 
> >else (h == XenXenXen) {
> >   if (cpuid(xen_features) & xen_rnd)
> > rdmsr(xen_rnd)
> >   }
> > 
> 
> If we need to do chase loops, especially not so...
> 
What loops exactly? As a non native English speaker I fail to understand
if your answer is "yes" or "no" ;)

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Standardizing an MSR or other hypercall to get an RNG seed?

2014-09-19 Thread Gleb Natapov

On Fri, Sep 19, 2014 at 09:40:07AM -0700, H. Peter Anvin wrote:
> On 09/19/2014 09:37 AM, Gleb Natapov wrote:
> >
> > Linux detects what hypervior it runs on very early
> 
> Not anywhere close to early enough.  We're talking for uses like kASLR.
> 
Still to early to do:

   h = cpuid(HYPERVIOR_SIGNATURE)
   if (h == KVMKVMKVM) {
  if (cpuid(kvm_features) & kvm_rnd)
 rdmsr(kvm_rnd)
   else (h == HyperV) {
  if (cpuid(hv_features) & hv_rnd)
rdmsr(hv_rnd) 
   else (h == XenXenXen) {
  if (cpuid(xen_features) & xen_rnd)
rdmsr(xen_rnd)
  }

?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Standardizing an MSR or other hypercall to get an RNG seed?

2014-09-19 Thread Gleb Natapov

On Thu, Sep 18, 2014 at 03:00:05PM -0700, Andy Lutomirski wrote:
> On Thu, Sep 18, 2014 at 2:46 PM, David Hepkin  wrote:
> > I suggest we come to consensus on a specific CPUID leaf where an OS needs 
> > to look to determine if a hypervisor supports this capability.  We could 
> > define a new CPUID leaf range at a well-defined location, or we could just 
> > use one of the existing CPUID leaf ranges implemented by an existing 
> > hypervisor.  I'm not familiar with the KVM CPUID leaf range, but in the 
> > case of Hyper-V, the Hyper-V CPUID leaf range was architected to allow for 
> > other hypervisors to implement it and just show through specific 
> > capabilities supported by the hypervisor.  So, we could define a bit in the 
> > Hyper-V CPUID leaf range (since Xen and KVM also implement this range), but 
> > that would require Linux to look in that range on boot to discover this 
> > capability.
> 
> I also don't know whether QEMU and KVM would be okay with implementing
> the host side of the Hyper-V mechanism by default.  They would have to
> implement at least leaves 0x4001 and 0x402, plus correctly
> reporting zeros through whatever leaf is used for this new feature.
> Gleb?  Paolo?
> 
KVM and any other hypervisor out there already implement capability
detection mechanism in 0x4000 range, and of course all of them do
it differently. Linux detects what hypervior it runs on very early and
switch to correspondent code to handle each hypervisor. Quite frankly
I do not see what problem you are trying to fix with standardizing MSR
to get RND and detection mechanism for this MSR. RND MSR is in no way
unique here. There are other mechanisms that are virtually identical
between hypervisors but have different gust/hypervisor interfaces and
are detected differently on different hypervisors. Examples are pvclock,
pveoi may be others.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] kvm: Faults which trigger IO release the mmap_sem

2014-09-17 Thread Gleb Natapov

On Wed, Sep 17, 2014 at 10:51:48AM -0700, Andres Lagar-Cavilla wrote:
> When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory
> has been swapped out or is behind a filemap, this will trigger async
> readahead and return immediately. The rationale is that KVM will kick
> back the guest with an "async page fault" and allow for some other
> guest process to take over.
> 
> If async PFs are enabled the fault is retried asap from an async
> workqueue. If not, it's retried immediately in the same code path. In
> either case the retry will not relinquish the mmap semaphore and will
> block on the IO. This is a bad thing, as other mmap semaphore users
> now stall as a function of swap or filemap latency.
> 
> This patch ensures both the regular and async PF path re-enter the
> fault allowing for the mmap semaphore to be relinquished in the case
> of IO wait.
> 
Reviewed-by: Gleb Natapov 

> Reviewed-by: Radim Krčmář 
> Signed-off-by: Andres Lagar-Cavilla 
> 
> ---
> v1 -> v2
> 
> * WARN_ON_ONCE -> VM_WARN_ON_ONCE
> * pagep == NULL skips the final retry
> * kvm_gup_retry -> kvm_gup_io
> * Comment updates throughout
> ---
>  include/linux/kvm_host.h | 11 +++
>  include/linux/mm.h   |  1 +
>  mm/gup.c |  4 
>  virt/kvm/async_pf.c  |  4 +---
>  virt/kvm/kvm_main.c  | 49 
> +---
>  5 files changed, 63 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3addcbc..4c1991b 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -198,6 +198,17 @@ int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, 
> unsigned long hva,
>  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
>  #endif
>  
> +/*
> + * Carry out a gup that requires IO. Allow the mm to relinquish the mmap
> + * semaphore if the filemap/swap has to wait on a page lock. pagep == NULL
> + * controls whether we retry the gup one more time to completion in that 
> case.
> + * Typically this is called after a FAULT_FLAG_RETRY_NOWAIT in the main tdp
> + * handler.
> + */
> +int kvm_get_user_page_io(struct task_struct *tsk, struct mm_struct *mm,
> +  unsigned long addr, bool write_fault,
> +  struct page **pagep);
> +
>  enum {
>   OUTSIDE_GUEST_MODE,
>   IN_GUEST_MODE,
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ebc5f90..13e585f7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2011,6 +2011,7 @@ static inline struct page *follow_page(struct 
> vm_area_struct *vma,
>  #define FOLL_HWPOISON0x100   /* check page is hwpoisoned */
>  #define FOLL_NUMA0x200   /* force NUMA hinting page fault */
>  #define FOLL_MIGRATION   0x400   /* wait for page to replace migration 
> entry */
> +#define FOLL_TRIED   0x800   /* a retry, previous pass started an IO */
>  
>  typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
>   void *data);
> diff --git a/mm/gup.c b/mm/gup.c
> index 91d044b..af7ea3e 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -281,6 +281,10 @@ static int faultin_page(struct task_struct *tsk, struct 
> vm_area_struct *vma,
>   fault_flags |= FAULT_FLAG_ALLOW_RETRY;
>   if (*flags & FOLL_NOWAIT)
>   fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
> + if (*flags & FOLL_TRIED) {
> + VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
> + fault_flags |= FAULT_FLAG_TRIED;
> + }
>  
>   ret = handle_mm_fault(mm, vma, address, fault_flags);
>   if (ret & VM_FAULT_ERROR) {
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index d6a3d09..5ff7f7f 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -80,9 +80,7 @@ static void async_pf_execute(struct work_struct *work)
>  
>   might_sleep();
>  
> - down_read(&mm->mmap_sem);
> - get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL);
> - up_read(&mm->mmap_sem);
> + kvm_get_user_page_io(NULL, mm, addr, 1, NULL);
>   kvm_async_page_present_sync(vcpu, apf);
>  
>   spin_lock(&vcpu->async_pf.lock);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 7ef6b48..fa8a565 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1115,6 +1115,43 @@ static int get_user_page_nowait(struct task_struct 
> *tsk, struct mm_struct *mm,
>   return __get_user_pages(tsk, mm, start, 1, flags, page, NULL, NULL);
>  }
>  
> +int kvm_get_user_page_io(struct task_struct *tsk, stru

Re: [PATCH v2] kvm: Faults which trigger IO release the mmap_sem

2014-09-17 Thread Gleb Natapov

On Thu, Sep 18, 2014 at 08:29:17AM +0800, Wanpeng Li wrote:
> Hi Andres,
> On Wed, Sep 17, 2014 at 10:51:48AM -0700, Andres Lagar-Cavilla wrote:
> [...]
> > static inline int check_user_page_hwpoison(unsigned long addr)
> > {
> > int rc, flags = FOLL_TOUCH | FOLL_HWPOISON | FOLL_WRITE;
> >@@ -1177,9 +1214,15 @@ static int hva_to_pfn_slow(unsigned long addr, bool 
> >*async, bool write_fault,
> > npages = get_user_page_nowait(current, current->mm,
> >   addr, write_fault, page);
> > up_read(¤t->mm->mmap_sem);
> >-} else
> >-npages = get_user_pages_fast(addr, 1, write_fault,
> >- page);
> >+} else {
> >+/*
> >+ * By now we have tried gup_fast, and possibly async_pf, and we
> >+ * are certainly not atomic. Time to retry the gup, allowing
> >+ * mmap semaphore to be relinquished in the case of IO.
> >+ */
> >+npages = kvm_get_user_page_io(current, current->mm, addr,
> >+  write_fault, page);
> >+}
> 
> try_async_pf 
>  gfn_to_pfn_async 
>   __gfn_to_pfnasync = false 
*async = false

>__gfn_to_pfn_memslot
> hva_to_pfn 
>hva_to_pfn_fast 
>hva_to_pfn_slow 
hva_to_pfn_slow checks async not *async.

> kvm_get_user_page_io
> 
> page will always be ready after kvm_get_user_page_io which leads to APF
> don't need to work any more.
> 
> Regards,
> Wanpeng Li
> 
> > if (npages != 1)
> > return npages;
> > 
> >-- 
> >2.1.0.rc2.206.gedb03e5
> >
> >--
> >To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >the body to majord...@kvack.org.  For more info on Linux MM,
> >see: http://www.linux-mm.org/ .
> >Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: Faults which trigger IO release the mmap_sem

2014-09-17 Thread Gleb Natapov

On Wed, Sep 17, 2014 at 10:13:45AM -0700, Andres Lagar-Cavilla wrote:
> On Wed, Sep 17, 2014 at 10:08 AM, Gleb Natapov  wrote:
> > On Wed, Sep 17, 2014 at 10:00:32AM -0700, Andres Lagar-Cavilla wrote:
> >> On Wed, Sep 17, 2014 at 4:42 AM, Gleb Natapov  wrote:
> >> > On Wed, Sep 17, 2014 at 01:27:14PM +0200, Radim Krčmář wrote:
> >> >> 2014-09-17 13:26+0300, Gleb Natapov:
> >> >> > For async_pf_execute() you do not need to even retry. Next guest's 
> >> >> > page fault
> >> >> > will retry it for you.
> >> >>
> >> >> Wouldn't that be a waste of vmentries?
> >> > This is how it will work with or without this second gup. Page is not
> >> > mapped into a shadow page table on this path, it happens on a next fault.
> >>
> >> The point is that the gup in the async pf completion from the work
> >> queue will not relinquish the mmap semaphore. And it most definitely
> >> should, given that we are likely looking at swap/filemap.
> >>
> > I get this point and the patch looks good in general, but my point is
> > that when _retry() is called from async_pf_execute() second gup is not
> > needed. In the original code gup is called to do IO and nothing else.
> > In your patch this is accomplished by the first gup already, so you
> > can skip second gup if pagep == nullptr.
> 
> I see. However, if this function were to be used elsewhere in the
> future, then the "if pagep == NULL don't retry" semantics may not
> match the new caller's intention. Would you prefer an explicit flag?
> 
We can add explicit flag whenever such caller will be added, if ever.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: Faults which trigger IO release the mmap_sem

2014-09-17 Thread Gleb Natapov

On Wed, Sep 17, 2014 at 10:00:32AM -0700, Andres Lagar-Cavilla wrote:
> On Wed, Sep 17, 2014 at 4:42 AM, Gleb Natapov  wrote:
> > On Wed, Sep 17, 2014 at 01:27:14PM +0200, Radim Krčmář wrote:
> >> 2014-09-17 13:26+0300, Gleb Natapov:
> >> > For async_pf_execute() you do not need to even retry. Next guest's page 
> >> > fault
> >> > will retry it for you.
> >>
> >> Wouldn't that be a waste of vmentries?
> > This is how it will work with or without this second gup. Page is not
> > mapped into a shadow page table on this path, it happens on a next fault.
> 
> The point is that the gup in the async pf completion from the work
> queue will not relinquish the mmap semaphore. And it most definitely
> should, given that we are likely looking at swap/filemap.
> 
I get this point and the patch looks good in general, but my point is
that when _retry() is called from async_pf_execute() second gup is not
needed. In the original code gup is called to do IO and nothing else.
In your patch this is accomplished by the first gup already, so you
can skip second gup if pagep == nullptr.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: Faults which trigger IO release the mmap_sem

2014-09-17 Thread Gleb Natapov

On Wed, Sep 17, 2014 at 01:27:14PM +0200, Radim Krčmář wrote:
> 2014-09-17 13:26+0300, Gleb Natapov:
> > For async_pf_execute() you do not need to even retry. Next guest's page 
> > fault
> > will retry it for you.
> 
> Wouldn't that be a waste of vmentries?
This is how it will work with or without this second gup. Page is not
mapped into a shadow page table on this path, it happens on a next fault.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: Faults which trigger IO release the mmap_sem

2014-09-17 Thread Gleb Natapov

On Mon, Sep 15, 2014 at 01:11:25PM -0700, Andres Lagar-Cavilla wrote:
> When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory has been
> swapped out or is behind a filemap, this will trigger async readahead and
> return immediately. The rationale is that KVM will kick back the guest with an
> "async page fault" and allow for some other guest process to take over.
> 
> If async PFs are enabled the fault is retried asap from a workqueue, or
> immediately if no async PFs. The retry will not relinquish the mmap semaphore
> and will block on the IO. This is a bad thing, as other mmap semaphore users
> now stall. The fault could take a long time, depending on swap or filemap
> latency.
> 
> This patch ensures both the regular and async PF path re-enter the fault
> allowing for the mmap semaphore to be relinquished in the case of IO wait.
> 
> Signed-off-by: Andres Lagar-Cavilla 
> ---
>  include/linux/kvm_host.h |  9 +
>  include/linux/mm.h   |  1 +
>  mm/gup.c |  4 
>  virt/kvm/async_pf.c  |  4 +---
>  virt/kvm/kvm_main.c  | 45 ++---
>  5 files changed, 57 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3addcbc..704908d 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -198,6 +198,15 @@ int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, 
> unsigned long hva,
>  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
>  #endif
>  
> +/*
> + * Retry a fault after a gup with FOLL_NOWAIT. This properly relinquishes 
> mmap
> + * semaphore if the filemap/swap has to wait on page lock (and retries the 
> gup
> + * to completion after that).
> + */
> +int kvm_get_user_page_retry(struct task_struct *tsk, struct mm_struct *mm,
> + unsigned long addr, bool write_fault,
> + struct page **pagep);
> +
>  enum {
>   OUTSIDE_GUEST_MODE,
>   IN_GUEST_MODE,
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ebc5f90..13e585f7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2011,6 +2011,7 @@ static inline struct page *follow_page(struct 
> vm_area_struct *vma,
>  #define FOLL_HWPOISON0x100   /* check page is hwpoisoned */
>  #define FOLL_NUMA0x200   /* force NUMA hinting page fault */
>  #define FOLL_MIGRATION   0x400   /* wait for page to replace migration 
> entry */
> +#define FOLL_TRIED   0x800   /* a retry, previous pass started an IO */
>  
>  typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
>   void *data);
> diff --git a/mm/gup.c b/mm/gup.c
> index 91d044b..332d1c3 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -281,6 +281,10 @@ static int faultin_page(struct task_struct *tsk, struct 
> vm_area_struct *vma,
>   fault_flags |= FAULT_FLAG_ALLOW_RETRY;
>   if (*flags & FOLL_NOWAIT)
>   fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
> + if (*flags & FOLL_TRIED) {
> + WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
> + fault_flags |= FAULT_FLAG_TRIED;
> + }
>  
>   ret = handle_mm_fault(mm, vma, address, fault_flags);
>   if (ret & VM_FAULT_ERROR) {
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index d6a3d09..17b78b1 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -80,9 +80,7 @@ static void async_pf_execute(struct work_struct *work)
>  
>   might_sleep();
>  
> - down_read(&mm->mmap_sem);
> - get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL);
> - up_read(&mm->mmap_sem);
> + kvm_get_user_page_retry(NULL, mm, addr, 1, NULL);
>   kvm_async_page_present_sync(vcpu, apf);
>  
>   spin_lock(&vcpu->async_pf.lock);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 7ef6b48..43a9ab9 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1115,6 +1115,39 @@ static int get_user_page_nowait(struct task_struct 
> *tsk, struct mm_struct *mm,
>   return __get_user_pages(tsk, mm, start, 1, flags, page, NULL, NULL);
>  }
>  
> +int kvm_get_user_page_retry(struct task_struct *tsk, struct mm_struct *mm,
> + unsigned long addr, bool write_fault,
> + struct page **pagep)
> +{
> + int npages;
> + int locked = 1;
> + int flags = FOLL_TOUCH | FOLL_HWPOISON |
> + (pagep ? FOLL_GET : 0) |
> + (write_fault ? FOLL_WRITE : 0);
> +
> + /*
> +  * Retrying fault, we get here *not* having allowed the filemap to wait
> +  * on the page lock. We should now allow waiting on the IO with the
> +  * mmap semaphore released.
> +  */
> + down_read(&mm->mmap_sem);
> + npages = __get_user_pages(tsk, mm, addr, 1, flags, pagep, NULL,
> +   &locked);
> + if (!locked) {
> + BUG_ON(npages != -EBUSY);

Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-11 Thread Gleb Natapov

On Thu, Sep 11, 2014 at 04:37:39PM +0200, Paolo Bonzini wrote:
> Il 11/09/2014 16:31, Gleb Natapov ha scritto:
> >> > What if the page being swapped out is L1's APIC access page?  We don't
> >> > run prepare_vmcs12 in that case because it's an L2->L0->L2 entry, so we
> >> > need to "do something".
> > We will do something on L2->L1 exit. We will call 
> > kvm_reload_apic_access_page().
> > That is what patch 5 of this series is doing.
> 
> Sorry, I meant "the APIC access page prepared by L1" for L2's execution.
> 
> You wrote:
> 
> > if (!is_guest_mode() || !(vmcs12->secondary_vm_exec_control & 
> > ECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
> > write(PIC_ACCESS_ADDR)
> > 
> > In other words if L2 shares L1 apic access page then reload, otherwise do 
> > nothing.
> 
> but in that case you have to redo nested_get_page, so "do nothing"
> doesn't work.
> 
Ah, 7/7 is new in this submission. Before that this page was still
pinned.  Looking at 7/7 now I do not see how it can work since it has no
code for mmu notifier to detect that it deals with such page and call
kvm_reload_apic_access_page().  I said to Tang previously that nested
kvm has a bunch of pinned page that are hard to deal with and suggested
to iron out non nested case first :(

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-11 Thread Gleb Natapov

On Thu, Sep 11, 2014 at 04:24:04PM +0200, Paolo Bonzini wrote:
> Il 11/09/2014 16:21, Gleb Natapov ha scritto:
> > As far as I can tell the if that is needed there is:
> > 
> > if (!is_guest_mode() || !(vmcs12->secondary_vm_exec_control & 
> > ECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
> > write(PIC_ACCESS_ADDR)
> > 
> > In other words if L2 shares L1 apic access page then reload, otherwise do 
> > nothing.
> 
> What if the page being swapped out is L1's APIC access page?  We don't
> run prepare_vmcs12 in that case because it's an L2->L0->L2 entry, so we
> need to "do something".
We will do something on L2->L1 exit. We will call kvm_reload_apic_access_page().
That is what patch 5 of this series is doing.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-11 Thread Gleb Natapov

On Thu, Sep 11, 2014 at 04:06:58PM +0200, Paolo Bonzini wrote:
> Il 11/09/2014 15:59, Gleb Natapov ha scritto:
> > 
> > Suppose vmcs01->APIC_ACCESS_ADDR = 0xf000. During L2 entry
> > vmcs02->APIC_ACCESS_ADDR is set to 0xf000 too (by prepare_vmcs02). Now
> > 0xf000 is migrated to 0x8000, mmu notifier is called, it forces vmexit,
> > but vcpu is in a guest mode so vmcs02->APIC_ACCESS_ADDR is never updated
> > to 0x8000 because of "if (!is_guest_mode(vcpu))" check. So what am I
> > missing here?
> 
> Right, guest mode isn't left as soon as you execute nested_vmx_vmexit,
> because this isn't an L2->L1 exit.  So we need an "else" for that "if
> (!is_guest_mode(vcpu))", in which case the hpa is ignored and
> vmcs12->apic_access_addr is used instead?
> 
As far as I can tell the if that is needed there is:

if (!is_guest_mode() || !(vmcs12->secondary_vm_exec_control & 
ECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
write(PIC_ACCESS_ADDR)

In other words if L2 shares L1 apic access page then reload, otherwise do 
nothing.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-11 Thread Gleb Natapov

On Thu, Sep 11, 2014 at 03:05:05PM +0200, Paolo Bonzini wrote:
> Il 11/09/2014 13:30, Gleb Natapov ha scritto:
> >> > +vmcs_write64(APIC_ACCESS_ADDR, 
> >> > page_to_phys(page));
> >> > +/*
> >> > + * Do not pin apic access page in memory so 
> >> > that memory
> >> > + * hotplug process is able to migrate it.
> >> > + */
> >> > +put_page(page);
> >> >  }
> > This code is in prepare_vmcs02() and is executed during L1->L2 vmentry. 
> > What happens
> > when apic access page is migrated while L2 is running? It needs to be 
> > update somewhere.
> 
> Before it is migrated, the MMU notifier is called and will force a
> vmexit on all CPUs.  The reload code will call GUP again on the page
> again and swap it in.
> 
This is how it will work without "if (!is_guest_mode(vcpu))". But,
unless I am missing something, with this check it will not work while
vcpu is in L2.

Suppose vmcs01->APIC_ACCESS_ADDR = 0xf000. During L2 entry
vmcs02->APIC_ACCESS_ADDR is set to 0xf000 too (by prepare_vmcs02). Now
0xf000 is migrated to 0x8000, mmu notifier is called, it forces vmexit,
but vcpu is in a guest mode so vmcs02->APIC_ACCESS_ADDR is never updated
to 0x8000 because of "if (!is_guest_mode(vcpu))" check. So what am I
missing here?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-11 Thread Gleb Natapov

On Thu, Sep 11, 2014 at 12:47:16PM +0200, Paolo Bonzini wrote:
> Il 11/09/2014 12:12, Gleb Natapov ha scritto:
> > On Thu, Sep 11, 2014 at 11:21:49AM +0200, Paolo Bonzini wrote:
> >> Il 11/09/2014 07:38, Tang Chen ha scritto:
> >>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >>> index 63c4c3e..da6d55d 100644
> >>> --- a/arch/x86/kvm/vmx.c
> >>> +++ b/arch/x86/kvm/vmx.c
> >>> @@ -7093,6 +7093,11 @@ static void vmx_set_virtual_x2apic_mode(struct 
> >>> kvm_vcpu *vcpu, bool set)
> >>>   vmx_set_msr_bitmap(vcpu);
> >>>  }
> >>>  
> >>> +static void vmx_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa)
> >>> +{
> >>> + vmcs_write64(APIC_ACCESS_ADDR, hpa);
> >>
> >> This has to be guarded by "if (!is_guest_mode(vcpu))".
> >>
> > We do need to write it if L1 and L2 share APIC_ACCESS_ADDR and skip
> > it otherwise, no?
> 
> Yes, but this would be handled by patch 6:
> 
>   } else if (vm_need_virtualize_apic_accesses(vmx->vcpu.kvm)) {
> + struct page *page = gfn_to_page(vmx->vcpu.kvm,
> + APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
>   exec_control |=
>   SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> - vmcs_write64(APIC_ACCESS_ADDR,
> - page_to_phys(vcpu->kvm->arch.apic_access_page));
> + vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(page));
> + /*
> +  * Do not pin apic access page in memory so that memory
> +  * hotplug process is able to migrate it.
> +  */
> + put_page(page);
>   }
This code is in prepare_vmcs02() and is executed during L1->L2 vmentry. What 
happens
when apic access page is migrated while L2 is running? It needs to be update 
somewhere.

> 
> However, this is also useless code duplication because the above snippet could
> reuse vcpu_reload_apic_access_page too.
> 
> So I think you cannot do the is_guest_mode check in
> kvm_vcpu_reload_apic_access_page and also not in
> vmx_reload_apic_access_page.  But you could do something like
> 
> kvm_vcpu_reload_apic_access_page(...)
> {
>   ...
>   kvm_x86_ops->reload_apic_access_page(...);
> }
> EXPORT_SYMBOL_GPL(kvm_vcpu_reload_apic_access_page);
> 
> /* used in vcpu_enter_guest only */
> vcpu_reload_apic_access_page(...)
> {
>   if (!is_guest_mode(vcpu))
>   kvm_vcpu_reload_apic_access_page(...)
> }
> 
> Paolo

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-11 Thread Gleb Natapov

On Thu, Sep 11, 2014 at 11:21:49AM +0200, Paolo Bonzini wrote:
> Il 11/09/2014 07:38, Tang Chen ha scritto:
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 63c4c3e..da6d55d 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -7093,6 +7093,11 @@ static void vmx_set_virtual_x2apic_mode(struct 
> > kvm_vcpu *vcpu, bool set)
> > vmx_set_msr_bitmap(vcpu);
> >  }
> >  
> > +static void vmx_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa)
> > +{
> > +   vmcs_write64(APIC_ACCESS_ADDR, hpa);
> 
> This has to be guarded by "if (!is_guest_mode(vcpu))".
> 
We do need to write it if L1 and L2 share APIC_ACCESS_ADDR and skip
it otherwise, no?

> > +}
> > +

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 4/6] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-10 Thread Gleb Natapov

On Tue, Sep 09, 2014 at 03:13:07PM +0800, tangchen wrote:
> Hi Gleb,
> 
> On 09/03/2014 11:04 PM, Gleb Natapov wrote:
> >On Wed, Sep 03, 2014 at 09:42:30AM +0800, tangchen wrote:
> >>Hi Gleb,
> >>
> >>On 09/03/2014 12:00 AM, Gleb Natapov wrote:
> >>>..
> >>>+static void vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
> >>>+{
> >>>+  /*
> >>>+   * apic access page could be migrated. When the page is being migrated,
> >>>+   * GUP will wait till the migrate entry is replaced with the new pte
> >>>+   * entry pointing to the new page.
> >>>+   */
> >>>+  vcpu->kvm->arch.apic_access_page = gfn_to_page(vcpu->kvm,
> >>>+  APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
> >>>+  kvm_x86_ops->set_apic_access_page_addr(vcpu->kvm,
> >>>+  page_to_phys(vcpu->kvm->arch.apic_access_page));
> >>>I am a little bit worried that here all vcpus write to 
> >>>vcpu->kvm->arch.apic_access_page
> >>>without any locking. It is probably benign since pointer write is atomic 
> >>>on x86. Paolo?
> >>>
> >>>Do we even need apic_access_page? Why not call
> >>>  gfn_to_page(APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT)
> >>>  put_page()
> >>>on rare occasions we need to know its address?
> >>Isn't it a necessary item defined in hardware spec ?
> >>
> >vcpu->kvm->arch.apic_access_page? No. This is internal kvm data structure.
> >
> >>I didn't read intel spec deeply, but according to the code, the page's
> >>address is
> >>written into vmcs. And it made me think that we cannnot remove it.
> >>
> >We cannot remove writing of apic page address into vmcs, but this is not 
> >done by
> >assigning to vcpu->kvm->arch.apic_access_page, but by vmwrite in 
> >set_apic_access_page_addr().
> 
> OK, I'll try to remove kvm->arch.apic_access_page and send a patch for it
> soon.
> 
> BTW, if you don't have objection to the first two patches, would you please
> help to
> commit them first ?
> 
I acked them and CCed Paolo to this reply. I hope he will look at the series 
too.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 2/6] kvm: Remove ept_identity_pagetable from struct kvm_arch.

2014-09-10 Thread Gleb Natapov

On Wed, Aug 27, 2014 at 06:17:37PM +0800, Tang Chen wrote:
> kvm_arch->ept_identity_pagetable holds the ept identity pagetable page. But
> it is never used to refer to the page at all.
> 
> In vcpu initialization, it indicates two things:
> 1. indicates if ept page is allocated
> 2. indicates if a memory slot for identity page is initialized
> 
> Actually, kvm_arch->ept_identity_pagetable_done is enough to tell if the ept
> identity pagetable is initialized. So we can remove ept_identity_pagetable.
> 
> NOTE: In the original code, ept identity pagetable page is pinned in memroy.
>   As a result, it cannot be migrated/hot-removed. After this patch, since
>   kvm_arch->ept_identity_pagetable is removed, ept identity pagetable page
>   is no longer pinned in memory. And it can be migrated/hot-removed.
Reviewed-by: Gleb Natapov 

> 
> Signed-off-by: Tang Chen 
> ---
>  arch/x86/include/asm/kvm_host.h |  1 -
>  arch/x86/kvm/vmx.c  | 50 
> -
>  arch/x86/kvm/x86.c  |  2 --
>  3 files changed, 25 insertions(+), 28 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 7c492ed..35171c7 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -580,7 +580,6 @@ struct kvm_arch {
>  
>   gpa_t wall_clock;
>  
> - struct page *ept_identity_pagetable;
>   bool ept_identity_pagetable_done;
>   gpa_t ept_identity_map_addr;
>  
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 4b80ead..953d529 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -743,6 +743,7 @@ static u32 vmx_segment_access_rights(struct kvm_segment 
> *var);
>  static void vmx_sync_pir_to_irr_dummy(struct kvm_vcpu *vcpu);
>  static void copy_vmcs12_to_shadow(struct vcpu_vmx *vmx);
>  static void copy_shadow_to_vmcs12(struct vcpu_vmx *vmx);
> +static int alloc_identity_pagetable(struct kvm *kvm);
>  
>  static DEFINE_PER_CPU(struct vmcs *, vmxarea);
>  static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
> @@ -3938,21 +3939,27 @@ out:
>  
>  static int init_rmode_identity_map(struct kvm *kvm)
>  {
> - int i, idx, r, ret;
> + int i, idx, r, ret = 0;
>   pfn_t identity_map_pfn;
>   u32 tmp;
>  
>   if (!enable_ept)
>   return 1;
> - if (unlikely(!kvm->arch.ept_identity_pagetable)) {
> - printk(KERN_ERR "EPT: identity-mapping pagetable "
> - "haven't been allocated!\n");
> - return 0;
> +
> + /* Protect kvm->arch.ept_identity_pagetable_done. */
> + mutex_lock(&kvm->slots_lock);
> +
> + if (likely(kvm->arch.ept_identity_pagetable_done)) {
> + ret = 1;
> + goto out2;
>   }
> - if (likely(kvm->arch.ept_identity_pagetable_done))
> - return 1;
> - ret = 0;
> +
>   identity_map_pfn = kvm->arch.ept_identity_map_addr >> PAGE_SHIFT;
> +
> + r = alloc_identity_pagetable(kvm);
> + if (r)
> + goto out2;
> +
>   idx = srcu_read_lock(&kvm->srcu);
>   r = kvm_clear_guest_page(kvm, identity_map_pfn, 0, PAGE_SIZE);
>   if (r < 0)
> @@ -3970,6 +3977,9 @@ static int init_rmode_identity_map(struct kvm *kvm)
>   ret = 1;
>  out:
>   srcu_read_unlock(&kvm->srcu, idx);
> +
> +out2:
> + mutex_unlock(&kvm->slots_lock);
>   return ret;
>  }
>  
> @@ -4019,31 +4029,23 @@ out:
>  
>  static int alloc_identity_pagetable(struct kvm *kvm)
>  {
> - struct page *page;
> + /*
> +  * In init_rmode_identity_map(), kvm->arch.ept_identity_pagetable_done
> +  * is checked before calling this function and set to true after the
> +  * calling. The access to kvm->arch.ept_identity_pagetable_done should
> +  * be protected by kvm->slots_lock.
> +  */
> +
>   struct kvm_userspace_memory_region kvm_userspace_mem;
>   int r = 0;
>  
> - mutex_lock(&kvm->slots_lock);
> - if (kvm->arch.ept_identity_pagetable)
> - goto out;
>   kvm_userspace_mem.slot = IDENTITY_PAGETABLE_PRIVATE_MEMSLOT;
>   kvm_userspace_mem.flags = 0;
>   kvm_userspace_mem.guest_phys_addr =
>   kvm->arch.ept_identity_map_addr;
>   kvm_userspace_mem.memory_size = PAGE_SIZE;
>   r = __kvm_set_memory_region(kvm, &kvm_userspace_mem);
> - if (r)
> - goto out;
>  
> - page = gfn_to_page(kvm, kvm->arch.ept_identity_map_addr >> PAGE_SHIFT);
> - if

Re: [PATCH v4 1/6] kvm: Use APIC_DEFAULT_PHYS_BASE macro as the apic access page address.

2014-09-10 Thread Gleb Natapov

On Wed, Aug 27, 2014 at 06:17:36PM +0800, Tang Chen wrote:
> We have APIC_DEFAULT_PHYS_BASE defined as 0xfee0, which is also the 
> address of
> apic access page. So use this macro.
Reviewed-by: Gleb Natapov 

> 
> Signed-off-by: Tang Chen 
> ---
>  arch/x86/kvm/svm.c | 3 ++-
>  arch/x86/kvm/vmx.c | 6 +++---
>  2 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index ddf7427..1d941ad 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -1257,7 +1257,8 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm 
> *kvm, unsigned int id)
>   svm->asid_generation = 0;
>   init_vmcb(svm);
>  
> - svm->vcpu.arch.apic_base = 0xfee0 | MSR_IA32_APICBASE_ENABLE;
> + svm->vcpu.arch.apic_base = APIC_DEFAULT_PHYS_BASE |
> +MSR_IA32_APICBASE_ENABLE;
>   if (kvm_vcpu_is_bsp(&svm->vcpu))
>   svm->vcpu.arch.apic_base |= MSR_IA32_APICBASE_BSP;
>  
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index bfe11cf..4b80ead 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -3999,13 +3999,13 @@ static int alloc_apic_access_page(struct kvm *kvm)
>   goto out;
>   kvm_userspace_mem.slot = APIC_ACCESS_PAGE_PRIVATE_MEMSLOT;
>   kvm_userspace_mem.flags = 0;
> - kvm_userspace_mem.guest_phys_addr = 0xfee0ULL;
> + kvm_userspace_mem.guest_phys_addr = APIC_DEFAULT_PHYS_BASE;
>   kvm_userspace_mem.memory_size = PAGE_SIZE;
>   r = __kvm_set_memory_region(kvm, &kvm_userspace_mem);
>   if (r)
>   goto out;
>  
> - page = gfn_to_page(kvm, 0xfee00);
> + page = gfn_to_page(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
>   if (is_error_page(page)) {
>   r = -EFAULT;
>   goto out;
> @@ -4477,7 +4477,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu)
>  
>   vmx->vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val();
>   kvm_set_cr8(&vmx->vcpu, 0);
> - apic_base_msr.data = 0xfee0 | MSR_IA32_APICBASE_ENABLE;
> + apic_base_msr.data = APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE;
>   if (kvm_vcpu_is_bsp(&vmx->vcpu))
>   apic_base_msr.data |= MSR_IA32_APICBASE_BSP;
>   apic_base_msr.host_initiated = true;
> -- 
> 1.8.3.1
> 

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] KVM: x86: inject nested page faults on emulated instructions

2014-09-05 Thread Gleb Natapov

On Thu, Sep 04, 2014 at 07:44:51PM +0200, Paolo Bonzini wrote:
> Il 04/09/2014 17:05, Gleb Natapov ha scritto:
> >> > 
> >> > If you do that, KVM gets down to the "if (writeback)" and writes the 
> >> > ctxt->eip from L2 into the L1 EIP.
> > Heh, that's a bummer. We should not write back if an instruction caused a 
> > vmexit.
> > 
> 
> You're right, that works.
Looks good!

Reviewed-by: Gleb Natapov 

> 
> Paolo
> 
> -- 8< -
> Subject: [PATCH] KVM: x86: skip writeback on injection of nested exception
> 
> If a nested page fault happens during emulation, we will inject a vmexit,
> not a page fault.  However because writeback happens after the injection,
> we will write ctxt->eip from L2 into the L1 EIP.  We do not write back
> if an instruction caused an interception vmexit---do the same for page
> faults.
> 
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/x86/include/asm/kvm_host.h |  1 -
>  arch/x86/kvm/x86.c  | 15 ++-
>  2 files changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 08cc299..c989651 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -893,7 +893,6 @@ void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct 
> x86_exception *fault);
>  int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>   gfn_t gfn, void *data, int offset, int len,
>   u32 access);
> -void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault);
>  bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl);
>  
>  static inline int __kvm_irq_line_state(unsigned long *irq_state,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e4ed85e..3541946 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -408,12 +408,14 @@ void kvm_inject_page_fault(struct kvm_vcpu *vcpu, 
> struct x86_exception *fault)
>  }
>  EXPORT_SYMBOL_GPL(kvm_inject_page_fault);
>  
> -void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault)
> +static bool kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception 
> *fault)
>  {
>   if (mmu_is_nested(vcpu) && !fault->nested_page_fault)
>   vcpu->arch.nested_mmu.inject_page_fault(vcpu, fault);
>   else
>   vcpu->arch.mmu.inject_page_fault(vcpu, fault);
> +
> + return fault->nested_page_fault;
>  }
>  
>  void kvm_inject_nmi(struct kvm_vcpu *vcpu)
> @@ -4929,16 +4931,18 @@ static void toggle_interruptibility(struct kvm_vcpu 
> *vcpu, u32 mask)
>   }
>  }
>  
> -static void inject_emulated_exception(struct kvm_vcpu *vcpu)
> +static bool inject_emulated_exception(struct kvm_vcpu *vcpu)
>  {
>   struct x86_emulate_ctxt *ctxt = &vcpu->arch.emulate_ctxt;
>   if (ctxt->exception.vector == PF_VECTOR)
> - kvm_propagate_fault(vcpu, &ctxt->exception);
> - else if (ctxt->exception.error_code_valid)
> + return kvm_propagate_fault(vcpu, &ctxt->exception);
> +
> + if (ctxt->exception.error_code_valid)
>   kvm_queue_exception_e(vcpu, ctxt->exception.vector,
> ctxt->exception.error_code);
>   else
>   kvm_queue_exception(vcpu, ctxt->exception.vector);
> + return false;
>  }
>  
>  static void init_emulate_ctxt(struct kvm_vcpu *vcpu)
> @@ -5300,8 +5304,9 @@ restart:
>   }
>  
>   if (ctxt->have_exception) {
> - inject_emulated_exception(vcpu);
>   r = EMULATE_DONE;
> + if (inject_emulated_exception(vcpu))
> + return r;
>   } else if (vcpu->arch.pio.count) {
>   if (!vcpu->arch.pio.in) {
>   /* FIXME: return into emulator if single-stepping.  */
> -- 
> 1.9.3
> 
> 

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] KVM: x86: inject nested page faults on emulated instructions

2014-09-04 Thread Gleb Natapov

On Thu, Sep 04, 2014 at 04:12:19PM +0200, Paolo Bonzini wrote:
> Il 04/09/2014 09:02, Gleb Natapov ha scritto:
> > On Tue, Sep 02, 2014 at 05:13:49PM +0200, Paolo Bonzini wrote:
> >> > This is required for the following patch to work correctly.  If a nested 
> >> > page
> >> > fault happens during emulation, we must inject a vmexit, not a page 
> >> > fault.
> >> > Luckily we already have the required machinery: it is enough to return
> >> > X86EMUL_INTERCEPTED instead of X86EMUL_PROPAGATE_FAULT.
> >> > 
> > I wonder why this patch is needed. X86EMUL_PROPAGATE_FAULT causes
> > ctxt->have_exception to be set to true in x86_emulate_insn().
> > x86_emulate_instruction() checks ctxt->have_exception and calls
> > inject_emulated_exception() if it is true. inject_emulated_exception()
> > calls kvm_propagate_fault() where we check if the fault was nested and
> > generate vmexit or a page fault accordingly.
> 
> Good question. :)
> 
> If you do that, KVM gets down to the "if (writeback)" and writes the 
> ctxt->eip from L2 into the L1 EIP.
Heh, that's a bummer. We should not write back if an instruction caused a 
vmexit.

> 
> Possibly this patch can be replaced by just this?
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 022513b..475e979 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5312,7 +5312,7 @@ restart:
>  
> if (ctxt->have_exception) {
> inject_emulated_exception(vcpu);
> -   r = EMULATE_DONE;
> +   return EMULATE_DONE;
If there was no vmexit we still want to writeback. Perhaps:
writeback = inject_emulated_exception(vcpu);
and return false if there was vmexit due to nested page fault (or any fault,
can't L1 ask for #GP/#UD intercept that need to be handled here too?)

> } else if (vcpu->arch.pio.count) {
> if (!vcpu->arch.pio.in) {
> /* FIXME: return into emulator if single-stepping.  */
> 
> But I'm not sure how to test it, and I like the idea of treating nested page
> faults like other nested vmexits during emulation (which is what this patch
> does).
IMO exits due to instruction intercept and exits due to other interceptable 
events
that may happen during instruction emulation are sufficiently different to be 
handled
slightly different. If my assumption about #GP above are correct with current 
approach it
can be easily handled inside inject_emulated_exception().

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] KVM: x86: inject nested page faults on emulated instructions

2014-09-04 Thread Gleb Natapov

On Tue, Sep 02, 2014 at 05:13:49PM +0200, Paolo Bonzini wrote:
> This is required for the following patch to work correctly.  If a nested page
> fault happens during emulation, we must inject a vmexit, not a page fault.
> Luckily we already have the required machinery: it is enough to return
> X86EMUL_INTERCEPTED instead of X86EMUL_PROPAGATE_FAULT.
> 
I wonder why this patch is needed. X86EMUL_PROPAGATE_FAULT causes
ctxt->have_exception to be set to true in x86_emulate_insn().
x86_emulate_instruction() checks ctxt->have_exception and calls
inject_emulated_exception() if it is true. inject_emulated_exception()
calls kvm_propagate_fault() where we check if the fault was nested and
generate vmexit or a page fault accordingly.

> Reported-by: Valentine Sinitsyn 
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/x86/kvm/x86.c | 18 ++
>  1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e4ed85e07a01..9e3b74c044ed 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -416,6 +416,16 @@ void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct 
> x86_exception *fault)
>   vcpu->arch.mmu.inject_page_fault(vcpu, fault);
>  }
>  
> +static inline int kvm_propagate_or_intercept(struct kvm_vcpu *vcpu,
> +  struct x86_exception *exception)
> +{
> + if (likely(!exception->nested_page_fault))
> + return X86EMUL_PROPAGATE_FAULT;
> +
> + vcpu->arch.mmu.inject_page_fault(vcpu, exception);
> + return X86EMUL_INTERCEPTED;
> +}
> +
>  void kvm_inject_nmi(struct kvm_vcpu *vcpu)
>  {
>   atomic_inc(&vcpu->arch.nmi_queued);
> @@ -4122,7 +4132,7 @@ static int kvm_read_guest_virt_helper(gva_t addr, void 
> *val, unsigned int bytes,
>   int ret;
>  
>   if (gpa == UNMAPPED_GVA)
> - return X86EMUL_PROPAGATE_FAULT;
> + return kvm_propagate_or_intercept(vcpu, exception);
>   ret = kvm_read_guest_page(vcpu->kvm, gpa >> PAGE_SHIFT, data,
> offset, toread);
>   if (ret < 0) {
> @@ -4152,7 +4162,7 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt 
> *ctxt,
>   gpa_t gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr, 
> access|PFERR_FETCH_MASK,
>   exception);
>   if (unlikely(gpa == UNMAPPED_GVA))
> - return X86EMUL_PROPAGATE_FAULT;
> + return kvm_propagate_or_intercept(vcpu, exception);
>  
>   offset = addr & (PAGE_SIZE-1);
>   if (WARN_ON(offset + bytes > PAGE_SIZE))
> @@ -4203,7 +4213,7 @@ int kvm_write_guest_virt_system(struct x86_emulate_ctxt 
> *ctxt,
>   int ret;
>  
>   if (gpa == UNMAPPED_GVA)
> - return X86EMUL_PROPAGATE_FAULT;
> + return kvm_propagate_or_intercept(vcpu, exception);
>   ret = kvm_write_guest(vcpu->kvm, gpa, data, towrite);
>   if (ret < 0) {
>   r = X86EMUL_IO_NEEDED;
> @@ -4350,7 +4360,7 @@ static int emulator_read_write_onepage(unsigned long 
> addr, void *val,
>   ret = vcpu_mmio_gva_to_gpa(vcpu, addr, &gpa, exception, write);
>  
>   if (ret < 0)
> - return X86EMUL_PROPAGATE_FAULT;
> + return kvm_propagate_or_intercept(vcpu, exception);
>  
>   /* For APIC access vmexit */
>   if (ret)
> -- 
> 1.8.3.1
> 
> 

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 5/6] kvm, mem-hotplug: Reload L1's apic access page on migration when L2 is running.

2014-09-03 Thread Gleb Natapov

On Wed, Aug 27, 2014 at 06:17:40PM +0800, Tang Chen wrote:
> This patch only handle "L1 and L2 vm share one apic access page" situation.
> 
> When L1 vm is running, if the shared apic access page is migrated, 
> mmu_notifier will
> request all vcpus to exit to L0, and reload apic access page physical address 
> for
> all the vcpus' vmcs (which is done by patch 5/6). And when it enters L2 vm, 
> L2's vmcs
> will be updated in prepare_vmcs02() called by nested_vm_run(). So we need to 
> do
> nothing.
> 
> When L2 vm is running, if the shared apic access page is migrated, 
> mmu_notifier will
> request all vcpus to exit to L0, and reload apic access page physical address 
> for
> all L2 vmcs. And this patch requests apic access page reload in L2->L1 vmexit.
> 
> Signed-off-by: Tang Chen 
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/svm.c  |  6 ++
>  arch/x86/kvm/vmx.c  | 32 
>  arch/x86/kvm/x86.c  |  3 +++
>  virt/kvm/kvm_main.c |  1 +
>  5 files changed, 43 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 514183e..13fbb62 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -740,6 +740,7 @@ struct kvm_x86_ops {
>   void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
>   void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set);
>   void (*set_apic_access_page_addr)(struct kvm *kvm, hpa_t hpa);
> + void (*set_nested_apic_page_migrated)(struct kvm_vcpu *vcpu, bool set);
>   void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
>   void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
>   int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index f2eacc4..da88646 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -3624,6 +3624,11 @@ static void svm_set_apic_access_page_addr(struct kvm 
> *kvm, hpa_t hpa)
>   return;
>  }
>  
> +static void svm_set_nested_apic_page_migrated(struct kvm_vcpu *vcpu, bool 
> set)
> +{
> + return;
> +}
> +
>  static int svm_vm_has_apicv(struct kvm *kvm)
>  {
>   return 0;
> @@ -4379,6 +4384,7 @@ static struct kvm_x86_ops svm_x86_ops = {
>   .update_cr8_intercept = update_cr8_intercept,
>   .set_virtual_x2apic_mode = svm_set_virtual_x2apic_mode,
>   .set_apic_access_page_addr = svm_set_apic_access_page_addr,
> + .set_nested_apic_page_migrated = svm_set_nested_apic_page_migrated,
>   .vm_has_apicv = svm_vm_has_apicv,
>   .load_eoi_exitmap = svm_load_eoi_exitmap,
>   .hwapic_isr_update = svm_hwapic_isr_update,
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index da6d55d..9035fd1 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -379,6 +379,16 @@ struct nested_vmx {
>* we must keep them pinned while L2 runs.
>*/
>   struct page *apic_access_page;
> + /*
> +  * L1's apic access page can be migrated. When L1 and L2 are sharing
> +  * the apic access page, after the page is migrated when L2 is running,
> +  * we have to reload it to L1 vmcs before we enter L1.
> +  *
> +  * When the shared apic access page is migrated in L1 mode, we don't
> +  * need to do anything else because we reload apic access page each
> +  * time when entering L2 in prepare_vmcs02().
> +  */
> + bool apic_access_page_migrated;
>   u64 msr_ia32_feature_control;
>  
>   struct hrtimer preemption_timer;
> @@ -7098,6 +7108,12 @@ static void vmx_set_apic_access_page_addr(struct kvm 
> *kvm, hpa_t hpa)
>   vmcs_write64(APIC_ACCESS_ADDR, hpa);
>  }
>  
> +static void vmx_set_nested_apic_page_migrated(struct kvm_vcpu *vcpu, bool 
> set)
> +{
> + struct vcpu_vmx *vmx = to_vmx(vcpu);
> + vmx->nested.apic_access_page_migrated = set;
> +}
> +
>  static void vmx_hwapic_isr_update(struct kvm *kvm, int isr)
>  {
>   u16 status;
> @@ -8796,6 +8812,21 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, 
> u32 exit_reason,
>   }
>  
>   /*
> +  * When shared (L1 & L2) apic access page is migrated during L2 is
> +  * running, mmu_notifier will force to reload the page's hpa for L2
> +  * vmcs. Need to reload it for L1 before entering L1.
> +  */
> + if (vmx->nested.apic_access_page_migrated) {
> + /*
> +  * Do not call kvm_reload_apic_access_page() because we are now
> +  * in L2. We should not call make_all_cpus_request() to exit to
> +  * L0, otherwise we will reload for L2 vmcs again.
> +  */
> + kvm_reload_apic_access_page(vcpu->kvm);
> + vmx->nested.apic_access_page_migrated = false;
> + }
I would just call kvm_reload_apic_access_page() unconditionally and only if
it will prove to be performance problem would optimize it further. Vme

Re: [PATCH v4 4/6] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-03 Thread Gleb Natapov

On Wed, Sep 03, 2014 at 09:42:30AM +0800, tangchen wrote:
> Hi Gleb,
> 
> On 09/03/2014 12:00 AM, Gleb Natapov wrote:
> >..
> >+static void vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
> >+{
> >+/*
> >+ * apic access page could be migrated. When the page is being migrated,
> >+ * GUP will wait till the migrate entry is replaced with the new pte
> >+ * entry pointing to the new page.
> >+ */
> >+vcpu->kvm->arch.apic_access_page = gfn_to_page(vcpu->kvm,
> >+APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
> >+kvm_x86_ops->set_apic_access_page_addr(vcpu->kvm,
> >+page_to_phys(vcpu->kvm->arch.apic_access_page));
> >I am a little bit worried that here all vcpus write to 
> >vcpu->kvm->arch.apic_access_page
> >without any locking. It is probably benign since pointer write is atomic on 
> >x86. Paolo?
> >
> >Do we even need apic_access_page? Why not call
> >  gfn_to_page(APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT)
> >  put_page()
> >on rare occasions we need to know its address?
> 
> Isn't it a necessary item defined in hardware spec ?
> 
vcpu->kvm->arch.apic_access_page? No. This is internal kvm data structure.

> I didn't read intel spec deeply, but according to the code, the page's
> address is
> written into vmcs. And it made me think that we cannnot remove it.
> 
We cannot remove writing of apic page address into vmcs, but this is not done by
assigning to vcpu->kvm->arch.apic_access_page, but by vmwrite in 
set_apic_access_page_addr().

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 4/6] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-02 Thread Gleb Natapov

On Wed, Aug 27, 2014 at 06:17:39PM +0800, Tang Chen wrote:
> apic access page is pinned in memory. As a result, it cannot be 
> migrated/hot-removed.
> Actually, it is not necessary to be pinned.
> 
> The hpa of apic access page is stored in VMCS APIC_ACCESS_ADDR pointer. When
> the page is migrated, kvm_mmu_notifier_invalidate_page() will invalidate the
> corresponding ept entry. This patch introduces a new vcpu request named
> KVM_REQ_APIC_PAGE_RELOAD, and makes this request to all the vcpus at this 
> time,
> and force all the vcpus exit guest, and re-enter guest till they updates the 
> VMCS
> APIC_ACCESS_ADDR pointer to the new apic access page address, and updates
> kvm->arch.apic_access_page to the new page.
> 
> Signed-off-by: Tang Chen 
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/svm.c  |  6 ++
>  arch/x86/kvm/vmx.c  |  6 ++
>  arch/x86/kvm/x86.c  | 15 +++
>  include/linux/kvm_host.h|  2 ++
>  virt/kvm/kvm_main.c | 12 
>  6 files changed, 42 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 35171c7..514183e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -739,6 +739,7 @@ struct kvm_x86_ops {
>   void (*hwapic_isr_update)(struct kvm *kvm, int isr);
>   void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
>   void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set);
> + void (*set_apic_access_page_addr)(struct kvm *kvm, hpa_t hpa);
>   void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
>   void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
>   int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 1d941ad..f2eacc4 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -3619,6 +3619,11 @@ static void svm_set_virtual_x2apic_mode(struct 
> kvm_vcpu *vcpu, bool set)
>   return;
>  }
>  
> +static void svm_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa)
> +{
> + return;
> +}
> +
>  static int svm_vm_has_apicv(struct kvm *kvm)
>  {
>   return 0;
> @@ -4373,6 +4378,7 @@ static struct kvm_x86_ops svm_x86_ops = {
>   .enable_irq_window = enable_irq_window,
>   .update_cr8_intercept = update_cr8_intercept,
>   .set_virtual_x2apic_mode = svm_set_virtual_x2apic_mode,
> + .set_apic_access_page_addr = svm_set_apic_access_page_addr,
>   .vm_has_apicv = svm_vm_has_apicv,
>   .load_eoi_exitmap = svm_load_eoi_exitmap,
>   .hwapic_isr_update = svm_hwapic_isr_update,
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 63c4c3e..da6d55d 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -7093,6 +7093,11 @@ static void vmx_set_virtual_x2apic_mode(struct 
> kvm_vcpu *vcpu, bool set)
>   vmx_set_msr_bitmap(vcpu);
>  }
>  
> +static void vmx_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa)
> +{
> + vmcs_write64(APIC_ACCESS_ADDR, hpa);
> +}
> +
>  static void vmx_hwapic_isr_update(struct kvm *kvm, int isr)
>  {
>   u16 status;
> @@ -8910,6 +8915,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
>   .enable_irq_window = enable_irq_window,
>   .update_cr8_intercept = update_cr8_intercept,
>   .set_virtual_x2apic_mode = vmx_set_virtual_x2apic_mode,
> + .set_apic_access_page_addr = vmx_set_apic_access_page_addr,
>   .vm_has_apicv = vmx_vm_has_apicv,
>   .load_eoi_exitmap = vmx_load_eoi_exitmap,
>   .hwapic_irr_update = vmx_hwapic_irr_update,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e05bd58..96f4188 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5989,6 +5989,19 @@ static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu)
>   kvm_apic_update_tmr(vcpu, tmr);
>  }
>  
> +static void vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
> +{
> + /*
> +  * apic access page could be migrated. When the page is being migrated,
> +  * GUP will wait till the migrate entry is replaced with the new pte
> +  * entry pointing to the new page.
> +  */
> + vcpu->kvm->arch.apic_access_page = gfn_to_page(vcpu->kvm,
> + APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
> + kvm_x86_ops->set_apic_access_page_addr(vcpu->kvm,
> + page_to_phys(vcpu->kvm->arch.apic_access_page));
I am a little bit worried that here all vcpus write to 
vcpu->kvm->arch.apic_access_page
without any locking. It is probably benign since pointer write is atomic on 
x86. Paolo?

Do we even need apic_access_page? Why not call
 gfn_to_page(APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT)
 put_page()
on rare occasions we need to know its address?

> +}
> +
>  /*
>   * Returns 1 to let __vcpu_run() continue the guest execution loop without
>   * exiting to the userspace.  Otherwise, the value will be returned to the
> @@ -6049

Re: GET_RNG_SEED hypercall ABI? (Re: [PATCH v5 0/5] random,x86,kvm: Rework arch RNG seeds and get some from kvm)

2014-08-28 Thread Gleb Natapov

On Tue, Aug 26, 2014 at 04:58:34PM -0700, Andy Lutomirski wrote:
> hpa pointed out that the ABI that I chose (an MSR from the KVM range
> and a KVM cpuid bit) is unnecessarily KVM-specific.  It would be nice
> to allocate an MSR that everyone involved can agree on and, rather
> than relying on a cpuid bit, just have the guest probe for the MSR.
> 
CPUID part allows feature to be disabled for machine compatibility purpose
during migration. Of course interface can explicitly state that one successful
use of the MSR does not mean that next use will not result in a #GP, but that
doesn't sound very elegant and is different from any other MSR out there.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM-Use value reading from MSR when construct the eptp in VMX mode

2014-08-25 Thread Gleb Natapov

On Mon, Aug 25, 2014 at 11:16:34AM +0800, Dennis Chen wrote:
> On Sun, Aug 24, 2014 at 5:38 PM, Gleb Natapov  wrote:
> > On Sun, Aug 24, 2014 at 11:54:32AM +0800, Dennis Chen wrote:
> >> This patch is used to construct the eptp in vmx mode with values
> >> readed from MSR according to the intel x86 software developer's
> >> manual.
> >>
> >>  static u64 construct_eptp(unsigned long root_hpa)
> >>  {
> >> -u64 eptp;
> >> +u64 eptp, pwl;
> >> +
> >> +if (cpu_has_vmx_ept_4levels())
> >> +pwl = VMX_EPT_DEFAULT_GAW << VMX_EPT_GAW_EPTP_SHIFT;
> >> +else {
> >> +WARN(1, "Unsupported page-walk length of 4.\n");
> > Page-walk length of 4 is the only one that is supported.
> >
> Since there is a bit 6 in IA32_VMX_EPT_VPID_CAP MSR indicating the
> support for the page-walk length, I think sanity check is necessary.
> But I just checked the code, it's already done in the hardware_setup()
> function which will disable ept feature if the page-wake length is not
> 4. Gleb, any comments for the memory type check part?
Looks fine, but are there CPUs out there that do not support WB for eptp? Since
there was no bug reports about it I assume no.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 3/4] KVM: MMU: reload request from GET_DIRTY_LOG path

2014-07-21 Thread Gleb Natapov

On Wed, Jul 09, 2014 at 04:12:53PM -0300, mtosa...@redhat.com wrote:
> Reload remote vcpus MMU from GET_DIRTY_LOG codepath, before
> deleting a pinned spte.
> 
> Signed-off-by: Marcelo Tosatti 
> 
> ---
>  arch/x86/kvm/mmu.c |   29 +++--
>  1 file changed, 23 insertions(+), 6 deletions(-)
> 
> Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c
> ===
> --- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c  2014-07-09 11:23:59.290744490 
> -0300
> +++ kvm.pinned-sptes/arch/x86/kvm/mmu.c   2014-07-09 11:24:58.449632435 
> -0300
> @@ -1208,7 +1208,8 @@
>   *
>   * Return true if tlb need be flushed.
>   */
> -static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
> +static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect,
> +bool skip_pinned)
>  {
>   u64 spte = *sptep;
>  
> @@ -1218,6 +1219,22 @@
>  
>   rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
>  
> + if (is_pinned_spte(spte)) {
> + /* keep pinned spte intact, mark page dirty again */
> + if (skip_pinned) {
> + struct kvm_mmu_page *sp;
> + gfn_t gfn;
> +
> + sp = page_header(__pa(sptep));
> + gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
> +
> + mark_page_dirty(kvm, gfn);
> + return false;
Why not mark all pinned gfns as dirty in kvm_vm_ioctl_get_dirty_log() while
populating dirty_bitmap_buffer?

> + } else
> + mmu_reload_pinned_vcpus(kvm);
Can you explain why do you need this?

> + }
> +
> +
>   if (pt_protect)
>   spte &= ~SPTE_MMU_WRITEABLE;
>   spte = spte & ~PT_WRITABLE_MASK;
> @@ -1226,7 +1243,7 @@
>  }
>  
>  static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
> -  bool pt_protect)
> +  bool pt_protect, bool skip_pinned)
>  {
>   u64 *sptep;
>   struct rmap_iterator iter;
> @@ -1235,7 +1252,7 @@
>   for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
>   BUG_ON(!(*sptep & PT_PRESENT_MASK));
>  
> - flush |= spte_write_protect(kvm, sptep, pt_protect);
> + flush |= spte_write_protect(kvm, sptep, pt_protect, 
> skip_pinned);
>   sptep = rmap_get_next(&iter);
>   }
>  
> @@ -1261,7 +1278,7 @@
>   while (mask) {
>   rmapp = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
> PT_PAGE_TABLE_LEVEL, slot);
> - __rmap_write_protect(kvm, rmapp, false);
> + __rmap_write_protect(kvm, rmapp, false, true);
>  
>   /* clear the first set bit */
>   mask &= mask - 1;
> @@ -1280,7 +1297,7 @@
>   for (i = PT_PAGE_TABLE_LEVEL;
>i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
>   rmapp = __gfn_to_rmap(gfn, i, slot);
> - write_protected |= __rmap_write_protect(kvm, rmapp, true);
> + write_protected |= __rmap_write_protect(kvm, rmapp, true, 
> false);
>   }
>  
>   return write_protected;
> @@ -4565,7 +4582,7 @@
>  
>   for (index = 0; index <= last_index; ++index, ++rmapp) {
>   if (*rmapp)
> - __rmap_write_protect(kvm, rmapp, false);
> + __rmap_write_protect(kvm, rmapp, false, false);
>  
>   if (need_resched() || spin_needbreak(&kvm->mmu_lock))
>   cond_resched_lock(&kvm->mmu_lock);
> 
> 

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.

2014-07-18 Thread Gleb Natapov

On Fri, Jul 18, 2014 at 05:05:20PM +0800, Tang Chen wrote:
> Hi Gleb,
> 
> On 07/17/2014 09:57 PM, Gleb Natapov wrote:
> >On Thu, Jul 17, 2014 at 09:34:20PM +0800, Tang Chen wrote:
> >>Hi Gleb,
> >>
> >>On 07/15/2014 08:40 PM, Gleb Natapov wrote:
> >>..
> >>>>
> >>>>And yes, we have the problem you said here. We can migrate the page while 
> >>>>L2
> >>>>vm is running.
> >>>>So I think we should enforce L2 vm to exit to L1. Right ?
> >>>>
> >>>We can request APIC_ACCESS_ADDR reload during L2->L1 vmexit emulation, so
> >>>if APIC_ACCESS_ADDR changes while L2 is running it will be reloaded for L1 
> >>>too.
> >>>
> >>
> >>Sorry, I think I don't quite understand the procedure you are talking about
> >>here.
> >>
> >>Referring to the code, I think we have three machines: L0(host), L1 and L2.
> >>And we have two types of vmexit: L2->L1 and L2->L0.  Right ?
> >>
> >>We are now talking about this case: L2 and L1 shares the apic page.
> >>
> >>Using patch 5/5, when apic page is migrated on L0, mmu_notifier will notify
> >>L1,
> >>and update L1's VMCS. At this time, we are in L0, not L2. Why cannot we
> >Using patch 5/5, when apic page is migrated on L0, mmu_notifier will notify
> >L1 or L2 VMCS depending on which one happens to be running right now.
> >If it is L1 then L2's VMCS will be updated during vmentry emulation,
> 
> OK, this is easy to understand.
> 
> >if it is
> >L2 we need to request reload during vmexit emulation to make sure L1's VMCS 
> >is
> >updated.
> >
> 
> I'm a little confused here. In patch 5/5, I called make_all_cpus_request()
> to
> force all vcpus exit to host. If we are in L2, where will the vcpus exit to
> ?
> L1 or L0 ?
> 
L0. CPU always exits to L0. L1->L2 exit is emulated by nested_vmx_vmexit().

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.

2014-07-17 Thread Gleb Natapov

On Thu, Jul 17, 2014 at 09:34:20PM +0800, Tang Chen wrote:
> Hi Gleb,
> 
> On 07/15/2014 08:40 PM, Gleb Natapov wrote:
> ..
> >>
> >>And yes, we have the problem you said here. We can migrate the page while L2
> >>vm is running.
> >>So I think we should enforce L2 vm to exit to L1. Right ?
> >>
> >We can request APIC_ACCESS_ADDR reload during L2->L1 vmexit emulation, so
> >if APIC_ACCESS_ADDR changes while L2 is running it will be reloaded for L1 
> >too.
> >
> 
> Sorry, I think I don't quite understand the procedure you are talking about
> here.
> 
> Referring to the code, I think we have three machines: L0(host), L1 and L2.
> And we have two types of vmexit: L2->L1 and L2->L0.  Right ?
> 
> We are now talking about this case: L2 and L1 shares the apic page.
> 
> Using patch 5/5, when apic page is migrated on L0, mmu_notifier will notify
> L1,
> and update L1's VMCS. At this time, we are in L0, not L2. Why cannot we
Using patch 5/5, when apic page is migrated on L0, mmu_notifier will notify
L1 or L2 VMCS depending on which one happens to be running right now.
If it is L1 then L2's VMCS will be updated during vmentry emulation, if it is
L2 we need to request reload during vmexit emulation to make sure L1's VMCS is
updated.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] random,x86,kvm: Add and use MSR_KVM_GET_RNG_SEED

2014-07-16 Thread Gleb Natapov

On Wed, Jul 16, 2014 at 09:13:23AM -0700, H. Peter Anvin wrote:
> On 07/16/2014 09:08 AM, Paolo Bonzini wrote:
> > Il 16/07/2014 18:03, H. Peter Anvin ha scritto:
> >> I suggested emulating RDRAND *but not set the CPUID bit*.  We already
> >> developed a protocol in KVM/Qemu to enumerate emulated features (created
> >> for MOVBE as I recall), specifically to service the semantic "feature X
> >> will work but will be substantially slower than normal."
> > 
> > But those will set the CPUID bit.  There is currently no way for KVM
> > guests to know if a CPUID bit is real or emulated.
> > 
> 
> OK, so there wasn't any protocol implemented in the end.  I sit corrected.
> 
That protocol that was implemented is between qemu and kvm, not kvm and a guest.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] random,x86,kvm: Add and use MSR_KVM_GET_RNG_SEED

2014-07-16 Thread Gleb Natapov

On Wed, Jul 16, 2014 at 04:32:19PM +0200, Paolo Bonzini wrote:
> Il 16/07/2014 16:07, Andy Lutomirski ha scritto:
> >This patch has nothing whatsoever to do with how much I trust the CPU
> >vs the hypervisor.  It's for the enormous installed base of machines
> >without RDRAND.
> 
> Ok.  I think an MSR is fine, though I don't think it's useful for the guest
> to use it if it already has RDRAND and/or RDSEED.
> 
Agree. It is unfortunate that we add PV interfaces for a HW that will be extinct
in a couple of years though :(

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] random,x86,kvm: Add and use MSR_KVM_GET_RNG_SEED

2014-07-16 Thread Gleb Natapov

On Wed, Jul 16, 2014 at 09:10:27AM +0200, Daniel Borkmann wrote:
> On 07/16/2014 08:41 AM, Gleb Natapov wrote:
> >On Tue, Jul 15, 2014 at 07:48:06PM -0700, Andy Lutomirski wrote:
> >>virtio-rng is both too complicated and insufficient for initial rng
> >>seeding.  It's far too complicated to use for KASLR or any other
> >>early boot random number needs.  It also provides /dev/random-style
> >>bits, which means that making guest boot wait for virtio-rng is
> >>unacceptably slow, and doing it asynchronously means that
> >>/dev/urandom might be predictable when userspace starts.
> >>
> >>This introduces a very simple synchronous mechanism to get
> >>/dev/urandom-style bits.
> >
> >Why can't you use RDRAND instruction for that?
> 
> You mean using it directly? I think simply for the very same reasons
> as in c2557a303a ...
So you trust your hypervisor vendor more than you trust your CPU vendor? :)

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] random,x86,kvm: Add and use MSR_KVM_GET_RNG_SEED

2014-07-15 Thread Gleb Natapov

On Tue, Jul 15, 2014 at 07:48:06PM -0700, Andy Lutomirski wrote:
> virtio-rng is both too complicated and insufficient for initial rng
> seeding.  It's far too complicated to use for KASLR or any other
> early boot random number needs.  It also provides /dev/random-style
> bits, which means that making guest boot wait for virtio-rng is
> unacceptably slow, and doing it asynchronously means that
> /dev/urandom might be predictable when userspace starts.
> 
> This introduces a very simple synchronous mechanism to get
> /dev/urandom-style bits.
Why can't you use RDRAND instruction for that?

> 
> This is a KVM change: am I supposed to write a unit test somewhere?
> 
> Andy Lutomirski (4):
>   x86,kvm: Add MSR_KVM_GET_RNG_SEED and a matching feature bit
>   random,x86: Add arch_get_slow_rng_u64
>   random: Seed pools from arch_get_slow_rng_u64 at startup
>   x86,kaslr: Use MSR_KVM_GET_RNG_SEED for KASLR if available
> 
>  Documentation/virtual/kvm/cpuid.txt  |  3 +++
>  arch/x86/Kconfig |  4 
>  arch/x86/boot/compressed/aslr.c  | 27 +++
>  arch/x86/include/asm/archslowrng.h   | 30 ++
>  arch/x86/include/uapi/asm/kvm_para.h |  2 ++
>  arch/x86/kernel/kvm.c| 22 ++
>  arch/x86/kvm/cpuid.c |  3 ++-
>  arch/x86/kvm/x86.c   |  4 
>  drivers/char/random.c| 14 +-
>  include/linux/random.h   |  9 +
>  10 files changed, 116 insertions(+), 2 deletions(-)
>  create mode 100644 arch/x86/include/asm/archslowrng.h
> 
> -- 
> 1.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.

2014-07-15 Thread Gleb Natapov

On Tue, Jul 15, 2014 at 08:54:01PM +0800, Tang Chen wrote:
> On 07/15/2014 08:40 PM, Gleb Natapov wrote:
> >On Tue, Jul 15, 2014 at 08:28:22PM +0800, Tang Chen wrote:
> >>On 07/15/2014 08:09 PM, Gleb Natapov wrote:
> >>>On Tue, Jul 15, 2014 at 01:52:40PM +0200, Jan Kiszka wrote:
> >>..
> >>>>
> >>>>I cannot follow your concerns yet. Specifically, how should
> >>>>APIC_ACCESS_ADDR (the VMCS field, right?) change while L2 is running? We
> >>>>currently pin/unpin on L1->L2/L2->L1, respectively. Or what do you mean?
> >>>>
> >>>I am talking about this case:
> >>>  if (cpu_has_secondary_exec_ctrls()) {a
> >>>  } else {
> >>>  exec_control |=
> >>> SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> >>> vmcs_write64(APIC_ACCESS_ADDR,
> >>> page_to_phys(vcpu->kvm->arch.apic_access_page));
> >>>  }
> >>>We do not pin here.
> >>>
> >>
> >>Hi Gleb,
> >>
> >>
> >>7905 if (exec_control&
> >>SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES) {
> >>..
> >>7912 if (vmx->nested.apic_access_page) /* shouldn't
> >>happen */
> >>7913 nested_release_page(vmx->nested.apic_access_page);
> >>7914 vmx->nested.apic_access_page =
> >>7915 nested_get_page(vcpu,
> >>vmcs12->apic_access_addr);
> >>
> >>I thought you were talking about the problem here. We pin
> >>vmcs12->apic_access_addr
> >>in memory. And I think we should do the same thing to this page as to L1 vm.
> >>Right ?
> >Nested kvm pins a lot of pages, it will probably be not easy to handle all 
> >of them,
> >so for now I am concerned with non nested case only (but nested should 
> >continue to
> >work obviously, just pin pages like it does now).
> 
> True. I will work on it.
> 
> And also, when using PCI passthrough, kvm_pin_pages() also pins some pages.
> This is
> also in my todo list.
Those pages are (almost) directly accessible by assigned PCI devices,
I am not sure this is even doable.

> 
> But sorry, a little strange. I didn't find where vmcs12->apic_access_addr is
> allocated
> or initialized... Would you please tell me ?
handle_vmwrite() writes it when guest is executing vmwrite(APIC_ACCESS_ADDR);

> 
> >
> >>
> >>..
> >>7922 if (!vmx->nested.apic_access_page)
> >>7923 exec_control&=
> >>7924 ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> >>7925 else
> >>7926 vmcs_write64(APIC_ACCESS_ADDR,
> >>7927 page_to_phys(vmx->nested.apic_access_page));
> >>7928 } else if
> >>(vm_need_virtualize_apic_accesses(vmx->vcpu.kvm)) {
> >>7929 exec_control |=
> >>7930 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> >>7931 vmcs_write64(APIC_ACCESS_ADDR,
> >>7932 page_to_phys(vcpu->kvm->arch.apic_access_page));
> >>7933 }
> >>
> >>And yes, we have the problem you said here. We can migrate the page while L2
> >>vm is running.
> >>So I think we should enforce L2 vm to exit to L1. Right ?
> >>
> >We can request APIC_ACCESS_ADDR reload during L2->L1 vmexit emulation, so
> >if APIC_ACCESS_ADDR changes while L2 is running it will be reloaded for L1 
> >too.
> >
> 
> apic pages for L2 and L1 are not the same page, right ?
> 
If L2 guest enable apic access page then they are different, otherwise
they are the same.

> I think, just like we are doing in patch 5/5, we cannot wait for the next
> L2->L1 vmexit.
> We should enforce a L2->L1 vmexit in mmu_notifier, just like
> make_all_cpus_request() does.
> 
> Am I right ?
> 
I do not see why forcing APIC_ACCESS_ADDR reload during L2->L1 exit is not 
enough.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.

2014-07-15 Thread Gleb Natapov

On Tue, Jul 15, 2014 at 03:10:15PM +0200, Jan Kiszka wrote:
> On 2014-07-15 14:40, Gleb Natapov wrote:
> >>
> >> ..
> >> 7922 if (!vmx->nested.apic_access_page)
> >> 7923 exec_control &=
> >> 7924 ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> >> 7925 else
> >> 7926 vmcs_write64(APIC_ACCESS_ADDR,
> >> 7927 page_to_phys(vmx->nested.apic_access_page));
> >> 7928 } else if
> >> (vm_need_virtualize_apic_accesses(vmx->vcpu.kvm)) {
> >> 7929 exec_control |=
> >> 7930 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> >> 7931 vmcs_write64(APIC_ACCESS_ADDR,
> >> 7932 page_to_phys(vcpu->kvm->arch.apic_access_page));
> >> 7933 }
> >>
> >> And yes, we have the problem you said here. We can migrate the page while 
> >> L2
> >> vm is running.
> >> So I think we should enforce L2 vm to exit to L1. Right ?
> >>
> > We can request APIC_ACCESS_ADDR reload during L2->L1 vmexit emulation, so
> > if APIC_ACCESS_ADDR changes while L2 is running it will be reloaded for L1 
> > too.
> 
> How should this host-managed VMCS field possibly change while L2 is running?
> 
That what "Do not pin apic access page in memory" patch is doing. It changes 
APIC_ACCESS_ADDR
of a current vmcs. It kicks vcpu out of a guest mode of course, but vcpu may 
still be in 
L2 at this point, so only L2 vmcs will get new APIC_ACCESS_ADDR pointer, L1 
vmcs will still have
an old one.
   
--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.

2014-07-15 Thread Gleb Natapov

On Tue, Jul 15, 2014 at 08:28:22PM +0800, Tang Chen wrote:
> On 07/15/2014 08:09 PM, Gleb Natapov wrote:
> >On Tue, Jul 15, 2014 at 01:52:40PM +0200, Jan Kiszka wrote:
> ..
> >>
> >>I cannot follow your concerns yet. Specifically, how should
> >>APIC_ACCESS_ADDR (the VMCS field, right?) change while L2 is running? We
> >>currently pin/unpin on L1->L2/L2->L1, respectively. Or what do you mean?
> >>
> >I am talking about this case:
> >  if (cpu_has_secondary_exec_ctrls()) {a
> >  } else {
> >  exec_control |=
> > SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> > vmcs_write64(APIC_ACCESS_ADDR,
> > page_to_phys(vcpu->kvm->arch.apic_access_page));
> >  }
> >We do not pin here.
> >
> 
> Hi Gleb,
> 
> 
> 7905 if (exec_control &
> SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES) {
> ..
> 7912 if (vmx->nested.apic_access_page) /* shouldn't
> happen */
> 7913 nested_release_page(vmx->nested.apic_access_page);
> 7914 vmx->nested.apic_access_page =
> 7915 nested_get_page(vcpu,
> vmcs12->apic_access_addr);
> 
> I thought you were talking about the problem here. We pin
> vmcs12->apic_access_addr
> in memory. And I think we should do the same thing to this page as to L1 vm.
> Right ?
Nested kvm pins a lot of pages, it will probably be not easy to handle all of 
them,
so for now I am concerned with non nested case only (but nested should continue 
to
work obviously, just pin pages like it does now).

> 
> ..
> 7922 if (!vmx->nested.apic_access_page)
> 7923 exec_control &=
> 7924 ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> 7925 else
> 7926 vmcs_write64(APIC_ACCESS_ADDR,
> 7927 page_to_phys(vmx->nested.apic_access_page));
> 7928 } else if
> (vm_need_virtualize_apic_accesses(vmx->vcpu.kvm)) {
> 7929 exec_control |=
> 7930 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> 7931 vmcs_write64(APIC_ACCESS_ADDR,
> 7932 page_to_phys(vcpu->kvm->arch.apic_access_page));
> 7933 }
> 
> And yes, we have the problem you said here. We can migrate the page while L2
> vm is running.
> So I think we should enforce L2 vm to exit to L1. Right ?
> 
We can request APIC_ACCESS_ADDR reload during L2->L1 vmexit emulation, so
if APIC_ACCESS_ADDR changes while L2 is running it will be reloaded for L1 too.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.

2014-07-15 Thread Gleb Natapov

On Tue, Jul 15, 2014 at 01:52:40PM +0200, Jan Kiszka wrote:
> On 2014-07-14 16:58, Gleb Natapov wrote:
> >>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >>>> index ffbe557..7080eda 100644
> >>>> --- a/arch/x86/kvm/x86.c
> >>>> +++ b/arch/x86/kvm/x86.c
> >>>> @@ -5929,6 +5929,18 @@ static void vcpu_scan_ioapic(struct kvm_vcpu 
> >>>> *vcpu)
> >>>>  kvm_apic_update_tmr(vcpu, tmr);
> >>>>  }
> >>>>
> >>>> +static void vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
> >>>> +{
> >>>> +/*
> >>>> + * When the page is being migrated, GUP will wait till the 
> >>>> migrate
> >>>> + * entry is replaced with the new pte entry pointing to the new 
> >>>> page.
> >>>> + */
> >>>> +struct page *page = gfn_to_page_no_pin(vcpu->kvm,
> >>>> +APIC_DEFAULT_PHYS_BASE>>  
> >>>> PAGE_SHIFT);
> >>> If you do not use kvm->arch.apic_access_page to get current address why 
> >>> not drop it entirely?
> >>>
> >>
> >> I should also update kvm->arch.apic_access_page here. It is used in other
> >> places
> >> in kvm, so I don't think we should drop it. Will update the patch.
> > What other places? The only other place I see is in nested kvm code and you 
> > can call
> > gfn_to_page_no_pin() there instead of using kvm->arch.apic_access_page 
> > directly. But
> > as far as I see nested kvm code cannot handle change of APIC_ACCESS_ADDR 
> > phys address.
> > If APIC_ACCESS_ADDR changes during nested guest run, non nested vmcs will 
> > still have old
> > physical address. One way to fix that is to set KVM_REQ_APIC_PAGE_RELOAD 
> > during nested exit.
> 
> I cannot follow your concerns yet. Specifically, how should
> APIC_ACCESS_ADDR (the VMCS field, right?) change while L2 is running? We
> currently pin/unpin on L1->L2/L2->L1, respectively. Or what do you mean?
> 
I am talking about this case:
 if (cpu_has_secondary_exec_ctrls()) {a
 } else {
 exec_control |=
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
vmcs_write64(APIC_ACCESS_ADDR,
page_to_phys(vcpu->kvm->arch.apic_access_page));
 }
We do not pin here.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.

2014-07-14 Thread Gleb Natapov

CCing Jan to check my nested kvm findings below.

On Mon, Jul 14, 2014 at 03:57:09PM +0800, Tang Chen wrote:
> Hi Gleb,
> 
> Thanks for the reply. Please see below.
> 
> On 07/12/2014 04:04 PM, Gleb Natapov wrote:
> >On Tue, Jul 08, 2014 at 09:01:32PM +0800, Tang Chen wrote:
> >>apic access page is pinned in memory. As a result, it cannot be 
> >>migrated/hot-removed.
> >>Actually, it is not necessary to be pinned.
> >>
> >>The hpa of apic access page is stored in VMCS APIC_ACCESS_ADDR pointer. When
> >>the page is migrated, kvm_mmu_notifier_invalidate_page() will invalidate the
> >>corresponding ept entry. This patch introduces a new vcpu request named
> >>KVM_REQ_APIC_PAGE_RELOAD, and makes this request to all the vcpus at this 
> >>time,
> >>and force all the vcpus exit guest, and re-enter guest till they updates 
> >>the VMCS
> >>APIC_ACCESS_ADDR pointer to the new apic access page address, and updates
> >>kvm->arch.apic_access_page to the new page.
> >>
> >By default kvm Linux guest uses x2apic, so APIC_ACCESS_ADDR mechanism
> >is not used since no MMIO access to APIC is ever done. Have you tested
> >this with "-cpu modelname,-x2apic" qemu flag?
> 
> I used the following commandline to test the patches.
> 
> # /usr/libexec/qemu-kvm -m 512M -hda /home/tangchen/xxx.img -enable-kvm -smp
> 2
> 
That most likely uses x2apic.

> And I think the guest used APIC_ACCESS_ADDR mechanism because the previous
> patch-set has some problem which will happen when the apic page is accessed.
> And it did happen.
> 
> I'll test this patch-set with "-cpu modelname,-x2apic" flag.
> 
Replace "modelname" with one of supported model names such as nehalem of course 
:)

> >
> >>Signed-off-by: Tang Chen
> >>---
> >>  arch/x86/include/asm/kvm_host.h |  1 +
> >>  arch/x86/kvm/mmu.c  | 11 +++
> >>  arch/x86/kvm/svm.c  |  6 ++
> >>  arch/x86/kvm/vmx.c  |  8 +++-
> >>  arch/x86/kvm/x86.c  | 14 ++
> >>  include/linux/kvm_host.h|  2 ++
> >>  virt/kvm/kvm_main.c | 12 
> >>  7 files changed, 53 insertions(+), 1 deletion(-)
> >>
> >>diff --git a/arch/x86/include/asm/kvm_host.h 
> >>b/arch/x86/include/asm/kvm_host.h
> >>index 62f973e..9ce6bfd 100644
> >>--- a/arch/x86/include/asm/kvm_host.h
> >>+++ b/arch/x86/include/asm/kvm_host.h
> >>@@ -737,6 +737,7 @@ struct kvm_x86_ops {
> >>void (*hwapic_isr_update)(struct kvm *kvm, int isr);
> >>void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
> >>void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set);
> >>+   void (*set_apic_access_page_addr)(struct kvm *kvm, hpa_t hpa);
> >>void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
> >>void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
> >>int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
> >>diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >>index 9314678..551693d 100644
> >>--- a/arch/x86/kvm/mmu.c
> >>+++ b/arch/x86/kvm/mmu.c
> >>@@ -3427,6 +3427,17 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, 
> >>gva_t gpa, u32 error_code,
> >> level, gfn, pfn, prefault);
> >>spin_unlock(&vcpu->kvm->mmu_lock);
> >>
> >>+   /*
> >>+* apic access page could be migrated. When the guest tries to access
> >>+* the apic access page, ept violation will occur, and we can use GUP
> >>+* to find the new page.
> >>+*
> >>+* GUP will wait till the migrate entry be replaced with the new page.
> >>+*/
> >>+   if (gpa == APIC_DEFAULT_PHYS_BASE)
> >>+   vcpu->kvm->arch.apic_access_page = gfn_to_page_no_pin(vcpu->kvm,
> >>+   APIC_DEFAULT_PHYS_BASE>>  PAGE_SHIFT);
> >Shouldn't you make KVM_REQ_APIC_PAGE_RELOAD request here?
> 
> I don't think we need to make KVM_REQ_APIC_PAGE_RELOAD request here.
> 
> In kvm_mmu_notifier_invalidate_page() I made the request. And the handler
> called
> gfn_to_page_no_pin() to get the new page, which will wait till the migration
> finished. And then updated the VMCS APIC_ACCESS_ADDR pointer. So, when the
> vcpus
> were forced to exit the guest mode, they would wait till the VMCS
> APIC_ACCESS_ADDR
> pointer was updated.
> 
> As a result, we don't need to make the

Re: [RESEND PATCH v2 4/5] kvm: Remove ept_identity_pagetable from struct kvm_arch.

2014-07-14 Thread Gleb Natapov

On Mon, Jul 14, 2014 at 05:17:04PM +0800, Tang Chen wrote:
> On 07/12/2014 03:44 PM, Gleb Natapov wrote:
> >On Wed, Jul 09, 2014 at 10:08:03AM +0800, Tang Chen wrote:
> >>kvm_arch->ept_identity_pagetable holds the ept identity pagetable page. But
> >>it is never used to refer to the page at all.
> >>
> >>In vcpu initialization, it indicates two things:
> >>1. indicates if ept page is allocated
> >>2. indicates if a memory slot for identity page is initialized
> >>
> >>Actually, kvm_arch->ept_identity_pagetable_done is enough to tell if the ept
> >>identity pagetable is initialized. So we can remove ept_identity_pagetable.
> >>
> >>Signed-off-by: Tang Chen
> >>---
> >>  arch/x86/include/asm/kvm_host.h |  1 -
> >>  arch/x86/kvm/vmx.c  | 25 +++--
> >>  2 files changed, 11 insertions(+), 15 deletions(-)
> >>
> >>diff --git a/arch/x86/include/asm/kvm_host.h 
> >>b/arch/x86/include/asm/kvm_host.h
> >>index 4931415..62f973e 100644
> >>--- a/arch/x86/include/asm/kvm_host.h
> >>+++ b/arch/x86/include/asm/kvm_host.h
> >>@@ -578,7 +578,6 @@ struct kvm_arch {
> >>
> >>gpa_t wall_clock;
> >>
> >>-   struct page *ept_identity_pagetable;
> >>bool ept_identity_pagetable_done;
> >>gpa_t ept_identity_map_addr;
> >>
> >>diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >>index 0918635e..fe2e5f4 100644
> >>--- a/arch/x86/kvm/vmx.c
> >>+++ b/arch/x86/kvm/vmx.c
> >>@@ -741,6 +741,7 @@ static void vmx_sync_pir_to_irr_dummy(struct kvm_vcpu 
> >>*vcpu);
> >>  static void copy_vmcs12_to_shadow(struct vcpu_vmx *vmx);
> >>  static void copy_shadow_to_vmcs12(struct vcpu_vmx *vmx);
> >>  static bool vmx_mpx_supported(void);
> >>+static int alloc_identity_pagetable(struct kvm *kvm);
> >>
> >>  static DEFINE_PER_CPU(struct vmcs *, vmxarea);
> >>  static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
> >>@@ -3921,21 +3922,21 @@ out:
> >>
> >>  static int init_rmode_identity_map(struct kvm *kvm)
> >>  {
> >>-   int i, idx, r, ret;
> >>+   int i, idx, r, ret = 0;
> >>pfn_t identity_map_pfn;
> >>u32 tmp;
> >>
> >>if (!enable_ept)
> >>return 1;
> >>-   if (unlikely(!kvm->arch.ept_identity_pagetable)) {
> >>-   printk(KERN_ERR "EPT: identity-mapping pagetable "
> >>-   "haven't been allocated!\n");
> >>-   return 0;
> >>-   }
> >>if (likely(kvm->arch.ept_identity_pagetable_done))
> >>return 1;
> >>-   ret = 0;
> >>identity_map_pfn = kvm->arch.ept_identity_map_addr>>  PAGE_SHIFT;
> >>+
> >>+   mutex_lock(&kvm->slots_lock);
> >Why move this out of alloc_identity_pagetable()?
> >
> 
> Referring to the original code, I think mutex_lock(&kvm->slots_lock) is used
> to protect kvm->arch.ept_identity_pagetable. If two or more threads try to
> modify it at the same time, the mutex ensures that the identity table is
> only
> allocated once.
> 
> Now, we dropped kvm->arch.ept_identity_pagetable. And use
> kvm->arch.ept_identity_pagetable_done
> to check if the identity table is allocated and initialized. So we should
> protect
> memory slot operation in alloc_identity_pagetable() and
> kvm->arch.ept_identity_pagetable_done
> with this mutex.
> 
> Of course, I can see that the name "slots_lock" indicates that it may be
> used
> to protect the memory slot operation only. Maybe move it out here is not
> suitable.
> 
> If I'm wrong, please tell me.
> 
No, you are right that besides memory slot creation slots_lock protects checking
of ept_identity_pagetable here, but after you patch ept_identity_pagetable_done 
is
tested outside of slots_lock so the allocation can happen twice, no?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.

2014-07-12 Thread Gleb Natapov

On Tue, Jul 08, 2014 at 09:01:32PM +0800, Tang Chen wrote:
> apic access page is pinned in memory. As a result, it cannot be 
> migrated/hot-removed.
> Actually, it is not necessary to be pinned.
> 
> The hpa of apic access page is stored in VMCS APIC_ACCESS_ADDR pointer. When
> the page is migrated, kvm_mmu_notifier_invalidate_page() will invalidate the
> corresponding ept entry. This patch introduces a new vcpu request named
> KVM_REQ_APIC_PAGE_RELOAD, and makes this request to all the vcpus at this 
> time,
> and force all the vcpus exit guest, and re-enter guest till they updates the 
> VMCS
> APIC_ACCESS_ADDR pointer to the new apic access page address, and updates
> kvm->arch.apic_access_page to the new page.
> 
By default kvm Linux guest uses x2apic, so APIC_ACCESS_ADDR mechanism
is not used since no MMIO access to APIC is ever done. Have you tested
this with "-cpu modelname,-x2apic" qemu flag?

> Signed-off-by: Tang Chen 
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/mmu.c  | 11 +++
>  arch/x86/kvm/svm.c  |  6 ++
>  arch/x86/kvm/vmx.c  |  8 +++-
>  arch/x86/kvm/x86.c  | 14 ++
>  include/linux/kvm_host.h|  2 ++
>  virt/kvm/kvm_main.c | 12 
>  7 files changed, 53 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 62f973e..9ce6bfd 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -737,6 +737,7 @@ struct kvm_x86_ops {
>   void (*hwapic_isr_update)(struct kvm *kvm, int isr);
>   void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
>   void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set);
> + void (*set_apic_access_page_addr)(struct kvm *kvm, hpa_t hpa);
>   void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
>   void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
>   int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 9314678..551693d 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3427,6 +3427,17 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t 
> gpa, u32 error_code,
>level, gfn, pfn, prefault);
>   spin_unlock(&vcpu->kvm->mmu_lock);
>  
> + /*
> +  * apic access page could be migrated. When the guest tries to access
> +  * the apic access page, ept violation will occur, and we can use GUP
> +  * to find the new page.
> +  *
> +  * GUP will wait till the migrate entry be replaced with the new page.
> +  */
> + if (gpa == APIC_DEFAULT_PHYS_BASE)
> + vcpu->kvm->arch.apic_access_page = gfn_to_page_no_pin(vcpu->kvm,
> + APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
Shouldn't you make KVM_REQ_APIC_PAGE_RELOAD request here?

> +
>   return r;
>  
>  out_unlock:
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 576b525..dc76f29 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -3612,6 +3612,11 @@ static void svm_set_virtual_x2apic_mode(struct 
> kvm_vcpu *vcpu, bool set)
>   return;
>  }
>  
> +static void svm_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa)
> +{
> + return;
> +}
> +
>  static int svm_vm_has_apicv(struct kvm *kvm)
>  {
>   return 0;
> @@ -4365,6 +4370,7 @@ static struct kvm_x86_ops svm_x86_ops = {
>   .enable_irq_window = enable_irq_window,
>   .update_cr8_intercept = update_cr8_intercept,
>   .set_virtual_x2apic_mode = svm_set_virtual_x2apic_mode,
> + .set_apic_access_page_addr = svm_set_apic_access_page_addr,
>   .vm_has_apicv = svm_vm_has_apicv,
>   .load_eoi_exitmap = svm_load_eoi_exitmap,
>   .hwapic_isr_update = svm_hwapic_isr_update,
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 5532ac8..f7c6313 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -3992,7 +3992,7 @@ static int alloc_apic_access_page(struct kvm *kvm)
>   if (r)
>   goto out;
>  
> - page = gfn_to_page(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
> + page = gfn_to_page_no_pin(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
>   if (is_error_page(page)) {
>   r = -EFAULT;
>   goto out;
> @@ -7073,6 +7073,11 @@ static void vmx_set_virtual_x2apic_mode(struct 
> kvm_vcpu *vcpu, bool set)
>   vmx_set_msr_bitmap(vcpu);
>  }
>  
> +static void vmx_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa)
> +{
> + vmcs_write64(APIC_ACCESS_ADDR, hpa);
> +}
> +
>  static void vmx_hwapic_isr_update(struct kvm *kvm, int isr)
>  {
>   u16 status;
> @@ -8842,6 +8847,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
>   .enable_irq_window = enable_irq_window,
>   .update_cr8_intercept = update_cr8_intercept,
>   .set_virtual_x2apic_mode = vmx_set_virtual_x2apic_mode,
> +

Re: [RESEND PATCH v2 4/5] kvm: Remove ept_identity_pagetable from struct kvm_arch.

2014-07-12 Thread Gleb Natapov

On Wed, Jul 09, 2014 at 10:08:03AM +0800, Tang Chen wrote:
> kvm_arch->ept_identity_pagetable holds the ept identity pagetable page. But
> it is never used to refer to the page at all.
> 
> In vcpu initialization, it indicates two things:
> 1. indicates if ept page is allocated
> 2. indicates if a memory slot for identity page is initialized
> 
> Actually, kvm_arch->ept_identity_pagetable_done is enough to tell if the ept
> identity pagetable is initialized. So we can remove ept_identity_pagetable.
> 
> Signed-off-by: Tang Chen 
> ---
>  arch/x86/include/asm/kvm_host.h |  1 -
>  arch/x86/kvm/vmx.c  | 25 +++--
>  2 files changed, 11 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 4931415..62f973e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -578,7 +578,6 @@ struct kvm_arch {
>  
>   gpa_t wall_clock;
>  
> - struct page *ept_identity_pagetable;
>   bool ept_identity_pagetable_done;
>   gpa_t ept_identity_map_addr;
>  
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 0918635e..fe2e5f4 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -741,6 +741,7 @@ static void vmx_sync_pir_to_irr_dummy(struct kvm_vcpu 
> *vcpu);
>  static void copy_vmcs12_to_shadow(struct vcpu_vmx *vmx);
>  static void copy_shadow_to_vmcs12(struct vcpu_vmx *vmx);
>  static bool vmx_mpx_supported(void);
> +static int alloc_identity_pagetable(struct kvm *kvm);
>  
>  static DEFINE_PER_CPU(struct vmcs *, vmxarea);
>  static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
> @@ -3921,21 +3922,21 @@ out:
>  
>  static int init_rmode_identity_map(struct kvm *kvm)
>  {
> - int i, idx, r, ret;
> + int i, idx, r, ret = 0;
>   pfn_t identity_map_pfn;
>   u32 tmp;
>  
>   if (!enable_ept)
>   return 1;
> - if (unlikely(!kvm->arch.ept_identity_pagetable)) {
> - printk(KERN_ERR "EPT: identity-mapping pagetable "
> - "haven't been allocated!\n");
> - return 0;
> - }
>   if (likely(kvm->arch.ept_identity_pagetable_done))
>   return 1;
> - ret = 0;
>   identity_map_pfn = kvm->arch.ept_identity_map_addr >> PAGE_SHIFT;
> +
> + mutex_lock(&kvm->slots_lock);
Why move this out of alloc_identity_pagetable()?

> + r = alloc_identity_pagetable(kvm);
> + if (r)
> + goto out2;
> +
>   idx = srcu_read_lock(&kvm->srcu);
>   r = kvm_clear_guest_page(kvm, identity_map_pfn, 0, PAGE_SIZE);
>   if (r < 0)
> @@ -3953,6 +3954,9 @@ static int init_rmode_identity_map(struct kvm *kvm)
>   ret = 1;
>  out:
>   srcu_read_unlock(&kvm->srcu, idx);
> +
> +out2:
> + mutex_unlock(&kvm->slots_lock);
>   return ret;
>  }
>  
> @@ -4006,9 +4010,6 @@ static int alloc_identity_pagetable(struct kvm *kvm)
>   struct kvm_userspace_memory_region kvm_userspace_mem;
>   int r = 0;
>  
> - mutex_lock(&kvm->slots_lock);
> - if (kvm->arch.ept_identity_pagetable)
> - goto out;
>   kvm_userspace_mem.slot = IDENTITY_PAGETABLE_PRIVATE_MEMSLOT;
>   kvm_userspace_mem.flags = 0;
>   kvm_userspace_mem.guest_phys_addr =
> @@ -4025,9 +4026,7 @@ static int alloc_identity_pagetable(struct kvm *kvm)
>   goto out;
>   }
>  
> - kvm->arch.ept_identity_pagetable = page;
I think we can drop gfn_to_page() above too now. Why would we need it?

>  out:
> - mutex_unlock(&kvm->slots_lock);
>   return r;
>  }
>  
> @@ -7583,8 +7582,6 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm 
> *kvm, unsigned int id)
>   kvm->arch.ept_identity_map_addr =
>   VMX_EPT_IDENTITY_PAGETABLE_ADDR;
>   err = -ENOMEM;
> - if (alloc_identity_pagetable(kvm) != 0)
> - goto free_vmcs;
>   if (!init_rmode_identity_map(kvm))
>   goto free_vmcs;
>   }
> -- 
> 1.8.3.1
> 

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] kvm, mem-hotplug: Update apic access page when it is migrated.

2014-07-07 Thread Gleb Natapov

On Mon, Jul 07, 2014 at 03:10:23PM +0300, Nadav Amit wrote:
> On 7/7/14, 2:54 PM, Gleb Natapov wrote:
> >On Mon, Jul 07, 2014 at 02:42:27PM +0300, Nadav Amit wrote:
> >>Tang,
> >>
> >>Running some (unrelated) tests I see that KVM does not handle APIC base
> >>relocation correctly. When the base is changed, kvm_lapic_set_base just
> >>changes lapic->base_address without taking further action (i.e., modifying
> >>the VMCS apic address in VMX).
> >>
> >>This patch follows KVM bad behavior by using the constant
> >>VMX_APIC_ACCESS_PAGE_ADDR instead of lapic->base_address.
> >There is no OS out there that relocates APIC base (in fact it was not always
> >relocatable on real HW), so there is not point in complicating the code to 
> >support
> >it. In fact current APIC_ACCESS_ADDR handling relies on the fact that all 
> >vcpus
> >has apic mapped at the same address.
> >
> >>
> >>Anyhow, I didn't see anything that would make my life (in fixing the lapic
> >>base issue) too difficult. Yet, feel free in making it more "fix-friendly".
> >>
> >Why would you want to fix it?
> >
> If there is no general need, I will not send a fix. However, I think the
It is much more works that it may look like for no apparent gain.

> very least a warning message should be appear if the guest relocates the
> APIC base.
> 
Warning is fine. Make it _once().

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] kvm, mem-hotplug: Update apic access page when it is migrated.

2014-07-07 Thread Gleb Natapov

On Mon, Jul 07, 2014 at 02:42:27PM +0300, Nadav Amit wrote:
> Tang,
> 
> Running some (unrelated) tests I see that KVM does not handle APIC base
> relocation correctly. When the base is changed, kvm_lapic_set_base just
> changes lapic->base_address without taking further action (i.e., modifying
> the VMCS apic address in VMX).
> 
> This patch follows KVM bad behavior by using the constant
> VMX_APIC_ACCESS_PAGE_ADDR instead of lapic->base_address.
There is no OS out there that relocates APIC base (in fact it was not always
relocatable on real HW), so there is not point in complicating the code to 
support
it. In fact current APIC_ACCESS_ADDR handling relies on the fact that all vcpus
has apic mapped at the same address.

> 
> Anyhow, I didn't see anything that would make my life (in fixing the lapic
> base issue) too difficult. Yet, feel free in making it more "fix-friendly".
> 
Why would you want to fix it?

> Thanks,
> Nadav
> 
> On 7/7/14, 12:52 PM, Tang Chen wrote:
> >Hi Gleb,
> >
> >The guest hang problem has been solved.
> >
> >When mmu_notifier is called, I set VMCS APIC_ACCESS_ADDR to the new value
> >instead of setting it to 0. And only update kvm->arch.apic_access_page in
> >the next ept violation.
> >
> >The guest is running well now.
> >
> >I'll post the new patches tomorrow. ;)
> >
> >Thanks.
> >

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] x86: Make x2APIC support depend on interrupt remapping or guest support

2014-07-06 Thread Gleb Natapov

On Sun, Jul 06, 2014 at 05:49:09PM +0200, Jan Kiszka wrote:
> On 2014-07-06 17:41, Gleb Natapov wrote:
> > On Sun, Jul 06, 2014 at 05:24:27PM +0200, Jan Kiszka wrote:
> >> On 2014-07-06 17:12, Gleb Natapov wrote:
> >>> On Sat, Jul 05, 2014 at 09:47:54AM +0200, Jan Kiszka wrote:
> >>>> From: Jan Kiszka 
> >>>>
> >>>> We are able to use x2APIC mode in the absence of interrupt remapping on
> >>>> certain hypervisors. So it if fine to disable IRQ_REMAP without having
> >>>> to give up x2APIC support.
> >>> FWIW I did similar thing back when I added x2apic to KVM:
> >>> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-06/msg14579.html
> >>> But was advised against it.
> >>
> >> I don't get the point from that thread.
> > It is not architectural to use x2apic without irq remapping, so Suresh
> 
> Slightly off-topic for this patch: Is this dependency documented
> somewhere? I guess there are broken systems where it can fail, but none
> of the (Intel) boxes I tried so far had problems with x2APIC mode on
> (given APIC IDs < 255) and IRQ remapping disabled.
Not sure I saw that written, but I was told many times that this is so.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] x86: Make x2APIC support depend on interrupt remapping or guest support

2014-07-06 Thread Gleb Natapov

On Sun, Jul 06, 2014 at 05:24:27PM +0200, Jan Kiszka wrote:
> On 2014-07-06 17:12, Gleb Natapov wrote:
> > On Sat, Jul 05, 2014 at 09:47:54AM +0200, Jan Kiszka wrote:
> >> From: Jan Kiszka 
> >>
> >> We are able to use x2APIC mode in the absence of interrupt remapping on
> >> certain hypervisors. So it if fine to disable IRQ_REMAP without having
> >> to give up x2APIC support.
> > FWIW I did similar thing back when I added x2apic to KVM:
> > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-06/msg14579.html
> > But was advised against it.
> 
> I don't get the point from that thread.
It is not architectural to use x2apic without irq remapping, so Suresh
from Intel felt that it should not be allowed to create a kernel that
can run on real HW and has non architectural configuration compiled in,
though only when running on a VM x2apic will ever be enabled without irq
remapping. I didn't argue about it to much back then. If HYPERVISOR_GUEST
guaranties that resulting kernel can run only on a VM (does it?) this
objection does not hold any more.

> 
> However, this change is just formalizing a configuration that is already
> being used: there is no interrupt remapping available with KVM (yet),
> but we do run our guests in x2APIC mode most of the time. Interrupt
> remapping code in kernels tailored as KVM guest is dead code, right now.
Yes. I am just saying why things are the way they are.

> 
> Jan
> 
> > 
> >>
> >> Signed-off-by: Jan Kiszka 
> >> ---
> >>  arch/x86/Kconfig | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> >> index a8f749e..30a9987 100644
> >> --- a/arch/x86/Kconfig
> >> +++ b/arch/x86/Kconfig
> >> @@ -310,7 +310,7 @@ config SMP
> >>  
> >>  config X86_X2APIC
> >>bool "Support x2apic"
> >> -  depends on X86_LOCAL_APIC && X86_64 && IRQ_REMAP
> >> +  depends on X86_LOCAL_APIC && X86_64 && (IRQ_REMAP || HYPERVISOR_GUEST)
> >>---help---
> >>  This enables x2apic support on CPUs that have this feature.
> >>  
> >>
> >>
> > 
> > 
> > 
> > --
> > Gleb.
> > 
> 
> 



--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] x86: Make x2APIC support depend on interrupt remapping or guest support

2014-07-06 Thread Gleb Natapov

On Sat, Jul 05, 2014 at 09:47:54AM +0200, Jan Kiszka wrote:
> From: Jan Kiszka 
> 
> We are able to use x2APIC mode in the absence of interrupt remapping on
> certain hypervisors. So it if fine to disable IRQ_REMAP without having
> to give up x2APIC support.
FWIW I did similar thing back when I added x2apic to KVM:
http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-06/msg14579.html
But was advised against it.

> 
> Signed-off-by: Jan Kiszka 
> ---
>  arch/x86/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index a8f749e..30a9987 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -310,7 +310,7 @@ config SMP
>  
>  config X86_X2APIC
>   bool "Support x2apic"
> - depends on X86_LOCAL_APIC && X86_64 && IRQ_REMAP
> + depends on X86_LOCAL_APIC && X86_64 && (IRQ_REMAP || HYPERVISOR_GUEST)
>   ---help---
> This enables x2apic support on CPUs that have this feature.
>  
> 
> 



--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] kvm, mem-hotplug: Update apic access page when it is migrated.

2014-07-04 Thread Gleb Natapov

On Fri, Jul 04, 2014 at 10:18:25AM +0800, Tang Chen wrote:
> Hi Gleb,
> 
> Thanks for the advices. Please see below.
> 
> On 07/03/2014 09:55 PM, Gleb Natapov wrote:
> ..
> >>@@ -575,6 +575,7 @@ struct kvm_arch {
> >>
> >>unsigned int tss_addr;
> >>struct page *apic_access_page;
> >>+   bool apic_access_page_migrated;
> >Better have two requests KVM_REQ_APIC_PAGE_MAP, KVM_REQ_APIC_PAGE_UNMAP IMO.
> >
> 
> vcpu->requests is an unsigned long, and we can only has 64 requests. Isn't
> adding two requests for apic page and another similar two for ept page too
> many ?  Not sure.
> 
Lets not worry about that for now. May be it is enough to have only one
KVM_REQ_APIC_PAGE_RELOAD request set apic_access_page to a new value
before sending the request and reload whatever is in apic_access_page
during KVM_REQ_APIC_PAGE_RELOAD processing. Or we can even reload
apic_access_page as part of mmu reload and reuse KVM_REQ_MMU_RELOAD.

> >>
> >>gpa_t wall_clock;
> >>
> >>@@ -739,6 +740,7 @@ struct kvm_x86_ops {
> >>void (*hwapic_isr_update)(struct kvm *kvm, int isr);
> >>void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
> >>void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set);
> >>+   void (*set_apic_access_page_addr)(struct kvm *kvm, hpa_t hpa);
> >>void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
> >>void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
> >>int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
> >>diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >>index c0d72f6..a655444 100644
> >>--- a/arch/x86/kvm/mmu.c
> >>+++ b/arch/x86/kvm/mmu.c
> >>@@ -3436,6 +3436,21 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, 
> >>gva_t gpa, u32 error_code,
> >>kvm_make_request(KVM_REQ_MIGRATE_EPT, vcpu);
> >>}
> >>
> >>+   if (gpa == VMX_APIC_ACCESS_PAGE_ADDR&&
> >>+   vcpu->kvm->arch.apic_access_page_migrated) {
> >Why check arch.apic_access_page_migrated here? Isn't it enough that the 
> >fault is on apic
> >address.
> >
> 
> True. It's enough. Followed.
> 
> >>+   int i;
> >>+
> >>+   vcpu->kvm->arch.apic_access_page_migrated = false;
> >>+
> >>+   /*
> >>+* We need update APIC_ACCESS_ADDR pointer in each VMCS of
> >>+* all the online vcpus.
> >>+*/
> >>+   for (i = 0; i<  atomic_read(&vcpu->kvm->online_vcpus); i++)
> >>+   kvm_make_request(KVM_REQ_MIGRATE_APIC,
> >>+vcpu->kvm->vcpus[i]);
> >make_all_cpus_request(). You need to kick all vcpus from a guest mode.
> >
> 
> OK, followed. But would you please explain more about this. :)
> Why need to kick all vcpus from guest mode when making request to all vcpus
> ?
Because if you do not force other vcpus from a guest mode they will not reload
apic_access_page value till next vmexit, but since EPT page table now has a 
mapping
for 0xfee0 access to this address will not cause EPT violation and will not 
cause
apic exit either.

> 
> >>+   }
> >>+
> >>spin_unlock(&vcpu->kvm->mmu_lock);
> >>
> >>return r;
> >>diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >>index c336cb3..abc152f 100644
> >>--- a/arch/x86/kvm/vmx.c
> >>+++ b/arch/x86/kvm/vmx.c
> >>@@ -3988,7 +3988,7 @@ static int alloc_apic_access_page(struct kvm *kvm)
> >>if (r)
> >>goto out;
> >>
> >>-   page = gfn_to_page(kvm, VMX_APIC_ACCESS_PAGE_ADDR>>  PAGE_SHIFT);
> >>+   page = gfn_to_page_no_pin(kvm, VMX_APIC_ACCESS_PAGE_ADDR>>  PAGE_SHIFT);
> >>if (is_error_page(page)) {
> >>r = -EFAULT;
> >>goto out;
> >>@@ -7075,6 +7075,12 @@ static void vmx_set_virtual_x2apic_mode(struct 
> >>kvm_vcpu *vcpu, bool set)
> >>vmx_set_msr_bitmap(vcpu);
> >>  }
> >>
> >>+static void vmx_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa)
> >>+{
> >>+   if (vm_need_virtualize_apic_accesses(kvm))
> >This shouldn't even been called if apic access page is not supported. Nor
> >mmu_notifier path neither tdp_page_fault path should ever see 0xfee0
> >address. BUG() is more appropriate here.
> >
> 
> I don't quite understand. Why calling this function

Re: [PATCH 3/4] kvm, memory-hotplug: Update ept identity pagetable when it is migrated.

2014-07-04 Thread Gleb Natapov

On Fri, Jul 04, 2014 at 10:36:06AM +0800, Tang Chen wrote:
> Hi Gleb,
> 
> On 07/03/2014 12:34 AM, Gleb Natapov wrote:
> >On Wed, Jul 02, 2014 at 05:00:36PM +0800, Tang Chen wrote:
> >>ept identity pagetable is pinned in memory, and as a result it cannot be
> >>migrated/hot-removed.
> >>
> >>But actually it doesn't need to be pinned in memory.
> >>
> >>This patch introduces a new vcpu request: KVM_REQ_MIGRATE_EPT to reset ept
> >>indetity pagetable related variable. This request will be made when
> >>kvm_mmu_notifier_invalidate_page() is called when the page is unmapped
> >>from the qemu user space to reset kvm->arch.ept_identity_pagetable to NULL.
> >>And will also be made when ept violation happens to reset
> >>kvm->arch.ept_identity_pagetable to the new page.
> >
> >kvm->arch.ept_identity_pagetable is never used as a page address, just
> >boolean null/!null to see if identity pagetable is initialized. I do
> >not see why would we want to track its address at all. Changing it to bool
> >and assigning true during initialization should be enough.
> 
> We already have kvm->arch.ept_identity_pagetable_done to indicate if the ept
> identity table is initialized. If we make kvm->arch.ept_identity_pagetable
> to
> bool, do you mean we have:
> 
> kvm->arch.ept_identity_pagetable: indicate if ept page is allocated,
> kvm->arch.ept_identity_pagetable_done: indicate if ept page is initialized ?
> 
ept_identity_pagetable also means that a memory slot for identity page is 
initialized.

> I don't think we need this. Shall we remove kvm->arch.ept_identity_pagetable
> ?
> 
May be those two can be consolidated somehow, but lets leave it to a separate 
patch
to make this one simpler.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] kvm, mem-hotplug: Update apic access page when it is migrated.

2014-07-03 Thread Gleb Natapov

On Wed, Jul 02, 2014 at 05:00:37PM +0800, Tang Chen wrote:
> apic access page is pinned in memory, and as a result it cannot be
> migrated/hot-removed.
> 
> Actually it doesn't need to be pinned in memory.
> 
> This patch introduces a new vcpu request: KVM_REQ_MIGRATE_EPT. This requet
> will be made when kvm_mmu_notifier_invalidate_page() is called when the page
> is unmapped from the qemu user space to reset APIC_ACCESS_ADDR pointer in
> each online vcpu to 0. And will also be made when ept violation happens to
> reset APIC_ACCESS_ADDR to the new page phys_addr (host phys_addr).
> ---
>  arch/x86/include/asm/kvm_host.h |  2 ++
>  arch/x86/kvm/mmu.c  | 15 +++
>  arch/x86/kvm/vmx.c  |  9 -
>  arch/x86/kvm/x86.c  | 20 
>  include/linux/kvm_host.h|  1 +
>  virt/kvm/kvm_main.c | 15 +++
>  6 files changed, 61 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 8771c0f..f104b87 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -575,6 +575,7 @@ struct kvm_arch {
>  
>   unsigned int tss_addr;
>   struct page *apic_access_page;
> + bool apic_access_page_migrated;
Better have two requests KVM_REQ_APIC_PAGE_MAP, KVM_REQ_APIC_PAGE_UNMAP IMO.

>  
>   gpa_t wall_clock;
>  
> @@ -739,6 +740,7 @@ struct kvm_x86_ops {
>   void (*hwapic_isr_update)(struct kvm *kvm, int isr);
>   void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
>   void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set);
> + void (*set_apic_access_page_addr)(struct kvm *kvm, hpa_t hpa);
>   void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
>   void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
>   int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index c0d72f6..a655444 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3436,6 +3436,21 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t 
> gpa, u32 error_code,
>   kvm_make_request(KVM_REQ_MIGRATE_EPT, vcpu);
>   }
>  
> + if (gpa == VMX_APIC_ACCESS_PAGE_ADDR &&
> + vcpu->kvm->arch.apic_access_page_migrated) {
Why check arch.apic_access_page_migrated here? Isn't it enough that the fault 
is on apic
address.

> + int i;
> +
> + vcpu->kvm->arch.apic_access_page_migrated = false;
> +
> + /*
> +  * We need update APIC_ACCESS_ADDR pointer in each VMCS of
> +  * all the online vcpus.
> +  */
> + for (i = 0; i < atomic_read(&vcpu->kvm->online_vcpus); i++)
> + kvm_make_request(KVM_REQ_MIGRATE_APIC,
> +  vcpu->kvm->vcpus[i]);
make_all_cpus_request(). You need to kick all vcpus from a guest mode.

> + }
> +
>   spin_unlock(&vcpu->kvm->mmu_lock);
>  
>   return r;
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index c336cb3..abc152f 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -3988,7 +3988,7 @@ static int alloc_apic_access_page(struct kvm *kvm)
>   if (r)
>   goto out;
>  
> - page = gfn_to_page(kvm, VMX_APIC_ACCESS_PAGE_ADDR >> PAGE_SHIFT);
> + page = gfn_to_page_no_pin(kvm, VMX_APIC_ACCESS_PAGE_ADDR >> PAGE_SHIFT);
>   if (is_error_page(page)) {
>   r = -EFAULT;
>   goto out;
> @@ -7075,6 +7075,12 @@ static void vmx_set_virtual_x2apic_mode(struct 
> kvm_vcpu *vcpu, bool set)
>   vmx_set_msr_bitmap(vcpu);
>  }
>  
> +static void vmx_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa)
> +{
> + if (vm_need_virtualize_apic_accesses(kvm))
This shouldn't even been called if apic access page is not supported. Nor
mmu_notifier path neither tdp_page_fault path should ever see 0xfee0
address. BUG() is more appropriate here.


> + vmcs_write64(APIC_ACCESS_ADDR, hpa);
> +}
> +
>  static void vmx_hwapic_isr_update(struct kvm *kvm, int isr)
>  {
>   u16 status;
> @@ -8846,6 +8852,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
>   .enable_irq_window = enable_irq_window,
>   .update_cr8_intercept = update_cr8_intercept,
>   .set_virtual_x2apic_mode = vmx_set_virtual_x2apic_mode,
> + .set_apic_access_page_addr = vmx_set_apic_access_page_addr,
svm needs that too.

>   .vm_has_apicv = vmx_vm_has_apicv,
>   .load_eoi_exitmap = vmx_load_eoi_exitmap,
>   .hwapic_irr_update = vmx_hwapic_irr_update,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a26524f..14e7174 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5943,6 +5943,24 @@ static void vcpu_migrated_page_update_ept(struct 
> kvm_vcpu *vcpu)
>   }
>  }
>  
> +static void vcpu_migrated_page_update_apic(struct kvm_vcpu *vcpu)
> +{
> + struct k

Re: [PATCH 0/4] kvm, mem-hotplug: Do not pin ept identity pagetable and apic access page.

2014-07-02 Thread Gleb Natapov

On Thu, Jul 03, 2014 at 09:17:59AM +0800, Tang Chen wrote:
> Hi Gleb,
> 
> On 07/02/2014 05:00 PM, Tang Chen wrote:
> >Hi Gleb, Marcelo,
> >
> >Please help to review this patch-set.
> >
> >NOTE: This patch-set doesn't work properly.
> >
> >
> >ept identity pagetable and apic access page in kvm are pinned in memory.
> >As a result, they cannot be migrated/hot-removed.
> >
> >But actually they don't need to be pinned in memory.
> >
> >This patch-set introduces two new vcpu requests: KVM_REQ_MIGRATE_EPT and 
> >KVM_REQ_MIGRATE_APIC.
> >These two requests are made when the two pages are migrated by the 
> >mmu_notifier
> >to reset the related variable to unusable value. And will also be made when
> >ept violation happens to reset new pages.
> >
> >
> >[Known problem]
> >After this patch-set applied, the two pages can be migrated/hot-removed.
> >But after migrating apic access page, the guest died.
> >
> >The host physical address of apic access page is stored in VMCS. I reset
> >it to 0 to stop guest from accessing it when it is unmapped by
> >kvm_mmu_notifier_invalidate_page(). And reset it to new page's host physical
> >address in tdp_page_fault(). But it seems that guest will access apic page
> >directly by the host physical address.
> 
> Would you please to give some advice about this problem ?
> 
I haven't reviewed third patch yet, will do ASAP.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] kvm, memory-hotplug: Update ept identity pagetable when it is migrated.

2014-07-02 Thread Gleb Natapov

On Wed, Jul 02, 2014 at 05:00:36PM +0800, Tang Chen wrote:
> ept identity pagetable is pinned in memory, and as a result it cannot be
> migrated/hot-removed.
> 
> But actually it doesn't need to be pinned in memory.
> 
> This patch introduces a new vcpu request: KVM_REQ_MIGRATE_EPT to reset ept
> indetity pagetable related variable. This request will be made when
> kvm_mmu_notifier_invalidate_page() is called when the page is unmapped
> from the qemu user space to reset kvm->arch.ept_identity_pagetable to NULL.
> And will also be made when ept violation happens to reset
> kvm->arch.ept_identity_pagetable to the new page.

kvm->arch.ept_identity_pagetable is never used as a page address, just
boolean null/!null to see if identity pagetable is initialized. I do
not see why would we want to track its address at all. Changing it to bool
and assigning true during initialization should be enough.


--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/4] kvm: Add macro VMX_APIC_ACCESS_PAGE_ADDR

2014-07-02 Thread Gleb Natapov

On Wed, Jul 02, 2014 at 05:00:35PM +0800, Tang Chen wrote:
> Define guest phys_addr of apic access page.
> ---
>  arch/x86/include/asm/vmx.h | 2 +-
>  arch/x86/kvm/svm.c | 3 ++-
>  arch/x86/kvm/vmx.c | 7 ---
>  3 files changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 7004d21..c4672d1 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -422,7 +422,7 @@ enum vmcs_field {
>  #define VMX_EPT_DIRTY_BIT(1ull << 9)
>  
>  #define VMX_EPT_IDENTITY_PAGETABLE_ADDR  0xfffbc000ul
> -
> +#define VMX_APIC_ACCESS_PAGE_ADDR0xfee0ull
>  
It has nothing to do with VMX and there is already define for that: 
APIC_DEFAULT_PHYS_BASE

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 4/5] KVM: MMU: reload request from GET_DIRTY_LOG path

2014-06-30 Thread Gleb Natapov

On Mon, Jun 30, 2014 at 05:59:02PM -0300, Marcelo Tosatti wrote:
> On Fri, Jun 20, 2014 at 01:46:10PM +0300, Gleb Natapov wrote:
> > On Thu, Jun 19, 2014 at 03:40:31PM -0300, Marcelo Tosatti wrote:
> > > On Thu, Jun 19, 2014 at 11:17:19AM +0300, Gleb Natapov wrote:
> > > > On Wed, Jun 18, 2014 at 08:12:07PM -0300, mtosa...@redhat.com wrote:
> > > > > Reload remote vcpus MMU from GET_DIRTY_LOG codepath, before
> > > > > deleting a pinned spte.
> > > > > 
> > > > > Signed-off-by: Marcelo Tosatti 
> > > > > 
> > > > > ---
> > > > >  arch/x86/kvm/mmu.c |3 +++
> > > > >  1 file changed, 3 insertions(+)
> > > > > 
> > > > > Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c
> > > > > ===
> > > > > --- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c  2014-06-13 
> > > > > 16:50:50.040140594 -0300
> > > > > +++ kvm.pinned-sptes/arch/x86/kvm/mmu.c   2014-06-13 
> > > > > 16:51:05.620104451 -0300
> > > > > @@ -1247,6 +1247,9 @@
> > > > >   spte &= ~SPTE_MMU_WRITEABLE;
> > > > >   spte = spte & ~PT_WRITABLE_MASK;
> > > > >  
> > > > > + if (is_pinned_spte(spte))
> > > > > + mmu_reload_pinned_vcpus(kvm);
> > > > > +
> > > > Why write protect it at all? mmu_reload_pinned_vcpus() will unprotected 
> > > > it anyway
> > > > on the next vmentry. Isn't it better to just report all pinned pages as 
> > > > dirty alway.
> > > 
> > > That was the initial plan, however its awkward to stop vcpus, execute
> > > get_dirty_log twice, and have pages marked as dirty on the second
> > > execution.
> > Indeed, but I think it may happen today with vhost (or even other devices
> > that emulate dma), so userspace should be able to deal with it already.
> > 
> > > 
> > > That is, it is in "incorrect" to report pages as dirty when they are
> > > clean.
> > In case of PEBS area we cannot know. CPU writes there directly without
> > KVM even knowing about it so the only sane thing is to assume that after
> > a vmentry PEBS area is dirty. This is what happens with this patch BTW,
> > mmu_reload_pinned_vcpus() will force ->page_fault(FERR_WRITE) which will
> > mark all pinned pages as dirty even if pages are never written to. You
> > can achieve the same by having vcpu->pinned_page_dirty which is set to
> > true on each vmentry and to false on each GET_DIRTY_LOG.
> > 
> > > 
> > > Moreover, if the number of pinned pages is larger than the dirty
> > > threshold to stop VM and migrate, you'll never migrate. If vcpus are
> > > in HLT and don't VM-enter immediately, the pages should not be refaulted
> > > right away.
> > We should not allow that many pinned pages for security reasons. And
> > having a lot of page to fault in on a next vmentry after each
> > GET_DIRTY_LOG will slow down a guest during migration considerably.
> > 
> > > 
> > > Do you think the optimization is worthwhile ?
> > > 
> > I think it's simpler, but even if we will go with your current approach
> > it should be improved: there is no point sending IPI to all vcpus in
> > spte_write_protect() like you do here since the IPI will be send anyway at
> > the end of write protect because of ptes writable->nonwritable transition,
> > so all you have to do here is to set KVM_REQ_MMU_RELOAD, not need for IPI.
> 
> No, you have to make sure the vcpu is out of guest mode.
> 
> > In fact this makes me thinking that write protecting pinned page here is
> > incorrect because old translation may not be in TLB and if CPU will try to
> > write PEBS entry after pte is write protected but before MMU is reloaded
> > it will encounter non writable pte and god knows what will happen, SDM
> > does not tell. So spte_write_protect() should be something like that IMO:
> 
> The vcpu will never see a read-only spte because the VM-exit (due to
> IPI) guarantees vcpu is outside of guest mode _before_ it is write
> protected.
Right. Now I see why you absolutely have to send IPI in 
mmu_reload_pinned_vcpus()
before marking pte as read only. And kvm->mmu_lock is what will prevent vcpu 
from
re-entering guest mode again before pte is marked read only, right?

> 
> So i ask you: do you still hold the "current approach should be
> improved" position ?
> 
As I said IMO what I proposed is much simpler and not as tricky as what you 
have here.
It also has an advantage of not slowing down next guest entry after 
GET_DIRTY_LOG because
it does not require mmu reload and page_faulting in pinned pages.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: SVM: Fix CPL export via SS.DPL

2014-06-30 Thread Gleb Natapov

On Mon, Jun 30, 2014 at 05:15:44PM +0200, Borislav Petkov wrote:
> On Mon, Jun 30, 2014 at 05:03:57PM +0200, Jan Kiszka wrote:
> > 15.5.1:
> > 
> > "When examining segment attributes after a #VMEXIT:
> > [...]
> > • Retrieve the CPL from the CPL field in the VMCB, not from any segment
> > DPL."
> 
> Heey, it is even documented! :-P
> 
Yes, on SVM we should always respect this field. Unfortunately there
is no such field in VMX, so we have to do DPL gymnastics there.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: Pending interrupt may be delivered after INIT

2014-06-30 Thread Gleb Natapov

On Mon, Jun 30, 2014 at 11:35:27AM +0300, Nadav Amit wrote:
> We encountered a scenario in which after an INIT is delivered, a pending
> interrupt is delivered, although it was sent before the INIT.  As the SDM
> states in section 10.4.7.1, the ISR and the IRR should be cleared after INIT 
> as
> KVM does.  This also means that pending interrupts should be cleared.  This
> patch clears upon reset (and INIT) the pending interrupts; and at the same
> occassion clears the pending exceptions, since they may cause a similar issue.
> 
> Signed-off-by: Nadav Amit 
> ---
>  arch/x86/kvm/x86.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index f32a025..863ac07 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6835,6 +6835,8 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu)
>   atomic_set(&vcpu->arch.nmi_queued, 0);
>   vcpu->arch.nmi_pending = 0;
>   vcpu->arch.nmi_injected = false;
> + vcpu->arch.interrupt.pending = false;
> + vcpu->arch.exception.pending = false;
kvm_clear_interrupt_queue(vcpu);
kvm_clear_exception_queue(vcpu);

>  
>   memset(vcpu->arch.db, 0, sizeof(vcpu->arch.db));
>   vcpu->arch.dr6 = DR6_FIXED_1;
> -- 
> 1.9.1
> 

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/1] Move two pinned pages to non-movable node in kvm.

2014-06-29 Thread Gleb Natapov

On Mon, Jun 30, 2014 at 09:45:32AM +0800, Tang Chen wrote:
> On 06/21/2014 04:39 AM, Marcelo Tosatti wrote:
> >On Fri, Jun 20, 2014 at 05:31:46PM -0300, Marcelo Tosatti wrote:
> >>>IIRC your shadow page pinning patch series support flushing of ptes
> >>>by mmu notifier by forcing MMU reload and, as a result, faulting in of
> >>>pinned pages during next entry.  Your patch series does not pin pages
> >>>by elevating their page count.
> >>
> >>No but PEBS series does and its required to stop swap-out
> >>of the page.
> >
> >Well actually no because of mmu notifiers.
> >
> >Tang, can you implement mmu notifiers for the other breaker of
> >mem hotplug ?
> 
> Hi Marcelo,
> 
> I made a patch to update ept and apic pages when finding them in the
> next ept violation. And I also updated the APIC_ACCESS_ADDR phys_addr.
> The pages can be migrated, but the guest crached.
How does it crash?

> 
> How do I stop guest from access apic pages in mmu_notifier when the
> page migration starts ?  Do I need to stop all the vcpus by set vcpu
> state to KVM_MP_STATE_HALTED ?  If so, the vcpu will not able to go
> to the next ept violation.
When apic access page is unmapped from ept pages by mmu notifiers you
need to set its value in VMCS to a physical address that will never be
mapped into guest memory. Zero for instance. You can do it by introducing
new KVM_REQ_ bit and set VMCS value during next vcpu's vmentry. On ept
violation you need to update VMCS pointer to newly allocated physical
address, you can use the same KVM_REQ_ mechanism again.

> 
> So, may I write any specific value into APIC_ACCESS_ADDR to stop guest
> from access to apic page ?
> 
Any phys address that will never be mapped into guest's memory should work.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: __schedule #DF splat

2014-06-29 Thread Gleb Natapov

On Sun, Jun 29, 2014 at 04:01:04PM +0200, Borislav Petkov wrote:
> On Sun, Jun 29, 2014 at 04:42:47PM +0300, Gleb Natapov wrote:
> > Please do so and let us know.
> 
> Yep, just did. Reverting ae9fedc793 fixes the issue.
> 
> > reinj:1 means that previous injection failed due to another #PF that
> > happened during the event injection itself This may happen if GDT or fist
> > instruction of a fault handler is not mapped by shadow pages, but here
> > it says that the new page fault is at the same address as the previous
> > one as if GDT is or #PF handler is mapped there. Strange. Especially
> > since #DF is injected successfully, so GDT should be fine. May be wrong
> > cpl makes svm crazy?
> 
> Well, I'm not going to even pretend to know kvm to know *when* we're
> saving VMCB state but if we're saving the wrong CPL and then doing the
> pagetable walk, I can very well imagine if the walker gets confused. One
> possible issue could be U/S bit (bit 2) in the PTE bits which allows
> access to supervisor pages only when CPL < 3. I.e., CPL has effect on
> pagetable walk and a wrong CPL level could break it.
> 
> All a conjecture though...
> 
Looks plausible, still strange that second #PF is at the same address as the 
first one though.
Anyway, not we have the commit to blame.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: __schedule #DF splat

2014-06-29 Thread Gleb Natapov

On Sun, Jun 29, 2014 at 03:14:43PM +0200, Borislav Petkov wrote:
> On Sun, Jun 29, 2014 at 02:22:35PM +0200, Jan Kiszka wrote:
> > OK, looks like I won ;):
> 
> I gladly let you win. :-P
> 
> > The issue was apparently introduced with "KVM: x86: get CPL from
> > SS.DPL" (ae9fedc793). Maybe we are not properly saving or restoring
> > this state on SVM since then.
> 
> I wonder if this change in the CPL saving would have anything to do with
> the fact that we're doing a CR3 write right before we fail pagetable
> walk and end up walking a user page table. It could be unrelated though,
> as in the previous dump I had a get_user right before the #DF. Hmmm.
> 
> I better go and revert that one and check whether it fixes things.
Please do so and let us know.

> 
> > Need a break, will look into details later.
> 
> Ok, some more info from my side, see relevant snippet below. We're
> basically not finding the pte at level 3 during the page walk for
> 7fff0b0f8908.
> 
> However, why we're even page walking this userspace address at that
> point I have no idea.
> 
> And the CR3 write right before this happens is there so I'm pretty much
> sure by now that this is related...
> 
>  qemu-system-x86-5007  [007] ...1   346.126204: vcpu_match_mmio: gva 
> 0xff5fd0b0 gpa 0xfee000b0 Write GVA
>  qemu-system-x86-5007  [007] ...1   346.126204: kvm_mmio: mmio write len 4 
> gpa 0xfee000b0 val 0x0
>  qemu-system-x86-5007  [007] ...1   346.126205: kvm_apic: apic_write APIC_EOI 
> = 0x0
>  qemu-system-x86-5007  [007] ...1   346.126205: kvm_eoi: apicid 0 vector 253
>  qemu-system-x86-5007  [007] d..2   346.126206: kvm_entry: vcpu 0
>  qemu-system-x86-5007  [007] d..2   346.126211: kvm_exit: reason write_cr3 
> rip 0x816113a0 info 8000 0
>  qemu-system-x86-5007  [007] ...2   346.126214: kvm_mmu_get_page: sp gen 25 
> gfn 7b2b1 4 pae q0 wux !nxe root 0 sync existing
>  qemu-system-x86-5007  [007] d..2   346.126215: kvm_entry: vcpu 0
>  qemu-system-x86-5007  [007] d..2   346.126216: kvm_exit: reason PF excp rip 
> 0x816113df info 2 7fff0b0f8908
>  qemu-system-x86-5007  [007] ...1   346.126217: kvm_page_fault: address 
> 7fff0b0f8908 error_code 2
VCPU faults on 7fff0b0f8908.

>  qemu-system-x86-5007  [007] ...1   346.126218: kvm_mmu_pagetable_walk: addr 
> 7fff0b0f8908 pferr 2 W
>  qemu-system-x86-5007  [007] ...1   346.126219: kvm_mmu_paging_element: pte 
> 7b2b6067 level 4
>  qemu-system-x86-5007  [007] ...1   346.126220: kvm_mmu_paging_element: pte 0 
> level 3
>  qemu-system-x86-5007  [007] ...1   346.126220: kvm_mmu_walker_error: pferr 2 
> W
Address is not mapped by the page tables.

>  qemu-system-x86-5007  [007] ...1   346.126221: kvm_multiple_exception: nr: 
> 14, prev: 255, has_error: 1, error_code: 0x2, reinj: 0
>  qemu-system-x86-5007  [007] ...1   346.126221: kvm_inj_exception: #PF (0x2)
KVM injects #PF.

>  qemu-system-x86-5007  [007] d..2   346.126222: kvm_entry: vcpu 0
>  qemu-system-x86-5007  [007] d..2   346.126223: kvm_exit: reason PF excp rip 
> 0x816113df info 2 7fff0b0f8908
>  qemu-system-x86-5007  [007] ...1   346.126224: kvm_multiple_exception: nr: 
> 14, prev: 14, has_error: 1, error_code: 0x2, reinj: 1
reinj:1 means that previous injection failed due to another #PF that
happened during the event injection itself This may happen if GDT or fist
instruction of a fault handler is not mapped by shadow pages, but here
it says that the new page fault is at the same address as the previous
one as if GDT is or #PF handler is mapped there. Strange. Especially
since #DF is injected successfully, so GDT should be fine. May be wrong
cpl makes svm crazy?

 
>  qemu-system-x86-5007  [007] ...1   346.126225: kvm_page_fault: address 
> 7fff0b0f8908 error_code 2
>  qemu-system-x86-5007  [007] ...1   346.126225: kvm_mmu_pagetable_walk: addr 
> 7fff0b0f8908 pferr 0 
>  qemu-system-x86-5007  [007] ...1   346.126226: kvm_mmu_paging_element: pte 
> 7b2b6067 level 4
>  qemu-system-x86-5007  [007] ...1   346.126227: kvm_mmu_paging_element: pte 0 
> level 3
>  qemu-system-x86-5007  [007] ...1   346.126227: kvm_mmu_walker_error: pferr 0 
>  qemu-system-x86-5007  [007] ...1   346.126228: kvm_mmu_pagetable_walk: addr 
> 7fff0b0f8908 pferr 2 W
>  qemu-system-x86-5007  [007] ...1   346.126229: kvm_mmu_paging_element: pte 
> 7b2b6067 level 4
>  qemu-system-x86-5007  [007] ...1   346.126230: kvm_mmu_paging_element: pte 0 
> level 3
>  qemu-system-x86-5007  [007] ...1   346.126230: kvm_mmu_walker_error: pferr 2 
> W
>  qemu-system-x86-5007  [007] ...1   346.126231: kvm_multiple_exception: nr: 
> 14, prev: 14, has_error: 1, error_code: 0x2, reinj: 0
Here we getting a #PF while delivering another #PF which is, rightfully, 
transformed to #DF.

>  qemu-system-x86-5007  [007] ...1   346.126231: kvm_inj_exception: #DF (0x0)
>  qemu-system-x86-5007  [007] d..2   346.126232: kvm_entry: vcpu 0
>  qemu-system-x86-5007  [007] d..2   346.126371: kvm_exit: reason io rip 
> 0x8131e623 info 3d40220 fff

Re: __schedule #DF splat

2014-06-29 Thread Gleb Natapov

On Sun, Jun 29, 2014 at 12:31:50PM +0200, Jan Kiszka wrote:
> On 2014-06-29 12:24, Gleb Natapov wrote:
> > On Sun, Jun 29, 2014 at 11:56:03AM +0200, Jan Kiszka wrote:
> >> On 2014-06-29 08:46, Gleb Natapov wrote:
> >>> On Sat, Jun 28, 2014 at 01:44:31PM +0200, Borislav Petkov wrote:
> >>>>  qemu-system-x86-20240 [006] ...1  9406.484134: kvm_page_fault: address 
> >>>> 7fffb62ba318 error_code 2
> >>>>  qemu-system-x86-20240 [006] ...1  9406.484136: kvm_inj_exception: #PF 
> >>>> (0x2)a
> >>>>
> >>>> kvm injects the #PF into the guest.
> >>>>
> >>>>  qemu-system-x86-20240 [006] d..2  9406.484136: kvm_entry: vcpu 1
> >>>>  qemu-system-x86-20240 [006] d..2  9406.484137: kvm_exit: reason PF excp 
> >>>> rip 0x8161130f info 2 7fffb62ba318
> >>>>  qemu-system-x86-20240 [006] ...1  9406.484138: kvm_page_fault: address 
> >>>> 7fffb62ba318 error_code 2
> >>>>  qemu-system-x86-20240 [006] ...1  9406.484141: kvm_inj_exception: #DF 
> >>>> (0x0)
> >>>>
> >>>> Second #PF at the same address and kvm injects the #DF.
> >>>>
> >>>> BUT(!), why?
> >>>>
> >>>> I probably am missing something but WTH are we pagefaulting at a
> >>>> user address in context_switch() while doing a lockdep call, i.e.
> >>>> spin_release? We're not touching any userspace gunk there AFAICT.
> >>>>
> >>>> Is this an async pagefault or so which kvm is doing so that the guest
> >>>> rip is actually pointing at the wrong place?
> >>>>
> >>> There is nothing in the trace that point to async pagefault as far as I 
> >>> see.
> >>>
> >>>> Or something else I'm missing, most probably...
> >>>>
> >>> Strange indeed. Can you also enable kvmmmu tracing? You can also 
> >>> instrument
> >>> kvm_multiple_exception() to see which two exception are combined into #DF.
> >>>
> >>
> >> FWIW, I'm seeing the same issue here (likely) on an E-450 APU. It
> >> disappears with older KVM (didn't bisect yet, some 3.11 is fine) and
> >> when patch-disabling the vmport in QEMU.
> >>
> >> Let me know if I can help with the analysis.
> >>
> > Bisection would be great of course. Once thing that is special about
> > vmport that comes to mind is that it reads vcpu registers to userspace and
> > write them back. IIRC "info registers" does the same. Can you see if the
> > problem is reproducible with disabled vmport, but doing "info registers"
> > in qemu console? Although trace does not should any exists to userspace
> > near the failure...
> 
> Yes, info registers crashes the guest after a while as well (with
> different backtrace due to different context).
> 
Oh crap. Bisection would be most helpful. Just to be absolutely sure
that this is not QEMU problem: does exactly same QEMU version work with
older kernels?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: __schedule #DF splat

2014-06-29 Thread Gleb Natapov

On Sun, Jun 29, 2014 at 11:56:03AM +0200, Jan Kiszka wrote:
> On 2014-06-29 08:46, Gleb Natapov wrote:
> > On Sat, Jun 28, 2014 at 01:44:31PM +0200, Borislav Petkov wrote:
> >>  qemu-system-x86-20240 [006] ...1  9406.484134: kvm_page_fault: address 
> >> 7fffb62ba318 error_code 2
> >>  qemu-system-x86-20240 [006] ...1  9406.484136: kvm_inj_exception: #PF 
> >> (0x2)a
> >>
> >> kvm injects the #PF into the guest.
> >>
> >>  qemu-system-x86-20240 [006] d..2  9406.484136: kvm_entry: vcpu 1
> >>  qemu-system-x86-20240 [006] d..2  9406.484137: kvm_exit: reason PF excp 
> >> rip 0x8161130f info 2 7fffb62ba318
> >>  qemu-system-x86-20240 [006] ...1  9406.484138: kvm_page_fault: address 
> >> 7fffb62ba318 error_code 2
> >>  qemu-system-x86-20240 [006] ...1  9406.484141: kvm_inj_exception: #DF 
> >> (0x0)
> >>
> >> Second #PF at the same address and kvm injects the #DF.
> >>
> >> BUT(!), why?
> >>
> >> I probably am missing something but WTH are we pagefaulting at a
> >> user address in context_switch() while doing a lockdep call, i.e.
> >> spin_release? We're not touching any userspace gunk there AFAICT.
> >>
> >> Is this an async pagefault or so which kvm is doing so that the guest
> >> rip is actually pointing at the wrong place?
> >>
> > There is nothing in the trace that point to async pagefault as far as I see.
> > 
> >> Or something else I'm missing, most probably...
> >>
> > Strange indeed. Can you also enable kvmmmu tracing? You can also instrument
> > kvm_multiple_exception() to see which two exception are combined into #DF.
> > 
> 
> FWIW, I'm seeing the same issue here (likely) on an E-450 APU. It
> disappears with older KVM (didn't bisect yet, some 3.11 is fine) and
> when patch-disabling the vmport in QEMU.
> 
> Let me know if I can help with the analysis.
>
Bisection would be great of course. Once thing that is special about
vmport that comes to mind is that it reads vcpu registers to userspace and
write them back. IIRC "info registers" does the same. Can you see if the
problem is reproducible with disabled vmport, but doing "info registers"
in qemu console? Although trace does not should any exists to userspace
near the failure...

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: __schedule #DF splat

2014-06-28 Thread Gleb Natapov

On Sat, Jun 28, 2014 at 01:44:31PM +0200, Borislav Petkov wrote:
>  qemu-system-x86-20240 [006] ...1  9406.484134: kvm_page_fault: address 
> 7fffb62ba318 error_code 2
>  qemu-system-x86-20240 [006] ...1  9406.484136: kvm_inj_exception: #PF (0x2)a
> 
> kvm injects the #PF into the guest.
> 
>  qemu-system-x86-20240 [006] d..2  9406.484136: kvm_entry: vcpu 1
>  qemu-system-x86-20240 [006] d..2  9406.484137: kvm_exit: reason PF excp rip 
> 0x8161130f info 2 7fffb62ba318
>  qemu-system-x86-20240 [006] ...1  9406.484138: kvm_page_fault: address 
> 7fffb62ba318 error_code 2
>  qemu-system-x86-20240 [006] ...1  9406.484141: kvm_inj_exception: #DF (0x0)
> 
> Second #PF at the same address and kvm injects the #DF.
> 
> BUT(!), why?
> 
> I probably am missing something but WTH are we pagefaulting at a
> user address in context_switch() while doing a lockdep call, i.e.
> spin_release? We're not touching any userspace gunk there AFAICT.
> 
> Is this an async pagefault or so which kvm is doing so that the guest
> rip is actually pointing at the wrong place?
> 
There is nothing in the trace that point to async pagefault as far as I see.

> Or something else I'm missing, most probably...
> 
Strange indeed. Can you also enable kvmmmu tracing? You can also instrument
kvm_multiple_exception() to see which two exception are combined into #DF.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/1] Move two pinned pages to non-movable node in kvm.

2014-06-22 Thread Gleb Natapov

On Fri, Jun 20, 2014 at 05:31:46PM -0300, Marcelo Tosatti wrote:
> > > Same with the APIC access page.
> > APIC page is always mapped into guest's APIC base address 0xfee0.
> > The way it works is that when vCPU accesses page at 0xfee0 the access
> > is translated to APIC access page physical address. CPU sees that access
> > is for APIC page and generates APIC access exit instead of memory access.
> > If address 0xfee0 is not mapped by EPT then EPT violation exit will
> > be generated instead, EPT mapping will be instantiated, access retired
> > by a guest and this time will generate APIC access exit.
> 
> Right, confused with the other APIC page which the CPU writes (the vAPIC 
> page) 
> to.
> 
That one is allocated with kmalloc.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/1] Move two pinned pages to non-movable node in kvm.

2014-06-20 Thread Gleb Natapov

On Fri, Jun 20, 2014 at 09:53:26AM -0300, Marcelo Tosatti wrote:
> On Fri, Jun 20, 2014 at 02:15:10PM +0300, Gleb Natapov wrote:
> > On Thu, Jun 19, 2014 at 04:00:24PM -0300, Marcelo Tosatti wrote:
> > > On Thu, Jun 19, 2014 at 12:20:32PM +0300, Gleb Natapov wrote:
> > > > CCing Marcelo,
> > > > 
> > > > On Wed, Jun 18, 2014 at 02:50:44PM +0800, Tang Chen wrote:
> > > > > Hi Gleb,
> > > > > 
> > > > > Thanks for the quick reply. Please see below.
> > > > > 
> > > > > On 06/18/2014 02:12 PM, Gleb Natapov wrote:
> > > > > >On Wed, Jun 18, 2014 at 01:50:00PM +0800, Tang Chen wrote:
> > > > > >>[Questions]
> > > > > >>And by the way, would you guys please answer the following 
> > > > > >>questions for me ?
> > > > > >>
> > > > > >>1. What's the ept identity pagetable for ?  Only one page is enough 
> > > > > >>?
> > > > > >>
> > > > > >>2. Is the ept identity pagetable only used in realmode ?
> > > > > >>Can we free it once the guest is up (vcpu in protect mode)?
> > > > > >>
> > > > > >>3. Now, ept identity pagetable is allocated in qemu userspace.
> > > > > >>Can we allocate it in kernel space ?
> > > > > >What would be the benefit?
> > > > > 
> > > > > I think the benefit is we can hot-remove the host memory a kvm guest
> > > > > is using.
> > > > > 
> > > > > For now, only memory in ZONE_MOVABLE can be migrated/hot-removed. And 
> > > > > the
> > > > > kernel
> > > > > will never use ZONE_MOVABLE memory. So if we can allocate these two 
> > > > > pages in
> > > > > kernel space, we can pin them without any trouble. When doing memory
> > > > > hot-remove,
> > > > > the kernel will not try to migrate these two pages.
> > > > But we can do that by other means, no? The patch you've sent for 
> > > > instance.
> > > > 
> > > > > 
> > > > > >
> > > > > >>
> > > > > >>4. If I want to migrate these two pages, what do you think is the 
> > > > > >>best way ?
> > > > > >>
> > > > > >I answered most of those here: 
> > > > > >http://www.mail-archive.com/kvm@vger.kernel.org/msg103718.html
> > > > > 
> > > > > I'm sorry I must missed this email.
> > > > > 
> > > > > Seeing your advice, we can unpin these two pages and repin them in 
> > > > > the next
> > > > > EPT violation.
> > > > > So about this problem, which solution would you prefer, allocate 
> > > > > these two
> > > > > pages in kernel
> > > > > space, or migrate them before memory hot-remove ?
> > > > > 
> > > > > I think the first solution is simpler. But I'm not quite sure if 
> > > > > there is
> > > > > any other pages
> > > > > pinned in memory. If we have the same problem with other kvm pages, I 
> > > > > think
> > > > > it is better to
> > > > > solve it in the second way.
> > > > > 
> > > > > What do you think ?
> > > > Remove pinning is preferable. In fact looks like for identity pagetable
> > > > it should be trivial, just don't pin. APIC access page is a little bit
> > > > more complicated since its physical address needs to be tracked to be
> > > > updated in VMCS.
> > > 
> > > Yes, and there are new users of page pinning as well soon (see PEBS
> > > threads on kvm-devel).
> > > 
> > > Was thinking of notifiers scheme. Perhaps:
> > > 
> > > ->begin_page_unpin(struct page *page)
> > >   - Remove any possible access to page.
> > > 
> > > ->end_page_unpin(struct page *page)
> > >   - Reinstantiate any possible access to page.
> > > 
> > > For KVM:
> > > 
> > > ->begin_page_unpin()
> > >   - Remove APIC-access page address from VMCS.
> > > or
> > >   - Remove spte translation to pinned page.
> > >   
> > >   - Put vcpu in state where no VM-entries are allowed.
> > > 
> > > ->end_page_unpin()
&

Re: [RFC PATCH 1/1] Move two pinned pages to non-movable node in kvm.

2014-06-20 Thread Gleb Natapov

On Thu, Jun 19, 2014 at 04:00:24PM -0300, Marcelo Tosatti wrote:
> On Thu, Jun 19, 2014 at 12:20:32PM +0300, Gleb Natapov wrote:
> > CCing Marcelo,
> > 
> > On Wed, Jun 18, 2014 at 02:50:44PM +0800, Tang Chen wrote:
> > > Hi Gleb,
> > > 
> > > Thanks for the quick reply. Please see below.
> > > 
> > > On 06/18/2014 02:12 PM, Gleb Natapov wrote:
> > > >On Wed, Jun 18, 2014 at 01:50:00PM +0800, Tang Chen wrote:
> > > >>[Questions]
> > > >>And by the way, would you guys please answer the following questions 
> > > >>for me ?
> > > >>
> > > >>1. What's the ept identity pagetable for ?  Only one page is enough ?
> > > >>
> > > >>2. Is the ept identity pagetable only used in realmode ?
> > > >>Can we free it once the guest is up (vcpu in protect mode)?
> > > >>
> > > >>3. Now, ept identity pagetable is allocated in qemu userspace.
> > > >>Can we allocate it in kernel space ?
> > > >What would be the benefit?
> > > 
> > > I think the benefit is we can hot-remove the host memory a kvm guest
> > > is using.
> > > 
> > > For now, only memory in ZONE_MOVABLE can be migrated/hot-removed. And the
> > > kernel
> > > will never use ZONE_MOVABLE memory. So if we can allocate these two pages 
> > > in
> > > kernel space, we can pin them without any trouble. When doing memory
> > > hot-remove,
> > > the kernel will not try to migrate these two pages.
> > But we can do that by other means, no? The patch you've sent for instance.
> > 
> > > 
> > > >
> > > >>
> > > >>4. If I want to migrate these two pages, what do you think is the best 
> > > >>way ?
> > > >>
> > > >I answered most of those here: 
> > > >http://www.mail-archive.com/kvm@vger.kernel.org/msg103718.html
> > > 
> > > I'm sorry I must missed this email.
> > > 
> > > Seeing your advice, we can unpin these two pages and repin them in the 
> > > next
> > > EPT violation.
> > > So about this problem, which solution would you prefer, allocate these two
> > > pages in kernel
> > > space, or migrate them before memory hot-remove ?
> > > 
> > > I think the first solution is simpler. But I'm not quite sure if there is
> > > any other pages
> > > pinned in memory. If we have the same problem with other kvm pages, I 
> > > think
> > > it is better to
> > > solve it in the second way.
> > > 
> > > What do you think ?
> > Remove pinning is preferable. In fact looks like for identity pagetable
> > it should be trivial, just don't pin. APIC access page is a little bit
> > more complicated since its physical address needs to be tracked to be
> > updated in VMCS.
> 
> Yes, and there are new users of page pinning as well soon (see PEBS
> threads on kvm-devel).
> 
> Was thinking of notifiers scheme. Perhaps:
> 
> ->begin_page_unpin(struct page *page)
>   - Remove any possible access to page.
> 
> ->end_page_unpin(struct page *page)
>   - Reinstantiate any possible access to page.
> 
> For KVM:
> 
> ->begin_page_unpin()
>   - Remove APIC-access page address from VMCS.
> or
>   - Remove spte translation to pinned page.
>   
>   - Put vcpu in state where no VM-entries are allowed.
> 
> ->end_page_unpin()
>   - Setup APIC-access page, ...
>   - Allow vcpu to VM-entry.
> 
I believe that to handle identity page and APIC access page we do not
need any of those. We can use mmu notifiers to track when page begins
to be moved and we can find new page location on EPT violation.

> 
> Because allocating APIC access page from distant NUMA node can
> be a performance problem, i believe.
I do not think this is the case. APIC access page is never written to,
and in fact SDM advice to share it between all vcpus.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 4/5] KVM: MMU: reload request from GET_DIRTY_LOG path

2014-06-20 Thread Gleb Natapov

On Thu, Jun 19, 2014 at 03:40:31PM -0300, Marcelo Tosatti wrote:
> On Thu, Jun 19, 2014 at 11:17:19AM +0300, Gleb Natapov wrote:
> > On Wed, Jun 18, 2014 at 08:12:07PM -0300, mtosa...@redhat.com wrote:
> > > Reload remote vcpus MMU from GET_DIRTY_LOG codepath, before
> > > deleting a pinned spte.
> > > 
> > > Signed-off-by: Marcelo Tosatti 
> > > 
> > > ---
> > >  arch/x86/kvm/mmu.c |3 +++
> > >  1 file changed, 3 insertions(+)
> > > 
> > > Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c
> > > ===
> > > --- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c  2014-06-13 
> > > 16:50:50.040140594 -0300
> > > +++ kvm.pinned-sptes/arch/x86/kvm/mmu.c   2014-06-13 16:51:05.620104451 
> > > -0300
> > > @@ -1247,6 +1247,9 @@
> > >   spte &= ~SPTE_MMU_WRITEABLE;
> > >   spte = spte & ~PT_WRITABLE_MASK;
> > >  
> > > + if (is_pinned_spte(spte))
> > > + mmu_reload_pinned_vcpus(kvm);
> > > +
> > Why write protect it at all? mmu_reload_pinned_vcpus() will unprotected it 
> > anyway
> > on the next vmentry. Isn't it better to just report all pinned pages as 
> > dirty alway.
> 
> That was the initial plan, however its awkward to stop vcpus, execute
> get_dirty_log twice, and have pages marked as dirty on the second
> execution.
Indeed, but I think it may happen today with vhost (or even other devices
that emulate dma), so userspace should be able to deal with it already.

> 
> That is, it is in "incorrect" to report pages as dirty when they are
> clean.
In case of PEBS area we cannot know. CPU writes there directly without
KVM even knowing about it so the only sane thing is to assume that after
a vmentry PEBS area is dirty. This is what happens with this patch BTW,
mmu_reload_pinned_vcpus() will force ->page_fault(FERR_WRITE) which will
mark all pinned pages as dirty even if pages are never written to. You
can achieve the same by having vcpu->pinned_page_dirty which is set to
true on each vmentry and to false on each GET_DIRTY_LOG.

> 
> Moreover, if the number of pinned pages is larger than the dirty
> threshold to stop VM and migrate, you'll never migrate. If vcpus are
> in HLT and don't VM-enter immediately, the pages should not be refaulted
> right away.
We should not allow that many pinned pages for security reasons. And
having a lot of page to fault in on a next vmentry after each
GET_DIRTY_LOG will slow down a guest during migration considerably.

> 
> Do you think the optimization is worthwhile ?
> 
I think it's simpler, but even if we will go with your current approach
it should be improved: there is no point sending IPI to all vcpus in
spte_write_protect() like you do here since the IPI will be send anyway at
the end of write protect because of ptes writable->nonwritable transition,
so all you have to do here is to set KVM_REQ_MMU_RELOAD, not need for IPI.

In fact this makes me thinking that write protecting pinned page here is
incorrect because old translation may not be in TLB and if CPU will try to
write PEBS entry after pte is write protected but before MMU is reloaded
it will encounter non writable pte and god knows what will happen, SDM
does not tell. So spte_write_protect() should be something like that IMO:

static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
{
u64 spte = *sptep;

if (!is_writable_pte(spte) &&
  !(pt_protect && spte_is_locklessly_modifiable(spte)))
return false;

if (!pt_protect && !is_pinned_spte(spte)) {
for_each_vcpu()
 kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
return true;
}

rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);

if (pt_protect)
spte &= ~SPTE_MMU_WRITEABLE;

spte = spte & ~PT_WRITABLE_MASK;

return mmu_spte_update(sptep, spte);
}



--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 3/5] KVM: MMU: notifiers support for pinned sptes

2014-06-20 Thread Gleb Natapov

On Thu, Jun 19, 2014 at 03:28:25PM -0300, Marcelo Tosatti wrote:
> On Thu, Jun 19, 2014 at 09:48:50AM +0300, Gleb Natapov wrote:
> > On Wed, Jun 18, 2014 at 08:12:06PM -0300, mtosa...@redhat.com wrote:
> > > Request KVM_REQ_MMU_RELOAD when deleting sptes from MMU notifiers.
> > > 
> > > Keep pinned sptes intact if page aging.
> > > 
> > > Signed-off-by: Marcelo Tosatti 
> > > 
> > > ---
> > >  arch/x86/kvm/mmu.c |   71 
> > > ++---
> > >  1 file changed, 62 insertions(+), 9 deletions(-)
> > > 
> > > Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c
> > > ===
> > > --- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c  2014-06-18 
> > > 17:28:24.339435654 -0300
> > > +++ kvm.pinned-sptes/arch/x86/kvm/mmu.c   2014-06-18 17:29:32.510225755 
> > > -0300
> > > @@ -1184,6 +1184,42 @@
> > >   kvm_flush_remote_tlbs(vcpu->kvm);
> > >  }
> > >  
> > > +static void ack_flush(void *_completed)
> > > +{
> > > +}
> > > +
> > > +static void mmu_reload_pinned_vcpus(struct kvm *kvm)
> > > +{
> > > + int i, cpu, me;
> > > + cpumask_var_t cpus;
> > > + struct kvm_vcpu *vcpu;
> > > + unsigned int req = KVM_REQ_MMU_RELOAD;
> > > +
> > > + zalloc_cpumask_var(&cpus, GFP_ATOMIC);
> > > +
> > > + me = get_cpu();
> > > + kvm_for_each_vcpu(i, vcpu, kvm) {
> > > + if (list_empty(&vcpu->arch.pinned_mmu_pages))
> > > + continue;
> > > + kvm_make_request(req, vcpu);
> > > + cpu = vcpu->cpu;
> > > +
> > > + /* Set ->requests bit before we read ->mode */
> > > + smp_mb();
> > > +
> > > + if (cpus != NULL && cpu != -1 && cpu != me &&
> > > +   kvm_vcpu_exiting_guest_mode(vcpu) != OUTSIDE_GUEST_MODE)
> > > + cpumask_set_cpu(cpu, cpus);
> > > + }
> > > + if (unlikely(cpus == NULL))
> > > + smp_call_function_many(cpu_online_mask, ack_flush, NULL, 1);
> > > + else if (!cpumask_empty(cpus))
> > > + smp_call_function_many(cpus, ack_flush, NULL, 1);
> > > + put_cpu();
> > > + free_cpumask_var(cpus);
> > > + return;
> > > +}
> > This is a c&p of make_all_cpus_request(), the only difference is checking
> > of vcpu->arch.pinned_mmu_pages.  You can add make_some_cpus_request(..., 
> > bool (*predicate)(struct kvm_vcpu *))
> > to kvm_main.c and rewrite make_all_cpus_request() to use it instead.
> 
> Half-way through it i decided it was better to c&p.
> 
> Can change make_all_cpus_request() though if it makes more sense to you.
> 
If I haven't missed anything and checking of pinned_mmu_pages is indeed the
only difference, then yes, reusing make_all_cpus_request() makes more sense.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only)

2014-06-20 Thread Gleb Natapov

On Thu, Jun 19, 2014 at 04:22:57PM -0300, Marcelo Tosatti wrote:
> On Thu, Jun 19, 2014 at 10:21:16AM +0300, Gleb Natapov wrote:
> > On Wed, Jun 18, 2014 at 08:12:05PM -0300, mtosa...@redhat.com wrote:
> > > Allow vcpus to pin spte translations by:
> > > 
> > > 1) Creating a per-vcpu list of pinned ranges.
> > What if memory slot containing pinned range is going away?
> 
> ->page_fault() should fail and guest abort. Will double check.
> 
> > > 2) On mmu reload request:
> > >   - Fault ranges.
> > >   - Mark sptes with a pinned bit.
> > Should also be marked "dirty" as per SDM:
> >  The three DS save area sections should be allocated from a non-paged pool, 
> > and marked accessed and dirty
> 
> This (SDM text) is about guest pagetable AFAICS.
> 
Its hard to say. SDM does not mention virtualization or two dimensional
paging in that section at all. My reading is that this section talks about
all translations that CPU should perform to get to the physical address,
otherwise why are we trying hard to make sure that EPT translations are
always present? Because the same paragraph say in the next sentence:

 It is the responsibility of the operating system to keep the pages that
 contain the buffer present and to mark them accessed and dirty

So it we take from it that translation should be present the same goes for
accessed and dirty. If Andi can clarify this within Intel it would be great.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] KVM: x86: correct mwait and monitor emulation

2014-06-19 Thread Gleb Natapov

On Thu, Jun 19, 2014 at 03:10:21PM +0300, Nadav Amit wrote:
> On 6/19/14, 3:07 PM, Gleb Natapov wrote:
> >On Thu, Jun 19, 2014 at 02:52:20PM +0300, Nadav Amit wrote:
> >>On 6/19/14, 2:23 PM, Gleb Natapov wrote:
> >>>On Thu, Jun 19, 2014 at 01:53:36PM +0300, Nadav Amit wrote:
> >>>>
> >>>>On Jun 19, 2014, at 1:18 PM, Michael S. Tsirkin  wrote:
> >>>>
> >>>>>On Wed, Jun 18, 2014 at 02:46:01PM -0400, Gabriel L. Somlo wrote:
> >>>>>>On Wed, Jun 18, 2014 at 10:59:14AM -0700, Eric Northup wrote:
> >>>>>>>On Wed, Jun 18, 2014 at 7:19 AM, Nadav Amit  
> >>>>>>>wrote:
> >>>>>>>>mwait and monitor are currently handled as nop. Considering this 
> >>>>>>>>behavior, they
> >>>>>>>>should still be handled correctly, i.e., check execution conditions 
> >>>>>>>>and generate
> >>>>>>>>exceptions when required. mwait and monitor may also be executed in 
> >>>>>>>>real-mode
> >>>>>>>>and are not handled in that case.  This patch performs the emulation 
> >>>>>>>>of
> >>>>>>>>monitor-mwait according to Intel SDM (other than checking whether 
> >>>>>>>>interrupt can
> >>>>>>>>be used as a break event).
> >>>>>>>>
> >>>>>>>>Signed-off-by: Nadav Amit 
> >>>>>>
> >>>>>>How about this instead (details in the commit log below) ? Please let
> >>>>>>me know what you think, and if you'd prefer me to send it out as a
> >>>>>>separate patch rather than a reply to this thread.
> >>>>>>
> >>>>>>Thanks,
> >>>>>>--Gabriel
> >>>>>
> >>>>>If there's an easy workaround, I'm inclined to agree.
> >>>>>We can always go back to Gabriel's patch (and then we'll need
> >>>>>Nadav's one too) but if we release a kernel with this
> >>>>>support it becomes an ABI and we can't go back.
> >>>>>
> >>>>>So let's be careful here, and revert the hack for 3.16.
> >>>>>
> >>>>>
> >>>>>Acked-by: Michael S. Tsirkin 
> >>>>>
> >>>>Personally, I got a custom guest which requires mwait for executing 
> >>>>correctly.
> >>>Can you elaborate on this guest a little bit. With nop implementation
> >>>for mwait the guest will hog a host cpu. Do you consider this to be
> >>>"executing correctly?"
> >>>
> >>>--
> >>
> >>mwait is not as "clean" as it may appear. It encounters false wake-ups due
> >>to a variety of reasons, and any code need to recheck the wake-up condition
> >>afterwards. Actually, some CPUs had bugs that caused excessive wake-ups that
> >>degraded performance considerably (Nehalem, if I am not mistaken).
> >>Therefore, handling mwait as nop is logically correct (although it may
> >>degrade performance).
> >>
> >>For the reference, if you look at the SDM 8.10.4, you'll see:
> >>"Multiple events other than a write to the triggering address range can
> >>cause a processor that executed MWAIT to wake up. These include events that
> >>would lead to voluntary or involuntary context switches, such as..."
> >>
> >>Note the words "include" in the sentence "These include events". Software
> >>has no way of controlling whether it gets false wake-ups and cannot rely on
> >>the wake-up as indication to anything.
> >>
> >That's all well and good and I didn't say that nop is not a valid
> >mwait implementation, it is, though there is a big difference between
> >"encounters false wake-ups" and never sleeps.  What I asked is do you
> >consider your guest hogging host cpu to be "executing correctly?". What
> >this guest is doing that such behaviour is tolerated and shouldn't it
> >be better to just poll for a condition you are waiting for instead of
> >executing expensive vmexits. This will also hog 100% host cpu, but will
> >be actually faster.
> >
> You are correct, but unfortunately I have no control over the guest
> workload. In this specific workload I do not care about performance but only
> about correctness.
> 
Fair enough. But can you at least hint what is this mysterious guest?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] KVM: x86: correct mwait and monitor emulation

2014-06-19 Thread Gleb Natapov

On Thu, Jun 19, 2014 at 02:52:20PM +0300, Nadav Amit wrote:
> On 6/19/14, 2:23 PM, Gleb Natapov wrote:
> >On Thu, Jun 19, 2014 at 01:53:36PM +0300, Nadav Amit wrote:
> >>
> >>On Jun 19, 2014, at 1:18 PM, Michael S. Tsirkin  wrote:
> >>
> >>>On Wed, Jun 18, 2014 at 02:46:01PM -0400, Gabriel L. Somlo wrote:
> >>>>On Wed, Jun 18, 2014 at 10:59:14AM -0700, Eric Northup wrote:
> >>>>>On Wed, Jun 18, 2014 at 7:19 AM, Nadav Amit  
> >>>>>wrote:
> >>>>>>mwait and monitor are currently handled as nop. Considering this 
> >>>>>>behavior, they
> >>>>>>should still be handled correctly, i.e., check execution conditions and 
> >>>>>>generate
> >>>>>>exceptions when required. mwait and monitor may also be executed in 
> >>>>>>real-mode
> >>>>>>and are not handled in that case.  This patch performs the emulation of
> >>>>>>monitor-mwait according to Intel SDM (other than checking whether 
> >>>>>>interrupt can
> >>>>>>be used as a break event).
> >>>>>>
> >>>>>>Signed-off-by: Nadav Amit 
> >>>>
> >>>>How about this instead (details in the commit log below) ? Please let
> >>>>me know what you think, and if you'd prefer me to send it out as a
> >>>>separate patch rather than a reply to this thread.
> >>>>
> >>>>Thanks,
> >>>>--Gabriel
> >>>
> >>>If there's an easy workaround, I'm inclined to agree.
> >>>We can always go back to Gabriel's patch (and then we'll need
> >>>Nadav's one too) but if we release a kernel with this
> >>>support it becomes an ABI and we can't go back.
> >>>
> >>>So let's be careful here, and revert the hack for 3.16.
> >>>
> >>>
> >>>Acked-by: Michael S. Tsirkin 
> >>>
> >>Personally, I got a custom guest which requires mwait for executing 
> >>correctly.
> >Can you elaborate on this guest a little bit. With nop implementation
> >for mwait the guest will hog a host cpu. Do you consider this to be
> >"executing correctly?"
> >
> >--
> 
> mwait is not as "clean" as it may appear. It encounters false wake-ups due
> to a variety of reasons, and any code need to recheck the wake-up condition
> afterwards. Actually, some CPUs had bugs that caused excessive wake-ups that
> degraded performance considerably (Nehalem, if I am not mistaken).
> Therefore, handling mwait as nop is logically correct (although it may
> degrade performance).
> 
> For the reference, if you look at the SDM 8.10.4, you'll see:
> "Multiple events other than a write to the triggering address range can
> cause a processor that executed MWAIT to wake up. These include events that
> would lead to voluntary or involuntary context switches, such as..."
> 
> Note the words "include" in the sentence "These include events". Software
> has no way of controlling whether it gets false wake-ups and cannot rely on
> the wake-up as indication to anything.
> 
That's all well and good and I didn't say that nop is not a valid
mwait implementation, it is, though there is a big difference between
"encounters false wake-ups" and never sleeps.  What I asked is do you
consider your guest hogging host cpu to be "executing correctly?". What
this guest is doing that such behaviour is tolerated and shouldn't it
be better to just poll for a condition you are waiting for instead of
executing expensive vmexits. This will also hog 100% host cpu, but will
be actually faster.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 4969 matches

Mail list logo