Re: kernel BUG at arch/x86/kvm/mmu.c:LINE! (2)

2019-11-08 Thread Paolo Bonzini
#syz dup:  KASAN: slab-out-of-bounds Read in handle_vmptrld

On 08/11/19 20:14, syzbot wrote:
> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:    847120f8 Merge branch 'for-linus' of
> git://git.kernel.org/..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=12d60164e0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=8c5e2eca3f31f9bf
> dashboard link:
> https://syzkaller.appspot.com/bug?extid=824609cfabee9c6e153c
> compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=16e2a12ce0
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=13314994e0
> 
> The bug was bisected to:
> 
> commit 1ffe8bdc09f8bfcaad76d71ae68b623c7e03f20f
> Author: Spencer E. Olson 
> Date:   Mon Oct 10 14:14:19 2016 +
> 
>     staging: comedi: ni_mio_common: split out ao arming from ni_ao_inttrig
> 
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=122977fae0
> console output: https://syzkaller.appspot.com/x/log.txt?x=162977fae0
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+824609cfabee9c6e1...@syzkaller.appspotmail.com
> Fixes: 1ffe8bdc09f8 ("staging: comedi: ni_mio_common: split out ao
> arming from ni_ao_inttrig")
> 
> L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646
> and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html
> for details.
> [ cut here ]
> kernel BUG at arch/x86/kvm/mmu.c:3324!
> invalid opcode:  [#1] PREEMPT SMP KASAN
> CPU: 0 PID: 8688 Comm: syz-executor906 Not tainted 5.4.0-rc6+ #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> RIP: 0010:transparent_hugepage_adjust+0x490/0x530 arch/x86/kvm/mmu.c:3324
> Code: 63 00 48 8b 45 b8 48 83 e8 01 e9 19 fd ff ff e8 36 3c 63 00 48 8b
> 45 b8 48 83 e8 01 48 89 45 c8 e9 a1 fd ff ff e8 20 3c 63 00 <0f> 0b 48
> 89 df e8 66 9e 9e 00 e9 9f fb ff ff 4c 89 ff e8 59 9e 9e
> RSP: 0018:88809753f690 EFLAGS: 00010293
> RAX: 88809549e6c0 RBX: 88809753f778 RCX: 810fe787
> RDX:  RSI: 810fe8c0 RDI: 0007
> RBP: 88809753f6d8 R08: 88809549e6c0 R09: ed10131ed682
> R10: ed10131ed681 R11: 888098f6b40b R12: 88809753f768
> R13: 0083 R14: 0008fe81 R15: 
> FS:  0158e880() GS:8880ae80()
> knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2:  CR3: 9f2a4000 CR4: 001426f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  tdp_page_fault+0x56e/0x650 arch/x86/kvm/mmu.c:4216
>  kvm_mmu_page_fault+0x1dd/0x1800 arch/x86/kvm/mmu.c:5439
>  handle_ept_violation+0x259/0x560 arch/x86/kvm/vmx/vmx.c:5185
>  vmx_handle_exit+0x29f/0x1730 arch/x86/kvm/vmx/vmx.c:5929
>  vcpu_enter_guest arch/x86/kvm/x86.c:8227 [inline]
>  vcpu_run arch/x86/kvm/x86.c:8291 [inline]
>  kvm_arch_vcpu_ioctl_run+0x1cb8/0x70d0 arch/x86/kvm/x86.c:8498
>  kvm_vcpu_ioctl+0x4dc/0xfc0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2772
>  vfs_ioctl fs/ioctl.c:46 [inline]
>  file_ioctl fs/ioctl.c:509 [inline]
>  do_vfs_ioctl+0xdb6/0x13e0 fs/ioctl.c:696
>  ksys_ioctl+0xab/0xd0 fs/ioctl.c:713
>  __do_sys_ioctl fs/ioctl.c:720 [inline]
>  __se_sys_ioctl fs/ioctl.c:718 [inline]
>  __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
>  do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x443f49
> Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89
> f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01
> f0 ff ff 0f 83 7b d8 fb ff c3 66 2e 0f 1f 84 00 00 00 00
> RSP: 002b:7ffd991d67d8 EFLAGS: 0246 ORIG_RAX: 0010
> RAX: ffda RBX: 004002e0 RCX: 00443f49
> RDX:  RSI: ae80 RDI: 0006
> RBP: 006ce018 R08: 004002e0 R09: 004002e0
> R10: 004002e0 R11: 0246 R12: 00401c50
> R13: 00401ce0 R14:  R15: 
> Modules linked in:
> ---[ end trace 911095bae56804bc ]---
> RIP: 0010:transparent_hugepage_adjust+0x490/0x530 arch/x86/kvm/mmu.c:3324
> Code: 63 00 48 8b 45 b8 48 83 e8 01 e9 19 fd ff ff e8 36 3c 63 00 48 8b
> 45 b8 48 83 e8 01 48 89 45 c8 e9 a1 fd ff ff e8 20 3c 63 00 <0f> 0b 48
> 89 df e8 66 9e 9e 00 e9 9f fb ff ff 4c 89 ff e8 59 9e 9e
> RSP: 0018:88809753f690 EFLAGS: 00010293
> RAX: 88809549e6c0 RBX: 88809753f778 RCX: 810fe787
> RDX:  RSI: 810fe8c0 RDI: 0007
> RBP: 88809753f6d8 R08: 88809549e6c0 R09: ed10131ed682
> R10: ed10131ed681 R11: 888098f6b40b R12: 88809753f768
> R13: 0083 R14: 

Re: [PATCH] x86/Hyper-V: Fix definition HV_MAX_FLUSH_REP_COUNT

2019-03-15 Thread Paolo Bonzini
On 22/02/19 11:48, lantianyu1...@gmail.com wrote:
> From: Lan Tianyu 
> 
> The max flush rep count of HvFlushGuestPhysicalAddressList hypercall
> is equal with how many entries of union hv_gpa_page_range can be populated
> into the input parameter page. The origin code lacks parenthesis around
> PAGE_SIZE - 2 * sizeof(u64). This patch is to fix it.
> 
> Cc: 
> Fixs: cc4edae4b9(x86/hyper-v: Add HvFlushGuestAddressList hypercall support)
> Signed-off-by: Lan Tianyu 
> ---
>  arch/x86/include/asm/hyperv-tlfs.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/hyperv-tlfs.h 
> b/arch/x86/include/asm/hyperv-tlfs.h
> index 705dafc2d11a..2bdbbbcfa393 100644
> --- a/arch/x86/include/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/asm/hyperv-tlfs.h
> @@ -841,7 +841,7 @@ union hv_gpa_page_range {
>   * count is equal with how many entries of union hv_gpa_page_range can
>   * be populated into the input parameter page.
>   */
> -#define HV_MAX_FLUSH_REP_COUNT (PAGE_SIZE - 2 * sizeof(u64) /\
> +#define HV_MAX_FLUSH_REP_COUNT ((PAGE_SIZE - 2 * sizeof(u64)) /  \
>   sizeof(union hv_gpa_page_range))
>  
>  struct hv_guest_mapping_flush_list {
> 

Queued for after he merge window, with the fixed "Fixes" tag.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH V3 00/10] X86/KVM/Hyper-V: Add HV ept tlb range list flush support in KVM

2019-02-22 Thread Paolo Bonzini
On 22/02/19 16:06, lantianyu1...@gmail.com wrote:
> From: Lan Tianyu 
> 
> This patchset is to introduce hv ept tlb range list flush function
> support in the KVM MMU component. Flushing ept tlbs of several address
> range can be done via single hypercall and new list flush function is
> used in the kvm_mmu_commit_zap_page() and FNAME(sync_page). This patchset
> also adds more hv ept tlb range flush support in more KVM MMU function.
> 
> This patchset is based on the fix patch "x86/Hyper-V: Fix definition 
> HV_MAX_FLUSH_REP_COUNT".
> (https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1939455.html)

Note that this won't make it in 5.1 unless Linus releases an -rc8.
Otherwise, I'll get to it next week.

Thanks,

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH V2 00/10] X86/KVM/Hyper-V: Add HV ept tlb range list flush support in KVM

2019-02-14 Thread Paolo Bonzini
On 02/02/19 02:38, lantianyu1...@gmail.com wrote:
> From: Lan Tianyu 
> 
> This patchset is to introduce hv ept tlb range list flush function
> support in the KVM MMU component. Flushing ept tlbs of several address
> range can be done via single hypercall and new list flush function is
> used in the kvm_mmu_commit_zap_page() and FNAME(sync_page). This patchset
> also adds more hv ept tlb range flush support in more KVM MMU function.
> 
> Change since v1:
>1) Make flush list as a hlist instead of list in order to 
>keep struct kvm_mmu_page size.
>2) Add last_level flag in the struct kvm_mmu_page instead
>of spte pointer
>3) Move tlb flush from kvm_mmu_notifier_clear_flush_young() to 
> kvm_age_hva()
>4) Use range flush in the kvm_vm_ioctl_get/clear_dirty_log()

Looks good except for the confusion on sp->last_level (which was mostly
mine---sorry about that).  I think it can still make 5.1.

However, note that I've never received "KVM/MMU: Use tlb range flush  in
the kvm_age_hva()".

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [Resend PATCH V5 2/10] x86/hyper-v: Add HvFlushGuestAddressList hypercall support

2018-12-21 Thread Paolo Bonzini
On 06/12/18 14:21, lantianyu1...@gmail.com wrote:
>  static inline int hyperv_flush_guest_mapping(u64 as) { return -1; }
> +static inline int hyperv_flush_guest_mapping_range(u64 as,
> + hyperv_fill_flush_list_func fill_func, void *data);
> +{
> + return -1;

This part for !IS_ENABLED(CONFIG_HYPERV) does not compile.

No big deal, but please add that to your testing procedures.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [Resend PATCH V5 0/10] x86/KVM/Hyper-v: Add HV ept tlb range flush hypercall support in KVM

2018-12-21 Thread Paolo Bonzini
On 06/12/18 14:21, lantianyu1...@gmail.com wrote:
> From: Lan Tianyu 
> 
> For nested memory virtualization, Hyper-v doesn't set write-protect
> L1 hypervisor EPT page directory and page table node to track changes 
> while it relies on guest to tell it changes via HvFlushGuestAddressLlist
> hypercall. HvFlushGuestAddressLlist hypercall provides a way to flush
> EPT page table with ranges which are specified by L1 hypervisor.
> 
> If L1 hypervisor uses INVEPT or HvFlushGuestAddressSpace hypercall to
> flush EPT tlb, Hyper-V will invalidate associated EPT shadow page table
> and sync L1's EPT table when next EPT page fault is triggered.
> HvFlushGuestAddressLlist hypercall helps to avoid such redundant EPT
> page fault and synchronization of shadow page table.
> 
> This patchset is based on the Patch "KVM/VMX: Check ept_pointer before
> flushing ept tlb"(https://marc.info/?l=kvm=154408169705686=2).
> 
> Change since v4:
>1) Split flush address and flush list patches. This patchset only 
> contains
>flush address patches. Will post flush list patches later.
>2) Expose function hyperv_fill_flush_guest_mapping_list()
>out of hyperv file
>3) Adjust parameter of hyperv_flush_guest_mapping_range()
>4) Reorder patchset and move Hyper-V and VMX changes ahead.
> 
> Change since v3:
> 1) Remove code of updating "tlbs_dirty" in 
> kvm_flush_remote_tlbs_with_range()
> 2) Remove directly tlb flush in the kvm_handle_hva_range()
> 3) Move tlb flush in kvm_set_pte_rmapp() to 
> kvm_mmu_notifier_change_pte()
> 4) Combine Vitaly's "don't pass EPT configuration info to
> vmx_hv_remote_flush_tlb()" fix
> 
> Change since v2:
>1) Fix comment in the kvm_flush_remote_tlbs_with_range()
>2) Move HV_MAX_FLUSH_PAGES and HV_MAX_FLUSH_REP_COUNT to
> hyperv-tlfs.h.
>3) Calculate HV_MAX_FLUSH_REP_COUNT in the macro definition
>4) Use HV_MAX_FLUSH_REP_COUNT to define length of gpa_list in
> struct hv_guest_mapping_flush_list.
> 
> Change since v1:
>1) Convert "end_gfn" of struct kvm_tlb_range to "pages" in order
>   to avoid confusion as to whether "end_gfn" is inclusive or exlusive.
>2) Add hyperv tlb range struct and replace kvm tlb range struct
>   with new struct in order to avoid using kvm struct in the hyperv
>   code directly.
> 
> 
> 
> Lan Tianyu (10):
>   KVM: Add tlb_remote_flush_with_range callback in kvm_x86_ops
>   x86/hyper-v: Add HvFlushGuestAddressList hypercall support
>   x86/Hyper-v: Add trace in the
> hyperv_nested_flush_guest_mapping_range()
>   KVM/VMX: Add hv tlb range flush support
>   KVM/MMU: Add tlb flush with range helper function
>   KVM: Replace old tlb flush function with new one to flush a specified
> range.
>   KVM: Make kvm_set_spte_hva() return int
>   KVM/MMU: Move tlb flush in kvm_set_pte_rmapp() to
> kvm_mmu_notifier_change_pte()
>   KVM/MMU: Flush tlb directly in the kvm_set_pte_rmapp()
>   KVM/MMU: Flush tlb directly in the kvm_zap_gfn_range()
> 
>  arch/arm/include/asm/kvm_host.h |  2 +-
>  arch/arm64/include/asm/kvm_host.h   |  2 +-
>  arch/mips/include/asm/kvm_host.h|  2 +-
>  arch/mips/kvm/mmu.c |  3 +-
>  arch/powerpc/include/asm/kvm_host.h |  2 +-
>  arch/powerpc/kvm/book3s.c   |  3 +-
>  arch/powerpc/kvm/e500_mmu_host.c|  3 +-
>  arch/x86/hyperv/nested.c| 80 +++
>  arch/x86/include/asm/hyperv-tlfs.h  | 32 +
>  arch/x86/include/asm/kvm_host.h |  9 +++-
>  arch/x86/include/asm/mshyperv.h | 15 ++
>  arch/x86/include/asm/trace/hyperv.h | 14 ++
>  arch/x86/kvm/mmu.c  | 96 
> +
>  arch/x86/kvm/paging_tmpl.h  |  3 +-
>  arch/x86/kvm/vmx.c  | 63 +---
>  virt/kvm/arm/mmu.c  |  6 ++-
>  virt/kvm/kvm_main.c |  5 +-
>  17 files changed, 292 insertions(+), 48 deletions(-)
> 

Queued, thanks.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH V4 00/15] x86/KVM/Hyper-v: Add HV ept tlb range flush hypercall support in KVM

2018-10-15 Thread Paolo Bonzini
On 13/10/2018 16:53, lantianyu1...@gmail.com wrote:
> From: Lan Tianyu 
> 
> For nested memory virtualization, Hyper-v doesn't set write-protect
> L1 hypervisor EPT page directory and page table node to track changes 
> while it relies on guest to tell it changes via HvFlushGuestAddressLlist
> hypercall. HvFlushGuestAddressLlist hypercall provides a way to flush
> EPT page table with ranges which are specified by L1 hypervisor.
> 
> If L1 hypervisor uses INVEPT or HvFlushGuestAddressSpace hypercall to
> flush EPT tlb, Hyper-V will invalidate associated EPT shadow page table
> and sync L1's EPT table when next EPT page fault is triggered.
> HvFlushGuestAddressLlist hypercall helps to avoid such redundant EPT
> page fault and synchronization of shadow page table.

So I just told you that the first part is well understood but I must
retract that; after carefully reviewing the whole series, I think one of
us is actually very confused.

I am not afraid to say it can be me, but my understanding is that you're
passing L1 GPAs to the hypercall and instead the spec says it expects L2
GPAs.  (Consider that, because KVM's shadow paging code is shared
between nested EPT and !EPT cases, every time you see gpa/gfn in the
code it is for L1, while nested EPT GPAs are really what the code calls
gva.)

What's going on?

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH V4 2/15] KVM/MMU: Add tlb flush with range helper function

2018-10-15 Thread Paolo Bonzini
On 14/10/2018 10:16, Thomas Gleixner wrote:
>>> +static inline bool kvm_available_flush_tlb_with_range(void)
>>> +{
>>> +   return kvm_x86_ops->tlb_remote_flush_with_range;
>>> +}
>> Seems that kvm_available_flush_tlb_with_range() is not used in this patch…
> What's wrong with that? 
> 
> It provides the implementation and later patches make use of it. It's a
> sensible way to split patches into small, self contained entities.

That's true, on the other hand I have indeed a concerns with this patch:
this series is not bisectable at all, because all the new code is dead
until the very last patch.  Uses of the new feature should come _after_
the implementation.

I don't have any big problem with what Liran pointed out (and I can live
with the unused static functions that would warn with -Wunused, too),
but the above should be fixed in v5, basically by moving patches 12-15
at the beginning of the series.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH V4 6/15] KVM/MMU: Flush tlb directly in the kvm_set_pte_rmapp()

2018-10-15 Thread Paolo Bonzini
On 13/10/2018 16:53, lantianyu1...@gmail.com wrote:
> @@ -1781,6 +1781,11 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, struct 
> kvm_rmap_head *rmap_head,
>   }
>   }
>  
> + if (need_flush && kvm_available_flush_tlb_with_range()) {
> + kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
> + return 0;
> + }
> +

Here you're passing an L1 GPA, not an L2 GPA.  Is it correct?

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH V4 11/15] KVM/MMU: Replace tlb flush function with range list flush function

2018-10-15 Thread Paolo Bonzini
On 13/10/2018 16:54, lantianyu1...@gmail.com wrote:
>   while (mmu_unsync_walk(parent, )) {
>   bool protected = false;
> + LIST_HEAD(flush_list);
>  
> - for_each_sp(pages, sp, parents, i)
> + for_each_sp(pages, sp, parents, i) {
>   protected |= rmap_write_protect(vcpu, sp->gfn);
> + kvm_mmu_queue_flush_request(sp, _list);
> + }

Here you already know that the page has to be flushed, because you are
dealing with shadow page tables and those always use 4K pages.  So the
check on is_last_page is unnecessary.

> 
>pte_access, PT_PAGE_TABLE_LEVEL,
>gfn, spte_to_pfn(sp->spt[i]),
>true, false, host_writable);
> + if (set_spte_ret && kvm_available_flush_tlb_with_range())
> + kvm_mmu_queue_flush_request(sp, _list);
>   }

This is wrong, I think.  sp is always the same throughout the loop, so
you are adding it multiple times to flush_list.

Instead, you need to add a separate range for each virtual address (in
this case L2 GPA) that is synced; but for each PTE that you call
set_spte here for, you could be syncing multiple L2 GPAs if a single
page is reused multiple times by the guest's EPT page tables.

And actually I may be missing something, but doesn't this apply to all
call sites?  For mmu_sync_children you can do the flush in
__rmap_write_protect and return false, similar to the first part of the
series, but not for kvm_mmu_commit_zap_page and sync_page.

Can you simplify this series to only have hv_remote_flush_tlb_with_range
and remove all the flush_list stuff?  That first part is safe and well
understood, because it uses the rmap and so it's clear that you have L2
GPAs at hand.  Most of the remarks I made on the Hyper-V API will still
apply.

Paolo

>   if (set_spte_ret & SET_SPTE_NEED_REMOTE_TLB_FLUSH)
> - kvm_flush_remote_tlbs(vcpu->kvm);
> + kvm_flush_remote_tlbs_with_list(vcpu->kvm, _list);
>  
>   return nr_present;

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH V4 12/15] x86/hyper-v: Add HvFlushGuestAddressList hypercall support

2018-10-15 Thread Paolo Bonzini
On 13/10/2018 16:54, lantianyu1...@gmail.com wrote:
> 
> +static int fill_flush_list(union hv_gpa_page_range gpa_list[],
> + int offset, u64 start_gfn, u64 pages)

Pass the entire struct hv_guest_mapping_flush_list to this function,
it's simpler and it hides the gpa_list argument from
range->parse_flush_list_func.

> + if (!range->flush_list)
> + gpa_n = fill_flush_list(flush->gpa_list, gpa_n,
> + range->start_gfn, range->pages);
> + else if (range->parse_flush_list_func)
> + gpa_n = range->parse_flush_list_func(flush->gpa_list, gpa_n,
> + range->flush_list, fill_flush_list);
> + else

You are making the code more complicated in order to avoid making
fill_flush_list public.  Just make it public and always go through the
parse_flush_list_func case.  In fact:

- make parse_flush_list_func an argument of hyperv_flush_guest_mapping_range

- drop struct hyperv_tlb_range completely, instead just pass a void* to
hyperv_flush_guest_mapping_range and pass it back to the callback.  The
void * can point to the start_gfn/pages pair, it can be the flush_list,
or anything else.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH V4 9/15] KVM: Add flush_link and parent_pte in the struct kvm_mmu_page

2018-10-15 Thread Paolo Bonzini
On 13/10/2018 16:54, lantianyu1...@gmail.com wrote:
> From: Lan Tianyu 
> 
> PV EPT tlb flush function will accept a list of flush ranges and
> use struct kvm_mmu_page as the list entry.
> 
> Signed-off-by: Lan Tianyu 
> ---
>  arch/x86/include/asm/kvm_host.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 19985c602ed6..8279235285f8 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -316,6 +316,7 @@ struct kvm_rmap_head {
>  
>  struct kvm_mmu_page {
>   struct list_head link;
> + struct list_head flush_link;

This can be an hlist.  However, you are not documenting what's the
locking here.  There are many places in which KVM does a
"cond_resched_lock(>kvm->mmu_lock);" and you need to explain how
flush_link is not live across that.

I would start from a simpler patch that just uses the list-based flush
in kvm_mmu_commit_zap_page, where you already have the list of things to
flush as invalid_list.

>   struct hlist_node hash_link;
>   bool unsync;
>  
> 

Also this is not adding parent_pte, so the subject is incorrect.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH V4 7/15] KVM/MMU: Flush tlb directly in the kvm_zap_gfn_range()

2018-10-15 Thread Paolo Bonzini
On 13/10/2018 16:53, lantianyu1...@gmail.com wrote:
> + bool flush = false;
>   int i;
>  
>   spin_lock(>mmu_lock);
> @@ -5654,18 +5655,27 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t 
> gfn_start, gfn_t gfn_end)
>   slots = __kvm_memslots(kvm, i);
>   kvm_for_each_memslot(memslot, slots) {
>   gfn_t start, end;
> + bool flush_tlb = true;
>  
>   start = max(gfn_start, memslot->base_gfn);
>   end = min(gfn_end, memslot->base_gfn + memslot->npages);
>   if (start >= end)
>   continue;
>  
> - slot_handle_level_range(kvm, memslot, kvm_zap_rmapp,
> - PT_PAGE_TABLE_LEVEL, 
> PT_MAX_HUGEPAGE_LEVEL,
> - start, end - 1, true);
> + if (kvm_available_flush_tlb_with_range())
> + flush_tlb = false;

This should be moved outside the for, because it's invariant.

> + flush = slot_handle_level_range(kvm, memslot,
> + kvm_zap_rmapp, PT_PAGE_TABLE_LEVEL,
> + PT_MAX_HUGEPAGE_LEVEL, start,
> + end - 1, flush_tlb);

... and this should be "flush |= ".
>   }
>   }
>  
> + if (flush && kvm_available_flush_tlb_with_range())
> + kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
> + gfn_end - gfn_start + 1);
> +

... and this can be just if (flush), because if flush_tlb is true then
slot_handle_level_range always returns false.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-10-04 Thread Paolo Bonzini
On 04/10/2018 09:54, Vitaly Kuznetsov wrote:
> - Check if pure TSC can be used on SkyLake+ systems (where TSC scaling
> is supported).

Not if you want to migrate to pre-Skylake systems.

> - Check if non-masterclock mode is still needed. E.g. HyperV's TSC page
> clocksource is a single page for the whole VM, not a per-cpu thing. Can
> we think that all the buggy hardware is already gone?

No. :(  We still get reports whenever we break 2007-2008 hardware.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH V3 2/13] KVM/MMU: Add tlb flush with range helper function

2018-10-01 Thread Paolo Bonzini
On 27/09/2018 05:48, Tianyu Lan wrote:
> +
> + if (range && kvm_x86_ops->tlb_remote_flush_with_range) {
> + /*
> +  * Read tlbs_dirty before flushing tlbs in order
> +  * to track dirty tlbs during flushing.
> +  */
> + long dirty_count = smp_load_acquire(>tlbs_dirty);
> +
> + ret = kvm_x86_ops->tlb_remote_flush_with_range(kvm, range);
> + cmpxchg(>tlbs_dirty, dirty_count, 0);

This is wrong, because it's not the entire TLB that is flushed.  So you
cannot do the cmpxchg here.

Paolo

> +
> + if (ret)
> + kvm_flush_remote_tlbs(kvm);
> +}
> +

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH V3 4/13] KVM/MMU: Flush tlb directly in the kvm_handle_hva_range()

2018-10-01 Thread Paolo Bonzini
On 27/09/2018 05:49, Tianyu Lan wrote:
> This patch is to flush tlb directly in the kvm_handle_hva_range()
> when range flush is available.
> 
> Signed-off-by: Lan Tianyu 
> ---
>  arch/x86/kvm/mmu.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index d10d8423e8d6..877edae0401f 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1888,6 +1888,13 @@ static int kvm_handle_hva_range(struct kvm *kvm,
>)
>   ret |= handler(kvm, iterator.rmap, memslot,
>  iterator.gfn, iterator.level, 
> data);
> +
> + if (ret && kvm_available_flush_tlb_with_range()) {
> + kvm_flush_remote_tlbs_with_address(kvm,
> + gfn_start,
> + gfn_end - gfn_start);
> + ret = 0;
> + }
>   }
>   }
>  
> 


Not all callers need a TLB flush (in particular kvm_test_age_hva)
require a flush.  My suggestion is to rewrite kvm_test_age_hva like this:

index 4705a7f4169e..f72364a0ef9c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1898,12 +1898,13 @@ static int kvm_test_age_rmapp(struct kvm *kvm,
  struct kvm_memory_slot *slot, gfn_t gfn,
  int level, unsigned long data)
 {
+   bool *result = (bool *)data;
u64 *sptep;
struct rmap_iterator iter;

for_each_rmap_spte(rmap_head, , sptep)
if (is_accessed_spte(*sptep))
-   return 1;
+   *result = true;
return 0;
 }

@@ -1929,7 +1930,10 @@ int kvm_age_hva(struct kvm *kvm,

 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
 {
-   return kvm_handle_hva(kvm, hva, 0, kvm_test_age_rmapp);
+   bool result = false;
+   kvm_handle_hva(kvm, hva, (unsigned long) ,
+   kvm_test_age_rmapp);
+   return result;
 }

 #ifdef MMU_DEBUG

and move the flush from kvm_set_pte_rmapp to
kvm_mmu_notifier_change_pte, making kvm_set_spte_hva return an int;
otherwise, it will flush twice.  For non-x86 architectures just grep for
"set_spte_hva" and make the various kvm_set_spte_hva implementations
return false.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [ 1/1] KVM/MMU: Fix comment in walk_shadow_page_lockless_end()

2018-09-14 Thread Paolo Bonzini
On 07/09/2018 07:45, Tianyu Lan wrote:
> kvm_commit_zap_page() has been renamed to kvm_mmu_commit_zap_page()
> This patch is to fix the commit.
> 
> Signed-off-by: Lan Tianyu 
> ---
>  arch/x86/kvm/mmu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 7ccd29b95746..648b839a349d 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -947,7 +947,7 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu 
> *vcpu)
>  {
>   /*
>* Make sure the write to vcpu->mode is not reordered in front of
> -  * reads to sptes.  If it does, kvm_commit_zap_page() can see us
> +  * reads to sptes.  If it does, kvm_mmu_commit_zap_page() can see us
>* OUTSIDE_GUEST_MODE and proceed to free the shadow page table.
>*/
>   smp_store_release(>mode, OUTSIDE_GUEST_MODE);
> 

Queued, thanks.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH V3 0/4] KVM/x86/hyper-V: Introduce PV guest address space mapping flush support

2018-07-20 Thread Paolo Bonzini
On 20/07/2018 05:58, KY Srinivasan wrote:
> 
> 
>> -Original Message-
>> From: Tianyu Lan
>> Sent: Thursday, July 19, 2018 1:40 AM
>> Cc: Tianyu Lan ; de...@linuxdriverproject.org;
>> Haiyang Zhang ; h...@zytor.com;
>> k...@vger.kernel.org; KY Srinivasan ; linux-
>> ker...@vger.kernel.org; mi...@redhat.com; pbonz...@redhat.com;
>> rkrc...@redhat.com; Stephen Hemminger ;
>> t...@linutronix.de; x...@kernel.org; Michael Kelley (EOSG)
>> ; vkuzn...@redhat.com
>> Subject: [PATCH V3 0/4] KVM/x86/hyper-V: Introduce PV guest address
>> space mapping flush support
>>
>> Hyper-V provides a para-virtualization hypercall
>> HvFlushGuestPhysicalAddressSpace
>> to flush nested VM address space mapping in l1 hypervisor and it's to reduce
>> overhead
>> of flushing ept tlb among vcpus. The tradition way is to send IPIs to all
>> affected
>> vcpus and executes INVEPT on each vcpus. It will trigger several vmexits for
>> IPI and
>> INVEPT emulation. The pv hypercall can help to flush specified ept table on 
>> all
>> vcpus
>> via one single hypercall.
>>
>> Change since v2:
>>- Make ept_pointers_match as tristate "check", "match" and "mismatch".
>>Set "check" in vmx_set_cr3(), check all ept table pointers in
>> hv_remote_flush_tlb()
>>and call hypercall when all ept pointers are same.
>>- Rename kvm_arch_hv_flush_remote_tlb with
>> kvm_arch_flush_remote_tlb and
>>Rename kvm_x86_ops->hv_tlb_remote_flush with kvm_x86_ops-
>>> tlb_remote_flush
>>- Fix issue that ignore updating tlbs_dirty during calling
>> kvm_arch_flush_remote_tlbs()
>>- Merge patch "KVM/VMX: Add identical ept table pointer check" and
>>patch "KVM/x86: Add tlb_remote_flush callback support for vmx"
>>
>> Change since v1:
>>- Fix compilation error for non-x86 platform.
>>- Use ept_pointers_match to check condition of identical ept
>> table pointer and get ept pointer from struct 
>> vcpu_vmx->ept_pointer.
>>- Add hyperv_nested_flush_guest_mapping ftrace support
>>
>>
>>
>> Lan Tianyu (4):
>>   X86/Hyper-V: Add flush HvFlushGuestPhysicalAddressSpace hypercall
>> support
>>   X86/Hyper-V: Add hyperv_nested_flush_guest_mapping ftrace support
>>   KVM: Add tlb remote flush callback in kvm_x86_ops.
>>   KVM/x86: Add tlb_remote_flush callback support for vmx
>>
>>  arch/x86/hyperv/Makefile|  2 +-
>>  arch/x86/hyperv/nested.c| 67
>> ++
>>  arch/x86/include/asm/hyperv-tlfs.h  |  8 +
>>  arch/x86/include/asm/kvm_host.h | 11 ++
>>  arch/x86/include/asm/mshyperv.h |  2 ++
>>  arch/x86/include/asm/trace/hyperv.h | 14 
>>  arch/x86/kvm/vmx.c  | 72
>> -
>>  include/linux/kvm_host.h|  7 
>>  virt/kvm/kvm_main.c |  3 +-
>>  9 files changed, 183 insertions(+), 3 deletions(-)
>>  create mode 100644 arch/x86/hyperv/nested.c
> 
> Acked-by: K. Y. Srinivasan 

Queued, thanks!

Paolo

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH V3 0/4] KVM/x86/hyper-V: Introduce PV guest address space mapping flush support

2018-07-19 Thread Paolo Bonzini
On 19/07/2018 10:39, Tianyu Lan wrote:
> Hyper-V provides a para-virtualization hypercall 
> HvFlushGuestPhysicalAddressSpace
> to flush nested VM address space mapping in l1 hypervisor and it's to reduce 
> overhead
> of flushing ept tlb among vcpus. The tradition way is to send IPIs to all 
> affected
> vcpus and executes INVEPT on each vcpus. It will trigger several vmexits for 
> IPI and
> INVEPT emulation. The pv hypercall can help to flush specified ept table on 
> all vcpus
> via one single hypercall.

Thanks, this looks good apart from a global replace of EFAULT with
ENOTSUP (which can be done when applying).  Can I have an explicit ack
for patches 1 and 2 from the Hyper-V people?

Thanks,

Paolo

> Change since v2:
>- Make ept_pointers_match as tristate "check", "match" and "mismatch".
>Set "check" in vmx_set_cr3(), check all ept table pointers in 
> hv_remote_flush_tlb()
>and call hypercall when all ept pointers are same.
>- Rename kvm_arch_hv_flush_remote_tlb with kvm_arch_flush_remote_tlb 
> and
>Rename kvm_x86_ops->hv_tlb_remote_flush with 
> kvm_x86_ops->tlb_remote_flush
>- Fix issue that ignore updating tlbs_dirty during calling 
> kvm_arch_flush_remote_tlbs()
>- Merge patch "KVM/VMX: Add identical ept table pointer check" and
>patch "KVM/x86: Add tlb_remote_flush callback support for vmx"
> 
> Change since v1:
>- Fix compilation error for non-x86 platform.
>- Use ept_pointers_match to check condition of identical ept
> table pointer and get ept pointer from struct 
> vcpu_vmx->ept_pointer.
>- Add hyperv_nested_flush_guest_mapping ftrace support
>
> 
> 
> Lan Tianyu (4):
>   X86/Hyper-V: Add flush HvFlushGuestPhysicalAddressSpace hypercall
> support
>   X86/Hyper-V: Add hyperv_nested_flush_guest_mapping ftrace support
>   KVM: Add tlb remote flush callback in kvm_x86_ops.
>   KVM/x86: Add tlb_remote_flush callback support for vmx
> 
>  arch/x86/hyperv/Makefile|  2 +-
>  arch/x86/hyperv/nested.c| 67 ++
>  arch/x86/include/asm/hyperv-tlfs.h  |  8 +
>  arch/x86/include/asm/kvm_host.h | 11 ++
>  arch/x86/include/asm/mshyperv.h |  2 ++
>  arch/x86/include/asm/trace/hyperv.h | 14 
>  arch/x86/kvm/vmx.c  | 72 
> -
>  include/linux/kvm_host.h|  7 
>  virt/kvm/kvm_main.c |  3 +-
>  9 files changed, 183 insertions(+), 3 deletions(-)
>  create mode 100644 arch/x86/hyperv/nested.c
> 

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH V2 2/5] KVM: Add tlb remote flush callback in kvm_x86_ops.

2018-07-18 Thread Paolo Bonzini
On 09/07/2018 11:02, Tianyu Lan wrote:
> + /*
> +  * Call kvm_arch_hv_tlb_remote first and go back old way when
> +  * return failure.
> +  */
> + if (!kvm_arch_hv_flush_remote_tlb(kvm))
> + return;
> +
>   /*
>* Read tlbs_dirty before setting KVM_REQ_TLB_FLUSH in
>* kvm_make_all_cpus_request.
>*/
> - long dirty_count = smp_load_acquire(>tlbs_dirty);
> + dirty_count = smp_load_acquire(>tlbs_dirty);

Also, the call to kvm_arch_flush_remote_tlbs should not replace the 
whole function.  It should be something like:

long dirty_count = smp_load_acquire(>tlbs_dirty);

/*
 * We want to publish modifications to the page tables before reading
 * mode. Pairs with a memory barrier in arch-specific code.
 * - x86: smp_mb__after_srcu_read_unlock in vcpu_enter_guest
 * and smp_mb in walk_shadow_page_lockless_begin/end.
 * - powerpc: smp_mb in kvmppc_prepare_to_enter.
 *
 * There is already an smp_mb__after_atomic() before
 * kvm_make_all_cpus_request() reads vcpu->mode. We reuse that
 * barrier here.
 */
if (!kvm_arch_hv_flush_remote_tlb(kvm) ||
kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH))
++kvm->stat.remote_tlb_flush;
cmpxchg(>tlbs_dirty, dirty_count, 0);

Otherwise, kvm_mmu_notifier_invalidate_range_start will incorrectly skip
a TLB flush.

Thanks,

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH V2 3/5] KVM/VMX: Add identical ept table pointer check

2018-07-18 Thread Paolo Bonzini
On 09/07/2018 11:02, Tianyu Lan wrote:
> +static void check_ept_pointer(struct kvm_vcpu *vcpu, u64 eptp)
> +{
> + struct kvm *kvm = vcpu->kvm;
> + u64 tmp_eptp = INVALID_PAGE;
> + int i;
> +
> + if (!kvm_x86_ops->tlb_remote_flush)
> + return;
> +
> + spin_lock(_kvm_vmx(kvm)->ept_pointer_lock);
> + to_vmx(vcpu)->ept_pointer = eptp;
> +
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + if (!VALID_PAGE(tmp_eptp)) {
> + tmp_eptp = to_vmx(vcpu)->ept_pointer;
> + } else if (tmp_eptp != to_vmx(vcpu)->ept_pointer) {
> + to_kvm_vmx(kvm)->ept_pointers_match = false;
> + spin_unlock(_kvm_vmx(kvm)->ept_pointer_lock);
> + return;
> + }
> + }
> +
> + to_kvm_vmx(kvm)->ept_pointers_match = true;
> + spin_unlock(_kvm_vmx(kvm)->ept_pointer_lock);
> +}
> +

Is there any reason to do the check here, rather than the first time the
TLB flush is invoked?  You could:

- have a tristate (true, false, check) value for ept_pointers_match

- reset it to "check" in vmx_set_cr3

- set it to either true or false in tlb_remote_flush if it is check, and
do the hypercall if it is true.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH V2 2/5] KVM: Add tlb remote flush callback in kvm_x86_ops.

2018-07-18 Thread Paolo Bonzini
On 09/07/2018 11:02, Tianyu Lan wrote:
> +
> + /*
> +  * Call kvm_arch_hv_tlb_remote first and go back old way when
> +  * return failure.
> +  */

Please call it "kvm_arch_flush_remote_tlbs", since Hyper-V should not be
mentioned in the generic code.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC 0/7] KVM: nVMX: enlightened VMCS initial implementation

2017-12-21 Thread Paolo Bonzini
On 21/12/2017 13:50, Vitaly Kuznetsov wrote:
> I'm back with (somewhat frustrating) results (E5-2603):

v4 (that would be Broadwell)?

> 1) Windows on Hyper-V (no nesting): 1350 cycles
> 
> 2) Windows on Hyper-V on Hyper-V: 8600
> 
> 3) Windows on KVM (no nesting): 1150  cycles
> 
> 4) Windows on Hyper-V on KVM (no enlightened VMCS): 18200
> 
> 5) Windows on Hyper-V on KVM (enlightened VMCS): 17100

What version were you using for KVM?  There are quite a few nested virt
optimizations in kvm/queue (which may make enlightened VMCS both more or
less efficient).

In particular, with latest kvm/queue you could try tracing vmread and
vmwrite vmexits, and see if you get any.  If you do, that might be an
easy few hundred cycles savings.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC 2/7] KVM: nVMX: modify vmcs12 fields to match Hyper-V enlightened VMCS

2017-12-19 Thread Paolo Bonzini
On 19/12/2017 18:40, Jim Mattson wrote:
> I'm not sure that's really the right way to go, since any guest that
> has already read the IA32_VMX_BASIC MSR has a right to expect the VMCS
> revision to remain unchanged.

Hmm, not just that, "the VMCS revision identifier is never written by
the processor" according to the SDM.  Maybe the code that accesses the
vmcs12 can be placed in a .h file and included more than once in vmx.c.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC 0/7] KVM: nVMX: enlightened VMCS initial implementation

2017-12-19 Thread Paolo Bonzini
On 18/12/2017 18:17, Vitaly Kuznetsov wrote:
> The original author of these patches does no longer work at Red Hat, I
> agreed to take this over and send upstream. Here is his original
> description:
> 
> "Makes KVM implement the enlightened VMCS feature per Hyper-V TLFS 5.0b.
> I've measured about %5 improvement in cost of a nested VM exit (Hyper-V
> enabled Windows Server 2016 nested in KVM)."

Can you try reproducing this and see how much a simple CPUID loop costs in:

* Hyper-V on Hyper-V (with enlightened VMCS, as a proxy for a full
implementation including the clean fields mask)

* Hyper-V on KVM, with and without enlightened VMCS

The latest kvm/queue branch already cut a lot of the cost of a nested VM
exit (from ~22000 to ~14000 clock cycles for KVM on KVM), so we could
also see if Hyper-V needs shadowing of more fields.

> This is just an initial implementation. By leveraging clean fields mask
> we can further improve performance. I'm also interested in implementing
> the other part of the feature: consuming enlightened VMCS when KVM is
> running on top of Hyper-V.

I'm also interested in consuming enlightened VMCS on Hyper-V if that can
provide better performance for KVM on Azure.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC 2/7] KVM: nVMX: modify vmcs12 fields to match Hyper-V enlightened VMCS

2017-12-19 Thread Paolo Bonzini
On 19/12/2017 13:25, Vitaly Kuznetsov wrote:
> 
>> At this point in time, I don't think you can just blithely change the
>> virtual VMCS layout and revision number. Existing VMs using the old
>> layout and revision number must continue to work on versions of kvm
>> past this point. You could tie the layout and revision number changes
>> to KVM_CAP_HYPERV_ENLIGHTENED_VMCS if you like, but kvm must be able
>> to continue to service VMs using the previous layout and revision
>> number in perpetuity.
>>
> I see what you mean. In case we need to keep migration of nested
> workloads working between KVMs of different versions we can't (ever)
> touch vmcs12.

Actually we can, for two reasons.

First, the active VMCS is stored in host RAM (not in guest RAM).  This
means there are clear points where to do the translation, namely vmptrld
and the (not yet upstream) ioctl to set VMX state.

Therefore you only need to keep an (offset, type) map from old to new
layout map; at those two points if you detect an old VMCS12_REVISION you
copy the fields one by one instead of doing a memcpy.  The next vmclear
or vmptrld or get-VMX-state ioctl will automatically update to the new
VMCS12_REVISION.  Of course, this is a one-way street unless you also
add support for writing old VMCS12_REVISIONs.

But, second, VMX state migration is not upstream yet, so nested
hypervisors are currently not migratable: the active VMCS12 state will
not be migrated at all!  So in upstream KVM we wouldn't even need to
upgrade the VMCS12_REVISION to make changes to vmcs12.

That said...

> The way to go in this case, I think, is to create a completely separate
> enlightened_vmcs12 struct and use it when appropriate. We can't possibly
> support migrating workloads which use enlightened VMCS to an old KVM
> which doesn't support it.

... this is probably a good idea as well.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v2 0/7] x86/kvm/hyperv: stable clocksorce for L2 guests when running nested KVM on Hyper-V

2017-12-15 Thread Paolo Bonzini
On 13/12/2017 16:09, Vitaly Kuznetsov wrote:
> Currently, KVM passes PVCLOCK_TSC_STABLE_BIT to its guests when running in
> so called 'masterclock' mode and this is only possible when the clocksource
> on the host is TSC. When running nested on Hyper-V we're using a different
> clocksource in L1 (Hyper-V TSC Page) which can actually be used for
> masterclock. This series brings the required support.
> 
> Making KVM work with TSC page clocksource is relatively easy, it is done in
> PATCH 5 of the series. All the rest is required to support L1 migration
  ^^^

Patch 6. :)

> when TSC frequency changes, we use a special feature from Hyper-V to do
> the job.

Patches 5-7 are

Acked-by: Paolo Bonzini <pbonz...@redhat.com>

I would appreciate if the Hyper-V folks can provide a topic branch to be
merged in both HV and KVM trees.

Thanks,

Paolo

> Vitaly Kuznetsov (7):
>   x86/hyper-v: check for required priviliges in hyperv_init()
>   x86/hyper-v: add a function to read both TSC and TSC page value
> simulateneously
>   x86/hyper-v: reenlightenment notifications support
>   x86/hyper-v: redirect reenlightment notifications on CPU offlining
>   x86/irq: Count Hyper-V reenlightenment interrupts
>   x86/kvm: pass stable clocksource to guests when running nested on
> Hyper-V
>   x86/kvm: support Hyper-V reenlightenment
> 
>  arch/x86/entry/entry_32.S  |   3 +
>  arch/x86/entry/entry_64.S  |   3 +
>  arch/x86/hyperv/hv_init.c  | 133 ++-
>  arch/x86/include/asm/hardirq.h |   3 +
>  arch/x86/include/asm/irq_vectors.h |   7 +-
>  arch/x86/include/asm/mshyperv.h|  32 +++--
>  arch/x86/include/uapi/asm/hyperv.h |  27 
>  arch/x86/kernel/cpu/mshyperv.c |   6 ++
>  arch/x86/kernel/irq.c  |   9 +++
>  arch/x86/kvm/x86.c | 138 
> ++---
>  10 files changed, 329 insertions(+), 32 deletions(-)
> 

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC 2/6] x86/hyper-v: add a function to read both TSC and TSC page value simulateneously

2017-12-01 Thread Paolo Bonzini
On 01/12/2017 18:29, Stephen Hemminger wrote:
>> +static inline u64 hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page 
>> *tsc_pg,
>> +   u64 *cur_tsc)
>> +{
>> +*cur_tsc = rdtsc();
>> +
>> +return cur_tsc;
> Why do return and setting by reference. Looks like an ugly API.

This is the fallback implementation for !CONFIG_HYPERV_TSCPAGE, which
explains why it's ugly, but why is it needed at all (or it could just
BUG())?

Thanks,

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC 0/6] x86/kvm/hyperv: stable clocksorce for L2 guests when running nested KVM on Hyper-V

2017-12-01 Thread Paolo Bonzini
On 01/12/2017 14:13, Vitaly Kuznetsov wrote:
> Currently, KVM passes PVCLOCK_TSC_STABLE_BIT to its guests when running in
> so called 'masterclock' mode and this is only possible when the clocksource
> on the host is TSC. When running nested on Hyper-V we're using a different
> clocksource in L1 (Hyper-V TSC Page) which can actually be used for
> masterclock. This series brings the required support.
> 
> Making KVM work with TSC page clocksource is relatively easy, it is done in
> PATCH 5 of the series. All the rest is required to support L1 migration
> when TSC frequency changes, we use a special feature from Hyper-V to do
> the job.

This looks great.  If the Hyper-V folks can provide a topic branch with
patches 1-4 in it, we can merge it.

Thanks,

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH v2 16/32] x86: kvm: Provide support to create Guest and HV shared per-CPU variables

2017-03-29 Thread Paolo Bonzini


On 28/03/2017 20:39, Borislav Petkov wrote:
>> 2) Since the encryption attributes works on PAGE_SIZE hence add some extra
>> padding to 'struct kvm-steal-time' to make it PAGE_SIZE and then at runtime
>> clear the encryption attribute of the full PAGE. The downside of this was
>> now we need to modify structure which may break the compatibility.
> From SEV-ES whitepaper:
> 
> "To facilitate this communication, the SEV-ES architecture defines
> a Guest Hypervisor Communication Block (GHCB). The GHCB resides in
> page of shared memory so it is accessible to both the guest VM and the
> hypervisor."
> 
> So this is kinda begging to be implemented with a shared page between
> guest and host. And then put steal-time, ... etc in there too. Provided
> there's enough room in the single page for the GHCB *and* our stuff.

The GHCB would have to be allocated much earlier, possibly even by
firmware depending on how things will be designed.  I think it's
premature to consider SEV-ES requirements.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH v2 14/32] x86: mm: Provide support to use memblock when spliting large pages

2017-03-17 Thread Paolo Bonzini


On 17/03/2017 12:33, Borislav Petkov wrote:
> On Fri, Mar 17, 2017 at 12:03:31PM +0100, Paolo Bonzini wrote:
> 
>> If it is possible to do it in a fairly hypervisor-independent manner,
>> I'm all for it.  That is, only by looking at AMD-specified CPUID leaves
>> and at kernel ELF sections.
> 
> Not even that.
> 
> What that needs to be able to do is:
> 
>   kvm_map_percpu_hv_shared(st, sizeof(*st)))
> 
> where st is the percpu steal time ptr:
> 
>   struct kvm_steal_time *st = _cpu(steal_time, cpu);
> 
> Underneath, what it does basically is it clears the encryption mask from
> the pte, see patch 16/32.

Yes, and I'd like that to be done with a new data section rather than a
special KVM hook.

> And I keep talking about SEV-ES because this is going to expand on the
> need of having a shared memory region which the hypervisor and the guest
> needs to access, thus unencrypted. See
> 
> http://support.amd.com/TechDocs/Protecting%20VM%20Register%20State%20with%20SEV-ES.pdf
> 
> This is where you come in and say what would be the best approach there...

I have no idea.  SEV-ES seems to be very hard to set up at the beginning
of the kernel bootstrap.  There's all sorts of chicken and egg problems,
as well as complicated handshakes between the firmware and the guest,
and the way to do it also depends on the trust and threat models.

A much simpler way is to just boot under a trusted hypervisor, do
"modprobe sev-es" and save a snapshot of the guest.  Then you sign the
snapshot and pass it to your cloud provider.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH v2 29/32] kvm: svm: Add support for SEV DEBUG_DECRYPT command

2017-03-17 Thread Paolo Bonzini


On 16/03/2017 19:41, Brijesh Singh wrote:
>>
>> Please do add it, it doesn't seem very different from what you're doing
>> in LAUNCH_UPDATE_DATA.  There's no need for a separate
>> __sev_dbg_decrypt_page function, you can just pin/unpin here and do a
>> per-page loop as in LAUNCH_UPDATE_DATA.
> 
> I can certainly add support to handle crossing the page boundary cases.
> Should we limit the size to prevent user passing arbitrary long length
> and we end up looping inside the kernel? I was thinking to limit to a
> PAGE_SIZE.

I guess it depends on how it's used.  PAGE_SIZE makes sense since you
only know if a physical address is encrypted when you reach it from a
visit of the page tables.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH v2 14/32] x86: mm: Provide support to use memblock when spliting large pages

2017-03-17 Thread Paolo Bonzini


On 17/03/2017 11:56, Borislav Petkov wrote:
>> Theoretically or practically?
> In the sense, it needs to be tried first to see how ugly it can get.
> 
>> It only looks at the E820 map, doesn't it?  Why does it have to do
>> anything with percpu memory areas?
> That's irrelevant. What we want to do is take what's in init_mm.pgd and
> modify it. And use the facilities in arch/x86/mm/init_{32,64}.c because
> they already know about early/late pagetable pages allocation and they
> deal with the kernel pagetable anyway.

If it is possible to do it in a fairly hypervisor-independent manner,
I'm all for it.  That is, only by looking at AMD-specified CPUID leaves
and at kernel ELF sections.

Paolo

> And *not* teach pageattr.c about memblock because that can be misused,
> as tglx pointed out on IRC.

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH v2 14/32] x86: mm: Provide support to use memblock when spliting large pages

2017-03-17 Thread Paolo Bonzini


On 17/03/2017 11:17, Borislav Petkov wrote:
> 
>> I also don't really like the patch as is (plus it fails modpost), but
>> IMO reusing __change_page_attr and __split_large_page is the right thing
>> to do.
> 
> Right, so teaching pageattr.c about memblock could theoretically come
> around and bite us later when a page allocated with memblock gets freed
> with free_page().

Theoretically or practically?

> And looking at this more, we have all this kernel pagetable preparation
> code down the init_mem_mapping() call and the pagetable setup in
> arch/x86/mm/init_{32,64}.c

It only looks at the E820 map, doesn't it?  Why does it have to do
anything with percpu memory areas?

Paolo

> And that code even does some basic page splitting. Oh and it uses
> alloc_low_pages() which knows whether to do memblock reservation or the
> common __get_free_pages() when slabs are up.
> 
> So what would be much cleaner, IMHO, is if one would reuse that code to
> change init_mm.pgd mappings early without copying pageattr.c.
> 
> init_mem_mapping() gets called before kvm_guest_init() in setup_arch()
> so the guest would simply fixup its pagetable right there.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH v2 14/32] x86: mm: Provide support to use memblock when spliting large pages

2017-03-16 Thread Paolo Bonzini


On 16/03/2017 19:28, Borislav Petkov wrote:
> So how hard would it be if the hypervisor allocated that memory for the
> guest instead? It would allocate it decrypted and guest would need to
> access it decrypted too. All in preparation for SEV-ES which will need a
> block of unencrypted memory for the guest anyway...

The kvmclock memory is initially zero so there is no need for the
hypervisor to allocate anything; the point of these patches is just to
access the data in a natural way from Linux source code.

I also don't really like the patch as is (plus it fails modpost), but
IMO reusing __change_page_attr and __split_large_page is the right thing
to do.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH v2 14/32] x86: mm: Provide support to use memblock when spliting large pages

2017-03-16 Thread Paolo Bonzini


On 10/03/2017 23:41, Brijesh Singh wrote:
>> Maybe there's a reason this fires:
>>
>> WARNING: modpost: Found 2 section mismatch(es).
>> To see full details build your kernel with:
>> 'make CONFIG_DEBUG_SECTION_MISMATCH=y'
>>
>> WARNING: vmlinux.o(.text+0x48edc): Section mismatch in reference from
>> the function __change_page_attr() to the function
>> .init.text:memblock_alloc()
>> The function __change_page_attr() references
>> the function __init memblock_alloc().
>> This is often because __change_page_attr lacks a __init
>> annotation or the annotation of memblock_alloc is wrong.
>>
>> WARNING: vmlinux.o(.text+0x491d1): Section mismatch in reference from
>> the function __change_page_attr() to the function
>> .meminit.text:memblock_free()
>> The function __change_page_attr() references
>> the function __meminit memblock_free().
>> This is often because __change_page_attr lacks a __meminit
>> annotation or the annotation of memblock_free is wrong.
>> 
>> But maybe Paolo might have an even better idea...
> 
> I am sure he will have better idea :)

Not sure if it's better or worse, but an alternative idea is to turn
__change_page_attr and __change_page_attr_set_clr inside out, so that:
1) the alloc_pages/__free_page happens in __change_page_attr_set_clr;
2) __change_page_attr_set_clr overall does not beocome more complex.

Then you can introduce __early_change_page_attr_set_clr and/or
early_kernel_map_pages_in_pgd, for use in your next patches.  They use
the memblock allocator instead of alloc/free_page

The attached patch is compile-tested only and almost certainly has some
thinko in it.  But it even skims a few lines from the code so the idea
might have some merit.

Paolo
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 28d42130243c..953c8e697562 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -490,11 +490,12 @@ static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
 }
 
 static int
-try_preserve_large_page(pte_t *kpte, unsigned long address,
+try_preserve_large_page(pte_t **p_kpte, unsigned long address,
 			struct cpa_data *cpa)
 {
 	unsigned long nextpage_addr, numpages, pmask, psize, addr, pfn, old_pfn;
-	pte_t new_pte, old_pte, *tmp;
+	pte_t *kpte = *p_kpte;
+	pte_t new_pte, old_pte;
 	pgprot_t old_prot, new_prot, req_prot;
 	int i, do_split = 1;
 	enum pg_level level;
@@ -507,8 +508,8 @@ try_preserve_large_page(pte_t *kpte, unsigned long address,
 	 * Check for races, another CPU might have split this page
 	 * up already:
 	 */
-	tmp = _lookup_address_cpa(cpa, address, );
-	if (tmp != kpte)
+	*p_kpte = _lookup_address_cpa(cpa, address, );
+	if (*p_kpte != kpte)
 		goto out_unlock;
 
 	switch (level) {
@@ -634,17 +635,18 @@ __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
 	unsigned int i, level;
 	pte_t *tmp;
 	pgprot_t ref_prot;
+	int retry = 1;
 
+	if (!debug_pagealloc_enabled())
+		spin_lock(_lock);
 	spin_lock(_lock);
 	/*
 	 * Check for races, another CPU might have split this page
 	 * up for us already:
 	 */
 	tmp = _lookup_address_cpa(cpa, address, );
-	if (tmp != kpte) {
-		spin_unlock(_lock);
-		return 1;
-	}
+	if (tmp != kpte)
+		goto out;
 
 	paravirt_alloc_pte(_mm, page_to_pfn(base));
 
@@ -671,10 +673,11 @@ __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
 		break;
 
 	default:
-		spin_unlock(_lock);
-		return 1;
+		goto out;
 	}
 
+	retry = 0;
+
 	/*
 	 * Set the GLOBAL flags only if the PRESENT flag is set
 	 * otherwise pmd/pte_present will return true even on a non
@@ -718,28 +721,34 @@ __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
 	 * going on.
 	 */
 	__flush_tlb_all();
-	spin_unlock(_lock);
 
-	return 0;
-}
-
-static int split_large_page(struct cpa_data *cpa, pte_t *kpte,
-			unsigned long address)
-{
-	struct page *base;
+out:
+	spin_unlock(_lock);
 
+	/*
+	 * Do a global flush tlb after splitting the large page
+ 	 * and before we do the actual change page attribute in the PTE.
+ 	 *
+ 	 * With out this, we violate the TLB application note, that says
+ 	 * "The TLBs may contain both ordinary and large-page
+	 *  translations for a 4-KByte range of linear addresses. This
+	 *  may occur if software modifies the paging structures so that
+	 *  the page size used for the address range changes. If the two
+	 *  translations differ with respect to page frame or attributes
+	 *  (e.g., permissions), processor behavior is undefined and may
+	 *  be implementation-specific."
+ 	 *
+ 	 * We do this global tlb flush inside the cpa_lock, so that we
+	 * don't allow any other cpu, with stale tlb entries change the
+	 * page attribute in parallel, that also falls into the
+	 * just split large page entry.
+ 	 */
+	if (!retry)
+		flush_tlb_all();
 	if (!debug_pagealloc_enabled())
 		spin_unlock(_lock);
-	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
-	if (!debug_pagealloc_enabled())
-		spin_lock(_lock);
-	if (!base)
-		return 

Re: [RFC PATCH v2 14/32] x86: mm: Provide support to use memblock when spliting large pages

2017-03-16 Thread Paolo Bonzini


On 02/03/2017 16:15, Brijesh Singh wrote:
> 
>  __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
> -struct page *base)
> +   pte_t *pbase, unsigned long new_pfn)
>  {
> - pte_t *pbase = (pte_t *)page_address(base);

Just one comment and I'll reply to Boris, I think you can compute pbase 
with pfn_to_kaddr, and avoid adding a new argument.

>*/
> - __set_pmd_pte(kpte, address, mk_pte(base, __pgprot(_KERNPG_TABLE)));
> + __set_pmd_pte(kpte, address,
> + native_make_pte((new_pfn << PAGE_SHIFT) + _KERNPG_TABLE));

And this probably is better written as:

__set_pmd_pte(kpte, address, pfn_pte(new_pfn, __pgprot(_KERNPG_TABLE));

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH v2 16/32] x86: kvm: Provide support to create Guest and HV shared per-CPU variables

2017-03-16 Thread Paolo Bonzini


On 02/03/2017 16:15, Brijesh Singh wrote:
> Some KVM specific MSR's (steal-time, asyncpf, avic_eio) allocates per-CPU
> variable at compile time and share its physical address with hypervisor.
> It presents a challege when SEV is active in guest OS. When SEV is active,
> guest memory is encrypted with guest key and hypervisor will no longer able
> to modify the guest memory. When SEV is active, we need to clear the
> encryption attribute of shared physical addresses so that both guest and
> hypervisor can access the data.
> 
> To solve this problem, I have tried these three options:
> 
> 1) Convert the static per-CPU to dynamic per-CPU allocation. When SEV is
> detected then clear the encryption attribute. But while doing so I found
> that per-CPU dynamic allocator was not ready when kvm_guest_cpu_init was
> called.
> 
> 2) Since the encryption attributes works on PAGE_SIZE hence add some extra
> padding to 'struct kvm-steal-time' to make it PAGE_SIZE and then at runtime
> clear the encryption attribute of the full PAGE. The downside of this was
> now we need to modify structure which may break the compatibility.
> 
> 3) Define a new per-CPU section (.data..percpu.hv_shared) which will be
> used to hold the compile time shared per-CPU variables. When SEV is
> detected we map this section with encryption attribute cleared.
> 
> This patch implements #3. It introduces a new DEFINE_PER_CPU_HV_SHAHRED
> macro to create a compile time per-CPU variable. When SEV is detected we
> map the per-CPU variable as decrypted (i.e with encryption attribute cleared).
> 
> Signed-off-by: Brijesh Singh 

Looks good to me.

Paolo

> ---
>  arch/x86/kernel/kvm.c |   43 
> +++--
>  include/asm-generic/vmlinux.lds.h |3 +++
>  include/linux/percpu-defs.h   |9 
>  3 files changed, 48 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 099fcba..706a08e 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -75,8 +75,8 @@ static int parse_no_kvmclock_vsyscall(char *arg)
>  
>  early_param("no-kvmclock-vsyscall", parse_no_kvmclock_vsyscall);
>  
> -static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
> -static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
> +static DEFINE_PER_CPU_HV_SHARED(struct kvm_vcpu_pv_apf_data, apf_reason) 
> __aligned(64);
> +static DEFINE_PER_CPU_HV_SHARED(struct kvm_steal_time, steal_time) 
> __aligned(64);
>  static int has_steal_clock = 0;
>  
>  /*
> @@ -290,6 +290,22 @@ static void __init paravirt_ops_setup(void)
>  #endif
>  }
>  
> +static int kvm_map_percpu_hv_shared(void *addr, unsigned long size)
> +{
> + /* When SEV is active, the percpu static variables initialized
> +  * in data section will contain the encrypted data so we first
> +  * need to decrypt it and then map it as decrypted.
> +  */
> + if (sev_active()) {
> + unsigned long pa = slow_virt_to_phys(addr);
> +
> + sme_early_decrypt(pa, size);
> + return early_set_memory_decrypted(addr, size);
> + }
> +
> + return 0;
> +}
> +
>  static void kvm_register_steal_time(void)
>  {
>   int cpu = smp_processor_id();
> @@ -298,12 +314,17 @@ static void kvm_register_steal_time(void)
>   if (!has_steal_clock)
>   return;
>  
> + if (kvm_map_percpu_hv_shared(st, sizeof(*st))) {
> + pr_err("kvm-stealtime: failed to map hv_shared percpu\n");
> + return;
> + }
> +
>   wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED));
>   pr_info("kvm-stealtime: cpu %d, msr %llx\n",
>   cpu, (unsigned long long) slow_virt_to_phys(st));
>  }
>  
> -static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) = KVM_PV_EOI_DISABLED;
> +static DEFINE_PER_CPU_HV_SHARED(unsigned long, kvm_apic_eoi) = 
> KVM_PV_EOI_DISABLED;
>  
>  static notrace void kvm_guest_apic_eoi_write(u32 reg, u32 val)
>  {
> @@ -327,25 +348,33 @@ static void kvm_guest_cpu_init(void)
>   if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
>   u64 pa = slow_virt_to_phys(this_cpu_ptr(_reason));
>  
> + if (kvm_map_percpu_hv_shared(this_cpu_ptr(_reason),
> + sizeof(struct kvm_vcpu_pv_apf_data)))
> + goto skip_asyncpf;
>  #ifdef CONFIG_PREEMPT
>   pa |= KVM_ASYNC_PF_SEND_ALWAYS;
>  #endif
>   wrmsrl(MSR_KVM_ASYNC_PF_EN, pa | KVM_ASYNC_PF_ENABLED);
>   __this_cpu_write(apf_reason.enabled, 1);
> - printk(KERN_INFO"KVM setup async PF for cpu %d\n",
> -smp_processor_id());
> + printk(KERN_INFO"KVM setup async PF for cpu %d msr %llx\n",
> +smp_processor_id(), pa);
>   }
> -
> +skip_asyncpf:
>   if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) {
>   unsigned long pa;

Re: [RFC PATCH v2 30/32] kvm: svm: Add support for SEV DEBUG_ENCRYPT command

2017-03-16 Thread Paolo Bonzini


On 02/03/2017 16:18, Brijesh Singh wrote:
> + data = (void *) get_zeroed_page(GFP_KERNEL);

The page does not need to be zeroed, does it?

> +
> + if ((len & 15) || (dst_addr & 15)) {
> + /* if destination address and length are not 16-byte
> +  * aligned then:
> +  * a) decrypt destination page into temporary buffer
> +  * b) copy source data into temporary buffer at correct offset
> +  * c) encrypt temporary buffer
> +  */
> + ret = __sev_dbg_decrypt_page(kvm, dst_addr, data, >error);

Ah, I see now you're using this function here for read-modify-write.
data is already pinned here, so even if you keep the function it makes
sense to push pinning out of __sev_dbg_decrypt_page and into
sev_dbg_decrypt.

> + if (ret)
> + goto err_3;
> + d_off = dst_addr & (PAGE_SIZE - 1);
> +
> + if (copy_from_user(data + d_off,
> + (uint8_t *)debug.src_addr, len)) {
> + ret = -EFAULT;
> + goto err_3;
> + }
> +
> + encrypt->length = PAGE_SIZE;

Why decrypt/re-encrypt all the page instead of just the 16 byte area
around the [dst_addr, dst_addr+len) range?

> + encrypt->src_addr = __psp_pa(data);
> + encrypt->dst_addr =  __sev_page_pa(inpages[0]);
> + } else {
> + if (copy_from_user(data, (uint8_t *)debug.src_addr, len)) {
> + ret = -EFAULT;
> + goto err_3;
> + }

Do you need copy_from_user, or can you just pin/unpin memory as for
DEBUG_DECRYPT?

Paolo

> + d_off = dst_addr & (PAGE_SIZE - 1);
> + encrypt->length = len;
> + encrypt->src_addr = __psp_pa(data);
> + encrypt->dst_addr = __sev_page_pa(inpages[0]);
> + encrypt->dst_addr += d_off;
> + }
> +
> + encrypt->handle = sev_get_handle(kvm);
> + ret = sev_issue_cmd(kvm, SEV_CMD_DBG_ENCRYPT, encrypt, >error);
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH v2 29/32] kvm: svm: Add support for SEV DEBUG_DECRYPT command

2017-03-16 Thread Paolo Bonzini


On 02/03/2017 16:18, Brijesh Singh wrote:
> +static int __sev_dbg_decrypt_page(struct kvm *kvm, unsigned long src,
> + void *dst, int *error)
> +{
> + inpages = sev_pin_memory(src, PAGE_SIZE, );
> + if (!inpages) {
> + ret = -ENOMEM;
> + goto err_1;
> + }
> +
> + data->handle = sev_get_handle(kvm);
> + data->dst_addr = __psp_pa(dst);
> + data->src_addr = __sev_page_pa(inpages[0]);
> + data->length = PAGE_SIZE;
> +
> + ret = sev_issue_cmd(kvm, SEV_CMD_DBG_DECRYPT, data, error);
> + if (ret)
> + printk(KERN_ERR "SEV: DEBUG_DECRYPT %d (%#010x)\n",
> + ret, *error);
> + sev_unpin_memory(inpages, npages);
> +err_1:
> + kfree(data);
> + return ret;
> +}
> +
> +static int sev_dbg_decrypt(struct kvm *kvm, struct kvm_sev_cmd *argp)
> +{
> + void *data;
> + int ret, offset, len;
> + struct kvm_sev_dbg debug;
> +
> + if (!sev_guest(kvm))
> + return -ENOTTY;
> +
> + if (copy_from_user(, (void *)argp->data,
> + sizeof(struct kvm_sev_dbg)))
> + return -EFAULT;
> + /*
> +  * TODO: add support for decrypting length which crosses the
> +  * page boundary.
> +  */
> + offset = debug.src_addr & (PAGE_SIZE - 1);
> + if (offset + debug.length > PAGE_SIZE)
> + return -EINVAL;
> +

Please do add it, it doesn't seem very different from what you're doing
in LAUNCH_UPDATE_DATA.  There's no need for a separate
__sev_dbg_decrypt_page function, you can just pin/unpin here and do a
per-page loop as in LAUNCH_UPDATE_DATA.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH v2 26/32] kvm: svm: Add support for SEV LAUNCH_UPDATE_DATA command

2017-03-16 Thread Paolo Bonzini


On 02/03/2017 16:17, Brijesh Singh wrote:
> +static struct page **sev_pin_memory(unsigned long uaddr, unsigned long ulen,
> + unsigned long *n)
> +{
> + struct page **pages;
> + int first, last;
> + unsigned long npages, pinned;
> +
> + /* Get number of pages */
> + first = (uaddr & PAGE_MASK) >> PAGE_SHIFT;
> + last = ((uaddr + ulen - 1) & PAGE_MASK) >> PAGE_SHIFT;
> + npages = (last - first + 1);
> +
> + pages = kzalloc(npages * sizeof(struct page *), GFP_KERNEL);
> + if (!pages)
> + return NULL;
> +
> + /* pin the user virtual address */
> + down_read(>mm->mmap_sem);
> + pinned = get_user_pages_fast(uaddr, npages, 1, pages);
> + up_read(>mm->mmap_sem);

get_user_pages_fast, like get_user_pages_unlocked, must be called
without mmap_sem held.

> + if (pinned != npages) {
> + printk(KERN_ERR "SEV: failed to pin  %ld pages (got %ld)\n",
> + npages, pinned);
> + goto err;
> + }
> +
> + *n = npages;
> + return pages;
> +err:
> + if (pinned > 0)
> + release_pages(pages, pinned, 0);
> + kfree(pages);
> +
> + return NULL;
> +}
>
> + /* the array of pages returned by get_user_pages() is a page-aligned
> +  * memory. Since the user buffer is probably not page-aligned, we need
> +  * to calculate the offset within a page for first update entry.
> +  */
> + offset = uaddr & (PAGE_SIZE - 1);
> + len = min_t(size_t, (PAGE_SIZE - offset), ulen);
> + ulen -= len;
> +
> + /* update first page -
> +  * special care need to be taken for the first page because we might
> +  * be dealing with offset within the page
> +  */

No need to special case the first page; just set "offset = 0" inside the
loop after the first iteration.

Paolo

> + data->handle = sev_get_handle(kvm);
> + data->length = len;
> + data->address = __sev_page_pa(inpages[0]) + offset;
> + ret = sev_issue_cmd(kvm, SEV_CMD_LAUNCH_UPDATE_DATA,
> + data, >error);
> + if (ret)
> + goto err_3;
> +
> + /* update remaining pages */
> + for (i = 1; i < nr_pages; i++) {
> +
> + len = min_t(size_t, PAGE_SIZE, ulen);
> + ulen -= len;
> + data->length = len;
> + data->address = __sev_page_pa(inpages[i]);
> + ret = sev_issue_cmd(kvm, SEV_CMD_LAUNCH_UPDATE_DATA,
> + data, >error);
> + if (ret)
> + goto err_3;
> + }

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH v2 32/32] x86: kvm: Pin the guest memory when SEV is active

2017-03-16 Thread Paolo Bonzini


On 02/03/2017 16:18, Brijesh Singh wrote:
> The SEV memory encryption engine uses a tweak such that two identical
> plaintexts at different location will have a different ciphertexts.
> So swapping or moving ciphertexts of two pages will not result in
> plaintexts being swapped. Relocating (or migrating) a physical backing pages
> for SEV guest will require some additional steps. The current SEV key
> management spec [1] does not provide commands to swap or migrate (move)
> ciphertexts. For now we pin the memory allocated for the SEV guest. In
> future when SEV key management spec provides the commands to support the
> page migration we can update the KVM code to remove the pinning logical
> without making any changes into userspace (qemu).
> 
> The patch pins userspace memory when a new slot is created and unpin the
> memory when slot is removed.
> 
> [1] http://support.amd.com/TechDocs/55766_SEV-KM%20API_Spec.pdf

This is not enough, because memory can be hidden temporarily from the
guest and remapped later.  Think of a PCI BAR that is backed by RAM, or
also SMRAM.  The pinning must be kept even in that case.

You need to add a pair of KVM_MEMORY_ENCRYPT_OPs (one that doesn't map
to a PSP operation), such as KVM_REGISTER/UNREGISTER_ENCRYPTED_RAM.  In
QEMU you can use a RAMBlockNotifier to invoke the ioctls.

Paolo

> Signed-off-by: Brijesh Singh 
> ---
>  arch/x86/include/asm/kvm_host.h |6 +++
>  arch/x86/kvm/svm.c  |   93 
> +++
>  arch/x86/kvm/x86.c  |3 +
>  3 files changed, 102 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index fcc4710..9dc59f0 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -723,6 +723,7 @@ struct kvm_sev_info {
>   unsigned int handle;/* firmware handle */
>   unsigned int asid;  /* asid for this guest */
>   int sev_fd; /* SEV device fd */
> + struct list_head pinned_memory_slot;
>  };
>  
>  struct kvm_arch {
> @@ -1043,6 +1044,11 @@ struct kvm_x86_ops {
>   void (*setup_mce)(struct kvm_vcpu *vcpu);
>  
>   int (*memory_encryption_op)(struct kvm *kvm, void __user *argp);
> +
> + void (*prepare_memory_region)(struct kvm *kvm,
> + struct kvm_memory_slot *memslot,
> + const struct kvm_userspace_memory_region *mem,
> + enum kvm_mr_change change);
>  };
>  
>  struct kvm_arch_async_pf {
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 13996d6..ab973f9 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -498,12 +498,21 @@ static inline bool gif_set(struct vcpu_svm *svm)
>  }
>  
>  /* Secure Encrypted Virtualization */
> +struct kvm_sev_pinned_memory_slot {
> + struct list_head list;
> + unsigned long npages;
> + struct page **pages;
> + unsigned long userspace_addr;
> + short id;
> +};
> +
>  static unsigned int max_sev_asid;
>  static unsigned long *sev_asid_bitmap;
>  static void sev_deactivate_handle(struct kvm *kvm);
>  static void sev_decommission_handle(struct kvm *kvm);
>  static int sev_asid_new(void);
>  static void sev_asid_free(int asid);
> +static void sev_unpin_memory(struct page **pages, unsigned long npages);
>  #define __sev_page_pa(x) ((page_to_pfn(x) << PAGE_SHIFT) | sme_me_mask)
>  
>  static bool kvm_sev_enabled(void)
> @@ -1544,9 +1553,25 @@ static inline int avic_free_vm_id(int id)
>  
>  static void sev_vm_destroy(struct kvm *kvm)
>  {
> + struct list_head *pos, *q;
> + struct kvm_sev_pinned_memory_slot *pinned_slot;
> + struct list_head *head = >arch.sev_info.pinned_memory_slot;
> +
>   if (!sev_guest(kvm))
>   return;
>  
> + /* if guest memory is pinned then unpin it now */
> + if (!list_empty(head)) {
> + list_for_each_safe(pos, q, head) {
> + pinned_slot = list_entry(pos,
> + struct kvm_sev_pinned_memory_slot, list);
> + sev_unpin_memory(pinned_slot->pages,
> + pinned_slot->npages);
> + list_del(pos);
> + kfree(pinned_slot);
> + }
> + }
> +
>   /* release the firmware resources */
>   sev_deactivate_handle(kvm);
>   sev_decommission_handle(kvm);
> @@ -5663,6 +5688,8 @@ static int sev_pre_start(struct kvm *kvm, int *asid)
>   }
>   *asid = ret;
>   ret = 0;
> +
> + INIT_LIST_HEAD(>arch.sev_info.pinned_memory_slot);
>   }
>  
>   return ret;
> @@ -6189,6 +6216,71 @@ static int sev_launch_measure(struct kvm *kvm, struct 
> kvm_sev_cmd *argp)
>   return ret;
>  }
>  
> +static struct kvm_sev_pinned_memory_slot *sev_find_pinned_memory_slot(
> + struct kvm *kvm, struct kvm_memory_slot *slot)
> +{
> + struct 

Re: [RFC PATCH v2 24/32] kvm: x86: prepare for SEV guest management API support

2017-03-16 Thread Paolo Bonzini


On 02/03/2017 16:17, Brijesh Singh wrote:
> ASID management:
>  - Reserve asid range for SEV guest, SEV asid range is obtained through
>CPUID Fn8000_001f[ECX]. A non-SEV guest can use any asid outside the SEV
>asid range.

How is backwards compatibility handled?

>  - SEV guest must have asid value within asid range obtained through CPUID.
>  - SEV guest must have the same asid for all vcpu's. A TLB flush is required
>if different vcpu for the same ASID is to be run on the same host CPU.

[...]

> +
> + /* which host cpu was used for running this vcpu */
> + bool last_cpuid;

Should be unsigned int.

> 
> + /* Assign the asid allocated for this SEV guest */
> + svm->vmcb->control.asid = asid;
> +
> + /* Flush guest TLB:
> +  * - when different VMCB for the same ASID is to be run on the
> +  *   same host CPU
> +  *   or
> +  * - this VMCB was executed on different host cpu in previous VMRUNs.
> +  */
> + if (sd->sev_vmcbs[asid] != (void *)svm->vmcb ||

Why the cast?

> + svm->last_cpuid != cpu)
> + svm->vmcb->control.tlb_ctl = TLB_CONTROL_FLUSH_ALL_ASID;

If there is a match, you don't need to do anything else (neither reset
the asid, nor mark it as dirty, nor update the fields), so:

if (sd->sev_vmcbs[asid] == svm->vmcb &&
svm->last_cpuid == cpu)
return;

svm->last_cpuid = cpu;
sd->sev_vmcbs[asid] = svm->vmcb;
svm->vmcb->control.tlb_ctl = TLB_CONTROL_FLUSH_ALL_ASID;
svm->vmcb->control.asid = asid;
mark_dirty(svm->vmcb, VMCB_ASID);

(plus comments ;)).

Also, why not TLB_CONTROL_FLUSH_ASID if possible?

> + svm->last_cpuid = cpu;
> + sd->sev_vmcbs[asid] = (void *)svm->vmcb;
> +
> + mark_dirty(svm->vmcb, VMCB_ASID);

[...]

> 
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index fef7d83..9df37a2 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1284,6 +1284,104 @@ struct kvm_s390_ucas_mapping {
>  /* Memory Encryption Commands */
>  #define KVM_MEMORY_ENCRYPT_OP  _IOWR(KVMIO, 0xb8, unsigned long)
>  
> +/* Secure Encrypted Virtualization mode */
> +enum sev_cmd_id {

Please add documentation in Documentation/virtual/kvm/memory_encrypt.txt.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH v2 23/32] kvm: introduce KVM_MEMORY_ENCRYPT_OP ioctl

2017-03-16 Thread Paolo Bonzini


On 02/03/2017 16:17, Brijesh Singh wrote:
> If hardware supports encrypting then KVM_MEMORY_ENCRYPT_OP ioctl can
> be used by qemu to issue platform specific memory encryption commands.
> 
> Signed-off-by: Brijesh Singh <brijesh.si...@amd.com>
> ---
>  arch/x86/include/asm/kvm_host.h |2 ++
>  arch/x86/kvm/x86.c  |   12 
>  include/uapi/linux/kvm.h|2 ++
>  3 files changed, 16 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index bff1f15..62651ad 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1033,6 +1033,8 @@ struct kvm_x86_ops {
>   void (*cancel_hv_timer)(struct kvm_vcpu *vcpu);
>  
>   void (*setup_mce)(struct kvm_vcpu *vcpu);
> +
> + int (*memory_encryption_op)(struct kvm *kvm, void __user *argp);
>  };
>  
>  struct kvm_arch_async_pf {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 2099df8..6a737e9 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3926,6 +3926,14 @@ static int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>   return r;
>  }
>  
> +static int kvm_vm_ioctl_memory_encryption_op(struct kvm *kvm, void __user 
> *argp)
> +{
> + if (kvm_x86_ops->memory_encryption_op)
> + return kvm_x86_ops->memory_encryption_op(kvm, argp);
> +
> + return -ENOTTY;
> +}
> +
>  long kvm_arch_vm_ioctl(struct file *filp,
>  unsigned int ioctl, unsigned long arg)
>  {
> @@ -4189,6 +4197,10 @@ long kvm_arch_vm_ioctl(struct file *filp,
>   r = kvm_vm_ioctl_enable_cap(kvm, );
>   break;
>   }
> + case KVM_MEMORY_ENCRYPT_OP: {
> + r = kvm_vm_ioctl_memory_encryption_op(kvm, argp);
> + break;
> + }
>   default:
>   r = kvm_vm_ioctl_assigned_device(kvm, ioctl, arg);
>   }
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index cac48ed..fef7d83 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1281,6 +1281,8 @@ struct kvm_s390_ucas_mapping {
>  #define KVM_S390_GET_IRQ_STATE _IOW(KVMIO, 0xb6, struct 
> kvm_s390_irq_state)
>  /* Available with KVM_CAP_X86_SMM */
>  #define KVM_SMI   _IO(KVMIO,   0xb7)
> +/* Memory Encryption Commands */
> +#define KVM_MEMORY_ENCRYPT_OP  _IOWR(KVMIO, 0xb8, unsigned long)
>  
>  #define KVM_DEV_ASSIGN_ENABLE_IOMMU  (1 << 0)
>  #define KVM_DEV_ASSIGN_PCI_2_3   (1 << 1)
> 

Reviewed-by: Paolo Bonzini <pbonz...@redhat.com>
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH v2 12/32] x86: Add early boot support when running with SEV active

2017-03-09 Thread Paolo Bonzini


On 09/03/2017 15:07, Borislav Petkov wrote:
> + /* Check if running under a hypervisor */
> + eax = 0x4000;
> + ecx = 0;
> + native_cpuid(, , , );

This is not how you check if running under a hypervisor; you should
check the HYPERVISOR bit, i.e. bit 31 of cpuid(1).ecx.  This in turn
tells you if leaf 0x4000 is valid.

That said, the main issue with this function is that it hardcodes the
behavior for KVM.  It is possible that another hypervisor defines its
0x4001 leaf in such a way that KVM_FEATURE_SEV has a different meaning.

Instead, AMD should define a "well-known" bit in its own space (i.e.
0x80xx) that is only used by hypervisors that support SEV.  This is
similar to how Intel defined one bit in leaf 1 to say "is leaf
0x4000 valid".

Thanks,

Paolo

> + if (eax > 0x4000) {
> + eax = 0x4001;
> + ecx = 0;
> + native_cpuid(, , , );
> + if (!(eax & BIT(KVM_FEATURE_SEV)))
> + goto out;
> +
> + eax = 0x801f;
> + ecx = 0;
> + native_cpuid(, , , );
> + if (!(eax & 1))
> + goto out;
> +
> + sme_me_mask = 1UL << (ebx & 0x3f);
> + sev_enabled = 1;
> +
> + goto out;
> + }
> +
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 12/15] hyperv: move VMBus connection ids to uapi

2017-01-09 Thread Paolo Bonzini


On 09/01/2017 09:40, h...@zytor.com wrote:
> On January 9, 2017 12:32:23 AM PST, Roman Kagan <rka...@virtuozzo.com> wrote:
>> On Mon, Jan 02, 2017 at 09:19:57AM +0100, Paolo Bonzini wrote:
>>> On 28/12/2016 18:09, Roman Kagan wrote:
>>>> Am I correct assuming that QEMU is currently the only user of
>>>> arch/x86/include/uapi/asm/hyperv.h?
>>>>
>>>> Then I think we're fine withdrawing it from uapi as a whole and
>> letting
>>>> QEMU pull it in through its header-harvesting scripts (as does now
>>>> anyway).  This would lift all licensing and longterm API stability
>>>> expectations.
>>>
>>> Actually, QEMU's header-harvesting scripts use uapi/ headers
>>> exclusively, since they are built on "make headers_install".
>>>
>>> The extra cleanups that QEMU does on top are to allow compilation of
>> the
>>> headers on non-Linux machines.  They don't really do much more than
>>> changing Linux (linux/types.h) integer types to the C99 (stdint.h)
>>> equivalents.
>>
>> Ouch, I stand corrected.
>>
>> So what should we do with it then?  I'm sorta lost...
>>
>> We certainly can give it up and live with a private copy of the
>> definitions in the QEMU tree but that doesn't sound optimal in any
>> sense.
> 
> Why do that through header mangling rather than typedef?

Because you are not suppose to typedef identifiers that start with "__",
and because it does do a few other ad-hoc changes:

-e 's/<linux\/\([^>]*\)>/"standard-headers\/linux\/\1"/' \
-e 's/__bitwise__//' \
-e 's/__attribute__((packed))/QEMU_PACKED/' \
-e 's/__inline__/inline/' \
-e '/sys\/ioctl.h/d' \
-e 's/SW_MAX/SW_MAX_/' \

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 12/15] hyperv: move VMBus connection ids to uapi

2017-01-03 Thread Paolo Bonzini


On 02/01/2017 20:39, Stephen Hemminger wrote:
>>>
>>> I would like to minimize what we include in the uapi header; especially 
>>> when MSFT has made no guarantees
>>> with regards how  they may be evolved. I will also work on getting some 
>>> clarity on both stability and
>>> under what license we would expose the uapi header.  
>> Am I correct assuming that QEMU is currently the only user of
>> arch/x86/include/uapi/asm/hyperv.h?
>>
>> Then I think we're fine withdrawing it from uapi as a whole and letting
>> QEMU pull it in through its header-harvesting scripts (as does now
>> anyway).  This would lift all licensing and longterm API stability
>> expectations.
>
> Thanks, that prevents lots of problems.
> That is how I handle iproute2 as well.

Except it wouldn't work.  But no big deal, I guess we'll just
synchronize hyperv.h manually. :((

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 12/15] hyperv: move VMBus connection ids to uapi

2017-01-02 Thread Paolo Bonzini


On 28/12/2016 18:09, Roman Kagan wrote:
> Am I correct assuming that QEMU is currently the only user of
> arch/x86/include/uapi/asm/hyperv.h?
> 
> Then I think we're fine withdrawing it from uapi as a whole and letting
> QEMU pull it in through its header-harvesting scripts (as does now
> anyway).  This would lift all licensing and longterm API stability
> expectations.

Actually, QEMU's header-harvesting scripts use uapi/ headers
exclusively, since they are built on "make headers_install".

The extra cleanups that QEMU does on top are to allow compilation of the
headers on non-Linux machines.  They don't really do much more than
changing Linux (linux/types.h) integer types to the C99 (stdint.h)
equivalents.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 03/15] hyperv: use standard bitops

2016-12-22 Thread Paolo Bonzini


On 21/12/2016 14:23, Roman Kagan wrote:
> On Wed, Dec 21, 2016 at 01:00:44PM +0100, Olaf Hering wrote:
>> On Tue, Dec 20, Roman Kagan wrote:
>>
>> Reverting commit 22356585712d ("staging: hv: use sync_bitops when
>> interacting with the hypervisor") is save because ...
>>
>>> -   sync_set_bit(channel->monitor_bit,
>>> +   set_bit(channel->monitor_bit,
> 
> It isn't indeed.  I didn't realize there was a UP case where it made a
> difference, and failed to locate the commit where it changed.
> 
> I'll drop this part, thanks.

Perhaps the sync_bitops should be renamed to virt_bitops.  This would
match virt_* memory barriers and would make their usage much more obvious.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 12/15] hyperv: move VMBus connection ids to uapi

2016-12-21 Thread Paolo Bonzini


On 21/12/2016 18:50, Stephen Hemminger wrote:
> The other problem with the hyperv headers is they were initially done with
> only the Linux driver usage in mind. This made perfect sense at the time,
> the problem is that they mix internal state with protocol definitions.

Yes, and this was partly fixed when KVM started to use some of those
definitions too (the implementation of Hyper-V protocols is split
between kernel and userspace).

> Lastly, there is licensing issues on headers. It would be good to have any
> userspace ABI headers licensed with a more liberal license so that BSD and 
> DPDK drivers
> could use them directly. Right now each one reinvents.

Other projects are using sanitized userspace ABI headers just fine, so
this is something that lawyers should sort out before kernel hackers do.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 12/15] hyperv: move VMBus connection ids to uapi

2016-12-21 Thread Paolo Bonzini


On 21/12/2016 16:43, Christoph Hellwig wrote:
> On Wed, Dec 21, 2016 at 04:39:18PM +0100, Paolo Bonzini wrote:
>> That said, there are precedents in using UAPI this way for PV
>> interfaces.  See for example include/uapi/linux/virtio*.h and
>> arch/x86/include/uapi/asm/kvm_para.h.
> 
> We have all kinds of historical examples, but most of them turned
> into a major pain sooner or later - my favourite example are the
> SCSI protocol headers.

Mine too, and because of how uapi/ was created there are quite a few
such historical headers (my favourite is cuda.h, just because of the name).

> Protocols needs to stay compatible on the (virtual) wire, but not
> on the language level.  Locking us into the strict UABI policies
> for them just make someone life horrible further down the road.

The ABI for this kind of thing is not changing much anyway, because it's
the hardware or processor or (as in this case) hypervisor ABI.

The more interesting question is about the API, and here in the end it
seems to be up to the maintainer.

Some have explicitly asked to move stuff *out* of UAPI, for example the
x86 guys have removed msr-index.h from UAPI recently.  Others are okay
with it and they simply aren't strict on cleanups that might break the
*programming* interface, as in patch 15 of this series.  See for example
pci_regs.h commit 846fc70986a6, "PCI/AER: Rename PCI_ERR_UNC_TRAIN to
PCI_ERR_UNC_UND", everybody just moved on and QEMU adjusted its use of
PCI_ERR_UNC_TRAIN.

Having this in UAPI has been convenient for QEMU, but of course the
kernel couldn't care less.  So if KY prefers to have the header outside
UAPI, we will just follow suit...

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 12/15] hyperv: move VMBus connection ids to uapi

2016-12-21 Thread Paolo Bonzini


On 21/12/2016 15:26, Christoph Hellwig wrote:
> On Wed, Dec 21, 2016 at 03:59:20PM +0300, Roman Kagan wrote:
>> That's fine by me.
>>
>> I guess the series should then start with a complete move
>> arch/x86/include/uapi/asm/hyperv.h ->
>> arch/x86/include/asm/hyperv_proto.h, and the remaining patches have to
>> change the latter instead of the former?
> 
> That would be my preference, but we'll need to figure out why
> hyperv has ever been a UABI header, and if anyone is using it.

QEMU uses it, but we already bundle the header files and update them
periodically from the files that Linux installs.  So any change in Linux
would not break the QEMU build; having the header in UAPI is convenient
but I guess our update scripts could do whatever Linux's
scripts/headers_install.sh does.

That said, there are precedents in using UAPI this way for PV
interfaces.  See for example include/uapi/linux/virtio*.h and
arch/x86/include/uapi/asm/kvm_para.h.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 2/2] mm: remove get_user_pages_locked()

2016-10-31 Thread Paolo Bonzini


On 31/10/2016 14:48, Lorenzo Stoakes wrote:
> On Mon, Oct 31, 2016 at 12:45:36PM +0100, Paolo Bonzini wrote:
>>
>>
>> On 31/10/2016 11:02, Lorenzo Stoakes wrote:
>>> - *
>>> - * get_user_pages should be phased out in favor of
>>> - * get_user_pages_locked|unlocked or get_user_pages_fast. Nothing
>>> - * should use get_user_pages because it cannot pass
>>> - * FAULT_FLAG_ALLOW_RETRY to handle_mm_fault.
>>
>> This comment should be preserved in some way.  In addition, removing
> 
> Could you clarify what you think should be retained?
> 
> The comment seems to me to be largely rendered redundant by this change -
> get_user_pages() now offers identical behaviour, and of course the latter part
> of the comment ('because it cannot pass FAULT_FLAG_ALLOW_RETRY') is rendered
> incorrect by this change.
> 
> It could be replaced with a recommendation to make use of VM_FAULT_RETRY logic
> if possible.

Yes, exactly.  locked == NULL should be avoided whenever mmap_sem can 
be dropped, and the comment indirectly said so.  Now most of those cases
actually are those where you can just call get_user_pages_unlocked.

>> get_user_pages_locked() makes it harder (compared to a simple "git grep
>> -w") to identify callers that lack allow-retry functionality).  So I'm
>> not sure about the benefits of these patches.
> 
> That's a fair point, and certainly this cleanup series is less obviously 
> helpful
> than previous ones.
> 
> However, there are a few points in its favour:
> 
> 1. get_user_pages_remote() was the mirror of get_user_pages() prior to adding 
> an
>int *locked parameter to the former (necessary to allow for the unexport of
>__get_user_pages_unlocked()), differing only in task/mm being specified and
>FOLL_REMOTE being set. This patch series keeps these functions 
> 'synchronised'
>in this respect.
> 
> 2. There is currently only one caller of get_user_pages_locked() in
>mm/frame_vector.c which seems to suggest this function isn't widely
>used/known.

Or not widely necessary. :)

> 3. This change results in all slow-path get_user_pages*() functions having the
>ability to use VM_FAULT_RETRY logic rather than 'defaulting' to using
>get_user_pages() that doesn't let you do this even if you wanted to.

This is only true if someone does the work though.  From a quick look 
at your series, the following files can be trivially changed to use 
get_user_pages_unlocked:

- drivers/gpu/drm/via/via_dmablit.c
- drivers/platform/goldfish/goldfish_pipe.c
- drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c
- drivers/rapidio/devices/rio_mport_cdev.c
- drivers/virt/fsl_hypervisor.c

Also, videobuf_dma_init_user could be changed to use retry by adding a 
*locked argument to videobuf_dma_init_user_locked, prototype patch 
after my signature.

Everything else is probably best kept using get_user_pages.

> 4. It's somewhat confusing/redundant having both get_user_pages_locked() and
>get_user_pages() functions which both require mmap_sem to be held (i.e. 
> both
>are 'locked' versions.)
> 
>> If all callers were changed, then sure removing the _locked suffix would
>> be a good idea.
> 
> It seems many callers cannot release mmap_sem so VM_FAULT_RETRY logic couldn't
> happen anyway in this cases and in these cases we couldn't change the caller.

But then get_user_pages_locked remains a less-common case, so perhaps 
it's a good thing to give it a longer name!

> Overall, an alternative here might be to remove get_user_pages() and update
> get_user_pages_locked() to add a 'vmas' parameter, however doing this renders
> get_user_pages_unlocked() asymmetric as it would lack an vmas parameter 
> (adding
> one there would make no sense as VM_FAULT_RETRY logic invalidates VMAs) though
> perhaps not such a big issue as it makes sense as to why this is the case.

Adding the 'vmas' parameter to get_user_pages_locked would make little 
sense.  Since VM_FAULT_RETRY invalidates it and g_u_p_locked can and 
does retry, it would generally not be useful.

So I think we have the right API now:

- do not have lock?  Use get_user_pages_unlocked, get retry for free,
no need to handle  mmap_sem and the locked argument; cannot get back vmas.

- have and cannot drop lock?  User get_user_pages, no need to pass NULL 
for the locked argument; can get back vmas.

- have but can drop lock (rare case)?  Use get_user_pages_locked, 
cannot get back vams.

Paolo

> get_user_pages_unlocked() definitely seems to be a useful helper and therefore
> makes sense to retain.

> Of course another alternative is to leave things be :)
> 

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c 
b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 1db0af6c7f94..dae4

Re: [PATCH 2/2] mm: remove get_user_pages_locked()

2016-10-31 Thread Paolo Bonzini


On 31/10/2016 11:02, Lorenzo Stoakes wrote:
> - *
> - * get_user_pages should be phased out in favor of
> - * get_user_pages_locked|unlocked or get_user_pages_fast. Nothing
> - * should use get_user_pages because it cannot pass
> - * FAULT_FLAG_ALLOW_RETRY to handle_mm_fault.

This comment should be preserved in some way.  In addition, removing
get_user_pages_locked() makes it harder (compared to a simple "git grep
-w") to identify callers that lack allow-retry functionality).  So I'm
not sure about the benefits of these patches.

If all callers were changed, then sure removing the _locked suffix would
be a good idea.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH v1 21/28] KVM: introduce KVM_SEV_ISSUE_CMD ioctl

2016-10-18 Thread Paolo Bonzini

> > If I understanding correctly then you are recommending that instead of
> > exporting various functions from PSP drv we should expose one function
> > for the all the guest command handling (something like this).
> >
> > My understanding is that a user could exhaust the ASIDs for encrypted
> > VMs if it was allowed to start an arbitrary number of KVM guests.  So
> > we would need some kind of control.  Is this correct?
> 
> Yes, there is limited number of ASIDs for encrypted VMs. Do we need to
> pass the psp_fd into SEV_ISSUE_CMD ioctl or can we handle it from Qemu
> itself ? e.g when user asks to transition a guest into SEV-enabled mode
> then before calling LAUNCH_START Qemu tries to open /dev/psp device. If
> open() returns success then we know user has permission to communicate
> with PSP firmware.

No, this is a stateful mechanism and it's hard to implement.  Passing a
/dev/psp file descriptor is the simplest way to "prove" that you have
access to the device.

Thanks,

Paolo

> > If so, does /dev/psp provide any functionality that you believe is
> > dangerous for the KVM userspace (which runs in a very confined
> > environment)?  Is this functionality blocked through capability
> > checks?
> 
> I do not see /dev/psp providing anything which would be dangerous to KVM
> userspace. It should be safe to access /dev/psp into KVM userspace.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH v1 21/28] KVM: introduce KVM_SEV_ISSUE_CMD ioctl

2016-10-17 Thread Paolo Bonzini
> I am not sure if I fully understand this feedback. Let me summaries what
> we have right now.
> 
> At highest level SEV key management commands are divided into two sections:
> 
> - platform  management : commands used during platform provisioning. PSP
> drv provides ioctl's for these commands. Qemu will not use these
> ioctl's, i believe these ioctl will be used by other tools.
> 
> - guest management: command used during guest life cycle. PSP drv
> exports various function and KVM drv calls these function when it
> receives the SEV_ISSUE_CMD ioctl from qemu.
> 
> If I understanding correctly then you are recommending that instead of
> exporting various functions from PSP drv we should expose one function
> for the all the guest command handling (something like this).

My understanding is that a user could exhaust the ASIDs for encrypted
VMs if it was allowed to start an arbitrary number of KVM guests.  So
we would need some kind of control.  Is this correct?

If so, does /dev/psp provide any functionality that you believe is
dangerous for the KVM userspace (which runs in a very confined
environment)?  Is this functionality blocked through capability
checks?

Thanks,

Paolo


> int psp_issue_cmd_external_user(struct file *filep,
>   int cmd, unsigned long addr,
>   int *psp_ret)
> {
>   /* here we check to ensure that file->f_ops is a valid
>* psp instance.
>   */
>   if (filep->f_ops != _fops)
>   return -EINVAL;
> 
>   /* handle the command */
>   return psp_issue_cmd (cmd, addr, timeout, psp_ret);
> }
> 
> In KVM driver use something like this to invoke the PSP command handler.
> 
> int kvm_sev_psp_cmd (struct kvm_sev_issue_cmd *input,
>unsigned long data)
> {
>   int ret;
>   struct fd f;
> 
>   f = fdget(input->psp_fd);
>   if (!f.file)
>   return -EBADF;
>   
> 
>   psp_issue_cmd_external_user(f.file, input->cmd,
>   data, >psp_ret);
>   
> }
> 
> Please let me know if I understood this correctly.
> 
> >> Signed-off-by: Brijesh Singh 
> >> ---
> >>  arch/x86/include/asm/kvm_host.h |3 +
> >>  arch/x86/kvm/x86.c  |   13 
> >>  include/uapi/linux/kvm.h|  125
> >>  +++
> >>  3 files changed, 141 insertions(+)
> >>
> 
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH v1 00/28] x86: Secure Encrypted Virtualization (AMD)

2016-10-13 Thread Paolo Bonzini


On 23/08/2016 01:23, Brijesh Singh wrote:
> TODO:
> - send qemu/seabios RFC's on respective mailing list
> - integrate the psp driver with CCP driver (they share the PCI id's)
> - add SEV guest migration command support
> - add SEV snapshotting command support
> - determine how to do ioremap of physical memory with mem encryption enabled
>   (e.g acpi tables)

The would be encrypted, right?  Similar to the EFI data in patch 9.

> - determine how to share the guest memory with hypervisor for to support
>   pvclock driver

Is it enough if the guest makes that page unencrypted?

I reviewed the KVM host-side patches and they are pretty
straightforward, so the comments on each patch suffice.

Thanks,

Paolo

> Brijesh Singh (11):
>   crypto: add AMD Platform Security Processor driver
>   KVM: SVM: prepare to reserve asid for SEV guest
>   KVM: SVM: prepare for SEV guest management API support
>   KVM: introduce KVM_SEV_ISSUE_CMD ioctl
>   KVM: SVM: add SEV launch start command
>   KVM: SVM: add SEV launch update command
>   KVM: SVM: add SEV_LAUNCH_FINISH command
>   KVM: SVM: add KVM_SEV_GUEST_STATUS command
>   KVM: SVM: add KVM_SEV_DEBUG_DECRYPT command
>   KVM: SVM: add KVM_SEV_DEBUG_ENCRYPT command
>   KVM: SVM: add command to query SEV API version
> 
> Tom Lendacky (17):
>   kvm: svm: Add support for additional SVM NPF error codes
>   kvm: svm: Add kvm_fast_pio_in support
>   kvm: svm: Use the hardware provided GPA instead of page walk
>   x86: Secure Encrypted Virtualization (SEV) support
>   KVM: SVM: prepare for new bit definition in nested_ctl
>   KVM: SVM: Add SEV feature definitions to KVM
>   x86: Do not encrypt memory areas if SEV is enabled
>   Access BOOT related data encrypted with SEV active
>   x86/efi: Access EFI data as encrypted when SEV is active
>   x86: Change early_ioremap to early_memremap for BOOT data
>   x86: Don't decrypt trampoline area if SEV is active
>   x86: DMA support for SEV memory encryption
>   iommu/amd: AMD IOMMU support for SEV
>   x86: Don't set the SME MSR bit when SEV is active
>   x86: Unroll string I/O when SEV is active
>   x86: Add support to determine if running with SEV enabled
>   KVM: SVM: Enable SEV by setting the SEV_ENABLE cpu feature
> 
> 
>  arch/x86/boot/compressed/Makefile  |2 
>  arch/x86/boot/compressed/head_64.S |   19 +
>  arch/x86/boot/compressed/mem_encrypt.S |  123 
>  arch/x86/include/asm/io.h  |   26 +
>  arch/x86/include/asm/kvm_emulate.h |3 
>  arch/x86/include/asm/kvm_host.h|   27 +
>  arch/x86/include/asm/mem_encrypt.h |3 
>  arch/x86/include/asm/svm.h |3 
>  arch/x86/include/uapi/asm/hyperv.h |4 
>  arch/x86/include/uapi/asm/kvm_para.h   |4 
>  arch/x86/kernel/acpi/boot.c|4 
>  arch/x86/kernel/head64.c   |4 
>  arch/x86/kernel/mem_encrypt.S  |   44 ++
>  arch/x86/kernel/mpparse.c  |   10 
>  arch/x86/kernel/setup.c|7 
>  arch/x86/kernel/x8664_ksyms_64.c   |1 
>  arch/x86/kvm/cpuid.c   |4 
>  arch/x86/kvm/mmu.c |   20 +
>  arch/x86/kvm/svm.c |  906 
> 
>  arch/x86/kvm/x86.c |   73 +++
>  arch/x86/mm/ioremap.c  |7 
>  arch/x86/mm/mem_encrypt.c  |   50 ++
>  arch/x86/platform/efi/efi_64.c |   14 
>  arch/x86/realmode/init.c   |   11 
>  drivers/crypto/Kconfig |   11 
>  drivers/crypto/Makefile|1 
>  drivers/crypto/psp/Kconfig |8 
>  drivers/crypto/psp/Makefile|3 
>  drivers/crypto/psp/psp-dev.c   |  220 
>  drivers/crypto/psp/psp-dev.h   |   95 +++
>  drivers/crypto/psp/psp-ops.c   |  454 
>  drivers/crypto/psp/psp-pci.c   |  376 +
>  drivers/sfi/sfi_core.c |6 
>  include/linux/ccp-psp.h|  833 +
>  include/uapi/linux/Kbuild  |1 
>  include/uapi/linux/ccp-psp.h   |  182 ++
>  include/uapi/linux/kvm.h   |  125 
>  37 files changed, 3643 insertions(+), 41 deletions(-)
>  create mode 100644 arch/x86/boot/compressed/mem_encrypt.S
>  create mode 100644 drivers/crypto/psp/Kconfig
>  create mode 100644 drivers/crypto/psp/Makefile
>  create mode 100644 drivers/crypto/psp/psp-dev.c
>  create mode 100644 drivers/crypto/psp/psp-dev.h
>  create mode 100644 drivers/crypto/psp/psp-ops.c
>  create mode 100644 drivers/crypto/psp/psp-pci.c
>  create mode 100644 include/linux/ccp-psp.h
>  create mode 100644 include/uapi/linux/ccp-psp.h
> 
___
devel mailing list
de...@linuxdriverproject.org

Re: [RFC PATCH v1 24/28] KVM: SVM: add SEV_LAUNCH_FINISH command

2016-10-13 Thread Paolo Bonzini


On 23/08/2016 01:28, Brijesh Singh wrote:
> +
> + /* Iterate through each vcpus and set SEV KVM_SEV_FEATURE bit in
> +  * KVM_CPUID_FEATURE to indicate that SEV is enabled on this vcpu
> +  */
> + kvm_for_each_vcpu(i, vcpu, kvm)
> + svm_cpuid_update(vcpu);
> +

Do you need another call to sev_init_vmcb here?

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH v1 22/28] KVM: SVM: add SEV launch start command

2016-10-13 Thread Paolo Bonzini


On 23/08/2016 01:28, Brijesh Singh wrote:
> +static int sev_launch_start(struct kvm *kvm,
> + struct kvm_sev_launch_start __user *arg,
> + int *psp_ret)
> +{
> + int ret, asid;
> + struct kvm_sev_launch_start params;
> + struct psp_data_launch_start *start;
> +
> + /* Get parameter from the user */
> + if (copy_from_user(, arg, sizeof(*arg)))
> + return -EFAULT;
> +
> + start = kzalloc(sizeof(*start), GFP_KERNEL);
> + if (!start)
> + return -ENOMEM;
> +
> + ret = sev_pre_start(kvm, );

You need some locking in sev_asid_{new,free}.  Probably _lock.  The
SEV_ISSUE_CMD ioctl instead should take >lock.

Paolo

> + if (ret)
> + goto err_1;
> +
> + start->hdr.buffer_len = sizeof(*start);
> + start->flags  = params.flags;
> + start->policy = params.policy;
> + start->handle = params.handle;
> + memcpy(start->nonce, , sizeof(start->nonce));
> + memcpy(start->dh_pub_qx, _pub_qx, sizeof(start->dh_pub_qx));
> + memcpy(start->dh_pub_qy, _pub_qy, sizeof(start->dh_pub_qy));
> +
> + /* launch start */
> + ret = psp_guest_launch_start(start, psp_ret);
> + if (ret) {
> + printk(KERN_ERR "SEV: LAUNCH_START ret=%d (%#010x)\n",
> + ret, *psp_ret);
> + goto err_2;
> + }
> +
> + ret = sev_post_start(kvm, asid, start->handle, psp_ret);
> + if (ret)
> + goto err_2;

Paolo

> + kfree(start);
> + return 0;
> +
> +err_2:
> + sev_asid_free(asid);
> +err_1:
> + kfree(start);
> + return ret;
> +}
> +
> +static int amd_sev_issue_cmd(struct kvm *kvm,
> +  struct kvm_sev_issue_cmd __user *user_data)
> +{
> + int r = -ENOTTY;
> + struct kvm_sev_issue_cmd arg;
> +
> + if (copy_from_user(, user_data, sizeof(struct kvm_sev_issue_cmd)))
> + return -EFAULT;
> +
> + switch (arg.cmd) {
> + case KVM_SEV_LAUNCH_START: {
> + r = sev_launch_start(kvm, (void *)arg.opaque,
> + _code);
> + break;
> + }
> + default:
> + break;
> + }
> +
> + if (copy_to_user(user_data, , sizeof(struct kvm_sev_issue_cmd)))
> + r = -EFAULT;
> + return r;
> +}
> +
>  static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
>   .cpu_has_kvm_support = has_svm,
>   .disabled_by_bios = is_disabled,
> @@ -5313,6 +5517,8 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = 
> {
>  
>   .pmu_ops = _pmu_ops,
>   .deliver_posted_interrupt = svm_deliver_avic_intr,
> +
> + .sev_issue_cmd = amd_sev_issue_cmd,
>  };
>  
>  static int __init svm_init(void)
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
> 
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH v1 21/28] KVM: introduce KVM_SEV_ISSUE_CMD ioctl

2016-10-13 Thread Paolo Bonzini


On 23/08/2016 01:28, Brijesh Singh wrote:
> The ioctl will be used by qemu to issue the Secure Encrypted
> Virtualization (SEV) guest commands to transition a guest into
> into SEV-enabled mode.
> 
> a typical usage:
> 
> struct kvm_sev_launch_start start;
> struct kvm_sev_issue_cmd data;
> 
> data.cmd = KVM_SEV_LAUNCH_START;
> data.opaque = 
> 
> ret = ioctl(fd, KVM_SEV_ISSUE_CMD, );
> 
> On SEV command failure, data.ret_code will contain the firmware error code.

Please modify the ioctl to require the file descriptor for the PSP.  A
program without access to /dev/psp should not be able to use SEV.

> Signed-off-by: Brijesh Singh 
> ---
>  arch/x86/include/asm/kvm_host.h |3 +
>  arch/x86/kvm/x86.c  |   13 
>  include/uapi/linux/kvm.h|  125 
> +++
>  3 files changed, 141 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 9b885fc..a94e37d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1040,6 +1040,9 @@ struct kvm_x86_ops {
>   void (*cancel_hv_timer)(struct kvm_vcpu *vcpu);
>  
>   void (*setup_mce)(struct kvm_vcpu *vcpu);
> +
> + int (*sev_issue_cmd)(struct kvm *kvm,
> +  struct kvm_sev_issue_cmd __user *argp);
>  };
>  
>  struct kvm_arch_async_pf {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d6f2f4b..0c0adad 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3820,6 +3820,15 @@ split_irqchip_unlock:
>   return r;
>  }
>  
> +static int kvm_vm_ioctl_sev_issue_cmd(struct kvm *kvm,
> +   struct kvm_sev_issue_cmd __user *argp)
> +{
> + if (kvm_x86_ops->sev_issue_cmd)
> + return kvm_x86_ops->sev_issue_cmd(kvm, argp);
> +
> + return -ENOTTY;
> +}

Please make a more generic vm_ioctl callback.

>  long kvm_arch_vm_ioctl(struct file *filp,
>  unsigned int ioctl, unsigned long arg)
>  {
> @@ -4085,6 +4094,10 @@ long kvm_arch_vm_ioctl(struct file *filp,
>   r = kvm_vm_ioctl_enable_cap(kvm, );
>   break;
>   }
> + case KVM_SEV_ISSUE_CMD: {
> + r = kvm_vm_ioctl_sev_issue_cmd(kvm, argp);
> + break;
> + }
>   default:
>   r = kvm_vm_ioctl_assigned_device(kvm, ioctl, arg);
>   }
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 300ef25..72c18c3 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1274,6 +1274,131 @@ struct kvm_s390_ucas_mapping {
>  /* Available with KVM_CAP_X86_SMM */
>  #define KVM_SMI   _IO(KVMIO,   0xb7)
>  
> +/* Secure Encrypted Virtualization mode */
> +enum sev_cmd {
> + KVM_SEV_LAUNCH_START = 0,
> + KVM_SEV_LAUNCH_UPDATE,
> + KVM_SEV_LAUNCH_FINISH,
> + KVM_SEV_GUEST_STATUS,
> + KVM_SEV_DBG_DECRYPT,
> + KVM_SEV_DBG_ENCRYPT,
> + KVM_SEV_RECEIVE_START,
> + KVM_SEV_RECEIVE_UPDATE,
> + KVM_SEV_RECEIVE_FINISH,
> + KVM_SEV_SEND_START,
> + KVM_SEV_SEND_UPDATE,
> + KVM_SEV_SEND_FINISH,
> + KVM_SEV_API_VERSION,
> + KVM_SEV_NR_MAX,
> +};
> +
> +struct kvm_sev_issue_cmd {
> + __u32 cmd;
> + __u64 opaque;
> + __u32 ret_code;
> +};
> +
> +struct kvm_sev_launch_start {
> + __u32 handle;
> + __u32 flags;
> + __u32 policy;
> + __u8 nonce[16];
> + __u8 dh_pub_qx[32];
> + __u8 dh_pub_qy[32];
> +};
> +
> +struct kvm_sev_launch_update {
> + __u64   address;
> + __u32   length;
> +};
> +
> +struct kvm_sev_launch_finish {
> + __u32 vcpu_count;
> + __u32 vcpu_length;
> + __u64 vcpu_mask_addr;
> + __u32 vcpu_mask_length;
> + __u8  measurement[32];
> +};
> +
> +struct kvm_sev_guest_status {
> + __u32 policy;
> + __u32 state;
> +};
> +
> +struct kvm_sev_dbg_decrypt {
> + __u64 src_addr;
> + __u64 dst_addr;
> + __u32 length;
> +};
> +
> +struct kvm_sev_dbg_encrypt {
> + __u64 src_addr;
> + __u64 dst_addr;
> + __u32 length;
> +};
> +
> +struct kvm_sev_receive_start {
> + __u32 handle;
> + __u32 flags;
> + __u32 policy;
> + __u8 policy_meas[32];
> + __u8 wrapped_tek[24];
> + __u8 wrapped_tik[24];
> + __u8 ten[16];
> + __u8 dh_pub_qx[32];
> + __u8 dh_pub_qy[32];
> + __u8 nonce[16];
> +};
> +
> +struct kvm_sev_receive_update {
> + __u8 iv[16];
> + __u64 address;
> + __u32 length;
> +};
> +
> +struct kvm_sev_receive_finish {
> + __u8 measurement[32];
> +};
> +
> +struct kvm_sev_send_start {
> + __u8 nonce[16];
> + __u32 policy;
> + __u8 policy_meas[32];
> + __u8 wrapped_tek[24];
> + __u8 wrapped_tik[24];
> + __u8 ten[16];
> + __u8 iv[16];
> + __u32 flags;
> + __u8 api_major;
> + __u8 api_minor;
> + __u32 serial;
> + __u8 dh_pub_qx[32];
> + __u8 dh_pub_qy[32];
> + __u8 pek_sig_r[32];
> + 

Re: [RFC PATCH v1 20/28] KVM: SVM: prepare for SEV guest management API support

2016-10-13 Thread Paolo Bonzini


On 23/08/2016 01:28, Brijesh Singh wrote:
> The patch adds initial support required for Secure Encrypted
> Virtualization (SEV) guest management API's.
> 
> ASID management:
>  - Reserve asid range for SEV guest, SEV asid range is obtained
>through CPUID Fn8000_001f[ECX]. A non-SEV guest can use any
>asid outside the SEV asid range.
>  - SEV guest must have asid value within asid range obtained
>through CPUID.
>  - SEV guest must have the same asid for all vcpu's. A TLB flush
>is required if different vcpu for the same ASID is to be run
>on the same host CPU.
> 
> - save SEV private structure in kvm_arch.
> 
> - If SEV is available then initialize PSP firmware during hardware probe
> 
> Signed-off-by: Brijesh Singh 
> ---
>  arch/x86/include/asm/kvm_host.h |9 ++
>  arch/x86/kvm/svm.c  |  213 
> +++
>  2 files changed, 221 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index b1dd673..9b885fc 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -715,6 +715,12 @@ struct kvm_hv {
>   u64 hv_crash_ctl;
>  };
>  
> +struct kvm_sev_info {
> + unsigned int asid;  /* asid for this guest */
> + unsigned int handle;/* firmware handle */
> + unsigned int ref_count; /* number of active vcpus */
> +};
> +
>  struct kvm_arch {
>   unsigned int n_used_mmu_pages;
>   unsigned int n_requested_mmu_pages;
> @@ -799,6 +805,9 @@ struct kvm_arch {
>  
>   bool x2apic_format;
>   bool x2apic_broadcast_quirk_disabled;
> +
> + /* struct for SEV guest */
> + struct kvm_sev_info sev_info;
>  };
>  
>  struct kvm_vm_stat {
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index f010b23..dcee635 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -34,6 +34,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -186,6 +187,9 @@ struct vcpu_svm {
>   struct page *avic_backing_page;
>   u64 *avic_physical_id_cache;
>   bool avic_is_running;
> +
> + /* which host cpu was used for running this vcpu */
> + bool last_cpuid;
>  };
>  
>  #define AVIC_LOGICAL_ID_ENTRY_GUEST_PHYSICAL_ID_MASK (0xFF)
> @@ -243,6 +247,25 @@ static int avic;
>  module_param(avic, int, S_IRUGO);
>  #endif
>  
> +/* Secure Encrypted Virtualization */
> +static bool sev_enabled;

You can check max_sev_asid != 0 instead (wrapped in a sev_enabled()
function).

> +static unsigned long max_sev_asid;

Need not be 64-bit.

> +static unsigned long *sev_asid_bitmap;

Please note what lock protects this, and modify it with __set_bit and
__clear_bit.

> +#define kvm_sev_guest()  (kvm->arch.sev_info.handle)
> +#define kvm_sev_handle() (kvm->arch.sev_info.handle)
> +#define kvm_sev_ref()(kvm->arch.sev_info.ref_count++)
> +#define kvm_sev_unref()  (kvm->arch.sev_info.ref_count--)
> +#define svm_sev_handle() (svm->vcpu.kvm->arch.sev_info.handle)
> +#define svm_sev_asid()   (svm->vcpu.kvm->arch.sev_info.asid)
> +#define svm_sev_ref()
> (svm->vcpu.kvm->arch.sev_info.ref_count++)
> +#define svm_sev_unref()  
> (svm->vcpu.kvm->arch.sev_info.ref_count--)
> +#define svm_sev_guest()  (svm->vcpu.kvm->arch.sev_info.handle)
> +#define svm_sev_ref_count()  (svm->vcpu.kvm->arch.sev_info.ref_count)

Why is the reference count necessary?  Could you use the kvm refcount
instead and free the ASID in kvm_x86_ops->vm_destroy?  Also, what lock
protects the reference count?

Also please remove the macros in general.  If there is only a struct
vcpu_svm*, use

struct kvm_arch *vm_data = >vcpu.kvm->arch;

as done for example in avic_init_vmcb.

> +
> +static int sev_asid_new(void);
> +static void sev_asid_free(int asid);
> +
>  static void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0);
>  static void svm_flush_tlb(struct kvm_vcpu *vcpu);
>  static void svm_complete_interrupts(struct vcpu_svm *svm);
> @@ -474,6 +497,8 @@ struct svm_cpu_data {
>   struct kvm_ldttss_desc *tss_desc;
>  
>   struct page *save_area;
> +
> + void **sev_vmcb;  /* index = sev_asid, value = vmcb pointer */

It's not a void**, it's a struct vmcb**.  Please rename it to sev_vmcbs,
too, so that it's clear that it's an array.

>  };
>  
>  static DEFINE_PER_CPU(struct svm_cpu_data *, svm_data);
> @@ -727,7 +752,10 @@ static int svm_hardware_enable(void)
>   sd->asid_generation = 1;
>   sd->max_asid = cpuid_ebx(SVM_CPUID_FUNC) - 1;
>   sd->next_asid = sd->max_asid + 1;
> - sd->min_asid = 1;
> + sd->min_asid = max_sev_asid + 1;
> +
> + if (sev_enabled)
> + memset(sd->sev_vmcb, 0, (max_sev_asid + 1) * sizeof(void *));

This seems strange.  You should clear the field, for each possible CPU,
in sev_asid_free, not in sev_uninit_vcpu.  Then when you reuse the 

Re: [RFC PATCH v1 19/28] KVM: SVM: prepare to reserve asid for SEV guest

2016-10-13 Thread Paolo Bonzini


On 23/08/2016 01:27, Brijesh Singh wrote:
> In current implementation, asid allocation starts from 1, this patch
> adds a min_asid variable in svm_vcpu structure to allow starting asid
> from something other than 1.
> 
> Signed-off-by: Brijesh Singh <brijesh.si...@amd.com>
> ---
>  arch/x86/kvm/svm.c |4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 211be94..f010b23 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -470,6 +470,7 @@ struct svm_cpu_data {
>   u64 asid_generation;
>   u32 max_asid;
>   u32 next_asid;
> + u32 min_asid;
>   struct kvm_ldttss_desc *tss_desc;
>  
>   struct page *save_area;
> @@ -726,6 +727,7 @@ static int svm_hardware_enable(void)
>   sd->asid_generation = 1;
>   sd->max_asid = cpuid_ebx(SVM_CPUID_FUNC) - 1;
>   sd->next_asid = sd->max_asid + 1;
> + sd->min_asid = 1;
>  
>   native_store_gdt(_descr);
>   gdt = (struct desc_struct *)gdt_descr.address;
> @@ -1887,7 +1889,7 @@ static void new_asid(struct vcpu_svm *svm, struct 
> svm_cpu_data *sd)
>  {
>   if (sd->next_asid > sd->max_asid) {
>   ++sd->asid_generation;
> - sd->next_asid = 1;
> + sd->next_asid = sd->min_asid;
>   svm->vmcb->control.tlb_ctl = TLB_CONTROL_FLUSH_ALL_ASID;
>   }
>  
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
> 

Reviewed-by: Paolo Bonzini <pbonz...@redhat.com>
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH v1 09/28] x86/efi: Access EFI data as encrypted when SEV is active

2016-09-22 Thread Paolo Bonzini


On 22/09/2016 20:47, Tom Lendacky wrote:
> > Because the firmware volume is written to high memory in encrypted form,
> > and because the PEI phase runs in 32-bit mode, the firmware code will be
> > encrypted; on the other hand, data that is placed in low memory for the
> > kernel can be unencrypted, thus limiting differences between SME and SEV.
> 
> I like the idea of limiting the differences but it would leave the EFI
> data and ACPI tables exposed and able to be manipulated.

Hmm, that makes sense.  So I guess this has to stay, and Borislav's
proposal doesn't fly either.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH v1 09/28] x86/efi: Access EFI data as encrypted when SEV is active

2016-09-22 Thread Paolo Bonzini


On 22/09/2016 20:37, Borislav Petkov wrote:
>> > Unless this is part of some spec, it's easier if things are the same in
>> > SME and SEV.
> Yeah, I was pondering over how sprinkling sev_active checks might not be
> so clean.
> 
> I'm wondering if we could make the EFI regions presented to the guest
> unencrypted too, as part of some SEV-specific init routine so that the
> guest kernel doesn't need to do anything different.

That too, but why not fix it in the firmware?...  (Again, if there's any
MSFT guy looking at this offlist, let's involve him in the discussion).

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH v1 09/28] x86/efi: Access EFI data as encrypted when SEV is active

2016-09-22 Thread Paolo Bonzini


On 22/09/2016 19:46, Tom Lendacky wrote:
>> > Do you mean, it is encrypted here because we're in the guest kernel?
> Yes, the idea is that the SEV guest will be running encrypted from the
> start, including the BIOS/UEFI, and so all of the EFI related data will
> be encrypted.

Unless this is part of some spec, it's easier if things are the same in
SME and SEV.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH v1 09/28] x86/efi: Access EFI data as encrypted when SEV is active

2016-09-22 Thread Paolo Bonzini


On 22/09/2016 19:07, Borislav Petkov wrote:
>> Which paragraph?
> "Linux relies on BIOS to set this bit if BIOS has determined that the
> reduction in the physical address space as a result of enabling memory
> encryption..."
> 
> Basically, you can enable SME in the BIOS and you're all set.

That's not how I read it.  I just figured that the BIOS has some magic
things high in the physical address space and if you reduce the physical
address space the BIOS (which is called from e.g. EFI runtime services)
would have problems with that.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH v1 09/28] x86/efi: Access EFI data as encrypted when SEV is active

2016-09-22 Thread Paolo Bonzini


On 22/09/2016 16:35, Borislav Petkov wrote:
>> > @@ -230,6 +230,10 @@ int __init efi_setup_page_tables(unsigned long 
>> > pa_memmap, unsigned num_pages)
>> >efi_scratch.efi_pgt = (pgd_t *)__sme_pa(efi_pgd);
>> >pgd = efi_pgd;
>> >  
>> > +  flags = _PAGE_NX | _PAGE_RW;
>> > +  if (sev_active)
>> > +  flags |= _PAGE_ENC;
> So this is confusing me. There's this patch which says EFI data is
> accessed in the clear:
> 
> https://lkml.kernel.org/r/2016083738.29880.6909.st...@tlendack-t1.amdoffice.net
> 
> but now here it is encrypted when SEV is enabled.
> 
> Do you mean, it is encrypted here because we're in the guest kernel?

I suspect this patch is untested, and also wrong. :)

The main difference between the SME and SEV encryption, from the point
of view of the kernel, is that real-mode always writes unencrypted in
SME and always writes encrypted in SEV.  But UEFI can run in 64-bit mode
and learn about the C bit, so EFI boot data should be unprotected in SEV
guests.

Because the firmware volume is written to high memory in encrypted form,
and because the PEI phase runs in 32-bit mode, the firmware code will be
encrypted; on the other hand, data that is placed in low memory for the
kernel can be unencrypted, thus limiting differences between SME and SEV.

Important: I don't know what you guys are doing for SEV and
Windows guests, but if you are doing something I would really
appreciate doing things in the open.  If Linux and Windows end
up doing different things with EFI boot data, ACPI tables, etc.
it will be a huge pain.  On the other hand, if we can enjoy
being first, that's great.

In fact, I have suggested in the QEMU list that SEV guests should always
use UEFI; because BIOS runs in real-mode or 32-bit non-paging protected
mode, BIOS must always write encrypted data, which becomes painful in
the kernel.

And regarding the above "important" point, all I know is that Microsoft
for sure will be happy to restrict SEV to UEFI guests. :)

There are still some differences, mostly around the real mode trampoline
executed by the kernel, but they should be much smaller.

Paolo
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH] x86: guest: rely on leaf 0x40000001 to detect Hyper-V

2015-10-02 Thread Paolo Bonzini
The specification says that "Microsoft Hv" is actually a vendor ID field
that is only used for reporting and diagnostic purposes.  The actual
field that you need to check is the interface ID that you get in eax
when querying the HYPERV_CPUID_INTERFACE.

Change ms_hyperv_platform to actually do what the specification suggests.
This roughy matches what Windows looks for, though Windows actually
ignores HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS completely.

Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
---
 arch/x86/kernel/cpu/mshyperv.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 381c8b9b3a33..7910e7fd705b 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -116,18 +116,16 @@ static void hv_machine_crash_shutdown(struct pt_regs 
*regs)
 static uint32_t  __init ms_hyperv_platform(void)
 {
u32 eax;
-   u32 hyp_signature[3];
 
if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
return 0;
 
-   cpuid(HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS,
- , _signature[0], _signature[1], _signature[2]);
-
-   if (eax >= HYPERV_CPUID_MIN &&
-   eax <= HYPERV_CPUID_MAX &&
-   !memcmp("Microsoft Hv", hyp_signature, 12))
-   return HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS;
+   eax = cpuid_eax(HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS);
+   if (eax >= HYPERV_CPUID_MIN && eax <= HYPERV_CPUID_MAX) {
+   eax = cpuid_eax(HYPERV_CPUID_INTERFACE);
+   if (!memcmp(, "Hv#1", sizeof(eax)))
+   return HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS;
+   }
 
return 0;
 }
-- 
2.5.0

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel