RE: [PATCH v3 2/2] locking/pvqspinlock, hv: Enable PV qspinlock for Hyper-V

2018-09-28 Thread Michael Kelley (EOSG)
From: Yi Sun  Sent: Wednesday, September 26, 2018 
11:02 PM

> Follow PV spinlock mechanism to implement the callback functions
> to allow the CPU idling and kicking operations on Hyper-V.
> 
> Cc: "K. Y. Srinivasan" 
> Cc: Haiyang Zhang 
> Cc: Stephen Hemminger 
> Cc: Thomas Gleixner 
> Cc: Michael Kelley (EOSG) 
> Signed-off-by: Yi Sun 
> ---

Reviewed-by: Michael Kelley 


RE: [PATCH v3 2/2] locking/pvqspinlock, hv: Enable PV qspinlock for Hyper-V

2018-09-28 Thread Michael Kelley (EOSG)
From: Yi Sun  Sent: Wednesday, September 26, 2018 
11:02 PM

> Follow PV spinlock mechanism to implement the callback functions
> to allow the CPU idling and kicking operations on Hyper-V.
> 
> Cc: "K. Y. Srinivasan" 
> Cc: Haiyang Zhang 
> Cc: Stephen Hemminger 
> Cc: Thomas Gleixner 
> Cc: Michael Kelley (EOSG) 
> Signed-off-by: Yi Sun 
> ---

Reviewed-by: Michael Kelley 


RE: [PATCH V3 10/13] x86/hyper-v: Add HvFlushGuestAddressList hypercall support

2018-09-26 Thread Michael Kelley (EOSG)
From: Tianyu Lan  Sent: Wednesday, September 26, 2018 8:50 PM
> 
> Hyper-V provides HvFlushGuestAddressList() hypercall to flush EPT tlb
> with specified ranges. This patch is to add the hypercall support.
> 
> Signed-off-by: Lan Tianyu 
>

Looks good!

Reviewed-by:  Michael Kelley 


RE: [PATCH V3 10/13] x86/hyper-v: Add HvFlushGuestAddressList hypercall support

2018-09-26 Thread Michael Kelley (EOSG)
From: Tianyu Lan  Sent: Wednesday, September 26, 2018 8:50 PM
> 
> Hyper-V provides HvFlushGuestAddressList() hypercall to flush EPT tlb
> with specified ranges. This patch is to add the hypercall support.
> 
> Signed-off-by: Lan Tianyu 
>

Looks good!

Reviewed-by:  Michael Kelley 


RE: [PATCH -next] x86/hyper-v: Remove unused including

2018-09-23 Thread Michael Kelley (EOSG)
From: YueHaibing   Sent: Sunday, September 23, 2018 1:20 
AM

> Remove including  that don't need it.
> 
> Signed-off-by: YueHaibing 
> ---
>  arch/x86/hyperv/hv_apic.c | 1 -
>  1 file changed, 1 deletion(-)
> 
Reviewed-by:  Michael Kelley 


RE: [PATCH -next] x86/hyper-v: Remove unused including

2018-09-23 Thread Michael Kelley (EOSG)
From: YueHaibing   Sent: Sunday, September 23, 2018 1:20 
AM

> Remove including  that don't need it.
> 
> Signed-off-by: YueHaibing 
> ---
>  arch/x86/hyperv/hv_apic.c | 1 -
>  1 file changed, 1 deletion(-)
> 
Reviewed-by:  Michael Kelley 


RE: [PATCH v2 2/2] locking/pvqspinlock, hv: Enable PV qspinlock for Hyper-V

2018-09-21 Thread Michael Kelley (EOSG)
From: Yi Sun  Sent: Friday, September 21, 2018 12:25 
AM
> +
> +#define pr_fmt(fmt) "hv: " fmt

Other Hyper-V messages use "Hyper-V: " as the prefix, not "hv: ".  Take
a quick look at 'dmesg' output for reference.

> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 

Some of these #includes look like they might be leftovers from
some other code.  Please check and see whether kernel_stat.h,
debugsfs.h, log2.h, and gfp.h are actually needed.

> +static void hv_qlock_wait(u8 *byte, u8 val)
> +{
> + unsigned long msr_val;
> +
> + if (READ_ONCE(*byte) != val)
> + return;
> +
> + /*
> +  * Read HV_X64_MSR_GUEST_IDLE MSR can trigger the guest's
> +  * transition to the idle power state which can be exited
> +  * by an IPI even if IF flag is disabled.
> +  */
> + if (ms_hyperv.features & HV_X64_MSR_GUEST_IDLE_AVAILABLE)

I can't see a case where this test is actually needed.  hv_qlock_wait()
can only get called if the flag is set when hv_init_spinlocks() is run, and
the flag value doesn't change after it is set.

> + rdmsrl(HV_X64_MSR_GUEST_IDLE, msr_val);
> +}

Michael


RE: [PATCH v2 2/2] locking/pvqspinlock, hv: Enable PV qspinlock for Hyper-V

2018-09-21 Thread Michael Kelley (EOSG)
From: Yi Sun  Sent: Friday, September 21, 2018 12:25 
AM
> +
> +#define pr_fmt(fmt) "hv: " fmt

Other Hyper-V messages use "Hyper-V: " as the prefix, not "hv: ".  Take
a quick look at 'dmesg' output for reference.

> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 

Some of these #includes look like they might be leftovers from
some other code.  Please check and see whether kernel_stat.h,
debugsfs.h, log2.h, and gfp.h are actually needed.

> +static void hv_qlock_wait(u8 *byte, u8 val)
> +{
> + unsigned long msr_val;
> +
> + if (READ_ONCE(*byte) != val)
> + return;
> +
> + /*
> +  * Read HV_X64_MSR_GUEST_IDLE MSR can trigger the guest's
> +  * transition to the idle power state which can be exited
> +  * by an IPI even if IF flag is disabled.
> +  */
> + if (ms_hyperv.features & HV_X64_MSR_GUEST_IDLE_AVAILABLE)

I can't see a case where this test is actually needed.  hv_qlock_wait()
can only get called if the flag is set when hv_init_spinlocks() is run, and
the flag value doesn't change after it is set.

> + rdmsrl(HV_X64_MSR_GUEST_IDLE, msr_val);
> +}

Michael


RE: [PATCH v2 1/2] X86/Hyper-V: Add Guest IDLE MSR support

2018-09-21 Thread Michael Kelley (EOSG)
From: Yi Sun   Sent: Friday, September 21, 2018 12:25 
AM
> 
> Hyper-V may expose a HV_X64_MSR_GUEST_IDLE MSR. We can read
> HYPERV_CPUID_FEATURES eax to check it. Read this MSR can
> trigger the guest's transition to the idle power state which
> can be exited by an IPI even if IF flag is disabled.
> 
> Cc: "K. Y. Srinivasan" 
> Cc: Haiyang Zhang 
> Cc: Stephen Hemminger 
> Signed-off-by: Yi Sun 
> ---
>  arch/x86/include/asm/hyperv-tlfs.h | 5 +
>  1 file changed, 5 insertions(+)
> 

Reviewed-by: Michael Kelley 


RE: [PATCH v2 1/2] X86/Hyper-V: Add Guest IDLE MSR support

2018-09-21 Thread Michael Kelley (EOSG)
From: Yi Sun   Sent: Friday, September 21, 2018 12:25 
AM
> 
> Hyper-V may expose a HV_X64_MSR_GUEST_IDLE MSR. We can read
> HYPERV_CPUID_FEATURES eax to check it. Read this MSR can
> trigger the guest's transition to the idle power state which
> can be exited by an IPI even if IF flag is disabled.
> 
> Cc: "K. Y. Srinivasan" 
> Cc: Haiyang Zhang 
> Cc: Stephen Hemminger 
> Signed-off-by: Yi Sun 
> ---
>  arch/x86/include/asm/hyperv-tlfs.h | 5 +
>  1 file changed, 5 insertions(+)
> 

Reviewed-by: Michael Kelley 


RE: [PATCH V2 4/13] KVM/MMU: Flush tlb directly in the kvm_handle_hva_range()

2018-09-20 Thread Michael Kelley (EOSG)
From: Tianyu Lan  Sent: Thursday, September 20, 2018 7:30 AM
> On 9/20/2018 12:08 AM, Michael Kelley (EOSG) wrote:
> > From: Tianyu Lan  Sent: Monday, September 17, 2018 8:19 PM
> >> +
> >> +  if (ret && kvm_available_flush_tlb_with_range()) {
> >> +  kvm_flush_remote_tlbs_with_address(kvm,
> >> +  gfn_start,
> >> +  gfn_end - gfn_start);
> >
> > Does the above need to be gfn_end - gfn_start + 1?
> 
> The flush range depends on the input parameter frame start and frame end
> of for_each_slot_rmap_range().
> 
>  for_each_slot_rmap_range(memslot, PT_PAGE_TABLE_LEVEL,
>   PT_MAX_HUGEPAGE_LEVEL,
>   gfn_start, gfn_end - 1,
>   )
>  ret |= handler(kvm, iterator.rmap, memslot,
> iterator.gfn, iterator.level, data);
> 
> 
> The start is "gfn_start" and the end is "gfn_end - 1". The flush size is
> (gfn_end - 1) - gfn_start + 1 = gfn_end - gfn_start.
> 

Got it.  I agree.

Michael


RE: [PATCH V2 4/13] KVM/MMU: Flush tlb directly in the kvm_handle_hva_range()

2018-09-20 Thread Michael Kelley (EOSG)
From: Tianyu Lan  Sent: Thursday, September 20, 2018 7:30 AM
> On 9/20/2018 12:08 AM, Michael Kelley (EOSG) wrote:
> > From: Tianyu Lan  Sent: Monday, September 17, 2018 8:19 PM
> >> +
> >> +  if (ret && kvm_available_flush_tlb_with_range()) {
> >> +  kvm_flush_remote_tlbs_with_address(kvm,
> >> +  gfn_start,
> >> +  gfn_end - gfn_start);
> >
> > Does the above need to be gfn_end - gfn_start + 1?
> 
> The flush range depends on the input parameter frame start and frame end
> of for_each_slot_rmap_range().
> 
>  for_each_slot_rmap_range(memslot, PT_PAGE_TABLE_LEVEL,
>   PT_MAX_HUGEPAGE_LEVEL,
>   gfn_start, gfn_end - 1,
>   )
>  ret |= handler(kvm, iterator.rmap, memslot,
> iterator.gfn, iterator.level, data);
> 
> 
> The start is "gfn_start" and the end is "gfn_end - 1". The flush size is
> (gfn_end - 1) - gfn_start + 1 = gfn_end - gfn_start.
> 

Got it.  I agree.

Michael


RE: [PATCH V2 10/13] x86/hyper-v: Add HvFlushGuestAddressList hypercall support

2018-09-19 Thread Michael Kelley (EOSG)
From: Tianyu Lan  Sent: Monday, September 17, 2018 8:19 PM
>
>  #include 
>  #include 
>  #include 
>  #include 
> +#include 

Hopefully asm/kvm_host.h does not need to be #included, given
the new code structure.

> 
>  #include 
> 
> +/*
> + *  MAX_FLUSH_PAGES = "additional_pages" + 1. It's limited
> + *  by the bitwidth of "additional_pages" in union hv_gpa_page_range.
> + */
> +#define MAX_FLUSH_PAGES (2048)
> +
> +/*
> + * All input flush parameters are in single page. The max flush count
> + * is equal with how many entries of union hv_gpa_page_range can be
> + * populated in the input parameter page. MAX_FLUSH_REP_COUNT
> + * = (4096 - 16) / 8. (“Page Size” - "Address Space" - "Flags") /
> + * "GPA Range".
> + */
> +#define MAX_FLUSH_REP_COUNT (510)
> +

I would recommend putting the above two definitions in
hyperv-tlfs.h.  They are directly tied to the data structures defined
by Hyper-V in the TLFS.  Put MAX_FLUSH_PAGES immediately after
the definition for hv_gpa_page_range so that the dependency is
obvious.

For MAX_FLUSH_REP_COUNT, can you do the calculation in
the #define rather than just in the comment?  Alternatively, define
the gpa_list[] array to be of MAX_FLUSH_REP_COUNT size, and then
add a compile time assert that the size of struct
hv_guest_mapping_flush_list is exactly one page in size.   It's just
a good way to use the compiler to help check for mistakes.

Also prefix them both with HV_ since they will be more
globally visible as part of hyperv-tlfs.h.

>  int hyperv_flush_guest_mapping(u64 as)
>  {
>   struct hv_guest_mapping_flush **flush_pcpu;
> @@ -54,3 +71,89 @@ int hyperv_flush_guest_mapping(u64 as)
>   return ret;
>  }
>  EXPORT_SYMBOL_GPL(hyperv_flush_guest_mapping);
> +
> +static int fill_flush_list(union hv_gpa_page_range gpa_list[],
> + int offset, u64 start_gfn, u64 pages)
> +{
> + int gpa_n = offset;
> + u64 cur = start_gfn;
> + u64 additional_pages;
> +
> + do {
> + if (gpa_n >= MAX_FLUSH_REP_COUNT) {
> + pr_warn("Request exceeds HvFlushGuestList max flush 
> count.");
> + return -ENOSPC;

I wonder if the warning is really needed.  When the error is returned up
through the higher levels of code, won't the higher levels just fallback to
the non-enlightened flush code?  So nothing is actually goes wrong; it's just
that a slower code path gets taken.  A comment about such expectations
might be helpful.

> + }
> +
> + if (pages > MAX_FLUSH_PAGES) {
> + additional_pages = MAX_FLUSH_PAGES - 1;
> + pages -= MAX_FLUSH_PAGES;
> + } else {
> + additional_pages = pages - 1;
> + pages = 0;
> + }

The above code is really doing:

additional_pages = min(pages, MAX_FLUSH_PAGES) - 1;
pages -= additional_pages + 1;

And you might want to move the decrement of 'pages' down to the
bottom of the loop where you update the other loop variables.

> +
> + gpa_list[gpa_n].page.additional_pages = additional_pages;
> + gpa_list[gpa_n].page.largepage = false;
> + gpa_list[gpa_n].page.basepfn = cur;
> +
> + cur += additional_pages + 1;
> + gpa_n++;
> + } while (pages > 0);
> +
> + return gpa_n;
> +}
> +
> +int hyperv_flush_guest_mapping_range(u64 as, struct hyperv_tlb_range *range)
> +{
> + struct hv_guest_mapping_flush_list **flush_pcpu;
> + struct hv_guest_mapping_flush_list *flush;
> + u64 status = 0;
> + unsigned long flags;
> + int ret = -ENOTSUPP;
> + int gpa_n = 0;
> +
> + if (!hv_hypercall_pg)
> + goto fault;
> +
> + local_irq_save(flags);
> +
> + flush_pcpu = (struct hv_guest_mapping_flush_list **)
> + this_cpu_ptr(hyperv_pcpu_input_arg);
> +
> + flush = *flush_pcpu;
> + if (unlikely(!flush)) {
> + local_irq_restore(flags);
> + goto fault;
> + }
> +
> + flush->address_space = as;
> + flush->flags = 0;
> +
> + if (!range->flush_list)
> + gpa_n = fill_flush_list(flush->gpa_list, gpa_n,
> + range->start_gfn, range->pages);
> + else if (range->parse_flush_list_func)
> + gpa_n = range->parse_flush_list_func(flush->gpa_list, gpa_n,
> + range->flush_list, fill_flush_list);
> + else
> + gpa_n = -1;
> +
> + if (gpa_n < 0) {
> + local_irq_restore(flags);
> + goto fault;
> + }
> +
> + status = hv_do_rep_hypercall(HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST,
> +  gpa_n, 0, flush, NULL);
> +
> + local_irq_restore(flags);
> +
> + if (!(status & HV_HYPERCALL_RESULT_MASK))
> + ret = 0;
> + else
> + ret = status;
> +fault:
> + return ret;
> +}
> 

RE: [PATCH V2 10/13] x86/hyper-v: Add HvFlushGuestAddressList hypercall support

2018-09-19 Thread Michael Kelley (EOSG)
From: Tianyu Lan  Sent: Monday, September 17, 2018 8:19 PM
>
>  #include 
>  #include 
>  #include 
>  #include 
> +#include 

Hopefully asm/kvm_host.h does not need to be #included, given
the new code structure.

> 
>  #include 
> 
> +/*
> + *  MAX_FLUSH_PAGES = "additional_pages" + 1. It's limited
> + *  by the bitwidth of "additional_pages" in union hv_gpa_page_range.
> + */
> +#define MAX_FLUSH_PAGES (2048)
> +
> +/*
> + * All input flush parameters are in single page. The max flush count
> + * is equal with how many entries of union hv_gpa_page_range can be
> + * populated in the input parameter page. MAX_FLUSH_REP_COUNT
> + * = (4096 - 16) / 8. (“Page Size” - "Address Space" - "Flags") /
> + * "GPA Range".
> + */
> +#define MAX_FLUSH_REP_COUNT (510)
> +

I would recommend putting the above two definitions in
hyperv-tlfs.h.  They are directly tied to the data structures defined
by Hyper-V in the TLFS.  Put MAX_FLUSH_PAGES immediately after
the definition for hv_gpa_page_range so that the dependency is
obvious.

For MAX_FLUSH_REP_COUNT, can you do the calculation in
the #define rather than just in the comment?  Alternatively, define
the gpa_list[] array to be of MAX_FLUSH_REP_COUNT size, and then
add a compile time assert that the size of struct
hv_guest_mapping_flush_list is exactly one page in size.   It's just
a good way to use the compiler to help check for mistakes.

Also prefix them both with HV_ since they will be more
globally visible as part of hyperv-tlfs.h.

>  int hyperv_flush_guest_mapping(u64 as)
>  {
>   struct hv_guest_mapping_flush **flush_pcpu;
> @@ -54,3 +71,89 @@ int hyperv_flush_guest_mapping(u64 as)
>   return ret;
>  }
>  EXPORT_SYMBOL_GPL(hyperv_flush_guest_mapping);
> +
> +static int fill_flush_list(union hv_gpa_page_range gpa_list[],
> + int offset, u64 start_gfn, u64 pages)
> +{
> + int gpa_n = offset;
> + u64 cur = start_gfn;
> + u64 additional_pages;
> +
> + do {
> + if (gpa_n >= MAX_FLUSH_REP_COUNT) {
> + pr_warn("Request exceeds HvFlushGuestList max flush 
> count.");
> + return -ENOSPC;

I wonder if the warning is really needed.  When the error is returned up
through the higher levels of code, won't the higher levels just fallback to
the non-enlightened flush code?  So nothing is actually goes wrong; it's just
that a slower code path gets taken.  A comment about such expectations
might be helpful.

> + }
> +
> + if (pages > MAX_FLUSH_PAGES) {
> + additional_pages = MAX_FLUSH_PAGES - 1;
> + pages -= MAX_FLUSH_PAGES;
> + } else {
> + additional_pages = pages - 1;
> + pages = 0;
> + }

The above code is really doing:

additional_pages = min(pages, MAX_FLUSH_PAGES) - 1;
pages -= additional_pages + 1;

And you might want to move the decrement of 'pages' down to the
bottom of the loop where you update the other loop variables.

> +
> + gpa_list[gpa_n].page.additional_pages = additional_pages;
> + gpa_list[gpa_n].page.largepage = false;
> + gpa_list[gpa_n].page.basepfn = cur;
> +
> + cur += additional_pages + 1;
> + gpa_n++;
> + } while (pages > 0);
> +
> + return gpa_n;
> +}
> +
> +int hyperv_flush_guest_mapping_range(u64 as, struct hyperv_tlb_range *range)
> +{
> + struct hv_guest_mapping_flush_list **flush_pcpu;
> + struct hv_guest_mapping_flush_list *flush;
> + u64 status = 0;
> + unsigned long flags;
> + int ret = -ENOTSUPP;
> + int gpa_n = 0;
> +
> + if (!hv_hypercall_pg)
> + goto fault;
> +
> + local_irq_save(flags);
> +
> + flush_pcpu = (struct hv_guest_mapping_flush_list **)
> + this_cpu_ptr(hyperv_pcpu_input_arg);
> +
> + flush = *flush_pcpu;
> + if (unlikely(!flush)) {
> + local_irq_restore(flags);
> + goto fault;
> + }
> +
> + flush->address_space = as;
> + flush->flags = 0;
> +
> + if (!range->flush_list)
> + gpa_n = fill_flush_list(flush->gpa_list, gpa_n,
> + range->start_gfn, range->pages);
> + else if (range->parse_flush_list_func)
> + gpa_n = range->parse_flush_list_func(flush->gpa_list, gpa_n,
> + range->flush_list, fill_flush_list);
> + else
> + gpa_n = -1;
> +
> + if (gpa_n < 0) {
> + local_irq_restore(flags);
> + goto fault;
> + }
> +
> + status = hv_do_rep_hypercall(HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST,
> +  gpa_n, 0, flush, NULL);
> +
> + local_irq_restore(flags);
> +
> + if (!(status & HV_HYPERCALL_RESULT_MASK))
> + ret = 0;
> + else
> + ret = status;
> +fault:
> + return ret;
> +}
> 

RE: [PATCH V2 4/13] KVM/MMU: Flush tlb directly in the kvm_handle_hva_range()

2018-09-19 Thread Michael Kelley (EOSG)
From: Tianyu Lan  Sent: Monday, September 17, 2018 8:19 PM
> +
> + if (ret && kvm_available_flush_tlb_with_range()) {
> + kvm_flush_remote_tlbs_with_address(kvm,
> + gfn_start,
> + gfn_end - gfn_start);

Does the above need to be gfn_end - gfn_start + 1?

> + ret = 0;
> + }

Michael


RE: [PATCH V2 4/13] KVM/MMU: Flush tlb directly in the kvm_handle_hva_range()

2018-09-19 Thread Michael Kelley (EOSG)
From: Tianyu Lan  Sent: Monday, September 17, 2018 8:19 PM
> +
> + if (ret && kvm_available_flush_tlb_with_range()) {
> + kvm_flush_remote_tlbs_with_address(kvm,
> + gfn_start,
> + gfn_end - gfn_start);

Does the above need to be gfn_end - gfn_start + 1?

> + ret = 0;
> + }

Michael


RE: [PATCH V2 2/13] KVM/MMU: Add tlb flush with range helper function

2018-09-19 Thread Michael Kelley (EOSG)
From: Tianyu Lan  Sent: Monday, September 17, 2018 8:18 PM
>
> +static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm,
> + struct kvm_tlb_range *range)
> +{
> + int ret = -ENOTSUPP;
> +
> + if (range && kvm_x86_ops->tlb_remote_flush_with_range) {
> + /*
> +  * Read tlbs_dirty before setting KVM_REQ_TLB_FLUSH in
> +  * kvm_make_all_cpus_request.
> +  */
> + long dirty_count = smp_load_acquire(>tlbs_dirty);
> +
> + ret = kvm_x86_ops->tlb_remote_flush_with_range(kvm, range);
> + cmpxchg(>tlbs_dirty, dirty_count, 0);
> + }

The comment and the code that manipulates kvm->tlbs_dirty appears
to have been copied from kvm_flush_remote_tlbs().  But the above
code doesn't call kvm_make_all_cpus_request().   I haven't traced
all the details, but it seems like the comment should be updated,
or the code isn't needed.

Michael


RE: [PATCH V2 2/13] KVM/MMU: Add tlb flush with range helper function

2018-09-19 Thread Michael Kelley (EOSG)
From: Tianyu Lan  Sent: Monday, September 17, 2018 8:18 PM
>
> +static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm,
> + struct kvm_tlb_range *range)
> +{
> + int ret = -ENOTSUPP;
> +
> + if (range && kvm_x86_ops->tlb_remote_flush_with_range) {
> + /*
> +  * Read tlbs_dirty before setting KVM_REQ_TLB_FLUSH in
> +  * kvm_make_all_cpus_request.
> +  */
> + long dirty_count = smp_load_acquire(>tlbs_dirty);
> +
> + ret = kvm_x86_ops->tlb_remote_flush_with_range(kvm, range);
> + cmpxchg(>tlbs_dirty, dirty_count, 0);
> + }

The comment and the code that manipulates kvm->tlbs_dirty appears
to have been copied from kvm_flush_remote_tlbs().  But the above
code doesn't call kvm_make_all_cpus_request().   I haven't traced
all the details, but it seems like the comment should be updated,
or the code isn't needed.

Michael


RE: [PATCH] x86/hyperv: suppress "PCI: Fatal: No config space access function found"

2018-09-19 Thread Michael Kelley (EOSG)
From: Dexuan Cui   Sent: Tuesday, September 18, 2018 3:30 PM
> 
> A Generatin-2 Linux VM on Hyper-V doesn't have the legacy PCI bus, and
> users always see the scary warning, which is actually harmless. The patch
> is made to suppress the warning.
> 
> Signed-off-by: Dexuan Cui 
> Cc: K. Y. Srinivasan 
> Cc: Haiyang Zhang 
> Cc: Stephen Hemminger 
> ---
>  arch/x86/hyperv/hv_init.c | 19 +++
>  1 file changed, 19 insertions(+)
> 

Reviewed-by:  Michael Kelley 


RE: [PATCH] x86/hyperv: suppress "PCI: Fatal: No config space access function found"

2018-09-19 Thread Michael Kelley (EOSG)
From: Dexuan Cui   Sent: Tuesday, September 18, 2018 3:30 PM
> 
> A Generatin-2 Linux VM on Hyper-V doesn't have the legacy PCI bus, and
> users always see the scary warning, which is actually harmless. The patch
> is made to suppress the warning.
> 
> Signed-off-by: Dexuan Cui 
> Cc: K. Y. Srinivasan 
> Cc: Haiyang Zhang 
> Cc: Stephen Hemminger 
> ---
>  arch/x86/hyperv/hv_init.c | 19 +++
>  1 file changed, 19 insertions(+)
> 

Reviewed-by:  Michael Kelley 


RE: [PATCH v1 0/3] Enable PV qspinlock for Hyper-V

2018-09-15 Thread Michael Kelley (EOSG)
>From Yi Sun   Sent: Thursday, September 13, 2018 2:13 AM

> This patch adds the necessary Hyper-V specific code to allow
> PV qspinlock work on Hyper-V.
> 

Have you done any performance measurements with this
new code, so that we know whether there is any improvement,
or even potentially any degradation in some circumstances?

Michael
 



RE: [PATCH v1 0/3] Enable PV qspinlock for Hyper-V

2018-09-15 Thread Michael Kelley (EOSG)
>From Yi Sun   Sent: Thursday, September 13, 2018 2:13 AM

> This patch adds the necessary Hyper-V specific code to allow
> PV qspinlock work on Hyper-V.
> 

Have you done any performance measurements with this
new code, so that we know whether there is any improvement,
or even potentially any degradation in some circumstances?

Michael
 



RE: [PATCH] Drivers: hv: vmbus: include header for get_irq_regs()

2018-09-15 Thread Michael Kelley (EOSG)
From Sebastian Andrzej Siewior   Sent: Thursday, August 30, 2018 12:55 AM
> 
> On !RT the header file get_irq_regs() gets pulled in via other header files. 
> On
> RT it does not and the build fails:
> 
> drivers/hv/vmbus_drv.c:975 implicit declaration of function 
> ‘get_irq_regs’ [-
> Werror=implicit-function-declaration]
> drivers/hv/hv.c:115 implicit declaration of function ‘get_irq_regs’ 
> [-Werror=implicit-
> function-declaration]
> 
> Add the header file for get_irq_regs() in a common header so it used by
> vmbus_drv.c by hv.c for their get_irq_regs() usage.
> 

get_irq_regs() is not used explicitly in either vmbus_drv.c or in hv.c.  And I
couldn't make the line numbers in the errors above line up with anything
in the source code that might be implicitly using get_irq_regs().  Is it the
calls to add_interrupt_randomness()?   Did you figure out
exactly what line of code is causing the compile error?

I'm wondering whether adding the #include of irq.h into hyperv_vmbus.h
is really the right solution.  More correct might be to have the file where
get_irq_regs() is actually used to #include irq_regs.h.

Michael


RE: [PATCH] Drivers: hv: vmbus: include header for get_irq_regs()

2018-09-15 Thread Michael Kelley (EOSG)
From Sebastian Andrzej Siewior   Sent: Thursday, August 30, 2018 12:55 AM
> 
> On !RT the header file get_irq_regs() gets pulled in via other header files. 
> On
> RT it does not and the build fails:
> 
> drivers/hv/vmbus_drv.c:975 implicit declaration of function 
> ‘get_irq_regs’ [-
> Werror=implicit-function-declaration]
> drivers/hv/hv.c:115 implicit declaration of function ‘get_irq_regs’ 
> [-Werror=implicit-
> function-declaration]
> 
> Add the header file for get_irq_regs() in a common header so it used by
> vmbus_drv.c by hv.c for their get_irq_regs() usage.
> 

get_irq_regs() is not used explicitly in either vmbus_drv.c or in hv.c.  And I
couldn't make the line numbers in the errors above line up with anything
in the source code that might be implicitly using get_irq_regs().  Is it the
calls to add_interrupt_randomness()?   Did you figure out
exactly what line of code is causing the compile error?

I'm wondering whether adding the #include of irq.h into hyperv_vmbus.h
is really the right solution.  More correct might be to have the file where
get_irq_regs() is actually used to #include irq_regs.h.

Michael


RE: [PATCH 10/13] x86/hyper-v: Add HvFlushGuestAddressList hypercall support

2018-09-11 Thread Michael Kelley (EOSG)
From: Tianyu Lan  Sent: Monday, September 10, 2018 1:39 AM
> +
> +int hyperv_flush_guest_mapping_range(u64 as, struct kvm_tlb_range *range)

I'm really concerned about defining the Hyper-V function to flush
guest mappings in terms of a KVM struct definition.  Your patch puts
this function in arch/x86/hyperv/nested.c.  I haven't investigated all
the details, but on its face this approach seems like it would cause
trouble in the long run, and it doesn't support the case of a
hypervisor other than KVM running at L1.

I know that KVM code has taken a dependency on Hyper-V types and
code, but that's because KVM is emulating a lot of Hyper-V functionality
and it's taking advantage of Hyper-V enlightenments.  Is there a top
level reason I haven't thought of for Hyper-V code to take a
dependency on KVM definitions?  I would think we want Hyper-V
code to be generic, using Hyper-V data structure definitions.  Then in
keeping with what's already been done, KVM code would use those
definitions where it needs to make calls to Hyper-V code.

> +{
> + struct kvm_mmu_page *sp;
> + struct hv_guest_mapping_flush_list **flush_pcpu;
> + struct hv_guest_mapping_flush_list *flush;
> + u64 status = 0;
> + unsigned long flags;
> + int ret = -ENOTSUPP;
> + int gpa_n = 0;
> +
> + if (!hv_hypercall_pg)
> + goto fault;
> +
> + local_irq_save(flags);
> +
> + flush_pcpu = (struct hv_guest_mapping_flush_list **)
> + this_cpu_ptr(hyperv_pcpu_input_arg);
> +
> + flush = *flush_pcpu;
> + if (unlikely(!flush)) {
> + local_irq_restore(flags);
> + goto fault;
> + }
> +
> + flush->address_space = as;
> + flush->flags = 0;
> +
> + if (!range->flush_list) {
> + gpa_n = fill_flush_list(flush->gpa_list, gpa_n,
> + range->start_gfn, range->end_gfn);
> + } else {
> + list_for_each_entry(sp, range->flush_list,
> + flush_link) {
> + u64 end_gfn = sp->gfn +
> + KVM_PAGES_PER_HPAGE(sp->role.level) - 1;
> + gpa_n = fill_flush_list(flush->gpa_list, gpa_n,
> + sp->gfn, end_gfn);
> + }

Per the previous comment, if this loop really needs to walk a KVM
data structure, look for a different way to organize things so that
the handling of KVM-specific data structures is in code that’s part
of KVM, rather than in Hyper-V code.

Michael



RE: [PATCH 10/13] x86/hyper-v: Add HvFlushGuestAddressList hypercall support

2018-09-11 Thread Michael Kelley (EOSG)
From: Tianyu Lan  Sent: Monday, September 10, 2018 1:39 AM
> +
> +int hyperv_flush_guest_mapping_range(u64 as, struct kvm_tlb_range *range)

I'm really concerned about defining the Hyper-V function to flush
guest mappings in terms of a KVM struct definition.  Your patch puts
this function in arch/x86/hyperv/nested.c.  I haven't investigated all
the details, but on its face this approach seems like it would cause
trouble in the long run, and it doesn't support the case of a
hypervisor other than KVM running at L1.

I know that KVM code has taken a dependency on Hyper-V types and
code, but that's because KVM is emulating a lot of Hyper-V functionality
and it's taking advantage of Hyper-V enlightenments.  Is there a top
level reason I haven't thought of for Hyper-V code to take a
dependency on KVM definitions?  I would think we want Hyper-V
code to be generic, using Hyper-V data structure definitions.  Then in
keeping with what's already been done, KVM code would use those
definitions where it needs to make calls to Hyper-V code.

> +{
> + struct kvm_mmu_page *sp;
> + struct hv_guest_mapping_flush_list **flush_pcpu;
> + struct hv_guest_mapping_flush_list *flush;
> + u64 status = 0;
> + unsigned long flags;
> + int ret = -ENOTSUPP;
> + int gpa_n = 0;
> +
> + if (!hv_hypercall_pg)
> + goto fault;
> +
> + local_irq_save(flags);
> +
> + flush_pcpu = (struct hv_guest_mapping_flush_list **)
> + this_cpu_ptr(hyperv_pcpu_input_arg);
> +
> + flush = *flush_pcpu;
> + if (unlikely(!flush)) {
> + local_irq_restore(flags);
> + goto fault;
> + }
> +
> + flush->address_space = as;
> + flush->flags = 0;
> +
> + if (!range->flush_list) {
> + gpa_n = fill_flush_list(flush->gpa_list, gpa_n,
> + range->start_gfn, range->end_gfn);
> + } else {
> + list_for_each_entry(sp, range->flush_list,
> + flush_link) {
> + u64 end_gfn = sp->gfn +
> + KVM_PAGES_PER_HPAGE(sp->role.level) - 1;
> + gpa_n = fill_flush_list(flush->gpa_list, gpa_n,
> + sp->gfn, end_gfn);
> + }

Per the previous comment, if this loop really needs to walk a KVM
data structure, look for a different way to organize things so that
the handling of KVM-specific data structures is in code that’s part
of KVM, rather than in Hyper-V code.

Michael



RE: [PATCH v2 2/4] arm64: hyperv: Add support for Hyper-V as a hypervisor

2018-08-31 Thread Michael Kelley (EOSG)
From: KY Srinivasan  Sent: Thursday, August 30, 2018 11:51 AM

> > +   /* Allocate percpu VP index */
> > +   hv_vp_index = kmalloc_array(num_possible_cpus(),
> > sizeof(*hv_vp_index),
> > +   GFP_KERNEL);
> > +   if (!hv_vp_index)
> > +   return 1;
> > +
> We should perhaps set the array so the contents are invalid so we can 
> correctly
> handle enlightenments for TL shootdown and IPI.

Agreed.  Will add the initialization in v3 of the patch.

Michael


RE: [PATCH v2 2/4] arm64: hyperv: Add support for Hyper-V as a hypervisor

2018-08-31 Thread Michael Kelley (EOSG)
From: KY Srinivasan  Sent: Thursday, August 30, 2018 11:51 AM

> > +   /* Allocate percpu VP index */
> > +   hv_vp_index = kmalloc_array(num_possible_cpus(),
> > sizeof(*hv_vp_index),
> > +   GFP_KERNEL);
> > +   if (!hv_vp_index)
> > +   return 1;
> > +
> We should perhaps set the array so the contents are invalid so we can 
> correctly
> handle enlightenments for TL shootdown and IPI.

Agreed.  Will add the initialization in v3 of the patch.

Michael


RE: [PATCH v2 1/4] arm64: hyperv: Add core Hyper-V include files

2018-08-31 Thread Michael Kelley (EOSG)
From: KY Srinivasan Sent: Thursday, August 30, 2018 11:23 AM

> > +/*
> > + * This file contains definitions from the Hyper-V Hypervisor Top-Level
> > + * Functional Specification (TLFS):
> > + * 
> > https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/reference/tlfs
> > +
> A lot of TLFS definitions are ISA independent and we are duplicating these
> definitions both for X86_64 and ARM_64.  Perhaps we should look at splitting
> this file into a common and ISA specific header file.

I agree that we want to end up with x86_64 and ARM64 ISA dependent files
that include an ISA independent file.  My thinking was to not make that
separation now, for a couple of reasons:

1) We don't have a Hyper-V TLFS that is explicit about what should be
considered ISA independent and ISA dependent.  I can make some
reasonable guesses, but it will be subject to change as the Hyper-V team
firms up the interface and decides what they want to commit to.

2) Some of the things defined in the TLFS have names that are
x86-specific (TSC, MSR, etc.).  For the ISA independent parts, those names
should be more generic, which is another dependency on the Hyper-V
team defining the ISA independent parts of the TLFS.

My judgment was that we'll end up with less perturbation overall to go
with this cloned version of hyperv-tlfs.h for now, and then come back
and do the separation once we have a definitive TLFS to base it on.  But
it's a judgment call, and if the sense is that we should do the separation
now, I can give it a try.

> > +#define HvRegisterHypervisorVersion0x0100 /*CPUID
> > 0x4002 */
> > +#defineHvRegisterPrivilegesAndFeaturesInfo0x0200 /*CPUID
> > 0x4003 */
> > +#defineHvRegisterFeaturesInfo0x0201
> > /*CPUID 0x4004 */
> > +#defineHvRegisterImplementationLimitsInfo0x0202 /*CPUID
> > 0x4005 */
> > +#define HvARM64RegisterInterfaceVersion0x00090006 /*CPUID
> > 0x4001 */
> 
> Can we avoid the mixed case names.

Agreed. I'll fix this throughout to use all uppercase, with underscore
as the word separator.

> > + * Linux-specific definitions for managing interactions with Microsoft's
> > + * Hyper-V hypervisor. Definitions that are specified in the Hyper-V
> > + * Top Level Functional Spec (TLFS) should not go in this file, but
> > + * should instead go in hyperv-tlfs.h.
> 
> Would it make sense to breakup this header file into ISA independent and 
> dependent files?

Yes, as above I agree the separation make sense.  And since this file is tied
To Linux and not to the Hyper-V TLFS, the separation isn't affected by the
TLFS issues mentioned above.  I'll give it a try and see if any issues arise.

> > +/*
> > + * Define the IRQ numbers/vectors used by Hyper-V VMbus interrupts
> > + * and by STIMER0 Direct Mode interrupts. Hyper-V should be supplying
> > + * these values through ACPI, but there are no other interrupting
> > + * devices in a Hyper-V VM on ARM64, so it's OK to hard code for now.
> > + * The "CALLBACK_VECTOR" terminology is a left-over from the x86/x64
> > + * world that is used in architecture independent Hyper-V code.
> > + */
> When we have direct device assignment for ARM-64 guests, can we still 
> hardcode.

Yes, we can still hardcode.  These values are in the Per-Processor Interrupt
(PPI) range of 16 to 31.  Any IRQ numbers assigned to a Discrete Device
Assignment (DDA) device will be in the Shared Peripheral Interrupt (SPI)
range of 32-1019 or the Locality-specific Peripheral Interrupt (LPI) range
of greater than 8192.   The handling of DDA interrupts is still under
discussion with the Hyper-V team, but there won't be any conflicts with
the PPI values that are hardcoded here.

> > +/*
> > + * The guest OS needs to register the guest ID with the hypervisor.
> > + * The guest ID is a 64 bit entity and the structure of this ID is
> > + * specified in the Hyper-V specification:
> > + *
> > + * msdn.microsoft.com/en-
> > us/library/windows/hardware/ff542653%28v=vs.85%29.aspx
> > + *
> > + * While the current guideline does not specify how Linux guest ID(s)
> > + * need to be generated, our plan is to publish the guidelines for
> > + * Linux and other guest operating systems that currently are hosted
> > + * on Hyper-V. The implementation here conforms to this yet
> > + * unpublished guidelines.
> > + *
> > + *
> > + * Bit(s)
> > + * 63 - Indicates if the OS is Open Source or not; 1 is Open Source
> > + * 62:56 - Os Type; Linux is 0x100
> > + * 55:48 - Distro specific identification
> > + * 47:16 - Linux kernel version number
> > + * 15:0  - Distro specific identification
> > + *
> > + * Generate the guest ID based on the guideline described above.
> > + */
> 
> No need to repeat the above block comment (already included in the TLFS 
> header).

Agreed.  Will make the change in v3 of the patch.

> > +/* Free the message slot and signal end-of-message if required */
> > +static inline void vmbus_signal_eom(struct hv_message *msg, u32
> > old_msg_type)
> > +{
> > +/*
> > + 

RE: [PATCH v2 1/4] arm64: hyperv: Add core Hyper-V include files

2018-08-31 Thread Michael Kelley (EOSG)
From: KY Srinivasan Sent: Thursday, August 30, 2018 11:23 AM

> > +/*
> > + * This file contains definitions from the Hyper-V Hypervisor Top-Level
> > + * Functional Specification (TLFS):
> > + * 
> > https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/reference/tlfs
> > +
> A lot of TLFS definitions are ISA independent and we are duplicating these
> definitions both for X86_64 and ARM_64.  Perhaps we should look at splitting
> this file into a common and ISA specific header file.

I agree that we want to end up with x86_64 and ARM64 ISA dependent files
that include an ISA independent file.  My thinking was to not make that
separation now, for a couple of reasons:

1) We don't have a Hyper-V TLFS that is explicit about what should be
considered ISA independent and ISA dependent.  I can make some
reasonable guesses, but it will be subject to change as the Hyper-V team
firms up the interface and decides what they want to commit to.

2) Some of the things defined in the TLFS have names that are
x86-specific (TSC, MSR, etc.).  For the ISA independent parts, those names
should be more generic, which is another dependency on the Hyper-V
team defining the ISA independent parts of the TLFS.

My judgment was that we'll end up with less perturbation overall to go
with this cloned version of hyperv-tlfs.h for now, and then come back
and do the separation once we have a definitive TLFS to base it on.  But
it's a judgment call, and if the sense is that we should do the separation
now, I can give it a try.

> > +#define HvRegisterHypervisorVersion0x0100 /*CPUID
> > 0x4002 */
> > +#defineHvRegisterPrivilegesAndFeaturesInfo0x0200 /*CPUID
> > 0x4003 */
> > +#defineHvRegisterFeaturesInfo0x0201
> > /*CPUID 0x4004 */
> > +#defineHvRegisterImplementationLimitsInfo0x0202 /*CPUID
> > 0x4005 */
> > +#define HvARM64RegisterInterfaceVersion0x00090006 /*CPUID
> > 0x4001 */
> 
> Can we avoid the mixed case names.

Agreed. I'll fix this throughout to use all uppercase, with underscore
as the word separator.

> > + * Linux-specific definitions for managing interactions with Microsoft's
> > + * Hyper-V hypervisor. Definitions that are specified in the Hyper-V
> > + * Top Level Functional Spec (TLFS) should not go in this file, but
> > + * should instead go in hyperv-tlfs.h.
> 
> Would it make sense to breakup this header file into ISA independent and 
> dependent files?

Yes, as above I agree the separation make sense.  And since this file is tied
To Linux and not to the Hyper-V TLFS, the separation isn't affected by the
TLFS issues mentioned above.  I'll give it a try and see if any issues arise.

> > +/*
> > + * Define the IRQ numbers/vectors used by Hyper-V VMbus interrupts
> > + * and by STIMER0 Direct Mode interrupts. Hyper-V should be supplying
> > + * these values through ACPI, but there are no other interrupting
> > + * devices in a Hyper-V VM on ARM64, so it's OK to hard code for now.
> > + * The "CALLBACK_VECTOR" terminology is a left-over from the x86/x64
> > + * world that is used in architecture independent Hyper-V code.
> > + */
> When we have direct device assignment for ARM-64 guests, can we still 
> hardcode.

Yes, we can still hardcode.  These values are in the Per-Processor Interrupt
(PPI) range of 16 to 31.  Any IRQ numbers assigned to a Discrete Device
Assignment (DDA) device will be in the Shared Peripheral Interrupt (SPI)
range of 32-1019 or the Locality-specific Peripheral Interrupt (LPI) range
of greater than 8192.   The handling of DDA interrupts is still under
discussion with the Hyper-V team, but there won't be any conflicts with
the PPI values that are hardcoded here.

> > +/*
> > + * The guest OS needs to register the guest ID with the hypervisor.
> > + * The guest ID is a 64 bit entity and the structure of this ID is
> > + * specified in the Hyper-V specification:
> > + *
> > + * msdn.microsoft.com/en-
> > us/library/windows/hardware/ff542653%28v=vs.85%29.aspx
> > + *
> > + * While the current guideline does not specify how Linux guest ID(s)
> > + * need to be generated, our plan is to publish the guidelines for
> > + * Linux and other guest operating systems that currently are hosted
> > + * on Hyper-V. The implementation here conforms to this yet
> > + * unpublished guidelines.
> > + *
> > + *
> > + * Bit(s)
> > + * 63 - Indicates if the OS is Open Source or not; 1 is Open Source
> > + * 62:56 - Os Type; Linux is 0x100
> > + * 55:48 - Distro specific identification
> > + * 47:16 - Linux kernel version number
> > + * 15:0  - Distro specific identification
> > + *
> > + * Generate the guest ID based on the guideline described above.
> > + */
> 
> No need to repeat the above block comment (already included in the TLFS 
> header).

Agreed.  Will make the change in v3 of the patch.

> > +/* Free the message slot and signal end-of-message if required */
> > +static inline void vmbus_signal_eom(struct hv_message *msg, u32
> > old_msg_type)
> > +{
> > +/*
> > + 

RE: [PATCH char-misc 1/1] Drivers: hv: vmbus: Make synic_initialized flag per-cpu

2018-08-28 Thread Michael Kelley (EOSG)
From: Vitaly Kuznetsov   Sent: Wednesday, August 1, 2018 
2:26 AM

> > I was trying to decide if there are any arguments in favor of one
> > approach vs. the other:  a per-cpu flag in memory or checking
> > the synic_control "enable" bit.   Seems like a wash to me, in which
> > case I have a slight preference for the per-cpu flag in memory vs.
> > creating another function to return sctrl.enable.  But I'm completely
> > open to reasons why checking sctrl.enable is better.
> 
> Just a few thoughts: reading MSR is definitely slower but we avoid
> 'shadowing' the state, the reading is always correct. In case there's a
> chance the SynIC will get disabled from host side we can only find this
> out by doing MSR read. This is a purely theoretical possibility, I
> believe, we can go ahead with this patch.

Vitaly -- just to confirm:  you are OK with the patch as is?  (I'll
check, but I may need to rebase on the latest code.)

Michael


RE: [PATCH char-misc 1/1] Drivers: hv: vmbus: Make synic_initialized flag per-cpu

2018-08-28 Thread Michael Kelley (EOSG)
From: Vitaly Kuznetsov   Sent: Wednesday, August 1, 2018 
2:26 AM

> > I was trying to decide if there are any arguments in favor of one
> > approach vs. the other:  a per-cpu flag in memory or checking
> > the synic_control "enable" bit.   Seems like a wash to me, in which
> > case I have a slight preference for the per-cpu flag in memory vs.
> > creating another function to return sctrl.enable.  But I'm completely
> > open to reasons why checking sctrl.enable is better.
> 
> Just a few thoughts: reading MSR is definitely slower but we avoid
> 'shadowing' the state, the reading is always correct. In case there's a
> chance the SynIC will get disabled from host side we can only find this
> out by doing MSR read. This is a purely theoretical possibility, I
> believe, we can go ahead with this patch.

Vitaly -- just to confirm:  you are OK with the patch as is?  (I'll
check, but I may need to rebase on the latest code.)

Michael


RE: [PATCH 2/5] vmbus: add driver_override support

2018-08-14 Thread Michael Kelley (EOSG)
From: Stephen Hemminger   Sent: Tuesday, August 14, 
2018 9:35 AM

> On Mon, 13 Aug 2018 19:30:50 +
> "Michael Kelley (EOSG)"  wrote:
> 
> > > +/*
> > > + * Return a matching hv_vmbus_device_id pointer.
> > > + * If there is no match, return NULL.
> > > + */
> > > +static const struct hv_vmbus_device_id *hv_vmbus_get_id(struct hv_driver 
> > > *drv,
> > > + struct hv_device *dev)
> > > +{
> > > + const uuid_le *guid = >dev_type;
> > > + const struct hv_vmbus_device_id *id;
> > >
> > > - return NULL;
> > > + /* When driver_override is set, only bind to the matching driver */
> > > + if (dev->driver_override && strcmp(dev->driver_override, drv->name))
> > > + return NULL;
> >
> > This function needs to be covered by the device lock, so that
> > dev->driver_override can't be set to NULL and the memory freed
> > during the above 'if' statement.  When called from vmbus_probe(),
> > the device lock is held, so it's good. But when called from
> > vmbus_match(), the device lock may not be held: consider the path
> > __driver_attach() -> driver_match_device() -> vmbus_match().
> 
> The function hv_vmbus_get_id is called from that path.
> i.e. __device_attach -> driver-match_device -> vmbus_match.
> and __device_attach always does:
>   device_lock(dev);

Agreed.  The __device_attach() path holds the device lock and all is good.
But the __driver_attach() path does not hold the device lock when the
match function is called, leaving the code open to a potential race.  Same
problem could happen in the pci subsystem, so the issue is more generic
and probably should be evaluated and dealt with separately.

Michael

> 
> The code in driver _override_store uses the same device_lock
> when storing the new value.
> 
> This is same locking as is done in pci-sysfs.c



RE: [PATCH 2/5] vmbus: add driver_override support

2018-08-14 Thread Michael Kelley (EOSG)
From: Stephen Hemminger   Sent: Tuesday, August 14, 
2018 9:35 AM

> On Mon, 13 Aug 2018 19:30:50 +
> "Michael Kelley (EOSG)"  wrote:
> 
> > > +/*
> > > + * Return a matching hv_vmbus_device_id pointer.
> > > + * If there is no match, return NULL.
> > > + */
> > > +static const struct hv_vmbus_device_id *hv_vmbus_get_id(struct hv_driver 
> > > *drv,
> > > + struct hv_device *dev)
> > > +{
> > > + const uuid_le *guid = >dev_type;
> > > + const struct hv_vmbus_device_id *id;
> > >
> > > - return NULL;
> > > + /* When driver_override is set, only bind to the matching driver */
> > > + if (dev->driver_override && strcmp(dev->driver_override, drv->name))
> > > + return NULL;
> >
> > This function needs to be covered by the device lock, so that
> > dev->driver_override can't be set to NULL and the memory freed
> > during the above 'if' statement.  When called from vmbus_probe(),
> > the device lock is held, so it's good. But when called from
> > vmbus_match(), the device lock may not be held: consider the path
> > __driver_attach() -> driver_match_device() -> vmbus_match().
> 
> The function hv_vmbus_get_id is called from that path.
> i.e. __device_attach -> driver-match_device -> vmbus_match.
> and __device_attach always does:
>   device_lock(dev);

Agreed.  The __device_attach() path holds the device lock and all is good.
But the __driver_attach() path does not hold the device lock when the
match function is called, leaving the code open to a potential race.  Same
problem could happen in the pci subsystem, so the issue is more generic
and probably should be evaluated and dealt with separately.

Michael

> 
> The code in driver _override_store uses the same device_lock
> when storing the new value.
> 
> This is same locking as is done in pci-sysfs.c



RE: [PATCH 2/5] vmbus: add driver_override support

2018-08-13 Thread Michael Kelley (EOSG)
From: k...@linuxonhyperv.com   Sent: Friday, August 10, 
2018 4:06 PM

> From: Stephen Hemminger 
> 
> Add support for overriding the default driver for a VMBus device
> in the same way that it can be done for PCI devices. This patch
> adds the /sys/bus/vmbus/devices/.../driver_override file
> and the logic for matching.
> 
> This is used by driverctl tool to do driver override.
> https://gitlab.com/driverctl/driverctl
> 
> Signed-off-by: Stephen Hemminger 
> Signed-off-by: K. Y. Srinivasan 
> ---
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index b1b548a21f91..e6d8fdac6d8b 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -498,6 +498,54 @@ static ssize_t device_show(struct device *dev,
>  }
>  static DEVICE_ATTR_RO(device);
> 
> +static ssize_t driver_override_store(struct device *dev,
> +  struct device_attribute *attr,
> +  const char *buf, size_t count)
> +{
> + struct hv_device *hv_dev = device_to_hv_device(dev);
> + char *driver_override, *old, *cp;
> +
> + /* We need to keep extra room for a newline */
> + if (count >= (PAGE_SIZE - 1))
> + return -EINVAL;

Does 'count' actually have a relationship to PAGE_SIZE, or
is PAGE_SIZE just used as an arbitrary size limit?  I'm
wondering what happens on ARM64 with a 64K page size,
for example.  If it's just arbitrary, coding such a constant
would be better.

> +/*
> + * Return a matching hv_vmbus_device_id pointer.
> + * If there is no match, return NULL.
> + */
> +static const struct hv_vmbus_device_id *hv_vmbus_get_id(struct hv_driver 
> *drv,
> + struct hv_device *dev)
> +{
> + const uuid_le *guid = >dev_type;
> + const struct hv_vmbus_device_id *id;
> 
> - return NULL;
> + /* When driver_override is set, only bind to the matching driver */
> + if (dev->driver_override && strcmp(dev->driver_override, drv->name))
> + return NULL;

This function needs to be covered by the device lock, so that
dev->driver_override can't be set to NULL and the memory freed
during the above 'if' statement.  When called from vmbus_probe(),
the device lock is held, so it's good. But when called from
vmbus_match(), the device lock may not be held: consider the path
__driver_attach() -> driver_match_device() -> vmbus_match().

Michael


RE: [PATCH 2/5] vmbus: add driver_override support

2018-08-13 Thread Michael Kelley (EOSG)
From: k...@linuxonhyperv.com   Sent: Friday, August 10, 
2018 4:06 PM

> From: Stephen Hemminger 
> 
> Add support for overriding the default driver for a VMBus device
> in the same way that it can be done for PCI devices. This patch
> adds the /sys/bus/vmbus/devices/.../driver_override file
> and the logic for matching.
> 
> This is used by driverctl tool to do driver override.
> https://gitlab.com/driverctl/driverctl
> 
> Signed-off-by: Stephen Hemminger 
> Signed-off-by: K. Y. Srinivasan 
> ---
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index b1b548a21f91..e6d8fdac6d8b 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -498,6 +498,54 @@ static ssize_t device_show(struct device *dev,
>  }
>  static DEVICE_ATTR_RO(device);
> 
> +static ssize_t driver_override_store(struct device *dev,
> +  struct device_attribute *attr,
> +  const char *buf, size_t count)
> +{
> + struct hv_device *hv_dev = device_to_hv_device(dev);
> + char *driver_override, *old, *cp;
> +
> + /* We need to keep extra room for a newline */
> + if (count >= (PAGE_SIZE - 1))
> + return -EINVAL;

Does 'count' actually have a relationship to PAGE_SIZE, or
is PAGE_SIZE just used as an arbitrary size limit?  I'm
wondering what happens on ARM64 with a 64K page size,
for example.  If it's just arbitrary, coding such a constant
would be better.

> +/*
> + * Return a matching hv_vmbus_device_id pointer.
> + * If there is no match, return NULL.
> + */
> +static const struct hv_vmbus_device_id *hv_vmbus_get_id(struct hv_driver 
> *drv,
> + struct hv_device *dev)
> +{
> + const uuid_le *guid = >dev_type;
> + const struct hv_vmbus_device_id *id;
> 
> - return NULL;
> + /* When driver_override is set, only bind to the matching driver */
> + if (dev->driver_override && strcmp(dev->driver_override, drv->name))
> + return NULL;

This function needs to be covered by the device lock, so that
dev->driver_override can't be set to NULL and the memory freed
during the above 'if' statement.  When called from vmbus_probe(),
the device lock is held, so it's good. But when called from
vmbus_match(), the device lock may not be held: consider the path
__driver_attach() -> driver_match_device() -> vmbus_match().

Michael


RE: [PATCH 1/5] Tools: hv: Fix a bug in the key delete code

2018-08-13 Thread Michael Kelley (EOSG)
From: k...@linuxonhyperv.com  Sent: Friday, August 10, 
2018 4:06 PM
> 
> Fix a bug in the key delete code - the num_records range
> from 0 to num_records-1.
> 
> Signed-off-by: K. Y. Srinivasan 
> Reported-by: David Binderman 
> Cc: 
> ---

Reviewed-by: Michael Kelley 


RE: [PATCH 1/5] Tools: hv: Fix a bug in the key delete code

2018-08-13 Thread Michael Kelley (EOSG)
From: k...@linuxonhyperv.com  Sent: Friday, August 10, 
2018 4:06 PM
> 
> Fix a bug in the key delete code - the num_records range
> from 0 to num_records-1.
> 
> Signed-off-by: K. Y. Srinivasan 
> Reported-by: David Binderman 
> Cc: 
> ---

Reviewed-by: Michael Kelley 


RE: [PATCH char-misc 1/1] Drivers: hv: vmbus: Make synic_initialized flag per-cpu

2018-07-31 Thread Michael Kelley (EOSG)
From: Vitaly Kuznetsov  Sent: Tuesday, July 31, 2018 4:20 
AM
> 
> Reviewed-by: Vitaly Kuznetsov 

Thanks for the review

> 
> Alternatively, we can get rid of synic_initialized flag altogether:
> hv_synic_init() never fails in the first place but we can always
> implement something like:
> 
> int hv_synic_is_initialized(void) {
>   union hv_synic_scontrol sctrl;
> 
>   hv_get_synic_state(sctrl.as_uint64);
> 
>   return sctrl.enable;
> }
> 
> as it doesn't seem that we need to check synic state on _other_ CPUs.
> 
> --
>   Vitaly

I was trying to decide if there are any arguments in favor of one
approach vs. the other:  a per-cpu flag in memory or checking
the synic_control "enable" bit.   Seems like a wash to me, in which
case I have a slight preference for the per-cpu flag in memory vs.
creating another function to return sctrl.enable.  But I'm completely
open to reasons why checking sctrl.enable is better.

Michael



RE: [PATCH char-misc 1/1] Drivers: hv: vmbus: Make synic_initialized flag per-cpu

2018-07-31 Thread Michael Kelley (EOSG)
From: Vitaly Kuznetsov  Sent: Tuesday, July 31, 2018 4:20 
AM
> 
> Reviewed-by: Vitaly Kuznetsov 

Thanks for the review

> 
> Alternatively, we can get rid of synic_initialized flag altogether:
> hv_synic_init() never fails in the first place but we can always
> implement something like:
> 
> int hv_synic_is_initialized(void) {
>   union hv_synic_scontrol sctrl;
> 
>   hv_get_synic_state(sctrl.as_uint64);
> 
>   return sctrl.enable;
> }
> 
> as it doesn't seem that we need to check synic state on _other_ CPUs.
> 
> --
>   Vitaly

I was trying to decide if there are any arguments in favor of one
approach vs. the other:  a per-cpu flag in memory or checking
the synic_control "enable" bit.   Seems like a wash to me, in which
case I have a slight preference for the per-cpu flag in memory vs.
creating another function to return sctrl.enable.  But I'm completely
open to reasons why checking sctrl.enable is better.

Michael



RE: [PATCH 1/1] Drivers: HV: Send one page worth of kmsg dump over Hyper-V during panic

2018-07-11 Thread Michael Kelley (EOSG)
From: Sunil Muthuswamy Sent: Wednesday, July 11, 2018 9:59 AM
> Thanks, Michael. In which branch should I fix these now that the changes have 
> been
> merged with the char-misc-next branch?

If the original code is already in char-misc-next, you should probably submit a
completely new patch for char-misc-next that just makes the fixes to the 
original code.

Michael


RE: [PATCH 1/1] Drivers: HV: Send one page worth of kmsg dump over Hyper-V during panic

2018-07-11 Thread Michael Kelley (EOSG)
From: Sunil Muthuswamy Sent: Wednesday, July 11, 2018 9:59 AM
> Thanks, Michael. In which branch should I fix these now that the changes have 
> been
> merged with the char-misc-next branch?

If the original code is already in char-misc-next, you should probably submit a
completely new patch for char-misc-next that just makes the fixes to the 
original code.

Michael


RE: [PATCH 1/1] Drivers: HV: Send one page worth of kmsg dump over Hyper-V during panic

2018-07-10 Thread Michael Kelley (EOSG)
>From k...@linuxonhyperv.com   Sent: Saturday, July 7, 
>2018 7:57 PM
> 
> From: Sunil Muthuswamy 
> 
> In the VM mode on Hyper-V, currently, when the kernel panics, an error
> code and few register values are populated in an MSR and the Hypervisor
> notified. This information is collected on the host. The amount of
> information currently collected is found to be limited and not very
> actionable. To gather more actionable data, such as stack trace, the
> proposal is to write one page worth of kmsg data on an allocated page
> and the Hypervisor notified of the page address through the MSR.
> 
> - Sysctl option to control the behavior, with ON by default.
> 
> Cc: K. Y. Srinivasan 
> Cc: Stephen Hemminger 
> Signed-off-by: Sunil Muthuswamy 
> Signed-off-by: K. Y. Srinivasan 
> ---

> + /*
> +  * Write dump contents to the page. No need to synchronize; panic should
> +  * be single-threaded.
> +  */
> + if (!kmsg_dump_get_buffer(dumper, true, hv_panic_page,
> +   PAGE_SIZE, _written)) {
> + pr_err("Hyper-V: Unable to get kmsg data for panic\n");
> + return;

>From what I can see, the return value from kmsg_dump_get_buffer()
is not an indication of success or failure -- it's an indication of whether
there is more data available.   There's no reason to output an error
message.

> @@ -1065,6 +1136,32 @@ static int vmbus_bus_init(void)
>* Only register if the crash MSRs are available
>*/
>   if (ms_hyperv.misc_features & HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE) {
> + u64 hyperv_crash_ctl;
> + /*
> +  * Sysctl registration is not fatal, since by default
> +  * reporting is enabled.
> +  */
> + hv_ctl_table_hdr = register_sysctl_table(hv_root_table);
> + if (!hv_ctl_table_hdr)
> + pr_err("Hyper-V: sysctl table register error");
> +
> + /*
> +  * Register for panic kmsg callback only if the right
> +  * capability is supported by the hypervisor.
> +  */
> + rdmsrl(HV_X64_MSR_CRASH_CTL, hyperv_crash_ctl);
> + if (hyperv_crash_ctl & HV_CRASH_CTL_CRASH_NOTIFY_MSG) {

vmbus_drv.c is architecture independent code, and should not be referencing
x86/x64 MSRs.   Reading the MSR (and maybe the test as well?) should go
in a separate function in an x86-specific source file.

And just to confirm, is this the right way to test for the feature?  Usually,
feature determination is based on one of the feature registers.  The
NOTIFY_MSG flag seems to have a dual meaning -- on read it indicates
the feature is present.  On write in hyperv_report_panic_msg(), it evidently
means that the guest is sending a full page of data to Hyper-V.

> @@ -1081,6 +1178,11 @@ static int vmbus_bus_init(void)
>   bus_unregister(_bus);
> + free_page((unsigned long)hv_panic_page);
> + if (!hv_ctl_table_hdr) {

The above test is backwards.  Remove the bang.

> @@ -1785,10 +1887,18 @@ static void __exit vmbus_exit(void)
> + free_page((unsigned long)hv_panic_page);
> + if (!hv_ctl_table_hdr) {

Same here.  Test is backwards.

Michael



RE: [PATCH 1/1] Drivers: HV: Send one page worth of kmsg dump over Hyper-V during panic

2018-07-10 Thread Michael Kelley (EOSG)
>From k...@linuxonhyperv.com   Sent: Saturday, July 7, 
>2018 7:57 PM
> 
> From: Sunil Muthuswamy 
> 
> In the VM mode on Hyper-V, currently, when the kernel panics, an error
> code and few register values are populated in an MSR and the Hypervisor
> notified. This information is collected on the host. The amount of
> information currently collected is found to be limited and not very
> actionable. To gather more actionable data, such as stack trace, the
> proposal is to write one page worth of kmsg data on an allocated page
> and the Hypervisor notified of the page address through the MSR.
> 
> - Sysctl option to control the behavior, with ON by default.
> 
> Cc: K. Y. Srinivasan 
> Cc: Stephen Hemminger 
> Signed-off-by: Sunil Muthuswamy 
> Signed-off-by: K. Y. Srinivasan 
> ---

> + /*
> +  * Write dump contents to the page. No need to synchronize; panic should
> +  * be single-threaded.
> +  */
> + if (!kmsg_dump_get_buffer(dumper, true, hv_panic_page,
> +   PAGE_SIZE, _written)) {
> + pr_err("Hyper-V: Unable to get kmsg data for panic\n");
> + return;

>From what I can see, the return value from kmsg_dump_get_buffer()
is not an indication of success or failure -- it's an indication of whether
there is more data available.   There's no reason to output an error
message.

> @@ -1065,6 +1136,32 @@ static int vmbus_bus_init(void)
>* Only register if the crash MSRs are available
>*/
>   if (ms_hyperv.misc_features & HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE) {
> + u64 hyperv_crash_ctl;
> + /*
> +  * Sysctl registration is not fatal, since by default
> +  * reporting is enabled.
> +  */
> + hv_ctl_table_hdr = register_sysctl_table(hv_root_table);
> + if (!hv_ctl_table_hdr)
> + pr_err("Hyper-V: sysctl table register error");
> +
> + /*
> +  * Register for panic kmsg callback only if the right
> +  * capability is supported by the hypervisor.
> +  */
> + rdmsrl(HV_X64_MSR_CRASH_CTL, hyperv_crash_ctl);
> + if (hyperv_crash_ctl & HV_CRASH_CTL_CRASH_NOTIFY_MSG) {

vmbus_drv.c is architecture independent code, and should not be referencing
x86/x64 MSRs.   Reading the MSR (and maybe the test as well?) should go
in a separate function in an x86-specific source file.

And just to confirm, is this the right way to test for the feature?  Usually,
feature determination is based on one of the feature registers.  The
NOTIFY_MSG flag seems to have a dual meaning -- on read it indicates
the feature is present.  On write in hyperv_report_panic_msg(), it evidently
means that the guest is sending a full page of data to Hyper-V.

> @@ -1081,6 +1178,11 @@ static int vmbus_bus_init(void)
>   bus_unregister(_bus);
> + free_page((unsigned long)hv_panic_page);
> + if (!hv_ctl_table_hdr) {

The above test is backwards.  Remove the bang.

> @@ -1785,10 +1887,18 @@ static void __exit vmbus_exit(void)
> + free_page((unsigned long)hv_panic_page);
> + if (!hv_ctl_table_hdr) {

Same here.  Test is backwards.

Michael



RE: [PATCH V2 1/5] X86/Hyper-V: Add flush HvFlushGuestPhysicalAddressSpace hypercall support

2018-07-10 Thread Michael Kelley (EOSG)
From: Tianyu Lan  Monday, July 9, 2018 2:03 AM
> Hyper-V supports a pv hypercall HvFlushGuestPhysicalAddressSpace to
> flush nested VM address space mapping in l1 hypervisor and it's to
> reduce overhead of flushing ept tlb among vcpus. This patch is to
> implement it.
> 
> Signed-off-by: Lan Tianyu 
> ---
>  arch/x86/hyperv/Makefile   |  2 +-
>  arch/x86/hyperv/nested.c   | 64 
> ++
>  arch/x86/include/asm/hyperv-tlfs.h |  8 +
>  arch/x86/include/asm/mshyperv.h|  2 ++
>  4 files changed, 75 insertions(+), 1 deletion(-)
>  create mode 100644 arch/x86/hyperv/nested.c
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +int hyperv_flush_guest_mapping(u64 as)
> +{
> + struct hv_guest_mapping_flush **flush_pcpu;
> + struct hv_guest_mapping_flush *flush;
> + u64 status;
> + unsigned long flags;
> + int ret = -EFAULT;
> +
> + if (!hv_hypercall_pg)
> + goto fault;
> +
> + local_irq_save(flags);
> +
> + flush_pcpu = (struct hv_guest_mapping_flush **)
> + this_cpu_ptr(hyperv_pcpu_input_arg);
> +
> + flush = *flush_pcpu;
> +
> + if (unlikely(!flush)) {
> + local_irq_restore(flags);
> + goto fault;
> + }
> +
> + flush->address_space = as;
> + flush->flags = 0;
> +
> + status = hv_do_hypercall(HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE,
> +  flush, NULL);

Did you consider using a "fast" hypercall?  Unless there's some reason I'm
not aware of, a "fast" hypercall would be perfect here as there are 16 bytes
of input and no output. Vitaly recently added hv_do_fast_hypercall16()
in the linux-next tree. See __send_ipi_mask() in hv_apic.c in linux-next
for an example of usage.  With a fast hypercall, you don't need the code for
getting the per-cpu input arg or the code for local irq save/restore, so the
code that is left is a lot faster and simpler.

Michael

> + local_irq_restore(flags);
> +
> + if (!(status & HV_HYPERCALL_RESULT_MASK))
> + ret = 0;
> +
> +fault:
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(hyperv_flush_guest_mapping);


RE: [PATCH V2 1/5] X86/Hyper-V: Add flush HvFlushGuestPhysicalAddressSpace hypercall support

2018-07-10 Thread Michael Kelley (EOSG)
From: Tianyu Lan  Monday, July 9, 2018 2:03 AM
> Hyper-V supports a pv hypercall HvFlushGuestPhysicalAddressSpace to
> flush nested VM address space mapping in l1 hypervisor and it's to
> reduce overhead of flushing ept tlb among vcpus. This patch is to
> implement it.
> 
> Signed-off-by: Lan Tianyu 
> ---
>  arch/x86/hyperv/Makefile   |  2 +-
>  arch/x86/hyperv/nested.c   | 64 
> ++
>  arch/x86/include/asm/hyperv-tlfs.h |  8 +
>  arch/x86/include/asm/mshyperv.h|  2 ++
>  4 files changed, 75 insertions(+), 1 deletion(-)
>  create mode 100644 arch/x86/hyperv/nested.c
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +int hyperv_flush_guest_mapping(u64 as)
> +{
> + struct hv_guest_mapping_flush **flush_pcpu;
> + struct hv_guest_mapping_flush *flush;
> + u64 status;
> + unsigned long flags;
> + int ret = -EFAULT;
> +
> + if (!hv_hypercall_pg)
> + goto fault;
> +
> + local_irq_save(flags);
> +
> + flush_pcpu = (struct hv_guest_mapping_flush **)
> + this_cpu_ptr(hyperv_pcpu_input_arg);
> +
> + flush = *flush_pcpu;
> +
> + if (unlikely(!flush)) {
> + local_irq_restore(flags);
> + goto fault;
> + }
> +
> + flush->address_space = as;
> + flush->flags = 0;
> +
> + status = hv_do_hypercall(HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE,
> +  flush, NULL);

Did you consider using a "fast" hypercall?  Unless there's some reason I'm
not aware of, a "fast" hypercall would be perfect here as there are 16 bytes
of input and no output. Vitaly recently added hv_do_fast_hypercall16()
in the linux-next tree. See __send_ipi_mask() in hv_apic.c in linux-next
for an example of usage.  With a fast hypercall, you don't need the code for
getting the per-cpu input arg or the code for local irq save/restore, so the
code that is left is a lot faster and simpler.

Michael

> + local_irq_restore(flags);
> +
> + if (!(status & HV_HYPERCALL_RESULT_MASK))
> + ret = 0;
> +
> +fault:
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(hyperv_flush_guest_mapping);


RE: [PATCH 2/2] x86/hyper-v: check for VP_INVAL in hyperv_flush_tlb_others()

2018-07-10 Thread Michael Kelley (EOSG)
From: Vitaly Kuznetsov   Monday, July 9, 2018 10:40 AM
> Commit 1268ed0c474a ("x86/hyper-v: Fix the circular dependency in IPI
>  enlightenment") pre-filled hv_vp_index with VP_INVAL so it is now
> (theoretically) possible to observe hv_cpu_number_to_vp_number()
> returning VP_INVAL. We need to check for that in hyperv_flush_tlb_others().
> 
> Not checking for VP_INVAL on the first call site where we do
> 
>  if (hv_cpu_number_to_vp_number(cpumask_last(cpus)) >= 64)
> goto do_ex_hypercall;
> 
> is OK, in case we're eligible for non-ex hypercall we'll catch the
> issue later in for_each_cpu() cycle and in case we'll be doing ex-
> hypercall cpumask_to_vpset() will fail.
> 
> It would be nice to change hv_cpu_number_to_vp_number() return
> value's type to 'u32' but this will likely be a bigger change as
> all call sites need to be checked first.
> 
> Fixes: 1268ed0c474a ("x86/hyper-v: Fix the circular dependency in IPI 
> enlightenment")
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/hyperv/mmu.c | 5 +
>  1 file changed, 5 insertions(+)

Reviewed-by: Michael Kelley 


RE: [PATCH 2/2] x86/hyper-v: check for VP_INVAL in hyperv_flush_tlb_others()

2018-07-10 Thread Michael Kelley (EOSG)
From: Vitaly Kuznetsov   Monday, July 9, 2018 10:40 AM
> Commit 1268ed0c474a ("x86/hyper-v: Fix the circular dependency in IPI
>  enlightenment") pre-filled hv_vp_index with VP_INVAL so it is now
> (theoretically) possible to observe hv_cpu_number_to_vp_number()
> returning VP_INVAL. We need to check for that in hyperv_flush_tlb_others().
> 
> Not checking for VP_INVAL on the first call site where we do
> 
>  if (hv_cpu_number_to_vp_number(cpumask_last(cpus)) >= 64)
> goto do_ex_hypercall;
> 
> is OK, in case we're eligible for non-ex hypercall we'll catch the
> issue later in for_each_cpu() cycle and in case we'll be doing ex-
> hypercall cpumask_to_vpset() will fail.
> 
> It would be nice to change hv_cpu_number_to_vp_number() return
> value's type to 'u32' but this will likely be a bigger change as
> all call sites need to be checked first.
> 
> Fixes: 1268ed0c474a ("x86/hyper-v: Fix the circular dependency in IPI 
> enlightenment")
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/hyperv/mmu.c | 5 +
>  1 file changed, 5 insertions(+)

Reviewed-by: Michael Kelley 


RE: [PATCH 1/2] x86/hyper-v: check cpumask_to_vpset() return value in hyperv_flush_tlb_others_ex()

2018-07-10 Thread Michael Kelley (EOSG)
From: Vitaly Kuznetsov  Monday, July 9, 2018 10:40 AM
> Commit 1268ed0c474a ("x86/hyper-v: Fix the circular dependency in IPI
> enlightenment") made cpumask_to_vpset() return '-1' when there is a CPU
> with unknown VP index in the supplied set. This needs to be checked before
> we pass 'nr_bank' to hypercall.
> 
> Fixes: 1268ed0c474a ("x86/hyper-v: Fix the circular dependency in IPI 
> enlightenment")
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/hyperv/mmu.c | 2 ++
>  1 file changed, 2 insertions(+)

Reviewed-by: Michael Kelley 


RE: [PATCH 1/2] x86/hyper-v: check cpumask_to_vpset() return value in hyperv_flush_tlb_others_ex()

2018-07-10 Thread Michael Kelley (EOSG)
From: Vitaly Kuznetsov  Monday, July 9, 2018 10:40 AM
> Commit 1268ed0c474a ("x86/hyper-v: Fix the circular dependency in IPI
> enlightenment") made cpumask_to_vpset() return '-1' when there is a CPU
> with unknown VP index in the supplied set. This needs to be checked before
> we pass 'nr_bank' to hypercall.
> 
> Fixes: 1268ed0c474a ("x86/hyper-v: Fix the circular dependency in IPI 
> enlightenment")
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/hyperv/mmu.c | 2 ++
>  1 file changed, 2 insertions(+)

Reviewed-by: Michael Kelley 


RE: [PATCH 1/1] X86/Hyper-V:: Fix the circular dependency in IPI enlightenment.

2018-07-06 Thread Michael Kelley (EOSG)
> -Original Message-
> From: Thomas Gleixner 
> Sent: Friday, July 6, 2018 3:42 AM
> To: KY Srinivasan 
> Cc: Ingo Molnar ; x...@kernel.org; 
> gre...@linuxfoundation.org; linux-
> ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de; 
> a...@canonical.com;
> jasow...@redhat.com; h...@zytor.com; Stephen Hemminger 
> ;
> Michael Kelley (EOSG) ; vkuzn...@redhat.com
> Subject: RE: [PATCH 1/1] X86/Hyper-V:: Fix the circular dependency in IPI 
> enlightenment.
> 
> On Fri, 6 Jul 2018, Thomas Gleixner wrote:
> > On Fri, 6 Jul 2018, KY Srinivasan wrote:
> > > >
> > > > The problem is that the wreckage is in Linus tree and needs to be fixed
> > > > there, i.e. via x86/urgent.
> > > >
> > > > Now we have the new bits queued in x86/hyperv already which collide. So
> > > > we
> > > > need to merge x86/urgent into x86/hyperv after applying the fix and mop 
> > > > up
> > > > the merge wreckage in x86/hyperv.
> > > >
> > > > I'll have a look tomorrow morning unless you beat me to it.
> > >
> > > I can rebase this patch against the latest tip and resend (tomorrow).
> >
> > That doesn't help as we need to push the original fix to Linus ...
> 
> Applied it to x86/urgent and fixed up the '-1' sloppy hack as pointed out
> by Vitaly and merged x86/urgent into x86/hyperv.
> 
> Please check both branches for correctness.

Changes look good to me.  Some pre-existing signed/unsigned type sloppiness is
still there in that hv_cpu_number_to_vp_number() should be 'u32' instead
of 'int', along with related local variables, but that's probably best to fix in
linux-next on top of Vitaly's other changes.  This code will work.

Michael

> 
> Thanks,
> 
>   tglx



RE: [PATCH 1/1] X86/Hyper-V:: Fix the circular dependency in IPI enlightenment.

2018-07-06 Thread Michael Kelley (EOSG)
> -Original Message-
> From: Thomas Gleixner 
> Sent: Friday, July 6, 2018 3:42 AM
> To: KY Srinivasan 
> Cc: Ingo Molnar ; x...@kernel.org; 
> gre...@linuxfoundation.org; linux-
> ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de; 
> a...@canonical.com;
> jasow...@redhat.com; h...@zytor.com; Stephen Hemminger 
> ;
> Michael Kelley (EOSG) ; vkuzn...@redhat.com
> Subject: RE: [PATCH 1/1] X86/Hyper-V:: Fix the circular dependency in IPI 
> enlightenment.
> 
> On Fri, 6 Jul 2018, Thomas Gleixner wrote:
> > On Fri, 6 Jul 2018, KY Srinivasan wrote:
> > > >
> > > > The problem is that the wreckage is in Linus tree and needs to be fixed
> > > > there, i.e. via x86/urgent.
> > > >
> > > > Now we have the new bits queued in x86/hyperv already which collide. So
> > > > we
> > > > need to merge x86/urgent into x86/hyperv after applying the fix and mop 
> > > > up
> > > > the merge wreckage in x86/hyperv.
> > > >
> > > > I'll have a look tomorrow morning unless you beat me to it.
> > >
> > > I can rebase this patch against the latest tip and resend (tomorrow).
> >
> > That doesn't help as we need to push the original fix to Linus ...
> 
> Applied it to x86/urgent and fixed up the '-1' sloppy hack as pointed out
> by Vitaly and merged x86/urgent into x86/hyperv.
> 
> Please check both branches for correctness.

Changes look good to me.  Some pre-existing signed/unsigned type sloppiness is
still there in that hv_cpu_number_to_vp_number() should be 'u32' instead
of 'int', along with related local variables, but that's probably best to fix in
linux-next on top of Vitaly's other changes.  This code will work.

Michael

> 
> Thanks,
> 
>   tglx



RE: [PATCH 3/4] x86/hyper-v: use cheaper HVCALL_SEND_IPI hypercall when possible

2018-06-25 Thread Michael Kelley (EOSG)
> -Original Message-
> From: Vitaly Kuznetsov 
> Sent: Friday, June 22, 2018 10:06 AM
> To: x...@kernel.org
> Cc: de...@linuxdriverproject.org; linux-kernel@vger.kernel.org; KY Srinivasan
> ; Haiyang Zhang ; Stephen 
> Hemminger
> ; Thomas Gleixner ; Ingo Molnar
> ; H. Peter Anvin ; Tianyu Lan
> ; Michael Kelley (EOSG) 
> 
> Subject: [PATCH 3/4] x86/hyper-v: use cheaper HVCALL_SEND_IPI hypercall when 
> possible
> 
> When there is no need to send an IPI to a CPU with VP number > 64
> we can do the job with fast HVCALL_SEND_IPI hypercall.
> 
> Signed-off-by: Vitaly Kuznetsov 

Reviewed-by: Michael Kelley 


RE: [PATCH 3/4] x86/hyper-v: use cheaper HVCALL_SEND_IPI hypercall when possible

2018-06-25 Thread Michael Kelley (EOSG)
> -Original Message-
> From: Vitaly Kuznetsov 
> Sent: Friday, June 22, 2018 10:06 AM
> To: x...@kernel.org
> Cc: de...@linuxdriverproject.org; linux-kernel@vger.kernel.org; KY Srinivasan
> ; Haiyang Zhang ; Stephen 
> Hemminger
> ; Thomas Gleixner ; Ingo Molnar
> ; H. Peter Anvin ; Tianyu Lan
> ; Michael Kelley (EOSG) 
> 
> Subject: [PATCH 3/4] x86/hyper-v: use cheaper HVCALL_SEND_IPI hypercall when 
> possible
> 
> When there is no need to send an IPI to a CPU with VP number > 64
> we can do the job with fast HVCALL_SEND_IPI hypercall.
> 
> Signed-off-by: Vitaly Kuznetsov 

Reviewed-by: Michael Kelley 


RE: [PATCH 2/4] x86/hyper-v: use 'fast' hypercall for HVCALL_SEND_IPI

2018-06-25 Thread Michael Kelley (EOSG)
> -Original Message-
> From: Vitaly Kuznetsov 
> Sent: Friday, June 22, 2018 10:06 AM
> To: x...@kernel.org
> Cc: de...@linuxdriverproject.org; linux-kernel@vger.kernel.org; KY Srinivasan
> ; Haiyang Zhang ; Stephen 
> Hemminger
> ; Thomas Gleixner ; Ingo Molnar
> ; H. Peter Anvin ; Tianyu Lan
> ; Michael Kelley (EOSG) 
> 
> Subject: [PATCH 2/4] x86/hyper-v: use 'fast' hypercall for HVCALL_SEND_IPI
> 
> Current Hyper-V TLFS (v5.0b) claims that HvCallSendSyntheticClusterIpi
> hypercall can't be 'fast' (passing parameters through registers) but
> apparently this is not true, Windows always uses 'fast' version. We can
> do the same in Linux too.
> 
> Signed-off-by: Vitaly Kuznetsov 

Reviewed-by: Michael Kelley 


RE: [PATCH 2/4] x86/hyper-v: use 'fast' hypercall for HVCALL_SEND_IPI

2018-06-25 Thread Michael Kelley (EOSG)
> -Original Message-
> From: Vitaly Kuznetsov 
> Sent: Friday, June 22, 2018 10:06 AM
> To: x...@kernel.org
> Cc: de...@linuxdriverproject.org; linux-kernel@vger.kernel.org; KY Srinivasan
> ; Haiyang Zhang ; Stephen 
> Hemminger
> ; Thomas Gleixner ; Ingo Molnar
> ; H. Peter Anvin ; Tianyu Lan
> ; Michael Kelley (EOSG) 
> 
> Subject: [PATCH 2/4] x86/hyper-v: use 'fast' hypercall for HVCALL_SEND_IPI
> 
> Current Hyper-V TLFS (v5.0b) claims that HvCallSendSyntheticClusterIpi
> hypercall can't be 'fast' (passing parameters through registers) but
> apparently this is not true, Windows always uses 'fast' version. We can
> do the same in Linux too.
> 
> Signed-off-by: Vitaly Kuznetsov 

Reviewed-by: Michael Kelley 


RE: [PATCH] x86/hyper-v: use cheaper HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} hypercalls when possible

2018-06-19 Thread Michael Kelley (EOSG)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  
> On Behalf
> Of Vitaly Kuznetsov
> Sent: Friday, June 15, 2018 9:30 AM
> To: x...@kernel.org
> Cc: de...@linuxdriverproject.org; linux-kernel@vger.kernel.org; KY Srinivasan
> ; Haiyang Zhang ; Stephen 
> Hemminger
> ; Thomas Gleixner ; Ingo Molnar
> ; H. Peter Anvin ; Tianyu Lan
> 
> Subject: [PATCH] x86/hyper-v: use cheaper 
> HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE}
> hypercalls when possible
> 
> While working on Hyper-V style PV TLB flush support in KVM I noticed that
> real Windows guests use TLB flush hypercall in a somewhat smarter way: when
> the flush needs to be performed on a subset of first 64 vCPUs or on all
> present vCPUs Windows avoids more expensive hypercalls which support
> sparse CPU sets and uses their 'cheap' counterparts. This means that
> HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED name is actually a misnomer: EX
> hypercalls (which support sparse CPU sets) are "available", not
> "recommended". This makes sense as they are actually harder to parse.
> 
> Nothing stops us from being equally 'smart' in Linux too. Switch to
> doing cheaper hypercalls whenever possible.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---

This is a good idea.  We should probably do the same with the hypercalls for 
sending
IPIs -- try the simpler version first and move to the more complex _EX version 
only
if necessary.

A complication:  We've recently found a problem with the code for doing IPI
hypercalls, and the bug affects the TLB flush code as well.  As secondary CPUs
are started, there's a window of time where the hv_vp_index entry for a
secondary CPU is uninitialized.  We are seeing IPIs happening in that window, 
and
the IPI hypercall code uses the uninitialized hv_vp_index entry.   Same thing 
could
happen with the TLB flush hypercall code.  I didn't actually see any 
occurrences of
the TLB case in my tracing, but we should fix it anyway in case a TLB flush gets
added at some point in the future.

KY has a patch coming.  In the patch, hv_cpu_number_to_vp_number()
and cpumask_to_vpset() can both return U32_MAX if they encounter an
uninitialized hv_vp_index entry, and the code needs to be able to bail out to
the native functions for that particular IPI or TLB flush operation.  Once the
initialization of secondary CPUs is complete, the uninitialized situation won't
happen again, and the hypercall path will always be used.

We'll need to coordinate on these patches.  Be aware that the IPI flavor of the
bug is currently causing random failures when booting 4.18 RC1 on Hyper-V VMs
with large vCPU counts.

Reviewed-by:  Michael Kelley 


RE: [PATCH] x86/hyper-v: use cheaper HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} hypercalls when possible

2018-06-19 Thread Michael Kelley (EOSG)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  
> On Behalf
> Of Vitaly Kuznetsov
> Sent: Friday, June 15, 2018 9:30 AM
> To: x...@kernel.org
> Cc: de...@linuxdriverproject.org; linux-kernel@vger.kernel.org; KY Srinivasan
> ; Haiyang Zhang ; Stephen 
> Hemminger
> ; Thomas Gleixner ; Ingo Molnar
> ; H. Peter Anvin ; Tianyu Lan
> 
> Subject: [PATCH] x86/hyper-v: use cheaper 
> HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE}
> hypercalls when possible
> 
> While working on Hyper-V style PV TLB flush support in KVM I noticed that
> real Windows guests use TLB flush hypercall in a somewhat smarter way: when
> the flush needs to be performed on a subset of first 64 vCPUs or on all
> present vCPUs Windows avoids more expensive hypercalls which support
> sparse CPU sets and uses their 'cheap' counterparts. This means that
> HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED name is actually a misnomer: EX
> hypercalls (which support sparse CPU sets) are "available", not
> "recommended". This makes sense as they are actually harder to parse.
> 
> Nothing stops us from being equally 'smart' in Linux too. Switch to
> doing cheaper hypercalls whenever possible.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---

This is a good idea.  We should probably do the same with the hypercalls for 
sending
IPIs -- try the simpler version first and move to the more complex _EX version 
only
if necessary.

A complication:  We've recently found a problem with the code for doing IPI
hypercalls, and the bug affects the TLB flush code as well.  As secondary CPUs
are started, there's a window of time where the hv_vp_index entry for a
secondary CPU is uninitialized.  We are seeing IPIs happening in that window, 
and
the IPI hypercall code uses the uninitialized hv_vp_index entry.   Same thing 
could
happen with the TLB flush hypercall code.  I didn't actually see any 
occurrences of
the TLB case in my tracing, but we should fix it anyway in case a TLB flush gets
added at some point in the future.

KY has a patch coming.  In the patch, hv_cpu_number_to_vp_number()
and cpumask_to_vpset() can both return U32_MAX if they encounter an
uninitialized hv_vp_index entry, and the code needs to be able to bail out to
the native functions for that particular IPI or TLB flush operation.  Once the
initialization of secondary CPUs is complete, the uninitialized situation won't
happen again, and the hypercall path will always be used.

We'll need to coordinate on these patches.  Be aware that the IPI flavor of the
bug is currently causing random failures when booting 4.18 RC1 on Hyper-V VMs
with large vCPU counts.

Reviewed-by:  Michael Kelley 


RE: [RFC Patch 1/3] X86/Hyper-V: Add flush HvFlushGuestPhysicalAddressSpace hypercall support

2018-06-05 Thread Michael Kelley (EOSG)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  
> On Behalf
> Of Tianyu Lan
> Sent: Monday, June 4, 2018 2:08 AM
> Cc: Tianyu Lan ; KY Srinivasan 
> ; Haiyang
> Zhang ; Stephen Hemminger ;
> t...@linutronix.de; mi...@redhat.com; h...@zytor.com; x...@kernel.org;
> pbonz...@redhat.com; rkrc...@redhat.com; de...@linuxdriverproject.org; linux-
> ker...@vger.kernel.org; k...@vger.kernel.org; vkuzn...@redhat.com
> Subject: [RFC Patch 1/3] X86/Hyper-V: Add flush 
> HvFlushGuestPhysicalAddressSpace hypercall
> support
> 
> Hyper-V provides a pv hypercall HvFlushGuestPhysicalAddressSpace to flush
> nested VM address space mapping in l1 hypervisor and it's to reduce overhead
> of flushing ept tlb among vcpus. This patch is to implement it.
> 
> Signed-off-by: Lan Tianyu 
> ---
> diff --git a/arch/x86/hyperv/nested.c b/arch/x86/hyperv/nested.c
> new file mode 100644
> index ..17f7c288eccc
> --- /dev/null
> +++ b/arch/x86/hyperv/nested.c
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +int hyperv_flush_guest_mapping(u64 as)
> +{
> + struct hv_guest_mapping_flush **flush_pcpu;
> + struct hv_guest_mapping_flush *flush;
> + u64 status = U64_MAX;

Initializing status to U64_MAX doesn't seem necessary.

> + unsigned long flags;
> + int ret = -EFAULT;
> +
> + if (!hv_hypercall_pg)
> + goto fault;
> +
> + local_irq_save(flags);
> +
> + flush_pcpu = (struct hv_guest_mapping_flush **)
> + this_cpu_ptr(hyperv_pcpu_input_arg);
> +
> + flush = *flush_pcpu;
> +
> + if (unlikely(!flush)) {
> + local_irq_restore(flags);
> + goto fault;
> + }
> +
> + flush->address_space = as;
> + flush->flags = 0;
> +
> + status = hv_do_hypercall(HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE,
> +  flush, NULL);
> + local_irq_restore(flags);
> +
> + if (!(status & HV_HYPERCALL_RESULT_MASK))
> + ret = 0;
> +
> +fault:
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(hyperv_flush_guest_mapping);
> diff --git a/arch/x86/include/asm/hyperv-tlfs.h 
> b/arch/x86/include/asm/hyperv-tlfs.h
> index b8c89265baf0..53bbeb08faea 100644
> --- a/arch/x86/include/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/asm/hyperv-tlfs.h
> @@ -309,6 +309,7 @@ struct ms_hyperv_tsc_page {
>  #define HV_X64_MSR_REENLIGHTENMENT_CONTROL   0x4106
> 
>  /* Nested features (CPUID 0x400A) EAX */
> +#define HV_X64_NESTED_GUSET_MAPPING_FLUSHBIT(18)

The #define name is misspelled.  "_GUSET_" should be "_GUEST_".
And the matching usage in patch 3/3 will need to be updated as well.

Michael


RE: [RFC Patch 1/3] X86/Hyper-V: Add flush HvFlushGuestPhysicalAddressSpace hypercall support

2018-06-05 Thread Michael Kelley (EOSG)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  
> On Behalf
> Of Tianyu Lan
> Sent: Monday, June 4, 2018 2:08 AM
> Cc: Tianyu Lan ; KY Srinivasan 
> ; Haiyang
> Zhang ; Stephen Hemminger ;
> t...@linutronix.de; mi...@redhat.com; h...@zytor.com; x...@kernel.org;
> pbonz...@redhat.com; rkrc...@redhat.com; de...@linuxdriverproject.org; linux-
> ker...@vger.kernel.org; k...@vger.kernel.org; vkuzn...@redhat.com
> Subject: [RFC Patch 1/3] X86/Hyper-V: Add flush 
> HvFlushGuestPhysicalAddressSpace hypercall
> support
> 
> Hyper-V provides a pv hypercall HvFlushGuestPhysicalAddressSpace to flush
> nested VM address space mapping in l1 hypervisor and it's to reduce overhead
> of flushing ept tlb among vcpus. This patch is to implement it.
> 
> Signed-off-by: Lan Tianyu 
> ---
> diff --git a/arch/x86/hyperv/nested.c b/arch/x86/hyperv/nested.c
> new file mode 100644
> index ..17f7c288eccc
> --- /dev/null
> +++ b/arch/x86/hyperv/nested.c
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +int hyperv_flush_guest_mapping(u64 as)
> +{
> + struct hv_guest_mapping_flush **flush_pcpu;
> + struct hv_guest_mapping_flush *flush;
> + u64 status = U64_MAX;

Initializing status to U64_MAX doesn't seem necessary.

> + unsigned long flags;
> + int ret = -EFAULT;
> +
> + if (!hv_hypercall_pg)
> + goto fault;
> +
> + local_irq_save(flags);
> +
> + flush_pcpu = (struct hv_guest_mapping_flush **)
> + this_cpu_ptr(hyperv_pcpu_input_arg);
> +
> + flush = *flush_pcpu;
> +
> + if (unlikely(!flush)) {
> + local_irq_restore(flags);
> + goto fault;
> + }
> +
> + flush->address_space = as;
> + flush->flags = 0;
> +
> + status = hv_do_hypercall(HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE,
> +  flush, NULL);
> + local_irq_restore(flags);
> +
> + if (!(status & HV_HYPERCALL_RESULT_MASK))
> + ret = 0;
> +
> +fault:
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(hyperv_flush_guest_mapping);
> diff --git a/arch/x86/include/asm/hyperv-tlfs.h 
> b/arch/x86/include/asm/hyperv-tlfs.h
> index b8c89265baf0..53bbeb08faea 100644
> --- a/arch/x86/include/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/asm/hyperv-tlfs.h
> @@ -309,6 +309,7 @@ struct ms_hyperv_tsc_page {
>  #define HV_X64_MSR_REENLIGHTENMENT_CONTROL   0x4106
> 
>  /* Nested features (CPUID 0x400A) EAX */
> +#define HV_X64_NESTED_GUSET_MAPPING_FLUSHBIT(18)

The #define name is misspelled.  "_GUSET_" should be "_GUEST_".
And the matching usage in patch 3/3 will need to be updated as well.

Michael


RE: [PATCH] PCI: hv: Do not wait forever on a device that has disappeared

2018-05-28 Thread Michael Kelley (EOSG)
> 
> Before the guest finishes the device initialization, the device can be
> removed anytime by the host, and after that the host won't respond to
> the guest's request, so the guest should be prepared to handle this
> case.
> 
> Signed-off-by: Dexuan Cui 
> Cc: Stephen Hemminger 
> Cc: K. Y. Srinivasan 
> ---
>  drivers/pci/host/pci-hyperv.c | 46 
> ---
>  1 file changed, 34 insertions(+), 12 deletions(-)
> 

While this patch solves the immediate problem of getting hung waiting
for a response from Hyper-V that will never come, there's another scenario
to look at that I think introduces a race.  Suppose the guest VM issues a
vmbus_sendpacket() request in one of the cases covered by this patch,
and suppose that Hyper-V queues a response to the request, and then
immediately follows with a rescind request.   Processing the response will
get queued to a tasklet associated with the channel, while processing the
rescind will get queued to a tasklet associated with the top-level vmbus
connection.   From what I can see, the code doesn't impose any ordering
on processing the two.  If the rescind is processed first, the new
wait_for_response() function may wake up, notice the rescind flag, and
return an error.  Its caller will return an error, and in doing so pop the
completion packet off the stack.   When the response is processed later,
it will try to signal completion via a completion packet that no longer
exists, and memory corruption will likely occur.

Am I missing anything that would prevent this scenario from happening?
It is admittedly low probability, and a solution seems non-trivial.  I haven't
looked specifically, but a similar scenario is probably possible with the
drivers for other VMbus devices.  We should work on a generic solution.

Michael



RE: [PATCH] PCI: hv: Do not wait forever on a device that has disappeared

2018-05-28 Thread Michael Kelley (EOSG)
> 
> Before the guest finishes the device initialization, the device can be
> removed anytime by the host, and after that the host won't respond to
> the guest's request, so the guest should be prepared to handle this
> case.
> 
> Signed-off-by: Dexuan Cui 
> Cc: Stephen Hemminger 
> Cc: K. Y. Srinivasan 
> ---
>  drivers/pci/host/pci-hyperv.c | 46 
> ---
>  1 file changed, 34 insertions(+), 12 deletions(-)
> 

While this patch solves the immediate problem of getting hung waiting
for a response from Hyper-V that will never come, there's another scenario
to look at that I think introduces a race.  Suppose the guest VM issues a
vmbus_sendpacket() request in one of the cases covered by this patch,
and suppose that Hyper-V queues a response to the request, and then
immediately follows with a rescind request.   Processing the response will
get queued to a tasklet associated with the channel, while processing the
rescind will get queued to a tasklet associated with the top-level vmbus
connection.   From what I can see, the code doesn't impose any ordering
on processing the two.  If the rescind is processed first, the new
wait_for_response() function may wake up, notice the rescind flag, and
return an error.  Its caller will return an error, and in doing so pop the
completion packet off the stack.   When the response is processed later,
it will try to signal completion via a completion packet that no longer
exists, and memory corruption will likely occur.

Am I missing anything that would prevent this scenario from happening?
It is admittedly low probability, and a solution seems non-trivial.  I haven't
looked specifically, but a similar scenario is probably possible with the
drivers for other VMbus devices.  We should work on a generic solution.

Michael



RE: [PATCH char-misc 1/2] Drivers: hv: vmbus: Remove x86 MSR refs in arch independent code

2018-05-12 Thread Michael Kelley (EOSG)
> > -#define hv_init_timer(timer, tick) wrmsrl(timer, tick)
> > -#define hv_init_timer_config(config, val) wrmsrl(config, val)
> > +#define hv_init_timer(timer, tick) \
> > +   wrmsrl(HV_X64_MSR_STIMER0_COUNT + (2*timer), tick)
> > +#define hv_init_timer_config(timer, val) \
> > +   wrmsrl(HV_X64_MSR_STIMER0_CONFIG + (2*timer), val)
>
> Why are we stepping in units of 2?
> 
> K. Y

The register addresses for  the *_CONFIG and *_COUNT registers
are interleaved.  From hyperv-tlfs.h:

#define HV_X64_MSR_STIMER0_CONFIG   0x40B0
#define HV_X64_MSR_STIMER0_COUNT0x40B1
#define HV_X64_MSR_STIMER1_CONFIG   0x40B2
#define HV_X64_MSR_STIMER1_COUNT0x40B3
#define HV_X64_MSR_STIMER2_CONFIG   0x40B4
#define HV_X64_MSR_STIMER2_COUNT0x40B5
#define HV_X64_MSR_STIMER3_CONFIG   0x40B6
#define HV_X64_MSR_STIMER3_COUNT0x40B7

Michael


RE: [PATCH char-misc 1/2] Drivers: hv: vmbus: Remove x86 MSR refs in arch independent code

2018-05-12 Thread Michael Kelley (EOSG)
> > -#define hv_init_timer(timer, tick) wrmsrl(timer, tick)
> > -#define hv_init_timer_config(config, val) wrmsrl(config, val)
> > +#define hv_init_timer(timer, tick) \
> > +   wrmsrl(HV_X64_MSR_STIMER0_COUNT + (2*timer), tick)
> > +#define hv_init_timer_config(timer, val) \
> > +   wrmsrl(HV_X64_MSR_STIMER0_CONFIG + (2*timer), val)
>
> Why are we stepping in units of 2?
> 
> K. Y

The register addresses for  the *_CONFIG and *_COUNT registers
are interleaved.  From hyperv-tlfs.h:

#define HV_X64_MSR_STIMER0_CONFIG   0x40B0
#define HV_X64_MSR_STIMER0_COUNT0x40B1
#define HV_X64_MSR_STIMER1_CONFIG   0x40B2
#define HV_X64_MSR_STIMER1_COUNT0x40B3
#define HV_X64_MSR_STIMER2_CONFIG   0x40B4
#define HV_X64_MSR_STIMER2_COUNT0x40B5
#define HV_X64_MSR_STIMER3_CONFIG   0x40B6
#define HV_X64_MSR_STIMER3_COUNT0x40B7

Michael


RE: [PATCH V2 5/5] X86: Hyper-V: Consolidate the allocation of the hypercall input page

2018-05-09 Thread Michael Kelley (EOSG)
> -Original Message-
> From: k...@linuxonhyperv.com <k...@linuxonhyperv.com>
> Sent: Thursday, May 3, 2018 11:08 PM
> To: x...@kernel.org; gre...@linuxfoundation.org; linux-kernel@vger.kernel.org;
> de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; 
> jasow...@redhat.com;
> t...@linutronix.de; h...@zytor.com; Stephen Hemminger 
> <sthem...@microsoft.com>;
> Michael Kelley (EOSG) <michael.h.kel...@microsoft.com>; vkuzn...@redhat.com
> Cc: KY Srinivasan <k...@microsoft.com>
> Subject: [PATCH V2 5/5] X86: Hyper-V: Consolidate the allocation of the 
> hypercall input page
> 
> From: "K. Y. Srinivasan" <k...@microsoft.com>
> 
> Consolidate the allocation of the hypercall input page.
> 
> Signed-off-by: K. Y. Srinivasan <k...@microsoft.com>
> ---

Reviewed-by: Michael Kelley <mikel...@microsoft.com>


RE: [PATCH V2 5/5] X86: Hyper-V: Consolidate the allocation of the hypercall input page

2018-05-09 Thread Michael Kelley (EOSG)
> -Original Message-
> From: k...@linuxonhyperv.com 
> Sent: Thursday, May 3, 2018 11:08 PM
> To: x...@kernel.org; gre...@linuxfoundation.org; linux-kernel@vger.kernel.org;
> de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; 
> jasow...@redhat.com;
> t...@linutronix.de; h...@zytor.com; Stephen Hemminger 
> ;
> Michael Kelley (EOSG) ; vkuzn...@redhat.com
> Cc: KY Srinivasan 
> Subject: [PATCH V2 5/5] X86: Hyper-V: Consolidate the allocation of the 
> hypercall input page
> 
> From: "K. Y. Srinivasan" 
> 
> Consolidate the allocation of the hypercall input page.
> 
> Signed-off-by: K. Y. Srinivasan 
> ---

Reviewed-by: Michael Kelley 


RE: [PATCH V2 4/5] X86: Hyper-V: Consolidate code for converting cpumask to vpset

2018-05-09 Thread Michael Kelley (EOSG)
> -Original Message-
> From: k...@linuxonhyperv.com <k...@linuxonhyperv.com>
> Sent: Thursday, May 3, 2018 11:08 PM
> To: x...@kernel.org; gre...@linuxfoundation.org; linux-kernel@vger.kernel.org;
> de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; 
> jasow...@redhat.com;
> t...@linutronix.de; h...@zytor.com; Stephen Hemminger 
> <sthem...@microsoft.com>;
> Michael Kelley (EOSG) <michael.h.kel...@microsoft.com>; vkuzn...@redhat.com
> Cc: KY Srinivasan <k...@microsoft.com>
> Subject: [PATCH V2 4/5] X86: Hyper-V: Consolidate code for converting cpumask 
> to vpset
> 
> From: "K. Y. Srinivasan" <k...@microsoft.com>
> 
> Consolidate code for converting cpumask to vpset.
> 
> Signed-off-by: K. Y. Srinivasan <k...@microsoft.com>
> ---

Reviewed-by: Michael Kelley <mikel...@microsoft.com>


RE: [PATCH V2 4/5] X86: Hyper-V: Consolidate code for converting cpumask to vpset

2018-05-09 Thread Michael Kelley (EOSG)
> -Original Message-
> From: k...@linuxonhyperv.com 
> Sent: Thursday, May 3, 2018 11:08 PM
> To: x...@kernel.org; gre...@linuxfoundation.org; linux-kernel@vger.kernel.org;
> de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; 
> jasow...@redhat.com;
> t...@linutronix.de; h...@zytor.com; Stephen Hemminger 
> ;
> Michael Kelley (EOSG) ; vkuzn...@redhat.com
> Cc: KY Srinivasan 
> Subject: [PATCH V2 4/5] X86: Hyper-V: Consolidate code for converting cpumask 
> to vpset
> 
> From: "K. Y. Srinivasan" 
> 
> Consolidate code for converting cpumask to vpset.
> 
> Signed-off-by: K. Y. Srinivasan 
> ---

Reviewed-by: Michael Kelley 


RE: [PATCH V2 3/5] X86: Hyper-V: Enhanced IPI enlightenment

2018-05-09 Thread Michael Kelley (EOSG)
> -Original Message-
> From: k...@linuxonhyperv.com <k...@linuxonhyperv.com>
> Sent: Thursday, May 3, 2018 11:08 PM
> To: x...@kernel.org; gre...@linuxfoundation.org; linux-kernel@vger.kernel.org;
> de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; 
> jasow...@redhat.com;
> t...@linutronix.de; h...@zytor.com; Stephen Hemminger 
> <sthem...@microsoft.com>;
> Michael Kelley (EOSG) <michael.h.kel...@microsoft.com>; vkuzn...@redhat.com
> Cc: KY Srinivasan <k...@microsoft.com>
> Subject: [PATCH V2 3/5] X86: Hyper-V: Enhanced IPI enlightenment
> 
> From: "K. Y. Srinivasan" <k...@microsoft.com>
> 
> Support enhanced IPI enlightenments (to target more than 64 CPUs).
> 
> Signed-off-by: K. Y. Srinivasan <k...@microsoft.com>
> ---

Reviewed-by: Michael Kelley <mikel...@microsoft.com>


RE: [PATCH V2 3/5] X86: Hyper-V: Enhanced IPI enlightenment

2018-05-09 Thread Michael Kelley (EOSG)
> -Original Message-
> From: k...@linuxonhyperv.com 
> Sent: Thursday, May 3, 2018 11:08 PM
> To: x...@kernel.org; gre...@linuxfoundation.org; linux-kernel@vger.kernel.org;
> de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; 
> jasow...@redhat.com;
> t...@linutronix.de; h...@zytor.com; Stephen Hemminger 
> ;
> Michael Kelley (EOSG) ; vkuzn...@redhat.com
> Cc: KY Srinivasan 
> Subject: [PATCH V2 3/5] X86: Hyper-V: Enhanced IPI enlightenment
> 
> From: "K. Y. Srinivasan" 
> 
> Support enhanced IPI enlightenments (to target more than 64 CPUs).
> 
> Signed-off-by: K. Y. Srinivasan 
> ---

Reviewed-by: Michael Kelley 


RE: [PATCH V2 2/5] X86: Hyper-V: Enable IPI enlightenments

2018-05-09 Thread Michael Kelley (EOSG)
> -Original Message-
> From: k...@linuxonhyperv.com <k...@linuxonhyperv.com>
> Sent: Thursday, May 3, 2018 11:08 PM
> To: x...@kernel.org; gre...@linuxfoundation.org; linux-kernel@vger.kernel.org;
> de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; 
> jasow...@redhat.com;
> t...@linutronix.de; h...@zytor.com; Stephen Hemminger 
> <sthem...@microsoft.com>;
> Michael Kelley (EOSG) <michael.h.kel...@microsoft.com>; vkuzn...@redhat.com
> Cc: KY Srinivasan <k...@microsoft.com>
> Subject: [PATCH V2 2/5] X86: Hyper-V: Enable IPI enlightenments
> 
> From: "K. Y. Srinivasan" <k...@microsoft.com>
> 
> Hyper-V supports hypercalls to implement IPI; use them.
> 
> Signed-off-by: K. Y. Srinivasan <k...@microsoft.com>
> ---

Reviewed-by: Michael Kelley <mikel...@microsoft.com>


RE: [PATCH V2 2/5] X86: Hyper-V: Enable IPI enlightenments

2018-05-09 Thread Michael Kelley (EOSG)
> -Original Message-
> From: k...@linuxonhyperv.com 
> Sent: Thursday, May 3, 2018 11:08 PM
> To: x...@kernel.org; gre...@linuxfoundation.org; linux-kernel@vger.kernel.org;
> de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; 
> jasow...@redhat.com;
> t...@linutronix.de; h...@zytor.com; Stephen Hemminger 
> ;
> Michael Kelley (EOSG) ; vkuzn...@redhat.com
> Cc: KY Srinivasan 
> Subject: [PATCH V2 2/5] X86: Hyper-V: Enable IPI enlightenments
> 
> From: "K. Y. Srinivasan" 
> 
> Hyper-V supports hypercalls to implement IPI; use them.
> 
> Signed-off-by: K. Y. Srinivasan 
> ---

Reviewed-by: Michael Kelley 


RE: [PATCH V2 1/5] X86: Hyper-V: Enlighten APIC access

2018-05-09 Thread Michael Kelley (EOSG)
> -Original Message-
> From: k...@linuxonhyperv.com <k...@linuxonhyperv.com>
> Sent: Thursday, May 3, 2018 11:08 PM
> To: x...@kernel.org; gre...@linuxfoundation.org; linux-kernel@vger.kernel.org;
> de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; 
> jasow...@redhat.com;
> t...@linutronix.de; h...@zytor.com; Stephen Hemminger 
> <sthem...@microsoft.com>;
> Michael Kelley (EOSG) <michael.h.kel...@microsoft.com>; vkuzn...@redhat.com
> Cc: KY Srinivasan <k...@microsoft.com>
> Subject: [PATCH V2 1/5] X86: Hyper-V: Enlighten APIC access
> 
> From: "K. Y. Srinivasan" <k...@microsoft.com>
> 
> Hyper-V supports MSR based APIC access; implement
> the enlightenment.
> 
> Signed-off-by: K. Y. Srinivasan <k...@microsoft.com>
> ---

Reviewed-by: Michael Kelley <mikel...@microsoft.com>


RE: [PATCH V2 1/5] X86: Hyper-V: Enlighten APIC access

2018-05-09 Thread Michael Kelley (EOSG)
> -Original Message-
> From: k...@linuxonhyperv.com 
> Sent: Thursday, May 3, 2018 11:08 PM
> To: x...@kernel.org; gre...@linuxfoundation.org; linux-kernel@vger.kernel.org;
> de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; 
> jasow...@redhat.com;
> t...@linutronix.de; h...@zytor.com; Stephen Hemminger 
> ;
> Michael Kelley (EOSG) ; vkuzn...@redhat.com
> Cc: KY Srinivasan 
> Subject: [PATCH V2 1/5] X86: Hyper-V: Enlighten APIC access
> 
> From: "K. Y. Srinivasan" 
> 
> Hyper-V supports MSR based APIC access; implement
> the enlightenment.
> 
> Signed-off-by: K. Y. Srinivasan 
> ---

Reviewed-by: Michael Kelley 


RE: [PATCH 3/5] X86: Hyper-V: Enhanced IPI enlightenment

2018-04-26 Thread Michael Kelley (EOSG)
On Wed, 25 Apr 2018, KY Srinivasan  wrote:
> 
> +struct ipi_arg_ex {
> + u32 vector;
> + u32  reserved;
> + struct hv_vpset vp_set;
> +};

Again, suggest moving to hyperv-tlfs.h.  And the 5.0b version
of the TLFS has:

u32 vector;
u8  targetvtl;
u8  reserved[3];
struct hv_vpset vp_set;


 
> 
> +struct hv_vpset {
> + u64 format;
> + u64 valid_bank_mask;
> + u64 bank_contents[];
> +};

And this as well.

Michael


RE: [PATCH 3/5] X86: Hyper-V: Enhanced IPI enlightenment

2018-04-26 Thread Michael Kelley (EOSG)
On Wed, 25 Apr 2018, KY Srinivasan  wrote:
> 
> +struct ipi_arg_ex {
> + u32 vector;
> + u32  reserved;
> + struct hv_vpset vp_set;
> +};

Again, suggest moving to hyperv-tlfs.h.  And the 5.0b version
of the TLFS has:

u32 vector;
u8  targetvtl;
u8  reserved[3];
struct hv_vpset vp_set;


 
> 
> +struct hv_vpset {
> + u64 format;
> + u64 valid_bank_mask;
> + u64 bank_contents[];
> +};

And this as well.

Michael


RE: [PATCH 2/5] X86: Hyper-V: Enable IPI enlightenments

2018-04-26 Thread Michael Kelley (EOSG)
> -Original Message-
> From: k...@linuxonhyperv.com <k...@linuxonhyperv.com>
> Sent: Wednesday, April 25, 2018 11:13 AM
> To: x...@kernel.org; gre...@linuxfoundation.org; linux-kernel@vger.kernel.org;
> de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; 
> jasow...@redhat.com;
> t...@linutronix.de; h...@zytor.com; Stephen Hemminger 
> <sthem...@microsoft.com>;
> Michael Kelley (EOSG) <michael.h.kel...@microsoft.com>; vkuzn...@redhat.com
> Cc: KY Srinivasan <k...@microsoft.com>
> Subject: [PATCH 2/5] X86: Hyper-V: Enable IPI enlightenments
> 
> From: "K. Y. Srinivasan" <k...@microsoft.com>
> 
> Hyper-V supports hypercalls to implement IPI; use them.
> 
> Signed-off-by: K. Y. Srinivasan <k...@microsoft.com>
> ---
>  arch/x86/hyperv/hv_apic.c  | 125 
> +
>  arch/x86/hyperv/hv_init.c  |  17 +
>  arch/x86/include/asm/hyperv-tlfs.h |   9 +++
>  arch/x86/include/asm/mshyperv.h|   1 +
>  4 files changed, 152 insertions(+)
> 
> diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
> index e0a5b36208fc..7f3322ecfb01 100644
> --- a/arch/x86/hyperv/hv_apic.c
> +++ b/arch/x86/hyperv/hv_apic.c
> @@ -30,6 +30,14 @@
>  #include 
>  #include 
> 
> +struct ipi_arg_non_ex {
> + u32 vector;
> + u32 reserved;
> + u64 cpu_mask;
> +};

I think we'd like to put structures like this, which are defined in the
Hyper-V Top Level Functional Spec, in hyperv-tlfs.h.  Also, the 5.0b
version of the TLFS, which is latest, shows this structure on page 100:

u32 vector;
u8  targetvtl;
u8  reserved[3];
u64 cpu_mask;

> +
> +static struct apic orig_apic;
> +
>  static u64 hv_apic_icr_read(void)
>  {
>   u64 reg_val;
> @@ -85,8 +93,125 @@ static void hv_apic_eoi_write(u32 reg, u32 val)
>   wrmsr(HV_X64_MSR_EOI, val, 0);
>  }
> 
> +/*
> + * IPI implementation on Hyper-V.
> + */
> +
> +static int __send_ipi_mask(const struct cpumask *mask, int vector)
> +{
> + int cur_cpu, vcpu;
> + struct ipi_arg_non_ex **arg;
> + struct ipi_arg_non_ex *ipi_arg;
> + int ret = 1;
> + unsigned long flags;
> +
> + if (cpumask_empty(mask))
> + return 0;
> +
> + if (!hv_hypercall_pg)
> + return ret;
> +
> + if ((vector < HV_IPI_LOW_VECTOR) || (vector > HV_IPI_HIGH_VECTOR))
> + return ret;
> +
> + local_irq_save(flags);
> + arg = (struct ipi_arg_non_ex **)this_cpu_ptr(hyperv_pcpu_input_arg);
> +
> + ipi_arg = *arg;
> + if (unlikely(!ipi_arg))
> + goto ipi_mask_done;
> +
> +
> + ipi_arg->vector = vector;
> + ipi_arg->reserved = 0;
> + ipi_arg->cpu_mask = 0;
> +
> + for_each_cpu(cur_cpu, mask) {
> + vcpu = hv_cpu_number_to_vp_number(cur_cpu);
> + if (vcpu >= 64)
> + goto ipi_mask_done;
> +
> + __set_bit(vcpu, (unsigned long *)_arg->cpu_mask);
> + }
> +
> + ret = hv_do_hypercall(HVCALL_SEND_IPI, ipi_arg, NULL);
> +
> +ipi_mask_done:
> + local_irq_restore(flags);
> + return ret;
> +}
> +
> +static int __send_ipi_one(int cpu, int vector)
> +{
> + struct cpumask mask = CPU_MASK_NONE;
> +
> + cpumask_set_cpu(cpu, );
> + return __send_ipi_mask(, vector);
> +}
> +
> +static void hv_send_ipi(int cpu, int vector)
> +{
> + if (__send_ipi_one(cpu, vector))
> + orig_apic.send_IPI(cpu, vector);
> +}
> +
> +static void hv_send_ipi_mask(const struct cpumask *mask, int vector)
> +{
> + if (__send_ipi_mask(mask, vector))
> + orig_apic.send_IPI_mask(mask, vector);
> +}
> +
> +static void hv_send_ipi_mask_allbutself(const struct cpumask *mask, int 
> vector)
> +{
> + unsigned int this_cpu = smp_processor_id();
> + struct cpumask new_mask;
> + const struct cpumask *local_mask;
> +
> + cpumask_copy(_mask, mask);
> + cpumask_clear_cpu(this_cpu, _mask);
> + local_mask = _mask;
> + if (__send_ipi_mask(local_mask, vector))
> + orig_apic.send_IPI_mask_allbutself(mask, vector);
> +}
> +
> +static void hv_send_ipi_allbutself(int vector)
> +{
> + hv_send_ipi_mask_allbutself(cpu_online_mask, vector);
> +}
> +
> +static void hv_send_ipi_all(int vector)
> +{
> + if (__send_ipi_mask(cpu_online_mask, vector))
> + orig_apic.send_IPI_all(vector);
> +}
> +
> +static void hv_send_ipi_self(int vector)
> +{
> + if (__send_ipi_one(smp_processor_id(), vector))
> + orig_apic

RE: [PATCH 2/5] X86: Hyper-V: Enable IPI enlightenments

2018-04-26 Thread Michael Kelley (EOSG)
> -Original Message-
> From: k...@linuxonhyperv.com 
> Sent: Wednesday, April 25, 2018 11:13 AM
> To: x...@kernel.org; gre...@linuxfoundation.org; linux-kernel@vger.kernel.org;
> de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; 
> jasow...@redhat.com;
> t...@linutronix.de; h...@zytor.com; Stephen Hemminger 
> ;
> Michael Kelley (EOSG) ; vkuzn...@redhat.com
> Cc: KY Srinivasan 
> Subject: [PATCH 2/5] X86: Hyper-V: Enable IPI enlightenments
> 
> From: "K. Y. Srinivasan" 
> 
> Hyper-V supports hypercalls to implement IPI; use them.
> 
> Signed-off-by: K. Y. Srinivasan 
> ---
>  arch/x86/hyperv/hv_apic.c  | 125 
> +
>  arch/x86/hyperv/hv_init.c  |  17 +
>  arch/x86/include/asm/hyperv-tlfs.h |   9 +++
>  arch/x86/include/asm/mshyperv.h|   1 +
>  4 files changed, 152 insertions(+)
> 
> diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
> index e0a5b36208fc..7f3322ecfb01 100644
> --- a/arch/x86/hyperv/hv_apic.c
> +++ b/arch/x86/hyperv/hv_apic.c
> @@ -30,6 +30,14 @@
>  #include 
>  #include 
> 
> +struct ipi_arg_non_ex {
> + u32 vector;
> + u32 reserved;
> + u64 cpu_mask;
> +};

I think we'd like to put structures like this, which are defined in the
Hyper-V Top Level Functional Spec, in hyperv-tlfs.h.  Also, the 5.0b
version of the TLFS, which is latest, shows this structure on page 100:

u32 vector;
u8  targetvtl;
u8  reserved[3];
u64 cpu_mask;

> +
> +static struct apic orig_apic;
> +
>  static u64 hv_apic_icr_read(void)
>  {
>   u64 reg_val;
> @@ -85,8 +93,125 @@ static void hv_apic_eoi_write(u32 reg, u32 val)
>   wrmsr(HV_X64_MSR_EOI, val, 0);
>  }
> 
> +/*
> + * IPI implementation on Hyper-V.
> + */
> +
> +static int __send_ipi_mask(const struct cpumask *mask, int vector)
> +{
> + int cur_cpu, vcpu;
> + struct ipi_arg_non_ex **arg;
> + struct ipi_arg_non_ex *ipi_arg;
> + int ret = 1;
> + unsigned long flags;
> +
> + if (cpumask_empty(mask))
> + return 0;
> +
> + if (!hv_hypercall_pg)
> + return ret;
> +
> + if ((vector < HV_IPI_LOW_VECTOR) || (vector > HV_IPI_HIGH_VECTOR))
> + return ret;
> +
> + local_irq_save(flags);
> + arg = (struct ipi_arg_non_ex **)this_cpu_ptr(hyperv_pcpu_input_arg);
> +
> + ipi_arg = *arg;
> + if (unlikely(!ipi_arg))
> + goto ipi_mask_done;
> +
> +
> + ipi_arg->vector = vector;
> + ipi_arg->reserved = 0;
> + ipi_arg->cpu_mask = 0;
> +
> + for_each_cpu(cur_cpu, mask) {
> + vcpu = hv_cpu_number_to_vp_number(cur_cpu);
> + if (vcpu >= 64)
> + goto ipi_mask_done;
> +
> + __set_bit(vcpu, (unsigned long *)_arg->cpu_mask);
> + }
> +
> + ret = hv_do_hypercall(HVCALL_SEND_IPI, ipi_arg, NULL);
> +
> +ipi_mask_done:
> + local_irq_restore(flags);
> + return ret;
> +}
> +
> +static int __send_ipi_one(int cpu, int vector)
> +{
> + struct cpumask mask = CPU_MASK_NONE;
> +
> + cpumask_set_cpu(cpu, );
> + return __send_ipi_mask(, vector);
> +}
> +
> +static void hv_send_ipi(int cpu, int vector)
> +{
> + if (__send_ipi_one(cpu, vector))
> + orig_apic.send_IPI(cpu, vector);
> +}
> +
> +static void hv_send_ipi_mask(const struct cpumask *mask, int vector)
> +{
> + if (__send_ipi_mask(mask, vector))
> + orig_apic.send_IPI_mask(mask, vector);
> +}
> +
> +static void hv_send_ipi_mask_allbutself(const struct cpumask *mask, int 
> vector)
> +{
> + unsigned int this_cpu = smp_processor_id();
> + struct cpumask new_mask;
> + const struct cpumask *local_mask;
> +
> + cpumask_copy(_mask, mask);
> + cpumask_clear_cpu(this_cpu, _mask);
> + local_mask = _mask;
> + if (__send_ipi_mask(local_mask, vector))
> + orig_apic.send_IPI_mask_allbutself(mask, vector);
> +}
> +
> +static void hv_send_ipi_allbutself(int vector)
> +{
> + hv_send_ipi_mask_allbutself(cpu_online_mask, vector);
> +}
> +
> +static void hv_send_ipi_all(int vector)
> +{
> + if (__send_ipi_mask(cpu_online_mask, vector))
> + orig_apic.send_IPI_all(vector);
> +}
> +
> +static void hv_send_ipi_self(int vector)
> +{
> + if (__send_ipi_one(smp_processor_id(), vector))
> + orig_apic.send_IPI_self(vector);
> +}
> +
>  void __init hv_apic_init(void)
>  {
> + if (ms_hyperv.hints & HV_X64_CLUSTER_IPI_RECOMMENDED) {
> + if

RE: [PATCH 1/5] X86: Hyper-V: Enlighten APIC access

2018-04-26 Thread Michael Kelley (EOSG)
> -Original Message-
> From: k...@linuxonhyperv.com <k...@linuxonhyperv.com>
> Sent: Wednesday, April 25, 2018 11:13 AM
> To: x...@kernel.org; gre...@linuxfoundation.org; linux-kernel@vger.kernel.org;
> de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; 
> jasow...@redhat.com;
> t...@linutronix.de; h...@zytor.com; Stephen Hemminger 
> <sthem...@microsoft.com>;
> Michael Kelley (EOSG) <michael.h.kel...@microsoft.com>; vkuzn...@redhat.com
> Cc: KY Srinivasan <k...@microsoft.com>
> Subject: [PATCH 1/5] X86: Hyper-V: Enlighten APIC access
> 
> From: "K. Y. Srinivasan" <k...@microsoft.com>
> 
> Hyper-V supports MSR based APIC access; implement
> the enlightenment.
> 
> Signed-off-by: K. Y. Srinivasan <k...@microsoft.com>
> ---
>  arch/x86/hyperv/Makefile|  2 +-
>  arch/x86/hyperv/hv_apic.c   | 98 
> +
>  arch/x86/hyperv/hv_init.c   |  5 ++-
>  arch/x86/include/asm/mshyperv.h |  6 ++-
>  4 files changed, 107 insertions(+), 4 deletions(-)
>  create mode 100644 arch/x86/hyperv/hv_apic.c
> 
> diff --git a/arch/x86/hyperv/Makefile b/arch/x86/hyperv/Makefile
> index 367a8203cfcf..00ce4df01a09 100644
> --- a/arch/x86/hyperv/Makefile
> +++ b/arch/x86/hyperv/Makefile
> @@ -1 +1 @@
> -obj-y:= hv_init.o mmu.o
> +obj-y:= hv_init.o mmu.o hv_apic.o
> diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
> new file mode 100644
> index ..e0a5b36208fc
> --- /dev/null
> +++ b/arch/x86/hyperv/hv_apic.c
> @@ -0,0 +1,98 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Hyper-V specific APIC code.
> + *
> + * Copyright (C) 2018, Microsoft, Inc.
> + *
> + * Author : K. Y. Srinivasan <k...@microsoft.com>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published
> + * by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
> + * NON INFRINGEMENT.  See the GNU General Public License for more
> + * details.
> + *
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +static u64 hv_apic_icr_read(void)
> +{
> + u64 reg_val;
> +
> + rdmsrl(HV_X64_MSR_ICR, reg_val);
> + return reg_val;
> +}
> +
> +static void hv_apic_icr_write(u32 low, u32 id)
> +{
> + u64 reg_val;
> +
> + reg_val = SET_APIC_DEST_FIELD(id);
> + reg_val = reg_val << 32;
> + reg_val |= low;
> +
> + wrmsrl(HV_X64_MSR_ICR, reg_val);
> +}
> +
> +static u32 hv_apic_read(u32 reg)
> +{
> + u32 reg_val, hi;
> +
> + switch (reg) {
> + case APIC_EOI:
> + rdmsr(HV_X64_MSR_EOI, reg_val, hi);
> + return reg_val;
> + case APIC_TASKPRI:
> + rdmsr(HV_X64_MSR_TPR, reg_val, hi);
> + return reg_val;
> +
> + default:
> + return native_apic_mem_read(reg);
> + }
> +}
> +
> +static void hv_apic_write(u32 reg, u32 val)
> +{
> + switch (reg) {
> + case APIC_EOI:
> + wrmsr(HV_X64_MSR_EOI, val, 0);
> + break;
> + case APIC_TASKPRI:
> + wrmsr(HV_X64_MSR_TPR, val, 0);
> + break;
> + default:
> + native_apic_mem_write(reg, val);
> + }
> +}
> +
> +static void hv_apic_eoi_write(u32 reg, u32 val)
> +{
> + wrmsr(HV_X64_MSR_EOI, val, 0);
> +}
> +
> +void __init hv_apic_init(void)
> +{
> + if (ms_hyperv.hints & HV_X64_APIC_ACCESS_RECOMMENDED) {
> + pr_info("Hyper-V: Using MSR ased APIC access\n");

Typo here.  "ased" should be "based".

> + apic_set_eoi_write(hv_apic_eoi_write);
> + apic->read  = hv_apic_read;
> + apic->write = hv_apic_write;
> + apic->icr_write = hv_apic_icr_write;
> + apic->icr_read  = hv_apic_icr_read;
> + }
> +}
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index cfecc2272f2d..71e50fc2b7ef 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -242,8 +242,9 @@ static int hv_cpu_die(unsigned int cpu)
>   *
>   * 1. Setup the hypercall page.
>   * 2. Register Hyper-V specific clocksource.
> +

RE: [PATCH 1/5] X86: Hyper-V: Enlighten APIC access

2018-04-26 Thread Michael Kelley (EOSG)
> -Original Message-
> From: k...@linuxonhyperv.com 
> Sent: Wednesday, April 25, 2018 11:13 AM
> To: x...@kernel.org; gre...@linuxfoundation.org; linux-kernel@vger.kernel.org;
> de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; 
> jasow...@redhat.com;
> t...@linutronix.de; h...@zytor.com; Stephen Hemminger 
> ;
> Michael Kelley (EOSG) ; vkuzn...@redhat.com
> Cc: KY Srinivasan 
> Subject: [PATCH 1/5] X86: Hyper-V: Enlighten APIC access
> 
> From: "K. Y. Srinivasan" 
> 
> Hyper-V supports MSR based APIC access; implement
> the enlightenment.
> 
> Signed-off-by: K. Y. Srinivasan 
> ---
>  arch/x86/hyperv/Makefile|  2 +-
>  arch/x86/hyperv/hv_apic.c   | 98 
> +
>  arch/x86/hyperv/hv_init.c   |  5 ++-
>  arch/x86/include/asm/mshyperv.h |  6 ++-
>  4 files changed, 107 insertions(+), 4 deletions(-)
>  create mode 100644 arch/x86/hyperv/hv_apic.c
> 
> diff --git a/arch/x86/hyperv/Makefile b/arch/x86/hyperv/Makefile
> index 367a8203cfcf..00ce4df01a09 100644
> --- a/arch/x86/hyperv/Makefile
> +++ b/arch/x86/hyperv/Makefile
> @@ -1 +1 @@
> -obj-y:= hv_init.o mmu.o
> +obj-y:= hv_init.o mmu.o hv_apic.o
> diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
> new file mode 100644
> index ..e0a5b36208fc
> --- /dev/null
> +++ b/arch/x86/hyperv/hv_apic.c
> @@ -0,0 +1,98 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Hyper-V specific APIC code.
> + *
> + * Copyright (C) 2018, Microsoft, Inc.
> + *
> + * Author : K. Y. Srinivasan 
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published
> + * by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
> + * NON INFRINGEMENT.  See the GNU General Public License for more
> + * details.
> + *
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +static u64 hv_apic_icr_read(void)
> +{
> + u64 reg_val;
> +
> + rdmsrl(HV_X64_MSR_ICR, reg_val);
> + return reg_val;
> +}
> +
> +static void hv_apic_icr_write(u32 low, u32 id)
> +{
> + u64 reg_val;
> +
> + reg_val = SET_APIC_DEST_FIELD(id);
> + reg_val = reg_val << 32;
> + reg_val |= low;
> +
> + wrmsrl(HV_X64_MSR_ICR, reg_val);
> +}
> +
> +static u32 hv_apic_read(u32 reg)
> +{
> + u32 reg_val, hi;
> +
> + switch (reg) {
> + case APIC_EOI:
> + rdmsr(HV_X64_MSR_EOI, reg_val, hi);
> + return reg_val;
> + case APIC_TASKPRI:
> + rdmsr(HV_X64_MSR_TPR, reg_val, hi);
> + return reg_val;
> +
> + default:
> + return native_apic_mem_read(reg);
> + }
> +}
> +
> +static void hv_apic_write(u32 reg, u32 val)
> +{
> + switch (reg) {
> + case APIC_EOI:
> + wrmsr(HV_X64_MSR_EOI, val, 0);
> + break;
> + case APIC_TASKPRI:
> + wrmsr(HV_X64_MSR_TPR, val, 0);
> + break;
> + default:
> + native_apic_mem_write(reg, val);
> + }
> +}
> +
> +static void hv_apic_eoi_write(u32 reg, u32 val)
> +{
> + wrmsr(HV_X64_MSR_EOI, val, 0);
> +}
> +
> +void __init hv_apic_init(void)
> +{
> + if (ms_hyperv.hints & HV_X64_APIC_ACCESS_RECOMMENDED) {
> + pr_info("Hyper-V: Using MSR ased APIC access\n");

Typo here.  "ased" should be "based".

> + apic_set_eoi_write(hv_apic_eoi_write);
> + apic->read  = hv_apic_read;
> + apic->write = hv_apic_write;
> + apic->icr_write = hv_apic_icr_write;
> + apic->icr_read  = hv_apic_icr_read;
> + }
> +}
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index cfecc2272f2d..71e50fc2b7ef 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -242,8 +242,9 @@ static int hv_cpu_die(unsigned int cpu)
>   *
>   * 1. Setup the hypercall page.
>   * 2. Register Hyper-V specific clocksource.
> + * 3. Setup Hyper-V specific APIC entry points.
>   */
> -void hyperv_init(void)
> +void __init hyperv_init(void)
>  {
>   u64 guest_id, required_msrs;
>   union hv_x64_msr_hyperca

RE: [Patch v2 1/6] cifs: smbd: Check for iov length on sending the last iov

2018-04-22 Thread Michael Kelley (EOSG)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  
> On Behalf
> Of Long Li
> Sent: Tuesday, April 17, 2018 12:17 PM
> To: Steve French ; linux-c...@vger.kernel.org; samba-
> techni...@lists.samba.org; linux-kernel@vger.kernel.org; 
> linux-r...@vger.kernel.org
> Cc: Long Li ; sta...@vger.kernel.org
> Subject: [Patch v2 1/6] cifs: smbd: Check for iov length on sending the last 
> iov
> 
> From: Long Li 
> 
> When sending the last iov that breaks into smaller buffers to fit the
> transfer size, it's necessary to check if this is the last iov.
> 
> If this is the latest iov, stop and proceed to send pages.
> 
> Signed-off-by: Long Li 
> Cc: sta...@vger.kernel.org

Reviewed-by: Michael Kelley 

But a question unrelated to this change arose during my review:  At the
beginning and end of smbd_send(), the field smbd_send_pending is
incremented and decremented, respectively.   The increment/decrement
are not done as atomic operations.  Is this code guaranteed to be single
threaded?  If not, the count could become corrupted, and
smbd_destroy_rdma_work(), which waits for the count to become zero,
could hang.  A similar question applies to smbd_recv_pending in smbd_recv().

> ---
>  fs/cifs/smbdirect.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
> index 90e673c..b5c6c0d 100644
> --- a/fs/cifs/smbdirect.c
> +++ b/fs/cifs/smbdirect.c
> @@ -2197,6 +2197,8 @@ int smbd_send(struct smbd_connection *info, struct 
> smb_rqst *rqst)
>   goto done;
>   }
>   i++;
> + if (i == rqst->rq_nvec)
> + break;
>   }
>   start = i;
>   buflen = 0;
> --
> 2.7.4



RE: [Patch v2 1/6] cifs: smbd: Check for iov length on sending the last iov

2018-04-22 Thread Michael Kelley (EOSG)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  
> On Behalf
> Of Long Li
> Sent: Tuesday, April 17, 2018 12:17 PM
> To: Steve French ; linux-c...@vger.kernel.org; samba-
> techni...@lists.samba.org; linux-kernel@vger.kernel.org; 
> linux-r...@vger.kernel.org
> Cc: Long Li ; sta...@vger.kernel.org
> Subject: [Patch v2 1/6] cifs: smbd: Check for iov length on sending the last 
> iov
> 
> From: Long Li 
> 
> When sending the last iov that breaks into smaller buffers to fit the
> transfer size, it's necessary to check if this is the last iov.
> 
> If this is the latest iov, stop and proceed to send pages.
> 
> Signed-off-by: Long Li 
> Cc: sta...@vger.kernel.org

Reviewed-by: Michael Kelley 

But a question unrelated to this change arose during my review:  At the
beginning and end of smbd_send(), the field smbd_send_pending is
incremented and decremented, respectively.   The increment/decrement
are not done as atomic operations.  Is this code guaranteed to be single
threaded?  If not, the count could become corrupted, and
smbd_destroy_rdma_work(), which waits for the count to become zero,
could hang.  A similar question applies to smbd_recv_pending in smbd_recv().

> ---
>  fs/cifs/smbdirect.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
> index 90e673c..b5c6c0d 100644
> --- a/fs/cifs/smbdirect.c
> +++ b/fs/cifs/smbdirect.c
> @@ -2197,6 +2197,8 @@ int smbd_send(struct smbd_connection *info, struct 
> smb_rqst *rqst)
>   goto done;
>   }
>   i++;
> + if (i == rqst->rq_nvec)
> + break;
>   }
>   start = i;
>   buflen = 0;
> --
> 2.7.4



RE: [Patch v2] Storvsc: Select channel based on available percentage of ring buffer to write

2018-04-22 Thread Michael Kelley (EOSG)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  
> On Behalf
> Of Long Li
> Sent: Thursday, April 19, 2018 2:54 PM
> To: KY Srinivasan ; Haiyang Zhang 
> ; Stephen
> Hemminger ; James E . J . Bottomley 
> ;
> Martin K . Petersen ; 
> de...@linuxdriverproject.org; linux-
> s...@vger.kernel.org; linux-kernel@vger.kernel.org
> Cc: Long Li 
> Subject: [Patch v2] Storvsc: Select channel based on available percentage of 
> ring buffer to
> write
> 
> From: Long Li 
> 
> This is a best effort for estimating on how busy the ring buffer is for
> that channel, based on available buffer to write in percentage. It is still
> possible that at the time of actual ring buffer write, the space may not be
> available due to other processes may be writing at the time.
> 
> Selecting a channel based on how full it is can reduce the possibility that
> a ring buffer write will fail, and avoid the situation a channel is over
> busy.
> 
> Now it's possible that storvsc can use a smaller ring buffer size
> (e.g. 40k bytes) to take advantage of cache locality.
> 
> Changes.
> v2: Pre-allocate struct cpumask on the heap.
> Struct cpumask is a big structure (1k bytes) when CONFIG_NR_CPUS=8192 (default
> value when CONFIG_MAXSMP=y). Don't use kernel stack for it by pre-allocating
> them using kmalloc when channels are first initialized.
> 
> Signed-off-by: Long Li 
> ---
>  drivers/scsi/storvsc_drv.c | 90 
> --
>  1 file changed, 72 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> index a2ec0bc9e9fa..2a9fff94dd1a 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -395,6 +395,12 @@ MODULE_PARM_DESC(storvsc_ringbuffer_size, "Ring buffer 
> size
> (bytes)");
> 
>  module_param(storvsc_vcpus_per_sub_channel, int, S_IRUGO);
>  MODULE_PARM_DESC(storvsc_vcpus_per_sub_channel, "Ratio of VCPUs to 
> subchannels");
> +
> +static int ring_avail_percent_lowater = 10;
> +module_param(ring_avail_percent_lowater, int, S_IRUGO);
> +MODULE_PARM_DESC(ring_avail_percent_lowater,
> + "Select a channel if available ring size > this in percent");
> +
>  /*
>   * Timeout in seconds for all devices managed by this driver.
>   */
> @@ -468,6 +474,13 @@ struct storvsc_device {
>* Mask of CPUs bound to subchannels.
>*/
>   struct cpumask alloced_cpus;
> + /*
> +  * Pre-allocated struct cpumask for each hardware queue.
> +  * struct cpumask is used by selecting out-going channels. It is a
> +  * big structure, default to 1024k bytes when CONFIG_MAXSMP=y.

I think you mean "1024 bytes" or "1k bytes" in the above comment.

> +  * Pre-allocate it to avoid allocation on the kernel stack.
> +  */
> + struct cpumask *cpumask_chns;
>   /* Used for vsc/vsp channel reset process */
>   struct storvsc_cmd_request init_request;
>   struct storvsc_cmd_request reset_request;
> @@ -872,6 +885,13 @@ static int storvsc_channel_init(struct hv_device 
> *device, bool is_fc)
>   if (stor_device->stor_chns == NULL)
>   return -ENOMEM;
> 
> + stor_device->cpumask_chns = kcalloc(num_possible_cpus(),
> + sizeof(struct cpumask), GFP_KERNEL);

Note that num_possible_cpus() is 240 for a Hyper-V 2016 guest unless 
overridden on the kernel boot line, so this is going to allocate 240 Kbytes for 
each
synthetic SCSI controller.  On an Azure VM, which has two IDE and two SCSI
controllers, this is nearly 1 Mbyte.  It's unfortunate to have to allocate this 
much
memory for a what is essentially a temporary variable.   Further down in these
comments, I've proposed an alternate implementation of the code that avoids
the need for the temporary variable, and hence avoids the need for this
allocation.

> + if (stor_device->cpumask_chns == NULL) {
> + kfree(stor_device->stor_chns);
> + return -ENOMEM;
> + }
> +
>   stor_device->stor_chns[device->channel->target_cpu] = device->channel;
>   cpumask_set_cpu(device->channel->target_cpu,
>   _device->alloced_cpus);
> @@ -1232,6 +1252,7 @@ static int storvsc_dev_remove(struct hv_device *device)
>   vmbus_close(device->channel);
> 
>   kfree(stor_device->stor_chns);
> + kfree(stor_device->cpumask_chns);
>   kfree(stor_device);
>   return 0;
>  }
> @@ -1241,7 +1262,7 @@ static struct vmbus_channel *get_og_chn(struct 
> storvsc_device
> *stor_device,
>  {1G/
>   u16 slot = 0;
>   u16 hash_qnum;
> - struct cpumask alloced_mask;
> + struct cpumask *alloced_mask = _device->cpumask_chns[q_num];
>   int num_channels, tgt_cpu;
> 
>   if (stor_device->num_sc == 0)
> @@ -1257,10 

RE: [Patch v2] Storvsc: Select channel based on available percentage of ring buffer to write

2018-04-22 Thread Michael Kelley (EOSG)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  
> On Behalf
> Of Long Li
> Sent: Thursday, April 19, 2018 2:54 PM
> To: KY Srinivasan ; Haiyang Zhang 
> ; Stephen
> Hemminger ; James E . J . Bottomley 
> ;
> Martin K . Petersen ; 
> de...@linuxdriverproject.org; linux-
> s...@vger.kernel.org; linux-kernel@vger.kernel.org
> Cc: Long Li 
> Subject: [Patch v2] Storvsc: Select channel based on available percentage of 
> ring buffer to
> write
> 
> From: Long Li 
> 
> This is a best effort for estimating on how busy the ring buffer is for
> that channel, based on available buffer to write in percentage. It is still
> possible that at the time of actual ring buffer write, the space may not be
> available due to other processes may be writing at the time.
> 
> Selecting a channel based on how full it is can reduce the possibility that
> a ring buffer write will fail, and avoid the situation a channel is over
> busy.
> 
> Now it's possible that storvsc can use a smaller ring buffer size
> (e.g. 40k bytes) to take advantage of cache locality.
> 
> Changes.
> v2: Pre-allocate struct cpumask on the heap.
> Struct cpumask is a big structure (1k bytes) when CONFIG_NR_CPUS=8192 (default
> value when CONFIG_MAXSMP=y). Don't use kernel stack for it by pre-allocating
> them using kmalloc when channels are first initialized.
> 
> Signed-off-by: Long Li 
> ---
>  drivers/scsi/storvsc_drv.c | 90 
> --
>  1 file changed, 72 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> index a2ec0bc9e9fa..2a9fff94dd1a 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -395,6 +395,12 @@ MODULE_PARM_DESC(storvsc_ringbuffer_size, "Ring buffer 
> size
> (bytes)");
> 
>  module_param(storvsc_vcpus_per_sub_channel, int, S_IRUGO);
>  MODULE_PARM_DESC(storvsc_vcpus_per_sub_channel, "Ratio of VCPUs to 
> subchannels");
> +
> +static int ring_avail_percent_lowater = 10;
> +module_param(ring_avail_percent_lowater, int, S_IRUGO);
> +MODULE_PARM_DESC(ring_avail_percent_lowater,
> + "Select a channel if available ring size > this in percent");
> +
>  /*
>   * Timeout in seconds for all devices managed by this driver.
>   */
> @@ -468,6 +474,13 @@ struct storvsc_device {
>* Mask of CPUs bound to subchannels.
>*/
>   struct cpumask alloced_cpus;
> + /*
> +  * Pre-allocated struct cpumask for each hardware queue.
> +  * struct cpumask is used by selecting out-going channels. It is a
> +  * big structure, default to 1024k bytes when CONFIG_MAXSMP=y.

I think you mean "1024 bytes" or "1k bytes" in the above comment.

> +  * Pre-allocate it to avoid allocation on the kernel stack.
> +  */
> + struct cpumask *cpumask_chns;
>   /* Used for vsc/vsp channel reset process */
>   struct storvsc_cmd_request init_request;
>   struct storvsc_cmd_request reset_request;
> @@ -872,6 +885,13 @@ static int storvsc_channel_init(struct hv_device 
> *device, bool is_fc)
>   if (stor_device->stor_chns == NULL)
>   return -ENOMEM;
> 
> + stor_device->cpumask_chns = kcalloc(num_possible_cpus(),
> + sizeof(struct cpumask), GFP_KERNEL);

Note that num_possible_cpus() is 240 for a Hyper-V 2016 guest unless 
overridden on the kernel boot line, so this is going to allocate 240 Kbytes for 
each
synthetic SCSI controller.  On an Azure VM, which has two IDE and two SCSI
controllers, this is nearly 1 Mbyte.  It's unfortunate to have to allocate this 
much
memory for a what is essentially a temporary variable.   Further down in these
comments, I've proposed an alternate implementation of the code that avoids
the need for the temporary variable, and hence avoids the need for this
allocation.

> + if (stor_device->cpumask_chns == NULL) {
> + kfree(stor_device->stor_chns);
> + return -ENOMEM;
> + }
> +
>   stor_device->stor_chns[device->channel->target_cpu] = device->channel;
>   cpumask_set_cpu(device->channel->target_cpu,
>   _device->alloced_cpus);
> @@ -1232,6 +1252,7 @@ static int storvsc_dev_remove(struct hv_device *device)
>   vmbus_close(device->channel);
> 
>   kfree(stor_device->stor_chns);
> + kfree(stor_device->cpumask_chns);
>   kfree(stor_device);
>   return 0;
>  }
> @@ -1241,7 +1262,7 @@ static struct vmbus_channel *get_og_chn(struct 
> storvsc_device
> *stor_device,
>  {1G/
>   u16 slot = 0;
>   u16 hash_qnum;
> - struct cpumask alloced_mask;
> + struct cpumask *alloced_mask = _device->cpumask_chns[q_num];
>   int num_channels, tgt_cpu;
> 
>   if (stor_device->num_sc == 0)
> @@ -1257,10 +1278,10 @@ static struct vmbus_channel *get_og_chn(struct 
> storvsc_device
> *stor_device,
>* III. Mapping is persistent.
>*/
> 
> - cpumask_and(_mask, _device->alloced_cpus,
> + 

RE: [Patch v2 2/6] cifs: Allocate validate negotiation request through kmalloc

2018-04-18 Thread Michael Kelley (EOSG)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  
> On Behalf
> Of Long Li
> Sent: Tuesday, April 17, 2018 12:17 PM
> To: Steve French ; linux-c...@vger.kernel.org; samba-
> techni...@lists.samba.org; linux-kernel@vger.kernel.org; 
> linux-r...@vger.kernel.org
> Cc: Long Li ; sta...@vger.kernel.org
> Subject: [Patch v2 2/6] cifs: Allocate validate negotiation request through 
> kmalloc
> 
> From: Long Li 
> 
> The data buffer allocated on the stack can't be DMA'ed, and hence can't send
> through RDMA via SMB Direct.
> 
> Fix this by allocating the request on the heap in smb3_validate_negotiate.
> 
> Fixes: ff1c038addc4f205d5f1ede449426c7d316c0eed "Check SMB3 dialects against
> downgrade attacks"
> 
> Changes in v2:
> Removed duplicated code on freeing buffers on function exit.
> (Thanks to Parav Pandit )
> 
> Fixed typo in the patch title.
> 
> Signed-off-by: Long Li 
> Cc: sta...@vger.kernel.org
> ---
>  fs/cifs/smb2pdu.c | 57 
> ++-
>  1 file changed, 31 insertions(+), 26 deletions(-)
> 
> diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
> index 0f044c4..41625e4 100644
> --- a/fs/cifs/smb2pdu.c
> +++ b/fs/cifs/smb2pdu.c
> @@ -729,8 +729,8 @@ SMB2_negotiate(const unsigned int xid, struct cifs_ses 
> *ses)
> 
>  int smb3_validate_negotiate(const unsigned int xid, struct cifs_tcon *tcon)
>  {
> - int rc = 0;
> - struct validate_negotiate_info_req vneg_inbuf;
> + int ret, rc = -EIO;
> + struct validate_negotiate_info_req *pneg_inbuf;
>   struct validate_negotiate_info_rsp *pneg_rsp = NULL;
>   u32 rsplen;
>   u32 inbuflen; /* max of 4 dialects */
> @@ -741,6 +741,9 @@ int smb3_validate_negotiate(const unsigned int xid, 
> struct cifs_tcon
> *tcon)
>   if (tcon->ses->server->rdma)
>   return 0;
>  #endif
> + pneg_inbuf = kmalloc(sizeof(*pneg_inbuf), GFP_KERNEL);
> + if (!pneg_inbuf)
> + return -ENOMEM;

Immediately after the above new code, there are three if statements that can
'return 0' and never free the pneg_inbuf memory.  They should instead set 'rc'
appropriately and 'goto' the out_free_inbuf label.

Michael

> 
>   /* In SMB3.11 preauth integrity supersedes validate negotiate */
>   if (tcon->ses->server->dialect == SMB311_PROT_ID)
> @@ -764,53 +767,53 @@ int smb3_validate_negotiate(const unsigned int xid, 
> struct cifs_tcon
> *tcon)
>   if (tcon->ses->session_flags & SMB2_SESSION_FLAG_IS_NULL)
>   cifs_dbg(VFS, "Unexpected null user (anonymous) auth flag sent 
> by server\n");
> 
> - vneg_inbuf.Capabilities =
> + pneg_inbuf->Capabilities =
>   cpu_to_le32(tcon->ses->server->vals->req_capabilities);
> - memcpy(vneg_inbuf.Guid, tcon->ses->server->client_guid,
> + memcpy(pneg_inbuf->Guid, tcon->ses->server->client_guid,
>   SMB2_CLIENT_GUID_SIZE);
> 
>   if (tcon->ses->sign)
> - vneg_inbuf.SecurityMode =
> + pneg_inbuf->SecurityMode =
>   cpu_to_le16(SMB2_NEGOTIATE_SIGNING_REQUIRED);
>   else if (global_secflags & CIFSSEC_MAY_SIGN)
> - vneg_inbuf.SecurityMode =
> + pneg_inbuf->SecurityMode =
>   cpu_to_le16(SMB2_NEGOTIATE_SIGNING_ENABLED);
>   else
> - vneg_inbuf.SecurityMode = 0;
> + pneg_inbuf->SecurityMode = 0;
> 
> 
>   if (strcmp(tcon->ses->server->vals->version_string,
>   SMB3ANY_VERSION_STRING) == 0) {
> - vneg_inbuf.Dialects[0] = cpu_to_le16(SMB30_PROT_ID);
> - vneg_inbuf.Dialects[1] = cpu_to_le16(SMB302_PROT_ID);
> - vneg_inbuf.DialectCount = cpu_to_le16(2);
> + pneg_inbuf->Dialects[0] = cpu_to_le16(SMB30_PROT_ID);
> + pneg_inbuf->Dialects[1] = cpu_to_le16(SMB302_PROT_ID);
> + pneg_inbuf->DialectCount = cpu_to_le16(2);
>   /* structure is big enough for 3 dialects, sending only 2 */
>   inbuflen = sizeof(struct validate_negotiate_info_req) - 2;
>   } else if (strcmp(tcon->ses->server->vals->version_string,
>   SMBDEFAULT_VERSION_STRING) == 0) {
> - vneg_inbuf.Dialects[0] = cpu_to_le16(SMB21_PROT_ID);
> - vneg_inbuf.Dialects[1] = cpu_to_le16(SMB30_PROT_ID);
> - vneg_inbuf.Dialects[2] = cpu_to_le16(SMB302_PROT_ID);
> - vneg_inbuf.DialectCount = cpu_to_le16(3);
> + pneg_inbuf->Dialects[0] = cpu_to_le16(SMB21_PROT_ID);
> + pneg_inbuf->Dialects[1] = cpu_to_le16(SMB30_PROT_ID);
> + pneg_inbuf->Dialects[2] = cpu_to_le16(SMB302_PROT_ID);
> + pneg_inbuf->DialectCount = cpu_to_le16(3);
>   /* structure is big enough for 3 dialects */
>   inbuflen = sizeof(struct 

RE: [Patch v2 2/6] cifs: Allocate validate negotiation request through kmalloc

2018-04-18 Thread Michael Kelley (EOSG)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  
> On Behalf
> Of Long Li
> Sent: Tuesday, April 17, 2018 12:17 PM
> To: Steve French ; linux-c...@vger.kernel.org; samba-
> techni...@lists.samba.org; linux-kernel@vger.kernel.org; 
> linux-r...@vger.kernel.org
> Cc: Long Li ; sta...@vger.kernel.org
> Subject: [Patch v2 2/6] cifs: Allocate validate negotiation request through 
> kmalloc
> 
> From: Long Li 
> 
> The data buffer allocated on the stack can't be DMA'ed, and hence can't send
> through RDMA via SMB Direct.
> 
> Fix this by allocating the request on the heap in smb3_validate_negotiate.
> 
> Fixes: ff1c038addc4f205d5f1ede449426c7d316c0eed "Check SMB3 dialects against
> downgrade attacks"
> 
> Changes in v2:
> Removed duplicated code on freeing buffers on function exit.
> (Thanks to Parav Pandit )
> 
> Fixed typo in the patch title.
> 
> Signed-off-by: Long Li 
> Cc: sta...@vger.kernel.org
> ---
>  fs/cifs/smb2pdu.c | 57 
> ++-
>  1 file changed, 31 insertions(+), 26 deletions(-)
> 
> diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
> index 0f044c4..41625e4 100644
> --- a/fs/cifs/smb2pdu.c
> +++ b/fs/cifs/smb2pdu.c
> @@ -729,8 +729,8 @@ SMB2_negotiate(const unsigned int xid, struct cifs_ses 
> *ses)
> 
>  int smb3_validate_negotiate(const unsigned int xid, struct cifs_tcon *tcon)
>  {
> - int rc = 0;
> - struct validate_negotiate_info_req vneg_inbuf;
> + int ret, rc = -EIO;
> + struct validate_negotiate_info_req *pneg_inbuf;
>   struct validate_negotiate_info_rsp *pneg_rsp = NULL;
>   u32 rsplen;
>   u32 inbuflen; /* max of 4 dialects */
> @@ -741,6 +741,9 @@ int smb3_validate_negotiate(const unsigned int xid, 
> struct cifs_tcon
> *tcon)
>   if (tcon->ses->server->rdma)
>   return 0;
>  #endif
> + pneg_inbuf = kmalloc(sizeof(*pneg_inbuf), GFP_KERNEL);
> + if (!pneg_inbuf)
> + return -ENOMEM;

Immediately after the above new code, there are three if statements that can
'return 0' and never free the pneg_inbuf memory.  They should instead set 'rc'
appropriately and 'goto' the out_free_inbuf label.

Michael

> 
>   /* In SMB3.11 preauth integrity supersedes validate negotiate */
>   if (tcon->ses->server->dialect == SMB311_PROT_ID)
> @@ -764,53 +767,53 @@ int smb3_validate_negotiate(const unsigned int xid, 
> struct cifs_tcon
> *tcon)
>   if (tcon->ses->session_flags & SMB2_SESSION_FLAG_IS_NULL)
>   cifs_dbg(VFS, "Unexpected null user (anonymous) auth flag sent 
> by server\n");
> 
> - vneg_inbuf.Capabilities =
> + pneg_inbuf->Capabilities =
>   cpu_to_le32(tcon->ses->server->vals->req_capabilities);
> - memcpy(vneg_inbuf.Guid, tcon->ses->server->client_guid,
> + memcpy(pneg_inbuf->Guid, tcon->ses->server->client_guid,
>   SMB2_CLIENT_GUID_SIZE);
> 
>   if (tcon->ses->sign)
> - vneg_inbuf.SecurityMode =
> + pneg_inbuf->SecurityMode =
>   cpu_to_le16(SMB2_NEGOTIATE_SIGNING_REQUIRED);
>   else if (global_secflags & CIFSSEC_MAY_SIGN)
> - vneg_inbuf.SecurityMode =
> + pneg_inbuf->SecurityMode =
>   cpu_to_le16(SMB2_NEGOTIATE_SIGNING_ENABLED);
>   else
> - vneg_inbuf.SecurityMode = 0;
> + pneg_inbuf->SecurityMode = 0;
> 
> 
>   if (strcmp(tcon->ses->server->vals->version_string,
>   SMB3ANY_VERSION_STRING) == 0) {
> - vneg_inbuf.Dialects[0] = cpu_to_le16(SMB30_PROT_ID);
> - vneg_inbuf.Dialects[1] = cpu_to_le16(SMB302_PROT_ID);
> - vneg_inbuf.DialectCount = cpu_to_le16(2);
> + pneg_inbuf->Dialects[0] = cpu_to_le16(SMB30_PROT_ID);
> + pneg_inbuf->Dialects[1] = cpu_to_le16(SMB302_PROT_ID);
> + pneg_inbuf->DialectCount = cpu_to_le16(2);
>   /* structure is big enough for 3 dialects, sending only 2 */
>   inbuflen = sizeof(struct validate_negotiate_info_req) - 2;
>   } else if (strcmp(tcon->ses->server->vals->version_string,
>   SMBDEFAULT_VERSION_STRING) == 0) {
> - vneg_inbuf.Dialects[0] = cpu_to_le16(SMB21_PROT_ID);
> - vneg_inbuf.Dialects[1] = cpu_to_le16(SMB30_PROT_ID);
> - vneg_inbuf.Dialects[2] = cpu_to_le16(SMB302_PROT_ID);
> - vneg_inbuf.DialectCount = cpu_to_le16(3);
> + pneg_inbuf->Dialects[0] = cpu_to_le16(SMB21_PROT_ID);
> + pneg_inbuf->Dialects[1] = cpu_to_le16(SMB30_PROT_ID);
> + pneg_inbuf->Dialects[2] = cpu_to_le16(SMB302_PROT_ID);
> + pneg_inbuf->DialectCount = cpu_to_le16(3);
>   /* structure is big enough for 3 dialects */
>   inbuflen = sizeof(struct validate_negotiate_info_req);
>   } else {
>   /* otherwise specific dialect was requested */
> - 

RE: [Resend Patch 2/3] Netvsc: Use the vmbus functiton to calculate ring buffer percentage

2018-04-13 Thread Michael Kelley (EOSG)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  
> On Behalf
> Of Long Li
> Sent: Tuesday, March 27, 2018 5:49 PM
> To: KY Srinivasan ; Haiyang Zhang 
> ; Stephen
> Hemminger ; James E . J . Bottomley 
> ;
> Martin K . Petersen ; 
> de...@linuxdriverproject.org; linux-
> s...@vger.kernel.org; linux-kernel@vger.kernel.org; net...@vger.kernel.org
> Cc: Long Li 
> Subject: [Resend Patch 2/3] Netvsc: Use the vmbus functiton to calculate ring 
> buffer
> percentage
> 
> From: Long Li 
> 
> In Vmbus, we have defined a function to calculate available ring buffer
> percentage to write.
> 
> Use that function and remove netvsc's private version.
> 
> Signed-off-by: Long Li 

Reviewed-by:  Michael Kelley 

> ---
>  drivers/net/hyperv/hyperv_net.h |  1 -
>  drivers/net/hyperv/netvsc.c | 17 +++--
>  drivers/net/hyperv/netvsc_drv.c |  3 ---
>  3 files changed, 3 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
> index cd538d5a7986..a0199ab13d67 100644
> --- a/drivers/net/hyperv/hyperv_net.h
> +++ b/drivers/net/hyperv/hyperv_net.h
> @@ -189,7 +189,6 @@ struct netvsc_device;
>  struct net_device_context;
> 
>  extern u32 netvsc_ring_bytes;
> -extern struct reciprocal_value netvsc_ring_reciprocal;
> 
>  struct netvsc_device *netvsc_device_add(struct hv_device *device,
>   const struct netvsc_device_info *info);
> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> index 0265d703eb03..8af0069e4d8c 100644
> --- a/drivers/net/hyperv/netvsc.c
> +++ b/drivers/net/hyperv/netvsc.c
> @@ -31,7 +31,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
> 
>  #include 
> 
> @@ -590,17 +589,6 @@ void netvsc_device_remove(struct hv_device *device)
>  #define RING_AVAIL_PERCENT_HIWATER 20
>  #define RING_AVAIL_PERCENT_LOWATER 10
> 
> -/*
> - * Get the percentage of available bytes to write in the ring.
> - * The return value is in range from 0 to 100.
> - */
> -static u32 hv_ringbuf_avail_percent(const struct hv_ring_buffer_info 
> *ring_info)
> -{
> - u32 avail_write = hv_get_bytes_to_write(ring_info);
> -
> - return reciprocal_divide(avail_write  * 100, netvsc_ring_reciprocal);
> -}
> -
>  static inline void netvsc_free_send_slot(struct netvsc_device *net_device,
>u32 index)
>  {
> @@ -649,7 +637,8 @@ static void netvsc_send_tx_complete(struct netvsc_device
> *net_device,
>   wake_up(_device->wait_drain);
> 
>   if (netif_tx_queue_stopped(netdev_get_tx_queue(ndev, q_idx)) &&
> - (hv_ringbuf_avail_percent(>outbound) >
> RING_AVAIL_PERCENT_HIWATER ||
> + (hv_get_avail_to_write_percent(>outbound) >
> +  RING_AVAIL_PERCENT_HIWATER ||
>queue_sends < 1)) {
>   netif_tx_wake_queue(netdev_get_tx_queue(ndev, q_idx));
>   ndev_ctx->eth_stats.wake_queue++;
> @@ -757,7 +746,7 @@ static inline int netvsc_send_pkt(
>   struct netdev_queue *txq = netdev_get_tx_queue(ndev, packet->q_idx);
>   u64 req_id;
>   int ret;
> - u32 ring_avail = hv_ringbuf_avail_percent(_channel->outbound);
> + u32 ring_avail = hv_get_avail_to_write_percent(_channel->outbound);
> 
>   nvmsg.hdr.msg_type = NVSP_MSG1_TYPE_SEND_RNDIS_PKT;
>   if (skb)
> diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
> index faea0be18924..b0b1c2fd2b7b 100644
> --- a/drivers/net/hyperv/netvsc_drv.c
> +++ b/drivers/net/hyperv/netvsc_drv.c
> @@ -35,7 +35,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
> 
>  #include 
>  #include 
> @@ -55,7 +54,6 @@ static unsigned int ring_size __ro_after_init = 128;
>  module_param(ring_size, uint, S_IRUGO);
>  MODULE_PARM_DESC(ring_size, "Ring buffer size (# of pages)");
>  unsigned int netvsc_ring_bytes __ro_after_init;
> -struct reciprocal_value netvsc_ring_reciprocal __ro_after_init;
> 
>  static const u32 default_msg = NETIF_MSG_DRV | NETIF_MSG_PROBE |
>   NETIF_MSG_LINK | NETIF_MSG_IFUP |
> @@ -2186,7 +2184,6 @@ static int __init netvsc_drv_init(void)
>   ring_size);
>   }
>   netvsc_ring_bytes = ring_size * PAGE_SIZE;
> - netvsc_ring_reciprocal = reciprocal_value(netvsc_ring_bytes);
> 
>   ret = vmbus_driver_register(_drv);
>   if (ret)
> --
> 2.14.1



RE: [Resend Patch 2/3] Netvsc: Use the vmbus functiton to calculate ring buffer percentage

2018-04-13 Thread Michael Kelley (EOSG)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  
> On Behalf
> Of Long Li
> Sent: Tuesday, March 27, 2018 5:49 PM
> To: KY Srinivasan ; Haiyang Zhang 
> ; Stephen
> Hemminger ; James E . J . Bottomley 
> ;
> Martin K . Petersen ; 
> de...@linuxdriverproject.org; linux-
> s...@vger.kernel.org; linux-kernel@vger.kernel.org; net...@vger.kernel.org
> Cc: Long Li 
> Subject: [Resend Patch 2/3] Netvsc: Use the vmbus functiton to calculate ring 
> buffer
> percentage
> 
> From: Long Li 
> 
> In Vmbus, we have defined a function to calculate available ring buffer
> percentage to write.
> 
> Use that function and remove netvsc's private version.
> 
> Signed-off-by: Long Li 

Reviewed-by:  Michael Kelley 

> ---
>  drivers/net/hyperv/hyperv_net.h |  1 -
>  drivers/net/hyperv/netvsc.c | 17 +++--
>  drivers/net/hyperv/netvsc_drv.c |  3 ---
>  3 files changed, 3 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
> index cd538d5a7986..a0199ab13d67 100644
> --- a/drivers/net/hyperv/hyperv_net.h
> +++ b/drivers/net/hyperv/hyperv_net.h
> @@ -189,7 +189,6 @@ struct netvsc_device;
>  struct net_device_context;
> 
>  extern u32 netvsc_ring_bytes;
> -extern struct reciprocal_value netvsc_ring_reciprocal;
> 
>  struct netvsc_device *netvsc_device_add(struct hv_device *device,
>   const struct netvsc_device_info *info);
> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> index 0265d703eb03..8af0069e4d8c 100644
> --- a/drivers/net/hyperv/netvsc.c
> +++ b/drivers/net/hyperv/netvsc.c
> @@ -31,7 +31,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
> 
>  #include 
> 
> @@ -590,17 +589,6 @@ void netvsc_device_remove(struct hv_device *device)
>  #define RING_AVAIL_PERCENT_HIWATER 20
>  #define RING_AVAIL_PERCENT_LOWATER 10
> 
> -/*
> - * Get the percentage of available bytes to write in the ring.
> - * The return value is in range from 0 to 100.
> - */
> -static u32 hv_ringbuf_avail_percent(const struct hv_ring_buffer_info 
> *ring_info)
> -{
> - u32 avail_write = hv_get_bytes_to_write(ring_info);
> -
> - return reciprocal_divide(avail_write  * 100, netvsc_ring_reciprocal);
> -}
> -
>  static inline void netvsc_free_send_slot(struct netvsc_device *net_device,
>u32 index)
>  {
> @@ -649,7 +637,8 @@ static void netvsc_send_tx_complete(struct netvsc_device
> *net_device,
>   wake_up(_device->wait_drain);
> 
>   if (netif_tx_queue_stopped(netdev_get_tx_queue(ndev, q_idx)) &&
> - (hv_ringbuf_avail_percent(>outbound) >
> RING_AVAIL_PERCENT_HIWATER ||
> + (hv_get_avail_to_write_percent(>outbound) >
> +  RING_AVAIL_PERCENT_HIWATER ||
>queue_sends < 1)) {
>   netif_tx_wake_queue(netdev_get_tx_queue(ndev, q_idx));
>   ndev_ctx->eth_stats.wake_queue++;
> @@ -757,7 +746,7 @@ static inline int netvsc_send_pkt(
>   struct netdev_queue *txq = netdev_get_tx_queue(ndev, packet->q_idx);
>   u64 req_id;
>   int ret;
> - u32 ring_avail = hv_ringbuf_avail_percent(_channel->outbound);
> + u32 ring_avail = hv_get_avail_to_write_percent(_channel->outbound);
> 
>   nvmsg.hdr.msg_type = NVSP_MSG1_TYPE_SEND_RNDIS_PKT;
>   if (skb)
> diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
> index faea0be18924..b0b1c2fd2b7b 100644
> --- a/drivers/net/hyperv/netvsc_drv.c
> +++ b/drivers/net/hyperv/netvsc_drv.c
> @@ -35,7 +35,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
> 
>  #include 
>  #include 
> @@ -55,7 +54,6 @@ static unsigned int ring_size __ro_after_init = 128;
>  module_param(ring_size, uint, S_IRUGO);
>  MODULE_PARM_DESC(ring_size, "Ring buffer size (# of pages)");
>  unsigned int netvsc_ring_bytes __ro_after_init;
> -struct reciprocal_value netvsc_ring_reciprocal __ro_after_init;
> 
>  static const u32 default_msg = NETIF_MSG_DRV | NETIF_MSG_PROBE |
>   NETIF_MSG_LINK | NETIF_MSG_IFUP |
> @@ -2186,7 +2184,6 @@ static int __init netvsc_drv_init(void)
>   ring_size);
>   }
>   netvsc_ring_bytes = ring_size * PAGE_SIZE;
> - netvsc_ring_reciprocal = reciprocal_value(netvsc_ring_bytes);
> 
>   ret = vmbus_driver_register(_drv);
>   if (ret)
> --
> 2.14.1



RE: [Resend Patch 3/3] Storvsc: Select channel based on available percentage of ring buffer to write

2018-04-13 Thread Michael Kelley (EOSG)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  
> On Behalf
> Of Long Li
> Sent: Tuesday, March 27, 2018 5:49 PM
> To: KY Srinivasan ; Haiyang Zhang 
> ; Stephen
> Hemminger ; James E . J . Bottomley 
> ;
> Martin K . Petersen ; 
> de...@linuxdriverproject.org; linux-
> s...@vger.kernel.org; linux-kernel@vger.kernel.org; net...@vger.kernel.org
> Cc: Long Li 
> Subject: [Resend Patch 3/3] Storvsc: Select channel based on available 
> percentage of ring
> buffer to write
> 
> From: Long Li 
> 
> This is a best effort for estimating on how busy the ring buffer is for
> that channel, based on available buffer to write in percentage. It is still
> possible that at the time of actual ring buffer write, the space may not be
> available due to other processes may be writing at the time.
> 
> Selecting a channel based on how full it is can reduce the possibility that
> a ring buffer write will fail, and avoid the situation a channel is over
> busy.
> 
> Now it's possible that storvsc can use a smaller ring buffer size
> (e.g. 40k bytes) to take advantage of cache locality.
> 
> Signed-off-by: Long Li 
> ---
>  drivers/scsi/storvsc_drv.c | 62 
> +-
>  1 file changed, 50 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> index a2ec0bc9e9fa..b1a87072b3ab 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -395,6 +395,12 @@ MODULE_PARM_DESC(storvsc_ringbuffer_size, "Ring buffer 
> size
> (bytes)");
> 
>  module_param(storvsc_vcpus_per_sub_channel, int, S_IRUGO);
>  MODULE_PARM_DESC(storvsc_vcpus_per_sub_channel, "Ratio of VCPUs to 
> subchannels");
> +
> +static int ring_avail_percent_lowater = 10;

Reserving 10% of each ring buffer by default seems like more than is needed
in the storvsc driver.  That would be about 4Kbytes for the 40K ring buffer
you suggest, and even more for a ring buffer of 128K.  Each outgoing record
is only about 344 bytes (I'd have to check exactly).  With the new channel
selection algorithm below, the only time we use a channel that is already
below the low water mark is when no channel could be found that is above
the low water mark.   There could be a case of two or more threads deciding
that a channel is above the low water mark at the same time and both
choosing it, but that's likely to be rare.  So it seems like we could set the
default low water mark to 5 percent or even 3 percent, which will let more
of the ring buffer be used, and let a channel be assigned according to the
algorithm, rather than falling through to the default because all channels
appear to be "full".

> +module_param(ring_avail_percent_lowater, int, S_IRUGO);
> +MODULE_PARM_DESC(ring_avail_percent_lowater,
> + "Select a channel if available ring size > this in percent");
> +
>  /*
>   * Timeout in seconds for all devices managed by this driver.
>   */
> @@ -1285,9 +1291,9 @@ static int storvsc_do_io(struct hv_device *device,
>  {
>   struct storvsc_device *stor_device;
>   struct vstor_packet *vstor_packet;
> - struct vmbus_channel *outgoing_channel;
> + struct vmbus_channel *outgoing_channel, *channel;
>   int ret = 0;
> - struct cpumask alloced_mask;
> + struct cpumask alloced_mask, other_numa_mask;
>   int tgt_cpu;
> 
>   vstor_packet = >vstor_packet;
> @@ -1301,22 +1307,53 @@ static int storvsc_do_io(struct hv_device *device,
>   /*
>* Select an an appropriate channel to send the request out.
>*/
> -
>   if (stor_device->stor_chns[q_num] != NULL) {
>   outgoing_channel = stor_device->stor_chns[q_num];
> - if (outgoing_channel->target_cpu == smp_processor_id()) {
> + if (outgoing_channel->target_cpu == q_num) {
>   /*
>* Ideally, we want to pick a different channel if
>* available on the same NUMA node.
>*/
>   cpumask_and(_mask, _device->alloced_cpus,
>   cpumask_of_node(cpu_to_node(q_num)));
> - for_each_cpu_wrap(tgt_cpu, _mask,
> - outgoing_channel->target_cpu + 1) {
> - if (tgt_cpu != outgoing_channel->target_cpu) {
> - outgoing_channel =
> - stor_device->stor_chns[tgt_cpu];
> - break;
> +
> + for_each_cpu_wrap(tgt_cpu, _mask, q_num + 1) {
> + if (tgt_cpu == q_num)
> + continue;
> + channel = 

RE: [Resend Patch 3/3] Storvsc: Select channel based on available percentage of ring buffer to write

2018-04-13 Thread Michael Kelley (EOSG)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  
> On Behalf
> Of Long Li
> Sent: Tuesday, March 27, 2018 5:49 PM
> To: KY Srinivasan ; Haiyang Zhang 
> ; Stephen
> Hemminger ; James E . J . Bottomley 
> ;
> Martin K . Petersen ; 
> de...@linuxdriverproject.org; linux-
> s...@vger.kernel.org; linux-kernel@vger.kernel.org; net...@vger.kernel.org
> Cc: Long Li 
> Subject: [Resend Patch 3/3] Storvsc: Select channel based on available 
> percentage of ring
> buffer to write
> 
> From: Long Li 
> 
> This is a best effort for estimating on how busy the ring buffer is for
> that channel, based on available buffer to write in percentage. It is still
> possible that at the time of actual ring buffer write, the space may not be
> available due to other processes may be writing at the time.
> 
> Selecting a channel based on how full it is can reduce the possibility that
> a ring buffer write will fail, and avoid the situation a channel is over
> busy.
> 
> Now it's possible that storvsc can use a smaller ring buffer size
> (e.g. 40k bytes) to take advantage of cache locality.
> 
> Signed-off-by: Long Li 
> ---
>  drivers/scsi/storvsc_drv.c | 62 
> +-
>  1 file changed, 50 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> index a2ec0bc9e9fa..b1a87072b3ab 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -395,6 +395,12 @@ MODULE_PARM_DESC(storvsc_ringbuffer_size, "Ring buffer 
> size
> (bytes)");
> 
>  module_param(storvsc_vcpus_per_sub_channel, int, S_IRUGO);
>  MODULE_PARM_DESC(storvsc_vcpus_per_sub_channel, "Ratio of VCPUs to 
> subchannels");
> +
> +static int ring_avail_percent_lowater = 10;

Reserving 10% of each ring buffer by default seems like more than is needed
in the storvsc driver.  That would be about 4Kbytes for the 40K ring buffer
you suggest, and even more for a ring buffer of 128K.  Each outgoing record
is only about 344 bytes (I'd have to check exactly).  With the new channel
selection algorithm below, the only time we use a channel that is already
below the low water mark is when no channel could be found that is above
the low water mark.   There could be a case of two or more threads deciding
that a channel is above the low water mark at the same time and both
choosing it, but that's likely to be rare.  So it seems like we could set the
default low water mark to 5 percent or even 3 percent, which will let more
of the ring buffer be used, and let a channel be assigned according to the
algorithm, rather than falling through to the default because all channels
appear to be "full".

> +module_param(ring_avail_percent_lowater, int, S_IRUGO);
> +MODULE_PARM_DESC(ring_avail_percent_lowater,
> + "Select a channel if available ring size > this in percent");
> +
>  /*
>   * Timeout in seconds for all devices managed by this driver.
>   */
> @@ -1285,9 +1291,9 @@ static int storvsc_do_io(struct hv_device *device,
>  {
>   struct storvsc_device *stor_device;
>   struct vstor_packet *vstor_packet;
> - struct vmbus_channel *outgoing_channel;
> + struct vmbus_channel *outgoing_channel, *channel;
>   int ret = 0;
> - struct cpumask alloced_mask;
> + struct cpumask alloced_mask, other_numa_mask;
>   int tgt_cpu;
> 
>   vstor_packet = >vstor_packet;
> @@ -1301,22 +1307,53 @@ static int storvsc_do_io(struct hv_device *device,
>   /*
>* Select an an appropriate channel to send the request out.
>*/
> -
>   if (stor_device->stor_chns[q_num] != NULL) {
>   outgoing_channel = stor_device->stor_chns[q_num];
> - if (outgoing_channel->target_cpu == smp_processor_id()) {
> + if (outgoing_channel->target_cpu == q_num) {
>   /*
>* Ideally, we want to pick a different channel if
>* available on the same NUMA node.
>*/
>   cpumask_and(_mask, _device->alloced_cpus,
>   cpumask_of_node(cpu_to_node(q_num)));
> - for_each_cpu_wrap(tgt_cpu, _mask,
> - outgoing_channel->target_cpu + 1) {
> - if (tgt_cpu != outgoing_channel->target_cpu) {
> - outgoing_channel =
> - stor_device->stor_chns[tgt_cpu];
> - break;
> +
> + for_each_cpu_wrap(tgt_cpu, _mask, q_num + 1) {
> + if (tgt_cpu == q_num)
> + continue;
> + channel = stor_device->stor_chns[tgt_cpu];
> + if (hv_get_avail_to_write_percent(
> + >outbound)
> + > 

RE: [PATCH v2 char-misc 1/1] x86/hyperv: Add interrupt handler annotations

2018-04-04 Thread Michael Kelley (EOSG)
> -Original Message-
> From: Greg KH <gre...@linuxfoundation.org>
> Sent: Wednesday, April 4, 2018 1:16 AM
> To: Michael Kelley (EOSG) <michael.h.kel...@microsoft.com>
> Cc: linux-kernel@vger.kernel.org; de...@linuxdriverproject.org; 
> o...@aepfle.de;
> a...@canonical.com; vkuzn...@redhat.com; jasow...@redhat.com;
> leann.ogasaw...@canonical.com; marcelo.ce...@canonical.com; Stephen Hemminger
> <sthem...@microsoft.com>; KY Srinivasan <k...@microsoft.com>
> Subject: Re: [PATCH v2 char-misc 1/1] x86/hyperv: Add interrupt handler 
> annotations
> 
> On Tue, Apr 03, 2018 at 01:59:08PM -0700, mhkelle...@gmail.com wrote:
> > From: Michael Kelley <mikel...@microsoft.com>
> >
> > Add standard interrupt handler annotations to
> > hyperv_vector_handler().
> >
> > Signed-off-by: Michael Kelley <mikel...@microsoft.com>
> > ---
> > Changes in v2:
> > * Fixed From: line
> > ---
> >  arch/x86/kernel/cpu/mshyperv.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> > index 4488cf0..20f6849 100644
> > --- a/arch/x86/kernel/cpu/mshyperv.c
> > +++ b/arch/x86/kernel/cpu/mshyperv.c
> > @@ -41,7 +41,7 @@ static void (*hv_stimer0_handler)(void);
> >  static void (*hv_kexec_handler)(void);
> >  static void (*hv_crash_handler)(struct pt_regs *regs);
> >
> > -void hyperv_vector_handler(struct pt_regs *regs)
> > +__visible void __irq_entry hyperv_vector_handler(struct pt_regs *regs)
> 
> What bug does this solve?  What is wrong with the existing markings?
> What does __visible and __irq_entry give us that we don't already have
> and we need?
> 
> Are you really using LTO that requires this marking to prevent the code
> from being removed?

Thomas Gleixner commented on Vitaly Kuznetsov's Hyper-V reenlightenment patch
that the interrupt handler should have these annotations: see
https://lkml.org/lkml/2018/1/14/145

I put the same annotations on the interrupt handler for stimer0 Direct Mode,
So this change makes the hyperv_vector_handler() consistent with
hv_stimer0_vector_handler() in the same source file.  It does not fix any
immediate bug -- it's for consistency and alignment with what is apparently
standard practice.

Not sure what LTO is ...

Michael

> 
> thanks,
> 
> greg k-h


RE: [PATCH v2 char-misc 1/1] x86/hyperv: Add interrupt handler annotations

2018-04-04 Thread Michael Kelley (EOSG)
> -Original Message-
> From: Greg KH 
> Sent: Wednesday, April 4, 2018 1:16 AM
> To: Michael Kelley (EOSG) 
> Cc: linux-kernel@vger.kernel.org; de...@linuxdriverproject.org; 
> o...@aepfle.de;
> a...@canonical.com; vkuzn...@redhat.com; jasow...@redhat.com;
> leann.ogasaw...@canonical.com; marcelo.ce...@canonical.com; Stephen Hemminger
> ; KY Srinivasan 
> Subject: Re: [PATCH v2 char-misc 1/1] x86/hyperv: Add interrupt handler 
> annotations
> 
> On Tue, Apr 03, 2018 at 01:59:08PM -0700, mhkelle...@gmail.com wrote:
> > From: Michael Kelley 
> >
> > Add standard interrupt handler annotations to
> > hyperv_vector_handler().
> >
> > Signed-off-by: Michael Kelley 
> > ---
> > Changes in v2:
> > * Fixed From: line
> > ---
> >  arch/x86/kernel/cpu/mshyperv.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> > index 4488cf0..20f6849 100644
> > --- a/arch/x86/kernel/cpu/mshyperv.c
> > +++ b/arch/x86/kernel/cpu/mshyperv.c
> > @@ -41,7 +41,7 @@ static void (*hv_stimer0_handler)(void);
> >  static void (*hv_kexec_handler)(void);
> >  static void (*hv_crash_handler)(struct pt_regs *regs);
> >
> > -void hyperv_vector_handler(struct pt_regs *regs)
> > +__visible void __irq_entry hyperv_vector_handler(struct pt_regs *regs)
> 
> What bug does this solve?  What is wrong with the existing markings?
> What does __visible and __irq_entry give us that we don't already have
> and we need?
> 
> Are you really using LTO that requires this marking to prevent the code
> from being removed?

Thomas Gleixner commented on Vitaly Kuznetsov's Hyper-V reenlightenment patch
that the interrupt handler should have these annotations: see
https://lkml.org/lkml/2018/1/14/145

I put the same annotations on the interrupt handler for stimer0 Direct Mode,
So this change makes the hyperv_vector_handler() consistent with
hv_stimer0_vector_handler() in the same source file.  It does not fix any
immediate bug -- it's for consistency and alignment with what is apparently
standard practice.

Not sure what LTO is ...

Michael

> 
> thanks,
> 
> greg k-h


RE: [PATCH v2] storvsc: Set up correct queue depth values for IDE devices

2018-03-24 Thread Michael Kelley (EOSG)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  
> On Behalf
> Of Long Li
> Sent: Thursday, March 22, 2018 2:47 PM
> To: KY Srinivasan ; Haiyang Zhang 
> ; Stephen
> Hemminger ; James E . J . Bottomley 
> ;
> Martin K . Petersen ; 
> de...@linuxdriverproject.org; linux-
> s...@vger.kernel.org; linux-kernel@vger.kernel.org
> Cc: Long Li 
> Subject: [PATCH v2] storvsc: Set up correct queue depth values for IDE devices
> 
> From: Long Li 
> 
> Unlike SCSI and FC, we don't use multiple channels for IDE.
> Also fix the calculation for sub-channels.
> 
> Change log:
> v2: Addressed comment on incorrect number of sub-channels.
> (Michael Kelley )
> 
> Signed-off-by: Long Li 

Reviewed-by: Michael Kelley 

> ---
>  drivers/scsi/storvsc_drv.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> index 8c51d628b52e..a2ec0bc9e9fa 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -1722,11 +1722,14 @@ static int storvsc_probe(struct hv_device *device,
>   max_targets = STORVSC_MAX_TARGETS;
>   max_channels = STORVSC_MAX_CHANNELS;
>   /*
> -  * On Windows8 and above, we support sub-channels for storage.
> +  * On Windows8 and above, we support sub-channels for storage
> +  * on SCSI and FC controllers.
>* The number of sub-channels offerred is based on the number of
>* VCPUs in the guest.
>*/
> - max_sub_channels = (num_cpus / storvsc_vcpus_per_sub_channel);
> + if (!dev_is_ide)
> + max_sub_channels =
> + (num_cpus - 1) / storvsc_vcpus_per_sub_channel;
>   }
> 
>   scsi_driver.can_queue = (max_outstanding_req_per_channel *
> --
> 2.14.1



RE: [PATCH v2] storvsc: Set up correct queue depth values for IDE devices

2018-03-24 Thread Michael Kelley (EOSG)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  
> On Behalf
> Of Long Li
> Sent: Thursday, March 22, 2018 2:47 PM
> To: KY Srinivasan ; Haiyang Zhang 
> ; Stephen
> Hemminger ; James E . J . Bottomley 
> ;
> Martin K . Petersen ; 
> de...@linuxdriverproject.org; linux-
> s...@vger.kernel.org; linux-kernel@vger.kernel.org
> Cc: Long Li 
> Subject: [PATCH v2] storvsc: Set up correct queue depth values for IDE devices
> 
> From: Long Li 
> 
> Unlike SCSI and FC, we don't use multiple channels for IDE.
> Also fix the calculation for sub-channels.
> 
> Change log:
> v2: Addressed comment on incorrect number of sub-channels.
> (Michael Kelley )
> 
> Signed-off-by: Long Li 

Reviewed-by: Michael Kelley 

> ---
>  drivers/scsi/storvsc_drv.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> index 8c51d628b52e..a2ec0bc9e9fa 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -1722,11 +1722,14 @@ static int storvsc_probe(struct hv_device *device,
>   max_targets = STORVSC_MAX_TARGETS;
>   max_channels = STORVSC_MAX_CHANNELS;
>   /*
> -  * On Windows8 and above, we support sub-channels for storage.
> +  * On Windows8 and above, we support sub-channels for storage
> +  * on SCSI and FC controllers.
>* The number of sub-channels offerred is based on the number of
>* VCPUs in the guest.
>*/
> - max_sub_channels = (num_cpus / storvsc_vcpus_per_sub_channel);
> + if (!dev_is_ide)
> + max_sub_channels =
> + (num_cpus - 1) / storvsc_vcpus_per_sub_channel;
>   }
> 
>   scsi_driver.can_queue = (max_outstanding_req_per_channel *
> --
> 2.14.1



RE: [PATCH net-next,1/2] hv_netvsc: Fix the return status in RX path

2018-03-24 Thread Michael Kelley (EOSG)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  
> On Behalf
> Of Haiyang Zhang
> Sent: Thursday, March 22, 2018 12:01 PM
> To: da...@davemloft.net; net...@vger.kernel.org
> Cc: Haiyang Zhang ; KY Srinivasan 
> ; Stephen
> Hemminger ; o...@aepfle.de; vkuzn...@redhat.com;
> de...@linuxdriverproject.org; linux-kernel@vger.kernel.org
> Subject: [PATCH net-next,1/2] hv_netvsc: Fix the return status in RX path
> 
> From: Haiyang Zhang 
> 
> As defined in hyperv_net.h, the NVSP_STAT_SUCCESS is one not zero.
> Some functions returns 0 when it actually means NVSP_STAT_SUCCESS.
> This patch fixes them.
> 
> In netvsc_receive(), it puts the last RNDIS packet's receive status
> for all packets in a vmxferpage which may contain multiple RNDIS
> packets.
> This patch puts NVSP_STAT_FAIL in the receive completion if one of
> the packets in a vmxferpage fails.

This patch changes the status field that is being reported back to
the Hyper-V host in the receive completion message in
enq_receive_complete().   The current code reports 0 on success,
and with the patch, it will report 1 on success.  So does this change
affect anything on the Hyper-V side?  Or is Hyper-V just ignoring
the value?   If this change doesn't have any impact on the
interactions with Hyper-V, perhaps it would be good to explain
why in the commit message.

Michael

> 
> Signed-off-by: Haiyang Zhang 
> ---
>  drivers/net/hyperv/netvsc.c   | 8 ++--
>  drivers/net/hyperv/netvsc_drv.c   | 2 +-
>  drivers/net/hyperv/rndis_filter.c | 4 ++--
>  3 files changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> index aa95e81af6e5..1ddb2c39b6e4 100644
> --- a/drivers/net/hyperv/netvsc.c
> +++ b/drivers/net/hyperv/netvsc.c
> @@ -1098,12 +1098,16 @@ static int netvsc_receive(struct net_device *ndev,
>   void *data = recv_buf
>   + vmxferpage_packet->ranges[i].byte_offset;
>   u32 buflen = vmxferpage_packet->ranges[i].byte_count;
> + int ret;
> 
>   trace_rndis_recv(ndev, q_idx, data);
> 
>   /* Pass it to the upper layer */
> - status = rndis_filter_receive(ndev, net_device,
> -   channel, data, buflen);
> + ret = rndis_filter_receive(ndev, net_device,
> +channel, data, buflen);
> +
> + if (unlikely(ret != NVSP_STAT_SUCCESS))
> + status = NVSP_STAT_FAIL;
>   }
> 
>   enq_receive_complete(ndev, net_device, q_idx,
> diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
> index cdb78eefab67..33607995be62 100644
> --- a/drivers/net/hyperv/netvsc_drv.c
> +++ b/drivers/net/hyperv/netvsc_drv.c
> @@ -818,7 +818,7 @@ int netvsc_recv_callback(struct net_device *net,
>   u64_stats_update_end(_stats->syncp);
> 
>   napi_gro_receive(>napi, skb);
> - return 0;
> + return NVSP_STAT_SUCCESS;
>  }
> 
>  static void netvsc_get_drvinfo(struct net_device *net,
> diff --git a/drivers/net/hyperv/rndis_filter.c 
> b/drivers/net/hyperv/rndis_filter.c
> index 2dc00f714482..591fb8080f11 100644
> --- a/drivers/net/hyperv/rndis_filter.c
> +++ b/drivers/net/hyperv/rndis_filter.c
> @@ -443,10 +443,10 @@ int rndis_filter_receive(struct net_device *ndev,
>   "unhandled rndis message (type %u len %u)\n",
>  rndis_msg->ndis_msg_type,
>  rndis_msg->msg_len);
> - break;
> + return NVSP_STAT_FAIL;
>   }
> 
> - return 0;
> + return NVSP_STAT_SUCCESS;
>  }
> 
>  static int rndis_filter_query_device(struct rndis_device *dev,
> --
> 2.15.1



RE: [PATCH net-next,1/2] hv_netvsc: Fix the return status in RX path

2018-03-24 Thread Michael Kelley (EOSG)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  
> On Behalf
> Of Haiyang Zhang
> Sent: Thursday, March 22, 2018 12:01 PM
> To: da...@davemloft.net; net...@vger.kernel.org
> Cc: Haiyang Zhang ; KY Srinivasan 
> ; Stephen
> Hemminger ; o...@aepfle.de; vkuzn...@redhat.com;
> de...@linuxdriverproject.org; linux-kernel@vger.kernel.org
> Subject: [PATCH net-next,1/2] hv_netvsc: Fix the return status in RX path
> 
> From: Haiyang Zhang 
> 
> As defined in hyperv_net.h, the NVSP_STAT_SUCCESS is one not zero.
> Some functions returns 0 when it actually means NVSP_STAT_SUCCESS.
> This patch fixes them.
> 
> In netvsc_receive(), it puts the last RNDIS packet's receive status
> for all packets in a vmxferpage which may contain multiple RNDIS
> packets.
> This patch puts NVSP_STAT_FAIL in the receive completion if one of
> the packets in a vmxferpage fails.

This patch changes the status field that is being reported back to
the Hyper-V host in the receive completion message in
enq_receive_complete().   The current code reports 0 on success,
and with the patch, it will report 1 on success.  So does this change
affect anything on the Hyper-V side?  Or is Hyper-V just ignoring
the value?   If this change doesn't have any impact on the
interactions with Hyper-V, perhaps it would be good to explain
why in the commit message.

Michael

> 
> Signed-off-by: Haiyang Zhang 
> ---
>  drivers/net/hyperv/netvsc.c   | 8 ++--
>  drivers/net/hyperv/netvsc_drv.c   | 2 +-
>  drivers/net/hyperv/rndis_filter.c | 4 ++--
>  3 files changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> index aa95e81af6e5..1ddb2c39b6e4 100644
> --- a/drivers/net/hyperv/netvsc.c
> +++ b/drivers/net/hyperv/netvsc.c
> @@ -1098,12 +1098,16 @@ static int netvsc_receive(struct net_device *ndev,
>   void *data = recv_buf
>   + vmxferpage_packet->ranges[i].byte_offset;
>   u32 buflen = vmxferpage_packet->ranges[i].byte_count;
> + int ret;
> 
>   trace_rndis_recv(ndev, q_idx, data);
> 
>   /* Pass it to the upper layer */
> - status = rndis_filter_receive(ndev, net_device,
> -   channel, data, buflen);
> + ret = rndis_filter_receive(ndev, net_device,
> +channel, data, buflen);
> +
> + if (unlikely(ret != NVSP_STAT_SUCCESS))
> + status = NVSP_STAT_FAIL;
>   }
> 
>   enq_receive_complete(ndev, net_device, q_idx,
> diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
> index cdb78eefab67..33607995be62 100644
> --- a/drivers/net/hyperv/netvsc_drv.c
> +++ b/drivers/net/hyperv/netvsc_drv.c
> @@ -818,7 +818,7 @@ int netvsc_recv_callback(struct net_device *net,
>   u64_stats_update_end(_stats->syncp);
> 
>   napi_gro_receive(>napi, skb);
> - return 0;
> + return NVSP_STAT_SUCCESS;
>  }
> 
>  static void netvsc_get_drvinfo(struct net_device *net,
> diff --git a/drivers/net/hyperv/rndis_filter.c 
> b/drivers/net/hyperv/rndis_filter.c
> index 2dc00f714482..591fb8080f11 100644
> --- a/drivers/net/hyperv/rndis_filter.c
> +++ b/drivers/net/hyperv/rndis_filter.c
> @@ -443,10 +443,10 @@ int rndis_filter_receive(struct net_device *ndev,
>   "unhandled rndis message (type %u len %u)\n",
>  rndis_msg->ndis_msg_type,
>  rndis_msg->msg_len);
> - break;
> + return NVSP_STAT_FAIL;
>   }
> 
> - return 0;
> + return NVSP_STAT_SUCCESS;
>  }
> 
>  static int rndis_filter_query_device(struct rndis_device *dev,
> --
> 2.15.1



RE: [PATCH] vmbus: use put_device() if device_register fail

2018-03-18 Thread Michael Kelley (EOSG)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  
> On Behalf
> Of KY Srinivasan
> Sent: Sunday, March 18, 2018 8:02 PM
> To: Arvind Yadav ; Stephen Hemminger
> ; Haiyang Zhang 
> Cc: de...@linuxdriverproject.org; linux-kernel@vger.kernel.org
> Subject: RE: [PATCH] vmbus: use put_device() if device_register fail
> 
> > -Original Message-
> > From: devel  On Behalf
> > Of Arvind Yadav
> > Sent: Saturday, March 17, 2018 11:48 PM
> > To: Stephen Hemminger ; Haiyang Zhang
> > 
> > Cc: de...@linuxdriverproject.org; linux-kernel@vger.kernel.org
> > Subject: [PATCH] vmbus: use put_device() if device_register fail
> >
> > if device_register() returned an error. Always use put_device()
> > to give up the reference initialized.
> >
> > Signed-off-by: Arvind Yadav 
> > ---
> >  drivers/hv/vmbus_drv.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> > index bc65c4d..25da2f3 100644
> > --- a/drivers/hv/vmbus_drv.c
> > +++ b/drivers/hv/vmbus_drv.c
> > @@ -1358,6 +1358,7 @@ int vmbus_device_register(struct hv_device
> > *child_device_obj)
> > ret = device_register(_device_obj->device);
> > if (ret) {
> > pr_err("Unable to register child device\n");
> > +   put_device(_device_obj->device);
> 
> If the registration failed; we would not have acquired the reference on the 
> device and so
> we should not be dropping the reference in the failure path.

Looking at the code for device_register() and its constituent parts
device_initialize() and device_add(), Arvind is correct.  device_initialize()
creates the object with a reference count of 1.  device_add() does not
decrement that reference count or free the object, even if it fails.  The
comments for device_register() call this out as well.

Michael  

> 
> K. Y
> > return ret;
> > }
> >
> > --
> > 2.7.4


RE: [PATCH] vmbus: use put_device() if device_register fail

2018-03-18 Thread Michael Kelley (EOSG)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  
> On Behalf
> Of KY Srinivasan
> Sent: Sunday, March 18, 2018 8:02 PM
> To: Arvind Yadav ; Stephen Hemminger
> ; Haiyang Zhang 
> Cc: de...@linuxdriverproject.org; linux-kernel@vger.kernel.org
> Subject: RE: [PATCH] vmbus: use put_device() if device_register fail
> 
> > -Original Message-
> > From: devel  On Behalf
> > Of Arvind Yadav
> > Sent: Saturday, March 17, 2018 11:48 PM
> > To: Stephen Hemminger ; Haiyang Zhang
> > 
> > Cc: de...@linuxdriverproject.org; linux-kernel@vger.kernel.org
> > Subject: [PATCH] vmbus: use put_device() if device_register fail
> >
> > if device_register() returned an error. Always use put_device()
> > to give up the reference initialized.
> >
> > Signed-off-by: Arvind Yadav 
> > ---
> >  drivers/hv/vmbus_drv.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> > index bc65c4d..25da2f3 100644
> > --- a/drivers/hv/vmbus_drv.c
> > +++ b/drivers/hv/vmbus_drv.c
> > @@ -1358,6 +1358,7 @@ int vmbus_device_register(struct hv_device
> > *child_device_obj)
> > ret = device_register(_device_obj->device);
> > if (ret) {
> > pr_err("Unable to register child device\n");
> > +   put_device(_device_obj->device);
> 
> If the registration failed; we would not have acquired the reference on the 
> device and so
> we should not be dropping the reference in the failure path.

Looking at the code for device_register() and its constituent parts
device_initialize() and device_add(), Arvind is correct.  device_initialize()
creates the object with a reference count of 1.  device_add() does not
decrement that reference count or free the object, even if it fails.  The
comments for device_register() call this out as well.

Michael  

> 
> K. Y
> > return ret;
> > }
> >
> > --
> > 2.7.4


RE: [PATCH 2/2] pci: host: pci-hyperv: Replace GFP_ATOMIC with GFP_KERNEL in new_pcichild_device

2018-03-18 Thread Michael Kelley (EOSG)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  
> On Behalf
> Of Jia-Ju Bai
> Sent: Sunday, March 18, 2018 7:53 AM
> To: KY Srinivasan ; Haiyang Zhang 
> ; Stephen
> Hemminger ; bhelg...@google.com
> Cc: de...@linuxdriverproject.org; linux-...@vger.kernel.org; 
> linux-kernel@vger.kernel.org;
> Jia-Ju Bai 
> Subject: [PATCH 2/2] pci: host: pci-hyperv: Replace GFP_ATOMIC with 
> GFP_KERNEL in
> new_pcichild_device
> 
> new_pcichild_device() is not called in atomic context.
> 
> The call chain ending up at new_pcichild_device() is:
> [1] new_pcichild_device() <- pci_devices_present_work()
> pci_devices_present_work() is only set in INIT_WORK().
> 
> Despite never getting called from atomic context,
> new_pcichild_device() calls kzalloc with GFP_ATOMIC,
> which waits busily for allocation.
> GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL
> to avoid busy waiting.
> 
> This is found by a static analysis tool named DCNS written by myself.
> 
> Signed-off-by: Jia-Ju Bai 

Reviewed-by: Michael Kelley 

> ---
>  drivers/pci/host/pci-hyperv.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/host/pci-hyperv.c b/drivers/pci/host/pci-hyperv.c
> index 0fe3ea1..289e31d 100644
> --- a/drivers/pci/host/pci-hyperv.c
> +++ b/drivers/pci/host/pci-hyperv.c
> @@ -1500,7 +1500,7 @@ static struct hv_pci_dev *new_pcichild_device(struct
> hv_pcibus_device *hbus,
>   unsigned long flags;
>   int ret;
> 
> - hpdev = kzalloc(sizeof(*hpdev), GFP_ATOMIC);
> + hpdev = kzalloc(sizeof(*hpdev), GFP_KERNEL);
>   if (!hpdev)
>   return NULL;
> 
> --
> 1.9.1



RE: [PATCH 2/2] pci: host: pci-hyperv: Replace GFP_ATOMIC with GFP_KERNEL in new_pcichild_device

2018-03-18 Thread Michael Kelley (EOSG)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  
> On Behalf
> Of Jia-Ju Bai
> Sent: Sunday, March 18, 2018 7:53 AM
> To: KY Srinivasan ; Haiyang Zhang 
> ; Stephen
> Hemminger ; bhelg...@google.com
> Cc: de...@linuxdriverproject.org; linux-...@vger.kernel.org; 
> linux-kernel@vger.kernel.org;
> Jia-Ju Bai 
> Subject: [PATCH 2/2] pci: host: pci-hyperv: Replace GFP_ATOMIC with 
> GFP_KERNEL in
> new_pcichild_device
> 
> new_pcichild_device() is not called in atomic context.
> 
> The call chain ending up at new_pcichild_device() is:
> [1] new_pcichild_device() <- pci_devices_present_work()
> pci_devices_present_work() is only set in INIT_WORK().
> 
> Despite never getting called from atomic context,
> new_pcichild_device() calls kzalloc with GFP_ATOMIC,
> which waits busily for allocation.
> GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL
> to avoid busy waiting.
> 
> This is found by a static analysis tool named DCNS written by myself.
> 
> Signed-off-by: Jia-Ju Bai 

Reviewed-by: Michael Kelley 

> ---
>  drivers/pci/host/pci-hyperv.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/host/pci-hyperv.c b/drivers/pci/host/pci-hyperv.c
> index 0fe3ea1..289e31d 100644
> --- a/drivers/pci/host/pci-hyperv.c
> +++ b/drivers/pci/host/pci-hyperv.c
> @@ -1500,7 +1500,7 @@ static struct hv_pci_dev *new_pcichild_device(struct
> hv_pcibus_device *hbus,
>   unsigned long flags;
>   int ret;
> 
> - hpdev = kzalloc(sizeof(*hpdev), GFP_ATOMIC);
> + hpdev = kzalloc(sizeof(*hpdev), GFP_KERNEL);
>   if (!hpdev)
>   return NULL;
> 
> --
> 1.9.1



  1   2   >