RE: [PATCH v3 2/2] locking/pvqspinlock, hv: Enable PV qspinlock for Hyper-V
From: Yi Sun Sent: Wednesday, September 26, 2018 11:02 PM > Follow PV spinlock mechanism to implement the callback functions > to allow the CPU idling and kicking operations on Hyper-V. > > Cc: "K. Y. Srinivasan" > Cc: Haiyang Zhang > Cc: Stephen Hemminger > Cc: Thomas Gleixner > Cc: Michael Kelley (EOSG) > Signed-off-by: Yi Sun > --- Reviewed-by: Michael Kelley
RE: [PATCH v3 2/2] locking/pvqspinlock, hv: Enable PV qspinlock for Hyper-V
From: Yi Sun Sent: Wednesday, September 26, 2018 11:02 PM > Follow PV spinlock mechanism to implement the callback functions > to allow the CPU idling and kicking operations on Hyper-V. > > Cc: "K. Y. Srinivasan" > Cc: Haiyang Zhang > Cc: Stephen Hemminger > Cc: Thomas Gleixner > Cc: Michael Kelley (EOSG) > Signed-off-by: Yi Sun > --- Reviewed-by: Michael Kelley
RE: [PATCH V3 10/13] x86/hyper-v: Add HvFlushGuestAddressList hypercall support
From: Tianyu Lan Sent: Wednesday, September 26, 2018 8:50 PM > > Hyper-V provides HvFlushGuestAddressList() hypercall to flush EPT tlb > with specified ranges. This patch is to add the hypercall support. > > Signed-off-by: Lan Tianyu > Looks good! Reviewed-by: Michael Kelley
RE: [PATCH V3 10/13] x86/hyper-v: Add HvFlushGuestAddressList hypercall support
From: Tianyu Lan Sent: Wednesday, September 26, 2018 8:50 PM > > Hyper-V provides HvFlushGuestAddressList() hypercall to flush EPT tlb > with specified ranges. This patch is to add the hypercall support. > > Signed-off-by: Lan Tianyu > Looks good! Reviewed-by: Michael Kelley
RE: [PATCH -next] x86/hyper-v: Remove unused including
From: YueHaibing Sent: Sunday, September 23, 2018 1:20 AM > Remove including that don't need it. > > Signed-off-by: YueHaibing > --- > arch/x86/hyperv/hv_apic.c | 1 - > 1 file changed, 1 deletion(-) > Reviewed-by: Michael Kelley
RE: [PATCH -next] x86/hyper-v: Remove unused including
From: YueHaibing Sent: Sunday, September 23, 2018 1:20 AM > Remove including that don't need it. > > Signed-off-by: YueHaibing > --- > arch/x86/hyperv/hv_apic.c | 1 - > 1 file changed, 1 deletion(-) > Reviewed-by: Michael Kelley
RE: [PATCH v2 2/2] locking/pvqspinlock, hv: Enable PV qspinlock for Hyper-V
From: Yi Sun Sent: Friday, September 21, 2018 12:25 AM > + > +#define pr_fmt(fmt) "hv: " fmt Other Hyper-V messages use "Hyper-V: " as the prefix, not "hv: ". Take a quick look at 'dmesg' output for reference. > + > +#include > +#include > +#include > +#include > +#include Some of these #includes look like they might be leftovers from some other code. Please check and see whether kernel_stat.h, debugsfs.h, log2.h, and gfp.h are actually needed. > +static void hv_qlock_wait(u8 *byte, u8 val) > +{ > + unsigned long msr_val; > + > + if (READ_ONCE(*byte) != val) > + return; > + > + /* > + * Read HV_X64_MSR_GUEST_IDLE MSR can trigger the guest's > + * transition to the idle power state which can be exited > + * by an IPI even if IF flag is disabled. > + */ > + if (ms_hyperv.features & HV_X64_MSR_GUEST_IDLE_AVAILABLE) I can't see a case where this test is actually needed. hv_qlock_wait() can only get called if the flag is set when hv_init_spinlocks() is run, and the flag value doesn't change after it is set. > + rdmsrl(HV_X64_MSR_GUEST_IDLE, msr_val); > +} Michael
RE: [PATCH v2 2/2] locking/pvqspinlock, hv: Enable PV qspinlock for Hyper-V
From: Yi Sun Sent: Friday, September 21, 2018 12:25 AM > + > +#define pr_fmt(fmt) "hv: " fmt Other Hyper-V messages use "Hyper-V: " as the prefix, not "hv: ". Take a quick look at 'dmesg' output for reference. > + > +#include > +#include > +#include > +#include > +#include Some of these #includes look like they might be leftovers from some other code. Please check and see whether kernel_stat.h, debugsfs.h, log2.h, and gfp.h are actually needed. > +static void hv_qlock_wait(u8 *byte, u8 val) > +{ > + unsigned long msr_val; > + > + if (READ_ONCE(*byte) != val) > + return; > + > + /* > + * Read HV_X64_MSR_GUEST_IDLE MSR can trigger the guest's > + * transition to the idle power state which can be exited > + * by an IPI even if IF flag is disabled. > + */ > + if (ms_hyperv.features & HV_X64_MSR_GUEST_IDLE_AVAILABLE) I can't see a case where this test is actually needed. hv_qlock_wait() can only get called if the flag is set when hv_init_spinlocks() is run, and the flag value doesn't change after it is set. > + rdmsrl(HV_X64_MSR_GUEST_IDLE, msr_val); > +} Michael
RE: [PATCH v2 1/2] X86/Hyper-V: Add Guest IDLE MSR support
From: Yi Sun Sent: Friday, September 21, 2018 12:25 AM > > Hyper-V may expose a HV_X64_MSR_GUEST_IDLE MSR. We can read > HYPERV_CPUID_FEATURES eax to check it. Read this MSR can > trigger the guest's transition to the idle power state which > can be exited by an IPI even if IF flag is disabled. > > Cc: "K. Y. Srinivasan" > Cc: Haiyang Zhang > Cc: Stephen Hemminger > Signed-off-by: Yi Sun > --- > arch/x86/include/asm/hyperv-tlfs.h | 5 + > 1 file changed, 5 insertions(+) > Reviewed-by: Michael Kelley
RE: [PATCH v2 1/2] X86/Hyper-V: Add Guest IDLE MSR support
From: Yi Sun Sent: Friday, September 21, 2018 12:25 AM > > Hyper-V may expose a HV_X64_MSR_GUEST_IDLE MSR. We can read > HYPERV_CPUID_FEATURES eax to check it. Read this MSR can > trigger the guest's transition to the idle power state which > can be exited by an IPI even if IF flag is disabled. > > Cc: "K. Y. Srinivasan" > Cc: Haiyang Zhang > Cc: Stephen Hemminger > Signed-off-by: Yi Sun > --- > arch/x86/include/asm/hyperv-tlfs.h | 5 + > 1 file changed, 5 insertions(+) > Reviewed-by: Michael Kelley
RE: [PATCH V2 4/13] KVM/MMU: Flush tlb directly in the kvm_handle_hva_range()
From: Tianyu Lan Sent: Thursday, September 20, 2018 7:30 AM > On 9/20/2018 12:08 AM, Michael Kelley (EOSG) wrote: > > From: Tianyu Lan Sent: Monday, September 17, 2018 8:19 PM > >> + > >> + if (ret && kvm_available_flush_tlb_with_range()) { > >> + kvm_flush_remote_tlbs_with_address(kvm, > >> + gfn_start, > >> + gfn_end - gfn_start); > > > > Does the above need to be gfn_end - gfn_start + 1? > > The flush range depends on the input parameter frame start and frame end > of for_each_slot_rmap_range(). > > for_each_slot_rmap_range(memslot, PT_PAGE_TABLE_LEVEL, > PT_MAX_HUGEPAGE_LEVEL, > gfn_start, gfn_end - 1, > ) > ret |= handler(kvm, iterator.rmap, memslot, > iterator.gfn, iterator.level, data); > > > The start is "gfn_start" and the end is "gfn_end - 1". The flush size is > (gfn_end - 1) - gfn_start + 1 = gfn_end - gfn_start. > Got it. I agree. Michael
RE: [PATCH V2 4/13] KVM/MMU: Flush tlb directly in the kvm_handle_hva_range()
From: Tianyu Lan Sent: Thursday, September 20, 2018 7:30 AM > On 9/20/2018 12:08 AM, Michael Kelley (EOSG) wrote: > > From: Tianyu Lan Sent: Monday, September 17, 2018 8:19 PM > >> + > >> + if (ret && kvm_available_flush_tlb_with_range()) { > >> + kvm_flush_remote_tlbs_with_address(kvm, > >> + gfn_start, > >> + gfn_end - gfn_start); > > > > Does the above need to be gfn_end - gfn_start + 1? > > The flush range depends on the input parameter frame start and frame end > of for_each_slot_rmap_range(). > > for_each_slot_rmap_range(memslot, PT_PAGE_TABLE_LEVEL, > PT_MAX_HUGEPAGE_LEVEL, > gfn_start, gfn_end - 1, > ) > ret |= handler(kvm, iterator.rmap, memslot, > iterator.gfn, iterator.level, data); > > > The start is "gfn_start" and the end is "gfn_end - 1". The flush size is > (gfn_end - 1) - gfn_start + 1 = gfn_end - gfn_start. > Got it. I agree. Michael
RE: [PATCH V2 10/13] x86/hyper-v: Add HvFlushGuestAddressList hypercall support
From: Tianyu Lan Sent: Monday, September 17, 2018 8:19 PM > > #include > #include > #include > #include > +#include Hopefully asm/kvm_host.h does not need to be #included, given the new code structure. > > #include > > +/* > + * MAX_FLUSH_PAGES = "additional_pages" + 1. It's limited > + * by the bitwidth of "additional_pages" in union hv_gpa_page_range. > + */ > +#define MAX_FLUSH_PAGES (2048) > + > +/* > + * All input flush parameters are in single page. The max flush count > + * is equal with how many entries of union hv_gpa_page_range can be > + * populated in the input parameter page. MAX_FLUSH_REP_COUNT > + * = (4096 - 16) / 8. (“Page Size” - "Address Space" - "Flags") / > + * "GPA Range". > + */ > +#define MAX_FLUSH_REP_COUNT (510) > + I would recommend putting the above two definitions in hyperv-tlfs.h. They are directly tied to the data structures defined by Hyper-V in the TLFS. Put MAX_FLUSH_PAGES immediately after the definition for hv_gpa_page_range so that the dependency is obvious. For MAX_FLUSH_REP_COUNT, can you do the calculation in the #define rather than just in the comment? Alternatively, define the gpa_list[] array to be of MAX_FLUSH_REP_COUNT size, and then add a compile time assert that the size of struct hv_guest_mapping_flush_list is exactly one page in size. It's just a good way to use the compiler to help check for mistakes. Also prefix them both with HV_ since they will be more globally visible as part of hyperv-tlfs.h. > int hyperv_flush_guest_mapping(u64 as) > { > struct hv_guest_mapping_flush **flush_pcpu; > @@ -54,3 +71,89 @@ int hyperv_flush_guest_mapping(u64 as) > return ret; > } > EXPORT_SYMBOL_GPL(hyperv_flush_guest_mapping); > + > +static int fill_flush_list(union hv_gpa_page_range gpa_list[], > + int offset, u64 start_gfn, u64 pages) > +{ > + int gpa_n = offset; > + u64 cur = start_gfn; > + u64 additional_pages; > + > + do { > + if (gpa_n >= MAX_FLUSH_REP_COUNT) { > + pr_warn("Request exceeds HvFlushGuestList max flush > count."); > + return -ENOSPC; I wonder if the warning is really needed. When the error is returned up through the higher levels of code, won't the higher levels just fallback to the non-enlightened flush code? So nothing is actually goes wrong; it's just that a slower code path gets taken. A comment about such expectations might be helpful. > + } > + > + if (pages > MAX_FLUSH_PAGES) { > + additional_pages = MAX_FLUSH_PAGES - 1; > + pages -= MAX_FLUSH_PAGES; > + } else { > + additional_pages = pages - 1; > + pages = 0; > + } The above code is really doing: additional_pages = min(pages, MAX_FLUSH_PAGES) - 1; pages -= additional_pages + 1; And you might want to move the decrement of 'pages' down to the bottom of the loop where you update the other loop variables. > + > + gpa_list[gpa_n].page.additional_pages = additional_pages; > + gpa_list[gpa_n].page.largepage = false; > + gpa_list[gpa_n].page.basepfn = cur; > + > + cur += additional_pages + 1; > + gpa_n++; > + } while (pages > 0); > + > + return gpa_n; > +} > + > +int hyperv_flush_guest_mapping_range(u64 as, struct hyperv_tlb_range *range) > +{ > + struct hv_guest_mapping_flush_list **flush_pcpu; > + struct hv_guest_mapping_flush_list *flush; > + u64 status = 0; > + unsigned long flags; > + int ret = -ENOTSUPP; > + int gpa_n = 0; > + > + if (!hv_hypercall_pg) > + goto fault; > + > + local_irq_save(flags); > + > + flush_pcpu = (struct hv_guest_mapping_flush_list **) > + this_cpu_ptr(hyperv_pcpu_input_arg); > + > + flush = *flush_pcpu; > + if (unlikely(!flush)) { > + local_irq_restore(flags); > + goto fault; > + } > + > + flush->address_space = as; > + flush->flags = 0; > + > + if (!range->flush_list) > + gpa_n = fill_flush_list(flush->gpa_list, gpa_n, > + range->start_gfn, range->pages); > + else if (range->parse_flush_list_func) > + gpa_n = range->parse_flush_list_func(flush->gpa_list, gpa_n, > + range->flush_list, fill_flush_list); > + else > + gpa_n = -1; > + > + if (gpa_n < 0) { > + local_irq_restore(flags); > + goto fault; > + } > + > + status = hv_do_rep_hypercall(HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST, > + gpa_n, 0, flush, NULL); > + > + local_irq_restore(flags); > + > + if (!(status & HV_HYPERCALL_RESULT_MASK)) > + ret = 0; > + else > + ret = status; > +fault: > + return ret; > +} >
RE: [PATCH V2 10/13] x86/hyper-v: Add HvFlushGuestAddressList hypercall support
From: Tianyu Lan Sent: Monday, September 17, 2018 8:19 PM > > #include > #include > #include > #include > +#include Hopefully asm/kvm_host.h does not need to be #included, given the new code structure. > > #include > > +/* > + * MAX_FLUSH_PAGES = "additional_pages" + 1. It's limited > + * by the bitwidth of "additional_pages" in union hv_gpa_page_range. > + */ > +#define MAX_FLUSH_PAGES (2048) > + > +/* > + * All input flush parameters are in single page. The max flush count > + * is equal with how many entries of union hv_gpa_page_range can be > + * populated in the input parameter page. MAX_FLUSH_REP_COUNT > + * = (4096 - 16) / 8. (“Page Size” - "Address Space" - "Flags") / > + * "GPA Range". > + */ > +#define MAX_FLUSH_REP_COUNT (510) > + I would recommend putting the above two definitions in hyperv-tlfs.h. They are directly tied to the data structures defined by Hyper-V in the TLFS. Put MAX_FLUSH_PAGES immediately after the definition for hv_gpa_page_range so that the dependency is obvious. For MAX_FLUSH_REP_COUNT, can you do the calculation in the #define rather than just in the comment? Alternatively, define the gpa_list[] array to be of MAX_FLUSH_REP_COUNT size, and then add a compile time assert that the size of struct hv_guest_mapping_flush_list is exactly one page in size. It's just a good way to use the compiler to help check for mistakes. Also prefix them both with HV_ since they will be more globally visible as part of hyperv-tlfs.h. > int hyperv_flush_guest_mapping(u64 as) > { > struct hv_guest_mapping_flush **flush_pcpu; > @@ -54,3 +71,89 @@ int hyperv_flush_guest_mapping(u64 as) > return ret; > } > EXPORT_SYMBOL_GPL(hyperv_flush_guest_mapping); > + > +static int fill_flush_list(union hv_gpa_page_range gpa_list[], > + int offset, u64 start_gfn, u64 pages) > +{ > + int gpa_n = offset; > + u64 cur = start_gfn; > + u64 additional_pages; > + > + do { > + if (gpa_n >= MAX_FLUSH_REP_COUNT) { > + pr_warn("Request exceeds HvFlushGuestList max flush > count."); > + return -ENOSPC; I wonder if the warning is really needed. When the error is returned up through the higher levels of code, won't the higher levels just fallback to the non-enlightened flush code? So nothing is actually goes wrong; it's just that a slower code path gets taken. A comment about such expectations might be helpful. > + } > + > + if (pages > MAX_FLUSH_PAGES) { > + additional_pages = MAX_FLUSH_PAGES - 1; > + pages -= MAX_FLUSH_PAGES; > + } else { > + additional_pages = pages - 1; > + pages = 0; > + } The above code is really doing: additional_pages = min(pages, MAX_FLUSH_PAGES) - 1; pages -= additional_pages + 1; And you might want to move the decrement of 'pages' down to the bottom of the loop where you update the other loop variables. > + > + gpa_list[gpa_n].page.additional_pages = additional_pages; > + gpa_list[gpa_n].page.largepage = false; > + gpa_list[gpa_n].page.basepfn = cur; > + > + cur += additional_pages + 1; > + gpa_n++; > + } while (pages > 0); > + > + return gpa_n; > +} > + > +int hyperv_flush_guest_mapping_range(u64 as, struct hyperv_tlb_range *range) > +{ > + struct hv_guest_mapping_flush_list **flush_pcpu; > + struct hv_guest_mapping_flush_list *flush; > + u64 status = 0; > + unsigned long flags; > + int ret = -ENOTSUPP; > + int gpa_n = 0; > + > + if (!hv_hypercall_pg) > + goto fault; > + > + local_irq_save(flags); > + > + flush_pcpu = (struct hv_guest_mapping_flush_list **) > + this_cpu_ptr(hyperv_pcpu_input_arg); > + > + flush = *flush_pcpu; > + if (unlikely(!flush)) { > + local_irq_restore(flags); > + goto fault; > + } > + > + flush->address_space = as; > + flush->flags = 0; > + > + if (!range->flush_list) > + gpa_n = fill_flush_list(flush->gpa_list, gpa_n, > + range->start_gfn, range->pages); > + else if (range->parse_flush_list_func) > + gpa_n = range->parse_flush_list_func(flush->gpa_list, gpa_n, > + range->flush_list, fill_flush_list); > + else > + gpa_n = -1; > + > + if (gpa_n < 0) { > + local_irq_restore(flags); > + goto fault; > + } > + > + status = hv_do_rep_hypercall(HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST, > + gpa_n, 0, flush, NULL); > + > + local_irq_restore(flags); > + > + if (!(status & HV_HYPERCALL_RESULT_MASK)) > + ret = 0; > + else > + ret = status; > +fault: > + return ret; > +} >
RE: [PATCH V2 4/13] KVM/MMU: Flush tlb directly in the kvm_handle_hva_range()
From: Tianyu Lan Sent: Monday, September 17, 2018 8:19 PM > + > + if (ret && kvm_available_flush_tlb_with_range()) { > + kvm_flush_remote_tlbs_with_address(kvm, > + gfn_start, > + gfn_end - gfn_start); Does the above need to be gfn_end - gfn_start + 1? > + ret = 0; > + } Michael
RE: [PATCH V2 4/13] KVM/MMU: Flush tlb directly in the kvm_handle_hva_range()
From: Tianyu Lan Sent: Monday, September 17, 2018 8:19 PM > + > + if (ret && kvm_available_flush_tlb_with_range()) { > + kvm_flush_remote_tlbs_with_address(kvm, > + gfn_start, > + gfn_end - gfn_start); Does the above need to be gfn_end - gfn_start + 1? > + ret = 0; > + } Michael
RE: [PATCH V2 2/13] KVM/MMU: Add tlb flush with range helper function
From: Tianyu Lan Sent: Monday, September 17, 2018 8:18 PM > > +static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm, > + struct kvm_tlb_range *range) > +{ > + int ret = -ENOTSUPP; > + > + if (range && kvm_x86_ops->tlb_remote_flush_with_range) { > + /* > + * Read tlbs_dirty before setting KVM_REQ_TLB_FLUSH in > + * kvm_make_all_cpus_request. > + */ > + long dirty_count = smp_load_acquire(>tlbs_dirty); > + > + ret = kvm_x86_ops->tlb_remote_flush_with_range(kvm, range); > + cmpxchg(>tlbs_dirty, dirty_count, 0); > + } The comment and the code that manipulates kvm->tlbs_dirty appears to have been copied from kvm_flush_remote_tlbs(). But the above code doesn't call kvm_make_all_cpus_request(). I haven't traced all the details, but it seems like the comment should be updated, or the code isn't needed. Michael
RE: [PATCH V2 2/13] KVM/MMU: Add tlb flush with range helper function
From: Tianyu Lan Sent: Monday, September 17, 2018 8:18 PM > > +static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm, > + struct kvm_tlb_range *range) > +{ > + int ret = -ENOTSUPP; > + > + if (range && kvm_x86_ops->tlb_remote_flush_with_range) { > + /* > + * Read tlbs_dirty before setting KVM_REQ_TLB_FLUSH in > + * kvm_make_all_cpus_request. > + */ > + long dirty_count = smp_load_acquire(>tlbs_dirty); > + > + ret = kvm_x86_ops->tlb_remote_flush_with_range(kvm, range); > + cmpxchg(>tlbs_dirty, dirty_count, 0); > + } The comment and the code that manipulates kvm->tlbs_dirty appears to have been copied from kvm_flush_remote_tlbs(). But the above code doesn't call kvm_make_all_cpus_request(). I haven't traced all the details, but it seems like the comment should be updated, or the code isn't needed. Michael
RE: [PATCH] x86/hyperv: suppress "PCI: Fatal: No config space access function found"
From: Dexuan Cui Sent: Tuesday, September 18, 2018 3:30 PM > > A Generatin-2 Linux VM on Hyper-V doesn't have the legacy PCI bus, and > users always see the scary warning, which is actually harmless. The patch > is made to suppress the warning. > > Signed-off-by: Dexuan Cui > Cc: K. Y. Srinivasan > Cc: Haiyang Zhang > Cc: Stephen Hemminger > --- > arch/x86/hyperv/hv_init.c | 19 +++ > 1 file changed, 19 insertions(+) > Reviewed-by: Michael Kelley
RE: [PATCH] x86/hyperv: suppress "PCI: Fatal: No config space access function found"
From: Dexuan Cui Sent: Tuesday, September 18, 2018 3:30 PM > > A Generatin-2 Linux VM on Hyper-V doesn't have the legacy PCI bus, and > users always see the scary warning, which is actually harmless. The patch > is made to suppress the warning. > > Signed-off-by: Dexuan Cui > Cc: K. Y. Srinivasan > Cc: Haiyang Zhang > Cc: Stephen Hemminger > --- > arch/x86/hyperv/hv_init.c | 19 +++ > 1 file changed, 19 insertions(+) > Reviewed-by: Michael Kelley
RE: [PATCH v1 0/3] Enable PV qspinlock for Hyper-V
>From Yi Sun Sent: Thursday, September 13, 2018 2:13 AM > This patch adds the necessary Hyper-V specific code to allow > PV qspinlock work on Hyper-V. > Have you done any performance measurements with this new code, so that we know whether there is any improvement, or even potentially any degradation in some circumstances? Michael
RE: [PATCH v1 0/3] Enable PV qspinlock for Hyper-V
>From Yi Sun Sent: Thursday, September 13, 2018 2:13 AM > This patch adds the necessary Hyper-V specific code to allow > PV qspinlock work on Hyper-V. > Have you done any performance measurements with this new code, so that we know whether there is any improvement, or even potentially any degradation in some circumstances? Michael
RE: [PATCH] Drivers: hv: vmbus: include header for get_irq_regs()
From Sebastian Andrzej Siewior Sent: Thursday, August 30, 2018 12:55 AM > > On !RT the header file get_irq_regs() gets pulled in via other header files. > On > RT it does not and the build fails: > > drivers/hv/vmbus_drv.c:975 implicit declaration of function > ‘get_irq_regs’ [- > Werror=implicit-function-declaration] > drivers/hv/hv.c:115 implicit declaration of function ‘get_irq_regs’ > [-Werror=implicit- > function-declaration] > > Add the header file for get_irq_regs() in a common header so it used by > vmbus_drv.c by hv.c for their get_irq_regs() usage. > get_irq_regs() is not used explicitly in either vmbus_drv.c or in hv.c. And I couldn't make the line numbers in the errors above line up with anything in the source code that might be implicitly using get_irq_regs(). Is it the calls to add_interrupt_randomness()? Did you figure out exactly what line of code is causing the compile error? I'm wondering whether adding the #include of irq.h into hyperv_vmbus.h is really the right solution. More correct might be to have the file where get_irq_regs() is actually used to #include irq_regs.h. Michael
RE: [PATCH] Drivers: hv: vmbus: include header for get_irq_regs()
From Sebastian Andrzej Siewior Sent: Thursday, August 30, 2018 12:55 AM > > On !RT the header file get_irq_regs() gets pulled in via other header files. > On > RT it does not and the build fails: > > drivers/hv/vmbus_drv.c:975 implicit declaration of function > ‘get_irq_regs’ [- > Werror=implicit-function-declaration] > drivers/hv/hv.c:115 implicit declaration of function ‘get_irq_regs’ > [-Werror=implicit- > function-declaration] > > Add the header file for get_irq_regs() in a common header so it used by > vmbus_drv.c by hv.c for their get_irq_regs() usage. > get_irq_regs() is not used explicitly in either vmbus_drv.c or in hv.c. And I couldn't make the line numbers in the errors above line up with anything in the source code that might be implicitly using get_irq_regs(). Is it the calls to add_interrupt_randomness()? Did you figure out exactly what line of code is causing the compile error? I'm wondering whether adding the #include of irq.h into hyperv_vmbus.h is really the right solution. More correct might be to have the file where get_irq_regs() is actually used to #include irq_regs.h. Michael
RE: [PATCH 10/13] x86/hyper-v: Add HvFlushGuestAddressList hypercall support
From: Tianyu Lan Sent: Monday, September 10, 2018 1:39 AM > + > +int hyperv_flush_guest_mapping_range(u64 as, struct kvm_tlb_range *range) I'm really concerned about defining the Hyper-V function to flush guest mappings in terms of a KVM struct definition. Your patch puts this function in arch/x86/hyperv/nested.c. I haven't investigated all the details, but on its face this approach seems like it would cause trouble in the long run, and it doesn't support the case of a hypervisor other than KVM running at L1. I know that KVM code has taken a dependency on Hyper-V types and code, but that's because KVM is emulating a lot of Hyper-V functionality and it's taking advantage of Hyper-V enlightenments. Is there a top level reason I haven't thought of for Hyper-V code to take a dependency on KVM definitions? I would think we want Hyper-V code to be generic, using Hyper-V data structure definitions. Then in keeping with what's already been done, KVM code would use those definitions where it needs to make calls to Hyper-V code. > +{ > + struct kvm_mmu_page *sp; > + struct hv_guest_mapping_flush_list **flush_pcpu; > + struct hv_guest_mapping_flush_list *flush; > + u64 status = 0; > + unsigned long flags; > + int ret = -ENOTSUPP; > + int gpa_n = 0; > + > + if (!hv_hypercall_pg) > + goto fault; > + > + local_irq_save(flags); > + > + flush_pcpu = (struct hv_guest_mapping_flush_list **) > + this_cpu_ptr(hyperv_pcpu_input_arg); > + > + flush = *flush_pcpu; > + if (unlikely(!flush)) { > + local_irq_restore(flags); > + goto fault; > + } > + > + flush->address_space = as; > + flush->flags = 0; > + > + if (!range->flush_list) { > + gpa_n = fill_flush_list(flush->gpa_list, gpa_n, > + range->start_gfn, range->end_gfn); > + } else { > + list_for_each_entry(sp, range->flush_list, > + flush_link) { > + u64 end_gfn = sp->gfn + > + KVM_PAGES_PER_HPAGE(sp->role.level) - 1; > + gpa_n = fill_flush_list(flush->gpa_list, gpa_n, > + sp->gfn, end_gfn); > + } Per the previous comment, if this loop really needs to walk a KVM data structure, look for a different way to organize things so that the handling of KVM-specific data structures is in code that’s part of KVM, rather than in Hyper-V code. Michael
RE: [PATCH 10/13] x86/hyper-v: Add HvFlushGuestAddressList hypercall support
From: Tianyu Lan Sent: Monday, September 10, 2018 1:39 AM > + > +int hyperv_flush_guest_mapping_range(u64 as, struct kvm_tlb_range *range) I'm really concerned about defining the Hyper-V function to flush guest mappings in terms of a KVM struct definition. Your patch puts this function in arch/x86/hyperv/nested.c. I haven't investigated all the details, but on its face this approach seems like it would cause trouble in the long run, and it doesn't support the case of a hypervisor other than KVM running at L1. I know that KVM code has taken a dependency on Hyper-V types and code, but that's because KVM is emulating a lot of Hyper-V functionality and it's taking advantage of Hyper-V enlightenments. Is there a top level reason I haven't thought of for Hyper-V code to take a dependency on KVM definitions? I would think we want Hyper-V code to be generic, using Hyper-V data structure definitions. Then in keeping with what's already been done, KVM code would use those definitions where it needs to make calls to Hyper-V code. > +{ > + struct kvm_mmu_page *sp; > + struct hv_guest_mapping_flush_list **flush_pcpu; > + struct hv_guest_mapping_flush_list *flush; > + u64 status = 0; > + unsigned long flags; > + int ret = -ENOTSUPP; > + int gpa_n = 0; > + > + if (!hv_hypercall_pg) > + goto fault; > + > + local_irq_save(flags); > + > + flush_pcpu = (struct hv_guest_mapping_flush_list **) > + this_cpu_ptr(hyperv_pcpu_input_arg); > + > + flush = *flush_pcpu; > + if (unlikely(!flush)) { > + local_irq_restore(flags); > + goto fault; > + } > + > + flush->address_space = as; > + flush->flags = 0; > + > + if (!range->flush_list) { > + gpa_n = fill_flush_list(flush->gpa_list, gpa_n, > + range->start_gfn, range->end_gfn); > + } else { > + list_for_each_entry(sp, range->flush_list, > + flush_link) { > + u64 end_gfn = sp->gfn + > + KVM_PAGES_PER_HPAGE(sp->role.level) - 1; > + gpa_n = fill_flush_list(flush->gpa_list, gpa_n, > + sp->gfn, end_gfn); > + } Per the previous comment, if this loop really needs to walk a KVM data structure, look for a different way to organize things so that the handling of KVM-specific data structures is in code that’s part of KVM, rather than in Hyper-V code. Michael
RE: [PATCH v2 2/4] arm64: hyperv: Add support for Hyper-V as a hypervisor
From: KY Srinivasan Sent: Thursday, August 30, 2018 11:51 AM > > + /* Allocate percpu VP index */ > > + hv_vp_index = kmalloc_array(num_possible_cpus(), > > sizeof(*hv_vp_index), > > + GFP_KERNEL); > > + if (!hv_vp_index) > > + return 1; > > + > We should perhaps set the array so the contents are invalid so we can > correctly > handle enlightenments for TL shootdown and IPI. Agreed. Will add the initialization in v3 of the patch. Michael
RE: [PATCH v2 2/4] arm64: hyperv: Add support for Hyper-V as a hypervisor
From: KY Srinivasan Sent: Thursday, August 30, 2018 11:51 AM > > + /* Allocate percpu VP index */ > > + hv_vp_index = kmalloc_array(num_possible_cpus(), > > sizeof(*hv_vp_index), > > + GFP_KERNEL); > > + if (!hv_vp_index) > > + return 1; > > + > We should perhaps set the array so the contents are invalid so we can > correctly > handle enlightenments for TL shootdown and IPI. Agreed. Will add the initialization in v3 of the patch. Michael
RE: [PATCH v2 1/4] arm64: hyperv: Add core Hyper-V include files
From: KY Srinivasan Sent: Thursday, August 30, 2018 11:23 AM > > +/* > > + * This file contains definitions from the Hyper-V Hypervisor Top-Level > > + * Functional Specification (TLFS): > > + * > > https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/reference/tlfs > > + > A lot of TLFS definitions are ISA independent and we are duplicating these > definitions both for X86_64 and ARM_64. Perhaps we should look at splitting > this file into a common and ISA specific header file. I agree that we want to end up with x86_64 and ARM64 ISA dependent files that include an ISA independent file. My thinking was to not make that separation now, for a couple of reasons: 1) We don't have a Hyper-V TLFS that is explicit about what should be considered ISA independent and ISA dependent. I can make some reasonable guesses, but it will be subject to change as the Hyper-V team firms up the interface and decides what they want to commit to. 2) Some of the things defined in the TLFS have names that are x86-specific (TSC, MSR, etc.). For the ISA independent parts, those names should be more generic, which is another dependency on the Hyper-V team defining the ISA independent parts of the TLFS. My judgment was that we'll end up with less perturbation overall to go with this cloned version of hyperv-tlfs.h for now, and then come back and do the separation once we have a definitive TLFS to base it on. But it's a judgment call, and if the sense is that we should do the separation now, I can give it a try. > > +#define HvRegisterHypervisorVersion0x0100 /*CPUID > > 0x4002 */ > > +#defineHvRegisterPrivilegesAndFeaturesInfo0x0200 /*CPUID > > 0x4003 */ > > +#defineHvRegisterFeaturesInfo0x0201 > > /*CPUID 0x4004 */ > > +#defineHvRegisterImplementationLimitsInfo0x0202 /*CPUID > > 0x4005 */ > > +#define HvARM64RegisterInterfaceVersion0x00090006 /*CPUID > > 0x4001 */ > > Can we avoid the mixed case names. Agreed. I'll fix this throughout to use all uppercase, with underscore as the word separator. > > + * Linux-specific definitions for managing interactions with Microsoft's > > + * Hyper-V hypervisor. Definitions that are specified in the Hyper-V > > + * Top Level Functional Spec (TLFS) should not go in this file, but > > + * should instead go in hyperv-tlfs.h. > > Would it make sense to breakup this header file into ISA independent and > dependent files? Yes, as above I agree the separation make sense. And since this file is tied To Linux and not to the Hyper-V TLFS, the separation isn't affected by the TLFS issues mentioned above. I'll give it a try and see if any issues arise. > > +/* > > + * Define the IRQ numbers/vectors used by Hyper-V VMbus interrupts > > + * and by STIMER0 Direct Mode interrupts. Hyper-V should be supplying > > + * these values through ACPI, but there are no other interrupting > > + * devices in a Hyper-V VM on ARM64, so it's OK to hard code for now. > > + * The "CALLBACK_VECTOR" terminology is a left-over from the x86/x64 > > + * world that is used in architecture independent Hyper-V code. > > + */ > When we have direct device assignment for ARM-64 guests, can we still > hardcode. Yes, we can still hardcode. These values are in the Per-Processor Interrupt (PPI) range of 16 to 31. Any IRQ numbers assigned to a Discrete Device Assignment (DDA) device will be in the Shared Peripheral Interrupt (SPI) range of 32-1019 or the Locality-specific Peripheral Interrupt (LPI) range of greater than 8192. The handling of DDA interrupts is still under discussion with the Hyper-V team, but there won't be any conflicts with the PPI values that are hardcoded here. > > +/* > > + * The guest OS needs to register the guest ID with the hypervisor. > > + * The guest ID is a 64 bit entity and the structure of this ID is > > + * specified in the Hyper-V specification: > > + * > > + * msdn.microsoft.com/en- > > us/library/windows/hardware/ff542653%28v=vs.85%29.aspx > > + * > > + * While the current guideline does not specify how Linux guest ID(s) > > + * need to be generated, our plan is to publish the guidelines for > > + * Linux and other guest operating systems that currently are hosted > > + * on Hyper-V. The implementation here conforms to this yet > > + * unpublished guidelines. > > + * > > + * > > + * Bit(s) > > + * 63 - Indicates if the OS is Open Source or not; 1 is Open Source > > + * 62:56 - Os Type; Linux is 0x100 > > + * 55:48 - Distro specific identification > > + * 47:16 - Linux kernel version number > > + * 15:0 - Distro specific identification > > + * > > + * Generate the guest ID based on the guideline described above. > > + */ > > No need to repeat the above block comment (already included in the TLFS > header). Agreed. Will make the change in v3 of the patch. > > +/* Free the message slot and signal end-of-message if required */ > > +static inline void vmbus_signal_eom(struct hv_message *msg, u32 > > old_msg_type) > > +{ > > +/* > > +
RE: [PATCH v2 1/4] arm64: hyperv: Add core Hyper-V include files
From: KY Srinivasan Sent: Thursday, August 30, 2018 11:23 AM > > +/* > > + * This file contains definitions from the Hyper-V Hypervisor Top-Level > > + * Functional Specification (TLFS): > > + * > > https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/reference/tlfs > > + > A lot of TLFS definitions are ISA independent and we are duplicating these > definitions both for X86_64 and ARM_64. Perhaps we should look at splitting > this file into a common and ISA specific header file. I agree that we want to end up with x86_64 and ARM64 ISA dependent files that include an ISA independent file. My thinking was to not make that separation now, for a couple of reasons: 1) We don't have a Hyper-V TLFS that is explicit about what should be considered ISA independent and ISA dependent. I can make some reasonable guesses, but it will be subject to change as the Hyper-V team firms up the interface and decides what they want to commit to. 2) Some of the things defined in the TLFS have names that are x86-specific (TSC, MSR, etc.). For the ISA independent parts, those names should be more generic, which is another dependency on the Hyper-V team defining the ISA independent parts of the TLFS. My judgment was that we'll end up with less perturbation overall to go with this cloned version of hyperv-tlfs.h for now, and then come back and do the separation once we have a definitive TLFS to base it on. But it's a judgment call, and if the sense is that we should do the separation now, I can give it a try. > > +#define HvRegisterHypervisorVersion0x0100 /*CPUID > > 0x4002 */ > > +#defineHvRegisterPrivilegesAndFeaturesInfo0x0200 /*CPUID > > 0x4003 */ > > +#defineHvRegisterFeaturesInfo0x0201 > > /*CPUID 0x4004 */ > > +#defineHvRegisterImplementationLimitsInfo0x0202 /*CPUID > > 0x4005 */ > > +#define HvARM64RegisterInterfaceVersion0x00090006 /*CPUID > > 0x4001 */ > > Can we avoid the mixed case names. Agreed. I'll fix this throughout to use all uppercase, with underscore as the word separator. > > + * Linux-specific definitions for managing interactions with Microsoft's > > + * Hyper-V hypervisor. Definitions that are specified in the Hyper-V > > + * Top Level Functional Spec (TLFS) should not go in this file, but > > + * should instead go in hyperv-tlfs.h. > > Would it make sense to breakup this header file into ISA independent and > dependent files? Yes, as above I agree the separation make sense. And since this file is tied To Linux and not to the Hyper-V TLFS, the separation isn't affected by the TLFS issues mentioned above. I'll give it a try and see if any issues arise. > > +/* > > + * Define the IRQ numbers/vectors used by Hyper-V VMbus interrupts > > + * and by STIMER0 Direct Mode interrupts. Hyper-V should be supplying > > + * these values through ACPI, but there are no other interrupting > > + * devices in a Hyper-V VM on ARM64, so it's OK to hard code for now. > > + * The "CALLBACK_VECTOR" terminology is a left-over from the x86/x64 > > + * world that is used in architecture independent Hyper-V code. > > + */ > When we have direct device assignment for ARM-64 guests, can we still > hardcode. Yes, we can still hardcode. These values are in the Per-Processor Interrupt (PPI) range of 16 to 31. Any IRQ numbers assigned to a Discrete Device Assignment (DDA) device will be in the Shared Peripheral Interrupt (SPI) range of 32-1019 or the Locality-specific Peripheral Interrupt (LPI) range of greater than 8192. The handling of DDA interrupts is still under discussion with the Hyper-V team, but there won't be any conflicts with the PPI values that are hardcoded here. > > +/* > > + * The guest OS needs to register the guest ID with the hypervisor. > > + * The guest ID is a 64 bit entity and the structure of this ID is > > + * specified in the Hyper-V specification: > > + * > > + * msdn.microsoft.com/en- > > us/library/windows/hardware/ff542653%28v=vs.85%29.aspx > > + * > > + * While the current guideline does not specify how Linux guest ID(s) > > + * need to be generated, our plan is to publish the guidelines for > > + * Linux and other guest operating systems that currently are hosted > > + * on Hyper-V. The implementation here conforms to this yet > > + * unpublished guidelines. > > + * > > + * > > + * Bit(s) > > + * 63 - Indicates if the OS is Open Source or not; 1 is Open Source > > + * 62:56 - Os Type; Linux is 0x100 > > + * 55:48 - Distro specific identification > > + * 47:16 - Linux kernel version number > > + * 15:0 - Distro specific identification > > + * > > + * Generate the guest ID based on the guideline described above. > > + */ > > No need to repeat the above block comment (already included in the TLFS > header). Agreed. Will make the change in v3 of the patch. > > +/* Free the message slot and signal end-of-message if required */ > > +static inline void vmbus_signal_eom(struct hv_message *msg, u32 > > old_msg_type) > > +{ > > +/* > > +
RE: [PATCH char-misc 1/1] Drivers: hv: vmbus: Make synic_initialized flag per-cpu
From: Vitaly Kuznetsov Sent: Wednesday, August 1, 2018 2:26 AM > > I was trying to decide if there are any arguments in favor of one > > approach vs. the other: a per-cpu flag in memory or checking > > the synic_control "enable" bit. Seems like a wash to me, in which > > case I have a slight preference for the per-cpu flag in memory vs. > > creating another function to return sctrl.enable. But I'm completely > > open to reasons why checking sctrl.enable is better. > > Just a few thoughts: reading MSR is definitely slower but we avoid > 'shadowing' the state, the reading is always correct. In case there's a > chance the SynIC will get disabled from host side we can only find this > out by doing MSR read. This is a purely theoretical possibility, I > believe, we can go ahead with this patch. Vitaly -- just to confirm: you are OK with the patch as is? (I'll check, but I may need to rebase on the latest code.) Michael
RE: [PATCH char-misc 1/1] Drivers: hv: vmbus: Make synic_initialized flag per-cpu
From: Vitaly Kuznetsov Sent: Wednesday, August 1, 2018 2:26 AM > > I was trying to decide if there are any arguments in favor of one > > approach vs. the other: a per-cpu flag in memory or checking > > the synic_control "enable" bit. Seems like a wash to me, in which > > case I have a slight preference for the per-cpu flag in memory vs. > > creating another function to return sctrl.enable. But I'm completely > > open to reasons why checking sctrl.enable is better. > > Just a few thoughts: reading MSR is definitely slower but we avoid > 'shadowing' the state, the reading is always correct. In case there's a > chance the SynIC will get disabled from host side we can only find this > out by doing MSR read. This is a purely theoretical possibility, I > believe, we can go ahead with this patch. Vitaly -- just to confirm: you are OK with the patch as is? (I'll check, but I may need to rebase on the latest code.) Michael
RE: [PATCH 2/5] vmbus: add driver_override support
From: Stephen Hemminger Sent: Tuesday, August 14, 2018 9:35 AM > On Mon, 13 Aug 2018 19:30:50 + > "Michael Kelley (EOSG)" wrote: > > > > +/* > > > + * Return a matching hv_vmbus_device_id pointer. > > > + * If there is no match, return NULL. > > > + */ > > > +static const struct hv_vmbus_device_id *hv_vmbus_get_id(struct hv_driver > > > *drv, > > > + struct hv_device *dev) > > > +{ > > > + const uuid_le *guid = >dev_type; > > > + const struct hv_vmbus_device_id *id; > > > > > > - return NULL; > > > + /* When driver_override is set, only bind to the matching driver */ > > > + if (dev->driver_override && strcmp(dev->driver_override, drv->name)) > > > + return NULL; > > > > This function needs to be covered by the device lock, so that > > dev->driver_override can't be set to NULL and the memory freed > > during the above 'if' statement. When called from vmbus_probe(), > > the device lock is held, so it's good. But when called from > > vmbus_match(), the device lock may not be held: consider the path > > __driver_attach() -> driver_match_device() -> vmbus_match(). > > The function hv_vmbus_get_id is called from that path. > i.e. __device_attach -> driver-match_device -> vmbus_match. > and __device_attach always does: > device_lock(dev); Agreed. The __device_attach() path holds the device lock and all is good. But the __driver_attach() path does not hold the device lock when the match function is called, leaving the code open to a potential race. Same problem could happen in the pci subsystem, so the issue is more generic and probably should be evaluated and dealt with separately. Michael > > The code in driver _override_store uses the same device_lock > when storing the new value. > > This is same locking as is done in pci-sysfs.c
RE: [PATCH 2/5] vmbus: add driver_override support
From: Stephen Hemminger Sent: Tuesday, August 14, 2018 9:35 AM > On Mon, 13 Aug 2018 19:30:50 + > "Michael Kelley (EOSG)" wrote: > > > > +/* > > > + * Return a matching hv_vmbus_device_id pointer. > > > + * If there is no match, return NULL. > > > + */ > > > +static const struct hv_vmbus_device_id *hv_vmbus_get_id(struct hv_driver > > > *drv, > > > + struct hv_device *dev) > > > +{ > > > + const uuid_le *guid = >dev_type; > > > + const struct hv_vmbus_device_id *id; > > > > > > - return NULL; > > > + /* When driver_override is set, only bind to the matching driver */ > > > + if (dev->driver_override && strcmp(dev->driver_override, drv->name)) > > > + return NULL; > > > > This function needs to be covered by the device lock, so that > > dev->driver_override can't be set to NULL and the memory freed > > during the above 'if' statement. When called from vmbus_probe(), > > the device lock is held, so it's good. But when called from > > vmbus_match(), the device lock may not be held: consider the path > > __driver_attach() -> driver_match_device() -> vmbus_match(). > > The function hv_vmbus_get_id is called from that path. > i.e. __device_attach -> driver-match_device -> vmbus_match. > and __device_attach always does: > device_lock(dev); Agreed. The __device_attach() path holds the device lock and all is good. But the __driver_attach() path does not hold the device lock when the match function is called, leaving the code open to a potential race. Same problem could happen in the pci subsystem, so the issue is more generic and probably should be evaluated and dealt with separately. Michael > > The code in driver _override_store uses the same device_lock > when storing the new value. > > This is same locking as is done in pci-sysfs.c
RE: [PATCH 2/5] vmbus: add driver_override support
From: k...@linuxonhyperv.com Sent: Friday, August 10, 2018 4:06 PM > From: Stephen Hemminger > > Add support for overriding the default driver for a VMBus device > in the same way that it can be done for PCI devices. This patch > adds the /sys/bus/vmbus/devices/.../driver_override file > and the logic for matching. > > This is used by driverctl tool to do driver override. > https://gitlab.com/driverctl/driverctl > > Signed-off-by: Stephen Hemminger > Signed-off-by: K. Y. Srinivasan > --- > diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c > index b1b548a21f91..e6d8fdac6d8b 100644 > --- a/drivers/hv/vmbus_drv.c > +++ b/drivers/hv/vmbus_drv.c > @@ -498,6 +498,54 @@ static ssize_t device_show(struct device *dev, > } > static DEVICE_ATTR_RO(device); > > +static ssize_t driver_override_store(struct device *dev, > + struct device_attribute *attr, > + const char *buf, size_t count) > +{ > + struct hv_device *hv_dev = device_to_hv_device(dev); > + char *driver_override, *old, *cp; > + > + /* We need to keep extra room for a newline */ > + if (count >= (PAGE_SIZE - 1)) > + return -EINVAL; Does 'count' actually have a relationship to PAGE_SIZE, or is PAGE_SIZE just used as an arbitrary size limit? I'm wondering what happens on ARM64 with a 64K page size, for example. If it's just arbitrary, coding such a constant would be better. > +/* > + * Return a matching hv_vmbus_device_id pointer. > + * If there is no match, return NULL. > + */ > +static const struct hv_vmbus_device_id *hv_vmbus_get_id(struct hv_driver > *drv, > + struct hv_device *dev) > +{ > + const uuid_le *guid = >dev_type; > + const struct hv_vmbus_device_id *id; > > - return NULL; > + /* When driver_override is set, only bind to the matching driver */ > + if (dev->driver_override && strcmp(dev->driver_override, drv->name)) > + return NULL; This function needs to be covered by the device lock, so that dev->driver_override can't be set to NULL and the memory freed during the above 'if' statement. When called from vmbus_probe(), the device lock is held, so it's good. But when called from vmbus_match(), the device lock may not be held: consider the path __driver_attach() -> driver_match_device() -> vmbus_match(). Michael
RE: [PATCH 2/5] vmbus: add driver_override support
From: k...@linuxonhyperv.com Sent: Friday, August 10, 2018 4:06 PM > From: Stephen Hemminger > > Add support for overriding the default driver for a VMBus device > in the same way that it can be done for PCI devices. This patch > adds the /sys/bus/vmbus/devices/.../driver_override file > and the logic for matching. > > This is used by driverctl tool to do driver override. > https://gitlab.com/driverctl/driverctl > > Signed-off-by: Stephen Hemminger > Signed-off-by: K. Y. Srinivasan > --- > diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c > index b1b548a21f91..e6d8fdac6d8b 100644 > --- a/drivers/hv/vmbus_drv.c > +++ b/drivers/hv/vmbus_drv.c > @@ -498,6 +498,54 @@ static ssize_t device_show(struct device *dev, > } > static DEVICE_ATTR_RO(device); > > +static ssize_t driver_override_store(struct device *dev, > + struct device_attribute *attr, > + const char *buf, size_t count) > +{ > + struct hv_device *hv_dev = device_to_hv_device(dev); > + char *driver_override, *old, *cp; > + > + /* We need to keep extra room for a newline */ > + if (count >= (PAGE_SIZE - 1)) > + return -EINVAL; Does 'count' actually have a relationship to PAGE_SIZE, or is PAGE_SIZE just used as an arbitrary size limit? I'm wondering what happens on ARM64 with a 64K page size, for example. If it's just arbitrary, coding such a constant would be better. > +/* > + * Return a matching hv_vmbus_device_id pointer. > + * If there is no match, return NULL. > + */ > +static const struct hv_vmbus_device_id *hv_vmbus_get_id(struct hv_driver > *drv, > + struct hv_device *dev) > +{ > + const uuid_le *guid = >dev_type; > + const struct hv_vmbus_device_id *id; > > - return NULL; > + /* When driver_override is set, only bind to the matching driver */ > + if (dev->driver_override && strcmp(dev->driver_override, drv->name)) > + return NULL; This function needs to be covered by the device lock, so that dev->driver_override can't be set to NULL and the memory freed during the above 'if' statement. When called from vmbus_probe(), the device lock is held, so it's good. But when called from vmbus_match(), the device lock may not be held: consider the path __driver_attach() -> driver_match_device() -> vmbus_match(). Michael
RE: [PATCH 1/5] Tools: hv: Fix a bug in the key delete code
From: k...@linuxonhyperv.com Sent: Friday, August 10, 2018 4:06 PM > > Fix a bug in the key delete code - the num_records range > from 0 to num_records-1. > > Signed-off-by: K. Y. Srinivasan > Reported-by: David Binderman > Cc: > --- Reviewed-by: Michael Kelley
RE: [PATCH 1/5] Tools: hv: Fix a bug in the key delete code
From: k...@linuxonhyperv.com Sent: Friday, August 10, 2018 4:06 PM > > Fix a bug in the key delete code - the num_records range > from 0 to num_records-1. > > Signed-off-by: K. Y. Srinivasan > Reported-by: David Binderman > Cc: > --- Reviewed-by: Michael Kelley
RE: [PATCH char-misc 1/1] Drivers: hv: vmbus: Make synic_initialized flag per-cpu
From: Vitaly Kuznetsov Sent: Tuesday, July 31, 2018 4:20 AM > > Reviewed-by: Vitaly Kuznetsov Thanks for the review > > Alternatively, we can get rid of synic_initialized flag altogether: > hv_synic_init() never fails in the first place but we can always > implement something like: > > int hv_synic_is_initialized(void) { > union hv_synic_scontrol sctrl; > > hv_get_synic_state(sctrl.as_uint64); > > return sctrl.enable; > } > > as it doesn't seem that we need to check synic state on _other_ CPUs. > > -- > Vitaly I was trying to decide if there are any arguments in favor of one approach vs. the other: a per-cpu flag in memory or checking the synic_control "enable" bit. Seems like a wash to me, in which case I have a slight preference for the per-cpu flag in memory vs. creating another function to return sctrl.enable. But I'm completely open to reasons why checking sctrl.enable is better. Michael
RE: [PATCH char-misc 1/1] Drivers: hv: vmbus: Make synic_initialized flag per-cpu
From: Vitaly Kuznetsov Sent: Tuesday, July 31, 2018 4:20 AM > > Reviewed-by: Vitaly Kuznetsov Thanks for the review > > Alternatively, we can get rid of synic_initialized flag altogether: > hv_synic_init() never fails in the first place but we can always > implement something like: > > int hv_synic_is_initialized(void) { > union hv_synic_scontrol sctrl; > > hv_get_synic_state(sctrl.as_uint64); > > return sctrl.enable; > } > > as it doesn't seem that we need to check synic state on _other_ CPUs. > > -- > Vitaly I was trying to decide if there are any arguments in favor of one approach vs. the other: a per-cpu flag in memory or checking the synic_control "enable" bit. Seems like a wash to me, in which case I have a slight preference for the per-cpu flag in memory vs. creating another function to return sctrl.enable. But I'm completely open to reasons why checking sctrl.enable is better. Michael
RE: [PATCH 1/1] Drivers: HV: Send one page worth of kmsg dump over Hyper-V during panic
From: Sunil Muthuswamy Sent: Wednesday, July 11, 2018 9:59 AM > Thanks, Michael. In which branch should I fix these now that the changes have > been > merged with the char-misc-next branch? If the original code is already in char-misc-next, you should probably submit a completely new patch for char-misc-next that just makes the fixes to the original code. Michael
RE: [PATCH 1/1] Drivers: HV: Send one page worth of kmsg dump over Hyper-V during panic
From: Sunil Muthuswamy Sent: Wednesday, July 11, 2018 9:59 AM > Thanks, Michael. In which branch should I fix these now that the changes have > been > merged with the char-misc-next branch? If the original code is already in char-misc-next, you should probably submit a completely new patch for char-misc-next that just makes the fixes to the original code. Michael
RE: [PATCH 1/1] Drivers: HV: Send one page worth of kmsg dump over Hyper-V during panic
>From k...@linuxonhyperv.com Sent: Saturday, July 7, >2018 7:57 PM > > From: Sunil Muthuswamy > > In the VM mode on Hyper-V, currently, when the kernel panics, an error > code and few register values are populated in an MSR and the Hypervisor > notified. This information is collected on the host. The amount of > information currently collected is found to be limited and not very > actionable. To gather more actionable data, such as stack trace, the > proposal is to write one page worth of kmsg data on an allocated page > and the Hypervisor notified of the page address through the MSR. > > - Sysctl option to control the behavior, with ON by default. > > Cc: K. Y. Srinivasan > Cc: Stephen Hemminger > Signed-off-by: Sunil Muthuswamy > Signed-off-by: K. Y. Srinivasan > --- > + /* > + * Write dump contents to the page. No need to synchronize; panic should > + * be single-threaded. > + */ > + if (!kmsg_dump_get_buffer(dumper, true, hv_panic_page, > + PAGE_SIZE, _written)) { > + pr_err("Hyper-V: Unable to get kmsg data for panic\n"); > + return; >From what I can see, the return value from kmsg_dump_get_buffer() is not an indication of success or failure -- it's an indication of whether there is more data available. There's no reason to output an error message. > @@ -1065,6 +1136,32 @@ static int vmbus_bus_init(void) >* Only register if the crash MSRs are available >*/ > if (ms_hyperv.misc_features & HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE) { > + u64 hyperv_crash_ctl; > + /* > + * Sysctl registration is not fatal, since by default > + * reporting is enabled. > + */ > + hv_ctl_table_hdr = register_sysctl_table(hv_root_table); > + if (!hv_ctl_table_hdr) > + pr_err("Hyper-V: sysctl table register error"); > + > + /* > + * Register for panic kmsg callback only if the right > + * capability is supported by the hypervisor. > + */ > + rdmsrl(HV_X64_MSR_CRASH_CTL, hyperv_crash_ctl); > + if (hyperv_crash_ctl & HV_CRASH_CTL_CRASH_NOTIFY_MSG) { vmbus_drv.c is architecture independent code, and should not be referencing x86/x64 MSRs. Reading the MSR (and maybe the test as well?) should go in a separate function in an x86-specific source file. And just to confirm, is this the right way to test for the feature? Usually, feature determination is based on one of the feature registers. The NOTIFY_MSG flag seems to have a dual meaning -- on read it indicates the feature is present. On write in hyperv_report_panic_msg(), it evidently means that the guest is sending a full page of data to Hyper-V. > @@ -1081,6 +1178,11 @@ static int vmbus_bus_init(void) > bus_unregister(_bus); > + free_page((unsigned long)hv_panic_page); > + if (!hv_ctl_table_hdr) { The above test is backwards. Remove the bang. > @@ -1785,10 +1887,18 @@ static void __exit vmbus_exit(void) > + free_page((unsigned long)hv_panic_page); > + if (!hv_ctl_table_hdr) { Same here. Test is backwards. Michael
RE: [PATCH 1/1] Drivers: HV: Send one page worth of kmsg dump over Hyper-V during panic
>From k...@linuxonhyperv.com Sent: Saturday, July 7, >2018 7:57 PM > > From: Sunil Muthuswamy > > In the VM mode on Hyper-V, currently, when the kernel panics, an error > code and few register values are populated in an MSR and the Hypervisor > notified. This information is collected on the host. The amount of > information currently collected is found to be limited and not very > actionable. To gather more actionable data, such as stack trace, the > proposal is to write one page worth of kmsg data on an allocated page > and the Hypervisor notified of the page address through the MSR. > > - Sysctl option to control the behavior, with ON by default. > > Cc: K. Y. Srinivasan > Cc: Stephen Hemminger > Signed-off-by: Sunil Muthuswamy > Signed-off-by: K. Y. Srinivasan > --- > + /* > + * Write dump contents to the page. No need to synchronize; panic should > + * be single-threaded. > + */ > + if (!kmsg_dump_get_buffer(dumper, true, hv_panic_page, > + PAGE_SIZE, _written)) { > + pr_err("Hyper-V: Unable to get kmsg data for panic\n"); > + return; >From what I can see, the return value from kmsg_dump_get_buffer() is not an indication of success or failure -- it's an indication of whether there is more data available. There's no reason to output an error message. > @@ -1065,6 +1136,32 @@ static int vmbus_bus_init(void) >* Only register if the crash MSRs are available >*/ > if (ms_hyperv.misc_features & HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE) { > + u64 hyperv_crash_ctl; > + /* > + * Sysctl registration is not fatal, since by default > + * reporting is enabled. > + */ > + hv_ctl_table_hdr = register_sysctl_table(hv_root_table); > + if (!hv_ctl_table_hdr) > + pr_err("Hyper-V: sysctl table register error"); > + > + /* > + * Register for panic kmsg callback only if the right > + * capability is supported by the hypervisor. > + */ > + rdmsrl(HV_X64_MSR_CRASH_CTL, hyperv_crash_ctl); > + if (hyperv_crash_ctl & HV_CRASH_CTL_CRASH_NOTIFY_MSG) { vmbus_drv.c is architecture independent code, and should not be referencing x86/x64 MSRs. Reading the MSR (and maybe the test as well?) should go in a separate function in an x86-specific source file. And just to confirm, is this the right way to test for the feature? Usually, feature determination is based on one of the feature registers. The NOTIFY_MSG flag seems to have a dual meaning -- on read it indicates the feature is present. On write in hyperv_report_panic_msg(), it evidently means that the guest is sending a full page of data to Hyper-V. > @@ -1081,6 +1178,11 @@ static int vmbus_bus_init(void) > bus_unregister(_bus); > + free_page((unsigned long)hv_panic_page); > + if (!hv_ctl_table_hdr) { The above test is backwards. Remove the bang. > @@ -1785,10 +1887,18 @@ static void __exit vmbus_exit(void) > + free_page((unsigned long)hv_panic_page); > + if (!hv_ctl_table_hdr) { Same here. Test is backwards. Michael
RE: [PATCH V2 1/5] X86/Hyper-V: Add flush HvFlushGuestPhysicalAddressSpace hypercall support
From: Tianyu Lan Monday, July 9, 2018 2:03 AM > Hyper-V supports a pv hypercall HvFlushGuestPhysicalAddressSpace to > flush nested VM address space mapping in l1 hypervisor and it's to > reduce overhead of flushing ept tlb among vcpus. This patch is to > implement it. > > Signed-off-by: Lan Tianyu > --- > arch/x86/hyperv/Makefile | 2 +- > arch/x86/hyperv/nested.c | 64 > ++ > arch/x86/include/asm/hyperv-tlfs.h | 8 + > arch/x86/include/asm/mshyperv.h| 2 ++ > 4 files changed, 75 insertions(+), 1 deletion(-) > create mode 100644 arch/x86/hyperv/nested.c > +#include > +#include > +#include > +#include > + > +int hyperv_flush_guest_mapping(u64 as) > +{ > + struct hv_guest_mapping_flush **flush_pcpu; > + struct hv_guest_mapping_flush *flush; > + u64 status; > + unsigned long flags; > + int ret = -EFAULT; > + > + if (!hv_hypercall_pg) > + goto fault; > + > + local_irq_save(flags); > + > + flush_pcpu = (struct hv_guest_mapping_flush **) > + this_cpu_ptr(hyperv_pcpu_input_arg); > + > + flush = *flush_pcpu; > + > + if (unlikely(!flush)) { > + local_irq_restore(flags); > + goto fault; > + } > + > + flush->address_space = as; > + flush->flags = 0; > + > + status = hv_do_hypercall(HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE, > + flush, NULL); Did you consider using a "fast" hypercall? Unless there's some reason I'm not aware of, a "fast" hypercall would be perfect here as there are 16 bytes of input and no output. Vitaly recently added hv_do_fast_hypercall16() in the linux-next tree. See __send_ipi_mask() in hv_apic.c in linux-next for an example of usage. With a fast hypercall, you don't need the code for getting the per-cpu input arg or the code for local irq save/restore, so the code that is left is a lot faster and simpler. Michael > + local_irq_restore(flags); > + > + if (!(status & HV_HYPERCALL_RESULT_MASK)) > + ret = 0; > + > +fault: > + return ret; > +} > +EXPORT_SYMBOL_GPL(hyperv_flush_guest_mapping);
RE: [PATCH V2 1/5] X86/Hyper-V: Add flush HvFlushGuestPhysicalAddressSpace hypercall support
From: Tianyu Lan Monday, July 9, 2018 2:03 AM > Hyper-V supports a pv hypercall HvFlushGuestPhysicalAddressSpace to > flush nested VM address space mapping in l1 hypervisor and it's to > reduce overhead of flushing ept tlb among vcpus. This patch is to > implement it. > > Signed-off-by: Lan Tianyu > --- > arch/x86/hyperv/Makefile | 2 +- > arch/x86/hyperv/nested.c | 64 > ++ > arch/x86/include/asm/hyperv-tlfs.h | 8 + > arch/x86/include/asm/mshyperv.h| 2 ++ > 4 files changed, 75 insertions(+), 1 deletion(-) > create mode 100644 arch/x86/hyperv/nested.c > +#include > +#include > +#include > +#include > + > +int hyperv_flush_guest_mapping(u64 as) > +{ > + struct hv_guest_mapping_flush **flush_pcpu; > + struct hv_guest_mapping_flush *flush; > + u64 status; > + unsigned long flags; > + int ret = -EFAULT; > + > + if (!hv_hypercall_pg) > + goto fault; > + > + local_irq_save(flags); > + > + flush_pcpu = (struct hv_guest_mapping_flush **) > + this_cpu_ptr(hyperv_pcpu_input_arg); > + > + flush = *flush_pcpu; > + > + if (unlikely(!flush)) { > + local_irq_restore(flags); > + goto fault; > + } > + > + flush->address_space = as; > + flush->flags = 0; > + > + status = hv_do_hypercall(HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE, > + flush, NULL); Did you consider using a "fast" hypercall? Unless there's some reason I'm not aware of, a "fast" hypercall would be perfect here as there are 16 bytes of input and no output. Vitaly recently added hv_do_fast_hypercall16() in the linux-next tree. See __send_ipi_mask() in hv_apic.c in linux-next for an example of usage. With a fast hypercall, you don't need the code for getting the per-cpu input arg or the code for local irq save/restore, so the code that is left is a lot faster and simpler. Michael > + local_irq_restore(flags); > + > + if (!(status & HV_HYPERCALL_RESULT_MASK)) > + ret = 0; > + > +fault: > + return ret; > +} > +EXPORT_SYMBOL_GPL(hyperv_flush_guest_mapping);
RE: [PATCH 2/2] x86/hyper-v: check for VP_INVAL in hyperv_flush_tlb_others()
From: Vitaly Kuznetsov Monday, July 9, 2018 10:40 AM > Commit 1268ed0c474a ("x86/hyper-v: Fix the circular dependency in IPI > enlightenment") pre-filled hv_vp_index with VP_INVAL so it is now > (theoretically) possible to observe hv_cpu_number_to_vp_number() > returning VP_INVAL. We need to check for that in hyperv_flush_tlb_others(). > > Not checking for VP_INVAL on the first call site where we do > > if (hv_cpu_number_to_vp_number(cpumask_last(cpus)) >= 64) > goto do_ex_hypercall; > > is OK, in case we're eligible for non-ex hypercall we'll catch the > issue later in for_each_cpu() cycle and in case we'll be doing ex- > hypercall cpumask_to_vpset() will fail. > > It would be nice to change hv_cpu_number_to_vp_number() return > value's type to 'u32' but this will likely be a bigger change as > all call sites need to be checked first. > > Fixes: 1268ed0c474a ("x86/hyper-v: Fix the circular dependency in IPI > enlightenment") > Signed-off-by: Vitaly Kuznetsov > --- > arch/x86/hyperv/mmu.c | 5 + > 1 file changed, 5 insertions(+) Reviewed-by: Michael Kelley
RE: [PATCH 2/2] x86/hyper-v: check for VP_INVAL in hyperv_flush_tlb_others()
From: Vitaly Kuznetsov Monday, July 9, 2018 10:40 AM > Commit 1268ed0c474a ("x86/hyper-v: Fix the circular dependency in IPI > enlightenment") pre-filled hv_vp_index with VP_INVAL so it is now > (theoretically) possible to observe hv_cpu_number_to_vp_number() > returning VP_INVAL. We need to check for that in hyperv_flush_tlb_others(). > > Not checking for VP_INVAL on the first call site where we do > > if (hv_cpu_number_to_vp_number(cpumask_last(cpus)) >= 64) > goto do_ex_hypercall; > > is OK, in case we're eligible for non-ex hypercall we'll catch the > issue later in for_each_cpu() cycle and in case we'll be doing ex- > hypercall cpumask_to_vpset() will fail. > > It would be nice to change hv_cpu_number_to_vp_number() return > value's type to 'u32' but this will likely be a bigger change as > all call sites need to be checked first. > > Fixes: 1268ed0c474a ("x86/hyper-v: Fix the circular dependency in IPI > enlightenment") > Signed-off-by: Vitaly Kuznetsov > --- > arch/x86/hyperv/mmu.c | 5 + > 1 file changed, 5 insertions(+) Reviewed-by: Michael Kelley
RE: [PATCH 1/2] x86/hyper-v: check cpumask_to_vpset() return value in hyperv_flush_tlb_others_ex()
From: Vitaly Kuznetsov Monday, July 9, 2018 10:40 AM > Commit 1268ed0c474a ("x86/hyper-v: Fix the circular dependency in IPI > enlightenment") made cpumask_to_vpset() return '-1' when there is a CPU > with unknown VP index in the supplied set. This needs to be checked before > we pass 'nr_bank' to hypercall. > > Fixes: 1268ed0c474a ("x86/hyper-v: Fix the circular dependency in IPI > enlightenment") > Signed-off-by: Vitaly Kuznetsov > --- > arch/x86/hyperv/mmu.c | 2 ++ > 1 file changed, 2 insertions(+) Reviewed-by: Michael Kelley
RE: [PATCH 1/2] x86/hyper-v: check cpumask_to_vpset() return value in hyperv_flush_tlb_others_ex()
From: Vitaly Kuznetsov Monday, July 9, 2018 10:40 AM > Commit 1268ed0c474a ("x86/hyper-v: Fix the circular dependency in IPI > enlightenment") made cpumask_to_vpset() return '-1' when there is a CPU > with unknown VP index in the supplied set. This needs to be checked before > we pass 'nr_bank' to hypercall. > > Fixes: 1268ed0c474a ("x86/hyper-v: Fix the circular dependency in IPI > enlightenment") > Signed-off-by: Vitaly Kuznetsov > --- > arch/x86/hyperv/mmu.c | 2 ++ > 1 file changed, 2 insertions(+) Reviewed-by: Michael Kelley
RE: [PATCH 1/1] X86/Hyper-V:: Fix the circular dependency in IPI enlightenment.
> -Original Message- > From: Thomas Gleixner > Sent: Friday, July 6, 2018 3:42 AM > To: KY Srinivasan > Cc: Ingo Molnar ; x...@kernel.org; > gre...@linuxfoundation.org; linux- > ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de; > a...@canonical.com; > jasow...@redhat.com; h...@zytor.com; Stephen Hemminger > ; > Michael Kelley (EOSG) ; vkuzn...@redhat.com > Subject: RE: [PATCH 1/1] X86/Hyper-V:: Fix the circular dependency in IPI > enlightenment. > > On Fri, 6 Jul 2018, Thomas Gleixner wrote: > > On Fri, 6 Jul 2018, KY Srinivasan wrote: > > > > > > > > The problem is that the wreckage is in Linus tree and needs to be fixed > > > > there, i.e. via x86/urgent. > > > > > > > > Now we have the new bits queued in x86/hyperv already which collide. So > > > > we > > > > need to merge x86/urgent into x86/hyperv after applying the fix and mop > > > > up > > > > the merge wreckage in x86/hyperv. > > > > > > > > I'll have a look tomorrow morning unless you beat me to it. > > > > > > I can rebase this patch against the latest tip and resend (tomorrow). > > > > That doesn't help as we need to push the original fix to Linus ... > > Applied it to x86/urgent and fixed up the '-1' sloppy hack as pointed out > by Vitaly and merged x86/urgent into x86/hyperv. > > Please check both branches for correctness. Changes look good to me. Some pre-existing signed/unsigned type sloppiness is still there in that hv_cpu_number_to_vp_number() should be 'u32' instead of 'int', along with related local variables, but that's probably best to fix in linux-next on top of Vitaly's other changes. This code will work. Michael > > Thanks, > > tglx
RE: [PATCH 1/1] X86/Hyper-V:: Fix the circular dependency in IPI enlightenment.
> -Original Message- > From: Thomas Gleixner > Sent: Friday, July 6, 2018 3:42 AM > To: KY Srinivasan > Cc: Ingo Molnar ; x...@kernel.org; > gre...@linuxfoundation.org; linux- > ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de; > a...@canonical.com; > jasow...@redhat.com; h...@zytor.com; Stephen Hemminger > ; > Michael Kelley (EOSG) ; vkuzn...@redhat.com > Subject: RE: [PATCH 1/1] X86/Hyper-V:: Fix the circular dependency in IPI > enlightenment. > > On Fri, 6 Jul 2018, Thomas Gleixner wrote: > > On Fri, 6 Jul 2018, KY Srinivasan wrote: > > > > > > > > The problem is that the wreckage is in Linus tree and needs to be fixed > > > > there, i.e. via x86/urgent. > > > > > > > > Now we have the new bits queued in x86/hyperv already which collide. So > > > > we > > > > need to merge x86/urgent into x86/hyperv after applying the fix and mop > > > > up > > > > the merge wreckage in x86/hyperv. > > > > > > > > I'll have a look tomorrow morning unless you beat me to it. > > > > > > I can rebase this patch against the latest tip and resend (tomorrow). > > > > That doesn't help as we need to push the original fix to Linus ... > > Applied it to x86/urgent and fixed up the '-1' sloppy hack as pointed out > by Vitaly and merged x86/urgent into x86/hyperv. > > Please check both branches for correctness. Changes look good to me. Some pre-existing signed/unsigned type sloppiness is still there in that hv_cpu_number_to_vp_number() should be 'u32' instead of 'int', along with related local variables, but that's probably best to fix in linux-next on top of Vitaly's other changes. This code will work. Michael > > Thanks, > > tglx
RE: [PATCH 3/4] x86/hyper-v: use cheaper HVCALL_SEND_IPI hypercall when possible
> -Original Message- > From: Vitaly Kuznetsov > Sent: Friday, June 22, 2018 10:06 AM > To: x...@kernel.org > Cc: de...@linuxdriverproject.org; linux-kernel@vger.kernel.org; KY Srinivasan > ; Haiyang Zhang ; Stephen > Hemminger > ; Thomas Gleixner ; Ingo Molnar > ; H. Peter Anvin ; Tianyu Lan > ; Michael Kelley (EOSG) > > Subject: [PATCH 3/4] x86/hyper-v: use cheaper HVCALL_SEND_IPI hypercall when > possible > > When there is no need to send an IPI to a CPU with VP number > 64 > we can do the job with fast HVCALL_SEND_IPI hypercall. > > Signed-off-by: Vitaly Kuznetsov Reviewed-by: Michael Kelley
RE: [PATCH 3/4] x86/hyper-v: use cheaper HVCALL_SEND_IPI hypercall when possible
> -Original Message- > From: Vitaly Kuznetsov > Sent: Friday, June 22, 2018 10:06 AM > To: x...@kernel.org > Cc: de...@linuxdriverproject.org; linux-kernel@vger.kernel.org; KY Srinivasan > ; Haiyang Zhang ; Stephen > Hemminger > ; Thomas Gleixner ; Ingo Molnar > ; H. Peter Anvin ; Tianyu Lan > ; Michael Kelley (EOSG) > > Subject: [PATCH 3/4] x86/hyper-v: use cheaper HVCALL_SEND_IPI hypercall when > possible > > When there is no need to send an IPI to a CPU with VP number > 64 > we can do the job with fast HVCALL_SEND_IPI hypercall. > > Signed-off-by: Vitaly Kuznetsov Reviewed-by: Michael Kelley
RE: [PATCH 2/4] x86/hyper-v: use 'fast' hypercall for HVCALL_SEND_IPI
> -Original Message- > From: Vitaly Kuznetsov > Sent: Friday, June 22, 2018 10:06 AM > To: x...@kernel.org > Cc: de...@linuxdriverproject.org; linux-kernel@vger.kernel.org; KY Srinivasan > ; Haiyang Zhang ; Stephen > Hemminger > ; Thomas Gleixner ; Ingo Molnar > ; H. Peter Anvin ; Tianyu Lan > ; Michael Kelley (EOSG) > > Subject: [PATCH 2/4] x86/hyper-v: use 'fast' hypercall for HVCALL_SEND_IPI > > Current Hyper-V TLFS (v5.0b) claims that HvCallSendSyntheticClusterIpi > hypercall can't be 'fast' (passing parameters through registers) but > apparently this is not true, Windows always uses 'fast' version. We can > do the same in Linux too. > > Signed-off-by: Vitaly Kuznetsov Reviewed-by: Michael Kelley
RE: [PATCH 2/4] x86/hyper-v: use 'fast' hypercall for HVCALL_SEND_IPI
> -Original Message- > From: Vitaly Kuznetsov > Sent: Friday, June 22, 2018 10:06 AM > To: x...@kernel.org > Cc: de...@linuxdriverproject.org; linux-kernel@vger.kernel.org; KY Srinivasan > ; Haiyang Zhang ; Stephen > Hemminger > ; Thomas Gleixner ; Ingo Molnar > ; H. Peter Anvin ; Tianyu Lan > ; Michael Kelley (EOSG) > > Subject: [PATCH 2/4] x86/hyper-v: use 'fast' hypercall for HVCALL_SEND_IPI > > Current Hyper-V TLFS (v5.0b) claims that HvCallSendSyntheticClusterIpi > hypercall can't be 'fast' (passing parameters through registers) but > apparently this is not true, Windows always uses 'fast' version. We can > do the same in Linux too. > > Signed-off-by: Vitaly Kuznetsov Reviewed-by: Michael Kelley
RE: [PATCH] x86/hyper-v: use cheaper HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} hypercalls when possible
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org > On Behalf > Of Vitaly Kuznetsov > Sent: Friday, June 15, 2018 9:30 AM > To: x...@kernel.org > Cc: de...@linuxdriverproject.org; linux-kernel@vger.kernel.org; KY Srinivasan > ; Haiyang Zhang ; Stephen > Hemminger > ; Thomas Gleixner ; Ingo Molnar > ; H. Peter Anvin ; Tianyu Lan > > Subject: [PATCH] x86/hyper-v: use cheaper > HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} > hypercalls when possible > > While working on Hyper-V style PV TLB flush support in KVM I noticed that > real Windows guests use TLB flush hypercall in a somewhat smarter way: when > the flush needs to be performed on a subset of first 64 vCPUs or on all > present vCPUs Windows avoids more expensive hypercalls which support > sparse CPU sets and uses their 'cheap' counterparts. This means that > HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED name is actually a misnomer: EX > hypercalls (which support sparse CPU sets) are "available", not > "recommended". This makes sense as they are actually harder to parse. > > Nothing stops us from being equally 'smart' in Linux too. Switch to > doing cheaper hypercalls whenever possible. > > Signed-off-by: Vitaly Kuznetsov > --- This is a good idea. We should probably do the same with the hypercalls for sending IPIs -- try the simpler version first and move to the more complex _EX version only if necessary. A complication: We've recently found a problem with the code for doing IPI hypercalls, and the bug affects the TLB flush code as well. As secondary CPUs are started, there's a window of time where the hv_vp_index entry for a secondary CPU is uninitialized. We are seeing IPIs happening in that window, and the IPI hypercall code uses the uninitialized hv_vp_index entry. Same thing could happen with the TLB flush hypercall code. I didn't actually see any occurrences of the TLB case in my tracing, but we should fix it anyway in case a TLB flush gets added at some point in the future. KY has a patch coming. In the patch, hv_cpu_number_to_vp_number() and cpumask_to_vpset() can both return U32_MAX if they encounter an uninitialized hv_vp_index entry, and the code needs to be able to bail out to the native functions for that particular IPI or TLB flush operation. Once the initialization of secondary CPUs is complete, the uninitialized situation won't happen again, and the hypercall path will always be used. We'll need to coordinate on these patches. Be aware that the IPI flavor of the bug is currently causing random failures when booting 4.18 RC1 on Hyper-V VMs with large vCPU counts. Reviewed-by: Michael Kelley
RE: [PATCH] x86/hyper-v: use cheaper HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} hypercalls when possible
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org > On Behalf > Of Vitaly Kuznetsov > Sent: Friday, June 15, 2018 9:30 AM > To: x...@kernel.org > Cc: de...@linuxdriverproject.org; linux-kernel@vger.kernel.org; KY Srinivasan > ; Haiyang Zhang ; Stephen > Hemminger > ; Thomas Gleixner ; Ingo Molnar > ; H. Peter Anvin ; Tianyu Lan > > Subject: [PATCH] x86/hyper-v: use cheaper > HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} > hypercalls when possible > > While working on Hyper-V style PV TLB flush support in KVM I noticed that > real Windows guests use TLB flush hypercall in a somewhat smarter way: when > the flush needs to be performed on a subset of first 64 vCPUs or on all > present vCPUs Windows avoids more expensive hypercalls which support > sparse CPU sets and uses their 'cheap' counterparts. This means that > HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED name is actually a misnomer: EX > hypercalls (which support sparse CPU sets) are "available", not > "recommended". This makes sense as they are actually harder to parse. > > Nothing stops us from being equally 'smart' in Linux too. Switch to > doing cheaper hypercalls whenever possible. > > Signed-off-by: Vitaly Kuznetsov > --- This is a good idea. We should probably do the same with the hypercalls for sending IPIs -- try the simpler version first and move to the more complex _EX version only if necessary. A complication: We've recently found a problem with the code for doing IPI hypercalls, and the bug affects the TLB flush code as well. As secondary CPUs are started, there's a window of time where the hv_vp_index entry for a secondary CPU is uninitialized. We are seeing IPIs happening in that window, and the IPI hypercall code uses the uninitialized hv_vp_index entry. Same thing could happen with the TLB flush hypercall code. I didn't actually see any occurrences of the TLB case in my tracing, but we should fix it anyway in case a TLB flush gets added at some point in the future. KY has a patch coming. In the patch, hv_cpu_number_to_vp_number() and cpumask_to_vpset() can both return U32_MAX if they encounter an uninitialized hv_vp_index entry, and the code needs to be able to bail out to the native functions for that particular IPI or TLB flush operation. Once the initialization of secondary CPUs is complete, the uninitialized situation won't happen again, and the hypercall path will always be used. We'll need to coordinate on these patches. Be aware that the IPI flavor of the bug is currently causing random failures when booting 4.18 RC1 on Hyper-V VMs with large vCPU counts. Reviewed-by: Michael Kelley
RE: [RFC Patch 1/3] X86/Hyper-V: Add flush HvFlushGuestPhysicalAddressSpace hypercall support
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org > On Behalf > Of Tianyu Lan > Sent: Monday, June 4, 2018 2:08 AM > Cc: Tianyu Lan ; KY Srinivasan > ; Haiyang > Zhang ; Stephen Hemminger ; > t...@linutronix.de; mi...@redhat.com; h...@zytor.com; x...@kernel.org; > pbonz...@redhat.com; rkrc...@redhat.com; de...@linuxdriverproject.org; linux- > ker...@vger.kernel.org; k...@vger.kernel.org; vkuzn...@redhat.com > Subject: [RFC Patch 1/3] X86/Hyper-V: Add flush > HvFlushGuestPhysicalAddressSpace hypercall > support > > Hyper-V provides a pv hypercall HvFlushGuestPhysicalAddressSpace to flush > nested VM address space mapping in l1 hypervisor and it's to reduce overhead > of flushing ept tlb among vcpus. This patch is to implement it. > > Signed-off-by: Lan Tianyu > --- > diff --git a/arch/x86/hyperv/nested.c b/arch/x86/hyperv/nested.c > new file mode 100644 > index ..17f7c288eccc > --- /dev/null > +++ b/arch/x86/hyperv/nested.c > +#include > +#include > +#include > +#include > + > +int hyperv_flush_guest_mapping(u64 as) > +{ > + struct hv_guest_mapping_flush **flush_pcpu; > + struct hv_guest_mapping_flush *flush; > + u64 status = U64_MAX; Initializing status to U64_MAX doesn't seem necessary. > + unsigned long flags; > + int ret = -EFAULT; > + > + if (!hv_hypercall_pg) > + goto fault; > + > + local_irq_save(flags); > + > + flush_pcpu = (struct hv_guest_mapping_flush **) > + this_cpu_ptr(hyperv_pcpu_input_arg); > + > + flush = *flush_pcpu; > + > + if (unlikely(!flush)) { > + local_irq_restore(flags); > + goto fault; > + } > + > + flush->address_space = as; > + flush->flags = 0; > + > + status = hv_do_hypercall(HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE, > + flush, NULL); > + local_irq_restore(flags); > + > + if (!(status & HV_HYPERCALL_RESULT_MASK)) > + ret = 0; > + > +fault: > + return ret; > +} > +EXPORT_SYMBOL_GPL(hyperv_flush_guest_mapping); > diff --git a/arch/x86/include/asm/hyperv-tlfs.h > b/arch/x86/include/asm/hyperv-tlfs.h > index b8c89265baf0..53bbeb08faea 100644 > --- a/arch/x86/include/asm/hyperv-tlfs.h > +++ b/arch/x86/include/asm/hyperv-tlfs.h > @@ -309,6 +309,7 @@ struct ms_hyperv_tsc_page { > #define HV_X64_MSR_REENLIGHTENMENT_CONTROL 0x4106 > > /* Nested features (CPUID 0x400A) EAX */ > +#define HV_X64_NESTED_GUSET_MAPPING_FLUSHBIT(18) The #define name is misspelled. "_GUSET_" should be "_GUEST_". And the matching usage in patch 3/3 will need to be updated as well. Michael
RE: [RFC Patch 1/3] X86/Hyper-V: Add flush HvFlushGuestPhysicalAddressSpace hypercall support
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org > On Behalf > Of Tianyu Lan > Sent: Monday, June 4, 2018 2:08 AM > Cc: Tianyu Lan ; KY Srinivasan > ; Haiyang > Zhang ; Stephen Hemminger ; > t...@linutronix.de; mi...@redhat.com; h...@zytor.com; x...@kernel.org; > pbonz...@redhat.com; rkrc...@redhat.com; de...@linuxdriverproject.org; linux- > ker...@vger.kernel.org; k...@vger.kernel.org; vkuzn...@redhat.com > Subject: [RFC Patch 1/3] X86/Hyper-V: Add flush > HvFlushGuestPhysicalAddressSpace hypercall > support > > Hyper-V provides a pv hypercall HvFlushGuestPhysicalAddressSpace to flush > nested VM address space mapping in l1 hypervisor and it's to reduce overhead > of flushing ept tlb among vcpus. This patch is to implement it. > > Signed-off-by: Lan Tianyu > --- > diff --git a/arch/x86/hyperv/nested.c b/arch/x86/hyperv/nested.c > new file mode 100644 > index ..17f7c288eccc > --- /dev/null > +++ b/arch/x86/hyperv/nested.c > +#include > +#include > +#include > +#include > + > +int hyperv_flush_guest_mapping(u64 as) > +{ > + struct hv_guest_mapping_flush **flush_pcpu; > + struct hv_guest_mapping_flush *flush; > + u64 status = U64_MAX; Initializing status to U64_MAX doesn't seem necessary. > + unsigned long flags; > + int ret = -EFAULT; > + > + if (!hv_hypercall_pg) > + goto fault; > + > + local_irq_save(flags); > + > + flush_pcpu = (struct hv_guest_mapping_flush **) > + this_cpu_ptr(hyperv_pcpu_input_arg); > + > + flush = *flush_pcpu; > + > + if (unlikely(!flush)) { > + local_irq_restore(flags); > + goto fault; > + } > + > + flush->address_space = as; > + flush->flags = 0; > + > + status = hv_do_hypercall(HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE, > + flush, NULL); > + local_irq_restore(flags); > + > + if (!(status & HV_HYPERCALL_RESULT_MASK)) > + ret = 0; > + > +fault: > + return ret; > +} > +EXPORT_SYMBOL_GPL(hyperv_flush_guest_mapping); > diff --git a/arch/x86/include/asm/hyperv-tlfs.h > b/arch/x86/include/asm/hyperv-tlfs.h > index b8c89265baf0..53bbeb08faea 100644 > --- a/arch/x86/include/asm/hyperv-tlfs.h > +++ b/arch/x86/include/asm/hyperv-tlfs.h > @@ -309,6 +309,7 @@ struct ms_hyperv_tsc_page { > #define HV_X64_MSR_REENLIGHTENMENT_CONTROL 0x4106 > > /* Nested features (CPUID 0x400A) EAX */ > +#define HV_X64_NESTED_GUSET_MAPPING_FLUSHBIT(18) The #define name is misspelled. "_GUSET_" should be "_GUEST_". And the matching usage in patch 3/3 will need to be updated as well. Michael
RE: [PATCH] PCI: hv: Do not wait forever on a device that has disappeared
> > Before the guest finishes the device initialization, the device can be > removed anytime by the host, and after that the host won't respond to > the guest's request, so the guest should be prepared to handle this > case. > > Signed-off-by: Dexuan Cui > Cc: Stephen Hemminger > Cc: K. Y. Srinivasan > --- > drivers/pci/host/pci-hyperv.c | 46 > --- > 1 file changed, 34 insertions(+), 12 deletions(-) > While this patch solves the immediate problem of getting hung waiting for a response from Hyper-V that will never come, there's another scenario to look at that I think introduces a race. Suppose the guest VM issues a vmbus_sendpacket() request in one of the cases covered by this patch, and suppose that Hyper-V queues a response to the request, and then immediately follows with a rescind request. Processing the response will get queued to a tasklet associated with the channel, while processing the rescind will get queued to a tasklet associated with the top-level vmbus connection. From what I can see, the code doesn't impose any ordering on processing the two. If the rescind is processed first, the new wait_for_response() function may wake up, notice the rescind flag, and return an error. Its caller will return an error, and in doing so pop the completion packet off the stack. When the response is processed later, it will try to signal completion via a completion packet that no longer exists, and memory corruption will likely occur. Am I missing anything that would prevent this scenario from happening? It is admittedly low probability, and a solution seems non-trivial. I haven't looked specifically, but a similar scenario is probably possible with the drivers for other VMbus devices. We should work on a generic solution. Michael
RE: [PATCH] PCI: hv: Do not wait forever on a device that has disappeared
> > Before the guest finishes the device initialization, the device can be > removed anytime by the host, and after that the host won't respond to > the guest's request, so the guest should be prepared to handle this > case. > > Signed-off-by: Dexuan Cui > Cc: Stephen Hemminger > Cc: K. Y. Srinivasan > --- > drivers/pci/host/pci-hyperv.c | 46 > --- > 1 file changed, 34 insertions(+), 12 deletions(-) > While this patch solves the immediate problem of getting hung waiting for a response from Hyper-V that will never come, there's another scenario to look at that I think introduces a race. Suppose the guest VM issues a vmbus_sendpacket() request in one of the cases covered by this patch, and suppose that Hyper-V queues a response to the request, and then immediately follows with a rescind request. Processing the response will get queued to a tasklet associated with the channel, while processing the rescind will get queued to a tasklet associated with the top-level vmbus connection. From what I can see, the code doesn't impose any ordering on processing the two. If the rescind is processed first, the new wait_for_response() function may wake up, notice the rescind flag, and return an error. Its caller will return an error, and in doing so pop the completion packet off the stack. When the response is processed later, it will try to signal completion via a completion packet that no longer exists, and memory corruption will likely occur. Am I missing anything that would prevent this scenario from happening? It is admittedly low probability, and a solution seems non-trivial. I haven't looked specifically, but a similar scenario is probably possible with the drivers for other VMbus devices. We should work on a generic solution. Michael
RE: [PATCH char-misc 1/2] Drivers: hv: vmbus: Remove x86 MSR refs in arch independent code
> > -#define hv_init_timer(timer, tick) wrmsrl(timer, tick) > > -#define hv_init_timer_config(config, val) wrmsrl(config, val) > > +#define hv_init_timer(timer, tick) \ > > + wrmsrl(HV_X64_MSR_STIMER0_COUNT + (2*timer), tick) > > +#define hv_init_timer_config(timer, val) \ > > + wrmsrl(HV_X64_MSR_STIMER0_CONFIG + (2*timer), val) > > Why are we stepping in units of 2? > > K. Y The register addresses for the *_CONFIG and *_COUNT registers are interleaved. From hyperv-tlfs.h: #define HV_X64_MSR_STIMER0_CONFIG 0x40B0 #define HV_X64_MSR_STIMER0_COUNT0x40B1 #define HV_X64_MSR_STIMER1_CONFIG 0x40B2 #define HV_X64_MSR_STIMER1_COUNT0x40B3 #define HV_X64_MSR_STIMER2_CONFIG 0x40B4 #define HV_X64_MSR_STIMER2_COUNT0x40B5 #define HV_X64_MSR_STIMER3_CONFIG 0x40B6 #define HV_X64_MSR_STIMER3_COUNT0x40B7 Michael
RE: [PATCH char-misc 1/2] Drivers: hv: vmbus: Remove x86 MSR refs in arch independent code
> > -#define hv_init_timer(timer, tick) wrmsrl(timer, tick) > > -#define hv_init_timer_config(config, val) wrmsrl(config, val) > > +#define hv_init_timer(timer, tick) \ > > + wrmsrl(HV_X64_MSR_STIMER0_COUNT + (2*timer), tick) > > +#define hv_init_timer_config(timer, val) \ > > + wrmsrl(HV_X64_MSR_STIMER0_CONFIG + (2*timer), val) > > Why are we stepping in units of 2? > > K. Y The register addresses for the *_CONFIG and *_COUNT registers are interleaved. From hyperv-tlfs.h: #define HV_X64_MSR_STIMER0_CONFIG 0x40B0 #define HV_X64_MSR_STIMER0_COUNT0x40B1 #define HV_X64_MSR_STIMER1_CONFIG 0x40B2 #define HV_X64_MSR_STIMER1_COUNT0x40B3 #define HV_X64_MSR_STIMER2_CONFIG 0x40B4 #define HV_X64_MSR_STIMER2_COUNT0x40B5 #define HV_X64_MSR_STIMER3_CONFIG 0x40B6 #define HV_X64_MSR_STIMER3_COUNT0x40B7 Michael
RE: [PATCH V2 5/5] X86: Hyper-V: Consolidate the allocation of the hypercall input page
> -Original Message- > From: k...@linuxonhyperv.com <k...@linuxonhyperv.com> > Sent: Thursday, May 3, 2018 11:08 PM > To: x...@kernel.org; gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; > jasow...@redhat.com; > t...@linutronix.de; h...@zytor.com; Stephen Hemminger > <sthem...@microsoft.com>; > Michael Kelley (EOSG) <michael.h.kel...@microsoft.com>; vkuzn...@redhat.com > Cc: KY Srinivasan <k...@microsoft.com> > Subject: [PATCH V2 5/5] X86: Hyper-V: Consolidate the allocation of the > hypercall input page > > From: "K. Y. Srinivasan" <k...@microsoft.com> > > Consolidate the allocation of the hypercall input page. > > Signed-off-by: K. Y. Srinivasan <k...@microsoft.com> > --- Reviewed-by: Michael Kelley <mikel...@microsoft.com>
RE: [PATCH V2 5/5] X86: Hyper-V: Consolidate the allocation of the hypercall input page
> -Original Message- > From: k...@linuxonhyperv.com > Sent: Thursday, May 3, 2018 11:08 PM > To: x...@kernel.org; gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; > jasow...@redhat.com; > t...@linutronix.de; h...@zytor.com; Stephen Hemminger > ; > Michael Kelley (EOSG) ; vkuzn...@redhat.com > Cc: KY Srinivasan > Subject: [PATCH V2 5/5] X86: Hyper-V: Consolidate the allocation of the > hypercall input page > > From: "K. Y. Srinivasan" > > Consolidate the allocation of the hypercall input page. > > Signed-off-by: K. Y. Srinivasan > --- Reviewed-by: Michael Kelley
RE: [PATCH V2 4/5] X86: Hyper-V: Consolidate code for converting cpumask to vpset
> -Original Message- > From: k...@linuxonhyperv.com <k...@linuxonhyperv.com> > Sent: Thursday, May 3, 2018 11:08 PM > To: x...@kernel.org; gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; > jasow...@redhat.com; > t...@linutronix.de; h...@zytor.com; Stephen Hemminger > <sthem...@microsoft.com>; > Michael Kelley (EOSG) <michael.h.kel...@microsoft.com>; vkuzn...@redhat.com > Cc: KY Srinivasan <k...@microsoft.com> > Subject: [PATCH V2 4/5] X86: Hyper-V: Consolidate code for converting cpumask > to vpset > > From: "K. Y. Srinivasan" <k...@microsoft.com> > > Consolidate code for converting cpumask to vpset. > > Signed-off-by: K. Y. Srinivasan <k...@microsoft.com> > --- Reviewed-by: Michael Kelley <mikel...@microsoft.com>
RE: [PATCH V2 4/5] X86: Hyper-V: Consolidate code for converting cpumask to vpset
> -Original Message- > From: k...@linuxonhyperv.com > Sent: Thursday, May 3, 2018 11:08 PM > To: x...@kernel.org; gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; > jasow...@redhat.com; > t...@linutronix.de; h...@zytor.com; Stephen Hemminger > ; > Michael Kelley (EOSG) ; vkuzn...@redhat.com > Cc: KY Srinivasan > Subject: [PATCH V2 4/5] X86: Hyper-V: Consolidate code for converting cpumask > to vpset > > From: "K. Y. Srinivasan" > > Consolidate code for converting cpumask to vpset. > > Signed-off-by: K. Y. Srinivasan > --- Reviewed-by: Michael Kelley
RE: [PATCH V2 3/5] X86: Hyper-V: Enhanced IPI enlightenment
> -Original Message- > From: k...@linuxonhyperv.com <k...@linuxonhyperv.com> > Sent: Thursday, May 3, 2018 11:08 PM > To: x...@kernel.org; gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; > jasow...@redhat.com; > t...@linutronix.de; h...@zytor.com; Stephen Hemminger > <sthem...@microsoft.com>; > Michael Kelley (EOSG) <michael.h.kel...@microsoft.com>; vkuzn...@redhat.com > Cc: KY Srinivasan <k...@microsoft.com> > Subject: [PATCH V2 3/5] X86: Hyper-V: Enhanced IPI enlightenment > > From: "K. Y. Srinivasan" <k...@microsoft.com> > > Support enhanced IPI enlightenments (to target more than 64 CPUs). > > Signed-off-by: K. Y. Srinivasan <k...@microsoft.com> > --- Reviewed-by: Michael Kelley <mikel...@microsoft.com>
RE: [PATCH V2 3/5] X86: Hyper-V: Enhanced IPI enlightenment
> -Original Message- > From: k...@linuxonhyperv.com > Sent: Thursday, May 3, 2018 11:08 PM > To: x...@kernel.org; gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; > jasow...@redhat.com; > t...@linutronix.de; h...@zytor.com; Stephen Hemminger > ; > Michael Kelley (EOSG) ; vkuzn...@redhat.com > Cc: KY Srinivasan > Subject: [PATCH V2 3/5] X86: Hyper-V: Enhanced IPI enlightenment > > From: "K. Y. Srinivasan" > > Support enhanced IPI enlightenments (to target more than 64 CPUs). > > Signed-off-by: K. Y. Srinivasan > --- Reviewed-by: Michael Kelley
RE: [PATCH V2 2/5] X86: Hyper-V: Enable IPI enlightenments
> -Original Message- > From: k...@linuxonhyperv.com <k...@linuxonhyperv.com> > Sent: Thursday, May 3, 2018 11:08 PM > To: x...@kernel.org; gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; > jasow...@redhat.com; > t...@linutronix.de; h...@zytor.com; Stephen Hemminger > <sthem...@microsoft.com>; > Michael Kelley (EOSG) <michael.h.kel...@microsoft.com>; vkuzn...@redhat.com > Cc: KY Srinivasan <k...@microsoft.com> > Subject: [PATCH V2 2/5] X86: Hyper-V: Enable IPI enlightenments > > From: "K. Y. Srinivasan" <k...@microsoft.com> > > Hyper-V supports hypercalls to implement IPI; use them. > > Signed-off-by: K. Y. Srinivasan <k...@microsoft.com> > --- Reviewed-by: Michael Kelley <mikel...@microsoft.com>
RE: [PATCH V2 2/5] X86: Hyper-V: Enable IPI enlightenments
> -Original Message- > From: k...@linuxonhyperv.com > Sent: Thursday, May 3, 2018 11:08 PM > To: x...@kernel.org; gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; > jasow...@redhat.com; > t...@linutronix.de; h...@zytor.com; Stephen Hemminger > ; > Michael Kelley (EOSG) ; vkuzn...@redhat.com > Cc: KY Srinivasan > Subject: [PATCH V2 2/5] X86: Hyper-V: Enable IPI enlightenments > > From: "K. Y. Srinivasan" > > Hyper-V supports hypercalls to implement IPI; use them. > > Signed-off-by: K. Y. Srinivasan > --- Reviewed-by: Michael Kelley
RE: [PATCH V2 1/5] X86: Hyper-V: Enlighten APIC access
> -Original Message- > From: k...@linuxonhyperv.com <k...@linuxonhyperv.com> > Sent: Thursday, May 3, 2018 11:08 PM > To: x...@kernel.org; gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; > jasow...@redhat.com; > t...@linutronix.de; h...@zytor.com; Stephen Hemminger > <sthem...@microsoft.com>; > Michael Kelley (EOSG) <michael.h.kel...@microsoft.com>; vkuzn...@redhat.com > Cc: KY Srinivasan <k...@microsoft.com> > Subject: [PATCH V2 1/5] X86: Hyper-V: Enlighten APIC access > > From: "K. Y. Srinivasan" <k...@microsoft.com> > > Hyper-V supports MSR based APIC access; implement > the enlightenment. > > Signed-off-by: K. Y. Srinivasan <k...@microsoft.com> > --- Reviewed-by: Michael Kelley <mikel...@microsoft.com>
RE: [PATCH V2 1/5] X86: Hyper-V: Enlighten APIC access
> -Original Message- > From: k...@linuxonhyperv.com > Sent: Thursday, May 3, 2018 11:08 PM > To: x...@kernel.org; gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; > jasow...@redhat.com; > t...@linutronix.de; h...@zytor.com; Stephen Hemminger > ; > Michael Kelley (EOSG) ; vkuzn...@redhat.com > Cc: KY Srinivasan > Subject: [PATCH V2 1/5] X86: Hyper-V: Enlighten APIC access > > From: "K. Y. Srinivasan" > > Hyper-V supports MSR based APIC access; implement > the enlightenment. > > Signed-off-by: K. Y. Srinivasan > --- Reviewed-by: Michael Kelley
RE: [PATCH 3/5] X86: Hyper-V: Enhanced IPI enlightenment
On Wed, 25 Apr 2018, KY Srinivasanwrote: > > +struct ipi_arg_ex { > + u32 vector; > + u32 reserved; > + struct hv_vpset vp_set; > +}; Again, suggest moving to hyperv-tlfs.h. And the 5.0b version of the TLFS has: u32 vector; u8 targetvtl; u8 reserved[3]; struct hv_vpset vp_set; > > +struct hv_vpset { > + u64 format; > + u64 valid_bank_mask; > + u64 bank_contents[]; > +}; And this as well. Michael
RE: [PATCH 3/5] X86: Hyper-V: Enhanced IPI enlightenment
On Wed, 25 Apr 2018, KY Srinivasan wrote: > > +struct ipi_arg_ex { > + u32 vector; > + u32 reserved; > + struct hv_vpset vp_set; > +}; Again, suggest moving to hyperv-tlfs.h. And the 5.0b version of the TLFS has: u32 vector; u8 targetvtl; u8 reserved[3]; struct hv_vpset vp_set; > > +struct hv_vpset { > + u64 format; > + u64 valid_bank_mask; > + u64 bank_contents[]; > +}; And this as well. Michael
RE: [PATCH 2/5] X86: Hyper-V: Enable IPI enlightenments
> -Original Message- > From: k...@linuxonhyperv.com <k...@linuxonhyperv.com> > Sent: Wednesday, April 25, 2018 11:13 AM > To: x...@kernel.org; gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; > jasow...@redhat.com; > t...@linutronix.de; h...@zytor.com; Stephen Hemminger > <sthem...@microsoft.com>; > Michael Kelley (EOSG) <michael.h.kel...@microsoft.com>; vkuzn...@redhat.com > Cc: KY Srinivasan <k...@microsoft.com> > Subject: [PATCH 2/5] X86: Hyper-V: Enable IPI enlightenments > > From: "K. Y. Srinivasan" <k...@microsoft.com> > > Hyper-V supports hypercalls to implement IPI; use them. > > Signed-off-by: K. Y. Srinivasan <k...@microsoft.com> > --- > arch/x86/hyperv/hv_apic.c | 125 > + > arch/x86/hyperv/hv_init.c | 17 + > arch/x86/include/asm/hyperv-tlfs.h | 9 +++ > arch/x86/include/asm/mshyperv.h| 1 + > 4 files changed, 152 insertions(+) > > diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c > index e0a5b36208fc..7f3322ecfb01 100644 > --- a/arch/x86/hyperv/hv_apic.c > +++ b/arch/x86/hyperv/hv_apic.c > @@ -30,6 +30,14 @@ > #include > #include > > +struct ipi_arg_non_ex { > + u32 vector; > + u32 reserved; > + u64 cpu_mask; > +}; I think we'd like to put structures like this, which are defined in the Hyper-V Top Level Functional Spec, in hyperv-tlfs.h. Also, the 5.0b version of the TLFS, which is latest, shows this structure on page 100: u32 vector; u8 targetvtl; u8 reserved[3]; u64 cpu_mask; > + > +static struct apic orig_apic; > + > static u64 hv_apic_icr_read(void) > { > u64 reg_val; > @@ -85,8 +93,125 @@ static void hv_apic_eoi_write(u32 reg, u32 val) > wrmsr(HV_X64_MSR_EOI, val, 0); > } > > +/* > + * IPI implementation on Hyper-V. > + */ > + > +static int __send_ipi_mask(const struct cpumask *mask, int vector) > +{ > + int cur_cpu, vcpu; > + struct ipi_arg_non_ex **arg; > + struct ipi_arg_non_ex *ipi_arg; > + int ret = 1; > + unsigned long flags; > + > + if (cpumask_empty(mask)) > + return 0; > + > + if (!hv_hypercall_pg) > + return ret; > + > + if ((vector < HV_IPI_LOW_VECTOR) || (vector > HV_IPI_HIGH_VECTOR)) > + return ret; > + > + local_irq_save(flags); > + arg = (struct ipi_arg_non_ex **)this_cpu_ptr(hyperv_pcpu_input_arg); > + > + ipi_arg = *arg; > + if (unlikely(!ipi_arg)) > + goto ipi_mask_done; > + > + > + ipi_arg->vector = vector; > + ipi_arg->reserved = 0; > + ipi_arg->cpu_mask = 0; > + > + for_each_cpu(cur_cpu, mask) { > + vcpu = hv_cpu_number_to_vp_number(cur_cpu); > + if (vcpu >= 64) > + goto ipi_mask_done; > + > + __set_bit(vcpu, (unsigned long *)_arg->cpu_mask); > + } > + > + ret = hv_do_hypercall(HVCALL_SEND_IPI, ipi_arg, NULL); > + > +ipi_mask_done: > + local_irq_restore(flags); > + return ret; > +} > + > +static int __send_ipi_one(int cpu, int vector) > +{ > + struct cpumask mask = CPU_MASK_NONE; > + > + cpumask_set_cpu(cpu, ); > + return __send_ipi_mask(, vector); > +} > + > +static void hv_send_ipi(int cpu, int vector) > +{ > + if (__send_ipi_one(cpu, vector)) > + orig_apic.send_IPI(cpu, vector); > +} > + > +static void hv_send_ipi_mask(const struct cpumask *mask, int vector) > +{ > + if (__send_ipi_mask(mask, vector)) > + orig_apic.send_IPI_mask(mask, vector); > +} > + > +static void hv_send_ipi_mask_allbutself(const struct cpumask *mask, int > vector) > +{ > + unsigned int this_cpu = smp_processor_id(); > + struct cpumask new_mask; > + const struct cpumask *local_mask; > + > + cpumask_copy(_mask, mask); > + cpumask_clear_cpu(this_cpu, _mask); > + local_mask = _mask; > + if (__send_ipi_mask(local_mask, vector)) > + orig_apic.send_IPI_mask_allbutself(mask, vector); > +} > + > +static void hv_send_ipi_allbutself(int vector) > +{ > + hv_send_ipi_mask_allbutself(cpu_online_mask, vector); > +} > + > +static void hv_send_ipi_all(int vector) > +{ > + if (__send_ipi_mask(cpu_online_mask, vector)) > + orig_apic.send_IPI_all(vector); > +} > + > +static void hv_send_ipi_self(int vector) > +{ > + if (__send_ipi_one(smp_processor_id(), vector)) > + orig_apic
RE: [PATCH 2/5] X86: Hyper-V: Enable IPI enlightenments
> -Original Message- > From: k...@linuxonhyperv.com > Sent: Wednesday, April 25, 2018 11:13 AM > To: x...@kernel.org; gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; > jasow...@redhat.com; > t...@linutronix.de; h...@zytor.com; Stephen Hemminger > ; > Michael Kelley (EOSG) ; vkuzn...@redhat.com > Cc: KY Srinivasan > Subject: [PATCH 2/5] X86: Hyper-V: Enable IPI enlightenments > > From: "K. Y. Srinivasan" > > Hyper-V supports hypercalls to implement IPI; use them. > > Signed-off-by: K. Y. Srinivasan > --- > arch/x86/hyperv/hv_apic.c | 125 > + > arch/x86/hyperv/hv_init.c | 17 + > arch/x86/include/asm/hyperv-tlfs.h | 9 +++ > arch/x86/include/asm/mshyperv.h| 1 + > 4 files changed, 152 insertions(+) > > diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c > index e0a5b36208fc..7f3322ecfb01 100644 > --- a/arch/x86/hyperv/hv_apic.c > +++ b/arch/x86/hyperv/hv_apic.c > @@ -30,6 +30,14 @@ > #include > #include > > +struct ipi_arg_non_ex { > + u32 vector; > + u32 reserved; > + u64 cpu_mask; > +}; I think we'd like to put structures like this, which are defined in the Hyper-V Top Level Functional Spec, in hyperv-tlfs.h. Also, the 5.0b version of the TLFS, which is latest, shows this structure on page 100: u32 vector; u8 targetvtl; u8 reserved[3]; u64 cpu_mask; > + > +static struct apic orig_apic; > + > static u64 hv_apic_icr_read(void) > { > u64 reg_val; > @@ -85,8 +93,125 @@ static void hv_apic_eoi_write(u32 reg, u32 val) > wrmsr(HV_X64_MSR_EOI, val, 0); > } > > +/* > + * IPI implementation on Hyper-V. > + */ > + > +static int __send_ipi_mask(const struct cpumask *mask, int vector) > +{ > + int cur_cpu, vcpu; > + struct ipi_arg_non_ex **arg; > + struct ipi_arg_non_ex *ipi_arg; > + int ret = 1; > + unsigned long flags; > + > + if (cpumask_empty(mask)) > + return 0; > + > + if (!hv_hypercall_pg) > + return ret; > + > + if ((vector < HV_IPI_LOW_VECTOR) || (vector > HV_IPI_HIGH_VECTOR)) > + return ret; > + > + local_irq_save(flags); > + arg = (struct ipi_arg_non_ex **)this_cpu_ptr(hyperv_pcpu_input_arg); > + > + ipi_arg = *arg; > + if (unlikely(!ipi_arg)) > + goto ipi_mask_done; > + > + > + ipi_arg->vector = vector; > + ipi_arg->reserved = 0; > + ipi_arg->cpu_mask = 0; > + > + for_each_cpu(cur_cpu, mask) { > + vcpu = hv_cpu_number_to_vp_number(cur_cpu); > + if (vcpu >= 64) > + goto ipi_mask_done; > + > + __set_bit(vcpu, (unsigned long *)_arg->cpu_mask); > + } > + > + ret = hv_do_hypercall(HVCALL_SEND_IPI, ipi_arg, NULL); > + > +ipi_mask_done: > + local_irq_restore(flags); > + return ret; > +} > + > +static int __send_ipi_one(int cpu, int vector) > +{ > + struct cpumask mask = CPU_MASK_NONE; > + > + cpumask_set_cpu(cpu, ); > + return __send_ipi_mask(, vector); > +} > + > +static void hv_send_ipi(int cpu, int vector) > +{ > + if (__send_ipi_one(cpu, vector)) > + orig_apic.send_IPI(cpu, vector); > +} > + > +static void hv_send_ipi_mask(const struct cpumask *mask, int vector) > +{ > + if (__send_ipi_mask(mask, vector)) > + orig_apic.send_IPI_mask(mask, vector); > +} > + > +static void hv_send_ipi_mask_allbutself(const struct cpumask *mask, int > vector) > +{ > + unsigned int this_cpu = smp_processor_id(); > + struct cpumask new_mask; > + const struct cpumask *local_mask; > + > + cpumask_copy(_mask, mask); > + cpumask_clear_cpu(this_cpu, _mask); > + local_mask = _mask; > + if (__send_ipi_mask(local_mask, vector)) > + orig_apic.send_IPI_mask_allbutself(mask, vector); > +} > + > +static void hv_send_ipi_allbutself(int vector) > +{ > + hv_send_ipi_mask_allbutself(cpu_online_mask, vector); > +} > + > +static void hv_send_ipi_all(int vector) > +{ > + if (__send_ipi_mask(cpu_online_mask, vector)) > + orig_apic.send_IPI_all(vector); > +} > + > +static void hv_send_ipi_self(int vector) > +{ > + if (__send_ipi_one(smp_processor_id(), vector)) > + orig_apic.send_IPI_self(vector); > +} > + > void __init hv_apic_init(void) > { > + if (ms_hyperv.hints & HV_X64_CLUSTER_IPI_RECOMMENDED) { > + if
RE: [PATCH 1/5] X86: Hyper-V: Enlighten APIC access
> -Original Message- > From: k...@linuxonhyperv.com <k...@linuxonhyperv.com> > Sent: Wednesday, April 25, 2018 11:13 AM > To: x...@kernel.org; gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; > jasow...@redhat.com; > t...@linutronix.de; h...@zytor.com; Stephen Hemminger > <sthem...@microsoft.com>; > Michael Kelley (EOSG) <michael.h.kel...@microsoft.com>; vkuzn...@redhat.com > Cc: KY Srinivasan <k...@microsoft.com> > Subject: [PATCH 1/5] X86: Hyper-V: Enlighten APIC access > > From: "K. Y. Srinivasan" <k...@microsoft.com> > > Hyper-V supports MSR based APIC access; implement > the enlightenment. > > Signed-off-by: K. Y. Srinivasan <k...@microsoft.com> > --- > arch/x86/hyperv/Makefile| 2 +- > arch/x86/hyperv/hv_apic.c | 98 > + > arch/x86/hyperv/hv_init.c | 5 ++- > arch/x86/include/asm/mshyperv.h | 6 ++- > 4 files changed, 107 insertions(+), 4 deletions(-) > create mode 100644 arch/x86/hyperv/hv_apic.c > > diff --git a/arch/x86/hyperv/Makefile b/arch/x86/hyperv/Makefile > index 367a8203cfcf..00ce4df01a09 100644 > --- a/arch/x86/hyperv/Makefile > +++ b/arch/x86/hyperv/Makefile > @@ -1 +1 @@ > -obj-y:= hv_init.o mmu.o > +obj-y:= hv_init.o mmu.o hv_apic.o > diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c > new file mode 100644 > index ..e0a5b36208fc > --- /dev/null > +++ b/arch/x86/hyperv/hv_apic.c > @@ -0,0 +1,98 @@ > +// SPDX-License-Identifier: GPL-2.0 > + > +/* > + * Hyper-V specific APIC code. > + * > + * Copyright (C) 2018, Microsoft, Inc. > + * > + * Author : K. Y. Srinivasan <k...@microsoft.com> > + * > + * This program is free software; you can redistribute it and/or modify it > + * under the terms of the GNU General Public License version 2 as published > + * by the Free Software Foundation. > + * > + * This program is distributed in the hope that it will be useful, but > + * WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or > + * NON INFRINGEMENT. See the GNU General Public License for more > + * details. > + * > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +static u64 hv_apic_icr_read(void) > +{ > + u64 reg_val; > + > + rdmsrl(HV_X64_MSR_ICR, reg_val); > + return reg_val; > +} > + > +static void hv_apic_icr_write(u32 low, u32 id) > +{ > + u64 reg_val; > + > + reg_val = SET_APIC_DEST_FIELD(id); > + reg_val = reg_val << 32; > + reg_val |= low; > + > + wrmsrl(HV_X64_MSR_ICR, reg_val); > +} > + > +static u32 hv_apic_read(u32 reg) > +{ > + u32 reg_val, hi; > + > + switch (reg) { > + case APIC_EOI: > + rdmsr(HV_X64_MSR_EOI, reg_val, hi); > + return reg_val; > + case APIC_TASKPRI: > + rdmsr(HV_X64_MSR_TPR, reg_val, hi); > + return reg_val; > + > + default: > + return native_apic_mem_read(reg); > + } > +} > + > +static void hv_apic_write(u32 reg, u32 val) > +{ > + switch (reg) { > + case APIC_EOI: > + wrmsr(HV_X64_MSR_EOI, val, 0); > + break; > + case APIC_TASKPRI: > + wrmsr(HV_X64_MSR_TPR, val, 0); > + break; > + default: > + native_apic_mem_write(reg, val); > + } > +} > + > +static void hv_apic_eoi_write(u32 reg, u32 val) > +{ > + wrmsr(HV_X64_MSR_EOI, val, 0); > +} > + > +void __init hv_apic_init(void) > +{ > + if (ms_hyperv.hints & HV_X64_APIC_ACCESS_RECOMMENDED) { > + pr_info("Hyper-V: Using MSR ased APIC access\n"); Typo here. "ased" should be "based". > + apic_set_eoi_write(hv_apic_eoi_write); > + apic->read = hv_apic_read; > + apic->write = hv_apic_write; > + apic->icr_write = hv_apic_icr_write; > + apic->icr_read = hv_apic_icr_read; > + } > +} > diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c > index cfecc2272f2d..71e50fc2b7ef 100644 > --- a/arch/x86/hyperv/hv_init.c > +++ b/arch/x86/hyperv/hv_init.c > @@ -242,8 +242,9 @@ static int hv_cpu_die(unsigned int cpu) > * > * 1. Setup the hypercall page. > * 2. Register Hyper-V specific clocksource. > +
RE: [PATCH 1/5] X86: Hyper-V: Enlighten APIC access
> -Original Message- > From: k...@linuxonhyperv.com > Sent: Wednesday, April 25, 2018 11:13 AM > To: x...@kernel.org; gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; > jasow...@redhat.com; > t...@linutronix.de; h...@zytor.com; Stephen Hemminger > ; > Michael Kelley (EOSG) ; vkuzn...@redhat.com > Cc: KY Srinivasan > Subject: [PATCH 1/5] X86: Hyper-V: Enlighten APIC access > > From: "K. Y. Srinivasan" > > Hyper-V supports MSR based APIC access; implement > the enlightenment. > > Signed-off-by: K. Y. Srinivasan > --- > arch/x86/hyperv/Makefile| 2 +- > arch/x86/hyperv/hv_apic.c | 98 > + > arch/x86/hyperv/hv_init.c | 5 ++- > arch/x86/include/asm/mshyperv.h | 6 ++- > 4 files changed, 107 insertions(+), 4 deletions(-) > create mode 100644 arch/x86/hyperv/hv_apic.c > > diff --git a/arch/x86/hyperv/Makefile b/arch/x86/hyperv/Makefile > index 367a8203cfcf..00ce4df01a09 100644 > --- a/arch/x86/hyperv/Makefile > +++ b/arch/x86/hyperv/Makefile > @@ -1 +1 @@ > -obj-y:= hv_init.o mmu.o > +obj-y:= hv_init.o mmu.o hv_apic.o > diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c > new file mode 100644 > index ..e0a5b36208fc > --- /dev/null > +++ b/arch/x86/hyperv/hv_apic.c > @@ -0,0 +1,98 @@ > +// SPDX-License-Identifier: GPL-2.0 > + > +/* > + * Hyper-V specific APIC code. > + * > + * Copyright (C) 2018, Microsoft, Inc. > + * > + * Author : K. Y. Srinivasan > + * > + * This program is free software; you can redistribute it and/or modify it > + * under the terms of the GNU General Public License version 2 as published > + * by the Free Software Foundation. > + * > + * This program is distributed in the hope that it will be useful, but > + * WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or > + * NON INFRINGEMENT. See the GNU General Public License for more > + * details. > + * > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +static u64 hv_apic_icr_read(void) > +{ > + u64 reg_val; > + > + rdmsrl(HV_X64_MSR_ICR, reg_val); > + return reg_val; > +} > + > +static void hv_apic_icr_write(u32 low, u32 id) > +{ > + u64 reg_val; > + > + reg_val = SET_APIC_DEST_FIELD(id); > + reg_val = reg_val << 32; > + reg_val |= low; > + > + wrmsrl(HV_X64_MSR_ICR, reg_val); > +} > + > +static u32 hv_apic_read(u32 reg) > +{ > + u32 reg_val, hi; > + > + switch (reg) { > + case APIC_EOI: > + rdmsr(HV_X64_MSR_EOI, reg_val, hi); > + return reg_val; > + case APIC_TASKPRI: > + rdmsr(HV_X64_MSR_TPR, reg_val, hi); > + return reg_val; > + > + default: > + return native_apic_mem_read(reg); > + } > +} > + > +static void hv_apic_write(u32 reg, u32 val) > +{ > + switch (reg) { > + case APIC_EOI: > + wrmsr(HV_X64_MSR_EOI, val, 0); > + break; > + case APIC_TASKPRI: > + wrmsr(HV_X64_MSR_TPR, val, 0); > + break; > + default: > + native_apic_mem_write(reg, val); > + } > +} > + > +static void hv_apic_eoi_write(u32 reg, u32 val) > +{ > + wrmsr(HV_X64_MSR_EOI, val, 0); > +} > + > +void __init hv_apic_init(void) > +{ > + if (ms_hyperv.hints & HV_X64_APIC_ACCESS_RECOMMENDED) { > + pr_info("Hyper-V: Using MSR ased APIC access\n"); Typo here. "ased" should be "based". > + apic_set_eoi_write(hv_apic_eoi_write); > + apic->read = hv_apic_read; > + apic->write = hv_apic_write; > + apic->icr_write = hv_apic_icr_write; > + apic->icr_read = hv_apic_icr_read; > + } > +} > diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c > index cfecc2272f2d..71e50fc2b7ef 100644 > --- a/arch/x86/hyperv/hv_init.c > +++ b/arch/x86/hyperv/hv_init.c > @@ -242,8 +242,9 @@ static int hv_cpu_die(unsigned int cpu) > * > * 1. Setup the hypercall page. > * 2. Register Hyper-V specific clocksource. > + * 3. Setup Hyper-V specific APIC entry points. > */ > -void hyperv_init(void) > +void __init hyperv_init(void) > { > u64 guest_id, required_msrs; > union hv_x64_msr_hyperca
RE: [Patch v2 1/6] cifs: smbd: Check for iov length on sending the last iov
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org> On Behalf > Of Long Li > Sent: Tuesday, April 17, 2018 12:17 PM > To: Steve French ; linux-c...@vger.kernel.org; samba- > techni...@lists.samba.org; linux-kernel@vger.kernel.org; > linux-r...@vger.kernel.org > Cc: Long Li ; sta...@vger.kernel.org > Subject: [Patch v2 1/6] cifs: smbd: Check for iov length on sending the last > iov > > From: Long Li > > When sending the last iov that breaks into smaller buffers to fit the > transfer size, it's necessary to check if this is the last iov. > > If this is the latest iov, stop and proceed to send pages. > > Signed-off-by: Long Li > Cc: sta...@vger.kernel.org Reviewed-by: Michael Kelley But a question unrelated to this change arose during my review: At the beginning and end of smbd_send(), the field smbd_send_pending is incremented and decremented, respectively. The increment/decrement are not done as atomic operations. Is this code guaranteed to be single threaded? If not, the count could become corrupted, and smbd_destroy_rdma_work(), which waits for the count to become zero, could hang. A similar question applies to smbd_recv_pending in smbd_recv(). > --- > fs/cifs/smbdirect.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c > index 90e673c..b5c6c0d 100644 > --- a/fs/cifs/smbdirect.c > +++ b/fs/cifs/smbdirect.c > @@ -2197,6 +2197,8 @@ int smbd_send(struct smbd_connection *info, struct > smb_rqst *rqst) > goto done; > } > i++; > + if (i == rqst->rq_nvec) > + break; > } > start = i; > buflen = 0; > -- > 2.7.4
RE: [Patch v2 1/6] cifs: smbd: Check for iov length on sending the last iov
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org > On Behalf > Of Long Li > Sent: Tuesday, April 17, 2018 12:17 PM > To: Steve French ; linux-c...@vger.kernel.org; samba- > techni...@lists.samba.org; linux-kernel@vger.kernel.org; > linux-r...@vger.kernel.org > Cc: Long Li ; sta...@vger.kernel.org > Subject: [Patch v2 1/6] cifs: smbd: Check for iov length on sending the last > iov > > From: Long Li > > When sending the last iov that breaks into smaller buffers to fit the > transfer size, it's necessary to check if this is the last iov. > > If this is the latest iov, stop and proceed to send pages. > > Signed-off-by: Long Li > Cc: sta...@vger.kernel.org Reviewed-by: Michael Kelley But a question unrelated to this change arose during my review: At the beginning and end of smbd_send(), the field smbd_send_pending is incremented and decremented, respectively. The increment/decrement are not done as atomic operations. Is this code guaranteed to be single threaded? If not, the count could become corrupted, and smbd_destroy_rdma_work(), which waits for the count to become zero, could hang. A similar question applies to smbd_recv_pending in smbd_recv(). > --- > fs/cifs/smbdirect.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c > index 90e673c..b5c6c0d 100644 > --- a/fs/cifs/smbdirect.c > +++ b/fs/cifs/smbdirect.c > @@ -2197,6 +2197,8 @@ int smbd_send(struct smbd_connection *info, struct > smb_rqst *rqst) > goto done; > } > i++; > + if (i == rqst->rq_nvec) > + break; > } > start = i; > buflen = 0; > -- > 2.7.4
RE: [Patch v2] Storvsc: Select channel based on available percentage of ring buffer to write
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org> On Behalf > Of Long Li > Sent: Thursday, April 19, 2018 2:54 PM > To: KY Srinivasan ; Haiyang Zhang > ; Stephen > Hemminger ; James E . J . Bottomley > ; > Martin K . Petersen ; > de...@linuxdriverproject.org; linux- > s...@vger.kernel.org; linux-kernel@vger.kernel.org > Cc: Long Li > Subject: [Patch v2] Storvsc: Select channel based on available percentage of > ring buffer to > write > > From: Long Li > > This is a best effort for estimating on how busy the ring buffer is for > that channel, based on available buffer to write in percentage. It is still > possible that at the time of actual ring buffer write, the space may not be > available due to other processes may be writing at the time. > > Selecting a channel based on how full it is can reduce the possibility that > a ring buffer write will fail, and avoid the situation a channel is over > busy. > > Now it's possible that storvsc can use a smaller ring buffer size > (e.g. 40k bytes) to take advantage of cache locality. > > Changes. > v2: Pre-allocate struct cpumask on the heap. > Struct cpumask is a big structure (1k bytes) when CONFIG_NR_CPUS=8192 (default > value when CONFIG_MAXSMP=y). Don't use kernel stack for it by pre-allocating > them using kmalloc when channels are first initialized. > > Signed-off-by: Long Li > --- > drivers/scsi/storvsc_drv.c | 90 > -- > 1 file changed, 72 insertions(+), 18 deletions(-) > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c > index a2ec0bc9e9fa..2a9fff94dd1a 100644 > --- a/drivers/scsi/storvsc_drv.c > +++ b/drivers/scsi/storvsc_drv.c > @@ -395,6 +395,12 @@ MODULE_PARM_DESC(storvsc_ringbuffer_size, "Ring buffer > size > (bytes)"); > > module_param(storvsc_vcpus_per_sub_channel, int, S_IRUGO); > MODULE_PARM_DESC(storvsc_vcpus_per_sub_channel, "Ratio of VCPUs to > subchannels"); > + > +static int ring_avail_percent_lowater = 10; > +module_param(ring_avail_percent_lowater, int, S_IRUGO); > +MODULE_PARM_DESC(ring_avail_percent_lowater, > + "Select a channel if available ring size > this in percent"); > + > /* > * Timeout in seconds for all devices managed by this driver. > */ > @@ -468,6 +474,13 @@ struct storvsc_device { >* Mask of CPUs bound to subchannels. >*/ > struct cpumask alloced_cpus; > + /* > + * Pre-allocated struct cpumask for each hardware queue. > + * struct cpumask is used by selecting out-going channels. It is a > + * big structure, default to 1024k bytes when CONFIG_MAXSMP=y. I think you mean "1024 bytes" or "1k bytes" in the above comment. > + * Pre-allocate it to avoid allocation on the kernel stack. > + */ > + struct cpumask *cpumask_chns; > /* Used for vsc/vsp channel reset process */ > struct storvsc_cmd_request init_request; > struct storvsc_cmd_request reset_request; > @@ -872,6 +885,13 @@ static int storvsc_channel_init(struct hv_device > *device, bool is_fc) > if (stor_device->stor_chns == NULL) > return -ENOMEM; > > + stor_device->cpumask_chns = kcalloc(num_possible_cpus(), > + sizeof(struct cpumask), GFP_KERNEL); Note that num_possible_cpus() is 240 for a Hyper-V 2016 guest unless overridden on the kernel boot line, so this is going to allocate 240 Kbytes for each synthetic SCSI controller. On an Azure VM, which has two IDE and two SCSI controllers, this is nearly 1 Mbyte. It's unfortunate to have to allocate this much memory for a what is essentially a temporary variable. Further down in these comments, I've proposed an alternate implementation of the code that avoids the need for the temporary variable, and hence avoids the need for this allocation. > + if (stor_device->cpumask_chns == NULL) { > + kfree(stor_device->stor_chns); > + return -ENOMEM; > + } > + > stor_device->stor_chns[device->channel->target_cpu] = device->channel; > cpumask_set_cpu(device->channel->target_cpu, > _device->alloced_cpus); > @@ -1232,6 +1252,7 @@ static int storvsc_dev_remove(struct hv_device *device) > vmbus_close(device->channel); > > kfree(stor_device->stor_chns); > + kfree(stor_device->cpumask_chns); > kfree(stor_device); > return 0; > } > @@ -1241,7 +1262,7 @@ static struct vmbus_channel *get_og_chn(struct > storvsc_device > *stor_device, > {1G/ > u16 slot = 0; > u16 hash_qnum; > - struct cpumask alloced_mask; > + struct cpumask *alloced_mask = _device->cpumask_chns[q_num]; > int num_channels, tgt_cpu; > > if (stor_device->num_sc == 0) > @@ -1257,10
RE: [Patch v2] Storvsc: Select channel based on available percentage of ring buffer to write
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org > On Behalf > Of Long Li > Sent: Thursday, April 19, 2018 2:54 PM > To: KY Srinivasan ; Haiyang Zhang > ; Stephen > Hemminger ; James E . J . Bottomley > ; > Martin K . Petersen ; > de...@linuxdriverproject.org; linux- > s...@vger.kernel.org; linux-kernel@vger.kernel.org > Cc: Long Li > Subject: [Patch v2] Storvsc: Select channel based on available percentage of > ring buffer to > write > > From: Long Li > > This is a best effort for estimating on how busy the ring buffer is for > that channel, based on available buffer to write in percentage. It is still > possible that at the time of actual ring buffer write, the space may not be > available due to other processes may be writing at the time. > > Selecting a channel based on how full it is can reduce the possibility that > a ring buffer write will fail, and avoid the situation a channel is over > busy. > > Now it's possible that storvsc can use a smaller ring buffer size > (e.g. 40k bytes) to take advantage of cache locality. > > Changes. > v2: Pre-allocate struct cpumask on the heap. > Struct cpumask is a big structure (1k bytes) when CONFIG_NR_CPUS=8192 (default > value when CONFIG_MAXSMP=y). Don't use kernel stack for it by pre-allocating > them using kmalloc when channels are first initialized. > > Signed-off-by: Long Li > --- > drivers/scsi/storvsc_drv.c | 90 > -- > 1 file changed, 72 insertions(+), 18 deletions(-) > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c > index a2ec0bc9e9fa..2a9fff94dd1a 100644 > --- a/drivers/scsi/storvsc_drv.c > +++ b/drivers/scsi/storvsc_drv.c > @@ -395,6 +395,12 @@ MODULE_PARM_DESC(storvsc_ringbuffer_size, "Ring buffer > size > (bytes)"); > > module_param(storvsc_vcpus_per_sub_channel, int, S_IRUGO); > MODULE_PARM_DESC(storvsc_vcpus_per_sub_channel, "Ratio of VCPUs to > subchannels"); > + > +static int ring_avail_percent_lowater = 10; > +module_param(ring_avail_percent_lowater, int, S_IRUGO); > +MODULE_PARM_DESC(ring_avail_percent_lowater, > + "Select a channel if available ring size > this in percent"); > + > /* > * Timeout in seconds for all devices managed by this driver. > */ > @@ -468,6 +474,13 @@ struct storvsc_device { >* Mask of CPUs bound to subchannels. >*/ > struct cpumask alloced_cpus; > + /* > + * Pre-allocated struct cpumask for each hardware queue. > + * struct cpumask is used by selecting out-going channels. It is a > + * big structure, default to 1024k bytes when CONFIG_MAXSMP=y. I think you mean "1024 bytes" or "1k bytes" in the above comment. > + * Pre-allocate it to avoid allocation on the kernel stack. > + */ > + struct cpumask *cpumask_chns; > /* Used for vsc/vsp channel reset process */ > struct storvsc_cmd_request init_request; > struct storvsc_cmd_request reset_request; > @@ -872,6 +885,13 @@ static int storvsc_channel_init(struct hv_device > *device, bool is_fc) > if (stor_device->stor_chns == NULL) > return -ENOMEM; > > + stor_device->cpumask_chns = kcalloc(num_possible_cpus(), > + sizeof(struct cpumask), GFP_KERNEL); Note that num_possible_cpus() is 240 for a Hyper-V 2016 guest unless overridden on the kernel boot line, so this is going to allocate 240 Kbytes for each synthetic SCSI controller. On an Azure VM, which has two IDE and two SCSI controllers, this is nearly 1 Mbyte. It's unfortunate to have to allocate this much memory for a what is essentially a temporary variable. Further down in these comments, I've proposed an alternate implementation of the code that avoids the need for the temporary variable, and hence avoids the need for this allocation. > + if (stor_device->cpumask_chns == NULL) { > + kfree(stor_device->stor_chns); > + return -ENOMEM; > + } > + > stor_device->stor_chns[device->channel->target_cpu] = device->channel; > cpumask_set_cpu(device->channel->target_cpu, > _device->alloced_cpus); > @@ -1232,6 +1252,7 @@ static int storvsc_dev_remove(struct hv_device *device) > vmbus_close(device->channel); > > kfree(stor_device->stor_chns); > + kfree(stor_device->cpumask_chns); > kfree(stor_device); > return 0; > } > @@ -1241,7 +1262,7 @@ static struct vmbus_channel *get_og_chn(struct > storvsc_device > *stor_device, > {1G/ > u16 slot = 0; > u16 hash_qnum; > - struct cpumask alloced_mask; > + struct cpumask *alloced_mask = _device->cpumask_chns[q_num]; > int num_channels, tgt_cpu; > > if (stor_device->num_sc == 0) > @@ -1257,10 +1278,10 @@ static struct vmbus_channel *get_og_chn(struct > storvsc_device > *stor_device, >* III. Mapping is persistent. >*/ > > - cpumask_and(_mask, _device->alloced_cpus, > +
RE: [Patch v2 2/6] cifs: Allocate validate negotiation request through kmalloc
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org> On Behalf > Of Long Li > Sent: Tuesday, April 17, 2018 12:17 PM > To: Steve French ; linux-c...@vger.kernel.org; samba- > techni...@lists.samba.org; linux-kernel@vger.kernel.org; > linux-r...@vger.kernel.org > Cc: Long Li ; sta...@vger.kernel.org > Subject: [Patch v2 2/6] cifs: Allocate validate negotiation request through > kmalloc > > From: Long Li > > The data buffer allocated on the stack can't be DMA'ed, and hence can't send > through RDMA via SMB Direct. > > Fix this by allocating the request on the heap in smb3_validate_negotiate. > > Fixes: ff1c038addc4f205d5f1ede449426c7d316c0eed "Check SMB3 dialects against > downgrade attacks" > > Changes in v2: > Removed duplicated code on freeing buffers on function exit. > (Thanks to Parav Pandit ) > > Fixed typo in the patch title. > > Signed-off-by: Long Li > Cc: sta...@vger.kernel.org > --- > fs/cifs/smb2pdu.c | 57 > ++- > 1 file changed, 31 insertions(+), 26 deletions(-) > > diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c > index 0f044c4..41625e4 100644 > --- a/fs/cifs/smb2pdu.c > +++ b/fs/cifs/smb2pdu.c > @@ -729,8 +729,8 @@ SMB2_negotiate(const unsigned int xid, struct cifs_ses > *ses) > > int smb3_validate_negotiate(const unsigned int xid, struct cifs_tcon *tcon) > { > - int rc = 0; > - struct validate_negotiate_info_req vneg_inbuf; > + int ret, rc = -EIO; > + struct validate_negotiate_info_req *pneg_inbuf; > struct validate_negotiate_info_rsp *pneg_rsp = NULL; > u32 rsplen; > u32 inbuflen; /* max of 4 dialects */ > @@ -741,6 +741,9 @@ int smb3_validate_negotiate(const unsigned int xid, > struct cifs_tcon > *tcon) > if (tcon->ses->server->rdma) > return 0; > #endif > + pneg_inbuf = kmalloc(sizeof(*pneg_inbuf), GFP_KERNEL); > + if (!pneg_inbuf) > + return -ENOMEM; Immediately after the above new code, there are three if statements that can 'return 0' and never free the pneg_inbuf memory. They should instead set 'rc' appropriately and 'goto' the out_free_inbuf label. Michael > > /* In SMB3.11 preauth integrity supersedes validate negotiate */ > if (tcon->ses->server->dialect == SMB311_PROT_ID) > @@ -764,53 +767,53 @@ int smb3_validate_negotiate(const unsigned int xid, > struct cifs_tcon > *tcon) > if (tcon->ses->session_flags & SMB2_SESSION_FLAG_IS_NULL) > cifs_dbg(VFS, "Unexpected null user (anonymous) auth flag sent > by server\n"); > > - vneg_inbuf.Capabilities = > + pneg_inbuf->Capabilities = > cpu_to_le32(tcon->ses->server->vals->req_capabilities); > - memcpy(vneg_inbuf.Guid, tcon->ses->server->client_guid, > + memcpy(pneg_inbuf->Guid, tcon->ses->server->client_guid, > SMB2_CLIENT_GUID_SIZE); > > if (tcon->ses->sign) > - vneg_inbuf.SecurityMode = > + pneg_inbuf->SecurityMode = > cpu_to_le16(SMB2_NEGOTIATE_SIGNING_REQUIRED); > else if (global_secflags & CIFSSEC_MAY_SIGN) > - vneg_inbuf.SecurityMode = > + pneg_inbuf->SecurityMode = > cpu_to_le16(SMB2_NEGOTIATE_SIGNING_ENABLED); > else > - vneg_inbuf.SecurityMode = 0; > + pneg_inbuf->SecurityMode = 0; > > > if (strcmp(tcon->ses->server->vals->version_string, > SMB3ANY_VERSION_STRING) == 0) { > - vneg_inbuf.Dialects[0] = cpu_to_le16(SMB30_PROT_ID); > - vneg_inbuf.Dialects[1] = cpu_to_le16(SMB302_PROT_ID); > - vneg_inbuf.DialectCount = cpu_to_le16(2); > + pneg_inbuf->Dialects[0] = cpu_to_le16(SMB30_PROT_ID); > + pneg_inbuf->Dialects[1] = cpu_to_le16(SMB302_PROT_ID); > + pneg_inbuf->DialectCount = cpu_to_le16(2); > /* structure is big enough for 3 dialects, sending only 2 */ > inbuflen = sizeof(struct validate_negotiate_info_req) - 2; > } else if (strcmp(tcon->ses->server->vals->version_string, > SMBDEFAULT_VERSION_STRING) == 0) { > - vneg_inbuf.Dialects[0] = cpu_to_le16(SMB21_PROT_ID); > - vneg_inbuf.Dialects[1] = cpu_to_le16(SMB30_PROT_ID); > - vneg_inbuf.Dialects[2] = cpu_to_le16(SMB302_PROT_ID); > - vneg_inbuf.DialectCount = cpu_to_le16(3); > + pneg_inbuf->Dialects[0] = cpu_to_le16(SMB21_PROT_ID); > + pneg_inbuf->Dialects[1] = cpu_to_le16(SMB30_PROT_ID); > + pneg_inbuf->Dialects[2] = cpu_to_le16(SMB302_PROT_ID); > + pneg_inbuf->DialectCount = cpu_to_le16(3); > /* structure is big enough for 3 dialects */ > inbuflen = sizeof(struct
RE: [Patch v2 2/6] cifs: Allocate validate negotiation request through kmalloc
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org > On Behalf > Of Long Li > Sent: Tuesday, April 17, 2018 12:17 PM > To: Steve French ; linux-c...@vger.kernel.org; samba- > techni...@lists.samba.org; linux-kernel@vger.kernel.org; > linux-r...@vger.kernel.org > Cc: Long Li ; sta...@vger.kernel.org > Subject: [Patch v2 2/6] cifs: Allocate validate negotiation request through > kmalloc > > From: Long Li > > The data buffer allocated on the stack can't be DMA'ed, and hence can't send > through RDMA via SMB Direct. > > Fix this by allocating the request on the heap in smb3_validate_negotiate. > > Fixes: ff1c038addc4f205d5f1ede449426c7d316c0eed "Check SMB3 dialects against > downgrade attacks" > > Changes in v2: > Removed duplicated code on freeing buffers on function exit. > (Thanks to Parav Pandit ) > > Fixed typo in the patch title. > > Signed-off-by: Long Li > Cc: sta...@vger.kernel.org > --- > fs/cifs/smb2pdu.c | 57 > ++- > 1 file changed, 31 insertions(+), 26 deletions(-) > > diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c > index 0f044c4..41625e4 100644 > --- a/fs/cifs/smb2pdu.c > +++ b/fs/cifs/smb2pdu.c > @@ -729,8 +729,8 @@ SMB2_negotiate(const unsigned int xid, struct cifs_ses > *ses) > > int smb3_validate_negotiate(const unsigned int xid, struct cifs_tcon *tcon) > { > - int rc = 0; > - struct validate_negotiate_info_req vneg_inbuf; > + int ret, rc = -EIO; > + struct validate_negotiate_info_req *pneg_inbuf; > struct validate_negotiate_info_rsp *pneg_rsp = NULL; > u32 rsplen; > u32 inbuflen; /* max of 4 dialects */ > @@ -741,6 +741,9 @@ int smb3_validate_negotiate(const unsigned int xid, > struct cifs_tcon > *tcon) > if (tcon->ses->server->rdma) > return 0; > #endif > + pneg_inbuf = kmalloc(sizeof(*pneg_inbuf), GFP_KERNEL); > + if (!pneg_inbuf) > + return -ENOMEM; Immediately after the above new code, there are three if statements that can 'return 0' and never free the pneg_inbuf memory. They should instead set 'rc' appropriately and 'goto' the out_free_inbuf label. Michael > > /* In SMB3.11 preauth integrity supersedes validate negotiate */ > if (tcon->ses->server->dialect == SMB311_PROT_ID) > @@ -764,53 +767,53 @@ int smb3_validate_negotiate(const unsigned int xid, > struct cifs_tcon > *tcon) > if (tcon->ses->session_flags & SMB2_SESSION_FLAG_IS_NULL) > cifs_dbg(VFS, "Unexpected null user (anonymous) auth flag sent > by server\n"); > > - vneg_inbuf.Capabilities = > + pneg_inbuf->Capabilities = > cpu_to_le32(tcon->ses->server->vals->req_capabilities); > - memcpy(vneg_inbuf.Guid, tcon->ses->server->client_guid, > + memcpy(pneg_inbuf->Guid, tcon->ses->server->client_guid, > SMB2_CLIENT_GUID_SIZE); > > if (tcon->ses->sign) > - vneg_inbuf.SecurityMode = > + pneg_inbuf->SecurityMode = > cpu_to_le16(SMB2_NEGOTIATE_SIGNING_REQUIRED); > else if (global_secflags & CIFSSEC_MAY_SIGN) > - vneg_inbuf.SecurityMode = > + pneg_inbuf->SecurityMode = > cpu_to_le16(SMB2_NEGOTIATE_SIGNING_ENABLED); > else > - vneg_inbuf.SecurityMode = 0; > + pneg_inbuf->SecurityMode = 0; > > > if (strcmp(tcon->ses->server->vals->version_string, > SMB3ANY_VERSION_STRING) == 0) { > - vneg_inbuf.Dialects[0] = cpu_to_le16(SMB30_PROT_ID); > - vneg_inbuf.Dialects[1] = cpu_to_le16(SMB302_PROT_ID); > - vneg_inbuf.DialectCount = cpu_to_le16(2); > + pneg_inbuf->Dialects[0] = cpu_to_le16(SMB30_PROT_ID); > + pneg_inbuf->Dialects[1] = cpu_to_le16(SMB302_PROT_ID); > + pneg_inbuf->DialectCount = cpu_to_le16(2); > /* structure is big enough for 3 dialects, sending only 2 */ > inbuflen = sizeof(struct validate_negotiate_info_req) - 2; > } else if (strcmp(tcon->ses->server->vals->version_string, > SMBDEFAULT_VERSION_STRING) == 0) { > - vneg_inbuf.Dialects[0] = cpu_to_le16(SMB21_PROT_ID); > - vneg_inbuf.Dialects[1] = cpu_to_le16(SMB30_PROT_ID); > - vneg_inbuf.Dialects[2] = cpu_to_le16(SMB302_PROT_ID); > - vneg_inbuf.DialectCount = cpu_to_le16(3); > + pneg_inbuf->Dialects[0] = cpu_to_le16(SMB21_PROT_ID); > + pneg_inbuf->Dialects[1] = cpu_to_le16(SMB30_PROT_ID); > + pneg_inbuf->Dialects[2] = cpu_to_le16(SMB302_PROT_ID); > + pneg_inbuf->DialectCount = cpu_to_le16(3); > /* structure is big enough for 3 dialects */ > inbuflen = sizeof(struct validate_negotiate_info_req); > } else { > /* otherwise specific dialect was requested */ > -
RE: [Resend Patch 2/3] Netvsc: Use the vmbus functiton to calculate ring buffer percentage
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org> On Behalf > Of Long Li > Sent: Tuesday, March 27, 2018 5:49 PM > To: KY Srinivasan ; Haiyang Zhang > ; Stephen > Hemminger ; James E . J . Bottomley > ; > Martin K . Petersen ; > de...@linuxdriverproject.org; linux- > s...@vger.kernel.org; linux-kernel@vger.kernel.org; net...@vger.kernel.org > Cc: Long Li > Subject: [Resend Patch 2/3] Netvsc: Use the vmbus functiton to calculate ring > buffer > percentage > > From: Long Li > > In Vmbus, we have defined a function to calculate available ring buffer > percentage to write. > > Use that function and remove netvsc's private version. > > Signed-off-by: Long Li Reviewed-by: Michael Kelley > --- > drivers/net/hyperv/hyperv_net.h | 1 - > drivers/net/hyperv/netvsc.c | 17 +++-- > drivers/net/hyperv/netvsc_drv.c | 3 --- > 3 files changed, 3 insertions(+), 18 deletions(-) > > diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h > index cd538d5a7986..a0199ab13d67 100644 > --- a/drivers/net/hyperv/hyperv_net.h > +++ b/drivers/net/hyperv/hyperv_net.h > @@ -189,7 +189,6 @@ struct netvsc_device; > struct net_device_context; > > extern u32 netvsc_ring_bytes; > -extern struct reciprocal_value netvsc_ring_reciprocal; > > struct netvsc_device *netvsc_device_add(struct hv_device *device, > const struct netvsc_device_info *info); > diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c > index 0265d703eb03..8af0069e4d8c 100644 > --- a/drivers/net/hyperv/netvsc.c > +++ b/drivers/net/hyperv/netvsc.c > @@ -31,7 +31,6 @@ > #include > #include > #include > -#include > > #include > > @@ -590,17 +589,6 @@ void netvsc_device_remove(struct hv_device *device) > #define RING_AVAIL_PERCENT_HIWATER 20 > #define RING_AVAIL_PERCENT_LOWATER 10 > > -/* > - * Get the percentage of available bytes to write in the ring. > - * The return value is in range from 0 to 100. > - */ > -static u32 hv_ringbuf_avail_percent(const struct hv_ring_buffer_info > *ring_info) > -{ > - u32 avail_write = hv_get_bytes_to_write(ring_info); > - > - return reciprocal_divide(avail_write * 100, netvsc_ring_reciprocal); > -} > - > static inline void netvsc_free_send_slot(struct netvsc_device *net_device, >u32 index) > { > @@ -649,7 +637,8 @@ static void netvsc_send_tx_complete(struct netvsc_device > *net_device, > wake_up(_device->wait_drain); > > if (netif_tx_queue_stopped(netdev_get_tx_queue(ndev, q_idx)) && > - (hv_ringbuf_avail_percent(>outbound) > > RING_AVAIL_PERCENT_HIWATER || > + (hv_get_avail_to_write_percent(>outbound) > > + RING_AVAIL_PERCENT_HIWATER || >queue_sends < 1)) { > netif_tx_wake_queue(netdev_get_tx_queue(ndev, q_idx)); > ndev_ctx->eth_stats.wake_queue++; > @@ -757,7 +746,7 @@ static inline int netvsc_send_pkt( > struct netdev_queue *txq = netdev_get_tx_queue(ndev, packet->q_idx); > u64 req_id; > int ret; > - u32 ring_avail = hv_ringbuf_avail_percent(_channel->outbound); > + u32 ring_avail = hv_get_avail_to_write_percent(_channel->outbound); > > nvmsg.hdr.msg_type = NVSP_MSG1_TYPE_SEND_RNDIS_PKT; > if (skb) > diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c > index faea0be18924..b0b1c2fd2b7b 100644 > --- a/drivers/net/hyperv/netvsc_drv.c > +++ b/drivers/net/hyperv/netvsc_drv.c > @@ -35,7 +35,6 @@ > #include > #include > #include > -#include > > #include > #include > @@ -55,7 +54,6 @@ static unsigned int ring_size __ro_after_init = 128; > module_param(ring_size, uint, S_IRUGO); > MODULE_PARM_DESC(ring_size, "Ring buffer size (# of pages)"); > unsigned int netvsc_ring_bytes __ro_after_init; > -struct reciprocal_value netvsc_ring_reciprocal __ro_after_init; > > static const u32 default_msg = NETIF_MSG_DRV | NETIF_MSG_PROBE | > NETIF_MSG_LINK | NETIF_MSG_IFUP | > @@ -2186,7 +2184,6 @@ static int __init netvsc_drv_init(void) > ring_size); > } > netvsc_ring_bytes = ring_size * PAGE_SIZE; > - netvsc_ring_reciprocal = reciprocal_value(netvsc_ring_bytes); > > ret = vmbus_driver_register(_drv); > if (ret) > -- > 2.14.1
RE: [Resend Patch 2/3] Netvsc: Use the vmbus functiton to calculate ring buffer percentage
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org > On Behalf > Of Long Li > Sent: Tuesday, March 27, 2018 5:49 PM > To: KY Srinivasan ; Haiyang Zhang > ; Stephen > Hemminger ; James E . J . Bottomley > ; > Martin K . Petersen ; > de...@linuxdriverproject.org; linux- > s...@vger.kernel.org; linux-kernel@vger.kernel.org; net...@vger.kernel.org > Cc: Long Li > Subject: [Resend Patch 2/3] Netvsc: Use the vmbus functiton to calculate ring > buffer > percentage > > From: Long Li > > In Vmbus, we have defined a function to calculate available ring buffer > percentage to write. > > Use that function and remove netvsc's private version. > > Signed-off-by: Long Li Reviewed-by: Michael Kelley > --- > drivers/net/hyperv/hyperv_net.h | 1 - > drivers/net/hyperv/netvsc.c | 17 +++-- > drivers/net/hyperv/netvsc_drv.c | 3 --- > 3 files changed, 3 insertions(+), 18 deletions(-) > > diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h > index cd538d5a7986..a0199ab13d67 100644 > --- a/drivers/net/hyperv/hyperv_net.h > +++ b/drivers/net/hyperv/hyperv_net.h > @@ -189,7 +189,6 @@ struct netvsc_device; > struct net_device_context; > > extern u32 netvsc_ring_bytes; > -extern struct reciprocal_value netvsc_ring_reciprocal; > > struct netvsc_device *netvsc_device_add(struct hv_device *device, > const struct netvsc_device_info *info); > diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c > index 0265d703eb03..8af0069e4d8c 100644 > --- a/drivers/net/hyperv/netvsc.c > +++ b/drivers/net/hyperv/netvsc.c > @@ -31,7 +31,6 @@ > #include > #include > #include > -#include > > #include > > @@ -590,17 +589,6 @@ void netvsc_device_remove(struct hv_device *device) > #define RING_AVAIL_PERCENT_HIWATER 20 > #define RING_AVAIL_PERCENT_LOWATER 10 > > -/* > - * Get the percentage of available bytes to write in the ring. > - * The return value is in range from 0 to 100. > - */ > -static u32 hv_ringbuf_avail_percent(const struct hv_ring_buffer_info > *ring_info) > -{ > - u32 avail_write = hv_get_bytes_to_write(ring_info); > - > - return reciprocal_divide(avail_write * 100, netvsc_ring_reciprocal); > -} > - > static inline void netvsc_free_send_slot(struct netvsc_device *net_device, >u32 index) > { > @@ -649,7 +637,8 @@ static void netvsc_send_tx_complete(struct netvsc_device > *net_device, > wake_up(_device->wait_drain); > > if (netif_tx_queue_stopped(netdev_get_tx_queue(ndev, q_idx)) && > - (hv_ringbuf_avail_percent(>outbound) > > RING_AVAIL_PERCENT_HIWATER || > + (hv_get_avail_to_write_percent(>outbound) > > + RING_AVAIL_PERCENT_HIWATER || >queue_sends < 1)) { > netif_tx_wake_queue(netdev_get_tx_queue(ndev, q_idx)); > ndev_ctx->eth_stats.wake_queue++; > @@ -757,7 +746,7 @@ static inline int netvsc_send_pkt( > struct netdev_queue *txq = netdev_get_tx_queue(ndev, packet->q_idx); > u64 req_id; > int ret; > - u32 ring_avail = hv_ringbuf_avail_percent(_channel->outbound); > + u32 ring_avail = hv_get_avail_to_write_percent(_channel->outbound); > > nvmsg.hdr.msg_type = NVSP_MSG1_TYPE_SEND_RNDIS_PKT; > if (skb) > diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c > index faea0be18924..b0b1c2fd2b7b 100644 > --- a/drivers/net/hyperv/netvsc_drv.c > +++ b/drivers/net/hyperv/netvsc_drv.c > @@ -35,7 +35,6 @@ > #include > #include > #include > -#include > > #include > #include > @@ -55,7 +54,6 @@ static unsigned int ring_size __ro_after_init = 128; > module_param(ring_size, uint, S_IRUGO); > MODULE_PARM_DESC(ring_size, "Ring buffer size (# of pages)"); > unsigned int netvsc_ring_bytes __ro_after_init; > -struct reciprocal_value netvsc_ring_reciprocal __ro_after_init; > > static const u32 default_msg = NETIF_MSG_DRV | NETIF_MSG_PROBE | > NETIF_MSG_LINK | NETIF_MSG_IFUP | > @@ -2186,7 +2184,6 @@ static int __init netvsc_drv_init(void) > ring_size); > } > netvsc_ring_bytes = ring_size * PAGE_SIZE; > - netvsc_ring_reciprocal = reciprocal_value(netvsc_ring_bytes); > > ret = vmbus_driver_register(_drv); > if (ret) > -- > 2.14.1
RE: [Resend Patch 3/3] Storvsc: Select channel based on available percentage of ring buffer to write
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org> On Behalf > Of Long Li > Sent: Tuesday, March 27, 2018 5:49 PM > To: KY Srinivasan ; Haiyang Zhang > ; Stephen > Hemminger ; James E . J . Bottomley > ; > Martin K . Petersen ; > de...@linuxdriverproject.org; linux- > s...@vger.kernel.org; linux-kernel@vger.kernel.org; net...@vger.kernel.org > Cc: Long Li > Subject: [Resend Patch 3/3] Storvsc: Select channel based on available > percentage of ring > buffer to write > > From: Long Li > > This is a best effort for estimating on how busy the ring buffer is for > that channel, based on available buffer to write in percentage. It is still > possible that at the time of actual ring buffer write, the space may not be > available due to other processes may be writing at the time. > > Selecting a channel based on how full it is can reduce the possibility that > a ring buffer write will fail, and avoid the situation a channel is over > busy. > > Now it's possible that storvsc can use a smaller ring buffer size > (e.g. 40k bytes) to take advantage of cache locality. > > Signed-off-by: Long Li > --- > drivers/scsi/storvsc_drv.c | 62 > +- > 1 file changed, 50 insertions(+), 12 deletions(-) > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c > index a2ec0bc9e9fa..b1a87072b3ab 100644 > --- a/drivers/scsi/storvsc_drv.c > +++ b/drivers/scsi/storvsc_drv.c > @@ -395,6 +395,12 @@ MODULE_PARM_DESC(storvsc_ringbuffer_size, "Ring buffer > size > (bytes)"); > > module_param(storvsc_vcpus_per_sub_channel, int, S_IRUGO); > MODULE_PARM_DESC(storvsc_vcpus_per_sub_channel, "Ratio of VCPUs to > subchannels"); > + > +static int ring_avail_percent_lowater = 10; Reserving 10% of each ring buffer by default seems like more than is needed in the storvsc driver. That would be about 4Kbytes for the 40K ring buffer you suggest, and even more for a ring buffer of 128K. Each outgoing record is only about 344 bytes (I'd have to check exactly). With the new channel selection algorithm below, the only time we use a channel that is already below the low water mark is when no channel could be found that is above the low water mark. There could be a case of two or more threads deciding that a channel is above the low water mark at the same time and both choosing it, but that's likely to be rare. So it seems like we could set the default low water mark to 5 percent or even 3 percent, which will let more of the ring buffer be used, and let a channel be assigned according to the algorithm, rather than falling through to the default because all channels appear to be "full". > +module_param(ring_avail_percent_lowater, int, S_IRUGO); > +MODULE_PARM_DESC(ring_avail_percent_lowater, > + "Select a channel if available ring size > this in percent"); > + > /* > * Timeout in seconds for all devices managed by this driver. > */ > @@ -1285,9 +1291,9 @@ static int storvsc_do_io(struct hv_device *device, > { > struct storvsc_device *stor_device; > struct vstor_packet *vstor_packet; > - struct vmbus_channel *outgoing_channel; > + struct vmbus_channel *outgoing_channel, *channel; > int ret = 0; > - struct cpumask alloced_mask; > + struct cpumask alloced_mask, other_numa_mask; > int tgt_cpu; > > vstor_packet = >vstor_packet; > @@ -1301,22 +1307,53 @@ static int storvsc_do_io(struct hv_device *device, > /* >* Select an an appropriate channel to send the request out. >*/ > - > if (stor_device->stor_chns[q_num] != NULL) { > outgoing_channel = stor_device->stor_chns[q_num]; > - if (outgoing_channel->target_cpu == smp_processor_id()) { > + if (outgoing_channel->target_cpu == q_num) { > /* >* Ideally, we want to pick a different channel if >* available on the same NUMA node. >*/ > cpumask_and(_mask, _device->alloced_cpus, > cpumask_of_node(cpu_to_node(q_num))); > - for_each_cpu_wrap(tgt_cpu, _mask, > - outgoing_channel->target_cpu + 1) { > - if (tgt_cpu != outgoing_channel->target_cpu) { > - outgoing_channel = > - stor_device->stor_chns[tgt_cpu]; > - break; > + > + for_each_cpu_wrap(tgt_cpu, _mask, q_num + 1) { > + if (tgt_cpu == q_num) > + continue; > + channel =
RE: [Resend Patch 3/3] Storvsc: Select channel based on available percentage of ring buffer to write
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org > On Behalf > Of Long Li > Sent: Tuesday, March 27, 2018 5:49 PM > To: KY Srinivasan ; Haiyang Zhang > ; Stephen > Hemminger ; James E . J . Bottomley > ; > Martin K . Petersen ; > de...@linuxdriverproject.org; linux- > s...@vger.kernel.org; linux-kernel@vger.kernel.org; net...@vger.kernel.org > Cc: Long Li > Subject: [Resend Patch 3/3] Storvsc: Select channel based on available > percentage of ring > buffer to write > > From: Long Li > > This is a best effort for estimating on how busy the ring buffer is for > that channel, based on available buffer to write in percentage. It is still > possible that at the time of actual ring buffer write, the space may not be > available due to other processes may be writing at the time. > > Selecting a channel based on how full it is can reduce the possibility that > a ring buffer write will fail, and avoid the situation a channel is over > busy. > > Now it's possible that storvsc can use a smaller ring buffer size > (e.g. 40k bytes) to take advantage of cache locality. > > Signed-off-by: Long Li > --- > drivers/scsi/storvsc_drv.c | 62 > +- > 1 file changed, 50 insertions(+), 12 deletions(-) > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c > index a2ec0bc9e9fa..b1a87072b3ab 100644 > --- a/drivers/scsi/storvsc_drv.c > +++ b/drivers/scsi/storvsc_drv.c > @@ -395,6 +395,12 @@ MODULE_PARM_DESC(storvsc_ringbuffer_size, "Ring buffer > size > (bytes)"); > > module_param(storvsc_vcpus_per_sub_channel, int, S_IRUGO); > MODULE_PARM_DESC(storvsc_vcpus_per_sub_channel, "Ratio of VCPUs to > subchannels"); > + > +static int ring_avail_percent_lowater = 10; Reserving 10% of each ring buffer by default seems like more than is needed in the storvsc driver. That would be about 4Kbytes for the 40K ring buffer you suggest, and even more for a ring buffer of 128K. Each outgoing record is only about 344 bytes (I'd have to check exactly). With the new channel selection algorithm below, the only time we use a channel that is already below the low water mark is when no channel could be found that is above the low water mark. There could be a case of two or more threads deciding that a channel is above the low water mark at the same time and both choosing it, but that's likely to be rare. So it seems like we could set the default low water mark to 5 percent or even 3 percent, which will let more of the ring buffer be used, and let a channel be assigned according to the algorithm, rather than falling through to the default because all channels appear to be "full". > +module_param(ring_avail_percent_lowater, int, S_IRUGO); > +MODULE_PARM_DESC(ring_avail_percent_lowater, > + "Select a channel if available ring size > this in percent"); > + > /* > * Timeout in seconds for all devices managed by this driver. > */ > @@ -1285,9 +1291,9 @@ static int storvsc_do_io(struct hv_device *device, > { > struct storvsc_device *stor_device; > struct vstor_packet *vstor_packet; > - struct vmbus_channel *outgoing_channel; > + struct vmbus_channel *outgoing_channel, *channel; > int ret = 0; > - struct cpumask alloced_mask; > + struct cpumask alloced_mask, other_numa_mask; > int tgt_cpu; > > vstor_packet = >vstor_packet; > @@ -1301,22 +1307,53 @@ static int storvsc_do_io(struct hv_device *device, > /* >* Select an an appropriate channel to send the request out. >*/ > - > if (stor_device->stor_chns[q_num] != NULL) { > outgoing_channel = stor_device->stor_chns[q_num]; > - if (outgoing_channel->target_cpu == smp_processor_id()) { > + if (outgoing_channel->target_cpu == q_num) { > /* >* Ideally, we want to pick a different channel if >* available on the same NUMA node. >*/ > cpumask_and(_mask, _device->alloced_cpus, > cpumask_of_node(cpu_to_node(q_num))); > - for_each_cpu_wrap(tgt_cpu, _mask, > - outgoing_channel->target_cpu + 1) { > - if (tgt_cpu != outgoing_channel->target_cpu) { > - outgoing_channel = > - stor_device->stor_chns[tgt_cpu]; > - break; > + > + for_each_cpu_wrap(tgt_cpu, _mask, q_num + 1) { > + if (tgt_cpu == q_num) > + continue; > + channel = stor_device->stor_chns[tgt_cpu]; > + if (hv_get_avail_to_write_percent( > + >outbound) > + >
RE: [PATCH v2 char-misc 1/1] x86/hyperv: Add interrupt handler annotations
> -Original Message- > From: Greg KH <gre...@linuxfoundation.org> > Sent: Wednesday, April 4, 2018 1:16 AM > To: Michael Kelley (EOSG) <michael.h.kel...@microsoft.com> > Cc: linux-kernel@vger.kernel.org; de...@linuxdriverproject.org; > o...@aepfle.de; > a...@canonical.com; vkuzn...@redhat.com; jasow...@redhat.com; > leann.ogasaw...@canonical.com; marcelo.ce...@canonical.com; Stephen Hemminger > <sthem...@microsoft.com>; KY Srinivasan <k...@microsoft.com> > Subject: Re: [PATCH v2 char-misc 1/1] x86/hyperv: Add interrupt handler > annotations > > On Tue, Apr 03, 2018 at 01:59:08PM -0700, mhkelle...@gmail.com wrote: > > From: Michael Kelley <mikel...@microsoft.com> > > > > Add standard interrupt handler annotations to > > hyperv_vector_handler(). > > > > Signed-off-by: Michael Kelley <mikel...@microsoft.com> > > --- > > Changes in v2: > > * Fixed From: line > > --- > > arch/x86/kernel/cpu/mshyperv.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c > > index 4488cf0..20f6849 100644 > > --- a/arch/x86/kernel/cpu/mshyperv.c > > +++ b/arch/x86/kernel/cpu/mshyperv.c > > @@ -41,7 +41,7 @@ static void (*hv_stimer0_handler)(void); > > static void (*hv_kexec_handler)(void); > > static void (*hv_crash_handler)(struct pt_regs *regs); > > > > -void hyperv_vector_handler(struct pt_regs *regs) > > +__visible void __irq_entry hyperv_vector_handler(struct pt_regs *regs) > > What bug does this solve? What is wrong with the existing markings? > What does __visible and __irq_entry give us that we don't already have > and we need? > > Are you really using LTO that requires this marking to prevent the code > from being removed? Thomas Gleixner commented on Vitaly Kuznetsov's Hyper-V reenlightenment patch that the interrupt handler should have these annotations: see https://lkml.org/lkml/2018/1/14/145 I put the same annotations on the interrupt handler for stimer0 Direct Mode, So this change makes the hyperv_vector_handler() consistent with hv_stimer0_vector_handler() in the same source file. It does not fix any immediate bug -- it's for consistency and alignment with what is apparently standard practice. Not sure what LTO is ... Michael > > thanks, > > greg k-h
RE: [PATCH v2 char-misc 1/1] x86/hyperv: Add interrupt handler annotations
> -Original Message- > From: Greg KH > Sent: Wednesday, April 4, 2018 1:16 AM > To: Michael Kelley (EOSG) > Cc: linux-kernel@vger.kernel.org; de...@linuxdriverproject.org; > o...@aepfle.de; > a...@canonical.com; vkuzn...@redhat.com; jasow...@redhat.com; > leann.ogasaw...@canonical.com; marcelo.ce...@canonical.com; Stephen Hemminger > ; KY Srinivasan > Subject: Re: [PATCH v2 char-misc 1/1] x86/hyperv: Add interrupt handler > annotations > > On Tue, Apr 03, 2018 at 01:59:08PM -0700, mhkelle...@gmail.com wrote: > > From: Michael Kelley > > > > Add standard interrupt handler annotations to > > hyperv_vector_handler(). > > > > Signed-off-by: Michael Kelley > > --- > > Changes in v2: > > * Fixed From: line > > --- > > arch/x86/kernel/cpu/mshyperv.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c > > index 4488cf0..20f6849 100644 > > --- a/arch/x86/kernel/cpu/mshyperv.c > > +++ b/arch/x86/kernel/cpu/mshyperv.c > > @@ -41,7 +41,7 @@ static void (*hv_stimer0_handler)(void); > > static void (*hv_kexec_handler)(void); > > static void (*hv_crash_handler)(struct pt_regs *regs); > > > > -void hyperv_vector_handler(struct pt_regs *regs) > > +__visible void __irq_entry hyperv_vector_handler(struct pt_regs *regs) > > What bug does this solve? What is wrong with the existing markings? > What does __visible and __irq_entry give us that we don't already have > and we need? > > Are you really using LTO that requires this marking to prevent the code > from being removed? Thomas Gleixner commented on Vitaly Kuznetsov's Hyper-V reenlightenment patch that the interrupt handler should have these annotations: see https://lkml.org/lkml/2018/1/14/145 I put the same annotations on the interrupt handler for stimer0 Direct Mode, So this change makes the hyperv_vector_handler() consistent with hv_stimer0_vector_handler() in the same source file. It does not fix any immediate bug -- it's for consistency and alignment with what is apparently standard practice. Not sure what LTO is ... Michael > > thanks, > > greg k-h
RE: [PATCH v2] storvsc: Set up correct queue depth values for IDE devices
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org> On Behalf > Of Long Li > Sent: Thursday, March 22, 2018 2:47 PM > To: KY Srinivasan ; Haiyang Zhang > ; Stephen > Hemminger ; James E . J . Bottomley > ; > Martin K . Petersen ; > de...@linuxdriverproject.org; linux- > s...@vger.kernel.org; linux-kernel@vger.kernel.org > Cc: Long Li > Subject: [PATCH v2] storvsc: Set up correct queue depth values for IDE devices > > From: Long Li > > Unlike SCSI and FC, we don't use multiple channels for IDE. > Also fix the calculation for sub-channels. > > Change log: > v2: Addressed comment on incorrect number of sub-channels. > (Michael Kelley ) > > Signed-off-by: Long Li Reviewed-by: Michael Kelley > --- > drivers/scsi/storvsc_drv.c | 7 +-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c > index 8c51d628b52e..a2ec0bc9e9fa 100644 > --- a/drivers/scsi/storvsc_drv.c > +++ b/drivers/scsi/storvsc_drv.c > @@ -1722,11 +1722,14 @@ static int storvsc_probe(struct hv_device *device, > max_targets = STORVSC_MAX_TARGETS; > max_channels = STORVSC_MAX_CHANNELS; > /* > - * On Windows8 and above, we support sub-channels for storage. > + * On Windows8 and above, we support sub-channels for storage > + * on SCSI and FC controllers. >* The number of sub-channels offerred is based on the number of >* VCPUs in the guest. >*/ > - max_sub_channels = (num_cpus / storvsc_vcpus_per_sub_channel); > + if (!dev_is_ide) > + max_sub_channels = > + (num_cpus - 1) / storvsc_vcpus_per_sub_channel; > } > > scsi_driver.can_queue = (max_outstanding_req_per_channel * > -- > 2.14.1
RE: [PATCH v2] storvsc: Set up correct queue depth values for IDE devices
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org > On Behalf > Of Long Li > Sent: Thursday, March 22, 2018 2:47 PM > To: KY Srinivasan ; Haiyang Zhang > ; Stephen > Hemminger ; James E . J . Bottomley > ; > Martin K . Petersen ; > de...@linuxdriverproject.org; linux- > s...@vger.kernel.org; linux-kernel@vger.kernel.org > Cc: Long Li > Subject: [PATCH v2] storvsc: Set up correct queue depth values for IDE devices > > From: Long Li > > Unlike SCSI and FC, we don't use multiple channels for IDE. > Also fix the calculation for sub-channels. > > Change log: > v2: Addressed comment on incorrect number of sub-channels. > (Michael Kelley ) > > Signed-off-by: Long Li Reviewed-by: Michael Kelley > --- > drivers/scsi/storvsc_drv.c | 7 +-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c > index 8c51d628b52e..a2ec0bc9e9fa 100644 > --- a/drivers/scsi/storvsc_drv.c > +++ b/drivers/scsi/storvsc_drv.c > @@ -1722,11 +1722,14 @@ static int storvsc_probe(struct hv_device *device, > max_targets = STORVSC_MAX_TARGETS; > max_channels = STORVSC_MAX_CHANNELS; > /* > - * On Windows8 and above, we support sub-channels for storage. > + * On Windows8 and above, we support sub-channels for storage > + * on SCSI and FC controllers. >* The number of sub-channels offerred is based on the number of >* VCPUs in the guest. >*/ > - max_sub_channels = (num_cpus / storvsc_vcpus_per_sub_channel); > + if (!dev_is_ide) > + max_sub_channels = > + (num_cpus - 1) / storvsc_vcpus_per_sub_channel; > } > > scsi_driver.can_queue = (max_outstanding_req_per_channel * > -- > 2.14.1
RE: [PATCH net-next,1/2] hv_netvsc: Fix the return status in RX path
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org> On Behalf > Of Haiyang Zhang > Sent: Thursday, March 22, 2018 12:01 PM > To: da...@davemloft.net; net...@vger.kernel.org > Cc: Haiyang Zhang ; KY Srinivasan > ; Stephen > Hemminger ; o...@aepfle.de; vkuzn...@redhat.com; > de...@linuxdriverproject.org; linux-kernel@vger.kernel.org > Subject: [PATCH net-next,1/2] hv_netvsc: Fix the return status in RX path > > From: Haiyang Zhang > > As defined in hyperv_net.h, the NVSP_STAT_SUCCESS is one not zero. > Some functions returns 0 when it actually means NVSP_STAT_SUCCESS. > This patch fixes them. > > In netvsc_receive(), it puts the last RNDIS packet's receive status > for all packets in a vmxferpage which may contain multiple RNDIS > packets. > This patch puts NVSP_STAT_FAIL in the receive completion if one of > the packets in a vmxferpage fails. This patch changes the status field that is being reported back to the Hyper-V host in the receive completion message in enq_receive_complete(). The current code reports 0 on success, and with the patch, it will report 1 on success. So does this change affect anything on the Hyper-V side? Or is Hyper-V just ignoring the value? If this change doesn't have any impact on the interactions with Hyper-V, perhaps it would be good to explain why in the commit message. Michael > > Signed-off-by: Haiyang Zhang > --- > drivers/net/hyperv/netvsc.c | 8 ++-- > drivers/net/hyperv/netvsc_drv.c | 2 +- > drivers/net/hyperv/rndis_filter.c | 4 ++-- > 3 files changed, 9 insertions(+), 5 deletions(-) > > diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c > index aa95e81af6e5..1ddb2c39b6e4 100644 > --- a/drivers/net/hyperv/netvsc.c > +++ b/drivers/net/hyperv/netvsc.c > @@ -1098,12 +1098,16 @@ static int netvsc_receive(struct net_device *ndev, > void *data = recv_buf > + vmxferpage_packet->ranges[i].byte_offset; > u32 buflen = vmxferpage_packet->ranges[i].byte_count; > + int ret; > > trace_rndis_recv(ndev, q_idx, data); > > /* Pass it to the upper layer */ > - status = rndis_filter_receive(ndev, net_device, > - channel, data, buflen); > + ret = rndis_filter_receive(ndev, net_device, > +channel, data, buflen); > + > + if (unlikely(ret != NVSP_STAT_SUCCESS)) > + status = NVSP_STAT_FAIL; > } > > enq_receive_complete(ndev, net_device, q_idx, > diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c > index cdb78eefab67..33607995be62 100644 > --- a/drivers/net/hyperv/netvsc_drv.c > +++ b/drivers/net/hyperv/netvsc_drv.c > @@ -818,7 +818,7 @@ int netvsc_recv_callback(struct net_device *net, > u64_stats_update_end(_stats->syncp); > > napi_gro_receive(>napi, skb); > - return 0; > + return NVSP_STAT_SUCCESS; > } > > static void netvsc_get_drvinfo(struct net_device *net, > diff --git a/drivers/net/hyperv/rndis_filter.c > b/drivers/net/hyperv/rndis_filter.c > index 2dc00f714482..591fb8080f11 100644 > --- a/drivers/net/hyperv/rndis_filter.c > +++ b/drivers/net/hyperv/rndis_filter.c > @@ -443,10 +443,10 @@ int rndis_filter_receive(struct net_device *ndev, > "unhandled rndis message (type %u len %u)\n", > rndis_msg->ndis_msg_type, > rndis_msg->msg_len); > - break; > + return NVSP_STAT_FAIL; > } > > - return 0; > + return NVSP_STAT_SUCCESS; > } > > static int rndis_filter_query_device(struct rndis_device *dev, > -- > 2.15.1
RE: [PATCH net-next,1/2] hv_netvsc: Fix the return status in RX path
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org > On Behalf > Of Haiyang Zhang > Sent: Thursday, March 22, 2018 12:01 PM > To: da...@davemloft.net; net...@vger.kernel.org > Cc: Haiyang Zhang ; KY Srinivasan > ; Stephen > Hemminger ; o...@aepfle.de; vkuzn...@redhat.com; > de...@linuxdriverproject.org; linux-kernel@vger.kernel.org > Subject: [PATCH net-next,1/2] hv_netvsc: Fix the return status in RX path > > From: Haiyang Zhang > > As defined in hyperv_net.h, the NVSP_STAT_SUCCESS is one not zero. > Some functions returns 0 when it actually means NVSP_STAT_SUCCESS. > This patch fixes them. > > In netvsc_receive(), it puts the last RNDIS packet's receive status > for all packets in a vmxferpage which may contain multiple RNDIS > packets. > This patch puts NVSP_STAT_FAIL in the receive completion if one of > the packets in a vmxferpage fails. This patch changes the status field that is being reported back to the Hyper-V host in the receive completion message in enq_receive_complete(). The current code reports 0 on success, and with the patch, it will report 1 on success. So does this change affect anything on the Hyper-V side? Or is Hyper-V just ignoring the value? If this change doesn't have any impact on the interactions with Hyper-V, perhaps it would be good to explain why in the commit message. Michael > > Signed-off-by: Haiyang Zhang > --- > drivers/net/hyperv/netvsc.c | 8 ++-- > drivers/net/hyperv/netvsc_drv.c | 2 +- > drivers/net/hyperv/rndis_filter.c | 4 ++-- > 3 files changed, 9 insertions(+), 5 deletions(-) > > diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c > index aa95e81af6e5..1ddb2c39b6e4 100644 > --- a/drivers/net/hyperv/netvsc.c > +++ b/drivers/net/hyperv/netvsc.c > @@ -1098,12 +1098,16 @@ static int netvsc_receive(struct net_device *ndev, > void *data = recv_buf > + vmxferpage_packet->ranges[i].byte_offset; > u32 buflen = vmxferpage_packet->ranges[i].byte_count; > + int ret; > > trace_rndis_recv(ndev, q_idx, data); > > /* Pass it to the upper layer */ > - status = rndis_filter_receive(ndev, net_device, > - channel, data, buflen); > + ret = rndis_filter_receive(ndev, net_device, > +channel, data, buflen); > + > + if (unlikely(ret != NVSP_STAT_SUCCESS)) > + status = NVSP_STAT_FAIL; > } > > enq_receive_complete(ndev, net_device, q_idx, > diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c > index cdb78eefab67..33607995be62 100644 > --- a/drivers/net/hyperv/netvsc_drv.c > +++ b/drivers/net/hyperv/netvsc_drv.c > @@ -818,7 +818,7 @@ int netvsc_recv_callback(struct net_device *net, > u64_stats_update_end(_stats->syncp); > > napi_gro_receive(>napi, skb); > - return 0; > + return NVSP_STAT_SUCCESS; > } > > static void netvsc_get_drvinfo(struct net_device *net, > diff --git a/drivers/net/hyperv/rndis_filter.c > b/drivers/net/hyperv/rndis_filter.c > index 2dc00f714482..591fb8080f11 100644 > --- a/drivers/net/hyperv/rndis_filter.c > +++ b/drivers/net/hyperv/rndis_filter.c > @@ -443,10 +443,10 @@ int rndis_filter_receive(struct net_device *ndev, > "unhandled rndis message (type %u len %u)\n", > rndis_msg->ndis_msg_type, > rndis_msg->msg_len); > - break; > + return NVSP_STAT_FAIL; > } > > - return 0; > + return NVSP_STAT_SUCCESS; > } > > static int rndis_filter_query_device(struct rndis_device *dev, > -- > 2.15.1
RE: [PATCH] vmbus: use put_device() if device_register fail
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org> On Behalf > Of KY Srinivasan > Sent: Sunday, March 18, 2018 8:02 PM > To: Arvind Yadav ; Stephen Hemminger > ; Haiyang Zhang > Cc: de...@linuxdriverproject.org; linux-kernel@vger.kernel.org > Subject: RE: [PATCH] vmbus: use put_device() if device_register fail > > > -Original Message- > > From: devel On Behalf > > Of Arvind Yadav > > Sent: Saturday, March 17, 2018 11:48 PM > > To: Stephen Hemminger ; Haiyang Zhang > > > > Cc: de...@linuxdriverproject.org; linux-kernel@vger.kernel.org > > Subject: [PATCH] vmbus: use put_device() if device_register fail > > > > if device_register() returned an error. Always use put_device() > > to give up the reference initialized. > > > > Signed-off-by: Arvind Yadav > > --- > > drivers/hv/vmbus_drv.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c > > index bc65c4d..25da2f3 100644 > > --- a/drivers/hv/vmbus_drv.c > > +++ b/drivers/hv/vmbus_drv.c > > @@ -1358,6 +1358,7 @@ int vmbus_device_register(struct hv_device > > *child_device_obj) > > ret = device_register(_device_obj->device); > > if (ret) { > > pr_err("Unable to register child device\n"); > > + put_device(_device_obj->device); > > If the registration failed; we would not have acquired the reference on the > device and so > we should not be dropping the reference in the failure path. Looking at the code for device_register() and its constituent parts device_initialize() and device_add(), Arvind is correct. device_initialize() creates the object with a reference count of 1. device_add() does not decrement that reference count or free the object, even if it fails. The comments for device_register() call this out as well. Michael > > K. Y > > return ret; > > } > > > > -- > > 2.7.4
RE: [PATCH] vmbus: use put_device() if device_register fail
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org > On Behalf > Of KY Srinivasan > Sent: Sunday, March 18, 2018 8:02 PM > To: Arvind Yadav ; Stephen Hemminger > ; Haiyang Zhang > Cc: de...@linuxdriverproject.org; linux-kernel@vger.kernel.org > Subject: RE: [PATCH] vmbus: use put_device() if device_register fail > > > -Original Message- > > From: devel On Behalf > > Of Arvind Yadav > > Sent: Saturday, March 17, 2018 11:48 PM > > To: Stephen Hemminger ; Haiyang Zhang > > > > Cc: de...@linuxdriverproject.org; linux-kernel@vger.kernel.org > > Subject: [PATCH] vmbus: use put_device() if device_register fail > > > > if device_register() returned an error. Always use put_device() > > to give up the reference initialized. > > > > Signed-off-by: Arvind Yadav > > --- > > drivers/hv/vmbus_drv.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c > > index bc65c4d..25da2f3 100644 > > --- a/drivers/hv/vmbus_drv.c > > +++ b/drivers/hv/vmbus_drv.c > > @@ -1358,6 +1358,7 @@ int vmbus_device_register(struct hv_device > > *child_device_obj) > > ret = device_register(_device_obj->device); > > if (ret) { > > pr_err("Unable to register child device\n"); > > + put_device(_device_obj->device); > > If the registration failed; we would not have acquired the reference on the > device and so > we should not be dropping the reference in the failure path. Looking at the code for device_register() and its constituent parts device_initialize() and device_add(), Arvind is correct. device_initialize() creates the object with a reference count of 1. device_add() does not decrement that reference count or free the object, even if it fails. The comments for device_register() call this out as well. Michael > > K. Y > > return ret; > > } > > > > -- > > 2.7.4
RE: [PATCH 2/2] pci: host: pci-hyperv: Replace GFP_ATOMIC with GFP_KERNEL in new_pcichild_device
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org> On Behalf > Of Jia-Ju Bai > Sent: Sunday, March 18, 2018 7:53 AM > To: KY Srinivasan ; Haiyang Zhang > ; Stephen > Hemminger ; bhelg...@google.com > Cc: de...@linuxdriverproject.org; linux-...@vger.kernel.org; > linux-kernel@vger.kernel.org; > Jia-Ju Bai > Subject: [PATCH 2/2] pci: host: pci-hyperv: Replace GFP_ATOMIC with > GFP_KERNEL in > new_pcichild_device > > new_pcichild_device() is not called in atomic context. > > The call chain ending up at new_pcichild_device() is: > [1] new_pcichild_device() <- pci_devices_present_work() > pci_devices_present_work() is only set in INIT_WORK(). > > Despite never getting called from atomic context, > new_pcichild_device() calls kzalloc with GFP_ATOMIC, > which waits busily for allocation. > GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL > to avoid busy waiting. > > This is found by a static analysis tool named DCNS written by myself. > > Signed-off-by: Jia-Ju Bai Reviewed-by: Michael Kelley > --- > drivers/pci/host/pci-hyperv.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/pci/host/pci-hyperv.c b/drivers/pci/host/pci-hyperv.c > index 0fe3ea1..289e31d 100644 > --- a/drivers/pci/host/pci-hyperv.c > +++ b/drivers/pci/host/pci-hyperv.c > @@ -1500,7 +1500,7 @@ static struct hv_pci_dev *new_pcichild_device(struct > hv_pcibus_device *hbus, > unsigned long flags; > int ret; > > - hpdev = kzalloc(sizeof(*hpdev), GFP_ATOMIC); > + hpdev = kzalloc(sizeof(*hpdev), GFP_KERNEL); > if (!hpdev) > return NULL; > > -- > 1.9.1
RE: [PATCH 2/2] pci: host: pci-hyperv: Replace GFP_ATOMIC with GFP_KERNEL in new_pcichild_device
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org > On Behalf > Of Jia-Ju Bai > Sent: Sunday, March 18, 2018 7:53 AM > To: KY Srinivasan ; Haiyang Zhang > ; Stephen > Hemminger ; bhelg...@google.com > Cc: de...@linuxdriverproject.org; linux-...@vger.kernel.org; > linux-kernel@vger.kernel.org; > Jia-Ju Bai > Subject: [PATCH 2/2] pci: host: pci-hyperv: Replace GFP_ATOMIC with > GFP_KERNEL in > new_pcichild_device > > new_pcichild_device() is not called in atomic context. > > The call chain ending up at new_pcichild_device() is: > [1] new_pcichild_device() <- pci_devices_present_work() > pci_devices_present_work() is only set in INIT_WORK(). > > Despite never getting called from atomic context, > new_pcichild_device() calls kzalloc with GFP_ATOMIC, > which waits busily for allocation. > GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL > to avoid busy waiting. > > This is found by a static analysis tool named DCNS written by myself. > > Signed-off-by: Jia-Ju Bai Reviewed-by: Michael Kelley > --- > drivers/pci/host/pci-hyperv.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/pci/host/pci-hyperv.c b/drivers/pci/host/pci-hyperv.c > index 0fe3ea1..289e31d 100644 > --- a/drivers/pci/host/pci-hyperv.c > +++ b/drivers/pci/host/pci-hyperv.c > @@ -1500,7 +1500,7 @@ static struct hv_pci_dev *new_pcichild_device(struct > hv_pcibus_device *hbus, > unsigned long flags; > int ret; > > - hpdev = kzalloc(sizeof(*hpdev), GFP_ATOMIC); > + hpdev = kzalloc(sizeof(*hpdev), GFP_KERNEL); > if (!hpdev) > return NULL; > > -- > 1.9.1