[tip:x86/pti] sched/smt: Make sched_smt_present track topology
Commit-ID: c5511d03ec090980732e929c318a7a6374b5550e Gitweb: https://git.kernel.org/tip/c5511d03ec090980732e929c318a7a6374b5550e Author: Peter Zijlstra (Intel) AuthorDate: Sun, 25 Nov 2018 19:33:36 +0100 Committer: Thomas Gleixner CommitDate: Wed, 28 Nov 2018 11:57:06 +0100 sched/smt: Make sched_smt_present track topology Currently the 'sched_smt_present' static key is enabled when at CPU bringup SMT topology is observed, but it is never disabled. However there is demand to also disable the key when the topology changes such that there is no SMT present anymore. Implement this by making the key count the number of cores that have SMT enabled. In particular, the SMT topology bits are set before interrrupts are enabled and similarly, are cleared after interrupts are disabled for the last time and the CPU dies. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Thomas Gleixner Reviewed-by: Ingo Molnar Cc: Andy Lutomirski Cc: Linus Torvalds Cc: Jiri Kosina Cc: Tom Lendacky Cc: Josh Poimboeuf Cc: Andrea Arcangeli Cc: David Woodhouse Cc: Tim Chen Cc: Andi Kleen Cc: Dave Hansen Cc: Casey Schaufler Cc: Asit Mallick Cc: Arjan van de Ven Cc: Jon Masters Cc: Waiman Long Cc: Greg KH Cc: Dave Stewart Cc: Kees Cook Cc: sta...@vger.kernel.org Link: https://lkml.kernel.org/r/20181125185004.246110...@linutronix.de --- kernel/sched/core.c | 19 +++ 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 091e089063be..6fedf3a98581 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5738,15 +5738,10 @@ int sched_cpu_activate(unsigned int cpu) #ifdef CONFIG_SCHED_SMT /* -* The sched_smt_present static key needs to be evaluated on every -* hotplug event because at boot time SMT might be disabled when -* the number of booted CPUs is limited. -* -* If then later a sibling gets hotplugged, then the key would stay -* off and SMT scheduling would never be functional. +* When going up, increment the number of cores with SMT present. */ - if (cpumask_weight(cpu_smt_mask(cpu)) > 1) - static_branch_enable_cpuslocked(_smt_present); + if (cpumask_weight(cpu_smt_mask(cpu)) == 2) + static_branch_inc_cpuslocked(_smt_present); #endif set_cpu_active(cpu, true); @@ -5790,6 +5785,14 @@ int sched_cpu_deactivate(unsigned int cpu) */ synchronize_rcu_mult(call_rcu, call_rcu_sched); +#ifdef CONFIG_SCHED_SMT + /* +* When going down, decrement the number of cores with SMT present. +*/ + if (cpumask_weight(cpu_smt_mask(cpu)) == 2) + static_branch_dec_cpuslocked(_smt_present); +#endif + if (!sched_smp_initialized) return 0;
[tip:x86/pti] sched/smt: Make sched_smt_present track topology
Commit-ID: c5511d03ec090980732e929c318a7a6374b5550e Gitweb: https://git.kernel.org/tip/c5511d03ec090980732e929c318a7a6374b5550e Author: Peter Zijlstra (Intel) AuthorDate: Sun, 25 Nov 2018 19:33:36 +0100 Committer: Thomas Gleixner CommitDate: Wed, 28 Nov 2018 11:57:06 +0100 sched/smt: Make sched_smt_present track topology Currently the 'sched_smt_present' static key is enabled when at CPU bringup SMT topology is observed, but it is never disabled. However there is demand to also disable the key when the topology changes such that there is no SMT present anymore. Implement this by making the key count the number of cores that have SMT enabled. In particular, the SMT topology bits are set before interrrupts are enabled and similarly, are cleared after interrupts are disabled for the last time and the CPU dies. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Thomas Gleixner Reviewed-by: Ingo Molnar Cc: Andy Lutomirski Cc: Linus Torvalds Cc: Jiri Kosina Cc: Tom Lendacky Cc: Josh Poimboeuf Cc: Andrea Arcangeli Cc: David Woodhouse Cc: Tim Chen Cc: Andi Kleen Cc: Dave Hansen Cc: Casey Schaufler Cc: Asit Mallick Cc: Arjan van de Ven Cc: Jon Masters Cc: Waiman Long Cc: Greg KH Cc: Dave Stewart Cc: Kees Cook Cc: sta...@vger.kernel.org Link: https://lkml.kernel.org/r/20181125185004.246110...@linutronix.de --- kernel/sched/core.c | 19 +++ 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 091e089063be..6fedf3a98581 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5738,15 +5738,10 @@ int sched_cpu_activate(unsigned int cpu) #ifdef CONFIG_SCHED_SMT /* -* The sched_smt_present static key needs to be evaluated on every -* hotplug event because at boot time SMT might be disabled when -* the number of booted CPUs is limited. -* -* If then later a sibling gets hotplugged, then the key would stay -* off and SMT scheduling would never be functional. +* When going up, increment the number of cores with SMT present. */ - if (cpumask_weight(cpu_smt_mask(cpu)) > 1) - static_branch_enable_cpuslocked(_smt_present); + if (cpumask_weight(cpu_smt_mask(cpu)) == 2) + static_branch_inc_cpuslocked(_smt_present); #endif set_cpu_active(cpu, true); @@ -5790,6 +5785,14 @@ int sched_cpu_deactivate(unsigned int cpu) */ synchronize_rcu_mult(call_rcu, call_rcu_sched); +#ifdef CONFIG_SCHED_SMT + /* +* When going down, decrement the number of cores with SMT present. +*/ + if (cpumask_weight(cpu_smt_mask(cpu)) == 2) + static_branch_dec_cpuslocked(_smt_present); +#endif + if (!sched_smp_initialized) return 0;
[tip:x86/boot] x86/kaslr, ACPI/NUMA: Fix KASLR build error
Commit-ID: 9d94e8b1d4f94a3c4cee5ad11a1be460cd070839 Gitweb: https://git.kernel.org/tip/9d94e8b1d4f94a3c4cee5ad11a1be460cd070839 Author: Peter Zijlstra (Intel) AuthorDate: Wed, 3 Oct 2018 14:41:27 +0200 Committer: Borislav Petkov CommitDate: Tue, 9 Oct 2018 12:30:25 +0200 x86/kaslr, ACPI/NUMA: Fix KASLR build error There is no point in trying to compile KASLR-specific code when there is no KASLR. [ bp: Move the whole crap into kaslr.c and make rand_mem_physical_padding static. Make kaslr_check_padding() weak to avoid build breakage on other architectures. ] Reported-by: Naresh Kamboju Reported-by: Mark Brown Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Borislav Petkov Cc: Cc: Cc: Cc: Cc: Cc: Link: http://lkml.kernel.org/r/20181003123402.ga15...@hirez.programming.kicks-ass.net --- arch/x86/include/asm/setup.h | 2 -- arch/x86/mm/kaslr.c | 19 ++- drivers/acpi/numa.c | 17 + 3 files changed, 23 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h index 65a5bf8f6aba..ae13bc974416 100644 --- a/arch/x86/include/asm/setup.h +++ b/arch/x86/include/asm/setup.h @@ -80,8 +80,6 @@ static inline unsigned long kaslr_offset(void) return (unsigned long)&_text - __START_KERNEL; } -extern int rand_mem_physical_padding; - /* * Do NOT EVER look at the BIOS memory size location. * It does not work on many machines. diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c index 00cf4cae38f5..b3471388288d 100644 --- a/arch/x86/mm/kaslr.c +++ b/arch/x86/mm/kaslr.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include @@ -40,7 +41,7 @@ */ static const unsigned long vaddr_end = CPU_ENTRY_AREA_BASE; -int __initdata rand_mem_physical_padding = CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING; +static int __initdata rand_mem_physical_padding = CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING; /* * Memory regions randomized by KASLR (except modules that use a separate logic * earlier during boot). The list is ordered based on virtual addresses. This @@ -70,6 +71,22 @@ static inline bool kaslr_memory_enabled(void) return kaslr_enabled() && !IS_ENABLED(CONFIG_KASAN); } +/* + * Check the padding size for KASLR is enough. + */ +void __init kaslr_check_padding(void) +{ + u64 max_possible_phys, max_actual_phys, threshold; + + max_actual_phys = roundup(PFN_PHYS(max_pfn), 1ULL << 40); + max_possible_phys = roundup(PFN_PHYS(max_possible_pfn), 1ULL << 40); + threshold = max_actual_phys + ((u64)rand_mem_physical_padding << 40); + + if (max_possible_phys > threshold) + pr_warn("Set 'rand_mem_physical_padding=%llu' to avoid memory hotadd failure.\n", + (max_possible_phys - max_actual_phys) >> 40); +} + static int __init rand_mem_physical_padding_setup(char *str) { int max_padding = (1 << (MAX_PHYSMEM_BITS - TB_SHIFT)) - 1; diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c index 3d69834c692f..ba62004f4d86 100644 --- a/drivers/acpi/numa.c +++ b/drivers/acpi/numa.c @@ -32,7 +32,6 @@ #include #include #include -#include static nodemask_t nodes_found_map = NODE_MASK_NONE; @@ -433,10 +432,12 @@ acpi_table_parse_srat(enum acpi_srat_type id, handler, max_entries); } +/* To be overridden by architectures */ +void __init __weak kaslr_check_padding(void) { } + int __init acpi_numa_init(void) { int cnt = 0; - u64 max_possible_phys, max_actual_phys, threshold; if (acpi_disabled) return -EINVAL; @@ -466,17 +467,9 @@ int __init acpi_numa_init(void) cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY, acpi_parse_memory_affinity, 0); - /* check the padding size for KASLR is enough. */ - if (parsed_numa_memblks && kaslr_enabled()) { - max_actual_phys = roundup(PFN_PHYS(max_pfn), 1ULL << 40); - max_possible_phys = roundup(PFN_PHYS(max_possible_pfn), 1ULL << 40); - threshold = max_actual_phys + ((u64)rand_mem_physical_padding << 40); + if (parsed_numa_memblks) + kaslr_check_padding(); - if (max_possible_phys > threshold) { - pr_warn("Set 'rand_mem_physical_padding=%llu' to avoid memory hotadd failure.\n", - (max_possible_phys - max_actual_phys) >> 40); - } - } } /* SLIT: System Locality Information Table */
[tip:x86/boot] x86/kaslr, ACPI/NUMA: Fix KASLR build error
Commit-ID: 9d94e8b1d4f94a3c4cee5ad11a1be460cd070839 Gitweb: https://git.kernel.org/tip/9d94e8b1d4f94a3c4cee5ad11a1be460cd070839 Author: Peter Zijlstra (Intel) AuthorDate: Wed, 3 Oct 2018 14:41:27 +0200 Committer: Borislav Petkov CommitDate: Tue, 9 Oct 2018 12:30:25 +0200 x86/kaslr, ACPI/NUMA: Fix KASLR build error There is no point in trying to compile KASLR-specific code when there is no KASLR. [ bp: Move the whole crap into kaslr.c and make rand_mem_physical_padding static. Make kaslr_check_padding() weak to avoid build breakage on other architectures. ] Reported-by: Naresh Kamboju Reported-by: Mark Brown Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Borislav Petkov Cc: Cc: Cc: Cc: Cc: Cc: Link: http://lkml.kernel.org/r/20181003123402.ga15...@hirez.programming.kicks-ass.net --- arch/x86/include/asm/setup.h | 2 -- arch/x86/mm/kaslr.c | 19 ++- drivers/acpi/numa.c | 17 + 3 files changed, 23 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h index 65a5bf8f6aba..ae13bc974416 100644 --- a/arch/x86/include/asm/setup.h +++ b/arch/x86/include/asm/setup.h @@ -80,8 +80,6 @@ static inline unsigned long kaslr_offset(void) return (unsigned long)&_text - __START_KERNEL; } -extern int rand_mem_physical_padding; - /* * Do NOT EVER look at the BIOS memory size location. * It does not work on many machines. diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c index 00cf4cae38f5..b3471388288d 100644 --- a/arch/x86/mm/kaslr.c +++ b/arch/x86/mm/kaslr.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include @@ -40,7 +41,7 @@ */ static const unsigned long vaddr_end = CPU_ENTRY_AREA_BASE; -int __initdata rand_mem_physical_padding = CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING; +static int __initdata rand_mem_physical_padding = CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING; /* * Memory regions randomized by KASLR (except modules that use a separate logic * earlier during boot). The list is ordered based on virtual addresses. This @@ -70,6 +71,22 @@ static inline bool kaslr_memory_enabled(void) return kaslr_enabled() && !IS_ENABLED(CONFIG_KASAN); } +/* + * Check the padding size for KASLR is enough. + */ +void __init kaslr_check_padding(void) +{ + u64 max_possible_phys, max_actual_phys, threshold; + + max_actual_phys = roundup(PFN_PHYS(max_pfn), 1ULL << 40); + max_possible_phys = roundup(PFN_PHYS(max_possible_pfn), 1ULL << 40); + threshold = max_actual_phys + ((u64)rand_mem_physical_padding << 40); + + if (max_possible_phys > threshold) + pr_warn("Set 'rand_mem_physical_padding=%llu' to avoid memory hotadd failure.\n", + (max_possible_phys - max_actual_phys) >> 40); +} + static int __init rand_mem_physical_padding_setup(char *str) { int max_padding = (1 << (MAX_PHYSMEM_BITS - TB_SHIFT)) - 1; diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c index 3d69834c692f..ba62004f4d86 100644 --- a/drivers/acpi/numa.c +++ b/drivers/acpi/numa.c @@ -32,7 +32,6 @@ #include #include #include -#include static nodemask_t nodes_found_map = NODE_MASK_NONE; @@ -433,10 +432,12 @@ acpi_table_parse_srat(enum acpi_srat_type id, handler, max_entries); } +/* To be overridden by architectures */ +void __init __weak kaslr_check_padding(void) { } + int __init acpi_numa_init(void) { int cnt = 0; - u64 max_possible_phys, max_actual_phys, threshold; if (acpi_disabled) return -EINVAL; @@ -466,17 +467,9 @@ int __init acpi_numa_init(void) cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY, acpi_parse_memory_affinity, 0); - /* check the padding size for KASLR is enough. */ - if (parsed_numa_memblks && kaslr_enabled()) { - max_actual_phys = roundup(PFN_PHYS(max_pfn), 1ULL << 40); - max_possible_phys = roundup(PFN_PHYS(max_possible_pfn), 1ULL << 40); - threshold = max_actual_phys + ((u64)rand_mem_physical_padding << 40); + if (parsed_numa_memblks) + kaslr_check_padding(); - if (max_possible_phys > threshold) { - pr_warn("Set 'rand_mem_physical_padding=%llu' to avoid memory hotadd failure.\n", - (max_possible_phys - max_actual_phys) >> 40); - } - } } /* SLIT: System Locality Information Table */
[tip:x86/boot] x86/kaslr, ACPI/NUMA: Fix KASLR build error
Commit-ID: 3a387c6d96e69f1710a3804eb68e1253263298f2 Gitweb: https://git.kernel.org/tip/3a387c6d96e69f1710a3804eb68e1253263298f2 Author: Peter Zijlstra (Intel) AuthorDate: Wed, 3 Oct 2018 14:41:27 +0200 Committer: Borislav Petkov CommitDate: Wed, 3 Oct 2018 16:15:49 +0200 x86/kaslr, ACPI/NUMA: Fix KASLR build error There is no point in trying to compile KASLR-specific code when there is no KASLR. [ bp: Move the whole crap into kaslr.c and make rand_mem_physical_padding static. ] Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Borislav Petkov Cc: Cc: Cc: Cc: Cc: Cc: Link: http://lkml.kernel.org/r/20181003123402.ga15...@hirez.programming.kicks-ass.net --- arch/x86/include/asm/kaslr.h | 2 ++ arch/x86/include/asm/setup.h | 2 -- arch/x86/mm/kaslr.c | 19 ++- drivers/acpi/numa.c | 15 +++ 4 files changed, 23 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/kaslr.h b/arch/x86/include/asm/kaslr.h index db7ba2feb947..95ef3fc01d12 100644 --- a/arch/x86/include/asm/kaslr.h +++ b/arch/x86/include/asm/kaslr.h @@ -6,8 +6,10 @@ unsigned long kaslr_get_random_long(const char *purpose); #ifdef CONFIG_RANDOMIZE_MEMORY void kernel_randomize_memory(void); +void kaslr_check_padding(void); #else static inline void kernel_randomize_memory(void) { } +static inline void kaslr_check_padding(void) { } #endif /* CONFIG_RANDOMIZE_MEMORY */ #endif diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h index 65a5bf8f6aba..ae13bc974416 100644 --- a/arch/x86/include/asm/setup.h +++ b/arch/x86/include/asm/setup.h @@ -80,8 +80,6 @@ static inline unsigned long kaslr_offset(void) return (unsigned long)&_text - __START_KERNEL; } -extern int rand_mem_physical_padding; - /* * Do NOT EVER look at the BIOS memory size location. * It does not work on many machines. diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c index 00cf4cae38f5..b3471388288d 100644 --- a/arch/x86/mm/kaslr.c +++ b/arch/x86/mm/kaslr.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include @@ -40,7 +41,7 @@ */ static const unsigned long vaddr_end = CPU_ENTRY_AREA_BASE; -int __initdata rand_mem_physical_padding = CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING; +static int __initdata rand_mem_physical_padding = CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING; /* * Memory regions randomized by KASLR (except modules that use a separate logic * earlier during boot). The list is ordered based on virtual addresses. This @@ -70,6 +71,22 @@ static inline bool kaslr_memory_enabled(void) return kaslr_enabled() && !IS_ENABLED(CONFIG_KASAN); } +/* + * Check the padding size for KASLR is enough. + */ +void __init kaslr_check_padding(void) +{ + u64 max_possible_phys, max_actual_phys, threshold; + + max_actual_phys = roundup(PFN_PHYS(max_pfn), 1ULL << 40); + max_possible_phys = roundup(PFN_PHYS(max_possible_pfn), 1ULL << 40); + threshold = max_actual_phys + ((u64)rand_mem_physical_padding << 40); + + if (max_possible_phys > threshold) + pr_warn("Set 'rand_mem_physical_padding=%llu' to avoid memory hotadd failure.\n", + (max_possible_phys - max_actual_phys) >> 40); +} + static int __init rand_mem_physical_padding_setup(char *str) { int max_padding = (1 << (MAX_PHYSMEM_BITS - TB_SHIFT)) - 1; diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c index 3d69834c692f..4408e37600ef 100644 --- a/drivers/acpi/numa.c +++ b/drivers/acpi/numa.c @@ -32,7 +32,7 @@ #include #include #include -#include +#include static nodemask_t nodes_found_map = NODE_MASK_NONE; @@ -436,7 +436,6 @@ acpi_table_parse_srat(enum acpi_srat_type id, int __init acpi_numa_init(void) { int cnt = 0; - u64 max_possible_phys, max_actual_phys, threshold; if (acpi_disabled) return -EINVAL; @@ -466,17 +465,9 @@ int __init acpi_numa_init(void) cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY, acpi_parse_memory_affinity, 0); - /* check the padding size for KASLR is enough. */ - if (parsed_numa_memblks && kaslr_enabled()) { - max_actual_phys = roundup(PFN_PHYS(max_pfn), 1ULL << 40); - max_possible_phys = roundup(PFN_PHYS(max_possible_pfn), 1ULL << 40); - threshold = max_actual_phys + ((u64)rand_mem_physical_padding << 40); + if (parsed_numa_memblks) + kaslr_check_padding(); - if (max_possible_phys > threshold) { - pr_warn("Set 'rand_mem_physical_padding=%llu' to avoid memory hotadd failure.\n", - (max_possible_phys - max_actual_phys) >> 40); - } - } } /* SLIT: System Locality
[tip:x86/boot] x86/kaslr, ACPI/NUMA: Fix KASLR build error
Commit-ID: 3a387c6d96e69f1710a3804eb68e1253263298f2 Gitweb: https://git.kernel.org/tip/3a387c6d96e69f1710a3804eb68e1253263298f2 Author: Peter Zijlstra (Intel) AuthorDate: Wed, 3 Oct 2018 14:41:27 +0200 Committer: Borislav Petkov CommitDate: Wed, 3 Oct 2018 16:15:49 +0200 x86/kaslr, ACPI/NUMA: Fix KASLR build error There is no point in trying to compile KASLR-specific code when there is no KASLR. [ bp: Move the whole crap into kaslr.c and make rand_mem_physical_padding static. ] Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Borislav Petkov Cc: Cc: Cc: Cc: Cc: Cc: Link: http://lkml.kernel.org/r/20181003123402.ga15...@hirez.programming.kicks-ass.net --- arch/x86/include/asm/kaslr.h | 2 ++ arch/x86/include/asm/setup.h | 2 -- arch/x86/mm/kaslr.c | 19 ++- drivers/acpi/numa.c | 15 +++ 4 files changed, 23 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/kaslr.h b/arch/x86/include/asm/kaslr.h index db7ba2feb947..95ef3fc01d12 100644 --- a/arch/x86/include/asm/kaslr.h +++ b/arch/x86/include/asm/kaslr.h @@ -6,8 +6,10 @@ unsigned long kaslr_get_random_long(const char *purpose); #ifdef CONFIG_RANDOMIZE_MEMORY void kernel_randomize_memory(void); +void kaslr_check_padding(void); #else static inline void kernel_randomize_memory(void) { } +static inline void kaslr_check_padding(void) { } #endif /* CONFIG_RANDOMIZE_MEMORY */ #endif diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h index 65a5bf8f6aba..ae13bc974416 100644 --- a/arch/x86/include/asm/setup.h +++ b/arch/x86/include/asm/setup.h @@ -80,8 +80,6 @@ static inline unsigned long kaslr_offset(void) return (unsigned long)&_text - __START_KERNEL; } -extern int rand_mem_physical_padding; - /* * Do NOT EVER look at the BIOS memory size location. * It does not work on many machines. diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c index 00cf4cae38f5..b3471388288d 100644 --- a/arch/x86/mm/kaslr.c +++ b/arch/x86/mm/kaslr.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include @@ -40,7 +41,7 @@ */ static const unsigned long vaddr_end = CPU_ENTRY_AREA_BASE; -int __initdata rand_mem_physical_padding = CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING; +static int __initdata rand_mem_physical_padding = CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING; /* * Memory regions randomized by KASLR (except modules that use a separate logic * earlier during boot). The list is ordered based on virtual addresses. This @@ -70,6 +71,22 @@ static inline bool kaslr_memory_enabled(void) return kaslr_enabled() && !IS_ENABLED(CONFIG_KASAN); } +/* + * Check the padding size for KASLR is enough. + */ +void __init kaslr_check_padding(void) +{ + u64 max_possible_phys, max_actual_phys, threshold; + + max_actual_phys = roundup(PFN_PHYS(max_pfn), 1ULL << 40); + max_possible_phys = roundup(PFN_PHYS(max_possible_pfn), 1ULL << 40); + threshold = max_actual_phys + ((u64)rand_mem_physical_padding << 40); + + if (max_possible_phys > threshold) + pr_warn("Set 'rand_mem_physical_padding=%llu' to avoid memory hotadd failure.\n", + (max_possible_phys - max_actual_phys) >> 40); +} + static int __init rand_mem_physical_padding_setup(char *str) { int max_padding = (1 << (MAX_PHYSMEM_BITS - TB_SHIFT)) - 1; diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c index 3d69834c692f..4408e37600ef 100644 --- a/drivers/acpi/numa.c +++ b/drivers/acpi/numa.c @@ -32,7 +32,7 @@ #include #include #include -#include +#include static nodemask_t nodes_found_map = NODE_MASK_NONE; @@ -436,7 +436,6 @@ acpi_table_parse_srat(enum acpi_srat_type id, int __init acpi_numa_init(void) { int cnt = 0; - u64 max_possible_phys, max_actual_phys, threshold; if (acpi_disabled) return -EINVAL; @@ -466,17 +465,9 @@ int __init acpi_numa_init(void) cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY, acpi_parse_memory_affinity, 0); - /* check the padding size for KASLR is enough. */ - if (parsed_numa_memblks && kaslr_enabled()) { - max_actual_phys = roundup(PFN_PHYS(max_pfn), 1ULL << 40); - max_possible_phys = roundup(PFN_PHYS(max_possible_pfn), 1ULL << 40); - threshold = max_actual_phys + ((u64)rand_mem_physical_padding << 40); + if (parsed_numa_memblks) + kaslr_check_padding(); - if (max_possible_phys > threshold) { - pr_warn("Set 'rand_mem_physical_padding=%llu' to avoid memory hotadd failure.\n", - (max_possible_phys - max_actual_phys) >> 40); - } - } } /* SLIT: System Locality
[tip:smp/hotplug] perf: Avoid cpu_hotplug_lock r-r recursion
Commit-ID: 641693094ee1568502280f95900f374b2226b51d Gitweb: http://git.kernel.org/tip/641693094ee1568502280f95900f374b2226b51d Author: Peter Zijlstra (Intel)AuthorDate: Tue, 18 Apr 2017 19:05:05 +0200 Committer: Thomas Gleixner CommitDate: Thu, 20 Apr 2017 13:08:57 +0200 perf: Avoid cpu_hotplug_lock r-r recursion There are two call-sites where using static_key results in recursing on the cpu_hotplug_lock. Use the hotplug locked version of static_key_slow_inc(). Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Thomas Gleixner Cc: Sebastian Siewior Cc: Steven Rostedt Cc: jba...@akamai.com Link: http://lkml.kernel.org/r/20170418103422.687248...@infradead.org --- kernel/events/core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index 634dd95..8aa3063 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -7653,7 +7653,7 @@ static int perf_swevent_init(struct perf_event *event) if (err) return err; - static_key_slow_inc(_swevent_enabled[event_id]); + static_key_slow_inc_cpuslocked(_swevent_enabled[event_id]); event->destroy = sw_perf_event_destroy; } @@ -9160,7 +9160,7 @@ static void account_event(struct perf_event *event) mutex_lock(_sched_mutex); if (!atomic_read(_sched_count)) { - static_branch_enable(_sched_events); + static_key_slow_inc_cpuslocked(_sched_events.key); /* * Guarantee that all CPUs observe they key change and * call the perf scheduling hooks before proceeding to
[tip:smp/hotplug] perf: Avoid cpu_hotplug_lock r-r recursion
Commit-ID: 641693094ee1568502280f95900f374b2226b51d Gitweb: http://git.kernel.org/tip/641693094ee1568502280f95900f374b2226b51d Author: Peter Zijlstra (Intel) AuthorDate: Tue, 18 Apr 2017 19:05:05 +0200 Committer: Thomas Gleixner CommitDate: Thu, 20 Apr 2017 13:08:57 +0200 perf: Avoid cpu_hotplug_lock r-r recursion There are two call-sites where using static_key results in recursing on the cpu_hotplug_lock. Use the hotplug locked version of static_key_slow_inc(). Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Thomas Gleixner Cc: Sebastian Siewior Cc: Steven Rostedt Cc: jba...@akamai.com Link: http://lkml.kernel.org/r/20170418103422.687248...@infradead.org --- kernel/events/core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index 634dd95..8aa3063 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -7653,7 +7653,7 @@ static int perf_swevent_init(struct perf_event *event) if (err) return err; - static_key_slow_inc(_swevent_enabled[event_id]); + static_key_slow_inc_cpuslocked(_swevent_enabled[event_id]); event->destroy = sw_perf_event_destroy; } @@ -9160,7 +9160,7 @@ static void account_event(struct perf_event *event) mutex_lock(_sched_mutex); if (!atomic_read(_sched_count)) { - static_branch_enable(_sched_events); + static_key_slow_inc_cpuslocked(_sched_events.key); /* * Guarantee that all CPUs observe they key change and * call the perf scheduling hooks before proceeding to
[tip:smp/hotplug] jump_label: Provide static_key_slow_inc_cpuslocked()
Commit-ID: f5efc6fad63f5533a6083e95286920d5753e52bf Gitweb: http://git.kernel.org/tip/f5efc6fad63f5533a6083e95286920d5753e52bf Author: Peter Zijlstra (Intel)AuthorDate: Tue, 18 Apr 2017 19:05:04 +0200 Committer: Thomas Gleixner CommitDate: Thu, 20 Apr 2017 13:08:57 +0200 jump_label: Provide static_key_slow_inc_cpuslocked() Provide static_key_slow_inc_cpuslocked(), a variant that doesn't take cpu_hotplug_lock(). Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Thomas Gleixner Cc: Sebastian Siewior Cc: Steven Rostedt Cc: jba...@akamai.com Link: http://lkml.kernel.org/r/20170418103422.636958...@infradead.org --- include/linux/jump_label.h | 3 +++ kernel/jump_label.c| 21 + 2 files changed, 20 insertions(+), 4 deletions(-) diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h index 2afd74b..7d07f0b 100644 --- a/include/linux/jump_label.h +++ b/include/linux/jump_label.h @@ -158,6 +158,7 @@ extern void arch_jump_label_transform_static(struct jump_entry *entry, enum jump_label_type type); extern int jump_label_text_reserved(void *start, void *end); extern void static_key_slow_inc(struct static_key *key); +extern void static_key_slow_inc_cpuslocked(struct static_key *key); extern void static_key_slow_dec(struct static_key *key); extern void jump_label_apply_nops(struct module *mod); extern int static_key_count(struct static_key *key); @@ -213,6 +214,8 @@ static inline void static_key_slow_inc(struct static_key *key) atomic_inc(>enabled); } +#define static_key_slow_inc_cpuslocked static_key_slow_inc + static inline void static_key_slow_dec(struct static_key *key) { STATIC_KEY_CHECK_USE(); diff --git a/kernel/jump_label.c b/kernel/jump_label.c index f3afe07..308b12e 100644 --- a/kernel/jump_label.c +++ b/kernel/jump_label.c @@ -101,7 +101,7 @@ void static_key_disable(struct static_key *key) } EXPORT_SYMBOL_GPL(static_key_disable); -void static_key_slow_inc(struct static_key *key) +void __static_key_slow_inc(struct static_key *key) { int v, v1; @@ -130,7 +130,6 @@ void static_key_slow_inc(struct static_key *key) * the all CPUs, for that to be serialized against CPU hot-plug * we need to avoid CPUs coming online. */ - get_online_cpus(); jump_label_lock(); if (atomic_read(>enabled) == 0) { atomic_set(>enabled, -1); @@ -140,10 +139,22 @@ void static_key_slow_inc(struct static_key *key) atomic_inc(>enabled); } jump_label_unlock(); +} + +void static_key_slow_inc(struct static_key *key) +{ + get_online_cpus(); + __static_key_slow_inc(key); put_online_cpus(); } EXPORT_SYMBOL_GPL(static_key_slow_inc); +void static_key_slow_inc_cpuslocked(struct static_key *key) +{ + __static_key_slow_inc(key); +} +EXPORT_SYMBOL_GPL(static_key_slow_inc_cpuslocked); + static void __static_key_slow_dec(struct static_key *key, unsigned long rate_limit, struct delayed_work *work) { @@ -154,7 +165,6 @@ static void __static_key_slow_dec(struct static_key *key, * returns is unbalanced, because all other static_key_slow_inc() * instances block while the update is in progress. */ - get_online_cpus(); if (!atomic_dec_and_mutex_lock(>enabled, _label_mutex)) { WARN(atomic_read(>enabled) < 0, "jump label: negative count!\n"); @@ -168,20 +178,23 @@ static void __static_key_slow_dec(struct static_key *key, jump_label_update(key); } jump_label_unlock(); - put_online_cpus(); } static void jump_label_update_timeout(struct work_struct *work) { struct static_key_deferred *key = container_of(work, struct static_key_deferred, work.work); + get_online_cpus(); __static_key_slow_dec(>key, 0, NULL); + put_online_cpus(); } void static_key_slow_dec(struct static_key *key) { STATIC_KEY_CHECK_USE(); + get_online_cpus(); __static_key_slow_dec(key, 0, NULL); + put_online_cpus(); } EXPORT_SYMBOL_GPL(static_key_slow_dec);
[tip:smp/hotplug] jump_label: Provide static_key_slow_inc_cpuslocked()
Commit-ID: f5efc6fad63f5533a6083e95286920d5753e52bf Gitweb: http://git.kernel.org/tip/f5efc6fad63f5533a6083e95286920d5753e52bf Author: Peter Zijlstra (Intel) AuthorDate: Tue, 18 Apr 2017 19:05:04 +0200 Committer: Thomas Gleixner CommitDate: Thu, 20 Apr 2017 13:08:57 +0200 jump_label: Provide static_key_slow_inc_cpuslocked() Provide static_key_slow_inc_cpuslocked(), a variant that doesn't take cpu_hotplug_lock(). Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Thomas Gleixner Cc: Sebastian Siewior Cc: Steven Rostedt Cc: jba...@akamai.com Link: http://lkml.kernel.org/r/20170418103422.636958...@infradead.org --- include/linux/jump_label.h | 3 +++ kernel/jump_label.c| 21 + 2 files changed, 20 insertions(+), 4 deletions(-) diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h index 2afd74b..7d07f0b 100644 --- a/include/linux/jump_label.h +++ b/include/linux/jump_label.h @@ -158,6 +158,7 @@ extern void arch_jump_label_transform_static(struct jump_entry *entry, enum jump_label_type type); extern int jump_label_text_reserved(void *start, void *end); extern void static_key_slow_inc(struct static_key *key); +extern void static_key_slow_inc_cpuslocked(struct static_key *key); extern void static_key_slow_dec(struct static_key *key); extern void jump_label_apply_nops(struct module *mod); extern int static_key_count(struct static_key *key); @@ -213,6 +214,8 @@ static inline void static_key_slow_inc(struct static_key *key) atomic_inc(>enabled); } +#define static_key_slow_inc_cpuslocked static_key_slow_inc + static inline void static_key_slow_dec(struct static_key *key) { STATIC_KEY_CHECK_USE(); diff --git a/kernel/jump_label.c b/kernel/jump_label.c index f3afe07..308b12e 100644 --- a/kernel/jump_label.c +++ b/kernel/jump_label.c @@ -101,7 +101,7 @@ void static_key_disable(struct static_key *key) } EXPORT_SYMBOL_GPL(static_key_disable); -void static_key_slow_inc(struct static_key *key) +void __static_key_slow_inc(struct static_key *key) { int v, v1; @@ -130,7 +130,6 @@ void static_key_slow_inc(struct static_key *key) * the all CPUs, for that to be serialized against CPU hot-plug * we need to avoid CPUs coming online. */ - get_online_cpus(); jump_label_lock(); if (atomic_read(>enabled) == 0) { atomic_set(>enabled, -1); @@ -140,10 +139,22 @@ void static_key_slow_inc(struct static_key *key) atomic_inc(>enabled); } jump_label_unlock(); +} + +void static_key_slow_inc(struct static_key *key) +{ + get_online_cpus(); + __static_key_slow_inc(key); put_online_cpus(); } EXPORT_SYMBOL_GPL(static_key_slow_inc); +void static_key_slow_inc_cpuslocked(struct static_key *key) +{ + __static_key_slow_inc(key); +} +EXPORT_SYMBOL_GPL(static_key_slow_inc_cpuslocked); + static void __static_key_slow_dec(struct static_key *key, unsigned long rate_limit, struct delayed_work *work) { @@ -154,7 +165,6 @@ static void __static_key_slow_dec(struct static_key *key, * returns is unbalanced, because all other static_key_slow_inc() * instances block while the update is in progress. */ - get_online_cpus(); if (!atomic_dec_and_mutex_lock(>enabled, _label_mutex)) { WARN(atomic_read(>enabled) < 0, "jump label: negative count!\n"); @@ -168,20 +178,23 @@ static void __static_key_slow_dec(struct static_key *key, jump_label_update(key); } jump_label_unlock(); - put_online_cpus(); } static void jump_label_update_timeout(struct work_struct *work) { struct static_key_deferred *key = container_of(work, struct static_key_deferred, work.work); + get_online_cpus(); __static_key_slow_dec(>key, 0, NULL); + put_online_cpus(); } void static_key_slow_dec(struct static_key *key) { STATIC_KEY_CHECK_USE(); + get_online_cpus(); __static_key_slow_dec(key, 0, NULL); + put_online_cpus(); } EXPORT_SYMBOL_GPL(static_key_slow_dec);
[tip:smp/hotplug] jump_label: Pull get_online_cpus() into generic code
Commit-ID: 82947f31231157d8ab70fa8961f23fd3887a3327 Gitweb: http://git.kernel.org/tip/82947f31231157d8ab70fa8961f23fd3887a3327 Author: Peter Zijlstra (Intel)AuthorDate: Tue, 18 Apr 2017 19:05:03 +0200 Committer: Thomas Gleixner CommitDate: Thu, 20 Apr 2017 13:08:57 +0200 jump_label: Pull get_online_cpus() into generic code This change does two things: - it moves the get_online_cpus() call into generic code, with the aim of later providing some static_key ops that avoid it. - as a side effect it inverts the lock order between cpu_hotplug_lock and jump_label_mutex. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Thomas Gleixner Cc: Sebastian Siewior Cc: Steven Rostedt Cc: jba...@akamai.com Link: http://lkml.kernel.org/r/20170418103422.590118...@infradead.org --- arch/mips/kernel/jump_label.c | 2 -- arch/sparc/kernel/jump_label.c | 2 -- arch/tile/kernel/jump_label.c | 2 -- arch/x86/kernel/jump_label.c | 2 -- kernel/jump_label.c| 14 ++ 5 files changed, 14 insertions(+), 8 deletions(-) diff --git a/arch/mips/kernel/jump_label.c b/arch/mips/kernel/jump_label.c index 3e586da..32e3168 100644 --- a/arch/mips/kernel/jump_label.c +++ b/arch/mips/kernel/jump_label.c @@ -58,7 +58,6 @@ void arch_jump_label_transform(struct jump_entry *e, insn.word = 0; /* nop */ } - get_online_cpus(); mutex_lock(_mutex); if (IS_ENABLED(CONFIG_CPU_MICROMIPS)) { insn_p->halfword[0] = insn.word >> 16; @@ -70,7 +69,6 @@ void arch_jump_label_transform(struct jump_entry *e, (unsigned long)insn_p + sizeof(*insn_p)); mutex_unlock(_mutex); - put_online_cpus(); } #endif /* HAVE_JUMP_LABEL */ diff --git a/arch/sparc/kernel/jump_label.c b/arch/sparc/kernel/jump_label.c index 07933b9..93adde1 100644 --- a/arch/sparc/kernel/jump_label.c +++ b/arch/sparc/kernel/jump_label.c @@ -41,12 +41,10 @@ void arch_jump_label_transform(struct jump_entry *entry, val = 0x0100; } - get_online_cpus(); mutex_lock(_mutex); *insn = val; flushi(insn); mutex_unlock(_mutex); - put_online_cpus(); } #endif diff --git a/arch/tile/kernel/jump_label.c b/arch/tile/kernel/jump_label.c index 07802d5..93931a4 100644 --- a/arch/tile/kernel/jump_label.c +++ b/arch/tile/kernel/jump_label.c @@ -45,14 +45,12 @@ static void __jump_label_transform(struct jump_entry *e, void arch_jump_label_transform(struct jump_entry *e, enum jump_label_type type) { - get_online_cpus(); mutex_lock(_mutex); __jump_label_transform(e, type); flush_icache_range(e->code, e->code + sizeof(tilegx_bundle_bits)); mutex_unlock(_mutex); - put_online_cpus(); } __init_or_module void arch_jump_label_transform_static(struct jump_entry *e, diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c index c37bd0f..ab4f491 100644 --- a/arch/x86/kernel/jump_label.c +++ b/arch/x86/kernel/jump_label.c @@ -105,11 +105,9 @@ static void __jump_label_transform(struct jump_entry *entry, void arch_jump_label_transform(struct jump_entry *entry, enum jump_label_type type) { - get_online_cpus(); mutex_lock(_mutex); __jump_label_transform(entry, type, NULL, 0); mutex_unlock(_mutex); - put_online_cpus(); } static enum { diff --git a/kernel/jump_label.c b/kernel/jump_label.c index 6c9cb20..f3afe07 100644 --- a/kernel/jump_label.c +++ b/kernel/jump_label.c @@ -15,6 +15,7 @@ #include #include #include +#include #ifdef HAVE_JUMP_LABEL @@ -124,6 +125,12 @@ void static_key_slow_inc(struct static_key *key) return; } + /* +* A number of architectures need to synchronize I$ across +* the all CPUs, for that to be serialized against CPU hot-plug +* we need to avoid CPUs coming online. +*/ + get_online_cpus(); jump_label_lock(); if (atomic_read(>enabled) == 0) { atomic_set(>enabled, -1); @@ -133,6 +140,7 @@ void static_key_slow_inc(struct static_key *key) atomic_inc(>enabled); } jump_label_unlock(); + put_online_cpus(); } EXPORT_SYMBOL_GPL(static_key_slow_inc); @@ -146,6 +154,7 @@ static void __static_key_slow_dec(struct static_key *key, * returns is unbalanced, because all other static_key_slow_inc() * instances block while the update is in progress. */ + get_online_cpus(); if (!atomic_dec_and_mutex_lock(>enabled, _label_mutex)) { WARN(atomic_read(>enabled) < 0, "jump label: negative count!\n"); @@ -159,6 +168,7 @@ static void __static_key_slow_dec(struct
[tip:smp/hotplug] jump_label: Pull get_online_cpus() into generic code
Commit-ID: 82947f31231157d8ab70fa8961f23fd3887a3327 Gitweb: http://git.kernel.org/tip/82947f31231157d8ab70fa8961f23fd3887a3327 Author: Peter Zijlstra (Intel) AuthorDate: Tue, 18 Apr 2017 19:05:03 +0200 Committer: Thomas Gleixner CommitDate: Thu, 20 Apr 2017 13:08:57 +0200 jump_label: Pull get_online_cpus() into generic code This change does two things: - it moves the get_online_cpus() call into generic code, with the aim of later providing some static_key ops that avoid it. - as a side effect it inverts the lock order between cpu_hotplug_lock and jump_label_mutex. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Thomas Gleixner Cc: Sebastian Siewior Cc: Steven Rostedt Cc: jba...@akamai.com Link: http://lkml.kernel.org/r/20170418103422.590118...@infradead.org --- arch/mips/kernel/jump_label.c | 2 -- arch/sparc/kernel/jump_label.c | 2 -- arch/tile/kernel/jump_label.c | 2 -- arch/x86/kernel/jump_label.c | 2 -- kernel/jump_label.c| 14 ++ 5 files changed, 14 insertions(+), 8 deletions(-) diff --git a/arch/mips/kernel/jump_label.c b/arch/mips/kernel/jump_label.c index 3e586da..32e3168 100644 --- a/arch/mips/kernel/jump_label.c +++ b/arch/mips/kernel/jump_label.c @@ -58,7 +58,6 @@ void arch_jump_label_transform(struct jump_entry *e, insn.word = 0; /* nop */ } - get_online_cpus(); mutex_lock(_mutex); if (IS_ENABLED(CONFIG_CPU_MICROMIPS)) { insn_p->halfword[0] = insn.word >> 16; @@ -70,7 +69,6 @@ void arch_jump_label_transform(struct jump_entry *e, (unsigned long)insn_p + sizeof(*insn_p)); mutex_unlock(_mutex); - put_online_cpus(); } #endif /* HAVE_JUMP_LABEL */ diff --git a/arch/sparc/kernel/jump_label.c b/arch/sparc/kernel/jump_label.c index 07933b9..93adde1 100644 --- a/arch/sparc/kernel/jump_label.c +++ b/arch/sparc/kernel/jump_label.c @@ -41,12 +41,10 @@ void arch_jump_label_transform(struct jump_entry *entry, val = 0x0100; } - get_online_cpus(); mutex_lock(_mutex); *insn = val; flushi(insn); mutex_unlock(_mutex); - put_online_cpus(); } #endif diff --git a/arch/tile/kernel/jump_label.c b/arch/tile/kernel/jump_label.c index 07802d5..93931a4 100644 --- a/arch/tile/kernel/jump_label.c +++ b/arch/tile/kernel/jump_label.c @@ -45,14 +45,12 @@ static void __jump_label_transform(struct jump_entry *e, void arch_jump_label_transform(struct jump_entry *e, enum jump_label_type type) { - get_online_cpus(); mutex_lock(_mutex); __jump_label_transform(e, type); flush_icache_range(e->code, e->code + sizeof(tilegx_bundle_bits)); mutex_unlock(_mutex); - put_online_cpus(); } __init_or_module void arch_jump_label_transform_static(struct jump_entry *e, diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c index c37bd0f..ab4f491 100644 --- a/arch/x86/kernel/jump_label.c +++ b/arch/x86/kernel/jump_label.c @@ -105,11 +105,9 @@ static void __jump_label_transform(struct jump_entry *entry, void arch_jump_label_transform(struct jump_entry *entry, enum jump_label_type type) { - get_online_cpus(); mutex_lock(_mutex); __jump_label_transform(entry, type, NULL, 0); mutex_unlock(_mutex); - put_online_cpus(); } static enum { diff --git a/kernel/jump_label.c b/kernel/jump_label.c index 6c9cb20..f3afe07 100644 --- a/kernel/jump_label.c +++ b/kernel/jump_label.c @@ -15,6 +15,7 @@ #include #include #include +#include #ifdef HAVE_JUMP_LABEL @@ -124,6 +125,12 @@ void static_key_slow_inc(struct static_key *key) return; } + /* +* A number of architectures need to synchronize I$ across +* the all CPUs, for that to be serialized against CPU hot-plug +* we need to avoid CPUs coming online. +*/ + get_online_cpus(); jump_label_lock(); if (atomic_read(>enabled) == 0) { atomic_set(>enabled, -1); @@ -133,6 +140,7 @@ void static_key_slow_inc(struct static_key *key) atomic_inc(>enabled); } jump_label_unlock(); + put_online_cpus(); } EXPORT_SYMBOL_GPL(static_key_slow_inc); @@ -146,6 +154,7 @@ static void __static_key_slow_dec(struct static_key *key, * returns is unbalanced, because all other static_key_slow_inc() * instances block while the update is in progress. */ + get_online_cpus(); if (!atomic_dec_and_mutex_lock(>enabled, _label_mutex)) { WARN(atomic_read(>enabled) < 0, "jump label: negative count!\n"); @@ -159,6 +168,7 @@ static void __static_key_slow_dec(struct static_key *key, jump_label_update(key); } jump_label_unlock(); + put_online_cpus(); }
[tip:perf/core] perf annotate: Add number of samples to the header
Commit-ID: 135cce1bf12bd30d7d66360022f9dac6ea3a07cd Gitweb: http://git.kernel.org/tip/135cce1bf12bd30d7d66360022f9dac6ea3a07cd Author: Peter Zijlstra (Intel)AuthorDate: Thu, 30 Jun 2016 10:29:55 +0200 Committer: Arnaldo Carvalho de Melo CommitDate: Thu, 30 Jun 2016 18:27:42 -0300 perf annotate: Add number of samples to the header Staring at annotations of large functions is useless if there's only a few samples in them. Report the number of samples in the header to make this easier to determine. Committer note: The change amounts to: - Percent | Source code & Disassembly of perf-vdso.so for cycles:u -- + Percent | Source code & Disassembly of perf-vdso.so for cycles:u (3278 samples) + Signed-off-by: Peter Zijlstra (Intel) Cc: Jiri Olsa Link: http://lkml.kernel.org/r/20160630082955.ga30...@twins.programming.kicks-ass.net [ split from a larger patch ] Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/util/annotate.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c index 78e5d6f..e9825fe 100644 --- a/tools/perf/util/annotate.c +++ b/tools/perf/util/annotate.c @@ -1522,6 +1522,7 @@ int symbol__annotate_printf(struct symbol *sym, struct map *map, const char *d_filename; const char *evsel_name = perf_evsel__name(evsel); struct annotation *notes = symbol__annotation(sym); + struct sym_hist *h = annotation__histogram(notes, evsel->idx); struct disasm_line *pos, *queue = NULL; u64 start = map__rip_2objdump(map, sym->start); int printed = 2, queue_len = 0; @@ -1544,8 +1545,8 @@ int symbol__annotate_printf(struct symbol *sym, struct map *map, if (perf_evsel__is_group_event(evsel)) width *= evsel->nr_members; - graph_dotted_len = printf(" %-*.*s| Source code & Disassembly of %s for %s\n", - width, width, "Percent", d_filename, evsel_name); + graph_dotted_len = printf(" %-*.*s| Source code & Disassembly of %s for %s (%" PRIu64 " samples)\n", + width, width, "Percent", d_filename, evsel_name, h->sum); printf("%-*.*s\n", graph_dotted_len, graph_dotted_len, graph_dotted_line);
[tip:perf/core] perf annotate: Add number of samples to the header
Commit-ID: 135cce1bf12bd30d7d66360022f9dac6ea3a07cd Gitweb: http://git.kernel.org/tip/135cce1bf12bd30d7d66360022f9dac6ea3a07cd Author: Peter Zijlstra (Intel) AuthorDate: Thu, 30 Jun 2016 10:29:55 +0200 Committer: Arnaldo Carvalho de Melo CommitDate: Thu, 30 Jun 2016 18:27:42 -0300 perf annotate: Add number of samples to the header Staring at annotations of large functions is useless if there's only a few samples in them. Report the number of samples in the header to make this easier to determine. Committer note: The change amounts to: - Percent | Source code & Disassembly of perf-vdso.so for cycles:u -- + Percent | Source code & Disassembly of perf-vdso.so for cycles:u (3278 samples) + Signed-off-by: Peter Zijlstra (Intel) Cc: Jiri Olsa Link: http://lkml.kernel.org/r/20160630082955.ga30...@twins.programming.kicks-ass.net [ split from a larger patch ] Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/util/annotate.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c index 78e5d6f..e9825fe 100644 --- a/tools/perf/util/annotate.c +++ b/tools/perf/util/annotate.c @@ -1522,6 +1522,7 @@ int symbol__annotate_printf(struct symbol *sym, struct map *map, const char *d_filename; const char *evsel_name = perf_evsel__name(evsel); struct annotation *notes = symbol__annotation(sym); + struct sym_hist *h = annotation__histogram(notes, evsel->idx); struct disasm_line *pos, *queue = NULL; u64 start = map__rip_2objdump(map, sym->start); int printed = 2, queue_len = 0; @@ -1544,8 +1545,8 @@ int symbol__annotate_printf(struct symbol *sym, struct map *map, if (perf_evsel__is_group_event(evsel)) width *= evsel->nr_members; - graph_dotted_len = printf(" %-*.*s| Source code & Disassembly of %s for %s\n", - width, width, "Percent", d_filename, evsel_name); + graph_dotted_len = printf(" %-*.*s| Source code & Disassembly of %s for %s (%" PRIu64 " samples)\n", + width, width, "Percent", d_filename, evsel_name, h->sum); printf("%-*.*s\n", graph_dotted_len, graph_dotted_len, graph_dotted_line);
[tip:perf/core] perf annotate: Simplify header dotted line sizing
Commit-ID: 53dd9b5f95dda95bcadda1b4680be42dfe1f9e5e Gitweb: http://git.kernel.org/tip/53dd9b5f95dda95bcadda1b4680be42dfe1f9e5e Author: Peter Zijlstra (Intel)AuthorDate: Thu, 30 Jun 2016 09:17:26 -0300 Committer: Arnaldo Carvalho de Melo CommitDate: Thu, 30 Jun 2016 09:21:03 -0300 perf annotate: Simplify header dotted line sizing No need to use strlen, etc to figure that out, just use the return from printf(), it will tell how wide the following line needs to be. Signed-off-by: Peter Zijlstra (Intel) Cc: Jiri Olsa Link: http://lkml.kernel.org/r/20160630082955.ga30...@twins.programming.kicks-ass.net [ split from a larger patch ] Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/util/annotate.c | 9 +++-- 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c index c385fec..78e5d6f 100644 --- a/tools/perf/util/annotate.c +++ b/tools/perf/util/annotate.c @@ -1528,7 +1528,7 @@ int symbol__annotate_printf(struct symbol *sym, struct map *map, int more = 0; u64 len; int width = 8; - int namelen, evsel_name_len, graph_dotted_len; + int graph_dotted_len; filename = strdup(dso->long_name); if (!filename) @@ -1540,17 +1540,14 @@ int symbol__annotate_printf(struct symbol *sym, struct map *map, d_filename = basename(filename); len = symbol__size(sym); - namelen = strlen(d_filename); - evsel_name_len = strlen(evsel_name); if (perf_evsel__is_group_event(evsel)) width *= evsel->nr_members; - printf(" %-*.*s|Source code & Disassembly of %s for %s\n", + graph_dotted_len = printf(" %-*.*s| Source code & Disassembly of %s for %s\n", width, width, "Percent", d_filename, evsel_name); - graph_dotted_len = width + namelen + evsel_name_len; - printf("-%-*.*s-\n", + printf("%-*.*s\n", graph_dotted_len, graph_dotted_len, graph_dotted_line); if (verbose)
[tip:perf/core] perf annotate: Simplify header dotted line sizing
Commit-ID: 53dd9b5f95dda95bcadda1b4680be42dfe1f9e5e Gitweb: http://git.kernel.org/tip/53dd9b5f95dda95bcadda1b4680be42dfe1f9e5e Author: Peter Zijlstra (Intel) AuthorDate: Thu, 30 Jun 2016 09:17:26 -0300 Committer: Arnaldo Carvalho de Melo CommitDate: Thu, 30 Jun 2016 09:21:03 -0300 perf annotate: Simplify header dotted line sizing No need to use strlen, etc to figure that out, just use the return from printf(), it will tell how wide the following line needs to be. Signed-off-by: Peter Zijlstra (Intel) Cc: Jiri Olsa Link: http://lkml.kernel.org/r/20160630082955.ga30...@twins.programming.kicks-ass.net [ split from a larger patch ] Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/util/annotate.c | 9 +++-- 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c index c385fec..78e5d6f 100644 --- a/tools/perf/util/annotate.c +++ b/tools/perf/util/annotate.c @@ -1528,7 +1528,7 @@ int symbol__annotate_printf(struct symbol *sym, struct map *map, int more = 0; u64 len; int width = 8; - int namelen, evsel_name_len, graph_dotted_len; + int graph_dotted_len; filename = strdup(dso->long_name); if (!filename) @@ -1540,17 +1540,14 @@ int symbol__annotate_printf(struct symbol *sym, struct map *map, d_filename = basename(filename); len = symbol__size(sym); - namelen = strlen(d_filename); - evsel_name_len = strlen(evsel_name); if (perf_evsel__is_group_event(evsel)) width *= evsel->nr_members; - printf(" %-*.*s|Source code & Disassembly of %s for %s\n", + graph_dotted_len = printf(" %-*.*s| Source code & Disassembly of %s for %s\n", width, width, "Percent", d_filename, evsel_name); - graph_dotted_len = width + namelen + evsel_name_len; - printf("-%-*.*s-\n", + printf("%-*.*s\n", graph_dotted_len, graph_dotted_len, graph_dotted_line); if (verbose)
[tip:smp/hotplug] sched: Allow per-cpu kernel threads to run on online && !active
Commit-ID: e9d867a67fd03ccc07248ca4e9c2f74fed494d5b Gitweb: http://git.kernel.org/tip/e9d867a67fd03ccc07248ca4e9c2f74fed494d5b Author: Peter Zijlstra (Intel)AuthorDate: Thu, 10 Mar 2016 12:54:08 +0100 Committer: Thomas Gleixner CommitDate: Fri, 6 May 2016 14:58:22 +0200 sched: Allow per-cpu kernel threads to run on online && !active In order to enable symmetric hotplug, we must mirror the online && !active state of cpu-down on the cpu-up side. However, to retain sanity, limit this state to per-cpu kthreads. Aside from the change to set_cpus_allowed_ptr(), which allow moving the per-cpu kthreads on, the other critical piece is the cpu selection for pinned tasks in select_task_rq(). This avoids dropping into select_fallback_rq(). select_fallback_rq() cannot be allowed to select !active cpus because its used to migrate user tasks away. And we do not want to move user tasks onto cpus that are in transition. Requested-by: Thomas Gleixner Signed-off-by: Peter Zijlstra (Intel) Tested-by: Thomas Gleixner Cc: Lai Jiangshan Cc: Jan H. Schönherr Cc: Oleg Nesterov Cc: r...@linutronix.de Link: http://lkml.kernel.org/r/20160301152303.gv6...@twins.programming.kicks-ass.net Signed-off-by: Thomas Gleixner --- arch/powerpc/kernel/smp.c | 2 +- arch/s390/kernel/smp.c| 2 +- include/linux/cpumask.h | 6 ++ kernel/sched/core.c | 49 --- 4 files changed, 46 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 8cac1eb..55c924b 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -565,7 +565,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *tidle) smp_ops->give_timebase(); /* Wait until cpu puts itself in the online & active maps */ - while (!cpu_online(cpu) || !cpu_active(cpu)) + while (!cpu_online(cpu)) cpu_relax(); return 0; diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c index 40a6b4f..7b89a75 100644 --- a/arch/s390/kernel/smp.c +++ b/arch/s390/kernel/smp.c @@ -832,7 +832,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *tidle) pcpu_attach_task(pcpu, tidle); pcpu_start_fn(pcpu, smp_start_secondary, NULL); /* Wait until cpu puts itself in the online & active maps */ - while (!cpu_online(cpu) || !cpu_active(cpu)) + while (!cpu_online(cpu)) cpu_relax(); return 0; } diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h index 40cee6b..e828cf6 100644 --- a/include/linux/cpumask.h +++ b/include/linux/cpumask.h @@ -743,12 +743,10 @@ set_cpu_present(unsigned int cpu, bool present) static inline void set_cpu_online(unsigned int cpu, bool online) { - if (online) { + if (online) cpumask_set_cpu(cpu, &__cpu_online_mask); - cpumask_set_cpu(cpu, &__cpu_active_mask); - } else { + else cpumask_clear_cpu(cpu, &__cpu_online_mask); - } } static inline void diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 8b489fc..8bfd7d4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1082,13 +1082,21 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask) static int __set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask, bool check) { + const struct cpumask *cpu_valid_mask = cpu_active_mask; + unsigned int dest_cpu; unsigned long flags; struct rq *rq; - unsigned int dest_cpu; int ret = 0; rq = task_rq_lock(p, ); + if (p->flags & PF_KTHREAD) { + /* +* Kernel threads are allowed on online && !active CPUs +*/ + cpu_valid_mask = cpu_online_mask; + } + /* * Must re-check here, to close a race against __kthread_bind(), * sched_setaffinity() is not guaranteed to observe the flag. @@ -1101,18 +1109,28 @@ static int __set_cpus_allowed_ptr(struct task_struct *p, if (cpumask_equal(>cpus_allowed, new_mask)) goto out; - if (!cpumask_intersects(new_mask, cpu_active_mask)) { + if (!cpumask_intersects(new_mask, cpu_valid_mask)) { ret = -EINVAL; goto out; } do_set_cpus_allowed(p, new_mask); + if (p->flags & PF_KTHREAD) { + /* +* For kernel threads that do indeed end up on online && +* !active we want to ensure they are strict per-cpu threads. +*/ + WARN_ON(cpumask_intersects(new_mask, cpu_online_mask) && + !cpumask_intersects(new_mask,
[tip:smp/hotplug] sched: Allow per-cpu kernel threads to run on online && !active
Commit-ID: e9d867a67fd03ccc07248ca4e9c2f74fed494d5b Gitweb: http://git.kernel.org/tip/e9d867a67fd03ccc07248ca4e9c2f74fed494d5b Author: Peter Zijlstra (Intel) AuthorDate: Thu, 10 Mar 2016 12:54:08 +0100 Committer: Thomas Gleixner CommitDate: Fri, 6 May 2016 14:58:22 +0200 sched: Allow per-cpu kernel threads to run on online && !active In order to enable symmetric hotplug, we must mirror the online && !active state of cpu-down on the cpu-up side. However, to retain sanity, limit this state to per-cpu kthreads. Aside from the change to set_cpus_allowed_ptr(), which allow moving the per-cpu kthreads on, the other critical piece is the cpu selection for pinned tasks in select_task_rq(). This avoids dropping into select_fallback_rq(). select_fallback_rq() cannot be allowed to select !active cpus because its used to migrate user tasks away. And we do not want to move user tasks onto cpus that are in transition. Requested-by: Thomas Gleixner Signed-off-by: Peter Zijlstra (Intel) Tested-by: Thomas Gleixner Cc: Lai Jiangshan Cc: Jan H. Schönherr Cc: Oleg Nesterov Cc: r...@linutronix.de Link: http://lkml.kernel.org/r/20160301152303.gv6...@twins.programming.kicks-ass.net Signed-off-by: Thomas Gleixner --- arch/powerpc/kernel/smp.c | 2 +- arch/s390/kernel/smp.c| 2 +- include/linux/cpumask.h | 6 ++ kernel/sched/core.c | 49 --- 4 files changed, 46 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 8cac1eb..55c924b 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -565,7 +565,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *tidle) smp_ops->give_timebase(); /* Wait until cpu puts itself in the online & active maps */ - while (!cpu_online(cpu) || !cpu_active(cpu)) + while (!cpu_online(cpu)) cpu_relax(); return 0; diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c index 40a6b4f..7b89a75 100644 --- a/arch/s390/kernel/smp.c +++ b/arch/s390/kernel/smp.c @@ -832,7 +832,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *tidle) pcpu_attach_task(pcpu, tidle); pcpu_start_fn(pcpu, smp_start_secondary, NULL); /* Wait until cpu puts itself in the online & active maps */ - while (!cpu_online(cpu) || !cpu_active(cpu)) + while (!cpu_online(cpu)) cpu_relax(); return 0; } diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h index 40cee6b..e828cf6 100644 --- a/include/linux/cpumask.h +++ b/include/linux/cpumask.h @@ -743,12 +743,10 @@ set_cpu_present(unsigned int cpu, bool present) static inline void set_cpu_online(unsigned int cpu, bool online) { - if (online) { + if (online) cpumask_set_cpu(cpu, &__cpu_online_mask); - cpumask_set_cpu(cpu, &__cpu_active_mask); - } else { + else cpumask_clear_cpu(cpu, &__cpu_online_mask); - } } static inline void diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 8b489fc..8bfd7d4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1082,13 +1082,21 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask) static int __set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask, bool check) { + const struct cpumask *cpu_valid_mask = cpu_active_mask; + unsigned int dest_cpu; unsigned long flags; struct rq *rq; - unsigned int dest_cpu; int ret = 0; rq = task_rq_lock(p, ); + if (p->flags & PF_KTHREAD) { + /* +* Kernel threads are allowed on online && !active CPUs +*/ + cpu_valid_mask = cpu_online_mask; + } + /* * Must re-check here, to close a race against __kthread_bind(), * sched_setaffinity() is not guaranteed to observe the flag. @@ -1101,18 +1109,28 @@ static int __set_cpus_allowed_ptr(struct task_struct *p, if (cpumask_equal(>cpus_allowed, new_mask)) goto out; - if (!cpumask_intersects(new_mask, cpu_active_mask)) { + if (!cpumask_intersects(new_mask, cpu_valid_mask)) { ret = -EINVAL; goto out; } do_set_cpus_allowed(p, new_mask); + if (p->flags & PF_KTHREAD) { + /* +* For kernel threads that do indeed end up on online && +* !active we want to ensure they are strict per-cpu threads. +*/ + WARN_ON(cpumask_intersects(new_mask, cpu_online_mask) && + !cpumask_intersects(new_mask, cpu_active_mask) && + p->nr_cpus_allowed != 1); + } + /* Can the task run on the task's current CPU? If so, we're done */ if
[tip:smp/hotplug] sched: Allow per-cpu kernel threads to run on online && !active
Commit-ID: 618d6e31623149c6203b46850e2e76ee0f29e577 Gitweb: http://git.kernel.org/tip/618d6e31623149c6203b46850e2e76ee0f29e577 Author: Peter Zijlstra (Intel)AuthorDate: Thu, 10 Mar 2016 12:54:08 +0100 Committer: Thomas Gleixner CommitDate: Thu, 5 May 2016 13:17:52 +0200 sched: Allow per-cpu kernel threads to run on online && !active In order to enable symmetric hotplug, we must mirror the online && !active state of cpu-down on the cpu-up side. However, to retain sanity, limit this state to per-cpu kthreads. Aside from the change to set_cpus_allowed_ptr(), which allow moving the per-cpu kthreads on, the other critical piece is the cpu selection for pinned tasks in select_task_rq(). This avoids dropping into select_fallback_rq(). select_fallback_rq() cannot be allowed to select !active cpus because its used to migrate user tasks away. And we do not want to move user tasks onto cpus that are in transition. Requested-by: Thomas Gleixner Signed-off-by: Peter Zijlstra (Intel) Tested-by: Thomas Gleixner Cc: Lai Jiangshan Cc: Jan H. Schönherr Cc: Oleg Nesterov Cc: r...@linutronix.de Link: http://lkml.kernel.org/r/20160301152303.gv6...@twins.programming.kicks-ass.net Signed-off-by: Thomas Gleixner --- arch/powerpc/kernel/smp.c | 2 +- arch/s390/kernel/smp.c| 2 +- include/linux/cpumask.h | 6 ++ kernel/sched/core.c | 49 --- 4 files changed, 46 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 8cac1eb..55c924b 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -565,7 +565,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *tidle) smp_ops->give_timebase(); /* Wait until cpu puts itself in the online & active maps */ - while (!cpu_online(cpu) || !cpu_active(cpu)) + while (!cpu_online(cpu)) cpu_relax(); return 0; diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c index 40a6b4f..7b89a75 100644 --- a/arch/s390/kernel/smp.c +++ b/arch/s390/kernel/smp.c @@ -832,7 +832,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *tidle) pcpu_attach_task(pcpu, tidle); pcpu_start_fn(pcpu, smp_start_secondary, NULL); /* Wait until cpu puts itself in the online & active maps */ - while (!cpu_online(cpu) || !cpu_active(cpu)) + while (!cpu_online(cpu)) cpu_relax(); return 0; } diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h index 40cee6b..e828cf6 100644 --- a/include/linux/cpumask.h +++ b/include/linux/cpumask.h @@ -743,12 +743,10 @@ set_cpu_present(unsigned int cpu, bool present) static inline void set_cpu_online(unsigned int cpu, bool online) { - if (online) { + if (online) cpumask_set_cpu(cpu, &__cpu_online_mask); - cpumask_set_cpu(cpu, &__cpu_active_mask); - } else { + else cpumask_clear_cpu(cpu, &__cpu_online_mask); - } } static inline void diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 8b489fc..8bfd7d4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1082,13 +1082,21 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask) static int __set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask, bool check) { + const struct cpumask *cpu_valid_mask = cpu_active_mask; + unsigned int dest_cpu; unsigned long flags; struct rq *rq; - unsigned int dest_cpu; int ret = 0; rq = task_rq_lock(p, ); + if (p->flags & PF_KTHREAD) { + /* +* Kernel threads are allowed on online && !active CPUs +*/ + cpu_valid_mask = cpu_online_mask; + } + /* * Must re-check here, to close a race against __kthread_bind(), * sched_setaffinity() is not guaranteed to observe the flag. @@ -1101,18 +1109,28 @@ static int __set_cpus_allowed_ptr(struct task_struct *p, if (cpumask_equal(>cpus_allowed, new_mask)) goto out; - if (!cpumask_intersects(new_mask, cpu_active_mask)) { + if (!cpumask_intersects(new_mask, cpu_valid_mask)) { ret = -EINVAL; goto out; } do_set_cpus_allowed(p, new_mask); + if (p->flags & PF_KTHREAD) { + /* +* For kernel threads that do indeed end up on online && +* !active we want to ensure they are strict per-cpu threads. +*/ + WARN_ON(cpumask_intersects(new_mask, cpu_online_mask) && + !cpumask_intersects(new_mask,
[tip:smp/hotplug] sched: Allow per-cpu kernel threads to run on online && !active
Commit-ID: 618d6e31623149c6203b46850e2e76ee0f29e577 Gitweb: http://git.kernel.org/tip/618d6e31623149c6203b46850e2e76ee0f29e577 Author: Peter Zijlstra (Intel) AuthorDate: Thu, 10 Mar 2016 12:54:08 +0100 Committer: Thomas Gleixner CommitDate: Thu, 5 May 2016 13:17:52 +0200 sched: Allow per-cpu kernel threads to run on online && !active In order to enable symmetric hotplug, we must mirror the online && !active state of cpu-down on the cpu-up side. However, to retain sanity, limit this state to per-cpu kthreads. Aside from the change to set_cpus_allowed_ptr(), which allow moving the per-cpu kthreads on, the other critical piece is the cpu selection for pinned tasks in select_task_rq(). This avoids dropping into select_fallback_rq(). select_fallback_rq() cannot be allowed to select !active cpus because its used to migrate user tasks away. And we do not want to move user tasks onto cpus that are in transition. Requested-by: Thomas Gleixner Signed-off-by: Peter Zijlstra (Intel) Tested-by: Thomas Gleixner Cc: Lai Jiangshan Cc: Jan H. Schönherr Cc: Oleg Nesterov Cc: r...@linutronix.de Link: http://lkml.kernel.org/r/20160301152303.gv6...@twins.programming.kicks-ass.net Signed-off-by: Thomas Gleixner --- arch/powerpc/kernel/smp.c | 2 +- arch/s390/kernel/smp.c| 2 +- include/linux/cpumask.h | 6 ++ kernel/sched/core.c | 49 --- 4 files changed, 46 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 8cac1eb..55c924b 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -565,7 +565,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *tidle) smp_ops->give_timebase(); /* Wait until cpu puts itself in the online & active maps */ - while (!cpu_online(cpu) || !cpu_active(cpu)) + while (!cpu_online(cpu)) cpu_relax(); return 0; diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c index 40a6b4f..7b89a75 100644 --- a/arch/s390/kernel/smp.c +++ b/arch/s390/kernel/smp.c @@ -832,7 +832,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *tidle) pcpu_attach_task(pcpu, tidle); pcpu_start_fn(pcpu, smp_start_secondary, NULL); /* Wait until cpu puts itself in the online & active maps */ - while (!cpu_online(cpu) || !cpu_active(cpu)) + while (!cpu_online(cpu)) cpu_relax(); return 0; } diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h index 40cee6b..e828cf6 100644 --- a/include/linux/cpumask.h +++ b/include/linux/cpumask.h @@ -743,12 +743,10 @@ set_cpu_present(unsigned int cpu, bool present) static inline void set_cpu_online(unsigned int cpu, bool online) { - if (online) { + if (online) cpumask_set_cpu(cpu, &__cpu_online_mask); - cpumask_set_cpu(cpu, &__cpu_active_mask); - } else { + else cpumask_clear_cpu(cpu, &__cpu_online_mask); - } } static inline void diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 8b489fc..8bfd7d4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1082,13 +1082,21 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask) static int __set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask, bool check) { + const struct cpumask *cpu_valid_mask = cpu_active_mask; + unsigned int dest_cpu; unsigned long flags; struct rq *rq; - unsigned int dest_cpu; int ret = 0; rq = task_rq_lock(p, ); + if (p->flags & PF_KTHREAD) { + /* +* Kernel threads are allowed on online && !active CPUs +*/ + cpu_valid_mask = cpu_online_mask; + } + /* * Must re-check here, to close a race against __kthread_bind(), * sched_setaffinity() is not guaranteed to observe the flag. @@ -1101,18 +1109,28 @@ static int __set_cpus_allowed_ptr(struct task_struct *p, if (cpumask_equal(>cpus_allowed, new_mask)) goto out; - if (!cpumask_intersects(new_mask, cpu_active_mask)) { + if (!cpumask_intersects(new_mask, cpu_valid_mask)) { ret = -EINVAL; goto out; } do_set_cpus_allowed(p, new_mask); + if (p->flags & PF_KTHREAD) { + /* +* For kernel threads that do indeed end up on online && +* !active we want to ensure they are strict per-cpu threads. +*/ + WARN_ON(cpumask_intersects(new_mask, cpu_online_mask) && + !cpumask_intersects(new_mask, cpu_active_mask) && + p->nr_cpus_allowed != 1); + } + /* Can the task run on the task's current CPU? If so, we're done */ if
[tip:sched/core] wait.[ch]: Introduce the simple waitqueue (swait) implementation
Commit-ID: 13b35686e8b934ff78f59cef0c65fa3a43f8eeaf Gitweb: http://git.kernel.org/tip/13b35686e8b934ff78f59cef0c65fa3a43f8eeaf Author: Peter Zijlstra (Intel)AuthorDate: Fri, 19 Feb 2016 09:46:37 +0100 Committer: Thomas Gleixner CommitDate: Thu, 25 Feb 2016 11:27:16 +0100 wait.[ch]: Introduce the simple waitqueue (swait) implementation The existing wait queue support has support for custom wake up call backs, wake flags, wake key (passed to call back) and exclusive flags that allow wakers to be tagged as exclusive, for limiting the number of wakers. In a lot of cases, none of these features are used, and hence we can benefit from a slimmed down version that lowers memory overhead and reduces runtime overhead. The concept originated from -rt, where waitqueues are a constant source of trouble, as we can't convert the head lock to a raw spinlock due to fancy and long lasting callbacks. With the removal of custom callbacks, we can use a raw lock for queue list manipulations, hence allowing the simple wait support to be used in -rt. [Patch is from PeterZ which is based on Thomas version. Commit message is written by Paul G. Daniel: - Fixed some compile issues - Added non-lazy implementation of swake_up_locked as suggested by Boqun Feng.] Originally-by: Thomas Gleixner Signed-off-by: Daniel Wagner Acked-by: Peter Zijlstra (Intel) Cc: linux-rt-us...@vger.kernel.org Cc: Boqun Feng Cc: Marcelo Tosatti Cc: Steven Rostedt Cc: Paul Gortmaker Cc: Paolo Bonzini Cc: "Paul E. McKenney" Link: http://lkml.kernel.org/r/1455871601-27484-2-git-send-email-w...@monom.org Signed-off-by: Thomas Gleixner --- include/linux/swait.h | 172 ++ kernel/sched/Makefile | 2 +- kernel/sched/swait.c | 123 3 files changed, 296 insertions(+), 1 deletion(-) diff --git a/include/linux/swait.h b/include/linux/swait.h new file mode 100644 index 000..c1f9c62 --- /dev/null +++ b/include/linux/swait.h @@ -0,0 +1,172 @@ +#ifndef _LINUX_SWAIT_H +#define _LINUX_SWAIT_H + +#include +#include +#include +#include + +/* + * Simple wait queues + * + * While these are very similar to the other/complex wait queues (wait.h) the + * most important difference is that the simple waitqueue allows for + * deterministic behaviour -- IOW it has strictly bounded IRQ and lock hold + * times. + * + * In order to make this so, we had to drop a fair number of features of the + * other waitqueue code; notably: + * + * - mixing INTERRUPTIBLE and UNINTERRUPTIBLE sleeps on the same waitqueue; + *all wakeups are TASK_NORMAL in order to avoid O(n) lookups for the right + *sleeper state. + * + * - the exclusive mode; because this requires preserving the list order + *and this is hard. + * + * - custom wake functions; because you cannot give any guarantees about + *random code. + * + * As a side effect of this; the data structures are slimmer. + * + * One would recommend using this wait queue where possible. + */ + +struct task_struct; + +struct swait_queue_head { + raw_spinlock_t lock; + struct list_headtask_list; +}; + +struct swait_queue { + struct task_struct *task; + struct list_headtask_list; +}; + +#define __SWAITQUEUE_INITIALIZER(name) { \ + .task = current, \ + .task_list = LIST_HEAD_INIT((name).task_list), \ +} + +#define DECLARE_SWAITQUEUE(name) \ + struct swait_queue name = __SWAITQUEUE_INITIALIZER(name) + +#define __SWAIT_QUEUE_HEAD_INITIALIZER(name) { \ + .lock = __RAW_SPIN_LOCK_UNLOCKED(name.lock), \ + .task_list = LIST_HEAD_INIT((name).task_list), \ +} + +#define DECLARE_SWAIT_QUEUE_HEAD(name) \ + struct swait_queue_head name = __SWAIT_QUEUE_HEAD_INITIALIZER(name) + +extern void __init_swait_queue_head(struct swait_queue_head *q, const char *name, + struct lock_class_key *key); + +#define init_swait_queue_head(q) \ + do {\ + static struct lock_class_key __key; \ + __init_swait_queue_head((q), #q, &__key); \ + } while (0) + +#ifdef CONFIG_LOCKDEP +# define __SWAIT_QUEUE_HEAD_INIT_ONSTACK(name) \ + ({ init_swait_queue_head(); name; }) +# define DECLARE_SWAIT_QUEUE_HEAD_ONSTACK(name)\ + struct
[tip:sched/core] wait.[ch]: Introduce the simple waitqueue (swait) implementation
Commit-ID: 13b35686e8b934ff78f59cef0c65fa3a43f8eeaf Gitweb: http://git.kernel.org/tip/13b35686e8b934ff78f59cef0c65fa3a43f8eeaf Author: Peter Zijlstra (Intel) AuthorDate: Fri, 19 Feb 2016 09:46:37 +0100 Committer: Thomas Gleixner CommitDate: Thu, 25 Feb 2016 11:27:16 +0100 wait.[ch]: Introduce the simple waitqueue (swait) implementation The existing wait queue support has support for custom wake up call backs, wake flags, wake key (passed to call back) and exclusive flags that allow wakers to be tagged as exclusive, for limiting the number of wakers. In a lot of cases, none of these features are used, and hence we can benefit from a slimmed down version that lowers memory overhead and reduces runtime overhead. The concept originated from -rt, where waitqueues are a constant source of trouble, as we can't convert the head lock to a raw spinlock due to fancy and long lasting callbacks. With the removal of custom callbacks, we can use a raw lock for queue list manipulations, hence allowing the simple wait support to be used in -rt. [Patch is from PeterZ which is based on Thomas version. Commit message is written by Paul G. Daniel: - Fixed some compile issues - Added non-lazy implementation of swake_up_locked as suggested by Boqun Feng.] Originally-by: Thomas Gleixner Signed-off-by: Daniel Wagner Acked-by: Peter Zijlstra (Intel) Cc: linux-rt-us...@vger.kernel.org Cc: Boqun Feng Cc: Marcelo Tosatti Cc: Steven Rostedt Cc: Paul Gortmaker Cc: Paolo Bonzini Cc: "Paul E. McKenney" Link: http://lkml.kernel.org/r/1455871601-27484-2-git-send-email-w...@monom.org Signed-off-by: Thomas Gleixner --- include/linux/swait.h | 172 ++ kernel/sched/Makefile | 2 +- kernel/sched/swait.c | 123 3 files changed, 296 insertions(+), 1 deletion(-) diff --git a/include/linux/swait.h b/include/linux/swait.h new file mode 100644 index 000..c1f9c62 --- /dev/null +++ b/include/linux/swait.h @@ -0,0 +1,172 @@ +#ifndef _LINUX_SWAIT_H +#define _LINUX_SWAIT_H + +#include +#include +#include +#include + +/* + * Simple wait queues + * + * While these are very similar to the other/complex wait queues (wait.h) the + * most important difference is that the simple waitqueue allows for + * deterministic behaviour -- IOW it has strictly bounded IRQ and lock hold + * times. + * + * In order to make this so, we had to drop a fair number of features of the + * other waitqueue code; notably: + * + * - mixing INTERRUPTIBLE and UNINTERRUPTIBLE sleeps on the same waitqueue; + *all wakeups are TASK_NORMAL in order to avoid O(n) lookups for the right + *sleeper state. + * + * - the exclusive mode; because this requires preserving the list order + *and this is hard. + * + * - custom wake functions; because you cannot give any guarantees about + *random code. + * + * As a side effect of this; the data structures are slimmer. + * + * One would recommend using this wait queue where possible. + */ + +struct task_struct; + +struct swait_queue_head { + raw_spinlock_t lock; + struct list_headtask_list; +}; + +struct swait_queue { + struct task_struct *task; + struct list_headtask_list; +}; + +#define __SWAITQUEUE_INITIALIZER(name) { \ + .task = current, \ + .task_list = LIST_HEAD_INIT((name).task_list), \ +} + +#define DECLARE_SWAITQUEUE(name) \ + struct swait_queue name = __SWAITQUEUE_INITIALIZER(name) + +#define __SWAIT_QUEUE_HEAD_INITIALIZER(name) { \ + .lock = __RAW_SPIN_LOCK_UNLOCKED(name.lock), \ + .task_list = LIST_HEAD_INIT((name).task_list), \ +} + +#define DECLARE_SWAIT_QUEUE_HEAD(name) \ + struct swait_queue_head name = __SWAIT_QUEUE_HEAD_INITIALIZER(name) + +extern void __init_swait_queue_head(struct swait_queue_head *q, const char *name, + struct lock_class_key *key); + +#define init_swait_queue_head(q) \ + do {\ + static struct lock_class_key __key; \ + __init_swait_queue_head((q), #q, &__key); \ + } while (0) + +#ifdef CONFIG_LOCKDEP +# define __SWAIT_QUEUE_HEAD_INIT_ONSTACK(name) \ + ({ init_swait_queue_head(); name; }) +# define DECLARE_SWAIT_QUEUE_HEAD_ONSTACK(name)\ + struct swait_queue_head name = __SWAIT_QUEUE_HEAD_INIT_ONSTACK(name) +#else +# define DECLARE_SWAIT_QUEUE_HEAD_ONSTACK(name)\ + DECLARE_SWAIT_QUEUE_HEAD(name) +#endif + +static inline int swait_active(struct swait_queue_head *q) +{ + return
[tip:perf/core] perf/core: Rename perf_event_read_{one,group}, perf_read_hw
Commit-ID: b15f495b4e9295cf21065d8569835a2f18cfe41b Gitweb: http://git.kernel.org/tip/b15f495b4e9295cf21065d8569835a2f18cfe41b Author: Peter Zijlstra (Intel) AuthorDate: Thu, 3 Sep 2015 20:07:47 -0700 Committer: Ingo Molnar CommitDate: Sun, 13 Sep 2015 11:27:26 +0200 perf/core: Rename perf_event_read_{one,group}, perf_read_hw In order to free up the perf_event_read_group() name: s/perf_event_read_\(one\|group\)/perf_read_\1/g s/perf_read_hw/__perf_read/g Signed-off-by: Peter Zijlstra (Intel) Cc: Arnaldo Carvalho de Melo Cc: Arnaldo Carvalho de Melo Cc: Jiri Olsa Cc: Linus Torvalds Cc: Michael Ellerman Cc: Peter Zijlstra Cc: Stephane Eranian Cc: Thomas Gleixner Cc: Vince Weaver Link: http://lkml.kernel.org/r/1441336073-22750-5-git-send-email-suka...@linux.vnet.ibm.com Signed-off-by: Ingo Molnar --- kernel/events/core.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index 260bf8c..67b7dba 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3742,7 +3742,7 @@ static void put_event(struct perf_event *event) * see the comment there. * * 2) there is a lock-inversion with mmap_sem through -* perf_event_read_group(), which takes faults while +* perf_read_group(), which takes faults while * holding ctx->mutex, however this is called after * the last filedesc died, so there is no possibility * to trigger the AB-BA case. @@ -3837,7 +3837,7 @@ u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running) } EXPORT_SYMBOL_GPL(perf_event_read_value); -static int perf_event_read_group(struct perf_event *event, +static int perf_read_group(struct perf_event *event, u64 read_format, char __user *buf) { struct perf_event *leader = event->group_leader, *sub; @@ -3885,7 +3885,7 @@ static int perf_event_read_group(struct perf_event *event, return ret; } -static int perf_event_read_one(struct perf_event *event, +static int perf_read_one(struct perf_event *event, u64 read_format, char __user *buf) { u64 enabled, running; @@ -3923,7 +3923,7 @@ static bool is_event_hup(struct perf_event *event) * Read the performance event - simple non blocking version for now */ static ssize_t -perf_read_hw(struct perf_event *event, char __user *buf, size_t count) +__perf_read(struct perf_event *event, char __user *buf, size_t count) { u64 read_format = event->attr.read_format; int ret; @@ -3941,9 +3941,9 @@ perf_read_hw(struct perf_event *event, char __user *buf, size_t count) WARN_ON_ONCE(event->ctx->parent_ctx); if (read_format & PERF_FORMAT_GROUP) - ret = perf_event_read_group(event, read_format, buf); + ret = perf_read_group(event, read_format, buf); else - ret = perf_event_read_one(event, read_format, buf); + ret = perf_read_one(event, read_format, buf); return ret; } @@ -3956,7 +3956,7 @@ perf_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) int ret; ctx = perf_event_ctx_lock(event); - ret = perf_read_hw(event, buf, count); + ret = __perf_read(event, buf, count); perf_event_ctx_unlock(event, ctx); return ret; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:perf/core] perf/core: Rename perf_event_read_{one,group}, perf_read_hw
Commit-ID: b15f495b4e9295cf21065d8569835a2f18cfe41b Gitweb: http://git.kernel.org/tip/b15f495b4e9295cf21065d8569835a2f18cfe41b Author: Peter Zijlstra (Intel)AuthorDate: Thu, 3 Sep 2015 20:07:47 -0700 Committer: Ingo Molnar CommitDate: Sun, 13 Sep 2015 11:27:26 +0200 perf/core: Rename perf_event_read_{one,group}, perf_read_hw In order to free up the perf_event_read_group() name: s/perf_event_read_\(one\|group\)/perf_read_\1/g s/perf_read_hw/__perf_read/g Signed-off-by: Peter Zijlstra (Intel) Cc: Arnaldo Carvalho de Melo Cc: Arnaldo Carvalho de Melo Cc: Jiri Olsa Cc: Linus Torvalds Cc: Michael Ellerman Cc: Peter Zijlstra Cc: Stephane Eranian Cc: Thomas Gleixner Cc: Vince Weaver Link: http://lkml.kernel.org/r/1441336073-22750-5-git-send-email-suka...@linux.vnet.ibm.com Signed-off-by: Ingo Molnar --- kernel/events/core.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index 260bf8c..67b7dba 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3742,7 +3742,7 @@ static void put_event(struct perf_event *event) * see the comment there. * * 2) there is a lock-inversion with mmap_sem through -* perf_event_read_group(), which takes faults while +* perf_read_group(), which takes faults while * holding ctx->mutex, however this is called after * the last filedesc died, so there is no possibility * to trigger the AB-BA case. @@ -3837,7 +3837,7 @@ u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running) } EXPORT_SYMBOL_GPL(perf_event_read_value); -static int perf_event_read_group(struct perf_event *event, +static int perf_read_group(struct perf_event *event, u64 read_format, char __user *buf) { struct perf_event *leader = event->group_leader, *sub; @@ -3885,7 +3885,7 @@ static int perf_event_read_group(struct perf_event *event, return ret; } -static int perf_event_read_one(struct perf_event *event, +static int perf_read_one(struct perf_event *event, u64 read_format, char __user *buf) { u64 enabled, running; @@ -3923,7 +3923,7 @@ static bool is_event_hup(struct perf_event *event) * Read the performance event - simple non blocking version for now */ static ssize_t -perf_read_hw(struct perf_event *event, char __user *buf, size_t count) +__perf_read(struct perf_event *event, char __user *buf, size_t count) { u64 read_format = event->attr.read_format; int ret; @@ -3941,9 +3941,9 @@ perf_read_hw(struct perf_event *event, char __user *buf, size_t count) WARN_ON_ONCE(event->ctx->parent_ctx); if (read_format & PERF_FORMAT_GROUP) - ret = perf_event_read_group(event, read_format, buf); + ret = perf_read_group(event, read_format, buf); else - ret = perf_event_read_one(event, read_format, buf); + ret = perf_read_one(event, read_format, buf); return ret; } @@ -3956,7 +3956,7 @@ perf_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) int ret; ctx = perf_event_ctx_lock(event); - ret = perf_read_hw(event, buf, count); + ret = __perf_read(event, buf, count); perf_event_ctx_unlock(event, ctx); return ret; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:locking/core] locking/pvqspinlock, x86: Implement the paravirt qspinlock call patching
Commit-ID: f233f7f1581e78fd9b4023f2e7d8c1ed89020cc9 Gitweb: http://git.kernel.org/tip/f233f7f1581e78fd9b4023f2e7d8c1ed89020cc9 Author: Peter Zijlstra (Intel) AuthorDate: Fri, 24 Apr 2015 14:56:38 -0400 Committer: Ingo Molnar CommitDate: Fri, 8 May 2015 12:37:09 +0200 locking/pvqspinlock, x86: Implement the paravirt qspinlock call patching We use the regular paravirt call patching to switch between: native_queued_spin_lock_slowpath()__pv_queued_spin_lock_slowpath() native_queued_spin_unlock() __pv_queued_spin_unlock() We use a callee saved call for the unlock function which reduces the i-cache footprint and allows 'inlining' of SPIN_UNLOCK functions again. We further optimize the unlock path by patching the direct call with a "movb $0,%arg1" if we are indeed using the native unlock code. This makes the unlock code almost as fast as the !PARAVIRT case. This significantly lowers the overhead of having CONFIG_PARAVIRT_SPINLOCKS enabled, even for native code. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Waiman Long Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Boris Ostrovsky Cc: Borislav Petkov Cc: Daniel J Blueman Cc: David Vrabel Cc: Douglas Hatch Cc: H. Peter Anvin Cc: Konrad Rzeszutek Wilk Cc: Linus Torvalds Cc: Oleg Nesterov Cc: Paolo Bonzini Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Raghavendra K T Cc: Rik van Riel Cc: Scott J Norton Cc: Thomas Gleixner Cc: virtualizat...@lists.linux-foundation.org Cc: xen-de...@lists.xenproject.org Link: http://lkml.kernel.org/r/1429901803-29771-10-git-send-email-waiman.l...@hp.com Signed-off-by: Ingo Molnar --- arch/x86/Kconfig | 2 +- arch/x86/include/asm/paravirt.h | 29 - arch/x86/include/asm/paravirt_types.h | 10 ++ arch/x86/include/asm/qspinlock.h | 25 - arch/x86/include/asm/qspinlock_paravirt.h | 6 ++ arch/x86/kernel/paravirt-spinlocks.c | 24 +++- arch/x86/kernel/paravirt_patch_32.c | 22 ++ arch/x86/kernel/paravirt_patch_64.c | 22 ++ 8 files changed, 128 insertions(+), 12 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 90b1b54..50ec043 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -667,7 +667,7 @@ config PARAVIRT_DEBUG config PARAVIRT_SPINLOCKS bool "Paravirtualization layer for spinlocks" depends on PARAVIRT && SMP - select UNINLINE_SPIN_UNLOCK + select UNINLINE_SPIN_UNLOCK if !QUEUED_SPINLOCK ---help--- Paravirtualized spinlocks allow a pvops backend to replace the spinlock implementation with something virtualization-friendly diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 8957810..266c353 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -712,6 +712,31 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx, #if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT_SPINLOCKS) +#ifdef CONFIG_QUEUED_SPINLOCK + +static __always_inline void pv_queued_spin_lock_slowpath(struct qspinlock *lock, + u32 val) +{ + PVOP_VCALL2(pv_lock_ops.queued_spin_lock_slowpath, lock, val); +} + +static __always_inline void pv_queued_spin_unlock(struct qspinlock *lock) +{ + PVOP_VCALLEE1(pv_lock_ops.queued_spin_unlock, lock); +} + +static __always_inline void pv_wait(u8 *ptr, u8 val) +{ + PVOP_VCALL2(pv_lock_ops.wait, ptr, val); +} + +static __always_inline void pv_kick(int cpu) +{ + PVOP_VCALL1(pv_lock_ops.kick, cpu); +} + +#else /* !CONFIG_QUEUED_SPINLOCK */ + static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock, __ticket_t ticket) { @@ -724,7 +749,9 @@ static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock, PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket); } -#endif +#endif /* CONFIG_QUEUED_SPINLOCK */ + +#endif /* SMP && PARAVIRT_SPINLOCKS */ #ifdef CONFIG_X86_32 #define PV_SAVE_REGS "pushl %ecx; pushl %edx;" diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index f7b0b5c..76cd684 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -333,9 +333,19 @@ struct arch_spinlock; typedef u16 __ticket_t; #endif +struct qspinlock; + struct pv_lock_ops { +#ifdef CONFIG_QUEUED_SPINLOCK + void (*queued_spin_lock_slowpath)(struct qspinlock *lock, u32 val); + struct paravirt_callee_save queued_spin_unlock; + + void (*wait)(u8 *ptr, u8 val); + void (*kick)(int cpu); +#else /* !CONFIG_QUEUED_SPINLOCK */ struct paravirt_callee_save lock_spinning; void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket);
[tip:locking/core] locking/qspinlock: Optimize for smaller NR_CPUS
Commit-ID: 69f9cae90907e09af95fb991ed384670cef8dd32 Gitweb: http://git.kernel.org/tip/69f9cae90907e09af95fb991ed384670cef8dd32 Author: Peter Zijlstra (Intel) AuthorDate: Fri, 24 Apr 2015 14:56:34 -0400 Committer: Ingo Molnar CommitDate: Fri, 8 May 2015 12:36:48 +0200 locking/qspinlock: Optimize for smaller NR_CPUS When we allow for a max NR_CPUS < 2^14 we can optimize the pending wait-acquire and the xchg_tail() operations. By growing the pending bit to a byte, we reduce the tail to 16bit. This means we can use xchg16 for the tail part and do away with all the repeated compxchg() operations. This in turn allows us to unconditionally acquire; the locked state as observed by the wait loops cannot change. And because both locked and pending are now a full byte we can use simple stores for the state transition, obviating one atomic operation entirely. This optimization is needed to make the qspinlock achieve performance parity with ticket spinlock at light load. All this is horribly broken on Alpha pre EV56 (and any other arch that cannot do single-copy atomic byte stores). Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Waiman Long Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Boris Ostrovsky Cc: Borislav Petkov Cc: Daniel J Blueman Cc: David Vrabel Cc: Douglas Hatch Cc: H. Peter Anvin Cc: Konrad Rzeszutek Wilk Cc: Linus Torvalds Cc: Oleg Nesterov Cc: Paolo Bonzini Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Raghavendra K T Cc: Rik van Riel Cc: Scott J Norton Cc: Thomas Gleixner Cc: virtualizat...@lists.linux-foundation.org Cc: xen-de...@lists.xenproject.org Link: http://lkml.kernel.org/r/1429901803-29771-6-git-send-email-waiman.l...@hp.com Signed-off-by: Ingo Molnar --- include/asm-generic/qspinlock_types.h | 13 +++ kernel/locking/qspinlock.c| 69 ++- 2 files changed, 81 insertions(+), 1 deletion(-) diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h index 3a7f671..85f888e 100644 --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -35,6 +35,14 @@ typedef struct qspinlock { /* * Bitfields in the atomic value: * + * When NR_CPUS < 16K + * 0- 7: locked byte + * 8: pending + * 9-15: not used + * 16-17: tail index + * 18-31: tail cpu (+1) + * + * When NR_CPUS >= 16K * 0- 7: locked byte * 8: pending * 9-10: tail index @@ -47,7 +55,11 @@ typedef struct qspinlock { #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED) #define _Q_PENDING_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#if CONFIG_NR_CPUS < (1U << 14) +#define _Q_PENDING_BITS8 +#else #define _Q_PENDING_BITS1 +#endif #define _Q_PENDING_MASK_Q_SET_MASK(PENDING) #define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS) @@ -58,6 +70,7 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET) #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) +#define _Q_TAIL_OFFSET _Q_TAIL_IDX_OFFSET #define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK) #define _Q_LOCKED_VAL (1U << _Q_LOCKED_OFFSET) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 82bb4a9..e17efe7 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -24,6 +24,7 @@ #include #include #include +#include #include /* @@ -56,6 +57,10 @@ * node; whereby avoiding the need to carry a node from lock to unlock, and * preserving existing lock API. This also makes the unlock code simpler and * faster. + * + * N.B. The current implementation only supports architectures that allow + * atomic operations on smaller 8-bit and 16-bit data types. + * */ #include "mcs_spinlock.h" @@ -96,6 +101,62 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) +/* + * By using the whole 2nd least significant byte for the pending bit, we + * can allow better optimization of the lock acquisition for the pending + * bit holder. + */ +#if _Q_PENDING_BITS == 8 + +struct __qspinlock { + union { + atomic_t val; + struct { +#ifdef __LITTLE_ENDIAN + u16 locked_pending; + u16 tail; +#else + u16 tail; + u16 locked_pending; +#endif + }; + }; +}; + +/** + * clear_pending_set_locked - take ownership and clear the pending bit. + * @lock: Pointer to queued spinlock structure + * + * *,1,0 -> *,0,1 + * + * Lock stealing is not allowed if this function is used. + */ +static __always_inline void clear_pending_set_locked(struct qspinlock *lock) +{ + struct __qspinlock *l = (void *)lock; + + WRITE_ONCE(l->locked_pending, _Q_LOCKED_VAL); +} + +/* + * xchg_tail - Put in the new queue tail code word
[tip:locking/core] locking/qspinlock: Revert to test-and-set on hypervisors
Commit-ID: 2aa79af64263190eec610422b07f60e99a7d230a Gitweb: http://git.kernel.org/tip/2aa79af64263190eec610422b07f60e99a7d230a Author: Peter Zijlstra (Intel) AuthorDate: Fri, 24 Apr 2015 14:56:36 -0400 Committer: Ingo Molnar CommitDate: Fri, 8 May 2015 12:36:58 +0200 locking/qspinlock: Revert to test-and-set on hypervisors When we detect a hypervisor (!paravirt, see qspinlock paravirt support patches), revert to a simple test-and-set lock to avoid the horrors of queue preemption. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Waiman Long Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Boris Ostrovsky Cc: Borislav Petkov Cc: Daniel J Blueman Cc: David Vrabel Cc: Douglas Hatch Cc: H. Peter Anvin Cc: Konrad Rzeszutek Wilk Cc: Linus Torvalds Cc: Oleg Nesterov Cc: Paolo Bonzini Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Raghavendra K T Cc: Rik van Riel Cc: Scott J Norton Cc: Thomas Gleixner Cc: virtualizat...@lists.linux-foundation.org Cc: xen-de...@lists.xenproject.org Link: http://lkml.kernel.org/r/1429901803-29771-8-git-send-email-waiman.l...@hp.com Signed-off-by: Ingo Molnar --- arch/x86/include/asm/qspinlock.h | 14 ++ include/asm-generic/qspinlock.h | 7 +++ kernel/locking/qspinlock.c | 3 +++ 3 files changed, 24 insertions(+) diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h index e2aee82..f079b70 100644 --- a/arch/x86/include/asm/qspinlock.h +++ b/arch/x86/include/asm/qspinlock.h @@ -1,6 +1,7 @@ #ifndef _ASM_X86_QSPINLOCK_H #define _ASM_X86_QSPINLOCK_H +#include #include #definequeued_spin_unlock queued_spin_unlock @@ -15,6 +16,19 @@ static inline void queued_spin_unlock(struct qspinlock *lock) smp_store_release((u8 *)lock, 0); } +#define virt_queued_spin_lock virt_queued_spin_lock + +static inline bool virt_queued_spin_lock(struct qspinlock *lock) +{ + if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) + return false; + + while (atomic_cmpxchg(>val, 0, _Q_LOCKED_VAL) != 0) + cpu_relax(); + + return true; +} + #include #endif /* _ASM_X86_QSPINLOCK_H */ diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h index 569abcd..83bfb87 100644 --- a/include/asm-generic/qspinlock.h +++ b/include/asm-generic/qspinlock.h @@ -111,6 +111,13 @@ static inline void queued_spin_unlock_wait(struct qspinlock *lock) cpu_relax(); } +#ifndef virt_queued_spin_lock +static __always_inline bool virt_queued_spin_lock(struct qspinlock *lock) +{ + return false; +} +#endif + /* * Initializier */ diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 0338721..fd31a47 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -249,6 +249,9 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS)); + if (virt_queued_spin_lock(lock)) + return; + /* * wait for in-progress pending->locked hand-overs * -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:locking/core] locking/qspinlock: Add pending bit
Commit-ID: c1fb159db9f2e50e0f4025bed92a67a6a7bfa7b7 Gitweb: http://git.kernel.org/tip/c1fb159db9f2e50e0f4025bed92a67a6a7bfa7b7 Author: Peter Zijlstra (Intel) AuthorDate: Fri, 24 Apr 2015 14:56:32 -0400 Committer: Ingo Molnar CommitDate: Fri, 8 May 2015 12:36:32 +0200 locking/qspinlock: Add pending bit Because the qspinlock needs to touch a second cacheline (the per-cpu mcs_nodes[]); add a pending bit and allow a single in-word spinner before we punt to the second cacheline. It is possible so observe the pending bit without the locked bit when the last owner has just released but the pending owner has not yet taken ownership. In this case we would normally queue -- because the pending bit is already taken. However, in this case the pending bit is guaranteed to be released 'soon', therefore wait for it and avoid queueing. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Waiman Long Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Boris Ostrovsky Cc: Borislav Petkov Cc: Daniel J Blueman Cc: David Vrabel Cc: Douglas Hatch Cc: H. Peter Anvin Cc: Konrad Rzeszutek Wilk Cc: Linus Torvalds Cc: Oleg Nesterov Cc: Paolo Bonzini Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Raghavendra K T Cc: Rik van Riel Cc: Scott J Norton Cc: Thomas Gleixner Cc: virtualizat...@lists.linux-foundation.org Cc: xen-de...@lists.xenproject.org Link: http://lkml.kernel.org/r/1429901803-29771-4-git-send-email-waiman.l...@hp.com Signed-off-by: Ingo Molnar --- include/asm-generic/qspinlock_types.h | 12 +++- kernel/locking/qspinlock.c| 119 -- 2 files changed, 107 insertions(+), 24 deletions(-) diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h index aec05c7..7ee6632 100644 --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -36,8 +36,9 @@ typedef struct qspinlock { * Bitfields in the atomic value: * * 0- 7: locked byte - * 8- 9: tail index - * 10-31: tail cpu (+1) + * 8: pending + * 9-10: tail index + * 11-31: tail cpu (+1) */ #define_Q_SET_MASK(type) (((1U << _Q_ ## type ## _BITS) - 1)\ << _Q_ ## type ## _OFFSET) @@ -45,7 +46,11 @@ typedef struct qspinlock { #define _Q_LOCKED_BITS 8 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED) -#define _Q_TAIL_IDX_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#define _Q_PENDING_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#define _Q_PENDING_BITS1 +#define _Q_PENDING_MASK_Q_SET_MASK(PENDING) + +#define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS) #define _Q_TAIL_IDX_BITS 2 #define _Q_TAIL_IDX_MASK _Q_SET_MASK(TAIL_IDX) @@ -54,5 +59,6 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) #define _Q_LOCKED_VAL (1U << _Q_LOCKED_OFFSET) +#define _Q_PENDING_VAL (1U << _Q_PENDING_OFFSET) #endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */ diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 029b51c..af9c2ef 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -94,24 +94,28 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) return per_cpu_ptr(_nodes[idx], cpu); } +#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) + /** * queued_spin_lock_slowpath - acquire the queued spinlock * @lock: Pointer to queued spinlock structure * @val: Current value of the queued spinlock 32-bit word * - * (queue tail, lock value) - * - * fast :slow : unlock - *: : - * uncontended (0,0) --:--> (0,1) :--> (*,0) - *: | ^./ : - *: v \ | : - * uncontended:(n,x) --+--> (n,0) | : - * queue: | ^--' | : - *: v | : - * contended :(*,x) --+--> (*,0) -> (*,1) ---' : - * queue: ^--' : + * (queue tail, pending bit, lock value) * + * fast :slow :unlock + * : : + * uncontended (0,0,0) -:--> (0,0,1) --:--> (*,*,0) + * : | ^.--. / : + * : v \ \| : + * pending :(0,1,1) +--> (0,1,0) \ | : + * : | ^--' | | : + * : v |
[tip:locking/core] locking/pvqspinlock, x86: Implement the paravirt qspinlock call patching
Commit-ID: f233f7f1581e78fd9b4023f2e7d8c1ed89020cc9 Gitweb: http://git.kernel.org/tip/f233f7f1581e78fd9b4023f2e7d8c1ed89020cc9 Author: Peter Zijlstra (Intel) pet...@infradead.org AuthorDate: Fri, 24 Apr 2015 14:56:38 -0400 Committer: Ingo Molnar mi...@kernel.org CommitDate: Fri, 8 May 2015 12:37:09 +0200 locking/pvqspinlock, x86: Implement the paravirt qspinlock call patching We use the regular paravirt call patching to switch between: native_queued_spin_lock_slowpath()__pv_queued_spin_lock_slowpath() native_queued_spin_unlock() __pv_queued_spin_unlock() We use a callee saved call for the unlock function which reduces the i-cache footprint and allows 'inlining' of SPIN_UNLOCK functions again. We further optimize the unlock path by patching the direct call with a movb $0,%arg1 if we are indeed using the native unlock code. This makes the unlock code almost as fast as the !PARAVIRT case. This significantly lowers the overhead of having CONFIG_PARAVIRT_SPINLOCKS enabled, even for native code. Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Cc: Andrew Morton a...@linux-foundation.org Cc: Boris Ostrovsky boris.ostrov...@oracle.com Cc: Borislav Petkov b...@alien8.de Cc: Daniel J Blueman dan...@numascale.com Cc: David Vrabel david.vra...@citrix.com Cc: Douglas Hatch doug.ha...@hp.com Cc: H. Peter Anvin h...@zytor.com Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com Cc: Linus Torvalds torva...@linux-foundation.org Cc: Oleg Nesterov o...@redhat.com Cc: Paolo Bonzini paolo.bonz...@gmail.com Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Peter Zijlstra pet...@infradead.org Cc: Raghavendra K T raghavendra...@linux.vnet.ibm.com Cc: Rik van Riel r...@redhat.com Cc: Scott J Norton scott.nor...@hp.com Cc: Thomas Gleixner t...@linutronix.de Cc: virtualizat...@lists.linux-foundation.org Cc: xen-de...@lists.xenproject.org Link: http://lkml.kernel.org/r/1429901803-29771-10-git-send-email-waiman.l...@hp.com Signed-off-by: Ingo Molnar mi...@kernel.org --- arch/x86/Kconfig | 2 +- arch/x86/include/asm/paravirt.h | 29 - arch/x86/include/asm/paravirt_types.h | 10 ++ arch/x86/include/asm/qspinlock.h | 25 - arch/x86/include/asm/qspinlock_paravirt.h | 6 ++ arch/x86/kernel/paravirt-spinlocks.c | 24 +++- arch/x86/kernel/paravirt_patch_32.c | 22 ++ arch/x86/kernel/paravirt_patch_64.c | 22 ++ 8 files changed, 128 insertions(+), 12 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 90b1b54..50ec043 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -667,7 +667,7 @@ config PARAVIRT_DEBUG config PARAVIRT_SPINLOCKS bool Paravirtualization layer for spinlocks depends on PARAVIRT SMP - select UNINLINE_SPIN_UNLOCK + select UNINLINE_SPIN_UNLOCK if !QUEUED_SPINLOCK ---help--- Paravirtualized spinlocks allow a pvops backend to replace the spinlock implementation with something virtualization-friendly diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 8957810..266c353 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -712,6 +712,31 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx, #if defined(CONFIG_SMP) defined(CONFIG_PARAVIRT_SPINLOCKS) +#ifdef CONFIG_QUEUED_SPINLOCK + +static __always_inline void pv_queued_spin_lock_slowpath(struct qspinlock *lock, + u32 val) +{ + PVOP_VCALL2(pv_lock_ops.queued_spin_lock_slowpath, lock, val); +} + +static __always_inline void pv_queued_spin_unlock(struct qspinlock *lock) +{ + PVOP_VCALLEE1(pv_lock_ops.queued_spin_unlock, lock); +} + +static __always_inline void pv_wait(u8 *ptr, u8 val) +{ + PVOP_VCALL2(pv_lock_ops.wait, ptr, val); +} + +static __always_inline void pv_kick(int cpu) +{ + PVOP_VCALL1(pv_lock_ops.kick, cpu); +} + +#else /* !CONFIG_QUEUED_SPINLOCK */ + static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock, __ticket_t ticket) { @@ -724,7 +749,9 @@ static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock, PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket); } -#endif +#endif /* CONFIG_QUEUED_SPINLOCK */ + +#endif /* SMP PARAVIRT_SPINLOCKS */ #ifdef CONFIG_X86_32 #define PV_SAVE_REGS pushl %ecx; pushl %edx; diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index f7b0b5c..76cd684 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -333,9 +333,19 @@ struct arch_spinlock; typedef u16 __ticket_t;
[tip:locking/core] locking/qspinlock: Optimize for smaller NR_CPUS
Commit-ID: 69f9cae90907e09af95fb991ed384670cef8dd32 Gitweb: http://git.kernel.org/tip/69f9cae90907e09af95fb991ed384670cef8dd32 Author: Peter Zijlstra (Intel) pet...@infradead.org AuthorDate: Fri, 24 Apr 2015 14:56:34 -0400 Committer: Ingo Molnar mi...@kernel.org CommitDate: Fri, 8 May 2015 12:36:48 +0200 locking/qspinlock: Optimize for smaller NR_CPUS When we allow for a max NR_CPUS 2^14 we can optimize the pending wait-acquire and the xchg_tail() operations. By growing the pending bit to a byte, we reduce the tail to 16bit. This means we can use xchg16 for the tail part and do away with all the repeated compxchg() operations. This in turn allows us to unconditionally acquire; the locked state as observed by the wait loops cannot change. And because both locked and pending are now a full byte we can use simple stores for the state transition, obviating one atomic operation entirely. This optimization is needed to make the qspinlock achieve performance parity with ticket spinlock at light load. All this is horribly broken on Alpha pre EV56 (and any other arch that cannot do single-copy atomic byte stores). Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Cc: Andrew Morton a...@linux-foundation.org Cc: Boris Ostrovsky boris.ostrov...@oracle.com Cc: Borislav Petkov b...@alien8.de Cc: Daniel J Blueman dan...@numascale.com Cc: David Vrabel david.vra...@citrix.com Cc: Douglas Hatch doug.ha...@hp.com Cc: H. Peter Anvin h...@zytor.com Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com Cc: Linus Torvalds torva...@linux-foundation.org Cc: Oleg Nesterov o...@redhat.com Cc: Paolo Bonzini paolo.bonz...@gmail.com Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Peter Zijlstra pet...@infradead.org Cc: Raghavendra K T raghavendra...@linux.vnet.ibm.com Cc: Rik van Riel r...@redhat.com Cc: Scott J Norton scott.nor...@hp.com Cc: Thomas Gleixner t...@linutronix.de Cc: virtualizat...@lists.linux-foundation.org Cc: xen-de...@lists.xenproject.org Link: http://lkml.kernel.org/r/1429901803-29771-6-git-send-email-waiman.l...@hp.com Signed-off-by: Ingo Molnar mi...@kernel.org --- include/asm-generic/qspinlock_types.h | 13 +++ kernel/locking/qspinlock.c| 69 ++- 2 files changed, 81 insertions(+), 1 deletion(-) diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h index 3a7f671..85f888e 100644 --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -35,6 +35,14 @@ typedef struct qspinlock { /* * Bitfields in the atomic value: * + * When NR_CPUS 16K + * 0- 7: locked byte + * 8: pending + * 9-15: not used + * 16-17: tail index + * 18-31: tail cpu (+1) + * + * When NR_CPUS = 16K * 0- 7: locked byte * 8: pending * 9-10: tail index @@ -47,7 +55,11 @@ typedef struct qspinlock { #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED) #define _Q_PENDING_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#if CONFIG_NR_CPUS (1U 14) +#define _Q_PENDING_BITS8 +#else #define _Q_PENDING_BITS1 +#endif #define _Q_PENDING_MASK_Q_SET_MASK(PENDING) #define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS) @@ -58,6 +70,7 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET) #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) +#define _Q_TAIL_OFFSET _Q_TAIL_IDX_OFFSET #define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK) #define _Q_LOCKED_VAL (1U _Q_LOCKED_OFFSET) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 82bb4a9..e17efe7 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -24,6 +24,7 @@ #include linux/percpu.h #include linux/hardirq.h #include linux/mutex.h +#include asm/byteorder.h #include asm/qspinlock.h /* @@ -56,6 +57,10 @@ * node; whereby avoiding the need to carry a node from lock to unlock, and * preserving existing lock API. This also makes the unlock code simpler and * faster. + * + * N.B. The current implementation only supports architectures that allow + * atomic operations on smaller 8-bit and 16-bit data types. + * */ #include mcs_spinlock.h @@ -96,6 +101,62 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) +/* + * By using the whole 2nd least significant byte for the pending bit, we + * can allow better optimization of the lock acquisition for the pending + * bit holder. + */ +#if _Q_PENDING_BITS == 8 + +struct __qspinlock { + union { + atomic_t val; + struct { +#ifdef __LITTLE_ENDIAN + u16 locked_pending; + u16 tail; +#else + u16 tail; +
[tip:locking/core] locking/qspinlock: Revert to test-and-set on hypervisors
Commit-ID: 2aa79af64263190eec610422b07f60e99a7d230a Gitweb: http://git.kernel.org/tip/2aa79af64263190eec610422b07f60e99a7d230a Author: Peter Zijlstra (Intel) pet...@infradead.org AuthorDate: Fri, 24 Apr 2015 14:56:36 -0400 Committer: Ingo Molnar mi...@kernel.org CommitDate: Fri, 8 May 2015 12:36:58 +0200 locking/qspinlock: Revert to test-and-set on hypervisors When we detect a hypervisor (!paravirt, see qspinlock paravirt support patches), revert to a simple test-and-set lock to avoid the horrors of queue preemption. Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Cc: Andrew Morton a...@linux-foundation.org Cc: Boris Ostrovsky boris.ostrov...@oracle.com Cc: Borislav Petkov b...@alien8.de Cc: Daniel J Blueman dan...@numascale.com Cc: David Vrabel david.vra...@citrix.com Cc: Douglas Hatch doug.ha...@hp.com Cc: H. Peter Anvin h...@zytor.com Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com Cc: Linus Torvalds torva...@linux-foundation.org Cc: Oleg Nesterov o...@redhat.com Cc: Paolo Bonzini paolo.bonz...@gmail.com Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Peter Zijlstra pet...@infradead.org Cc: Raghavendra K T raghavendra...@linux.vnet.ibm.com Cc: Rik van Riel r...@redhat.com Cc: Scott J Norton scott.nor...@hp.com Cc: Thomas Gleixner t...@linutronix.de Cc: virtualizat...@lists.linux-foundation.org Cc: xen-de...@lists.xenproject.org Link: http://lkml.kernel.org/r/1429901803-29771-8-git-send-email-waiman.l...@hp.com Signed-off-by: Ingo Molnar mi...@kernel.org --- arch/x86/include/asm/qspinlock.h | 14 ++ include/asm-generic/qspinlock.h | 7 +++ kernel/locking/qspinlock.c | 3 +++ 3 files changed, 24 insertions(+) diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h index e2aee82..f079b70 100644 --- a/arch/x86/include/asm/qspinlock.h +++ b/arch/x86/include/asm/qspinlock.h @@ -1,6 +1,7 @@ #ifndef _ASM_X86_QSPINLOCK_H #define _ASM_X86_QSPINLOCK_H +#include asm/cpufeature.h #include asm-generic/qspinlock_types.h #definequeued_spin_unlock queued_spin_unlock @@ -15,6 +16,19 @@ static inline void queued_spin_unlock(struct qspinlock *lock) smp_store_release((u8 *)lock, 0); } +#define virt_queued_spin_lock virt_queued_spin_lock + +static inline bool virt_queued_spin_lock(struct qspinlock *lock) +{ + if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) + return false; + + while (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) != 0) + cpu_relax(); + + return true; +} + #include asm-generic/qspinlock.h #endif /* _ASM_X86_QSPINLOCK_H */ diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h index 569abcd..83bfb87 100644 --- a/include/asm-generic/qspinlock.h +++ b/include/asm-generic/qspinlock.h @@ -111,6 +111,13 @@ static inline void queued_spin_unlock_wait(struct qspinlock *lock) cpu_relax(); } +#ifndef virt_queued_spin_lock +static __always_inline bool virt_queued_spin_lock(struct qspinlock *lock) +{ + return false; +} +#endif + /* * Initializier */ diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 0338721..fd31a47 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -249,6 +249,9 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) BUILD_BUG_ON(CONFIG_NR_CPUS = (1U _Q_TAIL_CPU_BITS)); + if (virt_queued_spin_lock(lock)) + return; + /* * wait for in-progress pending-locked hand-overs * -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:locking/core] locking/qspinlock: Add pending bit
Commit-ID: c1fb159db9f2e50e0f4025bed92a67a6a7bfa7b7 Gitweb: http://git.kernel.org/tip/c1fb159db9f2e50e0f4025bed92a67a6a7bfa7b7 Author: Peter Zijlstra (Intel) pet...@infradead.org AuthorDate: Fri, 24 Apr 2015 14:56:32 -0400 Committer: Ingo Molnar mi...@kernel.org CommitDate: Fri, 8 May 2015 12:36:32 +0200 locking/qspinlock: Add pending bit Because the qspinlock needs to touch a second cacheline (the per-cpu mcs_nodes[]); add a pending bit and allow a single in-word spinner before we punt to the second cacheline. It is possible so observe the pending bit without the locked bit when the last owner has just released but the pending owner has not yet taken ownership. In this case we would normally queue -- because the pending bit is already taken. However, in this case the pending bit is guaranteed to be released 'soon', therefore wait for it and avoid queueing. Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Cc: Andrew Morton a...@linux-foundation.org Cc: Boris Ostrovsky boris.ostrov...@oracle.com Cc: Borislav Petkov b...@alien8.de Cc: Daniel J Blueman dan...@numascale.com Cc: David Vrabel david.vra...@citrix.com Cc: Douglas Hatch doug.ha...@hp.com Cc: H. Peter Anvin h...@zytor.com Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com Cc: Linus Torvalds torva...@linux-foundation.org Cc: Oleg Nesterov o...@redhat.com Cc: Paolo Bonzini paolo.bonz...@gmail.com Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Peter Zijlstra pet...@infradead.org Cc: Raghavendra K T raghavendra...@linux.vnet.ibm.com Cc: Rik van Riel r...@redhat.com Cc: Scott J Norton scott.nor...@hp.com Cc: Thomas Gleixner t...@linutronix.de Cc: virtualizat...@lists.linux-foundation.org Cc: xen-de...@lists.xenproject.org Link: http://lkml.kernel.org/r/1429901803-29771-4-git-send-email-waiman.l...@hp.com Signed-off-by: Ingo Molnar mi...@kernel.org --- include/asm-generic/qspinlock_types.h | 12 +++- kernel/locking/qspinlock.c| 119 -- 2 files changed, 107 insertions(+), 24 deletions(-) diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h index aec05c7..7ee6632 100644 --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -36,8 +36,9 @@ typedef struct qspinlock { * Bitfields in the atomic value: * * 0- 7: locked byte - * 8- 9: tail index - * 10-31: tail cpu (+1) + * 8: pending + * 9-10: tail index + * 11-31: tail cpu (+1) */ #define_Q_SET_MASK(type) (((1U _Q_ ## type ## _BITS) - 1)\ _Q_ ## type ## _OFFSET) @@ -45,7 +46,11 @@ typedef struct qspinlock { #define _Q_LOCKED_BITS 8 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED) -#define _Q_TAIL_IDX_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#define _Q_PENDING_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#define _Q_PENDING_BITS1 +#define _Q_PENDING_MASK_Q_SET_MASK(PENDING) + +#define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS) #define _Q_TAIL_IDX_BITS 2 #define _Q_TAIL_IDX_MASK _Q_SET_MASK(TAIL_IDX) @@ -54,5 +59,6 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) #define _Q_LOCKED_VAL (1U _Q_LOCKED_OFFSET) +#define _Q_PENDING_VAL (1U _Q_PENDING_OFFSET) #endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */ diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 029b51c..af9c2ef 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -94,24 +94,28 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) return per_cpu_ptr(mcs_nodes[idx], cpu); } +#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) + /** * queued_spin_lock_slowpath - acquire the queued spinlock * @lock: Pointer to queued spinlock structure * @val: Current value of the queued spinlock 32-bit word * - * (queue tail, lock value) - * - * fast :slow : unlock - *: : - * uncontended (0,0) --:-- (0,1) :-- (*,0) - *: | ^./ : - *: v \ | : - * uncontended:(n,x) --+-- (n,0) | : - * queue: | ^--' | : - *: v | : - * contended :(*,x) --+-- (*,0) - (*,1) ---' : - * queue: ^--' : + * (queue tail, pending bit, lock value) * + * fast :slow :unlock + * :
[tip:perf/core] perf: Fix move_group() order
Commit-ID: 8f95b435b62522aed3381aaea920de8d09ccabf3 Gitweb: http://git.kernel.org/tip/8f95b435b62522aed3381aaea920de8d09ccabf3 Author: Peter Zijlstra (Intel) AuthorDate: Tue, 27 Jan 2015 11:53:12 +0100 Committer: Ingo Molnar CommitDate: Wed, 4 Feb 2015 08:07:11 +0100 perf: Fix move_group() order Jiri reported triggering the new WARN_ON_ONCE in event_sched_out over the weekend: event_sched_out.isra.79+0x2b9/0x2d0 group_sched_out+0x69/0xc0 ctx_sched_out+0x106/0x130 task_ctx_sched_out+0x37/0x70 __perf_install_in_context+0x70/0x1a0 remote_function+0x48/0x60 generic_exec_single+0x15b/0x1d0 smp_call_function_single+0x67/0xa0 task_function_call+0x53/0x80 perf_install_in_context+0x8b/0x110 I think the below should cure this; if we install a group leader it will iterate the (still intact) group list and find its siblings and try and install those too -- even though those still have the old event->ctx -- in the new ctx. Upon installing the first group sibling we'd try and schedule out the group and trigger the above warn. Fix this by installing the group leader last, installing siblings would have no effect, they're not reachable through the group lists and therefore we don't schedule them. Also delay resetting the state until we're absolutely sure the events are quiescent. Reported-by: Jiri Olsa Reported-by: vincent.wea...@maine.edu Signed-off-by: Peter Zijlstra (Intel) Cc: Arnaldo Carvalho de Melo Cc: Linus Torvalds Link: http://lkml.kernel.org/r/20150126162639.ga21...@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar --- kernel/events/core.c | 56 +++- 1 file changed, 47 insertions(+), 9 deletions(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index 417a96b..142dbabc 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -7645,16 +7645,9 @@ SYSCALL_DEFINE5(perf_event_open, perf_remove_from_context(group_leader, false); - /* -* Removing from the context ends up with disabled -* event. What we want here is event in the initial -* startup state, ready to be add into new context. -*/ - perf_event__state_init(group_leader); list_for_each_entry(sibling, _leader->sibling_list, group_entry) { perf_remove_from_context(sibling, false); - perf_event__state_init(sibling); put_ctx(gctx); } } else { @@ -7670,13 +7663,31 @@ SYSCALL_DEFINE5(perf_event_open, */ synchronize_rcu(); - perf_install_in_context(ctx, group_leader, group_leader->cpu); - get_ctx(ctx); + /* +* Install the group siblings before the group leader. +* +* Because a group leader will try and install the entire group +* (through the sibling list, which is still in-tact), we can +* end up with siblings installed in the wrong context. +* +* By installing siblings first we NO-OP because they're not +* reachable through the group lists. +*/ list_for_each_entry(sibling, _leader->sibling_list, group_entry) { + perf_event__state_init(sibling); perf_install_in_context(ctx, sibling, sibling->cpu); get_ctx(ctx); } + + /* +* Removing from the context ends up with disabled +* event. What we want here is event in the initial +* startup state, ready to be add into new context. +*/ + perf_event__state_init(group_leader); + perf_install_in_context(ctx, group_leader, group_leader->cpu); + get_ctx(ctx); } perf_install_in_context(ctx, event, event->cpu); @@ -7806,8 +7817,35 @@ void perf_pmu_migrate_context(struct pmu *pmu, int src_cpu, int dst_cpu) list_add(>migrate_entry, ); } + /* +* Wait for the events to quiesce before re-instating them. +*/ synchronize_rcu(); + /* +* Re-instate events in 2 passes. +* +* Skip over group leaders and only install siblings on this first +* pass, siblings will not get enabled without a leader, however a +* leader will enable its siblings, even if those are still on the old +* context. +*/ + list_for_each_entry_safe(event, tmp, , migrate_entry) { + if (event->group_leader == event) + continue; + + list_del(>migrate_entry); + if (event->state >= PERF_EVENT_STATE_OFF) +
[tip:perf/core] perf: Fix move_group() order
Commit-ID: 8f95b435b62522aed3381aaea920de8d09ccabf3 Gitweb: http://git.kernel.org/tip/8f95b435b62522aed3381aaea920de8d09ccabf3 Author: Peter Zijlstra (Intel) pet...@infradead.org AuthorDate: Tue, 27 Jan 2015 11:53:12 +0100 Committer: Ingo Molnar mi...@kernel.org CommitDate: Wed, 4 Feb 2015 08:07:11 +0100 perf: Fix move_group() order Jiri reported triggering the new WARN_ON_ONCE in event_sched_out over the weekend: event_sched_out.isra.79+0x2b9/0x2d0 group_sched_out+0x69/0xc0 ctx_sched_out+0x106/0x130 task_ctx_sched_out+0x37/0x70 __perf_install_in_context+0x70/0x1a0 remote_function+0x48/0x60 generic_exec_single+0x15b/0x1d0 smp_call_function_single+0x67/0xa0 task_function_call+0x53/0x80 perf_install_in_context+0x8b/0x110 I think the below should cure this; if we install a group leader it will iterate the (still intact) group list and find its siblings and try and install those too -- even though those still have the old event-ctx -- in the new ctx. Upon installing the first group sibling we'd try and schedule out the group and trigger the above warn. Fix this by installing the group leader last, installing siblings would have no effect, they're not reachable through the group lists and therefore we don't schedule them. Also delay resetting the state until we're absolutely sure the events are quiescent. Reported-by: Jiri Olsa jo...@redhat.com Reported-by: vincent.wea...@maine.edu Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Cc: Arnaldo Carvalho de Melo a...@kernel.org Cc: Linus Torvalds torva...@linux-foundation.org Link: http://lkml.kernel.org/r/20150126162639.ga21...@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar mi...@kernel.org --- kernel/events/core.c | 56 +++- 1 file changed, 47 insertions(+), 9 deletions(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index 417a96b..142dbabc 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -7645,16 +7645,9 @@ SYSCALL_DEFINE5(perf_event_open, perf_remove_from_context(group_leader, false); - /* -* Removing from the context ends up with disabled -* event. What we want here is event in the initial -* startup state, ready to be add into new context. -*/ - perf_event__state_init(group_leader); list_for_each_entry(sibling, group_leader-sibling_list, group_entry) { perf_remove_from_context(sibling, false); - perf_event__state_init(sibling); put_ctx(gctx); } } else { @@ -7670,13 +7663,31 @@ SYSCALL_DEFINE5(perf_event_open, */ synchronize_rcu(); - perf_install_in_context(ctx, group_leader, group_leader-cpu); - get_ctx(ctx); + /* +* Install the group siblings before the group leader. +* +* Because a group leader will try and install the entire group +* (through the sibling list, which is still in-tact), we can +* end up with siblings installed in the wrong context. +* +* By installing siblings first we NO-OP because they're not +* reachable through the group lists. +*/ list_for_each_entry(sibling, group_leader-sibling_list, group_entry) { + perf_event__state_init(sibling); perf_install_in_context(ctx, sibling, sibling-cpu); get_ctx(ctx); } + + /* +* Removing from the context ends up with disabled +* event. What we want here is event in the initial +* startup state, ready to be add into new context. +*/ + perf_event__state_init(group_leader); + perf_install_in_context(ctx, group_leader, group_leader-cpu); + get_ctx(ctx); } perf_install_in_context(ctx, event, event-cpu); @@ -7806,8 +7817,35 @@ void perf_pmu_migrate_context(struct pmu *pmu, int src_cpu, int dst_cpu) list_add(event-migrate_entry, events); } + /* +* Wait for the events to quiesce before re-instating them. +*/ synchronize_rcu(); + /* +* Re-instate events in 2 passes. +* +* Skip over group leaders and only install siblings on this first +* pass, siblings will not get enabled without a leader, however a +* leader will enable its siblings, even if those are still on the old +* context. +*/ + list_for_each_entry_safe(event, tmp, events, migrate_entry) { + if (event-group_leader == event) +
[tip:perf/core] perf: Avoid horrible stack usage
Commit-ID: 86038c5ea81b519a8a1fcfcd5e4599aab0cdd119 Gitweb: http://git.kernel.org/tip/86038c5ea81b519a8a1fcfcd5e4599aab0cdd119 Author: Peter Zijlstra (Intel) AuthorDate: Tue, 16 Dec 2014 12:47:34 +0100 Committer: Ingo Molnar CommitDate: Wed, 14 Jan 2015 15:11:45 +0100 perf: Avoid horrible stack usage Both Linus (most recent) and Steve (a while ago) reported that perf related callbacks have massive stack bloat. The problem is that software events need a pt_regs in order to properly report the event location and unwind stack. And because we could not assume one was present we allocated one on stack and filled it with minimal bits required for operation. Now, pt_regs is quite large, so this is undesirable. Furthermore it turns out that most sites actually have a pt_regs pointer available, making this even more onerous, as the stack space is pointless waste. This patch addresses the problem by observing that software events have well defined nesting semantics, therefore we can use static per-cpu storage instead of on-stack. Linus made the further observation that all but the scheduler callers of perf_sw_event() have a pt_regs available, so we change the regular perf_sw_event() to require a valid pt_regs (where it used to be optional) and add perf_sw_event_sched() for the scheduler. We have a scheduler specific call instead of a more generic _noregs() like construct because we can assume non-recursion from the scheduler and thereby simplify the code further (_noregs would have to put the recursion context call inline in order to assertain which __perf_regs element to use). One last note on the implementation of perf_trace_buf_prepare(); we allow .regs = NULL for those cases where we already have a pt_regs pointer available and do not need another. Reported-by: Linus Torvalds Reported-by: Steven Rostedt Signed-off-by: Peter Zijlstra (Intel) Cc: Arnaldo Carvalho de Melo Cc: Javi Merino Cc: Linus Torvalds Cc: Mathieu Desnoyers Cc: Oleg Nesterov Cc: Paul Mackerras Cc: Petr Mladek Cc: Steven Rostedt Cc: Tom Zanussi Cc: Vaibhav Nagarnaik Link: http://lkml.kernel.org/r/20141216115041.gw3...@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar --- include/linux/ftrace_event.h| 2 +- include/linux/perf_event.h | 28 +--- include/trace/ftrace.h | 7 --- kernel/events/core.c| 23 +-- kernel/sched/core.c | 2 +- kernel/trace/trace_event_perf.c | 4 +++- kernel/trace/trace_kprobe.c | 4 ++-- kernel/trace/trace_syscalls.c | 4 ++-- kernel/trace/trace_uprobe.c | 2 +- 9 files changed, 52 insertions(+), 24 deletions(-) diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h index 0bebb5c..d36f68b 100644 --- a/include/linux/ftrace_event.h +++ b/include/linux/ftrace_event.h @@ -595,7 +595,7 @@ extern int ftrace_profile_set_filter(struct perf_event *event, int event_id, char *filter_str); extern void ftrace_profile_free_filter(struct perf_event *event); extern void *perf_trace_buf_prepare(int size, unsigned short type, - struct pt_regs *regs, int *rctxp); + struct pt_regs **regs, int *rctxp); static inline void perf_trace_buf_submit(void *raw_data, int size, int rctx, u64 addr, diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 4f7a61c..3a7bd80 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -665,6 +665,7 @@ static inline int is_software_event(struct perf_event *event) extern struct static_key perf_swevent_enabled[PERF_COUNT_SW_MAX]; +extern void ___perf_sw_event(u32, u64, struct pt_regs *, u64); extern void __perf_sw_event(u32, u64, struct pt_regs *, u64); #ifndef perf_arch_fetch_caller_regs @@ -689,14 +690,25 @@ static inline void perf_fetch_caller_regs(struct pt_regs *regs) static __always_inline void perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, u64 addr) { - struct pt_regs hot_regs; + if (static_key_false(_swevent_enabled[event_id])) + __perf_sw_event(event_id, nr, regs, addr); +} + +DECLARE_PER_CPU(struct pt_regs, __perf_regs[4]); +/* + * 'Special' version for the scheduler, it hard assumes no recursion, + * which is guaranteed by us not actually scheduling inside other swevents + * because those disable preemption. + */ +static __always_inline void +perf_sw_event_sched(u32 event_id, u64 nr, u64 addr) +{ if (static_key_false(_swevent_enabled[event_id])) { - if (!regs) { - perf_fetch_caller_regs(_regs); - regs = _regs; - } - __perf_sw_event(event_id, nr, regs, addr); + struct pt_regs *regs = this_cpu_ptr(&__perf_regs[0]); + + perf_fetch_caller_regs(regs); + ___perf_sw_event(event_id, nr, regs, addr); }
[tip:perf/core] perf: Avoid horrible stack usage
Commit-ID: 86038c5ea81b519a8a1fcfcd5e4599aab0cdd119 Gitweb: http://git.kernel.org/tip/86038c5ea81b519a8a1fcfcd5e4599aab0cdd119 Author: Peter Zijlstra (Intel) pet...@infradead.org AuthorDate: Tue, 16 Dec 2014 12:47:34 +0100 Committer: Ingo Molnar mi...@kernel.org CommitDate: Wed, 14 Jan 2015 15:11:45 +0100 perf: Avoid horrible stack usage Both Linus (most recent) and Steve (a while ago) reported that perf related callbacks have massive stack bloat. The problem is that software events need a pt_regs in order to properly report the event location and unwind stack. And because we could not assume one was present we allocated one on stack and filled it with minimal bits required for operation. Now, pt_regs is quite large, so this is undesirable. Furthermore it turns out that most sites actually have a pt_regs pointer available, making this even more onerous, as the stack space is pointless waste. This patch addresses the problem by observing that software events have well defined nesting semantics, therefore we can use static per-cpu storage instead of on-stack. Linus made the further observation that all but the scheduler callers of perf_sw_event() have a pt_regs available, so we change the regular perf_sw_event() to require a valid pt_regs (where it used to be optional) and add perf_sw_event_sched() for the scheduler. We have a scheduler specific call instead of a more generic _noregs() like construct because we can assume non-recursion from the scheduler and thereby simplify the code further (_noregs would have to put the recursion context call inline in order to assertain which __perf_regs element to use). One last note on the implementation of perf_trace_buf_prepare(); we allow .regs = NULL for those cases where we already have a pt_regs pointer available and do not need another. Reported-by: Linus Torvalds torva...@linux-foundation.org Reported-by: Steven Rostedt rost...@goodmis.org Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Cc: Arnaldo Carvalho de Melo a...@kernel.org Cc: Javi Merino javi.mer...@arm.com Cc: Linus Torvalds torva...@linux-foundation.org Cc: Mathieu Desnoyers mathieu.desnoy...@efficios.com Cc: Oleg Nesterov o...@redhat.com Cc: Paul Mackerras pau...@samba.org Cc: Petr Mladek pmla...@suse.cz Cc: Steven Rostedt rost...@goodmis.org Cc: Tom Zanussi tom.zanu...@linux.intel.com Cc: Vaibhav Nagarnaik vnagarn...@google.com Link: http://lkml.kernel.org/r/20141216115041.gw3...@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar mi...@kernel.org --- include/linux/ftrace_event.h| 2 +- include/linux/perf_event.h | 28 +--- include/trace/ftrace.h | 7 --- kernel/events/core.c| 23 +-- kernel/sched/core.c | 2 +- kernel/trace/trace_event_perf.c | 4 +++- kernel/trace/trace_kprobe.c | 4 ++-- kernel/trace/trace_syscalls.c | 4 ++-- kernel/trace/trace_uprobe.c | 2 +- 9 files changed, 52 insertions(+), 24 deletions(-) diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h index 0bebb5c..d36f68b 100644 --- a/include/linux/ftrace_event.h +++ b/include/linux/ftrace_event.h @@ -595,7 +595,7 @@ extern int ftrace_profile_set_filter(struct perf_event *event, int event_id, char *filter_str); extern void ftrace_profile_free_filter(struct perf_event *event); extern void *perf_trace_buf_prepare(int size, unsigned short type, - struct pt_regs *regs, int *rctxp); + struct pt_regs **regs, int *rctxp); static inline void perf_trace_buf_submit(void *raw_data, int size, int rctx, u64 addr, diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 4f7a61c..3a7bd80 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -665,6 +665,7 @@ static inline int is_software_event(struct perf_event *event) extern struct static_key perf_swevent_enabled[PERF_COUNT_SW_MAX]; +extern void ___perf_sw_event(u32, u64, struct pt_regs *, u64); extern void __perf_sw_event(u32, u64, struct pt_regs *, u64); #ifndef perf_arch_fetch_caller_regs @@ -689,14 +690,25 @@ static inline void perf_fetch_caller_regs(struct pt_regs *regs) static __always_inline void perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, u64 addr) { - struct pt_regs hot_regs; + if (static_key_false(perf_swevent_enabled[event_id])) + __perf_sw_event(event_id, nr, regs, addr); +} + +DECLARE_PER_CPU(struct pt_regs, __perf_regs[4]); +/* + * 'Special' version for the scheduler, it hard assumes no recursion, + * which is guaranteed by us not actually scheduling inside other swevents + * because those disable preemption. + */ +static __always_inline void +perf_sw_event_sched(u32 event_id, u64 nr, u64 addr) +{ if (static_key_false(perf_swevent_enabled[event_id])) { - if (!regs) { -
[tip:perf/urgent] perf/x86: Fix embarrasing typo
Commit-ID: ce5686d4ed12158599d2042a6c8659254ed263ce Gitweb: http://git.kernel.org/tip/ce5686d4ed12158599d2042a6c8659254ed263ce Author: Peter Zijlstra (Intel) AuthorDate: Wed, 29 Oct 2014 11:17:04 +0100 Committer: Ingo Molnar CommitDate: Tue, 4 Nov 2014 07:06:58 +0100 perf/x86: Fix embarrasing typo Because we're all human and typing sucks.. Fixes: 7fb0f1de49fc ("perf/x86: Fix compile warnings for intel_uncore") Reported-by: Andi Kleen Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: x...@kernel.org Link: http://lkml.kernel.org/n/tip-be0bftjh8yfm4uvmvtf3y...@git.kernel.org Signed-off-by: Ingo Molnar --- arch/x86/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index ded8a67..41a503c 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -144,7 +144,7 @@ config INSTRUCTION_DECODER config PERF_EVENTS_INTEL_UNCORE def_bool y - depends on PERF_EVENTS && SUP_SUP_INTEL && PCI + depends on PERF_EVENTS && CPU_SUP_INTEL && PCI config OUTPUT_FORMAT string -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:perf/urgent] perf/x86: Fix embarrasing typo
Commit-ID: ce5686d4ed12158599d2042a6c8659254ed263ce Gitweb: http://git.kernel.org/tip/ce5686d4ed12158599d2042a6c8659254ed263ce Author: Peter Zijlstra (Intel) pet...@infradead.org AuthorDate: Wed, 29 Oct 2014 11:17:04 +0100 Committer: Ingo Molnar mi...@kernel.org CommitDate: Tue, 4 Nov 2014 07:06:58 +0100 perf/x86: Fix embarrasing typo Because we're all human and typing sucks.. Fixes: 7fb0f1de49fc (perf/x86: Fix compile warnings for intel_uncore) Reported-by: Andi Kleen a...@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Cc: Linus Torvalds torva...@linux-foundation.org Cc: x...@kernel.org Link: http://lkml.kernel.org/n/tip-be0bftjh8yfm4uvmvtf3y...@git.kernel.org Signed-off-by: Ingo Molnar mi...@kernel.org --- arch/x86/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index ded8a67..41a503c 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -144,7 +144,7 @@ config INSTRUCTION_DECODER config PERF_EVENTS_INTEL_UNCORE def_bool y - depends on PERF_EVENTS SUP_SUP_INTEL PCI + depends on PERF_EVENTS CPU_SUP_INTEL PCI config OUTPUT_FORMAT string -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:perf/urgent] perf: Fix bogus kernel printk
Commit-ID: 65d71fe1375b973083733294795bf2b09d45b3c2 Gitweb: http://git.kernel.org/tip/65d71fe1375b973083733294795bf2b09d45b3c2 Author: Peter Zijlstra (Intel) AuthorDate: Tue, 7 Oct 2014 19:07:33 +0200 Committer: Ingo Molnar CommitDate: Tue, 28 Oct 2014 10:51:01 +0100 perf: Fix bogus kernel printk Andy spotted the fail in what was intended as a conditional printk level. Reported-by: Andy Lutomirski Fixes: cc6cd47e7395 ("perf/x86: Tone down kernel messages when the PMU check fails in a virtual environment") Signed-off-by: Peter Zijlstra (Intel) Cc: Arnaldo Carvalho de Melo Cc: Linus Torvalds Link: http://lkml.kernel.org/r/20141007124757.gh19...@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar --- arch/x86/kernel/cpu/perf_event.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 1b8299d..66451a6 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -243,8 +243,9 @@ static bool check_hw_exists(void) msr_fail: printk(KERN_CONT "Broken PMU hardware detected, using software events only.\n"); - printk(boot_cpu_has(X86_FEATURE_HYPERVISOR) ? KERN_INFO : KERN_ERR - "Failed to access perfctr msr (MSR %x is %Lx)\n", reg, val_new); + printk("%sFailed to access perfctr msr (MSR %x is %Lx)\n", + boot_cpu_has(X86_FEATURE_HYPERVISOR) ? KERN_INFO : KERN_ERR, + reg, val_new); return false; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:perf/urgent] perf: Fix bogus kernel printk
Commit-ID: 65d71fe1375b973083733294795bf2b09d45b3c2 Gitweb: http://git.kernel.org/tip/65d71fe1375b973083733294795bf2b09d45b3c2 Author: Peter Zijlstra (Intel) pet...@infradead.org AuthorDate: Tue, 7 Oct 2014 19:07:33 +0200 Committer: Ingo Molnar mi...@kernel.org CommitDate: Tue, 28 Oct 2014 10:51:01 +0100 perf: Fix bogus kernel printk Andy spotted the fail in what was intended as a conditional printk level. Reported-by: Andy Lutomirski l...@amacapital.net Fixes: cc6cd47e7395 (perf/x86: Tone down kernel messages when the PMU check fails in a virtual environment) Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Cc: Arnaldo Carvalho de Melo a...@kernel.org Cc: Linus Torvalds torva...@linux-foundation.org Link: http://lkml.kernel.org/r/20141007124757.gh19...@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar mi...@kernel.org --- arch/x86/kernel/cpu/perf_event.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 1b8299d..66451a6 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -243,8 +243,9 @@ static bool check_hw_exists(void) msr_fail: printk(KERN_CONT Broken PMU hardware detected, using software events only.\n); - printk(boot_cpu_has(X86_FEATURE_HYPERVISOR) ? KERN_INFO : KERN_ERR - Failed to access perfctr msr (MSR %x is %Lx)\n, reg, val_new); + printk(%sFailed to access perfctr msr (MSR %x is %Lx)\n, + boot_cpu_has(X86_FEATURE_HYPERVISOR) ? KERN_INFO : KERN_ERR, + reg, val_new); return false; } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/