Re: [PATCH v4 RESEND 3/5] perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region
On 2021/3/24 12:04, Namhyung Kim wrote: On Wed, Mar 24, 2021 at 12:47 PM Like Xu wrote: Hi Namhyung, On 2021/3/24 9:32, Namhyung Kim wrote: Hello, On Mon, Mar 22, 2021 at 3:14 PM Like Xu wrote: +void reserve_lbr_buffers(struct perf_event *event) +{ + struct kmem_cache *kmem_cache = x86_get_pmu()->task_ctx_cache; + struct cpu_hw_events *cpuc; + int cpu; + + if (!static_cpu_has(X86_FEATURE_ARCH_LBR)) + return; + + for_each_possible_cpu(cpu) { + cpuc = per_cpu_ptr(_hw_events, cpu); + if (kmem_cache && !cpuc->lbr_xsave && !event->attr.precise_ip) + cpuc->lbr_xsave = kmem_cache_alloc(kmem_cache, GFP_KERNEL); + } +} I think we should use kmem_cache_alloc_node(). "kmem_cache_alloc_node - Allocate an object on the specified node" The reserve_lbr_buffers() is called in __x86_pmu_event_init(). When the LBR perf_event is scheduled to another node, it seems that we will not call init() and allocate again. Do you mean use kmem_cache_alloc_node() for each numa_nodes_parsed ? I assume cpuc->lbr_xsave will be accessed for that cpu only. Then it needs to allocate it in the node that cpu belongs to. Something like below.. cpuc->lbr_xsave = kmem_cache_alloc_node(kmem_cache, GFP_KERNEL, cpu_to_node(cpu)); Thanks, it helps and I will apply it in the next version. Thanks, Namhyung
Re: [PATCH v4 RESEND 3/5] perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region
On Wed, Mar 24, 2021 at 12:47 PM Like Xu wrote: > > Hi Namhyung, > > On 2021/3/24 9:32, Namhyung Kim wrote: > > Hello, > > > > On Mon, Mar 22, 2021 at 3:14 PM Like Xu wrote: > >> +void reserve_lbr_buffers(struct perf_event *event) > >> +{ > >> + struct kmem_cache *kmem_cache = x86_get_pmu()->task_ctx_cache; > >> + struct cpu_hw_events *cpuc; > >> + int cpu; > >> + > >> + if (!static_cpu_has(X86_FEATURE_ARCH_LBR)) > >> + return; > >> + > >> + for_each_possible_cpu(cpu) { > >> + cpuc = per_cpu_ptr(_hw_events, cpu); > >> + if (kmem_cache && !cpuc->lbr_xsave && > >> !event->attr.precise_ip) > >> + cpuc->lbr_xsave = kmem_cache_alloc(kmem_cache, > >> GFP_KERNEL); > >> + } > >> +} > > > > I think we should use kmem_cache_alloc_node(). > > "kmem_cache_alloc_node - Allocate an object on the specified node" > > The reserve_lbr_buffers() is called in __x86_pmu_event_init(). > When the LBR perf_event is scheduled to another node, it seems > that we will not call init() and allocate again. > > Do you mean use kmem_cache_alloc_node() for each numa_nodes_parsed ? I assume cpuc->lbr_xsave will be accessed for that cpu only. Then it needs to allocate it in the node that cpu belongs to. Something like below.. cpuc->lbr_xsave = kmem_cache_alloc_node(kmem_cache, GFP_KERNEL, cpu_to_node(cpu)); Thanks, Namhyung
Re: [PATCH v4 RESEND 3/5] perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region
Hi Namhyung, On 2021/3/24 9:32, Namhyung Kim wrote: Hello, On Mon, Mar 22, 2021 at 3:14 PM Like Xu wrote: +void reserve_lbr_buffers(struct perf_event *event) +{ + struct kmem_cache *kmem_cache = x86_get_pmu()->task_ctx_cache; + struct cpu_hw_events *cpuc; + int cpu; + + if (!static_cpu_has(X86_FEATURE_ARCH_LBR)) + return; + + for_each_possible_cpu(cpu) { + cpuc = per_cpu_ptr(_hw_events, cpu); + if (kmem_cache && !cpuc->lbr_xsave && !event->attr.precise_ip) + cpuc->lbr_xsave = kmem_cache_alloc(kmem_cache, GFP_KERNEL); + } +} I think we should use kmem_cache_alloc_node(). "kmem_cache_alloc_node - Allocate an object on the specified node" The reserve_lbr_buffers() is called in __x86_pmu_event_init(). When the LBR perf_event is scheduled to another node, it seems that we will not call init() and allocate again. Do you mean use kmem_cache_alloc_node() for each numa_nodes_parsed ? Thanks, Namhyung
Re: [PATCH v4 RESEND 3/5] perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region
Hello, On Mon, Mar 22, 2021 at 3:14 PM Like Xu wrote: > +void reserve_lbr_buffers(struct perf_event *event) > +{ > + struct kmem_cache *kmem_cache = x86_get_pmu()->task_ctx_cache; > + struct cpu_hw_events *cpuc; > + int cpu; > + > + if (!static_cpu_has(X86_FEATURE_ARCH_LBR)) > + return; > + > + for_each_possible_cpu(cpu) { > + cpuc = per_cpu_ptr(_hw_events, cpu); > + if (kmem_cache && !cpuc->lbr_xsave && !event->attr.precise_ip) > + cpuc->lbr_xsave = kmem_cache_alloc(kmem_cache, > GFP_KERNEL); > + } > +} I think we should use kmem_cache_alloc_node(). Thanks, Namhyung
Re: [PATCH v4 RESEND 3/5] perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region
On 3/23/2021 5:41 PM, Peter Zijlstra wrote: On Mon, Mar 22, 2021 at 02:06:33PM +0800, Like Xu wrote: diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 18df17129695..a4ce669cc78d 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -373,7 +373,7 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct perf_event *event) return x86_pmu_extra_regs(val, event); } -int x86_reserve_hardware(void) +int x86_reserve_hardware(struct perf_event *event) { int err = 0; @@ -382,8 +382,10 @@ int x86_reserve_hardware(void) if (atomic_read(_refcount) == 0) { if (!reserve_pmc_hardware()) err = -EBUSY; - else + else { reserve_ds_buffers(); + reserve_lbr_buffers(event); + } } if (!err) atomic_inc(_refcount); This makes absolutely no sense what so ever. This is only executed for the first event that gets here. The LBR xsave buffer is a per-CPU buffer, not a per-event buffer. I think we only need to allocate the buffer once when initializing the first event. Thanks, Kan
Re: [PATCH v4 RESEND 3/5] perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region
On Mon, Mar 22, 2021 at 02:06:33PM +0800, Like Xu wrote: > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c > index 18df17129695..a4ce669cc78d 100644 > --- a/arch/x86/events/core.c > +++ b/arch/x86/events/core.c > @@ -373,7 +373,7 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct > perf_event *event) > return x86_pmu_extra_regs(val, event); > } > > -int x86_reserve_hardware(void) > +int x86_reserve_hardware(struct perf_event *event) > { > int err = 0; > > @@ -382,8 +382,10 @@ int x86_reserve_hardware(void) > if (atomic_read(_refcount) == 0) { > if (!reserve_pmc_hardware()) > err = -EBUSY; > - else > + else { > reserve_ds_buffers(); > + reserve_lbr_buffers(event); > + } > } > if (!err) > atomic_inc(_refcount); This makes absolutely no sense what so ever. This is only executed for the first event that gets here.
Re: [PATCH v4 RESEND 3/5] perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region
On 3/22/2021 2:06 AM, Like Xu wrote: If the kernel is compiled with the CONFIG_LOCKDEP option, the conditional might_sleep_if() deep in kmem_cache_alloc() will generate the following trace, and potentially cause a deadlock when another LBR event is added: [ 243.115549] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:196 [ 243.117576] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 839, name: perf [ 243.119326] INFO: lockdep is turned off. [ 243.120249] irq event stamp: 0 [ 243.120967] hardirqs last enabled at (0): [<>] 0x0 [ 243.122415] hardirqs last disabled at (0): [] copy_process+0xa45/0x1dc0 [ 243.124302] softirqs last enabled at (0): [] copy_process+0xa45/0x1dc0 [ 243.126255] softirqs last disabled at (0): [<>] 0x0 [ 243.128119] CPU: 0 PID: 839 Comm: perf Not tainted 5.11.0-rc4-guest+ #8 [ 243.129654] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 [ 243.131520] Call Trace: [ 243.132112] dump_stack+0x8d/0xb5 [ 243.132896] ___might_sleep.cold.106+0xb3/0xc3 [ 243.133984] slab_pre_alloc_hook.constprop.85+0x96/0xd0 [ 243.135208] ? intel_pmu_lbr_add+0x152/0x170 [ 243.136207] kmem_cache_alloc+0x36/0x250 [ 243.137126] intel_pmu_lbr_add+0x152/0x170 [ 243.138088] x86_pmu_add+0x83/0xd0 [ 243.138889] ? lock_acquire+0x158/0x350 [ 243.139791] ? lock_acquire+0x158/0x350 [ 243.140694] ? lock_acquire+0x158/0x350 [ 243.141625] ? lock_acquired+0x1e3/0x360 [ 243.142544] ? lock_release+0x1bf/0x340 [ 243.143726] ? trace_hardirqs_on+0x1a/0xd0 [ 243.144823] ? lock_acquired+0x1e3/0x360 [ 243.145742] ? lock_release+0x1bf/0x340 [ 243.147107] ? __slab_free+0x49/0x540 [ 243.147966] ? trace_hardirqs_on+0x1a/0xd0 [ 243.148924] event_sched_in.isra.129+0xf8/0x2a0 [ 243.149989] merge_sched_in+0x261/0x3e0 [ 243.150889] ? trace_hardirqs_on+0x1a/0xd0 [ 243.151869] visit_groups_merge.constprop.135+0x130/0x4a0 [ 243.153122] ? sched_clock_cpu+0xc/0xb0 [ 243.154023] ctx_sched_in+0x101/0x210 [ 243.154884] ctx_resched+0x6f/0xc0 [ 243.155686] perf_event_exec+0x21e/0x2e0 [ 243.156641] begin_new_exec+0x5e5/0xbd0 [ 243.157540] load_elf_binary+0x6af/0x1770 [ 243.158478] ? __kernel_read+0x19d/0x2b0 [ 243.159977] ? lock_acquire+0x158/0x350 [ 243.160876] ? __kernel_read+0x19d/0x2b0 [ 243.161796] bprm_execve+0x3c8/0x840 [ 243.162638] do_execveat_common.isra.38+0x1a5/0x1c0 [ 243.163776] __x64_sys_execve+0x32/0x40 [ 243.164676] do_syscall_64+0x33/0x40 [ 243.165514] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 243.166746] RIP: 0033:0x7f6180a26feb [ 243.167590] Code: Unable to access opcode bytes at RIP 0x7f6180a26fc1. [ 243.169097] RSP: 002b:7ffc6558ce18 EFLAGS: 0202 ORIG_RAX: 003b [ 243.170844] RAX: ffda RBX: 7ffc65592d30 RCX: 7f6180a26feb [ 243.172514] RDX: 55657f408dc0 RSI: 7ffc65592410 RDI: 7ffc65592d30 [ 243.174162] RBP: 7ffc6558ce80 R08: 7ffc6558cde0 R09: [ 243.176042] R10: 0008 R11: 0202 R12: 7ffc65592410 [ 243.177696] R13: 55657f408dc0 R14: 0001 R15: 7ffc65592410 One of the solution is to use GFP_ATOMIC, but it will make the code less reliable under memory pressue. Let's move the memory allocation out of the sleeping region and put it into the x86_reserve_hardware(). The disadvantage of this fix is that the cpuc->lbr_xsave memory will be allocated for each cpu like the legacy ds_buffer. Fixes: c085fb8774 ("perf/x86/intel/lbr: Support XSAVES for arch LBR read") Suggested-by: Kan Liang Signed-off-by: Like Xu I observed the same issue when I did LBR test on an ADL machine. This patch fixes the issue. Tested-by: Kan Liang Thanks, Kan --- arch/x86/events/core.c | 8 +--- arch/x86/events/intel/bts.c | 2 +- arch/x86/events/intel/lbr.c | 22 -- arch/x86/events/perf_event.h | 8 +++- 4 files changed, 29 insertions(+), 11 deletions(-) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 18df17129695..a4ce669cc78d 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -373,7 +373,7 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct perf_event *event) return x86_pmu_extra_regs(val, event); } -int x86_reserve_hardware(void) +int x86_reserve_hardware(struct perf_event *event) { int err = 0; @@ -382,8 +382,10 @@ int x86_reserve_hardware(void) if (atomic_read(_refcount) == 0) { if (!reserve_pmc_hardware()) err = -EBUSY; - else + else { reserve_ds_buffers(); + reserve_lbr_buffers(event); + } } if (!err) atomic_inc(_refcount); @@ -634,7 +636,7 @@ static int
[PATCH v4 RESEND 3/5] perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region
If the kernel is compiled with the CONFIG_LOCKDEP option, the conditional might_sleep_if() deep in kmem_cache_alloc() will generate the following trace, and potentially cause a deadlock when another LBR event is added: [ 243.115549] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:196 [ 243.117576] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 839, name: perf [ 243.119326] INFO: lockdep is turned off. [ 243.120249] irq event stamp: 0 [ 243.120967] hardirqs last enabled at (0): [<>] 0x0 [ 243.122415] hardirqs last disabled at (0): [] copy_process+0xa45/0x1dc0 [ 243.124302] softirqs last enabled at (0): [] copy_process+0xa45/0x1dc0 [ 243.126255] softirqs last disabled at (0): [<>] 0x0 [ 243.128119] CPU: 0 PID: 839 Comm: perf Not tainted 5.11.0-rc4-guest+ #8 [ 243.129654] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 [ 243.131520] Call Trace: [ 243.132112] dump_stack+0x8d/0xb5 [ 243.132896] ___might_sleep.cold.106+0xb3/0xc3 [ 243.133984] slab_pre_alloc_hook.constprop.85+0x96/0xd0 [ 243.135208] ? intel_pmu_lbr_add+0x152/0x170 [ 243.136207] kmem_cache_alloc+0x36/0x250 [ 243.137126] intel_pmu_lbr_add+0x152/0x170 [ 243.138088] x86_pmu_add+0x83/0xd0 [ 243.138889] ? lock_acquire+0x158/0x350 [ 243.139791] ? lock_acquire+0x158/0x350 [ 243.140694] ? lock_acquire+0x158/0x350 [ 243.141625] ? lock_acquired+0x1e3/0x360 [ 243.142544] ? lock_release+0x1bf/0x340 [ 243.143726] ? trace_hardirqs_on+0x1a/0xd0 [ 243.144823] ? lock_acquired+0x1e3/0x360 [ 243.145742] ? lock_release+0x1bf/0x340 [ 243.147107] ? __slab_free+0x49/0x540 [ 243.147966] ? trace_hardirqs_on+0x1a/0xd0 [ 243.148924] event_sched_in.isra.129+0xf8/0x2a0 [ 243.149989] merge_sched_in+0x261/0x3e0 [ 243.150889] ? trace_hardirqs_on+0x1a/0xd0 [ 243.151869] visit_groups_merge.constprop.135+0x130/0x4a0 [ 243.153122] ? sched_clock_cpu+0xc/0xb0 [ 243.154023] ctx_sched_in+0x101/0x210 [ 243.154884] ctx_resched+0x6f/0xc0 [ 243.155686] perf_event_exec+0x21e/0x2e0 [ 243.156641] begin_new_exec+0x5e5/0xbd0 [ 243.157540] load_elf_binary+0x6af/0x1770 [ 243.158478] ? __kernel_read+0x19d/0x2b0 [ 243.159977] ? lock_acquire+0x158/0x350 [ 243.160876] ? __kernel_read+0x19d/0x2b0 [ 243.161796] bprm_execve+0x3c8/0x840 [ 243.162638] do_execveat_common.isra.38+0x1a5/0x1c0 [ 243.163776] __x64_sys_execve+0x32/0x40 [ 243.164676] do_syscall_64+0x33/0x40 [ 243.165514] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 243.166746] RIP: 0033:0x7f6180a26feb [ 243.167590] Code: Unable to access opcode bytes at RIP 0x7f6180a26fc1. [ 243.169097] RSP: 002b:7ffc6558ce18 EFLAGS: 0202 ORIG_RAX: 003b [ 243.170844] RAX: ffda RBX: 7ffc65592d30 RCX: 7f6180a26feb [ 243.172514] RDX: 55657f408dc0 RSI: 7ffc65592410 RDI: 7ffc65592d30 [ 243.174162] RBP: 7ffc6558ce80 R08: 7ffc6558cde0 R09: [ 243.176042] R10: 0008 R11: 0202 R12: 7ffc65592410 [ 243.177696] R13: 55657f408dc0 R14: 0001 R15: 7ffc65592410 One of the solution is to use GFP_ATOMIC, but it will make the code less reliable under memory pressue. Let's move the memory allocation out of the sleeping region and put it into the x86_reserve_hardware(). The disadvantage of this fix is that the cpuc->lbr_xsave memory will be allocated for each cpu like the legacy ds_buffer. Fixes: c085fb8774 ("perf/x86/intel/lbr: Support XSAVES for arch LBR read") Suggested-by: Kan Liang Signed-off-by: Like Xu --- arch/x86/events/core.c | 8 +--- arch/x86/events/intel/bts.c | 2 +- arch/x86/events/intel/lbr.c | 22 -- arch/x86/events/perf_event.h | 8 +++- 4 files changed, 29 insertions(+), 11 deletions(-) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 18df17129695..a4ce669cc78d 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -373,7 +373,7 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct perf_event *event) return x86_pmu_extra_regs(val, event); } -int x86_reserve_hardware(void) +int x86_reserve_hardware(struct perf_event *event) { int err = 0; @@ -382,8 +382,10 @@ int x86_reserve_hardware(void) if (atomic_read(_refcount) == 0) { if (!reserve_pmc_hardware()) err = -EBUSY; - else + else { reserve_ds_buffers(); + reserve_lbr_buffers(event); + } } if (!err) atomic_inc(_refcount); @@ -634,7 +636,7 @@ static int __x86_pmu_event_init(struct perf_event *event) if (!x86_pmu_initialized()) return -ENODEV; - err = x86_reserve_hardware(); + err = x86_reserve_hardware(event);