Re: [PATCH v4 RESEND 3/5] perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region

2021-03-23 Thread Like Xu

On 2021/3/24 12:04, Namhyung Kim wrote:

On Wed, Mar 24, 2021 at 12:47 PM Like Xu  wrote:


Hi Namhyung,

On 2021/3/24 9:32, Namhyung Kim wrote:

Hello,

On Mon, Mar 22, 2021 at 3:14 PM Like Xu  wrote:

+void reserve_lbr_buffers(struct perf_event *event)
+{
+   struct kmem_cache *kmem_cache = x86_get_pmu()->task_ctx_cache;
+   struct cpu_hw_events *cpuc;
+   int cpu;
+
+   if (!static_cpu_has(X86_FEATURE_ARCH_LBR))
+   return;
+
+   for_each_possible_cpu(cpu) {
+   cpuc = per_cpu_ptr(_hw_events, cpu);
+   if (kmem_cache && !cpuc->lbr_xsave && !event->attr.precise_ip)
+   cpuc->lbr_xsave = kmem_cache_alloc(kmem_cache, 
GFP_KERNEL);
+   }
+}


I think we should use kmem_cache_alloc_node().


"kmem_cache_alloc_node - Allocate an object on the specified node"

The reserve_lbr_buffers() is called in __x86_pmu_event_init().
When the LBR perf_event is scheduled to another node, it seems
that we will not call init() and allocate again.

Do you mean use kmem_cache_alloc_node() for each numa_nodes_parsed ?


I assume cpuc->lbr_xsave will be accessed for that cpu only.
Then it needs to allocate it in the node that cpu belongs to.
Something like below..

 cpuc->lbr_xsave = kmem_cache_alloc_node(kmem_cache, GFP_KERNEL,
cpu_to_node(cpu));


Thanks, it helps and I will apply it in the next version.



Thanks,
Namhyung





Re: [PATCH v4 RESEND 3/5] perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region

2021-03-23 Thread Namhyung Kim
On Wed, Mar 24, 2021 at 12:47 PM Like Xu  wrote:
>
> Hi Namhyung,
>
> On 2021/3/24 9:32, Namhyung Kim wrote:
> > Hello,
> >
> > On Mon, Mar 22, 2021 at 3:14 PM Like Xu  wrote:
> >> +void reserve_lbr_buffers(struct perf_event *event)
> >> +{
> >> +   struct kmem_cache *kmem_cache = x86_get_pmu()->task_ctx_cache;
> >> +   struct cpu_hw_events *cpuc;
> >> +   int cpu;
> >> +
> >> +   if (!static_cpu_has(X86_FEATURE_ARCH_LBR))
> >> +   return;
> >> +
> >> +   for_each_possible_cpu(cpu) {
> >> +   cpuc = per_cpu_ptr(_hw_events, cpu);
> >> +   if (kmem_cache && !cpuc->lbr_xsave && 
> >> !event->attr.precise_ip)
> >> +   cpuc->lbr_xsave = kmem_cache_alloc(kmem_cache, 
> >> GFP_KERNEL);
> >> +   }
> >> +}
> >
> > I think we should use kmem_cache_alloc_node().
>
> "kmem_cache_alloc_node - Allocate an object on the specified node"
>
> The reserve_lbr_buffers() is called in __x86_pmu_event_init().
> When the LBR perf_event is scheduled to another node, it seems
> that we will not call init() and allocate again.
>
> Do you mean use kmem_cache_alloc_node() for each numa_nodes_parsed ?

I assume cpuc->lbr_xsave will be accessed for that cpu only.
Then it needs to allocate it in the node that cpu belongs to.
Something like below..

cpuc->lbr_xsave = kmem_cache_alloc_node(kmem_cache, GFP_KERNEL,
   cpu_to_node(cpu));

Thanks,
Namhyung


Re: [PATCH v4 RESEND 3/5] perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region

2021-03-23 Thread Like Xu

Hi Namhyung,

On 2021/3/24 9:32, Namhyung Kim wrote:

Hello,

On Mon, Mar 22, 2021 at 3:14 PM Like Xu  wrote:

+void reserve_lbr_buffers(struct perf_event *event)
+{
+   struct kmem_cache *kmem_cache = x86_get_pmu()->task_ctx_cache;
+   struct cpu_hw_events *cpuc;
+   int cpu;
+
+   if (!static_cpu_has(X86_FEATURE_ARCH_LBR))
+   return;
+
+   for_each_possible_cpu(cpu) {
+   cpuc = per_cpu_ptr(_hw_events, cpu);
+   if (kmem_cache && !cpuc->lbr_xsave && !event->attr.precise_ip)
+   cpuc->lbr_xsave = kmem_cache_alloc(kmem_cache, 
GFP_KERNEL);
+   }
+}


I think we should use kmem_cache_alloc_node().


"kmem_cache_alloc_node - Allocate an object on the specified node"

The reserve_lbr_buffers() is called in __x86_pmu_event_init().
When the LBR perf_event is scheduled to another node, it seems
that we will not call init() and allocate again.

Do you mean use kmem_cache_alloc_node() for each numa_nodes_parsed ?



Thanks,
Namhyung





Re: [PATCH v4 RESEND 3/5] perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region

2021-03-23 Thread Namhyung Kim
Hello,

On Mon, Mar 22, 2021 at 3:14 PM Like Xu  wrote:
> +void reserve_lbr_buffers(struct perf_event *event)
> +{
> +   struct kmem_cache *kmem_cache = x86_get_pmu()->task_ctx_cache;
> +   struct cpu_hw_events *cpuc;
> +   int cpu;
> +
> +   if (!static_cpu_has(X86_FEATURE_ARCH_LBR))
> +   return;
> +
> +   for_each_possible_cpu(cpu) {
> +   cpuc = per_cpu_ptr(_hw_events, cpu);
> +   if (kmem_cache && !cpuc->lbr_xsave && !event->attr.precise_ip)
> +   cpuc->lbr_xsave = kmem_cache_alloc(kmem_cache, 
> GFP_KERNEL);
> +   }
> +}

I think we should use kmem_cache_alloc_node().

Thanks,
Namhyung


Re: [PATCH v4 RESEND 3/5] perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region

2021-03-23 Thread Liang, Kan




On 3/23/2021 5:41 PM, Peter Zijlstra wrote:

On Mon, Mar 22, 2021 at 02:06:33PM +0800, Like Xu wrote:

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 18df17129695..a4ce669cc78d 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -373,7 +373,7 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct 
perf_event *event)
return x86_pmu_extra_regs(val, event);
  }
  
-int x86_reserve_hardware(void)

+int x86_reserve_hardware(struct perf_event *event)
  {
int err = 0;
  
@@ -382,8 +382,10 @@ int x86_reserve_hardware(void)

if (atomic_read(_refcount) == 0) {
if (!reserve_pmc_hardware())
err = -EBUSY;
-   else
+   else {
reserve_ds_buffers();
+   reserve_lbr_buffers(event);
+   }
}
if (!err)
atomic_inc(_refcount);


This makes absolutely no sense what so ever. This is only executed for
the first event that gets here.



The LBR xsave buffer is a per-CPU buffer, not a per-event buffer. I 
think we only need to allocate the buffer once when initializing the 
first event.


Thanks,
Kan


Re: [PATCH v4 RESEND 3/5] perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region

2021-03-23 Thread Peter Zijlstra
On Mon, Mar 22, 2021 at 02:06:33PM +0800, Like Xu wrote:
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 18df17129695..a4ce669cc78d 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -373,7 +373,7 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct 
> perf_event *event)
>   return x86_pmu_extra_regs(val, event);
>  }
>  
> -int x86_reserve_hardware(void)
> +int x86_reserve_hardware(struct perf_event *event)
>  {
>   int err = 0;
>  
> @@ -382,8 +382,10 @@ int x86_reserve_hardware(void)
>   if (atomic_read(_refcount) == 0) {
>   if (!reserve_pmc_hardware())
>   err = -EBUSY;
> - else
> + else {
>   reserve_ds_buffers();
> + reserve_lbr_buffers(event);
> + }
>   }
>   if (!err)
>   atomic_inc(_refcount);

This makes absolutely no sense what so ever. This is only executed for
the first event that gets here.


Re: [PATCH v4 RESEND 3/5] perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region

2021-03-23 Thread Liang, Kan




On 3/22/2021 2:06 AM, Like Xu wrote:

If the kernel is compiled with the CONFIG_LOCKDEP option, the conditional
might_sleep_if() deep in kmem_cache_alloc() will generate the following
trace, and potentially cause a deadlock when another LBR event is added:

[  243.115549] BUG: sleeping function called from invalid context at 
include/linux/sched/mm.h:196
[  243.117576] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 839, 
name: perf
[  243.119326] INFO: lockdep is turned off.
[  243.120249] irq event stamp: 0
[  243.120967] hardirqs last  enabled at (0): [<>] 0x0
[  243.122415] hardirqs last disabled at (0): [] 
copy_process+0xa45/0x1dc0
[  243.124302] softirqs last  enabled at (0): [] 
copy_process+0xa45/0x1dc0
[  243.126255] softirqs last disabled at (0): [<>] 0x0
[  243.128119] CPU: 0 PID: 839 Comm: perf Not tainted 5.11.0-rc4-guest+ #8
[  243.129654] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
0.0.0 02/06/2015
[  243.131520] Call Trace:
[  243.132112]  dump_stack+0x8d/0xb5
[  243.132896]  ___might_sleep.cold.106+0xb3/0xc3
[  243.133984]  slab_pre_alloc_hook.constprop.85+0x96/0xd0
[  243.135208]  ? intel_pmu_lbr_add+0x152/0x170
[  243.136207]  kmem_cache_alloc+0x36/0x250
[  243.137126]  intel_pmu_lbr_add+0x152/0x170
[  243.138088]  x86_pmu_add+0x83/0xd0
[  243.138889]  ? lock_acquire+0x158/0x350
[  243.139791]  ? lock_acquire+0x158/0x350
[  243.140694]  ? lock_acquire+0x158/0x350
[  243.141625]  ? lock_acquired+0x1e3/0x360
[  243.142544]  ? lock_release+0x1bf/0x340
[  243.143726]  ? trace_hardirqs_on+0x1a/0xd0
[  243.144823]  ? lock_acquired+0x1e3/0x360
[  243.145742]  ? lock_release+0x1bf/0x340
[  243.147107]  ? __slab_free+0x49/0x540
[  243.147966]  ? trace_hardirqs_on+0x1a/0xd0
[  243.148924]  event_sched_in.isra.129+0xf8/0x2a0
[  243.149989]  merge_sched_in+0x261/0x3e0
[  243.150889]  ? trace_hardirqs_on+0x1a/0xd0
[  243.151869]  visit_groups_merge.constprop.135+0x130/0x4a0
[  243.153122]  ? sched_clock_cpu+0xc/0xb0
[  243.154023]  ctx_sched_in+0x101/0x210
[  243.154884]  ctx_resched+0x6f/0xc0
[  243.155686]  perf_event_exec+0x21e/0x2e0
[  243.156641]  begin_new_exec+0x5e5/0xbd0
[  243.157540]  load_elf_binary+0x6af/0x1770
[  243.158478]  ? __kernel_read+0x19d/0x2b0
[  243.159977]  ? lock_acquire+0x158/0x350
[  243.160876]  ? __kernel_read+0x19d/0x2b0
[  243.161796]  bprm_execve+0x3c8/0x840
[  243.162638]  do_execveat_common.isra.38+0x1a5/0x1c0
[  243.163776]  __x64_sys_execve+0x32/0x40
[  243.164676]  do_syscall_64+0x33/0x40
[  243.165514]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  243.166746] RIP: 0033:0x7f6180a26feb
[  243.167590] Code: Unable to access opcode bytes at RIP 0x7f6180a26fc1.
[  243.169097] RSP: 002b:7ffc6558ce18 EFLAGS: 0202 ORIG_RAX: 
003b
[  243.170844] RAX: ffda RBX: 7ffc65592d30 RCX: 7f6180a26feb
[  243.172514] RDX: 55657f408dc0 RSI: 7ffc65592410 RDI: 7ffc65592d30
[  243.174162] RBP: 7ffc6558ce80 R08: 7ffc6558cde0 R09: 
[  243.176042] R10: 0008 R11: 0202 R12: 7ffc65592410
[  243.177696] R13: 55657f408dc0 R14: 0001 R15: 7ffc65592410

One of the solution is to use GFP_ATOMIC, but it will make the code less
reliable under memory pressue. Let's move the memory allocation out of
the sleeping region and put it into the x86_reserve_hardware().

The disadvantage of this fix is that the cpuc->lbr_xsave memory
will be allocated for each cpu like the legacy ds_buffer.

Fixes: c085fb8774 ("perf/x86/intel/lbr: Support XSAVES for arch LBR read")
Suggested-by: Kan Liang 
Signed-off-by: Like Xu 


I observed the same issue when I did LBR test on an ADL machine. This 
patch fixes the issue.


Tested-by: Kan Liang 

Thanks,
Kan

---
  arch/x86/events/core.c   |  8 +---
  arch/x86/events/intel/bts.c  |  2 +-
  arch/x86/events/intel/lbr.c  | 22 --
  arch/x86/events/perf_event.h |  8 +++-
  4 files changed, 29 insertions(+), 11 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 18df17129695..a4ce669cc78d 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -373,7 +373,7 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct 
perf_event *event)
return x86_pmu_extra_regs(val, event);
  }
  
-int x86_reserve_hardware(void)

+int x86_reserve_hardware(struct perf_event *event)
  {
int err = 0;
  
@@ -382,8 +382,10 @@ int x86_reserve_hardware(void)

if (atomic_read(_refcount) == 0) {
if (!reserve_pmc_hardware())
err = -EBUSY;
-   else
+   else {
reserve_ds_buffers();
+   reserve_lbr_buffers(event);
+   }
}
if (!err)
atomic_inc(_refcount);
@@ -634,7 +636,7 @@ static int 

[PATCH v4 RESEND 3/5] perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region

2021-03-22 Thread Like Xu
If the kernel is compiled with the CONFIG_LOCKDEP option, the conditional
might_sleep_if() deep in kmem_cache_alloc() will generate the following
trace, and potentially cause a deadlock when another LBR event is added:

[  243.115549] BUG: sleeping function called from invalid context at 
include/linux/sched/mm.h:196
[  243.117576] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 839, 
name: perf
[  243.119326] INFO: lockdep is turned off.
[  243.120249] irq event stamp: 0
[  243.120967] hardirqs last  enabled at (0): [<>] 0x0
[  243.122415] hardirqs last disabled at (0): [] 
copy_process+0xa45/0x1dc0
[  243.124302] softirqs last  enabled at (0): [] 
copy_process+0xa45/0x1dc0
[  243.126255] softirqs last disabled at (0): [<>] 0x0
[  243.128119] CPU: 0 PID: 839 Comm: perf Not tainted 5.11.0-rc4-guest+ #8
[  243.129654] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
0.0.0 02/06/2015
[  243.131520] Call Trace:
[  243.132112]  dump_stack+0x8d/0xb5
[  243.132896]  ___might_sleep.cold.106+0xb3/0xc3
[  243.133984]  slab_pre_alloc_hook.constprop.85+0x96/0xd0
[  243.135208]  ? intel_pmu_lbr_add+0x152/0x170
[  243.136207]  kmem_cache_alloc+0x36/0x250
[  243.137126]  intel_pmu_lbr_add+0x152/0x170
[  243.138088]  x86_pmu_add+0x83/0xd0
[  243.138889]  ? lock_acquire+0x158/0x350
[  243.139791]  ? lock_acquire+0x158/0x350
[  243.140694]  ? lock_acquire+0x158/0x350
[  243.141625]  ? lock_acquired+0x1e3/0x360
[  243.142544]  ? lock_release+0x1bf/0x340
[  243.143726]  ? trace_hardirqs_on+0x1a/0xd0
[  243.144823]  ? lock_acquired+0x1e3/0x360
[  243.145742]  ? lock_release+0x1bf/0x340
[  243.147107]  ? __slab_free+0x49/0x540
[  243.147966]  ? trace_hardirqs_on+0x1a/0xd0
[  243.148924]  event_sched_in.isra.129+0xf8/0x2a0
[  243.149989]  merge_sched_in+0x261/0x3e0
[  243.150889]  ? trace_hardirqs_on+0x1a/0xd0
[  243.151869]  visit_groups_merge.constprop.135+0x130/0x4a0
[  243.153122]  ? sched_clock_cpu+0xc/0xb0
[  243.154023]  ctx_sched_in+0x101/0x210
[  243.154884]  ctx_resched+0x6f/0xc0
[  243.155686]  perf_event_exec+0x21e/0x2e0
[  243.156641]  begin_new_exec+0x5e5/0xbd0
[  243.157540]  load_elf_binary+0x6af/0x1770
[  243.158478]  ? __kernel_read+0x19d/0x2b0
[  243.159977]  ? lock_acquire+0x158/0x350
[  243.160876]  ? __kernel_read+0x19d/0x2b0
[  243.161796]  bprm_execve+0x3c8/0x840
[  243.162638]  do_execveat_common.isra.38+0x1a5/0x1c0
[  243.163776]  __x64_sys_execve+0x32/0x40
[  243.164676]  do_syscall_64+0x33/0x40
[  243.165514]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  243.166746] RIP: 0033:0x7f6180a26feb
[  243.167590] Code: Unable to access opcode bytes at RIP 0x7f6180a26fc1.
[  243.169097] RSP: 002b:7ffc6558ce18 EFLAGS: 0202 ORIG_RAX: 
003b
[  243.170844] RAX: ffda RBX: 7ffc65592d30 RCX: 7f6180a26feb
[  243.172514] RDX: 55657f408dc0 RSI: 7ffc65592410 RDI: 7ffc65592d30
[  243.174162] RBP: 7ffc6558ce80 R08: 7ffc6558cde0 R09: 
[  243.176042] R10: 0008 R11: 0202 R12: 7ffc65592410
[  243.177696] R13: 55657f408dc0 R14: 0001 R15: 7ffc65592410

One of the solution is to use GFP_ATOMIC, but it will make the code less
reliable under memory pressue. Let's move the memory allocation out of
the sleeping region and put it into the x86_reserve_hardware().

The disadvantage of this fix is that the cpuc->lbr_xsave memory
will be allocated for each cpu like the legacy ds_buffer.

Fixes: c085fb8774 ("perf/x86/intel/lbr: Support XSAVES for arch LBR read")
Suggested-by: Kan Liang 
Signed-off-by: Like Xu 
---
 arch/x86/events/core.c   |  8 +---
 arch/x86/events/intel/bts.c  |  2 +-
 arch/x86/events/intel/lbr.c  | 22 --
 arch/x86/events/perf_event.h |  8 +++-
 4 files changed, 29 insertions(+), 11 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 18df17129695..a4ce669cc78d 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -373,7 +373,7 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct 
perf_event *event)
return x86_pmu_extra_regs(val, event);
 }
 
-int x86_reserve_hardware(void)
+int x86_reserve_hardware(struct perf_event *event)
 {
int err = 0;
 
@@ -382,8 +382,10 @@ int x86_reserve_hardware(void)
if (atomic_read(_refcount) == 0) {
if (!reserve_pmc_hardware())
err = -EBUSY;
-   else
+   else {
reserve_ds_buffers();
+   reserve_lbr_buffers(event);
+   }
}
if (!err)
atomic_inc(_refcount);
@@ -634,7 +636,7 @@ static int __x86_pmu_event_init(struct perf_event *event)
if (!x86_pmu_initialized())
return -ENODEV;
 
-   err = x86_reserve_hardware();
+   err = x86_reserve_hardware(event);