RE: "struct perf_sample_data" alignment
... > What I remember is that since perf_sample_data is fairly large, > unconditionally initializing the whole thing is *slow* (and > -fauto-var-init=zero will hurt here). That will hurt everywhere. I can also imagine it hiding bugs and making people shrink on-stack arrays to the point where they either overrun or cause an unexpected error or string truncation. Initialising to zero is a bad choice if the aim is to avoid leaking stack. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Re: "struct perf_sample_data" alignment
* Alexei Starovoitov wrote: > > This seems to be it... (completely untested) > > > > --- > > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > > index 3f7f89ea5e51..918a296d2ca2 100644 > > --- a/include/linux/perf_event.h > > +++ b/include/linux/perf_event.h > > @@ -1032,7 +1032,9 @@ struct perf_sample_data { > > u64 cgroup; > > u64 data_page_size; > > u64 code_page_size; > > -} cacheline_aligned; > > +}; > > + > > +typedef struct perf_sample_data perf_sample_data_t cacheline_aligned; > > > > /* default value for data source */ > > #define PERF_MEM_NA (PERF_MEM_S(OP, NA) |\ > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > > index b0c45d923f0f..f32c623abef6 100644 > > --- a/kernel/trace/bpf_trace.c > > +++ b/kernel/trace/bpf_trace.c > > @@ -923,7 +923,7 @@ __bpf_perf_event_output(struct pt_regs *regs, struct > > bpf_map *map, > > * bpf_perf_event_output > > */ > > struct bpf_trace_sample_data { > > - struct perf_sample_data sds[3]; > > + perf_sample_data_t sds[3]; > > bpf side doesn't care about about cacheline aligned. > No need to add new typedef just for that. So this structure is not supposed to be exposed to any ABI anywhere. I did a (non-exhaustive) search of tooling, and there doesn't appear to be any accidental exposure. The in-kernel ABI interaction appears to be the following: - In __perf_event_header_size() we only use fields within perf_sample_data to size the header. Alignment won't change any of the output. - Ditto in perf_event__id_header_size(). I.e. I think we should just zap it per the patch below (untested). Thanks, Ingo > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 3f7f89ea5e51..d75e03ff31ea 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1032,7 +1032,7 @@ struct perf_sample_data { u64 cgroup; u64 data_page_size; u64 code_page_size; -} cacheline_aligned; +}; /* default value for data source */ #define PERF_MEM_NA (PERF_MEM_S(OP, NA) |\
Re: "struct perf_sample_data" alignment
On Fri, Mar 5, 2021 at 7:01 AM Peter Zijlstra wrote: > > On Fri, Mar 05, 2021 at 09:36:30AM +0100, Peter Zijlstra wrote: > > On Thu, Mar 04, 2021 at 07:45:44PM -0800, Linus Torvalds wrote: > > > That cacheline_aligned goes back many years, this is not new, it > > > seems to come from back in 2014: commit 2565711fb7d7 ("perf: Improve > > > the perf_sample_data struct layout"). > > > > long time ago... > > > > > But it really seems entirely and utterly bogus. That cacheline > > > alignment makes things *worse*, when the variables are on the local > > > stack. The local stack is already going to be dirty and in the cache, > > > and aligning those things isn't going to - and I quote from the code > > > in that commend in that commit - "minimize the cachelines touched". > > > > > > Quite the reverse. It's just going to make the stack frame use *more* > > > memory, and make any cacheline usage _worse_. > > > > IIRC there is more history here, but I can't seem to find references > > just now. > > > > What I remember is that since perf_sample_data is fairly large, > > unconditionally initializing the whole thing is *slow* (and > > -fauto-var-init=zero will hurt here). > > > > So at some point I removed that full initialization and made sure we > > only unconditionally touched the first few variables, which gave a > > measurable speedup. > > > > Then things got messy again and the commit 2565711fb7d7 referenced above > > was cleanup, to get back to that initial state. > > > > Now, you're right that __cacheline_aligned on on-stack (and this is > > indeed mostly on-stack) is fairly tedious (there were a few patches > > recently to reduce the amount of on-stack instances). > > > > I'll put it on the todo list, along with that hotplug stuff (which I > > tried to fix but ended up with an even bigger mess). I suppose we can > > try and not have the alignment for the on-stack instances while > > preserving it for the few off-stack ones. > > > > Also; we're running on the NMI stack, and that's not typically hot. > > This seems to be it... (completely untested) > > --- > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > index 3f7f89ea5e51..918a296d2ca2 100644 > --- a/include/linux/perf_event.h > +++ b/include/linux/perf_event.h > @@ -1032,7 +1032,9 @@ struct perf_sample_data { > u64 cgroup; > u64 data_page_size; > u64 code_page_size; > -} cacheline_aligned; > +}; > + > +typedef struct perf_sample_data perf_sample_data_t cacheline_aligned; > > /* default value for data source */ > #define PERF_MEM_NA (PERF_MEM_S(OP, NA) |\ > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > index b0c45d923f0f..f32c623abef6 100644 > --- a/kernel/trace/bpf_trace.c > +++ b/kernel/trace/bpf_trace.c > @@ -923,7 +923,7 @@ __bpf_perf_event_output(struct pt_regs *regs, struct > bpf_map *map, > * bpf_perf_event_output > */ > struct bpf_trace_sample_data { > - struct perf_sample_data sds[3]; > + perf_sample_data_t sds[3]; bpf side doesn't care about about cacheline aligned. No need to add new typedef just for that.
Re: "struct perf_sample_data" alignment
On Fri, Mar 05, 2021 at 09:36:30AM +0100, Peter Zijlstra wrote: > On Thu, Mar 04, 2021 at 07:45:44PM -0800, Linus Torvalds wrote: > > That cacheline_aligned goes back many years, this is not new, it > > seems to come from back in 2014: commit 2565711fb7d7 ("perf: Improve > > the perf_sample_data struct layout"). > > long time ago... > > > But it really seems entirely and utterly bogus. That cacheline > > alignment makes things *worse*, when the variables are on the local > > stack. The local stack is already going to be dirty and in the cache, > > and aligning those things isn't going to - and I quote from the code > > in that commend in that commit - "minimize the cachelines touched". > > > > Quite the reverse. It's just going to make the stack frame use *more* > > memory, and make any cacheline usage _worse_. > > IIRC there is more history here, but I can't seem to find references > just now. > > What I remember is that since perf_sample_data is fairly large, > unconditionally initializing the whole thing is *slow* (and > -fauto-var-init=zero will hurt here). > > So at some point I removed that full initialization and made sure we > only unconditionally touched the first few variables, which gave a > measurable speedup. > > Then things got messy again and the commit 2565711fb7d7 referenced above > was cleanup, to get back to that initial state. > > Now, you're right that __cacheline_aligned on on-stack (and this is > indeed mostly on-stack) is fairly tedious (there were a few patches > recently to reduce the amount of on-stack instances). > > I'll put it on the todo list, along with that hotplug stuff (which I > tried to fix but ended up with an even bigger mess). I suppose we can > try and not have the alignment for the on-stack instances while > preserving it for the few off-stack ones. > > Also; we're running on the NMI stack, and that's not typically hot. This seems to be it... (completely untested) --- diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 3f7f89ea5e51..918a296d2ca2 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1032,7 +1032,9 @@ struct perf_sample_data { u64 cgroup; u64 data_page_size; u64 code_page_size; -} cacheline_aligned; +}; + +typedef struct perf_sample_data perf_sample_data_t cacheline_aligned; /* default value for data source */ #define PERF_MEM_NA (PERF_MEM_S(OP, NA) |\ diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index b0c45d923f0f..f32c623abef6 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -923,7 +923,7 @@ __bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map, * bpf_perf_event_output */ struct bpf_trace_sample_data { - struct perf_sample_data sds[3]; + perf_sample_data_t sds[3]; }; static DEFINE_PER_CPU(struct bpf_trace_sample_data, bpf_trace_sds);
Re: "struct perf_sample_data" alignment
On Thu, Mar 04, 2021 at 07:45:44PM -0800, Linus Torvalds wrote: > That cacheline_aligned goes back many years, this is not new, it > seems to come from back in 2014: commit 2565711fb7d7 ("perf: Improve > the perf_sample_data struct layout"). long time ago... > But it really seems entirely and utterly bogus. That cacheline > alignment makes things *worse*, when the variables are on the local > stack. The local stack is already going to be dirty and in the cache, > and aligning those things isn't going to - and I quote from the code > in that commend in that commit - "minimize the cachelines touched". > > Quite the reverse. It's just going to make the stack frame use *more* > memory, and make any cacheline usage _worse_. IIRC there is more history here, but I can't seem to find references just now. What I remember is that since perf_sample_data is fairly large, unconditionally initializing the whole thing is *slow* (and -fauto-var-init=zero will hurt here). So at some point I removed that full initialization and made sure we only unconditionally touched the first few variables, which gave a measurable speedup. Then things got messy again and the commit 2565711fb7d7 referenced above was cleanup, to get back to that initial state. Now, you're right that __cacheline_aligned on on-stack (and this is indeed mostly on-stack) is fairly tedious (there were a few patches recently to reduce the amount of on-stack instances). I'll put it on the todo list, along with that hotplug stuff (which I tried to fix but ended up with an even bigger mess). I suppose we can try and not have the alignment for the on-stack instances while preserving it for the few off-stack ones. Also; we're running on the NMI stack, and that's not typically hot.