Re: [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
On Thu, Oct 01, 2020 at 06:49:36AM -0700, Dave Hansen wrote: > On 9/30/20 10:30 AM, Peter Zijlstra wrote: > > In general though; I think using ->active_mm is a mistake though. That > > code should be doing something like: > > > > > > mm = current->mm; > > if (!mm) > > mm = _mm; > > > > I was hoping that using ->active_mm would give us the *actual* copy of > the page tables that's loaded in CR3 for kernel thraeds. But, there are > few if any practical advantages of doing that at the moment. Some of us hate active_mm with a passion and want to remove it entirely (/me waves at amluto). Also, if !current->mm, it had better not be accessing anything outside of the kernel map, so _mm should cover that just fine. And given the kernel maps are shared between all CR3s, they'll see the exact same pagetables.
Re: [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
On 9/30/20 10:30 AM, Peter Zijlstra wrote: > In general though; I think using ->active_mm is a mistake though. That > code should be doing something like: > > > mm = current->mm; > if (!mm) > mm = _mm; > I was hoping that using ->active_mm would give us the *actual* copy of the page tables that's loaded in CR3 for kernel thraeds. But, there are few if any practical advantages of doing that at the moment. Using that ^ is fine with me.
Re: [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
On 9/30/2020 6:45 PM, Stephane Eranian wrote: On Wed, Sep 30, 2020 at 10:30 AM Peter Zijlstra wrote: On Wed, Sep 30, 2020 at 07:48:48AM -0700, Dave Hansen wrote: On 9/30/20 7:42 AM, Liang, Kan wrote: When I tested on my kernel, it panicked because I suspect current->active_mm could be NULL. Adding a check for NULL avoided the problem. But I suspect this is not the correct solution. I guess the NULL active_mm should be a rare case. If so, I think it's not bad to add a check and return 0 page size. I think it would be best to understand why ->active_mm is NULL instead of just papering over the problem. If it is papered over, and this is common, you might end up effectively turning off your shiny new feature inadvertently. context_switch() can set prev->active_mm to NULL when it transfers it to @next. It does this before @current is updated. So an NMI that comes in between this active_mm swizzling and updating @current will see !active_mm. I think Peter is right. This code is called in the context of NMI, so if active_mm is set to NULL inside a locked section, this is not enough to protect the perf_events code from seeing it. In general though; I think using ->active_mm is a mistake though. That code should be doing something like: mm = current->mm; if (!mm) mm = _mm; I will send a V9 with the changes Peter suggests. Thanks, Kan
Re: [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
On Wed, Sep 30, 2020 at 10:30 AM Peter Zijlstra wrote: > > On Wed, Sep 30, 2020 at 07:48:48AM -0700, Dave Hansen wrote: > > On 9/30/20 7:42 AM, Liang, Kan wrote: > > >> When I tested on my kernel, it panicked because I suspect > > >> current->active_mm could be NULL. Adding a check for NULL avoided the > > >> problem. But I suspect this is not the correct solution. > > > > > > I guess the NULL active_mm should be a rare case. If so, I think it's > > > not bad to add a check and return 0 page size. > > > > I think it would be best to understand why ->active_mm is NULL instead > > of just papering over the problem. If it is papered over, and this is > > common, you might end up effectively turning off your shiny new feature > > inadvertently. > > context_switch() can set prev->active_mm to NULL when it transfers it to > @next. It does this before @current is updated. So an NMI that comes in > between this active_mm swizzling and updating @current will see > !active_mm. > I think Peter is right. This code is called in the context of NMI, so if active_mm is set to NULL inside a locked section, this is not enough to protect the perf_events code from seeing it. > In general though; I think using ->active_mm is a mistake though. That > code should be doing something like: > > > mm = current->mm; > if (!mm) > mm = _mm; > >
Re: [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
On Wed, Sep 30, 2020 at 07:48:48AM -0700, Dave Hansen wrote: > On 9/30/20 7:42 AM, Liang, Kan wrote: > >> When I tested on my kernel, it panicked because I suspect > >> current->active_mm could be NULL. Adding a check for NULL avoided the > >> problem. But I suspect this is not the correct solution. > > > > I guess the NULL active_mm should be a rare case. If so, I think it's > > not bad to add a check and return 0 page size. > > I think it would be best to understand why ->active_mm is NULL instead > of just papering over the problem. If it is papered over, and this is > common, you might end up effectively turning off your shiny new feature > inadvertently. context_switch() can set prev->active_mm to NULL when it transfers it to @next. It does this before @current is updated. So an NMI that comes in between this active_mm swizzling and updating @current will see !active_mm. In general though; I think using ->active_mm is a mistake though. That code should be doing something like: mm = current->mm; if (!mm) mm = _mm;
Re: [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
On Wed, Sep 30, 2020 at 7:48 AM Dave Hansen wrote: > > On 9/30/20 7:42 AM, Liang, Kan wrote: > >> When I tested on my kernel, it panicked because I suspect > >> current->active_mm could be NULL. Adding a check for NULL avoided the > >> problem. But I suspect this is not the correct solution. > > > > I guess the NULL active_mm should be a rare case. If so, I think it's > > not bad to add a check and return 0 page size. > > I think it would be best to understand why ->active_mm is NULL instead > of just papering over the problem. If it is papered over, and this is > common, you might end up effectively turning off your shiny new feature > inadvertently. I tried that on a backport of the patch to an older kernel. Maybe the behavior of active_mm has change compared to tip.git. I will try again with tip.git.
Re: [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
On 9/30/20 7:42 AM, Liang, Kan wrote: >> When I tested on my kernel, it panicked because I suspect >> current->active_mm could be NULL. Adding a check for NULL avoided the >> problem. But I suspect this is not the correct solution. > > I guess the NULL active_mm should be a rare case. If so, I think it's > not bad to add a check and return 0 page size. I think it would be best to understand why ->active_mm is NULL instead of just papering over the problem. If it is papered over, and this is common, you might end up effectively turning off your shiny new feature inadvertently.
Re: [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
On 9/30/2020 3:15 AM, Stephane Eranian wrote: +static u64 perf_get_page_size(unsigned long addr) +{ + unsigned long flags; + u64 size; + + if (!addr) + return 0; + + /* +* Software page-table walkers must disable IRQs, +* which prevents any tear down of the page tables. +*/ + local_irq_save(flags); + + size = __perf_get_page_size(current->active_mm, addr); + When I tested on my kernel, it panicked because I suspect current->active_mm could be NULL. Adding a check for NULL avoided the problem. But I suspect this is not the correct solution. I guess the NULL active_mm should be a rare case. If so, I think it's not bad to add a check and return 0 page size. Thanks, Kan
Re: [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
On 9/30/20 12:15 AM, Stephane Eranian wrote: >> + /* >> +* Software page-table walkers must disable IRQs, >> +* which prevents any tear down of the page tables. >> +*/ >> + local_irq_save(flags); >> + >> + size = __perf_get_page_size(current->active_mm, addr); >> + > When I tested on my kernel, it panicked because I suspect > current->active_mm could be NULL. Adding a check for NULL avoided the > problem. But I suspect this is not the correct solution. Did you happen to capture the oops? I can _imagine_ scenarios where current->active_mm could be NULL, I just can't find any obvious ones in the code.
Re: [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
On Mon, Sep 21, 2020 at 8:29 AM wrote: > > From: Kan Liang > > Current perf can report both virtual addresses and physical addresses, > but not the MMU page size. Without the MMU page size information of the > utilized page, users cannot decide whether to promote/demote large pages > to optimize memory usage. > > Add a new sample type for the data MMU page size. > > Current perf already has a facility to collect data virtual addresses. > A page walker is required to walk the pages tables and calculate the > MMU page size from a given virtual address. > > On some platforms, e.g., X86, the page walker is invoked in an NMI > handler. So the page walker must be NMI-safe and low overhead. Besides, > the page walker should work for both user and kernel virtual address. > The existing generic page walker, e.g., walk_page_range_novma(), is a > little bit complex and doesn't guarantee the NMI-safe. The follow_page() > is only for user-virtual address. > > Add a new function perf_get_page_size() to walk the page tables and > calculate the MMU page size. In the function: > - Interrupts have to be disabled to prevent any teardown of the page > tables. > - The active_mm is used for the page walker. Compared with mm, the > active_mm is a better choice. It's always non-NULL. For the user > thread, it always points to the real address space. For the kernel > thread, it "take over" the mm of the threads that switched to it, > so it's not using all of the page tables from the init_mm all the > time. > - The MMU page size is calculated from the page table level. > > The method should work for all architectures, but it has only been > verified on X86. Should there be some architectures, which support perf, > where the method doesn't work, it can be fixed later separately. > Reporting the wrong page size would not be fatal for the architecture. > > Some under discussion features may impact the method in the future. > Quote from Dave Hansen, > "There are lots of weird things folks are trying to do with the page >tables, like Address Space Isolation. For instance, if you get a >perf NMI when running userspace, current->mm->pgd is *different* than >the PGD that was in use when userspace was running. It's close enough >today, but it might not stay that way." > If the case happens later, lots of consecutive page walk errors will > happen. The worst case is that lots of page-size '0' are returned, which > would not be fatal. > In the perf tool, a check is implemented to detect this case. Once it > happens, a kernel patch could be implemented accordingly then. > > Suggested-by: Peter Zijlstra > Signed-off-by: Kan Liang > --- > include/linux/perf_event.h | 1 + > include/uapi/linux/perf_event.h | 4 +- > kernel/events/core.c| 93 + > 3 files changed, 97 insertions(+), 1 deletion(-) > > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > index 0c19d279b97f..7e3785dd27d9 100644 > --- a/include/linux/perf_event.h > +++ b/include/linux/perf_event.h > @@ -1034,6 +1034,7 @@ struct perf_sample_data { > > u64 phys_addr; > u64 cgroup; > + u64 data_page_size; > } cacheline_aligned; > > /* default value for data source */ > diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h > index 077e7ee69e3d..cc6ea346e9f9 100644 > --- a/include/uapi/linux/perf_event.h > +++ b/include/uapi/linux/perf_event.h > @@ -143,8 +143,9 @@ enum perf_event_sample_format { > PERF_SAMPLE_PHYS_ADDR = 1U << 19, > PERF_SAMPLE_AUX = 1U << 20, > PERF_SAMPLE_CGROUP = 1U << 21, > + PERF_SAMPLE_DATA_PAGE_SIZE = 1U << 22, > > - PERF_SAMPLE_MAX = 1U << 22, /* non-ABI */ > + PERF_SAMPLE_MAX = 1U << 23, /* non-ABI */ > > __PERF_SAMPLE_CALLCHAIN_EARLY = 1ULL << 63, /* non-ABI; > internal use */ > }; > @@ -896,6 +897,7 @@ enum perf_event_type { > * { u64 phys_addr;} && PERF_SAMPLE_PHYS_ADDR > * { u64 size; > *char data[size]; } && PERF_SAMPLE_AUX > +* { u64 data_page_size;} && > PERF_SAMPLE_DATA_PAGE_SIZE > * }; > */ > PERF_RECORD_SAMPLE = 9, > diff --git a/kernel/events/core.c b/kernel/events/core.c > index 45edb85344a1..dd329a8f99f7 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -51,6 +51,7 @@ > #include > #include > #include > +#include > > #include "internal.h" > > @@ -1894,6 +1895,9 @@ static void __perf_event_header_size(struct perf_event > *event, u64 sample_type) > if (sample_type & PERF_SAMPLE_CGROUP) > size += sizeof(data->cgroup); > > + if
[PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
From: Kan Liang Current perf can report both virtual addresses and physical addresses, but not the MMU page size. Without the MMU page size information of the utilized page, users cannot decide whether to promote/demote large pages to optimize memory usage. Add a new sample type for the data MMU page size. Current perf already has a facility to collect data virtual addresses. A page walker is required to walk the pages tables and calculate the MMU page size from a given virtual address. On some platforms, e.g., X86, the page walker is invoked in an NMI handler. So the page walker must be NMI-safe and low overhead. Besides, the page walker should work for both user and kernel virtual address. The existing generic page walker, e.g., walk_page_range_novma(), is a little bit complex and doesn't guarantee the NMI-safe. The follow_page() is only for user-virtual address. Add a new function perf_get_page_size() to walk the page tables and calculate the MMU page size. In the function: - Interrupts have to be disabled to prevent any teardown of the page tables. - The active_mm is used for the page walker. Compared with mm, the active_mm is a better choice. It's always non-NULL. For the user thread, it always points to the real address space. For the kernel thread, it "take over" the mm of the threads that switched to it, so it's not using all of the page tables from the init_mm all the time. - The MMU page size is calculated from the page table level. The method should work for all architectures, but it has only been verified on X86. Should there be some architectures, which support perf, where the method doesn't work, it can be fixed later separately. Reporting the wrong page size would not be fatal for the architecture. Some under discussion features may impact the method in the future. Quote from Dave Hansen, "There are lots of weird things folks are trying to do with the page tables, like Address Space Isolation. For instance, if you get a perf NMI when running userspace, current->mm->pgd is *different* than the PGD that was in use when userspace was running. It's close enough today, but it might not stay that way." If the case happens later, lots of consecutive page walk errors will happen. The worst case is that lots of page-size '0' are returned, which would not be fatal. In the perf tool, a check is implemented to detect this case. Once it happens, a kernel patch could be implemented accordingly then. Suggested-by: Peter Zijlstra Signed-off-by: Kan Liang --- include/linux/perf_event.h | 1 + include/uapi/linux/perf_event.h | 4 +- kernel/events/core.c| 93 + 3 files changed, 97 insertions(+), 1 deletion(-) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 0c19d279b97f..7e3785dd27d9 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1034,6 +1034,7 @@ struct perf_sample_data { u64 phys_addr; u64 cgroup; + u64 data_page_size; } cacheline_aligned; /* default value for data source */ diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index 077e7ee69e3d..cc6ea346e9f9 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -143,8 +143,9 @@ enum perf_event_sample_format { PERF_SAMPLE_PHYS_ADDR = 1U << 19, PERF_SAMPLE_AUX = 1U << 20, PERF_SAMPLE_CGROUP = 1U << 21, + PERF_SAMPLE_DATA_PAGE_SIZE = 1U << 22, - PERF_SAMPLE_MAX = 1U << 22, /* non-ABI */ + PERF_SAMPLE_MAX = 1U << 23, /* non-ABI */ __PERF_SAMPLE_CALLCHAIN_EARLY = 1ULL << 63, /* non-ABI; internal use */ }; @@ -896,6 +897,7 @@ enum perf_event_type { * { u64 phys_addr;} && PERF_SAMPLE_PHYS_ADDR * { u64 size; *char data[size]; } && PERF_SAMPLE_AUX +* { u64 data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE * }; */ PERF_RECORD_SAMPLE = 9, diff --git a/kernel/events/core.c b/kernel/events/core.c index 45edb85344a1..dd329a8f99f7 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -51,6 +51,7 @@ #include #include #include +#include #include "internal.h" @@ -1894,6 +1895,9 @@ static void __perf_event_header_size(struct perf_event *event, u64 sample_type) if (sample_type & PERF_SAMPLE_CGROUP) size += sizeof(data->cgroup); + if (sample_type & PERF_SAMPLE_DATA_PAGE_SIZE) + size += sizeof(data->data_page_size); + event->header_size = size; } @@ -6937,6 +6941,9 @@ void perf_output_sample(struct perf_output_handle *handle, if (sample_type &