Re: [PATCH 4/4] x86: Use the page tables to look up kernel addresses in backtrace

2014-10-10 Thread Eric Dumazet
On Fri, 2014-10-10 at 16:25 -0700, Andi Kleen wrote:
> From: Andi Kleen 
> 
> On my workstation which has a lot of modules loaded:
> 
> $ lsmod | wc -l
> 80

> This patch uses the NX bits in the page tables to check for
> valid kernel addresses instead. This can be done in any context
> because kernel page tables are not removed (if they were it could
> be handled by RCU like the user page tables)
> 
> The lookup here is 2-4 memory accesses bounded.
> 
> Anything with no NX bit set and is in kernel space is a valid
> kernel executable. Unlike the previous scheme this will also
> handle cases like the profiler hitting BIOS code or similar
> (e.g. the PCI BIOS on 32bit)
> 
> On systems without NX we fall back to the previous scheme.

For such systems, we probably could reorder fields in struct module to
reduce at one cache line per module instead of three...

offsetof(struct module, module_core)=0x138
offsetof(struct module, module_init)=0x130
offsetof(struct module, core_size)=0x144
offsetof(struct module, init_size)=0x140
offsetof(struct module, list)=0x8



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] x86: Use the page tables to look up kernel addresses in backtrace

2014-10-10 Thread Chuck Ebbert
On Fri, 10 Oct 2014 16:25:17 -0700
Andi Kleen  wrote:

> From: Andi Kleen 
> 
> On my workstation which has a lot of modules loaded:
> 
> $ lsmod | wc -l
> 80
> 
> backtrace from the NMI for perf record -g can take a quite long time.
> 
> This leads to frequent messages like:
> perf interrupt took too long (7852 > 7812), lowering 
> kernel.perf_event_max_sample_rate to 16000
> 
> One larger part of the PMI cost is each text address check during
> the backtrace taking upto to 3us, like this:
> 
>   1)   |  print_context_stack_bp() {
>   1)   |__kernel_text_address() {
>   1)   |  is_module_text_address() {
>   1)   |__module_text_address() {
>   1)   1.611 us|  __module_address();
>   1)   1.939 us|}
>   1)   2.296 us|  }
>   1)   2.659 us|}
>   1)   |__kernel_text_address() {
>   1)   |  is_module_text_address() {
>   1)   |__module_text_address() {
>   1)   0.724 us|  __module_address();
>   1)   1.064 us|}
>   1)   1.430 us|  }
>   1)   1.798 us|}
>   1)   |__kernel_text_address() {
>   1)   |  is_module_text_address() {
>   1)   |__module_text_address() {
>   1)   0.656 us|  __module_address();
>   1)   1.012 us|}
>   1)   1.356 us|  }
>   1)   1.761 us|}
> 
> So just with a reasonably sized backtrace easily 10-20us can be spent
> on just checking the frame pointer IPs.
> 
> So essentially currently the module lookup is N-MODULES*M-length of backtrace
> 
> This patch uses the NX bits in the page tables to check for
> valid kernel addresses instead. This can be done in any context
> because kernel page tables are not removed (if they were it could
> be handled by RCU like the user page tables)
> 
> The lookup here is 2-4 memory accesses bounded.
> 
> Anything with no NX bit set and is in kernel space is a valid
> kernel executable. Unlike the previous scheme this will also
> handle cases like the profiler hitting BIOS code or similar
> (e.g. the PCI BIOS on 32bit)
> 
> On systems without NX we fall back to the previous scheme.
> 
> Signed-off-by: Andi Kleen 
> ---
>  arch/x86/kernel/dumpstack.c | 38 +-
>  1 file changed, 37 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
> index b74ebc7..9279549 100644
> --- a/arch/x86/kernel/dumpstack.c
> +++ b/arch/x86/kernel/dumpstack.c
> @@ -90,6 +90,42 @@ static inline int valid_stack_ptr(struct thread_info 
> *tinfo,
>   return p > t && p < t + THREAD_SIZE - size;
>  }
>  
> +/*
> + * Check if the address is in a executable page.
> + * This can be much faster than looking it up in the module
> + * table list when many modules are loaded.
> + *
> + * This is safe in any context because kernel page tables
> + * are never removed.
> + */
> +static bool addr_is_executable(unsigned long addr)

That name is confusing. Maybe call it x86_kernel_text_address()?

> +{
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t *pte;
> +
> + if (!(__supported_pte_mask & _PAGE_NX))
> + return __kernel_text_address(addr);
> + if (addr < __PAGE_OFFSET)
> + return false;

Can't you check __PAGE_OFFSET first? That would speed up the non-NX
case...

> + pgd = pgd_offset_k(addr);
> + if (!pgd_present(*pgd))
> + return false;
> + pud = pud_offset(pgd, addr);
> + if (!pud_present(*pud))
> + return false;
> + if (pud_large(*pud))
> + return pte_exec(*(pte_t *)pud);
> + pmd = pmd_offset(pud, addr);
> + if (!pmd_present(*pmd))
> + return false;
> + if (pmd_large(*pmd))
> + return pte_exec(*(pte_t *)pmd);
> + pte = pte_offset_kernel(pmd, addr);
> + return pte_present(*pte) && pte_exec(*pte);
> +}
> +
>  unsigned long
>  print_context_stack(struct thread_info *tinfo,
>   unsigned long *stack, unsigned long bp,
> @@ -102,7 +138,7 @@ print_context_stack(struct thread_info *tinfo,
>   unsigned long addr;
>  
>   addr = *stack;
> - if (__kernel_text_address(addr)) {
> + if (addr_is_executable(addr)) {
>   if ((unsigned long) stack == bp + sizeof(long)) {
>   ops->address(data, addr, 1);
>   frame = frame->next_frame;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/4] x86: Use the page tables to look up kernel addresses in backtrace

2014-10-10 Thread Andi Kleen
From: Andi Kleen 

On my workstation which has a lot of modules loaded:

$ lsmod | wc -l
80

backtrace from the NMI for perf record -g can take a quite long time.

This leads to frequent messages like:
perf interrupt took too long (7852 > 7812), lowering 
kernel.perf_event_max_sample_rate to 16000

One larger part of the PMI cost is each text address check during
the backtrace taking upto to 3us, like this:

  1)   |  print_context_stack_bp() {
  1)   |__kernel_text_address() {
  1)   |  is_module_text_address() {
  1)   |__module_text_address() {
  1)   1.611 us|  __module_address();
  1)   1.939 us|}
  1)   2.296 us|  }
  1)   2.659 us|}
  1)   |__kernel_text_address() {
  1)   |  is_module_text_address() {
  1)   |__module_text_address() {
  1)   0.724 us|  __module_address();
  1)   1.064 us|}
  1)   1.430 us|  }
  1)   1.798 us|}
  1)   |__kernel_text_address() {
  1)   |  is_module_text_address() {
  1)   |__module_text_address() {
  1)   0.656 us|  __module_address();
  1)   1.012 us|}
  1)   1.356 us|  }
  1)   1.761 us|}

So just with a reasonably sized backtrace easily 10-20us can be spent
on just checking the frame pointer IPs.

So essentially currently the module lookup is N-MODULES*M-length of backtrace

This patch uses the NX bits in the page tables to check for
valid kernel addresses instead. This can be done in any context
because kernel page tables are not removed (if they were it could
be handled by RCU like the user page tables)

The lookup here is 2-4 memory accesses bounded.

Anything with no NX bit set and is in kernel space is a valid
kernel executable. Unlike the previous scheme this will also
handle cases like the profiler hitting BIOS code or similar
(e.g. the PCI BIOS on 32bit)

On systems without NX we fall back to the previous scheme.

Signed-off-by: Andi Kleen 
---
 arch/x86/kernel/dumpstack.c | 38 +-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index b74ebc7..9279549 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -90,6 +90,42 @@ static inline int valid_stack_ptr(struct thread_info *tinfo,
return p > t && p < t + THREAD_SIZE - size;
 }
 
+/*
+ * Check if the address is in a executable page.
+ * This can be much faster than looking it up in the module
+ * table list when many modules are loaded.
+ *
+ * This is safe in any context because kernel page tables
+ * are never removed.
+ */
+static bool addr_is_executable(unsigned long addr)
+{
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *pte;
+
+   if (!(__supported_pte_mask & _PAGE_NX))
+   return __kernel_text_address(addr);
+   if (addr < __PAGE_OFFSET)
+   return false;
+   pgd = pgd_offset_k(addr);
+   if (!pgd_present(*pgd))
+   return false;
+   pud = pud_offset(pgd, addr);
+   if (!pud_present(*pud))
+   return false;
+   if (pud_large(*pud))
+   return pte_exec(*(pte_t *)pud);
+   pmd = pmd_offset(pud, addr);
+   if (!pmd_present(*pmd))
+   return false;
+   if (pmd_large(*pmd))
+   return pte_exec(*(pte_t *)pmd);
+   pte = pte_offset_kernel(pmd, addr);
+   return pte_present(*pte) && pte_exec(*pte);
+}
+
 unsigned long
 print_context_stack(struct thread_info *tinfo,
unsigned long *stack, unsigned long bp,
@@ -102,7 +138,7 @@ print_context_stack(struct thread_info *tinfo,
unsigned long addr;
 
addr = *stack;
-   if (__kernel_text_address(addr)) {
+   if (addr_is_executable(addr)) {
if ((unsigned long) stack == bp + sizeof(long)) {
ops->address(data, addr, 1);
frame = frame->next_frame;
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/4] x86: Use the page tables to look up kernel addresses in backtrace

2014-10-10 Thread Andi Kleen
From: Andi Kleen a...@linux.intel.com

On my workstation which has a lot of modules loaded:

$ lsmod | wc -l
80

backtrace from the NMI for perf record -g can take a quite long time.

This leads to frequent messages like:
perf interrupt took too long (7852  7812), lowering 
kernel.perf_event_max_sample_rate to 16000

One larger part of the PMI cost is each text address check during
the backtrace taking upto to 3us, like this:

  1)   |  print_context_stack_bp() {
  1)   |__kernel_text_address() {
  1)   |  is_module_text_address() {
  1)   |__module_text_address() {
  1)   1.611 us|  __module_address();
  1)   1.939 us|}
  1)   2.296 us|  }
  1)   2.659 us|}
  1)   |__kernel_text_address() {
  1)   |  is_module_text_address() {
  1)   |__module_text_address() {
  1)   0.724 us|  __module_address();
  1)   1.064 us|}
  1)   1.430 us|  }
  1)   1.798 us|}
  1)   |__kernel_text_address() {
  1)   |  is_module_text_address() {
  1)   |__module_text_address() {
  1)   0.656 us|  __module_address();
  1)   1.012 us|}
  1)   1.356 us|  }
  1)   1.761 us|}

So just with a reasonably sized backtrace easily 10-20us can be spent
on just checking the frame pointer IPs.

So essentially currently the module lookup is N-MODULES*M-length of backtrace

This patch uses the NX bits in the page tables to check for
valid kernel addresses instead. This can be done in any context
because kernel page tables are not removed (if they were it could
be handled by RCU like the user page tables)

The lookup here is 2-4 memory accesses bounded.

Anything with no NX bit set and is in kernel space is a valid
kernel executable. Unlike the previous scheme this will also
handle cases like the profiler hitting BIOS code or similar
(e.g. the PCI BIOS on 32bit)

On systems without NX we fall back to the previous scheme.

Signed-off-by: Andi Kleen a...@linux.intel.com
---
 arch/x86/kernel/dumpstack.c | 38 +-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index b74ebc7..9279549 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -90,6 +90,42 @@ static inline int valid_stack_ptr(struct thread_info *tinfo,
return p  t  p  t + THREAD_SIZE - size;
 }
 
+/*
+ * Check if the address is in a executable page.
+ * This can be much faster than looking it up in the module
+ * table list when many modules are loaded.
+ *
+ * This is safe in any context because kernel page tables
+ * are never removed.
+ */
+static bool addr_is_executable(unsigned long addr)
+{
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *pte;
+
+   if (!(__supported_pte_mask  _PAGE_NX))
+   return __kernel_text_address(addr);
+   if (addr  __PAGE_OFFSET)
+   return false;
+   pgd = pgd_offset_k(addr);
+   if (!pgd_present(*pgd))
+   return false;
+   pud = pud_offset(pgd, addr);
+   if (!pud_present(*pud))
+   return false;
+   if (pud_large(*pud))
+   return pte_exec(*(pte_t *)pud);
+   pmd = pmd_offset(pud, addr);
+   if (!pmd_present(*pmd))
+   return false;
+   if (pmd_large(*pmd))
+   return pte_exec(*(pte_t *)pmd);
+   pte = pte_offset_kernel(pmd, addr);
+   return pte_present(*pte)  pte_exec(*pte);
+}
+
 unsigned long
 print_context_stack(struct thread_info *tinfo,
unsigned long *stack, unsigned long bp,
@@ -102,7 +138,7 @@ print_context_stack(struct thread_info *tinfo,
unsigned long addr;
 
addr = *stack;
-   if (__kernel_text_address(addr)) {
+   if (addr_is_executable(addr)) {
if ((unsigned long) stack == bp + sizeof(long)) {
ops-address(data, addr, 1);
frame = frame-next_frame;
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] x86: Use the page tables to look up kernel addresses in backtrace

2014-10-10 Thread Chuck Ebbert
On Fri, 10 Oct 2014 16:25:17 -0700
Andi Kleen a...@firstfloor.org wrote:

 From: Andi Kleen a...@linux.intel.com
 
 On my workstation which has a lot of modules loaded:
 
 $ lsmod | wc -l
 80
 
 backtrace from the NMI for perf record -g can take a quite long time.
 
 This leads to frequent messages like:
 perf interrupt took too long (7852  7812), lowering 
 kernel.perf_event_max_sample_rate to 16000
 
 One larger part of the PMI cost is each text address check during
 the backtrace taking upto to 3us, like this:
 
   1)   |  print_context_stack_bp() {
   1)   |__kernel_text_address() {
   1)   |  is_module_text_address() {
   1)   |__module_text_address() {
   1)   1.611 us|  __module_address();
   1)   1.939 us|}
   1)   2.296 us|  }
   1)   2.659 us|}
   1)   |__kernel_text_address() {
   1)   |  is_module_text_address() {
   1)   |__module_text_address() {
   1)   0.724 us|  __module_address();
   1)   1.064 us|}
   1)   1.430 us|  }
   1)   1.798 us|}
   1)   |__kernel_text_address() {
   1)   |  is_module_text_address() {
   1)   |__module_text_address() {
   1)   0.656 us|  __module_address();
   1)   1.012 us|}
   1)   1.356 us|  }
   1)   1.761 us|}
 
 So just with a reasonably sized backtrace easily 10-20us can be spent
 on just checking the frame pointer IPs.
 
 So essentially currently the module lookup is N-MODULES*M-length of backtrace
 
 This patch uses the NX bits in the page tables to check for
 valid kernel addresses instead. This can be done in any context
 because kernel page tables are not removed (if they were it could
 be handled by RCU like the user page tables)
 
 The lookup here is 2-4 memory accesses bounded.
 
 Anything with no NX bit set and is in kernel space is a valid
 kernel executable. Unlike the previous scheme this will also
 handle cases like the profiler hitting BIOS code or similar
 (e.g. the PCI BIOS on 32bit)
 
 On systems without NX we fall back to the previous scheme.
 
 Signed-off-by: Andi Kleen a...@linux.intel.com
 ---
  arch/x86/kernel/dumpstack.c | 38 +-
  1 file changed, 37 insertions(+), 1 deletion(-)
 
 diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
 index b74ebc7..9279549 100644
 --- a/arch/x86/kernel/dumpstack.c
 +++ b/arch/x86/kernel/dumpstack.c
 @@ -90,6 +90,42 @@ static inline int valid_stack_ptr(struct thread_info 
 *tinfo,
   return p  t  p  t + THREAD_SIZE - size;
  }
  
 +/*
 + * Check if the address is in a executable page.
 + * This can be much faster than looking it up in the module
 + * table list when many modules are loaded.
 + *
 + * This is safe in any context because kernel page tables
 + * are never removed.
 + */
 +static bool addr_is_executable(unsigned long addr)

That name is confusing. Maybe call it x86_kernel_text_address()?

 +{
 + pgd_t *pgd;
 + pud_t *pud;
 + pmd_t *pmd;
 + pte_t *pte;
 +
 + if (!(__supported_pte_mask  _PAGE_NX))
 + return __kernel_text_address(addr);
 + if (addr  __PAGE_OFFSET)
 + return false;

Can't you check __PAGE_OFFSET first? That would speed up the non-NX
case...

 + pgd = pgd_offset_k(addr);
 + if (!pgd_present(*pgd))
 + return false;
 + pud = pud_offset(pgd, addr);
 + if (!pud_present(*pud))
 + return false;
 + if (pud_large(*pud))
 + return pte_exec(*(pte_t *)pud);
 + pmd = pmd_offset(pud, addr);
 + if (!pmd_present(*pmd))
 + return false;
 + if (pmd_large(*pmd))
 + return pte_exec(*(pte_t *)pmd);
 + pte = pte_offset_kernel(pmd, addr);
 + return pte_present(*pte)  pte_exec(*pte);
 +}
 +
  unsigned long
  print_context_stack(struct thread_info *tinfo,
   unsigned long *stack, unsigned long bp,
 @@ -102,7 +138,7 @@ print_context_stack(struct thread_info *tinfo,
   unsigned long addr;
  
   addr = *stack;
 - if (__kernel_text_address(addr)) {
 + if (addr_is_executable(addr)) {
   if ((unsigned long) stack == bp + sizeof(long)) {
   ops-address(data, addr, 1);
   frame = frame-next_frame;

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] x86: Use the page tables to look up kernel addresses in backtrace

2014-10-10 Thread Eric Dumazet
On Fri, 2014-10-10 at 16:25 -0700, Andi Kleen wrote:
 From: Andi Kleen a...@linux.intel.com
 
 On my workstation which has a lot of modules loaded:
 
 $ lsmod | wc -l
 80

 This patch uses the NX bits in the page tables to check for
 valid kernel addresses instead. This can be done in any context
 because kernel page tables are not removed (if they were it could
 be handled by RCU like the user page tables)
 
 The lookup here is 2-4 memory accesses bounded.
 
 Anything with no NX bit set and is in kernel space is a valid
 kernel executable. Unlike the previous scheme this will also
 handle cases like the profiler hitting BIOS code or similar
 (e.g. the PCI BIOS on 32bit)
 
 On systems without NX we fall back to the previous scheme.

For such systems, we probably could reorder fields in struct module to
reduce at one cache line per module instead of three...

offsetof(struct module, module_core)=0x138
offsetof(struct module, module_init)=0x130
offsetof(struct module, core_size)=0x144
offsetof(struct module, init_size)=0x140
offsetof(struct module, list)=0x8



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/