Re: [PATCH] x86/speculation/l1tf: Exempt zeroed PTEs from XOR conversion

2018-08-17 Thread Andi Kleen
On Thu, Aug 16, 2018 at 01:46:38PM -0700, Sean Christopherson wrote:
> clear_page() does not undergo the XOR logic to invert the address
> bits, i.e. PTE, PMD and PUD entries that have not been individually
> written will have val=0 and so will trigger __pte_needs_invert().
> As a result, {pte,pmd,pud}_pfn() will return the wrong PFN value,
> i.e. all ones (adjusted by the max PFN mask) instead of zero.
> A zeroed entry is ok because the page at physical address 0 is
> reserved early in boot specifically to mitigate L1TF, so explicitly
> exempt them from the inversion when reading the PFN.
> 
> Manifested as an unexpected mprotect(..., PROT_NONE) failure when
> called on a VMA that has VM_PFNMAP and was mmap'd to as something
> other than PROT_NONE but never used.  mprotect() sends the PROT_NONE
> request down prot_none_walk(), which walks the PTEs to check the PFNs.
> prot_none_pte_entry() gets the bogus PFN from pte_pfn() and returns
> -EACCES because it thinks mprotect() is trying to adjust a high MMIO
> address.

Looks good to me. You're right that case was missed.

Reviewed-by: Andi Kleen 

I think Thomas is still on vacation, copying Linus.

-Andi


[PATCH] x86/speculation/l1tf: Add assembly guard for xtensa/ia64

2018-08-15 Thread Andi Kleen
From: Andi Kleen 

The stable backport of the

x86/speculation/l1tf: Disallow non privileged high MMIO PROT_NONE mappings

patch for 4.4 and 4.9 put new C code for !__HAVE_ARCH_PFN_MODIFY_ALLOWED
code outside the assembler ifdef. This breaks the xtensa and ia64
build as reported by 0day which somehow include this file
into assembler.

Just add an #ifdef __ASSEMBLY__ around the new code to fix this.

This patch is only needed for 4.9 and 4.4 stable, the newer stables
don't have this problem.

Fixes: 7c5b42f82c13 ("x86/speculation/l1tf: Disallow non privileged high MMIO 
PROT_NONE mappings")
Signed-off-by: Andi Kleen 
---
 include/asm-generic/pgtable.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index a88ea9e37a25..abc2a1b15dd8 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -825,6 +825,7 @@ static inline int pmd_free_pte_page(pmd_t *pmd)
 #endif
 #endif
 
+#ifndef __ASSEMBLY__
 struct file;
 int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn,
unsigned long size, pgprot_t *vma_prot);
@@ -839,6 +840,9 @@ static inline bool arch_has_pfn_modify_check(void)
 {
return false;
 }
+
+#endif
+
 #endif /* !_HAVE_ARCH_PFN_MODIFY_ALLOWED */
 
 #endif /* !__ASSEMBLY__ */
-- 
2.17.1



[PATCH] x86/speculation/l1tf: Add assembly guard for xtensa/ia64

2018-08-15 Thread Andi Kleen
From: Andi Kleen 

The stable backport of the

x86/speculation/l1tf: Disallow non privileged high MMIO PROT_NONE mappings

patch for 4.4 and 4.9 put new C code for !__HAVE_ARCH_PFN_MODIFY_ALLOWED
code outside the assembler ifdef. This breaks the xtensa and ia64
build as reported by 0day which somehow include this file
into assembler.

Just add an #ifdef __ASSEMBLY__ around the new code to fix this.

This patch is only needed for 4.9 and 4.4 stable, the newer stables
don't have this problem.

Fixes: 7c5b42f82c13 ("x86/speculation/l1tf: Disallow non privileged high MMIO 
PROT_NONE mappings")
Signed-off-by: Andi Kleen 
---
 include/asm-generic/pgtable.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index a88ea9e37a25..abc2a1b15dd8 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -825,6 +825,7 @@ static inline int pmd_free_pte_page(pmd_t *pmd)
 #endif
 #endif
 
+#ifndef __ASSEMBLY__
 struct file;
 int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn,
unsigned long size, pgprot_t *vma_prot);
@@ -839,6 +840,9 @@ static inline bool arch_has_pfn_modify_check(void)
 {
return false;
 }
+
+#endif
+
 #endif /* !_HAVE_ARCH_PFN_MODIFY_ALLOWED */
 
 #endif /* !__ASSEMBLY__ */
-- 
2.17.1



Re: [RFC] perf/x86/intel: Export mem events only if there's PEBs support

2018-08-13 Thread Andi Kleen
On Mon, Aug 13, 2018 at 05:48:20PM +0200, Jiri Olsa wrote:
> hi,
> we had some reports that we show mem* events on KVM servers
> where they are not available, this patch hides them if there's
> no PEBs available
> 
> currently on those servers we fail to open mem* events,
> because we can't access LDLAT msr
> 
> however that's not so easy/straight forward to check,
> so I thought we could use check on PEBs instead, which
> needs to be there for mem events as well and seems to
> be enough to check
> 
> thoughts? thanks,
> jirka

Acked-by: Andi Kleen 

-Andi


Re: [RFC] perf/x86/intel: Export mem events only if there's PEBs support

2018-08-13 Thread Andi Kleen
On Mon, Aug 13, 2018 at 05:48:20PM +0200, Jiri Olsa wrote:
> hi,
> we had some reports that we show mem* events on KVM servers
> where they are not available, this patch hides them if there's
> no PEBs available
> 
> currently on those servers we fail to open mem* events,
> because we can't access LDLAT msr
> 
> however that's not so easy/straight forward to check,
> so I thought we could use check on PEBs instead, which
> needs to be there for mem events as well and seems to
> be enough to check
> 
> thoughts? thanks,
> jirka

Acked-by: Andi Kleen 

-Andi


Re: [PATCH] x86/cpu: Rename Denverton and Gemini Lake

2018-08-07 Thread Andi Kleen
> Which simply does not work. Look at Goldmont Fam 6 Model 5C. The SoCs
> with that Fam/Model combination are:
> 
>  - Apollo Lake
>  - Broxton (has two platforms: Morganfield and Willowtrail)

Right pick one. The others are the same for software purposes
and can be handled in the same way.

> 
> It's even worse with Silvermont.
> 
> So no, the interesting information is the UARCH and the variant of that,

With Uarch you mean the core uarch?  That doesn't really work for
something like Silvermont or Goldmont.

> e.g. UARCH_CLIENT, UARCH_SERVER, UARCH_WHATEVER. All the magic Code Names

Right your scheme totally doesn't work on Silvermont because there
are multiple client variants.

> and their platform variants are not interesting at all for the Fam/Model
> information.

You're right platform is misleading. I think the right level 
are SOCs, because that matches the how the model numbers are allocated.

On Big Core *Lakes are all unique SOCs.

-Andi


Re: [PATCH] x86/cpu: Rename Denverton and Gemini Lake

2018-08-07 Thread Andi Kleen
> Which simply does not work. Look at Goldmont Fam 6 Model 5C. The SoCs
> with that Fam/Model combination are:
> 
>  - Apollo Lake
>  - Broxton (has two platforms: Morganfield and Willowtrail)

Right pick one. The others are the same for software purposes
and can be handled in the same way.

> 
> It's even worse with Silvermont.
> 
> So no, the interesting information is the UARCH and the variant of that,

With Uarch you mean the core uarch?  That doesn't really work for
something like Silvermont or Goldmont.

> e.g. UARCH_CLIENT, UARCH_SERVER, UARCH_WHATEVER. All the magic Code Names

Right your scheme totally doesn't work on Silvermont because there
are multiple client variants.

> and their platform variants are not interesting at all for the Fam/Model
> information.

You're right platform is misleading. I think the right level 
are SOCs, because that matches the how the model numbers are allocated.

On Big Core *Lakes are all unique SOCs.

-Andi


Re: [PATCH] x86/cpu: Rename Denverton and Gemini Lake

2018-08-07 Thread Andi Kleen
On Tue, Aug 07, 2018 at 07:48:51PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 07, 2018 at 10:35:42AM -0700, Dave Hansen wrote:
> > On 08/07/2018 10:17 AM, kan.li...@linux.intel.com wrote:
> > > Denverton and Gemini Lake are platform names and should not be used for
> > > Processor Family stuff. The microarchitecture codename should be used.
> > 
> > Why?
> > 
> > Denverton is the platform.  "Goldmont" is literally the
> > microarchitecture, and you are suggesting moving *to* the
> > microarchitecture name, which contradicts the description.
> 
> All the other (big core) are uarch names. Atom is weird in that it mixes
> uarch with platform names.

On most big core the platform/SOC just happens to have the same name as the 
uarch. But the identifiers really have to be per SOC because that 
is how Intel model numbers work.

It should be always the SOC.

-Andi


Re: [PATCH] x86/cpu: Rename Denverton and Gemini Lake

2018-08-07 Thread Andi Kleen
On Tue, Aug 07, 2018 at 07:48:51PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 07, 2018 at 10:35:42AM -0700, Dave Hansen wrote:
> > On 08/07/2018 10:17 AM, kan.li...@linux.intel.com wrote:
> > > Denverton and Gemini Lake are platform names and should not be used for
> > > Processor Family stuff. The microarchitecture codename should be used.
> > 
> > Why?
> > 
> > Denverton is the platform.  "Goldmont" is literally the
> > microarchitecture, and you are suggesting moving *to* the
> > microarchitecture name, which contradicts the description.
> 
> All the other (big core) are uarch names. Atom is weird in that it mixes
> uarch with platform names.

On most big core the platform/SOC just happens to have the same name as the 
uarch. But the identifiers really have to be per SOC because that 
is how Intel model numbers work.

It should be always the SOC.

-Andi


Re: [PATCH] x86/cpu: Rename Denverton and Gemini Lake

2018-08-07 Thread Andi Kleen
On Tue, Aug 07, 2018 at 10:17:27AM -0700, kan.li...@linux.intel.com wrote:
> From: Kan Liang 
> 
> Denverton and Gemini Lake are platform names and should not be used for
> Processor Family stuff. The microarchitecture codename should be used.
> 
> DENVERTON is Goldmont server SoC. Rename DENVERTON to GOLDMONT2, similar
> to SILVERMONT2 being the silvermont server SoCs.
> 
> Rename GEMINI_LAKE to GOLDMONT_PLUS.

That doesn't make sense. Goldmont Plus can be in other SOCs
with different Model numbers too. 0x7a really means Gemini Lake SOC,
but not all systems containing a Goldmont Plus.

Also all the other identifiers are for SOCs too.

-Andi


Re: [PATCH] x86/cpu: Rename Denverton and Gemini Lake

2018-08-07 Thread Andi Kleen
On Tue, Aug 07, 2018 at 10:17:27AM -0700, kan.li...@linux.intel.com wrote:
> From: Kan Liang 
> 
> Denverton and Gemini Lake are platform names and should not be used for
> Processor Family stuff. The microarchitecture codename should be used.
> 
> DENVERTON is Goldmont server SoC. Rename DENVERTON to GOLDMONT2, similar
> to SILVERMONT2 being the silvermont server SoCs.
> 
> Rename GEMINI_LAKE to GOLDMONT_PLUS.

That doesn't make sense. Goldmont Plus can be in other SOCs
with different Model numbers too. 0x7a really means Gemini Lake SOC,
but not all systems containing a Goldmont Plus.

Also all the other identifiers are for SOCs too.

-Andi


Re: [PATCH 2/3] x86, perf: Add a separate Arch Perfmon v4 PMI handler

2018-08-06 Thread Andi Kleen
On Mon, Aug 06, 2018 at 08:35:15PM +0200, Peter Zijlstra wrote:
> > +static bool disable_counter_freezing;
> > +module_param(disable_counter_freezing, bool, 0444);
> > +MODULE_PARM_DESC(disable_counter_freezing, "Disable counter freezing 
> > feature."
> > +   "The PMI handler will fall back to generic handler."
> > +   "Default is false (enable counter freezing feature).");
> 
> Why?

See the description. Counter freezing took some time to stabilize,
so it seemed better to have a knob to ask users to try in case
there are more problems.
> 
> > +   /*
> > +* Ack the PMU late after the APIC.  This avoids bogus
> 
> > +* freezing on Skylake CPUs.  The acking unfreezes the PMU
> > +*/
> That doesn't make sense. PMU and APIC do not have order.> 

It makes a difference for the hardware.

> > +   /*
> > +* For arch perfmon 4 use counter freezing to avoid
> > +* several MSR accesses in the PMI.
> > +*/
> > +   if (x86_pmu.counter_freezing) {
> > +   x86_pmu.handle_irq = intel_pmu_handle_irq_v4;
> > +   pr_cont("counter freezing, ");
> > +   }
> 
> Lets not print the counter freezing, we already print v4, right?

I find it useful to see that the kernel has the support, otherwise
you would need to look at the version number, but it gets difficult
with backports. This is another paranoia bit, in case there
are problems.

> > @@ -561,6 +566,7 @@ struct x86_pmu {
> > struct x86_pmu_quirk *quirks;
> > int perfctr_second_write;
> > boollate_ack;
> > +   boolcounter_freezing;
> 
> Please make the both of them int or something.

That would make them bigger for no reason?

-Andi



Re: [PATCH 2/3] x86, perf: Add a separate Arch Perfmon v4 PMI handler

2018-08-06 Thread Andi Kleen
On Mon, Aug 06, 2018 at 08:35:15PM +0200, Peter Zijlstra wrote:
> > +static bool disable_counter_freezing;
> > +module_param(disable_counter_freezing, bool, 0444);
> > +MODULE_PARM_DESC(disable_counter_freezing, "Disable counter freezing 
> > feature."
> > +   "The PMI handler will fall back to generic handler."
> > +   "Default is false (enable counter freezing feature).");
> 
> Why?

See the description. Counter freezing took some time to stabilize,
so it seemed better to have a knob to ask users to try in case
there are more problems.
> 
> > +   /*
> > +* Ack the PMU late after the APIC.  This avoids bogus
> 
> > +* freezing on Skylake CPUs.  The acking unfreezes the PMU
> > +*/
> That doesn't make sense. PMU and APIC do not have order.> 

It makes a difference for the hardware.

> > +   /*
> > +* For arch perfmon 4 use counter freezing to avoid
> > +* several MSR accesses in the PMI.
> > +*/
> > +   if (x86_pmu.counter_freezing) {
> > +   x86_pmu.handle_irq = intel_pmu_handle_irq_v4;
> > +   pr_cont("counter freezing, ");
> > +   }
> 
> Lets not print the counter freezing, we already print v4, right?

I find it useful to see that the kernel has the support, otherwise
you would need to look at the version number, but it gets difficult
with backports. This is another paranoia bit, in case there
are problems.

> > @@ -561,6 +566,7 @@ struct x86_pmu {
> > struct x86_pmu_quirk *quirks;
> > int perfctr_second_write;
> > boollate_ack;
> > +   boolcounter_freezing;
> 
> Please make the both of them int or something.

That would make them bigger for no reason?

-Andi



Re: [PATCH 2/4] trace: avoid calling cc-option -mrecord-mcount for every Makefile

2018-08-06 Thread Andi Kleen
On Mon, Aug 06, 2018 at 03:17:44PM +0200, Vasily Gorbik wrote:
> Currently if CONFIG_FTRACE_MCOUNT_RECORD is enabled -mrecord-mcount
> compiler flag support is tested for every Makefile.

Good catch.  Does it make a measurable compile time difference?

> 
> Top 4 cc-option usages:
> 511 -mrecord-mcount
>  11  -fno-stack-protector
>   9 -Wno-override-init
>   2 -fsched-pressure
> 
> To address that move cc-option from scripts/Makefile.build to top Makefile
> and export CC_USING_RECORD_MCOUNT to be used in original place.
> 
> While doing that also add -mrecord-mcount to CC_FLAGS_FTRACE (if gcc
> actually supports it).
> 
> Signed-off-by: Vasily Gorbik 

Acked-by: Andi Kleen 

-Andi


Re: [PATCH 2/4] trace: avoid calling cc-option -mrecord-mcount for every Makefile

2018-08-06 Thread Andi Kleen
On Mon, Aug 06, 2018 at 03:17:44PM +0200, Vasily Gorbik wrote:
> Currently if CONFIG_FTRACE_MCOUNT_RECORD is enabled -mrecord-mcount
> compiler flag support is tested for every Makefile.

Good catch.  Does it make a measurable compile time difference?

> 
> Top 4 cc-option usages:
> 511 -mrecord-mcount
>  11  -fno-stack-protector
>   9 -Wno-override-init
>   2 -fsched-pressure
> 
> To address that move cc-option from scripts/Makefile.build to top Makefile
> and export CC_USING_RECORD_MCOUNT to be used in original place.
> 
> While doing that also add -mrecord-mcount to CC_FLAGS_FTRACE (if gcc
> actually supports it).
> 
> Signed-off-by: Vasily Gorbik 

Acked-by: Andi Kleen 

-Andi


Re: [PATCH 3/7] x86/mm/init: pass unconverted symbol addresses to free_init_pages()

2018-08-05 Thread Andi Kleen
> [ Goes around and rummages ]
> 
> Oh, never mind, looking around reminded me why: we want to map the
> kernel text in the top 31 bits, so that we can use the faster
> -mcmodel=kernel because all symbols fit in sign-extended 32 bits.
> 
> Maybe there was some other reason too, but I think that's it.

No that was the only reason.

Large code model would be extremely expensive, and PIC linked
kernel also had some issues. So we ended up with this set up.

-Andi


Re: [PATCH 3/7] x86/mm/init: pass unconverted symbol addresses to free_init_pages()

2018-08-05 Thread Andi Kleen
> [ Goes around and rummages ]
> 
> Oh, never mind, looking around reminded me why: we want to map the
> kernel text in the top 31 bits, so that we can use the faster
> -mcmodel=kernel because all symbols fit in sign-extended 32 bits.
> 
> Maybe there was some other reason too, but I think that's it.

No that was the only reason.

Large code model would be extremely expensive, and PIC linked
kernel also had some issues. So we ended up with this set up.

-Andi


Re: Perf JSON events mismatch with Intel SDM

2018-07-16 Thread Andi Kleen
>The semantic difference being, as far as I can tell, that the "GT"
>counters do not sample randomly, but will rather offer a precise count of
>loads meeting the stated condition. 

It's all the same underlying hardware event. The only difference is what 
threshold
is used for filtering the loads.

-Andi


Re: Perf JSON events mismatch with Intel SDM

2018-07-16 Thread Andi Kleen
>The semantic difference being, as far as I can tell, that the "GT"
>counters do not sample randomly, but will rather offer a precise count of
>loads meeting the stated condition. 

It's all the same underlying hardware event. The only difference is what 
threshold
is used for filtering the loads.

-Andi


Re: Perf JSON events mismatch with Intel SDM

2018-07-15 Thread Andi Kleen
>  now need to dive in deep into the actual implementation. Simple question: is 
> this a bug?

These are derived events from MEM_TRANS_RETIRED.LOAD_LATENCY, which is in
the SDM.  It's not a bug. In general the event lists used by perf are 
newer versions than what is in the SDM.

-Andi


Re: Perf JSON events mismatch with Intel SDM

2018-07-15 Thread Andi Kleen
>  now need to dive in deep into the actual implementation. Simple question: is 
> this a bug?

These are derived events from MEM_TRANS_RETIRED.LOAD_LATENCY, which is in
the SDM.  It's not a bug. In general the event lists used by perf are 
newer versions than what is in the SDM.

-Andi


Re: [bug] kpti, perf_event, bts: sporadic truncated trace

2018-07-12 Thread Andi Kleen
"Metzger, Markus T"  writes:

Adding Dave Hansen

> Hello,
>
> Starting with 4.15 I noticed that BTS is sporadically missing the tail
> of the trace in the perf_event data buffer.  It shows as
>
> [decode error (1): instruction overflow]
>
> in GDB.  Chances to see this are higher the longer the debuggee is
> running.  With this [1] tiny patch to one of GDB's tests, I am able to
> reproduce it reliably on my box.  To run the test, use:
>
> $ make -s check RUNTESTFLAGS="gdb.btrace/exception.exp"
>
> from the gdb/ sub-directory in the GDB build directory.
>
> The issue remains when I use 'nopti' on the kernel command-line.
>
>
> Bisecting yielded commit
>
> c1961a4 x86/events/intel/ds: Map debug buffers in cpu_entry_area
>
> I reverted the commit on top of v4.17 [2] and the issue disappears
> when I use 'nopti' on the kernel command-line.
>
> regards,
> markus.
>
>
> [1]
> diff --git a/gdb/testsuite/gdb.btrace/exception.exp 
> b/gdb/testsuite/gdb.btrace/exception.exp
> index 9408d61..a24ddd3 100755
> --- a/gdb/testsuite/gdb.btrace/exception.exp
> +++ b/gdb/testsuite/gdb.btrace/exception.exp
> @@ -36,16 +36,12 @@ if ![runto_main] {
>  gdb_test_no_output "set record function-call-history-size 0"
>  
>  # set bp
> -set bp_1 [gdb_get_line_number "bp.1" $srcfile]
>  set bp_2 [gdb_get_line_number "bp.2" $srcfile]
> -gdb_breakpoint $bp_1
>  gdb_breakpoint $bp_2
>  
> -# trace the code between the two breakpoints
> -gdb_continue_to_breakpoint "cont to bp.1" ".*$srcfile:$bp_1\r\n.*"
>  # increase the BTS buffer size - the trace can be quite big
> -gdb_test_no_output "set record btrace bts buffer-size 128000"
> -gdb_test_no_output "record btrace"
> +gdb_test_no_output "set record btrace bts buffer-size 1024000"
> +gdb_test_no_output "record btrace bts"
>  gdb_continue_to_breakpoint "cont to bp.2" ".*$srcfile:$bp_2\r\n.*"
>  
>  # show the flat branch trace
>
>
> [2]
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index 8a10a045b57b..46381f73c595 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -3,7 +3,6 @@
>  #include 
>  #include 
>  
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -282,67 +281,17 @@ void fini_debug_store_on_cpu(int cpu)
>  
>  static DEFINE_PER_CPU(void *, insn_buffer);
>  
> -static void ds_update_cea(void *cea, void *addr, size_t size, pgprot_t prot)
> -{
> - unsigned long start = (unsigned long)cea;
> - phys_addr_t pa;
> - size_t msz = 0;
> -
> - pa = virt_to_phys(addr);
> -
> - preempt_disable();
> - for (; msz < size; msz += PAGE_SIZE, pa += PAGE_SIZE, cea += PAGE_SIZE)
> - cea_set_pte(cea, pa, prot);
> -
> - /*
> -  * This is a cross-CPU update of the cpu_entry_area, we must shoot down
> -  * all TLB entries for it.
> -  */
> - flush_tlb_kernel_range(start, start + size);
> - preempt_enable();
> -}
> -
> -static void ds_clear_cea(void *cea, size_t size)
> -{
> - unsigned long start = (unsigned long)cea;
> - size_t msz = 0;
> -
> - preempt_disable();
> - for (; msz < size; msz += PAGE_SIZE, cea += PAGE_SIZE)
> - cea_set_pte(cea, 0, PAGE_NONE);
> -
> - flush_tlb_kernel_range(start, start + size);
> - preempt_enable();
> -}
> -
> -static void *dsalloc_pages(size_t size, gfp_t flags, int cpu)
> -{
> - unsigned int order = get_order(size);
> - int node = cpu_to_node(cpu);
> - struct page *page;
> -
> - page = __alloc_pages_node(node, flags | __GFP_ZERO, order);
> - return page ? page_address(page) : NULL;
> -}
> -
> -static void dsfree_pages(const void *buffer, size_t size)
> -{
> - if (buffer)
> - free_pages((unsigned long)buffer, get_order(size));
> -}
> -
>  static int alloc_pebs_buffer(int cpu)
>  {
> - struct cpu_hw_events *hwev = per_cpu_ptr(_hw_events, cpu);
> - struct debug_store *ds = hwev->ds;
> - size_t bsiz = x86_pmu.pebs_buffer_size;
> - int max, node = cpu_to_node(cpu);
> - void *buffer, *ibuffer, *cea;
> + struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
> + int node = cpu_to_node(cpu);
> + int max;
> + void *buffer, *ibuffer;
>  
>   if (!x86_pmu.pebs)
>   return 0;
>  
> - buffer = dsalloc_pages(bsiz, GFP_KERNEL, cpu);
> + buffer = kzalloc_node(x86_pmu.pebs_buffer_size, GFP_KERNEL, node);
>   if (unlikely(!buffer))
>   return -ENOMEM;
>  
> @@ -353,94 +302,99 @@ static int alloc_pebs_buffer(int cpu)
>   if (x86_pmu.intel_cap.pebs_format < 2) {
>   ibuffer = kzalloc_node(PEBS_FIXUP_SIZE, GFP_KERNEL, node);
>   if (!ibuffer) {
> - dsfree_pages(buffer, bsiz);
> + kfree(buffer);
>   return -ENOMEM;
>   }
>   per_cpu(insn_buffer, cpu) = ibuffer;
>   }
> - hwev->ds_pebs_vaddr = buffer;
> - /* Update the cpu entry area mapping */
> - cea = 

Re: [bug] kpti, perf_event, bts: sporadic truncated trace

2018-07-12 Thread Andi Kleen
"Metzger, Markus T"  writes:

Adding Dave Hansen

> Hello,
>
> Starting with 4.15 I noticed that BTS is sporadically missing the tail
> of the trace in the perf_event data buffer.  It shows as
>
> [decode error (1): instruction overflow]
>
> in GDB.  Chances to see this are higher the longer the debuggee is
> running.  With this [1] tiny patch to one of GDB's tests, I am able to
> reproduce it reliably on my box.  To run the test, use:
>
> $ make -s check RUNTESTFLAGS="gdb.btrace/exception.exp"
>
> from the gdb/ sub-directory in the GDB build directory.
>
> The issue remains when I use 'nopti' on the kernel command-line.
>
>
> Bisecting yielded commit
>
> c1961a4 x86/events/intel/ds: Map debug buffers in cpu_entry_area
>
> I reverted the commit on top of v4.17 [2] and the issue disappears
> when I use 'nopti' on the kernel command-line.
>
> regards,
> markus.
>
>
> [1]
> diff --git a/gdb/testsuite/gdb.btrace/exception.exp 
> b/gdb/testsuite/gdb.btrace/exception.exp
> index 9408d61..a24ddd3 100755
> --- a/gdb/testsuite/gdb.btrace/exception.exp
> +++ b/gdb/testsuite/gdb.btrace/exception.exp
> @@ -36,16 +36,12 @@ if ![runto_main] {
>  gdb_test_no_output "set record function-call-history-size 0"
>  
>  # set bp
> -set bp_1 [gdb_get_line_number "bp.1" $srcfile]
>  set bp_2 [gdb_get_line_number "bp.2" $srcfile]
> -gdb_breakpoint $bp_1
>  gdb_breakpoint $bp_2
>  
> -# trace the code between the two breakpoints
> -gdb_continue_to_breakpoint "cont to bp.1" ".*$srcfile:$bp_1\r\n.*"
>  # increase the BTS buffer size - the trace can be quite big
> -gdb_test_no_output "set record btrace bts buffer-size 128000"
> -gdb_test_no_output "record btrace"
> +gdb_test_no_output "set record btrace bts buffer-size 1024000"
> +gdb_test_no_output "record btrace bts"
>  gdb_continue_to_breakpoint "cont to bp.2" ".*$srcfile:$bp_2\r\n.*"
>  
>  # show the flat branch trace
>
>
> [2]
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index 8a10a045b57b..46381f73c595 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -3,7 +3,6 @@
>  #include 
>  #include 
>  
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -282,67 +281,17 @@ void fini_debug_store_on_cpu(int cpu)
>  
>  static DEFINE_PER_CPU(void *, insn_buffer);
>  
> -static void ds_update_cea(void *cea, void *addr, size_t size, pgprot_t prot)
> -{
> - unsigned long start = (unsigned long)cea;
> - phys_addr_t pa;
> - size_t msz = 0;
> -
> - pa = virt_to_phys(addr);
> -
> - preempt_disable();
> - for (; msz < size; msz += PAGE_SIZE, pa += PAGE_SIZE, cea += PAGE_SIZE)
> - cea_set_pte(cea, pa, prot);
> -
> - /*
> -  * This is a cross-CPU update of the cpu_entry_area, we must shoot down
> -  * all TLB entries for it.
> -  */
> - flush_tlb_kernel_range(start, start + size);
> - preempt_enable();
> -}
> -
> -static void ds_clear_cea(void *cea, size_t size)
> -{
> - unsigned long start = (unsigned long)cea;
> - size_t msz = 0;
> -
> - preempt_disable();
> - for (; msz < size; msz += PAGE_SIZE, cea += PAGE_SIZE)
> - cea_set_pte(cea, 0, PAGE_NONE);
> -
> - flush_tlb_kernel_range(start, start + size);
> - preempt_enable();
> -}
> -
> -static void *dsalloc_pages(size_t size, gfp_t flags, int cpu)
> -{
> - unsigned int order = get_order(size);
> - int node = cpu_to_node(cpu);
> - struct page *page;
> -
> - page = __alloc_pages_node(node, flags | __GFP_ZERO, order);
> - return page ? page_address(page) : NULL;
> -}
> -
> -static void dsfree_pages(const void *buffer, size_t size)
> -{
> - if (buffer)
> - free_pages((unsigned long)buffer, get_order(size));
> -}
> -
>  static int alloc_pebs_buffer(int cpu)
>  {
> - struct cpu_hw_events *hwev = per_cpu_ptr(_hw_events, cpu);
> - struct debug_store *ds = hwev->ds;
> - size_t bsiz = x86_pmu.pebs_buffer_size;
> - int max, node = cpu_to_node(cpu);
> - void *buffer, *ibuffer, *cea;
> + struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
> + int node = cpu_to_node(cpu);
> + int max;
> + void *buffer, *ibuffer;
>  
>   if (!x86_pmu.pebs)
>   return 0;
>  
> - buffer = dsalloc_pages(bsiz, GFP_KERNEL, cpu);
> + buffer = kzalloc_node(x86_pmu.pebs_buffer_size, GFP_KERNEL, node);
>   if (unlikely(!buffer))
>   return -ENOMEM;
>  
> @@ -353,94 +302,99 @@ static int alloc_pebs_buffer(int cpu)
>   if (x86_pmu.intel_cap.pebs_format < 2) {
>   ibuffer = kzalloc_node(PEBS_FIXUP_SIZE, GFP_KERNEL, node);
>   if (!ibuffer) {
> - dsfree_pages(buffer, bsiz);
> + kfree(buffer);
>   return -ENOMEM;
>   }
>   per_cpu(insn_buffer, cpu) = ibuffer;
>   }
> - hwev->ds_pebs_vaddr = buffer;
> - /* Update the cpu entry area mapping */
> - cea = 

Re: 4.17.x won't boot due to "x86/boot/compressed/64: Handle 5-level paging boot if kernel is above 4G"

2018-07-06 Thread Andi Kleen
> At least we need to make user aware about risk of setting custom flags.

There are valid use cases to override the flags. I use it sometimes too,
and know some other people do to.

But you need to know what you're doing. 

Perhaps a warning during build would be reasonable. So if you ask
for a build log you would see it.

-Andi


Re: 4.17.x won't boot due to "x86/boot/compressed/64: Handle 5-level paging boot if kernel is above 4G"

2018-07-06 Thread Andi Kleen
> At least we need to make user aware about risk of setting custom flags.

There are valid use cases to override the flags. I use it sometimes too,
and know some other people do to.

But you need to know what you're doing. 

Perhaps a warning during build would be reasonable. So if you ask
for a build log you would see it.

-Andi


Re: 4.17.x won't boot due to "x86/boot/compressed/64: Handle 5-level paging boot if kernel is above 4G"

2018-07-03 Thread Andi Kleen
On Tue, Jul 03, 2018 at 11:26:09PM +0300, Kirill A. Shutemov wrote:
> On Tue, Jul 03, 2018 at 11:03:07AM -0700, Andi Kleen wrote:
> > On Tue, Jul 03, 2018 at 05:21:50PM +0300, Kirill A. Shutemov wrote:
> > > On Tue, Jul 03, 2018 at 03:44:03PM +0300, Kirill A. Shutemov wrote:
> > > > On Tue, Jul 03, 2018 at 01:24:49PM +0200, Gabriel C wrote:
> > > > > 2018-07-01 23:32 GMT+02:00 Benjamin Gilbert :
> > > > > > On Sun, Jul 01, 2018 at 05:15:59PM -0400, Benjamin Gilbert wrote:
> > > > > >> 4.17 kernels built with the CoreOS Container Linux toolchain and 
> > > > > >> kconfig,
> > > > > >> up to and including 4.17.3, fail to boot on AMD64 running in (at 
> > > > > >> least)
> > > > > >> QEMU/KVM.  No messages are shown post-GRUB; the VM instantly 
> > > > > >> reboots.
> > > > > >> Reverting commit 194a9749c73d ("x86/boot/compressed/64: Handle 
> > > > > >> 5-level
> > > > > >> paging boot if kernel is above 4G") fixes it.  I've attached our 
> > > > > >> kernel
> > > > > >> config for reference, and am happy to test patches, provide sample 
> > > > > >> QCOW
> > > > > >> images, etc.
> > > > > >
> > > > > 
> > > > > Also see https://bugzilla.kernel.org/show_bug.cgi?id=200385 ,
> > > > > 
> > > > > 0a1756bd2897951c03c1cb671bdfd40729ac2177 is acting up
> > > > > too with the same symptoms
> > > > 
> > > > I tracked it down to -flto in LDFLAGS. I'll look more into this.
> > > 
> > > -flto in LDFLAGS screws up this part of paging_prepare():
> > 
> > Where is that coming from? The LTO patches are not upstream.
> > 
> > And I don't see any LTO usage in the main line.
> 
> Apparently some distros try to hack it around:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=200385
> 
> I'm amazed that it kinda worked for them.

I think it only works on older gccs that don't default to 
thin LTO, but always generate a fallback non LTO object. 
The kernel directly uses ld in the link step (without my patches), so LTO
shouldn't be able to ever generate code.

The early boot code may be an exception of this, and it's likely
the only code that actually uses LTO in such a set up.

The -fPIC is actually scarier than the -flto. The generated code 
must create quite a mess and I'm not sure why you even would want that
because the kernel can be relocatable without it.

BTW I hope to eventually resend the full LTO patches.
They seem to get more and more users recently, mainly for smaller
code size.

> 
> 
> > >   /* Copy trampoline code in place */
> > >   memcpy(trampoline_32bit + TRAMPOLINE_32BIT_CODE_OFFSET / 
> > > sizeof(unsigned long),
> > >   _32bit_src, TRAMPOLINE_32BIT_CODE_SIZE);
> > 
> > 
> > > In particular, relocation for trampoline_32bit_src solved in the wrong
> > > way. Without -flto, we have rip-realtive address load:
> > > 
> > >   982d30: 48 8d 35 09 cc ff fflea-0x33f7(%rip),%rsi# 
> > > 97f940 
> > > 
> > > With -flto we have immediate load:
> > > 
> > >   982cf0: 48 c7 c6 f0 f8 97 00mov$0x97f8f0,%rsi
> > 
> > Strange.
> > 
> > Can you add some RELOC_HIDE()s and see if that helps?
> 
> Nope. No difference in generated code.

Ok will need to put together some self contained test case for the compiler 
people.
I'll try to take a look.

-Andi


Re: 4.17.x won't boot due to "x86/boot/compressed/64: Handle 5-level paging boot if kernel is above 4G"

2018-07-03 Thread Andi Kleen
On Tue, Jul 03, 2018 at 11:26:09PM +0300, Kirill A. Shutemov wrote:
> On Tue, Jul 03, 2018 at 11:03:07AM -0700, Andi Kleen wrote:
> > On Tue, Jul 03, 2018 at 05:21:50PM +0300, Kirill A. Shutemov wrote:
> > > On Tue, Jul 03, 2018 at 03:44:03PM +0300, Kirill A. Shutemov wrote:
> > > > On Tue, Jul 03, 2018 at 01:24:49PM +0200, Gabriel C wrote:
> > > > > 2018-07-01 23:32 GMT+02:00 Benjamin Gilbert :
> > > > > > On Sun, Jul 01, 2018 at 05:15:59PM -0400, Benjamin Gilbert wrote:
> > > > > >> 4.17 kernels built with the CoreOS Container Linux toolchain and 
> > > > > >> kconfig,
> > > > > >> up to and including 4.17.3, fail to boot on AMD64 running in (at 
> > > > > >> least)
> > > > > >> QEMU/KVM.  No messages are shown post-GRUB; the VM instantly 
> > > > > >> reboots.
> > > > > >> Reverting commit 194a9749c73d ("x86/boot/compressed/64: Handle 
> > > > > >> 5-level
> > > > > >> paging boot if kernel is above 4G") fixes it.  I've attached our 
> > > > > >> kernel
> > > > > >> config for reference, and am happy to test patches, provide sample 
> > > > > >> QCOW
> > > > > >> images, etc.
> > > > > >
> > > > > 
> > > > > Also see https://bugzilla.kernel.org/show_bug.cgi?id=200385 ,
> > > > > 
> > > > > 0a1756bd2897951c03c1cb671bdfd40729ac2177 is acting up
> > > > > too with the same symptoms
> > > > 
> > > > I tracked it down to -flto in LDFLAGS. I'll look more into this.
> > > 
> > > -flto in LDFLAGS screws up this part of paging_prepare():
> > 
> > Where is that coming from? The LTO patches are not upstream.
> > 
> > And I don't see any LTO usage in the main line.
> 
> Apparently some distros try to hack it around:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=200385
> 
> I'm amazed that it kinda worked for them.

I think it only works on older gccs that don't default to 
thin LTO, but always generate a fallback non LTO object. 
The kernel directly uses ld in the link step (without my patches), so LTO
shouldn't be able to ever generate code.

The early boot code may be an exception of this, and it's likely
the only code that actually uses LTO in such a set up.

The -fPIC is actually scarier than the -flto. The generated code 
must create quite a mess and I'm not sure why you even would want that
because the kernel can be relocatable without it.

BTW I hope to eventually resend the full LTO patches.
They seem to get more and more users recently, mainly for smaller
code size.

> 
> 
> > >   /* Copy trampoline code in place */
> > >   memcpy(trampoline_32bit + TRAMPOLINE_32BIT_CODE_OFFSET / 
> > > sizeof(unsigned long),
> > >   _32bit_src, TRAMPOLINE_32BIT_CODE_SIZE);
> > 
> > 
> > > In particular, relocation for trampoline_32bit_src solved in the wrong
> > > way. Without -flto, we have rip-realtive address load:
> > > 
> > >   982d30: 48 8d 35 09 cc ff fflea-0x33f7(%rip),%rsi# 
> > > 97f940 
> > > 
> > > With -flto we have immediate load:
> > > 
> > >   982cf0: 48 c7 c6 f0 f8 97 00mov$0x97f8f0,%rsi
> > 
> > Strange.
> > 
> > Can you add some RELOC_HIDE()s and see if that helps?
> 
> Nope. No difference in generated code.

Ok will need to put together some self contained test case for the compiler 
people.
I'll try to take a look.

-Andi


Re: 4.17.x won't boot due to "x86/boot/compressed/64: Handle 5-level paging boot if kernel is above 4G"

2018-07-03 Thread Andi Kleen
On Tue, Jul 03, 2018 at 05:21:50PM +0300, Kirill A. Shutemov wrote:
> On Tue, Jul 03, 2018 at 03:44:03PM +0300, Kirill A. Shutemov wrote:
> > On Tue, Jul 03, 2018 at 01:24:49PM +0200, Gabriel C wrote:
> > > 2018-07-01 23:32 GMT+02:00 Benjamin Gilbert :
> > > > On Sun, Jul 01, 2018 at 05:15:59PM -0400, Benjamin Gilbert wrote:
> > > >> 4.17 kernels built with the CoreOS Container Linux toolchain and 
> > > >> kconfig,
> > > >> up to and including 4.17.3, fail to boot on AMD64 running in (at least)
> > > >> QEMU/KVM.  No messages are shown post-GRUB; the VM instantly reboots.
> > > >> Reverting commit 194a9749c73d ("x86/boot/compressed/64: Handle 5-level
> > > >> paging boot if kernel is above 4G") fixes it.  I've attached our kernel
> > > >> config for reference, and am happy to test patches, provide sample QCOW
> > > >> images, etc.
> > > >
> > > 
> > > Also see https://bugzilla.kernel.org/show_bug.cgi?id=200385 ,
> > > 
> > > 0a1756bd2897951c03c1cb671bdfd40729ac2177 is acting up
> > > too with the same symptoms
> > 
> > I tracked it down to -flto in LDFLAGS. I'll look more into this.
> 
> -flto in LDFLAGS screws up this part of paging_prepare():

Where is that coming from? The LTO patches are not upstream.

And I don't see any LTO usage in the main line.

> 
>   /* Copy trampoline code in place */
>   memcpy(trampoline_32bit + TRAMPOLINE_32BIT_CODE_OFFSET / 
> sizeof(unsigned long),
>   _32bit_src, TRAMPOLINE_32BIT_CODE_SIZE);


> In particular, relocation for trampoline_32bit_src solved in the wrong
> way. Without -flto, we have rip-realtive address load:
> 
>   982d30: 48 8d 35 09 cc ff fflea-0x33f7(%rip),%rsi# 
> 97f940 
> 
> With -flto we have immediate load:
> 
>   982cf0: 48 c7 c6 f0 f8 97 00mov$0x97f8f0,%rsi

Strange.

Can you add some RELOC_HIDE()s and see if that helps?

> It only would be okay if bootloader loads kernel at the address we compile
> it for. But it's not usually the case.
> 
> As result we copy garbage into trampoline and crash when trying to execute
> it.
> 
> I don't know how to solve it. As far as I know we don't support compiling
> kernel with LTO in mainline.

Right.


-Andi


Re: 4.17.x won't boot due to "x86/boot/compressed/64: Handle 5-level paging boot if kernel is above 4G"

2018-07-03 Thread Andi Kleen
On Tue, Jul 03, 2018 at 05:21:50PM +0300, Kirill A. Shutemov wrote:
> On Tue, Jul 03, 2018 at 03:44:03PM +0300, Kirill A. Shutemov wrote:
> > On Tue, Jul 03, 2018 at 01:24:49PM +0200, Gabriel C wrote:
> > > 2018-07-01 23:32 GMT+02:00 Benjamin Gilbert :
> > > > On Sun, Jul 01, 2018 at 05:15:59PM -0400, Benjamin Gilbert wrote:
> > > >> 4.17 kernels built with the CoreOS Container Linux toolchain and 
> > > >> kconfig,
> > > >> up to and including 4.17.3, fail to boot on AMD64 running in (at least)
> > > >> QEMU/KVM.  No messages are shown post-GRUB; the VM instantly reboots.
> > > >> Reverting commit 194a9749c73d ("x86/boot/compressed/64: Handle 5-level
> > > >> paging boot if kernel is above 4G") fixes it.  I've attached our kernel
> > > >> config for reference, and am happy to test patches, provide sample QCOW
> > > >> images, etc.
> > > >
> > > 
> > > Also see https://bugzilla.kernel.org/show_bug.cgi?id=200385 ,
> > > 
> > > 0a1756bd2897951c03c1cb671bdfd40729ac2177 is acting up
> > > too with the same symptoms
> > 
> > I tracked it down to -flto in LDFLAGS. I'll look more into this.
> 
> -flto in LDFLAGS screws up this part of paging_prepare():

Where is that coming from? The LTO patches are not upstream.

And I don't see any LTO usage in the main line.

> 
>   /* Copy trampoline code in place */
>   memcpy(trampoline_32bit + TRAMPOLINE_32BIT_CODE_OFFSET / 
> sizeof(unsigned long),
>   _32bit_src, TRAMPOLINE_32BIT_CODE_SIZE);


> In particular, relocation for trampoline_32bit_src solved in the wrong
> way. Without -flto, we have rip-realtive address load:
> 
>   982d30: 48 8d 35 09 cc ff fflea-0x33f7(%rip),%rsi# 
> 97f940 
> 
> With -flto we have immediate load:
> 
>   982cf0: 48 c7 c6 f0 f8 97 00mov$0x97f8f0,%rsi

Strange.

Can you add some RELOC_HIDE()s and see if that helps?

> It only would be okay if bootloader loads kernel at the address we compile
> it for. But it's not usually the case.
> 
> As result we copy garbage into trampoline and crash when trying to execute
> it.
> 
> I don't know how to solve it. As far as I know we don't support compiling
> kernel with LTO in mainline.

Right.


-Andi


Re: [RFC PATCH for 4.18] rseq: use __u64 for rseq_cs fields, validate user inputs

2018-07-03 Thread Andi Kleen
> 
> So I think you're good... But yes, you raise an interresting point.

So it sounds like architectures that don't have an instruction atomic u64
*_user need to disable interrupts during the access, and somehow handle that
case when a page fault happens?

-Andi


Re: [RFC PATCH for 4.18] rseq: use __u64 for rseq_cs fields, validate user inputs

2018-07-03 Thread Andi Kleen
> 
> So I think you're good... But yes, you raise an interresting point.

So it sounds like architectures that don't have an instruction atomic u64
*_user need to disable interrupts during the access, and somehow handle that
case when a page fault happens?

-Andi


Re: [PATCH 4/4 v3] perf stat: Add transaction flag (-T) support for s390

2018-06-25 Thread Andi Kleen
> > +bool metricgroup__has_metric(const char *metric)
> > +{
> > +   struct pmu_events_map *map = perf_pmu__find_map(NULL);
> > +   struct pmu_event *pe;
> > +   int i;
> > +
> > +   if (!map)
> > +   return false;
> > +
> > +   for (i = 0; ; i++) {
> > +   pe = >table[i];
> > +
> > +   if (!pe->name && !pe->metric_group && !pe->metric_name)
> > +   break;
> > +   if (!pe->metric_expr)
> > +   continue;
> > +   if (match_metric(pe->metric_group, metric) ||
> > +   match_metric(pe->metric_name, metric))

Why both the group and the metric?

I would just match the metric_name

Otherwise it's impossible to have a group called "transaction"

With that fixed it's ok for me.

Acked-by: Andi Kleen 

-Andi


Re: [PATCH 4/4 v3] perf stat: Add transaction flag (-T) support for s390

2018-06-25 Thread Andi Kleen
> > +bool metricgroup__has_metric(const char *metric)
> > +{
> > +   struct pmu_events_map *map = perf_pmu__find_map(NULL);
> > +   struct pmu_event *pe;
> > +   int i;
> > +
> > +   if (!map)
> > +   return false;
> > +
> > +   for (i = 0; ; i++) {
> > +   pe = >table[i];
> > +
> > +   if (!pe->name && !pe->metric_group && !pe->metric_name)
> > +   break;
> > +   if (!pe->metric_expr)
> > +   continue;
> > +   if (match_metric(pe->metric_group, metric) ||
> > +   match_metric(pe->metric_name, metric))

Why both the group and the metric?

I would just match the metric_name

Otherwise it's impossible to have a group called "transaction"

With that fixed it's ok for me.

Acked-by: Andi Kleen 

-Andi


Re: [PATCH 4/4] perf stat: Add transaction flag (-T) support for s390

2018-06-22 Thread Andi Kleen
> > 
> > You may need to add support for wildcard pmus to it.
> > 
> > -Andi
> > 
> 
> Currently only one PMU on s390 does support transactions, its the PMU with 
> counters
> named cpum_cf.

Right, but you don't want to hard code that pmu into builtin-stat.

-Andi


Re: [PATCH 4/4] perf stat: Add transaction flag (-T) support for s390

2018-06-22 Thread Andi Kleen
> > 
> > You may need to add support for wildcard pmus to it.
> > 
> > -Andi
> > 
> 
> Currently only one PMU on s390 does support transactions, its the PMU with 
> counters
> named cpum_cf.

Right, but you don't want to hard code that pmu into builtin-stat.

-Andi


Re: [PATCH 4/4] perf stat: Add transaction flag (-T) support for s390

2018-06-21 Thread Andi Kleen
Thomas Richter  writes:
>  
> + /* Handle -T as -M transaction. Once platform specific metrics
> +  * support has been added to the json files, all archiectures
> +  * will use this approach.
> +  */
> + if (!strcmp(perf_env__arch(NULL), "s390")) {

Use pmu_have_event() instead.

You may need to add support for wildcard pmus to it.

-Andi


Re: [PATCH 4/4] perf stat: Add transaction flag (-T) support for s390

2018-06-21 Thread Andi Kleen
Thomas Richter  writes:
>  
> + /* Handle -T as -M transaction. Once platform specific metrics
> +  * support has been added to the json files, all archiectures
> +  * will use this approach.
> +  */
> + if (!strcmp(perf_env__arch(NULL), "s390")) {

Use pmu_have_event() instead.

You may need to add support for wildcard pmus to it.

-Andi


Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

2018-06-19 Thread Andi Kleen
> In particular 1) means that any extra instructions executed/not executed
> will cause a replay divergence (in practice rr uses retired conditional
> branches rather than instructions, because the instruction counter is
> not accurate, while the branch one is). This alone causes a problem
> for the present case, because glibc branches on the xcr0 value before
> it branches on the cpuid value for AVX512. Glibc does check for the
> correct cpuid before calling xgetbv, so one possible thing to do is to
> completely disable xsave during recording by disabling it in CPUID, but
> that would make rr quite a bit less useful, since it wouldn't be able to

Ah I see it now. This problem was introduced with the changes
for glibc to save AVX registers using XSAVE instead of manually.

It still seems this has a straight forward fix in glibc though.

It could always allocate the worst case buffer, and also
verify XGETBV against CPUID first. I'm sure this can be
done in a way that executed branches don't differ.

AFAIK manual use of XSAVE is not that common, so hopefully
these problems are not wide spread in other programs.

Of course longer term you'll just need to have matching
ISAs in record and replay. Trying to patch around this
is likely always difficult.

-Andi



Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

2018-06-19 Thread Andi Kleen
> In particular 1) means that any extra instructions executed/not executed
> will cause a replay divergence (in practice rr uses retired conditional
> branches rather than instructions, because the instruction counter is
> not accurate, while the branch one is). This alone causes a problem
> for the present case, because glibc branches on the xcr0 value before
> it branches on the cpuid value for AVX512. Glibc does check for the
> correct cpuid before calling xgetbv, so one possible thing to do is to
> completely disable xsave during recording by disabling it in CPUID, but
> that would make rr quite a bit less useful, since it wouldn't be able to

Ah I see it now. This problem was introduced with the changes
for glibc to save AVX registers using XSAVE instead of manually.

It still seems this has a straight forward fix in glibc though.

It could always allocate the worst case buffer, and also
verify XGETBV against CPUID first. I'm sure this can be
done in a way that executed branches don't differ.

AFAIK manual use of XSAVE is not that common, so hopefully
these problems are not wide spread in other programs.

Of course longer term you'll just need to have matching
ISAs in record and replay. Trying to patch around this
is likely always difficult.

-Andi



Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

2018-06-18 Thread Andi Kleen
> Unfortunately, that is insufficient. Almost difference in CPU behavior
> between the replayer
> and the replayee. In particular, the problems for rr here are
> 1) xgetbv, which can introduce differing register contents (and if
> code branches on this,
> potentially differences in control flow).

So you're saying that software checks XGETBV first before checking
CPUID?

That seems broken. Without AVX XGETBV may not exist.

You always have to check CPUID first. And then it should just branch
around the XGETBV.

So we're talking about a workaround for broken software. The question
is how wide spread is it?


> 2) xsave writing more memory than the trace expects, causing
> divergence in the memory
> contents of the process.

Do memory contents which are never read by the application matter?


-Andi



Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

2018-06-18 Thread Andi Kleen
> Unfortunately, that is insufficient. Almost difference in CPU behavior
> between the replayer
> and the replayee. In particular, the problems for rr here are
> 1) xgetbv, which can introduce differing register contents (and if
> code branches on this,
> potentially differences in control flow).

So you're saying that software checks XGETBV first before checking
CPUID?

That seems broken. Without AVX XGETBV may not exist.

You always have to check CPUID first. And then it should just branch
around the XGETBV.

So we're talking about a workaround for broken software. The question
is how wide spread is it?


> 2) xsave writing more memory than the trace expects, causing
> divergence in the memory
> contents of the process.

Do memory contents which are never read by the application matter?


-Andi



Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

2018-06-17 Thread Andi Kleen


Patch seems pointless if you can already control CPUID, which rr
supports. Just disable AVX512 in CPUID. All code using AVX should check
cpuid (or will fail anyways). 

-Andi


Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

2018-06-17 Thread Andi Kleen


Patch seems pointless if you can already control CPUID, which rr
supports. Just disable AVX512 in CPUID. All code using AVX should check
cpuid (or will fail anyways). 

-Andi


Re: [PATCH] trace: fix SKIP_STACK_VALIDATION=1 build

2018-06-08 Thread Andi Kleen
On Fri, Jun 08, 2018 at 02:47:46PM -0700, Greg Thelen wrote:
> Non gcc-5 builds with CONFIG_STACK_VALIDATION=y and
> SKIP_STACK_VALIDATION=1 fail.
> Example output:
>   /bin/sh: init/.tmp_main.o: Permission denied
> 
> commit 96f60dfa5819 ("trace: Use -mcount-record for dynamic ftrace"),
> added a mismatched endif.  This causes cmd_objtool to get mistakenly
> set.
> 
> Relocate endif to balance the newly added -record-mcount check.
> 
> Fixes: 96f60dfa5819 ("trace: Use -mcount-record for dynamic ftrace")
> Signed-off-by: Greg Thelen 

Looks good thanks.

Acked-by: Andi Kleen 

-Andi


Re: [PATCH] trace: fix SKIP_STACK_VALIDATION=1 build

2018-06-08 Thread Andi Kleen
On Fri, Jun 08, 2018 at 02:47:46PM -0700, Greg Thelen wrote:
> Non gcc-5 builds with CONFIG_STACK_VALIDATION=y and
> SKIP_STACK_VALIDATION=1 fail.
> Example output:
>   /bin/sh: init/.tmp_main.o: Permission denied
> 
> commit 96f60dfa5819 ("trace: Use -mcount-record for dynamic ftrace"),
> added a mismatched endif.  This causes cmd_objtool to get mistakenly
> set.
> 
> Relocate endif to balance the newly added -record-mcount check.
> 
> Fixes: 96f60dfa5819 ("trace: Use -mcount-record for dynamic ftrace")
> Signed-off-by: Greg Thelen 

Looks good thanks.

Acked-by: Andi Kleen 

-Andi


Re: [PATCH 09/10] perf/cputime: Don't stop idle tick if there's live cputime event

2018-06-07 Thread Andi Kleen
On Thu, Jun 07, 2018 at 12:15:12AM +0200, Jiri Olsa wrote:
> Disable stopping of the idle tick when having live cputime
> event. When the tick is disabled, the idle counts are out
> of date until next tick/update and perf cputime PMU provides
> misleading counts.

I really don't like this. This can totally change performance
(e.g. less Turbo due to less idle) and performance tools shouldn't
change the performance profile drastically.

-Andi


Re: [PATCH 09/10] perf/cputime: Don't stop idle tick if there's live cputime event

2018-06-07 Thread Andi Kleen
On Thu, Jun 07, 2018 at 12:15:12AM +0200, Jiri Olsa wrote:
> Disable stopping of the idle tick when having live cputime
> event. When the tick is disabled, the idle counts are out
> of date until next tick/update and perf cputime PMU provides
> misleading counts.

I really don't like this. This can totally change performance
(e.g. less Turbo due to less idle) and performance tools shouldn't
change the performance profile drastically.

-Andi


Re: [RFC 00/10] perf: Add cputime events/metrics

2018-06-06 Thread Andi Kleen
> I had some issues with IDLE counter being miscounted due to stopping
> of the idle tick. I tried to solve it in this patch (it's part of the
> patchset):
>   perf/cputime: Don't stop idle tick if there's live cputime event
> 
> but I'm pretty sure it's wrong and there's better solution.

At least on intel we already have hardware counters for different idle
states. You just would need to add them and convert to the same
unit.

But of course it's still useful when this is not available.

> My current plan is now to read those counters in perf top/record/report
> to show (at least) the idle percentage for the current profile.

It's useful. Thanks for working on it. I was thinking about doing
something similar for some time.

-Andi


Re: [PATCH 01/10] perf tools: Uniquify the event name if there's no other matched event

2018-06-06 Thread Andi Kleen
On Thu, Jun 07, 2018 at 12:15:04AM +0200, Jiri Olsa wrote:
> Currently by default we try to match the user specified PMU
> name to all PMU units available and use them to aggregate
> all matched PMUs event counts into one 'pattern' event.
> 
> While this is useful for uncore events, it screws up names
> for other events, where this is not desirable, like:
> 
> Before:
>   # perf stat -e cp/cpu-cycles/ kill

I assume you mean cpU/cpu-cycles/
> 
>Performance counter stats for 'kill':
> 
>1,573,757  cp/cpu-cycles/
> 
> Keeping the pattern matching logic, but making the name unique
> in case there's no other match found. That fixes the example
> above and hopefully does not screw up anything else.
> 
> After:
>   # perf stat -e cp/cpu-cycles/ kill
> 
>Performance counter stats for 'kill':
> 
>1,573,757  cpu/cpu-cycles/


The output is 100% identical?

-Andi


Re: [PATCH 01/10] perf tools: Uniquify the event name if there's no other matched event

2018-06-06 Thread Andi Kleen
On Thu, Jun 07, 2018 at 12:15:04AM +0200, Jiri Olsa wrote:
> Currently by default we try to match the user specified PMU
> name to all PMU units available and use them to aggregate
> all matched PMUs event counts into one 'pattern' event.
> 
> While this is useful for uncore events, it screws up names
> for other events, where this is not desirable, like:
> 
> Before:
>   # perf stat -e cp/cpu-cycles/ kill

I assume you mean cpU/cpu-cycles/
> 
>Performance counter stats for 'kill':
> 
>1,573,757  cp/cpu-cycles/
> 
> Keeping the pattern matching logic, but making the name unique
> in case there's no other match found. That fixes the example
> above and hopefully does not screw up anything else.
> 
> After:
>   # perf stat -e cp/cpu-cycles/ kill
> 
>Performance counter stats for 'kill':
> 
>1,573,757  cpu/cpu-cycles/


The output is 100% identical?

-Andi


Re: [RFC 00/10] perf: Add cputime events/metrics

2018-06-06 Thread Andi Kleen
> I had some issues with IDLE counter being miscounted due to stopping
> of the idle tick. I tried to solve it in this patch (it's part of the
> patchset):
>   perf/cputime: Don't stop idle tick if there's live cputime event
> 
> but I'm pretty sure it's wrong and there's better solution.

At least on intel we already have hardware counters for different idle
states. You just would need to add them and convert to the same
unit.

But of course it's still useful when this is not available.

> My current plan is now to read those counters in perf top/record/report
> to show (at least) the idle percentage for the current profile.

It's useful. Thanks for working on it. I was thinking about doing
something similar for some time.

-Andi


Re: [PATCH V3 02/17] kallsyms, x86: Export addresses of syscall trampolines

2018-06-05 Thread Andi Kleen
> +#ifdef CONFIG_X86_64
> +int arch_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
> +  char *name)
> +{
> + unsigned int cpu, ncpu;
> +
> + if (symnum >= num_possible_cpus())
> + return -EINVAL;
> +
> + for (cpu = cpumask_first(cpu_possible_mask), ncpu = 0;
> +  cpu < num_possible_cpus() && ncpu < symnum;
> +  cpu = cpumask_next(cpu, cpu_possible_mask), ncpu++)
> + ;

That is max_t(unsigned, cpumask_last(cpu_possible_mask), symnum)

Rest and other kernel patches look good to me

Acked-by: Andi Kleen 

-Andi


Re: [PATCH V3 02/17] kallsyms, x86: Export addresses of syscall trampolines

2018-06-05 Thread Andi Kleen
> +#ifdef CONFIG_X86_64
> +int arch_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
> +  char *name)
> +{
> + unsigned int cpu, ncpu;
> +
> + if (symnum >= num_possible_cpus())
> + return -EINVAL;
> +
> + for (cpu = cpumask_first(cpu_possible_mask), ncpu = 0;
> +  cpu < num_possible_cpus() && ncpu < symnum;
> +  cpu = cpumask_next(cpu, cpu_possible_mask), ncpu++)
> + ;

That is max_t(unsigned, cpumask_last(cpu_possible_mask), symnum)

Rest and other kernel patches look good to me

Acked-by: Andi Kleen 

-Andi


Re: [PATCH v2]: perf record: enable arbitrary event names thru name= modifier

2018-06-01 Thread Andi Kleen
On Thu, May 31, 2018 at 05:08:12PM +0300, Alexey Budankov wrote:
> 
> Enable complex event names containing [.:=,] symbols to be encoded into Perf 
> trace using name= modifier e.g. like this:
> 
> perf record -e 
> cpu/name=\'OFFCORE_RESPONSE:request=DEMAND_RFO:response=L3_HIT.SNOOP_HITM\',\
>   period=0x3567e0,event=0x3c,cmask=0x1/Duk ./futex
> 
> Below is how it looks like in the report output. Please note explicit escaped 
> quoting at cmdline string in the header so that thestring can be directly 
> reused
> for another collection in shell:

Patch looks good, but new syntax also needs to be documented in some manpage
(e.g. perf list)

-Andi


Re: [PATCH v2]: perf record: enable arbitrary event names thru name= modifier

2018-06-01 Thread Andi Kleen
On Thu, May 31, 2018 at 05:08:12PM +0300, Alexey Budankov wrote:
> 
> Enable complex event names containing [.:=,] symbols to be encoded into Perf 
> trace using name= modifier e.g. like this:
> 
> perf record -e 
> cpu/name=\'OFFCORE_RESPONSE:request=DEMAND_RFO:response=L3_HIT.SNOOP_HITM\',\
>   period=0x3567e0,event=0x3c,cmask=0x1/Duk ./futex
> 
> Below is how it looks like in the report output. Please note explicit escaped 
> quoting at cmdline string in the header so that thestring can be directly 
> reused
> for another collection in shell:

Patch looks good, but new syntax also needs to be documented in some manpage
(e.g. perf list)

-Andi


Re: [PATCH v2 0/3] perf script python: Add more PMU fields

2018-05-31 Thread Andi Kleen
On Fri, Jun 01, 2018 at 05:01:00PM +0800, Jin Yao wrote:
> When doing pmu sampling and then running a script with
> perf script -s script.py, the process_event function gets
> dictionary with some fields from the perf ring buffer
> (like ip, sym, callchain etc).
> 
> But we miss quite a few fields we report now, for example,
> LBRs,data source,weight,transaction,iregs,uregs,and etc.
> 
> This patch adds these fields for perf script python.

Patches look good to me.

Reviewed-by: Andi Kleen 

-Andi


Re: [PATCH v2 0/3] perf script python: Add more PMU fields

2018-05-31 Thread Andi Kleen
On Fri, Jun 01, 2018 at 05:01:00PM +0800, Jin Yao wrote:
> When doing pmu sampling and then running a script with
> perf script -s script.py, the process_event function gets
> dictionary with some fields from the perf ring buffer
> (like ip, sym, callchain etc).
> 
> But we miss quite a few fields we report now, for example,
> LBRs,data source,weight,transaction,iregs,uregs,and etc.
> 
> This patch adds these fields for perf script python.

Patches look good to me.

Reviewed-by: Andi Kleen 

-Andi


Re: [PATCH] perf util: Add more PMU fields for perf script python

2018-05-30 Thread Andi Kleen
On Wed, May 30, 2018 at 10:20:45PM +0800, Jin Yao wrote:
> When doing pmu sampling and then running a script with
> perf script -s script.py, the process_event function gets
> dictionary with some fields from the perf ring buffer
> (like ip, sym, callchain etc).
> 
> But we miss quite a few fields we report now, for example,
> LBRs,data source,weight,transaction,iregs,uregs,and etc.
> 
> This patch reports these fields for perf script python
> processing.

We need documentation and an example script using it.

> + PyObject *pyelem;
> +
> + pyelem = PyDict_New();
> + if (!pyelem)
> + Py_FatalError("couldn't create Python dictionary");
> +

I think we need a field for the dso here.

> + pydict_set_item_string_decref(pyelem, "from",
> + PyLong_FromUnsignedLongLong(br->entries[i].from));
> + pydict_set_item_string_decref(pyelem, "to",
> + PyLong_FromUnsignedLongLong(br->entries[i].to));
> + pydict_set_item_string_decref(pyelem, "mispred",
> + PyLong_FromUnsignedLongLong(br->entries[i].flags.mispred));
> + pydict_set_item_string_decref(pyelem, "predicted",
> + 
> PyLong_FromUnsignedLongLong(br->entries[i].flags.predicted));
> + pydict_set_item_string_decref(pyelem, "in_tx",
> + PyLong_FromUnsignedLongLong(br->entries[i].flags.in_tx));
> + pydict_set_item_string_decref(pyelem, "abort",


These could be booleans.

> + PyLong_FromUnsignedLongLong(br->entries[i].flags.abort));
> + pydict_set_item_string_decref(pyelem, "cycles",
> + PyLong_FromUnsignedLongLong(br->entries[i].flags.cycles));

Would be nice to get access to the binary code too (see how perf script
brstackinsn does it). That could be a followon patch.

Thanks,

-Andi


Re: [PATCH] perf util: Add more PMU fields for perf script python

2018-05-30 Thread Andi Kleen
On Wed, May 30, 2018 at 10:20:45PM +0800, Jin Yao wrote:
> When doing pmu sampling and then running a script with
> perf script -s script.py, the process_event function gets
> dictionary with some fields from the perf ring buffer
> (like ip, sym, callchain etc).
> 
> But we miss quite a few fields we report now, for example,
> LBRs,data source,weight,transaction,iregs,uregs,and etc.
> 
> This patch reports these fields for perf script python
> processing.

We need documentation and an example script using it.

> + PyObject *pyelem;
> +
> + pyelem = PyDict_New();
> + if (!pyelem)
> + Py_FatalError("couldn't create Python dictionary");
> +

I think we need a field for the dso here.

> + pydict_set_item_string_decref(pyelem, "from",
> + PyLong_FromUnsignedLongLong(br->entries[i].from));
> + pydict_set_item_string_decref(pyelem, "to",
> + PyLong_FromUnsignedLongLong(br->entries[i].to));
> + pydict_set_item_string_decref(pyelem, "mispred",
> + PyLong_FromUnsignedLongLong(br->entries[i].flags.mispred));
> + pydict_set_item_string_decref(pyelem, "predicted",
> + 
> PyLong_FromUnsignedLongLong(br->entries[i].flags.predicted));
> + pydict_set_item_string_decref(pyelem, "in_tx",
> + PyLong_FromUnsignedLongLong(br->entries[i].flags.in_tx));
> + pydict_set_item_string_decref(pyelem, "abort",


These could be booleans.

> + PyLong_FromUnsignedLongLong(br->entries[i].flags.abort));
> + pydict_set_item_string_decref(pyelem, "cycles",
> + PyLong_FromUnsignedLongLong(br->entries[i].flags.cycles));

Would be nice to get access to the binary code too (see how perf script
brstackinsn does it). That could be a followon patch.

Thanks,

-Andi


Re: [RFC] perf: Allow fine-grained PMU access control

2018-05-22 Thread Andi Kleen
> IMHO, it is unsafe for CBOX pmu but could IMC, UPI pmus be an exception here?
> Because currently perf stat -I from IMC, UPI counters is only allowed when 
> system wide monitoring is permitted and this prevents joint perf record and 
> perf stat -I in cluster environments where users usually lack ability to 
> modify paranoid. Adding Andi who may have more ideas regarding all that.

PMU isolation is about not making side channels worse. There are normally
already side channels from timing, but it has a degree of noise.

PMU isolation is just to prevent opening side channels with less noise.
But reducing noise is always a trade off, it can never be perfect
and at some point there are dimishing returns.

In general the farther you are from the origin of the noise there 
is already more noise. The PMU can reduce the noise, but if it's far
enough away it may not make much difference.

So there are always trade offs with shades of grey, not a black 
and white situation. Depending on your security requirements
it may be totally reasonable e.g. to allow the PMU 
on the memory controller (which is already very noisy in any case), 
but not on the caches.

Or allow it only on the graphics which is already fairly isolated.

So per pmu paranoid settings are a useful concept.

-Andi


Re: [RFC] perf: Allow fine-grained PMU access control

2018-05-22 Thread Andi Kleen
> IMHO, it is unsafe for CBOX pmu but could IMC, UPI pmus be an exception here?
> Because currently perf stat -I from IMC, UPI counters is only allowed when 
> system wide monitoring is permitted and this prevents joint perf record and 
> perf stat -I in cluster environments where users usually lack ability to 
> modify paranoid. Adding Andi who may have more ideas regarding all that.

PMU isolation is about not making side channels worse. There are normally
already side channels from timing, but it has a degree of noise.

PMU isolation is just to prevent opening side channels with less noise.
But reducing noise is always a trade off, it can never be perfect
and at some point there are dimishing returns.

In general the farther you are from the origin of the noise there 
is already more noise. The PMU can reduce the noise, but if it's far
enough away it may not make much difference.

So there are always trade offs with shades of grey, not a black 
and white situation. Depending on your security requirements
it may be totally reasonable e.g. to allow the PMU 
on the memory controller (which is already very noisy in any case), 
but not on the caches.

Or allow it only on the graphics which is already fairly isolated.

So per pmu paranoid settings are a useful concept.

-Andi


Re: x86: Fix AMD K6 indirect call check v2

2018-05-21 Thread Andi Kleen
On Mon, May 21, 2018 at 07:13:37AM -0400, tedheadster wrote:
> Andi,
>   I have a K6 regression testing system. I think my K6 is a revision
> C, but I probably can get an earlier cpu (with the bug) to test on.
> 
> Do you have any specific tests you would want me to do on the affected cpu?

You answered to an ancient email? Can barely remember what it was about.
But I think if it still boots it's likely fine.

-Andi


Re: x86: Fix AMD K6 indirect call check v2

2018-05-21 Thread Andi Kleen
On Mon, May 21, 2018 at 07:13:37AM -0400, tedheadster wrote:
> Andi,
>   I have a K6 regression testing system. I think my K6 is a revision
> C, but I probably can get an earlier cpu (with the bug) to test on.
> 
> Do you have any specific tests you would want me to do on the affected cpu?

You answered to an ancient email? Can barely remember what it was about.
But I think if it still boots it's likely fine.

-Andi


Re: [PATCH v2] perf tools: allow map files to specify DSO

2018-05-07 Thread Andi Kleen
On Mon, May 07, 2018 at 02:30:32PM -0700, Josh Hunt wrote:
> On 05/07/2018 11:40 AM, Andi Kleen wrote:
> > On Mon, May 07, 2018 at 02:24:16PM -0400, Josh Hunt wrote:
> > > Add the ability to specify a DSO in the /tmp/perf-.map file.
> > > The DSO should be the first line in the file and readable by the
> > > running user. If a valid DSO is found all other contents of the
> > > file will be ignored. This allows things like callchain unwinding
> > > with DWARF to work.
> > 
> > FWIW it's ok, but also obsolete with Kirill's large-pages-in-tmpfs
> > work in newer kernels. With that you can just copy the executable into
> > a 2MB tmpfs and disable the manual huge page copying and everything
> > should work as usually.
> > 
> > So essentially it's only a hack for old kernels and old binaries.
> > 
> > But doesn't hurt I guess.
> 
> Ah, very interesting. I wasn't aware of this. Can you point me to some more
> details on this process?

See commit 5a6e75f8110c97e2a5488894d4e922187e6cb343

-Andi


Re: [PATCH v2] perf tools: allow map files to specify DSO

2018-05-07 Thread Andi Kleen
On Mon, May 07, 2018 at 02:30:32PM -0700, Josh Hunt wrote:
> On 05/07/2018 11:40 AM, Andi Kleen wrote:
> > On Mon, May 07, 2018 at 02:24:16PM -0400, Josh Hunt wrote:
> > > Add the ability to specify a DSO in the /tmp/perf-.map file.
> > > The DSO should be the first line in the file and readable by the
> > > running user. If a valid DSO is found all other contents of the
> > > file will be ignored. This allows things like callchain unwinding
> > > with DWARF to work.
> > 
> > FWIW it's ok, but also obsolete with Kirill's large-pages-in-tmpfs
> > work in newer kernels. With that you can just copy the executable into
> > a 2MB tmpfs and disable the manual huge page copying and everything
> > should work as usually.
> > 
> > So essentially it's only a hack for old kernels and old binaries.
> > 
> > But doesn't hurt I guess.
> 
> Ah, very interesting. I wasn't aware of this. Can you point me to some more
> details on this process?

See commit 5a6e75f8110c97e2a5488894d4e922187e6cb343

-Andi


Re: [PATCH v2] perf tools: allow map files to specify DSO

2018-05-07 Thread Andi Kleen
On Mon, May 07, 2018 at 02:24:16PM -0400, Josh Hunt wrote:
> Add the ability to specify a DSO in the /tmp/perf-.map file.
> The DSO should be the first line in the file and readable by the
> running user. If a valid DSO is found all other contents of the
> file will be ignored. This allows things like callchain unwinding
> with DWARF to work.

FWIW it's ok, but also obsolete with Kirill's large-pages-in-tmpfs
work in newer kernels. With that you can just copy the executable into
a 2MB tmpfs and disable the manual huge page copying and everything
should work as usually.

So essentially it's only a hack for old kernels and old binaries.

But doesn't hurt I guess.

-Andi


Re: [PATCH v2] perf tools: allow map files to specify DSO

2018-05-07 Thread Andi Kleen
On Mon, May 07, 2018 at 02:24:16PM -0400, Josh Hunt wrote:
> Add the ability to specify a DSO in the /tmp/perf-.map file.
> The DSO should be the first line in the file and readable by the
> running user. If a valid DSO is found all other contents of the
> file will be ignored. This allows things like callchain unwinding
> with DWARF to work.

FWIW it's ok, but also obsolete with Kirill's large-pages-in-tmpfs
work in newer kernels. With that you can just copy the executable into
a 2MB tmpfs and disable the manual huge page copying and everything
should work as usually.

So essentially it's only a hack for old kernels and old binaries.

But doesn't hurt I guess.

-Andi


Re: [PATCH 05/12] perf pmu: Fix pmu events parsing rule

2018-05-05 Thread Andi Kleen
Jiri Olsa  writes:

Please fix this quickly, PT is currently totally non functional in Linus
mainline.

-Andi


Re: [PATCH 05/12] perf pmu: Fix pmu events parsing rule

2018-05-05 Thread Andi Kleen
Jiri Olsa  writes:

Please fix this quickly, PT is currently totally non functional in Linus
mainline.

-Andi


Re: [PATCH] perf vendor events: fix spelling mistake: "alloacated" -> "allocated"

2018-04-30 Thread Andi Kleen
On Mon, Apr 30, 2018 at 10:52:56AM -0300, Arnaldo Carvalho de Melo wrote:
> Em Fri, Apr 27, 2018 at 07:52:06PM +0100, Colin King escreveu:
> > From: Colin Ian King 
> > 
> > Trivial fix to spelling mistakes in json text strings
> 
> These files are generated from other files, so any future update to them
> will bring back the spelling mistakes, so, as Andi pointed out
> previously, we better fix the spellings in the original files,
> maintained by Intel.

Thanks. I forwarded to the right people.
-Andi


Re: [PATCH] perf vendor events: fix spelling mistake: "alloacated" -> "allocated"

2018-04-30 Thread Andi Kleen
On Mon, Apr 30, 2018 at 10:52:56AM -0300, Arnaldo Carvalho de Melo wrote:
> Em Fri, Apr 27, 2018 at 07:52:06PM +0100, Colin King escreveu:
> > From: Colin Ian King 
> > 
> > Trivial fix to spelling mistakes in json text strings
> 
> These files are generated from other files, so any future update to them
> will bring back the spelling mistakes, so, as Andi pointed out
> previously, we better fix the spellings in the original files,
> maintained by Intel.

Thanks. I forwarded to the right people.
-Andi


Re: [RFC PATCH] perf tools: allow map files to specify DSO

2018-04-24 Thread Andi Kleen
> Pekka, Andi, Stephane, do you guys see any problems with this?

Looks ok to me.

-Andi

> 
> If this flies, tools/perf/Documentation/jit-interface.txt needs
> updating.


Re: [RFC PATCH] perf tools: allow map files to specify DSO

2018-04-24 Thread Andi Kleen
> Pekka, Andi, Stephane, do you guys see any problems with this?

Looks ok to me.

-Andi

> 
> If this flies, tools/perf/Documentation/jit-interface.txt needs
> updating.


Re: [PATCH] perf vendor events intel: Fix double words "are are" in cache.json

2018-04-24 Thread Andi Kleen
On Wed, Apr 25, 2018 at 12:02:49AM +0900, Masanari Iida wrote:
> This patch fix double words "are are".

These files are auto generated from another source. I don't think
it makes much sense to do a lot of editing on the Linux copies.

I will pass it on to the owners of the original files.

-Andi


Re: [PATCH] perf vendor events intel: Fix double words "are are" in cache.json

2018-04-24 Thread Andi Kleen
On Wed, Apr 25, 2018 at 12:02:49AM +0900, Masanari Iida wrote:
> This patch fix double words "are are".

These files are auto generated from another source. I don't think
it makes much sense to do a lot of editing on the Linux copies.

I will pass it on to the owners of the original files.

-Andi


[tip:perf/urgent] perf hists browser: Clarify top/report browser help

2018-04-21 Thread tip-bot for Andi Kleen
Commit-ID:  6a02f06edea5a5910c787fd6c49b0552e8080e5d
Gitweb: https://git.kernel.org/tip/6a02f06edea5a5910c787fd6c49b0552e8080e5d
Author: Andi Kleen <a...@linux.intel.com>
AuthorDate: Fri, 6 Apr 2018 13:38:10 -0700
Committer:  Arnaldo Carvalho de Melo <a...@redhat.com>
CommitDate: Wed, 18 Apr 2018 15:35:49 -0300

perf hists browser: Clarify top/report browser help

Clarify in the browser help that ESC in tui mode may go back to the
previous screen instead of just exiting (was not clear to me)

Signed-off-by: Andi Kleen <a...@linux.intel.com>
Acked-by: Jiri Olsa <jo...@kernel.org>
Link: http://lkml.kernel.org/r/20180406203812.3087-3-a...@firstfloor.org
Signed-off-by: Arnaldo Carvalho de Melo <a...@redhat.com>
---
 tools/perf/ui/browsers/hists.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/perf/ui/browsers/hists.c b/tools/perf/ui/browsers/hists.c
index 0eec06c105c6..e5f247247daa 100644
--- a/tools/perf/ui/browsers/hists.c
+++ b/tools/perf/ui/browsers/hists.c
@@ -2714,7 +2714,7 @@ static int perf_evsel__hists_browse(struct perf_evsel 
*evsel, int nr_events,
"h/?/F1Show this window\n"  \
"UP/DOWN/PGUP\n"\
"PGDN/SPACENavigate\n"  \
-   "q/ESC/CTRL+C  Exit browser\n\n"\
+   "q/ESC/CTRL+C  Exit browser or go back to previous screen\n\n"  \
"For multiple event sessions:\n\n"  \
"TAB/UNTAB Switch events\n\n"   \
"For symbolic views (--sort has sym):\n\n"  \


[tip:perf/urgent] perf hists browser: Clarify top/report browser help

2018-04-21 Thread tip-bot for Andi Kleen
Commit-ID:  6a02f06edea5a5910c787fd6c49b0552e8080e5d
Gitweb: https://git.kernel.org/tip/6a02f06edea5a5910c787fd6c49b0552e8080e5d
Author: Andi Kleen 
AuthorDate: Fri, 6 Apr 2018 13:38:10 -0700
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Wed, 18 Apr 2018 15:35:49 -0300

perf hists browser: Clarify top/report browser help

Clarify in the browser help that ESC in tui mode may go back to the
previous screen instead of just exiting (was not clear to me)

Signed-off-by: Andi Kleen 
Acked-by: Jiri Olsa 
Link: http://lkml.kernel.org/r/20180406203812.3087-3-a...@firstfloor.org
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/ui/browsers/hists.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/perf/ui/browsers/hists.c b/tools/perf/ui/browsers/hists.c
index 0eec06c105c6..e5f247247daa 100644
--- a/tools/perf/ui/browsers/hists.c
+++ b/tools/perf/ui/browsers/hists.c
@@ -2714,7 +2714,7 @@ static int perf_evsel__hists_browse(struct perf_evsel 
*evsel, int nr_events,
"h/?/F1Show this window\n"  \
"UP/DOWN/PGUP\n"\
"PGDN/SPACENavigate\n"  \
-   "q/ESC/CTRL+C  Exit browser\n\n"\
+   "q/ESC/CTRL+C  Exit browser or go back to previous screen\n\n"  \
"For multiple event sessions:\n\n"  \
"TAB/UNTAB Switch events\n\n"   \
"For symbolic views (--sort has sym):\n\n"  \


[tip:perf/urgent] perf record: Remove suggestion to enable APIC

2018-04-21 Thread tip-bot for Andi Kleen
Commit-ID:  ccbb6afe0890b09cc828373a9a5fffab40ec85df
Gitweb: https://git.kernel.org/tip/ccbb6afe0890b09cc828373a9a5fffab40ec85df
Author: Andi Kleen <a...@linux.intel.com>
AuthorDate: Fri, 6 Apr 2018 13:38:12 -0700
Committer:  Arnaldo Carvalho de Melo <a...@redhat.com>
CommitDate: Wed, 18 Apr 2018 15:35:50 -0300

perf record: Remove suggestion to enable APIC

'perf record' suggests to enable the APIC on errors.

APIC is practically always used today and the problem is usually
somewhere else.

Just remove the outdated suggestion.

Signed-off-by: Andi Kleen <a...@linux.intel.com>
Acked-by: Jiri Olsa <jo...@kernel.org>
Link: http://lkml.kernel.org/r/20180406203812.3087-5-a...@firstfloor.org
Signed-off-by: Arnaldo Carvalho de Melo <a...@redhat.com>
---
 tools/perf/util/evsel.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 66b62570c855..3e87486c28fe 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2870,8 +2870,7 @@ int perf_evsel__open_strerror(struct perf_evsel *evsel, 
struct target *target,
 #if defined(__i386__) || defined(__x86_64__)
if (evsel->attr.type == PERF_TYPE_HARDWARE)
return scnprintf(msg, size, "%s",
-   "No hardware sampling interrupt available.\n"
-   "No APIC? If so then you can boot the kernel with the \"lapic\" boot 
parameter to force-enable it.");
+   "No hardware sampling interrupt available.\n");
 #endif
break;
case EBUSY:


[tip:perf/urgent] perf record: Remove suggestion to enable APIC

2018-04-21 Thread tip-bot for Andi Kleen
Commit-ID:  ccbb6afe0890b09cc828373a9a5fffab40ec85df
Gitweb: https://git.kernel.org/tip/ccbb6afe0890b09cc828373a9a5fffab40ec85df
Author: Andi Kleen 
AuthorDate: Fri, 6 Apr 2018 13:38:12 -0700
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Wed, 18 Apr 2018 15:35:50 -0300

perf record: Remove suggestion to enable APIC

'perf record' suggests to enable the APIC on errors.

APIC is practically always used today and the problem is usually
somewhere else.

Just remove the outdated suggestion.

Signed-off-by: Andi Kleen 
Acked-by: Jiri Olsa 
Link: http://lkml.kernel.org/r/20180406203812.3087-5-a...@firstfloor.org
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/util/evsel.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 66b62570c855..3e87486c28fe 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2870,8 +2870,7 @@ int perf_evsel__open_strerror(struct perf_evsel *evsel, 
struct target *target,
 #if defined(__i386__) || defined(__x86_64__)
if (evsel->attr.type == PERF_TYPE_HARDWARE)
return scnprintf(msg, size, "%s",
-   "No hardware sampling interrupt available.\n"
-   "No APIC? If so then you can boot the kernel with the \"lapic\" boot 
parameter to force-enable it.");
+   "No hardware sampling interrupt available.\n");
 #endif
break;
case EBUSY:


[tip:perf/urgent] perf record: Remove misleading error suggestion

2018-04-21 Thread tip-bot for Andi Kleen
Commit-ID:  ec3948451e0ba317e66873b48d6cc51d701d4eb0
Gitweb: https://git.kernel.org/tip/ec3948451e0ba317e66873b48d6cc51d701d4eb0
Author: Andi Kleen <a...@linux.intel.com>
AuthorDate: Fri, 6 Apr 2018 13:38:11 -0700
Committer:  Arnaldo Carvalho de Melo <a...@redhat.com>
CommitDate: Wed, 18 Apr 2018 15:35:49 -0300

perf record: Remove misleading error suggestion

When perf record encounters an error setting up an event it suggests
to enable CONFIG_PERF_EVENTS. This is misleading because:

- Usually it is enabled (it is really hard to disable on x86)

- The problem is usually somewhere else, e.g. the CPU is not supported
or an invalid configuration has been used.

Remove the misleading suggestion.

Signed-off-by: Andi Kleen <a...@linux.intel.com>
Acked-by: Jiri Olsa <jo...@kernel.org>
Link: http://lkml.kernel.org/r/20180406203812.3087-4-a...@firstfloor.org
Signed-off-by: Arnaldo Carvalho de Melo <a...@redhat.com>
---
 tools/perf/util/evsel.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 1ac8d9236efd..66b62570c855 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2894,8 +2894,7 @@ int perf_evsel__open_strerror(struct perf_evsel *evsel, 
struct target *target,
 
return scnprintf(msg, size,
"The sys_perf_event_open() syscall returned with %d (%s) for event 
(%s).\n"
-   "/bin/dmesg may provide additional information.\n"
-   "No CONFIG_PERF_EVENTS=y kernel support configured?",
+   "/bin/dmesg | grep -i perf may provide additional information.\n",
 err, str_error_r(err, sbuf, sizeof(sbuf)),
 perf_evsel__name(evsel));
 }


[tip:perf/urgent] perf record: Remove misleading error suggestion

2018-04-21 Thread tip-bot for Andi Kleen
Commit-ID:  ec3948451e0ba317e66873b48d6cc51d701d4eb0
Gitweb: https://git.kernel.org/tip/ec3948451e0ba317e66873b48d6cc51d701d4eb0
Author: Andi Kleen 
AuthorDate: Fri, 6 Apr 2018 13:38:11 -0700
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Wed, 18 Apr 2018 15:35:49 -0300

perf record: Remove misleading error suggestion

When perf record encounters an error setting up an event it suggests
to enable CONFIG_PERF_EVENTS. This is misleading because:

- Usually it is enabled (it is really hard to disable on x86)

- The problem is usually somewhere else, e.g. the CPU is not supported
or an invalid configuration has been used.

Remove the misleading suggestion.

Signed-off-by: Andi Kleen 
Acked-by: Jiri Olsa 
Link: http://lkml.kernel.org/r/20180406203812.3087-4-a...@firstfloor.org
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/util/evsel.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 1ac8d9236efd..66b62570c855 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2894,8 +2894,7 @@ int perf_evsel__open_strerror(struct perf_evsel *evsel, 
struct target *target,
 
return scnprintf(msg, size,
"The sys_perf_event_open() syscall returned with %d (%s) for event 
(%s).\n"
-   "/bin/dmesg may provide additional information.\n"
-   "No CONFIG_PERF_EVENTS=y kernel support configured?",
+   "/bin/dmesg | grep -i perf may provide additional information.\n",
 err, str_error_r(err, sbuf, sizeof(sbuf)),
 perf_evsel__name(evsel));
 }


[tip:perf/urgent] perf mem: Allow all record/report options

2018-04-21 Thread tip-bot for Andi Kleen
Commit-ID:  a7e9eab3dbd35268c16244557a4155a2d9a641c3
Gitweb: https://git.kernel.org/tip/a7e9eab3dbd35268c16244557a4155a2d9a641c3
Author: Andi Kleen <a...@linux.intel.com>
AuthorDate: Fri, 6 Apr 2018 13:38:09 -0700
Committer:  Arnaldo Carvalho de Melo <a...@redhat.com>
CommitDate: Wed, 18 Apr 2018 15:35:48 -0300

perf mem: Allow all record/report options

For perf mem report / perf mem record, pass all unknown options
through to the underlying report/record commands. This makes things
like

perf mem record -a sleep 1

work. Matches how c2c and other tools work.

Signed-off-by: Andi Kleen <a...@linux.intel.com>
Acked-by: Jiri Olsa <jo...@kernel.org>
Link: http://lkml.kernel.org/r/20180406203812.3087-2-a...@firstfloor.org
Signed-off-by: Arnaldo Carvalho de Melo <a...@redhat.com>
---
 tools/perf/Documentation/perf-mem.txt | 3 +++
 tools/perf/builtin-mem.c  | 4 ++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/tools/perf/Documentation/perf-mem.txt 
b/tools/perf/Documentation/perf-mem.txt
index b0211410969b..8806ed5f3802 100644
--- a/tools/perf/Documentation/perf-mem.txt
+++ b/tools/perf/Documentation/perf-mem.txt
@@ -67,6 +67,9 @@ OPTIONS
 --phys-data::
Record/Report sample physical addresses
 
+In addition, for report all perf report options are valid, and for record
+all perf record options.
+
 SEE ALSO
 
 linkperf:perf-record[1], linkperf:perf-report[1]
diff --git a/tools/perf/builtin-mem.c b/tools/perf/builtin-mem.c
index 506564651cda..57393e94d156 100644
--- a/tools/perf/builtin-mem.c
+++ b/tools/perf/builtin-mem.c
@@ -83,7 +83,7 @@ static int __cmd_record(int argc, const char **argv, struct 
perf_mem *mem)
};
 
argc = parse_options(argc, argv, options, record_mem_usage,
-PARSE_OPT_STOP_AT_NON_OPTION);
+PARSE_OPT_KEEP_UNKNOWN);
 
rec_argc = argc + 9; /* max number of arguments */
rec_argv = calloc(rec_argc + 1, sizeof(char *));
@@ -436,7 +436,7 @@ int cmd_mem(int argc, const char **argv)
}
 
argc = parse_options_subcommand(argc, argv, mem_options, 
mem_subcommands,
-   mem_usage, 
PARSE_OPT_STOP_AT_NON_OPTION);
+   mem_usage, PARSE_OPT_KEEP_UNKNOWN);
 
if (!argc || !(strncmp(argv[0], "rec", 3) || mem.operation))
usage_with_options(mem_usage, mem_options);


[tip:perf/urgent] perf mem: Allow all record/report options

2018-04-21 Thread tip-bot for Andi Kleen
Commit-ID:  a7e9eab3dbd35268c16244557a4155a2d9a641c3
Gitweb: https://git.kernel.org/tip/a7e9eab3dbd35268c16244557a4155a2d9a641c3
Author: Andi Kleen 
AuthorDate: Fri, 6 Apr 2018 13:38:09 -0700
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Wed, 18 Apr 2018 15:35:48 -0300

perf mem: Allow all record/report options

For perf mem report / perf mem record, pass all unknown options
through to the underlying report/record commands. This makes things
like

perf mem record -a sleep 1

work. Matches how c2c and other tools work.

Signed-off-by: Andi Kleen 
Acked-by: Jiri Olsa 
Link: http://lkml.kernel.org/r/20180406203812.3087-2-a...@firstfloor.org
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/Documentation/perf-mem.txt | 3 +++
 tools/perf/builtin-mem.c  | 4 ++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/tools/perf/Documentation/perf-mem.txt 
b/tools/perf/Documentation/perf-mem.txt
index b0211410969b..8806ed5f3802 100644
--- a/tools/perf/Documentation/perf-mem.txt
+++ b/tools/perf/Documentation/perf-mem.txt
@@ -67,6 +67,9 @@ OPTIONS
 --phys-data::
Record/Report sample physical addresses
 
+In addition, for report all perf report options are valid, and for record
+all perf record options.
+
 SEE ALSO
 
 linkperf:perf-record[1], linkperf:perf-report[1]
diff --git a/tools/perf/builtin-mem.c b/tools/perf/builtin-mem.c
index 506564651cda..57393e94d156 100644
--- a/tools/perf/builtin-mem.c
+++ b/tools/perf/builtin-mem.c
@@ -83,7 +83,7 @@ static int __cmd_record(int argc, const char **argv, struct 
perf_mem *mem)
};
 
argc = parse_options(argc, argv, options, record_mem_usage,
-PARSE_OPT_STOP_AT_NON_OPTION);
+PARSE_OPT_KEEP_UNKNOWN);
 
rec_argc = argc + 9; /* max number of arguments */
rec_argv = calloc(rec_argc + 1, sizeof(char *));
@@ -436,7 +436,7 @@ int cmd_mem(int argc, const char **argv)
}
 
argc = parse_options_subcommand(argc, argv, mem_options, 
mem_subcommands,
-   mem_usage, 
PARSE_OPT_STOP_AT_NON_OPTION);
+   mem_usage, PARSE_OPT_KEEP_UNKNOWN);
 
if (!argc || !(strncmp(argv[0], "rec", 3) || mem.operation))
usage_with_options(mem_usage, mem_options);


Re: [PATCH 03/35] x86/entry/32: Load task stack from x86_tss.sp1 in SYSENTER handler

2018-04-18 Thread Andi Kleen
On Wed, Apr 18, 2018 at 05:02:02PM -0700, Linus Torvalds wrote:
> On Wed, Apr 18, 2018 at 4:26 PM, Andi Kleen <a...@linux.intel.com> wrote:
> >
> > Seems like a hack. Why can't that be stored in a per cpu variable?
> 
> It *is* a percpu variable - the whole x86_tss structure is percpu.
> 
> I guess it could be a different (separate) percpu variable, but might
> as well use the space we already have allocated.

Would be better/cleaner to use a separate variable instead of reusing
x86 structures like this. Who knows what subtle side effects that
may have eventually.

It will be also easier to understand in the code.

-Andi


Re: [PATCH 03/35] x86/entry/32: Load task stack from x86_tss.sp1 in SYSENTER handler

2018-04-18 Thread Andi Kleen
On Wed, Apr 18, 2018 at 05:02:02PM -0700, Linus Torvalds wrote:
> On Wed, Apr 18, 2018 at 4:26 PM, Andi Kleen  wrote:
> >
> > Seems like a hack. Why can't that be stored in a per cpu variable?
> 
> It *is* a percpu variable - the whole x86_tss structure is percpu.
> 
> I guess it could be a different (separate) percpu variable, but might
> as well use the space we already have allocated.

Would be better/cleaner to use a separate variable instead of reusing
x86 structures like this. Who knows what subtle side effects that
may have eventually.

It will be also easier to understand in the code.

-Andi


Re: [PATCH 03/35] x86/entry/32: Load task stack from x86_tss.sp1 in SYSENTER handler

2018-04-18 Thread Andi Kleen
Joerg Roedel  writes:

> From: Joerg Roedel 
>
> We want x86_tss.sp0 point to the entry stack later to use
> it as a trampoline stack for other kernel entry points
> besides SYSENTER.
>
> So store the task stack pointer in x86_tss.sp1, which is
> otherwise unused by the hardware, as Linux doesn't make use
> of Ring 1.

Seems like a hack. Why can't that be stored in a per cpu variable?

-Andi


Re: [PATCH 03/35] x86/entry/32: Load task stack from x86_tss.sp1 in SYSENTER handler

2018-04-18 Thread Andi Kleen
Joerg Roedel  writes:

> From: Joerg Roedel 
>
> We want x86_tss.sp0 point to the entry stack later to use
> it as a trampoline stack for other kernel entry points
> besides SYSENTER.
>
> So store the task stack pointer in x86_tss.sp1, which is
> otherwise unused by the hardware, as Linux doesn't make use
> of Ring 1.

Seems like a hack. Why can't that be stored in a per cpu variable?

-Andi


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-16 Thread Andi Kleen
> Single-stepping is only a subset of the rseq limitations addressed
> by cpu_opv. Anoher major limitation is algorithms requiring data
> migration between per-cpu data structures safely against CPU hotplug,
> and without having to change the cpu affinity mask. This is the case

And how many people are going to implement such a complex separate
path just for CPU hotplug? And even if they implement it how long
before it bitrots? Seems more like a checkbox item than a realistic
approach.

> for memory allocators and userspace task schedulers which require
> cpu_opv for migration between per-cpu memory pools and scheduler
> runqueues.

Not sure about that. Is that common?

> 
> About the vgettimeofday and general handling of vDSO by gdb, gdb's
> approach only takes care of line-by-line single-stepping by hiding
> Linux' vdso mapping so users cannot target source code lines within
> that shared object. However, it breaks instruction-level single-stepping.
> I reported this issue to you back in Nov. 2017:
> https://lkml.org/lkml/2017/11/20/803

It was known from day 1, but afaik never a problem.

-Andi


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-16 Thread Andi Kleen
> Single-stepping is only a subset of the rseq limitations addressed
> by cpu_opv. Anoher major limitation is algorithms requiring data
> migration between per-cpu data structures safely against CPU hotplug,
> and without having to change the cpu affinity mask. This is the case

And how many people are going to implement such a complex separate
path just for CPU hotplug? And even if they implement it how long
before it bitrots? Seems more like a checkbox item than a realistic
approach.

> for memory allocators and userspace task schedulers which require
> cpu_opv for migration between per-cpu memory pools and scheduler
> runqueues.

Not sure about that. Is that common?

> 
> About the vgettimeofday and general handling of vDSO by gdb, gdb's
> approach only takes care of line-by-line single-stepping by hiding
> Linux' vdso mapping so users cannot target source code lines within
> that shared object. However, it breaks instruction-level single-stepping.
> I reported this issue to you back in Nov. 2017:
> https://lkml.org/lkml/2017/11/20/803

It was known from day 1, but afaik never a problem.

-Andi


Re: [PATCH 17/17] perf annotate: Handle variables in 'sub', 'or' and many other instructions

2018-04-13 Thread Andi Kleen
> What do I miss? Or where is it that I'm misinterpreting the calculations
> that objdump did in its output?

The calculations are right, but these are still two different address modes.
You cannot just turn one silently into the other.

I think it would be ok to use the syntax in the assembler

symbol(%rip)  with no # ...

> About something mildly related: what do you think about this:
> http://ref.x86asm.net/, there is a xml file there[1] I'm thinking about
> using, if available on the developer's HOME or some other standard place,
> to provide help about the instructions :-)

I don't know how well it's going to be maintained. x86 changes a lot
and I've seen a lot of disassembler libraries etc. go stale as the
owner cannot keep up.

The only semi official maintained descriptions are the XED tables (but those
don't have descriptions) or the PDFs from Intel/AMD.
I suppose could have some hack that talks to a PDF reader and automatically
downloads/searches the PDF.

If unofficial is ok I would rather port some functionality
from https://github.com/HJLebbink/asm-dude
which has a lot of cool stuff.

-Andi


Re: [PATCH 17/17] perf annotate: Handle variables in 'sub', 'or' and many other instructions

2018-04-13 Thread Andi Kleen
> What do I miss? Or where is it that I'm misinterpreting the calculations
> that objdump did in its output?

The calculations are right, but these are still two different address modes.
You cannot just turn one silently into the other.

I think it would be ok to use the syntax in the assembler

symbol(%rip)  with no # ...

> About something mildly related: what do you think about this:
> http://ref.x86asm.net/, there is a xml file there[1] I'm thinking about
> using, if available on the developer's HOME or some other standard place,
> to provide help about the instructions :-)

I don't know how well it's going to be maintained. x86 changes a lot
and I've seen a lot of disassembler libraries etc. go stale as the
owner cannot keep up.

The only semi official maintained descriptions are the XED tables (but those
don't have descriptions) or the PDFs from Intel/AMD.
I suppose could have some hack that talks to a PDF reader and automatically
downloads/searches the PDF.

If unofficial is ok I would rather port some functionality
from https://github.com/HJLebbink/asm-dude
which has a lot of cool stuff.

-Andi


Re: [PATCH 17/17] perf annotate: Handle variables in 'sub', 'or' and many other instructions

2018-04-13 Thread Andi Kleen
On Fri, Apr 13, 2018 at 11:01:11AM -0300, Arnaldo Carvalho de Melo wrote:
> From: Arnaldo Carvalho de Melo 
> 
> Just like is done for 'mov' and others that can have as source or
> targets variables resolved by objdump, to make them more compact:
> 
> -   orb$0x4,0x224d71(%rip)# 226ca4 
> <_rtld_global+0xca4>
> +   orb$0x4,_rtld_global+0xca4

That's not equivalent.  It could be non rip relative too. You would need
to keep at least the (%rip).

-Andi


Re: [PATCH 17/17] perf annotate: Handle variables in 'sub', 'or' and many other instructions

2018-04-13 Thread Andi Kleen
On Fri, Apr 13, 2018 at 11:01:11AM -0300, Arnaldo Carvalho de Melo wrote:
> From: Arnaldo Carvalho de Melo 
> 
> Just like is done for 'mov' and others that can have as source or
> targets variables resolved by objdump, to make them more compact:
> 
> -   orb$0x4,0x224d71(%rip)# 226ca4 
> <_rtld_global+0xca4>
> +   orb$0x4,_rtld_global+0xca4

That's not equivalent.  It could be non rip relative too. You would need
to keep at least the (%rip).

-Andi


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-12 Thread Andi Kleen
> Can we plan on merging just the plain rseq parts *without* this all
> first, and then see the cpu_opv thing as a "maybe future expansion"
> part.

That would be the right way to go. I doubt anybody really needs cpu_opv.
We already have other code (e.g. vgettimeofday) which cannot 
be single stepped, and so far it never was a problem.

-Andi


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-12 Thread Andi Kleen
> Can we plan on merging just the plain rseq parts *without* this all
> first, and then see the cpu_opv thing as a "maybe future expansion"
> part.

That would be the right way to go. I doubt anybody really needs cpu_opv.
We already have other code (e.g. vgettimeofday) which cannot 
be single stepped, and so far it never was a problem.

-Andi


Some minor fixes for perf user tools

2018-04-06 Thread Andi Kleen
This patchkit fixes some random minor issues in the perf user tools

-Andi



Some minor fixes for perf user tools

2018-04-06 Thread Andi Kleen
This patchkit fixes some random minor issues in the perf user tools

-Andi



[PATCH 1/4] perf, tools, mem: Allow all record/report options

2018-04-06 Thread Andi Kleen
From: Andi Kleen <a...@linux.intel.com>

For perf mem report / perf mem record, pass all unknown options
through to the underlying report/record commands. This makes things
like

perf mem record -a sleep 1

work. Matches how c2c and other tools work.

Signed-off-by: Andi Kleen <a...@linux.intel.com>
---
 tools/perf/Documentation/perf-mem.txt | 3 +++
 tools/perf/builtin-mem.c  | 4 ++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/tools/perf/Documentation/perf-mem.txt 
b/tools/perf/Documentation/perf-mem.txt
index b0211410969b..8806ed5f3802 100644
--- a/tools/perf/Documentation/perf-mem.txt
+++ b/tools/perf/Documentation/perf-mem.txt
@@ -67,6 +67,9 @@ OPTIONS
 --phys-data::
Record/Report sample physical addresses
 
+In addition, for report all perf report options are valid, and for record
+all perf record options.
+
 SEE ALSO
 
 linkperf:perf-record[1], linkperf:perf-report[1]
diff --git a/tools/perf/builtin-mem.c b/tools/perf/builtin-mem.c
index 506564651cda..57393e94d156 100644
--- a/tools/perf/builtin-mem.c
+++ b/tools/perf/builtin-mem.c
@@ -83,7 +83,7 @@ static int __cmd_record(int argc, const char **argv, struct 
perf_mem *mem)
};
 
argc = parse_options(argc, argv, options, record_mem_usage,
-PARSE_OPT_STOP_AT_NON_OPTION);
+PARSE_OPT_KEEP_UNKNOWN);
 
rec_argc = argc + 9; /* max number of arguments */
rec_argv = calloc(rec_argc + 1, sizeof(char *));
@@ -436,7 +436,7 @@ int cmd_mem(int argc, const char **argv)
}
 
argc = parse_options_subcommand(argc, argv, mem_options, 
mem_subcommands,
-   mem_usage, 
PARSE_OPT_STOP_AT_NON_OPTION);
+   mem_usage, PARSE_OPT_KEEP_UNKNOWN);
 
if (!argc || !(strncmp(argv[0], "rec", 3) || mem.operation))
usage_with_options(mem_usage, mem_options);
-- 
2.14.3



[PATCH 1/4] perf, tools, mem: Allow all record/report options

2018-04-06 Thread Andi Kleen
From: Andi Kleen 

For perf mem report / perf mem record, pass all unknown options
through to the underlying report/record commands. This makes things
like

perf mem record -a sleep 1

work. Matches how c2c and other tools work.

Signed-off-by: Andi Kleen 
---
 tools/perf/Documentation/perf-mem.txt | 3 +++
 tools/perf/builtin-mem.c  | 4 ++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/tools/perf/Documentation/perf-mem.txt 
b/tools/perf/Documentation/perf-mem.txt
index b0211410969b..8806ed5f3802 100644
--- a/tools/perf/Documentation/perf-mem.txt
+++ b/tools/perf/Documentation/perf-mem.txt
@@ -67,6 +67,9 @@ OPTIONS
 --phys-data::
Record/Report sample physical addresses
 
+In addition, for report all perf report options are valid, and for record
+all perf record options.
+
 SEE ALSO
 
 linkperf:perf-record[1], linkperf:perf-report[1]
diff --git a/tools/perf/builtin-mem.c b/tools/perf/builtin-mem.c
index 506564651cda..57393e94d156 100644
--- a/tools/perf/builtin-mem.c
+++ b/tools/perf/builtin-mem.c
@@ -83,7 +83,7 @@ static int __cmd_record(int argc, const char **argv, struct 
perf_mem *mem)
};
 
argc = parse_options(argc, argv, options, record_mem_usage,
-PARSE_OPT_STOP_AT_NON_OPTION);
+PARSE_OPT_KEEP_UNKNOWN);
 
rec_argc = argc + 9; /* max number of arguments */
rec_argv = calloc(rec_argc + 1, sizeof(char *));
@@ -436,7 +436,7 @@ int cmd_mem(int argc, const char **argv)
}
 
argc = parse_options_subcommand(argc, argv, mem_options, 
mem_subcommands,
-   mem_usage, 
PARSE_OPT_STOP_AT_NON_OPTION);
+   mem_usage, PARSE_OPT_KEEP_UNKNOWN);
 
if (!argc || !(strncmp(argv[0], "rec", 3) || mem.operation))
usage_with_options(mem_usage, mem_options);
-- 
2.14.3



[PATCH 3/4] perf, tools, record: Remove misleading error suggestion

2018-04-06 Thread Andi Kleen
From: Andi Kleen <a...@linux.intel.com>

When perf record encounters an error setting up an event it suggests
to enable CONFIG_PERF_EVENTS. This is misleading because:
- Usually it is enabled (it is really hard to disable on x86)
- The problem is usually somewhere else, e.g. the CPU is not supported
or an invalid configuration has been used.

Remove the misleading suggestion.

Signed-off-by: Andi Kleen <a...@linux.intel.com>
---
 tools/perf/util/evsel.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 1ac8d9236efd..66b62570c855 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2894,8 +2894,7 @@ int perf_evsel__open_strerror(struct perf_evsel *evsel, 
struct target *target,
 
return scnprintf(msg, size,
"The sys_perf_event_open() syscall returned with %d (%s) for event 
(%s).\n"
-   "/bin/dmesg may provide additional information.\n"
-   "No CONFIG_PERF_EVENTS=y kernel support configured?",
+   "/bin/dmesg | grep -i perf may provide additional information.\n",
 err, str_error_r(err, sbuf, sizeof(sbuf)),
 perf_evsel__name(evsel));
 }
-- 
2.14.3



<    6   7   8   9   10   11   12   13   14   15   >