Re: [PATCH v2 3/3] perf report: Implement visual marker for macro fusion in annotate

2017-06-19 Thread Jin, Yao



Reference for macro fusion is the optimization guide,
http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html
2.3.2.1
— In Intel microarchitecture code name Nehalem: CMP, TEST.
— In Intel microarchitecture code name Sandy Bridge: CMP, TEST, ADD, SUB,
AND, INC, DEC
— These instructions can fuse if The first source / destination operand is a
register.

The second source operand (if exists) is one of: immediate, register, or non
RIP-relative memory.
The second instruction of the macro-fusable pair is a conditional branch.

We probably don't need the full rules, just a simple test for
CMP/TEST/ADD/SUB/AND/INC/DEC and second instruction a Jcc condition branch.
Also I don't think we need to distinguish Nehalem/Sandy Bridge and other
core platforms. A simple test may be acceptable.

Humm, then we need to make sure somehow that this may or may not be
happening, with the above rules and optimization guide URL and pages
mentioned in the documentation.

I think that as we improve the disassembler, the more precise we can go
the better. If we know that the machine is x86 _and_ Nehalem, then we
should do this fusing visual cue onlyu for CMP and TEST, etc.

- Arnaldo
  


I will add checking for Nehalem (CMP, TEST). For other newer Intel CPUs 
just check it by default (CMP, TEST, ADD, SUB, AND, INC, DEC).


Thanks
Jin Yao



Re: [PATCH v2 3/3] perf report: Implement visual marker for macro fusion in annotate

2017-06-19 Thread Jin, Yao



Reference for macro fusion is the optimization guide,
http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html
2.3.2.1
— In Intel microarchitecture code name Nehalem: CMP, TEST.
— In Intel microarchitecture code name Sandy Bridge: CMP, TEST, ADD, SUB,
AND, INC, DEC
— These instructions can fuse if The first source / destination operand is a
register.

The second source operand (if exists) is one of: immediate, register, or non
RIP-relative memory.
The second instruction of the macro-fusable pair is a conditional branch.

We probably don't need the full rules, just a simple test for
CMP/TEST/ADD/SUB/AND/INC/DEC and second instruction a Jcc condition branch.
Also I don't think we need to distinguish Nehalem/Sandy Bridge and other
core platforms. A simple test may be acceptable.

Humm, then we need to make sure somehow that this may or may not be
happening, with the above rules and optimization guide URL and pages
mentioned in the documentation.

I think that as we improve the disassembler, the more precise we can go
the better. If we know that the machine is x86 _and_ Nehalem, then we
should do this fusing visual cue onlyu for CMP and TEST, etc.

- Arnaldo
  


I will add checking for Nehalem (CMP, TEST). For other newer Intel CPUs 
just check it by default (CMP, TEST, ADD, SUB, AND, INC, DEC).


Thanks
Jin Yao



Re: [PATCH v2 3/3] perf report: Implement visual marker for macro fusion in annotate

2017-06-19 Thread Arnaldo Carvalho de Melo
Em Tue, Jun 20, 2017 at 09:25:35AM +0800, Jin, Yao escreveu:
> 
> > Ok, thanks for making this per-arch! Some comments:
> > 
> > I think we should have this marked permanently, i.e. not just when we go
> > to the jump line, something like this (testing here in a t450s
> > broadwell, function hc_find_func, /usr/lib64/liblzma.so.5.2.2):
> > 
> > It is like this now, when we are not on the jne jump line:
> > 
> >0.71 │   mov%r14d,%r10d  
> > 
> >   ▒
> > │   movzbl (%rdx,%r10,1),%ebp   
> > 
> >   ▒
> >1.06 │ 70:   mov(%r9,%rcx,4),%ecx
> > 
> >   ◆
> >   77.98 │ 74:   cmp%bpl,(%rbx,%r10,1)   
> > 
> >   ▒
> > │ ↑ jne70   
> > 
> >   ▒
> >0.85 │   movzbl (%rdx),%r10d 
> > 
> >   ▒
> >0.99 │   cmp%r10b,(%rbx) 
> > 
> >   ▒
> > 
> > I think it should be augmented to:
> > 
> >0.71 │   mov%r14d,%r10d  
> > 
> >   ▒
> > │   movzbl (%rdx,%r10,1),%ebp   
> > 
> >   ▒
> >1.06 │ 70: ┌─mov(%r9,%rcx,4),%ecx
> > 
> >   ◆
> >   77.98 │ 74: └─cmp%bpl,(%rbx,%r10,1)   
> > 
> >   ▒
> > │ ↑ jne70   
> > 
> >   ▒
> >0.85 │   movzbl (%rdx),%r10d 
> > 
> >   ▒
> >0.99 │   cmp%r10b,(%rbx) 
> > 
> >   ▒
> > 
> > I.e. no arrow, the two instructions that end up as one micro-op being
> > connected.
> 
> The fused instruction pairs are:
> cmp + jcc
> test + jcc
> add + jcc
> sub + jcc
> and + jcc
> inc + jcc
> dec + jcc
> 
> Mov and cmp are not the fused instruction pair. So we don't need to connect

Right, my bad, what I was trying to say was to have a marker for fused
instructions, not just when we go with the cursor over it like with this
patchset.

> mov and cmp. I guess what Arnaldo wants is to connect two fused instructions
> even we don't go to the jcc line. For example: a line is connected between
> cmp and jne in above case.

Right
 
> I have thought about that. While the visualization may be not very good
> because the original arrow before jne would be overwritten. So now I just
> implement a way that joins the jump arrow when we go to the jcc line.
> Another consideration is the fused instruction pairs are very common
> instructions in code, if we mark them all, there may be too much.

perhaps

> > And then this:
> > 
> >  │   ┌──cmpl   $0x0,argp_program_version_hook
> >81.93 │   │──je 20
> >  │   │  lock   cmpxchg %esi,0x38a9a4(%rip)
> >  │   │↓ jne29
> >  │   │↓ jmp43
> >11.47 │20:└─→cmpxch %esi,0x38a999(%rip)
> > 
> > Would look better as:
> > 
> >  │   ┌──cmpl   $0x0,argp_program_version_hook
> >81.93 │   ├──je 20
> >  │   │  lock   cmpxchg %esi,0x38a9a4(%rip)
> >  │   │↓ jne29
> >  │   │↓ jmp43
> >11.47 │20:└─→cmpxch %esi,0x38a999(%rip)
> > 
> > Patch below, please test/ack :-)
> 
> I have tested. It's better! There is no space in the line. Thanks!
> 
> > This was the low hanging fruit, having the:
> > 
> >1.06 │ 70: ┌─mov(%r9,%rcx,4),%ecx
> > 
> >   ◆
> >   77.98 │ 74: └─cmp%bpl,(%rbx,%r10,1)   
> > 
> >   ▒
> > 
> 

Re: [PATCH v2 3/3] perf report: Implement visual marker for macro fusion in annotate

2017-06-19 Thread Arnaldo Carvalho de Melo
Em Tue, Jun 20, 2017 at 09:25:35AM +0800, Jin, Yao escreveu:
> 
> > Ok, thanks for making this per-arch! Some comments:
> > 
> > I think we should have this marked permanently, i.e. not just when we go
> > to the jump line, something like this (testing here in a t450s
> > broadwell, function hc_find_func, /usr/lib64/liblzma.so.5.2.2):
> > 
> > It is like this now, when we are not on the jne jump line:
> > 
> >0.71 │   mov%r14d,%r10d  
> > 
> >   ▒
> > │   movzbl (%rdx,%r10,1),%ebp   
> > 
> >   ▒
> >1.06 │ 70:   mov(%r9,%rcx,4),%ecx
> > 
> >   ◆
> >   77.98 │ 74:   cmp%bpl,(%rbx,%r10,1)   
> > 
> >   ▒
> > │ ↑ jne70   
> > 
> >   ▒
> >0.85 │   movzbl (%rdx),%r10d 
> > 
> >   ▒
> >0.99 │   cmp%r10b,(%rbx) 
> > 
> >   ▒
> > 
> > I think it should be augmented to:
> > 
> >0.71 │   mov%r14d,%r10d  
> > 
> >   ▒
> > │   movzbl (%rdx,%r10,1),%ebp   
> > 
> >   ▒
> >1.06 │ 70: ┌─mov(%r9,%rcx,4),%ecx
> > 
> >   ◆
> >   77.98 │ 74: └─cmp%bpl,(%rbx,%r10,1)   
> > 
> >   ▒
> > │ ↑ jne70   
> > 
> >   ▒
> >0.85 │   movzbl (%rdx),%r10d 
> > 
> >   ▒
> >0.99 │   cmp%r10b,(%rbx) 
> > 
> >   ▒
> > 
> > I.e. no arrow, the two instructions that end up as one micro-op being
> > connected.
> 
> The fused instruction pairs are:
> cmp + jcc
> test + jcc
> add + jcc
> sub + jcc
> and + jcc
> inc + jcc
> dec + jcc
> 
> Mov and cmp are not the fused instruction pair. So we don't need to connect

Right, my bad, what I was trying to say was to have a marker for fused
instructions, not just when we go with the cursor over it like with this
patchset.

> mov and cmp. I guess what Arnaldo wants is to connect two fused instructions
> even we don't go to the jcc line. For example: a line is connected between
> cmp and jne in above case.

Right
 
> I have thought about that. While the visualization may be not very good
> because the original arrow before jne would be overwritten. So now I just
> implement a way that joins the jump arrow when we go to the jcc line.
> Another consideration is the fused instruction pairs are very common
> instructions in code, if we mark them all, there may be too much.

perhaps

> > And then this:
> > 
> >  │   ┌──cmpl   $0x0,argp_program_version_hook
> >81.93 │   │──je 20
> >  │   │  lock   cmpxchg %esi,0x38a9a4(%rip)
> >  │   │↓ jne29
> >  │   │↓ jmp43
> >11.47 │20:└─→cmpxch %esi,0x38a999(%rip)
> > 
> > Would look better as:
> > 
> >  │   ┌──cmpl   $0x0,argp_program_version_hook
> >81.93 │   ├──je 20
> >  │   │  lock   cmpxchg %esi,0x38a9a4(%rip)
> >  │   │↓ jne29
> >  │   │↓ jmp43
> >11.47 │20:└─→cmpxch %esi,0x38a999(%rip)
> > 
> > Patch below, please test/ack :-)
> 
> I have tested. It's better! There is no space in the line. Thanks!
> 
> > This was the low hanging fruit, having the:
> > 
> >1.06 │ 70: ┌─mov(%r9,%rcx,4),%ecx
> > 
> >   ◆
> >   77.98 │ 74: └─cmp%bpl,(%rbx,%r10,1)   
> > 
> >   ▒
> > 
> 

Re: [PATCH v2 3/3] perf report: Implement visual marker for macro fusion in annotate

2017-06-19 Thread Jin, Yao



Ok, thanks for making this per-arch! Some comments:

I think we should have this marked permanently, i.e. not just when we go
to the jump line, something like this (testing here in a t450s
broadwell, function hc_find_func, /usr/lib64/liblzma.so.5.2.2):

It is like this now, when we are not on the jne jump line:

   0.71 │   mov%r14d,%r10d  
  ▒
│   movzbl (%rdx,%r10,1),%ebp   
  ▒
   1.06 │ 70:   mov(%r9,%rcx,4),%ecx
  ◆
  77.98 │ 74:   cmp%bpl,(%rbx,%r10,1)   
  ▒
│ ↑ jne70   
  ▒
   0.85 │   movzbl (%rdx),%r10d 
  ▒
   0.99 │   cmp%r10b,(%rbx) 
  ▒

I think it should be augmented to:

   0.71 │   mov%r14d,%r10d  
  ▒
│   movzbl (%rdx,%r10,1),%ebp   
  ▒
   1.06 │ 70: ┌─mov(%r9,%rcx,4),%ecx
  ◆
  77.98 │ 74: └─cmp%bpl,(%rbx,%r10,1)   
  ▒
│ ↑ jne70   
  ▒
   0.85 │   movzbl (%rdx),%r10d 
  ▒
   0.99 │   cmp%r10b,(%rbx) 
  ▒

I.e. no arrow, the two instructions that end up as one micro-op being
connected.


The fused instruction pairs are:
cmp + jcc
test + jcc
add + jcc
sub + jcc
and + jcc
inc + jcc
dec + jcc

Mov and cmp are not the fused instruction pair. So we don't need to 
connect mov and cmp. I guess what Arnaldo wants is to connect two fused 
instructions even we don't go to the jcc line. For example: a line is 
connected between cmp and jne in above case.


I have thought about that. While the visualization may be not very good 
because the original arrow before jne would be overwritten. So now I 
just implement a way that joins the jump arrow when we go to the jcc 
line. Another consideration is the fused instruction pairs are very 
common instructions in code, if we mark them all, there may be too much.



And then this:

 │   ┌──cmpl   $0x0,argp_program_version_hook
   81.93 │   │──je 20
 │   │  lock   cmpxchg %esi,0x38a9a4(%rip)
 │   │↓ jne29
 │   │↓ jmp43
   11.47 │20:└─→cmpxch %esi,0x38a999(%rip)

Would look better as:

 │   ┌──cmpl   $0x0,argp_program_version_hook
   81.93 │   ├──je 20
 │   │  lock   cmpxchg %esi,0x38a9a4(%rip)
 │   │↓ jne29
 │   │↓ jmp43
   11.47 │20:└─→cmpxch %esi,0x38a999(%rip)

Patch below, please test/ack :-)


I have tested. It's better! There is no space in the line. Thanks!


This was the low hanging fruit, having the:

   1.06 │ 70: ┌─mov(%r9,%rcx,4),%ecx
  ◆
  77.98 │ 74: └─cmp%bpl,(%rbx,%r10,1)   
  ▒

Marker always there, not just when we have the cursor on top of one of
those lines remains to be coded.


My comment is as above.


But you state:

  
 Macro fusion merges two instructions to a single micro-op. Intel core
 platform performs this hardware optimization under limited
 circumstances.
  

"Intel core", what about older arches, etc, don't you have to look at:

# cpudesc : Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
# cpuid : GenuineIntel,6,61,4

present in the perf.data header (or in the running system, for things
like 'perf top') to make sure that this is a machine 

Re: [PATCH v2 3/3] perf report: Implement visual marker for macro fusion in annotate

2017-06-19 Thread Jin, Yao



Ok, thanks for making this per-arch! Some comments:

I think we should have this marked permanently, i.e. not just when we go
to the jump line, something like this (testing here in a t450s
broadwell, function hc_find_func, /usr/lib64/liblzma.so.5.2.2):

It is like this now, when we are not on the jne jump line:

   0.71 │   mov%r14d,%r10d  
  ▒
│   movzbl (%rdx,%r10,1),%ebp   
  ▒
   1.06 │ 70:   mov(%r9,%rcx,4),%ecx
  ◆
  77.98 │ 74:   cmp%bpl,(%rbx,%r10,1)   
  ▒
│ ↑ jne70   
  ▒
   0.85 │   movzbl (%rdx),%r10d 
  ▒
   0.99 │   cmp%r10b,(%rbx) 
  ▒

I think it should be augmented to:

   0.71 │   mov%r14d,%r10d  
  ▒
│   movzbl (%rdx,%r10,1),%ebp   
  ▒
   1.06 │ 70: ┌─mov(%r9,%rcx,4),%ecx
  ◆
  77.98 │ 74: └─cmp%bpl,(%rbx,%r10,1)   
  ▒
│ ↑ jne70   
  ▒
   0.85 │   movzbl (%rdx),%r10d 
  ▒
   0.99 │   cmp%r10b,(%rbx) 
  ▒

I.e. no arrow, the two instructions that end up as one micro-op being
connected.


The fused instruction pairs are:
cmp + jcc
test + jcc
add + jcc
sub + jcc
and + jcc
inc + jcc
dec + jcc

Mov and cmp are not the fused instruction pair. So we don't need to 
connect mov and cmp. I guess what Arnaldo wants is to connect two fused 
instructions even we don't go to the jcc line. For example: a line is 
connected between cmp and jne in above case.


I have thought about that. While the visualization may be not very good 
because the original arrow before jne would be overwritten. So now I 
just implement a way that joins the jump arrow when we go to the jcc 
line. Another consideration is the fused instruction pairs are very 
common instructions in code, if we mark them all, there may be too much.



And then this:

 │   ┌──cmpl   $0x0,argp_program_version_hook
   81.93 │   │──je 20
 │   │  lock   cmpxchg %esi,0x38a9a4(%rip)
 │   │↓ jne29
 │   │↓ jmp43
   11.47 │20:└─→cmpxch %esi,0x38a999(%rip)

Would look better as:

 │   ┌──cmpl   $0x0,argp_program_version_hook
   81.93 │   ├──je 20
 │   │  lock   cmpxchg %esi,0x38a9a4(%rip)
 │   │↓ jne29
 │   │↓ jmp43
   11.47 │20:└─→cmpxch %esi,0x38a999(%rip)

Patch below, please test/ack :-)


I have tested. It's better! There is no space in the line. Thanks!


This was the low hanging fruit, having the:

   1.06 │ 70: ┌─mov(%r9,%rcx,4),%ecx
  ◆
  77.98 │ 74: └─cmp%bpl,(%rbx,%r10,1)   
  ▒

Marker always there, not just when we have the cursor on top of one of
those lines remains to be coded.


My comment is as above.


But you state:

  
 Macro fusion merges two instructions to a single micro-op. Intel core
 platform performs this hardware optimization under limited
 circumstances.
  

"Intel core", what about older arches, etc, don't you have to look at:

# cpudesc : Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
# cpuid : GenuineIntel,6,61,4

present in the perf.data header (or in the running system, for things
like 'perf top') to make sure that this is a machine 

Re: [PATCH v2 3/3] perf report: Implement visual marker for macro fusion in annotate

2017-06-19 Thread Arnaldo Carvalho de Melo
Em Mon, Jun 19, 2017 at 02:35:29PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Mon, Jun 19, 2017 at 10:55:58AM +0800, Jin Yao escreveu:
> 
> Marker always there, not just when we have the cursor on top of one of
> those lines remains to be coded.
> 
> But you state:
> 
>  
> Macro fusion merges two instructions to a single micro-op. Intel core
> platform performs this hardware optimization under limited
> circumstances.
>  
> 
> "Intel core", what about older arches, etc, don't you have to look at:
> 
> # cpudesc : Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
> # cpuid : GenuineIntel,6,61,4
> 
> present in the perf.data header (or in the running system, for things
> like 'perf top') to make sure that this is a machine where such "macro
> fusion" takes place?

Ok, I have the patches that need this discussion to get to a conclusion
on a separate patch, tmp.perf/annotate, the first patch, the one that
returns the 'struct arch' for the browser to use arch specific stuff is
in perf/core and can go to Ingo now.

- Arnaldo


Re: [PATCH v2 3/3] perf report: Implement visual marker for macro fusion in annotate

2017-06-19 Thread Arnaldo Carvalho de Melo
Em Mon, Jun 19, 2017 at 02:35:29PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Mon, Jun 19, 2017 at 10:55:58AM +0800, Jin Yao escreveu:
> 
> Marker always there, not just when we have the cursor on top of one of
> those lines remains to be coded.
> 
> But you state:
> 
>  
> Macro fusion merges two instructions to a single micro-op. Intel core
> platform performs this hardware optimization under limited
> circumstances.
>  
> 
> "Intel core", what about older arches, etc, don't you have to look at:
> 
> # cpudesc : Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
> # cpuid : GenuineIntel,6,61,4
> 
> present in the perf.data header (or in the running system, for things
> like 'perf top') to make sure that this is a machine where such "macro
> fusion" takes place?

Ok, I have the patches that need this discussion to get to a conclusion
on a separate patch, tmp.perf/annotate, the first patch, the one that
returns the 'struct arch' for the browser to use arch specific stuff is
in perf/core and can go to Ingo now.

- Arnaldo


Re: [PATCH v2 3/3] perf report: Implement visual marker for macro fusion in annotate

2017-06-19 Thread Arnaldo Carvalho de Melo
Em Mon, Jun 19, 2017 at 10:55:58AM +0800, Jin Yao escreveu:
> For marking the fused instructions clearly, This patch adds a
> line before the first instruction of pair and joins it with the
> arrow of the jump.
> 
> For example, when je is selected in annotate view, the line
> before cmpl is displayed and joins the arrow of je.
> 
>│   ┌──cmpl   $0x0,argp_program_version_hook
>  81.93 │   │──je 20
>│   │  lock   cmpxchg %esi,0x38a9a4(%rip)
>│   │↓ jne29
>│   │↓ jmp43
>  11.47 │20:└─→cmpxch %esi,0x38a999(%rip)

Ok, thanks for making this per-arch! Some comments:

I think we should have this marked permanently, i.e. not just when we go
to the jump line, something like this (testing here in a t450s
broadwell, function hc_find_func, /usr/lib64/liblzma.so.5.2.2):

It is like this now, when we are not on the jne jump line:

  0.71 │   mov%r14d,%r10d   
 ▒
   │   movzbl (%rdx,%r10,1),%ebp
 ▒
  1.06 │ 70:   mov(%r9,%rcx,4),%ecx 
 ◆
 77.98 │ 74:   cmp%bpl,(%rbx,%r10,1)
 ▒
   │ ↑ jne70
 ▒
  0.85 │   movzbl (%rdx),%r10d  
 ▒
  0.99 │   cmp%r10b,(%rbx)  
 ▒

I think it should be augmented to:

  0.71 │   mov%r14d,%r10d   
 ▒
   │   movzbl (%rdx,%r10,1),%ebp
 ▒
  1.06 │ 70: ┌─mov(%r9,%rcx,4),%ecx 
 ◆
 77.98 │ 74: └─cmp%bpl,(%rbx,%r10,1)
 ▒
   │ ↑ jne70
 ▒
  0.85 │   movzbl (%rdx),%r10d  
 ▒
  0.99 │   cmp%r10b,(%rbx)  
 ▒

I.e. no arrow, the two instructions that end up as one micro-op being
connected.

And then this:

│   ┌──cmpl   $0x0,argp_program_version_hook
  81.93 │   │──je 20
│   │  lock   cmpxchg %esi,0x38a9a4(%rip)
│   │↓ jne29
│   │↓ jmp43
  11.47 │20:└─→cmpxch %esi,0x38a999(%rip)

Would look better as:

│   ┌──cmpl   $0x0,argp_program_version_hook
  81.93 │   ├──je 20
│   │  lock   cmpxchg %esi,0x38a9a4(%rip)
│   │↓ jne29
│   │↓ jmp43
  11.47 │20:└─→cmpxch %esi,0x38a999(%rip)

Patch below, please test/ack :-)

This was the low hanging fruit, having the:

  1.06 │ 70: ┌─mov(%r9,%rcx,4),%ecx 
 ◆
 77.98 │ 74: └─cmp%bpl,(%rbx,%r10,1)
 ▒

Marker always there, not just when we have the cursor on top of one of
those lines remains to be coded.

But you state:

 
Macro fusion merges two instructions to a single micro-op. Intel core
platform performs this hardware optimization under limited
circumstances.
 

"Intel core", what about older arches, etc, don't you have to look at:

# cpudesc : Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
# cpuid : GenuineIntel,6,61,4

present in the perf.data header (or in the running system, for things
like 'perf top') to make sure that this is a machine where such "macro
fusion" takes place?

- Arnaldo

diff --git a/tools/perf/ui/browser.c b/tools/perf/ui/browser.c
index acba636bd165..9ef7677ae14f 100644
--- a/tools/perf/ui/browser.c
+++ b/tools/perf/ui/browser.c
@@ -756,8 +756,10 @@ void ui_browser__mark_fused(struct ui_browser *browser, 

Re: [PATCH v2 3/3] perf report: Implement visual marker for macro fusion in annotate

2017-06-19 Thread Arnaldo Carvalho de Melo
Em Mon, Jun 19, 2017 at 10:55:58AM +0800, Jin Yao escreveu:
> For marking the fused instructions clearly, This patch adds a
> line before the first instruction of pair and joins it with the
> arrow of the jump.
> 
> For example, when je is selected in annotate view, the line
> before cmpl is displayed and joins the arrow of je.
> 
>│   ┌──cmpl   $0x0,argp_program_version_hook
>  81.93 │   │──je 20
>│   │  lock   cmpxchg %esi,0x38a9a4(%rip)
>│   │↓ jne29
>│   │↓ jmp43
>  11.47 │20:└─→cmpxch %esi,0x38a999(%rip)

Ok, thanks for making this per-arch! Some comments:

I think we should have this marked permanently, i.e. not just when we go
to the jump line, something like this (testing here in a t450s
broadwell, function hc_find_func, /usr/lib64/liblzma.so.5.2.2):

It is like this now, when we are not on the jne jump line:

  0.71 │   mov%r14d,%r10d   
 ▒
   │   movzbl (%rdx,%r10,1),%ebp
 ▒
  1.06 │ 70:   mov(%r9,%rcx,4),%ecx 
 ◆
 77.98 │ 74:   cmp%bpl,(%rbx,%r10,1)
 ▒
   │ ↑ jne70
 ▒
  0.85 │   movzbl (%rdx),%r10d  
 ▒
  0.99 │   cmp%r10b,(%rbx)  
 ▒

I think it should be augmented to:

  0.71 │   mov%r14d,%r10d   
 ▒
   │   movzbl (%rdx,%r10,1),%ebp
 ▒
  1.06 │ 70: ┌─mov(%r9,%rcx,4),%ecx 
 ◆
 77.98 │ 74: └─cmp%bpl,(%rbx,%r10,1)
 ▒
   │ ↑ jne70
 ▒
  0.85 │   movzbl (%rdx),%r10d  
 ▒
  0.99 │   cmp%r10b,(%rbx)  
 ▒

I.e. no arrow, the two instructions that end up as one micro-op being
connected.

And then this:

│   ┌──cmpl   $0x0,argp_program_version_hook
  81.93 │   │──je 20
│   │  lock   cmpxchg %esi,0x38a9a4(%rip)
│   │↓ jne29
│   │↓ jmp43
  11.47 │20:└─→cmpxch %esi,0x38a999(%rip)

Would look better as:

│   ┌──cmpl   $0x0,argp_program_version_hook
  81.93 │   ├──je 20
│   │  lock   cmpxchg %esi,0x38a9a4(%rip)
│   │↓ jne29
│   │↓ jmp43
  11.47 │20:└─→cmpxch %esi,0x38a999(%rip)

Patch below, please test/ack :-)

This was the low hanging fruit, having the:

  1.06 │ 70: ┌─mov(%r9,%rcx,4),%ecx 
 ◆
 77.98 │ 74: └─cmp%bpl,(%rbx,%r10,1)
 ▒

Marker always there, not just when we have the cursor on top of one of
those lines remains to be coded.

But you state:

 
Macro fusion merges two instructions to a single micro-op. Intel core
platform performs this hardware optimization under limited
circumstances.
 

"Intel core", what about older arches, etc, don't you have to look at:

# cpudesc : Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
# cpuid : GenuineIntel,6,61,4

present in the perf.data header (or in the running system, for things
like 'perf top') to make sure that this is a machine where such "macro
fusion" takes place?

- Arnaldo

diff --git a/tools/perf/ui/browser.c b/tools/perf/ui/browser.c
index acba636bd165..9ef7677ae14f 100644
--- a/tools/perf/ui/browser.c
+++ b/tools/perf/ui/browser.c
@@ -756,8 +756,10 @@ void ui_browser__mark_fused(struct ui_browser *browser,