Re: [PATCH 3/3] powerpc: bpf: implement in-register swap for 64-bit endian operations

2017-01-24 Thread 'Naveen N. Rao'
On 2017/01/24 04:13PM, David Laight wrote:
> From: 'Naveen N. Rao'
> > Sent: 23 January 2017 19:22
> > On 2017/01/15 09:00AM, Benjamin Herrenschmidt wrote:
> > > On Fri, 2017-01-13 at 23:22 +0530, 'Naveen N. Rao' wrote:
> > > > > That rather depends on whether the processor has a store to load 
> > > > > forwarder
> > > > > that will satisfy the read from the store buffer.
> > > > > I don't know about ppc, but at least some x86 will do that.
> > > >
> > > > Interesting - good to know that.
> > > >
> > > > However, I don't think powerpc does that and in-register swap is likely
> > > > faster regardless. Note also that gcc prefers this form at higher
> > > > optimization levels.
> > >
> > > Of course powerpc has a load-store forwarder these days, however, I
> > > wouldn't be surprised if the in-register form was still faster on some
> > > implementations, but this needs to be tested.
> > 
> > Thanks for clarifying! To test this, I wrote a simple (perhaps naive)
> > test that just issues a whole lot of endian swaps and in _that_ test, it
> > does look like the load-store forwarder is doing pretty well.
> ...
> > This is all in a POWER8 vm. On POWER7, the in-register variant is around
> > 4 times faster than the ldbrx variant.
> ...
> 
> I wonder which is faster on the little 1GHz embedded ppc we use here.

Worth a test, for sure.
FWIW, this patch won't matter since eBPF JIT is for ppc64.

Thanks,
Naveen



RE: [PATCH 3/3] powerpc: bpf: implement in-register swap for 64-bit endian operations

2017-01-24 Thread David Laight
From: 'Naveen N. Rao'
> Sent: 23 January 2017 19:22
> On 2017/01/15 09:00AM, Benjamin Herrenschmidt wrote:
> > On Fri, 2017-01-13 at 23:22 +0530, 'Naveen N. Rao' wrote:
> > > > That rather depends on whether the processor has a store to load 
> > > > forwarder
> > > > that will satisfy the read from the store buffer.
> > > > I don't know about ppc, but at least some x86 will do that.
> > >
> > > Interesting - good to know that.
> > >
> > > However, I don't think powerpc does that and in-register swap is likely
> > > faster regardless. Note also that gcc prefers this form at higher
> > > optimization levels.
> >
> > Of course powerpc has a load-store forwarder these days, however, I
> > wouldn't be surprised if the in-register form was still faster on some
> > implementations, but this needs to be tested.
> 
> Thanks for clarifying! To test this, I wrote a simple (perhaps naive)
> test that just issues a whole lot of endian swaps and in _that_ test, it
> does look like the load-store forwarder is doing pretty well.
...
> This is all in a POWER8 vm. On POWER7, the in-register variant is around
> 4 times faster than the ldbrx variant.
...

I wonder which is faster on the little 1GHz embedded ppc we use here.

David




Re: [PATCH 3/3] powerpc: bpf: implement in-register swap for 64-bit endian operations

2017-01-23 Thread 'Naveen N. Rao'
On 2017/01/15 09:00AM, Benjamin Herrenschmidt wrote:
> On Fri, 2017-01-13 at 23:22 +0530, 'Naveen N. Rao' wrote:
> > > That rather depends on whether the processor has a store to load forwarder
> > > that will satisfy the read from the store buffer.
> > > I don't know about ppc, but at least some x86 will do that.
> > 
> > Interesting - good to know that.
> > 
> > However, I don't think powerpc does that and in-register swap is likely 
> > faster regardless. Note also that gcc prefers this form at higher 
> > optimization levels.
> 
> Of course powerpc has a load-store forwarder these days, however, I
> wouldn't be surprised if the in-register form was still faster on some
> implementations, but this needs to be tested.

Thanks for clarifying! To test this, I wrote a simple (perhaps naive) 
test that just issues a whole lot of endian swaps and in _that_ test, it 
does look like the load-store forwarder is doing pretty well.

The tests:

bpf-bswap.S:
---
.file   "bpf-bswap.S"
.abiversion 2
.section".text"
.align 2
.globl main
.type   main, @function
main:
mflr0
std 0,16(1)
stdu1,-32760(1)
addi3,1,32
li  4,0
li  5,32720
li  11,32720
mulli   11,11,8
li  10,0
li  7,16
1:  ldx 6,3,4
stdx6,1,7
ldbrx   6,1,7
stdx6,3,4
addi4,4,8
cmpd4,5
beq 2f
b   1b
2:  addi10,10,1
li  4,0
cmpd10,11
beq 3f
b   1b
3:  li  3,0
addi1,1,32760
ld  0,16(1)
mtlr0
blr

bpf-bswap-reg.S:
---
.file   "bpf-bswap-reg.S"
.abiversion 2
.section".text"
.align 2
.globl main
.type   main, @function
main:
mflr0
std 0,16(1)
stdu1,-32760(1)
addi3,1,32
li  4,0
li  5,32720
li  11,32720
mulli   11,11,8
li  10,0
1:  ldx 6,3,4
rldicl  7,6,32,32
rlwinm  8,6,24,0,31
rlwimi  8,6,8,8,15
rlwinm  9,7,24,0,31
rlwimi  8,6,8,24,31
rlwimi  9,7,8,8,15
rlwimi  9,7,8,24,31
rldicr  8,8,32,31
or  6,8,9
stdx6,3,4
addi4,4,8
cmpd4,5
beq 2f
b   1b
2:  addi10,10,1
li  4,0
cmpd10,11
beq 3f
b   1b
3:  li  3,0
addi1,1,32760
ld  0,16(1)
mtlr0
blr

Profiling the two variants:

# perf stat ./bpf-bswap

 Performance counter stats for './bpf-bswap':

   1395.979224  task-clock (msec) #0.999 CPUs utilized  

 0  context-switches  #0.000 K/sec  

 0  cpu-migrations#0.000 K/sec  

45  page-faults   #0.032 K/sec  

 4,651,874,673  cycles#3.332 GHz
  (66.87%)
 3,141,186  stalled-cycles-frontend   #0.07% frontend cycles 
idle (50.57%)
 1,117,289,485  stalled-cycles-backend#   24.02% backend cycles 
idle  (50.57%)
 8,565,963,861  instructions  #1.84  insn per cycle 

  #0.13  stalled cycles per 
insn  (67.05%)
 2,174,029,771  branches  # 1557.351 M/sec  
  (49.69%)
   262,656  branch-misses #0.01% of all branches
  (50.05%)

   1.396893189 seconds time elapsed

# perf stat ./bpf-bswap-reg

 Performance counter stats for './bpf-bswap-reg':

   1819.758102  task-clock (msec) #0.999 CPUs utilized  

 3  context-switches  #0.002 K/sec  

 0  cpu-migrations#0.000 K/sec  

44  page-faults   #0.024 K/sec  

 6,034,777,602  cycles#3.316 GHz
  (66.83%)
 2,010,983  stalled-cycles-frontend   #0.03% frontend cycles 
idle (50.47%)
 1,024,975,759  stalled-cycles-backend#   16.98% backend cycles 
idle  (50.52%)
16,043,732,849  instructions  #2.66  insn per cycle 

  #0.06  stalled cycles per 
insn  (67.01%)
 2,148,710,750  branches  # 1180.767 M/sec  
  (49.57%)
   268,046  branch-misses #0.01% of all branches
  (49.52%)

   1.821501345 seconds time elapsed


This is all 

Re: [PATCH 3/3] powerpc: bpf: implement in-register swap for 64-bit endian operations

2017-01-15 Thread Benjamin Herrenschmidt
On Fri, 2017-01-13 at 23:22 +0530, 'Naveen N. Rao' wrote:
> > That rather depends on whether the processor has a store to load forwarder
> > that will satisfy the read from the store buffer.
> > I don't know about ppc, but at least some x86 will do that.
> 
> Interesting - good to know that.
> 
> However, I don't think powerpc does that and in-register swap is likely 
> faster regardless. Note also that gcc prefers this form at higher 
> optimization levels.

Of course powerpc has a load-store forwarder these days, however, I
wouldn't be surprised if the in-register form was still faster on some
implementations, but this needs to be tested.

Ideally, you'd want to try to "optimize" load+swap or swap+store
though.

Cheers,
Ben.



Re: [PATCH 3/3] powerpc: bpf: implement in-register swap for 64-bit endian operations

2017-01-13 Thread 'Naveen N. Rao'
On 2017/01/13 05:17PM, David Laight wrote:
> From: Naveen N. Rao
> > Sent: 13 January 2017 17:10
> > Generate instructions to perform the endian conversion using registers,
> > rather than generating two memory accesses.
> > 
> > The "way easier and faster" comment was obviously for the author, not
> > the processor.
> 
> That rather depends on whether the processor has a store to load forwarder
> that will satisfy the read from the store buffer.
> I don't know about ppc, but at least some x86 will do that.

Interesting - good to know that.

However, I don't think powerpc does that and in-register swap is likely 
faster regardless. Note also that gcc prefers this form at higher 
optimization levels.

Thanks,
Naveen



RE: [PATCH 3/3] powerpc: bpf: implement in-register swap for 64-bit endian operations

2017-01-13 Thread David Laight
From: Naveen N. Rao
> Sent: 13 January 2017 17:10
> Generate instructions to perform the endian conversion using registers,
> rather than generating two memory accesses.
> 
> The "way easier and faster" comment was obviously for the author, not
> the processor.

That rather depends on whether the processor has a store to load forwarder
that will satisfy the read from the store buffer.
I don't know about ppc, but at least some x86 will do that.

David