Re: [PATCH 1/2] powerpc: string: implement optimized memset variants

2017-04-18 Thread Michael Ellerman
Michael Ellerman  writes:

> "Naveen N. Rao"  writes:
>> (generic) is with Matt's arch-independent patches applied. Profiling 
>> indicates that most of the overhead is actually with the lzo 
>> decompression...
>>
>> Also, with a simple module to memset64() a 1GB vmalloc'ed buffer, here 
>> are the results:
>> generic: 0.245315533 seconds time elapsed( +-  1.83% )
>> optimized:   0.169282701 seconds time elapsed( +-  1.96% )
>
> Great, that's pretty conclusive.
>
> I'm pretty sure I can take these 2 patches independently of Matt's
> series, they just won't be used by much until his series goes in, so
> I'll do that unless someone yells.

Hmm, just went to merge these, but I don't see Matt's series in
linux-next, so I'll hold off for now.

cheers


Re: [PATCH 1/2] powerpc: string: implement optimized memset variants

2017-04-12 Thread Naveen N. Rao

Excerpts from PrasannaKumar Muralidharan's message of April 5, 2017 11:21:

On 30 March 2017 at 12:46, Naveen N. Rao
 wrote:

Also, with a simple module to memset64() a 1GB vmalloc'ed buffer, here
are the results:
generic:0.245315533 seconds time elapsed( +-  1.83% )
optimized:  0.169282701 seconds time elapsed( +-  1.96% )


Wondering what makes gcc not to produce efficient assembly code. Can
you please post the disassembly of C implementation of memset64? Just
for info purpose.


It's largely the same as what Christophe posted for powerpc32.

Others will have better insights, but afaics, gcc only seems to be 
unrolling the loop with -funroll-loops (which we don't use).


As an aside, it looks like gcc recently picked up an optimization in v7 
that can also help (from https://gcc.gnu.org/gcc-7/changes.html):
"A new store merging pass has been added. It merges constant stores to 
adjacent memory locations into fewer, wider, stores. It is enabled by 
the -fstore-merging option and at the -O2 optimization level or higher 
(and -Os)."



- Naveen




Re: [PATCH 1/2] powerpc: string: implement optimized memset variants

2017-04-04 Thread PrasannaKumar Muralidharan
On 30 March 2017 at 12:46, Naveen N. Rao
 wrote:
> Also, with a simple module to memset64() a 1GB vmalloc'ed buffer, here
> are the results:
> generic:0.245315533 seconds time elapsed( +-  1.83% )
> optimized:  0.169282701 seconds time elapsed( +-  1.96% )

Wondering what makes gcc not to produce efficient assembly code. Can
you please post the disassembly of C implementation of memset64? Just
for info purpose.

Thanks,
Prasanna


Re: [PATCH 1/2] powerpc: string: implement optimized memset variants

2017-04-04 Thread Michael Ellerman
"Naveen N. Rao"  writes:
> (generic) is with Matt's arch-independent patches applied. Profiling 
> indicates that most of the overhead is actually with the lzo 
> decompression...
>
> Also, with a simple module to memset64() a 1GB vmalloc'ed buffer, here 
> are the results:
> generic:  0.245315533 seconds time elapsed( +-  1.83% )
> optimized:0.169282701 seconds time elapsed( +-  1.96% )

Great, that's pretty conclusive.

I'm pretty sure I can take these 2 patches independently of Matt's
series, they just won't be used by much until his series goes in, so
I'll do that unless someone yells.

cheers


Re: [PATCH 1/2] powerpc: string: implement optimized memset variants

2017-03-30 Thread Naveen N. Rao
On 2017/03/29 10:36PM, Michael Ellerman wrote:
> "Naveen N. Rao"  writes:
> > I also tested zram today with the command shared by Wilcox:
> >
> > without patch:   1.493782568 seconds time elapsed( +-  0.08% )
> > with patch:  1.408457577 seconds time elapsed( +-  0.15% )
> >
> > ... which also shows an improvement along the same lines as x86, as 
> > reported by Minchan Kim.
> 
> I got:
> 
>   1.344847397 seconds time elapsed  ( 
> +-  0.13% )
> 
> Using the C versions. Can you also benchmark those on your setup so we
> can compare? So basically apply Matt's series but not your 2.

Ok, with a more comprehensive test:
$ sudo modprobe zram
$ sudo zramctl -f -s 1G
# ~/tmp/1g has repeated 8 byte patterns
$ sudo bash -c "cat ~/tmp/1g > /dev/zram0"

Here are the results I got on a P8 vm with:
$ sudo ./perf stat -r 10 taskset -c 16-23 dd if=/dev/zram0 of=/dev/null

vanilla:1.770592578 seconds time elapsed( +-  0.07% )
generic:1.728865141 seconds time elapsed( +-  0.06% )
optimized:  1.695363255 seconds time elapsed( +-  0.10% )

(generic) is with Matt's arch-independent patches applied. Profiling 
indicates that most of the overhead is actually with the lzo 
decompression...

Also, with a simple module to memset64() a 1GB vmalloc'ed buffer, here 
are the results:
generic:0.245315533 seconds time elapsed( +-  1.83% )
optimized:  0.169282701 seconds time elapsed( +-  1.96% )


- Naveen



Re: [PATCH 1/2] powerpc: string: implement optimized memset variants

2017-03-29 Thread Michael Ellerman
"Naveen N. Rao"  writes:
> I also tested zram today with the command shared by Wilcox:
>
> without patch:   1.493782568 seconds time elapsed( +-  0.08% )
> with patch:  1.408457577 seconds time elapsed( +-  0.15% )
>
> ... which also shows an improvement along the same lines as x86, as 
> reported by Minchan Kim.

I got:

  1.344847397 seconds time elapsed  ( 
+-  0.13% )

Using the C versions. Can you also benchmark those on your setup so we
can compare? So basically apply Matt's series but not your 2.

cheers


Re: [PATCH 1/2] powerpc: string: implement optimized memset variants

2017-03-28 Thread Naveen N. Rao
On 2017/03/28 11:44AM, Michael Ellerman wrote:
> "Naveen N. Rao"  writes:
> 
> > diff --git a/arch/powerpc/lib/mem_64.S b/arch/powerpc/lib/mem_64.S
> > index 85fa9869aec5..ec531de6 100644
> > --- a/arch/powerpc/lib/mem_64.S
> > +++ b/arch/powerpc/lib/mem_64.S
> > @@ -13,6 +13,23 @@
> >  #include 
> >  #include 
> >  
> > +_GLOBAL(__memset16)
> > +   rlwimi  r4,r4,16,0,15
> > +   /* fall through */
> > +
> > +_GLOBAL(__memset32)
> > +   rldimi  r4,r4,32,0
> > +   /* fall through */
> > +
> > +_GLOBAL(__memset64)
> > +   neg r0,r3
> > +   andi.   r0,r0,7
> > +   cmplw   cr1,r5,r0
> > +   b   .Lms
> > +EXPORT_SYMBOL(__memset16)
> > +EXPORT_SYMBOL(__memset32)
> > +EXPORT_SYMBOL(__memset64)
> 
> You'll have to convince me that's better than what GCC produces.

Sure :) I got lazy yesterday night and didn't post the test results...

I hadn't tested zram yesterday, but only done tests with a naive test 
module that memset's a large 1GB buffer with integers. With that test, I 
saw:

without patch:   0.389253910 seconds time elapsed( +-  1.49% )
with patch:  0.173269267 seconds time elapsed( +-  1.55% )

.. which is better than 2x.

I also tested zram today with the command shared by Wilcox:

without patch:   1.493782568 seconds time elapsed( +-  0.08% )
with patch:  1.408457577 seconds time elapsed( +-  0.15% )

... which also shows an improvement along the same lines as x86, as 
reported by Minchan Kim.


- Naveen



Re: [PATCH 1/2] powerpc: string: implement optimized memset variants

2017-03-27 Thread Michael Ellerman
"Naveen N. Rao"  writes:

> diff --git a/arch/powerpc/lib/mem_64.S b/arch/powerpc/lib/mem_64.S
> index 85fa9869aec5..ec531de6 100644
> --- a/arch/powerpc/lib/mem_64.S
> +++ b/arch/powerpc/lib/mem_64.S
> @@ -13,6 +13,23 @@
>  #include 
>  #include 
>  
> +_GLOBAL(__memset16)
> + rlwimi  r4,r4,16,0,15
> + /* fall through */
> +
> +_GLOBAL(__memset32)
> + rldimi  r4,r4,32,0
> + /* fall through */
> +
> +_GLOBAL(__memset64)
> + neg r0,r3
> + andi.   r0,r0,7
> + cmplw   cr1,r5,r0
> + b   .Lms
> +EXPORT_SYMBOL(__memset16)
> +EXPORT_SYMBOL(__memset32)
> +EXPORT_SYMBOL(__memset64)

You'll have to convince me that's better than what GCC produces.

cheers