Re: [Caml-list] Linear Scan Register Allocator for ocamlopt/ocamlnat

2011-08-30 Thread Fermin Reig

Pierre-Alexandre Voye wrote:

Note that the ocaml compiler has a flag -cmm which outputs C-- ast code.
F. Reig made a c-- ocaml backend during his thesis.  Including a GC.
Unhappily, sources code haven't been released.
But it proves it works.


If anyone is interested, the dissertation is available at 
http://theses.gla.ac.uk/686/


Fermin



Le 26 août 2011 14:30, Erik de Castro Lopo mle+oc...@mega-nerd.com 
mailto:mle%2boc...@mega-nerd.com a écrit :


Pierre-Alexandre Voye wrote:

 I have a stupid question : I wonder if it would not be a bad idea th...

I have some experience in thie area. I work on the DDC compiler [0]
a compiler for a strict by default (optionally lazy) evaluation
dialect of Haskell.

When I joined the project the compiler had a working compile via C
backend, to which I added an LLVM backend [1].

Executables compiled via the LLVM backend (even without exploring
any of the LLVM optimisation passes) were faster than the same
executables compiled via C (gcc -O2). I suspect this is because
the generated C code was nothing like the C code people write and
the GCC is only good at optimising idiomatic C code.

I highly recommend LLVM as a compiler backend.

HTH,
Erik

[0] http://disciple.ouroborus.net/
[1] http://www.mega-nerd.com/erikd/Blog/CodeHacking/DDC/index.html

--
--
Erik de Castro Lopo
http://www.mega-nerd.com/


--
Caml-list mailing list. Subscription management and archives:
https://sympa-roc.inria.fr/wws/i...




--
Caml-list mailing list.  Subscription management and archives:
https://sympa-roc.inria.fr/wws/info/caml-list
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs



Re: [Caml-list] Linear Scan Register Allocator for ocamlopt/ocamlnat

2011-08-26 Thread Guillaume Yziquel
Le Thursday 25 Aug 2011 à 12:25:49 (+0200), Pierre-Alexandre Voye a écrit :
I have a stupid question : I wonder if it would not be a bad idea that
Ocaml output C code and let gcc do its work, so compile code with good
performances in a lot of architecture ? Gcc is able to do
autovectorization (SSE, MMX, Larabee in the futur, etc...), very
specific processor optimization, etc...
But maybe it's a stupid idea ?

I do not think it's a stupid idea. I've been quite stunned by how the
Mercury compiler does it (using cpp macros and so-called virtual
registers in C code). It's likely a pain however, and most compilers
I've seen using the compile to C approach usually use Boehm's GC, so
it may be quite a pain to adapt the OCaml runtime to it. However, gcc
isn't the golden bullet for everything. It definitely sucks with the
SIMD instruction set.

-- 
 Guillaume Yziquel


-- 
Caml-list mailing list.  Subscription management and archives:
https://sympa-roc.inria.fr/wws/info/caml-list
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs



Re: [Caml-list] Linear Scan Register Allocator for ocamlopt/ocamlnat

2011-08-26 Thread Erik de Castro Lopo
Pierre-Alexandre Voye wrote:

 I have a stupid question : I wonder if it would not be a bad idea that Ocaml
 output C code and let gcc do its work, so compile code with good
 performances in a lot of architecture ? Gcc is able to do autovectorization
 (SSE, MMX, Larabee in the futur, etc...), very specific processor
 optimization, etc...
 But maybe it's a stupid idea ?

I have some experience in thie area. I work on the DDC compiler [0]
a compiler for a strict by default (optionally lazy) evaluation
dialect of Haskell.

When I joined the project the compiler had a working compile via C
backend, to which I added an LLVM backend [1].

Executables compiled via the LLVM backend (even without exploring
any of the LLVM optimisation passes) were faster than the same
executables compiled via C (gcc -O2). I suspect this is because
the generated C code was nothing like the C code people write and
the GCC is only good at optimising idiomatic C code.

I highly recommend LLVM as a compiler backend.

HTH,
Erik

[0] http://disciple.ouroborus.net/
[1] http://www.mega-nerd.com/erikd/Blog/CodeHacking/DDC/index.html

-- 
--
Erik de Castro Lopo
http://www.mega-nerd.com/

-- 
Caml-list mailing list.  Subscription management and archives:
https://sympa-roc.inria.fr/wws/info/caml-list
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs



Re: [Caml-list] Linear Scan Register Allocator for ocamlopt/ocamlnat

2011-08-26 Thread Pierre-Alexandre Voye
Note that the ocaml compiler has a flag -cmm which outputs C-- ast code.
F. Reig made a c-- ocaml backend during his thesis.  Including a GC.
Unhappily, sources code haven't been released.
But it proves it works.

Le 26 août 2011 14:30, Erik de Castro Lopo mle+oc...@mega-nerd.com a
écrit :

Pierre-Alexandre Voye wrote:

 I have a stupid question : I wonder if it would not be a bad idea th...
I have some experience in thie area. I work on the DDC compiler [0]
a compiler for a strict by default (optionally lazy) evaluation
dialect of Haskell.

When I joined the project the compiler had a working compile via C
backend, to which I added an LLVM backend [1].

Executables compiled via the LLVM backend (even without exploring
any of the LLVM optimisation passes) were faster than the same
executables compiled via C (gcc -O2). I suspect this is because
the generated C code was nothing like the C code people write and
the GCC is only good at optimising idiomatic C code.

I highly recommend LLVM as a compiler backend.

HTH,
Erik

[0] http://disciple.ouroborus.net/
[1] http://www.mega-nerd.com/erikd/Blog/CodeHacking/DDC/index.html

--
--
Erik de Castro Lopo
http://www.mega-nerd.com/


-- 
Caml-list mailing list. Subscription management and archives:
https://sympa-roc.inria.fr/wws/i...

-- 
Caml-list mailing list.  Subscription management and archives:
https://sympa-roc.inria.fr/wws/info/caml-list
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs



Re: [Caml-list] Linear Scan Register Allocator for ocamlopt/ocamlnat

2011-08-25 Thread Benedikt Meurer

On Aug 24, 2011, at 22:40 , Gerd Stolpmann wrote:

 - the performance cost of this new allocator in the generated code? I
 suppose the results may vary between different architectures (eg. x86
 is probably more sensitive to good allocation decisions than x86_64).
 
 - http://ps.informatik.uni-siegen.de/~meurer/tmp/compiletime_timings.pdf 
 contains a comparison of the ocamlopt invocations.
 - http://ps.informatik.uni-siegen.de/~meurer/tmp/runtime_timings.pdf 
 contains comparison of the generated code.
 
 As can be seen from the results, amd64 is more sensitive to register 
 allocator changes than i386. 
 
 Well, this particular i386 CPU model is a strange guy - Northwoods have
 this extremely long pipeline, which is very sensitive to unforeseen
 jumps. It would be more interesting to see this test on a modern CPU in
 i386 mode. My guess is that it behaves then more like amd64.

Modern CPUs most probably don't run ocaml in 32bit mode, but more likely in 
long mode. That's why we choose to run the i386 on real 32bit hardware, where 
the ocaml i386 port is actually used.

I'll rerun the benchmark in 32bit mode on a modern cpu to see how things change.

 Gerd

Benedikt

-- 
Caml-list mailing list.  Subscription management and archives:
https://sympa-roc.inria.fr/wws/info/caml-list
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs



Re: [Caml-list] Linear Scan Register Allocator for ocamlopt/ocamlnat

2011-08-25 Thread Gerd Stolpmann
Am Donnerstag, den 25.08.2011, 11:34 +0200 schrieb Benedikt Meurer:
 On Aug 25, 2011, at 10:02 , Benedikt Meurer wrote:
 
  - http://ps.informatik.uni-siegen.de/~meurer/tmp/compiletime_timings.pdf 
  contains a comparison of the ocamlopt invocations.
  - http://ps.informatik.uni-siegen.de/~meurer/tmp/runtime_timings.pdf 
  contains comparison of the generated code.
  
  As can be seen from the results, amd64 is more sensitive to register 
  allocator changes than i386. 
  
  Well, this particular i386 CPU model is a strange guy - Northwoods have
  this extremely long pipeline, which is very sensitive to unforeseen
  jumps. It would be more interesting to see this test on a modern CPU in
  i386 mode. My guess is that it behaves then more like amd64.
  
  Modern CPUs most probably don't run ocaml in 32bit mode, but more likely in 
  long mode. That's why we choose to run the i386 on real 32bit hardware, 
  where the ocaml i386 port is actually used.

Right. However, if you look at new 32-bit-only CPUs like older Atoms
these base on modern cores where only some of the circuits were omitted
that are needed for 64-bit execution.
  
  I'll rerun the benchmark in 32bit mode on a modern cpu to see how things 
  change.
 
 Reran the benchmark with 32bit ocaml on the MBP (Early 2011), results are 
 available at:
 
 http://ps.informatik.uni-siegen.de/~meurer/tmp/linscan-i7-i386-timings.pdf

Hard to interpret. I have the impression that the compile times for
graph coloring on i386 are lower than on amd64 - has maybe to do with
the number of registers - but those for linear scanning are about the
same. (I'm comparing here with the amd64 numbers taken on the same
machine.)

The graphs for the runtimes are on a logarithmic scale. Do you also have
raw numbers? 

Gerd

 
 Benedikt
 
 
 

-- 

Gerd Stolpmann, Darmstadt, Germanyg...@gerd-stolpmann.de
Creator of GODI and camlcity.org.
Contact details:http://www.camlcity.org/contact.html
Company homepage:   http://www.gerd-stolpmann.de
*** Searching for new projects! Need consulting for system
*** programming in Ocaml? Gerd Stolpmann can help you.



-- 
Caml-list mailing list.  Subscription management and archives:
https://sympa-roc.inria.fr/wws/info/caml-list
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs



Re: [Caml-list] Linear Scan Register Allocator for ocamlopt/ocamlnat

2011-08-25 Thread Benedikt Meurer

On Aug 25, 2011, at 12:21 , Gerd Stolpmann wrote:

 Reran the benchmark with 32bit ocaml on the MBP (Early 2011), results are 
 available at:
 
 http://ps.informatik.uni-siegen.de/~meurer/tmp/linscan-i7-i386-timings.pdf
 
 Hard to interpret. I have the impression that the compile times for
 graph coloring on i386 are lower than on amd64 - has maybe to do with
 the number of registers - but those for linear scanning are about the
 same. (I'm comparing here with the amd64 numbers taken on the same
 machine.)
 
 The graphs for the runtimes are on a logarithmic scale. Do you also have
 raw numbers? 

Of course.

These are the raw numbers for the i7 (i386):

command  opt_gc opt_ls   opt_gc/ls
--
binarytree 16 1.096  1.102   0.995
fasta 100 0.008  0.008   1.000
fasta 25  0.044  0.044   1.000
fasta 25002.767  2.737   1.011
mandelbrot 1000   0.224  0.224   1.000
mandelbrot 4000   3.583  3.570   1.004
meteor 2098   0.578  0.586   0.986
nbody 300 1.142  1.168   0.978
spectral 3000 4.963  5.243   0.947
almabench 4.665  4.687   0.995
almabench.unsafe  4.629  4.702   0.984
bdd   0.279  0.285   0.979
boyer 0.566  0.572   0.990
fft   0.421  0.422   0.998
nucleic   0.499  0.513   0.973
quicksort 0.150  0.149   1.007
quicksort.unsafe  0.121  0.121   1.000
sorts 1.889  1.947   0.970

And the raw numbers for the P4:

command  opt_gc opt_ls   opt_gc/ls
--
binarytree 16 2.796  2.780   1.006
fasta 100 0.024  0.024   1.000
fasta 25  0.404  0.408   0.990
fasta 2500   33.350 33.402   0.998
mandelbrot 1000   0.892  0.884   1.009
mandelbrot 4000  14.220 14.040   1.013
meteor 2098   1.300  1.312   0.991
nbody 300 4.524  4.840   0.935
spectral 300051.979 53.919   0.964
almabench17.261 17.721   0.974
almabench.unsafe 17.317 18.441   0.939
bdd   1.256  1.248   1.006
boyer 1.668  1.672   0.998
fft   3.064  3.080   0.995
nucleic   2.188  2.268   0.965
quicksort 0.340  0.340   1.000
quicksort.unsafe  0.272  0.272   1.000
sorts 6.164  6.180   0.997

 Gerd

Benedikt

-- 
Caml-list mailing list.  Subscription management and archives:
https://sympa-roc.inria.fr/wws/info/caml-list
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs



Re: [Caml-list] Linear Scan Register Allocator for ocamlopt/ocamlnat

2011-08-24 Thread Benedikt Meurer

On Aug 1, 2011, at 17:04 , Gabriel Scherer wrote:

 Do you have more precise measurements on

Also posting Marcell's timing results here for reference (taken from bug 5324).

 - the performance cost of this new allocator in the generated code? I
 suppose the results may vary between different architectures (eg. x86
 is probably more sensitive to good allocation decisions than x86_64).

- http://ps.informatik.uni-siegen.de/~meurer/tmp/compiletime_timings.pdf 
contains a comparison of the ocamlopt invocations.
- http://ps.informatik.uni-siegen.de/~meurer/tmp/runtime_timings.pdf contains 
comparison of the generated code.

As can be seen from the results, amd64 is more sensitive to register allocator 
changes than i386. Not really surprising to me. Would be interesting to see how 
this affects PPC/Sparc/Mips, but we don't have appropriate hardware available 
right now. Anyone with appropriate hardware and some spare time? :-)

Benedikt


-- 
Caml-list mailing list.  Subscription management and archives:
https://sympa-roc.inria.fr/wws/info/caml-list
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs



Re: [Caml-list] Linear Scan Register Allocator for ocamlopt/ocamlnat

2011-08-01 Thread Gabriel Scherer
This work is meant to make a compromise between generated code quality
and compilation speed to have good performances in rapid
prototyping/development scenario.

Do you have more precise measurements on
- the relative costs of the successive transformations during native
compilation (including external linking etc.)? Which proportion of
time is currently used for register allocation?
- the performance cost of this new allocator in the generated code? I
suppose the results may vary between different architectures (eg. x86
is probably more sensitive to good allocation decisions than x86_64).

On Mon, Aug 1, 2011 at 4:53 PM, Benedikt Meurer
benedikt.meu...@googlemail.com wrote:
 Hello,

 As mentioned earlier we have a student working on an implementation of the 
 Linear Scan Register Allocator [1] for ocamlopt (and thereby ocamlnat). It 
 took some time, but now there's a first working patch which looks promising. 
 This work is done by Marcell Fischbach as part of his diploma thesis. The 
 idea is to use the linear scan algorithm to drive the register allocation in 
 the native top-level ocamlnat at some point, as suggested by Fabrice Le 
 Fessant [2].

 Marcell is now working to implement a proof-of-concept of an inline assembler 
 for ocamlnat on i386 based on code from Alain Frisch an Fabrice Le Fessant. 
 The result will also be contributed once ready, and will be used to 
 effectively compare ocamlnat and the byte-code ocaml top-level.

 The linear scan implementation reuses as much of the existing ocamlopt 
 functionality as possible, so additional maintenance overhead should be 
 manageable. Comments and suggestions are welcome of course. Please keep 
 Marcell CC'ed with any replies as he's not subscribed to the list.

 greets,
 Benedikt


 [1] http://portal.acm.org/citation.cfm?id=330250
 [2] 
 http://caml.inria.fr/pub/ml-archives/caml-list/2010/11/a1b0ed0934ff51df4ac07c5e9da6e9d6.en.html


 --
 Caml-list mailing list.  Subscription management and archives:
 https://sympa-roc.inria.fr/wws/info/caml-list
 Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
 Bug reports: http://caml.inria.fr/bin/caml-bugs




-- 
Caml-list mailing list.  Subscription management and archives:
https://sympa-roc.inria.fr/wws/info/caml-list
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs



Re: [Caml-list] Linear Scan Register Allocator for ocamlopt/ocamlnat

2011-08-01 Thread Benedikt Meurer

On Aug 1, 2011, at 17:04 , Gabriel Scherer wrote:

 This work is meant to make a compromise between generated code quality
 and compilation speed to have good performances in rapid
 prototyping/development scenario.
 
 Do you have more precise measurements on
 - the relative costs of the successive transformations during native
 compilation (including external linking etc.)? Which proportion of
 time is currently used for register allocation?

Right now most of the time in ocamlnat is spent in waiting for the assembler 
and linker to finish and the runtime linker to load the generated library file. 
However there are no precise timing results yet.

Marcell did a rough test with ocamlopt last week, building the test suite with 
both graph coloring and linear scan on a 2009 iMac (Core 2 Duo). The overall 
time spent in the register allocator dropped from 28s to 8s.

 - the performance cost of this new allocator in the generated code? I
 suppose the results may vary between different architectures (eg. x86
 is probably more sensitive to good allocation decisions than x86_64).

Same here... not yet. From what I've seen, the generated amd64 code is really 
close to the graph coloring code (isomorphic up to register renaming in most 
cases). Dunno for i386 yet.

Benedikt

-- 
Caml-list mailing list.  Subscription management and archives:
https://sympa-roc.inria.fr/wws/info/caml-list
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs