Re: [Caml-list] Linear Scan Register Allocator for ocamlopt/ocamlnat
Pierre-Alexandre Voye wrote: Note that the ocaml compiler has a flag -cmm which outputs C-- ast code. F. Reig made a c-- ocaml backend during his thesis. Including a GC. Unhappily, sources code haven't been released. But it proves it works. If anyone is interested, the dissertation is available at http://theses.gla.ac.uk/686/ Fermin Le 26 août 2011 14:30, Erik de Castro Lopo mle+oc...@mega-nerd.com mailto:mle%2boc...@mega-nerd.com a écrit : Pierre-Alexandre Voye wrote: I have a stupid question : I wonder if it would not be a bad idea th... I have some experience in thie area. I work on the DDC compiler [0] a compiler for a strict by default (optionally lazy) evaluation dialect of Haskell. When I joined the project the compiler had a working compile via C backend, to which I added an LLVM backend [1]. Executables compiled via the LLVM backend (even without exploring any of the LLVM optimisation passes) were faster than the same executables compiled via C (gcc -O2). I suspect this is because the generated C code was nothing like the C code people write and the GCC is only good at optimising idiomatic C code. I highly recommend LLVM as a compiler backend. HTH, Erik [0] http://disciple.ouroborus.net/ [1] http://www.mega-nerd.com/erikd/Blog/CodeHacking/DDC/index.html -- -- Erik de Castro Lopo http://www.mega-nerd.com/ -- Caml-list mailing list. Subscription management and archives: https://sympa-roc.inria.fr/wws/i... -- Caml-list mailing list. Subscription management and archives: https://sympa-roc.inria.fr/wws/info/caml-list Beginner's list: http://groups.yahoo.com/group/ocaml_beginners Bug reports: http://caml.inria.fr/bin/caml-bugs
Re: [Caml-list] Linear Scan Register Allocator for ocamlopt/ocamlnat
Le Thursday 25 Aug 2011 à 12:25:49 (+0200), Pierre-Alexandre Voye a écrit : I have a stupid question : I wonder if it would not be a bad idea that Ocaml output C code and let gcc do its work, so compile code with good performances in a lot of architecture ? Gcc is able to do autovectorization (SSE, MMX, Larabee in the futur, etc...), very specific processor optimization, etc... But maybe it's a stupid idea ? I do not think it's a stupid idea. I've been quite stunned by how the Mercury compiler does it (using cpp macros and so-called virtual registers in C code). It's likely a pain however, and most compilers I've seen using the compile to C approach usually use Boehm's GC, so it may be quite a pain to adapt the OCaml runtime to it. However, gcc isn't the golden bullet for everything. It definitely sucks with the SIMD instruction set. -- Guillaume Yziquel -- Caml-list mailing list. Subscription management and archives: https://sympa-roc.inria.fr/wws/info/caml-list Beginner's list: http://groups.yahoo.com/group/ocaml_beginners Bug reports: http://caml.inria.fr/bin/caml-bugs
Re: [Caml-list] Linear Scan Register Allocator for ocamlopt/ocamlnat
Pierre-Alexandre Voye wrote: I have a stupid question : I wonder if it would not be a bad idea that Ocaml output C code and let gcc do its work, so compile code with good performances in a lot of architecture ? Gcc is able to do autovectorization (SSE, MMX, Larabee in the futur, etc...), very specific processor optimization, etc... But maybe it's a stupid idea ? I have some experience in thie area. I work on the DDC compiler [0] a compiler for a strict by default (optionally lazy) evaluation dialect of Haskell. When I joined the project the compiler had a working compile via C backend, to which I added an LLVM backend [1]. Executables compiled via the LLVM backend (even without exploring any of the LLVM optimisation passes) were faster than the same executables compiled via C (gcc -O2). I suspect this is because the generated C code was nothing like the C code people write and the GCC is only good at optimising idiomatic C code. I highly recommend LLVM as a compiler backend. HTH, Erik [0] http://disciple.ouroborus.net/ [1] http://www.mega-nerd.com/erikd/Blog/CodeHacking/DDC/index.html -- -- Erik de Castro Lopo http://www.mega-nerd.com/ -- Caml-list mailing list. Subscription management and archives: https://sympa-roc.inria.fr/wws/info/caml-list Beginner's list: http://groups.yahoo.com/group/ocaml_beginners Bug reports: http://caml.inria.fr/bin/caml-bugs
Re: [Caml-list] Linear Scan Register Allocator for ocamlopt/ocamlnat
Note that the ocaml compiler has a flag -cmm which outputs C-- ast code. F. Reig made a c-- ocaml backend during his thesis. Including a GC. Unhappily, sources code haven't been released. But it proves it works. Le 26 août 2011 14:30, Erik de Castro Lopo mle+oc...@mega-nerd.com a écrit : Pierre-Alexandre Voye wrote: I have a stupid question : I wonder if it would not be a bad idea th... I have some experience in thie area. I work on the DDC compiler [0] a compiler for a strict by default (optionally lazy) evaluation dialect of Haskell. When I joined the project the compiler had a working compile via C backend, to which I added an LLVM backend [1]. Executables compiled via the LLVM backend (even without exploring any of the LLVM optimisation passes) were faster than the same executables compiled via C (gcc -O2). I suspect this is because the generated C code was nothing like the C code people write and the GCC is only good at optimising idiomatic C code. I highly recommend LLVM as a compiler backend. HTH, Erik [0] http://disciple.ouroborus.net/ [1] http://www.mega-nerd.com/erikd/Blog/CodeHacking/DDC/index.html -- -- Erik de Castro Lopo http://www.mega-nerd.com/ -- Caml-list mailing list. Subscription management and archives: https://sympa-roc.inria.fr/wws/i... -- Caml-list mailing list. Subscription management and archives: https://sympa-roc.inria.fr/wws/info/caml-list Beginner's list: http://groups.yahoo.com/group/ocaml_beginners Bug reports: http://caml.inria.fr/bin/caml-bugs
Re: [Caml-list] Linear Scan Register Allocator for ocamlopt/ocamlnat
On Aug 24, 2011, at 22:40 , Gerd Stolpmann wrote: - the performance cost of this new allocator in the generated code? I suppose the results may vary between different architectures (eg. x86 is probably more sensitive to good allocation decisions than x86_64). - http://ps.informatik.uni-siegen.de/~meurer/tmp/compiletime_timings.pdf contains a comparison of the ocamlopt invocations. - http://ps.informatik.uni-siegen.de/~meurer/tmp/runtime_timings.pdf contains comparison of the generated code. As can be seen from the results, amd64 is more sensitive to register allocator changes than i386. Well, this particular i386 CPU model is a strange guy - Northwoods have this extremely long pipeline, which is very sensitive to unforeseen jumps. It would be more interesting to see this test on a modern CPU in i386 mode. My guess is that it behaves then more like amd64. Modern CPUs most probably don't run ocaml in 32bit mode, but more likely in long mode. That's why we choose to run the i386 on real 32bit hardware, where the ocaml i386 port is actually used. I'll rerun the benchmark in 32bit mode on a modern cpu to see how things change. Gerd Benedikt -- Caml-list mailing list. Subscription management and archives: https://sympa-roc.inria.fr/wws/info/caml-list Beginner's list: http://groups.yahoo.com/group/ocaml_beginners Bug reports: http://caml.inria.fr/bin/caml-bugs
Re: [Caml-list] Linear Scan Register Allocator for ocamlopt/ocamlnat
Am Donnerstag, den 25.08.2011, 11:34 +0200 schrieb Benedikt Meurer: On Aug 25, 2011, at 10:02 , Benedikt Meurer wrote: - http://ps.informatik.uni-siegen.de/~meurer/tmp/compiletime_timings.pdf contains a comparison of the ocamlopt invocations. - http://ps.informatik.uni-siegen.de/~meurer/tmp/runtime_timings.pdf contains comparison of the generated code. As can be seen from the results, amd64 is more sensitive to register allocator changes than i386. Well, this particular i386 CPU model is a strange guy - Northwoods have this extremely long pipeline, which is very sensitive to unforeseen jumps. It would be more interesting to see this test on a modern CPU in i386 mode. My guess is that it behaves then more like amd64. Modern CPUs most probably don't run ocaml in 32bit mode, but more likely in long mode. That's why we choose to run the i386 on real 32bit hardware, where the ocaml i386 port is actually used. Right. However, if you look at new 32-bit-only CPUs like older Atoms these base on modern cores where only some of the circuits were omitted that are needed for 64-bit execution. I'll rerun the benchmark in 32bit mode on a modern cpu to see how things change. Reran the benchmark with 32bit ocaml on the MBP (Early 2011), results are available at: http://ps.informatik.uni-siegen.de/~meurer/tmp/linscan-i7-i386-timings.pdf Hard to interpret. I have the impression that the compile times for graph coloring on i386 are lower than on amd64 - has maybe to do with the number of registers - but those for linear scanning are about the same. (I'm comparing here with the amd64 numbers taken on the same machine.) The graphs for the runtimes are on a logarithmic scale. Do you also have raw numbers? Gerd Benedikt -- Gerd Stolpmann, Darmstadt, Germanyg...@gerd-stolpmann.de Creator of GODI and camlcity.org. Contact details:http://www.camlcity.org/contact.html Company homepage: http://www.gerd-stolpmann.de *** Searching for new projects! Need consulting for system *** programming in Ocaml? Gerd Stolpmann can help you. -- Caml-list mailing list. Subscription management and archives: https://sympa-roc.inria.fr/wws/info/caml-list Beginner's list: http://groups.yahoo.com/group/ocaml_beginners Bug reports: http://caml.inria.fr/bin/caml-bugs
Re: [Caml-list] Linear Scan Register Allocator for ocamlopt/ocamlnat
On Aug 25, 2011, at 12:21 , Gerd Stolpmann wrote: Reran the benchmark with 32bit ocaml on the MBP (Early 2011), results are available at: http://ps.informatik.uni-siegen.de/~meurer/tmp/linscan-i7-i386-timings.pdf Hard to interpret. I have the impression that the compile times for graph coloring on i386 are lower than on amd64 - has maybe to do with the number of registers - but those for linear scanning are about the same. (I'm comparing here with the amd64 numbers taken on the same machine.) The graphs for the runtimes are on a logarithmic scale. Do you also have raw numbers? Of course. These are the raw numbers for the i7 (i386): command opt_gc opt_ls opt_gc/ls -- binarytree 16 1.096 1.102 0.995 fasta 100 0.008 0.008 1.000 fasta 25 0.044 0.044 1.000 fasta 25002.767 2.737 1.011 mandelbrot 1000 0.224 0.224 1.000 mandelbrot 4000 3.583 3.570 1.004 meteor 2098 0.578 0.586 0.986 nbody 300 1.142 1.168 0.978 spectral 3000 4.963 5.243 0.947 almabench 4.665 4.687 0.995 almabench.unsafe 4.629 4.702 0.984 bdd 0.279 0.285 0.979 boyer 0.566 0.572 0.990 fft 0.421 0.422 0.998 nucleic 0.499 0.513 0.973 quicksort 0.150 0.149 1.007 quicksort.unsafe 0.121 0.121 1.000 sorts 1.889 1.947 0.970 And the raw numbers for the P4: command opt_gc opt_ls opt_gc/ls -- binarytree 16 2.796 2.780 1.006 fasta 100 0.024 0.024 1.000 fasta 25 0.404 0.408 0.990 fasta 2500 33.350 33.402 0.998 mandelbrot 1000 0.892 0.884 1.009 mandelbrot 4000 14.220 14.040 1.013 meteor 2098 1.300 1.312 0.991 nbody 300 4.524 4.840 0.935 spectral 300051.979 53.919 0.964 almabench17.261 17.721 0.974 almabench.unsafe 17.317 18.441 0.939 bdd 1.256 1.248 1.006 boyer 1.668 1.672 0.998 fft 3.064 3.080 0.995 nucleic 2.188 2.268 0.965 quicksort 0.340 0.340 1.000 quicksort.unsafe 0.272 0.272 1.000 sorts 6.164 6.180 0.997 Gerd Benedikt -- Caml-list mailing list. Subscription management and archives: https://sympa-roc.inria.fr/wws/info/caml-list Beginner's list: http://groups.yahoo.com/group/ocaml_beginners Bug reports: http://caml.inria.fr/bin/caml-bugs
Re: [Caml-list] Linear Scan Register Allocator for ocamlopt/ocamlnat
On Aug 1, 2011, at 17:04 , Gabriel Scherer wrote: Do you have more precise measurements on Also posting Marcell's timing results here for reference (taken from bug 5324). - the performance cost of this new allocator in the generated code? I suppose the results may vary between different architectures (eg. x86 is probably more sensitive to good allocation decisions than x86_64). - http://ps.informatik.uni-siegen.de/~meurer/tmp/compiletime_timings.pdf contains a comparison of the ocamlopt invocations. - http://ps.informatik.uni-siegen.de/~meurer/tmp/runtime_timings.pdf contains comparison of the generated code. As can be seen from the results, amd64 is more sensitive to register allocator changes than i386. Not really surprising to me. Would be interesting to see how this affects PPC/Sparc/Mips, but we don't have appropriate hardware available right now. Anyone with appropriate hardware and some spare time? :-) Benedikt -- Caml-list mailing list. Subscription management and archives: https://sympa-roc.inria.fr/wws/info/caml-list Beginner's list: http://groups.yahoo.com/group/ocaml_beginners Bug reports: http://caml.inria.fr/bin/caml-bugs
Re: [Caml-list] Linear Scan Register Allocator for ocamlopt/ocamlnat
This work is meant to make a compromise between generated code quality and compilation speed to have good performances in rapid prototyping/development scenario. Do you have more precise measurements on - the relative costs of the successive transformations during native compilation (including external linking etc.)? Which proportion of time is currently used for register allocation? - the performance cost of this new allocator in the generated code? I suppose the results may vary between different architectures (eg. x86 is probably more sensitive to good allocation decisions than x86_64). On Mon, Aug 1, 2011 at 4:53 PM, Benedikt Meurer benedikt.meu...@googlemail.com wrote: Hello, As mentioned earlier we have a student working on an implementation of the Linear Scan Register Allocator [1] for ocamlopt (and thereby ocamlnat). It took some time, but now there's a first working patch which looks promising. This work is done by Marcell Fischbach as part of his diploma thesis. The idea is to use the linear scan algorithm to drive the register allocation in the native top-level ocamlnat at some point, as suggested by Fabrice Le Fessant [2]. Marcell is now working to implement a proof-of-concept of an inline assembler for ocamlnat on i386 based on code from Alain Frisch an Fabrice Le Fessant. The result will also be contributed once ready, and will be used to effectively compare ocamlnat and the byte-code ocaml top-level. The linear scan implementation reuses as much of the existing ocamlopt functionality as possible, so additional maintenance overhead should be manageable. Comments and suggestions are welcome of course. Please keep Marcell CC'ed with any replies as he's not subscribed to the list. greets, Benedikt [1] http://portal.acm.org/citation.cfm?id=330250 [2] http://caml.inria.fr/pub/ml-archives/caml-list/2010/11/a1b0ed0934ff51df4ac07c5e9da6e9d6.en.html -- Caml-list mailing list. Subscription management and archives: https://sympa-roc.inria.fr/wws/info/caml-list Beginner's list: http://groups.yahoo.com/group/ocaml_beginners Bug reports: http://caml.inria.fr/bin/caml-bugs -- Caml-list mailing list. Subscription management and archives: https://sympa-roc.inria.fr/wws/info/caml-list Beginner's list: http://groups.yahoo.com/group/ocaml_beginners Bug reports: http://caml.inria.fr/bin/caml-bugs
Re: [Caml-list] Linear Scan Register Allocator for ocamlopt/ocamlnat
On Aug 1, 2011, at 17:04 , Gabriel Scherer wrote: This work is meant to make a compromise between generated code quality and compilation speed to have good performances in rapid prototyping/development scenario. Do you have more precise measurements on - the relative costs of the successive transformations during native compilation (including external linking etc.)? Which proportion of time is currently used for register allocation? Right now most of the time in ocamlnat is spent in waiting for the assembler and linker to finish and the runtime linker to load the generated library file. However there are no precise timing results yet. Marcell did a rough test with ocamlopt last week, building the test suite with both graph coloring and linear scan on a 2009 iMac (Core 2 Duo). The overall time spent in the register allocator dropped from 28s to 8s. - the performance cost of this new allocator in the generated code? I suppose the results may vary between different architectures (eg. x86 is probably more sensitive to good allocation decisions than x86_64). Same here... not yet. From what I've seen, the generated amd64 code is really close to the graph coloring code (isomorphic up to register renaming in most cases). Dunno for i386 yet. Benedikt -- Caml-list mailing list. Subscription management and archives: https://sympa-roc.inria.fr/wws/info/caml-list Beginner's list: http://groups.yahoo.com/group/ocaml_beginners Bug reports: http://caml.inria.fr/bin/caml-bugs