Re: inside the GHC code generator
On 2/24/06, Simon Peyton-Jones <[EMAIL PROTECTED]> wrote: > | last days i studied GHC code generator, reading -ddumps of all sorts, > | GHC sources, various papers and John's appeals :) > | > | what i carried from this investigation: > | > | GHC has great high-level optimization. moreover, GHC now allows to > | program STG almost directly. when i look at STG code i don't see > | anything that can be done better - at least for these simple loops i > | tried to compile. i see just unboxed variables, primitive operations > | and simple loops represented as tail recursion. fine. > | > | then STG compiled to the simplified C (called abstract C earlier and > | quasi C-- now), what is next can be translated: > | > | * to real C and then compiled by gcc > | * to assembler by build-in simple C-- compiler > | * to assembler by some external C-- compiler (at least it is > | theoretically possible) > > Good stuff Bulat. There's plenty of interesting stuff to be done here. > > However, let me strongly urge you *not* to focus attention primarily on > the gcc route. Compiling via C has received a lot of attention over the > years, and there are many papers describing cool hacks for doing so. > GHC does not do as well as it could. > > But there are serious obstacles. That's not gcc's fault -- it wasn't > designed for this. Accurate GC is one of them, tail calls is another, > and there are plenty more smaller things that bite you only after you've > invested a lot of time. This way lies madness. > > C-- was *designed* for this purpose. GHC uses C-- as its intermediate > language (just before emitting C). So a good route is this: > > * Write C-- to C-- optimisations > > * Then, if you like, translate that code to C. Already you will be > doing better than GHC does today, because the C-- to C-- optimiser will > let you generate better C > > * But you can also emit C--, or native code; both of these paths will > directly benefit from your C-- optimisations. > > The point is that none of this relies on Quick C--; you can always use > an alternative back end. > > > > You can almost do this today. GHC uses C-- as an intermediate language. > But alas, GHC's code generator does not take advantage of C--'s native > calls or parameter passing. Instead, it pushes things on an auxiliary > stack etc. (Those Sp memory operations you see.) This isn't necessary. > We are planning to split GHC's code generator into two parts > A) Generate C-- with native calls, with an implicit C-- stack > B) Perform CPS conversion, to eliminate all calls in favour of > jumps > using an explicit stack > The current code gen is A followed by B. But A is a much more suitable > optimisation platform, and gives more flexibility. > > Chris Thompson, and undergrad at Cambridge, is doing (B) as his > undergrad project, although it remains to be seen whether he'll have > enough complete to be usable. > > Another shortcoming is that the native code generator in GHC isn't > capable of dealing with backward jumps to labels (because GHC hasn't > needed that so far). But if you did C-- optimisation, you'd probably > generate such jumps. It'd be great to beef up the native code gen to > handle that. > > > Many of the optimisations you describe (perhaps all) are readily > expressible in the C-- intermediate language, and by working at that > level you will be independent of with the back end is gcc, a native code > generator, or Quick C--, or some other C-- compiler. Much better. What optimizations are we talking about here? The loop optimizations that Bulat implicitly proposed would only affect recursion over unboxed arguments, and, since that's fairly rare, wouldn't give Joe Hacker any noticeable speed up. Are we at the end of what we can get without whole-program optimizations or are there other optimizations that apply to C-- representing a lazy PL? -- Friendly, Lemmih ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: inside the GHC code generator
Bulat Ziganshin wrote: * C++ calling stack should be used as much as possible * parameters are passed in C++ way: "factorial(int n, int r)" SM> This is unworkable, I'm afraid. Go and read Simon's paper on the STG: SM>http://citeseer.ist.psu.edu/peytonjones92implementing.html SM> there are two main reasons we don't use the C stack: garbage collection SM> and tail calls. What do you plan to do about GC? minimalistic solution i see - just use two stacks. unboxed values are passed using the C conventions and C compiler will be able to optimize using of this stack. boxed values are passed using the current Sp[] machinery and therefore GC will not be changed How do you propose to do thread switching? Incedentally, GHC had two stacks a long time ago (before 4.00) I rewrote the runtime and code generator to replace it with one stack. It was about 6 months work, if I remember correctly. I think you misused the words "just" and "minimalistic" :-) I know gcc does (some) tail-call optimisation these days. You don't get any *guarantee* that a call will be implemented as a tail-call, though, and the conditions are quite complex. The gcc folks treat it as an optimisation, rather than a fundamental part of the operational semantics, which is what you need for Haskell. SM> This is called semi-tagging, it was tried a long time ago. Certainly SM> worth trying again, but I very much doubt the gains would be anything SM> like you claim, and it increases code size. we will replace 2 unpredictable jumps with one that will not be even executed if value is in WHNF. at least, i'm very displeased by slowness of lazy values evaluation in GHC. for example, my binary serialization package writes 60 mb/s working with unboxed arrays and only 20 mb/s working with the lists You need to *try* these things and measure the difference. Quite often I'm surprised with the actual results I get from an optimisation that looks on paper to be worthwhile. Today's processors with 3-level caches are complicated beasts. Semi tagging increases code size, so unless you can make some significant gains to make up for it, you've lost already. Let me give you another example: GHC puts info tables next to the entry code for a closure. On paper this looks like it might be a bad idea: it pollutes the instruction cache with data that is rarely accessed during execution. It's hard to arrange (this is one thing the mangler does), so we'd like to avoid it if possible. However, doing this eliminates an indirection when entering a closure, reduces the size of info tables by one word, and reduces code size. These effects together are more beneficial than the code-cache-pollution effect - last time I measured it the difference was ~10%. (turn off TABLES_NEXT_TO_CODE and try it yourself). SM> There are other schemes, including this one that we particularly like: SM> use a spare bit in a pointer to indicate "points to an evaluated value". SM> Since there are two spare bits, you can go further and use the 4 SM> values to mean: SM>0: possibly unevaluated SM>1: evaluated, tag 0 SM>2: evaluated, tag 1 SM>3: evaluated, tag 2 SM> I believe Robert Ennals tried this as part of his spec eval SM> implementation, but IIRC he didn't get separate measurements for it (or SM> it didn't improve things much). Note that this increases code size too. to be exact, we had only 2^30-1 valid pointers, all other values are ready to use! :) True, but I'm assuming you often want to store a pointer in addition to the tag (eg. for a cons cell). For enumerations, you're quite right that the whole word can be used for the tag as long as it is distinguished from a pointer. Cheers, Simon ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re[2]: inside the GHC code generator
Hello Simon, Friday, February 24, 2006, 12:52:36 PM, you wrote: SM> Wow! You make a lot of good suggestions. I think some of them are SM> unfortunately unworkable, some others have been tried before, but there SM> are definitely some to try. I'll suggest a few more in this email. can you comment about unworkable/hard to implement ones? may be, ghc users can propose some solutions that you are skipped SM> First of all I should say that we don't *want* to use gcc as a code SM> generator. Everything we've been doing in the back end over the last SM> few years has been aiming towards dropping the dependency on gcc and SM> removing the Evil Mangler. Now is not the time to be spending a lot of SM> effort in beefing up the C back end :-) Our preference is to work on SM> both the native and C-- back ends; the C-- back end has the potential to SM> produce the best code and be the most portable. We also plan to keep SM> the "unregisterised" C back end for the purposes of porting to a new SM> platform, it's only the registerised path we want to lose. i know that is the hardest part of discussion. but sometimes we need to change our decisions. i think that it's better to compare efforts/results of going toward gcc/c-- way starting from current state of ghc, avoiding counting the previous efforts :( on the c-- side: 1) time/efforts required to implement many low-level optimizations 2) more or less efficient code on the gcc side: 1) time required to implement generation of native C code instead of current "portable asm" 2) one of the best optimization possible gcc way already exists and works, while C-- way is still unfinished even in its basic, unoptimized state. next, unregisterized C way provides the maximum possible level of portability SM> Having said all that, if we can get better code out of the C back end SM> without too much effort, then that's certainly worthwhile. moreover, many changes, proposed on this thread, perfectly implementable even on C-- way >> translated to something like this >> >> factorial: >> _s1BD = *(Sp + 0); >> if (_s1BD != 1) goto c1C4; // n!=1 ? >> R1 = *(Sp + 4); >> Sp = Sp + 8; >> return (*(Sp + 0))(); // return from factorial >> c1C4: >> _s1BI = _s1BD * (*(Sp + 4)); // n*r >> _s1BF = _s1BD - 1; // n-1 >> *(Sp + 4) = _s1BI; >> *(Sp + 0) = _s1BF; >> goto factorial; SM> To be fair, this is only the translation on an architecture that has no SM> argument registers (x86, and currently x86_64 but hopefully not for long). this just emphasizes main problem of "portable asm" code generated by ghc - you can't optimize it as good as C compiler optimizes idiomatic C code. x86 has 7 registers, after all, and gcc generates great code for this procedure - but only if it is written in "idiomatic" manner. you will need to develop your own optimization techniques, and it will be hard and long way comparing to just generation of more high-level C code just to compare - i've rewritten my binary serialization library two times. the first version just used string concatenation (!), the second was the carefully optimized using my C experience. nevertheless, it was slow enough. and the final 3rd was optimized according to ghc features. and the result was astonishing "portable asm" looking great in theory, and C-- as portable low-level language is great theoretical idea. but reality has its own laws. want to know why my second version of serialization library, highly optimized according to C experience, was so slow? it just returns tuples and constructing/analyzing these lazy tuples used 80% of the time ghc generates not so fast code and that is known for many years. you have developed the proper theoretical way to solve this problem, but there is also much easier backdoor :) at least, it seems like this SM> The simple optimisation of changing the final tail call to factorial SM> into a goto to a label at the beginning of the function (as suggested by SM> John Meacham I think) should improve the code generated by gcc for SM> functions like this, and is easy to do. the main detail is to work with local variables instead of Sp[xx]. gcc can't optimize access to Sp[xx] >> 1) because it just not asked. you can enable gcc optimization by >> adding "-optc-O6" to the cmdline, but this leads to compilation errors >> on part of modules. it seems that gcc optimization is not compatible >> with "evil mangler" that is the ghc's own optimization trick. SM> -O3 works, I haven't tried -O6. i've attached "optc-O2-bug.hs" that don't work both with -optc-O2 and -optc-O3 >> * C++ calling stack should be used as much as possible >> * parameters are passed in C++ way: "factorial(int n, int r)" SM> This is unworkable, I'm afraid. Go and read Simon's paper on the STG: SM>http://citeseer.ist.psu.edu/peytonjones92implementing.html SM> there are two main reasons we don't use the C stac
Re: inside the GHC code generator
Hello, > > SPJ> tail calls is another, > > > > nowadays gcc optimize tail calls > > I found this thesis on the subject by Andreas Bauer: > > "Compilation of Functional Programming Languages using GCC -- Tail > Calls" > > Do you know if it is his work that was incorporated into GCC? no. Also, it is based on now obsolete version of the GCC (most of the tail call optimization is currently handled in a completely different way). If anyone has some concrete proposals on improvements of handling of tail calls in GCC that could help GHC, please use the gcc bugzilla (http://gcc.gnu.org/bugzilla/) to file an improvement request, and add me to CC, I will look what I can do. Zdenek ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: inside the GHC code generator
Hello, SPJ> tail calls is another, nowadays gcc optimize tail calls I found this thesis on the subject by Andreas Bauer: "Compilation of Functional Programming Languages using GCC -- Tail Calls" Do you know if it is his work that was incorporated into GCC? I browsed trough some parts of the thesis, and he mentions GHC as one of the motivations for his work (the thesis also seems to be a nice tutorial on the internals of GCC). From this thread it seems that GHC cannot currently take advantage of his work. Is that correct? /Niklas ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re[2]: inside the GHC code generator
Hello Simon, Friday, February 24, 2006, 12:30:22 PM, you wrote: SPJ> | GHC has great high-level optimization. SPJ> However, let me strongly urge you *not* to focus attention primarily on SPJ> the gcc route. Compiling via C has received a lot of attention over the SPJ> years, and there are many papers describing cool hacks for doing so. SPJ> GHC does not do as well as it could. btw, it's well known that Clean sometimes make better code than GHC and sometimes vice versa. now I'm almost sure that GHC outperforms Clean in high-level optimizations and give away on the low-level code generation, so the grand result depends on that is more important for particular algorithm. on the hand-coded programs Clean should almost always outperform GHC SPJ> But there are serious obstacles. That's not gcc's fault -- it wasn't SPJ> designed for this. Accurate GC is one of them, we ca use two stacks - for boxed and for unboxed values SPJ> tail calls is another, nowadays gcc optimize tail calls SPJ> and there are plenty more smaller things that bite you only after you've SPJ> invested a lot of time. This way lies madness. but nevertheless ghc already compiles to gcc and it's the fastest backend, with help of Evil Mangler. afaik, code generated by ghc will work even w/o Evil Mangler, so it's just an optimization hack. can you please say what an optimizations done via EM? i think that part of them can be implemented via changing C generation, so may be even we can omit EM at long last SPJ> C-- was *designed* for this purpose. GHC uses C-- as its intermediate SPJ> language (just before emitting C). So a good route is this: SPJ> * Write C-- to C-- optimisations SPJ> * Then, if you like, translate that code to C. Already you will be SPJ> doing better than GHC does today, because the C-- to C-- optimiser will SPJ> let you generate better C SPJ> * But you can also emit C--, or native code; both of these paths will SPJ> directly benefit from your C-- optimisations. SPJ> The point is that none of this relies on Quick C--; you can always use SPJ> an alternative back end. yes, C-- was designed to solve all our problems - provide world-class optimization, code generation for different cpus and so on. but does it fulfil it's promises? no. and now you propose to write these optimization as part of Haskell project. great idea! i've started from the same idea and even wrote a sequence of optimizations what will transform initial C-- code for "fac" to the efficient one. but then i realized that the core problem is what ghc just generates low-level code in "portable asm" style. and optimizing of low-level code that tries to describe how to IMPLEMENT this algorithm instead of just describe the ALGORITHM itself is a hard task. noone writes optimizers for the ASM language, but you propose to do exact this. instead of coping with current "portable asm" code generated from STG i propose to generate idiomatic C or C-- code and leave its optimization to the appropriate compiler. the SEPARATE problem is what gcc/icc can generate much better code that qc, so i propose to direct efforts of generating idiomatic C[--] toward C compilers SPJ> You can almost do this today. GHC uses C-- as an intermediate language. SPJ> But alas, GHC's code generator does not take advantage of C--'s native SPJ> calls or parameter passing. Instead, it pushes things on an auxiliary SPJ> stack etc. (Those Sp memory operations you see.) This isn't necessary. SPJ> We are planning to split GHC's code generator into two parts SPJ> A) Generate C-- with native calls, with an implicit C-- stack SPJ> B) Perform CPS conversion, to eliminate all calls in favour of SPJ> jumps SPJ> using an explicit stack SPJ> The current code gen is A followed by B. But A is a much more suitable SPJ> optimisation platform, and gives more flexibility. i propose to implement only A and leave B to the C[--] compiler itself. that will require to return to 2-stacks model, but in return we will get 1st-class optimization instead of making itself it's rather restricted subset. instead of permanently trying to catch up C compiler's optimization we can just jump to the common bandwagon :) SPJ> Chris Thompson, and undergrad at Cambridge, is doing (B) as his SPJ> undergrad project, although it remains to be seen whether he'll have SPJ> enough complete to be usable. SPJ> Another shortcoming is that the native code generator in GHC isn't SPJ> capable of dealing with backward jumps to labels (because GHC hasn't SPJ> needed that so far). But if you did C-- optimisation, you'd probably SPJ> generate such jumps. It'd be great to beef up the native code gen to SPJ> handle that. i propose to add if/for/while statements to the CmmStmt datatype to allow generation of higher-level code. as John Meacham wrote several months ago, gcc can't unroll loops written with gotos. so it's anyway better to generate explicit loops. it should be not too hard to generate asm c
Re: inside the GHC code generator
[EMAIL PROTECTED] wrote: ... as you can see here, gcc isn't unrolled the loop and retained all the stack<->register moves. why? the short answer is what gcc just don't know that data pointed by the Sp pointer is non-volatile - they can't be changed in middle of the program by some async computation or exception and don't required to be updated immediately. may be there is some way to tell gcc this? this would immediately SUBSTANTIALLY improve all ghc code generation Would the C99 restrict keyword help in this situation? ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: inside the GHC code generator
> But there are serious obstacles. That's not gcc's fault -- it wasn't > designed for this. Accurate GC is one of them, tail calls is another, > and there are plenty more smaller things that bite you only after you've > invested a lot of time. This way lies madness. What about llvm[1]? It was primary designed to support GCC based C and C++ front-ends but currently it also supports: - Accurate GC - tail calls - output for x86, PowerPC, IA-64, Alpha, SPARC V9, and portable C. - a lot of (program wide and/or profile-guided) analyses and optimizations Are there arguments against llvm? [1] llvm website: http://llvm.cs.uiuc.edu/ Regards, Christof ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: inside the GHC code generator
Simon Peyton-Jones wrote: | [...] put a static table in the executable with | information about register and stack usage indexed by procedure return | point, and use that to unwind the stack and find roots. Every accurate garbage collector (including GHC's) uses a technique like this, but the solution is invariably tightly-coupled to the garbage collector, compiler and run-time system. Okay, I don't know what I was thinking when I wrote that no languages that compile via C use compacting collectors, since obviously lots of them do. But they do it by using various hacks to force local heap pointers into locations where the GC can find them. Most of this effort is wasted, because the local variables disappear before the GC runs. What I'm talking about is idiomatic C code which manipulates heap pointers like any other object, and which can be optimized as usual (e.g. holding heap pointers in callee-saved registers across function calls) without causing GC problems. There's no reason in principle that this couldn't be done. -- Ben ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: inside the GHC code generator
Another shortcoming is that the native code generator in GHC isn't capable of dealing with backward jumps to labels (because GHC hasn't needed that so far). But if you did C-- optimisation, you'd probably generate such jumps. It'd be great to beef up the native code gen to handle that. I'm already working on that. It's basically done, I think I only need to get around to one more session with the code for final cleanup. (Just to avoid any duplication of effort). Cheers, Wolfgang ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re[6]: inside the GHC code generator
Hello Lemmih, Friday, February 24, 2006, 8:55:37 PM, you wrote: >> x :: ![Int] >> >> translates to the >> >> data StrictList a = Nil | Cons a !(StrictList a) >> x :: !(StrictList a) L> Let's try this: L> x :: ![Int] -> Int L> It would translate to something like this: L> mkStrictList :: [a] -> StrictList a L> x = xStrict . mkStrictList L> xStrict = ... L> Wouldn't it be very expensive to strictify the list? yes, it would. but evaluating list as lazy one on EACH repetition of some loop will cost much more. just for example - i found that the cycle repeat 666 (putStr (replicate 15 ' ')) works several times slower than repeat (10^8) (putChar ' ') if argument of putStr will be evaluated only one time then much of difference will gone. of course, that is just benchmark. but there are cases when i prefer to work with strict datastructures and specialize my functions accordingly. typically it's the cases where time/space are really critical -- Best regards, Bulatmailto:[EMAIL PROTECTED] ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: Re[4]: inside the GHC code generator
On 2/24/06, Bulat Ziganshin <[EMAIL PROTECTED]> wrote: > Hello Lemmih, > > Friday, February 24, 2006, 1:15:51 PM, you wrote: > > >> clean differs from Haskell in support of unique types and strictness > >> annotations. the last is slowly migrates into GHC in form of shebang > >> patters, but i think that it is a half-solution. i mentioned in > >> original letter my proposals to add strictness annotations to > >> function types declarations and to declare strict datastructures, such > >> as "![Int]" > > L> As I've understood it, Clean's strictness annotations are a bit of a > L> hack which only works on certain built-in types. Am I mistaking here? > > i don't know Clean very well, although i've seen rumors that it > supports strict datastructures. after all, this don't need changes in > language itself, it's just a syntax sugar: > > data [a] = [] | a:[a] > x :: ![Int] > > translates to the > > data StrictList a = Nil | Cons a !(StrictList a) > x :: !(StrictList a) Let's try this: x :: ![Int] -> Int It would translate to something like this: mkStrictList :: [a] -> StrictList a x = xStrict . mkStrictList xStrict = ... Wouldn't it be very expensive to strictify the list? -- Friendly, Lemmih ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re[4]: inside the GHC code generator
Hello Lemmih, Friday, February 24, 2006, 1:15:51 PM, you wrote: >> clean differs from Haskell in support of unique types and strictness >> annotations. the last is slowly migrates into GHC in form of shebang >> patters, but i think that it is a half-solution. i mentioned in >> original letter my proposals to add strictness annotations to >> function types declarations and to declare strict datastructures, such >> as "![Int]" L> As I've understood it, Clean's strictness annotations are a bit of a L> hack which only works on certain built-in types. Am I mistaking here? i don't know Clean very well, although i've seen rumors that it supports strict datastructures. after all, this don't need changes in language itself, it's just a syntax sugar: data [a] = [] | a:[a] x :: ![Int] translates to the data StrictList a = Nil | Cons a !(StrictList a) x :: !(StrictList a) ... i've found the following in the 6 feb letter in cafe by Brian Hulley: Bulat Ziganshin wrote: > yes, i remember this SPJ's question :) "[!a]" means that list > elements are strict, it's the same as defining new list type with > strict elements and using it here. "![a]" means "strict list", it is > the same as defining list with "next" field strict: > > data List1 a = Nil1 | List1 !a (List1 a) > data List2 a = Nil2 | List2 a !(List2 a) > data List3 a = Nil3 | List3 !a !(List3 a) Clean allows (AFAIK) several distinctions to be made: 1) ![a] means that the list of a's is a strict argument, just like writing !b 2) [!a] means that the list is head strict (List1 a) 3) [a!] means that the list is tail strict (List2 a) 4) [!a!] means that the list is head and tail strict (List3 a) 5) ![!a!] means that the head-and-tail-strict-list-argument is strict!!! I think also (though I'm not entirely sure) that these distinctions are generalized for other data types by talking about element strictness and spine strictness. One motivation seems to be that in the absence of whole program optimization, the strictness annotations on a function's type can allow the compiler to avoid creating thunks at the call site for cross-module calls whereas using seq in the function body itself means that the thunk still has to be created at the call site because the compiler can't possibly know that it's going to be immediately evaluated by seq. and my own letter earlier on the 6 feb: >> foo :: !Int -> !Int KM> (Is the second ! actually meaningful?) yes! it means that the function is strict in its result - i.e. can't return undefined value when strict arguments are given. this sort of knowledge should help a compiler to "propagate" strictness and figure out the parts of program that can be compiled as strict code. really, i think ghc is able to figure functions with strict result just like it is able to figure strict function arguments KM> Personally, I think is much nicer than sprinkling seq's around, and KM> generally sufficient. However, there could perhaps be disambiguities? btw, it's just implemented in the GHC HEAD KM> Last time this came up, I think examples resembling these were brought KM> up: KM> foo :: [!a] -> ![a] -> a yes, i remember this SPJ's question :) "[!a]" means that list elements are strict, it's the same as defining new list type with strict elements and using it here. "![a]" means "strict list", it is the same as defining list with "next" field strict: data List1 a = Nil1 | List1 !a (List1 a) data List2 a = Nil2 | List2 a !(List2 a) data List3 a = Nil3 | List3 !a !(List3 a) the type List3 is a simple strict list, like in any strict programming language. foo :: [!a] -> ![a] -> ![!a] -> a translates to foo :: List1 a -> List2 a -> List3 a -> a KM> foo' :: Map !Int String -> Int -> String that means that keys in this map saved as strict values. for example, the following definition type Map a b = [(a,b)] will be instantiated to Map !Int String ==> [(!Int, String)] KM> Anyway, if a reasonable semantics can be formulated, I think KM> strictness type annotations would be a great, useful, and KM> relatively non-intrusive (AFAICT, entirely backwards compatible) KM> addtion to Haskell'. such proposal already exists and supported by implementing this in GHC HEAD -- Best regards, Bulatmailto:[EMAIL PROTECTED] ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: inside the GHC code generator
Rene de Visser wrote: Integer is about 30 times slower than it needs to be on GHC if you have over half the values between 2^-31 and 2^31. On i386 can you basically can test the low bit for free (it runs in parallel to the integer add instruction). This would allow only values outside this range to required calling the long integer code. Such an optimization is not easily done in C. Currently GHC defines data Integer = S# Int# | J# Int# ByteArray# So it already avoids using GMP for small integers. There's a "will this multiplication overflow" primitive, but I'm not sure how it's implemented. I think that changing this to use magic pointers would be pretty easy. You'd need to introduce a new primitive type, say Int31#, and then: 1. anytime you previously constructed a WHNF S# on the heap, make a magic pointer instead 2. anytime you dereference a pointer that might be an S#, check for a magic pointer first. Even if a lot of code needs to be changed, it's straightforward because the changes are local. You're just changing the meaning of a pointer such that there's a statically allocated S# n at address 2n+1. It would also be worth trying this for Char#, which is already a 31-bit type, to see if it would speed up text-processing code. If only simple loops are optimized it will encourage people to always code loops in their haskell rather than perhaps using more appropriate constructs. You could say that about any language. When your code needs to be fast it needs to be fast. I'd rather write ugly Haskell code than ugly C code, if it comes to that. Also take the Maybe data type with Nothing and Just ... or any other datatypes with 0 and 1 variable constructors. Here these could be represent by special values for the 0 variable case and bit marking on the single constructor values. I'm not sure this would help much. Nullary constructors are already allocated statically, so they're already represented by special values. You could check for those values instead of dereferencing the pointer, but in time-critical code the static object will presumably be in L1 cache anyway. I could be wrong of course, and it would be easy enough to extend the Int31# idea to nullary constructors (Int0#). -- Ben ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
RE: inside the GHC code generator
| The closest thing I've seen to a solution is the technique used in Urban | Boquist's thesis, which is to put a static table in the executable with | information about register and stack usage indexed by procedure return | point, and use that to unwind the stack and find roots. Every accurate garbage collector (including GHC's) uses a technique like this, but the solution is invariably tightly-coupled to the garbage collector, compiler and run-time system. A major feature of C-- is that it squarely addresses this problem in a *reusable* way. Simon ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: Re[2]: inside the GHC code generator
On 2/24/06, Bulat Ziganshin <[EMAIL PROTECTED]> wrote: > Hello kyra, > > Friday, February 24, 2006, 12:37:02 AM, you wrote: > > >> i prefer to see the asm code. this may be because of better high-level > >> optimization strategies (reusing fib values). the scheme about i say > >> will combine advantages of both worlds > k> no strategies, plain exponential algorithm, > > yes, the ocaml compiler works better with stack. but i sure that in > most cases gcc will outperform ocaml because it has large number of > optimizations which is not easy to implement (unrolling, instruction > scheduling and so on) > > k> also, Clean is *EXACTLY* in line with ocaml. This is interesting, > k> because Clean is so much similar to Haskell. > > clean differs from Haskell in support of unique types and strictness > annotations. the last is slowly migrates into GHC in form of shebang > patters, but i think that it is a half-solution. i mentioned in > original letter my proposals to add strictness annotations to > function types declarations and to declare strict datastructures, such > as "![Int]" As I've understood it, Clean's strictness annotations are a bit of a hack which only works on certain built-in types. Am I mistaking here? -- Friendly, Lemmih ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: inside the GHC code generator
Hi Bulat, Wow! You make a lot of good suggestions. I think some of them are unfortunately unworkable, some others have been tried before, but there are definitely some to try. I'll suggest a few more in this email. First of all I should say that we don't *want* to use gcc as a code generator. Everything we've been doing in the back end over the last few years has been aiming towards dropping the dependency on gcc and removing the Evil Mangler. Now is not the time to be spending a lot of effort in beefing up the C back end :-) Our preference is to work on both the native and C-- back ends; the C-- back end has the potential to produce the best code and be the most portable. We also plan to keep the "unregisterised" C back end for the purposes of porting to a new platform, it's only the registerised path we want to lose. Having said all that, if we can get better code out of the C back end without too much effort, then that's certainly worthwhile. translated to something like this factorial: _s1BD = *(Sp + 0); if (_s1BD != 1) goto c1C4; // n!=1 ? R1 = *(Sp + 4); Sp = Sp + 8; return (*(Sp + 0))(); // return from factorial c1C4: _s1BI = _s1BD * (*(Sp + 4)); // n*r _s1BF = _s1BD - 1; // n-1 *(Sp + 4) = _s1BI; *(Sp + 0) = _s1BF; goto factorial; To be fair, this is only the translation on an architecture that has no argument registers (x86, and currently x86_64 but hopefully not for long). The simple optimisation of changing the final tail call to factorial into a goto to a label at the beginning of the function (as suggested by John Meacham I think) should improve the code generated by gcc for functions like this, and is easy to do. 1) because it just not asked. you can enable gcc optimization by adding "-optc-O6" to the cmdline, but this leads to compilation errors on part of modules. it seems that gcc optimization is not compatible with "evil mangler" that is the ghc's own optimization trick. -O3 works, I haven't tried -O6. * C++ calling stack should be used as much as possible * parameters are passed in C++ way: "factorial(int n, int r)" This is unworkable, I'm afraid. Go and read Simon's paper on the STG: http://citeseer.ist.psu.edu/peytonjones92implementing.html there are two main reasons we don't use the C stack: garbage collection and tail calls. What do you plan to do about GC? * recursive definitions translated into the for/while loops if possible Certainly a good idea. * groups of mutually recursive definitions should be also "naturally" translated as much as possible * functions marked INLINE in Haskell code should be marked the same in C++ code an inline function will have been inlined, so not sure what you mean here! * maximally speed up using of already evaluated values. instead of unconditional jumping to the closure-computation function just test this field and continue computation if it contains NULL (indicating that value is already in WHNF). This change will make time penalty of boxing data structures significantly, at least 10 times, smaller This is called semi-tagging, it was tried a long time ago. Certainly worth trying again, but I very much doubt the gains would be anything like you claim, and it increases code size. There are other schemes, including this one that we particularly like: use a spare bit in a pointer to indicate "points to an evaluated value". Since there are two spare bits, you can go further and use the 4 values to mean: 0: possibly unevaluated 1: evaluated, tag 0 2: evaluated, tag 1 3: evaluated, tag 2 I believe Robert Ennals tried this as part of his spec eval implementation, but IIRC he didn't get separate measurements for it (or it didn't improve things much). Note that this increases code size too. * in addition to ability to define strict fields, allow to define strict data structures (say, lists) and strict types/structures of functions parameters and results. that is a long-standing goal, but the "![!Int] -> !Int" function can be used inside higher-order code a lot faster than "[Int] -> Int" one. the ultimate goal is to allow Haskell to be as fast as ML dialects in every situation and this means that laziness should be optional in every place. this will also allow to make efficient emulation of C++ virtual methods Strict types are interesting, and I know several people (including me) who have put some thought into how they would work. It turns out to be surprisingly difficult, though. You might look into doing something simple: suppose you could declare a type to be unlifted: data !T = .. The type T doesn't have a bottom element. Its kind is !, not *, and it can't be used to instantiate a polymorphic type variable, just like GHC's primitive types. This is quite restrictive, but it's simple and pretty easy to implement in GHC as it stands. It gives yo
RE: inside the GHC code generator
| last days i studied GHC code generator, reading -ddumps of all sorts, | GHC sources, various papers and John's appeals :) | | what i carried from this investigation: | | GHC has great high-level optimization. moreover, GHC now allows to | program STG almost directly. when i look at STG code i don't see | anything that can be done better - at least for these simple loops i | tried to compile. i see just unboxed variables, primitive operations | and simple loops represented as tail recursion. fine. | | then STG compiled to the simplified C (called abstract C earlier and | quasi C-- now), what is next can be translated: | | * to real C and then compiled by gcc | * to assembler by build-in simple C-- compiler | * to assembler by some external C-- compiler (at least it is | theoretically possible) Good stuff Bulat. There's plenty of interesting stuff to be done here. However, let me strongly urge you *not* to focus attention primarily on the gcc route. Compiling via C has received a lot of attention over the years, and there are many papers describing cool hacks for doing so. GHC does not do as well as it could. But there are serious obstacles. That's not gcc's fault -- it wasn't designed for this. Accurate GC is one of them, tail calls is another, and there are plenty more smaller things that bite you only after you've invested a lot of time. This way lies madness. C-- was *designed* for this purpose. GHC uses C-- as its intermediate language (just before emitting C). So a good route is this: * Write C-- to C-- optimisations * Then, if you like, translate that code to C. Already you will be doing better than GHC does today, because the C-- to C-- optimiser will let you generate better C * But you can also emit C--, or native code; both of these paths will directly benefit from your C-- optimisations. The point is that none of this relies on Quick C--; you can always use an alternative back end. You can almost do this today. GHC uses C-- as an intermediate language. But alas, GHC's code generator does not take advantage of C--'s native calls or parameter passing. Instead, it pushes things on an auxiliary stack etc. (Those Sp memory operations you see.) This isn't necessary. We are planning to split GHC's code generator into two parts A) Generate C-- with native calls, with an implicit C-- stack B) Perform CPS conversion, to eliminate all calls in favour of jumps using an explicit stack The current code gen is A followed by B. But A is a much more suitable optimisation platform, and gives more flexibility. Chris Thompson, and undergrad at Cambridge, is doing (B) as his undergrad project, although it remains to be seen whether he'll have enough complete to be usable. Another shortcoming is that the native code generator in GHC isn't capable of dealing with backward jumps to labels (because GHC hasn't needed that so far). But if you did C-- optimisation, you'd probably generate such jumps. It'd be great to beef up the native code gen to handle that. Many of the optimisations you describe (perhaps all) are readily expressible in the C-- intermediate language, and by working at that level you will be independent of with the back end is gcc, a native code generator, or Quick C--, or some other C-- compiler. Much better. Simon ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re[4]: inside the GHC code generator
Hello Jean-Philippe, Friday, February 24, 2006, 2:04:57 AM, you wrote: JPB> From my (limited) knowledge of GHC backend, the difficult part of your JPB> plan is that STG is not suited to compilation to native C at all. You JPB> might need to do quite advanced translation from STG to another JPB> intemediate language, (as GRIN for example), and then some more JPB> advanced analysis/optimization before C can be generated. your knowledge is very limited, because STG at all times was compiled to C and then compiled by gcc :) moreover, it's very easy to implement efficient compilation of fac() definition from STG to "natural C". it's one-hour task, maximum. the real problem is changing the whole environment, including AST representing C code inside the GHC, RTS and so on JPB> iirc, the tricky part is the handling of lazyness. At any point you JPB> may end up with a thunk (closure), which ghc handles easily by JPB> "evaluating" it: it's always a function pointer. (That's the tagless JPB> part) When in WHNF, it just calls a very simple function that fills JPB> registers with the evaluated data. Otherwise, the function call JPB> performs the reduction to WHNF. STG attaches explicit typing information to any var. i propose: 1) compile code that works with unboxed values to equivalent "natural C" code 2) for the boxed values, MAXIMALLY simplify test that they are already in WHNF. current solution leads to two jumps, what is a very expensive on modern cpus. i propose to make instead one check and one conditional jump what will be much cheaper. then, if value is in WHNF, we just load all the needed data himself. if it is not in WHNF, we perform just the same operations as in current GHC. for example, wrapper that calls the "int fac(int n, int r)" worker, can be something like this: if (p = n->closure) then goto eval_n; n_evaluated: if (p = r->closure) then goto eval_r; r_evaluated: return fac (n->value, r->value); eval_n: (*p) (); // sets n->closure=NULL; n->value to n's value goto n_evaluated; eval_r: (*p) (); goto r_evaluated; JPB> If you want to use the same trick, you'll end up with the same JPB> problems (bad C). So, you'll want to eliminate thunks statically, JPB> finally requiring a 'high level' analysis as I suggested above. really? :) JPB> Also, allocation + garbage collection is extremely efficient in JPB> current GHC... C/C++ alloc doesn't even come close. It's entirely JPB> possible that with even a very fast backend, the better ghc allocator JPB> will be enough to swing the balance. (see jhc) JPB> It might be possible to re-use most of it however. i don't propose to replace everything in ghc backend, just the way "unboxed" code is compiled and something around it. i just don't know enough about other parts of ghc backend/rts :))) -- Best regards, Bulatmailto:[EMAIL PROTECTED] ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re[2]: inside the GHC code generator
Hello Ben, Friday, February 24, 2006, 2:04:26 AM, you wrote: >> * multiple results can be returned via C++ pair template, if this is >> efficiently implemented in gcc BRG> There's a -freg-struct-return option in 2.x, 3.x and 4.x. I think it's off BRG> by default on many architectures. thank you! -1 problem. moreover, we can use just plain C for our translation instead of C++ >> * recursive definitions translated into the for/while loops if possible BRG> I think recent versions of GCC do a good job of this. See BRG> http://home.in.tum.de/~baueran/thesis/ there is no problem with proper handling of tail calls. the problem is what these loops are not unrolled, generating significantly worse code than explicit loop. you can see this in the files i attached to the original letter BRG> All of this efficiency stuff aside, there's a big problem you're neglecting: BRG> GARBAGE COLLECTION. For a language that allocates as much as Haskell I think BRG> a conservative collector is absolutely out of the question, because they BRG> can't compact the heap and so allocation becomes very expensive. A copying BRG> collector has to be able to walk the stack and find all the roots, and that BRG> interacts very badly with optimization. All the languages I've seen that BRG> compile via C either aren't garbage collected or use a conservative or BRG> reference-count collector. as i said, we can just use 2 stacks - one, pointed by EBP register, contains all boxed values, second is hardware stack, pointed of course by ESP, contains unboxed values and managed by gcc as for any other C programs. so, the boxed parameters to functions are go through EBP-pointed stack and unboxed values passed via usual C conventions: int fac(int n, int r) currently, EBP is used for all data and ESP is just not used. moreover, until 1999 the same two stacks scheme was used in GHC. comparing to current one stack scheme, we will need more space for stacks and may lose something because memory usage patterns will be slightly less localized -- Best regards, Bulatmailto:[EMAIL PROTECTED] ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re[2]: inside the GHC code generator
Hello kyra, Friday, February 24, 2006, 12:37:02 AM, you wrote: >> i prefer to see the asm code. this may be because of better high-level >> optimization strategies (reusing fib values). the scheme about i say >> will combine advantages of both worlds k> no strategies, plain exponential algorithm, yes, the ocaml compiler works better with stack. but i sure that in most cases gcc will outperform ocaml because it has large number of optimizations which is not easy to implement (unrolling, instruction scheduling and so on) k> also, Clean is *EXACTLY* in line with ocaml. This is interesting, k> because Clean is so much similar to Haskell. clean differs from Haskell in support of unique types and strictness annotations. the last is slowly migrates into GHC in form of shebang patters, but i think that it is a half-solution. i mentioned in original letter my proposals to add strictness annotations to function types declarations and to declare strict datastructures, such as "![Int]" -- Best regards, Bulatmailto:[EMAIL PROTECTED] ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: Re[2]: inside the GHC code generator
Hello Bulat, >From my (limited) knowledge of GHC backend, the difficult part of your plan is that STG is not suited to compilation to native C at all. You might need to do quite advanced translation from STG to another intemediate language, (as GRIN for example), and then some more advanced analysis/optimization before C can be generated. iirc, the tricky part is the handling of lazyness. At any point you may end up with a thunk (closure), which ghc handles easily by "evaluating" it: it's always a function pointer. (That's the tagless part) When in WHNF, it just calls a very simple function that fills registers with the evaluated data. Otherwise, the function call performs the reduction to WHNF. If you want to use the same trick, you'll end up with the same problems (bad C). So, you'll want to eliminate thunks statically, finally requiring a 'high level' analysis as I suggested above. Also, allocation + garbage collection is extremely efficient in current GHC... C/C++ alloc doesn't even come close. It's entirely possible that with even a very fast backend, the better ghc allocator will be enough to swing the balance. (see jhc) It might be possible to re-use most of it however. I certainly don't want to discourage you (we all want faster code :), but there is no easy path to climb the mountain ;) Cheers, JP. ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: inside the GHC code generator
[EMAIL PROTECTED] wrote: * multiple results can be returned via C++ pair template, if this is efficiently implemented in gcc There's a -freg-struct-return option in 2.x, 3.x and 4.x. I think it's off by default on many architectures. * recursive definitions translated into the for/while loops if possible I think recent versions of GCC do a good job of this. See http://home.in.tum.de/~baueran/thesis/ All of this efficiency stuff aside, there's a big problem you're neglecting: GARBAGE COLLECTION. For a language that allocates as much as Haskell I think a conservative collector is absolutely out of the question, because they can't compact the heap and so allocation becomes very expensive. A copying collector has to be able to walk the stack and find all the roots, and that interacts very badly with optimization. All the languages I've seen that compile via C either aren't garbage collected or use a conservative or reference-count collector. The closest thing I've seen to a solution is the technique used in Urban Boquist's thesis, which is to put a static table in the executable with information about register and stack usage indexed by procedure return point, and use that to unwind the stack and find roots. Some C compilers (including gcc) support debugging of optimized code, so it might be possible to parse the debugging tables and extract this information. As far as I know, no one has ever looked into the feasibility of doing this, which is kind of surprising since root-finding is probably the single biggest obstacle to compiling functional languages via C. -- Ben ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: inside the GHC code generator
Bulat Ziganshin wrote: i prefer to see the asm code. this may be because of better high-level optimization strategies (reusing fib values). the scheme about i say will combine advantages of both worlds no strategies, plain exponential algorithm, ocaml: _camlFibo__fib_57: subesp, 8 L101: cmpeax, 5 jgeL100 moveax, 3 addesp, 8 ret L100: movDWORD PTR 0[esp], eax addeax, -4 call_camlFibo__fib_57 L102: movDWORD PTR 4[esp], eax moveax, DWORD PTR 0[esp] addeax, -2 call_camlFibo__fib_57 L103: movebx, DWORD PTR 4[esp] addeax, ebx deceax addesp, 8 ret visual C++ 7.1 (next fastest): _fibPROC NEAR pushesi movesi, DWORD PTR 8[esp] cmpesi, 2 jgeSHORT $L945 moveax, 1 popesi ret0 $L945: leaeax, DWORD PTR [esi-2] pushedi pusheax call_fib decesi pushesi movedi, eax call_fib addesp, 8 addeax, edi popedi popesi ret0 _fibENDP also, Clean is *EXACTLY* in line with ocaml. This is interesting, because Clean is so much similar to Haskell. ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
RE: Re[2]: inside the GHC code generator
From: Bulat Ziganshin <[EMAIL PROTECTED]> i answered in the original letter (search for "Cray" :) Re-reading this, I see that you have a well defined goals that cover most of my points. seems that you don't seen the attached files. tail calls are optimized in gcc No I don't see any attached files (on any of your emails). Best of luck with your optimisations. I will look forward to using them. (Although I don't like the style of the shootout code, I find it very useful in helping me speed up my own code, and yes I find it better in write Haskell in such a style, than to use FFI and call C). Rene. ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re[2]: inside the GHC code generator
Hello Rene, Thursday, February 23, 2006, 10:17:40 PM, you wrote: RdV> Maybe GHC should generate better C. I am just not sure whether this will RdV> bring the best global (as opposed to micro-optimizations) performance. i answered in the original letter (search for "Cray" :) RdV> Generating better C might also be much more difficult for idioms that don't RdV> exist in C, i will be glad to see concrete examples RdV> and might encourage people to write Haskell that looks like C RdV> (in order to get C like performance). yes, they do just it now (see shootout entries or my Streams library) but still don't get C performance. so i think that we will go to something better position :) RdV> I prefer idiomatic Haskell. It would be nice if this could be fast. But can RdV> idiomatic Haskell be translated to efficient C? It might not have many RdV> direct loops for example. sorry, i don't understand that you mean? in general, idiomatic Haskell has big efficiency problem and this problem is laziness. the strict Haskell code is, like ML code, can be compiled efficient RdV> -- Other notes RdV> Integer is about 30 times slower than it needs to be on GHC if you have over RdV> half the values between 2^-31 and 2^31. On i386 can you basically can test RdV> the low bit for free (it runs in parallel to the integer add instruction). RdV> This would allow only values outside this range to required calling the long RdV> integer code. Such an optimization is not easily done in C. RdV> This encourages Haskell programmers to always use Int, even if the value RdV> might get too big, because Integer is too slow. this optimization is not implemented in ghc until now and i think that amount of work required to implement it is bigger than requirements to do faster Integer processing. moreover, i hope that good optimizing C compiler can avoid additional testing RdV> Also Tail recursion is more general than looping. A general tail call RdV> optimization will bring better returns than a loop optimization in GHC RdV> (though I could be wrong here). This requires special stack handling. Also RdV> not so easy in C. seems that you don't seen the attached files. tail calls are optimized in gcc RdV> If only simple loops are optimized it will encourage people to always code RdV> loops in their haskell rather than perhaps using more appropriate RdV> constructs. you are prefer that people will not have any option to make fast code? :) also i want to emphasize that C loops is the Haskell recursion. we always said that these two constructs are equivalent, now it's the time to generate code with the same speed RdV> Also take the Maybe data type with Nothing and Just ... or any other RdV> datatypes with 0 and 1 variable constructors. Here these could be represent RdV> by special values for the 0 variable case and bit marking on the single RdV> constructor values. This could lead to good optimizations on case RdV> expressions. RdV> Also not so easy in C. 1) it is far from current ghc optimization facilities 2) i don't see problems with using if() RdV> The lack of this feature encourages people to encode their datatypes as RdV> Int's to get speed. Also not good. RdV> Whether we can communicate the non aliasing and aliasing properties of GHC RdV> to the C compiler, I am also not so sure. concrete examples? RdV> Also in GHC, I am not sure whether stack base locals are the best move. It RdV> might be best to have a fixed page as register spill in some cases. it's the optimization that gcc can handle much better :) just see at the code generated for the fac() function RdV> If you wish to pass and return unboxed tuples without reboxing you will RdV> probably required a special function interface with double entry points (I RdV> think this can be done in C, but it is a bit tricky). as i said, probably pair can be used, but i don't tested this yet -- Best regards, Bulatmailto:[EMAIL PROTECTED] ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: inside the GHC code generator
From: Bulat Ziganshin <[EMAIL PROTECTED]> Hello Rene, i done reading. my question - is YOU read this? the lisp problems have almost nothing in common with haskell It is a long time since i read this, but some things come to mind. Listed below. Maybe GHC should generate better C. I am just not sure whether this will bring the best global (as opposed to micro-optimizations) performance. However doing anything else might be difficult. Generating better C might also be much more difficult for idioms that don't exist in C, and might encourage people to write Haskell that looks like C (in order to get C like performance). I prefer idiomatic Haskell. It would be nice if this could be fast. But can idiomatic Haskell be translated to efficient C? It might not have many direct loops for example. -- Other notes Integer is about 30 times slower than it needs to be on GHC if you have over half the values between 2^-31 and 2^31. On i386 can you basically can test the low bit for free (it runs in parallel to the integer add instruction). This would allow only values outside this range to required calling the long integer code. Such an optimization is not easily done in C. This encourages Haskell programmers to always use Int, even if the value might get too big, because Integer is too slow. Also Tail recursion is more general than looping. A general tail call optimization will bring better returns than a loop optimization in GHC (though I could be wrong here). This requires special stack handling. Also not so easy in C. If only simple loops are optimized it will encourage people to always code loops in their haskell rather than perhaps using more appropriate constructs. Also take the Maybe data type with Nothing and Just ... or any other datatypes with 0 and 1 variable constructors. Here these could be represent by special values for the 0 variable case and bit marking on the single constructor values. This could lead to good optimizations on case expressions. Also not so easy in C. The lack of this feature encourages people to encode their datatypes as Int's to get speed. Also not good. Whether we can communicate the non aliasing and aliasing properties of GHC to the C compiler, I am also not so sure. Also in GHC, I am not sure whether stack base locals are the best move. It might be best to have a fixed page as register spill in some cases. If you wish to pass and return unboxed tuples without reboxing you will probably required a special function interface with double entry points (I think this can be done in C, but it is a bit tricky). Summary: If only C like Haskell is fast, we end up with Haskell that looks like C. Yuck. Rene. ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re[2]: inside the GHC code generator
Bulat Ziganshin writes: > Hello Kevin, > > > KG> Also, ghc used to be faster than gcc for a naive, recursive factorial > KG> function (once the cpr analysis and optimisation was added). From > KG> what Bulat wrote it seems that gcc got better ... > > i don't say that we must compile recursive Haskell/STG functions > naively to recursive C ones (as jhc does). no - i say that we should > translate recursive Haskell definitions to explicit C loops, what is > NATURAL C PROGRAMMING STYLE and therefore optimized much better as you > can see in the files i attached > Ahhh, OK. I misunderstood. I don't claim ghc beat a C loop. ghc was just more optimised at making function calls :-). k ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re[2]: inside the GHC code generator
Hello Kevin, Thursday, February 23, 2006, 9:06:25 PM, you wrote: KG> On a related point, Mercury has two C backends a low level one at the KG> level of GHC's and a high level one. Bulat might want to read this for KG> a description of the high level C implementation: KG> KG> http://www.cs.mu.oz.au/research/mercury/information/papers.html#hlc_cc citating from this paper's annotation: "Many logic programming implementations compile to C, but they compile to very low-level C, and thus discard many of the advantages of compiling to a high-level language". it's the same as i think KG> Also, ghc used to be faster than gcc for a naive, recursive factorial KG> function (once the cpr analysis and optimisation was added). From KG> what Bulat wrote it seems that gcc got better ... i don't say that we must compile recursive Haskell/STG functions naively to recursive C ones (as jhc does). no - i say that we should translate recursive Haskell definitions to explicit C loops, what is NATURAL C PROGRAMMING STYLE and therefore optimized much better as you can see in the files i attached -- Best regards, Bulatmailto:[EMAIL PROTECTED] ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re[2]: inside the GHC code generator
Hello Claus, Thursday, February 23, 2006, 8:56:57 PM, you wrote: >> the long answer is: are you ever heard promises that gcc is best >> cpu-independent assembler? no? and you know why? because gcc is not >> cpu-independent assembler. gcc was strongly optimized to make >> efficient asm from the code usually written by the C programmers. but >> code generated by ghc has nothing common with it. so we are stay with >> all these register-memory moves, non-unrolled loops and all other >> results of naive compilation. gcc is just don't know how to >> efficiently manage such code! CR> would there be any advantage to targetting gcc's backend directly? CR> I notice that Mercury seems to support this: CR> http://www.cs.mu.oz.au/research/mercury/download/gcc-backend.html CR> http://gcc.gnu.org/frontends.html CR> that is, does using C as assembler disable possible optimizations, CR> or is going through the C frontend enabling more optimizations than CR> going to the backend directly? i will read this. but one disadvantage is obvious - we can't use other C compilers, including more efficient intel c/c++ (for my code it was constantly 10% faster) -- Best regards, Bulatmailto:[EMAIL PROTECTED] ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: inside the GHC code generator
Claus Reinke writes: > > the long answer is: are you ever heard promises that gcc is best > > cpu-independent assembler? no? and you know why? because gcc is not > > cpu-independent assembler. gcc was strongly optimized to make > > efficient asm from the code usually written by the C programmers. but > > code generated by ghc has nothing common with it. so we are stay with > > all these register-memory moves, non-unrolled loops and all other > > results of naive compilation. gcc is just don't know how to > > efficiently manage such code! > > would there be any advantage to targetting gcc's backend directly? > > I notice that Mercury seems to support this: > http://www.cs.mu.oz.au/research/mercury/download/gcc-backend.html > http://gcc.gnu.org/frontends.html > > that is, does using C as assembler disable possible optimizations, > or is going through the C frontend enabling more optimizations than > going to the backend directly? > On a related point, Mercury has two C backends a low level one at the level of GHC's and a high level one. Bulat might want to read this for a description of the high level C implementation: http://www.cs.mu.oz.au/research/mercury/information/papers.html#hlc_cc Also, ghc used to be faster than gcc for a naive, recursive factorial function (once the cpr analysis and optimisation was added). From what Bulat wrote it seems that gcc got better ... k ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: inside the GHC code generator
Hello Rene, Thursday, February 23, 2006, 4:19:15 PM, you wrote: RdV> The following link gives reasons for not generating via C RdV> http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&selm=4zp8kn7xe.fsf_-_%40beta.franz.com i done reading. my question - is YOU read this? the lisp problems have almost nothing in common with haskell -- Best regards, Bulatmailto:[EMAIL PROTECTED] ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re[2]: inside the GHC code generator
Hello Rene, Thursday, February 23, 2006, 5:32:21 PM, you wrote: >>seems that you don;t understand the situation. ghc compiles Haskell to >>language called "core", do almost all optimizations at level of this >>language, then translates final result to the "STG" language from that >>the C-- code is generated. changing the translation of STG can't >>prevent ANY ghc optimization. although iy is not so easy because ghc >>code generation and RTS closely tied together RdV> I should have been a bit clearer here. I meant that optimizations that are RdV> available from STG -> Assembler, are better than STG -> C -> Assembler. theoretically - yes. in practice, it is hard to compete with gcc. ghc wait for such code generation 15 years (wait for the mountain to go in our direction :) and i think that the REAL way to reach gcc level optimization is to compile to the idiomatic C instead of continue waiting while this theoretically possible optimization will arise RdV> GHC currently doesn't do most of the optimizations I am thinking of. RdV> -- Bit tagging to reduce pointer chasing, speed up pattern matching. Due to RdV> memory latency and speed it is quicker to do bit masking rather than memory RdV> reads RdV> -- Parameter passing and regisgter usage opimizations that rely on the RdV> structure of the RTS. RdV> -- Multiple stacks with custom frame layout. RdV> -- dynamic code optimization etc. RdV> -- Taking advantage of special assember instructions and flags. i think that all these things can be done with "idiomatic C" way RdV> Though I have also seen comments that you can do a lot of these with GCC if RdV> you do your own stack and parameter management. i.e. don't use the C stack RdV> at all. as i see in the current code generation, attaempts to do our own stack management lead to that gcc just don't understand such code and don't optimize it as good as possible. please compare code generated for fac() function with all other variants RdV> Though your suggestions are probably better than nothing, which is probably RdV> what the alternative is (for instance I have not sufficient time to work on RdV> these things). moreover, i sure that you can't compete with gcc :) RdV> Note that I didn't say that the assembly generation of OCAML was better than RdV> GCC, just that it was comparable. what mean "comparable"? and what we should do to reuse this code generation in ghc? at least i know what to do to reuse gcc code generation. can you propose similar plan to reuse ocaml code generation? -- Best regards, Bulatmailto:[EMAIL PROTECTED] ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re[2]: inside the GHC code generator
Hello kyra, Thursday, February 23, 2006, 5:38:54 PM, you wrote: k> Bulat Ziganshin wrote: >> i think that ocaml can't generate code better than gcc and especially >> icc (intel C/C++ compiler), but may be i'm wrong? ;) >> k> didn't try factorial, but exponential fib in ocaml is *FASTER* than both k> gcc and intel c/c++ with highest optimization levels i prefer to see the asm code. this may be because of better high-level optimization strategies (reusing fib values). the scheme about i say will combine advantages of both worlds -- Best regards, Bulatmailto:[EMAIL PROTECTED] ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: inside the GHC code generator
the long answer is: are you ever heard promises that gcc is best cpu-independent assembler? no? and you know why? because gcc is not cpu-independent assembler. gcc was strongly optimized to make efficient asm from the code usually written by the C programmers. but code generated by ghc has nothing common with it. so we are stay with all these register-memory moves, non-unrolled loops and all other results of naive compilation. gcc is just don't know how to efficiently manage such code! would there be any advantage to targetting gcc's backend directly? I notice that Mercury seems to support this: http://www.cs.mu.oz.au/research/mercury/download/gcc-backend.html http://gcc.gnu.org/frontends.html that is, does using C as assembler disable possible optimizations, or is going through the C frontend enabling more optimizations than going to the backend directly? just curious, claus ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: inside the GHC code generator
Bulat Ziganshin wrote: i think that ocaml can't generate code better than gcc and especially icc (intel C/C++ compiler), but may be i'm wrong? ;) didn't try factorial, but exponential fib in ocaml is *FASTER* than both gcc and intel c/c++ with highest optimization levels ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: inside the GHC code generator
seems that you don;t understand the situation. ghc compiles Haskell to language called "core", do almost all optimizations at level of this language, then translates final result to the "STG" language from that the C-- code is generated. changing the translation of STG can't prevent ANY ghc optimization. although iy is not so easy because ghc code generation and RTS closely tied together I should have been a bit clearer here. I meant that optimizations that are available from STG -> Assembler, are better than STG -> C -> Assembler. GHC currently doesn't do most of the optimizations I am thinking of. -- Bit tagging to reduce pointer chasing, speed up pattern matching. Due to memory latency and speed it is quicker to do bit masking rather than memory reads -- Parameter passing and regisgter usage opimizations that rely on the structure of the RTS. -- Multiple stacks with custom frame layout. -- dynamic code optimization etc. -- Taking advantage of special assember instructions and flags. Though I have also seen comments that you can do a lot of these with GCC if you do your own stack and parameter management. i.e. don't use the C stack at all. Though your suggestions are probably better than nothing, which is probably what the alternative is (for instance I have not sufficient time to work on these things). Note that I didn't say that the assembly generation of OCAML was better than GCC, just that it was comparable. Rene. ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: inside the GHC code generator
Hello Rene, Thursday, February 23, 2006, 4:19:15 PM, you wrote: RdV> The following link gives reasons for not generating via C RdV> http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&selm=4zp8kn7xe.fsf_-_%40beta.franz.com i read it now RdV> Naturally a number of these are common lisp specific, however I think that RdV> Haskell and GCC are quite semantically different, so using GCC might prevent RdV> a lot of optimizations, that aren't possible in C, but would be in GHC. seems that you don;t understand the situation. ghc compiles Haskell to language called "core", do almost all optimizations at level of this language, then translates final result to the "STG" language from that the C-- code is generated. changing the translation of STG can't prevent ANY ghc optimization. although iy is not so easy because ghc code generation and RTS closely tied together RdV> The OCAML-OPT backend is supposed to produce i386 assembly that is RdV> competitive with GCC. Maybe this could be ported to GHC? can you please show the code for the factorial accumulating function: factorial n r = if n=1 then r else (factorial (n - 1) (n*r)) i think that ocaml can't generate code better than gcc and especially icc (intel C/C++ compiler), but may be i'm wrong? ;) -- Best regards, Bulatmailto:[EMAIL PROTECTED] ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
inside the GHC code generator
The following link gives reasons for not generating via C http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&selm=4zp8kn7xe.fsf_-_%40beta.franz.com Naturally a number of these are common lisp specific, however I think that Haskell and GCC are quite semantically different, so using GCC might prevent a lot of optimizations, that aren't possible in C, but would be in GHC. The OCAML-OPT backend is supposed to produce i386 assembly that is competitive with GCC. Maybe this could be ported to GHC? Rene. ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
inside the GHC code generator
Hello last days i studied GHC code generator, reading -ddumps of all sorts, GHC sources, various papers and John's appeals :) what i carried from this investigation: GHC has great high-level optimization. moreover, GHC now allows to program STG almost directly. when i look at STG code i don't see anything that can be done better - at least for these simple loops i tried to compile. i see just unboxed variables, primitive operations and simple loops represented as tail recursion. fine. then STG compiled to the simplified C (called abstract C earlier and quasi C-- now), what is next can be translated: * to real C and then compiled by gcc * to assembler by build-in simple C-- compiler * to assembler by some external C-- compiler (at least it is theoretically possible) that is the "simplified C"? it's really generalized abstract assembler of modern cpus. its most important feature for us is that all parameters of STG functions are passed here via the stack. so, for example, the following STG code: factorial n r = if n=1 then r else (factorial (n - 1) (n*r)) translated to something like this factorial: _s1BD = *(Sp + 0); if (_s1BD != 1) goto c1C4; // n!=1 ? R1 = *(Sp + 4); Sp = Sp + 8; return (*(Sp + 0))(); // return from factorial c1C4: _s1BI = _s1BD * (*(Sp + 4)); // n*r _s1BF = _s1BD - 1; // n-1 *(Sp + 4) = _s1BI; *(Sp + 0) = _s1BF; goto factorial; i rewrote for clarity STG code to its Haskell equivalent, the C-- to the C, plus made a few comments. is this code good? it depends :) if we see this as the assembler code, it is really bad - for just 2 instructions that perform real work, there are 4 memory accesses, two jumps and comparison. unfortunately, this translation is what GHC really does when it is called in "-fasm" mode. now you should realize why even strict Haskell code compiled to such slow executables. well, there is also gcc path ("-fvia-C", auto-enabled by "-O"). but it's only several percents better (as said in GHC docs). why this shining C compiler can't optimize such trivial code? 1) because it just not asked. you can enable gcc optimization by adding "-optc-O6" to the cmdline, but this leads to compilation errors on part of modules. it seems that gcc optimization is not compatible with "evil mangler" that is the ghc's own optimization trick. 2) moreover, enabling gcc optimization seems to make noticeable improvements only on tight loops like this factorial calculation. and even in this case asm code produced is far worse than for the normal C program that uses explicit loop (see enclosed fac.c and fac-c.s, which contains factorial functions manually written in C using tail recursion and explicit loop): factorial: movl(%ebp), %edx cmpl$1, %edx je L6 movl4(%ebp), %eax imull %edx, %eax decl%edx movl%edx, (%ebp) movl%eax, 4(%ebp) jmp factorial as you can see here, gcc isn't unrolled the loop and retained all the stack<->register moves. why? the short answer is what gcc just don't know that data pointed by the Sp pointer is non-volatile - they can't be changed in middle of the program by some async computation or exception and don't required to be updated immediately. may be there is some way to tell gcc this? this would immediately SUBSTANTIALLY improve all ghc code generation the long answer is: are you ever heard promises that gcc is best cpu-independent assembler? no? and you know why? because gcc is not cpu-independent assembler. gcc was strongly optimized to make efficient asm from the code usually written by the C programmers. but code generated by ghc has nothing common with it. so we are stay with all these register-memory moves, non-unrolled loops and all other results of naive compilation. gcc is just don't know how to efficiently manage such code! so now we have rather strange situation: ghc generates not-so-bad "abstract asm" code, but there is just no tool to compile this code to efficient "concrete asm"! although that is perfectly possible and i can imagine the translation that uses registers to store these two internal variables, unrolls the loop and schedule instructions to maximally load the modern superscalar cpu. at this moment i recalled existence of C-- and now, i think, understand the whole history: - long ago in the 1992 Simon PJ has developed this abstract asm and rules for translation of STG to this abstract asm (you can find details in spineless-tagless-gmachine.ps.gz). this "portable asm" was represented at practice by low-level C code. - then he stand waiting for efficient compilation of ghc-generated C code by gcc or some other C compiler - in 1999 he realized that this mountain will not walk to him and started C-- project. its goal was exact to implement portable assembler and in particular make at last efficient compilation of ghc