Hi Matt,

That's very interesting. Thanks for the explanation.

I also sent an email on this issue to the ghc-bugs mailing list:
http://www.haskell.org//pipermail/glasgow-haskell-bugs/2006-March/006352.html
so I hope you don't mind if I cc this email there too.

On Sun, 2006-03-26 at 21:46 +1100, Matthew Chapman wrote:
> Hi Duncan,
> 
> The relevant regular expression in the mangler looks like this:
> 
> $p =~ s/^\t\.save ar\.pfs, r\d+\n\talloc r\d+ = ar\.pfs, 0, 3[12], \d+, 0\n//;

I found that eventually.

> The part that's failing is where it expects one of the numbers, the
> number of local variables allocated for the function in the register
> stack window, to be 31 or 32.  In this case it's as big as 77!

Right, because the unfolding that ghc's done has given us this one
function in the assembler output with one massive section of straight
line code.

> The reason this regex is so specific is that it acts as a sanity check.
> The way GHC works on IA64 is that every STG function runs in the same
> register stack window (this allows tailcalls to work).  This register
> stack window has 32 locals allocated, so a function must not use any
> more than this.

So that explains why it didn't work when I merely relaxed the regexp in
the mangler.

> Of these 32 locals, 16-28 are used by GHC for the STG machine's
> registers, 29-31 are used by gcc internally, and 0-15 are used by gcc
> for function locals.  (There are some evil assumptions made about gcc
> internals here :))
> 
> But architecturally IA64 allows up to 96 locals, so as it stands, there
> is nothing stopping gcc from allocating a frame larger than 32, and
> using positions at 32 and above for function locals as well, if it's
> compiling a complex function with a lot of register pressure... which
> is obviously what is happening here.

Indeed. If you look briefly at the source:
http://abridgegame.org/cgi-bin/darcs.cgi/darcs/SHA1.lhs?c=annotate

and then note that darcs is compiling this with the flags:
-O -funfolding-use-threshold20

then one can believe that it compiles to one massive unrolled function
with huge register pressure.

> There are a few possibilities:
> 
> - Make ghc pass an option like -mfixed-range=loc32-loc79 to gcc when
>   compiling, to stop gcc using locals from 32 upwards (N.B. in gcc the
>   locals are only numbered up to 79 and not 95.. probably because it
>   also reserves 8 inputs and 8 outputs).
>   
>   This would be the easiest and most promising option, but I can't seem
>   to get it to work on a test example, so there might be some
>   strangeness/bug in gcc that makes it ineffective.

Yes, it doesn't quite work for me either (gcc (GCC) 3.3.2 20040119
(Gentoo Linux 3.3.2-r7, propolice-3.3-7)):

Prologue junk?:         .proc s64t_ret#
s64t_ret:
        .save ar.pfs, r73
        alloc r73 = ar.pfs, 8, 35, 8, 0
        adds r16 = -24, r12
        adds r17 = -16, r12
        mov r18 = ar.unat
        .savesp ar.unat, 24
        st8 [r16] = r18, 16
        .save.g 0x1
        .mem.offset 16, 0
        st8.spill [r17] = r4, 16
        .save.g 0x2
        .mem.offset 8, 0
        st8.spill [r16] = r5, 16
        .save.g 0x4
        .mem.offset 0, 0
        st8.spill [r17] = r6
        .save.g 0x8
        .mem.offset -8, 0
        st8.spill [r16] = r7
        .body

So we're down to allocating only 35 locals and spilling several more,
but it's not within the 31-32 limit. It also has 8 in the second
argument of alloc, where the mangler regexp expects 0.

I'm going to try again with a more recent version of gcc.

> - Increase the size of the register stack frame used in STG code, and
>   change the allocation of locals so that gcc can allocate more function
>   locals.  This involves a change to the runtime and impacts performance
>   in all cases, not just in the rare case that it's actually needed, so
>   I don't think it's a particularly good solution.

Indeed.

> - Find some other workaround that might just work for now, like the one
>   you've found, and brush it under the carpet :D

Heh, yes.

Perhaps the right solution if we pick this option is to modify the
mangler to give us a better error message if the locals count is not
31-32. For example it could say that register pressure is too high and
try re-compiling with less aggressive optimisations/inlining.

> Feel free to ask if you want clarification on any of this.

Thanks very much.

Duncan

> On Sat, Mar 25, 2006 at 07:15:04PM +0000, Duncan Coutts wrote:
> > On Sat, 2006-03-25 at 17:17 +0000, Duncan Coutts wrote:
> > 
> > > I'm now running into a problem when building darcs (version 1.0.6). I
> > > think this must be a problem with the mangler:
> > > 
> > > ghc  -cpp  -package QuickCheck -package util -package parsec -O
> > > -funbox-strict-fields  -Wall -Werror -I. -DHAVE_CURSES -DHAVE_CURL
> > > -no-auto-all -funfolding-use-threshold20 -c SHA1.lhs
> > > Prologue junk?:         .proc s64t_ret#
> > > s64t_ret:
> > >          .save ar.pfs, r107
> > >          alloc r107 = ar.pfs, 0, 77, 8, 0
> > >          .body
> > 
> > Turns out this doesn't happen when we remove the flag
> > -funfolding-use-threshold20.
> > 
> > So I suspect the problem here is that the use of that flag generates
> > much more unrolling and register pressure. Perhaps the mangler isn't
> > dealing properly with register spilling, or excessive register spilling.
> > 
> > Duncan

_______________________________________________
Glasgow-haskell-bugs mailing list
Glasgow-haskell-bugs@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-bugs

Reply via email to