Re: Register Allocation issues with microblaze-elf

2013-02-14 Thread Michael Veksler

On 02/14/2013 03:28 AM, Vladimir Makarov wrote:

On 13-02-13 6:36 PM, Michael Eager wrote:


[snip]


I thought about register pressure causing this, but I think that 
should cause
spilling of one of the registers which were not used in this long 
sequence,

rather than causing a large number of additional loads.

Longer living pseudos has more probability to be spilled as they 
usually conflicts with more pseudos during their lives and spilling 
them frees a hard reg for many conflicting pseudos.  That how RA 
heuristics work (in the old RA log (live range span) was used.  The 
bigger the value, the more probability for spilling).

Perhaps the cost analysis has a problem.


I've checked it and it looks ok to me.
RA focused on generation of faster code. Looking at the fragment you 
provided it, it is hard to say
something about it.  I tried -Os for gcc4.8 and it generates 
desirable code for the fragment in
question (by the way the peak register pressure decreased to 66 in 
this case).


It's both larger and slower, since the additional loads take much 
longer.  I'll take a

look at -Os.

It looks like the values of p++ are being pre-calculated and stored 
on the stack.  This results in

a load, rather than an increment of a register.

If it is so.  It might be another optimization which moves p++ 
calculations higher.  IRA does not do it (more correctly a new IRA 
feature implemented by Bernd Schmidt in gcc4.7 can move insns 
downwards in CFG graph to decrease reg pressure).


I checked all rtl passes these calcualtions are not created by any RTL 
pass.  So it is probably some tree-ssa optimization.
Any industrial RA uses heuristic algorithms, in some cases better 
heuristics can work worse than
worse heuristics.  So you should probably check is there any 
progress moving from gcc4.1 to gcc4.6
with performance point of view for variety benchmarks. Introducing 
IRA improves code for x86 4% on
SPEC2000.  Subsequent improving (like using dynamic register 
classes) made further performance

improvements.


My impression is that the performance is worse.  Reports I've seen 
are that the code is

substantially larger, which means more instructions.

I'm skeptical about comparisons between x86 and RISC processors. What 
works well

for one may not work well for the other.

IRA improved code for many RISC processors.  Although tetter RA has 
smaller effect for these processors as they have more registers.

Looking at the test code, I can make some conclusions for myself:

o We need a common pass decreasing reg pressure (I already expressed 
this in the past) as
optimizations become more aggressive.  Some progress was made to 
make few optimizations aware about
RA (reg-pressure scheduling, loop-invariant motions, and code 
hoisting) but there are too many
passes and it is wrong and impossible to make them all aware of RA.  
Some register pressure
decreasing heuristics are difficult to implement in RA (like insn 
rearrangements or complex

rematerialization) and this pass could focus on them.


That might be useful.

o Implement RA live range splitting in regions different from loops 
or BB (now IRA makes splitting
only on loop bounds and LRA in BB, the old RA had no live range 
splitting at all).


Each of the blocks of code is in it's own BB.  I haven't checked, but 
I'd guess
that most of the registers are in use on entry and still live on 
exit, so the

block has no registers to allocate.


Splitting in BB scope this case is not profitable.
I'd also recommend to try the following options concerning RA: 
-fira-loop-pressure,
-fsched-pressure, -fira-algorithm=CB|priority, 
-fira-region=one,all,mixed.  Actually
-fira-algorithm=priority + -fira-region=one is analog of what the 
old RA did.





I am reading this thread and getting more and more puzzled. The RA stuff 
is very complicated,
having many constraints and many dependencies with other passes. Taking 
this into
account, it seems that no heuristic algorithm can even get close to an 
optimal register
allocation. A heuristic algorithm can't take all effects and 
side-effects into account

simultaneously.

Considering all that, why GCC does not use generic optimization 
algorithms for RA?
A generic optimization can take all issues into account, simultaneously. 
I am talking
about ILP/MIP (Integer Linear Programming/Mixed IP), SAT and CSP [1] 
(Constraint
Satisfaction Problem). There has been a lot of progress in these areas, 
and the

solvers are much faster (orders of magnitude) than they were 10 years ago.

So why ILP/SAT/CSP aren't used in RA? I don't think they will work much 
slower than
what RA does today ( they will be somewhat slower but not much). I do 
believe that
there is a good chance to get much better results from RA with these 
technologies.


I'd like to invest some time into a feasibility check, if someone is 
willing to work
with me on modeling the RA problem (or at least several examples, such 
as the above).


Michael

[1] Even though CSP 

Re: Register Allocation issues with microblaze-elf

2013-02-14 Thread Vladimir Makarov

On 02/14/2013 03:46 AM, Michael Veksler wrote:

On 02/14/2013 03:28 AM, Vladimir Makarov wrote:

On 13-02-13 6:36 PM, Michael Eager wrote:


[snip]


I thought about register pressure causing this, but I think that 
should cause
spilling of one of the registers which were not used in this long 
sequence,

rather than causing a large number of additional loads.

Longer living pseudos has more probability to be spilled as they 
usually conflicts with more pseudos during their lives and spilling 
them frees a hard reg for many conflicting pseudos. That how RA 
heuristics work (in the old RA log (live range span) was used.  The 
bigger the value, the more probability for spilling).

Perhaps the cost analysis has a problem.


I've checked it and it looks ok to me.
RA focused on generation of faster code. Looking at the fragment 
you provided it, it is hard to say
something about it.  I tried -Os for gcc4.8 and it generates 
desirable code for the fragment in
question (by the way the peak register pressure decreased to 66 in 
this case).


It's both larger and slower, since the additional loads take much 
longer.  I'll take a

look at -Os.

It looks like the values of p++ are being pre-calculated and stored 
on the stack.  This results in

a load, rather than an increment of a register.

If it is so.  It might be another optimization which moves p++ 
calculations higher.  IRA does not do it (more correctly a new IRA 
feature implemented by Bernd Schmidt in gcc4.7 can move insns 
downwards in CFG graph to decrease reg pressure).


I checked all rtl passes these calcualtions are not created by any 
RTL pass.  So it is probably some tree-ssa optimization.
Any industrial RA uses heuristic algorithms, in some cases better 
heuristics can work worse than
worse heuristics.  So you should probably check is there any 
progress moving from gcc4.1 to gcc4.6
with performance point of view for variety benchmarks. Introducing 
IRA improves code for x86 4% on
SPEC2000.  Subsequent improving (like using dynamic register 
classes) made further performance

improvements.


My impression is that the performance is worse.  Reports I've seen 
are that the code is

substantially larger, which means more instructions.

I'm skeptical about comparisons between x86 and RISC processors. 
What works well

for one may not work well for the other.

IRA improved code for many RISC processors.  Although tetter RA has 
smaller effect for these processors as they have more registers.

Looking at the test code, I can make some conclusions for myself:

o We need a common pass decreasing reg pressure (I already 
expressed this in the past) as
optimizations become more aggressive.  Some progress was made to 
make few optimizations aware about
RA (reg-pressure scheduling, loop-invariant motions, and code 
hoisting) but there are too many
passes and it is wrong and impossible to make them all aware of 
RA.  Some register pressure
decreasing heuristics are difficult to implement in RA (like insn 
rearrangements or complex

rematerialization) and this pass could focus on them.


That might be useful.

o Implement RA live range splitting in regions different from loops 
or BB (now IRA makes splitting
only on loop bounds and LRA in BB, the old RA had no live range 
splitting at all).


Each of the blocks of code is in it's own BB.  I haven't checked, 
but I'd guess
that most of the registers are in use on entry and still live on 
exit, so the

block has no registers to allocate.


Splitting in BB scope this case is not profitable.
I'd also recommend to try the following options concerning RA: 
-fira-loop-pressure,
-fsched-pressure, -fira-algorithm=CB|priority, 
-fira-region=one,all,mixed.  Actually
-fira-algorithm=priority + -fira-region=one is analog of what the 
old RA did.





I am reading this thread and getting more and more puzzled. The RA 
stuff is very complicated,
having many constraints and many dependencies with other passes. 
Taking this into
account, it seems that no heuristic algorithm can even get close to an 
optimal register
allocation. A heuristic algorithm can't take all effects and 
side-effects into account

simultaneously.

Considering all that, why GCC does not use generic optimization 
algorithms for RA?
A generic optimization can take all issues into account, 
simultaneously. I am talking
about ILP/MIP (Integer Linear Programming/Mixed IP), SAT and CSP [1] 
(Constraint
Satisfaction Problem). There has been a lot of progress in these 
areas, and the
solvers are much faster (orders of magnitude) than they were 10 years 
ago.


Actually, I tried ILP solver first time in 2003.  I am returning to this 
about every 3 years and every time I fail to produce something useful 
(at least an acceptable in some cases solution which makes compiler only 
ten times slower).  I've tried a lot of models.  From more sophisticated 
ones to simpler ones, pure solvers or some hybrids of heuristics and ILP 
solutions.  All solutions were too slow or 

Re: Register Allocation issues with microblaze-elf

2013-02-14 Thread Michael Veksler

On 02/14/2013 06:31 PM, Vladimir Makarov wrote:

On 02/14/2013 03:46 AM, Michael Veksler wrote:

[snip]
I am reading this thread and getting more and more puzzled. The RA 
stuff is very complicated,
having many constraints and many dependencies with other passes. 
Taking this into
account, it seems that no heuristic algorithm can even get close to 
an optimal register
allocation. A heuristic algorithm can't take all effects and 
side-effects into account

simultaneously.

Considering all that, why GCC does not use generic optimization 
algorithms for RA?
A generic optimization can take all issues into account, 
simultaneously. I am talking
about ILP/MIP (Integer Linear Programming/Mixed IP), SAT and CSP [1] 
(Constraint
Satisfaction Problem). There has been a lot of progress in these 
areas, and the
solvers are much faster (orders of magnitude) than they were 10 years 
ago.


Actually, I tried ILP solver first time in 2003.  I am returning to 
this about every 3 years and every time I fail to produce something 
useful (at least an acceptable in some cases solution which makes 
compiler only ten times slower).  I've tried a lot of models.  From 
more sophisticated ones to simpler ones, pure solvers or some hybrids 
of heuristics and ILP solutions.  All solutions were too slow or 
generated not better code.
In many cases ILP is faster than CP (Constraint Programming), but there 
are quite a few cases when CP and SAT are much faster. Speed issues with 
modeling RA as ILP do not rule out neither SAT nor CP.


Last time, I had hope to use massive parallelism of GPUs for this. 
After investing of much my time to learning simplex algorithm theory 
(a base for ILP solvers), I found that simple tableu-based algorithm 
is parallelized very well but for RA tasks even using 1000 GPU 
elements is still behind revised simplex algorithm which is not 
parallelized.  GPUs are not good for very sparse matrix problems which 
is the case of RA problems.

More and more it sounds like a CP or SAT.


Still some important RA problems (like rematerialization) is very hard 
to describe by ILP.
It sounds that your model requires many 0,1 variables. If it is so, then 
maybe SAT

is a better choice than ILP.


I don't see any progress, ILP solver may be ten times faster but they 
still have exponential complexity.  I don't know any industrial 
compiler which uses exact solver for RA (RA in GCC is even more 
complicated as it requires code selection too).

But the solver need not be exact. ILP uses branch and bound, which can be
interrupted after some time giving the best solution so far. The best 
solution,
so far, could be good enough for RA. I have seen huge SAT instances 
solved in

a fraction of a second. These problems have a very clear structure that
the solver takes advantage of,  but why the RA problem should be different?
So why ILP/SAT/CSP aren't used in RA? I don't think they will work 
much slower than
what RA does today ( they will be somewhat slower but not much). I do 
believe that
there is a good chance to get much better results from RA with these 
technologies.


You know, some LLVM guys had the same thoughts and as remember they 
had PBQP (I tried this too) based RA (they had too many different RAs) 
and now they have finished with just a good heursitics RA which works 
best.
I don't how PBQP works with RA, but if it uses branch  bound over 
integer variables

then it may have performance issues there.
I'd like to invest some time into a feasibility check, if someone is 
willing to work
with me on modeling the RA problem (or at least several examples, 
such as the above).



Good luck with that.  I am serious.

I need some help to get a basic model, even a hand-written model for a given
input to RA just as example (before diving deeper into it). I'd be happy 
if you

or somebody else can guide me through. At least, where should I look in the
source of GCC to figure some of these things out, or where can I find some
documentation/papers/presentations on RA and its input data-structure
(and output requirements).

Michael


Re: Register Allocation issues with microblaze-elf

2013-02-14 Thread Vladimir Makarov

On 02/14/2013 02:41 PM, Michael Veksler wrote:

...

It sounds that your model requires many 0,1 variables. If it is so, 
then maybe SAT

is a better choice than ILP.


Yes, it is binary ILP.



I don't see any progress, ILP solver may be ten times faster but they 
still have exponential complexity.  I don't know any industrial 
compiler which uses exact solver for RA (RA in GCC is even more 
complicated as it requires code selection too).

But the solver need not be exact. ILP uses branch and bound, which can be
interrupted after some time giving the best solution so far. The best 
solution,
so far, could be good enough for RA. I have seen huge SAT instances 
solved in

a fraction of a second. These problems have a very clear structure that
the solver takes advantage of,  but why the RA problem should be 
different?
As I wrote, I used this approach too.  By the way, I used GLPK as it is 
pretty good to affect its the solving process (when to stop, what to 
consider first, in what order to consider sub-problems and so on).  
Although it might be not the fastest open source MILP solver.


So why ILP/SAT/CSP aren't used in RA? I don't think they will work 
much slower than
what RA does today ( they will be somewhat slower but not much). I 
do believe that
there is a good chance to get much better results from RA with these 
technologies.


You know, some LLVM guys had the same thoughts and as remember they 
had PBQP (I tried this too) based RA (they had too many different 
RAs) and now they have finished with just a good heursitics RA which 
works best.
I don't how PBQP works with RA, but if it uses branch  bound over 
integer variables

then it may have performance issues there.
As I remember correctly a person who invented this, try to promote it in 
LLVM.
I'd like to invest some time into a feasibility check, if someone is 
willing to work
with me on modeling the RA problem (or at least several examples, 
such as the above).



Good luck with that.  I am serious.
I need some help to get a basic model, even a hand-written model for a 
given
input to RA just as example (before diving deeper into it). I'd be 
happy if you
or somebody else can guide me through. At least, where should I look 
in the
source of GCC to figure some of these things out, or where can I find 
some

documentation/papers/presentations on RA and its input data-structure
(and output requirements).

There are tons of articles about optimal RA (less for optimal insn 
scheduling).


You could start with Goodwin/Wilken or Appel/George ILP approach.  Add 
choosing insn alternatives (0/1 var for each alternative of each insn 
and constraints for using only alternative for each insn).  That is 
specific for GCC as GCC partially does code selection in RA.  That will 
define possible sets of registers or memory for pseudos which are 
operands of the insn.  You can add cost for each alternative (as RTL 
insn alternatives can represent different insns with different execution 
time) to cost function.  If you use Appel/George approach, add 
constraints and cost function part for coalescing and 
rematerialization.  If you want to go further you could add scheduling 
problem too because RA and insn scheduling have conflicting goals.


That will be probably most general base problem.  Now you can simplify 
this model in different ways.  But don't simplify it to much as ,for 
example, just solving register allocation (assignment) with ILP has no 
sense as graph coloring heuristics based on Kempe's criterion of trivial 
colorability solves this problem optimally probably in 99% cases.


Please don't forget that even optimal solution can work worse then 
heuristics one as ins execution frequency evaluation is approximate.  
Some heuristics work very well with uncertain or approximate execution 
frequencies.  Profile based compilation in some cases is not possible or 
very inconvenient.  I saw other cases when optimal solution even with 
profile based execution frequencies generates worse cade as the model 
probably does not take generated memory foot-prints into account.


The following articles (presentations) could help you to understand GCC RA:

ftp://gcc.gnu.org/pub/gcc/summit/2004/Fighting%20Register%20Pressure.pdf
http://ols.fedoraproject.org/GCC/Reprints-2007/makarov-reprint.pdf
http://vmakarov.fedorapeople.org/LRA.html




Re: Register Allocation issues with microblaze-elf

2013-02-13 Thread Vladimir Makarov

On 13-02-13 1:36 AM, Michael Eager wrote:

Hi --

I'm seeing register allocation problems and code size increases
with gcc-4.6.2 (and gcc-head) compared with older (gcc-4.1.2).
Both are compiled using -O3.

One test case that I have has a long series of nested if's
each with the same comparison and similar computation.

if (nmax_no){
  n+=*(cp-*p++);
  if (nmax_no){
n+=*(cp-*p);
  if (nmax_no){
. . .  ~20 levels of nesting
   more computations with 'cp' and 'p'
. . . }}}

Gcc-4.6.2 generates many blocks like the following:
lwir28,r1,68-- load into dead reg
lwir31,r1,140-- load p from stack
lbuir28,r31,0
rsubkr31,r28,r19
lbuir31,r31,0
addkr29,r29,r31
swir31,r1,308
lwir31,r1,428-- load of max_no from stack
cmpr28,r31,r29-- n in r29
bgeidr28,$L46

gcc-4.1.2 generates the following:
lbuir3,r26,3
rsubkr3,r3,r19
lbuir3,r3,0
addkr30,r30,r3
swir3,r1,80
cmpr18,r9,r30-- max_no in r9, n in r30
bgeir18,$L6

gcc-4.6.2 (and gcc-head) load max_no from the stack in each block.
There also are extra loads into r28 (which is not used) and r31 at
the start of each block.  Only r28, r29, and r31 are used.

I'm having a hard time telling what is happening or why.  The
IRA dump has this line:
   Ignoring reg 772, has equiv memory
where pseudo 772 is loaded with max_no early in the function.

The reload dump has
Reloads for insn # 254
Reload 0: reload_in (SI) = (reg/v:SI 722 [ max_no ])
GR_REGS, RELOAD_FOR_INPUT (opnum = 1)
reload_in_reg: (reg/v:SI 722 [ max_no ])
reload_reg_rtx: (reg:SI 31 r31)
and similar for each of the other insns using 722.

This is followed by
  Spilling for insn 254.
  Using reg 31 for reload 0
for each insn using pseudo 722.

Any idea what is going on?

So many changes happened since then (7 years ago), that it is very hard 
to me to say something definitely.  I also have no gcc-4.1 microblaze 
(as I see microblaze was added to public gcc for 4.6 version) and it 
makes me even more difficult to say something useful.


First of all, the new RA was introduced in gcc4.4 (IRA) which uses 
different heuristics (Chaitin-Briggs graph coloring vs Chow's priority RA).


We could blame IRA when we have the same started conditions for it RA 
gcc4.1 and gcc4.6-gcc-4.8.  But I am sure it is not the same. More 
aggressive optimizations creates higher register pressure.  I compared 
peak reg pressure in the test for gcc4.6 and gcc4.8.  It became higher 
(from 102 to 106).  I guess the increase was even bigger since gcc4.1.


RA focused on generation of faster code.  Looking at the fragment you 
provided it, it is hard to say something about it.  I tried -Os for 
gcc4.8 and it generates desirable code for the fragment in question (by 
the way the peak register pressure decreased to 66 in this case).


Any industrial RA uses heuristic algorithms, in some cases better 
heuristics can work worse than worse heuristics.  So you should probably 
check is there any progress moving from gcc4.1 to gcc4.6 with 
performance point of view for variety benchmarks.  Introducing IRA 
improves code for x86 4% on SPEC2000.  Subsequent improving (like using 
dynamic register classes) made further performance improvements.


Looking at the test code, I can make some conclusions for myself:

o We need a common pass decreasing reg pressure (I already expressed 
this in the past) as optimizations become more aggressive.  Some 
progress was made to make few optimizations aware about RA (reg-pressure 
scheduling, loop-invariant motions, and code hoisting) but there are too 
many passes and it is wrong and impossible to make them all aware of 
RA.  Some register pressure decreasing heuristics are difficult to 
implement in RA (like insn rearrangements or complex rematerialization) 
and this pass could focus on them.


o Implement RA live range splitting in regions different from loops or 
BB (now IRA makes splitting only on loop bounds and LRA in BB, the old 
RA had no live range splitting at all).


I'd also recommend to try the following options concerning RA: 
-fira-loop-pressure, -fsched-pressure, -fira-algorithm=CB|priority, 
-fira-region=one,all,mixed.  Actually -fira-algorithm=priority + 
-fira-region=one is analog of what the old RA did.


I hope I answered to your question.



Re: Register Allocation issues with microblaze-elf

2013-02-13 Thread Michael Eager

On 02/13/2013 02:38 PM, Vladimir Makarov wrote:

On 13-02-13 1:36 AM, Michael Eager wrote:

Hi --

I'm seeing register allocation problems and code size increases
with gcc-4.6.2 (and gcc-head) compared with older (gcc-4.1.2).
Both are compiled using -O3.

One test case that I have has a long series of nested if's
each with the same comparison and similar computation.

if (nmax_no){
  n+=*(cp-*p++);
  if (nmax_no){
n+=*(cp-*p);
  if (nmax_no){
. . .  ~20 levels of nesting
   more computations with 'cp' and 'p'
. . . }}}

Gcc-4.6.2 generates many blocks like the following:
lwir28,r1,68-- load into dead reg
lwir31,r1,140-- load p from stack
lbuir28,r31,0
rsubkr31,r28,r19
lbuir31,r31,0
addkr29,r29,r31
swir31,r1,308
lwir31,r1,428-- load of max_no from stack
cmpr28,r31,r29-- n in r29
bgeidr28,$L46

gcc-4.1.2 generates the following:
lbuir3,r26,3
rsubkr3,r3,r19
lbuir3,r3,0
addkr30,r30,r3
swir3,r1,80
cmpr18,r9,r30-- max_no in r9, n in r30
bgeir18,$L6

gcc-4.6.2 (and gcc-head) load max_no from the stack in each block.
There also are extra loads into r28 (which is not used) and r31 at
the start of each block.  Only r28, r29, and r31 are used.

I'm having a hard time telling what is happening or why.  The
IRA dump has this line:
   Ignoring reg 772, has equiv memory
where pseudo 772 is loaded with max_no early in the function.

The reload dump has
Reloads for insn # 254
Reload 0: reload_in (SI) = (reg/v:SI 722 [ max_no ])
GR_REGS, RELOAD_FOR_INPUT (opnum = 1)
reload_in_reg: (reg/v:SI 722 [ max_no ])
reload_reg_rtx: (reg:SI 31 r31)
and similar for each of the other insns using 722.

This is followed by
  Spilling for insn 254.
  Using reg 31 for reload 0
for each insn using pseudo 722.

Any idea what is going on?


So many changes happened since then (7 years ago), that it is very hard to me 
to say something
definitely.  I also have no gcc-4.1 microblaze (as I see microblaze was added 
to public gcc for 4.6
version) and it makes me even more difficult to say something useful.

First of all, the new RA was introduced in gcc4.4 (IRA) which uses different 
heuristics
(Chaitin-Briggs graph coloring vs Chow's priority RA).

We could blame IRA when we have the same started conditions for it RA gcc4.1 
and gcc4.6-gcc-4.8.
But I am sure it is not the same. More aggressive optimizations creates higher 
register pressure.  I
compared peak reg pressure in the test for gcc4.6 and gcc4.8.  It became higher 
(from 102 to 106).
I guess the increase was even bigger since gcc4.1.


I thought about register pressure causing this, but I think that should cause
spilling of one of the registers which were not used in this long sequence,
rather than causing a large number of additional loads.

Perhaps the cost analysis has a problem.


RA focused on generation of faster code.  Looking at the fragment you provided 
it, it is hard to say
something about it.  I tried -Os for gcc4.8 and it generates desirable code for 
the fragment in
question (by the way the peak register pressure decreased to 66 in this case).


It's both larger and slower, since the additional loads take much longer.  I'll 
take a
look at -Os.

It looks like the values of p++ are being pre-calculated and stored on the 
stack.  This results in
a load, rather than an increment of a register.


Any industrial RA uses heuristic algorithms, in some cases better heuristics 
can work worse than
worse heuristics.  So you should probably check is there any progress moving 
from gcc4.1 to gcc4.6
with performance point of view for variety benchmarks.  Introducing IRA 
improves code for x86 4% on
SPEC2000.  Subsequent improving (like using dynamic register classes) made 
further performance
improvements.


My impression is that the performance is worse.  Reports I've seen are that the 
code is
substantially larger, which means more instructions.

I'm skeptical about comparisons between x86 and RISC processors.  What works 
well
for one may not work well for the other.


Looking at the test code, I can make some conclusions for myself:

o We need a common pass decreasing reg pressure (I already expressed this in 
the past) as
optimizations become more aggressive.  Some progress was made to make few 
optimizations aware about
RA (reg-pressure scheduling, loop-invariant motions, and code hoisting) but 
there are too many
passes and it is wrong and impossible to make them all aware of RA.  Some 
register pressure
decreasing heuristics are difficult to implement in RA (like insn 
rearrangements or complex
rematerialization) and this pass could focus on them.


That might be useful.


o Implement RA live range splitting in regions different from loops or BB (now 
IRA makes splitting
only on loop bounds and LRA in BB, the old RA had 

Re: Register Allocation issues with microblaze-elf

2013-02-13 Thread Vladimir Makarov

On 13-02-13 6:36 PM, Michael Eager wrote:

On 02/13/2013 02:38 PM, Vladimir Makarov wrote:

On 13-02-13 1:36 AM, Michael Eager wrote:

Hi --

I'm seeing register allocation problems and code size increases
with gcc-4.6.2 (and gcc-head) compared with older (gcc-4.1.2).
Both are compiled using -O3.

One test case that I have has a long series of nested if's
each with the same comparison and similar computation.

if (nmax_no){
  n+=*(cp-*p++);
  if (nmax_no){
n+=*(cp-*p);
  if (nmax_no){
. . .  ~20 levels of nesting
   more computations with 'cp' and 'p'
. . . }}}

Gcc-4.6.2 generates many blocks like the following:
lwir28,r1,68-- load into dead reg
lwir31,r1,140-- load p from stack
lbuir28,r31,0
rsubkr31,r28,r19
lbuir31,r31,0
addkr29,r29,r31
swir31,r1,308
lwir31,r1,428-- load of max_no from stack
cmpr28,r31,r29-- n in r29
bgeidr28,$L46

gcc-4.1.2 generates the following:
lbuir3,r26,3
rsubkr3,r3,r19
lbuir3,r3,0
addkr30,r30,r3
swir3,r1,80
cmpr18,r9,r30-- max_no in r9, n in r30
bgeir18,$L6

gcc-4.6.2 (and gcc-head) load max_no from the stack in each block.
There also are extra loads into r28 (which is not used) and r31 at
the start of each block.  Only r28, r29, and r31 are used.

I'm having a hard time telling what is happening or why.  The
IRA dump has this line:
   Ignoring reg 772, has equiv memory
where pseudo 772 is loaded with max_no early in the function.

The reload dump has
Reloads for insn # 254
Reload 0: reload_in (SI) = (reg/v:SI 722 [ max_no ])
GR_REGS, RELOAD_FOR_INPUT (opnum = 1)
reload_in_reg: (reg/v:SI 722 [ max_no ])
reload_reg_rtx: (reg:SI 31 r31)
and similar for each of the other insns using 722.

This is followed by
  Spilling for insn 254.
  Using reg 31 for reload 0
for each insn using pseudo 722.

Any idea what is going on?

So many changes happened since then (7 years ago), that it is very 
hard to me to say something
definitely.  I also have no gcc-4.1 microblaze (as I see microblaze 
was added to public gcc for 4.6

version) and it makes me even more difficult to say something useful.

First of all, the new RA was introduced in gcc4.4 (IRA) which uses 
different heuristics

(Chaitin-Briggs graph coloring vs Chow's priority RA).

We could blame IRA when we have the same started conditions for it RA 
gcc4.1 and gcc4.6-gcc-4.8.
But I am sure it is not the same. More aggressive optimizations 
creates higher register pressure.  I
compared peak reg pressure in the test for gcc4.6 and gcc4.8. It 
became higher (from 102 to 106).

I guess the increase was even bigger since gcc4.1.


I thought about register pressure causing this, but I think that 
should cause
spilling of one of the registers which were not used in this long 
sequence,

rather than causing a large number of additional loads.

Longer living pseudos has more probability to be spilled as they usually 
conflicts with more pseudos during their lives and spilling them frees a 
hard reg for many conflicting pseudos.  That how RA heuristics work (in 
the old RA log (live range span) was used.  The bigger the value, the 
more probability for spilling).

Perhaps the cost analysis has a problem.


I've checked it and it looks ok to me.
RA focused on generation of faster code. Looking at the fragment you 
provided it, it is hard to say
something about it.  I tried -Os for gcc4.8 and it generates 
desirable code for the fragment in
question (by the way the peak register pressure decreased to 66 in 
this case).


It's both larger and slower, since the additional loads take much 
longer.  I'll take a

look at -Os.

It looks like the values of p++ are being pre-calculated and stored on 
the stack.  This results in

a load, rather than an increment of a register.

If it is so.  It might be another optimization which moves p++ 
calculations higher.  IRA does not do it (more correctly a new IRA 
feature implemented by Bernd Schmidt in gcc4.7 can move insns downwards 
in CFG graph to decrease reg pressure).


I checked all rtl passes these calcualtions are not created by any RTL 
pass.  So it is probably some tree-ssa optimization.
Any industrial RA uses heuristic algorithms, in some cases better 
heuristics can work worse than
worse heuristics.  So you should probably check is there any progress 
moving from gcc4.1 to gcc4.6
with performance point of view for variety benchmarks. Introducing 
IRA improves code for x86 4% on
SPEC2000.  Subsequent improving (like using dynamic register classes) 
made further performance

improvements.


My impression is that the performance is worse.  Reports I've seen are 
that the code is

substantially larger, which means more instructions.

I'm skeptical about comparisons between x86 and RISC processors. What 
works well

for one may not work well for 

Re: Register Allocation issues with microblaze-elf

2013-02-13 Thread Edgar E. Iglesias
On Thu, Feb 14, 2013 at 12:36:46AM +0100, Michael Eager wrote:
 On 02/13/2013 02:38 PM, Vladimir Makarov wrote:
  On 13-02-13 1:36 AM, Michael Eager wrote:
  Hi --
 
  I'm seeing register allocation problems and code size increases
  with gcc-4.6.2 (and gcc-head) compared with older (gcc-4.1.2).
  Both are compiled using -O3.
 
  One test case that I have has a long series of nested if's
  each with the same comparison and similar computation.
 
  if (nmax_no){
n+=*(cp-*p++);
if (nmax_no){
  n+=*(cp-*p);
if (nmax_no){
  . . .  ~20 levels of nesting
 more computations with 'cp' and 'p'
  . . . }}}
 
  Gcc-4.6.2 generates many blocks like the following:
  lwir28,r1,68-- load into dead reg
  lwir31,r1,140-- load p from stack
  lbuir28,r31,0
  rsubkr31,r28,r19
  lbuir31,r31,0
  addkr29,r29,r31
  swir31,r1,308
  lwir31,r1,428-- load of max_no from stack
  cmpr28,r31,r29-- n in r29
  bgeidr28,$L46
 
  gcc-4.1.2 generates the following:
  lbuir3,r26,3
  rsubkr3,r3,r19
  lbuir3,r3,0
  addkr30,r30,r3
  swir3,r1,80
  cmpr18,r9,r30-- max_no in r9, n in r30
  bgeir18,$L6
 
  gcc-4.6.2 (and gcc-head) load max_no from the stack in each block.
  There also are extra loads into r28 (which is not used) and r31 at
  the start of each block.  Only r28, r29, and r31 are used.
 
  I'm having a hard time telling what is happening or why.  The
  IRA dump has this line:
 Ignoring reg 772, has equiv memory
  where pseudo 772 is loaded with max_no early in the function.
 
  The reload dump has
  Reloads for insn # 254
  Reload 0: reload_in (SI) = (reg/v:SI 722 [ max_no ])
  GR_REGS, RELOAD_FOR_INPUT (opnum = 1)
  reload_in_reg: (reg/v:SI 722 [ max_no ])
  reload_reg_rtx: (reg:SI 31 r31)
  and similar for each of the other insns using 722.
 
  This is followed by
Spilling for insn 254.
Using reg 31 for reload 0
  for each insn using pseudo 722.
 
  Any idea what is going on?
 
  So many changes happened since then (7 years ago), that it is very hard to 
  me to say something
  definitely.  I also have no gcc-4.1 microblaze (as I see microblaze was 
  added to public gcc for 4.6
  version) and it makes me even more difficult to say something useful.
 
  First of all, the new RA was introduced in gcc4.4 (IRA) which uses 
  different heuristics
  (Chaitin-Briggs graph coloring vs Chow's priority RA).
 
  We could blame IRA when we have the same started conditions for it RA 
  gcc4.1 and gcc4.6-gcc-4.8.
  But I am sure it is not the same. More aggressive optimizations creates 
  higher register pressure.  I
  compared peak reg pressure in the test for gcc4.6 and gcc4.8.  It became 
  higher (from 102 to 106).
  I guess the increase was even bigger since gcc4.1.
 
 I thought about register pressure causing this, but I think that should cause
 spilling of one of the registers which were not used in this long sequence,
 rather than causing a large number of additional loads.
 
 Perhaps the cost analysis has a problem.
 
  RA focused on generation of faster code.  Looking at the fragment you 
  provided it, it is hard to say
  something about it.  I tried -Os for gcc4.8 and it generates desirable code 
  for the fragment in
  question (by the way the peak register pressure decreased to 66 in this 
  case).
 
 It's both larger and slower, since the additional loads take much longer.  
 I'll take a
 look at -Os.
 
 It looks like the values of p++ are being pre-calculated and stored on the 
 stack.  This results in
 a load, rather than an increment of a register.

Hi,

I remember having a similar issue about a year ago. IIRC, I foudn that
the ivopts pass was transforming things badly for microblaze. Disabling
it helped alot.

I can't tell if you are seeing the same thing, but it might be worth
trying -fno-ivopts in case you haven't already.

Cheers,
Edgar


Re: Register Allocation issues with microblaze-elf

2013-02-13 Thread Michael Eager

On 02/13/2013 11:24 PM, Edgar E. Iglesias wrote:

On Thu, Feb 14, 2013 at 12:36:46AM +0100, Michael Eager wrote:

On 02/13/2013 02:38 PM, Vladimir Makarov wrote:

On 13-02-13 1:36 AM, Michael Eager wrote:

Hi --

I'm seeing register allocation problems and code size increases
with gcc-4.6.2 (and gcc-head) compared with older (gcc-4.1.2).
Both are compiled using -O3.

One test case that I have has a long series of nested if's
each with the same comparison and similar computation.

 if (nmax_no){
   n+=*(cp-*p++);
   if (nmax_no){
 n+=*(cp-*p);
   if (nmax_no){
 . . .  ~20 levels of nesting
more computations with 'cp' and 'p'
 . . . }}}

Gcc-4.6.2 generates many blocks like the following:
 lwir28,r1,68-- load into dead reg
 lwir31,r1,140-- load p from stack
 lbuir28,r31,0
 rsubkr31,r28,r19
 lbuir31,r31,0
 addkr29,r29,r31
 swir31,r1,308
 lwir31,r1,428-- load of max_no from stack
 cmpr28,r31,r29-- n in r29
 bgeidr28,$L46

gcc-4.1.2 generates the following:
 lbuir3,r26,3
 rsubkr3,r3,r19
 lbuir3,r3,0
 addkr30,r30,r3
 swir3,r1,80
 cmpr18,r9,r30-- max_no in r9, n in r30
 bgeir18,$L6

gcc-4.6.2 (and gcc-head) load max_no from the stack in each block.
There also are extra loads into r28 (which is not used) and r31 at
the start of each block.  Only r28, r29, and r31 are used.

I'm having a hard time telling what is happening or why.  The
IRA dump has this line:
Ignoring reg 772, has equiv memory
where pseudo 772 is loaded with max_no early in the function.

The reload dump has
Reloads for insn # 254
Reload 0: reload_in (SI) = (reg/v:SI 722 [ max_no ])
 GR_REGS, RELOAD_FOR_INPUT (opnum = 1)
 reload_in_reg: (reg/v:SI 722 [ max_no ])
 reload_reg_rtx: (reg:SI 31 r31)
and similar for each of the other insns using 722.

This is followed by
   Spilling for insn 254.
   Using reg 31 for reload 0
for each insn using pseudo 722.

Any idea what is going on?


So many changes happened since then (7 years ago), that it is very hard to me 
to say something
definitely.  I also have no gcc-4.1 microblaze (as I see microblaze was added 
to public gcc for 4.6
version) and it makes me even more difficult to say something useful.

First of all, the new RA was introduced in gcc4.4 (IRA) which uses different 
heuristics
(Chaitin-Briggs graph coloring vs Chow's priority RA).

We could blame IRA when we have the same started conditions for it RA gcc4.1 
and gcc4.6-gcc-4.8.
But I am sure it is not the same. More aggressive optimizations creates higher 
register pressure.  I
compared peak reg pressure in the test for gcc4.6 and gcc4.8.  It became higher 
(from 102 to 106).
I guess the increase was even bigger since gcc4.1.


I thought about register pressure causing this, but I think that should cause
spilling of one of the registers which were not used in this long sequence,
rather than causing a large number of additional loads.

Perhaps the cost analysis has a problem.


RA focused on generation of faster code.  Looking at the fragment you provided 
it, it is hard to say
something about it.  I tried -Os for gcc4.8 and it generates desirable code for 
the fragment in
question (by the way the peak register pressure decreased to 66 in this case).


It's both larger and slower, since the additional loads take much longer.  I'll 
take a
look at -Os.

It looks like the values of p++ are being pre-calculated and stored on the 
stack.  This results in
a load, rather than an increment of a register.


Hi,

I remember having a similar issue about a year ago. IIRC, I foudn that
the ivopts pass was transforming things badly for microblaze. Disabling
it helped alot.

I can't tell if you are seeing the same thing, but it might be worth
trying -fno-ivopts in case you haven't already.


Thanks.  I'll see if that helps.


--
Michael Eagerea...@eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077


Register Allocation issues with microblaze-elf

2013-02-12 Thread Michael Eager

Hi --

I'm seeing register allocation problems and code size increases
with gcc-4.6.2 (and gcc-head) compared with older (gcc-4.1.2).
Both are compiled using -O3.

One test case that I have has a long series of nested if's
each with the same comparison and similar computation.

if (nmax_no){
  n+=*(cp-*p++);
  if (nmax_no){
n+=*(cp-*p);
  if (nmax_no){
. . .  ~20 levels of nesting
   more computations with 'cp' and 'p'
. . . }}}

Gcc-4.6.2 generates many blocks like the following:
lwi r28,r1,68   -- load into dead reg
lwi r31,r1,140  -- load p from stack
lbuir28,r31,0
rsubk   r31,r28,r19
lbuir31,r31,0
addkr29,r29,r31
swi r31,r1,308
lwi r31,r1,428  -- load of max_no from stack
cmp r28,r31,r29 -- n in r29
bgeid   r28,$L46

gcc-4.1.2 generates the following:
lbuir3,r26,3
rsubk   r3,r3,r19
lbuir3,r3,0
addkr30,r30,r3
swi r3,r1,80
cmp r18,r9,r30  -- max_no in r9, n in r30
bgeir18,$L6

gcc-4.6.2 (and gcc-head) load max_no from the stack in each block.
There also are extra loads into r28 (which is not used) and r31 at
the start of each block.  Only r28, r29, and r31 are used.

I'm having a hard time telling what is happening or why.  The
IRA dump has this line:
   Ignoring reg 772, has equiv memory
where pseudo 772 is loaded with max_no early in the function.

The reload dump has
Reloads for insn # 254
Reload 0: reload_in (SI) = (reg/v:SI 722 [ max_no ])
GR_REGS, RELOAD_FOR_INPUT (opnum = 1)
reload_in_reg: (reg/v:SI 722 [ max_no ])
reload_reg_rtx: (reg:SI 31 r31)
and similar for each of the other insns using 722.

This is followed by
  Spilling for insn 254.
  Using reg 31 for reload 0
for each insn using pseudo 722.

Any idea what is going on?

--
Michael Eagerea...@eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077
#if 0
mb-gcc -O3 -mhard-float -fdump-rtl-all -c s.c -save-temps
#endif
typedef unsigned char uchar;
typedef struct {int x,y,info, dx, dy, I;} CORNER_LIST[15000];
susan_corners(in,r,bp,max_no,corner_list,x_size,y_size)
  uchar *in, *bp;
  int *r, max_no, x_size, y_size;
  CORNER_LIST corner_list;
{
int n,x,y,sq,xx,yy,
  i,j,*cgx,*cgy;
float divide;
uchar c,*p,*cp;

  for (i=5;iy_size-5;i++)
for (j=5;jx_size-5;j++) 
  {
n=100;
p=in + (i-3)*x_size + j - 1;
cp=bp + in[i*x_size+j];

n+=*(cp-*p++);
n+=*(cp-*p++);
n+=*(cp-*p);
p+=x_size-3;

n+=*(cp-*p++);
n+=*(cp-*p++);
n+=*(cp-*p++);
n+=*(cp-*p++);
n+=*(cp-*p);
p+=x_size-5;

n+=*(cp-*p++);
n+=*(cp-*p++);
n+=*(cp-*p++);
n+=*(cp-*p++);
n+=*(cp-*p++);
n+=*(cp-*p++);
n+=*(cp-*p);
p+=x_size-6;

n+=*(cp-*p++);
n+=*(cp-*p++);
n+=*(cp-*p);
  if (nmax_no){
p+=2;
n+=*(cp-*p++);
#if 1
if (nmax_no){
  n+=*(cp-*p++);
  if (nmax_no){
n+=*(cp-*p);
  if (nmax_no){
p+=x_size-6;
	n+=*(cp-*p++);
  		if (nmax_no){
  n+=*(cp-*p++);
  if (nmax_no){
n+=*(cp-*p++);
if (nmax_no){
  n+=*(cp-*p++);
  if (nmax_no){
n+=*(cp-*p++);
if (nmax_no){
  n+=*(cp-*p++);
  if (nmax_no){
n+=*(cp-*p);
if (nmax_no){
		  p+=x_size-5;
		  n+=*(cp-*p++);
  			  if (nmax_no){
			n+=*(cp-*p++);
  if (nmax_no){
			  n+=*(cp-*p++);
  			 	  if (nmax_no){
			n+=*(cp-*p++);
  			if (nmax_no){
			  n+=*(cp-*p);
    if (nmax_no){
			p+=x_size-3;
n+=*(cp-*p++);
#endif
  	if (nmax_no){
			 	  n+=*(cp-*p++);
    if (nmax_no){
n+=*(cp-*p);

if (nmax_no)
  {
	x=0;y=0;
	p=in + (i-3)*x_size + j - 1;

	c=*(cp-*p++);x-=c;y-=3*c;
	c=*(cp-*p++);y-=3*c;
	c=*(cp-*p);x+=c;y-=3*c;
	p+=x_size-3;

	c=*(cp-*p++);x-=2*c;y-=2*c;
	c=*(cp-*p++);x-=c;y-=2*c;
	c=*(cp-*p++);y-=2*c;
	c=*(cp-*p++);x+=c;y-=2*c;
	c=*(cp-*p);x+=2*c;y-=2*c;
	p+=x_size-5;

	c=*(cp-*p++);x-=3*c;y-=c;
	c=*(cp-*p++);x-=2*c;y-=c;
	c=*(cp-*p++);x-=c;y-=c;
	c=*(cp-*p++);y-=c;

Re: Register allocation issues

2007-09-06 Thread David Edelsohn
 Matt Lee writes:

Matt The problem is, that though the loads can be optimized by pipelining
Matt them. The register allocator has created a dependency by using only r3
Matt and r4, instead of using the other volatiles.

GCC's register allocator currently is designed to minimize the
number of hard registers.  As Ian mentioned, -frename-registers tries to
perform register renaming with available registers after register
allocation.  As they say, Your Mileage May Vary.

David



RE: Register allocation issues

2007-09-06 Thread Dave Korn
On 05 September 2007 23:47, Matt Lee wrote:

 Registers r3 to r12 are volatiles. However, for the C code below,
 
 struct foo {
 int a[4];
 } ;
 
 struct foo p, q;
 
 void func ()
 {
 memcpy (p, q, sizeof (struct foo));
 }
 
 I am getting a instruction sequence for func() such as,
 
 load r3, q + 0
 load r4, q + 4
 store r3, p + 0
 store r4, p + 4
 load r3, q + 4
 load r4, q + 8
 store r3, p + 4
 store r4, p + 8
 
 The problem is, that though the loads can be optimized by pipelining
 them. The register allocator has created a dependency by using only r3
 and r4, instead of using the other volatiles.

  Does your backend define a movdi pattern?


cheers,
  DaveK
-- 
Can't think of a witty .sigline today



Re: Register allocation issues

2007-09-06 Thread Matt Lee
On 9/6/07, David Edelsohn [EMAIL PROTECTED] wrote:
  Matt Lee writes:

 Matt The problem is, that though the loads can be optimized by pipelining
 Matt them. The register allocator has created a dependency by using only r3
 Matt and r4, instead of using the other volatiles.

 GCC's register allocator currently is designed to minimize the
 number of hard registers.  As Ian mentioned, -frename-registers tries to
 perform register renaming with available registers after register
 allocation.  As they say, Your Mileage May Vary.


There is no point trying to minimize usage of volatile hard registers,
is there? They are precisely there to be used up as much as needed.
The function is a leaf procedure as well, so there are no other
considerations. Lastly, architectures like PPC do make use of more
registers (without -frename-registers), so there has to be something
in the PPC back-end that allows for the liberal use or in mine that
prevents such.

-- 
thanks,
Matt


Re: Register allocation issues

2007-09-06 Thread Matt Lee
On 9/6/07, Dave Korn [EMAIL PROTECTED] wrote:
 On 05 September 2007 23:47, Matt Lee wrote:

  Registers r3 to r12 are volatiles. However, for the C code below,
 
  struct foo {
  int a[4];
  } ;
 
  struct foo p, q;
 
  void func ()
  {
  memcpy (p, q, sizeof (struct foo));
  }
 
  I am getting a instruction sequence for func() such as,
 
  load r3, q + 0
  load r4, q + 4
  store r3, p + 0
  store r4, p + 4
  load r3, q + 4
  load r4, q + 8
  store r3, p + 4
  store r4, p + 8
 
  The problem is, that though the loads can be optimized by pipelining
  them. The register allocator has created a dependency by using only r3
  and r4, instead of using the other volatiles.

   Does your backend define a movdi pattern?



Yes it does. But in this case, these are word-by-word moves from
memory to memory and make use of only movsi.
-- 
thanks,
Matt


Re: Register allocation issues

2007-09-06 Thread dje
 Matt Lee writes: 
Date: Thu, 06 Sep 2007 15:02:52 -0400
From: David Edelsohn [EMAIL PROTECTED]

Matt There is no point trying to minimize usage of volatile hard registers,
Matt is there? They are precisely there to be used up as much as needed.
Matt The function is a leaf procedure as well, so there are no other
Matt considerations. Lastly, architectures like PPC do make use of more
Matt registers (without -frename-registers), so there has to be something
Matt in the PPC back-end that allows for the liberal use or in mine that
Matt prevents such.

GCC RA mostly is tuned for IA-32 with very few registers.

The rs6000 port defines the movmemsi pattern calling
expand_block_move() which generates many intermediate pseudos.

David



Re: Register allocation issues

2007-09-06 Thread Segher Boessenkool

load r3, q + 0
load r4, q + 4
store r3, p + 0
store r4, p + 4
load r3, q + 4
load r4, q + 8
store r3, p + 4
store r4, p + 8


These last four lines should be

load r3, q + 8
load r4, q + 12
store r3, p + 8
store r4, p + 12


Did you just typo it or do you have a bigger problem?  The
problems might even be connected, who knows :-)


Segher



Re: Register allocation issues

2007-09-06 Thread Matt Lee
On 9/6/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
  Matt Lee writes:
 Date: Thu, 06 Sep 2007 15:02:52 -0400
 From: David Edelsohn [EMAIL PROTECTED]

 Matt There is no point trying to minimize usage of volatile hard registers,
 Matt is there? They are precisely there to be used up as much as needed.
 Matt The function is a leaf procedure as well, so there are no other
 Matt considerations. Lastly, architectures like PPC do make use of more
 Matt registers (without -frename-registers), so there has to be something
 Matt in the PPC back-end that allows for the liberal use or in mine that
 Matt prevents such.

 GCC RA mostly is tuned for IA-32 with very few registers.

 The rs6000 port defines the movmemsi pattern calling
 expand_block_move() which generates many intermediate pseudos.


So does mine. My generated RTL has all the right looking RTL with a
unique pseudo for each load/store pair. If you see the dump in the
original email from the .lreg files, the problem is that these pseudos
get bound to only 2 out of 10 candidate volatile registers.

-- 
thanks,
Matt


Re: Register allocation issues

2007-09-06 Thread Matt Lee
On 9/6/07, Segher Boessenkool [EMAIL PROTECTED] wrote:
  load r3, q + 0
  load r4, q + 4
  store r3, p + 0
  store r4, p + 4
  load r3, q + 4
  load r4, q + 8
  store r3, p + 4
  store r4, p + 8

 These last four lines should be

 load r3, q + 8
 load r4, q + 12
 store r3, p + 8
 store r4, p + 12


 Did you just typo it or do you have a bigger problem?  The
 problems might even be connected, who knows :-)


Sorry, that was a typo.


Register allocation issues

2007-09-05 Thread Matt Lee
Hello,

On my simple RISC architecture I am seeing suboptimal instruction
scheduling with GCC-4.1.1 caused by the way registers are getting
allocated. I am looking for suggestions on what could be wrong in my
description to cause the poor allocation.

More details --

Registers r3 to r12 are volatiles. However, for the C code below,

struct foo {
int a[4];
} ;

struct foo p, q;

void func ()
{
memcpy (p, q, sizeof (struct foo));
}

I am getting a instruction sequence for func() such as,

load r3, q + 0
load r4, q + 4
store r3, p + 0
store r4, p + 4
load r3, q + 4
load r4, q + 8
store r3, p + 4
store r4, p + 8

The problem is, that though the loads can be optimized by pipelining
them. The register allocator has created a dependency by using only r3
and r4, instead of using the other volatiles.

Strangely, modifying func to the following helps,

void func (struct foo *p, struct foo *q)
{
memcpy (p, q, sizeof (struct foo));
}

Now the register allocator uses more volatile registers and there are
no false dependencies.

I have attached the header information from the .lreg file for the
poor register allocation case. Pseudos 87 and 88 get assigned again to
r3, r4 instead of new volatiles.

Is this a problem with some poor cost model in my backend? How do I
get more information  about why the register allocator didn't pick the
other volatiles?

-- 
thanks,
Matt


func.lreg
Description: Binary data


Re: Register allocation issues

2007-09-05 Thread Ian Lance Taylor
Matt Lee [EMAIL PROTECTED] writes:

 The problem is, that though the loads can be optimized by pipelining
 them. The register allocator has created a dependency by using only r3
 and r4, instead of using the other volatiles.

Try using -frename-registers.

Ian