Re: Register Allocation issues with microblaze-elf
On 02/14/2013 03:28 AM, Vladimir Makarov wrote: On 13-02-13 6:36 PM, Michael Eager wrote: [snip] I thought about register pressure causing this, but I think that should cause spilling of one of the registers which were not used in this long sequence, rather than causing a large number of additional loads. Longer living pseudos has more probability to be spilled as they usually conflicts with more pseudos during their lives and spilling them frees a hard reg for many conflicting pseudos. That how RA heuristics work (in the old RA log (live range span) was used. The bigger the value, the more probability for spilling). Perhaps the cost analysis has a problem. I've checked it and it looks ok to me. RA focused on generation of faster code. Looking at the fragment you provided it, it is hard to say something about it. I tried -Os for gcc4.8 and it generates desirable code for the fragment in question (by the way the peak register pressure decreased to 66 in this case). It's both larger and slower, since the additional loads take much longer. I'll take a look at -Os. It looks like the values of p++ are being pre-calculated and stored on the stack. This results in a load, rather than an increment of a register. If it is so. It might be another optimization which moves p++ calculations higher. IRA does not do it (more correctly a new IRA feature implemented by Bernd Schmidt in gcc4.7 can move insns downwards in CFG graph to decrease reg pressure). I checked all rtl passes these calcualtions are not created by any RTL pass. So it is probably some tree-ssa optimization. Any industrial RA uses heuristic algorithms, in some cases better heuristics can work worse than worse heuristics. So you should probably check is there any progress moving from gcc4.1 to gcc4.6 with performance point of view for variety benchmarks. Introducing IRA improves code for x86 4% on SPEC2000. Subsequent improving (like using dynamic register classes) made further performance improvements. My impression is that the performance is worse. Reports I've seen are that the code is substantially larger, which means more instructions. I'm skeptical about comparisons between x86 and RISC processors. What works well for one may not work well for the other. IRA improved code for many RISC processors. Although tetter RA has smaller effect for these processors as they have more registers. Looking at the test code, I can make some conclusions for myself: o We need a common pass decreasing reg pressure (I already expressed this in the past) as optimizations become more aggressive. Some progress was made to make few optimizations aware about RA (reg-pressure scheduling, loop-invariant motions, and code hoisting) but there are too many passes and it is wrong and impossible to make them all aware of RA. Some register pressure decreasing heuristics are difficult to implement in RA (like insn rearrangements or complex rematerialization) and this pass could focus on them. That might be useful. o Implement RA live range splitting in regions different from loops or BB (now IRA makes splitting only on loop bounds and LRA in BB, the old RA had no live range splitting at all). Each of the blocks of code is in it's own BB. I haven't checked, but I'd guess that most of the registers are in use on entry and still live on exit, so the block has no registers to allocate. Splitting in BB scope this case is not profitable. I'd also recommend to try the following options concerning RA: -fira-loop-pressure, -fsched-pressure, -fira-algorithm=CB|priority, -fira-region=one,all,mixed. Actually -fira-algorithm=priority + -fira-region=one is analog of what the old RA did. I am reading this thread and getting more and more puzzled. The RA stuff is very complicated, having many constraints and many dependencies with other passes. Taking this into account, it seems that no heuristic algorithm can even get close to an optimal register allocation. A heuristic algorithm can't take all effects and side-effects into account simultaneously. Considering all that, why GCC does not use generic optimization algorithms for RA? A generic optimization can take all issues into account, simultaneously. I am talking about ILP/MIP (Integer Linear Programming/Mixed IP), SAT and CSP [1] (Constraint Satisfaction Problem). There has been a lot of progress in these areas, and the solvers are much faster (orders of magnitude) than they were 10 years ago. So why ILP/SAT/CSP aren't used in RA? I don't think they will work much slower than what RA does today ( they will be somewhat slower but not much). I do believe that there is a good chance to get much better results from RA with these technologies. I'd like to invest some time into a feasibility check, if someone is willing to work with me on modeling the RA problem (or at least several examples, such as the above). Michael [1] Even though CSP
Re: Register Allocation issues with microblaze-elf
On 02/14/2013 03:46 AM, Michael Veksler wrote: On 02/14/2013 03:28 AM, Vladimir Makarov wrote: On 13-02-13 6:36 PM, Michael Eager wrote: [snip] I thought about register pressure causing this, but I think that should cause spilling of one of the registers which were not used in this long sequence, rather than causing a large number of additional loads. Longer living pseudos has more probability to be spilled as they usually conflicts with more pseudos during their lives and spilling them frees a hard reg for many conflicting pseudos. That how RA heuristics work (in the old RA log (live range span) was used. The bigger the value, the more probability for spilling). Perhaps the cost analysis has a problem. I've checked it and it looks ok to me. RA focused on generation of faster code. Looking at the fragment you provided it, it is hard to say something about it. I tried -Os for gcc4.8 and it generates desirable code for the fragment in question (by the way the peak register pressure decreased to 66 in this case). It's both larger and slower, since the additional loads take much longer. I'll take a look at -Os. It looks like the values of p++ are being pre-calculated and stored on the stack. This results in a load, rather than an increment of a register. If it is so. It might be another optimization which moves p++ calculations higher. IRA does not do it (more correctly a new IRA feature implemented by Bernd Schmidt in gcc4.7 can move insns downwards in CFG graph to decrease reg pressure). I checked all rtl passes these calcualtions are not created by any RTL pass. So it is probably some tree-ssa optimization. Any industrial RA uses heuristic algorithms, in some cases better heuristics can work worse than worse heuristics. So you should probably check is there any progress moving from gcc4.1 to gcc4.6 with performance point of view for variety benchmarks. Introducing IRA improves code for x86 4% on SPEC2000. Subsequent improving (like using dynamic register classes) made further performance improvements. My impression is that the performance is worse. Reports I've seen are that the code is substantially larger, which means more instructions. I'm skeptical about comparisons between x86 and RISC processors. What works well for one may not work well for the other. IRA improved code for many RISC processors. Although tetter RA has smaller effect for these processors as they have more registers. Looking at the test code, I can make some conclusions for myself: o We need a common pass decreasing reg pressure (I already expressed this in the past) as optimizations become more aggressive. Some progress was made to make few optimizations aware about RA (reg-pressure scheduling, loop-invariant motions, and code hoisting) but there are too many passes and it is wrong and impossible to make them all aware of RA. Some register pressure decreasing heuristics are difficult to implement in RA (like insn rearrangements or complex rematerialization) and this pass could focus on them. That might be useful. o Implement RA live range splitting in regions different from loops or BB (now IRA makes splitting only on loop bounds and LRA in BB, the old RA had no live range splitting at all). Each of the blocks of code is in it's own BB. I haven't checked, but I'd guess that most of the registers are in use on entry and still live on exit, so the block has no registers to allocate. Splitting in BB scope this case is not profitable. I'd also recommend to try the following options concerning RA: -fira-loop-pressure, -fsched-pressure, -fira-algorithm=CB|priority, -fira-region=one,all,mixed. Actually -fira-algorithm=priority + -fira-region=one is analog of what the old RA did. I am reading this thread and getting more and more puzzled. The RA stuff is very complicated, having many constraints and many dependencies with other passes. Taking this into account, it seems that no heuristic algorithm can even get close to an optimal register allocation. A heuristic algorithm can't take all effects and side-effects into account simultaneously. Considering all that, why GCC does not use generic optimization algorithms for RA? A generic optimization can take all issues into account, simultaneously. I am talking about ILP/MIP (Integer Linear Programming/Mixed IP), SAT and CSP [1] (Constraint Satisfaction Problem). There has been a lot of progress in these areas, and the solvers are much faster (orders of magnitude) than they were 10 years ago. Actually, I tried ILP solver first time in 2003. I am returning to this about every 3 years and every time I fail to produce something useful (at least an acceptable in some cases solution which makes compiler only ten times slower). I've tried a lot of models. From more sophisticated ones to simpler ones, pure solvers or some hybrids of heuristics and ILP solutions. All solutions were too slow or
Re: Register Allocation issues with microblaze-elf
On 02/14/2013 06:31 PM, Vladimir Makarov wrote: On 02/14/2013 03:46 AM, Michael Veksler wrote: [snip] I am reading this thread and getting more and more puzzled. The RA stuff is very complicated, having many constraints and many dependencies with other passes. Taking this into account, it seems that no heuristic algorithm can even get close to an optimal register allocation. A heuristic algorithm can't take all effects and side-effects into account simultaneously. Considering all that, why GCC does not use generic optimization algorithms for RA? A generic optimization can take all issues into account, simultaneously. I am talking about ILP/MIP (Integer Linear Programming/Mixed IP), SAT and CSP [1] (Constraint Satisfaction Problem). There has been a lot of progress in these areas, and the solvers are much faster (orders of magnitude) than they were 10 years ago. Actually, I tried ILP solver first time in 2003. I am returning to this about every 3 years and every time I fail to produce something useful (at least an acceptable in some cases solution which makes compiler only ten times slower). I've tried a lot of models. From more sophisticated ones to simpler ones, pure solvers or some hybrids of heuristics and ILP solutions. All solutions were too slow or generated not better code. In many cases ILP is faster than CP (Constraint Programming), but there are quite a few cases when CP and SAT are much faster. Speed issues with modeling RA as ILP do not rule out neither SAT nor CP. Last time, I had hope to use massive parallelism of GPUs for this. After investing of much my time to learning simplex algorithm theory (a base for ILP solvers), I found that simple tableu-based algorithm is parallelized very well but for RA tasks even using 1000 GPU elements is still behind revised simplex algorithm which is not parallelized. GPUs are not good for very sparse matrix problems which is the case of RA problems. More and more it sounds like a CP or SAT. Still some important RA problems (like rematerialization) is very hard to describe by ILP. It sounds that your model requires many 0,1 variables. If it is so, then maybe SAT is a better choice than ILP. I don't see any progress, ILP solver may be ten times faster but they still have exponential complexity. I don't know any industrial compiler which uses exact solver for RA (RA in GCC is even more complicated as it requires code selection too). But the solver need not be exact. ILP uses branch and bound, which can be interrupted after some time giving the best solution so far. The best solution, so far, could be good enough for RA. I have seen huge SAT instances solved in a fraction of a second. These problems have a very clear structure that the solver takes advantage of, but why the RA problem should be different? So why ILP/SAT/CSP aren't used in RA? I don't think they will work much slower than what RA does today ( they will be somewhat slower but not much). I do believe that there is a good chance to get much better results from RA with these technologies. You know, some LLVM guys had the same thoughts and as remember they had PBQP (I tried this too) based RA (they had too many different RAs) and now they have finished with just a good heursitics RA which works best. I don't how PBQP works with RA, but if it uses branch bound over integer variables then it may have performance issues there. I'd like to invest some time into a feasibility check, if someone is willing to work with me on modeling the RA problem (or at least several examples, such as the above). Good luck with that. I am serious. I need some help to get a basic model, even a hand-written model for a given input to RA just as example (before diving deeper into it). I'd be happy if you or somebody else can guide me through. At least, where should I look in the source of GCC to figure some of these things out, or where can I find some documentation/papers/presentations on RA and its input data-structure (and output requirements). Michael
Re: Register Allocation issues with microblaze-elf
On 02/14/2013 02:41 PM, Michael Veksler wrote: ... It sounds that your model requires many 0,1 variables. If it is so, then maybe SAT is a better choice than ILP. Yes, it is binary ILP. I don't see any progress, ILP solver may be ten times faster but they still have exponential complexity. I don't know any industrial compiler which uses exact solver for RA (RA in GCC is even more complicated as it requires code selection too). But the solver need not be exact. ILP uses branch and bound, which can be interrupted after some time giving the best solution so far. The best solution, so far, could be good enough for RA. I have seen huge SAT instances solved in a fraction of a second. These problems have a very clear structure that the solver takes advantage of, but why the RA problem should be different? As I wrote, I used this approach too. By the way, I used GLPK as it is pretty good to affect its the solving process (when to stop, what to consider first, in what order to consider sub-problems and so on). Although it might be not the fastest open source MILP solver. So why ILP/SAT/CSP aren't used in RA? I don't think they will work much slower than what RA does today ( they will be somewhat slower but not much). I do believe that there is a good chance to get much better results from RA with these technologies. You know, some LLVM guys had the same thoughts and as remember they had PBQP (I tried this too) based RA (they had too many different RAs) and now they have finished with just a good heursitics RA which works best. I don't how PBQP works with RA, but if it uses branch bound over integer variables then it may have performance issues there. As I remember correctly a person who invented this, try to promote it in LLVM. I'd like to invest some time into a feasibility check, if someone is willing to work with me on modeling the RA problem (or at least several examples, such as the above). Good luck with that. I am serious. I need some help to get a basic model, even a hand-written model for a given input to RA just as example (before diving deeper into it). I'd be happy if you or somebody else can guide me through. At least, where should I look in the source of GCC to figure some of these things out, or where can I find some documentation/papers/presentations on RA and its input data-structure (and output requirements). There are tons of articles about optimal RA (less for optimal insn scheduling). You could start with Goodwin/Wilken or Appel/George ILP approach. Add choosing insn alternatives (0/1 var for each alternative of each insn and constraints for using only alternative for each insn). That is specific for GCC as GCC partially does code selection in RA. That will define possible sets of registers or memory for pseudos which are operands of the insn. You can add cost for each alternative (as RTL insn alternatives can represent different insns with different execution time) to cost function. If you use Appel/George approach, add constraints and cost function part for coalescing and rematerialization. If you want to go further you could add scheduling problem too because RA and insn scheduling have conflicting goals. That will be probably most general base problem. Now you can simplify this model in different ways. But don't simplify it to much as ,for example, just solving register allocation (assignment) with ILP has no sense as graph coloring heuristics based on Kempe's criterion of trivial colorability solves this problem optimally probably in 99% cases. Please don't forget that even optimal solution can work worse then heuristics one as ins execution frequency evaluation is approximate. Some heuristics work very well with uncertain or approximate execution frequencies. Profile based compilation in some cases is not possible or very inconvenient. I saw other cases when optimal solution even with profile based execution frequencies generates worse cade as the model probably does not take generated memory foot-prints into account. The following articles (presentations) could help you to understand GCC RA: ftp://gcc.gnu.org/pub/gcc/summit/2004/Fighting%20Register%20Pressure.pdf http://ols.fedoraproject.org/GCC/Reprints-2007/makarov-reprint.pdf http://vmakarov.fedorapeople.org/LRA.html
Re: Register Allocation issues with microblaze-elf
On 13-02-13 1:36 AM, Michael Eager wrote: Hi -- I'm seeing register allocation problems and code size increases with gcc-4.6.2 (and gcc-head) compared with older (gcc-4.1.2). Both are compiled using -O3. One test case that I have has a long series of nested if's each with the same comparison and similar computation. if (nmax_no){ n+=*(cp-*p++); if (nmax_no){ n+=*(cp-*p); if (nmax_no){ . . . ~20 levels of nesting more computations with 'cp' and 'p' . . . }}} Gcc-4.6.2 generates many blocks like the following: lwir28,r1,68-- load into dead reg lwir31,r1,140-- load p from stack lbuir28,r31,0 rsubkr31,r28,r19 lbuir31,r31,0 addkr29,r29,r31 swir31,r1,308 lwir31,r1,428-- load of max_no from stack cmpr28,r31,r29-- n in r29 bgeidr28,$L46 gcc-4.1.2 generates the following: lbuir3,r26,3 rsubkr3,r3,r19 lbuir3,r3,0 addkr30,r30,r3 swir3,r1,80 cmpr18,r9,r30-- max_no in r9, n in r30 bgeir18,$L6 gcc-4.6.2 (and gcc-head) load max_no from the stack in each block. There also are extra loads into r28 (which is not used) and r31 at the start of each block. Only r28, r29, and r31 are used. I'm having a hard time telling what is happening or why. The IRA dump has this line: Ignoring reg 772, has equiv memory where pseudo 772 is loaded with max_no early in the function. The reload dump has Reloads for insn # 254 Reload 0: reload_in (SI) = (reg/v:SI 722 [ max_no ]) GR_REGS, RELOAD_FOR_INPUT (opnum = 1) reload_in_reg: (reg/v:SI 722 [ max_no ]) reload_reg_rtx: (reg:SI 31 r31) and similar for each of the other insns using 722. This is followed by Spilling for insn 254. Using reg 31 for reload 0 for each insn using pseudo 722. Any idea what is going on? So many changes happened since then (7 years ago), that it is very hard to me to say something definitely. I also have no gcc-4.1 microblaze (as I see microblaze was added to public gcc for 4.6 version) and it makes me even more difficult to say something useful. First of all, the new RA was introduced in gcc4.4 (IRA) which uses different heuristics (Chaitin-Briggs graph coloring vs Chow's priority RA). We could blame IRA when we have the same started conditions for it RA gcc4.1 and gcc4.6-gcc-4.8. But I am sure it is not the same. More aggressive optimizations creates higher register pressure. I compared peak reg pressure in the test for gcc4.6 and gcc4.8. It became higher (from 102 to 106). I guess the increase was even bigger since gcc4.1. RA focused on generation of faster code. Looking at the fragment you provided it, it is hard to say something about it. I tried -Os for gcc4.8 and it generates desirable code for the fragment in question (by the way the peak register pressure decreased to 66 in this case). Any industrial RA uses heuristic algorithms, in some cases better heuristics can work worse than worse heuristics. So you should probably check is there any progress moving from gcc4.1 to gcc4.6 with performance point of view for variety benchmarks. Introducing IRA improves code for x86 4% on SPEC2000. Subsequent improving (like using dynamic register classes) made further performance improvements. Looking at the test code, I can make some conclusions for myself: o We need a common pass decreasing reg pressure (I already expressed this in the past) as optimizations become more aggressive. Some progress was made to make few optimizations aware about RA (reg-pressure scheduling, loop-invariant motions, and code hoisting) but there are too many passes and it is wrong and impossible to make them all aware of RA. Some register pressure decreasing heuristics are difficult to implement in RA (like insn rearrangements or complex rematerialization) and this pass could focus on them. o Implement RA live range splitting in regions different from loops or BB (now IRA makes splitting only on loop bounds and LRA in BB, the old RA had no live range splitting at all). I'd also recommend to try the following options concerning RA: -fira-loop-pressure, -fsched-pressure, -fira-algorithm=CB|priority, -fira-region=one,all,mixed. Actually -fira-algorithm=priority + -fira-region=one is analog of what the old RA did. I hope I answered to your question.
Re: Register Allocation issues with microblaze-elf
On 02/13/2013 02:38 PM, Vladimir Makarov wrote: On 13-02-13 1:36 AM, Michael Eager wrote: Hi -- I'm seeing register allocation problems and code size increases with gcc-4.6.2 (and gcc-head) compared with older (gcc-4.1.2). Both are compiled using -O3. One test case that I have has a long series of nested if's each with the same comparison and similar computation. if (nmax_no){ n+=*(cp-*p++); if (nmax_no){ n+=*(cp-*p); if (nmax_no){ . . . ~20 levels of nesting more computations with 'cp' and 'p' . . . }}} Gcc-4.6.2 generates many blocks like the following: lwir28,r1,68-- load into dead reg lwir31,r1,140-- load p from stack lbuir28,r31,0 rsubkr31,r28,r19 lbuir31,r31,0 addkr29,r29,r31 swir31,r1,308 lwir31,r1,428-- load of max_no from stack cmpr28,r31,r29-- n in r29 bgeidr28,$L46 gcc-4.1.2 generates the following: lbuir3,r26,3 rsubkr3,r3,r19 lbuir3,r3,0 addkr30,r30,r3 swir3,r1,80 cmpr18,r9,r30-- max_no in r9, n in r30 bgeir18,$L6 gcc-4.6.2 (and gcc-head) load max_no from the stack in each block. There also are extra loads into r28 (which is not used) and r31 at the start of each block. Only r28, r29, and r31 are used. I'm having a hard time telling what is happening or why. The IRA dump has this line: Ignoring reg 772, has equiv memory where pseudo 772 is loaded with max_no early in the function. The reload dump has Reloads for insn # 254 Reload 0: reload_in (SI) = (reg/v:SI 722 [ max_no ]) GR_REGS, RELOAD_FOR_INPUT (opnum = 1) reload_in_reg: (reg/v:SI 722 [ max_no ]) reload_reg_rtx: (reg:SI 31 r31) and similar for each of the other insns using 722. This is followed by Spilling for insn 254. Using reg 31 for reload 0 for each insn using pseudo 722. Any idea what is going on? So many changes happened since then (7 years ago), that it is very hard to me to say something definitely. I also have no gcc-4.1 microblaze (as I see microblaze was added to public gcc for 4.6 version) and it makes me even more difficult to say something useful. First of all, the new RA was introduced in gcc4.4 (IRA) which uses different heuristics (Chaitin-Briggs graph coloring vs Chow's priority RA). We could blame IRA when we have the same started conditions for it RA gcc4.1 and gcc4.6-gcc-4.8. But I am sure it is not the same. More aggressive optimizations creates higher register pressure. I compared peak reg pressure in the test for gcc4.6 and gcc4.8. It became higher (from 102 to 106). I guess the increase was even bigger since gcc4.1. I thought about register pressure causing this, but I think that should cause spilling of one of the registers which were not used in this long sequence, rather than causing a large number of additional loads. Perhaps the cost analysis has a problem. RA focused on generation of faster code. Looking at the fragment you provided it, it is hard to say something about it. I tried -Os for gcc4.8 and it generates desirable code for the fragment in question (by the way the peak register pressure decreased to 66 in this case). It's both larger and slower, since the additional loads take much longer. I'll take a look at -Os. It looks like the values of p++ are being pre-calculated and stored on the stack. This results in a load, rather than an increment of a register. Any industrial RA uses heuristic algorithms, in some cases better heuristics can work worse than worse heuristics. So you should probably check is there any progress moving from gcc4.1 to gcc4.6 with performance point of view for variety benchmarks. Introducing IRA improves code for x86 4% on SPEC2000. Subsequent improving (like using dynamic register classes) made further performance improvements. My impression is that the performance is worse. Reports I've seen are that the code is substantially larger, which means more instructions. I'm skeptical about comparisons between x86 and RISC processors. What works well for one may not work well for the other. Looking at the test code, I can make some conclusions for myself: o We need a common pass decreasing reg pressure (I already expressed this in the past) as optimizations become more aggressive. Some progress was made to make few optimizations aware about RA (reg-pressure scheduling, loop-invariant motions, and code hoisting) but there are too many passes and it is wrong and impossible to make them all aware of RA. Some register pressure decreasing heuristics are difficult to implement in RA (like insn rearrangements or complex rematerialization) and this pass could focus on them. That might be useful. o Implement RA live range splitting in regions different from loops or BB (now IRA makes splitting only on loop bounds and LRA in BB, the old RA had
Re: Register Allocation issues with microblaze-elf
On 13-02-13 6:36 PM, Michael Eager wrote: On 02/13/2013 02:38 PM, Vladimir Makarov wrote: On 13-02-13 1:36 AM, Michael Eager wrote: Hi -- I'm seeing register allocation problems and code size increases with gcc-4.6.2 (and gcc-head) compared with older (gcc-4.1.2). Both are compiled using -O3. One test case that I have has a long series of nested if's each with the same comparison and similar computation. if (nmax_no){ n+=*(cp-*p++); if (nmax_no){ n+=*(cp-*p); if (nmax_no){ . . . ~20 levels of nesting more computations with 'cp' and 'p' . . . }}} Gcc-4.6.2 generates many blocks like the following: lwir28,r1,68-- load into dead reg lwir31,r1,140-- load p from stack lbuir28,r31,0 rsubkr31,r28,r19 lbuir31,r31,0 addkr29,r29,r31 swir31,r1,308 lwir31,r1,428-- load of max_no from stack cmpr28,r31,r29-- n in r29 bgeidr28,$L46 gcc-4.1.2 generates the following: lbuir3,r26,3 rsubkr3,r3,r19 lbuir3,r3,0 addkr30,r30,r3 swir3,r1,80 cmpr18,r9,r30-- max_no in r9, n in r30 bgeir18,$L6 gcc-4.6.2 (and gcc-head) load max_no from the stack in each block. There also are extra loads into r28 (which is not used) and r31 at the start of each block. Only r28, r29, and r31 are used. I'm having a hard time telling what is happening or why. The IRA dump has this line: Ignoring reg 772, has equiv memory where pseudo 772 is loaded with max_no early in the function. The reload dump has Reloads for insn # 254 Reload 0: reload_in (SI) = (reg/v:SI 722 [ max_no ]) GR_REGS, RELOAD_FOR_INPUT (opnum = 1) reload_in_reg: (reg/v:SI 722 [ max_no ]) reload_reg_rtx: (reg:SI 31 r31) and similar for each of the other insns using 722. This is followed by Spilling for insn 254. Using reg 31 for reload 0 for each insn using pseudo 722. Any idea what is going on? So many changes happened since then (7 years ago), that it is very hard to me to say something definitely. I also have no gcc-4.1 microblaze (as I see microblaze was added to public gcc for 4.6 version) and it makes me even more difficult to say something useful. First of all, the new RA was introduced in gcc4.4 (IRA) which uses different heuristics (Chaitin-Briggs graph coloring vs Chow's priority RA). We could blame IRA when we have the same started conditions for it RA gcc4.1 and gcc4.6-gcc-4.8. But I am sure it is not the same. More aggressive optimizations creates higher register pressure. I compared peak reg pressure in the test for gcc4.6 and gcc4.8. It became higher (from 102 to 106). I guess the increase was even bigger since gcc4.1. I thought about register pressure causing this, but I think that should cause spilling of one of the registers which were not used in this long sequence, rather than causing a large number of additional loads. Longer living pseudos has more probability to be spilled as they usually conflicts with more pseudos during their lives and spilling them frees a hard reg for many conflicting pseudos. That how RA heuristics work (in the old RA log (live range span) was used. The bigger the value, the more probability for spilling). Perhaps the cost analysis has a problem. I've checked it and it looks ok to me. RA focused on generation of faster code. Looking at the fragment you provided it, it is hard to say something about it. I tried -Os for gcc4.8 and it generates desirable code for the fragment in question (by the way the peak register pressure decreased to 66 in this case). It's both larger and slower, since the additional loads take much longer. I'll take a look at -Os. It looks like the values of p++ are being pre-calculated and stored on the stack. This results in a load, rather than an increment of a register. If it is so. It might be another optimization which moves p++ calculations higher. IRA does not do it (more correctly a new IRA feature implemented by Bernd Schmidt in gcc4.7 can move insns downwards in CFG graph to decrease reg pressure). I checked all rtl passes these calcualtions are not created by any RTL pass. So it is probably some tree-ssa optimization. Any industrial RA uses heuristic algorithms, in some cases better heuristics can work worse than worse heuristics. So you should probably check is there any progress moving from gcc4.1 to gcc4.6 with performance point of view for variety benchmarks. Introducing IRA improves code for x86 4% on SPEC2000. Subsequent improving (like using dynamic register classes) made further performance improvements. My impression is that the performance is worse. Reports I've seen are that the code is substantially larger, which means more instructions. I'm skeptical about comparisons between x86 and RISC processors. What works well for one may not work well for
Re: Register Allocation issues with microblaze-elf
On Thu, Feb 14, 2013 at 12:36:46AM +0100, Michael Eager wrote: On 02/13/2013 02:38 PM, Vladimir Makarov wrote: On 13-02-13 1:36 AM, Michael Eager wrote: Hi -- I'm seeing register allocation problems and code size increases with gcc-4.6.2 (and gcc-head) compared with older (gcc-4.1.2). Both are compiled using -O3. One test case that I have has a long series of nested if's each with the same comparison and similar computation. if (nmax_no){ n+=*(cp-*p++); if (nmax_no){ n+=*(cp-*p); if (nmax_no){ . . . ~20 levels of nesting more computations with 'cp' and 'p' . . . }}} Gcc-4.6.2 generates many blocks like the following: lwir28,r1,68-- load into dead reg lwir31,r1,140-- load p from stack lbuir28,r31,0 rsubkr31,r28,r19 lbuir31,r31,0 addkr29,r29,r31 swir31,r1,308 lwir31,r1,428-- load of max_no from stack cmpr28,r31,r29-- n in r29 bgeidr28,$L46 gcc-4.1.2 generates the following: lbuir3,r26,3 rsubkr3,r3,r19 lbuir3,r3,0 addkr30,r30,r3 swir3,r1,80 cmpr18,r9,r30-- max_no in r9, n in r30 bgeir18,$L6 gcc-4.6.2 (and gcc-head) load max_no from the stack in each block. There also are extra loads into r28 (which is not used) and r31 at the start of each block. Only r28, r29, and r31 are used. I'm having a hard time telling what is happening or why. The IRA dump has this line: Ignoring reg 772, has equiv memory where pseudo 772 is loaded with max_no early in the function. The reload dump has Reloads for insn # 254 Reload 0: reload_in (SI) = (reg/v:SI 722 [ max_no ]) GR_REGS, RELOAD_FOR_INPUT (opnum = 1) reload_in_reg: (reg/v:SI 722 [ max_no ]) reload_reg_rtx: (reg:SI 31 r31) and similar for each of the other insns using 722. This is followed by Spilling for insn 254. Using reg 31 for reload 0 for each insn using pseudo 722. Any idea what is going on? So many changes happened since then (7 years ago), that it is very hard to me to say something definitely. I also have no gcc-4.1 microblaze (as I see microblaze was added to public gcc for 4.6 version) and it makes me even more difficult to say something useful. First of all, the new RA was introduced in gcc4.4 (IRA) which uses different heuristics (Chaitin-Briggs graph coloring vs Chow's priority RA). We could blame IRA when we have the same started conditions for it RA gcc4.1 and gcc4.6-gcc-4.8. But I am sure it is not the same. More aggressive optimizations creates higher register pressure. I compared peak reg pressure in the test for gcc4.6 and gcc4.8. It became higher (from 102 to 106). I guess the increase was even bigger since gcc4.1. I thought about register pressure causing this, but I think that should cause spilling of one of the registers which were not used in this long sequence, rather than causing a large number of additional loads. Perhaps the cost analysis has a problem. RA focused on generation of faster code. Looking at the fragment you provided it, it is hard to say something about it. I tried -Os for gcc4.8 and it generates desirable code for the fragment in question (by the way the peak register pressure decreased to 66 in this case). It's both larger and slower, since the additional loads take much longer. I'll take a look at -Os. It looks like the values of p++ are being pre-calculated and stored on the stack. This results in a load, rather than an increment of a register. Hi, I remember having a similar issue about a year ago. IIRC, I foudn that the ivopts pass was transforming things badly for microblaze. Disabling it helped alot. I can't tell if you are seeing the same thing, but it might be worth trying -fno-ivopts in case you haven't already. Cheers, Edgar
Re: Register Allocation issues with microblaze-elf
On 02/13/2013 11:24 PM, Edgar E. Iglesias wrote: On Thu, Feb 14, 2013 at 12:36:46AM +0100, Michael Eager wrote: On 02/13/2013 02:38 PM, Vladimir Makarov wrote: On 13-02-13 1:36 AM, Michael Eager wrote: Hi -- I'm seeing register allocation problems and code size increases with gcc-4.6.2 (and gcc-head) compared with older (gcc-4.1.2). Both are compiled using -O3. One test case that I have has a long series of nested if's each with the same comparison and similar computation. if (nmax_no){ n+=*(cp-*p++); if (nmax_no){ n+=*(cp-*p); if (nmax_no){ . . . ~20 levels of nesting more computations with 'cp' and 'p' . . . }}} Gcc-4.6.2 generates many blocks like the following: lwir28,r1,68-- load into dead reg lwir31,r1,140-- load p from stack lbuir28,r31,0 rsubkr31,r28,r19 lbuir31,r31,0 addkr29,r29,r31 swir31,r1,308 lwir31,r1,428-- load of max_no from stack cmpr28,r31,r29-- n in r29 bgeidr28,$L46 gcc-4.1.2 generates the following: lbuir3,r26,3 rsubkr3,r3,r19 lbuir3,r3,0 addkr30,r30,r3 swir3,r1,80 cmpr18,r9,r30-- max_no in r9, n in r30 bgeir18,$L6 gcc-4.6.2 (and gcc-head) load max_no from the stack in each block. There also are extra loads into r28 (which is not used) and r31 at the start of each block. Only r28, r29, and r31 are used. I'm having a hard time telling what is happening or why. The IRA dump has this line: Ignoring reg 772, has equiv memory where pseudo 772 is loaded with max_no early in the function. The reload dump has Reloads for insn # 254 Reload 0: reload_in (SI) = (reg/v:SI 722 [ max_no ]) GR_REGS, RELOAD_FOR_INPUT (opnum = 1) reload_in_reg: (reg/v:SI 722 [ max_no ]) reload_reg_rtx: (reg:SI 31 r31) and similar for each of the other insns using 722. This is followed by Spilling for insn 254. Using reg 31 for reload 0 for each insn using pseudo 722. Any idea what is going on? So many changes happened since then (7 years ago), that it is very hard to me to say something definitely. I also have no gcc-4.1 microblaze (as I see microblaze was added to public gcc for 4.6 version) and it makes me even more difficult to say something useful. First of all, the new RA was introduced in gcc4.4 (IRA) which uses different heuristics (Chaitin-Briggs graph coloring vs Chow's priority RA). We could blame IRA when we have the same started conditions for it RA gcc4.1 and gcc4.6-gcc-4.8. But I am sure it is not the same. More aggressive optimizations creates higher register pressure. I compared peak reg pressure in the test for gcc4.6 and gcc4.8. It became higher (from 102 to 106). I guess the increase was even bigger since gcc4.1. I thought about register pressure causing this, but I think that should cause spilling of one of the registers which were not used in this long sequence, rather than causing a large number of additional loads. Perhaps the cost analysis has a problem. RA focused on generation of faster code. Looking at the fragment you provided it, it is hard to say something about it. I tried -Os for gcc4.8 and it generates desirable code for the fragment in question (by the way the peak register pressure decreased to 66 in this case). It's both larger and slower, since the additional loads take much longer. I'll take a look at -Os. It looks like the values of p++ are being pre-calculated and stored on the stack. This results in a load, rather than an increment of a register. Hi, I remember having a similar issue about a year ago. IIRC, I foudn that the ivopts pass was transforming things badly for microblaze. Disabling it helped alot. I can't tell if you are seeing the same thing, but it might be worth trying -fno-ivopts in case you haven't already. Thanks. I'll see if that helps. -- Michael Eagerea...@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077
Register Allocation issues with microblaze-elf
Hi -- I'm seeing register allocation problems and code size increases with gcc-4.6.2 (and gcc-head) compared with older (gcc-4.1.2). Both are compiled using -O3. One test case that I have has a long series of nested if's each with the same comparison and similar computation. if (nmax_no){ n+=*(cp-*p++); if (nmax_no){ n+=*(cp-*p); if (nmax_no){ . . . ~20 levels of nesting more computations with 'cp' and 'p' . . . }}} Gcc-4.6.2 generates many blocks like the following: lwi r28,r1,68 -- load into dead reg lwi r31,r1,140 -- load p from stack lbuir28,r31,0 rsubk r31,r28,r19 lbuir31,r31,0 addkr29,r29,r31 swi r31,r1,308 lwi r31,r1,428 -- load of max_no from stack cmp r28,r31,r29 -- n in r29 bgeid r28,$L46 gcc-4.1.2 generates the following: lbuir3,r26,3 rsubk r3,r3,r19 lbuir3,r3,0 addkr30,r30,r3 swi r3,r1,80 cmp r18,r9,r30 -- max_no in r9, n in r30 bgeir18,$L6 gcc-4.6.2 (and gcc-head) load max_no from the stack in each block. There also are extra loads into r28 (which is not used) and r31 at the start of each block. Only r28, r29, and r31 are used. I'm having a hard time telling what is happening or why. The IRA dump has this line: Ignoring reg 772, has equiv memory where pseudo 772 is loaded with max_no early in the function. The reload dump has Reloads for insn # 254 Reload 0: reload_in (SI) = (reg/v:SI 722 [ max_no ]) GR_REGS, RELOAD_FOR_INPUT (opnum = 1) reload_in_reg: (reg/v:SI 722 [ max_no ]) reload_reg_rtx: (reg:SI 31 r31) and similar for each of the other insns using 722. This is followed by Spilling for insn 254. Using reg 31 for reload 0 for each insn using pseudo 722. Any idea what is going on? -- Michael Eagerea...@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077 #if 0 mb-gcc -O3 -mhard-float -fdump-rtl-all -c s.c -save-temps #endif typedef unsigned char uchar; typedef struct {int x,y,info, dx, dy, I;} CORNER_LIST[15000]; susan_corners(in,r,bp,max_no,corner_list,x_size,y_size) uchar *in, *bp; int *r, max_no, x_size, y_size; CORNER_LIST corner_list; { int n,x,y,sq,xx,yy, i,j,*cgx,*cgy; float divide; uchar c,*p,*cp; for (i=5;iy_size-5;i++) for (j=5;jx_size-5;j++) { n=100; p=in + (i-3)*x_size + j - 1; cp=bp + in[i*x_size+j]; n+=*(cp-*p++); n+=*(cp-*p++); n+=*(cp-*p); p+=x_size-3; n+=*(cp-*p++); n+=*(cp-*p++); n+=*(cp-*p++); n+=*(cp-*p++); n+=*(cp-*p); p+=x_size-5; n+=*(cp-*p++); n+=*(cp-*p++); n+=*(cp-*p++); n+=*(cp-*p++); n+=*(cp-*p++); n+=*(cp-*p++); n+=*(cp-*p); p+=x_size-6; n+=*(cp-*p++); n+=*(cp-*p++); n+=*(cp-*p); if (nmax_no){ p+=2; n+=*(cp-*p++); #if 1 if (nmax_no){ n+=*(cp-*p++); if (nmax_no){ n+=*(cp-*p); if (nmax_no){ p+=x_size-6; n+=*(cp-*p++); if (nmax_no){ n+=*(cp-*p++); if (nmax_no){ n+=*(cp-*p++); if (nmax_no){ n+=*(cp-*p++); if (nmax_no){ n+=*(cp-*p++); if (nmax_no){ n+=*(cp-*p++); if (nmax_no){ n+=*(cp-*p); if (nmax_no){ p+=x_size-5; n+=*(cp-*p++); if (nmax_no){ n+=*(cp-*p++); if (nmax_no){ n+=*(cp-*p++); if (nmax_no){ n+=*(cp-*p++); if (nmax_no){ n+=*(cp-*p); if (nmax_no){ p+=x_size-3; n+=*(cp-*p++); #endif if (nmax_no){ n+=*(cp-*p++); if (nmax_no){ n+=*(cp-*p); if (nmax_no) { x=0;y=0; p=in + (i-3)*x_size + j - 1; c=*(cp-*p++);x-=c;y-=3*c; c=*(cp-*p++);y-=3*c; c=*(cp-*p);x+=c;y-=3*c; p+=x_size-3; c=*(cp-*p++);x-=2*c;y-=2*c; c=*(cp-*p++);x-=c;y-=2*c; c=*(cp-*p++);y-=2*c; c=*(cp-*p++);x+=c;y-=2*c; c=*(cp-*p);x+=2*c;y-=2*c; p+=x_size-5; c=*(cp-*p++);x-=3*c;y-=c; c=*(cp-*p++);x-=2*c;y-=c; c=*(cp-*p++);x-=c;y-=c; c=*(cp-*p++);y-=c;
Re: Register allocation issues
Matt Lee writes: Matt The problem is, that though the loads can be optimized by pipelining Matt them. The register allocator has created a dependency by using only r3 Matt and r4, instead of using the other volatiles. GCC's register allocator currently is designed to minimize the number of hard registers. As Ian mentioned, -frename-registers tries to perform register renaming with available registers after register allocation. As they say, Your Mileage May Vary. David
RE: Register allocation issues
On 05 September 2007 23:47, Matt Lee wrote: Registers r3 to r12 are volatiles. However, for the C code below, struct foo { int a[4]; } ; struct foo p, q; void func () { memcpy (p, q, sizeof (struct foo)); } I am getting a instruction sequence for func() such as, load r3, q + 0 load r4, q + 4 store r3, p + 0 store r4, p + 4 load r3, q + 4 load r4, q + 8 store r3, p + 4 store r4, p + 8 The problem is, that though the loads can be optimized by pipelining them. The register allocator has created a dependency by using only r3 and r4, instead of using the other volatiles. Does your backend define a movdi pattern? cheers, DaveK -- Can't think of a witty .sigline today
Re: Register allocation issues
On 9/6/07, David Edelsohn [EMAIL PROTECTED] wrote: Matt Lee writes: Matt The problem is, that though the loads can be optimized by pipelining Matt them. The register allocator has created a dependency by using only r3 Matt and r4, instead of using the other volatiles. GCC's register allocator currently is designed to minimize the number of hard registers. As Ian mentioned, -frename-registers tries to perform register renaming with available registers after register allocation. As they say, Your Mileage May Vary. There is no point trying to minimize usage of volatile hard registers, is there? They are precisely there to be used up as much as needed. The function is a leaf procedure as well, so there are no other considerations. Lastly, architectures like PPC do make use of more registers (without -frename-registers), so there has to be something in the PPC back-end that allows for the liberal use or in mine that prevents such. -- thanks, Matt
Re: Register allocation issues
On 9/6/07, Dave Korn [EMAIL PROTECTED] wrote: On 05 September 2007 23:47, Matt Lee wrote: Registers r3 to r12 are volatiles. However, for the C code below, struct foo { int a[4]; } ; struct foo p, q; void func () { memcpy (p, q, sizeof (struct foo)); } I am getting a instruction sequence for func() such as, load r3, q + 0 load r4, q + 4 store r3, p + 0 store r4, p + 4 load r3, q + 4 load r4, q + 8 store r3, p + 4 store r4, p + 8 The problem is, that though the loads can be optimized by pipelining them. The register allocator has created a dependency by using only r3 and r4, instead of using the other volatiles. Does your backend define a movdi pattern? Yes it does. But in this case, these are word-by-word moves from memory to memory and make use of only movsi. -- thanks, Matt
Re: Register allocation issues
Matt Lee writes: Date: Thu, 06 Sep 2007 15:02:52 -0400 From: David Edelsohn [EMAIL PROTECTED] Matt There is no point trying to minimize usage of volatile hard registers, Matt is there? They are precisely there to be used up as much as needed. Matt The function is a leaf procedure as well, so there are no other Matt considerations. Lastly, architectures like PPC do make use of more Matt registers (without -frename-registers), so there has to be something Matt in the PPC back-end that allows for the liberal use or in mine that Matt prevents such. GCC RA mostly is tuned for IA-32 with very few registers. The rs6000 port defines the movmemsi pattern calling expand_block_move() which generates many intermediate pseudos. David
Re: Register allocation issues
load r3, q + 0 load r4, q + 4 store r3, p + 0 store r4, p + 4 load r3, q + 4 load r4, q + 8 store r3, p + 4 store r4, p + 8 These last four lines should be load r3, q + 8 load r4, q + 12 store r3, p + 8 store r4, p + 12 Did you just typo it or do you have a bigger problem? The problems might even be connected, who knows :-) Segher
Re: Register allocation issues
On 9/6/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Matt Lee writes: Date: Thu, 06 Sep 2007 15:02:52 -0400 From: David Edelsohn [EMAIL PROTECTED] Matt There is no point trying to minimize usage of volatile hard registers, Matt is there? They are precisely there to be used up as much as needed. Matt The function is a leaf procedure as well, so there are no other Matt considerations. Lastly, architectures like PPC do make use of more Matt registers (without -frename-registers), so there has to be something Matt in the PPC back-end that allows for the liberal use or in mine that Matt prevents such. GCC RA mostly is tuned for IA-32 with very few registers. The rs6000 port defines the movmemsi pattern calling expand_block_move() which generates many intermediate pseudos. So does mine. My generated RTL has all the right looking RTL with a unique pseudo for each load/store pair. If you see the dump in the original email from the .lreg files, the problem is that these pseudos get bound to only 2 out of 10 candidate volatile registers. -- thanks, Matt
Re: Register allocation issues
On 9/6/07, Segher Boessenkool [EMAIL PROTECTED] wrote: load r3, q + 0 load r4, q + 4 store r3, p + 0 store r4, p + 4 load r3, q + 4 load r4, q + 8 store r3, p + 4 store r4, p + 8 These last four lines should be load r3, q + 8 load r4, q + 12 store r3, p + 8 store r4, p + 12 Did you just typo it or do you have a bigger problem? The problems might even be connected, who knows :-) Sorry, that was a typo.
Register allocation issues
Hello, On my simple RISC architecture I am seeing suboptimal instruction scheduling with GCC-4.1.1 caused by the way registers are getting allocated. I am looking for suggestions on what could be wrong in my description to cause the poor allocation. More details -- Registers r3 to r12 are volatiles. However, for the C code below, struct foo { int a[4]; } ; struct foo p, q; void func () { memcpy (p, q, sizeof (struct foo)); } I am getting a instruction sequence for func() such as, load r3, q + 0 load r4, q + 4 store r3, p + 0 store r4, p + 4 load r3, q + 4 load r4, q + 8 store r3, p + 4 store r4, p + 8 The problem is, that though the loads can be optimized by pipelining them. The register allocator has created a dependency by using only r3 and r4, instead of using the other volatiles. Strangely, modifying func to the following helps, void func (struct foo *p, struct foo *q) { memcpy (p, q, sizeof (struct foo)); } Now the register allocator uses more volatile registers and there are no false dependencies. I have attached the header information from the .lreg file for the poor register allocation case. Pseudos 87 and 88 get assigned again to r3, r4 instead of new volatiles. Is this a problem with some poor cost model in my backend? How do I get more information about why the register allocator didn't pick the other volatiles? -- thanks, Matt func.lreg Description: Binary data
Re: Register allocation issues
Matt Lee [EMAIL PROTECTED] writes: The problem is, that though the loads can be optimized by pipelining them. The register allocator has created a dependency by using only r3 and r4, instead of using the other volatiles. Try using -frename-registers. Ian