Re: [SMS] Support new loop pattern
Hello Ayal, First of all, thanks for your feedback. Now to your questions: On 31.03.2012 3:20, Ayal Zaks wrote: Roman, Andrey, Sorry for the delayed response. It would indeed be good to have SMS apply to more loop patterns, still within the realm of *countable* loops. SMS was originally designed to handle doloops, with a specific pattern controlling the loop, easily identified and separable from the loop's body. The newly proposed change to support new loop patterns is pretty invasive and sizable, taking place entirely within modulo-sched.c. The main issue I've been considering, is whether it would be possible instead to transform the new loop patterns we want SMS to handle, into doloops (potentially introducing additional induction variables to feed other uses), and then feed the resulting loop into SMS as is? In other words, could you fold it into doloop.c? And if so, will doing so introduce significant overheads? Let me perhaps explain better. The patch itself is one core patch (this thread) adding the new functionality on detecting more complex loop patterns and the three fixes to SMS found while working on the main patch (the fixes are in the mails pinged at the very end of this message). The three fixes are worthwhile to commit separately anyways, they are splitted up from the main patch for this purpose, so I would suggest to consider them in any case. For the main patch, its core is as small as we could get. It stays with the countable loops as for the cases where we could get overflow behavior or infinite loops we bail out early. We handle only a case of simple same-step affine counters. The main reason why we add support to SMS and not to the doloop pass are is when we do not pipeline a loop newly transformed to the doloop form, this loop actually slows down on the platforms not having a true doloop pattern. One has to undo the doloop form and to get back to the original loop form to avoid this, which seems rather strange. Also, the separate decrement insn that changes the induction variable is better be scheduled to get more precise schedule. And yes, I believe that making an extra induction variable just to have the control part without uses in the loop core will be unnecessary overhead. Thus, I believe that if we do want SMS to handle more complex loop, then it is inevitable that SMS itself would be somewhat more complex. I would welcome your suggestions to make the patch more clear. One way I see is that the function for getting the condition of the new loop form can be moved to the generic RTL loop code given the agreement of other RTL maintainers. Also, some new helpers can be introduced for handling this specific loop forms. But it seems that the distinction between doloop/non-doloop loops has to stay in the code. Yours, Andrey 2012/3/29 Andrey Belevantsev: Hello, I'd like to ping again those SMS patches once we're back to Stage 1. Ayal, maybe it would remove some burden for you if you'd review the general SMS functionality of those patches, and we'd ask RTL folks to look at the pieces related to RTL pattern matching and generation? It definitely would ... especially in light of the above issue. Thanks (for your patches, patience, pings..), Ayal. Yours, Andrey On 10.02.2012 16:15, Roman Zhuykov wrote: Ping. Ayal, please review this patch and these three patches too: http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00505.html http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00506.html http://gcc.gnu.org/ml/gcc-patches/2011-12/msg01800.html -- Roman Zhuykov zhr...@ispras.ru
Re: [SMS] Support new loop pattern
Roman, Andrey, Sorry for the delayed response. It would indeed be good to have SMS apply to more loop patterns, still within the realm of *countable* loops. SMS was originally designed to handle doloops, with a specific pattern controlling the loop, easily identified and separable from the loop's body. The newly proposed change to support new loop patterns is pretty invasive and sizable, taking place entirely within modulo-sched.c. The main issue I've been considering, is whether it would be possible instead to transform the new loop patterns we want SMS to handle, into doloops (potentially introducing additional induction variables to feed other uses), and then feed the resulting loop into SMS as is? In other words, could you fold it into doloop.c? And if so, will doing so introduce significant overheads? 2012/3/29 Andrey Belevantsev : > Hello, > > I'd like to ping again those SMS patches once we're back to Stage 1. > > Ayal, maybe it would remove some burden for you if you'd review the general > SMS functionality of those patches, and we'd ask RTL folks to look at the > pieces related to RTL pattern matching and generation? > It definitely would ... especially in light of the above issue. Thanks (for your patches, patience, pings..), Ayal. > Yours, > Andrey > > > On 10.02.2012 16:15, Roman Zhuykov wrote: >> >> Ping. >> Ayal, please review this patch and these three patches too: >> http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00505.html >> http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00506.html >> http://gcc.gnu.org/ml/gcc-patches/2011-12/msg01800.html >> >> -- >> Roman Zhuykov >> zhr...@ispras.ru
Re: [SMS] Support new loop pattern
Hello, I'd like to ping again those SMS patches once we're back to Stage 1. Ayal, maybe it would remove some burden for you if you'd review the general SMS functionality of those patches, and we'd ask RTL folks to look at the pieces related to RTL pattern matching and generation? Yours, Andrey On 10.02.2012 16:15, Roman Zhuykov wrote: Ping. Ayal, please review this patch and these three patches too: http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00505.html http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00506.html http://gcc.gnu.org/ml/gcc-patches/2011-12/msg01800.html -- Roman Zhuykov zhr...@ispras.ru
Re: [SMS] Support new loop pattern
Ping. Ayal, please review this patch and these three patches too: http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00505.html http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00506.html http://gcc.gnu.org/ml/gcc-patches/2011-12/msg01800.html -- Roman Zhuykov zhr...@ispras.ru
Re: [SMS] Support new loop pattern
Ping. Ayal, could you review this patch and these two patches too. http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00505.html http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00506.html Happy holidays. 2011/12/7 Roman Zhuykov : > Apologies for the messed up previous e-mail. > > 2011/10/12 Ayal Zaks : - the last jump instruction should look like: pc=(regF!=0)?label:pc, regF is >> >> you'd probably want to bump to next instruction if falling through, >> e.g., pc=(regF!=0)?label:pc+4 >> > > It is considered that program counter is increased automatically on > hardware level. > Otherwise we should add something like "pc=pc+4" in parallel to each > instruction in RTL. > flag register; - the last instruction which sets regF should be: regF=COMPARE(regC,X), where X is a constant, or maybe a register, which is not changed inside a loop; - only one instruction modifies regC inside a loop (other can use regC, but not write), and it should simply adjust it by a constant: regC=regC+step, where step is a constant. >>> When doloop is succesfully scheduled by SMS, its number of iterations of loop kernel should be decreased by the number of stages in a schedule minus one, while other iterations expand to prologue and epilogue. In new supported loops such approach can't be used, because some instructions can use count register (regC). Instead of this, the final register value X in compare instruction regF=COMPARE(regC,X) is changed to another value Y respective to the stage this instruction is scheduled (Y = X - stage * step). >> >> making sure this does not underflow; i.e., that the number of >> iterations is no less than stage (you've addressed this towards the >> end below). >> > > Yes, this situation is processed correctly. > >>> >>> The main difference from doloop case is that regC can be used by some >>> instructions in loop body. >>> That's why we are unable to simply adjust regC initial value, but have >>> to keep it's value correct on each particular iteration. >>> So, we change comparison instruction accordingly. >>> >>> An example: >>> int a[100]; >>> int main() >>> { >>> int i; >>> for (i = 85; i > 12; i -= 5) >>> a[i] = i * i; >>> return a[15]-225; >>> } >>> ARM assembler with "-O2 -fno-auto-inc-dec": >>> ldr r0, .L5 >>> mov r3, #85 >>> mov r2, r0 >>> .L2: >>> mul r1, r3, r3 >>> sub r3, r3, #5 >>> cmp r3, #10 >>> str r1, [r2, #340] >>> sub r2, r2, #20 >>> bne .L2 >>> ldr r0, [r0, #60] >>> sub r0, r0, #225 >>> bx lr >>> .L5: >>> .word a >>> >>> Loop body is executed 15 times. >>> When compiling with SMS, it finds a schedule with ii=7, stage_count=3 >>> and following times: >>> Stage Time Insn >>> 0 5 mul r1, r3, r3 >>> 1 10 sub r3, r3, #5 >>> 1 11 cmp r3, #10 >>> 1 11 str r1, [r2, #340] >>> 1 13 bne .L2 >>> 2 16 sub r2, r2, #20 >>> >> >> branch is not scheduled last? >> > > Yes, branch schedule time is smaller then sub's one. > This mean that "sub r2, r2, $20" instruction from original iteration > number K will be executed after > "bne .L2" from original iteration number K. > But certainly bne remains to be the last instuction in new loop body. > Below you can see how it looks after SMS. > >>> To make new schedule correct the loop body >>> should be executed 14 times and we change compare instruction: >> >> the loop itself should execute 13 times. > > with i = > 85, 80, 75, 70, 65 > 60, 55, 50, 45, 40 > 35, 30, 25, 20, 15 > this gives total 15 iterations (15 stores to memory). > And new loop body will be executed 13 times (one store goes to > epilogue and one - to prologue). > >>> regF=COMPARE(regC,X) to regF=COMPARE(regC,Y) where Y = X - stage * step. >>> In our example regC is r3, X is 10, step = -5, compare instruction >>> is scheduled on stage 1, so it should be Y = 10 - 1 * (-5) = 15. >>> >> >> right. In general, if the compare is on stage s (starting from 0), it >> will be executed s times in the epilog, so it should exit the loop >> upon reaching Y = X - s * step. >> >>> So, after SMS it looks like: >>> ldr r0, .L5 >>> mov r3, #85 >>> mov r2, r0 >>> ;;prologue >>> mul r1, r3, r3 ;;from stage 0 first iteration >>> sub r3, r3, #5 ;;3 insns from stage 1 first iteration >>> cmp r3, #10 >>> str r1, [r2, #340] >>> mul r1, r3, r3 ;;from stage 0 second iteration >>> ;;body >>> .L2: >>> sub r3, r3, #5 >>> sub r2, r2, #20 >>> cmp r3, #15 ;; new value to compare with is Y=15 >>> str r1, [r2, #340] >>> mul r1, r3, r3 >>> bne .L2 >>> ;;epilogue >>> sub r2, r2, #20 ;;from stage 2 p
Re: [SMS] Support new loop pattern
Apologies for the messed up previous e-mail. 2011/10/12 Ayal Zaks : >>> - the last jump instruction should look like: pc=(regF!=0)?label:pc, regF >>> is > > you'd probably want to bump to next instruction if falling through, > e.g., pc=(regF!=0)?label:pc+4 > It is considered that program counter is increased automatically on hardware level. Otherwise we should add something like "pc=pc+4" in parallel to each instruction in RTL. >>> flag register; >>> - the last instruction which sets regF should be: regF=COMPARE(regC,X), >>> where X >>> is a constant, or maybe a register, which is not changed inside a loop; >>> - only one instruction modifies regC inside a loop (other can use regC, but >>> not >>> write), and it should simply adjust it by a constant: regC=regC+step, where >>> step is a constant. >> >>> When doloop is succesfully scheduled by SMS, its number of >>> iterations of loop kernel should be decreased by the number of stages in a >>> schedule minus one, while other iterations expand to prologue and epilogue. >>> In new supported loops such approach can't be used, because some >>> instructions can use count register (regC). Instead of this, >>> the final register value X in compare instruction regF=COMPARE(regC,X) >>> is changed to another value Y respective to the stage this instruction >>> is scheduled (Y = X - stage * step). > > making sure this does not underflow; i.e., that the number of > iterations is no less than stage (you've addressed this towards the > end below). > Yes, this situation is processed correctly. >> >> The main difference from doloop case is that regC can be used by some >> instructions in loop body. >> That's why we are unable to simply adjust regC initial value, but have >> to keep it's value correct on each particular iteration. >> So, we change comparison instruction accordingly. >> >> An example: >> int a[100]; >> int main() >> { >> int i; >> for (i = 85; i > 12; i -= 5) >> a[i] = i * i; >> return a[15]-225; >> } >> ARM assembler with "-O2 -fno-auto-inc-dec": >> ldr r0, .L5 >> mov r3, #85 >> mov r2, r0 >> .L2: >> mul r1, r3, r3 >> sub r3, r3, #5 >> cmp r3, #10 >> str r1, [r2, #340] >> sub r2, r2, #20 >> bne .L2 >> ldr r0, [r0, #60] >> sub r0, r0, #225 >> bx lr >> .L5: >> .word a >> >> Loop body is executed 15 times. >> When compiling with SMS, it finds a schedule with ii=7, stage_count=3 >> and following times: >> Stage Time Insn >> 0 5 mul r1, r3, r3 >> 1 10 sub r3, r3, #5 >> 1 11 cmp r3, #10 >> 1 11 str r1, [r2, #340] >> 1 13 bne .L2 >> 2 16 sub r2, r2, #20 >> > > branch is not scheduled last? > Yes, branch schedule time is smaller then sub's one. This mean that "sub r2, r2, $20" instruction from original iteration number K will be executed after "bne .L2" from original iteration number K. But certainly bne remains to be the last instuction in new loop body. Below you can see how it looks after SMS. >> To make new schedule correct the loop body >> should be executed 14 times and we change compare instruction: > > the loop itself should execute 13 times. with i = 85, 80, 75, 70, 65 60, 55, 50, 45, 40 35, 30, 25, 20, 15 this gives total 15 iterations (15 stores to memory). And new loop body will be executed 13 times (one store goes to epilogue and one - to prologue). >> regF=COMPARE(regC,X) to regF=COMPARE(regC,Y) where Y = X - stage * step. >> In our example regC is r3, X is 10, step = -5, compare instruction >> is scheduled on stage 1, so it should be Y = 10 - 1 * (-5) = 15. >> > > right. In general, if the compare is on stage s (starting from 0), it > will be executed s times in the epilog, so it should exit the loop > upon reaching Y = X - s * step. > >> So, after SMS it looks like: >> ldr r0, .L5 >> mov r3, #85 >> mov r2, r0 >> ;;prologue >> mul r1, r3, r3 ;;from stage 0 first iteration >> sub r3, r3, #5 ;;3 insns from stage 1 first iteration >> cmp r3, #10 >> str r1, [r2, #340] >> mul r1, r3, r3 ;;from stage 0 second iteration >> ;;body >> .L2: >> sub r3, r3, #5 >> sub r2, r2, #20 >> cmp r3, #15 ;; new value to compare with is Y=15 >> str r1, [r2, #340] >> mul r1, r3, r3 >> bne .L2 >> ;;epilogue >> sub r2, r2, #20 ;;from stage 2 pre-last iteration >> sub r3, r3, #5 ;;3 insns from stage 1 last iteration >> cmp r3, #10 >> str r1, [r2, #340] >> sub r2, r2, #20 ;;from stage 2 last iteration >> >> ldr r0, [r0, #60] >> sub r0, r0, #225 >> bx lr >> .L5: >> .word a >> Here in comments I mention why insn was copied to prol
Re: [SMS] Support new loop pattern
2011/10/12 Ayal Zaks :>>> - the last jump instruction should look like: pc=(regF!=0)?label:pc, regF is>> you'd probably want to bump to next instruction if falling through,> e.g., pc=(regF!=0)?label:pc+4> It is considered that program counter is increased automatically onhardware level.Otherwise we should add something like "pc=pc+4" in parallel to eachinstruction in RTL. >>> flag register;>>> - the last instruction which sets regF should be: >>> regF=COMPARE(regC,X), where X>>> is a constant, or maybe a register, which >>> is not changed inside a loop;>>> - only one instruction modifies regC >>> inside a loop (other can use regC, but not>>> write), and it should simply >>> adjust it by a constant: regC=regC+step, where>>> step is a constant.> >>> When doloop is succesfully scheduled by SMS, its number of>>> iterations of >>> loop kernel should be decreased by the number of stages in a>>> schedule >>> minus one, while other iterations expand to prologue and epilogue.>>> In >>> new supported loops such approach can't be used, because some>>> >>> instructions can use count register (regC). Instead of this,>>> the final >>> register value X in compare instruction regF=COMPARE(regC,X)>>> is changed >>> to another value Y respective to the stage this instruction>>> is scheduled >>> (Y = X - stage * step).>> making sure this does not underflow; i.e., that >>> the number of> iterations is no less than stage (you've addressed this >>> towards the> end below).> Yes, this situation is processed correctly. The main difference from doloop case is that regC can be used by some>> instructions in loop body.>> That's why we are unable to simply adjust regC initial value, but have>> to keep it's value correct on each particular iteration.>> So, we change comparison instruction accordingly. An example:>> int a[100];>> int main()>> {>> int i;>> for (i = 85; i > 12; i -= 5)>> a[i] = i * i;>> return a[15]-225;>> }>> ARM assembler with "-O2 -fno-auto-inc-dec":>> ldr r0, .L5>> mov r3, #85>> mov r2, r0>> .L2:>> mul r1, r3, r3>> sub r3, r3, #5>> cmp r3, #10>> str r1, [r2, #340]>> sub r2, r2, #20>> bne .L2>> ldr r0, [r0, #60]>> sub r0, r0, #225>> bx lr>> .L5:>> .word a Loop body is executed 15 times.>> When compiling with SMS, it finds a schedule with ii=7, stage_count=3>> and following times:>> Stage Time Insn>> 0 5 mul r1, r3, r3>> 1 10 sub r3, r3, #5>> 1 11 cmp r3, #10>> 1 11 str r1, [r2, #340]>> 1 13 bne .L2>> 2 16 sub r2, r2, #20 branch is not scheduled last?> Yes, branch schedule time is smaller then sub's one.This mean that "sub r2, r2, $20" instruction from original iterationnumber K will be executed after"bne .L2" from original iteration number K.But certainly bne remains to be the last instuction in new loop body.Below you can see how it looks after SMS. >> To make new schedule correct the loop body>> should be executed 14 times and >> we change compare instruction:>> the loop itself should execute 13 times. with i =85, 80, 75, 70, 6560, 55, 50, 45, 4035, 30, 25, 20, 15this gives total 15 iterations (15 stores to memory).And new loop body will be executed 13 times (one store goes toepilogue and one - to prologue). >> regF=COMPARE(regC,X) to regF=COMPARE(regC,Y) where Y = X - stage * step.>> >> In our example regC is r3, X is 10, step = -5, compare instruction>> is >> scheduled on stage 1, so it should be Y = 10 - 1 * (-5) = 15. right. In >> general, if the compare is on stage s (starting from 0), it> will be >> executed s times in the epilog, so it should exit the loop> upon reaching Y >> = X - s * step.>>> So, after SMS it looks like:>> ldr r0, .L5>> >> mov r3, #85>> mov r2, r0>> ;;prologue>> mul >> r1, r3, r3 ;;from stage 0 first iteration>> sub r3, r3, #5 >> ;;3 insns from stage 1 first iteration>> cmp r3, #10>> >> str r1, [r2, #340]>> mul r1, r3, r3 ;;from stage 0 >> second iteration>> ;;body>> .L2:>> sub r3, r3, #5>> sub >> r2, r2, #20>> cmp r3, #15 ;; new value to compare with >> is Y=15>> str r1, [r2, #340]>> mul r1, r3, r3>> >> bne .L2>> ;;epilogue>> sub r2, r2, #20 ;;from stage 2 >> pre-last iteration>> sub r3, r3, #5 ;;3 insns from stage 1 >> last iteration>> cmp r3, #10>> str r1, [r2, #340]>> >> sub r2, r2, #20 ;;from stage 2 last iteration ldr >> r0, [r0, #60]>> sub r0, r0, #225>> bx lr>> .L5:>> >> .word a>> He