Re: [SMS] Support new loop pattern

2012-04-10 Thread Andrey Belevantsev

Hello Ayal,

First of all, thanks for your feedback.  Now to your questions:

On 31.03.2012 3:20, Ayal Zaks wrote:

Roman, Andrey,

Sorry for the delayed response.

It would indeed be good to have SMS apply to more loop patterns, still
within the realm of *countable* loops. SMS was originally designed to
handle doloops, with a specific pattern controlling the loop, easily
identified and separable from the loop's body. The newly proposed
change to support new loop patterns is pretty invasive and sizable,
taking place entirely within modulo-sched.c. The main issue I've been
considering, is whether it would be possible instead to transform the
new loop patterns we want SMS to handle, into doloops (potentially
introducing additional induction variables to feed other uses), and
then feed the resulting loop into SMS as is? In other words, could you
fold it into doloop.c? And if so, will doing so introduce significant
overheads?


Let me perhaps explain better.  The patch itself is one core patch (this 
thread) adding the new functionality on detecting more complex loop 
patterns and the three fixes to SMS found while working on the main patch 
(the fixes are in the mails pinged at the very end of this message).  The 
three fixes are worthwhile to commit separately anyways, they are splitted 
up from the main patch for this purpose, so I would suggest to consider 
them in any case.


For the main patch, its core is as small as we could get.  It stays with 
the countable loops as for the cases where we could get overflow behavior 
or infinite loops we bail out early.  We handle only a case of simple 
same-step affine counters.  The main reason why we add support to SMS and 
not to the doloop pass are is when we do not pipeline a loop newly 
transformed to the doloop form, this loop actually slows down on the 
platforms not having a true doloop pattern.  One has to undo the doloop 
form and to get back to the original loop form to avoid this, which seems 
rather strange.  Also, the separate decrement insn that changes the 
induction variable is better be scheduled to get more precise schedule. 
And yes, I believe that making an extra induction variable just to have the 
control part without uses in the loop core will be unnecessary overhead.


Thus, I believe that if we do want SMS to handle more complex loop, then it 
is inevitable that SMS itself would be somewhat more complex.  I would 
welcome your suggestions to make the patch more clear.  One way I see is 
that the function for getting the condition of the new loop form can be 
moved to the generic RTL loop code given the agreement of other RTL 
maintainers.  Also, some new helpers can be introduced for handling this 
specific loop forms.  But it seems that the distinction between 
doloop/non-doloop loops has to stay in the code.


Yours,
Andrey


2012/3/29 Andrey Belevantsev:

Hello,

I'd like to ping again those SMS patches once we're back to Stage 1.

Ayal, maybe it would remove some burden for you if you'd review the general
SMS functionality of those patches, and we'd ask RTL folks to look at the
pieces related to RTL pattern matching and generation?



It definitely would ... especially in light of the above issue.
Thanks (for your patches, patience, pings..),
Ayal.




Yours,
Andrey


On 10.02.2012 16:15, Roman Zhuykov wrote:


Ping.
Ayal, please review this patch and these three patches too:
http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00505.html
http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00506.html
http://gcc.gnu.org/ml/gcc-patches/2011-12/msg01800.html

--
Roman Zhuykov
zhr...@ispras.ru




Re: [SMS] Support new loop pattern

2012-03-30 Thread Ayal Zaks
Roman, Andrey,

Sorry for the delayed response.

It would indeed be good to have SMS apply to more loop patterns, still
within the realm of *countable* loops. SMS was originally designed to
handle doloops, with a specific pattern controlling the loop, easily
identified and separable from the loop's body. The newly proposed
change to support new loop patterns is pretty invasive and sizable,
taking place entirely within modulo-sched.c. The main issue I've been
considering, is whether it would be possible instead to transform the
new loop patterns we want SMS to handle, into doloops (potentially
introducing additional induction variables to feed other uses), and
then feed the resulting loop into SMS as is? In other words, could you
fold it into doloop.c? And if so, will doing so introduce significant
overheads?

2012/3/29 Andrey Belevantsev :
> Hello,
>
> I'd like to ping again those SMS patches once we're back to Stage 1.
>
> Ayal, maybe it would remove some burden for you if you'd review the general
> SMS functionality of those patches, and we'd ask RTL folks to look at the
> pieces related to RTL pattern matching and generation?
>

It definitely would ... especially in light of the above issue.
Thanks (for your patches, patience, pings..),
Ayal.



> Yours,
> Andrey
>
>
> On 10.02.2012 16:15, Roman Zhuykov wrote:
>>
>> Ping.
>> Ayal, please review this patch and these three patches too:
>> http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00505.html
>> http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00506.html
>> http://gcc.gnu.org/ml/gcc-patches/2011-12/msg01800.html
>>
>> --
>> Roman Zhuykov
>> zhr...@ispras.ru


Re: [SMS] Support new loop pattern

2012-03-29 Thread Andrey Belevantsev

Hello,

I'd like to ping again those SMS patches once we're back to Stage 1.

Ayal, maybe it would remove some burden for you if you'd review the general 
SMS functionality of those patches, and we'd ask RTL folks to look at the 
pieces related to RTL pattern matching and generation?


Yours,
Andrey

On 10.02.2012 16:15, Roman Zhuykov wrote:

Ping.
Ayal, please review this patch and these three patches too:
http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00505.html
http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00506.html
http://gcc.gnu.org/ml/gcc-patches/2011-12/msg01800.html

--
Roman Zhuykov
zhr...@ispras.ru




Re: [SMS] Support new loop pattern

2012-02-10 Thread Roman Zhuykov
Ping.
Ayal, please review this patch and these three patches too:
http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00505.html
http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00506.html
http://gcc.gnu.org/ml/gcc-patches/2011-12/msg01800.html

--
Roman Zhuykov
zhr...@ispras.ru


Re: [SMS] Support new loop pattern

2011-12-29 Thread Roman Zhuykov
Ping.
Ayal, could you review this patch and these two patches too.
http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00505.html
http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00506.html

Happy holidays.

2011/12/7 Roman Zhuykov :
> Apologies for the messed up previous e-mail.
>
> 2011/10/12 Ayal Zaks :
 - the last jump instruction should look like:  pc=(regF!=0)?label:pc, regF 
 is
>>
>> you'd probably want to bump to next instruction if falling through,
>> e.g., pc=(regF!=0)?label:pc+4
>>
>
> It is considered that program counter is increased automatically on
> hardware level.
> Otherwise we should add something like "pc=pc+4" in parallel to each
> instruction in RTL.
>
  flag register;
 - the last instruction which sets regF should be: regF=COMPARE(regC,X), 
 where X
  is a constant, or maybe a register, which is not changed inside a loop;
 - only one instruction modifies regC inside a loop (other can use regC, 
 but not
  write), and it should simply adjust it by a constant: regC=regC+step, 
 where
  step is a constant.
>>>
 When doloop is succesfully scheduled by SMS, its number of
 iterations of loop kernel should be decreased by the number of stages in a
 schedule minus one, while other iterations expand to prologue and epilogue.
 In new supported loops such approach can't be used, because some
 instructions can use count register (regC).  Instead of this,
 the final register value X in compare instruction regF=COMPARE(regC,X)
 is changed to another value Y respective to the stage this instruction
 is scheduled (Y = X - stage * step).
>>
>> making sure this does not underflow; i.e., that the number of
>> iterations is no less than stage (you've addressed this towards the
>> end below).
>>
>
> Yes, this situation is processed correctly.
>
>>>
>>> The main difference from doloop case is that regC can be used by some
>>> instructions in loop body.
>>> That's why we are unable to simply adjust regC initial value, but have
>>> to keep it's value correct on each particular iteration.
>>> So, we change comparison instruction accordingly.
>>>
>>> An example:
>>> int a[100];
>>> int main()
>>> {
>>>  int i;
>>>  for (i = 85; i > 12; i -= 5)
>>>      a[i] = i * i;
>>>  return a[15]-225;
>>> }
>>> ARM assembler with "-O2 -fno-auto-inc-dec":
>>>        ldr     r0, .L5
>>>        mov     r3, #85
>>>        mov     r2, r0
>>> .L2:
>>>        mul     r1, r3, r3
>>>        sub     r3, r3, #5
>>>        cmp     r3, #10
>>>        str     r1, [r2, #340]
>>>        sub     r2, r2, #20
>>>        bne     .L2
>>>        ldr     r0, [r0, #60]
>>>        sub     r0, r0, #225
>>>        bx      lr
>>> .L5:
>>>        .word   a
>>>
>>> Loop body is executed 15 times.
>>> When compiling with SMS, it finds a schedule with ii=7, stage_count=3
>>> and following times:
>>> Stage  Time       Insn
>>> 0          5      mul     r1, r3, r3
>>> 1         10     sub     r3, r3, #5
>>> 1         11     cmp     r3, #10
>>> 1         11     str     r1, [r2, #340]
>>> 1         13     bne     .L2
>>> 2         16     sub     r2, r2, #20
>>>
>>
>> branch is not scheduled last?
>>
>
> Yes, branch schedule time is smaller then sub's one.
> This mean that "sub r2, r2, $20" instruction from original iteration
> number K will be executed after
> "bne .L2" from original iteration number K.
> But certainly bne remains to be the last instuction in new loop body.
> Below you can see how it looks after SMS.
>
>>> To make new schedule correct the loop body
>>> should be executed 14 times and we change compare instruction:
>>
>> the loop itself should execute 13 times.
>
> with i =
> 85, 80, 75, 70, 65
> 60, 55, 50, 45, 40
> 35, 30, 25, 20, 15
> this gives total 15 iterations (15 stores to memory).
> And new loop body will be executed 13 times (one store goes to
> epilogue and one - to prologue).
>
>>> regF=COMPARE(regC,X) to regF=COMPARE(regC,Y) where Y = X - stage * step.
>>> In our example regC is r3, X is 10, step = -5, compare instruction
>>> is scheduled on stage 1, so it should be Y = 10 - 1 * (-5) = 15.
>>>
>>
>> right. In general, if the compare is on stage s (starting from 0), it
>> will be executed s times in the epilog, so it should exit the loop
>> upon reaching Y = X - s * step.
>>
>>> So, after SMS it looks like:
>>>        ldr     r0, .L5
>>>        mov     r3, #85
>>>        mov     r2, r0
>>> ;;prologue
>>>        mul     r1, r3, r3      ;;from stage 0 first iteration
>>>        sub     r3, r3, #5      ;;3 insns from stage 1 first iteration
>>>        cmp     r3, #10
>>>        str     r1, [r2, #340]
>>>        mul     r1, r3, r3      ;;from stage 0 second iteration
>>> ;;body
>>> .L2:
>>>        sub     r3, r3, #5
>>>        sub     r2, r2, #20
>>>        cmp     r3, #15         ;; new value to compare with is Y=15
>>>        str     r1, [r2, #340]
>>>        mul     r1, r3, r3
>>>        bne     .L2
>>> ;;epilogue
>>>        sub     r2, r2, #20     ;;from stage 2 p

Re: [SMS] Support new loop pattern

2011-12-07 Thread Roman Zhuykov
Apologies for the messed up previous e-mail.

2011/10/12 Ayal Zaks :
>>> - the last jump instruction should look like:  pc=(regF!=0)?label:pc, regF 
>>> is
>
> you'd probably want to bump to next instruction if falling through,
> e.g., pc=(regF!=0)?label:pc+4
>

It is considered that program counter is increased automatically on
hardware level.
Otherwise we should add something like "pc=pc+4" in parallel to each
instruction in RTL.

>>>  flag register;
>>> - the last instruction which sets regF should be: regF=COMPARE(regC,X), 
>>> where X
>>>  is a constant, or maybe a register, which is not changed inside a loop;
>>> - only one instruction modifies regC inside a loop (other can use regC, but 
>>> not
>>>  write), and it should simply adjust it by a constant: regC=regC+step, where
>>>  step is a constant.
>>
>>> When doloop is succesfully scheduled by SMS, its number of
>>> iterations of loop kernel should be decreased by the number of stages in a
>>> schedule minus one, while other iterations expand to prologue and epilogue.
>>> In new supported loops such approach can't be used, because some
>>> instructions can use count register (regC).  Instead of this,
>>> the final register value X in compare instruction regF=COMPARE(regC,X)
>>> is changed to another value Y respective to the stage this instruction
>>> is scheduled (Y = X - stage * step).
>
> making sure this does not underflow; i.e., that the number of
> iterations is no less than stage (you've addressed this towards the
> end below).
>

Yes, this situation is processed correctly.

>>
>> The main difference from doloop case is that regC can be used by some
>> instructions in loop body.
>> That's why we are unable to simply adjust regC initial value, but have
>> to keep it's value correct on each particular iteration.
>> So, we change comparison instruction accordingly.
>>
>> An example:
>> int a[100];
>> int main()
>> {
>>  int i;
>>  for (i = 85; i > 12; i -= 5)
>>      a[i] = i * i;
>>  return a[15]-225;
>> }
>> ARM assembler with "-O2 -fno-auto-inc-dec":
>>        ldr     r0, .L5
>>        mov     r3, #85
>>        mov     r2, r0
>> .L2:
>>        mul     r1, r3, r3
>>        sub     r3, r3, #5
>>        cmp     r3, #10
>>        str     r1, [r2, #340]
>>        sub     r2, r2, #20
>>        bne     .L2
>>        ldr     r0, [r0, #60]
>>        sub     r0, r0, #225
>>        bx      lr
>> .L5:
>>        .word   a
>>
>> Loop body is executed 15 times.
>> When compiling with SMS, it finds a schedule with ii=7, stage_count=3
>> and following times:
>> Stage  Time       Insn
>> 0          5      mul     r1, r3, r3
>> 1         10     sub     r3, r3, #5
>> 1         11     cmp     r3, #10
>> 1         11     str     r1, [r2, #340]
>> 1         13     bne     .L2
>> 2         16     sub     r2, r2, #20
>>
>
> branch is not scheduled last?
>

Yes, branch schedule time is smaller then sub's one.
This mean that "sub r2, r2, $20" instruction from original iteration
number K will be executed after
"bne .L2" from original iteration number K.
But certainly bne remains to be the last instuction in new loop body.
Below you can see how it looks after SMS.

>> To make new schedule correct the loop body
>> should be executed 14 times and we change compare instruction:
>
> the loop itself should execute 13 times.

with i =
85, 80, 75, 70, 65
60, 55, 50, 45, 40
35, 30, 25, 20, 15
this gives total 15 iterations (15 stores to memory).
And new loop body will be executed 13 times (one store goes to
epilogue and one - to prologue).

>> regF=COMPARE(regC,X) to regF=COMPARE(regC,Y) where Y = X - stage * step.
>> In our example regC is r3, X is 10, step = -5, compare instruction
>> is scheduled on stage 1, so it should be Y = 10 - 1 * (-5) = 15.
>>
>
> right. In general, if the compare is on stage s (starting from 0), it
> will be executed s times in the epilog, so it should exit the loop
> upon reaching Y = X - s * step.
>
>> So, after SMS it looks like:
>>        ldr     r0, .L5
>>        mov     r3, #85
>>        mov     r2, r0
>> ;;prologue
>>        mul     r1, r3, r3      ;;from stage 0 first iteration
>>        sub     r3, r3, #5      ;;3 insns from stage 1 first iteration
>>        cmp     r3, #10
>>        str     r1, [r2, #340]
>>        mul     r1, r3, r3      ;;from stage 0 second iteration
>> ;;body
>> .L2:
>>        sub     r3, r3, #5
>>        sub     r2, r2, #20
>>        cmp     r3, #15         ;; new value to compare with is Y=15
>>        str     r1, [r2, #340]
>>        mul     r1, r3, r3
>>        bne     .L2
>> ;;epilogue
>>        sub     r2, r2, #20     ;;from stage 2 pre-last iteration
>>        sub     r3, r3, #5      ;;3 insns from stage 1 last iteration
>>        cmp     r3, #10
>>        str     r1, [r2, #340]
>>        sub     r2, r2, #20     ;;from stage 2 last iteration
>>
>>        ldr     r0, [r0, #60]
>>        sub     r0, r0, #225
>>        bx      lr
>> .L5:
>>        .word   a
>>

Here in comments I mention why insn was copied to prol

Re: [SMS] Support new loop pattern

2011-12-07 Thread Roman Zhuykov
2011/10/12 Ayal Zaks :>>> - the last jump
instruction should look like:  pc=(regF!=0)?label:pc, regF is>> you'd
probably want to bump to next instruction if falling through,> e.g.,
pc=(regF!=0)?label:pc+4>
It is considered that program counter is increased automatically
onhardware level.Otherwise we should add something like "pc=pc+4" in
parallel to eachinstruction in RTL.
>>>  flag register;>>> - the last instruction which sets regF should be: 
>>> regF=COMPARE(regC,X), where X>>>  is a constant, or maybe a register, which 
>>> is not changed inside a loop;>>> - only one instruction modifies regC 
>>> inside a loop (other can use regC, but not>>>  write), and it should simply 
>>> adjust it by a constant: regC=regC+step, where>>>  step is a constant.> 
>>> When doloop is succesfully scheduled by SMS, its number of>>> iterations of 
>>> loop kernel should be decreased by the number of stages in a>>> schedule 
>>> minus one, while other iterations expand to prologue and epilogue.>>> In 
>>> new supported loops such approach can't be used, because some>>> 
>>> instructions can use count register (regC).  Instead of this,>>> the final 
>>> register value X in compare instruction regF=COMPARE(regC,X)>>> is changed 
>>> to another value Y respective to the stage this instruction>>> is scheduled 
>>> (Y = X - stage * step).>> making sure this does not underflow; i.e., that 
>>> the number of> iterations is no less than stage (you've addressed this 
>>> towards the> end below).>
Yes, this situation is processed correctly.
 The main difference from doloop case is that regC can be used by some>> 
 instructions in loop body.>> That's why we are unable to simply adjust 
 regC initial value, but have>> to keep it's value correct on each 
 particular iteration.>> So, we change comparison instruction 
 accordingly. An example:>> int a[100];>> int main()>> {>>  int i;>>  
 for (i = 85; i > 12; i -= 5)>>      a[i] = i * i;>>  return a[15]-225;>> 
 }>> ARM assembler with "-O2 -fno-auto-inc-dec":>>        ldr     r0, .L5>> 
        mov     r3, #85>>        mov     r2, r0>> .L2:>>        mul     r1, 
 r3, r3>>        sub     r3, r3, #5>>        cmp     r3, #10>>        str   
   r1, [r2, #340]>>        sub     r2, r2, #20>>        bne     .L2>>       
  ldr     r0, [r0, #60]>>        sub     r0, r0, #225>>        bx      lr>> 
 .L5:>>        .word   a Loop body is executed 15 times.>> When 
 compiling with SMS, it finds a schedule with ii=7, stage_count=3>> and 
 following times:>> Stage  Time       Insn>> 0          5      mul     r1, 
 r3, r3>> 1         10     sub     r3, r3, #5>> 1         11     cmp     
 r3, #10>> 1         11     str     r1, [r2, #340]>> 1         13     bne   
   .L2>> 2         16     sub     r2, r2, #20 branch is not scheduled 
 last?>
Yes, branch schedule time is smaller then sub's one.This mean that
"sub r2, r2, $20" instruction from original iterationnumber K will be
executed after"bne .L2" from original iteration number K.But certainly
bne remains to be the last instuction in new loop body.Below you can
see how it looks after SMS.
>> To make new schedule correct the loop body>> should be executed 14 times and 
>> we change compare instruction:>> the loop itself should execute 13 times.
with i =85, 80, 75, 70, 6560, 55, 50, 45, 4035, 30, 25, 20, 15this
gives total 15 iterations (15 stores to memory).And new loop body will
be executed 13 times (one store goes toepilogue and one - to
prologue).
>> regF=COMPARE(regC,X) to regF=COMPARE(regC,Y) where Y = X - stage * step.>> 
>> In our example regC is r3, X is 10, step = -5, compare instruction>> is 
>> scheduled on stage 1, so it should be Y = 10 - 1 * (-5) = 15. right. In 
>> general, if the compare is on stage s (starting from 0), it> will be 
>> executed s times in the epilog, so it should exit the loop> upon reaching Y 
>> = X - s * step.>>> So, after SMS it looks like:>>        ldr     r0, .L5>>   
>>      mov     r3, #85>>        mov     r2, r0>> ;;prologue>>        mul     
>> r1, r3, r3      ;;from stage 0 first iteration>>        sub     r3, r3, #5   
>>    ;;3 insns from stage 1 first iteration>>        cmp     r3, #10>>        
>> str     r1, [r2, #340]>>        mul     r1, r3, r3      ;;from stage 0 
>> second iteration>> ;;body>> .L2:>>        sub     r3, r3, #5>>        sub    
>>  r2, r2, #20>>        cmp     r3, #15         ;; new value to compare with 
>> is Y=15>>        str     r1, [r2, #340]>>        mul     r1, r3, r3>>        
>> bne     .L2>> ;;epilogue>>        sub     r2, r2, #20     ;;from stage 2 
>> pre-last iteration>>        sub     r3, r3, #5      ;;3 insns from stage 1 
>> last iteration>>        cmp     r3, #10>>        str     r1, [r2, #340]>>    
>>     sub     r2, r2, #20     ;;from stage 2 last iteration        ldr     
>> r0, [r0, #60]>>        sub     r0, r0, #225>>        bx      lr>> .L5:>>     
>>    .word   a>>
He