Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-06-01 Thread H.J. Lu
On Tue, May 29, 2012 at 12:01 PM, Igor Zamyatin izamya...@gmail.com wrote:
 Hi, Uros!

 Sorry, I didn't realize that patch was missed. I attached new version.

 Changelog:

 2012-05-29  Yuri Rumyantsev  yuri.s.rumyant...@intel.com

       * config/i386/i386.c (x86_sched_reorder): New function.
       Added new function x86_sched_reorder.

 As for multiply modes, currently we handled most frequent case for
 Atom and in the future this could be enhanced.


This may have caused:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53555

-- 
H.J.


Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-06-01 Thread Eric Botcazou
 Sorry, I didn't realize that patch was missed. I attached new version.

 Changelog:

 2012-05-29  Yuri Rumyantsev  yuri.s.rumyant...@intel.com

* config/i386/i386.c (x86_sched_reorder): New function.
Added new function x86_sched_reorder.

Reading it, you get the impression that the new function is unused.  Replace 
the second line with the description of the second hunk.

-- 
Eric Botcazou


Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-05-29 Thread Igor Zamyatin
Hi, Uros!

Sorry, I didn't realize that patch was missed. I attached new version.

Changelog:

2012-05-29  Yuri Rumyantsev  yuri.s.rumyant...@intel.com

   * config/i386/i386.c (x86_sched_reorder): New function.
   Added new function x86_sched_reorder.

As for multiply modes, currently we handled most frequent case for
Atom and in the future this could be enhanced.

Thanks,
Igor


 Hello!

 Ping?

 Please at least add and URL to the patch, it took me some time to found the 
 latest version [1], I'm not even sure if it is the latest version...

 I assume that you cleared all issues with middle-end and scheduler 
 maintainers, it is not clear from the message.

 +   (1) IMUL instrction is on the top of list;

 Typo above.

 +  static int issue_rate = -1;
 +  int n_ready = *pn_ready;
 +  rtx insn;
 +  rtx insn1;
 +  rtx insn2;

 Please put three definitions above on the same line.

 +  int i;
 +  sd_iterator_def sd_it;
 +  dep_t dep;
 +  int index = -1;
 +
 +  /* set up issue rate */
 +  if (issue_rate  0)
 +    issue_rate = ix86_issue_rate();

 Please set issue_rate unconditionally here.  Also, please follow the GNU 
 style of comments (Full sentence with two spaces after the dot) everywhere, 
 e.g:

 /* Set up issue rate.  */

 +  if (!(GET_CODE (SET_SRC (insn)) == MULT
 +       GET_MODE (SET_SRC (insn)) == SImode))
 +    return issue_rate;

 Is it correct that only SImode multiplies are checked against SImode 
 multiplies? Can't we use DImode or HImode multiply (or other long-latency 
 insns) to put them into the shadow of the first multiply insn?

 As proposed in [2], there are many other fine-tuning approaches proposed by 
 the scheduler maintainer. OTOH, even the big hammer
 approach in the proposed patch makes things better, so it is the step in the 
 right direction - and it is existing practice anyway.

 Under this rationale, I think that the patch should be committed to mainline. 
 But please also consider proposed fine-tunings to refine the scheduling 
 accuracy.

 So, OK for mainline, if there are no objections from other maintainers in 
 next two days.

 [1] http://gcc.gnu.org/ml/gcc-patches/2012-04/msg00964.html
 [2] http://gcc.gnu.org/ml/gcc-patches/2012-04/msg00806.html

 Thanks,
 Uros.


imul_reordering.patch
Description: Binary data


Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-05-28 Thread Igor Zamyatin
Ping?

On Sun, May 6, 2012 at 11:27 AM, Igor Zamyatin izamya...@gmail.com wrote:
 Ping. Could x86 maintainer(s) look at these changes?

 Thanks,
 Igor

 On Fri, Apr 20, 2012 at 4:04 PM, Igor Zamyatin izamya...@gmail.com wrote:
 On Tue, Apr 17, 2012 at 12:27 AM, Igor Zamyatin izamya...@gmail.com wrote:
 On Fri, Apr 13, 2012 at 4:20 PM, Andrey Belevantsev a...@ispras.ru wrote:
 On 13.04.2012 14:18, Igor Zamyatin wrote:

 On Thu, Apr 12, 2012 at 5:01 PM, Andrey Belevantseva...@ispras.ru
  wrote:

 On 12.04.2012 16:38, Richard Guenther wrote:


 On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatinizamya...@gmail.com
  wrote:


 On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther
 richard.guent...@gmail.com    wrote:


 On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakovamona...@ispras.ru
  wrote:



 Can atom execute two IMUL in parallel?  Or what exactly is the
 pipeline
 behavior?



 As I understand from Intel's optimization reference manual, the
 behavior is as
 follows: if the instruction immediately following IMUL has shorter
 latency,
 execution is stalled for 4 cycles (which is IMUL's latency);
 otherwise,
 a
 4-or-more cycles latency instruction can be issued after IMUL without
 a
 stall.
 In other words, IMUL is pipelined with respect to other long-latency
 instructions, but not to short-latency instructions.



 It seems to be modeled in the pipeline description though:

 ;;; imul insn has 5 cycles latency
 (define_reservation atom-imul-32
                    atom-imul-1, atom-imul-2, atom-imul-3,
 atom-imul-4,
                     atom-port-0)

 ;;; imul instruction excludes other non-FP instructions.
 (exclusion_set atom-eu-0, atom-eu-1
               atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4)


 The main idea is quite simple:

 If we are going to schedule IMUL instruction (it is on the top of
 ready list) we try to find out producer of other (independent) IMUL
 instruction that is in ready list too. The goal is try to schedule
 such a producer to get another IMUL in ready list and get scheduling
 of 2 successive IMUL instructions.



 Why does that not happen without your patch?  Does it never happen
 without
 your patch or does it merely not happen for one EEMBC benchmark (can
 you provide a testcase?)?



 It does not happen because the scheduler by itself does not do such
 specific
 reordering.  That said, it is easy to imagine the cases where this patch
 will make things worse rather than better.

 Igor, why not try different subtler mechanisms like adjust_priority,
 which
 is get called when an insn is added to the ready list?  E.g. increase the
 producer's priority.

 The patch as is misses checks for NONDEBUG_INSN_P.  Also, why bail out
 when
 you have more than one imul in the ready list?  Don't you want to bump
 the
 priority of the other imul found?


 Could you provide some examples when this patch would harm the
 performance?


 I thought of the cases when the other ready insns can fill up the hole and
 that would be more beneficial because e.g. they would be on more critical
 paths than the producer of your second imul.  I don't know enough of Atom 
 to
 give an example -- maybe some long divisions?



 Sched_reorder was chosen since it is used in other ports and looks
 most suitable for such case, e.g. it provides access to the whole
 ready list.
 BTW, just increasing producer's priority seems to be more risky in
 performance sense - we can incorrectly start delaying some
 instructions.


 Yes, but exactly because of the above example you can start incorrectly
 delaying other insns, too, as you force the insn to be the first in the
 list.  While bumping priority still leaves the scheduler sorting heuristics
 in place and actually lowers that risk.


 Thought ready list doesn't contain DEBUG_INSN... Is it so? If it
 contains them - this could be added easily


 It does, but I'm not sure the sched_reorder hook gets them or they are
 immediately removed -- I saw similar checks in one of the targets' hooks.

 Done with DEBUG_INSN, also 1-imul limit was removed. Patch attached


 Anyways, my main thought was that it is better to test on more benchmarks 
 to
 alleviate the above concerns, so as long as the i386 maintainers are happy,
 I don't see major problems here.  A good idea could be to generalize the
 patch to handle other long latency insns as second consumers, not only
 imuls, if this is relevant for Atom.

 Yes, generalization of this approach is in plans. According to Atom
 Software optimization guide there are several headrooms left here.
 As for trying on more benchmarks - the particular case is indeed quite
 rare. I attached the example
 where patch helps to group imuls in pairs which is profitable for
 Atom. Such and similar codes are not very common.
 But hopefully this approach could help avoid this and other glassjaws.

 BTW, this patch also helps some EEMBC tests when funroll-loops specified.
 So, any feedback from i386 maintainers about this? :)

 Changelog slightly 

Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-05-28 Thread Uros Bizjak
Hello!

 Ping?

Please at least add and URL to the patch, it took me some time to
found the latest version [1], I'm not even sure if it is the latest
version...

I assume that you cleared all issues with middle-end and scheduler
maintainers, it is not clear from the message.

+   (1) IMUL instrction is on the top of list;

Typo above.

+  static int issue_rate = -1;
+  int n_ready = *pn_ready;
+  rtx insn;
+  rtx insn1;
+  rtx insn2;

Please put three definitions above on the same line.

+  int i;
+  sd_iterator_def sd_it;
+  dep_t dep;
+  int index = -1;
+
+  /* set up issue rate */
+  if (issue_rate  0)
+issue_rate = ix86_issue_rate();

Please set issue_rate unconditionally here.  Also, please follow the
GNU style of comments (Full sentence with two spaces after the dot)
everywhere, e.g:

/* Set up issue rate.  */

+  if (!(GET_CODE (SET_SRC (insn)) == MULT
+   GET_MODE (SET_SRC (insn)) == SImode))
+return issue_rate;

Is it correct that only SImode multiplies are checked against SImode
multiplies? Can't we use DImode or HImode multiply (or other
long-latency insns) to put them into the shadow of the first multiply
insn?

As proposed in [2], there are many other fine-tuning approaches
proposed by the scheduler maintainer. OTOH, even the big hammer
approach in the proposed patch makes things better, so it is the step
in the right direction - and it is existing practice anyway.

Under this rationale, I think that the patch should be committed to
mainline. But please also consider proposed fine-tunings to refine the
scheduling accuracy.

So, OK for mainline, if there are no objections from other maintainers
in next two days.

[1] http://gcc.gnu.org/ml/gcc-patches/2012-04/msg00964.html
[2] http://gcc.gnu.org/ml/gcc-patches/2012-04/msg00806.html

Thanks,
Uros.


Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-05-06 Thread Igor Zamyatin
Ping. Could x86 maintainer(s) look at these changes?

Thanks,
Igor

On Fri, Apr 20, 2012 at 4:04 PM, Igor Zamyatin izamya...@gmail.com wrote:
 On Tue, Apr 17, 2012 at 12:27 AM, Igor Zamyatin izamya...@gmail.com wrote:
 On Fri, Apr 13, 2012 at 4:20 PM, Andrey Belevantsev a...@ispras.ru wrote:
 On 13.04.2012 14:18, Igor Zamyatin wrote:

 On Thu, Apr 12, 2012 at 5:01 PM, Andrey Belevantseva...@ispras.ru
  wrote:

 On 12.04.2012 16:38, Richard Guenther wrote:


 On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatinizamya...@gmail.com
  wrote:


 On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther
 richard.guent...@gmail.com    wrote:


 On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakovamona...@ispras.ru
  wrote:



 Can atom execute two IMUL in parallel?  Or what exactly is the
 pipeline
 behavior?



 As I understand from Intel's optimization reference manual, the
 behavior is as
 follows: if the instruction immediately following IMUL has shorter
 latency,
 execution is stalled for 4 cycles (which is IMUL's latency);
 otherwise,
 a
 4-or-more cycles latency instruction can be issued after IMUL without
 a
 stall.
 In other words, IMUL is pipelined with respect to other long-latency
 instructions, but not to short-latency instructions.



 It seems to be modeled in the pipeline description though:

 ;;; imul insn has 5 cycles latency
 (define_reservation atom-imul-32
                    atom-imul-1, atom-imul-2, atom-imul-3,
 atom-imul-4,
                     atom-port-0)

 ;;; imul instruction excludes other non-FP instructions.
 (exclusion_set atom-eu-0, atom-eu-1
               atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4)


 The main idea is quite simple:

 If we are going to schedule IMUL instruction (it is on the top of
 ready list) we try to find out producer of other (independent) IMUL
 instruction that is in ready list too. The goal is try to schedule
 such a producer to get another IMUL in ready list and get scheduling
 of 2 successive IMUL instructions.



 Why does that not happen without your patch?  Does it never happen
 without
 your patch or does it merely not happen for one EEMBC benchmark (can
 you provide a testcase?)?



 It does not happen because the scheduler by itself does not do such
 specific
 reordering.  That said, it is easy to imagine the cases where this patch
 will make things worse rather than better.

 Igor, why not try different subtler mechanisms like adjust_priority,
 which
 is get called when an insn is added to the ready list?  E.g. increase the
 producer's priority.

 The patch as is misses checks for NONDEBUG_INSN_P.  Also, why bail out
 when
 you have more than one imul in the ready list?  Don't you want to bump
 the
 priority of the other imul found?


 Could you provide some examples when this patch would harm the
 performance?


 I thought of the cases when the other ready insns can fill up the hole and
 that would be more beneficial because e.g. they would be on more critical
 paths than the producer of your second imul.  I don't know enough of Atom to
 give an example -- maybe some long divisions?



 Sched_reorder was chosen since it is used in other ports and looks
 most suitable for such case, e.g. it provides access to the whole
 ready list.
 BTW, just increasing producer's priority seems to be more risky in
 performance sense - we can incorrectly start delaying some
 instructions.


 Yes, but exactly because of the above example you can start incorrectly
 delaying other insns, too, as you force the insn to be the first in the
 list.  While bumping priority still leaves the scheduler sorting heuristics
 in place and actually lowers that risk.


 Thought ready list doesn't contain DEBUG_INSN... Is it so? If it
 contains them - this could be added easily


 It does, but I'm not sure the sched_reorder hook gets them or they are
 immediately removed -- I saw similar checks in one of the targets' hooks.

 Done with DEBUG_INSN, also 1-imul limit was removed. Patch attached


 Anyways, my main thought was that it is better to test on more benchmarks to
 alleviate the above concerns, so as long as the i386 maintainers are happy,
 I don't see major problems here.  A good idea could be to generalize the
 patch to handle other long latency insns as second consumers, not only
 imuls, if this is relevant for Atom.

 Yes, generalization of this approach is in plans. According to Atom
 Software optimization guide there are several headrooms left here.
 As for trying on more benchmarks - the particular case is indeed quite
 rare. I attached the example
 where patch helps to group imuls in pairs which is profitable for
 Atom. Such and similar codes are not very common.
 But hopefully this approach could help avoid this and other glassjaws.

 BTW, this patch also helps some EEMBC tests when funroll-loops specified.
 So, any feedback from i386 maintainers about this? :)

 Changelog slightly changed


 2012-04-10  Yuri Rumyantsev  yuri.s.rumyant...@intel.com

        * 

Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-04-20 Thread Igor Zamyatin
On Tue, Apr 17, 2012 at 12:27 AM, Igor Zamyatin izamya...@gmail.com wrote:
 On Fri, Apr 13, 2012 at 4:20 PM, Andrey Belevantsev a...@ispras.ru wrote:
 On 13.04.2012 14:18, Igor Zamyatin wrote:

 On Thu, Apr 12, 2012 at 5:01 PM, Andrey Belevantseva...@ispras.ru
  wrote:

 On 12.04.2012 16:38, Richard Guenther wrote:


 On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatinizamya...@gmail.com
  wrote:


 On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther
 richard.guent...@gmail.com    wrote:


 On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakovamona...@ispras.ru
  wrote:



 Can atom execute two IMUL in parallel?  Or what exactly is the
 pipeline
 behavior?



 As I understand from Intel's optimization reference manual, the
 behavior is as
 follows: if the instruction immediately following IMUL has shorter
 latency,
 execution is stalled for 4 cycles (which is IMUL's latency);
 otherwise,
 a
 4-or-more cycles latency instruction can be issued after IMUL without
 a
 stall.
 In other words, IMUL is pipelined with respect to other long-latency
 instructions, but not to short-latency instructions.



 It seems to be modeled in the pipeline description though:

 ;;; imul insn has 5 cycles latency
 (define_reservation atom-imul-32
                    atom-imul-1, atom-imul-2, atom-imul-3,
 atom-imul-4,
                     atom-port-0)

 ;;; imul instruction excludes other non-FP instructions.
 (exclusion_set atom-eu-0, atom-eu-1
               atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4)


 The main idea is quite simple:

 If we are going to schedule IMUL instruction (it is on the top of
 ready list) we try to find out producer of other (independent) IMUL
 instruction that is in ready list too. The goal is try to schedule
 such a producer to get another IMUL in ready list and get scheduling
 of 2 successive IMUL instructions.



 Why does that not happen without your patch?  Does it never happen
 without
 your patch or does it merely not happen for one EEMBC benchmark (can
 you provide a testcase?)?



 It does not happen because the scheduler by itself does not do such
 specific
 reordering.  That said, it is easy to imagine the cases where this patch
 will make things worse rather than better.

 Igor, why not try different subtler mechanisms like adjust_priority,
 which
 is get called when an insn is added to the ready list?  E.g. increase the
 producer's priority.

 The patch as is misses checks for NONDEBUG_INSN_P.  Also, why bail out
 when
 you have more than one imul in the ready list?  Don't you want to bump
 the
 priority of the other imul found?


 Could you provide some examples when this patch would harm the
 performance?


 I thought of the cases when the other ready insns can fill up the hole and
 that would be more beneficial because e.g. they would be on more critical
 paths than the producer of your second imul.  I don't know enough of Atom to
 give an example -- maybe some long divisions?



 Sched_reorder was chosen since it is used in other ports and looks
 most suitable for such case, e.g. it provides access to the whole
 ready list.
 BTW, just increasing producer's priority seems to be more risky in
 performance sense - we can incorrectly start delaying some
 instructions.


 Yes, but exactly because of the above example you can start incorrectly
 delaying other insns, too, as you force the insn to be the first in the
 list.  While bumping priority still leaves the scheduler sorting heuristics
 in place and actually lowers that risk.


 Thought ready list doesn't contain DEBUG_INSN... Is it so? If it
 contains them - this could be added easily


 It does, but I'm not sure the sched_reorder hook gets them or they are
 immediately removed -- I saw similar checks in one of the targets' hooks.

 Done with DEBUG_INSN, also 1-imul limit was removed. Patch attached


 Anyways, my main thought was that it is better to test on more benchmarks to
 alleviate the above concerns, so as long as the i386 maintainers are happy,
 I don't see major problems here.  A good idea could be to generalize the
 patch to handle other long latency insns as second consumers, not only
 imuls, if this is relevant for Atom.

 Yes, generalization of this approach is in plans. According to Atom
 Software optimization guide there are several headrooms left here.
 As for trying on more benchmarks - the particular case is indeed quite
 rare. I attached the example
 where patch helps to group imuls in pairs which is profitable for
 Atom. Such and similar codes are not very common.
 But hopefully this approach could help avoid this and other glassjaws.

BTW, this patch also helps some EEMBC tests when funroll-loops specified.
So, any feedback from i386 maintainers about this? :)

Changelog slightly changed


2012-04-10  Yuri Rumyantsev  yuri.s.rumyant...@intel.com

* config/i386/i386.c (x86_sched_reorder): New function.
Added new function x86_sched_reorder
to take advantage of Atom pipelened IMUL execution.


Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-04-16 Thread Igor Zamyatin
On Fri, Apr 13, 2012 at 4:20 PM, Andrey Belevantsev a...@ispras.ru wrote:
 On 13.04.2012 14:18, Igor Zamyatin wrote:

 On Thu, Apr 12, 2012 at 5:01 PM, Andrey Belevantseva...@ispras.ru
  wrote:

 On 12.04.2012 16:38, Richard Guenther wrote:


 On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatinizamya...@gmail.com
  wrote:


 On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther
 richard.guent...@gmail.com    wrote:


 On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakovamona...@ispras.ru
  wrote:



 Can atom execute two IMUL in parallel?  Or what exactly is the
 pipeline
 behavior?



 As I understand from Intel's optimization reference manual, the
 behavior is as
 follows: if the instruction immediately following IMUL has shorter
 latency,
 execution is stalled for 4 cycles (which is IMUL's latency);
 otherwise,
 a
 4-or-more cycles latency instruction can be issued after IMUL without
 a
 stall.
 In other words, IMUL is pipelined with respect to other long-latency
 instructions, but not to short-latency instructions.



 It seems to be modeled in the pipeline description though:

 ;;; imul insn has 5 cycles latency
 (define_reservation atom-imul-32
                    atom-imul-1, atom-imul-2, atom-imul-3,
 atom-imul-4,
                     atom-port-0)

 ;;; imul instruction excludes other non-FP instructions.
 (exclusion_set atom-eu-0, atom-eu-1
               atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4)


 The main idea is quite simple:

 If we are going to schedule IMUL instruction (it is on the top of
 ready list) we try to find out producer of other (independent) IMUL
 instruction that is in ready list too. The goal is try to schedule
 such a producer to get another IMUL in ready list and get scheduling
 of 2 successive IMUL instructions.



 Why does that not happen without your patch?  Does it never happen
 without
 your patch or does it merely not happen for one EEMBC benchmark (can
 you provide a testcase?)?



 It does not happen because the scheduler by itself does not do such
 specific
 reordering.  That said, it is easy to imagine the cases where this patch
 will make things worse rather than better.

 Igor, why not try different subtler mechanisms like adjust_priority,
 which
 is get called when an insn is added to the ready list?  E.g. increase the
 producer's priority.

 The patch as is misses checks for NONDEBUG_INSN_P.  Also, why bail out
 when
 you have more than one imul in the ready list?  Don't you want to bump
 the
 priority of the other imul found?


 Could you provide some examples when this patch would harm the
 performance?


 I thought of the cases when the other ready insns can fill up the hole and
 that would be more beneficial because e.g. they would be on more critical
 paths than the producer of your second imul.  I don't know enough of Atom to
 give an example -- maybe some long divisions?



 Sched_reorder was chosen since it is used in other ports and looks
 most suitable for such case, e.g. it provides access to the whole
 ready list.
 BTW, just increasing producer's priority seems to be more risky in
 performance sense - we can incorrectly start delaying some
 instructions.


 Yes, but exactly because of the above example you can start incorrectly
 delaying other insns, too, as you force the insn to be the first in the
 list.  While bumping priority still leaves the scheduler sorting heuristics
 in place and actually lowers that risk.


 Thought ready list doesn't contain DEBUG_INSN... Is it so? If it
 contains them - this could be added easily


 It does, but I'm not sure the sched_reorder hook gets them or they are
 immediately removed -- I saw similar checks in one of the targets' hooks.

Done with DEBUG_INSN, also 1-imul limit was removed. Patch attached


 Anyways, my main thought was that it is better to test on more benchmarks to
 alleviate the above concerns, so as long as the i386 maintainers are happy,
 I don't see major problems here.  A good idea could be to generalize the
 patch to handle other long latency insns as second consumers, not only
 imuls, if this is relevant for Atom.

Yes, generalization of this approach is in plans. According to Atom
Software optimization guide there are several headrooms left here.
As for trying on more benchmarks - the particular case is indeed quite
rare. I attached the example
where patch helps to group imuls in pairs which is profitable for
Atom. Such and similar codes are not very common.
But hopefully this approach could help avoid this and other glassjaws.


 Andrey



 The case with more than one imul in the ready list wasn't considered
 just because the goal was to catch the particular case when there is a
 risk to get the following picture: imul-producer-imul which is less
 effective than producer-imul-imul for Atom


 Andrey



 And MD allows us only prefer scheduling of successive IMUL
 instructions,
 i.e.
 If IMUL was just scheduled and ready list contains another IMUL
 instruction then it will be chosen as the 

Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-04-13 Thread Andrey Belevantsev

On 12.04.2012 18:22, Richard Guenther wrote:

2012/4/12 Andrey Belevantseva...@ispras.ru:

On 12.04.2012 17:54, Richard Guenther wrote:


2012/4/12 Andrey Belevantseva...@ispras.ru:


On 12.04.2012 16:38, Richard Guenther wrote:



On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatinizamya...@gmail.com
  wrote:



On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther
richard.guent...@gmail.com  wrote:



On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakovamona...@ispras.ru
  wrote:





Can atom execute two IMUL in parallel?  Or what exactly is the
pipeline
behavior?




As I understand from Intel's optimization reference manual, the
behavior is as
follows: if the instruction immediately following IMUL has shorter
latency,
execution is stalled for 4 cycles (which is IMUL's latency);
otherwise,
a
4-or-more cycles latency instruction can be issued after IMUL without
a
stall.
In other words, IMUL is pipelined with respect to other long-latency
instructions, but not to short-latency instructions.




It seems to be modeled in the pipeline description though:

;;; imul insn has 5 cycles latency
(define_reservation atom-imul-32
atom-imul-1, atom-imul-2, atom-imul-3,
atom-imul-4,
 atom-port-0)

;;; imul instruction excludes other non-FP instructions.
(exclusion_set atom-eu-0, atom-eu-1
   atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4)



The main idea is quite simple:

If we are going to schedule IMUL instruction (it is on the top of
ready list) we try to find out producer of other (independent) IMUL
instruction that is in ready list too. The goal is try to schedule
such a producer to get another IMUL in ready list and get scheduling
of 2 successive IMUL instructions.




Why does that not happen without your patch?  Does it never happen
without
your patch or does it merely not happen for one EEMBC benchmark (can
you provide a testcase?)?




It does not happen because the scheduler by itself does not do such
specific
reordering.  That said, it is easy to imagine the cases where this patch
will make things worse rather than better.



That surprises me.  What is so specific about this reordering?



I mean that the scheduler does things like sort the ready list according to
a number of heuristics and to the insn priority, then choose the insn that
would allow the maximum of ready insns to be issued on the current cycle.
  The heuristics are rather general.  The scheduler does not do things like
if some random insn is ready, then choose some other random insn from the
ready list and schedule it (which is what the patch does). This is what
scheduler hooks are for, to account for some target-specific heuristic.

The problem is that this particular implementation looks somewhat like an
overkill and also motivated by a single benchmark.  Testing on a wider set
of benchmarks and checking compile-time hit would make the motivation more
convincing.


Yeah, and especially looking _why_ the generic heuristics are not working
and if they could be improved.  After all the issue seems to be properly
modeled in the DFA.


It is, but the DFA only models the actual stalls that you get when an imul 
is scheduled.  What you need here is to increase priority for ready insns 
that have imuls in their critical paths, and those imuls should be close 
enough to fill in the gap (actual consumers in the case of the patch).


The lookahead done in max_issue can help with this dynamic priority 
change in general, but it considers just the ready insns, not their 
immediate consumers.  You need to make the deeper lookahead if you want to 
do this in a target independent way -- no-no from compile time POV.   A 
number of already existing hooks can help:


- sched_reorder used in this patch just sorts the ready list in any way the 
target wishes;


- adjust_priority/adjust_cost  helps to boost or to lower priority in such 
subtle cases;


- is_costly_dependence (now rs6000 only) can guide the early queue insn 
removal and its addition to the ready list;


- dfa_lookahead_guard (now ia64 only) can ban some insns from being issued 
on the current try.


Using sched_reorder is existing practice, like in s390, spu, mips, etc. 
Usually though an implementation again only considers the actual ready 
insns and not the successors.  I'd say that trying more tests would help, 
or other hooks can be tried, too, but after all this is Atom specific, so 
it's the decision of i386 maintainers.


Andrey



Richard.


Andrey





Igor, why not try different subtler mechanisms like adjust_priority,
which
is get called when an insn is added to the ready list?  E.g. increase the
producer's priority.

The patch as is misses checks for NONDEBUG_INSN_P.  Also, why bail out
when
you have more than one imul in the ready list?  Don't you want to bump
the
priority of the other imul found?

Andrey





And MD allows us only prefer scheduling of successive IMUL
instructions,
i.e.
If IMUL was just scheduled and ready list 

Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-04-13 Thread Igor Zamyatin
On Thu, Apr 12, 2012 at 5:01 PM, Andrey Belevantsev a...@ispras.ru wrote:
 On 12.04.2012 16:38, Richard Guenther wrote:

 On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatinizamya...@gmail.com
  wrote:

 On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther
 richard.guent...@gmail.com  wrote:

 On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakovamona...@ispras.ru
  wrote:


 Can atom execute two IMUL in parallel?  Or what exactly is the
 pipeline
 behavior?


 As I understand from Intel's optimization reference manual, the
 behavior is as
 follows: if the instruction immediately following IMUL has shorter
 latency,
 execution is stalled for 4 cycles (which is IMUL's latency); otherwise,
 a
 4-or-more cycles latency instruction can be issued after IMUL without a
 stall.
 In other words, IMUL is pipelined with respect to other long-latency
 instructions, but not to short-latency instructions.


 It seems to be modeled in the pipeline description though:

 ;;; imul insn has 5 cycles latency
 (define_reservation atom-imul-32
                    atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4,
                     atom-port-0)

 ;;; imul instruction excludes other non-FP instructions.
 (exclusion_set atom-eu-0, atom-eu-1
               atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4)


 The main idea is quite simple:

 If we are going to schedule IMUL instruction (it is on the top of
 ready list) we try to find out producer of other (independent) IMUL
 instruction that is in ready list too. The goal is try to schedule
 such a producer to get another IMUL in ready list and get scheduling
 of 2 successive IMUL instructions.


 Why does that not happen without your patch?  Does it never happen without
 your patch or does it merely not happen for one EEMBC benchmark (can
 you provide a testcase?)?


 It does not happen because the scheduler by itself does not do such specific
 reordering.  That said, it is easy to imagine the cases where this patch
 will make things worse rather than better.

 Igor, why not try different subtler mechanisms like adjust_priority, which
 is get called when an insn is added to the ready list?  E.g. increase the
 producer's priority.

 The patch as is misses checks for NONDEBUG_INSN_P.  Also, why bail out when
 you have more than one imul in the ready list?  Don't you want to bump the
 priority of the other imul found?

Could you provide some examples when this patch would harm the performance?

Sched_reorder was chosen since it is used in other ports and looks
most suitable for such case, e.g. it provides access to the whole
ready list.
BTW, just increasing producer's priority seems to be more risky in
performance sense - we can incorrectly start delaying some
instructions.

Thought ready list doesn't contain DEBUG_INSN... Is it so? If it
contains them - this could be added easily

The case with more than one imul in the ready list wasn't considered
just because the goal was to catch the particular case when there is a
risk to get the following picture: imul-producer-imul which is less
effective than producer-imul-imul for Atom


 Andrey



 And MD allows us only prefer scheduling of successive IMUL instructions,
 i.e.
 If IMUL was just scheduled and ready list contains another IMUL
 instruction then it will be chosen as the best candidate for
 scheduling.


 at least from my very limited guessing of what the above does.  So, did
 you
 analyze why the scheduler does not properly schedule things for you?

 Richard.


  From reading the patch, I could not understand the link between
 pipeline
 behavior and what the patch appears to do.

 Hope that helps.

 Alexander



Thanks,
Igor


Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-04-13 Thread Igor Zamyatin
On Fri, Apr 13, 2012 at 2:18 PM, Igor Zamyatin izamya...@gmail.com wrote:
 On Thu, Apr 12, 2012 at 5:01 PM, Andrey Belevantsev a...@ispras.ru wrote:
 On 12.04.2012 16:38, Richard Guenther wrote:

 On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatinizamya...@gmail.com
  wrote:

 On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther
 richard.guent...@gmail.com  wrote:

 On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakovamona...@ispras.ru
  wrote:


 Can atom execute two IMUL in parallel?  Or what exactly is the
 pipeline
 behavior?


 As I understand from Intel's optimization reference manual, the
 behavior is as
 follows: if the instruction immediately following IMUL has shorter
 latency,
 execution is stalled for 4 cycles (which is IMUL's latency); otherwise,
 a
 4-or-more cycles latency instruction can be issued after IMUL without a
 stall.
 In other words, IMUL is pipelined with respect to other long-latency
 instructions, but not to short-latency instructions.


 It seems to be modeled in the pipeline description though:

 ;;; imul insn has 5 cycles latency
 (define_reservation atom-imul-32
                    atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4,
                     atom-port-0)

 ;;; imul instruction excludes other non-FP instructions.
 (exclusion_set atom-eu-0, atom-eu-1
               atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4)


 The main idea is quite simple:

 If we are going to schedule IMUL instruction (it is on the top of
 ready list) we try to find out producer of other (independent) IMUL
 instruction that is in ready list too. The goal is try to schedule
 such a producer to get another IMUL in ready list and get scheduling
 of 2 successive IMUL instructions.


 Why does that not happen without your patch?  Does it never happen without
 your patch or does it merely not happen for one EEMBC benchmark (can
 you provide a testcase?)?


 It does not happen because the scheduler by itself does not do such specific
 reordering.  That said, it is easy to imagine the cases where this patch
 will make things worse rather than better.

 Igor, why not try different subtler mechanisms like adjust_priority, which
 is get called when an insn is added to the ready list?  E.g. increase the
 producer's priority.

 The patch as is misses checks for NONDEBUG_INSN_P.  Also, why bail out when
 you have more than one imul in the ready list?  Don't you want to bump the
 priority of the other imul found?

 Could you provide some examples when this patch would harm the performance?

 Sched_reorder was chosen since it is used in other ports and looks
 most suitable for such case, e.g. it provides access to the whole
 ready list.
 BTW, just increasing producer's priority seems to be more risky in
 performance sense - we can incorrectly start delaying some
 instructions.

 Thought ready list doesn't contain DEBUG_INSN... Is it so? If it
 contains them - this could be added easily

 The case with more than one imul in the ready list wasn't considered
 just because the goal was to catch the particular case when there is a
 risk to get the following picture: imul-producer-imul which is less
 effective than producer-imul-imul for Atom


Actually, this constraint of one imul could be omitted. I'm going to
try this change



 Andrey



 And MD allows us only prefer scheduling of successive IMUL instructions,
 i.e.
 If IMUL was just scheduled and ready list contains another IMUL
 instruction then it will be chosen as the best candidate for
 scheduling.


 at least from my very limited guessing of what the above does.  So, did
 you
 analyze why the scheduler does not properly schedule things for you?

 Richard.


  From reading the patch, I could not understand the link between
 pipeline
 behavior and what the patch appears to do.

 Hope that helps.

 Alexander



 Thanks,
 Igor


Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-04-13 Thread Andrey Belevantsev

On 13.04.2012 14:18, Igor Zamyatin wrote:

On Thu, Apr 12, 2012 at 5:01 PM, Andrey Belevantseva...@ispras.ru  wrote:

On 12.04.2012 16:38, Richard Guenther wrote:


On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatinizamya...@gmail.com
  wrote:


On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther
richard.guent...@gmail.comwrote:


On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakovamona...@ispras.ru
  wrote:




Can atom execute two IMUL in parallel?  Or what exactly is the
pipeline
behavior?



As I understand from Intel's optimization reference manual, the
behavior is as
follows: if the instruction immediately following IMUL has shorter
latency,
execution is stalled for 4 cycles (which is IMUL's latency); otherwise,
a
4-or-more cycles latency instruction can be issued after IMUL without a
stall.
In other words, IMUL is pipelined with respect to other long-latency
instructions, but not to short-latency instructions.



It seems to be modeled in the pipeline description though:

;;; imul insn has 5 cycles latency
(define_reservation atom-imul-32
atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4,
 atom-port-0)

;;; imul instruction excludes other non-FP instructions.
(exclusion_set atom-eu-0, atom-eu-1
   atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4)



The main idea is quite simple:

If we are going to schedule IMUL instruction (it is on the top of
ready list) we try to find out producer of other (independent) IMUL
instruction that is in ready list too. The goal is try to schedule
such a producer to get another IMUL in ready list and get scheduling
of 2 successive IMUL instructions.



Why does that not happen without your patch?  Does it never happen without
your patch or does it merely not happen for one EEMBC benchmark (can
you provide a testcase?)?



It does not happen because the scheduler by itself does not do such specific
reordering.  That said, it is easy to imagine the cases where this patch
will make things worse rather than better.

Igor, why not try different subtler mechanisms like adjust_priority, which
is get called when an insn is added to the ready list?  E.g. increase the
producer's priority.

The patch as is misses checks for NONDEBUG_INSN_P.  Also, why bail out when
you have more than one imul in the ready list?  Don't you want to bump the
priority of the other imul found?


Could you provide some examples when this patch would harm the performance?


I thought of the cases when the other ready insns can fill up the hole and 
that would be more beneficial because e.g. they would be on more critical 
paths than the producer of your second imul.  I don't know enough of Atom 
to give an example -- maybe some long divisions?




Sched_reorder was chosen since it is used in other ports and looks
most suitable for such case, e.g. it provides access to the whole
ready list.
BTW, just increasing producer's priority seems to be more risky in
performance sense - we can incorrectly start delaying some
instructions.


Yes, but exactly because of the above example you can start incorrectly 
delaying other insns, too, as you force the insn to be the first in the 
list.  While bumping priority still leaves the scheduler sorting heuristics 
in place and actually lowers that risk.




Thought ready list doesn't contain DEBUG_INSN... Is it so? If it
contains them - this could be added easily


It does, but I'm not sure the sched_reorder hook gets them or they are 
immediately removed -- I saw similar checks in one of the targets' hooks.


Anyways, my main thought was that it is better to test on more benchmarks 
to alleviate the above concerns, so as long as the i386 maintainers are 
happy, I don't see major problems here.  A good idea could be to generalize 
the patch to handle other long latency insns as second consumers, not only 
imuls, if this is relevant for Atom.


Andrey



The case with more than one imul in the ready list wasn't considered
just because the goal was to catch the particular case when there is a
risk to get the following picture: imul-producer-imul which is less
effective than producer-imul-imul for Atom



Andrey





And MD allows us only prefer scheduling of successive IMUL instructions,
i.e.
If IMUL was just scheduled and ready list contains another IMUL
instruction then it will be chosen as the best candidate for
scheduling.



at least from my very limited guessing of what the above does.  So, did
you
analyze why the scheduler does not properly schedule things for you?

Richard.



  From reading the patch, I could not understand the link between
pipeline
behavior and what the patch appears to do.

Hope that helps.

Alexander





Thanks,
Igor




Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-04-12 Thread Igor Zamyatin
On Wed, Apr 11, 2012 at 5:38 PM, Andi Kleen a...@firstfloor.org wrote:
 Igor Zamyatin izamya...@gmail.com writes:

 Hi All!

 It is known that imul placement is rather critical for Atom processors
 and changes try to improve imul scheduling for Atom.

 This gives +5% performance on several tests from new OA 2.0 testsuite
 from EEMBC.

 Tested for i386 and x86-64, ok for trunk?

 Did you measure how much this slows down the compiler when compiling
 for Atom? The new pass looks rather slow.

No, since this pass is applied rarely. Conditions are mentioned in
the comments to ix86_sched_reorder


 -Andi

 --
 a...@linux.intel.com -- Speaking for myself only


Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-04-12 Thread Igor Zamyatin
On Wed, Apr 11, 2012 at 6:10 PM, Richard Guenther
richard.guent...@gmail.com wrote:
 On Wed, Apr 11, 2012 at 3:38 PM, Andi Kleen a...@firstfloor.org wrote:
 Igor Zamyatin izamya...@gmail.com writes:

 Hi All!

 It is known that imul placement is rather critical for Atom processors
 and changes try to improve imul scheduling for Atom.

 This gives +5% performance on several tests from new OA 2.0 testsuite
 from EEMBC.

 Tested for i386 and x86-64, ok for trunk?

 Did you measure how much this slows down the compiler when compiling
 for Atom? The new pass looks rather slow.

 Also please explain why adjusting the automaton for Atom is not a way to
 attack this issue.

If I understand the question correctly - it's a dynamic check and it's
not clear how to describe this adjusting statically in machine
description


 Richard.

 -Andi

 --
 a...@linux.intel.com -- Speaking for myself only


Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-04-12 Thread Richard Guenther
On Thu, Apr 12, 2012 at 12:20 PM, Igor Zamyatin izamya...@gmail.com wrote:
 On Wed, Apr 11, 2012 at 6:10 PM, Richard Guenther
 richard.guent...@gmail.com wrote:
 On Wed, Apr 11, 2012 at 3:38 PM, Andi Kleen a...@firstfloor.org wrote:
 Igor Zamyatin izamya...@gmail.com writes:

 Hi All!

 It is known that imul placement is rather critical for Atom processors
 and changes try to improve imul scheduling for Atom.

 This gives +5% performance on several tests from new OA 2.0 testsuite
 from EEMBC.

 Tested for i386 and x86-64, ok for trunk?

 Did you measure how much this slows down the compiler when compiling
 for Atom? The new pass looks rather slow.

 Also please explain why adjusting the automaton for Atom is not a way to
 attack this issue.

 If I understand the question correctly - it's a dynamic check and it's
 not clear how to describe this adjusting statically in machine
 description

From reading the code (the comments are not clear to me) you seem to
produce two adjacent IMUL instructions from within the ready list - but
first check that there is only a single one.  So I fail to see how it can work.
Can atom execute two IMUL in parallel?  Or what exactly is the pipeline
behavior?

You miss a testcase that would show the effect of your patch.

Richard.


 Richard.

 -Andi

 --
 a...@linux.intel.com -- Speaking for myself only


Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-04-12 Thread Alexander Monakov

 Can atom execute two IMUL in parallel?  Or what exactly is the pipeline
 behavior?

As I understand from Intel's optimization reference manual, the behavior is as
follows: if the instruction immediately following IMUL has shorter latency,
execution is stalled for 4 cycles (which is IMUL's latency); otherwise, a
4-or-more cycles latency instruction can be issued after IMUL without a stall.
In other words, IMUL is pipelined with respect to other long-latency
instructions, but not to short-latency instructions.

From reading the patch, I could not understand the link between pipeline
behavior and what the patch appears to do.

Hope that helps.

Alexander


Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-04-12 Thread Richard Guenther
On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakov amona...@ispras.ru wrote:

 Can atom execute two IMUL in parallel?  Or what exactly is the pipeline
 behavior?

 As I understand from Intel's optimization reference manual, the behavior is as
 follows: if the instruction immediately following IMUL has shorter latency,
 execution is stalled for 4 cycles (which is IMUL's latency); otherwise, a
 4-or-more cycles latency instruction can be issued after IMUL without a stall.
 In other words, IMUL is pipelined with respect to other long-latency
 instructions, but not to short-latency instructions.

It seems to be modeled in the pipeline description though:

;;; imul insn has 5 cycles latency
(define_reservation atom-imul-32
atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4,
 atom-port-0)

;;; imul instruction excludes other non-FP instructions.
(exclusion_set atom-eu-0, atom-eu-1
   atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4)

at least from my very limited guessing of what the above does.  So, did you
analyze why the scheduler does not properly schedule things for you?

Richard.


 From reading the patch, I could not understand the link between pipeline
 behavior and what the patch appears to do.

 Hope that helps.

 Alexander


Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-04-12 Thread Igor Zamyatin
On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther
richard.guent...@gmail.com wrote:
 On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakov amona...@ispras.ru wrote:

 Can atom execute two IMUL in parallel?  Or what exactly is the pipeline
 behavior?

 As I understand from Intel's optimization reference manual, the behavior is 
 as
 follows: if the instruction immediately following IMUL has shorter latency,
 execution is stalled for 4 cycles (which is IMUL's latency); otherwise, a
 4-or-more cycles latency instruction can be issued after IMUL without a 
 stall.
 In other words, IMUL is pipelined with respect to other long-latency
 instructions, but not to short-latency instructions.

 It seems to be modeled in the pipeline description though:

 ;;; imul insn has 5 cycles latency
 (define_reservation atom-imul-32
                    atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4,
                     atom-port-0)

 ;;; imul instruction excludes other non-FP instructions.
 (exclusion_set atom-eu-0, atom-eu-1
               atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4)


The main idea is quite simple:

If we are going to schedule IMUL instruction (it is on the top of
ready list) we try to find out producer of other (independent) IMUL
instruction that is in ready list too. The goal is try to schedule
such a producer to get another IMUL in ready list and get scheduling
of 2 successive IMUL instructions.

And MD allows us only prefer scheduling of successive IMUL instructions, i.e.
If IMUL was just scheduled and ready list contains another IMUL
instruction then it will be chosen as the best candidate for
scheduling.


 at least from my very limited guessing of what the above does.  So, did you
 analyze why the scheduler does not properly schedule things for you?

 Richard.


 From reading the patch, I could not understand the link between pipeline
 behavior and what the patch appears to do.

 Hope that helps.

 Alexander


Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-04-12 Thread Richard Guenther
On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatin izamya...@gmail.com wrote:
 On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther
 richard.guent...@gmail.com wrote:
 On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakov amona...@ispras.ru 
 wrote:

 Can atom execute two IMUL in parallel?  Or what exactly is the pipeline
 behavior?

 As I understand from Intel's optimization reference manual, the behavior is 
 as
 follows: if the instruction immediately following IMUL has shorter latency,
 execution is stalled for 4 cycles (which is IMUL's latency); otherwise, a
 4-or-more cycles latency instruction can be issued after IMUL without a 
 stall.
 In other words, IMUL is pipelined with respect to other long-latency
 instructions, but not to short-latency instructions.

 It seems to be modeled in the pipeline description though:

 ;;; imul insn has 5 cycles latency
 (define_reservation atom-imul-32
                    atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4,
                     atom-port-0)

 ;;; imul instruction excludes other non-FP instructions.
 (exclusion_set atom-eu-0, atom-eu-1
               atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4)


 The main idea is quite simple:

 If we are going to schedule IMUL instruction (it is on the top of
 ready list) we try to find out producer of other (independent) IMUL
 instruction that is in ready list too. The goal is try to schedule
 such a producer to get another IMUL in ready list and get scheduling
 of 2 successive IMUL instructions.

Why does that not happen without your patch?  Does it never happen without
your patch or does it merely not happen for one EEMBC benchmark (can
you provide a testcase?)?

 And MD allows us only prefer scheduling of successive IMUL instructions, i.e.
 If IMUL was just scheduled and ready list contains another IMUL
 instruction then it will be chosen as the best candidate for
 scheduling.


 at least from my very limited guessing of what the above does.  So, did you
 analyze why the scheduler does not properly schedule things for you?

 Richard.


 From reading the patch, I could not understand the link between pipeline
 behavior and what the patch appears to do.

 Hope that helps.

 Alexander


Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-04-12 Thread Andrey Belevantsev

On 12.04.2012 16:38, Richard Guenther wrote:

On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatinizamya...@gmail.com  wrote:

On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther
richard.guent...@gmail.com  wrote:

On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakovamona...@ispras.ru  wrote:



Can atom execute two IMUL in parallel?  Or what exactly is the pipeline
behavior?


As I understand from Intel's optimization reference manual, the behavior is as
follows: if the instruction immediately following IMUL has shorter latency,
execution is stalled for 4 cycles (which is IMUL's latency); otherwise, a
4-or-more cycles latency instruction can be issued after IMUL without a stall.
In other words, IMUL is pipelined with respect to other long-latency
instructions, but not to short-latency instructions.


It seems to be modeled in the pipeline description though:

;;; imul insn has 5 cycles latency
(define_reservation atom-imul-32
atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4,
 atom-port-0)

;;; imul instruction excludes other non-FP instructions.
(exclusion_set atom-eu-0, atom-eu-1
   atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4)



The main idea is quite simple:

If we are going to schedule IMUL instruction (it is on the top of
ready list) we try to find out producer of other (independent) IMUL
instruction that is in ready list too. The goal is try to schedule
such a producer to get another IMUL in ready list and get scheduling
of 2 successive IMUL instructions.


Why does that not happen without your patch?  Does it never happen without
your patch or does it merely not happen for one EEMBC benchmark (can
you provide a testcase?)?


It does not happen because the scheduler by itself does not do such 
specific reordering.  That said, it is easy to imagine the cases where this 
patch will make things worse rather than better.


Igor, why not try different subtler mechanisms like adjust_priority, which 
is get called when an insn is added to the ready list?  E.g. increase the 
producer's priority.


The patch as is misses checks for NONDEBUG_INSN_P.  Also, why bail out when 
you have more than one imul in the ready list?  Don't you want to bump the 
priority of the other imul found?


Andrey




And MD allows us only prefer scheduling of successive IMUL instructions, i.e.
If IMUL was just scheduled and ready list contains another IMUL
instruction then it will be chosen as the best candidate for
scheduling.



at least from my very limited guessing of what the above does.  So, did you
analyze why the scheduler does not properly schedule things for you?

Richard.



 From reading the patch, I could not understand the link between pipeline
behavior and what the patch appears to do.

Hope that helps.

Alexander




Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-04-12 Thread Richard Guenther
2012/4/12 Andrey Belevantsev a...@ispras.ru:
 On 12.04.2012 16:38, Richard Guenther wrote:

 On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatinizamya...@gmail.com
  wrote:

 On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther
 richard.guent...@gmail.com  wrote:

 On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakovamona...@ispras.ru
  wrote:


 Can atom execute two IMUL in parallel?  Or what exactly is the
 pipeline
 behavior?


 As I understand from Intel's optimization reference manual, the
 behavior is as
 follows: if the instruction immediately following IMUL has shorter
 latency,
 execution is stalled for 4 cycles (which is IMUL's latency); otherwise,
 a
 4-or-more cycles latency instruction can be issued after IMUL without a
 stall.
 In other words, IMUL is pipelined with respect to other long-latency
 instructions, but not to short-latency instructions.


 It seems to be modeled in the pipeline description though:

 ;;; imul insn has 5 cycles latency
 (define_reservation atom-imul-32
                    atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4,
                     atom-port-0)

 ;;; imul instruction excludes other non-FP instructions.
 (exclusion_set atom-eu-0, atom-eu-1
               atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4)


 The main idea is quite simple:

 If we are going to schedule IMUL instruction (it is on the top of
 ready list) we try to find out producer of other (independent) IMUL
 instruction that is in ready list too. The goal is try to schedule
 such a producer to get another IMUL in ready list and get scheduling
 of 2 successive IMUL instructions.


 Why does that not happen without your patch?  Does it never happen without
 your patch or does it merely not happen for one EEMBC benchmark (can
 you provide a testcase?)?


 It does not happen because the scheduler by itself does not do such specific
 reordering.  That said, it is easy to imagine the cases where this patch
 will make things worse rather than better.

That surprises me.  What is so specific about this reordering?

 Igor, why not try different subtler mechanisms like adjust_priority, which
 is get called when an insn is added to the ready list?  E.g. increase the
 producer's priority.

 The patch as is misses checks for NONDEBUG_INSN_P.  Also, why bail out when
 you have more than one imul in the ready list?  Don't you want to bump the
 priority of the other imul found?

 Andrey



 And MD allows us only prefer scheduling of successive IMUL instructions,
 i.e.
 If IMUL was just scheduled and ready list contains another IMUL
 instruction then it will be chosen as the best candidate for
 scheduling.


 at least from my very limited guessing of what the above does.  So, did
 you
 analyze why the scheduler does not properly schedule things for you?

 Richard.


  From reading the patch, I could not understand the link between
 pipeline
 behavior and what the patch appears to do.

 Hope that helps.

 Alexander




Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-04-12 Thread Andrey Belevantsev

On 12.04.2012 17:54, Richard Guenther wrote:

2012/4/12 Andrey Belevantseva...@ispras.ru:

On 12.04.2012 16:38, Richard Guenther wrote:


On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatinizamya...@gmail.com
  wrote:


On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther
richard.guent...@gmail.comwrote:


On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakovamona...@ispras.ru
  wrote:




Can atom execute two IMUL in parallel?  Or what exactly is the
pipeline
behavior?



As I understand from Intel's optimization reference manual, the
behavior is as
follows: if the instruction immediately following IMUL has shorter
latency,
execution is stalled for 4 cycles (which is IMUL's latency); otherwise,
a
4-or-more cycles latency instruction can be issued after IMUL without a
stall.
In other words, IMUL is pipelined with respect to other long-latency
instructions, but not to short-latency instructions.



It seems to be modeled in the pipeline description though:

;;; imul insn has 5 cycles latency
(define_reservation atom-imul-32
atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4,
 atom-port-0)

;;; imul instruction excludes other non-FP instructions.
(exclusion_set atom-eu-0, atom-eu-1
   atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4)



The main idea is quite simple:

If we are going to schedule IMUL instruction (it is on the top of
ready list) we try to find out producer of other (independent) IMUL
instruction that is in ready list too. The goal is try to schedule
such a producer to get another IMUL in ready list and get scheduling
of 2 successive IMUL instructions.



Why does that not happen without your patch?  Does it never happen without
your patch or does it merely not happen for one EEMBC benchmark (can
you provide a testcase?)?



It does not happen because the scheduler by itself does not do such specific
reordering.  That said, it is easy to imagine the cases where this patch
will make things worse rather than better.


That surprises me.  What is so specific about this reordering?


I mean that the scheduler does things like sort the ready list according 
to a number of heuristics and to the insn priority, then choose the insn 
that would allow the maximum of ready insns to be issued on the current 
cycle.  The heuristics are rather general.  The scheduler does not do 
things like if some random insn is ready, then choose some other random 
insn from the ready list and schedule it (which is what the patch does). 
This is what scheduler hooks are for, to account for some target-specific 
heuristic.


The problem is that this particular implementation looks somewhat like an 
overkill and also motivated by a single benchmark.  Testing on a wider set 
of benchmarks and checking compile-time hit would make the motivation more 
convincing.


Andrey




Igor, why not try different subtler mechanisms like adjust_priority, which
is get called when an insn is added to the ready list?  E.g. increase the
producer's priority.

The patch as is misses checks for NONDEBUG_INSN_P.  Also, why bail out when
you have more than one imul in the ready list?  Don't you want to bump the
priority of the other imul found?

Andrey





And MD allows us only prefer scheduling of successive IMUL instructions,
i.e.
If IMUL was just scheduled and ready list contains another IMUL
instruction then it will be chosen as the best candidate for
scheduling.



at least from my very limited guessing of what the above does.  So, did
you
analyze why the scheduler does not properly schedule things for you?

Richard.



  From reading the patch, I could not understand the link between
pipeline
behavior and what the patch appears to do.

Hope that helps.

Alexander







Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-04-12 Thread Richard Guenther
2012/4/12 Andrey Belevantsev a...@ispras.ru:
 On 12.04.2012 17:54, Richard Guenther wrote:

 2012/4/12 Andrey Belevantseva...@ispras.ru:

 On 12.04.2012 16:38, Richard Guenther wrote:


 On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatinizamya...@gmail.com
  wrote:


 On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther
 richard.guent...@gmail.com    wrote:


 On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakovamona...@ispras.ru
  wrote:



 Can atom execute two IMUL in parallel?  Or what exactly is the
 pipeline
 behavior?



 As I understand from Intel's optimization reference manual, the
 behavior is as
 follows: if the instruction immediately following IMUL has shorter
 latency,
 execution is stalled for 4 cycles (which is IMUL's latency);
 otherwise,
 a
 4-or-more cycles latency instruction can be issued after IMUL without
 a
 stall.
 In other words, IMUL is pipelined with respect to other long-latency
 instructions, but not to short-latency instructions.



 It seems to be modeled in the pipeline description though:

 ;;; imul insn has 5 cycles latency
 (define_reservation atom-imul-32
                    atom-imul-1, atom-imul-2, atom-imul-3,
 atom-imul-4,
                     atom-port-0)

 ;;; imul instruction excludes other non-FP instructions.
 (exclusion_set atom-eu-0, atom-eu-1
               atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4)


 The main idea is quite simple:

 If we are going to schedule IMUL instruction (it is on the top of
 ready list) we try to find out producer of other (independent) IMUL
 instruction that is in ready list too. The goal is try to schedule
 such a producer to get another IMUL in ready list and get scheduling
 of 2 successive IMUL instructions.



 Why does that not happen without your patch?  Does it never happen
 without
 your patch or does it merely not happen for one EEMBC benchmark (can
 you provide a testcase?)?



 It does not happen because the scheduler by itself does not do such
 specific
 reordering.  That said, it is easy to imagine the cases where this patch
 will make things worse rather than better.


 That surprises me.  What is so specific about this reordering?


 I mean that the scheduler does things like sort the ready list according to
 a number of heuristics and to the insn priority, then choose the insn that
 would allow the maximum of ready insns to be issued on the current cycle.
  The heuristics are rather general.  The scheduler does not do things like
 if some random insn is ready, then choose some other random insn from the
 ready list and schedule it (which is what the patch does). This is what
 scheduler hooks are for, to account for some target-specific heuristic.

 The problem is that this particular implementation looks somewhat like an
 overkill and also motivated by a single benchmark.  Testing on a wider set
 of benchmarks and checking compile-time hit would make the motivation more
 convincing.

Yeah, and especially looking _why_ the generic heuristics are not working
and if they could be improved.  After all the issue seems to be properly
modeled in the DFA.

Richard.

 Andrey



 Igor, why not try different subtler mechanisms like adjust_priority,
 which
 is get called when an insn is added to the ready list?  E.g. increase the
 producer's priority.

 The patch as is misses checks for NONDEBUG_INSN_P.  Also, why bail out
 when
 you have more than one imul in the ready list?  Don't you want to bump
 the
 priority of the other imul found?

 Andrey



 And MD allows us only prefer scheduling of successive IMUL
 instructions,
 i.e.
 If IMUL was just scheduled and ready list contains another IMUL
 instruction then it will be chosen as the best candidate for
 scheduling.


 at least from my very limited guessing of what the above does.  So,
 did
 you
 analyze why the scheduler does not properly schedule things for you?

 Richard.


  From reading the patch, I could not understand the link between
 pipeline
 behavior and what the patch appears to do.

 Hope that helps.

 Alexander






Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-04-11 Thread Andi Kleen
Igor Zamyatin izamya...@gmail.com writes:

 Hi All!

 It is known that imul placement is rather critical for Atom processors
 and changes try to improve imul scheduling for Atom.

 This gives +5% performance on several tests from new OA 2.0 testsuite
 from EEMBC.

 Tested for i386 and x86-64, ok for trunk?

Did you measure how much this slows down the compiler when compiling
for Atom? The new pass looks rather slow.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only


Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-04-11 Thread Richard Guenther
On Wed, Apr 11, 2012 at 3:38 PM, Andi Kleen a...@firstfloor.org wrote:
 Igor Zamyatin izamya...@gmail.com writes:

 Hi All!

 It is known that imul placement is rather critical for Atom processors
 and changes try to improve imul scheduling for Atom.

 This gives +5% performance on several tests from new OA 2.0 testsuite
 from EEMBC.

 Tested for i386 and x86-64, ok for trunk?

 Did you measure how much this slows down the compiler when compiling
 for Atom? The new pass looks rather slow.

Also please explain why adjusting the automaton for Atom is not a way to
attack this issue.

Richard.

 -Andi

 --
 a...@linux.intel.com -- Speaking for myself only