Re: RFC reminder: an alternative -fsched-pressure implementation
Ulrich Weigand uweig...@de.ibm.com writes: Richard Sandiford wrote: Vladimir Makarov vmaka...@redhat.com writes: Taking your results for S390 and ARM with Neon into account, I guess it should be included and probably made by default for these 2 targets (for sure for s390). OK, thanks to both of you. Ulrich and Andreas: would you be happy for s390 to use this by default? I'll update the patch and install if so. I've talked to Andreas, and we agree that s390 should use this as default. If you install the base patch, we'll do the back-end change accordingly. (I'll also work with Ramana to enable it on ARM where it makes sense, probably when targeting Cortex-A cores.) OK, installed after bootstrapping regression-testing on x86_64-linux-gnu, both as normal and with: (a) the i386 -fschedule-insns override disabled (b) -fsched-pressure on by default and (c) -fsched-pressure-algorithm=model by default The results from the first run were the same as the baseline ones. The second run had an extra spill failure in a guality test for -m32, but I think that sort of thing is expected. It also had a debug quality regression in pr43479.c. We had a chain of debug_insns that were logically independent but which (regardless of the patch) were nevertheless constrained to be in sequence. The patch moved some later non-debug instructions ahead of where the first debug instruction could go. It seemed that the RA was such that the values used by the later debug instructions were no longer valid. Richard
Re: RFC reminder: an alternative -fsched-pressure implementation
Richard Sandiford wrote: Vladimir Makarov vmaka...@redhat.com writes: Taking your results for S390 and ARM with Neon into account, I guess it should be included and probably made by default for these 2 targets (for sure for s390). OK, thanks to both of you. Ulrich and Andreas: would you be happy for s390 to use this by default? I'll update the patch and install if so. I've talked to Andreas, and we agree that s390 should use this as default. If you install the base patch, we'll do the back-end change accordingly. (I'll also work with Ramana to enable it on ARM where it makes sense, probably when targeting Cortex-A cores.) Thanks again to you and Vlad for looking into this long-standing problem! Bye, Ulrich -- Dr. Ulrich Weigand GNU Toolchain for Linux on System z and Cell BE ulrich.weig...@de.ibm.com
Re: RFC reminder: an alternative -fsched-pressure implementation
Vladimir Makarov vmaka...@redhat.com writes: On 04/23/2012 11:42 AM, Ulrich Weigand wrote: Vladimir Makarov wrote: I have a mixed feeling with the patch. I've tried it on SPEC2000 on x86/x86-64 and ARM. Model algorithm generates bigger code up to 3.5% (SPECFP on x86), 2% (SPECFP on 86-64), and 0.23% (SPECFP on ARM) in comparison with the current algorithm. It is slower too. Although the difference is quite insignificant on Corei7, compiler speed slowdown achieves 0.4% on SPECFP2000 on arm. The algorithm also generates slower code on x86 (1.5% on SPECINT and 5% on SPECFP200) and practically the same average code on x86-64 and ARM (I've tried only SPECINT on ARM). Was this using NEON on ARM? The biggest benefits we saw involved vector code ... Sorry, No. On 04/17/2012 04:29 AM, Richard Sandiford wrote: Anyway, given those results and your mixed feelings, I think it would be best to drop the patch. It's a lot of code to carry around, and its ad-hoc nature would make it hard to change in future. There must be better ways of achieving the same thing. It is up to you, Richard. I don't mind if you commit it into the trunk. There is something in your approach too. If it would be possible to get the best of the two methods, we could see a big improvement. On s390, the patch is pretty much a straight improvement across the line: 0.5% on SPECint, 1.7% on SPECfp overall, with the best single improvement in GemsFDTD (9.3%), and no regression beyond usual noise levels. That is a really big improvement. Given that, and the impressive improvements we saw on some ARM benchmarks involving vector code (up to 70%), it would really be a pity to see the patch going nowhere ... Is there anything we can do to help make the patch more acceptable? Any suggestions welcome ... I did not object the patch. As I wrote Richard, it is up to him to decide to commit the patch or not (it still is). I had mixed results on x86-64 -- some better and some worse (but with average the same) with pretty big variations. Those big variations mean that people could use this for fine-tuning their programs. Taking your results for S390 and ARM with Neon into account, I guess it should be included and probably made by default for these 2 targets (for sure for s390). OK, thanks to both of you. Ulrich and Andreas: would you be happy for s390 to use this by default? I'll update the patch and install if so. Richard
Re: RFC reminder: an alternative -fsched-pressure implementation
Vladimir Makarov wrote: I have a mixed feeling with the patch. I've tried it on SPEC2000 on x86/x86-64 and ARM. Model algorithm generates bigger code up to 3.5% (SPECFP on x86), 2% (SPECFP on 86-64), and 0.23% (SPECFP on ARM) in comparison with the current algorithm. It is slower too. Although the difference is quite insignificant on Corei7, compiler speed slowdown achieves 0.4% on SPECFP2000 on arm. The algorithm also generates slower code on x86 (1.5% on SPECINT and 5% on SPECFP200) and practically the same average code on x86-64 and ARM (I've tried only SPECINT on ARM). Was this using NEON on ARM? The biggest benefits we saw involved vector code ... On 04/17/2012 04:29 AM, Richard Sandiford wrote: Anyway, given those results and your mixed feelings, I think it would be best to drop the patch. It's a lot of code to carry around, and its ad-hoc nature would make it hard to change in future. There must be better ways of achieving the same thing. It is up to you, Richard. I don't mind if you commit it into the trunk. There is something in your approach too. If it would be possible to get the best of the two methods, we could see a big improvement. On s390, the patch is pretty much a straight improvement across the line: 0.5% on SPECint, 1.7% on SPECfp overall, with the best single improvement in GemsFDTD (9.3%), and no regression beyond usual noise levels. Given that, and the impressive improvements we saw on some ARM benchmarks involving vector code (up to 70%), it would really be a pity to see the patch going nowhere ... Is there anything we can do to help make the patch more acceptable? Any suggestions welcome ... Thanks, Ulrich -- Dr. Ulrich Weigand GNU Toolchain for Linux on System z and Cell BE ulrich.weig...@de.ibm.com
Re: RFC reminder: an alternative -fsched-pressure implementation
On 04/23/2012 11:42 AM, Ulrich Weigand wrote: Vladimir Makarov wrote: I have a mixed feeling with the patch. I've tried it on SPEC2000 on x86/x86-64 and ARM. Model algorithm generates bigger code up to 3.5% (SPECFP on x86), 2% (SPECFP on 86-64), and 0.23% (SPECFP on ARM) in comparison with the current algorithm. It is slower too. Although the difference is quite insignificant on Corei7, compiler speed slowdown achieves 0.4% on SPECFP2000 on arm. The algorithm also generates slower code on x86 (1.5% on SPECINT and 5% on SPECFP200) and practically the same average code on x86-64 and ARM (I've tried only SPECINT on ARM). Was this using NEON on ARM? The biggest benefits we saw involved vector code ... Sorry, No. On 04/17/2012 04:29 AM, Richard Sandiford wrote: Anyway, given those results and your mixed feelings, I think it would be best to drop the patch. It's a lot of code to carry around, and its ad-hoc nature would make it hard to change in future. There must be better ways of achieving the same thing. It is up to you, Richard. I don't mind if you commit it into the trunk. There is something in your approach too. If it would be possible to get the best of the two methods, we could see a big improvement. On s390, the patch is pretty much a straight improvement across the line: 0.5% on SPECint, 1.7% on SPECfp overall, with the best single improvement in GemsFDTD (9.3%), and no regression beyond usual noise levels. That is a really big improvement. Given that, and the impressive improvements we saw on some ARM benchmarks involving vector code (up to 70%), it would really be a pity to see the patch going nowhere ... Is there anything we can do to help make the patch more acceptable? Any suggestions welcome ... I did not object the patch. As I wrote Richard, it is up to him to decide to commit the patch or not (it still is). I had mixed results on x86-64 -- some better and some worse (but with average the same) with pretty big variations. Those big variations mean that people could use this for fine-tuning their programs. Taking your results for S390 and ARM with Neon into account, I guess it should be included and probably made by default for these 2 targets (for sure for s390).
Re: RFC reminder: an alternative -fsched-pressure implementation
Vladimir Makarov vmaka...@redhat.com writes: On 04/10/2012 09:35 AM, Richard Sandiford wrote: Hi Vlad, Back in Decemember, when we were still very much in stage 3, I sent an RFC about an alternative implementation of -fsched-pressure. Just wanted to send a reminder now that we're in the proper stage: http://gcc.gnu.org/ml/gcc-patches/2011-12/msg01684.html Ulrich has benchmarked it on ARM, S/390 and Power7 (thanks), and got reasonable results. (I mentioned bad Power 7 results in that message, because of the way the VSX_REGS class is handled. Ulrich's results are without -mvsx though.) The condition I orignally set myself was that this patch should only go in if it becomes the default on at least one architecture, specifically ARM. Ulrich tells me that Linaro have now made it the default for ARM in their GCC 4.7 release, so hopefully Ramana would be OK with doing the same in upstream 4.8. I realise the whole thing is probably more complicated and ad-hoc than you'd like. Saying it can't go in is a perfectly acceptable answer IMO. I have a mixed feeling with the patch. I've tried it on SPEC2000 on x86/x86-64 and ARM. Model algorithm generates bigger code up to 3.5% (SPECFP on x86), 2% (SPECFP on 86-64), and 0.23% (SPECFP on ARM) in comparison with the current algorithm. Yeah. That's expected really, since one of the points of the patch is to allow more spilling than the current algorithm does. One of the main problems I was seeing with the current algorithm was that it was too conservative, and prevented spilling even when it would be benificial. This included spilling outside loops. E.g. if you have 14 available GPRs, as on ARM, and have 10 pseudos live across a loop but not used within it, the current algorithm uses 10 registers as the starting pressure when scheduling the loop. So the current algorithm tries hard to avoid spilling those 10 registers, even if that restricts the amount of reordering within the loop. The new algorithm was supposed to allow such registers to be spilled, but those extra spills would increase code size. It is slower too. Although the difference is quite insignificant on Corei7, compiler speed slowdown achieves 0.4% on SPECFP2000 on arm. Hmm, that's not good. The algorithm also generates slower code on x86 (1.5% on SPECINT and 5% on SPECFP200) and practically the same average code on x86-64 and ARM (I've tried only SPECINT on ARM). Yeah, that's underwhelming too. It looks like there's a danger that this would become an A8- and A9-specific pass, so we would really need better ARM results than that. On the other hand, I don't think that 1st insn scheduling will be ever used for x86. And although the SPECFP2000 rate is the same on x86-64 I saw that some SPECFP2000 tests benefit from your algorithm on x86-64 (one amazing difference is 70% improvement on swim on x86-64 although it might be because of different reasons like alignment or cache behaviour). So I think the algorithm might work better on processors with more registers. Notwithstanding that this is a goemean, I assume there were some bad results to cancel out the gain? As for the patch itself, I think you should document the option in doc/invoke.texi. It is missed. I forgot to say, but that was deliberate. I see this as a developer option rather than a user option. The important thing was to allow backends to default to the new algorithm if they want to. Providing command-line control gives developers an easier way of seeing which default makes sense, but I don't think the option should be advertised to users. Another minor mistake I found is one line garbage (I guess from -fira-algorithm) in description of -fsched-pressure-algorithm in common.opt. Oops, thanks :-) Anyway, given those results and your mixed feelings, I think it would be best to drop the patch. It's a lot of code to carry around, and its ad-hoc nature would make it hard to change in future. There must be better ways of achieving the same thing. Richard
Re: RFC reminder: an alternative -fsched-pressure implementation
On 04/17/2012 04:29 AM, Richard Sandiford wrote: Vladimir Makarovvmaka...@redhat.com writes: On the other hand, I don't think that 1st insn scheduling will be ever used for x86. And although the SPECFP2000 rate is the same on x86-64 I saw that some SPECFP2000 tests benefit from your algorithm on x86-64 (one amazing difference is 70% improvement on swim on x86-64 although it might be because of different reasons like alignment or cache behaviour). So I think the algorithm might work better on processors with more registers. Notwithstanding that this is a goemean, I assume there were some bad results to cancel out the gain? Yes, mgrid had 4% degradation, mesa had 2%, facerec and ammp had 2.5%, lucas had 15%. On the other hand, galgel had 2% improvement and equake had 1%. All other differences are not considerable. Oops, thanks :-) Anyway, given those results and your mixed feelings, I think it would be best to drop the patch. It's a lot of code to carry around, and its ad-hoc nature would make it hard to change in future. There must be better ways of achieving the same thing. It is up to you, Richard. I don't mind if you commit it into the trunk. There is something in your approach too. If it would be possible to get the best of the two methods, we could see a big improvement.
Re: RFC reminder: an alternative -fsched-pressure implementation
On 04/10/2012 09:35 AM, Richard Sandiford wrote: Hi Vlad, Back in Decemember, when we were still very much in stage 3, I sent an RFC about an alternative implementation of -fsched-pressure. Just wanted to send a reminder now that we're in the proper stage: http://gcc.gnu.org/ml/gcc-patches/2011-12/msg01684.html Ulrich has benchmarked it on ARM, S/390 and Power7 (thanks), and got reasonable results. (I mentioned bad Power 7 results in that message, because of the way the VSX_REGS class is handled. Ulrich's results are without -mvsx though.) The condition I orignally set myself was that this patch should only go in if it becomes the default on at least one architecture, specifically ARM. Ulrich tells me that Linaro have now made it the default for ARM in their GCC 4.7 release, so hopefully Ramana would be OK with doing the same in upstream 4.8. I realise the whole thing is probably more complicated and ad-hoc than you'd like. Saying it can't go in is a perfectly acceptable answer IMO. I have a mixed feeling with the patch. I've tried it on SPEC2000 on x86/x86-64 and ARM. Model algorithm generates bigger code up to 3.5% (SPECFP on x86), 2% (SPECFP on 86-64), and 0.23% (SPECFP on ARM) in comparison with the current algorithm. It is slower too. Although the difference is quite insignificant on Corei7, compiler speed slowdown achieves 0.4% on SPECFP2000 on arm. The algorithm also generates slower code on x86 (1.5% on SPECINT and 5% on SPECFP200) and practically the same average code on x86-64 and ARM (I've tried only SPECINT on ARM). On the other hand, I don't think that 1st insn scheduling will be ever used for x86. And although the SPECFP2000 rate is the same on x86-64 I saw that some SPECFP2000 tests benefit from your algorithm on x86-64 (one amazing difference is 70% improvement on swim on x86-64 although it might be because of different reasons like alignment or cache behaviour). So I think the algorithm might work better on processors with more registers. I gues it is ok to sumbit this work to the trunk. It may be useful for achieving better performance for specific tests or to make it a default for some targets if it is proven to be better. As for the patch itself, I think you should document the option in doc/invoke.texi. It is missed. Another minor mistake I found is one line garbage (I guess from -fira-algorithm) in description of -fsched-pressure-algorithm in common.opt. Thanks, Richard. By the way, the code in scheduler has been changed since you made a patch and you need to do some merging first.
Re: RFC reminder: an alternative -fsched-pressure implementation
The condition I orignally set myself was that this patch should only go in if it becomes the default on at least one architecture, specifically ARM. Ulrich tells me that Linaro have now made it the default for ARM in their GCC 4.7 release, so hopefully Ramana would be OK with doing the same in upstream 4.8. Before I approve that I'd like to hopefully see some numbers across the M-class cores as well so that we make sure that side of the architecture isn't missed out (I don't expect it to make a huge difference there but it's still worth checking) - I'm quite happy for it to be default on A8 and A9 based on the evidence I've seen so far. regards, Ramana
Re: RFC reminder: an alternative -fsched-pressure implementation
On 04/10/2012 09:35 AM, Richard Sandiford wrote: Hi Vlad, Back in Decemember, when we were still very much in stage 3, I sent an RFC about an alternative implementation of -fsched-pressure. Just wanted to send a reminder now that we're in the proper stage: http://gcc.gnu.org/ml/gcc-patches/2011-12/msg01684.html Ulrich has benchmarked it on ARM, S/390 and Power7 (thanks), and got reasonable results. (I mentioned bad Power 7 results in that message, because of the way the VSX_REGS class is handled. Ulrich's results are without -mvsx though.) The condition I orignally set myself was that this patch should only go in if it becomes the default on at least one architecture, specifically ARM. Ulrich tells me that Linaro have now made it the default for ARM in their GCC 4.7 release, so hopefully Ramana would be OK with doing the same in upstream 4.8. Sorry for forgetting about this and thanks for the remainder. I'll start looking at this patch thoroughly and reviewing it on this week (may be I'll do some benchmarks too). I realise the whole thing is probably more complicated and ad-hoc than you'd like. Saying it can't go in is a perfectly acceptable answer IMO. The first impression I got for the patch when you submitted it, that it probably should go in.