Re: [PATCH][v3] GIMPLE store merging pass
On 08/09/16 16:25, Bernhard Reutner-Fischer wrote: On 8 September 2016 at 10:31, Kyrill Tkachovwrote: On 07/09/16 20:03, Bernhard Reutner-Fischer wrote: On September 6, 2016 5:14:47 PM GMT+02:00, Kyrill Tkachov wrote: Thanks, fixed all the above in my tree (will be retesting). What about debug statements? ISTM you should skip those. (Isn't visited reset before entry of a pass?) Yes, I'll skip debug statements. Comments in gimple.h say that the visited property is undefined at pass boundaries, so I'd rather initialize it here. Right. Maybe I missed the bikeshedding about the name but I'd have used -fmerge-stores instead. Wouldn't be hard to change. I can change it any point if there's a consensus. Did you consider any relation to any of https://gcc.gnu.org/PR22141 https://gcc.gnu.org/PR23684 https://gcc.gnu.org/PR47059 https://gcc.gnu.org/PR54422 and their dups or https://www.mail-archive.com/gcc-patches@gcc.gnu.org/msg77311.html (the -fmerge-bitfields suggestion from imgtec; maybe the testcases are of interest) Thanks for the pointers. I was not aware of PR23684 and have extended the patch to handle that case as well. The current version I'm hoping to get in only handles constant stores. There are tricks we can do for non-constant contiguous stores as well and I hope to implement some of them in the future, but for now I'm concentrating on getting the initial version up to scratch. I am now re-testing re-benchmarking the patch after your feedback and some other improvements I've added since the last version and hope to send out an updated version soon. Thanks, Kyrill thanks,
Re: [PATCH][v3] GIMPLE store merging pass
On 8 September 2016 at 10:31, Kyrill Tkachovwrote: > > On 07/09/16 20:03, Bernhard Reutner-Fischer wrote: >> >> On September 6, 2016 5:14:47 PM GMT+02:00, Kyrill Tkachov >> wrote: > Thanks, fixed all the above in my tree (will be retesting). > >> What about debug statements? ISTM you should skip those. >> (Isn't visited reset before entry of a pass?) > > > Yes, I'll skip debug statements. Comments in gimple.h say that the visited > property is undefined at pass boundaries, so I'd rather initialize it here. Right. > > >> Maybe I missed the bikeshedding about the name but I'd have used >> -fmerge-stores instead. > > > Wouldn't be hard to change. I can change it any point if there's a > consensus. Did you consider any relation to any of https://gcc.gnu.org/PR22141 https://gcc.gnu.org/PR23684 https://gcc.gnu.org/PR47059 https://gcc.gnu.org/PR54422 and their dups or https://www.mail-archive.com/gcc-patches@gcc.gnu.org/msg77311.html (the -fmerge-bitfields suggestion from imgtec; maybe the testcases are of interest) thanks,
Re: [PATCH][v3] GIMPLE store merging pass
On 07/09/16 20:03, Bernhard Reutner-Fischer wrote: On September 6, 2016 5:14:47 PM GMT+02:00, Kyrill Tkachovwrote: Hi all, s/contigous/contiguous/ s/ where where/ where/ +struct merged_store_group +{ + HOST_WIDE_INT start; + HOST_WIDE_INT width; + unsigned char *val; + unsigned int align; + auto_vec stores; + /* We record the first and last original statements in the sequence because + because we'll need their vuse/vdef and replacement position. */ + gimple *last_stmt; s/ because because/ because/ Why aren't these two HWIs unsigned, likewise in store_immediate_info and in most other spots in the patch? + fprintf (dump_file, "Afer writing "); s/Afer /After/ /access if prohibitively slow/s/ if /is / I'd get rid of successful_p in imm_store_chain_info::output_merged_stores. Thanks, fixed all the above in my tree (will be retesting). +unsigned int +pass_store_merging::execute (function *fun) +{ + basic_block bb; + hash_set orig_stmts; + + FOR_EACH_BB_FN (bb, fun) +{ + gimple_stmt_iterator gsi; + HOST_WIDE_INT num_statements = 0; + /* Record the original statements so that we can keep track of +statements emitted in this pass and not re-process new +statements. */ + for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next ()) + { + gimple_set_visited (gsi_stmt (gsi), false); + num_statements++; + } + + if (num_statements < 2) + continue; What about debug statements? ISTM you should skip those. (Isn't visited reset before entry of a pass?) Yes, I'll skip debug statements. Comments in gimple.h say that the visited property is undefined at pass boundaries, so I'd rather initialize it here. Maybe I missed the bikeshedding about the name but I'd have used -fmerge-stores instead. Wouldn't be hard to change. I can change it any point if there's a consensus. Thanks for the feedback. Kyrill Thanks, The v3 of this patch addresses feedback I received on the version posted at [1]. The merged store buffer is now represented as a char array that we splat values onto with native_encode_expr and native_interpret_expr. This allows us to merge anything that native_encode_expr accepts, including floating point values and short vectors. So this version extends the functionality of the previous one in that it handles floating point values as well. The first phase of the algorithm that detects the contiguous stores is also slightly refactored according to feedback to read more fluently. Richi, I experimented with merging up to MOVE_MAX bytes rather than word size but I got worse results on aarch64. MOVE_MAX there is 16 (because it has load/store register pair instructions) but the 128-bit immediates that we ended synthesising were too complex. Perhaps the TImode immediate store RTL expansions could be improved, but for now I've left the maximum merge size to be BITS_PER_WORD. I've disabled the pass for PDP-endian targets as the merging code proved to be quite fiddly to get right for different endiannesses and I didn't feel comfortable writing logic for BYTES_BIG_ENDIAN != WORDS_BIG_ENDIAN targets without serious testing capabilities. I hope that's ok (I note the bswap pass also doesn't try to do anything on such targets). Tested on arm, aarch64, x86_64 and on big-endian arm and aarch64. How does this version look? Thanks, Kyrill [1] https://gcc.gnu.org/ml/gcc-patches/2016-08/msg01512.html 2016-09-06 Kyrylo Tkachov PR middle-end/22141 * Makefile.in (OBJS): Add gimple-ssa-store-merging.o. * common.opt (fstore-merging): New Optimization option. * opts.c (default_options_table): Add entry for OPT_ftree_store_merging. * params.def (PARAM_STORE_MERGING_ALLOW_UNALIGNED): Define. * passes.def: Insert pass_tree_store_merging. * tree-pass.h (make_pass_store_merging): Declare extern prototype. * gimple-ssa-store-merging.c: New file. * doc/invoke.texi (Optimization Options): Document -fstore-merging. 2016-09-06 Kyrylo Tkachov Jakub Jelinek PR middle-end/22141 * gcc.c-torture/execute/pr22141-1.c: New test. * gcc.c-torture/execute/pr22141-2.c: Likewise. * gcc.target/aarch64/ldp_stp_1.c: Adjust for -fstore-merging. * gcc.target/aarch64/ldp_stp_4.c: Likewise. * gcc.dg/store_merging_1.c: New test. * gcc.dg/store_merging_2.c: Likewise. * gcc.dg/store_merging_3.c: Likewise. * gcc.dg/store_merging_4.c: Likewise. * gcc.dg/store_merging_5.c: Likewise. * gcc.dg/store_merging_6.c: Likewise. * gcc.target/i386/pr22141.c: Likewise. * gcc.target/i386/pr34012.c: Add -fno-store-merging to dg-options.
Re: [PATCH][v3] GIMPLE store merging pass
On Wed, Sep 07, 2016 at 10:19:11AM +0200, Richard Biener wrote: > > If you want a 64-bit store, you'd need to merge the two, and that would be > > even more expensive. It is a matter of say: > > movl $0x12345678, (%rsp) > > movl $0x09abcdef, 4(%rsp) > > vs. > > movabsq $0x09abcdef12345678, %rax > > movq %rax, (%rsp) > > vs. > > movl $0x09abcdef, %eax > > salq $32, %rax > > orq $0x12345678, %rax > > movq $rax, (%rsp) > > vs. > > movq $LC0, (%rsp) You don't want to store the address, so you'd use movq .LC0, %rax movq %rax, (%rsp) > I think the important part to notice is that it should be straight forward > for a target / the expander to split a large store from an immediate > into any of the above but very hard to do the opposite. Thus from a > GIMPLE side "canonicalizing" to large stores (that are eventually > supported and well-aligned) seems best to me. I bet many programs assume that say 64-bit aligned store in the source is atomic in 64-bit apps, without using __atomic_store (..., __ATOMIC_RELAXED); So such a change would break that. Jakub
Re: [PATCH][v3] GIMPLE store merging pass
On September 6, 2016 5:14:47 PM GMT+02:00, Kyrill Tkachovwrote: >Hi all, s/contigous/contiguous/ s/ where where/ where/ +struct merged_store_group +{ + HOST_WIDE_INT start; + HOST_WIDE_INT width; + unsigned char *val; + unsigned int align; + auto_vec stores; + /* We record the first and last original statements in the sequence because + because we'll need their vuse/vdef and replacement position. */ + gimple *last_stmt; s/ because because/ because/ Why aren't these two HWIs unsigned, likewise in store_immediate_info and in most other spots in the patch? + fprintf (dump_file, "Afer writing "); s/Afer /After/ /access if prohibitively slow/s/ if /is / I'd get rid of successful_p in imm_store_chain_info::output_merged_stores. +unsigned int +pass_store_merging::execute (function *fun) +{ + basic_block bb; + hash_set orig_stmts; + + FOR_EACH_BB_FN (bb, fun) +{ + gimple_stmt_iterator gsi; + HOST_WIDE_INT num_statements = 0; + /* Record the original statements so that we can keep track of +statements emitted in this pass and not re-process new +statements. */ + for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next ()) + { + gimple_set_visited (gsi_stmt (gsi), false); + num_statements++; + } + + if (num_statements < 2) + continue; What about debug statements? ISTM you should skip those. (Isn't visited reset before entry of a pass?) Maybe I missed the bikeshedding about the name but I'd have used -fmerge-stores instead. Thanks, > >The v3 of this patch addresses feedback I received on the version >posted at [1]. >The merged store buffer is now represented as a char array that we >splat values onto with >native_encode_expr and native_interpret_expr. This allows us to merge >anything that native_encode_expr >accepts, including floating point values and short vectors. So this >version extends the functionality >of the previous one in that it handles floating point values as well. > >The first phase of the algorithm that detects the contiguous stores is >also slightly refactored according >to feedback to read more fluently. > >Richi, I experimented with merging up to MOVE_MAX bytes rather than >word size but I got worse results on aarch64. >MOVE_MAX there is 16 (because it has load/store register pair >instructions) but the 128-bit immediates that we ended >synthesising were too complex. Perhaps the TImode immediate store RTL >expansions could be improved, but for now >I've left the maximum merge size to be BITS_PER_WORD. > >I've disabled the pass for PDP-endian targets as the merging code >proved to be quite fiddly to get right for different >endiannesses and I didn't feel comfortable writing logic for >BYTES_BIG_ENDIAN != WORDS_BIG_ENDIAN targets without serious >testing capabilities. I hope that's ok (I note the bswap pass also >doesn't try to do anything on such targets). > >Tested on arm, aarch64, x86_64 and on big-endian arm and aarch64. > >How does this version look? >Thanks, >Kyrill > >[1] https://gcc.gnu.org/ml/gcc-patches/2016-08/msg01512.html > >2016-09-06 Kyrylo Tkachov > > PR middle-end/22141 > * Makefile.in (OBJS): Add gimple-ssa-store-merging.o. > * common.opt (fstore-merging): New Optimization option. > * opts.c (default_options_table): Add entry for > OPT_ftree_store_merging. > * params.def (PARAM_STORE_MERGING_ALLOW_UNALIGNED): Define. > * passes.def: Insert pass_tree_store_merging. > * tree-pass.h (make_pass_store_merging): Declare extern > prototype. > * gimple-ssa-store-merging.c: New file. > * doc/invoke.texi (Optimization Options): Document > -fstore-merging. > >2016-09-06 Kyrylo Tkachov > Jakub Jelinek > > PR middle-end/22141 > * gcc.c-torture/execute/pr22141-1.c: New test. > * gcc.c-torture/execute/pr22141-2.c: Likewise. > * gcc.target/aarch64/ldp_stp_1.c: Adjust for -fstore-merging. > * gcc.target/aarch64/ldp_stp_4.c: Likewise. > * gcc.dg/store_merging_1.c: New test. > * gcc.dg/store_merging_2.c: Likewise. > * gcc.dg/store_merging_3.c: Likewise. > * gcc.dg/store_merging_4.c: Likewise. > * gcc.dg/store_merging_5.c: Likewise. > * gcc.dg/store_merging_6.c: Likewise. > * gcc.target/i386/pr22141.c: Likewise. > * gcc.target/i386/pr34012.c: Add -fno-store-merging to dg-options.
Re: [PATCH][v3] GIMPLE store merging pass
On 09/07/2016 10:19 AM, Richard Biener wrote: On Tue, 6 Sep 2016, Jakub Jelinek wrote: If you want a 64-bit store, you'd need to merge the two, and that would be even more expensive. It is a matter of say: movl $0x12345678, (%rsp) movl $0x09abcdef, 4(%rsp) vs. movabsq $0x09abcdef12345678, %rax movq %rax, (%rsp) vs. movl $0x09abcdef, %eax salq $32, %rax orq $0x12345678, %rax movq $rax, (%rsp) vs. movq $LC0, (%rsp) ? Not the same. That moves the address of $LC0. Bernd
Re: [PATCH][v3] GIMPLE store merging pass
On 09/07/2016 02:19 AM, Richard Biener wrote: On Tue, 6 Sep 2016, Jakub Jelinek wrote: On Tue, Sep 06, 2016 at 04:59:23PM +0100, Kyrill Tkachov wrote: On 06/09/16 16:32, Jakub Jelinek wrote: On Tue, Sep 06, 2016 at 04:14:47PM +0100, Kyrill Tkachov wrote: The v3 of this patch addresses feedback I received on the version posted at [1]. The merged store buffer is now represented as a char array that we splat values onto with native_encode_expr and native_interpret_expr. This allows us to merge anything that native_encode_expr accepts, including floating point values and short vectors. So this version extends the functionality of the previous one in that it handles floating point values as well. The first phase of the algorithm that detects the contiguous stores is also slightly refactored according to feedback to read more fluently. Richi, I experimented with merging up to MOVE_MAX bytes rather than word size but I got worse results on aarch64. MOVE_MAX there is 16 (because it has load/store register pair instructions) but the 128-bit immediates that we ended synthesising were too complex. Perhaps the TImode immediate store RTL expansions could be improved, but for now I've left the maximum merge size to be BITS_PER_WORD. At least from playing with this kind of things in the RTL PR22141 patch, I remember storing 64-bit constants on x86_64 compared to storing 2 32-bit constants usually isn't a win (not just for speed optimized blocks but also for -Os). For 64-bit store if the constant isn't signed 32-bit or unsigned 32-bit you need movabsq into some temporary register which has like 3 times worse latency than normal store if I remember well, and then store it. We could restrict the maximum width of the stores generated to 32 bits on x86_64. I think this would need another parameter or target macro for the target to set. Alternatively, is it a possibility for x86 to be a bit smarter in its DImode mov-immediate expansion? For example break up the 64-bit movabsq immediate into two SImode immediates? If you want a 64-bit store, you'd need to merge the two, and that would be even more expensive. It is a matter of say: movl $0x12345678, (%rsp) movl $0x09abcdef, 4(%rsp) vs. movabsq $0x09abcdef12345678, %rax movq %rax, (%rsp) vs. movl $0x09abcdef, %eax salq $32, %rax orq $0x12345678, %rax movq $rax, (%rsp) vs. movq $LC0, (%rsp) ? etc. Guess it needs to be benchmarked on contemporary CPUs, I'm pretty sure the last sequence is the worst one. I think the important part to notice is that it should be straight forward for a target / the expander to split a large store from an immediate into any of the above but very hard to do the opposite. Thus from a GIMPLE side "canonicalizing" to large stores (that are eventually supported and well-aligned) seems best to me. Agreed. I'm aware of that. The patch already has logic to avoid emitting unaligned accesses for SLOW_UNALIGNED_ACCESS targets. Beyond that the patch introduces the parameter PARAM_STORE_MERGING_ALLOW_UNALIGNED that can be used by the user or target to forbid generation of unaligned stores by the pass altogether. Beyond that I'm not sure how to behave more intelligently here. Any ideas? Dunno, the heuristics was the main problem with my patch. Generally, I'd say there is a difference between cold and hot blocks, in cold ones perhaps unaligned stores are more appropriate (if supported at all and not way too slow), while in hot ones less desirable. Note that I repeatedly argue that if we can canonicalize sth to "larger" then even if unaligned, the expander should be able to produce optimal code again (it might not do, of course). And agreed. Furthermore, it's in line with our guiding principles WRT separation of the tree/SSA optimizers from target dependencies. So let's push those decisions into the expanders/backend/target and canonicalize to the larger stores. jeff
Re: [PATCH][v3] GIMPLE store merging pass
On Tue, 6 Sep 2016, Jakub Jelinek wrote: > On Tue, Sep 06, 2016 at 04:59:23PM +0100, Kyrill Tkachov wrote: > > On 06/09/16 16:32, Jakub Jelinek wrote: > > >On Tue, Sep 06, 2016 at 04:14:47PM +0100, Kyrill Tkachov wrote: > > >>The v3 of this patch addresses feedback I received on the version posted > > >>at [1]. > > >>The merged store buffer is now represented as a char array that we splat > > >>values onto with > > >>native_encode_expr and native_interpret_expr. This allows us to merge > > >>anything that native_encode_expr > > >>accepts, including floating point values and short vectors. So this > > >>version extends the functionality > > >>of the previous one in that it handles floating point values as well. > > >> > > >>The first phase of the algorithm that detects the contiguous stores is > > >>also slightly refactored according > > >>to feedback to read more fluently. > > >> > > >>Richi, I experimented with merging up to MOVE_MAX bytes rather than word > > >>size but I got worse results on aarch64. > > >>MOVE_MAX there is 16 (because it has load/store register pair > > >>instructions) but the 128-bit immediates that we ended > > >>synthesising were too complex. Perhaps the TImode immediate store RTL > > >>expansions could be improved, but for now > > >>I've left the maximum merge size to be BITS_PER_WORD. > > >At least from playing with this kind of things in the RTL PR22141 patch, > > >I remember storing 64-bit constants on x86_64 compared to storing 2 32-bit > > >constants usually isn't a win (not just for speed optimized blocks but > > >also for > > >-Os). For 64-bit store if the constant isn't signed 32-bit or unsigned > > >32-bit you need movabsq into some temporary register which has like 3 > > >times worse > > >latency than normal store if I remember well, and then store it. > > > > We could restrict the maximum width of the stores generated to 32 bits on > > x86_64. > > I think this would need another parameter or target macro for the target to > > set. > > Alternatively, is it a possibility for x86 to be a bit smarter in its > > DImode mov-immediate > > expansion? For example break up the 64-bit movabsq immediate into two > > SImode immediates? > > If you want a 64-bit store, you'd need to merge the two, and that would be > even more expensive. It is a matter of say: > movl $0x12345678, (%rsp) > movl $0x09abcdef, 4(%rsp) > vs. > movabsq $0x09abcdef12345678, %rax > movq %rax, (%rsp) > vs. > movl $0x09abcdef, %eax > salq $32, %rax > orq $0x12345678, %rax > movq $rax, (%rsp) vs. movq $LC0, (%rsp) ? > etc. Guess it needs to be benchmarked on contemporary CPUs, I'm pretty sure > the last sequence is the worst one. I think the important part to notice is that it should be straight forward for a target / the expander to split a large store from an immediate into any of the above but very hard to do the opposite. Thus from a GIMPLE side "canonicalizing" to large stores (that are eventually supported and well-aligned) seems best to me. > > >What alias set is used for the accesses if there are different alias sets > > >involved in between the merged stores? > > > > As per https://gcc.gnu.org/ml/gcc/2016-06/msg00162.html the type used in > > those cases > > would be ptr_type_node. See the get_type_for_merged_store function in the > > patch. > > Richi knows this best. I just wonder if e.g. all the stores go into fields > of the same structure it wouldn't be better if you need to punt use that > structure as the type rather than alias set 0. Well, yes - if the IL always accesses a common handled component you can use the alias set of that component. But it's some work to do this correctly as you can have MEM[ + 4] = 1; which stores an 'int' to a struct with two floats. So you do have to be careful, also when merging overlapping stores (in which case you'll certainly have an irregular situation). Which is why I suggested the "easy" approach above which should handle a lot of cases well already. > > I'm aware of that. The patch already has logic to avoid emitting unaligned > > accesses > > for SLOW_UNALIGNED_ACCESS targets. Beyond that the patch introduces the > > parameter > > PARAM_STORE_MERGING_ALLOW_UNALIGNED that can be used by the user or target > > to > > forbid generation of unaligned stores by the pass altogether. Beyond that > > I'm not sure > > how to behave more intelligently here. Any ideas? > > Dunno, the heuristics was the main problem with my patch. Generally, I'd > say there is a difference between cold and hot blocks, in cold ones perhaps > unaligned stores are more appropriate (if supported at all and not way too > slow), while in hot ones less desirable. Note that I repeatedly argue that if we can canonicalize sth to "larger" then even if unaligned, the expander should be able to produce optimal code again (it might not do, of course). Hope to look at the patch in detail soon.
Re: [PATCH][v3] GIMPLE store merging pass
On 06/09/16 17:21, Jakub Jelinek wrote: On Tue, Sep 06, 2016 at 04:59:23PM +0100, Kyrill Tkachov wrote: On 06/09/16 16:32, Jakub Jelinek wrote: On Tue, Sep 06, 2016 at 04:14:47PM +0100, Kyrill Tkachov wrote: The v3 of this patch addresses feedback I received on the version posted at [1]. The merged store buffer is now represented as a char array that we splat values onto with native_encode_expr and native_interpret_expr. This allows us to merge anything that native_encode_expr accepts, including floating point values and short vectors. So this version extends the functionality of the previous one in that it handles floating point values as well. The first phase of the algorithm that detects the contiguous stores is also slightly refactored according to feedback to read more fluently. Richi, I experimented with merging up to MOVE_MAX bytes rather than word size but I got worse results on aarch64. MOVE_MAX there is 16 (because it has load/store register pair instructions) but the 128-bit immediates that we ended synthesising were too complex. Perhaps the TImode immediate store RTL expansions could be improved, but for now I've left the maximum merge size to be BITS_PER_WORD. At least from playing with this kind of things in the RTL PR22141 patch, I remember storing 64-bit constants on x86_64 compared to storing 2 32-bit constants usually isn't a win (not just for speed optimized blocks but also for -Os). For 64-bit store if the constant isn't signed 32-bit or unsigned 32-bit you need movabsq into some temporary register which has like 3 times worse latency than normal store if I remember well, and then store it. We could restrict the maximum width of the stores generated to 32 bits on x86_64. I think this would need another parameter or target macro for the target to set. Alternatively, is it a possibility for x86 to be a bit smarter in its DImode mov-immediate expansion? For example break up the 64-bit movabsq immediate into two SImode immediates? If you want a 64-bit store, you'd need to merge the two, and that would be even more expensive. It is a matter of say: movl $0x12345678, (%rsp) movl $0x09abcdef, 4(%rsp) vs. movabsq $0x09abcdef12345678, %rax movq %rax, (%rsp) vs. movl $0x09abcdef, %eax salq $32, %rax orq $0x12345678, %rax movq $rax, (%rsp) etc. Guess it needs to be benchmarked on contemporary CPUs, I'm pretty sure the last sequence is the worst one. I'd think the backend could expand to the 1st form given a DImode move of 0x09abcdef12345678 to a memory destination? Or restrict the maximum merged store size to 32 bits. Or use rtx costs to iteratively reduce the size of the stores until we reach the cheap narrower constants. I suspect the problem there would be to weigh the benefit of using a narrower cheap constant against the cost of adding more stores. At the GIMPLE level we don't really want to be calling RTX costs on anything other than constants. What alias set is used for the accesses if there are different alias sets involved in between the merged stores? As per https://gcc.gnu.org/ml/gcc/2016-06/msg00162.html the type used in those cases would be ptr_type_node. See the get_type_for_merged_store function in the patch. Richi knows this best. I just wonder if e.g. all the stores go into fields of the same structure it wouldn't be better if you need to punt use that structure as the type rather than alias set 0. I'm aware of that. The patch already has logic to avoid emitting unaligned accesses for SLOW_UNALIGNED_ACCESS targets. Beyond that the patch introduces the parameter PARAM_STORE_MERGING_ALLOW_UNALIGNED that can be used by the user or target to forbid generation of unaligned stores by the pass altogether. Beyond that I'm not sure how to behave more intelligently here. Any ideas? Dunno, the heuristics was the main problem with my patch. Generally, I'd say there is a difference between cold and hot blocks, in cold ones perhaps unaligned stores are more appropriate (if supported at all and not way too slow), while in hot ones less desirable. Could do. I'll look into it, thanks. And, do you have some SPEC2k and/or SPEC2k6 numbers, for e.g. x86_64/i686/arm/aarch64/powerpc64le? I did some benchmarking on aarch64 in the initial submission at https://gcc.gnu.org/ml/gcc-patches/2016-07/msg00942.html aarch64 showed some improvement and no regressions on x86_64. I'll be rerunning the numbers on aarch64/x86_64/arm as the patch has expanded in scope since then (handling more bitfields, floating point constants). I just wanted to get this version out before the Cauldron for comments. The RTL PR22141 changes weren't added mainly because it slowed down SPEC2k* on powerpc. Unfortunately I don't have access to SPEC on powerpc. Any help with testing/benchmarking there would be very much appreciated. I don't have SPEC set up on any arch, perhaps you can ask some of the IBM folks to
Re: [PATCH][v3] GIMPLE store merging pass
On Tue, Sep 06, 2016 at 04:59:23PM +0100, Kyrill Tkachov wrote: > On 06/09/16 16:32, Jakub Jelinek wrote: > >On Tue, Sep 06, 2016 at 04:14:47PM +0100, Kyrill Tkachov wrote: > >>The v3 of this patch addresses feedback I received on the version posted at > >>[1]. > >>The merged store buffer is now represented as a char array that we splat > >>values onto with > >>native_encode_expr and native_interpret_expr. This allows us to merge > >>anything that native_encode_expr > >>accepts, including floating point values and short vectors. So this version > >>extends the functionality > >>of the previous one in that it handles floating point values as well. > >> > >>The first phase of the algorithm that detects the contiguous stores is also > >>slightly refactored according > >>to feedback to read more fluently. > >> > >>Richi, I experimented with merging up to MOVE_MAX bytes rather than word > >>size but I got worse results on aarch64. > >>MOVE_MAX there is 16 (because it has load/store register pair instructions) > >>but the 128-bit immediates that we ended > >>synthesising were too complex. Perhaps the TImode immediate store RTL > >>expansions could be improved, but for now > >>I've left the maximum merge size to be BITS_PER_WORD. > >At least from playing with this kind of things in the RTL PR22141 patch, > >I remember storing 64-bit constants on x86_64 compared to storing 2 32-bit > >constants usually isn't a win (not just for speed optimized blocks but also > >for > >-Os). For 64-bit store if the constant isn't signed 32-bit or unsigned > >32-bit you need movabsq into some temporary register which has like 3 times > >worse > >latency than normal store if I remember well, and then store it. > > We could restrict the maximum width of the stores generated to 32 bits on > x86_64. > I think this would need another parameter or target macro for the target to > set. > Alternatively, is it a possibility for x86 to be a bit smarter in its DImode > mov-immediate > expansion? For example break up the 64-bit movabsq immediate into two SImode > immediates? If you want a 64-bit store, you'd need to merge the two, and that would be even more expensive. It is a matter of say: movl $0x12345678, (%rsp) movl $0x09abcdef, 4(%rsp) vs. movabsq $0x09abcdef12345678, %rax movq %rax, (%rsp) vs. movl $0x09abcdef, %eax salq $32, %rax orq $0x12345678, %rax movq $rax, (%rsp) etc. Guess it needs to be benchmarked on contemporary CPUs, I'm pretty sure the last sequence is the worst one. > >What alias set is used for the accesses if there are different alias sets > >involved in between the merged stores? > > As per https://gcc.gnu.org/ml/gcc/2016-06/msg00162.html the type used in > those cases > would be ptr_type_node. See the get_type_for_merged_store function in the > patch. Richi knows this best. I just wonder if e.g. all the stores go into fields of the same structure it wouldn't be better if you need to punt use that structure as the type rather than alias set 0. > I'm aware of that. The patch already has logic to avoid emitting unaligned > accesses > for SLOW_UNALIGNED_ACCESS targets. Beyond that the patch introduces the > parameter > PARAM_STORE_MERGING_ALLOW_UNALIGNED that can be used by the user or target to > forbid generation of unaligned stores by the pass altogether. Beyond that I'm > not sure > how to behave more intelligently here. Any ideas? Dunno, the heuristics was the main problem with my patch. Generally, I'd say there is a difference between cold and hot blocks, in cold ones perhaps unaligned stores are more appropriate (if supported at all and not way too slow), while in hot ones less desirable. > > >And, do you have some SPEC2k and/or SPEC2k6 numbers, for > > e.g. x86_64/i686/arm/aarch64/powerpc64le? > > I did some benchmarking on aarch64 in the initial submission at > https://gcc.gnu.org/ml/gcc-patches/2016-07/msg00942.html > aarch64 showed some improvement and no regressions on x86_64. > I'll be rerunning the numbers on aarch64/x86_64/arm as the patch has expanded > in scope since then (handling more bitfields, floating point constants). > I just wanted to get this version out before the Cauldron for comments. > > >The RTL PR22141 changes weren't added mainly because it slowed down SPEC2k* > >on powerpc. > > Unfortunately I don't have access to SPEC on powerpc. Any help with > testing/benchmarking > there would be very much appreciated. I don't have SPEC set up on any arch, perhaps you can ask some of the IBM folks to benchmark it for you (that is what I've done back when writing the patch)? BTW, do you handle bitfield stores too? Once you don't support just constant stores, for bitfields but perhaps also for other adjacent operations not just copying, but e.g. also bitwise operations (&,|,^,~) on adjacent memory locations might be useful, or for bitfields even other operations (e.g. addition if there are enough bits
Re: [PATCH][v3] GIMPLE store merging pass
Hi Jakub, On 06/09/16 16:32, Jakub Jelinek wrote: On Tue, Sep 06, 2016 at 04:14:47PM +0100, Kyrill Tkachov wrote: The v3 of this patch addresses feedback I received on the version posted at [1]. The merged store buffer is now represented as a char array that we splat values onto with native_encode_expr and native_interpret_expr. This allows us to merge anything that native_encode_expr accepts, including floating point values and short vectors. So this version extends the functionality of the previous one in that it handles floating point values as well. The first phase of the algorithm that detects the contiguous stores is also slightly refactored according to feedback to read more fluently. Richi, I experimented with merging up to MOVE_MAX bytes rather than word size but I got worse results on aarch64. MOVE_MAX there is 16 (because it has load/store register pair instructions) but the 128-bit immediates that we ended synthesising were too complex. Perhaps the TImode immediate store RTL expansions could be improved, but for now I've left the maximum merge size to be BITS_PER_WORD. At least from playing with this kind of things in the RTL PR22141 patch, I remember storing 64-bit constants on x86_64 compared to storing 2 32-bit constants usually isn't a win (not just for speed optimized blocks but also for -Os). For 64-bit store if the constant isn't signed 32-bit or unsigned 32-bit you need movabsq into some temporary register which has like 3 times worse latency than normal store if I remember well, and then store it. We could restrict the maximum width of the stores generated to 32 bits on x86_64. I think this would need another parameter or target macro for the target to set. Alternatively, is it a possibility for x86 to be a bit smarter in its DImode mov-immediate expansion? For example break up the 64-bit movabsq immediate into two SImode immediates? If it can be CSEd and the same constant used multiple times in adjacent code perhaps. Perhaps. From glancing at SPEC2006 generated code for aarch64 I didn't spot too many opportunities for that though. Various other targets have different costs for different constants, so it would be nice if the pass considered that (computed RTX costs of those constants and used that in some heuristics). Could do. That could avoid creating too expensive immediates. What alias set is used for the accesses if there are different alias sets involved in between the merged stores? As per https://gcc.gnu.org/ml/gcc/2016-06/msg00162.html the type used in those cases would be ptr_type_node. See the get_type_for_merged_store function in the patch. Also alignment can matter, even on non-strict alignment targets (speed vs. -Os for that). I'm aware of that. The patch already has logic to avoid emitting unaligned accesses for SLOW_UNALIGNED_ACCESS targets. Beyond that the patch introduces the parameter PARAM_STORE_MERGING_ALLOW_UNALIGNED that can be used by the user or target to forbid generation of unaligned stores by the pass altogether. Beyond that I'm not sure how to behave more intelligently here. Any ideas? And, do you have some SPEC2k and/or SPEC2k6 numbers, for e.g. x86_64/i686/arm/aarch64/powerpc64le? I did some benchmarking on aarch64 in the initial submission at https://gcc.gnu.org/ml/gcc-patches/2016-07/msg00942.html aarch64 showed some improvement and no regressions on x86_64. I'll be rerunning the numbers on aarch64/x86_64/arm as the patch has expanded in scope since then (handling more bitfields, floating point constants). I just wanted to get this version out before the Cauldron for comments. The RTL PR22141 changes weren't added mainly because it slowed down SPEC2k* on powerpc. Unfortunately I don't have access to SPEC on powerpc. Any help with testing/benchmarking there would be very much appreciated. Also, do you only handle constants or also the case where there is partial or complete copying from some other memory, where it could be turned into larger chunk loads + stores or __builtin_memcpy? At the moment just constants. I hope in the future to extend it to perform more tricks involving contiguous stores. I've disabled the pass for PDP-endian targets as the merging code proved to be quite fiddly to get right for different endiannesses and I didn't feel comfortable writing logic for BYTES_BIG_ENDIAN != WORDS_BIG_ENDIAN targets without serious testing capabilities. I hope that's ok (I note the bswap pass also doesn't try to do anything on such targets). I think that is fine, it isn't the only pass that punts in this case. Thanks for the comments and ideas, Kyrill Jakub
Re: [PATCH][v3] GIMPLE store merging pass
On Tue, Sep 06, 2016 at 04:14:47PM +0100, Kyrill Tkachov wrote: > The v3 of this patch addresses feedback I received on the version posted at > [1]. > The merged store buffer is now represented as a char array that we splat > values onto with > native_encode_expr and native_interpret_expr. This allows us to merge > anything that native_encode_expr > accepts, including floating point values and short vectors. So this version > extends the functionality > of the previous one in that it handles floating point values as well. > > The first phase of the algorithm that detects the contiguous stores is also > slightly refactored according > to feedback to read more fluently. > > Richi, I experimented with merging up to MOVE_MAX bytes rather than word size > but I got worse results on aarch64. > MOVE_MAX there is 16 (because it has load/store register pair instructions) > but the 128-bit immediates that we ended > synthesising were too complex. Perhaps the TImode immediate store RTL > expansions could be improved, but for now > I've left the maximum merge size to be BITS_PER_WORD. At least from playing with this kind of things in the RTL PR22141 patch, I remember storing 64-bit constants on x86_64 compared to storing 2 32-bit constants usually isn't a win (not just for speed optimized blocks but also for -Os). For 64-bit store if the constant isn't signed 32-bit or unsigned 32-bit you need movabsq into some temporary register which has like 3 times worse latency than normal store if I remember well, and then store it. If it can be CSEd and the same constant used multiple times in adjacent code perhaps. Various other targets have different costs for different constants, so it would be nice if the pass considered that (computed RTX costs of those constants and used that in some heuristics). What alias set is used for the accesses if there are different alias sets involved in between the merged stores? Also alignment can matter, even on non-strict alignment targets (speed vs. -Os for that). And, do you have some SPEC2k and/or SPEC2k6 numbers, for e.g. x86_64/i686/arm/aarch64/powerpc64le? The RTL PR22141 changes weren't added mainly because it slowed down SPEC2k* on powerpc. Also, do you only handle constants or also the case where there is partial or complete copying from some other memory, where it could be turned into larger chunk loads + stores or __builtin_memcpy? > I've disabled the pass for PDP-endian targets as the merging code proved to > be quite fiddly to get right for different > endiannesses and I didn't feel comfortable writing logic for BYTES_BIG_ENDIAN > != WORDS_BIG_ENDIAN targets without serious > testing capabilities. I hope that's ok (I note the bswap pass also doesn't > try to do anything on such targets). I think that is fine, it isn't the only pass that punts in this case. Jakub
[PATCH][v3] GIMPLE store merging pass
Hi all, The v3 of this patch addresses feedback I received on the version posted at [1]. The merged store buffer is now represented as a char array that we splat values onto with native_encode_expr and native_interpret_expr. This allows us to merge anything that native_encode_expr accepts, including floating point values and short vectors. So this version extends the functionality of the previous one in that it handles floating point values as well. The first phase of the algorithm that detects the contiguous stores is also slightly refactored according to feedback to read more fluently. Richi, I experimented with merging up to MOVE_MAX bytes rather than word size but I got worse results on aarch64. MOVE_MAX there is 16 (because it has load/store register pair instructions) but the 128-bit immediates that we ended synthesising were too complex. Perhaps the TImode immediate store RTL expansions could be improved, but for now I've left the maximum merge size to be BITS_PER_WORD. I've disabled the pass for PDP-endian targets as the merging code proved to be quite fiddly to get right for different endiannesses and I didn't feel comfortable writing logic for BYTES_BIG_ENDIAN != WORDS_BIG_ENDIAN targets without serious testing capabilities. I hope that's ok (I note the bswap pass also doesn't try to do anything on such targets). Tested on arm, aarch64, x86_64 and on big-endian arm and aarch64. How does this version look? Thanks, Kyrill [1] https://gcc.gnu.org/ml/gcc-patches/2016-08/msg01512.html 2016-09-06 Kyrylo TkachovPR middle-end/22141 * Makefile.in (OBJS): Add gimple-ssa-store-merging.o. * common.opt (fstore-merging): New Optimization option. * opts.c (default_options_table): Add entry for OPT_ftree_store_merging. * params.def (PARAM_STORE_MERGING_ALLOW_UNALIGNED): Define. * passes.def: Insert pass_tree_store_merging. * tree-pass.h (make_pass_store_merging): Declare extern prototype. * gimple-ssa-store-merging.c: New file. * doc/invoke.texi (Optimization Options): Document -fstore-merging. 2016-09-06 Kyrylo Tkachov Jakub Jelinek PR middle-end/22141 * gcc.c-torture/execute/pr22141-1.c: New test. * gcc.c-torture/execute/pr22141-2.c: Likewise. * gcc.target/aarch64/ldp_stp_1.c: Adjust for -fstore-merging. * gcc.target/aarch64/ldp_stp_4.c: Likewise. * gcc.dg/store_merging_1.c: New test. * gcc.dg/store_merging_2.c: Likewise. * gcc.dg/store_merging_3.c: Likewise. * gcc.dg/store_merging_4.c: Likewise. * gcc.dg/store_merging_5.c: Likewise. * gcc.dg/store_merging_6.c: Likewise. * gcc.target/i386/pr22141.c: Likewise. * gcc.target/i386/pr34012.c: Add -fno-store-merging to dg-options. diff --git a/gcc/Makefile.in b/gcc/Makefile.in index 5e3cbe5d736e2cf9429df989d9d95708b1243011..ca2415b79de60c9be44871777d57bcc8d5c07487 100644 --- a/gcc/Makefile.in +++ b/gcc/Makefile.in @@ -1295,6 +1295,7 @@ OBJS = \ gimple-ssa-isolate-paths.o \ gimple-ssa-nonnull-compare.o \ gimple-ssa-split-paths.o \ + gimple-ssa-store-merging.o \ gimple-ssa-strength-reduction.o \ gimple-streamer-in.o \ gimple-streamer-out.o \ diff --git a/gcc/common.opt b/gcc/common.opt index 2004fbc7d0d7b1ca65d06cee0f7398f2c9e3fc64..d951ae7cd6fe8fdae17e17b8f5be7c636ea1986d 100644 --- a/gcc/common.opt +++ b/gcc/common.opt @@ -1452,6 +1452,10 @@ fstrict-volatile-bitfields Common Report Var(flag_strict_volatile_bitfields) Init(-1) Optimization Force bitfield accesses to match their type width. +fstore-merging +Common Var(flag_store_merging) Optimization +Use the tree store merging pass. + fguess-branch-probability Common Report Var(flag_guess_branch_prob) Optimization Enable guessing of branch probabilities. diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 87da1f1c12b718fa63c9b89fdd8f85fbc6b54cb0..2e2a9fdd92733beb6948efc7a5c4b3fc79da73a9 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -401,7 +401,7 @@ Objective-C and Objective-C++ Dialects}. -fsingle-precision-constant -fsplit-ivs-in-unroller @gol -fsplit-paths @gol -fsplit-wide-types -fssa-backprop -fssa-phiopt @gol --fstdarg-opt -fstrict-aliasing @gol +-fstdarg-opt -fstore-merging -fstrict-aliasing @gol -fstrict-overflow -fthread-jumps -ftracer -ftree-bit-ccp @gol -ftree-builtin-call-dce -ftree-ccp -ftree-ch @gol -ftree-coalesce-vars -ftree-copy-prop -ftree-dce -ftree-dominator-opts @gol @@ -412,8 +412,8 @@ Objective-C and Objective-C++ Dialects}. -ftree-loop-vectorize @gol -ftree-parallelize-loops=@var{n} -ftree-pre -ftree-partial-pre -ftree-pta @gol -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol --ftree-switch-conversion -ftree-tail-merge -ftree-ter @gol --ftree-vectorize -ftree-vrp -funconstrained-commons @gol +-ftree-switch-conversion -ftree-tail-merge @gol +-ftree-ter -ftree-vectorize -ftree-vrp -funconstrained-commons @gol -funit-at-a-time