[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9
--- Comment #27 from law at gcc dot gnu dot org 2009-10-29 16:49 --- Subject: Bug 40852 Author: law Date: Thu Oct 29 16:48:00 2009 New Revision: 153715 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=153715 Log: Recorded merge of revisions 153580-153581,153584,153586-153600,153604,153606,153610,153613,153615-153618,153621,153643,153646-153648,153650-153652,153654-153667,153669-153671 via svnmerge from svn+ssh://l...@gcc.gnu.org/svn/gcc/trunk r153580 | gccadmin | 2009-10-26 18:17:26 -0600 (Mon, 26 Oct 2009) | 1 line Daily bump. r153581 | paolo | 2009-10-26 19:18:10 -0600 (Mon, 26 Oct 2009) | 6 lines 2009-10-26 Paolo Carlini paolo.carl...@oracle.com * include/std/chrono (duration::duration(const duration)): Fix per the straightforward resolution of DR 974. * testsuite/20_util/duration/cons/dr974.cc: Add. r153584 | carrot | 2009-10-27 03:06:36 -0600 (Tue, 27 Oct 2009) | 16 lines * target.h (have_conditional_execution): Add a new target hook function. * target-def.h (TARGET_HAVE_CONDITIONAL_EXECUTION): Likewise. * targhooks.h (default_have_conditional_execution): Likewise. * targhooks.c (default_have_conditional_execution): Likewise. * doc/tm.texi (TARGET_HAVE_CONDITIONAL_EXECUTION): Document it. * config/arm/arm.c (TARGET_HAVE_CONDITIONAL_EXECUTION): Define it. (arm_have_conditional_execution): New function. * ifcvt.c (noce_process_if_block, find_if_header, cond_exec_find_if_block, dead_or_predicable): Change the usage of macro HAVE_conditional_execution to a target hook call. * recog.c (peephole2_optimize): Likewise. * sched-rgn.c (add_branch_dependences): Likewise. * final.c (asm_insn_count, final_scan_insn): Likewise. * bb-reorder.c (HAVE_conditional_execution): Remove it. r153586 | ebotcazou | 2009-10-27 04:09:04 -0600 (Tue, 27 Oct 2009) | 1 line Fix nits r153587 | jakub | 2009-10-27 04:28:48 -0600 (Tue, 27 Oct 2009) | 3 lines PR c++/41020 * g++.dg/lookup/extern-c-redecl5.C: Fix up regexp. r153588 | aldyh | 2009-10-27 05:18:12 -0600 (Tue, 27 Oct 2009) | 5 lines PR bootstrap/41451 * fold-const.c (fold_binary_loc): Do not call protected_set_expr_location. r153589 | rguenth | 2009-10-27 05:30:59 -0600 (Tue, 27 Oct 2009) | 5 lines 2009-10-27 Richard Guenther rguent...@suse.de PR lto/41821 * gimple.c (gimple_types_compatible_p): Handle OFFSET_TYPE. r153590 | revitale | 2009-10-27 05:46:07 -0600 (Tue, 27 Oct 2009) | 1 line Fix PR40648 -- Fix misaligned store vectorizer patch r153591 | charlet | 2009-10-27 07:06:06 -0600 (Tue, 27 Oct 2009) | 16 lines 2009-10-27 Arnaud Charlet char...@adacore.com * exp_aggr.adb: Fix comment. 2009-10-27 Emmanuel Briot br...@adacore.com * prj-err.adb (Error_Msg): take into account continuation lines when computing whether we have a warning. 2009-10-27 Vasiliy Fofanov fofa...@adacore.com * make.adb, s-os_lib.adb, s-os_lib.ads (Create_Temp_Output_File): New routine that is designed to create temp file descriptor specifically for redirecting an output stream. r153592 | charlet | 2009-10-27 07:16:48 -0600 (Tue, 27 Oct 2009) | 45 lines 2009-10-27 Vincent Celier cel...@adacore.com * makeutl.adb (Check_Source_Info_In_ALI): Do not recompile if a subunit from the runtime is found, except if gnatmake switch -a is used and this subunit cannot be found. 2009-10-27 Ed Schonberg schonb...@adacore.com * gnatbind.adb (gnatbind): When the -R option is selected, list subunits as well, for tools that need the complete closure of the main program. 2009-10-27 Sergey Rybin ry...@adacore.com * gnat_ugn.texi: Minor updates. 2009-10-27 Emmanuel Briot br...@adacore.com * prj-tree.adb (Free): Fix memory leak. 2009-10-27 Vasiliy Fofanov fofa...@adacore.com * adaint.c, s-os_lib.adb (__gnat_create_output_file_new): New function that ensures the file that is created is new. Use this function to make sure there is no race condition if several processes are creating temp files concurrently. * s-os_lib.ads: Update comment. 2009-10-27 Thomas Quinot qui...@adacore.com * sem_ch12.adb: Minor reformatting 2009-10-27 Javier Miranda mira...@adacore.com * exp_ch4.ads (Integer_Promotion_Possible): New subprogram. * exp_ch4.adb (Integer_Promotion_Possible): New subprogram. (Expand_N_Type_Conversion): Replace code that checks if the integer promotion of the operands is possible by a call to the new function Integer_Promotion_Possible. Minor reformating because an enclosing block is now not needed. * checks.adb
[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9
--- Comment #23 from singler at gcc dot gnu dot org 2009-10-28 10:04 --- Subject: Bug 40852 Author: singler Date: Wed Oct 28 10:04:03 2009 New Revision: 153648 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=153648 Log: 2009-10-28 Johannes Singler sing...@kit.edu PR libstdc++/40852 * include/parallel/multiseq_selection.h (multiseq_partition, multiseq_selection): Avoid intermediate values exceeding the integer type range for very large inputs. Modified: trunk/libstdc++-v3/ChangeLog trunk/libstdc++-v3/include/parallel/multiseq_selection.h -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852
[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9
--- Comment #24 from singler at gcc dot gnu dot org 2009-10-28 10:04 --- Subject: Bug 40852 Author: singler Date: Wed Oct 28 10:04:35 2009 New Revision: 153649 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=153649 Log: 2009-10-28 Johannes Singler sing...@kit.edu PR libstdc++/40852 * include/parallel/multiseq_selection.h (multiseq_partition, multiseq_selection): Avoid intermediate values exceeding the integer type range for very large inputs. Modified: branches/gcc-4_4-branch/libstdc++-v3/ChangeLog branches/gcc-4_4-branch/libstdc++-v3/include/parallel/multiseq_selection.h -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852
[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9
--- Comment #25 from singler at gcc dot gnu dot org 2009-10-28 10:11 --- Closing this bug. -- singler at gcc dot gnu dot org changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution||FIXED http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852
[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9
--- Comment #26 from paolo dot carlini at oracle dot com 2009-10-28 10:44 --- Fixed for 4.4.3 and mainline. -- paolo dot carlini at oracle dot com changed: What|Removed |Added Target Milestone|--- |4.4.3 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852
[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9
--- Comment #21 from jaffe at broadinstitute dot org 2009-10-27 09:45 --- Subject: Re: [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9 I tested the patch from comment #19, sorting X billion integers on a machine having 32 processors and 256 GB memory, X = 4, 6, ..., 26. The overall behavior is very close to linear. For example, X = 4 took 1.02 minutes, whereas X = 20 took 5.22 minutes. Very nice! -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852
[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9
--- Comment #22 from paolo dot carlini at oracle dot com 2009-10-27 15:53 --- Patch regtests fine on x86_64-linux. Johannes, can you prepare a ChangeLog entry, post and commit both? Thanks! -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852
[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9
--- Comment #18 from singler at gcc dot gnu dot org 2009-10-23 10:00 --- (In reply to comment #17) Is something known about the actual size of a, b, and c? They can be as large as the input size. Also, I don't know which is the required precision for the result: must be exact if representable? In the last iteration, __n == 0 = __total == __N, and then, the result must absolutely be __rank, according to the specification. Anyway, I think I have found a solution that is easier, faster, and avoids the large intermediate altogether (see attached patch). It also fixes similar problems in two other locations. However, this patch needs further thorough testing. Also, __n == 2 ^ __r - 1, so __n + 1 == 2 ^ __r, and the divisions could be replaced by shifts. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852
[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9
--- Comment #19 from singler at gcc dot gnu dot org 2009-10-23 10:01 --- Created an attachment (id=18878) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18878action=view) Patch avoid large intermediates to avoid overflow, for trunk. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852
[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9
--- Comment #20 from paolo dot carlini at oracle dot com 2009-10-23 16:00 --- Excellent. Let's wait a bit for feedback from people experiencing this issue and then commit the patch, first mainline and then probably 4_4-branch too. Make sure to also regression test the fix on a normal ;) machine... -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852
[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9
--- Comment #9 from singler at gcc dot gnu dot org 2009-10-22 06:57 --- I can reproduce the bug on my machine (2 Quadcore Nehalems, 48GB RAM) 4 x 10^9 ints: 65 seconds used in sort 5 x 10^9 ints: 193 seconds used in sort -- singler at gcc dot gnu dot org changed: What|Removed |Added AssignedTo|unassigned at gcc dot gnu |singler at gcc dot gnu dot |dot org |org Status|UNCONFIRMED |ASSIGNED Ever Confirmed|0 |1 Last reconfirmed|-00-00 00:00:00 |2009-10-22 06:57:14 date|| http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852
[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9
--- Comment #10 from singler at gcc dot gnu dot org 2009-10-22 07:15 --- The problem is in multiseq_selection.h, where this line has an overflow (static_castuint64_t(__total) * __rank / __N - __leftsize) if (__total * __rank) exceeds 64 bits. The quick fix is to use a temporary double, which solves the original test case: 4 x 10^9 ints: 64 seconds used in sort 5 x 10^9 ints: 80 seconds used in sort Find patches for branch (4.4) and trunk (4.5) attached. However, I do not fully trust the double arithmetics yet, although some test cases work. Does anybody else know a better way to avoid an overflow in ((a * b) / c) with only integer arithmetics and normal rounding? Maybe I can find a way to avoid this calculation altogether. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852
[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9
--- Comment #11 from singler at gcc dot gnu dot org 2009-10-22 07:16 --- Created an attachment (id=18862) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18862action=view) Patch replacing uint64_t by double to avoid overflow, for trunk. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852
[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9
--- Comment #12 from singler at gcc dot gnu dot org 2009-10-22 07:17 --- Created an attachment (id=18863) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18863action=view) Patch replacing uint64_t by double to avoid overflow, for branch 4.4. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852
[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9
--- Comment #13 from singler at gcc dot gnu dot org 2009-10-22 07:42 --- (In reply to comment #10) However, I do not fully trust the double arithmetics yet, although some test cases work. Er, this sounded a bit pessimistic, all sort tests I have tried so far work with the patch. And some more explanation: The overflow resulted in erratic and thus very load balancing in the merge step, causing the huge running times. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852
[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9
--- Comment #14 from pluto at agmk dot net 2009-10-22 09:01 --- (In reply to comment #10) However, I do not fully trust the double arithmetics yet, although some test cases work. Does anybody else know a better way to avoid an overflow in ((a * b) / c) with only integer arithmetics and normal rounding? you can use a 128-bit integer type on x86-64. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852
[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9
--- Comment #15 from jaffe at broadinstitute dot org 2009-10-22 10:22 --- Subject: Re: [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9 Wonderful! Thank you very much for fixing this problem. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852
[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9
--- Comment #16 from singler at gcc dot gnu dot org 2009-10-22 16:41 --- (In reply to comment #14) (In reply to comment #10) However, I do not fully trust the double arithmetics yet, although some test cases work. Does anybody else know a better way to avoid an overflow in ((a * b) / c) with only integer arithmetics and normal rounding? you can use a 128-bit integer type on x86-64. Very good idea. Do you know a good #ifdef clause to check its availability. Is it really just x64-64? Also, I probably want to use it only when really needed, because I assume it to be implemented in software, in particular the division. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852
[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9
--- Comment #17 from paolo dot carlini at oracle dot com 2009-10-22 17:46 --- Is something known about the actual size of a, b, and c? Also, I don't know which is the required precision for the result: must be exact if representable? I suppose not, otherwise the suggestiong of using double would not make sense. Depending on the answer to the above, there are various options, maybe checking for a * b overflowing (if the quantities are all positive, then checking for wraparound is easy) and then taking the appropriate actions. Anyway, barring more sophisticated solutions, using long double seems a better idea to me, because on most widespread targets a long double is at least 80 bits, with a mantissa of at least 64 bits, thus able to exactly represent any long long integer. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852
[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9
--- Comment #7 from singler at gcc dot gnu dot org 2009-10-20 07:46 --- Sorry, the CC has never reached me. So concerning comment #4: Was the parallel mode actually activated? The multiway mergesort takes another time the space of the input as temporarily. Sure that the program was not swapping? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852
[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9
--- Comment #8 from jaffe at broadinstitute dot org 2009-10-20 10:55 --- Subject: Re: [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9 Regarding comment #7, I just ran this now on a machine with 32 processors and 512 GB memory. (a) Sorting 4 x 10^9 ints took 0.9 minutes. (b) Sorting 5 x 10^9 ints took 16 minutes. The second test used about 40 GB, which is a small fraction of the available memory. (c) Sorting 2.5 x 10^9 structures having 2 ints each took 1.1 minutes. Regarding comment #6, repeating (a) and (b) with __gnu_parallel::balanced_quicksort_tag( ): (a') 6.3 minutes (b') 8.1 minutes, so the algorithm is slower on these data but does not exhibit the same jump in runtime. I also tried __gnu_parallel::quicksort_tag( ) which was about the same for (b) [(a) not tested]. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852
[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9
--- Comment #6 from jason at gcc dot gnu dot org 2009-10-19 18:07 --- Have you tried selecting a different sort algorithm? The default seems to be the multi-way mergesort, but there are two quicksort options as well. -- jason at gcc dot gnu dot org changed: What|Removed |Added CC||jason at gcc dot gnu dot org http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852
[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9
--- Comment #5 from paolo dot carlini at oracle dot com 2009-07-24 21:23 --- So this is issue is just that you are not completely happy with the behavior of parallel-mode. Ok... Let's see what Johannes thinks. -- paolo dot carlini at oracle dot com changed: What|Removed |Added Summary|parallel sort run time |[parallel-mode] parallel |increases ~10 fold when |sort run time increases ~10 |vector size gets over |fold when vector size gets |~4*10^9 |over ~4*10^9 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852