[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9

2009-10-29 Thread law at gcc dot gnu dot org


--- Comment #27 from law at gcc dot gnu dot org  2009-10-29 16:49 ---
Subject: Bug 40852

Author: law
Date: Thu Oct 29 16:48:00 2009
New Revision: 153715

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=153715
Log:
Recorded merge of revisions
153580-153581,153584,153586-153600,153604,153606,153610,153613,153615-153618,153621,153643,153646-153648,153650-153652,153654-153667,153669-153671
via svnmerge from 
svn+ssh://l...@gcc.gnu.org/svn/gcc/trunk


  r153580 | gccadmin | 2009-10-26 18:17:26 -0600 (Mon, 26 Oct 2009) | 1 line

  Daily bump.

  r153581 | paolo | 2009-10-26 19:18:10 -0600 (Mon, 26 Oct 2009) | 6 lines

  2009-10-26  Paolo Carlini  paolo.carl...@oracle.com

* include/std/chrono (duration::duration(const duration)): Fix
per the straightforward resolution of DR 974.
* testsuite/20_util/duration/cons/dr974.cc: Add.

  r153584 | carrot | 2009-10-27 03:06:36 -0600 (Tue, 27 Oct 2009) | 16 lines

* target.h (have_conditional_execution): Add a new target hook
function.
* target-def.h (TARGET_HAVE_CONDITIONAL_EXECUTION): Likewise.
* targhooks.h (default_have_conditional_execution): Likewise.
* targhooks.c (default_have_conditional_execution): Likewise.
* doc/tm.texi (TARGET_HAVE_CONDITIONAL_EXECUTION): Document it.
* config/arm/arm.c (TARGET_HAVE_CONDITIONAL_EXECUTION): Define it.
(arm_have_conditional_execution): New function.
* ifcvt.c (noce_process_if_block, find_if_header,
cond_exec_find_if_block, dead_or_predicable): Change the usage of macro
HAVE_conditional_execution to a target hook call.
* recog.c (peephole2_optimize): Likewise.
* sched-rgn.c (add_branch_dependences): Likewise.
* final.c (asm_insn_count, final_scan_insn): Likewise.
* bb-reorder.c (HAVE_conditional_execution): Remove it.

  r153586 | ebotcazou | 2009-10-27 04:09:04 -0600 (Tue, 27 Oct 2009) | 1 line

  Fix nits

  r153587 | jakub | 2009-10-27 04:28:48 -0600 (Tue, 27 Oct 2009) | 3 lines

PR c++/41020
* g++.dg/lookup/extern-c-redecl5.C: Fix up regexp.

  r153588 | aldyh | 2009-10-27 05:18:12 -0600 (Tue, 27 Oct 2009) | 5 lines

PR bootstrap/41451
* fold-const.c (fold_binary_loc): Do not call
protected_set_expr_location.

  r153589 | rguenth | 2009-10-27 05:30:59 -0600 (Tue, 27 Oct 2009) | 5 lines

  2009-10-27  Richard Guenther  rguent...@suse.de

PR lto/41821
* gimple.c (gimple_types_compatible_p): Handle OFFSET_TYPE.

  r153590 | revitale | 2009-10-27 05:46:07 -0600 (Tue, 27 Oct 2009) | 1 line

  Fix PR40648 -- Fix misaligned store vectorizer patch

  r153591 | charlet | 2009-10-27 07:06:06 -0600 (Tue, 27 Oct 2009) | 16 lines

  2009-10-27  Arnaud Charlet  char...@adacore.com

* exp_aggr.adb: Fix comment.

  2009-10-27  Emmanuel Briot  br...@adacore.com

* prj-err.adb (Error_Msg): take into account continuation lines when
computing whether we have a warning.

  2009-10-27  Vasiliy Fofanov  fofa...@adacore.com

* make.adb, s-os_lib.adb, s-os_lib.ads (Create_Temp_Output_File): New
routine that is designed to create temp file descriptor specifically
for redirecting an output stream.

  r153592 | charlet | 2009-10-27 07:16:48 -0600 (Tue, 27 Oct 2009) | 45 lines

  2009-10-27  Vincent Celier  cel...@adacore.com

* makeutl.adb (Check_Source_Info_In_ALI): Do not recompile if a subunit
from the runtime is found, except if gnatmake switch -a is used and
this
subunit cannot be found.

  2009-10-27  Ed Schonberg  schonb...@adacore.com

* gnatbind.adb (gnatbind): When the -R option is selected, list
subunits
as well, for tools that need the complete closure of the main program.

  2009-10-27  Sergey Rybin  ry...@adacore.com

* gnat_ugn.texi: Minor updates.

  2009-10-27  Emmanuel Briot  br...@adacore.com

* prj-tree.adb (Free): Fix memory leak.

  2009-10-27  Vasiliy Fofanov  fofa...@adacore.com

* adaint.c, s-os_lib.adb (__gnat_create_output_file_new): New function
that ensures the file that is created is new. Use this function to make
sure there is no race condition if several processes are creating temp
files concurrently.

* s-os_lib.ads: Update comment.

  2009-10-27  Thomas Quinot  qui...@adacore.com

* sem_ch12.adb: Minor reformatting

  2009-10-27  Javier Miranda  mira...@adacore.com

* exp_ch4.ads (Integer_Promotion_Possible): New subprogram.
* exp_ch4.adb (Integer_Promotion_Possible): New subprogram.
(Expand_N_Type_Conversion): Replace code that checks if the integer
promotion of the operands is possible by a call to the new function
Integer_Promotion_Possible. Minor reformating because an enclosing
block is now not needed.
* checks.adb 

[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9

2009-10-28 Thread singler at gcc dot gnu dot org


--- Comment #23 from singler at gcc dot gnu dot org  2009-10-28 10:04 
---
Subject: Bug 40852

Author: singler
Date: Wed Oct 28 10:04:03 2009
New Revision: 153648

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=153648
Log:
2009-10-28  Johannes Singler  sing...@kit.edu

PR libstdc++/40852
* include/parallel/multiseq_selection.h
(multiseq_partition, multiseq_selection):  Avoid intermediate
values exceeding the integer type range for very large inputs.


Modified:
trunk/libstdc++-v3/ChangeLog
trunk/libstdc++-v3/include/parallel/multiseq_selection.h


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852



[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9

2009-10-28 Thread singler at gcc dot gnu dot org


--- Comment #24 from singler at gcc dot gnu dot org  2009-10-28 10:04 
---
Subject: Bug 40852

Author: singler
Date: Wed Oct 28 10:04:35 2009
New Revision: 153649

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=153649
Log:
2009-10-28  Johannes Singler  sing...@kit.edu

PR libstdc++/40852
* include/parallel/multiseq_selection.h
(multiseq_partition, multiseq_selection):  Avoid intermediate
values exceeding the integer type range for very large inputs.


Modified:
branches/gcc-4_4-branch/libstdc++-v3/ChangeLog
branches/gcc-4_4-branch/libstdc++-v3/include/parallel/multiseq_selection.h


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852



[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9

2009-10-28 Thread singler at gcc dot gnu dot org


--- Comment #25 from singler at gcc dot gnu dot org  2009-10-28 10:11 
---
Closing this bug.


-- 

singler at gcc dot gnu dot org changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution||FIXED


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852



[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9

2009-10-28 Thread paolo dot carlini at oracle dot com


--- Comment #26 from paolo dot carlini at oracle dot com  2009-10-28 10:44 
---
Fixed for 4.4.3 and mainline.


-- 

paolo dot carlini at oracle dot com changed:

   What|Removed |Added

   Target Milestone|--- |4.4.3


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852



[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9

2009-10-27 Thread jaffe at broadinstitute dot org


--- Comment #21 from jaffe at broadinstitute dot org  2009-10-27 09:45 
---
Subject: Re:  [parallel-mode] parallel sort run time
 increases ~10 fold when vector size gets over ~4*10^9

I tested the patch from comment #19, sorting X billion integers on a machine
having
32 processors and 256 GB memory, X = 4, 6, ..., 26.  The overall behavior is
very
close to linear.  For example, X = 4 took 1.02 minutes, whereas X = 20 took
5.22
minutes.  Very nice!


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852



[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9

2009-10-27 Thread paolo dot carlini at oracle dot com


--- Comment #22 from paolo dot carlini at oracle dot com  2009-10-27 15:53 
---
Patch regtests fine on x86_64-linux. Johannes, can you prepare a ChangeLog
entry, post and commit both? Thanks!


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852



[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9

2009-10-23 Thread singler at gcc dot gnu dot org


--- Comment #18 from singler at gcc dot gnu dot org  2009-10-23 10:00 
---
(In reply to comment #17)
 Is something known about the actual size of a, b, and c? 

They can be as large as the input size.

 Also, I don't know which is the required precision for the result: must be 
 exact if representable?

In the last iteration, __n == 0 = __total == __N, and then, the result must
absolutely be __rank, according to the specification.

Anyway, I think I have found a solution that is easier, faster, and avoids the
large intermediate altogether (see attached patch).  It also fixes similar
problems in two other locations.  However, this patch needs further thorough
testing.

Also, __n == 2 ^ __r - 1, so __n + 1 == 2 ^ __r, and the divisions could be
replaced by shifts.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852



[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9

2009-10-23 Thread singler at gcc dot gnu dot org


--- Comment #19 from singler at gcc dot gnu dot org  2009-10-23 10:01 
---
Created an attachment (id=18878)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18878action=view)
Patch avoid large intermediates to avoid overflow, for trunk. 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852



[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9

2009-10-23 Thread paolo dot carlini at oracle dot com


--- Comment #20 from paolo dot carlini at oracle dot com  2009-10-23 16:00 
---
Excellent. Let's wait a bit for feedback from people experiencing this issue
and then commit the patch, first mainline and then probably 4_4-branch too.
Make sure to also regression test the fix on a normal ;) machine...


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852



[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9

2009-10-22 Thread singler at gcc dot gnu dot org


--- Comment #9 from singler at gcc dot gnu dot org  2009-10-22 06:57 ---
I can reproduce the bug on my machine (2 Quadcore Nehalems, 48GB RAM)

4 x 10^9 ints: 65 seconds used in sort
5 x 10^9 ints: 193 seconds used in sort


-- 

singler at gcc dot gnu dot org changed:

   What|Removed |Added

 AssignedTo|unassigned at gcc dot gnu   |singler at gcc dot gnu dot
   |dot org |org
 Status|UNCONFIRMED |ASSIGNED
 Ever Confirmed|0   |1
   Last reconfirmed|-00-00 00:00:00 |2009-10-22 06:57:14
   date||


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852



[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9

2009-10-22 Thread singler at gcc dot gnu dot org


--- Comment #10 from singler at gcc dot gnu dot org  2009-10-22 07:15 
---
The problem is in multiseq_selection.h, where this line has an overflow

(static_castuint64_t(__total) * __rank / __N - __leftsize)

if (__total * __rank) exceeds 64 bits.  The quick fix is to use a temporary
double, which solves the original test case:

4 x 10^9 ints: 64 seconds used in sort
5 x 10^9 ints: 80 seconds used in sort

Find patches for branch (4.4) and trunk (4.5) attached.

However, I do not fully trust the double arithmetics yet, although some test
cases work.  Does anybody else know a better way to avoid an overflow in ((a *
b) / c) with only integer arithmetics and normal rounding?

Maybe I can find a way to avoid this calculation altogether.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852



[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9

2009-10-22 Thread singler at gcc dot gnu dot org


--- Comment #11 from singler at gcc dot gnu dot org  2009-10-22 07:16 
---
Created an attachment (id=18862)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18862action=view)
Patch replacing uint64_t by double to avoid overflow, for trunk.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852



[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9

2009-10-22 Thread singler at gcc dot gnu dot org


--- Comment #12 from singler at gcc dot gnu dot org  2009-10-22 07:17 
---
Created an attachment (id=18863)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18863action=view)
Patch replacing uint64_t by double to avoid overflow, for branch 4.4.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852



[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9

2009-10-22 Thread singler at gcc dot gnu dot org


--- Comment #13 from singler at gcc dot gnu dot org  2009-10-22 07:42 
---
(In reply to comment #10)

 However, I do not fully trust the double arithmetics yet, although some test
 cases work.

Er, this sounded a bit pessimistic, all sort tests I have tried so far work
with the patch.

And some more explanation:
The overflow resulted in erratic and thus very load balancing in the merge
step, causing the huge running times.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852



[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9

2009-10-22 Thread pluto at agmk dot net


--- Comment #14 from pluto at agmk dot net  2009-10-22 09:01 ---
(In reply to comment #10)

 However, I do not fully trust the double arithmetics yet, although some test
 cases work.  Does anybody else know a better way to avoid an overflow in ((a *
 b) / c) with only integer arithmetics and normal rounding?

you can use a 128-bit integer type on x86-64.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852



[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9

2009-10-22 Thread jaffe at broadinstitute dot org


--- Comment #15 from jaffe at broadinstitute dot org  2009-10-22 10:22 
---
Subject: Re:  [parallel-mode] parallel sort run time
 increases ~10 fold when vector size gets over ~4*10^9

Wonderful!  Thank you very much for fixing this problem.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852



[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9

2009-10-22 Thread singler at gcc dot gnu dot org


--- Comment #16 from singler at gcc dot gnu dot org  2009-10-22 16:41 
---
(In reply to comment #14)
 (In reply to comment #10)
 
  However, I do not fully trust the double arithmetics yet, although some test
  cases work.  Does anybody else know a better way to avoid an overflow in 
  ((a *
  b) / c) with only integer arithmetics and normal rounding?
 
 you can use a 128-bit integer type on x86-64.

Very good idea.
Do you know a good #ifdef clause to check its availability.  Is it really just
x64-64?
Also, I probably want to use it only when really needed, because I assume it to
be implemented in software, in particular the division.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852



[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9

2009-10-22 Thread paolo dot carlini at oracle dot com


--- Comment #17 from paolo dot carlini at oracle dot com  2009-10-22 17:46 
---
Is something known about the actual size of a, b, and c? Also, I don't know
which is the required precision for the result: must be exact if representable?
I suppose not, otherwise the suggestiong of using double would not make sense.
Depending on the answer to the above, there are various options, maybe checking
for a * b overflowing (if the quantities are all positive, then checking for
wraparound is easy) and then taking the appropriate actions.

Anyway, barring more sophisticated solutions, using long double seems a better
idea to me, because on most widespread targets a long double is at least 80
bits, with a mantissa of at least 64 bits, thus able to exactly represent any
long long integer.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852



[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9

2009-10-20 Thread singler at gcc dot gnu dot org


--- Comment #7 from singler at gcc dot gnu dot org  2009-10-20 07:46 ---
Sorry, the CC has never reached me.  
So concerning comment #4:  Was the parallel mode actually activated?
The multiway mergesort takes another time the space of the input as
temporarily.  Sure that the program was not swapping?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852



[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9

2009-10-20 Thread jaffe at broadinstitute dot org


--- Comment #8 from jaffe at broadinstitute dot org  2009-10-20 10:55 
---
Subject: Re:  [parallel-mode] parallel sort run time
 increases ~10 fold when vector size gets over ~4*10^9

Regarding comment #7, I just ran this now on a machine with 32 processors and
512 GB memory.

(a) Sorting 4 x 10^9 ints took 0.9 minutes.
(b) Sorting 5 x 10^9 ints took 16 minutes.

The second test used about 40 GB, which is a small fraction of the available
memory.

(c) Sorting 2.5 x 10^9 structures having 2 ints each took 1.1 minutes.

Regarding comment #6, repeating (a) and (b) with
__gnu_parallel::balanced_quicksort_tag( ):

(a') 6.3 minutes
(b') 8.1 minutes,

so the algorithm is slower on these data but does not exhibit the same jump in
runtime.
I also tried __gnu_parallel::quicksort_tag( ) which was about the same for (b)
[(a) not tested].


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852



[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9

2009-10-19 Thread jason at gcc dot gnu dot org


--- Comment #6 from jason at gcc dot gnu dot org  2009-10-19 18:07 ---
Have you tried selecting a different sort algorithm?  The default seems to be
the multi-way mergesort, but there are two quicksort options as well.


-- 

jason at gcc dot gnu dot org changed:

   What|Removed |Added

 CC||jason at gcc dot gnu dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852



[Bug libstdc++/40852] [parallel-mode] parallel sort run time increases ~10 fold when vector size gets over ~4*10^9

2009-07-24 Thread paolo dot carlini at oracle dot com


--- Comment #5 from paolo dot carlini at oracle dot com  2009-07-24 21:23 
---
So this is issue is just that you are not completely happy with the behavior of
parallel-mode. Ok... Let's see what Johannes thinks.


-- 

paolo dot carlini at oracle dot com changed:

   What|Removed |Added

Summary|parallel sort run time  |[parallel-mode] parallel
   |increases ~10 fold when |sort run time increases ~10
   |vector size gets over   |fold when vector size gets
   |~4*10^9 |over ~4*10^9


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40852