[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2012-01-12 Thread pinskia at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706

Andrew Pinski pinskia at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution||FIXED
   Target Milestone|4.5.3   |4.6.0

--- Comment #29 from Andrew Pinski pinskia at gcc dot gnu.org 2012-01-12 
20:32:38 UTC ---
Fixed.


[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-12-16 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706

Richard Guenther rguenth at gcc dot gnu.org changed:

   What|Removed |Added

   Target Milestone|4.5.2   |4.5.3

--- Comment #28 from Richard Guenther rguenth at gcc dot gnu.org 2010-12-16 
13:02:46 UTC ---
GCC 4.5.2 is being released, adjusting target milestone.


[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-12-02 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706

--- Comment #27 from Jakub Jelinek jakub at gcc dot gnu.org 2010-12-02 
14:31:31 UTC ---
Author: jakub
Date: Thu Dec  2 14:31:27 2010
New Revision: 167371

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=167371
Log:
PR libgomp/43706
* env.c (initialize_env): Default to spin count 30
instead of 2000 if neither OMP_WAIT_POLICY nor GOMP_SPINCOUNT
is specified.

Modified:
trunk/libgomp/ChangeLog
trunk/libgomp/env.c


[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-11-15 Thread singler at kit dot edu
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706

--- Comment #26 from Johannes Singler singler at kit dot edu 2010-11-15 
08:53:12 UTC ---
(In reply to comment #25)
 You might have misread what I wrote.  I did not mention 35 tests; I 
 mentioned
 that a test became slower by 35%.  The total number of different tests was 4
 (and each was invoked multiple times per spincount setting, indeed).  One out
 of four stayed 35% slower until I increased GOMP_SPINCOUNT to 20.

Sorry, I got that wrong.  

 This makes some sense, but the job of an optimizing compiler and runtime
 libraries is to deliver the best performance they can even with somewhat
 non-optimal source code.  

I agree with that in principle.  But please be reminded that as is, there is
the very simple testcase posted, which takes a serious performance hit.  And
repeated parallel loops like the one in the test program certainly appear very
often in real applications.
BTW:  How does the testcase react to this change on your machine?

 There are plenty of real-world cases where spending
 time on application redesign for speed is unreasonable or can only be 
 completed
 at a later time - yet it is desirable to squeeze a little bit of extra
 performance out of the existing code.  There are also cases where more
 efficient parallelization - implemented at a higher level to avoid frequent
 switches between parallel and sequential execution - makes the application
 harder to use.  To me, one of the very reasons to use OpenMP was to
 avoid/postpone that redesign and the user-visible complication for now.  If I
 went for a more efficient higher-level solution, I would not need OpenMP in 
 the
 first place.

OpenMP should not be regarded as only good for loop parallelization.  With
the new task construct, it is a fully-fledged parallelization substrate.

  So I would suggest a threshold of 10 for now.
 
 My suggestion is 25.

Well, that's already much better than staying with 20,000,000, so I agree.

  IMHO, something should really happen to this problem before the 4.6 release.
 
 Agreed.  It'd be best to have a code fix, though.

IMHO, there is no obvious way to fix this in principle.  There will always be a
compromise between busy waiting and giving back control to the OS.

Jakub, what do you plan to do about this problem?


[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-11-12 Thread singler at kit dot edu
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706

--- Comment #24 from Johannes Singler singler at kit dot edu 2010-11-12 
08:15:56 UTC ---
If only one out of 35 tests becomes slower, I would rather blame it to this one
(probably badly parallelized) application, not the OpenMP runtime system.  So I
would suggest a threshold of 10 for now.  IMHO, something should really
happen to this problem before the 4.6 release.


[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-11-12 Thread solar-gcc at openwall dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706

--- Comment #25 from Alexander Peslyak solar-gcc at openwall dot com 
2010-11-12 11:19:13 UTC ---
(In reply to comment #24)
 If only one out of 35 tests becomes slower,

You might have misread what I wrote.  I did not mention 35 tests; I mentioned
that a test became slower by 35%.  The total number of different tests was 4
(and each was invoked multiple times per spincount setting, indeed).  One out
of four stayed 35% slower until I increased GOMP_SPINCOUNT to 20.

 I would rather blame it to this one (probably badly parallelized) 
 application, not the OpenMP runtime system.

This makes some sense, but the job of an optimizing compiler and runtime
libraries is to deliver the best performance they can even with somewhat
non-optimal source code.  There are plenty of real-world cases where spending
time on application redesign for speed is unreasonable or can only be completed
at a later time - yet it is desirable to squeeze a little bit of extra
performance out of the existing code.  There are also cases where more
efficient parallelization - implemented at a higher level to avoid frequent
switches between parallel and sequential execution - makes the application
harder to use.  To me, one of the very reasons to use OpenMP was to
avoid/postpone that redesign and the user-visible complication for now.  If I
went for a more efficient higher-level solution, I would not need OpenMP in the
first place.

 So I would suggest a threshold of 10 for now.

My suggestion is 25.

 IMHO, something should really happen to this problem before the 4.6 release.

Agreed.  It'd be best to have a code fix, though.


[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-11-09 Thread solar-gcc at openwall dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706

--- Comment #23 from Alexander Peslyak solar-gcc at openwall dot com 
2010-11-09 16:32:53 UTC ---
(In reply to comment #20)
 Maybe we could agree on a compromise for a start.  Alexander, what are the
 corresponding results for GOMP_SPINCOUNT=10?

I reproduced slowdown of 5% to 35% (on different pieces of code) on an
otherwise-idle dual-E5520 system (16 logical CPUs) when going from gcc 4.5.0's
defaults to GOMP_SPINCOUNT=1.  On all but one test, the original full speed
is restored with GOMP_SPINCOUNT=10.  On the remaining test, the threshold
appears to be between 10 (still 35% slower than full speed) and 20
(original full speed).  So if we're not going to have a code fix soon enough
maybe the new default should be slightly higher than 20.  It won't help as
much as 1 would for cases where this is needed, but it would be of some
help.


[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-09-05 Thread solar-gcc at openwall dot com


--- Comment #22 from solar-gcc at openwall dot com  2010-09-05 11:37 ---
(In reply to comment #20)
 Maybe we could agree on a compromise for a start.  Alexander, what are the
 corresponding results for GOMP_SPINCOUNT=10?

Unfortunately, I no longer have access to the dual-X5550 system, and I did not
try other values for this parameter when I was benchmarking that system.  On
systems that I do currently have access to, the slowdown from
GOMP_SPINCOUNT=1 was typically no more than 10% (and most of the time there
was either no effect or substantial speedup).  I can try 10 on those,
although it'd be difficult to tell the difference from 1 because of the
changing load.  I'll plan on doing this next time I run this sort of
benchmarks.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706



[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-09-01 Thread jakub at gcc dot gnu dot org


--- Comment #21 from jakub at gcc dot gnu dot org  2010-09-01 16:38 ---
*** Bug 45485 has been marked as a duplicate of this bug. ***


-- 

jakub at gcc dot gnu dot org changed:

   What|Removed |Added

 CC||h dot vogt at gom dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706



[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-08-30 Thread singler at kit dot edu


--- Comment #20 from singler at kit dot edu  2010-08-30 08:41 ---
Maybe we could agree on a compromise for a start.  Alexander, what are the
corresponding results for GOMP_SPINCOUNT=10?


-- 

singler at kit dot edu changed:

   What|Removed |Added

 CC||singler at kit dot edu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706



[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-08-24 Thread solar-gcc at openwall dot com


--- Comment #17 from solar-gcc at openwall dot com  2010-08-24 11:07 ---
(In reply to comment #16)
 I would really like to see this bug tackled.

I second that.

 Fixing it is easily done by lowering the spin count as proposed.  Otherwise,
 please show cases where a low spin count hurts performance.

Unfortunately, yes, I've since identified real-world test cases where
GOMP_SPINCOUNT=1 hurts performance significantly (compared to gcc 4.5.0's
default).  Specifically, this was the case when I experimented with my John the
Ripper patches on a dual-X5550 system (16 logical CPUs).  On a few
real-world'ish runs, GOMP_SPINCOUNT=1 would halve the speed.  On most other
tests I ran, it would slow things down by about 10%.  That's on an otherwise
idle system.  I was surprised as I previously only saw GOMP_SPINCOUNT=1
hurt performance on systems with server-like unrelated load (and it would help
tremendously with certain other kinds of load).

 In general, for a tuning parameter, a good-natured rather value should be
 preferred over a value that gives best results in one case, but very bad ones
 in another case.

In general, I agree.  Even the 50% worse-case slowdown I observed with
GOMP_SPINCOUNT=1 is not as bad as the 400x worst-case slowdown observed
without that option.  On the other hand, a 50% slowdown would be fatal as it
relates to comparison of libgomp vs. competing implementations.  Also, HPC
cluster nodes may well be allocated such that there's no other load on each
individual node.  So having the defaults tuned for a system with no other load
makes some sense to me, and I am really unsure whether simply changing the
defaults is the proper fix here.

I'd be happy to see this problem fixed differently, such that the unacceptable
slowdowns are avoided in both cases.  Maybe the new default could be to
auto-tune the setting while the program is running?

Meanwhile, if it's going to take a long time until we have a code fix, perhaps
the problem and the workaround need to be documented prominently.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706



[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-08-24 Thread jakub at gcc dot gnu dot org


--- Comment #18 from jakub at gcc dot gnu dot org  2010-08-24 11:40 ---
For the auto-tuning, ideally the kernel would tell the thread when it lost CPU,
I doubt there is any API for that currently.  E.g. if a thread could register
with kernel address where the kernel would store some value (e.g. zero) or do
atomic increment or decrement upon taking away CPU from the thread.
Then, at the start of the spinning libgomp could initialize that flag and check
it from time to time (say every few hundred or thousand iterations) whether it
has lost the CPU.  In the lost CPU case it would continue just with a couple of
spins and then go to sleep, and ensure the spincount will get dynamically
adjusted next time or something similar.

Currently I'm afraid the only way to dynamically adjust is from time to time
read/parse /proc/loadavg and if it went above number of available CPUs or
something similar, start throttling down.


-- 

jakub at gcc dot gnu dot org changed:

   What|Removed |Added

 CC||drepper at redhat dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706



[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-08-24 Thread solar-gcc at openwall dot com


--- Comment #19 from solar-gcc at openwall dot com  2010-08-24 12:18 ---
(In reply to comment #18)
 Then, at the start of the spinning libgomp could initialize that flag and 
 check
 it from time to time (say every few hundred or thousand iterations) whether it
 has lost the CPU.

Without a kernel API like that, you can achieve a similar effect by issuing the
rdtsc instruction (or its equivalents for non-x86 archs) and seeing if the
cycle counter changes unexpectedly (say, by 1000 or more for a single loop
iteration), which would indicate that there was a context switch.  For an
arch-independent implementation, you could also use a syscall such as times(2)
or gettimeofday(2), but then you'd need to do it very infrequently (e.g., maybe
just to see if there's a context switch between 10k to 20k spins).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706



[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-08-13 Thread singler at kit dot edu


--- Comment #16 from singler at kit dot edu  2010-08-13 15:48 ---
I would really like to see this bug tackled.  It has been confirmed two more
times. 

Fixing it is easily done by lowering the spin count as proposed.  Otherwise,
please show cases where a low spin count hurts performance.

In general, for a tuning parameter, a good-natured rather value should be
preferred over a value that gives best results in one case, but very bad ones
in another case.


-- 

singler at kit dot edu changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
 Ever Confirmed|0   |1
   Last reconfirmed|-00-00 00:00:00 |2010-08-13 15:48:18
   date||
   Target Milestone|4.4.5   |4.5.2


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706



[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-07-30 Thread johnfb at mail dot utexas dot edu


--- Comment #15 from johnfb at mail dot utexas dot edu  2010-07-30 14:00 
---
We have also had some trouble with this issue. We found that in general if we
where running on a machine with hardware threads (i.e., Intel's
Hyper-Threading) then performance was poor. Most of our runs where on a machine
with a Intel Xeon L5530 running RHEL 5. Profiling showed that our program was
spending 50% of its time inside libgomp. Setting either GOMP_SPINCOUNT or
OMP_WAIT_POLICY as discussed in this thread increased performance greatly.
Experiments with disabling and enabling cores with default OMP settings showed
that when the Hyper-Thread cores come on performance dipped below what we got
when we had only one core enabled on some runs.

A little thought as to how hardware threads are implemented makes it obvious
why spinning for more than a few cycles will cause performance problems. If one
hardware threads spins then all other threads on that core may be starved as
resources are shared between cores.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706



[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-07-01 Thread solar-gcc at openwall dot com


--- Comment #14 from solar-gcc at openwall dot com  2010-07-02 01:39 ---
We're also seeing this problem on OpenMP-using code built with gcc 4.5.0
release on linux-x86_64.  Here's a user's report (400x slowdown on an 8-core
system when there's a single other process running on a CPU):

http://www.openwall.com/lists/john-users/2010/06/30/3

Here's my confirmation of the problem report (I easily reproduced similar
slowdowns), and workarounds:

http://www.openwall.com/lists/john-users/2010/06/30/6

GOMP_SPINCOUNT=1 (this specific value) turned out to be nearly optimal in
cases affected by this problem, as well as on idle systems, although I was also
able to identify cases (with server-like unrelated load: short requests to many
processes, which quickly go back to sleep) where this setting lowered the
measured best-case speed by 15% (over multiple benchmark invocations), even
though it might have improved the average speed even in those cases.

All of this is reproducible with John the Ripper 1.7.6 release on Blowfish
hashes (john --test --format=bf) and with the -omp-des patch (current
revision is 1.7.6-omp-des-4) on DES-based crypt(3) hashes (john --test
--format=des).  The use of OpenMP needs to be enabled by uncommenting the
OMPFLAGS line in the Makefile.  JtR and the patch can be downloaded from:

http://www.openwall.com/john/
http://openwall.info/wiki/john/patches

To reproduce the problem, it is sufficient to have one other CPU-using process
running when invoking the John benchmark.  I was using a non-OpenMP build of
John itself as that other process.

Overall, besides this specific bug, OpenMP-using programs are very sensitive
to other system load - e.g., unrelated server-like load of 10% often slows an
OpenMP program down by 50%.  Any improvements in this area would be very
welcome.  However, this specific bug is extreme, with its 400x slowdowns, so
perhaps it is to be treated with priority.

Jakub - thank you for your work on gcc's OpenMP support.  The ease of use is
great!


-- 

solar-gcc at openwall dot com changed:

   What|Removed |Added

 CC||solar-gcc at openwall dot
   ||com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706



[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-04-30 Thread jakub at gcc dot gnu dot org


-- 

jakub at gcc dot gnu dot org changed:

   What|Removed |Added

   Target Milestone|4.4.4   |4.4.5


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706



[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-04-23 Thread singler at kit dot edu


--- Comment #13 from singler at kit dot edu  2010-04-23 14:17 ---
The default spin count is not 2,000,000 cycles, but even 20,000,000.  As
commented in libgomp/env.c, this is supposed to correspond to 200ms.  The
timings we see here are even larger, but the number of cycles is just a rough
estimation.

Throttling the spincount in the awareness of too many threads is a good idea,
but it is just a heuristic.  If there are other processes, the cores might be
loaded anyway, and libgomp has little chances to figure that out.  This gets
even more difficult when having multiple programs using libgomp at the same
time.  So I would like the non-throttling value to be chosen more conservative,
better balancing worst case behavior in difficult situations and best case
behavior on an unloaded machine.

There are algorithms in libstdc++ parallel mode that show speedups for as
little as less than 1ms of sequential running time (when taking threads from
the pool), so users will accept a parallelization overhead for such small
computing times.  However, if they are then hit by a 200ms penalty, this
results in catastrophic slowdowns.  Calling such short-lived parallel regions
several times will make this very noticeable, although it need not be.  So
IMHO, by default, the spinning should take about as long as rescheduling a
thread takes (that was already migrated on another core), by that making things
at most twice as bad as in the best case.
From my experience, this is a matter of a few milliseconds, so I propose to
lower the default spincount to something like 10,000, at most 100,000.  I think
that spinning for even longer than a usual time slice like now is questionable
anyway.

Are nested threads taken into account when deciding on whether to throttle or
not?


-- 

singler at kit dot edu changed:

   What|Removed |Added

 Status|RESOLVED|UNCONFIRMED
 Resolution|FIXED   |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706



[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-04-21 Thread jakub at gcc dot gnu dot org


--- Comment #9 from jakub at gcc dot gnu dot org  2010-04-21 14:00 ---
Subject: Bug 43706

Author: jakub
Date: Wed Apr 21 13:59:39 2010
New Revision: 158600

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=158600
Log:
PR libgomp/43706
* config/linux/affinity.c (gomp_init_affinity): Decrease
gomp_available_cpus if affinity mask confines the process to fewer
CPUs.
* config/linux/proc.c (get_num_procs): If gomp_cpu_affinity is
non-NULL, just return gomp_available_cpus.

Modified:
branches/gcc-4_5-branch/libgomp/ChangeLog
branches/gcc-4_5-branch/libgomp/config/linux/affinity.c
branches/gcc-4_5-branch/libgomp/config/linux/proc.c


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706



[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-04-21 Thread jakub at gcc dot gnu dot org


--- Comment #10 from jakub at gcc dot gnu dot org  2010-04-21 14:01 ---
Subject: Bug 43706

Author: jakub
Date: Wed Apr 21 14:00:10 2010
New Revision: 158601

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=158601
Log:
PR libgomp/43706
* config/linux/affinity.c (gomp_init_affinity): Decrease
gomp_available_cpus if affinity mask confines the process to fewer
CPUs.
* config/linux/proc.c (get_num_procs): If gomp_cpu_affinity is
non-NULL, just return gomp_available_cpus.

Modified:
branches/gcc-4_4-branch/libgomp/ChangeLog
branches/gcc-4_4-branch/libgomp/config/linux/affinity.c
branches/gcc-4_4-branch/libgomp/config/linux/proc.c


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706



[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-04-21 Thread jakub at gcc dot gnu dot org


--- Comment #11 from jakub at gcc dot gnu dot org  2010-04-21 14:05 ---
GOMP_CPU_AFFINITY vs. throttling fixed for 4.4.4/4.5.1/4.6.0.


-- 

jakub at gcc dot gnu dot org changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution||FIXED


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706



[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-04-21 Thread jakub at gcc dot gnu dot org


-- 

jakub at gcc dot gnu dot org changed:

   What|Removed |Added

   Target Milestone|--- |4.4.4


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706



[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-04-21 Thread mika dot fischer at kit dot edu


--- Comment #12 from mika dot fischer at kit dot edu  2010-04-21 14:23 
---
Just to make it clear, this patch most probably does not fix the issue we're
having, since it can be triggered without using GOMP_CPU_AFFINITY.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706



[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-04-20 Thread jakub at gcc dot gnu dot org


--- Comment #5 from jakub at gcc dot gnu dot org  2010-04-20 10:23 ---
For performance reasons libgomp uses some busy waiting, which of course works
well when there are available CPUs and cycles to burn (decreases latency a
lot), but if you have more threads than CPUs it can make things worse.
You can tweak this through OMP_WAIT_POLICY and GOMP_SPINCOUNT env vars.
Although the implementation recognizes two kinds of spin counts (normal and
throttled, the latter in use when number of threads is bigger than number of
available CPUs), in some cases even that default might be too large (the
default
for throttled spin count is 1000 spins for OMP_WAIT_POLICY=active and 100 spins
for no OMP_WAIT_POLICY in environment).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706



[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-04-20 Thread jakub at gcc dot gnu dot org


--- Comment #6 from jakub at gcc dot gnu dot org  2010-04-20 10:49 ---
Created an attachment (id=20441)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=20441action=view)
gcc46-pr43706.patch

For GOMP_CPU_AFFINITY there was an issue that the number of available CPUs used
to decide whether the number of managed threads is bigger than available CPUs
didn't take into account GOMP_CPU_AFFINITY restriction.  The attached patch
does that.  That said, for cases where some CPU is available for the GOMP
program, yet is constantly busy doing other things (say higher priority), this
can't help and OMP_WAIT_POLICY=passive or GOMP_SPINCOUNT=1000 or some similar
small number is your only option.


-- 

jakub at gcc dot gnu dot org changed:

   What|Removed |Added

 AssignedTo|unassigned at gcc dot gnu   |jakub at gcc dot gnu dot org
   |dot org |
 Status|UNCONFIRMED |ASSIGNED


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706



[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-04-20 Thread mika dot fischer at kit dot edu


--- Comment #7 from mika dot fischer at kit dot edu  2010-04-20 12:23 
---
 For performance reasons libgomp uses some busy waiting, which of course works
 well when there are available CPUs and cycles to burn (decreases latency a
 lot), but if you have more threads than CPUs it can make things worse.
 You can tweak this through OMP_WAIT_POLICY and GOMP_SPINCOUNT env vars.

This is definitely the reason for the behavior we're seeing. When we set
OMP_WAIT_POLICY=passive, the test program runs through normally. Without it
it takes very very long.

Here are some
measurements when while (true) is replaced by for (int j=0; j1000; ++j):


All cores idle:
===
$ /usr/bin/time ./openmp-bug
3.21user 0.00system 0:00.81elapsed 391%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+331minor)pagefaults 0swaps

$ OMP_WAIT_POLICY=passive /usr/bin/time ./openmp-bug
2.75user 0.05system 0:01.42elapsed 196%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+335minor)pagefaults 0swaps


1 (out of 4) cores occupied:

$ /usr/bin/time ./openmp-bug
133.65user 0.02system 0:45.30elapsed 295%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+330minor)pagefaults 0swaps

$ OMP_WAIT_POLICY=passive /usr/bin/time ./openmp-bug
2.67user 0.00system 0:02.35elapsed 113%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+335minor)pagefaults 0swaps

$ GOMP_SPINCOUNT=10 /usr/bin/time ./openmp-bug
2.91user 0.03system 0:01.73elapsed 169%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+336minor)pagefaults 0swaps

$ GOMP_SPINCOUNT=100 /usr/bin/time ./openmp-bug
2.77user 0.03system 0:01.90elapsed 147%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+336minor)pagefaults 0swaps

$ GOMP_SPINCOUNT=1000 /usr/bin/time ./openmp-bug
2.87user 0.00system 0:01.70elapsed 168%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+336minor)pagefaults 0swaps

$ GOMP_SPINCOUNT=1 /usr/bin/time ./openmp-bug
3.05user 0.06system 0:01.85elapsed 167%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+337minor)pagefaults 0swaps

$ GOMP_SPINCOUNT=10 /usr/bin/time ./openmp-bug
5.25user 0.03system 0:03.10elapsed 170%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+335minor)pagefaults 0swaps

$ GOMP_SPINCOUNT=100 /usr/bin/time ./openmp-bug
28.84user 0.00system 0:14.13elapsed 203%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+336minor)pagefaults 0swaps

[I ran all of these several times and took the runtime around the average]

 Although the implementation recognizes two kinds of spin counts (normal and
 throttled, the latter in use when number of threads is bigger than number of
 available CPUs), in some cases even that default might be too large (the
 default for throttled spin count is 1000 spins for OMP_WAIT_POLICY=active and
 100 spins for no OMP_WAIT_POLICY in environment).

As the numbers show, a default spin count of 1000 would be fine. The problem is
however, that OpenMP assumes that it has all the cores of the CPU for itself.
The throttled spin count is only used if the number of OpenMP threads is larger
than the number of cores in the system (AFAICT). This will almost never happen
(AFAICT only if you set OMP_NUM_THREADS to something larger than the number of
cores).

Since it seems clear that the number of spin counts should be smaller when the
CPU cores are active, the throttled spin count must be used when the cores are
actually used at the moment the thread starts waiting. That the number of
OpenMP
threads running at that time is smaller than the number of cores is not a
sufficient condition. If it's not possible to determine this or if it's too
time-consuming, then maybe the non-throttled default spin count can be reduced
to 1000
or so.

So thanks for the workaround! But I still think the default behavior can easily
cause very significant slowdowns and thus should be reconsidered.

Finally, I still don't get why the spinlocking has these effects on the
runtime.
I would expect even 200 spin lock cycles to be over very quickly and not a
20-fold increase in the total runtime of the program. Just out of curiosity
maybe you can explain why this happens.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706



[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-04-20 Thread jakub at gcc dot gnu dot org


--- Comment #8 from jakub at gcc dot gnu dot org  2010-04-20 15:38 ---
Subject: Bug 43706

Author: jakub
Date: Tue Apr 20 15:37:51 2010
New Revision: 158565

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=158565
Log:
PR libgomp/43706
* config/linux/affinity.c (gomp_init_affinity): Decrease
gomp_available_cpus if affinity mask confines the process to fewer
CPUs.
* config/linux/proc.c (get_num_procs): If gomp_cpu_affinity is
non-NULL, just return gomp_available_cpus.

Modified:
trunk/libgomp/ChangeLog
trunk/libgomp/config/linux/affinity.c
trunk/libgomp/config/linux/proc.c


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706



[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-04-09 Thread baeuml at kit dot edu


--- Comment #1 from baeuml at kit dot edu  2010-04-09 16:22 ---
Created an attachment (id=20348)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=20348action=view)
output of -save-temps


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706



[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-04-09 Thread pinskia at gcc dot gnu dot org


--- Comment #2 from pinskia at gcc dot gnu dot org  2010-04-09 18:33 ---
Have you done a profile (using oprofile) to see why this happens. Really I
think GOMP_CPU_AFFINITY should not be used that much as it will cause
starvation no matter what.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706



[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-04-09 Thread baeuml at kit dot edu


--- Comment #3 from baeuml at kit dot edu  2010-04-09 20:55 ---
 Have you done a profile (using oprofile) to see why this happens.

I've oprofile'd the original program from which this is a stripped down minimal
example.  I did not see anything unusual, but I'm certainly no expert with
oprofile.

Gdb however shows that all 4 threads are waiting in do_wait()
(config/linux/wait.h)

 Really I think GOMP_CPU_AFFINITY should not be used that much as it will cause
 starvation no matter what.

As I said, this happens also without GOMP_CPU_AFFINITY when there is enough
load on the cores.  By using GOMP_CPU_AFFINITY I'm just able to reproduce the
issue reliably with the given minimal example without needing to put load on
the cores.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706



[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-04-09 Thread mika dot fischer at kit dot edu


--- Comment #4 from mika dot fischer at kit dot edu  2010-04-09 22:10 
---
I'm Martin's coworker and want to add some additional points.

Just be be clear, this is not an exotic toy example, it is causing us real
problems with production code. Martin just stripped it down so it can be easily
reproduced.

The same goes for GOMP_CPU_AFFINITY. Of course we don't use it. It just makes
it very easy to reproduce the bug. The real problem we're having is being
caused by other threads than OpenMP threads causing load on some cores. You can
also easily test this by occupying one of the cores using some computation and
running the test program without GOMP_CPU_AFFINITY.

I also don't quite get why GOMP_CPU_AFFINITY should cause starvation. Maybe you
could elaborate. It's clear that it will cause additional scheduling overhead
and one thread might have to wait for the other to finish, but I would not call
this starvation.

Finally, and most importantly, the workaround we're currently employing is to
use libgomp.so.1 (via LD_LIBRARY_PATH) from OpenSuSE 11.0. With this version of
libgomp, the problem does not occur at all!

Here's the gcc -v output from the GCC release of OpenSuSE 11.0 from which we
took the working libgomp:
gcc -v
Using built-in specs.
Target: x86_64-suse-linux
Configured with: ../configure --prefix=/usr --with-local-prefix=/usr/local
--infodir=/usr/share/info --mandir=/usr/share/man --libdir=/usr/lib64
--libexecdir=/usr/lib64 --enable-languages=c,c++,objc,fortran,obj-c++,java,ada
--enable-checking=release --with-gxx-include-dir=/usr/include/c++/4.3
--enable-ssp --disable-libssp --with-bugurl=http://bugs.opensuse.org/
--with-pkgversion='SUSE Linux' --disable-libgcj --with-slibdir=/lib64
--with-system-zlib --enable-__cxa_atexit --enable-libstdcxx-allocator=new
--disable-libstdcxx-pch --program-suffix=-4.3
--enable-version-specific-runtime-libs --enable-linux-futex
--without-system-libunwind --with-cpu=generic --build=x86_64-suse-linux
Thread model: posix
gcc version 4.3.1 20080507 (prerelease) [gcc-4_3-branch revision 135036] (SUSE
Linux)


-- 

mika dot fischer at kit dot edu changed:

   What|Removed |Added

 CC||mika dot fischer at kit dot
   ||edu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706