[Bug libgomp/43706] scheduling two threads on one core leads to starvation
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706 Andrew Pinski pinskia at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution||FIXED Target Milestone|4.5.3 |4.6.0 --- Comment #29 from Andrew Pinski pinskia at gcc dot gnu.org 2012-01-12 20:32:38 UTC --- Fixed.
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706 Richard Guenther rguenth at gcc dot gnu.org changed: What|Removed |Added Target Milestone|4.5.2 |4.5.3 --- Comment #28 from Richard Guenther rguenth at gcc dot gnu.org 2010-12-16 13:02:46 UTC --- GCC 4.5.2 is being released, adjusting target milestone.
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706 --- Comment #27 from Jakub Jelinek jakub at gcc dot gnu.org 2010-12-02 14:31:31 UTC --- Author: jakub Date: Thu Dec 2 14:31:27 2010 New Revision: 167371 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=167371 Log: PR libgomp/43706 * env.c (initialize_env): Default to spin count 30 instead of 2000 if neither OMP_WAIT_POLICY nor GOMP_SPINCOUNT is specified. Modified: trunk/libgomp/ChangeLog trunk/libgomp/env.c
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706 --- Comment #26 from Johannes Singler singler at kit dot edu 2010-11-15 08:53:12 UTC --- (In reply to comment #25) You might have misread what I wrote. I did not mention 35 tests; I mentioned that a test became slower by 35%. The total number of different tests was 4 (and each was invoked multiple times per spincount setting, indeed). One out of four stayed 35% slower until I increased GOMP_SPINCOUNT to 20. Sorry, I got that wrong. This makes some sense, but the job of an optimizing compiler and runtime libraries is to deliver the best performance they can even with somewhat non-optimal source code. I agree with that in principle. But please be reminded that as is, there is the very simple testcase posted, which takes a serious performance hit. And repeated parallel loops like the one in the test program certainly appear very often in real applications. BTW: How does the testcase react to this change on your machine? There are plenty of real-world cases where spending time on application redesign for speed is unreasonable or can only be completed at a later time - yet it is desirable to squeeze a little bit of extra performance out of the existing code. There are also cases where more efficient parallelization - implemented at a higher level to avoid frequent switches between parallel and sequential execution - makes the application harder to use. To me, one of the very reasons to use OpenMP was to avoid/postpone that redesign and the user-visible complication for now. If I went for a more efficient higher-level solution, I would not need OpenMP in the first place. OpenMP should not be regarded as only good for loop parallelization. With the new task construct, it is a fully-fledged parallelization substrate. So I would suggest a threshold of 10 for now. My suggestion is 25. Well, that's already much better than staying with 20,000,000, so I agree. IMHO, something should really happen to this problem before the 4.6 release. Agreed. It'd be best to have a code fix, though. IMHO, there is no obvious way to fix this in principle. There will always be a compromise between busy waiting and giving back control to the OS. Jakub, what do you plan to do about this problem?
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706 --- Comment #24 from Johannes Singler singler at kit dot edu 2010-11-12 08:15:56 UTC --- If only one out of 35 tests becomes slower, I would rather blame it to this one (probably badly parallelized) application, not the OpenMP runtime system. So I would suggest a threshold of 10 for now. IMHO, something should really happen to this problem before the 4.6 release.
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706 --- Comment #25 from Alexander Peslyak solar-gcc at openwall dot com 2010-11-12 11:19:13 UTC --- (In reply to comment #24) If only one out of 35 tests becomes slower, You might have misread what I wrote. I did not mention 35 tests; I mentioned that a test became slower by 35%. The total number of different tests was 4 (and each was invoked multiple times per spincount setting, indeed). One out of four stayed 35% slower until I increased GOMP_SPINCOUNT to 20. I would rather blame it to this one (probably badly parallelized) application, not the OpenMP runtime system. This makes some sense, but the job of an optimizing compiler and runtime libraries is to deliver the best performance they can even with somewhat non-optimal source code. There are plenty of real-world cases where spending time on application redesign for speed is unreasonable or can only be completed at a later time - yet it is desirable to squeeze a little bit of extra performance out of the existing code. There are also cases where more efficient parallelization - implemented at a higher level to avoid frequent switches between parallel and sequential execution - makes the application harder to use. To me, one of the very reasons to use OpenMP was to avoid/postpone that redesign and the user-visible complication for now. If I went for a more efficient higher-level solution, I would not need OpenMP in the first place. So I would suggest a threshold of 10 for now. My suggestion is 25. IMHO, something should really happen to this problem before the 4.6 release. Agreed. It'd be best to have a code fix, though.
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706 --- Comment #23 from Alexander Peslyak solar-gcc at openwall dot com 2010-11-09 16:32:53 UTC --- (In reply to comment #20) Maybe we could agree on a compromise for a start. Alexander, what are the corresponding results for GOMP_SPINCOUNT=10? I reproduced slowdown of 5% to 35% (on different pieces of code) on an otherwise-idle dual-E5520 system (16 logical CPUs) when going from gcc 4.5.0's defaults to GOMP_SPINCOUNT=1. On all but one test, the original full speed is restored with GOMP_SPINCOUNT=10. On the remaining test, the threshold appears to be between 10 (still 35% slower than full speed) and 20 (original full speed). So if we're not going to have a code fix soon enough maybe the new default should be slightly higher than 20. It won't help as much as 1 would for cases where this is needed, but it would be of some help.
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
--- Comment #22 from solar-gcc at openwall dot com 2010-09-05 11:37 --- (In reply to comment #20) Maybe we could agree on a compromise for a start. Alexander, what are the corresponding results for GOMP_SPINCOUNT=10? Unfortunately, I no longer have access to the dual-X5550 system, and I did not try other values for this parameter when I was benchmarking that system. On systems that I do currently have access to, the slowdown from GOMP_SPINCOUNT=1 was typically no more than 10% (and most of the time there was either no effect or substantial speedup). I can try 10 on those, although it'd be difficult to tell the difference from 1 because of the changing load. I'll plan on doing this next time I run this sort of benchmarks. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
--- Comment #21 from jakub at gcc dot gnu dot org 2010-09-01 16:38 --- *** Bug 45485 has been marked as a duplicate of this bug. *** -- jakub at gcc dot gnu dot org changed: What|Removed |Added CC||h dot vogt at gom dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
--- Comment #20 from singler at kit dot edu 2010-08-30 08:41 --- Maybe we could agree on a compromise for a start. Alexander, what are the corresponding results for GOMP_SPINCOUNT=10? -- singler at kit dot edu changed: What|Removed |Added CC||singler at kit dot edu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
--- Comment #17 from solar-gcc at openwall dot com 2010-08-24 11:07 --- (In reply to comment #16) I would really like to see this bug tackled. I second that. Fixing it is easily done by lowering the spin count as proposed. Otherwise, please show cases where a low spin count hurts performance. Unfortunately, yes, I've since identified real-world test cases where GOMP_SPINCOUNT=1 hurts performance significantly (compared to gcc 4.5.0's default). Specifically, this was the case when I experimented with my John the Ripper patches on a dual-X5550 system (16 logical CPUs). On a few real-world'ish runs, GOMP_SPINCOUNT=1 would halve the speed. On most other tests I ran, it would slow things down by about 10%. That's on an otherwise idle system. I was surprised as I previously only saw GOMP_SPINCOUNT=1 hurt performance on systems with server-like unrelated load (and it would help tremendously with certain other kinds of load). In general, for a tuning parameter, a good-natured rather value should be preferred over a value that gives best results in one case, but very bad ones in another case. In general, I agree. Even the 50% worse-case slowdown I observed with GOMP_SPINCOUNT=1 is not as bad as the 400x worst-case slowdown observed without that option. On the other hand, a 50% slowdown would be fatal as it relates to comparison of libgomp vs. competing implementations. Also, HPC cluster nodes may well be allocated such that there's no other load on each individual node. So having the defaults tuned for a system with no other load makes some sense to me, and I am really unsure whether simply changing the defaults is the proper fix here. I'd be happy to see this problem fixed differently, such that the unacceptable slowdowns are avoided in both cases. Maybe the new default could be to auto-tune the setting while the program is running? Meanwhile, if it's going to take a long time until we have a code fix, perhaps the problem and the workaround need to be documented prominently. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
--- Comment #18 from jakub at gcc dot gnu dot org 2010-08-24 11:40 --- For the auto-tuning, ideally the kernel would tell the thread when it lost CPU, I doubt there is any API for that currently. E.g. if a thread could register with kernel address where the kernel would store some value (e.g. zero) or do atomic increment or decrement upon taking away CPU from the thread. Then, at the start of the spinning libgomp could initialize that flag and check it from time to time (say every few hundred or thousand iterations) whether it has lost the CPU. In the lost CPU case it would continue just with a couple of spins and then go to sleep, and ensure the spincount will get dynamically adjusted next time or something similar. Currently I'm afraid the only way to dynamically adjust is from time to time read/parse /proc/loadavg and if it went above number of available CPUs or something similar, start throttling down. -- jakub at gcc dot gnu dot org changed: What|Removed |Added CC||drepper at redhat dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
--- Comment #19 from solar-gcc at openwall dot com 2010-08-24 12:18 --- (In reply to comment #18) Then, at the start of the spinning libgomp could initialize that flag and check it from time to time (say every few hundred or thousand iterations) whether it has lost the CPU. Without a kernel API like that, you can achieve a similar effect by issuing the rdtsc instruction (or its equivalents for non-x86 archs) and seeing if the cycle counter changes unexpectedly (say, by 1000 or more for a single loop iteration), which would indicate that there was a context switch. For an arch-independent implementation, you could also use a syscall such as times(2) or gettimeofday(2), but then you'd need to do it very infrequently (e.g., maybe just to see if there's a context switch between 10k to 20k spins). -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
--- Comment #16 from singler at kit dot edu 2010-08-13 15:48 --- I would really like to see this bug tackled. It has been confirmed two more times. Fixing it is easily done by lowering the spin count as proposed. Otherwise, please show cases where a low spin count hurts performance. In general, for a tuning parameter, a good-natured rather value should be preferred over a value that gives best results in one case, but very bad ones in another case. -- singler at kit dot edu changed: What|Removed |Added Status|UNCONFIRMED |NEW Ever Confirmed|0 |1 Last reconfirmed|-00-00 00:00:00 |2010-08-13 15:48:18 date|| Target Milestone|4.4.5 |4.5.2 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
--- Comment #15 from johnfb at mail dot utexas dot edu 2010-07-30 14:00 --- We have also had some trouble with this issue. We found that in general if we where running on a machine with hardware threads (i.e., Intel's Hyper-Threading) then performance was poor. Most of our runs where on a machine with a Intel Xeon L5530 running RHEL 5. Profiling showed that our program was spending 50% of its time inside libgomp. Setting either GOMP_SPINCOUNT or OMP_WAIT_POLICY as discussed in this thread increased performance greatly. Experiments with disabling and enabling cores with default OMP settings showed that when the Hyper-Thread cores come on performance dipped below what we got when we had only one core enabled on some runs. A little thought as to how hardware threads are implemented makes it obvious why spinning for more than a few cycles will cause performance problems. If one hardware threads spins then all other threads on that core may be starved as resources are shared between cores. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
--- Comment #14 from solar-gcc at openwall dot com 2010-07-02 01:39 --- We're also seeing this problem on OpenMP-using code built with gcc 4.5.0 release on linux-x86_64. Here's a user's report (400x slowdown on an 8-core system when there's a single other process running on a CPU): http://www.openwall.com/lists/john-users/2010/06/30/3 Here's my confirmation of the problem report (I easily reproduced similar slowdowns), and workarounds: http://www.openwall.com/lists/john-users/2010/06/30/6 GOMP_SPINCOUNT=1 (this specific value) turned out to be nearly optimal in cases affected by this problem, as well as on idle systems, although I was also able to identify cases (with server-like unrelated load: short requests to many processes, which quickly go back to sleep) where this setting lowered the measured best-case speed by 15% (over multiple benchmark invocations), even though it might have improved the average speed even in those cases. All of this is reproducible with John the Ripper 1.7.6 release on Blowfish hashes (john --test --format=bf) and with the -omp-des patch (current revision is 1.7.6-omp-des-4) on DES-based crypt(3) hashes (john --test --format=des). The use of OpenMP needs to be enabled by uncommenting the OMPFLAGS line in the Makefile. JtR and the patch can be downloaded from: http://www.openwall.com/john/ http://openwall.info/wiki/john/patches To reproduce the problem, it is sufficient to have one other CPU-using process running when invoking the John benchmark. I was using a non-OpenMP build of John itself as that other process. Overall, besides this specific bug, OpenMP-using programs are very sensitive to other system load - e.g., unrelated server-like load of 10% often slows an OpenMP program down by 50%. Any improvements in this area would be very welcome. However, this specific bug is extreme, with its 400x slowdowns, so perhaps it is to be treated with priority. Jakub - thank you for your work on gcc's OpenMP support. The ease of use is great! -- solar-gcc at openwall dot com changed: What|Removed |Added CC||solar-gcc at openwall dot ||com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
-- jakub at gcc dot gnu dot org changed: What|Removed |Added Target Milestone|4.4.4 |4.4.5 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
--- Comment #13 from singler at kit dot edu 2010-04-23 14:17 --- The default spin count is not 2,000,000 cycles, but even 20,000,000. As commented in libgomp/env.c, this is supposed to correspond to 200ms. The timings we see here are even larger, but the number of cycles is just a rough estimation. Throttling the spincount in the awareness of too many threads is a good idea, but it is just a heuristic. If there are other processes, the cores might be loaded anyway, and libgomp has little chances to figure that out. This gets even more difficult when having multiple programs using libgomp at the same time. So I would like the non-throttling value to be chosen more conservative, better balancing worst case behavior in difficult situations and best case behavior on an unloaded machine. There are algorithms in libstdc++ parallel mode that show speedups for as little as less than 1ms of sequential running time (when taking threads from the pool), so users will accept a parallelization overhead for such small computing times. However, if they are then hit by a 200ms penalty, this results in catastrophic slowdowns. Calling such short-lived parallel regions several times will make this very noticeable, although it need not be. So IMHO, by default, the spinning should take about as long as rescheduling a thread takes (that was already migrated on another core), by that making things at most twice as bad as in the best case. From my experience, this is a matter of a few milliseconds, so I propose to lower the default spincount to something like 10,000, at most 100,000. I think that spinning for even longer than a usual time slice like now is questionable anyway. Are nested threads taken into account when deciding on whether to throttle or not? -- singler at kit dot edu changed: What|Removed |Added Status|RESOLVED|UNCONFIRMED Resolution|FIXED | http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
--- Comment #9 from jakub at gcc dot gnu dot org 2010-04-21 14:00 --- Subject: Bug 43706 Author: jakub Date: Wed Apr 21 13:59:39 2010 New Revision: 158600 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=158600 Log: PR libgomp/43706 * config/linux/affinity.c (gomp_init_affinity): Decrease gomp_available_cpus if affinity mask confines the process to fewer CPUs. * config/linux/proc.c (get_num_procs): If gomp_cpu_affinity is non-NULL, just return gomp_available_cpus. Modified: branches/gcc-4_5-branch/libgomp/ChangeLog branches/gcc-4_5-branch/libgomp/config/linux/affinity.c branches/gcc-4_5-branch/libgomp/config/linux/proc.c -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
--- Comment #10 from jakub at gcc dot gnu dot org 2010-04-21 14:01 --- Subject: Bug 43706 Author: jakub Date: Wed Apr 21 14:00:10 2010 New Revision: 158601 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=158601 Log: PR libgomp/43706 * config/linux/affinity.c (gomp_init_affinity): Decrease gomp_available_cpus if affinity mask confines the process to fewer CPUs. * config/linux/proc.c (get_num_procs): If gomp_cpu_affinity is non-NULL, just return gomp_available_cpus. Modified: branches/gcc-4_4-branch/libgomp/ChangeLog branches/gcc-4_4-branch/libgomp/config/linux/affinity.c branches/gcc-4_4-branch/libgomp/config/linux/proc.c -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
--- Comment #11 from jakub at gcc dot gnu dot org 2010-04-21 14:05 --- GOMP_CPU_AFFINITY vs. throttling fixed for 4.4.4/4.5.1/4.6.0. -- jakub at gcc dot gnu dot org changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution||FIXED http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
-- jakub at gcc dot gnu dot org changed: What|Removed |Added Target Milestone|--- |4.4.4 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
--- Comment #12 from mika dot fischer at kit dot edu 2010-04-21 14:23 --- Just to make it clear, this patch most probably does not fix the issue we're having, since it can be triggered without using GOMP_CPU_AFFINITY. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
--- Comment #5 from jakub at gcc dot gnu dot org 2010-04-20 10:23 --- For performance reasons libgomp uses some busy waiting, which of course works well when there are available CPUs and cycles to burn (decreases latency a lot), but if you have more threads than CPUs it can make things worse. You can tweak this through OMP_WAIT_POLICY and GOMP_SPINCOUNT env vars. Although the implementation recognizes two kinds of spin counts (normal and throttled, the latter in use when number of threads is bigger than number of available CPUs), in some cases even that default might be too large (the default for throttled spin count is 1000 spins for OMP_WAIT_POLICY=active and 100 spins for no OMP_WAIT_POLICY in environment). -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
--- Comment #6 from jakub at gcc dot gnu dot org 2010-04-20 10:49 --- Created an attachment (id=20441) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=20441action=view) gcc46-pr43706.patch For GOMP_CPU_AFFINITY there was an issue that the number of available CPUs used to decide whether the number of managed threads is bigger than available CPUs didn't take into account GOMP_CPU_AFFINITY restriction. The attached patch does that. That said, for cases where some CPU is available for the GOMP program, yet is constantly busy doing other things (say higher priority), this can't help and OMP_WAIT_POLICY=passive or GOMP_SPINCOUNT=1000 or some similar small number is your only option. -- jakub at gcc dot gnu dot org changed: What|Removed |Added AssignedTo|unassigned at gcc dot gnu |jakub at gcc dot gnu dot org |dot org | Status|UNCONFIRMED |ASSIGNED http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
--- Comment #7 from mika dot fischer at kit dot edu 2010-04-20 12:23 --- For performance reasons libgomp uses some busy waiting, which of course works well when there are available CPUs and cycles to burn (decreases latency a lot), but if you have more threads than CPUs it can make things worse. You can tweak this through OMP_WAIT_POLICY and GOMP_SPINCOUNT env vars. This is definitely the reason for the behavior we're seeing. When we set OMP_WAIT_POLICY=passive, the test program runs through normally. Without it it takes very very long. Here are some measurements when while (true) is replaced by for (int j=0; j1000; ++j): All cores idle: === $ /usr/bin/time ./openmp-bug 3.21user 0.00system 0:00.81elapsed 391%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+331minor)pagefaults 0swaps $ OMP_WAIT_POLICY=passive /usr/bin/time ./openmp-bug 2.75user 0.05system 0:01.42elapsed 196%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+335minor)pagefaults 0swaps 1 (out of 4) cores occupied: $ /usr/bin/time ./openmp-bug 133.65user 0.02system 0:45.30elapsed 295%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+330minor)pagefaults 0swaps $ OMP_WAIT_POLICY=passive /usr/bin/time ./openmp-bug 2.67user 0.00system 0:02.35elapsed 113%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+335minor)pagefaults 0swaps $ GOMP_SPINCOUNT=10 /usr/bin/time ./openmp-bug 2.91user 0.03system 0:01.73elapsed 169%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+336minor)pagefaults 0swaps $ GOMP_SPINCOUNT=100 /usr/bin/time ./openmp-bug 2.77user 0.03system 0:01.90elapsed 147%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+336minor)pagefaults 0swaps $ GOMP_SPINCOUNT=1000 /usr/bin/time ./openmp-bug 2.87user 0.00system 0:01.70elapsed 168%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+336minor)pagefaults 0swaps $ GOMP_SPINCOUNT=1 /usr/bin/time ./openmp-bug 3.05user 0.06system 0:01.85elapsed 167%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+337minor)pagefaults 0swaps $ GOMP_SPINCOUNT=10 /usr/bin/time ./openmp-bug 5.25user 0.03system 0:03.10elapsed 170%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+335minor)pagefaults 0swaps $ GOMP_SPINCOUNT=100 /usr/bin/time ./openmp-bug 28.84user 0.00system 0:14.13elapsed 203%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+336minor)pagefaults 0swaps [I ran all of these several times and took the runtime around the average] Although the implementation recognizes two kinds of spin counts (normal and throttled, the latter in use when number of threads is bigger than number of available CPUs), in some cases even that default might be too large (the default for throttled spin count is 1000 spins for OMP_WAIT_POLICY=active and 100 spins for no OMP_WAIT_POLICY in environment). As the numbers show, a default spin count of 1000 would be fine. The problem is however, that OpenMP assumes that it has all the cores of the CPU for itself. The throttled spin count is only used if the number of OpenMP threads is larger than the number of cores in the system (AFAICT). This will almost never happen (AFAICT only if you set OMP_NUM_THREADS to something larger than the number of cores). Since it seems clear that the number of spin counts should be smaller when the CPU cores are active, the throttled spin count must be used when the cores are actually used at the moment the thread starts waiting. That the number of OpenMP threads running at that time is smaller than the number of cores is not a sufficient condition. If it's not possible to determine this or if it's too time-consuming, then maybe the non-throttled default spin count can be reduced to 1000 or so. So thanks for the workaround! But I still think the default behavior can easily cause very significant slowdowns and thus should be reconsidered. Finally, I still don't get why the spinlocking has these effects on the runtime. I would expect even 200 spin lock cycles to be over very quickly and not a 20-fold increase in the total runtime of the program. Just out of curiosity maybe you can explain why this happens. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
--- Comment #8 from jakub at gcc dot gnu dot org 2010-04-20 15:38 --- Subject: Bug 43706 Author: jakub Date: Tue Apr 20 15:37:51 2010 New Revision: 158565 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=158565 Log: PR libgomp/43706 * config/linux/affinity.c (gomp_init_affinity): Decrease gomp_available_cpus if affinity mask confines the process to fewer CPUs. * config/linux/proc.c (get_num_procs): If gomp_cpu_affinity is non-NULL, just return gomp_available_cpus. Modified: trunk/libgomp/ChangeLog trunk/libgomp/config/linux/affinity.c trunk/libgomp/config/linux/proc.c -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
--- Comment #1 from baeuml at kit dot edu 2010-04-09 16:22 --- Created an attachment (id=20348) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=20348action=view) output of -save-temps -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
--- Comment #2 from pinskia at gcc dot gnu dot org 2010-04-09 18:33 --- Have you done a profile (using oprofile) to see why this happens. Really I think GOMP_CPU_AFFINITY should not be used that much as it will cause starvation no matter what. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
--- Comment #3 from baeuml at kit dot edu 2010-04-09 20:55 --- Have you done a profile (using oprofile) to see why this happens. I've oprofile'd the original program from which this is a stripped down minimal example. I did not see anything unusual, but I'm certainly no expert with oprofile. Gdb however shows that all 4 threads are waiting in do_wait() (config/linux/wait.h) Really I think GOMP_CPU_AFFINITY should not be used that much as it will cause starvation no matter what. As I said, this happens also without GOMP_CPU_AFFINITY when there is enough load on the cores. By using GOMP_CPU_AFFINITY I'm just able to reproduce the issue reliably with the given minimal example without needing to put load on the cores. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
--- Comment #4 from mika dot fischer at kit dot edu 2010-04-09 22:10 --- I'm Martin's coworker and want to add some additional points. Just be be clear, this is not an exotic toy example, it is causing us real problems with production code. Martin just stripped it down so it can be easily reproduced. The same goes for GOMP_CPU_AFFINITY. Of course we don't use it. It just makes it very easy to reproduce the bug. The real problem we're having is being caused by other threads than OpenMP threads causing load on some cores. You can also easily test this by occupying one of the cores using some computation and running the test program without GOMP_CPU_AFFINITY. I also don't quite get why GOMP_CPU_AFFINITY should cause starvation. Maybe you could elaborate. It's clear that it will cause additional scheduling overhead and one thread might have to wait for the other to finish, but I would not call this starvation. Finally, and most importantly, the workaround we're currently employing is to use libgomp.so.1 (via LD_LIBRARY_PATH) from OpenSuSE 11.0. With this version of libgomp, the problem does not occur at all! Here's the gcc -v output from the GCC release of OpenSuSE 11.0 from which we took the working libgomp: gcc -v Using built-in specs. Target: x86_64-suse-linux Configured with: ../configure --prefix=/usr --with-local-prefix=/usr/local --infodir=/usr/share/info --mandir=/usr/share/man --libdir=/usr/lib64 --libexecdir=/usr/lib64 --enable-languages=c,c++,objc,fortran,obj-c++,java,ada --enable-checking=release --with-gxx-include-dir=/usr/include/c++/4.3 --enable-ssp --disable-libssp --with-bugurl=http://bugs.opensuse.org/ --with-pkgversion='SUSE Linux' --disable-libgcj --with-slibdir=/lib64 --with-system-zlib --enable-__cxa_atexit --enable-libstdcxx-allocator=new --disable-libstdcxx-pch --program-suffix=-4.3 --enable-version-specific-runtime-libs --enable-linux-futex --without-system-libunwind --with-cpu=generic --build=x86_64-suse-linux Thread model: posix gcc version 4.3.1 20080507 (prerelease) [gcc-4_3-branch revision 135036] (SUSE Linux) -- mika dot fischer at kit dot edu changed: What|Removed |Added CC||mika dot fischer at kit dot ||edu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706