[Bug fortran/87622] coarray does not run in parallel
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87622 Thomas Koenig changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |INVALID --- Comment #6 from Thomas Koenig --- (In reply to Andreas Klein from comment #5) > Adding initialzation removes the effect. > There is still a 20% decresed performance, but that are plausible cache > effects. OK, I'll close this bug then. > Sorry my minimal example was to minimal. I derived the mini example from a > big parallel linear algebra package. Now I must go throgh all > minimalization steps, but its possible that the original error has nothing > to do with coarrays. If you find anything that we could fix, please open a new PR. And thanks for the PR anyway - it is better to open something that later turns out to be invalid (I've submitted a few PRs like that) than to miss something.
[Bug fortran/87622] coarray does not run in parallel
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87622 --- Comment #5 from Andreas Klein --- On Wed, 17 Oct 2018, tkoenig at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87622 > > --- Comment #2 from Thomas Koenig --- > Some more remarks: In a benchmark, it is best to actually fill the values of > all read variables to something defined, for example with a call to > random_number. Also, if you generate values which you do not use later, the > compilet may decide to remove the calculation altogether. What works well for > this somethinh like > > read (*,*) i,j > print *, a(i,j) > Adding initialzation removes the effect. There is still a 20% decresed performance, but that are plausible cache effects. Sorry my minimal example was to minimal. I derived the mini example from a big parallel linear algebra package. Now I must go throgh all minimalization steps, but its possible that the original error has nothing to do with coarrays.
[Bug fortran/87622] coarray does not run in parallel
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87622 --- Comment #4 from Andreas Klein --- On Wed, 17 Oct 2018, burnus at gcc dot gnu.org wrote: > > If "b" (and "a") are allocated in some slower memory part, it matters how one > sums over the variables in the matmul loop. I know that there are difference in speed. But a factor 2 is realy large and should not happen. I created the example as a minimal example. I observerd similar problems in almost every coarray program I have tried. The result is always that the use coarrays of brings not the desired speed up and is just a wast of resouces.
[Bug fortran/87622] coarray does not run in parallel
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87622 Tobias Burnus changed: What|Removed |Added CC||burnus at gcc dot gnu.org --- Comment #3 from Tobias Burnus --- On the coarray library interaction, we have: * On startup: - the coarray library is initialized - both coarrays are registed with the library i.e. the library allocates the memory for the program * At the end: the coarray library is told to finish Otherwise, the program itself just calls _gfortran_matmul_r4. The memory for variable "c" is in static memory while "a" and "b" are allocated by the coarray run-time library. If "b" (and "a") are allocated in some slower memory part, it matters how one sums over the variables in the matmul loop. For instance, for A = BC, one can calculate it as: a(i,j) = sum_k b(i,k)c(k,j) in that case, b(i,:) is contiguous in memory (which is faster) and c(:,j) is not (which is slower). [I have not checked in which order _gfortran_matmul_r4 calculates that the loop.] In the simplest case, the memory of "B" is slower because part of the page is accessed by two processors and it keeps getting kicked out of the level 1 or level 2 cache - and, hence, needs to be loaded from the the L3/L4 cache or the normal RAM. Besides those reasons, there is also an overhead of generating multiple jobs and synchronizing them - i.e. waiting until all processes have called "init" and all processes have called "finished" in the library. - With many processes, this overall barrier (all waiting for all) might take a very sizable amount of the total run time. For 1 vs. 2 jobs, it should be negligible but still some libraries have an explicit short cut for 1 job.
[Bug fortran/87622] coarray does not run in parallel
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87622 --- Comment #2 from Thomas Koenig --- Some more remarks: In a benchmark, it is best to actually fill the values of all read variables to something defined, for example with a call to random_number. Also, if you generate values which you do not use later, the compilet may decide to remove the calculation altogether. What works well for this somethinh like read (*,*) i,j print *, a(i,j) at the end and then invoke the progtam with echo 1 1 | ./a.out And, of course, please supply the timings.
[Bug fortran/87622] coarray does not run in parallel
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87622 Thomas Koenig changed: What|Removed |Added CC||tkoenig at gcc dot gnu.org --- Comment #1 from Thomas Koenig --- There could be interaction from cache interaction, or from something regarding coarrays and the matmul library funczions. Can you supply some more details such as compiler options, compiler version, CPU type, MPI version etc? And do you have the possibility of monitoring cache misses?