[Bug fortran/87622] coarray does not run in parallel

2018-10-17 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87622

Thomas Koenig  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |INVALID

--- Comment #6 from Thomas Koenig  ---
(In reply to Andreas Klein from comment #5)


> Adding initialzation removes the effect.
> There is still a 20% decresed performance, but that are plausible cache 
> effects.

OK, I'll close this bug then.

> Sorry my minimal example was to minimal. I derived the mini example from a 
> big parallel linear algebra package. Now I must go throgh all 
> minimalization steps, but its possible that the original error has nothing 
> to do with coarrays.

If you find anything that we could fix, please open a new PR.

And thanks for the PR anyway - it is better to open something that
later turns out to be invalid (I've submitted a few PRs like that) than
to miss something.

[Bug fortran/87622] coarray does not run in parallel

2018-10-17 Thread klein at cage dot ugent.be
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87622

--- Comment #5 from Andreas Klein  ---
On Wed, 17 Oct 2018, tkoenig at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87622
>
> --- Comment #2 from Thomas Koenig  ---
> Some more remarks: In a benchmark, it is best to actually fill the values of
> all read variables to something defined, for example with a call to
> random_number. Also, if you generate values which you do not use later, the
> compilet may decide to remove the calculation altogether. What works well for
> this somethinh like
>
> read (*,*) i,j
> print *, a(i,j)
>
Adding initialzation removes the effect.
There is still a 20% decresed performance, but that are plausible cache 
effects.

Sorry my minimal example was to minimal. I derived the mini example from a 
big parallel linear algebra package. Now I must go throgh all 
minimalization steps, but its possible that the original error has nothing 
to do with coarrays.

[Bug fortran/87622] coarray does not run in parallel

2018-10-17 Thread klein at cage dot ugent.be
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87622

--- Comment #4 from Andreas Klein  ---
On Wed, 17 Oct 2018, burnus at gcc dot gnu.org wrote:
>
> If "b" (and "a") are allocated in some slower memory part, it matters how one
> sums over the variables in the matmul loop.

I know that there are difference in speed. But a factor 2 is realy large 
and should not happen. I created the example as a minimal 
example. I observerd similar problems in almost every coarray program I 
have tried. The result is always that the use coarrays of brings not the 
desired speed up and is just a wast of resouces.

[Bug fortran/87622] coarray does not run in parallel

2018-10-17 Thread burnus at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87622

Tobias Burnus  changed:

   What|Removed |Added

 CC||burnus at gcc dot gnu.org

--- Comment #3 from Tobias Burnus  ---
On the coarray library interaction, we have:
* On startup:
  - the coarray library is initialized
  - both coarrays are registed with the library
i.e. the library allocates the memory for the program
* At the end: the coarray library is told to finish


Otherwise, the program itself just calls _gfortran_matmul_r4.

The memory for variable "c" is in static memory while "a" and "b" are allocated
by the coarray run-time library.

If "b" (and "a") are allocated in some slower memory part, it matters how one
sums over the variables in the matmul loop.

For instance, for A = BC, one can calculate it as:
   a(i,j) = sum_k b(i,k)c(k,j)
in that case, b(i,:) is contiguous in memory (which is faster) and c(:,j) is
not (which is slower).  [I have not checked in which order _gfortran_matmul_r4
calculates that the loop.]

In the simplest case, the memory of "B" is slower because part of the page is
accessed by two processors and it keeps getting kicked out of the level 1 or
level 2 cache - and, hence, needs to be loaded from the the L3/L4 cache or the
normal RAM.

Besides those reasons, there is also an overhead of generating multiple jobs
and synchronizing them - i.e. waiting until all processes have called "init"
and all processes have called "finished" in the library. - With many processes,
this overall barrier (all waiting for all) might take a very sizable amount of
the total run time.

For 1 vs. 2 jobs, it should be negligible but still some libraries have an
explicit short cut for 1 job.

[Bug fortran/87622] coarray does not run in parallel

2018-10-17 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87622

--- Comment #2 from Thomas Koenig  ---
Some more remarks: In a benchmark, it is best to actually fill the values of
all read variables to something defined, for example with a call to
random_number. Also, if you generate values which you do not use later, the
compilet may decide to remove the calculation altogether. What works well for
this somethinh like

read (*,*) i,j
print *, a(i,j)

at the end and then invoke the progtam with

echo 1 1 | ./a.out

And, of course, please supply the timings.

[Bug fortran/87622] coarray does not run in parallel

2018-10-17 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87622

Thomas Koenig  changed:

   What|Removed |Added

 CC||tkoenig at gcc dot gnu.org

--- Comment #1 from Thomas Koenig  ---
There could be interaction from cache interaction, or from something regarding
coarrays and the matmul library funczions.

Can you supply some more details such as compiler options, compiler version,
CPU type, MPI version etc? And do you have the possibility of monitoring cache
misses?