I tried the settings suggested by Peter and it indeed helps to improve
much more. Running on 64 cores with the line (in dyn_rules)
8192 2 0 0 # 8k+, pairwise 2, no topo or segmentation
I get the following
bw for 100 x 10 B : 1.9 Mbytes/s time was: 65.4 ms
bw for 100 x 20 B
I tried to run with the first dynamic rules file that Pavel proposed
and it works, the time per one MD step on 48 cores decreased from 2.8
s to 1.8 s as expected.
Good news :-)
Pasha.
Thanks
Roman
On Wed, May 20, 2009 at 7:18 PM, Pavel Shamis (Pasha) wrote:
Tomorrow I will add some
On Wednesday 20 May 2009, Roman Martonak wrote:
> I tried to run with the first dynamic rules file that Pavel proposed
> and it works, the time per one MD step on 48 cores decreased from 2.8
> s to 1.8 s as expected. It was clearly the basic linear algorithm that
> was causing the problem. I will c
I tried to run with the first dynamic rules file that Pavel proposed
and it works, the time per one MD step on 48 cores decreased from 2.8
s to 1.8 s as expected. It was clearly the basic linear algorithm that
was causing the problem. I will check the performance of bruck and
pairwise on my HW. It
Tomorrow I will add some printf to collective code and check what really
happens there...
Pasha
Peter Kjellstrom wrote:
On Wednesday 20 May 2009, Pavel Shamis (Pasha) wrote:
Disabling basic_linear seems like a good idea but your config file sets
the cut-off at 128 Bytes for 64-ranks (the f
On Wednesday 20 May 2009, Pavel Shamis (Pasha) wrote:
> > Disabling basic_linear seems like a good idea but your config file sets
> > the cut-off at 128 Bytes for 64-ranks (the field you set to 8192 seems to
> > result in a message size of that value divided by the number of ranks).
> >
> > In my t
Disabling basic_linear seems like a good idea but your config file sets the
cut-off at 128 Bytes for 64-ranks (the field you set to 8192 seems to result
in a message size of that value divided by the number of ranks).
In my testing bruck seems to win clearly (at least for 64 ranks on my IB) u
On Wednesday 20 May 2009, Pavel Shamis (Pasha) wrote:
> > With the file Pavel has provided things have changed to the following.
> > (maybe someone can confirm)
> >
> > If message size < 8192
> > bruck
> > else
> > pairwise
> > end
>
> You are right here. Target of my conf file is disable basic_lin
On Wednesday 20 May 2009, Rolf Vandevaart wrote:
...
> If I am understanding what is happening, it looks like the original
> MPI_Alltoall made use of three algorithms. (You can look in
> coll_tuned_decision_fixed.c)
>
> If message size < 200 or communicator size > 12
>bruck
> else if message s
The correct MCA parameters are the following:
-mca coll_tuned_use_dynamic_rules 1
-mca coll_tuned_dynamic_rules_filename ./dyn_rules
Ohh..it was my mistake
You can also run the following command:
ompi_info -mca coll_tuned_use_dynamic_rules 1 -param coll tuned
This will give some insight
The correct MCA parameters are the following:
-mca coll_tuned_use_dynamic_rules 1
-mca coll_tuned_dynamic_rules_filename ./dyn_rules
You can also run the following command:
ompi_info -mca coll_tuned_use_dynamic_rules 1 -param coll tuned
This will give some insight into all the various algorithms
Many thanks for the highly helpful analysis. Indeed, what Peter says
seems to be precisely the case here. I tried to run the 32 waters test
on 48 cores now, with the original cutoff of 100 Ry, and with slightly
increased one of 110 Ry. Normally with larger cutoff it should
obviously take more time
Default algorithm thresholds in mvapich are different from ompi.
Using tunned collectives in Open MPI you may configure the Open MPI
Alltoall threshold as Mvapich defaults.
The follow mca parameters configure Open MPI to use custom rules that
are defined in configure(txt) file.
"--mca use_dynam
On Tuesday 19 May 2009, Peter Kjellstrom wrote:
> On Tuesday 19 May 2009, Roman Martonak wrote:
> > On Tue, May 19, 2009 at 3:29 PM, Peter Kjellstrom wrote:
> > > On Tuesday 19 May 2009, Roman Martonak wrote:
> > > ...
> > >
> > >> openmpi-1.3.2 time per one MD step is 3.
On Tuesday 19 May 2009, Roman Martonak wrote:
> On Tue, May 19, 2009 at 3:29 PM, Peter Kjellstrom wrote:
> > On Tuesday 19 May 2009, Roman Martonak wrote:
> > ...
> >> openmpi-1.3.2 time per one MD step is 3.66 s
> >> ELAPSED TIME : 0 HOURS 1 MINUTES 25.90 SECONDS
On Tue, May 19, 2009 at 3:29 PM, Peter Kjellstrom wrote:
> On Tuesday 19 May 2009, Roman Martonak wrote:
> ...
>> openmpi-1.3.2 time per one MD step is 3.66 s
>> ELAPSED TIME : 0 HOURS 1 MINUTES 25.90 SECONDS
>> = ALL TO ALL COMM 102033. BYTES
On Tuesday 19 May 2009, Roman Martonak wrote:
...
> openmpi-1.3.2 time per one MD step is 3.66 s
>ELAPSED TIME :0 HOURS 1 MINUTES 25.90 SECONDS
> = ALL TO ALL COMM 102033. BYTES 4221. =
> = ALL TO ALL COMM 7.802 MB/S
I am using CPMD 3.11.1, not cp2k. Below are the timings for 20 steps
of MD for 32 water molecules (one of standard CPMD benchmarks) with
openmpi, mvapich and Intel MPI, running on 64 cores (8 blades, each
has 2 quad-core 2.2 GHz AMD Barcelona CPUs).
openmpi-1.3.2 time per
Hi Pavel
This is not my league, but here are some
CPMD helpful links (code, benchmarks):
http://www.cpmd.org/
http://www.cpmd.org/cpmd_thecode.html
http://www.theochem.ruhr-uni-bochum.de/~axel.kohlmeyer/cpmd-bench.html
IHIH
Gus Correa
Noam Bernstein wrote:
On May 18, 2009, at 12:50 PM, Pavel
On May 18, 2009, at 12:50 PM, Pavel Shamis (Pasha) wrote:
Roman,
Can you please share with us Mvapich numbers that you get . Also
what is mvapich version that you use.
Default mvapich and openmpi IB tuning is very similar, so it is
strange to see so big difference. Do you know what kind of
Roman,
Can you please share with us Mvapich numbers that you get . Also what is
mvapich version that you use.
Default mvapich and openmpi IB tuning is very similar, so it is strange
to see so big difference. Do you know what kind of collectives operation
is used in this specific application.
Hi Roman
Note that in 1.3.0 and 1.3.1 the default ("-mca mpi_leave_pinned 1")
had a glitch. In my case it appeared as a memory leak.
See this:
http://www.open-mpi.org/community/lists/users/2009/05/9173.php
http://www.open-mpi.org/community/lists/announce/2009/03/0029.php
One workaround is to
I've been using --mca mpi_paffinity_alone 1 in all simulations. Concerning "-mca
mpi_leave_pinned 1", I tried it with openmpi 1.2.X versions and it
makes no difference.
Best regards
Roman
On Mon, May 18, 2009 at 4:57 PM, Pavel Shamis (Pasha) wrote:
>
>>
>> 1) I was told to add "-mca mpi_leave_
1) I was told to add "-mca mpi_leave_pinned 0" to avoid problems with
Infinband. This was with OpenMPI 1.3.1. Not
Actually for 1.2.X version I will recommend you to enable leave pinned
"-mca mpi_leave_pinned 1"
sure if the problems were fixed on 1.3.2, but I am hanging on to that
setting j
Hi Roman, list
Sorry, now I see I totally missed your well taken point.
Your comparison of OpenMPI vs. IntelMPI scaling kills
my argument of problem size and halo overhead
being the possible cause for bad scaling.
Or at least makes my argument an inadvertent red herring.
All I can think of now i
Hi, Gus,
what I am reporting definitely is an openmpi scaling problem. The 32
waters problem I am talking about does scale to 64 cores, as clearly
shown by the numbers I posted, if I use IntelMPI (or mvapich) instead
of openmpi, on the same hardware, same code, same compiler, same Intel
mkl librar
Hi Roman
I googled out and found that CPMD is a molecular dynamics program.
(What would be of civilization without Google?)
Unfortunately I kind of wiped off from my mind
Schrodinger's equation, Quantum Mechanics,
and the Born approximation,
which I learned probably before you were born.
I could
Hi Roman
Just a guess.
Is this a domain decomposition code?
(I never heard about "cpmd 32 waters" before, sorry.)
Is it based on finite differences, finite volume, finite element?
If it is, once the size of the subdomains becomes too small compared to
the size of the halo around them, the overhe
Hello,
I observe very poor scaling with openmpi on HP blade system consisting
of 8 blades (each having 2 quad-core AMD Barcelona 2.2 GHz CPU) and
interconnected with Infiniband fabric. When running the standard cpmd
32 waters test, I observe the following scaling (the numbers are
elapsed time)
op
29 matches
Mail list logo