On Nov 4, 2009, at 12:46 , George Markomanolis wrote:
I have some questions, because I am using some programs for
profiling, when you say that the cost of allreduce raise you mean
about the time only or also and the flops of this command? Is there
some additional work added at the allreduce or it's only about time?
During profiling I want to count the flops so if there is a small
difference on timing because of debug mode and declaration of the
allreduce algorithm is not so big deal, but if it changes also the
flops then it is bad for me.
Using a linear algorithm for reduce will clearly increase the number
of fp on the root (of course if we suppose the reduction is working on
fp), and will decrease the fp on the other nodes. Imagine that instead
of having the computations nicely spread all over the nodes, you put
them all on the root. This is what happens with the linear reduction.
When I executed a program with debug mode I saw that openmpi uses
some algorithms and I looked at your code and I saw that rank 0 is
not always the root cpu (if I understood right). Finally do you have
any opinion about which is the best way to know the algorithm is
used in collective communication and the root cpu of the
communication?
For the linear implementation of allreduce Open MPI always use the
rank 0 in the communicator as the root. The code is in the $
(OMPI_SRCDIR)/ompi/mca/coll/tuned/coll_tuned_allreduce.c file at line
895.
george.
Best regards,
George
Today's Topics:
1. Re: using specific algorithm for collective communication,
and knowing the root cpu? (George Bosilca)
----------------------------------------------------------------------
Message: 1
Date: Tue, 3 Nov 2009 12:09:18 -0500
From: George Bosilca <bosi...@eecs.utk.edu>
Subject: Re: [OMPI users] using specific algorithm for collective
communication, and knowing the root cpu?
To: Open MPI Users <us...@open-mpi.org>
Message-ID: <e59919b2-42c1-49af-803a-ab4450609...@eecs.utk.edu>
Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes
You can add the following MCA parameters either on the command line
or in the $(HOME)/.openmpi/mca-params.conf file.
On Nov 2, 2009, at 08:52 , George Markomanolis wrote:
Dear all,
I would like to ask about collective communication. With debug
mode enabled, I can see many info during the execution which
algorithm is used etc. But my question is that I would like to
use a specific algorithm (the simplest I suppose). I am profiling
some applications and I want to simulate them with another
program so I must be able to know for example what the
mpi_allreduce is doing. I saw many algorithms that depend on the
message size and the number of processors, so I would like to ask:
1) what is the way to say at open mpi to use a simple algorithm
for allreduce (is there any way to say to use the simplest
algorithm for all the collective communication?). Basically I
would like to know the root cpu for every collective
communication. What are the disadvantages for demanding the
simplest algorithm?
coll_tuned_use_dynamic_rules=1 to allow you to manually set the
algorithms to be used.
coll_tuned_allreduce_algorithm=*something between 0 and 5* to
describe the algorithm to be user. For the simplest algorithm I
guess you will want to use 1 (star based fan-in fan-out).
The main disadvantage is that the cost of the allreduce will raise
which will negatively impact the overall performance of the
application.
2) Is there any overhead because I installed open mpi with debug
mode even if I just run a program without any flag with --mca?
There are many overhead because you compile in debug mode. We do a
lot of extra tracking of internally allocate memory, checks on
most/all internal objects and so on. Based on previous results I
would say your latency increase by about 2-3 micro-secs, but the
impact on the bandwidth is minimal.
3) How you could describe allreduce by words? Can we say that the
root cpu does reduce and then broadcast? I mean is that right for
your implementation? I saw that it depends on the algorithm which
cpu is the root, so is it possible to use an algorithm that I
will know every time that cpu with rank 0 is the root?
Exactly, allreduce = reduce + bcast (and btw this is what the
algorithm basic will do). However, there is no root in an allreduce
as all processors execute symmetric work. Of course if one see
the allreduce as a reduce followed by a broadcast then one has to
select a root (I guess we pick the rank 0 in our implementation).
george.
Thanks a lot,
George
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users