Re: [OMPI users] using specific algorithm for collective communication and knowing the root cpu?

George Bosilca Fri, 6 Nov 2009 01:31:21 -0500

On Nov 4, 2009, at 12:46 , George Markomanolis wrote:

I have some questions, because I am using some programs forprofiling, when you say that the cost of allreduce raise you meanabout the time only or also and the flops of this command? Is theresome additional work added at the allreduce or it's only about time?During profiling I want to count the flops so if there is a smalldifference on timing because of debug mode and declaration of theallreduce algorithm is not so big deal, but if it changes also theflops then it is bad for me.

Using a linear algorithm for reduce will clearly increase the numberof fp on the root (of course if we suppose the reduction is working onfp), and will decrease the fp on the other nodes. Imagine that insteadof having the computations nicely spread all over the nodes, you putthem all on the root. This is what happens with the linear reduction.

When I executed a program with debug mode I saw that openmpi usessome algorithms and I looked at your code and I saw that rank 0 isnot always the root cpu (if I understood right). Finally do you haveany opinion about which is the best way to know the algorithm isused in collective communication and the root cpu of thecommunication?

For the linear implementation of allreduce Open MPI always use therank 0 in the communicator as the root. The code is in the $(OMPI_SRCDIR)/ompi/mca/coll/tuned/coll_tuned_allreduce.c file at line895.


  george.

Best regards,
George
Today's Topics:

  1. Re: using specific algorithm for collective        communication,
     and knowing the root cpu? (George Bosilca)


----------------------------------------------------------------------

Message: 1
Date: Tue, 3 Nov 2009 12:09:18 -0500
From: George Bosilca <bosi...@eecs.utk.edu>
Subject: Re: [OMPI users] using specific algorithm for collective
        communication, and knowing the root cpu?
To: Open MPI Users <us...@open-mpi.org>
Message-ID: <e59919b2-42c1-49af-803a-ab4450609...@eecs.utk.edu>
Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes
You can add the following MCA parameters either on the command lineor in the $(HOME)/.openmpi/mca-params.conf file.
On Nov 2, 2009, at 08:52 , George Markomanolis wrote:
Dear all,
I would like to ask about collective communication. With debugmode enabled, I can see many info during the execution whichalgorithm is used etc. But my question is that I would like touse a specific algorithm (the simplest I suppose). I am profilingsome applications and I want to simulate them with anotherprogram so I must be able to know for example what thempi_allreduce is doing. I saw many algorithms that depend on themessage size and the number of processors, so I would like to ask:
1) what is the way to say at open mpi to use a simple algorithmfor allreduce (is there any way to say to use the simplestalgorithm for all the collective communication?). Basically Iwould like to know the root cpu for every collectivecommunication. What are the disadvantages for demanding thesimplest algorithm?
coll_tuned_use_dynamic_rules=1 to allow you to manually set thealgorithms to be used.coll_tuned_allreduce_algorithm=*something between 0 and 5* todescribe the algorithm to be user. For the simplest algorithm Iguess you will want to use 1 (star based fan-in fan-out).
The main disadvantage is that the cost of the allreduce will raisewhich will negatively impact the overall performance of theapplication.
2) Is there any overhead because I installed open mpi with debugmode even if I just run a program without any flag with --mca?
There are many overhead because you compile in debug mode. We do alot of extra tracking of internally allocate memory, checks onmost/all internal objects and so on. Based on previous results Iwould say your latency increase by about 2-3 micro-secs, but theimpact on the bandwidth is minimal.
3) How you could describe allreduce by words? Can we say that theroot cpu does reduce and then broadcast? I mean is that right foryour implementation? I saw that it depends on the algorithm whichcpu is the root, so is it possible to use an algorithm that Iwill know every time that cpu with rank 0 is the root?
Exactly, allreduce = reduce + bcast (and btw this is what thealgorithm basic will do). However, there is no root in an allreduceas all processors execute symmetric work. Of course if one seethe allreduce as a reduce followed by a broadcast then one has toselect a root (I guess we pick the rank 0 in our implementation).
  george.
Thanks a lot,
George
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] using specific algorithm for collective communication and knowing the root cpu?

Reply via email to