Re: [OMPI users] Distribute app over open mpi

2009-11-06 Thread Ralph Castain
I'm afraid we don't position binaries for you, though we have talked  
about providing that capability. I have some code in the system that - 
will- do it for certain special circumstances, but not in the general  
case, and much of the required infrastructure exists in the code  
today. I'm not sure how well that will do what you want.


There are a number of reasons why we don't do it (locating required  
libraries, etc) - I think they have been reported on the user list  
several times over the years.



On Nov 6, 2009, at 9:10 AM, Arnaud Westenberg wrote:


Hi all,

Sorry for the newbie question, but I'm having a hard time finding  
the answer, as I'm not even familiar with the terminology...


I've setup a small cluster on Ubuntu (hardy) and everything is  
working great, including slurm etc. If I run the well known 'Pi'  
program I get the proper results returned from all the nodes.


However, I'm looking for a way such that I wouldn't need to install  
the application on each node, nor on the shared nfs. Currently I get  
the obvious error that the app is not found on the nodes on which it  
isn't installed.


The idea is that the master node would thus distribute the required  
(parts of the) program to the slave nodes so they can perform the  
assigned work.


Reason is that I want to run an FEA package on a much larger  
(redhat) cluster we currently use for CDF calculations. I really  
don't want to mess up the cluster as we bought it already configured  
and compiling new versions of the FEA package on it turns out to be  
a missing library nightmare.


Thanks for your help.

Regards,


Arnaud



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] an environment variable with same meaning than the -x option of mpiexec

2009-11-06 Thread Ralph Castain
Not at the moment - though I imagine we could create one. It is a tad  
tricky in that we allow multiple -x options on the cmd line, but we  
obviously can't do that with an envar.


The most likely solution would be to specify multiple "-x" equivalents  
by separating them with a comma in the envar. It would take some  
parsing to make it all work, but not impossible.


I can add it to the "to-do" list for a rainy day :-)


On Nov 6, 2009, at 7:59 AM, Paul Kapinos wrote:


Dear OpenMPI developer,

with the -x option of mpiexec there is a way to distribute  
environmnet variables:


-x   Export  the  specified  environment  variables  to the  
remote

nodes before executing the  program.


Is there an environment variable ( OMPI_) with the same meaning?  
The writing of environmnet variables on the command line is ugly and  
tedious...


I've searched for this info on OpenMPI web pages for about an hour  
and didn't find the ansver :-/



Thanking you in anticipation,

Paul




--
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, Center for Computing and Communication
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Programming Help needed

2009-11-06 Thread amjad ali
Thanks Dear,

Jonathan seems almost perfect; I percieve the same.

On Fri, Nov 6, 2009 at 6:17 PM, Tom Rosmond  wrote:

> AMJAD
>
> On your first question, the answer is probably, if everything else is
> done correctly.  The first test is to not try to do the overlapping
> communication and computation, but do them sequentially and make sure
> the answers are correct. Have you done this test?  Debugging your
> original approach will be challenging, and having a control solution
> will be a big help.
>

I followed the path of sequentional---then--parallel blockingand then
parallel non-blocking.
My serial solution is the control solution.


>
> On your second question, if I  understand it correctly, is that it is
> always better to minimize the number of messages.  In problems like this
> communication costs are dominated by latency, so bundling the data into
> the fewest possible messages will ALWAYS be better.
>

Thats good.
But what pointed out by Jonathan:

If you really do hide most of the communications cost with your non-blocking
communications, then it may not matter too much.

is the point I want to be sure about.


> T. Rosmond
>
>
>
> On Fri, 2009-11-06 at 17:44 -0500, amjad ali wrote:
> > Hi all,
> >
> > I need/request some help from those who have some experience in
> > debugging/profiling/tuning parallel scientific codes, specially for
> > PDEs/CFD.
> >
> > I have parallelized a Fortran CFD code to run on
> > Ethernet-based-Linux-Cluster. Regarding MPI communication what I do is
> > that:
> >
> > Suppose that the grid/mesh is decomposed for n number of processors,
> > such that each processors has a number of elements that share their
> > side/face with different processors. What I do is that I start non
> > blocking MPI communication at the partition boundary faces (faces
> > shared between any two processors) , and then start computing values
> > on the internal/non-shared faces. When I complete this computation, I
> > put WAITALL to ensure MPI communication completion. Then I do
> > computation on the partition boundary faces (shared-ones). This way I
> > try to hide the communication behind computation. Is it correct?
> >
> > IMPORTANT: Secondly, if processor A shares 50 faces (on 50 or less
> > elements) with an another processor B then it sends/recvs 50 different
> > messages. So in general if a processors has X number of faces sharing
> > with any number of other processors it sends/recvs that much messages.
> > Is this way has "very much reduced" performance in comparison to the
> > possibility that processor A will send/recv a single-bundle message
> > (containg all 50-faces-data) to process B. Means that in general a
> > processor will only send/recv that much messages as the number of
> > processors neighbour to it.  It will send a single bundle/pack of
> > messages to each neighbouring processor.
> > Is their "quite a much difference" between these two approaches?
> >
> > THANK YOU VERY MUCH.
> > AMJAD.
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Programming Help needed

2009-11-06 Thread Tom Rosmond
AMJAD

On your first question, the answer is probably, if everything else is
done correctly.  The first test is to not try to do the overlapping
communication and computation, but do them sequentially and make sure
the answers are correct. Have you done this test?  Debugging your
original approach will be challenging, and having a control solution
will be a big help.

On your second question, if I  understand it correctly, is that it is
always better to minimize the number of messages.  In problems like this
communication costs are dominated by latency, so bundling the data into
the fewest possible messages will ALWAYS be better.

T. Rosmond



On Fri, 2009-11-06 at 17:44 -0500, amjad ali wrote:
> Hi all,
> 
> I need/request some help from those who have some experience in
> debugging/profiling/tuning parallel scientific codes, specially for
> PDEs/CFD.
> 
> I have parallelized a Fortran CFD code to run on
> Ethernet-based-Linux-Cluster. Regarding MPI communication what I do is
> that:
> 
> Suppose that the grid/mesh is decomposed for n number of processors,
> such that each processors has a number of elements that share their
> side/face with different processors. What I do is that I start non
> blocking MPI communication at the partition boundary faces (faces
> shared between any two processors) , and then start computing values
> on the internal/non-shared faces. When I complete this computation, I
> put WAITALL to ensure MPI communication completion. Then I do
> computation on the partition boundary faces (shared-ones). This way I
> try to hide the communication behind computation. Is it correct?
> 
> IMPORTANT: Secondly, if processor A shares 50 faces (on 50 or less
> elements) with an another processor B then it sends/recvs 50 different
> messages. So in general if a processors has X number of faces sharing
> with any number of other processors it sends/recvs that much messages.
> Is this way has "very much reduced" performance in comparison to the
> possibility that processor A will send/recv a single-bundle message
> (containg all 50-faces-data) to process B. Means that in general a
> processor will only send/recv that much messages as the number of
> processors neighbour to it.  It will send a single bundle/pack of
> messages to each neighbouring processor.
> Is their "quite a much difference" between these two approaches?
> 
> THANK YOU VERY MUCH.
> AMJAD.
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Programming Help needed

2009-11-06 Thread Jonathan Dursi

Hi, Amjad:


[...]
 What I do is that I start non 
blocking MPI communication at the partition boundary faces (faces shared 
between any two processors) , and then start computing values on the 
internal/non-shared faces. When I complete this computation, I put 
WAITALL to ensure MPI communication completion. Then I do computation on 
the partition boundary faces (shared-ones). This way I try to hide the 
communication behind computation. Is it correct?


As long as your numerical method allows you to do this (that is, you 
definitely don't need those boundary values to compute the internal 
values), then yes, this approach can hide some of the communication 
costs very effectively.  The way I'd program this if I were doing it 
from scratch would be to do the usual blocking approach (no one computes 
anything until all the faces are exchanged) first and get that working, 
then break up the computation step into internal and boundary 
computations and make sure it still works, and then change the messaging 
to isends/irecvs/waitalls, and make sure it still works, and only then 
interleave the two.


IMPORTANT: Secondly, if processor A shares 50 faces (on 50 or less 
elements) with an another processor B then it sends/recvs 50 different 
messages. So in general if a processors has X number of faces sharing 
with any number of other processors it sends/recvs that much messages. 
Is this way has "very much reduced" performance in comparison to the 
possibility that processor A will send/recv a single-bundle message 
(containg all 50-faces-data) to process B. Means that in general a 
processor will only send/recv that much messages as the number of 
processors neighbour to it.  It will send a single bundle/pack of 
messages to each neighbouring processor.

Is their "quite a much difference" between these two approaches?


Your individual element faces that are being communicated are likely 
quite small.   It is quite generally the case that bundling many small 
messages into large messages can significantly improve performance, as 
you avoid incurring the repeated latency costs of sending many messages.


As always, though, the answer is `it depends', and the only way to know 
is to try it both ways.   If you really do hide most of the 
communications cost with your non-blocking communications, then it may 
not matter too much.  In addition, if you don't know beforehand how much 
data you need to send/receive, then you'll need a handshaking step which 
introduces more synchronization and may actually hurt performance, or 
you'll have to use MPI2 one-sided communications.   On the other hand, 
if this shared boundary doesn't change through the simulation, you could 
just figure out at start-up time how big the messages will be between 
neighbours and use that as the basis for the usual two-sided messages.


My experience is that there's an excellent chance you'll improve the 
performance by packing the little messages into fewer larger messages.


   Jonathan
--
Jonathan Dursi 


[OMPI users] Programming Help needed

2009-11-06 Thread amjad ali
Hi all,

I need/request some help from those who have some experience in
debugging/profiling/tuning parallel scientific codes, specially for
PDEs/CFD.

I have parallelized a Fortran CFD code to run on
Ethernet-based-Linux-Cluster. Regarding MPI communication what I do is that:

Suppose that the grid/mesh is decomposed for n number of processors, such
that each processors has a number of elements that share their side/face
with different processors. What I do is that I start non blocking MPI
communication at the partition boundary faces (faces shared between any two
processors) , and then start computing values on the internal/non-shared
faces. When I complete this computation, I put WAITALL to ensure MPI
communication completion. Then I do computation on the partition boundary
faces (shared-ones). This way I try to hide the communication behind
computation. Is it correct?

IMPORTANT: Secondly, if processor A shares 50 faces (on 50 or less elements)
with an another processor B then it sends/recvs 50 different messages. So in
general if a processors has X number of faces sharing with any number of
other processors it sends/recvs that much messages. Is this way has "very
much reduced" performance in comparison to the possibility that processor A
will send/recv a single-bundle message (containg all 50-faces-data) to
process B. Means that in general a processor will only send/recv that much
messages as the number of processors neighbour to it.  It will send a single
bundle/pack of messages to each neighbouring processor.
Is their "quite a much difference" between these two approaches?

THANK YOU VERY MUCH.
AMJAD.


[OMPI users] Distribute app over open mpi

2009-11-06 Thread Arnaud Westenberg
Hi all,

Sorry for the newbie question, but I'm having a hard time finding the answer, 
as I'm not even familiar with the terminology...

I've setup a small cluster on Ubuntu (hardy) and everything is working great, 
including slurm etc. If I run the well known 'Pi' program I get the proper 
results returned from all the nodes.

However, I'm looking for a way such that I wouldn't need to install the 
application on each node, nor on the shared nfs. Currently I get the obvious 
error that the app is not found on the nodes on which it isn't installed.

The idea is that the master node would thus distribute the required (parts of 
the) program to the slave nodes so they can perform the assigned work.

Reason is that I want to run an FEA package on a much larger (redhat) cluster 
we currently use for CDF calculations. I really don't want to mess up the 
cluster as we bought it already configured and compiling new versions of the 
FEA package on it turns out to be a missing library nightmare.

Thanks for your help.

Regards,


Arnaud





[OMPI users] mpirun noticed that process rank 1 ... exited on signal 13 (Broken pipe).

2009-11-06 Thread Kritiraj Sajadah
Hi Everyone,
  I have install openmpi 1.3 and blcr 0.81 on my laptop (single 
processor).

I am trying to checkpoint a small test application:

###

#include 
#include 
#include 
#include
#include

int main(int argc, char **argv)
{
int rank,size;
MPI_Init(, );
MPI_Comm_rank(MPI_COMM_WORLD, );
MPI_Comm_size(MPI_COMM_WORLD, );
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 10");
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 10");
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 10");
printf("mpisleep bye \n");
MPI_Finalize();
return 0;
}
###

I compile it as follows:

mpicc mpisleep.c -o mpisleep

and i run it as follows:

mpirun -am ft-enable-cr -np 2 mpisleep.

When i try checkpointing ( ompi-checkpoint -v 8118) it, it checkpoints fine but 
when i restart it, i get the following:

I am processor no 0 of a total of 2 procs 
I am processor no 1 of a total of 2 procs 
mpisleep bye 
--
mpirun noticed that process rank 1 with PID 8118 on node raj-laptop exited on 
signal 13 (Broken pipe).
--

Any suggestions is very much appreciated

Raj





[OMPI users] an environment variable with same meaning than the -x option of mpiexec

2009-11-06 Thread Paul Kapinos

Dear OpenMPI developer,

with the -x option of mpiexec there is a way to distribute environmnet 
variables:


 -x   Export  the  specified  environment  variables  to the remote
 nodes before executing the  program.


Is there an environment variable ( OMPI_) with the same meaning? The 
writing of environmnet variables on the command line is ugly and tedious...


I've searched for this info on OpenMPI web pages for about an hour and 
didn't find the ansver :-/



Thanking you in anticipation,

Paul




--
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, Center for Computing and Communication
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [OMPI users] Segmentation fault whilst running RaXML-MPI

2009-11-06 Thread Nick Holway
Hi,

Thank you for the information, I'm going to try the new Intel
Compilers which I'm downloading now, but as they're taking so long to
download I don't think I'm going to be able to look into this again
until after the weekend. BTW using their java-based downloader is a
bit less painful than their normal download.

In the meantime, if anyone else has some suggestions then please let me know.

Thanks

Nick

2009/11/5 Jeff Squyres :
> FWIW, I think Intel released 11.1.059 earlier today (I've been trying to
> download it all morning).  I doubt it's an issue in this case, but I thought
> I'd mention it as a public service announcement.  ;-)
>
> Seg faults are *usually* an application issue (never say "never", but they
> *usually* are).  You might want to first contact the RaXML team to see if
> there are any known issues with their software and Open MPI 1.3.3...?
>  (Sorry, I'm totally unfamiliar with RaXML)
>
> On Nov 5, 2009, at 12:30 PM, Nick Holway wrote:
>
>> Dear all,
>>
>> I'm trying to run RaXML 7.0.4 on my 64bit Rocks 5.1 cluster (ie Centos
>> 5.2). I compiled Open MPI 1.3.3 using the Intel compilers v 11.1.056
>> using ./configure CC=icc CXX=icpc F77=ifort FC=ifort --with-sge
>> --prefix=/usr/prog/mpi/openmpi/1.3.3/x86_64-no-mem-man
>> --with-memory-manager=none.
>>
>> When I run run RaXML in a qlogin session using
>> /usr/prog/mpi/openmpi/1.3.3/x86_64-no-mem-man/bin/mpirun -np 8
>> /usr/prog/bioinformatics/RAxML/7.0.4/x86_64/RAxML-7.0.4/raxmlHPC-MPI
>> -f a -x 12345 -p12345 -# 10 -m GTRGAMMA -s
>> /users/holwani1/jay/ornodko-1582 -n mpitest39
>>
>> I get the following output:
>>
>> This is the RAxML MPI Worker Process Number: 1
>> This is the RAxML MPI Worker Process Number: 3
>>
>> This is the RAxML MPI Master process
>>
>> This is the RAxML MPI Worker Process Number: 7
>>
>> This is the RAxML MPI Worker Process Number: 4
>>
>> This is the RAxML MPI Worker Process Number: 5
>>
>> This is the RAxML MPI Worker Process Number: 2
>>
>> This is the RAxML MPI Worker Process Number: 6
>> IMPORTANT WARNING: Alignment column 1695 contains only undetermined
>> values which will be treated as missing data
>>
>>
>> IMPORTANT WARNING: Sequences A4_H10 and A3ii_E11 are exactly identical
>>
>>
>> IMPORTANT WARNING: Sequences A2_A08 and A9_C10 are exactly identical
>>
>>
>> IMPORTANT WARNING: Sequences A3ii_B03 and A3ii_C06 are exactly identical
>>
>>
>> IMPORTANT WARNING: Sequences A9_D08 and A9_F10 are exactly identical
>>
>>
>> IMPORTANT WARNING: Sequences A3ii_F07 and A9_C08 are exactly identical
>>
>>
>> IMPORTANT WARNING: Sequences A6_F05 and A6_F11 are exactly identical
>>
>> IMPORTANT WARNING
>> Found 6 sequences that are exactly identical to other sequences in the
>> alignment.
>> Normally they should be excluded from the analysis.
>>
>>
>> IMPORTANT WARNING
>> Found 1 column that contains only undetermined values which will be
>> treated as missing data.
>> Normally these columns should be excluded from the analysis.
>>
>> An alignment file with undetermined columns and sequence duplicates
>> removed has already
>> been printed to file /users/holwani1/jay/ornodko-1582.reduced
>>
>>
>> You are using RAxML version 7.0.4 released by Alexandros Stamatakis in
>> April 2008
>>
>> Alignment has 1280 distinct alignment patterns
>>
>> Proportion of gaps and completely undetermined characters in this
>> alignment: 0.124198
>>
>> RAxML rapid bootstrapping and subsequent ML search
>>
>>
>> Executing 10 rapid bootstrap inferences and thereafter a thorough ML
>> search
>>
>> All free model parameters will be estimated by RAxML
>> GAMMA model of rate heteorgeneity, ML estimate of alpha-parameter
>> GAMMA Model parameters will be estimated up to an accuracy of
>> 0.10 Log Likelihood units
>>
>> Partition: 0
>> Name: No Name Provided
>> DataType: DNA
>> Substitution Matrix: GTR
>> Empirical Base Frequencies:
>> pi(A): 0.261129 pi(C): 0.228570 pi(G): 0.315946 pi(T): 0.194354
>>
>>
>> Switching from GAMMA to CAT for rapid Bootstrap, final ML search will
>> be conducted under the GAMMA model you specified
>> Bootstrap[10]: Time 44.442728 bootstrap likelihood -inf, best
>> rearrangement setting 5
>> Bootstrap[0]: Time 44.814948 bootstrap likelihood -inf, best
>> rearrangement setting 5
>> Bootstrap[6]: Time 46.470371 bootstrap likelihood -inf, best
>> rearrangement setting 6
>> [compute-0-11:08698] *** Process received signal ***
>> [compute-0-11:08698] Signal: Segmentation fault (11)
>> [compute-0-11:08698] Signal code: Address not mapped (1)
>> [compute-0-11:08698] Failing at address: 0x408
>> [compute-0-11:08698] [ 0] /lib64/libpthread.so.0 [0x3fb580de80]
>> [compute-0-11:08698] [ 1]
>>
>> /usr/prog/bioinformatics/RAxML/7.0.4/x86_64/RAxML-7.0.4/raxmlHPC-MPI(hookup+0)
>> [0x413ca0]
>> [compute-0-11:08698] [ 2]
>>
>> /usr/prog/bioinformatics/RAxML/7.0.4/x86_64/RAxML-7.0.4/raxmlHPC-MPI(restoreTL+0xd9)
>> [0x442c09]
>> [compute-0-11:08698] [ 3]
>> 

Re: [OMPI users] Question about checkpoint/restart protocol

2009-11-06 Thread Josh Hursey


On Nov 5, 2009, at 4:46 AM, Mohamed Adel wrote:


Dear Sergio,

Thank you for your reply. I've inserted the modules into the kernel  
and it all worked fine. But there is still a weired issue. I use the  
command "mpirun -n 2 -am ft-enable-cr -H comp001 checkpoint-restart- 
test" to start the an mpi job. I then use "ompi-checkpoint PID" to  
checkpoint a job, but the ompi-checkpoint didn't respond and the  
mpirun produces the following.


--
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

 Local host:  comp001.local (PID 23514)
 MPI_COMM_WORLD rank: 0

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--
[login01.local:21425] 1 more process has sent help message help-mpi- 
runtime.txt / mpi_init:warn-fork
[login01.local:21425] Set MCA parameter "orte_base_help_aggregate"  
to 0 to see all help / error messages


Notice: if the -n option has a value more than 1, then this error  
occurs, but if the -n option has the value 1 then the ompi- 
checkpoint succeeds, mpirun produces the same message and ompi- 
restart fails with the message

[login01:21417] *** Process received signal ***
[login01:21417] Signal: Segmentation fault (11)
[login01:21417] Signal code: Address not mapped (1)
[login01:21417] Failing at address: (nil)
[login01:21417] [ 0] /lib64/libpthread.so.0 [0x32df20de70]
[login01:21417] [ 1] /home/mab/openmpi-1.3.3/lib/openmpi/ 
mca_crs_blcr.so [0x2b093509dfee]
[login01:21417] [ 2] /home/mab/openmpi-1.3.3/lib/openmpi/ 
mca_crs_blcr.so(opal_crs_blcr_restart+0xd9) [0x2b093509d251]

[login01:21417] [ 3] opal-restart [0x401c3e]
[login01:21417] [ 4] /lib64/libc.so.6(__libc_start_main+0xf4)  
[0x32dea1d8b4]

[login01:21417] [ 5] opal-restart [0x401399]
[login01:21417] *** End of error message ***
--
mpirun noticed that process rank 0 with PID 21417 on node  
login01.local exited on signal 11 (Segmentation fault).

--

Any help with that will be appreciated?


I have not seen this behavior before. The first error is Open MPI  
warning you that one of your MPI processes is trying to use fork(), so  
you may want to make sure that your application is not using any system 
() or fork() function calls. Open MPI internally should not be using  
any of these functions from within the MPI library linked to the  
application.


When you reloaded the BLCR module, did you rebuild Open MPI and  
install it in a clean directory (not over the top of the old directory)?


Have you tried to checkpoint/restart an non-MPI process with BLCR on  
your system? This will help to rule out installation problems with BLCR.


I suspect that Open MPI is not building correctly, or something in  
your build environment is confusing/corrupting the build. Can you send  
me your config.log, it may help me pinpoint the problem if it is build  
related.


-- Josh



Thanks in advance,
Mohamed Adel


From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] On  
Behalf Of Sergio Díaz [sd...@cesga.es]

Sent: Thursday, November 05, 2009 11:38 AM
To: Open MPI Users
Subject: Re: [OMPI users] Question about checkpoint/restart protocol

Hi,

Did you load the BLCR modules before compiling OpenMPI?

Regards,
Sergio

Mohamed Adel escribió:

Dear OMPI users,

I'm a new OpenMPI user. I've configured openmpi-1.3.3 with those  
options "./configure --prefix=/home/mab/openmpi-1.3.3 --with-sge -- 
enable-ft-thread --with-ft=cr --enable-mpi-threads --enable-static  
--disable-shared --with-blcr=/home/mab/blcr-0.8.2/" then compiled  
and installed it successfully.
Now I'm trying to use the checkpoint/restart protocol. I run a  
program with the options "mpirun -n 2 -am ft-enable-cr -H localhost  
prime/checkpoint-restart-test" but I receive the following error:


*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[madel:28896] Abort before MPI_INIT completed successfully; not  
able to guarantee that all other processes were killed!

--
It looks like opal_init failed for some reason; your parallel  
process is

likely to abort.  There are many reasons that a parallel 

Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2009-11-06 Thread Josh Hursey


On Oct 28, 2009, at 7:41 AM, Sergio Díaz wrote:


Hello,

I have achieved the checkpoint of an easy program without SGE. Now,  
I'm trying to do the integration openmpi+sge but I have some  
problems... When I try to do checkpoint of the mpirun PID, I got an  
error similar to the error gotten when the PID doesn't exit. The  
example below.


I do not have any experience with the SGE environment, so I suspect  
that there may something 'special' about the environment that is  
tripping up the ompi-checkpoint tool.


First of all, what version of Open MPI are you using?

Somethings to check:
 - Does 'ompi-ps' work when your application is running?
 - Is there an /tmp/openmpi-sessions-* directory on the node where  
mpirun is currently running? This directory contains information on  
how to connect to the mpirun process from an external tool, if it's  
missing then this could be the cause of the problem.




Any ideas?
Somebody have a script to do it automatic with SGE?. For example I  
have one to do checkpoint each X seconds with BLCR and non-mpi jobs.  
It is launched by SGE if you have configured the queue and the ckpt  
environment.


I do not know of any integration of the Open MPI checkpointing work  
with SGE at the moment.


As far as time triggered checkpointing, I have a feature ticket open  
about this:

  https://svn.open-mpi.org/trac/ompi/ticket/1961

It is not available yet, but in the works.




Is it possible choose the name of the ckpt folder when you do the  
ompi-checkpoint? I can't find the option to do it.


Not at this time. Though I could see it as a useful feature, and  
shouldn't be too hard to implement. I filed a ticket if you want to  
follow the progress:

  https://svn.open-mpi.org/trac/ompi/ticket/2098

-- Josh




Regards,
Sergio




[sdiaz@compute-3-17 ~]$ ps auxf

root 20044  0.0  0.0  4468 1224 ?S13:28   0:00  \_  
sge_shepherd-2645150 -bg
sdiaz20072  0.0  0.0 53172 1212 ?Ss   13:28   0:00   
\_ -bash /opt/cesga/sge62/default/spool/compute-3-17/job_scripts/ 
2645150
sdiaz20112  0.2  0.0 41028 2480 ?S13:28
0:00  \_ mpirun -np 2 -am ft-enable-cr pi3
sdiaz20113  0.0  0.0 36484 1824 ?Sl   13:28
0:00  \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -inherit - 
nostdin -V compute-3-18..
sdiaz20116  1.2  0.0 99464 4616 ?Sl   13:28
0:00  \_ pi3



[sdiaz@compute-3-17 ~]$ ompi-checkpoint 20112
[compute-3-17.local:20124] HNP with PID 20112 Not found!

[sdiaz@compute-3-17 ~]$ ompi-checkpoint -s 20112
[compute-3-17.local:20135] HNP with PID 20112 Not found!

[sdiaz@compute-3-17 ~]$ ompi-checkpoint -s --term 20112
[compute-3-17.local:20136] HNP with PID 20112 Not found!

[sdiaz@compute-3-17 ~]$ ompi-checkpoint --hnp-pid 20112
--
ompi-checkpoint PID_OF_MPIRUN
  Open MPI Checkpoint Tool

   -am Aggregate MCA parameter set file list
   -gmca|--gmca  
 Pass global MCA parameters that are  
applicable to
 all contexts (arg0 is the parameter name;  
arg1 is

 the parameter value)
-h|--helpThis help message
   --hnp-jobid This should be the jobid of the HNP whose
 applications you wish to checkpoint.
   --hnp-pid   This should be the pid of the mpirun whose
 applications you wish to checkpoint.
   -mca|--mca  
 Pass context-specific MCA parameters; they  
are
 considered global if --gmca is not used and  
only
 one context is specified (arg0 is the  
parameter

 name; arg1 is the parameter value)
-s|--status  Display status messages describing the  
progression

 of the checkpoint
   --termTerminate the application after checkpoint
-v|--verbose Be Verbose
-w|--nowait  Do not wait for the application to finish
 checkpointing before returning

--
[sdiaz@compute-3-17 ~]$ exit
logout
Connection to c3-17 closed.
[sdiaz@svgd mpi_test]$ ssh c3-18
Last login: Wed Oct 28 13:24:12 2009 from svgd.local
-bash-3.00$ ps auxf |grep sdiaz

sdiaz14412  0.0  0.0  1888  560 ?Ss   13:28   0:00   
\_ /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter /opt/cesga/sge62/ 
default/spool/compute-3-18/active_jobs/2645150.1/1.compute-3-18
sdiaz14419  0.0  0.0 35728 2260 ?S13:28
0:00  \_ orted -mca ess env -mca orte_ess_jobid 2295267328 - 
mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri  
2295267328.0;tcp://192.168.4.144:36596 -mca  
mca_base_param_file_prefix ft-enable-cr -mca  
mca_base_param_file_path 

Re: [OMPI users] problems with checkpointing an mpi job

2009-11-06 Thread Josh Hursey


On Oct 30, 2009, at 1:35 PM, Hui Jin wrote:


Hi All,
I got a problem when trying to checkpoint a mpi job.
I will really appreciate if you can help me fix the problem.
the blcr package was installed successfully on the cluster.
I configure the ompenmpi with flags,
./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads -- 
with-blcr=/usr/local --with-blcr-libdir=/usr/local/lib/

The installation looks correct. The open MPI version is 1.3.3

I got the following output when issueing ompi_info:

root@hec:/export/home/hjin/test# ompi_info | grep ft
   MCA rml: ftrm (MCA v2.0, API v2.0, Component v1.3.3)
root@hec:/export/home/hjin/test# ompi_info | grep crs
   MCA crs: none (MCA v2.0, API v2.0, Component v1.3.3)
It seems the MCA crs is lost but I have no idea about how to get it.


This is an artifact of the way ompi_info searches for components. This  
came up before on the users list:

  http://www.open-mpi.org/community/lists/users/2009/09/10667.php

I filed a bug about this, if you want to track its progress:
  https://svn.open-mpi.org/trac/ompi/ticket/2097



To run a checkpointable application, I run:
mpirun -np 2 --host hec -am ft-enable-cr test_mpi

however, when trying to checkpoint at another terminal of the same  
host, I have the following,

root@hec:~# ompi-checkpoint -v 29234
[hec:29243] orte_checkpoint: Checkpointing...
[hec:29243]  PID 29234
[hec:29243]  Connected to Mpirun [[46621,0],0]
[hec:29243] orte_checkpoint: notify_hnp: Contact Head Node Process  
PID 29234
[hec:29243] orte_checkpoint: notify_hnp: Requested a checkpoint of  
jobid [INVALID]

[hec:29243] orte_checkpoint: hnp_receiver: Receive a command message.
[hec:29243] orte_checkpoint: hnp_receiver: Status Update.
[hec:29243] Requested - Global Snapshot Reference:  
(null)

[hec:29243] orte_checkpoint: hnp_receiver: Receive a command message.
[hec:29243] orte_checkpoint: hnp_receiver: Status Update.
[hec:29243]   Pending - Global Snapshot Reference:  
(null)

[hec:29243] orte_checkpoint: hnp_receiver: Receive a command message.
[hec:29243] orte_checkpoint: hnp_receiver: Status Update.
[hec:29243]   Running - Global Snapshot Reference:  
(null)


There is some error msg at the terminal of the running applicaiton,  
as,

--
Error: The process with PID 29236 is not checkpointable.
 This could be due to one of the following:
  - An application with this PID doesn't currently exist
  - The application with this PID isn't checkpointable
  - The application with this PID isn't an OPAL application.
 We were looking for the named files:
   /tmp/opal_cr_prog_write.29236
   /tmp/opal_cr_prog_read.29236
--
[hec:29234] local) Error: Unable to initiate the handshake with peer  
[[46621,1],1]. -1
[hec:29234] [[46621,0],0] ORTE_ERROR_LOG: Error in file  
snapc_full_global.c at line 567
[hec:29234] [[46621,0],0] ORTE_ERROR_LOG: Error in file  
snapc_full_global.c at line 1054


This means that either the MPI application did not respond to the  
checkpoint request in time, or that the application was not  
checkpointable for some other reason.


Some options to try:
 - Set the 'snapc_full_max_wait_time' MCA parameter to say 60, the  
default is 20 seconds before giving up. You can also set it to 0,  
which indicates to the runtime to wait indefinitely.

   shell$ mpirun -mca snapc_full_max_wait_time 60
 - Try cleaning out the /tmp directory on all of the nodes, maybe  
this has something to do with disks being full (though usually we  
would see other symptoms).


If that doesn't help, can you send me the config.log from your build  
of Open MPI. If those do not work, I would suspect that something in  
the configure of Open MPI might have gone wrong.


-- Josh







does anyone have some hint to fix this problem?

Thanks,
Hui Jin

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] problem using openmpi with DMTCP

2009-11-06 Thread Josh Hursey

(Sorry for the excessive delay in replying)

I do not have any experience with the DMTCP project, so I can only  
speculate on what might be going on here. If you are using DMTCP to  
transparently checkpoint Open MPI you will need to make sure that you  
are not using any other interconnect other than TCP.


If you are building an OPAL CRS component for DMTCP (actually you  
probably want their MTCP project which is just the local checkpoint/ 
restart service), then what you might be seeing are the TCP sockets  
that are left open across a checkpoint operation. As an optimization  
for checkpoint->continue we leave sockets open when we checkpoint.  
Since most checkpoint/restart services will skip over the socket fd  
(since they are not supported) and take the checkpoint we leave them  
open, and close them only on restart. I suspect that DMTCP is erroring  
out since it is trying to do something else with those fds.


You may want to try just using the MTCP project, or ask for a way to  
shut off the socket negotiation and just ignore the socket fds.


Let me know how it goes.

-- Josh

On Sep 28, 2009, at 9:55 AM, Kritiraj Sajadah wrote:


Dear All,
 I am trying to integrate DMTCP with openmpi. IF I run a c  
application, it works fine. But when I execute the program using  
mpirun, It checkpoints application but gives error when restarting  
the application.


#
[31007] WARNING at connection.cpp:303 in restore; REASON='JWARNING 
((_sockDomain == AF_INET || _sockDomain == AF_UNIX ) && _sockType ==  
SOCK_STREAM) failed'

id() = 2ab3f248-30933-4ac0d75a(99007)
_sockDomain = 10
_sockType = 1
_sockProtocol = 0
Message: socket type not yet [fully] supported
[31007] WARNING at connection.cpp:303 in restore; REASON='JWARNING 
((_sockDomain == AF_INET || _sockDomain == AF_UNIX ) && _sockType ==  
SOCK_STREAM) failed'

id() = 2ab3f248-30943-4ac0d75c(99007)
_sockDomain = 10
_sockType = 1
_sockProtocol = 0
Message: socket type not yet [fully] supported
[31013] WARNING at connection.cpp:87 in restartDup2; REASON='JWARNING 
(_real_dup2 ( oldFd, fd ) == fd) failed'

oldFd = 537
fd = 1
(strerror((*__errno_location ( = Bad file descriptor
[31013] WARNING at connectionmanager.cpp:627 in closeAll;  
REASON='JWARNING(_real_close ( i->second ) ==0) failed'

i->second = 537
(strerror((*__errno_location ( = Bad file descriptor
[31015] WARNING at connectionmanager.cpp:627 in closeAll;  
REASON='JWARNING(_real_close ( i->second ) ==0) failed'

i->second = 537
(strerror((*__errno_location ( = Bad file descriptor
[31017] WARNING at connectionmanager.cpp:627 in closeAll;  
REASON='JWARNING(_real_close ( i->second ) ==0) failed'

i->second = 537
(strerror((*__errno_location ( = Bad file descriptor
[31007] WARNING at connectionmanager.cpp:627 in closeAll;  
REASON='JWARNING(_real_close ( i->second ) ==0) failed'

i->second = 537
(strerror((*__errno_location ( = Bad file descriptor
MTCP: mtcp_restart_nolibc: mapping current version of /usr/lib/gconv/ 
gconv-modules.cache into memory;

 _not_ file as it existed at time of checkpoint.
 Change mtcp_restart_nolibc.c:634 and re-compile, if you want  
different behavior.
[31015] ERROR at connection.cpp:372 in restoreOptions;  
REASON='JASSERT(ret == 0) failed'

(strerror((*__errno_location ( = Invalid argument
fds[0] = 6
opt->first = 26
opt->second.size() = 4
Message: restoring setsockopt failed
Terminating...
#

Any suggestions is very welcomed.

regards,

Raj



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Changing location where checkpoints are saved

2009-11-06 Thread Josh Hursey

(Sorry for the excessive delay in replying)

On Sep 30, 2009, at 11:02 AM, Constantinos Makassikis wrote:


Thanks for the reply!

Concerning the mca options for checkpointing:
- are verbosity options (e.g.: crs_base_verbose) limited to 0 and 1  
values ?
- in priority options (e.g.: crs_blcr_priority) do lower numbers  
indicate higher priority ?


By searching in the archives of the mailing list I found two  
interesting/useful posts:
- [1] http://www.open-mpi.org/community/lists/users/2008/09/6534.php  
(for different checkpointing schemes)
- [2] http://www.open-mpi.org/community/lists/users/2009/05/9385.php  
(for restarting)


Following indications given in [1], I tried to make each process
checkpoint itself in it local /tmp and centralize the resulting
checkpoints in /tmp or $HOME:

Excerpt from mca-params.conf:
-
snapc_base_store_in_place=0
snapc_base_global_snapshot_dir=/tmp or $HOME
crs_base_snapshot_dir=/tmp

COMMANDS used:
--
mpirun -n 2 -machinefile machines -am ft-enable-cr a.out
ompi-checkpoint mpirun_pid



OUTPUT of ompi-checkpoint -v 16753
--
[ic85:17044] orte_checkpoint: Checkpointing...
[ic85:17044] PID 17036
[ic85:17044] Connected to Mpirun [[42098,0],0]
[ic85:17044] orte_checkpoint: notify_hnp: Contact Head Node Process  
PID 17036
[ic85:17044] orte_checkpoint: notify_hnp: Requested a checkpoint of  
jobid [INVALID]

[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Requested - Global Snapshot Reference:  
(null)

[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044]   Pending - Global Snapshot Reference:  
(null)

[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044]   Running - Global Snapshot Reference:  
(null)

[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] File Transfer - Global Snapshot Reference:  
(null)

[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Error - Global Snapshot Reference:  
ompi_global_snapshot_17036.ckpt




OUTPUT of MPIRUN


[ic85:17038] crs:blcr: blcr_checkpoint_peer: Thread finished with  
status 3
[ic86:20567] crs:blcr: blcr_checkpoint_peer: Thread finished with  
status 3

--
WARNING: Could not preload specified file: File already exists.

Fileset: /tmp/ompi_global_snapshot_17036.ckpt/0
Host: ic85

Will continue attempting to launch the process.

--
[ic85:17036] filem:rsh: wait_all(): Wait failed (-1)
[ic85:17036] [[42098,0],0] ORTE_ERROR_LOG: Error in  
file ../../../../../orte/mca/snapc/full/snapc_full_global.c at line  
1054


This is a warning about creating the global snapshot directory  
(ompi_global_snapshot_17036.ckpt) for the first checkpoint (seq 0). It  
seems to indicate that the directory existed when the file gather  
started.


A couple things to check:
 - Did you clean out the /tmp on all of the nodes with any files  
starting with "opal" or "ompi"?
 - Does the error go away when you set  
(snapc_base_global_snapshot_dir=$HOME)?
 - Could you try running against a v1.3 release? (I wonder if this  
feature has been broken on the trunk)


Let me know what you find. In the next couple days, I'll try to test  
the trunk again with this feature to make sure that it is still  
working on my test machines.


-- Josh






Does anyone has an idea about what is wrong?


Best regards,

--
Constantinos



Josh Hursey wrote:
This is described in the C/R User's Guide attached to the webpage  
below:

 https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR

Additionally this has been addressed on the users mailing list in  
the past, so searching around will likely turn up some examples.


-- Josh

On Sep 18, 2009, at 11:58 AM, Constantinos Makassikis wrote:


Dear all,

I have installed blcr 0.8.2 and Open MPI (r21973) on my NFS  
account. By default,
it seems that checkpoints are saved in $HOME. However, I would  
prefer them

to be saved on a local disk (e.g.: /tmp).

Does anyone know how I can change the location where Open MPI  
saves checkpoints?



Best regards,

--
Constantinos
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] mpirun example program fail on multiple nodes- unable to launch specified application on client node

2009-11-06 Thread Josh Hursey
As an alternative technique for distributing the binary, you could ask  
Open MPI's runtime to do it for you (made available in the v1.3  
series). You still need to make sure that the same version of Open is  
installed on all nodes, but if you pass the --preload-binary option to  
mpirun the runtime environment will distribute the binary across the  
machine (staging it to a temporary directory) before launching it.


You can do the same with any arbitrary set of files or directories  
(comma separated) using the --preload-files option as well.


If you type 'mpirun --help' the options that you are looking for are:

   --preload-files 
   --preload-files-dest-dir 
 with --preload-files. By default the  
absolute and

 relative paths provided by --preload-files are
-s|--preload-binary  Preload the binary on the remote machine before



-- Josh

On Nov 5, 2009, at 6:56 PM, Terry Frankcombe wrote:

For small ad hoc COWs I'd vote for sshfs too.  It may well be as  
slow as
a dog, but it actually has some security, unlike NFS, and is a  
doddle to

make work with no superuser access on the server, unlike NFS.


On Thu, 2009-11-05 at 17:53 -0500, Jeff Squyres wrote:

On Nov 5, 2009, at 5:34 PM, Douglas Guptill wrote:

I am currently using sshfs to mount both OpenMPI and my  
application on

the "other" computers/nodes.  The advantage to this is that I have
only one copy of OpenMPI and my application.  There may be a
performance penalty, but I haven't seen it yet.




For a small number of nodes (where small <=32 or sometimes even  
<=64),

I find that simple NFS works just fine.  If your apps aren't IO
intensive, that can greatly simplify installation and deployment of
both Open MPI and your MPI applications IMNSHO.

But -- every app is different.  :-)  YMMV.



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Help: Firewall problems

2009-11-06 Thread Jeff Squyres

On Nov 6, 2009, at 5:49 AM, Lee Amy wrote:


Thanks. And actually I don't know if I need to disable iptables to run
MPI programs properly. Obviously from your words Open MPI will use
random ports so how do I set up in iptables then let trusted machines
open their random ports?




I'm afraid I'm not enough of an iptables expert to know (I don't know  
if anyone else on this list will be, either) -- you'll need to check  
the iptables docs to see.


Sorry!

--
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI users] Help: Firewall problems

2009-11-06 Thread Lee Amy
On Fri, Nov 6, 2009 at 2:39 AM, Jeff Squyres  wrote:
> On Nov 5, 2009, at 11:28 AM, Lee Amy wrote:
>
>> I remembered MPI does not count on TCP/IP but why default iptables
>> will prevent the MPI programs from running? After I stop iptables then
>> programs run well. I use Ethernet as connection.
>>
>
>
> Note that Open MPI *can* use TCP as an interface for MPI messaging.  It
> definitely uses TCP for administrative control of MPI jobs, even if TCP is
> not used for MPI messaging.  Open MPI therefore basically requires the
> ability to open sockets between all nodes in the job on random TCP ports.
>
> Your could probably configure iptables to "trust" all the machines in your
> cluster (i.e., allow TCP sockets to/from random ports) but disallow most
> (all?) TCP connections from outside your cluster, if you wanted to...?
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Thanks. And actually I don't know if I need to disable iptables to run
MPI programs properly. Obviously from your words Open MPI will use
random ports so how do I set up in iptables then let trusted machines
open their random ports?

Regards,

Amy



Re: [OMPI users] using specific algorithm for collective communication and knowing the root cpu?

2009-11-06 Thread George Bosilca

On Nov 4, 2009, at 12:46 , George Markomanolis wrote:

I have some questions, because I am using some programs for  
profiling, when you say that the cost of allreduce raise you mean  
about the time only or also and the flops of this command? Is there  
some additional work added at the allreduce or it's only about time?  
During profiling I want to count the flops so if there is a small  
difference on timing because of debug mode and declaration of the  
allreduce algorithm is not so big deal, but if it changes also the  
flops then it is bad for me.


Using a linear algorithm for reduce will clearly increase the number  
of fp on the root (of course if we suppose the reduction is working on  
fp), and will decrease the fp on the other nodes. Imagine that instead  
of having the computations nicely spread all over the nodes, you put  
them all on the root. This is what happens with the linear reduction.


When I executed a program with debug mode I saw that openmpi uses  
some algorithms and I looked at your code and I saw that rank 0 is  
not always the root cpu (if I understood right). Finally do you have  
any opinion about which is the best way to know the algorithm is  
used in collective communication and the root cpu of the  
communication?


For the linear implementation of allreduce Open MPI always use the  
rank 0 in the communicator as the root. The code is in the $ 
(OMPI_SRCDIR)/ompi/mca/coll/tuned/coll_tuned_allreduce.c file at line  
895.


  george.



Best regards,
George




Today's Topics:

  1. Re: using specific algorithm for collectivecommunication,
 and knowing the root cpu? (George Bosilca)


--

Message: 1
Date: Tue, 3 Nov 2009 12:09:18 -0500
From: George Bosilca 
Subject: Re: [OMPI users] using specific algorithm for collective
communication, and knowing the root cpu?
To: Open MPI Users 
Message-ID: 
Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes

You can add the following MCA parameters either on the command line  
or  in the $(HOME)/.openmpi/mca-params.conf file.


On Nov 2, 2009, at 08:52 , George Markomanolis wrote:



Dear all,

I would like to ask about collective communication. With debug  
mode  enabled, I can see many info during the execution which  
algorithm is  used etc. But my question is that I would like to  
use a specific  algorithm (the simplest I suppose). I am profiling  
some applications  and I want to simulate them with another  
program so I must be able  to know for example what the  
mpi_allreduce is doing. I saw many  algorithms that depend on the  
message size and the number of  processors, so I would like to ask:


1) what is the way to say at open mpi to use a simple algorithm  
for  allreduce (is there any way to say to use the simplest  
algorithm for  all the collective communication?). Basically I  
would like to know  the root cpu for every collective  
communication. What are the  disadvantages for demanding the  
simplest algorithm?




coll_tuned_use_dynamic_rules=1 to allow you to manually set the   
algorithms to be used.
coll_tuned_allreduce_algorithm=*something between 0 and 5* to  
describe  the algorithm to be user. For the simplest algorithm I  
guess you will  want to use 1 (star based fan-in fan-out).


The main disadvantage is that the cost of the allreduce will raise   
which will negatively impact the overall performance of the  
application.



2) Is there any overhead because I installed open mpi with debug   
mode even if I just run a program without any flag with --mca?




There are many overhead because you compile in debug mode. We do a  
lot  of extra tracking of internally allocate memory, checks on  
most/all  internal objects and so on. Based on previous results I  
would say your  latency increase by about 2-3 micro-secs, but the  
impact on the  bandwidth is minimal.



3) How you could describe allreduce by words? Can we say that the   
root cpu does reduce and then broadcast? I mean is that right for   
your implementation? I saw that it depends on the algorithm which   
cpu is the root, so is it possible to use an algorithm that I  
will  know every time that cpu with rank 0 is the root?




Exactly, allreduce = reduce + bcast (and btw this is what the   
algorithm basic will do). However, there is no root in an allreduce  
as  all processors execute symmetric work. Of course if one see  
the  allreduce as a reduce followed by a broadcast then one has to  
select a  root (I guess we pick the rank 0 in our implementation).


  george.



Thanks a lot,
George
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






___
users mailing list
us...@open-mpi.org