Re: [OMPI users] Openmpi problem

2012-05-03 Thread Ralph Castain
You apparently are running on a cluster that uses Torque, yes? If so, it won't 
use ssh to do the launch - it uses Torque to do it, so the passwordless ssh 
setup is irrelevant.

Did you ensure that your LD_LIBRARY_PATH includes the OMPI install lib location?


On May 3, 2012, at 9:59 AM, Acero Fernandez Alicia wrote:

> 
> 
> Hello,
> 
> I have a problem when running a mpi program with openmpi library. I did the 
> following.
> 
> 
> 1.- I installed the ofed 1.5.4 from RHEL. The hardware are qlogic 7340 ib 
> cards.
> 
> 2.- I am using openmpi 1.4.3 , the one that comes with ofed 1.5.4
> 
> 3.- I have check openmpi website, and I have all the requirements they asked:
> 
>ssh passwordless
>same ofed/openmpi version in all the cluster nodes 
>iband conectivity between the nodes, etc
> 
> 4.- When I run an mpi program it runs properly in one node, but it doesn´t 
> run in more than one node. The error I can see in the execution is the 
> following:
> 
> dirac13.ciemat.es:06415] plm:tm: failed to poll for a spawned daemon, return 
> status = 17002 
> 
> 
> --
> 
> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to 
> launch so we are aborting.
> 
> 
> 
> There may be more information reported by the environment (see above).
> 
> 
> 
> This may be because the daemon was unable to find all the needed shared 
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the 
> location of the shared libraries on the remote nodes and this will 
> automatically be forwarded to the remote nodes.
> 
> 
> 
> --
> 
> 
> 
> --
> 
> mpiexec noticed that the job aborted, but has no info as to the process that 
> caused that situation.
> 
> 
> 
> --
> 
> 
> 
> --
> 
> mpiexec was unable to cleanly terminate the daemons on the nodes shown below. 
> Additional manual cleanup may be required - please refer to the "orte-clean" 
> tool for assistance.
> 
> 
> 
> --
> 
>dirac12.ciemat.es - daemon did not report back when launched
> 
> 
> 
> The command I use to run the mpi program is the following:
> 
> 
>mpiexec -H dirac12,dirac13 ./cpi
> 
> I have also tried
> 
>mpiexec -np 24 -H dirac12,dirac13 ./cpi
> 
> And sending to the batch
> 
>mpiexec -np 24 -hostfile $PBS_NODEFILE ./cpi
> 
> All of them with the same result.
> 
> 
> All the mpi libraries in the cluster are the same in all the nodes.
> 
> Please, could anyone help me?
> 
> Thanks,
> Alicia
> 
> 
> Confidencialidad: 
> Este mensaje y sus ficheros adjuntos se dirige exclusivamente a su 
> destinatario y puede contener información privilegiada o confidencial. Si no 
> es vd. el destinatario indicado, queda notificado de que la utilización, 
> divulgación y/o copia sin autorización está prohibida en virtud de la 
> legislación vigente. Si ha recibido este mensaje por error, le rogamos que 
> nos lo comunique inmediatamente respondiendo al mensaje y proceda a su 
> destrucción.
> 
> Disclaimer: 
> This message and its attached files is intended exclusively for its 
> recipients and may contain confidential information. If you received this 
> e-mail in error you are hereby notified that any dissemination, copy or 
> disclosure of this communication is strictly prohibited and may be unlawful. 
> In this case, please notify us by a reply and delete this email and its 
> contents immediately. 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] Openmpi problem

2012-05-03 Thread Acero Fernandez Alicia


Hello,

I have a problem when running a mpi program with openmpi library. I did the 
following.


 1.- I installed the ofed 1.5.4 from RHEL. The hardware are qlogic 7340 ib 
cards.

2.- I am using openmpi 1.4.3 , the one that comes with ofed 1.5.4

3.- I have check openmpi website, and I have all the requirements they asked:

ssh passwordless
same ofed/openmpi version in all the cluster nodes 
iband conectivity between the nodes, etc

4.- When I run an mpi program it runs properly in one node, but it doesn´t run 
in more than one node. The error I can see in the execution is the following:

dirac13.ciemat.es:06415] plm:tm: failed to poll for a spawned daemon, return 
status = 17002 


--

A daemon (pid unknown) died unexpectedly on signal 1  while attempting to 
launch so we are aborting.



There may be more information reported by the environment (see above).



This may be because the daemon was unable to find all the needed shared 
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the 
location of the shared libraries on the remote nodes and this will 
automatically be forwarded to the remote nodes.



--



--

mpiexec noticed that the job aborted, but has no info as to the process that 
caused that situation.



--



--

mpiexec was unable to cleanly terminate the daemons on the nodes shown below. 
Additional manual cleanup may be required - please refer to the "orte-clean" 
tool for assistance.



--

dirac12.ciemat.es - daemon did not report back when launched



The command I use to run the mpi program is the following:


mpiexec -H dirac12,dirac13 ./cpi

I have also tried

mpiexec -np 24 -H dirac12,dirac13 ./cpi

And sending to the batch

mpiexec -np 24 -hostfile $PBS_NODEFILE ./cpi

All of them with the same result.


All the mpi libraries in the cluster are the same in all the nodes.

Please, could anyone help me?

Thanks,
Alicia


Confidencialidad: 
Este mensaje y sus ficheros adjuntos se dirige exclusivamente a su destinatario 
y puede contener información privilegiada o confidencial. Si no es vd. el 
destinatario indicado, queda notificado de que la utilización, divulgación y/o 
copia sin autorización está prohibida en virtud de la legislación vigente. Si 
ha recibido este mensaje por error, le rogamos que nos lo comunique 
inmediatamente respondiendo al mensaje y proceda a su destrucción.

Disclaimer: 
This message and its attached files is intended exclusively for its recipients 
and may contain confidential information. If you received this e-mail in error 
you are hereby notified that any dissemination, copy or disclosure of this 
communication is strictly prohibited and may be unlawful. In this case, please 
notify us by a reply and delete this email and its contents immediately. 




Re: [OMPI users] OpenMPI problem on Fedora Core 12

2010-01-12 Thread Eugene Loh

Jeff Squyres wrote:


It would be very strange for nanosleep to cause a problem for Open MPI -- it 
shouldn't interfere with any of Open MPI's mechanisms.  Double check that your 
my_barrier() function is actually working properly -- removing the nanosleep() 
shouldn't affect the correctness of your barrier.
 

I read Gijsbert's e-mail differently.  Apparently, the issue is not 
MPI/OMPI at all, but a hang inside nanosleep.



On Dec 31, 2009, at 1:15 PM, Gijsbert Wiesenekker wrote:
 


I only recently learned about the OMPI_MCA_mpi_yield_when_idle variable, I 
still have to test if that is an alternative to my workaround.
   

mpi_yield_when_idle does not free the CPU up very much.  It still polls 
fairly aggressively, and the yield() call doesn't really free the CPU up 
that much.  It's a weak and probably ungratifying solution for your problem.



Meanwhile I seem to have found the cause of problem ...
... rather than OpenMPI being the problem, nanosleep is the culprit because the 
call to it seems to hang.
   

So, "we" (OMPI community) are off the hook?  Problem is in nanosleep?  
"We" are relieved (or confused about what you're reporting)!


Re: [OMPI users] OpenMPI problem on Fedora Core 12

2010-01-12 Thread Jeff Squyres
It would be very strange for nanosleep to cause a problem for Open MPI -- it 
shouldn't interfere with any of Open MPI's mechanisms.  Double check that your 
my_barrier() function is actually working properly -- removing the nanosleep() 
shouldn't affect the correctness of your barrier.  

If you've implemented your own barrier function, here's a few points:

1. If you want to re-implement the back-end to MPI_Barrier itself, it would 
likely be possible to wrap up your routine in an Open MPI plugin (remember that 
the back-end of MPI_Barrier -- and others -- are driven by plugins; hence, you 
can actually replace the algorithms and whatnot that are used by MPI_Barrier 
without altering Open MPI's source code).  Let me know if you're interested in 
that.

2. MPI_Wait, as you surmised, is pretty much the same -- it aggressively polls, 
waiting for progress.  You *could* replace its behavior with a plugin, similar 
to MPI_Barrier, but it's a little harder (I can describe why, if you care).  

3. Your best bet might actually be to write a small profiling library that 
intercepts calls to MPI_Barrier and/or MPI_Wait and replaces them with 
not-aggressive versions.  E.g., your version of MPI_Wait can call MPI_Test, and 
if the request is not finished, call sleep() (or whatever).  Rinse, repeat.  

4. The mpi_yield_when_idle MCA parameter will simply call sched_yield() in 
OMPI's inner loops.  It'll still aggressively, but it'll call yield in the very 
core of those loops, thereby allowing other processes to pre-empt the MPI 
processes.  So it'll likely help your situation by allowing other processes to 
run, but the CPU's will still be pegged out at 100%.


On Dec 31, 2009, at 1:15 PM, Gijsbert Wiesenekker wrote:

> First of all, the reason that I have created a CPU-friendly version of 
> MPI_Barrier is that my program is asymmetric (so some of the nodes can easily 
> have to wait for several hours) and that it is I/O bound. My program uses MPI 
> mainly to synchronize I/O and to share some counters between the nodes, 
> followed by a gather/scatter of the files. MPI_Barrier (or any of the other 
> MPI calls) caused the four CPU's of my Quad Core to continuously run at 100% 
> because of the aggressive polling, making the server almost unusable and also 
> slowing my program down because there was less CPU time available for I/O and 
> file synchronization. With this version of MPI_Barrier CPU usage averages out 
> at about 25%. I only recently learned about the OMPI_MCA_mpi_yield_when_idle 
> variable, I still have to test if that is an alternative to my workaround.
> Meanwhile I seem to have found the cause of problem thanks to Ashley's 
> excellent padb tool. Following Eugene's recommendation, I have added the 
> MPI_Wait call: the same problem. Next I created a separate program that just 
> calls my_barrier repeatedly with randomized 1-2 seconds intervals. Again the 
> same problem (with 4 nodes), sometimes after a couple of iterations, 
> sometimes after 500, 1000 or 2000 iterations. Next I followed Ashley's 
> suggestion to use padb. I ran padb --all --mpi-queue and padb --all 
> --message-queue while the program was running fine and after the problem 
> occured. When the problem occurred padb said:
> 
> Warning, remote process state differs across ranks
> state : ranks
> R : [2-3]
> S : [0-1]
> 
> and
> 
> $ padb --all --stack-trace --tree
> Warning, remote process state differs across ranks
> state : ranks
> R : [2-3]
> S : [0-1]
> -
> [0-1] (2 processes)
> -
> main() at ?:?
>   barrier_util() at ?:?
> my_sleep() at ?:?
>   __nanosleep_nocancel() at ?:?
> -
> [2-3] (2 processes)
> -
> ??() at ?:?
>   ??() at ?:?
> ??() at ?:?
>   ??() at ?:?
> ??() at ?:?
>   ompi_mpi_signed_char() at ?:?
> ompi_request_default_wait_all() at ?:?
>   opal_progress() at ?:?
> -
> 2 (1 processes)
> -
> mca_pml_ob1_progress() at ?:?
> 
> suggests that rather than OpenMPI being the problem, nanosleep is the culprit 
> because the call to it seems to hang.
> 
> Thanks for all the help.
> 
> Gijsbert
> 
> On Mon, Dec 14, 2009 at 8:22 PM, Ashley Pittman  wrote:
> On Sun, 2009-12-13 at 19:04 +0100, Gijsbert Wiesenekker wrote:
> > The following routine gives a problem after some (not reproducible)
> > time on Fedora Core 12. The routine is a CPU usage friendly version of
> > MPI_Barrier.
> 
> There are some proposals for Non-blocking collectives before the MPI
> forum currently and I believe a working implementation which can be used
> as a plug-in for OpenMPI, I would urge you to look at these rather than
> try and implement your own.
> 
> > My question is: is there a problem with this routine that I overlooked
> > that somehow did not show up until now
> 
> Your code both does 

Re: [OMPI users] OpenMPI problem on Fedora Core 12

2009-12-14 Thread Ashley Pittman
On Sun, 2009-12-13 at 19:04 +0100, Gijsbert Wiesenekker wrote:
> The following routine gives a problem after some (not reproducible)
> time on Fedora Core 12. The routine is a CPU usage friendly version of
> MPI_Barrier.

There are some proposals for Non-blocking collectives before the MPI
forum currently and I believe a working implementation which can be used
as a plug-in for OpenMPI, I would urge you to look at these rather than
try and implement your own.

> My question is: is there a problem with this routine that I overlooked
> that somehow did not show up until now

Your code both does all-to-all communication and also uses probe, both
of these can easily be avoided when implementing Barrier.

> Is there a way to see which messages have been sent/received/are
> pending?

Yes, there is a message queue interface allowing tools to peek inside
the MPI library and see these queues.  That I know of there are three
tools which use this, either TotalView, DDT or my own tool, padb.
TotalView and DDT are both full-featured graphical debuggers and
commercial products, padb is a open-source text based tool.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] OpenMPI problem on Fedora Core 12

2009-12-14 Thread Eugene Loh
Let's start with this:  You generate non-blocking sends (MPI_Isend).  
Those sends are not completed anywhere.  So, strictly speaking, they 
don't need to be executed.  In practice, even if they are executed, they 
should be "completed" from the user program's point of view (MPI_Test, 
MPI_Wait, MPI_Waitall, etc.) to reclaim resources associated with the 
requests.


So, you should start by fixing that.  The question arises where you 
should complete those send calls.  I think there are several steps you 
could take here to get what you're looking for:


1) Implement a version that works without worrying about "sleep" 
behavior.  In your case, you're sending messages in an all-to-all 
pattern.  So, for example, you could issue an MPI_Irecv for each 
non-self process.  Then, issue an MPI_Isend for each non-self process.  
Then, issue MPI_Wait commands to complete all those requests.


2) (Optional step):  consider alternative message patterns, like trees, 
to cut down on all the message traffic.


3) Insert the "sleep" calls.

If you fix that and still have problems, let us know and let us know 
what "interconnect" (possibly on-node shared memory) you're using and 
what GCC version.


Gijsbert Wiesenekker wrote:


The following routine gives a problem after some (not reproducible) time on 
Fedora Core 12. The routine is a CPU usage friendly version of MPI_Barrier.
The verbose output shows that if the problem occurs one of the (not 
reproducible) nodes does not receive a message from one of the other (not 
reproducible) nodes, so it looks like the message is lost or is never received. 
This routine worked fine on Fedora Core 10 with OpenMPI 1.3.x and works fine on 
Centos 5.3 with OpenMPI 1.3.x. The problem occurs with OpenMPI 1.3.x, OpenMPI 
1.4, gcc and icc.
My question is: is there a problem with this routine that I overlooked that 
somehow did not show up until now, and if not, how can I debug what causes this 
problem. Is there a way to see which messages have been sent/received/are 
pending?

Regards,
Gijsbert

local void my_barrier(char * info, MPI_Comm comm, int verbose)
{
int ncomm;
int comm_id;
int send[MPI_NPROCS_MAX];
MPI_Request request[MPI_NPROCS_MAX];
int icomm;
int done[MPI_NPROCS_MAX];
time_t t0, t1;
double wall[MPI_NPROCS_MAX];
double wall_max;

BUG(mpi_nprocs == 1)

MPI_Comm_size(comm, );
BUG(ncomm < 1)
MPI_Comm_rank(comm, _id);

my_printf("entering barrier %s %d %d\n", info, ncomm, comm_id);
for (icomm = 0; icomm < ncomm; icomm++) send[icomm] = comm_id;
for (icomm = 0; icomm < ncomm; icomm++)
{
if (icomm != comm_id)
{
if (verbose) my_printf("sending from %d to %d\n", 
comm_id, icomm);
MPI_Isend(send + icomm, 1, MPI_INT, icomm, 
MPI_BARRIER_TAG,
comm, request + icomm);
done[icomm] = FALSE;
}
else
{
done[icomm] = TRUE;
}
wall[icomm] = 0.0;
}
t0 = time(NULL);
while(TRUE)
{
int receive;
int flag;
MPI_Status status;

MPI_Iprobe(MPI_ANY_SOURCE, MPI_BARRIER_TAG,
comm, , );
if (!flag)
{
my_sleep(0, BARRIER_POLL);
continue;
}
BUG(status.MPI_SOURCE < 0)
BUG(status.MPI_SOURCE >= ncomm)
MPI_Recv(, 1, MPI_INT, status.MPI_SOURCE, 
MPI_BARRIER_TAG,
comm, );
BUG(receive != status.MPI_SOURCE)
BUG(done[status.MPI_SOURCE])
if (verbose) my_printf("receiving from %d\n", 
status.MPI_SOURCE);

t1 = time(NULL);
done[status.MPI_SOURCE] = TRUE;
wall[status.MPI_SOURCE] = difftime(t1, t0);

for (icomm = 0; icomm < ncomm; icomm++)
if (!done[icomm]) break;
if (icomm == ncomm) break;
}
my_printf("leaving barrier %s\n", info);

wall_max = 0;
for (icomm = 0; icomm < ncomm; icomm++)
{   
if (verbose)
my_printf("icomm=%d time=%.0f%s\n",
icomm, wall[icomm], icomm == comm_id ? " *" : 
"");
if (wall[icomm] > wall_max) wall_max = wall[icomm];
}
//to be sure
MPI_Barrier(comm);
MPI_Allreduce(MPI_IN_PLACE, _max, 1,
MPI_DOUBLE, MPI_MAX, comm);
my_printf("mpi wall_max=%.0f\n", wall_max);
}
 



[OMPI users] OpenMPI problem on Fedora Core 12

2009-12-13 Thread Gijsbert Wiesenekker
The following routine gives a problem after some (not reproducible) time on 
Fedora Core 12. The routine is a CPU usage friendly version of MPI_Barrier.
The verbose output shows that if the problem occurs one of the (not 
reproducible) nodes does not receive a message from one of the other (not 
reproducible) nodes, so it looks like the message is lost or is never received. 
This routine worked fine on Fedora Core 10 with OpenMPI 1.3.x and works fine on 
Centos 5.3 with OpenMPI 1.3.x. The problem occurs with OpenMPI 1.3.x, OpenMPI 
1.4, gcc and icc.
My question is: is there a problem with this routine that I overlooked that 
somehow did not show up until now, and if not, how can I debug what causes this 
problem. Is there a way to see which messages have been sent/received/are 
pending?

Regards,
Gijsbert

local void my_barrier(char * info, MPI_Comm comm, int verbose)
{
int ncomm;
int comm_id;
int send[MPI_NPROCS_MAX];
MPI_Request request[MPI_NPROCS_MAX];
int icomm;
int done[MPI_NPROCS_MAX];
time_t t0, t1;
double wall[MPI_NPROCS_MAX];
double wall_max;

BUG(mpi_nprocs == 1)

MPI_Comm_size(comm, );
BUG(ncomm < 1)
MPI_Comm_rank(comm, _id);

my_printf("entering barrier %s %d %d\n", info, ncomm, comm_id);
for (icomm = 0; icomm < ncomm; icomm++) send[icomm] = comm_id;
for (icomm = 0; icomm < ncomm; icomm++)
{
if (icomm != comm_id)
{
if (verbose) my_printf("sending from %d to %d\n", 
comm_id, icomm);
MPI_Isend(send + icomm, 1, MPI_INT, icomm, 
MPI_BARRIER_TAG,
comm, request + icomm);
done[icomm] = FALSE;
}
else
{
done[icomm] = TRUE;
}
wall[icomm] = 0.0;
}
t0 = time(NULL);
while(TRUE)
{
int receive;
int flag;
MPI_Status status;

MPI_Iprobe(MPI_ANY_SOURCE, MPI_BARRIER_TAG,
comm, , );
if (!flag)
{
my_sleep(0, BARRIER_POLL);
continue;
}
BUG(status.MPI_SOURCE < 0)
BUG(status.MPI_SOURCE >= ncomm)
MPI_Recv(, 1, MPI_INT, status.MPI_SOURCE, 
MPI_BARRIER_TAG,
comm, );
BUG(receive != status.MPI_SOURCE)
BUG(done[status.MPI_SOURCE])
if (verbose) my_printf("receiving from %d\n", 
status.MPI_SOURCE);

t1 = time(NULL);
done[status.MPI_SOURCE] = TRUE;
wall[status.MPI_SOURCE] = difftime(t1, t0);

for (icomm = 0; icomm < ncomm; icomm++)
if (!done[icomm]) break;
if (icomm == ncomm) break;
}
my_printf("leaving barrier %s\n", info);

wall_max = 0;
for (icomm = 0; icomm < ncomm; icomm++)
{   
if (verbose)
my_printf("icomm=%d time=%.0f%s\n",
icomm, wall[icomm], icomm == comm_id ? " *" : 
"");
if (wall[icomm] > wall_max) wall_max = wall[icomm];
}
//to be sure
MPI_Barrier(comm);
MPI_Allreduce(MPI_IN_PLACE, _max, 1,
MPI_DOUBLE, MPI_MAX, comm);
my_printf("mpi wall_max=%.0f\n", wall_max);
}




Re: [OMPI users] openmpi problem

2006-11-03 Thread Durga Choudhury

Calin

Your questions don't belong in this forum. You either need to be computer
literate (your questions are basic OS related) or delegate this task to
someone more experienced.

Good luck
Durga


On 11/3/06, calin pal  wrote:


/*please read the mail and ans my query*/
sir,
   in   four machine of our college i have installed in this way..that
i m sending u
i start four machine from root...
then i installed the openmpi1.1.1 -tar.gz using the commands.
>>tar -xvzf openmpi-1.1.1
>>cd openmpi-1.1.1
>>./configure --prefix=/usr/local
>>make
>>make all install
>>ompi_info
that i did in root

then according to u r suggestion i went to user(where i did my program
jacobi.c)
gave the password
then i wrote
>>cd .bashrc
>>export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
>>source .bashrc
>>mpicc mpihello.c -o mpihello
>>mpirun -np 4 mpihello

after did all this thing i m getting the problem libmpi:so file
.."mpihello" is not working

what i supposed to do???

should i have to install again???

anything wrong in the installation    sir i cant undersatnd from the
FAQ whatever u have suggested to see methats why i m asking again sir
please tell me whatever i have done in our computer is this okay or anything
i have to change in the code what i have written in the above code please
check it out sir and tell me whats wrong in my code please
sir.please sir read the command also which i have used for installation
in root and user for running the openmpi-1.1.1.tar.gz ...please see
it.

calin pal
 msctech(maths and compsc)
pune ,india

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





--
Devil wanted omnipresence;
He therefore created communists.


[OMPI users] openmpi problem

2006-11-03 Thread calin pal

/*please read the mail and ans my query*/
sir,
  in   four machine of our college i have installed in this way..that i
m sending u
i start four machine from root...
then i installed the openmpi1.1.1 -tar.gz using the commands.

tar -xvzf openmpi-1.1.1
cd openmpi-1.1.1
./configure --prefix=/usr/local
make
make all install
ompi_info

that i did in root

then according to u r suggestion i went to user(where i did my program
jacobi.c)
gave the password
then i wrote

cd .bashrc
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
source .bashrc
mpicc mpihello.c -o mpihello
mpirun -np 4 mpihello


after did all this thing i m getting the problem libmpi:so file
.."mpihello" is not working

what i supposed to do???

should i have to install again???

anything wrong in the installation    sir i cant undersatnd from the FAQ
whatever u have suggested to see methats why i m asking again sir please
tell me whatever i have done in our computer is this okay or anything i have
to change in the code what i have written in the above code please check it
out sir and tell me whats wrong in my code please
sir.please sir read the command also which i have used for installation
in root and user for running the openmpi-1.1.1.tar.gz ...please see it.

calin pal
msctech(maths and compsc)
pune ,india