Re: [OMPI users] OpenMPI and Torque

2011-03-21 Thread Randall Svancara
Yeah, the system admin is me lol.and this is a new system which I
am frantically trying to work out all the bugs.  Torque and MPI are my
last hurdles to overcome.  But I have already been through some faulty
infiniband equipment, bad memory and bad drives.which is to be
expected on a cluster.


I wish there was some kind of TM test tool, that would be really nice
for testing.

I will ping the Torque list again.  Originally they forwarded me to
the openmpi list.

On Mon, Mar 21, 2011 at 12:29 PM, Ralph Castain  wrote:
> mpiexec doesn't use pbsdsh (we use a TM API), but the affect is the same. 
> Been so long since I ran on a Torque machine, though, that I honestly don't 
> remember how to set the LD_LIBRARY_PATH on the backend.
>
> Do you have a sys admin there whom you could ask? Or you could ping the 
> Torque list about it - pretty standard issue.
>
>
> On Mar 21, 2011, at 1:19 PM, Randall Svancara wrote:
>
>> Hi.  The pbsdsh tool is great.  I ran an interactive qsub session
>> (qsub -I -lnodes=2:ppn=12) and then rand the pbsdsh tool like this:
>>
>> [rsvancara@node164 ~]$ /usr/local/bin/pbsdsh  -h node164 printenv
>> PATH=/bin:/usr/bin
>> LANG=C
>> PBS_O_HOME=/home/admins/rsvancara
>> PBS_O_LANG=en_US.UTF-8
>> PBS_O_LOGNAME=rsvancara
>> PBS_O_PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin
>> PBS_O_MAIL=/var/spool/mail/rsvancara
>> PBS_O_SHELL=/bin/bash
>> PBS_SERVER=mgt1.wsuhpc.edu
>> PBS_O_WORKDIR=/home/admins/rsvancara/TEST
>> PBS_O_QUEUE=batch
>> PBS_O_HOST=login1
>> HOME=/home/admins/rsvancara
>> PBS_JOBNAME=STDIN
>> PBS_JOBID=1672.mgt1.wsuhpc.edu
>> PBS_QUEUE=batch
>> PBS_JOBCOOKIE=50E4985E63684BA781EE9294F21EE25E
>> PBS_NODENUM=0
>> PBS_TASKNUM=146
>> PBS_MOMPORT=15003
>> PBS_NODEFILE=/var/spool/torque/aux//1672.mgt1.wsuhpc.edu
>> PBS_VERSION=TORQUE-2.4.7
>> PBS_VNODENUM=0
>> PBS_ENVIRONMENT=PBS_BATCH
>> ENVIRONMENT=BATCH
>> [rsvancara@node164 ~]$ /usr/local/bin/pbsdsh  -h node163 printenv
>> PATH=/bin:/usr/bin
>> LANG=C
>> PBS_O_HOME=/home/admins/rsvancara
>> PBS_O_LANG=en_US.UTF-8
>> PBS_O_LOGNAME=rsvancara
>> PBS_O_PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin
>> PBS_O_MAIL=/var/spool/mail/rsvancara
>> PBS_O_SHELL=/bin/bash
>> PBS_SERVER=mgt1.wsuhpc.edu
>> PBS_O_WORKDIR=/home/admins/rsvancara/TEST
>> PBS_O_QUEUE=batch
>> PBS_O_HOST=login1
>> HOME=/home/admins/rsvancara
>> PBS_JOBNAME=STDIN
>> PBS_JOBID=1672.mgt1.wsuhpc.edu
>> PBS_QUEUE=batch
>> PBS_JOBCOOKIE=50E4985E63684BA781EE9294F21EE25E
>> PBS_NODENUM=1
>> PBS_TASKNUM=147
>> PBS_MOMPORT=15003
>> PBS_VERSION=TORQUE-2.4.7
>> PBS_VNODENUM=12
>> PBS_ENVIRONMENT=PBS_BATCH
>> ENVIRONMENT=BATCH
>>
>> So one thing that strikes me as bad is the LD_LIBRARY_PATH does not
>> appear available.  Attempted to run mpiexec like this and it fails.
>>
>> [rsvancara@node164 ~]$ /usr/local/bin/pbsdsh  -h node163
>> /home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec hostname
>> /home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec: error while
>> loading shared libraries: libimf.so: cannot open shared object file:
>> No such file or directory
>> pbsdsh: task 12 exit status 127
>> [rsvancara@node164 ~]$ /usr/local/bin/pbsdsh  -h node164
>> /home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec hostname
>> /home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec: error while
>> loading shared libraries: libimf.so: cannot open shared object file:
>> No such file or directory
>> pbsdsh: task 0 exit status 127
>>
>> If this is how the openmpi processes are being launched, then it is no
>> wonder they are failing and the LD_LIBRARY_PATH error message is
>> indeed somewhat accurate.
>>
>> So the next question is how to I ensure that this information is
>> available to pbsdsh?
>>
>> Thanks,
>>
>> Randall
>>
>>
>> On Mon, Mar 21, 2011 at 11:24 AM, Randall Svancara  
>> wrote:
>>> Ok, these are good things to check.  I am going to follow through with
>>> this in the next hour after our GPFS upgrade.  Thanks!!!
>>>
>>> On Mon, Mar 21, 2011 at 11:14 AM, Brock Palen  wrote:
 On Mar 21, 2011, at 1:59 PM, Jeff Squyres wrote:

> I no longer run Torque on my cluster, so my Torqueology is pretty rusty 
> -- but I think there's a Torque command to launch on remote nodes.  tmrsh 
> or pbsrsh or something like that...?

 pbsdsh
 If TM is working pbsdsh should work fine.

 Torque+OpenMPI has been working just fine for us.
 Do you have libtorque on all your compute hosts?  You should see it open 
 on all hosts if it works.

>
> Try that and make sure it works.  Open MPI should be using the same API 
> as that command under the covers.
>
> I also have a dim 

Re: [OMPI users] OpenMPI and Torque

2011-03-21 Thread Ralph Castain
mpiexec doesn't use pbsdsh (we use a TM API), but the affect is the same. Been 
so long since I ran on a Torque machine, though, that I honestly don't remember 
how to set the LD_LIBRARY_PATH on the backend.

Do you have a sys admin there whom you could ask? Or you could ping the Torque 
list about it - pretty standard issue.


On Mar 21, 2011, at 1:19 PM, Randall Svancara wrote:

> Hi.  The pbsdsh tool is great.  I ran an interactive qsub session
> (qsub -I -lnodes=2:ppn=12) and then rand the pbsdsh tool like this:
> 
> [rsvancara@node164 ~]$ /usr/local/bin/pbsdsh  -h node164 printenv
> PATH=/bin:/usr/bin
> LANG=C
> PBS_O_HOME=/home/admins/rsvancara
> PBS_O_LANG=en_US.UTF-8
> PBS_O_LOGNAME=rsvancara
> PBS_O_PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin
> PBS_O_MAIL=/var/spool/mail/rsvancara
> PBS_O_SHELL=/bin/bash
> PBS_SERVER=mgt1.wsuhpc.edu
> PBS_O_WORKDIR=/home/admins/rsvancara/TEST
> PBS_O_QUEUE=batch
> PBS_O_HOST=login1
> HOME=/home/admins/rsvancara
> PBS_JOBNAME=STDIN
> PBS_JOBID=1672.mgt1.wsuhpc.edu
> PBS_QUEUE=batch
> PBS_JOBCOOKIE=50E4985E63684BA781EE9294F21EE25E
> PBS_NODENUM=0
> PBS_TASKNUM=146
> PBS_MOMPORT=15003
> PBS_NODEFILE=/var/spool/torque/aux//1672.mgt1.wsuhpc.edu
> PBS_VERSION=TORQUE-2.4.7
> PBS_VNODENUM=0
> PBS_ENVIRONMENT=PBS_BATCH
> ENVIRONMENT=BATCH
> [rsvancara@node164 ~]$ /usr/local/bin/pbsdsh  -h node163 printenv
> PATH=/bin:/usr/bin
> LANG=C
> PBS_O_HOME=/home/admins/rsvancara
> PBS_O_LANG=en_US.UTF-8
> PBS_O_LOGNAME=rsvancara
> PBS_O_PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin
> PBS_O_MAIL=/var/spool/mail/rsvancara
> PBS_O_SHELL=/bin/bash
> PBS_SERVER=mgt1.wsuhpc.edu
> PBS_O_WORKDIR=/home/admins/rsvancara/TEST
> PBS_O_QUEUE=batch
> PBS_O_HOST=login1
> HOME=/home/admins/rsvancara
> PBS_JOBNAME=STDIN
> PBS_JOBID=1672.mgt1.wsuhpc.edu
> PBS_QUEUE=batch
> PBS_JOBCOOKIE=50E4985E63684BA781EE9294F21EE25E
> PBS_NODENUM=1
> PBS_TASKNUM=147
> PBS_MOMPORT=15003
> PBS_VERSION=TORQUE-2.4.7
> PBS_VNODENUM=12
> PBS_ENVIRONMENT=PBS_BATCH
> ENVIRONMENT=BATCH
> 
> So one thing that strikes me as bad is the LD_LIBRARY_PATH does not
> appear available.  Attempted to run mpiexec like this and it fails.
> 
> [rsvancara@node164 ~]$ /usr/local/bin/pbsdsh  -h node163
> /home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec hostname
> /home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec: error while
> loading shared libraries: libimf.so: cannot open shared object file:
> No such file or directory
> pbsdsh: task 12 exit status 127
> [rsvancara@node164 ~]$ /usr/local/bin/pbsdsh  -h node164
> /home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec hostname
> /home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec: error while
> loading shared libraries: libimf.so: cannot open shared object file:
> No such file or directory
> pbsdsh: task 0 exit status 127
> 
> If this is how the openmpi processes are being launched, then it is no
> wonder they are failing and the LD_LIBRARY_PATH error message is
> indeed somewhat accurate.
> 
> So the next question is how to I ensure that this information is
> available to pbsdsh?
> 
> Thanks,
> 
> Randall
> 
> 
> On Mon, Mar 21, 2011 at 11:24 AM, Randall Svancara  
> wrote:
>> Ok, these are good things to check.  I am going to follow through with
>> this in the next hour after our GPFS upgrade.  Thanks!!!
>> 
>> On Mon, Mar 21, 2011 at 11:14 AM, Brock Palen  wrote:
>>> On Mar 21, 2011, at 1:59 PM, Jeff Squyres wrote:
>>> 
 I no longer run Torque on my cluster, so my Torqueology is pretty rusty -- 
 but I think there's a Torque command to launch on remote nodes.  tmrsh or 
 pbsrsh or something like that...?
>>> 
>>> pbsdsh
>>> If TM is working pbsdsh should work fine.
>>> 
>>> Torque+OpenMPI has been working just fine for us.
>>> Do you have libtorque on all your compute hosts?  You should see it open on 
>>> all hosts if it works.
>>> 
 
 Try that and make sure it works.  Open MPI should be using the same API as 
 that command under the covers.
 
 I also have a dim recollection that the TM API support library(ies?) may 
 not be installed by default.  You may have to ensure that they're 
 available on all nodes...?
 
 
 On Mar 21, 2011, at 1:53 PM, Randall Svancara wrote:
 
> I am not sure if there is any extra configuration necessary for torque
> to forward the environment.  I have included the output of printenv
> for an interactive qsub session.  I am really at a loss here because I
> never had this much difficulty making torque run with openmpi.  It has
> been mostly a good experience.
> 
> Permissions of /tmp
> 
> drwxrwxrwt   4 root root   140 Mar 20 

Re: [OMPI users] OpenMPI and Torque

2011-03-21 Thread Randall Svancara
Hi.  The pbsdsh tool is great.  I ran an interactive qsub session
(qsub -I -lnodes=2:ppn=12) and then rand the pbsdsh tool like this:

[rsvancara@node164 ~]$ /usr/local/bin/pbsdsh  -h node164 printenv
PATH=/bin:/usr/bin
LANG=C
PBS_O_HOME=/home/admins/rsvancara
PBS_O_LANG=en_US.UTF-8
PBS_O_LOGNAME=rsvancara
PBS_O_PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin
PBS_O_MAIL=/var/spool/mail/rsvancara
PBS_O_SHELL=/bin/bash
PBS_SERVER=mgt1.wsuhpc.edu
PBS_O_WORKDIR=/home/admins/rsvancara/TEST
PBS_O_QUEUE=batch
PBS_O_HOST=login1
HOME=/home/admins/rsvancara
PBS_JOBNAME=STDIN
PBS_JOBID=1672.mgt1.wsuhpc.edu
PBS_QUEUE=batch
PBS_JOBCOOKIE=50E4985E63684BA781EE9294F21EE25E
PBS_NODENUM=0
PBS_TASKNUM=146
PBS_MOMPORT=15003
PBS_NODEFILE=/var/spool/torque/aux//1672.mgt1.wsuhpc.edu
PBS_VERSION=TORQUE-2.4.7
PBS_VNODENUM=0
PBS_ENVIRONMENT=PBS_BATCH
ENVIRONMENT=BATCH
[rsvancara@node164 ~]$ /usr/local/bin/pbsdsh  -h node163 printenv
PATH=/bin:/usr/bin
LANG=C
PBS_O_HOME=/home/admins/rsvancara
PBS_O_LANG=en_US.UTF-8
PBS_O_LOGNAME=rsvancara
PBS_O_PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin
PBS_O_MAIL=/var/spool/mail/rsvancara
PBS_O_SHELL=/bin/bash
PBS_SERVER=mgt1.wsuhpc.edu
PBS_O_WORKDIR=/home/admins/rsvancara/TEST
PBS_O_QUEUE=batch
PBS_O_HOST=login1
HOME=/home/admins/rsvancara
PBS_JOBNAME=STDIN
PBS_JOBID=1672.mgt1.wsuhpc.edu
PBS_QUEUE=batch
PBS_JOBCOOKIE=50E4985E63684BA781EE9294F21EE25E
PBS_NODENUM=1
PBS_TASKNUM=147
PBS_MOMPORT=15003
PBS_VERSION=TORQUE-2.4.7
PBS_VNODENUM=12
PBS_ENVIRONMENT=PBS_BATCH
ENVIRONMENT=BATCH

So one thing that strikes me as bad is the LD_LIBRARY_PATH does not
appear available.  Attempted to run mpiexec like this and it fails.

[rsvancara@node164 ~]$ /usr/local/bin/pbsdsh  -h node163
/home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec hostname
/home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec: error while
loading shared libraries: libimf.so: cannot open shared object file:
No such file or directory
pbsdsh: task 12 exit status 127
[rsvancara@node164 ~]$ /usr/local/bin/pbsdsh  -h node164
/home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec hostname
/home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec: error while
loading shared libraries: libimf.so: cannot open shared object file:
No such file or directory
pbsdsh: task 0 exit status 127

If this is how the openmpi processes are being launched, then it is no
wonder they are failing and the LD_LIBRARY_PATH error message is
indeed somewhat accurate.

So the next question is how to I ensure that this information is
available to pbsdsh?

Thanks,

Randall


On Mon, Mar 21, 2011 at 11:24 AM, Randall Svancara  wrote:
> Ok, these are good things to check.  I am going to follow through with
> this in the next hour after our GPFS upgrade.  Thanks!!!
>
> On Mon, Mar 21, 2011 at 11:14 AM, Brock Palen  wrote:
>> On Mar 21, 2011, at 1:59 PM, Jeff Squyres wrote:
>>
>>> I no longer run Torque on my cluster, so my Torqueology is pretty rusty -- 
>>> but I think there's a Torque command to launch on remote nodes.  tmrsh or 
>>> pbsrsh or something like that...?
>>
>> pbsdsh
>> If TM is working pbsdsh should work fine.
>>
>> Torque+OpenMPI has been working just fine for us.
>> Do you have libtorque on all your compute hosts?  You should see it open on 
>> all hosts if it works.
>>
>>>
>>> Try that and make sure it works.  Open MPI should be using the same API as 
>>> that command under the covers.
>>>
>>> I also have a dim recollection that the TM API support library(ies?) may 
>>> not be installed by default.  You may have to ensure that they're available 
>>> on all nodes...?
>>>
>>>
>>> On Mar 21, 2011, at 1:53 PM, Randall Svancara wrote:
>>>
 I am not sure if there is any extra configuration necessary for torque
 to forward the environment.  I have included the output of printenv
 for an interactive qsub session.  I am really at a loss here because I
 never had this much difficulty making torque run with openmpi.  It has
 been mostly a good experience.

 Permissions of /tmp

 drwxrwxrwt   4 root root   140 Mar 20 08:57 tmp

 mpiexec hostname single node:

 [rsvancara@login1 ~]$ qsub -I -lnodes=1:ppn=12
 qsub: waiting for job 1667.mgt1.wsuhpc.edu to start
 qsub: job 1667.mgt1.wsuhpc.edu ready

 [rsvancara@node100 ~]$ mpiexec hostname
 node100
 node100
 node100
 node100
 node100
 node100
 node100
 node100
 node100
 node100
 node100
 node100

 mpiexec hostname two nodes:

 [rsvancara@node100 ~]$ mpiexec hostname
 [node100:09342] plm:tm: failed to poll for a spawned daemon, return
 status = 17002
 

Re: [OMPI users] Displaying MAIN in Totalview

2011-03-21 Thread Dominik Goeddeke

Hi,

for what it's worth: Same thing happens with DDT. OpenMPI 1.2.x runs 
fine, later versions (at least 1.4.x and newer) let DDT bail out with 
"Could not break at function MPIR_Breakpoint".


DDT has something like "OpenMPI (compatibility mode)" in its session 
launch dialog, with this setting (instead of the default "OpenMPI") it 
works flawlessly.


Dominik



On 03/21/2011 06:22 PM, Ralph Castain wrote:

Ick - appears that got dropped a long time ago. I'll add it back in and post a 
CMR for 1.4 and 1.5 series.

Thanks!
Ralph


On Mar 21, 2011, at 11:08 AM, David Turner wrote:


Hi,

About a month ago, this topic was discussed with no real resolution:

http://www.open-mpi.org/community/lists/users/2011/02/15538.php

We noticed the same problem (TV does not display the user's MAIN
routine upon initial startup), and contacted the TV developers.
They suggested a simple OMPI code modification, which we implemented
and tested; it seems to work fine.  Hopefully, this capability
can be restored in future releases.

Here is the body of our communication with the TV developers:

--

Interestingly enough, someone else asked this very same question recently and I 
finally dug into it last week and figured out what was going on. TotalView 
publishes a public interface which allows any MPI implementor to set things up 
so that it should work fairly seamless with TotalView. I found that one of the 
defines in the interface is

MPIR_force_to_main

and when we find this symbol defined in mpirun (or orterun in Open MPI's case) 
then we spend a bit more effort to focus the source pane on the main routine. 
As you may guess, this is NOT being defined in OpenMPI 1.4.2. It was being 
defined in the 1.2.x builds though, in a routine called totalview.c. OpenMPI 
has been re-worked significantly since then, and totalview.c has been replaced 
by debuggers.c in orte/tools/orterun. About line 130 to 140 (depending on any 
changes since my look at the 1.4.1 sources) you should find a number of MPIR_ 
symbols being defined.

struct MPIR_PROCDESC *MPIR_proctable = NULL;
int MPIR_proctable_size = 0;
int MPIR_being_debugged = 0;
volatile int MPIR_debug_state = 0;
volatile int MPIR_i_am_starter = 0;
volatile int MPIR_partial_attach_ok = 1;


I believe you should be able to insert the line:

int MPIR_force_to_main = 0;

into this section, and then the behavior you are looking for should work after 
you rebuild OpenMPI. I haven't yet had the time to do that myself, but that was 
all that existed in the 1.2.x sources, and I know those achieved the desired 
effect. It's quite possible that someone realized the symbol was initialized, 
but wasn't be used anyplace, so they just removed it. Without realizing we were 
looking for it in the debugger. When I pointed this out to the other user, he 
said he would try it out and pass it on to the Open MPI group. I just checked 
on that thread, and didn't see any update, so I passed on the info myself.

--

--
Best regards,

David Turner
User Services Groupemail: dptur...@lbl.gov
NERSC Division phone: (510) 486-4027
Lawrence Berkeley Labfax: (510) 486-4316
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Dr. Dominik Göddeke
Institut für Angewandte Mathematik
Technische Universität Dortmund
http://www.mathematik.tu-dortmund.de/~goeddeke
Tel. +49-(0)231-755-7218  Fax +49-(0)231-755-5933







Re: [OMPI users] OpenMPI and Torque

2011-03-21 Thread Randall Svancara
Ok, these are good things to check.  I am going to follow through with
this in the next hour after our GPFS upgrade.  Thanks!!!

On Mon, Mar 21, 2011 at 11:14 AM, Brock Palen  wrote:
> On Mar 21, 2011, at 1:59 PM, Jeff Squyres wrote:
>
>> I no longer run Torque on my cluster, so my Torqueology is pretty rusty -- 
>> but I think there's a Torque command to launch on remote nodes.  tmrsh or 
>> pbsrsh or something like that...?
>
> pbsdsh
> If TM is working pbsdsh should work fine.
>
> Torque+OpenMPI has been working just fine for us.
> Do you have libtorque on all your compute hosts?  You should see it open on 
> all hosts if it works.
>
>>
>> Try that and make sure it works.  Open MPI should be using the same API as 
>> that command under the covers.
>>
>> I also have a dim recollection that the TM API support library(ies?) may not 
>> be installed by default.  You may have to ensure that they're available on 
>> all nodes...?
>>
>>
>> On Mar 21, 2011, at 1:53 PM, Randall Svancara wrote:
>>
>>> I am not sure if there is any extra configuration necessary for torque
>>> to forward the environment.  I have included the output of printenv
>>> for an interactive qsub session.  I am really at a loss here because I
>>> never had this much difficulty making torque run with openmpi.  It has
>>> been mostly a good experience.
>>>
>>> Permissions of /tmp
>>>
>>> drwxrwxrwt   4 root root   140 Mar 20 08:57 tmp
>>>
>>> mpiexec hostname single node:
>>>
>>> [rsvancara@login1 ~]$ qsub -I -lnodes=1:ppn=12
>>> qsub: waiting for job 1667.mgt1.wsuhpc.edu to start
>>> qsub: job 1667.mgt1.wsuhpc.edu ready
>>>
>>> [rsvancara@node100 ~]$ mpiexec hostname
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>>
>>> mpiexec hostname two nodes:
>>>
>>> [rsvancara@node100 ~]$ mpiexec hostname
>>> [node100:09342] plm:tm: failed to poll for a spawned daemon, return
>>> status = 17002
>>> --
>>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
>>> launch so we are aborting.
>>>
>>> There may be more information reported by the environment (see above).
>>>
>>> This may be because the daemon was unable to find all the needed shared
>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>>> location of the shared libraries on the remote nodes and this will
>>> automatically be forwarded to the remote nodes.
>>> --
>>> --
>>> mpiexec noticed that the job aborted, but has no info as to the process
>>> that caused that situation.
>>> --
>>> --
>>> mpiexec was unable to cleanly terminate the daemons on the nodes shown
>>> below. Additional manual cleanup may be required - please refer to
>>> the "orte-clean" tool for assistance.
>>> --
>>>      node99 - daemon did not report back when launched
>>>
>>>
>>> MPIexec on one node with one cpu:
>>>
>>> [rsvancara@node164 ~]$ mpiexec printenv
>>> OMPI_MCA_orte_precondition_transports=5fbd0d3c8e4195f1-80f964226d1575ea
>>> MODULE_VERSION_STACK=3.2.8
>>> MANPATH=/home/software/mpi/intel/openmpi-1.4.3/share/man:/home/software/intel/Compiler/11.1/075/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/../man/en_US:/home/software/Modules/3.2.8/share/man:/usr/share/man
>>> HOSTNAME=node164
>>> PBS_VERSION=TORQUE-2.4.7
>>> TERM=xterm
>>> SHELL=/bin/bash
>>> HISTSIZE=1000
>>> PBS_JOBNAME=STDIN
>>> PBS_ENVIRONMENT=PBS_INTERACTIVE
>>> PBS_O_WORKDIR=/home/admins/rsvancara
>>> PBS_TASKNUM=1
>>> USER=rsvancara
>>> LD_LIBRARY_PATH=/home/software/mpi/intel/openmpi-1.4.3/lib:/home/software/intel/Compiler/11.1/075/lib/intel64:/home/software/intel/Compiler/11.1/075/ipp/em64t/sharedlib:/home/software/intel/Compiler/11.1/075/mkl/lib/em64t:/home/software/intel/Compiler/11.1/075/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib:/home/software/intel/Compiler/11.1/075/lib
>>> LS_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.exe=00;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35:
>>> PBS_O_HOME=/home/admins/rsvancara
>>> 

Re: [OMPI users] OpenMPI and Torque

2011-03-21 Thread Brock Palen
On Mar 21, 2011, at 1:59 PM, Jeff Squyres wrote:

> I no longer run Torque on my cluster, so my Torqueology is pretty rusty -- 
> but I think there's a Torque command to launch on remote nodes.  tmrsh or 
> pbsrsh or something like that...?

pbsdsh
If TM is working pbsdsh should work fine.

Torque+OpenMPI has been working just fine for us.
Do you have libtorque on all your compute hosts?  You should see it open on all 
hosts if it works.

> 
> Try that and make sure it works.  Open MPI should be using the same API as 
> that command under the covers.
> 
> I also have a dim recollection that the TM API support library(ies?) may not 
> be installed by default.  You may have to ensure that they're available on 
> all nodes...?
> 
> 
> On Mar 21, 2011, at 1:53 PM, Randall Svancara wrote:
> 
>> I am not sure if there is any extra configuration necessary for torque
>> to forward the environment.  I have included the output of printenv
>> for an interactive qsub session.  I am really at a loss here because I
>> never had this much difficulty making torque run with openmpi.  It has
>> been mostly a good experience.
>> 
>> Permissions of /tmp
>> 
>> drwxrwxrwt   4 root root   140 Mar 20 08:57 tmp
>> 
>> mpiexec hostname single node:
>> 
>> [rsvancara@login1 ~]$ qsub -I -lnodes=1:ppn=12
>> qsub: waiting for job 1667.mgt1.wsuhpc.edu to start
>> qsub: job 1667.mgt1.wsuhpc.edu ready
>> 
>> [rsvancara@node100 ~]$ mpiexec hostname
>> node100
>> node100
>> node100
>> node100
>> node100
>> node100
>> node100
>> node100
>> node100
>> node100
>> node100
>> node100
>> 
>> mpiexec hostname two nodes:
>> 
>> [rsvancara@node100 ~]$ mpiexec hostname
>> [node100:09342] plm:tm: failed to poll for a spawned daemon, return
>> status = 17002
>> --
>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
>> launch so we are aborting.
>> 
>> There may be more information reported by the environment (see above).
>> 
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --
>> --
>> mpiexec noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --
>> --
>> mpiexec was unable to cleanly terminate the daemons on the nodes shown
>> below. Additional manual cleanup may be required - please refer to
>> the "orte-clean" tool for assistance.
>> --
>>  node99 - daemon did not report back when launched
>> 
>> 
>> MPIexec on one node with one cpu:
>> 
>> [rsvancara@node164 ~]$ mpiexec printenv
>> OMPI_MCA_orte_precondition_transports=5fbd0d3c8e4195f1-80f964226d1575ea
>> MODULE_VERSION_STACK=3.2.8
>> MANPATH=/home/software/mpi/intel/openmpi-1.4.3/share/man:/home/software/intel/Compiler/11.1/075/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/../man/en_US:/home/software/Modules/3.2.8/share/man:/usr/share/man
>> HOSTNAME=node164
>> PBS_VERSION=TORQUE-2.4.7
>> TERM=xterm
>> SHELL=/bin/bash
>> HISTSIZE=1000
>> PBS_JOBNAME=STDIN
>> PBS_ENVIRONMENT=PBS_INTERACTIVE
>> PBS_O_WORKDIR=/home/admins/rsvancara
>> PBS_TASKNUM=1
>> USER=rsvancara
>> LD_LIBRARY_PATH=/home/software/mpi/intel/openmpi-1.4.3/lib:/home/software/intel/Compiler/11.1/075/lib/intel64:/home/software/intel/Compiler/11.1/075/ipp/em64t/sharedlib:/home/software/intel/Compiler/11.1/075/mkl/lib/em64t:/home/software/intel/Compiler/11.1/075/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib:/home/software/intel/Compiler/11.1/075/lib
>> LS_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.exe=00;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35:
>> PBS_O_HOME=/home/admins/rsvancara
>> CPATH=/home/software/intel/Compiler/11.1/075/ipp/em64t/include:/home/software/intel/Compiler/11.1/075/mkl/include:/home/software/intel/Compiler/11.1/075/tbb/include
>> PBS_MOMPORT=15003
>> PBS_O_QUEUE=batch
>> 

Re: [OMPI users] OpenMPI and Torque

2011-03-21 Thread Randall Svancara
Ok, Let me give this a try.  Thanks for all your helpful suggestions.

On Mon, Mar 21, 2011 at 11:10 AM, Ralph Castain  wrote:
>
> On Mar 21, 2011, at 11:59 AM, Jeff Squyres wrote:
>
>> I no longer run Torque on my cluster, so my Torqueology is pretty rusty -- 
>> but I think there's a Torque command to launch on remote nodes.  tmrsh or 
>> pbsrsh or something like that...?
>
> pbsrsh, IIRC
>
> So run pbsrsh  printenv to see the environment on a remote node. 
> Etc.
>
>>
>> Try that and make sure it works.  Open MPI should be using the same API as 
>> that command under the covers.
>>
>> I also have a dim recollection that the TM API support library(ies?) may not 
>> be installed by default.  You may have to ensure that they're available on 
>> all nodes...?
>
> This is true - usually not installed by default, and need to be available on 
> all nodes since Torque starts mpiexec on a backend node.
>
>>
>>
>> On Mar 21, 2011, at 1:53 PM, Randall Svancara wrote:
>>
>>> I am not sure if there is any extra configuration necessary for torque
>>> to forward the environment.  I have included the output of printenv
>>> for an interactive qsub session.  I am really at a loss here because I
>>> never had this much difficulty making torque run with openmpi.  It has
>>> been mostly a good experience.
>>>
>>> Permissions of /tmp
>>>
>>> drwxrwxrwt   4 root root   140 Mar 20 08:57 tmp
>>>
>>> mpiexec hostname single node:
>>>
>>> [rsvancara@login1 ~]$ qsub -I -lnodes=1:ppn=12
>>> qsub: waiting for job 1667.mgt1.wsuhpc.edu to start
>>> qsub: job 1667.mgt1.wsuhpc.edu ready
>>>
>>> [rsvancara@node100 ~]$ mpiexec hostname
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>>
>>> mpiexec hostname two nodes:
>>>
>>> [rsvancara@node100 ~]$ mpiexec hostname
>>> [node100:09342] plm:tm: failed to poll for a spawned daemon, return
>>> status = 17002
>>> --
>>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
>>> launch so we are aborting.
>>>
>>> There may be more information reported by the environment (see above).
>>>
>>> This may be because the daemon was unable to find all the needed shared
>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>>> location of the shared libraries on the remote nodes and this will
>>> automatically be forwarded to the remote nodes.
>>> --
>>> --
>>> mpiexec noticed that the job aborted, but has no info as to the process
>>> that caused that situation.
>>> --
>>> --
>>> mpiexec was unable to cleanly terminate the daemons on the nodes shown
>>> below. Additional manual cleanup may be required - please refer to
>>> the "orte-clean" tool for assistance.
>>> --
>>>      node99 - daemon did not report back when launched
>>>
>>>
>>> MPIexec on one node with one cpu:
>>>
>>> [rsvancara@node164 ~]$ mpiexec printenv
>>> OMPI_MCA_orte_precondition_transports=5fbd0d3c8e4195f1-80f964226d1575ea
>>> MODULE_VERSION_STACK=3.2.8
>>> MANPATH=/home/software/mpi/intel/openmpi-1.4.3/share/man:/home/software/intel/Compiler/11.1/075/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/../man/en_US:/home/software/Modules/3.2.8/share/man:/usr/share/man
>>> HOSTNAME=node164
>>> PBS_VERSION=TORQUE-2.4.7
>>> TERM=xterm
>>> SHELL=/bin/bash
>>> HISTSIZE=1000
>>> PBS_JOBNAME=STDIN
>>> PBS_ENVIRONMENT=PBS_INTERACTIVE
>>> PBS_O_WORKDIR=/home/admins/rsvancara
>>> PBS_TASKNUM=1
>>> USER=rsvancara
>>> LD_LIBRARY_PATH=/home/software/mpi/intel/openmpi-1.4.3/lib:/home/software/intel/Compiler/11.1/075/lib/intel64:/home/software/intel/Compiler/11.1/075/ipp/em64t/sharedlib:/home/software/intel/Compiler/11.1/075/mkl/lib/em64t:/home/software/intel/Compiler/11.1/075/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib:/home/software/intel/Compiler/11.1/075/lib
>>> LS_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.exe=00;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35:
>>> PBS_O_HOME=/home/admins/rsvancara
>>> CPATH=/home/software/intel/Compiler/11.1/075/ipp/em64t/include:/home/software/intel/Compiler/11.1/075/mkl/include:/home/software/intel/Compiler/11.1/075/tbb/include

Re: [OMPI users] OpenMPI and Torque

2011-03-21 Thread Ralph Castain

On Mar 21, 2011, at 11:59 AM, Jeff Squyres wrote:

> I no longer run Torque on my cluster, so my Torqueology is pretty rusty -- 
> but I think there's a Torque command to launch on remote nodes.  tmrsh or 
> pbsrsh or something like that...?

pbsrsh, IIRC

So run pbsrsh  printenv to see the environment on a remote node. Etc.

> 
> Try that and make sure it works.  Open MPI should be using the same API as 
> that command under the covers.
> 
> I also have a dim recollection that the TM API support library(ies?) may not 
> be installed by default.  You may have to ensure that they're available on 
> all nodes...?

This is true - usually not installed by default, and need to be available on 
all nodes since Torque starts mpiexec on a backend node.

> 
> 
> On Mar 21, 2011, at 1:53 PM, Randall Svancara wrote:
> 
>> I am not sure if there is any extra configuration necessary for torque
>> to forward the environment.  I have included the output of printenv
>> for an interactive qsub session.  I am really at a loss here because I
>> never had this much difficulty making torque run with openmpi.  It has
>> been mostly a good experience.
>> 
>> Permissions of /tmp
>> 
>> drwxrwxrwt   4 root root   140 Mar 20 08:57 tmp
>> 
>> mpiexec hostname single node:
>> 
>> [rsvancara@login1 ~]$ qsub -I -lnodes=1:ppn=12
>> qsub: waiting for job 1667.mgt1.wsuhpc.edu to start
>> qsub: job 1667.mgt1.wsuhpc.edu ready
>> 
>> [rsvancara@node100 ~]$ mpiexec hostname
>> node100
>> node100
>> node100
>> node100
>> node100
>> node100
>> node100
>> node100
>> node100
>> node100
>> node100
>> node100
>> 
>> mpiexec hostname two nodes:
>> 
>> [rsvancara@node100 ~]$ mpiexec hostname
>> [node100:09342] plm:tm: failed to poll for a spawned daemon, return
>> status = 17002
>> --
>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
>> launch so we are aborting.
>> 
>> There may be more information reported by the environment (see above).
>> 
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --
>> --
>> mpiexec noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --
>> --
>> mpiexec was unable to cleanly terminate the daemons on the nodes shown
>> below. Additional manual cleanup may be required - please refer to
>> the "orte-clean" tool for assistance.
>> --
>>  node99 - daemon did not report back when launched
>> 
>> 
>> MPIexec on one node with one cpu:
>> 
>> [rsvancara@node164 ~]$ mpiexec printenv
>> OMPI_MCA_orte_precondition_transports=5fbd0d3c8e4195f1-80f964226d1575ea
>> MODULE_VERSION_STACK=3.2.8
>> MANPATH=/home/software/mpi/intel/openmpi-1.4.3/share/man:/home/software/intel/Compiler/11.1/075/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/../man/en_US:/home/software/Modules/3.2.8/share/man:/usr/share/man
>> HOSTNAME=node164
>> PBS_VERSION=TORQUE-2.4.7
>> TERM=xterm
>> SHELL=/bin/bash
>> HISTSIZE=1000
>> PBS_JOBNAME=STDIN
>> PBS_ENVIRONMENT=PBS_INTERACTIVE
>> PBS_O_WORKDIR=/home/admins/rsvancara
>> PBS_TASKNUM=1
>> USER=rsvancara
>> LD_LIBRARY_PATH=/home/software/mpi/intel/openmpi-1.4.3/lib:/home/software/intel/Compiler/11.1/075/lib/intel64:/home/software/intel/Compiler/11.1/075/ipp/em64t/sharedlib:/home/software/intel/Compiler/11.1/075/mkl/lib/em64t:/home/software/intel/Compiler/11.1/075/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib:/home/software/intel/Compiler/11.1/075/lib
>> LS_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.exe=00;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35:
>> PBS_O_HOME=/home/admins/rsvancara
>> CPATH=/home/software/intel/Compiler/11.1/075/ipp/em64t/include:/home/software/intel/Compiler/11.1/075/mkl/include:/home/software/intel/Compiler/11.1/075/tbb/include
>> PBS_MOMPORT=15003
>> PBS_O_QUEUE=batch
>> 

Re: [OMPI users] Displaying MAIN in Totalview

2011-03-21 Thread Jeff Squyres
Welcome back, Peter.  :-)

On Mar 21, 2011, at 2:02 PM, Peter Thompson wrote:

> Gee,  I had tried posting that info earlier today, but my post was rejected 
> because my email address has changed.  This is as much a test of that address 
> change request as it is a confirmation of the info Dave reports.  (Of course 
> I'm the one who sent them the info, so it's only a little self-serving ;-)
> 
> Cheers,
> Peter Thompson
> 
> 
> Ralph Castain wrote:
>> Ick - appears that got dropped a long time ago. I'll add it back in and post 
>> a CMR for 1.4 and 1.5 series.
>> 
>> Thanks!
>> Ralph
>> 
>> 
>> On Mar 21, 2011, at 11:08 AM, David Turner wrote:
>> 
>>  
>>> Hi,
>>> 
>>> About a month ago, this topic was discussed with no real resolution:
>>> 
>>> http://www.open-mpi.org/community/lists/users/2011/02/15538.php
>>> 
>>> We noticed the same problem (TV does not display the user's MAIN
>>> routine upon initial startup), and contacted the TV developers.
>>> They suggested a simple OMPI code modification, which we implemented
>>> and tested; it seems to work fine.  Hopefully, this capability
>>> can be restored in future releases.
>>> 
>>> Here is the body of our communication with the TV developers:
>>> 
>>> --
>>> 
>>> Interestingly enough, someone else asked this very same question recently 
>>> and I finally dug into it last week and figured out what was going on. 
>>> TotalView publishes a public interface which allows any MPI implementor to 
>>> set things up so that it should work fairly seamless with TotalView. I 
>>> found that one of the defines in the interface is
>>> 
>>> MPIR_force_to_main
>>> 
>>> and when we find this symbol defined in mpirun (or orterun in Open MPI's 
>>> case) then we spend a bit more effort to focus the source pane on the main 
>>> routine. As you may guess, this is NOT being defined in OpenMPI 1.4.2. It 
>>> was being defined in the 1.2.x builds though, in a routine called 
>>> totalview.c. OpenMPI has been re-worked significantly since then, and 
>>> totalview.c has been replaced by debuggers.c in orte/tools/orterun. About 
>>> line 130 to 140 (depending on any changes since my look at the 1.4.1 
>>> sources) you should find a number of MPIR_ symbols being defined.
>>> 
>>> struct MPIR_PROCDESC *MPIR_proctable = NULL;
>>> int MPIR_proctable_size = 0;
>>> int MPIR_being_debugged = 0;
>>> volatile int MPIR_debug_state = 0;
>>> volatile int MPIR_i_am_starter = 0;
>>> volatile int MPIR_partial_attach_ok = 1;
>>> 
>>> 
>>> I believe you should be able to insert the line:
>>> 
>>> int MPIR_force_to_main = 0;
>>> 
>>> into this section, and then the behavior you are looking for should work 
>>> after you rebuild OpenMPI. I haven't yet had the time to do that myself, 
>>> but that was all that existed in the 1.2.x sources, and I know those 
>>> achieved the desired effect. It's quite possible that someone realized the 
>>> symbol was initialized, but wasn't be used anyplace, so they just removed 
>>> it. Without realizing we were looking for it in the debugger. When I 
>>> pointed this out to the other user, he said he would try it out and pass it 
>>> on to the Open MPI group. I just checked on that thread, and didn't see any 
>>> update, so I passed on the info myself.
>>> 
>>> --
>>> 
>>> -- 
>>> Best regards,
>>> 
>>> David Turner
>>> User Services Groupemail: dptur...@lbl.gov
>>> NERSC Division phone: (510) 486-4027
>>> Lawrence Berkeley Labfax: (510) 486-4316
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>  
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] OpenMPI and Torque

2011-03-21 Thread Ralph Castain

On Mar 21, 2011, at 11:53 AM, Randall Svancara wrote:

> I am not sure if there is any extra configuration necessary for torque
> to forward the environment.  I have included the output of printenv
> for an interactive qsub session.  I am really at a loss here because I
> never had this much difficulty making torque run with openmpi.  It has
> been mostly a good experience.

Not seeing a problem on other Torque users, so it appears to be something in 
your local setup.

Note that running mpiexec on a single node doesn't invoke Torque at all - 
mpiexec just fork/execs the app processes directly. Torque is only invoked when 
running on multiple nodes.

One thing stands out immediately. When you used rsh, you specified the tmp dir:

> -mca orte_tmpdir_base /fastscratch/admins/tmp

Yet you didn't do so when running under Torque. Was there a reason?


> 
> Permissions of /tmp
> 
> drwxrwxrwt   4 root root   140 Mar 20 08:57 tmp
> 
> mpiexec hostname single node:
> 
> [rsvancara@login1 ~]$ qsub -I -lnodes=1:ppn=12
> qsub: waiting for job 1667.mgt1.wsuhpc.edu to start
> qsub: job 1667.mgt1.wsuhpc.edu ready
> 
> [rsvancara@node100 ~]$ mpiexec hostname
> node100
> node100
> node100
> node100
> node100
> node100
> node100
> node100
> node100
> node100
> node100
> node100
> 
> mpiexec hostname two nodes:
> 
> [rsvancara@node100 ~]$ mpiexec hostname
> [node100:09342] plm:tm: failed to poll for a spawned daemon, return
> status = 17002
> --
> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
> launch so we are aborting.
> 
> There may be more information reported by the environment (see above).
> 
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> --
> mpiexec noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> --
> mpiexec was unable to cleanly terminate the daemons on the nodes shown
> below. Additional manual cleanup may be required - please refer to
> the "orte-clean" tool for assistance.
> --
>   node99 - daemon did not report back when launched
> 
> 
> MPIexec on one node with one cpu:
> 
> [rsvancara@node164 ~]$ mpiexec printenv
> OMPI_MCA_orte_precondition_transports=5fbd0d3c8e4195f1-80f964226d1575ea
> MODULE_VERSION_STACK=3.2.8
> MANPATH=/home/software/mpi/intel/openmpi-1.4.3/share/man:/home/software/intel/Compiler/11.1/075/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/../man/en_US:/home/software/Modules/3.2.8/share/man:/usr/share/man
> HOSTNAME=node164
> PBS_VERSION=TORQUE-2.4.7
> TERM=xterm
> SHELL=/bin/bash
> HISTSIZE=1000
> PBS_JOBNAME=STDIN
> PBS_ENVIRONMENT=PBS_INTERACTIVE
> PBS_O_WORKDIR=/home/admins/rsvancara
> PBS_TASKNUM=1
> USER=rsvancara
> LD_LIBRARY_PATH=/home/software/mpi/intel/openmpi-1.4.3/lib:/home/software/intel/Compiler/11.1/075/lib/intel64:/home/software/intel/Compiler/11.1/075/ipp/em64t/sharedlib:/home/software/intel/Compiler/11.1/075/mkl/lib/em64t:/home/software/intel/Compiler/11.1/075/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib:/home/software/intel/Compiler/11.1/075/lib
> LS_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.exe=00;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35:
> PBS_O_HOME=/home/admins/rsvancara
> CPATH=/home/software/intel/Compiler/11.1/075/ipp/em64t/include:/home/software/intel/Compiler/11.1/075/mkl/include:/home/software/intel/Compiler/11.1/075/tbb/include
> PBS_MOMPORT=15003
> PBS_O_QUEUE=batch
> NLSPATH=/home/software/intel/Compiler/11.1/075/lib/intel64/locale/%l_%t/%N:/home/software/intel/Compiler/11.1/075/ipp/em64t/lib/locale/%l_%t/%N:/home/software/intel/Compiler/11.1/075/mkl/lib/em64t/locale/%l_%t/%N:/home/software/intel/Compiler/11.1/075/idb/intel64/locale/%l_%t/%N
> MODULE_VERSION=3.2.8
> MAIL=/var/spool/mail/rsvancara
> PBS_O_LOGNAME=rsvancara
> PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin
> PBS_O_LANG=en_US.UTF-8
> 

Re: [OMPI users] Displaying MAIN in Totalview

2011-03-21 Thread Peter Thompson
Gee,  I had tried posting that info earlier today, but my post was 
rejected because my email address has changed.  This is as much a test 
of that address change request as it is a confirmation of the info Dave 
reports.  (Of course I'm the one who sent them the info, so it's only a 
little self-serving ;-)


Cheers,
Peter Thompson


Ralph Castain wrote:

Ick - appears that got dropped a long time ago. I'll add it back in and post a 
CMR for 1.4 and 1.5 series.

Thanks!
Ralph


On Mar 21, 2011, at 11:08 AM, David Turner wrote:

  

Hi,

About a month ago, this topic was discussed with no real resolution:

http://www.open-mpi.org/community/lists/users/2011/02/15538.php

We noticed the same problem (TV does not display the user's MAIN
routine upon initial startup), and contacted the TV developers.
They suggested a simple OMPI code modification, which we implemented
and tested; it seems to work fine.  Hopefully, this capability
can be restored in future releases.

Here is the body of our communication with the TV developers:

--

Interestingly enough, someone else asked this very same question recently and I 
finally dug into it last week and figured out what was going on. TotalView 
publishes a public interface which allows any MPI implementor to set things up 
so that it should work fairly seamless with TotalView. I found that one of the 
defines in the interface is

MPIR_force_to_main

and when we find this symbol defined in mpirun (or orterun in Open MPI's case) 
then we spend a bit more effort to focus the source pane on the main routine. 
As you may guess, this is NOT being defined in OpenMPI 1.4.2. It was being 
defined in the 1.2.x builds though, in a routine called totalview.c. OpenMPI 
has been re-worked significantly since then, and totalview.c has been replaced 
by debuggers.c in orte/tools/orterun. About line 130 to 140 (depending on any 
changes since my look at the 1.4.1 sources) you should find a number of MPIR_ 
symbols being defined.

struct MPIR_PROCDESC *MPIR_proctable = NULL;
int MPIR_proctable_size = 0;
int MPIR_being_debugged = 0;
volatile int MPIR_debug_state = 0;
volatile int MPIR_i_am_starter = 0;
volatile int MPIR_partial_attach_ok = 1;


I believe you should be able to insert the line:

int MPIR_force_to_main = 0;

into this section, and then the behavior you are looking for should work after 
you rebuild OpenMPI. I haven't yet had the time to do that myself, but that was 
all that existed in the 1.2.x sources, and I know those achieved the desired 
effect. It's quite possible that someone realized the symbol was initialized, 
but wasn't be used anyplace, so they just removed it. Without realizing we were 
looking for it in the debugger. When I pointed this out to the other user, he 
said he would try it out and pass it on to the Open MPI group. I just checked 
on that thread, and didn't see any update, so I passed on the info myself.

--

--
Best regards,

David Turner
User Services Groupemail: dptur...@lbl.gov
NERSC Division phone: (510) 486-4027
Lawrence Berkeley Labfax: (510) 486-4316
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
  


Re: [OMPI users] OpenMPI and Torque

2011-03-21 Thread Jeff Squyres
I no longer run Torque on my cluster, so my Torqueology is pretty rusty -- but 
I think there's a Torque command to launch on remote nodes.  tmrsh or pbsrsh or 
something like that...?

Try that and make sure it works.  Open MPI should be using the same API as that 
command under the covers.

I also have a dim recollection that the TM API support library(ies?) may not be 
installed by default.  You may have to ensure that they're available on all 
nodes...?


On Mar 21, 2011, at 1:53 PM, Randall Svancara wrote:

> I am not sure if there is any extra configuration necessary for torque
> to forward the environment.  I have included the output of printenv
> for an interactive qsub session.  I am really at a loss here because I
> never had this much difficulty making torque run with openmpi.  It has
> been mostly a good experience.
> 
> Permissions of /tmp
> 
> drwxrwxrwt   4 root root   140 Mar 20 08:57 tmp
> 
> mpiexec hostname single node:
> 
> [rsvancara@login1 ~]$ qsub -I -lnodes=1:ppn=12
> qsub: waiting for job 1667.mgt1.wsuhpc.edu to start
> qsub: job 1667.mgt1.wsuhpc.edu ready
> 
> [rsvancara@node100 ~]$ mpiexec hostname
> node100
> node100
> node100
> node100
> node100
> node100
> node100
> node100
> node100
> node100
> node100
> node100
> 
> mpiexec hostname two nodes:
> 
> [rsvancara@node100 ~]$ mpiexec hostname
> [node100:09342] plm:tm: failed to poll for a spawned daemon, return
> status = 17002
> --
> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
> launch so we are aborting.
> 
> There may be more information reported by the environment (see above).
> 
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> --
> mpiexec noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> --
> mpiexec was unable to cleanly terminate the daemons on the nodes shown
> below. Additional manual cleanup may be required - please refer to
> the "orte-clean" tool for assistance.
> --
>   node99 - daemon did not report back when launched
> 
> 
> MPIexec on one node with one cpu:
> 
> [rsvancara@node164 ~]$ mpiexec printenv
> OMPI_MCA_orte_precondition_transports=5fbd0d3c8e4195f1-80f964226d1575ea
> MODULE_VERSION_STACK=3.2.8
> MANPATH=/home/software/mpi/intel/openmpi-1.4.3/share/man:/home/software/intel/Compiler/11.1/075/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/../man/en_US:/home/software/Modules/3.2.8/share/man:/usr/share/man
> HOSTNAME=node164
> PBS_VERSION=TORQUE-2.4.7
> TERM=xterm
> SHELL=/bin/bash
> HISTSIZE=1000
> PBS_JOBNAME=STDIN
> PBS_ENVIRONMENT=PBS_INTERACTIVE
> PBS_O_WORKDIR=/home/admins/rsvancara
> PBS_TASKNUM=1
> USER=rsvancara
> LD_LIBRARY_PATH=/home/software/mpi/intel/openmpi-1.4.3/lib:/home/software/intel/Compiler/11.1/075/lib/intel64:/home/software/intel/Compiler/11.1/075/ipp/em64t/sharedlib:/home/software/intel/Compiler/11.1/075/mkl/lib/em64t:/home/software/intel/Compiler/11.1/075/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib:/home/software/intel/Compiler/11.1/075/lib
> LS_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.exe=00;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35:
> PBS_O_HOME=/home/admins/rsvancara
> CPATH=/home/software/intel/Compiler/11.1/075/ipp/em64t/include:/home/software/intel/Compiler/11.1/075/mkl/include:/home/software/intel/Compiler/11.1/075/tbb/include
> PBS_MOMPORT=15003
> PBS_O_QUEUE=batch
> NLSPATH=/home/software/intel/Compiler/11.1/075/lib/intel64/locale/%l_%t/%N:/home/software/intel/Compiler/11.1/075/ipp/em64t/lib/locale/%l_%t/%N:/home/software/intel/Compiler/11.1/075/mkl/lib/em64t/locale/%l_%t/%N:/home/software/intel/Compiler/11.1/075/idb/intel64/locale/%l_%t/%N
> MODULE_VERSION=3.2.8
> MAIL=/var/spool/mail/rsvancara
> PBS_O_LOGNAME=rsvancara
> PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin
> PBS_O_LANG=en_US.UTF-8
> 

Re: [OMPI users] OpenMPI and Torque

2011-03-21 Thread Randall Svancara
I am not sure if there is any extra configuration necessary for torque
to forward the environment.  I have included the output of printenv
for an interactive qsub session.  I am really at a loss here because I
never had this much difficulty making torque run with openmpi.  It has
been mostly a good experience.

Permissions of /tmp

drwxrwxrwt   4 root root   140 Mar 20 08:57 tmp

mpiexec hostname single node:

[rsvancara@login1 ~]$ qsub -I -lnodes=1:ppn=12
qsub: waiting for job 1667.mgt1.wsuhpc.edu to start
qsub: job 1667.mgt1.wsuhpc.edu ready

[rsvancara@node100 ~]$ mpiexec hostname
node100
node100
node100
node100
node100
node100
node100
node100
node100
node100
node100
node100

mpiexec hostname two nodes:

[rsvancara@node100 ~]$ mpiexec hostname
[node100:09342] plm:tm: failed to poll for a spawned daemon, return
status = 17002
--
A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
--
mpiexec noticed that the job aborted, but has no info as to the process
that caused that situation.
--
--
mpiexec was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--
node99 - daemon did not report back when launched


MPIexec on one node with one cpu:

[rsvancara@node164 ~]$ mpiexec printenv
OMPI_MCA_orte_precondition_transports=5fbd0d3c8e4195f1-80f964226d1575ea
MODULE_VERSION_STACK=3.2.8
MANPATH=/home/software/mpi/intel/openmpi-1.4.3/share/man:/home/software/intel/Compiler/11.1/075/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/../man/en_US:/home/software/Modules/3.2.8/share/man:/usr/share/man
HOSTNAME=node164
PBS_VERSION=TORQUE-2.4.7
TERM=xterm
SHELL=/bin/bash
HISTSIZE=1000
PBS_JOBNAME=STDIN
PBS_ENVIRONMENT=PBS_INTERACTIVE
PBS_O_WORKDIR=/home/admins/rsvancara
PBS_TASKNUM=1
USER=rsvancara
LD_LIBRARY_PATH=/home/software/mpi/intel/openmpi-1.4.3/lib:/home/software/intel/Compiler/11.1/075/lib/intel64:/home/software/intel/Compiler/11.1/075/ipp/em64t/sharedlib:/home/software/intel/Compiler/11.1/075/mkl/lib/em64t:/home/software/intel/Compiler/11.1/075/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib:/home/software/intel/Compiler/11.1/075/lib
LS_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.exe=00;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35:
PBS_O_HOME=/home/admins/rsvancara
CPATH=/home/software/intel/Compiler/11.1/075/ipp/em64t/include:/home/software/intel/Compiler/11.1/075/mkl/include:/home/software/intel/Compiler/11.1/075/tbb/include
PBS_MOMPORT=15003
PBS_O_QUEUE=batch
NLSPATH=/home/software/intel/Compiler/11.1/075/lib/intel64/locale/%l_%t/%N:/home/software/intel/Compiler/11.1/075/ipp/em64t/lib/locale/%l_%t/%N:/home/software/intel/Compiler/11.1/075/mkl/lib/em64t/locale/%l_%t/%N:/home/software/intel/Compiler/11.1/075/idb/intel64/locale/%l_%t/%N
MODULE_VERSION=3.2.8
MAIL=/var/spool/mail/rsvancara
PBS_O_LOGNAME=rsvancara
PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin
PBS_O_LANG=en_US.UTF-8
PBS_JOBCOOKIE=D52DE562B685A462849C1136D6B581F9
INPUTRC=/etc/inputrc
PWD=/home/admins/rsvancara
_LMFILES_=/home/software/Modules/3.2.8/modulefiles/modules:/home/software/Modules/3.2.8/modulefiles/null:/home/software/modulefiles/intel/11.1.075:/home/software/modulefiles/openmpi/1.4.3_intel
PBS_NODENUM=0
LANG=C
MODULEPATH=/home/software/Modules/versions:/home/software/Modules/$MODULE_VERSION/modulefiles:/home/software/modulefiles
LOADEDMODULES=modules:null:intel/11.1.075:openmpi/1.4.3_intel
PBS_O_SHELL=/bin/bash
PBS_SERVER=mgt1.wsuhpc.edu
PBS_JOBID=1670.mgt1.wsuhpc.edu
SHLVL=1
HOME=/home/admins/rsvancara
INTEL_LICENSES=/home/software/intel/Compiler/11.1/075/licenses:/opt/intel/licenses
PBS_O_HOST=login1

Re: [OMPI users] Displaying MAIN in Totalview

2011-03-21 Thread Ralph Castain
Ick - appears that got dropped a long time ago. I'll add it back in and post a 
CMR for 1.4 and 1.5 series.

Thanks!
Ralph


On Mar 21, 2011, at 11:08 AM, David Turner wrote:

> Hi,
> 
> About a month ago, this topic was discussed with no real resolution:
> 
> http://www.open-mpi.org/community/lists/users/2011/02/15538.php
> 
> We noticed the same problem (TV does not display the user's MAIN
> routine upon initial startup), and contacted the TV developers.
> They suggested a simple OMPI code modification, which we implemented
> and tested; it seems to work fine.  Hopefully, this capability
> can be restored in future releases.
> 
> Here is the body of our communication with the TV developers:
> 
> --
> 
> Interestingly enough, someone else asked this very same question recently and 
> I finally dug into it last week and figured out what was going on. TotalView 
> publishes a public interface which allows any MPI implementor to set things 
> up so that it should work fairly seamless with TotalView. I found that one of 
> the defines in the interface is
> 
> MPIR_force_to_main
> 
> and when we find this symbol defined in mpirun (or orterun in Open MPI's 
> case) then we spend a bit more effort to focus the source pane on the main 
> routine. As you may guess, this is NOT being defined in OpenMPI 1.4.2. It was 
> being defined in the 1.2.x builds though, in a routine called totalview.c. 
> OpenMPI has been re-worked significantly since then, and totalview.c has been 
> replaced by debuggers.c in orte/tools/orterun. About line 130 to 140 
> (depending on any changes since my look at the 1.4.1 sources) you should find 
> a number of MPIR_ symbols being defined.
> 
> struct MPIR_PROCDESC *MPIR_proctable = NULL;
> int MPIR_proctable_size = 0;
> int MPIR_being_debugged = 0;
> volatile int MPIR_debug_state = 0;
> volatile int MPIR_i_am_starter = 0;
> volatile int MPIR_partial_attach_ok = 1;
> 
> 
> I believe you should be able to insert the line:
> 
> int MPIR_force_to_main = 0;
> 
> into this section, and then the behavior you are looking for should work 
> after you rebuild OpenMPI. I haven't yet had the time to do that myself, but 
> that was all that existed in the 1.2.x sources, and I know those achieved the 
> desired effect. It's quite possible that someone realized the symbol was 
> initialized, but wasn't be used anyplace, so they just removed it. Without 
> realizing we were looking for it in the debugger. When I pointed this out to 
> the other user, he said he would try it out and pass it on to the Open MPI 
> group. I just checked on that thread, and didn't see any update, so I passed 
> on the info myself.
> 
> --
> 
> -- 
> Best regards,
> 
> David Turner
> User Services Groupemail: dptur...@lbl.gov
> NERSC Division phone: (510) 486-4027
> Lawrence Berkeley Labfax: (510) 486-4316
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] 1.5.3 and SGE integration?

2011-03-21 Thread Ralph Castain

On Mar 21, 2011, at 11:12 AM, Dave Love wrote:

> Ralph Castain  writes:
> 
>> Just looking at this for another question. Yes, SGE integration is broken in 
>> 1.5. Looking at how to fix now.
>> 
>> Meantime, you can get it work by adding "-mca plm ^rshd" to your mpirun cmd 
>> line.
> 
> Thanks.  I'd forgotten about plm when checking, though I guess that
> wouldn't have helped me.
> 
> Should rshd be mentioned in the release notes?

Just starting the discussion on the best solution going forward. I'd rather not 
have to tell SGE users to add this to their cmd line. :-(

> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] 1.5.3 and SGE integration?

2011-03-21 Thread Dave Love
Ralph Castain  writes:

> Just looking at this for another question. Yes, SGE integration is broken in 
> 1.5. Looking at how to fix now.
>
> Meantime, you can get it work by adding "-mca plm ^rshd" to your mpirun cmd 
> line.

Thanks.  I'd forgotten about plm when checking, though I guess that
wouldn't have helped me.

Should rshd be mentioned in the release notes?



[OMPI users] Displaying MAIN in Totalview

2011-03-21 Thread David Turner

Hi,

About a month ago, this topic was discussed with no real resolution:

http://www.open-mpi.org/community/lists/users/2011/02/15538.php

We noticed the same problem (TV does not display the user's MAIN
routine upon initial startup), and contacted the TV developers.
They suggested a simple OMPI code modification, which we implemented
and tested; it seems to work fine.  Hopefully, this capability
can be restored in future releases.

Here is the body of our communication with the TV developers:

--

Interestingly enough, someone else asked this very same question 
recently and I finally dug into it last week and figured out what was 
going on. TotalView publishes a public interface which allows any MPI 
implementor to set things up so that it should work fairly seamless with 
TotalView. I found that one of the defines in the interface is


MPIR_force_to_main

and when we find this symbol defined in mpirun (or orterun in Open MPI's 
case) then we spend a bit more effort to focus the source pane on the 
main routine. As you may guess, this is NOT being defined in OpenMPI 
1.4.2. It was being defined in the 1.2.x builds though, in a routine 
called totalview.c. OpenMPI has been re-worked significantly since then, 
and totalview.c has been replaced by debuggers.c in orte/tools/orterun. 
About line 130 to 140 (depending on any changes since my look at the 
1.4.1 sources) you should find a number of MPIR_ symbols being defined.


struct MPIR_PROCDESC *MPIR_proctable = NULL;
int MPIR_proctable_size = 0;
int MPIR_being_debugged = 0;
volatile int MPIR_debug_state = 0;
volatile int MPIR_i_am_starter = 0;
volatile int MPIR_partial_attach_ok = 1;


I believe you should be able to insert the line:

int MPIR_force_to_main = 0;

into this section, and then the behavior you are looking for should work 
after you rebuild OpenMPI. I haven't yet had the time to do that myself, 
but that was all that existed in the 1.2.x sources, and I know those 
achieved the desired effect. It's quite possible that someone realized 
the symbol was initialized, but wasn't be used anyplace, so they just 
removed it. Without realizing we were looking for it in the debugger. 
When I pointed this out to the other user, he said he would try it out 
and pass it on to the Open MPI group. I just checked on that thread, and 
didn't see any update, so I passed on the info myself.


--

--
Best regards,

David Turner
User Services Groupemail: dptur...@lbl.gov
NERSC Division phone: (510) 486-4027
Lawrence Berkeley Labfax: (510) 486-4316


Re: [OMPI users] OpenMPI and Torque

2011-03-21 Thread Ralph Castain
Can you run anything under TM? Try running "hostname" directly from Torque to 
see if anything works at all.

The error message is telling you that the Torque daemon on the remote node 
reported a failure when trying to launch the OMPI daemon. Could be that Torque 
isn't setup to forward environments so the OMPI daemon isn't finding required 
libs. You could directly run "printenv" to see how your remote environ is being 
setup.

Could be that the tmp dir lacks correct permissions for a user to create the 
required directories. The OMPI daemon tries to create a session directory in 
the tmp dir, so failure to do so would indeed cause the launch to fail. You can 
specify the tmp dir with a cmd line option to mpirun. See "mpirun -h" for info.


On Mar 21, 2011, at 12:21 AM, Randall Svancara wrote:

> I have a question about using OpenMPI and Torque on stateless nodes.
> I have compiled openmpi 1.4.3 with --with-tm=/usr/local
> --without-slurm using intel compiler version 11.1.075.
> 
> When I run a simple "hello world" mpi program, I am receiving the
> following error.
> 
> [node164:11193] plm:tm: failed to poll for a spawned daemon, return
> status = 17002
> --
> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
> launch so we are aborting.
> 
> There may be more information reported by the environment (see above).
> 
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> --
> mpiexec noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> --
> mpiexec was unable to cleanly terminate the daemons on the nodes shown
> below. Additional manual cleanup may be required - please refer to
> the "orte-clean" tool for assistance.
> --
> node163 - daemon did not report back when launched
> node159 - daemon did not report back when launched
> node158 - daemon did not report back when launched
> node157 - daemon did not report back when launched
> node156 - daemon did not report back when launched
> node155 - daemon did not report back when launched
> node154 - daemon did not report back when launched
> node152 - daemon did not report back when launched
> node151 - daemon did not report back when launched
> node150 - daemon did not report back when launched
> node149 - daemon did not report back when launched
> 
> 
> But if I include:
> 
> -mca plm rsh
> 
> The job runs just fine.
> 
> I am not sure what the problem is with torque or openmpi that prevents
> the process from launching on remote nodes.  I have posted to the
> torque list and someone suggested that it may be temporary directory
> space that can be causing issues.  I have 100MB allocated to /tmp
> 
> Any ideas as to why I am having this problem would be appreciated.
> 
> 
> -- 
> Randall Svancara
> http://knowyourlinux.com/
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] 1.5.3 and SGE integration?

2011-03-21 Thread Ralph Castain
Just looking at this for another question. Yes, SGE integration is broken in 
1.5. Looking at how to fix now.

Meantime, you can get it work by adding "-mca plm ^rshd" to your mpirun cmd 
line.


On Mar 21, 2011, at 9:47 AM, Dave Love wrote:

> Terry Dontje  writes:
> 
>> Dave what version of Grid Engine are you using?
> 
> 6.2u5, plus irrelevant patches.  It's fine with ompi 1.4.  (All I did to
> switch was to load the 1.5.3 modules environment.)
> 
>> The plm checks for the following env-var's to determine if you are
>> running Grid Engine.
>> SGE_ROOT
>> ARC
>> PE_HOSTFILE
>> JOB_ID
>> 
>> If these are not there during the session that mpirun is executed then
>> it will resort to ssh.
> 
> Sure.  What ras_gridengine_debug reported looked correct.  I'll try to
> debug it.  At least I stand a reasonable chance with grid engine issues.
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] 1.5.3 and SGE integration?

2011-03-21 Thread Dave Love
Terry Dontje  writes:

> Dave what version of Grid Engine are you using?

6.2u5, plus irrelevant patches.  It's fine with ompi 1.4.  (All I did to
switch was to load the 1.5.3 modules environment.)

> The plm checks for the following env-var's to determine if you are
> running Grid Engine.
> SGE_ROOT
> ARC
> PE_HOSTFILE
> JOB_ID
>
> If these are not there during the session that mpirun is executed then
> it will resort to ssh.

Sure.  What ras_gridengine_debug reported looked correct.  I'll try to
debug it.  At least I stand a reasonable chance with grid engine issues.



Re: [OMPI users] bizarre failure with IMB/openib

2011-03-21 Thread Dave Love
Peter Kjellström  writes:

> Are you sure you launched it correctly and that you have (re)built OpenMPI 
> against your Redhat-5 ib stack?

Yes.  I had to rebuild because I'd omitted openib when we only needed
psm.  As I said, I did exactly the same thing successfully with PMB
(initially because I wanted to try an old binary, and PMB was lying
around).

>>   Your MPI job is now going to abort; sorry.
> ...
>>   [lvgig116:07931] 19 more processes have sent help message
>> help-mca-bml-r2.txt / unreachable proc [lvgig116:07931] Set MCA parameter
>
> Seems to me that OpenMPI gave up because it didn't succeed in initializing 
> any 
> inter-node btl/mtl.

Sure, but why won't it load the btl under IMB when it will under PMB
(and other codes like XHPL), and how do I get any diagnostics?

My boss has just stumbled upon a reference while looking for something
else It looks as if it's an OFED bug entry, but I can't find an
operational version of an OFED tracker or any other reference to the bug
than (the equivalent of)
http://lists.openfabrics.org/pipermail/ewg/2010-March/014983.html :

  1976  maj jsquyres at cisco.com   errors running IMB over 
openmpi-1.4.1

I guess Jeff will enlighten me if/when he spots this.  (Thanks in
advance, obviously.)



Re: [OMPI users] 1.5.3 and SGE integration?

2011-03-21 Thread Terry Dontje

Dave what version of Grid Engine are you using?
The plm checks for the following env-var's to determine if you are 
running Grid Engine.

SGE_ROOT
ARC
PE_HOSTFILE
JOB_ID

If these are not there during the session that mpirun is executed then 
it will resort to ssh.


--td


On 03/21/2011 08:24 AM, Dave Love wrote:

I've just tried 1.5.3 under SGE with tight integration, which seems to
be broken.  I built and ran in the same way as for 1.4.{1,3}, which
works, and ompi_info reports the same gridengine parameters for 1.5 as
for 1.4.

The symptoms are that it reports a failure to communicate using ssh,
whereas it should be using the SGE builtin method via qrsh.

There doesn't seem to be a relevant bug report, but before I
investigate, has anyone else succeeded/failed with it, or have any
hints?

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 





Re: [OMPI users] bizarre failure with IMB/openib

2011-03-21 Thread Peter Kjellström
On Monday, March 21, 2011 12:25:37 pm Dave Love wrote:
> I'm trying to test some new nodes with ConnectX adaptors, and failing to
> get (so far just) IMB to run on them.
...
> I'm using gcc-compiled OMPI 1.4.3 and the current RedHat 5 OFED with IMB
> 3.2.2, specifying `btl openib,sm,self' (or `mtl psm' on the Qlogic
> nodes).  I'm not sure what else might be relevant.  The output from
> trying to run IMB follows, for what it's worth.
> 
>  
> --
> At least one pair of MPI processes are unable to reach each other for MPI
> communications.  This means that no Open MPI device has indicated that it
> can be used to communicate between these processes.  This is an error;
> Open MPI requires that all MPI processes be able to reach each other. 
> This error can sometimes be the result of forgetting to specify the "self"
> BTL.
> 
> Process 1 ([[25307,1],2]) is on host: lvgig116
> Process 2 ([[25307,1],12]) is on host: lvgig117
> BTLs attempted: self sm

Are you sure you launched it correctly and that you have (re)built OpenMPI 
against your Redhat-5 ib stack?
 
>   Your MPI job is now going to abort; sorry.
...
>   [lvgig116:07931] 19 more processes have sent help message
> help-mca-bml-r2.txt / unreachable proc [lvgig116:07931] Set MCA parameter

Seems to me that OpenMPI gave up because it didn't succeed in initializing any 
inter-node btl/mtl.

I'd suggest you try (roughly in order):

 1) ibstat on all nodes to verify that your ib interfaces are up
 2) try a verbs level test (like ib_write_bw) to verify data can flow
 3) make sure your OpenMPI was built with the redhat libibverbs-devel present
(=> a suitable openib btl is built).

/Peter

> "orte_base_help_aggregate" to 0 to see all help / error messages
> [lvgig116:07931] 19 more processes have sent help message help-mpi-runtime
> / mpi_init:startup:internal-failure


signature.asc
Description: This is a digitally signed message part.


Re: [OMPI users] OpenMPI 1.2.x segfault as regular user

2011-03-21 Thread Prentice Bisbal
On 03/20/2011 06:22 PM, kevin.buck...@ecs.vuw.ac.nz wrote:
> 
>> It's not hard to test whether or not SELinux is the problem. You can
>> turn SELinux off on the command-line with this command:
>>
>> setenforce 0
>>
>> Of course, you need to be root in order to do this.
>>
>> After turning SELinux off, you can try reproducing the error. If it
>> still occurs, it's SELinux, if it doesn't the problem is elswhere. When
>> your done, you can reenable SELinux with
>>
>> setenforce 1
>>
>> If you're running your job across multiple nodes, you should disable
>> SELinux on all of them for testing.
> 
> You are not actually disabling SELinux with setenforce 0, just
> putting it into "permissive" mode: SELinux is still active.
> 

That's correct. Thanks for catching my inaccurate choice of words.

> Running SELinux in its permissive mode, as opposed to disabling it
> at boot time, sees SELinux making a log of things that would cause
> it to dive in, were it running in "enforcing" mode.

I forgot about that. Checking those logs will make debugging even easier
for the original poster.

> 
> There's then a tool you can run over that log that will suggest
> the ACL changes you need to make to fix the issue from an SELinux
> perspective.
> 

-- 
Prentice


Re: [OMPI users] intel compiler linking issue and issue of environment variable on remote node, with open mpi 1.4.3

2011-03-21 Thread Tim Prince

On 3/21/2011 5:21 AM, ya...@adina.com wrote:


I am trying to compile our codes with open mpi 1.4.3, by intel
compilers 8.1.

(1) For open mpi 1.4.3 installation on linux beowulf cluster, I use:

./configure --prefix=/home/yiguang/dmp-setup/openmpi-1.4.3
CC=icc
CXX=icpc F77=ifort FC=ifort --enable-static LDFLAGS="-i-static -
static-libcxa" --with-wrapper-ldflags="-i-static -static-libcxa" 2>&1 |
tee config.log

and

make all install 2>&1 | tee install.log

The issue is that I am trying to build open mpi 1.4.3 with intel
compiler libraries statically linked to it, so that when we run
mpirun/orterun, it does not need to dynamically load any intel
libraries. But what I got is mpirun always asks for some intel
library(e.g. libsvml.so) if I do not put intel library path on library
search path($LD_LIBRARY_PATH). I checked the open mpi user
archive, it seems only some kind user mentioned to use
"-i-static"(in my case) or "-static-intel" in ldflags, this is what I did,
but it seems not working, and I did not get any confirmation whether
or not this works for anyone else from the user archive. could
anyone help me on this? thanks!



If you are to use such an ancient compiler (apparently a 32-bit one), 
you must read the docs which come with it, rather than relying on 
comments about a more recent version.  libsvml isn't included 
automatically at link time by that 32-bit compiler, unless you specify 
an SSE option, such as -xW.
It's likely that no one has verified OpenMPI with a compiler of that 
vintage.  We never used the 32-bit compiler for MPI, and we encountered 
run-time library bugs for the ifort x86_64 which weren't fixed until 
later versions.



--
Tim Prince


[OMPI users] 1.5.3 and SGE integration?

2011-03-21 Thread Dave Love
I've just tried 1.5.3 under SGE with tight integration, which seems to
be broken.  I built and ran in the same way as for 1.4.{1,3}, which
works, and ompi_info reports the same gridengine parameters for 1.5 as
for 1.4.

The symptoms are that it reports a failure to communicate using ssh,
whereas it should be using the SGE builtin method via qrsh.

There doesn't seem to be a relevant bug report, but before I
investigate, has anyone else succeeded/failed with it, or have any
hints?



[OMPI users] intel compiler linking issue and issue of environment variable on remote node, with open mpi 1.4.3

2011-03-21 Thread yanyg
Hi,

I am trying to compile our codes with open mpi 1.4.3, by intel 
compilers 8.1. 

(1) For open mpi 1.4.3 installation on linux beowulf cluster, I use:

./configure --prefix=/home/yiguang/dmp-setup/openmpi-1.4.3 
CC=icc 
CXX=icpc F77=ifort FC=ifort --enable-static LDFLAGS="-i-static -
static-libcxa" --with-wrapper-ldflags="-i-static -static-libcxa" 2>&1 |
tee config.log

and 

make all install 2>&1 | tee install.log

The issue is that I am trying to build open mpi 1.4.3 with intel 
compiler libraries statically linked to it, so that when we run 
mpirun/orterun, it does not need to dynamically load any intel 
libraries. But what I got is mpirun always asks for some intel 
library(e.g. libsvml.so) if I do not put intel library path on library 
search path($LD_LIBRARY_PATH). I checked the open mpi user 
archive, it seems only some kind user mentioned to use
"-i-static"(in my case) or "-static-intel" in ldflags, this is what I did,
but it seems not working, and I did not get any confirmation whether 
or not this works for anyone else from the user archive. could 
anyone help me on this? thanks!

(2) After compiling and linking our in-house codes  with open mpi 
1.4.3, we want to make a minimal list of executables for our codes 
with some from open mpi 1.4.3 installation, without any dependent 
on external setting such as environment variables, etc.

I orgnize my directory as follows:

parent---
|
package
|
bin  
|
lib
|
tools

In package/ directory are executables from our codes. bin/ has 
mpirun and orted, copied from openmpi installation. lib/ includes 
open mpi libraries, and intel libraries. tools/ includes some c-shell 
scripts to launch mpi jobs, which uses mpirun in bin/.

The parent/ directory is on a NFS shared by all nodes of the 
cluster. In ~/.bashrc(shared by all nodes too), I clear PATH and 
LD_LIBRARY_PATH without direct to any directory of open mpi 
1.4.3 installation. 

First, if I set above bin/ directory  to PATH and lib/ 
LD_LIBRARY_PATH in ~/.bashrc, our parallel codes(starting by the 
C shell script in tools/) run AS EXPECTED without any problem, so 
that I set other things right.

Then again, to avoid modifying ~/.bashrc or ~/.profile, I set bin/ to 
PATH and lib/ to LD_LIBRARY_PATH in the C shell script under 
tools/ directory, as:

setenv PATH /path/to/bin:$PATH
setenv LD_LIBRARY_PATH /path/to/lib:$LD_LIBRARY_PATH

Then I start our codes from the C shell script in tools/, I got 
message: "orted command not found", which is from slave nodes, 
and orted should be in directory /path/to/bin. So I guess the $PATH 
variable or more general, the environment variables set in the script 
are not passed to the slave nodes by mpirun(I use absolute path for 
mpirun in the script). After I checked open mpi FAQ, I tried to set 
the "--prefix /path/to/parent" to mpirun command in the C shell 
script. it still does not work. Does any one have any hints? thanks!

I have tried my best to describe the issues, if anything not clear, 
please let me know as well. Thanks a lot for helps!

Sincerely,
Yiguang



[OMPI users] OpenMPI and Torque

2011-03-21 Thread Randall Svancara
I have a question about using OpenMPI and Torque on stateless nodes.
I have compiled openmpi 1.4.3 with --with-tm=/usr/local
--without-slurm using intel compiler version 11.1.075.

When I run a simple "hello world" mpi program, I am receiving the
following error.

[node164:11193] plm:tm: failed to poll for a spawned daemon, return
status = 17002
 --
 A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
 launch so we are aborting.

 There may be more information reported by the environment (see above).

 This may be because the daemon was unable to find all the needed shared
 libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
 location of the shared libraries on the remote nodes and this will
 automatically be forwarded to the remote nodes.
 --
 --
 mpiexec noticed that the job aborted, but has no info as to the process
 that caused that situation.
 --
 --
 mpiexec was unable to cleanly terminate the daemons on the nodes shown
 below. Additional manual cleanup may be required - please refer to
 the "orte-clean" tool for assistance.
 --
 node163 - daemon did not report back when launched
 node159 - daemon did not report back when launched
 node158 - daemon did not report back when launched
 node157 - daemon did not report back when launched
 node156 - daemon did not report back when launched
 node155 - daemon did not report back when launched
 node154 - daemon did not report back when launched
 node152 - daemon did not report back when launched
 node151 - daemon did not report back when launched
 node150 - daemon did not report back when launched
 node149 - daemon did not report back when launched


But if I include:

-mca plm rsh

The job runs just fine.

I am not sure what the problem is with torque or openmpi that prevents
the process from launching on remote nodes.  I have posted to the
torque list and someone suggested that it may be temporary directory
space that can be causing issues.  I have 100MB allocated to /tmp

Any ideas as to why I am having this problem would be appreciated.


-- 
Randall Svancara
http://knowyourlinux.com/


Re: [OMPI users] Problems with openmpi-1.4.3

2011-03-21 Thread Gustavo Correa
Hi Amos

This form perhaps?
 'export PATH=/opt/openmpi/bin:$PATH' 
You don't want to wipe off the existing path, just add openmpi to it.

Also, Intel also has its shared libraries, which may be causing trouble.
My guess is that you need to set the Intel environment first by
placing a line more or less like this in your .bashrc/.cshrc file:

source /path/to/intel/bin/ifortvars.sh  (or ifortvars.csh depending on the 
shell you use)

The Intel script will add the Intel bin and lib directories to your environment.

Then on your .bashrc/.cshrc you pre-pend the OpenMPI bin and lib directories 
to your PATH and LD_LIBRARY_PATH:

export PATH=/opt/openmpi/bin:$PATH
export LD_LIBRARY_PATH=/opt/openmpi/lib:$LD_LIBRARY_PATH

for bash, or 'setenv PATH /opt/openmpi/bin:$PATH', etc for csh.

I hope this helps,
Gus Correa

On Mar 20, 2011, at 10:43 PM, Amos Leffler wrote:

> Hi,
>  I have been having problems getting openmpi-1.4.3 with Linux
> under SUSE 11.3.  I have put the following entries in .bashrc:
>   PATH: /opt/openmpi/bin
>   LD_LIBRARY_PATH /opt/openmpi/lib
>   aliasifort='opt/intel/bin/ifort'
>  alias   libopen-pal.so.0:=/usr/lib/libopen-pal.so.0
> The file appears to run properly under the configure command:
>  ./configure   CC=gcc   CXX=g++   F77=ifort
> F90=ifort  --prefix=/opt/openmpi
> and usingmake all install.  However, running an example such as:
>   mpicc hello_c.c -o hello_c it gives the result:
>   mpicc: error while loading shared libraries:
> libopen-pal.so.0: cannot open shared object file: No such file or
> directory
>   At this point I am stumped and any thouughts would be much
> appreciated.
> 
>   Amos Leffler
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Problems with openmpi-1.4.3

2011-03-21 Thread David Zhang
I don't know if your alias got mapped when mpicc is called.  Try adding
/usr/lib to LD_LIBRARY_PATH?

On Sun, Mar 20, 2011 at 7:43 PM, Amos Leffler  wrote:

> Hi,
>  I have been having problems getting openmpi-1.4.3 with Linux
> under SUSE 11.3.  I have put the following entries in .bashrc:
>   PATH: /opt/openmpi/bin
>   LD_LIBRARY_PATH /opt/openmpi/lib
>   aliasifort='opt/intel/bin/ifort'
>  alias   libopen-pal.so.0:=/usr/lib/libopen-pal.so.0
> The file appears to run properly under the configure command:
>  ./configure   CC=gcc   CXX=g++   F77=ifort
> F90=ifort  --prefix=/opt/openmpi
> and usingmake all install.  However, running an example such as:
>   mpicc hello_c.c -o hello_c it gives the result:
>   mpicc: error while loading shared libraries:
> libopen-pal.so.0: cannot open shared object file: No such file or
> directory
>   At this point I am stumped and any thouughts would be much
> appreciated.
>
>   Amos Leffler
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
David Zhang
University of California, San Diego