Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes

2012-03-02 Thread Ralph Castain

On Mar 2, 2012, at 3:23 PM, Yiguang Yan wrote:

> It turns out that the "-x" option should be put on each line of the app file 
> if app file is used.
> 
> 

> So from tests (a),(b),(c), if I am using app file, the PATH and 
> LD_LIBRARY_PATH are only passed to slave node 
> when the "-x" is set on each line of the app file, similar to the "--prefix" 
> option.
> 
> Any conclusion? If a bug fix is admitted for the "--prefix" option, I would 
> think this is another bug for "-x" option.

I don't think so, in this case. I can see places where one might want to pass 
an envar to one app_context, but not all. I fixed the --prefix option on our 
trunk and filed the patch for the 1.5 series - let's hold there for now.

Thanks
Ralph

> 
> Thanks,
> Yiguang
> 
> 
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes

2012-03-02 Thread Yiguang Yan
It turns out that the "-x" option should be put on each line of the app file if 
app file is used.

OK, now test results on our cluster, in case this may be useful to some Open 
MPI users(Open MPI 1.4.3 used on 
my system):

(1) If I run mpirun command from command line as Jeff's foo test, everything 
works fine, the same as in Jeff's foo 
test.

(2) Now let me start mpirun from shell script:

first, foo script includes:
>>>
#!/bin/sh -f

echo $HOSTNAME: PATH : $PATH
echo $HOSTNAME: LD_LIBRARY_PATH : $LD_LIBRARY_PATH
<<<

testenvars.bash script includes:
>>>
#!/bin/sh -f
#nohup
#
# 
>---<
adinahome=/home/yiguang/testdmp881
mpirunfile=$adinahome/bin/mpirun
#
# Set envars for mpirun and orted
#
export PATH=/this/is/a/fake/path:$adinahome/bin:$adinahome/tools:$PATH
export LD_LIBRARY_PATH=/this/is/a/fake/libdir:$adinahome/lib:$LD_LIBRARY_PATH
#
#
# run DMP problem
#
mcaprefix="--prefix $adinahome"
mcaenvars="-x PATH -x LD_LIBRARY_PATH"
mcabtlconn="--mca btl openib,sm,self"
#mcaplmbase="--mca plm_base_verbose 100"

# mpirun is under $adinahome/bin

$mpirunfile --host gulftown,ibnode001 foo
<<<

Now if I run testenvars.bash from command line:
>>>
[yiguang@gulftown testdmp]$ ./testenvars.bash
gulftown: PATH : 
/home/yiguang/testdmp881/bin:/home/yiguang/testdmp881/bin:/this/is/a/fake/path:/home/yiguang/testdmp881/bin:/ho
me/yiguang/testdmp881/tools:/usr/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/adina/system8.8/tools:/usr/adi
na/system8.7/tools:/usr/adina/system8.6/tools:/usr/adina/system8.5/tools:/home/yiguang/bin
gulftown: LD_LIBRARY_PATH : 
/home/yiguang/testdmp881/lib:/home/yiguang/testdmp881/lib:/this/is/a/fake/libdir:/home/yiguang/testdmp881/lib:
ibnode001: PATH : 
/home/yiguang/testdmp881/bin:/home/yiguang/testdmp881/bin:/usr/bin:/usr/lib64/qt-
3.3/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin
ibnode001: LD_LIBRARY_PATH : 
/home/yiguang/testdmp881/lib:/home/yiguang/testdmp881/lib:
<<<

If, in the testenvars.bash script, I change the line 
$mpirunfile --host gulftown,ibnode001 foo
-->
mpirun --prefix $adinahome --host gulftown,ibnode001 foo

then I get the same output as above, and as expected, full path of mpirun and 
--prefix give us the same action. The 
unexpected part is that /home/yiguang/testdmp881/bin and 
/home/yiguang/testdmp881/lib are included twice here, 
why?

Now if I change, in the above testenvars.bash script, the line

$mpirunfile --host gulftown,ibnode001 foo
-->
mpirun --prefix $adinahome $mcaenvars --host gulftown,ibnode001 foo

Then run the script:
>>>
[yiguang@gulftown testdmp]$ ./testenvars.bash
gulftown: PATH : 
/home/yiguang/testdmp881/bin:/this/is/a/fake/path:/home/yiguang/testdmp881/bin:/home/yiguang/testdmp881/tools:/
usr/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/adina/system8.8/tools:/usr/adina/system8.7/tools:/usr/adina/s
ystem8.6/tools:/usr/adina/system8.5/tools:/home/yiguang/bin
gulftown: LD_LIBRARY_PATH : 
/home/yiguang/testdmp881/lib:/this/is/a/fake/libdir:/home/yiguang/testdmp881/lib:
ibnode001: PATH : 
/home/yiguang/testdmp881/bin:/this/is/a/fake/path:/home/yiguang/testdmp881/bin:/home/yiguang/testdmp881/tools:/
usr/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/adina/system8.8/tools:/usr/adina/system8.7/tools:/usr/adina/s
ystem8.6/tools:/usr/adina/system8.5/tools:/home/yiguang/bin
ibnode001: LD_LIBRARY_PATH : 
/home/yiguang/testdmp881/lib:/this/is/a/fake/libdir:/home/yiguang/testdmp881/lib:
<<<
This time, the PATH and LD_LIBRARY_PATH are passed to slave node, and 
/home/yiguang/testdmp881/bin and 
/home/yiguang/testdmp881/lib include only once, different from the last test.

So far so good expect the minor things.

(3) Now I changed to use app file

First scripts, foo script is as above, testenvars-app.bash scripts includes:
>>>
[yiguang@gulftown testdmp]$ cat testenvars-app.bash
#!/bin/sh -f
#nohup
#
# 
>---<
adinahome=/home/yiguang/testdmp881
mpirunfile=$adinahome/bin/mpirun
#
# Set envars for mpirun and orted
#
export PATH=/this/is/a/fake/path:$adinahome/bin:$adinahome/tools:$PATH
export LD_LIBRARY_PATH=/this/is/a/fake/libdir:$adinahome/lib:$LD_LIBRARY_PATH
#
#
# run DMP problem
#
#mcaprefix="--prefix $adinahome"
mcaenvars="-x PATH -x LD_LIBRARY_PATH"
mcabtlconn="--mca btl openib,sm,self"
#mcaplmbase="--mca plm_base_verbose 100"

$mpirunfile $mcabltconn --app addmpw-foo-nox
#$mpirunfile $mcaenvars $mcabltconn --app addmpw-foo-nox
#$mpirunfile $mcabltconn --app addmpw-foo
<<<

addmpw-foo-nox app file as:
>>>
[yiguang@gulftown testdmp]$ cat addmpw-foo-nox
--prefix /home/yiguang/testdmp881 -n 1 -host gulftown foo
--prefix /home/yiguang/testdmp881 -n 1 -host ibnode001 foo
<<<
addmpw-foo app file as:
>>>
[yiguang@gulftown testdmp]$ cat addmpw-foo
--prefix /home/yiguang/testdmp881 -x PATH -x LD_LIBRARY_PATH -n 1 -host 
gulftown foo
--prefix 

Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes

2012-03-02 Thread Jeffrey Squyres
On Mar 2, 2012, at 2:50 PM, Ralph Castain wrote:

>> Ralph and I just had a phone conversation about this.  We consider it a bug 
>> -- you shouldn't need to put --prefix in the app file.  Meaning: --prefix is 
>> currently being ignored if you use an app file (and therefore you have to 
>> put --prefix in the app file).  We're going to fix that.
> 
> Updated in the developer's trunk. I don't think we'll bring this to the 1.5 
> branch, though I leave that up to Jeff.


Actually, I think we should.  This way, the unexpected behavior of --prefix / 
absolute mpirun path name being dropped won't be in the entire 1.6 series.

Ralph -- can you CMR this?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes

2012-03-02 Thread Ralph Castain

On Mar 2, 2012, at 10:50 AM, Jeffrey Squyres wrote:

> On Mar 2, 2012, at 9:48 AM, Yiguang Yan wrote:
> 
>> (All with the same test script test.bash I post in my previous emails, so 
>> run with app file fed to mpirun command.)
>> 
>> (1) If I put the --prefix in the app file, on each line of it, it works fine 
>> as Jeff said.
>> 
>> (2) Since in the manual, it is said that the full path of mpirun is the same 
>> as setting "--prefix". However, with app file, 
>> this is not the case. Without "--prefix" on each line of the app file, the 
>> full path of mpirun does not work.
> 
> Ralph and I just had a phone conversation about this.  We consider it a bug 
> -- you shouldn't need to put --prefix in the app file.  Meaning: --prefix is 
> currently being ignored if you use an app file (and therefore you have to put 
> --prefix in the app file).  We're going to fix that.

Updated in the developer's trunk. I don't think we'll bring this to the 1.5 
branch, though I leave that up to Jeff.




Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes

2012-03-02 Thread Jeffrey Squyres
On Mar 2, 2012, at 9:48 AM, Yiguang Yan wrote:

> (All with the same test script test.bash I post in my previous emails, so run 
> with app file fed to mpirun command.)
> 
> (1) If I put the --prefix in the app file, on each line of it, it works fine 
> as Jeff said.
> 
> (2) Since in the manual, it is said that the full path of mpirun is the same 
> as setting "--prefix". However, with app file, 
> this is not the case. Without "--prefix" on each line of the app file, the 
> full path of mpirun does not work.

Ralph and I just had a phone conversation about this.  We consider it a bug -- 
you shouldn't need to put --prefix in the app file.  Meaning: --prefix is 
currently being ignored if you use an app file (and therefore you have to put 
--prefix in the app file).  We're going to fix that.

> (3) With "--prefix $adinahome" set on each line of the app file, it is 
> exclusively put, on each node, the 
> $adinahome/bin into the PATH, and $adinahome/lib into the LD_LIBRARY_PATH(not 
> the $adinahome/lib64 as said 
> in mpirun manual(v1.4.x)).

Correct.

> The envars $PATH and $LD_LIBARARY_PATH set in test.bash script only affect 
> the 
> envars on the submitting node(gulftown in my case). No $PATH or 
> $LD_LIBRARY_PATH is passed to slave nodes 
> even if I use "-x PATH -x LD_LIBRARY_PATH", either fed to mpirun or put on 
> each line of the app file. I am not sure 
> if this is intended, since "--prefix" overwrite the effect of "-x" option, 
> this is different from what I see from the mpirun 
> man page.

Hmm.  Let's do a simple test here...

-
[9:38] svbu-mpi:~ % cat foo
#!/bin/bash

echo test_env_var: $test_env_var
[9:38] svbu-mpi:~ % ./foo
test_env_var:
[9:38] svbu-mpi:~ % mpirun --host svbu-mpi001,svbu-mpi002 ~/foo
test_env_var:
test_env_var:
[9:38] svbu-mpi:~ % setenv test_env_var THIS-IS-TEST-ENV-VAR
[9:39] svbu-mpi:~ % ./foo
test_env_var: THIS-IS-TEST-ENV-VAR
[9:39] svbu-mpi:~ % mpirun --host svbu-mpi001,svbu-mpi002 ~/foo
test_env_var:
test_env_var:
[9:39] svbu-mpi:~ % mpirun --host svbu-mpi001,svbu-mpi002 -x test_env_var ~/foo
test_env_var: THIS-IS-TEST-ENV-VAR
test_env_var: THIS-IS-TEST-ENV-VAR
[9:39] svbu-mpi:~ % 
-

So that appears to work.  Let's try with PATH.

-
[9:41] svbu-mpi:~ % cat foo
#!/bin/bash -f

echo PATH: $PATH
[9:41] svbu-mpi:~ % ./foo
PATH: 
/home/jsquyres/bogus/bin:/users/jsquyres/local/bin:/home/jsquyres/bogus/bin:/users/jsquyres/local/bin:/var/opt/intel/composerxe-2011.1.107/bin:/opt/autotools/ac268-am1113-lt242/bin:/cm/shared/apps/valgrind/3.7.0/bin:/cm/shared/apps/mercurial/2.0.2/bin:/cm/shared/apps/gcc/4.4.6/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/sbin:/usr/sbin:/cm/shared/apps/slurm/2.2.4/bin:/cm/shared/apps/slurm/2.2.4/sbin:/cm/shared/apps/proxy/bin:/cm/shared/apps/subversion/1.7.2/bin:/sbin:/usr/sbin

# That's ok. Now let's try with mpirun.

[9:41] svbu-mpi:~ % mpirun --host svbu-mpi001,svbu-mpi002 ~/foo
PATH: 
/home/jsquyres/bogus/bin:/home/jsquyres/bogus/bin:/home/jsquyres/bogus/bin:/users/jsquyres/local/bin:/var/opt/intel/composerxe-2011.1.107/bin:/opt/autotools/ac268-am1113-lt242/bin:/cm/shared/apps/valgrind/3.7.0/bin:/cm/shared/apps/mercurial/2.0.2/bin:/cm/shared/apps/gcc/4.4.6/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/sbin:/usr/sbin:/cm/shared/apps/slurm/2.2.4/bin:/cm/shared/apps/slurm/2.2.4/sbin:/cm/shared/apps/proxy/bin:/cm/shared/apps/subversion/1.7.2/bin
PATH: 
/home/jsquyres/bogus/bin:/home/jsquyres/bogus/bin:/home/jsquyres/bogus/bin:/users/jsquyres/local/bin:/var/opt/intel/composerxe-2011.1.107/bin:/opt/autotools/ac268-am1113-lt242/bin:/cm/shared/apps/valgrind/3.7.0/bin:/cm/shared/apps/mercurial/2.0.2/bin:/cm/shared/apps/gcc/4.4.6/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/sbin:/usr/sbin:/cm/shared/apps/slurm/2.2.4/bin:/cm/shared/apps/slurm/2.2.4/sbin:/cm/shared/apps/proxy/bin:/cm/shared/apps/subversion/1.7.2/bin

# These look ok (my remote path is a bit longer than my local path)
# Now let's add a bogus entry the local path

[9:41] svbu-mpi:~ % set path = ($path /this/is/a/fake/path)
[9:41] svbu-mpi:~ % ./foo
PATH: 
/home/jsquyres/bogus/bin:/users/jsquyres/local/bin:/home/jsquyres/bogus/bin:/users/jsquyres/local/bin:/var/opt/intel/composerxe-2011.1.107/bin:/opt/autotools/ac268-am1113-lt242/bin:/cm/shared/apps/valgrind/3.7.0/bin:/cm/shared/apps/mercurial/2.0.2/bin:/cm/shared/apps/gcc/4.4.6/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/sbin:/usr/sbin:/cm/shared/apps/slurm/2.2.4/bin:/cm/shared/apps/slurm/2.2.4/sbin:/cm/shared/apps/proxy/bin:/cm/shared/apps/subversion/1.7.2/bin:/sbin:/usr/sbin:/this/is/a/fake/path

# Good; the bogus entry is there.  Now try mpirun

[9:41] svbu-mpi:~ % mpirun --host svbu-mpi001,svbu-mpi002 ~/foo
PATH: 

Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes

2012-03-02 Thread Ralph Castain
We'll take a look at the prefix behavior. As to the btl, you can always just 
force it: for example, -mca btl sm,self,openib would restrict it to shared 
memory and IB.


On Mar 2, 2012, at 7:48 AM, Yiguang Yan wrote:

> Hi Jeff, Ralph--
> 
> Please let me follow the thread, here are what I observed:
> 
> (All with the same test script test.bash I post in my previous emails, so run 
> with app file fed to mpirun command.)
> 
> (1) If I put the --prefix in the app file, on each line of it, it works fine 
> as Jeff said.
> 
> (2) Since in the manual, it is said that the full path of mpirun is the same 
> as setting "--prefix". However, with app file, 
> this is not the case. Without "--prefix" on each line of the app file, the 
> full path of mpirun does not work.
> 
> (3) With "--prefix $adinahome" set on each line of the app file, it is 
> exclusively put, on each node, the 
> $adinahome/bin into the PATH, and $adinahome/lib into the LD_LIBRARY_PATH(not 
> the $adinahome/lib64 as said 
> in mpirun manual(v1.4.x)). The envars $PATH and $LD_LIBARARY_PATH set in 
> test.bash script only affect the 
> envars on the submitting node(gulftown in my case). No $PATH or 
> $LD_LIBRARY_PATH is passed to slave nodes 
> even if I use "-x PATH -x LD_LIBRARY_PATH", either fed to mpirun or put on 
> each line of the app file. I am not sure 
> if this is intended, since "--prefix" overwrite the effect of "-x" option, 
> this is different from what I see from the mpirun 
> man page.
> 
> I have another question about the btl used for communication. I noticed that 
> rsh is using the tcp for connection, I 
> understand that tcp may be used for initial connection, but how can I know 
> that openib(infiniband) btl is used for my 
> data communication? Any explicit way?
> 
> Thanks,
> Yiguang
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes

2012-03-02 Thread Yiguang Yan
Hi Jeff, Ralph--

Please let me follow the thread, here are what I observed:

(All with the same test script test.bash I post in my previous emails, so run 
with app file fed to mpirun command.)

(1) If I put the --prefix in the app file, on each line of it, it works fine as 
Jeff said.

(2) Since in the manual, it is said that the full path of mpirun is the same as 
setting "--prefix". However, with app file, 
this is not the case. Without "--prefix" on each line of the app file, the full 
path of mpirun does not work.

(3) With "--prefix $adinahome" set on each line of the app file, it is 
exclusively put, on each node, the 
$adinahome/bin into the PATH, and $adinahome/lib into the LD_LIBRARY_PATH(not 
the $adinahome/lib64 as said 
in mpirun manual(v1.4.x)). The envars $PATH and $LD_LIBARARY_PATH set in 
test.bash script only affect the 
envars on the submitting node(gulftown in my case). No $PATH or 
$LD_LIBRARY_PATH is passed to slave nodes 
even if I use "-x PATH -x LD_LIBRARY_PATH", either fed to mpirun or put on each 
line of the app file. I am not sure 
if this is intended, since "--prefix" overwrite the effect of "-x" option, this 
is different from what I see from the mpirun 
man page.

I have another question about the btl used for communication. I noticed that 
rsh is using the tcp for connection, I 
understand that tcp may be used for initial connection, but how can I know that 
openib(infiniband) btl is used for my 
data communication? Any explicit way?

Thanks,
Yiguang



Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes

2012-03-01 Thread Ralph Castain
I don't know - I didn't write the app file code, and I've never seen anything 
defining its behavior. So I guess you could say it is intended - or not! :-/


On Mar 1, 2012, at 2:53 PM, Jeffrey Squyres wrote:

> Actually, I should say that I discovered that if you put --prefix on each 
> line of the app context file, then the first case (running the app context 
> file) works fine; it adheres to the --prefix behavior.
> 
> Ralph: is this intended behavior?  (I don't know if I have an opinion either 
> way)
> 
> 
> On Mar 1, 2012, at 4:51 PM, Jeffrey Squyres wrote:
> 
>> I see the problem.
>> 
>> It looks like the use of the app context file is triggering different 
>> behavior, and that behavior is erasing the use of --prefix.  If I replace 
>> the app context file with a complete command line, it works and the --prefix 
>> behavior is observed.
>> 
>> Specifically:
>> 
>> $mpirunfile $mcaparams --app addmpw-hostname
>> 
>> ^^ This one seems to ignore --prefix behavior.
>> 
>> $mpirunfile $mcaparams --host svbu-mpi,svbu-mpi001 -np 2 hostname
>> $mpirunfile $mcaparams --host svbu-mpi -np 1 hostname : --host svbu-mpi001 
>> -np 1 hostname
>> 
>> ^^ These two seem to adhere to the proper --prefix behavior.
>> 
>> Ralph -- can you have a look?
>> 
>> 
>> 
>> 
>> On Mar 1, 2012, at 2:59 PM, Yiguang Yan wrote:
>> 
>>> Hi Ralph,
>>> 
>>> Thanks, here is what I did as suggested by Jeff:
>>> 
 What did this command line look like? Can you provide the configure line 
 as well? 
>>> 
>>> As in my previous post, the script as following:
>>> 
>>> (1) debug messages:
>> 
>>> yiguang@gulftown testdmp]$ ./test.bash
>>> [gulftown:28340] mca: base: components_open: Looking for plm components
>>> [gulftown:28340] mca: base: components_open: opening plm components
>>> [gulftown:28340] mca: base: components_open: found loaded component rsh
>>> [gulftown:28340] mca: base: components_open: component rsh has no register 
>>> function
>>> [gulftown:28340] mca: base: components_open: component rsh open function 
>>> successful
>>> [gulftown:28340] mca: base: components_open: found loaded component slurm
>>> [gulftown:28340] mca: base: components_open: component slurm has no 
>>> register function
>>> [gulftown:28340] mca: base: components_open: component slurm open function 
>>> successful
>>> [gulftown:28340] mca: base: components_open: found loaded component tm
>>> [gulftown:28340] mca: base: components_open: component tm has no register 
>>> function
>>> [gulftown:28340] mca: base: components_open: component tm open function 
>>> successful
>>> [gulftown:28340] mca:base:select: Auto-selecting plm components
>>> [gulftown:28340] mca:base:select:(  plm) Querying component [rsh]
>>> [gulftown:28340] mca:base:select:(  plm) Query of component [rsh] set 
>>> priority to 10
>>> [gulftown:28340] mca:base:select:(  plm) Querying component [slurm]
>>> [gulftown:28340] mca:base:select:(  plm) Skipping component [slurm]. Query 
>>> failed to return a module
>>> [gulftown:28340] mca:base:select:(  plm) Querying component [tm]
>>> [gulftown:28340] mca:base:select:(  plm) Skipping component [tm]. Query 
>>> failed to return a module
>>> [gulftown:28340] mca:base:select:(  plm) Selected component [rsh]
>>> [gulftown:28340] mca: base: close: component slurm closed
>>> [gulftown:28340] mca: base: close: unloading component slurm
>>> [gulftown:28340] mca: base: close: component tm closed
>>> [gulftown:28340] mca: base: close: unloading component tm
>>> [gulftown:28340] plm:base:set_hnp_name: initial bias 28340 nodename hash 
>>> 3546479048
>>> [gulftown:28340] plm:base:set_hnp_name: final jobfam 17438
>>> [gulftown:28340] [[17438,0],0] plm:base:receive start comm
>>> [gulftown:28340] [[17438,0],0] plm:rsh: setting up job [17438,1]
>>> [gulftown:28340] [[17438,0],0] plm:base:setup_job for job [17438,1]
>>> [gulftown:28340] [[17438,0],0] plm:rsh: local shell: 0 (bash)
>>> [gulftown:28340] [[17438,0],0] plm:rsh: assuming same remote shell as local 
>>> shell
>>> [gulftown:28340] [[17438,0],0] plm:rsh: remote shell: 0 (bash)
>>> [gulftown:28340] [[17438,0],0] plm:rsh: final template argv:
>>>  /usr/bin/rsh   orted --daemonize -mca ess env -mca 
>>> orte_ess_jobid 1142816768 -mca 
>>> orte_ess_vpid  -mca orte_ess_num_procs 4 --hnp-uri 
>>> "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159"
>>>  -
>>> -mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca 
>>> btl openib,sm,self --mca 
>>> orte_tmpdir_base /tmp --mca plm_base_verbose 100
>>> [gulftown:28340] [[17438,0],0] plm:rsh:launch daemon already exists on node 
>>> gulftown
>>> [gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode001
>>> [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon 
>>> [[17438,0],1]
>>> [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) 
>>> [/usr/bin/rsh ibnode001  orted --daemonize -mca 
>>> ess env -mca orte_ess_jobid 1142816768 

Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes

2012-03-01 Thread Yiguang Yan

> Actually, I should say that I discovered that if you put --prefix on each 
> line of the app context file, then the first
> case (running the app context file) works fine; it adheres to the --prefix 
> behavior. 

Yes, I confirmed this on our cluster. It works with --prefix on each line of 
the app file.


Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes

2012-03-01 Thread Jeffrey Squyres
Actually, I should say that I discovered that if you put --prefix on each line 
of the app context file, then the first case (running the app context file) 
works fine; it adheres to the --prefix behavior.

Ralph: is this intended behavior?  (I don't know if I have an opinion either 
way)


On Mar 1, 2012, at 4:51 PM, Jeffrey Squyres wrote:

> I see the problem.
> 
> It looks like the use of the app context file is triggering different 
> behavior, and that behavior is erasing the use of --prefix.  If I replace the 
> app context file with a complete command line, it works and the --prefix 
> behavior is observed.
> 
> Specifically:
> 
> $mpirunfile $mcaparams --app addmpw-hostname
> 
> ^^ This one seems to ignore --prefix behavior.
> 
> $mpirunfile $mcaparams --host svbu-mpi,svbu-mpi001 -np 2 hostname
> $mpirunfile $mcaparams --host svbu-mpi -np 1 hostname : --host svbu-mpi001 
> -np 1 hostname
> 
> ^^ These two seem to adhere to the proper --prefix behavior.
> 
> Ralph -- can you have a look?
> 
> 
> 
> 
> On Mar 1, 2012, at 2:59 PM, Yiguang Yan wrote:
> 
>> Hi Ralph,
>> 
>> Thanks, here is what I did as suggested by Jeff:
>> 
>>> What did this command line look like? Can you provide the configure line as 
>>> well? 
>> 
>> As in my previous post, the script as following:
>> 
>> (1) debug messages:
> 
>> yiguang@gulftown testdmp]$ ./test.bash
>> [gulftown:28340] mca: base: components_open: Looking for plm components
>> [gulftown:28340] mca: base: components_open: opening plm components
>> [gulftown:28340] mca: base: components_open: found loaded component rsh
>> [gulftown:28340] mca: base: components_open: component rsh has no register 
>> function
>> [gulftown:28340] mca: base: components_open: component rsh open function 
>> successful
>> [gulftown:28340] mca: base: components_open: found loaded component slurm
>> [gulftown:28340] mca: base: components_open: component slurm has no register 
>> function
>> [gulftown:28340] mca: base: components_open: component slurm open function 
>> successful
>> [gulftown:28340] mca: base: components_open: found loaded component tm
>> [gulftown:28340] mca: base: components_open: component tm has no register 
>> function
>> [gulftown:28340] mca: base: components_open: component tm open function 
>> successful
>> [gulftown:28340] mca:base:select: Auto-selecting plm components
>> [gulftown:28340] mca:base:select:(  plm) Querying component [rsh]
>> [gulftown:28340] mca:base:select:(  plm) Query of component [rsh] set 
>> priority to 10
>> [gulftown:28340] mca:base:select:(  plm) Querying component [slurm]
>> [gulftown:28340] mca:base:select:(  plm) Skipping component [slurm]. Query 
>> failed to return a module
>> [gulftown:28340] mca:base:select:(  plm) Querying component [tm]
>> [gulftown:28340] mca:base:select:(  plm) Skipping component [tm]. Query 
>> failed to return a module
>> [gulftown:28340] mca:base:select:(  plm) Selected component [rsh]
>> [gulftown:28340] mca: base: close: component slurm closed
>> [gulftown:28340] mca: base: close: unloading component slurm
>> [gulftown:28340] mca: base: close: component tm closed
>> [gulftown:28340] mca: base: close: unloading component tm
>> [gulftown:28340] plm:base:set_hnp_name: initial bias 28340 nodename hash 
>> 3546479048
>> [gulftown:28340] plm:base:set_hnp_name: final jobfam 17438
>> [gulftown:28340] [[17438,0],0] plm:base:receive start comm
>> [gulftown:28340] [[17438,0],0] plm:rsh: setting up job [17438,1]
>> [gulftown:28340] [[17438,0],0] plm:base:setup_job for job [17438,1]
>> [gulftown:28340] [[17438,0],0] plm:rsh: local shell: 0 (bash)
>> [gulftown:28340] [[17438,0],0] plm:rsh: assuming same remote shell as local 
>> shell
>> [gulftown:28340] [[17438,0],0] plm:rsh: remote shell: 0 (bash)
>> [gulftown:28340] [[17438,0],0] plm:rsh: final template argv:
>>   /usr/bin/rsh   orted --daemonize -mca ess env -mca 
>> orte_ess_jobid 1142816768 -mca 
>> orte_ess_vpid  -mca orte_ess_num_procs 4 --hnp-uri 
>> "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159"
>>  -
>> -mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca 
>> btl openib,sm,self --mca 
>> orte_tmpdir_base /tmp --mca plm_base_verbose 100
>> [gulftown:28340] [[17438,0],0] plm:rsh:launch daemon already exists on node 
>> gulftown
>> [gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode001
>> [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon 
>> [[17438,0],1]
>> [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) 
>> [/usr/bin/rsh ibnode001  orted --daemonize -mca 
>> ess env -mca orte_ess_jobid 1142816768 -mca orte_ess_vpid 1 -mca 
>> orte_ess_num_procs 4 --hnp-uri 
>> "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159"
>>  -
>> -mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca 
>> btl openib,sm,self --mca 
>> orte_tmpdir_base /tmp --mca 

Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes

2012-03-01 Thread Yiguang Yan
Hi Ralph,

Thanks, here is what I did as suggested by Jeff:

> What did this command line look like? Can you provide the configure line as 
> well? 

As in my previous post, the script as following:

(1) debug messages:
>>>
yiguang@gulftown testdmp]$ ./test.bash
[gulftown:28340] mca: base: components_open: Looking for plm components
[gulftown:28340] mca: base: components_open: opening plm components
[gulftown:28340] mca: base: components_open: found loaded component rsh
[gulftown:28340] mca: base: components_open: component rsh has no register 
function
[gulftown:28340] mca: base: components_open: component rsh open function 
successful
[gulftown:28340] mca: base: components_open: found loaded component slurm
[gulftown:28340] mca: base: components_open: component slurm has no register 
function
[gulftown:28340] mca: base: components_open: component slurm open function 
successful
[gulftown:28340] mca: base: components_open: found loaded component tm
[gulftown:28340] mca: base: components_open: component tm has no register 
function
[gulftown:28340] mca: base: components_open: component tm open function 
successful
[gulftown:28340] mca:base:select: Auto-selecting plm components
[gulftown:28340] mca:base:select:(  plm) Querying component [rsh]
[gulftown:28340] mca:base:select:(  plm) Query of component [rsh] set priority 
to 10
[gulftown:28340] mca:base:select:(  plm) Querying component [slurm]
[gulftown:28340] mca:base:select:(  plm) Skipping component [slurm]. Query 
failed to return a module
[gulftown:28340] mca:base:select:(  plm) Querying component [tm]
[gulftown:28340] mca:base:select:(  plm) Skipping component [tm]. Query failed 
to return a module
[gulftown:28340] mca:base:select:(  plm) Selected component [rsh]
[gulftown:28340] mca: base: close: component slurm closed
[gulftown:28340] mca: base: close: unloading component slurm
[gulftown:28340] mca: base: close: component tm closed
[gulftown:28340] mca: base: close: unloading component tm
[gulftown:28340] plm:base:set_hnp_name: initial bias 28340 nodename hash 
3546479048
[gulftown:28340] plm:base:set_hnp_name: final jobfam 17438
[gulftown:28340] [[17438,0],0] plm:base:receive start comm
[gulftown:28340] [[17438,0],0] plm:rsh: setting up job [17438,1]
[gulftown:28340] [[17438,0],0] plm:base:setup_job for job [17438,1]
[gulftown:28340] [[17438,0],0] plm:rsh: local shell: 0 (bash)
[gulftown:28340] [[17438,0],0] plm:rsh: assuming same remote shell as local 
shell
[gulftown:28340] [[17438,0],0] plm:rsh: remote shell: 0 (bash)
[gulftown:28340] [[17438,0],0] plm:rsh: final template argv:
/usr/bin/rsh   orted --daemonize -mca ess env -mca 
orte_ess_jobid 1142816768 -mca 
orte_ess_vpid  -mca orte_ess_num_procs 4 --hnp-uri 
"1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159"
 -
-mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl 
openib,sm,self --mca 
orte_tmpdir_base /tmp --mca plm_base_verbose 100
[gulftown:28340] [[17438,0],0] plm:rsh:launch daemon already exists on node 
gulftown
[gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode001
[gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon [[17438,0],1]
[gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) 
[/usr/bin/rsh ibnode001  orted --daemonize -mca 
ess env -mca orte_ess_jobid 1142816768 -mca orte_ess_vpid 1 -mca 
orte_ess_num_procs 4 --hnp-uri 
"1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159"
 -
-mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl 
openib,sm,self --mca 
orte_tmpdir_base /tmp --mca plm_base_verbose 100]
bash: orted: command not found
[gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode002
[gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon [[17438,0],2]
[gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) 
[/usr/bin/rsh ibnode002  orted --daemonize -mca 
ess env -mca orte_ess_jobid 1142816768 -mca orte_ess_vpid 2 -mca 
orte_ess_num_procs 4 --hnp-uri 
"1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159"
 -
-mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl 
openib,sm,self --mca 
orte_tmpdir_base /tmp --mca plm_base_verbose 100]
bash: orted: command not found
[gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode003
[gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) 
[/usr/bin/rsh ibnode003  orted --daemonize -mca 
ess env -mca orte_ess_jobid 1142816768 -mca orte_ess_vpid 3 -mca 
orte_ess_num_procs 4 --hnp-uri 
"1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159"
 -
-mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl 
openib,sm,self --mca 
orte_tmpdir_base /tmp --mca plm_base_verbose 100]
[gulftown:28340] [[17438,0],0] 

Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes

2012-03-01 Thread Ralph Castain
What did this command line look like? Can you provide the configure line as 
well?

On Mar 1, 2012, at 12:46 PM, Yiguang Yan wrote:

> Hi Jeff,
> 
> Here I made a developer build, and then got the following message 
> with plm_base_verbose:
> 
 
> [gulftown:28340] mca: base: components_open: Looking for plm 
> components
> [gulftown:28340] mca: base: components_open: opening plm 
> components
> [gulftown:28340] mca: base: components_open: found loaded 
> component rsh
> [gulftown:28340] mca: base: components_open: component rsh 
> has no register function
> [gulftown:28340] mca: base: components_open: component rsh 
> open function successful
> [gulftown:28340] mca: base: components_open: found loaded 
> component slurm
> [gulftown:28340] mca: base: components_open: component slurm 
> has no register function
> [gulftown:28340] mca: base: components_open: component slurm 
> open function successful
> [gulftown:28340] mca: base: components_open: found loaded 
> component tm
> [gulftown:28340] mca: base: components_open: component tm 
> has no register function
> [gulftown:28340] mca: base: components_open: component tm 
> open function successful
> [gulftown:28340] mca:base:select: Auto-selecting plm components
> [gulftown:28340] mca:base:select:(  plm) Querying component [rsh]
> [gulftown:28340] mca:base:select:(  plm) Query of component [rsh] 
> set priority to 10
> [gulftown:28340] mca:base:select:(  plm) Querying component 
> [slurm]
> [gulftown:28340] mca:base:select:(  plm) Skipping component 
> [slurm]. Query failed to return a module
> [gulftown:28340] mca:base:select:(  plm) Querying component [tm]
> [gulftown:28340] mca:base:select:(  plm) Skipping component [tm]. 
> Query failed to return a module
> [gulftown:28340] mca:base:select:(  plm) Selected component [rsh]
> [gulftown:28340] mca: base: close: component slurm closed
> [gulftown:28340] mca: base: close: unloading component slurm
> [gulftown:28340] mca: base: close: component tm closed
> [gulftown:28340] mca: base: close: unloading component tm
> [gulftown:28340] plm:base:set_hnp_name: initial bias 28340 
> nodename hash 3546479048
> [gulftown:28340] plm:base:set_hnp_name: final jobfam 17438
> [gulftown:28340] [[17438,0],0] plm:base:receive start comm
> [gulftown:28340] [[17438,0],0] plm:rsh: setting up job [17438,1]
> [gulftown:28340] [[17438,0],0] plm:base:setup_job for job [17438,1]
> [gulftown:28340] [[17438,0],0] plm:rsh: local shell: 0 (bash)
> [gulftown:28340] [[17438,0],0] plm:rsh: assuming same remote 
> shell as local shell
> [gulftown:28340] [[17438,0],0] plm:rsh: remote shell: 0 (bash)
> [gulftown:28340] [[17438,0],0] plm:rsh: final template argv:
>/usr/bin/rsh   orted --daemonize -mca ess env -
> mca orte_ess_jobid 1142816768 -mca orte_ess_vpid  -
> mca orte_ess_num_procs 4 --hnp-uri 
> "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t
> cp://172.23.10.1:43159;tcp://172.33.10.1:43159" --mca 
> plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 
> 0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca 
> plm_base_verbose 100
> [gulftown:28340] [[17438,0],0] plm:rsh:launch daemon already 
> exists on node gulftown
> [gulftown:28340] [[17438,0],0] plm:rsh: launching on node 
> ibnode001
> [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon 
> [[17438,0],1]
> [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) 
> [/usr/bin/rsh ibnode001  orted --daemonize -mca ess env -mca 
> orte_ess_jobid 1142816768 -mca orte_ess_vpid 1 -mca 
> orte_ess_num_procs 4 --hnp-uri 
> "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t
> cp://172.23.10.1:43159;tcp://172.33.10.1:43159" --mca 
> plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 
> 0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca 
> plm_base_verbose 100]
> bash: orted: command not found
> [gulftown:28340] [[17438,0],0] plm:rsh: launching on node 
> ibnode002
> [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon 
> [[17438,0],2]
> [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) 
> [/usr/bin/rsh ibnode002  orted --daemonize -mca ess env -mca 
> orte_ess_jobid 1142816768 -mca orte_ess_vpid 2 -mca 
> orte_ess_num_procs 4 --hnp-uri 
> "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t
> cp://172.23.10.1:43159;tcp://172.33.10.1:43159" --mca 
> plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 
> 0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca 
> plm_base_verbose 100]
> bash: orted: command not found
> [gulftown:28340] [[17438,0],0] plm:rsh: launching on node 
> ibnode003
> [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) 
> [/usr/bin/rsh ibnode003  orted --daemonize -mca ess env -mca 
> orte_ess_jobid 1142816768 -mca orte_ess_vpid 3 -mca 
> orte_ess_num_procs 4 --hnp-uri 
> "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t
> cp://172.23.10.1:43159;tcp://172.33.10.1:43159" 

Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes

2012-03-01 Thread Yiguang Yan
Hi Jeff,

Here I made a developer build, and then got the following message 
with plm_base_verbose:

>>>
[gulftown:28340] mca: base: components_open: Looking for plm 
components
[gulftown:28340] mca: base: components_open: opening plm 
components
[gulftown:28340] mca: base: components_open: found loaded 
component rsh
[gulftown:28340] mca: base: components_open: component rsh 
has no register function
[gulftown:28340] mca: base: components_open: component rsh 
open function successful
[gulftown:28340] mca: base: components_open: found loaded 
component slurm
[gulftown:28340] mca: base: components_open: component slurm 
has no register function
[gulftown:28340] mca: base: components_open: component slurm 
open function successful
[gulftown:28340] mca: base: components_open: found loaded 
component tm
[gulftown:28340] mca: base: components_open: component tm 
has no register function
[gulftown:28340] mca: base: components_open: component tm 
open function successful
[gulftown:28340] mca:base:select: Auto-selecting plm components
[gulftown:28340] mca:base:select:(  plm) Querying component [rsh]
[gulftown:28340] mca:base:select:(  plm) Query of component [rsh] 
set priority to 10
[gulftown:28340] mca:base:select:(  plm) Querying component 
[slurm]
[gulftown:28340] mca:base:select:(  plm) Skipping component 
[slurm]. Query failed to return a module
[gulftown:28340] mca:base:select:(  plm) Querying component [tm]
[gulftown:28340] mca:base:select:(  plm) Skipping component [tm]. 
Query failed to return a module
[gulftown:28340] mca:base:select:(  plm) Selected component [rsh]
[gulftown:28340] mca: base: close: component slurm closed
[gulftown:28340] mca: base: close: unloading component slurm
[gulftown:28340] mca: base: close: component tm closed
[gulftown:28340] mca: base: close: unloading component tm
[gulftown:28340] plm:base:set_hnp_name: initial bias 28340 
nodename hash 3546479048
[gulftown:28340] plm:base:set_hnp_name: final jobfam 17438
[gulftown:28340] [[17438,0],0] plm:base:receive start comm
[gulftown:28340] [[17438,0],0] plm:rsh: setting up job [17438,1]
[gulftown:28340] [[17438,0],0] plm:base:setup_job for job [17438,1]
[gulftown:28340] [[17438,0],0] plm:rsh: local shell: 0 (bash)
[gulftown:28340] [[17438,0],0] plm:rsh: assuming same remote 
shell as local shell
[gulftown:28340] [[17438,0],0] plm:rsh: remote shell: 0 (bash)
[gulftown:28340] [[17438,0],0] plm:rsh: final template argv:
/usr/bin/rsh   orted --daemonize -mca ess env -
mca orte_ess_jobid 1142816768 -mca orte_ess_vpid  -
mca orte_ess_num_procs 4 --hnp-uri 
"1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t
cp://172.23.10.1:43159;tcp://172.33.10.1:43159" --mca 
plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 
0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca 
plm_base_verbose 100
[gulftown:28340] [[17438,0],0] plm:rsh:launch daemon already 
exists on node gulftown
[gulftown:28340] [[17438,0],0] plm:rsh: launching on node 
ibnode001
[gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon 
[[17438,0],1]
[gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) 
[/usr/bin/rsh ibnode001  orted --daemonize -mca ess env -mca 
orte_ess_jobid 1142816768 -mca orte_ess_vpid 1 -mca 
orte_ess_num_procs 4 --hnp-uri 
"1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t
cp://172.23.10.1:43159;tcp://172.33.10.1:43159" --mca 
plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 
0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca 
plm_base_verbose 100]
bash: orted: command not found
[gulftown:28340] [[17438,0],0] plm:rsh: launching on node 
ibnode002
[gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon 
[[17438,0],2]
[gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) 
[/usr/bin/rsh ibnode002  orted --daemonize -mca ess env -mca 
orte_ess_jobid 1142816768 -mca orte_ess_vpid 2 -mca 
orte_ess_num_procs 4 --hnp-uri 
"1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t
cp://172.23.10.1:43159;tcp://172.33.10.1:43159" --mca 
plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 
0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca 
plm_base_verbose 100]
bash: orted: command not found
[gulftown:28340] [[17438,0],0] plm:rsh: launching on node 
ibnode003
[gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) 
[/usr/bin/rsh ibnode003  orted --daemonize -mca ess env -mca 
orte_ess_jobid 1142816768 -mca orte_ess_vpid 3 -mca 
orte_ess_num_procs 4 --hnp-uri 
"1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t
cp://172.23.10.1:43159;tcp://172.33.10.1:43159" --mca 
plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 
0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca 
plm_base_verbose 100]
[gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon 
[[17438,0],3]
bash: orted: command not found
[gulftown:28340] [[17438,0],0] plm:base:daemon_callback
<<<


It 

Re: [OMPI users] orted daemon no found! --- environment not passed to slave nodes

2012-02-29 Thread Jeffrey Squyres
Gah.  I didn't realize that my 1.4.x build was a *developer* build.  
*Developer* builds give a *lot* more detail with plm_base_verbose=100 
(including the specific rsh command being used).  You obviously didn't get that 
output because you don't have a developer build.  :-\

Just for reference, here's what plm_base_verbose=100 tells me for running an 
orted on a remote node, when I use the --prefix option to mpirun (I'm a tcsh 
user, so the syntax below will be a little different than what is running in 
your environment):

-
[svbu-mpi:28527] [[20181,0],0] plm:rsh: executing: (//usr/bin/ssh) 
[/usr/bin/ssh svbu-mpi001  set path = ( /home/jsquyres/bogus/bin $path ) ; if ( 
$?LD_LIBRARY_PATH == 1 ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH == 0 ) 
setenv LD_LIBRARY_PATH /home/jsquyres/bogus/lib ; if ( $?OMPI_have_llp == 1 ) 
setenv LD_LIBRARY_PATH /home/jsquyres/bogus/lib:$LD_LIBRARY_PATH ;  
/home/jsquyres/bogus/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 
1322582016 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 --hnp-uri 
"1322582016.0;tcp://172.29.218.140:34815;tcp://10.148.255.1:34815" --mca 
plm_base_verbose 100]
-

Ok, a few options here:

1. You can get a developer build if you use the --enable-debug option to 
configure.  Then plm_base_verbose=100 will give a lot more info.  Remember, the 
goal here is to see what's going wrong -- not to depend on having a developer 
build around.

2. If that isn't workable, make an "orted" in your default path somewhere 
that's a short script:

-
:
echo ===environment===
env | sort
echo ===environment end===
sleep 1000
-

Then when you "mpirun", do a "ps" to see exactly what was executed on the node 
where mpirun was invoked and the node where orted is supposed to be running.  
It's not quite as descriptive as seeing the plm_base_verbose output because we 
run multiple shell commands, but it's something.  You'll also see the stdout 
from the local node.  You'll need to use the --leave-session-attached option to 
mpirun to see the output from the remote nodes.


On Feb 29, 2012, at 9:43 AM, Yiguang Yan wrote:

> Hi Jeff,
> 
> Thanks.
> 
> I tried as what you suggested. Here are the output:
> 
 
> yiguang@gulftown testdmp]$ ./test.bash
> [gulftown:25052] mca: base: components_open: Looking for plm 
> components
> [gulftown:25052] mca: base: components_open: opening plm 
> components
> [gulftown:25052] mca: base: components_open: found loaded 
> component rsh
> [gulftown:25052] mca: base: components_open: component rsh 
> has no register function
> [gulftown:25052] mca: base: components_open: component rsh 
> open function successful
> [gulftown:25052] mca: base: components_open: found loaded 
> component slurm
> [gulftown:25052] mca: base: components_open: component slurm 
> has no register function
> [gulftown:25052] mca: base: components_open: component slurm 
> open function successful
> [gulftown:25052] mca: base: components_open: found loaded 
> component tm
> [gulftown:25052] mca: base: components_open: component tm 
> has no register function
> [gulftown:25052] mca: base: components_open: component tm 
> open function successful
> [gulftown:25052] mca:base:select: Auto-selecting plm components
> [gulftown:25052] mca:base:select:(  plm) Querying component [rsh]
> [gulftown:25052] mca:base:select:(  plm) Query of component [rsh] 
> set priority to 10
> [gulftown:25052] mca:base:select:(  plm) Querying component 
> [slurm]
> [gulftown:25052] mca:base:select:(  plm) Skipping component 
> [slurm]. Query failed to return a module
> [gulftown:25052] mca:base:select:(  plm) Querying component [tm]
> [gulftown:25052] mca:base:select:(  plm) Skipping component [tm]. 
> Query failed to return a module
> [gulftown:25052] mca:base:select:(  plm) Selected component [rsh]
> [gulftown:25052] mca: base: close: component slurm closed
> [gulftown:25052] mca: base: close: unloading component slurm
> [gulftown:25052] mca: base: close: component tm closed
> [gulftown:25052] mca: base: close: unloading component tm
> bash: orted: command not found
> bash: orted: command not found
> bash: orted: command not found
> <<<
> 
> 
> The following is the content of test.bash:
 
> yiguang@gulftown testdmp]$ ./test.bash
> #!/bin/sh -f
> #nohup
> #
> # 
> >---
> <
> adinahome=/usr/adina/system8.8dmp
> mpirunfile=$adinahome/bin/mpirun
> #
> # Set envars for mpirun and orted
> #
> export PATH=$adinahome/bin:$adinahome/tools:$PATH
> export LD_LIBRARY_PATH=$adinahome/lib:$LD_LIBRARY_PATH
> #
> #
> # run DMP problem
> #
> mcaprefix="--prefix $adinahome"
> mcarshagent="--mca plm_rsh_agent rsh:ssh"
> mcatmpdir="--mca orte_tmpdir_base /tmp"
> mcaopenibmsg="--mca btl_openib_warn_default_gid_prefix 0"
> mcaenvars="-x PATH -x LD_LIBRARY_PATH"
> mcabtlconn="--mca btl openib,sm,self"
> mcaplmbase="--mca plm_base_verbose 100"
> 
> 

Re: [OMPI users] orted daemon no found! --- environment not passed to slave nodes

2012-02-29 Thread Yiguang Yan
Hi Jeff,

Thanks.

I tried as what you suggested. Here are the output:

>>>
yiguang@gulftown testdmp]$ ./test.bash
[gulftown:25052] mca: base: components_open: Looking for plm 
components
[gulftown:25052] mca: base: components_open: opening plm 
components
[gulftown:25052] mca: base: components_open: found loaded 
component rsh
[gulftown:25052] mca: base: components_open: component rsh 
has no register function
[gulftown:25052] mca: base: components_open: component rsh 
open function successful
[gulftown:25052] mca: base: components_open: found loaded 
component slurm
[gulftown:25052] mca: base: components_open: component slurm 
has no register function
[gulftown:25052] mca: base: components_open: component slurm 
open function successful
[gulftown:25052] mca: base: components_open: found loaded 
component tm
[gulftown:25052] mca: base: components_open: component tm 
has no register function
[gulftown:25052] mca: base: components_open: component tm 
open function successful
[gulftown:25052] mca:base:select: Auto-selecting plm components
[gulftown:25052] mca:base:select:(  plm) Querying component [rsh]
[gulftown:25052] mca:base:select:(  plm) Query of component [rsh] 
set priority to 10
[gulftown:25052] mca:base:select:(  plm) Querying component 
[slurm]
[gulftown:25052] mca:base:select:(  plm) Skipping component 
[slurm]. Query failed to return a module
[gulftown:25052] mca:base:select:(  plm) Querying component [tm]
[gulftown:25052] mca:base:select:(  plm) Skipping component [tm]. 
Query failed to return a module
[gulftown:25052] mca:base:select:(  plm) Selected component [rsh]
[gulftown:25052] mca: base: close: component slurm closed
[gulftown:25052] mca: base: close: unloading component slurm
[gulftown:25052] mca: base: close: component tm closed
[gulftown:25052] mca: base: close: unloading component tm
bash: orted: command not found
bash: orted: command not found
bash: orted: command not found
<<<


The following is the content of test.bash:
>>>
yiguang@gulftown testdmp]$ ./test.bash
#!/bin/sh -f
#nohup
#
# 
>---
<
adinahome=/usr/adina/system8.8dmp
mpirunfile=$adinahome/bin/mpirun
#
# Set envars for mpirun and orted
#
export PATH=$adinahome/bin:$adinahome/tools:$PATH
export LD_LIBRARY_PATH=$adinahome/lib:$LD_LIBRARY_PATH
#
#
# run DMP problem
#
mcaprefix="--prefix $adinahome"
mcarshagent="--mca plm_rsh_agent rsh:ssh"
mcatmpdir="--mca orte_tmpdir_base /tmp"
mcaopenibmsg="--mca btl_openib_warn_default_gid_prefix 0"
mcaenvars="-x PATH -x LD_LIBRARY_PATH"
mcabtlconn="--mca btl openib,sm,self"
mcaplmbase="--mca plm_base_verbose 100"

mcaparams="$mcaprefix $mcaenvars $mcarshagent 
$mcaopenibmsg $mcabtlconn $mcatmpdir $mcaplmbase"

$mpirunfile $mcaparams --app addmpw-hostname
<<<

While the content of addmpw-hostname is:
>>>
-n 1 -host gulftown hostname
-n 1 -host ibnode001 hostname
-n 1 -host ibnode002 hostname
-n 1 -host ibnode003 thostname
<<<

After this, I also tried to specify the orted through:

--mca orte_launch_agent $adinahome/bin/orted

then, orted could be found on slave nodes, but now the shared libs 
in $adinahome/lib are not on the LD_LIBRARY_PATH.

Any comments?

Thanks,
Yiguang





Re: [OMPI users] orted daemon no found! --- environment not passed to slave nodes?

2012-02-28 Thread Jeffrey Squyres
The intent of the --prefix option (or using the full path name to mpirun) was 
exactly for the purpose of not requiring changes to the .bashrc.

Can you run with "--mca plm_base_verbose 100" on your command line?  This will 
show us the exact rsh/ssh command line that is being executed -- it might shed 
some light on what is going on here.  For example:

mpirun --mca plm_base_verbose 100 --host A,B hostname



On Feb 27, 2012, at 10:41 AM, ya...@adina.com wrote:

> Greetings!
> 
> I have tried to run ring_c example test from a bash script. In this 
> bash script, I setup PATH and LD_LIBRARY_PATH(I donot want to 
> disturb ~/.bashrc, etc), then use a full path of mpirun to invoke mpi 
> processes, the mpirun and orted are both on the PATH. However, 
> from the Open MPI message, orted was not found, to me, it was 
> not found only on slave nodes. Then I tried to set the --prefix or -x 
> PATH -x LD_LIBRARY_PATH to hope these envars passed to 
> slave nodes, but it turned out they are not forwarded to slave 
> nodes. 
> 
> On the other hand, if I set the same PATH and 
> LD_LIBRARY_PATH in ~/.bashrc which shared by all nodes, 
> mpirun from bash script runs fine and orted could be found. This is 
> easy to understand though, but I realy do not want to change 
> ~/.bashrc.
> 
> It seems the non-interactive bash shell does not pass envars to 
> slave nodes. 
> 
> Any comments and solutions?
> 
> Thanks,
> Yiguang
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI users] orted daemon no found! --- environment not passed to slave nodes?

2012-02-27 Thread yanyg
Greetings!

I have tried to run ring_c example test from a bash script. In this 
bash script, I setup PATH and LD_LIBRARY_PATH(I donot want to 
disturb ~/.bashrc, etc), then use a full path of mpirun to invoke mpi 
processes, the mpirun and orted are both on the PATH. However, 
from the Open MPI message, orted was not found, to me, it was 
not found only on slave nodes. Then I tried to set the --prefix or -x 
PATH -x LD_LIBRARY_PATH to hope these envars passed to 
slave nodes, but it turned out they are not forwarded to slave 
nodes. 

On the other hand, if I set the same PATH and 
LD_LIBRARY_PATH in ~/.bashrc which shared by all nodes, 
mpirun from bash script runs fine and orted could be found. This is 
easy to understand though, but I realy do not want to change 
~/.bashrc.

It seems the non-interactive bash shell does not pass envars to 
slave nodes. 

Any comments and solutions?

Thanks,
Yiguang