Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes
On Mar 2, 2012, at 3:23 PM, Yiguang Yan wrote: > It turns out that the "-x" option should be put on each line of the app file > if app file is used. > > > So from tests (a),(b),(c), if I am using app file, the PATH and > LD_LIBRARY_PATH are only passed to slave node > when the "-x" is set on each line of the app file, similar to the "--prefix" > option. > > Any conclusion? If a bug fix is admitted for the "--prefix" option, I would > think this is another bug for "-x" option. I don't think so, in this case. I can see places where one might want to pass an envar to one app_context, but not all. I fixed the --prefix option on our trunk and filed the patch for the 1.5 series - let's hold there for now. Thanks Ralph > > Thanks, > Yiguang > > > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes
It turns out that the "-x" option should be put on each line of the app file if app file is used. OK, now test results on our cluster, in case this may be useful to some Open MPI users(Open MPI 1.4.3 used on my system): (1) If I run mpirun command from command line as Jeff's foo test, everything works fine, the same as in Jeff's foo test. (2) Now let me start mpirun from shell script: first, foo script includes: >>> #!/bin/sh -f echo $HOSTNAME: PATH : $PATH echo $HOSTNAME: LD_LIBRARY_PATH : $LD_LIBRARY_PATH <<< testenvars.bash script includes: >>> #!/bin/sh -f #nohup # # >---< adinahome=/home/yiguang/testdmp881 mpirunfile=$adinahome/bin/mpirun # # Set envars for mpirun and orted # export PATH=/this/is/a/fake/path:$adinahome/bin:$adinahome/tools:$PATH export LD_LIBRARY_PATH=/this/is/a/fake/libdir:$adinahome/lib:$LD_LIBRARY_PATH # # # run DMP problem # mcaprefix="--prefix $adinahome" mcaenvars="-x PATH -x LD_LIBRARY_PATH" mcabtlconn="--mca btl openib,sm,self" #mcaplmbase="--mca plm_base_verbose 100" # mpirun is under $adinahome/bin $mpirunfile --host gulftown,ibnode001 foo <<< Now if I run testenvars.bash from command line: >>> [yiguang@gulftown testdmp]$ ./testenvars.bash gulftown: PATH : /home/yiguang/testdmp881/bin:/home/yiguang/testdmp881/bin:/this/is/a/fake/path:/home/yiguang/testdmp881/bin:/ho me/yiguang/testdmp881/tools:/usr/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/adina/system8.8/tools:/usr/adi na/system8.7/tools:/usr/adina/system8.6/tools:/usr/adina/system8.5/tools:/home/yiguang/bin gulftown: LD_LIBRARY_PATH : /home/yiguang/testdmp881/lib:/home/yiguang/testdmp881/lib:/this/is/a/fake/libdir:/home/yiguang/testdmp881/lib: ibnode001: PATH : /home/yiguang/testdmp881/bin:/home/yiguang/testdmp881/bin:/usr/bin:/usr/lib64/qt- 3.3/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin ibnode001: LD_LIBRARY_PATH : /home/yiguang/testdmp881/lib:/home/yiguang/testdmp881/lib: <<< If, in the testenvars.bash script, I change the line $mpirunfile --host gulftown,ibnode001 foo --> mpirun --prefix $adinahome --host gulftown,ibnode001 foo then I get the same output as above, and as expected, full path of mpirun and --prefix give us the same action. The unexpected part is that /home/yiguang/testdmp881/bin and /home/yiguang/testdmp881/lib are included twice here, why? Now if I change, in the above testenvars.bash script, the line $mpirunfile --host gulftown,ibnode001 foo --> mpirun --prefix $adinahome $mcaenvars --host gulftown,ibnode001 foo Then run the script: >>> [yiguang@gulftown testdmp]$ ./testenvars.bash gulftown: PATH : /home/yiguang/testdmp881/bin:/this/is/a/fake/path:/home/yiguang/testdmp881/bin:/home/yiguang/testdmp881/tools:/ usr/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/adina/system8.8/tools:/usr/adina/system8.7/tools:/usr/adina/s ystem8.6/tools:/usr/adina/system8.5/tools:/home/yiguang/bin gulftown: LD_LIBRARY_PATH : /home/yiguang/testdmp881/lib:/this/is/a/fake/libdir:/home/yiguang/testdmp881/lib: ibnode001: PATH : /home/yiguang/testdmp881/bin:/this/is/a/fake/path:/home/yiguang/testdmp881/bin:/home/yiguang/testdmp881/tools:/ usr/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/adina/system8.8/tools:/usr/adina/system8.7/tools:/usr/adina/s ystem8.6/tools:/usr/adina/system8.5/tools:/home/yiguang/bin ibnode001: LD_LIBRARY_PATH : /home/yiguang/testdmp881/lib:/this/is/a/fake/libdir:/home/yiguang/testdmp881/lib: <<< This time, the PATH and LD_LIBRARY_PATH are passed to slave node, and /home/yiguang/testdmp881/bin and /home/yiguang/testdmp881/lib include only once, different from the last test. So far so good expect the minor things. (3) Now I changed to use app file First scripts, foo script is as above, testenvars-app.bash scripts includes: >>> [yiguang@gulftown testdmp]$ cat testenvars-app.bash #!/bin/sh -f #nohup # # >---< adinahome=/home/yiguang/testdmp881 mpirunfile=$adinahome/bin/mpirun # # Set envars for mpirun and orted # export PATH=/this/is/a/fake/path:$adinahome/bin:$adinahome/tools:$PATH export LD_LIBRARY_PATH=/this/is/a/fake/libdir:$adinahome/lib:$LD_LIBRARY_PATH # # # run DMP problem # #mcaprefix="--prefix $adinahome" mcaenvars="-x PATH -x LD_LIBRARY_PATH" mcabtlconn="--mca btl openib,sm,self" #mcaplmbase="--mca plm_base_verbose 100" $mpirunfile $mcabltconn --app addmpw-foo-nox #$mpirunfile $mcaenvars $mcabltconn --app addmpw-foo-nox #$mpirunfile $mcabltconn --app addmpw-foo <<< addmpw-foo-nox app file as: >>> [yiguang@gulftown testdmp]$ cat addmpw-foo-nox --prefix /home/yiguang/testdmp881 -n 1 -host gulftown foo --prefix /home/yiguang/testdmp881 -n 1 -host ibnode001 foo <<< addmpw-foo app file as: >>> [yiguang@gulftown testdmp]$ cat addmpw-foo --prefix /home/yiguang/testdmp881 -x PATH -x LD_LIBRARY_PATH -n 1 -host gulftown foo --prefix
Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes
On Mar 2, 2012, at 2:50 PM, Ralph Castain wrote: >> Ralph and I just had a phone conversation about this. We consider it a bug >> -- you shouldn't need to put --prefix in the app file. Meaning: --prefix is >> currently being ignored if you use an app file (and therefore you have to >> put --prefix in the app file). We're going to fix that. > > Updated in the developer's trunk. I don't think we'll bring this to the 1.5 > branch, though I leave that up to Jeff. Actually, I think we should. This way, the unexpected behavior of --prefix / absolute mpirun path name being dropped won't be in the entire 1.6 series. Ralph -- can you CMR this? -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes
On Mar 2, 2012, at 10:50 AM, Jeffrey Squyres wrote: > On Mar 2, 2012, at 9:48 AM, Yiguang Yan wrote: > >> (All with the same test script test.bash I post in my previous emails, so >> run with app file fed to mpirun command.) >> >> (1) If I put the --prefix in the app file, on each line of it, it works fine >> as Jeff said. >> >> (2) Since in the manual, it is said that the full path of mpirun is the same >> as setting "--prefix". However, with app file, >> this is not the case. Without "--prefix" on each line of the app file, the >> full path of mpirun does not work. > > Ralph and I just had a phone conversation about this. We consider it a bug > -- you shouldn't need to put --prefix in the app file. Meaning: --prefix is > currently being ignored if you use an app file (and therefore you have to put > --prefix in the app file). We're going to fix that. Updated in the developer's trunk. I don't think we'll bring this to the 1.5 branch, though I leave that up to Jeff.
Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes
On Mar 2, 2012, at 9:48 AM, Yiguang Yan wrote: > (All with the same test script test.bash I post in my previous emails, so run > with app file fed to mpirun command.) > > (1) If I put the --prefix in the app file, on each line of it, it works fine > as Jeff said. > > (2) Since in the manual, it is said that the full path of mpirun is the same > as setting "--prefix". However, with app file, > this is not the case. Without "--prefix" on each line of the app file, the > full path of mpirun does not work. Ralph and I just had a phone conversation about this. We consider it a bug -- you shouldn't need to put --prefix in the app file. Meaning: --prefix is currently being ignored if you use an app file (and therefore you have to put --prefix in the app file). We're going to fix that. > (3) With "--prefix $adinahome" set on each line of the app file, it is > exclusively put, on each node, the > $adinahome/bin into the PATH, and $adinahome/lib into the LD_LIBRARY_PATH(not > the $adinahome/lib64 as said > in mpirun manual(v1.4.x)). Correct. > The envars $PATH and $LD_LIBARARY_PATH set in test.bash script only affect > the > envars on the submitting node(gulftown in my case). No $PATH or > $LD_LIBRARY_PATH is passed to slave nodes > even if I use "-x PATH -x LD_LIBRARY_PATH", either fed to mpirun or put on > each line of the app file. I am not sure > if this is intended, since "--prefix" overwrite the effect of "-x" option, > this is different from what I see from the mpirun > man page. Hmm. Let's do a simple test here... - [9:38] svbu-mpi:~ % cat foo #!/bin/bash echo test_env_var: $test_env_var [9:38] svbu-mpi:~ % ./foo test_env_var: [9:38] svbu-mpi:~ % mpirun --host svbu-mpi001,svbu-mpi002 ~/foo test_env_var: test_env_var: [9:38] svbu-mpi:~ % setenv test_env_var THIS-IS-TEST-ENV-VAR [9:39] svbu-mpi:~ % ./foo test_env_var: THIS-IS-TEST-ENV-VAR [9:39] svbu-mpi:~ % mpirun --host svbu-mpi001,svbu-mpi002 ~/foo test_env_var: test_env_var: [9:39] svbu-mpi:~ % mpirun --host svbu-mpi001,svbu-mpi002 -x test_env_var ~/foo test_env_var: THIS-IS-TEST-ENV-VAR test_env_var: THIS-IS-TEST-ENV-VAR [9:39] svbu-mpi:~ % - So that appears to work. Let's try with PATH. - [9:41] svbu-mpi:~ % cat foo #!/bin/bash -f echo PATH: $PATH [9:41] svbu-mpi:~ % ./foo PATH: /home/jsquyres/bogus/bin:/users/jsquyres/local/bin:/home/jsquyres/bogus/bin:/users/jsquyres/local/bin:/var/opt/intel/composerxe-2011.1.107/bin:/opt/autotools/ac268-am1113-lt242/bin:/cm/shared/apps/valgrind/3.7.0/bin:/cm/shared/apps/mercurial/2.0.2/bin:/cm/shared/apps/gcc/4.4.6/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/sbin:/usr/sbin:/cm/shared/apps/slurm/2.2.4/bin:/cm/shared/apps/slurm/2.2.4/sbin:/cm/shared/apps/proxy/bin:/cm/shared/apps/subversion/1.7.2/bin:/sbin:/usr/sbin # That's ok. Now let's try with mpirun. [9:41] svbu-mpi:~ % mpirun --host svbu-mpi001,svbu-mpi002 ~/foo PATH: /home/jsquyres/bogus/bin:/home/jsquyres/bogus/bin:/home/jsquyres/bogus/bin:/users/jsquyres/local/bin:/var/opt/intel/composerxe-2011.1.107/bin:/opt/autotools/ac268-am1113-lt242/bin:/cm/shared/apps/valgrind/3.7.0/bin:/cm/shared/apps/mercurial/2.0.2/bin:/cm/shared/apps/gcc/4.4.6/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/sbin:/usr/sbin:/cm/shared/apps/slurm/2.2.4/bin:/cm/shared/apps/slurm/2.2.4/sbin:/cm/shared/apps/proxy/bin:/cm/shared/apps/subversion/1.7.2/bin PATH: /home/jsquyres/bogus/bin:/home/jsquyres/bogus/bin:/home/jsquyres/bogus/bin:/users/jsquyres/local/bin:/var/opt/intel/composerxe-2011.1.107/bin:/opt/autotools/ac268-am1113-lt242/bin:/cm/shared/apps/valgrind/3.7.0/bin:/cm/shared/apps/mercurial/2.0.2/bin:/cm/shared/apps/gcc/4.4.6/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/sbin:/usr/sbin:/cm/shared/apps/slurm/2.2.4/bin:/cm/shared/apps/slurm/2.2.4/sbin:/cm/shared/apps/proxy/bin:/cm/shared/apps/subversion/1.7.2/bin # These look ok (my remote path is a bit longer than my local path) # Now let's add a bogus entry the local path [9:41] svbu-mpi:~ % set path = ($path /this/is/a/fake/path) [9:41] svbu-mpi:~ % ./foo PATH: /home/jsquyres/bogus/bin:/users/jsquyres/local/bin:/home/jsquyres/bogus/bin:/users/jsquyres/local/bin:/var/opt/intel/composerxe-2011.1.107/bin:/opt/autotools/ac268-am1113-lt242/bin:/cm/shared/apps/valgrind/3.7.0/bin:/cm/shared/apps/mercurial/2.0.2/bin:/cm/shared/apps/gcc/4.4.6/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/sbin:/usr/sbin:/cm/shared/apps/slurm/2.2.4/bin:/cm/shared/apps/slurm/2.2.4/sbin:/cm/shared/apps/proxy/bin:/cm/shared/apps/subversion/1.7.2/bin:/sbin:/usr/sbin:/this/is/a/fake/path # Good; the bogus entry is there. Now try mpirun [9:41] svbu-mpi:~ % mpirun --host svbu-mpi001,svbu-mpi002 ~/foo PATH:
Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes
We'll take a look at the prefix behavior. As to the btl, you can always just force it: for example, -mca btl sm,self,openib would restrict it to shared memory and IB. On Mar 2, 2012, at 7:48 AM, Yiguang Yan wrote: > Hi Jeff, Ralph-- > > Please let me follow the thread, here are what I observed: > > (All with the same test script test.bash I post in my previous emails, so run > with app file fed to mpirun command.) > > (1) If I put the --prefix in the app file, on each line of it, it works fine > as Jeff said. > > (2) Since in the manual, it is said that the full path of mpirun is the same > as setting "--prefix". However, with app file, > this is not the case. Without "--prefix" on each line of the app file, the > full path of mpirun does not work. > > (3) With "--prefix $adinahome" set on each line of the app file, it is > exclusively put, on each node, the > $adinahome/bin into the PATH, and $adinahome/lib into the LD_LIBRARY_PATH(not > the $adinahome/lib64 as said > in mpirun manual(v1.4.x)). The envars $PATH and $LD_LIBARARY_PATH set in > test.bash script only affect the > envars on the submitting node(gulftown in my case). No $PATH or > $LD_LIBRARY_PATH is passed to slave nodes > even if I use "-x PATH -x LD_LIBRARY_PATH", either fed to mpirun or put on > each line of the app file. I am not sure > if this is intended, since "--prefix" overwrite the effect of "-x" option, > this is different from what I see from the mpirun > man page. > > I have another question about the btl used for communication. I noticed that > rsh is using the tcp for connection, I > understand that tcp may be used for initial connection, but how can I know > that openib(infiniband) btl is used for my > data communication? Any explicit way? > > Thanks, > Yiguang > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes
Hi Jeff, Ralph-- Please let me follow the thread, here are what I observed: (All with the same test script test.bash I post in my previous emails, so run with app file fed to mpirun command.) (1) If I put the --prefix in the app file, on each line of it, it works fine as Jeff said. (2) Since in the manual, it is said that the full path of mpirun is the same as setting "--prefix". However, with app file, this is not the case. Without "--prefix" on each line of the app file, the full path of mpirun does not work. (3) With "--prefix $adinahome" set on each line of the app file, it is exclusively put, on each node, the $adinahome/bin into the PATH, and $adinahome/lib into the LD_LIBRARY_PATH(not the $adinahome/lib64 as said in mpirun manual(v1.4.x)). The envars $PATH and $LD_LIBARARY_PATH set in test.bash script only affect the envars on the submitting node(gulftown in my case). No $PATH or $LD_LIBRARY_PATH is passed to slave nodes even if I use "-x PATH -x LD_LIBRARY_PATH", either fed to mpirun or put on each line of the app file. I am not sure if this is intended, since "--prefix" overwrite the effect of "-x" option, this is different from what I see from the mpirun man page. I have another question about the btl used for communication. I noticed that rsh is using the tcp for connection, I understand that tcp may be used for initial connection, but how can I know that openib(infiniband) btl is used for my data communication? Any explicit way? Thanks, Yiguang
Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes
I don't know - I didn't write the app file code, and I've never seen anything defining its behavior. So I guess you could say it is intended - or not! :-/ On Mar 1, 2012, at 2:53 PM, Jeffrey Squyres wrote: > Actually, I should say that I discovered that if you put --prefix on each > line of the app context file, then the first case (running the app context > file) works fine; it adheres to the --prefix behavior. > > Ralph: is this intended behavior? (I don't know if I have an opinion either > way) > > > On Mar 1, 2012, at 4:51 PM, Jeffrey Squyres wrote: > >> I see the problem. >> >> It looks like the use of the app context file is triggering different >> behavior, and that behavior is erasing the use of --prefix. If I replace >> the app context file with a complete command line, it works and the --prefix >> behavior is observed. >> >> Specifically: >> >> $mpirunfile $mcaparams --app addmpw-hostname >> >> ^^ This one seems to ignore --prefix behavior. >> >> $mpirunfile $mcaparams --host svbu-mpi,svbu-mpi001 -np 2 hostname >> $mpirunfile $mcaparams --host svbu-mpi -np 1 hostname : --host svbu-mpi001 >> -np 1 hostname >> >> ^^ These two seem to adhere to the proper --prefix behavior. >> >> Ralph -- can you have a look? >> >> >> >> >> On Mar 1, 2012, at 2:59 PM, Yiguang Yan wrote: >> >>> Hi Ralph, >>> >>> Thanks, here is what I did as suggested by Jeff: >>> What did this command line look like? Can you provide the configure line as well? >>> >>> As in my previous post, the script as following: >>> >>> (1) debug messages: >> >>> yiguang@gulftown testdmp]$ ./test.bash >>> [gulftown:28340] mca: base: components_open: Looking for plm components >>> [gulftown:28340] mca: base: components_open: opening plm components >>> [gulftown:28340] mca: base: components_open: found loaded component rsh >>> [gulftown:28340] mca: base: components_open: component rsh has no register >>> function >>> [gulftown:28340] mca: base: components_open: component rsh open function >>> successful >>> [gulftown:28340] mca: base: components_open: found loaded component slurm >>> [gulftown:28340] mca: base: components_open: component slurm has no >>> register function >>> [gulftown:28340] mca: base: components_open: component slurm open function >>> successful >>> [gulftown:28340] mca: base: components_open: found loaded component tm >>> [gulftown:28340] mca: base: components_open: component tm has no register >>> function >>> [gulftown:28340] mca: base: components_open: component tm open function >>> successful >>> [gulftown:28340] mca:base:select: Auto-selecting plm components >>> [gulftown:28340] mca:base:select:( plm) Querying component [rsh] >>> [gulftown:28340] mca:base:select:( plm) Query of component [rsh] set >>> priority to 10 >>> [gulftown:28340] mca:base:select:( plm) Querying component [slurm] >>> [gulftown:28340] mca:base:select:( plm) Skipping component [slurm]. Query >>> failed to return a module >>> [gulftown:28340] mca:base:select:( plm) Querying component [tm] >>> [gulftown:28340] mca:base:select:( plm) Skipping component [tm]. Query >>> failed to return a module >>> [gulftown:28340] mca:base:select:( plm) Selected component [rsh] >>> [gulftown:28340] mca: base: close: component slurm closed >>> [gulftown:28340] mca: base: close: unloading component slurm >>> [gulftown:28340] mca: base: close: component tm closed >>> [gulftown:28340] mca: base: close: unloading component tm >>> [gulftown:28340] plm:base:set_hnp_name: initial bias 28340 nodename hash >>> 3546479048 >>> [gulftown:28340] plm:base:set_hnp_name: final jobfam 17438 >>> [gulftown:28340] [[17438,0],0] plm:base:receive start comm >>> [gulftown:28340] [[17438,0],0] plm:rsh: setting up job [17438,1] >>> [gulftown:28340] [[17438,0],0] plm:base:setup_job for job [17438,1] >>> [gulftown:28340] [[17438,0],0] plm:rsh: local shell: 0 (bash) >>> [gulftown:28340] [[17438,0],0] plm:rsh: assuming same remote shell as local >>> shell >>> [gulftown:28340] [[17438,0],0] plm:rsh: remote shell: 0 (bash) >>> [gulftown:28340] [[17438,0],0] plm:rsh: final template argv: >>> /usr/bin/rsh orted --daemonize -mca ess env -mca >>> orte_ess_jobid 1142816768 -mca >>> orte_ess_vpid -mca orte_ess_num_procs 4 --hnp-uri >>> "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159" >>> - >>> -mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca >>> btl openib,sm,self --mca >>> orte_tmpdir_base /tmp --mca plm_base_verbose 100 >>> [gulftown:28340] [[17438,0],0] plm:rsh:launch daemon already exists on node >>> gulftown >>> [gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode001 >>> [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon >>> [[17438,0],1] >>> [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) >>> [/usr/bin/rsh ibnode001 orted --daemonize -mca >>> ess env -mca orte_ess_jobid 1142816768
Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes
> Actually, I should say that I discovered that if you put --prefix on each > line of the app context file, then the first > case (running the app context file) works fine; it adheres to the --prefix > behavior. Yes, I confirmed this on our cluster. It works with --prefix on each line of the app file.
Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes
Actually, I should say that I discovered that if you put --prefix on each line of the app context file, then the first case (running the app context file) works fine; it adheres to the --prefix behavior. Ralph: is this intended behavior? (I don't know if I have an opinion either way) On Mar 1, 2012, at 4:51 PM, Jeffrey Squyres wrote: > I see the problem. > > It looks like the use of the app context file is triggering different > behavior, and that behavior is erasing the use of --prefix. If I replace the > app context file with a complete command line, it works and the --prefix > behavior is observed. > > Specifically: > > $mpirunfile $mcaparams --app addmpw-hostname > > ^^ This one seems to ignore --prefix behavior. > > $mpirunfile $mcaparams --host svbu-mpi,svbu-mpi001 -np 2 hostname > $mpirunfile $mcaparams --host svbu-mpi -np 1 hostname : --host svbu-mpi001 > -np 1 hostname > > ^^ These two seem to adhere to the proper --prefix behavior. > > Ralph -- can you have a look? > > > > > On Mar 1, 2012, at 2:59 PM, Yiguang Yan wrote: > >> Hi Ralph, >> >> Thanks, here is what I did as suggested by Jeff: >> >>> What did this command line look like? Can you provide the configure line as >>> well? >> >> As in my previous post, the script as following: >> >> (1) debug messages: > >> yiguang@gulftown testdmp]$ ./test.bash >> [gulftown:28340] mca: base: components_open: Looking for plm components >> [gulftown:28340] mca: base: components_open: opening plm components >> [gulftown:28340] mca: base: components_open: found loaded component rsh >> [gulftown:28340] mca: base: components_open: component rsh has no register >> function >> [gulftown:28340] mca: base: components_open: component rsh open function >> successful >> [gulftown:28340] mca: base: components_open: found loaded component slurm >> [gulftown:28340] mca: base: components_open: component slurm has no register >> function >> [gulftown:28340] mca: base: components_open: component slurm open function >> successful >> [gulftown:28340] mca: base: components_open: found loaded component tm >> [gulftown:28340] mca: base: components_open: component tm has no register >> function >> [gulftown:28340] mca: base: components_open: component tm open function >> successful >> [gulftown:28340] mca:base:select: Auto-selecting plm components >> [gulftown:28340] mca:base:select:( plm) Querying component [rsh] >> [gulftown:28340] mca:base:select:( plm) Query of component [rsh] set >> priority to 10 >> [gulftown:28340] mca:base:select:( plm) Querying component [slurm] >> [gulftown:28340] mca:base:select:( plm) Skipping component [slurm]. Query >> failed to return a module >> [gulftown:28340] mca:base:select:( plm) Querying component [tm] >> [gulftown:28340] mca:base:select:( plm) Skipping component [tm]. Query >> failed to return a module >> [gulftown:28340] mca:base:select:( plm) Selected component [rsh] >> [gulftown:28340] mca: base: close: component slurm closed >> [gulftown:28340] mca: base: close: unloading component slurm >> [gulftown:28340] mca: base: close: component tm closed >> [gulftown:28340] mca: base: close: unloading component tm >> [gulftown:28340] plm:base:set_hnp_name: initial bias 28340 nodename hash >> 3546479048 >> [gulftown:28340] plm:base:set_hnp_name: final jobfam 17438 >> [gulftown:28340] [[17438,0],0] plm:base:receive start comm >> [gulftown:28340] [[17438,0],0] plm:rsh: setting up job [17438,1] >> [gulftown:28340] [[17438,0],0] plm:base:setup_job for job [17438,1] >> [gulftown:28340] [[17438,0],0] plm:rsh: local shell: 0 (bash) >> [gulftown:28340] [[17438,0],0] plm:rsh: assuming same remote shell as local >> shell >> [gulftown:28340] [[17438,0],0] plm:rsh: remote shell: 0 (bash) >> [gulftown:28340] [[17438,0],0] plm:rsh: final template argv: >> /usr/bin/rsh orted --daemonize -mca ess env -mca >> orte_ess_jobid 1142816768 -mca >> orte_ess_vpid -mca orte_ess_num_procs 4 --hnp-uri >> "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159" >> - >> -mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca >> btl openib,sm,self --mca >> orte_tmpdir_base /tmp --mca plm_base_verbose 100 >> [gulftown:28340] [[17438,0],0] plm:rsh:launch daemon already exists on node >> gulftown >> [gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode001 >> [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon >> [[17438,0],1] >> [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) >> [/usr/bin/rsh ibnode001 orted --daemonize -mca >> ess env -mca orte_ess_jobid 1142816768 -mca orte_ess_vpid 1 -mca >> orte_ess_num_procs 4 --hnp-uri >> "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159" >> - >> -mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca >> btl openib,sm,self --mca >> orte_tmpdir_base /tmp --mca
Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes
Hi Ralph, Thanks, here is what I did as suggested by Jeff: > What did this command line look like? Can you provide the configure line as > well? As in my previous post, the script as following: (1) debug messages: >>> yiguang@gulftown testdmp]$ ./test.bash [gulftown:28340] mca: base: components_open: Looking for plm components [gulftown:28340] mca: base: components_open: opening plm components [gulftown:28340] mca: base: components_open: found loaded component rsh [gulftown:28340] mca: base: components_open: component rsh has no register function [gulftown:28340] mca: base: components_open: component rsh open function successful [gulftown:28340] mca: base: components_open: found loaded component slurm [gulftown:28340] mca: base: components_open: component slurm has no register function [gulftown:28340] mca: base: components_open: component slurm open function successful [gulftown:28340] mca: base: components_open: found loaded component tm [gulftown:28340] mca: base: components_open: component tm has no register function [gulftown:28340] mca: base: components_open: component tm open function successful [gulftown:28340] mca:base:select: Auto-selecting plm components [gulftown:28340] mca:base:select:( plm) Querying component [rsh] [gulftown:28340] mca:base:select:( plm) Query of component [rsh] set priority to 10 [gulftown:28340] mca:base:select:( plm) Querying component [slurm] [gulftown:28340] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [gulftown:28340] mca:base:select:( plm) Querying component [tm] [gulftown:28340] mca:base:select:( plm) Skipping component [tm]. Query failed to return a module [gulftown:28340] mca:base:select:( plm) Selected component [rsh] [gulftown:28340] mca: base: close: component slurm closed [gulftown:28340] mca: base: close: unloading component slurm [gulftown:28340] mca: base: close: component tm closed [gulftown:28340] mca: base: close: unloading component tm [gulftown:28340] plm:base:set_hnp_name: initial bias 28340 nodename hash 3546479048 [gulftown:28340] plm:base:set_hnp_name: final jobfam 17438 [gulftown:28340] [[17438,0],0] plm:base:receive start comm [gulftown:28340] [[17438,0],0] plm:rsh: setting up job [17438,1] [gulftown:28340] [[17438,0],0] plm:base:setup_job for job [17438,1] [gulftown:28340] [[17438,0],0] plm:rsh: local shell: 0 (bash) [gulftown:28340] [[17438,0],0] plm:rsh: assuming same remote shell as local shell [gulftown:28340] [[17438,0],0] plm:rsh: remote shell: 0 (bash) [gulftown:28340] [[17438,0],0] plm:rsh: final template argv: /usr/bin/rsh orted --daemonize -mca ess env -mca orte_ess_jobid 1142816768 -mca orte_ess_vpid -mca orte_ess_num_procs 4 --hnp-uri "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159" - -mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca plm_base_verbose 100 [gulftown:28340] [[17438,0],0] plm:rsh:launch daemon already exists on node gulftown [gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode001 [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon [[17438,0],1] [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) [/usr/bin/rsh ibnode001 orted --daemonize -mca ess env -mca orte_ess_jobid 1142816768 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 4 --hnp-uri "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159" - -mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca plm_base_verbose 100] bash: orted: command not found [gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode002 [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon [[17438,0],2] [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) [/usr/bin/rsh ibnode002 orted --daemonize -mca ess env -mca orte_ess_jobid 1142816768 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 4 --hnp-uri "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159" - -mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca plm_base_verbose 100] bash: orted: command not found [gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode003 [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) [/usr/bin/rsh ibnode003 orted --daemonize -mca ess env -mca orte_ess_jobid 1142816768 -mca orte_ess_vpid 3 -mca orte_ess_num_procs 4 --hnp-uri "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;tcp://172.23.10.1:43159;tcp://172.33.10.1:43159" - -mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca plm_base_verbose 100] [gulftown:28340] [[17438,0],0]
Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes
What did this command line look like? Can you provide the configure line as well? On Mar 1, 2012, at 12:46 PM, Yiguang Yan wrote: > Hi Jeff, > > Here I made a developer build, and then got the following message > with plm_base_verbose: > > [gulftown:28340] mca: base: components_open: Looking for plm > components > [gulftown:28340] mca: base: components_open: opening plm > components > [gulftown:28340] mca: base: components_open: found loaded > component rsh > [gulftown:28340] mca: base: components_open: component rsh > has no register function > [gulftown:28340] mca: base: components_open: component rsh > open function successful > [gulftown:28340] mca: base: components_open: found loaded > component slurm > [gulftown:28340] mca: base: components_open: component slurm > has no register function > [gulftown:28340] mca: base: components_open: component slurm > open function successful > [gulftown:28340] mca: base: components_open: found loaded > component tm > [gulftown:28340] mca: base: components_open: component tm > has no register function > [gulftown:28340] mca: base: components_open: component tm > open function successful > [gulftown:28340] mca:base:select: Auto-selecting plm components > [gulftown:28340] mca:base:select:( plm) Querying component [rsh] > [gulftown:28340] mca:base:select:( plm) Query of component [rsh] > set priority to 10 > [gulftown:28340] mca:base:select:( plm) Querying component > [slurm] > [gulftown:28340] mca:base:select:( plm) Skipping component > [slurm]. Query failed to return a module > [gulftown:28340] mca:base:select:( plm) Querying component [tm] > [gulftown:28340] mca:base:select:( plm) Skipping component [tm]. > Query failed to return a module > [gulftown:28340] mca:base:select:( plm) Selected component [rsh] > [gulftown:28340] mca: base: close: component slurm closed > [gulftown:28340] mca: base: close: unloading component slurm > [gulftown:28340] mca: base: close: component tm closed > [gulftown:28340] mca: base: close: unloading component tm > [gulftown:28340] plm:base:set_hnp_name: initial bias 28340 > nodename hash 3546479048 > [gulftown:28340] plm:base:set_hnp_name: final jobfam 17438 > [gulftown:28340] [[17438,0],0] plm:base:receive start comm > [gulftown:28340] [[17438,0],0] plm:rsh: setting up job [17438,1] > [gulftown:28340] [[17438,0],0] plm:base:setup_job for job [17438,1] > [gulftown:28340] [[17438,0],0] plm:rsh: local shell: 0 (bash) > [gulftown:28340] [[17438,0],0] plm:rsh: assuming same remote > shell as local shell > [gulftown:28340] [[17438,0],0] plm:rsh: remote shell: 0 (bash) > [gulftown:28340] [[17438,0],0] plm:rsh: final template argv: >/usr/bin/rsh orted --daemonize -mca ess env - > mca orte_ess_jobid 1142816768 -mca orte_ess_vpid - > mca orte_ess_num_procs 4 --hnp-uri > "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t > cp://172.23.10.1:43159;tcp://172.33.10.1:43159" --mca > plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix > 0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca > plm_base_verbose 100 > [gulftown:28340] [[17438,0],0] plm:rsh:launch daemon already > exists on node gulftown > [gulftown:28340] [[17438,0],0] plm:rsh: launching on node > ibnode001 > [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon > [[17438,0],1] > [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) > [/usr/bin/rsh ibnode001 orted --daemonize -mca ess env -mca > orte_ess_jobid 1142816768 -mca orte_ess_vpid 1 -mca > orte_ess_num_procs 4 --hnp-uri > "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t > cp://172.23.10.1:43159;tcp://172.33.10.1:43159" --mca > plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix > 0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca > plm_base_verbose 100] > bash: orted: command not found > [gulftown:28340] [[17438,0],0] plm:rsh: launching on node > ibnode002 > [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon > [[17438,0],2] > [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) > [/usr/bin/rsh ibnode002 orted --daemonize -mca ess env -mca > orte_ess_jobid 1142816768 -mca orte_ess_vpid 2 -mca > orte_ess_num_procs 4 --hnp-uri > "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t > cp://172.23.10.1:43159;tcp://172.33.10.1:43159" --mca > plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix > 0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca > plm_base_verbose 100] > bash: orted: command not found > [gulftown:28340] [[17438,0],0] plm:rsh: launching on node > ibnode003 > [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) > [/usr/bin/rsh ibnode003 orted --daemonize -mca ess env -mca > orte_ess_jobid 1142816768 -mca orte_ess_vpid 3 -mca > orte_ess_num_procs 4 --hnp-uri > "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t > cp://172.23.10.1:43159;tcp://172.33.10.1:43159"
Re: [OMPI users] orted daemon not found! --- environment not passed to slave nodes
Hi Jeff, Here I made a developer build, and then got the following message with plm_base_verbose: >>> [gulftown:28340] mca: base: components_open: Looking for plm components [gulftown:28340] mca: base: components_open: opening plm components [gulftown:28340] mca: base: components_open: found loaded component rsh [gulftown:28340] mca: base: components_open: component rsh has no register function [gulftown:28340] mca: base: components_open: component rsh open function successful [gulftown:28340] mca: base: components_open: found loaded component slurm [gulftown:28340] mca: base: components_open: component slurm has no register function [gulftown:28340] mca: base: components_open: component slurm open function successful [gulftown:28340] mca: base: components_open: found loaded component tm [gulftown:28340] mca: base: components_open: component tm has no register function [gulftown:28340] mca: base: components_open: component tm open function successful [gulftown:28340] mca:base:select: Auto-selecting plm components [gulftown:28340] mca:base:select:( plm) Querying component [rsh] [gulftown:28340] mca:base:select:( plm) Query of component [rsh] set priority to 10 [gulftown:28340] mca:base:select:( plm) Querying component [slurm] [gulftown:28340] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [gulftown:28340] mca:base:select:( plm) Querying component [tm] [gulftown:28340] mca:base:select:( plm) Skipping component [tm]. Query failed to return a module [gulftown:28340] mca:base:select:( plm) Selected component [rsh] [gulftown:28340] mca: base: close: component slurm closed [gulftown:28340] mca: base: close: unloading component slurm [gulftown:28340] mca: base: close: component tm closed [gulftown:28340] mca: base: close: unloading component tm [gulftown:28340] plm:base:set_hnp_name: initial bias 28340 nodename hash 3546479048 [gulftown:28340] plm:base:set_hnp_name: final jobfam 17438 [gulftown:28340] [[17438,0],0] plm:base:receive start comm [gulftown:28340] [[17438,0],0] plm:rsh: setting up job [17438,1] [gulftown:28340] [[17438,0],0] plm:base:setup_job for job [17438,1] [gulftown:28340] [[17438,0],0] plm:rsh: local shell: 0 (bash) [gulftown:28340] [[17438,0],0] plm:rsh: assuming same remote shell as local shell [gulftown:28340] [[17438,0],0] plm:rsh: remote shell: 0 (bash) [gulftown:28340] [[17438,0],0] plm:rsh: final template argv: /usr/bin/rsh orted --daemonize -mca ess env - mca orte_ess_jobid 1142816768 -mca orte_ess_vpid - mca orte_ess_num_procs 4 --hnp-uri "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t cp://172.23.10.1:43159;tcp://172.33.10.1:43159" --mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca plm_base_verbose 100 [gulftown:28340] [[17438,0],0] plm:rsh:launch daemon already exists on node gulftown [gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode001 [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon [[17438,0],1] [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) [/usr/bin/rsh ibnode001 orted --daemonize -mca ess env -mca orte_ess_jobid 1142816768 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 4 --hnp-uri "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t cp://172.23.10.1:43159;tcp://172.33.10.1:43159" --mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca plm_base_verbose 100] bash: orted: command not found [gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode002 [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon [[17438,0],2] [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) [/usr/bin/rsh ibnode002 orted --daemonize -mca ess env -mca orte_ess_jobid 1142816768 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 4 --hnp-uri "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t cp://172.23.10.1:43159;tcp://172.33.10.1:43159" --mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca plm_base_verbose 100] bash: orted: command not found [gulftown:28340] [[17438,0],0] plm:rsh: launching on node ibnode003 [gulftown:28340] [[17438,0],0] plm:rsh: executing: (//usr/bin/rsh) [/usr/bin/rsh ibnode003 orted --daemonize -mca ess env -mca orte_ess_jobid 1142816768 -mca orte_ess_vpid 3 -mca orte_ess_num_procs 4 --hnp-uri "1142816768.0;tcp://198.177.146.70:43159;tcp://10.10.10.4:43159;t cp://172.23.10.1:43159;tcp://172.33.10.1:43159" --mca plm_rsh_agent rsh:ssh --mca btl_openib_warn_default_gid_prefix 0 --mca btl openib,sm,self --mca orte_tmpdir_base /tmp --mca plm_base_verbose 100] [gulftown:28340] [[17438,0],0] plm:rsh: recording launch of daemon [[17438,0],3] bash: orted: command not found [gulftown:28340] [[17438,0],0] plm:base:daemon_callback <<< It
Re: [OMPI users] orted daemon no found! --- environment not passed to slave nodes
Gah. I didn't realize that my 1.4.x build was a *developer* build. *Developer* builds give a *lot* more detail with plm_base_verbose=100 (including the specific rsh command being used). You obviously didn't get that output because you don't have a developer build. :-\ Just for reference, here's what plm_base_verbose=100 tells me for running an orted on a remote node, when I use the --prefix option to mpirun (I'm a tcsh user, so the syntax below will be a little different than what is running in your environment): - [svbu-mpi:28527] [[20181,0],0] plm:rsh: executing: (//usr/bin/ssh) [/usr/bin/ssh svbu-mpi001 set path = ( /home/jsquyres/bogus/bin $path ) ; if ( $?LD_LIBRARY_PATH == 1 ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH == 0 ) setenv LD_LIBRARY_PATH /home/jsquyres/bogus/lib ; if ( $?OMPI_have_llp == 1 ) setenv LD_LIBRARY_PATH /home/jsquyres/bogus/lib:$LD_LIBRARY_PATH ; /home/jsquyres/bogus/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 1322582016 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 --hnp-uri "1322582016.0;tcp://172.29.218.140:34815;tcp://10.148.255.1:34815" --mca plm_base_verbose 100] - Ok, a few options here: 1. You can get a developer build if you use the --enable-debug option to configure. Then plm_base_verbose=100 will give a lot more info. Remember, the goal here is to see what's going wrong -- not to depend on having a developer build around. 2. If that isn't workable, make an "orted" in your default path somewhere that's a short script: - : echo ===environment=== env | sort echo ===environment end=== sleep 1000 - Then when you "mpirun", do a "ps" to see exactly what was executed on the node where mpirun was invoked and the node where orted is supposed to be running. It's not quite as descriptive as seeing the plm_base_verbose output because we run multiple shell commands, but it's something. You'll also see the stdout from the local node. You'll need to use the --leave-session-attached option to mpirun to see the output from the remote nodes. On Feb 29, 2012, at 9:43 AM, Yiguang Yan wrote: > Hi Jeff, > > Thanks. > > I tried as what you suggested. Here are the output: > > yiguang@gulftown testdmp]$ ./test.bash > [gulftown:25052] mca: base: components_open: Looking for plm > components > [gulftown:25052] mca: base: components_open: opening plm > components > [gulftown:25052] mca: base: components_open: found loaded > component rsh > [gulftown:25052] mca: base: components_open: component rsh > has no register function > [gulftown:25052] mca: base: components_open: component rsh > open function successful > [gulftown:25052] mca: base: components_open: found loaded > component slurm > [gulftown:25052] mca: base: components_open: component slurm > has no register function > [gulftown:25052] mca: base: components_open: component slurm > open function successful > [gulftown:25052] mca: base: components_open: found loaded > component tm > [gulftown:25052] mca: base: components_open: component tm > has no register function > [gulftown:25052] mca: base: components_open: component tm > open function successful > [gulftown:25052] mca:base:select: Auto-selecting plm components > [gulftown:25052] mca:base:select:( plm) Querying component [rsh] > [gulftown:25052] mca:base:select:( plm) Query of component [rsh] > set priority to 10 > [gulftown:25052] mca:base:select:( plm) Querying component > [slurm] > [gulftown:25052] mca:base:select:( plm) Skipping component > [slurm]. Query failed to return a module > [gulftown:25052] mca:base:select:( plm) Querying component [tm] > [gulftown:25052] mca:base:select:( plm) Skipping component [tm]. > Query failed to return a module > [gulftown:25052] mca:base:select:( plm) Selected component [rsh] > [gulftown:25052] mca: base: close: component slurm closed > [gulftown:25052] mca: base: close: unloading component slurm > [gulftown:25052] mca: base: close: component tm closed > [gulftown:25052] mca: base: close: unloading component tm > bash: orted: command not found > bash: orted: command not found > bash: orted: command not found > <<< > > > The following is the content of test.bash: > yiguang@gulftown testdmp]$ ./test.bash > #!/bin/sh -f > #nohup > # > # > >--- > < > adinahome=/usr/adina/system8.8dmp > mpirunfile=$adinahome/bin/mpirun > # > # Set envars for mpirun and orted > # > export PATH=$adinahome/bin:$adinahome/tools:$PATH > export LD_LIBRARY_PATH=$adinahome/lib:$LD_LIBRARY_PATH > # > # > # run DMP problem > # > mcaprefix="--prefix $adinahome" > mcarshagent="--mca plm_rsh_agent rsh:ssh" > mcatmpdir="--mca orte_tmpdir_base /tmp" > mcaopenibmsg="--mca btl_openib_warn_default_gid_prefix 0" > mcaenvars="-x PATH -x LD_LIBRARY_PATH" > mcabtlconn="--mca btl openib,sm,self" > mcaplmbase="--mca plm_base_verbose 100" > >
Re: [OMPI users] orted daemon no found! --- environment not passed to slave nodes
Hi Jeff, Thanks. I tried as what you suggested. Here are the output: >>> yiguang@gulftown testdmp]$ ./test.bash [gulftown:25052] mca: base: components_open: Looking for plm components [gulftown:25052] mca: base: components_open: opening plm components [gulftown:25052] mca: base: components_open: found loaded component rsh [gulftown:25052] mca: base: components_open: component rsh has no register function [gulftown:25052] mca: base: components_open: component rsh open function successful [gulftown:25052] mca: base: components_open: found loaded component slurm [gulftown:25052] mca: base: components_open: component slurm has no register function [gulftown:25052] mca: base: components_open: component slurm open function successful [gulftown:25052] mca: base: components_open: found loaded component tm [gulftown:25052] mca: base: components_open: component tm has no register function [gulftown:25052] mca: base: components_open: component tm open function successful [gulftown:25052] mca:base:select: Auto-selecting plm components [gulftown:25052] mca:base:select:( plm) Querying component [rsh] [gulftown:25052] mca:base:select:( plm) Query of component [rsh] set priority to 10 [gulftown:25052] mca:base:select:( plm) Querying component [slurm] [gulftown:25052] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [gulftown:25052] mca:base:select:( plm) Querying component [tm] [gulftown:25052] mca:base:select:( plm) Skipping component [tm]. Query failed to return a module [gulftown:25052] mca:base:select:( plm) Selected component [rsh] [gulftown:25052] mca: base: close: component slurm closed [gulftown:25052] mca: base: close: unloading component slurm [gulftown:25052] mca: base: close: component tm closed [gulftown:25052] mca: base: close: unloading component tm bash: orted: command not found bash: orted: command not found bash: orted: command not found <<< The following is the content of test.bash: >>> yiguang@gulftown testdmp]$ ./test.bash #!/bin/sh -f #nohup # # >--- < adinahome=/usr/adina/system8.8dmp mpirunfile=$adinahome/bin/mpirun # # Set envars for mpirun and orted # export PATH=$adinahome/bin:$adinahome/tools:$PATH export LD_LIBRARY_PATH=$adinahome/lib:$LD_LIBRARY_PATH # # # run DMP problem # mcaprefix="--prefix $adinahome" mcarshagent="--mca plm_rsh_agent rsh:ssh" mcatmpdir="--mca orte_tmpdir_base /tmp" mcaopenibmsg="--mca btl_openib_warn_default_gid_prefix 0" mcaenvars="-x PATH -x LD_LIBRARY_PATH" mcabtlconn="--mca btl openib,sm,self" mcaplmbase="--mca plm_base_verbose 100" mcaparams="$mcaprefix $mcaenvars $mcarshagent $mcaopenibmsg $mcabtlconn $mcatmpdir $mcaplmbase" $mpirunfile $mcaparams --app addmpw-hostname <<< While the content of addmpw-hostname is: >>> -n 1 -host gulftown hostname -n 1 -host ibnode001 hostname -n 1 -host ibnode002 hostname -n 1 -host ibnode003 thostname <<< After this, I also tried to specify the orted through: --mca orte_launch_agent $adinahome/bin/orted then, orted could be found on slave nodes, but now the shared libs in $adinahome/lib are not on the LD_LIBRARY_PATH. Any comments? Thanks, Yiguang
Re: [OMPI users] orted daemon no found! --- environment not passed to slave nodes?
The intent of the --prefix option (or using the full path name to mpirun) was exactly for the purpose of not requiring changes to the .bashrc. Can you run with "--mca plm_base_verbose 100" on your command line? This will show us the exact rsh/ssh command line that is being executed -- it might shed some light on what is going on here. For example: mpirun --mca plm_base_verbose 100 --host A,B hostname On Feb 27, 2012, at 10:41 AM, ya...@adina.com wrote: > Greetings! > > I have tried to run ring_c example test from a bash script. In this > bash script, I setup PATH and LD_LIBRARY_PATH(I donot want to > disturb ~/.bashrc, etc), then use a full path of mpirun to invoke mpi > processes, the mpirun and orted are both on the PATH. However, > from the Open MPI message, orted was not found, to me, it was > not found only on slave nodes. Then I tried to set the --prefix or -x > PATH -x LD_LIBRARY_PATH to hope these envars passed to > slave nodes, but it turned out they are not forwarded to slave > nodes. > > On the other hand, if I set the same PATH and > LD_LIBRARY_PATH in ~/.bashrc which shared by all nodes, > mpirun from bash script runs fine and orted could be found. This is > easy to understand though, but I realy do not want to change > ~/.bashrc. > > It seems the non-interactive bash shell does not pass envars to > slave nodes. > > Any comments and solutions? > > Thanks, > Yiguang > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI users] orted daemon no found! --- environment not passed to slave nodes?
Greetings! I have tried to run ring_c example test from a bash script. In this bash script, I setup PATH and LD_LIBRARY_PATH(I donot want to disturb ~/.bashrc, etc), then use a full path of mpirun to invoke mpi processes, the mpirun and orted are both on the PATH. However, from the Open MPI message, orted was not found, to me, it was not found only on slave nodes. Then I tried to set the --prefix or -x PATH -x LD_LIBRARY_PATH to hope these envars passed to slave nodes, but it turned out they are not forwarded to slave nodes. On the other hand, if I set the same PATH and LD_LIBRARY_PATH in ~/.bashrc which shared by all nodes, mpirun from bash script runs fine and orted could be found. This is easy to understand though, but I realy do not want to change ~/.bashrc. It seems the non-interactive bash shell does not pass envars to slave nodes. Any comments and solutions? Thanks, Yiguang