Re: [OMPI devel] LD_LIBRARY_PATH and process launch on a head node
On Thu, Jul 19, 2007 at 01:04:27PM -0600, Ralph H Castain wrote: > I fixed the specific problem of setting the LD_LIBRARY_PATH (and PATH, > though that wasn't mentioned) for the case of procs spawned locally by > mpirun - see r15516. Please confirm that the problem is gone and/or let me > know if it persists for you. My test cases now work. Thanks. > > The issue of name resolution is a more general problem that will take some > discussion - to occur separately from this chain. So some of the behavior > you cited continues for the moment. > > Thanks > Ralph > > > > On 7/19/07 9:39 AM, "Ralph H Castain" wrote: > > > Talked with Brian and we have identified the problem and a fix - will come > > in later today. > > > > Thanks > > Ralph > > > > > > > > On 7/19/07 9:24 AM, "Ralph H Castain" wrote: > > > >> You are correct - I misread the note. My bad. > >> > >> I'll look at how we might ensure the LD_LIBRARY_PATH shows up correctly - > >> shouldn't be a big deal. > >> > >> > >> On 7/19/07 9:12 AM, "George Bosilca" wrote: > >> > >>> The second execution (the one that you make reference to) is the one > >>> that works fine. The failing one is the first one, where > >>> LD_LIBRARY_PATH is not provided. As Gleb indicate using localhost > >>> make the problem vanish. > >>> > >>>george. > >>> > >>> On Jul 19, 2007, at 10:57 AM, Ralph H Castain wrote: > >>> > But it *does* provide an LD_LIBRARY_PATH that is pointing to your > openmpi > installation - it says it did it right here in your debug output: > > >>> [elfit1:14752] pls:rsh: reset LD_LIBRARY_PATH: /home/glebn/ > >>> openmpi/lib > > I suspect that the problem isn't in the launcher, but rather in the > iof > again. Why don't we wait until those fixes come into the trunk before > chasing our tails any further? > > > On 7/19/07 8:18 AM, "Gleb Natapov" wrote: > > > On Thu, Jul 19, 2007 at 08:07:51AM -0600, Ralph H Castain wrote: > >> Interesting. Apparently, it is getting a NULL back when it tries > >> to access > >> the LD_LIBRARY_PATH in your environment. Here is the code involved: > >> > >> newenv = opal_os_path( false, prefix_dir, lib_base, NULL ); > >> oldenv = getenv("LD_LIBRARY_PATH"); > >> if (NULL != oldenv) { > >> char* temp; > >> asprintf(&temp, "%s:%s", newenv, oldenv); > >> free(newenv); > >> newenv = temp; > >> } > >> opal_setenv("LD_LIBRARY_PATH", newenv, true, &env); > >> if (mca_pls_rsh_component.debug) { > >> opal_output(0, "pls:rsh: reset LD_LIBRARY_PATH: %s", > >> newenv); > >> } > >> free(newenv); > >> > >> So you can see that the only way we can get your debugging output > >> is for the > >> LD_LIBRARY_PATH in your starting environment to be NULL. Note > >> that this > >> comes after we fork, so we are talking about the child process - > >> not sure > >> that matters, but may as well point it out. > >> > >> So the question is: why do you not have LD_LIBRARY_PATH set in your > >> environment when you provide a different hostname? > > Right I don't have LD_LIBRARY_PATH set in my environment, but I > > expect > > that mpirun will provide working environment for all ranks not just > > remote ones. This is how it worked before. Perhaps that was a bug, > > but > > this was useful bug :) > > > >> > >> > >> On 7/19/07 7:45 AM, "Gleb Natapov" wrote: > >> > >>> On Wed, Jul 18, 2007 at 09:08:38PM +0300, Gleb Natapov wrote: > On Wed, Jul 18, 2007 at 09:08:47AM -0600, Ralph H Castain wrote: > > But this will lockup: > > > > pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host pn1180961 > > printenv | grep > > LD > > > > The reason is that the hostname in this last command doesn't > > match the > > hostname I get when I query my interfaces, so mpirun thinks it > > must be a > > remote host - and so we stick in ssh until that times out. > > Which could be > > quick on your machine, but takes awhile for me. > > > This is not my case. mpirun resolves hostname and runs env but > LD_LIBRARY_PATH is not there. If I use full name like this > # /home/glebn/openmpi/bin/mpirun -np 1 -H elfit1.voltaire.com > env | grep > LD_LIBRARY_PATH > LD_LIBRARY_PATH=/home/glebn/openmpi/lib > > everything is OK. > > >>> More info. If I provide hostname to mpirun as returned by command > >>> "hostname" the LD_LIBRARY_PATH is not set: > >>> # /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` env | grep LD > >>> OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests > >>> > >>> if I provide any other
Re: [OMPI devel] LD_LIBRARY_PATH and process launch on a head node
I fixed the specific problem of setting the LD_LIBRARY_PATH (and PATH, though that wasn't mentioned) for the case of procs spawned locally by mpirun - see r15516. Please confirm that the problem is gone and/or let me know if it persists for you. The issue of name resolution is a more general problem that will take some discussion - to occur separately from this chain. So some of the behavior you cited continues for the moment. Thanks Ralph On 7/19/07 9:39 AM, "Ralph H Castain" wrote: > Talked with Brian and we have identified the problem and a fix - will come > in later today. > > Thanks > Ralph > > > > On 7/19/07 9:24 AM, "Ralph H Castain" wrote: > >> You are correct - I misread the note. My bad. >> >> I'll look at how we might ensure the LD_LIBRARY_PATH shows up correctly - >> shouldn't be a big deal. >> >> >> On 7/19/07 9:12 AM, "George Bosilca" wrote: >> >>> The second execution (the one that you make reference to) is the one >>> that works fine. The failing one is the first one, where >>> LD_LIBRARY_PATH is not provided. As Gleb indicate using localhost >>> make the problem vanish. >>> >>>george. >>> >>> On Jul 19, 2007, at 10:57 AM, Ralph H Castain wrote: >>> But it *does* provide an LD_LIBRARY_PATH that is pointing to your openmpi installation - it says it did it right here in your debug output: >>> [elfit1:14752] pls:rsh: reset LD_LIBRARY_PATH: /home/glebn/ >>> openmpi/lib I suspect that the problem isn't in the launcher, but rather in the iof again. Why don't we wait until those fixes come into the trunk before chasing our tails any further? On 7/19/07 8:18 AM, "Gleb Natapov" wrote: > On Thu, Jul 19, 2007 at 08:07:51AM -0600, Ralph H Castain wrote: >> Interesting. Apparently, it is getting a NULL back when it tries >> to access >> the LD_LIBRARY_PATH in your environment. Here is the code involved: >> >> newenv = opal_os_path( false, prefix_dir, lib_base, NULL ); >> oldenv = getenv("LD_LIBRARY_PATH"); >> if (NULL != oldenv) { >> char* temp; >> asprintf(&temp, "%s:%s", newenv, oldenv); >> free(newenv); >> newenv = temp; >> } >> opal_setenv("LD_LIBRARY_PATH", newenv, true, &env); >> if (mca_pls_rsh_component.debug) { >> opal_output(0, "pls:rsh: reset LD_LIBRARY_PATH: %s", >> newenv); >> } >> free(newenv); >> >> So you can see that the only way we can get your debugging output >> is for the >> LD_LIBRARY_PATH in your starting environment to be NULL. Note >> that this >> comes after we fork, so we are talking about the child process - >> not sure >> that matters, but may as well point it out. >> >> So the question is: why do you not have LD_LIBRARY_PATH set in your >> environment when you provide a different hostname? > Right I don't have LD_LIBRARY_PATH set in my environment, but I > expect > that mpirun will provide working environment for all ranks not just > remote ones. This is how it worked before. Perhaps that was a bug, > but > this was useful bug :) > >> >> >> On 7/19/07 7:45 AM, "Gleb Natapov" wrote: >> >>> On Wed, Jul 18, 2007 at 09:08:38PM +0300, Gleb Natapov wrote: On Wed, Jul 18, 2007 at 09:08:47AM -0600, Ralph H Castain wrote: > But this will lockup: > > pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host pn1180961 > printenv | grep > LD > > The reason is that the hostname in this last command doesn't > match the > hostname I get when I query my interfaces, so mpirun thinks it > must be a > remote host - and so we stick in ssh until that times out. > Which could be > quick on your machine, but takes awhile for me. > This is not my case. mpirun resolves hostname and runs env but LD_LIBRARY_PATH is not there. If I use full name like this # /home/glebn/openmpi/bin/mpirun -np 1 -H elfit1.voltaire.com env | grep LD_LIBRARY_PATH LD_LIBRARY_PATH=/home/glebn/openmpi/lib everything is OK. >>> More info. If I provide hostname to mpirun as returned by command >>> "hostname" the LD_LIBRARY_PATH is not set: >>> # /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` env | grep LD >>> OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests >>> >>> if I provide any other name that resolves to the same IP then >>> LD_LIBRARY_PATH is set. >>> # /home/glebn/openmpi/bin/mpirun -np 1 -H localhost env | grep LD >>> OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests >>> LD_LIBRARY_PATH=/home/glebn/openmpi/lib >>> >>> Here is debug output of "bad" run: >>> /home/glebn/openmpi/bin/mpirun -np 1 -H `hostn
Re: [OMPI devel] LD_LIBRARY_PATH and process launch on a head node
Talked with Brian and we have identified the problem and a fix - will come in later today. Thanks Ralph On 7/19/07 9:24 AM, "Ralph H Castain" wrote: > You are correct - I misread the note. My bad. > > I'll look at how we might ensure the LD_LIBRARY_PATH shows up correctly - > shouldn't be a big deal. > > > On 7/19/07 9:12 AM, "George Bosilca" wrote: > >> The second execution (the one that you make reference to) is the one >> that works fine. The failing one is the first one, where >> LD_LIBRARY_PATH is not provided. As Gleb indicate using localhost >> make the problem vanish. >> >>george. >> >> On Jul 19, 2007, at 10:57 AM, Ralph H Castain wrote: >> >>> But it *does* provide an LD_LIBRARY_PATH that is pointing to your >>> openmpi >>> installation - it says it did it right here in your debug output: >>> >> [elfit1:14752] pls:rsh: reset LD_LIBRARY_PATH: /home/glebn/ >> openmpi/lib >>> >>> I suspect that the problem isn't in the launcher, but rather in the >>> iof >>> again. Why don't we wait until those fixes come into the trunk before >>> chasing our tails any further? >>> >>> >>> On 7/19/07 8:18 AM, "Gleb Natapov" wrote: >>> On Thu, Jul 19, 2007 at 08:07:51AM -0600, Ralph H Castain wrote: > Interesting. Apparently, it is getting a NULL back when it tries > to access > the LD_LIBRARY_PATH in your environment. Here is the code involved: > > newenv = opal_os_path( false, prefix_dir, lib_base, NULL ); > oldenv = getenv("LD_LIBRARY_PATH"); > if (NULL != oldenv) { > char* temp; > asprintf(&temp, "%s:%s", newenv, oldenv); > free(newenv); > newenv = temp; > } > opal_setenv("LD_LIBRARY_PATH", newenv, true, &env); > if (mca_pls_rsh_component.debug) { > opal_output(0, "pls:rsh: reset LD_LIBRARY_PATH: %s", > newenv); > } > free(newenv); > > So you can see that the only way we can get your debugging output > is for the > LD_LIBRARY_PATH in your starting environment to be NULL. Note > that this > comes after we fork, so we are talking about the child process - > not sure > that matters, but may as well point it out. > > So the question is: why do you not have LD_LIBRARY_PATH set in your > environment when you provide a different hostname? Right I don't have LD_LIBRARY_PATH set in my environment, but I expect that mpirun will provide working environment for all ranks not just remote ones. This is how it worked before. Perhaps that was a bug, but this was useful bug :) > > > On 7/19/07 7:45 AM, "Gleb Natapov" wrote: > >> On Wed, Jul 18, 2007 at 09:08:38PM +0300, Gleb Natapov wrote: >>> On Wed, Jul 18, 2007 at 09:08:47AM -0600, Ralph H Castain wrote: But this will lockup: pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host pn1180961 printenv | grep LD The reason is that the hostname in this last command doesn't match the hostname I get when I query my interfaces, so mpirun thinks it must be a remote host - and so we stick in ssh until that times out. Which could be quick on your machine, but takes awhile for me. >>> This is not my case. mpirun resolves hostname and runs env but >>> LD_LIBRARY_PATH is not there. If I use full name like this >>> # /home/glebn/openmpi/bin/mpirun -np 1 -H elfit1.voltaire.com >>> env | grep >>> LD_LIBRARY_PATH >>> LD_LIBRARY_PATH=/home/glebn/openmpi/lib >>> >>> everything is OK. >>> >> More info. If I provide hostname to mpirun as returned by command >> "hostname" the LD_LIBRARY_PATH is not set: >> # /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` env | grep LD >> OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests >> >> if I provide any other name that resolves to the same IP then >> LD_LIBRARY_PATH is set. >> # /home/glebn/openmpi/bin/mpirun -np 1 -H localhost env | grep LD >> OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests >> LD_LIBRARY_PATH=/home/glebn/openmpi/lib >> >> Here is debug output of "bad" run: >> /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` -mca >> pls_rsh_debug 1 echo >> [elfit1:14730] pls:rsh: launching job 1 >> [elfit1:14730] pls:rsh: no new daemons to launch >> >> Here is good one: >> /home/glebn/openmpi/bin/mpirun -np 1 -H localhost -mca >> pls_rsh_debug 1 echo >> [elfit1:14752] pls:rsh: launching job 1 >> [elfit1:14752] pls:rsh: local csh: 0, local sh: 1 >> [elfit1:14752] pls:rsh: assuming same remote shell as local shell >> [elfit1:14752] pls:rsh: remote csh: 0, remote sh: 1 >> [elfit1:14752] pls:rsh: final template argv: >> [elfit1:14752] pls:rsh: /usr/bin/ssh orted --name >> >
Re: [OMPI devel] LD_LIBRARY_PATH and process launch on a head node
You are correct - I misread the note. My bad. I'll look at how we might ensure the LD_LIBRARY_PATH shows up correctly - shouldn't be a big deal. On 7/19/07 9:12 AM, "George Bosilca" wrote: > The second execution (the one that you make reference to) is the one > that works fine. The failing one is the first one, where > LD_LIBRARY_PATH is not provided. As Gleb indicate using localhost > make the problem vanish. > >george. > > On Jul 19, 2007, at 10:57 AM, Ralph H Castain wrote: > >> But it *does* provide an LD_LIBRARY_PATH that is pointing to your >> openmpi >> installation - it says it did it right here in your debug output: >> > [elfit1:14752] pls:rsh: reset LD_LIBRARY_PATH: /home/glebn/ > openmpi/lib >> >> I suspect that the problem isn't in the launcher, but rather in the >> iof >> again. Why don't we wait until those fixes come into the trunk before >> chasing our tails any further? >> >> >> On 7/19/07 8:18 AM, "Gleb Natapov" wrote: >> >>> On Thu, Jul 19, 2007 at 08:07:51AM -0600, Ralph H Castain wrote: Interesting. Apparently, it is getting a NULL back when it tries to access the LD_LIBRARY_PATH in your environment. Here is the code involved: newenv = opal_os_path( false, prefix_dir, lib_base, NULL ); oldenv = getenv("LD_LIBRARY_PATH"); if (NULL != oldenv) { char* temp; asprintf(&temp, "%s:%s", newenv, oldenv); free(newenv); newenv = temp; } opal_setenv("LD_LIBRARY_PATH", newenv, true, &env); if (mca_pls_rsh_component.debug) { opal_output(0, "pls:rsh: reset LD_LIBRARY_PATH: %s", newenv); } free(newenv); So you can see that the only way we can get your debugging output is for the LD_LIBRARY_PATH in your starting environment to be NULL. Note that this comes after we fork, so we are talking about the child process - not sure that matters, but may as well point it out. So the question is: why do you not have LD_LIBRARY_PATH set in your environment when you provide a different hostname? >>> Right I don't have LD_LIBRARY_PATH set in my environment, but I >>> expect >>> that mpirun will provide working environment for all ranks not just >>> remote ones. This is how it worked before. Perhaps that was a bug, >>> but >>> this was useful bug :) >>> On 7/19/07 7:45 AM, "Gleb Natapov" wrote: > On Wed, Jul 18, 2007 at 09:08:38PM +0300, Gleb Natapov wrote: >> On Wed, Jul 18, 2007 at 09:08:47AM -0600, Ralph H Castain wrote: >>> But this will lockup: >>> >>> pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host pn1180961 >>> printenv | grep >>> LD >>> >>> The reason is that the hostname in this last command doesn't >>> match the >>> hostname I get when I query my interfaces, so mpirun thinks it >>> must be a >>> remote host - and so we stick in ssh until that times out. >>> Which could be >>> quick on your machine, but takes awhile for me. >>> >> This is not my case. mpirun resolves hostname and runs env but >> LD_LIBRARY_PATH is not there. If I use full name like this >> # /home/glebn/openmpi/bin/mpirun -np 1 -H elfit1.voltaire.com >> env | grep >> LD_LIBRARY_PATH >> LD_LIBRARY_PATH=/home/glebn/openmpi/lib >> >> everything is OK. >> > More info. If I provide hostname to mpirun as returned by command > "hostname" the LD_LIBRARY_PATH is not set: > # /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` env | grep LD > OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests > > if I provide any other name that resolves to the same IP then > LD_LIBRARY_PATH is set. > # /home/glebn/openmpi/bin/mpirun -np 1 -H localhost env | grep LD > OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests > LD_LIBRARY_PATH=/home/glebn/openmpi/lib > > Here is debug output of "bad" run: > /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` -mca > pls_rsh_debug 1 echo > [elfit1:14730] pls:rsh: launching job 1 > [elfit1:14730] pls:rsh: no new daemons to launch > > Here is good one: > /home/glebn/openmpi/bin/mpirun -np 1 -H localhost -mca > pls_rsh_debug 1 echo > [elfit1:14752] pls:rsh: launching job 1 > [elfit1:14752] pls:rsh: local csh: 0, local sh: 1 > [elfit1:14752] pls:rsh: assuming same remote shell as local shell > [elfit1:14752] pls:rsh: remote csh: 0, remote sh: 1 > [elfit1:14752] pls:rsh: final template argv: > [elfit1:14752] pls:rsh: /usr/bin/ssh orted --name > > --num_procs 1 --vpid_start 0 --nodename --universe > root@elfit1:default-universe-14752 --nsreplica > "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -- > gprreplica > "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -mca > mca_base_param_file_p
Re: [OMPI devel] LD_LIBRARY_PATH and process launch on a head node
The second execution (the one that you make reference to) is the one that works fine. The failing one is the first one, where LD_LIBRARY_PATH is not provided. As Gleb indicate using localhost make the problem vanish. george. On Jul 19, 2007, at 10:57 AM, Ralph H Castain wrote: But it *does* provide an LD_LIBRARY_PATH that is pointing to your openmpi installation - it says it did it right here in your debug output: [elfit1:14752] pls:rsh: reset LD_LIBRARY_PATH: /home/glebn/ openmpi/lib I suspect that the problem isn't in the launcher, but rather in the iof again. Why don't we wait until those fixes come into the trunk before chasing our tails any further? On 7/19/07 8:18 AM, "Gleb Natapov" wrote: On Thu, Jul 19, 2007 at 08:07:51AM -0600, Ralph H Castain wrote: Interesting. Apparently, it is getting a NULL back when it tries to access the LD_LIBRARY_PATH in your environment. Here is the code involved: newenv = opal_os_path( false, prefix_dir, lib_base, NULL ); oldenv = getenv("LD_LIBRARY_PATH"); if (NULL != oldenv) { char* temp; asprintf(&temp, "%s:%s", newenv, oldenv); free(newenv); newenv = temp; } opal_setenv("LD_LIBRARY_PATH", newenv, true, &env); if (mca_pls_rsh_component.debug) { opal_output(0, "pls:rsh: reset LD_LIBRARY_PATH: %s", newenv); } free(newenv); So you can see that the only way we can get your debugging output is for the LD_LIBRARY_PATH in your starting environment to be NULL. Note that this comes after we fork, so we are talking about the child process - not sure that matters, but may as well point it out. So the question is: why do you not have LD_LIBRARY_PATH set in your environment when you provide a different hostname? Right I don't have LD_LIBRARY_PATH set in my environment, but I expect that mpirun will provide working environment for all ranks not just remote ones. This is how it worked before. Perhaps that was a bug, but this was useful bug :) On 7/19/07 7:45 AM, "Gleb Natapov" wrote: On Wed, Jul 18, 2007 at 09:08:38PM +0300, Gleb Natapov wrote: On Wed, Jul 18, 2007 at 09:08:47AM -0600, Ralph H Castain wrote: But this will lockup: pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host pn1180961 printenv | grep LD The reason is that the hostname in this last command doesn't match the hostname I get when I query my interfaces, so mpirun thinks it must be a remote host - and so we stick in ssh until that times out. Which could be quick on your machine, but takes awhile for me. This is not my case. mpirun resolves hostname and runs env but LD_LIBRARY_PATH is not there. If I use full name like this # /home/glebn/openmpi/bin/mpirun -np 1 -H elfit1.voltaire.com env | grep LD_LIBRARY_PATH LD_LIBRARY_PATH=/home/glebn/openmpi/lib everything is OK. More info. If I provide hostname to mpirun as returned by command "hostname" the LD_LIBRARY_PATH is not set: # /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` env | grep LD OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests if I provide any other name that resolves to the same IP then LD_LIBRARY_PATH is set. # /home/glebn/openmpi/bin/mpirun -np 1 -H localhost env | grep LD OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests LD_LIBRARY_PATH=/home/glebn/openmpi/lib Here is debug output of "bad" run: /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` -mca pls_rsh_debug 1 echo [elfit1:14730] pls:rsh: launching job 1 [elfit1:14730] pls:rsh: no new daemons to launch Here is good one: /home/glebn/openmpi/bin/mpirun -np 1 -H localhost -mca pls_rsh_debug 1 echo [elfit1:14752] pls:rsh: launching job 1 [elfit1:14752] pls:rsh: local csh: 0, local sh: 1 [elfit1:14752] pls:rsh: assuming same remote shell as local shell [elfit1:14752] pls:rsh: remote csh: 0, remote sh: 1 [elfit1:14752] pls:rsh: final template argv: [elfit1:14752] pls:rsh: /usr/bin/ssh orted --name --num_procs 1 --vpid_start 0 --nodename --universe root@elfit1:default-universe-14752 --nsreplica "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -- gprreplica "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -mca mca_base_param_file_path /home/glebn/openmpi//share/openmpi/amca-param-sets:/home/USERS/ glebn/openmpi wd -mca mca_base_param_file_path_force /home/USERS/glebn/openmpiwd [elfit1:14752] pls:rsh: launching on node localhost [elfit1:14752] pls:rsh: localhost is a LOCAL node [elfit1:14752] pls:rsh: reset PATH: /home/glebn/openmpi/bin:/home/USERS/lenny/MPI/mpi/bin:/opt/ vltmpi/OPENIB/mpi /b in:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/ local/bin:/sbin :/ bin:/usr/sbin:/usr/bin:/root/bin [elfit1:14752] pls:rsh: reset LD_LIBRARY_PATH: /home/glebn/ openmpi/lib [elfit1:14752] pls:rsh: changing to directory /root [elfit1:14752] pls:rsh: executing: (/home/glebn/openmpi/bin/ orted) [orted --name 0.0.1 --num_procs 1 --vpid_start 0 --nodename localhost -- universe root@elfit1:def
Re: [OMPI devel] LD_LIBRARY_PATH and process launch on a head node
The problem occurs in the following situation. In the rsh PLS the number of daemons that have to be spawned is set to zero (as mpirun act now as a daemon). Therefore, the PLS rsh don't do anything except sending the launch order to the daemons. Then the remaining of the work is done in the ODLS. However, as we will not spawn new daemons, the new application will inherit exactly the same environment as the mpirun application. The PATH and LD_LIBRARY_PATH is not set except in the case where -x LD_LIBRARY_PATH is used by the user. voyager$ mpirun --prefix /Users/bosilca/opt/mpi -np 1 -host voyager printenv | grep LD_ <**there is no output here**> voyager$ mpirun -x LD_LIBRARY_PATH=/toto --prefix /Users/bosilca/opt/ mpi -np 1 -host voyager printenv | grep LD_ LD_LIBRARY_PATH=/Users/bosilca/opt/mpi/lib:/toto However, using localhost seems to make the problem vanish, as the rsh PLS will always be used. voyager$ mpirun -x LD_LIBRARY_PATH=/toto --prefix /Users/bosilca/opt/ mpi -np 1 -host localhost printenv | grep LD_ LD_LIBRARY_PATH=/Users/bosilca/opt/mpi/lib:/toto voyager$ mpirun --prefix /Users/bosilca/opt/mpi -np 1 -host localhost printenv | grep LD_ LD_LIBRARY_PATH=/Users/bosilca/opt/mpi/lib Digging a little bit deeper, shows that the problem is coming from the fact that rmaps_node->nodename != orte_system_info.nodename when anything else than localhost is provided. george. On Jul 19, 2007, at 10:35 AM, George Bosilca wrote: It wasn't a bug. There is a bunch of code there just to make sure PATH and LD_LIBRARY_PATH are set correctly. Yesterday we discovered that even if you force the --prefix in a similar execution environment the LD_LIBRARY_PATH doesn't get set. However, using localhost always solve the problem. george. On Jul 19, 2007, at 10:18 AM, Gleb Natapov wrote: On Thu, Jul 19, 2007 at 08:07:51AM -0600, Ralph H Castain wrote: Interesting. Apparently, it is getting a NULL back when it tries to access the LD_LIBRARY_PATH in your environment. Here is the code involved: newenv = opal_os_path( false, prefix_dir, lib_base, NULL ); oldenv = getenv("LD_LIBRARY_PATH"); if (NULL != oldenv) { char* temp; asprintf(&temp, "%s:%s", newenv, oldenv); free(newenv); newenv = temp; } opal_setenv("LD_LIBRARY_PATH", newenv, true, &env); if (mca_pls_rsh_component.debug) { opal_output(0, "pls:rsh: reset LD_LIBRARY_PATH: %s", newenv); } free(newenv); So you can see that the only way we can get your debugging output is for the LD_LIBRARY_PATH in your starting environment to be NULL. Note that this comes after we fork, so we are talking about the child process - not sure that matters, but may as well point it out. So the question is: why do you not have LD_LIBRARY_PATH set in your environment when you provide a different hostname? Right I don't have LD_LIBRARY_PATH set in my environment, but I expect that mpirun will provide working environment for all ranks not just remote ones. This is how it worked before. Perhaps that was a bug, but this was useful bug :) On 7/19/07 7:45 AM, "Gleb Natapov" wrote: On Wed, Jul 18, 2007 at 09:08:38PM +0300, Gleb Natapov wrote: On Wed, Jul 18, 2007 at 09:08:47AM -0600, Ralph H Castain wrote: But this will lockup: pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host pn1180961 printenv | grep LD The reason is that the hostname in this last command doesn't match the hostname I get when I query my interfaces, so mpirun thinks it must be a remote host - and so we stick in ssh until that times out. Which could be quick on your machine, but takes awhile for me. This is not my case. mpirun resolves hostname and runs env but LD_LIBRARY_PATH is not there. If I use full name like this # /home/glebn/openmpi/bin/mpirun -np 1 -H elfit1.voltaire.com env | grep LD_LIBRARY_PATH LD_LIBRARY_PATH=/home/glebn/openmpi/lib everything is OK. More info. If I provide hostname to mpirun as returned by command "hostname" the LD_LIBRARY_PATH is not set: # /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` env | grep LD OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests if I provide any other name that resolves to the same IP then LD_LIBRARY_PATH is set. # /home/glebn/openmpi/bin/mpirun -np 1 -H localhost env | grep LD OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests LD_LIBRARY_PATH=/home/glebn/openmpi/lib Here is debug output of "bad" run: /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` -mca pls_rsh_debug 1 echo [elfit1:14730] pls:rsh: launching job 1 [elfit1:14730] pls:rsh: no new daemons to launch Here is good one: /home/glebn/openmpi/bin/mpirun -np 1 -H localhost -mca pls_rsh_debug 1 echo [elfit1:14752] pls:rsh: launching job 1 [elfit1:14752] pls:rsh: local csh: 0, local sh: 1 [elfit1:14752] pls:rsh: assuming same remote shell as local shell [elfit1:14752] pls:rsh: remote csh: 0, remote sh: 1 [elfit1:14752] pls:rsh: final template argv: [elfit1:14752] p
Re: [OMPI devel] LD_LIBRARY_PATH and process launch on a head node
But it *does* provide an LD_LIBRARY_PATH that is pointing to your openmpi installation - it says it did it right here in your debug output: >>> [elfit1:14752] pls:rsh: reset LD_LIBRARY_PATH: /home/glebn/openmpi/lib I suspect that the problem isn't in the launcher, but rather in the iof again. Why don't we wait until those fixes come into the trunk before chasing our tails any further? On 7/19/07 8:18 AM, "Gleb Natapov" wrote: > On Thu, Jul 19, 2007 at 08:07:51AM -0600, Ralph H Castain wrote: >> Interesting. Apparently, it is getting a NULL back when it tries to access >> the LD_LIBRARY_PATH in your environment. Here is the code involved: >> >> newenv = opal_os_path( false, prefix_dir, lib_base, NULL ); >> oldenv = getenv("LD_LIBRARY_PATH"); >> if (NULL != oldenv) { >> char* temp; >> asprintf(&temp, "%s:%s", newenv, oldenv); >> free(newenv); >> newenv = temp; >> } >> opal_setenv("LD_LIBRARY_PATH", newenv, true, &env); >> if (mca_pls_rsh_component.debug) { >> opal_output(0, "pls:rsh: reset LD_LIBRARY_PATH: %s", newenv); >> } >> free(newenv); >> >> So you can see that the only way we can get your debugging output is for the >> LD_LIBRARY_PATH in your starting environment to be NULL. Note that this >> comes after we fork, so we are talking about the child process - not sure >> that matters, but may as well point it out. >> >> So the question is: why do you not have LD_LIBRARY_PATH set in your >> environment when you provide a different hostname? > Right I don't have LD_LIBRARY_PATH set in my environment, but I expect > that mpirun will provide working environment for all ranks not just > remote ones. This is how it worked before. Perhaps that was a bug, but > this was useful bug :) > >> >> >> On 7/19/07 7:45 AM, "Gleb Natapov" wrote: >> >>> On Wed, Jul 18, 2007 at 09:08:38PM +0300, Gleb Natapov wrote: On Wed, Jul 18, 2007 at 09:08:47AM -0600, Ralph H Castain wrote: > But this will lockup: > > pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host pn1180961 printenv | grep > LD > > The reason is that the hostname in this last command doesn't match the > hostname I get when I query my interfaces, so mpirun thinks it must be a > remote host - and so we stick in ssh until that times out. Which could be > quick on your machine, but takes awhile for me. > This is not my case. mpirun resolves hostname and runs env but LD_LIBRARY_PATH is not there. If I use full name like this # /home/glebn/openmpi/bin/mpirun -np 1 -H elfit1.voltaire.com env | grep LD_LIBRARY_PATH LD_LIBRARY_PATH=/home/glebn/openmpi/lib everything is OK. >>> More info. If I provide hostname to mpirun as returned by command >>> "hostname" the LD_LIBRARY_PATH is not set: >>> # /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` env | grep LD >>> OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests >>> >>> if I provide any other name that resolves to the same IP then >>> LD_LIBRARY_PATH is set. >>> # /home/glebn/openmpi/bin/mpirun -np 1 -H localhost env | grep LD >>> OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests >>> LD_LIBRARY_PATH=/home/glebn/openmpi/lib >>> >>> Here is debug output of "bad" run: >>> /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` -mca pls_rsh_debug 1 echo >>> [elfit1:14730] pls:rsh: launching job 1 >>> [elfit1:14730] pls:rsh: no new daemons to launch >>> >>> Here is good one: >>> /home/glebn/openmpi/bin/mpirun -np 1 -H localhost -mca pls_rsh_debug 1 echo >>> [elfit1:14752] pls:rsh: launching job 1 >>> [elfit1:14752] pls:rsh: local csh: 0, local sh: 1 >>> [elfit1:14752] pls:rsh: assuming same remote shell as local shell >>> [elfit1:14752] pls:rsh: remote csh: 0, remote sh: 1 >>> [elfit1:14752] pls:rsh: final template argv: >>> [elfit1:14752] pls:rsh: /usr/bin/ssh orted --name >>> --num_procs 1 --vpid_start 0 --nodename --universe >>> root@elfit1:default-universe-14752 --nsreplica >>> "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" --gprreplica >>> "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -mca >>> mca_base_param_file_path >>> /home/glebn/openmpi//share/openmpi/amca-param-sets:/home/USERS/glebn/openmpi >>> wd >>> -mca mca_base_param_file_path_force /home/USERS/glebn/openmpiwd >>> [elfit1:14752] pls:rsh: launching on node localhost >>> [elfit1:14752] pls:rsh: localhost is a LOCAL node >>> [elfit1:14752] pls:rsh: reset PATH: >>> /home/glebn/openmpi/bin:/home/USERS/lenny/MPI/mpi/bin:/opt/vltmpi/OPENIB/mpi >>> /b >>> in:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin >>> :/ >>> bin:/usr/sbin:/usr/bin:/root/bin >>> [elfit1:14752] pls:rsh: reset LD_LIBRARY_PATH: /home/glebn/openmpi/lib >>> [elfit1:14752] pls:rsh: changing to directory /root >>> [elfit1:14752] pls:rsh: executing: (/home/glebn/openmpi/bin/orted) [orted >>> --name 0.0.1 --num_procs 1 --vpid_start 0 --nodename localhost --univer
Re: [OMPI devel] LD_LIBRARY_PATH and process launch on a head node
It wasn't a bug. There is a bunch of code there just to make sure PATH and LD_LIBRARY_PATH are set correctly. Yesterday we discovered that even if you force the --prefix in a similar execution environment the LD_LIBRARY_PATH doesn't get set. However, using localhost always solve the problem. george. On Jul 19, 2007, at 10:18 AM, Gleb Natapov wrote: On Thu, Jul 19, 2007 at 08:07:51AM -0600, Ralph H Castain wrote: Interesting. Apparently, it is getting a NULL back when it tries to access the LD_LIBRARY_PATH in your environment. Here is the code involved: newenv = opal_os_path( false, prefix_dir, lib_base, NULL ); oldenv = getenv("LD_LIBRARY_PATH"); if (NULL != oldenv) { char* temp; asprintf(&temp, "%s:%s", newenv, oldenv); free(newenv); newenv = temp; } opal_setenv("LD_LIBRARY_PATH", newenv, true, &env); if (mca_pls_rsh_component.debug) { opal_output(0, "pls:rsh: reset LD_LIBRARY_PATH: %s", newenv); } free(newenv); So you can see that the only way we can get your debugging output is for the LD_LIBRARY_PATH in your starting environment to be NULL. Note that this comes after we fork, so we are talking about the child process - not sure that matters, but may as well point it out. So the question is: why do you not have LD_LIBRARY_PATH set in your environment when you provide a different hostname? Right I don't have LD_LIBRARY_PATH set in my environment, but I expect that mpirun will provide working environment for all ranks not just remote ones. This is how it worked before. Perhaps that was a bug, but this was useful bug :) On 7/19/07 7:45 AM, "Gleb Natapov" wrote: On Wed, Jul 18, 2007 at 09:08:38PM +0300, Gleb Natapov wrote: On Wed, Jul 18, 2007 at 09:08:47AM -0600, Ralph H Castain wrote: But this will lockup: pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host pn1180961 printenv | grep LD The reason is that the hostname in this last command doesn't match the hostname I get when I query my interfaces, so mpirun thinks it must be a remote host - and so we stick in ssh until that times out. Which could be quick on your machine, but takes awhile for me. This is not my case. mpirun resolves hostname and runs env but LD_LIBRARY_PATH is not there. If I use full name like this # /home/glebn/openmpi/bin/mpirun -np 1 -H elfit1.voltaire.com env | grep LD_LIBRARY_PATH LD_LIBRARY_PATH=/home/glebn/openmpi/lib everything is OK. More info. If I provide hostname to mpirun as returned by command "hostname" the LD_LIBRARY_PATH is not set: # /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` env | grep LD OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests if I provide any other name that resolves to the same IP then LD_LIBRARY_PATH is set. # /home/glebn/openmpi/bin/mpirun -np 1 -H localhost env | grep LD OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests LD_LIBRARY_PATH=/home/glebn/openmpi/lib Here is debug output of "bad" run: /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` -mca pls_rsh_debug 1 echo [elfit1:14730] pls:rsh: launching job 1 [elfit1:14730] pls:rsh: no new daemons to launch Here is good one: /home/glebn/openmpi/bin/mpirun -np 1 -H localhost -mca pls_rsh_debug 1 echo [elfit1:14752] pls:rsh: launching job 1 [elfit1:14752] pls:rsh: local csh: 0, local sh: 1 [elfit1:14752] pls:rsh: assuming same remote shell as local shell [elfit1:14752] pls:rsh: remote csh: 0, remote sh: 1 [elfit1:14752] pls:rsh: final template argv: [elfit1:14752] pls:rsh: /usr/bin/ssh orted --name --num_procs 1 --vpid_start 0 --nodename --universe root@elfit1:default-universe-14752 --nsreplica "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -- gprreplica "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -mca mca_base_param_file_path /home/glebn/openmpi//share/openmpi/amca-param-sets:/home/USERS/ glebn/openmpiwd -mca mca_base_param_file_path_force /home/USERS/glebn/openmpiwd [elfit1:14752] pls:rsh: launching on node localhost [elfit1:14752] pls:rsh: localhost is a LOCAL node [elfit1:14752] pls:rsh: reset PATH: /home/glebn/openmpi/bin:/home/USERS/lenny/MPI/mpi/bin:/opt/vltmpi/ OPENIB/mpi/b in:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/ local/bin:/sbin:/ bin:/usr/sbin:/usr/bin:/root/bin [elfit1:14752] pls:rsh: reset LD_LIBRARY_PATH: /home/glebn/ openmpi/lib [elfit1:14752] pls:rsh: changing to directory /root [elfit1:14752] pls:rsh: executing: (/home/glebn/openmpi/bin/ orted) [orted --name 0.0.1 --num_procs 1 --vpid_start 0 --nodename localhost -- universe root@elfit1:default-universe-14752 --nsreplica "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -- gprreplica "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -mca mca_base_param_file_path /home/glebn/openmpi//share/openmpi/amca-param-sets:/home/USERS/ glebn/openmpiwd -mca mca_base_param_file_path_force /home/USERS/glebn/openmpiwd -- set-sid] -- Gleb. __
Re: [OMPI devel] LD_LIBRARY_PATH and process launch on a head node
On Thu, Jul 19, 2007 at 08:07:51AM -0600, Ralph H Castain wrote: > Interesting. Apparently, it is getting a NULL back when it tries to access > the LD_LIBRARY_PATH in your environment. Here is the code involved: > > newenv = opal_os_path( false, prefix_dir, lib_base, NULL ); > oldenv = getenv("LD_LIBRARY_PATH"); > if (NULL != oldenv) { > char* temp; > asprintf(&temp, "%s:%s", newenv, oldenv); > free(newenv); > newenv = temp; > } > opal_setenv("LD_LIBRARY_PATH", newenv, true, &env); > if (mca_pls_rsh_component.debug) { > opal_output(0, "pls:rsh: reset LD_LIBRARY_PATH: %s", newenv); > } > free(newenv); > > So you can see that the only way we can get your debugging output is for the > LD_LIBRARY_PATH in your starting environment to be NULL. Note that this > comes after we fork, so we are talking about the child process - not sure > that matters, but may as well point it out. > > So the question is: why do you not have LD_LIBRARY_PATH set in your > environment when you provide a different hostname? Right I don't have LD_LIBRARY_PATH set in my environment, but I expect that mpirun will provide working environment for all ranks not just remote ones. This is how it worked before. Perhaps that was a bug, but this was useful bug :) > > > On 7/19/07 7:45 AM, "Gleb Natapov" wrote: > > > On Wed, Jul 18, 2007 at 09:08:38PM +0300, Gleb Natapov wrote: > >> On Wed, Jul 18, 2007 at 09:08:47AM -0600, Ralph H Castain wrote: > >>> But this will lockup: > >>> > >>> pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host pn1180961 printenv | grep > >>> LD > >>> > >>> The reason is that the hostname in this last command doesn't match the > >>> hostname I get when I query my interfaces, so mpirun thinks it must be a > >>> remote host - and so we stick in ssh until that times out. Which could be > >>> quick on your machine, but takes awhile for me. > >>> > >> This is not my case. mpirun resolves hostname and runs env but > >> LD_LIBRARY_PATH is not there. If I use full name like this > >> # /home/glebn/openmpi/bin/mpirun -np 1 -H elfit1.voltaire.com env | grep > >> LD_LIBRARY_PATH > >> LD_LIBRARY_PATH=/home/glebn/openmpi/lib > >> > >> everything is OK. > >> > > More info. If I provide hostname to mpirun as returned by command > > "hostname" the LD_LIBRARY_PATH is not set: > > # /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` env | grep LD > > OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests > > > > if I provide any other name that resolves to the same IP then > > LD_LIBRARY_PATH is set. > > # /home/glebn/openmpi/bin/mpirun -np 1 -H localhost env | grep LD > > OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests > > LD_LIBRARY_PATH=/home/glebn/openmpi/lib > > > > Here is debug output of "bad" run: > > /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` -mca pls_rsh_debug 1 echo > > [elfit1:14730] pls:rsh: launching job 1 > > [elfit1:14730] pls:rsh: no new daemons to launch > > > > Here is good one: > > /home/glebn/openmpi/bin/mpirun -np 1 -H localhost -mca pls_rsh_debug 1 echo > > [elfit1:14752] pls:rsh: launching job 1 > > [elfit1:14752] pls:rsh: local csh: 0, local sh: 1 > > [elfit1:14752] pls:rsh: assuming same remote shell as local shell > > [elfit1:14752] pls:rsh: remote csh: 0, remote sh: 1 > > [elfit1:14752] pls:rsh: final template argv: > > [elfit1:14752] pls:rsh: /usr/bin/ssh orted --name > > --num_procs 1 --vpid_start 0 --nodename --universe > > root@elfit1:default-universe-14752 --nsreplica > > "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" --gprreplica > > "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -mca > > mca_base_param_file_path > > /home/glebn/openmpi//share/openmpi/amca-param-sets:/home/USERS/glebn/openmpiwd > > -mca mca_base_param_file_path_force /home/USERS/glebn/openmpiwd > > [elfit1:14752] pls:rsh: launching on node localhost > > [elfit1:14752] pls:rsh: localhost is a LOCAL node > > [elfit1:14752] pls:rsh: reset PATH: > > /home/glebn/openmpi/bin:/home/USERS/lenny/MPI/mpi/bin:/opt/vltmpi/OPENIB/mpi/b > > in:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/ > > bin:/usr/sbin:/usr/bin:/root/bin > > [elfit1:14752] pls:rsh: reset LD_LIBRARY_PATH: /home/glebn/openmpi/lib > > [elfit1:14752] pls:rsh: changing to directory /root > > [elfit1:14752] pls:rsh: executing: (/home/glebn/openmpi/bin/orted) [orted > > --name 0.0.1 --num_procs 1 --vpid_start 0 --nodename localhost --universe > > root@elfit1:default-universe-14752 --nsreplica > > "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" --gprreplica > > "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -mca > > mca_base_param_file_path > > /home/glebn/openmpi//share/openmpi/amca-param-sets:/home/USERS/glebn/openmpiwd > > -mca mca_base_param_file_path_force /home/USERS/glebn/openmpiwd --set-sid] > > > > -- > > Gleb. > > ___ > > devel mailing list > >
Re: [OMPI devel] LD_LIBRARY_PATH and process launch on a head node
Interesting. Apparently, it is getting a NULL back when it tries to access the LD_LIBRARY_PATH in your environment. Here is the code involved: newenv = opal_os_path( false, prefix_dir, lib_base, NULL ); oldenv = getenv("LD_LIBRARY_PATH"); if (NULL != oldenv) { char* temp; asprintf(&temp, "%s:%s", newenv, oldenv); free(newenv); newenv = temp; } opal_setenv("LD_LIBRARY_PATH", newenv, true, &env); if (mca_pls_rsh_component.debug) { opal_output(0, "pls:rsh: reset LD_LIBRARY_PATH: %s", newenv); } free(newenv); So you can see that the only way we can get your debugging output is for the LD_LIBRARY_PATH in your starting environment to be NULL. Note that this comes after we fork, so we are talking about the child process - not sure that matters, but may as well point it out. So the question is: why do you not have LD_LIBRARY_PATH set in your environment when you provide a different hostname? On 7/19/07 7:45 AM, "Gleb Natapov" wrote: > On Wed, Jul 18, 2007 at 09:08:38PM +0300, Gleb Natapov wrote: >> On Wed, Jul 18, 2007 at 09:08:47AM -0600, Ralph H Castain wrote: >>> But this will lockup: >>> >>> pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host pn1180961 printenv | grep >>> LD >>> >>> The reason is that the hostname in this last command doesn't match the >>> hostname I get when I query my interfaces, so mpirun thinks it must be a >>> remote host - and so we stick in ssh until that times out. Which could be >>> quick on your machine, but takes awhile for me. >>> >> This is not my case. mpirun resolves hostname and runs env but >> LD_LIBRARY_PATH is not there. If I use full name like this >> # /home/glebn/openmpi/bin/mpirun -np 1 -H elfit1.voltaire.com env | grep >> LD_LIBRARY_PATH >> LD_LIBRARY_PATH=/home/glebn/openmpi/lib >> >> everything is OK. >> > More info. If I provide hostname to mpirun as returned by command > "hostname" the LD_LIBRARY_PATH is not set: > # /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` env | grep LD > OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests > > if I provide any other name that resolves to the same IP then > LD_LIBRARY_PATH is set. > # /home/glebn/openmpi/bin/mpirun -np 1 -H localhost env | grep LD > OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests > LD_LIBRARY_PATH=/home/glebn/openmpi/lib > > Here is debug output of "bad" run: > /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` -mca pls_rsh_debug 1 echo > [elfit1:14730] pls:rsh: launching job 1 > [elfit1:14730] pls:rsh: no new daemons to launch > > Here is good one: > /home/glebn/openmpi/bin/mpirun -np 1 -H localhost -mca pls_rsh_debug 1 echo > [elfit1:14752] pls:rsh: launching job 1 > [elfit1:14752] pls:rsh: local csh: 0, local sh: 1 > [elfit1:14752] pls:rsh: assuming same remote shell as local shell > [elfit1:14752] pls:rsh: remote csh: 0, remote sh: 1 > [elfit1:14752] pls:rsh: final template argv: > [elfit1:14752] pls:rsh: /usr/bin/ssh orted --name > --num_procs 1 --vpid_start 0 --nodename --universe > root@elfit1:default-universe-14752 --nsreplica > "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" --gprreplica > "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -mca > mca_base_param_file_path > /home/glebn/openmpi//share/openmpi/amca-param-sets:/home/USERS/glebn/openmpiwd > -mca mca_base_param_file_path_force /home/USERS/glebn/openmpiwd > [elfit1:14752] pls:rsh: launching on node localhost > [elfit1:14752] pls:rsh: localhost is a LOCAL node > [elfit1:14752] pls:rsh: reset PATH: > /home/glebn/openmpi/bin:/home/USERS/lenny/MPI/mpi/bin:/opt/vltmpi/OPENIB/mpi/b > in:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/ > bin:/usr/sbin:/usr/bin:/root/bin > [elfit1:14752] pls:rsh: reset LD_LIBRARY_PATH: /home/glebn/openmpi/lib > [elfit1:14752] pls:rsh: changing to directory /root > [elfit1:14752] pls:rsh: executing: (/home/glebn/openmpi/bin/orted) [orted > --name 0.0.1 --num_procs 1 --vpid_start 0 --nodename localhost --universe > root@elfit1:default-universe-14752 --nsreplica > "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" --gprreplica > "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -mca > mca_base_param_file_path > /home/glebn/openmpi//share/openmpi/amca-param-sets:/home/USERS/glebn/openmpiwd > -mca mca_base_param_file_path_force /home/USERS/glebn/openmpiwd --set-sid] > > -- > Gleb. > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] LD_LIBRARY_PATH and process launch on a head node
On Wed, Jul 18, 2007 at 09:08:38PM +0300, Gleb Natapov wrote: > On Wed, Jul 18, 2007 at 09:08:47AM -0600, Ralph H Castain wrote: > > But this will lockup: > > > > pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host pn1180961 printenv | grep > > LD > > > > The reason is that the hostname in this last command doesn't match the > > hostname I get when I query my interfaces, so mpirun thinks it must be a > > remote host - and so we stick in ssh until that times out. Which could be > > quick on your machine, but takes awhile for me. > > > This is not my case. mpirun resolves hostname and runs env but > LD_LIBRARY_PATH is not there. If I use full name like this > # /home/glebn/openmpi/bin/mpirun -np 1 -H elfit1.voltaire.com env | grep > LD_LIBRARY_PATH > LD_LIBRARY_PATH=/home/glebn/openmpi/lib > > everything is OK. > More info. If I provide hostname to mpirun as returned by command "hostname" the LD_LIBRARY_PATH is not set: # /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` env | grep LD OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests if I provide any other name that resolves to the same IP then LD_LIBRARY_PATH is set. # /home/glebn/openmpi/bin/mpirun -np 1 -H localhost env | grep LD OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests LD_LIBRARY_PATH=/home/glebn/openmpi/lib Here is debug output of "bad" run: /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` -mca pls_rsh_debug 1 echo [elfit1:14730] pls:rsh: launching job 1 [elfit1:14730] pls:rsh: no new daemons to launch Here is good one: /home/glebn/openmpi/bin/mpirun -np 1 -H localhost -mca pls_rsh_debug 1 echo [elfit1:14752] pls:rsh: launching job 1 [elfit1:14752] pls:rsh: local csh: 0, local sh: 1 [elfit1:14752] pls:rsh: assuming same remote shell as local shell [elfit1:14752] pls:rsh: remote csh: 0, remote sh: 1 [elfit1:14752] pls:rsh: final template argv: [elfit1:14752] pls:rsh: /usr/bin/ssh orted --name --num_procs 1 --vpid_start 0 --nodename --universe root@elfit1:default-universe-14752 --nsreplica "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" --gprreplica "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -mca mca_base_param_file_path /home/glebn/openmpi//share/openmpi/amca-param-sets:/home/USERS/glebn/openmpiwd -mca mca_base_param_file_path_force /home/USERS/glebn/openmpiwd [elfit1:14752] pls:rsh: launching on node localhost [elfit1:14752] pls:rsh: localhost is a LOCAL node [elfit1:14752] pls:rsh: reset PATH: /home/glebn/openmpi/bin:/home/USERS/lenny/MPI/mpi/bin:/opt/vltmpi/OPENIB/mpi/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin [elfit1:14752] pls:rsh: reset LD_LIBRARY_PATH: /home/glebn/openmpi/lib [elfit1:14752] pls:rsh: changing to directory /root [elfit1:14752] pls:rsh: executing: (/home/glebn/openmpi/bin/orted) [orted --name 0.0.1 --num_procs 1 --vpid_start 0 --nodename localhost --universe root@elfit1:default-universe-14752 --nsreplica "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" --gprreplica "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -mca mca_base_param_file_path /home/glebn/openmpi//share/openmpi/amca-param-sets:/home/USERS/glebn/openmpiwd -mca mca_base_param_file_path_force /home/USERS/glebn/openmpiwd --set-sid] -- Gleb.
Re: [OMPI devel] LD_LIBRARY_PATH and process launch on a head node
On Wed, Jul 18, 2007 at 09:08:47AM -0600, Ralph H Castain wrote: > But this will lockup: > > pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host pn1180961 printenv | grep > LD > > The reason is that the hostname in this last command doesn't match the > hostname I get when I query my interfaces, so mpirun thinks it must be a > remote host - and so we stick in ssh until that times out. Which could be > quick on your machine, but takes awhile for me. > This is not my case. mpirun resolves hostname and runs env but LD_LIBRARY_PATH is not there. If I use full name like this # /home/glebn/openmpi/bin/mpirun -np 1 -H elfit1.voltaire.com env | grep LD_LIBRARY_PATH LD_LIBRARY_PATH=/home/glebn/openmpi/lib everything is OK. -- Gleb.
Re: [OMPI devel] LD_LIBRARY_PATH and process launch on a head node
It works for me in both cases, provided I give the fully qualified host name for your first example. In other words, these work: pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host localhost printenv | grep LD [pn1180961.lanl.gov:22021] [0.0] test of print_name OLDPWD=/Users/rhc/openmpi LD_LIBRARY_PATH=/Users/rhc/openmpi/lib:/Users/rhc/lib:/opt/local/lib:/usr/lo cal/lib: pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host pn1180961.lanl.gov printenv | grep LD [pn1180961.lanl.gov:22012] [0.0] test of print_name OLDPWD=/Users/rhc/openmpi LD_LIBRARY_PATH=/Users/rhc/openmpi/lib:/Users/rhc/lib:/opt/local/lib:/usr/lo cal/lib: But this will lockup: pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host pn1180961 printenv | grep LD The reason is that the hostname in this last command doesn't match the hostname I get when I query my interfaces, so mpirun thinks it must be a remote host - and so we stick in ssh until that times out. Which could be quick on your machine, but takes awhile for me. Hope that helps Ralph On 7/18/07 7:45 AM, "Gleb Natapov" wrote: > On Wed, Jul 18, 2007 at 04:27:15PM +0300, Gleb Natapov wrote: >> Hi, >> >> With current trunk LD_LIBRARY_PATH is not set for ranks that are >> launched on the head node. This worked previously. >> > Same more info. I use rsh pls. > elfit1# /home/glebn/openmpi/bin/mpirun -np 1 -H elfit1 env | grep > LD_LIBRARY_PATH > gives nothing. > > The strange thing that I just found is that this one works > elfit1# /home/glebn/openmpi/bin/mpirun -np 1 -H localhost env | grep > LD_LIBRARY_PATH > LD_LIBRARY_PATH=/home/glebn/openmpi/lib > > -- > Gleb. > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] LD_LIBRARY_PATH and process launch on a head node
On Wed, Jul 18, 2007 at 07:48:17AM -0600, Ralph H Castain wrote: > I believe that was fixed in r15405 - are you at that rev level? I am on the latest revision. > > > On 7/18/07 7:27 AM, "Gleb Natapov" wrote: > > > Hi, > > > > With current trunk LD_LIBRARY_PATH is not set for ranks that are > > launched on the head node. This worked previously. > > > > -- > > Gleb. > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Gleb.
Re: [OMPI devel] LD_LIBRARY_PATH and process launch on a head node
I believe that was fixed in r15405 - are you at that rev level? On 7/18/07 7:27 AM, "Gleb Natapov" wrote: > Hi, > > With current trunk LD_LIBRARY_PATH is not set for ranks that are > launched on the head node. This worked previously. > > -- > Gleb. > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] LD_LIBRARY_PATH and process launch on a head node
On Wed, Jul 18, 2007 at 04:27:15PM +0300, Gleb Natapov wrote: > Hi, > > With current trunk LD_LIBRARY_PATH is not set for ranks that are > launched on the head node. This worked previously. > Same more info. I use rsh pls. elfit1# /home/glebn/openmpi/bin/mpirun -np 1 -H elfit1 env | grep LD_LIBRARY_PATH gives nothing. The strange thing that I just found is that this one works elfit1# /home/glebn/openmpi/bin/mpirun -np 1 -H localhost env | grep LD_LIBRARY_PATH LD_LIBRARY_PATH=/home/glebn/openmpi/lib -- Gleb.