On Thu, Jul 19, 2007 at 01:04:27PM -0600, Ralph H Castain wrote: > I fixed the specific problem of setting the LD_LIBRARY_PATH (and PATH, > though that wasn't mentioned) for the case of procs spawned locally by > mpirun - see r15516. Please confirm that the problem is gone and/or let me > know if it persists for you. My test cases now work. Thanks.
> > The issue of name resolution is a more general problem that will take some > discussion - to occur separately from this chain. So some of the behavior > you cited continues for the moment. > > Thanks > Ralph > > > > On 7/19/07 9:39 AM, "Ralph H Castain" <r...@lanl.gov> wrote: > > > Talked with Brian and we have identified the problem and a fix - will come > > in later today. > > > > Thanks > > Ralph > > > > > > > > On 7/19/07 9:24 AM, "Ralph H Castain" <r...@lanl.gov> wrote: > > > >> You are correct - I misread the note. My bad. > >> > >> I'll look at how we might ensure the LD_LIBRARY_PATH shows up correctly - > >> shouldn't be a big deal. > >> > >> > >> On 7/19/07 9:12 AM, "George Bosilca" <bosi...@cs.utk.edu> wrote: > >> > >>> The second execution (the one that you make reference to) is the one > >>> that works fine. The failing one is the first one, where > >>> LD_LIBRARY_PATH is not provided. As Gleb indicate using localhost > >>> make the problem vanish. > >>> > >>> george. > >>> > >>> On Jul 19, 2007, at 10:57 AM, Ralph H Castain wrote: > >>> > >>>> But it *does* provide an LD_LIBRARY_PATH that is pointing to your > >>>> openmpi > >>>> installation - it says it did it right here in your debug output: > >>>> > >>>>>>> [elfit1:14752] pls:rsh: reset LD_LIBRARY_PATH: /home/glebn/ > >>>>>>> openmpi/lib > >>>> > >>>> I suspect that the problem isn't in the launcher, but rather in the > >>>> iof > >>>> again. Why don't we wait until those fixes come into the trunk before > >>>> chasing our tails any further? > >>>> > >>>> > >>>> On 7/19/07 8:18 AM, "Gleb Natapov" <gl...@voltaire.com> wrote: > >>>> > >>>>> On Thu, Jul 19, 2007 at 08:07:51AM -0600, Ralph H Castain wrote: > >>>>>> Interesting. Apparently, it is getting a NULL back when it tries > >>>>>> to access > >>>>>> the LD_LIBRARY_PATH in your environment. Here is the code involved: > >>>>>> > >>>>>> newenv = opal_os_path( false, prefix_dir, lib_base, NULL ); > >>>>>> oldenv = getenv("LD_LIBRARY_PATH"); > >>>>>> if (NULL != oldenv) { > >>>>>> char* temp; > >>>>>> asprintf(&temp, "%s:%s", newenv, oldenv); > >>>>>> free(newenv); > >>>>>> newenv = temp; > >>>>>> } > >>>>>> opal_setenv("LD_LIBRARY_PATH", newenv, true, &env); > >>>>>> if (mca_pls_rsh_component.debug) { > >>>>>> opal_output(0, "pls:rsh: reset LD_LIBRARY_PATH: %s", > >>>>>> newenv); > >>>>>> } > >>>>>> free(newenv); > >>>>>> > >>>>>> So you can see that the only way we can get your debugging output > >>>>>> is for the > >>>>>> LD_LIBRARY_PATH in your starting environment to be NULL. Note > >>>>>> that this > >>>>>> comes after we fork, so we are talking about the child process - > >>>>>> not sure > >>>>>> that matters, but may as well point it out. > >>>>>> > >>>>>> So the question is: why do you not have LD_LIBRARY_PATH set in your > >>>>>> environment when you provide a different hostname? > >>>>> Right I don't have LD_LIBRARY_PATH set in my environment, but I > >>>>> expect > >>>>> that mpirun will provide working environment for all ranks not just > >>>>> remote ones. This is how it worked before. Perhaps that was a bug, > >>>>> but > >>>>> this was useful bug :) > >>>>> > >>>>>> > >>>>>> > >>>>>> On 7/19/07 7:45 AM, "Gleb Natapov" <gl...@voltaire.com> wrote: > >>>>>> > >>>>>>> On Wed, Jul 18, 2007 at 09:08:38PM +0300, Gleb Natapov wrote: > >>>>>>>> On Wed, Jul 18, 2007 at 09:08:47AM -0600, Ralph H Castain wrote: > >>>>>>>>> But this will lockup: > >>>>>>>>> > >>>>>>>>> pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host pn1180961 > >>>>>>>>> printenv | grep > >>>>>>>>> LD > >>>>>>>>> > >>>>>>>>> The reason is that the hostname in this last command doesn't > >>>>>>>>> match the > >>>>>>>>> hostname I get when I query my interfaces, so mpirun thinks it > >>>>>>>>> must be a > >>>>>>>>> remote host - and so we stick in ssh until that times out. > >>>>>>>>> Which could be > >>>>>>>>> quick on your machine, but takes awhile for me. > >>>>>>>>> > >>>>>>>> This is not my case. mpirun resolves hostname and runs env but > >>>>>>>> LD_LIBRARY_PATH is not there. If I use full name like this > >>>>>>>> # /home/glebn/openmpi/bin/mpirun -np 1 -H elfit1.voltaire.com > >>>>>>>> env | grep > >>>>>>>> LD_LIBRARY_PATH > >>>>>>>> LD_LIBRARY_PATH=/home/glebn/openmpi/lib > >>>>>>>> > >>>>>>>> everything is OK. > >>>>>>>> > >>>>>>> More info. If I provide hostname to mpirun as returned by command > >>>>>>> "hostname" the LD_LIBRARY_PATH is not set: > >>>>>>> # /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` env | grep LD > >>>>>>> OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests > >>>>>>> > >>>>>>> if I provide any other name that resolves to the same IP then > >>>>>>> LD_LIBRARY_PATH is set. > >>>>>>> # /home/glebn/openmpi/bin/mpirun -np 1 -H localhost env | grep LD > >>>>>>> OLDPWD=/home/glebn/OpenMPI/ompi-tests/intel_tests > >>>>>>> LD_LIBRARY_PATH=/home/glebn/openmpi/lib > >>>>>>> > >>>>>>> Here is debug output of "bad" run: > >>>>>>> /home/glebn/openmpi/bin/mpirun -np 1 -H `hostname` -mca > >>>>>>> pls_rsh_debug 1 echo > >>>>>>> [elfit1:14730] pls:rsh: launching job 1 > >>>>>>> [elfit1:14730] pls:rsh: no new daemons to launch > >>>>>>> > >>>>>>> Here is good one: > >>>>>>> /home/glebn/openmpi/bin/mpirun -np 1 -H localhost -mca > >>>>>>> pls_rsh_debug 1 echo > >>>>>>> [elfit1:14752] pls:rsh: launching job 1 > >>>>>>> [elfit1:14752] pls:rsh: local csh: 0, local sh: 1 > >>>>>>> [elfit1:14752] pls:rsh: assuming same remote shell as local shell > >>>>>>> [elfit1:14752] pls:rsh: remote csh: 0, remote sh: 1 > >>>>>>> [elfit1:14752] pls:rsh: final template argv: > >>>>>>> [elfit1:14752] pls:rsh: /usr/bin/ssh <template> orted --name > >>>>>>> <template> > >>>>>>> --num_procs 1 --vpid_start 0 --nodename <template> --universe > >>>>>>> root@elfit1:default-universe-14752 --nsreplica > >>>>>>> "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -- > >>>>>>> gprreplica > >>>>>>> "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -mca > >>>>>>> mca_base_param_file_path > >>>>>>> /home/glebn/openmpi//share/openmpi/amca-param-sets:/home/USERS/ > >>>>>>> glebn/openmpi > >>>>>>> wd > >>>>>>> -mca mca_base_param_file_path_force /home/USERS/glebn/openmpiwd > >>>>>>> [elfit1:14752] pls:rsh: launching on node localhost > >>>>>>> [elfit1:14752] pls:rsh: localhost is a LOCAL node > >>>>>>> [elfit1:14752] pls:rsh: reset PATH: > >>>>>>> /home/glebn/openmpi/bin:/home/USERS/lenny/MPI/mpi/bin:/opt/ > >>>>>>> vltmpi/OPENIB/mpi > >>>>>>> /b > >>>>>>> in:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/ > >>>>>>> local/bin:/sbin > >>>>>>> :/ > >>>>>>> bin:/usr/sbin:/usr/bin:/root/bin > >>>>>>> [elfit1:14752] pls:rsh: reset LD_LIBRARY_PATH: /home/glebn/ > >>>>>>> openmpi/lib > >>>>>>> [elfit1:14752] pls:rsh: changing to directory /root > >>>>>>> [elfit1:14752] pls:rsh: executing: (/home/glebn/openmpi/bin/ > >>>>>>> orted) [orted > >>>>>>> --name 0.0.1 --num_procs 1 --vpid_start 0 --nodename localhost -- > >>>>>>> universe > >>>>>>> root@elfit1:default-universe-14752 --nsreplica > >>>>>>> "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -- > >>>>>>> gprreplica > >>>>>>> "0.0.0;tcp://172.30.7.187:43017;tcp://192.168.7.187:43017" -mca > >>>>>>> mca_base_param_file_path > >>>>>>> /home/glebn/openmpi//share/openmpi/amca-param-sets:/home/USERS/ > >>>>>>> glebn/openmpi > >>>>>>> wd > >>>>>>> -mca mca_base_param_file_path_force /home/USERS/glebn/openmpiwd > >>>>>>> --set-sid] > >>>>>>> > >>>>>>> -- > >>>>>>> Gleb. > >>>>>>> _______________________________________________ > >>>>>>> devel mailing list > >>>>>>> de...@open-mpi.org > >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>>> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> devel mailing list > >>>>>> de...@open-mpi.org > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>> > >>>>> -- > >>>>> Gleb. > >>>>> _______________________________________________ > >>>>> devel mailing list > >>>>> de...@open-mpi.org > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>> > >>>> > >>>> _______________________________________________ > >>>> devel mailing list > >>>> de...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>> > >>> _______________________________________________ > >>> devel mailing list > >>> de...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Gleb.