This version of OpenMPI I am running was built without any guidance regarding SGE in the configure command, but it was built on a system that did not have SGE, so I would presume support is absent.
My hope is that OpenMPI will not attempt to use SGE in any way. But perhaps it is trying to. Yes, I did supply a machinefile on my own. It is formed on the fly within the submitted script by parsing the PE_HOSTFILE, and I leave the resulting file lying around, and the result appears to be correct, i.e. it includes those nodes (and only those nodes) allocated to the job. -----Original Message----- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Reuti Sent: Tuesday, September 13, 2011 4:27 PM To: Open MPI Users Subject: EXTERNAL: Re: [OMPI users] Problem running under SGE Am 13.09.2011 um 23:18 schrieb Blosch, Edwin L: > I'm able to run this command below from an interactive shell window: > > <path>/bin/mpirun --machinefile mpihosts.dat -np 16 -mca plm_rsh_agent > /usr/bin/rsh -x MPI_ENVIRONMENT=1 ./test_setup > > but it does not work if I put it into a shell script and 'qsub' that script > to SGE. I get the message shown at the bottom of this post. > > I've tried everything I can think of. I would welcome any hints on how to > proceed. > > For what it's worth, this OpenMPI is 1.4.3 and I built it on another system. > I am setting and exporting OPAL_PREFIX and as I said, all works fine > interactively just not in batch. It was built with -disable-shared and I > don't see any shared libs under openmpi/lib, and I've done 'ldd' from within > the script, on both the application executable and on the orterun command; no > unresolved shared libraries. So I don't think the error message hinting at > LD_LIBRARY_PATH issues is pointing me in the right direction. > > Thanks for any guidance, > > Ed > Oh, I missed this: > error: executing task of job 139362 failed: execution daemon on host "f8312" > didn't accept task did you supply a machinefile on your own? In a proper SGE integration it's running in a parallel environment. You defined and requested one? The error looks like it was started in a PE, but tried to access a node not granted for the actual job -- Reuti > -------------------------------------------------------------------------- > A daemon (pid 2818) died unexpectedly with status 1 while attempting > to launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > -------------------------------------------------------------------------- > mpirun: clean termination accomplished > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users