On 16 November 2011 03:29, Vang Le <[email protected]> wrote: > Hello GridUsers, > My grid is running, it can deliver jobs, but they only run on one nodes at a > time. > When I tried running with mpirun in a batch script, i get errors like > "execution daemon on host <hostname> didn't accept task" as shown at the > bottom of this email. > > I can run mpirun outside of sge without any problems. > I am suspecting that when mpirun is put inside the sge batch script, it can > not communicate with exec nodes successfully. > > > My system information: > 3 servers running Ubuntu Lucid Lynx with recompiled openmpi to support > gridengine. SGE was installed via Ubuntu repository setup correct > environmental variables. > I also setup non-password ssh access for openmpi user account, which is the > same account that I use to submit sge batch. Since you have Grid Engine support how are your rsh_command and rsh_daemon configured? If you have it configured to use ssh/sshd do you need to or could you get away with using the builtin support? If you are using a standard sshd you probably need the sge_qrsh PAM module installed and enabled. First thing I would do is check that you can run qrsh by hand within a job.
Not sure why but OpenMPI seems to be good at triggering issues with sshd. On Scientific Linux 5 I found that invoking the system sshd via rsh_daemon OpenMPI would trigger some sort of bug that showed up as corruption of the ssh connection. It has been suggested that this sort of issue can be triggered by noisy PAM modules. Other MPI's or using qrsh by hand don't trigger this bug. I worked around this by use of a locally compiled sshd. William William > > > Any help is very much appreciated. > > Vang. > > > > > ============ERROR================ > error: executing task of job 63 failed: execution daemon on host "node1" > didn't accept task > error: executing task of job 63 failed: execution daemon on host "submithost" > didn't accept task > -------------------------------------------------------------------------- > A daemon (pid 13317) died unexpectedly with status 1 while attempting > to launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > > > ============CONTENT OF SGE BATCH SUBMIT============== > > #!/bin/bash > > # run at current working directory > #$ -cwd > #$ -V > # Specify the shell for this job > #$ -S /bin/bash > #$ -pe test_pe 5 > #$ -P test1 > > # Merge the standard output and standard error > #$ -j y > > # Specify the location of the output messages > #$ -o messages.txt > > #---------Customization part starts below ------------- > # Customization > # Which email should the start running and edning of this job be emailed to > # > #$ -M <my_gmail_id>@gmail.com > #$ -m be > > export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:$LD_LIBRARY_PATH > > mpirun -np $NSLOTS hostname > mpirun -np $NSLOTS ~/hello > > > > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users > > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
