Hi, Am 11.04.2013 um 16:45 schrieb Bernard Massot:
> On Thu, Apr 11, 2013 at 11:31:12AM +0200, Reuti wrote: >> Am 11.04.2013 um 10:41 schrieb Bernard Massot: >>> I'm new to parallel computing and have a problem with Open MPI jobs in >> 0) Which version of Open MPI are you using? > 1.4.2 > >> This is a fine setup. I assume the setting in SGE's configuration is: >> >> $ qconf -sconf >> ... >> qlogin_command builtin >> qlogin_daemon builtin >> rlogin_command builtin >> rlogin_daemon builtin >> rsh_command builtin >> rsh_daemon builtin > No. I have the default Debian configuration, which is : > rlogin_daemon /usr/sbin/sshd -i > rlogin_command /usr/bin/ssh > qlogin_daemon /usr/sbin/sshd -i > qlogin_command /usr/share/gridengine/qlogin-wrapper > rsh_daemon /usr/sbin/sshd -i > rsh_command /usr/bin/ssh > But I think it has never been a problem. Well, if you would compute between nodes you would need passphrase-less ssh-keys. But you don't want to compute between nodes anyway, so this should be a problem. >> So Open MPI should detect it's running under SGE and issue `qrsh >> -inherit ...` in the end without the necessity to have somewhere ssh >> inside the cluster around in case it would start something on >> additional nodes. But this is not intended by your setup anyway due to >> the PE definition. > Even when a job fails, strace shows that mpirun used "qrsh -inherit ...". I would say: "*Only* when a job fails, strace shows that mpirun used "qrsh -inherit ...". If it's local, there should only be forks in the process lisiting for a running job. If it's making a local "qrsh -inherit ..." something is wrong with Open MPI detection on the hostname. >> 1) Is the `mpiexec` the users use the one you supplied or can it >> happen, that by accident they used a different `mpiexec` or even >> compiled their application with a different MPI library? > It's not another mpirun (strace confirmed that). Does the script specify by accident a dedicated hostlist or refer to a self-assembled hostfile (which would contradict the granted machine)? No "-nolocal" specified? -- Reuti >> 2) As the used jobscript is available on the node in >> $SGE_JOB_SPOOL_DIR/job_scripts (the directory specified during >> installation to be used by the exechosts): do they show anything >> unusual regarding the PATH settings? > No. > >> 3) Are all slots for a job coming from one queue only, or are the >> slots collected from several queues on one and the same exechost? >> I.e.: is the same PE "orte" attached to more than one queue? > I only have one queue. > > Any other idea ? > -- > Bernard Massot > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
