Hi,

Am 11.04.2013 um 16:45 schrieb Bernard Massot:

> On Thu, Apr 11, 2013 at 11:31:12AM +0200, Reuti wrote:
>> Am 11.04.2013 um 10:41 schrieb Bernard Massot:
>>> I'm new to parallel computing and have a problem with Open MPI jobs in
>> 0) Which version of Open MPI are you using?
> 1.4.2 
> 
>> This is a fine setup. I assume the setting in SGE's configuration is:
>> 
>> $ qconf -sconf
>> ...
>> qlogin_command               builtin
>> qlogin_daemon                builtin
>> rlogin_command               builtin
>> rlogin_daemon                builtin
>> rsh_command                  builtin
>> rsh_daemon                   builtin
> No. I have the default Debian configuration, which is :
> rlogin_daemon                /usr/sbin/sshd -i
> rlogin_command               /usr/bin/ssh
> qlogin_daemon                /usr/sbin/sshd -i
> qlogin_command               /usr/share/gridengine/qlogin-wrapper
> rsh_daemon                   /usr/sbin/sshd -i
> rsh_command                  /usr/bin/ssh
> But I think it has never been a problem.

Well, if you would compute between nodes you would need passphrase-less 
ssh-keys. But you don't want to compute between nodes anyway, so this should be 
a problem.


>> So Open MPI should detect it's running under SGE and issue `qrsh
>> -inherit ...` in the end without the necessity to have somewhere ssh
>> inside the cluster around in case it would start something on
>> additional nodes. But this is not intended by your setup anyway due to
>> the PE definition.
> Even when a job fails, strace shows that mpirun used "qrsh -inherit ...".

I would say: "*Only* when a job fails, strace shows that mpirun used "qrsh 
-inherit ...". If it's local, there should only be forks in the process 
lisiting for a running job. If it's making a local  "qrsh -inherit ..." 
something is wrong with Open MPI detection on the hostname.


>> 1) Is the `mpiexec` the users use the one you supplied or can it
>> happen, that by accident they used a different `mpiexec` or even
>> compiled their application with a different MPI library?
> It's not another mpirun (strace confirmed that).

Does the script specify by accident a dedicated hostlist or refer to a 
self-assembled hostfile (which would contradict the granted machine)? No 
"-nolocal" specified?

-- Reuti


>> 2) As the used jobscript is available on the node in
>> $SGE_JOB_SPOOL_DIR/job_scripts (the directory specified during
>> installation to be used by the exechosts): do they show anything
>> unusual regarding the PATH settings?
> No.
> 
>> 3) Are all slots for a job coming from one queue only, or are the
>> slots collected from several queues on one and the same exechost?
>> I.e.: is the same PE "orte" attached to more than one queue?
> I only have one queue.
> 
> Any other idea ?
> -- 
> Bernard Massot
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to