Hi,

Am 13.02.2013 um 13:43 schrieb Britto, Rajesh:

> When I tried to execute an distributed job on a cluster, the job started 
> successfully.
>  
> However, after some time, the job was getting hanged by the following 
> process. Can anyone please let me know what could be the issue?
>  
> /opt/sge/utilbin/lx24-amd64/rsh -n -p 36425 <NodeName> exec 
> '/opt/sge/utilbin/lx24-amd64/qrsh_starter' 
> '/opt/spool/node/active_jobs/41270.1/1.node'

It looks like you used the old startup method by `rsh` - which version of SGE 
is it? When setting:

$ qconf -sconf
...
qlogin_command               builtin
qlogin_daemon                builtin
rlogin_command               builtin
rlogin_daemon                builtin
rsh_command                  builtin
rsh_daemon                   builtin

the `rsh` shouldn't appear in the process tree. How did you start your 
application in the jobscript? How does the application start slave tasks: by 
Open MPI, MPICH2 ...?


> FYI, cluster is having both password less ssh and rsh communications between 
> the nodes.

In a Tight Integration setup even parallel jobs don't need this.

-- Reuti
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to