Hi, Am 13.02.2013 um 13:43 schrieb Britto, Rajesh:
> When I tried to execute an distributed job on a cluster, the job started > successfully. > > However, after some time, the job was getting hanged by the following > process. Can anyone please let me know what could be the issue? > > /opt/sge/utilbin/lx24-amd64/rsh -n -p 36425 <NodeName> exec > '/opt/sge/utilbin/lx24-amd64/qrsh_starter' > '/opt/spool/node/active_jobs/41270.1/1.node' It looks like you used the old startup method by `rsh` - which version of SGE is it? When setting: $ qconf -sconf ... qlogin_command builtin qlogin_daemon builtin rlogin_command builtin rlogin_daemon builtin rsh_command builtin rsh_daemon builtin the `rsh` shouldn't appear in the process tree. How did you start your application in the jobscript? How does the application start slave tasks: by Open MPI, MPICH2 ...? > FYI, cluster is having both password less ssh and rsh communications between > the nodes. In a Tight Integration setup even parallel jobs don't need this. -- Reuti _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
