Hi Reuti, Thanks for the information. I am using SGE 6.1u2.
Qconf -sconf: qlogin_command telnet qlogin_daemon /usr/sbin/in.telnetd rlogin_daemon /usr/sbin/in.rlogind The rsh command doesn't appear in the qconf -sconf output. We are uinsg openmpi for running parallel and distributed jobs. The application uses the mpirun command to invoke the distributed jobs. Please let me know for more clarification. Thanks, Britto. -----Original Message----- From: Reuti [mailto:[email protected]] Sent: Wednesday, February 13, 2013 7:00 PM To: Britto, Rajesh Cc: [email protected] Subject: Re: [gridengine users] Issue in Distributed jobs Hi, Am 13.02.2013 um 13:43 schrieb Britto, Rajesh: > When I tried to execute an distributed job on a cluster, the job started > successfully. > > However, after some time, the job was getting hanged by the following > process. Can anyone please let me know what could be the issue? > > /opt/sge/utilbin/lx24-amd64/rsh -n -p 36425 <NodeName> exec > '/opt/sge/utilbin/lx24-amd64/qrsh_starter' > '/opt/spool/node/active_jobs/41270.1/1.node' It looks like you used the old startup method by `rsh` - which version of SGE is it? When setting: $ qconf -sconf ... qlogin_command builtin qlogin_daemon builtin rlogin_command builtin rlogin_daemon builtin rsh_command builtin rsh_daemon builtin the `rsh` shouldn't appear in the process tree. How did you start your application in the jobscript? How does the application start slave tasks: by Open MPI, MPICH2 ...? > FYI, cluster is having both password less ssh and rsh communications between > the nodes. In a Tight Integration setup even parallel jobs don't need this. -- Reuti _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
