Hi Reuti,

Thanks for the information. I am using SGE 6.1u2.

Qconf -sconf:

qlogin_command               telnet
qlogin_daemon                /usr/sbin/in.telnetd
rlogin_daemon                /usr/sbin/in.rlogind

The rsh command doesn't appear in the qconf -sconf output. We are uinsg openmpi 
for running parallel and distributed jobs.

The application uses the mpirun command to invoke the distributed jobs. Please 
let me know for more clarification.

Thanks,
Britto.


-----Original Message-----
From: Reuti [mailto:[email protected]] 
Sent: Wednesday, February 13, 2013 7:00 PM
To: Britto, Rajesh
Cc: [email protected]
Subject: Re: [gridengine users] Issue in Distributed jobs

Hi,

Am 13.02.2013 um 13:43 schrieb Britto, Rajesh:

> When I tried to execute an distributed job on a cluster, the job started 
> successfully.
>  
> However, after some time, the job was getting hanged by the following 
> process. Can anyone please let me know what could be the issue?
>  
> /opt/sge/utilbin/lx24-amd64/rsh -n -p 36425 <NodeName> exec 
> '/opt/sge/utilbin/lx24-amd64/qrsh_starter' 
> '/opt/spool/node/active_jobs/41270.1/1.node'

It looks like you used the old startup method by `rsh` - which version of SGE 
is it? When setting:

$ qconf -sconf
...
qlogin_command               builtin
qlogin_daemon                builtin
rlogin_command               builtin
rlogin_daemon                builtin
rsh_command                  builtin
rsh_daemon                   builtin

the `rsh` shouldn't appear in the process tree. How did you start your 
application in the jobscript? How does the application start slave tasks: by 
Open MPI, MPICH2 ...?


> FYI, cluster is having both password less ssh and rsh communications between 
> the nodes.

In a Tight Integration setup even parallel jobs don't need this.

-- Reuti

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to