Re: [gridengine users] Issue in Distributed jobs

Reuti Mon, 18 Feb 2013 00:25:35 -0800

Hi,

Am 18.02.2013 um 04:53 schrieb Britto, Rajesh:


> Thanks for the information.
> 
> Its not the ROCKS cluster, its an normal SGE cluster with RHEL5.2 installed.
> 
> There is no firewall or SELinux enabled on these machines.

Is it a fresh installation? I wonder about using 6.1u2 as there were versions 
after it which were still freely available.

To investigate: it might be outside of SGE. Can you please submit such a 
hanging job, login to the node and issue:

strace -p 1234

with the PID of your haning application. If it's just the `qrsh` hanging 
around, it's return code might be retrieved later.

One other possibility: one version of PVM missed to close the stdout and it had 
a similar effect IIRC. What type of parallel application is it (e.g. MPI)?

-- Reuti


> Thanks,
> Britto.
> 
> -----Original Message-----
> From: Reuti [mailto:[email protected]] 
> Sent: Friday, February 15, 2013 10:15 PM
> To: Britto, Rajesh
> Cc: [email protected]
> Subject: Re: [gridengine users] Issue in Distributed jobs
> 
> Am 15.02.2013 um 08:22 schrieb Britto, Rajesh:
> 
>> Hi Reuti,
>> 
>> Thanks for the information. I am using SGE 6.1u2.
> 
> Ok, IIRC the builtin startup mechanism appeared only in 6.2.
> 
> 
>> Qconf -sconf:
>> 
>> qlogin_command               telnet
>> qlogin_daemon                /usr/sbin/in.telnetd
>> rlogin_daemon                /usr/sbin/in.rlogind
> 
> ROCKS? I remember that they added some lines at the end and override settings 
> which appear earlier in the file.
> 
> Do you have any firewall installed on the system, which could block the MPI 
> communication?
> 
> -- Reuti
> 
> 
>> The rsh command doesn't appear in the qconf -sconf output. We are uinsg 
>> openmpi for running parallel and distributed jobs.
>> 
>> The application uses the mpirun command to invoke the distributed jobs. 
>> Please let me know for more clarification.
>> 
>> Thanks,
>> Britto.
>> 
>> 
>> -----Original Message-----
>> From: Reuti [mailto:[email protected]] 
>> Sent: Wednesday, February 13, 2013 7:00 PM
>> To: Britto, Rajesh
>> Cc: [email protected]
>> Subject: Re: [gridengine users] Issue in Distributed jobs
>> 
>> Hi,
>> 
>> Am 13.02.2013 um 13:43 schrieb Britto, Rajesh:
>> 
>>> When I tried to execute an distributed job on a cluster, the job started 
>>> successfully.
>>> 
>>> However, after some time, the job was getting hanged by the following 
>>> process. Can anyone please let me know what could be the issue?
>>> 
>>> /opt/sge/utilbin/lx24-amd64/rsh -n -p 36425 <NodeName> exec 
>>> '/opt/sge/utilbin/lx24-amd64/qrsh_starter' 
>>> '/opt/spool/node/active_jobs/41270.1/1.node'
>> 
>> It looks like you used the old startup method by `rsh` - which version of 
>> SGE is it? When setting:
>> 
>> $ qconf -sconf
>> ...
>> qlogin_command               builtin
>> qlogin_daemon                builtin
>> rlogin_command               builtin
>> rlogin_daemon                builtin
>> rsh_command                  builtin
>> rsh_daemon                   builtin
>> 
>> the `rsh` shouldn't appear in the process tree. How did you start your 
>> application in the jobscript? How does the application start slave tasks: by 
>> Open MPI, MPICH2 ...?
>> 
>> 
>>> FYI, cluster is having both password less ssh and rsh communications 
>>> between the nodes.
>> 
>> In a Tight Integration setup even parallel jobs don't need this.
>> 
>> -- Reuti
>> 
> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Issue in Distributed jobs

Reply via email to