Hi Reuti, Thanks for the information.
Its not the ROCKS cluster, its an normal SGE cluster with RHEL5.2 installed. There is no firewall or SELinux enabled on these machines. Thanks, Britto. -----Original Message----- From: Reuti [mailto:[email protected]] Sent: Friday, February 15, 2013 10:15 PM To: Britto, Rajesh Cc: [email protected] Subject: Re: [gridengine users] Issue in Distributed jobs Am 15.02.2013 um 08:22 schrieb Britto, Rajesh: > Hi Reuti, > > Thanks for the information. I am using SGE 6.1u2. Ok, IIRC the builtin startup mechanism appeared only in 6.2. > Qconf -sconf: > > qlogin_command telnet > qlogin_daemon /usr/sbin/in.telnetd > rlogin_daemon /usr/sbin/in.rlogind ROCKS? I remember that they added some lines at the end and override settings which appear earlier in the file. Do you have any firewall installed on the system, which could block the MPI communication? -- Reuti > The rsh command doesn't appear in the qconf -sconf output. We are uinsg > openmpi for running parallel and distributed jobs. > > The application uses the mpirun command to invoke the distributed jobs. > Please let me know for more clarification. > > Thanks, > Britto. > > > -----Original Message----- > From: Reuti [mailto:[email protected]] > Sent: Wednesday, February 13, 2013 7:00 PM > To: Britto, Rajesh > Cc: [email protected] > Subject: Re: [gridengine users] Issue in Distributed jobs > > Hi, > > Am 13.02.2013 um 13:43 schrieb Britto, Rajesh: > >> When I tried to execute an distributed job on a cluster, the job started >> successfully. >> >> However, after some time, the job was getting hanged by the following >> process. Can anyone please let me know what could be the issue? >> >> /opt/sge/utilbin/lx24-amd64/rsh -n -p 36425 <NodeName> exec >> '/opt/sge/utilbin/lx24-amd64/qrsh_starter' >> '/opt/spool/node/active_jobs/41270.1/1.node' > > It looks like you used the old startup method by `rsh` - which version of SGE > is it? When setting: > > $ qconf -sconf > ... > qlogin_command builtin > qlogin_daemon builtin > rlogin_command builtin > rlogin_daemon builtin > rsh_command builtin > rsh_daemon builtin > > the `rsh` shouldn't appear in the process tree. How did you start your > application in the jobscript? How does the application start slave tasks: by > Open MPI, MPICH2 ...? > > >> FYI, cluster is having both password less ssh and rsh communications between >> the nodes. > > In a Tight Integration setup even parallel jobs don't need this. > > -- Reuti > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
