Hi, Am 18.02.2013 um 04:53 schrieb Britto, Rajesh:
> Thanks for the information. > > Its not the ROCKS cluster, its an normal SGE cluster with RHEL5.2 installed. > > There is no firewall or SELinux enabled on these machines. Is it a fresh installation? I wonder about using 6.1u2 as there were versions after it which were still freely available. To investigate: it might be outside of SGE. Can you please submit such a hanging job, login to the node and issue: strace -p 1234 with the PID of your haning application. If it's just the `qrsh` hanging around, it's return code might be retrieved later. One other possibility: one version of PVM missed to close the stdout and it had a similar effect IIRC. What type of parallel application is it (e.g. MPI)? -- Reuti > Thanks, > Britto. > > -----Original Message----- > From: Reuti [mailto:[email protected]] > Sent: Friday, February 15, 2013 10:15 PM > To: Britto, Rajesh > Cc: [email protected] > Subject: Re: [gridengine users] Issue in Distributed jobs > > Am 15.02.2013 um 08:22 schrieb Britto, Rajesh: > >> Hi Reuti, >> >> Thanks for the information. I am using SGE 6.1u2. > > Ok, IIRC the builtin startup mechanism appeared only in 6.2. > > >> Qconf -sconf: >> >> qlogin_command telnet >> qlogin_daemon /usr/sbin/in.telnetd >> rlogin_daemon /usr/sbin/in.rlogind > > ROCKS? I remember that they added some lines at the end and override settings > which appear earlier in the file. > > Do you have any firewall installed on the system, which could block the MPI > communication? > > -- Reuti > > >> The rsh command doesn't appear in the qconf -sconf output. We are uinsg >> openmpi for running parallel and distributed jobs. >> >> The application uses the mpirun command to invoke the distributed jobs. >> Please let me know for more clarification. >> >> Thanks, >> Britto. >> >> >> -----Original Message----- >> From: Reuti [mailto:[email protected]] >> Sent: Wednesday, February 13, 2013 7:00 PM >> To: Britto, Rajesh >> Cc: [email protected] >> Subject: Re: [gridengine users] Issue in Distributed jobs >> >> Hi, >> >> Am 13.02.2013 um 13:43 schrieb Britto, Rajesh: >> >>> When I tried to execute an distributed job on a cluster, the job started >>> successfully. >>> >>> However, after some time, the job was getting hanged by the following >>> process. Can anyone please let me know what could be the issue? >>> >>> /opt/sge/utilbin/lx24-amd64/rsh -n -p 36425 <NodeName> exec >>> '/opt/sge/utilbin/lx24-amd64/qrsh_starter' >>> '/opt/spool/node/active_jobs/41270.1/1.node' >> >> It looks like you used the old startup method by `rsh` - which version of >> SGE is it? When setting: >> >> $ qconf -sconf >> ... >> qlogin_command builtin >> qlogin_daemon builtin >> rlogin_command builtin >> rlogin_daemon builtin >> rsh_command builtin >> rsh_daemon builtin >> >> the `rsh` shouldn't appear in the process tree. How did you start your >> application in the jobscript? How does the application start slave tasks: by >> Open MPI, MPICH2 ...? >> >> >>> FYI, cluster is having both password less ssh and rsh communications >>> between the nodes. >> >> In a Tight Integration setup even parallel jobs don't need this. >> >> -- Reuti >> > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
