Hi, Thanks for the information. It's not an fresh installation and we already installed 6.1 which is in production, we are not updating the same.
After doing strace with the process id which it hangs I found the following information. /opt/sge/utilbin/lx24-amd64/rsh -n -p 51693 node09 exec '/opt/sge/utilbin/lx24-amd64/qrsh_starter' '/opt/spool/node09/active_jobs/41406.1/1.node09' The above command is hanged, and it's trying to find the file '/opt/spool/node09/active_jobs/41406.1/1.node09' which is not available whereas /opt/spool/node05/active_jobs/41406.1/ is available. I have submitted an distributed job and it was running on node09 and node05 in the grid and the active_job folder contains node05(since the parent process invoked from this node) and not for node09. I am using the following pe for my distributed job. pe_name Distributed slots 94 user_lists NONE xuser_lists NONE start_proc_args /bin/true stop_proc_args /bin/true allocation_rule $fill_up control_slaves TRUE job_is_first_task FALSE urgency_slots min Can you please help me to resolve the issue? Thanks, Britto. -----Original Message----- From: Reuti [mailto:[email protected]] Sent: Monday, February 18, 2013 1:54 PM To: Britto, Rajesh Cc: [email protected] Subject: Re: [gridengine users] Issue in Distributed jobs Hi, Am 18.02.2013 um 04:53 schrieb Britto, Rajesh: > Thanks for the information. > > Its not the ROCKS cluster, its an normal SGE cluster with RHEL5.2 installed. > > There is no firewall or SELinux enabled on these machines. Is it a fresh installation? I wonder about using 6.1u2 as there were versions after it which were still freely available. To investigate: it might be outside of SGE. Can you please submit such a hanging job, login to the node and issue: strace -p 1234 with the PID of your haning application. If it's just the `qrsh` hanging around, it's return code might be retrieved later. One other possibility: one version of PVM missed to close the stdout and it had a similar effect IIRC. What type of parallel application is it (e.g. MPI)? -- Reuti > Thanks, > Britto. > > -----Original Message----- > From: Reuti [mailto:[email protected]] > Sent: Friday, February 15, 2013 10:15 PM > To: Britto, Rajesh > Cc: [email protected] > Subject: Re: [gridengine users] Issue in Distributed jobs > > Am 15.02.2013 um 08:22 schrieb Britto, Rajesh: > >> Hi Reuti, >> >> Thanks for the information. I am using SGE 6.1u2. > > Ok, IIRC the builtin startup mechanism appeared only in 6.2. > > >> Qconf -sconf: >> >> qlogin_command telnet >> qlogin_daemon /usr/sbin/in.telnetd >> rlogin_daemon /usr/sbin/in.rlogind > > ROCKS? I remember that they added some lines at the end and override settings > which appear earlier in the file. > > Do you have any firewall installed on the system, which could block the MPI > communication? > > -- Reuti > > >> The rsh command doesn't appear in the qconf -sconf output. We are uinsg >> openmpi for running parallel and distributed jobs. >> >> The application uses the mpirun command to invoke the distributed jobs. >> Please let me know for more clarification. >> >> Thanks, >> Britto. >> >> >> -----Original Message----- >> From: Reuti [mailto:[email protected]] >> Sent: Wednesday, February 13, 2013 7:00 PM >> To: Britto, Rajesh >> Cc: [email protected] >> Subject: Re: [gridengine users] Issue in Distributed jobs >> >> Hi, >> >> Am 13.02.2013 um 13:43 schrieb Britto, Rajesh: >> >>> When I tried to execute an distributed job on a cluster, the job started >>> successfully. >>> >>> However, after some time, the job was getting hanged by the following >>> process. Can anyone please let me know what could be the issue? >>> >>> /opt/sge/utilbin/lx24-amd64/rsh -n -p 36425 <NodeName> exec >>> '/opt/sge/utilbin/lx24-amd64/qrsh_starter' >>> '/opt/spool/node/active_jobs/41270.1/1.node' >> >> It looks like you used the old startup method by `rsh` - which version of >> SGE is it? When setting: >> >> $ qconf -sconf >> ... >> qlogin_command builtin >> qlogin_daemon builtin >> rlogin_command builtin >> rlogin_daemon builtin >> rsh_command builtin >> rsh_daemon builtin >> >> the `rsh` shouldn't appear in the process tree. How did you start your >> application in the jobscript? How does the application start slave tasks: by >> Open MPI, MPICH2 ...? >> >> >>> FYI, cluster is having both password less ssh and rsh communications >>> between the nodes. >> >> In a Tight Integration setup even parallel jobs don't need this. >> >> -- Reuti >> > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
