Hi, I could see the following error message on the message files.
Qmaster |mgr|E| tightly integrated parallel task 41406.1 task 1.node09 failed - killing job Can you please help me in this regard? Thanks, Britto. -----Original Message----- From: Reuti [mailto:[email protected]] Sent: Friday, February 22, 2013 6:56 PM To: Britto, Rajesh Cc: [email protected] Subject: Re: [gridengine users] Issue in Distributed jobs Am 22.02.2013 um 08:15 schrieb Britto, Rajesh: > Thanks for the information. It's not an fresh installation and we already > installed 6.1 which is in production, we are not updating the same. > > After doing strace with the process id which it hangs I found the following > information. > > /opt/sge/utilbin/lx24-amd64/rsh -n -p 51693 node09 exec > '/opt/sge/utilbin/lx24-amd64/qrsh_starter' > '/opt/spool/node09/active_jobs/41406.1/1.node09' To clarify this: the directory /opt/spool/node09/active_jobs/41406.1/1.node09 should be on the slave node. It's not created? Anything in the messages file of the node anout this failure? -- Reuti > The above command is hanged, and it's trying to find the file > '/opt/spool/node09/active_jobs/41406.1/1.node09' which is not available > whereas /opt/spool/node05/active_jobs/41406.1/ is available. > > I have submitted an distributed job and it was running on node09 and node05 > in the grid and the active_job folder contains node05(since the parent > process invoked from this node) and not for node09. > > I am using the following pe for my distributed job. > > pe_name Distributed > slots 94 > user_lists NONE > xuser_lists NONE > start_proc_args /bin/true > stop_proc_args /bin/true > allocation_rule $fill_up > control_slaves TRUE > job_is_first_task FALSE > urgency_slots min > > Can you please help me to resolve the issue? > > Thanks, > Britto. > > > -----Original Message----- > From: Reuti [mailto:[email protected]] > Sent: Monday, February 18, 2013 1:54 PM > To: Britto, Rajesh > Cc: [email protected] > Subject: Re: [gridengine users] Issue in Distributed jobs > > Hi, > > Am 18.02.2013 um 04:53 schrieb Britto, Rajesh: > >> Thanks for the information. >> >> Its not the ROCKS cluster, its an normal SGE cluster with RHEL5.2 installed. >> >> There is no firewall or SELinux enabled on these machines. > > Is it a fresh installation? I wonder about using 6.1u2 as there were versions > after it which were still freely available. > > To investigate: it might be outside of SGE. Can you please submit such a > hanging job, login to the node and issue: > > strace -p 1234 > > with the PID of your haning application. If it's just the `qrsh` hanging > around, it's return code might be retrieved later. > > One other possibility: one version of PVM missed to close the stdout and it > had a similar effect IIRC. What type of parallel application is it (e.g. MPI)? > > -- Reuti > > >> Thanks, >> Britto. >> >> -----Original Message----- >> From: Reuti [mailto:[email protected]] >> Sent: Friday, February 15, 2013 10:15 PM >> To: Britto, Rajesh >> Cc: [email protected] >> Subject: Re: [gridengine users] Issue in Distributed jobs >> >> Am 15.02.2013 um 08:22 schrieb Britto, Rajesh: >> >>> Hi Reuti, >>> >>> Thanks for the information. I am using SGE 6.1u2. >> >> Ok, IIRC the builtin startup mechanism appeared only in 6.2. >> >> >>> Qconf -sconf: >>> >>> qlogin_command telnet >>> qlogin_daemon /usr/sbin/in.telnetd >>> rlogin_daemon /usr/sbin/in.rlogind >> >> ROCKS? I remember that they added some lines at the end and override >> settings which appear earlier in the file. >> >> Do you have any firewall installed on the system, which could block the MPI >> communication? >> >> -- Reuti >> >> >>> The rsh command doesn't appear in the qconf -sconf output. We are uinsg >>> openmpi for running parallel and distributed jobs. >>> >>> The application uses the mpirun command to invoke the distributed jobs. >>> Please let me know for more clarification. >>> >>> Thanks, >>> Britto. >>> >>> >>> -----Original Message----- >>> From: Reuti [mailto:[email protected]] >>> Sent: Wednesday, February 13, 2013 7:00 PM >>> To: Britto, Rajesh >>> Cc: [email protected] >>> Subject: Re: [gridengine users] Issue in Distributed jobs >>> >>> Hi, >>> >>> Am 13.02.2013 um 13:43 schrieb Britto, Rajesh: >>> >>>> When I tried to execute an distributed job on a cluster, the job started >>>> successfully. >>>> >>>> However, after some time, the job was getting hanged by the following >>>> process. Can anyone please let me know what could be the issue? >>>> >>>> /opt/sge/utilbin/lx24-amd64/rsh -n -p 36425 <NodeName> exec >>>> '/opt/sge/utilbin/lx24-amd64/qrsh_starter' >>>> '/opt/spool/node/active_jobs/41270.1/1.node' >>> >>> It looks like you used the old startup method by `rsh` - which version of >>> SGE is it? When setting: >>> >>> $ qconf -sconf >>> ... >>> qlogin_command builtin >>> qlogin_daemon builtin >>> rlogin_command builtin >>> rlogin_daemon builtin >>> rsh_command builtin >>> rsh_daemon builtin >>> >>> the `rsh` shouldn't appear in the process tree. How did you start your >>> application in the jobscript? How does the application start slave tasks: >>> by Open MPI, MPICH2 ...? >>> >>> >>>> FYI, cluster is having both password less ssh and rsh communications >>>> between the nodes. >>> >>> In a Tight Integration setup even parallel jobs don't need this. >>> >>> -- Reuti >>> >> >> > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
