Guys, Thanks for the information.
I don't find any information on the node09 messages and I hope it couldn't able to create any spool directory. >>>>>That error means that the process launched by qrsh on node09 exited before >>>>>the rest of the slots so qmaster killed everything for you. >>>>>I see these occasionally even when the parallel run finishes normally and >>>>>exits because the first process to exit may be noticed by qmaster before >>>>>the others. Is there any solution for the above scenario since I fell my case resembles the same as above. Thanks, Britto. -----Original Message----- From: Jim Phillips [mailto:[email protected]] Sent: Monday, February 25, 2013 9:52 PM To: Reuti Cc: Britto, Rajesh; [email protected] Subject: Re: [gridengine users] Issue in Distributed jobs That error means that the process launched by qrsh on node09 exited before the rest of the slots so qmaster killed everything for you. I see these occasionally even when the parallel run finishes normally and exits because the first process to exit may be noticed by qmaster before the others. -Jim On Mon, 25 Feb 2013, Reuti wrote: > Am 25.02.2013 um 08:03 schrieb Britto, Rajesh: > >> I could see the following error message on the message files. >> >> Qmaster |mgr|E| tightly integrated parallel task 41406.1 task 1.node09 >> failed - killing job > > This is on the qmaster AFAICS. What is in the message file of the node09? > Maybe the job specific spool directory couldn't be created. > > -- Reuti > > >> Can you please help me in this regard? >> >> Thanks, >> Britto. >> >> -----Original Message----- >> From: Reuti [mailto:[email protected]] >> Sent: Friday, February 22, 2013 6:56 PM >> To: Britto, Rajesh >> Cc: [email protected] >> Subject: Re: [gridengine users] Issue in Distributed jobs >> >> Am 22.02.2013 um 08:15 schrieb Britto, Rajesh: >> >>> Thanks for the information. It's not an fresh installation and we already >>> installed 6.1 which is in production, we are not updating the same. >>> >>> After doing strace with the process id which it hangs I found the following >>> information. >>> >>> /opt/sge/utilbin/lx24-amd64/rsh -n -p 51693 node09 exec >>> '/opt/sge/utilbin/lx24-amd64/qrsh_starter' >>> '/opt/spool/node09/active_jobs/41406.1/1.node09' >> >> To clarify this: >> >> the directory /opt/spool/node09/active_jobs/41406.1/1.node09 should be on >> the slave node. It's not created? Anything in the messages file of the node >> anout this failure? >> >> -- Reuti >> >> >>> The above command is hanged, and it's trying to find the file >>> '/opt/spool/node09/active_jobs/41406.1/1.node09' which is not available >>> whereas /opt/spool/node05/active_jobs/41406.1/ is available. >>> >>> I have submitted an distributed job and it was running on node09 and node05 >>> in the grid and the active_job folder contains node05(since the parent >>> process invoked from this node) and not for node09. >>> >>> I am using the following pe for my distributed job. >>> >>> pe_name Distributed >>> slots 94 >>> user_lists NONE >>> xuser_lists NONE >>> start_proc_args /bin/true >>> stop_proc_args /bin/true >>> allocation_rule $fill_up >>> control_slaves TRUE >>> job_is_first_task FALSE >>> urgency_slots min >>> >>> Can you please help me to resolve the issue? >>> >>> Thanks, >>> Britto. >>> >>> >>> -----Original Message----- >>> From: Reuti [mailto:[email protected]] >>> Sent: Monday, February 18, 2013 1:54 PM >>> To: Britto, Rajesh >>> Cc: [email protected] >>> Subject: Re: [gridengine users] Issue in Distributed jobs >>> >>> Hi, >>> >>> Am 18.02.2013 um 04:53 schrieb Britto, Rajesh: >>> >>>> Thanks for the information. >>>> >>>> Its not the ROCKS cluster, its an normal SGE cluster with RHEL5.2 >>>> installed. >>>> >>>> There is no firewall or SELinux enabled on these machines. >>> >>> Is it a fresh installation? I wonder about using 6.1u2 as there were >>> versions after it which were still freely available. >>> >>> To investigate: it might be outside of SGE. Can you please submit such a >>> hanging job, login to the node and issue: >>> >>> strace -p 1234 >>> >>> with the PID of your haning application. If it's just the `qrsh` hanging >>> around, it's return code might be retrieved later. >>> >>> One other possibility: one version of PVM missed to close the stdout and it >>> had a similar effect IIRC. What type of parallel application is it (e.g. >>> MPI)? >>> >>> -- Reuti >>> >>> >>>> Thanks, >>>> Britto. >>>> >>>> -----Original Message----- >>>> From: Reuti [mailto:[email protected]] >>>> Sent: Friday, February 15, 2013 10:15 PM >>>> To: Britto, Rajesh >>>> Cc: [email protected] >>>> Subject: Re: [gridengine users] Issue in Distributed jobs >>>> >>>> Am 15.02.2013 um 08:22 schrieb Britto, Rajesh: >>>> >>>>> Hi Reuti, >>>>> >>>>> Thanks for the information. I am using SGE 6.1u2. >>>> >>>> Ok, IIRC the builtin startup mechanism appeared only in 6.2. >>>> >>>> >>>>> Qconf -sconf: >>>>> >>>>> qlogin_command telnet >>>>> qlogin_daemon /usr/sbin/in.telnetd >>>>> rlogin_daemon /usr/sbin/in.rlogind >>>> >>>> ROCKS? I remember that they added some lines at the end and override >>>> settings which appear earlier in the file. >>>> >>>> Do you have any firewall installed on the system, which could block the >>>> MPI communication? >>>> >>>> -- Reuti >>>> >>>> >>>>> The rsh command doesn't appear in the qconf -sconf output. We are uinsg >>>>> openmpi for running parallel and distributed jobs. >>>>> >>>>> The application uses the mpirun command to invoke the distributed jobs. >>>>> Please let me know for more clarification. >>>>> >>>>> Thanks, >>>>> Britto. >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: Reuti [mailto:[email protected]] >>>>> Sent: Wednesday, February 13, 2013 7:00 PM >>>>> To: Britto, Rajesh >>>>> Cc: [email protected] >>>>> Subject: Re: [gridengine users] Issue in Distributed jobs >>>>> >>>>> Hi, >>>>> >>>>> Am 13.02.2013 um 13:43 schrieb Britto, Rajesh: >>>>> >>>>>> When I tried to execute an distributed job on a cluster, the job started >>>>>> successfully. >>>>>> >>>>>> However, after some time, the job was getting hanged by the following >>>>>> process. Can anyone please let me know what could be the issue? >>>>>> >>>>>> /opt/sge/utilbin/lx24-amd64/rsh -n -p 36425 <NodeName> exec >>>>>> '/opt/sge/utilbin/lx24-amd64/qrsh_starter' >>>>>> '/opt/spool/node/active_jobs/41270.1/1.node' >>>>> >>>>> It looks like you used the old startup method by `rsh` - which version of >>>>> SGE is it? When setting: >>>>> >>>>> $ qconf -sconf >>>>> ... >>>>> qlogin_command builtin >>>>> qlogin_daemon builtin >>>>> rlogin_command builtin >>>>> rlogin_daemon builtin >>>>> rsh_command builtin >>>>> rsh_daemon builtin >>>>> >>>>> the `rsh` shouldn't appear in the process tree. How did you start your >>>>> application in the jobscript? How does the application start slave tasks: >>>>> by Open MPI, MPICH2 ...? >>>>> >>>>> >>>>>> FYI, cluster is having both password less ssh and rsh communications >>>>>> between the nodes. >>>>> >>>>> In a Tight Integration setup even parallel jobs don't need this. >>>>> >>>>> -- Reuti >>>>> >>>> >>>> >>> >>> >> >> > > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
