Am 26.02.2013 um 09:51 schrieb Britto, Rajesh: > Guys, > > Thanks for the information. > > I don't find any information on the node09 messages and I hope it couldn't > able to create any spool directory.
No filled file space? What are the permissons of /var/spool/sge or whereever you put the spooling directory. The account under which SGE is running as effective userid must be able to write there. You can `su` to this account and try to create some files/directories by hand. -- Reuti >>>>>> That error means that the process launched by qrsh on node09 exited >>>>>> before the rest of the slots so qmaster killed everything for you. > >>>>>> I see these occasionally even when the parallel run finishes normally >>>>>> and exits because the first process to exit may be noticed by qmaster >>>>>> before the others. > > Is there any solution for the above scenario since I fell my case resembles > the same as above. > > Thanks, > Britto. > > > -----Original Message----- > From: Jim Phillips [mailto:[email protected]] > Sent: Monday, February 25, 2013 9:52 PM > To: Reuti > Cc: Britto, Rajesh; [email protected] > Subject: Re: [gridengine users] Issue in Distributed jobs > > > That error means that the process launched by qrsh on node09 exited before > the rest of the slots so qmaster killed everything for you. > > I see these occasionally even when the parallel run finishes normally and > exits because the first process to exit may be noticed by qmaster before > the others. > > -Jim > > On Mon, 25 Feb 2013, Reuti wrote: > >> Am 25.02.2013 um 08:03 schrieb Britto, Rajesh: >> >>> I could see the following error message on the message files. >>> >>> Qmaster |mgr|E| tightly integrated parallel task 41406.1 task 1.node09 >>> failed - killing job >> >> This is on the qmaster AFAICS. What is in the message file of the node09? >> Maybe the job specific spool directory couldn't be created. >> >> -- Reuti >> >> >>> Can you please help me in this regard? >>> >>> Thanks, >>> Britto. >>> >>> -----Original Message----- >>> From: Reuti [mailto:[email protected]] >>> Sent: Friday, February 22, 2013 6:56 PM >>> To: Britto, Rajesh >>> Cc: [email protected] >>> Subject: Re: [gridengine users] Issue in Distributed jobs >>> >>> Am 22.02.2013 um 08:15 schrieb Britto, Rajesh: >>> >>>> Thanks for the information. It's not an fresh installation and we already >>>> installed 6.1 which is in production, we are not updating the same. >>>> >>>> After doing strace with the process id which it hangs I found the >>>> following information. >>>> >>>> /opt/sge/utilbin/lx24-amd64/rsh -n -p 51693 node09 exec >>>> '/opt/sge/utilbin/lx24-amd64/qrsh_starter' >>>> '/opt/spool/node09/active_jobs/41406.1/1.node09' >>> >>> To clarify this: >>> >>> the directory /opt/spool/node09/active_jobs/41406.1/1.node09 should be on >>> the slave node. It's not created? Anything in the messages file of the node >>> anout this failure? >>> >>> -- Reuti >>> >>> >>>> The above command is hanged, and it's trying to find the file >>>> '/opt/spool/node09/active_jobs/41406.1/1.node09' which is not available >>>> whereas /opt/spool/node05/active_jobs/41406.1/ is available. >>>> >>>> I have submitted an distributed job and it was running on node09 and >>>> node05 in the grid and the active_job folder contains node05(since the >>>> parent process invoked from this node) and not for node09. >>>> >>>> I am using the following pe for my distributed job. >>>> >>>> pe_name Distributed >>>> slots 94 >>>> user_lists NONE >>>> xuser_lists NONE >>>> start_proc_args /bin/true >>>> stop_proc_args /bin/true >>>> allocation_rule $fill_up >>>> control_slaves TRUE >>>> job_is_first_task FALSE >>>> urgency_slots min >>>> >>>> Can you please help me to resolve the issue? >>>> >>>> Thanks, >>>> Britto. >>>> >>>> >>>> -----Original Message----- >>>> From: Reuti [mailto:[email protected]] >>>> Sent: Monday, February 18, 2013 1:54 PM >>>> To: Britto, Rajesh >>>> Cc: [email protected] >>>> Subject: Re: [gridengine users] Issue in Distributed jobs >>>> >>>> Hi, >>>> >>>> Am 18.02.2013 um 04:53 schrieb Britto, Rajesh: >>>> >>>>> Thanks for the information. >>>>> >>>>> Its not the ROCKS cluster, its an normal SGE cluster with RHEL5.2 >>>>> installed. >>>>> >>>>> There is no firewall or SELinux enabled on these machines. >>>> >>>> Is it a fresh installation? I wonder about using 6.1u2 as there were >>>> versions after it which were still freely available. >>>> >>>> To investigate: it might be outside of SGE. Can you please submit such a >>>> hanging job, login to the node and issue: >>>> >>>> strace -p 1234 >>>> >>>> with the PID of your haning application. If it's just the `qrsh` hanging >>>> around, it's return code might be retrieved later. >>>> >>>> One other possibility: one version of PVM missed to close the stdout and >>>> it had a similar effect IIRC. What type of parallel application is it >>>> (e.g. MPI)? >>>> >>>> -- Reuti >>>> >>>> >>>>> Thanks, >>>>> Britto. >>>>> >>>>> -----Original Message----- >>>>> From: Reuti [mailto:[email protected]] >>>>> Sent: Friday, February 15, 2013 10:15 PM >>>>> To: Britto, Rajesh >>>>> Cc: [email protected] >>>>> Subject: Re: [gridengine users] Issue in Distributed jobs >>>>> >>>>> Am 15.02.2013 um 08:22 schrieb Britto, Rajesh: >>>>> >>>>>> Hi Reuti, >>>>>> >>>>>> Thanks for the information. I am using SGE 6.1u2. >>>>> >>>>> Ok, IIRC the builtin startup mechanism appeared only in 6.2. >>>>> >>>>> >>>>>> Qconf -sconf: >>>>>> >>>>>> qlogin_command telnet >>>>>> qlogin_daemon /usr/sbin/in.telnetd >>>>>> rlogin_daemon /usr/sbin/in.rlogind >>>>> >>>>> ROCKS? I remember that they added some lines at the end and override >>>>> settings which appear earlier in the file. >>>>> >>>>> Do you have any firewall installed on the system, which could block the >>>>> MPI communication? >>>>> >>>>> -- Reuti >>>>> >>>>> >>>>>> The rsh command doesn't appear in the qconf -sconf output. We are uinsg >>>>>> openmpi for running parallel and distributed jobs. >>>>>> >>>>>> The application uses the mpirun command to invoke the distributed jobs. >>>>>> Please let me know for more clarification. >>>>>> >>>>>> Thanks, >>>>>> Britto. >>>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: Reuti [mailto:[email protected]] >>>>>> Sent: Wednesday, February 13, 2013 7:00 PM >>>>>> To: Britto, Rajesh >>>>>> Cc: [email protected] >>>>>> Subject: Re: [gridengine users] Issue in Distributed jobs >>>>>> >>>>>> Hi, >>>>>> >>>>>> Am 13.02.2013 um 13:43 schrieb Britto, Rajesh: >>>>>> >>>>>>> When I tried to execute an distributed job on a cluster, the job >>>>>>> started successfully. >>>>>>> >>>>>>> However, after some time, the job was getting hanged by the following >>>>>>> process. Can anyone please let me know what could be the issue? >>>>>>> >>>>>>> /opt/sge/utilbin/lx24-amd64/rsh -n -p 36425 <NodeName> exec >>>>>>> '/opt/sge/utilbin/lx24-amd64/qrsh_starter' >>>>>>> '/opt/spool/node/active_jobs/41270.1/1.node' >>>>>> >>>>>> It looks like you used the old startup method by `rsh` - which version >>>>>> of SGE is it? When setting: >>>>>> >>>>>> $ qconf -sconf >>>>>> ... >>>>>> qlogin_command builtin >>>>>> qlogin_daemon builtin >>>>>> rlogin_command builtin >>>>>> rlogin_daemon builtin >>>>>> rsh_command builtin >>>>>> rsh_daemon builtin >>>>>> >>>>>> the `rsh` shouldn't appear in the process tree. How did you start your >>>>>> application in the jobscript? How does the application start slave >>>>>> tasks: by Open MPI, MPICH2 ...? >>>>>> >>>>>> >>>>>>> FYI, cluster is having both password less ssh and rsh communications >>>>>>> between the nodes. >>>>>> >>>>>> In a Tight Integration setup even parallel jobs don't need this. >>>>>> >>>>>> -- Reuti >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> >> _______________________________________________ >> users mailing list >> [email protected] >> https://gridengine.org/mailman/listinfo/users >> > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
