Hi, Am 27.02.2013 um 13:18 schrieb Britto, Rajesh:
> Space available on the partition and the spool directory has 777 permission. > > I can able to create folder using the root account(as well as with user > accounts) and SGE was running as root user. is SGE running under root in your case? Usually it switches to the admin user: $ ps -e f -o user,ruser,command USER RUSER COMMAND ... sgeadmin root /usr/sge/bin/lx24-amd64/sge_qmaster As you are using the classic rsh: has the file /opt/sge/utilbin/lx24-amd64/rsh the suid bit set and is it also honored on the nodes? -- Reuti > Thanks, > Britto. > > -----Original Message----- > From: Reuti [mailto:[email protected]] > Sent: Tuesday, February 26, 2013 4:24 PM > To: Britto, Rajesh > Cc: Jim Phillips; [email protected] > Subject: Re: [gridengine users] Issue in Distributed jobs > > Am 26.02.2013 um 09:51 schrieb Britto, Rajesh: > >> Guys, >> >> Thanks for the information. >> >> I don't find any information on the node09 messages and I hope it couldn't >> able to create any spool directory. > > No filled file space? What are the permissons of /var/spool/sge or whereever > you put the spooling directory. The account under which SGE is running as > effective userid must be able to write there. You can `su` to this account > and try to create some files/directories by hand. > > -- Reuti > > >>>>>>> That error means that the process launched by qrsh on node09 exited >>>>>>> before the rest of the slots so qmaster killed everything for you. >> >>>>>>> I see these occasionally even when the parallel run finishes normally >>>>>>> and exits because the first process to exit may be noticed by qmaster >>>>>>> before the others. >> >> Is there any solution for the above scenario since I fell my case resembles >> the same as above. >> >> Thanks, >> Britto. >> >> >> -----Original Message----- >> From: Jim Phillips [mailto:[email protected]] >> Sent: Monday, February 25, 2013 9:52 PM >> To: Reuti >> Cc: Britto, Rajesh; [email protected] >> Subject: Re: [gridengine users] Issue in Distributed jobs >> >> >> That error means that the process launched by qrsh on node09 exited before >> the rest of the slots so qmaster killed everything for you. >> >> I see these occasionally even when the parallel run finishes normally and >> exits because the first process to exit may be noticed by qmaster before >> the others. >> >> -Jim >> >> On Mon, 25 Feb 2013, Reuti wrote: >> >>> Am 25.02.2013 um 08:03 schrieb Britto, Rajesh: >>> >>>> I could see the following error message on the message files. >>>> >>>> Qmaster |mgr|E| tightly integrated parallel task 41406.1 task 1.node09 >>>> failed - killing job >>> >>> This is on the qmaster AFAICS. What is in the message file of the node09? >>> Maybe the job specific spool directory couldn't be created. >>> >>> -- Reuti >>> >>> >>>> Can you please help me in this regard? >>>> >>>> Thanks, >>>> Britto. >>>> >>>> -----Original Message----- >>>> From: Reuti [mailto:[email protected]] >>>> Sent: Friday, February 22, 2013 6:56 PM >>>> To: Britto, Rajesh >>>> Cc: [email protected] >>>> Subject: Re: [gridengine users] Issue in Distributed jobs >>>> >>>> Am 22.02.2013 um 08:15 schrieb Britto, Rajesh: >>>> >>>>> Thanks for the information. It's not an fresh installation and we already >>>>> installed 6.1 which is in production, we are not updating the same. >>>>> >>>>> After doing strace with the process id which it hangs I found the >>>>> following information. >>>>> >>>>> /opt/sge/utilbin/lx24-amd64/rsh -n -p 51693 node09 exec >>>>> '/opt/sge/utilbin/lx24-amd64/qrsh_starter' >>>>> '/opt/spool/node09/active_jobs/41406.1/1.node09' >>>> >>>> To clarify this: >>>> >>>> the directory /opt/spool/node09/active_jobs/41406.1/1.node09 should be on >>>> the slave node. It's not created? Anything in the messages file of the >>>> node anout this failure? >>>> >>>> -- Reuti >>>> >>>> >>>>> The above command is hanged, and it's trying to find the file >>>>> '/opt/spool/node09/active_jobs/41406.1/1.node09' which is not available >>>>> whereas /opt/spool/node05/active_jobs/41406.1/ is available. >>>>> >>>>> I have submitted an distributed job and it was running on node09 and >>>>> node05 in the grid and the active_job folder contains node05(since the >>>>> parent process invoked from this node) and not for node09. >>>>> >>>>> I am using the following pe for my distributed job. >>>>> >>>>> pe_name Distributed >>>>> slots 94 >>>>> user_lists NONE >>>>> xuser_lists NONE >>>>> start_proc_args /bin/true >>>>> stop_proc_args /bin/true >>>>> allocation_rule $fill_up >>>>> control_slaves TRUE >>>>> job_is_first_task FALSE >>>>> urgency_slots min >>>>> >>>>> Can you please help me to resolve the issue? >>>>> >>>>> Thanks, >>>>> Britto. >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: Reuti [mailto:[email protected]] >>>>> Sent: Monday, February 18, 2013 1:54 PM >>>>> To: Britto, Rajesh >>>>> Cc: [email protected] >>>>> Subject: Re: [gridengine users] Issue in Distributed jobs >>>>> >>>>> Hi, >>>>> >>>>> Am 18.02.2013 um 04:53 schrieb Britto, Rajesh: >>>>> >>>>>> Thanks for the information. >>>>>> >>>>>> Its not the ROCKS cluster, its an normal SGE cluster with RHEL5.2 >>>>>> installed. >>>>>> >>>>>> There is no firewall or SELinux enabled on these machines. >>>>> >>>>> Is it a fresh installation? I wonder about using 6.1u2 as there were >>>>> versions after it which were still freely available. >>>>> >>>>> To investigate: it might be outside of SGE. Can you please submit such a >>>>> hanging job, login to the node and issue: >>>>> >>>>> strace -p 1234 >>>>> >>>>> with the PID of your haning application. If it's just the `qrsh` hanging >>>>> around, it's return code might be retrieved later. >>>>> >>>>> One other possibility: one version of PVM missed to close the stdout and >>>>> it had a similar effect IIRC. What type of parallel application is it >>>>> (e.g. MPI)? >>>>> >>>>> -- Reuti >>>>> >>>>> >>>>>> Thanks, >>>>>> Britto. >>>>>> >>>>>> -----Original Message----- >>>>>> From: Reuti [mailto:[email protected]] >>>>>> Sent: Friday, February 15, 2013 10:15 PM >>>>>> To: Britto, Rajesh >>>>>> Cc: [email protected] >>>>>> Subject: Re: [gridengine users] Issue in Distributed jobs >>>>>> >>>>>> Am 15.02.2013 um 08:22 schrieb Britto, Rajesh: >>>>>> >>>>>>> Hi Reuti, >>>>>>> >>>>>>> Thanks for the information. I am using SGE 6.1u2. >>>>>> >>>>>> Ok, IIRC the builtin startup mechanism appeared only in 6.2. >>>>>> >>>>>> >>>>>>> Qconf -sconf: >>>>>>> >>>>>>> qlogin_command telnet >>>>>>> qlogin_daemon /usr/sbin/in.telnetd >>>>>>> rlogin_daemon /usr/sbin/in.rlogind >>>>>> >>>>>> ROCKS? I remember that they added some lines at the end and override >>>>>> settings which appear earlier in the file. >>>>>> >>>>>> Do you have any firewall installed on the system, which could block the >>>>>> MPI communication? >>>>>> >>>>>> -- Reuti >>>>>> >>>>>> >>>>>>> The rsh command doesn't appear in the qconf -sconf output. We are uinsg >>>>>>> openmpi for running parallel and distributed jobs. >>>>>>> >>>>>>> The application uses the mpirun command to invoke the distributed jobs. >>>>>>> Please let me know for more clarification. >>>>>>> >>>>>>> Thanks, >>>>>>> Britto. >>>>>>> >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Reuti [mailto:[email protected]] >>>>>>> Sent: Wednesday, February 13, 2013 7:00 PM >>>>>>> To: Britto, Rajesh >>>>>>> Cc: [email protected] >>>>>>> Subject: Re: [gridengine users] Issue in Distributed jobs >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Am 13.02.2013 um 13:43 schrieb Britto, Rajesh: >>>>>>> >>>>>>>> When I tried to execute an distributed job on a cluster, the job >>>>>>>> started successfully. >>>>>>>> >>>>>>>> However, after some time, the job was getting hanged by the following >>>>>>>> process. Can anyone please let me know what could be the issue? >>>>>>>> >>>>>>>> /opt/sge/utilbin/lx24-amd64/rsh -n -p 36425 <NodeName> exec >>>>>>>> '/opt/sge/utilbin/lx24-amd64/qrsh_starter' >>>>>>>> '/opt/spool/node/active_jobs/41270.1/1.node' >>>>>>> >>>>>>> It looks like you used the old startup method by `rsh` - which version >>>>>>> of SGE is it? When setting: >>>>>>> >>>>>>> $ qconf -sconf >>>>>>> ... >>>>>>> qlogin_command builtin >>>>>>> qlogin_daemon builtin >>>>>>> rlogin_command builtin >>>>>>> rlogin_daemon builtin >>>>>>> rsh_command builtin >>>>>>> rsh_daemon builtin >>>>>>> >>>>>>> the `rsh` shouldn't appear in the process tree. How did you start your >>>>>>> application in the jobscript? How does the application start slave >>>>>>> tasks: by Open MPI, MPICH2 ...? >>>>>>> >>>>>>> >>>>>>>> FYI, cluster is having both password less ssh and rsh communications >>>>>>>> between the nodes. >>>>>>> >>>>>>> In a Tight Integration setup even parallel jobs don't need this. >>>>>>> >>>>>>> -- Reuti >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> [email protected] >>> https://gridengine.org/mailman/listinfo/users >>> >> > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
