Am 25.02.2013 um 08:03 schrieb Britto, Rajesh: > I could see the following error message on the message files. > > Qmaster |mgr|E| tightly integrated parallel task 41406.1 task 1.node09 failed > - killing job
This is on the qmaster AFAICS. What is in the message file of the node09? Maybe the job specific spool directory couldn't be created. -- Reuti > Can you please help me in this regard? > > Thanks, > Britto. > > -----Original Message----- > From: Reuti [mailto:[email protected]] > Sent: Friday, February 22, 2013 6:56 PM > To: Britto, Rajesh > Cc: [email protected] > Subject: Re: [gridengine users] Issue in Distributed jobs > > Am 22.02.2013 um 08:15 schrieb Britto, Rajesh: > >> Thanks for the information. It's not an fresh installation and we already >> installed 6.1 which is in production, we are not updating the same. >> >> After doing strace with the process id which it hangs I found the following >> information. >> >> /opt/sge/utilbin/lx24-amd64/rsh -n -p 51693 node09 exec >> '/opt/sge/utilbin/lx24-amd64/qrsh_starter' >> '/opt/spool/node09/active_jobs/41406.1/1.node09' > > To clarify this: > > the directory /opt/spool/node09/active_jobs/41406.1/1.node09 should be on the > slave node. It's not created? Anything in the messages file of the node anout > this failure? > > -- Reuti > > >> The above command is hanged, and it's trying to find the file >> '/opt/spool/node09/active_jobs/41406.1/1.node09' which is not available >> whereas /opt/spool/node05/active_jobs/41406.1/ is available. >> >> I have submitted an distributed job and it was running on node09 and node05 >> in the grid and the active_job folder contains node05(since the parent >> process invoked from this node) and not for node09. >> >> I am using the following pe for my distributed job. >> >> pe_name Distributed >> slots 94 >> user_lists NONE >> xuser_lists NONE >> start_proc_args /bin/true >> stop_proc_args /bin/true >> allocation_rule $fill_up >> control_slaves TRUE >> job_is_first_task FALSE >> urgency_slots min >> >> Can you please help me to resolve the issue? >> >> Thanks, >> Britto. >> >> >> -----Original Message----- >> From: Reuti [mailto:[email protected]] >> Sent: Monday, February 18, 2013 1:54 PM >> To: Britto, Rajesh >> Cc: [email protected] >> Subject: Re: [gridengine users] Issue in Distributed jobs >> >> Hi, >> >> Am 18.02.2013 um 04:53 schrieb Britto, Rajesh: >> >>> Thanks for the information. >>> >>> Its not the ROCKS cluster, its an normal SGE cluster with RHEL5.2 installed. >>> >>> There is no firewall or SELinux enabled on these machines. >> >> Is it a fresh installation? I wonder about using 6.1u2 as there were >> versions after it which were still freely available. >> >> To investigate: it might be outside of SGE. Can you please submit such a >> hanging job, login to the node and issue: >> >> strace -p 1234 >> >> with the PID of your haning application. If it's just the `qrsh` hanging >> around, it's return code might be retrieved later. >> >> One other possibility: one version of PVM missed to close the stdout and it >> had a similar effect IIRC. What type of parallel application is it (e.g. >> MPI)? >> >> -- Reuti >> >> >>> Thanks, >>> Britto. >>> >>> -----Original Message----- >>> From: Reuti [mailto:[email protected]] >>> Sent: Friday, February 15, 2013 10:15 PM >>> To: Britto, Rajesh >>> Cc: [email protected] >>> Subject: Re: [gridengine users] Issue in Distributed jobs >>> >>> Am 15.02.2013 um 08:22 schrieb Britto, Rajesh: >>> >>>> Hi Reuti, >>>> >>>> Thanks for the information. I am using SGE 6.1u2. >>> >>> Ok, IIRC the builtin startup mechanism appeared only in 6.2. >>> >>> >>>> Qconf -sconf: >>>> >>>> qlogin_command telnet >>>> qlogin_daemon /usr/sbin/in.telnetd >>>> rlogin_daemon /usr/sbin/in.rlogind >>> >>> ROCKS? I remember that they added some lines at the end and override >>> settings which appear earlier in the file. >>> >>> Do you have any firewall installed on the system, which could block the MPI >>> communication? >>> >>> -- Reuti >>> >>> >>>> The rsh command doesn't appear in the qconf -sconf output. We are uinsg >>>> openmpi for running parallel and distributed jobs. >>>> >>>> The application uses the mpirun command to invoke the distributed jobs. >>>> Please let me know for more clarification. >>>> >>>> Thanks, >>>> Britto. >>>> >>>> >>>> -----Original Message----- >>>> From: Reuti [mailto:[email protected]] >>>> Sent: Wednesday, February 13, 2013 7:00 PM >>>> To: Britto, Rajesh >>>> Cc: [email protected] >>>> Subject: Re: [gridengine users] Issue in Distributed jobs >>>> >>>> Hi, >>>> >>>> Am 13.02.2013 um 13:43 schrieb Britto, Rajesh: >>>> >>>>> When I tried to execute an distributed job on a cluster, the job started >>>>> successfully. >>>>> >>>>> However, after some time, the job was getting hanged by the following >>>>> process. Can anyone please let me know what could be the issue? >>>>> >>>>> /opt/sge/utilbin/lx24-amd64/rsh -n -p 36425 <NodeName> exec >>>>> '/opt/sge/utilbin/lx24-amd64/qrsh_starter' >>>>> '/opt/spool/node/active_jobs/41270.1/1.node' >>>> >>>> It looks like you used the old startup method by `rsh` - which version of >>>> SGE is it? When setting: >>>> >>>> $ qconf -sconf >>>> ... >>>> qlogin_command builtin >>>> qlogin_daemon builtin >>>> rlogin_command builtin >>>> rlogin_daemon builtin >>>> rsh_command builtin >>>> rsh_daemon builtin >>>> >>>> the `rsh` shouldn't appear in the process tree. How did you start your >>>> application in the jobscript? How does the application start slave tasks: >>>> by Open MPI, MPICH2 ...? >>>> >>>> >>>>> FYI, cluster is having both password less ssh and rsh communications >>>>> between the nodes. >>>> >>>> In a Tight Integration setup even parallel jobs don't need this. >>>> >>>> -- Reuti >>>> >>> >>> >> >> > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
