Re: [gridengine users] Issue in Distributed jobs

Reuti Mon, 25 Feb 2013 04:06:55 -0800

Am 25.02.2013 um 08:03 schrieb Britto, Rajesh:

> I could see the following error message on the message files.
> 
> Qmaster |mgr|E| tightly integrated parallel task 41406.1 task 1.node09 failed 
> - killing job


This is on the qmaster AFAICS. What is in the message file of the node09? Maybe 
the job specific spool directory couldn't be created.

-- Reuti


> Can you please help me in this regard?
> 
> Thanks,
> Britto.
> 
> -----Original Message-----
> From: Reuti [mailto:[email protected]] 
> Sent: Friday, February 22, 2013 6:56 PM
> To: Britto, Rajesh
> Cc: [email protected]
> Subject: Re: [gridengine users] Issue in Distributed jobs
> 
> Am 22.02.2013 um 08:15 schrieb Britto, Rajesh:
> 
>> Thanks for the information. It's not an fresh installation and we already 
>> installed 6.1 which is in production, we are not updating the same.
>> 
>> After doing strace with the process id which it hangs I found the following 
>> information. 
>> 
>> /opt/sge/utilbin/lx24-amd64/rsh -n -p 51693 node09 exec 
>> '/opt/sge/utilbin/lx24-amd64/qrsh_starter' 
>> '/opt/spool/node09/active_jobs/41406.1/1.node09'
> 
> To clarify this:
> 
> the directory /opt/spool/node09/active_jobs/41406.1/1.node09 should be on the 
> slave node. It's not created? Anything in the messages file of the node anout 
> this failure?
> 
> -- Reuti
> 
> 
>> The above command is hanged, and it's trying to find the file 
>> '/opt/spool/node09/active_jobs/41406.1/1.node09' which is not available 
>> whereas /opt/spool/node05/active_jobs/41406.1/ is available.
>> 
>> I have submitted an distributed job and it was running on node09 and node05 
>> in the grid and the active_job folder contains node05(since the parent 
>> process invoked from this node) and not for node09.
>> 
>> I am using the following pe for my distributed job.
>> 
>> pe_name           Distributed
>> slots             94
>> user_lists        NONE
>> xuser_lists       NONE
>> start_proc_args   /bin/true
>> stop_proc_args    /bin/true
>> allocation_rule   $fill_up
>> control_slaves    TRUE
>> job_is_first_task FALSE
>> urgency_slots     min
>> 
>> Can you please help me to resolve the issue?
>> 
>> Thanks,
>> Britto.
>> 
>> 
>> -----Original Message-----
>> From: Reuti [mailto:[email protected]] 
>> Sent: Monday, February 18, 2013 1:54 PM
>> To: Britto, Rajesh
>> Cc: [email protected]
>> Subject: Re: [gridengine users] Issue in Distributed jobs
>> 
>> Hi,
>> 
>> Am 18.02.2013 um 04:53 schrieb Britto, Rajesh:
>> 
>>> Thanks for the information.
>>> 
>>> Its not the ROCKS cluster, its an normal SGE cluster with RHEL5.2 installed.
>>> 
>>> There is no firewall or SELinux enabled on these machines.
>> 
>> Is it a fresh installation? I wonder about using 6.1u2 as there were 
>> versions after it which were still freely available.
>> 
>> To investigate: it might be outside of SGE. Can you please submit such a 
>> hanging job, login to the node and issue:
>> 
>> strace -p 1234
>> 
>> with the PID of your haning application. If it's just the `qrsh` hanging 
>> around, it's return code might be retrieved later.
>> 
>> One other possibility: one version of PVM missed to close the stdout and it 
>> had a similar effect IIRC. What type of parallel application is it (e.g. 
>> MPI)?
>> 
>> -- Reuti
>> 
>> 
>>> Thanks,
>>> Britto.
>>> 
>>> -----Original Message-----
>>> From: Reuti [mailto:[email protected]] 
>>> Sent: Friday, February 15, 2013 10:15 PM
>>> To: Britto, Rajesh
>>> Cc: [email protected]
>>> Subject: Re: [gridengine users] Issue in Distributed jobs
>>> 
>>> Am 15.02.2013 um 08:22 schrieb Britto, Rajesh:
>>> 
>>>> Hi Reuti,
>>>> 
>>>> Thanks for the information. I am using SGE 6.1u2.
>>> 
>>> Ok, IIRC the builtin startup mechanism appeared only in 6.2.
>>> 
>>> 
>>>> Qconf -sconf:
>>>> 
>>>> qlogin_command               telnet
>>>> qlogin_daemon                /usr/sbin/in.telnetd
>>>> rlogin_daemon                /usr/sbin/in.rlogind
>>> 
>>> ROCKS? I remember that they added some lines at the end and override 
>>> settings which appear earlier in the file.
>>> 
>>> Do you have any firewall installed on the system, which could block the MPI 
>>> communication?
>>> 
>>> -- Reuti
>>> 
>>> 
>>>> The rsh command doesn't appear in the qconf -sconf output. We are uinsg 
>>>> openmpi for running parallel and distributed jobs.
>>>> 
>>>> The application uses the mpirun command to invoke the distributed jobs. 
>>>> Please let me know for more clarification.
>>>> 
>>>> Thanks,
>>>> Britto.
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Reuti [mailto:[email protected]] 
>>>> Sent: Wednesday, February 13, 2013 7:00 PM
>>>> To: Britto, Rajesh
>>>> Cc: [email protected]
>>>> Subject: Re: [gridengine users] Issue in Distributed jobs
>>>> 
>>>> Hi,
>>>> 
>>>> Am 13.02.2013 um 13:43 schrieb Britto, Rajesh:
>>>> 
>>>>> When I tried to execute an distributed job on a cluster, the job started 
>>>>> successfully.
>>>>> 
>>>>> However, after some time, the job was getting hanged by the following 
>>>>> process. Can anyone please let me know what could be the issue?
>>>>> 
>>>>> /opt/sge/utilbin/lx24-amd64/rsh -n -p 36425 <NodeName> exec 
>>>>> '/opt/sge/utilbin/lx24-amd64/qrsh_starter' 
>>>>> '/opt/spool/node/active_jobs/41270.1/1.node'
>>>> 
>>>> It looks like you used the old startup method by `rsh` - which version of 
>>>> SGE is it? When setting:
>>>> 
>>>> $ qconf -sconf
>>>> ...
>>>> qlogin_command               builtin
>>>> qlogin_daemon                builtin
>>>> rlogin_command               builtin
>>>> rlogin_daemon                builtin
>>>> rsh_command                  builtin
>>>> rsh_daemon                   builtin
>>>> 
>>>> the `rsh` shouldn't appear in the process tree. How did you start your 
>>>> application in the jobscript? How does the application start slave tasks: 
>>>> by Open MPI, MPICH2 ...?
>>>> 
>>>> 
>>>>> FYI, cluster is having both password less ssh and rsh communications 
>>>>> between the nodes.
>>>> 
>>>> In a Tight Integration setup even parallel jobs don't need this.
>>>> 
>>>> -- Reuti
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Issue in Distributed jobs

Reply via email to