Re: [gridengine users] Issue in Distributed jobs

Britto, Rajesh Sun, 24 Feb 2013 23:05:04 -0800

Hi,

I could see the following error message on the message files.


Qmaster |mgr|E| tightly integrated parallel task 41406.1 task 1.node09 failed - 
killing job

Can you please help me in this regard?

Thanks,
Britto.

-----Original Message-----
From: Reuti [mailto:[email protected]] 
Sent: Friday, February 22, 2013 6:56 PM
To: Britto, Rajesh
Cc: [email protected]
Subject: Re: [gridengine users] Issue in Distributed jobs

Am 22.02.2013 um 08:15 schrieb Britto, Rajesh:

> Thanks for the information. It's not an fresh installation and we already 
> installed 6.1 which is in production, we are not updating the same.
> 
> After doing strace with the process id which it hangs I found the following 
> information. 
> 
> /opt/sge/utilbin/lx24-amd64/rsh -n -p 51693 node09 exec 
> '/opt/sge/utilbin/lx24-amd64/qrsh_starter' 
> '/opt/spool/node09/active_jobs/41406.1/1.node09'

To clarify this:

the directory /opt/spool/node09/active_jobs/41406.1/1.node09 should be on the 
slave node. It's not created? Anything in the messages file of the node anout 
this failure?

-- Reuti


> The above command is hanged, and it's trying to find the file 
> '/opt/spool/node09/active_jobs/41406.1/1.node09' which is not available 
> whereas /opt/spool/node05/active_jobs/41406.1/ is available.
> 
> I have submitted an distributed job and it was running on node09 and node05 
> in the grid and the active_job folder contains node05(since the parent 
> process invoked from this node) and not for node09.
> 
> I am using the following pe for my distributed job.
> 
> pe_name           Distributed
> slots             94
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /bin/true
> stop_proc_args    /bin/true
> allocation_rule   $fill_up
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
> 
> Can you please help me to resolve the issue?
> 
> Thanks,
> Britto.
> 
> 
> -----Original Message-----
> From: Reuti [mailto:[email protected]] 
> Sent: Monday, February 18, 2013 1:54 PM
> To: Britto, Rajesh
> Cc: [email protected]
> Subject: Re: [gridengine users] Issue in Distributed jobs
> 
> Hi,
> 
> Am 18.02.2013 um 04:53 schrieb Britto, Rajesh:
> 
>> Thanks for the information.
>> 
>> Its not the ROCKS cluster, its an normal SGE cluster with RHEL5.2 installed.
>> 
>> There is no firewall or SELinux enabled on these machines.
> 
> Is it a fresh installation? I wonder about using 6.1u2 as there were versions 
> after it which were still freely available.
> 
> To investigate: it might be outside of SGE. Can you please submit such a 
> hanging job, login to the node and issue:
> 
> strace -p 1234
> 
> with the PID of your haning application. If it's just the `qrsh` hanging 
> around, it's return code might be retrieved later.
> 
> One other possibility: one version of PVM missed to close the stdout and it 
> had a similar effect IIRC. What type of parallel application is it (e.g. MPI)?
> 
> -- Reuti
> 
> 
>> Thanks,
>> Britto.
>> 
>> -----Original Message-----
>> From: Reuti [mailto:[email protected]] 
>> Sent: Friday, February 15, 2013 10:15 PM
>> To: Britto, Rajesh
>> Cc: [email protected]
>> Subject: Re: [gridengine users] Issue in Distributed jobs
>> 
>> Am 15.02.2013 um 08:22 schrieb Britto, Rajesh:
>> 
>>> Hi Reuti,
>>> 
>>> Thanks for the information. I am using SGE 6.1u2.
>> 
>> Ok, IIRC the builtin startup mechanism appeared only in 6.2.
>> 
>> 
>>> Qconf -sconf:
>>> 
>>> qlogin_command               telnet
>>> qlogin_daemon                /usr/sbin/in.telnetd
>>> rlogin_daemon                /usr/sbin/in.rlogind
>> 
>> ROCKS? I remember that they added some lines at the end and override 
>> settings which appear earlier in the file.
>> 
>> Do you have any firewall installed on the system, which could block the MPI 
>> communication?
>> 
>> -- Reuti
>> 
>> 
>>> The rsh command doesn't appear in the qconf -sconf output. We are uinsg 
>>> openmpi for running parallel and distributed jobs.
>>> 
>>> The application uses the mpirun command to invoke the distributed jobs. 
>>> Please let me know for more clarification.
>>> 
>>> Thanks,
>>> Britto.
>>> 
>>> 
>>> -----Original Message-----
>>> From: Reuti [mailto:[email protected]] 
>>> Sent: Wednesday, February 13, 2013 7:00 PM
>>> To: Britto, Rajesh
>>> Cc: [email protected]
>>> Subject: Re: [gridengine users] Issue in Distributed jobs
>>> 
>>> Hi,
>>> 
>>> Am 13.02.2013 um 13:43 schrieb Britto, Rajesh:
>>> 
>>>> When I tried to execute an distributed job on a cluster, the job started 
>>>> successfully.
>>>> 
>>>> However, after some time, the job was getting hanged by the following 
>>>> process. Can anyone please let me know what could be the issue?
>>>> 
>>>> /opt/sge/utilbin/lx24-amd64/rsh -n -p 36425 <NodeName> exec 
>>>> '/opt/sge/utilbin/lx24-amd64/qrsh_starter' 
>>>> '/opt/spool/node/active_jobs/41270.1/1.node'
>>> 
>>> It looks like you used the old startup method by `rsh` - which version of 
>>> SGE is it? When setting:
>>> 
>>> $ qconf -sconf
>>> ...
>>> qlogin_command               builtin
>>> qlogin_daemon                builtin
>>> rlogin_command               builtin
>>> rlogin_daemon                builtin
>>> rsh_command                  builtin
>>> rsh_daemon                   builtin
>>> 
>>> the `rsh` shouldn't appear in the process tree. How did you start your 
>>> application in the jobscript? How does the application start slave tasks: 
>>> by Open MPI, MPICH2 ...?
>>> 
>>> 
>>>> FYI, cluster is having both password less ssh and rsh communications 
>>>> between the nodes.
>>> 
>>> In a Tight Integration setup even parallel jobs don't need this.
>>> 
>>> -- Reuti
>>> 
>> 
>> 
> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Issue in Distributed jobs

Reply via email to