Re: [gridengine users] Issue in Distributed jobs

Britto, Rajesh Thu, 21 Feb 2013 23:17:38 -0800

Hi,

Thanks for the information. It's not an fresh installation and we already 
installed 6.1 which is in production, we are not updating the same.


After doing strace with the process id which it hangs I found the following 
information. 

/opt/sge/utilbin/lx24-amd64/rsh -n -p 51693 node09 exec 
'/opt/sge/utilbin/lx24-amd64/qrsh_starter' 
'/opt/spool/node09/active_jobs/41406.1/1.node09'

The above command is hanged, and it's trying to find the file 
'/opt/spool/node09/active_jobs/41406.1/1.node09' which is not available whereas 
/opt/spool/node05/active_jobs/41406.1/ is available.

I have submitted an distributed job and it was running on node09 and node05 in 
the grid and the active_job folder contains node05(since the parent process 
invoked from this node) and not for node09.

I am using the following pe for my distributed job.

pe_name           Distributed
slots             94
user_lists        NONE
xuser_lists       NONE
start_proc_args   /bin/true
stop_proc_args    /bin/true
allocation_rule   $fill_up
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min

Can you please help me to resolve the issue?

Thanks,
Britto.


-----Original Message-----
From: Reuti [mailto:[email protected]] 
Sent: Monday, February 18, 2013 1:54 PM
To: Britto, Rajesh
Cc: [email protected]
Subject: Re: [gridengine users] Issue in Distributed jobs

Hi,

Am 18.02.2013 um 04:53 schrieb Britto, Rajesh:

> Thanks for the information.
> 
> Its not the ROCKS cluster, its an normal SGE cluster with RHEL5.2 installed.
> 
> There is no firewall or SELinux enabled on these machines.

Is it a fresh installation? I wonder about using 6.1u2 as there were versions 
after it which were still freely available.

To investigate: it might be outside of SGE. Can you please submit such a 
hanging job, login to the node and issue:

strace -p 1234

with the PID of your haning application. If it's just the `qrsh` hanging 
around, it's return code might be retrieved later.

One other possibility: one version of PVM missed to close the stdout and it had 
a similar effect IIRC. What type of parallel application is it (e.g. MPI)?

-- Reuti


> Thanks,
> Britto.
> 
> -----Original Message-----
> From: Reuti [mailto:[email protected]] 
> Sent: Friday, February 15, 2013 10:15 PM
> To: Britto, Rajesh
> Cc: [email protected]
> Subject: Re: [gridengine users] Issue in Distributed jobs
> 
> Am 15.02.2013 um 08:22 schrieb Britto, Rajesh:
> 
>> Hi Reuti,
>> 
>> Thanks for the information. I am using SGE 6.1u2.
> 
> Ok, IIRC the builtin startup mechanism appeared only in 6.2.
> 
> 
>> Qconf -sconf:
>> 
>> qlogin_command               telnet
>> qlogin_daemon                /usr/sbin/in.telnetd
>> rlogin_daemon                /usr/sbin/in.rlogind
> 
> ROCKS? I remember that they added some lines at the end and override settings 
> which appear earlier in the file.
> 
> Do you have any firewall installed on the system, which could block the MPI 
> communication?
> 
> -- Reuti
> 
> 
>> The rsh command doesn't appear in the qconf -sconf output. We are uinsg 
>> openmpi for running parallel and distributed jobs.
>> 
>> The application uses the mpirun command to invoke the distributed jobs. 
>> Please let me know for more clarification.
>> 
>> Thanks,
>> Britto.
>> 
>> 
>> -----Original Message-----
>> From: Reuti [mailto:[email protected]] 
>> Sent: Wednesday, February 13, 2013 7:00 PM
>> To: Britto, Rajesh
>> Cc: [email protected]
>> Subject: Re: [gridengine users] Issue in Distributed jobs
>> 
>> Hi,
>> 
>> Am 13.02.2013 um 13:43 schrieb Britto, Rajesh:
>> 
>>> When I tried to execute an distributed job on a cluster, the job started 
>>> successfully.
>>> 
>>> However, after some time, the job was getting hanged by the following 
>>> process. Can anyone please let me know what could be the issue?
>>> 
>>> /opt/sge/utilbin/lx24-amd64/rsh -n -p 36425 <NodeName> exec 
>>> '/opt/sge/utilbin/lx24-amd64/qrsh_starter' 
>>> '/opt/spool/node/active_jobs/41270.1/1.node'
>> 
>> It looks like you used the old startup method by `rsh` - which version of 
>> SGE is it? When setting:
>> 
>> $ qconf -sconf
>> ...
>> qlogin_command               builtin
>> qlogin_daemon                builtin
>> rlogin_command               builtin
>> rlogin_daemon                builtin
>> rsh_command                  builtin
>> rsh_daemon                   builtin
>> 
>> the `rsh` shouldn't appear in the process tree. How did you start your 
>> application in the jobscript? How does the application start slave tasks: by 
>> Open MPI, MPICH2 ...?
>> 
>> 
>>> FYI, cluster is having both password less ssh and rsh communications 
>>> between the nodes.
>> 
>> In a Tight Integration setup even parallel jobs don't need this.
>> 
>> -- Reuti
>> 
> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Issue in Distributed jobs

Reply via email to