Re: [gridengine users] Issue in Distributed jobs

Jim Phillips Mon, 25 Feb 2013 08:23:31 -0800

That error means that the process launched by qrsh on node09 exited beforethe rest of the slots so qmaster killed everything for you.

I see these occasionally even when the parallel run finishes normally andexits because the first process to exit may be noticed by qmaster beforethe others.


-Jim

On Mon, 25 Feb 2013, Reuti wrote:

Am 25.02.2013 um 08:03 schrieb Britto, Rajesh:

I could see the following error message on the message files.

Qmaster |mgr|E| tightly integrated parallel task 41406.1 task 1.node09 failed - 
killing job


This is on the qmaster AFAICS. What is in the message file of the node09? Maybe 
the job specific spool directory couldn't be created.

-- Reuti

Can you please help me in this regard?

Thanks,
Britto.

-----Original Message-----
From: Reuti [mailto:[email protected]]
Sent: Friday, February 22, 2013 6:56 PM
To: Britto, Rajesh
Cc: [email protected]
Subject: Re: [gridengine users] Issue in Distributed jobs

Am 22.02.2013 um 08:15 schrieb Britto, Rajesh:

Thanks for the information. It's not an fresh installation and we already 
installed 6.1 which is in production, we are not updating the same.

After doing strace with the process id which it hangs I found the following 
information.

/opt/sge/utilbin/lx24-amd64/rsh -n -p 51693 node09 exec 
'/opt/sge/utilbin/lx24-amd64/qrsh_starter' 
'/opt/spool/node09/active_jobs/41406.1/1.node09'


To clarify this:

the directory /opt/spool/node09/active_jobs/41406.1/1.node09 should be on the 
slave node. It's not created? Anything in the messages file of the node anout 
this failure?

-- Reuti

The above command is hanged, and it's trying to find the file 
'/opt/spool/node09/active_jobs/41406.1/1.node09' which is not available whereas 
/opt/spool/node05/active_jobs/41406.1/ is available.

I have submitted an distributed job and it was running on node09 and node05 in 
the grid and the active_job folder contains node05(since the parent process 
invoked from this node) and not for node09.

I am using the following pe for my distributed job.

pe_name           Distributed
slots             94
user_lists        NONE
xuser_lists       NONE
start_proc_args   /bin/true
stop_proc_args    /bin/true
allocation_rule   $fill_up
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min

Can you please help me to resolve the issue?

Thanks,
Britto.


-----Original Message-----
From: Reuti [mailto:[email protected]]
Sent: Monday, February 18, 2013 1:54 PM
To: Britto, Rajesh
Cc: [email protected]
Subject: Re: [gridengine users] Issue in Distributed jobs

Hi,

Am 18.02.2013 um 04:53 schrieb Britto, Rajesh:

Thanks for the information.

Its not the ROCKS cluster, its an normal SGE cluster with RHEL5.2 installed.

There is no firewall or SELinux enabled on these machines.


Is it a fresh installation? I wonder about using 6.1u2 as there were versions 
after it which were still freely available.

To investigate: it might be outside of SGE. Can you please submit such a 
hanging job, login to the node and issue:

strace -p 1234

with the PID of your haning application. If it's just the `qrsh` hanging 
around, it's return code might be retrieved later.

One other possibility: one version of PVM missed to close the stdout and it had 
a similar effect IIRC. What type of parallel application is it (e.g. MPI)?

-- Reuti

Thanks,
Britto.

-----Original Message-----
From: Reuti [mailto:[email protected]]
Sent: Friday, February 15, 2013 10:15 PM
To: Britto, Rajesh
Cc: [email protected]
Subject: Re: [gridengine users] Issue in Distributed jobs

Am 15.02.2013 um 08:22 schrieb Britto, Rajesh:

Hi Reuti,

Thanks for the information. I am using SGE 6.1u2.


Ok, IIRC the builtin startup mechanism appeared only in 6.2.

Qconf -sconf:

qlogin_command               telnet
qlogin_daemon                /usr/sbin/in.telnetd
rlogin_daemon                /usr/sbin/in.rlogind


ROCKS? I remember that they added some lines at the end and override settings 
which appear earlier in the file.

Do you have any firewall installed on the system, which could block the MPI 
communication?

-- Reuti

The rsh command doesn't appear in the qconf -sconf output. We are uinsg openmpi 
for running parallel and distributed jobs.

The application uses the mpirun command to invoke the distributed jobs. Please 
let me know for more clarification.

Thanks,
Britto.


-----Original Message-----
From: Reuti [mailto:[email protected]]
Sent: Wednesday, February 13, 2013 7:00 PM
To: Britto, Rajesh
Cc: [email protected]
Subject: Re: [gridengine users] Issue in Distributed jobs

Hi,

Am 13.02.2013 um 13:43 schrieb Britto, Rajesh:

When I tried to execute an distributed job on a cluster, the job started 
successfully.

However, after some time, the job was getting hanged by the following process. 
Can anyone please let me know what could be the issue?

/opt/sge/utilbin/lx24-amd64/rsh -n -p 36425 <NodeName> exec 
'/opt/sge/utilbin/lx24-amd64/qrsh_starter' 
'/opt/spool/node/active_jobs/41270.1/1.node'


It looks like you used the old startup method by `rsh` - which version of SGE 
is it? When setting:

$ qconf -sconf
...
qlogin_command               builtin
qlogin_daemon                builtin
rlogin_command               builtin
rlogin_daemon                builtin
rsh_command                  builtin
rsh_daemon                   builtin

the `rsh` shouldn't appear in the process tree. How did you start your 
application in the jobscript? How does the application start slave tasks: by 
Open MPI, MPICH2 ...?

FYI, cluster is having both password less ssh and rsh communications between 
the nodes.


In a Tight Integration setup even parallel jobs don't need this.

-- Reuti



_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Issue in Distributed jobs

Reply via email to