That error means that the process launched by qrsh on node09 exited before
the rest of the slots so qmaster killed everything for you.
I see these occasionally even when the parallel run finishes normally and
exits because the first process to exit may be noticed by qmaster before
the others.
-Jim
On Mon, 25 Feb 2013, Reuti wrote:
Am 25.02.2013 um 08:03 schrieb Britto, Rajesh:
I could see the following error message on the message files.
Qmaster |mgr|E| tightly integrated parallel task 41406.1 task 1.node09 failed -
killing job
This is on the qmaster AFAICS. What is in the message file of the node09? Maybe
the job specific spool directory couldn't be created.
-- Reuti
Can you please help me in this regard?
Thanks,
Britto.
-----Original Message-----
From: Reuti [mailto:[email protected]]
Sent: Friday, February 22, 2013 6:56 PM
To: Britto, Rajesh
Cc: [email protected]
Subject: Re: [gridengine users] Issue in Distributed jobs
Am 22.02.2013 um 08:15 schrieb Britto, Rajesh:
Thanks for the information. It's not an fresh installation and we already
installed 6.1 which is in production, we are not updating the same.
After doing strace with the process id which it hangs I found the following
information.
/opt/sge/utilbin/lx24-amd64/rsh -n -p 51693 node09 exec
'/opt/sge/utilbin/lx24-amd64/qrsh_starter'
'/opt/spool/node09/active_jobs/41406.1/1.node09'
To clarify this:
the directory /opt/spool/node09/active_jobs/41406.1/1.node09 should be on the
slave node. It's not created? Anything in the messages file of the node anout
this failure?
-- Reuti
The above command is hanged, and it's trying to find the file
'/opt/spool/node09/active_jobs/41406.1/1.node09' which is not available whereas
/opt/spool/node05/active_jobs/41406.1/ is available.
I have submitted an distributed job and it was running on node09 and node05 in
the grid and the active_job folder contains node05(since the parent process
invoked from this node) and not for node09.
I am using the following pe for my distributed job.
pe_name Distributed
slots 94
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
Can you please help me to resolve the issue?
Thanks,
Britto.
-----Original Message-----
From: Reuti [mailto:[email protected]]
Sent: Monday, February 18, 2013 1:54 PM
To: Britto, Rajesh
Cc: [email protected]
Subject: Re: [gridengine users] Issue in Distributed jobs
Hi,
Am 18.02.2013 um 04:53 schrieb Britto, Rajesh:
Thanks for the information.
Its not the ROCKS cluster, its an normal SGE cluster with RHEL5.2 installed.
There is no firewall or SELinux enabled on these machines.
Is it a fresh installation? I wonder about using 6.1u2 as there were versions
after it which were still freely available.
To investigate: it might be outside of SGE. Can you please submit such a
hanging job, login to the node and issue:
strace -p 1234
with the PID of your haning application. If it's just the `qrsh` hanging
around, it's return code might be retrieved later.
One other possibility: one version of PVM missed to close the stdout and it had
a similar effect IIRC. What type of parallel application is it (e.g. MPI)?
-- Reuti
Thanks,
Britto.
-----Original Message-----
From: Reuti [mailto:[email protected]]
Sent: Friday, February 15, 2013 10:15 PM
To: Britto, Rajesh
Cc: [email protected]
Subject: Re: [gridengine users] Issue in Distributed jobs
Am 15.02.2013 um 08:22 schrieb Britto, Rajesh:
Hi Reuti,
Thanks for the information. I am using SGE 6.1u2.
Ok, IIRC the builtin startup mechanism appeared only in 6.2.
Qconf -sconf:
qlogin_command telnet
qlogin_daemon /usr/sbin/in.telnetd
rlogin_daemon /usr/sbin/in.rlogind
ROCKS? I remember that they added some lines at the end and override settings
which appear earlier in the file.
Do you have any firewall installed on the system, which could block the MPI
communication?
-- Reuti
The rsh command doesn't appear in the qconf -sconf output. We are uinsg openmpi
for running parallel and distributed jobs.
The application uses the mpirun command to invoke the distributed jobs. Please
let me know for more clarification.
Thanks,
Britto.
-----Original Message-----
From: Reuti [mailto:[email protected]]
Sent: Wednesday, February 13, 2013 7:00 PM
To: Britto, Rajesh
Cc: [email protected]
Subject: Re: [gridengine users] Issue in Distributed jobs
Hi,
Am 13.02.2013 um 13:43 schrieb Britto, Rajesh:
When I tried to execute an distributed job on a cluster, the job started
successfully.
However, after some time, the job was getting hanged by the following process.
Can anyone please let me know what could be the issue?
/opt/sge/utilbin/lx24-amd64/rsh -n -p 36425 <NodeName> exec
'/opt/sge/utilbin/lx24-amd64/qrsh_starter'
'/opt/spool/node/active_jobs/41270.1/1.node'
It looks like you used the old startup method by `rsh` - which version of SGE
is it? When setting:
$ qconf -sconf
...
qlogin_command builtin
qlogin_daemon builtin
rlogin_command builtin
rlogin_daemon builtin
rsh_command builtin
rsh_daemon builtin
the `rsh` shouldn't appear in the process tree. How did you start your
application in the jobscript? How does the application start slave tasks: by
Open MPI, MPICH2 ...?
FYI, cluster is having both password less ssh and rsh communications between
the nodes.
In a Tight Integration setup even parallel jobs don't need this.
-- Reuti
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users