Re: [gridengine users] Issue in Distributed jobs

Britto, Rajesh Tue, 26 Feb 2013 00:52:35 -0800

Guys,

Thanks for the information.


I don't find any information on the node09 messages and I hope it couldn't able 
to create any spool directory.

>>>>>That error means that the process launched by qrsh on node09 exited before 
>>>>>the rest of the slots so qmaster killed everything for you.

>>>>>I see these occasionally even when the parallel run finishes normally and 
>>>>>exits because the first process to exit may be noticed by qmaster before 
>>>>>the others.

Is there any solution for the above scenario since I fell my case resembles the 
same as above.

Thanks,
Britto.


-----Original Message-----
From: Jim Phillips [mailto:[email protected]] 
Sent: Monday, February 25, 2013 9:52 PM
To: Reuti
Cc: Britto, Rajesh; [email protected]
Subject: Re: [gridengine users] Issue in Distributed jobs


That error means that the process launched by qrsh on node09 exited before 
the rest of the slots so qmaster killed everything for you.

I see these occasionally even when the parallel run finishes normally and 
exits because the first process to exit may be noticed by qmaster before 
the others.

-Jim

On Mon, 25 Feb 2013, Reuti wrote:

> Am 25.02.2013 um 08:03 schrieb Britto, Rajesh:
>
>> I could see the following error message on the message files.
>>
>> Qmaster |mgr|E| tightly integrated parallel task 41406.1 task 1.node09 
>> failed - killing job
>
> This is on the qmaster AFAICS. What is in the message file of the node09? 
> Maybe the job specific spool directory couldn't be created.
>
> -- Reuti
>
>
>> Can you please help me in this regard?
>>
>> Thanks,
>> Britto.
>>
>> -----Original Message-----
>> From: Reuti [mailto:[email protected]]
>> Sent: Friday, February 22, 2013 6:56 PM
>> To: Britto, Rajesh
>> Cc: [email protected]
>> Subject: Re: [gridengine users] Issue in Distributed jobs
>>
>> Am 22.02.2013 um 08:15 schrieb Britto, Rajesh:
>>
>>> Thanks for the information. It's not an fresh installation and we already 
>>> installed 6.1 which is in production, we are not updating the same.
>>>
>>> After doing strace with the process id which it hangs I found the following 
>>> information.
>>>
>>> /opt/sge/utilbin/lx24-amd64/rsh -n -p 51693 node09 exec 
>>> '/opt/sge/utilbin/lx24-amd64/qrsh_starter' 
>>> '/opt/spool/node09/active_jobs/41406.1/1.node09'
>>
>> To clarify this:
>>
>> the directory /opt/spool/node09/active_jobs/41406.1/1.node09 should be on 
>> the slave node. It's not created? Anything in the messages file of the node 
>> anout this failure?
>>
>> -- Reuti
>>
>>
>>> The above command is hanged, and it's trying to find the file 
>>> '/opt/spool/node09/active_jobs/41406.1/1.node09' which is not available 
>>> whereas /opt/spool/node05/active_jobs/41406.1/ is available.
>>>
>>> I have submitted an distributed job and it was running on node09 and node05 
>>> in the grid and the active_job folder contains node05(since the parent 
>>> process invoked from this node) and not for node09.
>>>
>>> I am using the following pe for my distributed job.
>>>
>>> pe_name           Distributed
>>> slots             94
>>> user_lists        NONE
>>> xuser_lists       NONE
>>> start_proc_args   /bin/true
>>> stop_proc_args    /bin/true
>>> allocation_rule   $fill_up
>>> control_slaves    TRUE
>>> job_is_first_task FALSE
>>> urgency_slots     min
>>>
>>> Can you please help me to resolve the issue?
>>>
>>> Thanks,
>>> Britto.
>>>
>>>
>>> -----Original Message-----
>>> From: Reuti [mailto:[email protected]]
>>> Sent: Monday, February 18, 2013 1:54 PM
>>> To: Britto, Rajesh
>>> Cc: [email protected]
>>> Subject: Re: [gridengine users] Issue in Distributed jobs
>>>
>>> Hi,
>>>
>>> Am 18.02.2013 um 04:53 schrieb Britto, Rajesh:
>>>
>>>> Thanks for the information.
>>>>
>>>> Its not the ROCKS cluster, its an normal SGE cluster with RHEL5.2 
>>>> installed.
>>>>
>>>> There is no firewall or SELinux enabled on these machines.
>>>
>>> Is it a fresh installation? I wonder about using 6.1u2 as there were 
>>> versions after it which were still freely available.
>>>
>>> To investigate: it might be outside of SGE. Can you please submit such a 
>>> hanging job, login to the node and issue:
>>>
>>> strace -p 1234
>>>
>>> with the PID of your haning application. If it's just the `qrsh` hanging 
>>> around, it's return code might be retrieved later.
>>>
>>> One other possibility: one version of PVM missed to close the stdout and it 
>>> had a similar effect IIRC. What type of parallel application is it (e.g. 
>>> MPI)?
>>>
>>> -- Reuti
>>>
>>>
>>>> Thanks,
>>>> Britto.
>>>>
>>>> -----Original Message-----
>>>> From: Reuti [mailto:[email protected]]
>>>> Sent: Friday, February 15, 2013 10:15 PM
>>>> To: Britto, Rajesh
>>>> Cc: [email protected]
>>>> Subject: Re: [gridengine users] Issue in Distributed jobs
>>>>
>>>> Am 15.02.2013 um 08:22 schrieb Britto, Rajesh:
>>>>
>>>>> Hi Reuti,
>>>>>
>>>>> Thanks for the information. I am using SGE 6.1u2.
>>>>
>>>> Ok, IIRC the builtin startup mechanism appeared only in 6.2.
>>>>
>>>>
>>>>> Qconf -sconf:
>>>>>
>>>>> qlogin_command               telnet
>>>>> qlogin_daemon                /usr/sbin/in.telnetd
>>>>> rlogin_daemon                /usr/sbin/in.rlogind
>>>>
>>>> ROCKS? I remember that they added some lines at the end and override 
>>>> settings which appear earlier in the file.
>>>>
>>>> Do you have any firewall installed on the system, which could block the 
>>>> MPI communication?
>>>>
>>>> -- Reuti
>>>>
>>>>
>>>>> The rsh command doesn't appear in the qconf -sconf output. We are uinsg 
>>>>> openmpi for running parallel and distributed jobs.
>>>>>
>>>>> The application uses the mpirun command to invoke the distributed jobs. 
>>>>> Please let me know for more clarification.
>>>>>
>>>>> Thanks,
>>>>> Britto.
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Reuti [mailto:[email protected]]
>>>>> Sent: Wednesday, February 13, 2013 7:00 PM
>>>>> To: Britto, Rajesh
>>>>> Cc: [email protected]
>>>>> Subject: Re: [gridengine users] Issue in Distributed jobs
>>>>>
>>>>> Hi,
>>>>>
>>>>> Am 13.02.2013 um 13:43 schrieb Britto, Rajesh:
>>>>>
>>>>>> When I tried to execute an distributed job on a cluster, the job started 
>>>>>> successfully.
>>>>>>
>>>>>> However, after some time, the job was getting hanged by the following 
>>>>>> process. Can anyone please let me know what could be the issue?
>>>>>>
>>>>>> /opt/sge/utilbin/lx24-amd64/rsh -n -p 36425 <NodeName> exec 
>>>>>> '/opt/sge/utilbin/lx24-amd64/qrsh_starter' 
>>>>>> '/opt/spool/node/active_jobs/41270.1/1.node'
>>>>>
>>>>> It looks like you used the old startup method by `rsh` - which version of 
>>>>> SGE is it? When setting:
>>>>>
>>>>> $ qconf -sconf
>>>>> ...
>>>>> qlogin_command               builtin
>>>>> qlogin_daemon                builtin
>>>>> rlogin_command               builtin
>>>>> rlogin_daemon                builtin
>>>>> rsh_command                  builtin
>>>>> rsh_daemon                   builtin
>>>>>
>>>>> the `rsh` shouldn't appear in the process tree. How did you start your 
>>>>> application in the jobscript? How does the application start slave tasks: 
>>>>> by Open MPI, MPICH2 ...?
>>>>>
>>>>>
>>>>>> FYI, cluster is having both password less ssh and rsh communications 
>>>>>> between the nodes.
>>>>>
>>>>> In a Tight Integration setup even parallel jobs don't need this.
>>>>>
>>>>> -- Reuti
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Issue in Distributed jobs

Reply via email to