Re: [gridengine users] Issue in Distributed jobs

Reuti Tue, 26 Feb 2013 02:55:39 -0800

Am 26.02.2013 um 09:51 schrieb Britto, Rajesh:

> Guys,
> 
> Thanks for the information.
> 
> I don't find any information on the node09 messages and I hope it couldn't 
> able to create any spool directory.


No filled file space? What are the permissons of /var/spool/sge or whereever 
you put the spooling directory. The account under which SGE is running as 
effective userid must be able to write there. You can `su` to this account and 
try to create some files/directories by hand.

-- Reuti


>>>>>> That error means that the process launched by qrsh on node09 exited 
>>>>>> before the rest of the slots so qmaster killed everything for you.
> 
>>>>>> I see these occasionally even when the parallel run finishes normally 
>>>>>> and exits because the first process to exit may be noticed by qmaster 
>>>>>> before the others.
> 
> Is there any solution for the above scenario since I fell my case resembles 
> the same as above.
> 
> Thanks,
> Britto.
> 
> 
> -----Original Message-----
> From: Jim Phillips [mailto:[email protected]] 
> Sent: Monday, February 25, 2013 9:52 PM
> To: Reuti
> Cc: Britto, Rajesh; [email protected]
> Subject: Re: [gridengine users] Issue in Distributed jobs
> 
> 
> That error means that the process launched by qrsh on node09 exited before 
> the rest of the slots so qmaster killed everything for you.
> 
> I see these occasionally even when the parallel run finishes normally and 
> exits because the first process to exit may be noticed by qmaster before 
> the others.
> 
> -Jim
> 
> On Mon, 25 Feb 2013, Reuti wrote:
> 
>> Am 25.02.2013 um 08:03 schrieb Britto, Rajesh:
>> 
>>> I could see the following error message on the message files.
>>> 
>>> Qmaster |mgr|E| tightly integrated parallel task 41406.1 task 1.node09 
>>> failed - killing job
>> 
>> This is on the qmaster AFAICS. What is in the message file of the node09? 
>> Maybe the job specific spool directory couldn't be created.
>> 
>> -- Reuti
>> 
>> 
>>> Can you please help me in this regard?
>>> 
>>> Thanks,
>>> Britto.
>>> 
>>> -----Original Message-----
>>> From: Reuti [mailto:[email protected]]
>>> Sent: Friday, February 22, 2013 6:56 PM
>>> To: Britto, Rajesh
>>> Cc: [email protected]
>>> Subject: Re: [gridengine users] Issue in Distributed jobs
>>> 
>>> Am 22.02.2013 um 08:15 schrieb Britto, Rajesh:
>>> 
>>>> Thanks for the information. It's not an fresh installation and we already 
>>>> installed 6.1 which is in production, we are not updating the same.
>>>> 
>>>> After doing strace with the process id which it hangs I found the 
>>>> following information.
>>>> 
>>>> /opt/sge/utilbin/lx24-amd64/rsh -n -p 51693 node09 exec 
>>>> '/opt/sge/utilbin/lx24-amd64/qrsh_starter' 
>>>> '/opt/spool/node09/active_jobs/41406.1/1.node09'
>>> 
>>> To clarify this:
>>> 
>>> the directory /opt/spool/node09/active_jobs/41406.1/1.node09 should be on 
>>> the slave node. It's not created? Anything in the messages file of the node 
>>> anout this failure?
>>> 
>>> -- Reuti
>>> 
>>> 
>>>> The above command is hanged, and it's trying to find the file 
>>>> '/opt/spool/node09/active_jobs/41406.1/1.node09' which is not available 
>>>> whereas /opt/spool/node05/active_jobs/41406.1/ is available.
>>>> 
>>>> I have submitted an distributed job and it was running on node09 and 
>>>> node05 in the grid and the active_job folder contains node05(since the 
>>>> parent process invoked from this node) and not for node09.
>>>> 
>>>> I am using the following pe for my distributed job.
>>>> 
>>>> pe_name           Distributed
>>>> slots             94
>>>> user_lists        NONE
>>>> xuser_lists       NONE
>>>> start_proc_args   /bin/true
>>>> stop_proc_args    /bin/true
>>>> allocation_rule   $fill_up
>>>> control_slaves    TRUE
>>>> job_is_first_task FALSE
>>>> urgency_slots     min
>>>> 
>>>> Can you please help me to resolve the issue?
>>>> 
>>>> Thanks,
>>>> Britto.
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Reuti [mailto:[email protected]]
>>>> Sent: Monday, February 18, 2013 1:54 PM
>>>> To: Britto, Rajesh
>>>> Cc: [email protected]
>>>> Subject: Re: [gridengine users] Issue in Distributed jobs
>>>> 
>>>> Hi,
>>>> 
>>>> Am 18.02.2013 um 04:53 schrieb Britto, Rajesh:
>>>> 
>>>>> Thanks for the information.
>>>>> 
>>>>> Its not the ROCKS cluster, its an normal SGE cluster with RHEL5.2 
>>>>> installed.
>>>>> 
>>>>> There is no firewall or SELinux enabled on these machines.
>>>> 
>>>> Is it a fresh installation? I wonder about using 6.1u2 as there were 
>>>> versions after it which were still freely available.
>>>> 
>>>> To investigate: it might be outside of SGE. Can you please submit such a 
>>>> hanging job, login to the node and issue:
>>>> 
>>>> strace -p 1234
>>>> 
>>>> with the PID of your haning application. If it's just the `qrsh` hanging 
>>>> around, it's return code might be retrieved later.
>>>> 
>>>> One other possibility: one version of PVM missed to close the stdout and 
>>>> it had a similar effect IIRC. What type of parallel application is it 
>>>> (e.g. MPI)?
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>>> Thanks,
>>>>> Britto.
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Reuti [mailto:[email protected]]
>>>>> Sent: Friday, February 15, 2013 10:15 PM
>>>>> To: Britto, Rajesh
>>>>> Cc: [email protected]
>>>>> Subject: Re: [gridengine users] Issue in Distributed jobs
>>>>> 
>>>>> Am 15.02.2013 um 08:22 schrieb Britto, Rajesh:
>>>>> 
>>>>>> Hi Reuti,
>>>>>> 
>>>>>> Thanks for the information. I am using SGE 6.1u2.
>>>>> 
>>>>> Ok, IIRC the builtin startup mechanism appeared only in 6.2.
>>>>> 
>>>>> 
>>>>>> Qconf -sconf:
>>>>>> 
>>>>>> qlogin_command               telnet
>>>>>> qlogin_daemon                /usr/sbin/in.telnetd
>>>>>> rlogin_daemon                /usr/sbin/in.rlogind
>>>>> 
>>>>> ROCKS? I remember that they added some lines at the end and override 
>>>>> settings which appear earlier in the file.
>>>>> 
>>>>> Do you have any firewall installed on the system, which could block the 
>>>>> MPI communication?
>>>>> 
>>>>> -- Reuti
>>>>> 
>>>>> 
>>>>>> The rsh command doesn't appear in the qconf -sconf output. We are uinsg 
>>>>>> openmpi for running parallel and distributed jobs.
>>>>>> 
>>>>>> The application uses the mpirun command to invoke the distributed jobs. 
>>>>>> Please let me know for more clarification.
>>>>>> 
>>>>>> Thanks,
>>>>>> Britto.
>>>>>> 
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Reuti [mailto:[email protected]]
>>>>>> Sent: Wednesday, February 13, 2013 7:00 PM
>>>>>> To: Britto, Rajesh
>>>>>> Cc: [email protected]
>>>>>> Subject: Re: [gridengine users] Issue in Distributed jobs
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> Am 13.02.2013 um 13:43 schrieb Britto, Rajesh:
>>>>>> 
>>>>>>> When I tried to execute an distributed job on a cluster, the job 
>>>>>>> started successfully.
>>>>>>> 
>>>>>>> However, after some time, the job was getting hanged by the following 
>>>>>>> process. Can anyone please let me know what could be the issue?
>>>>>>> 
>>>>>>> /opt/sge/utilbin/lx24-amd64/rsh -n -p 36425 <NodeName> exec 
>>>>>>> '/opt/sge/utilbin/lx24-amd64/qrsh_starter' 
>>>>>>> '/opt/spool/node/active_jobs/41270.1/1.node'
>>>>>> 
>>>>>> It looks like you used the old startup method by `rsh` - which version 
>>>>>> of SGE is it? When setting:
>>>>>> 
>>>>>> $ qconf -sconf
>>>>>> ...
>>>>>> qlogin_command               builtin
>>>>>> qlogin_daemon                builtin
>>>>>> rlogin_command               builtin
>>>>>> rlogin_daemon                builtin
>>>>>> rsh_command                  builtin
>>>>>> rsh_daemon                   builtin
>>>>>> 
>>>>>> the `rsh` shouldn't appear in the process tree. How did you start your 
>>>>>> application in the jobscript? How does the application start slave 
>>>>>> tasks: by Open MPI, MPICH2 ...?
>>>>>> 
>>>>>> 
>>>>>>> FYI, cluster is having both password less ssh and rsh communications 
>>>>>>> between the nodes.
>>>>>> 
>>>>>> In a Tight Integration setup even parallel jobs don't need this.
>>>>>> 
>>>>>> -- Reuti
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
>> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Issue in Distributed jobs

Reply via email to