Re: [gridengine users] Issue in Distributed jobs

Reuti Wed, 27 Feb 2013 05:38:23 -0800

Hi,

Am 27.02.2013 um 13:18 schrieb Britto, Rajesh:


> Space available on the partition and the spool directory has 777 permission.
> 
> I can able to create folder using the root account(as well as with user 
> accounts) and SGE was running as root user.

is SGE running under root in your case? Usually it switches to the admin user:

$ ps -e f -o user,ruser,command
USER     RUSER    COMMAND
...
sgeadmin root     /usr/sge/bin/lx24-amd64/sge_qmaster

As you are using the classic rsh: has the file /opt/sge/utilbin/lx24-amd64/rsh 
the suid bit set and is it also honored on the nodes?

-- Reuti


> Thanks,
> Britto.
> 
> -----Original Message-----
> From: Reuti [mailto:[email protected]] 
> Sent: Tuesday, February 26, 2013 4:24 PM
> To: Britto, Rajesh
> Cc: Jim Phillips; [email protected]
> Subject: Re: [gridengine users] Issue in Distributed jobs
> 
> Am 26.02.2013 um 09:51 schrieb Britto, Rajesh:
> 
>> Guys,
>> 
>> Thanks for the information.
>> 
>> I don't find any information on the node09 messages and I hope it couldn't 
>> able to create any spool directory.
> 
> No filled file space? What are the permissons of /var/spool/sge or whereever 
> you put the spooling directory. The account under which SGE is running as 
> effective userid must be able to write there. You can `su` to this account 
> and try to create some files/directories by hand.
> 
> -- Reuti
> 
> 
>>>>>>> That error means that the process launched by qrsh on node09 exited 
>>>>>>> before the rest of the slots so qmaster killed everything for you.
>> 
>>>>>>> I see these occasionally even when the parallel run finishes normally 
>>>>>>> and exits because the first process to exit may be noticed by qmaster 
>>>>>>> before the others.
>> 
>> Is there any solution for the above scenario since I fell my case resembles 
>> the same as above.
>> 
>> Thanks,
>> Britto.
>> 
>> 
>> -----Original Message-----
>> From: Jim Phillips [mailto:[email protected]] 
>> Sent: Monday, February 25, 2013 9:52 PM
>> To: Reuti
>> Cc: Britto, Rajesh; [email protected]
>> Subject: Re: [gridengine users] Issue in Distributed jobs
>> 
>> 
>> That error means that the process launched by qrsh on node09 exited before 
>> the rest of the slots so qmaster killed everything for you.
>> 
>> I see these occasionally even when the parallel run finishes normally and 
>> exits because the first process to exit may be noticed by qmaster before 
>> the others.
>> 
>> -Jim
>> 
>> On Mon, 25 Feb 2013, Reuti wrote:
>> 
>>> Am 25.02.2013 um 08:03 schrieb Britto, Rajesh:
>>> 
>>>> I could see the following error message on the message files.
>>>> 
>>>> Qmaster |mgr|E| tightly integrated parallel task 41406.1 task 1.node09 
>>>> failed - killing job
>>> 
>>> This is on the qmaster AFAICS. What is in the message file of the node09? 
>>> Maybe the job specific spool directory couldn't be created.
>>> 
>>> -- Reuti
>>> 
>>> 
>>>> Can you please help me in this regard?
>>>> 
>>>> Thanks,
>>>> Britto.
>>>> 
>>>> -----Original Message-----
>>>> From: Reuti [mailto:[email protected]]
>>>> Sent: Friday, February 22, 2013 6:56 PM
>>>> To: Britto, Rajesh
>>>> Cc: [email protected]
>>>> Subject: Re: [gridengine users] Issue in Distributed jobs
>>>> 
>>>> Am 22.02.2013 um 08:15 schrieb Britto, Rajesh:
>>>> 
>>>>> Thanks for the information. It's not an fresh installation and we already 
>>>>> installed 6.1 which is in production, we are not updating the same.
>>>>> 
>>>>> After doing strace with the process id which it hangs I found the 
>>>>> following information.
>>>>> 
>>>>> /opt/sge/utilbin/lx24-amd64/rsh -n -p 51693 node09 exec 
>>>>> '/opt/sge/utilbin/lx24-amd64/qrsh_starter' 
>>>>> '/opt/spool/node09/active_jobs/41406.1/1.node09'
>>>> 
>>>> To clarify this:
>>>> 
>>>> the directory /opt/spool/node09/active_jobs/41406.1/1.node09 should be on 
>>>> the slave node. It's not created? Anything in the messages file of the 
>>>> node anout this failure?
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>>> The above command is hanged, and it's trying to find the file 
>>>>> '/opt/spool/node09/active_jobs/41406.1/1.node09' which is not available 
>>>>> whereas /opt/spool/node05/active_jobs/41406.1/ is available.
>>>>> 
>>>>> I have submitted an distributed job and it was running on node09 and 
>>>>> node05 in the grid and the active_job folder contains node05(since the 
>>>>> parent process invoked from this node) and not for node09.
>>>>> 
>>>>> I am using the following pe for my distributed job.
>>>>> 
>>>>> pe_name           Distributed
>>>>> slots             94
>>>>> user_lists        NONE
>>>>> xuser_lists       NONE
>>>>> start_proc_args   /bin/true
>>>>> stop_proc_args    /bin/true
>>>>> allocation_rule   $fill_up
>>>>> control_slaves    TRUE
>>>>> job_is_first_task FALSE
>>>>> urgency_slots     min
>>>>> 
>>>>> Can you please help me to resolve the issue?
>>>>> 
>>>>> Thanks,
>>>>> Britto.
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Reuti [mailto:[email protected]]
>>>>> Sent: Monday, February 18, 2013 1:54 PM
>>>>> To: Britto, Rajesh
>>>>> Cc: [email protected]
>>>>> Subject: Re: [gridengine users] Issue in Distributed jobs
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> Am 18.02.2013 um 04:53 schrieb Britto, Rajesh:
>>>>> 
>>>>>> Thanks for the information.
>>>>>> 
>>>>>> Its not the ROCKS cluster, its an normal SGE cluster with RHEL5.2 
>>>>>> installed.
>>>>>> 
>>>>>> There is no firewall or SELinux enabled on these machines.
>>>>> 
>>>>> Is it a fresh installation? I wonder about using 6.1u2 as there were 
>>>>> versions after it which were still freely available.
>>>>> 
>>>>> To investigate: it might be outside of SGE. Can you please submit such a 
>>>>> hanging job, login to the node and issue:
>>>>> 
>>>>> strace -p 1234
>>>>> 
>>>>> with the PID of your haning application. If it's just the `qrsh` hanging 
>>>>> around, it's return code might be retrieved later.
>>>>> 
>>>>> One other possibility: one version of PVM missed to close the stdout and 
>>>>> it had a similar effect IIRC. What type of parallel application is it 
>>>>> (e.g. MPI)?
>>>>> 
>>>>> -- Reuti
>>>>> 
>>>>> 
>>>>>> Thanks,
>>>>>> Britto.
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Reuti [mailto:[email protected]]
>>>>>> Sent: Friday, February 15, 2013 10:15 PM
>>>>>> To: Britto, Rajesh
>>>>>> Cc: [email protected]
>>>>>> Subject: Re: [gridengine users] Issue in Distributed jobs
>>>>>> 
>>>>>> Am 15.02.2013 um 08:22 schrieb Britto, Rajesh:
>>>>>> 
>>>>>>> Hi Reuti,
>>>>>>> 
>>>>>>> Thanks for the information. I am using SGE 6.1u2.
>>>>>> 
>>>>>> Ok, IIRC the builtin startup mechanism appeared only in 6.2.
>>>>>> 
>>>>>> 
>>>>>>> Qconf -sconf:
>>>>>>> 
>>>>>>> qlogin_command               telnet
>>>>>>> qlogin_daemon                /usr/sbin/in.telnetd
>>>>>>> rlogin_daemon                /usr/sbin/in.rlogind
>>>>>> 
>>>>>> ROCKS? I remember that they added some lines at the end and override 
>>>>>> settings which appear earlier in the file.
>>>>>> 
>>>>>> Do you have any firewall installed on the system, which could block the 
>>>>>> MPI communication?
>>>>>> 
>>>>>> -- Reuti
>>>>>> 
>>>>>> 
>>>>>>> The rsh command doesn't appear in the qconf -sconf output. We are uinsg 
>>>>>>> openmpi for running parallel and distributed jobs.
>>>>>>> 
>>>>>>> The application uses the mpirun command to invoke the distributed jobs. 
>>>>>>> Please let me know for more clarification.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Britto.
>>>>>>> 
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Reuti [mailto:[email protected]]
>>>>>>> Sent: Wednesday, February 13, 2013 7:00 PM
>>>>>>> To: Britto, Rajesh
>>>>>>> Cc: [email protected]
>>>>>>> Subject: Re: [gridengine users] Issue in Distributed jobs
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> Am 13.02.2013 um 13:43 schrieb Britto, Rajesh:
>>>>>>> 
>>>>>>>> When I tried to execute an distributed job on a cluster, the job 
>>>>>>>> started successfully.
>>>>>>>> 
>>>>>>>> However, after some time, the job was getting hanged by the following 
>>>>>>>> process. Can anyone please let me know what could be the issue?
>>>>>>>> 
>>>>>>>> /opt/sge/utilbin/lx24-amd64/rsh -n -p 36425 <NodeName> exec 
>>>>>>>> '/opt/sge/utilbin/lx24-amd64/qrsh_starter' 
>>>>>>>> '/opt/spool/node/active_jobs/41270.1/1.node'
>>>>>>> 
>>>>>>> It looks like you used the old startup method by `rsh` - which version 
>>>>>>> of SGE is it? When setting:
>>>>>>> 
>>>>>>> $ qconf -sconf
>>>>>>> ...
>>>>>>> qlogin_command               builtin
>>>>>>> qlogin_daemon                builtin
>>>>>>> rlogin_command               builtin
>>>>>>> rlogin_daemon                builtin
>>>>>>> rsh_command                  builtin
>>>>>>> rsh_daemon                   builtin
>>>>>>> 
>>>>>>> the `rsh` shouldn't appear in the process tree. How did you start your 
>>>>>>> application in the jobscript? How does the application start slave 
>>>>>>> tasks: by Open MPI, MPICH2 ...?
>>>>>>> 
>>>>>>> 
>>>>>>>> FYI, cluster is having both password less ssh and rsh communications 
>>>>>>>> between the nodes.
>>>>>>> 
>>>>>>> In a Tight Integration setup even parallel jobs don't need this.
>>>>>>> 
>>>>>>> -- Reuti
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> https://gridengine.org/mailman/listinfo/users
>>> 
>> 
> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Issue in Distributed jobs

Reply via email to