From: Reuti <[email protected]>
Subject: Re: [gridengine users] Configure gridengine on CentOS 6.3
Date: Wed, 7 Nov 2012 16:37:22 +0100

> Am 07.11.2012 um 15:46 schrieb Petter Gustad:
> 
>>> From: Reuti <[email protected]>
>>> Subject: Re: [gridengine users] Configure gridengine on CentOS 6.3
>>> Date: Tue, 30 Oct 2012 11:27:49 +0100
>>> 
>>>> Just use the version you have already in the shared /usr/sge or your
>>>> particular mountpoint.
>>> 
>>> I should probably try this first, at least to verify that it's
>>> working. But later I would like to migrate to the CentOS on all my
>>> exechosts and leave the installation to somebody else.
>> 
>> I did this and it worked out fine on the first machine I migrated.
>> However, on the next set of machines I run into the problem where the
>> submitted job will cause the queue to go into the error state.
>> 
>> I observe that:
>> 
>> 1) It will not be submitted
>> 2) The queue will be marked with the 'E' state
>> 3) I get an e-mail saying
>>    Shepherd pe_hostfile:
>>    node 1 queue@node UNDEFINED
>> 4) The node will log the following in the spool/node/messages file:
>>    11/07/2012 15:33:07|  main|node|E|shepherd of job 48548.1 exited with 
>> exit status = 11
>> 5) qstat -j jobnumber returns
>> 
>>    error reason    1:          11/07/2012 15:33:06 [555:29681]: unable to 
>> find job file "/work/gridengine/spool/node/job_scr

Is this output always truncated, or could this be the source of the problem?

> This looks like an anusual path for the spool directory. The name of the node 
> should be included.

I've subsituted the string "node" for the actual node name. It appears
to be the same for all the nodes, hence I just used "node".

> $ qconf -sconf
> 
> (at the top something like: execd_spool_dir              /var/spool/sge, the 
> directory for the particular node will be created automatically when the 
> execd starts up)

This will show the spool directory on the qmaster, which is different
from the above. But for all the nodes this is /work/gridengine/spool.

> $ qconf -sconfl
> 
> (get all exechost definitions [if any are present at all]), then for the 
> particular node:
> 
> $ qconf -sconf node42
> 
> and check the path to the execd_spool_dir.

They are all identical. If I do something like:

qconf -sconf good-node > /tmp/good-node
qconf -sconf bad-node > /tmp/bad-node

and diff the two, the only diff will be the hostname part.

All the nodes are using spool on a local filesystem located at
/work/gridengine/spool


The only difference I see on the bad nodes is that there is a "." at
the end of the permissions in the spool directory so I think this
might be related to SELinux. I'll have to investegate this further.

Thank you for your help.

Best regards
//Petter

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to