From: Reuti <[email protected]> Subject: Re: [gridengine users] Configure gridengine on CentOS 6.3 Date: Wed, 7 Nov 2012 16:37:22 +0100
> Am 07.11.2012 um 15:46 schrieb Petter Gustad: > >>> From: Reuti <[email protected]> >>> Subject: Re: [gridengine users] Configure gridengine on CentOS 6.3 >>> Date: Tue, 30 Oct 2012 11:27:49 +0100 >>> >>>> Just use the version you have already in the shared /usr/sge or your >>>> particular mountpoint. >>> >>> I should probably try this first, at least to verify that it's >>> working. But later I would like to migrate to the CentOS on all my >>> exechosts and leave the installation to somebody else. >> >> I did this and it worked out fine on the first machine I migrated. >> However, on the next set of machines I run into the problem where the >> submitted job will cause the queue to go into the error state. >> >> I observe that: >> >> 1) It will not be submitted >> 2) The queue will be marked with the 'E' state >> 3) I get an e-mail saying >> Shepherd pe_hostfile: >> node 1 queue@node UNDEFINED >> 4) The node will log the following in the spool/node/messages file: >> 11/07/2012 15:33:07| main|node|E|shepherd of job 48548.1 exited with >> exit status = 11 >> 5) qstat -j jobnumber returns >> >> error reason 1: 11/07/2012 15:33:06 [555:29681]: unable to >> find job file "/work/gridengine/spool/node/job_scr Is this output always truncated, or could this be the source of the problem? > This looks like an anusual path for the spool directory. The name of the node > should be included. I've subsituted the string "node" for the actual node name. It appears to be the same for all the nodes, hence I just used "node". > $ qconf -sconf > > (at the top something like: execd_spool_dir /var/spool/sge, the > directory for the particular node will be created automatically when the > execd starts up) This will show the spool directory on the qmaster, which is different from the above. But for all the nodes this is /work/gridengine/spool. > $ qconf -sconfl > > (get all exechost definitions [if any are present at all]), then for the > particular node: > > $ qconf -sconf node42 > > and check the path to the execd_spool_dir. They are all identical. If I do something like: qconf -sconf good-node > /tmp/good-node qconf -sconf bad-node > /tmp/bad-node and diff the two, the only diff will be the hostname part. All the nodes are using spool on a local filesystem located at /work/gridengine/spool The only difference I see on the bad nodes is that there is a "." at the end of the permissions in the spool directory so I think this might be related to SELinux. I'll have to investegate this further. Thank you for your help. Best regards //Petter _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
