In the message dated: Thu, 01 Aug 2013 03:09:56 -0000,
The pithy ruminations from "Jewell, Chris" on 
<[gridengine users] Random queue errors, and suspect pe_hostfiles> were:
=> Hello all,
=> 
=> A while since I posted here, so good to be back!
=> 
=> My installation of GE 8.1.3 from the Scientific Linux 6.3 RPM repo
=> has started misbehaving of late, since I introduced a share tree policy
=> the other day.

I'm using SGE 6.2u5 under RHEL 6.3.

=> 
=> My setup is contained entirely on my 32 cpu, 2 NVIDIA Tesla card
=> machine (both qmaster and execd), and the spool directory is mounted in
=> /opt which is on the root partition.  Having had a very stable vanilla

I'm trying to do performance testing on 2 machines:
        32 CPU Intel vs 64 CPU AMD

        each machine is isolated as a single 'cluster' with
        both qmaster & execd on each server

        the spool directory for each 'cluster' is mounted in /opt which
        is on the root partition

The SGE configuration is very closely based on the stable config we've been
using on our production cluster for several years.


        [SNIP!]

=> 
=> This seems associated with another new (though less frequent) error message:
=> 
=> 07/29/2013 09:55:54|  main | it060123 | E | can't start job "821": can't 
open file /opt/sge/default/spool/it060123/active_jobs/821.36000/pe_hostfile: 
Permission denied
=> 

I'm seeing the same error on both clusters.

The error shows up reliably but randomly. For example, I've got a sample set
of jobs, intended to load each server in order to test throughput. The test
submits 2048 jobs. The error occurs when about 75~95 jobs have run on each
server, at which point the queue is in an error state with the remaining jobs
waiting.

The error doesn't occur after a fixed number of jobs, and seems to be
independent of the job content--I've tried 4 different types of workloads
during the process of evaluating the servers, and the error is the same.

=> which puts the queue into an error state.  This appears to happen to a
=> minority of jobs at random, but of course stalls the queue.  I'm fairly
=> sure the filesystem is okay (at least, an fsck tells me it is), so I'm
=> assuming it's something related to GE.

Same here...

=> 
=> Any ideas on where to start?
=> 

I started with a search of the SGE mailing list archive, and found your
post. :)

Have you found a solution?

Thanks,

Mark

=> Cheers,
=> 
=> Chris
=> 
=> 
=> 
=> --
=> Dr Chris Jewell
=> Lecturer in Biostatistics
=> Institute of Fundamental Sciences
=> Massey University
=> Private Bag 11222
=> Palmerston North 4442
=> New Zealand
=> Tel: +64 (0) 6 350 5701 Extn: 3586
=> 
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to