Re: [gridengine users] Random queue errors, and suspect pe_hostfiles

bergman Fri, 23 Aug 2013 10:01:59 -0700


In the message dated: Fri, 23 Aug 2013 01:28:34 -0000,
The pithy ruminations from "Jewell, Chris" on 
<Re: [gridengine users] Random queue errors, and suspect pe_hostfiles> were:
=> > 
=> > I started with a search of the SGE mailing list archive, and found your
=> > post. :)
=> > 
=> > Have you found a solution?
=> 
=> 
=> Hello all,
=> 
=> Sorry for the long leave of absence.  I've been thoroughly testing
=> my system for this issue.  I checked my RAID1 for consistency, and
=> performed an xfs_repair to make doubly sure my filesystem was okay.
=> It was.  I also disabled SELinux in case that was the problem.
=> 
=> In reply to Reuti:
=> 
=> > The directories (/opt/sge/default/spool/it060123/active_jobs/...) are
=> > normally created by the admin user - is this root or any other one with
=> > normal rights (which would be fine)?
=> > 
=> > Nevertheless also "other users" must be allowed to read this
=> > directory and the files inside. Is there any special `umask` in place
=> > and/or does it only happen to parallel jobs and/or only certain users?
=> 
=> No special umask or parallel jobs being used.  The problem seems more
=> apparent when lots of very short jobs are sent to the system.


I observe the same thing, that many short jobs seems to trigger the
problem more quickly. The problem with "Permission denied" creating
the pe_hostfile appears whether the jobs are all single-threaded, in a
parallel (multithreaded) queue, or mixed.

The issue doesn't seem to be with the number of jobs in the queue
(we've got 17K+ jobs in the queue in our other 6.2u5 cluster right now),
but with the number of jobs running on a single exec host.

I'll try to do some monitoring, but manually running 'qstat' before the
problem shows up seems to suggest that it appears after more than 25~50
jobs are running on an exec host.

I don't think that the issue is related to selinux, disk corruption,
a umask, or permissions. The permissions and file ownership of the
/opt/sge/.../active_jobs directory are correct. In our case, as we're
benchmarking different machines, there's only a single user submitting
jobs, and no other activity on each server. Starting our test runs (ie.,
from a point with no jobs in the queue and with the queue in a non-error
state) always succeeds in running a few (20~100s) jobs until the error
appears. Clearing the queue's error state allows jobs to resume running.

Is there any locking in SGE that could prevent the creation of the
pe_hostfile (for example, is the /opt/sge/.../active_jobs directory locked
when a job is being deleted, causing a new job that attempts to create
a new pe_hostfile to incorrectly report a 'permission denied" error)?o

My testing is deliberately 'overloading' the execd (in terms of h_vmem
requested and/or number of CPU-slots used and/or load). Qstat reports:

        queue instance "[email protected]" dropped because it is
        temporarily not available All queues dropped because of overload
        or full
=> 
=> The one thing that Mark and I have in common is high-CPU count
=> machines.  My box is currently configured to provide 28 slots out of 32
=> logical cores.  I wonder if this might be causing a race-condition to

Another thing we have in common--which is different than most SGE
installations--is that the sgemaster and sge_execd are on the same server.

Is it possible that the sequence of events is:

        SGE runs jobs until the execd becomes overloaded

        job N is scheduled to run

        the qmaster determines that the queue is full and refuses to run
        job N

        the execd incorrectly reports the failure to run the job as a
        'Permission denied' error in creating the pe_hostfile

        the qmaster puts the queue into an error state

The difference being that on a traditional cluster with multiple machines
and a separation between the qmaster and sge_execd, the execd receives
notice that the queue is overloaded before trying to run the job, and
therefore doesn't report a failure in creating the pe_hostfile.

=> become apparent in the creation of the pe_hostfile?

Following up on William Hay's suggestion about checking the number of
Unix groups, in our case, I don't think the gid range is an issue:

        gid_range                    40000-40500

Unless SGE is very slow about wrapping or re-using gids within that range, 500
gids should be sufficient for a 32 or 64-slot server.

I've increased the range to 40000-50000 just to rule out the possibilty, but
that doesn't prevent the "Permission denied" error opening a new pe_hostfile.

However, this line of investigation made me think about other limits
which may be appearing on high-CPU count machines.  I changed:

        number of open files (1024 default, limit of 4096) increased to 4096
        number of processes (1024 default), increased to 4096

and restarted sgeexecd and the sgemaster, then resubmited my test
jobs...unfortunately, the same error happens:

         can't open file 
/opt/sge/6.2u5/default/spool/r820-1/active_jobs/93629.1/pe_hostfile: Permission 
denied


Any suggestions?
        
Thanks,

Mark

=> 
=> Cheers,
=> 
=> Chris
=> 
=> 
=> --
=> Dr Chris Jewell
=> Lecturer in Biostatistics
=> Institute of Fundamental Sciences
=> Massey University
=> Private Bag 11222
=> Palmerston North 4442
=> New Zealand
=> Tel: +64 (0) 6 350 5701 Extn: 3586
=> 
=> 

-- 
Mark Bergman
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Random queue errors, and suspect pe_hostfiles

Reply via email to