Re: [gridengine users] Random queue errors, and suspect pe_hostfiles

Jewell, Chris Sun, 25 Aug 2013 16:14:15 -0700

> The message is from a failure of setuid(2) or similar.  I don't know if
> it's a libc bug that errno seems no to be set ("Success") as it should
> be.
> 
> The two possible cases are:
> 
>    EAGAIN The uid does not match the current uid and  uid  brings process
>           over its RLIMIT_NPROC resource limit.
> 
> i.e. check the limit on processes/user (ulimit -u), and


Checked ulimit -u, which shows 1024 processes per user.  Monitoring jobs/user 
with:

$ while (true) ; do ps hax -o user | sort | uniq -c; sleep 0.5; done

shows that sgeadmin and my username never go above 120 jobs.  I've increase the 
process limit
in /etc/security/limits.d/90-nproc.conf to 4096 to see if this improves things, 
but...

>    EPERM  The  user is not privileged (Linux: does not have the CAP_SETUID
>           capability) and uid does not match the real UID  or  saved set-
>           user-ID of the calling process.
> 
> possibly because there was a previous failure to switch back to root
> somehow -- many cases still don't have a check for errors,
> unfortunately.  (At least in some cases failure to drop privileges
> should probably be fatal.).
> 
> In the absence of errno info, EAGAIN seems more likely.

Well, not sure, given the above.  Anything else I can do to try to gather more 
info?  Is it possible to get GE to not delete the directories in 
$SGE_ROOT/default/spool/hostname/active_jobs so I can get a trace?

Cheers,

Chris

--
Dr Chris Jewell
Lecturer in Biostatistics
Institute of Fundamental Sciences
Massey University
Private Bag 11222
Palmerston North 4442
New Zealand
Tel: +64 (0) 6 350 5701 Extn: 3586


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Random queue errors, and suspect pe_hostfiles

Reply via email to