Re: [gridengine users] Corrupt user config?

William Hay Mon, 16 Apr 2018 06:17:20 -0700

On Mon, Apr 16, 2018 at 12:16:26PM +0100, Mark Dixon wrote:
> Hi William,
> 
> I've seen this before back in the SGE 6.2u5 days when it used to write out
> core binding options it couldn't subsequently read back in.
> 
> IIRC, users are read from disk at startup in turn and then the files are
> only written to from then on - so this sort of thing only tends to be
> noticed when the qmaster is restarted. If it finds a user file that it
> cannot read properly, SGE gives up reading any more user files and you'll
> appear to lose a big chunk of your user base even if those other user files
> are ok.
I don't think that can be right given that the qmaster complains about multiple
user files on start up.  If it gave up after the first then presumably it 
wouldn't 
complain about the others.


> 
> Your instinct is right: stop the qmaster, delete or preferably modify the
> file for the reported problem user so that the bit it's complaining about is
> removed, and start the qmaster again. Repeat if it complains about another
> user.
> 
> Feel free to post the main bit of the user file if you want an opinion about
> the edit.
The user who first drew our attention to this has a user file that looks like 
this:

name ucbptba
oticket 0
fshare 1
delete_time 0
usage NONE
usage_time_stamp 1522981780
long_term_usage NONE
project AllUsers 
cpu=10018380758.728712,mem=566859015.524199,io=485.211850,binding_inuse!SccccccccScccccccc=0.000000,iow=0.000000,vmem=2649215676489.408691,maxvmem=0.000000,submission_time=185478188707.216309,priority=0.000000,exit_status=0.000000,signal=12732.427329,start_time=185484299059.331726,end_time=185486650486.414856,ru_wallclock=24949894.118398,ru_utime=352365452.860190,ru_stime=227668.336061,ru_maxrss=2312860731.515988,ru_ixrss=0.000000,ru_ismrss=0.000000,ru_idrss=0.000000,ru_isrss=0.000000,ru_minflt=139382336529.001709,ru_majflt=13666248.060146,ru_nswap=0.000000,ru_inblock=885794734.354178,ru_oublock=19889227.200863,ru_msgsnd=0.000000,ru_msgrcv=0.000000,ru_nsignals=0.000000,ru_nvcsw=2535937951.970052,ru_nivcsw=3926968166.964020,acct_cpu=376560487.827089,acct_mem=654584364.211819,acct_io=207.611235,acct_iow=0.000000,acct_maxvmem=30119711845145.703125,finished_jobs=0.000000
 
cpu=10021139983.020000,mem=567054438.077435,io=485.452673,binding_inuse!SccccccccScccccccc=0.000000,iow=0.000000,vmem=2650784657408.000000,maxvmem=0.000000,submission_time=185581632895.000000,priority=0.000000,exit_status=0.000000,signal=12740.000000,start_time=185587744917.000000,end_time=185590097173.000000,ru_wallclock=24960037.000000,ru_utime=352504013.044632,ru_stime=227747.889479,ru_maxrss=2313909568.000000,ru_ixrss=0.000000,ru_ismrss=0.000000,ru_idrss=0.000000,ru_isrss=0.000000,ru_minflt=139433350974.000000,ru_majflt=13670114.000000,ru_nswap=0.000000,ru_inblock=886048536.000000,ru_oublock=19900264.000000,ru_msgsnd=0.000000,ru_msgrcv=0.000000,ru_nsignals=0.000000,ru_nvcsw=2536948317.000000,ru_nivcsw=3928530274.000000,acct_cpu=376713375.452962,acct_mem=654806517.378607,acct_io=207.726701,acct_iow=0.000000,acct_maxvmem=30134842138624.000000,finished_jobs=125.000000;
default_project NONE
debited_job_usage 251393 
binding_inuse!SccccccccScccccccc=0.000000,cpu=11215611.000000,mem=0.000000,io=0.648810,iow=0.000000;

It is possible that this file has fixed itself as two tasks from the problem 
array job have started and the file
has changed since I first looked at it.  However qconf -suser still doesn't 
show the user in question and the array job
is apparently stuck at the back of the queue due to not getting any functional 
tickets.

The messages file complains thusly:
04/16/2018 11:06:53|  main|util01|E|line 12 should begin with an attribute name
04/16/2018 11:06:53|  main|util01|E|error reading file: 
"/var/opt/sge/shared/qmaster/users/ucbptba"
04/16/2018 11:06:53|  main|util01|E|unrecognized characters after the attribute 
values in line 12: "mem"

Line 12 being the line starting project.


> 
> If you delete the user file, you'll lose all usage for that user - including
> that user's contribution to projects in any share tree you might have.
> You'll also probably lose any jobs queued up by them.

Oh fun.  Fortunately we're mostly per user functional share and use share-tree 
only 
as a tie breaker.   But deleting jobs would be bad.  Is the probably lose any 
jobs queued
something you know from experience?  It seems odd that we can have jobs queued 
and running 
with the running qmaster knowing nothing of the user but deleting the file 
would kill them on 
restart.




> 
> Mark
> 
> On Mon, 16 Apr 2018, William Hay wrote:
> 
> > We had a user report that one of their array jobs wasn't scheduling A
> > bit of poking around showed that qconf -suser knew nothing of the user
> > despite them having a queued job.  However there was a file in the spool
> > that should have defined the user.  Several other users appear to be
> > affected as well.
> > 
> > I bounced the qmaster in the hopes of getting it to reread the users'
> > details from disk.  And got several messages like this:
> > 
> > 04/16/2018 11:06:53| main|util01|E|error reading file: 
> > "/var/opt/sge/shared/qmaster/users/zccag81"
> > 04/16/2018 11:06:53| main|util01|E|unrecognized characters after the 
> > attribute values in line 12: "mem"
> > 04/16/2018 11:06:53| main|util01|E|line 12 should begin with an attribute 
> > name
> > 
> > I suspect that my next step should be to stop the qmaster, delete the
> > problem files and then restart the qmaster.  Hopefully grid engine will
> > then recreate the user or I can create them manually.
> > 
> > However if anyone has a better idea or has seen this before I'd be glad
> > to hear of it.
> > 
> > Creation of the user object on our cluster is done by means of enforce_user 
> > auto:
> > #qconf -sconf |grep auto
> > enforce_user                 auto
> > auto_user_oticket            0
> > auto_user_fshare             1
> > auto_user_default_project    none
> > auto_user_delete_time        0
> > 
> > 
> > William
> > 
> >

signature.asc
Description: PGP signature

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Corrupt user config?

Reply via email to