On Mon, Apr 16, 2018 at 12:16:26PM +0100, Mark Dixon wrote: > Hi William, > > I've seen this before back in the SGE 6.2u5 days when it used to write out > core binding options it couldn't subsequently read back in. > > IIRC, users are read from disk at startup in turn and then the files are > only written to from then on - so this sort of thing only tends to be > noticed when the qmaster is restarted. If it finds a user file that it > cannot read properly, SGE gives up reading any more user files and you'll > appear to lose a big chunk of your user base even if those other user files > are ok. I don't think that can be right given that the qmaster complains about multiple user files on start up. If it gave up after the first then presumably it wouldn't complain about the others.
> > Your instinct is right: stop the qmaster, delete or preferably modify the > file for the reported problem user so that the bit it's complaining about is > removed, and start the qmaster again. Repeat if it complains about another > user. > > Feel free to post the main bit of the user file if you want an opinion about > the edit. The user who first drew our attention to this has a user file that looks like this: name ucbptba oticket 0 fshare 1 delete_time 0 usage NONE usage_time_stamp 1522981780 long_term_usage NONE project AllUsers cpu=10018380758.728712,mem=566859015.524199,io=485.211850,binding_inuse!SccccccccScccccccc=0.000000,iow=0.000000,vmem=2649215676489.408691,maxvmem=0.000000,submission_time=185478188707.216309,priority=0.000000,exit_status=0.000000,signal=12732.427329,start_time=185484299059.331726,end_time=185486650486.414856,ru_wallclock=24949894.118398,ru_utime=352365452.860190,ru_stime=227668.336061,ru_maxrss=2312860731.515988,ru_ixrss=0.000000,ru_ismrss=0.000000,ru_idrss=0.000000,ru_isrss=0.000000,ru_minflt=139382336529.001709,ru_majflt=13666248.060146,ru_nswap=0.000000,ru_inblock=885794734.354178,ru_oublock=19889227.200863,ru_msgsnd=0.000000,ru_msgrcv=0.000000,ru_nsignals=0.000000,ru_nvcsw=2535937951.970052,ru_nivcsw=3926968166.964020,acct_cpu=376560487.827089,acct_mem=654584364.211819,acct_io=207.611235,acct_iow=0.000000,acct_maxvmem=30119711845145.703125,finished_jobs=0.000000 cpu=10021139983.020000,mem=567054438.077435,io=485.452673,binding_inuse!SccccccccScccccccc=0.000000,iow=0.000000,vmem=2650784657408.000000,maxvmem=0.000000,submission_time=185581632895.000000,priority=0.000000,exit_status=0.000000,signal=12740.000000,start_time=185587744917.000000,end_time=185590097173.000000,ru_wallclock=24960037.000000,ru_utime=352504013.044632,ru_stime=227747.889479,ru_maxrss=2313909568.000000,ru_ixrss=0.000000,ru_ismrss=0.000000,ru_idrss=0.000000,ru_isrss=0.000000,ru_minflt=139433350974.000000,ru_majflt=13670114.000000,ru_nswap=0.000000,ru_inblock=886048536.000000,ru_oublock=19900264.000000,ru_msgsnd=0.000000,ru_msgrcv=0.000000,ru_nsignals=0.000000,ru_nvcsw=2536948317.000000,ru_nivcsw=3928530274.000000,acct_cpu=376713375.452962,acct_mem=654806517.378607,acct_io=207.726701,acct_iow=0.000000,acct_maxvmem=30134842138624.000000,finished_jobs=125.000000; default_project NONE debited_job_usage 251393 binding_inuse!SccccccccScccccccc=0.000000,cpu=11215611.000000,mem=0.000000,io=0.648810,iow=0.000000; It is possible that this file has fixed itself as two tasks from the problem array job have started and the file has changed since I first looked at it. However qconf -suser still doesn't show the user in question and the array job is apparently stuck at the back of the queue due to not getting any functional tickets. The messages file complains thusly: 04/16/2018 11:06:53| main|util01|E|line 12 should begin with an attribute name 04/16/2018 11:06:53| main|util01|E|error reading file: "/var/opt/sge/shared/qmaster/users/ucbptba" 04/16/2018 11:06:53| main|util01|E|unrecognized characters after the attribute values in line 12: "mem" Line 12 being the line starting project. > > If you delete the user file, you'll lose all usage for that user - including > that user's contribution to projects in any share tree you might have. > You'll also probably lose any jobs queued up by them. Oh fun. Fortunately we're mostly per user functional share and use share-tree only as a tie breaker. But deleting jobs would be bad. Is the probably lose any jobs queued something you know from experience? It seems odd that we can have jobs queued and running with the running qmaster knowing nothing of the user but deleting the file would kill them on restart. > > Mark > > On Mon, 16 Apr 2018, William Hay wrote: > > > We had a user report that one of their array jobs wasn't scheduling A > > bit of poking around showed that qconf -suser knew nothing of the user > > despite them having a queued job. However there was a file in the spool > > that should have defined the user. Several other users appear to be > > affected as well. > > > > I bounced the qmaster in the hopes of getting it to reread the users' > > details from disk. And got several messages like this: > > > > 04/16/2018 11:06:53| main|util01|E|error reading file: > > "/var/opt/sge/shared/qmaster/users/zccag81" > > 04/16/2018 11:06:53| main|util01|E|unrecognized characters after the > > attribute values in line 12: "mem" > > 04/16/2018 11:06:53| main|util01|E|line 12 should begin with an attribute > > name > > > > I suspect that my next step should be to stop the qmaster, delete the > > problem files and then restart the qmaster. Hopefully grid engine will > > then recreate the user or I can create them manually. > > > > However if anyone has a better idea or has seen this before I'd be glad > > to hear of it. > > > > Creation of the user object on our cluster is done by means of enforce_user > > auto: > > #qconf -sconf |grep auto > > enforce_user auto > > auto_user_oticket 0 > > auto_user_fshare 1 > > auto_user_default_project none > > auto_user_delete_time 0 > > > > > > William > > > >
signature.asc
Description: PGP signature
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users