Hi all! We recently update from:
- 16.05.6 to 17.02.0-0rc1 (10 days ago) - 17.02.0-0rc1 to 17.02.1 (8 days ago) - 17.02.0 to 17.02.1-2 (Today) We have a serious problem with user jobs: some of them are staying in zombie state. If the job complete sucessfull, fail or is cancelled in the database they appear with no EndTime. For example we need to delete some user: *sacctmgr delete user username Error with request: Job(s) active, cancel job(s) before remove JobID = 6680924 C = leftraru A = nlhpc U = username P = slims* If we want to show the job: *scontrol show job 6680924* *slurm_load_jobs error: Invalid job id specified* sacct show us this: *sacct --jobs=6680924* * JobID JobName Partition Account AllocCPUS State ExitCode * *------------ ---------- ---------- ---------- ---------- ---------- -------- **6680924 hostname slims nlhpc 2 RUNNING 0:0* Another jobs staying in pending state: *sacct -j 6651709 * * JobID JobName Partition Account AllocCPUS State ExitCode * *------------ ---------- ---------- ---------- ---------- ---------- -------- * *6651709 Ar500-120+ slims fis_unab 1 PENDING 0:0 * But it has already finished: scontrol show job 6651709 JobId=6651709 JobName=Ar500-12000 ... JobState=COMPLETED Reason=None Dependency=(null) .... In slurmdbd.log we receive the following message over and over again: .... [2017-03-03T18:43:07.077] error: CONN:7 Failed to unpack DBD_JOB_COMPLETE message [2017-03-03T18:43:18.001] error: unpackmem_xmalloc: Buffer to be unpacked is too large (4294967295 > 1073741824) [2017-03-03T18:43:18.001] error: CONN:7 Failed to unpack DBD_JOB_COMPLETE message [2017-03-03T18:43:37.000] error: unpackmem_xmalloc: Buffer to be unpacked is too large (4294967295 > 1073741824) [2017-03-03T18:43:37.000] error: CONN:7 Failed to unpack DBD_JOB_COMPLETE message .... We read about a similar problem in https://bugs.schedmd.com/sh ow_bug.cgi?id=3388, but sacctmgr show runawayjobs ends with: sacctmgr: error: slurmdbd: Sending message type 1488: 11: No error sacctmgr: error: Failed to fix runaway job: Resource temporarily unavailable Any ideas? Thanks in advance, Support Team NLHPC <[email protected]> National Lab for High Performance Computing (NLHPC) <http://www.nlhpc.cl> Center for Mathematical Modeling (CMM) School of Engineering and Sciences. University of Chile Beauchef, 851, 7th Floor. Santiago, Chile Office: +56-2-9784603
