Hi all!

We recently update from:

- 16.05.6 to 17.02.0-0rc1 (10 days ago)
- 17.02.0-0rc1 to 17.02.1 (8 days ago)
- 17.02.0 to 17.02.1-2 (Today)

We have a serious problem with user jobs: some of them are staying in
zombie state. If the job complete sucessfull, fail or is cancelled in the
database they appear with no EndTime. For example we need to delete some
user:



*sacctmgr delete user username Error with request: Job(s) active, cancel
job(s) before remove  JobID = 6680924    C = leftraru   A = nlhpc      U =
username P = slims*

If we want to show the job:


*scontrol show job 6680924*
*slurm_load_jobs error: Invalid job id specified*

sacct show us this:


*sacct --jobs=6680924*
*       JobID    JobName  Partition    Account  AllocCPUS      State
ExitCode *
*------------ ---------- ---------- ---------- ---------- ----------
-------- **6680924        hostname      slims      nlhpc          2
RUNNING      0:0*


Another jobs staying in pending state:

*sacct -j 6651709 *
*       JobID    JobName  Partition    Account  AllocCPUS      State
ExitCode *
*------------ ---------- ---------- ---------- ---------- ----------
-------- *
*6651709      Ar500-120+      slims   fis_unab          1    PENDING
 0:0 *

But it has already finished:

scontrol show job 6651709

JobId=6651709 JobName=Ar500-12000

   ...

   JobState=COMPLETED Reason=None Dependency=(null)
   ....


In slurmdbd.log we receive the following message over and over again:

....
[2017-03-03T18:43:07.077] error: CONN:7 Failed to unpack DBD_JOB_COMPLETE
message
[2017-03-03T18:43:18.001] error: unpackmem_xmalloc: Buffer to be unpacked
is too large (4294967295 > 1073741824)
[2017-03-03T18:43:18.001] error: CONN:7 Failed to unpack DBD_JOB_COMPLETE
message
[2017-03-03T18:43:37.000] error: unpackmem_xmalloc: Buffer to be unpacked
is too large (4294967295 > 1073741824)
[2017-03-03T18:43:37.000] error: CONN:7 Failed to unpack DBD_JOB_COMPLETE
message
....


We read about a similar problem in https://bugs.schedmd.com/sh
ow_bug.cgi?id=3388, but sacctmgr show runawayjobs ends with:

sacctmgr: error: slurmdbd: Sending message type 1488: 11: No error
sacctmgr: error: Failed to fix runaway job: Resource temporarily unavailable


Any ideas?

Thanks in advance,


Support Team NLHPC <[email protected]>
National Lab for High Performance Computing (NLHPC) <http://www.nlhpc.cl>
Center for Mathematical Modeling (CMM)
School of Engineering and Sciences. University of Chile
Beauchef, 851, 7th Floor. Santiago, Chile
Office: +56-2-9784603

Reply via email to