Hi,

Couple of weeks ago we had some "issues" with slurm limits and as a result we 
have one job in state JobHeldAdmin which we would like to get rid of.JobId is 
larger that maximum allowed, and I've not managed to release/cancel it. Any 
hints?


squeue | grep JobHeldAdmin
4294967294    serial   (null)     root PD       0:00      0 (JobHeldAdmin)


[2014-05-16T15:40:54.496] cons_res: select_p_reconfigure
[2014-05-16T15:40:54.496] cons_res: select_p_node_init
[2014-05-16T15:40:54.496] cons_res: preparing for 7 partitions
[2014-05-16T15:40:54.988] Job 1666752 completion process took 257 seconds
[2014-05-16T15:40:54.990] completing job 1666002
[2014-05-16T15:40:55.054] _slurm_rpc_reconfigure_controller: completed 
usec=11656130
[2014-05-16T15:40:55.058] sched: job_complete for JobId=1666002 successful, 
exit code=0
[2014-05-16T15:40:55.058] Job 1666743 completion process took 287 seconds
[2014-05-16T15:40:55.060] Job 1667238 completion process took 287 seconds
[2014-05-16T15:40:55.060] Job 1666698 completion process took 287 seconds
[2014-05-16T15:40:55.061] Job 1667375 completion process took 287 seconds
[2014-05-16T15:40:55.061] Job 1667345 completion process took 135 seconds
[2014-05-16T15:40:55.062] error: We have exhausted our supply of valid job id 
values. FirstJobId=1 MaxJobId=4294901760
[2014-05-16T15:40:55.062] Job 1666661 completion process took 287 seconds
[2014-05-16T15:40:55.062] _slurm_rpc_submit_batch_job: Resource temporarily 
unavailable
[2014-05-16T15:40:55.062] Job 1667579 completion process took 287 seconds
[2014-05-16T15:40:55.063] Job 1666778 completion process took 258 seconds
[2014-05-16T15:40:55.063] Job 1666659 completion process took 258 seconds
[2014-05-16T15:40:55.064] _slurm_rpc_job_step_create for job 1661977: Required 
node not available (down or drained)
[2014-05-16T15:40:55.065] error: We have exhausted our supply of valid job id 
values. FirstJobId=1 MaxJobId=4294901760
...
[2014-05-16T15:47:54.546] Recovered job 1699395 2766
[2014-05-16T15:47:54.627] Recovered job 1699396 2766
[2014-05-16T15:47:54.709] Recovered job 4294967294 2
[2014-05-16T15:47:54.789] Recovered job 4294967294 2
[2014-05-16T15:47:54.870] Recovered job 4294967294 2
[2014-05-16T15:47:54.951] Recovered job 4294967294 2
[2014-05-16T15:47:55.032] Recovered job 4294967294 2
[2014-05-16T15:47:55.113] Recovered job 1699397 2766
[2014-05-16T15:47:55.194] Recovered job 1699398 2766


./scontrol release 4294967294
==9891== Memcheck, a memory error detector
==9891== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al.
==9891== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==9891== Command: ./scontrol release 4294967294
==9891==
==9891== Invalid read of size 8
==9891==    at 0x42CED8: scontrol_hold (update_job.c:305)
==9891==    by 0x429E68: _process_command (scontrol.c:928)
==9891==    by 0x428355: main (scontrol.c:218)
==9891==  Address 0x10 is not stack'd, malloc'd or (recently) free'd

Regards,
Tommi

Reply via email to