Hi, Couple of weeks ago we had some "issues" with slurm limits and as a result we have one job in state JobHeldAdmin which we would like to get rid of.JobId is larger that maximum allowed, and I've not managed to release/cancel it. Any hints?
squeue | grep JobHeldAdmin 4294967294 serial (null) root PD 0:00 0 (JobHeldAdmin) [2014-05-16T15:40:54.496] cons_res: select_p_reconfigure [2014-05-16T15:40:54.496] cons_res: select_p_node_init [2014-05-16T15:40:54.496] cons_res: preparing for 7 partitions [2014-05-16T15:40:54.988] Job 1666752 completion process took 257 seconds [2014-05-16T15:40:54.990] completing job 1666002 [2014-05-16T15:40:55.054] _slurm_rpc_reconfigure_controller: completed usec=11656130 [2014-05-16T15:40:55.058] sched: job_complete for JobId=1666002 successful, exit code=0 [2014-05-16T15:40:55.058] Job 1666743 completion process took 287 seconds [2014-05-16T15:40:55.060] Job 1667238 completion process took 287 seconds [2014-05-16T15:40:55.060] Job 1666698 completion process took 287 seconds [2014-05-16T15:40:55.061] Job 1667375 completion process took 287 seconds [2014-05-16T15:40:55.061] Job 1667345 completion process took 135 seconds [2014-05-16T15:40:55.062] error: We have exhausted our supply of valid job id values. FirstJobId=1 MaxJobId=4294901760 [2014-05-16T15:40:55.062] Job 1666661 completion process took 287 seconds [2014-05-16T15:40:55.062] _slurm_rpc_submit_batch_job: Resource temporarily unavailable [2014-05-16T15:40:55.062] Job 1667579 completion process took 287 seconds [2014-05-16T15:40:55.063] Job 1666778 completion process took 258 seconds [2014-05-16T15:40:55.063] Job 1666659 completion process took 258 seconds [2014-05-16T15:40:55.064] _slurm_rpc_job_step_create for job 1661977: Required node not available (down or drained) [2014-05-16T15:40:55.065] error: We have exhausted our supply of valid job id values. FirstJobId=1 MaxJobId=4294901760 ... [2014-05-16T15:47:54.546] Recovered job 1699395 2766 [2014-05-16T15:47:54.627] Recovered job 1699396 2766 [2014-05-16T15:47:54.709] Recovered job 4294967294 2 [2014-05-16T15:47:54.789] Recovered job 4294967294 2 [2014-05-16T15:47:54.870] Recovered job 4294967294 2 [2014-05-16T15:47:54.951] Recovered job 4294967294 2 [2014-05-16T15:47:55.032] Recovered job 4294967294 2 [2014-05-16T15:47:55.113] Recovered job 1699397 2766 [2014-05-16T15:47:55.194] Recovered job 1699398 2766 ./scontrol release 4294967294 ==9891== Memcheck, a memory error detector ==9891== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al. ==9891== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info ==9891== Command: ./scontrol release 4294967294 ==9891== ==9891== Invalid read of size 8 ==9891== at 0x42CED8: scontrol_hold (update_job.c:305) ==9891== by 0x429E68: _process_command (scontrol.c:928) ==9891== by 0x428355: main (scontrol.c:218) ==9891== Address 0x10 is not stack'd, malloc'd or (recently) free'd Regards, Tommi
