Greetings -- I'm seeking advice on how to learn more about a job submitted nearly two months ago, and appearing today as "Job Requeued to Held State", even surviving multiple slurm restarts, and an explicit maintenance period downtime.
This job became apparent when based on the Priority=0, and the reason code, in squeue output. A variety of different squeue output flags were used to learn more about the current status of the job, including that it was originally submitted on 10 Feb 2016 (today is 29 Mar 2016). Also have used sinfo -j <jobid> -l, which again shows current state information. The information gleaned from the above led me back to the user script, and I think I've found a fundamental problem with the path to the command meant to be executed. Could this actually have caused the job to rattle around the queue for weeks? Seems as though a simple "command not found" would have triggered an error code and job termination. My main interest is how to determine the history of this jobid? It seems to be a potentially interesting history, and I'd like to know information about the previous encounters the job had with the queue. My interest is in learning how the job: -- persisted so long in the queue -- survived restarting slurm on March 2nd -- why the job priority is set to 0 -- why sinfo and squeue indicate the scheduler can't determine what resources to allocate My instinct has been to grep and awk the log files, but that has failed to reveal any information about this jobid. I'm only aware of the slurmcontrol.log files -- and we seem to be keeping only the current week, after which presumably all job information resides in the database. Many thanks -- I look forward to learning both from the active list, and from investigating the archives. I'm an admirer of the SLURM documentation ( slurm.schedmd.com), although an admitted newbie with cluster schedulers (there are wiser heads in house, so our actual cluster admin is in good hands). Cheers, ~ Emily ---------------------------------- E.M. Dragowsky, Ph.D. ITS -- Research Computing Case Western Reserve University (216) 368-0082
