Greetings --

I'm seeking advice on how to learn more about a job submitted nearly two
months ago, and appearing today as "Job Requeued to Held State", even
surviving multiple slurm restarts, and an explicit maintenance period
downtime.

This job became apparent when based on the Priority=0, and the reason code,
in squeue output. A variety of different squeue output flags were used to
learn more about the current status of the job, including that it was
originally submitted on 10 Feb 2016 (today is 29 Mar 2016).

Also have used sinfo -j <jobid> -l, which again shows current state
information.

The information gleaned from the above led me back to the user script, and
I think I've found a fundamental problem with the path to the command meant
to be executed. Could this actually have caused the job to rattle around
the queue for weeks? Seems as though a simple "command not found" would
have triggered an error code and job termination.

My main interest is how to determine the history of this jobid? It seems to
be a potentially interesting history, and I'd like to know information
about the previous encounters the job had with the queue. My interest is in
learning how the job:
-- persisted so long in the queue
-- survived restarting slurm on March 2nd
-- why the job priority is set to 0
-- why sinfo  and squeue indicate the scheduler can't determine what
resources to allocate

My instinct has been to grep and awk the log files, but that has failed to
reveal any information about this jobid. I'm only aware of the
slurmcontrol.log files -- and we seem to be keeping only the current week,
after which presumably all job information resides in the database.

Many thanks -- I look forward to learning both from the active list, and
from investigating the archives. I'm an admirer of the SLURM documentation (
slurm.schedmd.com), although an admitted newbie with cluster schedulers
(there are wiser heads in house, so our actual cluster admin is in good
hands).

Cheers,
~ Emily
----------------------------------
E.M. Dragowsky, Ph.D.
ITS -- Research Computing
Case Western Reserve University
(216) 368-0082

Reply via email to