Hi Emily,

On Thu, Mar 31, 2016 at 10:46 PM, E.M. Dragowsky <[email protected]> wrote:
> The information gleaned from the above led me back to the user script, and I
> think I've found a fundamental problem with the path to the command meant to
> be executed. Could this actually have caused the job to rattle around the
> queue for weeks? Seems as though a simple "command not found" would have
> triggered an error code and job termination.


I'm not sure if this is relevant and I certainly don't intend to
mislead you.  But several weeks ago, I had a problem with one of my
jobs on a SLURM instance where I am not the system administrator.

Anyway, my job ended up being re-queued.  I didn't have a problem with
the pathnames to the command, but I didn't quite understand why a
"perfectly" running job would get re-queued.

According to the system administrators, the cause was that my job was
using more memory than what was available on the node.  This caused
the node to thrash and become less responsive.  The head node failed
to communicate with the execution node and thought the execution node
was down; as a result, it re-queued my job.  As the queue was empty at
the time, the re-queued job immediately re-ran with a new job ID.

I don't know how the administrators determined this.  So, I'm afraid I
can't help you in determining if this happened to you.  But until this
case, I had thought such a job would terminate with an error message.
Now, I queue jobs by explicitly specifying the amount of memory to
use.  So far, this problem hasn't happened to me since (if memory
usage exceeds what I specify, the job does indeed terminate).

Ray

Reply via email to