Hi Emily,
On Thu, Mar 31, 2016 at 10:46 PM, E.M. Dragowsky <[email protected]> wrote: > The information gleaned from the above led me back to the user script, and I > think I've found a fundamental problem with the path to the command meant to > be executed. Could this actually have caused the job to rattle around the > queue for weeks? Seems as though a simple "command not found" would have > triggered an error code and job termination. I'm not sure if this is relevant and I certainly don't intend to mislead you. But several weeks ago, I had a problem with one of my jobs on a SLURM instance where I am not the system administrator. Anyway, my job ended up being re-queued. I didn't have a problem with the pathnames to the command, but I didn't quite understand why a "perfectly" running job would get re-queued. According to the system administrators, the cause was that my job was using more memory than what was available on the node. This caused the node to thrash and become less responsive. The head node failed to communicate with the execution node and thought the execution node was down; as a result, it re-queued my job. As the queue was empty at the time, the re-queued job immediately re-ran with a new job ID. I don't know how the administrators determined this. So, I'm afraid I can't help you in determining if this happened to you. But until this case, I had thought such a job would terminate with an error message. Now, I queue jobs by explicitly specifying the amount of memory to use. So far, this problem hasn't happened to me since (if memory usage exceeds what I specify, the job does indeed terminate). Ray
