We've had some trouble with curious job failures- the jobs aren't even assigned nodes:
JobID NodeList State ExitCode ------------ --------------- ---------- -------- 7229124 None assigned FAILED 0:1 We finally got some better log data (I'd turned it way too low) which suggests that restarting and/or reconfiguring the controller is at the root. After some preliminaries (purging job records, recovering active jobs) there will be these sorts of messages : [2014-06-09T23:10:15.920] No nodes satisfy job 7228909 requirements in partition full [2014-06-09T23:10:15.920] sched: schedule: JobId=7228909 non-runnable: Requested node configuration is not available The indicated job has specified --mem and --tmp, but the values are within the capacities for all nodes in that "full" partition. Typically if a user were to request resources exceeding those available on nodes in this partition the submission is failed. It appears that this failure only occurs for jobs with memory and/or disk constraints. Worse yet, it's not consistent- only seems to happen sometime. I also cannot reproduce this in our test environment. A typical node configuration line looks thus: NodeName=gizmod[51-60] Sockets=2 CoresPerSocket=6 RealMemory=48000 Weight=10 Feature=full,restart,rx200,ssd though I've got FastSchedule=0. Honestly it *feels* like there's a moment where the node data isn't fully loaded from the slurmd and thus the scheduler doesn't see any nodes that satisfy the requirements. Thanks all... Michael
