I initially thought that this didn't apply. I didn't see where those ports (12000-12015) were blocked on the hosts.
However, it turns out that my MPI configuration is incomplete in slurm.conf. The parameter "MpiParams" was commented out and was defaulting to "(null)" in the output of "scontrol show config". The tell was in the debug messages, where I was getting messages indicating something along the lines of "step needs 2 reserved ports, but only 0 exist". Without adding some ports parameter, like "ports=12000-12099", to MpiParams, slurmctld wasn't reserving *any* ports. Kind of puzzling how MPI has worked till now. Anyway, I think we've found the root of our errors. Thanks Michael On Sun, Feb 3, 2013 at 11:30 AM, Moe Jette <[email protected]> wrote: > > See "NOTE" in > http://www.schedmd.com/slurmdocs/mpi_guide.html#open_mpi > > Quoting Michael Gutteridge <[email protected]>: > >> >> Hi all >> >> We had an upgrade go sour on us. We went from 2.3.5 to 2.5.1 and >> we're finding that srun isn't running properly. The messages we get >> output from srun are: >> >> srun: Job is in held state, pending scheduler release >> srun: job 1416755 queued and waiting for resources >> srun: job 1416755 has been allocated resources >> srun: Job step creation temporarily disabled, retrying >> srun: error: Unable to create job step: Requires more ports than can >> be reserved >> srun: Force Terminated job 1416755 >> >> >> Logs don't seem to indicate a cause. On the slurmd side we see: >> >> [2013-02-02T22:04:16-08:00] sched: Allocate JobId=1416757 >> NodeList=gizmod13 #CPUs=6 >> [2013-02-02T22:04:19-08:00] _slurm_rpc_job_step_create for job >> 1416757: Requested nodes are busy >> [2013-02-02T22:06:43-08:00] _slurm_rpc_job_step_create for job >> 1416757: Requested nodes are busy >> [2013-02-02T22:07:12-08:00] _slurm_rpc_job_step_create for job >> 1416757: Requested nodes are busy >> [2013-02-02T22:07:41-08:00] _slurm_rpc_job_step_create for job >> 1416757: Requested nodes are busy >> >> even though the node is not indicated as busy: >> >> NodeName=gizmod13 Arch=x86_64 CoresPerSocket=6 >> CPUAlloc=0 CPUErr=0 CPUTot=12 CPULoad=0.04 Features=rx200,campus >> Gres=(null) >> NodeAddr=gizmod13 NodeHostName=gizmod13 >> OS=Linux RealMemory=48168 Sockets=2 Boards=1 >> State=IDLE ThreadsPerCore=1 TmpDisk=938900 Weight=1 >> BootTime=2013-02-02T14:57:13 SlurmdStartTime=2013-02-02T22:39:06 >> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 >> >> eventually the job dies: >> >> [2013-02-02T22:08:01-08:00] Node gizmod13 now responding >> [2013-02-02T22:08:10-08:00] step 1416757.0 needs 7 reserved ports, but >> only 0 exist >> [2013-02-02T22:08:10-08:00] _slurm_rpc_job_step_create for job >> 1416757: Requires more ports than can be reserved >> [2013-02-02T22:08:10-08:00] completing job 1416757 >> [2013-02-02T22:08:10-08:00] sched: job_complete for JobId=1416757 successful >> >> during this time on the controller: >> >> slurmd: debug2: got this type of message 1008 >> slurmd: debug2: got this type of message 6011 >> slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB >> slurmd: debug: _rpc_terminate_job, uid = 6281 >> slurmd: debug: task_slurmd_release_resources: 1416757 >> slurmd: debug: credential for job 1416757 revoked >> slurmd: debug2: No steps in jobid 1416757 to send signal 18 >> slurmd: debug2: No steps in jobid 1416757 to send signal 15 >> slurmd: debug2: set revoke expiration for jobid 1416757 to 1359872890 UTS >> slurmd: debug: Waiting for job 1416757's prolog to complete >> slurmd: debug: Finished wait for job 1416757's prolog to complete >> slurmd: debug: Calling /usr/sbin/slurmstepd spank epilog >> spank-epilog: Reading slurm.conf file: /etc/slurm-llnl/slurm.conf >> spank-epilog: Running spank/epilog for jobid [1416757] uid [34152] >> spank-epilog: spank: opening plugin stack /etc/slurm-llnl/plugstack.conf >> spank-epilog: spank: /usr/lib64/slurm-llnl/use-env.so: no callbacks in >> this context >> slurmd: debug: [job 1416757] attempting to run epilog >> [/etc/slurm-llnl/slurmd.epilog] >> slurmd: debug: completed epilog for jobid 1416757 >> slurmd: debug: Job 1416757: sent epilog complete msg: rc = 0 >> >> There weren't any changes to the config, though concurrently we'd also >> updated the OS (Ubuntu 12.04 LTS) with the latest patches and such, so >> I'm not ruling out OS interactions. We backed out to 2.3.5 and things >> are running OK again. >> >> Any insights? >> >> Thanks >> >> Michael > -- Hey! Somebody punched the foley guy! - Crow, MST3K ep. 508
