See "NOTE" in http://www.schedmd.com/slurmdocs/mpi_guide.html#open_mpi
Quoting Michael Gutteridge <[email protected]>: > > Hi all > > We had an upgrade go sour on us. We went from 2.3.5 to 2.5.1 and > we're finding that srun isn't running properly. The messages we get > output from srun are: > > srun: Job is in held state, pending scheduler release > srun: job 1416755 queued and waiting for resources > srun: job 1416755 has been allocated resources > srun: Job step creation temporarily disabled, retrying > srun: error: Unable to create job step: Requires more ports than can > be reserved > srun: Force Terminated job 1416755 > > > Logs don't seem to indicate a cause. On the slurmd side we see: > > [2013-02-02T22:04:16-08:00] sched: Allocate JobId=1416757 > NodeList=gizmod13 #CPUs=6 > [2013-02-02T22:04:19-08:00] _slurm_rpc_job_step_create for job > 1416757: Requested nodes are busy > [2013-02-02T22:06:43-08:00] _slurm_rpc_job_step_create for job > 1416757: Requested nodes are busy > [2013-02-02T22:07:12-08:00] _slurm_rpc_job_step_create for job > 1416757: Requested nodes are busy > [2013-02-02T22:07:41-08:00] _slurm_rpc_job_step_create for job > 1416757: Requested nodes are busy > > even though the node is not indicated as busy: > > NodeName=gizmod13 Arch=x86_64 CoresPerSocket=6 > CPUAlloc=0 CPUErr=0 CPUTot=12 CPULoad=0.04 Features=rx200,campus > Gres=(null) > NodeAddr=gizmod13 NodeHostName=gizmod13 > OS=Linux RealMemory=48168 Sockets=2 Boards=1 > State=IDLE ThreadsPerCore=1 TmpDisk=938900 Weight=1 > BootTime=2013-02-02T14:57:13 SlurmdStartTime=2013-02-02T22:39:06 > CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 > > eventually the job dies: > > [2013-02-02T22:08:01-08:00] Node gizmod13 now responding > [2013-02-02T22:08:10-08:00] step 1416757.0 needs 7 reserved ports, but > only 0 exist > [2013-02-02T22:08:10-08:00] _slurm_rpc_job_step_create for job > 1416757: Requires more ports than can be reserved > [2013-02-02T22:08:10-08:00] completing job 1416757 > [2013-02-02T22:08:10-08:00] sched: job_complete for JobId=1416757 successful > > during this time on the controller: > > slurmd: debug2: got this type of message 1008 > slurmd: debug2: got this type of message 6011 > slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB > slurmd: debug: _rpc_terminate_job, uid = 6281 > slurmd: debug: task_slurmd_release_resources: 1416757 > slurmd: debug: credential for job 1416757 revoked > slurmd: debug2: No steps in jobid 1416757 to send signal 18 > slurmd: debug2: No steps in jobid 1416757 to send signal 15 > slurmd: debug2: set revoke expiration for jobid 1416757 to 1359872890 UTS > slurmd: debug: Waiting for job 1416757's prolog to complete > slurmd: debug: Finished wait for job 1416757's prolog to complete > slurmd: debug: Calling /usr/sbin/slurmstepd spank epilog > spank-epilog: Reading slurm.conf file: /etc/slurm-llnl/slurm.conf > spank-epilog: Running spank/epilog for jobid [1416757] uid [34152] > spank-epilog: spank: opening plugin stack /etc/slurm-llnl/plugstack.conf > spank-epilog: spank: /usr/lib64/slurm-llnl/use-env.so: no callbacks in > this context > slurmd: debug: [job 1416757] attempting to run epilog > [/etc/slurm-llnl/slurmd.epilog] > slurmd: debug: completed epilog for jobid 1416757 > slurmd: debug: Job 1416757: sent epilog complete msg: rc = 0 > > There weren't any changes to the config, though concurrently we'd also > updated the OS (Ubuntu 12.04 LTS) with the latest patches and such, so > I'm not ruling out OS interactions. We backed out to 2.3.5 and things > are running OK again. > > Any insights? > > Thanks > > Michael
