See "NOTE" in
http://www.schedmd.com/slurmdocs/mpi_guide.html#open_mpi

Quoting Michael Gutteridge <[email protected]>:

>
> Hi all
>
> We had an upgrade go sour on us.  We went from 2.3.5 to 2.5.1 and
> we're finding that srun isn't running properly.  The messages we get
> output from srun are:
>
> srun: Job is in held state, pending scheduler release
> srun: job 1416755 queued and waiting for resources
> srun: job 1416755 has been allocated resources
> srun: Job step creation temporarily disabled, retrying
> srun: error: Unable to create job step: Requires more ports than can  
> be reserved
> srun: Force Terminated job 1416755
>
>
> Logs don't seem to indicate a cause.  On the slurmd side we see:
>
> [2013-02-02T22:04:16-08:00] sched: Allocate JobId=1416757
> NodeList=gizmod13 #CPUs=6
> [2013-02-02T22:04:19-08:00] _slurm_rpc_job_step_create for job
> 1416757: Requested nodes are busy
> [2013-02-02T22:06:43-08:00] _slurm_rpc_job_step_create for job
> 1416757: Requested nodes are busy
> [2013-02-02T22:07:12-08:00] _slurm_rpc_job_step_create for job
> 1416757: Requested nodes are busy
> [2013-02-02T22:07:41-08:00] _slurm_rpc_job_step_create for job
> 1416757: Requested nodes are busy
>
> even though the node is not indicated as busy:
>
> NodeName=gizmod13 Arch=x86_64 CoresPerSocket=6
>    CPUAlloc=0 CPUErr=0 CPUTot=12 CPULoad=0.04 Features=rx200,campus
>    Gres=(null)
>    NodeAddr=gizmod13 NodeHostName=gizmod13
>    OS=Linux RealMemory=48168 Sockets=2 Boards=1
>    State=IDLE ThreadsPerCore=1 TmpDisk=938900 Weight=1
>    BootTime=2013-02-02T14:57:13 SlurmdStartTime=2013-02-02T22:39:06
>    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>
> eventually the job dies:
>
> [2013-02-02T22:08:01-08:00] Node gizmod13 now responding
> [2013-02-02T22:08:10-08:00] step 1416757.0 needs 7 reserved ports, but
> only 0 exist
> [2013-02-02T22:08:10-08:00] _slurm_rpc_job_step_create for job
> 1416757: Requires more ports than can be reserved
> [2013-02-02T22:08:10-08:00] completing job 1416757
> [2013-02-02T22:08:10-08:00] sched: job_complete for JobId=1416757 successful
>
> during this time on the controller:
>
> slurmd: debug2: got this type of message 1008
> slurmd: debug2: got this type of message 6011
> slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB
> slurmd: debug:  _rpc_terminate_job, uid = 6281
> slurmd: debug:  task_slurmd_release_resources: 1416757
> slurmd: debug:  credential for job 1416757 revoked
> slurmd: debug2: No steps in jobid 1416757 to send signal 18
> slurmd: debug2: No steps in jobid 1416757 to send signal 15
> slurmd: debug2: set revoke expiration for jobid 1416757 to 1359872890 UTS
> slurmd: debug:  Waiting for job 1416757's prolog to complete
> slurmd: debug:  Finished wait for job 1416757's prolog to complete
> slurmd: debug:  Calling /usr/sbin/slurmstepd spank epilog
> spank-epilog: Reading slurm.conf file: /etc/slurm-llnl/slurm.conf
> spank-epilog: Running spank/epilog for jobid [1416757] uid [34152]
> spank-epilog: spank: opening plugin stack /etc/slurm-llnl/plugstack.conf
> spank-epilog: spank: /usr/lib64/slurm-llnl/use-env.so: no callbacks in
> this context
> slurmd: debug:  [job 1416757] attempting to run epilog
> [/etc/slurm-llnl/slurmd.epilog]
> slurmd: debug:  completed epilog for jobid 1416757
> slurmd: debug:  Job 1416757: sent epilog complete msg: rc = 0
>
> There weren't any changes to the config, though concurrently we'd also
> updated the OS (Ubuntu 12.04 LTS) with the latest patches and such, so
> I'm not ruling out OS interactions. We backed out to 2.3.5 and things
> are running OK again.
>
> Any insights?
>
> Thanks
>
> Michael

Reply via email to