Any news on this?

/Magnus

On 2013-02-06 02:24, Michael Gutteridge wrote:

We have a prolog that the slurm controller runs (pretty
straightforward, just sets up some temporary directories).  However,
since upgrading from 2.3.5 to 2.5.1 we've got a situation where having
any slurmctld prolog configured causes long delays (60-120s)  between
when slurmctl allocates resources and starts the job. It seems to
occur in both srun and sbatch submitted jobs, though with different
symptoms.

I've distilled to a very generic config, using the FIFO scheduler to
eliminate any of that.  I've also reduced the prolog to a two-line
script:

#!/bin/bash
exit 0

The slurmctld.log has this:

[2013-02-05T15:26:27-08:00] debug2: Processing RPC:
REQUEST_SUBMIT_BATCH_JOB from uid=55555
[2013-02-05T15:26:27-08:00] debug3: JobDesc: user_id=55555 job_id=-1
partition=(null) name=sleeper.sh
[2013-02-05T15:26:27-08:00] debug3:    cpus=1-4294967294 pn_min_cpus=-1

.... snip....

[2013-02-05T15:26:27-08:00] debug2: found 5 usable nodes from config
containing puck[2-6]
[2013-02-05T15:26:27-08:00] debug3: _pick_best_nodes: job 29
idle_nodes 4 share_nodes 5
[2013-02-05T15:26:27-08:00] debug2: select_p_job_test for job 29
[2013-02-05T15:26:27-08:00] debug2: sched: JobId=29 allocated
resources: NodeList=(null)
[2013-02-05T15:26:27-08:00] _slurm_rpc_submit_batch_job JobId=29 usec=1359
[2013-02-05T15:26:27-08:00] debug:  sched: Running job scheduler
[2013-02-05T15:26:27-08:00] debug2: found 5 usable nodes from config
containing puck[2-6]
[2013-02-05T15:26:27-08:00] debug3: _pick_best_nodes: job 29
idle_nodes 4 share_nodes 5
[2013-02-05T15:26:27-08:00] debug2: select_p_job_test for job 29
[2013-02-05T15:26:27-08:00] debug3: cons_res: best_fit: node[0]:
required cpus: 1, min req boards: 1,
[2013-02-05T15:26:27-08:00] debug3: cons_res: best_fit: node[0]: min
req sockets: 1, min avail cores: 7
[2013-02-05T15:26:27-08:00] debug3: cons_res: best_fit: using node[0]:
board[0]: socket[1]: 3 cores available
[2013-02-05T15:26:27-08:00] debug3: cons_res: _add_job_to_res: job 29 act 0
[2013-02-05T15:26:27-08:00] debug3: cons_res: adding job 29 to part campus row 0
[2013-02-05T15:26:27-08:00] debug3: sched: JobId=29 initiated
[2013-02-05T15:26:27-08:00] sched: Allocate JobId=29 NodeList=puck2 #CPUs=1
[2013-02-05T15:26:27-08:00] debug3: Writing job id 29 to header record
of job_state file
[2013-02-05T15:26:27-08:00] debug2: prolog_slurmctld job 29 prolog completed

The job shows running, but there are not processes running on the
allocated node (puck2 in this case).  In the allocated node's
slurmd.log there's nothing (despite running with 3 "v" flags).  A
little while later:

[2013-02-05T15:27:27-08:00] error: agent waited too long for nodes to
respond, sending batch request anyway...
[2013-02-05T15:27:27-08:00] Job 29 launch delayed by 60 secs, updating end_time
[2013-02-05T15:27:27-08:00] debug2: Spawning RPC agent for msg_type 4005
[2013-02-05T15:27:27-08:00] debug2: got 1 threads to send out
[2013-02-05T15:27:27-08:00] debug2: Tree head got back 0 looking for 1
[2013-02-05T15:27:27-08:00] debug3: Tree sending to puck2
[2013-02-05T15:27:27-08:00] debug2: Tree head got back 1
[2013-02-05T15:27:27-08:00] debug2: Tree head got them all
[2013-02-05T15:27:27-08:00] Node puck2 now responding
[2013-02-05T15:27:27-08:00] debug2: node_did_resp puck2

and on the allocated node, slurmd.log comes to life:

[2013-02-05T15:27:27-08:00] debug2: got this type of message 4005
[2013-02-05T15:27:27-08:00] debug2: Processing RPC: REQUEST_BATCH_JOB_LAUNCH
[2013-02-05T15:27:27-08:00] debug:  task_slurmd_batch_request: 29
[2013-02-05T15:27:27-08:00] debug:  Calling /usr/sbin/slurmstepd spank prolog
[2013-02-05T15:27:27-08:00] Reading slurm.conf file: /etc/slurm-llnl/slurm.conf
[2013-02-05T15:27:27-08:00] Running spank/prolog for jobid [29] uid [34152]
[2013-02-05T15:27:27-08:00] spank: opening plugin stack
/etc/slurm-llnl/plugstack.conf
[2013-02-05T15:27:27-08:00] spank: /usr/lib64/slurm-llnl/use-env.so:
no callbacks in this context
[2013-02-05T15:27:27-08:00] Launching batch job 29 for UID 34152
[2013-02-05T15:27:27-08:00] debug level is 6.

and the task starts running.  Removing "PrologSlurmctld" eliminates
this delay, and the job starts immediately.  The fact that the delay
is exactly 60 is suspicious and makes me suspect a misconfiguration.
However, outside of the prolog configuration directive, the config is
straight out of the config generator.

Any pointers would be greatly appreciated- I'm out of ideas...

Thanks

Michael


--
Magnus Jonsson, Developer, HPC2N, UmeƄ Universitet

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to