-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi folks,

One of our Xeon Phi early adopters has found a behaviour I
cannot explain that prevents jobs using GRES from spanning
compute nodes.

A job submitted with:

sbatch --nodes=2 --gres=mic:2 phi.sh

will hang at srun/mpirun and slurmctld will log:

[2013-08-23T16:06:33.223] _slurm_rpc_submit_batch_job JobId=95464 usec=1411
[2013-08-23T16:06:33.224] sched: Allocate JobId=95464 NodeList=barcoo[062-063] 
#CPUs=32
[2013-08-23T16:06:33.384] _pick_step_nodes: some requested nodes barcoo063 
still have memory used by other steps
[2013-08-23T16:06:33.384] _slurm_rpc_job_step_create for job 95464: Requested 
nodes are busy

But if I just drop the gres request it will work:

sbatch --nodes=2 --gres=mic:2 phi.sh

[2013-08-23T16:14:03.699] _slurm_rpc_submit_batch_job JobId=95466 usec=1326
[2013-08-23T16:14:14.001] backfill: Started JobId=95466 on barcoo[062-063]
[2013-08-23T16:14:14.159] sched: _slurm_rpc_job_step_create: StepId=95466.0 
barcoo063 usec=381
[2013-08-23T16:14:14.757] sched: _slurm_rpc_step_complete StepId=95466.0 
usec=135
[2013-08-23T16:14:15.766] completing job 95466
[2013-08-23T16:14:15.768] sched: job_complete for JobId=95466 successful, exit 
code=256

Ignore the failed exit code, the code is just complaining it
cannot access the Phi's as OFFLOAD_DEVICES gets set to -1 with
no GRES request. :-)

Any ideas?

We're on Slurm 2.6.0 but I don't see any relevant changes in the
NEWS file for 2.6.1.

All the best,
Chris
- -- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIXAB4ACgkQO2KABBYQAh8kLACfV46v3smSDagdUSbOc+zSlxYJ
utAAoIIhh1AlBSnIfW7If8cl4PC5HzM3
=XhHw
-----END PGP SIGNATURE-----

Reply via email to