-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi folks,
One of our Xeon Phi early adopters has found a behaviour I cannot explain that prevents jobs using GRES from spanning compute nodes. A job submitted with: sbatch --nodes=2 --gres=mic:2 phi.sh will hang at srun/mpirun and slurmctld will log: [2013-08-23T16:06:33.223] _slurm_rpc_submit_batch_job JobId=95464 usec=1411 [2013-08-23T16:06:33.224] sched: Allocate JobId=95464 NodeList=barcoo[062-063] #CPUs=32 [2013-08-23T16:06:33.384] _pick_step_nodes: some requested nodes barcoo063 still have memory used by other steps [2013-08-23T16:06:33.384] _slurm_rpc_job_step_create for job 95464: Requested nodes are busy But if I just drop the gres request it will work: sbatch --nodes=2 --gres=mic:2 phi.sh [2013-08-23T16:14:03.699] _slurm_rpc_submit_batch_job JobId=95466 usec=1326 [2013-08-23T16:14:14.001] backfill: Started JobId=95466 on barcoo[062-063] [2013-08-23T16:14:14.159] sched: _slurm_rpc_job_step_create: StepId=95466.0 barcoo063 usec=381 [2013-08-23T16:14:14.757] sched: _slurm_rpc_step_complete StepId=95466.0 usec=135 [2013-08-23T16:14:15.766] completing job 95466 [2013-08-23T16:14:15.768] sched: job_complete for JobId=95466 successful, exit code=256 Ignore the failed exit code, the code is just complaining it cannot access the Phi's as OFFLOAD_DEVICES gets set to -1 with no GRES request. :-) Any ideas? We're on Slurm 2.6.0 but I don't see any relevant changes in the NEWS file for 2.6.1. All the best, Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIXAB4ACgkQO2KABBYQAh8kLACfV46v3smSDagdUSbOc+zSlxYJ utAAoIIhh1AlBSnIfW7If8cl4PC5HzM3 =XhHw -----END PGP SIGNATURE-----