On Thu, 22 Aug 2013 11:25:53 PM Christopher Samuel wrote:

> A job submitted with:
> 
> sbatch --nodes=2 --gres=mic:2 phi.sh
> 
> will hang at srun/mpirun and slurmctld will log:
> 
> [2013-08-23T16:06:33.223] _slurm_rpc_submit_batch_job JobId=95464 usec=1411
> [2013-08-23T16:06:33.224] sched: Allocate JobId=95464
> NodeList=barcoo[062-063] #CPUs=32 [2013-08-23T16:06:33.384]
> _pick_step_nodes: some requested nodes barcoo063 still have memory used by
> other steps [2013-08-23T16:06:33.384] _slurm_rpc_job_step_create for job
> 95464: Requested nodes are busy

Interestingly if I use srun (without salloc first) it works fine:

[samuel@barcoo ~]$ srun -p debug --nodes=2 --exclusive --gres=mic:2 hostname
barcoo063
barcoo062

But if I do an salloc for that and then srun it hangs:

[samuel@barcoo ~]$ salloc -p debug --nodes=2 --exclusive --gres=mic:2 
salloc: Granted job allocation 95474
[samuel@barcoo ~]$ srun hostname
^Csrun: Cancelled pending job step
srun: error: Unable to create job step: Job/step already completing or completed

...and logs the same issue as before.

*But* I can get it to work from inside that same salloc if I do:

[samuel@barcoo ~]$ srun --exclusive --ntasks 2 --gres=mic:2 hostname
barcoo062
barcoo063

and I can confirm that works for sbatch for srun too.   It doesn't appear
to work with mpirun from Open-MPI though as I'm yet to find a way to get
it to pass options to srun to launch the orted's. :-(

Any ideas as to why I need to recapitulate those options please?

All the best,
Chris
-- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: [email protected] Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci

Reply via email to