On Thu, 22 Aug 2013 11:25:53 PM Christopher Samuel wrote: > A job submitted with: > > sbatch --nodes=2 --gres=mic:2 phi.sh > > will hang at srun/mpirun and slurmctld will log: > > [2013-08-23T16:06:33.223] _slurm_rpc_submit_batch_job JobId=95464 usec=1411 > [2013-08-23T16:06:33.224] sched: Allocate JobId=95464 > NodeList=barcoo[062-063] #CPUs=32 [2013-08-23T16:06:33.384] > _pick_step_nodes: some requested nodes barcoo063 still have memory used by > other steps [2013-08-23T16:06:33.384] _slurm_rpc_job_step_create for job > 95464: Requested nodes are busy
Interestingly if I use srun (without salloc first) it works fine: [samuel@barcoo ~]$ srun -p debug --nodes=2 --exclusive --gres=mic:2 hostname barcoo063 barcoo062 But if I do an salloc for that and then srun it hangs: [samuel@barcoo ~]$ salloc -p debug --nodes=2 --exclusive --gres=mic:2 salloc: Granted job allocation 95474 [samuel@barcoo ~]$ srun hostname ^Csrun: Cancelled pending job step srun: error: Unable to create job step: Job/step already completing or completed ...and logs the same issue as before. *But* I can get it to work from inside that same salloc if I do: [samuel@barcoo ~]$ srun --exclusive --ntasks 2 --gres=mic:2 hostname barcoo062 barcoo063 and I can confirm that works for sbatch for srun too. It doesn't appear to work with mpirun from Open-MPI though as I'm yet to find a way to get it to pass options to srun to launch the orted's. :-( Any ideas as to why I need to recapitulate those options please? All the best, Chris -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: [email protected] Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
