Hi folks, Having sorted out the previous issue with PMI2 and Open-MPI I now find that trying to use PMI2 to run a simple hello world MPI app gives:
[samuel@snowy-m Test]$ srun --ntasks=64 ./hello srun: job 959 queued and waiting for resources srun: job 959 has been allocated resources srun: error: mpi/pmi2: failed to send temp kvs to compute nodes slurmstepd: *** STEP 959.0 CANCELLED AT 2015-09-02T16:16:41 *** on snowy001 srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: snowy002: tasks 32-63: Killed srun: error: Timed out waiting for job step to complete The code itself is from: http://mpitutorial.com/tutorials/mpi-hello-world/ I don't see any errors in the slurmd on the nodes in question, except when cleaning up: [2015-09-02T16:16:23.136] launch task 959.0 request from 500.506@10.14.0.33 (port 40414) [2015-09-02T16:16:23.169] _run_prolog: run job script took usec=32494 [2015-09-02T16:16:23.169] _run_prolog: prolog with lock for job 959 ran for 0 seconds [2015-09-02T16:16:23.199] [959.0] task/cgroup: /slurm/uid_500/job_959: alloc=65536MB mem.limit=65536MB memsw.limit=65536MB [2015-09-02T16:16:23.199] [959.0] task/cgroup: /slurm/uid_500/job_959/step_0: alloc=65536MB mem.limit=65536MB memsw.limit=65536MB [2015-09-02T16:16:41.010] [959.0] *** STEP 959.0 CANCELLED AT 2015-09-02T16:16:41 *** on snowy001 [2015-09-02T16:17:13.505] [959.0] Failed to send MESSAGE_TASK_EXIT: Connection refused [2015-09-02T16:17:13.508] [959.0] done with job The second node says: [2015-09-02T16:16:23.137] launch task 959.0 request from 500.506@10.14.0.33 (port 52706) [2015-09-02T16:16:23.170] _run_prolog: run job script took usec=32506 [2015-09-02T16:16:23.170] _run_prolog: prolog with lock for job 959 ran for 0 seconds [2015-09-02T16:16:23.207] [959.0] task/cgroup: /slurm/uid_500/job_959: alloc=65536MB mem.limit=65536MB memsw.limit=65536MB [2015-09-02T16:16:23.207] [959.0] task/cgroup: /slurm/uid_500/job_959/step_0: alloc=65536MB mem.limit=65536MB memsw.limit=65536MB [2015-09-02T16:17:13.498] [959.0] done with job The slurmctld just logs: [2015-09-02T16:16:21.698] sched: _slurm_rpc_allocate_resources JobId=959 NodeList=(null) usec=322 [2015-09-02T16:16:23.136] backfill: Started JobId=959 on snowy[001-002] [2015-09-02T16:17:13.002] job_complete: JobID=959 State=0x1 NodeCnt=2 WIFEXITED 0 WEXITSTATUS 0 [2015-09-02T16:17:13.003] job_complete: JobID=959 State=0x8003 NodeCnt=2 done Any ideas? All the best, Chris -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci