Hi folks,

Having sorted out the previous issue with PMI2 and Open-MPI I now find that
trying to use PMI2 to run a simple hello world MPI app gives:

[samuel@snowy-m Test]$ srun --ntasks=64 ./hello
srun: job 959 queued and waiting for resources
srun: job 959 has been allocated resources
srun: error: mpi/pmi2: failed to send temp kvs to compute nodes
slurmstepd: *** STEP 959.0 CANCELLED AT 2015-09-02T16:16:41 *** on snowy001
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: snowy002: tasks 32-63: Killed
srun: error: Timed out waiting for job step to complete

The code itself is from:

http://mpitutorial.com/tutorials/mpi-hello-world/

I don't see any errors in the slurmd on the nodes in question,
except when cleaning up:

[2015-09-02T16:16:23.136] launch task 959.0 request from 500.506@10.14.0.33 
(port 40414)
[2015-09-02T16:16:23.169] _run_prolog: run job script took usec=32494
[2015-09-02T16:16:23.169] _run_prolog: prolog with lock for job 959 ran for 0 
seconds
[2015-09-02T16:16:23.199] [959.0] task/cgroup: /slurm/uid_500/job_959: 
alloc=65536MB mem.limit=65536MB memsw.limit=65536MB
[2015-09-02T16:16:23.199] [959.0] task/cgroup: /slurm/uid_500/job_959/step_0: 
alloc=65536MB mem.limit=65536MB memsw.limit=65536MB
[2015-09-02T16:16:41.010] [959.0] *** STEP 959.0 CANCELLED AT 
2015-09-02T16:16:41 *** on snowy001
[2015-09-02T16:17:13.505] [959.0] Failed to send MESSAGE_TASK_EXIT: Connection 
refused
[2015-09-02T16:17:13.508] [959.0] done with job

The second node says:

[2015-09-02T16:16:23.137] launch task 959.0 request from 500.506@10.14.0.33 
(port 52706)
[2015-09-02T16:16:23.170] _run_prolog: run job script took usec=32506
[2015-09-02T16:16:23.170] _run_prolog: prolog with lock for job 959 ran for 0 
seconds
[2015-09-02T16:16:23.207] [959.0] task/cgroup: /slurm/uid_500/job_959: 
alloc=65536MB mem.limit=65536MB memsw.limit=65536MB
[2015-09-02T16:16:23.207] [959.0] task/cgroup: /slurm/uid_500/job_959/step_0: 
alloc=65536MB mem.limit=65536MB memsw.limit=65536MB
[2015-09-02T16:17:13.498] [959.0] done with job

The slurmctld just logs:

[2015-09-02T16:16:21.698] sched: _slurm_rpc_allocate_resources JobId=959 
NodeList=(null) usec=322
[2015-09-02T16:16:23.136] backfill: Started JobId=959 on snowy[001-002]
[2015-09-02T16:17:13.002] job_complete: JobID=959 State=0x1 NodeCnt=2 WIFEXITED 
0 WEXITSTATUS 0
[2015-09-02T16:17:13.003] job_complete: JobID=959 State=0x8003 NodeCnt=2 done


Any ideas?

All the best,
Chris
-- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci

Reply via email to