It looks like you're right. The exact reason of KVS send fail is the following line:
*srun: failed to send temp kvs, rc=107, retrying* rc = 107 is "#define ENOTCONN 107 /* Transport endpoint is not connected */" One guess according to this part of the log: slurmstepd: mpi/pmi2: _tree_listen_readable slurmstepd: mpi/pmi2: _task_readable slurmstepd: mpi/pmi2: _tree_listen_readable slurmstepd: mpi/pmi2: _task_readable slurmstepd: spank: x11.so: user_init = 0 is that spank plugin is initialized after pmi2 plugin. So PMI2 plugin creates the unix socket in the original /tmp dir that later overlapped by spank's one. I am 99% sure that error is triggered here: https://github.com/SchedMD/slurm/blob/master/src/slurmd/slurmd/req.c#L5955 (I was fighting VERY similar problem yesterday: http://bugs.schedmd.com/show_bug.cgi?id=1907) We have the following syscalls there: 1. socket 2. connect 3. write 4. close My manpages for those syscalls says that none of them return ENOTCONN, however googling shows that read/write/close may return this error. What OS are you using? Can you check your man pages for the following syscalls if any of them may return ENOTCONN. You can also check slurmd's logs on the affected nodes (if debug was ON for them in slurm.conf) and see what slurmd's was talking while you were getting this error. This will help to clarify things completely. In general PMI2 has hardcoded path to the tmpdir: https://github.com/SchedMD/slurm/blob/master/src/plugins/mpi/pmi2/setup.c#L71 It would be nice to change that. In upcoming pmix plugin I'd prefer to have this flexibility too. SLURM.conf has an option: *TmpFS* *Fully qualified pathname of the file system available to user jobs for temporary storage. This parameter is used in establishing a node's TmpDisk space. The default value is "/tmp".* which is not well suitable for this need. It would be nice to have a flexibility to provide a system-purpose tmpdir too. Moe, David, what do you think? 2015-09-04 9:09 GMT+03:00 Christopher Samuel <sam...@unimelb.edu.au>: > > On 04/09/15 16:02, Christopher Samuel wrote: > > > I've attached the output file for the version with debugging on, and I > > have a suspicion it's related to: > > > > srun: debug: slurm_forward_data: nodelist=snowy010, > address=/tmp/sock.pmi2.1006.0, len=243 > > > > We are using the tmpdir spank plugin to map /tmp for a job to a > > temporary directory created on the scratch filesystem (rather than > > having it hit the tiny ramdisk on our diskless nodes - we've been > > unable to get a number of codes to honour $TMPDIR, etc). > > > > I can disable that plugin to check that theory. > > ...and it works.. > > [samuel@snowy-m PMI2]$ srun -p debug --mpi=pmi2 ./testpmi2 > srun: job 1007 queued and waiting for resources > srun: job 1007 has been allocated resources > rank: 0 key:PMI_netinfo_of_task > > val:(snowy010,(eth0,IP_V4,10.14.103.10),(ib0,IP_V4,10.7.103.10),(eth0,IP_V6,fe80::42f2:e9ff:fec5:6906%eth0),(ib0,IP_V6,fe80::e61d:2 > rank: 0 key:david@0 val:rbxqrwfoeloksanm > rank: 0 key:mpi_reserved_ports val: > 11.639000 > > So time to see if we can hack PMI2 to use /dev/shm instead of /tmp. > > -- > Christopher Samuel Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov