It looks like you're right.
The exact reason of KVS send fail is the following line:

*srun: failed to send temp kvs, rc=107, retrying*

rc = 107 is "#define ENOTCONN    107 /* Transport endpoint is not connected
*/"

One guess according to this part of the log:
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: spank: x11.so: user_init = 0

is that spank plugin is initialized after pmi2 plugin. So PMI2 plugin
creates the unix socket in the original /tmp dir that later overlapped by
spank's one.

I am 99% sure that error is triggered here:
https://github.com/SchedMD/slurm/blob/master/src/slurmd/slurmd/req.c#L5955
(I was fighting VERY similar problem yesterday:
http://bugs.schedmd.com/show_bug.cgi?id=1907)

We have the following syscalls there:
1. socket
2. connect
3. write
4. close

My manpages for those syscalls says that none of them return ENOTCONN,
however googling shows that read/write/close may return this error. What OS
are you using? Can you check your man pages for the following syscalls if
any of them may return ENOTCONN.

You can also check slurmd's logs on the affected nodes (if debug was ON for
them in slurm.conf) and see what slurmd's was talking while you were
getting this error. This will help to clarify things completely.

In general PMI2 has hardcoded path to the tmpdir:
https://github.com/SchedMD/slurm/blob/master/src/plugins/mpi/pmi2/setup.c#L71

It would be nice to change that. In upcoming pmix plugin I'd prefer to have
this flexibility too.
SLURM.conf has an option:
*TmpFS*
*Fully qualified pathname of the file system available to user jobs for
temporary storage. This parameter is used in establishing a node's TmpDisk
space. The default value is "/tmp".*

which is not well suitable for this need. It would be nice to have a
flexibility to provide a system-purpose tmpdir too. Moe, David, what do you
think?

2015-09-04 9:09 GMT+03:00 Christopher Samuel <sam...@unimelb.edu.au>:

>
> On 04/09/15 16:02, Christopher Samuel wrote:
>
> > I've attached the output file for the version with debugging on, and I
> > have a suspicion it's related to:
> >
> > srun: debug:  slurm_forward_data: nodelist=snowy010,
> address=/tmp/sock.pmi2.1006.0, len=243
> >
> > We are using the tmpdir spank plugin to map /tmp for a job to a
> > temporary directory created on the scratch filesystem (rather than
> > having it hit the tiny ramdisk on our diskless nodes - we've been
> > unable to get a number of codes to honour $TMPDIR, etc).
> >
> > I can disable that plugin to check that theory.
>
> ...and it works..
>
> [samuel@snowy-m PMI2]$ srun -p debug --mpi=pmi2 ./testpmi2
> srun: job 1007 queued and waiting for resources
> srun: job 1007 has been allocated resources
> rank: 0 key:PMI_netinfo_of_task
>
> val:(snowy010,(eth0,IP_V4,10.14.103.10),(ib0,IP_V4,10.7.103.10),(eth0,IP_V6,fe80::42f2:e9ff:fec5:6906%eth0),(ib0,IP_V6,fe80::e61d:2
> rank: 0 key:david@0 val:rbxqrwfoeloksanm
> rank: 0 key:mpi_reserved_ports val:
> 11.639000
>
> So time to see if we can hack PMI2 to use /dev/shm instead of /tmp.
>
> --
>  Christopher Samuel        Senior Systems Administrator
>  VLSCI - Victorian Life Sciences Computation Initiative
>  Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
>  http://www.vlsci.org.au/      http://twitter.com/vlsci
>



-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov

Reply via email to