Errata corrige. The core file is in the log directory.

On 04/11/2014 12:08 PM, David Bigagli wrote:

Hi,
    this Slurm bug has been fixed and it will be available in 14.03.1
which will be released soon. Otherwise it is available in the HEAD.
You should find a core file of slurmstepd in the directory where you
have run the srun command.

On 04/11/2014 11:57 AM, Anthony Alba wrote:
Not sure if this is a SLURM or OMPI issue so starting here..

The OpenMPI FAQ mentions an issue with slurm 2.6.3/pmi2.
https://www.open-mpi.org/faq/?category=slurm#slurm-2.6.3-issue

I have built both 1.7.5/1.8 against slurm 14.03/pmi2.

When I launch openmpi/examples/hello_c on a single node allocation:

srun --mpi=pmi2 -N 1 hello_c:

srun -N 1 --mpi=pmi2 hello_c
srun: error: _server_read: fd 18 got error or unexpected eof reading
header
srun: error: step_launch_notify_io_failure: aborting, io error with
slurmstepd on node 0
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete


with --slurmd-debug=9: (I'm not sure what is the meaning of "ip
111.110.61.48 sd 14"
below, is that ip as in ip address? It is not the ip address of any
Nodes in my partition)

slurmstepd: mpi/pmi2: client_resp_send: 26    cmd=kvs-put-response;rc=0;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: got client request: 14     cmd=kvs-fence;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_read
slurmstepd: _tree_listen_read: accepted tree connection: ip
111.110.61.48 sd 14
slurmstepd: _handle_accept_rank: going to read() client rank
slurmstepd: _handle_accept_rank: got client rank 1478164480 on fd 14
srun: error: _server_read: fd 18 got error or unexpected eof reading
header
srun: error: step_launch_notify_io_failure: aborting, io error with
slurmstepd on node 0
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

Launching with salloc/sbatch works.

- Anthony



--

Thanks,
      /David/Bigagli

www.schedmd.com

Reply via email to