SRUN fails on all hosts.  All other features work, MPI jobs run fine as
long as we bypass SRUN.

Error:

srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
srun: error: Task launch for 5.0 failed on node node002: Socket timed out
on send/recv operation
srun: error: Application launch failed: Socket timed out on send/recv
operation
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.


This happens on single and multi-node jobs.  All hosts use DNS, so there
shouldn't be any wierd name mismatches.
SELinux is DISABLED:

Log (/var/spool/slurm/slurmd.log on node002):
[2017-06-12T14:08:40.654] debug2: Processing RPC: REQUEST_LAUNCH_TASKS
[2017-06-12T14:08:40.654] launch task 5.0 request from [email protected] (port
7638)
[2017-06-12T14:08:40.654] debug3: state for jobid 2850: ctime:1497018336
revoked:0 expires:0
[2017-06-12T14:08:40.654] debug3: state for jobid 2882: ctime:1497023548
revoked:1497023548 expires:1497023548
[2017-06-12T14:08:40.654] debug3: destroying job 2882 state
[2017-06-12T14:08:40.654] debug3: state for jobid 2886: ctime:1497023588
revoked:1497023588 expires:1497023588
[2017-06-12T14:08:40.654] debug3: destroying job 2886 state
[2017-06-12T14:08:40.654] debug3: state for jobid 2892: ctime:1497023659
revoked:1497023659 expires:1497023659
[2017-06-12T14:08:40.654] debug3: destroying job 2892 state
[2017-06-12T14:08:40.654] debug:  Checking credential with 276 bytes of sig
data
[2017-06-12T14:08:40.654] debug:  task_p_slurmd_launch_request: 5.0 0
[2017-06-12T14:08:40.655] debug:  [job 5] attempting to run prolog
[/var/spool/slurm/prolog_and_epilog/bin/prolog]
[2017-06-12T14:09:06.467] _run_prolog: run job script took usec=25811904
[2017-06-12T14:09:06.467] _run_prolog: prolog with lock for job 5 ran for
26 seconds
[2017-06-12T14:09:06.467] prolog for job 5 ran for 26 seconds
[2017-06-12T14:09:06.467] debug3: _rpc_launch_tasks: call to
_forkexec_slurmstepd
[2017-06-12T14:09:06.467] debug3: slurmstepd rank 0 (dh0002), parent rank
-1 (NONE), children 0, depth 0, max_depth 0
[2017-06-12T14:09:06.467] debug3: _send_slurmstepd_init: call to getpwuid_r
[2017-06-12T14:09:06.467] debug3: _send_slurmstepd_init: return from
getpwuid_r
[2017-06-12T14:09:06.467] debug2: Cached group access list for root/0
[2017-06-12T14:09:06.580] debug2: debug level is 8.
[2017-06-12T14:09:06.580] debug3: Trying to load plugin
/usr/lib64/slurm/switch_none.so
[2017-06-12T14:09:06.580] debug:  switch NONE plugin loaded
[2017-06-12T14:09:06.580] debug3: Success.
[2017-06-12T14:09:06.580] debug3: slurmstepd rank 0, parent address =
0.0.0.0, port = 0
[2017-06-12T14:09:06.591] debug2: Received CPU frequency information for 24
CPUs
[2017-06-12T14:09:06.597] debug2: got gid 0
[2017-06-12T14:09:06.597] debug2: got gid 1
[2017-06-12T14:09:06.597] debug2: got gid 2
[2017-06-12T14:09:06.597] debug2: got gid 3
[2017-06-12T14:09:06.597] debug2: got gid 4
[2017-06-12T14:09:06.597] debug2: got gid 6
[2017-06-12T14:09:06.597] debug2: got gid 10
[2017-06-12T14:09:06.597] debug2: setup for a launch_task
[2017-06-12T14:09:06.597] debug3: entering stepd_step_rec_create
[2017-06-12T14:09:06.602] debug3: Trying to load plugin
/usr/lib64/slurm/acct_gather_profile_none.so
[2017-06-12T14:09:06.602] debug:  AcctGatherProfile NONE plugin loaded
[2017-06-12T14:09:06.602] debug3: Success.
[2017-06-12T14:09:06.602] debug3: Trying to load plugin
/usr/lib64/slurm/acct_gather_energy_none.so
[2017-06-12T14:09:06.606] debug:  AcctGatherEnergy NONE plugin loaded
[2017-06-12T14:09:06.606] debug3: Success.
[2017-06-12T14:09:06.606] debug3: Trying to load plugin
/usr/lib64/slurm/acct_gather_infiniband_none.so
[2017-06-12T14:09:06.609] debug:  AcctGatherInfiniband NONE plugin loaded
[2017-06-12T14:09:06.609] debug3: Success.
[2017-06-12T14:09:06.609] debug3: Trying to load plugin
/usr/lib64/slurm/acct_gather_filesystem_none.so
[2017-06-12T14:09:06.609] debug:  AcctGatherFilesystem NONE plugin loaded
[2017-06-12T14:09:06.609] debug3: Success.
[2017-06-12T14:09:06.609] debug2: No acct_gather.conf file
(/etc/slurm/acct_gather.conf)
[2017-06-12T14:09:06.609] debug3: Trying to load plugin
/usr/lib64/slurm/jobacct_gather_none.so
[2017-06-12T14:09:06.609] debug:  Job accounting gather NOT_INVOKED plugin
loaded
[2017-06-12T14:09:06.609] debug3: Success.
[2017-06-12T14:09:06.609] debug3: jobacct_gather dynamic logging enabled
[2017-06-12T14:09:06.609] debug3: acct_gather_profile_startpoll dynamic
logging enabled
[2017-06-12T14:09:06.609] debug2: profile signalling type Task
[2017-06-12T14:09:06.609] debug2: scaling CPU count by factor of 2
(24/(12-0)
[2017-06-12T14:09:06.609] debug3: _rpc_launch_tasks: return from
_forkexec_slurmstepd
[2017-06-12T14:09:06.609] [5.0] debug:  Message thread started pid = 9825
[2017-06-12T14:09:06.609] [5.0] debug4: eio: handling events for 1 objects
[2017-06-12T14:09:06.609] [5.0] debug3: Called _msg_socket_readable
[2017-06-12T14:09:06.609] [5.0] debug3: Entered job_manager for 5.0 pid=9825
[2017-06-12T14:09:06.609] [5.0] debug3: Trying to load plugin
/usr/lib64/slurm/core_spec_none.so
[2017-06-12T14:09:06.609] debug2: slurm_send_timeout: Socket no longer there
[2017-06-12T14:09:06.609] debug3: slurm_msg_sendto: peer has disappeared
for msg_type=8001
[2017-06-12T14:09:06.609] error: _rpc_launch_tasks: unable to send return
code to address:port=10.30.0.42:54813 msg_type=6001: Transport endpoint is
not connected
[2017-06-12T14:09:06.617] [5.0] debug3: Success.
[2017-06-12T14:09:06.617] [5.0] debug3: Trying to load plugin
/usr/lib64/slurm/task_none.so
[2017-06-12T14:09:06.617] [5.0] debug:  task NONE plugin loaded
[2017-06-12T14:09:06.617] [5.0] debug3: Success.
[2017-06-12T14:09:06.617] [5.0] debug3: Trying to load plugin
/usr/lib64/slurm/proctrack_pgid.so
[2017-06-12T14:09:06.617] [5.0] debug3: Success.
[2017-06-12T14:09:06.617] [5.0] debug3: Trying to load plugin
/usr/lib64/slurm/checkpoint_none.so
[2017-06-12T14:09:06.622] [5.0] debug3: Success.
[2017-06-12T14:09:06.622] [5.0] debug:  Checkpoint plugin loaded:
checkpoint/none
[2017-06-12T14:09:06.622] [5.0] debug3: Trying to load plugin
/usr/lib64/slurm/crypto_munge.so
[2017-06-12T14:09:06.627] [5.0] Munge cryptographic signature plugin loaded
[2017-06-12T14:09:06.627] [5.0] debug3: Success.
[2017-06-12T14:09:06.627] [5.0] debug3: Trying to load plugin
/usr/lib64/slurm/job_container_none.so
[2017-06-12T14:09:06.627] [5.0] debug:  job_container none plugin loaded
[2017-06-12T14:09:06.627] [5.0] debug3: Success.
[2017-06-12T14:09:06.627] [5.0] debug:  mpi type = none
[2017-06-12T14:09:06.627] [5.0] debug3: Trying to load plugin
/usr/lib64/slurm/mpi_none.so
[2017-06-12T14:09:06.627] [5.0] debug3: Success.
[2017-06-12T14:09:06.627] [5.0] debug2: Before call to spank_init()
[2017-06-12T14:09:06.627] [5.0] debug:  spank: opening plugin stack
/etc/slurm/plugstack.conf
[2017-06-12T14:09:06.627] [5.0] debug2: After call to spank_init()
[2017-06-12T14:09:06.627] [5.0] debug:  mpi type = (null)
[2017-06-12T14:09:06.627] [5.0] debug:  mpi/none: slurmstepd prefork
[2017-06-12T14:09:06.627] [5.0] debug4: eio: handling events for 1 objects
[2017-06-12T14:09:06.628] [5.0] debug3: Called _msg_socket_readable
[2017-06-12T14:09:06.628] [5.0] debug2: Using gid list sent by slurmd
[2017-06-12T14:09:06.628] [5.0] debug4: eio: handling events for 1 objects
[2017-06-12T14:09:06.628] [5.0] debug3: Called _msg_socket_readable
[2017-06-12T14:09:06.682] [5.0] debug2: Entering _setup_normal_io
[2017-06-12T14:09:06.682] [5.0] debug4: eio: handling events for 1 objects
[2017-06-12T14:09:06.682] [5.0] debug3: Called _msg_socket_readable
[2017-06-12T14:09:06.682] [5.0] debug2: Using gid list sent by slurmd
[2017-06-12T14:09:06.682] [5.0] debug4: eio: handling events for 1 objects
[2017-06-12T14:09:06.682] [5.0] debug3: Called _msg_socket_readable
[2017-06-12T14:09:06.682] [5.0] debug:    stdin uses a pty object
[2017-06-12T14:09:06.682] [5.0] debug4: eio: handling events for 1 objects
[2017-06-12T14:09:06.682] [5.0] debug3: Called _msg_socket_readable
[2017-06-12T14:09:06.683] [5.0] debug4: eio: handling events for 1 objects
[2017-06-12T14:09:06.683] [5.0] debug3: Called _msg_socket_readable
[2017-06-12T14:09:06.683] [5.0] debug:  init pty size 53:161
[2017-06-12T14:09:06.683] [5.0] in _window_manager
[2017-06-12T14:09:06.683] [5.0] debug4: adding IO connection (logical node
rank 0)
[2017-06-12T14:09:06.683] [5.0] debug4: connecting IO back to
10.30.0.42:57166
[2017-06-12T14:09:06.684] [5.0] debug2: Entering io_init_msg_write_to_fd
[2017-06-12T14:09:06.684] [5.0] debug2:   msg->nodeid = 0
[2017-06-12T14:09:06.684] [5.0] debug2: Leaving  io_init_msg_write_to_fd
[2017-06-12T14:09:06.684] [5.0] debug2: Leaving  _setup_normal_io
[2017-06-12T14:09:06.684] [5.0] debug level = 2
[2017-06-12T14:09:06.684] [5.0] debug:  IO handler started pid=9825
[2017-06-12T14:09:06.684] [5.0] debug4: eio: handling events for 3 objects
[2017-06-12T14:09:06.684] [5.0] debug2: Using gid list sent by slurmd
[2017-06-12T14:09:06.684] [5.0] debug4: eio: handling events for 1 objects
[2017-06-12T14:09:06.684] [5.0] debug3: Called _msg_socket_readable
[2017-06-12T14:09:06.693] [5.0] starting 1 tasks
[2017-06-12T14:09:06.693] [5.0] task 0 (9831) started 2017-06-12T14:09:06
[2017-06-12T14:09:06.693] [5.0] debug3: Unblocking 5.0 task 0, writefd = 17
[2017-06-12T14:09:06.693] [5.0] debug2: read_slurm_cgroup_conf: No
cgroup.conf file (/etc/slurm/cgroup.conf)
[2017-06-12T14:09:06.693] [5.0] debug2: xcgroup_load: unable to get cgroup
'(null)/cpuset' entry '(null)/cpuset/system' properties: No such file or
directory
[2017-06-12T14:09:06.693] [5.0] debug:  task_p_pre_launch_priv: 5.0
[2017-06-12T14:09:06.693] [5.0] debug2: xcgroup_load: unable to get cgroup
'(null)/memory' entry '(null)/memory/system' properties: No such file or
directory
[2017-06-12T14:09:06.693] [5.0] debug:  Sending launch resp rc=0
[2017-06-12T14:09:06.693] [5.0] debug2: Using gid list sent by slurmd
[2017-06-12T14:09:06.693] [5.0] debug3: login_tty good
[2017-06-12T14:09:06.694] [5.0] debug3: slurm_send_only_node_msg: sent 208
[2017-06-12T14:09:06.694] [5.0] debug:  mpi type = (null)
[2017-06-12T14:09:06.694] [5.0] debug:  Using mpi/none
[2017-06-12T14:09:06.694] [5.0] debug:  task_p_pre_launch: 5.0, task 0
[2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: conf setrlimit
RLIMIT_CPU no change in value: 18446744073709551615
[2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: conf setrlimit
RLIMIT_FSIZE no change in value: 18446744073709551615
[2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: conf setrlimit
RLIMIT_DATA no change in value: 18446744073709551615
[2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: conf setrlimit
RLIMIT_STACK no change in value: 18446744073709551615
[2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: RLIMIT_CORE   : max:inf
cur:inf req:0
[2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: conf setrlimit
RLIMIT_CORE succeeded
[2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: conf setrlimit
RLIMIT_RSS no change in value: 18446744073709551615
[2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: conf setrlimit
RLIMIT_NPROC no change in value: 18446744073709551615
[2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: conf setrlimit
RLIMIT_NOFILE no change in value: 65536
[2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: conf setrlimit
RLIMIT_MEMLOCK no change in value: 18446744073709551615
[2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: conf setrlimit
RLIMIT_AS no change in value: 18446744073709551615
[2017-06-12T14:09:06.694] [5.0] debug2: Set task rss(128000 MB)
[2017-06-12T14:09:06.945] [5.0] debug4: Entering _task_read for obj 2139cc0
[2017-06-12T14:09:06.945] [5.0] debug4: _task_build_message: Entering...
[2017-06-12T14:09:06.945] [5.0] debug4: Leaving  _task_build_message
[2017-06-12T14:09:06.945] [5.0] debug4: eio: handling events for 3 objects
[2017-06-12T14:09:06.945] [5.0] debug4: Entering _task_read for obj 2139cc0
[2017-06-12T14:09:06.945] [5.0] debug4: _task_build_message: Entering...
[2017-06-12T14:09:06.945] [5.0] debug4: Leaving  _task_build_message
[2017-06-12T14:09:06.945] [5.0] debug4: eio: handling events for 3 objects
[2017-06-12T14:09:06.945] [5.0] debug4: Entering _task_read for obj 2139cc0
[2017-06-12T14:09:06.945] [5.0] debug4: _task_build_message: Entering...
[2017-06-12T14:09:06.945] [5.0] debug4: Leaving  _task_build_message
[2017-06-12T14:09:06.945] [5.0] debug4: eio: handling events for 3 objects
[2017-06-12T14:09:32.008] [5.0] debug4: Entering _client_read
[2017-06-12T14:09:32.008] [5.0] debug3: Entering io_hdr_read_fd
[2017-06-12T14:09:32.008] [5.0] debug3:   _full_read (_client_read) got eof
[2017-06-12T14:09:32.008] [5.0] debug3: Leaving  io_hdr_read_fd
[2017-06-12T14:09:32.008] [5.0] debug4: eio: handling events for 3 objects
[2017-06-12T14:09:32.010] debug3: in the service_connection
[2017-06-12T14:09:32.011] [5.0] debug:  _slurm_recv_timeout at 0 of 4, recv
zero bytes
[2017-06-12T14:09:32.011] debug2: got this type of message 6011
[2017-06-12T14:09:32.011] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2017-06-12T14:09:32.011] debug:  _rpc_terminate_job, uid = 493
[2017-06-12T14:09:32.011] debug:  task_p_slurmd_release_resources: 5
[2017-06-12T14:09:32.011] debug3: state for jobid 2850: ctime:1497018336
revoked:0 expires:0
[2017-06-12T14:09:32.011] debug3: state for jobid 5: ctime:1497294520
revoked:0 expires:0
[2017-06-12T14:09:32.011] debug3: state for jobid 5: ctime:1497294520
revoked:0 expires:0
[2017-06-12T14:09:32.011] debug:  credential for job 5 revoked
[2017-06-12T14:09:32.011] debug4: found jobid = 5, stepid = 0
[2017-06-12T14:09:32.011] [5.0] debug3: Called _msg_socket_accept
[2017-06-12T14:09:32.011] [5.0] debug3: Leaving _msg_socket_accept
[2017-06-12T14:09:32.011] [5.0] debug4: eio: handling events for 1 objects
[2017-06-12T14:09:32.011] [5.0] debug3: Called _msg_socket_readable
[2017-06-12T14:09:32.011] [5.0] debug3: Entering _handle_accept (new thread)
[2017-06-12T14:09:32.011] [5.0] debug3:   Identity: uid=0, gid=0
[2017-06-12T14:09:32.011] [5.0] debug3: Entering _handle_request
[2017-06-12T14:09:32.011] debug2: container signal 18 to job 5.0
[2017-06-12T14:09:32.011] [5.0] debug3: Got request 4
[2017-06-12T14:09:32.011] [5.0] debug:  Handling REQUEST_SIGNAL_CONTAINER
[2017-06-12T14:09:32.011] [5.0] debug:  _handle_signal_container for
step=5.0 uid=0 signal=18
[2017-06-12T14:09:32.012] [5.0] Sent signal 18 to 5.0
[2017-06-12T14:09:32.012] [5.0] debug3: Leaving  _handle_request:
SLURM_SUCCESS
[2017-06-12T14:09:32.012] [5.0] debug3: Entering _handle_request
[2017-06-12T14:09:32.012] [5.0] debug3: Leaving  _handle_accept
[2017-06-12T14:09:32.012] debug4: found jobid = 5, stepid = 0
[2017-06-12T14:09:32.012] [5.0] debug3: Called _msg_socket_accept
[2017-06-12T14:09:32.012] [5.0] debug3: Leaving _msg_socket_accept
[2017-06-12T14:09:32.012] [5.0] debug4: eio: handling events for 1 objects
[2017-06-12T14:09:32.012] [5.0] debug3: Called _msg_socket_readable
[2017-06-12T14:09:32.012] [5.0] debug3: Entering _handle_accept (new thread)
[2017-06-12T14:09:32.012] [5.0] debug3:   Identity: uid=0, gid=0
[2017-06-12T14:09:32.012] [5.0] debug3: Entering _handle_request
[2017-06-12T14:09:32.012] debug2: container signal 15 to job 5.0
[2017-06-12T14:09:32.012] [5.0] debug3: Got request 4
[2017-06-12T14:09:32.012] [5.0] debug:  Handling REQUEST_SIGNAL_CONTAINER
[2017-06-12T14:09:32.012] [5.0] debug:  _handle_signal_container for
step=5.0 uid=0 signal=15
[2017-06-12T14:09:32.012] [5.0] error: *** STEP 5.0 ON dh0002 CANCELLED AT
2017-06-12T14:09:32 ***
[2017-06-12T14:09:32.012] [5.0] Sent signal 15 to 5.0
[2017-06-12T14:09:32.012] [5.0] debug3: Leaving  _handle_request:
SLURM_SUCCESS
[2017-06-12T14:09:32.012] [5.0] debug3: Entering _handle_request
[2017-06-12T14:09:32.012] debug4: sent SUCCESS

We have also seen:
*Error* connecting slurm stream socket at 10.30.0.42:45967: Connection
refused (where the IP is that of the submit host)


When doing an STRACE I do see srun trying to bind to privileged ports and
failing:


10636 bind(4, {sa_family=AF_INET, sin_port=htons(636),
sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EACCES (Permission denied)
10636 bind(8, {sa_family=AF_INET, sin_port=htons(637),
sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EACCES (Permission denied)
10636 bind(8, {sa_family=AF_INET, sin_port=htons(638),
sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EACCES (Permission denied)
10636 bind(9, {sa_family=AF_INET, sin_port=htons(639),
sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EACCES (Permission denied)
10636 bind(9, {sa_family=AF_INET, sin_port=htons(640),
sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EACCES (Permission denied)
10636 bind(15, {sa_family=AF_INET, sin_port=htons(641),
sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EACCES (Permission denied)

Here is the slurm.conf
ClusterName=cluster
ControlMachine=sched01
ControlAddr=10.30.4.41
SlurmUser=slurm
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
StateSaveLocation=/etc/slurm/state
SlurmdSpoolDir=/var/spool/slurm/jobs
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
PluginDir=/usr/lib64/slurm
CacheGroups=0
ReturnToService=0
Prolog=/var/spool/slurm/prolog_and_epilog/bin/prolog
Epilog=/var/spool/slurm/prolog_and_epilog/bin/epilog
UsePAM=1
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
SchedulerType=sched/backfill
SchedulerParameters=bf_max_job_start=1,sched_max_job_start=1,preempt_youngest_first
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
FastSchedule=1
PreemptType=preempt/qos
PreemptMode=requeue
SlurmctldDebug=8
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=8
SlurmdLogFile=/var/spool/slurm/slurmd.log
JobCompType=jobcomp/none
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=sched01
AccountingStorageLoc=slurm_acct_db
AccountingStoragePass=/var/run/munge/munge.socket.2
AccountingStoragePort=6819
AccountingStorageUser=slurm

Reply via email to