SRUN fails on all hosts. All other features work, MPI jobs run fine as long as we bypass SRUN.
Error: srun: error: slurm_receive_msgs: Socket timed out on send/recv operation srun: error: Task launch for 5.0 failed on node node002: Socket timed out on send/recv operation srun: error: Application launch failed: Socket timed out on send/recv operation srun: Job step aborted: Waiting up to 32 seconds for job step to finish. This happens on single and multi-node jobs. All hosts use DNS, so there shouldn't be any wierd name mismatches. SELinux is DISABLED: Log (/var/spool/slurm/slurmd.log on node002): [2017-06-12T14:08:40.654] debug2: Processing RPC: REQUEST_LAUNCH_TASKS [2017-06-12T14:08:40.654] launch task 5.0 request from [email protected] (port 7638) [2017-06-12T14:08:40.654] debug3: state for jobid 2850: ctime:1497018336 revoked:0 expires:0 [2017-06-12T14:08:40.654] debug3: state for jobid 2882: ctime:1497023548 revoked:1497023548 expires:1497023548 [2017-06-12T14:08:40.654] debug3: destroying job 2882 state [2017-06-12T14:08:40.654] debug3: state for jobid 2886: ctime:1497023588 revoked:1497023588 expires:1497023588 [2017-06-12T14:08:40.654] debug3: destroying job 2886 state [2017-06-12T14:08:40.654] debug3: state for jobid 2892: ctime:1497023659 revoked:1497023659 expires:1497023659 [2017-06-12T14:08:40.654] debug3: destroying job 2892 state [2017-06-12T14:08:40.654] debug: Checking credential with 276 bytes of sig data [2017-06-12T14:08:40.654] debug: task_p_slurmd_launch_request: 5.0 0 [2017-06-12T14:08:40.655] debug: [job 5] attempting to run prolog [/var/spool/slurm/prolog_and_epilog/bin/prolog] [2017-06-12T14:09:06.467] _run_prolog: run job script took usec=25811904 [2017-06-12T14:09:06.467] _run_prolog: prolog with lock for job 5 ran for 26 seconds [2017-06-12T14:09:06.467] prolog for job 5 ran for 26 seconds [2017-06-12T14:09:06.467] debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd [2017-06-12T14:09:06.467] debug3: slurmstepd rank 0 (dh0002), parent rank -1 (NONE), children 0, depth 0, max_depth 0 [2017-06-12T14:09:06.467] debug3: _send_slurmstepd_init: call to getpwuid_r [2017-06-12T14:09:06.467] debug3: _send_slurmstepd_init: return from getpwuid_r [2017-06-12T14:09:06.467] debug2: Cached group access list for root/0 [2017-06-12T14:09:06.580] debug2: debug level is 8. [2017-06-12T14:09:06.580] debug3: Trying to load plugin /usr/lib64/slurm/switch_none.so [2017-06-12T14:09:06.580] debug: switch NONE plugin loaded [2017-06-12T14:09:06.580] debug3: Success. [2017-06-12T14:09:06.580] debug3: slurmstepd rank 0, parent address = 0.0.0.0, port = 0 [2017-06-12T14:09:06.591] debug2: Received CPU frequency information for 24 CPUs [2017-06-12T14:09:06.597] debug2: got gid 0 [2017-06-12T14:09:06.597] debug2: got gid 1 [2017-06-12T14:09:06.597] debug2: got gid 2 [2017-06-12T14:09:06.597] debug2: got gid 3 [2017-06-12T14:09:06.597] debug2: got gid 4 [2017-06-12T14:09:06.597] debug2: got gid 6 [2017-06-12T14:09:06.597] debug2: got gid 10 [2017-06-12T14:09:06.597] debug2: setup for a launch_task [2017-06-12T14:09:06.597] debug3: entering stepd_step_rec_create [2017-06-12T14:09:06.602] debug3: Trying to load plugin /usr/lib64/slurm/acct_gather_profile_none.so [2017-06-12T14:09:06.602] debug: AcctGatherProfile NONE plugin loaded [2017-06-12T14:09:06.602] debug3: Success. [2017-06-12T14:09:06.602] debug3: Trying to load plugin /usr/lib64/slurm/acct_gather_energy_none.so [2017-06-12T14:09:06.606] debug: AcctGatherEnergy NONE plugin loaded [2017-06-12T14:09:06.606] debug3: Success. [2017-06-12T14:09:06.606] debug3: Trying to load plugin /usr/lib64/slurm/acct_gather_infiniband_none.so [2017-06-12T14:09:06.609] debug: AcctGatherInfiniband NONE plugin loaded [2017-06-12T14:09:06.609] debug3: Success. [2017-06-12T14:09:06.609] debug3: Trying to load plugin /usr/lib64/slurm/acct_gather_filesystem_none.so [2017-06-12T14:09:06.609] debug: AcctGatherFilesystem NONE plugin loaded [2017-06-12T14:09:06.609] debug3: Success. [2017-06-12T14:09:06.609] debug2: No acct_gather.conf file (/etc/slurm/acct_gather.conf) [2017-06-12T14:09:06.609] debug3: Trying to load plugin /usr/lib64/slurm/jobacct_gather_none.so [2017-06-12T14:09:06.609] debug: Job accounting gather NOT_INVOKED plugin loaded [2017-06-12T14:09:06.609] debug3: Success. [2017-06-12T14:09:06.609] debug3: jobacct_gather dynamic logging enabled [2017-06-12T14:09:06.609] debug3: acct_gather_profile_startpoll dynamic logging enabled [2017-06-12T14:09:06.609] debug2: profile signalling type Task [2017-06-12T14:09:06.609] debug2: scaling CPU count by factor of 2 (24/(12-0) [2017-06-12T14:09:06.609] debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd [2017-06-12T14:09:06.609] [5.0] debug: Message thread started pid = 9825 [2017-06-12T14:09:06.609] [5.0] debug4: eio: handling events for 1 objects [2017-06-12T14:09:06.609] [5.0] debug3: Called _msg_socket_readable [2017-06-12T14:09:06.609] [5.0] debug3: Entered job_manager for 5.0 pid=9825 [2017-06-12T14:09:06.609] [5.0] debug3: Trying to load plugin /usr/lib64/slurm/core_spec_none.so [2017-06-12T14:09:06.609] debug2: slurm_send_timeout: Socket no longer there [2017-06-12T14:09:06.609] debug3: slurm_msg_sendto: peer has disappeared for msg_type=8001 [2017-06-12T14:09:06.609] error: _rpc_launch_tasks: unable to send return code to address:port=10.30.0.42:54813 msg_type=6001: Transport endpoint is not connected [2017-06-12T14:09:06.617] [5.0] debug3: Success. [2017-06-12T14:09:06.617] [5.0] debug3: Trying to load plugin /usr/lib64/slurm/task_none.so [2017-06-12T14:09:06.617] [5.0] debug: task NONE plugin loaded [2017-06-12T14:09:06.617] [5.0] debug3: Success. [2017-06-12T14:09:06.617] [5.0] debug3: Trying to load plugin /usr/lib64/slurm/proctrack_pgid.so [2017-06-12T14:09:06.617] [5.0] debug3: Success. [2017-06-12T14:09:06.617] [5.0] debug3: Trying to load plugin /usr/lib64/slurm/checkpoint_none.so [2017-06-12T14:09:06.622] [5.0] debug3: Success. [2017-06-12T14:09:06.622] [5.0] debug: Checkpoint plugin loaded: checkpoint/none [2017-06-12T14:09:06.622] [5.0] debug3: Trying to load plugin /usr/lib64/slurm/crypto_munge.so [2017-06-12T14:09:06.627] [5.0] Munge cryptographic signature plugin loaded [2017-06-12T14:09:06.627] [5.0] debug3: Success. [2017-06-12T14:09:06.627] [5.0] debug3: Trying to load plugin /usr/lib64/slurm/job_container_none.so [2017-06-12T14:09:06.627] [5.0] debug: job_container none plugin loaded [2017-06-12T14:09:06.627] [5.0] debug3: Success. [2017-06-12T14:09:06.627] [5.0] debug: mpi type = none [2017-06-12T14:09:06.627] [5.0] debug3: Trying to load plugin /usr/lib64/slurm/mpi_none.so [2017-06-12T14:09:06.627] [5.0] debug3: Success. [2017-06-12T14:09:06.627] [5.0] debug2: Before call to spank_init() [2017-06-12T14:09:06.627] [5.0] debug: spank: opening plugin stack /etc/slurm/plugstack.conf [2017-06-12T14:09:06.627] [5.0] debug2: After call to spank_init() [2017-06-12T14:09:06.627] [5.0] debug: mpi type = (null) [2017-06-12T14:09:06.627] [5.0] debug: mpi/none: slurmstepd prefork [2017-06-12T14:09:06.627] [5.0] debug4: eio: handling events for 1 objects [2017-06-12T14:09:06.628] [5.0] debug3: Called _msg_socket_readable [2017-06-12T14:09:06.628] [5.0] debug2: Using gid list sent by slurmd [2017-06-12T14:09:06.628] [5.0] debug4: eio: handling events for 1 objects [2017-06-12T14:09:06.628] [5.0] debug3: Called _msg_socket_readable [2017-06-12T14:09:06.682] [5.0] debug2: Entering _setup_normal_io [2017-06-12T14:09:06.682] [5.0] debug4: eio: handling events for 1 objects [2017-06-12T14:09:06.682] [5.0] debug3: Called _msg_socket_readable [2017-06-12T14:09:06.682] [5.0] debug2: Using gid list sent by slurmd [2017-06-12T14:09:06.682] [5.0] debug4: eio: handling events for 1 objects [2017-06-12T14:09:06.682] [5.0] debug3: Called _msg_socket_readable [2017-06-12T14:09:06.682] [5.0] debug: stdin uses a pty object [2017-06-12T14:09:06.682] [5.0] debug4: eio: handling events for 1 objects [2017-06-12T14:09:06.682] [5.0] debug3: Called _msg_socket_readable [2017-06-12T14:09:06.683] [5.0] debug4: eio: handling events for 1 objects [2017-06-12T14:09:06.683] [5.0] debug3: Called _msg_socket_readable [2017-06-12T14:09:06.683] [5.0] debug: init pty size 53:161 [2017-06-12T14:09:06.683] [5.0] in _window_manager [2017-06-12T14:09:06.683] [5.0] debug4: adding IO connection (logical node rank 0) [2017-06-12T14:09:06.683] [5.0] debug4: connecting IO back to 10.30.0.42:57166 [2017-06-12T14:09:06.684] [5.0] debug2: Entering io_init_msg_write_to_fd [2017-06-12T14:09:06.684] [5.0] debug2: msg->nodeid = 0 [2017-06-12T14:09:06.684] [5.0] debug2: Leaving io_init_msg_write_to_fd [2017-06-12T14:09:06.684] [5.0] debug2: Leaving _setup_normal_io [2017-06-12T14:09:06.684] [5.0] debug level = 2 [2017-06-12T14:09:06.684] [5.0] debug: IO handler started pid=9825 [2017-06-12T14:09:06.684] [5.0] debug4: eio: handling events for 3 objects [2017-06-12T14:09:06.684] [5.0] debug2: Using gid list sent by slurmd [2017-06-12T14:09:06.684] [5.0] debug4: eio: handling events for 1 objects [2017-06-12T14:09:06.684] [5.0] debug3: Called _msg_socket_readable [2017-06-12T14:09:06.693] [5.0] starting 1 tasks [2017-06-12T14:09:06.693] [5.0] task 0 (9831) started 2017-06-12T14:09:06 [2017-06-12T14:09:06.693] [5.0] debug3: Unblocking 5.0 task 0, writefd = 17 [2017-06-12T14:09:06.693] [5.0] debug2: read_slurm_cgroup_conf: No cgroup.conf file (/etc/slurm/cgroup.conf) [2017-06-12T14:09:06.693] [5.0] debug2: xcgroup_load: unable to get cgroup '(null)/cpuset' entry '(null)/cpuset/system' properties: No such file or directory [2017-06-12T14:09:06.693] [5.0] debug: task_p_pre_launch_priv: 5.0 [2017-06-12T14:09:06.693] [5.0] debug2: xcgroup_load: unable to get cgroup '(null)/memory' entry '(null)/memory/system' properties: No such file or directory [2017-06-12T14:09:06.693] [5.0] debug: Sending launch resp rc=0 [2017-06-12T14:09:06.693] [5.0] debug2: Using gid list sent by slurmd [2017-06-12T14:09:06.693] [5.0] debug3: login_tty good [2017-06-12T14:09:06.694] [5.0] debug3: slurm_send_only_node_msg: sent 208 [2017-06-12T14:09:06.694] [5.0] debug: mpi type = (null) [2017-06-12T14:09:06.694] [5.0] debug: Using mpi/none [2017-06-12T14:09:06.694] [5.0] debug: task_p_pre_launch: 5.0, task 0 [2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: conf setrlimit RLIMIT_CPU no change in value: 18446744073709551615 [2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: conf setrlimit RLIMIT_FSIZE no change in value: 18446744073709551615 [2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: conf setrlimit RLIMIT_DATA no change in value: 18446744073709551615 [2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: conf setrlimit RLIMIT_STACK no change in value: 18446744073709551615 [2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: RLIMIT_CORE : max:inf cur:inf req:0 [2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: conf setrlimit RLIMIT_CORE succeeded [2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: conf setrlimit RLIMIT_RSS no change in value: 18446744073709551615 [2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: conf setrlimit RLIMIT_NPROC no change in value: 18446744073709551615 [2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: conf setrlimit RLIMIT_NOFILE no change in value: 65536 [2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: conf setrlimit RLIMIT_MEMLOCK no change in value: 18446744073709551615 [2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: conf setrlimit RLIMIT_AS no change in value: 18446744073709551615 [2017-06-12T14:09:06.694] [5.0] debug2: Set task rss(128000 MB) [2017-06-12T14:09:06.945] [5.0] debug4: Entering _task_read for obj 2139cc0 [2017-06-12T14:09:06.945] [5.0] debug4: _task_build_message: Entering... [2017-06-12T14:09:06.945] [5.0] debug4: Leaving _task_build_message [2017-06-12T14:09:06.945] [5.0] debug4: eio: handling events for 3 objects [2017-06-12T14:09:06.945] [5.0] debug4: Entering _task_read for obj 2139cc0 [2017-06-12T14:09:06.945] [5.0] debug4: _task_build_message: Entering... [2017-06-12T14:09:06.945] [5.0] debug4: Leaving _task_build_message [2017-06-12T14:09:06.945] [5.0] debug4: eio: handling events for 3 objects [2017-06-12T14:09:06.945] [5.0] debug4: Entering _task_read for obj 2139cc0 [2017-06-12T14:09:06.945] [5.0] debug4: _task_build_message: Entering... [2017-06-12T14:09:06.945] [5.0] debug4: Leaving _task_build_message [2017-06-12T14:09:06.945] [5.0] debug4: eio: handling events for 3 objects [2017-06-12T14:09:32.008] [5.0] debug4: Entering _client_read [2017-06-12T14:09:32.008] [5.0] debug3: Entering io_hdr_read_fd [2017-06-12T14:09:32.008] [5.0] debug3: _full_read (_client_read) got eof [2017-06-12T14:09:32.008] [5.0] debug3: Leaving io_hdr_read_fd [2017-06-12T14:09:32.008] [5.0] debug4: eio: handling events for 3 objects [2017-06-12T14:09:32.010] debug3: in the service_connection [2017-06-12T14:09:32.011] [5.0] debug: _slurm_recv_timeout at 0 of 4, recv zero bytes [2017-06-12T14:09:32.011] debug2: got this type of message 6011 [2017-06-12T14:09:32.011] debug2: Processing RPC: REQUEST_TERMINATE_JOB [2017-06-12T14:09:32.011] debug: _rpc_terminate_job, uid = 493 [2017-06-12T14:09:32.011] debug: task_p_slurmd_release_resources: 5 [2017-06-12T14:09:32.011] debug3: state for jobid 2850: ctime:1497018336 revoked:0 expires:0 [2017-06-12T14:09:32.011] debug3: state for jobid 5: ctime:1497294520 revoked:0 expires:0 [2017-06-12T14:09:32.011] debug3: state for jobid 5: ctime:1497294520 revoked:0 expires:0 [2017-06-12T14:09:32.011] debug: credential for job 5 revoked [2017-06-12T14:09:32.011] debug4: found jobid = 5, stepid = 0 [2017-06-12T14:09:32.011] [5.0] debug3: Called _msg_socket_accept [2017-06-12T14:09:32.011] [5.0] debug3: Leaving _msg_socket_accept [2017-06-12T14:09:32.011] [5.0] debug4: eio: handling events for 1 objects [2017-06-12T14:09:32.011] [5.0] debug3: Called _msg_socket_readable [2017-06-12T14:09:32.011] [5.0] debug3: Entering _handle_accept (new thread) [2017-06-12T14:09:32.011] [5.0] debug3: Identity: uid=0, gid=0 [2017-06-12T14:09:32.011] [5.0] debug3: Entering _handle_request [2017-06-12T14:09:32.011] debug2: container signal 18 to job 5.0 [2017-06-12T14:09:32.011] [5.0] debug3: Got request 4 [2017-06-12T14:09:32.011] [5.0] debug: Handling REQUEST_SIGNAL_CONTAINER [2017-06-12T14:09:32.011] [5.0] debug: _handle_signal_container for step=5.0 uid=0 signal=18 [2017-06-12T14:09:32.012] [5.0] Sent signal 18 to 5.0 [2017-06-12T14:09:32.012] [5.0] debug3: Leaving _handle_request: SLURM_SUCCESS [2017-06-12T14:09:32.012] [5.0] debug3: Entering _handle_request [2017-06-12T14:09:32.012] [5.0] debug3: Leaving _handle_accept [2017-06-12T14:09:32.012] debug4: found jobid = 5, stepid = 0 [2017-06-12T14:09:32.012] [5.0] debug3: Called _msg_socket_accept [2017-06-12T14:09:32.012] [5.0] debug3: Leaving _msg_socket_accept [2017-06-12T14:09:32.012] [5.0] debug4: eio: handling events for 1 objects [2017-06-12T14:09:32.012] [5.0] debug3: Called _msg_socket_readable [2017-06-12T14:09:32.012] [5.0] debug3: Entering _handle_accept (new thread) [2017-06-12T14:09:32.012] [5.0] debug3: Identity: uid=0, gid=0 [2017-06-12T14:09:32.012] [5.0] debug3: Entering _handle_request [2017-06-12T14:09:32.012] debug2: container signal 15 to job 5.0 [2017-06-12T14:09:32.012] [5.0] debug3: Got request 4 [2017-06-12T14:09:32.012] [5.0] debug: Handling REQUEST_SIGNAL_CONTAINER [2017-06-12T14:09:32.012] [5.0] debug: _handle_signal_container for step=5.0 uid=0 signal=15 [2017-06-12T14:09:32.012] [5.0] error: *** STEP 5.0 ON dh0002 CANCELLED AT 2017-06-12T14:09:32 *** [2017-06-12T14:09:32.012] [5.0] Sent signal 15 to 5.0 [2017-06-12T14:09:32.012] [5.0] debug3: Leaving _handle_request: SLURM_SUCCESS [2017-06-12T14:09:32.012] [5.0] debug3: Entering _handle_request [2017-06-12T14:09:32.012] debug4: sent SUCCESS We have also seen: *Error* connecting slurm stream socket at 10.30.0.42:45967: Connection refused (where the IP is that of the submit host) When doing an STRACE I do see srun trying to bind to privileged ports and failing: 10636 bind(4, {sa_family=AF_INET, sin_port=htons(636), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EACCES (Permission denied) 10636 bind(8, {sa_family=AF_INET, sin_port=htons(637), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EACCES (Permission denied) 10636 bind(8, {sa_family=AF_INET, sin_port=htons(638), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EACCES (Permission denied) 10636 bind(9, {sa_family=AF_INET, sin_port=htons(639), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EACCES (Permission denied) 10636 bind(9, {sa_family=AF_INET, sin_port=htons(640), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EACCES (Permission denied) 10636 bind(15, {sa_family=AF_INET, sin_port=htons(641), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EACCES (Permission denied) Here is the slurm.conf ClusterName=cluster ControlMachine=sched01 ControlAddr=10.30.4.41 SlurmUser=slurm SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge StateSaveLocation=/etc/slurm/state SlurmdSpoolDir=/var/spool/slurm/jobs SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid ProctrackType=proctrack/pgid PluginDir=/usr/lib64/slurm CacheGroups=0 ReturnToService=0 Prolog=/var/spool/slurm/prolog_and_epilog/bin/prolog Epilog=/var/spool/slurm/prolog_and_epilog/bin/epilog UsePAM=1 SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 SchedulerType=sched/backfill SchedulerParameters=bf_max_job_start=1,sched_max_job_start=1,preempt_youngest_first SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory FastSchedule=1 PreemptType=preempt/qos PreemptMode=requeue SlurmctldDebug=8 SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=8 SlurmdLogFile=/var/spool/slurm/slurmd.log JobCompType=jobcomp/none AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=sched01 AccountingStorageLoc=slurm_acct_db AccountingStoragePass=/var/run/munge/munge.socket.2 AccountingStoragePort=6819 AccountingStorageUser=slurm
