'fixed it' ish. One of our prolog scripts does 'sleep 10' after dropping caches before checking free mem again. Just this sleep line kills all srun jobs, somehow.
I did add PrologEpilogTimeout=120, but that didn't help at all. Just removed the sleep function. Looks like it may be related to this bug: https://bugs.schedmd.com/show_bug.cgi?id=2214 On Mon, Jun 12, 2017 at 2:27 PM, Michael Senizaiz <[email protected]> wrote: > SRUN fails on all hosts. All other features work, MPI jobs run fine as > long as we bypass SRUN. > > Error: > > srun: error: slurm_receive_msgs: Socket timed out on send/recv operation > srun: error: Task launch for 5.0 failed on node node002: Socket timed out > on send/recv operation > srun: error: Application launch failed: Socket timed out on send/recv > operation > srun: Job step aborted: Waiting up to 32 seconds for job step to finish. > > > This happens on single and multi-node jobs. All hosts use DNS, so there > shouldn't be any wierd name mismatches. > SELinux is DISABLED: > > Log (/var/spool/slurm/slurmd.log on node002): > [2017-06-12T14:08:40.654] debug2: Processing RPC: REQUEST_LAUNCH_TASKS > [2017-06-12T14:08:40.654] launch task 5.0 request from [email protected] > (port 7638) > [2017-06-12T14:08:40.654] debug3: state for jobid 2850: ctime:1497018336 > revoked:0 expires:0 > [2017-06-12T14:08:40.654] debug3: state for jobid 2882: ctime:1497023548 > revoked:1497023548 expires:1497023548 > [2017-06-12T14:08:40.654] debug3: destroying job 2882 state > [2017-06-12T14:08:40.654] debug3: state for jobid 2886: ctime:1497023588 > revoked:1497023588 expires:1497023588 > [2017-06-12T14:08:40.654] debug3: destroying job 2886 state > [2017-06-12T14:08:40.654] debug3: state for jobid 2892: ctime:1497023659 > revoked:1497023659 expires:1497023659 > [2017-06-12T14:08:40.654] debug3: destroying job 2892 state > [2017-06-12T14:08:40.654] debug: Checking credential with 276 bytes of > sig data > [2017-06-12T14:08:40.654] debug: task_p_slurmd_launch_request: 5.0 0 > [2017-06-12T14:08:40.655] debug: [job 5] attempting to run prolog > [/var/spool/slurm/prolog_and_epilog/bin/prolog] > [2017-06-12T14:09:06.467] _run_prolog: run job script took usec=25811904 > [2017-06-12T14:09:06.467] _run_prolog: prolog with lock for job 5 ran for > 26 seconds > [2017-06-12T14:09:06.467] prolog for job 5 ran for 26 seconds > [2017-06-12T14:09:06.467] debug3: _rpc_launch_tasks: call to > _forkexec_slurmstepd > [2017-06-12T14:09:06.467] debug3: slurmstepd rank 0 (dh0002), parent rank > -1 (NONE), children 0, depth 0, max_depth 0 > [2017-06-12T14:09:06.467] debug3: _send_slurmstepd_init: call to getpwuid_r > [2017-06-12T14:09:06.467] debug3: _send_slurmstepd_init: return from > getpwuid_r > [2017-06-12T14:09:06.467] debug2: Cached group access list for root/0 > [2017-06-12T14:09:06.580] debug2: debug level is 8. > [2017-06-12T14:09:06.580] debug3: Trying to load plugin > /usr/lib64/slurm/switch_none.so > [2017-06-12T14:09:06.580] debug: switch NONE plugin loaded > [2017-06-12T14:09:06.580] debug3: Success. > [2017-06-12T14:09:06.580] debug3: slurmstepd rank 0, parent address = > 0.0.0.0, port = 0 > [2017-06-12T14:09:06.591] debug2: Received CPU frequency information for > 24 CPUs > [2017-06-12T14:09:06.597] debug2: got gid 0 > [2017-06-12T14:09:06.597] debug2: got gid 1 > [2017-06-12T14:09:06.597] debug2: got gid 2 > [2017-06-12T14:09:06.597] debug2: got gid 3 > [2017-06-12T14:09:06.597] debug2: got gid 4 > [2017-06-12T14:09:06.597] debug2: got gid 6 > [2017-06-12T14:09:06.597] debug2: got gid 10 > [2017-06-12T14:09:06.597] debug2: setup for a launch_task > [2017-06-12T14:09:06.597] debug3: entering stepd_step_rec_create > [2017-06-12T14:09:06.602] debug3: Trying to load plugin > /usr/lib64/slurm/acct_gather_profile_none.so > [2017-06-12T14:09:06.602] debug: AcctGatherProfile NONE plugin loaded > [2017-06-12T14:09:06.602] debug3: Success. > [2017-06-12T14:09:06.602] debug3: Trying to load plugin > /usr/lib64/slurm/acct_gather_energy_none.so > [2017-06-12T14:09:06.606] debug: AcctGatherEnergy NONE plugin loaded > [2017-06-12T14:09:06.606] debug3: Success. > [2017-06-12T14:09:06.606] debug3: Trying to load plugin > /usr/lib64/slurm/acct_gather_infiniband_none.so > [2017-06-12T14:09:06.609] debug: AcctGatherInfiniband NONE plugin loaded > [2017-06-12T14:09:06.609] debug3: Success. > [2017-06-12T14:09:06.609] debug3: Trying to load plugin > /usr/lib64/slurm/acct_gather_filesystem_none.so > [2017-06-12T14:09:06.609] debug: AcctGatherFilesystem NONE plugin loaded > [2017-06-12T14:09:06.609] debug3: Success. > [2017-06-12T14:09:06.609] debug2: No acct_gather.conf file > (/etc/slurm/acct_gather.conf) > [2017-06-12T14:09:06.609] debug3: Trying to load plugin > /usr/lib64/slurm/jobacct_gather_none.so > [2017-06-12T14:09:06.609] debug: Job accounting gather NOT_INVOKED plugin > loaded > [2017-06-12T14:09:06.609] debug3: Success. > [2017-06-12T14:09:06.609] debug3: jobacct_gather dynamic logging enabled > [2017-06-12T14:09:06.609] debug3: acct_gather_profile_startpoll dynamic > logging enabled > [2017-06-12T14:09:06.609] debug2: profile signalling type Task > [2017-06-12T14:09:06.609] debug2: scaling CPU count by factor of 2 > (24/(12-0) > [2017-06-12T14:09:06.609] debug3: _rpc_launch_tasks: return from > _forkexec_slurmstepd > [2017-06-12T14:09:06.609] [5.0] debug: Message thread started pid = 9825 > [2017-06-12T14:09:06.609] [5.0] debug4: eio: handling events for 1 objects > [2017-06-12T14:09:06.609] [5.0] debug3: Called _msg_socket_readable > [2017-06-12T14:09:06.609] [5.0] debug3: Entered job_manager for 5.0 > pid=9825 > [2017-06-12T14:09:06.609] [5.0] debug3: Trying to load plugin > /usr/lib64/slurm/core_spec_none.so > [2017-06-12T14:09:06.609] debug2: slurm_send_timeout: Socket no longer > there > [2017-06-12T14:09:06.609] debug3: slurm_msg_sendto: peer has disappeared > for msg_type=8001 > [2017-06-12T14:09:06.609] error: _rpc_launch_tasks: unable to send return > code to address:port=10.30.0.42:54813 msg_type=6001: Transport endpoint > is not connected > [2017-06-12T14:09:06.617] [5.0] debug3: Success. > [2017-06-12T14:09:06.617] [5.0] debug3: Trying to load plugin > /usr/lib64/slurm/task_none.so > [2017-06-12T14:09:06.617] [5.0] debug: task NONE plugin loaded > [2017-06-12T14:09:06.617] [5.0] debug3: Success. > [2017-06-12T14:09:06.617] [5.0] debug3: Trying to load plugin > /usr/lib64/slurm/proctrack_pgid.so > [2017-06-12T14:09:06.617] [5.0] debug3: Success. > [2017-06-12T14:09:06.617] [5.0] debug3: Trying to load plugin > /usr/lib64/slurm/checkpoint_none.so > [2017-06-12T14:09:06.622] [5.0] debug3: Success. > [2017-06-12T14:09:06.622] [5.0] debug: Checkpoint plugin loaded: > checkpoint/none > [2017-06-12T14:09:06.622] [5.0] debug3: Trying to load plugin > /usr/lib64/slurm/crypto_munge.so > [2017-06-12T14:09:06.627] [5.0] Munge cryptographic signature plugin loaded > [2017-06-12T14:09:06.627] [5.0] debug3: Success. > [2017-06-12T14:09:06.627] [5.0] debug3: Trying to load plugin > /usr/lib64/slurm/job_container_none.so > [2017-06-12T14:09:06.627] [5.0] debug: job_container none plugin loaded > [2017-06-12T14:09:06.627] [5.0] debug3: Success. > [2017-06-12T14:09:06.627] [5.0] debug: mpi type = none > [2017-06-12T14:09:06.627] [5.0] debug3: Trying to load plugin > /usr/lib64/slurm/mpi_none.so > [2017-06-12T14:09:06.627] [5.0] debug3: Success. > [2017-06-12T14:09:06.627] [5.0] debug2: Before call to spank_init() > [2017-06-12T14:09:06.627] [5.0] debug: spank: opening plugin stack > /etc/slurm/plugstack.conf > [2017-06-12T14:09:06.627] [5.0] debug2: After call to spank_init() > [2017-06-12T14:09:06.627] [5.0] debug: mpi type = (null) > [2017-06-12T14:09:06.627] [5.0] debug: mpi/none: slurmstepd prefork > [2017-06-12T14:09:06.627] [5.0] debug4: eio: handling events for 1 objects > [2017-06-12T14:09:06.628] [5.0] debug3: Called _msg_socket_readable > [2017-06-12T14:09:06.628] [5.0] debug2: Using gid list sent by slurmd > [2017-06-12T14:09:06.628] [5.0] debug4: eio: handling events for 1 objects > [2017-06-12T14:09:06.628] [5.0] debug3: Called _msg_socket_readable > [2017-06-12T14:09:06.682] [5.0] debug2: Entering _setup_normal_io > [2017-06-12T14:09:06.682] [5.0] debug4: eio: handling events for 1 objects > [2017-06-12T14:09:06.682] [5.0] debug3: Called _msg_socket_readable > [2017-06-12T14:09:06.682] [5.0] debug2: Using gid list sent by slurmd > [2017-06-12T14:09:06.682] [5.0] debug4: eio: handling events for 1 objects > [2017-06-12T14:09:06.682] [5.0] debug3: Called _msg_socket_readable > [2017-06-12T14:09:06.682] [5.0] debug: stdin uses a pty object > [2017-06-12T14:09:06.682] [5.0] debug4: eio: handling events for 1 objects > [2017-06-12T14:09:06.682] [5.0] debug3: Called _msg_socket_readable > [2017-06-12T14:09:06.683] [5.0] debug4: eio: handling events for 1 objects > [2017-06-12T14:09:06.683] [5.0] debug3: Called _msg_socket_readable > [2017-06-12T14:09:06.683] [5.0] debug: init pty size 53:161 > [2017-06-12T14:09:06.683] [5.0] in _window_manager > [2017-06-12T14:09:06.683] [5.0] debug4: adding IO connection (logical node > rank 0) > [2017-06-12T14:09:06.683] [5.0] debug4: connecting IO back to > 10.30.0.42:57166 > [2017-06-12T14:09:06.684] [5.0] debug2: Entering io_init_msg_write_to_fd > [2017-06-12T14:09:06.684] [5.0] debug2: msg->nodeid = 0 > [2017-06-12T14:09:06.684] [5.0] debug2: Leaving io_init_msg_write_to_fd > [2017-06-12T14:09:06.684] [5.0] debug2: Leaving _setup_normal_io > [2017-06-12T14:09:06.684] [5.0] debug level = 2 > [2017-06-12T14:09:06.684] [5.0] debug: IO handler started pid=9825 > [2017-06-12T14:09:06.684] [5.0] debug4: eio: handling events for 3 objects > [2017-06-12T14:09:06.684] [5.0] debug2: Using gid list sent by slurmd > [2017-06-12T14:09:06.684] [5.0] debug4: eio: handling events for 1 objects > [2017-06-12T14:09:06.684] [5.0] debug3: Called _msg_socket_readable > [2017-06-12T14:09:06.693] [5.0] starting 1 tasks > [2017-06-12T14:09:06.693] [5.0] task 0 (9831) started 2017-06-12T14:09:06 > [2017-06-12T14:09:06.693] [5.0] debug3: Unblocking 5.0 task 0, writefd = 17 > [2017-06-12T14:09:06.693] [5.0] debug2: read_slurm_cgroup_conf: No > cgroup.conf file (/etc/slurm/cgroup.conf) > [2017-06-12T14:09:06.693] [5.0] debug2: xcgroup_load: unable to get cgroup > '(null)/cpuset' entry '(null)/cpuset/system' properties: No such file or > directory > [2017-06-12T14:09:06.693] [5.0] debug: task_p_pre_launch_priv: 5.0 > [2017-06-12T14:09:06.693] [5.0] debug2: xcgroup_load: unable to get cgroup > '(null)/memory' entry '(null)/memory/system' properties: No such file or > directory > [2017-06-12T14:09:06.693] [5.0] debug: Sending launch resp rc=0 > [2017-06-12T14:09:06.693] [5.0] debug2: Using gid list sent by slurmd > [2017-06-12T14:09:06.693] [5.0] debug3: login_tty good > [2017-06-12T14:09:06.694] [5.0] debug3: slurm_send_only_node_msg: sent 208 > [2017-06-12T14:09:06.694] [5.0] debug: mpi type = (null) > [2017-06-12T14:09:06.694] [5.0] debug: Using mpi/none > [2017-06-12T14:09:06.694] [5.0] debug: task_p_pre_launch: 5.0, task 0 > [2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: conf setrlimit > RLIMIT_CPU no change in value: 18446744073709551615 > [2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: conf setrlimit > RLIMIT_FSIZE no change in value: 18446744073709551615 > [2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: conf setrlimit > RLIMIT_DATA no change in value: 18446744073709551615 > [2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: conf setrlimit > RLIMIT_STACK no change in value: 18446744073709551615 > [2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: RLIMIT_CORE : > max:inf cur:inf req:0 > [2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: conf setrlimit > RLIMIT_CORE succeeded > [2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: conf setrlimit > RLIMIT_RSS no change in value: 18446744073709551615 > [2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: conf setrlimit > RLIMIT_NPROC no change in value: 18446744073709551615 > [2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: conf setrlimit > RLIMIT_NOFILE no change in value: 65536 > [2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: conf setrlimit > RLIMIT_MEMLOCK no change in value: 18446744073709551615 > [2017-06-12T14:09:06.694] [5.0] debug2: _set_limit: conf setrlimit > RLIMIT_AS no change in value: 18446744073709551615 > [2017-06-12T14:09:06.694] [5.0] debug2: Set task rss(128000 MB) > [2017-06-12T14:09:06.945] [5.0] debug4: Entering _task_read for obj 2139cc0 > [2017-06-12T14:09:06.945] [5.0] debug4: _task_build_message: Entering... > [2017-06-12T14:09:06.945] [5.0] debug4: Leaving _task_build_message > [2017-06-12T14:09:06.945] [5.0] debug4: eio: handling events for 3 objects > [2017-06-12T14:09:06.945] [5.0] debug4: Entering _task_read for obj 2139cc0 > [2017-06-12T14:09:06.945] [5.0] debug4: _task_build_message: Entering... > [2017-06-12T14:09:06.945] [5.0] debug4: Leaving _task_build_message > [2017-06-12T14:09:06.945] [5.0] debug4: eio: handling events for 3 objects > [2017-06-12T14:09:06.945] [5.0] debug4: Entering _task_read for obj 2139cc0 > [2017-06-12T14:09:06.945] [5.0] debug4: _task_build_message: Entering... > [2017-06-12T14:09:06.945] [5.0] debug4: Leaving _task_build_message > [2017-06-12T14:09:06.945] [5.0] debug4: eio: handling events for 3 objects > [2017-06-12T14:09:32.008] [5.0] debug4: Entering _client_read > [2017-06-12T14:09:32.008] [5.0] debug3: Entering io_hdr_read_fd > [2017-06-12T14:09:32.008] [5.0] debug3: _full_read (_client_read) got eof > [2017-06-12T14:09:32.008] [5.0] debug3: Leaving io_hdr_read_fd > [2017-06-12T14:09:32.008] [5.0] debug4: eio: handling events for 3 objects > [2017-06-12T14:09:32.010] debug3: in the service_connection > [2017-06-12T14:09:32.011] [5.0] debug: _slurm_recv_timeout at 0 of 4, > recv zero bytes > [2017-06-12T14:09:32.011] debug2: got this type of message 6011 > [2017-06-12T14:09:32.011] debug2: Processing RPC: REQUEST_TERMINATE_JOB > [2017-06-12T14:09:32.011] debug: _rpc_terminate_job, uid = 493 > [2017-06-12T14:09:32.011] debug: task_p_slurmd_release_resources: 5 > [2017-06-12T14:09:32.011] debug3: state for jobid 2850: ctime:1497018336 > revoked:0 expires:0 > [2017-06-12T14:09:32.011] debug3: state for jobid 5: ctime:1497294520 > revoked:0 expires:0 > [2017-06-12T14:09:32.011] debug3: state for jobid 5: ctime:1497294520 > revoked:0 expires:0 > [2017-06-12T14:09:32.011] debug: credential for job 5 revoked > [2017-06-12T14:09:32.011] debug4: found jobid = 5, stepid = 0 > [2017-06-12T14:09:32.011] [5.0] debug3: Called _msg_socket_accept > [2017-06-12T14:09:32.011] [5.0] debug3: Leaving _msg_socket_accept > [2017-06-12T14:09:32.011] [5.0] debug4: eio: handling events for 1 objects > [2017-06-12T14:09:32.011] [5.0] debug3: Called _msg_socket_readable > [2017-06-12T14:09:32.011] [5.0] debug3: Entering _handle_accept (new > thread) > [2017-06-12T14:09:32.011] [5.0] debug3: Identity: uid=0, gid=0 > [2017-06-12T14:09:32.011] [5.0] debug3: Entering _handle_request > [2017-06-12T14:09:32.011] debug2: container signal 18 to job 5.0 > [2017-06-12T14:09:32.011] [5.0] debug3: Got request 4 > [2017-06-12T14:09:32.011] [5.0] debug: Handling REQUEST_SIGNAL_CONTAINER > [2017-06-12T14:09:32.011] [5.0] debug: _handle_signal_container for > step=5.0 uid=0 signal=18 > [2017-06-12T14:09:32.012] [5.0] Sent signal 18 to 5.0 > [2017-06-12T14:09:32.012] [5.0] debug3: Leaving _handle_request: > SLURM_SUCCESS > [2017-06-12T14:09:32.012] [5.0] debug3: Entering _handle_request > [2017-06-12T14:09:32.012] [5.0] debug3: Leaving _handle_accept > [2017-06-12T14:09:32.012] debug4: found jobid = 5, stepid = 0 > [2017-06-12T14:09:32.012] [5.0] debug3: Called _msg_socket_accept > [2017-06-12T14:09:32.012] [5.0] debug3: Leaving _msg_socket_accept > [2017-06-12T14:09:32.012] [5.0] debug4: eio: handling events for 1 objects > [2017-06-12T14:09:32.012] [5.0] debug3: Called _msg_socket_readable > [2017-06-12T14:09:32.012] [5.0] debug3: Entering _handle_accept (new > thread) > [2017-06-12T14:09:32.012] [5.0] debug3: Identity: uid=0, gid=0 > [2017-06-12T14:09:32.012] [5.0] debug3: Entering _handle_request > [2017-06-12T14:09:32.012] debug2: container signal 15 to job 5.0 > [2017-06-12T14:09:32.012] [5.0] debug3: Got request 4 > [2017-06-12T14:09:32.012] [5.0] debug: Handling REQUEST_SIGNAL_CONTAINER > [2017-06-12T14:09:32.012] [5.0] debug: _handle_signal_container for > step=5.0 uid=0 signal=15 > [2017-06-12T14:09:32.012] [5.0] error: *** STEP 5.0 ON dh0002 CANCELLED AT > 2017-06-12T14:09:32 *** > [2017-06-12T14:09:32.012] [5.0] Sent signal 15 to 5.0 > [2017-06-12T14:09:32.012] [5.0] debug3: Leaving _handle_request: > SLURM_SUCCESS > [2017-06-12T14:09:32.012] [5.0] debug3: Entering _handle_request > [2017-06-12T14:09:32.012] debug4: sent SUCCESS > > We have also seen: > *Error* connecting slurm stream socket at 10.30.0.42:45967: Connection > refused (where the IP is that of the submit host) > > > When doing an STRACE I do see srun trying to bind to privileged ports and > failing: > > > 10636 bind(4, {sa_family=AF_INET, sin_port=htons(636), > sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EACCES (Permission denied) > 10636 bind(8, {sa_family=AF_INET, sin_port=htons(637), > sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EACCES (Permission denied) > 10636 bind(8, {sa_family=AF_INET, sin_port=htons(638), > sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EACCES (Permission denied) > 10636 bind(9, {sa_family=AF_INET, sin_port=htons(639), > sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EACCES (Permission denied) > 10636 bind(9, {sa_family=AF_INET, sin_port=htons(640), > sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EACCES (Permission denied) > 10636 bind(15, {sa_family=AF_INET, sin_port=htons(641), > sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EACCES (Permission denied) > > Here is the slurm.conf > ClusterName=cluster > ControlMachine=sched01 > ControlAddr=10.30.4.41 > SlurmUser=slurm > SlurmctldPort=6817 > SlurmdPort=6818 > AuthType=auth/munge > StateSaveLocation=/etc/slurm/state > SlurmdSpoolDir=/var/spool/slurm/jobs > SwitchType=switch/none > MpiDefault=none > SlurmctldPidFile=/var/run/slurmctld.pid > SlurmdPidFile=/var/run/slurmd.pid > ProctrackType=proctrack/pgid > PluginDir=/usr/lib64/slurm > CacheGroups=0 > ReturnToService=0 > Prolog=/var/spool/slurm/prolog_and_epilog/bin/prolog > Epilog=/var/spool/slurm/prolog_and_epilog/bin/epilog > UsePAM=1 > SlurmctldTimeout=300 > SlurmdTimeout=300 > InactiveLimit=0 > MinJobAge=300 > KillWait=30 > Waittime=0 > SchedulerType=sched/backfill > SchedulerParameters=bf_max_job_start=1,sched_max_job_ > start=1,preempt_youngest_first > SelectType=select/cons_res > SelectTypeParameters=CR_Core_Memory > FastSchedule=1 > PreemptType=preempt/qos > PreemptMode=requeue > SlurmctldDebug=8 > SlurmctldLogFile=/var/log/slurm/slurmctld.log > SlurmdDebug=8 > SlurmdLogFile=/var/spool/slurm/slurmd.log > JobCompType=jobcomp/none > AccountingStorageType=accounting_storage/slurmdbd > AccountingStorageHost=sched01 > AccountingStorageLoc=slurm_acct_db > AccountingStoragePass=/var/run/munge/munge.socket.2 > AccountingStoragePort=6819 > AccountingStorageUser=slurm >
