Ten days ago we started using slurm controller prologs and epilogs (PrologSlurmctld and EpilogSlurmctld) to talk to the Gold accounting database. Since then, slurmctld has failed to restart three times at the nightly logrotation, whith the following error messages in the slurmctld log:
[2013-09-20T04:50:16+02:00] error: Error binding slurm stream socket: Address already in use [2013-09-20T04:50:16+02:00] fatal: slurm_init_msg_engine_addrname_port error Address already in use (See below for a longer excerpt of the log file.) Slurmdbd restarts without problems. I cannot recall seeing this error before, at least not for a long time. We are running slurm 2.5.6 on a Rocks (CentOS) 6.2 cluster. What causes this error? Is it prolog or epilog processes that are still running when slurmctld starts, and are preventing it from binding to a socket? (Talking to Gold is slow, so the prologs/epilogs can take several seconds to finish.) Or is it something completely different? If there is any more information I should provide, please let me know. Here is the top of the slurmctld log file: [2013-09-20T04:50:13+02:00] pidfile not locked, assuming no running daemon [2013-09-20T04:50:13+02:00] debug: sched: slurmctld starting [2013-09-20T04:50:13+02:00] Accounting storage SLURMDBD plugin loaded with AuthInfo=(null) [2013-09-20T04:50:13+02:00] auth plugin for Munge (http://code.google.com/p/munge/) loaded [2013-09-20T04:50:13+02:00] debug: slurmdbd: Sent DbdInit msg [2013-09-20T04:50:13+02:00] slurmdbd: recovered 11 pending RPCs [2013-09-20T04:50:15+02:00] slurmctld version 2.5.6 started on cluster abel [2013-09-20T04:50:15+02:00] Munge cryptographic signature plugin loaded ... [2013-09-20T04:50:16+02:00] Recovered state of 3 reservations [2013-09-20T04:50:16+02:00] State of 2 triggers recovered [2013-09-20T04:50:16+02:00] read_slurm_conf: backup_controller not specified. [2013-09-20T04:50:16+02:00] cons_res: select_p_reconfigure [2013-09-20T04:50:16+02:00] cons_res: select_p_node_init [2013-09-20T04:50:16+02:00] cons_res: preparing for 5 partitions [2013-09-20T04:50:16+02:00] Warning: Note very large processing time from read_slurm_conf: usec=1128370 [2013-09-20T04:50:16+02:00] Running as primary controller [2013-09-20T04:50:16+02:00] Registering slurmctld at port 6817 with slurmdbd. [2013-09-20T04:50:16+02:00] debug: Priority MULTIFACTOR plugin loaded [2013-09-20T04:50:16+02:00] debug: power_save module disabled, SuspendTime < 0 [2013-09-20T04:50:16+02:00] error: Error binding slurm stream socket: Address already in use [2013-09-20T04:50:16+02:00] fatal: slurm_init_msg_engine_addrname_port error Address already in use -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo
