Moe, I have additional data. If I convert the faulty server into a backup controller using the SlurmctldPort=6817, then it throws the following errors:
error: Invalid RPC received 65534 while in standby mode debug2: _slurm_send_timeout: Socket no longer there error: slurm_msg_sendto: address:port=0.0.0.0:0 msg_type=8001: Transport endpoint is not connected debug: _slurm_recv_timeout at 0 of 4, recv zero bytes error: slurm_receive_msg: Zero Bytes were transmitted or received error: slurm_receive_msg: Zero Bytes were transmitted or received Otherwise, it works smoothly using SlurmctldPort=6816. I do not find any process using port 6817 at the same time. Then, why? Any idea? Please, give me something to fight. Regards On Tue, 2011-10-04 at 09:14 -0600, Moe Jette wrote: > Judging from the log, the slurmd daemon is able to send a message to > the slurmctld daemon: > [2011-10-04T14:19:27] debug2: Processing RPC: > MESSAGE_NODE_REGISTRATION_STATUS from uid=0 > > [2011-10-04T14:19:27] debug: validate_node_specs: node jff has registered > > [2011-10-04T14:19:27] debug2: _slurm_rpc_node_registration complete > for jff usec=51 > > But messages going the other way are being blocked (to port 6817). I > would suggest that you try another value for SlurmctldPort in > slurm.conf and try again. Also, you are running slurmd as user root > and slurmctld as user "slurm", correct? If not, you need to configure > SlurmdUser and/or SlurmUser. If SlurmdUser is not root, it will not be > able to set the user id to run jobs as other users. > > Moe Jette > > > Quoting Ramiro Alba <r...@cttc.upc.edu>: > > > Sten, > > > > There you are (as attachments) > > > > Cheers > > > > On Tue, 2011-10-04 at 14:13 +0200, Sten Wolf wrote: > >> SlurmctldDebug=3 > >> SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log > >> SlurmdDebug=3 > >> SlurmdLogFile=/var/log/slurm-llnl/slurmd.log > > > > > > > >> > >> > >> can you change thedebug level to 7 for both ctld and node, and provide > >> the contents of both log files ? > >> > >> > >> > >> On 04/10//2011 14:09, Ramiro Alba wrote: > >> > Andy, > >> > > >> > On Tue, 2011-10-04 at 08:00 -0400, Andy Riebs wrote: > >> > > Hi Ramiro, > >> > > > >> > > You might check to ensure that all of your clocks are in sync (varying > >> > > by no more than a minute or two). > >> > No, they are not: > >> > > >> > root@jff:~# date; rsh jffmds date > >> > Tue Oct 4 14:03:52 CEST 2011 > >> > Tue Oct 4 14:03:42 CEST 2011 > >> > > >> > I am using Lustre and this is a MUST. > >> > > >> > Furthermore, the issue is also present in the simplest case (one node > >> > acting as a controller and compute) > >> > > >> > > >> > > >> > > >> > > Andy > >> > > > >> > > On 10/04/2011 07:56 AM, Ramiro Alba wrote: > >> > > > Sten, > >> > > > > >> > > > > >> > > > On Tue, 2011-10-04 at 13:45 +0200, Sten Wolf wrote: > >> > > > > did you create munge key? > >> > > > > > >> > > > Yes, I did. See local test: > >> > > > > >> > > > # munge -n | unmunge > >> > > > > >> > > > STATUS: Success (0) > >> > > > ENCODE_HOST: jff.cttc-jffeth.org (10.2.254.1) > >> > > > ENCODE_TIME: 2011-10-04 13:54:19 (1317729259) > >> > > > DECODE_TIME: 2011-10-04 13:54:19 (1317729259) > >> > > > TTL: 300 > >> > > > CIPHER: aes128 (4) > >> > > > MAC: sha1 (3) > >> > > > ZIP: none (0) > >> > > > UID: root (0) > >> > > > GID: root (0) > >> > > > LENGTH: 0 > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > On 04/10//2011 13:28, Ramiro Alba wrote: > >> > > > > > Hi all, > >> > > > > > > >> > > > > > I am trying to setup a slurm controller (2.2.7) on Ubuntu 10.04 > >> > > > > > on > >> > > > > > cluster server and even with a simple slurm.conf (see > >> attached file) the > >> > > > > > 'slurmctld' daemon sends continuously to the log file: > >> > > > > > > >> > > > > > debug: _slurm_recv_timeout at 0 of 4, recv zero bytes > >> > > > > > error: slurm_receive_msg: Zero Bytes were transmitted or received > >> > > > > > error: slurm_receive_msg: Zero Bytes were transmitted or received > >> > > > > > > >> > > > > > You can see at 'slurm.conf' that the same node acts as a > >> controller and > >> > > > > > as a compute node. Jobs can be submited. > >> > > > > > > >> > > > > > > >> > > > > > Any other cluster node/server (apparently having the same/similar > >> > > > > > hardware and the same operating system) works smoothly > >> without any error > >> > > > > > acting both as a controller or a backup controller. > >> > > > > > > >> > > > > > Can anyone give me some idea what to look at, so as to > >> suppress those > >> > > > > > error messages? > >> > > > > > I've looked at the mailing list for similar messages but > >> none was of > >> > > > > > help. > >> > > > > > > >> > > > > -- > >> > > > > Aquest missatge ha estat analitzat per MailScanner > >> > > > > a la cerca de virus i d'altres continguts perillosos, > >> > > > > i es considera que está net. > >> > > -- > >> > > Andy Riebs > >> > > Hewlett-Packard Company > >> > > High Performance Computing > >> > > +1-786-263-9743 > >> > > My opinions are not necessarily those of HP > >> > > > >> > > > >> > >> -- > >> Aquest missatge ha estat analitzat per MailScanner > >> a la cerca de virus i d'altres continguts perillosos, > >> i es considera que está net. > > > > -- > > Ramiro Alba > > > > Centre Tecnològic de Tranferència de Calor > > http://www.cttc.upc.edu > > > > > > Escola Tècnica Superior d'Enginyeries > > Industrial i Aeronàutica de Terrassa > > Colom 11, E-08222, Terrassa, Barcelona, Spain > > Tel: (+34) 93 739 86 46 > > > > > > -- > > Aquest missatge ha estat analitzat per MailScanner > > a la cerca de virus i d'altres continguts perillosos, > > i es considera que est� net. > > > > > > > -- Ramiro Alba Centre Tecnològic de Tranferència de Calor http://www.cttc.upc.edu Escola Tècnica Superior d'Enginyeries Industrial i Aeronàutica de Terrassa Colom 11, E-08222, Terrassa, Barcelona, Spain Tel: (+34) 93 739 86 46 -- Aquest missatge ha estat analitzat per MailScanner a la cerca de virus i d'altres continguts perillosos, i es considera que est� net.