Sten, On Tue, 2011-10-04 at 15:04 +0200, Sten Wolf wrote: > Ramiro, > > Another option might be iptables blocking port 6817. Just for the sake > of testing - change auth/munge to auth/none (if you have ruled out > iptables) and see if anything changes in the controller log. >
debug2: slurmctld listening on 0.0.0.0:6817 debug3: _slurmctld_background pid = 19913 debug3: Trying to load plugin /usr/lib/slurm/auth_none.so Null authentication plugin loaded debug3: Success. debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 debug: validate_node_specs: node jff has registered debug2: _slurm_rpc_node_registration complete for jff usec=48 debug: _slurm_recv_timeout at 0 of 4, recv zero bytes error: slurm_receive_msg: Zero Bytes were transmitted or received error: slurm_receive_msg: Zero Bytes were transmitted or received Things go on the same > BTW, according to the docs debug level 7 is highest. > > > > On 04/10//2011 14:44, Ramiro Alba wrote: > > Sten, > > > > Yes. I did. Look at: > > > > root@jff:/var/log/slurm-llnl# scontrol show conf | grep -i debug > > DebugFlags = (null) > > SlurmctldDebug = 7 > > SlurmdDebug = 7 > > > > I suppose you say that because you can only see debug3 tag at logs, but > > it is the same as using debug devel 9. Has this sense? > > > > Cheers > > > > On Tue, 2011-10-04 at 14:28 +0200, Sten Wolf wrote: > > > Hi, > > > > > > did you remember to restart the slurm service? you might need to > > > "service slurm stop ; service slurm start" , as "service slurm > > > restart" doesn't actually re-read the slurm.conf file. The debug level > > > seems to be 3 currently. > > > > > > > > > > > > On 04/10//2011 14:24, Ramiro Alba wrote: > > > > Sten, > > > > > > > > There you are (as attachments) > > > > > > > > Cheers > > > > > > > > On Tue, 2011-10-04 at 14:13 +0200, Sten Wolf wrote: > > > > > SlurmctldDebug=3 > > > > > SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log > > > > > SlurmdDebug=3 > > > > > SlurmdLogFile=/var/log/slurm-llnl/slurmd.log > > > > > > > > > can you change thedebug level to 7 for both ctld and node, and provide > > > > > the contents of both log files ? > > > > > > > > > > > > > > > > > > > > On 04/10//2011 14:09, Ramiro Alba wrote: > > > > > > Andy, > > > > > > > > > > > > On Tue, 2011-10-04 at 08:00 -0400, Andy Riebs wrote: > > > > > > > Hi Ramiro, > > > > > > > > > > > > > > You might check to ensure that all of your clocks are in sync > > > > > > > (varying > > > > > > > by no more than a minute or two). > > > > > > No, they are not: > > > > > > > > > > > > root@jff:~# date; rsh jffmds date > > > > > > Tue Oct 4 14:03:52 CEST 2011 > > > > > > Tue Oct 4 14:03:42 CEST 2011 > > > > > > > > > > > > I am using Lustre and this is a MUST. > > > > > > > > > > > > Furthermore, the issue is also present in the simplest case (one > > > > > > node > > > > > > acting as a controller and compute) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Andy > > > > > > > > > > > > > > On 10/04/2011 07:56 AM, Ramiro Alba wrote: > > > > > > > > Sten, > > > > > > > > > > > > > > > > > > > > > > > > On Tue, 2011-10-04 at 13:45 +0200, Sten Wolf wrote: > > > > > > > > > did you create munge key? > > > > > > > > > > > > > > > > > Yes, I did. See local test: > > > > > > > > > > > > > > > > # munge -n | unmunge > > > > > > > > > > > > > > > > STATUS: Success (0) > > > > > > > > ENCODE_HOST: jff.cttc-jffeth.org (10.2.254.1) > > > > > > > > ENCODE_TIME: 2011-10-04 13:54:19 (1317729259) > > > > > > > > DECODE_TIME: 2011-10-04 13:54:19 (1317729259) > > > > > > > > TTL: 300 > > > > > > > > CIPHER: aes128 (4) > > > > > > > > MAC: sha1 (3) > > > > > > > > ZIP: none (0) > > > > > > > > UID: root (0) > > > > > > > > GID: root (0) > > > > > > > > LENGTH: 0 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 04/10//2011 13:28, Ramiro Alba wrote: > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > > > I am trying to setup a slurm controller (2.2.7) on Ubuntu > > > > > > > > > > 10.04 on > > > > > > > > > > cluster server and even with a simple slurm.conf (see > > > > > > > > > > attached file) the > > > > > > > > > > 'slurmctld' daemon sends continuously to the log file: > > > > > > > > > > > > > > > > > > > > debug: _slurm_recv_timeout at 0 of 4, recv zero bytes > > > > > > > > > > error: slurm_receive_msg: Zero Bytes were transmitted or > > > > > > > > > > received > > > > > > > > > > error: slurm_receive_msg: Zero Bytes were transmitted or > > > > > > > > > > received > > > > > > > > > > > > > > > > > > > > You can see at 'slurm.conf' that the same node acts as a > > > > > > > > > > controller and > > > > > > > > > > as a compute node. Jobs can be submited. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Any other cluster node/server (apparently having the > > > > > > > > > > same/similar > > > > > > > > > > hardware and the same operating system) works smoothly > > > > > > > > > > without any error > > > > > > > > > > acting both as a controller or a backup controller. > > > > > > > > > > > > > > > > > > > > Can anyone give me some idea what to look at, so as to > > > > > > > > > > suppress those > > > > > > > > > > error messages? > > > > > > > > > > I've looked at the mailing list for similar messages but > > > > > > > > > > none was of > > > > > > > > > > help. > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Aquest missatge ha estat analitzat per MailScanner > > > > > > > > > a la cerca de virus i d'altres continguts perillosos, > > > > > > > > > i es considera que está net. > > > > > > > -- > > > > > > > Andy Riebs > > > > > > > Hewlett-Packard Company > > > > > > > High Performance Computing > > > > > > > +1-786-263-9743 > > > > > > > My opinions are not necessarily those of HP > > > > > > > > > > > > > > > > > > > -- > > > > > Aquest missatge ha estat analitzat per MailScanner > > > > > a la cerca de virus i d'altres continguts perillosos, > > > > > i es considera que está net. > > > -- > > > Aquest missatge ha estat analitzat per MailScanner > > > a la cerca de virus i d'altres continguts perillosos, > > > i es considera que está net. > > -- > Aquest missatge ha estat analitzat per MailScanner > a la cerca de virus i d'altres continguts perillosos, > i es considera que está net. -- Ramiro Alba Centre Tecnològic de Tranferència de Calor http://www.cttc.upc.edu Escola Tècnica Superior d'Enginyeries Industrial i Aeronàutica de Terrassa Colom 11, E-08222, Terrassa, Barcelona, Spain Tel: (+34) 93 739 86 46 -- Aquest missatge ha estat analitzat per MailScanner a la cerca de virus i d'altres continguts perillosos, i es considera que est� net.