Sten,

On Tue, 2011-10-04 at 15:04 +0200, Sten Wolf wrote:
> Ramiro,
> 
> Another option might be iptables blocking port 6817. Just for the sake
> of testing - change auth/munge to auth/none (if you have ruled out
> iptables) and see if anything changes in the controller log.
> 

debug2: slurmctld listening on 0.0.0.0:6817
debug3: _slurmctld_background pid = 19913
debug3: Trying to load plugin /usr/lib/slurm/auth_none.so
Null authentication plugin loaded
debug3: Success.
debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0
debug:  validate_node_specs: node jff has registered
debug2: _slurm_rpc_node_registration complete for jff usec=48
debug:  _slurm_recv_timeout at 0 of 4, recv zero bytes
error: slurm_receive_msg: Zero Bytes were transmitted or received
error: slurm_receive_msg: Zero Bytes were transmitted or received


Things go on the same



> BTW, according to the docs debug level 7 is highest.
> 
> 
> 
> On 04/10//2011 14:44, Ramiro Alba wrote: 
> > Sten,
> > 
> > Yes. I did. Look at:
> > 
> > root@jff:/var/log/slurm-llnl# scontrol show conf | grep -i debug
> > DebugFlags              = (null)
> > SlurmctldDebug          = 7
> > SlurmdDebug             = 7
> > 
> > I suppose you say that because you can only see debug3 tag at logs, but
> > it is the same as using debug devel 9. Has this sense?
> > 
> > Cheers
> > 
> > On Tue, 2011-10-04 at 14:28 +0200, Sten Wolf wrote:
> > > Hi,
> > > 
> > > did you remember to restart the slurm service? you might need to
> > > "service slurm stop ; service slurm start" , as "service slurm
> > > restart" doesn't actually re-read the slurm.conf file. The debug level
> > > seems to be 3 currently.
> > > 
> > > 
> > > 
> > > On 04/10//2011 14:24, Ramiro Alba wrote: 
> > > > Sten,
> > > > 
> > > > There you are (as attachments)
> > > > 
> > > > Cheers
> > > > 
> > > > On Tue, 2011-10-04 at 14:13 +0200, Sten Wolf wrote:
> > > > > SlurmctldDebug=3
> > > > > SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
> > > > > SlurmdDebug=3
> > > > > SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
> > > > 
> > > > > can you change thedebug level to 7 for both ctld and node, and provide
> > > > > the contents of both log files ?
> > > > > 
> > > > > 
> > > > > 
> > > > > On 04/10//2011 14:09, Ramiro Alba wrote: 
> > > > > > Andy,
> > > > > > 
> > > > > > On Tue, 2011-10-04 at 08:00 -0400, Andy Riebs wrote:
> > > > > > > Hi Ramiro,
> > > > > > > 
> > > > > > > You might check to ensure that all of your clocks are in sync 
> > > > > > > (varying 
> > > > > > > by no more than a minute or two).
> > > > > > No, they are not:
> > > > > > 
> > > > > > root@jff:~# date; rsh jffmds date
> > > > > > Tue Oct  4 14:03:52 CEST 2011
> > > > > > Tue Oct  4 14:03:42 CEST 2011
> > > > > > 
> > > > > > I am using Lustre and this is a MUST.
> > > > > > 
> > > > > > Furthermore, the issue is also present in the simplest case (one 
> > > > > > node
> > > > > > acting as a controller and compute)
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > Andy
> > > > > > > 
> > > > > > > On 10/04/2011 07:56 AM, Ramiro Alba wrote:
> > > > > > > > Sten,
> > > > > > > > 
> > > > > > > > 
> > > > > > > > On Tue, 2011-10-04 at 13:45 +0200, Sten Wolf wrote:
> > > > > > > > > did you create munge key?
> > > > > > > > > 
> > > > > > > > Yes, I did. See local test:
> > > > > > > > 
> > > > > > > > # munge -n | unmunge
> > > > > > > > 
> > > > > > > > STATUS:           Success (0)
> > > > > > > > ENCODE_HOST:      jff.cttc-jffeth.org (10.2.254.1)
> > > > > > > > ENCODE_TIME:      2011-10-04 13:54:19 (1317729259)
> > > > > > > > DECODE_TIME:      2011-10-04 13:54:19 (1317729259)
> > > > > > > > TTL:              300
> > > > > > > > CIPHER:           aes128 (4)
> > > > > > > > MAC:              sha1 (3)
> > > > > > > > ZIP:              none (0)
> > > > > > > > UID:              root (0)
> > > > > > > > GID:              root (0)
> > > > > > > > LENGTH:           0
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > On 04/10//2011 13:28, Ramiro Alba wrote:
> > > > > > > > > > Hi all,
> > > > > > > > > > 
> > > > > > > > > > I am trying to setup a slurm controller (2.2.7) on Ubuntu 
> > > > > > > > > > 10.04 on
> > > > > > > > > > cluster server and even with a simple slurm.conf (see 
> > > > > > > > > > attached file) the
> > > > > > > > > > 'slurmctld' daemon sends continuously to the log file:
> > > > > > > > > > 
> > > > > > > > > > debug:  _slurm_recv_timeout at 0 of 4, recv zero bytes
> > > > > > > > > > error: slurm_receive_msg: Zero Bytes were transmitted or 
> > > > > > > > > > received
> > > > > > > > > > error: slurm_receive_msg: Zero Bytes were transmitted or 
> > > > > > > > > > received
> > > > > > > > > > 
> > > > > > > > > > You can see at 'slurm.conf' that the same node acts as a 
> > > > > > > > > > controller and
> > > > > > > > > > as a compute node. Jobs can be submited.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Any other cluster node/server (apparently having the 
> > > > > > > > > > same/similar
> > > > > > > > > > hardware and the same operating system) works smoothly 
> > > > > > > > > > without any error
> > > > > > > > > > acting both as a controller or a backup controller.
> > > > > > > > > > 
> > > > > > > > > > Can anyone give me some idea what to look at, so as to 
> > > > > > > > > > suppress those
> > > > > > > > > > error messages?
> > > > > > > > > > I've looked at the mailing list for similar messages but 
> > > > > > > > > > none was of
> > > > > > > > > > help.
> > > > > > > > > > 
> > > > > > > > > -- 
> > > > > > > > > Aquest missatge ha estat analitzat per MailScanner
> > > > > > > > > a la cerca de virus i d'altres continguts perillosos,
> > > > > > > > > i es considera que está net.
> > > > > > > -- 
> > > > > > > Andy Riebs
> > > > > > > Hewlett-Packard Company
> > > > > > > High Performance Computing
> > > > > > > +1-786-263-9743
> > > > > > > My opinions are not necessarily those of HP
> > > > > > > 
> > > > > > > 
> > > > > -- 
> > > > > Aquest missatge ha estat analitzat per MailScanner 
> > > > > a la cerca de virus i d'altres continguts perillosos, 
> > > > > i es considera que está net.
> > > -- 
> > > Aquest missatge ha estat analitzat per MailScanner 
> > > a la cerca de virus i d'altres continguts perillosos, 
> > > i es considera que está net.
> 
> -- 
> Aquest missatge ha estat analitzat per MailScanner 
> a la cerca de virus i d'altres continguts perillosos, 
> i es considera que está net.

-- 
Ramiro Alba

Centre Tecnològic de Tranferència de Calor
http://www.cttc.upc.edu


Escola Tècnica Superior d'Enginyeries
Industrial i Aeronàutica de Terrassa
Colom 11, E-08222, Terrassa, Barcelona, Spain
Tel: (+34) 93 739 86 46



-- 
Aquest missatge ha estat analitzat per MailScanner
a la cerca de virus i d'altres continguts perillosos,
i es considera que est� net.

Reply via email to