Moe,

I have additional data. If I convert the faulty server into a backup
controller using the SlurmctldPort=6817, then it throws the following
errors:

error: Invalid RPC received 65534 while in standby mode
debug2: _slurm_send_timeout: Socket no longer there
error: slurm_msg_sendto: address:port=0.0.0.0:0 msg_type=8001: Transport
endpoint is not connected
debug:  _slurm_recv_timeout at 0 of 4, recv zero bytes
error: slurm_receive_msg: Zero Bytes were transmitted or received
error: slurm_receive_msg: Zero Bytes were transmitted or received

Otherwise, it works smoothly using SlurmctldPort=6816. I do not find any
process using port 6817 at the same time. Then, why?

Any idea? Please, give me something to fight.

Regards
 

On Tue, 2011-10-04 at 09:14 -0600, Moe Jette wrote:
> Judging from the log, the slurmd daemon is able to send a message to  
> the slurmctld daemon:
> [2011-10-04T14:19:27] debug2: Processing RPC:  
> MESSAGE_NODE_REGISTRATION_STATUS from uid=0
> 
> [2011-10-04T14:19:27] debug:  validate_node_specs: node jff has registered
> 
> [2011-10-04T14:19:27] debug2: _slurm_rpc_node_registration complete  
> for jff usec=51
> 
> But messages going the other way are being blocked (to port 6817). I  
> would suggest that you try another value for SlurmctldPort in  
> slurm.conf and try again. Also, you are running slurmd as user root  
> and slurmctld as user "slurm", correct? If not, you need to configure  
> SlurmdUser and/or SlurmUser. If SlurmdUser is not root, it will not be  
> able to set the user id to run jobs as other users.
> 
> Moe Jette
> 
> 
> Quoting Ramiro Alba <r...@cttc.upc.edu>:
> 
> > Sten,
> >
> > There you are (as attachments)
> >
> > Cheers
> >
> > On Tue, 2011-10-04 at 14:13 +0200, Sten Wolf wrote:
> >> SlurmctldDebug=3
> >> SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
> >> SlurmdDebug=3
> >> SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
> >
> >
> >
> >>
> >>
> >> can you change thedebug level to 7 for both ctld and node, and provide
> >> the contents of both log files ?
> >>
> >>
> >>
> >> On 04/10//2011 14:09, Ramiro Alba wrote:
> >> > Andy,
> >> >
> >> > On Tue, 2011-10-04 at 08:00 -0400, Andy Riebs wrote:
> >> > > Hi Ramiro,
> >> > >
> >> > > You might check to ensure that all of your clocks are in sync (varying
> >> > > by no more than a minute or two).
> >> > No, they are not:
> >> >
> >> > root@jff:~# date; rsh jffmds date
> >> > Tue Oct  4 14:03:52 CEST 2011
> >> > Tue Oct  4 14:03:42 CEST 2011
> >> >
> >> > I am using Lustre and this is a MUST.
> >> >
> >> > Furthermore, the issue is also present in the simplest case (one node
> >> > acting as a controller and compute)
> >> >
> >> >
> >> >
> >> >
> >> > > Andy
> >> > >
> >> > > On 10/04/2011 07:56 AM, Ramiro Alba wrote:
> >> > > > Sten,
> >> > > >
> >> > > >
> >> > > > On Tue, 2011-10-04 at 13:45 +0200, Sten Wolf wrote:
> >> > > > > did you create munge key?
> >> > > > >
> >> > > > Yes, I did. See local test:
> >> > > >
> >> > > > # munge -n | unmunge
> >> > > >
> >> > > > STATUS:           Success (0)
> >> > > > ENCODE_HOST:      jff.cttc-jffeth.org (10.2.254.1)
> >> > > > ENCODE_TIME:      2011-10-04 13:54:19 (1317729259)
> >> > > > DECODE_TIME:      2011-10-04 13:54:19 (1317729259)
> >> > > > TTL:              300
> >> > > > CIPHER:           aes128 (4)
> >> > > > MAC:              sha1 (3)
> >> > > > ZIP:              none (0)
> >> > > > UID:              root (0)
> >> > > > GID:              root (0)
> >> > > > LENGTH:           0
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > > On 04/10//2011 13:28, Ramiro Alba wrote:
> >> > > > > > Hi all,
> >> > > > > >
> >> > > > > > I am trying to setup a slurm controller (2.2.7) on Ubuntu 10.04 
> >> > > > > > on
> >> > > > > > cluster server and even with a simple slurm.conf (see  
> >> attached file) the
> >> > > > > > 'slurmctld' daemon sends continuously to the log file:
> >> > > > > >
> >> > > > > > debug:  _slurm_recv_timeout at 0 of 4, recv zero bytes
> >> > > > > > error: slurm_receive_msg: Zero Bytes were transmitted or received
> >> > > > > > error: slurm_receive_msg: Zero Bytes were transmitted or received
> >> > > > > >
> >> > > > > > You can see at 'slurm.conf' that the same node acts as a  
> >> controller and
> >> > > > > > as a compute node. Jobs can be submited.
> >> > > > > >
> >> > > > > >
> >> > > > > > Any other cluster node/server (apparently having the same/similar
> >> > > > > > hardware and the same operating system) works smoothly  
> >> without any error
> >> > > > > > acting both as a controller or a backup controller.
> >> > > > > >
> >> > > > > > Can anyone give me some idea what to look at, so as to  
> >> suppress those
> >> > > > > > error messages?
> >> > > > > > I've looked at the mailing list for similar messages but  
> >> none was of
> >> > > > > > help.
> >> > > > > >
> >> > > > > --
> >> > > > > Aquest missatge ha estat analitzat per MailScanner
> >> > > > > a la cerca de virus i d'altres continguts perillosos,
> >> > > > > i es considera que está net.
> >> > > --
> >> > > Andy Riebs
> >> > > Hewlett-Packard Company
> >> > > High Performance Computing
> >> > > +1-786-263-9743
> >> > > My opinions are not necessarily those of HP
> >> > >
> >> > >
> >>
> >> --
> >> Aquest missatge ha estat analitzat per MailScanner
> >> a la cerca de virus i d'altres continguts perillosos,
> >> i es considera que está net.
> >
> > --
> > Ramiro Alba
> >
> > Centre Tecnològic de Tranferència de Calor
> > http://www.cttc.upc.edu
> >
> >
> > Escola Tècnica Superior d'Enginyeries
> > Industrial i Aeronàutica de Terrassa
> > Colom 11, E-08222, Terrassa, Barcelona, Spain
> > Tel: (+34) 93 739 86 46
> >
> >
> > --
> > Aquest missatge ha estat analitzat per MailScanner
> > a la cerca de virus i d'altres continguts perillosos,
> > i es considera que est� net.
> >
> >
> 
> 
> 

-- 
Ramiro Alba

Centre Tecnològic de Tranferència de Calor
http://www.cttc.upc.edu


Escola Tècnica Superior d'Enginyeries
Industrial i Aeronàutica de Terrassa
Colom 11, E-08222, Terrassa, Barcelona, Spain
Tel: (+34) 93 739 86 46



-- 
Aquest missatge ha estat analitzat per MailScanner
a la cerca de virus i d'altres continguts perillosos,
i es considera que est� net.

Reply via email to