Yes, if I do scontrol takeover, it successfully goes to the backup.
Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 -----Original Message----- From: TO_Webmaster [mailto:luftha...@gmail.com] Sent: Monday, January 30, 2017 11:02 AM To: slurm-dev <slurm-dev@schedmd.com> Subject: [slurm-dev] Re: Backup controller not responding to requests Does it work if you use "scontrol takeover" to shut down the primary controller and switch immediately to the backup controller? 2017-01-30 19:41 GMT+01:00 Andrus, Brian Contractor <bdand...@nps.edu>: > Paddy, > > I will enable those and try. The backup controller does have access to the > directory and it is the same version as the master. > > Not seeing much more in the logs.. > The backup controller ends with a ping of the master and then just > sits. I restart the master and the backup starts saying "Invalid RPC". When > the master comes back up, it says it is ignoring the RPC: REQUEST_CONTROL So, > for some reason, it seems the backup will not promote itself... > > ---------------------- > [2017-01-30T10:30:21.321] debug3: Success. > [2017-01-30T10:30:21.322] trigger pulled for SLURMCTLD event 16384 > successful [2017-01-30T10:30:27.323] debug3: pinging slurmctld at > 10.1.1.127 [2017-01-30T10:31:55.814] error: Invalid RPC received 2009 > while in standby mode [2017-01-30T10:32:04.839] debug3: Ignoring RPC: > REQUEST_CONTROL [2017-01-30T10:32:06.133] error: Invalid RPC received > 2009 while in standby mode [2017-01-30T10:32:07.338] debug3: pinging > slurmctld at 10.1.1.127 [2017-01-30T10:32:07.339] debug2: > slurm_connect failed: Connection refused [2017-01-30T10:32:07.339] > debug2: Error connecting slurm stream socket at 10.1.1.127:6817: > Connection refused [2017-01-30T10:32:07.339] error: > _ping_controller/slurm_send_node_msg error: Connection refused > [2017-01-30T10:33:47.351] debug3: pinging slurmctld at 10.1.1.127 > [2017-01-30T10:35:27.366] debug3: pinging slurmctld at 10.1.1.127 > [2017-01-30T10:35:33.758] debug3: Ignoring RPC: REQUEST_CONTROL > --------------------- > > > Brian Andrus > ITACS/Research Computing > Naval Postgraduate School > Monterey, California > voice: 831-656-6238 > > > > -----Original Message----- > From: Paddy Doyle [mailto:pa...@tchpc.tcd.ie] > Sent: Monday, January 30, 2017 9:48 AM > To: slurm-dev <slurm-dev@schedmd.com> > Subject: [slurm-dev] Re: Backup controller not responding to requests > > > Hi Brian, > > You could turn up the SlurmctldDebug and SlurmdDebug values in slurm.conf to > get it to be more verbose. > > As a wild guess, perhaps your backup control doesn't have access to the > StateSaveLocation directory? > > Or another possibility could be it's running a different version of slurm. > > Paddy > > On Mon, Jan 30, 2017 at 08:21:59AM -0800, Andrus, Brian Contractor wrote: > >> All, >> >> I have configured a backup slurmctld system and it appears to work at first, >> but not in practice. >> In particular, when I start it, it says it is running in background mode: >> [2017-01-25T14:23:37.648] slurmctld version 16.05.6 started on >> cluster hamming [2017-01-25T14:23:37.650] slurmctld running in >> background mode >> >> But if I stop the primary daemon, it does not take over. I keep getting >> Invalid RPC errors (random snippets): >> [2017-01-25T15:50:37.664] error: Invalid RPC received 2007 while in >> standby mode [2017-01-25T15:53:50.495] error: Invalid RPC received >> 5018 while in standby mode [2017-01-25T15:59:36.847] error: Invalid >> RPC received 2007 while in standby mode [2017-01-25T15:59:37.499] >> error: Invalid RPC received 2007 while in standby mode >> [2017-01-25T15:59:38.923] error: Invalid RPC received 2007 while in >> standby mode [2017-01-25T15:59:38.985] error: Invalid RPC received >> 2007 while in standby mode [2017-01-25T15:59:39.246] error: Invalid >> RPC received 2007 while in standby mode [2017-01-25T15:59:39.293] >> error: Invalid RPC received 2009 while in standby mode >> [2017-01-25T15:59:39.522] error: Invalid RPC received 5018 while in >> standby mode [2017-01-25T15:59:43.839] error: Invalid RPC received >> 2009 while in standby mode [2017-01-25T15:59:43.930] error: Invalid >> RPC received 2009 while in standby mode [2017-01-25T16:19:47.215] >> error: Invalid RPC received 6012 while in standby mode >> [2017-01-25T16:19:48.238] error: Invalid RPC received 6012 while in >> standby mode >> >> And on any client running 'sinfo' for instance, it merely hangs. >> The interfaces for both slurmctld controllers are in the 'trusted' firewall >> group and there is no filtering between them. >> Is there something I am missing to make the backup controller 'kick in' and >> start responding to requests? >> >> >> Brian Andrus >> ITACS/Research Computing >> Naval Postgraduate School >> Monterey, California >> voice: 831-656-6238 >> > > -- > Paddy Doyle > Trinity Centre for High Performance Computing, Lloyd Building, Trinity > College Dublin, Dublin 2, Ireland. > Phone: +353-1-896-3725 > http://www.tchpc.tcd.ie/