What is the output of scontrol show config | grep SlurmctldTimeout
? 2017-01-31 6:57 GMT+01:00 Andrus, Brian Contractor <bdand...@nps.edu>: > Yes, if I do scontrol takeover, it successfully goes to the backup. > > > Brian Andrus > ITACS/Research Computing > Naval Postgraduate School > Monterey, California > voice: 831-656-6238 > > > > -----Original Message----- > From: TO_Webmaster [mailto:luftha...@gmail.com] > Sent: Monday, January 30, 2017 11:02 AM > To: slurm-dev <slurm-dev@schedmd.com> > Subject: [slurm-dev] Re: Backup controller not responding to requests > > > Does it work if you use "scontrol takeover" to shut down the primary > controller and switch immediately to the backup controller? > > 2017-01-30 19:41 GMT+01:00 Andrus, Brian Contractor <bdand...@nps.edu>: >> Paddy, >> >> I will enable those and try. The backup controller does have access to the >> directory and it is the same version as the master. >> >> Not seeing much more in the logs.. >> The backup controller ends with a ping of the master and then just >> sits. I restart the master and the backup starts saying "Invalid RPC". When >> the master comes back up, it says it is ignoring the RPC: REQUEST_CONTROL >> So, for some reason, it seems the backup will not promote itself... >> >> ---------------------- >> [2017-01-30T10:30:21.321] debug3: Success. >> [2017-01-30T10:30:21.322] trigger pulled for SLURMCTLD event 16384 >> successful [2017-01-30T10:30:27.323] debug3: pinging slurmctld at >> 10.1.1.127 [2017-01-30T10:31:55.814] error: Invalid RPC received 2009 >> while in standby mode [2017-01-30T10:32:04.839] debug3: Ignoring RPC: >> REQUEST_CONTROL [2017-01-30T10:32:06.133] error: Invalid RPC received >> 2009 while in standby mode [2017-01-30T10:32:07.338] debug3: pinging >> slurmctld at 10.1.1.127 [2017-01-30T10:32:07.339] debug2: >> slurm_connect failed: Connection refused [2017-01-30T10:32:07.339] >> debug2: Error connecting slurm stream socket at 10.1.1.127:6817: >> Connection refused [2017-01-30T10:32:07.339] error: >> _ping_controller/slurm_send_node_msg error: Connection refused >> [2017-01-30T10:33:47.351] debug3: pinging slurmctld at 10.1.1.127 >> [2017-01-30T10:35:27.366] debug3: pinging slurmctld at 10.1.1.127 >> [2017-01-30T10:35:33.758] debug3: Ignoring RPC: REQUEST_CONTROL >> --------------------- >> >> >> Brian Andrus >> ITACS/Research Computing >> Naval Postgraduate School >> Monterey, California >> voice: 831-656-6238 >> >> >> >> -----Original Message----- >> From: Paddy Doyle [mailto:pa...@tchpc.tcd.ie] >> Sent: Monday, January 30, 2017 9:48 AM >> To: slurm-dev <slurm-dev@schedmd.com> >> Subject: [slurm-dev] Re: Backup controller not responding to requests >> >> >> Hi Brian, >> >> You could turn up the SlurmctldDebug and SlurmdDebug values in slurm.conf to >> get it to be more verbose. >> >> As a wild guess, perhaps your backup control doesn't have access to the >> StateSaveLocation directory? >> >> Or another possibility could be it's running a different version of slurm. >> >> Paddy >> >> On Mon, Jan 30, 2017 at 08:21:59AM -0800, Andrus, Brian Contractor wrote: >> >>> All, >>> >>> I have configured a backup slurmctld system and it appears to work at >>> first, but not in practice. >>> In particular, when I start it, it says it is running in background mode: >>> [2017-01-25T14:23:37.648] slurmctld version 16.05.6 started on >>> cluster hamming [2017-01-25T14:23:37.650] slurmctld running in >>> background mode >>> >>> But if I stop the primary daemon, it does not take over. I keep getting >>> Invalid RPC errors (random snippets): >>> [2017-01-25T15:50:37.664] error: Invalid RPC received 2007 while in >>> standby mode [2017-01-25T15:53:50.495] error: Invalid RPC received >>> 5018 while in standby mode [2017-01-25T15:59:36.847] error: Invalid >>> RPC received 2007 while in standby mode [2017-01-25T15:59:37.499] >>> error: Invalid RPC received 2007 while in standby mode >>> [2017-01-25T15:59:38.923] error: Invalid RPC received 2007 while in >>> standby mode [2017-01-25T15:59:38.985] error: Invalid RPC received >>> 2007 while in standby mode [2017-01-25T15:59:39.246] error: Invalid >>> RPC received 2007 while in standby mode [2017-01-25T15:59:39.293] >>> error: Invalid RPC received 2009 while in standby mode >>> [2017-01-25T15:59:39.522] error: Invalid RPC received 5018 while in >>> standby mode [2017-01-25T15:59:43.839] error: Invalid RPC received >>> 2009 while in standby mode [2017-01-25T15:59:43.930] error: Invalid >>> RPC received 2009 while in standby mode [2017-01-25T16:19:47.215] >>> error: Invalid RPC received 6012 while in standby mode >>> [2017-01-25T16:19:48.238] error: Invalid RPC received 6012 while in >>> standby mode >>> >>> And on any client running 'sinfo' for instance, it merely hangs. >>> The interfaces for both slurmctld controllers are in the 'trusted' firewall >>> group and there is no filtering between them. >>> Is there something I am missing to make the backup controller 'kick in' and >>> start responding to requests? >>> >>> >>> Brian Andrus >>> ITACS/Research Computing >>> Naval Postgraduate School >>> Monterey, California >>> voice: 831-656-6238 >>> >> >> -- >> Paddy Doyle >> Trinity Centre for High Performance Computing, Lloyd Building, Trinity >> College Dublin, Dublin 2, Ireland. >> Phone: +353-1-896-3725 >> http://www.tchpc.tcd.ie/