Yes, if I do scontrol takeover, it successfully goes to the backup.

Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238



-----Original Message-----
From: TO_Webmaster [mailto:luftha...@gmail.com] 
Sent: Monday, January 30, 2017 11:02 AM
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] Re: Backup controller not responding to requests


Does it work if you use "scontrol takeover" to shut down the primary controller 
and switch immediately to the backup controller?

2017-01-30 19:41 GMT+01:00 Andrus, Brian Contractor <bdand...@nps.edu>:
> Paddy,
>
> I will enable those and try. The backup controller does have access to the 
> directory and it is the same version as the master.
>
> Not seeing much more in the logs..
> The backup controller ends with a ping of the master and then just 
> sits. I restart the master and the backup starts saying "Invalid RPC". When 
> the master comes back up, it says it is ignoring the RPC: REQUEST_CONTROL So, 
> for some reason, it seems the backup will not promote itself...
>
> ----------------------
> [2017-01-30T10:30:21.321] debug3: Success.
> [2017-01-30T10:30:21.322] trigger pulled for SLURMCTLD event 16384 
> successful [2017-01-30T10:30:27.323] debug3: pinging slurmctld at 
> 10.1.1.127 [2017-01-30T10:31:55.814] error: Invalid RPC received 2009 
> while in standby mode [2017-01-30T10:32:04.839] debug3: Ignoring RPC: 
> REQUEST_CONTROL [2017-01-30T10:32:06.133] error: Invalid RPC received 
> 2009 while in standby mode [2017-01-30T10:32:07.338] debug3: pinging 
> slurmctld at 10.1.1.127 [2017-01-30T10:32:07.339] debug2: 
> slurm_connect failed: Connection refused [2017-01-30T10:32:07.339] 
> debug2: Error connecting slurm stream socket at 10.1.1.127:6817: 
> Connection refused [2017-01-30T10:32:07.339] error: 
> _ping_controller/slurm_send_node_msg error: Connection refused 
> [2017-01-30T10:33:47.351] debug3: pinging slurmctld at 10.1.1.127 
> [2017-01-30T10:35:27.366] debug3: pinging slurmctld at 10.1.1.127 
> [2017-01-30T10:35:33.758] debug3: Ignoring RPC: REQUEST_CONTROL
> ---------------------
>
>
> Brian Andrus
> ITACS/Research Computing
> Naval Postgraduate School
> Monterey, California
> voice: 831-656-6238
>
>
>
> -----Original Message-----
> From: Paddy Doyle [mailto:pa...@tchpc.tcd.ie]
> Sent: Monday, January 30, 2017 9:48 AM
> To: slurm-dev <slurm-dev@schedmd.com>
> Subject: [slurm-dev] Re: Backup controller not responding to requests
>
>
> Hi Brian,
>
> You could turn up the SlurmctldDebug and SlurmdDebug values in slurm.conf to 
> get it to be more verbose.
>
> As a wild guess, perhaps your backup control doesn't have access to the 
> StateSaveLocation directory?
>
> Or another possibility could be it's running a different version of slurm.
>
> Paddy
>
> On Mon, Jan 30, 2017 at 08:21:59AM -0800, Andrus, Brian Contractor wrote:
>
>> All,
>>
>> I have configured a backup slurmctld system and it appears to work at first, 
>> but not in practice.
>> In particular, when I start it, it says it is running in background mode:
>> [2017-01-25T14:23:37.648] slurmctld version 16.05.6 started on 
>> cluster hamming [2017-01-25T14:23:37.650] slurmctld running in 
>> background mode
>>
>> But if I stop the primary daemon, it does not take over. I keep getting 
>> Invalid RPC errors (random snippets):
>> [2017-01-25T15:50:37.664] error: Invalid RPC received 2007 while in 
>> standby mode [2017-01-25T15:53:50.495] error: Invalid RPC received
>> 5018 while in standby mode [2017-01-25T15:59:36.847] error: Invalid 
>> RPC received 2007 while in standby mode [2017-01-25T15:59:37.499]
>> error: Invalid RPC received 2007 while in standby mode 
>> [2017-01-25T15:59:38.923] error: Invalid RPC received 2007 while in 
>> standby mode [2017-01-25T15:59:38.985] error: Invalid RPC received
>> 2007 while in standby mode [2017-01-25T15:59:39.246] error: Invalid 
>> RPC received 2007 while in standby mode [2017-01-25T15:59:39.293]
>> error: Invalid RPC received 2009 while in standby mode 
>> [2017-01-25T15:59:39.522] error: Invalid RPC received 5018 while in 
>> standby mode [2017-01-25T15:59:43.839] error: Invalid RPC received
>> 2009 while in standby mode [2017-01-25T15:59:43.930] error: Invalid 
>> RPC received 2009 while in standby mode [2017-01-25T16:19:47.215]
>> error: Invalid RPC received 6012 while in standby mode 
>> [2017-01-25T16:19:48.238] error: Invalid RPC received 6012 while in 
>> standby mode
>>
>> And on any client running 'sinfo' for instance, it merely hangs.
>> The interfaces for both slurmctld controllers are in the 'trusted' firewall 
>> group and there is no filtering between them.
>> Is there something I am missing to make the backup controller 'kick in' and 
>> start responding to requests?
>>
>>
>> Brian Andrus
>> ITACS/Research Computing
>> Naval Postgraduate School
>> Monterey, California
>> voice: 831-656-6238
>>
>
> --
> Paddy Doyle
> Trinity Centre for High Performance Computing, Lloyd Building, Trinity 
> College Dublin, Dublin 2, Ireland.
> Phone: +353-1-896-3725
> http://www.tchpc.tcd.ie/

Reply via email to