What is the output of

scontrol show config | grep SlurmctldTimeout

?

2017-01-31 6:57 GMT+01:00 Andrus, Brian Contractor <bdand...@nps.edu>:
> Yes, if I do scontrol takeover, it successfully goes to the backup.
>
>
> Brian Andrus
> ITACS/Research Computing
> Naval Postgraduate School
> Monterey, California
> voice: 831-656-6238
>
>
>
> -----Original Message-----
> From: TO_Webmaster [mailto:luftha...@gmail.com]
> Sent: Monday, January 30, 2017 11:02 AM
> To: slurm-dev <slurm-dev@schedmd.com>
> Subject: [slurm-dev] Re: Backup controller not responding to requests
>
>
> Does it work if you use "scontrol takeover" to shut down the primary 
> controller and switch immediately to the backup controller?
>
> 2017-01-30 19:41 GMT+01:00 Andrus, Brian Contractor <bdand...@nps.edu>:
>> Paddy,
>>
>> I will enable those and try. The backup controller does have access to the 
>> directory and it is the same version as the master.
>>
>> Not seeing much more in the logs..
>> The backup controller ends with a ping of the master and then just
>> sits. I restart the master and the backup starts saying "Invalid RPC". When 
>> the master comes back up, it says it is ignoring the RPC: REQUEST_CONTROL 
>> So, for some reason, it seems the backup will not promote itself...
>>
>> ----------------------
>> [2017-01-30T10:30:21.321] debug3: Success.
>> [2017-01-30T10:30:21.322] trigger pulled for SLURMCTLD event 16384
>> successful [2017-01-30T10:30:27.323] debug3: pinging slurmctld at
>> 10.1.1.127 [2017-01-30T10:31:55.814] error: Invalid RPC received 2009
>> while in standby mode [2017-01-30T10:32:04.839] debug3: Ignoring RPC:
>> REQUEST_CONTROL [2017-01-30T10:32:06.133] error: Invalid RPC received
>> 2009 while in standby mode [2017-01-30T10:32:07.338] debug3: pinging
>> slurmctld at 10.1.1.127 [2017-01-30T10:32:07.339] debug2:
>> slurm_connect failed: Connection refused [2017-01-30T10:32:07.339]
>> debug2: Error connecting slurm stream socket at 10.1.1.127:6817:
>> Connection refused [2017-01-30T10:32:07.339] error:
>> _ping_controller/slurm_send_node_msg error: Connection refused
>> [2017-01-30T10:33:47.351] debug3: pinging slurmctld at 10.1.1.127
>> [2017-01-30T10:35:27.366] debug3: pinging slurmctld at 10.1.1.127
>> [2017-01-30T10:35:33.758] debug3: Ignoring RPC: REQUEST_CONTROL
>> ---------------------
>>
>>
>> Brian Andrus
>> ITACS/Research Computing
>> Naval Postgraduate School
>> Monterey, California
>> voice: 831-656-6238
>>
>>
>>
>> -----Original Message-----
>> From: Paddy Doyle [mailto:pa...@tchpc.tcd.ie]
>> Sent: Monday, January 30, 2017 9:48 AM
>> To: slurm-dev <slurm-dev@schedmd.com>
>> Subject: [slurm-dev] Re: Backup controller not responding to requests
>>
>>
>> Hi Brian,
>>
>> You could turn up the SlurmctldDebug and SlurmdDebug values in slurm.conf to 
>> get it to be more verbose.
>>
>> As a wild guess, perhaps your backup control doesn't have access to the 
>> StateSaveLocation directory?
>>
>> Or another possibility could be it's running a different version of slurm.
>>
>> Paddy
>>
>> On Mon, Jan 30, 2017 at 08:21:59AM -0800, Andrus, Brian Contractor wrote:
>>
>>> All,
>>>
>>> I have configured a backup slurmctld system and it appears to work at 
>>> first, but not in practice.
>>> In particular, when I start it, it says it is running in background mode:
>>> [2017-01-25T14:23:37.648] slurmctld version 16.05.6 started on
>>> cluster hamming [2017-01-25T14:23:37.650] slurmctld running in
>>> background mode
>>>
>>> But if I stop the primary daemon, it does not take over. I keep getting 
>>> Invalid RPC errors (random snippets):
>>> [2017-01-25T15:50:37.664] error: Invalid RPC received 2007 while in
>>> standby mode [2017-01-25T15:53:50.495] error: Invalid RPC received
>>> 5018 while in standby mode [2017-01-25T15:59:36.847] error: Invalid
>>> RPC received 2007 while in standby mode [2017-01-25T15:59:37.499]
>>> error: Invalid RPC received 2007 while in standby mode
>>> [2017-01-25T15:59:38.923] error: Invalid RPC received 2007 while in
>>> standby mode [2017-01-25T15:59:38.985] error: Invalid RPC received
>>> 2007 while in standby mode [2017-01-25T15:59:39.246] error: Invalid
>>> RPC received 2007 while in standby mode [2017-01-25T15:59:39.293]
>>> error: Invalid RPC received 2009 while in standby mode
>>> [2017-01-25T15:59:39.522] error: Invalid RPC received 5018 while in
>>> standby mode [2017-01-25T15:59:43.839] error: Invalid RPC received
>>> 2009 while in standby mode [2017-01-25T15:59:43.930] error: Invalid
>>> RPC received 2009 while in standby mode [2017-01-25T16:19:47.215]
>>> error: Invalid RPC received 6012 while in standby mode
>>> [2017-01-25T16:19:48.238] error: Invalid RPC received 6012 while in
>>> standby mode
>>>
>>> And on any client running 'sinfo' for instance, it merely hangs.
>>> The interfaces for both slurmctld controllers are in the 'trusted' firewall 
>>> group and there is no filtering between them.
>>> Is there something I am missing to make the backup controller 'kick in' and 
>>> start responding to requests?
>>>
>>>
>>> Brian Andrus
>>> ITACS/Research Computing
>>> Naval Postgraduate School
>>> Monterey, California
>>> voice: 831-656-6238
>>>
>>
>> --
>> Paddy Doyle
>> Trinity Centre for High Performance Computing, Lloyd Building, Trinity 
>> College Dublin, Dublin 2, Ireland.
>> Phone: +353-1-896-3725
>> http://www.tchpc.tcd.ie/

Reply via email to