net.d with Ansible installer on 3.7 and 3.9

Alan Christie Sun, 15 Apr 2018 03:50:30 -0700

Thanks Clayton.

The base system’s been tested with two independently authored base images but 
I’ll try and ensure I have time to follow your suggestion next week and report 
back if anything repetitive’s been found.


Knowing that this is not a common problem narrows it down. Thanks.

Alan.

> On 14 Apr 2018, at 16:33, Clayton Coleman <[email protected]> wrote:
> 
> I don’t think we’ve seen it elsewhere (certainly not repeatedly), which 
> probably indicates something specific to your environment, inventory, or base 
> system.
> 
> I suggested restarting because this is all the same debugging info we’d ask 
> in a bug - knowing whether it’s transient and clears on a restart narrows the 
> issue down (likely to be a bug in the core code).
> 
> On Apr 14, 2018, at 4:30 AM, Alan Christie <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>> Thanks Clayton,
>> 
>> I’ll take a look a closer look next week because the solution seems to be 
>> fixing the symptoms, not the cause and I’d like to get to a stage where we 
>> don’t need to patch the installation and restart-it.
>> 
>> This happens pretty much *every time* that I install 3.7 or 3.9 on AWS and a 
>> significant number of times on OpenStack.
>> 
>> Has this been reported by others because it’s so common that we can't be the 
>> only ones seeing this?
>> 
>> Alan
>> 
>> 
>>> On 13 Apr 2018, at 21:35, Clayton Coleman <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Can not find allocated subnet usually means the master didn’t hand out a 
>>> chunk of SDN IPs to that node.  Check the master’s origin-master-controller 
>>> logs and look for anything that relates to the node name mentioned in your 
>>> error.  If you see a problem, try restarting the origin-master-controllers 
>>> processes on all nodes.
>>> 
>>> On Apr 13, 2018, at 2:26 PM, Alan Christie 
>>> <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>>> What’s wrong with the post-3.6 OpenShift/Origin release?
>>>> 
>>>> I build my cluster with Terraform and OpenShift 3.6 (on AWS) is 
>>>> wonderfully stable and I have no problem creating clusters. But, with both 
>>>> 3.7 and 3.9, I just cannot start a cluster without encountering at least 
>>>> one node with an empty /etc/cni/net.d.
>>>> 
>>>> This applies to 3.7 and 3.9 on AWS and two OpenStack providers. In all 
>>>> cases the Ansible installer enters the "RUNNING HANDLER [openshift_node : 
>>>> restart node]" task but this, for the vast majority of installations on 
>>>> OpenStack and every single attempt in AWS, always fails. I’m worried that 
>>>> I’ve got something clearly very wrong and have had to return to 3.6 to get 
>>>> anything done.
>>>> 
>>>> RUNNING HANDLER [openshift_node : restart openvswitch] 
>>>> ********************************************************************************
>>>> Friday 13 April 2018  13:19:09 +0100 (0:00:00.062)       0:09:28.744 
>>>> ********** 
>>>> changed: [18.195.236.210]
>>>> changed: [18.195.126.190]
>>>> changed: [18.184.65.88]
>>>> 
>>>> RUNNING HANDLER [openshift_node : restart openvswitch pause] 
>>>> **************************************************************************
>>>> Friday 13 April 2018  13:19:09 +0100 (0:00:00.720)       0:09:29.464 
>>>> ********** 
>>>> skipping: [18.195.236.210]
>>>> 
>>>> RUNNING HANDLER [openshift_node : restart node] 
>>>> ***************************************************************************************
>>>> Friday 13 April 2018  13:19:09 +0100 (0:00:00.036)       0:09:29.501 
>>>> ********** 
>>>> FAILED - RETRYING: restart node (3 retries left).
>>>> FAILED - RETRYING: restart node (3 retries left).
>>>> FAILED - RETRYING: restart node (3 retries left).
>>>> FAILED - RETRYING: restart node (2 retries left).
>>>> FAILED - RETRYING: restart node (2 retries left).
>>>> FAILED - RETRYING: restart node (2 retries left).
>>>> FAILED - RETRYING: restart node (1 retries left).
>>>> FAILED - RETRYING: restart node (1 retries left).
>>>> FAILED - RETRYING: restart node (1 retries left).
>>>> fatal: [18.195.236.210]: FAILED! => {"attempts": 3, "changed": false, 
>>>> "msg": "Unable to restart service origin-node: Job for origin-node.service 
>>>> failed because the control process exited with error code. See \"systemctl 
>>>> status origin-node.service\" and \"journalctl -xe\" for details.\n"}
>>>> fatal: [18.195.126.190]: FAILED! => {"attempts": 3, "changed": false, 
>>>> "msg": "Unable to restart service origin-node: Job for origin-node.service 
>>>> failed because the control process exited with error code. See \"systemctl 
>>>> status origin-node.service\" and \"journalctl -xe\" for details.\n"}
>>>> fatal: [18.184.65.88]: FAILED! => {"attempts": 3, "changed": false, "msg": 
>>>> "Unable to restart service origin-node: Job for origin-node.service failed 
>>>> because the control process exited with error code. See \"systemctl status 
>>>> origin-node.service\" and \"journalctl -xe\" for details.\n"}
>>>> 
>>>> When I jump onto a suspect node after the failure I find/etc/cni/net.d is 
>>>> empty and the journal contains the message "No networks found in 
>>>> /etc/cni/net.d”...
>>>> 
>>>> -- The start-up result is done.
>>>> Apr 13 12:23:44 ip-10-0-0-61.eu 
>>>> <http://ip-10-0-0-61.eu/>-central-1.compute.internal 
>>>> origin-master-controllers[26728]: I0413 12:23:44.850154   26728 
>>>> leaderelection.go:179] attempting to acquire leader lease...
>>>> Apr 13 12:23:44 ip-10-0-0-61.eu 
>>>> <http://ip-10-0-0-61.eu/>-central-1.compute.internal origin-node[26683]: 
>>>> W0413 12:23:44.933963   26683 cni.go:189] Unable to update cni config: No 
>>>> networks found in /etc/cni/net.d
>>>> Apr 13 12:23:44 ip-10-0-0-61.eu 
>>>> <http://ip-10-0-0-61.eu/>-central-1.compute.internal origin-node[26683]: 
>>>> E0413 12:23:44.934447   26683 kubelet.go:2112] Container runtime network 
>>>> not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: 
>>>> network plugin is not ready: cni config uninitialized
>>>> Apr 13 12:23:47 ip-10-0-0-61.eu 
>>>> <http://ip-10-0-0-61.eu/>-central-1.compute.internal origin-node[26683]: 
>>>> W0413 12:23:47.947200   26683 sdn_controller.go:48] Could not find an 
>>>> allocated subnet for node: ip-10-0-0-61.eu 
>>>> <http://ip-10-0-0-61.eu/>-central-1.compute.internal, Waiting...
>>>> 
>>>> Is anyone else seeing this and, more importantly, is there a clear cause 
>>>> and solution?
>>>> 
>>>> I cannot start 3.7 and have been tinkering with it for days on AWS at all 
>>>> and on OpenStack 3 out of 4 attempts fail. I just tried 3.9 to find the 
>>>> same failure on AWS and have just given up and returned to the wonderfully 
>>>> stable 3.6.
>>>> 
>>>> Alan Christie
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> [email protected] <mailto:[email protected]>
>>>> http://lists.openshift.redhat.com/openshiftmm/listinfo/users 
>>>> <http://lists.openshift.redhat.com/openshiftmm/listinfo/users>
>>

_______________________________________________
users mailing list
[email protected]
http://lists.openshift.redhat.com/openshiftmm/listinfo/users

Re: Empty /etc/cni/net.d with Ansible installer on 3.7 and 3.9

Reply via email to