I don’t think we’ve seen it elsewhere (certainly not repeatedly), which probably indicates something specific to your environment, inventory, or base system.
I suggested restarting because this is all the same debugging info we’d ask in a bug - knowing whether it’s transient and clears on a restart narrows the issue down (likely to be a bug in the core code). On Apr 14, 2018, at 4:30 AM, Alan Christie <[email protected]> wrote: Thanks Clayton, I’ll take a look a closer look next week because the solution seems to be fixing the symptoms, not the cause and I’d like to get to a stage where we don’t need to patch the installation and restart-it. This happens pretty much *every time* that I install 3.7 or 3.9 on AWS and a significant number of times on OpenStack. Has this been reported by others because it’s so common that we can't be the only ones seeing this? Alan On 13 Apr 2018, at 21:35, Clayton Coleman <[email protected]> wrote: Can not find allocated subnet usually means the master didn’t hand out a chunk of SDN IPs to that node. Check the master’s origin-master-controller logs and look for anything that relates to the node name mentioned in your error. If you see a problem, try restarting the origin-master-controllers processes on all nodes. On Apr 13, 2018, at 2:26 PM, Alan Christie <[email protected]> wrote: What’s wrong with the post-3.6 OpenShift/Origin release? I build my cluster with Terraform and OpenShift 3.6 (on AWS) is wonderfully stable and I have no problem creating clusters. But, with both 3.7 and 3.9, I just cannot start a cluster without encountering at least one node with an empty */etc/cni/net.d*. This applies to 3.7 and 3.9 on AWS and two OpenStack providers. In all cases the Ansible installer enters the "*RUNNING HANDLER [openshift_node : restart node]"* task but this, for the vast majority of installations on OpenStack and every single attempt in AWS, always fails. I’m worried that I’ve got something clearly very wrong and have had to return to 3.6 to get anything done. RUNNING HANDLER [openshift_node : restart openvswitch] ******************************************************************************** Friday 13 April 2018 13:19:09 +0100 (0:00:00.062) 0:09:28.744 ********** changed: [18.195.236.210] changed: [18.195.126.190] changed: [18.184.65.88] RUNNING HANDLER [openshift_node : restart openvswitch pause] ************************************************************************** Friday 13 April 2018 13:19:09 +0100 (0:00:00.720) 0:09:29.464 ********** skipping: [18.195.236.210] RUNNING HANDLER [openshift_node : restart node] *************************************************************************************** Friday 13 April 2018 13:19:09 +0100 (0:00:00.036) 0:09:29.501 ********** FAILED - RETRYING: restart node (3 retries left). FAILED - RETRYING: restart node (3 retries left). FAILED - RETRYING: restart node (3 retries left). FAILED - RETRYING: restart node (2 retries left). FAILED - RETRYING: restart node (2 retries left). FAILED - RETRYING: restart node (2 retries left). FAILED - RETRYING: restart node (1 retries left). FAILED - RETRYING: restart node (1 retries left). FAILED - RETRYING: restart node (1 retries left). fatal: [18.195.236.210]: FAILED! => {"attempts": 3, "changed": false, "msg": "Unable to restart service origin-node: Job for origin-node.service failed because the control process exited with error code. See \"systemctl status origin-node.service\" and \"journalctl -xe\" for details.\n"} fatal: [18.195.126.190]: FAILED! => {"attempts": 3, "changed": false, "msg": "Unable to restart service origin-node: Job for origin-node.service failed because the control process exited with error code. See \"systemctl status origin-node.service\" and \"journalctl -xe\" for details.\n"} fatal: [18.184.65.88]: FAILED! => {"attempts": 3, "changed": false, "msg": "Unable to restart service origin-node: Job for origin-node.service failed because the control process exited with error code. See \"systemctl status origin-node.service\" and \"journalctl -xe\" for details.\n"} When I jump onto a suspect node after the failure I find*/etc/cni/net.d* is empty and the journal contains the message "*No networks found in /etc/cni/net.d*”... -- The start-up result is done. Apr 13 12:23:44 ip-10-0-0-61.eu-central-1.compute.internal origin-master-controllers[26728]: I0413 12:23:44.850154 26728 leaderelection.go:179] attempting to acquire leader lease... Apr 13 12:23:44 ip-10-0-0-61.eu-central-1.compute.internal origin-node[26683]: W0413 12:23:44.933963 26683 cni.go:189] Unable to update cni config: No networks found in /etc/cni/net.d Apr 13 12:23:44 ip-10-0-0-61.eu-central-1.compute.internal origin-node[26683]: E0413 12:23:44.934447 26683 kubelet.go:2112] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized Apr 13 12:23:47 ip-10-0-0-61.eu-central-1.compute.internal origin-node[26683]: W0413 12:23:47.947200 26683 sdn_controller.go:48] Could not find an allocated subnet for node: ip-10-0-0-61.eu-central-1.compute.internal, Waiting... Is anyone else seeing this and, more importantly, is there a clear cause and solution? I cannot start 3.7 and have been tinkering with it for days on AWS at all and on OpenStack 3 out of 4 attempts fail. I just tried 3.9 to find the same failure on AWS and have just given up and returned to the wonderfully stable 3.6. Alan Christie _______________________________________________ users mailing list [email protected] http://lists.openshift.redhat.com/openshiftmm/listinfo/users
_______________________________________________ users mailing list [email protected] http://lists.openshift.redhat.com/openshiftmm/listinfo/users
