Thanks Clayton. The base system’s been tested with two independently authored base images but I’ll try and ensure I have time to follow your suggestion next week and report back if anything repetitive’s been found.
Knowing that this is not a common problem narrows it down. Thanks. Alan. > On 14 Apr 2018, at 16:33, Clayton Coleman <[email protected]> wrote: > > I don’t think we’ve seen it elsewhere (certainly not repeatedly), which > probably indicates something specific to your environment, inventory, or base > system. > > I suggested restarting because this is all the same debugging info we’d ask > in a bug - knowing whether it’s transient and clears on a restart narrows the > issue down (likely to be a bug in the core code). > > On Apr 14, 2018, at 4:30 AM, Alan Christie <[email protected] > <mailto:[email protected]>> wrote: > >> Thanks Clayton, >> >> I’ll take a look a closer look next week because the solution seems to be >> fixing the symptoms, not the cause and I’d like to get to a stage where we >> don’t need to patch the installation and restart-it. >> >> This happens pretty much *every time* that I install 3.7 or 3.9 on AWS and a >> significant number of times on OpenStack. >> >> Has this been reported by others because it’s so common that we can't be the >> only ones seeing this? >> >> Alan >> >> >>> On 13 Apr 2018, at 21:35, Clayton Coleman <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Can not find allocated subnet usually means the master didn’t hand out a >>> chunk of SDN IPs to that node. Check the master’s origin-master-controller >>> logs and look for anything that relates to the node name mentioned in your >>> error. If you see a problem, try restarting the origin-master-controllers >>> processes on all nodes. >>> >>> On Apr 13, 2018, at 2:26 PM, Alan Christie >>> <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>>> What’s wrong with the post-3.6 OpenShift/Origin release? >>>> >>>> I build my cluster with Terraform and OpenShift 3.6 (on AWS) is >>>> wonderfully stable and I have no problem creating clusters. But, with both >>>> 3.7 and 3.9, I just cannot start a cluster without encountering at least >>>> one node with an empty /etc/cni/net.d. >>>> >>>> This applies to 3.7 and 3.9 on AWS and two OpenStack providers. In all >>>> cases the Ansible installer enters the "RUNNING HANDLER [openshift_node : >>>> restart node]" task but this, for the vast majority of installations on >>>> OpenStack and every single attempt in AWS, always fails. I’m worried that >>>> I’ve got something clearly very wrong and have had to return to 3.6 to get >>>> anything done. >>>> >>>> RUNNING HANDLER [openshift_node : restart openvswitch] >>>> ******************************************************************************** >>>> Friday 13 April 2018 13:19:09 +0100 (0:00:00.062) 0:09:28.744 >>>> ********** >>>> changed: [18.195.236.210] >>>> changed: [18.195.126.190] >>>> changed: [18.184.65.88] >>>> >>>> RUNNING HANDLER [openshift_node : restart openvswitch pause] >>>> ************************************************************************** >>>> Friday 13 April 2018 13:19:09 +0100 (0:00:00.720) 0:09:29.464 >>>> ********** >>>> skipping: [18.195.236.210] >>>> >>>> RUNNING HANDLER [openshift_node : restart node] >>>> *************************************************************************************** >>>> Friday 13 April 2018 13:19:09 +0100 (0:00:00.036) 0:09:29.501 >>>> ********** >>>> FAILED - RETRYING: restart node (3 retries left). >>>> FAILED - RETRYING: restart node (3 retries left). >>>> FAILED - RETRYING: restart node (3 retries left). >>>> FAILED - RETRYING: restart node (2 retries left). >>>> FAILED - RETRYING: restart node (2 retries left). >>>> FAILED - RETRYING: restart node (2 retries left). >>>> FAILED - RETRYING: restart node (1 retries left). >>>> FAILED - RETRYING: restart node (1 retries left). >>>> FAILED - RETRYING: restart node (1 retries left). >>>> fatal: [18.195.236.210]: FAILED! => {"attempts": 3, "changed": false, >>>> "msg": "Unable to restart service origin-node: Job for origin-node.service >>>> failed because the control process exited with error code. See \"systemctl >>>> status origin-node.service\" and \"journalctl -xe\" for details.\n"} >>>> fatal: [18.195.126.190]: FAILED! => {"attempts": 3, "changed": false, >>>> "msg": "Unable to restart service origin-node: Job for origin-node.service >>>> failed because the control process exited with error code. See \"systemctl >>>> status origin-node.service\" and \"journalctl -xe\" for details.\n"} >>>> fatal: [18.184.65.88]: FAILED! => {"attempts": 3, "changed": false, "msg": >>>> "Unable to restart service origin-node: Job for origin-node.service failed >>>> because the control process exited with error code. See \"systemctl status >>>> origin-node.service\" and \"journalctl -xe\" for details.\n"} >>>> >>>> When I jump onto a suspect node after the failure I find/etc/cni/net.d is >>>> empty and the journal contains the message "No networks found in >>>> /etc/cni/net.d”... >>>> >>>> -- The start-up result is done. >>>> Apr 13 12:23:44 ip-10-0-0-61.eu >>>> <http://ip-10-0-0-61.eu/>-central-1.compute.internal >>>> origin-master-controllers[26728]: I0413 12:23:44.850154 26728 >>>> leaderelection.go:179] attempting to acquire leader lease... >>>> Apr 13 12:23:44 ip-10-0-0-61.eu >>>> <http://ip-10-0-0-61.eu/>-central-1.compute.internal origin-node[26683]: >>>> W0413 12:23:44.933963 26683 cni.go:189] Unable to update cni config: No >>>> networks found in /etc/cni/net.d >>>> Apr 13 12:23:44 ip-10-0-0-61.eu >>>> <http://ip-10-0-0-61.eu/>-central-1.compute.internal origin-node[26683]: >>>> E0413 12:23:44.934447 26683 kubelet.go:2112] Container runtime network >>>> not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: >>>> network plugin is not ready: cni config uninitialized >>>> Apr 13 12:23:47 ip-10-0-0-61.eu >>>> <http://ip-10-0-0-61.eu/>-central-1.compute.internal origin-node[26683]: >>>> W0413 12:23:47.947200 26683 sdn_controller.go:48] Could not find an >>>> allocated subnet for node: ip-10-0-0-61.eu >>>> <http://ip-10-0-0-61.eu/>-central-1.compute.internal, Waiting... >>>> >>>> Is anyone else seeing this and, more importantly, is there a clear cause >>>> and solution? >>>> >>>> I cannot start 3.7 and have been tinkering with it for days on AWS at all >>>> and on OpenStack 3 out of 4 attempts fail. I just tried 3.9 to find the >>>> same failure on AWS and have just given up and returned to the wonderfully >>>> stable 3.6. >>>> >>>> Alan Christie >>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> [email protected] <mailto:[email protected]> >>>> http://lists.openshift.redhat.com/openshiftmm/listinfo/users >>>> <http://lists.openshift.redhat.com/openshiftmm/listinfo/users> >>
_______________________________________________ users mailing list [email protected] http://lists.openshift.redhat.com/openshiftmm/listinfo/users
