Re: Node not joining cluster during ansible install

Tim Bielawa Tue, 15 Aug 2017 07:16:53 -0700

Discussion on-going in new github issue:
https://github.com/openshift/openshift-ansible/issues/5088


On Tue, Aug 15, 2017 at 8:44 AM, Tim Bielawa <[email protected]> wrote:

> Tim,
>
> Can you please provide more information? You full inventory would be very
> useful right now for debugging. Feel free to mask your hostnames if you
> wish. What I need to see to debug this further are all the parameters
> you're setting in the [OSEv3] section and applying to each host in
> [masters] and [nodes].
>
> You will find my GPG public key fingerprint in my signature If you wish to
> encrypt the inventory file instead
>
> As for those two stalls you mentioned:
>
> "Ensure OpenShift <THING> correctly rolls out (best-effort today)"
>
>
> The delays you experienced are normal and expected. Those delays are
> typically because the pod images were being downloaded to your hosts.
> However, you showed your 'oc get nodes' output and I noticed your master
> said "Ready,SchedulingDisabled". Because your master is labeled as
> 'SchedulingDisabled' then your master should *NOT* be running any pods. In
> which case that means it wasn't downloading pod images.
>
> Can you please provide the following information:
>
> * The output from `oc get all` on your master
> * The output `docker images` on your node *AND* your master
> * Your complete inventory file. As I said before, feel free to mask your
> hostnames or IPs if you prefer.
>
> You logs would also be helpful. Ensure you run ansible-playbook with the
> -vv option for extra verbosity. You can do this in two ways:
>
> 1) If you run the install again you can set:
>
> log_path = /tmp/ansible.log
>
>
> in the [defaults] section of your ansible.cfg file.
>
> 2) Alternatively you can capture the output of ansible using the `tee`
> command like so:
>
> ansible-playbook -vv -i <INVENTORY> ./playbooks/byo/config.yml | tee
>> /tmp/ansible.log
>
>
> Again, if you wish to keep this information private, my GPG key is in my
> signature. Short ID is 0333AE37.
>
>
> Thanks!
>
>
>
> On Tue, Aug 15, 2017 at 6:15 AM, Tim Dudgeon <[email protected]>
> wrote:
>
>> Thanks for the response, and sorry for the delay on my end - I've been
>> away for a week.
>>
>> I ran through the process again and got the same result. On the node it
>> looks like the openshift services are running OK:
>>
>> systemctl list-units --all | grep -i origin
>>   origin-node.service loaded active running OpenShift Node
>>
>> But from the master the node has not joined the cluster:
>>
>> oc get nodes
>> NAME STATUS AGE VERSION
>> 2c0e37ab-f41e-40f1-a466-a575c85823b6.priv.cloud.scaleway.com
>> Ready,SchedulingDisabled 26m v1.6.1+5115d708d7
>>
>> The install process seems to have gone OK. There were no obvious errors,
>> though it did twice stall at a point like this:
>>
>> ### TASK [openshift_hosted : Ensure OpenShift router correctly rolls out
>> (best-effort today)] ******************
>>
>> But after waiting for about 5-10 mins it continued.
>>
>> There were lot of 'skipping' messages during the install, but no obvious
>> errors. The output was huge and not captured to a file, so I'd have to run
>> it again to try to get a full log.
>>
>> Any thoughts as to what is wrong?
>>
>> Tim
>>
>> On 04/08/2017 16:07, Tim Bielawa wrote:
>>
>> (reposting: forgot to reply-all the first time)
>>
>>
>> Just based off of the number of tasks your summary says completed I am
>> not sure your installation actually completed in full. I expect to see
>> upwards of 1->2 thousand tasks.
>>
>>
>> A while back we changed node integration behavior such that if a node
>> fails to provision it does not stop your entire installation. This is to
>> ease the pain felt when provisioning large (hundred+) node clusters.
>>
>> <private node1 dns name> : ok=235  changed=56 unreachable=0    failed=0
>>
>>
>> That node did not fully install. Open a shell on that node and check the
>> openshift services. I'm willing to bet that
>>
>> systemctl list-units --all | grep -i origin
>>
>>
>> would show the node service is not running. Find the name of the node
>> service and then examine the journal logs for that node
>>
>> journalctl -x -u <node-service-name>
>>
>>
>>
>> I think we (the openshift-ansible team) will want to add detection of
>> failed node integrations into our error summary report in the future. Would
>> you mind please opening an issue for this on our github page with this
>> information?
>>
>>
>> Thanks!
>>
>>
>>
>> On Sun, Jul 30, 2017 at 10:57 AM, Tim Dudgeon <[email protected]>
>> wrote:
>>
>>> I'm trying to get to grips with the advanced (Ansible) installer.
>>> Initially I'm trying to do something very simple, fire up a cluster with
>>> one master and one node.
>>> My inventory file looks like this:
>>>
>>> [OSEv3:children]
>>> masters
>>> nodes
>>>
>>>
>>> [OSEv3:vars]
>>> ansible_ssh_user=root
>>> openshift_hostname=<private master dns name>
>>> openshift_master_cluster_hostname=<private master dns name>
>>> openshift_master_cluster_public_hostname=<public master dns name>
>>> openshift_disable_check=docker_storage,memory_availability
>>> openshift_deployment_type=origin
>>>
>>> [masters]
>>> <private master dns name>
>>>
>>> [etcd]
>>> <private master dns name>
>>>
>>>
>>> [nodes]
>>> <private master dns name>
>>> <private node1 dns name>
>>>
>>>
>>> I run:
>>> ansible-playbook ~/openshift-ansible/playbooks/byo/config.yml
>>> and (after a long time) it completes, without any noticeable errors:
>>>
>>> ...
>>> PLAY RECAP ************************************************************
>>> ************************************************************
>>> *********************************
>>> <private node1 dns name> : ok=235  changed=56 unreachable=0    failed=0
>>> <private master dns name> : ok=623  changed=166 unreachable=0    failed=0
>>> localhost                  : ok=12   changed=0    unreachable=0 failed=0
>>>
>>> Both nodes seem to have been setup OK.
>>> But when I look on the master node there is only the master in the
>>> cluster, no second node:
>>>
>>> oc get nodes
>>> NAME STATUS                     AGE
>>> <private master dns name> Ready,SchedulingDisabled   32m
>>>
>>> and of course like this nothing can get scheduled.
>>>
>>> Presumably the node should be added to the cluster, so any ideas what is
>>> going wrong here?
>>>
>>> Thanks
>>> Tim
>>>
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>>>
>>
>>
>>
>> --
>> Tim Bielawa, Sr. Software Engineer [ED-C137]
>> IRC: tbielawa (#openshift)
>> 1BA0 4FAB 4C13 FBA0 A036  4958 AD05 E75E 0333 AE37
>>
>>
>>
>
>
> --
> Tim Bielawa, Sr. Software Engineer [ED-C137]
> IRC: tbielawa (#openshift)
> 1BA0 4FAB 4C13 FBA0 A036  4958 AD05 E75E 0333 AE37
>



-- 
Tim Bielawa, Sr. Software Engineer [ED-C137]
Cell: 919.332.6411  | IRC: tbielawa (#openshift)
1BA0 4FAB 4C13 FBA0 A036  4958 AD05 E75E 0333 AE37

_______________________________________________
users mailing list
[email protected]
http://lists.openshift.redhat.com/openshiftmm/listinfo/users

Re: Node not joining cluster during ansible install

Reply via email to