Re: clsut stopped working - certificate problems

Brian Jarvis Tue, 31 Mar 2020 10:22:38 -0700

The certificate expiration check playbook was recently updated to include
this check for the nodes.


[0] https://github.com/openshift/openshift-ansible/pull/11967

On Tue, Mar 31, 2020 at 1:12 PM Tim Dudgeon <tdudgeon...@gmail.com> wrote:

> So thanks for the help in fixing this problem. Much appreciated.
>
> Having looked at it now after the event I have 2 concerns.
>
> 1. Whilst this is documented [1] the significance of this is not
> mentioned. Unless you do as described (either manually or automatically)
> your cluster will stop working 1 year after being deployed!
>
> 2. The playbooks that check certificate expiry [2] do not catch this
> problem.
>
> Thanks
> Tim
>
> [1]
> https://docs.okd.io/3.11/install_config/redeploying_certificates.html#cert-expiry-managing-csrs
>
> [2]
> https://docs.okd.io/3.11/install_config/redeploying_certificates.html#install-config-cert-expiry
>
>
> On 31/03/2020 17:05, Brian Jarvis wrote:
>
> Hello Tim,
>
> Each node has a client certificate that expire after one year.
> Run "oc get csr"  you should see many pending requests, possibly
> thousands.
>
> To clear those run "oc get csr -o name | xargs oc adm certificate approve"
>
> One way to prevent this in the future is to deploy/enable the auto
> approver statefulset with the following command.
> ansible-playbook -vvv -i [inventory_file]
> /usr/share/ansible/openshift-ansible/playbooks/openshift-master/enable_bootstrap.yml
> -e openshift_master_bootstrap_auto_approve=true
>
> On Tue, Mar 31, 2020 at 11:53 AM Tim Dudgeon <tdudgeon...@gmail.com>
> wrote:
>
>> Maybe an uncanny coincidence but with think the cluster was created
>> almost EXACTLY 1 year before it failed.
>> On 31/03/2020 16:17, Ben Holmes wrote:
>>
>> Hi Tim,
>>
>> Can you verify that the host's clocks are being synced correctly as per
>> Simon's other suggestion?
>>
>> Ben
>>
>> On Tue, 31 Mar 2020 at 16:05, Tim Dudgeon <tdudgeon...@gmail.com> wrote:
>>
>>> Hi Simon,
>>>
>>> we're run those playbooks and all certs are reported as still being
>>> valid.
>>>
>>> Tim
>>>
>>> On 31/03/2020 15:59, Simon Krenger wrote:
>>> > Hi Tim,
>>> >
>>> > Note that there are multiple sets of certificates, both external and
>>> > internal. So it would be worth checking the certificates again using
>>> > the Certificate Expiration Playbooks (see link below). The
>>> > documentation also has an overview of what can be done to renew
>>> > certain certificates:
>>> >
>>> > - [ Redeploying Certificates ]
>>> >
>>> https://docs.okd.io/3.11/install_config/redeploying_certificates.html
>>> >
>>> > Apart from checking all certificates, I'd certainly review the time
>>> > synchronisation for the whole cluster, as we see the message "x509:
>>> > certificate has expired or is not yet valid".
>>> >
>>> > I hope this helps.
>>> >
>>> > Kind regards
>>> > Simon
>>> >
>>> > On Tue, Mar 31, 2020 at 4:33 PM Tim Dudgeon <tdudgeon...@gmail.com>
>>> wrote:
>>> >> One of our OKD 3.11 clusters has suddenly stopped working without any
>>> >> obvious reason.
>>> >>
>>> >> The origin-node service on the nodes does not start (times out).
>>> >> The master-api pod is running on the master.
>>> >> The nodes can access the master-api endpoints.
>>> >>
>>> >> The logs of the master-api pod look mostly OK other than a huge number
>>> >> of warnings about certificates that don't really make sense as the
>>> >> certificates are valid (we use named certificates from let's Encryt
>>> and
>>> >> they were renewed about 2 weeks ago and all appear to be correct.
>>> >>
>>> >> Examples of errors from the master-api pod are:
>>> >>
>>> >> I0331 12:46:57.065147       1 establishing_controller.go:73] Starting
>>> >> EstablishingController
>>> >> I0331 12:46:57.065561       1 logs.go:49] http: TLS handshake error
>>> from
>>> >> 192.168.160.17:58024: EOF
>>> >> I0331 12:46:57.071932       1 logs.go:49] http: TLS handshake error
>>> from
>>> >> 192.168.160.19:48102: EOF
>>> >> I0331 12:46:57.072036       1 logs.go:49] http: TLS handshake error
>>> from
>>> >> 192.168.160.19:37178: EOF
>>> >> I0331 12:46:57.072141       1 logs.go:49] http: TLS handshake error
>>> from
>>> >> 192.168.160.17:58022: EOF
>>> >>
>>> >> E0331 12:47:37.855023       1 memcache.go:147] couldn't get resource
>>> >> list for metrics.k8s.io/v1beta1: the server is currently unable to
>>> >> handle the request
>>> >> E0331 12:47:37.856569       1 memcache.go:147] couldn't get resource
>>> >> list for servicecatalog.k8s.io/v1beta1: the server is currently
>>> unable
>>> >> to handle the request
>>> >> E0331 12:47:44.115290       1 authentication.go:62] Unable to
>>> >> authenticate the request due to an error: [x509: certificate has
>>> expired
>>> >> or is not yet valid, x509: certificate
>>> >>    has expired or is not yet valid]
>>> >> E0331 12:47:44.118976       1 authentication.go:62] Unable to
>>> >> authenticate the request due to an error: [x509: certificate has
>>> expired
>>> >> or is not yet valid, x509: certificate
>>> >>    has expired or is not yet valid]
>>> >> E0331 12:47:44.122276       1 authentication.go:62] Unable to
>>> >> authenticate the request due to an error: [x509: certificate has
>>> expired
>>> >> or is not yet valid, x509: certificate
>>> >>    has expired or is not yet valid]
>>> >>
>>> >> Huge number of this second sort.
>>> >>
>>> >> Any ideas what is wrong?
>>> >>
>>> >>
>>> >>
>>> >> _______________________________________________
>>> >> users mailing list
>>> >> users@lists.openshift.redhat.com
>>> >> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>>> >
>>> >
>>>
>>> _______________________________________________
>>> users mailing list
>>> users@lists.openshift.redhat.com
>>> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>>>
>>>
>>
>> --
>>
>> BENJAMIN HOLMES
>>
>> SENIOR Solution ARCHITECT
>>
>> Red Hat UKI Presales <https://www.redhat.com/>
>>
>> bhol...@redhat.com    M: 07876-885388
>> <http://redhatemailsignature-marketing.itos.redhat.com/>
>> <https://red.ht/sig>
>>
>> _______________________________________________
>> users mailing list
>> users@lists.openshift.redhat.com
>> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>>
>
>
> --
>
>
> Brian Jarvis, RHCE
>
> Technical Account Manager
>
> Red Hat North America <https://www.redhat.com/>
> Partnering with you to help achieve your business goals
>
> bjar...@redhat.com
>
> T: 631-685-7519   M: 610-587-1736
>
> _______________________________________________
> users mailing list
> users@lists.openshift.redhat.com
> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>


-- 


Brian Jarvis, RHCE

Technical Account Manager

Red Hat North America <https://www.redhat.com/>
Partnering with you to help achieve your business goals

bjar...@redhat.com

T: 631-685-7519   M: 610-587-1736

_______________________________________________
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users

Re: clsut stopped working - certificate problems

Reply via email to