Hello Tim,

Each node has a client certificate that expire after one year.
Run "oc get csr"  you should see many pending requests, possibly thousands.

To clear those run "oc get csr -o name | xargs oc adm certificate approve"

One way to prevent this in the future is to deploy/enable the auto approver
statefulset with the following command.
ansible-playbook -vvv -i [inventory_file]
/usr/share/ansible/openshift-ansible/playbooks/openshift-master/enable_bootstrap.yml
-e openshift_master_bootstrap_auto_approve=true

On Tue, Mar 31, 2020 at 11:53 AM Tim Dudgeon <tdudgeon...@gmail.com> wrote:

> Maybe an uncanny coincidence but with think the cluster was created almost
> EXACTLY 1 year before it failed.
> On 31/03/2020 16:17, Ben Holmes wrote:
>
> Hi Tim,
>
> Can you verify that the host's clocks are being synced correctly as per
> Simon's other suggestion?
>
> Ben
>
> On Tue, 31 Mar 2020 at 16:05, Tim Dudgeon <tdudgeon...@gmail.com> wrote:
>
>> Hi Simon,
>>
>> we're run those playbooks and all certs are reported as still being valid.
>>
>> Tim
>>
>> On 31/03/2020 15:59, Simon Krenger wrote:
>> > Hi Tim,
>> >
>> > Note that there are multiple sets of certificates, both external and
>> > internal. So it would be worth checking the certificates again using
>> > the Certificate Expiration Playbooks (see link below). The
>> > documentation also has an overview of what can be done to renew
>> > certain certificates:
>> >
>> > - [ Redeploying Certificates ]
>> >
>> https://docs.okd.io/3.11/install_config/redeploying_certificates.html
>> >
>> > Apart from checking all certificates, I'd certainly review the time
>> > synchronisation for the whole cluster, as we see the message "x509:
>> > certificate has expired or is not yet valid".
>> >
>> > I hope this helps.
>> >
>> > Kind regards
>> > Simon
>> >
>> > On Tue, Mar 31, 2020 at 4:33 PM Tim Dudgeon <tdudgeon...@gmail.com>
>> wrote:
>> >> One of our OKD 3.11 clusters has suddenly stopped working without any
>> >> obvious reason.
>> >>
>> >> The origin-node service on the nodes does not start (times out).
>> >> The master-api pod is running on the master.
>> >> The nodes can access the master-api endpoints.
>> >>
>> >> The logs of the master-api pod look mostly OK other than a huge number
>> >> of warnings about certificates that don't really make sense as the
>> >> certificates are valid (we use named certificates from let's Encryt and
>> >> they were renewed about 2 weeks ago and all appear to be correct.
>> >>
>> >> Examples of errors from the master-api pod are:
>> >>
>> >> I0331 12:46:57.065147       1 establishing_controller.go:73] Starting
>> >> EstablishingController
>> >> I0331 12:46:57.065561       1 logs.go:49] http: TLS handshake error
>> from
>> >> 192.168.160.17:58024: EOF
>> >> I0331 12:46:57.071932       1 logs.go:49] http: TLS handshake error
>> from
>> >> 192.168.160.19:48102: EOF
>> >> I0331 12:46:57.072036       1 logs.go:49] http: TLS handshake error
>> from
>> >> 192.168.160.19:37178: EOF
>> >> I0331 12:46:57.072141       1 logs.go:49] http: TLS handshake error
>> from
>> >> 192.168.160.17:58022: EOF
>> >>
>> >> E0331 12:47:37.855023       1 memcache.go:147] couldn't get resource
>> >> list for metrics.k8s.io/v1beta1: the server is currently unable to
>> >> handle the request
>> >> E0331 12:47:37.856569       1 memcache.go:147] couldn't get resource
>> >> list for servicecatalog.k8s.io/v1beta1: the server is currently unable
>> >> to handle the request
>> >> E0331 12:47:44.115290       1 authentication.go:62] Unable to
>> >> authenticate the request due to an error: [x509: certificate has
>> expired
>> >> or is not yet valid, x509: certificate
>> >>    has expired or is not yet valid]
>> >> E0331 12:47:44.118976       1 authentication.go:62] Unable to
>> >> authenticate the request due to an error: [x509: certificate has
>> expired
>> >> or is not yet valid, x509: certificate
>> >>    has expired or is not yet valid]
>> >> E0331 12:47:44.122276       1 authentication.go:62] Unable to
>> >> authenticate the request due to an error: [x509: certificate has
>> expired
>> >> or is not yet valid, x509: certificate
>> >>    has expired or is not yet valid]
>> >>
>> >> Huge number of this second sort.
>> >>
>> >> Any ideas what is wrong?
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> users mailing list
>> >> users@lists.openshift.redhat.com
>> >> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>> >
>> >
>>
>> _______________________________________________
>> users mailing list
>> users@lists.openshift.redhat.com
>> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>>
>>
>
> --
>
> BENJAMIN HOLMES
>
> SENIOR Solution ARCHITECT
>
> Red Hat UKI Presales <https://www.redhat.com/>
>
> bhol...@redhat.com    M: 07876-885388
> <http://redhatemailsignature-marketing.itos.redhat.com/>
> <https://red.ht/sig>
>
> _______________________________________________
> users mailing list
> users@lists.openshift.redhat.com
> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>


-- 


Brian Jarvis, RHCE

Technical Account Manager

Red Hat North America <https://www.redhat.com/>
Partnering with you to help achieve your business goals

bjar...@redhat.com

T: 631-685-7519   M: 610-587-1736
_______________________________________________
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users

Reply via email to