The certificate expiration check playbook was recently updated to include this check for the nodes.
[0] https://github.com/openshift/openshift-ansible/pull/11967 On Tue, Mar 31, 2020 at 1:12 PM Tim Dudgeon <tdudgeon...@gmail.com> wrote: > So thanks for the help in fixing this problem. Much appreciated. > > Having looked at it now after the event I have 2 concerns. > > 1. Whilst this is documented [1] the significance of this is not > mentioned. Unless you do as described (either manually or automatically) > your cluster will stop working 1 year after being deployed! > > 2. The playbooks that check certificate expiry [2] do not catch this > problem. > > Thanks > Tim > > [1] > https://docs.okd.io/3.11/install_config/redeploying_certificates.html#cert-expiry-managing-csrs > > [2] > https://docs.okd.io/3.11/install_config/redeploying_certificates.html#install-config-cert-expiry > > > On 31/03/2020 17:05, Brian Jarvis wrote: > > Hello Tim, > > Each node has a client certificate that expire after one year. > Run "oc get csr" you should see many pending requests, possibly > thousands. > > To clear those run "oc get csr -o name | xargs oc adm certificate approve" > > One way to prevent this in the future is to deploy/enable the auto > approver statefulset with the following command. > ansible-playbook -vvv -i [inventory_file] > /usr/share/ansible/openshift-ansible/playbooks/openshift-master/enable_bootstrap.yml > -e openshift_master_bootstrap_auto_approve=true > > On Tue, Mar 31, 2020 at 11:53 AM Tim Dudgeon <tdudgeon...@gmail.com> > wrote: > >> Maybe an uncanny coincidence but with think the cluster was created >> almost EXACTLY 1 year before it failed. >> On 31/03/2020 16:17, Ben Holmes wrote: >> >> Hi Tim, >> >> Can you verify that the host's clocks are being synced correctly as per >> Simon's other suggestion? >> >> Ben >> >> On Tue, 31 Mar 2020 at 16:05, Tim Dudgeon <tdudgeon...@gmail.com> wrote: >> >>> Hi Simon, >>> >>> we're run those playbooks and all certs are reported as still being >>> valid. >>> >>> Tim >>> >>> On 31/03/2020 15:59, Simon Krenger wrote: >>> > Hi Tim, >>> > >>> > Note that there are multiple sets of certificates, both external and >>> > internal. So it would be worth checking the certificates again using >>> > the Certificate Expiration Playbooks (see link below). The >>> > documentation also has an overview of what can be done to renew >>> > certain certificates: >>> > >>> > - [ Redeploying Certificates ] >>> > >>> https://docs.okd.io/3.11/install_config/redeploying_certificates.html >>> > >>> > Apart from checking all certificates, I'd certainly review the time >>> > synchronisation for the whole cluster, as we see the message "x509: >>> > certificate has expired or is not yet valid". >>> > >>> > I hope this helps. >>> > >>> > Kind regards >>> > Simon >>> > >>> > On Tue, Mar 31, 2020 at 4:33 PM Tim Dudgeon <tdudgeon...@gmail.com> >>> wrote: >>> >> One of our OKD 3.11 clusters has suddenly stopped working without any >>> >> obvious reason. >>> >> >>> >> The origin-node service on the nodes does not start (times out). >>> >> The master-api pod is running on the master. >>> >> The nodes can access the master-api endpoints. >>> >> >>> >> The logs of the master-api pod look mostly OK other than a huge number >>> >> of warnings about certificates that don't really make sense as the >>> >> certificates are valid (we use named certificates from let's Encryt >>> and >>> >> they were renewed about 2 weeks ago and all appear to be correct. >>> >> >>> >> Examples of errors from the master-api pod are: >>> >> >>> >> I0331 12:46:57.065147 1 establishing_controller.go:73] Starting >>> >> EstablishingController >>> >> I0331 12:46:57.065561 1 logs.go:49] http: TLS handshake error >>> from >>> >> 192.168.160.17:58024: EOF >>> >> I0331 12:46:57.071932 1 logs.go:49] http: TLS handshake error >>> from >>> >> 192.168.160.19:48102: EOF >>> >> I0331 12:46:57.072036 1 logs.go:49] http: TLS handshake error >>> from >>> >> 192.168.160.19:37178: EOF >>> >> I0331 12:46:57.072141 1 logs.go:49] http: TLS handshake error >>> from >>> >> 192.168.160.17:58022: EOF >>> >> >>> >> E0331 12:47:37.855023 1 memcache.go:147] couldn't get resource >>> >> list for metrics.k8s.io/v1beta1: the server is currently unable to >>> >> handle the request >>> >> E0331 12:47:37.856569 1 memcache.go:147] couldn't get resource >>> >> list for servicecatalog.k8s.io/v1beta1: the server is currently >>> unable >>> >> to handle the request >>> >> E0331 12:47:44.115290 1 authentication.go:62] Unable to >>> >> authenticate the request due to an error: [x509: certificate has >>> expired >>> >> or is not yet valid, x509: certificate >>> >> has expired or is not yet valid] >>> >> E0331 12:47:44.118976 1 authentication.go:62] Unable to >>> >> authenticate the request due to an error: [x509: certificate has >>> expired >>> >> or is not yet valid, x509: certificate >>> >> has expired or is not yet valid] >>> >> E0331 12:47:44.122276 1 authentication.go:62] Unable to >>> >> authenticate the request due to an error: [x509: certificate has >>> expired >>> >> or is not yet valid, x509: certificate >>> >> has expired or is not yet valid] >>> >> >>> >> Huge number of this second sort. >>> >> >>> >> Any ideas what is wrong? >>> >> >>> >> >>> >> >>> >> _______________________________________________ >>> >> users mailing list >>> >> users@lists.openshift.redhat.com >>> >> http://lists.openshift.redhat.com/openshiftmm/listinfo/users >>> > >>> > >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.openshift.redhat.com >>> http://lists.openshift.redhat.com/openshiftmm/listinfo/users >>> >>> >> >> -- >> >> BENJAMIN HOLMES >> >> SENIOR Solution ARCHITECT >> >> Red Hat UKI Presales <https://www.redhat.com/> >> >> bhol...@redhat.com M: 07876-885388 >> <http://redhatemailsignature-marketing.itos.redhat.com/> >> <https://red.ht/sig> >> >> _______________________________________________ >> users mailing list >> users@lists.openshift.redhat.com >> http://lists.openshift.redhat.com/openshiftmm/listinfo/users >> > > > -- > > > Brian Jarvis, RHCE > > Technical Account Manager > > Red Hat North America <https://www.redhat.com/> > Partnering with you to help achieve your business goals > > bjar...@redhat.com > > T: 631-685-7519 M: 610-587-1736 > > _______________________________________________ > users mailing list > users@lists.openshift.redhat.com > http://lists.openshift.redhat.com/openshiftmm/listinfo/users > -- Brian Jarvis, RHCE Technical Account Manager Red Hat North America <https://www.redhat.com/> Partnering with you to help achieve your business goals bjar...@redhat.com T: 631-685-7519 M: 610-587-1736
_______________________________________________ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users