Hello Tim, Each node has a client certificate that expire after one year. Run "oc get csr" you should see many pending requests, possibly thousands.
To clear those run "oc get csr -o name | xargs oc adm certificate approve" One way to prevent this in the future is to deploy/enable the auto approver statefulset with the following command. ansible-playbook -vvv -i [inventory_file] /usr/share/ansible/openshift-ansible/playbooks/openshift-master/enable_bootstrap.yml -e openshift_master_bootstrap_auto_approve=true On Tue, Mar 31, 2020 at 11:53 AM Tim Dudgeon <tdudgeon...@gmail.com> wrote: > Maybe an uncanny coincidence but with think the cluster was created almost > EXACTLY 1 year before it failed. > On 31/03/2020 16:17, Ben Holmes wrote: > > Hi Tim, > > Can you verify that the host's clocks are being synced correctly as per > Simon's other suggestion? > > Ben > > On Tue, 31 Mar 2020 at 16:05, Tim Dudgeon <tdudgeon...@gmail.com> wrote: > >> Hi Simon, >> >> we're run those playbooks and all certs are reported as still being valid. >> >> Tim >> >> On 31/03/2020 15:59, Simon Krenger wrote: >> > Hi Tim, >> > >> > Note that there are multiple sets of certificates, both external and >> > internal. So it would be worth checking the certificates again using >> > the Certificate Expiration Playbooks (see link below). The >> > documentation also has an overview of what can be done to renew >> > certain certificates: >> > >> > - [ Redeploying Certificates ] >> > >> https://docs.okd.io/3.11/install_config/redeploying_certificates.html >> > >> > Apart from checking all certificates, I'd certainly review the time >> > synchronisation for the whole cluster, as we see the message "x509: >> > certificate has expired or is not yet valid". >> > >> > I hope this helps. >> > >> > Kind regards >> > Simon >> > >> > On Tue, Mar 31, 2020 at 4:33 PM Tim Dudgeon <tdudgeon...@gmail.com> >> wrote: >> >> One of our OKD 3.11 clusters has suddenly stopped working without any >> >> obvious reason. >> >> >> >> The origin-node service on the nodes does not start (times out). >> >> The master-api pod is running on the master. >> >> The nodes can access the master-api endpoints. >> >> >> >> The logs of the master-api pod look mostly OK other than a huge number >> >> of warnings about certificates that don't really make sense as the >> >> certificates are valid (we use named certificates from let's Encryt and >> >> they were renewed about 2 weeks ago and all appear to be correct. >> >> >> >> Examples of errors from the master-api pod are: >> >> >> >> I0331 12:46:57.065147 1 establishing_controller.go:73] Starting >> >> EstablishingController >> >> I0331 12:46:57.065561 1 logs.go:49] http: TLS handshake error >> from >> >> 192.168.160.17:58024: EOF >> >> I0331 12:46:57.071932 1 logs.go:49] http: TLS handshake error >> from >> >> 192.168.160.19:48102: EOF >> >> I0331 12:46:57.072036 1 logs.go:49] http: TLS handshake error >> from >> >> 192.168.160.19:37178: EOF >> >> I0331 12:46:57.072141 1 logs.go:49] http: TLS handshake error >> from >> >> 192.168.160.17:58022: EOF >> >> >> >> E0331 12:47:37.855023 1 memcache.go:147] couldn't get resource >> >> list for metrics.k8s.io/v1beta1: the server is currently unable to >> >> handle the request >> >> E0331 12:47:37.856569 1 memcache.go:147] couldn't get resource >> >> list for servicecatalog.k8s.io/v1beta1: the server is currently unable >> >> to handle the request >> >> E0331 12:47:44.115290 1 authentication.go:62] Unable to >> >> authenticate the request due to an error: [x509: certificate has >> expired >> >> or is not yet valid, x509: certificate >> >> has expired or is not yet valid] >> >> E0331 12:47:44.118976 1 authentication.go:62] Unable to >> >> authenticate the request due to an error: [x509: certificate has >> expired >> >> or is not yet valid, x509: certificate >> >> has expired or is not yet valid] >> >> E0331 12:47:44.122276 1 authentication.go:62] Unable to >> >> authenticate the request due to an error: [x509: certificate has >> expired >> >> or is not yet valid, x509: certificate >> >> has expired or is not yet valid] >> >> >> >> Huge number of this second sort. >> >> >> >> Any ideas what is wrong? >> >> >> >> >> >> >> >> _______________________________________________ >> >> users mailing list >> >> users@lists.openshift.redhat.com >> >> http://lists.openshift.redhat.com/openshiftmm/listinfo/users >> > >> > >> >> _______________________________________________ >> users mailing list >> users@lists.openshift.redhat.com >> http://lists.openshift.redhat.com/openshiftmm/listinfo/users >> >> > > -- > > BENJAMIN HOLMES > > SENIOR Solution ARCHITECT > > Red Hat UKI Presales <https://www.redhat.com/> > > bhol...@redhat.com M: 07876-885388 > <http://redhatemailsignature-marketing.itos.redhat.com/> > <https://red.ht/sig> > > _______________________________________________ > users mailing list > users@lists.openshift.redhat.com > http://lists.openshift.redhat.com/openshiftmm/listinfo/users > -- Brian Jarvis, RHCE Technical Account Manager Red Hat North America <https://www.redhat.com/> Partnering with you to help achieve your business goals bjar...@redhat.com T: 631-685-7519 M: 610-587-1736
_______________________________________________ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users