So thanks for the help in fixing this problem. Much appreciated.
Having looked at it now after the event I have 2 concerns.
1. Whilst this is documented [1] the significance of this is not
mentioned. Unless you do as described (either manually or automatically)
your cluster will stop working 1 year after being deployed!
2. The playbooks that check certificate expiry [2] do not catch this
problem.
Thanks
Tim
[1]
https://docs.okd.io/3.11/install_config/redeploying_certificates.html#cert-expiry-managing-csrs
[2]
https://docs.okd.io/3.11/install_config/redeploying_certificates.html#install-config-cert-expiry
On 31/03/2020 17:05, Brian Jarvis wrote:
Hello Tim,
Each node has a client certificate that expire after one year.
Run "oc get csr" you should see many pending requests, possibly thousands.
To clear those run "oc get csr -o name | xargs oc adm certificate approve"
One way to prevent this in the future is to deploy/enable the auto
approver statefulset with the following command.
ansible-playbook -vvv -i [inventory_file]
/usr/share/ansible/openshift-ansible/playbooks/openshift-master/enable_bootstrap.yml
-e openshift_master_bootstrap_auto_approve=true
On Tue, Mar 31, 2020 at 11:53 AM Tim Dudgeon <tdudgeon...@gmail.com
<mailto:tdudgeon...@gmail.com>> wrote:
Maybe an uncanny coincidence but with think the cluster was
created almost EXACTLY 1 year before it failed.
On 31/03/2020 16:17, Ben Holmes wrote:
Hi Tim,
Can you verify that the host's clocks are being synced correctly
as per Simon's other suggestion?
Ben
On Tue, 31 Mar 2020 at 16:05, Tim Dudgeon <tdudgeon...@gmail.com
<mailto:tdudgeon...@gmail.com>> wrote:
Hi Simon,
we're run those playbooks and all certs are reported as still
being valid.
Tim
On 31/03/2020 15:59, Simon Krenger wrote:
> Hi Tim,
>
> Note that there are multiple sets of certificates, both
external and
> internal. So it would be worth checking the certificates
again using
> the Certificate Expiration Playbooks (see link below). The
> documentation also has an overview of what can be done to renew
> certain certificates:
>
> - [ Redeploying Certificates ]
>
https://docs.okd.io/3.11/install_config/redeploying_certificates.html
>
> Apart from checking all certificates, I'd certainly review
the time
> synchronisation for the whole cluster, as we see the
message "x509:
> certificate has expired or is not yet valid".
>
> I hope this helps.
>
> Kind regards
> Simon
>
> On Tue, Mar 31, 2020 at 4:33 PM Tim Dudgeon
<tdudgeon...@gmail.com <mailto:tdudgeon...@gmail.com>> wrote:
>> One of our OKD 3.11 clusters has suddenly stopped working
without any
>> obvious reason.
>>
>> The origin-node service on the nodes does not start (times
out).
>> The master-api pod is running on the master.
>> The nodes can access the master-api endpoints.
>>
>> The logs of the master-api pod look mostly OK other than a
huge number
>> of warnings about certificates that don't really make
sense as the
>> certificates are valid (we use named certificates from
let's Encryt and
>> they were renewed about 2 weeks ago and all appear to be
correct.
>>
>> Examples of errors from the master-api pod are:
>>
>> I0331 12:46:57.065147 1
establishing_controller.go:73] Starting
>> EstablishingController
>> I0331 12:46:57.065561 1 logs.go:49] http: TLS
handshake error from
>> 192.168.160.17:58024 <http://192.168.160.17:58024>: EOF
>> I0331 12:46:57.071932 1 logs.go:49] http: TLS
handshake error from
>> 192.168.160.19:48102 <http://192.168.160.19:48102>: EOF
>> I0331 12:46:57.072036 1 logs.go:49] http: TLS
handshake error from
>> 192.168.160.19:37178 <http://192.168.160.19:37178>: EOF
>> I0331 12:46:57.072141 1 logs.go:49] http: TLS
handshake error from
>> 192.168.160.17:58022 <http://192.168.160.17:58022>: EOF
>>
>> E0331 12:47:37.855023 1 memcache.go:147] couldn't
get resource
>> list for metrics.k8s.io/v1beta1
<http://metrics.k8s.io/v1beta1>: the server is currently
unable to
>> handle the request
>> E0331 12:47:37.856569 1 memcache.go:147] couldn't
get resource
>> list for servicecatalog.k8s.io/v1beta1
<http://servicecatalog.k8s.io/v1beta1>: the server is
currently unable
>> to handle the request
>> E0331 12:47:44.115290 1 authentication.go:62] Unable to
>> authenticate the request due to an error: [x509:
certificate has expired
>> or is not yet valid, x509: certificate
>> has expired or is not yet valid]
>> E0331 12:47:44.118976 1 authentication.go:62] Unable to
>> authenticate the request due to an error: [x509:
certificate has expired
>> or is not yet valid, x509: certificate
>> has expired or is not yet valid]
>> E0331 12:47:44.122276 1 authentication.go:62] Unable to
>> authenticate the request due to an error: [x509:
certificate has expired
>> or is not yet valid, x509: certificate
>> has expired or is not yet valid]
>>
>> Huge number of this second sort.
>>
>> Any ideas what is wrong?
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
>> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>
>
_______________________________________________
users mailing list
users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
http://lists.openshift.redhat.com/openshiftmm/listinfo/users
--
BENJAMIN HOLMES
SENIOR Solution ARCHITECT
Red Hat UKI Presales <https://www.redhat.com/>
bhol...@redhat.com <mailto:bhol...@redhat.com> M: 07876-885388
<http://redhatemailsignature-marketing.itos.redhat.com/>
<https://red.ht/sig>
_______________________________________________
users mailing list
users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
http://lists.openshift.redhat.com/openshiftmm/listinfo/users
--
Brian Jarvis, RHCE
Technical Account Manager
Red Hat North America <https://www.redhat.com/>
Partnering with you to help achieve your business goals
bjar...@redhat.com <mailto:bjar...@redhat.com>
T: 631-685-7519 <tel:631-685-7519> M: 610-587-1736 <tel:610-587-1736>
_______________________________________________
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users