Brian,

That's fixed it. THANK YOU.

On 31/03/2020 17:05, Brian Jarvis wrote:
Hello Tim,

Each node has a client certificate that expire after one year.
Run "oc get csr" you should see many pending requests, possibly thousands.

To clear those run "oc get csr -o name | xargs oc adm certificate approve"

One way to prevent this in the future is to deploy/enable the auto approver statefulset with the following command. ansible-playbook -vvv -i [inventory_file] /usr/share/ansible/openshift-ansible/playbooks/openshift-master/enable_bootstrap.yml -e openshift_master_bootstrap_auto_approve=true

On Tue, Mar 31, 2020 at 11:53 AM Tim Dudgeon <tdudgeon...@gmail.com <mailto:tdudgeon...@gmail.com>> wrote:

    Maybe an uncanny coincidence but with think the cluster was
    created almost EXACTLY 1 year before it failed.

    On 31/03/2020 16:17, Ben Holmes wrote:
    Hi Tim,

    Can you verify that the host's clocks are being synced correctly
    as per Simon's other suggestion?

    Ben

    On Tue, 31 Mar 2020 at 16:05, Tim Dudgeon <tdudgeon...@gmail.com
    <mailto:tdudgeon...@gmail.com>> wrote:

        Hi Simon,

        we're run those playbooks and all certs are reported as still
        being valid.

        Tim

        On 31/03/2020 15:59, Simon Krenger wrote:
        > Hi Tim,
        >
        > Note that there are multiple sets of certificates, both
        external and
        > internal. So it would be worth checking the certificates
        again using
        > the Certificate Expiration Playbooks (see link below). The
        > documentation also has an overview of what can be done to renew
        > certain certificates:
        >
        > - [ Redeploying Certificates ]
        >
        https://docs.okd.io/3.11/install_config/redeploying_certificates.html
        >
        > Apart from checking all certificates, I'd certainly review
        the time
        > synchronisation for the whole cluster, as we see the
        message "x509:
        > certificate has expired or is not yet valid".
        >
        > I hope this helps.
        >
        > Kind regards
        > Simon
        >
        > On Tue, Mar 31, 2020 at 4:33 PM Tim Dudgeon
        <tdudgeon...@gmail.com <mailto:tdudgeon...@gmail.com>> wrote:
        >> One of our OKD 3.11 clusters has suddenly stopped working
        without any
        >> obvious reason.
        >>
        >> The origin-node service on the nodes does not start (times
        out).
        >> The master-api pod is running on the master.
        >> The nodes can access the master-api endpoints.
        >>
        >> The logs of the master-api pod look mostly OK other than a
        huge number
        >> of warnings about certificates that don't really make
        sense as the
        >> certificates are valid (we use named certificates from
        let's Encryt and
        >> they were renewed about 2 weeks ago and all appear to be
        correct.
        >>
        >> Examples of errors from the master-api pod are:
        >>
        >> I0331 12:46:57.065147       1
        establishing_controller.go:73] Starting
        >> EstablishingController
        >> I0331 12:46:57.065561       1 logs.go:49] http: TLS
        handshake error from
        >> 192.168.160.17:58024 <http://192.168.160.17:58024>: EOF
        >> I0331 12:46:57.071932       1 logs.go:49] http: TLS
        handshake error from
        >> 192.168.160.19:48102 <http://192.168.160.19:48102>: EOF
        >> I0331 12:46:57.072036       1 logs.go:49] http: TLS
        handshake error from
        >> 192.168.160.19:37178 <http://192.168.160.19:37178>: EOF
        >> I0331 12:46:57.072141       1 logs.go:49] http: TLS
        handshake error from
        >> 192.168.160.17:58022 <http://192.168.160.17:58022>: EOF
        >>
        >> E0331 12:47:37.855023       1 memcache.go:147] couldn't
        get resource
        >> list for metrics.k8s.io/v1beta1
        <http://metrics.k8s.io/v1beta1>: the server is currently
        unable to
        >> handle the request
        >> E0331 12:47:37.856569       1 memcache.go:147] couldn't
        get resource
        >> list for servicecatalog.k8s.io/v1beta1
        <http://servicecatalog.k8s.io/v1beta1>: the server is
        currently unable
        >> to handle the request
        >> E0331 12:47:44.115290       1 authentication.go:62] Unable to
        >> authenticate the request due to an error: [x509:
        certificate has expired
        >> or is not yet valid, x509: certificate
        >>    has expired or is not yet valid]
        >> E0331 12:47:44.118976       1 authentication.go:62] Unable to
        >> authenticate the request due to an error: [x509:
        certificate has expired
        >> or is not yet valid, x509: certificate
        >>    has expired or is not yet valid]
        >> E0331 12:47:44.122276       1 authentication.go:62] Unable to
        >> authenticate the request due to an error: [x509:
        certificate has expired
        >> or is not yet valid, x509: certificate
        >>    has expired or is not yet valid]
        >>
        >> Huge number of this second sort.
        >>
        >> Any ideas what is wrong?
        >>
        >>
        >>
        >> _______________________________________________
        >> users mailing list
        >> users@lists.openshift.redhat.com
        <mailto:users@lists.openshift.redhat.com>
        >> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
        >
        >

        _______________________________________________
        users mailing list
        users@lists.openshift.redhat.com
        <mailto:users@lists.openshift.redhat.com>
        http://lists.openshift.redhat.com/openshiftmm/listinfo/users



--
    BENJAMIN HOLMES

    SENIOR Solution ARCHITECT

    Red Hat UKI Presales <https://www.redhat.com/>

    bhol...@redhat.com <mailto:bhol...@redhat.com> M: 07876-885388
    <http://redhatemailsignature-marketing.itos.redhat.com/>

    <https://red.ht/sig>

    _______________________________________________
    users mailing list
    users@lists.openshift.redhat.com
    <mailto:users@lists.openshift.redhat.com>
    http://lists.openshift.redhat.com/openshiftmm/listinfo/users



--


Brian Jarvis, RHCE

Technical Account Manager

Red Hat North America <https://www.redhat.com/>
Partnering with you to help achieve your business goals

bjar...@redhat.com <mailto:bjar...@redhat.com>

T: 631-685-7519 <tel:631-685-7519> M: 610-587-1736 <tel:610-587-1736>


_______________________________________________
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users

Reply via email to