On Mon, Sep 19, 2016 at 6:40 PM, Srinivas Naga Kotaru (skotaru) <[email protected]> wrote: > Trying to understand on which scenarios all the instances of an application > running from cluster unavailable? > > > > OS upgrade failure?? > > Openshift upgrade bugs/failures/downtime?
The best way to mitigate risks from the first two are to upgrade independent sets of Nodes in batches to prevent downtime in the event of unforeseen problems. This should be rare if there is sufficient monitoring in the environment. In the Origin 1.4, OCP 3.4 timeframe it will be much easier to upgrade batches of Nodes. It's possible today but it takes a little more involvement with the ansible inventory. In large environments with strict maintenance windows it's common to only update a set of Nodes during each window. > > Router failures ?? This is likely the most common source of user-facing downtime. > > Keepalive containers failed?? Unless this event triggered a failover to a pod that was actually in outage I don't think the Keepalive pod failing would cause a user-facing outage. The platform would spawn another. > > Floating IP shared by keepalive container had issues?? If somehow the floating IP was in use by another interface on the network I'm certain bad things would happen. > > VXLAN bug or upgrade caused entire cluster network failure? Catastrophic network failures could indeed cause a major outage. > > Human config error ( what those???) Always. Best avoided by using a tool like Ansible and testing changes in other environments before production. > > > > Is above list accurate? Can we think off any other possible scanarios where > whole application will be down in cluster duet to platform issues? > I would mention downtime caused by load. Anecdotally, this is probably the second most common cause of downtime. It often relates to the human error and lack of monitoring. The more dense the platform operators wish to keep the environment the more rigor is needed for monitoring. This could simply be an error of the pod owner as well. eg, the JVM inside the pod might be online however the application running in the JVM might be throwing out of memory errors due to incorrect assignment of limits. > > > -- > > Srinivas Kotaru > > > _______________________________________________ > users mailing list > [email protected] > http://lists.openshift.redhat.com/openshiftmm/listinfo/users > _______________________________________________ users mailing list [email protected] http://lists.openshift.redhat.com/openshiftmm/listinfo/users
