On Mon, Sep 19, 2016 at 6:40 PM, Srinivas Naga Kotaru (skotaru)
<skot...@cisco.com> wrote:
> Trying to understand on which scenarios all the instances of an application
> running from cluster unavailable?
> OS upgrade failure??
> Openshift upgrade bugs/failures/downtime?

The best way to mitigate risks from the first two are to upgrade
independent sets of Nodes in batches to prevent downtime in the event
of unforeseen problems.  This should be rare if there is sufficient
monitoring in the environment.

In the Origin 1.4, OCP 3.4 timeframe it will be much easier to upgrade
batches of Nodes.  It's possible today but it takes a little more
involvement with the ansible inventory.  In large environments with
strict maintenance windows it's common to only update a set of Nodes
during each window.

> Router failures ??

This is likely the most common source of user-facing downtime.

> Keepalive containers failed??

Unless this event triggered a failover to a pod that was actually in
outage I don't think the Keepalive pod failing would cause a
user-facing outage.  The platform would spawn another.

> Floating IP shared by keepalive container had issues??

If somehow the floating IP was in use by another interface on the
network I'm certain bad things would happen.

> VXLAN bug or upgrade caused entire cluster network failure?

Catastrophic network failures could indeed cause a major outage.

> Human config error ( what those???)

Always.  Best avoided by using a tool like Ansible and testing changes
in other environments before production.

> Is above list accurate? Can we think off any other possible scanarios where
> whole application will be down in cluster duet to platform issues?

I would mention downtime caused by load.  Anecdotally, this is
probably the second most common cause of downtime.  It often relates
to the human error and lack of monitoring.  The more dense the
platform operators wish to keep the environment the more rigor is
needed for monitoring.

This could simply be an error of the pod owner as well.  eg, the JVM
inside the pod might be online however the application running in the
JVM might be throwing out of memory errors due to incorrect assignment
of limits.

> --
> Srinivas Kotaru
> _______________________________________________
> users mailing list
> users@lists.openshift.redhat.com
> http://lists.openshift.redhat.com/openshiftmm/listinfo/users

users mailing list

Reply via email to