Re: How to recover from failed update in OpenShift 4.2.x?

2019-11-26 Thread Joel Pearson
On Thu, 21 Nov 2019 at 10:58, Clayton Coleman  wrote:

>
>
> On Nov 17, 2019, at 9:34 PM, Joel Pearson 
> wrote:
>
> So, I'm running OpenShift 4.2 on Azure UPI following this blog article:
> https://blog.openshift.com/openshift-4-1-upi-environment-deployment-on-microsoft-azure-cloud/
>  with
> a few customisations on the terraform side.
>
> One of the main differences it seems, is how the router/ingress is
> handled. Normal Azure uses load balancers, but UPI Azure uses a regular
> router (that I'm used to seeing the 3.x version) which is configured by
> setting the "HostNetwork" for the endpoint publishing strategy
> 
>
>
> This sounds like a bug in Azure UPI.  IPI is the reference architecture,
> it shouldn’t have a default divergent from the ref arch.
>

In the blog, he mentions that he has changed the architecture because it
creates a public facing load balancer.  In my case I'm not allowed to
create a public load balancer at all, additionally I can't use Azure's
Public or Private DNS either, so I had to customise the terraform templates
even more.

Maybe supported UPI Azure will allow internally facing load balancers?


>
>
> It was all working fine in OpenShift 4.2.0 and 4.2.2, but when I upgraded
> to OpenShift 4.2.4, the router stopped listening on ports 80 and 443, I
> could see the pod running with "crictl ps", but a "netstat -tpln" didn't
> show anything listening.
>
> I tried updating the version back from 4.2.4 to 4.2.2, but I
> accidentally used 4.1.22 image digest value, so I quickly reverted back to
> 4.2.4 once I saw the apiservers coming up as 4.1.22.  I then noticed that
> there was a 4.2.7 release on the candidate-4.2 channel, so I switched to
> that, and ingress started working properly again.
>
> So my question is, what is the strategy for recovering from a failed
> update? Do I need to have etcd backups and then restore the cluster by
> restoring etcd? Ie.
> https://docs.openshift.com/container-platform/4.2/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html
>
> The upgrade page
> 
> specifically says "Reverting your cluster to a previous version, or a
> rollback, is not supported. Only upgrading to a newer version is
> supported." so is it an expectation for a production cluster that you would
> restore from backup if the cluster isn't usable?
>
>
> Backup, yes.  If you could open a bug for the documentation that would be
> great.
>

Thanks, raised it here: https://bugzilla.redhat.com/show_bug.cgi?id=1777155


>
>
> Maybe the upgrade page should mention taking backups? Especially if there
> is no rollback option.
>
> ___
> users mailing list
> users@lists.openshift.redhat.com
> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>
>
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: How to recover from failed update in OpenShift 4.2.x?

2019-11-20 Thread Clayton Coleman
On Nov 17, 2019, at 9:34 PM, Joel Pearson 
wrote:

So, I'm running OpenShift 4.2 on Azure UPI following this blog article:
https://blog.openshift.com/openshift-4-1-upi-environment-deployment-on-microsoft-azure-cloud/
with
a few customisations on the terraform side.

One of the main differences it seems, is how the router/ingress is handled.
Normal Azure uses load balancers, but UPI Azure uses a regular router (that
I'm used to seeing the 3.x version) which is configured by setting the
"HostNetwork"
for the endpoint publishing strategy



This sounds like a bug in Azure UPI.  IPI is the reference architecture, it
shouldn’t have a default divergent from the ref arch.


It was all working fine in OpenShift 4.2.0 and 4.2.2, but when I upgraded
to OpenShift 4.2.4, the router stopped listening on ports 80 and 443, I
could see the pod running with "crictl ps", but a "netstat -tpln" didn't
show anything listening.

I tried updating the version back from 4.2.4 to 4.2.2, but I
accidentally used 4.1.22 image digest value, so I quickly reverted back to
4.2.4 once I saw the apiservers coming up as 4.1.22.  I then noticed that
there was a 4.2.7 release on the candidate-4.2 channel, so I switched to
that, and ingress started working properly again.

So my question is, what is the strategy for recovering from a failed
update? Do I need to have etcd backups and then restore the cluster by
restoring etcd? Ie.
https://docs.openshift.com/container-platform/4.2/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html

The upgrade page

specifically says "Reverting your cluster to a previous version, or a
rollback, is not supported. Only upgrading to a newer version is
supported." so is it an expectation for a production cluster that you would
restore from backup if the cluster isn't usable?


Backup, yes.  If you could open a bug for the documentation that would be
great.


Maybe the upgrade page should mention taking backups? Especially if there
is no rollback option.

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


How to recover from failed update in OpenShift 4.2.x?

2019-11-17 Thread Joel Pearson
So, I'm running OpenShift 4.2 on Azure UPI following this blog article:
https://blog.openshift.com/openshift-4-1-upi-environment-deployment-on-microsoft-azure-cloud/
with
a few customisations on the terraform side.

One of the main differences it seems, is how the router/ingress is handled.
Normal Azure uses load balancers, but UPI Azure uses a regular router (that
I'm used to seeing the 3.x version) which is configured by setting the
"HostNetwork"
for the endpoint publishing strategy


It was all working fine in OpenShift 4.2.0 and 4.2.2, but when I upgraded
to OpenShift 4.2.4, the router stopped listening on ports 80 and 443, I
could see the pod running with "crictl ps", but a "netstat -tpln" didn't
show anything listening.

I tried updating the version back from 4.2.4 to 4.2.2, but I
accidentally used 4.1.22 image digest value, so I quickly reverted back to
4.2.4 once I saw the apiservers coming up as 4.1.22.  I then noticed that
there was a 4.2.7 release on the candidate-4.2 channel, so I switched to
that, and ingress started working properly again.

So my question is, what is the strategy for recovering from a failed
update? Do I need to have etcd backups and then restore the cluster by
restoring etcd? Ie.
https://docs.openshift.com/container-platform/4.2/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html

The upgrade page

specifically says "Reverting your cluster to a previous version, or a
rollback, is not supported. Only upgrading to a newer version is
supported." so is it an expectation for a production cluster that you would
restore from backup if the cluster isn't usable?

Maybe the upgrade page should mention taking backups? Especially if there
is no rollback option.
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users