Re: OpenShift Origin 3.7 Template Broker seems super flaky

Ben Parees Mon, 08 Jan 2018 08:21:06 -0800

On Mon, Jan 8, 2018 at 12:21 AM, Joel Pearson <[email protected]
> wrote:


> > The TemplateInstance object should have an ownerReference to a
> BrokerTemplateInstance and that reference not being handled properly is the
> bug.  If you remove that ownerRef from the TemplateInstance, you should be
> safe from undesired of the TemplateInstance (and the cascading delete of
> everything else) (at least w/ respect to the bug we are aware of).
>
> Nice, that did the trick.
>
> I did an oc patch, and that fixed it:
>
> $ oc get templateinstance
> NAME                                   TEMPLATE
> b180d814-2917-4c7e-875f-b91e5d4743e8   jenkins-ephemeral
>
> $ oc patch templateinstance b180d814-2917-4c7e-875f-b91e5d4743e8 --type
> json -p='[{"op": "remove", "path": "/metadata/ownerReferences"}]'
> templateinstance "b180d814-2917-4c7e-875f-b91e5d4743e8" patched
>
>
> Also, I've got another stale serviceinstance after a few rounds of
> testing, I cannot for the life of me make it die, meaning I can't delete
> the project that it is a part of, I've tried a force delete, but it doesn't
> work.
>
> $ oc delete serviceinstance jenkins-ephemeral-8dmk9 --force
> --grace-period=0
> warning: Immediate deletion does not wait for confirmation that the
> running resource has been terminated. The resource may continue to run on
> the cluster indefinitely.
> serviceinstance "jenkins-ephemeral-8dmk9" deleted
>
> $ oc get serviceinstance
> NAME                      AGE
> jenkins-ephemeral-8dmk9   7m
>
> What's the magic sauce to make it so that I can delete the serviceinstance?
>

That's going to be a question for our Service Catalog lead, Paul Morie (on
CC).



>
> On 8 January 2018 at 15:29, Ben Parees <[email protected]> wrote:
>
>>
>>
>> On Sun, Jan 7, 2018 at 9:35 PM, Joel Pearson <
>> [email protected]> wrote:
>>
>>> Ahh, I looked into all the objects that were getting deleted and they
>>> all have an ownerReference, eg:
>>>
>>> "ownerReferences": [
>>>                     {
>>>                         "apiVersion": "template.openshift.io/v1",
>>>                         "kind": "TemplateInstance",
>>>                         "name": "75c0ccd3-642e-4035-a5cf-3c27e54cae40",
>>>                         "uid": "a7301596-f41a-11e7-88e5-fa163eb8ca3a",
>>>                         "blockOwnerDeletion": true
>>>                     }
>>>                 ]
>>>
>>> That looks like what patch is about. I also found that if I tried to
>>> edit an object and remove the ownerReference then it also triggered a
>>> garbage collect on the spot and all the resources evaporated.
>>>
>>>
>> Sounds worse than the behavior we were aware of, but fundamentally what's
>> causing the cascade deletion is this:
>>
>> Jan 08 00:26:49 master-0.openshift.staging.local dockerd-current[23329]:
>> I0108 00:26:49.904249       1 garbagecollector.go:394] delete object [
>> template.openshift.io/v1/TemplateInstance, namespace: jenkins-test,
>> name: e3639aec-bbbc-4170-b0e4-3b63735af348, uid: 915d585d-f408-11e7-88e5-
>> fa163eb8ca3a] with propagation policy Background
>>
>> The TemplateInstance object should have an ownerReference to a
>> BrokerTemplateInstance and that reference not being handled properly is the
>> bug.  If you remove that ownerRef from the TemplateInstance, you should be
>> safe from undesired of the TemplateInstance (and the cascading delete of
>> everything else) (at least w/ respect to the bug we are aware of).
>>
>> That should be the only ownerRef you need to delete unless there are
>> other (to date unknow) bugs in the GC behavior, or in how the TSB is
>> creating the ownerRef chain.
>>
>>
>>
>>> So I guess my workaround can be, run the template, wait for everything
>>> to deploy, export all templated resources to json, strip out
>>> ownerReferences, and create all the resources again.
>>>
>>> On Mon, Jan 8, 2018 at 12:30 PM Joel Pearson <
>>> [email protected]> wrote:
>>>
>>>> Hmm, in my case I don't need to need to restart to cause the problem to
>>>> happen. Is there some way to run nightlies of openshift:release-3.7 using
>>>> the openshift-ansible? So that I can verify it's fixed for me?
>>>>
>>>> On Mon, Jan 8, 2018 at 12:23 PM Jordan Liggitt <[email protected]>
>>>> wrote:
>>>>
>>>>> Garbage collection in particular could be related to
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1525699 (fixed in
>>>>> https://github.com/openshift/origin/pull/17818 but not included in a
>>>>> point release yet)
>>>>>
>>>>>
>>>>> On Jan 7, 2018, at 8:17 PM, Joel Pearson <
>>>>> [email protected]> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> Has anyone else noticed that the new OpenShift Origin 3.7 Template
>>>>> Broker seems super flaky?
>>>>>
>>>>> For example, if I deploy a Jenkins (Persistent or Ephemeral), and then
>>>>> I modify the route, by adding an annotation for example:
>>>>>
>>>>> kubernetes.io/tls-acme: 'true'
>>>>>
>>>>> I have https://github.com/tnozicka/openshift-acme Installed in the
>>>>> cluster which then grabs an SSL cert for me, adds it to the route, then
>>>>> moments later all resources from the template are garbage collected for no
>>>>> apparent reason.
>>>>>
>>>>> I also got the same behaviour when I modified the service account the
>>>>> Jenkins template uses, I added an additional route so I added a new "
>>>>> serviceaccounts.openshift.io/oauth-redirectreference.jenkins:" entry.
>>>>> It took a bit longer (like 12 hours), but it all disappeared again.  I 
>>>>> have
>>>>> a suspicion that if you modify any object that a template created, then
>>>>> eventually the template broker will remove all objects it created.
>>>>>
>>>>> Is there any way to disable the new template broker and use the old
>>>>> template system?
>>>>>
>>>>> In Origin 3.6 it was flawless and worked with openshift-acme without
>>>>> any problems at all.
>>>>>
>>>>> I should mention that if I create things manually then it works fine,
>>>>> I can use openshift-acme, and all my resources don't vanish at whim.
>>>>>
>>>>> Here is a snippet of the logs, you can see the acme points are removed
>>>>> after successfully getting a cert, and then moments later, the deleting
>>>>> starts:
>>>>>
>>>>> Jan 08 00:26:47 master-0.openshift.staging.local
>>>>> dockerd-current[23329]: I0108 00:26:47.648255       1
>>>>> leaderelection.go:199] successfully renewed lease
>>>>> kube-service-catalog/service-catalog-controller-manager
>>>>> Jan 08 00:26:47 master-0.openshift.staging.local origin-node[26684]:
>>>>> I0108 00:26:47.744777   26749 roundrobin.go:338] LoadBalancerRR: Removing
>>>>> endpoints for jenkins-test/acme-9cv97q5dn8:
>>>>> Jan 08 00:26:47 master-0.openshift.staging.local
>>>>> dockerd-current[23329]: I0108 00:26:47.744777   26749 roundrobin.go:338]
>>>>> LoadBalancerRR: Removing endpoints for jenkins-test/acme-9cv97q5dn8:
>>>>> Jan 08 00:26:47 master-0.openshift.staging.local origin-node[26684]:
>>>>> I0108 00:26:47.762005   26749 ovs.go:143] Error executing ovs-ofctl:
>>>>> ovs-ofctl: None: invalid IP address
>>>>> Jan 08 00:26:47 master-0.openshift.staging.local
>>>>> dockerd-current[23329]: I0108 00:26:47.762005   26749 ovs.go:143] Error
>>>>> executing ovs-ofctl: ovs-ofctl: None: invalid IP address
>>>>> Jan 08 00:26:47 master-0.openshift.staging.local
>>>>> dockerd-current[23329]: E0108 00:26:47.765091   26749
>>>>> sdn_controller.go:284] Error deleting OVS flows for service &{{ }
>>>>> {acme-9cv97q5dn8  jenkins-test 
>>>>> /api/v1/namespaces/jenkins-test/services/acme-9cv97q5dn8
>>>>> 94c6b3b3-f40a-11e7-88e5-fa163eb8ca3a 622382 0 2018-01-08 00:26:34
>>>>> +0000 UTC <nil> <nil> map[] map[] [] nil [] } {ClusterIP [{http TCP 80 {0
>>>>> 80 } 0}] map[] None  []  None []  0} {{[]}}}: exit status 1
>>>>> Jan 08 00:26:47 master-0.openshift.staging.local origin-node[26684]:
>>>>> E0108 00:26:47.765091   26749 sdn_controller.go:284] Error deleting OVS
>>>>> flows for service &{{ } {acme-9cv97q5dn8  jenkins-test
>>>>> /api/v1/namespaces/jenkins-test/services/acme-9cv97q5dn8
>>>>> 94c6b3b3-f40a-11e7-88e5-fa163eb8ca3a 622382 0 2018-01-08 00:26:34
>>>>> +0000 UTC <nil> <nil> map[] map[] [] nil [] } {ClusterIP [{http TCP 80 {0
>>>>> 80 } 0}] map[] None  []  None []  0} {{[]}}}: exit status 1
>>>>> Jan 08 00:26:48 master-0.openshift.staging.local
>>>>> dockerd-current[23329]: I0108 00:26:48.139090 <13%2090%2090>       1
>>>>> rest.go:362] Starting watch for /api/v1/namespaces, rv=622418 labels=
>>>>> fields= timeout=8m38s
>>>>> Jan 08 00:26:48 master-0.openshift.staging.local
>>>>> origin-master-api[23448]: I0108 00:26:48.139090 <13%2090%2090>
>>>>>  1 rest.go:362] Starting watch for /api/v1/namespaces, rv=622418 labels=
>>>>> fields= timeout=8m38s
>>>>> Jan 08 00:26:49 master-0.openshift.staging.local
>>>>> dockerd-current[23329]: I0108 00:26:49.668205       1
>>>>> leaderelection.go:199] successfully renewed lease
>>>>> kube-service-catalog/service-catalog-controller-manager
>>>>> Jan 08 00:26:49 master-0.openshift.staging.local
>>>>> dockerd-current[23329]: I0108 00:26:49.885207       1
>>>>> garbagecollector.go:291] processing item [
>>>>> template.openshift.io/v1/TemplateInstance, namespace: jenkins-test,
>>>>> name: e3639aec-bbbc-4170-b0e4-3b63735af348, uid:
>>>>> 915d585d-f408-11e7-88e5-fa163eb8ca3a]
>>>>> Jan 08 00:26:49 master-0.openshift.staging.local
>>>>> origin-master-controllers[73353]: I0108 00:26:49.885207       1
>>>>> garbagecollector.go:291] processing item [
>>>>> template.openshift.io/v1/TemplateInstance, namespace: jenkins-test,
>>>>> name: e3639aec-bbbc-4170-b0e4-3b63735af348, uid:
>>>>> 915d585d-f408-11e7-88e5-fa163eb8ca3a]
>>>>> Jan 08 00:26:49 master-0.openshift.staging.local
>>>>> dockerd-current[23329]: I0108 00:26:49.904249       1
>>>>> garbagecollector.go:394] delete object [template.openshift.io/v1/Temp
>>>>> lateInstance, namespace: jenkins-test, name:
>>>>> e3639aec-bbbc-4170-b0e4-3b63735af348, uid:
>>>>> 915d585d-f408-11e7-88e5-fa163eb8ca3a] with propagation policy
>>>>> Background
>>>>> Jan 08 00:26:49 master-0.openshift.staging.local
>>>>> origin-master-controllers[73353]: I0108 00:26:49.904249       1
>>>>> garbagecollector.go:394] delete object [template.openshift.io/v1/Temp
>>>>> lateInstance, namespace: jenkins-test, name:
>>>>> e3639aec-bbbc-4170-b0e4-3b63735af348, uid:
>>>>> 915d585d-f408-11e7-88e5-fa163eb8ca3a] with propagation policy
>>>>> Background
>>>>> Jan 08 00:26:49 master-0.openshift.staging.local
>>>>> dockerd-current[23329]: I0108 00:26:49.910964       1
>>>>> garbagecollector.go:291] processing item [
>>>>> apps.openshift.io/v1/DeploymentConfig, namespace: jenkins-test, name:
>>>>> jenkins, uid: 91759f72-f408-11e7-88e5-fa163eb8ca3a]
>>>>>
>>>>> Any ideas? Has anyone else seen this?  Considering
>>>>> "openshift-ansible-service-broker" is deployed in a broken state by
>>>>> openshift-ansible on the release-3.7 branch (for origin, I think 
>>>>> enterprise
>>>>> would work as the tags exist), it makes me think that not many people are
>>>>> using the new service brokers that are talked about here:
>>>>> https://blog.openshift.com/whats-new-in-openshift-3-7-
>>>>> service-catalog-and-brokers/
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Joel
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> [email protected]
>>>>> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>>>>>
>>>>>
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>>>
>>>
>>
>>
>> --
>> Ben Parees | OpenShift
>>
>>
>
>
> --
> Kind Regards,
>
> Joel Pearson
> Agile Digital | Senior Software Consultant
>
> Love Your Software™ | ABN 98 106 361 273
> p: 1300 858 277 | m: 0405 417 843 <0405417843> | w: agiledigital.com.au
>



-- 
Ben Parees | OpenShift

_______________________________________________
users mailing list
[email protected]
http://lists.openshift.redhat.com/openshiftmm/listinfo/users

Re: OpenShift Origin 3.7 Template Broker seems super flaky

Reply via email to