Re: origin crashed

Candide Kemmler Thu, 08 Sep 2016 13:05:14 -0700

Hey sorry to insist... anything I can do from here?

> On 07 Sep 2016, at 19:05, Candide Kemmler <[email protected]> wrote:
> 
> 
>> On 06 Sep 2016, at 18:00, Clayton Coleman <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> What auth mechanism backs your "admin" user?
>> 
> 
> .htpasswd
> 
> Thanks for the follow up
> 
> Candide
> 
>> On Sep 6, 2016, at 10:19 AM, Candide Kemmler <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>>> Yes that seems to be OK..., although I'm not sure I know exactly what the 
>>> "root cluster cert is", so I checked all the following:
>>> 
>>> [root@paas master]# openssl x509 -enddate -noout -in cloudapps.router.pem
>>> notAfter=Apr 21 16:38:31 2018 GMT
>>> [root@paas master]# openssl x509 -enddate -noout -in ca.crt
>>> notAfter=Apr 20 16:31:56 2021 GMT
>>> [root@paas master]# openssl x509 -enddate -noout -in master.server.crt
>>> notAfter=Apr 21 16:32:00 2018 GMT
>>> [root@paas master]# openssl x509 -enddate -noout -in etcd.server.crt
>>> notAfter=Apr 21 16:32:01 2018 GMT
>>> [root@paas master]# openssl x509 -enddate -noout -in admin.crt
>>> notAfter=Apr 21 16:31:58 2018 GMT
>>> [root@paas master]# openssl x509 -enddate -noout -in ca-bundle.crt
>>> notAfter=Apr 20 16:31:56 2021 GMT
>>> [root@paas master]# openssl x509 -enddate -noout -in openshift-master.crt
>>> notAfter=Apr 21 16:31:57 2018 GMT
>>> [root@paas master]# openssl x509 -enddate -noout -in openshift-registry.crt
>>> notAfter=Apr 21 16:32:00 2018 GMT
>>> 
>>>> On 06 Sep 2016, at 15:04, Clayton Coleman <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> Were you able to check the expiration date on your admin root cluster cert 
>>>> and verify it has not expired?
>>>> 
>>>> On Sep 6, 2016, at 5:19 AM, Candide Kemmler <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>>> Hi Clayton,
>>>>> 
>>>>> Thanks! Here's the result of running `sudo oadm diagnostics`. I'm 
>>>>> particularly bothered by the "the server has asked for the client to 
>>>>> provide credentials" message as I'm seeing this one when I try to execute 
>>>>> the ansible scripts as well. Do you know how to solve it?
>>>>> 
>>>>> Any other ideas on things I should focus on?
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> Candide
>>>>> 
>>>>> 
>>>>> [Note] Determining if client configuration exists for client/cluster 
>>>>> diagnostics
>>>>> Info:  Successfully read a client config file at '/root/.kube/config'
>>>>> [Note] Could not configure a client, so client diagnostics are limited to 
>>>>> testing configuration and connection
>>>>> Info:  Using context for cluster-admin access: 
>>>>> 'default/paas-intrinsic-world:8443/system:admin'
>>>>> [Note] Performing systemd discovery
>>>>> 
>>>>> [Note] Running diagnostic: 
>>>>> ConfigContexts[logging/paas-intrinsic-world:8443/admin]
>>>>>        Description: Validate client config context is complete and has 
>>>>> connectivity
>>>>> 
>>>>> ERROR: [DCli0014 from diagnostic 
>>>>> ConfigContexts@openshift/origin/pkg/diagnostics/client/config_contexts.go:285]
>>>>>        For client config context 
>>>>> 'logging/paas-intrinsic-world:8443/admin':
>>>>>        The server URL is 'https://paas.intrinsic.world:8443' 
>>>>> <https://paas.intrinsic.world:8443'>
>>>>>        The user authentication is 'admin/paas-intrinsic-world:8443'
>>>>>        The current project is 'logging'
>>>>>        (*errors.StatusError) the server has asked for the client to 
>>>>> provide credentials
>>>>> 
>>>>>        This means that when we tried to make a request to the master API
>>>>>        server, the request required credentials that were not presented. 
>>>>> This
>>>>>        can happen with an expired or invalid authentication token. Try 
>>>>> logging
>>>>>        in with this user again.
>>>>> 
>>>>> [Note] Running diagnostic: 
>>>>> ConfigContexts[logging/paas-intrinsic-world:8443/system:admin]
>>>>>        Description: Validate client config context is complete and has 
>>>>> connectivity
>>>>> 
>>>>> Info:  For client config context 
>>>>> 'logging/paas-intrinsic-world:8443/system:admin':
>>>>>        The server URL is 'https://paas.intrinsic.world:8443' 
>>>>> <https://paas.intrinsic.world:8443'>
>>>>>        The user authentication is 'system:admin/paas-intrinsic-world:8443'
>>>>>        The current project is 'logging'
>>>>>        Successfully requested project list; has access to project(s):
>>>>>          [openshift-infra dev ieml-demo logging management-infra misc 
>>>>> openshift p2p default ieml-dev ...]
>>>>> 
>>>>> [Note] Running diagnostic: ClusterRegistry
>>>>>        Description: Check that there is a working Docker registry
>>>>> 
>>>>> WARN:  [DClu1009 from diagnostic 
>>>>> ClusterRegistry@openshift/origin/pkg/diagnostics/cluster/registry.go:217]
>>>>>        The "docker-registry-1-8w93s" pod for the "docker-registry" 
>>>>> service is not running.
>>>>>        This may be transient, a scheduling error, or something else.
>>>>> 
>>>>> ERROR: [DClu1001 from diagnostic 
>>>>> ClusterRegistry@openshift/origin/pkg/diagnostics/cluster/registry.go:173]
>>>>>        The "docker-registry" service exists but no pods currently 
>>>>> running, so it
>>>>>        is not available. Builds and deployments that use the registry 
>>>>> will fail.
>>>>> 
>>>>> [Note] Running diagnostic: ClusterRoleBindings
>>>>>        Description: Check that the default ClusterRoleBindings are 
>>>>> present and contain the expected subjects
>>>>> 
>>>>> Info:  clusterrolebinding/cluster-admins has more subjects than expected.
>>>>> 
>>>>>        Use the `oadm policy reconcile-cluster-role-bindings` command to 
>>>>> update the role binding to remove extra subjects.
>>>>> 
>>>>> Info:  clusterrolebinding/cluster-admins has extra subject {User  admin   
>>>>>  }.
>>>>> 
>>>>> Info:  clusterrolebinding/cluster-readers has more subjects than expected.
>>>>> 
>>>>>        Use the `oadm policy reconcile-cluster-role-bindings` command to 
>>>>> update the role binding to remove extra subjects.
>>>>> 
>>>>> Info:  clusterrolebinding/cluster-readers has extra subject 
>>>>> {ServiceAccount management-infra management-admin    }.
>>>>> Info:  clusterrolebinding/cluster-readers has extra subject 
>>>>> {ServiceAccount logging aggregated-logging-fluentd    }.
>>>>> 
>>>>> [Note] Running diagnostic: ClusterRoles
>>>>>        Description: Check that the default ClusterRoles are present and 
>>>>> contain the expected permissions
>>>>> 
>>>>> [Note] Running diagnostic: ClusterRouterName
>>>>>        Description: Check there is a working router
>>>>> 
>>>>> ERROR: [DClu2007 from diagnostic 
>>>>> ClusterRouter@openshift/origin/pkg/diagnostics/cluster/router.go:156]
>>>>>        The "router" DeploymentConfig exists but has no running pods, so it
>>>>>        is not available. Apps will not be externally accessible via the 
>>>>> router.
>>>>> 
>>>>> [Note] Running diagnostic: MasterNode
>>>>>        Description: Check if master is also running node (for Open 
>>>>> vSwitch)
>>>>> 
>>>>> Info:  Found a node with same IP as master: paas.intrinsic.world
>>>>> 
>>>>> [Note] Running diagnostic: NodeDefinitions
>>>>>        Description: Check node records on master
>>>>> 
>>>>> WARN:  [DClu0003 from diagnostic 
>>>>> NodeDefinition@openshift/origin/pkg/diagnostics/cluster/node_definitions.go:112]
>>>>>        Node paas.intrinsic.world is ready but is marked Unschedulable.
>>>>>        This is usually set manually for administrative reasons.
>>>>>        An administrator can mark the node schedulable with:
>>>>>            oadm manage-node paas.intrinsic.world --schedulable=true
>>>>> 
>>>>>        While in this state, pods should not be scheduled to deploy on the 
>>>>> node.
>>>>>        Existing pods will continue to run until completed or evacuated 
>>>>> (see
>>>>>        other options for 'oadm manage-node').
>>>>> 
>>>>> [Note] Running diagnostic: AnalyzeLogs
>>>>>        Description: Check for recent problems in systemd service logs
>>>>> 
>>>>> Info:  Checking journalctl logs for 'origin-master' service
>>>>> Info:  Checking journalctl logs for 'origin-node' service
>>>>> Info:  Checking journalctl logs for 'docker' service
>>>>> 
>>>>> [Note] Running diagnostic: MasterConfigCheck
>>>>>        Description: Check the master config file
>>>>> 
>>>>> Info:  Found a master config file: /etc/origin/master/master-config.yaml
>>>>> 
>>>>> WARN:  [DH0005 from diagnostic 
>>>>> MasterConfigCheck@openshift/origin/pkg/diagnostics/host/check_master_config.go:58]
>>>>>        Validation of master config file 
>>>>> '/etc/origin/master/master-config.yaml' warned:
>>>>>        assetConfig.loggingPublicURL: Invalid value: "": required to view 
>>>>> aggregated container logs in the console
>>>>>        assetConfig.metricsPublicURL: Invalid value: "": required to view 
>>>>> cluster metrics in the console
>>>>> 
>>>>> [Note] Running diagnostic: NodeConfigCheck
>>>>>        Description: Check the node config file
>>>>> 
>>>>> Info:  Found a node config file: /etc/origin/node/node-config.yaml
>>>>> 
>>>>> [Note] Running diagnostic: UnitStatus
>>>>>        Description: Check status for related systemd units
>>>>> 
>>>>> [Note] Summary of diagnostics execution (version v1.1.6):
>>>>> [Note] Warnings seen: 3
>>>>> [Note] Errors seen: 4
>>>>> 
>>>>> 
>>>>> 
>>>>>> On 05 Sep 2016, at 18:46, Clayton Coleman <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> 
>>>>>> Did you change the IP of your master, or otherwise delete / alter the
>>>>>> openshift-infra namespace?  Or have your client certificates expired
>>>>>> (is this cluster 1 year old(?
>>>>>> 
>>>>>> Before deleting, try two things:
>>>>>> 
>>>>>>    oadm diagnostics
>>>>>> 
>>>>>> From the master (to see if it identifies anything).
>>>>>> 
>>>>>> Also check your certificate expiration a.
>>>>>> 
>>>>>>> On Sep 5, 2016, at 5:00 AM, Candide Kemmler <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I have a development server setup made up of two nodes (1 master - 1 
>>>>>>> slave) running a bunch of different projects and environments which 
>>>>>>> just crashed badly on me.
>>>>>>> 
>>>>>>> Symptoms are: all containers in all projects are in pending state 
>>>>>>> (orange circle) - when I try to `delete all`, things get removed but 
>>>>>>> pods hang in a 'terminating' state. oc describe gives me uninteresting 
>>>>>>> information that I already know (basically that pods are Pending) and 
>>>>>>> oc logs tells me that it (could not find the requested resource).
>>>>>>> 
>>>>>>> I tried to `sudo systemctl restart origin-master` as it seems to have 
>>>>>>> produced good results in the past, but that didn't help this time. I 
>>>>>>> also tried that in combination with a full system reboot.
>>>>>>> 
>>>>>>> Finally I tried running the ansible scripts in hopes of updating origin 
>>>>>>> to the latest version (it's still running 1.1.6) but I got the 
>>>>>>> following error log:
>>>>>>> 
>>>>>>> failed: [paas.intrinsic.world] => {"changed": false, "cmd": ["oc", 
>>>>>>> "create", "-n", "openshift", "-f", 
>>>>>>> "/usr/share/openshift/examples/image-streams/image-streams-centos7.json"],
>>>>>>>  "delta": "0:00:00.180874", "end": "2016-09-05 07:20:12.050123", 
>>>>>>> "failed": true, "failed_when_result": true, "rc": 1, "start": 
>>>>>>> "2016-09-05 07:20:11.869249", "stdout_lines": [], "warnings": []}
>>>>>>> stderr: unable to connect to a server to handle "imagestreamlists": the 
>>>>>>> server has asked for the client to provide credentials
>>>>>>> 
>>>>>>> FATAL: all hosts have already failed -- aborting
>>>>>>> 
>>>>>>> PLAY RECAP 
>>>>>>> ********************************************************************
>>>>>>>          to retry, use: --limit @/Users/candide/config.retry
>>>>>>> 
>>>>>>> apps.intrinsic.world       : ok=48   changed=0    unreachable=0    
>>>>>>> failed=0
>>>>>>> localhost                  : ok=15   changed=0    unreachable=0    
>>>>>>> failed=0
>>>>>>> paas.intrinsic.world       : ok=207  changed=0    unreachable=0    
>>>>>>> failed=1
>>>>>>> 
>>>>>>> My last option is to reinstall everything from scratch but before I do 
>>>>>>> this I wanted to know if you guys had other ideas on how to get on top 
>>>>>>> of things again.
>>>>>>> 
>>>>>>> Candide
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> [email protected] 
>>>>>>> <mailto:[email protected]>
>>>>>>> http://lists.openshift.redhat.com/openshiftmm/listinfo/users 
>>>>>>> <http://lists.openshift.redhat.com/openshiftmm/listinfo/users>
>>>>> 
>>> 
>

_______________________________________________
users mailing list
[email protected]
http://lists.openshift.redhat.com/openshiftmm/listinfo/users

Re: origin crashed

Reply via email to