Re: faulty diagnostics?

Luke Meyer Mon, 13 Nov 2017 11:05:59 -0800

Thanks for bringing this up. This tool... needs some attention. Comments
below:


On Fri, Oct 27, 2017 at 7:48 AM, Tim Dudgeon <tdudgeon...@gmail.com> wrote:

> I've been looking at using the diagnostics (oc adm diagnostics) to test
> the status of a cluster installed with the ansible installer and
> consistently see things that seem to be false alarms. The cluster appears
> to be functions (builds run, can push to registry and routes are working
> etc.). This is with origin 3.6.0.
>
> 1. This is consistently seen, and a restart of the master  does not fix
> it. The name docker-registry.default.svc resolve tot he ip address
> 172.30.200.62
>
> ERROR: [DClu1019 from diagnostic ClusterRegistry@openshift/orig
>> in/pkg/diagnostics/cluster/registry.go:343]
>>        Diagnostics created a test ImageStream and compared the registry IP
>>        it received to the registry IP available via the docker-registry
>> service.
>>
>>        docker-registry      : 172.30.200.62:5000
>>        ImageStream registry : docker-registry.default.svc:5000
>>
>>        They do not match, which probably means that an administrator
>> re-created
>>        the docker-registry service but the master has cached the old
>> service
>>        IP address. Builds or deployments that use ImageStreams with the
>> wrong
>>        docker-registry IP will fail under this condition.
>>
>>        To resolve this issue, restarting the master (to clear the cache)
>> should
>>        be sufficient. Existing ImageStreams may need to be re-created.
>>
>
This is a bug -- the registry deployment changed without updating the
relevant diagnostic. It has been fixed with
https://github.com/openshift/origin/pull/16188 which I guess was not
backported in Origin to 3.6 so expect it fixed in 3.7.



> 2. This warning is seen
>
> WARN:  [DClu0003 from diagnostic NodeDefinition@openshift/origi
>> n/pkg/diagnostics/cluster/node_definitions.go:113]
>>        Node ip-10-0-247-194.eu-west-1.compute.internal is ready but is
>> marked Unschedulable.
>>        This is usually set manually for administrative reasons.
>>        An administrator can mark the node schedulable with:
>>            oadm manage-node ip-10-0-247-194.eu-west-1.compute.internal
>> --schedulable=true
>>
>>        While in this state, pods should not be scheduled to deploy on the
>> node.
>>        Existing pods will continue to run until completed or evacuated
>> (see
>>        other options for 'oadm manage-node').
>>
> This is for the master node which by default is non-schedulable.
>

It's a warning, not an error, because this could be a legitimate
configuration. Actually, the diagnostic generally has no way to know that a
node belongs to a master or that it is supposed to be unschedulable (there
is nothing in the API to determine this).

That diagnostic is intended to alert you to the possibility that the reason
a node is not getting pods scheduled is because of this setting. It's not
saying there's anything wrong with the cluster. It's certainly a bit
confusing; do you feel it's better to get a useless warning from masters or
not to hear about unschedulable nodes at all in the diagnostics?



>
> 3. If metrics and logging are not deployed you see this warning:
>
> WARN:  [DH0005 from diagnostic MasterConfigCheck@openshift/or
>> igin/pkg/diagnostics/host/check_master_config.go:52]
>>        Validation of master config file 
>> '/etc/origin/master/master-config.yaml'
>> warned:
>>        assetConfig.loggingPublicURL: Invalid value: "": required to view
>> aggregated container logs in the console
>>        assetConfig.metricsPublicURL: Invalid value: "": required to view
>> cluster metrics in the console
>>        auditConfig.auditFilePath: Required value: audit can not be logged
>> to a separate file
>>
>
> Whilst 2 and 3 could be considered minor irritations, 1 might scare people
> that something is actually wrong.
>


Once again... it's a warning. And again, it's because there's no way to
determine from the API whether these are supposed to be deployed.



>
> Also, the 'oc adm diagnostics' command need to be run as root or with sudo
> otherwise you get some file permissions related errors. I don't think this
> is mentioned in the docs.
>


Could you be more specific about what errors you get? Errors accessing the
node/master config files perhaps?

Thanks for the feedback, and sorry for the delay in responding.

_______________________________________________
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users

Re: faulty diagnostics?

Reply via email to