Re: clsut stopped working - certificate problems

2020-03-31 Thread Tim Dudgeon

Brian,

That's fixed it. THANK YOU.

On 31/03/2020 17:05, Brian Jarvis wrote:

Hello Tim,

Each node has a client certificate that expire after one year.
Run "oc get csr" you should see many pending requests, possibly thousands.

To clear those run "oc get csr -o name | xargs oc adm certificate approve"

One way to prevent this in the future is to deploy/enable the auto 
approver statefulset with the following command.
ansible-playbook -vvv -i [inventory_file] 
/usr/share/ansible/openshift-ansible/playbooks/openshift-master/enable_bootstrap.yml 
-e openshift_master_bootstrap_auto_approve=true


On Tue, Mar 31, 2020 at 11:53 AM Tim Dudgeon <mailto:tdudgeon...@gmail.com>> wrote:


Maybe an uncanny coincidence but with think the cluster was
created almost EXACTLY 1 year before it failed.

On 31/03/2020 16:17, Ben Holmes wrote:

Hi Tim,

Can you verify that the host's clocks are being synced correctly
as per Simon's other suggestion?

Ben

On Tue, 31 Mar 2020 at 16:05, Tim Dudgeon mailto:tdudgeon...@gmail.com>> wrote:

Hi Simon,

we're run those playbooks and all certs are reported as still
being valid.

Tim

On 31/03/2020 15:59, Simon Krenger wrote:
> Hi Tim,
>
> Note that there are multiple sets of certificates, both
external and
> internal. So it would be worth checking the certificates
again using
> the Certificate Expiration Playbooks (see link below). The
> documentation also has an overview of what can be done to renew
> certain certificates:
>
> - [ Redeploying Certificates ]
>
https://docs.okd.io/3.11/install_config/redeploying_certificates.html
>
> Apart from checking all certificates, I'd certainly review
the time
> synchronisation for the whole cluster, as we see the
message "x509:
> certificate has expired or is not yet valid".
>
> I hope this helps.
>
    > Kind regards
> Simon
>
> On Tue, Mar 31, 2020 at 4:33 PM Tim Dudgeon
mailto:tdudgeon...@gmail.com>> wrote:
>> One of our OKD 3.11 clusters has suddenly stopped working
without any
>> obvious reason.
>>
>> The origin-node service on the nodes does not start (times
out).
>> The master-api pod is running on the master.
>> The nodes can access the master-api endpoints.
>>
>> The logs of the master-api pod look mostly OK other than a
huge number
>> of warnings about certificates that don't really make
sense as the
>> certificates are valid (we use named certificates from
let's Encryt and
>> they were renewed about 2 weeks ago and all appear to be
correct.
>>
>> Examples of errors from the master-api pod are:
>>
>> I0331 12:46:57.065147       1
establishing_controller.go:73] Starting
>> EstablishingController
>> I0331 12:46:57.065561       1 logs.go:49] http: TLS
handshake error from
>> 192.168.160.17:58024 <http://192.168.160.17:58024>: EOF
>> I0331 12:46:57.071932       1 logs.go:49] http: TLS
handshake error from
>> 192.168.160.19:48102 <http://192.168.160.19:48102>: EOF
>> I0331 12:46:57.072036       1 logs.go:49] http: TLS
handshake error from
>> 192.168.160.19:37178 <http://192.168.160.19:37178>: EOF
>> I0331 12:46:57.072141       1 logs.go:49] http: TLS
handshake error from
>> 192.168.160.17:58022 <http://192.168.160.17:58022>: EOF
>>
>> E0331 12:47:37.855023       1 memcache.go:147] couldn't
get resource
>> list for metrics.k8s.io/v1beta1
<http://metrics.k8s.io/v1beta1>: the server is currently
unable to
>> handle the request
>> E0331 12:47:37.856569       1 memcache.go:147] couldn't
get resource
>> list for servicecatalog.k8s.io/v1beta1
<http://servicecatalog.k8s.io/v1beta1>: the server is
currently unable
>> to handle the request
>> E0331 12:47:44.115290       1 authentication.go:62] Unable to
>> authenticate the request due to an error: [x509:
certificate has expired
>> or is not yet valid, x509: certificate
>>    has expired or is not yet valid]
>> E0331 12:47:44.118976       1 authentication.go:62] Unable to
>> authenticate the request due to an error: [x50

Re: clsut stopped working - certificate problems

2020-03-31 Thread Tim Dudgeon
Maybe an uncanny coincidence but with think the cluster was created 
almost EXACTLY 1 year before it failed.


On 31/03/2020 16:17, Ben Holmes wrote:

Hi Tim,

Can you verify that the host's clocks are being synced correctly as 
per Simon's other suggestion?


Ben

On Tue, 31 Mar 2020 at 16:05, Tim Dudgeon <mailto:tdudgeon...@gmail.com>> wrote:


Hi Simon,

we're run those playbooks and all certs are reported as still
being valid.

Tim

On 31/03/2020 15:59, Simon Krenger wrote:
> Hi Tim,
>
> Note that there are multiple sets of certificates, both external and
> internal. So it would be worth checking the certificates again using
> the Certificate Expiration Playbooks (see link below). The
> documentation also has an overview of what can be done to renew
> certain certificates:
>
> - [ Redeploying Certificates ]
>
https://docs.okd.io/3.11/install_config/redeploying_certificates.html
>
> Apart from checking all certificates, I'd certainly review the time
> synchronisation for the whole cluster, as we see the message "x509:
> certificate has expired or is not yet valid".
>
> I hope this helps.
>
> Kind regards
> Simon
>
> On Tue, Mar 31, 2020 at 4:33 PM Tim Dudgeon
mailto:tdudgeon...@gmail.com>> wrote:
>> One of our OKD 3.11 clusters has suddenly stopped working
without any
>> obvious reason.
>>
>> The origin-node service on the nodes does not start (times out).
>> The master-api pod is running on the master.
>> The nodes can access the master-api endpoints.
>>
>> The logs of the master-api pod look mostly OK other than a huge
number
>> of warnings about certificates that don't really make sense as the
>> certificates are valid (we use named certificates from let's
Encryt and
>> they were renewed about 2 weeks ago and all appear to be correct.
>>
>> Examples of errors from the master-api pod are:
>>
>> I0331 12:46:57.065147       1 establishing_controller.go:73]
Starting
>> EstablishingController
>> I0331 12:46:57.065561       1 logs.go:49] http: TLS handshake
error from
>> 192.168.160.17:58024 <http://192.168.160.17:58024>: EOF
>> I0331 12:46:57.071932       1 logs.go:49] http: TLS handshake
error from
>> 192.168.160.19:48102 <http://192.168.160.19:48102>: EOF
>> I0331 12:46:57.072036       1 logs.go:49] http: TLS handshake
error from
>> 192.168.160.19:37178 <http://192.168.160.19:37178>: EOF
>> I0331 12:46:57.072141       1 logs.go:49] http: TLS handshake
error from
>> 192.168.160.17:58022 <http://192.168.160.17:58022>: EOF
>>
>> E0331 12:47:37.855023       1 memcache.go:147] couldn't get
resource
>> list for metrics.k8s.io/v1beta1
<http://metrics.k8s.io/v1beta1>: the server is currently unable to
>> handle the request
>> E0331 12:47:37.856569       1 memcache.go:147] couldn't get
resource
>> list for servicecatalog.k8s.io/v1beta1
<http://servicecatalog.k8s.io/v1beta1>: the server is currently unable
>> to handle the request
>> E0331 12:47:44.115290       1 authentication.go:62] Unable to
>> authenticate the request due to an error: [x509: certificate
has expired
>> or is not yet valid, x509: certificate
>>    has expired or is not yet valid]
>> E0331 12:47:44.118976       1 authentication.go:62] Unable to
>> authenticate the request due to an error: [x509: certificate
has expired
>> or is not yet valid, x509: certificate
>>    has expired or is not yet valid]
>> E0331 12:47:44.122276       1 authentication.go:62] Unable to
>> authenticate the request due to an error: [x509: certificate
has expired
>> or is not yet valid, x509: certificate
>>    has expired or is not yet valid]
>>
>> Huge number of this second sort.
>>
>> Any ideas what is wrong?
>>
>>
>>
>> ___
>> users mailing list
>> users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
>> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>
>

___
users mailing list
users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
http://lists.openshift.redhat.com/openshiftmm/listinfo/users



--

BENJAMIN HOLMES

SENIOR Solution ARCHITECT

Red Hat UKI Presales <https://www.redhat.com/>

bhol...@redhat.com <mailto:bhol...@redhat.com> M: 07876-885388 
<http://redhatemailsignature-marketing.itos.redhat.com/>


<https://red.ht/sig>

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: clsut stopped working - certificate problems

2020-03-31 Thread Tim Dudgeon

And yes, the clocks of all the nodes are correct and in sync.

On 31/03/2020 16:17, Ben Holmes wrote:

Hi Tim,

Can you verify that the host's clocks are being synced correctly as 
per Simon's other suggestion?


Ben

On Tue, 31 Mar 2020 at 16:05, Tim Dudgeon <mailto:tdudgeon...@gmail.com>> wrote:


Hi Simon,

we're run those playbooks and all certs are reported as still
being valid.

Tim

On 31/03/2020 15:59, Simon Krenger wrote:
> Hi Tim,
>
> Note that there are multiple sets of certificates, both external and
> internal. So it would be worth checking the certificates again using
> the Certificate Expiration Playbooks (see link below). The
> documentation also has an overview of what can be done to renew
> certain certificates:
>
> - [ Redeploying Certificates ]
>
https://docs.okd.io/3.11/install_config/redeploying_certificates.html
>
> Apart from checking all certificates, I'd certainly review the time
> synchronisation for the whole cluster, as we see the message "x509:
> certificate has expired or is not yet valid".
>
> I hope this helps.
>
> Kind regards
> Simon
>
> On Tue, Mar 31, 2020 at 4:33 PM Tim Dudgeon
mailto:tdudgeon...@gmail.com>> wrote:
>> One of our OKD 3.11 clusters has suddenly stopped working
without any
>> obvious reason.
>>
>> The origin-node service on the nodes does not start (times out).
>> The master-api pod is running on the master.
>> The nodes can access the master-api endpoints.
>>
>> The logs of the master-api pod look mostly OK other than a huge
number
>> of warnings about certificates that don't really make sense as the
>> certificates are valid (we use named certificates from let's
Encryt and
>> they were renewed about 2 weeks ago and all appear to be correct.
>>
>> Examples of errors from the master-api pod are:
>>
>> I0331 12:46:57.065147       1 establishing_controller.go:73]
Starting
>> EstablishingController
>> I0331 12:46:57.065561       1 logs.go:49] http: TLS handshake
error from
>> 192.168.160.17:58024 <http://192.168.160.17:58024>: EOF
>> I0331 12:46:57.071932       1 logs.go:49] http: TLS handshake
error from
>> 192.168.160.19:48102 <http://192.168.160.19:48102>: EOF
>> I0331 12:46:57.072036       1 logs.go:49] http: TLS handshake
error from
>> 192.168.160.19:37178 <http://192.168.160.19:37178>: EOF
>> I0331 12:46:57.072141       1 logs.go:49] http: TLS handshake
error from
>> 192.168.160.17:58022 <http://192.168.160.17:58022>: EOF
>>
>> E0331 12:47:37.855023       1 memcache.go:147] couldn't get
resource
>> list for metrics.k8s.io/v1beta1
<http://metrics.k8s.io/v1beta1>: the server is currently unable to
>> handle the request
>> E0331 12:47:37.856569       1 memcache.go:147] couldn't get
resource
>> list for servicecatalog.k8s.io/v1beta1
<http://servicecatalog.k8s.io/v1beta1>: the server is currently unable
>> to handle the request
>> E0331 12:47:44.115290       1 authentication.go:62] Unable to
>> authenticate the request due to an error: [x509: certificate
has expired
>> or is not yet valid, x509: certificate
>>    has expired or is not yet valid]
>> E0331 12:47:44.118976       1 authentication.go:62] Unable to
>> authenticate the request due to an error: [x509: certificate
has expired
>> or is not yet valid, x509: certificate
>>    has expired or is not yet valid]
>> E0331 12:47:44.122276       1 authentication.go:62] Unable to
>> authenticate the request due to an error: [x509: certificate
has expired
>> or is not yet valid, x509: certificate
>>    has expired or is not yet valid]
>>
>> Huge number of this second sort.
>>
>> Any ideas what is wrong?
>>
>>
>>
>> ___
>> users mailing list
>> users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
>> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>
>

___
users mailing list
users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
http://lists.openshift.redhat.com/openshiftmm/listinfo/users



--

BENJAMIN HOLMES

SENIOR Solution ARCHITECT

Red Hat UKI Presales <https://www.redhat.com/>

bhol...@redhat.com <mailto:bhol...@redhat.com> M: 07876-885388 
<http://redhatemailsignature-marketing.itos.redhat.com/>


<https://red.ht/sig>

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: clsut stopped working - certificate problems

2020-03-31 Thread Tim Dudgeon

Hi Simon,

we're run those playbooks and all certs are reported as still being valid.

Tim

On 31/03/2020 15:59, Simon Krenger wrote:

Hi Tim,

Note that there are multiple sets of certificates, both external and
internal. So it would be worth checking the certificates again using
the Certificate Expiration Playbooks (see link below). The
documentation also has an overview of what can be done to renew
certain certificates:

- [ Redeploying Certificates ]
   https://docs.okd.io/3.11/install_config/redeploying_certificates.html

Apart from checking all certificates, I'd certainly review the time
synchronisation for the whole cluster, as we see the message "x509:
certificate has expired or is not yet valid".

I hope this helps.

Kind regards
Simon

On Tue, Mar 31, 2020 at 4:33 PM Tim Dudgeon  wrote:

One of our OKD 3.11 clusters has suddenly stopped working without any
obvious reason.

The origin-node service on the nodes does not start (times out).
The master-api pod is running on the master.
The nodes can access the master-api endpoints.

The logs of the master-api pod look mostly OK other than a huge number
of warnings about certificates that don't really make sense as the
certificates are valid (we use named certificates from let's Encryt and
they were renewed about 2 weeks ago and all appear to be correct.

Examples of errors from the master-api pod are:

I0331 12:46:57.065147   1 establishing_controller.go:73] Starting
EstablishingController
I0331 12:46:57.065561   1 logs.go:49] http: TLS handshake error from
192.168.160.17:58024: EOF
I0331 12:46:57.071932   1 logs.go:49] http: TLS handshake error from
192.168.160.19:48102: EOF
I0331 12:46:57.072036   1 logs.go:49] http: TLS handshake error from
192.168.160.19:37178: EOF
I0331 12:46:57.072141   1 logs.go:49] http: TLS handshake error from
192.168.160.17:58022: EOF

E0331 12:47:37.855023   1 memcache.go:147] couldn't get resource
list for metrics.k8s.io/v1beta1: the server is currently unable to
handle the request
E0331 12:47:37.856569   1 memcache.go:147] couldn't get resource
list for servicecatalog.k8s.io/v1beta1: the server is currently unable
to handle the request
E0331 12:47:44.115290   1 authentication.go:62] Unable to
authenticate the request due to an error: [x509: certificate has expired
or is not yet valid, x509: certificate
   has expired or is not yet valid]
E0331 12:47:44.118976   1 authentication.go:62] Unable to
authenticate the request due to an error: [x509: certificate has expired
or is not yet valid, x509: certificate
   has expired or is not yet valid]
E0331 12:47:44.122276   1 authentication.go:62] Unable to
authenticate the request due to an error: [x509: certificate has expired
or is not yet valid, x509: certificate
   has expired or is not yet valid]

Huge number of this second sort.

Any ideas what is wrong?



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users





___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users



clsut stopped working - certificate problems

2020-03-31 Thread Tim Dudgeon
One of our OKD 3.11 clusters has suddenly stopped working without any 
obvious reason.


The origin-node service on the nodes does not start (times out).
The master-api pod is running on the master.
The nodes can access the master-api endpoints.

The logs of the master-api pod look mostly OK other than a huge number 
of warnings about certificates that don't really make sense as the 
certificates are valid (we use named certificates from let's Encryt and 
they were renewed about 2 weeks ago and all appear to be correct.


Examples of errors from the master-api pod are:

I0331 12:46:57.065147   1 establishing_controller.go:73] Starting 
EstablishingController
I0331 12:46:57.065561   1 logs.go:49] http: TLS handshake error from 
192.168.160.17:58024: EOF
I0331 12:46:57.071932   1 logs.go:49] http: TLS handshake error from 
192.168.160.19:48102: EOF
I0331 12:46:57.072036   1 logs.go:49] http: TLS handshake error from 
192.168.160.19:37178: EOF
I0331 12:46:57.072141   1 logs.go:49] http: TLS handshake error from 
192.168.160.17:58022: EOF


E0331 12:47:37.855023   1 memcache.go:147] couldn't get resource 
list for metrics.k8s.io/v1beta1: the server is currently unable to 
handle the request
E0331 12:47:37.856569   1 memcache.go:147] couldn't get resource 
list for servicecatalog.k8s.io/v1beta1: the server is currently unable 
to handle the request
E0331 12:47:44.115290   1 authentication.go:62] Unable to 
authenticate the request due to an error: [x509: certificate has expired 
or is not yet valid, x509: certificate

 has expired or is not yet valid]
E0331 12:47:44.118976   1 authentication.go:62] Unable to 
authenticate the request due to an error: [x509: certificate has expired 
or is not yet valid, x509: certificate

 has expired or is not yet valid]
E0331 12:47:44.122276   1 authentication.go:62] Unable to 
authenticate the request due to an error: [x509: certificate has expired 
or is not yet valid, x509: certificate

 has expired or is not yet valid]

Huge number of this second sort.

Any ideas what is wrong?



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


cockpit/kubernetes images in OKD 3.11 not being pulled

2020-03-22 Thread Tim Dudgeon
We're running OKD 3.11 clusters and they have started having problems 
with the registry console.


This uses the docker.io/cockpit/kubernetes container image which can no 
longer be pulled from the node on which the registry is running:


$ docker pull cockpit/kubernetes:latest
Trying to pull repository docker.io/cockpit/kubernetes ...
manifest for docker.io/cockpit/kubernetes:latest not found

However I can pull that image from my laptop without problems.

I notice that an DockerHub this image is described at 'obsoleted in 
2018'. Is there anything in the OKD Docker configuration that blocks 
this image being pulled?


Thanks
Tim


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users



Re: Changing Prometheus rules

2019-11-19 Thread Tim Dudgeon
No joy with that approach. I tried editing the ConfigMap and the CRD but 
both got reset when the cluster-monitoring-operator was restarted.


Looks like I'll have to live with silencing the alert.

On 19/11/2019 07:56, Vladimir REMENAR wrote:

Hi Tim,

You need to stop cluster-monitoring-operator than and then edit 
configmap. If cluster-monitoring-operator is running while editing 
configmap it will always revert it to default.



Uz pozdrav,
*Vladimir Remenar*



From: Tim Dudgeon 
To: Simon Pasquier 
Cc: users 
Date: 18.11.2019 17:46
Subject: Re: Changing Prometheus rules
Sent by: users-boun...@lists.openshift.redhat.com




The KubeAPILatencyHigh alert fires several times a day for us (on 2
different OKD clusters).

On 18/11/2019 15:17, Simon Pasquier wrote:
> The Prometheus instances deployed by the cluster monitoring operator
> are read-only and can't be customized.
> 
https://docs.openshift.com/container-platform/3.11/install_config/prometheus_cluster_monitoring.html#alerting-rules_prometheus-cluster-monitoring

>
> Can you provide more details about which alerts are noisy?
>
> On Mon, Nov 18, 2019 at 2:43 PM Tim Dudgeon  
wrote:

>> What is the "right" way to edit Prometheus rules that are deployed by
>> default on OKD 3.11?
>> I have alerts that are annoyingly noisy, and want to silence them 
forever!

>>
>> I tried editing the definition of the PrometheusRule CRD and/or the
>> prometheus-k8s-rulefiles-0 ConfigMap in the openshift-monitoring 
project

>> but my changes keep getting reverted back to the original.
>>
>> ___
>> users mailing list
>> users@lists.openshift.redhat.com
>> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>>

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: Changing Prometheus rules

2019-11-18 Thread Tim Dudgeon
The KubeAPILatencyHigh alert fires several times a day for us (on 2 
different OKD clusters).


On 18/11/2019 15:17, Simon Pasquier wrote:

The Prometheus instances deployed by the cluster monitoring operator
are read-only and can't be customized.
https://docs.openshift.com/container-platform/3.11/install_config/prometheus_cluster_monitoring.html#alerting-rules_prometheus-cluster-monitoring

Can you provide more details about which alerts are noisy?

On Mon, Nov 18, 2019 at 2:43 PM Tim Dudgeon  wrote:

What is the "right" way to edit Prometheus rules that are deployed by
default on OKD 3.11?
I have alerts that are annoyingly noisy, and want to silence them forever!

I tried editing the definition of the PrometheusRule CRD and/or the
prometheus-k8s-rulefiles-0 ConfigMap in the openshift-monitoring project
but my changes keep getting reverted back to the original.

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users



Changing Prometheus rules

2019-11-18 Thread Tim Dudgeon
What is the "right" way to edit Prometheus rules that are deployed by 
default on OKD 3.11?

I have alerts that are annoyingly noisy, and want to silence them forever!

I tried editing the definition of the PrometheusRule CRD and/or the 
prometheus-k8s-rulefiles-0 ConfigMap in the openshift-monitoring project 
but my changes keep getting reverted back to the original.


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users



Re: router stats

2019-10-17 Thread Tim Dudgeon
OK, that looks more promising (but the question on whether the docs in 
the original link are correct still stands).


However, I'm having problems accessing the stats. Using the username and 
password found in the service definition e.g.:



curl admin:@172.30.67.67:1936/metrics


I get a:


HTTP/1.1 401 Unauthorized


This is with OKD 3.11

Tim


On 16/10/2019 16:24, Brian Jarvis wrote:

Information on accessing the router metrics can be found [0].

[0] 
https://docs.okd.io/3.11/install_config/router/default_haproxy_router.html#exposing-the-router-metrics



On Tue, Oct 15, 2019 at 6:09 AM Tim Dudgeon <mailto:tdudgeon...@gmail.com>> wrote:


So how do I access these?

And are the docs here [1] wrong?

[1] https://docs.okd.io/3.11/admin_guide/router.html

On 14/10/2019 19:26, Clayton Coleman wrote:

Metrics are exposed via the controller process in the pod (pid1),
not the HAProxy process.

On Mon, Oct 14, 2019 at 1:27 PM Tim Dudgeon
mailto:tdudgeon...@gmail.com>> wrote:

I'm trying to see the router stats as described here:
https://docs.okd.io/3.11/admin_guide/router.html

I can see this from within the container using the command:

echo 'show stat' | socat -
UNIX-CONNECT:/var/lib/haproxy/run/haproxy.sock

But they do not seem to be being exposed through the web
listener as
described in that doc. In fact I can't see anything in the
haproxy.config file that suggests that haproxy is exposing
stats on port
1936 or any other port.

The installation was a fairly standard openshift-ansible
install so I'm
sure the defaults have not been changed.

Are there any instructions for how to get this working?

Thanks
Tim

___
users mailing list
users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


___
users mailing list
users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
http://lists.openshift.redhat.com/openshiftmm/listinfo/users





___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: router stats

2019-10-15 Thread Tim Dudgeon

So how do I access these?

And are the docs here [1] wrong?

[1] https://docs.okd.io/3.11/admin_guide/router.html

On 14/10/2019 19:26, Clayton Coleman wrote:
Metrics are exposed via the controller process in the pod (pid1), not 
the HAProxy process.


On Mon, Oct 14, 2019 at 1:27 PM Tim Dudgeon <mailto:tdudgeon...@gmail.com>> wrote:


I'm trying to see the router stats as described here:
https://docs.okd.io/3.11/admin_guide/router.html

I can see this from within the container using the command:

echo 'show stat' | socat -
UNIX-CONNECT:/var/lib/haproxy/run/haproxy.sock

But they do not seem to be being exposed through the web listener as
described in that doc. In fact I can't see anything in the
haproxy.config file that suggests that haproxy is exposing stats
on port
1936 or any other port.

The installation was a fairly standard openshift-ansible install
so I'm
sure the defaults have not been changed.

Are there any instructions for how to get this working?

Thanks
Tim

___
users mailing list
users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
http://lists.openshift.redhat.com/openshiftmm/listinfo/users

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


router stats

2019-10-14 Thread Tim Dudgeon
I'm trying to see the router stats as described here: 
https://docs.okd.io/3.11/admin_guide/router.html


I can see this from within the container using the command:

echo 'show stat' | socat - UNIX-CONNECT:/var/lib/haproxy/run/haproxy.sock

But they do not seem to be being exposed through the web listener as 
described in that doc. In fact I can't see anything in the 
haproxy.config file that suggests that haproxy is exposing stats on port 
1936 or any other port.


The installation was a fairly standard openshift-ansible install so I'm 
sure the defaults have not been changed.


Are there any instructions for how to get this working?

Thanks
Tim

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: Changing network MTU

2019-08-27 Thread Tim Dudgeon
OK, thanks. So what would be needed to change this on a running system 
rather than when you first install openshift?


On 27/08/2019 17:14, Brian Jarvis wrote:

Tim,

You need to set the MTU of the OpenShift SDN to be lower than the MTU 
of the NIC.


This is described in 
https://docs.openshift.com/container-platform/3.11/scaling_performance/network_optimization.html#scaling-performance-optimizing-mtu.





On Tue, Aug 27, 2019 at 12:02 PM Tim Dudgeon <mailto:tdudgeon...@gmail.com>> wrote:


In one of our OKD3.11 environments the hosting provider wanting to
change the network MTU from 9000 to 1500 and did that for all the
physical network interfaces of all the nodes.

This caused the Openshift networking to break completely.
Resetting back
to 9000 restored things.

Is there a way to allow for this to be done on a running Openshift
system?

Thanks
Tim

___
users mailing list
users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
http://lists.openshift.redhat.com/openshiftmm/listinfo/users



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Changing network MTU

2019-08-27 Thread Tim Dudgeon
In one of our OKD3.11 environments the hosting provider wanting to 
change the network MTU from 9000 to 1500 and did that for all the 
physical network interfaces of all the nodes.


This caused the Openshift networking to break completely. Resetting back 
to 9000 restored things.


Is there a way to allow for this to be done on a running Openshift system?

Thanks
Tim

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Adding prometheus rules

2019-05-13 Thread Tim Dudgeon

In the docs for prometheus [1] it states:

"OKD Cluster Monitoring ships with the following alerting rules 
configured by default. Currently you cannot add custom alerting rules."


Is the really the case? It's impossible to add new rules so that you can 
set up custom alerts?

Seems a bit crazy?

Tim


[1] 
https://docs.okd.io/latest/install_config/prometheus_cluster_monitoring.html#alerting-rules


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


missing brick directory with glusterfs

2019-05-03 Thread Tim Dudgeon
I'm hitting a problem where some gluster volumes are failing to start 
and it appears that the 'brick' dir is missing.

Paths (inside the gluster pod) typically look something like this:

/var/lib/heketi/mounts/vg_cd463bbe47d2fd219fadaf1a089f9816/brick_286495c898ccb1920998915757ed49ad/brick

but in the failing ones the final /brick directory is not present.

Any ideas as to how this can happen, and how to fix it?

This is with OKD 3.11 and Gluster 4.1.7.

Tim

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: gluster volume service and endpoint lost

2019-04-08 Thread Tim Dudgeon

I'm not really sure if that was the case.

I did manage to fix this by manually re-creating the service and 
endpoints using yaml like this;


**

*apiVersion: v1*

*

kind: Service

metadata:

 labels:

   gluster.kubernetes.io/provisioned-for-pvc: pvc-name

 name: glusterfs-dynamic-pvc-name

spec:

 ports:

 - port: 1

   protocol: TCP

   targetPort: 1

 sessionAffinity: None

 type: ClusterIP

*

**

*apiVersion: v1*

*

kind: Endpoints

metadata:

 labels:

   gluster.kubernetes.io/provisioned-for-pvc: pvc-name

 name: glusterfs-dynamic-pvc-name

subsets:

- addresses:

 - ip: 10.0.0.18

 - ip: 10.0.0.24

 - ip: 10.0.0.8

 ports:

 - port: 1

   protocol: TCP

*

I have no idea whether this is the "correct" solution, but it seema to work.
Nor do I have any real idea as to what caused this. Googling suggests 
that this can happen if you delete and re-create the PVC in quick 
succession, but I'm pretty sure that was not the case here.


Tim

On 07/04/2019 23:51, Nikolas Philips wrote:

Hi Tim,
Is it possible that these failing PVC/PVs are mounted on the same 
compute node(s)? It might be, that a node has somehow issues with the 
gluster mount points / docker storage.
If it's only one node or so, you could either try to reset the storage 
(https://docs.openshift.com/container-platform/3.11/admin_guide/manage_nodes.html#managing-nodes-docker-reset) 
oder simply reinstall the node (safer).


Nikolas


Am So., 7. Apr. 2019 um 10:02 Uhr schrieb Tim Dudgeon 
mailto:tdudgeon...@gmail.com>>:


I have a series of GlusterFS PVC that were working fine.

For some of these the corresponding service and endpoint has been
lost.
e.g. when I do a `oc get svc` or `oc get endpoints` some of the
PVC are
not listed and those PVCs cannot then be mounted to a pod with an
error
like this:

> MountVolume.NewMounter initialization failed for volume
> "pvc-3aafc4fa-3e5e-11e9-8522-fa163eca01d7" : endpoints
> "glusterfs-dynamic-xxx" not found

Any thoughts on:

1. what might have caused this?

2. How to re-create the service and endpoints?

Tim

___
users mailing list
users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
http://lists.openshift.redhat.com/openshiftmm/listinfo/users

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


gluster volume service and endpoint lost

2019-04-07 Thread Tim Dudgeon

I have a series of GlusterFS PVC that were working fine.

For some of these the corresponding service and endpoint has been lost. 
e.g. when I do a `oc get svc` or `oc get endpoints` some of the PVC are 
not listed and those PVCs cannot then be mounted to a pod with an error 
like this:


MountVolume.NewMounter initialization failed for volume 
"pvc-3aafc4fa-3e5e-11e9-8522-fa163eca01d7" : endpoints 
"glusterfs-dynamic-xxx" not found


Any thoughts on:

1. what might have caused this?

2. How to re-create the service and endpoints?

Tim

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


trusted certs for registry

2019-02-07 Thread Tim Dudgeon
I'm wanting to allow external access to the openshift registry, but am 
finding that the SSL certificates used are self-signed and so not 
trusted. And they do not include the public hostname of the registry so 
seem to be only suitable for access within the cluster.


Is there a mechanism for creating a public route for the registry and 
providing trusted certs in the ansible installer along the lines of the 
'openshift_master_named_certificates' property in the inventory file 
that handles this for the master API and console.


I know there are manual steps described [1] for doing this but these 
seem quite involved and not that easy to automate.


Note: this would need to handle the routes for both the registry and the 
registry console.
Note: we are currently stuck on version 3.7, but imagine this applies to 
more recent versions too.


[1] 
https://docs.okd.io/3.7/install_config/registry/securing_and_exposing_registry.html#exposing-the-registry


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: pod won't terminate

2018-11-26 Thread Tim Dudgeon

Thanks.

Actually I finally managed to delete it. I used the web console to 
delete the pod (choosing the delete immediately option).


Yes, I know it doesn't make much sense!

On 26/11/2018 17:35, lmi wrote:


Hi

One trick that have worked for us is to patch the pod - setting its 
list of finalizers to nothing:


oc patch pod  -n  -p 
'{"metadata":{"finalizers":null}}'


Best regards

Lars Milland

Tim Dudgeon wrote the 2018-11-26 16:14:


I've got a pod that just fails to terminate.

`oc delete --force` does not work.

Killing the container on the node where it was running doesn't work (and the 
container is not running on that node after this).

Restarting the origin-node service on the node, and the origin-master-api and 
origin-master-controllers service on the master does not work.

Nothing seems to work. That damn pod seem to be immortal! It just hangs around 
saying it is in `Terminating` status forever.

How can I kill it!

___
users mailing list
users@lists.openshift.redhat.com  <mailto:users@lists.openshift.redhat.com>
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


--
Mvh
Lars Milland
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


How best to determine volume usage

2018-11-26 Thread Tim Dudgeon
When using dynamic volumes from cloud providers (e.g. cinder with 
OpenStack) or GlusterFS volumes for creating PVCs these volumes do not 
get displayed when using the 'df' command on the the node so it not 
straight forward to determine how full the volume is.


Similarly 'oc describe pv/pv-name' lists the capacity of the volume, but 
how much of that capacity has been used.


What is the best way to achieve this?


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


pod won't terminate

2018-11-26 Thread Tim Dudgeon

I've got a pod that just fails to terminate.

`oc delete --force` does not work.

Killing the container on the node where it was running doesn't work (and 
the container is not running on that node after this).


Restarting the origin-node service on the node, and the 
origin-master-api and origin-master-controllers service on the master 
does not work.


Nothing seems to work. That damn pod seem to be immortal! It just hangs 
around saying it is in `Terminating` status forever.


How can I kill it!

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Contents of origin-upstream-dns.conf changed after reboot

2018-10-08 Thread Tim Dudgeon
We've got a situation where if a node is rebooted the contents of the 
/etc/dnsmasq.d/origin-upstream-dns.conf file get changed to the wrong 
settings preventing the node service to start.


The correct value is to point to the IP address of the nameserver on the 
network that is resolving the names of all servers in the cluster.
This was originally the value that was in /etc/resolv.conf before the 
the ansible installer changed this to point to the local machine, and 
place that value in the origin-upstream-dns.conf file.


But after reboot this contents of this file are changed to different 
nameservers, I believe these being ones that are retrieved from DHCP.
If the contents of origin-upstream-dns.conf are manually corrected and 
the dnsmasq and origin-node services restarted all is good again.


Can anyone explain why this happens and how to prevent it?

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: https route stopped working

2018-10-08 Thread Tim Dudgeon

Yes, I had tried re-creating the route and that didn't work.

Eventually I did manage to solve it. The 'Destination CA Cert' property 
for the route was (automatically) filled with some place holder 
'backwards compatibility' text. When I replaced this with the CA cert 
used by the service (found in the secrets) things started working again.


I have no idea why this stopped working and why this fix became necessary.


On 07/10/18 21:14, Joel Pearson wrote:
Have you tried looking at the generated haproxy file inside the 
router? It might give some hints as to what went wrong. I presume 
you’ve already tried recreating the route?
On Wed, 3 Oct 2018 at 2:30 am, Tim Dudgeon <mailto:tdudgeon...@gmail.com>> wrote:


We've hit a problem with a HTTPS route that used to work fine has now
stopped working.
Instead of the application we're are seeing the 'Application is not
available' page from the router.

The route is using 'reencrypt' termination type to hit the service on
port 8443.
The service itself and its pod is running OK as indicated by being
able
to curl it from inside the router pod using:

curl -kL https://secure-sso.openrisknet-infra.svc:8443/auth

(the -k is needed).

An equivalent HTTP route that hits the HTTP service on port 8080 is
working fine.

The only thing I can think of that might have caused this is
redeploying
the master certificates using the 'redeploy-certificates.yml'
playbook,
but I can't see how that would cause this.
This is all with Origin 3.7.

Any thoughts on what might be wrong here?

___
users mailing list
users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
http://lists.openshift.redhat.com/openshiftmm/listinfo/users



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


https route stopped working

2018-10-02 Thread Tim Dudgeon
We've hit a problem with a HTTPS route that used to work fine has now 
stopped working.
Instead of the application we're are seeing the 'Application is not 
available' page from the router.


The route is using 'reencrypt' termination type to hit the service on 
port 8443.
The service itself and its pod is running OK as indicated by being able 
to curl it from inside the router pod using:


curl -kL https://secure-sso.openrisknet-infra.svc:8443/auth

(the -k is needed).

An equivalent HTTP route that hits the HTTP service on port 8080 is 
working fine.


The only thing I can think of that might have caused this is redeploying 
the master certificates using the 'redeploy-certificates.yml' playbook, 
but I can't see how that would cause this.

This is all with Origin 3.7.

Any thoughts on what might be wrong here?

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


deleting docker images on nodes

2018-09-25 Thread Tim Dudgeon
As time progresses more and more docker images will be present on the 
nodes in a cluster as different pods get deployed.

This could use up significant disk space.

Does openshift provide a mechanism for pruning these, or is doing this 
up to the cluster administrator?


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


bad certificates with ansible service broker

2018-09-11 Thread Tim Dudgeon
We're having problems with the ansible service broker with the etcd 
rejecting the certificate of the ansible service broker.

In the logs of the asb-etcd pod I see this:

2018-09-11 09:13:26.779392 I | embed: rejected connection from 
"127.0.0.1:50656" (error "tls: failed to verify client's certificate: 
x509: certificate signed by unknown authority", ServerName "")
WARNING: 2018/09/11 09:13:26 Failed to dial 0.0.0.0:2379: connection 
error: desc = "transport: authentication handshake failed: remote error: 
tls: bad certificate"; please retry.


This results in the asb pod to failing to start.

I believe this may have happened after the cluster certificates were 
updated using the redeploy-certificates.yml playbook.

This is using Origin 3.7.2.

Any thoughts on how to correct this?

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: node ip addresses change

2018-08-28 Thread Tim Dudgeon

Jose,

Thanks.

So my setup just has a single master and single infra node (plus some 
'worker' nodes).
So that presumably makes it not possible to do this as single master 
setups cannot be scaled up?

Seems like I have to start again from scratch?


On 28/08/18 12:02, Jose Manuel wrote:


Hi Tim,

In master certificates the Subject Alternative Name includes some IP 
addresses like internal balancer.


In etcd certificates the Subject Alternative Name also includes its 
own addresses.


Masters cannot change their IP addresses (not easily).


Nodes also have certificates where their own address is and there is a 
virtual network software that all nodes (masters are also nodes) use 
to allow pods communication. I think that connections are also done 
using the ip address instead dns name. I am not sure about this point.



The most secure and easy way to change the node address is to remove 
it from the cluster and add it using the procedures described here: 
https://docs.okd.io/3.9/admin_guide/manage_nodes.html#adding-nodes



Jose Manuel


--

Jose Manuel Ferrer Mosteiro

Devops / Sysdev @ Paradigma Digital

   __    _ _
  / /  _ __   __ _ _ __ __ _  __| (_) __ _ _ __ ___   __ _
 | |  | '_ \ / _` | '__/ _` |/ _` | |/ _` | '_ ` _ \ / _` |
< <   | |_) | (_| | | | (_| | (_| | | (_| | | | | | | (_| |
 | |  | .__/ \__,_|_|  \__,_|\__,_|_|\__, |_| |_| |_|\__,_|
  \_\ |_|    |___/


http://www.paradigmadigital.com/
Vía de las dos Castillas, 33, Ática 4, 2ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: 91 352 59 42 // @paradigmate


El 2018-08-28 12:36, Tim Dudgeon escribió:

I've got a situation where the IP addresses of the nodes in an 
openshift origin 3.9 cluster are going to change and am trying to 
work out what impact this will have. Of course the DNS will be 
updated to reflect the changes, and the ansible inventory file only 
uses hostnames, not IP addresses.


However, looking that the /etc/origin/master/master-config.yaml I see 
an entry like this:

masterIP: 172.20.0.16

And on the nodes in the /etc/origin/node/node-config.yaml I see this:
dnsIP: 172.20.0.16

So this suggests that the IP addresses are significant in some aspects.
Are there other places where the IP addresses will need to be changed?
Should it work to just update those IP addresses and restart the services?

Thanks
Tim

___
users mailing list
users@lists.openshift.redhat.com 
<mailto:users@lists.openshift.redhat.com>

http://lists.openshift.redhat.com/openshiftmm/listinfo/users



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


node ip addresses change

2018-08-28 Thread Tim Dudgeon
I've got a situation where the IP addresses of the nodes in an openshift 
origin 3.9 cluster are going to change and am trying to work out what 
impact this will have. Of course the DNS will be updated to reflect the 
changes, and the ansible inventory file only uses hostnames, not IP 
addresses.


However, looking that the /etc/origin/master/master-config.yaml I see an 
entry like this:

masterIP: 172.20.0.16

And on the nodes in the /etc/origin/node/node-config.yaml I see this:
dnsIP: 172.20.0.16

So this suggests that the IP addresses are significant in some aspects.
Are there other places where the IP addresses will need to be changed?
Should it work to just update those IP addresses and restart the services?

Thanks
Tim

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: replace gluster node

2018-08-26 Thread Tim Dudgeon

So I dug a bit deeper.
After the procedure I described in the previous post I looked at what 
the status was.


On one of the two gluster nodes/pods that was still working I see this:


# gluster pool list
UUID                    Hostname     State
6075f5a2-ce2f-4d4d-92a2-6850620c636e    10.0.0.24    Connected
ee33c338-c057-416d-81a4-c8d103570f18    10.0.0.39  Disconnected
48571acb-9a4b-4f4d-81bf-30291b14513e    localhost    Connected
The first is the other good node, the second is the failed node that no 
longer exists.

From the other good node the situation is similar.

From the new node I see this:


# gluster pool list
UUID                    Hostname     State
95e93b7d-9e96-4d76-925f-3f8aaf289ba2    localhost    Connected 
So clearly the new node is alive, gluster is running, but its not joined 
the storage pool.
On the node itself the volumes are present as devices (/dev/vdb, 
/dev/vdc, /dev/vdd) but they are not mounted.


So how best to rectify this situation?
Should this be done with gluster or with heketi?

Tim


On 25/08/18 15:11, Tim Dudgeon wrote:

Not having any joy with replacing the broken glusterfs node.

What we did was:

1. Delete the broken gluster node from the cluster, and remove it from 
the inventory file


2. Create a new node to replace it. Add it to the [new_nodes] section 
of the inventory and run the playbooks/byo/openshift-node/scaleup.yml 
playbook. At this stage it is not added to the [glusterfs] section of 
the inventory. The node is now part of the cluster. Move it from the 
[new_nodes] section of the inventory to the [nodes] section.


3. Add the new node to the  [glusterfs] section of the inventory. At 
this stage we have the 2 functioning gluster nodes with volumes 
containing data, and one new node with unformatted volumes.


4. Edit the [OSEv3:vars] section and add these 3 properties:
openshift_storage_glusterfs_wipe = False
openshift_storage_glusterfs_is_missing = False
openshift_storage_glusterfs_heketi_is_missing = False

5. Run the playbooks/byo/openshift-glusterfs/config.yml playbook. This 
fails with the following error:


TASK [openshift_storage_glusterfs : Load heketi topology] 

Saturday 25 August 2018  12:04:37 + (0:00:02.073) 0:26:39.480 
***
fatal: [orn-master.openstacklocal]: FAILED! => {"changed": true, 
"cmd": ["oc", 
"--config=/tmp/openshift-glusterfs-ansible-2Zl8Vv/admin.kubeconfig", 
"rsh", "--namespace=glusterfs", "heketi-storage-2-jvfhn", 
"heketi-cli", "-s", "http://localhost:8080;, "--user", "admin", 
"--secret", "sjGJ1Gix0Nf9GEaXynTSngMwi6D/fHtEEWxyZCSlVY8=", 
"topology", "load", 
"--json=/tmp/openshift-glusterfs-ansible-2Zl8Vv/topology.json", 
"2>&1"], "delta": "0:00:05.354851", "end": "2018-08-25 
12:10:10.128102", "failed_when_result": true, "rc": 0, "start": 
"2018-08-25 12:10:04.773251", "stderr": "", "stderr_lines": [], 
"stdout": "\tFound node orn-gluster-storage-001.openstacklocal on 
cluster de03021c7b9d5f6a99d403a7a369d3e1\n\t\tFound device 
/dev/vdb\n\t\tFound device /dev/vdc\n\t\tFound device 
/dev/vdd\n\tFound node orn-gluster-storage-002.openstacklocal on 
cluster de03021c7b9d5f6a99d403a7a369d3e1\n\t\tFound device 
/dev/vdb\n\t\tFound device /dev/vdc\n\t\tFound device 
/dev/vdd\n\tCreating node orn-gluster-storage-003.openstacklocal ... 
Unable to create node: Unable to execute command on 
glusterfs-storage-k7lp4: peer probe: failed: Probe returned with 
Transport endpoint is not connected", "stdout_lines": ["\tFound node 
orn-gluster-storage-001.openstacklocal on cluster 
de03021c7b9d5f6a99d403a7a369d3e1", "\t\tFound device /dev/vdb", 
"\t\tFound device /dev/vdc", "\t\tFound device /dev/vdd", "\tFound 
node orn-gluster-storage-002.openstacklocal on cluster 
de03021c7b9d5f6a99d403a7a369d3e1", "\t\tFound device /dev/vdb", 
"\t\tFound device /dev/vdc", "\t\tFound device /dev/vdd", "\tCreating 
node orn-gluster-storage-003.openstacklocal ... Unable to create 
node: Unable to execute command on glusterfs-storage-k7lp4: peer 
probe: failed: Probe returned with Transport endpoint is not 
connected"]}


Similarly, if you oc rsh to the heketi pod and run the heketi-cli you 
get a similar error:


heketi-cli node add --zone=1 --cluster=$CLUSTER_ID 
--management-host-name=orn-gluster-storage-003.openstacklocal 
--storage-host-name=10.0.0.26
Error: Unable to execute command on glusterfs-storage-k7lp4: peer 
probe: failed: Probe returned with Transport endpoint is not connected


Any thoug

Re: replace gluster node

2018-08-25 Thread Tim Dudgeon

Not having any joy with replacing the broken glusterfs node.

What we did was:

1. Delete the broken gluster node from the cluster, and remove it from 
the inventory file


2. Create a new node to replace it. Add it to the [new_nodes] section of 
the inventory and run the playbooks/byo/openshift-node/scaleup.yml 
playbook. At this stage it is not added to the [glusterfs] section of 
the inventory. The node is now part of the cluster. Move it from the 
[new_nodes] section of the inventory to the [nodes] section.


3. Add the new node to the  [glusterfs] section of the inventory. At 
this stage we have the 2 functioning gluster nodes with volumes 
containing data, and one new node with unformatted volumes.


4. Edit the [OSEv3:vars] section and add these 3 properties:
openshift_storage_glusterfs_wipe = False
openshift_storage_glusterfs_is_missing = False
openshift_storage_glusterfs_heketi_is_missing = False

5. Run the playbooks/byo/openshift-glusterfs/config.yml playbook. This 
fails with the following error:


TASK [openshift_storage_glusterfs : Load heketi topology] 


Saturday 25 August 2018  12:04:37 + (0:00:02.073) 0:26:39.480 ***
fatal: [orn-master.openstacklocal]: FAILED! => {"changed": true, 
"cmd": ["oc", 
"--config=/tmp/openshift-glusterfs-ansible-2Zl8Vv/admin.kubeconfig", 
"rsh", "--namespace=glusterfs", "heketi-storage-2-jvfhn", 
"heketi-cli", "-s", "http://localhost:8080;, "--user", "admin", 
"--secret", "sjGJ1Gix0Nf9GEaXynTSngMwi6D/fHtEEWxyZCSlVY8=", 
"topology", "load", 
"--json=/tmp/openshift-glusterfs-ansible-2Zl8Vv/topology.json", 
"2>&1"], "delta": "0:00:05.354851", "end": "2018-08-25 
12:10:10.128102", "failed_when_result": true, "rc": 0, "start": 
"2018-08-25 12:10:04.773251", "stderr": "", "stderr_lines": [], 
"stdout": "\tFound node orn-gluster-storage-001.openstacklocal on 
cluster de03021c7b9d5f6a99d403a7a369d3e1\n\t\tFound device 
/dev/vdb\n\t\tFound device /dev/vdc\n\t\tFound device 
/dev/vdd\n\tFound node orn-gluster-storage-002.openstacklocal on 
cluster de03021c7b9d5f6a99d403a7a369d3e1\n\t\tFound device 
/dev/vdb\n\t\tFound device /dev/vdc\n\t\tFound device 
/dev/vdd\n\tCreating node orn-gluster-storage-003.openstacklocal ... 
Unable to create node: Unable to execute command on 
glusterfs-storage-k7lp4: peer probe: failed: Probe returned with 
Transport endpoint is not connected", "stdout_lines": ["\tFound node 
orn-gluster-storage-001.openstacklocal on cluster 
de03021c7b9d5f6a99d403a7a369d3e1", "\t\tFound device /dev/vdb", 
"\t\tFound device /dev/vdc", "\t\tFound device /dev/vdd", "\tFound 
node orn-gluster-storage-002.openstacklocal on cluster 
de03021c7b9d5f6a99d403a7a369d3e1", "\t\tFound device /dev/vdb", 
"\t\tFound device /dev/vdc", "\t\tFound device /dev/vdd", "\tCreating 
node orn-gluster-storage-003.openstacklocal ... Unable to create node: 
Unable to execute command on glusterfs-storage-k7lp4: peer probe: 
failed: Probe returned with Transport endpoint is not connected"]}


Similarly, if you oc rsh to the heketi pod and run the heketi-cli you 
get a similar error:


heketi-cli node add --zone=1 --cluster=$CLUSTER_ID 
--management-host-name=orn-gluster-storage-003.openstacklocal 
--storage-host-name=10.0.0.26
Error: Unable to execute command on glusterfs-storage-k7lp4: peer 
probe: failed: Probe returned with Transport endpoint is not connected


Any thoughts how to repair this?


On 24/08/18 13:39, Walters, Todd wrote:

Tim,

Try deleting all the pods, the glusterfs pods and the heketi pod. Do it one at 
a time. I’ve had this work for me where the pods came back up and heketi was ok.

Also can try restarting glusterfs glusterd in the pod term on each pod. That’s 
worked for me to get out of heketi db issues.

Other than that I don’t have any other ideas. I’ve not found good information 
on how to resolve or troubleshoot issues like this.

Thanks,
Todd

On 8/24/18, 4:37 AM, "Tim Dudgeon"  wrote:

 Todd,

 Thanks for that. Seems on the lines that I need.

 The problem though is that I have an additional problem of the heketi
 pod not starting because of a messed up database configuration.
 These two problems happened independently, but on the same OpenShift
 environment.
 This means I'm unable to run the heketi-cli until that is fixed.
 I'm not sure if I can modify the heketi database configuration as
 described in the troubleshooting guide [1] so that it only knows about
   

Re: replace gluster node

2018-08-24 Thread Tim Dudgeon

Todd,

Thanks for that. Seems on the lines that I need.

The problem though is that I have an additional problem of the heketi 
pod not starting because of a messed up database configuration.
These two problems happened independently, but on the same OpenShift 
environment.

This means I'm unable to run the heketi-cli until that is fixed.
I'm not sure if I can modify the heketi database configuration as 
described in the troubleshooting guide [1] so that it only knows about 
the two good gluster nodes, and then add back the third one?


Any thoughts?

Tim

[1] https://github.com/heketi/heketi/blob/master/docs/troubleshooting.md


On 23/08/18 17:14, Walters, Todd wrote:

Tim,

I have had this issue with 3 node cluster. I created a new node with new 
devices, ran scaleup and ran gluster playbook with some changes, then ran 
heketi-cli commands to add new node and remove old node.

For your other question, I’ve restarted all glusterfs pods and hekeit pod and 
resolved that issue before.  I guess you can restart glusterd in each pod too?

Here’s doc I wrote on node replacement. I’m not sure if this is proper 
procedure, but it works, and I wasn’t able to find any decent solution in the 
docs.

# - Replacing a Failed Node  #

Disable Node to simulate failure
Get node id with heketi-cli node list or topology info

heketi-cli node disable fb344a2ea889c7e25a772e747c2a -s http://localhost:8080 --user 
admin --secret "$HEKETI_CLI_KEY"
Node fb344a2ea889c7e25a772e747c2a is now offline

Stop Node in AWS Console
Scale up another node (4) for Gluster via Terraform
Run scaleup_node.yml playbook

Add New Node and Device

heketi-cli node add --zone=1 --cluster=441248c1b2f032a93aca4a4e03648b28 
--management-host-name=ip-new-node.ec2.internal --storage-host-name=newnodeIP  -s 
http://localhost:8080 --user admin --secret "$HEKETI_CLI_KEY"
heketi-cli device add --name /dev/xvdc --node 8973b41d8a4e437bd8b36d7df1a93f06 -s 
http://localhost:8080 --user admin --secret "$HEKETI_CLI_KEY"


Run deploy_gluster playbook, with the following changes in OSEv3

-   openshift_storage_glusterfs_wipe: False
-   openshift_storage_glusterfs_is_missing: False
-   openshift_storage_glusterfs_heketi_is_missing: False

Verify topology
rsh into heketi pod
run heketi-exports (file i created with export commands)
get old and new node info (id)

Remove Node

sh-4.4# heketi-cli node remove fb344a2ea889c7e25a772e747c2a -s http://localhost:8080 
--user admin --secret "$HEKETI_CLI_KEY"
Node fb344a2ea889c7e25a772e747c2a is now removed


Remove All Devices (check the topology)

sh-4.4# heketi-cli device delete ea85942eaec73cb666c4e3dcec8b3702 -s 
http://localhost:8080 --user admin --secret "$HEKETI_CLI_KEY"
Device ea85942eaec73cb666c4e3dcec8b3702 deleted


Delete the Node

sh-4.4# heketi-cli node delete fb344a2ea889c7e25a772e747c2a -s http://localhost:8080 
--user admin --secret "$HEKETI_CLI_KEY"
Node fb344a2ea889c7e25a772e747c2a deleted


Verify New Topology

$ heketi-cli topology info
make sure new node and device is listed.


Thanks,

Todd

# ---

Check any existing pvc is still accessible.
 Today's Topics:
2. Replacing failed gluster node (Tim Dudgeon)


--

 Message: 2
 Date: Thu, 23 Aug 2018 15:40:29 +0100
 From: Tim Dudgeon 
 To: users 
 Subject: Replacing failed gluster node
 Message-ID: 
 Content-Type: text/plain; charset=utf-8; format=flowed

 I have a 3 node containerised glusterfs setup, and one of the nodes has
 just died.
 I believe I can recover the disks that were used for the gluster storage.
 What is the best approach to replacing that node with a new one?
 Can I just create a new node with empty disks mounted and use the
 scaleup.yml playbook and [new_nodes] section, or should I be creating a
 node that re-uses the existing drives?

 Tim



 --

 ___
 users mailing list
 users@lists.openshift.redhat.com
 
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.openshift.redhat.com%2Fopenshiftmm%2Flistinfo%2Fusersdata=01%7C01%7Ctodd_walters%40unigroup.com%7C5dae269e490e4932a8d609118137%7C259bdc2f86d3477b8cb34eee64289142%7C1sdata=VkWMmlYIrfuEnZMGBtAqf2QER8dMSkFkVFYBAStVits%3Dreserved=0


 End of users Digest, Vol 73, Issue 44
 *




The information contained in this message, and any attachments thereto,
is intended solely for the use of the addressee(s) and may contain
confidential and/or privileged material. Any review, retransmission,
dissemination, copying, or other use of the transmitted information is
prohibited. If you received this in error, please contact the se

Replacing failed gluster node

2018-08-23 Thread Tim Dudgeon
I have a 3 node containerised glusterfs setup, and one of the nodes has 
just died.

I believe I can recover the disks that were used for the gluster storage.
What is the best approach to replacing that node with a new one?
Can I just create a new node with empty disks mounted and use the 
scaleup.yml playbook and [new_nodes] section, or should I be creating a 
node that re-uses the existing drives?


Tim

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


heketi database inconsistent after a crash

2018-08-23 Thread Tim Dudgeon

We had a system crash with all nodes being forced to shut down.
When restarted we're having problems with the GlusterFS storage (this is 
with OpenShift Origin 3.7.2).
The gluster nodes appear to have restarted fine, and AFAICT the volumes 
and bricks are all OK.
But the heketi pod is failing to restart as its db is in an inconsistent 
state.
Following the instructions here [1] we tried to remove the pending 
operations, but heketi is still stuck with errors like this:



[heketi] INFO 2018/08/22 13:24:50 Loaded kubernetes executor
[heketi] ERROR 2018/08/22 13:24:50 
/src/github.com/heketi/heketi/apps/glusterfs/dbcommon.go:109: 
 
Failed to upgrade db for brick entries: Id not found
[heketi] ERROR 2018/08/22 13:24:50 
/src/github.com/heketi/heketi/apps/glusterfs/app.go:125: 
 Unable to 
Upgrade Changes
[heketi] ERROR 2018/08/22 13:24:50 
/src/github.com/heketi/heketi/apps/glusterfs/app.go:133: 
 Id not found

ERROR: Unable to start application


Is it possible to work around this somehow by restarting the heketi pod 
and get it to pick up its information afresh from the gluster nodes?



[1] 
https://github.com/heketi/heketi/blob/263fbb72055d71b3763a77c051e7a00cf0c4e436/docs/troubleshooting.md


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


OAB failing to start due to certificate error

2018-07-20 Thread Tim Dudgeon
My pod for openshift-ansible-service-broker is failing to start, with 
this error found in the logs:


Using config file mounted to /etc/ansible-service-broker/config.yaml

==   Starting Ansible Service Broker...   ==

[2018-07-20T13:01:06.151Z] [NOTICE] Initializing clients...
[2018-07-20T13:01:06.152Z] [INFO] == ETCD CX ==
[2018-07-20T13:01:06.152Z] [INFO] EtcdHost: 
asb-etcd.openshift-ansible-service-broker.svc

[2018-07-20T13:01:06.152Z] [INFO] EtcdPort: 2379
[2018-07-20T13:01:06.152Z] [INFO] Endpoints: 
[https://asb-etcd.openshift-ansible-service-broker.svc:2379 ]
[2018-07-20T13:01:06.178Z] [ERROR] client: etcd cluster is unavailable 
or misconfigured; error #0: x509: certificate signed by unknown authority


The pod for asb-etcd is running fine, but in the logs I see these errors:

2018-07-20 12:27:45.723585 I | embed: rejected connection from 
"10.129.4.1:32916" (error "remote error: tls: bad certificate", 
ServerName "asb-etcd.openshift-ansible-service-broker.svc")


This pod used to be running fine, but now won't start. Possibly this is 
related to pushing out updated certificates using the 
redeploy-certificates.yml playbook.


Any ideas on how to resolve this?
Or maybe to undeploy and redeploy the OAB?

This is using Origin 3.7.2

Thanks
Tim


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: scheduler policy to spread pods

2018-07-09 Thread Tim Dudgeon
Hi, thanks for that suggestion. I took a look, but tit seems it isn't 
quite what's needed.
It looks likes pod (anti)affinity is a binary thing. It works for the 
first pod on the node with/without the specified label, but it doesn't 
ensure an even spread when you schedule multiple pods.


In my case I scheduled pods using an antiaffinity 
preferredDuringSchedulingIgnoredDuringExecution rule applying across 3 
nodes and that made sure that the first 3 pods went to separate nodes as 
expected, but after that the rule seemed to not be applied (there were 
no nodes that satisfied the rule, but as the rule was 'preferred' not 
'required' the pod was scheduled without any further preference). So 
that by the time I had 6 pods running 3 other them were on one node, 2 
on another and only 1 on the third.


So I suppose the anti-affinity rule is working as designed, but that its 
not designed to ensure an even spread when you have multiple pods on the 
nodes.



On 04/07/18 12:16, Joel Pearson wrote:

Here’s an OpenShift reference for the same thing.

https://docs.openshift.com/container-platform/3.6/admin_guide/scheduling/pod_affinity.html
On Wed, 4 Jul 2018 at 9:14 pm, Joel Pearson 
mailto:japear...@agiledigital.com.au>> 
wrote:


You’re probably after pod anti-affinity?

https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity

That lets you tell the scheduler that the pods aren’t allowed to
be on the same node for example.
On Wed, 4 Jul 2018 at 8:51 pm, Tim Dudgeon mailto:tdudgeon...@gmail.com>> wrote:

I've got a process the fires up a number of pods (bare pods,
not backed
by replication controller) to execute a computationally
demanding job in
parallel.
What I find is that the pods do not spread effectively across the
available nodes. In my case I have a node selector that restricts
execution to 3 nodes, and the pods run mostly on the first
node, a few
run on the second node, and none run on the third node.

I know that I could specify cpu resource requests and limits
to help
with this, but for other reasons I'm currently unable to do this.

It looks like this is controllable through the scheduler, but the
options for controlling this look pretty complex.
Could someone advise on how best to allow pods to spread
evenly across
nodes rather than execute preferentially on one node?

___
users mailing list
users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
http://lists.openshift.redhat.com/openshiftmm/listinfo/users



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: file permissions changed in docker registry

2018-07-06 Thread Tim Dudgeon

Here is the issue I raised for this topic:
https://github.com/openshift/origin/issues/20234


On 05/07/18 15:57, Ben Parees wrote:
I forwarded your problem on to our storage team lead, he had the 
following suggestions:


"I believe they will want to fiddle with the fsGroup or 
supplementalGroup so that it matches the GID of the cassandra user and 
make sure those GIDs are in the SCC ranges for the pod."


He also recommended you consider opening a bugzilla as it's easier to 
track these issues that way.





On Thu, Jul 5, 2018 at 7:42 AM, Tim Dudgeon <mailto:tdudgeon...@gmail.com>> wrote:


I hit this problem again, this time with the cassandra pod for
Hawkular metrics.

This has been running without problem for some months, but now I
found that the cassandra pod could not start because of file
permissions writing to the /cassandra_data/data directory.

Looking at that directory the ownership was set to
14.65534, but cassandra was running as user 313 so could
not write to that directory. Manually changing permissions to
313.65534 (the 65534 group is nfsnobody, and the cassandra user is
a member of that group) fixed the problem and allowed the
cassandra pod to start.

Clearly the 14 user is an openshift assigned user, but as
the container is running as the cassandra user (313) I have no
idea how this could have happened.

Can anyone explain what is going on here?

Tim



On 02/07/18 16:27, Tim Dudgeon wrote:

I've hit a strange problem with directory ownership for the
docker registry a couple of times, and don't understand what
is causing this.

The registry was working fine for some time. I'm using a
Cinder volume for the registry storage, but don't know if
that's relevant.
Then something happened that stopped pods pushing to the
registry, with the problem being that the registry pod was
getting "Permission denied" errors when it was trying to
create directories under
/registry/docker/registry/v2/repositories.

Looking at the file system the directories were all owned by
10.10 which explains why the registry process
(running as user 1001) could not write to these directories. e.g.

sh-4.2$ cd /registry/docker/registry/v2/
sh-4.2$ ls -al
total 0
drwxrwsr-x.  4 10 10  39 Apr 20 15:51 .
drwxrwsr-x.  3 10 10  16 Apr 20 15:51 ..
drwxrwsr-x.  3 10 10  20 Apr 20 15:51 blobs
drwxrwsr-x. 15 10 10 215 May 29 14:14 repositories

Doing a `docker -exec -u 0  on the infra node
and then a `chown -R 1001.0 /registry/docker/registry`  to
reset the permissions fixed the problem.

Anyone any idea what's going on here?

Tim


___
users mailing list
users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
http://lists.openshift.redhat.com/openshiftmm/listinfo/users
<http://lists.openshift.redhat.com/openshiftmm/listinfo/users>




--
Ben Parees | OpenShift



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: file permissions changed in docker registry

2018-07-05 Thread Tim Dudgeon

OK, I'll create an issue for this.

Though my comment is that both of the systems involved (docker registry 
and Hawkular metrics) are core parts of openshift so I would hole that 
no "fiddling" would be needed.



On 05/07/18 15:57, Ben Parees wrote:
I forwarded your problem on to our storage team lead, he had the 
following suggestions:


"I believe they will want to fiddle with the fsGroup or 
supplementalGroup so that it matches the GID of the cassandra user and 
make sure those GIDs are in the SCC ranges for the pod."


He also recommended you consider opening a bugzilla as it's easier to 
track these issues that way.





On Thu, Jul 5, 2018 at 7:42 AM, Tim Dudgeon <mailto:tdudgeon...@gmail.com>> wrote:


I hit this problem again, this time with the cassandra pod for
Hawkular metrics.

This has been running without problem for some months, but now I
found that the cassandra pod could not start because of file
permissions writing to the /cassandra_data/data directory.

Looking at that directory the ownership was set to
14.65534, but cassandra was running as user 313 so could
not write to that directory. Manually changing permissions to
313.65534 (the 65534 group is nfsnobody, and the cassandra user is
a member of that group) fixed the problem and allowed the
cassandra pod to start.

Clearly the 14 user is an openshift assigned user, but as
the container is running as the cassandra user (313) I have no
idea how this could have happened.

Can anyone explain what is going on here?

Tim



On 02/07/18 16:27, Tim Dudgeon wrote:

I've hit a strange problem with directory ownership for the
docker registry a couple of times, and don't understand what
is causing this.

The registry was working fine for some time. I'm using a
Cinder volume for the registry storage, but don't know if
that's relevant.
Then something happened that stopped pods pushing to the
registry, with the problem being that the registry pod was
getting "Permission denied" errors when it was trying to
create directories under
/registry/docker/registry/v2/repositories.

Looking at the file system the directories were all owned by
10.10 which explains why the registry process
(running as user 1001) could not write to these directories. e.g.

sh-4.2$ cd /registry/docker/registry/v2/
sh-4.2$ ls -al
total 0
drwxrwsr-x.  4 10 10  39 Apr 20 15:51 .
drwxrwsr-x.  3 10 10  16 Apr 20 15:51 ..
drwxrwsr-x.  3 10 10  20 Apr 20 15:51 blobs
drwxrwsr-x. 15 10 10 215 May 29 14:14 repositories

Doing a `docker -exec -u 0  on the infra node
and then a `chown -R 1001.0 /registry/docker/registry`  to
reset the permissions fixed the problem.

Anyone any idea what's going on here?

Tim


___
users mailing list
users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
http://lists.openshift.redhat.com/openshiftmm/listinfo/users
<http://lists.openshift.redhat.com/openshiftmm/listinfo/users>




--
Ben Parees | OpenShift



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: file permissions changed in docker registry

2018-07-05 Thread Tim Dudgeon
I hit this problem again, this time with the cassandra pod for Hawkular 
metrics.


This has been running without problem for some months, but now I found 
that the cassandra pod could not start because of file permissions 
writing to the /cassandra_data/data directory.


Looking at that directory the ownership was set to 14.65534, but 
cassandra was running as user 313 so could not write to that directory. 
Manually changing permissions to 313.65534 (the 65534 group is 
nfsnobody, and the cassandra user is a member of that group) fixed the 
problem and allowed the cassandra pod to start.


Clearly the 14 user is an openshift assigned user, but as the 
container is running as the cassandra user (313) I have no idea how this 
could have happened.


Can anyone explain what is going on here?

Tim


On 02/07/18 16:27, Tim Dudgeon wrote:
I've hit a strange problem with directory ownership for the docker 
registry a couple of times, and don't understand what is causing this.


The registry was working fine for some time. I'm using a Cinder volume 
for the registry storage, but don't know if that's relevant.
Then something happened that stopped pods pushing to the registry, 
with the problem being that the registry pod was getting "Permission 
denied" errors when it was trying to create directories under 
/registry/docker/registry/v2/repositories.


Looking at the file system the directories were all owned by 
10.10 which explains why the registry process (running 
as user 1001) could not write to these directories. e.g.


sh-4.2$ cd /registry/docker/registry/v2/
sh-4.2$ ls -al
total 0
drwxrwsr-x.  4 10 10  39 Apr 20 15:51 .
drwxrwsr-x.  3 10 10  16 Apr 20 15:51 ..
drwxrwsr-x.  3 10 10  20 Apr 20 15:51 blobs
drwxrwsr-x. 15 10 10 215 May 29 14:14 repositories

Doing a `docker -exec -u 0  on the infra node and then a 
`chown -R 1001.0 /registry/docker/registry`  to reset the permissions 
fixed the problem.


Anyone any idea what's going on here?

Tim



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


scheduler policy to spread pods

2018-07-04 Thread Tim Dudgeon
I've got a process the fires up a number of pods (bare pods, not backed 
by replication controller) to execute a computationally demanding job in 
parallel.
What I find is that the pods do not spread effectively across the 
available nodes. In my case I have a node selector that restricts 
execution to 3 nodes, and the pods run mostly on the first node, a few 
run on the second node, and none run on the third node.


I know that I could specify cpu resource requests and limits to help 
with this, but for other reasons I'm currently unable to do this.


It looks like this is controllable through the scheduler, but the 
options for controlling this look pretty complex.
Could someone advise on how best to allow pods to spread evenly across 
nodes rather than execute preferentially on one node?


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


file permissions changed in docker registry

2018-07-02 Thread Tim Dudgeon
I've hit a strange problem with directory ownership for the docker 
registry a couple of times, and don't understand what is causing this.


The registry was working fine for some time. I'm using a Cinder volume 
for the registry storage, but don't know if that's relevant.
Then something happened that stopped pods pushing to the registry, with 
the problem being that the registry pod was getting "Permission denied" 
errors when it was trying to create directories under 
/registry/docker/registry/v2/repositories.


Looking at the file system the directories were all owned by 
10.10 which explains why the registry process (running 
as user 1001) could not write to these directories. e.g.


sh-4.2$ cd /registry/docker/registry/v2/
sh-4.2$ ls -al
total 0
drwxrwsr-x.  4 10 10  39 Apr 20 15:51 .
drwxrwsr-x.  3 10 10  16 Apr 20 15:51 ..
drwxrwsr-x.  3 10 10  20 Apr 20 15:51 blobs
drwxrwsr-x. 15 10 10 215 May 29 14:14 repositories

Doing a `docker -exec -u 0  on the infra node and then a 
`chown -R 1001.0 /registry/docker/registry`  to reset the permissions 
fixed the problem.


Anyone any idea what's going on here?

Tim

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: load balancing for infra node in HA setup

2018-06-11 Thread Tim Dudgeon

Joel,

Thanks for those answers. Yes I am running on OpenStack, but I think 
this is really completely general. I still have these questions:


Is there just one load balancer handling the master and the infra 
functions (registry and router)?

The ansible playbooks will set this up if using an internal load balancer?
If using multiple registries do they all share the same storage (in 
which case can't use Cinder, EBS etc).


Thanks
Tim


On 09/06/18 00:52, Joel Pearson wrote:

Hi Tim,

Answers inline.

On 8 June 2018 at 23:00, Tim Dudgeon <mailto:tdudgeon...@gmail.com>> wrote:


The docs for installing a high availability openshift cluster e.g.
[1] are fairly clear when it comes to the master node. If you set
up a 3 masters then you need a load balancer that sits in front of
these. OpenShift can provide this or you can provide your own
external one.

What not so clear is how to handle the nodes where the
infrastructure components (registry and router) get deployed. In a
typical example you would have 2 of these nodes, but what would
happen in this case?

I presume you are still openstack? Here is the OpenStack reference 
architecture for Openshift: 
https://access.redhat.com/documentation/en-us/reference_architectures/2018/html/deploying_and_managing_openshift_3.9_on_red_hat_openstack_platform_10/reference_architecture_summary


Normally you have 3 infra nodes with 3 router replicas with 1 load 
balancer in front.


Does a single registry and router get deployed to one of those
nodes (in which case it would be difficult to set up DNS for the
router to point to the right one).

You simply point the DNS at the load balancer in front of the infra 
nodes.  In the AWS reference architecture I run 3 registries, but 
they're backed by S3, so it depends on the backing store for the 
registry I guess.
But it doesn't matter if you run 1 registry or 3, as long as the 
traffic comes in via the load balancer, the OpenShift Routers will 
figure out where the registries are running.


Or does the router get deployed to both so a load balancer is
needed in front of these?

Yes, routers should be deployed on all infra nodes with a load 
balancer in front.



And similarly for the registry. Is there one or two of these
deployed? How does this work?

As mentioned above, it doesn't matter how many registries, but for ha 
you could have as many as the number of infra nodes, provided the 
backend for your registry allows multiple replicas.



I hope someone can clarify this.
Tim

[1]

https://docs.openshift.org/latest/install_config/install/advanced_install.html#multiple-masters

<https://docs.openshift.org/latest/install_config/install/advanced_install.html#multiple-masters>

___
users mailing list
users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
http://lists.openshift.redhat.com/openshiftmm/listinfo/users
<http://lists.openshift.redhat.com/openshiftmm/listinfo/users>




--
Kind Regards,

Joel Pearson
Agile Digital | Senior Software Consultant

[=Love Your Software™ | ABN 98 106 361 273
p: 1300 858 277  |  w: agiledigital.com.au 
<http://agiledigital.com.au/>


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


load balancing for infra node in HA setup

2018-06-08 Thread Tim Dudgeon
The docs for installing a high availability openshift cluster e.g. [1] 
are fairly clear when it comes to the master node. If you set up a 3 
masters then you need a load balancer that sits in front of these. 
OpenShift can provide this or you can provide your own external one.


What not so clear is how to handle the nodes where the infrastructure 
components (registry and router) get deployed. In a typical example you 
would have 2 of these nodes, but what would happen in this case?


Does a single registry and router get deployed to one of those nodes (in 
which case it would be difficult to set up DNS for the router to point 
to the right one).


Or does the router get deployed to both so a load balancer is needed in 
front of these?


And similarly for the registry. Is there one or two of these deployed? 
How does this work?


I hope someone can clarify this.
Tim

[1] 
https://docs.openshift.org/latest/install_config/install/advanced_install.html#multiple-masters


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: OpenShift console doesn't come up on EC2 after its IP changed

2018-05-29 Thread Tim Dudgeon
Use an elastic IP address for your server. That way the address won't 
change after a restart.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/elastic-ip-addresses-eip.html

On 29/05/18 20:25, Iftikharuddin Khan wrote:


We have openshift running on an EC2 in AWS. We have the following 
commands to bring the cluster up


/metadata_endpoint="http://169.254.169.254/latest/meta-data"/

/public_hostname="$( curl "${metadata_endpoint}/public-hostname" )"/

/public_ip="$( curl "${metadata_endpoint}/public-ipv4" )"/

/oc cluster up --public-hostname="${public_hostname}" 
--routing-suffix="${public_ip}.nip.io" 
--host-data-dir="/home/centos/oc_dir"/


It works fine as long as the EC2 is up. Once we stop the EC2 instance 
and start it, the IP address of EC2 instance changes and because of 
that somehow openshift is getting impacted. We can see the docker 
containers running, but we cannot access the console. What should we 
do to resolve it?




___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: CLI login when using social authentication provider

2018-05-29 Thread Tim Dudgeon
OK, didn't quite work that way for me as just using `oc login ... -u 
username` just gave me a pw prompt.
Instead if I just logged in to the console using github I could then go 
to the menu in the top right corner and choose the 'Copy login command' 
option and this copied the entire `oc login https://your.server 
--token=**` command to the clipboard.



On 29/05/18 14:59, Jordan Liggitt wrote:
If you attempt to log in from the command line (`oc login 
https://your.server`), you get prompted to obtain a token via a web 
login, and are given a command to take back to the CLI to log in.


On Tue, May 29, 2018 at 9:53 AM, Tim Dudgeon <mailto:tdudgeon...@gmail.com>> wrote:


We've set up our OpenShift environment to use GitHub as an
authentication provider as described in [1].
Logging in through the web console works perfectly.
What's not clear is now to login using the CLI. Using:

oc login https://your.server -u githubusername

is clearly not going to work. Presumably you need to get some form
of token form github and then specify that.
How does one go about doing this?

[1]

https://docs.openshift.org/latest/install_config/configuring_authentication.html#GitHub

<https://docs.openshift.org/latest/install_config/configuring_authentication.html#GitHub>


___
users mailing list
users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
http://lists.openshift.redhat.com/openshiftmm/listinfo/users
<http://lists.openshift.redhat.com/openshiftmm/listinfo/users>




___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


CLI login when using social authentication provider

2018-05-29 Thread Tim Dudgeon
We've set up our OpenShift environment to use GitHub as an 
authentication provider as described in [1].

Logging in through the web console works perfectly.
What's not clear is now to login using the CLI. Using:

oc login https://your.server -u githubusername

is clearly not going to work. Presumably you need to get some form of 
token form github and then specify that.

How does one go about doing this?

[1] 
https://docs.openshift.org/latest/install_config/configuring_authentication.html#GitHub



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: hawkular-cassandra failed to startup on openshift origin 3.9

2018-05-25 Thread Tim Dudgeon

I don't see why that shouldn't work as its using an ephemeral volume.
When using NFS I did find that if I tried to redeploy metrics using a 
volume that had already been deployed to then I did hit permissions 
problems that were solved by wiping the data from the NFS mount.
But I can't see how that could apply to a ephemeral volume. That's 
always worked fine for me.



On 25/05/18 11:29, Yu Wei wrote:

configuration as below,

/openshift_metrics_install_metrics=true
/
/openshift_metrics_image_version=v3.9
/
/openshift_master_default_subdomain=paas-dev.dataos.io
/
/#openshift_hosted_logging_deploy=true
/
/openshift_logging_install_logging=true
/
/openshift_logging_image_version=v3.9
/
/openshift_disable_check=disk_availability,docker_image_availability,docker_storage
/
/osm_etcd_image=registry.access.redhat.com/rhel7/etcd
/
/
/
/openshift_enable_service_catalog=true
/
/openshift_service_catalog_image_prefix=openshift/origin-
/
/openshift_service_catalog_image_version=v3.9.0/

*From:* users-boun...@lists.openshift.redhat.com 
<users-boun...@lists.openshift.redhat.com> on behalf of Tim Dudgeon 
<tdudgeon...@gmail.com>

*Sent:* Friday, May 25, 2018 6:21 PM
*To:* users@lists.openshift.redhat.com
*Subject:* Re: hawkular-cassandra failed to startup on openshift 
origin 3.9


So what was the configuration for metrics in the inventory file.


On 25/05/18 11:04, Yu Wei wrote:

Yes, I deployed that via ansible-playbooks.

*From:* users-boun...@lists.openshift.redhat.com 
<mailto:users-boun...@lists.openshift.redhat.com> 
<users-boun...@lists.openshift.redhat.com> 
<mailto:users-boun...@lists.openshift.redhat.com> on behalf of Tim 
Dudgeon <tdudgeon...@gmail.com> <mailto:tdudgeon...@gmail.com>

*Sent:* Friday, May 25, 2018 5:51 PM
*To:* users@lists.openshift.redhat.com 
<mailto:users@lists.openshift.redhat.com>
*Subject:* Re: hawkular-cassandra failed to startup on openshift 
origin 3.9


How are you deploying this? Using the ansible playbooks?


On 25/05/18 10:25, Yu Wei wrote:

Hi,
I tried to deploy hawkular-cassandra on openshift origin 3.9 cluster.
However, pod failed to start up with error as below,
/WARN [main] 2018-05-25 09:17:43,277 StartupChecks.java:267 - 
Directory /cassandra_data/data doesn't exist /


/ERROR [main] 2018-05-25 09:17:43,279 CassandraDaemon.java:710 - Has 
no permission to create directory /cassandra_data/data/


I tried emptyDir and persistent volume as cassandra-data, both failed.

Any advice for this issue?

Thanks,

Jared, (韦煜)
Software developer
Interested in open source software, big data, Linux



___
users mailing list
users@lists.openshift.redhat.com 
<mailto:users@lists.openshift.redhat.com>

http://lists.openshift.redhat.com/openshiftmm/listinfo/users






___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: hawkular-cassandra failed to startup on openshift origin 3.9

2018-05-25 Thread Tim Dudgeon

So what was the configuration for metrics in the inventory file.


On 25/05/18 11:04, Yu Wei wrote:

Yes, I deployed that via ansible-playbooks.

*From:* users-boun...@lists.openshift.redhat.com 
<users-boun...@lists.openshift.redhat.com> on behalf of Tim Dudgeon 
<tdudgeon...@gmail.com>

*Sent:* Friday, May 25, 2018 5:51 PM
*To:* users@lists.openshift.redhat.com
*Subject:* Re: hawkular-cassandra failed to startup on openshift 
origin 3.9


How are you deploying this? Using the ansible playbooks?


On 25/05/18 10:25, Yu Wei wrote:

Hi,
I tried to deploy hawkular-cassandra on openshift origin 3.9 cluster.
However, pod failed to start up with error as below,
/WARN [main] 2018-05-25 09:17:43,277 StartupChecks.java:267 - 
Directory /cassandra_data/data doesn't exist /


/ERROR [main] 2018-05-25 09:17:43,279 CassandraDaemon.java:710 - Has 
no permission to create directory /cassandra_data/data/


I tried emptyDir and persistent volume as cassandra-data, both failed.

Any advice for this issue?

Thanks,

Jared, (韦煜)
Software developer
Interested in open source software, big data, Linux



___
users mailing list
users@lists.openshift.redhat.com 
<mailto:users@lists.openshift.redhat.com>

http://lists.openshift.redhat.com/openshiftmm/listinfo/users




___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: hawkular-cassandra failed to startup on openshift origin 3.9

2018-05-25 Thread Tim Dudgeon

How are you deploying this? Using the ansible playbooks?


On 25/05/18 10:25, Yu Wei wrote:

Hi,
I tried to deploy hawkular-cassandra on openshift origin 3.9 cluster.
However, pod failed to start up with error as below,
/WARN [main] 2018-05-25 09:17:43,277 StartupChecks.java:267 - 
Directory /cassandra_data/data doesn't exist /


/ERROR [main] 2018-05-25 09:17:43,279 CassandraDaemon.java:710 - Has 
no permission to create directory /cassandra_data/data/


I tried emptyDir and persistent volume as cassandra-data, both failed.

Any advice for this issue?

Thanks,

Jared, (韦煜)
Software developer
Interested in open source software, big data, Linux



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: running a docker image in openshift

2018-05-24 Thread Tim Dudgeon
Tomcat images usually run as a specified user, whilst OpenShift by 
default assigns an arbitary user ID to the container, and that arbitary 
user probably does not have permissions to read the server.xml file.


You need to 'relax' the settings on the SCC to allow the container to 
run as the user specified in the dockerfile.
e.g. 
https://docs.openshift.org/latest/admin_guide/manage_scc.html#enable-images-to-run-with-user-in-the-dockerfile

(though that's only one way to do this).

Tim


On 24/05/18 16:28, Brian Keyes wrote:

I am attempting to run a docker created image in openshift

I have created a docker image for apache tomcat ,and can launch a 
container in docker with it and it will run fine and continue to run



but when i push that up to docker hub and try to pull it down to the 
openshift console by going to "add to project" and then "deploy image" 
i seems to build it and run it fine but I get this error and the 
container crashes


ay 24, 2018 3:12:18 PM org.apache.catalina.startup.Catalina load
WARNING: Unable to load server configuration from 
[/usr/local/tomcat/conf/server.xml]

May 24, 2018 3:12:18 PM org.apache.catalina.startup.Catalina start
SEVERE: Cannot start server. Server instance is not configured.



any ideas 

thanks




___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: suggestion for a long running python container for a demo

2018-05-22 Thread Tim Dudgeon

Why python? Why not bash (or even sh).
Once the pod/container is running you can rsh to it and do what ever you 
want.

Even execute python.

So, more basically how best to start a minimal centos|debian|whatever 
container so that it stays running so that you can rsh to it and then be 
inside the pod and debug things.



On 22/05/18 19:01, Brian Keyes wrote:
I am looking for a long running non exiting example for python , maybe 
to ping a pubilc ip or something , just some thing to keep the 
container/POD alive


thanks 

--
Brian Keyes
Systems Engineer, Vizuri
703-855-9074(Mobile)
703-464-7030 x8239 (Office)

FOR OFFICIAL USE ONLY: This email and any attachments may contain 
information that is privacy and business sensitive. Inappropriate or 
unauthorized disclosure of business and privacy sensitive information 
may result in civil and/or criminal penalties as detailed in as 
amended Privacy Act of 1974 and DoD 5400.11-R.




___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: RPMs for 3.9 on Centos

2018-05-21 Thread Tim Dudgeon
OK, so do this on the nodes before running the ansible installer seems 
to do the trick:


yum -y install centos-release-openshift-origin


On 21/05/18 11:46, Joel Pearson wrote:
You shouldn’t need testing. It looks like they’ve been in the repo for 
about a month.


Not sure about the ansible side I haven’t actually tried to install 
3.9 yet. And when I do I plan on using system containers.


But you could grep through the ansible scripts looking for what 
installs to repo so you can figure out why it isn’t using it.
On Mon, 21 May 2018 at 8:38 pm, Tim Dudgeon <tdudgeon...@gmail.com 
<mailto:tdudgeon...@gmail.com>> wrote:


Seems like Ansible isn't doing so for me.
Are there any special params needed for this?

I did try setting these two, but to no effect:

openshift_enable_origin_repo=true
openshift_repos_enable_testing=true


On 21/05/18 11:32, Joel Pearson wrote:

They’re in the paas repo. You don’t have that repo installed for
some reason.

Ansible is supposed to lay that down

http://mirror.centos.org/centos/7/paas/x86_64/openshift-origin/

Why don’t you use the system container version instead? Or you
prefer rpms?
On Mon, 21 May 2018 at 8:30 pm, Tim Dudgeon
<tdudgeon...@gmail.com <mailto:tdudgeon...@gmail.com>> wrote:

I looks like RPMs for Origin 3.9 are still not available from
the Centos
repos:

> $ yum search origin
> Loaded plugins: fastestmirror
> Loading mirror speeds from cached hostfile
>  * base: ftp.lysator.liu.se <http://ftp.lysator.liu.se>
>  * extras: ftp.lysator.liu.se <http://ftp.lysator.liu.se>
>  * updates: ftp.lysator.liu.se <http://ftp.lysator.liu.se>
>



> N/S matched: origin
>

=
> centos-release-openshift-origin13.noarch : Yum
configuration for
> OpenShift Origin 1.3 packages
> centos-release-openshift-origin14.noarch : Yum
configuration for
> OpenShift Origin 1.4 packages
> centos-release-openshift-origin15.noarch : Yum
configuration for
> OpenShift Origin 1.5 packages
> centos-release-openshift-origin36.noarch : Yum
configuration for
> OpenShift Origin 3.6 packages
> centos-release-openshift-origin37.noarch : Yum
configuration for
> OpenShift Origin 3.7 packages
> google-noto-sans-canadian-aboriginal-fonts.noarch : Sans
Canadian
> Aboriginal font
> centos-release-openshift-origin.noarch : Common release
file to
> establish shared metadata for CentOS PaaS SIG
> ksh.x86_64 : The Original ATT Korn Shell
> texlive-tetex.noarch : scripts and files originally written
for or
> included in teTeX
>
>   Name and summary matches only, use "search all" for
everything.
Any idea when these will be available, or instructions for
finding them
somewhere else?





___
users mailing list
users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
http://lists.openshift.redhat.com/openshiftmm/listinfo/users





___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: Logging fails when using cinder volume for elasticsearch

2018-05-21 Thread Tim Dudgeon

On 21/05/18 13:30, Jeff Cantrill wrote:
Consider logging and issue so that it is properly addressed by the 
development team.



https://github.com/openshift/openshift-ansible/issues/8456
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Logging fails when using cinder volume for elasticsearch

2018-05-21 Thread Tim Dudgeon
I'm seeing a  strange problem with trying to use a Cinder volume for the 
elasticsearch PVC when installing logging with Origin 3.7. If I use NFS 
or GlusterFS volumes it all works fine. If I try a Cinder volume elastic 
search fails to start because of permissions problems:



[2018-05-21 11:03:48,483][INFO ][container.run    ] Begin 
Elasticsearch startup script
[2018-05-21 11:03:48,500][INFO ][container.run    ] Comparing 
the specified RAM to the maximum recommended for Elasticsearch...
[2018-05-21 11:03:48,503][INFO ][container.run    ] Inspecting 
the maximum RAM available...
[2018-05-21 11:03:48,513][INFO ][container.run    ] 
ES_HEAP_SIZE: '4096m'
[2018-05-21 11:03:48,527][INFO ][container.run    ] Setting heap 
dump location /elasticsearch/persistent/heapdump.hprof
[2018-05-21 11:03:48,531][INFO ][container.run    ] Checking if 
Elasticsearch is ready on https://localhost:9200
Exception in thread "main" java.lang.IllegalStateException: Failed to 
created node environment
Likely root cause: java.nio.file.AccessDeniedException: 
/elasticsearch/persistent/logging-es
    at 
sun.nio.fs.UnixException.translateToIOException(UnixException.java:84)
    at 
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
    at 
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
    at 
sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384)

    at java.nio.file.Files.createDirectory(Files.java:674)
    at java.nio.file.Files.createAndCheckIsDirectory(Files.java:781)
    at java.nio.file.Files.createDirectories(Files.java:767)
    at 
org.elasticsearch.env.NodeEnvironment.(NodeEnvironment.java:169)

    at org.elasticsearch.node.Node.(Node.java:165)
    at org.elasticsearch.node.Node.(Node.java:140)
    at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:143)
    at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:194)
    at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:286)
    at 
org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:45)

Refer to the log for complete error details.

The directory ownerships do look very strange. Using Gluster (where it 
works) you see this (/elasticsearch/persistent is where the volume is 
mounted):


sh-4.2$ cd /elasticsearch/persistent
sh-4.2$ ls -al
total 8
drwxrwsr-x. 4 root 2009 4096 May 21 07:17 .
drwxrwxrwx. 4 root root   42 May 21 07:17 ..
drwxr-sr-x. 3 1000 2009 4096 May 21 07:17 logging-es

User 1000 and group 2009 do not exist in /etc/passwd or /etc/groups



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: RPMs for 3.9 on Centos

2018-05-21 Thread Tim Dudgeon

Seems like Ansible isn't doing so for me.
Are there any special params needed for this?

I did try setting these two, but to no effect:

openshift_enable_origin_repo=true
openshift_repos_enable_testing=true


On 21/05/18 11:32, Joel Pearson wrote:
They’re in the paas repo. You don’t have that repo installed for some 
reason.


Ansible is supposed to lay that down

http://mirror.centos.org/centos/7/paas/x86_64/openshift-origin/

Why don’t you use the system container version instead? Or you prefer 
rpms?
On Mon, 21 May 2018 at 8:30 pm, Tim Dudgeon <tdudgeon...@gmail.com 
<mailto:tdudgeon...@gmail.com>> wrote:


I looks like RPMs for Origin 3.9 are still not available from the
Centos
repos:

> $ yum search origin
> Loaded plugins: fastestmirror
> Loading mirror speeds from cached hostfile
>  * base: ftp.lysator.liu.se <http://ftp.lysator.liu.se>
>  * extras: ftp.lysator.liu.se <http://ftp.lysator.liu.se>
>  * updates: ftp.lysator.liu.se <http://ftp.lysator.liu.se>
>



> N/S matched: origin
>

=
> centos-release-openshift-origin13.noarch : Yum configuration for
> OpenShift Origin 1.3 packages
> centos-release-openshift-origin14.noarch : Yum configuration for
> OpenShift Origin 1.4 packages
> centos-release-openshift-origin15.noarch : Yum configuration for
> OpenShift Origin 1.5 packages
> centos-release-openshift-origin36.noarch : Yum configuration for
> OpenShift Origin 3.6 packages
> centos-release-openshift-origin37.noarch : Yum configuration for
> OpenShift Origin 3.7 packages
> google-noto-sans-canadian-aboriginal-fonts.noarch : Sans Canadian
> Aboriginal font
> centos-release-openshift-origin.noarch : Common release file to
> establish shared metadata for CentOS PaaS SIG
> ksh.x86_64 : The Original ATT Korn Shell
> texlive-tetex.noarch : scripts and files originally written for or
> included in teTeX
>
>   Name and summary matches only, use "search all" for everything.
Any idea when these will be available, or instructions for finding
them
somewhere else?





___
users mailing list
users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
http://lists.openshift.redhat.com/openshiftmm/listinfo/users



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


RPMs for 3.9 on Centos

2018-05-21 Thread Tim Dudgeon
I looks like RPMs for Origin 3.9 are still not available from the Centos 
repos:



$ yum search origin
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
 * base: ftp.lysator.liu.se
 * extras: ftp.lysator.liu.se
 * updates: ftp.lysator.liu.se
 
N/S matched: origin 
=
centos-release-openshift-origin13.noarch : Yum configuration for 
OpenShift Origin 1.3 packages
centos-release-openshift-origin14.noarch : Yum configuration for 
OpenShift Origin 1.4 packages
centos-release-openshift-origin15.noarch : Yum configuration for 
OpenShift Origin 1.5 packages
centos-release-openshift-origin36.noarch : Yum configuration for 
OpenShift Origin 3.6 packages
centos-release-openshift-origin37.noarch : Yum configuration for 
OpenShift Origin 3.7 packages
google-noto-sans-canadian-aboriginal-fonts.noarch : Sans Canadian 
Aboriginal font
centos-release-openshift-origin.noarch : Common release file to 
establish shared metadata for CentOS PaaS SIG

ksh.x86_64 : The Original ATT Korn Shell
texlive-tetex.noarch : scripts and files originally written for or 
included in teTeX


  Name and summary matches only, use "search all" for everything.
Any idea when these will be available, or instructions for finding them 
somewhere else?






___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: Prometheus node exporter on v3.7

2018-05-06 Thread Tim Dudgeon

No, really not!
Not even been able to install 3.9 yet.
Need to stick with tried and trusted.


On 03/05/18 22:14, Joel Pearson wrote:

Upgrade your cluster to 3.9 just to be safe? You know you want too ... ;)
On Fri, 4 May 2018 at 6:00 am, Tim Dudgeon <tdudgeon...@gmail.com 
<mailto:tdudgeon...@gmail.com>> wrote:


Any Prometheus experts out there that can comment on this?


On 30/04/18 15:19, Tim Dudgeon wrote:
> I'm running Prometheus an Origin cluster using v3.7.2 installed
from
> the playbooks on the release-3.7 branch of
openshift/openshift-ansible.
>
> It looks like the node exported was not included in this version
[1]
> but was added for the 3.9 version [2].
> As it's metrics on the nodes that I'm wanting most I wonder what
the
> best approach is here.
>
> It is safe to run the `playbooks/openshift-prometheus/config.yml`
> playbook from the release-3.9 branch on a cluster running
v3.7.2, or
> is there a better approach?
>
> [1] (v3.7)
>

https://github.com/openshift/openshift-ansible/tree/release-3.7/roles/openshift_prometheus/tasks
> [2] (v3.9)
>

https://github.com/openshift/openshift-ansible/tree/release-3.9/roles/openshift_prometheus/tasks
>

___
users mailing list
users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
http://lists.openshift.redhat.com/openshiftmm/listinfo/users



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: Version settings for installing 3.9

2018-05-05 Thread Tim Dudgeon

Larry,

that's one of the combinations I tried.
It seems that the problem  is that the repos don't contain v3.9?
This is what I see on a node that I'm trying to install to:


$ yum search origin
Loaded plugins: fastestmirror
Determining fastest mirrors
 * base: centos.mirror.far.fi
 * extras: centos.mirror.far.fi
 * updates: centos.mirror.far.fi
=== 
N/S matched: origin 
===
centos-release-openshift-origin13.noarch : Yum configuration for 
OpenShift Origin 1.3 packages
centos-release-openshift-origin14.noarch : Yum configuration for 
OpenShift Origin 1.4 packages
centos-release-openshift-origin15.noarch : Yum configuration for 
OpenShift Origin 1.5 packages
centos-release-openshift-origin36.noarch : Yum configuration for 
OpenShift Origin 3.6 packages
centos-release-openshift-origin37.noarch : Yum configuration for 
OpenShift Origin 3.7 packages
google-noto-sans-canadian-aboriginal-fonts.noarch : Sans Canadian 
Aboriginal font
centos-release-openshift-origin.noarch : Common release file to 
establish shared metadata for CentOS PaaS SIG

ksh.x86_64 : The Original ATT Korn Shell
texlive-tetex.noarch : scripts and files originally written for or 
included in teTeX


  Name and summary matches only, use "search all" for everything.

3.9 is not there.
I thought 3.9 was released several weeks ago?

I even tried adding these to the inventory file, but to no affect:

openshift_enable_origin_repo=true
openshift_repos_enable_testing=true

Tim


On 04/05/18 15:43, Brigman, Larry wrote:

That top variable (openshift_release=v3.9) should be enough if you have the 
repos enabled.  The others aren't required and cause the installer to not find 
things.
If you are running the openshift-ansible installer make sure you are on branch 
'release-3.9'



From: users-boun...@lists.openshift.redhat.com 
[users-boun...@lists.openshift.redhat.com] on behalf of Tim Dudgeon 
[tdudgeon...@gmail.com]
Sent: Friday, May 04, 2018 3:26 AM
To: users@lists.openshift.redhat.com
Subject: Version settings for installing 3.9

What are the magical set of properties needed to run an ansible install
of Origin 3.9 on centos nodes?

I've tried various combinations around these but can't get anything to work:

openshift_deployment_type=origin
openshift_release=v3.9
openshift_image_tag=v3.9.0
openshift_pkg_version=-3.9.0

I'm continually getting:

1. Hosts:test39-master.openstacklocal
   Play: Determine openshift_version to configure on first master
   Task: openshift_version : fail
   Message:  Package 'origin-3.9*' not found

Surely if you are working from the release-3.9 branch of
openshift-ansible then you should not need to set any of these versions
- you should get the latest version of 3.9 images and rpms?

Tim

___
users mailing list
users@lists.openshift.redhat.com
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.openshift.redhat.com%2Fopenshiftmm%2Flistinfo%2Fusers=01%7C01%7Clarry.brigman%40arris.com%7Cb80f48a9bf9c458d549008d5b1a9826f%7Cf27929ade5544d55837ac561519c3091%7C1=qz3ZXyQ81pfnzI0AJPxVVvJrEyLOxRmd%2F4X9QZ%2BFu4Y%3D=0


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Version settings for installing 3.9

2018-05-04 Thread Tim Dudgeon
What are the magical set of properties needed to run an ansible install 
of Origin 3.9 on centos nodes?


I've tried various combinations around these but can't get anything to work:

openshift_deployment_type=origin
openshift_release=v3.9
openshift_image_tag=v3.9.0
openshift_pkg_version=-3.9.0

I'm continually getting:

  1. Hosts:    test39-master.openstacklocal
 Play: Determine openshift_version to configure on first master
 Task: openshift_version : fail
 Message:  Package 'origin-3.9*' not found

Surely if you are working from the release-3.9 branch of 
openshift-ansible then you should not need to set any of these versions 
- you should get the latest version of 3.9 images and rpms?


Tim

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: Prometheus node exporter on v3.7

2018-05-03 Thread Tim Dudgeon

Any Prometheus experts out there that can comment on this?


On 30/04/18 15:19, Tim Dudgeon wrote:
I'm running Prometheus an Origin cluster using v3.7.2 installed from 
the playbooks on the release-3.7 branch of openshift/openshift-ansible.


It looks like the node exported was not included in this version [1] 
but was added for the 3.9 version [2].
As it's metrics on the nodes that I'm wanting most I wonder what the 
best approach is here.


It is safe to run the `playbooks/openshift-prometheus/config.yml` 
playbook from the release-3.9 branch on a cluster running v3.7.2, or 
is there a better approach?


[1] (v3.7) 
https://github.com/openshift/openshift-ansible/tree/release-3.7/roles/openshift_prometheus/tasks
[2] (v3.9) 
https://github.com/openshift/openshift-ansible/tree/release-3.9/roles/openshift_prometheus/tasks




___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Prometheus node exporter on v3.7

2018-04-30 Thread Tim Dudgeon
I'm running Prometheus an Origin cluster using v3.7.2 installed from the 
playbooks on the release-3.7 branch of openshift/openshift-ansible.


It looks like the node exported was not included in this version [1] but 
was added for the 3.9 version [2].
As it's metrics on the nodes that I'm wanting most I wonder what the 
best approach is here.


It is safe to run the `playbooks/openshift-prometheus/config.yml` 
playbook from the release-3.9 branch on a cluster running v3.7.2, or is 
there a better approach?


[1] (v3.7) 
https://github.com/openshift/openshift-ansible/tree/release-3.7/roles/openshift_prometheus/tasks
[2] (v3.9) 
https://github.com/openshift/openshift-ansible/tree/release-3.9/roles/openshift_prometheus/tasks


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: Cleaning up not correct when using GlusterFS?

2018-04-22 Thread Tim Dudgeon

Thanks.
Upgrading from origin 3.7.1 to 3.7.2 fixes the problem.

Tim


On 20/04/18 22:46, Seth Jennings wrote:

Associated bz https://bugzilla.redhat.com/show_bug.cgi?id=1546156

On Fri, Apr 20, 2018 at 4:45 PM, Seth Jennings <sjenn...@redhat.com 
<mailto:sjenn...@redhat.com>> wrote:


Pretty sure this was fixed in this PR that went into 3.9.

https://github.com/openshift/origin/commit/0727d1d31fad4b4f66eff46fe750f966fab8c28b

<https://github.com/openshift/origin/commit/0727d1d31fad4b4f66eff46fe750f966fab8c28b>



On Fri, Apr 20, 2018 at 12:49 PM, Tim Dudgeon
<tdudgeon...@gmail.com <mailto:tdudgeon...@gmail.com>> wrote:

I believe I'm seeing a problem with using GlusterFS volumes
when you terminate a pod that is using a gluster backed PVC.
This is with Origin 3.7.1. I did this:

1. create new project
2. deployed a pod
3. added a volume to the pod using  a gluster backed PVC.
4. rsh to the pod and check the volume can be written to
5. delete the project

After stage 3 the volume was working OK in the pod and the
volume was reported by hekati.

After stage 5 the PVC was no longer present, glusterfs volume
was no longer see by hekati (so far so good) but the pod was
stuck in the 'Terminating' state and the project did not get
deleted. It looks like the container that was running in the
pod had been deleted. Even after one hour it was still stuck
in the terminating state.

Looking deeper it looks like the mount on the host on which
the pod was running was still present. e.g. this was still
found in /etc/mtab:

10.0.0.15:vol_a8866bf3769c987aee5c919305b89529

/var/lib/origin/openshift.local.volumes/pods/51a4ef9e-44b4-11e8-b523-fa163ea80da9/volumes/kubernetes.io

<http://kubernetes.io>~glusterfs/pvc-28d4eb2e-44b4-11e8-b523-fa163ea80da9
fuse.glusterfs

rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072
0 0

Manually unmounting this mount resulted in the pod finally
terminating and (after a short delay) the project being deleted.

Looks like the cleanup processes are not quite correct?

Tim



___
users mailing list
users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
http://lists.openshift.redhat.com/openshiftmm/listinfo/users
<http://lists.openshift.redhat.com/openshiftmm/listinfo/users>





___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Cleaning up not correct when using GlusterFS?

2018-04-20 Thread Tim Dudgeon
I believe I'm seeing a problem with using GlusterFS volumes when you 
terminate a pod that is using a gluster backed PVC. This is with Origin 
3.7.1. I did this:


1. create new project
2. deployed a pod
3. added a volume to the pod using  a gluster backed PVC.
4. rsh to the pod and check the volume can be written to
5. delete the project

After stage 3 the volume was working OK in the pod and the volume was 
reported by hekati.


After stage 5 the PVC was no longer present, glusterfs volume was no 
longer see by hekati (so far so good) but the pod was stuck in the 
'Terminating' state and the project did not get deleted. It looks like 
the container that was running in the pod had been deleted. Even after 
one hour it was still stuck in the terminating state.


Looking deeper it looks like the mount on the host on which the pod was 
running was still present. e.g. this was still found in /etc/mtab:


10.0.0.15:vol_a8866bf3769c987aee5c919305b89529 
/var/lib/origin/openshift.local.volumes/pods/51a4ef9e-44b4-11e8-b523-fa163ea80da9/volumes/kubernetes.io~glusterfs/pvc-28d4eb2e-44b4-11e8-b523-fa163ea80da9 
fuse.glusterfs 
rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072 
0 0


Manually unmounting this mount resulted in the pod finally terminating 
and (after a short delay) the project being deleted.


Looks like the cleanup processes are not quite correct?

Tim



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: default StorageClass

2018-04-17 Thread Tim Dudgeon
Sorry, which StorageClass do those variables apply to? There could be 
multiple ones deployed.
For instance, this property obviously applies to the StorageClass 
created for GlusterFS:


openshift_storage_glusterfs_storageclass_default=True


On 17/04/18 17:11, Hemant Kumar wrote:

For making the storageclass not default

openshift_storageclass_default=False

You can also change default class name by

openshift_storageclass_name=something_else



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


default StorageClass

2018-04-17 Thread Tim Dudgeon
When deploying glusterfs you can specify that this is to be the default 
StorageClass for dynamic provisioning using this variable


openshift_storage_glusterfs_storageclass_default=True

However if you also have another dynamic provisioner (e.g. OpenStack 
Cinder) then that is also declared as the default StorageClass and you 
end up with two defaults, which inevitably leads to trouble.


How can you specify that Cinder is not to be the default StorageClass?
Also, it is possible to specify the names for these StorageClasses?



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: specifying storage class for metrics and logging

2018-04-17 Thread Tim Dudgeon
So if you are using dynamic provisioning the only option for logging is 
for the default StorageClass to be set to what is needed?



On 17/04/18 11:12, Per Carlson wrote:

This holds at least for 3.7:

For metrics you can use 
"openshift_metrics_cassanda_pvc_storage_class_name" 
(https://github.com/openshift/openshift-ansible/blob/release-3.7/roles/openshift_metrics/tasks/generate_cassandra_pvcs.yaml#L44).


Using a StorageClass for logging (ElasticSearch) is more confusing. 
The variable is 
"openshift_logging_elasticsearch_pvc_storage_class_name" 
(https://github.com/openshift/openshift-ansible/blob/release-3.7/roles/openshift_logging_elasticsearch/defaults/main.yml#L34). 
But, it is only used for non-dynamic PVCs 
(https://github.com/openshift/openshift-ansible/blob/release-3.7/roles/openshift_logging_elasticsearch/tasks/main.yaml#L368-L370).



--
Pelle

Research is what I'm doing when I don't know what I'm doing.
- Wernher von Braun


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


specifying storage class for metrics and logging

2018-04-17 Thread Tim Dudgeon
If using dynamic provisioning for metrics and logging e.g. your 
inventory file contains:


openshift_metrics_cassandra_storage_type=dynamic

How does one go about specifying the StorageClass to uses?
Without this the default storage class would be used which is not what 
you might want.


Tim



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: GlusterFS failing to deploy

2018-04-16 Thread Tim Dudgeon

Rodrigo

I retried having replaced the node that failed and this time all 3 pods 
started correctly.


If this happens again (I suspect it will) I will report the outputs you 
mention.


Tim


On 16/04/18 14:06, Rodrigo Bersa wrote:

Hi Tim,

Looks like there's a problem to access the Node, or the device 
(/dev/vdb) on this Node.


Can you share the output of: oc logs of the failing glusterfs POD and 
the heketi POD?



Best regards,


Rodrigo Bersa

Cloud Consultant, RHCVA, RHCE

Red Hat Brasil <https://www.redhat.com>

rbe...@redhat.com <mailto:rbe...@redhat.com> M: +55-11-99557-5841 
<tel:+55-11-99557-5841>


<https://red.ht/sig>  
TRIED. TESTED. TRUSTED. <https://redhat.com/trusted>

Red Hat é reconhecida entre as melhores empresas para trabalhar no 
Brasil pelo *Great Place to Work*.


On Mon, Apr 16, 2018 at 8:07 AM, Tim Dudgeon <tdudgeon...@gmail.com 
<mailto:tdudgeon...@gmail.com>> wrote:


I'm having problems deploying GlusterFS to an Origin cluster.

I have 3 identical nodes for running glusterfs, but the deployment
seems to randomly fail on one of the nodes sometimes. This is a
typical error (with the json reformatted). Notice how node 001 and
003 work fine, but 002 fails.
All three nodes are equivalent in config.

TASK [openshift_storage_glusterfs : Load heketi topology]


Monday 16 April 2018  10:49:57 + (0:00:01.414) 0:44:22.372
**

{
  "changed": true,
  "cmd": [
    "oc",
"--config=/tmp/openshift-glusterfs-ansible-Eb85yA/admin.kubeconfig",
    "rsh",
    "--namespace=glusterfs",
    "deploy-heketi-storage-1-5svjh",
    "heketi-cli",
    "-s",
    "http://localhost:8080;,
    "--user",
    "admin",
    "--secret",
    "JsSOzmoF6nP6nfuJJ1RQigRQNkUiD88xl8FLfu+xhpk=",
    "topology",
    "load",
"--json=/tmp/openshift-glusterfs-ansible-Eb85yA/topology.json",
    "2>&1"
  ],
  "delta": "0:02:08.608619",
  "end": "2018-04-16 10:52:06.930155",
  "failed_when_result": true,
  "rc": 0,
  "start": "2018-04-16 10:49:58.321536",
  "stderr": "",
  "stderr_lines": [],
  "stdout": "Creating cluster ... ID:
69b19096f118186c5a09f9e78f9cb9aa\n\tAllowing file volumes on
cluster.\n\tAllowing block volumes on cluster.\n\tCreating node
orn-gluster-storage-001.openstacklocal ... ID:
ec9d615910d52bc5db9f4b18fdb714f3\n\t\tAdding device /dev/vdb ...
OK\n\tCreating node orn-gluster-storage-002.openstacklocal ...
Unable to create node: Unable to execute command on
glusterfs-storage-gbzd8:\n\tCreating node
orn-gluster-storage-003.openstacklocal ... ID:
9e69ad050cdc41af61707319612e5f58\n\t\tAdding device /dev/vdb ... OK",
  "stdout_lines": [
    "Creating cluster ... ID: 69b19096f118186c5a09f9e78f9cb9aa",
    "\tAllowing file volumes on cluster.",
    "\tAllowing block volumes on cluster.",
    "\tCreating node orn-gluster-storage-001.openstacklocal ...
ID: ec9d615910d52bc5db9f4b18fdb714f3",
    "\t\tAdding device /dev/vdb ... OK",
    "\tCreating node orn-gluster-storage-002.openstacklocal ...
Unable to create node: Unable to execute command on
glusterfs-storage-gbzd8:",
    "\tCreating node orn-gluster-storage-003.openstacklocal ...
ID: 9e69ad050cdc41af61707319612e5f58",
    "\t\tAdding device /dev/vdb ... OK"
  ]
}

Any idea what's going wrong?

Tim

___
users mailing list
users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
http://lists.openshift.redhat.com/openshiftmm/listinfo/users
<http://lists.openshift.redhat.com/openshiftmm/listinfo/users>




___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


GlusterFS failing to deploy

2018-04-16 Thread Tim Dudgeon

I'm having problems deploying GlusterFS to an Origin cluster.

I have 3 identical nodes for running glusterfs, but the deployment seems 
to randomly fail on one of the nodes sometimes. This is a typical error 
(with the json reformatted). Notice how node 001 and 003 work fine, but 
002 fails.

All three nodes are equivalent in config.

TASK [openshift_storage_glusterfs : Load heketi topology] 


Monday 16 April 2018  10:49:57 + (0:00:01.414) 0:44:22.372 **

{
  "changed": true,
  "cmd": [
    "oc",
"--config=/tmp/openshift-glusterfs-ansible-Eb85yA/admin.kubeconfig",
    "rsh",
    "--namespace=glusterfs",
    "deploy-heketi-storage-1-5svjh",
    "heketi-cli",
    "-s",
    "http://localhost:8080;,
    "--user",
    "admin",
    "--secret",
    "JsSOzmoF6nP6nfuJJ1RQigRQNkUiD88xl8FLfu+xhpk=",
    "topology",
    "load",
"--json=/tmp/openshift-glusterfs-ansible-Eb85yA/topology.json",
    "2>&1"
  ],
  "delta": "0:02:08.608619",
  "end": "2018-04-16 10:52:06.930155",
  "failed_when_result": true,
  "rc": 0,
  "start": "2018-04-16 10:49:58.321536",
  "stderr": "",
  "stderr_lines": [],
  "stdout": "Creating cluster ... ID: 
69b19096f118186c5a09f9e78f9cb9aa\n\tAllowing file volumes on 
cluster.\n\tAllowing block volumes on cluster.\n\tCreating node 
orn-gluster-storage-001.openstacklocal ... ID: 
ec9d615910d52bc5db9f4b18fdb714f3\n\t\tAdding device /dev/vdb ... 
OK\n\tCreating node orn-gluster-storage-002.openstacklocal ... Unable to 
create node: Unable to execute command on 
glusterfs-storage-gbzd8:\n\tCreating node 
orn-gluster-storage-003.openstacklocal ... ID: 
9e69ad050cdc41af61707319612e5f58\n\t\tAdding device /dev/vdb ... OK",

  "stdout_lines": [
    "Creating cluster ... ID: 69b19096f118186c5a09f9e78f9cb9aa",
    "\tAllowing file volumes on cluster.",
    "\tAllowing block volumes on cluster.",
    "\tCreating node orn-gluster-storage-001.openstacklocal ... ID: 
ec9d615910d52bc5db9f4b18fdb714f3",

    "\t\tAdding device /dev/vdb ... OK",
    "\tCreating node orn-gluster-storage-002.openstacklocal ... Unable 
to create node: Unable to execute command on glusterfs-storage-gbzd8:",
    "\tCreating node orn-gluster-storage-003.openstacklocal ... ID: 
9e69ad050cdc41af61707319612e5f58",

    "\t\tAdding device /dev/vdb ... OK"
  ]
}

Any idea what's going wrong?

Tim

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: /etc/cni/net.d/ is sometimes empty

2018-04-16 Thread Tim Dudgeon

I created this issue that summarises the problem:
https://github.com/openshift/openshift-ansible/issues/7967

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: Best way to get installed?

2018-04-13 Thread Tim Dudgeon

If you must deploy to GCE then Minishift is not the answer.
It's designed to run on your laptop so that you can test things and get 
up to speed with Openshift.

For that it's ideal.

For a simple 1 server env you might want to look at 'oc cluster up':
https://github.com/openshift/origin/blob/master/docs/cluster_up_down.md

It might be a good way to get you going.


On 13/04/18 20:25, Tracy Reed wrote:

On Fri, Apr 13, 2018 at 12:22:26AM PDT, Tim Dudgeon spake thusly:

Depends on what you are wanting to do.
To get some basic experience with using OpenShift you could try Minishift:

https://docs.openshift.org/latest/minishift/index.html

Thanks. This is so far the only suggestion. However, I have to deploy
this in Google Compute Engine. Minishift requires access to a supported
hypervisor so that it can spinup the VM itself. So unfortunately, this
won't work. I found the minishift github repo where someone had
requested that minishift be able to be provisioned on a pre-existing VM:
https://github.com/minishift/minishift/issues/467 but this is rejected
as they want to have more control over the environment in terms of
storage, cpu, installed OS, etc. So my search continues...

Thanks!



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: CentOS Origin packages in testing for 3.7, 3.8 and 3.9

2018-04-13 Thread Tim Dudgeon

How to best enable these test repos?
I found this ansible property that looks right, but it doesn't help:

openshift_repos_enable_testing=true

When I set these props:

openshift_deployment_type=origin
openshift_release=v3.9

you get this error when running the playbooks/deploy_cluster.yml playbook:

Failure summary:

  1. Hosts:    test39-master
 Play: Determine openshift_version to configure on first master
 Task: openshift_version : fail
 Message:  Package 'origin-3.9*' not found

Other notes for people wanting to try 3.9:
1. you may need to upgrade ansible on the machine you are deploying from
2. the nodes now need the python-ipaddress module to be `yum installed`

On 12/04/18 14:32, Troy Dawson wrote:

We have origin packages for 3.7.2, 3.8.0 and 3.9.0.  We also have the
corresponding openshift-ansible packages. They have been put in our
testing repos.

DO NOT USE ORIGIN 3.8.0, IT IS FOR UPGRADE PURPOSES ONLY

These will not be released until *someone* has tested them.  So
please, someone, anyone, test them, and let us know.

origin 3.9 testing
https://buildlogs.centos.org/centos/7/paas/x86_64/openshift-origin39/
origin-3.9.0-1.el7.git.0.ba7faec
openshift-ansible-3.9.0-0.53.0.git.1.af49d87.el7

origin 3.8 testing
https://buildlogs.centos.org/centos/7/paas/x86_64/openshift-origin38/
origin-3.8.0-1.el7.git.0.dd1558c
openshift-ansible-3.8.37-1.git.1.151d57f.el7

origin 3.7 testing
https://buildlogs.centos.org/centos/7/paas/x86_64/openshift-origin37/
origin-3.7.2-1.el7.git.0.cd74924
openshift-ansible-3.7.43-1.git.1.ed51ddd.el7

Once again, test and let us know.  If you don't know who to send it
to, just reply to this email.

Thanks
Paas Sig Group

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: /etc/cni/net.d/ is sometimes empty

2018-04-13 Thread Tim Dudgeon

Yes, that's exactly the sort of behaviour I'm seeing.
But I also see it when deploying a cluster (the playbooks/byo/config.yml 
playbook) as well as when scaling up.


On one of my (OpenStack) environments it seems to happen on about 50% of 
the nodes!



On 13/04/18 14:51, Rodrigo Bersa wrote:

Hi TIm,

Yes, I've seen this error sometimes, mainly during the Scaleup process.

What I did that apparently solve this issue is to remove the /etc/cni 
directory, and let the installation/scaleup process create it, but I 
don't know the root cause either.


As you said, it happens randomly and don't seem to have a pattern. The 
first time I faced it, I was scaling a cluster and adding four new 
Nodes, and just one presented the error, the other three were added to 
the cluster with no errors.



Best regards,


Rodrigo Bersa

Cloud Consultant, RHCVA, RHCE

Red Hat Brasil <https://www.redhat.com>

rbe...@redhat.com <mailto:rbe...@redhat.com> M: +55-11-99557-5841 
<tel:+55-11-99557-5841>


<https://red.ht/sig>  
TRIED. TESTED. TRUSTED. <https://redhat.com/trusted>

Red Hat é reconhecida entre as melhores empresas para trabalhar no 
Brasil pelo *Great Place to Work*.


On Fri, Apr 13, 2018 at 10:10 AM, Tim Dudgeon <tdudgeon...@gmail.com 
<mailto:tdudgeon...@gmail.com>> wrote:


We've long been encountering a seemingly random problem installing
Origin 3.7 on Centos nodes.
This is manifested in the /etc/cni/net.d/ directory on the node
being empty (it should contain one file named
80-openshift-sdn.conf) and that prevents the origin-node service
from starting, with the key error in the logs (using journalctl)
being something like this:

Apr 13 12:23:44 ip-10-0-0-61.eu-central-1.compute.internal
origin-node[26683]: W0413 12:23:44.933963   26683 cni.go:189]
Unable to update cni config: No networks found in /etc/cni/net.d

Something is preventing the ansible installer from creating this
file on the nodes (though the real cause maybe upstream of this).

This seems to happen randomly, and with differing frequencies on
different environments. One one environement abotu 50% of the
nodes fail in this way. On others its much less frequent. We
thought this was a problem with our OpenStack environment but we
have now also seen this on AWS so it looks like its a OpenShift
specific problem.

Has anyone else seen this or know what causes it?
It's been a really big impediment to rolling out a cluster.

Tim


___
users mailing list
users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
http://lists.openshift.redhat.com/openshiftmm/listinfo/users
<http://lists.openshift.redhat.com/openshiftmm/listinfo/users>




___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


adding glusterfs to an existing cluster

2018-04-13 Thread Tim Dudgeon
I'm having unreliability problems installing a complete cluster, so am 
trying to do this piece by piece.
First I deploy a basic origin cluster and then I try to deploy glusterfs 
using the playbooks/common/openshift-glusterfs/config.yml playbook (this 
is using v3.7 and the release-3.7 branch of openshift-ansible).


I already have the three gluster nodes as normal nodes in the cluster, 
and now add the gluster sections to the inventory file like this:


[glusterfs]
orn-gluster-storage-001 glusterfs_ip=10.0.0.30 glusterfs_devices='[ 
"/dev/vdb" ]'
orn-gluster-storage-002 glusterfs_ip=10.0.0.33 glusterfs_devices='[ 
"/dev/vdb" ]'
orn-gluster-storage-003 glusterfs_ip=10.0.0.7 glusterfs_devices='[ 
"/dev/vdb" ]'


[nodes]

orn-gluster-storage-001 
openshift_hostname=orn-gluster-storage-001.openstacklocal
orn-gluster-storage-002 
openshift_hostname=orn-gluster-storage-002.openstacklocal
orn-gluster-storage-003 
openshift_hostname=orn-gluster-storage-003.openstacklocal



But when I run the playbooks/common/openshift-glusterfs/config.yml 
playbook gluster does not get installed and I see this in the log:


PLAY [Configure GlusterFS] 


skipping: no hosts matched

What's the right procedure for doing this?

Tim

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: Best way to get installed?

2018-04-13 Thread Tim Dudgeon

Depends on what you are wanting to do.
To get some basic experience with using OpenShift you could try Minishift:

https://docs.openshift.org/latest/minishift/index.html

Tim


On 12/04/18 22:26, Tracy Reed wrote:

So I've been tasked with setting up an OpenShift cluster for some light
testing. Not prod. I was originally given
https://github.com/RedHatWorkshops/openshiftv3-ops-workshop/blob/master/setting_up_nonha_ocp_cluster.md
as the install guide.

This tutorial takes quite a while to manually setup the 4 nodes (in
GCE), plus storage, etc. and then launches into an hour long ansible
run.  I've been through it 4 times now and each time run into various
odd problems (which I could document for you if necessary).

Is there currently any other simpler and faster way to install
a basic OpenShift setup?

Googling produces a number of other OpenShift tutorials, many of which
now have comments on them about bugs or being out of date etc.

What's the current state of the art in simple openshift install
guides?

Thanks!



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: How to deploy openshift origin cluster on openstack?

2018-04-10 Thread Tim Dudgeon

Basically you do 2 things:

1. create your openstack environment with the instances you need and the 
appropriate networking (just like you would for any environment)


2. deploy openshift using the ansible playbooks [1]

But there is a lot of devil in the detail and it depends a bit on what 
you are wanting to deploy (openstack cloud provider, glusterfs ...).
We have used a number of openstack environments, and found them all to 
be a bit fragile. Added to this the openshift environment is continually 
changing (playbooks, RPMs, Docker images) so the whole process is a bit 
temperamental, but it can be made to work.


For sure you should look at the parts of the openshift documentation 
that cover openstack [2, 3] as well as these contrib playbooks that also 
handle creation of the openstack parts [4] (but IMHO these are not 
really suitable for creating a real cluster as they are).


[1] https://github.com/openshift/openshift-ansible/
[2] 
https://docs.openshift.org/latest/install_config/configuring_openstack.html
[3] 
https://docs.openshift.org/latest/install_config/persistent_storage/persistent_storage_cinder.html
[4] 
https://github.com/openshift/openshift-ansible-contrib/tree/master/playbooks/provisioning/openstack


On 10/04/18 11:27, Yu Wei wrote:

Hi,
How to deploy openshift origin cluster on openstack?
Could I use magnum, heat or other components?

Is there any document about this?

Thanks,

Jared, (韦煜)
Software developer
Interested in open source software, big data, Linux



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: glusterfs setup

2018-03-28 Thread Tim Dudgeon

Ah!, that's a shame.

Tim


On 28/03/18 14:11, Joel Pearson wrote:

“Distributed-Three-way replication is the only supported volume type.”

https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.2/html/container-native_storage_for_openshift_container_platform/ch03s02


On Thu, 29 Mar 2018 at 12:00 am, Tim Dudgeon <tdudgeon...@gmail.com 
<mailto:tdudgeon...@gmail.com>> wrote:


When using native glusterfs its not clear to me how to configure the
types of storage.

As described in the glusterfs docs [1] there are multiple types of
volume that can be created (Distributed, Replicated, Distributed
Replicated, Striped, Distributed Striped).

In the example ansible inventory file [2] you are suggested to set up
the glusterfs_devices variable like this:

[glusterfs]
node0  glusterfs_devices='[ "/dev/vdb", "/dev/vdc", "/dev/vdd" ]'
node1  glusterfs_devices='[ "/dev/vdb", "/dev/vdc", "/dev/vdd" ]'
node2  glusterfs_devices='[ "/dev/vdb", "/dev/vdc", "/dev/vdd" ]'

But how is the way those block devices are utilised to create a
particular type of volume?

How would you specify that you wanted multiple types of volume
(presumably each with its own storage class)?

Thanks
Tim

[1]

https://docs.gluster.org/en/latest/Quick-Start-Guide/Architecture/#types-of-volumes
[2]

https://github.com/openshift/openshift-ansible/blob/master/inventory/hosts.glusterfs.native.example

___
users mailing list
users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
http://lists.openshift.redhat.com/openshiftmm/listinfo/users



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


glusterfs setup

2018-03-28 Thread Tim Dudgeon
When using native glusterfs its not clear to me how to configure the 
types of storage.


As described in the glusterfs docs [1] there are multiple types of 
volume that can be created (Distributed, Replicated, Distributed 
Replicated, Striped, Distributed Striped).


In the example ansible inventory file [2] you are suggested to set up 
the glusterfs_devices variable like this:


[glusterfs]
node0  glusterfs_devices='[ "/dev/vdb", "/dev/vdc", "/dev/vdd" ]'
node1  glusterfs_devices='[ "/dev/vdb", "/dev/vdc", "/dev/vdd" ]'
node2  glusterfs_devices='[ "/dev/vdb", "/dev/vdc", "/dev/vdd" ]'

But how is the way those block devices are utilised to create a 
particular type of volume?


How would you specify that you wanted multiple types of volume 
(presumably each with its own storage class)?


Thanks
Tim

[1] 
https://docs.gluster.org/en/latest/Quick-Start-Guide/Architecture/#types-of-volumes
[2] 
https://github.com/openshift/openshift-ansible/blob/master/inventory/hosts.glusterfs.native.example


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: Not able to route to services

2018-03-28 Thread Tim Dudgeon

A little more on this.
I have two systems, installed in an identical manner as is possible.
One works fine, on the other I can't connect to services.

For instance, from the master node I try to connect the docker-registry 
service on the infrastructure node. If I try:


curl -I https://:5000/healthz

It works on the working environment, but gets a "No route to host" error 
on the failing one.


If I try:

sudo traceroute -T -p 5000 

it confirms the problem. On the working environment:

$ sudo traceroute -T -p 5000 172.30.145.23
traceroute to 172.30.145.23 (172.30.145.23), 30 hops max, 60 byte packets
 1  docker-registry.default.svc.cluster.local (172.30.145.23)  3.044 
ms  2.723 ms  2.307 ms


On the failing one:

$ sudo traceroute -T -p 5000 172.30.76.145
traceroute to 172.30.76.145 (172.30.76.145), 30 hops max, 60 byte packets
 1  docker-registry.default.svc.cluster.local (172.30.76.145) 3004.572 
ms !H  3004.517 ms !H  3004.502 ms !H


The !H means the host is unreachable.
If I run the same commands from the infrastructure node where the 
service is actually running then it works OK.


The security group for both servers leaves all TCP traffic open. e.g.

ALLOW IPv4 1-65535/tcp to 0.0.0.0/0
ALLOW IPv4 1-65535/tcp from 0.0.0.0/0

Any thoughts on what is blocking the traffic?

Tim



On 27/03/18 21:54, Tim Dudgeon wrote:


Sorry, I am using port 5000. I wrote that bit incorrectly.
I did do some more digging based on what's here 
(https://docs.openshift.org/latest/admin_guide/sdn_troubleshooting.html) 
and it looks like there's something wrong with the node to node 
communications.

From the master I try to contact the infrastructure node:

$ ping 192.168.253.126
PING 192.168.253.126 (192.168.253.126) 56(84) bytes of data.
64 bytes from 192.168.253.126: icmp_seq=1 ttl=64 time=0.657 ms
64 bytes from 192.168.253.126: icmp_seq=2 ttl=64 time=0.588 ms
64 bytes from 192.168.253.126: icmp_seq=3 ttl=64 time=0.605 ms
^C
--- 192.168.253.126 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.588/0.616/0.657/0.041 ms

$ tracepath 192.168.253.126
 1?: [LOCALHOST] pmtu 1450
 1:  no reply
 2:  no reply
 3:  no reply
 4:  no reply
^C

I can ping the node but treacepath can't reach it. On a working 
claster tracepath has no problems.


I don't know the cause. Any ideas?


On 27/03/18 21:46, Louis Santillan wrote:
Isn't the default port for your Registry 5000? Try `curl -kv 
https://docker-registry.default.svc:5000/healthz` 
<https://docker-registry.default.svc:5000/> [0][1].


[0] https://access.redhat.com/solutions/1616953#health
[1] 
https://docs.openshift.com/container-platform/3.7/install_config/registry/accessing_registry.html#accessing-registry-metrics


___

LOUIS P.SANTILLAN

Architect, OPENSHIFT, MIDDLEWARE & DEVOPS

Red Hat Consulting, <https://www.redhat.com/> Container and PaaS Practice

lsant...@redhat.com <mailto:lsant...@redhat.com>  M: 3236334854 



<https://red.ht/sig>  
TRIED. TESTED. TRUSTED. <https://redhat.com/trusted>




On Tue, Mar 27, 2018 at 6:39 AM, Tim Dudgeon <tdudgeon...@gmail.com 
<mailto:tdudgeon...@gmail.com>> wrote:


Something strange has happened in my environment which has
resulted in not being able to route to any of the services.
Earlier this was all working fine. The install was done using the
ansible installer and this is happening with 3.6.1 and 3.7.1.
The services are all there are running fine, and DNS is working,
but I can't reach them. e.g. from the master node:

$ host docker-registry.default.svc
docker-registry.default.svc.cl
<http://docker-registry.default.svc.cl>uster.local has address
172.30.243.173
$ curl -k https://docker-registry.default.svc/healthz
<https://docker-registry.default.svc/healthz>
curl: (7) Failed connect to docker-registry.default.svc:443; No
route to host

Any ideas on how to work out what's gone wrong?


___
users mailing list
users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
http://lists.openshift.redhat.com/openshiftmm/listinfo/users
<http://lists.openshift.redhat.com/openshiftmm/listinfo/users>






___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: Not able to route to services

2018-03-27 Thread Tim Dudgeon

Sorry, I am using port 5000. I wrote that bit incorrectly.
I did do some more digging based on what's here 
(https://docs.openshift.org/latest/admin_guide/sdn_troubleshooting.html) 
and it looks like there's something wrong with the node to node 
communications.

From the master I try to contact the infrastructure node:

$ ping 192.168.253.126
PING 192.168.253.126 (192.168.253.126) 56(84) bytes of data.
64 bytes from 192.168.253.126: icmp_seq=1 ttl=64 time=0.657 ms
64 bytes from 192.168.253.126: icmp_seq=2 ttl=64 time=0.588 ms
64 bytes from 192.168.253.126: icmp_seq=3 ttl=64 time=0.605 ms
^C
--- 192.168.253.126 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.588/0.616/0.657/0.041 ms

$ tracepath 192.168.253.126
 1?: [LOCALHOST] pmtu 1450
 1:  no reply
 2:  no reply
 3:  no reply
 4:  no reply
^C

I can ping the node but treacepath can't reach it. On a working claster 
tracepath has no problems.


I don't know the cause. Any ideas?


On 27/03/18 21:46, Louis Santillan wrote:
Isn't the default port for your Registry 5000? Try `curl -kv 
https://docker-registry.default.svc:5000/healthz` 
<https://docker-registry.default.svc:5000/> [0][1].


[0] https://access.redhat.com/solutions/1616953#health
[1] 
https://docs.openshift.com/container-platform/3.7/install_config/registry/accessing_registry.html#accessing-registry-metrics


___

LOUIS P.SANTILLAN

Architect, OPENSHIFT, MIDDLEWARE & DEVOPS

Red Hat Consulting, <https://www.redhat.com/> Container and PaaS Practice

lsant...@redhat.com <mailto:lsant...@redhat.com>  M: 3236334854 



<https://red.ht/sig>  
TRIED. TESTED. TRUSTED. <https://redhat.com/trusted>




On Tue, Mar 27, 2018 at 6:39 AM, Tim Dudgeon <tdudgeon...@gmail.com 
<mailto:tdudgeon...@gmail.com>> wrote:


Something strange has happened in my environment which has
resulted in not being able to route to any of the services.
Earlier this was all working fine. The install was done using the
ansible installer and this is happening with 3.6.1 and 3.7.1.
The services are all there are running fine, and DNS is working,
but I can't reach them. e.g. from the master node:

$ host docker-registry.default.svc
docker-registry.default.svc.cl
<http://docker-registry.default.svc.cl>uster.local has address
172.30.243.173
$ curl -k https://docker-registry.default.svc/healthz
<https://docker-registry.default.svc/healthz>
curl: (7) Failed connect to docker-registry.default.svc:443; No
route to host

Any ideas on how to work out what's gone wrong?


___
users mailing list
users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
http://lists.openshift.redhat.com/openshiftmm/listinfo/users
<http://lists.openshift.redhat.com/openshiftmm/listinfo/users>




___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Not able to route to services

2018-03-27 Thread Tim Dudgeon
Something strange has happened in my environment which has resulted in 
not being able to route to any of the services.
Earlier this was all working fine. The install was done using the 
ansible installer and this is happening with 3.6.1 and 3.7.1.
The services are all there are running fine, and DNS is working, but I 
can't reach them. e.g. from the master node:


$ host docker-registry.default.svc
docker-registry.default.svc.cluster.local has address 172.30.243.173
$ curl -k https://docker-registry.default.svc/healthz
curl: (7) Failed connect to docker-registry.default.svc:443; No route to 
host


Any ideas on how to work out what's gone wrong?


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: Uninstalling OpenShift

2018-03-27 Thread Tim Dudgeon

The uninstall playbook works great.
Seems to restore the machines back to their initial state pretty well.

Tim


On 27/03/18 11:07, Aleksandar Kostadinov wrote:
I see uninstall playbook [1]. Never used it but I guess it should do 
the necessary things. I mean, look at the place where your install 
playbook was. I don't suggest you to use an uninstall playbook from a 
different version that what you have installed.


Additionally, I think you can just disable the atomic-openshift* 
services with `systemctl` if you don't care about the leftover software.


Maybe somebody with more experience of uninstalling can chime in. I've 
always just used dedicated VMs that were removed after use. So I never 
actually reinstalled.


[1] 
https://github.com/openshift/openshift-ansible/blob/master/playbooks/adhoc/uninstall.yml


Alfredo Palhares wrote on 03/27/18 12:54:

Hello everyone,


How can I come to uninstall openshift from openshift ansible?

I would not prefer to do a total wipeout on the system, since I have 
don't have "jurisdiction" on that fron that would take me several 
weeks to even get that done 4 machines.


So is there an uninstall playbook? All I need to stay with docker.


Regards,
Alfredo Palhares


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: TSB fails to start

2018-03-20 Thread Tim Dudgeon

I'm just using the default SDN.

This seems to be some issue with Origin 3.7.  Switching back to 3.6.1 
works fine.
I'm struggling to work out what is going on as it is not very 
reproducible (and 3.7 is broken at present).


Tim


On 20/03/18 08:10, Joel Pearson wrote:
Are you using calico or something like that? If so why not consider a 
regular overlay network just to get it working?
On Thu, 15 Mar 2018 at 5:26 am, Tim Dudgeon <tdudgeon...@gmail.com 
<mailto:tdudgeon...@gmail.com>> wrote:


A little more on this.
One the nodes that are not working the file
/etc/cni/net.d/80-openshift-network.conf is not present.
This seems to cause errors like this in the origin-node service:

Mar 14 18:21:45 zzz-infra.openstacklocal origin-node[17833]: W0314
18:21:45.711715   17833 cni.go:189] Unable to update cni config:
No networks found in /etc/cni/net.d

Where in the installation process does the
80-openshift-network.conf file get created?
I don't see anything in the ansible installer logs suggesting
anything has gone wrong.



On 13/03/18 17:02, Tim Dudgeon wrote:


This is still troubling me. I would welcome any input on this.

When I run an ansible install (using Origin 3.7.1 on Centos7
nodes) the DNS setup on some nodes seems to randomly get messed
up. For instance I've just run a setup with 1 master, 1 infra and
2 identical worker nodes.

During the installation one of the worker nodes starts responding
very slowly. The other is fine.
Looking deeper, on the slow responding one I see a DNS setup like
this:


[centos@xxx-node-001 ~]$ sudo netstat -tunlp | grep tcp | grep
:53 | grep -v tcp6
tcp    0  0 10.0.0.20:53 <http://10.0.0.20:53>
0.0.0.0:*   LISTEN  14727/dnsmasq
tcp    0  0 172.17.0.1:53 <http://172.17.0.1:53>
0.0.0.0:*   LISTEN  14727/dnsmasq
[centos@xxx-node-001 ~]$ host orndev-bastion-002
;; connection timed out; trying next origin
orndev-bastion-002.openstacklocal has address 10.0.0.9


Whilst on the good one it looks like this:


[centos@xxx-node-002 ~]$ sudo netstat -tunlp | grep tcp | grep
:53 | grep -v tcp6
tcp    0  0 127.0.0.1:53 <http://127.0.0.1:53>
0.0.0.0:*   LISTEN  17231/openshift
tcp    0  0 10.129.0.1:53 <http://10.129.0.1:53>
0.0.0.0:*   LISTEN  14563/dnsmasq
tcp    0  0 10.0.0.22:53 <http://10.0.0.22:53>
0.0.0.0:*   LISTEN  14563/dnsmasq
tcp    0  0 172.17.0.1:53 <http://172.17.0.1:53>
0.0.0.0:*   LISTEN  14563/dnsmasq
[centos@xxx-node-002 ~]$ host orndev-bastion-002
orndev-bastion-002.openstacklocal has address 10.0.0.9

Notice how 2 DNS listeners are not present, and how this causes
the DNS lookup to timeout locally before falling back to an
upstream server.

Getting into this state seems to be a random event.

Any thoughts?



On 01/03/18 14:30, Tim Dudgeon wrote:


Yes, I think it is related to DNS.

On a similar, but working, OpenStack environment ` netstat
-tunlp | grep ...` shows this:

tcp    0  0 127.0.0.1:53 <http://127.0.0.1:53>
0.0.0.0:*   LISTEN 16957/openshift
tcp    0  0 10.128.0.1:53 <http://10.128.0.1:53>
0.0.0.0:*   LISTEN 16248/dnsmasq
tcp    0  0 10.0.0.5:53 <http://10.0.0.5:53>
0.0.0.0:*   LISTEN 16248/dnsmasq
tcp    0  0 172.17.0.1:53 <http://172.17.0.1:53>
0.0.0.0:*   LISTEN 16248/dnsmasq
tcp    0  0 0.0.0.0:8053 <http://0.0.0.0:8053>
0.0.0.0:*   LISTEN  12270/openshift

On the environment where the TSB is failing to start I'm seeing:

tcp    0  0 127.0.0.1:53 <http://127.0.0.1:53>
0.0.0.0:*   LISTEN 19067/openshift
tcp    0  0 10.129.0.1:53 <http://10.129.0.1:53>
0.0.0.0:*   LISTEN 16062/dnsmasq
tcp    0  0 172.17.0.1:53 <http://172.17.0.1:53>
0.0.0.0:*   LISTEN 16062/dnsmasq
tcp    0  0 0.0.0.0:8053 <http://0.0.0.0:8053>
0.0.0.0:*   LISTEN  11628/openshift

Notice that inf the first case dnsmasq is listening on the
machine's IP address (line 3) but in the second case  this is
missing.

Both environments have been created with the openshift-ansible
playbooks using an approach that is as equivalent as is possible.
The contents of /etc/dnsmasq.d/ on the two systems also seem to
be equivalent.

Any thoughts?



On 28/02/18 18:50, Nobuhiro Sue wrote:

Tim,

It seems to be DNS issue. I guess your environment is on
OpenStack, so please check resolver (lookup / reverse lookup).
You can see how DNS works on Open

Re: TSB fails to start

2018-03-14 Thread Tim Dudgeon

A little more on this.
One the nodes that are not working the file 
/etc/cni/net.d/80-openshift-network.conf is not present.

This seems to cause errors like this in the origin-node service:

Mar 14 18:21:45 zzz-infra.openstacklocal origin-node[17833]: W0314 
18:21:45.711715   17833 cni.go:189] Unable to update cni config: No 
networks found in /etc/cni/net.d


Where in the installation process does the 80-openshift-network.conf 
file get created?
I don't see anything in the ansible installer logs suggesting anything 
has gone wrong.




On 13/03/18 17:02, Tim Dudgeon wrote:


This is still troubling me. I would welcome any input on this.

When I run an ansible install (using Origin 3.7.1 on Centos7 nodes) 
the DNS setup on some nodes seems to randomly get messed up. For 
instance I've just run a setup with 1 master, 1 infra and 2 identical 
worker nodes.


During the installation one of the worker nodes starts responding very 
slowly. The other is fine.

Looking deeper, on the slow responding one I see a DNS setup like this:

[centos@xxx-node-001 ~]$ sudo netstat -tunlp | grep tcp | grep :53 | 
grep -v tcp6
tcp    0  0 10.0.0.20:53 0.0.0.0:*   LISTEN  
14727/dnsmasq
tcp    0  0 172.17.0.1:53 0.0.0.0:*   LISTEN  
14727/dnsmasq

[centos@xxx-node-001 ~]$ host orndev-bastion-002
;; connection timed out; trying next origin
orndev-bastion-002.openstacklocal has address 10.0.0.9


Whilst on the good one it looks like this:

[centos@xxx-node-002 ~]$ sudo netstat -tunlp | grep tcp | grep :53 | 
grep -v tcp6
tcp    0  0 127.0.0.1:53 0.0.0.0:*   LISTEN  
17231/openshift
tcp    0  0 10.129.0.1:53 0.0.0.0:*   LISTEN  
14563/dnsmasq
tcp    0  0 10.0.0.22:53 0.0.0.0:*   LISTEN  
14563/dnsmasq
tcp    0  0 172.17.0.1:53 0.0.0.0:*   LISTEN  
14563/dnsmasq

[centos@xxx-node-002 ~]$ host orndev-bastion-002
orndev-bastion-002.openstacklocal has address 10.0.0.9
Notice how 2 DNS listeners are not present, and how this causes the 
DNS lookup to timeout locally before falling back to an upstream server.


Getting into this state seems to be a random event.

Any thoughts?



On 01/03/18 14:30, Tim Dudgeon wrote:


Yes, I think it is related to DNS.

On a similar, but working, OpenStack environment ` netstat -tunlp | 
grep ...` shows this:


tcp    0  0 127.0.0.1:53 0.0.0.0:*   LISTEN  
16957/openshift
tcp    0  0 10.128.0.1:53 0.0.0.0:*   LISTEN  
16248/dnsmasq
tcp    0  0 10.0.0.5:53 0.0.0.0:*   LISTEN  
16248/dnsmasq
tcp    0  0 172.17.0.1:53 0.0.0.0:*   LISTEN  
16248/dnsmasq
tcp    0  0 0.0.0.0:8053 0.0.0.0:*   LISTEN  
12270/openshift


On the environment where the TSB is failing to start I'm seeing:

tcp    0  0 127.0.0.1:53 0.0.0.0:*   LISTEN  
19067/openshift
tcp    0  0 10.129.0.1:53 0.0.0.0:*   LISTEN  
16062/dnsmasq
tcp    0  0 172.17.0.1:53 0.0.0.0:*   LISTEN  
16062/dnsmasq
tcp    0  0 0.0.0.0:8053 0.0.0.0:*   LISTEN  
11628/openshift


Notice that inf the first case dnsmasq is listening on the machine's 
IP address (line 3) but in the second case  this is missing.


Both environments have been created with the openshift-ansible 
playbooks using an approach that is as equivalent as is possible.
The contents of /etc/dnsmasq.d/ on the two systems also seem to be 
equivalent.


Any thoughts?



On 28/02/18 18:50, Nobuhiro Sue wrote:

Tim,

It seems to be DNS issue. I guess your environment is on OpenStack, 
so please check resolver (lookup / reverse lookup).

You can see how DNS works on OpenShift 3.6 or above:
https://blog.openshift.com/dns-changes-red-hat-openshift-container-platform-3-6/

2018-03-01 0:06 GMT+09:00 Tim Dudgeon <tdudgeon...@gmail.com 
<mailto:tdudgeon...@gmail.com>>:


Hi

I'm having problems getting an Origin cluster running, using the
ansible playbooks.
It fails at this point:

TASK [template_service_broker : Verify that TSB is running]

**
FAILED - RETRYING: Verify that TSB is running (120 retries left).
FAILED - RETRYING: Verify that TSB is running (119 retries left).

FAILED - RETRYING: Verify that TSB is running (1 retries left).
fatal: [master-01.novalocal]: FAILED! => {"attempts": 120,
"changed": false, "cmd": ["curl", "-k",
"https://apiserver.openshift-template-service-broker.svc/healthz
<https://apiserver.openshift-template-service-broker.svc/healthz>"],
"delta": "0:00:01.529402", "end": "2018-02-28 14:49:30.190842",
&quo

GlusterFS install fail

2018-03-12 Thread Tim Dudgeon
I've trying to do a containerised GlusterFS install on an Origin/Centos7 
environment.


It's failing at this point:

TASK [openshift_storage_glusterfs : Verify heketi service] 


Monday 12 March 2018  18:14:34 + (0:00:00.136) 0:03:41.264 **
fatal: [orndev-master]: FAILED! => {"changed": false, "cmd": ["oc", 
"rsh", "--namespace=glusterfs", "deploy-heketi-storage-1-mfjlw", 
"heketi-cli", "-s", "http://localhost:8080;, "--user", "admin", 
"--secret", "OsXRF3zbk+vbybrLo2aVYkrt8gXyHdgORGA97UjWsZI=", "cluster", 
"list"], "delta": "0:00:00.793413", "end": "2018-03-12 18:14:36.177485", 
"msg": "non-zero return code", "rc": 255, "start": "2018-03-12 
18:14:35.384072", "stderr": "Error: signature is invalid\ncommand 
terminated with exit code 255", "stderr_lines": ["Error: signature is 
invalid", "command terminated with exit code 255"], "stdout": "", 
"stdout_lines": []}


The "signature is invalid" bit looks suspicious.
Any ideas what could be causing this?



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: TSB fails to start

2018-03-01 Thread Tim Dudgeon

Thanks for the suggestions, but I don't think its either of these.
1. an infra region is defined
2. I tried specifically setting the 
openshift_template_service_broker_namespaces property as you suggested 
but it makes no difference. In fact I believe that if it is not set it 
defaults to using the openshift project only.


I think the problem is DNS related (see other message in this thread).


On 28/02/18 19:17, Gaurav Ojha wrote:

Hi,

I had a similar issue when setting up OpenShift through the playbooks. 
What solved it for me was realizing that I had not defined my node 
with region infra which is required for the router and registry to run 
(link here 
<https://docs.openshift.org/latest/install_config/install/advanced_install.html#configuring-dedicated-infrastructure-nodes>), 
and also that I hadnt configured the Template Service Broker properly 
(here 
<https://docs.openshift.org/latest/install_config/install/advanced_install.html#configuring-template-service-broker>). 
I did these two, and it all started to work.


I am not sure if this is something which you also might have 
overlooked, so you could confirm if that is the case.


Regards
Gaurav

On Wed, Feb 28, 2018 at 1:50 PM, Nobuhiro Sue <no...@redhat.com 
<mailto:no...@redhat.com>> wrote:


Tim,

It seems to be DNS issue. I guess your environment is on
OpenStack, so please check resolver (lookup / reverse lookup).
You can see how DNS works on OpenShift 3.6 or above:

https://blog.openshift.com/dns-changes-red-hat-openshift-container-platform-3-6/

<https://blog.openshift.com/dns-changes-red-hat-openshift-container-platform-3-6/>

    2018-03-01 0:06 GMT+09:00 Tim Dudgeon <tdudgeon...@gmail.com
<mailto:tdudgeon...@gmail.com>>:

Hi

I'm having problems getting an Origin cluster running, using
the ansible playbooks.
It fails at this point:

TASK [template_service_broker : Verify that TSB is running]

**
FAILED - RETRYING: Verify that TSB is running (120 retries left).
FAILED - RETRYING: Verify that TSB is running (119 retries left).

FAILED - RETRYING: Verify that TSB is running (1 retries left).
fatal: [master-01.novalocal]: FAILED! => {"attempts": 120,
"changed": false, "cmd": ["curl", "-k",
"https://apiserver.openshift-template-service-broker.svc/healthz
<https://apiserver.openshift-template-service-broker.svc/healthz>"],
"delta": "0:00:01.529402", "end": "2018-02-28
14:49:30.190842", "msg": "non-zero return code", "rc": 7,
"start": "2018-02-28 14:49:28.661440", "stderr": "  % Total   
% Received % Xferd Average Speed   Time    Time Time
Current\n Dload  Upload   Total Spent    Left  Speed\n\r 0
0    0 0    0 0 0  0 --:--:-- --:--:--
--:--:-- 0\r  0 0    0 0 0 0  0  0
--:--:--  0:00:01 --:--:-- 0curl: (7) Failed connect to
apiserver.openshift-template-service-broker.svc:443; No route
to host", "stderr_lines": ["  % Total % Received % Xferd 
Average Speed   Time Time Time  Current", "     Dload
Upload   Total   Spent Left  Speed", "", "  0 0    0 0   
0 0  0  0 --:--:-- --:--:-- --:--:-- 0", " 
0 0    0 0 0 0  0  0 --:--:--  0:00:01
--:--:-- 0curl: (7) Failed connect to
apiserver.openshift-template-service-broker.svc:443; No route
to host"], "stdout": "", "stdout_lines": []}

All I can find in the logs on the master that seems relevant is:

Feb 28 14:43:25 master-01.novalocal
origin-master-controllers[9396]: E0228 14:43:25.394326    9396
daemoncontroller.go:255]
openshift-template-service-broker/apiserver failed with :
error storing status for daemon set
{TypeMeta:v1.TypeMeta{Kind:"",
APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"apiserver",
GenerateName:"",
Namespace:"openshift-template-service-broker",

SelfLink:"/apis/extensions/v1beta1/namespaces/openshift-template-service-broker/daemonsets/apiserver",
UID:"baa14f98-1c95-11e8-8a02-fa163e3f98d8",
ResourceVersion:"2972", Generation:1,
CreationTimestamp:v1.Time{Time:time.Time{sec:63655425804,
nsec:0, loc:(*time.Location)(0x111a3dc0)}},
DeletionTimestamp:(*v1.Time)(nil),
DeletionGraceP

Re: TSB fails to start

2018-03-01 Thread Tim Dudgeon

Yes, I think it is related to DNS.

On a similar, but working, OpenStack environment ` netstat -tunlp | grep 
...` shows this:


tcp    0  0 127.0.0.1:53 0.0.0.0:*   LISTEN  
16957/openshift
tcp    0  0 10.128.0.1:53 0.0.0.0:*   LISTEN  
16248/dnsmasq
tcp    0  0 10.0.0.5:53 0.0.0.0:*   LISTEN  
16248/dnsmasq
tcp    0  0 172.17.0.1:53 0.0.0.0:*   LISTEN  
16248/dnsmasq
tcp    0  0 0.0.0.0:8053 0.0.0.0:*   LISTEN  
12270/openshift


On the environment where the TSB is failing to start I'm seeing:

tcp    0  0 127.0.0.1:53 0.0.0.0:*   LISTEN  
19067/openshift
tcp    0  0 10.129.0.1:53 0.0.0.0:*   LISTEN  
16062/dnsmasq
tcp    0  0 172.17.0.1:53 0.0.0.0:*   LISTEN  
16062/dnsmasq
tcp    0  0 0.0.0.0:8053 0.0.0.0:*   LISTEN  
11628/openshift


Notice that inf the first case dnsmasq is listening on the machine's IP 
address (line 3) but in the second case  this is missing.


Both environments have been created with the openshift-ansible playbooks 
using an approach that is as equivalent as is possible.
The contents of /etc/dnsmasq.d/ on the two systems also seem to be 
equivalent.


Any thoughts?



On 28/02/18 18:50, Nobuhiro Sue wrote:

Tim,

It seems to be DNS issue. I guess your environment is on OpenStack, so 
please check resolver (lookup / reverse lookup).

You can see how DNS works on OpenShift 3.6 or above:
https://blog.openshift.com/dns-changes-red-hat-openshift-container-platform-3-6/

2018-03-01 0:06 GMT+09:00 Tim Dudgeon <tdudgeon...@gmail.com 
<mailto:tdudgeon...@gmail.com>>:


Hi

I'm having problems getting an Origin cluster running, using the
ansible playbooks.
It fails at this point:

TASK [template_service_broker : Verify that TSB is running]

**
FAILED - RETRYING: Verify that TSB is running (120 retries left).
FAILED - RETRYING: Verify that TSB is running (119 retries left).

FAILED - RETRYING: Verify that TSB is running (1 retries left).
fatal: [master-01.novalocal]: FAILED! => {"attempts": 120,
"changed": false, "cmd": ["curl", "-k",
"https://apiserver.openshift-template-service-broker.svc/healthz
<https://apiserver.openshift-template-service-broker.svc/healthz>"],
"delta": "0:00:01.529402", "end": "2018-02-28 14:49:30.190842",
"msg": "non-zero return code", "rc": 7, "start": "2018-02-28
14:49:28.661440", "stderr": "  % Total    % Received % Xferd
Average Speed   Time    Time Time Current\n     Dload
Upload   Total Spent    Left  Speed\n\r  0 0    0 0    0 0
0  0 --:--:-- --:--:-- --:--:-- 0\r 0 0    0 0
0 0  0  0 --:--:--  0:00:01 --:--:-- 0curl: (7)
Failed connect to
apiserver.openshift-template-service-broker.svc:443; No route to
host", "stderr_lines": ["  % Total    % Received % Xferd  Average
Speed   Time    Time Time  Current", "     Dload  Upload
Total   Spent Left  Speed", "", "  0 0    0 0 0 0 
0  0 --:--:-- --:--:-- --:--:-- 0", " 0 0    0
0    0 0  0  0 --:--:--  0:00:01 --:--:-- 0curl: (7)
Failed connect to
apiserver.openshift-template-service-broker.svc:443; No route to
host"], "stdout": "", "stdout_lines": []}

All I can find in the logs on the master that seems relevant is:

Feb 28 14:43:25 master-01.novalocal
origin-master-controllers[9396]: E0228 14:43:25.394326    9396
daemoncontroller.go:255]
openshift-template-service-broker/apiserver failed with : error
storing status for daemon set
{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""},
ObjectMeta:v1.ObjectMeta{Name:"apiserver", GenerateName:"",
Namespace:"openshift-template-service-broker",

SelfLink:"/apis/extensions/v1beta1/namespaces/openshift-template-service-broker/daemonsets/apiserver",
UID:"baa14f98-1c95-11e8-8a02-fa163e3f98d8",
ResourceVersion:"2972", Generation:1,
CreationTimestamp:v1.Time{Time:time.Time{sec:63655425804, nsec:0,
loc:(*time.Location)(0x111a3dc0)}},
DeletionTimestamp:(*v1.Time)(nil),
DeletionGracePeriodSeconds:(*int64)(nil),
Labels:map[string]string{"apiserver":"true"},

Annotations:map[string]string{"kubectl.kubernetes.io/last-applied-configuration

<http://kubectl.kubernetes.io/last-a

Re: OpenStack cloud provider problems

2018-01-17 Thread Tim Dudgeon
No, not yet, but first I think I need to understand what OpenShift is 
trying to do at this point.


Any Red Hatters out there who understand this?


On 17/01/18 10:56, Joel Pearson wrote:
Have you tried an OpenStack users list? It sounds like you need 
someone with in-depth OpenStack knowledge
On Wed, 17 Jan 2018 at 9:55 pm, Tim Dudgeon <tdudgeon...@gmail.com 
<mailto:tdudgeon...@gmail.com>> wrote:


So what does "complete an install" entail?
Presumably  OpenShift/Kubernetes is trying to do something in
OpenStack but this is failing.

But what is it trying to do?


On 17/01/18 10:49, Joel Pearson wrote:

Complete stab in the dark, but maybe your OpenStack account
doesn’t have enough privileges to be able to complete an install?
On Wed, 17 Jan 2018 at 9:46 pm, Tim Dudgeon
<tdudgeon...@gmail.com <mailto:tdudgeon...@gmail.com>> wrote:

I'm still having problems getting the OpenStack cloud
provider running.

I have a minimal OpenShift Origin 3.7 Ansible install that
runs OK. But
when I add the definition for the OpenStack cloud provider
(just the
cloud provider definition, nothing yet that uses it) the
installation
fails like this:

TASK [nickhammond.logrotate : nickhammond.logrotate | Setup
logrotate.d
scripts]

***

RUNNING HANDLER [openshift_node : restart node]


FAILED - RETRYING: restart node (3 retries left).
FAILED - RETRYING: restart node (3 retries left).
FAILED - RETRYING: restart node (3 retries left).
FAILED - RETRYING: restart node (3 retries left).
FAILED - RETRYING: restart node (3 retries left).
FAILED - RETRYING: restart node (2 retries left).
FAILED - RETRYING: restart node (2 retries left).
FAILED - RETRYING: restart node (2 retries left).
FAILED - RETRYING: restart node (2 retries left).
FAILED - RETRYING: restart node (2 retries left).
FAILED - RETRYING: restart node (1 retries left).
FAILED - RETRYING: restart node (1 retries left).
FAILED - RETRYING: restart node (1 retries left).
FAILED - RETRYING: restart node (1 retries left).
FAILED - RETRYING: restart node (1 retries left).
fatal: [orndev-node-000]: FAILED! => {"attempts": 3,
"changed": false,
"msg": "Unable to restart service origin-node: Job for
origin-node.service failed because the control process exited
with error
code. See \"systemctl status origin-node.service\" and
\"journalctl
-xe\" for details.\n"}
fatal: [orndev-node-001]: FAILED! => {"attempts": 3,
"changed": false,
"msg": "Unable to restart service origin-node: Job for
origin-node.service failed because the control process exited
with error
code. See \"systemctl status origin-node.service\" and
\"journalctl
-xe\" for details.\n"}
fatal: [orndev-master-000]: FAILED! => {"attempts": 3,
"changed": false,
"msg": "Unable to restart service origin-node: Job for
origin-node.service failed because the control process exited
with error
code. See \"systemctl status origin-node.service\" and
\"journalctl
-xe\" for details.\n"}
fatal: [orndev-node-002]: FAILED! => {"attempts": 3,
"changed": false,
"msg": "Unable to restart service origin-node: Job for
origin-node.service failed because the control process exited
with error
code. See \"systemctl status origin-node.service\" and
\"journalctl
-xe\" for details.\n"}
fatal: [orndev-infra-000]: FAILED! => {"attempts": 3,
"changed": false,
"msg": "Unable to restart service origin-node: Job for
origin-node.service failed because the control process exited
with error
code. See \"systemctl status origin-node.service\" and
\"journalctl
-xe\" for details.\n"}

RUNNING HANDLER [openshift_node : reload systemd units]


 to retry, use: --limit
@/home/centos/openshift-ansible/playbook

Re: OpenStack cloud provider problems

2018-01-17 Thread Tim Dudgeon

So what does "complete an install" entail?
Presumably  OpenShift/Kubernetes is trying to do something in OpenStack 
but this is failing.


But what is it trying to do?


On 17/01/18 10:49, Joel Pearson wrote:
Complete stab in the dark, but maybe your OpenStack account doesn’t 
have enough privileges to be able to complete an install?
On Wed, 17 Jan 2018 at 9:46 pm, Tim Dudgeon <tdudgeon...@gmail.com 
<mailto:tdudgeon...@gmail.com>> wrote:


I'm still having problems getting the OpenStack cloud provider
running.

I have a minimal OpenShift Origin 3.7 Ansible install that runs
OK. But
when I add the definition for the OpenStack cloud provider (just the
cloud provider definition, nothing yet that uses it) the installation
fails like this:

TASK [nickhammond.logrotate : nickhammond.logrotate | Setup
logrotate.d
scripts]

***

RUNNING HANDLER [openshift_node : restart node]


FAILED - RETRYING: restart node (3 retries left).
FAILED - RETRYING: restart node (3 retries left).
FAILED - RETRYING: restart node (3 retries left).
FAILED - RETRYING: restart node (3 retries left).
FAILED - RETRYING: restart node (3 retries left).
FAILED - RETRYING: restart node (2 retries left).
FAILED - RETRYING: restart node (2 retries left).
FAILED - RETRYING: restart node (2 retries left).
FAILED - RETRYING: restart node (2 retries left).
FAILED - RETRYING: restart node (2 retries left).
FAILED - RETRYING: restart node (1 retries left).
FAILED - RETRYING: restart node (1 retries left).
FAILED - RETRYING: restart node (1 retries left).
FAILED - RETRYING: restart node (1 retries left).
FAILED - RETRYING: restart node (1 retries left).
fatal: [orndev-node-000]: FAILED! => {"attempts": 3, "changed": false,
"msg": "Unable to restart service origin-node: Job for
origin-node.service failed because the control process exited with
error
code. See \"systemctl status origin-node.service\" and \"journalctl
-xe\" for details.\n"}
fatal: [orndev-node-001]: FAILED! => {"attempts": 3, "changed": false,
"msg": "Unable to restart service origin-node: Job for
origin-node.service failed because the control process exited with
error
code. See \"systemctl status origin-node.service\" and \"journalctl
-xe\" for details.\n"}
fatal: [orndev-master-000]: FAILED! => {"attempts": 3, "changed":
false,
"msg": "Unable to restart service origin-node: Job for
origin-node.service failed because the control process exited with
error
code. See \"systemctl status origin-node.service\" and \"journalctl
-xe\" for details.\n"}
fatal: [orndev-node-002]: FAILED! => {"attempts": 3, "changed": false,
"msg": "Unable to restart service origin-node: Job for
origin-node.service failed because the control process exited with
error
code. See \"systemctl status origin-node.service\" and \"journalctl
-xe\" for details.\n"}
fatal: [orndev-infra-000]: FAILED! => {"attempts": 3, "changed":
false,
"msg": "Unable to restart service origin-node: Job for
origin-node.service failed because the control process exited with
error
code. See \"systemctl status origin-node.service\" and \"journalctl
-xe\" for details.\n"}

RUNNING HANDLER [openshift_node : reload systemd units]


 to retry, use: --limit
@/home/centos/openshift-ansible/playbooks/byo/config.retry


Looking on one of the nodes I see this error in the
origin-node.service
logs:

Jan 17 09:40:49 orndev-master-000 origin-node[2419]: E0117
09:40:49.746806    2419 kubelet_node_status.go:106] Unable to register
node "orndev-master-000" with API server: nodes "orndev-master-000" is
forbidden: node 10.0.0.6 cannot modify node orndev-master-000

The /etc/origin/cloudprovider/openstack.conf file has been created OK,
and looks to be what is expected.
But I can't be sure its specified correctly and will work. In fact
if I
deliberately change the configuration to use an invalid openstack
username the install fails at the same place, but the error message on
the node is different:

Ja

cloud provider problems

2018-01-09 Thread Tim Dudgeon
I'm having problems setting up openstack as a cloud provider. In Ansible 
inventory file I have this, and other parameters defining the 
cloud_provider.


openshift_cloudprovider_kind = openstack

When this is present openshift fails to deploy, and I get this error on 
the nodes as reported by "journalctl -xe"


kubelet_node_status.go:106] Unable to register node "orndev-master-000" 
with API server: nodes "orndev-master-000" is forbidden: node 10.0.0.14 
cannot modify node orndev-master-000


"orndev-master-000" is the resolvable hostname of the node and 10.0.0.14 
is its IP address.


Any suggestions what the "Unable to register node" error is about?

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: Issues with logging and metrics on Origin 3.7

2018-01-08 Thread Tim Dudgeon

Ah, so that makes more sense.

So can I define the persistence properties (e.g. using nfs) in the 
inventory file, but specify 'openshift_metrics_install_metrics=false' 
and then run the byo/config.yml  playbook will that create the PVs, but 
not deploy metrics. Then I can later run the 
byo/openshift-cluster/openshift-metrics.yml to actually deploy the metrics.


The reason I'm doing this in 2 stages is that I sometimes hit 'Unable to 
allocate memory' problems when trying to deploy everything with 
byo/config.yml (possibly due to the 'forks' setting in ansible.cfg).



On 08/01/18 17:49, Eric Wolinetz wrote:
I think the issue you're seeing stems from the fact that the logging 
and metrics playbooks to not create their own PVs. That is handled by 
the cluster install playbook.
The logging and metrics playbooks only create the PVCs that their 
objects may require (unless ephemeral storage is configured).


I admit, the naming of the variables makes that confusing however it 
is described in our docs umbrella'd under the advanced install section 
which uses the cluster playbook...

https://docs.openshift.com/container-platform/3.7/install_config/install/advanced_install.html#advanced-install-cluster-metrics

On Mon, Jan 8, 2018 at 11:22 AM, Tim Dudgeon <tdudgeon...@gmail.com 
<mailto:tdudgeon...@gmail.com>> wrote:


On 08/01/18 16:51, Luke Meyer wrote:



On Thu, Jan 4, 2018 at 10:39 AM, Tim Dudgeon
<tdudgeon...@gmail.com <mailto:tdudgeon...@gmail.com>> wrote:

I'm hitting a number of issues with installing logging and
metrics on Origin 3.7.
This is using Centos7 hosts, the release-3.7 branch of
openshift-ansible and NFS for persistent storage.

I first do a minimal deploy with logging and metrics turned off.
This goes fine. On the NFS server I see various volumes
exported under /exports for logging, metrics, prometheus,
even thought these are not deployed, but that's fine,  they
are there if they become needed.
As epxected there are no PVs related to metrics and logging.

So I try to install metrics. I add this to the inventory file:

openshift_metrics_install_metrics=true
openshift_metrics_storage_kind=nfs
openshift_metrics_storage_access_modes=['ReadWriteOnce']
openshift_metrics_storage_nfs_directory=/exports
openshift_metrics_storage_nfs_options='*(rw,root_squash)'
openshift_metrics_storage_volume_name=metrics
openshift_metrics_storage_volume_size=10Gi
openshift_metrics_storage_labels={'storage': 'metrics'}

and run:

ansible-playbook
openshift-ansible/playbooks/byo/openshift-cluster/openshift-metrics.yml

All seems to install OK, but metrics can't start, and it
turns out that no PV is created so the PVC needed by Casandra
can't be satisfied.
So I manually create the PV using this definition:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: metrics-pv
  labels:
    storage: metrics
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Recycle
  nfs:
    path: /exports/metrics
    server: nfsserver

Now the PVC is satisfied and metrics can be started (though
pods may need to be bounced because they have timed out).

ISSUE 1: why does the metrics PV not get created?


So now on to trying to install logging. The approach is
similar. Add this to the inventory file:

openshift_logging_install_logging=true
openshift_logging_storage_kind=nfs
openshift_logging_storage_access_modes=['ReadWriteOnce']
openshift_logging_storage_nfs_directory=/exports
openshift_logging_storage_nfs_options='*(rw,root_squash)'
openshift_logging_storage_volume_name=logging
openshift_logging_storage_volume_size=10Gi
openshift_logging_storage_labels={'storage': 'logging'}

and run:
ansible-playbook
openshift-ansible/playbooks/byo/openshift-cluster/openshift-logging.yml

Logging installs fine, and is running fine. Kibana shows logs.
But look at what has been installed and there are no PVs or
PVs for logging. It seems it has  ignored the instructions to
use NFS and and deployed using ephemeral storage.

ISSUE 2: why does the persistence definitions get ignored?


I'm not entirely sure that under kind=nfs it's *supposed* to
create a PVC. Might just directly mount the volume.

One thing to check: did you set up a host in the [nfs] group in
your inventory?

Yes, there is a nfs server, and its working fine (e.g. for the
docker registry)



And finally, looking at the metrics and logging images on
Docker Hub there are 

Re: Issues with logging and metrics on Origin 3.7

2018-01-08 Thread Tim Dudgeon

On 08/01/18 16:51, Luke Meyer wrote:



On Thu, Jan 4, 2018 at 10:39 AM, Tim Dudgeon <tdudgeon...@gmail.com 
<mailto:tdudgeon...@gmail.com>> wrote:


I'm hitting a number of issues with installing logging and metrics
on Origin 3.7.
This is using Centos7 hosts, the release-3.7 branch of
openshift-ansible and NFS for persistent storage.

I first do a minimal deploy with logging and metrics turned off.
This goes fine. On the NFS server I see various volumes exported
under /exports for logging, metrics, prometheus, even thought
these are not deployed, but that's fine, they are there if they
become needed.
As epxected there are no PVs related to metrics and logging.

So I try to install metrics. I add this to the inventory file:

openshift_metrics_install_metrics=true
openshift_metrics_storage_kind=nfs
openshift_metrics_storage_access_modes=['ReadWriteOnce']
openshift_metrics_storage_nfs_directory=/exports
openshift_metrics_storage_nfs_options='*(rw,root_squash)'
openshift_metrics_storage_volume_name=metrics
openshift_metrics_storage_volume_size=10Gi
openshift_metrics_storage_labels={'storage': 'metrics'}

and run:

ansible-playbook
openshift-ansible/playbooks/byo/openshift-cluster/openshift-metrics.yml

All seems to install OK, but metrics can't start, and it turns out
that no PV is created so the PVC needed by Casandra can't be
satisfied.
So I manually create the PV using this definition:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: metrics-pv
  labels:
    storage: metrics
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Recycle
  nfs:
    path: /exports/metrics
    server: nfsserver

Now the PVC is satisfied and metrics can be started (though pods
may need to be bounced because they have timed out).

ISSUE 1: why does the metrics PV not get created?


So now on to trying to install logging. The approach is similar.
Add this to the inventory file:

openshift_logging_install_logging=true
openshift_logging_storage_kind=nfs
openshift_logging_storage_access_modes=['ReadWriteOnce']
openshift_logging_storage_nfs_directory=/exports
openshift_logging_storage_nfs_options='*(rw,root_squash)'
openshift_logging_storage_volume_name=logging
openshift_logging_storage_volume_size=10Gi
openshift_logging_storage_labels={'storage': 'logging'}

and run:
ansible-playbook
openshift-ansible/playbooks/byo/openshift-cluster/openshift-logging.yml

Logging installs fine, and is running fine. Kibana shows logs.
But look at what has been installed and there are no PVs or PVs
for logging. It seems it has  ignored the instructions to use NFS
and and deployed using ephemeral storage.

ISSUE 2: why does the persistence definitions get ignored?


I'm not entirely sure that under kind=nfs it's *supposed* to create a 
PVC. Might just directly mount the volume.


One thing to check: did you set up a host in the [nfs] group in your 
inventory?
Yes, there is a nfs server, and its working fine (e.g. for the docker 
registry)



And finally, looking at the metrics and logging images on Docker
Hub there are none with
v3.7.0 or v3.7 tags. The only tag related to 3.7 is v3.7.0-rc.0.
For example look here:
https://hub.docker.com/r/openshift/origin-metrics-hawkular-metrics/tags/
<https://hub.docker.com/r/openshift/origin-metrics-hawkular-metrics/tags/>
But for other openshift components there is a v3.7.0 tag present.
Without specifying any particular tag to use for metrics or
logging it seems you get 'latest' installed.

ISSUE 3: is 3.7 officially released yet (there's no docs for this
here either: https://docs.openshift.org/index.html
<https://docs.openshift.org/index.html>)?



3.7 is released. Seems like those dockerhub images (tags) got lost in 
the shuffle though.

OK. They will presumably appear sometime soon?
What about docs for 3.7? https://docs.openshift.org/index.html


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: Deployment to OpenStack

2018-01-05 Thread Tim Dudgeon
OK, so I tried setting `openstack_use_bastion: True`. Servers were 
provisioned OK. Public IP addresses were only applied to the infra and 
dns nodes (not master).


But the inventory/hosts file that gets auto-generated by this process 
still contains the "public" hostnames that can't be reached, even if put 
into DNS. Also, I expected to see a bastion node, but none was created.


I find the docs for this a bit baffling. Is there anyone on this list 
who was involved with creating this who can help get this straight?


On 04/01/18 23:13, Joel Pearson wrote:

Hi Tim,

Yes, I only discovered what the basion setting did by looking at the 
heat template, as I was going to try and remove the need for the 
bastion by myself.


I found this line in the heat template:
https://github.com/openshift/openshift-ansible-contrib/blob/master/roles/openstack-stack/templates/heat_stack.yaml.j2#L75

I don't know what provider_network does. But you might want to grep 
around the repo chasing down those settings to see if it suits your 
purposes. It seems a bit undocumented.


In regards to creating private floating ip's, this is what we did for 
our on-premise openstack, because we wanted to have floating ip's that 
allowed other computers outside the openstack network to be able 
connect to individual servers.


I don't know what sort of privileges you need to run this command, so 
it might not work for you.


openstack network create  --external --provider-physical-network flat 
--provider-network-type flat public
openstack subnet create --network public --allocation-pool 
start=10.2.100.1,end=10.2.100.254 --dns-nameserver 10.2.0.1  --gateway 
10.2.0.1 --subnet-range 10.2.0.0/16 <http://10.2.0.0/16> public


Instead of public, you could call it something else.

So the end result of that command was that when openshift ansible 
asked for a floating ip, we'd get an IP address in the range of 
10.2.100.1-254.


Hope it helps.

Thanks,

Joel

On Fri, Jan 5, 2018 at 8:18 AM Tim Dudgeon <tdudgeon...@gmail.com 
<mailto:tdudgeon...@gmail.com>> wrote:


Joel,

Thanks for that.
I had seen this but didn't really understand what it meant.
Having read through it again I still don't!
I'll give it a try tomorrow and see what happens.

As for the warning about scaling up/down then yes, that is a big
concern. That's the whole point of getting automation in place.
So if anyone can shed any light on this then please do so!

Could you explain more about 'an alternative is to create a
floating ip range that uses private non-routable ip addressees'?


On 04/01/18 20:17, Joel Pearson wrote:

I had exactly the same concern and I discovered that inside the
heat template there is a bastion mode, which once enabled it
doesn’t use floating ip’s any more.

Have a look at

https://github.com/openshift/openshift-ansible-contrib/blob/master/playbooks/provisioning/openstack/advanced-configuration.md

I think you want openstack_use_bastion: True but I am yet to test
it out so I’d recommend checking the heat template to see if it
does what I think it does.

At the bottom of that advanced page it mentions that in bastion
mode scale up doesn’t work for some reason, so I don’t know if
that matters for you.

Otherwise an alternative is to create a floating ip range that
uses private non-routable ip addressees. That’s what we’re using
in our on-premise OpenStack. But only because we hadn’t
discovered the bastion mode at the time.

Hope that helps.
On Fri, 5 Jan 2018 at 4:10 am, Tim Dudgeon <tdudgeon...@gmail.com
<mailto:tdudgeon...@gmail.com>> wrote:

I hope this is the right place to ask questions about the
openshift/openshift-ansible-contrib GitHub repo, and
specifically the
playbooks for installing OpenShift on OpenStack:

https://github.com/openshift/openshift-ansible-contrib/tree/master/playbooks/provisioning/openstack
If not then please redirect me.

By following the instructions in that link I successfully ran
a basic
deployment that involved provisioning the OpenStack servers
and the
deploying OpenShift using the byo config.yaml playbook. But
in doing so
it's immediately obvious that this approach is not really
viable as
public IP addresses are assigned to every node. It should only be
necessary to have public IP addresses for the master and the
infrastructure node hosting the router.

My expectation is that the best way to handle this would be to:

1. provision the basic openstack networking environment plus
a bastion
node from outside the openstack environment
2. from that bastion node provision the nodes that will form the
OpenShift cluster and deploy OpenShift to those.


Re: Deployment to OpenStack

2018-01-04 Thread Tim Dudgeon

Joel,

Thanks for that.
I had seen this but didn't really understand what it meant.
Having read through it again I still don't!
I'll give it a try tomorrow and see what happens.

As for the warning about scaling up/down then yes, that is a big 
concern. That's the whole point of getting automation in place.

So if anyone can shed any light on this then please do so!

Could you explain more about 'an alternative is to create a floating ip 
range that uses private non-routable ip addressees'?



On 04/01/18 20:17, Joel Pearson wrote:
I had exactly the same concern and I discovered that inside the heat 
template there is a bastion mode, which once enabled it doesn’t use 
floating ip’s any more.


Have a look at 
https://github.com/openshift/openshift-ansible-contrib/blob/master/playbooks/provisioning/openstack/advanced-configuration.md


I think you want openstack_use_bastion: True but I am yet to test it 
out so I’d recommend checking the heat template to see if it does what 
I think it does.


At the bottom of that advanced page it mentions that in bastion mode 
scale up doesn’t work for some reason, so I don’t know if that matters 
for you.


Otherwise an alternative is to create a floating ip range that uses 
private non-routable ip addressees. That’s what we’re using in our 
on-premise OpenStack. But only because we hadn’t discovered the 
bastion mode at the time.


Hope that helps.
On Fri, 5 Jan 2018 at 4:10 am, Tim Dudgeon <tdudgeon...@gmail.com 
<mailto:tdudgeon...@gmail.com>> wrote:


I hope this is the right place to ask questions about the
openshift/openshift-ansible-contrib GitHub repo, and specifically the
playbooks for installing OpenShift on OpenStack:

https://github.com/openshift/openshift-ansible-contrib/tree/master/playbooks/provisioning/openstack
If not then please redirect me.

By following the instructions in that link I successfully ran a basic
deployment that involved provisioning the OpenStack servers and the
deploying OpenShift using the byo config.yaml playbook. But in
doing so
it's immediately obvious that this approach is not really viable as
public IP addresses are assigned to every node. It should only be
necessary to have public IP addresses for the master and the
infrastructure node hosting the router.

My expectation is that the best way to handle this would be to:

1. provision the basic openstack networking environment plus a bastion
node from outside the openstack environment
2. from that bastion node provision the nodes that will form the
OpenShift cluster and deploy OpenShift to those.

Are there any examples along those lines?


___
users mailing list
users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
http://lists.openshift.redhat.com/openshiftmm/listinfo/users



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Deployment to OpenStack

2018-01-04 Thread Tim Dudgeon
I hope this is the right place to ask questions about the 
openshift/openshift-ansible-contrib GitHub repo, and specifically the 
playbooks for installing OpenShift on OpenStack:

https://github.com/openshift/openshift-ansible-contrib/tree/master/playbooks/provisioning/openstack
If not then please redirect me.

By following the instructions in that link I successfully ran a basic 
deployment that involved provisioning the OpenStack servers and the 
deploying OpenShift using the byo config.yaml playbook. But in doing so 
it's immediately obvious that this approach is not really viable as 
public IP addresses are assigned to every node. It should only be 
necessary to have public IP addresses for the master and the 
infrastructure node hosting the router.


My expectation is that the best way to handle this would be to:

1. provision the basic openstack networking environment plus a bastion 
node from outside the openstack environment
2. from that bastion node provision the nodes that will form the 
OpenShift cluster and deploy OpenShift to those.


Are there any examples along those lines?


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Issues with logging and metrics on Origin 3.7

2018-01-04 Thread Tim Dudgeon
I'm hitting a number of issues with installing logging and metrics on 
Origin 3.7.
This is using Centos7 hosts, the release-3.7 branch of openshift-ansible 
and NFS for persistent storage.


I first do a minimal deploy with logging and metrics turned off.
This goes fine. On the NFS server I see various volumes exported under 
/exports for logging, metrics, prometheus, even thought these are not 
deployed, but that's fine,  they are there if they become needed.

As epxected there are no PVs related to metrics and logging.

So I try to install metrics. I add this to the inventory file:

openshift_metrics_install_metrics=true
openshift_metrics_storage_kind=nfs
openshift_metrics_storage_access_modes=['ReadWriteOnce']
openshift_metrics_storage_nfs_directory=/exports
openshift_metrics_storage_nfs_options='*(rw,root_squash)'
openshift_metrics_storage_volume_name=metrics
openshift_metrics_storage_volume_size=10Gi
openshift_metrics_storage_labels={'storage': 'metrics'}

and run:

ansible-playbook 
openshift-ansible/playbooks/byo/openshift-cluster/openshift-metrics.yml


All seems to install OK, but metrics can't start, and it turns out that 
no PV is created so the PVC needed by Casandra can't be satisfied.

So I manually create the PV using this definition:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: metrics-pv
  labels:
    storage: metrics
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Recycle
  nfs:
    path: /exports/metrics
    server: nfsserver

Now the PVC is satisfied and metrics can be started (though pods may 
need to be bounced because they have timed out).


ISSUE 1: why does the metrics PV not get created?


So now on to trying to install logging. The approach is similar. Add 
this to the inventory file:


openshift_logging_install_logging=true
openshift_logging_storage_kind=nfs
openshift_logging_storage_access_modes=['ReadWriteOnce']
openshift_logging_storage_nfs_directory=/exports
openshift_logging_storage_nfs_options='*(rw,root_squash)'
openshift_logging_storage_volume_name=logging
openshift_logging_storage_volume_size=10Gi
openshift_logging_storage_labels={'storage': 'logging'}

and run:
ansible-playbook 
openshift-ansible/playbooks/byo/openshift-cluster/openshift-logging.yml


Logging installs fine, and is running fine. Kibana shows logs.
But look at what has been installed and there are no PVs or PVs for 
logging. It seems it has  ignored the instructions to use NFS and and 
deployed using ephemeral storage.


ISSUE 2: why does the persistence definitions get ignored?


And finally, looking at the metrics and logging images on Docker Hub 
there are none with
v3.7.0 or v3.7 tags. The only tag related to 3.7 is v3.7.0-rc.0. For 
example look here:

https://hub.docker.com/r/openshift/origin-metrics-hawkular-metrics/tags/
But for other openshift components there is a v3.7.0 tag present.
Without specifying any particular tag to use for metrics or logging it 
seems you get 'latest' installed.


ISSUE 3: is 3.7 officially released yet (there's no docs for this here 
either: https://docs.openshift.org/index.html)?


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: openvswitch?

2018-01-03 Thread Tim Dudgeon
Looks like this problem has fixed itself over the last couple of weeks 
(I just updated openshift-ansible on the release-3.7 branch) .

That package dependency error is no longer happening.
It now seems possible to deploy a minimal 3.7 distribution using the 
Ansible installer.

I have no idea what the source of the problem was or what has changed.


On 22/12/17 10:09, Tim Dudgeon wrote:


I tried disabling the package checks but this just pushes the failure 
down the line:


  1. Hosts:    host-10-0-0-10, host-10-0-0-12, host-10-0-0-13, 
host-10-0-0-6, host-10-0-0-9

 Play: Configure nodes
 Task: Install sdn-ovs package
 Message:  Error: Package: origin-sdn-ovs-3.7.0-1.0.7ed6862.x86_64 
(centos-openshift-origin37)

  Requires: openvswitch >= 2.6.1

Something seems broken with the package dependencies?

This happens when trying to install v3.7 using openshift-ansible from 
branch release-3.7.

openshift_deployment_type=origin
openshift_release=v3.7


On 21/12/17 16:48, Tim Dudgeon wrote:


Yes, but is this error a result of broken dependencies in the RPMs?
There's no mention of needing to instal openvswitch as part of the 
pre-requisites mentioned here:
https://docs.openshift.org/latest/install_config/install/host_preparation.html 




On 20/12/17 20:27, Joel Pearson wrote:
It’s in the paas repo 
http://mirror.centos.org/centos/7/paas/x86_64/openshift-origin/
On Thu, 21 Dec 2017 at 1:09 am, Tim Dudgeon <tdudgeon...@gmail.com 
<mailto:tdudgeon...@gmail.com>> wrote:


I just starting hitting this error when using the ansible installer
(installing v3.70 from openshift-ansible on branch release-3.7).

1. Hosts:    host-10-0-0-10, host-10-0-0-13, host-10-0-0-7,
host-10-0-0-8, host-10-0-0-9
  Play: OpenShift Health Checks
  Task: Run health checks (install) - EL
  Message:  One or more checks failed
  Details:  check "package_availability":
    Could not perform a yum update.
    Errors from dependency resolution:
  origin-sdn-ovs-3.7.0-1.0.7ed6862.x86_64 requires
openvswitch >= 2.6.1
    You should resolve these issues before
proceeding with
an install.
    You may need to remove or downgrade packages or
enable/disable yum repositories.

    check "package_version":
    Not all of the required packages are available
at their
requested version
    openvswitch:['2.6', '2.7', '2.8']
    Please check your subscriptions and enabled
repositories.

This was not happening before. Where does openvswitch come from?
Can't
find it in the standard rpm repos.

Tim

___
users mailing list
users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
http://lists.openshift.redhat.com/openshiftmm/listinfo/users







___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: Trying to import Tomcat container

2018-01-01 Thread Tim Dudgeon

Which Tomcat image is this based on?
Probably the problem is that the image runs as the root user (e.g. the 
standard Docker image on Docker Hub does this).
By default OpenShift does not allow to run containers as root user - 
instead it tries to run as an OpenShift assigned non-privileged user. 
Probably this is the case for you so your container is running as a user 
that does not have privs to read the server.xml file.


A workaround for this is described here:
https://docs.openshift.org/latest/admin_guide/manage_scc.html#enable-dockerhub-images-that-require-root
But better to use an image that allows to run as an arbitrarily assigned 
user ID.


On 01/01/18 15:15, Hetz Ben Hamo wrote:

Hi,

I'm trying to import a simple Tomcat container with the 'calendar' 
webapp (already in the image) to Openshift Origin (3.7)


As a container, it runs well and I can see the webapp running on the 
browser. I save the container, import it to OpenShift - import ok.


Then when I try to create a new app with the Image, I'm getting this 
error:


Jan 01, 2018 3:09:19 PM org.apache.catalina.startup.Catalina load
WARNING: Unable to load server configuration from 
[/opt/apache-tomcat-8.5.14/conf/server.xml]

Jan 01, 2018 3:09:19 PM org.apache.catalina.startup.Catalina load
WARNING: Unable to load server configuration from 
[/opt/apache-tomcat-8.5.14/conf/server.xml]

Jan 01, 2018 3:09:19 PM org.apache.catalina.startup.Catalina start
SEVERE: Cannot start server. Server instance is not configured.

I checked the container, and the 
/opt/apache-tomcat-8.5.14/conf/server.xml is right there iin the 
correct path.


Am I missing something?

Thanks


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: openvswitch?

2017-12-22 Thread Tim Dudgeon
I tried disabling the package checks but this just pushes the failure 
down the line:


  1. Hosts:    host-10-0-0-10, host-10-0-0-12, host-10-0-0-13, 
host-10-0-0-6, host-10-0-0-9

 Play: Configure nodes
 Task: Install sdn-ovs package
 Message:  Error: Package: origin-sdn-ovs-3.7.0-1.0.7ed6862.x86_64 
(centos-openshift-origin37)

  Requires: openvswitch >= 2.6.1

Something seems broken with the package dependencies?

This happens when trying to install v3.7 using openshift-ansible from 
branch release-3.7.

openshift_deployment_type=origin
openshift_release=v3.7


On 21/12/17 16:48, Tim Dudgeon wrote:


Yes, but is this error a result of broken dependencies in the RPMs?
There's no mention of needing to instal openvswitch as part of the 
pre-requisites mentioned here:
https://docs.openshift.org/latest/install_config/install/host_preparation.html 




On 20/12/17 20:27, Joel Pearson wrote:
It’s in the paas repo 
http://mirror.centos.org/centos/7/paas/x86_64/openshift-origin/
On Thu, 21 Dec 2017 at 1:09 am, Tim Dudgeon <tdudgeon...@gmail.com 
<mailto:tdudgeon...@gmail.com>> wrote:


I just starting hitting this error when using the ansible installer
(installing v3.70 from openshift-ansible on branch release-3.7).

1. Hosts:    host-10-0-0-10, host-10-0-0-13, host-10-0-0-7,
host-10-0-0-8, host-10-0-0-9
  Play: OpenShift Health Checks
  Task: Run health checks (install) - EL
  Message:  One or more checks failed
  Details:  check "package_availability":
    Could not perform a yum update.
    Errors from dependency resolution:
  origin-sdn-ovs-3.7.0-1.0.7ed6862.x86_64 requires
openvswitch >= 2.6.1
    You should resolve these issues before proceeding
with
an install.
    You may need to remove or downgrade packages or
enable/disable yum repositories.

    check "package_version":
    Not all of the required packages are available at
their
requested version
    openvswitch:['2.6', '2.7', '2.8']
    Please check your subscriptions and enabled
repositories.

This was not happening before. Where does openvswitch come from?
Can't
find it in the standard rpm repos.

Tim

___
users mailing list
users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
http://lists.openshift.redhat.com/openshiftmm/listinfo/users





___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: openvswitch?

2017-12-21 Thread Tim Dudgeon

Yes, but is this error a result of broken dependencies in the RPMs?
There's no mention of needing to instal openvswitch as part of the 
pre-requisites mentioned here:
https://docs.openshift.org/latest/install_config/install/host_preparation.html 




On 20/12/17 20:27, Joel Pearson wrote:
It’s in the paas repo 
http://mirror.centos.org/centos/7/paas/x86_64/openshift-origin/
On Thu, 21 Dec 2017 at 1:09 am, Tim Dudgeon <tdudgeon...@gmail.com 
<mailto:tdudgeon...@gmail.com>> wrote:


I just starting hitting this error when using the ansible installer
(installing v3.70 from openshift-ansible on branch release-3.7).

1. Hosts:    host-10-0-0-10, host-10-0-0-13, host-10-0-0-7,
host-10-0-0-8, host-10-0-0-9
  Play: OpenShift Health Checks
  Task: Run health checks (install) - EL
  Message:  One or more checks failed
  Details:  check "package_availability":
    Could not perform a yum update.
    Errors from dependency resolution:
  origin-sdn-ovs-3.7.0-1.0.7ed6862.x86_64 requires
openvswitch >= 2.6.1
    You should resolve these issues before proceeding with
an install.
    You may need to remove or downgrade packages or
enable/disable yum repositories.

    check "package_version":
    Not all of the required packages are available at
their
requested version
    openvswitch:['2.6', '2.7', '2.8']
    Please check your subscriptions and enabled
repositories.

This was not happening before. Where does openvswitch come from? Can't
find it in the standard rpm repos.

Tim

___
users mailing list
users@lists.openshift.redhat.com
<mailto:users@lists.openshift.redhat.com>
http://lists.openshift.redhat.com/openshiftmm/listinfo/users



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


  1   2   >