RE: 3.9 Default Router Malfunction When 1 of 3 Pods is Down

2018-09-02 Thread Stan Varlamov
Yes, the Reference architecture 
(https://access.redhat.com/documentation/en-us/reference_architectures/2018/html-single/deploying_and_managing_openshift_3.9_on_amazon_web_services/)
 describes the masters located on boxes separate from those for the 
Infrastructure nodes, and it shows separate ELB as well for the two sets. I 
don’t see it specifically explained which URLs should be assigned to which of 
the two ELBs, but I assume that only the web console URL is assigned to the 
Master’s ELB and the apps URL – to the Router’s ELB. In my case, having an 
expansive infrastructure for the system vs. the application nodes is 
cost-prohibitive compare to solutions like AWS ECS, so I’m looking to migrate 
my apps off the OpenShift install anyway, but it is still puzzling what 
specifically caused the outage. In the initial install, I had 3 masters 
co-located with the etcd and infrastructure nodes, and the ALB passing all port 
80/443 and 8443 traffic to those machines – this is a pretty typical install 
described in users blogs. Back to the issue, when one of the 3 machines linked 
to the ALB did not have a working oc router on it – some apps routes where not 
accessible. It would be nice to get an explanation for this that can also 
benefit others, e.g., if this configuration is particularly dangerous for this 
specific reason.

 
> I’m really confused what you are trying to do.  You should not front the 
> apiserver with a router.  The router and the masters are generally best not 
> to collocate unless your bandwidth requirements are low, > but it’s much more 
> effective to schedule the routers on nodes and keep that traffic separate 
> from a resiliency perspective. 

> The routers need the masters to be available (2/3 min) to receive their route 
> configuration when restarting, but require no interconnection to serve 
> traffic.




___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: openshift-ansible release-3.10 - Install fails with control plane pods

2018-09-02 Thread Marc Schlegel
Well I found two options for the inventory

openshift_ip

# host group for masters
[masters]
master openshift_ip=192.168.60.150
# host group for etcd
[etcd]
master openshift_ip=192.168.60.150
# host group for nodes, includes region info
[nodes]
master openshift_node_group_name='node-config-master' 
openshift_ip=192.168.60.150
infra openshift_node_group_name='node-config-infra' openshift_ip=192.168.60.151
app1 openshift_node_group_name='node-config-compute' openshift_ip=192.168.60.152
app2 openshift_node_group_name='node-config-compute' openshift_ip=192.168.60.153


and flannel

openshift_use_openshift_sdn=false 
openshift_use_flannel=true 
flannel_interface=eth1


The etcd logs are looking good now, still the problem seems that there is no 
SSL port open

Here are some line I could pull from journalctl on master

Sep 02 19:17:38 master.vnet.de origin-node[6300]: I0902 19:17:38.7200376300 
certificate_manager.go:216] Certificate rotation is enabled.
Sep 02 19:17:38 master.vnet.de origin-node[6300]: I0902 19:17:38.7204536300 
manager.go:154] cAdvisor running in container: "/sys/fs/cgroup/cpu,cpuacct"
Sep 02 19:17:38 master.vnet.de origin-node[6300]: I0902 19:17:38.7382576300 
certificate_manager.go:287] Rotating certificates
Sep 02 19:17:38 master.vnet.de origin-node[6300]: E0902 19:17:38.7525316300 
certificate_manager.go:299] Failed while requesting a signed certificate from 
the master: cannot create certificate signing request: Post 
https://master.vnet.de:8443/apis/certificates.k8s.io/v
Sep 02 19:17:38 master.vnet.de origin-node[6300]: I0902 19:17:38.7784906300 
fs.go:142] Filesystem UUIDs: map[570897ca-e759-4c81-90cf-389da6eee4cc:/dev/vda2 
b60e9498-0baa-4d9f-90aa-069048217fee:/dev/dm-0 
c39c5bed-f37c-4263-bee8-aeb6a6659d7b:/dev/dm-1]
Sep 02 19:17:38 master.vnet.de origin-node[6300]: I0902 19:17:38.7785066300 
fs.go:143] Filesystem partitions: map[tmpfs:{mountpoint:/dev/shm major:0 
minor:19 fsType:tmpfs blockSize:0} 
/dev/mapper/VolGroup00-LogVol00:{mountpoint:/var/lib/docker/overlay2 major:253 
minor
Sep 02 19:17:38 master.vnet.de origin-node[6300]: I0902 19:17:38.7801306300 
manager.go:227] Machine: {NumCores:1 CpuFrequency:2808000 
MemoryCapacity:3974230016 HugePages:[{PageSize:1048576 NumPages:0} 
{PageSize:2048 NumPages:0}] MachineID:6c1357b9e4a54b929e1d09cacf37e
Sep 02 19:17:38 master.vnet.de origin-node[6300]: I0902 19:17:38.7836556300 
manager.go:233] Version: {KernelVersion:3.10.0-862.2.3.el7.x86_64 
ContainerOsVersion:CentOS Linux 7 (Core) DockerVersion:1.13.1 
DockerAPIVersion:1.26 CadvisorVersion: CadvisorRevision:}
Sep 02 19:17:38 master.vnet.de origin-node[6300]: I0902 19:17:38.7842516300 
server.go:621] --cgroups-per-qos enabled, but --cgroup-root was not specified.  
defaulting to /
Sep 02 19:17:38 master.vnet.de origin-node[6300]: I0902 19:17:38.7845246300 
container_manager_linux.go:242] container manager verified user specified 
cgroup-root exists: /
Sep 02 19:17:38 master.vnet.de origin-node[6300]: I0902 19:17:38.7845336300 
container_manager_linux.go:247] Creating Container Manager object based on Node 
Config: {RuntimeCgroupsName: SystemCgroupsName: KubeletCgroupsName: 
ContainerRuntime:docker CgroupsPerQOS:true C
Sep 02 19:17:38 master.vnet.de origin-node[6300]: I0902 19:17:38.7846096300 
container_manager_linux.go:266] Creating device plugin manager: true
Sep 02 19:17:38 master.vnet.de origin-node[6300]: I0902 19:17:38.7846166300 
manager.go:102] Creating Device Plugin manager at 
/var/lib/kubelet/device-plugins/kubelet.sock
Sep 02 19:17:38 master.vnet.de origin-node[6300]: I0902 19:17:38.7847146300 
state_mem.go:36] [cpumanager] initializing new in-memory state store
Sep 02 19:17:38 master.vnet.de origin-node[6300]: I0902 19:17:38.7849446300 
state_file.go:82] [cpumanager] state file: created new state file 
"/var/lib/origin/openshift.local.volumes/cpu_manager_state"
Sep 02 19:17:38 master.vnet.de origin-node[6300]: I0902 19:17:38.7849886300 
server.go:895] Using root directory: /var/lib/origin/openshift.local.volumes
Sep 02 19:17:38 master.vnet.de origin-node[6300]: I0902 19:17:38.7850136300 
kubelet.go:273] Adding pod path: /etc/origin/node/pods
Sep 02 19:17:38 master.vnet.de origin-node[6300]: I0902 19:17:38.7850466300 
file.go:52] Watching path "/etc/origin/node/pods"
Sep 02 19:17:38 master.vnet.de origin-node[6300]: I0902 19:17:38.7850546300 
kubelet.go:298] Watching apiserver
Sep 02 19:17:38 master.vnet.de origin-node[6300]: E0902 19:17:38.7966516300 
reflector.go:205] 
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/kubelet.go:461:
 Failed to list *v1.Node: Get 
https://master.vnet.de:8443/api/v1/nodes?fieldSelector=metadata.
Sep 02 19:17:38 master.vnet.de origin-node[6300]: E0902 19:17:38.7966956300 
reflector.go:205] 
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/kubelet.go:452:
 Failed to list *v1.Service: Get 

Re: openshift-ansible release-3.10 - Install fails with control plane pods

2018-09-02 Thread Marc Schlegel
I might have found something...it could be a Vagrant issue

Vagrant uses to network interfaces: one for its own ssh access, the other one 
uses the ip configured in the Vagrantfile.
Here´s a log from the etcd-pod

...
2018-09-02 17:15:43.896539 I | etcdserver: published {Name:master.vnet.de 
ClientURLs:[https://192.168.121.202:2379]} to cluster 6d42105e200fef69
2018-09-02 17:15:43.896651 I | embed: ready to serve client requests
2018-09-02 17:15:43.897149 I | embed: serving client requests on 
192.168.121.202:2379


The interesting part is, that it is serving on 192.168.121.202, but the ip 
which should be used is 192.168.60.150.

[vagrant@master ~]$ ip ad 
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group 
default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
   valid_lft forever preferred_lft forever
inet6 ::1/128 scope host 
   valid_lft forever preferred_lft forever
2: eth0:  mtu 1500 qdisc pfifo_fast state UP 
group default qlen 1000
link/ether 52:54:00:87:13:01 brd ff:ff:ff:ff:ff:ff
inet 192.168.121.202/24 brd 192.168.121.255 scope global noprefixroute 
dynamic eth0
   valid_lft 3387sec preferred_lft 3387sec
inet6 fe80::5054:ff:fe87:1301/64 scope link 
   valid_lft forever preferred_lft forever
3: eth1:  mtu 1500 qdisc pfifo_fast state UP 
group default qlen 1000
link/ether 5c:a1:ab:1e:00:02 brd ff:ff:ff:ff:ff:ff
inet 192.168.60.150/24 brd 192.168.60.255 scope global noprefixroute eth1
   valid_lft forever preferred_lft forever
inet6 fe80::5ea1:abff:fe1e:2/64 scope link 
   valid_lft forever preferred_lft forever
4: docker0:  mtu 1500 qdisc noqueue state 
DOWN group default 
link/ether 02:42:8b:fa:b7:b0 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 scope global docker0
   valid_lft forever preferred_lft forever


Is there any way I can configure my inventory to use a dedicated 
network-interface (eth1 in my Vagrant case)?



Am Freitag, 31. August 2018, 21:15:12 CEST schrieben Sie:
> The dependency chain for control plane is node then etcd then api then
> controllers. From your previous post it looks like there's no apiserver
> running. I'd look into what's wrong there.
> 
> Check `master-logs api api` if that doesn't provide you any hints then
> check the logs for the node service but I can't think of anything that
> would fail there yet result in successfully starting the controller pods.
> The apiserver and controller pods use the same image. Each pod will have
> two containers, the k8s_POD containers are rarely interesting.
> 
> On Thu, Aug 30, 2018 at 2:37 PM Marc Schlegel  wrote:
> 
> > Thanks for the link. It looks like the api-pod is not getting up at all!
> >
> > Log from k8s_controllers_master-controllers-*
> >
> > [vagrant@master ~]$ sudo docker logs
> > k8s_controllers_master-controllers-master.vnet.de_kube-system_a3c3ca56f69ed817bad799176cba5ce8_1
> > E0830 18:28:05.787358   1 reflector.go:205]
> > github.com/openshift/origin/vendor/k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:594:
> > Failed to list *v1.Pod: Get
> > https://master.vnet.de:8443/api/v1/pods?fieldSelector=spec.schedulerName%3Ddefault-scheduler%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded=500=0:
> > dial tcp 127.0.0.1:8443: getsockopt: connection refused
> > E0830 18:28:05.788589   1 reflector.go:205]
> > github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:87:
> > Failed to list *v1.ReplicationController: Get
> > https://master.vnet.de:8443/api/v1/replicationcontrollers?limit=500=0:
> > dial tcp 127.0.0.1:8443: getsockopt: connection refused
> > E0830 18:28:05.804239   1 reflector.go:205]
> > github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:87:
> > Failed to list *v1.Node: Get
> > https://master.vnet.de:8443/api/v1/nodes?limit=500=0:
> > dial tcp 127.0.0.1:8443: getsockopt: connection refused
> > E0830 18:28:05.806879   1 reflector.go:205]
> > github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:87:
> > Failed to list *v1beta1.StatefulSet: Get
> > https://master.vnet.de:8443/apis/apps/v1beta1/statefulsets?limit=500=0:
> > dial tcp 127.0.0.1:8443: getsockopt: connection refused
> > E0830 18:28:05.808195   1 reflector.go:205]
> > github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:87:
> > Failed to list *v1beta1.PodDisruptionBudget: Get
> > https://master.vnet.de:8443/apis/policy/v1beta1/poddisruptionbudgets?limit=500=0:
> > dial tcp 127.0.0.1:8443: getsockopt: connection refused
> > E0830 18:28:06.673507   1 reflector.go:205]
> > github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:87:
> > Failed to list *v1.PersistentVolume: Get
> > https://master.vnet.de:8443/api/v1/persistentvolumes?limit=500=0:
> > dial tcp 127.0.0.1:8443: getsockopt: connection refused
> > E0830 18:28:06.770141   1 reflector.go:205]
> > 

Re: 3.9 Default Router Malfunction When 1 of 3 Pods is Down

2018-09-02 Thread Clayton Coleman
On Sep 2, 2018, at 9:51 AM, Stan Varlamov  wrote:

I think this is the cause. If using the ALB, each target master must have
the Router working. Either this is not documented well enough, or I’m not
reading the docs correctly, but my understanding of how this works was that
the link comes in into some generic receiver, and then OpenShift would take
over from there. With the ALB, the link comes in into the actual designated
master box, and that box, apparently, must have all the means of acting as
a designated oc master. Looks like I may need to remove the masters that I
don’t consider real ones anymore from the ALB targets, and that would take
care of my situation.


I’m really confused what you are trying to do.  You should not front the
apiserver with a router.  The router and the masters are generally best not
to collocate unless your bandwidth requirements are low, but it’s much more
effective to schedule the routers on nodes and keep that traffic separate
from a resiliency perspective.

The routers need the masters to be available (2/3 min) to receive their
route configuration when restarting, but require no interconnection to
serve traffic.



*From:* Clayton Coleman 
*Sent:* Sunday, September 2, 2018 9:31 PM
*To:* Stan Varlamov 
*Cc:* users@lists.openshift.redhat.com
*Subject:* Re: 3.9 Default Router Malfunction When 1 of 3 Pods is Down



When you were experiencing the outage was ALB listing 2/3 healthy
backends?  I’m not as familiar with ALB over ELB, but what you are
describing sounds like the frontend only was able to see one of the pods.


On Sep 2, 2018, at 9:21 AM, Stan Varlamov  wrote:

AWS ALB



*From:* Clayton Coleman 
*Sent:* Sunday, September 2, 2018 9:11 PM
*To:* Stan Varlamov 
*Cc:* users@lists.openshift.redhat.com
*Subject:* Re: 3.9 Default Router Malfunction When 1 of 3 Pods is Down



Routers all watch all routes.  What are you fronting your routers with for
HA?  VRRP?  An F5 or cloud load balancer?  DNS?


On Sep 2, 2018, at 6:18 AM, Stan Varlamov  wrote:

Went through a pretty scary experience of partial and uncontrollable outage
in a 3.9 cluster that happened to be caused by issues in the default out of
the box Router. The original installation had 3 region=infra nodes where
the 3 router pods got installed via the generic ansible cluster
installation. 2 of the 3 nodes where subsequently re-labeled at some point
in the past, and after one node was restarted, over sudden, random routes
started “disappearing”, causing 502s. I noticed that one of the 3 Router
pods was in pending – due to lack of available nodes. Bottom line, till I
got all 3 pods back into operation (tried dropping nodeselector
requirements but ended up re-labeling the nodes back to infra) – the routes
would not come back. I would expect that even one working Router can
control all routes in the cluster – no. I couldn’t find a pattern which
routes were off vs. those that stayed on, and some routes would pop in and
out of operation. Is there something in the Router design that relies on
all its pods working? Appears that individual Router pods are “responsible”
for some routes in the cluster vs. just doing redundancy.







___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


RE: 3.9 Default Router Malfunction When 1 of 3 Pods is Down

2018-09-02 Thread Stan Varlamov
I think this is the cause. If using the ALB, each target master must have the 
Router working. Either this is not documented well enough, or I’m not reading 
the docs correctly, but my understanding of how this works was that the link 
comes in into some generic receiver, and then OpenShift would take over from 
there. With the ALB, the link comes in into the actual designated master box, 
and that box, apparently, must have all the means of acting as a designated oc 
master. Looks like I may need to remove the masters that I don’t consider real 
ones anymore from the ALB targets, and that would take care of my situation. 

 
From: Clayton Coleman  
Sent: Sunday, September 2, 2018 9:31 PM
To: Stan Varlamov 
Cc: users@lists.openshift.redhat.com
Subject: Re: 3.9 Default Router Malfunction When 1 of 3 Pods is Down

 
When you were experiencing the outage was ALB listing 2/3 healthy backends?  
I’m not as familiar with ALB over ELB, but what you are describing sounds like 
the frontend only was able to see one of the pods.


On Sep 2, 2018, at 9:21 AM, Stan Varlamov mailto:stan.varla...@exlinc.com> > wrote:

AWS ALB

 
From: Clayton Coleman mailto:ccole...@redhat.com> > 
Sent: Sunday, September 2, 2018 9:11 PM
To: Stan Varlamov mailto:stan.varla...@exlinc.com> >
Cc: users@lists.openshift.redhat.com  
Subject: Re: 3.9 Default Router Malfunction When 1 of 3 Pods is Down

 
Routers all watch all routes.  What are you fronting your routers with for HA?  
VRRP?  An F5 or cloud load balancer?  DNS?


On Sep 2, 2018, at 6:18 AM, Stan Varlamov mailto:stan.varla...@exlinc.com> > wrote:

Went through a pretty scary experience of partial and uncontrollable outage in 
a 3.9 cluster that happened to be caused by issues in the default out of the 
box Router. The original installation had 3 region=infra nodes where the 3 
router pods got installed via the generic ansible cluster installation. 2 of 
the 3 nodes where subsequently re-labeled at some point in the past, and after 
one node was restarted, over sudden, random routes started “disappearing”, 
causing 502s. I noticed that one of the 3 Router pods was in pending – due to 
lack of available nodes. Bottom line, till I got all 3 pods back into operation 
(tried dropping nodeselector requirements but ended up re-labeling the nodes 
back to infra) – the routes would not come back. I would expect that even one 
working Router can control all routes in the cluster – no. I couldn’t find a 
pattern which routes were off vs. those that stayed on, and some routes would 
pop in and out of operation. Is there something in the Router design that 
relies on all its pods working? Appears that individual Router pods are 
“responsible” for some routes in the cluster vs. just doing redundancy.

 
 
 
___
users mailing list
users@lists.openshift.redhat.com  
http://lists.openshift.redhat.com/openshiftmm/listinfo/users 
 

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: 3.9 Default Router Malfunction When 1 of 3 Pods is Down

2018-09-02 Thread Clayton Coleman
When you were experiencing the outage was ALB listing 2/3 healthy
backends?  I’m not as familiar with ALB over ELB, but what you are
describing sounds like the frontend only was able to see one of the pods.

On Sep 2, 2018, at 9:21 AM, Stan Varlamov  wrote:

AWS ALB



*From:* Clayton Coleman 
*Sent:* Sunday, September 2, 2018 9:11 PM
*To:* Stan Varlamov 
*Cc:* users@lists.openshift.redhat.com
*Subject:* Re: 3.9 Default Router Malfunction When 1 of 3 Pods is Down



Routers all watch all routes.  What are you fronting your routers with for
HA?  VRRP?  An F5 or cloud load balancer?  DNS?


On Sep 2, 2018, at 6:18 AM, Stan Varlamov  wrote:

Went through a pretty scary experience of partial and uncontrollable outage
in a 3.9 cluster that happened to be caused by issues in the default out of
the box Router. The original installation had 3 region=infra nodes where
the 3 router pods got installed via the generic ansible cluster
installation. 2 of the 3 nodes where subsequently re-labeled at some point
in the past, and after one node was restarted, over sudden, random routes
started “disappearing”, causing 502s. I noticed that one of the 3 Router
pods was in pending – due to lack of available nodes. Bottom line, till I
got all 3 pods back into operation (tried dropping nodeselector
requirements but ended up re-labeling the nodes back to infra) – the routes
would not come back. I would expect that even one working Router can
control all routes in the cluster – no. I couldn’t find a pattern which
routes were off vs. those that stayed on, and some routes would pop in and
out of operation. Is there something in the Router design that relies on
all its pods working? Appears that individual Router pods are “responsible”
for some routes in the cluster vs. just doing redundancy.







___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


RE: 3.9 Default Router Malfunction When 1 of 3 Pods is Down

2018-09-02 Thread Stan Varlamov
AWS ALB

 
From: Clayton Coleman  
Sent: Sunday, September 2, 2018 9:11 PM
To: Stan Varlamov 
Cc: users@lists.openshift.redhat.com
Subject: Re: 3.9 Default Router Malfunction When 1 of 3 Pods is Down

 
Routers all watch all routes.  What are you fronting your routers with for HA?  
VRRP?  An F5 or cloud load balancer?  DNS?


On Sep 2, 2018, at 6:18 AM, Stan Varlamov mailto:stan.varla...@exlinc.com> > wrote:

Went through a pretty scary experience of partial and uncontrollable outage in 
a 3.9 cluster that happened to be caused by issues in the default out of the 
box Router. The original installation had 3 region=infra nodes where the 3 
router pods got installed via the generic ansible cluster installation. 2 of 
the 3 nodes where subsequently re-labeled at some point in the past, and after 
one node was restarted, over sudden, random routes started “disappearing”, 
causing 502s. I noticed that one of the 3 Router pods was in pending – due to 
lack of available nodes. Bottom line, till I got all 3 pods back into operation 
(tried dropping nodeselector requirements but ended up re-labeling the nodes 
back to infra) – the routes would not come back. I would expect that even one 
working Router can control all routes in the cluster – no. I couldn’t find a 
pattern which routes were off vs. those that stayed on, and some routes would 
pop in and out of operation. Is there something in the Router design that 
relies on all its pods working? Appears that individual Router pods are 
“responsible” for some routes in the cluster vs. just doing redundancy.

 
 
 
___
users mailing list
users@lists.openshift.redhat.com  
http://lists.openshift.redhat.com/openshiftmm/listinfo/users 
 

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: 3.9 Default Router Malfunction When 1 of 3 Pods is Down

2018-09-02 Thread Clayton Coleman
Routers all watch all routes.  What are you fronting your routers with for
HA?  VRRP?  An F5 or cloud load balancer?  DNS?

On Sep 2, 2018, at 6:18 AM, Stan Varlamov  wrote:

Went through a pretty scary experience of partial and uncontrollable outage
in a 3.9 cluster that happened to be caused by issues in the default out of
the box Router. The original installation had 3 region=infra nodes where
the 3 router pods got installed via the generic ansible cluster
installation. 2 of the 3 nodes where subsequently re-labeled at some point
in the past, and after one node was restarted, over sudden, random routes
started “disappearing”, causing 502s. I noticed that one of the 3 Router
pods was in pending – due to lack of available nodes. Bottom line, till I
got all 3 pods back into operation (tried dropping nodeselector
requirements but ended up re-labeling the nodes back to infra) – the routes
would not come back. I would expect that even one working Router can
control all routes in the cluster – no. I couldn’t find a pattern which
routes were off vs. those that stayed on, and some routes would pop in and
out of operation. Is there something in the Router design that relies on
all its pods working? Appears that individual Router pods are “responsible”
for some routes in the cluster vs. just doing redundancy.







___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


3.9 Default Router Malfunction When 1 of 3 Pods is Down

2018-09-02 Thread Stan Varlamov
Went through a pretty scary experience of partial and uncontrollable outage in 
a 3.9 cluster that happened to be caused by issues in the default out of the 
box Router. The original installation had 3 region=infra nodes where the 3 
router pods got installed via the generic ansible cluster installation. 2 of 
the 3 nodes where subsequently re-labeled at some point in the past, and after 
one node was restarted, over sudden, random routes started “disappearing”, 
causing 502s. I noticed that one of the 3 Router pods was in pending – due to 
lack of available nodes. Bottom line, till I got all 3 pods back into operation 
(tried dropping nodeselector requirements but ended up re-labeling the nodes 
back to infra) – the routes would not come back. I would expect that even one 
working Router can control all routes in the cluster – no. I couldn’t find a 
pattern which routes were off vs. those that stayed on, and some routes would 
pop in and out of operation. Is there something in the Router design that 
relies on all its pods working? Appears that individual Router pods are 
“responsible” for some routes in the cluster vs. just doing redundancy.

 
 
 
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: openshift-ansible release-3.10 - Install fails with control plane pods

2018-09-02 Thread klaasdemter

Hi,
I've this issue reproduceably after uninstalling a (failed/completed) 
installation and then reinstalling. It is however solved by rebooting 
all involved nodes/masters so I did not investigate further.


Greetings
Klaas

On 31.08.2018 21:26, Marc Schlegel wrote:

Sure, see attached.

Before each attempt I pull the latest release-3.10 branch for openshift-ansible.

@Scott Dodson: I am going to investigate again using your suggestions.


Marc,

Is it possible to share  your ansible inventory file to review your
openshift installation? I know there are some changes in 3.10 installation
and might reflect in the inventory.

On Thu, Aug 30, 2018 at 3:37 PM Marc Schlegel  wrote:


Thanks for the link. It looks like the api-pod is not getting up at all!

Log from k8s_controllers_master-controllers-*

[vagrant@master ~]$ sudo docker logs
k8s_controllers_master-controllers-master.vnet.de_kube-system_a3c3ca56f69ed817bad799176cba5ce8_1
E0830 18:28:05.787358   1 reflector.go:205]
github.com/openshift/origin/vendor/k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:594:
Failed to list *v1.Pod: Get
https://master.vnet.de:8443/api/v1/pods?fieldSelector=spec.schedulerName%3Ddefault-scheduler%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded=500=0:
dial tcp 127.0.0.1:8443: getsockopt: connection refused
E0830 18:28:05.788589   1 reflector.go:205]
github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:87:
Failed to list *v1.ReplicationController: Get
https://master.vnet.de:8443/api/v1/replicationcontrollers?limit=500=0:
dial tcp 127.0.0.1:8443: getsockopt: connection refused
E0830 18:28:05.804239   1 reflector.go:205]
github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:87:
Failed to list *v1.Node: Get
https://master.vnet.de:8443/api/v1/nodes?limit=500=0:
dial tcp 127.0.0.1:8443: getsockopt: connection refused
E0830 18:28:05.806879   1 reflector.go:205]
github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:87:
Failed to list *v1beta1.StatefulSet: Get
https://master.vnet.de:8443/apis/apps/v1beta1/statefulsets?limit=500=0:
dial tcp 127.0.0.1:8443: getsockopt: connection refused
E0830 18:28:05.808195   1 reflector.go:205]
github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:87:
Failed to list *v1beta1.PodDisruptionBudget: Get
https://master.vnet.de:8443/apis/policy/v1beta1/poddisruptionbudgets?limit=500=0:
dial tcp 127.0.0.1:8443: getsockopt: connection refused
E0830 18:28:06.673507   1 reflector.go:205]
github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:87:
Failed to list *v1.PersistentVolume: Get
https://master.vnet.de:8443/api/v1/persistentvolumes?limit=500=0:
dial tcp 127.0.0.1:8443: getsockopt: connection refused
E0830 18:28:06.770141   1 reflector.go:205]
github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:87:
Failed to list *v1beta1.ReplicaSet: Get
https://master.vnet.de:8443/apis/extensions/v1beta1/replicasets?limit=500=0:
dial tcp 127.0.0.1:8443: getsockopt: connection refused
E0830 18:28:06.773878   1 reflector.go:205]
github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:87:
Failed to list *v1.Service: Get
https://master.vnet.de:8443/api/v1/services?limit=500=0:
dial tcp 127.0.0.1:8443: getsockopt: connection refused
E0830 18:28:06.778204   1 reflector.go:205]
github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:87:
Failed to list *v1.StorageClass: Get
https://master.vnet.de:8443/apis/storage.k8s.io/v1/storageclasses?limit=500=0:
dial tcp 127.0.0.1:8443: getsockopt: connection refused
E0830 18:28:06.784874   1 reflector.go:205]
github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:87:
Failed to list *v1.PersistentVolumeClaim: Get
https://master.vnet.de:8443/api/v1/persistentvolumeclaims?limit=500=0:
dial tcp 127.0.0.1:8443: getsockopt: connection refused

The log is full with those. Since it is all about api, I tried to get the
logs from k8s_POD_master-api-master.vnet.de_kube-system_* which is
completely empty :-/

[vagrant@master ~]$ sudo docker logs
k8s_POD_master-api-master.vnet.de_kube-system_86017803919d833e39cb3d694c249997_1
[vagrant@master ~]$

Is there any special prerequisite about the api-pod?

regards
Marc



Marc,

could you please look over the issue [1] and pull the master pod logs and
see if you bumped into same issue mentioned by the other folks?
Also make sure the openshift-ansible release is the latest one.

Dani

[1] https://github.com/openshift/openshift-ansible/issues/9575

On Wed, Aug 29, 2018 at 7:36 PM Marc Schlegel 

wrote:

Hello everyone

I am having trouble getting a working Origin 3.10 installation using

the

openshift-ansible installer. My install always fails because the

control

pane pods are not available. I've checkout the release-3.10 branch from
openshift-ansible and configured the inventory accordingly


TASK [openshift_control_plane : Start and