Re: OKD3.11 install blocked - Could not find csr for nodes
Salut Dani, I'm using openshift-ansible release-3.11 tag. Dan În mar., 4 iun. 2019 la 09:54, Daniel Comnea a scris: > Hi Dan, > > Which openshift-ansible release tag have you used ? > > > Cheers, > Dani > > On Mon, Jun 3, 2019 at 4:18 PM Punga Dan wrote: > >> Thank you very much for the extensive response, Samuel! >> >> I've found that I do have a DNS misconfiguration so I receive the CSR >> error from the title not because of something related to Openshift >> installer procedure. >> >> Somehow (and I haven't yet found the reason, but still looking for it) >> dnsmasq fills the upstream DNS configuration with some public nameservers >> and not my "internal" DNS. >> So after the openshift-ansible playbook, related to this, installs >> dnsmasq and calls the /etc/NetworkManager/dispatcher.d/99-origin-dns.sh >> script(restarts NetworkManager), all nodes end up with "bad" upstream >> nameservers (in the /etc/dnsmasq.d/origin-upstream-dns.conf and >> /etc/origin/node/resolv.conf files). >> Even if the /etc/resolv.conf file for each host has the right nameserver >> and search domain, dnsmasq populates the OKD-related conf files above with >> a different nameserver. >> >> I think this is related to dnsmasq/NetworkManager specific >> configurationwill have to look into it and figure out what's not going >> as expected and why. I believe these are served by the DHCP server, but >> still looking for a way to address this. >> >> Anyway thanks again for the input, it put me on the right track! :) >> >> Dan >> >> În dum., 2 iun. 2019 la 22:04, Samuel Martín Moro a >> scris: >> >>> Hi, >>> >>> >>> This is quite puzzling, ... could you share your inventory with us? make >>> sure to obfuscate any sensitive data (ldap/htpasswd credentials among >>> others, ...) >>> mostly interested in potential openshift_node_groups edition. Although >>> something else might come up (?) >>> >>> >>> At first glance, you are right, it sounds like a firewalling issue. >>> Yet from your description, you did open all required ports. >>> I could suggest you check back on these, make sure your data is accurate >>> - although I would assume it is. >>> Also: if using Cri-O as a runtime, note that you would be missing port >>> 10010, that should be opened on all nodes. Yet I don't think that one would >>> be related to nodes registrations against your master API. >>> >>> Another explanation could be related to DNS (can your infra/compute >>> nodes properly resolve your masters name? the contrary would be unusual, >>> still could explain what's going on). >>> >>> As a general rule, at that stage, I would restart the origin-node >>> service on those hosts that fail to register, keeping an eye on >>> /var/log/messages (or journalctl -f). >>> If that doesn't help, I might raise log levels in >>> /etc/sysconfig/origin-node (there's a variable which defaults to 2, you can >>> change it to 99, beware it would give you a lots of logs/could saturate >>> your disks at some point, don't keep it like this over a long period) >>> >>> Dealing with large volumes of logs, note that openshift services tends >>> to store messages with prefix based on severity: you might be able to "| >>> grep -E 'E[0-9][0-9]" to focus on error messages, or W[0-9][0-9] for >>> warnings, ... >>> >>> Your issue being potentially related to firewalling, I might also use >>> tcpdump looking into what's being exchanged between nodes. >>> Look for any packets with a SYN flag ("[S]") that would not be followed >>> by an SYN-ACK ("[S.]"). >>> >>> >>> Let us know how that goes, >>> >>> >>> Good luck. >>> Failing during the "Approve node certificate" steps is relatively >>> common, and could have several causes, from node groups configuration, to >>> DNS, firewalls, broken TCP handshake, MTU not allowing for certificates to >>> go through, ... we'll want to dig deeper, to elucidate that issue. >>> >>> >>> Regards. >>> >>> On Sat, Jun 1, 2019 at 12:19 PM Punga Dan wrote: >>> Hello all! I'm hitting a problem when trying to install a OKD3.11 on one master 2 infra and 2 compute nodes. The hosts are VM that run centos7. I've gone through the issues related to this subject: https://access.redhat.com/solutions/3680401 which suggest naming the hosts as FQDN. Tried it with the same problem appearing for the same set of hosts(all except the master). In my case the error is only for the 2 infra nodes and 2 compute nodes, so not for the master as well. oc get nodes gives me just the master node, but I guess this is the case as the other OKD-nodes stand to be created by the process that fails. Am I wrong? oc get csr gives me a result of 3 csrs: [root@master ~]# oc get csr NAMEAGE REQUESTORCONDITION csr-4xjjb 24m system:admin Approved,Issued csr-b6x45 24m system:admin Approved,Issued csr-hgmpf 20m system:node:master
Re: OKD3.11 install blocked - Could not find csr for nodes
Hi Dan, Which openshift-ansible release tag have you used ? Cheers, Dani On Mon, Jun 3, 2019 at 4:18 PM Punga Dan wrote: > Thank you very much for the extensive response, Samuel! > > I've found that I do have a DNS misconfiguration so I receive the CSR > error from the title not because of something related to Openshift > installer procedure. > > Somehow (and I haven't yet found the reason, but still looking for it) > dnsmasq fills the upstream DNS configuration with some public nameservers > and not my "internal" DNS. > So after the openshift-ansible playbook, related to this, installs dnsmasq > and calls the /etc/NetworkManager/dispatcher.d/99-origin-dns.sh > script(restarts NetworkManager), all nodes end up with "bad" upstream > nameservers (in the /etc/dnsmasq.d/origin-upstream-dns.conf and > /etc/origin/node/resolv.conf files). > Even if the /etc/resolv.conf file for each host has the right nameserver > and search domain, dnsmasq populates the OKD-related conf files above with > a different nameserver. > > I think this is related to dnsmasq/NetworkManager specific > configurationwill have to look into it and figure out what's not going > as expected and why. I believe these are served by the DHCP server, but > still looking for a way to address this. > > Anyway thanks again for the input, it put me on the right track! :) > > Dan > > În dum., 2 iun. 2019 la 22:04, Samuel Martín Moro a > scris: > >> Hi, >> >> >> This is quite puzzling, ... could you share your inventory with us? make >> sure to obfuscate any sensitive data (ldap/htpasswd credentials among >> others, ...) >> mostly interested in potential openshift_node_groups edition. Although >> something else might come up (?) >> >> >> At first glance, you are right, it sounds like a firewalling issue. >> Yet from your description, you did open all required ports. >> I could suggest you check back on these, make sure your data is accurate >> - although I would assume it is. >> Also: if using Cri-O as a runtime, note that you would be missing port >> 10010, that should be opened on all nodes. Yet I don't think that one would >> be related to nodes registrations against your master API. >> >> Another explanation could be related to DNS (can your infra/compute nodes >> properly resolve your masters name? the contrary would be unusual, still >> could explain what's going on). >> >> As a general rule, at that stage, I would restart the origin-node service >> on those hosts that fail to register, keeping an eye on /var/log/messages >> (or journalctl -f). >> If that doesn't help, I might raise log levels in >> /etc/sysconfig/origin-node (there's a variable which defaults to 2, you can >> change it to 99, beware it would give you a lots of logs/could saturate >> your disks at some point, don't keep it like this over a long period) >> >> Dealing with large volumes of logs, note that openshift services tends to >> store messages with prefix based on severity: you might be able to "| grep >> -E 'E[0-9][0-9]" to focus on error messages, or W[0-9][0-9] for warnings, >> ... >> >> Your issue being potentially related to firewalling, I might also use >> tcpdump looking into what's being exchanged between nodes. >> Look for any packets with a SYN flag ("[S]") that would not be followed >> by an SYN-ACK ("[S.]"). >> >> >> Let us know how that goes, >> >> >> Good luck. >> Failing during the "Approve node certificate" steps is relatively common, >> and could have several causes, from node groups configuration, to DNS, >> firewalls, broken TCP handshake, MTU not allowing for certificates to go >> through, ... we'll want to dig deeper, to elucidate that issue. >> >> >> Regards. >> >> On Sat, Jun 1, 2019 at 12:19 PM Punga Dan wrote: >> >>> Hello all! >>> >>> I'm hitting a problem when trying to install a OKD3.11 on one master 2 >>> infra and 2 compute nodes. The hosts are VM that run centos7. >>> I've gone through the issues related to this subject: >>> https://access.redhat.com/solutions/3680401 which suggest naming the >>> hosts as FQDN. Tried it with the same problem appearing for the same set of >>> hosts(all except the master). >>> >>> In my case the error is only for the 2 infra nodes and 2 compute nodes, >>> so not for the master as well. >>> >>> oc get nodes gives me just the master node, but I guess this is the case >>> as the other OKD-nodes stand to be created by the process that fails. Am I >>> wrong? >>> >>> oc get csr gives me a result of 3 csrs: >>> [root@master ~]# oc get csr >>> NAMEAGE REQUESTORCONDITION >>> csr-4xjjb 24m system:admin Approved,Issued >>> csr-b6x45 24m system:admin Approved,Issued >>> csr-hgmpf 20m system:node:master Approved,Issued >>> >>> Here I believe I have 2 csrs for system:Admin because I ran >>> the playbooks/openshift-node/join.yml a second time. >>> >>> The bootstrapping certificates on the master look fine(??) >>> [root@master ~]# ll
Re: OKD3.11 install blocked - Could not find csr for nodes
Thank you very much for the extensive response, Samuel! I've found that I do have a DNS misconfiguration so I receive the CSR error from the title not because of something related to Openshift installer procedure. Somehow (and I haven't yet found the reason, but still looking for it) dnsmasq fills the upstream DNS configuration with some public nameservers and not my "internal" DNS. So after the openshift-ansible playbook, related to this, installs dnsmasq and calls the /etc/NetworkManager/dispatcher.d/99-origin-dns.sh script(restarts NetworkManager), all nodes end up with "bad" upstream nameservers (in the /etc/dnsmasq.d/origin-upstream-dns.conf and /etc/origin/node/resolv.conf files). Even if the /etc/resolv.conf file for each host has the right nameserver and search domain, dnsmasq populates the OKD-related conf files above with a different nameserver. I think this is related to dnsmasq/NetworkManager specific configurationwill have to look into it and figure out what's not going as expected and why. I believe these are served by the DHCP server, but still looking for a way to address this. Anyway thanks again for the input, it put me on the right track! :) Dan În dum., 2 iun. 2019 la 22:04, Samuel Martín Moro a scris: > Hi, > > > This is quite puzzling, ... could you share your inventory with us? make > sure to obfuscate any sensitive data (ldap/htpasswd credentials among > others, ...) > mostly interested in potential openshift_node_groups edition. Although > something else might come up (?) > > > At first glance, you are right, it sounds like a firewalling issue. > Yet from your description, you did open all required ports. > I could suggest you check back on these, make sure your data is accurate - > although I would assume it is. > Also: if using Cri-O as a runtime, note that you would be missing port > 10010, that should be opened on all nodes. Yet I don't think that one would > be related to nodes registrations against your master API. > > Another explanation could be related to DNS (can your infra/compute nodes > properly resolve your masters name? the contrary would be unusual, still > could explain what's going on). > > As a general rule, at that stage, I would restart the origin-node service > on those hosts that fail to register, keeping an eye on /var/log/messages > (or journalctl -f). > If that doesn't help, I might raise log levels in > /etc/sysconfig/origin-node (there's a variable which defaults to 2, you can > change it to 99, beware it would give you a lots of logs/could saturate > your disks at some point, don't keep it like this over a long period) > > Dealing with large volumes of logs, note that openshift services tends to > store messages with prefix based on severity: you might be able to "| grep > -E 'E[0-9][0-9]" to focus on error messages, or W[0-9][0-9] for warnings, > ... > > Your issue being potentially related to firewalling, I might also use > tcpdump looking into what's being exchanged between nodes. > Look for any packets with a SYN flag ("[S]") that would not be followed by > an SYN-ACK ("[S.]"). > > > Let us know how that goes, > > > Good luck. > Failing during the "Approve node certificate" steps is relatively common, > and could have several causes, from node groups configuration, to DNS, > firewalls, broken TCP handshake, MTU not allowing for certificates to go > through, ... we'll want to dig deeper, to elucidate that issue. > > > Regards. > > On Sat, Jun 1, 2019 at 12:19 PM Punga Dan wrote: > >> Hello all! >> >> I'm hitting a problem when trying to install a OKD3.11 on one master 2 >> infra and 2 compute nodes. The hosts are VM that run centos7. >> I've gone through the issues related to this subject: >> https://access.redhat.com/solutions/3680401 which suggest naming the >> hosts as FQDN. Tried it with the same problem appearing for the same set of >> hosts(all except the master). >> >> In my case the error is only for the 2 infra nodes and 2 compute nodes, >> so not for the master as well. >> >> oc get nodes gives me just the master node, but I guess this is the case >> as the other OKD-nodes stand to be created by the process that fails. Am I >> wrong? >> >> oc get csr gives me a result of 3 csrs: >> [root@master ~]# oc get csr >> NAMEAGE REQUESTORCONDITION >> csr-4xjjb 24m system:admin Approved,Issued >> csr-b6x45 24m system:admin Approved,Issued >> csr-hgmpf 20m system:node:master Approved,Issued >> >> Here I believe I have 2 csrs for system:Admin because I ran >> the playbooks/openshift-node/join.yml a second time. >> >> The bootstrapping certificates on the master look fine(??) >> [root@master ~]# ll /etc/origin/node/certificates/ >> total 20 >> -rw---. 1 root root 2830 iun 1 11:30 >> kubelet-client-2019-06-01-11-30-04.pem >> -rw---. 1 root root 1135 iun 1 11:31 >> kubelet-client-2019-06-01-11-31-23.pem >> lrwxrwxrwx. 1 root root 68 iun 1 11:31
OKD3.11 install blocked - Could not find csr for nodes
Hello all! I'm hitting a problem when trying to install a OKD3.11 on one master 2 infra and 2 compute nodes. The hosts are VM that run centos7. I've gone through the issues related to this subject: https://access.redhat.com/solutions/3680401 which suggest naming the hosts as FQDN. Tried it with the same problem appearing for the same set of hosts(all except the master). In my case the error is only for the 2 infra nodes and 2 compute nodes, so not for the master as well. oc get nodes gives me just the master node, but I guess this is the case as the other OKD-nodes stand to be created by the process that fails. Am I wrong? oc get csr gives me a result of 3 csrs: [root@master ~]# oc get csr NAMEAGE REQUESTORCONDITION csr-4xjjb 24m system:admin Approved,Issued csr-b6x45 24m system:admin Approved,Issued csr-hgmpf 20m system:node:master Approved,Issued Here I believe I have 2 csrs for system:Admin because I ran the playbooks/openshift-node/join.yml a second time. The bootstrapping certificates on the master look fine(??) [root@master ~]# ll /etc/origin/node/certificates/ total 20 -rw---. 1 root root 2830 iun 1 11:30 kubelet-client-2019-06-01-11-30-04.pem -rw---. 1 root root 1135 iun 1 11:31 kubelet-client-2019-06-01-11-31-23.pem lrwxrwxrwx. 1 root root 68 iun 1 11:31 kubelet-client-current.pem -> /etc/origin/node/certificates/kubelet-client-2019-06-01-11-31-23.pem -rw---. 1 root root 1179 iun 1 11:35 kubelet-server-2019-06-01-11-35-42.pem lrwxrwxrwx. 1 root root 68 iun 1 11:35 kubelet-server-current.pem -> /etc/origin/node/certificates/kubelet-server-2019-06-01-11-35-42.pem I've rechecked the open ports thinking the issue lies in some network-related config. - all hosts have the node related ports opened: 53/udp, 10250/tcp, 4789/udp - master(with etcd): 8053/udp+tcp, 2049/udp+tcp, 8443/tcp, 8444/tcp, 4789/udp, 53/udp - infra has on top of the node ones, the ports related to router/routes and logging components which it will host The chosen SDN is os_sdn_network_plugin_name='redhat/openshift-ovs-multitenant' with no extra config in the inventory file. (Do I need any?) Any hints about where and what to check would be much appreciated! Best regards, Dan Pungă ___ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users