Re: OKD3.11 install blocked - Could not find csr for nodes

2019-06-04 Thread Punga Dan
Salut Dani,

I'm using openshift-ansible release-3.11 tag.

Dan

În mar., 4 iun. 2019 la 09:54, Daniel Comnea  a
scris:

> Hi Dan,
>
> Which openshift-ansible release tag have you used ?
>
>
> Cheers,
> Dani
>
> On Mon, Jun 3, 2019 at 4:18 PM Punga Dan  wrote:
>
>> Thank you very much for the extensive response, Samuel!
>>
>> I've found that I do have a DNS misconfiguration so I receive the CSR
>> error from the title not because of something related to Openshift
>> installer procedure.
>>
>> Somehow (and I haven't yet found the reason, but still looking for it)
>> dnsmasq fills the upstream DNS configuration with some public nameservers
>> and not my "internal" DNS.
>> So after the openshift-ansible playbook, related to this, installs
>> dnsmasq and calls the /etc/NetworkManager/dispatcher.d/99-origin-dns.sh
>> script(restarts NetworkManager), all nodes end up with "bad" upstream
>> nameservers (in the /etc/dnsmasq.d/origin-upstream-dns.conf and
>> /etc/origin/node/resolv.conf files).
>> Even if the /etc/resolv.conf file for each host has the right nameserver
>> and search domain, dnsmasq populates the OKD-related conf files above with
>> a different nameserver.
>>
>> I think this is related to dnsmasq/NetworkManager specific
>> configurationwill have to look into it and figure out what's not going
>> as expected and why. I believe these are served by the DHCP server, but
>> still looking for a way to address this.
>>
>> Anyway thanks again for the input, it put me on the right track! :)
>>
>> Dan
>>
>> În dum., 2 iun. 2019 la 22:04, Samuel Martín Moro  a
>> scris:
>>
>>> Hi,
>>>
>>>
>>> This is quite puzzling, ... could you share your inventory with us? make
>>> sure to obfuscate any sensitive data (ldap/htpasswd credentials among
>>> others, ...)
>>> mostly interested in potential openshift_node_groups edition. Although
>>> something else might come up (?)
>>>
>>>
>>> At first glance, you are right, it sounds like a firewalling issue.
>>> Yet from your description, you did open all required ports.
>>> I could suggest you check back on these, make sure your data is accurate
>>> - although I would assume it is.
>>> Also: if using Cri-O as a runtime, note that you would be missing port
>>> 10010, that should be opened on all nodes. Yet I don't think that one would
>>> be related to nodes registrations against your master API.
>>>
>>> Another explanation could be related to DNS (can your infra/compute
>>> nodes properly resolve your masters name? the contrary would be unusual,
>>> still could explain what's going on).
>>>
>>> As a general rule, at that stage, I would restart the origin-node
>>> service on those hosts that fail to register, keeping an eye on
>>> /var/log/messages (or journalctl -f).
>>> If that doesn't help, I might raise log levels in
>>> /etc/sysconfig/origin-node (there's a variable which defaults to 2, you can
>>> change it to 99, beware it would give you a lots of logs/could saturate
>>> your disks at some point, don't keep it like this over a long period)
>>>
>>> Dealing with large volumes of logs, note that openshift services tends
>>> to store messages with prefix based on severity: you might be able to "|
>>> grep -E 'E[0-9][0-9]" to focus on error messages, or W[0-9][0-9] for
>>> warnings, ...
>>>
>>> Your issue being potentially related to firewalling, I might also use
>>> tcpdump looking into what's being exchanged between nodes.
>>> Look for any packets with a SYN flag ("[S]") that would not be followed
>>> by an SYN-ACK ("[S.]").
>>>
>>>
>>> Let us know how that goes,
>>>
>>>
>>> Good luck.
>>> Failing during the "Approve node certificate" steps is relatively
>>> common, and could have several causes, from node groups configuration, to
>>> DNS, firewalls, broken TCP handshake, MTU not allowing for certificates to
>>> go through, ... we'll want to dig deeper, to elucidate that issue.
>>>
>>>
>>> Regards.
>>>
>>> On Sat, Jun 1, 2019 at 12:19 PM Punga Dan  wrote:
>>>
 Hello all!

 I'm hitting a problem when trying to install a OKD3.11 on one master 2
 infra and 2 compute nodes. The hosts are VM that run centos7.
 I've gone through the issues related to this subject:
 https://access.redhat.com/solutions/3680401 which suggest naming the
 hosts as FQDN. Tried it with the same problem appearing for the same set of
 hosts(all except the master).

 In my case the error is only for the 2 infra nodes and 2 compute nodes,
 so not for the master as well.

 oc get nodes gives me just the master node, but I guess this is the
 case as the other OKD-nodes stand to be created by the process that fails.
 Am I wrong?

 oc get csr gives me a result of 3 csrs:
 [root@master ~]# oc get csr
 NAMEAGE   REQUESTORCONDITION
 csr-4xjjb   24m   system:admin Approved,Issued
 csr-b6x45   24m   system:admin Approved,Issued
 csr-hgmpf   20m   system:node:master   

Re: OKD3.11 install blocked - Could not find csr for nodes

2019-06-04 Thread Daniel Comnea
Hi Dan,

Which openshift-ansible release tag have you used ?


Cheers,
Dani

On Mon, Jun 3, 2019 at 4:18 PM Punga Dan  wrote:

> Thank you very much for the extensive response, Samuel!
>
> I've found that I do have a DNS misconfiguration so I receive the CSR
> error from the title not because of something related to Openshift
> installer procedure.
>
> Somehow (and I haven't yet found the reason, but still looking for it)
> dnsmasq fills the upstream DNS configuration with some public nameservers
> and not my "internal" DNS.
> So after the openshift-ansible playbook, related to this, installs dnsmasq
> and calls the /etc/NetworkManager/dispatcher.d/99-origin-dns.sh
> script(restarts NetworkManager), all nodes end up with "bad" upstream
> nameservers (in the /etc/dnsmasq.d/origin-upstream-dns.conf and
> /etc/origin/node/resolv.conf files).
> Even if the /etc/resolv.conf file for each host has the right nameserver
> and search domain, dnsmasq populates the OKD-related conf files above with
> a different nameserver.
>
> I think this is related to dnsmasq/NetworkManager specific
> configurationwill have to look into it and figure out what's not going
> as expected and why. I believe these are served by the DHCP server, but
> still looking for a way to address this.
>
> Anyway thanks again for the input, it put me on the right track! :)
>
> Dan
>
> În dum., 2 iun. 2019 la 22:04, Samuel Martín Moro  a
> scris:
>
>> Hi,
>>
>>
>> This is quite puzzling, ... could you share your inventory with us? make
>> sure to obfuscate any sensitive data (ldap/htpasswd credentials among
>> others, ...)
>> mostly interested in potential openshift_node_groups edition. Although
>> something else might come up (?)
>>
>>
>> At first glance, you are right, it sounds like a firewalling issue.
>> Yet from your description, you did open all required ports.
>> I could suggest you check back on these, make sure your data is accurate
>> - although I would assume it is.
>> Also: if using Cri-O as a runtime, note that you would be missing port
>> 10010, that should be opened on all nodes. Yet I don't think that one would
>> be related to nodes registrations against your master API.
>>
>> Another explanation could be related to DNS (can your infra/compute nodes
>> properly resolve your masters name? the contrary would be unusual, still
>> could explain what's going on).
>>
>> As a general rule, at that stage, I would restart the origin-node service
>> on those hosts that fail to register, keeping an eye on /var/log/messages
>> (or journalctl -f).
>> If that doesn't help, I might raise log levels in
>> /etc/sysconfig/origin-node (there's a variable which defaults to 2, you can
>> change it to 99, beware it would give you a lots of logs/could saturate
>> your disks at some point, don't keep it like this over a long period)
>>
>> Dealing with large volumes of logs, note that openshift services tends to
>> store messages with prefix based on severity: you might be able to "| grep
>> -E 'E[0-9][0-9]" to focus on error messages, or W[0-9][0-9] for warnings,
>> ...
>>
>> Your issue being potentially related to firewalling, I might also use
>> tcpdump looking into what's being exchanged between nodes.
>> Look for any packets with a SYN flag ("[S]") that would not be followed
>> by an SYN-ACK ("[S.]").
>>
>>
>> Let us know how that goes,
>>
>>
>> Good luck.
>> Failing during the "Approve node certificate" steps is relatively common,
>> and could have several causes, from node groups configuration, to DNS,
>> firewalls, broken TCP handshake, MTU not allowing for certificates to go
>> through, ... we'll want to dig deeper, to elucidate that issue.
>>
>>
>> Regards.
>>
>> On Sat, Jun 1, 2019 at 12:19 PM Punga Dan  wrote:
>>
>>> Hello all!
>>>
>>> I'm hitting a problem when trying to install a OKD3.11 on one master 2
>>> infra and 2 compute nodes. The hosts are VM that run centos7.
>>> I've gone through the issues related to this subject:
>>> https://access.redhat.com/solutions/3680401 which suggest naming the
>>> hosts as FQDN. Tried it with the same problem appearing for the same set of
>>> hosts(all except the master).
>>>
>>> In my case the error is only for the 2 infra nodes and 2 compute nodes,
>>> so not for the master as well.
>>>
>>> oc get nodes gives me just the master node, but I guess this is the case
>>> as the other OKD-nodes stand to be created by the process that fails. Am I
>>> wrong?
>>>
>>> oc get csr gives me a result of 3 csrs:
>>> [root@master ~]# oc get csr
>>> NAMEAGE   REQUESTORCONDITION
>>> csr-4xjjb   24m   system:admin Approved,Issued
>>> csr-b6x45   24m   system:admin Approved,Issued
>>> csr-hgmpf   20m   system:node:master   Approved,Issued
>>>
>>> Here I believe I have 2 csrs for system:Admin because I ran
>>> the playbooks/openshift-node/join.yml a second time.
>>>
>>> The bootstrapping certificates on the master look fine(??)
>>> [root@master ~]# ll 

Re: OKD3.11 install blocked - Could not find csr for nodes

2019-06-03 Thread Punga Dan
Thank you very much for the extensive response, Samuel!

I've found that I do have a DNS misconfiguration so I receive the CSR error
from the title not because of something related to Openshift installer
procedure.

Somehow (and I haven't yet found the reason, but still looking for it)
dnsmasq fills the upstream DNS configuration with some public nameservers
and not my "internal" DNS.
So after the openshift-ansible playbook, related to this, installs dnsmasq
and calls the /etc/NetworkManager/dispatcher.d/99-origin-dns.sh
script(restarts NetworkManager), all nodes end up with "bad" upstream
nameservers (in the /etc/dnsmasq.d/origin-upstream-dns.conf and
/etc/origin/node/resolv.conf files).
Even if the /etc/resolv.conf file for each host has the right nameserver
and search domain, dnsmasq populates the OKD-related conf files above with
a different nameserver.

I think this is related to dnsmasq/NetworkManager specific
configurationwill have to look into it and figure out what's not going
as expected and why. I believe these are served by the DHCP server, but
still looking for a way to address this.

Anyway thanks again for the input, it put me on the right track! :)

Dan

În dum., 2 iun. 2019 la 22:04, Samuel Martín Moro  a
scris:

> Hi,
>
>
> This is quite puzzling, ... could you share your inventory with us? make
> sure to obfuscate any sensitive data (ldap/htpasswd credentials among
> others, ...)
> mostly interested in potential openshift_node_groups edition. Although
> something else might come up (?)
>
>
> At first glance, you are right, it sounds like a firewalling issue.
> Yet from your description, you did open all required ports.
> I could suggest you check back on these, make sure your data is accurate -
> although I would assume it is.
> Also: if using Cri-O as a runtime, note that you would be missing port
> 10010, that should be opened on all nodes. Yet I don't think that one would
> be related to nodes registrations against your master API.
>
> Another explanation could be related to DNS (can your infra/compute nodes
> properly resolve your masters name? the contrary would be unusual, still
> could explain what's going on).
>
> As a general rule, at that stage, I would restart the origin-node service
> on those hosts that fail to register, keeping an eye on /var/log/messages
> (or journalctl -f).
> If that doesn't help, I might raise log levels in
> /etc/sysconfig/origin-node (there's a variable which defaults to 2, you can
> change it to 99, beware it would give you a lots of logs/could saturate
> your disks at some point, don't keep it like this over a long period)
>
> Dealing with large volumes of logs, note that openshift services tends to
> store messages with prefix based on severity: you might be able to "| grep
> -E 'E[0-9][0-9]" to focus on error messages, or W[0-9][0-9] for warnings,
> ...
>
> Your issue being potentially related to firewalling, I might also use
> tcpdump looking into what's being exchanged between nodes.
> Look for any packets with a SYN flag ("[S]") that would not be followed by
> an SYN-ACK ("[S.]").
>
>
> Let us know how that goes,
>
>
> Good luck.
> Failing during the "Approve node certificate" steps is relatively common,
> and could have several causes, from node groups configuration, to DNS,
> firewalls, broken TCP handshake, MTU not allowing for certificates to go
> through, ... we'll want to dig deeper, to elucidate that issue.
>
>
> Regards.
>
> On Sat, Jun 1, 2019 at 12:19 PM Punga Dan  wrote:
>
>> Hello all!
>>
>> I'm hitting a problem when trying to install a OKD3.11 on one master 2
>> infra and 2 compute nodes. The hosts are VM that run centos7.
>> I've gone through the issues related to this subject:
>> https://access.redhat.com/solutions/3680401 which suggest naming the
>> hosts as FQDN. Tried it with the same problem appearing for the same set of
>> hosts(all except the master).
>>
>> In my case the error is only for the 2 infra nodes and 2 compute nodes,
>> so not for the master as well.
>>
>> oc get nodes gives me just the master node, but I guess this is the case
>> as the other OKD-nodes stand to be created by the process that fails. Am I
>> wrong?
>>
>> oc get csr gives me a result of 3 csrs:
>> [root@master ~]# oc get csr
>> NAMEAGE   REQUESTORCONDITION
>> csr-4xjjb   24m   system:admin Approved,Issued
>> csr-b6x45   24m   system:admin Approved,Issued
>> csr-hgmpf   20m   system:node:master   Approved,Issued
>>
>> Here I believe I have 2 csrs for system:Admin because I ran
>> the playbooks/openshift-node/join.yml a second time.
>>
>> The bootstrapping certificates on the master look fine(??)
>> [root@master ~]# ll /etc/origin/node/certificates/
>> total 20
>> -rw---. 1 root root 2830 iun  1 11:30
>> kubelet-client-2019-06-01-11-30-04.pem
>> -rw---. 1 root root 1135 iun  1 11:31
>> kubelet-client-2019-06-01-11-31-23.pem
>> lrwxrwxrwx. 1 root root   68 iun  1 11:31 

OKD3.11 install blocked - Could not find csr for nodes

2019-06-01 Thread Punga Dan
Hello all!

I'm hitting a problem when trying to install a OKD3.11 on one master 2
infra and 2 compute nodes. The hosts are VM that run centos7.
I've gone through the issues related to this subject:
https://access.redhat.com/solutions/3680401 which suggest naming the hosts
as FQDN. Tried it with the same problem appearing for the same set of
hosts(all except the master).

In my case the error is only for the 2 infra nodes and 2 compute nodes, so
not for the master as well.

oc get nodes gives me just the master node, but I guess this is the case as
the other OKD-nodes stand to be created by the process that fails. Am I
wrong?

oc get csr gives me a result of 3 csrs:
[root@master ~]# oc get csr
NAMEAGE   REQUESTORCONDITION
csr-4xjjb   24m   system:admin Approved,Issued
csr-b6x45   24m   system:admin Approved,Issued
csr-hgmpf   20m   system:node:master   Approved,Issued

Here I believe I have 2 csrs for system:Admin because I ran
the playbooks/openshift-node/join.yml a second time.

The bootstrapping certificates on the master look fine(??)
[root@master ~]# ll /etc/origin/node/certificates/
total 20
-rw---. 1 root root 2830 iun  1 11:30
kubelet-client-2019-06-01-11-30-04.pem
-rw---. 1 root root 1135 iun  1 11:31
kubelet-client-2019-06-01-11-31-23.pem
lrwxrwxrwx. 1 root root   68 iun  1 11:31 kubelet-client-current.pem ->
/etc/origin/node/certificates/kubelet-client-2019-06-01-11-31-23.pem
-rw---. 1 root root 1179 iun  1 11:35
kubelet-server-2019-06-01-11-35-42.pem
lrwxrwxrwx. 1 root root   68 iun  1 11:35 kubelet-server-current.pem ->
/etc/origin/node/certificates/kubelet-server-2019-06-01-11-35-42.pem

 I've rechecked the open ports thinking the issue lies in some
network-related config.
- all hosts have the node related ports opened: 53/udp, 10250/tcp, 4789/udp
- master(with etcd): 8053/udp+tcp, 2049/udp+tcp, 8443/tcp, 8444/tcp,
4789/udp, 53/udp
- infra has on top of the node ones, the ports related to router/routes and
logging components which it will host
The chosen SDN
is os_sdn_network_plugin_name='redhat/openshift-ovs-multitenant' with no
extra config in the inventory file. (Do I need any?)


Any hints about where and what to check would be much appreciated!

Best regards,
Dan Pungă
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users