[jira] [Commented] (MESOS-10190) libprocess fails with "Failed to obtain the IP address for " when using CNI on some hosts
[ https://issues.apache.org/jira/browse/MESOS-10190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17349137#comment-17349137 ] Andreas Peters commented on MESOS-10190: Thats the IP configuration of your running mesos tasks. Thats why eth0 should be ok because it is the interface inside of your mesos tasks sandbox. Other question, do u still have this issue or can I close it? :) > libprocess fails with "Failed to obtain the IP address for " when using > CNI on some hosts > --- > > Key: MESOS-10190 > URL: https://issues.apache.org/jira/browse/MESOS-10190 > Project: Mesos > Issue Type: Bug > Components: executor >Affects Versions: 1.9.0 >Reporter: acecile555 >Priority: Major > > Hello, > > We deployed CNI support and 3 of our hosts (all the same) are failing to > start container with CNI enabled. The log file is: > {noformat} > E0917 16:58:11.481551 16770 process.cpp:1153] EXIT with status 1: Failed to > obtain the IP address for '7c4beac7-5385-4dfa-845a-beb01e13c77c'; the DNS > service may not be able to resolve it: Name or service not known{noformat} > So I tried enforcing LIBPROCESS_IP using env variable but I saw Mesos > overwrites it. So I rebuilt Mesos with additionnal debugging and here is the > log: > {noformat} > Overwriting environment variable 'LIBPROCESS_IP' from '10.99.50.3' to > '0.0.0.0' > E0917 16:34:49.779429 31428 process.cpp:1153] EXIT with status 1: Failed to > obtain the IP address for 'de65bbd8-b237-4884-ba87-7e13cb85078f'; the DNS > service may not be able to resolve it: Name or service not known{noformat} > According to the code, it's expected to be set to 0.0.0.0 (MESOS-5127). So I > tried to understand why libprocess attempts to resolve a container run uuid > instead of the hostname, here is libprocess code: > > {noformat} > // Resolve the hostname if ip is 0.0.0.0 in case we actually have > // a valid external IP address. Note that we need only one IP > // address, so that other processes can send and receive and > // don't get confused as to whom they are sending to. > if (__address__.ip.isAny()) { > char hostname[512]; > if (gethostname(hostname, sizeof(hostname)) < 0) { > PLOG(FATAL) << "Failed to initialize, gethostname"; > } > // Lookup an IP address of local hostname, taking the first result. > Try ip = net::getIP(hostname, __address__.ip.family()); > if (ip.isError()) { > EXIT(EXIT_FAILURE) > << "Failed to obtain the IP address for '" << hostname << "';" > << " the DNS service may not be able to resolve it: " << ip.error(); > } > __address__.ip = ip.get(); > } > {noformat} > > Well actually this is perfectly fine, except "gethostname" returns the > container UUID instead of an valid host IP address. How is that even possible > ? > > Any help would be greatly appreciated. > Regards, Adam. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10190) libprocess fails with "Failed to obtain the IP address for " when using CNI on some hosts
[ https://issues.apache.org/jira/browse/MESOS-10190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203933#comment-17203933 ] acecile555 commented on MESOS-10190: Might have something here: /run/mesos/isolators/network/cni/fef2fe5a-52a8-428d-bd8f-710fda758c9d/mesos/eth0/network.info { "cniVersion": "0.2.0", "ip4": { "ip": "10.96.0.73/32" }, "dns": {} } I'm not sure what the aim of this file is, but eth0 is definitely not correct here, as the machine is using bonding, the correct interface is bond0... > libprocess fails with "Failed to obtain the IP address for " when using > CNI on some hosts > --- > > Key: MESOS-10190 > URL: https://issues.apache.org/jira/browse/MESOS-10190 > Project: Mesos > Issue Type: Bug > Components: executor >Affects Versions: 1.9.0 >Reporter: acecile555 >Priority: Major > > Hello, > > We deployed CNI support and 3 of our hosts (all the same) are failing to > start container with CNI enabled. The log file is: > {noformat} > E0917 16:58:11.481551 16770 process.cpp:1153] EXIT with status 1: Failed to > obtain the IP address for '7c4beac7-5385-4dfa-845a-beb01e13c77c'; the DNS > service may not be able to resolve it: Name or service not known{noformat} > So I tried enforcing LIBPROCESS_IP using env variable but I saw Mesos > overwrites it. So I rebuilt Mesos with additionnal debugging and here is the > log: > {noformat} > Overwriting environment variable 'LIBPROCESS_IP' from '10.99.50.3' to > '0.0.0.0' > E0917 16:34:49.779429 31428 process.cpp:1153] EXIT with status 1: Failed to > obtain the IP address for 'de65bbd8-b237-4884-ba87-7e13cb85078f'; the DNS > service may not be able to resolve it: Name or service not known{noformat} > According to the code, it's expected to be set to 0.0.0.0 (MESOS-5127). So I > tried to understand why libprocess attempts to resolve a container run uuid > instead of the hostname, here is libprocess code: > > {noformat} > // Resolve the hostname if ip is 0.0.0.0 in case we actually have > // a valid external IP address. Note that we need only one IP > // address, so that other processes can send and receive and > // don't get confused as to whom they are sending to. > if (__address__.ip.isAny()) { > char hostname[512]; > if (gethostname(hostname, sizeof(hostname)) < 0) { > PLOG(FATAL) << "Failed to initialize, gethostname"; > } > // Lookup an IP address of local hostname, taking the first result. > Try ip = net::getIP(hostname, __address__.ip.family()); > if (ip.isError()) { > EXIT(EXIT_FAILURE) > << "Failed to obtain the IP address for '" << hostname << "';" > << " the DNS service may not be able to resolve it: " << ip.error(); > } > __address__.ip = ip.get(); > } > {noformat} > > Well actually this is perfectly fine, except "gethostname" returns the > container UUID instead of an valid host IP address. How is that even possible > ? > > Any help would be greatly appreciated. > Regards, Adam. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10190) libprocess fails with "Failed to obtain the IP address for " when using CNI on some hosts
[ https://issues.apache.org/jira/browse/MESOS-10190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203930#comment-17203930 ] acecile555 commented on MESOS-10190: So I checked the slave logs and managed to spot this: I0929 15:31:31.490895 24317 cni.cpp:974] Bind mounted '/proc/30726/ns/net' to '/run/mesos/isolators/network/cni/8abbbf69-80c6-4d5b-b424-8aeda80504c7/ns' for container 8abbbf69-80c6-4d5b-b424-8aeda80504c7 I0929 15:31:31.503661 24310 cni.cpp:974] Bind mounted '/proc/30727/ns/net' to '/run/mesos/isolators/network/cni/accdf85c-49da-4943-92e0-a208a40180da/ns' for container accdf85c-49da-4943-92e0-a208a40180da I0929 15:31:33.723649 24318 cni.cpp:1731] Unmounted the network namespace handle '/run/mesos/isolators/network/cni/8abbbf69-80c6-4d5b-b424-8aeda80504c7/ns' for container 8abbbf69-80c6-4d5b-b424-8aeda80504c7 I0929 15:31:33.747712 24318 cni.cpp:1731] Unmounted the network namespace handle '/run/mesos/isolators/network/cni/accdf85c-49da-4943-92e0-a208a40180da/ns' for container accdf85c-49da-4943-92e0-a208a40180da 1 seconds between creation and destroy is more than enough for me the capture the actual content. So here is an example of the generated files for container fef2fe5a-52a8-428d-bd8f-710fda758c9d: /run/mesos/isolators/network/cni/fef2fe5a-52a8-428d-bd8f-710fda758c9d/hostname fef2fe5a-52a8-428d-bd8f-710fda758c9d /run/mesos/isolators/network/cni/fef2fe5a-52a8-428d-bd8f-710fda758c9d/hosts 127.0.0.1 localhost 10.96.0.73 fef2fe5a-52a8-428d-bd8f-710fda758c9d Both files look ok to me, but I'm not sure how it is "overlay-mounted" then, add other id ? > libprocess fails with "Failed to obtain the IP address for " when using > CNI on some hosts > --- > > Key: MESOS-10190 > URL: https://issues.apache.org/jira/browse/MESOS-10190 > Project: Mesos > Issue Type: Bug > Components: executor >Affects Versions: 1.9.0 >Reporter: acecile555 >Priority: Major > > Hello, > > We deployed CNI support and 3 of our hosts (all the same) are failing to > start container with CNI enabled. The log file is: > {noformat} > E0917 16:58:11.481551 16770 process.cpp:1153] EXIT with status 1: Failed to > obtain the IP address for '7c4beac7-5385-4dfa-845a-beb01e13c77c'; the DNS > service may not be able to resolve it: Name or service not known{noformat} > So I tried enforcing LIBPROCESS_IP using env variable but I saw Mesos > overwrites it. So I rebuilt Mesos with additionnal debugging and here is the > log: > {noformat} > Overwriting environment variable 'LIBPROCESS_IP' from '10.99.50.3' to > '0.0.0.0' > E0917 16:34:49.779429 31428 process.cpp:1153] EXIT with status 1: Failed to > obtain the IP address for 'de65bbd8-b237-4884-ba87-7e13cb85078f'; the DNS > service may not be able to resolve it: Name or service not known{noformat} > According to the code, it's expected to be set to 0.0.0.0 (MESOS-5127). So I > tried to understand why libprocess attempts to resolve a container run uuid > instead of the hostname, here is libprocess code: > > {noformat} > // Resolve the hostname if ip is 0.0.0.0 in case we actually have > // a valid external IP address. Note that we need only one IP > // address, so that other processes can send and receive and > // don't get confused as to whom they are sending to. > if (__address__.ip.isAny()) { > char hostname[512]; > if (gethostname(hostname, sizeof(hostname)) < 0) { > PLOG(FATAL) << "Failed to initialize, gethostname"; > } > // Lookup an IP address of local hostname, taking the first result. > Try ip = net::getIP(hostname, __address__.ip.family()); > if (ip.isError()) { > EXIT(EXIT_FAILURE) > << "Failed to obtain the IP address for '" << hostname << "';" > << " the DNS service may not be able to resolve it: " << ip.error(); > } > __address__.ip = ip.get(); > } > {noformat} > > Well actually this is perfectly fine, except "gethostname" returns the > container UUID instead of an valid host IP address. How is that even possible > ? > > Any help would be greatly appreciated. > Regards, Adam. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10190) libprocess fails with "Failed to obtain the IP address for " when using CNI on some hosts
[ https://issues.apache.org/jira/browse/MESOS-10190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203878#comment-17203878 ] acecile555 commented on MESOS-10190: Hello [~qianzhang] How can I check "mount namespace" content ? Because the container fails to start so I guess it gets deleted. I guess the only way is to modify the code you have linked to output some debugging information ? Sadly I'm not skilled enough to play with gdb but I can rebuild Mesos with added debug info printed to stderr Regards, Adam. > libprocess fails with "Failed to obtain the IP address for " when using > CNI on some hosts > --- > > Key: MESOS-10190 > URL: https://issues.apache.org/jira/browse/MESOS-10190 > Project: Mesos > Issue Type: Bug > Components: executor >Affects Versions: 1.9.0 >Reporter: acecile555 >Priority: Major > > Hello, > > We deployed CNI support and 3 of our hosts (all the same) are failing to > start container with CNI enabled. The log file is: > {noformat} > E0917 16:58:11.481551 16770 process.cpp:1153] EXIT with status 1: Failed to > obtain the IP address for '7c4beac7-5385-4dfa-845a-beb01e13c77c'; the DNS > service may not be able to resolve it: Name or service not known{noformat} > So I tried enforcing LIBPROCESS_IP using env variable but I saw Mesos > overwrites it. So I rebuilt Mesos with additionnal debugging and here is the > log: > {noformat} > Overwriting environment variable 'LIBPROCESS_IP' from '10.99.50.3' to > '0.0.0.0' > E0917 16:34:49.779429 31428 process.cpp:1153] EXIT with status 1: Failed to > obtain the IP address for 'de65bbd8-b237-4884-ba87-7e13cb85078f'; the DNS > service may not be able to resolve it: Name or service not known{noformat} > According to the code, it's expected to be set to 0.0.0.0 (MESOS-5127). So I > tried to understand why libprocess attempts to resolve a container run uuid > instead of the hostname, here is libprocess code: > > {noformat} > // Resolve the hostname if ip is 0.0.0.0 in case we actually have > // a valid external IP address. Note that we need only one IP > // address, so that other processes can send and receive and > // don't get confused as to whom they are sending to. > if (__address__.ip.isAny()) { > char hostname[512]; > if (gethostname(hostname, sizeof(hostname)) < 0) { > PLOG(FATAL) << "Failed to initialize, gethostname"; > } > // Lookup an IP address of local hostname, taking the first result. > Try ip = net::getIP(hostname, __address__.ip.family()); > if (ip.isError()) { > EXIT(EXIT_FAILURE) > << "Failed to obtain the IP address for '" << hostname << "';" > << " the DNS service may not be able to resolve it: " << ip.error(); > } > __address__.ip = ip.get(); > } > {noformat} > > Well actually this is perfectly fine, except "gethostname" returns the > container UUID instead of an valid host IP address. How is that even possible > ? > > Any help would be greatly appreciated. > Regards, Adam. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10190) libprocess fails with "Failed to obtain the IP address for " when using CNI on some hosts
[ https://issues.apache.org/jira/browse/MESOS-10190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17202779#comment-17202779 ] Qian Zhang commented on MESOS-10190: [~acecile555] Yes, we will set container's hostname to its container ID (in UUID format) by writing the container ID into the `/etc/hostname` file in container's mount namespace and also write `container-IP container-ID` into container's `/etc/hosts`, so usually libprocess should be able to get the container's IP. I'd suggest to check if the `/etc/hostname` and `/etc/hosts` files are correctly written by Mesos for your containers, you can use gdb to start or attach Mesos agent and step into [this method|https://github.com/apache/mesos/blob/1.10.0/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L997] to check if those files are correctly updated. > libprocess fails with "Failed to obtain the IP address for " when using > CNI on some hosts > --- > > Key: MESOS-10190 > URL: https://issues.apache.org/jira/browse/MESOS-10190 > Project: Mesos > Issue Type: Bug > Components: executor >Affects Versions: 1.9.0 >Reporter: acecile555 >Priority: Major > > Hello, > > We deployed CNI support and 3 of our hosts (all the same) are failing to > start container with CNI enabled. The log file is: > {noformat} > E0917 16:58:11.481551 16770 process.cpp:1153] EXIT with status 1: Failed to > obtain the IP address for '7c4beac7-5385-4dfa-845a-beb01e13c77c'; the DNS > service may not be able to resolve it: Name or service not known{noformat} > So I tried enforcing LIBPROCESS_IP using env variable but I saw Mesos > overwrites it. So I rebuilt Mesos with additionnal debugging and here is the > log: > {noformat} > Overwriting environment variable 'LIBPROCESS_IP' from '10.99.50.3' to > '0.0.0.0' > E0917 16:34:49.779429 31428 process.cpp:1153] EXIT with status 1: Failed to > obtain the IP address for 'de65bbd8-b237-4884-ba87-7e13cb85078f'; the DNS > service may not be able to resolve it: Name or service not known{noformat} > According to the code, it's expected to be set to 0.0.0.0 (MESOS-5127). So I > tried to understand why libprocess attempts to resolve a container run uuid > instead of the hostname, here is libprocess code: > > {noformat} > // Resolve the hostname if ip is 0.0.0.0 in case we actually have > // a valid external IP address. Note that we need only one IP > // address, so that other processes can send and receive and > // don't get confused as to whom they are sending to. > if (__address__.ip.isAny()) { > char hostname[512]; > if (gethostname(hostname, sizeof(hostname)) < 0) { > PLOG(FATAL) << "Failed to initialize, gethostname"; > } > // Lookup an IP address of local hostname, taking the first result. > Try ip = net::getIP(hostname, __address__.ip.family()); > if (ip.isError()) { > EXIT(EXIT_FAILURE) > << "Failed to obtain the IP address for '" << hostname << "';" > << " the DNS service may not be able to resolve it: " << ip.error(); > } > __address__.ip = ip.get(); > } > {noformat} > > Well actually this is perfectly fine, except "gethostname" returns the > container UUID instead of an valid host IP address. How is that even possible > ? > > Any help would be greatly appreciated. > Regards, Adam. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10190) libprocess fails with "Failed to obtain the IP address for " when using CNI on some hosts
[ https://issues.apache.org/jira/browse/MESOS-10190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201746#comment-17201746 ] Benjamin Mahler commented on MESOS-10190: - cc [~qianzhang] > libprocess fails with "Failed to obtain the IP address for " when using > CNI on some hosts > --- > > Key: MESOS-10190 > URL: https://issues.apache.org/jira/browse/MESOS-10190 > Project: Mesos > Issue Type: Bug > Components: executor >Affects Versions: 1.9.0 >Reporter: acecile555 >Priority: Major > > Hello, > > We deployed CNI support and 3 of our hosts (all the same) are failing to > start container with CNI enabled. The log file is: > {noformat} > E0917 16:58:11.481551 16770 process.cpp:1153] EXIT with status 1: Failed to > obtain the IP address for '7c4beac7-5385-4dfa-845a-beb01e13c77c'; the DNS > service may not be able to resolve it: Name or service not known{noformat} > So I tried enforcing LIBPROCESS_IP using env variable but I saw Mesos > overwrites it. So I rebuilt Mesos with additionnal debugging and here is the > log: > {noformat} > Overwriting environment variable 'LIBPROCESS_IP' from '10.99.50.3' to > '0.0.0.0' > E0917 16:34:49.779429 31428 process.cpp:1153] EXIT with status 1: Failed to > obtain the IP address for 'de65bbd8-b237-4884-ba87-7e13cb85078f'; the DNS > service may not be able to resolve it: Name or service not known{noformat} > According to the code, it's expected to be set to 0.0.0.0 (MESOS-5127). So I > tried to understand why libprocess attempts to resolve a container run uuid > instead of the hostname, here is libprocess code: > > {noformat} > // Resolve the hostname if ip is 0.0.0.0 in case we actually have > // a valid external IP address. Note that we need only one IP > // address, so that other processes can send and receive and > // don't get confused as to whom they are sending to. > if (__address__.ip.isAny()) { > char hostname[512]; > if (gethostname(hostname, sizeof(hostname)) < 0) { > PLOG(FATAL) << "Failed to initialize, gethostname"; > } > // Lookup an IP address of local hostname, taking the first result. > Try ip = net::getIP(hostname, __address__.ip.family()); > if (ip.isError()) { > EXIT(EXIT_FAILURE) > << "Failed to obtain the IP address for '" << hostname << "';" > << " the DNS service may not be able to resolve it: " << ip.error(); > } > __address__.ip = ip.get(); > } > {noformat} > > Well actually this is perfectly fine, except "gethostname" returns the > container UUID instead of an valid host IP address. How is that even possible > ? > > Any help would be greatly appreciated. > Regards, Adam. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10190) libprocess fails with "Failed to obtain the IP address for " when using CNI on some hosts
[ https://issues.apache.org/jira/browse/MESOS-10190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199903#comment-17199903 ] acecile555 commented on MESOS-10190: Hello, I rebuilt Mesos whith a patch logging gethostname call in libprocess and in both a working agent and a non-working one it's getting called with an uuid related to the container, so I guess it's a normal behavior. I'm wiling to get deeper into the debugging but I'm missing some information now. Can someone give me a hint about how libprocess is supposed to resolve container uuid into an IP address ? Best regards, Adam. > libprocess fails with "Failed to obtain the IP address for " when using > CNI on some hosts > --- > > Key: MESOS-10190 > URL: https://issues.apache.org/jira/browse/MESOS-10190 > Project: Mesos > Issue Type: Bug > Components: executor >Affects Versions: 1.9.0 >Reporter: acecile555 >Priority: Major > > Hello, > > We deployed CNI support and 3 of our hosts (all the same) are failing to > start container with CNI enabled. The log file is: > {noformat} > E0917 16:58:11.481551 16770 process.cpp:1153] EXIT with status 1: Failed to > obtain the IP address for '7c4beac7-5385-4dfa-845a-beb01e13c77c'; the DNS > service may not be able to resolve it: Name or service not known{noformat} > So I tried enforcing LIBPROCESS_IP using env variable but I saw Mesos > overwrites it. So I rebuilt Mesos with additionnal debugging and here is the > log: > {noformat} > Overwriting environment variable 'LIBPROCESS_IP' from '10.99.50.3' to > '0.0.0.0' > E0917 16:34:49.779429 31428 process.cpp:1153] EXIT with status 1: Failed to > obtain the IP address for 'de65bbd8-b237-4884-ba87-7e13cb85078f'; the DNS > service may not be able to resolve it: Name or service not known{noformat} > According to the code, it's expected to be set to 0.0.0.0 (MESOS-5127). So I > tried to understand why libprocess attempts to resolve a container run uuid > instead of the hostname, here is libprocess code: > > {noformat} > // Resolve the hostname if ip is 0.0.0.0 in case we actually have > // a valid external IP address. Note that we need only one IP > // address, so that other processes can send and receive and > // don't get confused as to whom they are sending to. > if (__address__.ip.isAny()) { > char hostname[512]; > if (gethostname(hostname, sizeof(hostname)) < 0) { > PLOG(FATAL) << "Failed to initialize, gethostname"; > } > // Lookup an IP address of local hostname, taking the first result. > Try ip = net::getIP(hostname, __address__.ip.family()); > if (ip.isError()) { > EXIT(EXIT_FAILURE) > << "Failed to obtain the IP address for '" << hostname << "';" > << " the DNS service may not be able to resolve it: " << ip.error(); > } > __address__.ip = ip.get(); > } > {noformat} > > Well actually this is perfectly fine, except "gethostname" returns the > container UUID instead of an valid host IP address. How is that even possible > ? > > Any help would be greatly appreciated. > Regards, Adam. -- This message was sent by Atlassian Jira (v8.3.4#803005)