[jira] [Commented] (MESOS-10190) libprocess fails with "Failed to obtain the IP address for " when using CNI on some hosts

2020-09-29 Thread acecile5555555 (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203933#comment-17203933
 ] 

acecile555 commented on MESOS-10190:


Might have something here:

 

/run/mesos/isolators/network/cni/fef2fe5a-52a8-428d-bd8f-710fda758c9d/mesos/eth0/network.info

{
 "cniVersion": "0.2.0",
 "ip4": {
 "ip": "10.96.0.73/32"
 },
 "dns": {}

}

 

I'm not sure what the aim of this file is, but eth0 is definitely not correct 
here, as the machine is using bonding, the correct interface is bond0...

> libprocess fails with "Failed to obtain the IP address for " when using 
> CNI on some hosts
> ---
>
> Key: MESOS-10190
> URL: https://issues.apache.org/jira/browse/MESOS-10190
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Affects Versions: 1.9.0
>Reporter: acecile555
>Priority: Major
>
> Hello,
>  
> We deployed CNI support and 3 of our hosts (all the same) are failing to 
> start container with CNI enabled. The log file is:
> {noformat}
> E0917 16:58:11.481551 16770 process.cpp:1153] EXIT with status 1: Failed to 
> obtain the IP address for '7c4beac7-5385-4dfa-845a-beb01e13c77c'; the DNS 
> service may not be able to resolve it: Name or service not known{noformat}
> So I tried enforcing LIBPROCESS_IP using env variable but I saw Mesos 
> overwrites it. So I rebuilt Mesos with additionnal debugging and here is the 
> log:
> {noformat}
> Overwriting environment variable 'LIBPROCESS_IP' from '10.99.50.3' to 
> '0.0.0.0'
> E0917 16:34:49.779429 31428 process.cpp:1153] EXIT with status 1: Failed to 
> obtain the IP address for 'de65bbd8-b237-4884-ba87-7e13cb85078f'; the DNS 
> service may not be able to resolve it: Name or service not known{noformat}
> According to the code, it's expected to be set to 0.0.0.0 (MESOS-5127). So I 
> tried to understand why libprocess attempts to resolve a container run uuid 
> instead of the hostname, here is libprocess code:
>  
> {noformat}
> // Resolve the hostname if ip is 0.0.0.0 in case we actually have
>  // a valid external IP address. Note that we need only one IP
>  // address, so that other processes can send and receive and
>  // don't get confused as to whom they are sending to.
>  if (__address__.ip.isAny()) {
>  char hostname[512];
> if (gethostname(hostname, sizeof(hostname)) < 0) {
>  PLOG(FATAL) << "Failed to initialize, gethostname";
>  }
> // Lookup an IP address of local hostname, taking the first result.
>  Try ip = net::getIP(hostname, __address__.ip.family());
> if (ip.isError()) {
>  EXIT(EXIT_FAILURE)
>  << "Failed to obtain the IP address for '" << hostname << "';"
>  << " the DNS service may not be able to resolve it: " << ip.error();
>  }
> __address__.ip = ip.get();
>  }
> {noformat}
>  
> Well actually this is perfectly fine, except "gethostname" returns the 
> container UUID instead of an valid host IP address. How is that even possible 
> ?
>  
> Any help would be greatly appreciated.
> Regards, Adam.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10190) libprocess fails with "Failed to obtain the IP address for " when using CNI on some hosts

2020-09-29 Thread acecile5555555 (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203930#comment-17203930
 ] 

acecile555 commented on MESOS-10190:


So I checked the slave logs and managed to spot this:

 

I0929 15:31:31.490895 24317 cni.cpp:974] Bind mounted '/proc/30726/ns/net' to 
'/run/mesos/isolators/network/cni/8abbbf69-80c6-4d5b-b424-8aeda80504c7/ns' for 
container 8abbbf69-80c6-4d5b-b424-8aeda80504c7
I0929 15:31:31.503661 24310 cni.cpp:974] Bind mounted '/proc/30727/ns/net' to 
'/run/mesos/isolators/network/cni/accdf85c-49da-4943-92e0-a208a40180da/ns' for 
container accdf85c-49da-4943-92e0-a208a40180da
I0929 15:31:33.723649 24318 cni.cpp:1731] Unmounted the network namespace 
handle 
'/run/mesos/isolators/network/cni/8abbbf69-80c6-4d5b-b424-8aeda80504c7/ns' for 
container 8abbbf69-80c6-4d5b-b424-8aeda80504c7
I0929 15:31:33.747712 24318 cni.cpp:1731] Unmounted the network namespace 
handle 
'/run/mesos/isolators/network/cni/accdf85c-49da-4943-92e0-a208a40180da/ns' for 
container accdf85c-49da-4943-92e0-a208a40180da

 

1 seconds between creation and destroy is more than enough for me the capture 
the actual content.

 

So here is an example of the generated files for container 
fef2fe5a-52a8-428d-bd8f-710fda758c9d:

/run/mesos/isolators/network/cni/fef2fe5a-52a8-428d-bd8f-710fda758c9d/hostname

fef2fe5a-52a8-428d-bd8f-710fda758c9d

/run/mesos/isolators/network/cni/fef2fe5a-52a8-428d-bd8f-710fda758c9d/hosts

127.0.0.1 localhost
10.96.0.73 fef2fe5a-52a8-428d-bd8f-710fda758c9d

 

Both files look ok to me, but I'm not sure how it is "overlay-mounted" then, 
add other id ?

> libprocess fails with "Failed to obtain the IP address for " when using 
> CNI on some hosts
> ---
>
> Key: MESOS-10190
> URL: https://issues.apache.org/jira/browse/MESOS-10190
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Affects Versions: 1.9.0
>Reporter: acecile555
>Priority: Major
>
> Hello,
>  
> We deployed CNI support and 3 of our hosts (all the same) are failing to 
> start container with CNI enabled. The log file is:
> {noformat}
> E0917 16:58:11.481551 16770 process.cpp:1153] EXIT with status 1: Failed to 
> obtain the IP address for '7c4beac7-5385-4dfa-845a-beb01e13c77c'; the DNS 
> service may not be able to resolve it: Name or service not known{noformat}
> So I tried enforcing LIBPROCESS_IP using env variable but I saw Mesos 
> overwrites it. So I rebuilt Mesos with additionnal debugging and here is the 
> log:
> {noformat}
> Overwriting environment variable 'LIBPROCESS_IP' from '10.99.50.3' to 
> '0.0.0.0'
> E0917 16:34:49.779429 31428 process.cpp:1153] EXIT with status 1: Failed to 
> obtain the IP address for 'de65bbd8-b237-4884-ba87-7e13cb85078f'; the DNS 
> service may not be able to resolve it: Name or service not known{noformat}
> According to the code, it's expected to be set to 0.0.0.0 (MESOS-5127). So I 
> tried to understand why libprocess attempts to resolve a container run uuid 
> instead of the hostname, here is libprocess code:
>  
> {noformat}
> // Resolve the hostname if ip is 0.0.0.0 in case we actually have
>  // a valid external IP address. Note that we need only one IP
>  // address, so that other processes can send and receive and
>  // don't get confused as to whom they are sending to.
>  if (__address__.ip.isAny()) {
>  char hostname[512];
> if (gethostname(hostname, sizeof(hostname)) < 0) {
>  PLOG(FATAL) << "Failed to initialize, gethostname";
>  }
> // Lookup an IP address of local hostname, taking the first result.
>  Try ip = net::getIP(hostname, __address__.ip.family());
> if (ip.isError()) {
>  EXIT(EXIT_FAILURE)
>  << "Failed to obtain the IP address for '" << hostname << "';"
>  << " the DNS service may not be able to resolve it: " << ip.error();
>  }
> __address__.ip = ip.get();
>  }
> {noformat}
>  
> Well actually this is perfectly fine, except "gethostname" returns the 
> container UUID instead of an valid host IP address. How is that even possible 
> ?
>  
> Any help would be greatly appreciated.
> Regards, Adam.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10188) Master check failure : scalars does not contain agent

2020-09-29 Thread Andrei Sekretenko (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203898#comment-17203898
 ] 

Andrei Sekretenko commented on MESOS-10188:
---

Given that we don't see this bug triggered in other environments and that I 
wasn't able to reproduce this in simple tests, I've lowered the priority to 
"Major".

[~Jerome Soussens] Can you please update this issue if you find something or 
manage to reproduce this again? A stacktrace of this crash would be *very* 
helpful.

> Master check failure : scalars does not contain agent
> -
>
> Key: MESOS-10188
> URL: https://issues.apache.org/jira/browse/MESOS-10188
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Jerome Soussens
>Priority: Major
> Attachments: image-2020-09-14-10-07-42-622.png, 
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.ERROR.20200911-064325.46082-20200912.gz,
>  
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.FATAL.20200911-064325.46082-20200912.gz,
>  
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.INFO.20200910-200737.46082-20200911.gz,
>  
> mesos-master.dev-eu-w-01-sgmm-0-0.root.log.WARNING.20200830-004426.46082-20200911.gz
>
>
> Mesos master restarted with the error message :
> {code:java}
> F0911 06:43:25.109040 46181 hierarchical.cpp:232] Check failed: scalars does 
> not contain 8f1f65e8-c38d-4563-bfba-eaa079271b2b-S732{code}
> See attached log files.
> FYI, Agent S732 had a network outage between between 06:40 and 06:44  :
> !image-2020-09-14-10-07-42-622.png|width=1545,height=435!
>  
> AAt the end of the outage, Mesos master has the following logs :
> {code:java}
> I0911 06:43:20.392347 46184 master.cpp:6513] Received reregister agent 
> message from agent 8f1f65e8-c38d-4563-bfba-eaa079271b2b-S732 at 
> slave(1)@172.17.50.35:5051 (dev-eu-w-03-sgma-1)
> W0911 06:43:20.421454 46191 master.cpp:10618] Possibly orphaned completed 
> task b92038e7-b42c-4e23-ae55-9be4325a4d32 of framework 
> d65e2494-c7c5-456b-aad6-fc44cadf2f50 that ran on agent 
> 8f1f65e8-c38d-4563-bfba-eaa079271b2b-S732 at slave(1)@172.17.50.35:5051 
> (dev-eu-w-03-sgma-1)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10190) libprocess fails with "Failed to obtain the IP address for " when using CNI on some hosts

2020-09-29 Thread acecile5555555 (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203878#comment-17203878
 ] 

acecile555 commented on MESOS-10190:


Hello [~qianzhang]

 

How can I check "mount namespace" content ? Because the container fails to 
start so I guess it gets deleted.

I guess the only way is to modify the code you have linked to output some 
debugging information ? Sadly I'm not skilled enough to play with gdb but I can 
rebuild Mesos with added debug info printed to stderr

 

Regards, Adam.

> libprocess fails with "Failed to obtain the IP address for " when using 
> CNI on some hosts
> ---
>
> Key: MESOS-10190
> URL: https://issues.apache.org/jira/browse/MESOS-10190
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Affects Versions: 1.9.0
>Reporter: acecile555
>Priority: Major
>
> Hello,
>  
> We deployed CNI support and 3 of our hosts (all the same) are failing to 
> start container with CNI enabled. The log file is:
> {noformat}
> E0917 16:58:11.481551 16770 process.cpp:1153] EXIT with status 1: Failed to 
> obtain the IP address for '7c4beac7-5385-4dfa-845a-beb01e13c77c'; the DNS 
> service may not be able to resolve it: Name or service not known{noformat}
> So I tried enforcing LIBPROCESS_IP using env variable but I saw Mesos 
> overwrites it. So I rebuilt Mesos with additionnal debugging and here is the 
> log:
> {noformat}
> Overwriting environment variable 'LIBPROCESS_IP' from '10.99.50.3' to 
> '0.0.0.0'
> E0917 16:34:49.779429 31428 process.cpp:1153] EXIT with status 1: Failed to 
> obtain the IP address for 'de65bbd8-b237-4884-ba87-7e13cb85078f'; the DNS 
> service may not be able to resolve it: Name or service not known{noformat}
> According to the code, it's expected to be set to 0.0.0.0 (MESOS-5127). So I 
> tried to understand why libprocess attempts to resolve a container run uuid 
> instead of the hostname, here is libprocess code:
>  
> {noformat}
> // Resolve the hostname if ip is 0.0.0.0 in case we actually have
>  // a valid external IP address. Note that we need only one IP
>  // address, so that other processes can send and receive and
>  // don't get confused as to whom they are sending to.
>  if (__address__.ip.isAny()) {
>  char hostname[512];
> if (gethostname(hostname, sizeof(hostname)) < 0) {
>  PLOG(FATAL) << "Failed to initialize, gethostname";
>  }
> // Lookup an IP address of local hostname, taking the first result.
>  Try ip = net::getIP(hostname, __address__.ip.family());
> if (ip.isError()) {
>  EXIT(EXIT_FAILURE)
>  << "Failed to obtain the IP address for '" << hostname << "';"
>  << " the DNS service may not be able to resolve it: " << ip.error();
>  }
> __address__.ip = ip.get();
>  }
> {noformat}
>  
> Well actually this is perfectly fine, except "gethostname" returns the 
> container UUID instead of an valid host IP address. How is that even possible 
> ?
>  
> Any help would be greatly appreciated.
> Regards, Adam.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10153) Implement the `prepare` method of the `volume/csi` isolator

2020-09-29 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203728#comment-17203728
 ] 

Qian Zhang commented on MESOS-10153:


commit 8700dd8d5ece658804d7b7a40863800dcc5c72bc
Author: Qian Zhang 
Date: Sat Sep 19 11:11:04 2020 +0800

Inferred CSI volume's `readonly` field from volume mode.
 
 Review: https://reviews.apache.org/r/72888

> Implement the `prepare` method of the `volume/csi` isolator
> ---
>
> Key: MESOS-10153
> URL: https://issues.apache.org/jira/browse/MESOS-10153
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)