[jira] [Commented] (MESOS-7975) The command/default executor can incorrectly send a TASK_FINISHED update even when the task is killed
[ https://issues.apache.org/jira/browse/MESOS-7975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175836#comment-16175836 ] Qian Zhang commented on MESOS-7975: --- [~alexr] I have sent a mail to the lists just now, let's wait for the feedback from the community. > The command/default executor can incorrectly send a TASK_FINISHED update even > when the task is killed > - > > Key: MESOS-7975 > URL: https://issues.apache.org/jira/browse/MESOS-7975 > Project: Mesos > Issue Type: Bug >Reporter: Anand Mazumdar >Assignee: Qian Zhang >Priority: Critical > Labels: mesosphere > > Currently, when a task is killed, the default and the command executor > incorrectly send a {{TASK_FINISHED}} status update instead of > {{TASK_KILLED}}. This is due to an unfortunate missed conditional check when > the task exits with a zero status code. > {code} > if (WSUCCEEDED(status)) { > taskState = TASK_FINISHED; > } else if (killed) { > // Send TASK_KILLED if the task was killed as a result of > // kill() or shutdown(). > taskState = TASK_KILLED; > } else { > taskState = TASK_FAILED; > } > {code} > We should modify the code to correctly send {{TASK_KILLED}} status updates > when a task is killed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7975) The command/default executor can incorrectly send a TASK_FINISHED update even when the task is killed
[ https://issues.apache.org/jira/browse/MESOS-7975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175793#comment-16175793 ] Qian Zhang commented on MESOS-7975: --- [~jpe...@apache.org] When the scheduler sends a kill, will your executor send a SIGTERM to the task or SIGKILL? If it is SIGTERM, and the task handles it gracefully and exit with 0, do you think it is reasonable for executor to send a TASK_FINISHED in this case? > The command/default executor can incorrectly send a TASK_FINISHED update even > when the task is killed > - > > Key: MESOS-7975 > URL: https://issues.apache.org/jira/browse/MESOS-7975 > Project: Mesos > Issue Type: Bug >Reporter: Anand Mazumdar >Assignee: Qian Zhang >Priority: Critical > Labels: mesosphere > > Currently, when a task is killed, the default and the command executor > incorrectly send a {{TASK_FINISHED}} status update instead of > {{TASK_KILLED}}. This is due to an unfortunate missed conditional check when > the task exits with a zero status code. > {code} > if (WSUCCEEDED(status)) { > taskState = TASK_FINISHED; > } else if (killed) { > // Send TASK_KILLED if the task was killed as a result of > // kill() or shutdown(). > taskState = TASK_KILLED; > } else { > taskState = TASK_FAILED; > } > {code} > We should modify the code to correctly send {{TASK_KILLED}} status updates > when a task is killed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7962) Display task state counters in the framework page of the webui.
[ https://issues.apache.org/jira/browse/MESOS-7962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175636#comment-16175636 ] ASF GitHub Bot commented on MESOS-7962: --- Github user asfgit closed the pull request at: https://github.com/apache/mesos/pull/234 > Display task state counters in the framework page of the webui. > --- > > Key: MESOS-7962 > URL: https://issues.apache.org/jira/browse/MESOS-7962 > Project: Mesos > Issue Type: Improvement > Components: webui >Reporter: Benjamin Mahler >Assignee: Tomasz Janiszewski > > Currently the webui displays task state counters across all frameworks on the > home page, but it does not display the per-framework task state counters when > you click in to a particular framework. We should add the task state counters > to the per-framework page. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-2657) Support multiple reasons in status update message.
[ https://issues.apache.org/jira/browse/MESOS-2657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175579#comment-16175579 ] James Peach commented on MESOS-2657: [~haosd...@gmail.com] [~jieyu] As part of the refactoring for MESOS-7963, I'm planning to remove the multiple reasons in the {{ContainerTermination}} message. I think the main use case for that was supporting multiple limitations from isolators, but that never worked and I'm removing that as well :) Please let me know if you see any problems with this. > Support multiple reasons in status update message. > -- > > Key: MESOS-2657 > URL: https://issues.apache.org/jira/browse/MESOS-2657 > Project: Mesos > Issue Type: Improvement >Reporter: Jie Yu >Assignee: haosdent > > Sometimes, a single reason in the status update message makes it very hard > for frameworks to understand the cause of a status update. For example, we > have REASON_EXECUTOR_TERMINATED, but that's a very general reason and > sometime we want a sub-reason for that (e.g., REASON_CONTAINER_LAUNCH_FAILED) > so that the framework can better react to the status update. > We could change 'reason' field in TaskStatus to be a repeated field (should > be backward compatible). For instance, for a containerizer launch failure, we > probably need two reasons for TASK_LOST: 1) the top level reason > REASON_EXECUTOR_TERMINATED; 2) the second level reason > REASON_CONTAINER_LAUNCH_FAILED. > Another example. We may want to have a generic reason when resource limit is > reached: REASON_RESOURCE_LIMIT_EXCEEDED, and have a second level sub-reason: > REASON_OUT_OF_MEMORY. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7990) Support systemd named hierarchy (name=systemd) for Mesos Containerizer.
[ https://issues.apache.org/jira/browse/MESOS-7990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175460#comment-16175460 ] Jie Yu commented on MESOS-7990: --- [~jasonlai] Yes, i am aware of that (the naming convention). In fact, we're not supposed to touch named systemd cgroup hierarchy manually. However, major container orchestrators (docker, k8s) all manipulate systemd cgroup hierarchy directly. They all have an alternative mode that supports systemd more natively (using machined or system slice for containers, like rkt does). We want to support both too. The native systemd support will be added later. This ticket is for the cgroupfs support. > Support systemd named hierarchy (name=systemd) for Mesos Containerizer. > --- > > Key: MESOS-7990 > URL: https://issues.apache.org/jira/browse/MESOS-7990 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: Jie Yu > > Similar to docker's cgroupfs cgroup driver, we should create cgroups under > /sys/fs/cgroup/systemd (if it exists), and move container pid into the > corresponding cgroup ( /sys/fs/cgroup/systemd/mesos/). > This can give us a bunch of benefits: > 1) systemd-cgls can list mesos containers > 2) systemd-cgtop can show stats for mesos containers > ... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7990) Support systemd named hierarchy (name=systemd) for Mesos Containerizer.
[ https://issues.apache.org/jira/browse/MESOS-7990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175448#comment-16175448 ] Jason Lai commented on MESOS-7990: -- I'm all for the systemd support. But it doesn't go that easily as {{/systemd/mesos/}}, as Systemd has imposed some conventions on tasks' cgroup names. There are some references [here|https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/resource_management_guide/sec-default_cgroup_hierarchies]. AFAIK, rkt has aligned pretty well with the systemd conventions for their containers. Would be worth looking at what they're doing > Support systemd named hierarchy (name=systemd) for Mesos Containerizer. > --- > > Key: MESOS-7990 > URL: https://issues.apache.org/jira/browse/MESOS-7990 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: Jie Yu > > Similar to docker's cgroupfs cgroup driver, we should create cgroups under > /sys/fs/cgroup/systemd (if it exists), and move container pid into the > corresponding cgroup ( /sys/fs/cgroup/systemd/mesos/). > This can give us a bunch of benefits: > 1) systemd-cgls can list mesos containers > 2) systemd-cgtop can show stats for mesos containers > ... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (MESOS-7999) Add and document ability to expose new /monitor modules on agents
[ https://issues.apache.org/jira/browse/MESOS-7999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175118#comment-16175118 ] James Peach edited comment on MESOS-7999 at 9/21/17 7:56 PM: - You can write an anonymous Mesos module that uses the lib process [metrics|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/metrics/metrics.hpp#L94] API to expose metrics into {{/metrics/snapshot}}. was (Author: jamespeach): You can write an anonymous Mesos module that uses the lib process [metrics|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/metrics/metrics.hpp#L94] API to expose metrics into {{/metrics/snapshot}}/ > Add and document ability to expose new /monitor modules on agents > - > > Key: MESOS-7999 > URL: https://issues.apache.org/jira/browse/MESOS-7999 > Project: Mesos > Issue Type: Wish > Components: agent, json api, modules, statistics >Reporter: Charles Allen > > When looking at how to collect data about the cluster, the best way to > support functionality similar to Kubernetes DaemonSets is not completely > clear. > One key use case fore DaemonSets is a monitor for system metrics. This ask is > that agents are able to have a module which either exposes new endpoints in > {{/monitor}} or allows pluggable entries to be added to > {{/monitor/statistics}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7312) Update Resource proto for storage resource providers.
[ https://issues.apache.org/jira/browse/MESOS-7312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175316#comment-16175316 ] Benjamin Bannier commented on MESOS-7312: - {noformat} commit 91e279ad1855ac7f1ae628778731173aa603d5e3 Author: Benjamin BannierDate: Thu Sep 21 15:03:22 2017 +0200 Added 'id' and 'metadata' fields to 'Resource.DiskInfo.Source'. IDs will allow to create distinguishable resources, e.g., of RAW or BLOCK type. We also add a metadata field which can be used to expose additional disk information. Review: https://reviews.apache.org/r/58048/ {noformat} > Update Resource proto for storage resource providers. > - > > Key: MESOS-7312 > URL: https://issues.apache.org/jira/browse/MESOS-7312 > Project: Mesos > Issue Type: Bug >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier > Labels: storage > > Storage resource provider support requires a number of changes to the > {{Resource}} proto: > * support for {{RAW}} and {{BLOCK}} type {{Resource::DiskInfo::Source}} > * {{ResourceProviderID}} in Resource > * {{Resource::DiskInfo::Source::Path}} should be {{optional}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7975) The command/default executor can incorrectly send a TASK_FINISHED update even when the task is killed
[ https://issues.apache.org/jira/browse/MESOS-7975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175135#comment-16175135 ] James Peach commented on MESOS-7975: FWIW the rule we have in our executor is that if we terminated a task because the scheduler send a kill, we always send a {{TASK_KILLED}} status. That is the only reason we send this status. > The command/default executor can incorrectly send a TASK_FINISHED update even > when the task is killed > - > > Key: MESOS-7975 > URL: https://issues.apache.org/jira/browse/MESOS-7975 > Project: Mesos > Issue Type: Bug >Reporter: Anand Mazumdar >Assignee: Qian Zhang >Priority: Critical > Labels: mesosphere > > Currently, when a task is killed, the default and the command executor > incorrectly send a {{TASK_FINISHED}} status update instead of > {{TASK_KILLED}}. This is due to an unfortunate missed conditional check when > the task exits with a zero status code. > {code} > if (WSUCCEEDED(status)) { > taskState = TASK_FINISHED; > } else if (killed) { > // Send TASK_KILLED if the task was killed as a result of > // kill() or shutdown(). > taskState = TASK_KILLED; > } else { > taskState = TASK_FAILED; > } > {code} > We should modify the code to correctly send {{TASK_KILLED}} status updates > when a task is killed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8003) PersistentVolumeEndpointsTest.SlavesEndpointFullResources is flaky.
[ https://issues.apache.org/jira/browse/MESOS-8003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-8003: --- Attachment: PersistentVolumeEndpointsTest.SlavesEndpointFullResources_badrun.txt > PersistentVolumeEndpointsTest.SlavesEndpointFullResources is flaky. > --- > > Key: MESOS-8003 > URL: https://issues.apache.org/jira/browse/MESOS-8003 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.5.0 > Environment: Fedora 23 >Reporter: Alexander Rukletsov > Labels: flaky-test, mesosphere > Attachments: > PersistentVolumeEndpointsTest.SlavesEndpointFullResources_badrun.txt > > > Observed on internal CI: > {noformat} > ../../src/tests/persistent_volume_endpoints_tests.cpp:1952 > Value of: (response).get().status > Actual: "409 Conflict" > Expected: Accepted().status > Which is: "202 Accepted" > {noformat} > Full log attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7999) Add and document ability to expose new /monitor modules on agents
[ https://issues.apache.org/jira/browse/MESOS-7999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175118#comment-16175118 ] James Peach commented on MESOS-7999: You can write an anonymous Mesos module that uses the lib process [metrics|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/metrics/metrics.hpp#L94] API to expose metrics into {{/metrics/snapshot}}/ > Add and document ability to expose new /monitor modules on agents > - > > Key: MESOS-7999 > URL: https://issues.apache.org/jira/browse/MESOS-7999 > Project: Mesos > Issue Type: Wish > Components: agent, json api, modules, statistics >Reporter: Charles Allen > > When looking at how to collect data about the cluster, the best way to > support functionality similar to Kubernetes DaemonSets is not completely > clear. > One key use case fore DaemonSets is a monitor for system metrics. This ask is > that agents are able to have a module which either exposes new endpoints in > {{/monitor}} or allows pluggable entries to be added to > {{/monitor/statistics}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8003) PersistentVolumeEndpointsTest.SlavesEndpointFullResources is flaky.
Alexander Rukletsov created MESOS-8003: -- Summary: PersistentVolumeEndpointsTest.SlavesEndpointFullResources is flaky. Key: MESOS-8003 URL: https://issues.apache.org/jira/browse/MESOS-8003 Project: Mesos Issue Type: Bug Components: test Affects Versions: 1.5.0 Environment: Fedora 23 Reporter: Alexander Rukletsov Observed on internal CI: {noformat} ../../src/tests/persistent_volume_endpoints_tests.cpp:1952 Value of: (response).get().status Actual: "409 Conflict" Expected: Accepted().status Which is: "202 Accepted" {noformat} Full log attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8001) PersistentVolumeEndpointsTest.NoAuthentication is flaky.
[ https://issues.apache.org/jira/browse/MESOS-8001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-8001: --- Attachment: PersistentVolumeEndpointsTest.NoAuthentication_badrun.txt > PersistentVolumeEndpointsTest.NoAuthentication is flaky. > > > Key: MESOS-8001 > URL: https://issues.apache.org/jira/browse/MESOS-8001 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.5.0 > Environment: Ubuntu 16.04 with SSL >Reporter: Alexander Rukletsov > Labels: flaky-test, mesosphere > Attachments: PersistentVolumeEndpointsTest.NoAuthentication_badrun.txt > > > Observed a failure on internal CI: > {noformat} > ../../src/tests/persistent_volume_endpoints_tests.cpp:1385 > Value of: (response).get().status > Actual: "409 Conflict" > Expected: Accepted().status > Which is: "202 Accepted" > {noformat} > Full log attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8002) Marathon can't start on macOS 11.12.x with Mesos 1.3.0
Alex Lee created MESOS-8002: --- Summary: Marathon can't start on macOS 11.12.x with Mesos 1.3.0 Key: MESOS-8002 URL: https://issues.apache.org/jira/browse/MESOS-8002 Project: Mesos Issue Type: Bug Components: master Affects Versions: 1.3.0 Environment: macOS 10.12.x Reporter: Alex Lee We upgraded our Mesos cluster 1.3.0 and run into the following error when starting Marathon 1.4.7: ``` I0823 17:19:17.498087 101744640 group.cpp:340] Group process (zookeeper-group(1)@127.0.0.1:57708) connected to ZooKeeper I0823 17:19:17.498652 101744640 group.cpp:830] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0) I0823 17:19:17.499153 101744640 group.cpp:418] Trying to create path '/mesos/master' in ZooKeeper Assertion failed: (0), function hash, file /BuildRoot/Library/Caches/com.apple.xbs/Sources/cmph/cmph-6/src/hash.c, line 35. ``` This was reported in: https://jira.mesosphere.com/browse/MARATHON-7727 Interestingly, Marathon was able to start in the same cluster on macOS 10.11.6 host. We were suspecting it's OS version issue initially and open the issue with Apple. But macOS team responded that there may be a regression in mesos. The assertion is being raised in libcmph that libmeso.dylib invokes with providing invalid input and the hash functions in libcmph don’t look like they’ve changed between 10.11.6 and 10.12.6, at least with respect to that assert(0) being around. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.
[ https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174971#comment-16174971 ] Qian Zhang commented on MESOS-7963: --- Can you let me know what the special case is? I think currently when the default executor gets a limitation, it will kill all other nested containers and then terminate itself, I do not think we need to change this. And even without my proposal (i.e., raise limitation only for root container), all the nested containers will be killed as well (by Mesos containerizer), so the result is same, I am not sure when we need to restart the nested container. > Task groups can lose the container limitation status. > - > > Key: MESOS-7963 > URL: https://issues.apache.org/jira/browse/MESOS-7963 > Project: Mesos > Issue Type: Bug > Components: containerization, executor >Reporter: James Peach > > If you run a single task in a task group and that task fails with a container > limitation, that status update can be lost and only the executor failure will > be reported to the framework. > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a", > "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, { > "name": "mem", > "type": "SCALAR", > "scalar": { > "value": 32 > } > }, { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 2 > } > } > ], > "command": { > "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M > count=64 ; sleep 1" > } > } > ] > }' > I0911 11:48:01.480689 7340 scheduler.cpp:184] Version: 1.5.0 > I0911 11:48:01.488868 7339 scheduler.cpp:470] New master detected at > master@17.228.224.108:5050 > Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to > agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0' > Received status update TASK_RUNNING for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > source: SOURCE_EXECUTOR > Received status update TASK_FAILED for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > message: 'Command terminated with signal Killed' > source: SOURCE_EXECUTOR > {noformat} > However, the agent logs show that this failed with a memory limitation: > {noformat} > I0911 11:48:02.235818 7012 http.cpp:532] Processing call > WAIT_NESTED_CONTAINER > I0911 11:48:02.236395 7013 status_update_manager.cpp:323] Received status > update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:02.237083 7016 slave.cpp:4875] Forwarding the update > TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050 > I0911 11:48:02.283661 7007 status_update_manager.cpp:395] Received status > update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:04.771455 7014 memory.cpp:516] OOM detected for container > 474388fe-43c3-4372-b903-eaca22740996 > I0911 11:48:04.776445 7014 memory.cpp:556] Memory limit exceeded: Requested: > 64MB Maximum Used: 64MB > ... > I0911 11:48:04.776943 7012 containerizer.cpp:2681] Container > 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource > [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be > terminated > {noformat} > The following {{mesos-execute}} task will show the container limitation > correctly: > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "37db08f6-4f0f-4ef6-97ee-b10a5c5cc211", > "task_id": {"value" : "1372b2e2-c501-4e80-bcbd-1a5c5194e206"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } >
[jira] [Commented] (MESOS-7995) libprocess tests breaking on macOS.
[ https://issues.apache.org/jira/browse/MESOS-7995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174936#comment-16174936 ] Till Toenshoff commented on MESOS-7995: --- Downgrading from blocker as the workaround is to downgrade libevent towards 2.0.22. > libprocess tests breaking on macOS. > --- > > Key: MESOS-7995 > URL: https://issues.apache.org/jira/browse/MESOS-7995 > Project: Mesos > Issue Type: Bug > Components: libprocess, test >Affects Versions: 1.5.0 > Environment: libevent 2.1.8 >Reporter: Till Toenshoff >Priority: Blocker > > Many libprocess tests fail on macOS, some even abort. > Examples: > {noformat} > [--] 8 tests from HTTPConnectionTest > [ RUN ] HTTPConnectionTest.GzipRequestBody > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:972: Failure > Failed to wait 15secs for connect > [ FAILED ] HTTPConnectionTest.GzipRequestBody (15001 ms) > [ RUN ] HTTPConnectionTest.Serial > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1015: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.Serial (0 ms) > [ RUN ] HTTPConnectionTest.Pipeline > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1094: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.Pipeline (1 ms) > [ RUN ] HTTPConnectionTest.ClosingRequest > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1190: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.ClosingRequest (0 ms) > [ RUN ] HTTPConnectionTest.ClosingResponse > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1245: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.ClosingResponse (0 ms) > [ RUN ] HTTPConnectionTest.ReferenceCounting > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1306: Failure > (*connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.ReferenceCounting (1 ms) > [ RUN ] HTTPConnectionTest.Equality > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1333: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.Equality (0 ms) > [ RUN ] HTTPConnectionTest.RequestStreaming > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1360: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.RequestStreaming (0 ms) > [--] 8 tests from HTTPConnectionTest (15003 ms total) > {noformat} > {noformat} > [--] 8 tests from HttpAuthenticationTest > [ RUN ] HttpAuthenticationTest.NoAuthenticator > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1792: Failure > (response).failure(): Failed to connect to 192.168.178.20:51437: Host is down > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1786: Failure > Actual function call count doesn't match EXPECT_CALL(*http.process, > authenticated(_, Option::none()))... > Expected: to be called once >Actual: never called - unsatisfied and active > [ FAILED ] HttpAuthenticationTest.NoAuthenticator (1 ms) > [ RUN ] HttpAuthenticationTest.Unauthorized > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1816: Failure > (response).failure(): Failed to connect to 192.168.178.20:51437: Host is down > WARNING: Logging before InitGoogleLogging() is written to STDERR > F0921 12:18:19.947710 2519827264 future.hpp:1151] Check failed: !isFailed() > Future::get() but state == FAILED: Failed to connect to 192.168.178.20:51437: > Host is down > *** Check failure stack trace: *** > *** Aborted at 1505989099 (unix time) try "date -d @1505989099" if you are > using GNU date *** > PC: @ 0x7fff5cd45fce __pthread_kill > *** SIGABRT (@0x7fff5cd45fce) received by PID 23916 (TID 0x7fff96318340) > stack trace: *** > @ 0x7fff5ce76f5a _sigtramp > @ 0x7fff5ac5e526 std::__1::locale::facet::__on_zero_shared() > @ 0x7fff5cca232a abort > @0x1077b9659 google::logging_fail() > @0x1077b964a google::LogMessage::Fail() > @0x1077b72fc google::LogMessage::SendToLog() > @0x1077b8089 google::LogMessage::Flush() > @0x1077c12e9 google::LogMessageFatal::~LogMessageFatal() > @0x1077b9b35 google::LogMessageFatal::~LogMessageFatal() > @0x106998ad1 process::Future<>::get() > @0x1069d4d5b HttpAuthenticationTest_Unauthorized_Test::TestBody() > @0x1070a828e > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @0x10704a96b >
[jira] [Updated] (MESOS-7995) libprocess tests breaking on macOS.
[ https://issues.apache.org/jira/browse/MESOS-7995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Toenshoff updated MESOS-7995: -- Priority: Major (was: Blocker) > libprocess tests breaking on macOS. > --- > > Key: MESOS-7995 > URL: https://issues.apache.org/jira/browse/MESOS-7995 > Project: Mesos > Issue Type: Bug > Components: libprocess, test >Affects Versions: 1.5.0 > Environment: libevent 2.1.8 >Reporter: Till Toenshoff > > Many libprocess tests fail on macOS, some even abort. > Examples: > {noformat} > [--] 8 tests from HTTPConnectionTest > [ RUN ] HTTPConnectionTest.GzipRequestBody > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:972: Failure > Failed to wait 15secs for connect > [ FAILED ] HTTPConnectionTest.GzipRequestBody (15001 ms) > [ RUN ] HTTPConnectionTest.Serial > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1015: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.Serial (0 ms) > [ RUN ] HTTPConnectionTest.Pipeline > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1094: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.Pipeline (1 ms) > [ RUN ] HTTPConnectionTest.ClosingRequest > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1190: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.ClosingRequest (0 ms) > [ RUN ] HTTPConnectionTest.ClosingResponse > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1245: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.ClosingResponse (0 ms) > [ RUN ] HTTPConnectionTest.ReferenceCounting > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1306: Failure > (*connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.ReferenceCounting (1 ms) > [ RUN ] HTTPConnectionTest.Equality > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1333: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.Equality (0 ms) > [ RUN ] HTTPConnectionTest.RequestStreaming > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1360: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.RequestStreaming (0 ms) > [--] 8 tests from HTTPConnectionTest (15003 ms total) > {noformat} > {noformat} > [--] 8 tests from HttpAuthenticationTest > [ RUN ] HttpAuthenticationTest.NoAuthenticator > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1792: Failure > (response).failure(): Failed to connect to 192.168.178.20:51437: Host is down > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1786: Failure > Actual function call count doesn't match EXPECT_CALL(*http.process, > authenticated(_, Option::none()))... > Expected: to be called once >Actual: never called - unsatisfied and active > [ FAILED ] HttpAuthenticationTest.NoAuthenticator (1 ms) > [ RUN ] HttpAuthenticationTest.Unauthorized > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1816: Failure > (response).failure(): Failed to connect to 192.168.178.20:51437: Host is down > WARNING: Logging before InitGoogleLogging() is written to STDERR > F0921 12:18:19.947710 2519827264 future.hpp:1151] Check failed: !isFailed() > Future::get() but state == FAILED: Failed to connect to 192.168.178.20:51437: > Host is down > *** Check failure stack trace: *** > *** Aborted at 1505989099 (unix time) try "date -d @1505989099" if you are > using GNU date *** > PC: @ 0x7fff5cd45fce __pthread_kill > *** SIGABRT (@0x7fff5cd45fce) received by PID 23916 (TID 0x7fff96318340) > stack trace: *** > @ 0x7fff5ce76f5a _sigtramp > @ 0x7fff5ac5e526 std::__1::locale::facet::__on_zero_shared() > @ 0x7fff5cca232a abort > @0x1077b9659 google::logging_fail() > @0x1077b964a google::LogMessage::Fail() > @0x1077b72fc google::LogMessage::SendToLog() > @0x1077b8089 google::LogMessage::Flush() > @0x1077c12e9 google::LogMessageFatal::~LogMessageFatal() > @0x1077b9b35 google::LogMessageFatal::~LogMessageFatal() > @0x106998ad1 process::Future<>::get() > @0x1069d4d5b HttpAuthenticationTest_Unauthorized_Test::TestBody() > @0x1070a828e > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @0x10704a96b > testing::internal::HandleExceptionsInMethodIfSupported<>() > @0x10704a896 testing::Test::Run() > @
[jira] [Created] (MESOS-8001) PersistentVolumeEndpointsTest.NoAuthentication is flaky.
Alexander Rukletsov created MESOS-8001: -- Summary: PersistentVolumeEndpointsTest.NoAuthentication is flaky. Key: MESOS-8001 URL: https://issues.apache.org/jira/browse/MESOS-8001 Project: Mesos Issue Type: Bug Components: test Affects Versions: 1.5.0 Environment: Ubuntu 16.04 with SSL Reporter: Alexander Rukletsov Observed a failure on internal CI: {noformat} ../../src/tests/persistent_volume_endpoints_tests.cpp:1385 Value of: (response).get().status Actual: "409 Conflict" Expected: Accepted().status Which is: "202 Accepted" {noformat} Full log attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7995) libprocess tests breaking on macOS.
[ https://issues.apache.org/jira/browse/MESOS-7995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Toenshoff updated MESOS-7995: -- Environment: libevent 2.1.8 > libprocess tests breaking on macOS. > --- > > Key: MESOS-7995 > URL: https://issues.apache.org/jira/browse/MESOS-7995 > Project: Mesos > Issue Type: Bug > Components: libprocess, test >Affects Versions: 1.5.0 > Environment: libevent 2.1.8 >Reporter: Till Toenshoff >Priority: Blocker > > Many libprocess tests fail on macOS, some even abort. > Examples: > {noformat} > [--] 8 tests from HTTPConnectionTest > [ RUN ] HTTPConnectionTest.GzipRequestBody > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:972: Failure > Failed to wait 15secs for connect > [ FAILED ] HTTPConnectionTest.GzipRequestBody (15001 ms) > [ RUN ] HTTPConnectionTest.Serial > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1015: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.Serial (0 ms) > [ RUN ] HTTPConnectionTest.Pipeline > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1094: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.Pipeline (1 ms) > [ RUN ] HTTPConnectionTest.ClosingRequest > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1190: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.ClosingRequest (0 ms) > [ RUN ] HTTPConnectionTest.ClosingResponse > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1245: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.ClosingResponse (0 ms) > [ RUN ] HTTPConnectionTest.ReferenceCounting > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1306: Failure > (*connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.ReferenceCounting (1 ms) > [ RUN ] HTTPConnectionTest.Equality > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1333: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.Equality (0 ms) > [ RUN ] HTTPConnectionTest.RequestStreaming > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1360: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.RequestStreaming (0 ms) > [--] 8 tests from HTTPConnectionTest (15003 ms total) > {noformat} > {noformat} > [--] 8 tests from HttpAuthenticationTest > [ RUN ] HttpAuthenticationTest.NoAuthenticator > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1792: Failure > (response).failure(): Failed to connect to 192.168.178.20:51437: Host is down > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1786: Failure > Actual function call count doesn't match EXPECT_CALL(*http.process, > authenticated(_, Option::none()))... > Expected: to be called once >Actual: never called - unsatisfied and active > [ FAILED ] HttpAuthenticationTest.NoAuthenticator (1 ms) > [ RUN ] HttpAuthenticationTest.Unauthorized > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1816: Failure > (response).failure(): Failed to connect to 192.168.178.20:51437: Host is down > WARNING: Logging before InitGoogleLogging() is written to STDERR > F0921 12:18:19.947710 2519827264 future.hpp:1151] Check failed: !isFailed() > Future::get() but state == FAILED: Failed to connect to 192.168.178.20:51437: > Host is down > *** Check failure stack trace: *** > *** Aborted at 1505989099 (unix time) try "date -d @1505989099" if you are > using GNU date *** > PC: @ 0x7fff5cd45fce __pthread_kill > *** SIGABRT (@0x7fff5cd45fce) received by PID 23916 (TID 0x7fff96318340) > stack trace: *** > @ 0x7fff5ce76f5a _sigtramp > @ 0x7fff5ac5e526 std::__1::locale::facet::__on_zero_shared() > @ 0x7fff5cca232a abort > @0x1077b9659 google::logging_fail() > @0x1077b964a google::LogMessage::Fail() > @0x1077b72fc google::LogMessage::SendToLog() > @0x1077b8089 google::LogMessage::Flush() > @0x1077c12e9 google::LogMessageFatal::~LogMessageFatal() > @0x1077b9b35 google::LogMessageFatal::~LogMessageFatal() > @0x106998ad1 process::Future<>::get() > @0x1069d4d5b HttpAuthenticationTest_Unauthorized_Test::TestBody() > @0x1070a828e > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @0x10704a96b > testing::internal::HandleExceptionsInMethodIfSupported<>() > @0x10704a896
[jira] [Updated] (MESOS-8000) DefaultExecutorCniTest.ROOT_VerifyContainerIP is flaky.
[ https://issues.apache.org/jira/browse/MESOS-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-8000: --- Attachment: ROOT_VerifyContainerIP_badrun.txt ROOT_VerifyContainerIP_goodrun.txt > DefaultExecutorCniTest.ROOT_VerifyContainerIP is flaky. > --- > > Key: MESOS-8000 > URL: https://issues.apache.org/jira/browse/MESOS-8000 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.5.0 > Environment: Ubuntu 16.04 >Reporter: Alexander Rukletsov > Labels: flaky-test, mesosphere > Attachments: ROOT_VerifyContainerIP_badrun.txt, > ROOT_VerifyContainerIP_goodrun.txt > > > Observed a failure on internal CI: > {noformat} > ../../src/tests/containerizer/cni_isolator_tests.cpp:1419 > Failed to wait 15secs for subscribed > {noformat} > Full log attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.
[ https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174890#comment-16174890 ] James Peach commented on MESOS-7963: Right now, if an executor gets any limitation, it knows it will be terminated. The special case is that in your proposal some kinds of limitation would not cause the executor to be terminated, so the executor needs to decide how to handle that by either manually tearing everything down or restarting the nested container. > Task groups can lose the container limitation status. > - > > Key: MESOS-7963 > URL: https://issues.apache.org/jira/browse/MESOS-7963 > Project: Mesos > Issue Type: Bug > Components: containerization, executor >Reporter: James Peach > > If you run a single task in a task group and that task fails with a container > limitation, that status update can be lost and only the executor failure will > be reported to the framework. > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a", > "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, { > "name": "mem", > "type": "SCALAR", > "scalar": { > "value": 32 > } > }, { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 2 > } > } > ], > "command": { > "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M > count=64 ; sleep 1" > } > } > ] > }' > I0911 11:48:01.480689 7340 scheduler.cpp:184] Version: 1.5.0 > I0911 11:48:01.488868 7339 scheduler.cpp:470] New master detected at > master@17.228.224.108:5050 > Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to > agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0' > Received status update TASK_RUNNING for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > source: SOURCE_EXECUTOR > Received status update TASK_FAILED for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > message: 'Command terminated with signal Killed' > source: SOURCE_EXECUTOR > {noformat} > However, the agent logs show that this failed with a memory limitation: > {noformat} > I0911 11:48:02.235818 7012 http.cpp:532] Processing call > WAIT_NESTED_CONTAINER > I0911 11:48:02.236395 7013 status_update_manager.cpp:323] Received status > update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:02.237083 7016 slave.cpp:4875] Forwarding the update > TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050 > I0911 11:48:02.283661 7007 status_update_manager.cpp:395] Received status > update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:04.771455 7014 memory.cpp:516] OOM detected for container > 474388fe-43c3-4372-b903-eaca22740996 > I0911 11:48:04.776445 7014 memory.cpp:556] Memory limit exceeded: Requested: > 64MB Maximum Used: 64MB > ... > I0911 11:48:04.776943 7012 containerizer.cpp:2681] Container > 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource > [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be > terminated > {noformat} > The following {{mesos-execute}} task will show the container limitation > correctly: > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "37db08f6-4f0f-4ef6-97ee-b10a5c5cc211", > "task_id": {"value" : "1372b2e2-c501-4e80-bcbd-1a5c5194e206"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, > { > "name": "mem", > "type": "SCALAR", > "scalar": { >
[jira] [Created] (MESOS-8000) DefaultExecutorCniTest.ROOT_VerifyContainerIP is flaky.
Alexander Rukletsov created MESOS-8000: -- Summary: DefaultExecutorCniTest.ROOT_VerifyContainerIP is flaky. Key: MESOS-8000 URL: https://issues.apache.org/jira/browse/MESOS-8000 Project: Mesos Issue Type: Bug Components: test Affects Versions: 1.5.0 Environment: Ubuntu 16.04 Reporter: Alexander Rukletsov Observed a failure on internal CI: {noformat} ../../src/tests/containerizer/cni_isolator_tests.cpp:1419 Failed to wait 15secs for subscribed {noformat} Full log attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7999) Add and document ability to expose new /monitor modules on agents
Charles Allen created MESOS-7999: Summary: Add and document ability to expose new /monitor modules on agents Key: MESOS-7999 URL: https://issues.apache.org/jira/browse/MESOS-7999 Project: Mesos Issue Type: Wish Components: agent, json api, modules, statistics Reporter: Charles Allen When looking at how to collect data about the cluster, the best way to support functionality similar to Kubernetes DaemonSets is not completely clear. One key use case fore DaemonSets is a monitor for system metrics. This ask is that agents are able to have a module which either exposes new endpoints in {{/monitor}} or allows pluggable entries to be added to {{/monitor/statistics}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7997) ContentType/MasterAPITest.CreateAndDestroyVolumes is flaky.
[ https://issues.apache.org/jira/browse/MESOS-7997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-7997: --- Attachment: CreateAndDestroyVolumes_goodrun.txt CreateAndDestroyVolumes_badrun.txt > ContentType/MasterAPITest.CreateAndDestroyVolumes is flaky. > --- > > Key: MESOS-7997 > URL: https://issues.apache.org/jira/browse/MESOS-7997 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.5.0 > Environment: Ubuntu 17.04 with SSL >Reporter: Alexander Rukletsov > Labels: flaky-test, mesosphere > Attachments: CreateAndDestroyVolumes_badrun.txt, > CreateAndDestroyVolumes_goodrun.txt > > > Observed a failure on the internal CI: > {noformat} > ../../src/tests/api_tests.cpp:3052 > Value of: Resources(offer.resources()).contains( allocatedResources(volume, > frameworkInfo.role())) > Actual: false > Expected: true > {noformat} > Full log attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7998) PersistentVolumeEndpointsTest.UnreserveVolumeResources is flaky.
[ https://issues.apache.org/jira/browse/MESOS-7998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-7998: --- Attachment: UnreserveVolumeResources_badrun.txt > PersistentVolumeEndpointsTest.UnreserveVolumeResources is flaky. > > > Key: MESOS-7998 > URL: https://issues.apache.org/jira/browse/MESOS-7998 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.5.0 > Environment: Ubuntu 17.04 with SSL >Reporter: Alexander Rukletsov > Labels: flaky-test, mesosphere > Attachments: UnreserveVolumeResources_badrun.txt > > > Observed a failure on the internal CI: > {noformat} > ../../src/tests/persistent_volume_endpoints_tests.cpp:450 > Value of: (response).get().status > Actual: "409 Conflict" > Expected: Accepted().status > Which is: "202 Accepted" > {noformat} > Full log attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7998) PersistentVolumeEndpointsTest.UnreserveVolumeResources is flaky.
[ https://issues.apache.org/jira/browse/MESOS-7998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-7998: --- Summary: PersistentVolumeEndpointsTest.UnreserveVolumeResources is flaky. (was: UnreserveVolumeResources is flaky.) > PersistentVolumeEndpointsTest.UnreserveVolumeResources is flaky. > > > Key: MESOS-7998 > URL: https://issues.apache.org/jira/browse/MESOS-7998 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.5.0 > Environment: Ubuntu 17.04 with SSL >Reporter: Alexander Rukletsov > Labels: flaky-test, mesosphere > > Observed a failure on the internal CI: > {noformat} > ../../src/tests/persistent_volume_endpoints_tests.cpp:450 > Value of: (response).get().status > Actual: "409 Conflict" > Expected: Accepted().status > Which is: "202 Accepted" > {noformat} > Full log attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7998) UnreserveVolumeResources is flaky.
[ https://issues.apache.org/jira/browse/MESOS-7998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-7998: --- Environment: Ubuntu 17.04 with SSL (was: Ubuntu 17.07 with SSL) > UnreserveVolumeResources is flaky. > -- > > Key: MESOS-7998 > URL: https://issues.apache.org/jira/browse/MESOS-7998 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.5.0 > Environment: Ubuntu 17.04 with SSL >Reporter: Alexander Rukletsov > Labels: flaky-test, mesosphere > > Observed a failure on the internal CI: > {noformat} > ../../src/tests/persistent_volume_endpoints_tests.cpp:450 > Value of: (response).get().status > Actual: "409 Conflict" > Expected: Accepted().status > Which is: "202 Accepted" > {noformat} > Full log attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7998) UnreserveVolumeResources is flaky.
Alexander Rukletsov created MESOS-7998: -- Summary: UnreserveVolumeResources is flaky. Key: MESOS-7998 URL: https://issues.apache.org/jira/browse/MESOS-7998 Project: Mesos Issue Type: Bug Components: test Affects Versions: 1.5.0 Environment: Ubuntu 17.07 with SSL Reporter: Alexander Rukletsov Observed a failure on the internal CI: {noformat} ../../src/tests/persistent_volume_endpoints_tests.cpp:450 Value of: (response).get().status Actual: "409 Conflict" Expected: Accepted().status Which is: "202 Accepted" {noformat} Full log attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7997) ContentType/MasterAPITest.CreateAndDestroyVolumes is flaky.
Alexander Rukletsov created MESOS-7997: -- Summary: ContentType/MasterAPITest.CreateAndDestroyVolumes is flaky. Key: MESOS-7997 URL: https://issues.apache.org/jira/browse/MESOS-7997 Project: Mesos Issue Type: Bug Components: test Affects Versions: 1.5.0 Environment: Ubuntu 17.04 with SSL Reporter: Alexander Rukletsov Observed a failure on the internal CI: {noformat} ../../src/tests/api_tests.cpp:3052 Value of: Resources(offer.resources()).contains( allocatedResources(volume, frameworkInfo.role())) Actual: false Expected: true {noformat} Full log attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7996) ContentType/SchedulerTest.NoOffersWithAllRolesSuppressed is flaky.
[ https://issues.apache.org/jira/browse/MESOS-7996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-7996: --- Attachment: SchedulerTest.NoOffersWithAllRolesSuppressed_goodrun.txt SchedulerTest.NoOffersWithAllRolesSuppressed_badrun.txt > ContentType/SchedulerTest.NoOffersWithAllRolesSuppressed is flaky. > -- > > Key: MESOS-7996 > URL: https://issues.apache.org/jira/browse/MESOS-7996 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.5.0 > Environment: Observed on Ubuntu 17.04 with SSL enabled >Reporter: Alexander Rukletsov > Labels: flaky-test, mesosphere > Attachments: SchedulerTest.NoOffersWithAllRolesSuppressed_badrun.txt, > SchedulerTest.NoOffersWithAllRolesSuppressed_goodrun.txt > > > Observed the failure on internal CI: > {noformat} > ../../src/tests/scheduler_tests.cpp:1474 > Mock function called more times than expected - returning directly. > Function call: offers(0x7b085d90, @0x7f1a88003590 48-byte object > <48-82 52-9F 1A-7F 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 > 00-00 00-00 00-00 00-00 01-00 00-00 04-00 00-00 20-4D 00-88 1A-7F 00-00>) > Expected: to be never called >Actual: called once - over-saturated and active > {noformat} > Full log attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7996) ContentType/SchedulerTest.NoOffersWithAllRolesSuppressed is flaky.
Alexander Rukletsov created MESOS-7996: -- Summary: ContentType/SchedulerTest.NoOffersWithAllRolesSuppressed is flaky. Key: MESOS-7996 URL: https://issues.apache.org/jira/browse/MESOS-7996 Project: Mesos Issue Type: Bug Components: test Affects Versions: 1.5.0 Environment: Observed on Ubuntu 17.04 with SSL enabled Reporter: Alexander Rukletsov Observed the failure on internal CI: {noformat} ../../src/tests/scheduler_tests.cpp:1474 Mock function called more times than expected - returning directly. Function call: offers(0x7b085d90, @0x7f1a88003590 48-byte object <48-82 52-9F 1A-7F 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 01-00 00-00 04-00 00-00 20-4D 00-88 1A-7F 00-00>) Expected: to be never called Actual: called once - over-saturated and active {noformat} Full log attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-6162) Add support for cgroups blkio subsystem blkio statistics.
[ https://issues.apache.org/jira/browse/MESOS-6162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174787#comment-16174787 ] Qian Zhang commented on MESOS-6162: --- I did more tests for this performance issue with Mesos (rather than just manually tested it with {{dd}} in my previous post), I used {{mesos-execute}} to launch task to run {{dd}} like this: {code}mesos-execute --master=192.168.1.6:5050 --name=test --command="dd if=/dev/zero of=test.bin bs=512 count=1000 oflag=dsync"{code} And I found this performance issue will *always* happen as long as the combination {{ext4/ext3 with the data=ordered option}} + {{cfq IO scheduler}} is met *no matter `cgroups/blkio` isolation is enabled or not*, i.e., if that combination is met, the task will always take much longer to complete (~16s) than what the task will take (~1.2s) if that combination is not met regardless `cgroups/blkio` enabled or not. So it seems this performance issue has nothing to do with `cgroups/blkio` since it will happen even `cgroups/blkio` is not enabled at all. However a weird issue I found is, if the process is assigned to the *root* blkio cgroup and even that combination is met, this performance issue will *not* happen: {code} # echo $$ > /sys/fs/cgroup/blkio/cgroup.procs # dd if=/dev/zero of=test.bin bs=512 count=1000 oflag=dsync 1000+0 records in 1000+0 records out 512000 bytes (512 kB, 500 KiB) copied, 1.19546 s, 428 kB/s<--- No performance issue. {code} So the conclusion is when the combination is met, # If the process is not assigned to any blkio cgroups (i.e., `cgroups/blio` isolation is not enabled), the performance issue will happen. # If the process is assigned to a sub blkio cgroup (i.e., `cgroups/blio` isolation is enabled), the performance issue will happen. # If the process is assigned to the root blkio cgroup, the performance issue will not happen. I think 1 and 2 will happen in the Mesos context but not 3 since a container launched by Mesos will never be assigned to the root blkio cgroup. Originally I thought we should add a note for the performance issue in the doc of `cgroups/blkio`, but now I think that may not be the right place to mention such performance issue, instead we should add such note in the doc {{mesos-containerizer.md}} and {{persistent-volume.md}}. > Add support for cgroups blkio subsystem blkio statistics. > - > > Key: MESOS-6162 > URL: https://issues.apache.org/jira/browse/MESOS-6162 > Project: Mesos > Issue Type: Task > Components: cgroups, containerization >Reporter: haosdent >Assignee: Jason Lai > Labels: cgroups, containerizer, mesosphere > Fix For: 1.4.0 > > > Noted that cgroups blkio subsystem may have performance issue, refer to > https://github.com/opencontainers/runc/issues/861 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7500) Command checks via agent lead to flaky tests.
[ https://issues.apache.org/jira/browse/MESOS-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174785#comment-16174785 ] Andrei Budnik commented on MESOS-7500: -- Another example from the failed run, including debug output (https://reviews.apache.org/r/59107): https://pastebin.com/iKA1WaZB > Command checks via agent lead to flaky tests. > - > > Key: MESOS-7500 > URL: https://issues.apache.org/jira/browse/MESOS-7500 > Project: Mesos > Issue Type: Bug >Reporter: Alexander Rukletsov >Assignee: Gastón Kleiman > Labels: check, flaky-test, health-check, mesosphere > > Tests that rely on command checks via agent are flaky on Apache CI. Here is > an example from one of the failed run: https://pastebin.com/g2mPgYzu -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7742) ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky
[ https://issues.apache.org/jira/browse/MESOS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174742#comment-16174742 ] Alexander Rukletsov commented on MESOS-7742: Observed this on internal CI, for both {{application/x-protobuf}} and {{application/json}}. Same failure: {noformat} ../../src/tests/api_tests.cpp:6701 Value of: (response).get().status Actual: "500 Internal Server Error" Expected: http::OK().status Which is: "200 OK" {noformat} > ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky > -- > > Key: MESOS-7742 > URL: https://issues.apache.org/jira/browse/MESOS-7742 > Project: Mesos > Issue Type: Bug >Reporter: Vinod Kone >Assignee: Gastón Kleiman > Labels: flaky-test, mesosphere-oncall > > Observed this on ASF CI. > [~gkleiman] mind triaging this? > {code} > [ RUN ] > ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession/0 > I0629 05:49:33.180673 25301 cluster.cpp:162] Creating default 'local' > authorizer > I0629 05:49:33.182234 25306 master.cpp:436] Master > 90ea1640-bdf3-49ba-b78f-b2ba7ea30077 (296af9b598c3) started on > 172.17.0.3:45726 > I0629 05:49:33.182289 25306 master.cpp:438] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="1secs" - > -allocator="HierarchicalDRF" --authenticate_agents="true" > --authenticate_frameworks="true" --authenticate_http_frameworks="true" > --authenticate_http_readonly="true" --au > thenticate_http_readwrite="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/a5h5J3/credentials" > --framework_sorter="drf" --help="false" --hostn > ame_lookup="true" --http_authenticators="basic" > --http_framework_authenticators="basic" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" > --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="10 > 00" --port="5050" --quiet="false" --recovery_agent_removal_limit="100%" > --registry="in_memory" --registry_fetch_timeout="1mins" > --registry_gc_interval="15mins" --registr > y_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --root_submissions="true" --user_sorter="drf" - > -version="false" --webui_dir="/usr/local/share/mesos/webui" > --work_dir="/tmp/a5h5J3/master" --zk_session_timeout="10secs" > I0629 05:49:33.182561 25306 master.cpp:488] Master only allowing > authenticated frameworks to register > I0629 05:49:33.182610 25306 master.cpp:502] Master only allowing > authenticated agents to register > I0629 05:49:33.182636 25306 master.cpp:515] Master only allowing > authenticated HTTP frameworks to register > I0629 05:49:33.182656 25306 credentials.hpp:37] Loading credentials for > authentication from '/tmp/a5h5J3/credentials' > I0629 05:49:33.182915 25306 master.cpp:560] Using default 'crammd5' > authenticator > I0629 05:49:33.183009 25306 http.cpp:975] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > I0629 05:49:33.183151 25306 http.cpp:975] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > I0629 05:49:33.183218 25306 http.cpp:975] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > I0629 05:49:33.183284 25306 master.cpp:640] Authorization enabled > I0629 05:49:33.183462 25309 hierarchical.cpp:158] Initialized hierarchical > allocator process > I0629 05:49:33.183504 25309 whitelist_watcher.cpp:77] No whitelist given > I0629 05:49:33.184311 25308 master.cpp:2161] Elected as the leading master! > I0629 05:49:33.184341 25308 master.cpp:1700] Recovering from registrar > I0629 05:49:33.184404 25308 registrar.cpp:345] Recovering registrar > I0629 05:49:33.184622 25308 registrar.cpp:389] Successfully fetched the > registry (0B) in 183040ns > I0629 05:49:33.184687 25308 registrar.cpp:493] Applied 1 operations in > 6441ns; attempting to update the registry > I0629 05:49:33.184885 25304 registrar.cpp:550] Successfully updated the > registry in 147200ns > I0629 05:49:33.184993 25304 registrar.cpp:422] Successfully recovered > registrar > I0629 05:49:33.185148 25308 master.cpp:1799] Recovered 0 agents from the > registry (129B); allowing 10mins for agents to re-register > I0629 05:49:33.185161 25302 hierarchical.cpp:185] Skipping recovery of > hierarchical allocator: nothing to recover > I0629 05:49:33.186769 25301 containerizer.cpp:221] Using isolation: > posix/cpu,posix/mem,filesystem/posix,network/cni > W0629 05:49:33.187232 25301 backend.cpp:76] Failed to create 'aufs' backend: > AufsBackend
[jira] [Commented] (MESOS-7995) libprocess tests breaking on macOS.
[ https://issues.apache.org/jira/browse/MESOS-7995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174656#comment-16174656 ] Jan Schlicht commented on MESOS-7995: - Forgot to mention it: Mine's also a SSL build (--enable-libevent --enable-ssl), using libevent 2.0.22. Latest HEAD (c0293a6f7d457a595a3763662e3a9740db31859b). > libprocess tests breaking on macOS. > --- > > Key: MESOS-7995 > URL: https://issues.apache.org/jira/browse/MESOS-7995 > Project: Mesos > Issue Type: Bug > Components: libprocess, test >Affects Versions: 1.5.0 >Reporter: Till Toenshoff >Priority: Blocker > > Many libprocess tests fail on macOS, some even abort. > Examples: > {noformat} > [--] 8 tests from HTTPConnectionTest > [ RUN ] HTTPConnectionTest.GzipRequestBody > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:972: Failure > Failed to wait 15secs for connect > [ FAILED ] HTTPConnectionTest.GzipRequestBody (15001 ms) > [ RUN ] HTTPConnectionTest.Serial > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1015: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.Serial (0 ms) > [ RUN ] HTTPConnectionTest.Pipeline > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1094: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.Pipeline (1 ms) > [ RUN ] HTTPConnectionTest.ClosingRequest > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1190: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.ClosingRequest (0 ms) > [ RUN ] HTTPConnectionTest.ClosingResponse > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1245: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.ClosingResponse (0 ms) > [ RUN ] HTTPConnectionTest.ReferenceCounting > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1306: Failure > (*connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.ReferenceCounting (1 ms) > [ RUN ] HTTPConnectionTest.Equality > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1333: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.Equality (0 ms) > [ RUN ] HTTPConnectionTest.RequestStreaming > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1360: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.RequestStreaming (0 ms) > [--] 8 tests from HTTPConnectionTest (15003 ms total) > {noformat} > {noformat} > [--] 8 tests from HttpAuthenticationTest > [ RUN ] HttpAuthenticationTest.NoAuthenticator > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1792: Failure > (response).failure(): Failed to connect to 192.168.178.20:51437: Host is down > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1786: Failure > Actual function call count doesn't match EXPECT_CALL(*http.process, > authenticated(_, Option::none()))... > Expected: to be called once >Actual: never called - unsatisfied and active > [ FAILED ] HttpAuthenticationTest.NoAuthenticator (1 ms) > [ RUN ] HttpAuthenticationTest.Unauthorized > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1816: Failure > (response).failure(): Failed to connect to 192.168.178.20:51437: Host is down > WARNING: Logging before InitGoogleLogging() is written to STDERR > F0921 12:18:19.947710 2519827264 future.hpp:1151] Check failed: !isFailed() > Future::get() but state == FAILED: Failed to connect to 192.168.178.20:51437: > Host is down > *** Check failure stack trace: *** > *** Aborted at 1505989099 (unix time) try "date -d @1505989099" if you are > using GNU date *** > PC: @ 0x7fff5cd45fce __pthread_kill > *** SIGABRT (@0x7fff5cd45fce) received by PID 23916 (TID 0x7fff96318340) > stack trace: *** > @ 0x7fff5ce76f5a _sigtramp > @ 0x7fff5ac5e526 std::__1::locale::facet::__on_zero_shared() > @ 0x7fff5cca232a abort > @0x1077b9659 google::logging_fail() > @0x1077b964a google::LogMessage::Fail() > @0x1077b72fc google::LogMessage::SendToLog() > @0x1077b8089 google::LogMessage::Flush() > @0x1077c12e9 google::LogMessageFatal::~LogMessageFatal() > @0x1077b9b35 google::LogMessageFatal::~LogMessageFatal() > @0x106998ad1 process::Future<>::get() > @0x1069d4d5b HttpAuthenticationTest_Unauthorized_Test::TestBody() > @0x1070a828e > testing::internal::HandleSehExceptionsInMethodIfSupported<>() >
[jira] [Commented] (MESOS-7995) libprocess tests breaking on macOS.
[ https://issues.apache.org/jira/browse/MESOS-7995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174651#comment-16174651 ] Benjamin Bannier commented on MESOS-7995: - I can only repro this in an SSL-build. > libprocess tests breaking on macOS. > --- > > Key: MESOS-7995 > URL: https://issues.apache.org/jira/browse/MESOS-7995 > Project: Mesos > Issue Type: Bug > Components: libprocess, test >Affects Versions: 1.5.0 >Reporter: Till Toenshoff >Priority: Blocker > > Many libprocess tests fail on macOS, some even abort. > Examples: > {noformat} > [--] 8 tests from HTTPConnectionTest > [ RUN ] HTTPConnectionTest.GzipRequestBody > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:972: Failure > Failed to wait 15secs for connect > [ FAILED ] HTTPConnectionTest.GzipRequestBody (15001 ms) > [ RUN ] HTTPConnectionTest.Serial > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1015: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.Serial (0 ms) > [ RUN ] HTTPConnectionTest.Pipeline > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1094: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.Pipeline (1 ms) > [ RUN ] HTTPConnectionTest.ClosingRequest > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1190: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.ClosingRequest (0 ms) > [ RUN ] HTTPConnectionTest.ClosingResponse > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1245: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.ClosingResponse (0 ms) > [ RUN ] HTTPConnectionTest.ReferenceCounting > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1306: Failure > (*connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.ReferenceCounting (1 ms) > [ RUN ] HTTPConnectionTest.Equality > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1333: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.Equality (0 ms) > [ RUN ] HTTPConnectionTest.RequestStreaming > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1360: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.RequestStreaming (0 ms) > [--] 8 tests from HTTPConnectionTest (15003 ms total) > {noformat} > {noformat} > [--] 8 tests from HttpAuthenticationTest > [ RUN ] HttpAuthenticationTest.NoAuthenticator > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1792: Failure > (response).failure(): Failed to connect to 192.168.178.20:51437: Host is down > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1786: Failure > Actual function call count doesn't match EXPECT_CALL(*http.process, > authenticated(_, Option::none()))... > Expected: to be called once >Actual: never called - unsatisfied and active > [ FAILED ] HttpAuthenticationTest.NoAuthenticator (1 ms) > [ RUN ] HttpAuthenticationTest.Unauthorized > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1816: Failure > (response).failure(): Failed to connect to 192.168.178.20:51437: Host is down > WARNING: Logging before InitGoogleLogging() is written to STDERR > F0921 12:18:19.947710 2519827264 future.hpp:1151] Check failed: !isFailed() > Future::get() but state == FAILED: Failed to connect to 192.168.178.20:51437: > Host is down > *** Check failure stack trace: *** > *** Aborted at 1505989099 (unix time) try "date -d @1505989099" if you are > using GNU date *** > PC: @ 0x7fff5cd45fce __pthread_kill > *** SIGABRT (@0x7fff5cd45fce) received by PID 23916 (TID 0x7fff96318340) > stack trace: *** > @ 0x7fff5ce76f5a _sigtramp > @ 0x7fff5ac5e526 std::__1::locale::facet::__on_zero_shared() > @ 0x7fff5cca232a abort > @0x1077b9659 google::logging_fail() > @0x1077b964a google::LogMessage::Fail() > @0x1077b72fc google::LogMessage::SendToLog() > @0x1077b8089 google::LogMessage::Flush() > @0x1077c12e9 google::LogMessageFatal::~LogMessageFatal() > @0x1077b9b35 google::LogMessageFatal::~LogMessageFatal() > @0x106998ad1 process::Future<>::get() > @0x1069d4d5b HttpAuthenticationTest_Unauthorized_Test::TestBody() > @0x1070a828e > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @0x10704a96b > testing::internal::HandleExceptionsInMethodIfSupported<>() > @0x10704a896
[jira] [Commented] (MESOS-7995) libprocess tests breaking on macOS.
[ https://issues.apache.org/jira/browse/MESOS-7995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174636#comment-16174636 ] Jan Schlicht commented on MESOS-7995: - Is there something specific different in your environment? Can't reproduce this on macOS 10.13, Apple Clang 9.0.0. All libprocess tests are successful. > libprocess tests breaking on macOS. > --- > > Key: MESOS-7995 > URL: https://issues.apache.org/jira/browse/MESOS-7995 > Project: Mesos > Issue Type: Bug > Components: libprocess, test >Affects Versions: 1.5.0 >Reporter: Till Toenshoff >Priority: Blocker > > Many libprocess tests fail on macOS, some even abort. > Examples: > {noformat} > [--] 8 tests from HTTPConnectionTest > [ RUN ] HTTPConnectionTest.GzipRequestBody > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:972: Failure > Failed to wait 15secs for connect > [ FAILED ] HTTPConnectionTest.GzipRequestBody (15001 ms) > [ RUN ] HTTPConnectionTest.Serial > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1015: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.Serial (0 ms) > [ RUN ] HTTPConnectionTest.Pipeline > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1094: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.Pipeline (1 ms) > [ RUN ] HTTPConnectionTest.ClosingRequest > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1190: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.ClosingRequest (0 ms) > [ RUN ] HTTPConnectionTest.ClosingResponse > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1245: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.ClosingResponse (0 ms) > [ RUN ] HTTPConnectionTest.ReferenceCounting > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1306: Failure > (*connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.ReferenceCounting (1 ms) > [ RUN ] HTTPConnectionTest.Equality > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1333: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.Equality (0 ms) > [ RUN ] HTTPConnectionTest.RequestStreaming > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1360: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.RequestStreaming (0 ms) > [--] 8 tests from HTTPConnectionTest (15003 ms total) > {noformat} > {noformat} > [--] 8 tests from HttpAuthenticationTest > [ RUN ] HttpAuthenticationTest.NoAuthenticator > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1792: Failure > (response).failure(): Failed to connect to 192.168.178.20:51437: Host is down > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1786: Failure > Actual function call count doesn't match EXPECT_CALL(*http.process, > authenticated(_, Option::none()))... > Expected: to be called once >Actual: never called - unsatisfied and active > [ FAILED ] HttpAuthenticationTest.NoAuthenticator (1 ms) > [ RUN ] HttpAuthenticationTest.Unauthorized > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1816: Failure > (response).failure(): Failed to connect to 192.168.178.20:51437: Host is down > WARNING: Logging before InitGoogleLogging() is written to STDERR > F0921 12:18:19.947710 2519827264 future.hpp:1151] Check failed: !isFailed() > Future::get() but state == FAILED: Failed to connect to 192.168.178.20:51437: > Host is down > *** Check failure stack trace: *** > *** Aborted at 1505989099 (unix time) try "date -d @1505989099" if you are > using GNU date *** > PC: @ 0x7fff5cd45fce __pthread_kill > *** SIGABRT (@0x7fff5cd45fce) received by PID 23916 (TID 0x7fff96318340) > stack trace: *** > @ 0x7fff5ce76f5a _sigtramp > @ 0x7fff5ac5e526 std::__1::locale::facet::__on_zero_shared() > @ 0x7fff5cca232a abort > @0x1077b9659 google::logging_fail() > @0x1077b964a google::LogMessage::Fail() > @0x1077b72fc google::LogMessage::SendToLog() > @0x1077b8089 google::LogMessage::Flush() > @0x1077c12e9 google::LogMessageFatal::~LogMessageFatal() > @0x1077b9b35 google::LogMessageFatal::~LogMessageFatal() > @0x106998ad1 process::Future<>::get() > @0x1069d4d5b HttpAuthenticationTest_Unauthorized_Test::TestBody() > @0x1070a828e > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @
[jira] [Commented] (MESOS-7975) The command/default executor can incorrectly send a TASK_FINISHED update even when the task is killed
[ https://issues.apache.org/jira/browse/MESOS-7975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174611#comment-16174611 ] Alexander Rukletsov commented on MESOS-7975: [~qianzhang] I think we should send an email to the lists. I understand that this might seem like a lot of work for "an easy fix", but it is an important change even though it requires small code change. > The command/default executor can incorrectly send a TASK_FINISHED update even > when the task is killed > - > > Key: MESOS-7975 > URL: https://issues.apache.org/jira/browse/MESOS-7975 > Project: Mesos > Issue Type: Bug >Reporter: Anand Mazumdar >Assignee: Qian Zhang >Priority: Critical > Labels: mesosphere > > Currently, when a task is killed, the default and the command executor > incorrectly send a {{TASK_FINISHED}} status update instead of > {{TASK_KILLED}}. This is due to an unfortunate missed conditional check when > the task exits with a zero status code. > {code} > if (WSUCCEEDED(status)) { > taskState = TASK_FINISHED; > } else if (killed) { > // Send TASK_KILLED if the task was killed as a result of > // kill() or shutdown(). > taskState = TASK_KILLED; > } else { > taskState = TASK_FAILED; > } > {code} > We should modify the code to correctly send {{TASK_KILLED}} status updates > when a task is killed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7995) libprocess tests breaking on macOS.
[ https://issues.apache.org/jira/browse/MESOS-7995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174534#comment-16174534 ] Till Toenshoff commented on MESOS-7995: --- Example with extended logging (GLOG_v=2) {noformat} [--] 8 tests from HTTPConnectionTest [ RUN ] HTTPConnectionTest.GzipRequestBody I0921 12:25:35.704711 115154944 process.cpp:3245] Resuming __latch__(39)@192.168.178.20:51793 at 2017-09-25 22:25:35.705722944+00:00 I0921 12:25:35.704720 116764672 process.cpp:3245] Resuming help@192.168.178.20:51793 at 2017-09-25 22:25:35.705730112+00:00 I0921 12:25:35.704736 115154944 process.cpp:3383] Cleaning up __latch__(39)@192.168.178.20:51793 I0921 12:25:35.704778 2519827264 process.cpp:3235] Spawned process (1)@192.168.178.20:51793 I0921 12:25:35.704787 114081792 process.cpp:3245] Resuming (1)@192.168.178.20:51793 at 2017-09-25 22:25:35.705836096+00:00 I0921 12:25:35.704764 114618368 process.cpp:3245] Resuming help@192.168.178.20:51793 at 2017-09-25 22:25:35.705777984+00:00 I0921 12:25:35.705045 2519827264 process.cpp:3235] Spawned process __latch__(40)@192.168.178.20:51793 I0921 12:25:35.705051 116228096 process.cpp:3245] Resuming __latch__(40)@192.168.178.20:51793 at 2017-09-25 22:25:35.706059072+00:00 I0921 12:25:35.705068 116228096 process.cpp:3383] Cleaning up __latch__(40)@192.168.178.20:51793 I0921 12:25:35.705090 115691520 process.cpp:3245] Resuming help@192.168.178.20:51793 at 2017-09-25 22:25:35.706096960+00:00 ../../../3rdparty/libprocess/src/tests/http_tests.cpp:972: Failure (connect).failure(): Failed to connect to 192.168.178.20:51793: Host is down I0921 12:25:35.705114 2519827264 process.cpp:3555] Donating thread to (1)@192.168.178.20:51793 while waiting I0921 12:25:35.705135 2519827264 process.cpp:3245] Resuming (1)@192.168.178.20:51793 at 2017-09-25 22:25:35.706139968+00:00 I0921 12:25:35.705147 2519827264 process.cpp:3383] Cleaning up (1)@192.168.178.20:51793 I0921 12:25:35.705168 113008640 process.cpp:3245] Resuming help@192.168.178.20:51793 at 2017-09-25 22:25:35.706178112+00:00 [ FAILED ] HTTPConnectionTest.GzipRequestBody (1 ms) {noformat} > libprocess tests breaking on macOS. > --- > > Key: MESOS-7995 > URL: https://issues.apache.org/jira/browse/MESOS-7995 > Project: Mesos > Issue Type: Bug > Components: libprocess, test >Affects Versions: 1.5.0 >Reporter: Till Toenshoff >Priority: Blocker > > Many libprocess tests fail on macOS, some even abort. > Examples: > {noformat} > [--] 8 tests from HTTPConnectionTest > [ RUN ] HTTPConnectionTest.GzipRequestBody > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:972: Failure > Failed to wait 15secs for connect > [ FAILED ] HTTPConnectionTest.GzipRequestBody (15001 ms) > [ RUN ] HTTPConnectionTest.Serial > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1015: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.Serial (0 ms) > [ RUN ] HTTPConnectionTest.Pipeline > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1094: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.Pipeline (1 ms) > [ RUN ] HTTPConnectionTest.ClosingRequest > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1190: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.ClosingRequest (0 ms) > [ RUN ] HTTPConnectionTest.ClosingResponse > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1245: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.ClosingResponse (0 ms) > [ RUN ] HTTPConnectionTest.ReferenceCounting > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1306: Failure > (*connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.ReferenceCounting (1 ms) > [ RUN ] HTTPConnectionTest.Equality > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1333: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.Equality (0 ms) > [ RUN ] HTTPConnectionTest.RequestStreaming > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1360: Failure > (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down > [ FAILED ] HTTPConnectionTest.RequestStreaming (0 ms) > [--] 8 tests from HTTPConnectionTest (15003 ms total) > {noformat} > {noformat} > [--] 8 tests from HttpAuthenticationTest > [ RUN ] HttpAuthenticationTest.NoAuthenticator > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1792: Failure > (response).failure(): Failed to connect to
[jira] [Created] (MESOS-7995) libprocess tests breaking on macOS.
Till Toenshoff created MESOS-7995: - Summary: libprocess tests breaking on macOS. Key: MESOS-7995 URL: https://issues.apache.org/jira/browse/MESOS-7995 Project: Mesos Issue Type: Bug Components: libprocess, test Affects Versions: 1.5.0 Reporter: Till Toenshoff Priority: Blocker Many libprocess tests fail on macOS, some even abort. Examples: {noformat} [--] 8 tests from HTTPConnectionTest [ RUN ] HTTPConnectionTest.GzipRequestBody ../../../3rdparty/libprocess/src/tests/http_tests.cpp:972: Failure Failed to wait 15secs for connect [ FAILED ] HTTPConnectionTest.GzipRequestBody (15001 ms) [ RUN ] HTTPConnectionTest.Serial ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1015: Failure (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down [ FAILED ] HTTPConnectionTest.Serial (0 ms) [ RUN ] HTTPConnectionTest.Pipeline ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1094: Failure (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down [ FAILED ] HTTPConnectionTest.Pipeline (1 ms) [ RUN ] HTTPConnectionTest.ClosingRequest ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1190: Failure (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down [ FAILED ] HTTPConnectionTest.ClosingRequest (0 ms) [ RUN ] HTTPConnectionTest.ClosingResponse ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1245: Failure (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down [ FAILED ] HTTPConnectionTest.ClosingResponse (0 ms) [ RUN ] HTTPConnectionTest.ReferenceCounting ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1306: Failure (*connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down [ FAILED ] HTTPConnectionTest.ReferenceCounting (1 ms) [ RUN ] HTTPConnectionTest.Equality ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1333: Failure (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down [ FAILED ] HTTPConnectionTest.Equality (0 ms) [ RUN ] HTTPConnectionTest.RequestStreaming ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1360: Failure (connect).failure(): Failed to connect to 192.168.178.20:51437: Host is down [ FAILED ] HTTPConnectionTest.RequestStreaming (0 ms) [--] 8 tests from HTTPConnectionTest (15003 ms total) {noformat} {noformat} [--] 8 tests from HttpAuthenticationTest [ RUN ] HttpAuthenticationTest.NoAuthenticator ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1792: Failure (response).failure(): Failed to connect to 192.168.178.20:51437: Host is down ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1786: Failure Actual function call count doesn't match EXPECT_CALL(*http.process, authenticated(_, Option::none()))... Expected: to be called once Actual: never called - unsatisfied and active [ FAILED ] HttpAuthenticationTest.NoAuthenticator (1 ms) [ RUN ] HttpAuthenticationTest.Unauthorized ../../../3rdparty/libprocess/src/tests/http_tests.cpp:1816: Failure (response).failure(): Failed to connect to 192.168.178.20:51437: Host is down WARNING: Logging before InitGoogleLogging() is written to STDERR F0921 12:18:19.947710 2519827264 future.hpp:1151] Check failed: !isFailed() Future::get() but state == FAILED: Failed to connect to 192.168.178.20:51437: Host is down *** Check failure stack trace: *** *** Aborted at 1505989099 (unix time) try "date -d @1505989099" if you are using GNU date *** PC: @ 0x7fff5cd45fce __pthread_kill *** SIGABRT (@0x7fff5cd45fce) received by PID 23916 (TID 0x7fff96318340) stack trace: *** @ 0x7fff5ce76f5a _sigtramp @ 0x7fff5ac5e526 std::__1::locale::facet::__on_zero_shared() @ 0x7fff5cca232a abort @0x1077b9659 google::logging_fail() @0x1077b964a google::LogMessage::Fail() @0x1077b72fc google::LogMessage::SendToLog() @0x1077b8089 google::LogMessage::Flush() @0x1077c12e9 google::LogMessageFatal::~LogMessageFatal() @0x1077b9b35 google::LogMessageFatal::~LogMessageFatal() @0x106998ad1 process::Future<>::get() @0x1069d4d5b HttpAuthenticationTest_Unauthorized_Test::TestBody() @0x1070a828e testing::internal::HandleSehExceptionsInMethodIfSupported<>() @0x10704a96b testing::internal::HandleExceptionsInMethodIfSupported<>() @0x10704a896 testing::Test::Run() @0x10704c60d testing::TestInfo::Run() @0x10704dc0c testing::TestCase::Run() @0x10705e14c testing::internal::UnitTestImpl::RunAllTests() @0x1070ac2fe testing::internal::HandleSehExceptionsInMethodIfSupported<>() @0x10705db7b testing::internal::HandleExceptionsInMethodIfSupported<>() @
[jira] [Updated] (MESOS-7994) Hard-coded protobuf version in mesos.pom.in
[ https://issues.apache.org/jira/browse/MESOS-7994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-7994: Component/s: java api > Hard-coded protobuf version in mesos.pom.in > --- > > Key: MESOS-7994 > URL: https://issues.apache.org/jira/browse/MESOS-7994 > Project: Mesos > Issue Type: Bug > Components: java api >Reporter: Benno Evers > > Currently, the version of protobuf.jar used by maven is hardcoded in > `src/java/mesos.pom.in` to be 3.3.0. > When building against a non-bundled version of protobuf, this will likely > cause a version mismatch which can lead to build errors because the java > build is trying to compile the java source files created by the protoc of the > non-bundled protobuf. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7994) Hard-coded protobuf version in mesos.pom.in
Benno Evers created MESOS-7994: -- Summary: Hard-coded protobuf version in mesos.pom.in Key: MESOS-7994 URL: https://issues.apache.org/jira/browse/MESOS-7994 Project: Mesos Issue Type: Bug Reporter: Benno Evers Currently, the version of protobuf.jar used by maven is hardcoded in `src/java/mesos.pom.in` to be 3.3.0. When building against a non-bundled version of protobuf, this will likely cause a version mismatch which can lead to build errors because the java build is trying to compile the java source files created by the protoc of the non-bundled protobuf. -- This message was sent by Atlassian JIRA (v6.4.14#64029)