[jira] [Created] (MESOS-9591) Remove obsolete recovery code in Docker containerizer
Qian Zhang created MESOS-9591: - Summary: Remove obsolete recovery code in Docker containerizer Key: MESOS-9591 URL: https://issues.apache.org/jira/browse/MESOS-9591 Project: Mesos Issue Type: Improvement Components: containerization Reporter: Qian Zhang When fixing MESOS-8125, a [code logic|https://github.com/apache/mesos/blob/1.7.1/src/slave/containerizer/docker.cpp#L1028:L1063] was added in Docker containerizer to only reap the executor process if the executor can be connected via a TCP socket to avoid reaping an irrelevant process after the agent host is rebooted. However when fixing MESOS-9501 we have made agent not read the forked pid and libprocess pid after reboot, so I think the code logic added in Docker containerizer will never be hit and can be removed since after agent host reboot `run->forkedPid` and `run->libprocessPid` will always be `None()`, i.e., the container will be skipped [here|https://github.com/apache/mesos/blob/1.7.1/src/slave/containerizer/docker.cpp#L982:L984]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9574) Operation status update streams are not properly garbage collected.
[ https://issues.apache.org/jira/browse/MESOS-9574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767680#comment-16767680 ] Gastón Kleiman edited comment on MESOS-9574 at 2/21/19 12:44 AM: - https://reviews.apache.org/r/69978/ was (Author: gkleiman): https://reviews.apache.org/r/69978/ https://reviews.apache.org/r/70028/ > Operation status update streams are not properly garbage collected. > --- > > Key: MESOS-9574 > URL: https://issues.apache.org/jira/browse/MESOS-9574 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Gastón Kleiman >Assignee: Gastón Kleiman >Priority: Major > Labels: foundations, mesosphere > > After successfully handling the acknowledgment of a terminal operation status > update for an operation affecting agent's default resources, the agent should > garbage collect the corresponding operation status update stream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9574) Operation status update streams are not properly garbage collected.
[ https://issues.apache.org/jira/browse/MESOS-9574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767680#comment-16767680 ] Gastón Kleiman edited comment on MESOS-9574 at 2/21/19 12:19 AM: - https://reviews.apache.org/r/69978/ https://reviews.apache.org/r/70028/ was (Author: gkleiman): https://reviews.apache.org/r/69978/ > Operation status update streams are not properly garbage collected. > --- > > Key: MESOS-9574 > URL: https://issues.apache.org/jira/browse/MESOS-9574 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Gastón Kleiman >Assignee: Gastón Kleiman >Priority: Major > Labels: foundations, mesosphere > > After successfully handling the acknowledgment of a terminal operation status > update for an operation affecting agent's default resources, the agent should > garbage collect the corresponding operation status update stream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9590) Mesos CI sometimes, incorrectly, overwrites already-pushed mesos master nightly images with new images built from non-master branches.
[ https://issues.apache.org/jira/browse/MESOS-9590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16773517#comment-16773517 ] Jie Yu commented on MESOS-9590: --- commit 8143d006f1032bb1c43364bd9f6741ee3dfbfc0b (HEAD -> master, origin/master, origin/HEAD) Author: Jie Yu Date: Wed Feb 20 11:16:57 2019 -0800 Blacklisted the "ubuntu-4" Jenkins box. The git version installed on the box is too low. Review: https://reviews.apache.org/r/70025 commit e9acc79ed535dd95b71227412a0e19868cf453d9 Author: Jie Yu Date: Wed Feb 20 11:14:26 2019 -0800 Failed the scripts if `--points-at` is not supported. On some Jenkins boxes, the git installed on the box does not support `--points-at`. Instead of silently assume the 'master' branch in the scripts (which could be wrong), we fail hard here. Review: https://reviews.apache.org/r/70024 > Mesos CI sometimes, incorrectly, overwrites already-pushed mesos master > nightly images with new images built from non-master branches. > -- > > Key: MESOS-9590 > URL: https://issues.apache.org/jira/browse/MESOS-9590 > Project: Mesos > Issue Type: Bug >Reporter: James DeFelice >Assignee: Jie Yu >Priority: Major > Labels: mesosphere > > I pulled image mesos/mesos-centos:master-2019-02-15 some time on the 15th and > worked with it locally, on my laptop, for about a week. Part of that work > included downloading the related mesos-xxx-devel.rpm from the same CI build > that produced the image so that I could build 3rd party mesos modules from > the master base image. The rpm was labeled as pre-1.8.0. > This worked great until I tried to repeat the work on another machine. The > other machine pulled the "same" dockerhub image > (mesos/mesos-centos:master-2019-02-15) which was somehow built with a > mesos-xxx.rpm labeled as pre-1.7.2. I couldn't build my docker image using > this strangely new base because the mesos-xxx-devel.rpm I had hardcoded into > the dockerfile no longer aligned with the version of the mesos RPM that was > shipping in the base image. > The base image had changed, such that the mesos RPM version went from 1.8.0 > to 1.7.2. This should never happen. > [~jieyu] investigated and found that the problem appears to happen at random. > Current thinking is that one of the mesos CI boxes uses a version of git > that's too old, and that the CI scripts are incorrectly ignoring a git > command failure: the git command fails because the git version is too old, > and the script subsequently ignores any failures from the command pipeline in > which this command is executed. With the result being that the "version" of > the branch being built cannot be detected and therefore defaults to master - > overwriting *actual* master image builds. > [~jieyu] also wrote some patches, which I'll link here: > * https://reviews.apache.org/r/70024/ > * https://reviews.apache.org/r/70025/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9590) Mesos CI sometimes, incorrectly, overwrites already-pushed mesos master nightly images with new images built from non-master branches.
James DeFelice created MESOS-9590: - Summary: Mesos CI sometimes, incorrectly, overwrites already-pushed mesos master nightly images with new images built from non-master branches. Key: MESOS-9590 URL: https://issues.apache.org/jira/browse/MESOS-9590 Project: Mesos Issue Type: Bug Reporter: James DeFelice Assignee: Jie Yu I pulled image mesos/mesos-centos:master-2019-02-15 some time on the 15th and worked with it locally, on my laptop, for about a week. Part of that work included downloading the related mesos-xxx-devel.rpm from the same CI build that produced the image so that I could build 3rd party mesos modules from the master base image. The rpm was labeled as pre-1.8.0. This worked great until I tried to repeat the work on another machine. The other machine pulled the "same" dockerhub image (mesos/mesos-centos:master-2019-02-15) which was somehow built with a mesos-xxx.rpm labeled as pre-1.7.2. I couldn't build my docker image using this strangely new base because the mesos-xxx-devel.rpm I had hardcoded into the dockerfile no longer aligned with the version of the mesos RPM that was shipping in the base image. The base image had changed, such that the mesos RPM version went from 1.8.0 to 1.7.2. This should never happen. [~jieyu] investigated and found that the problem appears to happen at random. Current thinking is that one of the mesos CI boxes uses a version of git that's too old, and that the CI scripts are incorrectly ignoring a git command failure: the git command fails because the git version is too old, and the script subsequently ignores any failures from the command pipeline in which this command is executed. With the result being that the "version" of the branch being built cannot be detected and therefore defaults to master - overwriting *actual* master image builds. [~jieyu] also wrote some patches, which I'll link here: * https://reviews.apache.org/r/70024/ * https://reviews.apache.org/r/70025/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7568) Introduce a heartbeat mechanism for v0 executor <-> agent links.
[ https://issues.apache.org/jira/browse/MESOS-7568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16773263#comment-16773263 ] Greg Mann commented on MESOS-7568: -- [~kaysoky] is this still an issue for v0 executors? > Introduce a heartbeat mechanism for v0 executor <-> agent links. > > > Key: MESOS-7568 > URL: https://issues.apache.org/jira/browse/MESOS-7568 > Project: Mesos > Issue Type: Bug >Reporter: Anand Mazumdar >Priority: Critical > Labels: foundations, mesosphere > > Currently, we do not have heartbeats for executor <-> agent communication. > This is especially problematic in scenarios when IPFilters are enabled since > the default conntrack keep alive timeout is 5 days. When that timeout > elapses, the executor doesn't get notified via a socket disconnection when > the agent process restarts. The executor would then get killed if it doesn't > re-register when the agent recovery process is completed. > Enabling application level heartbeats or TCP KeepAlive's can be a possible > way for fixing this issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9589) AgentContainerAPITest.NestedContainerIdempotentLaunch is flaky
Greg Mann created MESOS-9589: Summary: AgentContainerAPITest.NestedContainerIdempotentLaunch is flaky Key: MESOS-9589 URL: https://issues.apache.org/jira/browse/MESOS-9589 Project: Mesos Issue Type: Bug Reporter: Greg Mann I've observed a couple different failure modes in this test: {code} ../../src/tests/agent_container_api_tests.cpp:828: Failure Failed to wait 15secs for launchNestedContainer(slave.get()->pid, containerId) {code} and {code} /tmp/SRC/src/tests/agent_container_api_tests.cpp:847: Failure Failed to wait 15secs for wait {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8930) THREADSAFE_SnapshotTimeout is flaky.
[ https://issues.apache.org/jira/browse/MESOS-8930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16773095#comment-16773095 ] Vinod Kone commented on MESOS-8930: --- Saw this when testing 1.7.2 rc. {code} 2: [ RUN ] MetricsTest.THREADSAFE_SnapshotTimeout 2: I0219 23:34:37.010373 23554 process.cpp:3588] Handling HTTP event for process 'metrics' with path: '/metrics/snapshot' 2: I0219 23:34:37.062614 23555 process.cpp:3588] Handling HTTP event for process 'metrics' with path: '/metrics/snapshot' 2: /tmp/SRC/3rdparty/libprocess/src/tests/metrics_tests.cpp:425: Failure {code} > THREADSAFE_SnapshotTimeout is flaky. > > > Key: MESOS-8930 > URL: https://issues.apache.org/jira/browse/MESOS-8930 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.7.2 > Environment: Ubuntu 16.04 >Reporter: Alexander Rukletsov >Assignee: Benjamin Mahler >Priority: Major > Labels: flaky-test, foundations, mesosphere > > Observed on ASF CI, might be related to a recent test change > https://reviews.apache.org/r/66831/ > {noformat} > 18:23:31 2: [ RUN ] MetricsTest.THREADSAFE_SnapshotTimeout > 18:23:31 2: I0516 18:23:31.747611 16246 process.cpp:3583] Handling HTTP event > for process 'metrics' with path: '/metrics/snapshot' > 18:23:31 2: I0516 18:23:31.796871 16251 process.cpp:3583] Handling HTTP event > for process 'metrics' with path: '/metrics/snapshot' > 18:23:46 2: /tmp/SRC/3rdparty/libprocess/src/tests/metrics_tests.cpp:425: > Failure > 18:23:46 2: Failed to wait 15secs for response > 22:57:13 Build timed out (after 300 minutes). Marking the build as failed. > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8994) Ensure that the cmake build knows about all source files in the autotools build
[ https://issues.apache.org/jira/browse/MESOS-8994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16773080#comment-16773080 ] Benjamin Bannier commented on MESOS-8994: - [~tillt], we have a tool {{support/check-cmake-missing-files.sh}} in tree which detects missing files. We are mostly clean here with only one recently added file missing in the cmake setup. Remaining work would be to integrate this tool into e.g., commit hooks or other tooling, depending on our stance on acceptable runtime for linters. > Ensure that the cmake build knows about all source files in the autotools > build > --- > > Key: MESOS-8994 > URL: https://issues.apache.org/jira/browse/MESOS-8994 > Project: Mesos > Issue Type: Improvement > Components: build, cmake >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier >Priority: Minor > > We currently maintain two build systems in parallel with autotools still > being used by the larger part of contributors and cmake catching up in terms > of coverage and features. > > This has lead to situations where certain features were added only to the > autotools build while updating the cmake build was either implicitly (without > creating a ticket) deferred or forgotten. Identifying such missing coverage > makes it harder to gauge where the two build systems stand in terms of > feature parity and how much work is left before autotools can be retired. > We should update the cmake build setup to explicitly check whether any > sources files (headers and sources) unknown to it exist in the tree. Until > full parity is reached we would likely need to maintain a whitelist of files > known to be missing in the cmake build (this whitelist would at the same time > serve as a {{TODO}} list). The LLVM project uses the following function to > perform closely related work, > https://github.com/llvm-mirror/llvm/blob/master/cmake/modules/LLVMProcessSources.cmake#L70-L111. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8994) Ensure that the cmake build knows about all source files in the autotools build
[ https://issues.apache.org/jira/browse/MESOS-8994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier reassigned MESOS-8994: --- Assignee: (was: Benjamin Bannier) > Ensure that the cmake build knows about all source files in the autotools > build > --- > > Key: MESOS-8994 > URL: https://issues.apache.org/jira/browse/MESOS-8994 > Project: Mesos > Issue Type: Improvement > Components: build, cmake >Reporter: Benjamin Bannier >Priority: Minor > > We currently maintain two build systems in parallel with autotools still > being used by the larger part of contributors and cmake catching up in terms > of coverage and features. > > This has lead to situations where certain features were added only to the > autotools build while updating the cmake build was either implicitly (without > creating a ticket) deferred or forgotten. Identifying such missing coverage > makes it harder to gauge where the two build systems stand in terms of > feature parity and how much work is left before autotools can be retired. > We should update the cmake build setup to explicitly check whether any > sources files (headers and sources) unknown to it exist in the tree. Until > full parity is reached we would likely need to maintain a whitelist of files > known to be missing in the cmake build (this whitelist would at the same time > serve as a {{TODO}} list). The LLVM project uses the following function to > perform closely related work, > https://github.com/llvm-mirror/llvm/blob/master/cmake/modules/LLVMProcessSources.cmake#L70-L111. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8994) Ensure that the cmake build knows about all source files in the autotools build
[ https://issues.apache.org/jira/browse/MESOS-8994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16773059#comment-16773059 ] Till Toenshoff commented on MESOS-8994: --- [~bbannier] where are we with this right now? > Ensure that the cmake build knows about all source files in the autotools > build > --- > > Key: MESOS-8994 > URL: https://issues.apache.org/jira/browse/MESOS-8994 > Project: Mesos > Issue Type: Improvement > Components: build, cmake >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier >Priority: Minor > > We currently maintain two build systems in parallel with autotools still > being used by the larger part of contributors and cmake catching up in terms > of coverage and features. > > This has lead to situations where certain features were added only to the > autotools build while updating the cmake build was either implicitly (without > creating a ticket) deferred or forgotten. Identifying such missing coverage > makes it harder to gauge where the two build systems stand in terms of > feature parity and how much work is left before autotools can be retired. > We should update the cmake build setup to explicitly check whether any > sources files (headers and sources) unknown to it exist in the tree. Until > full parity is reached we would likely need to maintain a whitelist of files > known to be missing in the cmake build (this whitelist would at the same time > serve as a {{TODO}} list). The LLVM project uses the following function to > perform closely related work, > https://github.com/llvm-mirror/llvm/blob/master/cmake/modules/LLVMProcessSources.cmake#L70-L111. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9588) Add a way to view current offer filters
Benno Evers created MESOS-9588: -- Summary: Add a way to view current offer filters Key: MESOS-9588 URL: https://issues.apache.org/jira/browse/MESOS-9588 Project: Mesos Issue Type: Improvement Reporter: Benno Evers Looking at just mesos, it's currently not possible to see which offer filters are active for which amount of time. The closest one can get is to check whether a filter currently exists, either by looking at via the `metrics/snapshot` if per-frameworks metrics are enabled or by scanning the master logs for this message {noformat} VLOG(1) << "Filtered offer with " << resources << " on agent " << slaveId << " for role " << role << " of framework " << frameworkId; {noformat} However, that does not tell the user how long the filter was there, which resources it contains and how long it will stay. Maybe MESOS-8621 would be a viable way to surface this information. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9587) SlaveRecoveryTest/0.RecoverTerminatedHTTPExecutor is flaky
Benjamin Bannier created MESOS-9587: --- Summary: SlaveRecoveryTest/0.RecoverTerminatedHTTPExecutor is flaky Key: MESOS-9587 URL: https://issues.apache.org/jira/browse/MESOS-9587 Project: Mesos Issue Type: Bug Components: test Affects Versions: 1.8.0 Reporter: Benjamin Bannier Attachments: log The test {{SlaveRecoveryTest/0.RecoverTerminatedHTTPExecutor}} failed in our internal CI with {{cf4d3f70f00739e7574ab5af037feda8d4676afc}}. {noformat} ../../src/tests/slave_recovery_tests.cpp:1548 Expected: TASK_FAILED To be equal to: status->state() Which is: TASK_LOST {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)