[jira] [Created] (MESOS-9591) Remove obsolete recovery code in Docker containerizer

2019-02-20 Thread Qian Zhang (JIRA)
Qian Zhang created MESOS-9591:
-

 Summary: Remove obsolete recovery code in Docker containerizer
 Key: MESOS-9591
 URL: https://issues.apache.org/jira/browse/MESOS-9591
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: Qian Zhang


When fixing MESOS-8125, a [code 
logic|https://github.com/apache/mesos/blob/1.7.1/src/slave/containerizer/docker.cpp#L1028:L1063]
 was added in Docker containerizer to only reap the executor process if the 
executor can be connected via a TCP socket to avoid reaping an irrelevant 
process after the agent host is rebooted. However when fixing MESOS-9501 we 
have made agent not read the forked pid and libprocess pid after reboot, so I 
think the code logic added in Docker containerizer will never be hit and can be 
removed since after agent host reboot `run->forkedPid` and `run->libprocessPid` 
will always be `None()`, i.e., the container will be skipped 
[here|https://github.com/apache/mesos/blob/1.7.1/src/slave/containerizer/docker.cpp#L982:L984].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9574) Operation status update streams are not properly garbage collected.

2019-02-20 Thread JIRA


[ 
https://issues.apache.org/jira/browse/MESOS-9574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767680#comment-16767680
 ] 

Gastón Kleiman edited comment on MESOS-9574 at 2/21/19 12:44 AM:
-

https://reviews.apache.org/r/69978/


was (Author: gkleiman):
https://reviews.apache.org/r/69978/
https://reviews.apache.org/r/70028/

> Operation status update streams are not properly garbage collected.
> ---
>
> Key: MESOS-9574
> URL: https://issues.apache.org/jira/browse/MESOS-9574
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Gastón Kleiman
>Assignee: Gastón Kleiman
>Priority: Major
>  Labels: foundations, mesosphere
>
> After successfully handling the acknowledgment of a terminal operation status 
> update for an operation affecting agent's default resources, the agent should 
> garbage collect the corresponding operation status update stream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9574) Operation status update streams are not properly garbage collected.

2019-02-20 Thread JIRA


[ 
https://issues.apache.org/jira/browse/MESOS-9574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767680#comment-16767680
 ] 

Gastón Kleiman edited comment on MESOS-9574 at 2/21/19 12:19 AM:
-

https://reviews.apache.org/r/69978/
https://reviews.apache.org/r/70028/


was (Author: gkleiman):
https://reviews.apache.org/r/69978/

> Operation status update streams are not properly garbage collected.
> ---
>
> Key: MESOS-9574
> URL: https://issues.apache.org/jira/browse/MESOS-9574
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Gastón Kleiman
>Assignee: Gastón Kleiman
>Priority: Major
>  Labels: foundations, mesosphere
>
> After successfully handling the acknowledgment of a terminal operation status 
> update for an operation affecting agent's default resources, the agent should 
> garbage collect the corresponding operation status update stream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9590) Mesos CI sometimes, incorrectly, overwrites already-pushed mesos master nightly images with new images built from non-master branches.

2019-02-20 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16773517#comment-16773517
 ] 

Jie Yu commented on MESOS-9590:
---

commit 8143d006f1032bb1c43364bd9f6741ee3dfbfc0b (HEAD -> master, origin/master, 
origin/HEAD)
Author: Jie Yu 
Date:   Wed Feb 20 11:16:57 2019 -0800

Blacklisted the "ubuntu-4" Jenkins box.

The git version installed on the box is too low.

Review: https://reviews.apache.org/r/70025

commit e9acc79ed535dd95b71227412a0e19868cf453d9
Author: Jie Yu 
Date:   Wed Feb 20 11:14:26 2019 -0800

Failed the scripts if `--points-at` is not supported.

On some Jenkins boxes, the git installed on the box does not support
`--points-at`. Instead of silently assume the 'master' branch in the
scripts (which could be wrong), we fail hard here.

Review: https://reviews.apache.org/r/70024



> Mesos CI sometimes, incorrectly, overwrites already-pushed mesos master 
> nightly images with new images built from non-master branches.
> --
>
> Key: MESOS-9590
> URL: https://issues.apache.org/jira/browse/MESOS-9590
> Project: Mesos
>  Issue Type: Bug
>Reporter: James DeFelice
>Assignee: Jie Yu
>Priority: Major
>  Labels: mesosphere
>
> I pulled image mesos/mesos-centos:master-2019-02-15 some time on the 15th and 
> worked with it locally, on my laptop, for about a week. Part of that work 
> included downloading the related mesos-xxx-devel.rpm from the same CI build 
> that produced the image so that I could build 3rd party mesos modules from 
> the master base image. The rpm was labeled as pre-1.8.0.
> This worked great until I tried to repeat the work on another machine. The 
> other machine pulled the "same" dockerhub image 
> (mesos/mesos-centos:master-2019-02-15) which was somehow built with a 
> mesos-xxx.rpm labeled as pre-1.7.2. I couldn't build my docker image using 
> this strangely new base because the mesos-xxx-devel.rpm I had hardcoded into 
> the dockerfile no longer aligned with the version of the mesos RPM that was 
> shipping in the base image.
> The base image had changed, such that the mesos RPM version went from 1.8.0 
> to 1.7.2. This should never happen.
> [~jieyu] investigated and found that the problem appears to happen at random. 
> Current thinking is that one of the mesos CI boxes uses a version of git 
> that's too old, and that the CI scripts are incorrectly ignoring a git 
> command failure: the git command fails because the git version is too old, 
> and the script subsequently ignores any failures from the command pipeline in 
> which this command is executed. With the result being that the "version" of 
> the branch being built cannot be detected and therefore defaults to master - 
> overwriting *actual* master image builds.
> [~jieyu] also wrote some patches, which I'll link here:
> * https://reviews.apache.org/r/70024/
> * https://reviews.apache.org/r/70025/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9590) Mesos CI sometimes, incorrectly, overwrites already-pushed mesos master nightly images with new images built from non-master branches.

2019-02-20 Thread James DeFelice (JIRA)
James DeFelice created MESOS-9590:
-

 Summary: Mesos CI sometimes, incorrectly, overwrites 
already-pushed mesos master nightly images with new images built from 
non-master branches.
 Key: MESOS-9590
 URL: https://issues.apache.org/jira/browse/MESOS-9590
 Project: Mesos
  Issue Type: Bug
Reporter: James DeFelice
Assignee: Jie Yu


I pulled image mesos/mesos-centos:master-2019-02-15 some time on the 15th and 
worked with it locally, on my laptop, for about a week. Part of that work 
included downloading the related mesos-xxx-devel.rpm from the same CI build 
that produced the image so that I could build 3rd party mesos modules from the 
master base image. The rpm was labeled as pre-1.8.0.

This worked great until I tried to repeat the work on another machine. The 
other machine pulled the "same" dockerhub image 
(mesos/mesos-centos:master-2019-02-15) which was somehow built with a 
mesos-xxx.rpm labeled as pre-1.7.2. I couldn't build my docker image using this 
strangely new base because the mesos-xxx-devel.rpm I had hardcoded into the 
dockerfile no longer aligned with the version of the mesos RPM that was 
shipping in the base image.

The base image had changed, such that the mesos RPM version went from 1.8.0 to 
1.7.2. This should never happen.

[~jieyu] investigated and found that the problem appears to happen at random. 
Current thinking is that one of the mesos CI boxes uses a version of git that's 
too old, and that the CI scripts are incorrectly ignoring a git command 
failure: the git command fails because the git version is too old, and the 
script subsequently ignores any failures from the command pipeline in which 
this command is executed. With the result being that the "version" of the 
branch being built cannot be detected and therefore defaults to master - 
overwriting *actual* master image builds.

[~jieyu] also wrote some patches, which I'll link here:

* https://reviews.apache.org/r/70024/
* https://reviews.apache.org/r/70025/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7568) Introduce a heartbeat mechanism for v0 executor <-> agent links.

2019-02-20 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16773263#comment-16773263
 ] 

Greg Mann commented on MESOS-7568:
--

[~kaysoky] is this still an issue for v0 executors?

> Introduce a heartbeat mechanism for v0 executor <-> agent links.
> 
>
> Key: MESOS-7568
> URL: https://issues.apache.org/jira/browse/MESOS-7568
> Project: Mesos
>  Issue Type: Bug
>Reporter: Anand Mazumdar
>Priority: Critical
>  Labels: foundations, mesosphere
>
> Currently, we do not have heartbeats for executor <-> agent communication. 
> This is especially problematic in scenarios when IPFilters are enabled since 
> the default conntrack keep alive timeout is 5 days. When that timeout 
> elapses, the executor doesn't get notified via a socket disconnection when 
> the agent process restarts. The executor would then get killed if it doesn't 
> re-register when the agent recovery process is completed.
> Enabling application level heartbeats or TCP KeepAlive's can be a possible 
> way for fixing this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9589) AgentContainerAPITest.NestedContainerIdempotentLaunch is flaky

2019-02-20 Thread Greg Mann (JIRA)
Greg Mann created MESOS-9589:


 Summary: AgentContainerAPITest.NestedContainerIdempotentLaunch is 
flaky
 Key: MESOS-9589
 URL: https://issues.apache.org/jira/browse/MESOS-9589
 Project: Mesos
  Issue Type: Bug
Reporter: Greg Mann


I've observed a couple different failure modes in this test:
{code}
../../src/tests/agent_container_api_tests.cpp:828: Failure
Failed to wait 15secs for launchNestedContainer(slave.get()->pid, containerId)
{code}
and
{code}
/tmp/SRC/src/tests/agent_container_api_tests.cpp:847: Failure
Failed to wait 15secs for wait
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8930) THREADSAFE_SnapshotTimeout is flaky.

2019-02-20 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16773095#comment-16773095
 ] 

Vinod Kone commented on MESOS-8930:
---

Saw this when testing 1.7.2 rc.

{code}
2: [ RUN  ] MetricsTest.THREADSAFE_SnapshotTimeout
2: I0219 23:34:37.010373 23554 process.cpp:3588] Handling HTTP event for 
process 'metrics' with path: '/metrics/snapshot'
2: I0219 23:34:37.062614 23555 process.cpp:3588] Handling HTTP event for 
process 'metrics' with path: '/metrics/snapshot'
2: /tmp/SRC/3rdparty/libprocess/src/tests/metrics_tests.cpp:425: Failure
{code}

> THREADSAFE_SnapshotTimeout is flaky.
> 
>
> Key: MESOS-8930
> URL: https://issues.apache.org/jira/browse/MESOS-8930
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.7.2
> Environment: Ubuntu 16.04
>Reporter: Alexander Rukletsov
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: flaky-test, foundations, mesosphere
>
> Observed on ASF CI, might be related to a recent test change 
> https://reviews.apache.org/r/66831/
> {noformat}
> 18:23:31 2: [ RUN  ] MetricsTest.THREADSAFE_SnapshotTimeout
> 18:23:31 2: I0516 18:23:31.747611 16246 process.cpp:3583] Handling HTTP event 
> for process 'metrics' with path: '/metrics/snapshot'
> 18:23:31 2: I0516 18:23:31.796871 16251 process.cpp:3583] Handling HTTP event 
> for process 'metrics' with path: '/metrics/snapshot'
> 18:23:46 2: /tmp/SRC/3rdparty/libprocess/src/tests/metrics_tests.cpp:425: 
> Failure
> 18:23:46 2: Failed to wait 15secs for response
> 22:57:13 Build timed out (after 300 minutes). Marking the build as failed.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8994) Ensure that the cmake build knows about all source files in the autotools build

2019-02-20 Thread Benjamin Bannier (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16773080#comment-16773080
 ] 

Benjamin Bannier commented on MESOS-8994:
-

[~tillt], we have a tool {{support/check-cmake-missing-files.sh}} in tree which 
detects missing files. We are mostly clean here with only one recently added 
file missing in the cmake setup.

Remaining work would be to integrate this tool into e.g., commit hooks or other 
tooling, depending on our stance on acceptable runtime for linters.

> Ensure that the cmake build knows about all source files in the autotools 
> build
> ---
>
> Key: MESOS-8994
> URL: https://issues.apache.org/jira/browse/MESOS-8994
> Project: Mesos
>  Issue Type: Improvement
>  Components: build, cmake
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>Priority: Minor
>
> We currently maintain two build systems in parallel with autotools still 
> being used by the larger part of contributors and cmake catching up in terms 
> of coverage and features.
>  
> This has lead to situations where certain features were added only to the 
> autotools build while updating the cmake build was either implicitly (without 
> creating a ticket) deferred or forgotten. Identifying such missing coverage 
> makes it harder to gauge where the two build systems stand in terms of 
> feature parity and how much work is left before autotools can be retired.
> We should update the cmake build setup to explicitly check whether any 
> sources files (headers and sources) unknown to it exist in the tree. Until 
> full parity is reached we would likely need to maintain a whitelist of files 
> known to be missing in the cmake build (this whitelist would at the same time 
> serve as a {{TODO}} list). The LLVM project uses the following function to 
> perform closely related work, 
> https://github.com/llvm-mirror/llvm/blob/master/cmake/modules/LLVMProcessSources.cmake#L70-L111.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8994) Ensure that the cmake build knows about all source files in the autotools build

2019-02-20 Thread Benjamin Bannier (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier reassigned MESOS-8994:
---

Assignee: (was: Benjamin Bannier)

> Ensure that the cmake build knows about all source files in the autotools 
> build
> ---
>
> Key: MESOS-8994
> URL: https://issues.apache.org/jira/browse/MESOS-8994
> Project: Mesos
>  Issue Type: Improvement
>  Components: build, cmake
>Reporter: Benjamin Bannier
>Priority: Minor
>
> We currently maintain two build systems in parallel with autotools still 
> being used by the larger part of contributors and cmake catching up in terms 
> of coverage and features.
>  
> This has lead to situations where certain features were added only to the 
> autotools build while updating the cmake build was either implicitly (without 
> creating a ticket) deferred or forgotten. Identifying such missing coverage 
> makes it harder to gauge where the two build systems stand in terms of 
> feature parity and how much work is left before autotools can be retired.
> We should update the cmake build setup to explicitly check whether any 
> sources files (headers and sources) unknown to it exist in the tree. Until 
> full parity is reached we would likely need to maintain a whitelist of files 
> known to be missing in the cmake build (this whitelist would at the same time 
> serve as a {{TODO}} list). The LLVM project uses the following function to 
> perform closely related work, 
> https://github.com/llvm-mirror/llvm/blob/master/cmake/modules/LLVMProcessSources.cmake#L70-L111.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8994) Ensure that the cmake build knows about all source files in the autotools build

2019-02-20 Thread Till Toenshoff (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16773059#comment-16773059
 ] 

Till Toenshoff commented on MESOS-8994:
---

[~bbannier] where are we with this right now?

> Ensure that the cmake build knows about all source files in the autotools 
> build
> ---
>
> Key: MESOS-8994
> URL: https://issues.apache.org/jira/browse/MESOS-8994
> Project: Mesos
>  Issue Type: Improvement
>  Components: build, cmake
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>Priority: Minor
>
> We currently maintain two build systems in parallel with autotools still 
> being used by the larger part of contributors and cmake catching up in terms 
> of coverage and features.
>  
> This has lead to situations where certain features were added only to the 
> autotools build while updating the cmake build was either implicitly (without 
> creating a ticket) deferred or forgotten. Identifying such missing coverage 
> makes it harder to gauge where the two build systems stand in terms of 
> feature parity and how much work is left before autotools can be retired.
> We should update the cmake build setup to explicitly check whether any 
> sources files (headers and sources) unknown to it exist in the tree. Until 
> full parity is reached we would likely need to maintain a whitelist of files 
> known to be missing in the cmake build (this whitelist would at the same time 
> serve as a {{TODO}} list). The LLVM project uses the following function to 
> perform closely related work, 
> https://github.com/llvm-mirror/llvm/blob/master/cmake/modules/LLVMProcessSources.cmake#L70-L111.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9588) Add a way to view current offer filters

2019-02-20 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9588:
--

 Summary: Add a way to view current offer filters
 Key: MESOS-9588
 URL: https://issues.apache.org/jira/browse/MESOS-9588
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers


Looking at just mesos, it's currently not possible to see which offer filters 
are active for which amount of time.

The closest one can get is to check whether a filter currently exists, either 
by looking at  via the `metrics/snapshot` if per-frameworks metrics are enabled 
or by scanning the master logs for this message
{noformat}
  VLOG(1) << "Filtered offer with " << resources
  << " on agent " << slaveId
  << " for role " << role
  << " of framework " << frameworkId;
{noformat}

However, that does not tell the user how long the filter was there, which 
resources it contains and how long it will stay.

Maybe MESOS-8621 would be a viable way to surface this information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9587) SlaveRecoveryTest/0.RecoverTerminatedHTTPExecutor is flaky

2019-02-20 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-9587:
---

 Summary: SlaveRecoveryTest/0.RecoverTerminatedHTTPExecutor is flaky
 Key: MESOS-9587
 URL: https://issues.apache.org/jira/browse/MESOS-9587
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 1.8.0
Reporter: Benjamin Bannier
 Attachments: log

The test {{SlaveRecoveryTest/0.RecoverTerminatedHTTPExecutor}} failed in our 
internal CI with {{cf4d3f70f00739e7574ab5af037feda8d4676afc}}.
{noformat}
../../src/tests/slave_recovery_tests.cpp:1548
  Expected: TASK_FAILED
To be equal to: status->state()
  Which is: TASK_LOST
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)