[jira] [Commented] (MESOS-9963) URI stringification constructs malformed URIs.
[ https://issues.apache.org/jira/browse/MESOS-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933051#comment-16933051 ] James Peach commented on MESOS-9963: Verified that this issue doesn't cause any problems in the current code, because callers are careful to ensure the path component begin with '/' > URI stringification constructs malformed URIs. > -- > > Key: MESOS-9963 > URL: https://issues.apache.org/jira/browse/MESOS-9963 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James Peach >Assignee: James Peach >Priority: Major > Labels: containerization > > Setting {{docker_registry="https://docker-cache.example.com/}} and then > pulling an image named {{org/image-name:latest}} fails. The Docker image > puller ends up constructing a malformed URL for the manifest: > {noformat} > Pulling image 'org/siri-centos6:stage' from > 'docker-manifest://docker-cache.example.com:443org/image-name?latest#https' > to '/tmp/mesos/store/docker/staging/LGArHA' > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-8608) RmdirContinueOnErrorTest.RemoveWithContinueOnError fails.
[ https://issues.apache.org/jira/browse/MESOS-8608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932930#comment-16932930 ] Alan Drees edited comment on MESOS-8608 at 9/18/19 11:53 PM: - Also attempting to build mesos inside a docker container. Does this test failure suggest any attempt to create a bind mount will also fail? (attempting to build 1.1.0) was (Author: adrees): Also attempting to build mesos inside a docker container. Does this test failure suggest any attempt to create a bind mount will also fail? > RmdirContinueOnErrorTest.RemoveWithContinueOnError fails. > - > > Key: MESOS-8608 > URL: https://issues.apache.org/jira/browse/MESOS-8608 > Project: Mesos > Issue Type: Bug > Components: cmake >Affects Versions: 1.4.1, 1.8.0 > Environment: Docker 17.12.0 > Ubuntu 16.04, Ubuntu 18.04 >Reporter: Pierre-Louis Chevallier >Priority: Critical > Labels: flaky-test, foundations, newbie, test > > I'm trying to run mesos on docker and when i "make check", i have 1 test that > is failed, i followed all the requirements & instructions on mesos getting > started guide. The Failed test say > RmDirContinueOnErrorTest.RemoveWithContinuedOnError -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-8608) RmdirContinueOnErrorTest.RemoveWithContinueOnError fails.
[ https://issues.apache.org/jira/browse/MESOS-8608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932930#comment-16932930 ] Alan Drees commented on MESOS-8608: --- Also attempting to build mesos inside a docker container. Does this test failure suggest any attempt to create a bind mount will also fail? > RmdirContinueOnErrorTest.RemoveWithContinueOnError fails. > - > > Key: MESOS-8608 > URL: https://issues.apache.org/jira/browse/MESOS-8608 > Project: Mesos > Issue Type: Bug > Components: cmake >Affects Versions: 1.4.1, 1.8.0 > Environment: Docker 17.12.0 > Ubuntu 16.04, Ubuntu 18.04 >Reporter: Pierre-Louis Chevallier >Priority: Critical > Labels: flaky-test, foundations, newbie, test > > I'm trying to run mesos on docker and when i "make check", i have 1 test that > is failed, i followed all the requirements & instructions on mesos getting > started guide. The Failed test say > RmDirContinueOnErrorTest.RemoveWithContinuedOnError -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-9971) 'dist' and 'distcheck' cmake targets are implemented as shell scripts, so fail on Windows/MSVC.
[ https://issues.apache.org/jira/browse/MESOS-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Wu reassigned MESOS-9971: Sprint: Foundations: RI-18 55 Story Points: 1 Assignee: Joseph Wu Labels: foundations (was: ) Priority: Trivial (was: Major) > 'dist' and 'distcheck' cmake targets are implemented as shell scripts, so > fail on Windows/MSVC. > --- > > Key: MESOS-9971 > URL: https://issues.apache.org/jira/browse/MESOS-9971 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: master > Environment: {color:#172b4d}VS 2017 + Windows Server 2016{color} >Reporter: LinGao >Assignee: Joseph Wu >Priority: Trivial > Labels: foundations > Attachments: log_x64_build.log > > > Mesos failed to build due to error MSB6006: "cmd.exe" exited with code 1 on > Windows using MSVC. It can be first reproduced on > {color:#24292e}e0f7e2d{color} reversion on master branch. Could you please > take a look at this isssue? Thanks a lot! > Reproduce steps: > 1. git clone -c core.autocrlf=true [https://github.com/apache/mesos] > D:\mesos\src > 2. Open a VS 2017 x64 command prompt as admin and browse to D:\mesos > 3. cd src > 4. .\bootstrap.bat > 5. cd .. > 6. mkdir build_x64 && pushd build_x64 > 7. cmake ..\src -G "Visual Studio 15 2017 Win64" > -DCMAKE_SYSTEM_VERSION=10.0.17134.0 -DENABLE_LIBEVENT=1 > -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin" -T host=x64 > 8. msbuild Mesos.sln /p:Configuration=Debug /p:Platform=x64 /maxcpucount:4 > /t:Rebuild > > ErrorMessage: > 67>PrepareForBuild: > Creating directory "x64\Debug\dist\dist.tlog\". > InitializeBuildStatus: > Creating "x64\Debug\dist\dist.tlog\unsuccessfulbuild" because > "AlwaysCreate" was specified. > 67>C:\Program Files (x86)\Microsoft Visual > Studio\2017\Enterprise\Common7\IDE\VC\VCTargets\Microsoft.CppCommon.targets(209,5): > error MSB6006: "cmd.exe" exited with code 1. > [D:\Mesos\build_x64\dist.vcxproj] > 67>Done Building Project "D:\Mesos\build_x64\dist.vcxproj" (Rebuild > target(s)) -- FAILED. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-9975) Sorter may leak clients.
Meng Zhu created MESOS-9975: --- Summary: Sorter may leak clients. Key: MESOS-9975 URL: https://issues.apache.org/jira/browse/MESOS-9975 Project: Mesos Issue Type: Bug Components: allocation Reporter: Meng Zhu In MESOS-9015, we allowed resource quantities to change when updating an existing allocation. When the allocation is updated to empty, however, we forget to remove the client in the map in the `sorter::update()` if the `newAllocation` is `empty()`. https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/sorter/drf/sorter.hpp#L382-L384 The above case could happen, for example, when a CSI volume with a stale profile is destroyed, it would be better to convert it into an empty resource since the disk space is no longer available. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-9975) Sorter may leak clients.
[ https://issues.apache.org/jira/browse/MESOS-9975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu reassigned MESOS-9975: --- Assignee: Meng Zhu > Sorter may leak clients. > > > Key: MESOS-9975 > URL: https://issues.apache.org/jira/browse/MESOS-9975 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: resource-management > > In MESOS-9015, we allowed resource quantities to change when updating an > existing allocation. When the allocation is updated to empty, however, we > forget to remove the client in the map in the `sorter::update()` if the > `newAllocation` is `empty()`. > https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/sorter/drf/sorter.hpp#L382-L384 > The above case could happen, for example, when a CSI volume with a stale > profile is destroyed, it would be better to convert it into an empty resource > since the disk space is no longer available. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-9971) 'dist' and 'distcheck' cmake targets are implemented as shell scripts, so fail on Windows/MSVC.
[ https://issues.apache.org/jira/browse/MESOS-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932768#comment-16932768 ] Joseph Wu commented on MESOS-9971: -- Disabling the targets: https://reviews.apache.org/r/71507/ > 'dist' and 'distcheck' cmake targets are implemented as shell scripts, so > fail on Windows/MSVC. > --- > > Key: MESOS-9971 > URL: https://issues.apache.org/jira/browse/MESOS-9971 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: master > Environment: {color:#172b4d}VS 2017 + Windows Server 2016{color} >Reporter: LinGao >Priority: Major > Attachments: log_x64_build.log > > > Mesos failed to build due to error MSB6006: "cmd.exe" exited with code 1 on > Windows using MSVC. It can be first reproduced on > {color:#24292e}e0f7e2d{color} reversion on master branch. Could you please > take a look at this isssue? Thanks a lot! > Reproduce steps: > 1. git clone -c core.autocrlf=true [https://github.com/apache/mesos] > D:\mesos\src > 2. Open a VS 2017 x64 command prompt as admin and browse to D:\mesos > 3. cd src > 4. .\bootstrap.bat > 5. cd .. > 6. mkdir build_x64 && pushd build_x64 > 7. cmake ..\src -G "Visual Studio 15 2017 Win64" > -DCMAKE_SYSTEM_VERSION=10.0.17134.0 -DENABLE_LIBEVENT=1 > -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin" -T host=x64 > 8. msbuild Mesos.sln /p:Configuration=Debug /p:Platform=x64 /maxcpucount:4 > /t:Rebuild > > ErrorMessage: > 67>PrepareForBuild: > Creating directory "x64\Debug\dist\dist.tlog\". > InitializeBuildStatus: > Creating "x64\Debug\dist\dist.tlog\unsuccessfulbuild" because > "AlwaysCreate" was specified. > 67>C:\Program Files (x86)\Microsoft Visual > Studio\2017\Enterprise\Common7\IDE\VC\VCTargets\Microsoft.CppCommon.targets(209,5): > error MSB6006: "cmd.exe" exited with code 1. > [D:\Mesos\build_x64\dist.vcxproj] > 67>Done Building Project "D:\Mesos\build_x64\dist.vcxproj" (Rebuild > target(s)) -- FAILED. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-9971) Mesos failed to build due to error MSB6006 on Windows with MSVC.
[ https://issues.apache.org/jira/browse/MESOS-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932759#comment-16932759 ] Joseph Wu commented on MESOS-9971: -- The {{dist}} and {{distcheck}} targets, which were recently added to mirror those targets from the autotools build, are currently implemented as {{.sh}} scripts, and are not expected to work on Windows. I'll consider removing those targets from the Windows build, if we deem the feature (making a clean source package) unnecessary for the Windows build. > Mesos failed to build due to error MSB6006 on Windows with MSVC. > > > Key: MESOS-9971 > URL: https://issues.apache.org/jira/browse/MESOS-9971 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: master > Environment: {color:#172b4d}VS 2017 + Windows Server 2016{color} >Reporter: LinGao >Priority: Major > Attachments: log_x64_build.log > > > Mesos failed to build due to error MSB6006: "cmd.exe" exited with code 1 on > Windows using MSVC. It can be first reproduced on > {color:#24292e}e0f7e2d{color} reversion on master branch. Could you please > take a look at this isssue? Thanks a lot! > Reproduce steps: > 1. git clone -c core.autocrlf=true [https://github.com/apache/mesos] > D:\mesos\src > 2. Open a VS 2017 x64 command prompt as admin and browse to D:\mesos > 3. cd src > 4. .\bootstrap.bat > 5. cd .. > 6. mkdir build_x64 && pushd build_x64 > 7. cmake ..\src -G "Visual Studio 15 2017 Win64" > -DCMAKE_SYSTEM_VERSION=10.0.17134.0 -DENABLE_LIBEVENT=1 > -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin" -T host=x64 > 8. msbuild Mesos.sln /p:Configuration=Debug /p:Platform=x64 /maxcpucount:4 > /t:Rebuild > > ErrorMessage: > 67>PrepareForBuild: > Creating directory "x64\Debug\dist\dist.tlog\". > InitializeBuildStatus: > Creating "x64\Debug\dist\dist.tlog\unsuccessfulbuild" because > "AlwaysCreate" was specified. > 67>C:\Program Files (x86)\Microsoft Visual > Studio\2017\Enterprise\Common7\IDE\VC\VCTargets\Microsoft.CppCommon.targets(209,5): > error MSB6006: "cmd.exe" exited with code 1. > [D:\Mesos\build_x64\dist.vcxproj] > 67>Done Building Project "D:\Mesos\build_x64\dist.vcxproj" (Rebuild > target(s)) -- FAILED. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-9971) Mesos failed to build due to error MSB6006 on Windows with MSVC.
[ https://issues.apache.org/jira/browse/MESOS-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932747#comment-16932747 ] Greg Mann commented on MESOS-9971: -- [~kaysoky] have you seen this before? > Mesos failed to build due to error MSB6006 on Windows with MSVC. > > > Key: MESOS-9971 > URL: https://issues.apache.org/jira/browse/MESOS-9971 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: master > Environment: {color:#172b4d}VS 2017 + Windows Server 2016{color} >Reporter: LinGao >Priority: Major > Attachments: log_x64_build.log > > > Mesos failed to build due to error MSB6006: "cmd.exe" exited with code 1 on > Windows using MSVC. It can be first reproduced on > {color:#24292e}e0f7e2d{color} reversion on master branch. Could you please > take a look at this isssue? Thanks a lot! > Reproduce steps: > 1. git clone -c core.autocrlf=true [https://github.com/apache/mesos] > D:\mesos\src > 2. Open a VS 2017 x64 command prompt as admin and browse to D:\mesos > 3. cd src > 4. .\bootstrap.bat > 5. cd .. > 6. mkdir build_x64 && pushd build_x64 > 7. cmake ..\src -G "Visual Studio 15 2017 Win64" > -DCMAKE_SYSTEM_VERSION=10.0.17134.0 -DENABLE_LIBEVENT=1 > -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin" -T host=x64 > 8. msbuild Mesos.sln /p:Configuration=Debug /p:Platform=x64 /maxcpucount:4 > /t:Rebuild > > ErrorMessage: > 67>PrepareForBuild: > Creating directory "x64\Debug\dist\dist.tlog\". > InitializeBuildStatus: > Creating "x64\Debug\dist\dist.tlog\unsuccessfulbuild" because > "AlwaysCreate" was specified. > 67>C:\Program Files (x86)\Microsoft Visual > Studio\2017\Enterprise\Common7\IDE\VC\VCTargets\Microsoft.CppCommon.targets(209,5): > error MSB6006: "cmd.exe" exited with code 1. > [D:\Mesos\build_x64\dist.vcxproj] > 67>Done Building Project "D:\Mesos\build_x64\dist.vcxproj" (Rebuild > target(s)) -- FAILED. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-9630) Consider moving linter setup to pre-commit
[ https://issues.apache.org/jira/browse/MESOS-9630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932300#comment-16932300 ] Benjamin Bannier commented on MESOS-9630: - Moving r/71300 to MESOS-9974 so this ticket can be closed. > Consider moving linter setup to pre-commit > -- > > Key: MESOS-9630 > URL: https://issues.apache.org/jira/browse/MESOS-9630 > Project: Mesos > Issue Type: Wish >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier >Priority: Minor > > Mesos currently uses a mix of hand-crafted git commit hooks and mesos-style > to perform linting. While this has served us well our current approach also > has some drawbacks, e.g., > * the linter setup is spread between hooks and {{support/mesos-style.py}} > * adding new linters can be cumbersome > * mesos-style.py uses a process where it creates a single virtualenv to > install linters in which is tie d to the source tree > * linter dependencies are only cached to an extent and it is easy to run > into a situation where one needs to update linter dependencies over the > network even though one has successfully linted a revision before > * {{support/mesos-style.py}} lacks a number of features, e.g., running over > only staged files, running linters in parallel for improved throughput, > running only specific linters or disabling certain linters, and the > parameterization of the linters is strongly coupled to implementation of the > style checker itself. > The [pre-commit tool|https://pre-commit.com/] solves most of these issues and > using it in Mesos would not only allow us to get rid of tooling which is hard > to maintain, but also unlock other features. It is licensed under a MIT > license. We should consider moving our linting setup over to pre-commit. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-9974) Remove support/mesos-style.py transition script
Benjamin Bannier created MESOS-9974: --- Summary: Remove support/mesos-style.py transition script Key: MESOS-9974 URL: https://issues.apache.org/jira/browse/MESOS-9974 Project: Mesos Issue Type: Task Affects Versions: 1.10 Reporter: Benjamin Bannier Assignee: Benjamin Bannier In MESOS-9360 we have moved our linter stack to pre-commit. We still have a dummy script {{support/mesos-style.py}} in tree instructing developers to migrate. We should remove it before releasing {{1.10.0}}, but give enough transition time so developers have transitioned their setups. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-9630) Consider moving linter setup to pre-commit
[ https://issues.apache.org/jira/browse/MESOS-9630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932299#comment-16932299 ] Benjamin Bannier commented on MESOS-9630: - The following patches have landed on {{master}} (1.10-dev): {noformat} commit fb467a03cb8a5bde5147dc06ca6b73c9df04ff48 Author: Benjamin Bannier Date: Wed Sep 18 11:37:19 2019 +0200 Enabled a number of additional pre-commit checks. This patch enables checkers for well-formed YAML and JSON, and a linter which checks that all executable scripts have a valid shebang line. Review: https://reviews.apache.org/r/71209/ commit 2af339668fd90212999bae06a050a05824f2971e Author: Benjamin Bannier Date: Wed Sep 18 11:37:18 2019 +0200 Revert "Updated cpplint to be compatible with Python 3." This reverts commit 89db66e3df831eaa50fffb4149a3894097505c14. This patch was necessary when we were running cpplint in the python3 environment used e.g., also for bindings and other scripts. With pre-commit we have freedom to choose the Python environment needed so we can undo our adjustments here to stay closer to upstream. Review: https://reviews.apache.org/r/71208/ commit 3478e40c656160b8f08e0ad8c154289417bb6aaa Author: Benjamin Bannier Date: Wed Sep 18 11:37:17 2019 +0200 Revert "Updated cpplint.py to be less verbose when there is no linting issue." This reverts commit c0f8f56d5a93f3fb870e448fedfd22f1491356ca. This patch was necessary when we were running cpplint via `support/mesos-style.py` to prevent it from cluttering up the hook output. When running under pre-commit linter output is not shown if no errors occur so we can undo our change to stay closer to upstream. Review: https://reviews.apache.org/r/71207/ commit 37d76fff124d28a0281b9231058bb1b92fc65abe Author: Benjamin Bannier Date: Wed Sep 18 11:37:15 2019 +0200 Removed old mesos-style and references. This patch removes references to `support/mesos-style.py` which was replaced with a pre-commit setup in a previous commit. We also remove the tool itself. Review: https://reviews.apache.org/r/71206/ commit 454661dd0dcbb7a7bc87ac58ad74fd6dd04c5c15 Author: Benjamin Bannier Date: Wed Sep 18 11:37:14 2019 +0200 Switched commit hooks to pre-commit. This patch switches commit hooks to be orchestrated by the pre-commit tool mirroring the previous linters invoked through git commit hooks (orchestrated by `support/mesos-style.py` or standalone hooks). Using pre-commit removes the burden of maintaining `support/mesos-style.py`, making sure that hooks have the expected environment (e.g., Python version, Node installed). Additionally, upstream provides a number of additional linters which are not hard to add to Mesos' hooks. Review: https://reviews.apache.org/r/71205/ commit a138c2bd7cb3749f1dceb0e520e1138536abb531 Author: Benjamin Bannier Date: Wed Sep 18 11:37:13 2019 +0200 Added separate script to install developer setup. This patch breaks the installation of developer tools (i.e., linter configuration files and git hooks) out of `./bootstrap`. This not only simplifies and streamlines the setup, but will allow us to add developer-only features without breaking users who are just interested in building a distribution tarball. Review: https://reviews.apache.org/r/71299/ commit cbaca81a54720771662c119c80aec6101f120afc Author: Benjamin Bannier Date: Wed Sep 18 11:37:11 2019 +0200 Added gitlint config. This patch adds a config for the gitlint tool which is slated to replace a custom commit-msg hook once we switch our hook infrastructure to the pre-commit tool. Review: https://reviews.apache.org/r/71204/ commit 526043b586da0201fd7e374197139e75b249e299 Author: Benjamin Bannier Date: Wed Sep 18 11:37:10 2019 +0200 Added check script to check for license headers. This check adds a script which validates that source files have valid license headers. This will allow us to reuse this functionality with e.g., the pre-commit tool. At the moment the code added here is not invoked from `support/mesos-style.py` since it will be removed in a follow-up commit. Review: https://reviews.apache.org/r/71203/ commit 2232d48ce5b07c4e094c6850cb28212495824110 Author: Benjamin Bannier Date: Wed Sep 18 11:37:09 2019 +0200 Moved cpplint configuration into dedicated file. With this change we not only reduce the amount of code in `support/mesos-style.py` in favor of a configuration supported by upstream, but we also make it easier to interoperate with editor integrations for cpplint. Review: https://reviews.apache.org/r/70096/ {noformat} > Consider moving linter setup to pre-commit >
[jira] [Created] (MESOS-9973) Remove Deprecated Names for libprocess TLS flags
Benno Evers created MESOS-9973: -- Summary: Remove Deprecated Names for libprocess TLS flags Key: MESOS-9973 URL: https://issues.apache.org/jira/browse/MESOS-9973 Project: Mesos Issue Type: Task Reporter: Benno Evers The names `LIBPROCESS_SSL_VERIFY_CERT` and `LIBPROCESS_SSL_REQUIRE_CERT` will become deprecated when https://reviews.apache.org/r/71497 lands. They should be removed at some point. NOTE: This ticket is just to satisfy bureaucracy, I don't think we should actually remove the old names when we release Mesos 2.0. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-9966) Agent crashes when trying to destroy orphaned nested container if root container is orphaned as well
[ https://issues.apache.org/jira/browse/MESOS-9966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932151#comment-16932151 ] Qian Zhang edited comment on MESOS-9966 at 9/18/19 10:41 AM: - Actually I have reproduced this issue, just launch Mesos agent with `--gc_non_executor_container_sandboxes` enabled, launch a task group without checkpoint enabled, and then restart Mesos agent which will crash. was (Author: qianzhang): Actually I have reproduced this issue, just launch Mesos agent with `--gc_non_executor_container_sandboxes` enabled, launch a task group, and then restart Mesos agent which will crash. > Agent crashes when trying to destroy orphaned nested container if root > container is orphaned as well > > > Key: MESOS-9966 > URL: https://issues.apache.org/jira/browse/MESOS-9966 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 1.7.3 >Reporter: Jan Schlicht >Assignee: Qian Zhang >Priority: Major > > Noticed an agent crash-looping when trying to recover. It recognized a > container and its nested container as orphaned. When trying to destroy the > nested container, the agent crashes. Probably when trying to [get the sandbox > path of the root > container|https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L2966]. > {noformat} > 2019-09-09 05:04:26: I0909 05:04:26.382326 89950 linux_launcher.cpp:286] > Recovering Linux launcher > 2019-09-09 05:04:26: I0909 05:04:26.383162 89950 linux_launcher.cpp:331] Not > recovering cgroup mesos/a127917b-96fe-4100-b73d-5f876ce9ffc1/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383199 89950 linux_launcher.cpp:343] > Recovered container > a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97 > 2019-09-09 05:04:26: I0909 05:04:26.383216 89950 linux_launcher.cpp:331] Not > recovering cgroup > mesos/a127917b-96fe-4100-b73d-5f876ce9ffc1/mesos/9783e2bb-7c2e-4930-9d39-4225bb6f1b97/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383229 89950 linux_launcher.cpp:343] > Recovered container 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.383237 89950 linux_launcher.cpp:343] > Recovered container a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.383249 89950 linux_launcher.cpp:343] > Recovered container > 2ee154e2-3cc4-420a-99fb-065e740f3091.49fe2bf9-17af-415f-92b6-92a4db619436 > 2019-09-09 05:04:26: I0909 05:04:26.383260 89950 linux_launcher.cpp:331] Not > recovering cgroup mesos/2ee154e2-3cc4-420a-99fb-065e740f3091/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383271 89950 linux_launcher.cpp:331] Not > recovering cgroup > mesos/2ee154e2-3cc4-420a-99fb-065e740f3091/mesos/49fe2bf9-17af-415f-92b6-92a4db619436/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383280 89950 linux_launcher.cpp:437] > 2ee154e2-3cc4-420a-99fb-065e740f3091.49fe2bf9-17af-415f-92b6-92a4db619436 is > a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383289 89950 linux_launcher.cpp:437] > a127917b-96fe-4100-b73d-5f876ce9ffc1 is a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383296 89950 linux_launcher.cpp:437] > 2ee154e2-3cc4-420a-99fb-065e740f3091 is a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383304 89950 linux_launcher.cpp:437] > a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97 is > a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383414 89950 containerizer.cpp:1092] > Recovering isolators > 2019-09-09 05:04:26: I0909 05:04:26.385931 89977 memory.cpp:478] Started > listening for OOM events for container a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386118 89977 memory.cpp:590] Started > listening on 'low' memory pressure events for container > a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386152 89977 memory.cpp:590] Started > listening on 'medium' memory pressure events for container > a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386175 89977 memory.cpp:590] Started > listening on 'critical' memory pressure events for container > a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386227 89977 memory.cpp:478] Started > listening for OOM events for container 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.386248 89977 memory.cpp:590] Started > listening on 'low' memory pressure events for container > 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.386270 89977 memory.cpp:590] Started > listening on 'medium' memory pressure events for container > 2ee154e2-3cc4-420a-99fb-065e740f3091 >
[jira] [Commented] (MESOS-9966) Agent crashes when trying to destroy orphaned nested container if root container is orphaned as well
[ https://issues.apache.org/jira/browse/MESOS-9966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932291#comment-16932291 ] Qian Zhang commented on MESOS-9966: --- The root cause of this issue is, when we destroy a nested container, if the agent flag `–gc_non_executor_container_sandboxes` is enabled, we will try to get the nested container's sandbox based on its root container's sandbox (see [this code|https://github.com/apache/mesos/blob/1.9.0/src/slave/containerizer/mesos/containerizer.cpp#L2966:L2967] for details): {code:java} const string sandboxPath = containerizer::paths::getSandboxPath( containers_[rootContainerId]->directory.get(), containerId);{code} But the problem is, we will not keep track of sandbox for orphan containers (see [this comment|https://github.com/apache/mesos/blob/1.9.0/src/slave/containerizer/mesos/containerizer.hpp#L370:L371] for details), so `containers_[rootContainerId]->directory.get()` will make agent crashes. > Agent crashes when trying to destroy orphaned nested container if root > container is orphaned as well > > > Key: MESOS-9966 > URL: https://issues.apache.org/jira/browse/MESOS-9966 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 1.7.3 >Reporter: Jan Schlicht >Assignee: Qian Zhang >Priority: Major > > Noticed an agent crash-looping when trying to recover. It recognized a > container and its nested container as orphaned. When trying to destroy the > nested container, the agent crashes. Probably when trying to [get the sandbox > path of the root > container|https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L2966]. > {noformat} > 2019-09-09 05:04:26: I0909 05:04:26.382326 89950 linux_launcher.cpp:286] > Recovering Linux launcher > 2019-09-09 05:04:26: I0909 05:04:26.383162 89950 linux_launcher.cpp:331] Not > recovering cgroup mesos/a127917b-96fe-4100-b73d-5f876ce9ffc1/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383199 89950 linux_launcher.cpp:343] > Recovered container > a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97 > 2019-09-09 05:04:26: I0909 05:04:26.383216 89950 linux_launcher.cpp:331] Not > recovering cgroup > mesos/a127917b-96fe-4100-b73d-5f876ce9ffc1/mesos/9783e2bb-7c2e-4930-9d39-4225bb6f1b97/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383229 89950 linux_launcher.cpp:343] > Recovered container 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.383237 89950 linux_launcher.cpp:343] > Recovered container a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.383249 89950 linux_launcher.cpp:343] > Recovered container > 2ee154e2-3cc4-420a-99fb-065e740f3091.49fe2bf9-17af-415f-92b6-92a4db619436 > 2019-09-09 05:04:26: I0909 05:04:26.383260 89950 linux_launcher.cpp:331] Not > recovering cgroup mesos/2ee154e2-3cc4-420a-99fb-065e740f3091/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383271 89950 linux_launcher.cpp:331] Not > recovering cgroup > mesos/2ee154e2-3cc4-420a-99fb-065e740f3091/mesos/49fe2bf9-17af-415f-92b6-92a4db619436/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383280 89950 linux_launcher.cpp:437] > 2ee154e2-3cc4-420a-99fb-065e740f3091.49fe2bf9-17af-415f-92b6-92a4db619436 is > a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383289 89950 linux_launcher.cpp:437] > a127917b-96fe-4100-b73d-5f876ce9ffc1 is a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383296 89950 linux_launcher.cpp:437] > 2ee154e2-3cc4-420a-99fb-065e740f3091 is a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383304 89950 linux_launcher.cpp:437] > a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97 is > a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383414 89950 containerizer.cpp:1092] > Recovering isolators > 2019-09-09 05:04:26: I0909 05:04:26.385931 89977 memory.cpp:478] Started > listening for OOM events for container a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386118 89977 memory.cpp:590] Started > listening on 'low' memory pressure events for container > a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386152 89977 memory.cpp:590] Started > listening on 'medium' memory pressure events for container > a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386175 89977 memory.cpp:590] Started > listening on 'critical' memory pressure events for container > a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386227 89977 memory.cpp:478] Started > listening for OOM events for container 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09
[jira] [Created] (MESOS-9972) Update Names for TLS-related environment variables in libprocess.
Benno Evers created MESOS-9972: -- Summary: Update Names for TLS-related environment variables in libprocess. Key: MESOS-9972 URL: https://issues.apache.org/jira/browse/MESOS-9972 Project: Mesos Issue Type: Bug Reporter: Benno Evers The environment variables `LIBPROCESS_SSL_VERIFY_CERT` and `LIBPROCESS_SSL_REQUIRE_CERT` regularly cause confusion because they do not precisely describe their function. In particular, one might mistakenly assume that certificates are not required when setting `LIBPROCESS_SSL_REQUIRE_CERT=false`, or that all certificates are verified when `LIBPROCESS_SSL_VERIFY_CERT=true`. We should rename the options to `LIBPROCESS_SSL_VERIFY_SERVER_CERT` and `LIBPROCESS_SSL_REQUIRE_CLIENT_CERT` to make the semantics more clear. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-9971) Mesos failed to build due to error MSB6006 on Windows with MSVC.
LinGao created MESOS-9971: - Summary: Mesos failed to build due to error MSB6006 on Windows with MSVC. Key: MESOS-9971 URL: https://issues.apache.org/jira/browse/MESOS-9971 Project: Mesos Issue Type: Bug Components: build Affects Versions: master Environment: {color:#172b4d}VS 2017 + Windows Server 2016{color} Reporter: LinGao Attachments: log_x64_build.log Mesos failed to build due to error MSB6006: "cmd.exe" exited with code 1 on Windows using MSVC. It can be first reproduced on {color:#24292e}e0f7e2d{color} reversion on master branch. Could you please take a look at this isssue? Thanks a lot! Reproduce steps: 1. git clone -c core.autocrlf=true [https://github.com/apache/mesos] D:\mesos\src 2. Open a VS 2017 x64 command prompt as admin and browse to D:\mesos 3. cd src 4. .\bootstrap.bat 5. cd .. 6. mkdir build_x64 && pushd build_x64 7. cmake ..\src -G "Visual Studio 15 2017 Win64" -DCMAKE_SYSTEM_VERSION=10.0.17134.0 -DENABLE_LIBEVENT=1 -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin" -T host=x64 8. msbuild Mesos.sln /p:Configuration=Debug /p:Platform=x64 /maxcpucount:4 /t:Rebuild ErrorMessage: 67>PrepareForBuild: Creating directory "x64\Debug\dist\dist.tlog\". InitializeBuildStatus: Creating "x64\Debug\dist\dist.tlog\unsuccessfulbuild" because "AlwaysCreate" was specified. 67>C:\Program Files (x86)\Microsoft Visual Studio\2017\Enterprise\Common7\IDE\VC\VCTargets\Microsoft.CppCommon.targets(209,5): error MSB6006: "cmd.exe" exited with code 1. [D:\Mesos\build_x64\dist.vcxproj] 67>Done Building Project "D:\Mesos\build_x64\dist.vcxproj" (Rebuild target(s)) -- FAILED. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-9969) Agent crashes when trying to clean up volue
[ https://issues.apache.org/jira/browse/MESOS-9969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932203#comment-16932203 ] Tomas Barton commented on MESOS-9969: - I was shutting down the agent using: {code} systemctl kill -s SIGUSR1 dcos-mesos-slave && systemctl stop dcos-mesos-slave {code} The volume is just a local file, otherwise default config. {{--gc_non_executor_container_sandboxes}} is set to {{true}} > Agent crashes when trying to clean up volue > --- > > Key: MESOS-9969 > URL: https://issues.apache.org/jira/browse/MESOS-9969 > Project: Mesos > Issue Type: Bug > Components: agent >Affects Versions: 1.8.2 >Reporter: Tomas Barton >Priority: Major > > {code} > Sep 17 13:49:26 w03 mesos-agent[21803]: I0917 13:49:26.081748 21828 > linux_launcher.cpp:650] Destroying cgroup > '/sys/fs/cgroup/systemd/mesos/370ed262-4041-4180-a7e1-9ea78070e3a6' > Sep 17 13:49:26 w03 mesos-agent[21803]: I0917 13:49:26.081876 21832 > containerizer.cpp:2907] Checkpointing termination state to nested container's > runtime directory > '/var/run/mesos/containers/8e3997e7-c53a-4043-9a7e-26a2e436a041/containers/ae0bdc6d-c738-4352-b5d4-7572182671d5/termination' > Sep 17 13:49:26 w03 mesos-agent[21803]: mesos-agent: > /pkg/src/mesos/3rdparty/stout/include/stout/option.hpp:120: T& > Option::get() & [with T = std::basic_string]: Assertion `isSome()' > failed. > Sep 17 13:49:26 w03 mesos-agent[21803]: *** Aborted at 1568728166 (unix time) > try "date -d @1568728166" if you are using GNU date *** > Sep 17 13:49:26 w03 mesos-agent[21803]: W0917 13:49:26.082281 21835 > disk.cpp:453] Ignoring cleanup for unknown container > a9ba6959-ea02-4543-b7d5-92a63940 > Sep 17 13:49:26 w03 mesos-agent[21803]: PC: @ 0x7f16a3867fff (unknown) > Sep 17 13:49:26 w03 mesos-agent[21803]: *** SIGABRT (@0x552b) received by PID > 21803 (TID 0x7f169e47d700) from PID 21803; stack trace: *** > Sep 17 13:49:26 w03 mesos-agent[21803]: E0917 13:49:26.082608 21835 > memory.cpp:501] Listening on OOM events failed for container > a9ba6959-ea02-4543-b7d5-92a63940: Event listener is terminating > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a3be50e0 (unknown) > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a3867fff (unknown) > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a386942a (unknown) > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a3860e67 (unknown) > Sep 17 13:49:26 w03 mesos-agent[21803]: I0917 13:49:26.083741 21835 > linux.cpp:1074] Unmounting volume > '/var/lib/mesos/slave/slaves/04e596b7-f03d-4cba-bbbc-fa9e0aebb5d2-S17/frameworks/04e596b7-f03d-4cba-bbbc-fa9e0aebb5d2-0003/executors/es01__coordinator__8591ac8e-3d9d-45ac-bb68-bee379c8c4a4/runs/a9ba6959-ea02-4543-b7d5-92a63940/container-path' > for con > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a3860f12 (unknown) > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a7654f13 > _ZNR6OptionISsE3getEv.part.152 > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a7666b2f > mesos::internal::slave::MesosContainerizerProcess::__destroy() > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a861cb41 > process::ProcessBase::consume() > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a8633c9c > process::ProcessManager::resume() > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a86398a6 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a43c6200 (unknown) > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a3bdb4a4 start_thread > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a391dd0f (unknown) > Sep 17 13:49:26 w03 systemd[1]: dcos-mesos-slave.service: Main process > exited, code=killed, status=6/ABRT > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-9969) Agent crashes when trying to clean up volue
[ https://issues.apache.org/jira/browse/MESOS-9969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932173#comment-16932173 ] Jan Schlicht commented on MESOS-9969: - This looks like MESOS-9966. > Agent crashes when trying to clean up volue > --- > > Key: MESOS-9969 > URL: https://issues.apache.org/jira/browse/MESOS-9969 > Project: Mesos > Issue Type: Bug > Components: agent >Affects Versions: 1.8.2 >Reporter: Tomas Barton >Priority: Major > > {code} > Sep 17 13:49:26 w03 mesos-agent[21803]: I0917 13:49:26.081748 21828 > linux_launcher.cpp:650] Destroying cgroup > '/sys/fs/cgroup/systemd/mesos/370ed262-4041-4180-a7e1-9ea78070e3a6' > Sep 17 13:49:26 w03 mesos-agent[21803]: I0917 13:49:26.081876 21832 > containerizer.cpp:2907] Checkpointing termination state to nested container's > runtime directory > '/var/run/mesos/containers/8e3997e7-c53a-4043-9a7e-26a2e436a041/containers/ae0bdc6d-c738-4352-b5d4-7572182671d5/termination' > Sep 17 13:49:26 w03 mesos-agent[21803]: mesos-agent: > /pkg/src/mesos/3rdparty/stout/include/stout/option.hpp:120: T& > Option::get() & [with T = std::basic_string]: Assertion `isSome()' > failed. > Sep 17 13:49:26 w03 mesos-agent[21803]: *** Aborted at 1568728166 (unix time) > try "date -d @1568728166" if you are using GNU date *** > Sep 17 13:49:26 w03 mesos-agent[21803]: W0917 13:49:26.082281 21835 > disk.cpp:453] Ignoring cleanup for unknown container > a9ba6959-ea02-4543-b7d5-92a63940 > Sep 17 13:49:26 w03 mesos-agent[21803]: PC: @ 0x7f16a3867fff (unknown) > Sep 17 13:49:26 w03 mesos-agent[21803]: *** SIGABRT (@0x552b) received by PID > 21803 (TID 0x7f169e47d700) from PID 21803; stack trace: *** > Sep 17 13:49:26 w03 mesos-agent[21803]: E0917 13:49:26.082608 21835 > memory.cpp:501] Listening on OOM events failed for container > a9ba6959-ea02-4543-b7d5-92a63940: Event listener is terminating > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a3be50e0 (unknown) > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a3867fff (unknown) > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a386942a (unknown) > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a3860e67 (unknown) > Sep 17 13:49:26 w03 mesos-agent[21803]: I0917 13:49:26.083741 21835 > linux.cpp:1074] Unmounting volume > '/var/lib/mesos/slave/slaves/04e596b7-f03d-4cba-bbbc-fa9e0aebb5d2-S17/frameworks/04e596b7-f03d-4cba-bbbc-fa9e0aebb5d2-0003/executors/es01__coordinator__8591ac8e-3d9d-45ac-bb68-bee379c8c4a4/runs/a9ba6959-ea02-4543-b7d5-92a63940/container-path' > for con > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a3860f12 (unknown) > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a7654f13 > _ZNR6OptionISsE3getEv.part.152 > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a7666b2f > mesos::internal::slave::MesosContainerizerProcess::__destroy() > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a861cb41 > process::ProcessBase::consume() > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a8633c9c > process::ProcessManager::resume() > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a86398a6 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a43c6200 (unknown) > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a3bdb4a4 start_thread > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a391dd0f (unknown) > Sep 17 13:49:26 w03 systemd[1]: dcos-mesos-slave.service: Main process > exited, code=killed, status=6/ABRT > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-9966) Agent crashes when trying to destroy orphaned nested container if root container is orphaned as well
[ https://issues.apache.org/jira/browse/MESOS-9966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932152#comment-16932152 ] Jan Schlicht commented on MESOS-9966: - You're right, the flag is enabled. > Agent crashes when trying to destroy orphaned nested container if root > container is orphaned as well > > > Key: MESOS-9966 > URL: https://issues.apache.org/jira/browse/MESOS-9966 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 1.7.3 >Reporter: Jan Schlicht >Assignee: Qian Zhang >Priority: Major > > Noticed an agent crash-looping when trying to recover. It recognized a > container and its nested container as orphaned. When trying to destroy the > nested container, the agent crashes. Probably when trying to [get the sandbox > path of the root > container|https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L2966]. > {noformat} > 2019-09-09 05:04:26: I0909 05:04:26.382326 89950 linux_launcher.cpp:286] > Recovering Linux launcher > 2019-09-09 05:04:26: I0909 05:04:26.383162 89950 linux_launcher.cpp:331] Not > recovering cgroup mesos/a127917b-96fe-4100-b73d-5f876ce9ffc1/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383199 89950 linux_launcher.cpp:343] > Recovered container > a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97 > 2019-09-09 05:04:26: I0909 05:04:26.383216 89950 linux_launcher.cpp:331] Not > recovering cgroup > mesos/a127917b-96fe-4100-b73d-5f876ce9ffc1/mesos/9783e2bb-7c2e-4930-9d39-4225bb6f1b97/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383229 89950 linux_launcher.cpp:343] > Recovered container 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.383237 89950 linux_launcher.cpp:343] > Recovered container a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.383249 89950 linux_launcher.cpp:343] > Recovered container > 2ee154e2-3cc4-420a-99fb-065e740f3091.49fe2bf9-17af-415f-92b6-92a4db619436 > 2019-09-09 05:04:26: I0909 05:04:26.383260 89950 linux_launcher.cpp:331] Not > recovering cgroup mesos/2ee154e2-3cc4-420a-99fb-065e740f3091/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383271 89950 linux_launcher.cpp:331] Not > recovering cgroup > mesos/2ee154e2-3cc4-420a-99fb-065e740f3091/mesos/49fe2bf9-17af-415f-92b6-92a4db619436/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383280 89950 linux_launcher.cpp:437] > 2ee154e2-3cc4-420a-99fb-065e740f3091.49fe2bf9-17af-415f-92b6-92a4db619436 is > a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383289 89950 linux_launcher.cpp:437] > a127917b-96fe-4100-b73d-5f876ce9ffc1 is a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383296 89950 linux_launcher.cpp:437] > 2ee154e2-3cc4-420a-99fb-065e740f3091 is a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383304 89950 linux_launcher.cpp:437] > a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97 is > a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383414 89950 containerizer.cpp:1092] > Recovering isolators > 2019-09-09 05:04:26: I0909 05:04:26.385931 89977 memory.cpp:478] Started > listening for OOM events for container a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386118 89977 memory.cpp:590] Started > listening on 'low' memory pressure events for container > a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386152 89977 memory.cpp:590] Started > listening on 'medium' memory pressure events for container > a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386175 89977 memory.cpp:590] Started > listening on 'critical' memory pressure events for container > a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386227 89977 memory.cpp:478] Started > listening for OOM events for container 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.386248 89977 memory.cpp:590] Started > listening on 'low' memory pressure events for container > 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.386270 89977 memory.cpp:590] Started > listening on 'medium' memory pressure events for container > 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.386376 89977 memory.cpp:590] Started > listening on 'critical' memory pressure events for container > 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.386694 89921 containerizer.cpp:1131] > Recovering provisioner > 2019-09-09 05:04:26: I0909 05:04:26.388226 90010 metadata_manager.cpp:286] > Successfully loaded 64 Docker images > 2019-09-09 05:04:26: I0909 05:04:26.388420
[jira] [Comment Edited] (MESOS-9966) Agent crashes when trying to destroy orphaned nested container if root container is orphaned as well
[ https://issues.apache.org/jira/browse/MESOS-9966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932151#comment-16932151 ] Qian Zhang edited comment on MESOS-9966 at 9/18/19 7:19 AM: Actually I have reproduced this issue, just launch Mesos agent with `--gc_non_executor_container_sandboxes` enabled, launch a task group, and then restart Mesos agent which will crash. was (Author: qianzhang): Actually I have reproduced this issue, just launch Mesos agent with `--gc_non_executor_container_sandboxes` enabled, launched a task group, and then restart Mesos agent which will crash. > Agent crashes when trying to destroy orphaned nested container if root > container is orphaned as well > > > Key: MESOS-9966 > URL: https://issues.apache.org/jira/browse/MESOS-9966 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 1.7.3 >Reporter: Jan Schlicht >Assignee: Qian Zhang >Priority: Major > > Noticed an agent crash-looping when trying to recover. It recognized a > container and its nested container as orphaned. When trying to destroy the > nested container, the agent crashes. Probably when trying to [get the sandbox > path of the root > container|https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L2966]. > {noformat} > 2019-09-09 05:04:26: I0909 05:04:26.382326 89950 linux_launcher.cpp:286] > Recovering Linux launcher > 2019-09-09 05:04:26: I0909 05:04:26.383162 89950 linux_launcher.cpp:331] Not > recovering cgroup mesos/a127917b-96fe-4100-b73d-5f876ce9ffc1/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383199 89950 linux_launcher.cpp:343] > Recovered container > a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97 > 2019-09-09 05:04:26: I0909 05:04:26.383216 89950 linux_launcher.cpp:331] Not > recovering cgroup > mesos/a127917b-96fe-4100-b73d-5f876ce9ffc1/mesos/9783e2bb-7c2e-4930-9d39-4225bb6f1b97/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383229 89950 linux_launcher.cpp:343] > Recovered container 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.383237 89950 linux_launcher.cpp:343] > Recovered container a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.383249 89950 linux_launcher.cpp:343] > Recovered container > 2ee154e2-3cc4-420a-99fb-065e740f3091.49fe2bf9-17af-415f-92b6-92a4db619436 > 2019-09-09 05:04:26: I0909 05:04:26.383260 89950 linux_launcher.cpp:331] Not > recovering cgroup mesos/2ee154e2-3cc4-420a-99fb-065e740f3091/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383271 89950 linux_launcher.cpp:331] Not > recovering cgroup > mesos/2ee154e2-3cc4-420a-99fb-065e740f3091/mesos/49fe2bf9-17af-415f-92b6-92a4db619436/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383280 89950 linux_launcher.cpp:437] > 2ee154e2-3cc4-420a-99fb-065e740f3091.49fe2bf9-17af-415f-92b6-92a4db619436 is > a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383289 89950 linux_launcher.cpp:437] > a127917b-96fe-4100-b73d-5f876ce9ffc1 is a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383296 89950 linux_launcher.cpp:437] > 2ee154e2-3cc4-420a-99fb-065e740f3091 is a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383304 89950 linux_launcher.cpp:437] > a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97 is > a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383414 89950 containerizer.cpp:1092] > Recovering isolators > 2019-09-09 05:04:26: I0909 05:04:26.385931 89977 memory.cpp:478] Started > listening for OOM events for container a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386118 89977 memory.cpp:590] Started > listening on 'low' memory pressure events for container > a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386152 89977 memory.cpp:590] Started > listening on 'medium' memory pressure events for container > a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386175 89977 memory.cpp:590] Started > listening on 'critical' memory pressure events for container > a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386227 89977 memory.cpp:478] Started > listening for OOM events for container 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.386248 89977 memory.cpp:590] Started > listening on 'low' memory pressure events for container > 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.386270 89977 memory.cpp:590] Started > listening on 'medium' memory pressure events for container > 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909
[jira] [Commented] (MESOS-9966) Agent crashes when trying to destroy orphaned nested container if root container is orphaned as well
[ https://issues.apache.org/jira/browse/MESOS-9966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932151#comment-16932151 ] Qian Zhang commented on MESOS-9966: --- Actually I have reproduced this issue, just launch Mesos agent with `--gc_non_executor_container_sandboxes` enabled, launched a task group, and then restart Mesos agent which will crash. > Agent crashes when trying to destroy orphaned nested container if root > container is orphaned as well > > > Key: MESOS-9966 > URL: https://issues.apache.org/jira/browse/MESOS-9966 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 1.7.3 >Reporter: Jan Schlicht >Assignee: Qian Zhang >Priority: Major > > Noticed an agent crash-looping when trying to recover. It recognized a > container and its nested container as orphaned. When trying to destroy the > nested container, the agent crashes. Probably when trying to [get the sandbox > path of the root > container|https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L2966]. > {noformat} > 2019-09-09 05:04:26: I0909 05:04:26.382326 89950 linux_launcher.cpp:286] > Recovering Linux launcher > 2019-09-09 05:04:26: I0909 05:04:26.383162 89950 linux_launcher.cpp:331] Not > recovering cgroup mesos/a127917b-96fe-4100-b73d-5f876ce9ffc1/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383199 89950 linux_launcher.cpp:343] > Recovered container > a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97 > 2019-09-09 05:04:26: I0909 05:04:26.383216 89950 linux_launcher.cpp:331] Not > recovering cgroup > mesos/a127917b-96fe-4100-b73d-5f876ce9ffc1/mesos/9783e2bb-7c2e-4930-9d39-4225bb6f1b97/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383229 89950 linux_launcher.cpp:343] > Recovered container 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.383237 89950 linux_launcher.cpp:343] > Recovered container a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.383249 89950 linux_launcher.cpp:343] > Recovered container > 2ee154e2-3cc4-420a-99fb-065e740f3091.49fe2bf9-17af-415f-92b6-92a4db619436 > 2019-09-09 05:04:26: I0909 05:04:26.383260 89950 linux_launcher.cpp:331] Not > recovering cgroup mesos/2ee154e2-3cc4-420a-99fb-065e740f3091/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383271 89950 linux_launcher.cpp:331] Not > recovering cgroup > mesos/2ee154e2-3cc4-420a-99fb-065e740f3091/mesos/49fe2bf9-17af-415f-92b6-92a4db619436/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383280 89950 linux_launcher.cpp:437] > 2ee154e2-3cc4-420a-99fb-065e740f3091.49fe2bf9-17af-415f-92b6-92a4db619436 is > a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383289 89950 linux_launcher.cpp:437] > a127917b-96fe-4100-b73d-5f876ce9ffc1 is a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383296 89950 linux_launcher.cpp:437] > 2ee154e2-3cc4-420a-99fb-065e740f3091 is a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383304 89950 linux_launcher.cpp:437] > a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97 is > a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383414 89950 containerizer.cpp:1092] > Recovering isolators > 2019-09-09 05:04:26: I0909 05:04:26.385931 89977 memory.cpp:478] Started > listening for OOM events for container a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386118 89977 memory.cpp:590] Started > listening on 'low' memory pressure events for container > a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386152 89977 memory.cpp:590] Started > listening on 'medium' memory pressure events for container > a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386175 89977 memory.cpp:590] Started > listening on 'critical' memory pressure events for container > a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386227 89977 memory.cpp:478] Started > listening for OOM events for container 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.386248 89977 memory.cpp:590] Started > listening on 'low' memory pressure events for container > 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.386270 89977 memory.cpp:590] Started > listening on 'medium' memory pressure events for container > 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.386376 89977 memory.cpp:590] Started > listening on 'critical' memory pressure events for container > 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.386694 89921 containerizer.cpp:1131] > Recovering provisioner > 2019-09-09
[jira] [Commented] (MESOS-9966) Agent crashes when trying to destroy orphaned nested container if root container is orphaned as well
[ https://issues.apache.org/jira/browse/MESOS-9966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932150#comment-16932150 ] Jan Schlicht commented on MESOS-9966: - According to the stack trace we are hitting the code. Let me double-check if {{gc_non_executor_container_sandboxes}} is enabled. > Agent crashes when trying to destroy orphaned nested container if root > container is orphaned as well > > > Key: MESOS-9966 > URL: https://issues.apache.org/jira/browse/MESOS-9966 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 1.7.3 >Reporter: Jan Schlicht >Assignee: Qian Zhang >Priority: Major > > Noticed an agent crash-looping when trying to recover. It recognized a > container and its nested container as orphaned. When trying to destroy the > nested container, the agent crashes. Probably when trying to [get the sandbox > path of the root > container|https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L2966]. > {noformat} > 2019-09-09 05:04:26: I0909 05:04:26.382326 89950 linux_launcher.cpp:286] > Recovering Linux launcher > 2019-09-09 05:04:26: I0909 05:04:26.383162 89950 linux_launcher.cpp:331] Not > recovering cgroup mesos/a127917b-96fe-4100-b73d-5f876ce9ffc1/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383199 89950 linux_launcher.cpp:343] > Recovered container > a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97 > 2019-09-09 05:04:26: I0909 05:04:26.383216 89950 linux_launcher.cpp:331] Not > recovering cgroup > mesos/a127917b-96fe-4100-b73d-5f876ce9ffc1/mesos/9783e2bb-7c2e-4930-9d39-4225bb6f1b97/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383229 89950 linux_launcher.cpp:343] > Recovered container 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.383237 89950 linux_launcher.cpp:343] > Recovered container a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.383249 89950 linux_launcher.cpp:343] > Recovered container > 2ee154e2-3cc4-420a-99fb-065e740f3091.49fe2bf9-17af-415f-92b6-92a4db619436 > 2019-09-09 05:04:26: I0909 05:04:26.383260 89950 linux_launcher.cpp:331] Not > recovering cgroup mesos/2ee154e2-3cc4-420a-99fb-065e740f3091/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383271 89950 linux_launcher.cpp:331] Not > recovering cgroup > mesos/2ee154e2-3cc4-420a-99fb-065e740f3091/mesos/49fe2bf9-17af-415f-92b6-92a4db619436/mesos > 2019-09-09 05:04:26: I0909 05:04:26.383280 89950 linux_launcher.cpp:437] > 2ee154e2-3cc4-420a-99fb-065e740f3091.49fe2bf9-17af-415f-92b6-92a4db619436 is > a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383289 89950 linux_launcher.cpp:437] > a127917b-96fe-4100-b73d-5f876ce9ffc1 is a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383296 89950 linux_launcher.cpp:437] > 2ee154e2-3cc4-420a-99fb-065e740f3091 is a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383304 89950 linux_launcher.cpp:437] > a127917b-96fe-4100-b73d-5f876ce9ffc1.9783e2bb-7c2e-4930-9d39-4225bb6f1b97 is > a known orphaned container > 2019-09-09 05:04:26: I0909 05:04:26.383414 89950 containerizer.cpp:1092] > Recovering isolators > 2019-09-09 05:04:26: I0909 05:04:26.385931 89977 memory.cpp:478] Started > listening for OOM events for container a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386118 89977 memory.cpp:590] Started > listening on 'low' memory pressure events for container > a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386152 89977 memory.cpp:590] Started > listening on 'medium' memory pressure events for container > a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386175 89977 memory.cpp:590] Started > listening on 'critical' memory pressure events for container > a127917b-96fe-4100-b73d-5f876ce9ffc1 > 2019-09-09 05:04:26: I0909 05:04:26.386227 89977 memory.cpp:478] Started > listening for OOM events for container 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.386248 89977 memory.cpp:590] Started > listening on 'low' memory pressure events for container > 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.386270 89977 memory.cpp:590] Started > listening on 'medium' memory pressure events for container > 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.386376 89977 memory.cpp:590] Started > listening on 'critical' memory pressure events for container > 2ee154e2-3cc4-420a-99fb-065e740f3091 > 2019-09-09 05:04:26: I0909 05:04:26.386694 89921 containerizer.cpp:1131] > Recovering provisioner > 2019-09-09 05:04:26: I0909 05:04:26.388226 90010