[jira] [Assigned] (MESOS-3094) Mesos on Windows
[ https://issues.apache.org/jira/browse/MESOS-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Li reassigned MESOS-3094: Assignee: Li Li (was: Alex Clemmer) > Mesos on Windows > > > Key: MESOS-3094 > URL: https://issues.apache.org/jira/browse/MESOS-3094 > Project: Mesos > Issue Type: Epic > Components: containerization, libprocess, stout >Reporter: Joseph Wu >Assignee: Li Li > Labels: mesosphere > > The ultimate goal of this is to have all containerizer tests running and > passing on Windows Server. > # It must build (see MESOS-898). > # All OS-specific code (that is touched by the containerizer) must be ported > to Windows. > # The containizer itself must be ported to Windows, alongside the > MesosContainerizer. > Note: Isolation (cgroups) will probably not exist on Windows. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7337) DefaultExecutorCheckTest.CommandCheckTimeout becomes flaky under load
[ https://issues.apache.org/jira/browse/MESOS-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-7337: Attachment: DefaultExecutorCheckTest.CommandCheckTimeout.log [~anandmazumdar]: Attached output for {{MESOS_VERBOSE=1 GLOG_v=1}}, again from parallel execution. I was not able to reproduce this in standalone execution with extra load from e.g., concurrent compilation jobs or {{stress}}. > DefaultExecutorCheckTest.CommandCheckTimeout becomes flaky under load > - > > Key: MESOS-7337 > URL: https://issues.apache.org/jira/browse/MESOS-7337 > Project: Mesos > Issue Type: Bug > Components: flaky, test > Environment: Mac OS 10.12.4 (16E195), SSL debug build w/o > optimizations, clang version 5.0.0 (http://llvm.org/git/clang > c511a96ffe744933459ef64bf963629538057a90) (http://llvm.org/git/llvm > 0cd81d8a1055f167e0f588dd1b476863b00da3d5) >Reporter: Benjamin Bannier > Labels: flaky-test, mesosphere > Attachments: DefaultExecutorCheckTest.CommandCheckTimeout.log > > > The test {{DefaultExecutorCheckTest.CommandCheckTimeout}} randomly fails for > me when executing tests in parallel, e.g., > {code} > [ RUN ] DefaultExecutorCheckTest.CommandCheckTimeout > ../../src/tests/check_tests.cpp:1374: Failure > Failed to wait 15secs for updateCheckResultTimeout > ../../src/tests/check_tests.cpp:1334: Failure > Actual function call count doesn't match EXPECT_CALL(*scheduler, update(_, > _))... > Expected: to be called at least 3 times >Actual: called twice - unsatisfied and active > [ FAILED ] DefaultExecutorCheckTest.CommandCheckTimeout (25351 ms) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (MESOS-7337) DefaultExecutorCheckTest.CommandCheckTimeout becomes flaky under load
[ https://issues.apache.org/jira/browse/MESOS-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954151#comment-15954151 ] Benjamin Bannier edited comment on MESOS-7337 at 4/3/17 8:54 PM: - [~anandmazumdar]: Attached output for {{MESOS_VERBOSE=1 GLOG_v=1}}, again from parallel execution. I was not able to reproduce this in standalone execution with extra load from e.g., concurrent compilation jobs or {{stress}}, so this could also be related to the test no behaving well in parallel execution due to conflicts. was (Author: bbannier): [~anandmazumdar]: Attached output for {{MESOS_VERBOSE=1 GLOG_v=1}}, again from parallel execution. I was not able to reproduce this in standalone execution with extra load from e.g., concurrent compilation jobs or {{stress}}. > DefaultExecutorCheckTest.CommandCheckTimeout becomes flaky under load > - > > Key: MESOS-7337 > URL: https://issues.apache.org/jira/browse/MESOS-7337 > Project: Mesos > Issue Type: Bug > Components: flaky, test > Environment: Mac OS 10.12.4 (16E195), SSL debug build w/o > optimizations, clang version 5.0.0 (http://llvm.org/git/clang > c511a96ffe744933459ef64bf963629538057a90) (http://llvm.org/git/llvm > 0cd81d8a1055f167e0f588dd1b476863b00da3d5) >Reporter: Benjamin Bannier > Labels: flaky-test, mesosphere > Attachments: DefaultExecutorCheckTest.CommandCheckTimeout.log > > > The test {{DefaultExecutorCheckTest.CommandCheckTimeout}} randomly fails for > me when executing tests in parallel, e.g., > {code} > [ RUN ] DefaultExecutorCheckTest.CommandCheckTimeout > ../../src/tests/check_tests.cpp:1374: Failure > Failed to wait 15secs for updateCheckResultTimeout > ../../src/tests/check_tests.cpp:1334: Failure > Actual function call count doesn't match EXPECT_CALL(*scheduler, update(_, > _))... > Expected: to be called at least 3 times >Actual: called twice - unsatisfied and active > [ FAILED ] DefaultExecutorCheckTest.CommandCheckTimeout (25351 ms) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-6742) Adding support for s390x architecture
[ https://issues.apache.org/jira/browse/MESOS-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-6742: -- Shepherd: Till Toenshoff (was: Vinod Kone) I'll let [~tillt] shepherd this since he has been reviewing the patch. > Adding support for s390x architecture > -- > > Key: MESOS-6742 > URL: https://issues.apache.org/jira/browse/MESOS-6742 > Project: Mesos > Issue Type: Bug >Reporter: Ayanampudi Varsha >Assignee: Ayanampudi Varsha > > There are 2 issues: > 1. LdcacheTest.Parse test case fails on s390x machines. > 2. From the value of flag docker_registry in slave/flags.cpp, amd64 images > get downloaded due to which test cases fail on s390x with "Exec format Error" -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7209) Mesos failed to build due to error MSB6006: "cmd.exe" exited with code 255 on windows
[ https://issues.apache.org/jira/browse/MESOS-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954029#comment-15954029 ] Li Li commented on MESOS-7209: -- I hit the same issue with main stream. I will take a look. Thanks. > Mesos failed to build due to error MSB6006: "cmd.exe" exited with code 255 on > windows > - > > Key: MESOS-7209 > URL: https://issues.apache.org/jira/browse/MESOS-7209 > Project: Mesos > Issue Type: Bug > Environment: Windows 10 (64bit) + VS2015 Update 3 >Reporter: Karen Huang > > I try to build mesos with Debug|x64 configuration on Windows. It failed to > build due to error MSB6006: "cmd.exe" exited with code > 255.[F:\mesos\build_x64\ensure_tool_arch.vcxproj]. This error is reported > when build ensure_tool_arch.vcxproj project. > Here is repro steps: > 1. git clone -c core.autocrlf=true https://github.com/apache/mesos > F:\mesos\src > 2. Open a VS amd64 command prompt as admin and browse to F:\mesos\src > 3. set PreferredToolArchitecture=x64 > 4. bootstrap.bat > 5. mkdir build_x64 && pushd build_x64 > 6. cmake ..\src -G "Visual Studio 14 2015 Win64" -DENABLE_LIBEVENT=1 > -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin" > 7. msbuild Mesos.sln /p:Configuration=Debug /p:Platform=x64 /m /t:Rebuild > Error message: > CustomBuild: > Building Custom Rule F:/mesos/src/CMakeLists.txt > CMake does not need to re-run because > F:\mesos\build_x64\CMakeFiles\generate.stamp is up-to-date. > ( was unexpected at this time. > 43>C:\Program Files > (x86)\MSBuild\Microsoft.Cpp\v4.0\V140\Microsoft.CppCommon.targets(171,5): > error MSB6006: "cmd.exe" exited with code 255. > [F:\mesos\build_x64\ensure_tool_arch.vcxproj] > If you build the project ensure_tool_arch.vcxproj in VS IDE seperatly. The > error info is as bleow: > 2>-- Rebuild All started: Project: ensure_tool_arch, Configuration: Debug > x64 -- > 2> Building Custom Rule D:/Mesos/src/CMakeLists.txt > 2> CMake does not need to re-run because > D:\Mesos\build_x64\CMakeFiles\generate.stamp is up-to-date. > 2> ( was unexpected at this time. > 2>C:\Program Files > (x86)\MSBuild\Microsoft.Cpp\v4.0\V140\Microsoft.CppCommon.targets(171,5): > error MSB6006: "cmd.exe" exited with code 255. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7338) Implement clang-tidy check(s) for namespace usage
Neil Conway created MESOS-7338: -- Summary: Implement clang-tidy check(s) for namespace usage Key: MESOS-7338 URL: https://issues.apache.org/jira/browse/MESOS-7338 Project: Mesos Issue Type: Bug Reporter: Neil Conway For example, if a {{.cpp}} file contains both {{x::y}} and {{using x::y}}, that is typically a style mistake. We could potentially identify unused {{using}} statements, although that can be tricky due to C++ identifier lookup rules. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7332) when submit an docker job to mesos, the agent show errors
[ https://issues.apache.org/jira/browse/MESOS-7332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953878#comment-15953878 ] Gilbert Song commented on MESOS-7332: - [~helloiss], I tested on ubuntu 14.04 and 15.10, but I am in US. Depending on [~yuyang]'s comment on MESOS-6810, it seems like due to docker hub is blocked in China. [~helloiss], could you try out the solution [~yuyang] posted? > when submit an docker job to mesos, the agent show errors > -- > > Key: MESOS-7332 > URL: https://issues.apache.org/jira/browse/MESOS-7332 > Project: Mesos > Issue Type: Bug > Components: agent >Affects Versions: 1.2.0 > Environment: ubuntu 16.04 >Reporter: liliang >Priority: Critical > > when run the following task > sudo mesos-execute --master=10.139.176.201:5050 --name=test > --docker_image=hello-world --shell=true > i got error on agent side: > Running on machine: gorilla > Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg > E0331 22:45:05.706740 20993 shell.hpp:107] Command 'hadoop version 2>&1' > failed; this is the output: > sh: 1: hadoop: not found > E0331 22:46:14.837656 21014 slave.cpp:4650] Container > '48256fd6-9f45-4725-a72f-195326798f2d' for executor 'test' of framework > 0a104215-eb82-4b1b-93e6-3fc765fc67af-0003 failed to start: Collect failed: > Failed to perform 'curl': curl: (35) gnutls_handshake() failed: Error in the > push function. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7337) DefaultExecutorCheckTest.CommandCheckTimeout becomes flaky under load
[ https://issues.apache.org/jira/browse/MESOS-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953826#comment-15953826 ] Anand Mazumdar commented on MESOS-7337: --- [~bbannier] Can you attach the verbose logs? cc: [~gkleiman] > DefaultExecutorCheckTest.CommandCheckTimeout becomes flaky under load > - > > Key: MESOS-7337 > URL: https://issues.apache.org/jira/browse/MESOS-7337 > Project: Mesos > Issue Type: Bug > Components: flaky, test > Environment: Mac OS 10.12.4 (16E195), SSL debug build w/o > optimizations, clang version 5.0.0 (http://llvm.org/git/clang > c511a96ffe744933459ef64bf963629538057a90) (http://llvm.org/git/llvm > 0cd81d8a1055f167e0f588dd1b476863b00da3d5) >Reporter: Benjamin Bannier > Labels: flaky-test, mesosphere > > The test {{DefaultExecutorCheckTest.CommandCheckTimeout}} randomly fails for > me when executing tests in parallel, e.g., > {code} > [ RUN ] DefaultExecutorCheckTest.CommandCheckTimeout > ../../src/tests/check_tests.cpp:1374: Failure > Failed to wait 15secs for updateCheckResultTimeout > ../../src/tests/check_tests.cpp:1334: Failure > Actual function call count doesn't match EXPECT_CALL(*scheduler, update(_, > _))... > Expected: to be called at least 3 times >Actual: called twice - unsatisfied and active > [ FAILED ] DefaultExecutorCheckTest.CommandCheckTimeout (25351 ms) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7337) DefaultExecutorCheckTest.CommandCheckTimeout becomes flaky under load
[ https://issues.apache.org/jira/browse/MESOS-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-7337: Environment: Mac OS 10.12.4 (16E195), SSL debug build w/o optimizations, clang version 5.0.0 (http://llvm.org/git/clang c511a96ffe744933459ef64bf963629538057a90) (http://llvm.org/git/llvm 0cd81d8a1055f167e0f588dd1b476863b00da3d5) (was: Mac OS 10.12.4 (16E195), SSL debug build w/o optimizations, clang version 5.0.0 (http://llvm.org/git/clang c511a96ffe744933459ef64bf963629538057a90) (http://llvm.org/git/llvm 0cd81d8a1055f167e0f588dd1b476863b00da3d5), ) > DefaultExecutorCheckTest.CommandCheckTimeout becomes flaky under load > - > > Key: MESOS-7337 > URL: https://issues.apache.org/jira/browse/MESOS-7337 > Project: Mesos > Issue Type: Bug > Components: flaky, test > Environment: Mac OS 10.12.4 (16E195), SSL debug build w/o > optimizations, clang version 5.0.0 (http://llvm.org/git/clang > c511a96ffe744933459ef64bf963629538057a90) (http://llvm.org/git/llvm > 0cd81d8a1055f167e0f588dd1b476863b00da3d5) >Reporter: Benjamin Bannier > Labels: flaky-test, mesosphere > > The test {{DefaultExecutorCheckTest.CommandCheckTimeout}} randomly fails for > me when executing tests in parallel, e.g., > {code} > [ RUN ] DefaultExecutorCheckTest.CommandCheckTimeout > ../../src/tests/check_tests.cpp:1374: Failure > Failed to wait 15secs for updateCheckResultTimeout > ../../src/tests/check_tests.cpp:1334: Failure > Actual function call count doesn't match EXPECT_CALL(*scheduler, update(_, > _))... > Expected: to be called at least 3 times >Actual: called twice - unsatisfied and active > [ FAILED ] DefaultExecutorCheckTest.CommandCheckTimeout (25351 ms) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7337) DefaultExecutorCheckTest.CommandCheckTimeout becomes flaky under load
[ https://issues.apache.org/jira/browse/MESOS-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-7337: Environment: Mac OS 10.12.4 (16E195), SSL debug build w/o optimizations, clang version 5.0.0 (http://llvm.org/git/clang c511a96ffe744933459ef64bf963629538057a90) (http://llvm.org/git/llvm 0cd81d8a1055f167e0f588dd1b476863b00da3d5), (was: Mac OS 10.12.4 (16E195), SSL bug, debug build w/o optimizations, clang version 5.0.0 (http://llvm.org/git/clang c511a96ffe744933459ef64bf963629538057a90) (http://llvm.org/git/llvm 0cd81d8a1055f167e0f588dd1b476863b00da3d5), ) > DefaultExecutorCheckTest.CommandCheckTimeout becomes flaky under load > - > > Key: MESOS-7337 > URL: https://issues.apache.org/jira/browse/MESOS-7337 > Project: Mesos > Issue Type: Bug > Components: flaky, test > Environment: Mac OS 10.12.4 (16E195), SSL debug build w/o > optimizations, clang version 5.0.0 (http://llvm.org/git/clang > c511a96ffe744933459ef64bf963629538057a90) (http://llvm.org/git/llvm > 0cd81d8a1055f167e0f588dd1b476863b00da3d5), >Reporter: Benjamin Bannier > Labels: flaky-test, mesosphere > > The test {{DefaultExecutorCheckTest.CommandCheckTimeout}} randomly fails for > me when executing tests in parallel, e.g., > {code} > [ RUN ] DefaultExecutorCheckTest.CommandCheckTimeout > ../../src/tests/check_tests.cpp:1374: Failure > Failed to wait 15secs for updateCheckResultTimeout > ../../src/tests/check_tests.cpp:1334: Failure > Actual function call count doesn't match EXPECT_CALL(*scheduler, update(_, > _))... > Expected: to be called at least 3 times >Actual: called twice - unsatisfied and active > [ FAILED ] DefaultExecutorCheckTest.CommandCheckTimeout (25351 ms) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7337) DefaultExecutorCheckTest.CommandCheckTimeout becomes flaky under load
[ https://issues.apache.org/jira/browse/MESOS-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-7337: Environment: Mac OS 10.12.4 (16E195), SSL bug, debug build w/o optimizations, clang version 5.0.0 (http://llvm.org/git/clang c511a96ffe744933459ef64bf963629538057a90) (http://llvm.org/git/llvm 0cd81d8a1055f167e0f588dd1b476863b00da3d5), > DefaultExecutorCheckTest.CommandCheckTimeout becomes flaky under load > - > > Key: MESOS-7337 > URL: https://issues.apache.org/jira/browse/MESOS-7337 > Project: Mesos > Issue Type: Bug > Components: flaky, test > Environment: Mac OS 10.12.4 (16E195), SSL bug, debug build w/o > optimizations, clang version 5.0.0 (http://llvm.org/git/clang > c511a96ffe744933459ef64bf963629538057a90) (http://llvm.org/git/llvm > 0cd81d8a1055f167e0f588dd1b476863b00da3d5), >Reporter: Benjamin Bannier > Labels: flaky-test, mesosphere > > The test {{DefaultExecutorCheckTest.CommandCheckTimeout}} randomly fails for > me when executing tests in parallel, e.g., > {code} > [ RUN ] DefaultExecutorCheckTest.CommandCheckTimeout > ../../src/tests/check_tests.cpp:1374: Failure > Failed to wait 15secs for updateCheckResultTimeout > ../../src/tests/check_tests.cpp:1334: Failure > Actual function call count doesn't match EXPECT_CALL(*scheduler, update(_, > _))... > Expected: to be called at least 3 times >Actual: called twice - unsatisfied and active > [ FAILED ] DefaultExecutorCheckTest.CommandCheckTimeout (25351 ms) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7210) MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( pid namespace mismatch )
[ https://issues.apache.org/jira/browse/MESOS-7210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-7210: --- Affects Version/s: 1.1.1 1.2.0 Story Points: 3 Target Version/s: 1.1.2, 1.2.1, 1.3.0 Priority: Critical (was: Major) > MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( > pid namespace mismatch ) > --- > > Key: MESOS-7210 > URL: https://issues.apache.org/jira/browse/MESOS-7210 > Project: Mesos > Issue Type: Bug > Components: docker >Affects Versions: 1.1.0, 1.1.1, 1.2.0 > Environment: Ubuntu 16.04.02 > Docker version 1.13.1 > mesos 1.1.0, runs from container > docker containers spawned by marathon 1.4.1 >Reporter: Wojciech Sielski >Assignee: haosdent >Priority: Critical > > When running mesos-slave with option "docker_mesos_image" like: > {code} > --master=zk://standalone:2181/mesos --containerizers=docker,mesos > --executor_registration_timeout=5mins --hostname=standalone --ip=0.0.0.0 > --docker_stop_timeout=5secs --gc_delay=1days > --docker_socket=/var/run/docker.sock --no-systemd_enable_support > --work_dir=/tmp/mesos --docker_mesos_image=panteras/paas-in-a-box:0.4.0 > {code} > from the container that was started with option "pid: host" like: > {code} > net:host > privileged: true > pid:host > {code} > and example marathon job, that use MESOS_HTTP checks like: > {code} > { > "id": "python-example-stable", > "cmd": "python3 -m http.server 8080", > "mem": 16, > "cpus": 0.1, > "instances": 2, > "container": { >"type": "DOCKER", >"docker": { > "image": "python:alpine", > "network": "BRIDGE", > "portMappings": [ > { "containerPort": 8080, "hostPort": 0, "protocol": "tcp" } > ] >} > }, > "env": { >"SERVICE_NAME" : "python" > }, > "healthChecks": [ >{ > "path": "/", > "portIndex": 0, > "protocol": "MESOS_HTTP", > "gracePeriodSeconds": 30, > "intervalSeconds": 10, > "timeoutSeconds": 30, > "maxConsecutiveFailures": 3 >} > ] > } > {code} > I see the errors like: > {code} > F0306 07:41:58.84429335 health_checker.cpp:94] Failed to enter the net > namespace of task (pid: '13527'): Pid 13527 does not exist > *** Check failure stack trace: *** > @ 0x7f51770b0c1d google::LogMessage::Fail() > @ 0x7f51770b29d0 google::LogMessage::SendToLog() > @ 0x7f51770b0803 google::LogMessage::Flush() > @ 0x7f51770b33f9 google::LogMessageFatal::~LogMessageFatal() > @ 0x7f517647ce46 > _ZNSt17_Function_handlerIFivEZN5mesos8internal6health14cloneWithSetnsERKSt8functionIS0_E6OptionIiERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISG_EEEUlvE_E9_M_invokeERKSt9_Any_data > @ 0x7f517647bf2b mesos::internal::health::cloneWithSetns() > @ 0x7f517648374b std::_Function_handler<>::_M_invoke() > @ 0x7f5177068167 process::internal::cloneChild() > @ 0x7f5177065c32 process::subprocess() > @ 0x7f5176481a9d > mesos::internal::health::HealthCheckerProcess::_httpHealthCheck() > @ 0x7f51764831f7 > mesos::internal::health::HealthCheckerProcess::_healthCheck() > @ 0x7f517701f38c process::ProcessBase::visit() > @ 0x7f517702c8b3 process::ProcessManager::resume() > @ 0x7f517702fb77 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv > @ 0x7f51754ddc80 (unknown) > @ 0x7f5174cf06ba start_thread > @ 0x7f5174a2682d (unknown) > I0306 07:41:59.077986 9 health_checker.cpp:199] Ignoring failure as > health check still in grace period > {code} > Looks like option docker_mesos_image makes, that newly started mesos job is > not using "pid host" option same as mother container was started, but has his > own PID namespace (so it doesn't matter if mother container was started with > "pid host" or not it will never be able to find PID) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7210) MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( pid namespace mismatch )
[ https://issues.apache.org/jira/browse/MESOS-7210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953242#comment-15953242 ] Alexander Rukletsov commented on MESOS-7210: [~xds2000], [~haosd...@gmail.com] Let's fix it and backport. > MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( > pid namespace mismatch ) > --- > > Key: MESOS-7210 > URL: https://issues.apache.org/jira/browse/MESOS-7210 > Project: Mesos > Issue Type: Bug > Components: docker >Affects Versions: 1.1.0, 1.1.1, 1.2.0 > Environment: Ubuntu 16.04.02 > Docker version 1.13.1 > mesos 1.1.0, runs from container > docker containers spawned by marathon 1.4.1 >Reporter: Wojciech Sielski >Assignee: haosdent >Priority: Critical > > When running mesos-slave with option "docker_mesos_image" like: > {code} > --master=zk://standalone:2181/mesos --containerizers=docker,mesos > --executor_registration_timeout=5mins --hostname=standalone --ip=0.0.0.0 > --docker_stop_timeout=5secs --gc_delay=1days > --docker_socket=/var/run/docker.sock --no-systemd_enable_support > --work_dir=/tmp/mesos --docker_mesos_image=panteras/paas-in-a-box:0.4.0 > {code} > from the container that was started with option "pid: host" like: > {code} > net:host > privileged: true > pid:host > {code} > and example marathon job, that use MESOS_HTTP checks like: > {code} > { > "id": "python-example-stable", > "cmd": "python3 -m http.server 8080", > "mem": 16, > "cpus": 0.1, > "instances": 2, > "container": { >"type": "DOCKER", >"docker": { > "image": "python:alpine", > "network": "BRIDGE", > "portMappings": [ > { "containerPort": 8080, "hostPort": 0, "protocol": "tcp" } > ] >} > }, > "env": { >"SERVICE_NAME" : "python" > }, > "healthChecks": [ >{ > "path": "/", > "portIndex": 0, > "protocol": "MESOS_HTTP", > "gracePeriodSeconds": 30, > "intervalSeconds": 10, > "timeoutSeconds": 30, > "maxConsecutiveFailures": 3 >} > ] > } > {code} > I see the errors like: > {code} > F0306 07:41:58.84429335 health_checker.cpp:94] Failed to enter the net > namespace of task (pid: '13527'): Pid 13527 does not exist > *** Check failure stack trace: *** > @ 0x7f51770b0c1d google::LogMessage::Fail() > @ 0x7f51770b29d0 google::LogMessage::SendToLog() > @ 0x7f51770b0803 google::LogMessage::Flush() > @ 0x7f51770b33f9 google::LogMessageFatal::~LogMessageFatal() > @ 0x7f517647ce46 > _ZNSt17_Function_handlerIFivEZN5mesos8internal6health14cloneWithSetnsERKSt8functionIS0_E6OptionIiERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISG_EEEUlvE_E9_M_invokeERKSt9_Any_data > @ 0x7f517647bf2b mesos::internal::health::cloneWithSetns() > @ 0x7f517648374b std::_Function_handler<>::_M_invoke() > @ 0x7f5177068167 process::internal::cloneChild() > @ 0x7f5177065c32 process::subprocess() > @ 0x7f5176481a9d > mesos::internal::health::HealthCheckerProcess::_httpHealthCheck() > @ 0x7f51764831f7 > mesos::internal::health::HealthCheckerProcess::_healthCheck() > @ 0x7f517701f38c process::ProcessBase::visit() > @ 0x7f517702c8b3 process::ProcessManager::resume() > @ 0x7f517702fb77 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv > @ 0x7f51754ddc80 (unknown) > @ 0x7f5174cf06ba start_thread > @ 0x7f5174a2682d (unknown) > I0306 07:41:59.077986 9 health_checker.cpp:199] Ignoring failure as > health check still in grace period > {code} > Looks like option docker_mesos_image makes, that newly started mesos job is > not using "pid host" option same as mother container was started, but has his > own PID namespace (so it doesn't matter if mother container was started with > "pid host" or not it will never be able to find PID) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7337) DefaultExecutorCheckTest.CommandCheckTimeout becomes flaky under load
Benjamin Bannier created MESOS-7337: --- Summary: DefaultExecutorCheckTest.CommandCheckTimeout becomes flaky under load Key: MESOS-7337 URL: https://issues.apache.org/jira/browse/MESOS-7337 Project: Mesos Issue Type: Bug Components: flaky, test Reporter: Benjamin Bannier The test {{DefaultExecutorCheckTest.CommandCheckTimeout}} randomly fails for me when executing tests in parallel, e.g., {code} [ RUN ] DefaultExecutorCheckTest.CommandCheckTimeout ../../src/tests/check_tests.cpp:1374: Failure Failed to wait 15secs for updateCheckResultTimeout ../../src/tests/check_tests.cpp:1334: Failure Actual function call count doesn't match EXPECT_CALL(*scheduler, update(_, _))... Expected: to be called at least 3 times Actual: called twice - unsatisfied and active [ FAILED ] DefaultExecutorCheckTest.CommandCheckTimeout (25351 ms) {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7269) Migrate setting in config.py to a TOML file
[ https://issues.apache.org/jira/browse/MESOS-7269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953131#comment-15953131 ] Armand Grillet commented on MESOS-7269: --- https://reviews.apache.org/r/57951/ > Migrate setting in config.py to a TOML file > --- > > Key: MESOS-7269 > URL: https://issues.apache.org/jira/browse/MESOS-7269 > Project: Mesos > Issue Type: Task > Components: cli >Reporter: Avinash Sridharan >Assignee: Armand Grillet > > We want Mesos CLI configuration to be given as a TOML file by the user. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7210) MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( pid namespace mismatch )
[ https://issues.apache.org/jira/browse/MESOS-7210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953071#comment-15953071 ] Wojciech Sielski commented on MESOS-7210: - [~xds2000] exactly, the mesos-slave (container) and the docker executor (container) need to runs in the same pid pool (host). > MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( > pid namespace mismatch ) > --- > > Key: MESOS-7210 > URL: https://issues.apache.org/jira/browse/MESOS-7210 > Project: Mesos > Issue Type: Bug > Components: docker >Affects Versions: 1.1.0 > Environment: Ubuntu 16.04.02 > Docker version 1.13.1 > mesos 1.1.0, runs from container > docker containers spawned by marathon 1.4.1 >Reporter: Wojciech Sielski >Assignee: haosdent > > When running mesos-slave with option "docker_mesos_image" like: > {code} > --master=zk://standalone:2181/mesos --containerizers=docker,mesos > --executor_registration_timeout=5mins --hostname=standalone --ip=0.0.0.0 > --docker_stop_timeout=5secs --gc_delay=1days > --docker_socket=/var/run/docker.sock --no-systemd_enable_support > --work_dir=/tmp/mesos --docker_mesos_image=panteras/paas-in-a-box:0.4.0 > {code} > from the container that was started with option "pid: host" like: > {code} > net:host > privileged: true > pid:host > {code} > and example marathon job, that use MESOS_HTTP checks like: > {code} > { > "id": "python-example-stable", > "cmd": "python3 -m http.server 8080", > "mem": 16, > "cpus": 0.1, > "instances": 2, > "container": { >"type": "DOCKER", >"docker": { > "image": "python:alpine", > "network": "BRIDGE", > "portMappings": [ > { "containerPort": 8080, "hostPort": 0, "protocol": "tcp" } > ] >} > }, > "env": { >"SERVICE_NAME" : "python" > }, > "healthChecks": [ >{ > "path": "/", > "portIndex": 0, > "protocol": "MESOS_HTTP", > "gracePeriodSeconds": 30, > "intervalSeconds": 10, > "timeoutSeconds": 30, > "maxConsecutiveFailures": 3 >} > ] > } > {code} > I see the errors like: > {code} > F0306 07:41:58.84429335 health_checker.cpp:94] Failed to enter the net > namespace of task (pid: '13527'): Pid 13527 does not exist > *** Check failure stack trace: *** > @ 0x7f51770b0c1d google::LogMessage::Fail() > @ 0x7f51770b29d0 google::LogMessage::SendToLog() > @ 0x7f51770b0803 google::LogMessage::Flush() > @ 0x7f51770b33f9 google::LogMessageFatal::~LogMessageFatal() > @ 0x7f517647ce46 > _ZNSt17_Function_handlerIFivEZN5mesos8internal6health14cloneWithSetnsERKSt8functionIS0_E6OptionIiERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISG_EEEUlvE_E9_M_invokeERKSt9_Any_data > @ 0x7f517647bf2b mesos::internal::health::cloneWithSetns() > @ 0x7f517648374b std::_Function_handler<>::_M_invoke() > @ 0x7f5177068167 process::internal::cloneChild() > @ 0x7f5177065c32 process::subprocess() > @ 0x7f5176481a9d > mesos::internal::health::HealthCheckerProcess::_httpHealthCheck() > @ 0x7f51764831f7 > mesos::internal::health::HealthCheckerProcess::_healthCheck() > @ 0x7f517701f38c process::ProcessBase::visit() > @ 0x7f517702c8b3 process::ProcessManager::resume() > @ 0x7f517702fb77 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv > @ 0x7f51754ddc80 (unknown) > @ 0x7f5174cf06ba start_thread > @ 0x7f5174a2682d (unknown) > I0306 07:41:59.077986 9 health_checker.cpp:199] Ignoring failure as > health check still in grace period > {code} > Looks like option docker_mesos_image makes, that newly started mesos job is > not using "pid host" option same as mother container was started, but has his > own PID namespace (so it doesn't matter if mother container was started with > "pid host" or not it will never be able to find PID) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-3545) Investigate restoring tasks/executors after machine reboot.
[ https://issues.apache.org/jira/browse/MESOS-3545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953046#comment-15953046 ] Yan Xu commented on MESOS-3545: --- [~xds2000] it's being worked on but the first patch will probably be weeks out as I am on leave right now and we have to settle down some design issue. > Investigate restoring tasks/executors after machine reboot. > --- > > Key: MESOS-3545 > URL: https://issues.apache.org/jira/browse/MESOS-3545 > Project: Mesos > Issue Type: Epic > Components: agent >Reporter: Benjamin Hindman >Assignee: Megha Sharma > > If a task/executor is restartable (see MESOS-3544) it might make sense to > force an agent to restart these tasks/executors _before_ after a machine > reboot in the event that the machine is network partitioned away from the > master (or the master has failed) but we'd like to get these services running > again. Assuming the agent(s) running on the machine has not been disconnected > from the master for longer than the master's agent re-registration timeout > the agent should be able to re-register (i.e., after a network partition is > resolved) without a problem. However, in the same way that a framework would > be interested in knowing that it's tasks/executors were restarted we'd want > to send something like a TASK_RESTARTED status update. -- This message was sent by Atlassian JIRA (v6.3.15#6346)