[jira] [Assigned] (MESOS-7029) FaultToleranceTest.FrameworkReregister is flaky
[ https://issues.apache.org/jira/browse/MESOS-7029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Guo reassigned MESOS-7029: -- Assignee: Jay Guo > FaultToleranceTest.FrameworkReregister is flaky > --- > > Key: MESOS-7029 > URL: https://issues.apache.org/jira/browse/MESOS-7029 > Project: Mesos > Issue Type: Bug > Components: test, tests > Environment: ASF CI, cmake, gcc, Ubuntu 14.04, libevent/SSL enabled >Reporter: Greg Mann >Assignee: Jay Guo > Labels: flaky, flaky-test > Attachments: FaultToleranceTest.FrameworkReregister.txt > > > This was observed on ASF CI: > {code} > /mesos/src/tests/fault_tolerance_tests.cpp:903: Failure > The difference between registerTime.secs() and > framework.values["registered_time"].as().as() is > 1.0100052356719971, which exceeds 1, where > registerTime.secs() evaluates to 1485732879.7673652, > framework.values["registered_time"].as().as() evaluates > to 1485732878.75736, and > 1 evaluates to 1. > {code} > Find the full log attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7029) FaultToleranceTest.FrameworkReregister is flaky
[ https://issues.apache.org/jira/browse/MESOS-7029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898840#comment-15898840 ] Jay Guo commented on MESOS-7029: RR: https://reviews.apache.org/r/57364/ > FaultToleranceTest.FrameworkReregister is flaky > --- > > Key: MESOS-7029 > URL: https://issues.apache.org/jira/browse/MESOS-7029 > Project: Mesos > Issue Type: Bug > Components: test, tests > Environment: ASF CI, cmake, gcc, Ubuntu 14.04, libevent/SSL enabled >Reporter: Greg Mann > Labels: flaky, flaky-test > Attachments: FaultToleranceTest.FrameworkReregister.txt > > > This was observed on ASF CI: > {code} > /mesos/src/tests/fault_tolerance_tests.cpp:903: Failure > The difference between registerTime.secs() and > framework.values["registered_time"].as().as() is > 1.0100052356719971, which exceeds 1, where > registerTime.secs() evaluates to 1485732879.7673652, > framework.values["registered_time"].as().as() evaluates > to 1485732878.75736, and > 1 evaluates to 1. > {code} > Find the full log attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7029) FaultToleranceTest.FrameworkReregister is flaky
[ https://issues.apache.org/jira/browse/MESOS-7029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898835#comment-15898835 ] Jay Guo commented on MESOS-7029: [~neilc] I think it is due to our intentional delay here: https://github.com/apache/mesos/blob/master/src/tests/fault_tolerance_tests.cpp#L824-L826 where the sum of them may exceed 1 sec > FaultToleranceTest.FrameworkReregister is flaky > --- > > Key: MESOS-7029 > URL: https://issues.apache.org/jira/browse/MESOS-7029 > Project: Mesos > Issue Type: Bug > Components: test, tests > Environment: ASF CI, cmake, gcc, Ubuntu 14.04, libevent/SSL enabled >Reporter: Greg Mann > Labels: flaky, flaky-test > Attachments: FaultToleranceTest.FrameworkReregister.txt > > > This was observed on ASF CI: > {code} > /mesos/src/tests/fault_tolerance_tests.cpp:903: Failure > The difference between registerTime.secs() and > framework.values["registered_time"].as().as() is > 1.0100052356719971, which exceeds 1, where > registerTime.secs() evaluates to 1485732879.7673652, > framework.values["registered_time"].as().as() evaluates > to 1485732878.75736, and > 1 evaluates to 1. > {code} > Find the full log attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7209) Mesos failed to build due to error MSB6006: "cmd.exe" exited with code 255 on windows
[ https://issues.apache.org/jira/browse/MESOS-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898800#comment-15898800 ] Karen Huang commented on MESOS-7209: Hi Joseph, The code in cmake file "CompilationConfigure.cmake" is as below: ADD_CUSTOM_TARGET( ${ENSURE_TOOL_ARCH} ALL COMMAND IF NOT "%PreferredToolArchitecture%"=="x64" ( echo "ERROR: Environment variable 'PreferredToolArchitecture' must be set to 'x64', see MESOS-6720 for details" 1>&2 && EXIT 1 ) ) But after we genearated project file ensure_tool_arch.vcxproj using cmake. In this project file, there is no quotes with variable %PreferredToolArchitecture%. It seems that "%PreferredToolArchitecture%"=="x64" is convert to %PreferredToolArchitecture%=="x64". I tried to change the cmake file as below. It works. From: ADD_CUSTOM_TARGET( ${ENSURE_TOOL_ARCH} ALL COMMAND IF NOT "%PreferredToolArchitecture%"=="x64" ( echo "ERROR: Environment variable 'PreferredToolArchitecture' must be set to 'x64', see MESOS-6720 for details" 1>&2 && EXIT 1 ) ) changed to: ADD_CUSTOM_TARGET( ${ENSURE_TOOL_ARCH} ALL COMMAND IF NOT "'%PreferredToolArchitecture%'"=='x64' ( echo "ERROR: Environment variable 'PreferredToolArchitecture' must be set to 'x64', see MESOS-6720 for details" 1>&2 && EXIT 1 ) > Mesos failed to build due to error MSB6006: "cmd.exe" exited with code 255 on > windows > - > > Key: MESOS-7209 > URL: https://issues.apache.org/jira/browse/MESOS-7209 > Project: Mesos > Issue Type: Bug > Environment: Windows 10 (64bit) + VS2015 Update 3 >Reporter: Karen Huang > > I try to build mesos with Debug|x64 configuration on Windows. It failed to > build due to error MSB6006: "cmd.exe" exited with code > 255.[F:\mesos\build_x64\ensure_tool_arch.vcxproj]. This error is reported > when build ensure_tool_arch.vcxproj project. > Here is repro steps: > 1. git clone -c core.autocrlf=true https://github.com/apache/mesos > F:\mesos\src > 2. Open a VS amd64 command prompt as admin and browse to F:\mesos\src > 3. set PreferredToolArchitecture=x64 > 4. bootstrap.bat > 5. mkdir build_x64 && pushd build_x64 > 6. cmake ..\src -G "Visual Studio 14 2015 Win64" -DENABLE_LIBEVENT=1 > -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin" > 7. msbuild Mesos.sln /p:Configuration=Debug /p:Platform=x64 /m /t:Rebuild > Error message: > CustomBuild: > Building Custom Rule F:/mesos/src/CMakeLists.txt > CMake does not need to re-run because > F:\mesos\build_x64\CMakeFiles\generate.stamp is up-to-date. > ( was unexpected at this time. > 43>C:\Program Files > (x86)\MSBuild\Microsoft.Cpp\v4.0\V140\Microsoft.CppCommon.targets(171,5): > error MSB6006: "cmd.exe" exited with code 255. > [F:\mesos\build_x64\ensure_tool_arch.vcxproj] > If you build the project ensure_tool_arch.vcxproj in VS IDE seperatly. The > error info is as bleow: > 2>-- Rebuild All started: Project: ensure_tool_arch, Configuration: Debug > x64 -- > 2> Building Custom Rule D:/Mesos/src/CMakeLists.txt > 2> CMake does not need to re-run because > D:\Mesos\build_x64\CMakeFiles\generate.stamp is up-to-date. > 2> ( was unexpected at this time. > 2>C:\Program Files > (x86)\MSBuild\Microsoft.Cpp\v4.0\V140\Microsoft.CppCommon.targets(171,5): > error MSB6006: "cmd.exe" exited with code 255. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (MESOS-7149) Support reservations for role subtrees
[ https://issues.apache.org/jira/browse/MESOS-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Guo reassigned MESOS-7149: -- Assignee: Jay Guo > Support reservations for role subtrees > -- > > Key: MESOS-7149 > URL: https://issues.apache.org/jira/browse/MESOS-7149 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Neil Conway >Assignee: Jay Guo > Labels: mesosphere > > When a reservation is made for a role path {{x}}, the reserved resource > should be offered to all frameworks registered in {{x}} _or any nested role > in the sub-tree under x_. For example, if a reservation is made for {{eng}}, > the reserved resource should be a candidate to appear in resource offers to > frameworks in any of the roles {{eng}}, {{eng/dev}}, and {{eng/prod}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (MESOS-7195) Use C++11 variadic templates for process::dispatch/defer/delay/async/run
[ https://issues.apache.org/jira/browse/MESOS-7195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898743#comment-15898743 ] Michael Park edited comment on MESOS-7195 at 3/7/17 4:50 AM: - [~xujyan] Here's a small example that captures the limitation of variadic templates in this context: {code} struct S { void f(int) const {} void f(int, int) const {} }; template void macro(R (T::*)(P), const T&, A) {} template void macro(R (T::*)(P1, P2), const T&, A1, A2) {} template void variadic(R (T::*)(Ps...), const T&, As...) {} int main() { S s; macro(::f, s, 42); // selects `void S::f(int)` macro(::f, s, 101, 202); // selects `void S::f(int, int)` // variadic(::f, s, 42); // error. // variadic(::f, s, 101, 202); // error. } {code} We have situations where there are overloaded member functions, and we happen to use the # of arguments provided to narrow down the # of parameters we need to match. The same trick doesn't work for variadic templates since the parameters and arguments are both free-form. As far as I know, there's no way to express the same with variadic templates. The macro form, of course, isn't good enough anyway, since it wouldn't work if {{f}} were to be overloaded with different types and the same # of parameters, but we haven't run into that just yet. By API changes, I mean that to make {{variadic}} work, we'll need to require the user to pass something like: {code} variadic([](const S& s, auto... args) { s.f(args...); }, s, 101, 202); {code} This is not as generic as it needs to be, since it'll only call {{const}} functions. To get the cv/ref qualifiers correct, we'd have to provide the proper overloads, and maybe try to "hide" it with a macro... but it gets ugly... {code} variadic(MEM_FN(S, f), s, 101, 202); {code} where {{MEM_FN}} produce an overloaded function object. Here's a rough sketch of how this could look: http://melpon.org/wandbox/permlink/BO8mf7r0CVr3akbu Note that the sketch is written in C++14. was (Author: mcypark): [~xujyan] Here's a small example that captures the limitation of variadic templates in this context: {code} #include struct S { void f(int) {} void f(int, int) {} }; template void macro(R (T::*)(P), A) {} template void macro(R (T::*)(P1, P2), A1, A2) {} template void variadic(R (T::*)(Ps...) , As...) {} int main() { macro(::f, 42); // selects `void S::f(int)` macro(::f, 101, 202); // selects `void S::f(int, int)` // variadic(::f, 42); // error. // variadic(::f, 101, 202); // error. } {code} We have situations where there are overloaded member functions, and we happen to use the # of arguments provided to narrow down the # of parameters we need to match. The same trick doesn't work for variadic templates since the parameters and arguments are both free-form. As far as I know, there's no way to express the same with variadic templates. The macro form, of course, isn't good enough anyway, since it wouldn't work if {{f}} were to be overloaded with different types and the same # of parameters, but we haven't run into that just yet. By API changes, I mean that to make {{variadic}} work, we'll need to require the user to pass something like: {code} variadic([](const S& s, auto... args) { s.f(args...); }, 101, 202); {code} This is not as generic as it needs to be, since it'll only call {{const}} functions. To get the cv/ref qualifiers correct, we'd have to provide the proper overloads, and maybe try to "hide" it with a macro... but it gets ugly... {code} variadic(MEM_FN(S, f), 101, 202); {code} where {{MEM_FN}} produce an overloaded function object. Here's a rough sketch of how this could look: http://melpon.org/wandbox/permlink/BO8mf7r0CVr3akbu Note that the sketch is written in C++14. > Use C++11 variadic templates for process::dispatch/defer/delay/async/run > > > Key: MESOS-7195 > URL: https://issues.apache.org/jira/browse/MESOS-7195 > Project: Mesos > Issue Type: Improvement > Components: libprocess >Reporter: Yan Xu > > These methods are currently implemented using {{REPEAT_FROM_TO}} (i.e., > {{BOOST_PP_REPEAT_FROM_TO}}): > {code:title=} > REPEAT_FROM_TO(1, 11, TEMPLATE, _) // Args A0 -> A9. > {code} > This means we have to bump up the number of repetition whenever we have a new > method with more args. > Seems like we can replace this with C++11 variadic templates. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7195) Use C++11 variadic templates for process::dispatch/defer/delay/async/run
[ https://issues.apache.org/jira/browse/MESOS-7195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898743#comment-15898743 ] Michael Park commented on MESOS-7195: - [~xujyan] Here's a small example that captures the limitation of variadic templates in this context: {code} #include struct S { void f(int) {} void f(int, int) {} }; template void macro(R (T::*)(P), A) {} template void macro(R (T::*)(P1, P2), A1, A2) {} template void variadic(R (T::*)(Ps...) , As...) {} int main() { macro(::f, 42); // selects `void S::f(int)` macro(::f, 101, 202); // selects `void S::f(int, int)` // variadic(::f, 42); // error. // variadic(::f, 101, 202); // error. } {code} We have situations where there are overloaded member functions, and we happen to use the # of arguments provided to narrow down the # of parameters we need to match. The same trick doesn't work for variadic templates since the parameters and arguments are both free-form. As far as I know, there's no way to express the same with variadic templates. The macro form, of course, isn't good enough anyway, since it wouldn't work if {{f}} were to be overloaded with different types and the same # of parameters, but we haven't run into that just yet. By API changes, I mean that to make {{variadic}} work, we'll need to require the user to pass something like: {code} variadic([](const S& s, auto... args) { s.f(args...); }, 101, 202); {code} This is not as generic as it needs to be, since it'll only call {{const}} functions. To get the cv/ref qualifiers correct, we'd have to provide the proper overloads, and maybe try to "hide" it with a macro... but it gets ugly... {code} variadic(MEM_FN(S, f), 101, 202); {code} where {{MEM_FN}} produce an overloaded function object. Here's a rough sketch of how this could look: http://melpon.org/wandbox/permlink/BO8mf7r0CVr3akbu Note that the sketch is written in C++14. > Use C++11 variadic templates for process::dispatch/defer/delay/async/run > > > Key: MESOS-7195 > URL: https://issues.apache.org/jira/browse/MESOS-7195 > Project: Mesos > Issue Type: Improvement > Components: libprocess >Reporter: Yan Xu > > These methods are currently implemented using {{REPEAT_FROM_TO}} (i.e., > {{BOOST_PP_REPEAT_FROM_TO}}): > {code:title=} > REPEAT_FROM_TO(1, 11, TEMPLATE, _) // Args A0 -> A9. > {code} > This means we have to bump up the number of repetition whenever we have a new > method with more args. > Seems like we can replace this with C++11 variadic templates. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7215) Master sends ShutdownFrameworkMessage for all non-partition-aware frameworks
[ https://issues.apache.org/jira/browse/MESOS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898721#comment-15898721 ] Vinod Kone commented on MESOS-7215: --- Not sure if [~neilc] has cycles. [~xujyan] is this something you can take up? > Master sends ShutdownFrameworkMessage for all non-partition-aware frameworks > > > Key: MESOS-7215 > URL: https://issues.apache.org/jira/browse/MESOS-7215 > Project: Mesos > Issue Type: Bug >Reporter: Yan Xu >Priority: Critical > > Prior to the partition-awareness work MESOS-5344, upon agent reregistration > after it has been removed, the master only sends ShutdownFrameworkMessages to > the agent for frameworks that it knows have been torn down. > With the new logic in MESOS-5344, Mesos is now sending > {{ShutdownFrameworkMessages}} to the agent for all non-partition-aware > frameworks (including the ones that are still registered) > This is problematic. The offer from this agent can still go to the same > framework which can then launch new tasks. The agent then receives tasks of > the same framework and ignores them because it thinks the framework is > shutting down. The framework is not shutting down of course, so from the > master and the scheduler's perspective the task is pending in STAGING forever > until the next agent reregistration, which could happen much later. > This also makes the semantics of `ShutdownFrameworkMessage` ambiguous: the > agent is assuming the framework to be going away (and act accordingly) when > it's not. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (MESOS-7215) Master sends ShutdownFrameworkMessage for all non-partition-aware frameworks
[ https://issues.apache.org/jira/browse/MESOS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898716#comment-15898716 ] Avinash Sridharan edited comment on MESOS-7215 at 3/7/17 4:16 AM: -- [~vi...@twitter.com] whom should this ticket be assigned to? [~neilc] was (Author: avin...@mesosphere.io): [~vi...@twitter.com] whom should this ticket be assigned to? > Master sends ShutdownFrameworkMessage for all non-partition-aware frameworks > > > Key: MESOS-7215 > URL: https://issues.apache.org/jira/browse/MESOS-7215 > Project: Mesos > Issue Type: Bug >Reporter: Yan Xu >Priority: Critical > > Prior to the partition-awareness work MESOS-5344, upon agent reregistration > after it has been removed, the master only sends ShutdownFrameworkMessages to > the agent for frameworks that it knows have been torn down. > With the new logic in MESOS-5344, Mesos is now sending > {{ShutdownFrameworkMessages}} to the agent for all non-partition-aware > frameworks (including the ones that are still registered) > This is problematic. The offer from this agent can still go to the same > framework which can then launch new tasks. The agent then receives tasks of > the same framework and ignores them because it thinks the framework is > shutting down. The framework is not shutting down of course, so from the > master and the scheduler's perspective the task is pending in STAGING forever > until the next agent reregistration, which could happen much later. > This also makes the semantics of `ShutdownFrameworkMessage` ambiguous: the > agent is assuming the framework to be going away (and act accordingly) when > it's not. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7215) Master sends ShutdownFrameworkMessage for all non-partition-aware frameworks
[ https://issues.apache.org/jira/browse/MESOS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898716#comment-15898716 ] Avinash Sridharan commented on MESOS-7215: -- [~vi...@twitter.com] whom should this ticket be assigned to? > Master sends ShutdownFrameworkMessage for all non-partition-aware frameworks > > > Key: MESOS-7215 > URL: https://issues.apache.org/jira/browse/MESOS-7215 > Project: Mesos > Issue Type: Bug >Reporter: Yan Xu >Priority: Critical > > Prior to the partition-awareness work MESOS-5344, upon agent reregistration > after it has been removed, the master only sends ShutdownFrameworkMessages to > the agent for frameworks that it knows have been torn down. > With the new logic in MESOS-5344, Mesos is now sending > {{ShutdownFrameworkMessages}} to the agent for all non-partition-aware > frameworks (including the ones that are still registered) > This is problematic. The offer from this agent can still go to the same > framework which can then launch new tasks. The agent then receives tasks of > the same framework and ignores them because it thinks the framework is > shutting down. The framework is not shutting down of course, so from the > master and the scheduler's perspective the task is pending in STAGING forever > until the next agent reregistration, which could happen much later. > This also makes the semantics of `ShutdownFrameworkMessage` ambiguous: the > agent is assuming the framework to be going away (and act accordingly) when > it's not. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (MESOS-7210) MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( pid namespace mismatch )
[ https://issues.apache.org/jira/browse/MESOS-7210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898712#comment-15898712 ] Avinash Sridharan edited comment on MESOS-7210 at 3/7/17 4:13 AM: -- [~alexr] ^^ [~gkleiman] was (Author: avin...@mesosphere.io): [~alexr] ^^ @gaston kleiman > MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( > pid namespace mismatch ) > --- > > Key: MESOS-7210 > URL: https://issues.apache.org/jira/browse/MESOS-7210 > Project: Mesos > Issue Type: Bug > Components: docker >Affects Versions: 1.1.0 > Environment: Ubuntu 16.04.02 > Docker version 1.13.1 > mesos 1.1.0, runs from container > docker containers spawned by marathon 1.4.1 >Reporter: Wojciech Sielski >Assignee: Gastón Kleiman > > When running mesos-slave with option "docker_mesos_image" like: > {code} > --master=zk://standalone:2181/mesos --containerizers=docker,mesos > --executor_registration_timeout=5mins --hostname=standalone --ip=0.0.0.0 > --docker_stop_timeout=5secs --gc_delay=1days > --docker_socket=/var/run/docker.sock --no-systemd_enable_support > --work_dir=/tmp/mesos --docker_mesos_image=panteras/paas-in-a-box:0.4.0 > {code} > from the container that was started with option "pid: host" like: > {code} > net:host > privileged: true > pid:host > {code} > and example marathon job, that use MESOS_HTTP checks like: > {code} > { > "id": "python-example-stable", > "cmd": "python3 -m http.server 8080", > "mem": 16, > "cpus": 0.1, > "instances": 2, > "container": { >"type": "DOCKER", >"docker": { > "image": "python:alpine", > "network": "BRIDGE", > "portMappings": [ > { "containerPort": 8080, "hostPort": 0, "protocol": "tcp" } > ] >} > }, > "env": { >"SERVICE_NAME" : "python" > }, > "healthChecks": [ >{ > "path": "/", > "portIndex": 0, > "protocol": "MESOS_HTTP", > "gracePeriodSeconds": 30, > "intervalSeconds": 10, > "timeoutSeconds": 30, > "maxConsecutiveFailures": 3 >} > ] > } > {code} > I see the errors like: > {code} > F0306 07:41:58.84429335 health_checker.cpp:94] Failed to enter the net > namespace of task (pid: '13527'): Pid 13527 does not exist > *** Check failure stack trace: *** > @ 0x7f51770b0c1d google::LogMessage::Fail() > @ 0x7f51770b29d0 google::LogMessage::SendToLog() > @ 0x7f51770b0803 google::LogMessage::Flush() > @ 0x7f51770b33f9 google::LogMessageFatal::~LogMessageFatal() > @ 0x7f517647ce46 > _ZNSt17_Function_handlerIFivEZN5mesos8internal6health14cloneWithSetnsERKSt8functionIS0_E6OptionIiERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISG_EEEUlvE_E9_M_invokeERKSt9_Any_data > @ 0x7f517647bf2b mesos::internal::health::cloneWithSetns() > @ 0x7f517648374b std::_Function_handler<>::_M_invoke() > @ 0x7f5177068167 process::internal::cloneChild() > @ 0x7f5177065c32 process::subprocess() > @ 0x7f5176481a9d > mesos::internal::health::HealthCheckerProcess::_httpHealthCheck() > @ 0x7f51764831f7 > mesos::internal::health::HealthCheckerProcess::_healthCheck() > @ 0x7f517701f38c process::ProcessBase::visit() > @ 0x7f517702c8b3 process::ProcessManager::resume() > @ 0x7f517702fb77 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv > @ 0x7f51754ddc80 (unknown) > @ 0x7f5174cf06ba start_thread > @ 0x7f5174a2682d (unknown) > I0306 07:41:59.077986 9 health_checker.cpp:199] Ignoring failure as > health check still in grace period > {code} > Looks like option docker_mesos_image makes, that newly started mesos job is > not using "pid host" option same as mother container was started, but has his > own PID namespace (so it doesn't matter if mother container was started with > "pid host" or not it will never be able to find PID) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (MESOS-7210) MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( pid namespace mismatch )
[ https://issues.apache.org/jira/browse/MESOS-7210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Avinash Sridharan reassigned MESOS-7210: Assignee: Gastón Kleiman > MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( > pid namespace mismatch ) > --- > > Key: MESOS-7210 > URL: https://issues.apache.org/jira/browse/MESOS-7210 > Project: Mesos > Issue Type: Bug > Components: docker >Affects Versions: 1.1.0 > Environment: Ubuntu 16.04.02 > Docker version 1.13.1 > mesos 1.1.0, runs from container > docker containers spawned by marathon 1.4.1 >Reporter: Wojciech Sielski >Assignee: Gastón Kleiman > > When running mesos-slave with option "docker_mesos_image" like: > {code} > --master=zk://standalone:2181/mesos --containerizers=docker,mesos > --executor_registration_timeout=5mins --hostname=standalone --ip=0.0.0.0 > --docker_stop_timeout=5secs --gc_delay=1days > --docker_socket=/var/run/docker.sock --no-systemd_enable_support > --work_dir=/tmp/mesos --docker_mesos_image=panteras/paas-in-a-box:0.4.0 > {code} > from the container that was started with option "pid: host" like: > {code} > net:host > privileged: true > pid:host > {code} > and example marathon job, that use MESOS_HTTP checks like: > {code} > { > "id": "python-example-stable", > "cmd": "python3 -m http.server 8080", > "mem": 16, > "cpus": 0.1, > "instances": 2, > "container": { >"type": "DOCKER", >"docker": { > "image": "python:alpine", > "network": "BRIDGE", > "portMappings": [ > { "containerPort": 8080, "hostPort": 0, "protocol": "tcp" } > ] >} > }, > "env": { >"SERVICE_NAME" : "python" > }, > "healthChecks": [ >{ > "path": "/", > "portIndex": 0, > "protocol": "MESOS_HTTP", > "gracePeriodSeconds": 30, > "intervalSeconds": 10, > "timeoutSeconds": 30, > "maxConsecutiveFailures": 3 >} > ] > } > {code} > I see the errors like: > {code} > F0306 07:41:58.84429335 health_checker.cpp:94] Failed to enter the net > namespace of task (pid: '13527'): Pid 13527 does not exist > *** Check failure stack trace: *** > @ 0x7f51770b0c1d google::LogMessage::Fail() > @ 0x7f51770b29d0 google::LogMessage::SendToLog() > @ 0x7f51770b0803 google::LogMessage::Flush() > @ 0x7f51770b33f9 google::LogMessageFatal::~LogMessageFatal() > @ 0x7f517647ce46 > _ZNSt17_Function_handlerIFivEZN5mesos8internal6health14cloneWithSetnsERKSt8functionIS0_E6OptionIiERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISG_EEEUlvE_E9_M_invokeERKSt9_Any_data > @ 0x7f517647bf2b mesos::internal::health::cloneWithSetns() > @ 0x7f517648374b std::_Function_handler<>::_M_invoke() > @ 0x7f5177068167 process::internal::cloneChild() > @ 0x7f5177065c32 process::subprocess() > @ 0x7f5176481a9d > mesos::internal::health::HealthCheckerProcess::_httpHealthCheck() > @ 0x7f51764831f7 > mesos::internal::health::HealthCheckerProcess::_healthCheck() > @ 0x7f517701f38c process::ProcessBase::visit() > @ 0x7f517702c8b3 process::ProcessManager::resume() > @ 0x7f517702fb77 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv > @ 0x7f51754ddc80 (unknown) > @ 0x7f5174cf06ba start_thread > @ 0x7f5174a2682d (unknown) > I0306 07:41:59.077986 9 health_checker.cpp:199] Ignoring failure as > health check still in grace period > {code} > Looks like option docker_mesos_image makes, that newly started mesos job is > not using "pid host" option same as mother container was started, but has his > own PID namespace (so it doesn't matter if mother container was started with > "pid host" or not it will never be able to find PID) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (MESOS-7210) MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( pid namespace mismatch )
[ https://issues.apache.org/jira/browse/MESOS-7210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898712#comment-15898712 ] Avinash Sridharan edited comment on MESOS-7210 at 3/7/17 4:13 AM: -- [~alexr] ^^ @gaston kleiman was (Author: avin...@mesosphere.io): [~alexr] ^^ > MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( > pid namespace mismatch ) > --- > > Key: MESOS-7210 > URL: https://issues.apache.org/jira/browse/MESOS-7210 > Project: Mesos > Issue Type: Bug > Components: docker >Affects Versions: 1.1.0 > Environment: Ubuntu 16.04.02 > Docker version 1.13.1 > mesos 1.1.0, runs from container > docker containers spawned by marathon 1.4.1 >Reporter: Wojciech Sielski >Assignee: Gastón Kleiman > > When running mesos-slave with option "docker_mesos_image" like: > {code} > --master=zk://standalone:2181/mesos --containerizers=docker,mesos > --executor_registration_timeout=5mins --hostname=standalone --ip=0.0.0.0 > --docker_stop_timeout=5secs --gc_delay=1days > --docker_socket=/var/run/docker.sock --no-systemd_enable_support > --work_dir=/tmp/mesos --docker_mesos_image=panteras/paas-in-a-box:0.4.0 > {code} > from the container that was started with option "pid: host" like: > {code} > net:host > privileged: true > pid:host > {code} > and example marathon job, that use MESOS_HTTP checks like: > {code} > { > "id": "python-example-stable", > "cmd": "python3 -m http.server 8080", > "mem": 16, > "cpus": 0.1, > "instances": 2, > "container": { >"type": "DOCKER", >"docker": { > "image": "python:alpine", > "network": "BRIDGE", > "portMappings": [ > { "containerPort": 8080, "hostPort": 0, "protocol": "tcp" } > ] >} > }, > "env": { >"SERVICE_NAME" : "python" > }, > "healthChecks": [ >{ > "path": "/", > "portIndex": 0, > "protocol": "MESOS_HTTP", > "gracePeriodSeconds": 30, > "intervalSeconds": 10, > "timeoutSeconds": 30, > "maxConsecutiveFailures": 3 >} > ] > } > {code} > I see the errors like: > {code} > F0306 07:41:58.84429335 health_checker.cpp:94] Failed to enter the net > namespace of task (pid: '13527'): Pid 13527 does not exist > *** Check failure stack trace: *** > @ 0x7f51770b0c1d google::LogMessage::Fail() > @ 0x7f51770b29d0 google::LogMessage::SendToLog() > @ 0x7f51770b0803 google::LogMessage::Flush() > @ 0x7f51770b33f9 google::LogMessageFatal::~LogMessageFatal() > @ 0x7f517647ce46 > _ZNSt17_Function_handlerIFivEZN5mesos8internal6health14cloneWithSetnsERKSt8functionIS0_E6OptionIiERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISG_EEEUlvE_E9_M_invokeERKSt9_Any_data > @ 0x7f517647bf2b mesos::internal::health::cloneWithSetns() > @ 0x7f517648374b std::_Function_handler<>::_M_invoke() > @ 0x7f5177068167 process::internal::cloneChild() > @ 0x7f5177065c32 process::subprocess() > @ 0x7f5176481a9d > mesos::internal::health::HealthCheckerProcess::_httpHealthCheck() > @ 0x7f51764831f7 > mesos::internal::health::HealthCheckerProcess::_healthCheck() > @ 0x7f517701f38c process::ProcessBase::visit() > @ 0x7f517702c8b3 process::ProcessManager::resume() > @ 0x7f517702fb77 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv > @ 0x7f51754ddc80 (unknown) > @ 0x7f5174cf06ba start_thread > @ 0x7f5174a2682d (unknown) > I0306 07:41:59.077986 9 health_checker.cpp:199] Ignoring failure as > health check still in grace period > {code} > Looks like option docker_mesos_image makes, that newly started mesos job is > not using "pid host" option same as mother container was started, but has his > own PID namespace (so it doesn't matter if mother container was started with > "pid host" or not it will never be able to find PID) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7210) MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( pid namespace mismatch )
[ https://issues.apache.org/jira/browse/MESOS-7210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898712#comment-15898712 ] Avinash Sridharan commented on MESOS-7210: -- [~alexr] ^^ > MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( > pid namespace mismatch ) > --- > > Key: MESOS-7210 > URL: https://issues.apache.org/jira/browse/MESOS-7210 > Project: Mesos > Issue Type: Bug > Components: docker >Affects Versions: 1.1.0 > Environment: Ubuntu 16.04.02 > Docker version 1.13.1 > mesos 1.1.0, runs from container > docker containers spawned by marathon 1.4.1 >Reporter: Wojciech Sielski > > When running mesos-slave with option "docker_mesos_image" like: > {code} > --master=zk://standalone:2181/mesos --containerizers=docker,mesos > --executor_registration_timeout=5mins --hostname=standalone --ip=0.0.0.0 > --docker_stop_timeout=5secs --gc_delay=1days > --docker_socket=/var/run/docker.sock --no-systemd_enable_support > --work_dir=/tmp/mesos --docker_mesos_image=panteras/paas-in-a-box:0.4.0 > {code} > from the container that was started with option "pid: host" like: > {code} > net:host > privileged: true > pid:host > {code} > and example marathon job, that use MESOS_HTTP checks like: > {code} > { > "id": "python-example-stable", > "cmd": "python3 -m http.server 8080", > "mem": 16, > "cpus": 0.1, > "instances": 2, > "container": { >"type": "DOCKER", >"docker": { > "image": "python:alpine", > "network": "BRIDGE", > "portMappings": [ > { "containerPort": 8080, "hostPort": 0, "protocol": "tcp" } > ] >} > }, > "env": { >"SERVICE_NAME" : "python" > }, > "healthChecks": [ >{ > "path": "/", > "portIndex": 0, > "protocol": "MESOS_HTTP", > "gracePeriodSeconds": 30, > "intervalSeconds": 10, > "timeoutSeconds": 30, > "maxConsecutiveFailures": 3 >} > ] > } > {code} > I see the errors like: > {code} > F0306 07:41:58.84429335 health_checker.cpp:94] Failed to enter the net > namespace of task (pid: '13527'): Pid 13527 does not exist > *** Check failure stack trace: *** > @ 0x7f51770b0c1d google::LogMessage::Fail() > @ 0x7f51770b29d0 google::LogMessage::SendToLog() > @ 0x7f51770b0803 google::LogMessage::Flush() > @ 0x7f51770b33f9 google::LogMessageFatal::~LogMessageFatal() > @ 0x7f517647ce46 > _ZNSt17_Function_handlerIFivEZN5mesos8internal6health14cloneWithSetnsERKSt8functionIS0_E6OptionIiERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISG_EEEUlvE_E9_M_invokeERKSt9_Any_data > @ 0x7f517647bf2b mesos::internal::health::cloneWithSetns() > @ 0x7f517648374b std::_Function_handler<>::_M_invoke() > @ 0x7f5177068167 process::internal::cloneChild() > @ 0x7f5177065c32 process::subprocess() > @ 0x7f5176481a9d > mesos::internal::health::HealthCheckerProcess::_httpHealthCheck() > @ 0x7f51764831f7 > mesos::internal::health::HealthCheckerProcess::_healthCheck() > @ 0x7f517701f38c process::ProcessBase::visit() > @ 0x7f517702c8b3 process::ProcessManager::resume() > @ 0x7f517702fb77 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv > @ 0x7f51754ddc80 (unknown) > @ 0x7f5174cf06ba start_thread > @ 0x7f5174a2682d (unknown) > I0306 07:41:59.077986 9 health_checker.cpp:199] Ignoring failure as > health check still in grace period > {code} > Looks like option docker_mesos_image makes, that newly started mesos job is > not using "pid host" option same as mother container was started, but has his > own PID namespace (so it doesn't matter if mother container was started with > "pid host" or not it will never be able to find PID) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (MESOS-6480) Support for docker live-restore option in Mesos
[ https://issues.apache.org/jira/browse/MESOS-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898649#comment-15898649 ] haosdent edited comment on MESOS-6480 at 3/7/17 2:59 AM: - As check, all docker command would fail when use {{--live-store}} and {{service docker stop}}, include {{docker log}} no matter which log-driver we use. After chat with [~jieyu], The possible way to resolve this is 1. * {{docker run -d}} to start the program * {{docker log --since xxx --follow}} to read the log * If {{docker log}} failed, check if {{/proc/$taskPid}} exist, if the task process still exist, keep retry {{docker log}} util {{/proc/$taskPid}} disappear or {{docker log}} success again. The problem of this way is it is a bit tricky to find the timestamp parameter in {{docker log --since}}. And some logs may miss 2. * Read the {{/run/docker/libcontainerd/$container_id/init-stdout}} and {{/run/docker/libcontainerd/$container_id/init-stderr}} directly. This is tricky as well. Because it depends on the implementation of docker accross different versions. And it don't allow multiple consumers, which mean if we read this file directly, other consumers on {{docker log}} would not see the log we got from this file. In a short word, I think we don't have a perfect solution for this problem unless we allow some log missing. was (Author: haosd...@gmail.com): As check, all docker command would fail when use {{--live-store}} and {{service docker stop}}, include {{docker log}} no matter which log-driver we use. After chat with Jie Yu, The possible way to resolve this is 1. * {{docker run -d}} to start the program * {{docker log --since xxx --follow}} to read the log * If {{docker log}} failed, check if {{/proc/$taskPid}} exist, if the task process still exist, keep retry {{docker log}} util {{/proc/$taskPid}} disappear or {{docker log}} success again. The problem of this way is it is a bit tricky to find the timestamp parameter in {{docker log --since}}. And some logs may miss 2. * Read the {{/run/docker/libcontainerd/$container_id/init-stdout}} and {{/run/docker/libcontainerd/$container_id/init-stderr}} directly. This is tricky as well. Because it depends on the implementation of docker accross different versions. And it don't allow multiple consumers, which mean if we read this file directly, other consumers on {{docker log}} would not see the log we got from this file. In a short word, I think we don't have a perfect solution for this problem unless we allow some log missing. > Support for docker live-restore option in Mesos > --- > > Key: MESOS-6480 > URL: https://issues.apache.org/jira/browse/MESOS-6480 > Project: Mesos > Issue Type: Task >Reporter: Milind Chawre > > Docker-1.12 supports live-restore option which keeps containers alive during > docker daemon downtime https://docs.docker.com/engine/admin/live-restore/ > I tried to use this option in my Mesos setup And observed this : > 1. On mesos worker node stop docker daemon. > 2. After some time start the docker daemon. All the containers running on > that are still visible using "docker ps". This is an expected behaviour of > live-restore option. > 3. When I check mesos and marathon UI. It shows no Active tasks running on > that node. The containers which are still running on that node are now > scheduled on different mesos nodes, which is not right since I can see the > containers in "docker ps" output because of live-restore option. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6480) Support for docker live-restore option in Mesos
[ https://issues.apache.org/jira/browse/MESOS-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898649#comment-15898649 ] haosdent commented on MESOS-6480: - As check, all docker command would fail when use {{--live-store}} and {{service docker stop}}, include {{docker log}} no matter which log-driver we use. After chat with Jie Yu, The possible way to resolve this is 1. * {{docker run -d}} to start the program * {{docker log --since xxx --follow}} to read the log * If {{docker log}} failed, check if {{/proc/$taskPid}} exist, if the task process still exist, keep retry {{docker log}} util {{/proc/$taskPid}} disappear or {{docker log}} success again. The problem of this way is it is a bit tricky to find the timestamp parameter in {{docker log --since}}. And some logs may miss 2. * Read the {{/run/docker/libcontainerd/$container_id/init-stdout}} and {{/run/docker/libcontainerd/$container_id/init-stderr}} directly. This is tricky as well. Because it depends on the implementation of docker accross different versions. And it don't allow multiple consumers, which mean if we read this file directly, other consumers on {{docker log}} would not see the log we got from this file. In a short word, I think we don't have a perfect solution for this problem unless we allow some log missing. > Support for docker live-restore option in Mesos > --- > > Key: MESOS-6480 > URL: https://issues.apache.org/jira/browse/MESOS-6480 > Project: Mesos > Issue Type: Task >Reporter: Milind Chawre > > Docker-1.12 supports live-restore option which keeps containers alive during > docker daemon downtime https://docs.docker.com/engine/admin/live-restore/ > I tried to use this option in my Mesos setup And observed this : > 1. On mesos worker node stop docker daemon. > 2. After some time start the docker daemon. All the containers running on > that are still visible using "docker ps". This is an expected behaviour of > live-restore option. > 3. When I check mesos and marathon UI. It shows no Active tasks running on > that node. The containers which are still running on that node are now > scheduled on different mesos nodes, which is not right since I can see the > containers in "docker ps" output because of live-restore option. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (MESOS-6919) Libprocess reinit code leaks SSL server socket FD
[ https://issues.apache.org/jira/browse/MESOS-6919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Wu reassigned MESOS-6919: Assignee: Joseph Wu > Libprocess reinit code leaks SSL server socket FD > - > > Key: MESOS-6919 > URL: https://issues.apache.org/jira/browse/MESOS-6919 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Greg Mann >Assignee: Joseph Wu > Labels: libprocess, ssl > > After [this commit|https://github.com/apache/mesos/commit/789e9f7], it was > discovered that tests which use {{process::reinitialize}} to switch between > SSL and non-SSL modes will leak the file descriptor associated with the > server socket {{\_\_s\_\_}}. This can be reproduced by running the following > trivial test in repetition: > {code} > diff --git a/src/tests/scheduler_tests.cpp b/src/tests/scheduler_tests.cpp > index 1ff423f..d5fd575 100644 > --- a/src/tests/scheduler_tests.cpp > +++ b/src/tests/scheduler_tests.cpp > @@ -1821,6 +1821,12 @@ INSTANTIATE_TEST_CASE_P( > #endif // USE_SSL_SOCKET > +TEST_P(SchedulerSSLTest, LeakTest) > +{ > + ::sleep(1); > +} > + > + > // Tests that a scheduler can subscribe, run a task, and then tear itself > down. > TEST_P(SchedulerSSLTest, RunTaskAndTeardown) > { > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6919) Libprocess reinit code leaks SSL server socket FD
[ https://issues.apache.org/jira/browse/MESOS-6919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898484#comment-15898484 ] Joseph Wu commented on MESOS-6919: -- Looks like this affects Unix sockets too: {code} while (true) { Try create = unix::Socket::create(); ASSERT_SOME(create); Try address = unix::Address::create(os::mkdtemp().get() + "/a"); ASSERT_SOME(address); Try bind = create->bind(address.get()); ASSERT_SOME(bind); Try listen = create->listen(10); ASSERT_SOME(listen); create->accept().discard(); } {code} > Libprocess reinit code leaks SSL server socket FD > - > > Key: MESOS-6919 > URL: https://issues.apache.org/jira/browse/MESOS-6919 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Greg Mann > Labels: libprocess, ssl > > After [this commit|https://github.com/apache/mesos/commit/789e9f7], it was > discovered that tests which use {{process::reinitialize}} to switch between > SSL and non-SSL modes will leak the file descriptor associated with the > server socket {{\_\_s\_\_}}. This can be reproduced by running the following > trivial test in repetition: > {code} > diff --git a/src/tests/scheduler_tests.cpp b/src/tests/scheduler_tests.cpp > index 1ff423f..d5fd575 100644 > --- a/src/tests/scheduler_tests.cpp > +++ b/src/tests/scheduler_tests.cpp > @@ -1821,6 +1821,12 @@ INSTANTIATE_TEST_CASE_P( > #endif // USE_SSL_SOCKET > +TEST_P(SchedulerSSLTest, LeakTest) > +{ > + ::sleep(1); > +} > + > + > // Tests that a scheduler can subscribe, run a task, and then tear itself > down. > TEST_P(SchedulerSSLTest, RunTaskAndTeardown) > { > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (MESOS-6919) Libprocess reinit code leaks SSL server socket FD
[ https://issues.apache.org/jira/browse/MESOS-6919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898414#comment-15898414 ] Joseph Wu edited comment on MESOS-6919 at 3/7/17 12:02 AM: --- This leak is not strictly limited to the reinitialization logic. Here is an even smaller repro (assuming libprocess is started with SSL): {code} while (true) { Try create = Socket::create(); ASSERT_SOME(create); Socket* __s__ = new Socket(create.get()); Try bind = __s__->bind(Address::ANY_ANY()); ASSERT_SOME(bind); Try listen = __s__->listen(10); ASSERT_SOME(listen) __s__->accept().discard(); delete __s__; __s__ = nullptr; } {code} was (Author: kaysoky): This leak is not strictly limited to the reinitialization logic. Here is an even smaller repro (assuming libprocess is started with SSL): {code} while (true) { Try create = Socket::create(); ASSERT_SOME(create); Socket* __s__ = new Socket(create.get()); std::cout << "Test socket == " << __s__->get() << std::endl; Try bind = __s__->bind(Address::ANY_ANY()); ASSERT_SOME(bind); Try listen = __s__->listen(10); ASSERT_SOME(listen) __s__->accept().discard(); delete __s__; __s__ = nullptr; } {code} > Libprocess reinit code leaks SSL server socket FD > - > > Key: MESOS-6919 > URL: https://issues.apache.org/jira/browse/MESOS-6919 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Greg Mann > Labels: libprocess, ssl > > After [this commit|https://github.com/apache/mesos/commit/789e9f7], it was > discovered that tests which use {{process::reinitialize}} to switch between > SSL and non-SSL modes will leak the file descriptor associated with the > server socket {{\_\_s\_\_}}. This can be reproduced by running the following > trivial test in repetition: > {code} > diff --git a/src/tests/scheduler_tests.cpp b/src/tests/scheduler_tests.cpp > index 1ff423f..d5fd575 100644 > --- a/src/tests/scheduler_tests.cpp > +++ b/src/tests/scheduler_tests.cpp > @@ -1821,6 +1821,12 @@ INSTANTIATE_TEST_CASE_P( > #endif // USE_SSL_SOCKET > +TEST_P(SchedulerSSLTest, LeakTest) > +{ > + ::sleep(1); > +} > + > + > // Tests that a scheduler can subscribe, run a task, and then tear itself > down. > TEST_P(SchedulerSSLTest, RunTaskAndTeardown) > { > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6919) Libprocess reinit code leaks SSL server socket FD
[ https://issues.apache.org/jira/browse/MESOS-6919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898414#comment-15898414 ] Joseph Wu commented on MESOS-6919: -- This leak is not strictly limited to the reinitialization logic. Here is an even smaller repro (assuming libprocess is started with SSL): {code} while (true) { Try create = Socket::create(); ASSERT_SOME(create); Socket* __s__ = new Socket(create.get()); std::cout << "Test socket == " << __s__->get() << std::endl; Try bind = __s__->bind(Address::ANY_ANY()); ASSERT_SOME(bind); Try listen = __s__->listen(10); ASSERT_SOME(listen) __s__->accept().discard(); delete __s__; __s__ = nullptr; } {code} > Libprocess reinit code leaks SSL server socket FD > - > > Key: MESOS-6919 > URL: https://issues.apache.org/jira/browse/MESOS-6919 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Greg Mann > Labels: libprocess, ssl > > After [this commit|https://github.com/apache/mesos/commit/789e9f7], it was > discovered that tests which use {{process::reinitialize}} to switch between > SSL and non-SSL modes will leak the file descriptor associated with the > server socket {{\_\_s\_\_}}. This can be reproduced by running the following > trivial test in repetition: > {code} > diff --git a/src/tests/scheduler_tests.cpp b/src/tests/scheduler_tests.cpp > index 1ff423f..d5fd575 100644 > --- a/src/tests/scheduler_tests.cpp > +++ b/src/tests/scheduler_tests.cpp > @@ -1821,6 +1821,12 @@ INSTANTIATE_TEST_CASE_P( > #endif // USE_SSL_SOCKET > +TEST_P(SchedulerSSLTest, LeakTest) > +{ > + ::sleep(1); > +} > + > + > // Tests that a scheduler can subscribe, run a task, and then tear itself > down. > TEST_P(SchedulerSSLTest, RunTaskAndTeardown) > { > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7195) Use C++11 variadic templates for process::dispatch/defer/delay/async/run
[ https://issues.apache.org/jira/browse/MESOS-7195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898378#comment-15898378 ] Yan Xu commented on MESOS-7195: --- [~mcypark] I am thinking of investigating this. Just would like to solicit some feedback first to help me get started. In the slack channel you mentioned: {quote} it’ll be some work to implement the variadic template versions because it’ll involve some API changes {quote} Could you elaborate a bit further? > Use C++11 variadic templates for process::dispatch/defer/delay/async/run > > > Key: MESOS-7195 > URL: https://issues.apache.org/jira/browse/MESOS-7195 > Project: Mesos > Issue Type: Improvement > Components: libprocess >Reporter: Yan Xu > > These methods are currently implemented using {{REPEAT_FROM_TO}} (i.e., > {{BOOST_PP_REPEAT_FROM_TO}}): > {code:title=} > REPEAT_FROM_TO(1, 11, TEMPLATE, _) // Args A0 -> A9. > {code} > This means we have to bump up the number of repetition whenever we have a new > method with more args. > Seems like we can replace this with C++11 variadic templates. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (MESOS-7195) Use C++11 variadic templates for process::dispatch/defer/delay/async/run
[ https://issues.apache.org/jira/browse/MESOS-7195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898378#comment-15898378 ] Yan Xu edited comment on MESOS-7195 at 3/6/17 11:30 PM: [~mcypark] I am thinking of investigating this. Just would like to solicit some feedback first to help me get started. In the slack channel you mentioned: {quote} it’ll be some work to implement the variadic template versions because it’ll involve some API changes {quote} Could you elaborate a bit further (plus other suggestions)? was (Author: xujyan): [~mcypark] I am thinking of investigating this. Just would like to solicit some feedback first to help me get started. In the slack channel you mentioned: {quote} it’ll be some work to implement the variadic template versions because it’ll involve some API changes {quote} Could you elaborate a bit further? > Use C++11 variadic templates for process::dispatch/defer/delay/async/run > > > Key: MESOS-7195 > URL: https://issues.apache.org/jira/browse/MESOS-7195 > Project: Mesos > Issue Type: Improvement > Components: libprocess >Reporter: Yan Xu > > These methods are currently implemented using {{REPEAT_FROM_TO}} (i.e., > {{BOOST_PP_REPEAT_FROM_TO}}): > {code:title=} > REPEAT_FROM_TO(1, 11, TEMPLATE, _) // Args A0 -> A9. > {code} > This means we have to bump up the number of repetition whenever we have a new > method with more args. > Seems like we can replace this with C++11 variadic templates. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7214) StatusUpdateManagerProcess::resume() doesn't support resuming a single stream.
[ https://issues.apache.org/jira/browse/MESOS-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-7214: -- Issue Type: Improvement (was: Bug) +1 to fix this. Will be happy to shepherd. > StatusUpdateManagerProcess::resume() doesn't support resuming a single stream. > -- > > Key: MESOS-7214 > URL: https://issues.apache.org/jira/browse/MESOS-7214 > Project: Mesos > Issue Type: Improvement >Reporter: Yan Xu > > Therefore resume() gets called repeatedly for each {{UpdateFrameworkMessage}} > and all status updates for ALL frameworks are resent unnecessarily. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7215) Master sends ShutdownFrameworkMessage for all non-partition-aware frameworks
[ https://issues.apache.org/jira/browse/MESOS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-7215: -- Priority: Critical (was: Major) > Master sends ShutdownFrameworkMessage for all non-partition-aware frameworks > > > Key: MESOS-7215 > URL: https://issues.apache.org/jira/browse/MESOS-7215 > Project: Mesos > Issue Type: Bug >Reporter: Yan Xu >Priority: Critical > > Prior to the partition-awareness work MESOS-5344, upon agent reregistration > after it has been removed, the master only sends ShutdownFrameworkMessages to > the agent for frameworks that it knows have been torn down. > With the new logic in MESOS-5344, Mesos is now sending > {{ShutdownFrameworkMessages}} to the agent for all non-partition-aware > frameworks (including the ones that are still registered) > This is problematic. The offer from this agent can still go to the same > framework which can then launch new tasks. The agent then receives tasks of > the same framework and ignores them because it thinks the framework is > shutting down. The framework is not shutting down of course, so from the > master and the scheduler's perspective the task is pending in STAGING forever > until the next agent reregistration, which could happen much later. > This also makes the semantics of `ShutdownFrameworkMessage` ambiguous: the > agent is assuming the framework to be going away (and act accordingly) when > it's not. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7215) Master sends ShutdownFrameworkMessage for all non-partition-aware frameworks
[ https://issues.apache.org/jira/browse/MESOS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898358#comment-15898358 ] Vinod Kone commented on MESOS-7215: --- Interesting. I guess we never explicitly called out that `ShutdownFrameworkMessage` should only be sent when framework is being torn down. But I'm surprised to hear that as a consequence of the recent changes the task stays in STAGING forever. I'm assuming this is because agent doesn't send a TASK_DROPPED status update since it thinks the framework is shutting down. Sending a `KillTaskMessage` instead of `ShutdownFrameworkMessage` sounds good to me. > Master sends ShutdownFrameworkMessage for all non-partition-aware frameworks > > > Key: MESOS-7215 > URL: https://issues.apache.org/jira/browse/MESOS-7215 > Project: Mesos > Issue Type: Bug >Reporter: Yan Xu > > Prior to the partition-awareness work MESOS-5344, upon agent reregistration > after it has been removed, the master only sends ShutdownFrameworkMessages to > the agent for frameworks that it knows have been torn down. > With the new logic in MESOS-5344, Mesos is now sending > {{ShutdownFrameworkMessages}} to the agent for all non-partition-aware > frameworks (including the ones that are still registered) > This is problematic. The offer from this agent can still go to the same > framework which can then launch new tasks. The agent then receives tasks of > the same framework and ignores them because it thinks the framework is > shutting down. The framework is not shutting down of course, so from the > master and the scheduler's perspective the task is pending in STAGING forever > until the next agent reregistration, which could happen much later. > This also makes the semantics of `ShutdownFrameworkMessage` ambiguous: the > agent is assuming the framework to be going away (and act accordingly) when > it's not. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7215) Master sends ShutdownFrameworkMessage for all non-partition-aware frameworks
[ https://issues.apache.org/jira/browse/MESOS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898344#comment-15898344 ] Yan Xu commented on MESOS-7215: --- /cc [~neilc] [~vinodkone] Perhaps we should keep the logic that transitions the tasks to {{TASK_LOST}} on the master and have the master kill these tasks on the agent without sending {{ShutdownFrameworkMessage}}? > Master sends ShutdownFrameworkMessage for all non-partition-aware frameworks > > > Key: MESOS-7215 > URL: https://issues.apache.org/jira/browse/MESOS-7215 > Project: Mesos > Issue Type: Bug >Reporter: Yan Xu > > Prior to the partition-awareness work MESOS-5344, upon agent reregistration > after it has been removed, the master only sends ShutdownFrameworkMessages to > the agent for frameworks that it knows have been torn down. > With the new logic in MESOS-5344, Mesos is now sending > {{ShutdownFrameworkMessages}} to the agent for all non-partition-aware > frameworks (including the ones that are still registered) > This is problematic. The offer from this agent can still go to the same > framework which can then launch new tasks. The agent then receives tasks of > the same framework and ignores them because it thinks the framework is > shutting down. The framework is not shutting down of course, so from the > master and the scheduler's perspective the task is pending in STAGING forever > until the next agent reregistration, which could happen much later. > This also makes the semantics of `ShutdownFrameworkMessage` ambiguous: the > agent is assuming the framework to be going away (and act accordingly) when > it's not. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7215) Master sends ShutdownFrameworkMessage for all non-partition-aware frameworks
[ https://issues.apache.org/jira/browse/MESOS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Xu updated MESOS-7215: -- Summary: Master sends ShutdownFrameworkMessage for all non-partition-aware frameworks (was: Master sends ShutdownFrameworkMessage for all partition-aware frameworks) > Master sends ShutdownFrameworkMessage for all non-partition-aware frameworks > > > Key: MESOS-7215 > URL: https://issues.apache.org/jira/browse/MESOS-7215 > Project: Mesos > Issue Type: Bug >Reporter: Yan Xu > > Prior to the partition-awareness work MESOS-5344, upon agent reregistration > after it has been removed, the master only sends ShutdownFrameworkMessages to > the agent for frameworks that it knows have been torn down. > With the new logic in MESOS-5344, Mesos is now sending > {{ShutdownFrameworkMessages}} to the agent for all non-partition-aware > frameworks (including the ones that are still registered) > This is problematic. The offer from this agent can still go to the same > framework which can then launch new tasks. The agent then receives tasks of > the same framework and ignores them because it thinks the framework is > shutting down. The framework is not shutting down of course, so from the > master and the scheduler's perspective the task is pending in STAGING forever > until the next agent reregistration, which could happen much later. > This also makes the semantics of `ShutdownFrameworkMessage` ambiguous: the > agent is assuming the framework to be going away (and act accordingly) when > it's not. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7215) Master sends ShutdownFrameworkMessage for all partition-aware frameworks
Yan Xu created MESOS-7215: - Summary: Master sends ShutdownFrameworkMessage for all partition-aware frameworks Key: MESOS-7215 URL: https://issues.apache.org/jira/browse/MESOS-7215 Project: Mesos Issue Type: Bug Reporter: Yan Xu Prior to the partition-awareness work MESOS-5344, upon agent reregistration after it has been removed, the master only sends ShutdownFrameworkMessages to the agent for frameworks that it knows have been torn down. With the new logic in MESOS-5344, Mesos is now sending {{ShutdownFrameworkMessages}} to the agent for all non-partition-aware frameworks (including the ones that are still registered) This is problematic. The offer from this agent can still go to the same framework which can then launch new tasks. The agent then receives tasks of the same framework and ignores them because it thinks the framework is shutting down. The framework is not shutting down of course, so from the master and the scheduler's perspective the task is pending in STAGING forever until the next agent reregistration, which could happen much later. This also makes the semantics of `ShutdownFrameworkMessage` ambiguous: the agent is assuming the framework to be going away (and act accordingly) when it's not. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7214) StatusUpdateManagerProcess::resume() doesn't support resuming a single stream.
[ https://issues.apache.org/jira/browse/MESOS-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Xu updated MESOS-7214: -- Description: Therefore resume() gets called repeatedly for each {{UpdateFrameworkMessage}} and all status updates for ALL frameworks are resent unnecessarily. (was: Therefore when resume() gets called repeatedly it re-flushes all messages unnecessarily.) > StatusUpdateManagerProcess::resume() doesn't support resuming a single stream. > -- > > Key: MESOS-7214 > URL: https://issues.apache.org/jira/browse/MESOS-7214 > Project: Mesos > Issue Type: Bug >Reporter: Yan Xu > > Therefore resume() gets called repeatedly for each {{UpdateFrameworkMessage}} > and all status updates for ALL frameworks are resent unnecessarily. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7214) StatusUpdateManagerProcess::resume() doesn't support resuming a single stream.
[ https://issues.apache.org/jira/browse/MESOS-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Xu updated MESOS-7214: -- Summary: StatusUpdateManagerProcess::resume() doesn't support resuming a single stream. (was: StatusUpdateManagerProcess::resume() doesn') > StatusUpdateManagerProcess::resume() doesn't support resuming a single stream. > -- > > Key: MESOS-7214 > URL: https://issues.apache.org/jira/browse/MESOS-7214 > Project: Mesos > Issue Type: Bug >Reporter: Yan Xu > > Therefore when resume() gets called repeatedly it re-flushes all messages > unnecessarily. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7214) StatusUpdateManagerProcess::resume() doesn'
Yan Xu created MESOS-7214: - Summary: StatusUpdateManagerProcess::resume() doesn' Key: MESOS-7214 URL: https://issues.apache.org/jira/browse/MESOS-7214 Project: Mesos Issue Type: Bug Reporter: Yan Xu Therefore when resume() gets called repeatedly it re-flushes all messages unnecessarily. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-5689) `PortMappingIsolatorTest.ROOT_ContainerICMPExternal` fails on Fedora 23.
[ https://issues.apache.org/jira/browse/MESOS-5689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898275#comment-15898275 ] Till Toenshoff commented on MESOS-5689: --- We need to clarify if this simply a test-failure or an actual bug in conjunction with Fedora 23. I am still seeing this when testing 1.1.1-rc2. > `PortMappingIsolatorTest.ROOT_ContainerICMPExternal` fails on Fedora 23. > > > Key: MESOS-5689 > URL: https://issues.apache.org/jira/browse/MESOS-5689 > Project: Mesos > Issue Type: Bug > Components: isolation, network > Environment: Fedora 23 with network isolation >Reporter: Gilbert Song > Labels: isolation, mesosphere, networking, tests > > Here is the log: > {noformat} > [20:17:53] : [Step 10/10] [ RUN ] > PortMappingIsolatorTest.ROOT_ContainerICMPExternal > [20:17:53]W: [Step 10/10] I0622 20:17:53.890225 28395 > port_mapping_tests.cpp:229] Using eth0 as the public interface > [20:17:53]W: [Step 10/10] I0622 20:17:53.890532 28395 > port_mapping_tests.cpp:237] Using lo as the loopback interface > [20:17:53]W: [Step 10/10] I0622 20:17:53.904742 28395 resources.cpp:572] > Parsing resources as JSON failed: > cpus:2;mem:1024;disk:1024;ephemeral_ports:[30001-30999];ports:[31000-32000] > [20:17:53]W: [Step 10/10] Trying semicolon-delimited string format instead > [20:17:53]W: [Step 10/10] I0622 20:17:53.905855 28395 > port_mapping.cpp:1557] Using eth0 as the public interface > [20:17:53]W: [Step 10/10] I0622 20:17:53.906159 28395 > port_mapping.cpp:1582] Using lo as the loopback interface > [20:17:53]W: [Step 10/10] I0622 20:17:53.907315 28395 > port_mapping.cpp:1869] /proc/sys/net/ipv4/neigh/default/gc_thresh3 = '1024' > [20:17:53]W: [Step 10/10] I0622 20:17:53.907362 28395 > port_mapping.cpp:1869] /proc/sys/net/ipv4/neigh/default/gc_thresh1 = '128' > [20:17:53]W: [Step 10/10] I0622 20:17:53.907418 28395 > port_mapping.cpp:1869] /proc/sys/net/ipv4/tcp_wmem = '409616384 4194304' > [20:17:53]W: [Step 10/10] I0622 20:17:53.907454 28395 > port_mapping.cpp:1869] /proc/sys/net/ipv4/tcp_synack_retries = '5' > [20:17:53]W: [Step 10/10] I0622 20:17:53.907491 28395 > port_mapping.cpp:1869] /proc/sys/net/core/rmem_max = '212992' > [20:17:53]W: [Step 10/10] I0622 20:17:53.907524 28395 > port_mapping.cpp:1869] /proc/sys/net/core/somaxconn = '128' > [20:17:53]W: [Step 10/10] I0622 20:17:53.907557 28395 > port_mapping.cpp:1869] /proc/sys/net/core/wmem_max = '212992' > [20:17:53]W: [Step 10/10] I0622 20:17:53.907588 28395 > port_mapping.cpp:1869] /proc/sys/net/ipv4/tcp_rmem = '409687380 6291456' > [20:17:53]W: [Step 10/10] I0622 20:17:53.907618 28395 > port_mapping.cpp:1869] /proc/sys/net/ipv4/tcp_keepalive_time = '7200' > [20:17:53]W: [Step 10/10] I0622 20:17:53.907649 28395 > port_mapping.cpp:1869] /proc/sys/net/ipv4/neigh/default/gc_thresh2 = '512' > [20:17:53]W: [Step 10/10] I0622 20:17:53.907680 28395 > port_mapping.cpp:1869] /proc/sys/net/core/netdev_max_backlog = '1000' > [20:17:53]W: [Step 10/10] I0622 20:17:53.907711 28395 > port_mapping.cpp:1869] /proc/sys/net/ipv4/tcp_keepalive_intvl = '75' > [20:17:53]W: [Step 10/10] I0622 20:17:53.907742 28395 > port_mapping.cpp:1869] /proc/sys/net/ipv4/tcp_keepalive_probes = '9' > [20:17:53]W: [Step 10/10] I0622 20:17:53.907773 28395 > port_mapping.cpp:1869] /proc/sys/net/ipv4/tcp_max_syn_backlog = '512' > [20:17:53]W: [Step 10/10] I0622 20:17:53.907802 28395 > port_mapping.cpp:1869] /proc/sys/net/ipv4/tcp_retries2 = '15' > [20:17:53]W: [Step 10/10] I0622 20:17:53.916348 28395 > linux_launcher.cpp:101] Using /sys/fs/cgroup/freezer as the freezer hierarchy > for the Linux launcher > [20:17:53]W: [Step 10/10] I0622 20:17:53.916575 28395 resources.cpp:572] > Parsing resources as JSON failed: ports:[31000-31499] > [20:17:53]W: [Step 10/10] Trying semicolon-delimited string format instead > [20:17:53]W: [Step 10/10] I0622 20:17:53.917032 28412 > port_mapping.cpp:2512] Using non-ephemeral ports {[31000,31500)} and > ephemeral ports [30016,30032) for container container1 of executor '' > [20:17:53]W: [Step 10/10] I0622 20:17:53.918092 28395 > linux_launcher.cpp:281] Cloning child process with flags = CLONE_NEWNS | > CLONE_NEWNET > [20:17:53]W: [Step 10/10] I0622 20:17:53.951756 28410 > port_mapping.cpp:2576] Bind mounted '/proc/15611/ns/net' to > '/run/netns/15611' for container container1 > [20:17:53]W: [Step 10/10] I0622 20:17:53.951918 28410 > port_mapping.cpp:2607] Created network namespace handle symlink > '/var/run/mesos/netns/container1' -> '/run/netns/15611' > [20:17:53]W: [Step 10/10] I0622 20:17:53.952893 28410 > port_mapping.cpp:2667] Adding IP packet filters with ports [30016,30031] for > container container1 >
[jira] [Assigned] (MESOS-6919) Libprocess reinit code leaks SSL server socket FD
[ https://issues.apache.org/jira/browse/MESOS-6919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-6919: Assignee: (was: Greg Mann) > Libprocess reinit code leaks SSL server socket FD > - > > Key: MESOS-6919 > URL: https://issues.apache.org/jira/browse/MESOS-6919 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Greg Mann > Labels: libprocess, ssl > > After [this commit|https://github.com/apache/mesos/commit/789e9f7], it was > discovered that tests which use {{process::reinitialize}} to switch between > SSL and non-SSL modes will leak the file descriptor associated with the > server socket {{\_\_s\_\_}}. This can be reproduced by running the following > trivial test in repetition: > {code} > diff --git a/src/tests/scheduler_tests.cpp b/src/tests/scheduler_tests.cpp > index 1ff423f..d5fd575 100644 > --- a/src/tests/scheduler_tests.cpp > +++ b/src/tests/scheduler_tests.cpp > @@ -1821,6 +1821,12 @@ INSTANTIATE_TEST_CASE_P( > #endif // USE_SSL_SOCKET > +TEST_P(SchedulerSSLTest, LeakTest) > +{ > + ::sleep(1); > +} > + > + > // Tests that a scheduler can subscribe, run a task, and then tear itself > down. > TEST_P(SchedulerSSLTest, RunTaskAndTeardown) > { > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7082) ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0 is flaky
[ https://issues.apache.org/jira/browse/MESOS-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897910#comment-15897910 ] Greg Mann commented on MESOS-7082: -- I just observed this failure again on our internal CI. > ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0 is > flaky > > > Key: MESOS-7082 > URL: https://issues.apache.org/jira/browse/MESOS-7082 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.2.0 > Environment: ubuntu 16.04 with/without SSL >Reporter: Anand Mazumdar > Labels: flaky, flaky-test, mesosphere > > Showed up on our internal CI > {noformat} > 07:00:17 [ RUN ] > ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0 > 07:00:17 I0207 07:00:17.775459 2952 cluster.cpp:160] Creating default > 'local' authorizer > 07:00:17 I0207 07:00:17.776511 2970 master.cpp:383] Master > fa1554c4-572a-4b89-8994-a89460f588d3 (ip-10-153-254-29.ec2.internal) started > on 10.153.254.29:38570 > 07:00:17 I0207 07:00:17.776538 2970 master.cpp:385] Flags at startup: > --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/ZROfJk/credentials" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --http_authenticators="basic" --http_framework_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="1000" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/ZROfJk/master" > --zk_session_timeout="10secs" > 07:00:17 I0207 07:00:17.776674 2970 master.cpp:435] Master only allowing > authenticated frameworks to register > 07:00:17 I0207 07:00:17.776687 2970 master.cpp:449] Master only allowing > authenticated agents to register > 07:00:17 I0207 07:00:17.776695 2970 master.cpp:462] Master only allowing > authenticated HTTP frameworks to register > 07:00:17 I0207 07:00:17.776703 2970 credentials.hpp:37] Loading credentials > for authentication from '/tmp/ZROfJk/credentials' > 07:00:17 I0207 07:00:17.776779 2970 master.cpp:507] Using default 'crammd5' > authenticator > 07:00:17 I0207 07:00:17.776841 2970 http.cpp:919] Using default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > 07:00:17 I0207 07:00:17.776919 2970 http.cpp:919] Using default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > 07:00:17 I0207 07:00:17.776970 2970 http.cpp:919] Using default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > 07:00:17 I0207 07:00:17.777009 2970 master.cpp:587] Authorization enabled > 07:00:17 I0207 07:00:17.777122 2975 hierarchical.cpp:161] Initialized > hierarchical allocator process > 07:00:17 I0207 07:00:17.777138 2974 whitelist_watcher.cpp:77] No whitelist > given > 07:00:17 I0207 07:00:17.04 2976 master.cpp:2123] Elected as the leading > master! > 07:00:17 I0207 07:00:17.26 2976 master.cpp:1645] Recovering from > registrar > 07:00:17 I0207 07:00:17.84 2975 registrar.cpp:329] Recovering registrar > 07:00:17 I0207 07:00:17.777989 2973 registrar.cpp:362] Successfully fetched > the registry (0B) in 176384ns > 07:00:17 I0207 07:00:17.778023 2973 registrar.cpp:461] Applied 1 operations > in 7573ns; attempting to update the registry > 07:00:17 I0207 07:00:17.778249 2976 registrar.cpp:506] Successfully updated > the registry in 210944ns > 07:00:17 I0207 07:00:17.778290 2976 registrar.cpp:392] Successfully > recovered registrar > 07:00:17 I0207 07:00:17.778373 2976 master.cpp:1761] Recovered 0 agents from > the registry (172B); allowing 10mins for agents to re-register > 07:00:17 I0207 07:00:17.778394 2974 hierarchical.cpp:188] Skipping recovery > of hierarchical allocator: nothing to recover > 07:00:17 I0207 07:00:17.869381 2952 containerizer.cpp:220] Using isolation: > posix/cpu,posix/mem,filesystem/posix,network/cni > 07:00:17 I0207 07:00:17.872557 2952 linux_launcher.cpp:150] Using > /sys/fs/cgroup/freezer as
[jira] [Issue Comment Deleted] (MESOS-6792) MasterSlaveReconciliationTest.ReconcileLostTask test is flaky
[ https://issues.apache.org/jira/browse/MESOS-6792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Toenshoff updated MESOS-6792: -- Comment: was deleted (was: Seeing this fail on ubuntu-14.04 while testing 1.1.1-rc2 -- not a crash though! {noformat} ../../src/tests/containerizer/cgroups_isolator_tests.cpp:438 Expected: (0.05) <= (cpuTime), actual: 0.05 vs 0.04 {noformat}) > MasterSlaveReconciliationTest.ReconcileLostTask test is flaky > - > > Key: MESOS-6792 > URL: https://issues.apache.org/jira/browse/MESOS-6792 > Project: Mesos > Issue Type: Bug > Components: technical debt, test > Environment: Fedora 25, clang, w/ optimizations, SSL build >Reporter: Benjamin Bannier > Labels: mesosphere > > The test {{MasterSlaveReconciliationTest.ReconcileLostTask}} is flaky for me > as of {{e99ea9ce8b1de01dd8b3cac6675337edb6320f38}}, > {code} > Repeating all tests (iteration 912) . . . > Note: Google Test filter = >
[jira] [Comment Edited] (MESOS-6792) MasterSlaveReconciliationTest.ReconcileLostTask test is flaky
[ https://issues.apache.org/jira/browse/MESOS-6792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897894#comment-15897894 ] Till Toenshoff edited comment on MESOS-6792 at 3/6/17 7:36 PM: --- Seeing this fail on ubuntu-14.04 while testing 1.1.1-rc2 -- not a crash though! {noformat} ../../src/tests/containerizer/cgroups_isolator_tests.cpp:438 Expected: (0.05) <= (cpuTime), actual: 0.05 vs 0.04 {noformat} was (Author: tillt): Seeing the same on ubuntu-14.04 while testing 1.1.1-rc2. > MasterSlaveReconciliationTest.ReconcileLostTask test is flaky > - > > Key: MESOS-6792 > URL: https://issues.apache.org/jira/browse/MESOS-6792 > Project: Mesos > Issue Type: Bug > Components: technical debt, test > Environment: Fedora 25, clang, w/ optimizations, SSL build >Reporter: Benjamin Bannier > Labels: mesosphere > > The test {{MasterSlaveReconciliationTest.ReconcileLostTask}} is flaky for me > as of {{e99ea9ce8b1de01dd8b3cac6675337edb6320f38}}, > {code} > Repeating all tests (iteration 912) . . . > Note: Google Test filter = >
[jira] [Commented] (MESOS-6792) MasterSlaveReconciliationTest.ReconcileLostTask test is flaky
[ https://issues.apache.org/jira/browse/MESOS-6792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897894#comment-15897894 ] Till Toenshoff commented on MESOS-6792: --- Seeing the same on ubuntu-14.04 while testing 1.1.1-rc2. > MasterSlaveReconciliationTest.ReconcileLostTask test is flaky > - > > Key: MESOS-6792 > URL: https://issues.apache.org/jira/browse/MESOS-6792 > Project: Mesos > Issue Type: Bug > Components: technical debt, test > Environment: Fedora 25, clang, w/ optimizations, SSL build >Reporter: Benjamin Bannier > Labels: mesosphere > > The test {{MasterSlaveReconciliationTest.ReconcileLostTask}} is flaky for me > as of {{e99ea9ce8b1de01dd8b3cac6675337edb6320f38}}, > {code} > Repeating all tests (iteration 912) . . . > Note: Google Test filter = >
[jira] [Created] (MESOS-7213) SlaveRecoveryTest/0.RecoverUnregisteredHTTPExecutor fails.
Till Toenshoff created MESOS-7213: - Summary: SlaveRecoveryTest/0.RecoverUnregisteredHTTPExecutor fails. Key: MESOS-7213 URL: https://issues.apache.org/jira/browse/MESOS-7213 Project: Mesos Issue Type: Bug Components: tests Affects Versions: 1.1.1 Environment: Debian 8, SSL/libevent build Reporter: Till Toenshoff The following happened while testing 1.1.1-rc2; may be flaky. {noformat} [ RUN ] SlaveRecoveryTest/0.RecoverUnregisteredHTTPExecutor I0306 14:16:42.640406 27141 cluster.cpp:158] Creating default 'local' authorizer I0306 14:16:42.648387 27141 leveldb.cpp:174] Opened db in 7.851169ms I0306 14:16:42.649245 27141 leveldb.cpp:181] Compacted db in 832265ns I0306 14:16:42.649266 27141 leveldb.cpp:196] Created db iterator in 4269ns I0306 14:16:42.649271 27141 leveldb.cpp:202] Seeked to beginning of db in 840ns I0306 14:16:42.649276 27141 leveldb.cpp:271] Iterated through 0 keys in the db in 448ns I0306 14:16:42.649291 27141 replica.cpp:776] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned I0306 14:16:42.649471 27163 recover.cpp:451] Starting replica recovery I0306 14:16:42.649528 27166 recover.cpp:477] Replica is in EMPTY status I0306 14:16:42.649864 27166 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from __req_res__(5541)@10.99.136.60:39312 I0306 14:16:42.649952 27160 recover.cpp:197] Received a recover response from a replica in EMPTY status I0306 14:16:42.650060 27164 recover.cpp:568] Updating replica status to STARTING I0306 14:16:42.650842 27160 master.cpp:380] Master 81fb2ed1-6c17-4dfb-a44f-160cfde9741e (ip-10-99-136-60.ec2.internal) started on 10.99.136.60:39312 I0306 14:16:42.650862 27160 master.cpp:382] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/syBZyN/credentials" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --quiet="false" --recovery_agent_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --root_submissions="true" --user_sorter="drf" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/syBZyN/master" --zk_session_timeout="10secs" I0306 14:16:42.651005 27160 master.cpp:432] Master only allowing authenticated frameworks to register I0306 14:16:42.651010 27160 master.cpp:446] Master only allowing authenticated agents to register I0306 14:16:42.651012 27160 master.cpp:459] Master only allowing authenticated HTTP frameworks to register I0306 14:16:42.651016 27160 credentials.hpp:37] Loading credentials for authentication from '/tmp/syBZyN/credentials' I0306 14:16:42.651876 27160 master.cpp:504] Using default 'crammd5' authenticator I0306 14:16:42.651919 27160 http.cpp:887] Using default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I0306 14:16:42.652004 27160 http.cpp:887] Using default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I0306 14:16:42.652042 27160 http.cpp:887] Using default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I0306 14:16:42.652107 27160 master.cpp:584] Authorization enabled I0306 14:16:42.652220 27161 whitelist_watcher.cpp:77] No whitelist given I0306 14:16:42.652228 27165 hierarchical.cpp:149] Initialized hierarchical allocator process I0306 14:16:42.652447 27162 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 2.311958ms I0306 14:16:42.652469 27162 replica.cpp:320] Persisted replica status to STARTING I0306 14:16:42.652573 27162 recover.cpp:477] Replica is in STARTING status I0306 14:16:42.652951 27162 replica.cpp:673] Replica in STARTING status received a broadcasted recover request from __req_res__(5542)@10.99.136.60:39312 I0306 14:16:42.652984 27163 master.cpp:2017] Elected as the leading master! I0306 14:16:42.652999 27163 master.cpp:1560] Recovering from registrar I0306 14:16:42.653074 27164 registrar.cpp:329] Recovering registrar I0306 14:16:42.653125 27165 recover.cpp:197] Received a recover response from a replica in STARTING status I0306 14:16:42.653259 27161 recover.cpp:568] Updating replica status to VOTING
[jira] [Commented] (MESOS-4736) DockerContainerizerTest.ROOT_DOCKER_LaunchWithPersistentVolumes fails on CentOS 6
[ https://issues.apache.org/jira/browse/MESOS-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897765#comment-15897765 ] Till Toenshoff commented on MESOS-4736: --- [~kaysoky] do you have any update here? > DockerContainerizerTest.ROOT_DOCKER_LaunchWithPersistentVolumes fails on > CentOS 6 > - > > Key: MESOS-4736 > URL: https://issues.apache.org/jira/browse/MESOS-4736 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.28.0 > Environment: Centos6 + GCC 4.9 on AWS >Reporter: Joseph Wu >Assignee: Joseph Wu > Labels: flaky, mesosphere, test > > This test passes consistently on other OS's, but fails consistently on CentOS > 6. > Verbose logs from test failure: > {code} > [ RUN ] DockerContainerizerTest.ROOT_DOCKER_LaunchWithPersistentVolumes > I0222 18:16:12.327957 26681 leveldb.cpp:174] Opened db in 7.466102ms > I0222 18:16:12.330528 26681 leveldb.cpp:181] Compacted db in 2.540139ms > I0222 18:16:12.330580 26681 leveldb.cpp:196] Created db iterator in 16908ns > I0222 18:16:12.330592 26681 leveldb.cpp:202] Seeked to beginning of db in > 1403ns > I0222 18:16:12.330600 26681 leveldb.cpp:271] Iterated through 0 keys in the > db in 315ns > I0222 18:16:12.330634 26681 replica.cpp:779] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0222 18:16:12.331082 26698 recover.cpp:447] Starting replica recovery > I0222 18:16:12.331289 26698 recover.cpp:473] Replica is in EMPTY status > I0222 18:16:12.332162 26703 replica.cpp:673] Replica in EMPTY status received > a broadcasted recover request from (13761)@172.30.2.148:35274 > I0222 18:16:12.332701 26701 recover.cpp:193] Received a recover response from > a replica in EMPTY status > I0222 18:16:12.333230 26699 recover.cpp:564] Updating replica status to > STARTING > I0222 18:16:12.334102 26698 master.cpp:376] Master > 652149b4-3932-4d8b-ba6f-8c9d9045be70 (ip-172-30-2-148.mesosphere.io) started > on 172.30.2.148:35274 > I0222 18:16:12.334116 26698 master.cpp:378] Flags at startup: --acls="" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate="true" --authenticate_http="true" --authenticate_slaves="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/QEhLBS/credentials" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" > --quiet="false" --recovery_slave_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_store_timeout="100secs" --registry_strict="true" > --root_submissions="true" --slave_ping_timeout="15secs" > --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/QEhLBS/master" > --zk_session_timeout="10secs" > I0222 18:16:12.334354 26698 master.cpp:423] Master only allowing > authenticated frameworks to register > I0222 18:16:12.334363 26698 master.cpp:428] Master only allowing > authenticated slaves to register > I0222 18:16:12.334369 26698 credentials.hpp:35] Loading credentials for > authentication from '/tmp/QEhLBS/credentials' > I0222 18:16:12.335366 26698 master.cpp:468] Using default 'crammd5' > authenticator > I0222 18:16:12.335492 26698 master.cpp:537] Using default 'basic' HTTP > authenticator > I0222 18:16:12.335623 26698 master.cpp:571] Authorization enabled > I0222 18:16:12.335752 26703 leveldb.cpp:304] Persisting metadata (8 bytes) to > leveldb took 2.314693ms > I0222 18:16:12.335769 26700 whitelist_watcher.cpp:77] No whitelist given > I0222 18:16:12.335778 26703 replica.cpp:320] Persisted replica status to > STARTING > I0222 18:16:12.335821 26697 hierarchical.cpp:144] Initialized hierarchical > allocator process > I0222 18:16:12.335965 26701 recover.cpp:473] Replica is in STARTING status > I0222 18:16:12.336771 26703 replica.cpp:673] Replica in STARTING status > received a broadcasted recover request from (13763)@172.30.2.148:35274 > I0222 18:16:12.337191 26696 recover.cpp:193] Received a recover response from > a replica in STARTING status > I0222 18:16:12.337635 26700 recover.cpp:564] Updating replica status to VOTING > I0222 18:16:12.337671 26703 master.cpp:1712] The newly elected leader is > master@172.30.2.148:35274 with id 652149b4-3932-4d8b-ba6f-8c9d9045be70 > I0222 18:16:12.337698 26703 master.cpp:1725] Elected as the leading master! > I0222 18:16:12.337713 26703 master.cpp:1470] Recovering from registrar > I0222 18:16:12.337828 26696 registrar.cpp:307] Recovering registrar > I0222
[jira] [Updated] (MESOS-7208) Persistent volume ownership is set to root when task is running with non-root user
[ https://issues.apache.org/jira/browse/MESOS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-7208: -- Affects Version/s: 1.0.2 > Persistent volume ownership is set to root when task is running with non-root > user > -- > > Key: MESOS-7208 > URL: https://issues.apache.org/jira/browse/MESOS-7208 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.2, 1.1.0 >Reporter: Nikolay Ustinov >Assignee: Gilbert Song > > I’m running docker container in universal containerizer, mesos 1.1.0. > switch_user=true, isolator=filesystem/linux,docker/runtime. Container is > launched with marathon, “user”:”someappuser”. I’d want to use persistent > volume, but it’s exposed to container with root user permissions even if root > folder is created with someppuser ownership (looks like mesos do chown to > this folder). > here logs for my container: > {code} > I0305 22:51:36.414655 10175 slave.cpp:1701] Launching task > 'md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a' for framework > e9d0e39e-b67d-4142-b95d-b0987998eb92- > I0305 22:51:36.415118 10175 paths.cpp:536] Trying to chown > '/export/intssd/mesos-slave/workdir/slaves/85150805-a201-4b23-ab21-b332a458fc97-S10/frameworks/e9d0e39e-b67d-4142-b95d-b0987998eb92-/executors/md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a/runs/e978d4eb-5ec1-44ad-b50a-9ae6bfe1065a' > to user 'root' > I0305 22:51:36.422992 10175 slave.cpp:6179] Launching executor > 'md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a' of framework > e9d0e39e-b67d-4142-b95d-b0987998eb92- with resources cpus(*):0.1; > mem(*):32 in work directory > '/export/intssd/mesos-slave/workdir/slaves/85150805-a201-4b23-ab21-b332a458fc97-S10/frameworks/e9d0e39e-b67d-4142-b95d-b0987998eb92-/executors/md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a/runs/e978d4eb-5ec1-44ad-b50a-9ae6bfe1065a' > I0305 22:51:36.424278 10175 slave.cpp:1987] Queued task > 'md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a' for executor > 'md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a' of framework > e9d0e39e-b67d-4142-b95d-b0987998eb92- > I0305 22:51:36.424347 10158 docker.cpp:1000] Skipping non-docker container > I0305 22:51:36.425639 10142 containerizer.cpp:938] Starting container > e978d4eb-5ec1-44ad-b50a-9ae6bfe1065a for executor > 'md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a' of framework > e9d0e39e-b67d-4142-b95d-b0987998eb92- > I0305 22:51:36.428725 10166 provisioner.cpp:294] Provisioning image rootfs > '/export/intssd/mesos-slave/workdir/provisioner/containers/e978d4eb-5ec1-44ad-b50a-9ae6bfe1065a/backends/copy/rootfses/0e2181e9-1bf2-42d4-8cb0-ee70e466c3ae' > for container e978d4eb-5ec1-44ad-b50a-9ae6bfe1065a > I0305 22:51:42.981240 10149 linux.cpp:695] Changing the ownership of the > persistent volume at > '/export/intssd/mesos-slave/data/volumes/roles/general_marathon_service_role/md_hdfs_journal#data#23f813aa-01dd-11e7-a012-0242ce94d92a' > with uid 0 and gid 0 > I0305 22:51:42.986593 10136 linux_launcher.cpp:421] Launching container > e978d4eb-5ec1-44ad-b50a-9ae6bfe1065a and cloning with namespaces CLONE_NEWNS > {code} > {code} > ls -la > /export/intssd/mesos-slave/workdir/slaves/85150805-a201-4b23-ab21-b332a458fc97-S10/frameworks/e9d0e39e-b67d-4142-b95d-b0987998eb92-/executors/md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a/runs/e978d4eb-5ec1-44ad-b50a-9ae6bfe1065a/ > drwxr-xr-x 3 someappuser someappgroup 4096 22:51 . > drwxr-xr-x 3 root root4096 22:51 .. > drwxr-xr-x 2 root root4096 22:51 data > -rw-r--r-- 1 root root 169 22:51 stderr > -rw-r--r-- 1 root root 183012 23:00 stdout > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7208) Persistent volume ownership is set to root when task is running with non-root user
[ https://issues.apache.org/jira/browse/MESOS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-7208: -- Target Version/s: 1.2.0 > Persistent volume ownership is set to root when task is running with non-root > user > -- > > Key: MESOS-7208 > URL: https://issues.apache.org/jira/browse/MESOS-7208 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.2, 1.1.0 >Reporter: Nikolay Ustinov >Assignee: Gilbert Song > > I’m running docker container in universal containerizer, mesos 1.1.0. > switch_user=true, isolator=filesystem/linux,docker/runtime. Container is > launched with marathon, “user”:”someappuser”. I’d want to use persistent > volume, but it’s exposed to container with root user permissions even if root > folder is created with someppuser ownership (looks like mesos do chown to > this folder). > here logs for my container: > {code} > I0305 22:51:36.414655 10175 slave.cpp:1701] Launching task > 'md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a' for framework > e9d0e39e-b67d-4142-b95d-b0987998eb92- > I0305 22:51:36.415118 10175 paths.cpp:536] Trying to chown > '/export/intssd/mesos-slave/workdir/slaves/85150805-a201-4b23-ab21-b332a458fc97-S10/frameworks/e9d0e39e-b67d-4142-b95d-b0987998eb92-/executors/md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a/runs/e978d4eb-5ec1-44ad-b50a-9ae6bfe1065a' > to user 'root' > I0305 22:51:36.422992 10175 slave.cpp:6179] Launching executor > 'md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a' of framework > e9d0e39e-b67d-4142-b95d-b0987998eb92- with resources cpus(*):0.1; > mem(*):32 in work directory > '/export/intssd/mesos-slave/workdir/slaves/85150805-a201-4b23-ab21-b332a458fc97-S10/frameworks/e9d0e39e-b67d-4142-b95d-b0987998eb92-/executors/md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a/runs/e978d4eb-5ec1-44ad-b50a-9ae6bfe1065a' > I0305 22:51:36.424278 10175 slave.cpp:1987] Queued task > 'md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a' for executor > 'md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a' of framework > e9d0e39e-b67d-4142-b95d-b0987998eb92- > I0305 22:51:36.424347 10158 docker.cpp:1000] Skipping non-docker container > I0305 22:51:36.425639 10142 containerizer.cpp:938] Starting container > e978d4eb-5ec1-44ad-b50a-9ae6bfe1065a for executor > 'md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a' of framework > e9d0e39e-b67d-4142-b95d-b0987998eb92- > I0305 22:51:36.428725 10166 provisioner.cpp:294] Provisioning image rootfs > '/export/intssd/mesos-slave/workdir/provisioner/containers/e978d4eb-5ec1-44ad-b50a-9ae6bfe1065a/backends/copy/rootfses/0e2181e9-1bf2-42d4-8cb0-ee70e466c3ae' > for container e978d4eb-5ec1-44ad-b50a-9ae6bfe1065a > I0305 22:51:42.981240 10149 linux.cpp:695] Changing the ownership of the > persistent volume at > '/export/intssd/mesos-slave/data/volumes/roles/general_marathon_service_role/md_hdfs_journal#data#23f813aa-01dd-11e7-a012-0242ce94d92a' > with uid 0 and gid 0 > I0305 22:51:42.986593 10136 linux_launcher.cpp:421] Launching container > e978d4eb-5ec1-44ad-b50a-9ae6bfe1065a and cloning with namespaces CLONE_NEWNS > {code} > {code} > ls -la > /export/intssd/mesos-slave/workdir/slaves/85150805-a201-4b23-ab21-b332a458fc97-S10/frameworks/e9d0e39e-b67d-4142-b95d-b0987998eb92-/executors/md_hdfs_journal.23f813ab-01dd-11e7-a012-0242ce94d92a/runs/e978d4eb-5ec1-44ad-b50a-9ae6bfe1065a/ > drwxr-xr-x 3 someappuser someappgroup 4096 22:51 . > drwxr-xr-x 3 root root4096 22:51 .. > drwxr-xr-x 2 root root4096 22:51 data > -rw-r--r-- 1 root root 169 22:51 stderr > -rw-r--r-- 1 root root 183012 23:00 stdout > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7211) Document SUPPRESS HTTP call
[ https://issues.apache.org/jira/browse/MESOS-7211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-7211: -- Labels: newbie (was: ) > Document SUPPRESS HTTP call > --- > > Key: MESOS-7211 > URL: https://issues.apache.org/jira/browse/MESOS-7211 > Project: Mesos > Issue Type: Documentation > Components: documentation >Affects Versions: 1.1.0 >Reporter: Bruce Merry >Priority: Minor > Labels: newbie > > The documentation at > http://mesos.apache.org/documentation/latest/scheduler-http-api/ doesn't list > the SUPPRESS call as one of the call types, but it does seem to be > implemented. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7212) CommandInfo first argument is ignored
[ https://issues.apache.org/jira/browse/MESOS-7212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897516#comment-15897516 ] Gastón Kleiman commented on MESOS-7212: --- This behaviour is documented in {{mesos.proto}} (https://github.com/apache/mesos/blob/5ac6e156390717c34586e6e19fee4bc4cb6b01d5/include/mesos/mesos.proto#L621-L626): {noformat} // 2) If 'shell == false', the command will be launched by passing //arguments to an executable. The 'value' specified will be //treated as the filename of the executable. The 'arguments' //will be treated as the arguments to the executable. This is //similar to how POSIX exec families launch processes (i.e., //execlp(value, arguments(0), arguments(1), ...)). {noformat} The POSIX exec calls expect first argument to be the name of the exec'd file. > CommandInfo first argument is ignored > - > > Key: MESOS-7212 > URL: https://issues.apache.org/jira/browse/MESOS-7212 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.0 > Environment: MacOS Sierra 10.12.2 >Reporter: Egor Ryashin > > First argument of CommandInfo is ignored, for example using: > {code} > CommandInfo commandInfo = CommandInfo.newBuilder() > .setShell(false) > .addArguments("1") > .addArguments("2") > .addArguments("3") > .setValue("echo") > {code} > I get in the sandbox stdout: > {noformat} > Starting task ta3e2-6234-4f8c-a609-e4b9064b4cf5 > /usr/local/Cellar/mesos/1.1.0/libexec/mesos/mesos-containerizer launch > --command="{"arguments":["1","2","3"],"shell":false,"value":"echo"}" > --help="false" > Forked command at 95660 > 2 3 > Command exited with status 0 (pid: 95660) >{noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7212) CommandInfo first argument is ignored
Egor Ryashin created MESOS-7212: --- Summary: CommandInfo first argument is ignored Key: MESOS-7212 URL: https://issues.apache.org/jira/browse/MESOS-7212 Project: Mesos Issue Type: Bug Affects Versions: 1.1.0 Environment: MacOS Sierra 10.12.2 Reporter: Egor Ryashin First argument of CommandInfo is ignored, for example using: {code} CommandInfo commandInfo = CommandInfo.newBuilder() .setShell(false) .addArguments("1") .addArguments("2") .addArguments("3") .setValue("echo") {code} I get in the sandbox stdout: {noformat} Starting task ta3e2-6234-4f8c-a609-e4b9064b4cf5 /usr/local/Cellar/mesos/1.1.0/libexec/mesos/mesos-containerizer launch --command="{"arguments":["1","2","3"],"shell":false,"value":"echo"}" --help="false" Forked command at 95660 2 3 Command exited with status 0 (pid: 95660) {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7095) Basic make check from getting started link fails
[ https://issues.apache.org/jira/browse/MESOS-7095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897364#comment-15897364 ] Alexander Rukletsov commented on MESOS-7095: The problem is likely with packaged {{brew}}'s {{apr}}. Here is what I see on my machine: {noformat} alex@alexr.local: /usr/local/Cellar/apr/1.5.2_3 $ lla total 88 -rw-r--r-- 1 alex wheel 7.7K Apr 25 2015 CHANGES -rw-r--r-- 1 alex staff 534B Feb 7 21:44 INSTALL_RECEIPT.json -rw-r--r-- 1 alex wheel18K Apr 25 2015 LICENSE -rw-r--r-- 1 alex wheel 527B Apr 25 2015 NOTICE -rw-r--r-- 1 alex wheel 5.5K Apr 25 2015 README drwxr-xr-x 3 alex wheel 102B Apr 25 2015 bin/ drwxr-xr-x 6 alex wheel 204B Apr 25 2015 libexec/ alex@alexr.local: /usr/local/Cellar/apr/1.5.2_3 $ lla libexec/include total 0 drwxr-xr-x 40 alex wheel 1.3K Apr 25 2015 apr-1/ {noformat} Hence the configure script cannot find {{apr}}'s includes. Try telling {{configure}} explicitly where {{apr}}'s includes are. > Basic make check from getting started link fails > > > Key: MESOS-7095 > URL: https://issues.apache.org/jira/browse/MESOS-7095 > Project: Mesos > Issue Type: Bug > Components: build >Reporter: Alec Bruns > > {*** Aborted at 1486657215 (unix time) try "date -d @1486657215" if you are > using GNU date *** PC: @0x1080b7367 apr_pool_create_ex *** SIGSEGV > (@0x30) received by PID 25167 (TID 0x7fffbdd073c0) stack trace: ***} > \{@ 0x7fffb50c7bba _sigtramp > @\{ 0x72c0517 (unknown)\} > @0x107eaa13a svn_pool_create_ex > @0x107691d6e svn::diff() > @0x107691042 SVNTest_DiffPatch_Test::TestBody() > @0x1077026ba > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @0x1076b3ad7 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @0x1076b3985 testing::Test::Run() > @0x1076b54f8 testing::TestInfo::Run() > @0x1076b6867 testing::TestCase::Run() > @0x1076c65dc testing::internal::UnitTestImpl::RunAllTests() > @0x1077033da > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @0x1076c6007 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @0x1076c5ed8 testing::UnitTest::Run() > @0x1074d55c1 RUN_ALL_TESTS() > @0x1074d5580 main > @ 0x7fffb4eba255 start > make[6]: *** [check-local] Segmentation fault: 11 > make[5]: *** [check-am] Error 2 make[4]: *** [check-recursive] Error 1 > make[3]: *** [check] Error 2 make[2]: *** [check-recursive] Error 1 > make[1]: *** [check] Error 2 make: *** [check-recursive] Error 1 > make: *** [check-recursive] Error 1 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7211) Document SUPPRESS HTTP call
Bruce Merry created MESOS-7211: -- Summary: Document SUPPRESS HTTP call Key: MESOS-7211 URL: https://issues.apache.org/jira/browse/MESOS-7211 Project: Mesos Issue Type: Documentation Components: documentation Affects Versions: 1.1.0 Reporter: Bruce Merry Priority: Minor The documentation at http://mesos.apache.org/documentation/latest/scheduler-http-api/ doesn't list the SUPPRESS call as one of the call types, but it does seem to be implemented. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (MESOS-5824) Include disk source information in stringification
[ https://issues.apache.org/jira/browse/MESOS-5824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rojas reassigned MESOS-5824: -- Assignee: (was: Tim Harper) > Include disk source information in stringification > -- > > Key: MESOS-5824 > URL: https://issues.apache.org/jira/browse/MESOS-5824 > Project: Mesos > Issue Type: Improvement > Components: stout >Affects Versions: 0.28.2 >Reporter: Tim Harper >Priority: Minor > Labels: mesosphere > > Some frameworks (like kafka_mesos) ignore the Source field when trying to > reserve an offered mount or path persistent volume; the resulting error > message is bewildering: > {code:none} > Task uses more resources > cpus(*):4; mem(*):4096; ports(*):[31000-31000]; disk(kafka, > kafka)[kafka_0:data]:960679 > than available > cpus(*):32; mem(*):256819; ports(*):[31000-32000]; disk(kafka, > kafka)[kafka_0:data]:960679; disk(*):240169; > {code} > The stringification of disk resources should include source information. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-5824) Include disk source information in stringification
[ https://issues.apache.org/jira/browse/MESOS-5824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896945#comment-15896945 ] Alexander Rojas commented on MESOS-5824: Review is closed due to inactivity. > Include disk source information in stringification > -- > > Key: MESOS-5824 > URL: https://issues.apache.org/jira/browse/MESOS-5824 > Project: Mesos > Issue Type: Improvement > Components: stout >Affects Versions: 0.28.2 >Reporter: Tim Harper >Priority: Minor > Labels: mesosphere > > Some frameworks (like kafka_mesos) ignore the Source field when trying to > reserve an offered mount or path persistent volume; the resulting error > message is bewildering: > {code:none} > Task uses more resources > cpus(*):4; mem(*):4096; ports(*):[31000-31000]; disk(kafka, > kafka)[kafka_0:data]:960679 > than available > cpus(*):32; mem(*):256819; ports(*):[31000-32000]; disk(kafka, > kafka)[kafka_0:data]:960679; disk(*):240169; > {code} > The stringification of disk resources should include source information. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7210) MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( pid namespace mismatch )
[ https://issues.apache.org/jira/browse/MESOS-7210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wojciech Sielski updated MESOS-7210: Summary: MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( pid namespace mismatch ) (was: MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( pid namespace missmatch )) > MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( > pid namespace mismatch ) > --- > > Key: MESOS-7210 > URL: https://issues.apache.org/jira/browse/MESOS-7210 > Project: Mesos > Issue Type: Bug > Components: docker >Affects Versions: 1.1.0 > Environment: Ubuntu 16.04.02 > Docker version 1.13.1 > mesos 1.1.0, runs from container > docker containers spawned by marathon 1.4.1 >Reporter: Wojciech Sielski > > When running mesos-slave with option "docker_mesos_image" like: > {code} > --master=zk://standalone:2181/mesos --containerizers=docker,mesos > --executor_registration_timeout=5mins --hostname=standalone --ip=0.0.0.0 > --docker_stop_timeout=5secs --gc_delay=1days > --docker_socket=/var/run/docker.sock --no-systemd_enable_support > --work_dir=/tmp/mesos --docker_mesos_image=panteras/paas-in-a-box:0.4.0 > {code} > from the container that was started with option "pid: host" like: > {code} > net:host > privileged: true > pid:host > {code} > and example marathon job, that use MESOS_HTTP checks like: > {code} > { > "id": "python-example-stable", > "cmd": "python3 -m http.server 8080", > "mem": 16, > "cpus": 0.1, > "instances": 2, > "container": { >"type": "DOCKER", >"docker": { > "image": "python:alpine", > "network": "BRIDGE", > "portMappings": [ > { "containerPort": 8080, "hostPort": 0, "protocol": "tcp" } > ] >} > }, > "env": { >"SERVICE_NAME" : "python" > }, > "healthChecks": [ >{ > "path": "/", > "portIndex": 0, > "protocol": "MESOS_HTTP", > "gracePeriodSeconds": 30, > "intervalSeconds": 10, > "timeoutSeconds": 30, > "maxConsecutiveFailures": 3 >} > ] > } > {code} > I see the errors like: > {code} > F0306 07:41:58.84429335 health_checker.cpp:94] Failed to enter the net > namespace of task (pid: '13527'): Pid 13527 does not exist > *** Check failure stack trace: *** > @ 0x7f51770b0c1d google::LogMessage::Fail() > @ 0x7f51770b29d0 google::LogMessage::SendToLog() > @ 0x7f51770b0803 google::LogMessage::Flush() > @ 0x7f51770b33f9 google::LogMessageFatal::~LogMessageFatal() > @ 0x7f517647ce46 > _ZNSt17_Function_handlerIFivEZN5mesos8internal6health14cloneWithSetnsERKSt8functionIS0_E6OptionIiERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISG_EEEUlvE_E9_M_invokeERKSt9_Any_data > @ 0x7f517647bf2b mesos::internal::health::cloneWithSetns() > @ 0x7f517648374b std::_Function_handler<>::_M_invoke() > @ 0x7f5177068167 process::internal::cloneChild() > @ 0x7f5177065c32 process::subprocess() > @ 0x7f5176481a9d > mesos::internal::health::HealthCheckerProcess::_httpHealthCheck() > @ 0x7f51764831f7 > mesos::internal::health::HealthCheckerProcess::_healthCheck() > @ 0x7f517701f38c process::ProcessBase::visit() > @ 0x7f517702c8b3 process::ProcessManager::resume() > @ 0x7f517702fb77 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv > @ 0x7f51754ddc80 (unknown) > @ 0x7f5174cf06ba start_thread > @ 0x7f5174a2682d (unknown) > I0306 07:41:59.077986 9 health_checker.cpp:199] Ignoring failure as > health check still in grace period > {code} > Looks like option docker_mesos_image makes, that newly started mesos job is > not using "pid host" option same as mother container was started, but has his > own PID namespace (so it doesn't matter if mother container was started with > "pid host" or not it will never be able to find PID) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7210) MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( pid namespace missmatch )
Wojciech Sielski created MESOS-7210: --- Summary: MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( pid namespace missmatch ) Key: MESOS-7210 URL: https://issues.apache.org/jira/browse/MESOS-7210 Project: Mesos Issue Type: Bug Components: docker Affects Versions: 1.1.0 Environment: Ubuntu 16.04.02 Docker version 1.13.1 mesos 1.1.0, runs from container docker containers spawned by marathon 1.4.1 Reporter: Wojciech Sielski When running mesos-slave with option "docker_mesos_image" like: {code} --master=zk://standalone:2181/mesos --containerizers=docker,mesos --executor_registration_timeout=5mins --hostname=standalone --ip=0.0.0.0 --docker_stop_timeout=5secs --gc_delay=1days --docker_socket=/var/run/docker.sock --no-systemd_enable_support --work_dir=/tmp/mesos --docker_mesos_image=panteras/paas-in-a-box:0.4.0 {code} from the container that was started with option "pid: host" like: {code} net:host privileged: true pid:host {code} and example marathon job, that use MESOS_HTTP checks like: {code} { "id": "python-example-stable", "cmd": "python3 -m http.server 8080", "mem": 16, "cpus": 0.1, "instances": 2, "container": { "type": "DOCKER", "docker": { "image": "python:alpine", "network": "BRIDGE", "portMappings": [ { "containerPort": 8080, "hostPort": 0, "protocol": "tcp" } ] } }, "env": { "SERVICE_NAME" : "python" }, "healthChecks": [ { "path": "/", "portIndex": 0, "protocol": "MESOS_HTTP", "gracePeriodSeconds": 30, "intervalSeconds": 10, "timeoutSeconds": 30, "maxConsecutiveFailures": 3 } ] } {code} I see the errors like: {code} F0306 07:41:58.84429335 health_checker.cpp:94] Failed to enter the net namespace of task (pid: '13527'): Pid 13527 does not exist *** Check failure stack trace: *** @ 0x7f51770b0c1d google::LogMessage::Fail() @ 0x7f51770b29d0 google::LogMessage::SendToLog() @ 0x7f51770b0803 google::LogMessage::Flush() @ 0x7f51770b33f9 google::LogMessageFatal::~LogMessageFatal() @ 0x7f517647ce46 _ZNSt17_Function_handlerIFivEZN5mesos8internal6health14cloneWithSetnsERKSt8functionIS0_E6OptionIiERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISG_EEEUlvE_E9_M_invokeERKSt9_Any_data @ 0x7f517647bf2b mesos::internal::health::cloneWithSetns() @ 0x7f517648374b std::_Function_handler<>::_M_invoke() @ 0x7f5177068167 process::internal::cloneChild() @ 0x7f5177065c32 process::subprocess() @ 0x7f5176481a9d mesos::internal::health::HealthCheckerProcess::_httpHealthCheck() @ 0x7f51764831f7 mesos::internal::health::HealthCheckerProcess::_healthCheck() @ 0x7f517701f38c process::ProcessBase::visit() @ 0x7f517702c8b3 process::ProcessManager::resume() @ 0x7f517702fb77 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv @ 0x7f51754ddc80 (unknown) @ 0x7f5174cf06ba start_thread @ 0x7f5174a2682d (unknown) I0306 07:41:59.077986 9 health_checker.cpp:199] Ignoring failure as health check still in grace period {code} Looks like option docker_mesos_image makes, that newly started mesos job is not using "pid host" option same as mother container was started, but has his own PID namespace (so it doesn't matter if mother container was started with "pid host" or not it will never be able to find PID) -- This message was sent by Atlassian JIRA (v6.3.15#6346)