[jira] [Assigned] (MESOS-3094) Mesos on Windows

2017-04-03 Thread Li Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Li reassigned MESOS-3094:


Assignee: Li Li  (was: Alex Clemmer)

> Mesos on Windows
> 
>
> Key: MESOS-3094
> URL: https://issues.apache.org/jira/browse/MESOS-3094
> Project: Mesos
>  Issue Type: Epic
>  Components: containerization, libprocess, stout
>Reporter: Joseph Wu
>Assignee: Li Li
>  Labels: mesosphere
>
> The ultimate goal of this is to have all containerizer tests running and 
> passing on Windows Server.
> # It must build (see MESOS-898).
> # All OS-specific code (that is touched by the containerizer) must be ported 
> to Windows.
> # The containizer itself must be ported to Windows, alongside the 
> MesosContainerizer.
> Note: Isolation (cgroups) will probably not exist on Windows.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7337) DefaultExecutorCheckTest.CommandCheckTimeout becomes flaky under load

2017-04-03 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-7337:

Attachment: DefaultExecutorCheckTest.CommandCheckTimeout.log

[~anandmazumdar]: Attached output for {{MESOS_VERBOSE=1 GLOG_v=1}}, again from 
parallel execution. I was not able to reproduce this in standalone execution 
with extra load from e.g., concurrent compilation jobs or {{stress}}.

> DefaultExecutorCheckTest.CommandCheckTimeout becomes flaky under load
> -
>
> Key: MESOS-7337
> URL: https://issues.apache.org/jira/browse/MESOS-7337
> Project: Mesos
>  Issue Type: Bug
>  Components: flaky, test
> Environment: Mac OS 10.12.4 (16E195), SSL debug build w/o 
> optimizations, clang version 5.0.0 (http://llvm.org/git/clang 
> c511a96ffe744933459ef64bf963629538057a90) (http://llvm.org/git/llvm 
> 0cd81d8a1055f167e0f588dd1b476863b00da3d5)
>Reporter: Benjamin Bannier
>  Labels: flaky-test, mesosphere
> Attachments: DefaultExecutorCheckTest.CommandCheckTimeout.log
>
>
> The test {{DefaultExecutorCheckTest.CommandCheckTimeout}} randomly fails for 
> me when executing tests in parallel, e.g.,
> {code}
> [ RUN  ] DefaultExecutorCheckTest.CommandCheckTimeout
> ../../src/tests/check_tests.cpp:1374: Failure
> Failed to wait 15secs for updateCheckResultTimeout
> ../../src/tests/check_tests.cpp:1334: Failure
> Actual function call count doesn't match EXPECT_CALL(*scheduler, update(_, 
> _))...
>  Expected: to be called at least 3 times
>Actual: called twice - unsatisfied and active
> [  FAILED  ] DefaultExecutorCheckTest.CommandCheckTimeout (25351 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (MESOS-7337) DefaultExecutorCheckTest.CommandCheckTimeout becomes flaky under load

2017-04-03 Thread Benjamin Bannier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954151#comment-15954151
 ] 

Benjamin Bannier edited comment on MESOS-7337 at 4/3/17 8:54 PM:
-

[~anandmazumdar]: Attached output for {{MESOS_VERBOSE=1 GLOG_v=1}}, again from 
parallel execution. I was not able to reproduce this in standalone execution 
with extra load from e.g., concurrent compilation jobs or {{stress}}, so this 
could also be related to the test no behaving well in parallel execution due to 
conflicts.


was (Author: bbannier):
[~anandmazumdar]: Attached output for {{MESOS_VERBOSE=1 GLOG_v=1}}, again from 
parallel execution. I was not able to reproduce this in standalone execution 
with extra load from e.g., concurrent compilation jobs or {{stress}}.

> DefaultExecutorCheckTest.CommandCheckTimeout becomes flaky under load
> -
>
> Key: MESOS-7337
> URL: https://issues.apache.org/jira/browse/MESOS-7337
> Project: Mesos
>  Issue Type: Bug
>  Components: flaky, test
> Environment: Mac OS 10.12.4 (16E195), SSL debug build w/o 
> optimizations, clang version 5.0.0 (http://llvm.org/git/clang 
> c511a96ffe744933459ef64bf963629538057a90) (http://llvm.org/git/llvm 
> 0cd81d8a1055f167e0f588dd1b476863b00da3d5)
>Reporter: Benjamin Bannier
>  Labels: flaky-test, mesosphere
> Attachments: DefaultExecutorCheckTest.CommandCheckTimeout.log
>
>
> The test {{DefaultExecutorCheckTest.CommandCheckTimeout}} randomly fails for 
> me when executing tests in parallel, e.g.,
> {code}
> [ RUN  ] DefaultExecutorCheckTest.CommandCheckTimeout
> ../../src/tests/check_tests.cpp:1374: Failure
> Failed to wait 15secs for updateCheckResultTimeout
> ../../src/tests/check_tests.cpp:1334: Failure
> Actual function call count doesn't match EXPECT_CALL(*scheduler, update(_, 
> _))...
>  Expected: to be called at least 3 times
>Actual: called twice - unsatisfied and active
> [  FAILED  ] DefaultExecutorCheckTest.CommandCheckTimeout (25351 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-6742) Adding support for s390x architecture

2017-04-03 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-6742:
--
Shepherd: Till Toenshoff  (was: Vinod Kone)

I'll let [~tillt] shepherd this since he has been reviewing the patch.

> Adding support for s390x architecture 
> --
>
> Key: MESOS-6742
> URL: https://issues.apache.org/jira/browse/MESOS-6742
> Project: Mesos
>  Issue Type: Bug
>Reporter: Ayanampudi Varsha
>Assignee: Ayanampudi Varsha
>
> There are 2 issues:
> 1. LdcacheTest.Parse test case fails on s390x machines.
> 2. From the value of flag docker_registry in slave/flags.cpp, amd64 images 
> get downloaded due to which test cases fail on s390x with "Exec format Error"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7209) Mesos failed to build due to error MSB6006: "cmd.exe" exited with code 255 on windows

2017-04-03 Thread Li Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954029#comment-15954029
 ] 

Li Li commented on MESOS-7209:
--

I hit the same issue with main stream. I will take a look. Thanks. 

> Mesos failed to build due to error MSB6006: "cmd.exe" exited with code 255 on 
> windows
> -
>
> Key: MESOS-7209
> URL: https://issues.apache.org/jira/browse/MESOS-7209
> Project: Mesos
>  Issue Type: Bug
> Environment: Windows 10 (64bit) + VS2015 Update 3
>Reporter: Karen Huang
>
> I try to build mesos with Debug|x64 configuration on Windows. It failed to 
> build due to error MSB6006: "cmd.exe" exited with code 
> 255.[F:\mesos\build_x64\ensure_tool_arch.vcxproj]. This error is reported 
> when build ensure_tool_arch.vcxproj project.
> Here is repro steps:
> 1. git clone -c core.autocrlf=true https://github.com/apache/mesos 
> F:\mesos\src
> 2. Open a VS amd64 command prompt as admin and browse to F:\mesos\src
> 3. set PreferredToolArchitecture=x64
> 4. bootstrap.bat
> 5. mkdir build_x64 && pushd build_x64
> 6. cmake ..\src -G "Visual Studio 14 2015 Win64" -DENABLE_LIBEVENT=1 
> -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin"
> 7. msbuild Mesos.sln /p:Configuration=Debug /p:Platform=x64 /m /t:Rebuild
> Error message:
>  CustomBuild:
>  Building Custom Rule F:/mesos/src/CMakeLists.txt
>  CMake does not need to re-run because 
> F:\mesos\build_x64\CMakeFiles\generate.stamp is up-to-date.
>  ( was unexpected at this time.
> 43>C:\Program Files 
> (x86)\MSBuild\Microsoft.Cpp\v4.0\V140\Microsoft.CppCommon.targets(171,5): 
> error MSB6006: "cmd.exe" exited with code 255. 
> [F:\mesos\build_x64\ensure_tool_arch.vcxproj]
> If you build the project ensure_tool_arch.vcxproj in VS IDE seperatly. The 
> error info is as bleow:
> 2>-- Rebuild All started: Project: ensure_tool_arch, Configuration: Debug 
> x64 --
> 2>  Building Custom Rule D:/Mesos/src/CMakeLists.txt
> 2>  CMake does not need to re-run because 
> D:\Mesos\build_x64\CMakeFiles\generate.stamp is up-to-date.
> 2>  ( was unexpected at this time.
> 2>C:\Program Files 
> (x86)\MSBuild\Microsoft.Cpp\v4.0\V140\Microsoft.CppCommon.targets(171,5): 
> error MSB6006: "cmd.exe" exited with code 255.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7338) Implement clang-tidy check(s) for namespace usage

2017-04-03 Thread Neil Conway (JIRA)
Neil Conway created MESOS-7338:
--

 Summary: Implement clang-tidy check(s) for namespace usage
 Key: MESOS-7338
 URL: https://issues.apache.org/jira/browse/MESOS-7338
 Project: Mesos
  Issue Type: Bug
Reporter: Neil Conway


For example, if a {{.cpp}} file contains both {{x::y}} and {{using x::y}}, that 
is typically a style mistake.

We could potentially identify unused {{using}} statements, although that can be 
tricky due to C++ identifier lookup rules.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7332) when submit an docker job to mesos, the agent show errors

2017-04-03 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953878#comment-15953878
 ] 

Gilbert Song commented on MESOS-7332:
-

[~helloiss], I tested on ubuntu 14.04 and 15.10, but I am in US. Depending on 
[~yuyang]'s comment on MESOS-6810, it seems like due to docker hub is blocked 
in China. [~helloiss], could you try out the solution [~yuyang] posted?

> when submit an docker job to mesos, the agent show errors 
> --
>
> Key: MESOS-7332
> URL: https://issues.apache.org/jira/browse/MESOS-7332
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.2.0
> Environment: ubuntu 16.04
>Reporter: liliang
>Priority: Critical
>
> when run the following task 
> sudo mesos-execute --master=10.139.176.201:5050 --name=test 
> --docker_image=hello-world --shell=true
> i got error on agent side:
> Running on machine: gorilla
> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
> E0331 22:45:05.706740 20993 shell.hpp:107] Command 'hadoop version 2>&1' 
> failed; this is the output:
> sh: 1: hadoop: not found
> E0331 22:46:14.837656 21014 slave.cpp:4650] Container 
> '48256fd6-9f45-4725-a72f-195326798f2d' for executor 'test' of framework 
> 0a104215-eb82-4b1b-93e6-3fc765fc67af-0003 failed to start: Collect failed: 
> Failed to perform 'curl': curl: (35) gnutls_handshake() failed: Error in the 
> push function.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7337) DefaultExecutorCheckTest.CommandCheckTimeout becomes flaky under load

2017-04-03 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953826#comment-15953826
 ] 

Anand Mazumdar commented on MESOS-7337:
---

[~bbannier] Can you attach the verbose logs?

cc: [~gkleiman]

> DefaultExecutorCheckTest.CommandCheckTimeout becomes flaky under load
> -
>
> Key: MESOS-7337
> URL: https://issues.apache.org/jira/browse/MESOS-7337
> Project: Mesos
>  Issue Type: Bug
>  Components: flaky, test
> Environment: Mac OS 10.12.4 (16E195), SSL debug build w/o 
> optimizations, clang version 5.0.0 (http://llvm.org/git/clang 
> c511a96ffe744933459ef64bf963629538057a90) (http://llvm.org/git/llvm 
> 0cd81d8a1055f167e0f588dd1b476863b00da3d5)
>Reporter: Benjamin Bannier
>  Labels: flaky-test, mesosphere
>
> The test {{DefaultExecutorCheckTest.CommandCheckTimeout}} randomly fails for 
> me when executing tests in parallel, e.g.,
> {code}
> [ RUN  ] DefaultExecutorCheckTest.CommandCheckTimeout
> ../../src/tests/check_tests.cpp:1374: Failure
> Failed to wait 15secs for updateCheckResultTimeout
> ../../src/tests/check_tests.cpp:1334: Failure
> Actual function call count doesn't match EXPECT_CALL(*scheduler, update(_, 
> _))...
>  Expected: to be called at least 3 times
>Actual: called twice - unsatisfied and active
> [  FAILED  ] DefaultExecutorCheckTest.CommandCheckTimeout (25351 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7337) DefaultExecutorCheckTest.CommandCheckTimeout becomes flaky under load

2017-04-03 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-7337:

Environment: Mac OS 10.12.4 (16E195), SSL debug build w/o optimizations, 
clang version 5.0.0 (http://llvm.org/git/clang 
c511a96ffe744933459ef64bf963629538057a90) (http://llvm.org/git/llvm 
0cd81d8a1055f167e0f588dd1b476863b00da3d5)  (was: Mac OS 10.12.4 (16E195), SSL 
debug build w/o optimizations, clang version 5.0.0 (http://llvm.org/git/clang 
c511a96ffe744933459ef64bf963629538057a90) (http://llvm.org/git/llvm 
0cd81d8a1055f167e0f588dd1b476863b00da3d5), )

> DefaultExecutorCheckTest.CommandCheckTimeout becomes flaky under load
> -
>
> Key: MESOS-7337
> URL: https://issues.apache.org/jira/browse/MESOS-7337
> Project: Mesos
>  Issue Type: Bug
>  Components: flaky, test
> Environment: Mac OS 10.12.4 (16E195), SSL debug build w/o 
> optimizations, clang version 5.0.0 (http://llvm.org/git/clang 
> c511a96ffe744933459ef64bf963629538057a90) (http://llvm.org/git/llvm 
> 0cd81d8a1055f167e0f588dd1b476863b00da3d5)
>Reporter: Benjamin Bannier
>  Labels: flaky-test, mesosphere
>
> The test {{DefaultExecutorCheckTest.CommandCheckTimeout}} randomly fails for 
> me when executing tests in parallel, e.g.,
> {code}
> [ RUN  ] DefaultExecutorCheckTest.CommandCheckTimeout
> ../../src/tests/check_tests.cpp:1374: Failure
> Failed to wait 15secs for updateCheckResultTimeout
> ../../src/tests/check_tests.cpp:1334: Failure
> Actual function call count doesn't match EXPECT_CALL(*scheduler, update(_, 
> _))...
>  Expected: to be called at least 3 times
>Actual: called twice - unsatisfied and active
> [  FAILED  ] DefaultExecutorCheckTest.CommandCheckTimeout (25351 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7337) DefaultExecutorCheckTest.CommandCheckTimeout becomes flaky under load

2017-04-03 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-7337:

Environment: Mac OS 10.12.4 (16E195), SSL debug build w/o optimizations, 
clang version 5.0.0 (http://llvm.org/git/clang 
c511a96ffe744933459ef64bf963629538057a90) (http://llvm.org/git/llvm 
0cd81d8a1055f167e0f588dd1b476863b00da3d5),   (was: Mac OS 10.12.4 (16E195), SSL 
bug, debug build w/o optimizations, clang version 5.0.0 
(http://llvm.org/git/clang c511a96ffe744933459ef64bf963629538057a90) 
(http://llvm.org/git/llvm 0cd81d8a1055f167e0f588dd1b476863b00da3d5), )

> DefaultExecutorCheckTest.CommandCheckTimeout becomes flaky under load
> -
>
> Key: MESOS-7337
> URL: https://issues.apache.org/jira/browse/MESOS-7337
> Project: Mesos
>  Issue Type: Bug
>  Components: flaky, test
> Environment: Mac OS 10.12.4 (16E195), SSL debug build w/o 
> optimizations, clang version 5.0.0 (http://llvm.org/git/clang 
> c511a96ffe744933459ef64bf963629538057a90) (http://llvm.org/git/llvm 
> 0cd81d8a1055f167e0f588dd1b476863b00da3d5), 
>Reporter: Benjamin Bannier
>  Labels: flaky-test, mesosphere
>
> The test {{DefaultExecutorCheckTest.CommandCheckTimeout}} randomly fails for 
> me when executing tests in parallel, e.g.,
> {code}
> [ RUN  ] DefaultExecutorCheckTest.CommandCheckTimeout
> ../../src/tests/check_tests.cpp:1374: Failure
> Failed to wait 15secs for updateCheckResultTimeout
> ../../src/tests/check_tests.cpp:1334: Failure
> Actual function call count doesn't match EXPECT_CALL(*scheduler, update(_, 
> _))...
>  Expected: to be called at least 3 times
>Actual: called twice - unsatisfied and active
> [  FAILED  ] DefaultExecutorCheckTest.CommandCheckTimeout (25351 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7337) DefaultExecutorCheckTest.CommandCheckTimeout becomes flaky under load

2017-04-03 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-7337:

Environment: Mac OS 10.12.4 (16E195), SSL bug, debug build w/o 
optimizations, clang version 5.0.0 (http://llvm.org/git/clang 
c511a96ffe744933459ef64bf963629538057a90) (http://llvm.org/git/llvm 
0cd81d8a1055f167e0f588dd1b476863b00da3d5), 

> DefaultExecutorCheckTest.CommandCheckTimeout becomes flaky under load
> -
>
> Key: MESOS-7337
> URL: https://issues.apache.org/jira/browse/MESOS-7337
> Project: Mesos
>  Issue Type: Bug
>  Components: flaky, test
> Environment: Mac OS 10.12.4 (16E195), SSL bug, debug build w/o 
> optimizations, clang version 5.0.0 (http://llvm.org/git/clang 
> c511a96ffe744933459ef64bf963629538057a90) (http://llvm.org/git/llvm 
> 0cd81d8a1055f167e0f588dd1b476863b00da3d5), 
>Reporter: Benjamin Bannier
>  Labels: flaky-test, mesosphere
>
> The test {{DefaultExecutorCheckTest.CommandCheckTimeout}} randomly fails for 
> me when executing tests in parallel, e.g.,
> {code}
> [ RUN  ] DefaultExecutorCheckTest.CommandCheckTimeout
> ../../src/tests/check_tests.cpp:1374: Failure
> Failed to wait 15secs for updateCheckResultTimeout
> ../../src/tests/check_tests.cpp:1334: Failure
> Actual function call count doesn't match EXPECT_CALL(*scheduler, update(_, 
> _))...
>  Expected: to be called at least 3 times
>Actual: called twice - unsatisfied and active
> [  FAILED  ] DefaultExecutorCheckTest.CommandCheckTimeout (25351 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7210) MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( pid namespace mismatch )

2017-04-03 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7210:
---
Affects Version/s: 1.1.1
   1.2.0
 Story Points: 3
 Target Version/s: 1.1.2, 1.2.1, 1.3.0
 Priority: Critical  (was: Major)

> MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( 
> pid namespace mismatch )
> ---
>
> Key: MESOS-7210
> URL: https://issues.apache.org/jira/browse/MESOS-7210
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.1.0, 1.1.1, 1.2.0
> Environment: Ubuntu 16.04.02
> Docker version 1.13.1
> mesos 1.1.0, runs from container
> docker containers  spawned by marathon 1.4.1
>Reporter: Wojciech Sielski
>Assignee: haosdent
>Priority: Critical
>
> When running mesos-slave with option "docker_mesos_image" like:
> {code}
> --master=zk://standalone:2181/mesos  --containerizers=docker,mesos  
> --executor_registration_timeout=5mins  --hostname=standalone  --ip=0.0.0.0  
> --docker_stop_timeout=5secs  --gc_delay=1days  
> --docker_socket=/var/run/docker.sock  --no-systemd_enable_support  
> --work_dir=/tmp/mesos  --docker_mesos_image=panteras/paas-in-a-box:0.4.0
> {code}
> from the container that was started with option "pid: host" like:
> {code}
>   net:host
>   privileged: true
>   pid:host
> {code}
> and example marathon job, that use MESOS_HTTP checks like:
> {code}
> {
>  "id": "python-example-stable",
>  "cmd": "python3 -m http.server 8080",
>  "mem": 16,
>  "cpus": 0.1,
>  "instances": 2,
>  "container": {
>"type": "DOCKER",
>"docker": {
>  "image": "python:alpine",
>  "network": "BRIDGE",
>  "portMappings": [
> { "containerPort": 8080, "hostPort": 0, "protocol": "tcp" }
>  ]
>}
>  },
>  "env": {
>"SERVICE_NAME" : "python"
>  },
>  "healthChecks": [
>{
>  "path": "/",
>  "portIndex": 0,
>  "protocol": "MESOS_HTTP",
>  "gracePeriodSeconds": 30,
>  "intervalSeconds": 10,
>  "timeoutSeconds": 30,
>  "maxConsecutiveFailures": 3
>}
>  ]
> }
> {code}
> I see the errors like:
> {code}
> F0306 07:41:58.84429335 health_checker.cpp:94] Failed to enter the net 
> namespace of task (pid: '13527'): Pid 13527 does not exist
> *** Check failure stack trace: ***
> @ 0x7f51770b0c1d  google::LogMessage::Fail()
> @ 0x7f51770b29d0  google::LogMessage::SendToLog()
> @ 0x7f51770b0803  google::LogMessage::Flush()
> @ 0x7f51770b33f9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f517647ce46  
> _ZNSt17_Function_handlerIFivEZN5mesos8internal6health14cloneWithSetnsERKSt8functionIS0_E6OptionIiERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISG_EEEUlvE_E9_M_invokeERKSt9_Any_data
> @ 0x7f517647bf2b  mesos::internal::health::cloneWithSetns()
> @ 0x7f517648374b  std::_Function_handler<>::_M_invoke()
> @ 0x7f5177068167  process::internal::cloneChild()
> @ 0x7f5177065c32  process::subprocess()
> @ 0x7f5176481a9d  
> mesos::internal::health::HealthCheckerProcess::_httpHealthCheck()
> @ 0x7f51764831f7  
> mesos::internal::health::HealthCheckerProcess::_healthCheck()
> @ 0x7f517701f38c  process::ProcessBase::visit()
> @ 0x7f517702c8b3  process::ProcessManager::resume()
> @ 0x7f517702fb77  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7f51754ddc80  (unknown)
> @ 0x7f5174cf06ba  start_thread
> @ 0x7f5174a2682d  (unknown)
> I0306 07:41:59.077986 9 health_checker.cpp:199] Ignoring failure as 
> health check still in grace period
> {code}
> Looks like option docker_mesos_image makes, that newly started mesos job is 
> not using "pid host" option same as mother container was started, but has his 
> own PID namespace (so it doesn't matter if mother container was started with 
> "pid host" or not it will never be able to find PID)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7210) MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( pid namespace mismatch )

2017-04-03 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953242#comment-15953242
 ] 

Alexander Rukletsov commented on MESOS-7210:


[~xds2000], [~haosd...@gmail.com] Let's fix it and backport.

> MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( 
> pid namespace mismatch )
> ---
>
> Key: MESOS-7210
> URL: https://issues.apache.org/jira/browse/MESOS-7210
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.1.0, 1.1.1, 1.2.0
> Environment: Ubuntu 16.04.02
> Docker version 1.13.1
> mesos 1.1.0, runs from container
> docker containers  spawned by marathon 1.4.1
>Reporter: Wojciech Sielski
>Assignee: haosdent
>Priority: Critical
>
> When running mesos-slave with option "docker_mesos_image" like:
> {code}
> --master=zk://standalone:2181/mesos  --containerizers=docker,mesos  
> --executor_registration_timeout=5mins  --hostname=standalone  --ip=0.0.0.0  
> --docker_stop_timeout=5secs  --gc_delay=1days  
> --docker_socket=/var/run/docker.sock  --no-systemd_enable_support  
> --work_dir=/tmp/mesos  --docker_mesos_image=panteras/paas-in-a-box:0.4.0
> {code}
> from the container that was started with option "pid: host" like:
> {code}
>   net:host
>   privileged: true
>   pid:host
> {code}
> and example marathon job, that use MESOS_HTTP checks like:
> {code}
> {
>  "id": "python-example-stable",
>  "cmd": "python3 -m http.server 8080",
>  "mem": 16,
>  "cpus": 0.1,
>  "instances": 2,
>  "container": {
>"type": "DOCKER",
>"docker": {
>  "image": "python:alpine",
>  "network": "BRIDGE",
>  "portMappings": [
> { "containerPort": 8080, "hostPort": 0, "protocol": "tcp" }
>  ]
>}
>  },
>  "env": {
>"SERVICE_NAME" : "python"
>  },
>  "healthChecks": [
>{
>  "path": "/",
>  "portIndex": 0,
>  "protocol": "MESOS_HTTP",
>  "gracePeriodSeconds": 30,
>  "intervalSeconds": 10,
>  "timeoutSeconds": 30,
>  "maxConsecutiveFailures": 3
>}
>  ]
> }
> {code}
> I see the errors like:
> {code}
> F0306 07:41:58.84429335 health_checker.cpp:94] Failed to enter the net 
> namespace of task (pid: '13527'): Pid 13527 does not exist
> *** Check failure stack trace: ***
> @ 0x7f51770b0c1d  google::LogMessage::Fail()
> @ 0x7f51770b29d0  google::LogMessage::SendToLog()
> @ 0x7f51770b0803  google::LogMessage::Flush()
> @ 0x7f51770b33f9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f517647ce46  
> _ZNSt17_Function_handlerIFivEZN5mesos8internal6health14cloneWithSetnsERKSt8functionIS0_E6OptionIiERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISG_EEEUlvE_E9_M_invokeERKSt9_Any_data
> @ 0x7f517647bf2b  mesos::internal::health::cloneWithSetns()
> @ 0x7f517648374b  std::_Function_handler<>::_M_invoke()
> @ 0x7f5177068167  process::internal::cloneChild()
> @ 0x7f5177065c32  process::subprocess()
> @ 0x7f5176481a9d  
> mesos::internal::health::HealthCheckerProcess::_httpHealthCheck()
> @ 0x7f51764831f7  
> mesos::internal::health::HealthCheckerProcess::_healthCheck()
> @ 0x7f517701f38c  process::ProcessBase::visit()
> @ 0x7f517702c8b3  process::ProcessManager::resume()
> @ 0x7f517702fb77  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7f51754ddc80  (unknown)
> @ 0x7f5174cf06ba  start_thread
> @ 0x7f5174a2682d  (unknown)
> I0306 07:41:59.077986 9 health_checker.cpp:199] Ignoring failure as 
> health check still in grace period
> {code}
> Looks like option docker_mesos_image makes, that newly started mesos job is 
> not using "pid host" option same as mother container was started, but has his 
> own PID namespace (so it doesn't matter if mother container was started with 
> "pid host" or not it will never be able to find PID)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7337) DefaultExecutorCheckTest.CommandCheckTimeout becomes flaky under load

2017-04-03 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-7337:
---

 Summary: DefaultExecutorCheckTest.CommandCheckTimeout becomes 
flaky under load
 Key: MESOS-7337
 URL: https://issues.apache.org/jira/browse/MESOS-7337
 Project: Mesos
  Issue Type: Bug
  Components: flaky, test
Reporter: Benjamin Bannier


The test {{DefaultExecutorCheckTest.CommandCheckTimeout}} randomly fails for me 
when executing tests in parallel, e.g.,
{code}
[ RUN  ] DefaultExecutorCheckTest.CommandCheckTimeout
../../src/tests/check_tests.cpp:1374: Failure
Failed to wait 15secs for updateCheckResultTimeout
../../src/tests/check_tests.cpp:1334: Failure
Actual function call count doesn't match EXPECT_CALL(*scheduler, update(_, 
_))...
 Expected: to be called at least 3 times
   Actual: called twice - unsatisfied and active
[  FAILED  ] DefaultExecutorCheckTest.CommandCheckTimeout (25351 ms)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7269) Migrate setting in config.py to a TOML file

2017-04-03 Thread Armand Grillet (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953131#comment-15953131
 ] 

Armand Grillet commented on MESOS-7269:
---

https://reviews.apache.org/r/57951/

> Migrate setting in config.py to a TOML file
> ---
>
> Key: MESOS-7269
> URL: https://issues.apache.org/jira/browse/MESOS-7269
> Project: Mesos
>  Issue Type: Task
>  Components: cli
>Reporter: Avinash Sridharan
>Assignee: Armand Grillet
>
> We want Mesos CLI configuration to be given as a TOML file by the user.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7210) MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( pid namespace mismatch )

2017-04-03 Thread Wojciech Sielski (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953071#comment-15953071
 ] 

Wojciech Sielski commented on MESOS-7210:
-

[~xds2000] exactly, the mesos-slave (container) and the docker executor 
(container) need to runs in the same pid pool (host).

> MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( 
> pid namespace mismatch )
> ---
>
> Key: MESOS-7210
> URL: https://issues.apache.org/jira/browse/MESOS-7210
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.1.0
> Environment: Ubuntu 16.04.02
> Docker version 1.13.1
> mesos 1.1.0, runs from container
> docker containers  spawned by marathon 1.4.1
>Reporter: Wojciech Sielski
>Assignee: haosdent
>
> When running mesos-slave with option "docker_mesos_image" like:
> {code}
> --master=zk://standalone:2181/mesos  --containerizers=docker,mesos  
> --executor_registration_timeout=5mins  --hostname=standalone  --ip=0.0.0.0  
> --docker_stop_timeout=5secs  --gc_delay=1days  
> --docker_socket=/var/run/docker.sock  --no-systemd_enable_support  
> --work_dir=/tmp/mesos  --docker_mesos_image=panteras/paas-in-a-box:0.4.0
> {code}
> from the container that was started with option "pid: host" like:
> {code}
>   net:host
>   privileged: true
>   pid:host
> {code}
> and example marathon job, that use MESOS_HTTP checks like:
> {code}
> {
>  "id": "python-example-stable",
>  "cmd": "python3 -m http.server 8080",
>  "mem": 16,
>  "cpus": 0.1,
>  "instances": 2,
>  "container": {
>"type": "DOCKER",
>"docker": {
>  "image": "python:alpine",
>  "network": "BRIDGE",
>  "portMappings": [
> { "containerPort": 8080, "hostPort": 0, "protocol": "tcp" }
>  ]
>}
>  },
>  "env": {
>"SERVICE_NAME" : "python"
>  },
>  "healthChecks": [
>{
>  "path": "/",
>  "portIndex": 0,
>  "protocol": "MESOS_HTTP",
>  "gracePeriodSeconds": 30,
>  "intervalSeconds": 10,
>  "timeoutSeconds": 30,
>  "maxConsecutiveFailures": 3
>}
>  ]
> }
> {code}
> I see the errors like:
> {code}
> F0306 07:41:58.84429335 health_checker.cpp:94] Failed to enter the net 
> namespace of task (pid: '13527'): Pid 13527 does not exist
> *** Check failure stack trace: ***
> @ 0x7f51770b0c1d  google::LogMessage::Fail()
> @ 0x7f51770b29d0  google::LogMessage::SendToLog()
> @ 0x7f51770b0803  google::LogMessage::Flush()
> @ 0x7f51770b33f9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f517647ce46  
> _ZNSt17_Function_handlerIFivEZN5mesos8internal6health14cloneWithSetnsERKSt8functionIS0_E6OptionIiERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISG_EEEUlvE_E9_M_invokeERKSt9_Any_data
> @ 0x7f517647bf2b  mesos::internal::health::cloneWithSetns()
> @ 0x7f517648374b  std::_Function_handler<>::_M_invoke()
> @ 0x7f5177068167  process::internal::cloneChild()
> @ 0x7f5177065c32  process::subprocess()
> @ 0x7f5176481a9d  
> mesos::internal::health::HealthCheckerProcess::_httpHealthCheck()
> @ 0x7f51764831f7  
> mesos::internal::health::HealthCheckerProcess::_healthCheck()
> @ 0x7f517701f38c  process::ProcessBase::visit()
> @ 0x7f517702c8b3  process::ProcessManager::resume()
> @ 0x7f517702fb77  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7f51754ddc80  (unknown)
> @ 0x7f5174cf06ba  start_thread
> @ 0x7f5174a2682d  (unknown)
> I0306 07:41:59.077986 9 health_checker.cpp:199] Ignoring failure as 
> health check still in grace period
> {code}
> Looks like option docker_mesos_image makes, that newly started mesos job is 
> not using "pid host" option same as mother container was started, but has his 
> own PID namespace (so it doesn't matter if mother container was started with 
> "pid host" or not it will never be able to find PID)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-3545) Investigate restoring tasks/executors after machine reboot.

2017-04-03 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953046#comment-15953046
 ] 

Yan Xu commented on MESOS-3545:
---

[~xds2000] it's being worked on but the first patch will probably be weeks out 
as I am on leave right now and we have to settle down some design issue.

> Investigate restoring tasks/executors after machine reboot.
> ---
>
> Key: MESOS-3545
> URL: https://issues.apache.org/jira/browse/MESOS-3545
> Project: Mesos
>  Issue Type: Epic
>  Components: agent
>Reporter: Benjamin Hindman
>Assignee: Megha Sharma
>
> If a task/executor is restartable (see MESOS-3544) it might make sense to 
> force an agent to restart these tasks/executors _before_ after a machine 
> reboot in the event that the machine is network partitioned away from the 
> master (or the master has failed) but we'd like to get these services running 
> again. Assuming the agent(s) running on the machine has not been disconnected 
> from the master for longer than the master's agent re-registration timeout 
> the agent should be able to re-register (i.e., after a network partition is 
> resolved) without a problem. However, in the same way that a framework would 
> be interested in knowing that it's tasks/executors were restarted we'd want 
> to send something like a TASK_RESTARTED status update.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)