date:20170126

[jira] [Updated] (MESOS-6758) Support 'Basic' auth docker private registry on Unified Containerizer.

2017-01-26 Thread Adam B (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-6758:
--
Priority: Critical  (was: Major)

> Support 'Basic' auth docker private registry on Unified Containerizer.
> --
>
> Key: MESOS-6758
> URL: https://issues.apache.org/jira/browse/MESOS-6758
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>Priority: Critical
>  Labels: containerizer
>
> Currently, the Unified Containerizer only supports the private docker 
> registry with 'Bearer' authorization (token is needed from the auth server). 
> We should support the 'Basic' auth registry as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6432) Roles with quota assigned can "game" the system to receive excessive resources.

2017-01-26 Thread Adam B (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840862#comment-15840862
 ] 

Adam B commented on MESOS-6432:
---

Dropping this from Blocker to Critical, since it's not new behavior (existed 
since the introduction of Quota), and it doesn't have any reviews yet, so we 
won't block the 1.2 release for it.
[~bmahler], [~alexr] Do either of you have time to review this soon?

> Roles with quota assigned can "game" the system to receive excessive 
> resources.
> ---
>
> Key: MESOS-6432
> URL: https://issues.apache.org/jira/browse/MESOS-6432
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Benjamin Bannier
>Priority: Blocker
>
> The current implementation of quota allocation attempts to satisfy each 
> resource quota for a role, but in doing so can far exceed the quota assigned 
> to the role.
> For example, if a role has quota for {{\[30,20,10\]}}, it can consume up to: 
> {{\[∞, ∞, 10\]}} or {{\[∞, 20, ∞\]}} or {{\[30, ∞, ∞\]}} as only once each 
> resource in the quota vector is satisfied do we stop allocating agent's 
> resources to the role!
> As a first step for preventing gaming, we could consider quota satisfied once 
> any of the resources in the vector has quota satisfied. This approach works 
> reasonably well for resources that are required and are present on every 
> agent (cpus, mem, disk). However, it doesn't work well for resources that are 
> optional / only present on some agents (e.g. gpus) (a.k.a. non-ubiquitous / 
> scarce resources). For this we would need to determine which agents have 
> resources that can satisfy the quota prior to performing the allocation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6432) Roles with quota assigned can "game" the system to receive excessive resources.

2017-01-26 Thread Adam B (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-6432:
--
Priority: Critical  (was: Blocker)

> Roles with quota assigned can "game" the system to receive excessive 
> resources.
> ---
>
> Key: MESOS-6432
> URL: https://issues.apache.org/jira/browse/MESOS-6432
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Benjamin Bannier
>Priority: Critical
>
> The current implementation of quota allocation attempts to satisfy each 
> resource quota for a role, but in doing so can far exceed the quota assigned 
> to the role.
> For example, if a role has quota for {{\[30,20,10\]}}, it can consume up to: 
> {{\[∞, ∞, 10\]}} or {{\[∞, 20, ∞\]}} or {{\[30, ∞, ∞\]}} as only once each 
> resource in the quota vector is satisfied do we stop allocating agent's 
> resources to the role!
> As a first step for preventing gaming, we could consider quota satisfied once 
> any of the resources in the vector has quota satisfied. This approach works 
> reasonably well for resources that are required and are present on every 
> agent (cpus, mem, disk). However, it doesn't work well for resources that are 
> optional / only present on some agents (e.g. gpus) (a.k.a. non-ubiquitous / 
> scarce resources). For this we would need to determine which agents have 
> resources that can satisfy the quota prior to performing the allocation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6665) io::redirect might cause stack overflow.

2017-01-26 Thread Adam B (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-6665:
--
Shepherd: Jie Yu

> io::redirect might cause stack overflow.
> 
>
> Key: MESOS-6665
> URL: https://issues.apache.org/jira/browse/MESOS-6665
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>Assignee: Benjamin Hindman
>Priority: Blocker
>  Labels: mesosphere
>
> Can reproduce this on macOS sierra:
> {noformat}
> [--] 6 tests from IOTest
> [ RUN  ] IOTest.Poll
> [   OK ] IOTest.Poll (0 ms)
> [ RUN  ] IOTest.Read
> [   OK ] IOTest.Read (3 ms)
> [ RUN  ] IOTest.BufferedRead
> [   OK ] IOTest.BufferedRead (5 ms)
> [ RUN  ] IOTest.Write
> [   OK ] IOTest.Write (1 ms)
> [ RUN  ] IOTest.Redirect
> make[6]: *** [check-local] Illegal instruction: 4
> make[5]: *** [check-am] Error 2
> make[4]: *** [check-recursive] Error 1
> make[3]: *** [check] Error 2
> make[2]: *** [check-recursive] Error 1
> make[1]: *** [check] Error 2
> make: *** [check-recursive] Error 1
> (reverse-i-search)`k': make check -j3
> Jies-MacBook-Pro:build jie$ lldb 3rdparty/libprocess/libprocess-tests
> (lldb) target create "3rdparty/libprocess/libprocess-tests"
> Current executable set to '3rdparty/libprocess/libprocess-tests' (x86_64).
> (lldb) run --gtest_filter=IOTest.Redirect
> Process 26064 launched: 
> '/Users/jie/workspace/dist/mesos/build/3rdparty/libprocess/libprocess-tests' 
> (x86_64)
> Note: Google Test filter = IOTest.Redirect
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from IOTest
> [ RUN  ] IOTest.Redirect
> Process 26064 stopped
> * thread #2: tid = 0x152c5c, 0x7fffd6d463e0 
> libsystem_malloc.dylib`szone_malloc_should_clear + 78, stop reason = 
> EXC_BAD_ACCESS (code=2, address=0x7eb16ff8)
> frame #0: 0x7fffd6d463e0 
> libsystem_malloc.dylib`szone_malloc_should_clear + 78
> libsystem_malloc.dylib`szone_malloc_should_clear:
> ->  0x7fffd6d463e0 <+78>: movq   %rax, -0x78(%rbp)
> 0x7fffd6d463e4 <+82>: movq   0x10f0(%r12), %r13
> 0x7fffd6d463ec <+90>: leaq   (%rax,%rax,4), %r14
> 0x7fffd6d463f0 <+94>: shlq   $0x9, %r14
> (lldb) bt
> .
> frame #2794: 0x7fffd6ddb221 libsystem_pthread.dylib`thread_start + 13
> {noformat}
> Change the test to redirect just 1KB data will hide the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6813) IOSwitchboardServerTest.AttachOutput has stack overflow issue.

2017-01-26 Thread Adam B (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-6813:
--
Shepherd: Jie Yu

> IOSwitchboardServerTest.AttachOutput has stack overflow issue.
> --
>
> Key: MESOS-6813
> URL: https://issues.apache.org/jira/browse/MESOS-6813
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>Assignee: Benjamin Hindman
>Priority: Blocker
>
> {noformat}
> bin/lldb-mesos-tests.sh --gtest_filter=IOSwitchboardServerTest.AttachOutput 
> --verbose
> 
> (lldb) run
> ...
> frame #3543: 0x000106a35cbd libmesos-1.2.0.dylib`bool 
> process::Future::_set(short&&) + 445
> frame #3544: 0x000106a35af5 
> libmesos-1.2.0.dylib`process::Future::set(short&&) + 37
> frame #3545: 0x000106a35ab0 libmesos-1.2.0.dylib`bool 
> process::Promise::_set(short&&) + 80
> frame #3546: 0x000106a33285 
> libmesos-1.2.0.dylib`process::Promise::set(short&&) + 37
> frame #3547: 0x000106a3322e 
> libmesos-1.2.0.dylib`process::polled(ev_loop*, ev_io*, int) + 110
> frame #3548: 0x000106abea59 
> libmesos-1.2.0.dylib`ev_invoke_pending(loop=) + 105 at ev.c:3288 
> [opt]
> frame #3549: 0x000106abf342 
> libmesos-1.2.0.dylib`ev_run(loop=, flags=) + 2242 
> at ev.c:3688 [opt]
> frame #3550: 0x000106a2ac5b libmesos-1.2.0.dylib`ev_loop(ev_loop*, 
> int) + 27
> frame #3551: 0x000106a2abc6 
> libmesos-1.2.0.dylib`process::EventLoop::run() + 134
> frame #3552: 0x00010698eff6 libmesos-1.2.0.dylib`void* 
> std::__1::__thread_proxy(void*) + 390
> frame #3553: 0x7fffd6ddbaab libsystem_pthread.dylib`_pthread_body + 
> 180
> frame #3554: 0x7fffd6ddb9f7 libsystem_pthread.dylib`_pthread_start + 
> 286
> frame #3555: 0x7fffd6ddb221 libsystem_pthread.dylib`thread_start + 13
> (lldb) quit
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4705) Linux 'perf' parsing logic may fail when OS distribution has perf backports.

2017-01-26 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840809#comment-15840809
 ] 

Benjamin Mahler commented on MESOS-4705:


Sure, let's make sure these are linked to a ticket that captures using the perf 
API rather than parsing output, per [~wangcong]'s approach above.

> Linux 'perf' parsing logic may fail when OS distribution has perf backports.
> 
>
> Key: MESOS-4705
> URL: https://issues.apache.org/jira/browse/MESOS-4705
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups, isolation
>Affects Versions: 0.27.1
>Reporter: Fan Du
>Assignee: Fan Du
> Fix For: 0.26.2, 0.27.3, 0.28.2, 1.0.0
>
>
> When sampling container with perf event on Centos7 with kernel 
> 3.10.0-123.el7.x86_64, slave complained with below error spew:
> {code}
> E0218 16:32:00.591181  8376 perf_event.cpp:408] Failed to get perf sample: 
> Failed to parse perf sample: Failed to parse perf sample line 
> '25871993253,,cycles,mesos/5f23ffca-87ed-4ff6-84f2-6ec3d4098ab8,10059827422,100.00':
>  Unexpected number of fields
> {code}
> it's caused by the current perf format [assumption | 
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob;f=src/linux/perf.cpp;h=1c113a2b3f57877e132bbd65e01fb2f045132128;hb=HEAD#l430]
>  with kernel version below 3.12 
> On 3.10.0-123.el7.x86_64 kernel, the format is with 6 tokens as below:
> value,unit,event,cgroup,running,ratio
> A local modification fixed this error on my test bed, please review this 
> ticket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6985) os::getenv() can segfault

2017-01-26 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840771#comment-15840771
 ] 

Greg Mann commented on MESOS-6985:
--

Yep, it's definitely occurring in {{::getenv}}. Here's the result of a failed 
test run within {{gdb}}:
{code}
[ RUN  ] MasterTest.MultipleExecutors
I0127 00:39:33.120487  1809 cluster.cpp:160] Creating default 'local' authorizer
I0127 00:39:33.122427  1815 master.cpp:383] Master 
ac440d30-722b-43a5-9f61-cea98b3e576a (vagrant-ubuntu-trusty-64) started on 
10.0.2.15:51845
I0127 00:39:33.122498  1815 master.cpp:385] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/b7WHq9/credentials" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--root_submissions="true" --user_sorter="drf" --version="false" 
--webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/b7WHq9/master" 
--zk_session_timeout="10secs"
I0127 00:39:33.122836  1815 master.cpp:435] Master only allowing authenticated 
frameworks to register
I0127 00:39:33.122858  1815 master.cpp:449] Master only allowing authenticated 
agents to register
I0127 00:39:33.122875  1815 master.cpp:462] Master only allowing authenticated 
HTTP frameworks to register
I0127 00:39:33.122891  1815 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/b7WHq9/credentials'
I0127 00:39:33.123128  1815 master.cpp:507] Using default 'crammd5' 
authenticator
I0127 00:39:33.123265  1815 http.cpp:922] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0127 00:39:33.123394  1815 http.cpp:922] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0127 00:39:33.123631  1815 http.cpp:922] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0127 00:39:33.123884  1815 master.cpp:587] Authorization enabled
I0127 00:39:33.127008  1819 master.cpp:2119] Elected as the leading master!
I0127 00:39:33.127084  1819 master.cpp:1641] Recovering from registrar
I0127 00:39:33.127766  1818 registrar.cpp:362] Successfully fetched the 
registry (0B) in 408832ns
I0127 00:39:33.127883  1818 registrar.cpp:461] Applied 1 operations in 22092ns; 
attempting to update the registry
I0127 00:39:33.130798  1818 registrar.cpp:506] Successfully updated the 
registry in 2.779136ms
I0127 00:39:33.130934  1818 registrar.cpp:392] Successfully recovered registrar
I0127 00:39:33.131573  1818 master.cpp:1757] Recovered 0 agents from the 
registry (153B); allowing 10mins for agents to re-register
I0127 00:39:33.134503  1809 cluster.cpp:446] Creating default 'local' authorizer
I0127 00:39:33.135774  1818 slave.cpp:209] Mesos agent started on 
(8)@10.0.2.15:51845
I0127 00:39:33.135824  1818 slave.cpp:210] Flags at startup: --acls="" 
--appc_simple_discovery_uri_prefix="http://; 
--appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticatee="crammd5" 
--authentication_backoff_factor="1secs" --authorizer="local" 
--cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
--cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
--cgroups_root="mesos" --container_disk_watch_interval="15secs" 
--containerizers="mesos" 
--credential="/tmp/MasterTest_MultipleExecutors_ruv9Vu/credential" 
--default_role="*" --disk_watch_interval="1mins" --docker="docker" 
--docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; 
--docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" 
--docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" 
--docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
--enforce_container_disk_quota="false" --executor_registration_timeout="1mins" 
--executor_shutdown_grace_period="5secs" 
--fetcher_cache_dir="/tmp/MasterTest_MultipleExecutors_ruv9Vu/fetch" 
--fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" 
--gc_disk_headroom="0.1" --hadoop_home="" --help="false" 
--hostname_lookup="true"

[jira] [Updated] (MESOS-6989) Docker executor segfaults in ~MesosExecutorDriver()

2017-01-26 Thread Joseph Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-6989:
-
Shepherd: Anand Mazumdar

> Docker executor segfaults in ~MesosExecutorDriver()
> ---
>
> Key: MESOS-6989
> URL: https://issues.apache.org/jira/browse/MESOS-6989
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Reporter: Jan-Philip Gehrcke
>Assignee: Joseph Wu
>Priority: Blocker
>  Labels: mesosphere
>
> With the current Mesos master state (commit 
> 42e515bc5c175a318e914d34473016feda4db6ff), the Docker executor segfaults 
> during shutdown. 
> Steps to reproduce:
> 1) Start master:
> {code}
> $ ./bin/mesos-master.sh --ip=127.0.0.1 --work_dir=/tmp/jp/mesos
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0125 13:41:15.963775 14744 main.cpp:278] Build: 2017-01-25 13:37:42 by jp
> I0125 13:41:15.963868 14744 main.cpp:279] Version: 1.2.0
> I0125 13:41:15.963877 14744 main.cpp:286] Git SHA: 
> 42e515bc5c175a318e914d34473016feda4db6ff
> {code}
> (note that building it at 13:37 is not part of the repro)
> 2) Start agent:
> {code}
> $ ./bin/mesos-slave.sh --containerizers=mesos,docker --master=127.0.0.1:5050 
> --work_dir=/tmp/jp/mesos
> {code}
> 3) Run {{mesos-execute}} with the Docker containerizer:
> {code}
> $ ./src/mesos-execute --master=127.0.0.1:5050 --name=testcommand 
> --containerizer=docker --docker_image=debian --command=env
> I0125 13:43:59.704973 14951 scheduler.cpp:184] Version: 1.2.0
> I0125 13:43:59.706425 14952 scheduler.cpp:470] New master detected at 
> master@127.0.0.1:5050
> Subscribed with ID 57596743-06f4-45f1-a975-348cf70589b1-
> Submitted task 'testcommand' to agent 
> '57596743-06f4-45f1-a975-348cf70589b1-S0'
> Received status update TASK_RUNNING for task 'testcommand'
>   source: SOURCE_EXECUTOR
> Received status update TASK_FINISHED for task 'testcommand'
>   message: 'Container exited with status 0'
>   source: SOURCE_EXECUTOR
> {code}
> Relevant agent output that shows the executor segfault:
> {code}
> [...]
> I0125 13:44:16.249191 14823 slave.cpp:4328] Got exited event for 
> executor(1)@192.99.40.208:33529
> I0125 13:44:16.347095 14830 docker.cpp:2358] Executor for container 
> 396282a9-7bf0-48ee-ba07-3ff2ca801d53 has exited
> I0125 13:44:16.347127 14830 docker.cpp:2052] Destroying container 
> 396282a9-7bf0-48ee-ba07-3ff2ca801d53
> I0125 13:44:16.347439 14830 docker.cpp:2179] Running docker stop on container 
> 396282a9-7bf0-48ee-ba07-3ff2ca801d53
> I0125 13:44:16.349215 14826 slave.cpp:4691] Executor 'testcommand' of 
> framework 57596743-06f4-45f1-a975-348cf70589b1- terminated with signal 
> Segmentation fault (core dumped)
> [...]
> {code}
> The complete task stderr:
> {code}
> $ cat 
> /tmp/jp/mesos/slaves/57596743-06f4-45f1-a975-348cf70589b1-S0/frameworks/57596743-06f4-45f1-a975-348cf70589b1-/executors/testcommand/runs/latest/stderr
>  
> I0125 13:44:12.850073 15030 exec.cpp:162] Version: 1.2.0
> I0125 13:44:12.864229 15050 exec.cpp:237] Executor registered on agent 
> 57596743-06f4-45f1-a975-348cf70589b1-S0
> I0125 13:44:12.865842 15054 docker.cpp:850] Running docker -H 
> unix:///var/run/docker.sock run --cpu-shares 1024 --memory 134217728 
> --env-file /tmp/xFZ8G9 -v 
> /tmp/jp/mesos/slaves/57596743-06f4-45f1-a975-348cf70589b1-S0/frameworks/57596743-06f4-45f1-a975-348cf70589b1-/executors/testcommand/runs/396282a9-7bf0-48ee-ba07-3ff2ca801d53:/mnt/mesos/sandbox
>  --net host --entrypoint /bin/sh --name 
> mesos-57596743-06f4-45f1-a975-348cf70589b1-S0.396282a9-7bf0-48ee-ba07-3ff2ca801d53
>  debian -c env
> I0125 13:44:15.248721 15064 exec.cpp:410] Executor asked to shutdown
> *** Aborted at 1485369856 (unix time) try "date -d @1485369856" if you are 
> using GNU date ***
> PC: @ 0x7fb38f153dd0 (unknown)
> *** SIGSEGV (@0x68) received by PID 15030 (TID 0x7fb3961a88c0) from PID 104; 
> stack trace: ***
> @ 0x7fb38f15b5c0 (unknown)
> @ 0x7fb38f153dd0 (unknown)
> @ 0x7fb39332c607 __gthread_mutex_lock()
> @ 0x7fb39332c657 __gthread_recursive_mutex_lock()
> @ 0x7fb39332edca std::recursive_mutex::lock()
> @ 0x7fb393337bd8 
> _ZZ11synchronizeISt15recursive_mutexE12SynchronizedIT_EPS2_ENKUlPS0_E_clES5_
> @ 0x7fb393337bf8 
> _ZZ11synchronizeISt15recursive_mutexE12SynchronizedIT_EPS2_ENUlPS0_E_4_FUNES5_
> @ 0x7fb39333ba6b Synchronized<>::Synchronized()
> @ 0x7fb393337cac synchronize<>()
> @ 0x7fb39492f15c process::ProcessManager::wait()
> @ 0x7fb3949353f0 process::wait()
> @ 0x55fd63f31fe5 process::wait()
> @ 0x7fb39332ce3c mesos::MesosExecutorDriver::~MesosExecutorDriver()
> @ 0x55fd63f2bd86 main
> @ 0x7fb38e4fc401 __libc_start_main
> @ 0x55fd63f2ab5a _start
> {code}

[jira] [Commented] (MESOS-6981) Allow disabling name based SSL checks

2017-01-26 Thread Kevin Klues (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840706#comment-15840706
 ] 

Kevin Klues commented on MESOS-6981:


Till, can assign this ticket to you and target it for 1.3?

> Allow disabling name based SSL checks
> -
>
> Key: MESOS-6981
> URL: https://issues.apache.org/jira/browse/MESOS-6981
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Kevin Cox
>  Labels: mesosphere, security
>
> Currently if you want to use verified certificates you need to enable 
> validation by hostname or IP. However if you are running your own CA for 
> these certificates it is often sufficient to verify solely based on the CA 
> signature.
> For example if an admin wants to connect it is a pain to make sure that they 
> always have a valid certificate for their IP or reverse DNS. It would be nice 
> if the admin could be given a certificate that was trusted no matter where he 
> is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-6588) LinuxRootfs misses required files

2017-01-26 Thread Yan Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15667754#comment-15667754
 ] 

Yan Xu edited comment on MESOS-6588 at 1/26/17 11:36 PM:
-

|Move containerizer Rootfs support to a cpp file. 
|[https://reviews.apache.org/r/53790|https://reviews.apache.org/r/53790] |
|Use the stout ELF parser to implement 
ldd.||
|Add some simple ldd() tests.||
|Use the stout ELF parser to collect Linux rootfs files. 
|[https://reviews.apache.org/r/53791|https://reviews.apache.org/r/53791] |


was (Author: jamespeach):
|Move containerizer Rootfs support to a cpp file. 
|[https://reviews.apache.org/r/53790|https://reviews.apache.org/r/53790] |
|Use the stout ELF parser to collect Linux rootfs files. 
|[https://reviews.apache.org/r/53791|https://reviews.apache.org/r/53791] |

> LinuxRootfs misses required files
> -
>
> Key: MESOS-6588
> URL: https://issues.apache.org/jira/browse/MESOS-6588
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, tests
>Reporter: James Peach
>Assignee: James Peach
>
> The hard-coded list of required files in 
> {{src/tests/containerizer/rootfs.hpp}} is out of date for Fedora 24. F24 now 
> requires {{libtinfo.so.6}} and {{/lib64/libcrypto.so.10}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6951) Docker containerizer: mangled environment when env value contains LF byte

2017-01-26 Thread Jan-Philip Gehrcke (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840658#comment-15840658
 ] 

Jan-Philip Gehrcke commented on MESOS-6951:
---

I am currently looking into proposing a patch for this.

> Docker containerizer: mangled environment when env value contains LF byte
> -
>
> Key: MESOS-6951
> URL: https://issues.apache.org/jira/browse/MESOS-6951
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Jan-Philip Gehrcke
>
> Consider this Marathon app definition:
> {code}
> {
>   "id": "/testapp",
>   "cmd": "env && tail -f /dev/null",
>   "env":{
> "TESTVAR":"line1\nline2"
>   },
>   "cpus": 0.1,
>   "mem": 10,
>   "instances": 1,
>   "container": {
> "type": "DOCKER",
> "docker": {
>   "image": "alpine"
> }
>   }
> }
> {code}
> The JSON-encoded newline in the value of the {{TESTVAR}} environment variable 
> leads to a corrupted task environment. What follows is a subset of the 
> resulting task environment (as printed via {{env}}, i.e. in key=value 
> notation):
> {code}
> line2=
> TESTVAR=line1
> {code}
> That is, the trailing part of the intended value ended up being interpreted 
> as variable name, and only the leading part of the intended value was used as 
> actual value for {{TESTVAR}}.
> Common application scenarios that would badly break with that involve 
> pretty-printed JSON documents or YAML documents passed along via the 
> environment.
> Following the code and information flow led to the conclusion that Docker's 
> {{--env-file}} command line interface is the weak point in the flow. It is 
> currently used in Mesos' Docker containerizer for passing the environment to 
> the container:
> {code}
>   argv.push_back("--env-file");
>   argv.push_back(environmentFile);
> {code}
> (Ref: 
> [code|https://github.com/apache/mesos/blob/c0aee8cc10b1d1f4b2db5ff12b771372fdd5b1f3/src/docker/docker.cpp#L584])
> Docker's {{--env-file}} argument behavior is documented via
> {quote}
> The --env-file flag takes a filename as an argument
> and expects each line to be in the VAR=VAL format,
> {quote}
> (Ref: https://docs.docker.com/engine/reference/commandline/run/)
> That is, Docker identifies individual environment variable key/value pair 
> definitions based on newline bytes in that file which explains the observed 
> environment variable value fragmentation. Notably, Docker does not provide a 
> mechanism for escaping newline bytes in the values specified in this 
> environment file.
> I think it is important to understand that Docker's {{--env-file}} mechanism 
> is ill-posed in the sense that it is not capable of transmitting the whole 
> range of environment variable values allowed by POSIX. That's what the Single 
> UNIX Specification, Version 3 has to say about environment variable values:
> {quote}
> the value shall be composed of characters from the
> portable character set (except NUL and as indicated below). 
> {quote}
> (Ref: http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap08.html)
> About "The portable character set": 
> http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap06.html#tagtcjh_3
> It includes (among others) the LF byte. Understandably, the current Docker 
> {{--env-file}} behavior will not change, so this is not an issue that can be 
> deferred to Docker: https://github.com/docker/docker/issues/12997
> Notably, the {{--env-file}} method for communicating environment variables to 
> Docker containers was just recently introduced to Mesos as of 
> https://issues.apache.org/jira/browse/MESOS-6566, for not leaking secrets 
> through the process listing. Previously, we specified env key/value pairs on 
> the command line which leaked secrets to the process list and probably also 
> did not support the full range of valid environment variable values.
> We need a solution that
> 1) does not leak sensitive values (i.e. is compliant with MESOS-6566).
> 2) allows for passing arbitrary environment variable values.
> It seems that Docker's {{--env}} method can be used for that. It can be used 
> to define _just the names of the environment variables_ to-be-passed-along, 
> in which case the docker binary will read the corresponding values from its 
> own environment, which we can clearly prepare appropriately when we invoke 
> the corresponding child process. This method would still leak environment 
> variable _names_ to the process listing, but (especially if documented) this 
> should be fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6989) Docker executor segfaults in ~MesosExecutorDriver()

2017-01-26 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6989:
--
Priority: Blocker  (was: Major)

Moving it to blocker since it does result in a stack trace in the task's 
stdout. Note that our existing tests might not be catching this because they 
might not be validating the executor's exit status code to be non-zero for 
docker/default executor.

> Docker executor segfaults in ~MesosExecutorDriver()
> ---
>
> Key: MESOS-6989
> URL: https://issues.apache.org/jira/browse/MESOS-6989
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Reporter: Jan-Philip Gehrcke
>Assignee: Joseph Wu
>Priority: Blocker
>  Labels: mesosphere
>
> With the current Mesos master state (commit 
> 42e515bc5c175a318e914d34473016feda4db6ff), the Docker executor segfaults 
> during shutdown. 
> Steps to reproduce:
> 1) Start master:
> {code}
> $ ./bin/mesos-master.sh --ip=127.0.0.1 --work_dir=/tmp/jp/mesos
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0125 13:41:15.963775 14744 main.cpp:278] Build: 2017-01-25 13:37:42 by jp
> I0125 13:41:15.963868 14744 main.cpp:279] Version: 1.2.0
> I0125 13:41:15.963877 14744 main.cpp:286] Git SHA: 
> 42e515bc5c175a318e914d34473016feda4db6ff
> {code}
> (note that building it at 13:37 is not part of the repro)
> 2) Start agent:
> {code}
> $ ./bin/mesos-slave.sh --containerizers=mesos,docker --master=127.0.0.1:5050 
> --work_dir=/tmp/jp/mesos
> {code}
> 3) Run {{mesos-execute}} with the Docker containerizer:
> {code}
> $ ./src/mesos-execute --master=127.0.0.1:5050 --name=testcommand 
> --containerizer=docker --docker_image=debian --command=env
> I0125 13:43:59.704973 14951 scheduler.cpp:184] Version: 1.2.0
> I0125 13:43:59.706425 14952 scheduler.cpp:470] New master detected at 
> master@127.0.0.1:5050
> Subscribed with ID 57596743-06f4-45f1-a975-348cf70589b1-
> Submitted task 'testcommand' to agent 
> '57596743-06f4-45f1-a975-348cf70589b1-S0'
> Received status update TASK_RUNNING for task 'testcommand'
>   source: SOURCE_EXECUTOR
> Received status update TASK_FINISHED for task 'testcommand'
>   message: 'Container exited with status 0'
>   source: SOURCE_EXECUTOR
> {code}
> Relevant agent output that shows the executor segfault:
> {code}
> [...]
> I0125 13:44:16.249191 14823 slave.cpp:4328] Got exited event for 
> executor(1)@192.99.40.208:33529
> I0125 13:44:16.347095 14830 docker.cpp:2358] Executor for container 
> 396282a9-7bf0-48ee-ba07-3ff2ca801d53 has exited
> I0125 13:44:16.347127 14830 docker.cpp:2052] Destroying container 
> 396282a9-7bf0-48ee-ba07-3ff2ca801d53
> I0125 13:44:16.347439 14830 docker.cpp:2179] Running docker stop on container 
> 396282a9-7bf0-48ee-ba07-3ff2ca801d53
> I0125 13:44:16.349215 14826 slave.cpp:4691] Executor 'testcommand' of 
> framework 57596743-06f4-45f1-a975-348cf70589b1- terminated with signal 
> Segmentation fault (core dumped)
> [...]
> {code}
> The complete task stderr:
> {code}
> $ cat 
> /tmp/jp/mesos/slaves/57596743-06f4-45f1-a975-348cf70589b1-S0/frameworks/57596743-06f4-45f1-a975-348cf70589b1-/executors/testcommand/runs/latest/stderr
>  
> I0125 13:44:12.850073 15030 exec.cpp:162] Version: 1.2.0
> I0125 13:44:12.864229 15050 exec.cpp:237] Executor registered on agent 
> 57596743-06f4-45f1-a975-348cf70589b1-S0
> I0125 13:44:12.865842 15054 docker.cpp:850] Running docker -H 
> unix:///var/run/docker.sock run --cpu-shares 1024 --memory 134217728 
> --env-file /tmp/xFZ8G9 -v 
> /tmp/jp/mesos/slaves/57596743-06f4-45f1-a975-348cf70589b1-S0/frameworks/57596743-06f4-45f1-a975-348cf70589b1-/executors/testcommand/runs/396282a9-7bf0-48ee-ba07-3ff2ca801d53:/mnt/mesos/sandbox
>  --net host --entrypoint /bin/sh --name 
> mesos-57596743-06f4-45f1-a975-348cf70589b1-S0.396282a9-7bf0-48ee-ba07-3ff2ca801d53
>  debian -c env
> I0125 13:44:15.248721 15064 exec.cpp:410] Executor asked to shutdown
> *** Aborted at 1485369856 (unix time) try "date -d @1485369856" if you are 
> using GNU date ***
> PC: @ 0x7fb38f153dd0 (unknown)
> *** SIGSEGV (@0x68) received by PID 15030 (TID 0x7fb3961a88c0) from PID 104; 
> stack trace: ***
> @ 0x7fb38f15b5c0 (unknown)
> @ 0x7fb38f153dd0 (unknown)
> @ 0x7fb39332c607 __gthread_mutex_lock()
> @ 0x7fb39332c657 __gthread_recursive_mutex_lock()
> @ 0x7fb39332edca std::recursive_mutex::lock()
> @ 0x7fb393337bd8 
> _ZZ11synchronizeISt15recursive_mutexE12SynchronizedIT_EPS2_ENKUlPS0_E_clES5_
> @ 0x7fb393337bf8 
> _ZZ11synchronizeISt15recursive_mutexE12SynchronizedIT_EPS2_ENUlPS0_E_4_FUNES5_
> @ 0x7fb39333ba6b Synchronized<>::Synchronized()
> @ 0x7fb393337cac synchronize<>()
> @ 0x7fb39492f15c process::ProcessManager::wait()
> @

[jira] [Created] (MESOS-7019) SCRAM authentication.

2017-01-26 Thread James Peach (JIRA)

James Peach created MESOS-7019:
--

 Summary: SCRAM authentication.
 Key: MESOS-7019
 URL: https://issues.apache.org/jira/browse/MESOS-7019
 Project: Mesos
  Issue Type: Improvement
Reporter: James Peach


Add support for the SCRAM authentication method, [RFC 5802 | 
https://tools.ietf.org/html/rfc5802] is the SASL mechanism and [RFC 7804 | 
https://tools.ietf.org/html/rfc7804 ] is the equivalent HTTP authentication 
mechanism.

SCRAM is a very simple digest-style authentication mechanism that has both a 
strong digest scheme and mutual authentication. The server is not required to 
have the cleartext passwords. It is suitable for use both the agent 
authentication API and the HTTP authentication API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6935) Operator API to get current frameworks only.

2017-01-26 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840548#comment-15840548
 ] 

James Peach commented on MESOS-6935:


Same problem for tasks. Mostly, the set of completed tasks is large and 
uninteresting.

> Operator API to get current frameworks only.
> 
>
> Key: MESOS-6935
> URL: https://issues.apache.org/jira/browse/MESOS-6935
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: James Peach
>
> The master {{GET_FRAMEWORKS}} operator API always return the current 
> frameworks and the {{completed_frameworks}}. Since the set of 
> {{completed_frameworks}} can be very large and is often not wanted, it would 
> be helpful if there was a way to exclude those.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-7017) HTTP API responses can crash the master.

2017-01-26 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840546#comment-15840546
 ] 

James Peach commented on MESOS-7017:


Need to verify, but experimentally it looks like the whole response is buffered 
in memory before sending anything. We ought to stream the response.

> HTTP API responses can crash the master.
> 
>
> Key: MESOS-7017
> URL: https://issues.apache.org/jira/browse/MESOS-7017
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: James Peach
>
> The master can crash when generating large responses to small API requests. 
> One manifestation of this is querying the tasks.
> {noformat}
> [libprotobuf ERROR google/protobuf/io/coded_stream.cc:180] A protocol message 
> was rejected because it was too big (more than 67108864 bytes).  To increase 
> the limit (or to disable these warnings), see 
> CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
> F0126 18:34:18.790386 26230 evolve.cpp:63] Check failed: 
> t.ParsePartialFromString(data) Failed to parse mesos.v1.master.Response while 
> evolving from mesos.master.Response
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-7017) HTTP API responses can crash the master.

2017-01-26 Thread James Peach (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach updated MESOS-7017:
---
Summary: HTTP API responses can crash the master.  (was: HTTP API responses 
can crash the master)

> HTTP API responses can crash the master.
> 
>
> Key: MESOS-7017
> URL: https://issues.apache.org/jira/browse/MESOS-7017
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: James Peach
>
> The master can crash when generating large responses to small API requests. 
> One manifestation of this is querying the tasks.
> {noformat}
> [libprotobuf ERROR google/protobuf/io/coded_stream.cc:180] A protocol message 
> was rejected because it was too big (more than 67108864 bytes).  To increase 
> the limit (or to disable these warnings), see 
> CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
> F0126 18:34:18.790386 26230 evolve.cpp:63] Check failed: 
> t.ParsePartialFromString(data) Failed to parse mesos.v1.master.Response while 
> evolving from mesos.master.Response
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-7018) Move historical data out of memory.

2017-01-26 Thread James Peach (JIRA)

James Peach created MESOS-7018:
--

 Summary: Move historical data out of memory.
 Key: MESOS-7018
 URL: https://issues.apache.org/jira/browse/MESOS-7018
 Project: Mesos
  Issue Type: Bug
Reporter: James Peach


There is a bunch of history (e.g., completed tasks, completed frameworks) that 
is kept in memory. It is information that is not commonly needed, so keeping it 
in memory is a waste. Keeping it in memory also limits the amount of history 
you can keep.

If we spool this history to disk we can keep much longer history at lower cost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-7017) HTTP API responses can crash the master

2017-01-26 Thread James Peach (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach updated MESOS-7017:
---
Component/s: HTTP API

> HTTP API responses can crash the master
> ---
>
> Key: MESOS-7017
> URL: https://issues.apache.org/jira/browse/MESOS-7017
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: James Peach
>
> The master can crash when generating large responses to small API requests. 
> One manifestation of this is querying the tasks.
> {noformat}
> [libprotobuf ERROR google/protobuf/io/coded_stream.cc:180] A protocol message 
> was rejected because it was too big (more than 67108864 bytes).  To increase 
> the limit (or to disable these warnings), see 
> CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
> F0126 18:34:18.790386 26230 evolve.cpp:63] Check failed: 
> t.ParsePartialFromString(data) Failed to parse mesos.v1.master.Response while 
> evolving from mesos.master.Response
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-7017) HTTP API responses can crash the master

2017-01-26 Thread James Peach (JIRA)

James Peach created MESOS-7017:
--

 Summary: HTTP API responses can crash the master
 Key: MESOS-7017
 URL: https://issues.apache.org/jira/browse/MESOS-7017
 Project: Mesos
  Issue Type: Bug
Reporter: James Peach


The master can crash when generating large responses to small API requests. One 
manifestation of this is querying the tasks.

{noformat}
[libprotobuf ERROR google/protobuf/io/coded_stream.cc:180] A protocol message 
was rejected because it was too big (more than 67108864 bytes).  To increase 
the limit (or to disable these warnings), see 
CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
F0126 18:34:18.790386 26230 evolve.cpp:63] Check failed: 
t.ParsePartialFromString(data) Failed to parse mesos.v1.master.Response while 
evolving from mesos.master.Response
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-7016) Make default AWAIT_* duration configurable

2017-01-26 Thread Benjamin Bannier (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-7016:

Summary: Make default AWAIT_* duration configurable  (was: Make default 
AWAIT_* duration)

> Make default AWAIT_* duration configurable
> --
>
> Key: MESOS-7016
> URL: https://issues.apache.org/jira/browse/MESOS-7016
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess, test
>Reporter: Benjamin Bannier
>
> libprocess defines a number of helpers {{AWAIT_*}} to wait for a 
> {{process::Future}} reaching terminal states. These helpers are used in tests.
> Currently the default duration to wait before triggering an assertion failure 
> is 15s. This value was chosen as a compromise between failing fast on likely 
> fast developer machines, but also allowing enough time for tests to pass in 
> high-contention environments (e.g., overbooked CI machines).
> If a machine is more overloaded than expected, {{Futures}} might take longer 
> to reach the desired state, and tests could fail. Ultimately we should 
> consider running tests with paused clock to eliminate this source of test 
> flakiness, see MESOS-4101, but as an intermediate measure we should make the 
> default timeout duration configurable.
> A simple approach might be to expose a build variable allowing users to set 
> at configure/cmake time a desired timeout duration for the setup they are 
> building for. This would allow us to define longer timeouts in the CI build 
> scripts, while keeping default timeouts as short as possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-7016) Make default AWAIT_* duration

2017-01-26 Thread Benjamin Bannier (JIRA)

Benjamin Bannier created MESOS-7016:
---

 Summary: Make default AWAIT_* duration
 Key: MESOS-7016
 URL: https://issues.apache.org/jira/browse/MESOS-7016
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess, test
Reporter: Benjamin Bannier


libprocess defines a number of helpers {{AWAIT_*}} to wait for a 
{{process::Future}} reaching terminal states. These helpers are used in tests.

Currently the default duration to wait before triggering an assertion failure 
is 15s. This value was chosen as a compromise between failing fast on likely 
fast developer machines, but also allowing enough time for tests to pass in 
high-contention environments (e.g., overbooked CI machines).

If a machine is more overloaded than expected, {{Futures}} might take longer to 
reach the desired state, and tests could fail. Ultimately we should consider 
running tests with paused clock to eliminate this source of test flakiness, see 
MESOS-4101, but as an intermediate measure we should make the default timeout 
duration configurable.

A simple approach might be to expose a build variable allowing users to set at 
configure/cmake time a desired timeout duration for the setup they are building 
for. This would allow us to define longer timeouts in the CI build scripts, 
while keeping default timeouts as short as possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6953) A compromised mesos-master node can execute code as root on agents.

2017-01-26 Thread Anindya Sinha (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840529#comment-15840529
 ] 

Anindya Sinha commented on MESOS-6953:
--

The main motivation here is not allow agents to run tasks with {{root}} 
privileges and do bad things on the agent. However, I agree we can extend this 
to other operations such as {{TEARDOWN_FRAMEWORK}}, and maybe to 
{{CREATE_VOLUME}} and {{DESTROY_VOLUME}} as well in addition to launching of 
tasks.

If we decide to do a long term solution, that can be tracked separately. This 
ticket captures what we can do right now to protect from the said scenario.

> A compromised mesos-master node can execute code as root on agents.
> ---
>
> Key: MESOS-6953
> URL: https://issues.apache.org/jira/browse/MESOS-6953
> Project: Mesos
>  Issue Type: Bug
>  Components: security
>Reporter: Anindya Sinha
>Assignee: Anindya Sinha
>  Labels: security, slave
>
> mesos-master has a `--[no-]root_submissions` flag that controls whether 
> frameworks with `root` user are admitted to the cluster.
> However, if a mesos-master node is compromised, it can attempt to schedule 
> tasks on agent as the `root` user. Since mesos-agent has no check against 
> tasks running on the agent for specific users, tasks can get run with `root` 
> privileges can get run within the container on the agent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-6989) Docker executor segfaults in ~MesosExecutorDriver()

2017-01-26 Thread Joseph Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-6989:


Assignee: Joseph Wu

> Docker executor segfaults in ~MesosExecutorDriver()
> ---
>
> Key: MESOS-6989
> URL: https://issues.apache.org/jira/browse/MESOS-6989
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Reporter: Jan-Philip Gehrcke
>Assignee: Joseph Wu
>
> With the current Mesos master state (commit 
> 42e515bc5c175a318e914d34473016feda4db6ff), the Docker executor segfaults 
> during shutdown. 
> Steps to reproduce:
> 1) Start master:
> {code}
> $ ./bin/mesos-master.sh --ip=127.0.0.1 --work_dir=/tmp/jp/mesos
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0125 13:41:15.963775 14744 main.cpp:278] Build: 2017-01-25 13:37:42 by jp
> I0125 13:41:15.963868 14744 main.cpp:279] Version: 1.2.0
> I0125 13:41:15.963877 14744 main.cpp:286] Git SHA: 
> 42e515bc5c175a318e914d34473016feda4db6ff
> {code}
> (note that building it at 13:37 is not part of the repro)
> 2) Start agent:
> {code}
> $ ./bin/mesos-slave.sh --containerizers=mesos,docker --master=127.0.0.1:5050 
> --work_dir=/tmp/jp/mesos
> {code}
> 3) Run {{mesos-execute}} with the Docker containerizer:
> {code}
> $ ./src/mesos-execute --master=127.0.0.1:5050 --name=testcommand 
> --containerizer=docker --docker_image=debian --command=env
> I0125 13:43:59.704973 14951 scheduler.cpp:184] Version: 1.2.0
> I0125 13:43:59.706425 14952 scheduler.cpp:470] New master detected at 
> master@127.0.0.1:5050
> Subscribed with ID 57596743-06f4-45f1-a975-348cf70589b1-
> Submitted task 'testcommand' to agent 
> '57596743-06f4-45f1-a975-348cf70589b1-S0'
> Received status update TASK_RUNNING for task 'testcommand'
>   source: SOURCE_EXECUTOR
> Received status update TASK_FINISHED for task 'testcommand'
>   message: 'Container exited with status 0'
>   source: SOURCE_EXECUTOR
> {code}
> Relevant agent output that shows the executor segfault:
> {code}
> [...]
> I0125 13:44:16.249191 14823 slave.cpp:4328] Got exited event for 
> executor(1)@192.99.40.208:33529
> I0125 13:44:16.347095 14830 docker.cpp:2358] Executor for container 
> 396282a9-7bf0-48ee-ba07-3ff2ca801d53 has exited
> I0125 13:44:16.347127 14830 docker.cpp:2052] Destroying container 
> 396282a9-7bf0-48ee-ba07-3ff2ca801d53
> I0125 13:44:16.347439 14830 docker.cpp:2179] Running docker stop on container 
> 396282a9-7bf0-48ee-ba07-3ff2ca801d53
> I0125 13:44:16.349215 14826 slave.cpp:4691] Executor 'testcommand' of 
> framework 57596743-06f4-45f1-a975-348cf70589b1- terminated with signal 
> Segmentation fault (core dumped)
> [...]
> {code}
> The complete task stderr:
> {code}
> $ cat 
> /tmp/jp/mesos/slaves/57596743-06f4-45f1-a975-348cf70589b1-S0/frameworks/57596743-06f4-45f1-a975-348cf70589b1-/executors/testcommand/runs/latest/stderr
>  
> I0125 13:44:12.850073 15030 exec.cpp:162] Version: 1.2.0
> I0125 13:44:12.864229 15050 exec.cpp:237] Executor registered on agent 
> 57596743-06f4-45f1-a975-348cf70589b1-S0
> I0125 13:44:12.865842 15054 docker.cpp:850] Running docker -H 
> unix:///var/run/docker.sock run --cpu-shares 1024 --memory 134217728 
> --env-file /tmp/xFZ8G9 -v 
> /tmp/jp/mesos/slaves/57596743-06f4-45f1-a975-348cf70589b1-S0/frameworks/57596743-06f4-45f1-a975-348cf70589b1-/executors/testcommand/runs/396282a9-7bf0-48ee-ba07-3ff2ca801d53:/mnt/mesos/sandbox
>  --net host --entrypoint /bin/sh --name 
> mesos-57596743-06f4-45f1-a975-348cf70589b1-S0.396282a9-7bf0-48ee-ba07-3ff2ca801d53
>  debian -c env
> I0125 13:44:15.248721 15064 exec.cpp:410] Executor asked to shutdown
> *** Aborted at 1485369856 (unix time) try "date -d @1485369856" if you are 
> using GNU date ***
> PC: @ 0x7fb38f153dd0 (unknown)
> *** SIGSEGV (@0x68) received by PID 15030 (TID 0x7fb3961a88c0) from PID 104; 
> stack trace: ***
> @ 0x7fb38f15b5c0 (unknown)
> @ 0x7fb38f153dd0 (unknown)
> @ 0x7fb39332c607 __gthread_mutex_lock()
> @ 0x7fb39332c657 __gthread_recursive_mutex_lock()
> @ 0x7fb39332edca std::recursive_mutex::lock()
> @ 0x7fb393337bd8 
> _ZZ11synchronizeISt15recursive_mutexE12SynchronizedIT_EPS2_ENKUlPS0_E_clES5_
> @ 0x7fb393337bf8 
> _ZZ11synchronizeISt15recursive_mutexE12SynchronizedIT_EPS2_ENUlPS0_E_4_FUNES5_
> @ 0x7fb39333ba6b Synchronized<>::Synchronized()
> @ 0x7fb393337cac synchronize<>()
> @ 0x7fb39492f15c process::ProcessManager::wait()
> @ 0x7fb3949353f0 process::wait()
> @ 0x55fd63f31fe5 process::wait()
> @ 0x7fb39332ce3c mesos::MesosExecutorDriver::~MesosExecutorDriver()
> @ 0x55fd63f2bd86 main
> @ 0x7fb38e4fc401 __libc_start_main
> @ 0x55fd63f2ab5a _start
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-7008) Quota not recovered from registry in empty cluster

2017-01-26 Thread Neil Conway (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-7008:
---
Summary: Quota not recovered from registry in empty cluster  (was: 
Incomplete recovery of roles leading to fatal CHECK failure)

> Quota not recovered from registry in empty cluster
> --
>
> Key: MESOS-7008
> URL: https://issues.apache.org/jira/browse/MESOS-7008
> Project: Mesos
>  Issue Type: Bug
>  Components: master
> Environment: OS X, SSL build
>Reporter: Benjamin Bannier
>Assignee: Neil Conway
>  Labels: quota, roles
>
> When a quota was set and the master is restarted, removal of the quota 
> reliably leads to a {{CHECK}} failure for me.
> Start a master:
> {code}
> $ mesos-master --work_dir=work_dir
> {code}
> Set a quota. This creates an implicit role.
> {code}
> $ cat quota.json
> {
> "role": "role2",
> "force": true,
> "guarantee": [
> {
> "name": "cpus",
> "type": "SCALAR",
> "scalar": { "value": 1 }
> }
> ]
> }
> $ cat quota.json| http POST :5050/quota
> HTTP/1.1 200 OK
> Content-Length: 0
> Date: Thu, 26 Jan 2017 12:33:38 GMT
> $ http GET :5050/quota
> HTTP/1.1 200 OK
> Content-Length: 108
> Content-Type: application/json
> Date: Thu, 26 Jan 2017 12:33:56 GMT
> {
> "infos": [
> {
> "guarantee": [
> {
> "name": "cpus",
> "role": "*",
> "scalar": {
> "value": 1.0
> },
> "type": "SCALAR"
> }
> ],
> "role": "role2"
> }
> ]
> }
> $ http GET :5050/roles
> HTTP/1.1 200 OK
> Content-Length: 106
> Content-Type: application/json
> Date: Thu, 26 Jan 2017 12:34:10 GMT
> {
> "roles": [
> {
> "frameworks": [],
> "name": "role2",
> "resources": {
> "cpus": 0,
> "disk": 0,
> "gpus": 0,
> "mem": 0
> },
> "weight": 1.0
> }
> ]
> }
> {code}
> Restart the master process using the same {{work_dir}} and attempt to delete 
> the quota after the master is started. The {{DELETE}} succeeds with an {{OK}}.
> {code}
> $ http DELETE :5050/quota/role2
> HTTP/1.1 200 OK
> Content-Length: 0
> Date: Thu, 26 Jan 2017 12:36:04 GMT
> {code}
> After handling the request, the master hits a {{CHECK}} failure and is 
> aborted.
> {code}
> $ mesos-master --work_dir=work_dir
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0126 13:34:57.528599 3145483200 main.cpp:278] Build: 2017-01-23 07:57:34 by 
> bbannier
> I0126 13:34:57.529131 3145483200 main.cpp:279] Version: 1.2.0
> I0126 13:34:57.529139 3145483200 main.cpp:286] Git SHA: 
> dd07d025d40975ec660ed17031d95ec0dba842d2
> [warn] kq_init: detected broken kqueue; not using.: No such process
> I0126 13:34:57.758896 3145483200 main.cpp:385] Using 'HierarchicalDRF' 
> allocator
> I0126 13:34:57.764276 3145483200 replica.cpp:778] Replica recovered with log 
> positions 3 -> 4 with 0 holes and 0 unlearned
> I0126 13:34:57.765278 256114688 recover.cpp:451] Starting replica recovery
> I0126 13:34:57.765547 256114688 recover.cpp:477] Replica is in VOTING status
> I0126 13:34:57.795964 257187840 master.cpp:383] Master 
> 569073cc-1195-45e9-b0d4-e2e1bf0d13d5 (172.18.9.56) started on 172.18.9.56:5050
> I0126 13:34:57.796023 257187840 master.cpp:385] Flags at startup: 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="false" --authenticate_frameworks="false" 
> --authenticate_http_frameworks="false" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticators="crammd5" 
> --authorizers="local" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="20secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="work_dir" 
> --zk_session_timeout="10secs"
> I0126 13:34:57.796478 257187840 master.cpp:437] Master allowing 
>

[jira] [Updated] (MESOS-7008) Incomplete recovery of roles leading to fatal CHECK failure

2017-01-26 Thread Neil Conway (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-7008:
---
Shepherd: Alexander Rukletsov

Likely culprit is that we don't set quota when recovering the allocator if 
{{expectedAgentCount}} is zero:

https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.cpp#L196
https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.cpp#L211

> Incomplete recovery of roles leading to fatal CHECK failure
> ---
>
> Key: MESOS-7008
> URL: https://issues.apache.org/jira/browse/MESOS-7008
> Project: Mesos
>  Issue Type: Bug
>  Components: master
> Environment: OS X, SSL build
>Reporter: Benjamin Bannier
>Assignee: Neil Conway
>  Labels: quota, roles
>
> When a quota was set and the master is restarted, removal of the quota 
> reliably leads to a {{CHECK}} failure for me.
> Start a master:
> {code}
> $ mesos-master --work_dir=work_dir
> {code}
> Set a quota. This creates an implicit role.
> {code}
> $ cat quota.json
> {
> "role": "role2",
> "force": true,
> "guarantee": [
> {
> "name": "cpus",
> "type": "SCALAR",
> "scalar": { "value": 1 }
> }
> ]
> }
> $ cat quota.json| http POST :5050/quota
> HTTP/1.1 200 OK
> Content-Length: 0
> Date: Thu, 26 Jan 2017 12:33:38 GMT
> $ http GET :5050/quota
> HTTP/1.1 200 OK
> Content-Length: 108
> Content-Type: application/json
> Date: Thu, 26 Jan 2017 12:33:56 GMT
> {
> "infos": [
> {
> "guarantee": [
> {
> "name": "cpus",
> "role": "*",
> "scalar": {
> "value": 1.0
> },
> "type": "SCALAR"
> }
> ],
> "role": "role2"
> }
> ]
> }
> $ http GET :5050/roles
> HTTP/1.1 200 OK
> Content-Length: 106
> Content-Type: application/json
> Date: Thu, 26 Jan 2017 12:34:10 GMT
> {
> "roles": [
> {
> "frameworks": [],
> "name": "role2",
> "resources": {
> "cpus": 0,
> "disk": 0,
> "gpus": 0,
> "mem": 0
> },
> "weight": 1.0
> }
> ]
> }
> {code}
> Restart the master process using the same {{work_dir}} and attempt to delete 
> the quota after the master is started. The {{DELETE}} succeeds with an {{OK}}.
> {code}
> $ http DELETE :5050/quota/role2
> HTTP/1.1 200 OK
> Content-Length: 0
> Date: Thu, 26 Jan 2017 12:36:04 GMT
> {code}
> After handling the request, the master hits a {{CHECK}} failure and is 
> aborted.
> {code}
> $ mesos-master --work_dir=work_dir
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0126 13:34:57.528599 3145483200 main.cpp:278] Build: 2017-01-23 07:57:34 by 
> bbannier
> I0126 13:34:57.529131 3145483200 main.cpp:279] Version: 1.2.0
> I0126 13:34:57.529139 3145483200 main.cpp:286] Git SHA: 
> dd07d025d40975ec660ed17031d95ec0dba842d2
> [warn] kq_init: detected broken kqueue; not using.: No such process
> I0126 13:34:57.758896 3145483200 main.cpp:385] Using 'HierarchicalDRF' 
> allocator
> I0126 13:34:57.764276 3145483200 replica.cpp:778] Replica recovered with log 
> positions 3 -> 4 with 0 holes and 0 unlearned
> I0126 13:34:57.765278 256114688 recover.cpp:451] Starting replica recovery
> I0126 13:34:57.765547 256114688 recover.cpp:477] Replica is in VOTING status
> I0126 13:34:57.795964 257187840 master.cpp:383] Master 
> 569073cc-1195-45e9-b0d4-e2e1bf0d13d5 (172.18.9.56) started on 172.18.9.56:5050
> I0126 13:34:57.796023 257187840 master.cpp:385] Flags at startup: 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="false" --authenticate_frameworks="false" 
> --authenticate_http_frameworks="false" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticators="crammd5" 
> --authorizers="local" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="20secs" --registry_strict="false" 
>

[jira] [Assigned] (MESOS-7008) Incomplete recovery of roles leading to fatal CHECK failure

2017-01-26 Thread Neil Conway (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway reassigned MESOS-7008:
--

Assignee: Neil Conway

> Incomplete recovery of roles leading to fatal CHECK failure
> ---
>
> Key: MESOS-7008
> URL: https://issues.apache.org/jira/browse/MESOS-7008
> Project: Mesos
>  Issue Type: Bug
>  Components: master
> Environment: OS X, SSL build
>Reporter: Benjamin Bannier
>Assignee: Neil Conway
>  Labels: quota, roles
>
> When a quota was set and the master is restarted, removal of the quota 
> reliably leads to a {{CHECK}} failure for me.
> Start a master:
> {code}
> $ mesos-master --work_dir=work_dir
> {code}
> Set a quota. This creates an implicit role.
> {code}
> $ cat quota.json
> {
> "role": "role2",
> "force": true,
> "guarantee": [
> {
> "name": "cpus",
> "type": "SCALAR",
> "scalar": { "value": 1 }
> }
> ]
> }
> $ cat quota.json| http POST :5050/quota
> HTTP/1.1 200 OK
> Content-Length: 0
> Date: Thu, 26 Jan 2017 12:33:38 GMT
> $ http GET :5050/quota
> HTTP/1.1 200 OK
> Content-Length: 108
> Content-Type: application/json
> Date: Thu, 26 Jan 2017 12:33:56 GMT
> {
> "infos": [
> {
> "guarantee": [
> {
> "name": "cpus",
> "role": "*",
> "scalar": {
> "value": 1.0
> },
> "type": "SCALAR"
> }
> ],
> "role": "role2"
> }
> ]
> }
> $ http GET :5050/roles
> HTTP/1.1 200 OK
> Content-Length: 106
> Content-Type: application/json
> Date: Thu, 26 Jan 2017 12:34:10 GMT
> {
> "roles": [
> {
> "frameworks": [],
> "name": "role2",
> "resources": {
> "cpus": 0,
> "disk": 0,
> "gpus": 0,
> "mem": 0
> },
> "weight": 1.0
> }
> ]
> }
> {code}
> Restart the master process using the same {{work_dir}} and attempt to delete 
> the quota after the master is started. The {{DELETE}} succeeds with an {{OK}}.
> {code}
> $ http DELETE :5050/quota/role2
> HTTP/1.1 200 OK
> Content-Length: 0
> Date: Thu, 26 Jan 2017 12:36:04 GMT
> {code}
> After handling the request, the master hits a {{CHECK}} failure and is 
> aborted.
> {code}
> $ mesos-master --work_dir=work_dir
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0126 13:34:57.528599 3145483200 main.cpp:278] Build: 2017-01-23 07:57:34 by 
> bbannier
> I0126 13:34:57.529131 3145483200 main.cpp:279] Version: 1.2.0
> I0126 13:34:57.529139 3145483200 main.cpp:286] Git SHA: 
> dd07d025d40975ec660ed17031d95ec0dba842d2
> [warn] kq_init: detected broken kqueue; not using.: No such process
> I0126 13:34:57.758896 3145483200 main.cpp:385] Using 'HierarchicalDRF' 
> allocator
> I0126 13:34:57.764276 3145483200 replica.cpp:778] Replica recovered with log 
> positions 3 -> 4 with 0 holes and 0 unlearned
> I0126 13:34:57.765278 256114688 recover.cpp:451] Starting replica recovery
> I0126 13:34:57.765547 256114688 recover.cpp:477] Replica is in VOTING status
> I0126 13:34:57.795964 257187840 master.cpp:383] Master 
> 569073cc-1195-45e9-b0d4-e2e1bf0d13d5 (172.18.9.56) started on 172.18.9.56:5050
> I0126 13:34:57.796023 257187840 master.cpp:385] Flags at startup: 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="false" --authenticate_frameworks="false" 
> --authenticate_http_frameworks="false" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticators="crammd5" 
> --authorizers="local" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="20secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="work_dir" 
> --zk_session_timeout="10secs"
> I0126 13:34:57.796478 257187840 master.cpp:437] Master allowing 
> unauthenticated frameworks to register
> I0126 13:34:57.796507 257187840

[jira] [Created] (MESOS-7015) Frameworks should be able to (re)register in suppressed state

2017-01-26 Thread Anindya Sinha (JIRA)

Anindya Sinha created MESOS-7015:


 Summary: Frameworks should be able to (re)register in suppressed 
state
 Key: MESOS-7015
 URL: https://issues.apache.org/jira/browse/MESOS-7015
 Project: Mesos
  Issue Type: Improvement
  Components: allocation, framework
Reporter: Anindya Sinha


We should consider allowing frameworks to specify their "suppressed mode" when 
they register or re-register with the mesos master.
This should help to keep traffic and the load on the cluster low especially 
when there are high number of frameworks and/or agents in the cluster during 
failovers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6999) Add agent flag to generate and pass executor secrets

2017-01-26 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-6999:
-
Description: 
A new agent flag {{--generate_executor_secrets}} is needed to support executor 
authentication. It should enable the generation of default executor secrets, 
which will entail:
* loading the default {{SecretGenerator}} module
* calling the secret generator when launching an executor
* passing the generated secret into the executor's environment

  was:
A new agent flag {{--generate_executor_credentials}} is needed to support 
executor authentication. It should enable the generation of default executor 
credentials, which will entail:
* loading the default {{CredentialGenerator}} module
* calling the credential generator when launching an executor
* passing the generated credential into the executor's environment


> Add agent flag to generate and pass executor secrets
> 
>
> Key: MESOS-6999
> URL: https://issues.apache.org/jira/browse/MESOS-6999
> Project: Mesos
>  Issue Type: Task
>  Components: agent, security
>Reporter: Greg Mann
>  Labels: agent, executor, flags, security
>
> A new agent flag {{--generate_executor_secrets}} is needed to support 
> executor authentication. It should enable the generation of default executor 
> secrets, which will entail:
> * loading the default {{SecretGenerator}} module
> * calling the secret generator when launching an executor
> * passing the generated secret into the executor's environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6999) Add agent flag to generate and pass executor secrets

2017-01-26 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-6999:
-
Summary: Add agent flag to generate and pass executor secrets  (was: Add 
agent flag to generate and pass executor credentials)

> Add agent flag to generate and pass executor secrets
> 
>
> Key: MESOS-6999
> URL: https://issues.apache.org/jira/browse/MESOS-6999
> Project: Mesos
>  Issue Type: Task
>  Components: agent, security
>Reporter: Greg Mann
>  Labels: agent, executor, flags, security
>
> A new agent flag {{--generate_executor_credentials}} is needed to support 
> executor authentication. It should enable the generation of default executor 
> credentials, which will entail:
> * loading the default {{CredentialGenerator}} module
> * calling the credential generator when launching an executor
> * passing the generated credential into the executor's environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-7000) Implement a JWT SecretGenerator

2017-01-26 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-7000:
-
Summary: Implement a JWT SecretGenerator  (was: Implement a JWT 
CredentialGenerator)

> Implement a JWT SecretGenerator
> ---
>
> Key: MESOS-7000
> URL: https://issues.apache.org/jira/browse/MESOS-7000
> Project: Mesos
>  Issue Type: Task
>  Components: agent, modules, security
>Reporter: Greg Mann
>  Labels: agent, executor, module, security
>
> The default {{CredentialGenerator}} for the generation of default executor 
> credentials will be a module which generates JSON web tokens. This module 
> will be loaded by default when executor credential generation is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-7000) Implement a JWT SecretGenerator

2017-01-26 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-7000:
-
Description: The default {{SecretGenerator}} for the generation of default 
executor credentials will be a module which generates JSON web tokens. This 
module will be loaded by default when executor secret generation is enabled.  
(was: The default {{SecretGenerator}} for the generation of default executor 
credentials will be a module which generates JSON web tokens. This module will 
be loaded by default when executor credential generation is enabled.)

> Implement a JWT SecretGenerator
> ---
>
> Key: MESOS-7000
> URL: https://issues.apache.org/jira/browse/MESOS-7000
> Project: Mesos
>  Issue Type: Task
>  Components: agent, modules, security
>Reporter: Greg Mann
>  Labels: agent, executor, module, security
>
> The default {{SecretGenerator}} for the generation of default executor 
> credentials will be a module which generates JSON web tokens. This module 
> will be loaded by default when executor secret generation is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-7000) Implement a JWT SecretGenerator

2017-01-26 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-7000:
-
Description: The default {{SecretGenerator}} for the generation of default 
executor credentials will be a module which generates JSON web tokens. This 
module will be loaded by default when executor credential generation is 
enabled.  (was: The default {{CredentialGenerator}} for the generation of 
default executor credentials will be a module which generates JSON web tokens. 
This module will be loaded by default when executor credential generation is 
enabled.)

> Implement a JWT SecretGenerator
> ---
>
> Key: MESOS-7000
> URL: https://issues.apache.org/jira/browse/MESOS-7000
> Project: Mesos
>  Issue Type: Task
>  Components: agent, modules, security
>Reporter: Greg Mann
>  Labels: agent, executor, module, security
>
> The default {{SecretGenerator}} for the generation of default executor 
> credentials will be a module which generates JSON web tokens. This module 
> will be loaded by default when executor credential generation is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-7013) Update the authorizer interface for executor authentication

2017-01-26 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-7013:
-
Description: 
The authorizer interface must be updated to accommodate changes introduced by 
the implementation of executor authentication:
* The {{authorization::Subject}} message must be extended to include the 
{{claims}} from an {{AuthenticationContext}}
* The local authorizer must be updated to accommodate this interface change

  was:
The authorizer interface must be updated to accommodate changes introduced by 
the implementation of executor authentication:
* The {{authorization::Subject}} message must be extended to include the 
{{claims}} from an {{AuthenticationContext}}
* The local authorizer must be updated to accommodate this interface change

Also, authorization actions should be added for the V1 executor calls:
* Subscribe
* Update
* Message


> Update the authorizer interface for executor authentication
> ---
>
> Key: MESOS-7013
> URL: https://issues.apache.org/jira/browse/MESOS-7013
> Project: Mesos
>  Issue Type: Task
>  Components: modules, security
>Reporter: Greg Mann
>  Labels: authorization, executor, mesosphere, module, security
>
> The authorizer interface must be updated to accommodate changes introduced by 
> the implementation of executor authentication:
> * The {{authorization::Subject}} message must be extended to include the 
> {{claims}} from an {{AuthenticationContext}}
> * The local authorizer must be updated to accommodate this interface change



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-7013) Update the authorizer interface for executor authentication

2017-01-26 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-7013:
-
Description: 
The authorizer interface must be updated to accommodate changes introduced by 
the implementation of executor authentication:
* The {{authorization::Subject}} message must be extended to include the 
{{claims}} from an {{AuthenticationContext}}
* The local authorizer must be updated to accommodate this interface change

Also, authorization actions should be added for the V1 executor calls:
* Subscribe
* Update
* Message

  was:
The authorizer interface must be updated to accommodate changes introduced by 
the implementation of executor authentication:
* The {{authorization::Subject}} message must be extended to include the 
{{claims}} from an {{AuthenticationContext}}
* The local authorizer must be updated to accommodate this interface change


> Update the authorizer interface for executor authentication
> ---
>
> Key: MESOS-7013
> URL: https://issues.apache.org/jira/browse/MESOS-7013
> Project: Mesos
>  Issue Type: Task
>  Components: modules, security
>Reporter: Greg Mann
>  Labels: authorization, executor, mesosphere, module, security
>
> The authorizer interface must be updated to accommodate changes introduced by 
> the implementation of executor authentication:
> * The {{authorization::Subject}} message must be extended to include the 
> {{claims}} from an {{AuthenticationContext}}
> * The local authorizer must be updated to accommodate this interface change
> Also, authorization actions should be added for the V1 executor calls:
> * Subscribe
> * Update
> * Message



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6517) Health checking only on 127.0.0.1 is limiting.

2017-01-26 Thread Avinash Sridharan (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840124#comment-15840124
 ] 

Avinash Sridharan commented on MESOS-6517:
--

[~haosd...@gmail.com] don't think asking the Framework/agent to set the IP 
address/ hostname would make sense here, since, as [~jieyu] mentioned what do 
you do if a container is joining multiple networks. I am fine with LOCALHOST 
semantics since it is consistent, the only challenge being if the containers 
bind to specific IP address in their network namespace. While a corner case, 
this is of course a possibility. If we want to address this problem we should 
be inferring all the IP addresses within the container's network namespace and 
polling on each one of them and reporting health status on each of them. We 
fail if we don't have reachability on any of the IP addresses.

> Health checking only on 127.0.0.1 is limiting.
> --
>
> Key: MESOS-6517
> URL: https://issues.apache.org/jira/browse/MESOS-6517
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Alexander Rukletsov
>  Labels: health-check, mesosphere
>
> As of Mesos 1.1.0, HTTP and TCP health checks always use 127.0.0.1 as the 
> target IP. This is not configurable. As a result, tasks should listen on all 
> interfaces if they want to support HTTP and TCP health checks. However, there 
> might be some cases where tasks or containers will end up binding to a 
> specific IP address. 
> To make health checking more robust we can:
> * look at all interfaces in a given network namespace and do health check on 
> all the IP addresses;
> * allow users to specify the IP to health check;
> * deduce the target IP from task's discovery information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-7005) Add executor authentication documentation

2017-01-26 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-7005:
-
Description: 
Documentation should be added regarding executor authentication. This will 
include updating:
* the configuration docs to include new agent flags
* the authentication documentation
* the authorization documentation
* the upgrade documentation
* the CHANGELOG

  was:
Documentation should be added regarding executor authentication. This will 
include:
* Adding the new flags to the configuration docs
* Updating the authentication documentation
* Updating the upgrade documentation
* Updating the CHANGELOG


> Add executor authentication documentation
> -
>
> Key: MESOS-7005
> URL: https://issues.apache.org/jira/browse/MESOS-7005
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Greg Mann
>  Labels: documentation
>
> Documentation should be added regarding executor authentication. This will 
> include updating:
> * the configuration docs to include new agent flags
> * the authentication documentation
> * the authorization documentation
> * the upgrade documentation
> * the CHANGELOG



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-7014) Add implicit executor authorization to local authorizer

2017-01-26 Thread Greg Mann (JIRA)

Greg Mann created MESOS-7014:


 Summary: Add implicit executor authorization to local authorizer
 Key: MESOS-7014
 URL: https://issues.apache.org/jira/browse/MESOS-7014
 Project: Mesos
  Issue Type: Task
  Components: security
Reporter: Greg Mann


The local authorizer should be updated to perform implicit authorization of 
executor actions. When executors authenticate using a default executor secret, 
the authorizer will receive an authorization {{Subject}} which contains claims, 
but no principal. In this case, implicit authorization should be performed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-5068) Kill task buttons

2017-01-26 Thread Laurent Hoss (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840114#comment-15840114
 ] 

Laurent Hoss commented on MESOS-5068:
-

+100  
( in general the mesos/marathon UI is lacking many features, that would both be 
easy to implement (with API existing) and saving a ton of time to the mesos 
cluster maintainer, that does not have to look up the API call, construct the 
URL etc etc)

> Kill task buttons
> -
>
> Key: MESOS-5068
> URL: https://issues.apache.org/jira/browse/MESOS-5068
> Project: Mesos
>  Issue Type: Wish
>Reporter: Guillermo Rodriguez
>Priority: Minor
>
> Is it possible to add a button to each task so that we can send a termination 
> signal to that particular task?
> Sometimes I would just like to go to the mesos-master UI, right click a task 
> and press Terminate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-7013) Update the authorizer interface for executor authentication

2017-01-26 Thread Greg Mann (JIRA)

Greg Mann created MESOS-7013:


 Summary: Update the authorizer interface for executor 
authentication
 Key: MESOS-7013
 URL: https://issues.apache.org/jira/browse/MESOS-7013
 Project: Mesos
  Issue Type: Task
  Components: modules, security
Reporter: Greg Mann


The authorizer interface must be updated to accommodate changes introduced by 
the implementation of executor authentication:
* The {{authorization::Subject}} message must be extended to include the 
{{claims}} from an {{AuthenticationContext}}
* The local authorizer must be updated to accommodate this interface change



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-7012) Add authorization actions for V1 executor calls

2017-01-26 Thread Greg Mann (JIRA)

Greg Mann created MESOS-7012:


 Summary: Add authorization actions for V1 executor calls
 Key: MESOS-7012
 URL: https://issues.apache.org/jira/browse/MESOS-7012
 Project: Mesos
  Issue Type: Task
  Components: agent, executor, security
Reporter: Greg Mann


Authorization actions should be added for the V1 executor calls:
* Subscribe
* Update
* Message



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-7011) Add a '--secret' flag to the agent

2017-01-26 Thread Greg Mann (JIRA)

Greg Mann created MESOS-7011:


 Summary: Add a '--secret' flag to the agent
 Key: MESOS-7011
 URL: https://issues.apache.org/jira/browse/MESOS-7011
 Project: Mesos
  Issue Type: Task
  Components: agent, security
Reporter: Greg Mann


A new {{--secret}} flag should be added to the agent to allow the operator to 
specify a secret file to be loaded into the default executor JWT authenticator 
and SecretGenerator modules. This secret will be used to generate default 
executor secrets when {{--generate_executor_secrets}} is set, and will be used 
to verify those secrets when {{--authenticate_http_executors}} is set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-7010) Add the HttpAuthenticatee module interface

2017-01-26 Thread Greg Mann (JIRA)

Greg Mann created MESOS-7010:


 Summary: Add the HttpAuthenticatee module interface
 Key: MESOS-7010
 URL: https://issues.apache.org/jira/browse/MESOS-7010
 Project: Mesos
  Issue Type: Task
  Components: modules, security
Reporter: Greg Mann


A new {{HttpAuthenticatee}} module interface should be added to permit the 
default executor to authenticate with the agent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6997) Add the SecretGenerator module interface

2017-01-26 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-6997:
-
Description: A new {{SecretGenerator}} module interface will be added to 
permit the agent to generate default executor credentials.  (was: Two new 
module interfaces are needed to accommodate executor authentication:
* {{CredentialGenerator}}
* {{HttpAuthenticatee}})

> Add the SecretGenerator module interface
> 
>
> Key: MESOS-6997
> URL: https://issues.apache.org/jira/browse/MESOS-6997
> Project: Mesos
>  Issue Type: Task
>  Components: executor, modules, security
>Reporter: Greg Mann
>  Labels: executor, module, security
>
> A new {{SecretGenerator}} module interface will be added to permit the agent 
> to generate default executor credentials.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6997) Add the SecretGenerator module interface

2017-01-26 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-6997:
-
Summary: Add the SecretGenerator module interface  (was: Add new module 
interfaces for executor authentication)

> Add the SecretGenerator module interface
> 
>
> Key: MESOS-6997
> URL: https://issues.apache.org/jira/browse/MESOS-6997
> Project: Mesos
>  Issue Type: Task
>  Components: executor, modules, security
>Reporter: Greg Mann
>  Labels: executor, module, security
>
> Two new module interfaces are needed to accommodate executor authentication:
> * {{CredentialGenerator}}
> * {{HttpAuthenticatee}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-7009) Add a 'secret' field to the 'Environment' message

2017-01-26 Thread Greg Mann (JIRA)

Greg Mann created MESOS-7009:


 Summary: Add a 'secret' field to the 'Environment' message
 Key: MESOS-7009
 URL: https://issues.apache.org/jira/browse/MESOS-7009
 Project: Mesos
  Issue Type: Task
  Components: security
Reporter: Greg Mann


A new field of type {{Secret}} should be added to the {{Environment}} message 
to enable the inclusion of secrets in executor and task environments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6996) Add a 'Secret' protobuf message

2017-01-26 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-6996:
-
Description: A {{Secret}} protobuf message should be added to serve as a 
generic message for sending credentials and other secrets throughout Mesos.  
(was: A {{Secret}} protobuf message should be added to serve as a generic 
message for sending credentials and other secrets throughout Mesos.

A new field of type {{Secret}} should also be added to the {{Environment}} 
message to enable the inclusion of secrets in executor and task environments.)

> Add a 'Secret' protobuf message
> ---
>
> Key: MESOS-6996
> URL: https://issues.apache.org/jira/browse/MESOS-6996
> Project: Mesos
>  Issue Type: Task
>  Components: security
>Reporter: Greg Mann
>  Labels: security
>
> A {{Secret}} protobuf message should be added to serve as a generic message 
> for sending credentials and other secrets throughout Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-7007) filesystem/shared and --default_container_info broken since 1.1

2017-01-26 Thread Pierre Cheynier (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Cheynier updated MESOS-7007:
---
Description: 
I face this issue, that prevent me to upgrade to 1.1.0 (and the change was 
consequently introduced in this version):

I'm using default_container_info to mount a /tmp volume in the container's 
mount namespace from its current sandbox, meaning that each container have a 
dedicated /tmp, thanks to the {{filesystem/shared}} isolator.

I noticed through our automation pipeline that integration tests were failing 
and found that this is because /tmp (the one from the host!) contents is 
trashed each time a container is created.

Here is my setup: 
* 
{{--isolation='cgroups/cpu,cgroups/mem,namespaces/pid,*disk/du,filesystem/shared,filesystem/linux*,docker/runtime'}}
* 
{{--default_container_info='\{"type":"MESOS","volumes":\[\{"host_path":"tmp","container_path":"/tmp","mode":"RW"\}\]\}'}}

I discovered this issue in the early days of 1.1 (end of Nov, spoke with 
someone on Slack), but had unfortunately no time to dig into the symptoms a bit 
more.

I found nothing interesting even using GLOGv=3.

Maybe it's a bad usage of isolators that trigger this issue ? If it's the case, 
then at least a documentation update should be done.

Let me know if more information is needed.

  was:
I face this issue, that prevent me to upgrade to 1.1.0 (and the change was 
consequently introduced in this version):

I'm using default_container_info to mount a /tmp volume in the container's 
mount namespace from its current sandbox, meaning that each container have a 
dedicated /tmp, thanks to the {filesystem/shared} isolator.

I noticed through our automation pipeline that integration tests were failing 
and found that this is because /tmp (the one from the host!) contents is 
trashed each time a container is created.

Here is my setup: 
* 
{{--isolation='cgroups/cpu,cgroups/mem,namespaces/pid,*disk/du,filesystem/shared,filesystem/linux*,docker/runtime'}}
* 
{{--default_container_info='\{"type":"MESOS","volumes":\[\{"host_path":"tmp","container_path":"/tmp","mode":"RW"\}\]\}'}}

I discovered this issue in the early days of 1.1 (end of Nov, spoke with 
someone on Slack), but had unfortunately no time to dig into the symptoms a bit 
more.

I found nothing interesting even using GLOGv=3.

Maybe it's a bad usage of isolators that trigger this issue ? If it's the case, 
then at least a documentation update should be done.

Let me know if more information is needed.


> filesystem/shared and --default_container_info broken since 1.1
> ---
>
> Key: MESOS-7007
> URL: https://issues.apache.org/jira/browse/MESOS-7007
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.1.0
>Reporter: Pierre Cheynier
>
> I face this issue, that prevent me to upgrade to 1.1.0 (and the change was 
> consequently introduced in this version):
> I'm using default_container_info to mount a /tmp volume in the container's 
> mount namespace from its current sandbox, meaning that each container have a 
> dedicated /tmp, thanks to the {{filesystem/shared}} isolator.
> I noticed through our automation pipeline that integration tests were failing 
> and found that this is because /tmp (the one from the host!) contents is 
> trashed each time a container is created.
> Here is my setup: 
> * 
> {{--isolation='cgroups/cpu,cgroups/mem,namespaces/pid,*disk/du,filesystem/shared,filesystem/linux*,docker/runtime'}}
> * 
> {{--default_container_info='\{"type":"MESOS","volumes":\[\{"host_path":"tmp","container_path":"/tmp","mode":"RW"\}\]\}'}}
> I discovered this issue in the early days of 1.1 (end of Nov, spoke with 
> someone on Slack), but had unfortunately no time to dig into the symptoms a 
> bit more.
> I found nothing interesting even using GLOGv=3.
> Maybe it's a bad usage of isolators that trigger this issue ? If it's the 
> case, then at least a documentation update should be done.
> Let me know if more information is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-7008) Incomplete recovery of roles leading to fatal CHECK failure

2017-01-26 Thread Benjamin Bannier (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-7008:

Labels: quota roles  (was: quota)

> Incomplete recovery of roles leading to fatal CHECK failure
> ---
>
> Key: MESOS-7008
> URL: https://issues.apache.org/jira/browse/MESOS-7008
> Project: Mesos
>  Issue Type: Bug
>  Components: master
> Environment: OS X, SSL build
>Reporter: Benjamin Bannier
>  Labels: quota, roles
>
> When a quota was set and the master is restarted, removal of the quota 
> reliably leads to a {{CHECK}} failure for me.
> Start a master:
> {code}
> $ mesos-master --work_dir=work_dir
> {code}
> Set a quota. This creates an implicit role.
> {code}
> $ cat quota.json
> {
> "role": "role2",
> "force": true,
> "guarantee": [
> {
> "name": "cpus",
> "type": "SCALAR",
> "scalar": { "value": 1 }
> }
> ]
> }
> $ cat quota.json| http POST :5050/quota
> HTTP/1.1 200 OK
> Content-Length: 0
> Date: Thu, 26 Jan 2017 12:33:38 GMT
> $ http GET :5050/quota
> HTTP/1.1 200 OK
> Content-Length: 108
> Content-Type: application/json
> Date: Thu, 26 Jan 2017 12:33:56 GMT
> {
> "infos": [
> {
> "guarantee": [
> {
> "name": "cpus",
> "role": "*",
> "scalar": {
> "value": 1.0
> },
> "type": "SCALAR"
> }
> ],
> "role": "role2"
> }
> ]
> }
> $ http GET :5050/roles
> HTTP/1.1 200 OK
> Content-Length: 106
> Content-Type: application/json
> Date: Thu, 26 Jan 2017 12:34:10 GMT
> {
> "roles": [
> {
> "frameworks": [],
> "name": "role2",
> "resources": {
> "cpus": 0,
> "disk": 0,
> "gpus": 0,
> "mem": 0
> },
> "weight": 1.0
> }
> ]
> }
> {code}
> Restart the master process using the same {{work_dir}} and attempt to delete 
> the quota after the master is started. The {{DELETE}} succeeds with an {{OK}}.
> {code}
> $ http DELETE :5050/quota/role2
> HTTP/1.1 200 OK
> Content-Length: 0
> Date: Thu, 26 Jan 2017 12:36:04 GMT
> {code}
> After handling the request, the master hits a {{CHECK}} failure and is 
> aborted.
> {code}
> $ mesos-master --work_dir=work_dir
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0126 13:34:57.528599 3145483200 main.cpp:278] Build: 2017-01-23 07:57:34 by 
> bbannier
> I0126 13:34:57.529131 3145483200 main.cpp:279] Version: 1.2.0
> I0126 13:34:57.529139 3145483200 main.cpp:286] Git SHA: 
> dd07d025d40975ec660ed17031d95ec0dba842d2
> [warn] kq_init: detected broken kqueue; not using.: No such process
> I0126 13:34:57.758896 3145483200 main.cpp:385] Using 'HierarchicalDRF' 
> allocator
> I0126 13:34:57.764276 3145483200 replica.cpp:778] Replica recovered with log 
> positions 3 -> 4 with 0 holes and 0 unlearned
> I0126 13:34:57.765278 256114688 recover.cpp:451] Starting replica recovery
> I0126 13:34:57.765547 256114688 recover.cpp:477] Replica is in VOTING status
> I0126 13:34:57.795964 257187840 master.cpp:383] Master 
> 569073cc-1195-45e9-b0d4-e2e1bf0d13d5 (172.18.9.56) started on 172.18.9.56:5050
> I0126 13:34:57.796023 257187840 master.cpp:385] Flags at startup: 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="false" --authenticate_frameworks="false" 
> --authenticate_http_frameworks="false" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticators="crammd5" 
> --authorizers="local" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="20secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="work_dir" 
> --zk_session_timeout="10secs"
> I0126 13:34:57.796478 257187840 master.cpp:437] Master allowing 
> unauthenticated frameworks to register
> I0126 13:34:57.796507 257187840 master.cpp:451] Master allowing

[jira] [Commented] (MESOS-7008) Incomplete recovery of roles leading to fatal CHECK failure

2017-01-26 Thread Benjamin Bannier (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839718#comment-15839718
 ] 

Benjamin Bannier commented on MESOS-7008:
-

A mechanical reproducer is
{code}
#!/bin/sh

set -e
set -o pipefail

source support/atexit.sh

atexit rm -rf WORK_DIR
./src/mesos-master --work_dir=WORK_DIR --port=12345 &
PID=${!}

atexit kill -9 $PID

echo '{ "role": "role2", "force": true, "guarantee": [ { "name": "cpus", 
"type": "SCALAR", "scalar": { "value": 1 } } ] }' | http :12345/quota

kill ${PID}

./src/mesos-master --work_dir=WORK_DIR --port=12345 &

http DELETE :12345/quota/role2
{code}

> Incomplete recovery of roles leading to fatal CHECK failure
> ---
>
> Key: MESOS-7008
> URL: https://issues.apache.org/jira/browse/MESOS-7008
> Project: Mesos
>  Issue Type: Bug
>  Components: master
> Environment: OS X, SSL build
>Reporter: Benjamin Bannier
>  Labels: quota, roles
>
> When a quota was set and the master is restarted, removal of the quota 
> reliably leads to a {{CHECK}} failure for me.
> Start a master:
> {code}
> $ mesos-master --work_dir=work_dir
> {code}
> Set a quota. This creates an implicit role.
> {code}
> $ cat quota.json
> {
> "role": "role2",
> "force": true,
> "guarantee": [
> {
> "name": "cpus",
> "type": "SCALAR",
> "scalar": { "value": 1 }
> }
> ]
> }
> $ cat quota.json| http POST :5050/quota
> HTTP/1.1 200 OK
> Content-Length: 0
> Date: Thu, 26 Jan 2017 12:33:38 GMT
> $ http GET :5050/quota
> HTTP/1.1 200 OK
> Content-Length: 108
> Content-Type: application/json
> Date: Thu, 26 Jan 2017 12:33:56 GMT
> {
> "infos": [
> {
> "guarantee": [
> {
> "name": "cpus",
> "role": "*",
> "scalar": {
> "value": 1.0
> },
> "type": "SCALAR"
> }
> ],
> "role": "role2"
> }
> ]
> }
> $ http GET :5050/roles
> HTTP/1.1 200 OK
> Content-Length: 106
> Content-Type: application/json
> Date: Thu, 26 Jan 2017 12:34:10 GMT
> {
> "roles": [
> {
> "frameworks": [],
> "name": "role2",
> "resources": {
> "cpus": 0,
> "disk": 0,
> "gpus": 0,
> "mem": 0
> },
> "weight": 1.0
> }
> ]
> }
> {code}
> Restart the master process using the same {{work_dir}} and attempt to delete 
> the quota after the master is started. The {{DELETE}} succeeds with an {{OK}}.
> {code}
> $ http DELETE :5050/quota/role2
> HTTP/1.1 200 OK
> Content-Length: 0
> Date: Thu, 26 Jan 2017 12:36:04 GMT
> {code}
> After handling the request, the master hits a {{CHECK}} failure and is 
> aborted.
> {code}
> $ mesos-master --work_dir=work_dir
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0126 13:34:57.528599 3145483200 main.cpp:278] Build: 2017-01-23 07:57:34 by 
> bbannier
> I0126 13:34:57.529131 3145483200 main.cpp:279] Version: 1.2.0
> I0126 13:34:57.529139 3145483200 main.cpp:286] Git SHA: 
> dd07d025d40975ec660ed17031d95ec0dba842d2
> [warn] kq_init: detected broken kqueue; not using.: No such process
> I0126 13:34:57.758896 3145483200 main.cpp:385] Using 'HierarchicalDRF' 
> allocator
> I0126 13:34:57.764276 3145483200 replica.cpp:778] Replica recovered with log 
> positions 3 -> 4 with 0 holes and 0 unlearned
> I0126 13:34:57.765278 256114688 recover.cpp:451] Starting replica recovery
> I0126 13:34:57.765547 256114688 recover.cpp:477] Replica is in VOTING status
> I0126 13:34:57.795964 257187840 master.cpp:383] Master 
> 569073cc-1195-45e9-b0d4-e2e1bf0d13d5 (172.18.9.56) started on 172.18.9.56:5050
> I0126 13:34:57.796023 257187840 master.cpp:385] Flags at startup: 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="false" --authenticate_frameworks="false" 
> --authenticate_http_frameworks="false" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticators="crammd5" 
> --authorizers="local" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
>

[jira] [Commented] (MESOS-6981) Allow disabling name based SSL checks

2017-01-26 Thread Kevin Cox (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839659#comment-15839659
 ] 

Kevin Cox commented on MESOS-6981:
--

That sounds exactly like what I want.

> Allow disabling name based SSL checks
> -
>
> Key: MESOS-6981
> URL: https://issues.apache.org/jira/browse/MESOS-6981
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Kevin Cox
>  Labels: mesosphere, security
>
> Currently if you want to use verified certificates you need to enable 
> validation by hostname or IP. However if you are running your own CA for 
> these certificates it is often sufficient to verify solely based on the CA 
> signature.
> For example if an admin wants to connect it is a pain to make sure that they 
> always have a valid certificate for their IP or reverse DNS. It would be nice 
> if the admin could be given a certificate that was trusted no matter where he 
> is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-7008) Incomplete recovery of roles leading to fatal CHECK failure

2017-01-26 Thread Benjamin Bannier (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-7008:

Labels: quota  (was: )

> Incomplete recovery of roles leading to fatal CHECK failure
> ---
>
> Key: MESOS-7008
> URL: https://issues.apache.org/jira/browse/MESOS-7008
> Project: Mesos
>  Issue Type: Bug
>  Components: master
> Environment: OS X, SSL build
>Reporter: Benjamin Bannier
>  Labels: quota
>
> When a quota was set and the master is restarted, removal of the quota 
> reliably leads to a {{CHECK}} failure for me.
> Start a master:
> {code}
> $ mesos-master --work_dir=work_dir
> {code}
> Set a quota. This creates an implicit role.
> {code}
> $ cat quota.json
> {
> "role": "role2",
> "force": true,
> "guarantee": [
> {
> "name": "cpus",
> "type": "SCALAR",
> "scalar": { "value": 1 }
> }
> ]
> }
> $ cat quota.json| http POST :5050/quota
> HTTP/1.1 200 OK
> Content-Length: 0
> Date: Thu, 26 Jan 2017 12:33:38 GMT
> $ http GET :5050/quota
> HTTP/1.1 200 OK
> Content-Length: 108
> Content-Type: application/json
> Date: Thu, 26 Jan 2017 12:33:56 GMT
> {
> "infos": [
> {
> "guarantee": [
> {
> "name": "cpus",
> "role": "*",
> "scalar": {
> "value": 1.0
> },
> "type": "SCALAR"
> }
> ],
> "role": "role2"
> }
> ]
> }
> $ http GET :5050/roles
> HTTP/1.1 200 OK
> Content-Length: 106
> Content-Type: application/json
> Date: Thu, 26 Jan 2017 12:34:10 GMT
> {
> "roles": [
> {
> "frameworks": [],
> "name": "role2",
> "resources": {
> "cpus": 0,
> "disk": 0,
> "gpus": 0,
> "mem": 0
> },
> "weight": 1.0
> }
> ]
> }
> {code}
> Restart the master process using the same {{work_dir}} and attempt to delete 
> the quota after the master is started. The {{DELETE}} succeeds with an {{OK}}.
> {code}
> $ http DELETE :5050/quota/role2
> HTTP/1.1 200 OK
> Content-Length: 0
> Date: Thu, 26 Jan 2017 12:36:04 GMT
> {code}
> After handling the request, the master hits a {{CHECK}} failure and is 
> aborted.
> {code}
> $ mesos-master --work_dir=work_dir
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0126 13:34:57.528599 3145483200 main.cpp:278] Build: 2017-01-23 07:57:34 by 
> bbannier
> I0126 13:34:57.529131 3145483200 main.cpp:279] Version: 1.2.0
> I0126 13:34:57.529139 3145483200 main.cpp:286] Git SHA: 
> dd07d025d40975ec660ed17031d95ec0dba842d2
> [warn] kq_init: detected broken kqueue; not using.: No such process
> I0126 13:34:57.758896 3145483200 main.cpp:385] Using 'HierarchicalDRF' 
> allocator
> I0126 13:34:57.764276 3145483200 replica.cpp:778] Replica recovered with log 
> positions 3 -> 4 with 0 holes and 0 unlearned
> I0126 13:34:57.765278 256114688 recover.cpp:451] Starting replica recovery
> I0126 13:34:57.765547 256114688 recover.cpp:477] Replica is in VOTING status
> I0126 13:34:57.795964 257187840 master.cpp:383] Master 
> 569073cc-1195-45e9-b0d4-e2e1bf0d13d5 (172.18.9.56) started on 172.18.9.56:5050
> I0126 13:34:57.796023 257187840 master.cpp:385] Flags at startup: 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="false" --authenticate_frameworks="false" 
> --authenticate_http_frameworks="false" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticators="crammd5" 
> --authorizers="local" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="20secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="work_dir" 
> --zk_session_timeout="10secs"
> I0126 13:34:57.796478 257187840 master.cpp:437] Master allowing 
> unauthenticated frameworks to register
> I0126 13:34:57.796507 257187840 master.cpp:451] Master allowing 
> unauthenticated

[jira] [Commented] (MESOS-6790) Wrong task started time in webui

2017-01-26 Thread Tomasz Janiszewski (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839646#comment-15839646
 ] 

Tomasz Janiszewski commented on MESOS-6790:
---

Yes, I think this could work. Do we have guarantee that TaskState arrives in 
order, so there won't be a situation when we set start time and then get update 
with earlier timestamp.

> Wrong task started time in webui
> 
>
> Key: MESOS-6790
> URL: https://issues.apache.org/jira/browse/MESOS-6790
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Reporter: haosdent
>Assignee: Tomasz Janiszewski
>  Labels: health-check, webui
>
> Reported by [~janisz]
> {quote}
> Hi
> When task has enabled Mesos healthcheck start time in UI can show wrong
> time. This happens because UI assumes that first status is task started
> [0]. This is not always true because Mesos keeps only recent tasks statuses
> [1] so when healthcheck updates tasks status it can override task start
> time displayed in webui.
> Best
> Tomek
> [0]
> https://github.com/apache/mesos/blob/master/src/webui/master/static/js/controllers.js#L140
> [1]
> https://github.com/apache/mesos/blob/f2adc8a95afda943f6a10e771aad64300da19047/src/common/protobuf_utils.cpp#L263-L265
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-7008) Incomplete recovery of roles leading to fatal CHECK failure

2017-01-26 Thread Benjamin Bannier (JIRA)

Benjamin Bannier created MESOS-7008:
---

 Summary: Incomplete recovery of roles leading to fatal CHECK 
failure
 Key: MESOS-7008
 URL: https://issues.apache.org/jira/browse/MESOS-7008
 Project: Mesos
  Issue Type: Bug
  Components: master
 Environment: OS X, SSL build
Reporter: Benjamin Bannier


When a quota was set and the master is restarted, removal of the quota reliably 
leads to a {{CHECK}} failure for me.

Start a master:
{code}
$ mesos-master --work_dir=work_dir
{code}

Set a quota. This creates an implicit role.
{code}
$ cat quota.json
{
"role": "role2",
"force": true,
"guarantee": [
{
"name": "cpus",
"type": "SCALAR",
"scalar": { "value": 1 }
}
]
}

$ cat quota.json| http POST :5050/quota
HTTP/1.1 200 OK
Content-Length: 0
Date: Thu, 26 Jan 2017 12:33:38 GMT

$ http GET :5050/quota
HTTP/1.1 200 OK
Content-Length: 108
Content-Type: application/json
Date: Thu, 26 Jan 2017 12:33:56 GMT

{
"infos": [
{
"guarantee": [
{
"name": "cpus",
"role": "*",
"scalar": {
"value": 1.0
},
"type": "SCALAR"
}
],
"role": "role2"
}
]
}

$ http GET :5050/roles
HTTP/1.1 200 OK
Content-Length: 106
Content-Type: application/json
Date: Thu, 26 Jan 2017 12:34:10 GMT

{
"roles": [
{
"frameworks": [],
"name": "role2",
"resources": {
"cpus": 0,
"disk": 0,
"gpus": 0,
"mem": 0
},
"weight": 1.0
}
]
}
{code}

Restart the master process using the same {{work_dir}} and attempt to delete 
the quota after the master is started. The {{DELETE}} succeeds with an {{OK}}.
{code}
$ http DELETE :5050/quota/role2
HTTP/1.1 200 OK
Content-Length: 0
Date: Thu, 26 Jan 2017 12:36:04 GMT
{code}

After handling the request, the master hits a {{CHECK}} failure and is aborted.
{code}
$ mesos-master --work_dir=work_dir
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0126 13:34:57.528599 3145483200 main.cpp:278] Build: 2017-01-23 07:57:34 by 
bbannier
I0126 13:34:57.529131 3145483200 main.cpp:279] Version: 1.2.0
I0126 13:34:57.529139 3145483200 main.cpp:286] Git SHA: 
dd07d025d40975ec660ed17031d95ec0dba842d2
[warn] kq_init: detected broken kqueue; not using.: No such process
I0126 13:34:57.758896 3145483200 main.cpp:385] Using 'HierarchicalDRF' allocator
I0126 13:34:57.764276 3145483200 replica.cpp:778] Replica recovered with log 
positions 3 -> 4 with 0 holes and 0 unlearned
I0126 13:34:57.765278 256114688 recover.cpp:451] Starting replica recovery
I0126 13:34:57.765547 256114688 recover.cpp:477] Replica is in VOTING status
I0126 13:34:57.795964 257187840 master.cpp:383] Master 
569073cc-1195-45e9-b0d4-e2e1bf0d13d5 (172.18.9.56) started on 172.18.9.56:5050
I0126 13:34:57.796023 257187840 master.cpp:385] Flags at startup: 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="false" --authenticate_frameworks="false" 
--authenticate_http_frameworks="false" --authenticate_http_readonly="false" 
--authenticate_http_readwrite="false" --authenticators="crammd5" 
--authorizers="local" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="replicated_log" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="20secs" --registry_strict="false" 
--root_submissions="true" --user_sorter="drf" --version="false" 
--webui_dir="/usr/local/share/mesos/webui" --work_dir="work_dir" 
--zk_session_timeout="10secs"
I0126 13:34:57.796478 257187840 master.cpp:437] Master allowing unauthenticated 
frameworks to register
I0126 13:34:57.796507 257187840 master.cpp:451] Master allowing unauthenticated 
agents to register
I0126 13:34:57.796517 257187840 master.cpp:465] Master allowing HTTP frameworks 
to register without authentication
I0126 13:34:57.796540 257187840 master.cpp:507] Using default 'crammd5' 
authenticator
W0126 13:34:57.796573 257187840 authenticator.cpp:512] No credentials provided, 
authentication requests will be refused
I0126 13:34:57.796584 257187840 authenticator.cpp:519] Initializing server SASL
I0126 13:34:57.825337 255578112

[jira] [Comment Edited] (MESOS-7007) filesystem/shared and --default_container_info broken since 1.1

2017-01-26 Thread Pierre Cheynier (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839607#comment-15839607
 ] 

Pierre Cheynier edited comment on MESOS-7007 at 1/26/17 11:52 AM:
--

Here is how to reproduce the case with minimal custom setup:
{noformat}
# Define $MASTER and $WORKDIR
# Launch agent: 
mesos-agent --advertise_ip=127.0.0.1 --cgroups_hierarchy=/sys/fs/cgroup 
--containerizers=mesos,docker 
--default_container_info='{"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"}]}'
 --default_role=default 
--isolation='cgroups/cpu,cgroups/mem,namespaces/pid,disk/du,filesystem/shared,filesystem/linux,docker/runtime'
 --master=$MASTER --port=5051 --strict --work_dir=$WORKDIR
# Create dummy file
touch /tmp/example
# Launch a container
mesos-execute --master='localhost:5050' --name=ls --command=ls
# List dir content, it's empty
ls /tmp
{noformat}


was (Author: pierrecdn):
Here is how to reproduce the case with minimal custom setup:
{noformat}
# Define $MASTER and $WORKDIR
# Launch agent: 
/usr/sbin/mesos-slave --advertise_ip=127.0.0.1 
--cgroups_hierarchy=/sys/fs/cgroup --containerizers=mesos,docker 
--default_container_info='{"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"}]}'
 --default_role=default 
--isolation='cgroups/cpu,cgroups/mem,namespaces/pid,disk/du,filesystem/shared,filesystem/linux,docker/runtime'
 --master=$MASTER --port=5051 --strict --work_dir=$WORKDIR
# Create dummy file
touch /tmp/example
# Launch a container
mesos-execute --master='localhost:5050' --name=ls --command=ls
# List dir content, it's empty
ls /tmp
{noformat}

> filesystem/shared and --default_container_info broken since 1.1
> ---
>
> Key: MESOS-7007
> URL: https://issues.apache.org/jira/browse/MESOS-7007
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.1.0
>Reporter: Pierre Cheynier
>
> I face this issue, that prevent me to upgrade to 1.1.0 (and the change was 
> consequently introduced in this version):
> I'm using default_container_info to mount a /tmp volume in the container's 
> mount namespace from its current sandbox, meaning that each container have a 
> dedicated /tmp, thanks to the {filesystem/shared} isolator.
> I noticed through our automation pipeline that integration tests were failing 
> and found that this is because /tmp (the one from the host!) contents is 
> trashed each time a container is created.
> Here is my setup: 
> * 
> {{--isolation='cgroups/cpu,cgroups/mem,namespaces/pid,*disk/du,filesystem/shared,filesystem/linux*,docker/runtime'}}
> * 
> {{--default_container_info='\{"type":"MESOS","volumes":\[\{"host_path":"tmp","container_path":"/tmp","mode":"RW"\}\]\}'}}
> I discovered this issue in the early days of 1.1 (end of Nov, spoke with 
> someone on Slack), but had unfortunately no time to dig into the symptoms a 
> bit more.
> I found nothing interesting even using GLOGv=3.
> Maybe it's a bad usage of isolators that trigger this issue ? If it's the 
> case, then at least a documentation update should be done.
> Let me know if more information is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-7007) filesystem/shared and --default_container_info broken since 1.1

2017-01-26 Thread Pierre Cheynier (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839607#comment-15839607
 ] 

Pierre Cheynier commented on MESOS-7007:


Here is how to reproduce the case with minimal custom setup:
{noformat}
# Define $MASTER and $WORKDIR
# Launch agent: 
/usr/sbin/mesos-slave --advertise_ip=127.0.0.1 
--cgroups_hierarchy=/sys/fs/cgroup --containerizers=mesos,docker 
--default_container_info='{"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"}]}'
 --default_role=default 
--isolation='cgroups/cpu,cgroups/mem,namespaces/pid,disk/du,filesystem/shared,filesystem/linux,docker/runtime'
 --master=$MASTER --port=5051 --strict --work_dir=$WORKDIR
# Create dummy file
touch /tmp/example
# Launch a container
mesos-execute --master='localhost:5050' --name=ls --command=ls
# List dir content, it's empty
ls /tmp

> filesystem/shared and --default_container_info broken since 1.1
> ---
>
> Key: MESOS-7007
> URL: https://issues.apache.org/jira/browse/MESOS-7007
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.1.0
>Reporter: Pierre Cheynier
>
> I face this issue, that prevent me to upgrade to 1.1.0 (and the change was 
> consequently introduced in this version):
> I'm using default_container_info to mount a /tmp volume in the container's 
> mount namespace from its current sandbox, meaning that each container have a 
> dedicated /tmp, thanks to the {filesystem/shared} isolator.
> I noticed through our automation pipeline that integration tests were failing 
> and found that this is because /tmp (the one from the host!) contents is 
> trashed each time a container is created.
> Here is my setup: 
> * 
> {{--isolation='cgroups/cpu,cgroups/mem,namespaces/pid,*disk/du,filesystem/shared,filesystem/linux*,docker/runtime'}}
> * 
> {{--default_container_info='\{"type":"MESOS","volumes":\[\{"host_path":"tmp","container_path":"/tmp","mode":"RW"\}\]\}'}}
> I discovered this issue in the early days of 1.1 (end of Nov, spoke with 
> someone on Slack), but had unfortunately no time to dig into the symptoms a 
> bit more.
> I found nothing interesting even using GLOGv=3.
> Maybe it's a bad usage of isolators that trigger this issue ? If it's the 
> case, then at least a documentation update should be done.
> Let me know if more information is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-7007) filesystem/shared and --default_container_info broken since 1.1

2017-01-26 Thread Pierre Cheynier (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839607#comment-15839607
 ] 

Pierre Cheynier edited comment on MESOS-7007 at 1/26/17 11:49 AM:
--

Here is how to reproduce the case with minimal custom setup:
{noformat}
# Define $MASTER and $WORKDIR
# Launch agent: 
/usr/sbin/mesos-slave --advertise_ip=127.0.0.1 
--cgroups_hierarchy=/sys/fs/cgroup --containerizers=mesos,docker 
--default_container_info='{"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"}]}'
 --default_role=default 
--isolation='cgroups/cpu,cgroups/mem,namespaces/pid,disk/du,filesystem/shared,filesystem/linux,docker/runtime'
 --master=$MASTER --port=5051 --strict --work_dir=$WORKDIR
# Create dummy file
touch /tmp/example
# Launch a container
mesos-execute --master='localhost:5050' --name=ls --command=ls
# List dir content, it's empty
ls /tmp
{noformat}


was (Author: pierrecdn):
Here is how to reproduce the case with minimal custom setup:
{noformat}
# Define $MASTER and $WORKDIR
# Launch agent: 
/usr/sbin/mesos-slave --advertise_ip=127.0.0.1 
--cgroups_hierarchy=/sys/fs/cgroup --containerizers=mesos,docker 
--default_container_info='{"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"}]}'
 --default_role=default 
--isolation='cgroups/cpu,cgroups/mem,namespaces/pid,disk/du,filesystem/shared,filesystem/linux,docker/runtime'
 --master=$MASTER --port=5051 --strict --work_dir=$WORKDIR
# Create dummy file
touch /tmp/example
# Launch a container
mesos-execute --master='localhost:5050' --name=ls --command=ls
# List dir content, it's empty
ls /tmp

> filesystem/shared and --default_container_info broken since 1.1
> ---
>
> Key: MESOS-7007
> URL: https://issues.apache.org/jira/browse/MESOS-7007
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.1.0
>Reporter: Pierre Cheynier
>
> I face this issue, that prevent me to upgrade to 1.1.0 (and the change was 
> consequently introduced in this version):
> I'm using default_container_info to mount a /tmp volume in the container's 
> mount namespace from its current sandbox, meaning that each container have a 
> dedicated /tmp, thanks to the {filesystem/shared} isolator.
> I noticed through our automation pipeline that integration tests were failing 
> and found that this is because /tmp (the one from the host!) contents is 
> trashed each time a container is created.
> Here is my setup: 
> * 
> {{--isolation='cgroups/cpu,cgroups/mem,namespaces/pid,*disk/du,filesystem/shared,filesystem/linux*,docker/runtime'}}
> * 
> {{--default_container_info='\{"type":"MESOS","volumes":\[\{"host_path":"tmp","container_path":"/tmp","mode":"RW"\}\]\}'}}
> I discovered this issue in the early days of 1.1 (end of Nov, spoke with 
> someone on Slack), but had unfortunately no time to dig into the symptoms a 
> bit more.
> I found nothing interesting even using GLOGv=3.
> Maybe it's a bad usage of isolators that trigger this issue ? If it's the 
> case, then at least a documentation update should be done.
> Let me know if more information is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-7007) filesystem/shared and --default_container_info broken since 1.1

2017-01-26 Thread Pierre Cheynier (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839607#comment-15839607
 ] 

Pierre Cheynier edited comment on MESOS-7007 at 1/26/17 11:49 AM:
--

Here is how to reproduce the case with minimal custom setup:
{noformat}
# Define $MASTER and $WORKDIR
# Launch agent: 
/usr/sbin/mesos-slave --advertise_ip=127.0.0.1 
--cgroups_hierarchy=/sys/fs/cgroup --containerizers=mesos,docker 
--default_container_info='{"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"}]}'
 --default_role=default 
--isolation='cgroups/cpu,cgroups/mem,namespaces/pid,disk/du,filesystem/shared,filesystem/linux,docker/runtime'
 --master=$MASTER --port=5051 --strict --work_dir=$WORKDIR
# Create dummy file
touch /tmp/example
# Launch a container
mesos-execute --master='localhost:5050' --name=ls --command=ls
# List dir content, it's empty
ls /tmp


was (Author: pierrecdn):
Here is how to reproduce the case with minimal custom setup:
{noformat}
# Define $MASTER and $WORKDIR
# Launch agent: 
/usr/sbin/mesos-slave --advertise_ip=127.0.0.1 
--cgroups_hierarchy=/sys/fs/cgroup --containerizers=mesos,docker 
--default_container_info='{"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"}]}'
 --default_role=default 
--isolation='cgroups/cpu,cgroups/mem,namespaces/pid,disk/du,filesystem/shared,filesystem/linux,docker/runtime'
 --master=$MASTER --port=5051 --strict --work_dir=$WORKDIR
# Create dummy file
touch /tmp/example
# Launch a container
mesos-execute --master='localhost:5050' --name=ls --command=ls
# List dir content, it's empty
ls /tmp

> filesystem/shared and --default_container_info broken since 1.1
> ---
>
> Key: MESOS-7007
> URL: https://issues.apache.org/jira/browse/MESOS-7007
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.1.0
>Reporter: Pierre Cheynier
>
> I face this issue, that prevent me to upgrade to 1.1.0 (and the change was 
> consequently introduced in this version):
> I'm using default_container_info to mount a /tmp volume in the container's 
> mount namespace from its current sandbox, meaning that each container have a 
> dedicated /tmp, thanks to the {filesystem/shared} isolator.
> I noticed through our automation pipeline that integration tests were failing 
> and found that this is because /tmp (the one from the host!) contents is 
> trashed each time a container is created.
> Here is my setup: 
> * 
> {{--isolation='cgroups/cpu,cgroups/mem,namespaces/pid,*disk/du,filesystem/shared,filesystem/linux*,docker/runtime'}}
> * 
> {{--default_container_info='\{"type":"MESOS","volumes":\[\{"host_path":"tmp","container_path":"/tmp","mode":"RW"\}\]\}'}}
> I discovered this issue in the early days of 1.1 (end of Nov, spoke with 
> someone on Slack), but had unfortunately no time to dig into the symptoms a 
> bit more.
> I found nothing interesting even using GLOGv=3.
> Maybe it's a bad usage of isolators that trigger this issue ? If it's the 
> case, then at least a documentation update should be done.
> Let me know if more information is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-7007) filesystem/shared and --default_container_info broken since 1.1

2017-01-26 Thread Pierre Cheynier (JIRA)

Pierre Cheynier created MESOS-7007:
--

 Summary: filesystem/shared and --default_container_info broken 
since 1.1
 Key: MESOS-7007
 URL: https://issues.apache.org/jira/browse/MESOS-7007
 Project: Mesos
  Issue Type: Bug
  Components: agent
Affects Versions: 1.1.0
Reporter: Pierre Cheynier


I face this issue, that prevent me to upgrade to 1.1.0 (and the change was 
consequently introduced in this version):

I'm using default_container_info to mount a /tmp volume in the container's 
mount namespace from its current sandbox, meaning that each container have a 
dedicated /tmp, thanks to the {filesystem/shared} isolator.

I noticed through our automation pipeline that integration tests were failing 
and found that this is because /tmp (the one from the host!) contents is 
trashed each time a container is created.

Here is my setup: 
* 
{{--isolation='cgroups/cpu,cgroups/mem,namespaces/pid,*disk/du,filesystem/shared,filesystem/linux*,docker/runtime'}}
* 
{{--default_container_info='\{"type":"MESOS","volumes":\[\{"host_path":"tmp","container_path":"/tmp","mode":"RW"\}\]\}'}}

I discovered this issue in the early days of 1.1 (end of Nov, spoke with 
someone on Slack), but had unfortunately no time to dig into the symptoms a bit 
more.

I found nothing interesting even using GLOGv=3.

Maybe it's a bad usage of isolators that trigger this issue ? If it's the case, 
then at least a documentation update should be done.

Let me know if more information is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-7006) Launch docker containers with --cpus instead of cpu-shares

2017-01-26 Thread Craig W (JIRA)

Craig W created MESOS-7006:
--

 Summary: Launch docker containers with --cpus instead of cpu-shares
 Key: MESOS-7006
 URL: https://issues.apache.org/jira/browse/MESOS-7006
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 1.1.0
Reporter: Craig W


docker 1.13 was recently released and it now has a new --cpus flag which allows 
a user to specify how many cpus a container should have. This is much simpler 
for users to reason about.

mesos should switch to starting a container with --cpus instead of 
--cpu-shares, or at least make it configurable.

https://blog.docker.com/2017/01/cpu-management-docker-1-13/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6304) Add authentication support to the default executor

2017-01-26 Thread Adam B (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-6304:
--
Labels: executor mesosphere module security  (was: executor module security)

> Add authentication support to the default executor
> --
>
> Key: MESOS-6304
> URL: https://issues.apache.org/jira/browse/MESOS-6304
> Project: Mesos
>  Issue Type: Improvement
>  Components: executor, modules, security
>Reporter: Galen Pewtherer
>Assignee: Greg Mann
>  Labels: executor, mesosphere, module, security
>
> The default executor should be updated to authenticate with the agent when 
> HTTP executor authentication is enabled. This will entail:
> * loading the default JWT authenticatee module
> * calling into the authenticatee before making requests to the agent
> * decorating requests with the headers returned by the authenticatee



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6933) Executor does not respect grace period

2017-01-26 Thread Tomasz Janiszewski (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839526#comment-15839526
 ] 

Tomasz Janiszewski commented on MESOS-6933:
---

I think ??{{/bin/sh}} doesn't forward signals to any child processes?? is not a 
problem, {{killTree}} deliver signal to every {{sh}} children. The problem is 
{{sh}} terminates fast and children could need some time to gracefully shutdown.

> Executor does not respect grace period
> --
>
> Key: MESOS-6933
> URL: https://issues.apache.org/jira/browse/MESOS-6933
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Reporter: Tomasz Janiszewski
>
> Mesos Command Executor try to support grace period with escalate but 
> unfortunately it does not work. It launches {{command}} by wrapping it in 
> {{sh -c}} this cause process tree to look like this
> {code}
> Received killTask
> Shutting down
> Sending SIGTERM to process tree at pid 18
> Sent SIGTERM to the following process trees:
> [ 
> -+- 18 sh -c cd offer-i18n-0.1.24 && LD_PRELOAD=../librealresources.so 
> ./bin/offer-i18n -e prod -p $PORT0 
>  \--- 19 command...
> ]
> Command terminated with signal Terminated (pid: 18)
> {code}
> This cause {{sh}} to immediately close and so executor, while wrapped 
> {{command}} might need some more time to finish. Finally, executor thinks 
> command executed gracefully so it won't 
> [escalate|https://github.com/apache/mesos/blob/1.1.0/src/launcher/executor.cpp#L695]
>  to SIGKILL.
> This cause leaks when POSIX containerizer is used because if command ignores 
> SIGTERM it will be attached to initialize and never get killed. Using 
> pid/namespace only masks the problem because hanging process is captured 
> before it can gracefully shutdown.
> Fix for this is to sent SIGTERM only to {{sh}} children. {{sh}} will exit 
> when all children processes finish. If not they will be killed by escalation 
> to SIGKILL.
> All versions from 0.20 are affected.
> This test should pass 
> [src/tests/command_executor_tests.cpp:342|https://github.com/apache/mesos/blob/2c856178b59593ff8068ea8d6c6593943c33008c/src/tests/command_executor_tests.cpp#L342-L343]
> [Mailing list 
> thread|https://lists.apache.org/thread.html/1025dca0cf4418aee50b14330711500af864f08b53eb82d10cd5c04c@%3Cuser.mesos.apache.org%3E]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6517) Health checking only on 127.0.0.1 is limiting.

2017-01-26 Thread haosdent (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839523#comment-15839523
 ] 

haosdent commented on MESOS-6517:
-

Hi, [~alexr][~avinash.mesos][~gkleiman][~jieyu] Should we add {{ip}} or 
{{hostname}} field in {{HTTPCheckInfo}} to address this?

Refer to the discussion at 
http://search-hadoop.com/m/Mesos/0Vlr6jCHiaMC2pm1?subj=Re+customized+IP+for+health+check
 
It looks like this ticket is invalid. 

> Health checking only on 127.0.0.1 is limiting.
> --
>
> Key: MESOS-6517
> URL: https://issues.apache.org/jira/browse/MESOS-6517
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Alexander Rukletsov
>  Labels: health-check, mesosphere
>
> As of Mesos 1.1.0, HTTP and TCP health checks always use 127.0.0.1 as the 
> target IP. This is not configurable. As a result, tasks should listen on all 
> interfaces if they want to support HTTP and TCP health checks. However, there 
> might be some cases where tasks or containers will end up binding to a 
> specific IP address. 
> To make health checking more robust we can:
> * look at all interfaces in a given network namespace and do health check on 
> all the IP addresses;
> * allow users to specify the IP to health check;
> * deduce the target IP from task's discovery information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4705) Linux 'perf' parsing logic may fail when OS distribution has perf backports.

2017-01-26 Thread haosdent (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839508#comment-15839508
 ] 

haosdent commented on MESOS-4705:
-

[~bmahler] It looks like have more fields in recent perf version.

https://github.com/torvalds/linux/blob/v4.9/tools/perf/util/stat-shadow.c#L528
https://github.com/torvalds/linux/blob/v4.9/tools/perf/builtin-stat.c#L1149

Should we create a new ticket for this since this is marked "resolved" ?

> Linux 'perf' parsing logic may fail when OS distribution has perf backports.
> 
>
> Key: MESOS-4705
> URL: https://issues.apache.org/jira/browse/MESOS-4705
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups, isolation
>Affects Versions: 0.27.1
>Reporter: Fan Du
>Assignee: Fan Du
> Fix For: 0.26.2, 0.27.3, 0.28.2, 1.0.0
>
>
> When sampling container with perf event on Centos7 with kernel 
> 3.10.0-123.el7.x86_64, slave complained with below error spew:
> {code}
> E0218 16:32:00.591181  8376 perf_event.cpp:408] Failed to get perf sample: 
> Failed to parse perf sample: Failed to parse perf sample line 
> '25871993253,,cycles,mesos/5f23ffca-87ed-4ff6-84f2-6ec3d4098ab8,10059827422,100.00':
>  Unexpected number of fields
> {code}
> it's caused by the current perf format [assumption | 
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob;f=src/linux/perf.cpp;h=1c113a2b3f57877e132bbd65e01fb2f045132128;hb=HEAD#l430]
>  with kernel version below 3.12 
> On 3.10.0-123.el7.x86_64 kernel, the format is with 6 tokens as below:
> value,unit,event,cgroup,running,ratio
> A local modification fixed this error on my test bed, please review this 
> ticket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6790) Wrong task started time in webui

2017-01-26 Thread haosdent (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839423#comment-15839423
 ] 

haosdent commented on MESOS-6790:
-

ping [~janisz] Do you think is this way acceptable?

> Wrong task started time in webui
> 
>
> Key: MESOS-6790
> URL: https://issues.apache.org/jira/browse/MESOS-6790
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Reporter: haosdent
>Assignee: Tomasz Janiszewski
>  Labels: health-check, webui
>
> Reported by [~janisz]
> {quote}
> Hi
> When task has enabled Mesos healthcheck start time in UI can show wrong
> time. This happens because UI assumes that first status is task started
> [0]. This is not always true because Mesos keeps only recent tasks statuses
> [1] so when healthcheck updates tasks status it can override task start
> time displayed in webui.
> Best
> Tomek
> [0]
> https://github.com/apache/mesos/blob/master/src/webui/master/static/js/controllers.js#L140
> [1]
> https://github.com/apache/mesos/blob/f2adc8a95afda943f6a10e771aad64300da19047/src/common/protobuf_utils.cpp#L263-L265
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6933) Executor does not respect grace period

2017-01-26 Thread haosdent (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839421#comment-15839421
 ] 

haosdent commented on MESOS-6933:
-

[~klueska][~janisz] This is {{sh}} problem rather than Mesos bug, because 
{{/bin/sh}} doesn't forward signals to any child processes. 

Docker has similar problem when you try to exit gracefully if you use {{sh}} to 
launch commands, refer to 
https://www.ctl.io/developers/blog/post/gracefully-stopping-docker-containers/ 
for the details.

So the correct way to implement exit gracefully in Docker, Mesos and other 
applications is to avoid use {{sh}}. More precisely, user should set 
{{CommandInfo.shell}} to false and use {{exec}} form to launch tasks if they 
would like to make task exit gracefully. Make sense? 

> Executor does not respect grace period
> --
>
> Key: MESOS-6933
> URL: https://issues.apache.org/jira/browse/MESOS-6933
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Reporter: Tomasz Janiszewski
>
> Mesos Command Executor try to support grace period with escalate but 
> unfortunately it does not work. It launches {{command}} by wrapping it in 
> {{sh -c}} this cause process tree to look like this
> {code}
> Received killTask
> Shutting down
> Sending SIGTERM to process tree at pid 18
> Sent SIGTERM to the following process trees:
> [ 
> -+- 18 sh -c cd offer-i18n-0.1.24 && LD_PRELOAD=../librealresources.so 
> ./bin/offer-i18n -e prod -p $PORT0 
>  \--- 19 command...
> ]
> Command terminated with signal Terminated (pid: 18)
> {code}
> This cause {{sh}} to immediately close and so executor, while wrapped 
> {{command}} might need some more time to finish. Finally, executor thinks 
> command executed gracefully so it won't 
> [escalate|https://github.com/apache/mesos/blob/1.1.0/src/launcher/executor.cpp#L695]
>  to SIGKILL.
> This cause leaks when POSIX containerizer is used because if command ignores 
> SIGTERM it will be attached to initialize and never get killed. Using 
> pid/namespace only masks the problem because hanging process is captured 
> before it can gracefully shutdown.
> Fix for this is to sent SIGTERM only to {{sh}} children. {{sh}} will exit 
> when all children processes finish. If not they will be killed by escalation 
> to SIGKILL.
> All versions from 0.20 are affected.
> This test should pass 
> [src/tests/command_executor_tests.cpp:342|https://github.com/apache/mesos/blob/2c856178b59593ff8068ea8d6c6593943c33008c/src/tests/command_executor_tests.cpp#L342-L343]
> [Mailing list 
> thread|https://lists.apache.org/thread.html/1025dca0cf4418aee50b14330711500af864f08b53eb82d10cd5c04c@%3Cuser.mesos.apache.org%3E]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6933) Executor does not respect grace period

2017-01-26 Thread haosdent (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-6933:

Description: 
Mesos Command Executor try to support grace period with escalate but 
unfortunately it does not work. It launches {{command}} by wrapping it in {{sh 
-c}} this cause process tree to look like this

{code}
Received killTask
Shutting down
Sending SIGTERM to process tree at pid 18
Sent SIGTERM to the following process trees:
[ 
-+- 18 sh -c cd offer-i18n-0.1.24 && LD_PRELOAD=../librealresources.so 
./bin/offer-i18n -e prod -p $PORT0 
 \--- 19 command...
]
Command terminated with signal Terminated (pid: 18)
{code}

This cause {{sh}} to immediately close and so executor, while wrapped 
{{command}} might need some more time to finish. Finally, executor thinks 
command executed gracefully so it won't 
[escalate|https://github.com/apache/mesos/blob/1.1.0/src/launcher/executor.cpp#L695]
 to SIGKILL.

This cause leaks when POSIX containerizer is used because if command ignores 
SIGTERM it will be attached to initialize and never get killed. Using 
pid/namespace only masks the problem because hanging process is captured before 
it can gracefully shutdown.

Fix for this is to sent SIGTERM only to {{sh}} children. {{sh}} will exit when 
all children processes finish. If not they will be killed by escalation to 
SIGKILL.

All versions from 0.20 are affected.

This test should pass 
[src/tests/command_executor_tests.cpp:342|https://github.com/apache/mesos/blob/2c856178b59593ff8068ea8d6c6593943c33008c/src/tests/command_executor_tests.cpp#L342-L343]
[Mailing list 
thread|https://lists.apache.org/thread.html/1025dca0cf4418aee50b14330711500af864f08b53eb82d10cd5c04c@%3Cuser.mesos.apache.org%3E]

  was:
Mesos Command Executor try to support grace period with escalate but 
unfortunately it does not work. It launches {{command}} by wrapping it in {{sh 
-c}} this cause process tree to look like this

{code}
Received killTask
Shutting down
Sending SIGTERM to process tree at pid 18
Sent SIGTERM to the following process trees:
[ 
-+- 18 sh -c cd offer-i18n-0.1.24 && LD_PRELOAD=../librealresources.so 
./bin/offer-i18n -e prod -p $PORT0 
 \--- 19 command...
]
Command terminated with signal Terminated (pid: 18)
{code}

This cause {{sh}} to immediately close and so executor, while wrapped 
{{command}} might need some more time to finish. Finally, executor thinks 
command executed gracefully so it won't 
[escalate|https://github.com/apache/mesos/blob/1.1.0/src/launcher/executor.cpp#L695]
 to SIGKILL.

This cause leaks when POSIX contenerizer is used because if command ignores 
SIGTERM it will be attached to init and never get killed. Using pid/namespace 
only masks the problem because hanging process is cpatured before it can 
gracefully shutdown.

Fix for this is to sent SIGTERM only to {{sh}} children. {{sh}} will exit when 
all sub processes finish. If not they will be killed by escalation to SIGKILL.

All versions from: 0.20 are affected.

This test should pass 
[src/tests/command_executor_tests.cpp:342|https://github.com/apache/mesos/blob/2c856178b59593ff8068ea8d6c6593943c33008c/src/tests/command_executor_tests.cpp#L342-L343]
[Mailing list 
thread|https://lists.apache.org/thread.html/1025dca0cf4418aee50b14330711500af864f08b53eb82d10cd5c04c@%3Cuser.mesos.apache.org%3E]


> Executor does not respect grace period
> --
>
> Key: MESOS-6933
> URL: https://issues.apache.org/jira/browse/MESOS-6933
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Reporter: Tomasz Janiszewski
>
> Mesos Command Executor try to support grace period with escalate but 
> unfortunately it does not work. It launches {{command}} by wrapping it in 
> {{sh -c}} this cause process tree to look like this
> {code}
> Received killTask
> Shutting down
> Sending SIGTERM to process tree at pid 18
> Sent SIGTERM to the following process trees:
> [ 
> -+- 18 sh -c cd offer-i18n-0.1.24 && LD_PRELOAD=../librealresources.so 
> ./bin/offer-i18n -e prod -p $PORT0 
>  \--- 19 command...
> ]
> Command terminated with signal Terminated (pid: 18)
> {code}
> This cause {{sh}} to immediately close and so executor, while wrapped 
> {{command}} might need some more time to finish. Finally, executor thinks 
> command executed gracefully so it won't 
> [escalate|https://github.com/apache/mesos/blob/1.1.0/src/launcher/executor.cpp#L695]
>  to SIGKILL.
> This cause leaks when POSIX containerizer is used because if command ignores 
> SIGTERM it will be attached to initialize and never get killed. Using 
> pid/namespace only masks the problem because hanging process is captured 
> before it can gracefully shutdown.
> Fix for this is to sent SIGTERM only to {{sh}} children. {{sh}} will exit 
> when all children processes finish. If not they will be killed by escalation 
> to

[jira] [Updated] (MESOS-6933) Executor does not respect grace period

2017-01-26 Thread haosdent (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-6933:

Description: 
Mesos Command Executor try to support grace period with escalate but 
unfortunately it does not work. It launches {{command}} by wrapping it in {{sh 
-c}} this cause process tree to look like this

{code}
Received killTask
Shutting down
Sending SIGTERM to process tree at pid 18
Sent SIGTERM to the following process trees:
[ 
-+- 18 sh -c cd offer-i18n-0.1.24 && LD_PRELOAD=../librealresources.so 
./bin/offer-i18n -e prod -p $PORT0 
 \--- 19 command...
]
Command terminated with signal Terminated (pid: 18)
{code}

This cause {{sh}} to immediately close and so executor, while wrapped 
{{command}} might need some more time to finish. Finally, executor thinks 
command executed gracefully so it won't 
[escalate|https://github.com/apache/mesos/blob/1.1.0/src/launcher/executor.cpp#L695]
 to SIGKILL.

This cause leaks when POSIX contenerizer is used because if command ignores 
SIGTERM it will be attached to init and never get killed. Using pid/namespace 
only masks the problem because hanging process is cpatured before it can 
gracefully shutdown.

Fix for this is to sent SIGTERM only to {{sh}} children. {{sh}} will exit when 
all sub processes finish. If not they will be killed by escalation to SIGKILL.

All versions from: 0.20 are affected.

This test should pass 
[src/tests/command_executor_tests.cpp:342|https://github.com/apache/mesos/blob/2c856178b59593ff8068ea8d6c6593943c33008c/src/tests/command_executor_tests.cpp#L342-L343]
[Mailing list 
thread|https://lists.apache.org/thread.html/1025dca0cf4418aee50b14330711500af864f08b53eb82d10cd5c04c@%3Cuser.mesos.apache.org%3E]

  was:
Mesos Defult Executor try to support grace period with escalate but 
unfortunately it does not work. It launches {{command}} by wrapping it in {{sh 
-c}} this cause process tree to look like this

{code}
Received killTask
Shutting down
Sending SIGTERM to process tree at pid 18
Sent SIGTERM to the following process trees:
[ 
-+- 18 sh -c cd offer-i18n-0.1.24 && LD_PRELOAD=../librealresources.so 
./bin/offer-i18n -e prod -p $PORT0 
 \--- 19 command...
]
Command terminated with signal Terminated (pid: 18)
{code}

This cause {{sh}} to immediately close and so executor, while wrapped 
{{command}} might need some more time to finish. Finally, executor thinks 
command executed gracefully so it won't 
[escalate|https://github.com/apache/mesos/blob/1.1.0/src/launcher/executor.cpp#L695]
 to SIGKILL.

This cause leaks when POSIX contenerizer is used because if command ignores 
SIGTERM it will be attached to init and never get killed. Using pid/namespace 
only masks the problem because hanging process is cpatured before it can 
gracefully shutdown.

Fix for this is to sent SIGTERM only to {{sh}} children. {{sh}} will exit when 
all sub processes finish. If not they will be killed by escalation to SIGKILL.

All versions from: 0.20 are affected.

This test should pass 
[src/tests/command_executor_tests.cpp:342|https://github.com/apache/mesos/blob/2c856178b59593ff8068ea8d6c6593943c33008c/src/tests/command_executor_tests.cpp#L342-L343]
[Mailing list 
thread|https://lists.apache.org/thread.html/1025dca0cf4418aee50b14330711500af864f08b53eb82d10cd5c04c@%3Cuser.mesos.apache.org%3E]


> Executor does not respect grace period
> --
>
> Key: MESOS-6933
> URL: https://issues.apache.org/jira/browse/MESOS-6933
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Reporter: Tomasz Janiszewski
>
> Mesos Command Executor try to support grace period with escalate but 
> unfortunately it does not work. It launches {{command}} by wrapping it in 
> {{sh -c}} this cause process tree to look like this
> {code}
> Received killTask
> Shutting down
> Sending SIGTERM to process tree at pid 18
> Sent SIGTERM to the following process trees:
> [ 
> -+- 18 sh -c cd offer-i18n-0.1.24 && LD_PRELOAD=../librealresources.so 
> ./bin/offer-i18n -e prod -p $PORT0 
>  \--- 19 command...
> ]
> Command terminated with signal Terminated (pid: 18)
> {code}
> This cause {{sh}} to immediately close and so executor, while wrapped 
> {{command}} might need some more time to finish. Finally, executor thinks 
> command executed gracefully so it won't 
> [escalate|https://github.com/apache/mesos/blob/1.1.0/src/launcher/executor.cpp#L695]
>  to SIGKILL.
> This cause leaks when POSIX contenerizer is used because if command ignores 
> SIGTERM it will be attached to init and never get killed. Using pid/namespace 
> only masks the problem because hanging process is cpatured before it can 
> gracefully shutdown.
> Fix for this is to sent SIGTERM only to {{sh}} children. {{sh}} will exit 
> when all sub processes finish. If not they will be killed by escalation to 
> SIGKILL.
> All versions

[jira] [Created] (MESOS-7005) Add executor authentication documentation

2017-01-26 Thread Greg Mann (JIRA)

Greg Mann created MESOS-7005:


 Summary: Add executor authentication documentation
 Key: MESOS-7005
 URL: https://issues.apache.org/jira/browse/MESOS-7005
 Project: Mesos
  Issue Type: Documentation
  Components: documentation
Reporter: Greg Mann


Documentation should be added regarding executor authentication. This will 
include:
* Adding the new flags to the configuration docs
* Updating the authentication documentation
* Updating the upgrade documentation
* Updating the CHANGELOG



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-5410) Support cgroup namespace in unified container

2017-01-26 Thread haosdent (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-5410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-5410:

Component/s: isolation

> Support cgroup namespace in unified container
> -
>
> Key: MESOS-5410
> URL: https://issues.apache.org/jira/browse/MESOS-5410
> Project: Mesos
>  Issue Type: Improvement
>  Components: isolation
>Reporter: Qian Zhang
>Assignee: haosdent
>
> In Linux 4.6 kernel, a new namespace (cgroup namespace) was introduced to 
> make a process can be created in its own cgroup namespace so that the global 
> cgroup hierarchy will not be leaked to the process. See the following link 
> for more details about this namespace:
> http://man7.org/linux/man-pages/man7/cgroup_namespaces.7.html
> We need to support this namespace in unified container to provide better 
> isolation for the containers created by Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-7004) Enable multiple authenticator modules

2017-01-26 Thread Greg Mann (JIRA)

Greg Mann created MESOS-7004:


 Summary: Enable multiple authenticator modules
 Key: MESOS-7004
 URL: https://issues.apache.org/jira/browse/MESOS-7004
 Project: Mesos
  Issue Type: Task
  Components: modules, security
Reporter: Greg Mann


To accommodate executor authentication, we will add support for the loading of 
multiple authenticator modules. The {{--http_authenticators}} flag is already 
set up for this, but we must relax the constraint in Mesos which enforces just 
a single authenticator, and libprocess must implement this infrastructure.

Also, the {{Headers}} type in Mesos must be changed to a multihashmap to 
accommodate multiple {{WWW-Authenticate}} headers when multiple authenticators 
are loaded and none of them return a successful result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-7003) Introduce the AuthenticationContext

2017-01-26 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-7003:
-
Story Points: 5  (was: 8)

> Introduce the AuthenticationContext
> ---
>
> Key: MESOS-7003
> URL: https://issues.apache.org/jira/browse/MESOS-7003
> Project: Mesos
>  Issue Type: Task
>  Components: executor, security
>Reporter: Greg Mann
>  Labels: executor, security
>
> We will introduce a new type to represent the identity of an authenticated 
> entity in Mesos: the {{AuthenticationContext}}. To accomplish this, the 
> following should be done:
> * Add the new AuthenticationContext type
> * Update the AuthenticationResult type to use the AuthenticationContext
> * Update all authenticated endpoint handlers to handle this new type
> * Update the default authenticator modules to use the new type



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-7003) Introduce the AuthenticationContext

2017-01-26 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-7003:
-
Description: 
We will introduce a new type to represent the identity of an authenticated 
entity in Mesos: the {{AuthenticationContext}}. To accomplish this, the 
following should be done:
* Add the new AuthenticationContext type
* Update the AuthenticationResult type to use the AuthenticationContext
* Update all authenticated endpoint handlers to handle this new type
* Update the default authenticator modules to use the new type

  was:
We will introduce a new type to represent the identity of an authenticated 
entity in Mesos: the {{AuthenticationContext}}. To accomplish this, the 
following should be done:
* Add the new AuthenticationContext type
* Update the AuthenticationResult type to use the AuthenticationContext
* Update all authenticated endpoint handlers to handle this new type
* Update the default authenticator modules to use the new type
* Update the authorizer interface to accept the new type
* Update the local authorizer to accept the new type


> Introduce the AuthenticationContext
> ---
>
> Key: MESOS-7003
> URL: https://issues.apache.org/jira/browse/MESOS-7003
> Project: Mesos
>  Issue Type: Task
>  Components: executor, security
>Reporter: Greg Mann
>  Labels: executor, security
>
> We will introduce a new type to represent the identity of an authenticated 
> entity in Mesos: the {{AuthenticationContext}}. To accomplish this, the 
> following should be done:
> * Add the new AuthenticationContext type
> * Update the AuthenticationResult type to use the AuthenticationContext
> * Update all authenticated endpoint handlers to handle this new type
> * Update the default authenticator modules to use the new type



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-7003) Introduce the AuthenticationContext

2017-01-26 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-7003:
-
Description: 
We will introduce a new type to represent the identity of an authenticated 
entity in Mesos: the {{AuthenticationContext}}. To accomplish this, the 
following should be done:
* Add the new AuthenticationContext type
* Update the AuthenticationResult type to use the AuthenticationContext
* Update all authenticated endpoint handlers to handle this new type
* Update the default authenticator modules to use the new type
* Update the authorizer interface to accept the new type
* Update the local authorizer to accept the new type

  was:
We will introduce a new type to represent the identity of an authenticated 
entity in Mesos: the {{AuthenticationContext}}. To accomplish this, the 
following should be done:
* Add the new AuthenticationContext type
* Update the AuthenticationResult type to use the AuthenticationContext
* Update all authenticated endpoint handlers to handle this new type
* Update the default authenticator modules to use the new type


> Introduce the AuthenticationContext
> ---
>
> Key: MESOS-7003
> URL: https://issues.apache.org/jira/browse/MESOS-7003
> Project: Mesos
>  Issue Type: Task
>  Components: executor, security
>Reporter: Greg Mann
>  Labels: executor, security
>
> We will introduce a new type to represent the identity of an authenticated 
> entity in Mesos: the {{AuthenticationContext}}. To accomplish this, the 
> following should be done:
> * Add the new AuthenticationContext type
> * Update the AuthenticationResult type to use the AuthenticationContext
> * Update all authenticated endpoint handlers to handle this new type
> * Update the default authenticator modules to use the new type
> * Update the authorizer interface to accept the new type
> * Update the local authorizer to accept the new type



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-7003) Introduce the AuthenticationContext

2017-01-26 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-7003:
-
Story Points: 8  (was: 5)

> Introduce the AuthenticationContext
> ---
>
> Key: MESOS-7003
> URL: https://issues.apache.org/jira/browse/MESOS-7003
> Project: Mesos
>  Issue Type: Task
>  Components: executor, security
>Reporter: Greg Mann
>  Labels: executor, security
>
> We will introduce a new type to represent the identity of an authenticated 
> entity in Mesos: the {{AuthenticationContext}}. To accomplish this, the 
> following should be done:
> * Add the new AuthenticationContext type
> * Update the AuthenticationResult type to use the AuthenticationContext
> * Update all authenticated endpoint handlers to handle this new type
> * Update the default authenticator modules to use the new type
> * Update the authorizer interface to accept the new type
> * Update the local authorizer to accept the new type



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-7003) Introduce the AuthenticationContext

2017-01-26 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-7003:
-
Description: 
We will introduce a new type to represent the identity of an authenticated 
entity in Mesos: the {{AuthenticationContext}}. To accomplish this, the 
following should be done:
* Add the new AuthenticationContext type
* Update the AuthenticationResult type to use the AuthenticationContext
* Update all authenticated endpoint handlers to handle this new type
* Update the default authenticator modules to use the new type

  was:
The default executor should be updated to authenticate with the agent when HTTP 
executor authentication is enabled. This will entail:
* loading the default JWT authenticatee module
* calling into the authenticatee before making requests to the agent
* decorating requests with the headers returned by the authenticatee


> Introduce the AuthenticationContext
> ---
>
> Key: MESOS-7003
> URL: https://issues.apache.org/jira/browse/MESOS-7003
> Project: Mesos
>  Issue Type: Task
>  Components: executor, security
>Reporter: Greg Mann
>  Labels: executor, security
>
> We will introduce a new type to represent the identity of an authenticated 
> entity in Mesos: the {{AuthenticationContext}}. To accomplish this, the 
> following should be done:
> * Add the new AuthenticationContext type
> * Update the AuthenticationResult type to use the AuthenticationContext
> * Update all authenticated endpoint handlers to handle this new type
> * Update the default authenticator modules to use the new type



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

75 matches

Mail list logo