[jira] [Created] (MESOS-6324) CNI should not use `ifconfig` in executors `pre_exec_command`

2016-10-06 Thread Avinash Sridharan (JIRA)
Avinash Sridharan created MESOS-6324:


 Summary: CNI should not use `ifconfig` in executors 
`pre_exec_command`
 Key: MESOS-6324
 URL: https://issues.apache.org/jira/browse/MESOS-6324
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Reporter: Avinash Sridharan
Assignee: Avinash Sridharan


Currently the `network/cni` isolator sets up the `pre_exec_command` for 
executors when a container needs to be launched on a non-host network. The 
`pre_exec_command` is `ifconfig lo up`. This is done to primarily bring 
loopback up in the new network namespace.

Setting up the `pre_exec_command` to bring loopback up is problematic since the 
executors PATH variable is generally very limited (doesn't contain all path 
that the agents PATH variable has due to security concerns). 

Therefore instead of running `ifconfig lo up` in the `pre_exec_command` we 
should run it in `NetworkCniIsolatorSetup` subcommand, which runs with the same 
PATH variable as the agent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6118) Agent would crash with docker container tasks due to host mount table read.

2016-10-06 Thread Kevin Klues (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Klues updated MESOS-6118:
---
Priority: Blocker  (was: Critical)

> Agent would crash with docker container tasks due to host mount table read.
> ---
>
> Key: MESOS-6118
> URL: https://issues.apache.org/jira/browse/MESOS-6118
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 1.0.1
> Environment: Build: 2016-08-26 23:06:27 by centos
> Version: 1.0.1
> Git tag: 1.0.1
> Git SHA: 3611eb0b7eea8d144e9b2e840e0ba16f2f659ee3
> systemd version `219` detected
> Inializing systemd state
> Created systemd slice: `/run/systemd/system/mesos_executors.slice`
> Started systemd slice `mesos_executors.slice`
> Using isolation: posix/cpu,posix/mem,filesystem/posix,network/cni
>  Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> Linux ip-10-254-192-40 3.10.0-327.28.3.el7.x86_64 #1 SMP Thu Aug 18 19:05:49 
> UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Jamie Briant
>Assignee: Kevin Klues
>Priority: Blocker
>  Labels: linux, slave
> Attachments: crashlogfull.log, cycle2.log, cycle3.log, cycle5.log, 
> cycle6.log, slave-crash.log
>
>
> I have a framework which schedules thousands of short running (a few seconds 
> to a few minutes) of tasks, over a period of several minutes. In 1.0.1, the 
> slave process will crash every few minutes (with systemd restarting it).
> Crash is:
> Sep 01 20:52:23 ip-10-254-192-99 mesos-slave: F0901 20:52:23.905678  1232 
> fs.cpp:140] Check failed: !visitedParents.contains(parentId)
> Sep 01 20:52:23 ip-10-254-192-99 mesos-slave: *** Check failure stack trace: 
> ***
> Version 1.0.0 works without this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6323) 'mesos-containerizer launch' should inherit agent environment variables.

2016-10-06 Thread Jie Yu (JIRA)
Jie Yu created MESOS-6323:
-

 Summary: 'mesos-containerizer launch' should inherit agent 
environment variables.
 Key: MESOS-6323
 URL: https://issues.apache.org/jira/browse/MESOS-6323
 Project: Mesos
  Issue Type: Bug
Reporter: Jie Yu
Priority: Critical


If some dynamic libraries that agent depends on are stored in a non standard 
location, and the operator starts the agent using LD_LIBRARY_PATH. When we 
actually fork/exec the 'mesos-containerizer launch' helper, we need to make 
sure it inherits agent's environment variables. Otherwise, it might throw 
linking errors. This makes sense because it's a Mesos controlled process.

However, the the helper actually fork/exec the user container (or executor), we 
need to make sure to strip the agent environment variables.

The tricky case is for default executor and command executor. These two are 
controlled by Mesos as well, we also want them to have agent environment 
variables. We need to somehow distinguish this from custom executor case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6322) Agent fails to kill empty parent container

2016-10-06 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15553587#comment-15553587
 ] 

Anand Mazumdar commented on MESOS-6322:
---

hmm, looks like we need similar logic that we had introduced for MESOS-5380 to 
guard against these cases. A bit surprised that we did not add the logic to the 
{{subscribe}} handler on the agent for HTTP based executors but only added it 
for driver based executors (https://reviews.apache.org/r/47381).

> Agent fails to kill empty parent container
> --
>
> Key: MESOS-6322
> URL: https://issues.apache.org/jira/browse/MESOS-6322
> Project: Mesos
>  Issue Type: Bug
>Reporter: Greg Mann
>Assignee: Anand Mazumdar
>Priority: Blocker
>  Labels: mesosphere
>
> I launched a pod using Marathon, which led to the launching of a task group 
> on a Mesos agent. The pod spec was flawed, so this led to Marathon repeatedly 
> re-launching multiple instances of the task group. After this went on for a 
> few minutes, I told Marathon to scale the app to 0 instances, which sends 
> {{TASK_KILLED}} for one task in each task group. After this, the Mesos agent 
> reports {{TASK_KILLED}} status updates for all 3 tasks in the pod, but 
> hitting the {{/containers}} endpoint on the agent reveals that the executor 
> container for this task group is still running.
> Here is the task group launching on the agent:
> {code}
> slave.cpp:1696] Launching task group containing tasks [ 
> test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1, 
> test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask2, 
> test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.clientTask ] for 
> framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-
> {code}
> and here is the executor container starting:
> {code}
> mesos-agent[2994]: I1006 20:23:27.407563  3094 containerizer.cpp:965] 
> Starting container bf38ff09-3da1-487a-8926-1f4cc88bce32 for executor 
> 'instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601' of framework 
> 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-
> {code}
> and here is the output showing the {{TASK_KILLED}} updates for one task group:
> {code}
> mesos-agent[2994]: I1006 20:23:28.728224  3097 slave.cpp:2283] Asked to kill 
> task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1 of 
> framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-
> mesos-agent[2994]: W1006 20:23:28.728304  3097 slave.cpp:2364] Transitioning 
> the state of task 
> test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1 of 
> framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- to TASK_KILLED because 
> the executor is not registered
> mesos-agent[2994]: I1006 20:23:28.728659  3097 slave.cpp:3609] Handling 
> status update TASK_KILLED (UUID: 1cb8197a-7829-4a05-9cb1-14ba97519c42) for 
> task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1 of 
> framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- from @0.0.0.0:0
> mesos-agent[2994]: I1006 20:23:28.728817  3097 slave.cpp:3609] Handling 
> status update TASK_KILLED (UUID: e377e9fb-6466-4ce5-b32a-37d840b9f87c) for 
> task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask2 of 
> framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- from @0.0.0.0:0
> mesos-agent[2994]: I1006 20:23:28.728912  3097 slave.cpp:3609] Handling 
> status update TASK_KILLED (UUID: 24d44b25-ea52-43a1-afdb-6c04389879d2) for 
> task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.clientTask of 
> framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- from @0.0.0.0:0
> {code}
> however, if we grep the log for the executor's ID, the last line mentioning 
> it is:
> {code}
> slave.cpp:3080] Creating a marker file for HTTP based executor 
> 'instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601' of framework 
> 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- (via HTTP) at path 
> '/var/lib/mesos/slave/meta/slaves/42838ca8-8d6b-475b-9b3b-59f3cd0d6834-S0/frameworks/42838ca8-8d6b-475b-9b3b-59f3cd0d6834-/executors/instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601/runs/bf38ff09-3da1-487a-8926-1f4cc88bce32/http.marker'
> {code}
> so it seems the executor never exited. If we hit the agent's {{/containers}} 
> endpoint, we get output which includes this executor container:
> {code}
> {
> "container_id": "bf38ff09-3da1-487a-8926-1f4cc88bce32",
> "executor_id": "instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601",
> "executor_name": "",
> "framework_id": "42838ca8-8d6b-475b-9b3b-59f3cd0d6834-",
> "source": "",
> "statistics": {
>   "cpus_limit": 0.1,
>   "cpus_nr_periods": 17,
>   "cpus_nr_throttled": 11,
>   "cpus_system_time_secs": 0.02,
>   "cpus_throttled_time_secs": 0.784142448,
>   "cpus_user_time_secs": 0.09,
>   "disk_limit_bytes": 

[jira] [Created] (MESOS-6322) Agent fails to kill empty parent container

2016-10-06 Thread Greg Mann (JIRA)
Greg Mann created MESOS-6322:


 Summary: Agent fails to kill empty parent container
 Key: MESOS-6322
 URL: https://issues.apache.org/jira/browse/MESOS-6322
 Project: Mesos
  Issue Type: Bug
Reporter: Greg Mann
Assignee: Anand Mazumdar
Priority: Blocker


I launched a pod using Marathon, which led to the launching of a task group on 
a Mesos agent. The pod spec was flawed, so this led to Marathon repeatedly 
re-launching multiple instances of the task group. After this went on for a few 
minutes, I told Marathon to scale the app to 0 instances, which sends 
{{TASK_KILLED}} for one task in each task group. After this, the Mesos agent 
reports {{TASK_KILLED}} status updates for all 3 tasks in the pod, but hitting 
the {{/containers}} endpoint on the agent reveals that the executor container 
for this task group is still running.

Here is the task group launching on the agent:
{code}
slave.cpp:1696] Launching task group containing tasks [ 
test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1, 
test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask2, 
test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.clientTask ] for 
framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-
{code}
and here is the executor container starting:
{code}
mesos-agent[2994]: I1006 20:23:27.407563  3094 containerizer.cpp:965] Starting 
container bf38ff09-3da1-487a-8926-1f4cc88bce32 for executor 
'instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601' of framework 
42838ca8-8d6b-475b-9b3b-59f3cd0d6834-
{code}
and here is the output showing the {{TASK_KILLED}} updates for one task group:
{code}
mesos-agent[2994]: I1006 20:23:28.728224  3097 slave.cpp:2283] Asked to kill 
task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1 of 
framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834-
mesos-agent[2994]: W1006 20:23:28.728304  3097 slave.cpp:2364] Transitioning 
the state of task 
test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1 of framework 
42838ca8-8d6b-475b-9b3b-59f3cd0d6834- to TASK_KILLED because the executor 
is not registered
mesos-agent[2994]: I1006 20:23:28.728659  3097 slave.cpp:3609] Handling status 
update TASK_KILLED (UUID: 1cb8197a-7829-4a05-9cb1-14ba97519c42) for task 
test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1 of framework 
42838ca8-8d6b-475b-9b3b-59f3cd0d6834- from @0.0.0.0:0
mesos-agent[2994]: I1006 20:23:28.728817  3097 slave.cpp:3609] Handling status 
update TASK_KILLED (UUID: e377e9fb-6466-4ce5-b32a-37d840b9f87c) for task 
test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask2 of framework 
42838ca8-8d6b-475b-9b3b-59f3cd0d6834- from @0.0.0.0:0
mesos-agent[2994]: I1006 20:23:28.728912  3097 slave.cpp:3609] Handling status 
update TASK_KILLED (UUID: 24d44b25-ea52-43a1-afdb-6c04389879d2) for task 
test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.clientTask of framework 
42838ca8-8d6b-475b-9b3b-59f3cd0d6834- from @0.0.0.0:0
{code}
however, if we grep the log for the executor's ID, the last line mentioning it 
is:
{code}
slave.cpp:3080] Creating a marker file for HTTP based executor 
'instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601' of framework 
42838ca8-8d6b-475b-9b3b-59f3cd0d6834- (via HTTP) at path 
'/var/lib/mesos/slave/meta/slaves/42838ca8-8d6b-475b-9b3b-59f3cd0d6834-S0/frameworks/42838ca8-8d6b-475b-9b3b-59f3cd0d6834-/executors/instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601/runs/bf38ff09-3da1-487a-8926-1f4cc88bce32/http.marker'
{code}
so it seems the executor never exited. If we hit the agent's {{/containers}} 
endpoint, we get output which includes this executor container:
{code}
{
"container_id": "bf38ff09-3da1-487a-8926-1f4cc88bce32",
"executor_id": "instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601",
"executor_name": "",
"framework_id": "42838ca8-8d6b-475b-9b3b-59f3cd0d6834-",
"source": "",
"statistics": {
  "cpus_limit": 0.1,
  "cpus_nr_periods": 17,
  "cpus_nr_throttled": 11,
  "cpus_system_time_secs": 0.02,
  "cpus_throttled_time_secs": 0.784142448,
  "cpus_user_time_secs": 0.09,
  "disk_limit_bytes": 10485760,
  "disk_used_bytes": 20480,
  "mem_anon_bytes": 11337728,
  "mem_cache_bytes": 0,
  "mem_critical_pressure_counter": 0,
  "mem_file_bytes": 0,
  "mem_limit_bytes": 33554432,
  "mem_low_pressure_counter": 0,
  "mem_mapped_file_bytes": 0,
  "mem_medium_pressure_counter": 0,
  "mem_rss_bytes": 11337728,
  "mem_swap_bytes": 0,
  "mem_total_bytes": 12013568,
  "mem_unevictable_bytes": 0,
  "timestamp": 1475792290.12373
},
"status": {
  "executor_pid": 9068,
  "network_infos": [
{
  "ip_addresses": [
{
  "ip_address": "9.0.1.34",
  "protocol": "IPv4"

[jira] [Updated] (MESOS-6031) Collect throttle related metrics for DockerContainerizer.

2016-10-06 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-6031:
-
Target Version/s: 1.1.0
   Fix Version/s: (was: 1.1.0)

> Collect throttle related metrics for DockerContainerizer.
> -
>
> Key: MESOS-6031
> URL: https://issues.apache.org/jira/browse/MESOS-6031
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.0
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>  Labels: containerizer, docker
>
> MESOS-2154 added the support of porting CFS quota to Docker containerizer, 
> but the metric collection part is still missing.
> we can use similar fashion like cgroups/cpushare.cpp to collect related 
> metrics too in docker containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4948) Move maintenance tests to use the new scheduler library interface.

2016-10-06 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15553459#comment-15553459
 ] 

Ilya Pronin commented on MESOS-4948:


Review request: https://reviews.apache.org/r/52620/

> Move maintenance tests to use the new scheduler library interface.
> --
>
> Key: MESOS-4948
> URL: https://issues.apache.org/jira/browse/MESOS-4948
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
> Environment: Ubuntu 14.04, using gcc, with libevent and SSL enabled 
> (on ASF CI)
>Reporter: Greg Mann
>Assignee: Ilya Pronin
>  Labels: flaky-test, maintenance, mesosphere, newbie
>
> We need to move the existing maintenance tests to use the new scheduler 
> interface. We have already moved 1 test 
> {{MasterMaintenanceTest.PendingUnavailabilityTest}} to use the new interface. 
> It would be good to move the other 2 remaining tests to the new interface 
> since it can lead to failures around the stack object being referenced after 
> has been already destroyed. Detailed log from an ASF CI build failure.
> {code}
> [ RUN  ] MasterMaintenanceTest.InverseOffers
> I0315 04:16:50.786032  2681 leveldb.cpp:174] Opened db in 125.361171ms
> I0315 04:16:50.836374  2681 leveldb.cpp:181] Compacted db in 50.254411ms
> I0315 04:16:50.836470  2681 leveldb.cpp:196] Created db iterator in 25917ns
> I0315 04:16:50.836488  2681 leveldb.cpp:202] Seeked to beginning of db in 
> 3291ns
> I0315 04:16:50.836498  2681 leveldb.cpp:271] Iterated through 0 keys in the 
> db in 253ns
> I0315 04:16:50.836549  2681 replica.cpp:779] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0315 04:16:50.837474  2702 recover.cpp:447] Starting replica recovery
> I0315 04:16:50.837565  2681 cluster.cpp:183] Creating default 'local' 
> authorizer
> I0315 04:16:50.838191  2702 recover.cpp:473] Replica is in EMPTY status
> I0315 04:16:50.839532  2704 replica.cpp:673] Replica in EMPTY status received 
> a broadcasted recover request from (4784)@172.17.0.4:39845
> I0315 04:16:50.839754  2705 recover.cpp:193] Received a recover response from 
> a replica in EMPTY status
> I0315 04:16:50.841893  2704 recover.cpp:564] Updating replica status to 
> STARTING
> I0315 04:16:50.842566  2703 master.cpp:376] Master 
> c326bc68-2581-48d4-9dc4-0d6f270bdda1 (01fcd642f65f) started on 
> 172.17.0.4:39845
> I0315 04:16:50.842644  2703 master.cpp:378] Flags at startup: --acls="" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate="false" --authenticate_http="true" 
> --authenticate_slaves="true" --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/DE2Uaw/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" 
> --quiet="false" --recovery_slave_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_store_timeout="100secs" --registry_strict="true" 
> --root_submissions="true" --slave_ping_timeout="15secs" 
> --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-0.29.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/DE2Uaw/master" --zk_session_timeout="10secs"
> I0315 04:16:50.843168  2703 master.cpp:425] Master allowing unauthenticated 
> frameworks to register
> I0315 04:16:50.843227  2703 master.cpp:428] Master only allowing 
> authenticated slaves to register
> I0315 04:16:50.843302  2703 credentials.hpp:35] Loading credentials for 
> authentication from '/tmp/DE2Uaw/credentials'
> I0315 04:16:50.843737  2703 master.cpp:468] Using default 'crammd5' 
> authenticator
> I0315 04:16:50.843969  2703 master.cpp:537] Using default 'basic' HTTP 
> authenticator
> I0315 04:16:50.844177  2703 master.cpp:571] Authorization enabled
> I0315 04:16:50.844360  2708 hierarchical.cpp:144] Initialized hierarchical 
> allocator process
> I0315 04:16:50.844430  2708 whitelist_watcher.cpp:77] No whitelist given
> I0315 04:16:50.848227  2703 master.cpp:1806] The newly elected leader is 
> master@172.17.0.4:39845 with id c326bc68-2581-48d4-9dc4-0d6f270bdda1
> I0315 04:16:50.848269  2703 master.cpp:1819] Elected as the leading master!
> I0315 04:16:50.848292  2703 master.cpp:1508] Recovering from registrar
> I0315 04:16:50.848563  2703 registrar.cpp:307] Recovering registrar
> I0315 04:16:50.876277  2711 leveldb.cpp:304] Persisting metadata (8 bytes) to 
> leveldb took 34.178445ms
> I0315 04:16:50.876365  2711 replica.cpp:320] Persisted replica status to 
> STARTING
> I0315 04:16:50.876776  2711 

[jira] [Comment Edited] (MESOS-6223) Allow agents to re-register post a host reboot

2016-10-06 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15553432#comment-15553432
 ] 

Yan Xu edited comment on MESOS-6223 at 10/6/16 10:48 PM:
-

[~neilc] [~vinodkone] I can think of ways we can implement restarting tasks 
post-reboot (MESOS-3545, will have design doc out soon) via either the approach 
in this ticket or in MESOS-5368 but this one feels simpler. Reboot as a special 
case sounds to me an optimization which will no longer hold true with tasks 
being restarted. Then the question is 

1) Should the agent ID *always* change after a reboot?
2) Does the agent ID *ever has to* change when its {{work_dir}} hasn't changed?

1) Sounds like no.

For 2), on the master the only error case where we disallow an agent to 
reregister but does allow the agent to register is [when the agent's ip or 
hostname has 
changed|https://github.com/apache/mesos/blob/3902b051f2cff59c55535dae08ebd4223833b0a0/src/master/master.cpp#L5228]
 (hostname change already prevents the agent from restarting). I can imagine 
we'd want to force the agent to get rid of its {{work_dir//slave_id}} 
but keep the checkpointed resources etc.?

To summarize, seems like we can keep both this ticket and MESOS-5368, but 
change MESOS-5368 to not change the session ID in the reboot case?

Thoughts?


was (Author: xujyan):
[~neilc] [~vinodkone] I can think of ways we can implement restarting tasks 
post-reboot (MESOS-3545, will have design doc out soon) via either the approach 
in this ticket or in MESOS-5368 but this one feels simpler. Reboot as a special 
case sounds to me an optimization which will no longer hold true with tasks 
being restarted. Then the question is 

1) Should the agent ID *always* change after a reboot?
2) Does the agent ID *ever has to* change when its {{work_dir}} hasn't changed?

1) Sounds like no.

For 2), on the master the only error case where we disallow an agent to 
reregister but does allow the agent to register is [when the agent's ip or 
hostname has 
changed|https://github.com/apache/mesos/blob/3902b051f2cff59c55535dae08ebd4223833b0a0/src/master/master.cpp#L5228].
 I can imagine we'd want to force the agent to get rid of its 
{{work_dir//slave_id}} but keep the checkpointed resources etc.?

To summarize, seems like we can keep both this ticket and MESOS-5368, but 
change MESOS-5368 to not change the session ID in the reboot case?

Thoughts?

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Reporter: Megha
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2016-10-06 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15553432#comment-15553432
 ] 

Yan Xu commented on MESOS-6223:
---

[~neilc] [~vinodkone] I can think of ways we can implement restarting tasks 
post-reboot (MESOS-3545, will have design doc out soon) via either the approach 
in this ticket or in MESOS-5368 but this one feels simpler. Reboot as a special 
case sounds to me an optimization which will no longer hold true with tasks 
being restarted. Then the question is 

1) Should the agent ID *always* change after a reboot?
2) Does the agent ID *ever has to* change when its {{work_dir}} hasn't changed?

1) Sounds like no.

For 2), on the master the only error case where we disallow an agent to 
reregister but does allow the agent to register is [when the agent's ip or 
hostname has 
changed|https://github.com/apache/mesos/blob/3902b051f2cff59c55535dae08ebd4223833b0a0/src/master/master.cpp#L5228].
 I can imagine we'd want to force the agent to get rid of its 
{{work_dir//slave_id}} but keep the checkpointed resources etc.?

To summarize, seems like we can keep both this ticket and MESOS-5368, but 
change MESOS-5368 to not change the session ID in the reboot case?

Thoughts?

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Reporter: Megha
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6228) Add timeout to /metrics/snapshot calls in tests

2016-10-06 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15553271#comment-15553271
 ] 

Neil Conway commented on MESOS-6228:


If we added a request timeout, the HTTP request would return successfully if 
fetching any metric times out. Not clear that this is actually better behavior. 
In this situation, we would {{VLOG(1)}} which metric has timed out; we could 
perhaps increase the verbosity of that error message and then enable the 
request timeout.

> Add timeout to /metrics/snapshot calls in tests
> ---
>
> Key: MESOS-6228
> URL: https://issues.apache.org/jira/browse/MESOS-6228
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Neil Conway
>  Labels: mesosphere, newbie++
>
> In the unit tests, {{Metrics()}} does an {{http::get}} of the 
> {{/metrics/snapshot}} endpoint. No {{timeout}} parameter is provided. That 
> means if any metric cannot be fetched, the request hangs for 15 seconds and 
> then dies with a mysterious / unclear error message. Digging into which 
> metric has hung and for what reason requires a lot of time / debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6228) Add timeout to /metrics/snapshot calls in tests

2016-10-06 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-6228:
---
Description: In the unit tests, {{Metrics()}} does an {{http::get}} of the 
{{/metrics/snapshot}} endpoint. No {{timeout}} parameter is provided. That 
means if any metric cannot be fetched, the request hangs for 15 seconds and 
then dies with a mysterious / unclear error message. Digging into which metric 
has hung and for what reason requires a lot of time / debugging.  (was: In the 
unit tests, {{Metrics()}} does an {{http::get}} of the {{/metrics/snapshot}} 
endpoint. No {{timeout}} parameter is provided. That means if any metric cannot 
be fetched, the request hangs for 15 seconds and then dies with a mysterious / 
unclear error message. Digging into which metric has hung and for what reason 
requires a lot of time / debugging.

Instead, we should specify a reasonable timeout (e.g., 15 seconds) and fail the 
test if the timeout fires.)

> Add timeout to /metrics/snapshot calls in tests
> ---
>
> Key: MESOS-6228
> URL: https://issues.apache.org/jira/browse/MESOS-6228
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Neil Conway
>  Labels: mesosphere, newbie++
>
> In the unit tests, {{Metrics()}} does an {{http::get}} of the 
> {{/metrics/snapshot}} endpoint. No {{timeout}} parameter is provided. That 
> means if any metric cannot be fetched, the request hangs for 15 seconds and 
> then dies with a mysterious / unclear error message. Digging into which 
> metric has hung and for what reason requires a lot of time / debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6228) Add timeout to /metrics/snapshot calls in tests

2016-10-06 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-6228:
---
Description: 
In the unit tests, {{Metrics()}} does an {{http::get}} of the 
{{/metrics/snapshot}} endpoint. No {{timeout}} parameter is provided. That 
means if any metric cannot be fetched, the request hangs for 15 seconds and 
then dies with a mysterious / unclear error message. Digging into which metric 
has hung and for what reason requires a lot of time / debugging.

Instead, we should specify a reasonable timeout (e.g., 15 seconds) and fail the 
test if the timeout fires.

  was:
In the unit tests, {{Metrics()}} does an {{http::get}} of the 
{{/metrics/snapshot}} endpoint. No {{timeout}} parameter is provided. That 
means if any metric cannot be fetched, the request hangs forever.

Instead, we should specify a reasonable timeout (e.g., 15 seconds) and fail the 
test if the timeout fires.


> Add timeout to /metrics/snapshot calls in tests
> ---
>
> Key: MESOS-6228
> URL: https://issues.apache.org/jira/browse/MESOS-6228
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Neil Conway
>  Labels: mesosphere, newbie++
>
> In the unit tests, {{Metrics()}} does an {{http::get}} of the 
> {{/metrics/snapshot}} endpoint. No {{timeout}} parameter is provided. That 
> means if any metric cannot be fetched, the request hangs for 15 seconds and 
> then dies with a mysterious / unclear error message. Digging into which 
> metric has hung and for what reason requires a lot of time / debugging.
> Instead, we should specify a reasonable timeout (e.g., 15 seconds) and fail 
> the test if the timeout fires.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5368) Consider introducing persistent agent ID

2016-10-06 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15553204#comment-15553204
 ] 

Yan Xu commented on MESOS-5368:
---

[~neilc] In an alternative approach, would we achieve the same thing if we 
change the semantics to have the agent *only* change its ID when we 
"permanently removing (decommission) an agent ({{work_dir}})"?  

> Consider introducing persistent agent ID
> 
>
> Key: MESOS-5368
> URL: https://issues.apache.org/jira/browse/MESOS-5368
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Neil Conway
>Assignee: Abhishek Dasgupta
>  Labels: mesosphere
>
> Currently, agent IDs identify a single "session" by an agent: that is, an 
> agent receives an agent ID when it registers with the master; it reuses that 
> agent ID if it disconnects and successfully reregisters; if the agent shuts 
> down and restarts, it registers anew and receives a new agent ID.
> It would be convenient to have a "persistent agent ID" that remains the same 
> for the duration of a given agent {{work_dir}}. This would mean that a given 
> persistent volume would not migrate between different persistent agent IDs 
> over time, for example (see MESOS-4894). If we supported permanently removing 
> an agent from the cluster (i.e., the {{work_dir}} and any volumes used by the 
> agent will never be reused), we could use the persistent agent ID to report 
> which agent has been removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6321) CHECK failure in HierarchicalAllocatorTest.NoDoubleAccounting

2016-10-06 Thread Neil Conway (JIRA)
Neil Conway created MESOS-6321:
--

 Summary: CHECK failure in 
HierarchicalAllocatorTest.NoDoubleAccounting
 Key: MESOS-6321
 URL: https://issues.apache.org/jira/browse/MESOS-6321
 Project: Mesos
  Issue Type: Bug
Reporter: Neil Conway
Assignee: Alexander Rukletsov


Observed in internal CI:

{noformat}
[15:52:21] : [Step 10/10] [ RUN  ] 
HierarchicalAllocatorTest.NoDoubleAccounting
[15:52:21]W: [Step 10/10] I1006 15:52:21.813817 23713 hierarchical.cpp:275] 
Added framework framework1
[15:52:21]W: [Step 10/10] I1006 15:52:21.814100 23713 
hierarchical.cpp:1694] No allocations performed
[15:52:21]W: [Step 10/10] I1006 15:52:21.814102 23712 process.cpp:3377] 
Handling HTTP event for process 'metrics' with path: '/metrics/snapshot'
[15:52:21]W: [Step 10/10] I1006 15:52:21.814121 23713 
hierarchical.cpp:1789] No inverse offers to send out!
[15:52:21]W: [Step 10/10] I1006 15:52:21.814146 23713 
hierarchical.cpp:1286] Performed allocation for 0 agents in 52445ns
[15:52:21]W: [Step 10/10] I1006 15:52:21.814206 23713 hierarchical.cpp:485] 
Added agent agent1 (agent1) with cpus(*):1 (allocated: cpus(*):1)
[15:52:21]W: [Step 10/10] I1006 15:52:21.814237 23713 
hierarchical.cpp:1694] No allocations performed
[15:52:21]W: [Step 10/10] I1006 15:52:21.814247 23713 
hierarchical.cpp:1789] No inverse offers to send out!
[15:52:21]W: [Step 10/10] I1006 15:52:21.814259 23713 
hierarchical.cpp:1309] Performed allocation for agent agent1 in 33887ns
[15:52:21]W: [Step 10/10] I1006 15:52:21.814294 23713 hierarchical.cpp:485] 
Added agent agent2 (agent2) with cpus(*):1 (allocated: cpus(*):1)
[15:52:21]W: [Step 10/10] I1006 15:52:21.814332 23713 
hierarchical.cpp:1694] No allocations performed
[15:52:21]W: [Step 10/10] I1006 15:52:21.814342 23713 
hierarchical.cpp:1789] No inverse offers to send out!
[15:52:21]W: [Step 10/10] I1006 15:52:21.814349 23713 
hierarchical.cpp:1309] Performed allocation for agent agent2 in 42682ns
[15:52:21]W: [Step 10/10] I1006 15:52:21.814417 23713 hierarchical.cpp:275] 
Added framework framework2
[15:52:21]W: [Step 10/10] I1006 15:52:21.814445 23713 
hierarchical.cpp:1694] No allocations performed
[15:52:21]W: [Step 10/10] I1006 15:52:21.814455 23713 
hierarchical.cpp:1789] No inverse offers to send out!
[15:52:21]W: [Step 10/10] I1006 15:52:21.814469 23713 
hierarchical.cpp:1286] Performed allocation for 2 agents in 37976ns
[15:52:21]W: [Step 10/10] F1006 15:52:21.824954 23692 json.hpp:334] Check 
failed: 'boost::get(this)' Must be non NULL
[15:52:21]W: [Step 10/10] *** Check failure stack trace: ***
[15:52:21]W: [Step 10/10] @ 0x7fe953bbd71d  
google::LogMessage::Fail()
[15:52:21]W: [Step 10/10] @ 0x7fe953bbf55d  
google::LogMessage::SendToLog()
[15:52:21]W: [Step 10/10] @ 0x7fe953bbd30c  
google::LogMessage::Flush()
[15:52:21]W: [Step 10/10] @ 0x7fe953bbfe59  
google::LogMessageFatal::~LogMessageFatal()
[15:52:21]W: [Step 10/10] @   0x7cc903  JSON::Value::as<>()
[15:52:21]W: [Step 10/10] @   0x8b633c  
mesos::internal::tests::HierarchicalAllocatorTest_NoDoubleAccounting_Test::TestBody()
[15:52:21]W: [Step 10/10] @  0x129ce23  
testing::internal::HandleExceptionsInMethodIfSupported<>()
[15:52:21]W: [Step 10/10] @  0x1292f07  testing::Test::Run()
[15:52:21]W: [Step 10/10] @  0x1292fae  testing::TestInfo::Run()
[15:52:21]W: [Step 10/10] @  0x12930b5  testing::TestCase::Run()
[15:52:21]W: [Step 10/10] @  0x1293368  
testing::internal::UnitTestImpl::RunAllTests()
[15:52:21]W: [Step 10/10] @  0x1293624  testing::UnitTest::Run()
[15:52:21]W: [Step 10/10] @   0x507254  main
[15:52:21]W: [Step 10/10] @ 0x7fe95122876d  (unknown)
[15:52:21]W: [Step 10/10] @   0x51e341  (unknown)
[15:52:21]W: [Step 10/10] Aborted (core dumped)
[15:52:21]W: [Step 10/10] Process exited with code 134
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6319) ContentType/AgentAPITest.NestedContainerLaunch/1 is flaky

2016-10-06 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-6319:
---
Assignee: Benjamin Mahler
  Sprint: Mesosphere Sprint 44

> ContentType/AgentAPITest.NestedContainerLaunch/1 is flaky
> -
>
> Key: MESOS-6319
> URL: https://issues.apache.org/jira/browse/MESOS-6319
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, tests
>Affects Versions: 1.1.0
> Environment: ubuntu-14.04, autotools build, verbose build
>Reporter: Benjamin Bannier
>Assignee: Benjamin Mahler
>  Labels: flaky-test
> Attachments: build.log
>
>
> {{ContentType/AgentAPITest.NestedContainerLaunch/1}} is flaky, saw this fail 
> in ASF CI (https://builds.apache.org/job/mesos-reviewbot/15545/)
> {code}
> ../../src/tests/api_tests.cpp:3552: Failure
> (wait).failure(): Unexpected response status 404 Not Found
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-6304) Add authentication support to the default executor

2016-10-06 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-6304:


Assignee: Greg Mann  (was: Artem Harutyunyan)

> Add authentication support to the default executor
> --
>
> Key: MESOS-6304
> URL: https://issues.apache.org/jira/browse/MESOS-6304
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Galen Pewtherer
>Assignee: Greg Mann
>
> Right now the default executor (used to launch task groups) does not 
> authenticate with either the executor API (/v1/executor) or the agent API 
> (v1). Ofcourse, the driver based executor doesn't authenticate either.
> It would be great to come up with a solution that works for both the built-in 
> executors and custom executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6319) ContentType/AgentAPITest.NestedContainerLaunch/1 is flaky

2016-10-06 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15552542#comment-15552542
 ] 

Anand Mazumdar commented on MESOS-6319:
---

[~bmahler] Can you take a look since you added this test recently?

> ContentType/AgentAPITest.NestedContainerLaunch/1 is flaky
> -
>
> Key: MESOS-6319
> URL: https://issues.apache.org/jira/browse/MESOS-6319
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, tests
>Affects Versions: 1.1.0
> Environment: ubuntu-14.04, autotools build, verbose build
>Reporter: Benjamin Bannier
>  Labels: flaky-test
> Attachments: build.log
>
>
> {{ContentType/AgentAPITest.NestedContainerLaunch/1}} is flaky, saw this fail 
> in ASF CI (https://builds.apache.org/job/mesos-reviewbot/15545/)
> {code}
> ../../src/tests/api_tests.cpp:3552: Failure
> (wait).failure(): Unexpected response status 404 Not Found
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2016-10-06 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15552427#comment-15552427
 ] 

Neil Conway commented on MESOS-6223:


cc [~vinodkone]

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Reporter: Megha
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2016-10-06 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15552425#comment-15552425
 ] 

Neil Conway commented on MESOS-6223:


Another way to go here would be to introduce a new type of "persistent agent 
ID", as discussed in MESOS-5368 -- that would essentially be an ID for a given 
{{work_dir}}, whereas the existing Agent ID would remain closer to a "session 
ID".

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Reporter: Megha
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6288) The default executor should maintain launcher_dir.

2016-10-06 Thread JIRA

[ 
https://issues.apache.org/jira/browse/MESOS-6288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15552417#comment-15552417
 ] 

Gastón Kleiman commented on MESOS-6288:
---

Patches:

https://reviews.apache.org/r/52556
https://reviews.apache.org/r/52608/

> The default executor should maintain launcher_dir.
> --
>
> Key: MESOS-6288
> URL: https://issues.apache.org/jira/browse/MESOS-6288
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Gastón Kleiman
>  Labels: health-check, mesosphere
>
> Both command and docker executors require {{launcher_dir}} is provided in a 
> flag. This directory contains mesos binaries, e.g. a tcp checker necessary 
> for TCP health check. The default executor should obtain somehow (a flag, env 
> var) and maintain this directory for health checker to use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6311) Consider supporting implicit reconciliation per agent

2016-10-06 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15552411#comment-15552411
 ] 

Neil Conway commented on MESOS-6311:


Seems reasonable to me, although I'd like to think about this in the context is 
making broader changes to the reconciliation API (see MESOS-5950, MESOS-4050, 
etc.).

> Consider supporting implicit reconciliation per agent
> -
>
> Key: MESOS-6311
> URL: https://issues.apache.org/jira/browse/MESOS-6311
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Joris Van Remoortere
>
> Currently mesos only supports:
> - total implicit reconciliation
> - explicit reconciliation per task
> Since agent can slowly rejoin the master after a master failover, it is hard 
> to have a low time bound on implicit reconciliation for tasks.
> Performing the current implicit reconciliation is expensive on big clusters 
> so it should not be done every N seconds.
> If we could perform implicit reconciliation for a particular agent, then it 
> would be cheap enough to after we notice that particular agent rejoining the 
> cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6278) Add test cases for the HTTP health checks

2016-10-06 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6278:
---
Target Version/s:   (was: 1.1.0)

> Add test cases for the HTTP health checks
> -
>
> Key: MESOS-6278
> URL: https://issues.apache.org/jira/browse/MESOS-6278
> Project: Mesos
>  Issue Type: Task
>  Components: tests
>Reporter: haosdent
>Assignee: haosdent
>  Labels: health-check, mesosphere, test
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6279) Add test cases for the TCP health check

2016-10-06 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6279:
---
Target Version/s:   (was: 1.1.0)

> Add test cases for the TCP health check
> ---
>
> Key: MESOS-6279
> URL: https://issues.apache.org/jira/browse/MESOS-6279
> Project: Mesos
>  Issue Type: Task
>  Components: tests
>Reporter: haosdent
>Assignee: haosdent
>  Labels: health-check, mesosphere, test
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6288) The default executor should maintain launcher_dir.

2016-10-06 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6288:
---
Fix Version/s: (was: 1.1.0)

> The default executor should maintain launcher_dir.
> --
>
> Key: MESOS-6288
> URL: https://issues.apache.org/jira/browse/MESOS-6288
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Gastón Kleiman
>  Labels: health-check, mesosphere
>
> Both command and docker executors require {{launcher_dir}} is provided in a 
> flag. This directory contains mesos binaries, e.g. a tcp checker necessary 
> for TCP health check. The default executor should obtain somehow (a flag, env 
> var) and maintain this directory for health checker to use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6184) Health checks should use a general mechanism to enter namespaces of the task.

2016-10-06 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6184:
---
Fix Version/s: (was: 1.1.0)

> Health checks should use a general mechanism to enter namespaces of the task.
> -
>
> Key: MESOS-6184
> URL: https://issues.apache.org/jira/browse/MESOS-6184
> Project: Mesos
>  Issue Type: Improvement
>Reporter: haosdent
>Assignee: haosdent
>Priority: Blocker
>  Labels: health-check, mesosphere
>
> To perform health checks for tasks, we need to enter the corresponding 
> namespaces of the container. For now health check use custom clone to 
> implement this
> {code}
>   return process::defaultClone([=]() -> int {
> if (taskPid.isSome()) {
>   foreach (const string& ns, namespaces) {
> Try setns = ns::setns(taskPid.get(), ns);
> if (setns.isError()) {
>   ...
> }
>   }
> }
> return func();
>   });
> {code}
> After the childHooks patches merged, we could change the health check to use 
> childHooks to call {{setns}} and make {{process::defaultClone}} private 
> again.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6119) TCP health checks are not portable.

2016-10-06 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6119:
---
 Priority: Major  (was: Blocker)
Fix Version/s: (was: 1.1.0)

> TCP health checks are not portable.
> ---
>
> Key: MESOS-6119
> URL: https://issues.apache.org/jira/browse/MESOS-6119
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>  Labels: health-check, mesosphere
>
> MESOS-3567 introduced a dependency on "bash" for TCP health checks, which is 
> undesirable. We should implement a portable solution for TCP health checks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6157) ContainerInfo is not validated.

2016-10-06 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15552352#comment-15552352
 ] 

Alexander Rukletsov commented on MESOS-6157:


Apparently, {{ContainerInfo}} could also be set for non-container tasks and can 
also be interpreted as which containerizer to use. I've reverted the 
validation, see https://reviews.apache.org/r/51865 for details.
{noformat}
Commit: f93f4fca57added6b0bff04a3e12699eaef13da9 [f93f4fc]
Parents: 001c55c306
Author: Alexander Rukletsov 
Date: 20 September 2016 at 14:41:15 GMT+2
Commit Date: 20 September 2016 at 16:58:19 GMT+2
Labels: alexr/container-additions-revert

Revert "Added validation for `ContainerInfo`."

This reverts commit e65f580bf0cbea64cedf521cf169b9b4c9f85454.
{noformat}

> ContainerInfo is not validated.
> ---
>
> Key: MESOS-6157
> URL: https://issues.apache.org/jira/browse/MESOS-6157
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>Priority: Blocker
>  Labels: containerizer, mesos-containerizer, mesosphere
> Fix For: 1.1.0
>
>
> Currently Mesos does not validate {{ContainerInfo}} provided with 
> {{TaskInfo}} or {{ExecutorInfo}}, hence invalid task configurations can be 
> accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6320) Implement clang-tidy check to catch incorrect flags hierarchies

2016-10-06 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-6320:
---

 Summary: Implement clang-tidy check to catch incorrect flags 
hierarchies
 Key: MESOS-6320
 URL: https://issues.apache.org/jira/browse/MESOS-6320
 Project: Mesos
  Issue Type: Bug
Reporter: Benjamin Bannier


Classes need to always use {{virtual}} inheritance when being derived from 
{{FlagsBase}}. Also, in order to compose such derived flags they should be 
inherited virtually again.

Some examples:
{code}
struct A : virtual FlagsBase {}; // OK
struct B : FlagsBase {}; // ERROR
struct C : A {}; // ERROR
{code}


We should implement a clang-tidy checker to catch such wrong inheritance issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6100) Make fails compiling 1.0.1

2016-10-06 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15551993#comment-15551993
 ] 

Neil Conway commented on MESOS-6100:


[~klueska] -- seems I can't edit reviews that have already been marked as 
submitted...

> Make fails compiling 1.0.1 
> ---
>
> Key: MESOS-6100
> URL: https://issues.apache.org/jira/browse/MESOS-6100
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1
> Environment: Alpine Linux  (Edge)
> GCC 6.1.1
>Reporter: Gennady Feldman
>Assignee: Kevin Klues
> Fix For: 1.1.0, 1.0.2
>
>
> linux/fs.cpp: In static member function 'static 
> Try 
> mesos::internal::fs::MountInfoTable::read(const Option&, bool)':
> linux/fs.cpp:152:27: error: 'rootParentId' may be used uninitialized in this 
> function [-Werror=maybe-uninitialized]
>  sortFrom(rootParentId);
>^
> cc1plus: all warnings being treated as errors
> P.S. This is something new since I am able to compile 1.0.0 just fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6308) CHECK failure in DRF sorter.

2016-10-06 Thread Benjamin Bannier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15551914#comment-15551914
 ] 

Benjamin Bannier commented on MESOS-6308:
-

Unrelated to the issue of an unexpected {{name}} value showing up, I am not 
sure we want a hard {{CHECK}} here. We should be perfectly capable of returning 
a sensible value even for an unknown {{name}}, e.g., a share of zero, and could 
just replace the {{CHECK}} with an early {{return 0}}. 

> CHECK failure in DRF sorter.
> 
>
> Key: MESOS-6308
> URL: https://issues.apache.org/jira/browse/MESOS-6308
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>Assignee: Guangya Liu
>
> Saw this CHECK failed in our internal CI:
> https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L450
> {noformat}
> [03:08:28] :   [Step 10/10] [ RUN  ] PartitionTest.DisconnectedFramework
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.200443   577 cluster.cpp:158] 
> Creating default 'local' authorizer
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.206408   577 leveldb.cpp:174] 
> Opened db in 5.827159ms
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208127   577 leveldb.cpp:181] 
> Compacted db in 1.697508ms
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208150   577 leveldb.cpp:196] 
> Created db iterator in 5756ns
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208160   577 leveldb.cpp:202] 
> Seeked to beginning of db in 1483ns
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208168   577 leveldb.cpp:271] 
> Iterated through 0 keys in the db in 1101ns
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208184   577 replica.cpp:776] 
> Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208452   591 recover.cpp:451] 
> Starting replica recovery
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208664   596 recover.cpp:477] 
> Replica is in EMPTY status
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209079   591 replica.cpp:673] 
> Replica in EMPTY status received a broadcasted recover request from 
> __req_res__(3666)@172.30.2.234:37300
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209203   593 recover.cpp:197] 
> Received a recover response from a replica in EMPTY status
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209394   598 recover.cpp:568] 
> Updating replica status to STARTING
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209473   598 master.cpp:380] 
> Master dd11d4ad-2087-4324-99ef-873e83ff09a1 (ip-172-30-2-234.mesosphere.io) 
> started on 172.30.2.234:37300
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209489   598 master.cpp:382] Flags 
> at startup: --acls="" --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="true" 
> --authenticate_frameworks="true" --authenticate_http_frameworks="true" 
> --authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/7rr0oB/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/7rr0oB/master" 
> --zk_session_timeout="10secs"
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209692   598 master.cpp:432] 
> Master only allowing authenticated frameworks to register
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209699   598 master.cpp:446] 
> Master only allowing authenticated agents to register
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209704   598 master.cpp:459] 
> Master only allowing authenticated HTTP frameworks to register
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209709   598 credentials.hpp:37] 
> Loading credentials for authentication from '/tmp/7rr0oB/credentials'
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209810   598 master.cpp:504] Using 
> default 'crammd5' authenticator
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209853   598 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readonly'
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209897   598 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 

[jira] [Updated] (MESOS-6319) ContentType/AgentAPITest.NestedContainerLaunch/1 is flaky

2016-10-06 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-6319:

Environment: ubuntu-14.04, autotools build, verbose build

> ContentType/AgentAPITest.NestedContainerLaunch/1 is flaky
> -
>
> Key: MESOS-6319
> URL: https://issues.apache.org/jira/browse/MESOS-6319
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, tests
>Affects Versions: 1.1.0
> Environment: ubuntu-14.04, autotools build, verbose build
>Reporter: Benjamin Bannier
>  Labels: flaky-test
> Attachments: build.log
>
>
> {{ContentType/AgentAPITest.NestedContainerLaunch/1}} is flaky, saw this fail 
> in ASF CI (https://builds.apache.org/job/mesos-reviewbot/15545/)
> {code}
> ../../src/tests/api_tests.cpp:3552: Failure
> (wait).failure(): Unexpected response status 404 Not Found
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6319) ContentType/AgentAPITest.NestedContainerLaunch/1 is flaky

2016-10-06 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-6319:

Attachment: build.log

> ContentType/AgentAPITest.NestedContainerLaunch/1 is flaky
> -
>
> Key: MESOS-6319
> URL: https://issues.apache.org/jira/browse/MESOS-6319
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, tests
>Affects Versions: 1.1.0
>Reporter: Benjamin Bannier
>  Labels: flaky-test
> Attachments: build.log
>
>
> {{ContentType/AgentAPITest.NestedContainerLaunch/1}} is flaky, saw this fail 
> in ASF CI (https://builds.apache.org/job/mesos-reviewbot/15545/)
> {code}
> ../../src/tests/api_tests.cpp:3552: Failure
> (wait).failure(): Unexpected response status 404 Not Found
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6319) ContentType/AgentAPITest.NestedContainerLaunch/1 is flaky

2016-10-06 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-6319:
---

 Summary: ContentType/AgentAPITest.NestedContainerLaunch/1 is flaky
 Key: MESOS-6319
 URL: https://issues.apache.org/jira/browse/MESOS-6319
 Project: Mesos
  Issue Type: Bug
  Components: containerization, tests
Affects Versions: 1.1.0
Reporter: Benjamin Bannier


{{ContentType/AgentAPITest.NestedContainerLaunch/1}} is flaky, saw this fail in 
ASF CI (https://builds.apache.org/job/mesos-reviewbot/15545/)

{code}
../../src/tests/api_tests.cpp:3552: Failure
(wait).failure(): Unexpected response status 404 Not Found
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-6238) SSL / libevent support broken in IPv6 patch from https://github.com/lava/mesos/tree/bennoe/ipv6

2016-10-06 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-6238:
--

Assignee: Benno Evers

> SSL / libevent support broken in IPv6 patch from 
> https://github.com/lava/mesos/tree/bennoe/ipv6
> ---
>
> Key: MESOS-6238
> URL: https://issues.apache.org/jira/browse/MESOS-6238
> Project: Mesos
>  Issue Type: Bug
>Reporter: Lukas Loesche
>Assignee: Benno Evers
>
> Affects https://github.com/lava/mesos/tree/bennoe/ipv6 at commit 
> 2199a24c0b7a782a0381aad8cceacbc95ec3d5c9 
> make fails when configure options --enable-ssl --enable-libevent were given.
> Error message:
> {noformat}
> ...
> ...
> ../../../3rdparty/libprocess/src/process.cpp: In member function ‘void 
> process::SocketManager::link_connect(const process::Future&, 
> process::network::Socket, const process::UPID&)’:
> ../../../3rdparty/libprocess/src/process.cpp:1457:25: error: ‘url’ was not 
> declared in this scope
>Try ip = url.ip;
>  ^
> Makefile:997: recipe for target 'libprocess_la-process.lo' failed
> make[5]: *** [libprocess_la-process.lo] Error 1
> ...
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6318) Update Mesos version that appears in Getting Started webpage

2016-10-06 Thread Armand Grillet (JIRA)
Armand Grillet created MESOS-6318:
-

 Summary: Update Mesos version that appears in Getting Started 
webpage
 Key: MESOS-6318
 URL: https://issues.apache.org/jira/browse/MESOS-6318
 Project: Mesos
  Issue Type: Task
  Components: project website
Reporter: Armand Grillet
Priority: Minor


The first step in the [Getting Started 
guide|http://mesos.apache.org/gettingstarted/] is to download the latest stable 
release but the version given in the snippet is 0.28.2. This problem does not 
concern 
[docs/getting-started.md|https://github.com/apache/mesos/blob/master/docs/getting-started.md].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6118) Agent would crash with docker container tasks due to host mount table read.

2016-10-06 Thread Kevin Klues (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15551191#comment-15551191
 ] 

Kevin Klues commented on MESOS-6118:


I've added two new patches to try and address this:
https://reviews.apache.org/r/52597/
https://reviews.apache.org/r/52596/

[~jamiebriant] [~bobrik] Could you please try things out with these patches and 
see if they fix your issues?

> Agent would crash with docker container tasks due to host mount table read.
> ---
>
> Key: MESOS-6118
> URL: https://issues.apache.org/jira/browse/MESOS-6118
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 1.0.1
> Environment: Build: 2016-08-26 23:06:27 by centos
> Version: 1.0.1
> Git tag: 1.0.1
> Git SHA: 3611eb0b7eea8d144e9b2e840e0ba16f2f659ee3
> systemd version `219` detected
> Inializing systemd state
> Created systemd slice: `/run/systemd/system/mesos_executors.slice`
> Started systemd slice `mesos_executors.slice`
> Using isolation: posix/cpu,posix/mem,filesystem/posix,network/cni
>  Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> Linux ip-10-254-192-40 3.10.0-327.28.3.el7.x86_64 #1 SMP Thu Aug 18 19:05:49 
> UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Jamie Briant
>Assignee: Kevin Klues
>Priority: Critical
>  Labels: linux, slave
> Attachments: crashlogfull.log, cycle2.log, cycle3.log, cycle5.log, 
> cycle6.log, slave-crash.log
>
>
> I have a framework which schedules thousands of short running (a few seconds 
> to a few minutes) of tasks, over a period of several minutes. In 1.0.1, the 
> slave process will crash every few minutes (with systemd restarting it).
> Crash is:
> Sep 01 20:52:23 ip-10-254-192-99 mesos-slave: F0901 20:52:23.905678  1232 
> fs.cpp:140] Check failed: !visitedParents.contains(parentId)
> Sep 01 20:52:23 ip-10-254-192-99 mesos-slave: *** Check failure stack trace: 
> ***
> Version 1.0.0 works without this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)