[jira] [Created] (MESOS-6324) CNI should not use `ifconfig` in executors `pre_exec_command`
Avinash Sridharan created MESOS-6324: Summary: CNI should not use `ifconfig` in executors `pre_exec_command` Key: MESOS-6324 URL: https://issues.apache.org/jira/browse/MESOS-6324 Project: Mesos Issue Type: Bug Components: containerization Reporter: Avinash Sridharan Assignee: Avinash Sridharan Currently the `network/cni` isolator sets up the `pre_exec_command` for executors when a container needs to be launched on a non-host network. The `pre_exec_command` is `ifconfig lo up`. This is done to primarily bring loopback up in the new network namespace. Setting up the `pre_exec_command` to bring loopback up is problematic since the executors PATH variable is generally very limited (doesn't contain all path that the agents PATH variable has due to security concerns). Therefore instead of running `ifconfig lo up` in the `pre_exec_command` we should run it in `NetworkCniIsolatorSetup` subcommand, which runs with the same PATH variable as the agent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6118) Agent would crash with docker container tasks due to host mount table read.
[ https://issues.apache.org/jira/browse/MESOS-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Klues updated MESOS-6118: --- Priority: Blocker (was: Critical) > Agent would crash with docker container tasks due to host mount table read. > --- > > Key: MESOS-6118 > URL: https://issues.apache.org/jira/browse/MESOS-6118 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 1.0.1 > Environment: Build: 2016-08-26 23:06:27 by centos > Version: 1.0.1 > Git tag: 1.0.1 > Git SHA: 3611eb0b7eea8d144e9b2e840e0ba16f2f659ee3 > systemd version `219` detected > Inializing systemd state > Created systemd slice: `/run/systemd/system/mesos_executors.slice` > Started systemd slice `mesos_executors.slice` > Using isolation: posix/cpu,posix/mem,filesystem/posix,network/cni > Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher > Linux ip-10-254-192-40 3.10.0-327.28.3.el7.x86_64 #1 SMP Thu Aug 18 19:05:49 > UTC 2016 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Jamie Briant >Assignee: Kevin Klues >Priority: Blocker > Labels: linux, slave > Attachments: crashlogfull.log, cycle2.log, cycle3.log, cycle5.log, > cycle6.log, slave-crash.log > > > I have a framework which schedules thousands of short running (a few seconds > to a few minutes) of tasks, over a period of several minutes. In 1.0.1, the > slave process will crash every few minutes (with systemd restarting it). > Crash is: > Sep 01 20:52:23 ip-10-254-192-99 mesos-slave: F0901 20:52:23.905678 1232 > fs.cpp:140] Check failed: !visitedParents.contains(parentId) > Sep 01 20:52:23 ip-10-254-192-99 mesos-slave: *** Check failure stack trace: > *** > Version 1.0.0 works without this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6323) 'mesos-containerizer launch' should inherit agent environment variables.
Jie Yu created MESOS-6323: - Summary: 'mesos-containerizer launch' should inherit agent environment variables. Key: MESOS-6323 URL: https://issues.apache.org/jira/browse/MESOS-6323 Project: Mesos Issue Type: Bug Reporter: Jie Yu Priority: Critical If some dynamic libraries that agent depends on are stored in a non standard location, and the operator starts the agent using LD_LIBRARY_PATH. When we actually fork/exec the 'mesos-containerizer launch' helper, we need to make sure it inherits agent's environment variables. Otherwise, it might throw linking errors. This makes sense because it's a Mesos controlled process. However, the the helper actually fork/exec the user container (or executor), we need to make sure to strip the agent environment variables. The tricky case is for default executor and command executor. These two are controlled by Mesos as well, we also want them to have agent environment variables. We need to somehow distinguish this from custom executor case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6322) Agent fails to kill empty parent container
[ https://issues.apache.org/jira/browse/MESOS-6322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15553587#comment-15553587 ] Anand Mazumdar commented on MESOS-6322: --- hmm, looks like we need similar logic that we had introduced for MESOS-5380 to guard against these cases. A bit surprised that we did not add the logic to the {{subscribe}} handler on the agent for HTTP based executors but only added it for driver based executors (https://reviews.apache.org/r/47381). > Agent fails to kill empty parent container > -- > > Key: MESOS-6322 > URL: https://issues.apache.org/jira/browse/MESOS-6322 > Project: Mesos > Issue Type: Bug >Reporter: Greg Mann >Assignee: Anand Mazumdar >Priority: Blocker > Labels: mesosphere > > I launched a pod using Marathon, which led to the launching of a task group > on a Mesos agent. The pod spec was flawed, so this led to Marathon repeatedly > re-launching multiple instances of the task group. After this went on for a > few minutes, I told Marathon to scale the app to 0 instances, which sends > {{TASK_KILLED}} for one task in each task group. After this, the Mesos agent > reports {{TASK_KILLED}} status updates for all 3 tasks in the pod, but > hitting the {{/containers}} endpoint on the agent reveals that the executor > container for this task group is still running. > Here is the task group launching on the agent: > {code} > slave.cpp:1696] Launching task group containing tasks [ > test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1, > test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask2, > test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.clientTask ] for > framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- > {code} > and here is the executor container starting: > {code} > mesos-agent[2994]: I1006 20:23:27.407563 3094 containerizer.cpp:965] > Starting container bf38ff09-3da1-487a-8926-1f4cc88bce32 for executor > 'instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601' of framework > 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- > {code} > and here is the output showing the {{TASK_KILLED}} updates for one task group: > {code} > mesos-agent[2994]: I1006 20:23:28.728224 3097 slave.cpp:2283] Asked to kill > task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1 of > framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- > mesos-agent[2994]: W1006 20:23:28.728304 3097 slave.cpp:2364] Transitioning > the state of task > test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1 of > framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- to TASK_KILLED because > the executor is not registered > mesos-agent[2994]: I1006 20:23:28.728659 3097 slave.cpp:3609] Handling > status update TASK_KILLED (UUID: 1cb8197a-7829-4a05-9cb1-14ba97519c42) for > task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1 of > framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- from @0.0.0.0:0 > mesos-agent[2994]: I1006 20:23:28.728817 3097 slave.cpp:3609] Handling > status update TASK_KILLED (UUID: e377e9fb-6466-4ce5-b32a-37d840b9f87c) for > task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask2 of > framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- from @0.0.0.0:0 > mesos-agent[2994]: I1006 20:23:28.728912 3097 slave.cpp:3609] Handling > status update TASK_KILLED (UUID: 24d44b25-ea52-43a1-afdb-6c04389879d2) for > task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.clientTask of > framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- from @0.0.0.0:0 > {code} > however, if we grep the log for the executor's ID, the last line mentioning > it is: > {code} > slave.cpp:3080] Creating a marker file for HTTP based executor > 'instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601' of framework > 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- (via HTTP) at path > '/var/lib/mesos/slave/meta/slaves/42838ca8-8d6b-475b-9b3b-59f3cd0d6834-S0/frameworks/42838ca8-8d6b-475b-9b3b-59f3cd0d6834-/executors/instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601/runs/bf38ff09-3da1-487a-8926-1f4cc88bce32/http.marker' > {code} > so it seems the executor never exited. If we hit the agent's {{/containers}} > endpoint, we get output which includes this executor container: > {code} > { > "container_id": "bf38ff09-3da1-487a-8926-1f4cc88bce32", > "executor_id": "instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601", > "executor_name": "", > "framework_id": "42838ca8-8d6b-475b-9b3b-59f3cd0d6834-", > "source": "", > "statistics": { > "cpus_limit": 0.1, > "cpus_nr_periods": 17, > "cpus_nr_throttled": 11, > "cpus_system_time_secs": 0.02, > "cpus_throttled_time_secs": 0.784142448, > "cpus_user_time_secs": 0.09, > "disk_limit_bytes":
[jira] [Created] (MESOS-6322) Agent fails to kill empty parent container
Greg Mann created MESOS-6322: Summary: Agent fails to kill empty parent container Key: MESOS-6322 URL: https://issues.apache.org/jira/browse/MESOS-6322 Project: Mesos Issue Type: Bug Reporter: Greg Mann Assignee: Anand Mazumdar Priority: Blocker I launched a pod using Marathon, which led to the launching of a task group on a Mesos agent. The pod spec was flawed, so this led to Marathon repeatedly re-launching multiple instances of the task group. After this went on for a few minutes, I told Marathon to scale the app to 0 instances, which sends {{TASK_KILLED}} for one task in each task group. After this, the Mesos agent reports {{TASK_KILLED}} status updates for all 3 tasks in the pod, but hitting the {{/containers}} endpoint on the agent reveals that the executor container for this task group is still running. Here is the task group launching on the agent: {code} slave.cpp:1696] Launching task group containing tasks [ test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1, test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask2, test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.clientTask ] for framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- {code} and here is the executor container starting: {code} mesos-agent[2994]: I1006 20:23:27.407563 3094 containerizer.cpp:965] Starting container bf38ff09-3da1-487a-8926-1f4cc88bce32 for executor 'instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601' of framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- {code} and here is the output showing the {{TASK_KILLED}} updates for one task group: {code} mesos-agent[2994]: I1006 20:23:28.728224 3097 slave.cpp:2283] Asked to kill task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1 of framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- mesos-agent[2994]: W1006 20:23:28.728304 3097 slave.cpp:2364] Transitioning the state of task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1 of framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- to TASK_KILLED because the executor is not registered mesos-agent[2994]: I1006 20:23:28.728659 3097 slave.cpp:3609] Handling status update TASK_KILLED (UUID: 1cb8197a-7829-4a05-9cb1-14ba97519c42) for task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask1 of framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- from @0.0.0.0:0 mesos-agent[2994]: I1006 20:23:28.728817 3097 slave.cpp:3609] Handling status update TASK_KILLED (UUID: e377e9fb-6466-4ce5-b32a-37d840b9f87c) for task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.healthTask2 of framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- from @0.0.0.0:0 mesos-agent[2994]: I1006 20:23:28.728912 3097 slave.cpp:3609] Handling status update TASK_KILLED (UUID: 24d44b25-ea52-43a1-afdb-6c04389879d2) for task test-pod.instance-bd0f7a5b-8c02-11e6-ad52-6eec1b96a601.clientTask of framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- from @0.0.0.0:0 {code} however, if we grep the log for the executor's ID, the last line mentioning it is: {code} slave.cpp:3080] Creating a marker file for HTTP based executor 'instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601' of framework 42838ca8-8d6b-475b-9b3b-59f3cd0d6834- (via HTTP) at path '/var/lib/mesos/slave/meta/slaves/42838ca8-8d6b-475b-9b3b-59f3cd0d6834-S0/frameworks/42838ca8-8d6b-475b-9b3b-59f3cd0d6834-/executors/instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601/runs/bf38ff09-3da1-487a-8926-1f4cc88bce32/http.marker' {code} so it seems the executor never exited. If we hit the agent's {{/containers}} endpoint, we get output which includes this executor container: {code} { "container_id": "bf38ff09-3da1-487a-8926-1f4cc88bce32", "executor_id": "instance-test-pod.bd0f7a5b-8c02-11e6-ad52-6eec1b96a601", "executor_name": "", "framework_id": "42838ca8-8d6b-475b-9b3b-59f3cd0d6834-", "source": "", "statistics": { "cpus_limit": 0.1, "cpus_nr_periods": 17, "cpus_nr_throttled": 11, "cpus_system_time_secs": 0.02, "cpus_throttled_time_secs": 0.784142448, "cpus_user_time_secs": 0.09, "disk_limit_bytes": 10485760, "disk_used_bytes": 20480, "mem_anon_bytes": 11337728, "mem_cache_bytes": 0, "mem_critical_pressure_counter": 0, "mem_file_bytes": 0, "mem_limit_bytes": 33554432, "mem_low_pressure_counter": 0, "mem_mapped_file_bytes": 0, "mem_medium_pressure_counter": 0, "mem_rss_bytes": 11337728, "mem_swap_bytes": 0, "mem_total_bytes": 12013568, "mem_unevictable_bytes": 0, "timestamp": 1475792290.12373 }, "status": { "executor_pid": 9068, "network_infos": [ { "ip_addresses": [ { "ip_address": "9.0.1.34", "protocol": "IPv4"
[jira] [Updated] (MESOS-6031) Collect throttle related metrics for DockerContainerizer.
[ https://issues.apache.org/jira/browse/MESOS-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-6031: - Target Version/s: 1.1.0 Fix Version/s: (was: 1.1.0) > Collect throttle related metrics for DockerContainerizer. > - > > Key: MESOS-6031 > URL: https://issues.apache.org/jira/browse/MESOS-6031 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 1.0.0 >Reporter: Zhitao Li >Assignee: Zhitao Li > Labels: containerizer, docker > > MESOS-2154 added the support of porting CFS quota to Docker containerizer, > but the metric collection part is still missing. > we can use similar fashion like cgroups/cpushare.cpp to collect related > metrics too in docker containerizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4948) Move maintenance tests to use the new scheduler library interface.
[ https://issues.apache.org/jira/browse/MESOS-4948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15553459#comment-15553459 ] Ilya Pronin commented on MESOS-4948: Review request: https://reviews.apache.org/r/52620/ > Move maintenance tests to use the new scheduler library interface. > -- > > Key: MESOS-4948 > URL: https://issues.apache.org/jira/browse/MESOS-4948 > Project: Mesos > Issue Type: Bug > Components: tests > Environment: Ubuntu 14.04, using gcc, with libevent and SSL enabled > (on ASF CI) >Reporter: Greg Mann >Assignee: Ilya Pronin > Labels: flaky-test, maintenance, mesosphere, newbie > > We need to move the existing maintenance tests to use the new scheduler > interface. We have already moved 1 test > {{MasterMaintenanceTest.PendingUnavailabilityTest}} to use the new interface. > It would be good to move the other 2 remaining tests to the new interface > since it can lead to failures around the stack object being referenced after > has been already destroyed. Detailed log from an ASF CI build failure. > {code} > [ RUN ] MasterMaintenanceTest.InverseOffers > I0315 04:16:50.786032 2681 leveldb.cpp:174] Opened db in 125.361171ms > I0315 04:16:50.836374 2681 leveldb.cpp:181] Compacted db in 50.254411ms > I0315 04:16:50.836470 2681 leveldb.cpp:196] Created db iterator in 25917ns > I0315 04:16:50.836488 2681 leveldb.cpp:202] Seeked to beginning of db in > 3291ns > I0315 04:16:50.836498 2681 leveldb.cpp:271] Iterated through 0 keys in the > db in 253ns > I0315 04:16:50.836549 2681 replica.cpp:779] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0315 04:16:50.837474 2702 recover.cpp:447] Starting replica recovery > I0315 04:16:50.837565 2681 cluster.cpp:183] Creating default 'local' > authorizer > I0315 04:16:50.838191 2702 recover.cpp:473] Replica is in EMPTY status > I0315 04:16:50.839532 2704 replica.cpp:673] Replica in EMPTY status received > a broadcasted recover request from (4784)@172.17.0.4:39845 > I0315 04:16:50.839754 2705 recover.cpp:193] Received a recover response from > a replica in EMPTY status > I0315 04:16:50.841893 2704 recover.cpp:564] Updating replica status to > STARTING > I0315 04:16:50.842566 2703 master.cpp:376] Master > c326bc68-2581-48d4-9dc4-0d6f270bdda1 (01fcd642f65f) started on > 172.17.0.4:39845 > I0315 04:16:50.842644 2703 master.cpp:378] Flags at startup: --acls="" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate="false" --authenticate_http="true" > --authenticate_slaves="true" --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/DE2Uaw/credentials" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" > --quiet="false" --recovery_slave_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_store_timeout="100secs" --registry_strict="true" > --root_submissions="true" --slave_ping_timeout="15secs" > --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" > --webui_dir="/mesos/mesos-0.29.0/_inst/share/mesos/webui" > --work_dir="/tmp/DE2Uaw/master" --zk_session_timeout="10secs" > I0315 04:16:50.843168 2703 master.cpp:425] Master allowing unauthenticated > frameworks to register > I0315 04:16:50.843227 2703 master.cpp:428] Master only allowing > authenticated slaves to register > I0315 04:16:50.843302 2703 credentials.hpp:35] Loading credentials for > authentication from '/tmp/DE2Uaw/credentials' > I0315 04:16:50.843737 2703 master.cpp:468] Using default 'crammd5' > authenticator > I0315 04:16:50.843969 2703 master.cpp:537] Using default 'basic' HTTP > authenticator > I0315 04:16:50.844177 2703 master.cpp:571] Authorization enabled > I0315 04:16:50.844360 2708 hierarchical.cpp:144] Initialized hierarchical > allocator process > I0315 04:16:50.844430 2708 whitelist_watcher.cpp:77] No whitelist given > I0315 04:16:50.848227 2703 master.cpp:1806] The newly elected leader is > master@172.17.0.4:39845 with id c326bc68-2581-48d4-9dc4-0d6f270bdda1 > I0315 04:16:50.848269 2703 master.cpp:1819] Elected as the leading master! > I0315 04:16:50.848292 2703 master.cpp:1508] Recovering from registrar > I0315 04:16:50.848563 2703 registrar.cpp:307] Recovering registrar > I0315 04:16:50.876277 2711 leveldb.cpp:304] Persisting metadata (8 bytes) to > leveldb took 34.178445ms > I0315 04:16:50.876365 2711 replica.cpp:320] Persisted replica status to > STARTING > I0315 04:16:50.876776 2711
[jira] [Comment Edited] (MESOS-6223) Allow agents to re-register post a host reboot
[ https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15553432#comment-15553432 ] Yan Xu edited comment on MESOS-6223 at 10/6/16 10:48 PM: - [~neilc] [~vinodkone] I can think of ways we can implement restarting tasks post-reboot (MESOS-3545, will have design doc out soon) via either the approach in this ticket or in MESOS-5368 but this one feels simpler. Reboot as a special case sounds to me an optimization which will no longer hold true with tasks being restarted. Then the question is 1) Should the agent ID *always* change after a reboot? 2) Does the agent ID *ever has to* change when its {{work_dir}} hasn't changed? 1) Sounds like no. For 2), on the master the only error case where we disallow an agent to reregister but does allow the agent to register is [when the agent's ip or hostname has changed|https://github.com/apache/mesos/blob/3902b051f2cff59c55535dae08ebd4223833b0a0/src/master/master.cpp#L5228] (hostname change already prevents the agent from restarting). I can imagine we'd want to force the agent to get rid of its {{work_dir//slave_id}} but keep the checkpointed resources etc.? To summarize, seems like we can keep both this ticket and MESOS-5368, but change MESOS-5368 to not change the session ID in the reboot case? Thoughts? was (Author: xujyan): [~neilc] [~vinodkone] I can think of ways we can implement restarting tasks post-reboot (MESOS-3545, will have design doc out soon) via either the approach in this ticket or in MESOS-5368 but this one feels simpler. Reboot as a special case sounds to me an optimization which will no longer hold true with tasks being restarted. Then the question is 1) Should the agent ID *always* change after a reboot? 2) Does the agent ID *ever has to* change when its {{work_dir}} hasn't changed? 1) Sounds like no. For 2), on the master the only error case where we disallow an agent to reregister but does allow the agent to register is [when the agent's ip or hostname has changed|https://github.com/apache/mesos/blob/3902b051f2cff59c55535dae08ebd4223833b0a0/src/master/master.cpp#L5228]. I can imagine we'd want to force the agent to get rid of its {{work_dir//slave_id}} but keep the checkpointed resources etc.? To summarize, seems like we can keep both this ticket and MESOS-5368, but change MESOS-5368 to not change the session ID in the reboot case? Thoughts? > Allow agents to re-register post a host reboot > -- > > Key: MESOS-6223 > URL: https://issues.apache.org/jira/browse/MESOS-6223 > Project: Mesos > Issue Type: Improvement > Components: slave >Reporter: Megha > > Agent does’t recover its state post a host reboot, it registers with the > master and gets a new SlaveID. With partition awareness, the agents are now > allowed to re-register after they have been marked Unreachable. The executors > are anyway terminated on the agent when it reboots so there is no harm in > letting the agent keep its SlaveID, re-register with the master and reconcile > the lost executors. This is a pre-requisite for supporting > persistent/restartable tasks in mesos (MESOS-3545). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot
[ https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15553432#comment-15553432 ] Yan Xu commented on MESOS-6223: --- [~neilc] [~vinodkone] I can think of ways we can implement restarting tasks post-reboot (MESOS-3545, will have design doc out soon) via either the approach in this ticket or in MESOS-5368 but this one feels simpler. Reboot as a special case sounds to me an optimization which will no longer hold true with tasks being restarted. Then the question is 1) Should the agent ID *always* change after a reboot? 2) Does the agent ID *ever has to* change when its {{work_dir}} hasn't changed? 1) Sounds like no. For 2), on the master the only error case where we disallow an agent to reregister but does allow the agent to register is [when the agent's ip or hostname has changed|https://github.com/apache/mesos/blob/3902b051f2cff59c55535dae08ebd4223833b0a0/src/master/master.cpp#L5228]. I can imagine we'd want to force the agent to get rid of its {{work_dir//slave_id}} but keep the checkpointed resources etc.? To summarize, seems like we can keep both this ticket and MESOS-5368, but change MESOS-5368 to not change the session ID in the reboot case? Thoughts? > Allow agents to re-register post a host reboot > -- > > Key: MESOS-6223 > URL: https://issues.apache.org/jira/browse/MESOS-6223 > Project: Mesos > Issue Type: Improvement > Components: slave >Reporter: Megha > > Agent does’t recover its state post a host reboot, it registers with the > master and gets a new SlaveID. With partition awareness, the agents are now > allowed to re-register after they have been marked Unreachable. The executors > are anyway terminated on the agent when it reboots so there is no harm in > letting the agent keep its SlaveID, re-register with the master and reconcile > the lost executors. This is a pre-requisite for supporting > persistent/restartable tasks in mesos (MESOS-3545). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6228) Add timeout to /metrics/snapshot calls in tests
[ https://issues.apache.org/jira/browse/MESOS-6228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15553271#comment-15553271 ] Neil Conway commented on MESOS-6228: If we added a request timeout, the HTTP request would return successfully if fetching any metric times out. Not clear that this is actually better behavior. In this situation, we would {{VLOG(1)}} which metric has timed out; we could perhaps increase the verbosity of that error message and then enable the request timeout. > Add timeout to /metrics/snapshot calls in tests > --- > > Key: MESOS-6228 > URL: https://issues.apache.org/jira/browse/MESOS-6228 > Project: Mesos > Issue Type: Bug > Components: tests >Reporter: Neil Conway > Labels: mesosphere, newbie++ > > In the unit tests, {{Metrics()}} does an {{http::get}} of the > {{/metrics/snapshot}} endpoint. No {{timeout}} parameter is provided. That > means if any metric cannot be fetched, the request hangs for 15 seconds and > then dies with a mysterious / unclear error message. Digging into which > metric has hung and for what reason requires a lot of time / debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6228) Add timeout to /metrics/snapshot calls in tests
[ https://issues.apache.org/jira/browse/MESOS-6228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neil Conway updated MESOS-6228: --- Description: In the unit tests, {{Metrics()}} does an {{http::get}} of the {{/metrics/snapshot}} endpoint. No {{timeout}} parameter is provided. That means if any metric cannot be fetched, the request hangs for 15 seconds and then dies with a mysterious / unclear error message. Digging into which metric has hung and for what reason requires a lot of time / debugging. (was: In the unit tests, {{Metrics()}} does an {{http::get}} of the {{/metrics/snapshot}} endpoint. No {{timeout}} parameter is provided. That means if any metric cannot be fetched, the request hangs for 15 seconds and then dies with a mysterious / unclear error message. Digging into which metric has hung and for what reason requires a lot of time / debugging. Instead, we should specify a reasonable timeout (e.g., 15 seconds) and fail the test if the timeout fires.) > Add timeout to /metrics/snapshot calls in tests > --- > > Key: MESOS-6228 > URL: https://issues.apache.org/jira/browse/MESOS-6228 > Project: Mesos > Issue Type: Bug > Components: tests >Reporter: Neil Conway > Labels: mesosphere, newbie++ > > In the unit tests, {{Metrics()}} does an {{http::get}} of the > {{/metrics/snapshot}} endpoint. No {{timeout}} parameter is provided. That > means if any metric cannot be fetched, the request hangs for 15 seconds and > then dies with a mysterious / unclear error message. Digging into which > metric has hung and for what reason requires a lot of time / debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6228) Add timeout to /metrics/snapshot calls in tests
[ https://issues.apache.org/jira/browse/MESOS-6228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neil Conway updated MESOS-6228: --- Description: In the unit tests, {{Metrics()}} does an {{http::get}} of the {{/metrics/snapshot}} endpoint. No {{timeout}} parameter is provided. That means if any metric cannot be fetched, the request hangs for 15 seconds and then dies with a mysterious / unclear error message. Digging into which metric has hung and for what reason requires a lot of time / debugging. Instead, we should specify a reasonable timeout (e.g., 15 seconds) and fail the test if the timeout fires. was: In the unit tests, {{Metrics()}} does an {{http::get}} of the {{/metrics/snapshot}} endpoint. No {{timeout}} parameter is provided. That means if any metric cannot be fetched, the request hangs forever. Instead, we should specify a reasonable timeout (e.g., 15 seconds) and fail the test if the timeout fires. > Add timeout to /metrics/snapshot calls in tests > --- > > Key: MESOS-6228 > URL: https://issues.apache.org/jira/browse/MESOS-6228 > Project: Mesos > Issue Type: Bug > Components: tests >Reporter: Neil Conway > Labels: mesosphere, newbie++ > > In the unit tests, {{Metrics()}} does an {{http::get}} of the > {{/metrics/snapshot}} endpoint. No {{timeout}} parameter is provided. That > means if any metric cannot be fetched, the request hangs for 15 seconds and > then dies with a mysterious / unclear error message. Digging into which > metric has hung and for what reason requires a lot of time / debugging. > Instead, we should specify a reasonable timeout (e.g., 15 seconds) and fail > the test if the timeout fires. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5368) Consider introducing persistent agent ID
[ https://issues.apache.org/jira/browse/MESOS-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15553204#comment-15553204 ] Yan Xu commented on MESOS-5368: --- [~neilc] In an alternative approach, would we achieve the same thing if we change the semantics to have the agent *only* change its ID when we "permanently removing (decommission) an agent ({{work_dir}})"? > Consider introducing persistent agent ID > > > Key: MESOS-5368 > URL: https://issues.apache.org/jira/browse/MESOS-5368 > Project: Mesos > Issue Type: Improvement >Reporter: Neil Conway >Assignee: Abhishek Dasgupta > Labels: mesosphere > > Currently, agent IDs identify a single "session" by an agent: that is, an > agent receives an agent ID when it registers with the master; it reuses that > agent ID if it disconnects and successfully reregisters; if the agent shuts > down and restarts, it registers anew and receives a new agent ID. > It would be convenient to have a "persistent agent ID" that remains the same > for the duration of a given agent {{work_dir}}. This would mean that a given > persistent volume would not migrate between different persistent agent IDs > over time, for example (see MESOS-4894). If we supported permanently removing > an agent from the cluster (i.e., the {{work_dir}} and any volumes used by the > agent will never be reused), we could use the persistent agent ID to report > which agent has been removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6321) CHECK failure in HierarchicalAllocatorTest.NoDoubleAccounting
Neil Conway created MESOS-6321: -- Summary: CHECK failure in HierarchicalAllocatorTest.NoDoubleAccounting Key: MESOS-6321 URL: https://issues.apache.org/jira/browse/MESOS-6321 Project: Mesos Issue Type: Bug Reporter: Neil Conway Assignee: Alexander Rukletsov Observed in internal CI: {noformat} [15:52:21] : [Step 10/10] [ RUN ] HierarchicalAllocatorTest.NoDoubleAccounting [15:52:21]W: [Step 10/10] I1006 15:52:21.813817 23713 hierarchical.cpp:275] Added framework framework1 [15:52:21]W: [Step 10/10] I1006 15:52:21.814100 23713 hierarchical.cpp:1694] No allocations performed [15:52:21]W: [Step 10/10] I1006 15:52:21.814102 23712 process.cpp:3377] Handling HTTP event for process 'metrics' with path: '/metrics/snapshot' [15:52:21]W: [Step 10/10] I1006 15:52:21.814121 23713 hierarchical.cpp:1789] No inverse offers to send out! [15:52:21]W: [Step 10/10] I1006 15:52:21.814146 23713 hierarchical.cpp:1286] Performed allocation for 0 agents in 52445ns [15:52:21]W: [Step 10/10] I1006 15:52:21.814206 23713 hierarchical.cpp:485] Added agent agent1 (agent1) with cpus(*):1 (allocated: cpus(*):1) [15:52:21]W: [Step 10/10] I1006 15:52:21.814237 23713 hierarchical.cpp:1694] No allocations performed [15:52:21]W: [Step 10/10] I1006 15:52:21.814247 23713 hierarchical.cpp:1789] No inverse offers to send out! [15:52:21]W: [Step 10/10] I1006 15:52:21.814259 23713 hierarchical.cpp:1309] Performed allocation for agent agent1 in 33887ns [15:52:21]W: [Step 10/10] I1006 15:52:21.814294 23713 hierarchical.cpp:485] Added agent agent2 (agent2) with cpus(*):1 (allocated: cpus(*):1) [15:52:21]W: [Step 10/10] I1006 15:52:21.814332 23713 hierarchical.cpp:1694] No allocations performed [15:52:21]W: [Step 10/10] I1006 15:52:21.814342 23713 hierarchical.cpp:1789] No inverse offers to send out! [15:52:21]W: [Step 10/10] I1006 15:52:21.814349 23713 hierarchical.cpp:1309] Performed allocation for agent agent2 in 42682ns [15:52:21]W: [Step 10/10] I1006 15:52:21.814417 23713 hierarchical.cpp:275] Added framework framework2 [15:52:21]W: [Step 10/10] I1006 15:52:21.814445 23713 hierarchical.cpp:1694] No allocations performed [15:52:21]W: [Step 10/10] I1006 15:52:21.814455 23713 hierarchical.cpp:1789] No inverse offers to send out! [15:52:21]W: [Step 10/10] I1006 15:52:21.814469 23713 hierarchical.cpp:1286] Performed allocation for 2 agents in 37976ns [15:52:21]W: [Step 10/10] F1006 15:52:21.824954 23692 json.hpp:334] Check failed: 'boost::get(this)' Must be non NULL [15:52:21]W: [Step 10/10] *** Check failure stack trace: *** [15:52:21]W: [Step 10/10] @ 0x7fe953bbd71d google::LogMessage::Fail() [15:52:21]W: [Step 10/10] @ 0x7fe953bbf55d google::LogMessage::SendToLog() [15:52:21]W: [Step 10/10] @ 0x7fe953bbd30c google::LogMessage::Flush() [15:52:21]W: [Step 10/10] @ 0x7fe953bbfe59 google::LogMessageFatal::~LogMessageFatal() [15:52:21]W: [Step 10/10] @ 0x7cc903 JSON::Value::as<>() [15:52:21]W: [Step 10/10] @ 0x8b633c mesos::internal::tests::HierarchicalAllocatorTest_NoDoubleAccounting_Test::TestBody() [15:52:21]W: [Step 10/10] @ 0x129ce23 testing::internal::HandleExceptionsInMethodIfSupported<>() [15:52:21]W: [Step 10/10] @ 0x1292f07 testing::Test::Run() [15:52:21]W: [Step 10/10] @ 0x1292fae testing::TestInfo::Run() [15:52:21]W: [Step 10/10] @ 0x12930b5 testing::TestCase::Run() [15:52:21]W: [Step 10/10] @ 0x1293368 testing::internal::UnitTestImpl::RunAllTests() [15:52:21]W: [Step 10/10] @ 0x1293624 testing::UnitTest::Run() [15:52:21]W: [Step 10/10] @ 0x507254 main [15:52:21]W: [Step 10/10] @ 0x7fe95122876d (unknown) [15:52:21]W: [Step 10/10] @ 0x51e341 (unknown) [15:52:21]W: [Step 10/10] Aborted (core dumped) [15:52:21]W: [Step 10/10] Process exited with code 134 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6319) ContentType/AgentAPITest.NestedContainerLaunch/1 is flaky
[ https://issues.apache.org/jira/browse/MESOS-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-6319: --- Assignee: Benjamin Mahler Sprint: Mesosphere Sprint 44 > ContentType/AgentAPITest.NestedContainerLaunch/1 is flaky > - > > Key: MESOS-6319 > URL: https://issues.apache.org/jira/browse/MESOS-6319 > Project: Mesos > Issue Type: Bug > Components: containerization, tests >Affects Versions: 1.1.0 > Environment: ubuntu-14.04, autotools build, verbose build >Reporter: Benjamin Bannier >Assignee: Benjamin Mahler > Labels: flaky-test > Attachments: build.log > > > {{ContentType/AgentAPITest.NestedContainerLaunch/1}} is flaky, saw this fail > in ASF CI (https://builds.apache.org/job/mesos-reviewbot/15545/) > {code} > ../../src/tests/api_tests.cpp:3552: Failure > (wait).failure(): Unexpected response status 404 Not Found > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-6304) Add authentication support to the default executor
[ https://issues.apache.org/jira/browse/MESOS-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-6304: Assignee: Greg Mann (was: Artem Harutyunyan) > Add authentication support to the default executor > -- > > Key: MESOS-6304 > URL: https://issues.apache.org/jira/browse/MESOS-6304 > Project: Mesos > Issue Type: Improvement >Reporter: Galen Pewtherer >Assignee: Greg Mann > > Right now the default executor (used to launch task groups) does not > authenticate with either the executor API (/v1/executor) or the agent API > (v1). Ofcourse, the driver based executor doesn't authenticate either. > It would be great to come up with a solution that works for both the built-in > executors and custom executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6319) ContentType/AgentAPITest.NestedContainerLaunch/1 is flaky
[ https://issues.apache.org/jira/browse/MESOS-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15552542#comment-15552542 ] Anand Mazumdar commented on MESOS-6319: --- [~bmahler] Can you take a look since you added this test recently? > ContentType/AgentAPITest.NestedContainerLaunch/1 is flaky > - > > Key: MESOS-6319 > URL: https://issues.apache.org/jira/browse/MESOS-6319 > Project: Mesos > Issue Type: Bug > Components: containerization, tests >Affects Versions: 1.1.0 > Environment: ubuntu-14.04, autotools build, verbose build >Reporter: Benjamin Bannier > Labels: flaky-test > Attachments: build.log > > > {{ContentType/AgentAPITest.NestedContainerLaunch/1}} is flaky, saw this fail > in ASF CI (https://builds.apache.org/job/mesos-reviewbot/15545/) > {code} > ../../src/tests/api_tests.cpp:3552: Failure > (wait).failure(): Unexpected response status 404 Not Found > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot
[ https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15552427#comment-15552427 ] Neil Conway commented on MESOS-6223: cc [~vinodkone] > Allow agents to re-register post a host reboot > -- > > Key: MESOS-6223 > URL: https://issues.apache.org/jira/browse/MESOS-6223 > Project: Mesos > Issue Type: Improvement > Components: slave >Reporter: Megha > > Agent does’t recover its state post a host reboot, it registers with the > master and gets a new SlaveID. With partition awareness, the agents are now > allowed to re-register after they have been marked Unreachable. The executors > are anyway terminated on the agent when it reboots so there is no harm in > letting the agent keep its SlaveID, re-register with the master and reconcile > the lost executors. This is a pre-requisite for supporting > persistent/restartable tasks in mesos (MESOS-3545). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot
[ https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15552425#comment-15552425 ] Neil Conway commented on MESOS-6223: Another way to go here would be to introduce a new type of "persistent agent ID", as discussed in MESOS-5368 -- that would essentially be an ID for a given {{work_dir}}, whereas the existing Agent ID would remain closer to a "session ID". > Allow agents to re-register post a host reboot > -- > > Key: MESOS-6223 > URL: https://issues.apache.org/jira/browse/MESOS-6223 > Project: Mesos > Issue Type: Improvement > Components: slave >Reporter: Megha > > Agent does’t recover its state post a host reboot, it registers with the > master and gets a new SlaveID. With partition awareness, the agents are now > allowed to re-register after they have been marked Unreachable. The executors > are anyway terminated on the agent when it reboots so there is no harm in > letting the agent keep its SlaveID, re-register with the master and reconcile > the lost executors. This is a pre-requisite for supporting > persistent/restartable tasks in mesos (MESOS-3545). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6288) The default executor should maintain launcher_dir.
[ https://issues.apache.org/jira/browse/MESOS-6288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15552417#comment-15552417 ] Gastón Kleiman commented on MESOS-6288: --- Patches: https://reviews.apache.org/r/52556 https://reviews.apache.org/r/52608/ > The default executor should maintain launcher_dir. > -- > > Key: MESOS-6288 > URL: https://issues.apache.org/jira/browse/MESOS-6288 > Project: Mesos > Issue Type: Bug >Reporter: Alexander Rukletsov >Assignee: Gastón Kleiman > Labels: health-check, mesosphere > > Both command and docker executors require {{launcher_dir}} is provided in a > flag. This directory contains mesos binaries, e.g. a tcp checker necessary > for TCP health check. The default executor should obtain somehow (a flag, env > var) and maintain this directory for health checker to use. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6311) Consider supporting implicit reconciliation per agent
[ https://issues.apache.org/jira/browse/MESOS-6311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15552411#comment-15552411 ] Neil Conway commented on MESOS-6311: Seems reasonable to me, although I'd like to think about this in the context is making broader changes to the reconciliation API (see MESOS-5950, MESOS-4050, etc.). > Consider supporting implicit reconciliation per agent > - > > Key: MESOS-6311 > URL: https://issues.apache.org/jira/browse/MESOS-6311 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Joris Van Remoortere > > Currently mesos only supports: > - total implicit reconciliation > - explicit reconciliation per task > Since agent can slowly rejoin the master after a master failover, it is hard > to have a low time bound on implicit reconciliation for tasks. > Performing the current implicit reconciliation is expensive on big clusters > so it should not be done every N seconds. > If we could perform implicit reconciliation for a particular agent, then it > would be cheap enough to after we notice that particular agent rejoining the > cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6278) Add test cases for the HTTP health checks
[ https://issues.apache.org/jira/browse/MESOS-6278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-6278: --- Target Version/s: (was: 1.1.0) > Add test cases for the HTTP health checks > - > > Key: MESOS-6278 > URL: https://issues.apache.org/jira/browse/MESOS-6278 > Project: Mesos > Issue Type: Task > Components: tests >Reporter: haosdent >Assignee: haosdent > Labels: health-check, mesosphere, test > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6279) Add test cases for the TCP health check
[ https://issues.apache.org/jira/browse/MESOS-6279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-6279: --- Target Version/s: (was: 1.1.0) > Add test cases for the TCP health check > --- > > Key: MESOS-6279 > URL: https://issues.apache.org/jira/browse/MESOS-6279 > Project: Mesos > Issue Type: Task > Components: tests >Reporter: haosdent >Assignee: haosdent > Labels: health-check, mesosphere, test > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6288) The default executor should maintain launcher_dir.
[ https://issues.apache.org/jira/browse/MESOS-6288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-6288: --- Fix Version/s: (was: 1.1.0) > The default executor should maintain launcher_dir. > -- > > Key: MESOS-6288 > URL: https://issues.apache.org/jira/browse/MESOS-6288 > Project: Mesos > Issue Type: Bug >Reporter: Alexander Rukletsov >Assignee: Gastón Kleiman > Labels: health-check, mesosphere > > Both command and docker executors require {{launcher_dir}} is provided in a > flag. This directory contains mesos binaries, e.g. a tcp checker necessary > for TCP health check. The default executor should obtain somehow (a flag, env > var) and maintain this directory for health checker to use. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6184) Health checks should use a general mechanism to enter namespaces of the task.
[ https://issues.apache.org/jira/browse/MESOS-6184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-6184: --- Fix Version/s: (was: 1.1.0) > Health checks should use a general mechanism to enter namespaces of the task. > - > > Key: MESOS-6184 > URL: https://issues.apache.org/jira/browse/MESOS-6184 > Project: Mesos > Issue Type: Improvement >Reporter: haosdent >Assignee: haosdent >Priority: Blocker > Labels: health-check, mesosphere > > To perform health checks for tasks, we need to enter the corresponding > namespaces of the container. For now health check use custom clone to > implement this > {code} > return process::defaultClone([=]() -> int { > if (taskPid.isSome()) { > foreach (const string& ns, namespaces) { > Try setns = ns::setns(taskPid.get(), ns); > if (setns.isError()) { > ... > } > } > } > return func(); > }); > {code} > After the childHooks patches merged, we could change the health check to use > childHooks to call {{setns}} and make {{process::defaultClone}} private > again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6119) TCP health checks are not portable.
[ https://issues.apache.org/jira/browse/MESOS-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-6119: --- Priority: Major (was: Blocker) Fix Version/s: (was: 1.1.0) > TCP health checks are not portable. > --- > > Key: MESOS-6119 > URL: https://issues.apache.org/jira/browse/MESOS-6119 > Project: Mesos > Issue Type: Bug >Reporter: Alexander Rukletsov >Assignee: Alexander Rukletsov > Labels: health-check, mesosphere > > MESOS-3567 introduced a dependency on "bash" for TCP health checks, which is > undesirable. We should implement a portable solution for TCP health checks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6157) ContainerInfo is not validated.
[ https://issues.apache.org/jira/browse/MESOS-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15552352#comment-15552352 ] Alexander Rukletsov commented on MESOS-6157: Apparently, {{ContainerInfo}} could also be set for non-container tasks and can also be interpreted as which containerizer to use. I've reverted the validation, see https://reviews.apache.org/r/51865 for details. {noformat} Commit: f93f4fca57added6b0bff04a3e12699eaef13da9 [f93f4fc] Parents: 001c55c306 Author: Alexander RukletsovDate: 20 September 2016 at 14:41:15 GMT+2 Commit Date: 20 September 2016 at 16:58:19 GMT+2 Labels: alexr/container-additions-revert Revert "Added validation for `ContainerInfo`." This reverts commit e65f580bf0cbea64cedf521cf169b9b4c9f85454. {noformat} > ContainerInfo is not validated. > --- > > Key: MESOS-6157 > URL: https://issues.apache.org/jira/browse/MESOS-6157 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Alexander Rukletsov >Assignee: Alexander Rukletsov >Priority: Blocker > Labels: containerizer, mesos-containerizer, mesosphere > Fix For: 1.1.0 > > > Currently Mesos does not validate {{ContainerInfo}} provided with > {{TaskInfo}} or {{ExecutorInfo}}, hence invalid task configurations can be > accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6320) Implement clang-tidy check to catch incorrect flags hierarchies
Benjamin Bannier created MESOS-6320: --- Summary: Implement clang-tidy check to catch incorrect flags hierarchies Key: MESOS-6320 URL: https://issues.apache.org/jira/browse/MESOS-6320 Project: Mesos Issue Type: Bug Reporter: Benjamin Bannier Classes need to always use {{virtual}} inheritance when being derived from {{FlagsBase}}. Also, in order to compose such derived flags they should be inherited virtually again. Some examples: {code} struct A : virtual FlagsBase {}; // OK struct B : FlagsBase {}; // ERROR struct C : A {}; // ERROR {code} We should implement a clang-tidy checker to catch such wrong inheritance issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6100) Make fails compiling 1.0.1
[ https://issues.apache.org/jira/browse/MESOS-6100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15551993#comment-15551993 ] Neil Conway commented on MESOS-6100: [~klueska] -- seems I can't edit reviews that have already been marked as submitted... > Make fails compiling 1.0.1 > --- > > Key: MESOS-6100 > URL: https://issues.apache.org/jira/browse/MESOS-6100 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.1 > Environment: Alpine Linux (Edge) > GCC 6.1.1 >Reporter: Gennady Feldman >Assignee: Kevin Klues > Fix For: 1.1.0, 1.0.2 > > > linux/fs.cpp: In static member function 'static > Try > mesos::internal::fs::MountInfoTable::read(const Option&, bool)': > linux/fs.cpp:152:27: error: 'rootParentId' may be used uninitialized in this > function [-Werror=maybe-uninitialized] > sortFrom(rootParentId); >^ > cc1plus: all warnings being treated as errors > P.S. This is something new since I am able to compile 1.0.0 just fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6308) CHECK failure in DRF sorter.
[ https://issues.apache.org/jira/browse/MESOS-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15551914#comment-15551914 ] Benjamin Bannier commented on MESOS-6308: - Unrelated to the issue of an unexpected {{name}} value showing up, I am not sure we want a hard {{CHECK}} here. We should be perfectly capable of returning a sensible value even for an unknown {{name}}, e.g., a share of zero, and could just replace the {{CHECK}} with an early {{return 0}}. > CHECK failure in DRF sorter. > > > Key: MESOS-6308 > URL: https://issues.apache.org/jira/browse/MESOS-6308 > Project: Mesos > Issue Type: Bug >Reporter: Jie Yu >Assignee: Guangya Liu > > Saw this CHECK failed in our internal CI: > https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L450 > {noformat} > [03:08:28] : [Step 10/10] [ RUN ] PartitionTest.DisconnectedFramework > [03:08:28]W: [Step 10/10] I1004 03:08:28.200443 577 cluster.cpp:158] > Creating default 'local' authorizer > [03:08:28]W: [Step 10/10] I1004 03:08:28.206408 577 leveldb.cpp:174] > Opened db in 5.827159ms > [03:08:28]W: [Step 10/10] I1004 03:08:28.208127 577 leveldb.cpp:181] > Compacted db in 1.697508ms > [03:08:28]W: [Step 10/10] I1004 03:08:28.208150 577 leveldb.cpp:196] > Created db iterator in 5756ns > [03:08:28]W: [Step 10/10] I1004 03:08:28.208160 577 leveldb.cpp:202] > Seeked to beginning of db in 1483ns > [03:08:28]W: [Step 10/10] I1004 03:08:28.208168 577 leveldb.cpp:271] > Iterated through 0 keys in the db in 1101ns > [03:08:28]W: [Step 10/10] I1004 03:08:28.208184 577 replica.cpp:776] > Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned > [03:08:28]W: [Step 10/10] I1004 03:08:28.208452 591 recover.cpp:451] > Starting replica recovery > [03:08:28]W: [Step 10/10] I1004 03:08:28.208664 596 recover.cpp:477] > Replica is in EMPTY status > [03:08:28]W: [Step 10/10] I1004 03:08:28.209079 591 replica.cpp:673] > Replica in EMPTY status received a broadcasted recover request from > __req_res__(3666)@172.30.2.234:37300 > [03:08:28]W: [Step 10/10] I1004 03:08:28.209203 593 recover.cpp:197] > Received a recover response from a replica in EMPTY status > [03:08:28]W: [Step 10/10] I1004 03:08:28.209394 598 recover.cpp:568] > Updating replica status to STARTING > [03:08:28]W: [Step 10/10] I1004 03:08:28.209473 598 master.cpp:380] > Master dd11d4ad-2087-4324-99ef-873e83ff09a1 (ip-172-30-2-234.mesosphere.io) > started on 172.30.2.234:37300 > [03:08:28]W: [Step 10/10] I1004 03:08:28.209489 598 master.cpp:382] Flags > at startup: --acls="" --agent_ping_timeout="15secs" > --agent_reregister_timeout="10mins" --allocation_interval="1secs" > --allocator="HierarchicalDRF" --authenticate_agents="true" > --authenticate_frameworks="true" --authenticate_http_frameworks="true" > --authenticate_http_readonly="true" --authenticate_http_readwrite="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/7rr0oB/credentials" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --http_framework_authenticators="basic" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_agent_ping_timeouts="5" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="replicated_log" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/7rr0oB/master" > --zk_session_timeout="10secs" > [03:08:28]W: [Step 10/10] I1004 03:08:28.209692 598 master.cpp:432] > Master only allowing authenticated frameworks to register > [03:08:28]W: [Step 10/10] I1004 03:08:28.209699 598 master.cpp:446] > Master only allowing authenticated agents to register > [03:08:28]W: [Step 10/10] I1004 03:08:28.209704 598 master.cpp:459] > Master only allowing authenticated HTTP frameworks to register > [03:08:28]W: [Step 10/10] I1004 03:08:28.209709 598 credentials.hpp:37] > Loading credentials for authentication from '/tmp/7rr0oB/credentials' > [03:08:28]W: [Step 10/10] I1004 03:08:28.209810 598 master.cpp:504] Using > default 'crammd5' authenticator > [03:08:28]W: [Step 10/10] I1004 03:08:28.209853 598 http.cpp:883] Using > default 'basic' HTTP authenticator for realm 'mesos-master-readonly' > [03:08:28]W: [Step 10/10] I1004 03:08:28.209897 598 http.cpp:883] Using > default 'basic' HTTP authenticator for realm
[jira] [Updated] (MESOS-6319) ContentType/AgentAPITest.NestedContainerLaunch/1 is flaky
[ https://issues.apache.org/jira/browse/MESOS-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-6319: Environment: ubuntu-14.04, autotools build, verbose build > ContentType/AgentAPITest.NestedContainerLaunch/1 is flaky > - > > Key: MESOS-6319 > URL: https://issues.apache.org/jira/browse/MESOS-6319 > Project: Mesos > Issue Type: Bug > Components: containerization, tests >Affects Versions: 1.1.0 > Environment: ubuntu-14.04, autotools build, verbose build >Reporter: Benjamin Bannier > Labels: flaky-test > Attachments: build.log > > > {{ContentType/AgentAPITest.NestedContainerLaunch/1}} is flaky, saw this fail > in ASF CI (https://builds.apache.org/job/mesos-reviewbot/15545/) > {code} > ../../src/tests/api_tests.cpp:3552: Failure > (wait).failure(): Unexpected response status 404 Not Found > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6319) ContentType/AgentAPITest.NestedContainerLaunch/1 is flaky
[ https://issues.apache.org/jira/browse/MESOS-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-6319: Attachment: build.log > ContentType/AgentAPITest.NestedContainerLaunch/1 is flaky > - > > Key: MESOS-6319 > URL: https://issues.apache.org/jira/browse/MESOS-6319 > Project: Mesos > Issue Type: Bug > Components: containerization, tests >Affects Versions: 1.1.0 >Reporter: Benjamin Bannier > Labels: flaky-test > Attachments: build.log > > > {{ContentType/AgentAPITest.NestedContainerLaunch/1}} is flaky, saw this fail > in ASF CI (https://builds.apache.org/job/mesos-reviewbot/15545/) > {code} > ../../src/tests/api_tests.cpp:3552: Failure > (wait).failure(): Unexpected response status 404 Not Found > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6319) ContentType/AgentAPITest.NestedContainerLaunch/1 is flaky
Benjamin Bannier created MESOS-6319: --- Summary: ContentType/AgentAPITest.NestedContainerLaunch/1 is flaky Key: MESOS-6319 URL: https://issues.apache.org/jira/browse/MESOS-6319 Project: Mesos Issue Type: Bug Components: containerization, tests Affects Versions: 1.1.0 Reporter: Benjamin Bannier {{ContentType/AgentAPITest.NestedContainerLaunch/1}} is flaky, saw this fail in ASF CI (https://builds.apache.org/job/mesos-reviewbot/15545/) {code} ../../src/tests/api_tests.cpp:3552: Failure (wait).failure(): Unexpected response status 404 Not Found {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-6238) SSL / libevent support broken in IPv6 patch from https://github.com/lava/mesos/tree/bennoe/ipv6
[ https://issues.apache.org/jira/browse/MESOS-6238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers reassigned MESOS-6238: -- Assignee: Benno Evers > SSL / libevent support broken in IPv6 patch from > https://github.com/lava/mesos/tree/bennoe/ipv6 > --- > > Key: MESOS-6238 > URL: https://issues.apache.org/jira/browse/MESOS-6238 > Project: Mesos > Issue Type: Bug >Reporter: Lukas Loesche >Assignee: Benno Evers > > Affects https://github.com/lava/mesos/tree/bennoe/ipv6 at commit > 2199a24c0b7a782a0381aad8cceacbc95ec3d5c9 > make fails when configure options --enable-ssl --enable-libevent were given. > Error message: > {noformat} > ... > ... > ../../../3rdparty/libprocess/src/process.cpp: In member function ‘void > process::SocketManager::link_connect(const process::Future&, > process::network::Socket, const process::UPID&)’: > ../../../3rdparty/libprocess/src/process.cpp:1457:25: error: ‘url’ was not > declared in this scope >Try ip = url.ip; > ^ > Makefile:997: recipe for target 'libprocess_la-process.lo' failed > make[5]: *** [libprocess_la-process.lo] Error 1 > ... > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6318) Update Mesos version that appears in Getting Started webpage
Armand Grillet created MESOS-6318: - Summary: Update Mesos version that appears in Getting Started webpage Key: MESOS-6318 URL: https://issues.apache.org/jira/browse/MESOS-6318 Project: Mesos Issue Type: Task Components: project website Reporter: Armand Grillet Priority: Minor The first step in the [Getting Started guide|http://mesos.apache.org/gettingstarted/] is to download the latest stable release but the version given in the snippet is 0.28.2. This problem does not concern [docs/getting-started.md|https://github.com/apache/mesos/blob/master/docs/getting-started.md]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6118) Agent would crash with docker container tasks due to host mount table read.
[ https://issues.apache.org/jira/browse/MESOS-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15551191#comment-15551191 ] Kevin Klues commented on MESOS-6118: I've added two new patches to try and address this: https://reviews.apache.org/r/52597/ https://reviews.apache.org/r/52596/ [~jamiebriant] [~bobrik] Could you please try things out with these patches and see if they fix your issues? > Agent would crash with docker container tasks due to host mount table read. > --- > > Key: MESOS-6118 > URL: https://issues.apache.org/jira/browse/MESOS-6118 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 1.0.1 > Environment: Build: 2016-08-26 23:06:27 by centos > Version: 1.0.1 > Git tag: 1.0.1 > Git SHA: 3611eb0b7eea8d144e9b2e840e0ba16f2f659ee3 > systemd version `219` detected > Inializing systemd state > Created systemd slice: `/run/systemd/system/mesos_executors.slice` > Started systemd slice `mesos_executors.slice` > Using isolation: posix/cpu,posix/mem,filesystem/posix,network/cni > Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher > Linux ip-10-254-192-40 3.10.0-327.28.3.el7.x86_64 #1 SMP Thu Aug 18 19:05:49 > UTC 2016 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Jamie Briant >Assignee: Kevin Klues >Priority: Critical > Labels: linux, slave > Attachments: crashlogfull.log, cycle2.log, cycle3.log, cycle5.log, > cycle6.log, slave-crash.log > > > I have a framework which schedules thousands of short running (a few seconds > to a few minutes) of tasks, over a period of several minutes. In 1.0.1, the > slave process will crash every few minutes (with systemd restarting it). > Crash is: > Sep 01 20:52:23 ip-10-254-192-99 mesos-slave: F0901 20:52:23.905678 1232 > fs.cpp:140] Check failed: !visitedParents.contains(parentId) > Sep 01 20:52:23 ip-10-254-192-99 mesos-slave: *** Check failure stack trace: > *** > Version 1.0.0 works without this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)