[jira] [Commented] (MESOS-5081) Posix disk isolator allows unrestricted sandbox disk usage if the executor/task doesn't specify disk resource
[ https://issues.apache.org/jira/browse/MESOS-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15550653#comment-15550653 ] Charles Allen commented on MESOS-5081: -- Does fixing this mean that things that are kind of dumb about disk (like Spark) won't be able to run on slaves which specify disk resources? > Posix disk isolator allows unrestricted sandbox disk usage if the > executor/task doesn't specify disk resource > - > > Key: MESOS-5081 > URL: https://issues.apache.org/jira/browse/MESOS-5081 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Yan Xu > Labels: mesosphere > > This is the case even if {{flags.enforce_container_disk_quota}} is true. When > a task/executor doesn't specify a disk resource, it still gets to write to > the container sandbox. However the posix disk isolator doesn't limit it. > Even though tasks always have access to the sandbox, it should be able to > write zero bytes if it doesn't have any {{disk}} resource (it can still touch > files). This likely will cause tasks to immediately fail due to > stdout/stderr/executor download, etc. but should be the correct behavior > (when {{flags.enforce_container_disk_quota}} is true). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-6308) CHECK failure in DRF sorter.
[ https://issues.apache.org/jira/browse/MESOS-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu reassigned MESOS-6308: -- Assignee: Guangya Liu > CHECK failure in DRF sorter. > > > Key: MESOS-6308 > URL: https://issues.apache.org/jira/browse/MESOS-6308 > Project: Mesos > Issue Type: Bug >Reporter: Jie Yu >Assignee: Guangya Liu > > Saw this CHECK failed in our internal CI: > https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L450 > {noformat} > [03:08:28] : [Step 10/10] [ RUN ] PartitionTest.DisconnectedFramework > [03:08:28]W: [Step 10/10] I1004 03:08:28.200443 577 cluster.cpp:158] > Creating default 'local' authorizer > [03:08:28]W: [Step 10/10] I1004 03:08:28.206408 577 leveldb.cpp:174] > Opened db in 5.827159ms > [03:08:28]W: [Step 10/10] I1004 03:08:28.208127 577 leveldb.cpp:181] > Compacted db in 1.697508ms > [03:08:28]W: [Step 10/10] I1004 03:08:28.208150 577 leveldb.cpp:196] > Created db iterator in 5756ns > [03:08:28]W: [Step 10/10] I1004 03:08:28.208160 577 leveldb.cpp:202] > Seeked to beginning of db in 1483ns > [03:08:28]W: [Step 10/10] I1004 03:08:28.208168 577 leveldb.cpp:271] > Iterated through 0 keys in the db in 1101ns > [03:08:28]W: [Step 10/10] I1004 03:08:28.208184 577 replica.cpp:776] > Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned > [03:08:28]W: [Step 10/10] I1004 03:08:28.208452 591 recover.cpp:451] > Starting replica recovery > [03:08:28]W: [Step 10/10] I1004 03:08:28.208664 596 recover.cpp:477] > Replica is in EMPTY status > [03:08:28]W: [Step 10/10] I1004 03:08:28.209079 591 replica.cpp:673] > Replica in EMPTY status received a broadcasted recover request from > __req_res__(3666)@172.30.2.234:37300 > [03:08:28]W: [Step 10/10] I1004 03:08:28.209203 593 recover.cpp:197] > Received a recover response from a replica in EMPTY status > [03:08:28]W: [Step 10/10] I1004 03:08:28.209394 598 recover.cpp:568] > Updating replica status to STARTING > [03:08:28]W: [Step 10/10] I1004 03:08:28.209473 598 master.cpp:380] > Master dd11d4ad-2087-4324-99ef-873e83ff09a1 (ip-172-30-2-234.mesosphere.io) > started on 172.30.2.234:37300 > [03:08:28]W: [Step 10/10] I1004 03:08:28.209489 598 master.cpp:382] Flags > at startup: --acls="" --agent_ping_timeout="15secs" > --agent_reregister_timeout="10mins" --allocation_interval="1secs" > --allocator="HierarchicalDRF" --authenticate_agents="true" > --authenticate_frameworks="true" --authenticate_http_frameworks="true" > --authenticate_http_readonly="true" --authenticate_http_readwrite="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/7rr0oB/credentials" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --http_framework_authenticators="basic" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_agent_ping_timeouts="5" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="replicated_log" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/7rr0oB/master" > --zk_session_timeout="10secs" > [03:08:28]W: [Step 10/10] I1004 03:08:28.209692 598 master.cpp:432] > Master only allowing authenticated frameworks to register > [03:08:28]W: [Step 10/10] I1004 03:08:28.209699 598 master.cpp:446] > Master only allowing authenticated agents to register > [03:08:28]W: [Step 10/10] I1004 03:08:28.209704 598 master.cpp:459] > Master only allowing authenticated HTTP frameworks to register > [03:08:28]W: [Step 10/10] I1004 03:08:28.209709 598 credentials.hpp:37] > Loading credentials for authentication from '/tmp/7rr0oB/credentials' > [03:08:28]W: [Step 10/10] I1004 03:08:28.209810 598 master.cpp:504] Using > default 'crammd5' authenticator > [03:08:28]W: [Step 10/10] I1004 03:08:28.209853 598 http.cpp:883] Using > default 'basic' HTTP authenticator for realm 'mesos-master-readonly' > [03:08:28]W: [Step 10/10] I1004 03:08:28.209897 598 http.cpp:883] Using > default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' > [03:08:28]W: [Step 10/10] I1004 03:08:28.209940 598 http.cpp:883] Using > default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' > [03:08:28]W: [Step 10/10] I1004 03:08:28.209962 598 master.cpp:584] > Authorization enabled > [03:08:28]W: [Step 10/10] I1004
[jira] [Commented] (MESOS-6308) CHECK failure in DRF sorter.
[ https://issues.apache.org/jira/browse/MESOS-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15550554#comment-15550554 ] Guangya Liu commented on MESOS-6308: I was now trying to reproduce this issue but with no lucky even with {{--gtest_repeat=100}}, will try to increase the workload as you suggested to see if I can reproduce this first. > CHECK failure in DRF sorter. > > > Key: MESOS-6308 > URL: https://issues.apache.org/jira/browse/MESOS-6308 > Project: Mesos > Issue Type: Bug >Reporter: Jie Yu > > Saw this CHECK failed in our internal CI: > https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L450 > {noformat} > [03:08:28] : [Step 10/10] [ RUN ] PartitionTest.DisconnectedFramework > [03:08:28]W: [Step 10/10] I1004 03:08:28.200443 577 cluster.cpp:158] > Creating default 'local' authorizer > [03:08:28]W: [Step 10/10] I1004 03:08:28.206408 577 leveldb.cpp:174] > Opened db in 5.827159ms > [03:08:28]W: [Step 10/10] I1004 03:08:28.208127 577 leveldb.cpp:181] > Compacted db in 1.697508ms > [03:08:28]W: [Step 10/10] I1004 03:08:28.208150 577 leveldb.cpp:196] > Created db iterator in 5756ns > [03:08:28]W: [Step 10/10] I1004 03:08:28.208160 577 leveldb.cpp:202] > Seeked to beginning of db in 1483ns > [03:08:28]W: [Step 10/10] I1004 03:08:28.208168 577 leveldb.cpp:271] > Iterated through 0 keys in the db in 1101ns > [03:08:28]W: [Step 10/10] I1004 03:08:28.208184 577 replica.cpp:776] > Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned > [03:08:28]W: [Step 10/10] I1004 03:08:28.208452 591 recover.cpp:451] > Starting replica recovery > [03:08:28]W: [Step 10/10] I1004 03:08:28.208664 596 recover.cpp:477] > Replica is in EMPTY status > [03:08:28]W: [Step 10/10] I1004 03:08:28.209079 591 replica.cpp:673] > Replica in EMPTY status received a broadcasted recover request from > __req_res__(3666)@172.30.2.234:37300 > [03:08:28]W: [Step 10/10] I1004 03:08:28.209203 593 recover.cpp:197] > Received a recover response from a replica in EMPTY status > [03:08:28]W: [Step 10/10] I1004 03:08:28.209394 598 recover.cpp:568] > Updating replica status to STARTING > [03:08:28]W: [Step 10/10] I1004 03:08:28.209473 598 master.cpp:380] > Master dd11d4ad-2087-4324-99ef-873e83ff09a1 (ip-172-30-2-234.mesosphere.io) > started on 172.30.2.234:37300 > [03:08:28]W: [Step 10/10] I1004 03:08:28.209489 598 master.cpp:382] Flags > at startup: --acls="" --agent_ping_timeout="15secs" > --agent_reregister_timeout="10mins" --allocation_interval="1secs" > --allocator="HierarchicalDRF" --authenticate_agents="true" > --authenticate_frameworks="true" --authenticate_http_frameworks="true" > --authenticate_http_readonly="true" --authenticate_http_readwrite="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/7rr0oB/credentials" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --http_framework_authenticators="basic" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_agent_ping_timeouts="5" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="replicated_log" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/7rr0oB/master" > --zk_session_timeout="10secs" > [03:08:28]W: [Step 10/10] I1004 03:08:28.209692 598 master.cpp:432] > Master only allowing authenticated frameworks to register > [03:08:28]W: [Step 10/10] I1004 03:08:28.209699 598 master.cpp:446] > Master only allowing authenticated agents to register > [03:08:28]W: [Step 10/10] I1004 03:08:28.209704 598 master.cpp:459] > Master only allowing authenticated HTTP frameworks to register > [03:08:28]W: [Step 10/10] I1004 03:08:28.209709 598 credentials.hpp:37] > Loading credentials for authentication from '/tmp/7rr0oB/credentials' > [03:08:28]W: [Step 10/10] I1004 03:08:28.209810 598 master.cpp:504] Using > default 'crammd5' authenticator > [03:08:28]W: [Step 10/10] I1004 03:08:28.209853 598 http.cpp:883] Using > default 'basic' HTTP authenticator for realm 'mesos-master-readonly' > [03:08:28]W: [Step 10/10] I1004 03:08:28.209897 598 http.cpp:883] Using > default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' > [03:08:28]W: [Step 10/10] I1004 03:08:28.209940 598 http.cpp:883] Using > default 'basic' HTTP authenticator for realm
[jira] [Updated] (MESOS-5613) mesos-local fails to start if MESOS_WORK_DIR isn't set.
[ https://issues.apache.org/jira/browse/MESOS-5613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-5613: -- Fix Version/s: 1.0.2 Backported to 1.0.x. commit 6773b8ffef6672ef012f8ccb6a9fe73d40e02ae0 Author: Vinod KoneDate: Wed Oct 5 18:24:57 2016 -0700 Added MESOS-5613 to 1.0.2 CHANGELOG. commit e7a35ebe1c52aa3dc3d90feabae13c1f7220723f Author: Ammar Askar Date: Wed Aug 10 17:57:09 2016 -0700 Propagated work_dir flag from local runs to agents/masters. Review: https://reviews.apache.org/r/50003/ > mesos-local fails to start if MESOS_WORK_DIR isn't set. > --- > > Key: MESOS-5613 > URL: https://issues.apache.org/jira/browse/MESOS-5613 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Jan Schlicht >Assignee: Ammar Askar > Fix For: 1.1.0, 1.0.2 > > > Running {{mesos-local}} fails with > {noformat} > Failed to start a local cluster while loading agent flags from the > environment: Flag 'work_dir' is required, but it was not provided > {noformat} > if {{MESOS_WORK_DIR}} isn't set. > This seems to be due to the changed behavior of making the {{work_dir}} flag > mandatory (MESOS-5064). While {{MESOS_WORK_DIR}} is being set in a > development environment in {{./bin/mesos-local.sh}}, this isn't true if > {{mesos-local}} is installed on the system after a {{make install}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6216) LibeventSSLSocketImpl::create is not safe to call concurrently with os::getenv
[ https://issues.apache.org/jira/browse/MESOS-6216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15550550#comment-15550550 ] Vinod Kone commented on MESOS-6216: --- What's the ETA for this to land on master and backported to 1.0.x? > LibeventSSLSocketImpl::create is not safe to call concurrently with os::getenv > -- > > Key: MESOS-6216 > URL: https://issues.apache.org/jira/browse/MESOS-6216 > Project: Mesos > Issue Type: Bug > Components: security >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier > Labels: mesosphere > Attachments: build.log > > > {{LibeventSSLSocketImpl::create}} is called whenever a potentially > ssl-enabled socket is created. It in turn calls {{openssl::initialize}} which > calls a function {{reinitialize}} using {{os::setenv}}. Here {{os::setenv}} > is used to set up SSL-related libprocess environment variables > {{LIBPROCESS_SSL_*}}. > Since {{os::setenv}} is not thread-safe just like the {{::setenv}} it wraps, > any calling of functions like {{os::getenv}} (or via {{os::environment}}) > concurrently with the first invocation of {{LibeventSSLSocketImpl::create}} > performs unsynchronized r/w access to the same data structure in the runtime. > We usually perform most setup of the environment before we start the > libprocess runtime with {{process::initialize}} from a {{main}} function, see > e.g., {{src/slave/main.cpp}} or {{src/master/main.cpp}} and others. It > appears that we should move the setup of libprocess' SSL environment > variables to a similar spot. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6317) Race in master update slave.
Guangya Liu created MESOS-6317: -- Summary: Race in master update slave. Key: MESOS-6317 URL: https://issues.apache.org/jira/browse/MESOS-6317 Project: Mesos Issue Type: Bug Reporter: Guangya Liu Assignee: Guangya Liu Currently, when {{updateSlave}} in master, it will first rescind offers and then updateSlave in allocator, but there is a race for this, there might be a batch allocation inserted bwteen the two. In this case, the order will be rescind offer -> batch allocation -> update slave. This order will cause some issues when the oversubscribed resources was decreased. Suppose the oversubscribed resources was decreased from 2 to 1, then after rescind offer finished, the batch allocation will allocate the old 2 oversubscribed resources again, then update slave will update the total oversubscribed resources to 1. This will cause the agent host have some time overcommitted due to the tasks can still use 2 oversubscribed resources but not 1 oversubscribed resources, once the tasks using the 2 oversubscribed resources finished, everything goes back. So here we should adjust the order of rescind offer and updateSlave in master to avoid resource overcommit. If we update slave first then rescind offer, the order will be update slave -> batch allocation -> rescind offer, this order will have no problem when descreasing resources. Suppose the oversubscribed resources was decreased from 2 to 1, then update slave will update total oversubscribed resources to 1 directly, then the batch allocation will not allocate any oversubscribed resources since there are more allocated than total oversubscribed resources, then rescind offer will rescind all offers using oversubscribed resources. This will not lead the agent host to be overcommitted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6142) Frameworks may RESERVE for an arbitrary role.
[ https://issues.apache.org/jira/browse/MESOS-6142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-6142: -- Target Version/s: 1.1.0 Fix Version/s: (was: 1.1.0) > Frameworks may RESERVE for an arbitrary role. > - > > Key: MESOS-6142 > URL: https://issues.apache.org/jira/browse/MESOS-6142 > Project: Mesos > Issue Type: Bug > Components: allocation, master >Affects Versions: 1.0.0 >Reporter: Alexander Rukletsov >Assignee: Gastón Kleiman >Priority: Blocker > Labels: mesosphere, reservations > > The master does not validate that resources from a reservation request have > the same role the framework is registered with. As a result, frameworks may > reserve resources for arbitrary roles. > I've modified the role in [the {{ReserveThenUnreserve}} > test|https://github.com/apache/mesos/blob/bca600cf5602ed8227d91af9f73d689da14ad786/src/tests/reservation_tests.cpp#L117] > to "yoyo" and observed the following in the test's log: > {noformat} > I0908 18:35:43.379122 2138112 master.cpp:3362] Processing ACCEPT call for > offers: [ dfaf67e6-7c1c-4988-b427-c49842cb7bb7-O0 ] on agent > dfaf67e6-7c1c-4988-b427-c49842cb7bb7-S0 at slave(1)@10.200.181.237:60116 > (alexr.railnet.train) for framework dfaf67e6-7c1c-4988-b427-c49842cb7bb7- > (default) at > scheduler-ca12a660-9f08-49de-be4e-d452aa3aa6da@10.200.181.237:60116 > I0908 18:35:43.379170 2138112 master.cpp:3022] Authorizing principal > 'test-principal' to reserve resources 'cpus(yoyo, test-principal):1; > mem(yoyo, test-principal):512' > I0908 18:35:43.379678 2138112 master.cpp:3642] Applying RESERVE operation for > resources cpus(yoyo, test-principal):1; mem(yoyo, test-principal):512 from > framework dfaf67e6-7c1c-4988-b427-c49842cb7bb7- (default) at > scheduler-ca12a660-9f08-49de-be4e-d452aa3aa6da@10.200.181.237:60116 to agent > dfaf67e6-7c1c-4988-b427-c49842cb7bb7-S0 at slave(1)@10.200.181.237:60116 > (alexr.railnet.train) > I0908 18:35:43.379767 2138112 master.cpp:7341] Sending checkpointed resources > cpus(yoyo, test-principal):1; mem(yoyo, test-principal):512 to agent > dfaf67e6-7c1c-4988-b427-c49842cb7bb7-S0 at slave(1)@10.200.181.237:60116 > (alexr.railnet.train) > I0908 18:35:43.380273 3211264 slave.cpp:2497] Updated checkpointed resources > from to cpus(yoyo, test-principal):1; mem(yoyo, test-principal):512 > I0908 18:35:43.380574 2674688 hierarchical.cpp:760] Updated allocation of > framework dfaf67e6-7c1c-4988-b427-c49842cb7bb7- on agent > dfaf67e6-7c1c-4988-b427-c49842cb7bb7-S0 from cpus(*):1; mem(*):512; > disk(*):470841; ports(*):[31000-32000] to ports(*):[31000-32000]; cpus(yoyo, > test-principal):1; disk(*):470841; mem(yoyo, test-principal):512 with RESERVE > operation > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6157) ContainerInfo is not validated.
[ https://issues.apache.org/jira/browse/MESOS-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15550491#comment-15550491 ] Vinod Kone commented on MESOS-6157: --- [~alexr] Should this be resolved? > ContainerInfo is not validated. > --- > > Key: MESOS-6157 > URL: https://issues.apache.org/jira/browse/MESOS-6157 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Alexander Rukletsov >Assignee: Alexander Rukletsov >Priority: Blocker > Labels: containerizer, mesos-containerizer, mesosphere > Fix For: 1.1.0 > > > Currently Mesos does not validate {{ContainerInfo}} provided with > {{TaskInfo}} or {{ExecutorInfo}}, hence invalid task configurations can be > accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6118) Agent would crash with docker container tasks due to host mount table read.
[ https://issues.apache.org/jira/browse/MESOS-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-6118: -- Target Version/s: 1.1.0, 1.0.2 Fix Version/s: (was: 1.0.2) (was: 1.1.0) > Agent would crash with docker container tasks due to host mount table read. > --- > > Key: MESOS-6118 > URL: https://issues.apache.org/jira/browse/MESOS-6118 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 1.0.1 > Environment: Build: 2016-08-26 23:06:27 by centos > Version: 1.0.1 > Git tag: 1.0.1 > Git SHA: 3611eb0b7eea8d144e9b2e840e0ba16f2f659ee3 > systemd version `219` detected > Inializing systemd state > Created systemd slice: `/run/systemd/system/mesos_executors.slice` > Started systemd slice `mesos_executors.slice` > Using isolation: posix/cpu,posix/mem,filesystem/posix,network/cni > Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher > Linux ip-10-254-192-40 3.10.0-327.28.3.el7.x86_64 #1 SMP Thu Aug 18 19:05:49 > UTC 2016 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Jamie Briant >Assignee: Kevin Klues >Priority: Critical > Labels: linux, slave > Attachments: crashlogfull.log, cycle2.log, cycle3.log, cycle5.log, > cycle6.log, slave-crash.log > > > I have a framework which schedules thousands of short running (a few seconds > to a few minutes) of tasks, over a period of several minutes. In 1.0.1, the > slave process will crash every few minutes (with systemd restarting it). > Crash is: > Sep 01 20:52:23 ip-10-254-192-99 mesos-slave: F0901 20:52:23.905678 1232 > fs.cpp:140] Check failed: !visitedParents.contains(parentId) > Sep 01 20:52:23 ip-10-254-192-99 mesos-slave: *** Check failure stack trace: > *** > Version 1.0.0 works without this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6302) Agent recovery can fail after nested containers are launched
[ https://issues.apache.org/jira/browse/MESOS-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15550349#comment-15550349 ] Jie Yu commented on MESOS-6302: --- commit 8bab70c691a3efeda301f72956de4f80b258464e Author: Gilbert Song songzihao1...@gmail.com Date: Mon Oct 3 15:28:39 2016 -0700 Fixed provisioner recovering with nested containers existed. Previously, in provisioner recover, we firstly get all container ids from the provisioner directory, and then find all rootfses from each container's 'backends' directory. We made an assumption that if a 'container_id' directory exists in the provisioner directory, it must contain a 'backends' directory underneath, which contains at least one rootfs for this container. However, this is no longer true since we added support for nested containers. Because we allow the case that a nested container is specified with a container image while its parent does not have an image specified. In this case, when the provisioner recovers, it can still find the parent container's id in the provisioner directory while no 'backends' directory exists, since all nested containers backend information are under its parent container's directory. As a result, we should skip recovering the 'Info' struct in provisioner for the parent container if it never provisions any image. Review: https://reviews.apache.org/r/52480/ > Agent recovery can fail after nested containers are launched > > > Key: MESOS-6302 > URL: https://issues.apache.org/jira/browse/MESOS-6302 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Greg Mann >Assignee: Gilbert Song >Priority: Blocker > Labels: mesosphere > Fix For: 1.1.0 > > Attachments: read_write_app.json > > > After launching a nested container which used a Docker image, I restarted the > agent which ran that task group and saw the following in the agent logs > during recovery: > {code} > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > I1001 01:45:10.813596 4640 status_update_manager.cpp:203] Recovering status > update manager > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > I1001 01:45:10.813622 4640 status_update_manager.cpp:211] Recovering > executor 'instance-testvolume.02c26bce-8778-11e6-9ff3-7a3cd7c1568e' of > framework 118ca38d-daee-4b2d-b584-b5581738a3dd- > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > I1001 01:45:10.814249 4639 docker.cpp:745] Recovering Docker containers > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > I1001 01:45:10.815294 4642 containerizer.cpp:581] Recovering containerizer > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > Failed to perform recovery: Collect failed: Unable to list rootfses belonged > to container a7d576da-fd0f-4dc1-bd5a-6d0a93ac8a53: Unable to list the > container directory: Failed to opendir > '/var/lib/mesos/slave/provisioner/containers/a7d576da-fd0f-4dc1-bd5a-6d0a93ac8a53/backends': > No such file or directory > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > To remedy this do as follows: > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > Step 1: rm -f /var/lib/mesos/slave/meta/slaves/latest > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > This ensures agent doesn't recover old live executors. > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > Step 2: Restart the agent. > {code} > and the agent continues to restart in this fashion. Attached is the Marathon > app definition that I used to launch the task group. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6308) CHECK failure in DRF sorter.
[ https://issues.apache.org/jira/browse/MESOS-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15550284#comment-15550284 ] Benjamin Mahler commented on MESOS-6308: [~gyliu] have you seen this before? > CHECK failure in DRF sorter. > > > Key: MESOS-6308 > URL: https://issues.apache.org/jira/browse/MESOS-6308 > Project: Mesos > Issue Type: Bug >Reporter: Jie Yu > > Saw this CHECK failed in our internal CI: > https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L450 > {noformat} > [03:08:28] : [Step 10/10] [ RUN ] PartitionTest.DisconnectedFramework > [03:08:28]W: [Step 10/10] I1004 03:08:28.200443 577 cluster.cpp:158] > Creating default 'local' authorizer > [03:08:28]W: [Step 10/10] I1004 03:08:28.206408 577 leveldb.cpp:174] > Opened db in 5.827159ms > [03:08:28]W: [Step 10/10] I1004 03:08:28.208127 577 leveldb.cpp:181] > Compacted db in 1.697508ms > [03:08:28]W: [Step 10/10] I1004 03:08:28.208150 577 leveldb.cpp:196] > Created db iterator in 5756ns > [03:08:28]W: [Step 10/10] I1004 03:08:28.208160 577 leveldb.cpp:202] > Seeked to beginning of db in 1483ns > [03:08:28]W: [Step 10/10] I1004 03:08:28.208168 577 leveldb.cpp:271] > Iterated through 0 keys in the db in 1101ns > [03:08:28]W: [Step 10/10] I1004 03:08:28.208184 577 replica.cpp:776] > Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned > [03:08:28]W: [Step 10/10] I1004 03:08:28.208452 591 recover.cpp:451] > Starting replica recovery > [03:08:28]W: [Step 10/10] I1004 03:08:28.208664 596 recover.cpp:477] > Replica is in EMPTY status > [03:08:28]W: [Step 10/10] I1004 03:08:28.209079 591 replica.cpp:673] > Replica in EMPTY status received a broadcasted recover request from > __req_res__(3666)@172.30.2.234:37300 > [03:08:28]W: [Step 10/10] I1004 03:08:28.209203 593 recover.cpp:197] > Received a recover response from a replica in EMPTY status > [03:08:28]W: [Step 10/10] I1004 03:08:28.209394 598 recover.cpp:568] > Updating replica status to STARTING > [03:08:28]W: [Step 10/10] I1004 03:08:28.209473 598 master.cpp:380] > Master dd11d4ad-2087-4324-99ef-873e83ff09a1 (ip-172-30-2-234.mesosphere.io) > started on 172.30.2.234:37300 > [03:08:28]W: [Step 10/10] I1004 03:08:28.209489 598 master.cpp:382] Flags > at startup: --acls="" --agent_ping_timeout="15secs" > --agent_reregister_timeout="10mins" --allocation_interval="1secs" > --allocator="HierarchicalDRF" --authenticate_agents="true" > --authenticate_frameworks="true" --authenticate_http_frameworks="true" > --authenticate_http_readonly="true" --authenticate_http_readwrite="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/7rr0oB/credentials" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --http_framework_authenticators="basic" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_agent_ping_timeouts="5" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="replicated_log" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/7rr0oB/master" > --zk_session_timeout="10secs" > [03:08:28]W: [Step 10/10] I1004 03:08:28.209692 598 master.cpp:432] > Master only allowing authenticated frameworks to register > [03:08:28]W: [Step 10/10] I1004 03:08:28.209699 598 master.cpp:446] > Master only allowing authenticated agents to register > [03:08:28]W: [Step 10/10] I1004 03:08:28.209704 598 master.cpp:459] > Master only allowing authenticated HTTP frameworks to register > [03:08:28]W: [Step 10/10] I1004 03:08:28.209709 598 credentials.hpp:37] > Loading credentials for authentication from '/tmp/7rr0oB/credentials' > [03:08:28]W: [Step 10/10] I1004 03:08:28.209810 598 master.cpp:504] Using > default 'crammd5' authenticator > [03:08:28]W: [Step 10/10] I1004 03:08:28.209853 598 http.cpp:883] Using > default 'basic' HTTP authenticator for realm 'mesos-master-readonly' > [03:08:28]W: [Step 10/10] I1004 03:08:28.209897 598 http.cpp:883] Using > default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' > [03:08:28]W: [Step 10/10] I1004 03:08:28.209940 598 http.cpp:883] Using > default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' > [03:08:28]W: [Step 10/10] I1004 03:08:28.209962 598 master.cpp:584] > Authorization enabled > [03:08:28]W: [Step
[jira] [Updated] (MESOS-6316) CREATE of shared volumes should not be allowed by frameworks not opted in to the capability.
[ https://issues.apache.org/jira/browse/MESOS-6316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anindya Sinha updated MESOS-6316: - Labels: persistent-volumes (was: ) > CREATE of shared volumes should not be allowed by frameworks not opted in to > the capability. > > > Key: MESOS-6316 > URL: https://issues.apache.org/jira/browse/MESOS-6316 > Project: Mesos > Issue Type: Bug > Components: general >Reporter: Anindya Sinha >Assignee: Anindya Sinha >Priority: Minor > Labels: persistent-volumes > > Even if shared resources are not offered to a framework if it has not opted > in for SHARED_RESOURCES capability, that framework can inadvertently CREATE a > shared volume. Although this volume shall not offered to such frameworks, > Mesos should disallow such CREATE operations to succeed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6316) CREATE of shared volumes should not be allowed by frameworks not opted in to the capability.
Anindya Sinha created MESOS-6316: Summary: CREATE of shared volumes should not be allowed by frameworks not opted in to the capability. Key: MESOS-6316 URL: https://issues.apache.org/jira/browse/MESOS-6316 Project: Mesos Issue Type: Bug Components: general Reporter: Anindya Sinha Assignee: Anindya Sinha Priority: Minor Even if shared resources are not offered to a framework if it has not opted in for SHARED_RESOURCES capability, that framework can inadvertently CREATE a shared volume. Although this volume shall not offered to such frameworks, Mesos should disallow such CREATE operations to succeed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6269) CNI isolator doesn't activate loopback interface
[ https://issues.apache.org/jira/browse/MESOS-6269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-6269: -- Fix Version/s: 1.0.2 > CNI isolator doesn't activate loopback interface > > > Key: MESOS-6269 > URL: https://issues.apache.org/jira/browse/MESOS-6269 > Project: Mesos > Issue Type: Bug > Components: isolation, network >Affects Versions: 1.0.1 >Reporter: Greg Mann >Assignee: Avinash Sridharan >Priority: Blocker > Labels: isolation, networking > Fix For: 1.1.0, 1.0.2 > > > Launching a nested CNI-enabled container yielded the following agent log > output: > {code} > cni.cpp:1255] Got assigned IPv4 address '9.0.1.25/25' from CNI network 'dcos' > for container 7c1ef3c4-ba7b-4b43-ba33-0612d84100cc > {code} > indicating that the container was successfully assigned an IP. Running > {{ifconfig -a}} inside the container yields: > {code} > eth0: flags=4163mtu 1420 > inet 9.0.1.25 netmask 255.255.255.128 broadcast 0.0.0.0 > inet6 fe80::e004:4bff:fefc:6816 prefixlen 64 scopeid 0x20 > ether 0a:58:09:00:01:19 txqueuelen 0 (Ethernet) > RX packets 31 bytes 5052 (4.9 KiB) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 36 bytes 5689 (5.5 KiB) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > lo: flags=8 mtu 65536 > loop txqueuelen 1 (Local Loopback) > RX packets 0 bytes 0 (0.0 B) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 0 bytes 0 (0.0 B) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > {code} > it can be seen that the loopback interface is not activated. {{ifconfig lo > up}} must be run before a process within the container can bind to that > interface, but this should be handled by the CNI isolator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-6249) On Mesos master failover the reregistered callback is not triggered
[ https://issues.apache.org/jira/browse/MESOS-6249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15549924#comment-15549924 ] Benjamin Mahler edited comment on MESOS-6249 at 10/5/16 8:58 PM: - Linking in MESOS-786 which describes the lifecycle of registered and re-registered callbacks. Note that MESOS-786 was resolved but AFAICT we did not update to the newer semantics described in this ticket for schedulers that use the old-style driver. However, it sounds like you care about this because you're trying to detect that the master has failed over. To do this you must introspect the {{MasterInfo}} provided to you in order to see if {{MasterInfo.id}} has changed. was (Author: bmahler): Linking in MESOS-786 which describes the lifecycle of registered and re-registered callbacks. Note that MESOS-786 was resolved but AFAICT we did not update to the newer semantics described in this ticket for schedulers that use the old-style driver. However, it sounds like you care about this because you're to detect that the master has failed over. To do this you must introspect the {{MasterInfo}} provided to you in order to see if {{MasterInfo.id}} has changed. > On Mesos master failover the reregistered callback is not triggered > --- > > Key: MESOS-6249 > URL: https://issues.apache.org/jira/browse/MESOS-6249 > Project: Mesos > Issue Type: Bug > Components: java api >Affects Versions: 0.28.0, 0.28.1, 1.0.1 > Environment: OS X 10.11.6 >Reporter: Markus Jura > > On a Mesos master failover the reregistered callback of the Java API is not > triggered. Only the registration callback is triggered which makes it hard > for a framework to distinguish between these scenarios. > This behaviour has been tested with the ConductR framework, both with the > Java API version 0.28.0, 0.28.1 and 1.0.1. Below you find the logs from the > master that got re-elected and from the ConductR framework. > *Log: Mesos master on a master re-election* > {code:bash} > I0926 11:44:20.008306 3747840 zookeeper.cpp:259] A new leading master > (UPID=master@127.0.0.1:5050) is detected > I0926 11:44:20.008458 3747840 master.cpp:1847] The newly elected leader is > master@127.0.0.1:5050 with id ca5b9713-1eec-43e1-9d27-9ebc5c0f95b1 > I0926 11:44:20.008484 3747840 master.cpp:1860] Elected as the leading master! > I0926 11:44:20.008498 3747840 master.cpp:1547] Recovering from registrar > I0926 11:44:20.008607 3747840 registrar.cpp:332] Recovering registrar > I0926 11:44:20.016340 4284416 registrar.cpp:365] Successfully fetched the > registry (0B) in 7.702016ms > I0926 11:44:20.016393 4284416 registrar.cpp:464] Applied 1 operations in > 12us; attempting to update the 'registry' > I0926 11:44:20.021428 4284416 registrar.cpp:509] Successfully updated the > 'registry' in 5.019904ms > I0926 11:44:20.021481 4284416 registrar.cpp:395] Successfully recovered > registrar > I0926 11:44:20.021611 528384 master.cpp:1655] Recovered 0 agents from the > Registry (118B) ; allowing 10mins for agents to re-register > I0926 11:44:20.536859 3747840 master.cpp:2424] Received SUBSCRIBE call for > framework 'conductr' at > scheduler-3f8b9645-7a17-4e9f-8ad5-077fe8c23b39@192.168.2.106:57164 > I0926 11:44:20.536969 3747840 master.cpp:2500] Subscribing framework conductr > with checkpointing disabled and capabilities [ ] > I0926 11:44:20.537401 3211264 hierarchical.cpp:271] Added framework conductr > I0926 11:44:20.807895 528384 master.cpp:4787] Re-registering agent > b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 at slave(1)@127.0.0.1:5051 (127.0.0.1) > I0926 11:44:20.808145 1601536 registrar.cpp:464] Applied 1 operations in > 38us; attempting to update the 'registry' > I0926 11:44:20.815757 1601536 registrar.cpp:509] Successfully updated the > 'registry' in 7.568896ms > I0926 11:44:20.815992 3747840 master.cpp:7447] Adding task > 6abce9bb-895f-4f6f-be5b-25f6bd09f548 with resources mem(*):0 on agent > b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 (127.0.0.1) > I0926 11:44:20.816339 3747840 master.cpp:4872] Re-registered agent > b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 at slave(1)@127.0.0.1:5051 > (127.0.0.1) with cpus(*):8; mem(*):15360; disk(*):470832; > ports(*):[31000-32000] > I0926 11:44:20.816385 1601536 hierarchical.cpp:478] Added agent > b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 (127.0.0.1) with cpus(*):8; > mem(*):15360; disk(*):470832; ports(*):[31000-32000] (allocated: cpus(*):0.9; > mem(*):402.653; disk(*):1000; ports(*):[31000-31000, 31001-31500]) > I0926 11:44:20.816437 3747840 master.cpp:4940] Sending updated checkpointed > resources to agent b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 at > slave(1)@127.0.0.1:5051 (127.0.0.1) > I0926 11:44:20.816787 4284416 master.cpp:5725] Sending 1 offers to framework >
[jira] [Commented] (MESOS-6249) On Mesos master failover the reregistered callback is not triggered
[ https://issues.apache.org/jira/browse/MESOS-6249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15549924#comment-15549924 ] Benjamin Mahler commented on MESOS-6249: Linking in MESOS-786 which describes the lifecycle of registered and re-registered callbacks. Note that MESOS-786 was resolved but AFAICT we did not update to the newer semantics described in this ticket for schedulers that use the old-style driver. However, it sounds like you care about this because you're to detect that the master has failed over. To do this you must introspect the {{MasterInfo}} provided to you in order to see if {{MasterInfo.id}} has changed. > On Mesos master failover the reregistered callback is not triggered > --- > > Key: MESOS-6249 > URL: https://issues.apache.org/jira/browse/MESOS-6249 > Project: Mesos > Issue Type: Bug > Components: java api >Affects Versions: 0.28.0, 0.28.1, 1.0.1 > Environment: OS X 10.11.6 >Reporter: Markus Jura > > On a Mesos master failover the reregistered callback of the Java API is not > triggered. Only the registration callback is triggered which makes it hard > for a framework to distinguish between these scenarios. > This behaviour has been tested with the ConductR framework, both with the > Java API version 0.28.0, 0.28.1 and 1.0.1. Below you find the logs from the > master that got re-elected and from the ConductR framework. > *Log: Mesos master on a master re-election* > {code:bash} > I0926 11:44:20.008306 3747840 zookeeper.cpp:259] A new leading master > (UPID=master@127.0.0.1:5050) is detected > I0926 11:44:20.008458 3747840 master.cpp:1847] The newly elected leader is > master@127.0.0.1:5050 with id ca5b9713-1eec-43e1-9d27-9ebc5c0f95b1 > I0926 11:44:20.008484 3747840 master.cpp:1860] Elected as the leading master! > I0926 11:44:20.008498 3747840 master.cpp:1547] Recovering from registrar > I0926 11:44:20.008607 3747840 registrar.cpp:332] Recovering registrar > I0926 11:44:20.016340 4284416 registrar.cpp:365] Successfully fetched the > registry (0B) in 7.702016ms > I0926 11:44:20.016393 4284416 registrar.cpp:464] Applied 1 operations in > 12us; attempting to update the 'registry' > I0926 11:44:20.021428 4284416 registrar.cpp:509] Successfully updated the > 'registry' in 5.019904ms > I0926 11:44:20.021481 4284416 registrar.cpp:395] Successfully recovered > registrar > I0926 11:44:20.021611 528384 master.cpp:1655] Recovered 0 agents from the > Registry (118B) ; allowing 10mins for agents to re-register > I0926 11:44:20.536859 3747840 master.cpp:2424] Received SUBSCRIBE call for > framework 'conductr' at > scheduler-3f8b9645-7a17-4e9f-8ad5-077fe8c23b39@192.168.2.106:57164 > I0926 11:44:20.536969 3747840 master.cpp:2500] Subscribing framework conductr > with checkpointing disabled and capabilities [ ] > I0926 11:44:20.537401 3211264 hierarchical.cpp:271] Added framework conductr > I0926 11:44:20.807895 528384 master.cpp:4787] Re-registering agent > b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 at slave(1)@127.0.0.1:5051 (127.0.0.1) > I0926 11:44:20.808145 1601536 registrar.cpp:464] Applied 1 operations in > 38us; attempting to update the 'registry' > I0926 11:44:20.815757 1601536 registrar.cpp:509] Successfully updated the > 'registry' in 7.568896ms > I0926 11:44:20.815992 3747840 master.cpp:7447] Adding task > 6abce9bb-895f-4f6f-be5b-25f6bd09f548 with resources mem(*):0 on agent > b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 (127.0.0.1) > I0926 11:44:20.816339 3747840 master.cpp:4872] Re-registered agent > b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 at slave(1)@127.0.0.1:5051 > (127.0.0.1) with cpus(*):8; mem(*):15360; disk(*):470832; > ports(*):[31000-32000] > I0926 11:44:20.816385 1601536 hierarchical.cpp:478] Added agent > b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 (127.0.0.1) with cpus(*):8; > mem(*):15360; disk(*):470832; ports(*):[31000-32000] (allocated: cpus(*):0.9; > mem(*):402.653; disk(*):1000; ports(*):[31000-31000, 31001-31500]) > I0926 11:44:20.816437 3747840 master.cpp:4940] Sending updated checkpointed > resources to agent b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 at > slave(1)@127.0.0.1:5051 (127.0.0.1) > I0926 11:44:20.816787 4284416 master.cpp:5725] Sending 1 offers to framework > conductr (conductr) at > scheduler-3f8b9645-7a17-4e9f-8ad5-077fe8c23b39@192.168.2.106:57164 > {code} > *Log: ConductR framework* > {code:bash} > I0926 11:44:20.007189 66441216 detector.cpp:152] Detected a new leader: > (id='87') > I0926 11:44:20.007524 64294912 group.cpp:706] Trying to get > '/mesos/json.info_87' in ZooKeeper > I0926 11:44:20.008625 63758336 zookeeper.cpp:259] A new leading master > (UPID=master@127.0.0.1:5050) is detected > I0926 11:44:20.008965 63758336 sched.cpp:330] New master detected at > master@127.0.0.1:5050 >
[jira] [Commented] (MESOS-6312) Add requirement in upgrade.md and getting-started.md for agent '--runtime_dir' in when running as non-root
[ https://issues.apache.org/jira/browse/MESOS-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15549883#comment-15549883 ] Kevin Klues commented on MESOS-6312: Possibly. How easy is it to change for just that one binary? I am not very familiar with {{mesos-local}}. > Add requirement in upgrade.md and getting-started.md for agent > '--runtime_dir' in when running as non-root > -- > > Key: MESOS-6312 > URL: https://issues.apache.org/jira/browse/MESOS-6312 > Project: Mesos > Issue Type: Task >Reporter: Kevin Klues >Priority: Blocker > Fix For: 1.1.0 > > > We recently introduced a new agent flag for {{\-\-runtime_dir}}. Unlike the > {{\-\-work_dir}}, this directory is designed to hold the state of a running > agent between subsequent agent-restarts (but not across host reboots). > By default, this flag is set to {{/var/run/mesos}} since this is a {{tempfs}} > on linux that gets automatically cleaned up on reboot. However, on most > systems {{/var/run/mesos}} is only writable by root, causing problems when > launching an agent as non-root and not pointing {{--runtime_dir}} to a > different location. > We need to call this out in the upgrade.md and getting-started.md docs so > that people know they may need to set this going forward. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6312) Add requirement in upgrade.md and getting-started.md for agent '--runtime_dir' in when running as non-root
[ https://issues.apache.org/jira/browse/MESOS-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Klues updated MESOS-6312: --- Description: We recently introduced a new agent flag for {{\-\-runtime_dir}}. Unlike the {{\-\-work_dir}}, this directory is designed to hold the state of a running agent between subsequent agent-restarts (but not across host reboots). By default, this flag is set to {{/var/run/mesos}} since this is a {{tempfs}} on linux that gets automatically cleaned up on reboot. However, on most systems {{/var/run/mesos}} is only writable by root, causing problems when launching an agent as non-root and not pointing {{--runtime_dir}} to a different location. We need to call this out in the upgrade.md and getting-started.md docs so that people know they may need to set this going forward. was: We recently introduced a new agent flag for {{--runtime_dir}}. Unlike the {{--work_dir}}, this directory is designed to hold the state of a running agent between subsequent agent-restarts (but not across host reboots). By default, this flag is set to {{/var/run/mesos}} since this is a {{tempfs}} on linux that gets automatically cleaned up on reboot. However, on most systems {{/var/run/mesos}} is only writable by root, causing problems when launching an agent as non-root and not pointing {{--runtime_dir}} to a different location. We need to call this out in the upgrade.md and getting-started.md docs so that people know they may need to set this going forward. > Add requirement in upgrade.md and getting-started.md for agent > '--runtime_dir' in when running as non-root > -- > > Key: MESOS-6312 > URL: https://issues.apache.org/jira/browse/MESOS-6312 > Project: Mesos > Issue Type: Task >Reporter: Kevin Klues >Priority: Blocker > Fix For: 1.1.0 > > > We recently introduced a new agent flag for {{\-\-runtime_dir}}. Unlike the > {{\-\-work_dir}}, this directory is designed to hold the state of a running > agent between subsequent agent-restarts (but not across host reboots). > By default, this flag is set to {{/var/run/mesos}} since this is a {{tempfs}} > on linux that gets automatically cleaned up on reboot. However, on most > systems {{/var/run/mesos}} is only writable by root, causing problems when > launching an agent as non-root and not pointing {{--runtime_dir}} to a > different location. > We need to call this out in the upgrade.md and getting-started.md docs so > that people know they may need to set this going forward. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5967) Add support for 'docker image inspect' in our docker abstraction
[ https://issues.apache.org/jira/browse/MESOS-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Klues updated MESOS-5967: --- Labels: gpu (was: gpu mesosphere) > Add support for 'docker image inspect' in our docker abstraction > > > Key: MESOS-5967 > URL: https://issues.apache.org/jira/browse/MESOS-5967 > Project: Mesos > Issue Type: Improvement >Reporter: Kevin Klues >Assignee: Guangya Liu > Labels: gpu > Fix For: 1.1.0 > > > Docker's command line tool for {{docker inspect}} can take either a > {{container}}, an {{image}}, or a {{task}} as its argument, and return a JSON > array containing low-level information about that container, image or task. > However, the current {{docker inspect}} support in our docker abstraction > only supports inspecting containers (not images or tasks). We should expand > this to (at least) support images. > In particular, this additional functionality is motivated by the upcoming GPU > support, which needs to inspect the labels in a docker image to decide if it > should inject the required Nvidia volumes into a container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5967) Add support for 'docker image inspect' in our docker abstraction
[ https://issues.apache.org/jira/browse/MESOS-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Klues updated MESOS-5967: --- Description: Docker's command line tool for {{docker inspect}} can take either a {{container}}, an {{image}}, or a {{task}} as its argument, and return a JSON array containing low-level information about that container, image or task. However, the current {{docker inspect}} support in our docker abstraction only supports inspecting containers (not images or tasks). We should expand this to (at least) support images. In particular, this additional functionality is motivated by the upcoming GPU support, which needs to inspect the labels in a docker image to decide if it should inject the required Nvidia volumes into a container. was: Docker's command line tool for {{docker inspect}} can take either a {{container}}, an {{image}}, or a {{task}} as its argument, and return a JSON array containing low-level information about that {{container}}, {{image}} or {{task}}. However, the current {{docker inspect}} support in our docker abstraction only supports inspecting containers (not images or tasks). We should expand this support to images. In particular, this additional functionality is motivated by the upcoming GPU support, which needs to inspect the labels in a docker image to decide if it should inject the required Nvidia volumes into a container. > Add support for 'docker image inspect' in our docker abstraction > > > Key: MESOS-5967 > URL: https://issues.apache.org/jira/browse/MESOS-5967 > Project: Mesos > Issue Type: Improvement >Reporter: Kevin Klues >Assignee: Kevin Klues > Labels: gpu, mesosphere > Fix For: 1.1.0 > > > Docker's command line tool for {{docker inspect}} can take either a > {{container}}, an {{image}}, or a {{task}} as its argument, and return a JSON > array containing low-level information about that container, image or task. > However, the current {{docker inspect}} support in our docker abstraction > only supports inspecting containers (not images or tasks). We should expand > this to (at least) support images. > In particular, this additional functionality is motivated by the upcoming GPU > support, which needs to inspect the labels in a docker image to decide if it > should inject the required Nvidia volumes into a container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5967) Add support for 'docker image inspect' in our docker abstraction
[ https://issues.apache.org/jira/browse/MESOS-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Klues updated MESOS-5967: --- Description: Docker's command line tool for {{docker inspect}} can take either a {{container}}, an {{image}}, or a {{task}} as its argument, and return a JSON array containing low-level information about that {{container}}, {{image}} or {{task}}. However, the current {{docker inspect}} support in our docker abstraction only supports inspecting containers (not images or tasks). We should expand this support to images. In particular, this additional functionality is motivated by the upcoming GPU support, which needs to inspect the labels in a docker image to decide if it should inject the required Nvidia volumes into a container. was:Our current {{docker inspect}} support in our docker abstraction only supports inspecting containers (not images). We should expand this support to images. > Add support for 'docker image inspect' in our docker abstraction > > > Key: MESOS-5967 > URL: https://issues.apache.org/jira/browse/MESOS-5967 > Project: Mesos > Issue Type: Improvement >Reporter: Kevin Klues >Assignee: Kevin Klues > Labels: gpu, mesosphere > Fix For: 1.1.0 > > > Docker's command line tool for {{docker inspect}} can take either a > {{container}}, an {{image}}, or a {{task}} as its argument, and return a JSON > array containing low-level information about that {{container}}, {{image}} or > {{task}}. > However, the current {{docker inspect}} support in our docker abstraction > only supports inspecting containers (not images or tasks). We should expand > this support to images. > In particular, this additional functionality is motivated by the upcoming GPU > support, which needs to inspect the labels in a docker image to decide if it > should inject the required Nvidia volumes into a container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6315) `killtree` can accidentally kill containerizer / executor
Joris Van Remoortere created MESOS-6315: --- Summary: `killtree` can accidentally kill containerizer / executor Key: MESOS-6315 URL: https://issues.apache.org/jira/browse/MESOS-6315 Project: Mesos Issue Type: Bug Affects Versions: 1.0.0 Reporter: Joris Van Remoortere The implementation of killtree is buggy. [~jieyu] has some ideas. ltrace of mesos-local: {code} [pid 19501] [0x7f89d77a61ab] libmesos-1.1.0.so->kill(29985, SIGKILL) = 0 [pid 19501] [0x7f89d77a61ab] libmesos-1.1.0.so->kill(31349, SIGKILL [pid 31359] [0x] +++ killed by SIGKILL +++ [pid 31358] [0x] +++ killed by SIGKILL +++ [pid 31357] [0x] +++ killed by SIGKILL +++ [pid 31356] [0x] +++ killed by SIGKILL +++ [pid 31354] [0x] +++ killed by SIGKILL +++ [pid 31353] [0x] +++ killed by SIGKILL +++ [pid 31351] [0x] +++ killed by SIGKILL +++ [pid 31350] [0x] +++ killed by SIGKILL +++ [pid 19501] [0x7f89d77a61ab] <... kill resumed> ) = 0 [pid 19501] [0x7f89d77a61dd] libmesos-1.1.0.so->kill(29985, SIGCONT [pid 29985] [0x] +++ killed by SIGKILL +++ [pid 19493] [0x7f89d64ceda0] --- SIGCHLD (Child exited) --- [pid 31352] [0x] +++ killed by SIGKILL +++ [pid 31349] [0x] +++ killed by SIGKILL +++ [pid 19501] [0x7f89d77a61dd] <... kill resumed> ) = 0 [pid 19501] [0x7f89d77a61dd] libmesos-1.1.0.so->kill(31349, SIGCONT) = -1 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6315) `killtree` can accidentally kill containerizer / executor
[ https://issues.apache.org/jira/browse/MESOS-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15549602#comment-15549602 ] Joris Van Remoortere commented on MESOS-6315: - Since {{killtree}} is only used in the posix containerizer this is not a blocker. > `killtree` can accidentally kill containerizer / executor > - > > Key: MESOS-6315 > URL: https://issues.apache.org/jira/browse/MESOS-6315 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Joris Van Remoortere > > The implementation of killtree is buggy. [~jieyu] has some ideas. > ltrace of mesos-local: > {code} > [pid 19501] [0x7f89d77a61ab] libmesos-1.1.0.so->kill(29985, SIGKILL) > = 0 > [pid 19501] [0x7f89d77a61ab] libmesos-1.1.0.so->kill(31349, SIGKILL return ...> > [pid 31359] [0x] +++ killed by SIGKILL +++ > [pid 31358] [0x] +++ killed by SIGKILL +++ > [pid 31357] [0x] +++ killed by SIGKILL +++ > [pid 31356] [0x] +++ killed by SIGKILL +++ > [pid 31354] [0x] +++ killed by SIGKILL +++ > [pid 31353] [0x] +++ killed by SIGKILL +++ > [pid 31351] [0x] +++ killed by SIGKILL +++ > [pid 31350] [0x] +++ killed by SIGKILL +++ > [pid 19501] [0x7f89d77a61ab] <... kill resumed> ) > = 0 > [pid 19501] [0x7f89d77a61dd] libmesos-1.1.0.so->kill(29985, SIGCONT return ...> > [pid 29985] [0x] +++ killed by SIGKILL +++ > [pid 19493] [0x7f89d64ceda0] --- SIGCHLD (Child exited) --- > [pid 31352] [0x] +++ killed by SIGKILL +++ > [pid 31349] [0x] +++ killed by SIGKILL +++ > [pid 19501] [0x7f89d77a61dd] <... kill resumed> ) > = 0 > [pid 19501] [0x7f89d77a61dd] libmesos-1.1.0.so->kill(31349, SIGCONT) > = -1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6119) TCP health checks are not portable.
[ https://issues.apache.org/jira/browse/MESOS-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-6119: --- Shepherd: Till Toenshoff > TCP health checks are not portable. > --- > > Key: MESOS-6119 > URL: https://issues.apache.org/jira/browse/MESOS-6119 > Project: Mesos > Issue Type: Bug >Reporter: Alexander Rukletsov >Assignee: Alexander Rukletsov >Priority: Blocker > Labels: health-check, mesosphere > Fix For: 1.1.0 > > > MESOS-3567 introduced a dependency on "bash" for TCP health checks, which is > undesirable. We should implement a portable solution for TCP health checks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-6279) Add test cases for the TCP health check
[ https://issues.apache.org/jira/browse/MESOS-6279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15533465#comment-15533465 ] haosdent edited comment on MESOS-6279 at 10/5/16 5:27 PM: -- | Added test cases for TCP health check. | https://reviews.apache.org/r/52251/ | was (Author: haosd...@gmail.com): | Added test case `HealthCheckTest.HealthyTaskViaTCP`. | https://reviews.apache.org/r/52251/ | | Added test case `HealthCheckTest.ROOT_DOCKER_DockerHealthyTaskViaTCP`. | https://reviews.apache.org/r/52253/ | | Added test case `ROOT_HealthyTaskViaTCPWithContainerImage`. | https://reviews.apache.org/r/52558/ | > Add test cases for the TCP health check > --- > > Key: MESOS-6279 > URL: https://issues.apache.org/jira/browse/MESOS-6279 > Project: Mesos > Issue Type: Task > Components: tests >Reporter: haosdent >Assignee: haosdent > Labels: health-check, mesosphere, test > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-6278) Add test cases for the HTTP health checks
[ https://issues.apache.org/jira/browse/MESOS-6278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15533460#comment-15533460 ] haosdent edited comment on MESOS-6278 at 10/5/16 5:26 PM: -- | Added test cases for HTTP health check. | https://reviews.apache.org/r/52250/ | was (Author: haosd...@gmail.com): | Added test case `HealthCheckTest.HealthyTaskViaHTTP`. | https://reviews.apache.org/r/52250/ | | Added test case `HealthCheckTest.ROOT_DOCKER_DockerHealthyTaskViaHTTP`. | https://reviews.apache.org/r/52252/ | | Added test case `ROOT_HealthyTaskViaHTTPWithContainerImage`. | https://reviews.apache.org/r/52557/ | > Add test cases for the HTTP health checks > - > > Key: MESOS-6278 > URL: https://issues.apache.org/jira/browse/MESOS-6278 > Project: Mesos > Issue Type: Task > Components: tests >Reporter: haosdent >Assignee: haosdent > Labels: health-check, mesosphere, test > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-6279) Add test cases for the TCP health check
[ https://issues.apache.org/jira/browse/MESOS-6279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15533465#comment-15533465 ] haosdent edited comment on MESOS-6279 at 10/5/16 4:12 PM: -- | Added test case `HealthCheckTest.HealthyTaskViaTCP`. | https://reviews.apache.org/r/52251/ | | Added test case `HealthCheckTest.ROOT_DOCKER_DockerHealthyTaskViaTCP`. | https://reviews.apache.org/r/52253/ | | Added test case `ROOT_HealthyTaskViaTCPWithContainerImage`. | https://reviews.apache.org/r/52558/ | was (Author: haosd...@gmail.com): | Added test case `HealthCheckTest.HealthyTaskViaTCP`. | https://reviews.apache.org/r/52251/ | | Added test case `HealthCheckTest.ROOT_DOCKER_DockerHealthyTaskViaTCP`. | https://reviews.apache.org/r/52253/ | > Add test cases for the TCP health check > --- > > Key: MESOS-6279 > URL: https://issues.apache.org/jira/browse/MESOS-6279 > Project: Mesos > Issue Type: Task > Components: tests >Reporter: haosdent >Assignee: haosdent > Labels: health-check, mesosphere, test > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-6278) Add test cases for the HTTP health checks
[ https://issues.apache.org/jira/browse/MESOS-6278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15533460#comment-15533460 ] haosdent edited comment on MESOS-6278 at 10/5/16 4:11 PM: -- | Added test case `HealthCheckTest.HealthyTaskViaHTTP`. | https://reviews.apache.org/r/52250/ | | Added test case `HealthCheckTest.ROOT_DOCKER_DockerHealthyTaskViaHTTP`. | https://reviews.apache.org/r/52252/ | | Added test case `ROOT_HealthyTaskViaHTTPWithContainerImage`. | https://reviews.apache.org/r/52557/ | was (Author: haosd...@gmail.com): | Added test case `HealthCheckTest.HealthyTaskViaHTTP`. | https://reviews.apache.org/r/52250/ | | Added test case `HealthCheckTest.ROOT_DOCKER_DockerHealthyTaskViaHTTP`. | https://reviews.apache.org/r/52252/ | > Add test cases for the HTTP health checks > - > > Key: MESOS-6278 > URL: https://issues.apache.org/jira/browse/MESOS-6278 > Project: Mesos > Issue Type: Task > Components: tests >Reporter: haosdent >Assignee: haosdent > Labels: health-check, mesosphere, test > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-6264) Investigate the high memory usage of the default executor.
[ https://issues.apache.org/jira/browse/MESOS-6264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15549111#comment-15549111 ] Joris Van Remoortere edited comment on MESOS-6264 at 10/5/16 3:45 PM: -- cc [~vinodkone][~jieyu] The bulk of this comes from loading in {{libmesos.so}} We do this because the autoconf build treats libmesos as a dynamic dependency. Since we load libmesos dynamically, there is no chance for the linker to strip unused code. This means that all of the code in libmesos regardless of use gets loaded into resident memory. In contrast the cmake build generates a static library for {{libmesos.a}}. This is then used to build the {{mesos-executor}} binary without a dynamic dependency on libmesos. The benefit of this approach is that the linker is able to strip out all unused code. In an optimized build this is {{~10MB}}. Some approaches for the quick win are: # Consider using the cmake build. This only needs to be modified slightly to strip symbols from the final executor binary {{-s}}. # Modiy the autoconf build to build a {{libmesos.a}} so that we can statically link it in to the {{mesos-executor}} binary and allow the linker to strip unused code. Regardless of the above approach, {{libmesos}} would still be by far the largest contributor of the {{RSS}}. This is for 2 reasons: # Much of our code is structured such that the linker can't determine if it is unused. We would need to adjust our patterns such that the unused code analyzer can do a better job. # Much of our code is {{inlined}} or written such that it can't be optimized. 2 examples are: ## https://github.com/apache/mesos/blob/9beb8eae6408249cdb3e2f16ba68b31a00d3452c/3rdparty/libprocess/include/process/mime.hpp#L35-L154 This code could be moved to a {{.cpp}} file and should be a {{static const std::unordered_map}} that we {{insert(begin(), end())}} into {{types}}. This would reduce the size of libmesos by {{~20KB}}! ## https://github.com/apache/mesos/blob/9beb8eae6408249cdb3e2f16ba68b31a00d3452c/3rdparty/libprocess/include/process/http.hpp#L453-L517 This code and sibling {{struct Request}} have auto-generated {{inlined}} destructors. These are very expensive. Just declaring and then defining in the {{.cpp}} the default destructor can remove another {{~20KB}} each from libmesos. There are plenty of other opportunities like this scattered through the codebase. It's work to find them and the returns are small for each, but end up adding to much of the {{9MB}} left over. was (Author: jvanremoortere): cc [~vinodkone][~jieyu] The bulk of this comes from loading in {{libmesos.so}} We do this because the autoconf build treats libmesos as a dynamic dependency. Since we load libmesos dynamically, there is no chance for the linker to strip unused code. This means that all of the code in libmesos regardless of use gets loaded into resident memory. In contrast the cmake build generates a static library for {{libmesos.a}}. This is then used to build the {{mesos-executor}} binary without a dynamic dependency on libmesos. The benefit of this approach is that the linker is able to strip out all unused code. In an optimized build this is {{~10MB}}. Some approaches for the quick win are: # Consider using the cmake build. This only needs to be modified slightly to strip symbols from the final executor binary {{-s}}. # Modiy the autoconf build to build a {{libmesos.a}} so that we can statically link it in to the {{mesos-executor}} binary and allow the linker to strip unused code. Regardless of the above approach, {{libmesos}} would still be by far the largest contributor of the {{RSS}}. This is for 2 reasons: # Much of our code is structured such that the linker can't determine if it is unused. We would need to adjust our patterns such that the unused code analyzer can do a better job. # Much of our code is {{inlined}} or written such that it can't be optimized. 2 examples are: ## https://github.com/apache/mesos/blob/9beb8eae6408249cdb3e2f16ba68b31a00d3452c/3rdparty/libprocess/include/process/mime.hpp#L35-L154 This code could be moved to a {{.cpp}} file and should be a {{static const std::unordered_map }} that we {{insert(begin(), end())}} into {{types}}. This would reduce the size of libmesos by {{~20KB}}! ## https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/http.hpp#L453-L517 This code and sibling {{struct Request}} have auto-generated {{inlined}} destructors. These are very expensive. Just declaring and then defining in the {{.cpp}} the default destructor can remove another {{~20KB}} each from libmesos. There are plenty of other opportunities like this scattered through the codebase. It's work to find them and the returns are small for each, but end up adding to much of the {{9MB}} left over. > Investigate the high memory usage of the default
[jira] [Commented] (MESOS-6264) Investigate the high memory usage of the default executor.
[ https://issues.apache.org/jira/browse/MESOS-6264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15549111#comment-15549111 ] Joris Van Remoortere commented on MESOS-6264: - cc [~vinodkone][~jieyu] The bulk of this comes from loading in {{libmesos.so}} We do this because the autoconf build treats libmesos as a dynamic dependency. Since we load libmesos dynamically, there is no chance for the linker to strip unused code. This means that all of the code in libmesos regardless of use gets loaded into resident memory. In contrast the cmake build generates a static library for {{libmesos.a}}. This is then used to build the {{mesos-executor}} binary without a dynamic dependency on libmesos. The benefit of this approach is that the linker is able to strip out all unused code. In an optimized build this is {{~10MB}}. Some approaches for the quick win are: # Consider using the cmake build. This only needs to be modified slightly to strip symbols from the final executor binary {{-s}}. # Modiy the autoconf build to build a {{libmesos.a}} so that we can statically link it in to the {{mesos-executor}} binary and allow the linker to strip unused code. Regardless of the above approach, {{libmesos}} would still be by far the largest contributor of the {{RSS}}. This is for 2 reasons: # Much of our code is structured such that the linker can't determine if it is unused. We would need to adjust our patterns such that the unused code analyzer can do a better job. # Much of our code is {{inlined}} or written such that it can't be optimized. 2 examples are: ## https://github.com/apache/mesos/blob/9beb8eae6408249cdb3e2f16ba68b31a00d3452c/3rdparty/libprocess/include/process/mime.hpp#L35-L154 This code could be moved to a {{.cpp}} file and should be a {{static const std::unordered_map}} that we {{insert(begin(), end())}} into {{types}}. This would reduce the size of libmesos by {{~20KB}}! ## https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/http.hpp#L453-L517 This code and sibling {{struct Request}} have auto-generated {{inlined}} destructors. These are very expensive. Just declaring and then defining in the {{.cpp}} the default destructor can remove another {{~20KB}} each from libmesos. There are plenty of other opportunities like this scattered through the codebase. It's work to find them and the returns are small for each, but end up adding to much of the {{9MB}} left over. > Investigate the high memory usage of the default executor. > -- > > Key: MESOS-6264 > URL: https://issues.apache.org/jira/browse/MESOS-6264 > Project: Mesos > Issue Type: Bug >Reporter: Anand Mazumdar > Labels: mesosphere > Fix For: 1.1.0 > > Attachments: pmap_output_for_the_default_executor.txt > > > It seems that a default executor with two sleep tasks is using ~32 mb on > average and can sometimes lead to it being killed for some tests like > {{SlaveRecoveryTest/0.ROOT_CGROUPS_ReconnectDefaultExecutor}} on our internal > CI. Attached the {{pmap}} output for the default executor. Please note that > the command executor memory usage is also pretty high (~26 mb). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6278) Add test cases for the HTTP health checks
[ https://issues.apache.org/jira/browse/MESOS-6278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-6278: --- Shepherd: Alexander Rukletsov Sprint: Mesosphere Sprint 44 Story Points: 3 Target Version/s: 1.1.0 Labels: health-check mesosphere test (was: health-check test) Component/s: tests > Add test cases for the HTTP health checks > - > > Key: MESOS-6278 > URL: https://issues.apache.org/jira/browse/MESOS-6278 > Project: Mesos > Issue Type: Task > Components: tests >Reporter: haosdent >Assignee: haosdent > Labels: health-check, mesosphere, test > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6207) Python bindings fail to build with custom SVN installation path
[ https://issues.apache.org/jira/browse/MESOS-6207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Toenshoff updated MESOS-6207: -- Shepherd: Till Toenshoff > Python bindings fail to build with custom SVN installation path > --- > > Key: MESOS-6207 > URL: https://issues.apache.org/jira/browse/MESOS-6207 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 1.0.1 >Reporter: Ilya Pronin >Assignee: Ilya Pronin >Priority: Trivial > > In {{src/Makefile.am}} {{PYTHON_LDFLAGS}} variable is used while building > Python bindings. This variable picks {{LDFLAGS}} during configuration phase > before we check for custom SVN installation path and misses > {{-L$\{with_svn\}/lib}} flag. That causes a link error on systems with > uncommon SVN installation path. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6207) Python bindings fail to build with custom SVN installation path
[ https://issues.apache.org/jira/browse/MESOS-6207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15548754#comment-15548754 ] Till Toenshoff commented on MESOS-6207: --- Thanks for your patience Ilya - I have taken over the shepherding after Vinod has nagged me enough ;). > Python bindings fail to build with custom SVN installation path > --- > > Key: MESOS-6207 > URL: https://issues.apache.org/jira/browse/MESOS-6207 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 1.0.1 >Reporter: Ilya Pronin >Assignee: Ilya Pronin >Priority: Trivial > > In {{src/Makefile.am}} {{PYTHON_LDFLAGS}} variable is used while building > Python bindings. This variable picks {{LDFLAGS}} during configuration phase > before we check for custom SVN installation path and misses > {{-L$\{with_svn\}/lib}} flag. That causes a link error on systems with > uncommon SVN installation path. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6279) Add test cases for the TCP health check
[ https://issues.apache.org/jira/browse/MESOS-6279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-6279: --- Shepherd: Alexander Rukletsov Sprint: Mesosphere Sprint 44 Story Points: 3 Target Version/s: 1.1.0 Labels: health-check mesosphere test (was: health-check) Component/s: tests > Add test cases for the TCP health check > --- > > Key: MESOS-6279 > URL: https://issues.apache.org/jira/browse/MESOS-6279 > Project: Mesos > Issue Type: Task > Components: tests >Reporter: haosdent >Assignee: haosdent > Labels: health-check, mesosphere, test > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6247) Enable Framework to set weight
[ https://issues.apache.org/jira/browse/MESOS-6247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15548683#comment-15548683 ] Klaus Ma commented on MESOS-6247: - [~jvanremoortere] , yes, they can not share the reserved resources with each other in different role. For the weight, it's better to let Mesos to allocate resources within a role. Because other frameworks may be deployed in this environment, e.g. Storm, it'll be a huge work to modify those frameworks one by one. I agree with you and BenM that hierarchical role is the long term solution; but any suggestion on the target date? BTW, how about other user's scenario about multiple frameworks? > Enable Framework to set weight > -- > > Key: MESOS-6247 > URL: https://issues.apache.org/jira/browse/MESOS-6247 > Project: Mesos > Issue Type: Bug > Components: allocation > Environment: all >Reporter: Klaus Ma >Priority: Critical > > We'd like to enable framework's weight when it register. So the framework can > share resources based on weight within the same role. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1653) HealthCheckTest.GracePeriod is flaky.
[ https://issues.apache.org/jira/browse/MESOS-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gastón Kleiman updated MESOS-1653: -- Shepherd: Alexander Rukletsov Sprint: Mesosphere Sprint 44 Target Version/s: 1.1.0 > HealthCheckTest.GracePeriod is flaky. > - > > Key: MESOS-1653 > URL: https://issues.apache.org/jira/browse/MESOS-1653 > Project: Mesos > Issue Type: Bug > Components: test >Reporter: Benjamin Mahler >Assignee: Gastón Kleiman > Labels: flaky, health-check, mesosphere > > {noformat} > [--] 3 tests from HealthCheckTest > [ RUN ] HealthCheckTest.GracePeriod > Using temporary directory '/tmp/HealthCheckTest_GracePeriod_d7zCPr' > I0729 17:10:10.484951 1176 leveldb.cpp:176] Opened db in 28.883552ms > I0729 17:10:10.499487 1176 leveldb.cpp:183] Compacted db in 13.674118ms > I0729 17:10:10.500200 1176 leveldb.cpp:198] Created db iterator in 7394ns > I0729 17:10:10.500692 1176 leveldb.cpp:204] Seeked to beginning of db in > 2317ns > I0729 17:10:10.501113 1176 leveldb.cpp:273] Iterated through 0 keys in the > db in 1367ns > I0729 17:10:10.501535 1176 replica.cpp:741] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0729 17:10:10.502233 1212 recover.cpp:425] Starting replica recovery > I0729 17:10:10.502295 1212 recover.cpp:451] Replica is in EMPTY status > I0729 17:10:10.502825 1212 replica.cpp:638] Replica in EMPTY status received > a broadcasted recover request > I0729 17:10:10.502877 1212 recover.cpp:188] Received a recover response from > a replica in EMPTY status > I0729 17:10:10.502980 1212 recover.cpp:542] Updating replica status to > STARTING > I0729 17:10:10.508482 1213 master.cpp:289] Master > 20140729-171010-16842879-54701-1176 (trusty) started on 127.0.1.1:54701 > I0729 17:10:10.508607 1213 master.cpp:326] Master only allowing > authenticated frameworks to register > I0729 17:10:10.508632 1213 master.cpp:331] Master only allowing > authenticated slaves to register > I0729 17:10:10.508656 1213 credentials.hpp:36] Loading credentials for > authentication from '/tmp/HealthCheckTest_GracePeriod_d7zCPr/credentials' > I0729 17:10:10.509407 1213 master.cpp:360] Authorization enabled > I0729 17:10:10.510030 1207 hierarchical_allocator_process.hpp:301] > Initializing hierarchical allocator process with master : > master@127.0.1.1:54701 > I0729 17:10:10.510113 1207 master.cpp:123] No whitelist given. Advertising > offers for all slaves > I0729 17:10:10.511699 1213 master.cpp:1129] The newly elected leader is > master@127.0.1.1:54701 with id 20140729-171010-16842879-54701-1176 > I0729 17:10:10.512230 1213 master.cpp:1142] Elected as the leading master! > I0729 17:10:10.512692 1213 master.cpp:960] Recovering from registrar > I0729 17:10:10.513226 1210 registrar.cpp:313] Recovering registrar > I0729 17:10:10.516006 1212 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 12.946461ms > I0729 17:10:10.516047 1212 replica.cpp:320] Persisted replica status to > STARTING > I0729 17:10:10.516129 1212 recover.cpp:451] Replica is in STARTING status > I0729 17:10:10.516520 1212 replica.cpp:638] Replica in STARTING status > received a broadcasted recover request > I0729 17:10:10.516592 1212 recover.cpp:188] Received a recover response from > a replica in STARTING status > I0729 17:10:10.516767 1212 recover.cpp:542] Updating replica status to VOTING > I0729 17:10:10.528376 1212 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 11.537102ms > I0729 17:10:10.528430 1212 replica.cpp:320] Persisted replica status to > VOTING > I0729 17:10:10.528501 1212 recover.cpp:556] Successfully joined the Paxos > group > I0729 17:10:10.528565 1212 recover.cpp:440] Recover process terminated > I0729 17:10:10.528700 1212 log.cpp:656] Attempting to start the writer > I0729 17:10:10.528960 1212 replica.cpp:474] Replica received implicit > promise request with proposal 1 > I0729 17:10:10.537821 1212 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 8.830863ms > I0729 17:10:10.537869 1212 replica.cpp:342] Persisted promised to 1 > I0729 17:10:10.540550 1209 coordinator.cpp:230] Coordinator attemping to > fill missing position > I0729 17:10:10.540856 1209 replica.cpp:375] Replica received explicit > promise request for position 0 with proposal 2 > I0729 17:10:10.547430 1209 leveldb.cpp:343] Persisting action (8 bytes) to > leveldb took 6.548344ms > I0729 17:10:10.547471 1209 replica.cpp:676] Persisted action at 0 > I0729 17:10:10.547732 1209 replica.cpp:508] Replica received write request > for position 0 > I0729 17:10:10.547765 1209 leveldb.cpp:438] Reading position from leveldb > took 15676ns > I0729 17:10:10.557169 1209 leveldb.cpp:343] Persisting action
[jira] [Commented] (MESOS-6247) Enable Framework to set weight
[ https://issues.apache.org/jira/browse/MESOS-6247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15548652#comment-15548652 ] Joris Van Remoortere commented on MESOS-6247: - [~klaus1982]Do you mean they can not share reserved resources with each other? If they are in the same role they are supposed to be co-operative. At that point why does the weight matter? They should both be yielding all unavailable resources to each other. If we add support for weights now it will make it *even* harder to move people into the hierarchical role world described by benm. It seems like the frameworks co-operating (as they should per the contract of sharing a role) is the right temporary solution for you. > Enable Framework to set weight > -- > > Key: MESOS-6247 > URL: https://issues.apache.org/jira/browse/MESOS-6247 > Project: Mesos > Issue Type: Bug > Components: allocation > Environment: all >Reporter: Klaus Ma >Priority: Critical > > We'd like to enable framework's weight when it register. So the framework can > share resources based on weight within the same role. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1653) HealthCheckTest.GracePeriod is flaky.
[ https://issues.apache.org/jira/browse/MESOS-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] haosdent updated MESOS-1653: Assignee: Gastón Kleiman (was: haosdent) > HealthCheckTest.GracePeriod is flaky. > - > > Key: MESOS-1653 > URL: https://issues.apache.org/jira/browse/MESOS-1653 > Project: Mesos > Issue Type: Bug > Components: test >Reporter: Benjamin Mahler >Assignee: Gastón Kleiman > Labels: flaky, health-check, mesosphere > > {noformat} > [--] 3 tests from HealthCheckTest > [ RUN ] HealthCheckTest.GracePeriod > Using temporary directory '/tmp/HealthCheckTest_GracePeriod_d7zCPr' > I0729 17:10:10.484951 1176 leveldb.cpp:176] Opened db in 28.883552ms > I0729 17:10:10.499487 1176 leveldb.cpp:183] Compacted db in 13.674118ms > I0729 17:10:10.500200 1176 leveldb.cpp:198] Created db iterator in 7394ns > I0729 17:10:10.500692 1176 leveldb.cpp:204] Seeked to beginning of db in > 2317ns > I0729 17:10:10.501113 1176 leveldb.cpp:273] Iterated through 0 keys in the > db in 1367ns > I0729 17:10:10.501535 1176 replica.cpp:741] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0729 17:10:10.502233 1212 recover.cpp:425] Starting replica recovery > I0729 17:10:10.502295 1212 recover.cpp:451] Replica is in EMPTY status > I0729 17:10:10.502825 1212 replica.cpp:638] Replica in EMPTY status received > a broadcasted recover request > I0729 17:10:10.502877 1212 recover.cpp:188] Received a recover response from > a replica in EMPTY status > I0729 17:10:10.502980 1212 recover.cpp:542] Updating replica status to > STARTING > I0729 17:10:10.508482 1213 master.cpp:289] Master > 20140729-171010-16842879-54701-1176 (trusty) started on 127.0.1.1:54701 > I0729 17:10:10.508607 1213 master.cpp:326] Master only allowing > authenticated frameworks to register > I0729 17:10:10.508632 1213 master.cpp:331] Master only allowing > authenticated slaves to register > I0729 17:10:10.508656 1213 credentials.hpp:36] Loading credentials for > authentication from '/tmp/HealthCheckTest_GracePeriod_d7zCPr/credentials' > I0729 17:10:10.509407 1213 master.cpp:360] Authorization enabled > I0729 17:10:10.510030 1207 hierarchical_allocator_process.hpp:301] > Initializing hierarchical allocator process with master : > master@127.0.1.1:54701 > I0729 17:10:10.510113 1207 master.cpp:123] No whitelist given. Advertising > offers for all slaves > I0729 17:10:10.511699 1213 master.cpp:1129] The newly elected leader is > master@127.0.1.1:54701 with id 20140729-171010-16842879-54701-1176 > I0729 17:10:10.512230 1213 master.cpp:1142] Elected as the leading master! > I0729 17:10:10.512692 1213 master.cpp:960] Recovering from registrar > I0729 17:10:10.513226 1210 registrar.cpp:313] Recovering registrar > I0729 17:10:10.516006 1212 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 12.946461ms > I0729 17:10:10.516047 1212 replica.cpp:320] Persisted replica status to > STARTING > I0729 17:10:10.516129 1212 recover.cpp:451] Replica is in STARTING status > I0729 17:10:10.516520 1212 replica.cpp:638] Replica in STARTING status > received a broadcasted recover request > I0729 17:10:10.516592 1212 recover.cpp:188] Received a recover response from > a replica in STARTING status > I0729 17:10:10.516767 1212 recover.cpp:542] Updating replica status to VOTING > I0729 17:10:10.528376 1212 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 11.537102ms > I0729 17:10:10.528430 1212 replica.cpp:320] Persisted replica status to > VOTING > I0729 17:10:10.528501 1212 recover.cpp:556] Successfully joined the Paxos > group > I0729 17:10:10.528565 1212 recover.cpp:440] Recover process terminated > I0729 17:10:10.528700 1212 log.cpp:656] Attempting to start the writer > I0729 17:10:10.528960 1212 replica.cpp:474] Replica received implicit > promise request with proposal 1 > I0729 17:10:10.537821 1212 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 8.830863ms > I0729 17:10:10.537869 1212 replica.cpp:342] Persisted promised to 1 > I0729 17:10:10.540550 1209 coordinator.cpp:230] Coordinator attemping to > fill missing position > I0729 17:10:10.540856 1209 replica.cpp:375] Replica received explicit > promise request for position 0 with proposal 2 > I0729 17:10:10.547430 1209 leveldb.cpp:343] Persisting action (8 bytes) to > leveldb took 6.548344ms > I0729 17:10:10.547471 1209 replica.cpp:676] Persisted action at 0 > I0729 17:10:10.547732 1209 replica.cpp:508] Replica received write request > for position 0 > I0729 17:10:10.547765 1209 leveldb.cpp:438] Reading position from leveldb > took 15676ns > I0729 17:10:10.557169 1209 leveldb.cpp:343] Persisting action (14 bytes) to > leveldb took 9.373798ms > I0729 17:10:10.557241 1209
[jira] [Issue Comment Deleted] (MESOS-1653) HealthCheckTest.GracePeriod is flaky.
[ https://issues.apache.org/jira/browse/MESOS-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] haosdent updated MESOS-1653: Comment: was deleted (was: Patch: https://reviews.apache.org/r/47089/) > HealthCheckTest.GracePeriod is flaky. > - > > Key: MESOS-1653 > URL: https://issues.apache.org/jira/browse/MESOS-1653 > Project: Mesos > Issue Type: Bug > Components: test >Reporter: Benjamin Mahler >Assignee: haosdent > Labels: flaky, health-check, mesosphere > > {noformat} > [--] 3 tests from HealthCheckTest > [ RUN ] HealthCheckTest.GracePeriod > Using temporary directory '/tmp/HealthCheckTest_GracePeriod_d7zCPr' > I0729 17:10:10.484951 1176 leveldb.cpp:176] Opened db in 28.883552ms > I0729 17:10:10.499487 1176 leveldb.cpp:183] Compacted db in 13.674118ms > I0729 17:10:10.500200 1176 leveldb.cpp:198] Created db iterator in 7394ns > I0729 17:10:10.500692 1176 leveldb.cpp:204] Seeked to beginning of db in > 2317ns > I0729 17:10:10.501113 1176 leveldb.cpp:273] Iterated through 0 keys in the > db in 1367ns > I0729 17:10:10.501535 1176 replica.cpp:741] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0729 17:10:10.502233 1212 recover.cpp:425] Starting replica recovery > I0729 17:10:10.502295 1212 recover.cpp:451] Replica is in EMPTY status > I0729 17:10:10.502825 1212 replica.cpp:638] Replica in EMPTY status received > a broadcasted recover request > I0729 17:10:10.502877 1212 recover.cpp:188] Received a recover response from > a replica in EMPTY status > I0729 17:10:10.502980 1212 recover.cpp:542] Updating replica status to > STARTING > I0729 17:10:10.508482 1213 master.cpp:289] Master > 20140729-171010-16842879-54701-1176 (trusty) started on 127.0.1.1:54701 > I0729 17:10:10.508607 1213 master.cpp:326] Master only allowing > authenticated frameworks to register > I0729 17:10:10.508632 1213 master.cpp:331] Master only allowing > authenticated slaves to register > I0729 17:10:10.508656 1213 credentials.hpp:36] Loading credentials for > authentication from '/tmp/HealthCheckTest_GracePeriod_d7zCPr/credentials' > I0729 17:10:10.509407 1213 master.cpp:360] Authorization enabled > I0729 17:10:10.510030 1207 hierarchical_allocator_process.hpp:301] > Initializing hierarchical allocator process with master : > master@127.0.1.1:54701 > I0729 17:10:10.510113 1207 master.cpp:123] No whitelist given. Advertising > offers for all slaves > I0729 17:10:10.511699 1213 master.cpp:1129] The newly elected leader is > master@127.0.1.1:54701 with id 20140729-171010-16842879-54701-1176 > I0729 17:10:10.512230 1213 master.cpp:1142] Elected as the leading master! > I0729 17:10:10.512692 1213 master.cpp:960] Recovering from registrar > I0729 17:10:10.513226 1210 registrar.cpp:313] Recovering registrar > I0729 17:10:10.516006 1212 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 12.946461ms > I0729 17:10:10.516047 1212 replica.cpp:320] Persisted replica status to > STARTING > I0729 17:10:10.516129 1212 recover.cpp:451] Replica is in STARTING status > I0729 17:10:10.516520 1212 replica.cpp:638] Replica in STARTING status > received a broadcasted recover request > I0729 17:10:10.516592 1212 recover.cpp:188] Received a recover response from > a replica in STARTING status > I0729 17:10:10.516767 1212 recover.cpp:542] Updating replica status to VOTING > I0729 17:10:10.528376 1212 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 11.537102ms > I0729 17:10:10.528430 1212 replica.cpp:320] Persisted replica status to > VOTING > I0729 17:10:10.528501 1212 recover.cpp:556] Successfully joined the Paxos > group > I0729 17:10:10.528565 1212 recover.cpp:440] Recover process terminated > I0729 17:10:10.528700 1212 log.cpp:656] Attempting to start the writer > I0729 17:10:10.528960 1212 replica.cpp:474] Replica received implicit > promise request with proposal 1 > I0729 17:10:10.537821 1212 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 8.830863ms > I0729 17:10:10.537869 1212 replica.cpp:342] Persisted promised to 1 > I0729 17:10:10.540550 1209 coordinator.cpp:230] Coordinator attemping to > fill missing position > I0729 17:10:10.540856 1209 replica.cpp:375] Replica received explicit > promise request for position 0 with proposal 2 > I0729 17:10:10.547430 1209 leveldb.cpp:343] Persisting action (8 bytes) to > leveldb took 6.548344ms > I0729 17:10:10.547471 1209 replica.cpp:676] Persisted action at 0 > I0729 17:10:10.547732 1209 replica.cpp:508] Replica received write request > for position 0 > I0729 17:10:10.547765 1209 leveldb.cpp:438] Reading position from leveldb > took 15676ns > I0729 17:10:10.557169 1209 leveldb.cpp:343] Persisting action (14 bytes) to > leveldb took 9.373798ms > I0729
[jira] [Commented] (MESOS-1653) HealthCheckTest.GracePeriod is flaky.
[ https://issues.apache.org/jira/browse/MESOS-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15548644#comment-15548644 ] Gastón Kleiman commented on MESOS-1653: --- Patch: https://reviews.apache.org/r/52432/ > HealthCheckTest.GracePeriod is flaky. > - > > Key: MESOS-1653 > URL: https://issues.apache.org/jira/browse/MESOS-1653 > Project: Mesos > Issue Type: Bug > Components: test >Reporter: Benjamin Mahler >Assignee: haosdent > Labels: flaky, health-check, mesosphere > > {noformat} > [--] 3 tests from HealthCheckTest > [ RUN ] HealthCheckTest.GracePeriod > Using temporary directory '/tmp/HealthCheckTest_GracePeriod_d7zCPr' > I0729 17:10:10.484951 1176 leveldb.cpp:176] Opened db in 28.883552ms > I0729 17:10:10.499487 1176 leveldb.cpp:183] Compacted db in 13.674118ms > I0729 17:10:10.500200 1176 leveldb.cpp:198] Created db iterator in 7394ns > I0729 17:10:10.500692 1176 leveldb.cpp:204] Seeked to beginning of db in > 2317ns > I0729 17:10:10.501113 1176 leveldb.cpp:273] Iterated through 0 keys in the > db in 1367ns > I0729 17:10:10.501535 1176 replica.cpp:741] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0729 17:10:10.502233 1212 recover.cpp:425] Starting replica recovery > I0729 17:10:10.502295 1212 recover.cpp:451] Replica is in EMPTY status > I0729 17:10:10.502825 1212 replica.cpp:638] Replica in EMPTY status received > a broadcasted recover request > I0729 17:10:10.502877 1212 recover.cpp:188] Received a recover response from > a replica in EMPTY status > I0729 17:10:10.502980 1212 recover.cpp:542] Updating replica status to > STARTING > I0729 17:10:10.508482 1213 master.cpp:289] Master > 20140729-171010-16842879-54701-1176 (trusty) started on 127.0.1.1:54701 > I0729 17:10:10.508607 1213 master.cpp:326] Master only allowing > authenticated frameworks to register > I0729 17:10:10.508632 1213 master.cpp:331] Master only allowing > authenticated slaves to register > I0729 17:10:10.508656 1213 credentials.hpp:36] Loading credentials for > authentication from '/tmp/HealthCheckTest_GracePeriod_d7zCPr/credentials' > I0729 17:10:10.509407 1213 master.cpp:360] Authorization enabled > I0729 17:10:10.510030 1207 hierarchical_allocator_process.hpp:301] > Initializing hierarchical allocator process with master : > master@127.0.1.1:54701 > I0729 17:10:10.510113 1207 master.cpp:123] No whitelist given. Advertising > offers for all slaves > I0729 17:10:10.511699 1213 master.cpp:1129] The newly elected leader is > master@127.0.1.1:54701 with id 20140729-171010-16842879-54701-1176 > I0729 17:10:10.512230 1213 master.cpp:1142] Elected as the leading master! > I0729 17:10:10.512692 1213 master.cpp:960] Recovering from registrar > I0729 17:10:10.513226 1210 registrar.cpp:313] Recovering registrar > I0729 17:10:10.516006 1212 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 12.946461ms > I0729 17:10:10.516047 1212 replica.cpp:320] Persisted replica status to > STARTING > I0729 17:10:10.516129 1212 recover.cpp:451] Replica is in STARTING status > I0729 17:10:10.516520 1212 replica.cpp:638] Replica in STARTING status > received a broadcasted recover request > I0729 17:10:10.516592 1212 recover.cpp:188] Received a recover response from > a replica in STARTING status > I0729 17:10:10.516767 1212 recover.cpp:542] Updating replica status to VOTING > I0729 17:10:10.528376 1212 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 11.537102ms > I0729 17:10:10.528430 1212 replica.cpp:320] Persisted replica status to > VOTING > I0729 17:10:10.528501 1212 recover.cpp:556] Successfully joined the Paxos > group > I0729 17:10:10.528565 1212 recover.cpp:440] Recover process terminated > I0729 17:10:10.528700 1212 log.cpp:656] Attempting to start the writer > I0729 17:10:10.528960 1212 replica.cpp:474] Replica received implicit > promise request with proposal 1 > I0729 17:10:10.537821 1212 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 8.830863ms > I0729 17:10:10.537869 1212 replica.cpp:342] Persisted promised to 1 > I0729 17:10:10.540550 1209 coordinator.cpp:230] Coordinator attemping to > fill missing position > I0729 17:10:10.540856 1209 replica.cpp:375] Replica received explicit > promise request for position 0 with proposal 2 > I0729 17:10:10.547430 1209 leveldb.cpp:343] Persisting action (8 bytes) to > leveldb took 6.548344ms > I0729 17:10:10.547471 1209 replica.cpp:676] Persisted action at 0 > I0729 17:10:10.547732 1209 replica.cpp:508] Replica received write request > for position 0 > I0729 17:10:10.547765 1209 leveldb.cpp:438] Reading position from leveldb > took 15676ns > I0729 17:10:10.557169 1209 leveldb.cpp:343] Persisting action (14 bytes) to > leveldb took
[jira] [Commented] (MESOS-6249) On Mesos master failover the reregistered callback is not triggered
[ https://issues.apache.org/jira/browse/MESOS-6249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15548632#comment-15548632 ] Joris Van Remoortere commented on MESOS-6249: - [~markusjura] It seems like you are hitting some logic around https://issues.apache.org/jira/browse/MESOS-786 You can see the comment here. https://github.com/apache/mesos/blob/b70a22bad22e5e8668f9af62c575902dec7b0125/src/master/master.cpp#L2813-L2820 pinging [~bmahler] who wrote the comment, and [~anandmazumdar] for reference. > On Mesos master failover the reregistered callback is not triggered > --- > > Key: MESOS-6249 > URL: https://issues.apache.org/jira/browse/MESOS-6249 > Project: Mesos > Issue Type: Bug > Components: java api >Affects Versions: 0.28.0, 0.28.1, 1.0.1 > Environment: OS X 10.11.6 >Reporter: Markus Jura > > On a Mesos master failover the reregistered callback of the Java API is not > triggered. Only the registration callback is triggered which makes it hard > for a framework to distinguish between these scenarios. > This behaviour has been tested with the ConductR framework, both with the > Java API version 0.28.0, 0.28.1 and 1.0.1. Below you find the logs from the > master that got re-elected and from the ConductR framework. > *Log: Mesos master on a master re-election* > {code:bash} > I0926 11:44:20.008306 3747840 zookeeper.cpp:259] A new leading master > (UPID=master@127.0.0.1:5050) is detected > I0926 11:44:20.008458 3747840 master.cpp:1847] The newly elected leader is > master@127.0.0.1:5050 with id ca5b9713-1eec-43e1-9d27-9ebc5c0f95b1 > I0926 11:44:20.008484 3747840 master.cpp:1860] Elected as the leading master! > I0926 11:44:20.008498 3747840 master.cpp:1547] Recovering from registrar > I0926 11:44:20.008607 3747840 registrar.cpp:332] Recovering registrar > I0926 11:44:20.016340 4284416 registrar.cpp:365] Successfully fetched the > registry (0B) in 7.702016ms > I0926 11:44:20.016393 4284416 registrar.cpp:464] Applied 1 operations in > 12us; attempting to update the 'registry' > I0926 11:44:20.021428 4284416 registrar.cpp:509] Successfully updated the > 'registry' in 5.019904ms > I0926 11:44:20.021481 4284416 registrar.cpp:395] Successfully recovered > registrar > I0926 11:44:20.021611 528384 master.cpp:1655] Recovered 0 agents from the > Registry (118B) ; allowing 10mins for agents to re-register > I0926 11:44:20.536859 3747840 master.cpp:2424] Received SUBSCRIBE call for > framework 'conductr' at > scheduler-3f8b9645-7a17-4e9f-8ad5-077fe8c23b39@192.168.2.106:57164 > I0926 11:44:20.536969 3747840 master.cpp:2500] Subscribing framework conductr > with checkpointing disabled and capabilities [ ] > I0926 11:44:20.537401 3211264 hierarchical.cpp:271] Added framework conductr > I0926 11:44:20.807895 528384 master.cpp:4787] Re-registering agent > b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 at slave(1)@127.0.0.1:5051 (127.0.0.1) > I0926 11:44:20.808145 1601536 registrar.cpp:464] Applied 1 operations in > 38us; attempting to update the 'registry' > I0926 11:44:20.815757 1601536 registrar.cpp:509] Successfully updated the > 'registry' in 7.568896ms > I0926 11:44:20.815992 3747840 master.cpp:7447] Adding task > 6abce9bb-895f-4f6f-be5b-25f6bd09f548 with resources mem(*):0 on agent > b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 (127.0.0.1) > I0926 11:44:20.816339 3747840 master.cpp:4872] Re-registered agent > b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 at slave(1)@127.0.0.1:5051 > (127.0.0.1) with cpus(*):8; mem(*):15360; disk(*):470832; > ports(*):[31000-32000] > I0926 11:44:20.816385 1601536 hierarchical.cpp:478] Added agent > b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 (127.0.0.1) with cpus(*):8; > mem(*):15360; disk(*):470832; ports(*):[31000-32000] (allocated: cpus(*):0.9; > mem(*):402.653; disk(*):1000; ports(*):[31000-31000, 31001-31500]) > I0926 11:44:20.816437 3747840 master.cpp:4940] Sending updated checkpointed > resources to agent b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 at > slave(1)@127.0.0.1:5051 (127.0.0.1) > I0926 11:44:20.816787 4284416 master.cpp:5725] Sending 1 offers to framework > conductr (conductr) at > scheduler-3f8b9645-7a17-4e9f-8ad5-077fe8c23b39@192.168.2.106:57164 > {code} > *Log: ConductR framework* > {code:bash} > I0926 11:44:20.007189 66441216 detector.cpp:152] Detected a new leader: > (id='87') > I0926 11:44:20.007524 64294912 group.cpp:706] Trying to get > '/mesos/json.info_87' in ZooKeeper > I0926 11:44:20.008625 63758336 zookeeper.cpp:259] A new leading master > (UPID=master@127.0.0.1:5050) is detected > I0926 11:44:20.008965 63758336 sched.cpp:330] New master detected at > master@127.0.0.1:5050 > 2016-09-26T09:44:20Z MacBook-Pro-6.local INFO MesosSchedulerClient > [sourceThread=conductr-akka.actor.default-dispatcher-2, >
[jira] [Created] (MESOS-6314) It looks like getgrouplist returns duplicated results
Marc Villacorta created MESOS-6314: -- Summary: It looks like getgrouplist returns duplicated results Key: MESOS-6314 URL: https://issues.apache.org/jira/browse/MESOS-6314 Project: Mesos Issue Type: Bug Components: tests Affects Versions: 1.0.2 Environment: Inside Docker container {{alpine:3.4}} Reporter: Marc Villacorta In my Alpine 3.4 system OsTest.User fails: {code:none} /mesos/build # id -G 0 1 2 3 4 6 10 11 20 26 27 {code} {code:none} RUN ] OsTest.User ../../../3rdparty/stout/tests/os_tests.cpp:696: Failure Value of: expected_gids Actual: { "0", "0", "1", "10", "11", "2", "20", "26", "27", "3", "4", "6" } Expected: tokens.get() Which is: { "0", "1", "10", "11", "2", "20", "26", "27", "3", "4", "6" } [ FAILED ] OsTest.User (6 ms) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6314) OsTest.User: It looks like getgrouplist returns duplicated results
[ https://issues.apache.org/jira/browse/MESOS-6314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marc Villacorta updated MESOS-6314: --- Summary: OsTest.User: It looks like getgrouplist returns duplicated results (was: It looks like getgrouplist returns duplicated results) > OsTest.User: It looks like getgrouplist returns duplicated results > -- > > Key: MESOS-6314 > URL: https://issues.apache.org/jira/browse/MESOS-6314 > Project: Mesos > Issue Type: Bug > Components: tests >Affects Versions: 1.0.2 > Environment: Inside Docker container {{alpine:3.4}} >Reporter: Marc Villacorta > > In my Alpine 3.4 system OsTest.User fails: > {code:none} > /mesos/build # id -G > 0 1 2 3 4 6 10 11 20 26 27 > {code} > {code:none} > RUN ] OsTest.User > ../../../3rdparty/stout/tests/os_tests.cpp:696: Failure > Value of: expected_gids > Actual: { "0", "0", "1", "10", "11", "2", "20", "26", "27", "3", "4", "6" } > Expected: tokens.get() > Which is: { "0", "1", "10", "11", "2", "20", "26", "27", "3", "4", "6" } > [ FAILED ] OsTest.User (6 ms) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5909) Stout "OsTest.User" test can fail on some systems
[ https://issues.apache.org/jira/browse/MESOS-5909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15548374#comment-15548374 ] haosdent commented on MESOS-5909: - Yep, please open a new one.It looks {{getgrouplist}} return duplicated result. > Stout "OsTest.User" test can fail on some systems > - > > Key: MESOS-5909 > URL: https://issues.apache.org/jira/browse/MESOS-5909 > Project: Mesos > Issue Type: Bug > Components: stout >Reporter: Kapil Arya >Assignee: Mao Geng > Labels: mesosphere > Fix For: 1.1.0 > > Attachments: MESOS-5909-fix.diff > > > Libc call {{getgrouplist}} doesn't return the {{gid}} list in a sorted manner > (in my case, it's returning "471 100") ... whereas {{id -G}} return a sorted > list ("100 471" in my case) causing the validation inside the loop to fail. > We should sort both lists before comparing the values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6313) In Mesos Console, "Completed Tasks" in a tab next to "Active Tasks"
[ https://issues.apache.org/jira/browse/MESOS-6313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15548362#comment-15548362 ] haosdent commented on MESOS-6313: - Cool, we could add {code} {code} in the navbar to implement this. > In Mesos Console, "Completed Tasks" in a tab next to "Active Tasks" > --- > > Key: MESOS-6313 > URL: https://issues.apache.org/jira/browse/MESOS-6313 > Project: Mesos > Issue Type: Improvement > Components: webui >Reporter: Roman Leventov > > It will ease navigation between them (clicks in close areas of screen, rather > than scroll) and will make "Completed Tasks" even *visible* when the list of > active tasks is very long. This is important for those who are not familiar > with mesos UI and expect everthing to be acessible though menus/tabs on the > top of the screen, not though scrolling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5909) Stout "OsTest.User" test can fail on some systems
[ https://issues.apache.org/jira/browse/MESOS-5909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15548353#comment-15548353 ] Marc Villacorta commented on MESOS-5909: In my Alpine 3.4 system this test still fails: {code:none} /mesos/build # id -G 0 1 2 3 4 6 10 11 20 26 27 {code} {code:none} RUN ] OsTest.User ../../../3rdparty/stout/tests/os_tests.cpp:696: Failure Value of: expected_gids Actual: { "0", "0", "1", "10", "11", "2", "20", "26", "27", "3", "4", "6" } Expected: tokens.get() Which is: { "0", "1", "10", "11", "2", "20", "26", "27", "3", "4", "6" } [ FAILED ] OsTest.User (6 ms) {code} Should I open a new Jira? > Stout "OsTest.User" test can fail on some systems > - > > Key: MESOS-5909 > URL: https://issues.apache.org/jira/browse/MESOS-5909 > Project: Mesos > Issue Type: Bug > Components: stout >Reporter: Kapil Arya >Assignee: Mao Geng > Labels: mesosphere > Fix For: 1.1.0 > > Attachments: MESOS-5909-fix.diff > > > Libc call {{getgrouplist}} doesn't return the {{gid}} list in a sorted manner > (in my case, it's returning "471 100") ... whereas {{id -G}} return a sorted > list ("100 471" in my case) causing the validation inside the loop to fail. > We should sort both lists before comparing the values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6312) Add requirement in upgrade.md and getting-started.md for agent '--runtime_dir' in when running as non-root
[ https://issues.apache.org/jira/browse/MESOS-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15548338#comment-15548338 ] haosdent commented on MESOS-6312: - [~klueska] For mesos-local, should we change the default value of it? > Add requirement in upgrade.md and getting-started.md for agent > '--runtime_dir' in when running as non-root > -- > > Key: MESOS-6312 > URL: https://issues.apache.org/jira/browse/MESOS-6312 > Project: Mesos > Issue Type: Task >Reporter: Kevin Klues >Priority: Blocker > Fix For: 1.1.0 > > > We recently introduced a new agent flag for {{--runtime_dir}}. Unlike the > {{--work_dir}}, this directory is designed to hold the state of a running > agent between subsequent agent-restarts (but not across host reboots). > By default, this flag is set to {{/var/run/mesos}} since this is a {{tempfs}} > on linux that gets automatically cleaned up on reboot. However, on most > systems {{/var/run/mesos}} is only writable by root, causing problems when > launching an agent as non-root and not pointing {{--runtime_dir}} to a > different location. > We need to call this out in the upgrade.md and getting-started.md docs so > that people know they may need to set this going forward. -- This message was sent by Atlassian JIRA (v6.3.4#6332)