[jira] [Commented] (MESOS-5081) Posix disk isolator allows unrestricted sandbox disk usage if the executor/task doesn't specify disk resource

2016-10-05 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15550653#comment-15550653
 ] 

Charles Allen commented on MESOS-5081:
--

Does fixing this mean that things that are kind of dumb about disk (like Spark) 
won't be able to run on slaves which specify disk resources?

> Posix disk isolator allows unrestricted sandbox disk usage if the 
> executor/task doesn't specify disk resource
> -
>
> Key: MESOS-5081
> URL: https://issues.apache.org/jira/browse/MESOS-5081
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Yan Xu
>  Labels: mesosphere
>
> This is the case even if {{flags.enforce_container_disk_quota}} is true. When 
> a task/executor doesn't specify a disk resource, it still gets to write to 
> the container sandbox. However the posix disk isolator doesn't limit it.
> Even though tasks always have access to the sandbox, it should be able to 
> write zero bytes if it doesn't have any {{disk}} resource (it can still touch 
> files). This likely will cause tasks to immediately fail due to 
> stdout/stderr/executor download, etc. but should be the correct behavior 
> (when {{flags.enforce_container_disk_quota}} is true).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-6308) CHECK failure in DRF sorter.

2016-10-05 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu reassigned MESOS-6308:
--

Assignee: Guangya Liu

> CHECK failure in DRF sorter.
> 
>
> Key: MESOS-6308
> URL: https://issues.apache.org/jira/browse/MESOS-6308
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>Assignee: Guangya Liu
>
> Saw this CHECK failed in our internal CI:
> https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L450
> {noformat}
> [03:08:28] :   [Step 10/10] [ RUN  ] PartitionTest.DisconnectedFramework
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.200443   577 cluster.cpp:158] 
> Creating default 'local' authorizer
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.206408   577 leveldb.cpp:174] 
> Opened db in 5.827159ms
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208127   577 leveldb.cpp:181] 
> Compacted db in 1.697508ms
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208150   577 leveldb.cpp:196] 
> Created db iterator in 5756ns
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208160   577 leveldb.cpp:202] 
> Seeked to beginning of db in 1483ns
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208168   577 leveldb.cpp:271] 
> Iterated through 0 keys in the db in 1101ns
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208184   577 replica.cpp:776] 
> Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208452   591 recover.cpp:451] 
> Starting replica recovery
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208664   596 recover.cpp:477] 
> Replica is in EMPTY status
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209079   591 replica.cpp:673] 
> Replica in EMPTY status received a broadcasted recover request from 
> __req_res__(3666)@172.30.2.234:37300
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209203   593 recover.cpp:197] 
> Received a recover response from a replica in EMPTY status
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209394   598 recover.cpp:568] 
> Updating replica status to STARTING
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209473   598 master.cpp:380] 
> Master dd11d4ad-2087-4324-99ef-873e83ff09a1 (ip-172-30-2-234.mesosphere.io) 
> started on 172.30.2.234:37300
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209489   598 master.cpp:382] Flags 
> at startup: --acls="" --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="true" 
> --authenticate_frameworks="true" --authenticate_http_frameworks="true" 
> --authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/7rr0oB/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/7rr0oB/master" 
> --zk_session_timeout="10secs"
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209692   598 master.cpp:432] 
> Master only allowing authenticated frameworks to register
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209699   598 master.cpp:446] 
> Master only allowing authenticated agents to register
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209704   598 master.cpp:459] 
> Master only allowing authenticated HTTP frameworks to register
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209709   598 credentials.hpp:37] 
> Loading credentials for authentication from '/tmp/7rr0oB/credentials'
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209810   598 master.cpp:504] Using 
> default 'crammd5' authenticator
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209853   598 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readonly'
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209897   598 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readwrite'
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209940   598 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-scheduler'
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209962   598 master.cpp:584] 
> Authorization enabled
> [03:08:28]W:   [Step 10/10] I1004 

[jira] [Commented] (MESOS-6308) CHECK failure in DRF sorter.

2016-10-05 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15550554#comment-15550554
 ] 

Guangya Liu commented on MESOS-6308:


I was now trying to reproduce this issue but with no lucky even with 
{{--gtest_repeat=100}}, will try to increase the workload as you suggested to 
see if I can reproduce this first.

> CHECK failure in DRF sorter.
> 
>
> Key: MESOS-6308
> URL: https://issues.apache.org/jira/browse/MESOS-6308
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>
> Saw this CHECK failed in our internal CI:
> https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L450
> {noformat}
> [03:08:28] :   [Step 10/10] [ RUN  ] PartitionTest.DisconnectedFramework
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.200443   577 cluster.cpp:158] 
> Creating default 'local' authorizer
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.206408   577 leveldb.cpp:174] 
> Opened db in 5.827159ms
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208127   577 leveldb.cpp:181] 
> Compacted db in 1.697508ms
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208150   577 leveldb.cpp:196] 
> Created db iterator in 5756ns
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208160   577 leveldb.cpp:202] 
> Seeked to beginning of db in 1483ns
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208168   577 leveldb.cpp:271] 
> Iterated through 0 keys in the db in 1101ns
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208184   577 replica.cpp:776] 
> Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208452   591 recover.cpp:451] 
> Starting replica recovery
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208664   596 recover.cpp:477] 
> Replica is in EMPTY status
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209079   591 replica.cpp:673] 
> Replica in EMPTY status received a broadcasted recover request from 
> __req_res__(3666)@172.30.2.234:37300
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209203   593 recover.cpp:197] 
> Received a recover response from a replica in EMPTY status
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209394   598 recover.cpp:568] 
> Updating replica status to STARTING
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209473   598 master.cpp:380] 
> Master dd11d4ad-2087-4324-99ef-873e83ff09a1 (ip-172-30-2-234.mesosphere.io) 
> started on 172.30.2.234:37300
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209489   598 master.cpp:382] Flags 
> at startup: --acls="" --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="true" 
> --authenticate_frameworks="true" --authenticate_http_frameworks="true" 
> --authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/7rr0oB/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/7rr0oB/master" 
> --zk_session_timeout="10secs"
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209692   598 master.cpp:432] 
> Master only allowing authenticated frameworks to register
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209699   598 master.cpp:446] 
> Master only allowing authenticated agents to register
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209704   598 master.cpp:459] 
> Master only allowing authenticated HTTP frameworks to register
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209709   598 credentials.hpp:37] 
> Loading credentials for authentication from '/tmp/7rr0oB/credentials'
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209810   598 master.cpp:504] Using 
> default 'crammd5' authenticator
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209853   598 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readonly'
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209897   598 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readwrite'
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209940   598 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 

[jira] [Updated] (MESOS-5613) mesos-local fails to start if MESOS_WORK_DIR isn't set.

2016-10-05 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-5613:
--
Fix Version/s: 1.0.2

Backported to 1.0.x.

commit 6773b8ffef6672ef012f8ccb6a9fe73d40e02ae0
Author: Vinod Kone 
Date:   Wed Oct 5 18:24:57 2016 -0700

Added MESOS-5613 to 1.0.2 CHANGELOG.

commit e7a35ebe1c52aa3dc3d90feabae13c1f7220723f
Author: Ammar Askar 
Date:   Wed Aug 10 17:57:09 2016 -0700

Propagated work_dir flag from local runs to agents/masters.

Review: https://reviews.apache.org/r/50003/



> mesos-local fails to start if MESOS_WORK_DIR isn't set.
> ---
>
> Key: MESOS-5613
> URL: https://issues.apache.org/jira/browse/MESOS-5613
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Jan Schlicht
>Assignee: Ammar Askar
> Fix For: 1.1.0, 1.0.2
>
>
> Running {{mesos-local}} fails with
> {noformat}
> Failed to start a local cluster while loading agent flags from the 
> environment: Flag 'work_dir' is required, but it was not provided
> {noformat}
> if {{MESOS_WORK_DIR}} isn't set.
> This seems to be due to the changed behavior of making the {{work_dir}} flag 
> mandatory (MESOS-5064). While {{MESOS_WORK_DIR}} is being set in a 
> development environment in {{./bin/mesos-local.sh}}, this isn't true if 
> {{mesos-local}} is installed on the system after a {{make install}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6216) LibeventSSLSocketImpl::create is not safe to call concurrently with os::getenv

2016-10-05 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15550550#comment-15550550
 ] 

Vinod Kone commented on MESOS-6216:
---

What's the ETA for this to land on master and backported to 1.0.x?

> LibeventSSLSocketImpl::create is not safe to call concurrently with os::getenv
> --
>
> Key: MESOS-6216
> URL: https://issues.apache.org/jira/browse/MESOS-6216
> Project: Mesos
>  Issue Type: Bug
>  Components: security
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>  Labels: mesosphere
> Attachments: build.log
>
>
> {{LibeventSSLSocketImpl::create}} is called whenever a potentially 
> ssl-enabled socket is created. It in turn calls {{openssl::initialize}} which 
> calls a function {{reinitialize}} using {{os::setenv}}. Here {{os::setenv}} 
> is used to set up SSL-related libprocess environment variables 
> {{LIBPROCESS_SSL_*}}.
> Since {{os::setenv}} is not thread-safe just like the {{::setenv}} it wraps, 
> any calling of functions like {{os::getenv}} (or via {{os::environment}}) 
> concurrently with the first invocation of {{LibeventSSLSocketImpl::create}} 
> performs unsynchronized r/w access to the same data structure in the runtime.
> We usually perform most setup of the environment before we start the 
> libprocess runtime with {{process::initialize}} from a {{main}} function, see 
> e.g., {{src/slave/main.cpp}} or {{src/master/main.cpp}} and others. It 
> appears that we should move the setup of libprocess' SSL environment 
> variables to a similar spot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6317) Race in master update slave.

2016-10-05 Thread Guangya Liu (JIRA)
Guangya Liu created MESOS-6317:
--

 Summary: Race in master update slave.
 Key: MESOS-6317
 URL: https://issues.apache.org/jira/browse/MESOS-6317
 Project: Mesos
  Issue Type: Bug
Reporter: Guangya Liu
Assignee: Guangya Liu


Currently, when {{updateSlave}} in master, it will first rescind offers and 
then updateSlave in allocator, but there is a race for this, there might be a 
batch allocation inserted bwteen the two. In this case, the order will be 
rescind offer -> batch allocation -> update slave. This order will cause some 
issues when the oversubscribed resources was decreased.

Suppose the oversubscribed resources was decreased from 2 to 1, then after 
rescind offer finished, the batch allocation will allocate the old 2 
oversubscribed resources again, then update slave will update the total 
oversubscribed resources to 1. This will cause the agent host have some time 
overcommitted due to the tasks can still use 2 oversubscribed resources but not 
1 oversubscribed resources, once the tasks using the 2 oversubscribed resources 
finished, everything goes back.

So here we should adjust the order of rescind offer and updateSlave in master 
to avoid resource overcommit.

If we update slave first then rescind offer, the order will be update slave -> 
batch allocation -> rescind offer, this order will have no problem when 
descreasing resources. Suppose the oversubscribed resources was decreased from 
2 to 1, then update slave will update total oversubscribed resources to 1 
directly, then the batch allocation will not allocate any oversubscribed 
resources since there are more allocated than total oversubscribed resources, 
then rescind offer will rescind all offers using oversubscribed resources. This 
will not lead the agent host to be overcommitted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6142) Frameworks may RESERVE for an arbitrary role.

2016-10-05 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-6142:
--
Target Version/s: 1.1.0
   Fix Version/s: (was: 1.1.0)

> Frameworks may RESERVE for an arbitrary role.
> -
>
> Key: MESOS-6142
> URL: https://issues.apache.org/jira/browse/MESOS-6142
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master
>Affects Versions: 1.0.0
>Reporter: Alexander Rukletsov
>Assignee: Gastón Kleiman
>Priority: Blocker
>  Labels: mesosphere, reservations
>
> The master does not validate that resources from a reservation request have 
> the same role the framework is registered with. As a result, frameworks may 
> reserve resources for arbitrary roles.
> I've modified the role in [the {{ReserveThenUnreserve}} 
> test|https://github.com/apache/mesos/blob/bca600cf5602ed8227d91af9f73d689da14ad786/src/tests/reservation_tests.cpp#L117]
>  to "yoyo" and observed the following in the test's log:
> {noformat}
> I0908 18:35:43.379122 2138112 master.cpp:3362] Processing ACCEPT call for 
> offers: [ dfaf67e6-7c1c-4988-b427-c49842cb7bb7-O0 ] on agent 
> dfaf67e6-7c1c-4988-b427-c49842cb7bb7-S0 at slave(1)@10.200.181.237:60116 
> (alexr.railnet.train) for framework dfaf67e6-7c1c-4988-b427-c49842cb7bb7- 
> (default) at 
> scheduler-ca12a660-9f08-49de-be4e-d452aa3aa6da@10.200.181.237:60116
> I0908 18:35:43.379170 2138112 master.cpp:3022] Authorizing principal 
> 'test-principal' to reserve resources 'cpus(yoyo, test-principal):1; 
> mem(yoyo, test-principal):512'
> I0908 18:35:43.379678 2138112 master.cpp:3642] Applying RESERVE operation for 
> resources cpus(yoyo, test-principal):1; mem(yoyo, test-principal):512 from 
> framework dfaf67e6-7c1c-4988-b427-c49842cb7bb7- (default) at 
> scheduler-ca12a660-9f08-49de-be4e-d452aa3aa6da@10.200.181.237:60116 to agent 
> dfaf67e6-7c1c-4988-b427-c49842cb7bb7-S0 at slave(1)@10.200.181.237:60116 
> (alexr.railnet.train)
> I0908 18:35:43.379767 2138112 master.cpp:7341] Sending checkpointed resources 
> cpus(yoyo, test-principal):1; mem(yoyo, test-principal):512 to agent 
> dfaf67e6-7c1c-4988-b427-c49842cb7bb7-S0 at slave(1)@10.200.181.237:60116 
> (alexr.railnet.train)
> I0908 18:35:43.380273 3211264 slave.cpp:2497] Updated checkpointed resources 
> from  to cpus(yoyo, test-principal):1; mem(yoyo, test-principal):512
> I0908 18:35:43.380574 2674688 hierarchical.cpp:760] Updated allocation of 
> framework dfaf67e6-7c1c-4988-b427-c49842cb7bb7- on agent 
> dfaf67e6-7c1c-4988-b427-c49842cb7bb7-S0 from cpus(*):1; mem(*):512; 
> disk(*):470841; ports(*):[31000-32000] to ports(*):[31000-32000]; cpus(yoyo, 
> test-principal):1; disk(*):470841; mem(yoyo, test-principal):512 with RESERVE 
> operation
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6157) ContainerInfo is not validated.

2016-10-05 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15550491#comment-15550491
 ] 

Vinod Kone commented on MESOS-6157:
---

[~alexr] Should this be resolved?

> ContainerInfo is not validated.
> ---
>
> Key: MESOS-6157
> URL: https://issues.apache.org/jira/browse/MESOS-6157
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>Priority: Blocker
>  Labels: containerizer, mesos-containerizer, mesosphere
> Fix For: 1.1.0
>
>
> Currently Mesos does not validate {{ContainerInfo}} provided with 
> {{TaskInfo}} or {{ExecutorInfo}}, hence invalid task configurations can be 
> accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6118) Agent would crash with docker container tasks due to host mount table read.

2016-10-05 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-6118:
--
Target Version/s: 1.1.0, 1.0.2
   Fix Version/s: (was: 1.0.2)
  (was: 1.1.0)

> Agent would crash with docker container tasks due to host mount table read.
> ---
>
> Key: MESOS-6118
> URL: https://issues.apache.org/jira/browse/MESOS-6118
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 1.0.1
> Environment: Build: 2016-08-26 23:06:27 by centos
> Version: 1.0.1
> Git tag: 1.0.1
> Git SHA: 3611eb0b7eea8d144e9b2e840e0ba16f2f659ee3
> systemd version `219` detected
> Inializing systemd state
> Created systemd slice: `/run/systemd/system/mesos_executors.slice`
> Started systemd slice `mesos_executors.slice`
> Using isolation: posix/cpu,posix/mem,filesystem/posix,network/cni
>  Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> Linux ip-10-254-192-40 3.10.0-327.28.3.el7.x86_64 #1 SMP Thu Aug 18 19:05:49 
> UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Jamie Briant
>Assignee: Kevin Klues
>Priority: Critical
>  Labels: linux, slave
> Attachments: crashlogfull.log, cycle2.log, cycle3.log, cycle5.log, 
> cycle6.log, slave-crash.log
>
>
> I have a framework which schedules thousands of short running (a few seconds 
> to a few minutes) of tasks, over a period of several minutes. In 1.0.1, the 
> slave process will crash every few minutes (with systemd restarting it).
> Crash is:
> Sep 01 20:52:23 ip-10-254-192-99 mesos-slave: F0901 20:52:23.905678  1232 
> fs.cpp:140] Check failed: !visitedParents.contains(parentId)
> Sep 01 20:52:23 ip-10-254-192-99 mesos-slave: *** Check failure stack trace: 
> ***
> Version 1.0.0 works without this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6302) Agent recovery can fail after nested containers are launched

2016-10-05 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15550349#comment-15550349
 ] 

Jie Yu commented on MESOS-6302:
---

commit 8bab70c691a3efeda301f72956de4f80b258464e
Author: Gilbert Song songzihao1...@gmail.com
Date:   Mon Oct 3 15:28:39 2016 -0700

Fixed provisioner recovering with nested containers existed.

Previously, in provisioner recover, we firstly get all container
ids from the provisioner directory, and then find all rootfses
from each container's 'backends' directory. We made an assumption
that if a 'container_id' directory exists in the provisioner
directory, it must contain a 'backends' directory underneath,
which contains at least one rootfs for this container.

However, this is no longer true since we added support for nested
containers. Because we allow the case that a nested container is
specified with a container image while its parent does not have
an image specified. In this case, when the provisioner recovers,
it can still find the parent container's id in the provisioner
directory while no 'backends' directory exists, since all nested
containers backend information are under its parent container's
directory.

As a result, we should skip recovering the 'Info' struct in
provisioner for the parent container if it never provisions any
image.

Review: https://reviews.apache.org/r/52480/

> Agent recovery can fail after nested containers are launched
> 
>
> Key: MESOS-6302
> URL: https://issues.apache.org/jira/browse/MESOS-6302
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Greg Mann
>Assignee: Gilbert Song
>Priority: Blocker
>  Labels: mesosphere
> Fix For: 1.1.0
>
> Attachments: read_write_app.json
>
>
> After launching a nested container which used a Docker image, I restarted the 
> agent which ran that task group and saw the following in the agent logs 
> during recovery:
> {code}
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> I1001 01:45:10.813596  4640 status_update_manager.cpp:203] Recovering status 
> update manager
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> I1001 01:45:10.813622  4640 status_update_manager.cpp:211] Recovering 
> executor 'instance-testvolume.02c26bce-8778-11e6-9ff3-7a3cd7c1568e' of 
> framework 118ca38d-daee-4b2d-b584-b5581738a3dd-
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> I1001 01:45:10.814249  4639 docker.cpp:745] Recovering Docker containers
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> I1001 01:45:10.815294  4642 containerizer.cpp:581] Recovering containerizer
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> Failed to perform recovery: Collect failed: Unable to list rootfses belonged 
> to container a7d576da-fd0f-4dc1-bd5a-6d0a93ac8a53: Unable to list the 
> container directory: Failed to opendir 
> '/var/lib/mesos/slave/provisioner/containers/a7d576da-fd0f-4dc1-bd5a-6d0a93ac8a53/backends':
>  No such file or directory
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> To remedy this do as follows:
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> Step 1: rm -f /var/lib/mesos/slave/meta/slaves/latest
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]:   
>   This ensures agent doesn't recover old live executors.
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> Step 2: Restart the agent.
> {code}
> and the agent continues to restart in this fashion. Attached is the Marathon 
> app definition that I used to launch the task group.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6308) CHECK failure in DRF sorter.

2016-10-05 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15550284#comment-15550284
 ] 

Benjamin Mahler commented on MESOS-6308:


[~gyliu] have you seen this before?

> CHECK failure in DRF sorter.
> 
>
> Key: MESOS-6308
> URL: https://issues.apache.org/jira/browse/MESOS-6308
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>
> Saw this CHECK failed in our internal CI:
> https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L450
> {noformat}
> [03:08:28] :   [Step 10/10] [ RUN  ] PartitionTest.DisconnectedFramework
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.200443   577 cluster.cpp:158] 
> Creating default 'local' authorizer
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.206408   577 leveldb.cpp:174] 
> Opened db in 5.827159ms
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208127   577 leveldb.cpp:181] 
> Compacted db in 1.697508ms
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208150   577 leveldb.cpp:196] 
> Created db iterator in 5756ns
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208160   577 leveldb.cpp:202] 
> Seeked to beginning of db in 1483ns
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208168   577 leveldb.cpp:271] 
> Iterated through 0 keys in the db in 1101ns
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208184   577 replica.cpp:776] 
> Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208452   591 recover.cpp:451] 
> Starting replica recovery
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.208664   596 recover.cpp:477] 
> Replica is in EMPTY status
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209079   591 replica.cpp:673] 
> Replica in EMPTY status received a broadcasted recover request from 
> __req_res__(3666)@172.30.2.234:37300
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209203   593 recover.cpp:197] 
> Received a recover response from a replica in EMPTY status
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209394   598 recover.cpp:568] 
> Updating replica status to STARTING
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209473   598 master.cpp:380] 
> Master dd11d4ad-2087-4324-99ef-873e83ff09a1 (ip-172-30-2-234.mesosphere.io) 
> started on 172.30.2.234:37300
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209489   598 master.cpp:382] Flags 
> at startup: --acls="" --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="true" 
> --authenticate_frameworks="true" --authenticate_http_frameworks="true" 
> --authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/7rr0oB/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/7rr0oB/master" 
> --zk_session_timeout="10secs"
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209692   598 master.cpp:432] 
> Master only allowing authenticated frameworks to register
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209699   598 master.cpp:446] 
> Master only allowing authenticated agents to register
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209704   598 master.cpp:459] 
> Master only allowing authenticated HTTP frameworks to register
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209709   598 credentials.hpp:37] 
> Loading credentials for authentication from '/tmp/7rr0oB/credentials'
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209810   598 master.cpp:504] Using 
> default 'crammd5' authenticator
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209853   598 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readonly'
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209897   598 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readwrite'
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209940   598 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-scheduler'
> [03:08:28]W:   [Step 10/10] I1004 03:08:28.209962   598 master.cpp:584] 
> Authorization enabled
> [03:08:28]W:   [Step 

[jira] [Updated] (MESOS-6316) CREATE of shared volumes should not be allowed by frameworks not opted in to the capability.

2016-10-05 Thread Anindya Sinha (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anindya Sinha updated MESOS-6316:
-
Labels: persistent-volumes  (was: )

> CREATE of shared volumes should not be allowed by frameworks not opted in to 
> the capability.
> 
>
> Key: MESOS-6316
> URL: https://issues.apache.org/jira/browse/MESOS-6316
> Project: Mesos
>  Issue Type: Bug
>  Components: general
>Reporter: Anindya Sinha
>Assignee: Anindya Sinha
>Priority: Minor
>  Labels: persistent-volumes
>
> Even if shared resources are not offered to a framework if it has not opted 
> in for SHARED_RESOURCES capability, that framework can inadvertently CREATE a 
> shared volume. Although this volume shall not offered to such frameworks, 
> Mesos should disallow such CREATE operations to succeed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6316) CREATE of shared volumes should not be allowed by frameworks not opted in to the capability.

2016-10-05 Thread Anindya Sinha (JIRA)
Anindya Sinha created MESOS-6316:


 Summary: CREATE of shared volumes should not be allowed by 
frameworks not opted in to the capability.
 Key: MESOS-6316
 URL: https://issues.apache.org/jira/browse/MESOS-6316
 Project: Mesos
  Issue Type: Bug
  Components: general
Reporter: Anindya Sinha
Assignee: Anindya Sinha
Priority: Minor


Even if shared resources are not offered to a framework if it has not opted in 
for SHARED_RESOURCES capability, that framework can inadvertently CREATE a 
shared volume. Although this volume shall not offered to such frameworks, Mesos 
should disallow such CREATE operations to succeed.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6269) CNI isolator doesn't activate loopback interface

2016-10-05 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-6269:
--
Fix Version/s: 1.0.2

> CNI isolator doesn't activate loopback interface
> 
>
> Key: MESOS-6269
> URL: https://issues.apache.org/jira/browse/MESOS-6269
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation, network
>Affects Versions: 1.0.1
>Reporter: Greg Mann
>Assignee: Avinash Sridharan
>Priority: Blocker
>  Labels: isolation, networking
> Fix For: 1.1.0, 1.0.2
>
>
> Launching a nested CNI-enabled container yielded the following agent log 
> output:
> {code}
> cni.cpp:1255] Got assigned IPv4 address '9.0.1.25/25' from CNI network 'dcos' 
> for container 7c1ef3c4-ba7b-4b43-ba33-0612d84100cc
> {code}
> indicating that the container was successfully assigned an IP. Running 
> {{ifconfig -a}} inside the container yields:
> {code}
> eth0: flags=4163  mtu 1420
> inet 9.0.1.25  netmask 255.255.255.128  broadcast 0.0.0.0
> inet6 fe80::e004:4bff:fefc:6816  prefixlen 64  scopeid 0x20
> ether 0a:58:09:00:01:19  txqueuelen 0  (Ethernet)
> RX packets 31  bytes 5052 (4.9 KiB)
> RX errors 0  dropped 0  overruns 0  frame 0
> TX packets 36  bytes 5689 (5.5 KiB)
> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> lo: flags=8  mtu 65536
> loop  txqueuelen 1  (Local Loopback)
> RX packets 0  bytes 0 (0.0 B)
> RX errors 0  dropped 0  overruns 0  frame 0
> TX packets 0  bytes 0 (0.0 B)
> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> {code}
> it can be seen that the loopback interface is not activated. {{ifconfig lo 
> up}} must be run before a process within the container can bind to that 
> interface, but this should be handled by the CNI isolator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6249) On Mesos master failover the reregistered callback is not triggered

2016-10-05 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15549924#comment-15549924
 ] 

Benjamin Mahler edited comment on MESOS-6249 at 10/5/16 8:58 PM:
-

Linking in MESOS-786 which describes the lifecycle of registered and 
re-registered callbacks. Note that MESOS-786 was resolved but AFAICT we did not 
update to the newer semantics described in this ticket for schedulers that use 
the old-style driver.

However, it sounds like you care about this because you're trying to detect 
that the master has failed over. To do this you must introspect the 
{{MasterInfo}} provided to you in order to see if {{MasterInfo.id}} has changed.


was (Author: bmahler):
Linking in MESOS-786 which describes the lifecycle of registered and 
re-registered callbacks. Note that MESOS-786 was resolved but AFAICT we did not 
update to the newer semantics described in this ticket for schedulers that use 
the old-style driver.

However, it sounds like you care about this because you're to detect that the 
master has failed over. To do this you must introspect the {{MasterInfo}} 
provided to you in order to see if {{MasterInfo.id}} has changed.

> On Mesos master failover the reregistered callback is not triggered
> ---
>
> Key: MESOS-6249
> URL: https://issues.apache.org/jira/browse/MESOS-6249
> Project: Mesos
>  Issue Type: Bug
>  Components: java api
>Affects Versions: 0.28.0, 0.28.1, 1.0.1
> Environment: OS X 10.11.6
>Reporter: Markus Jura
>
> On a Mesos master failover the reregistered callback of the Java API is not 
> triggered. Only the registration callback is triggered which makes it hard 
> for a framework to distinguish between these scenarios.
> This behaviour has been tested with the ConductR framework, both with the 
> Java API version 0.28.0, 0.28.1 and 1.0.1. Below you find the logs from the 
> master that got re-elected and from the ConductR framework.
> *Log: Mesos master on a master re-election*
> {code:bash}
> I0926 11:44:20.008306 3747840 zookeeper.cpp:259] A new leading master 
> (UPID=master@127.0.0.1:5050) is detected
> I0926 11:44:20.008458 3747840 master.cpp:1847] The newly elected leader is 
> master@127.0.0.1:5050 with id ca5b9713-1eec-43e1-9d27-9ebc5c0f95b1
> I0926 11:44:20.008484 3747840 master.cpp:1860] Elected as the leading master!
> I0926 11:44:20.008498 3747840 master.cpp:1547] Recovering from registrar
> I0926 11:44:20.008607 3747840 registrar.cpp:332] Recovering registrar
> I0926 11:44:20.016340 4284416 registrar.cpp:365] Successfully fetched the 
> registry (0B) in 7.702016ms
> I0926 11:44:20.016393 4284416 registrar.cpp:464] Applied 1 operations in 
> 12us; attempting to update the 'registry'
> I0926 11:44:20.021428 4284416 registrar.cpp:509] Successfully updated the 
> 'registry' in 5.019904ms
> I0926 11:44:20.021481 4284416 registrar.cpp:395] Successfully recovered 
> registrar
> I0926 11:44:20.021611 528384 master.cpp:1655] Recovered 0 agents from the 
> Registry (118B) ; allowing 10mins for agents to re-register
> I0926 11:44:20.536859 3747840 master.cpp:2424] Received SUBSCRIBE call for 
> framework 'conductr' at 
> scheduler-3f8b9645-7a17-4e9f-8ad5-077fe8c23b39@192.168.2.106:57164
> I0926 11:44:20.536969 3747840 master.cpp:2500] Subscribing framework conductr 
> with checkpointing disabled and capabilities [  ]
> I0926 11:44:20.537401 3211264 hierarchical.cpp:271] Added framework conductr
> I0926 11:44:20.807895 528384 master.cpp:4787] Re-registering agent 
> b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 at slave(1)@127.0.0.1:5051 (127.0.0.1)
> I0926 11:44:20.808145 1601536 registrar.cpp:464] Applied 1 operations in 
> 38us; attempting to update the 'registry'
> I0926 11:44:20.815757 1601536 registrar.cpp:509] Successfully updated the 
> 'registry' in 7.568896ms
> I0926 11:44:20.815992 3747840 master.cpp:7447] Adding task 
> 6abce9bb-895f-4f6f-be5b-25f6bd09f548 with resources mem(*):0 on agent 
> b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 (127.0.0.1)
> I0926 11:44:20.816339 3747840 master.cpp:4872] Re-registered agent 
> b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 at slave(1)@127.0.0.1:5051 
> (127.0.0.1) with cpus(*):8; mem(*):15360; disk(*):470832; 
> ports(*):[31000-32000]
> I0926 11:44:20.816385 1601536 hierarchical.cpp:478] Added agent 
> b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 (127.0.0.1) with cpus(*):8; 
> mem(*):15360; disk(*):470832; ports(*):[31000-32000] (allocated: cpus(*):0.9; 
> mem(*):402.653; disk(*):1000; ports(*):[31000-31000, 31001-31500])
> I0926 11:44:20.816437 3747840 master.cpp:4940] Sending updated checkpointed 
> resources  to agent b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 at 
> slave(1)@127.0.0.1:5051 (127.0.0.1)
> I0926 11:44:20.816787 4284416 master.cpp:5725] Sending 1 offers to framework 
> 

[jira] [Commented] (MESOS-6249) On Mesos master failover the reregistered callback is not triggered

2016-10-05 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15549924#comment-15549924
 ] 

Benjamin Mahler commented on MESOS-6249:


Linking in MESOS-786 which describes the lifecycle of registered and 
re-registered callbacks. Note that MESOS-786 was resolved but AFAICT we did not 
update to the newer semantics described in this ticket for schedulers that use 
the old-style driver.

However, it sounds like you care about this because you're to detect that the 
master has failed over. To do this you must introspect the {{MasterInfo}} 
provided to you in order to see if {{MasterInfo.id}} has changed.

> On Mesos master failover the reregistered callback is not triggered
> ---
>
> Key: MESOS-6249
> URL: https://issues.apache.org/jira/browse/MESOS-6249
> Project: Mesos
>  Issue Type: Bug
>  Components: java api
>Affects Versions: 0.28.0, 0.28.1, 1.0.1
> Environment: OS X 10.11.6
>Reporter: Markus Jura
>
> On a Mesos master failover the reregistered callback of the Java API is not 
> triggered. Only the registration callback is triggered which makes it hard 
> for a framework to distinguish between these scenarios.
> This behaviour has been tested with the ConductR framework, both with the 
> Java API version 0.28.0, 0.28.1 and 1.0.1. Below you find the logs from the 
> master that got re-elected and from the ConductR framework.
> *Log: Mesos master on a master re-election*
> {code:bash}
> I0926 11:44:20.008306 3747840 zookeeper.cpp:259] A new leading master 
> (UPID=master@127.0.0.1:5050) is detected
> I0926 11:44:20.008458 3747840 master.cpp:1847] The newly elected leader is 
> master@127.0.0.1:5050 with id ca5b9713-1eec-43e1-9d27-9ebc5c0f95b1
> I0926 11:44:20.008484 3747840 master.cpp:1860] Elected as the leading master!
> I0926 11:44:20.008498 3747840 master.cpp:1547] Recovering from registrar
> I0926 11:44:20.008607 3747840 registrar.cpp:332] Recovering registrar
> I0926 11:44:20.016340 4284416 registrar.cpp:365] Successfully fetched the 
> registry (0B) in 7.702016ms
> I0926 11:44:20.016393 4284416 registrar.cpp:464] Applied 1 operations in 
> 12us; attempting to update the 'registry'
> I0926 11:44:20.021428 4284416 registrar.cpp:509] Successfully updated the 
> 'registry' in 5.019904ms
> I0926 11:44:20.021481 4284416 registrar.cpp:395] Successfully recovered 
> registrar
> I0926 11:44:20.021611 528384 master.cpp:1655] Recovered 0 agents from the 
> Registry (118B) ; allowing 10mins for agents to re-register
> I0926 11:44:20.536859 3747840 master.cpp:2424] Received SUBSCRIBE call for 
> framework 'conductr' at 
> scheduler-3f8b9645-7a17-4e9f-8ad5-077fe8c23b39@192.168.2.106:57164
> I0926 11:44:20.536969 3747840 master.cpp:2500] Subscribing framework conductr 
> with checkpointing disabled and capabilities [  ]
> I0926 11:44:20.537401 3211264 hierarchical.cpp:271] Added framework conductr
> I0926 11:44:20.807895 528384 master.cpp:4787] Re-registering agent 
> b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 at slave(1)@127.0.0.1:5051 (127.0.0.1)
> I0926 11:44:20.808145 1601536 registrar.cpp:464] Applied 1 operations in 
> 38us; attempting to update the 'registry'
> I0926 11:44:20.815757 1601536 registrar.cpp:509] Successfully updated the 
> 'registry' in 7.568896ms
> I0926 11:44:20.815992 3747840 master.cpp:7447] Adding task 
> 6abce9bb-895f-4f6f-be5b-25f6bd09f548 with resources mem(*):0 on agent 
> b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 (127.0.0.1)
> I0926 11:44:20.816339 3747840 master.cpp:4872] Re-registered agent 
> b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 at slave(1)@127.0.0.1:5051 
> (127.0.0.1) with cpus(*):8; mem(*):15360; disk(*):470832; 
> ports(*):[31000-32000]
> I0926 11:44:20.816385 1601536 hierarchical.cpp:478] Added agent 
> b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 (127.0.0.1) with cpus(*):8; 
> mem(*):15360; disk(*):470832; ports(*):[31000-32000] (allocated: cpus(*):0.9; 
> mem(*):402.653; disk(*):1000; ports(*):[31000-31000, 31001-31500])
> I0926 11:44:20.816437 3747840 master.cpp:4940] Sending updated checkpointed 
> resources  to agent b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 at 
> slave(1)@127.0.0.1:5051 (127.0.0.1)
> I0926 11:44:20.816787 4284416 master.cpp:5725] Sending 1 offers to framework 
> conductr (conductr) at 
> scheduler-3f8b9645-7a17-4e9f-8ad5-077fe8c23b39@192.168.2.106:57164
> {code}
> *Log: ConductR framework*
> {code:bash}
> I0926 11:44:20.007189 66441216 detector.cpp:152] Detected a new leader: 
> (id='87')
> I0926 11:44:20.007524 64294912 group.cpp:706] Trying to get 
> '/mesos/json.info_87' in ZooKeeper
> I0926 11:44:20.008625 63758336 zookeeper.cpp:259] A new leading master 
> (UPID=master@127.0.0.1:5050) is detected
> I0926 11:44:20.008965 63758336 sched.cpp:330] New master detected at 
> master@127.0.0.1:5050
> 

[jira] [Commented] (MESOS-6312) Add requirement in upgrade.md and getting-started.md for agent '--runtime_dir' in when running as non-root

2016-10-05 Thread Kevin Klues (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15549883#comment-15549883
 ] 

Kevin Klues commented on MESOS-6312:


Possibly. How easy is it to change for just that one binary? I am not very 
familiar with {{mesos-local}}.

> Add requirement in upgrade.md and getting-started.md for agent 
> '--runtime_dir' in when running as non-root
> --
>
> Key: MESOS-6312
> URL: https://issues.apache.org/jira/browse/MESOS-6312
> Project: Mesos
>  Issue Type: Task
>Reporter: Kevin Klues
>Priority: Blocker
> Fix For: 1.1.0
>
>
> We recently introduced a new agent flag for {{\-\-runtime_dir}}. Unlike the 
> {{\-\-work_dir}}, this directory is designed to hold the state of a running 
> agent between subsequent agent-restarts (but not across host reboots).
> By default, this flag is set to {{/var/run/mesos}} since this is a {{tempfs}} 
> on linux that gets automatically cleaned up on reboot. However, on most 
> systems {{/var/run/mesos}} is only writable by root, causing problems when 
> launching an agent as non-root and not pointing {{--runtime_dir}} to a 
> different location.
> We need to call this out in the upgrade.md and getting-started.md docs so 
> that people know they may need to set this going forward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6312) Add requirement in upgrade.md and getting-started.md for agent '--runtime_dir' in when running as non-root

2016-10-05 Thread Kevin Klues (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Klues updated MESOS-6312:
---
Description: 
We recently introduced a new agent flag for {{\-\-runtime_dir}}. Unlike the 
{{\-\-work_dir}}, this directory is designed to hold the state of a running 
agent between subsequent agent-restarts (but not across host reboots).

By default, this flag is set to {{/var/run/mesos}} since this is a {{tempfs}} 
on linux that gets automatically cleaned up on reboot. However, on most systems 
{{/var/run/mesos}} is only writable by root, causing problems when launching an 
agent as non-root and not pointing {{--runtime_dir}} to a different location.

We need to call this out in the upgrade.md and getting-started.md docs so that 
people know they may need to set this going forward.

  was:
We recently introduced a new agent flag for {{--runtime_dir}}. Unlike the 
{{--work_dir}}, this directory is designed to hold the state of a running agent 
between subsequent agent-restarts (but not across host reboots).

By default, this flag is set to {{/var/run/mesos}} since this is a {{tempfs}} 
on linux that gets automatically cleaned up on reboot. However, on most systems 
{{/var/run/mesos}} is only writable by root, causing problems when launching an 
agent as non-root and not pointing {{--runtime_dir}} to a different location.

We need to call this out in the upgrade.md and getting-started.md docs so that 
people know they may need to set this going forward.


> Add requirement in upgrade.md and getting-started.md for agent 
> '--runtime_dir' in when running as non-root
> --
>
> Key: MESOS-6312
> URL: https://issues.apache.org/jira/browse/MESOS-6312
> Project: Mesos
>  Issue Type: Task
>Reporter: Kevin Klues
>Priority: Blocker
> Fix For: 1.1.0
>
>
> We recently introduced a new agent flag for {{\-\-runtime_dir}}. Unlike the 
> {{\-\-work_dir}}, this directory is designed to hold the state of a running 
> agent between subsequent agent-restarts (but not across host reboots).
> By default, this flag is set to {{/var/run/mesos}} since this is a {{tempfs}} 
> on linux that gets automatically cleaned up on reboot. However, on most 
> systems {{/var/run/mesos}} is only writable by root, causing problems when 
> launching an agent as non-root and not pointing {{--runtime_dir}} to a 
> different location.
> We need to call this out in the upgrade.md and getting-started.md docs so 
> that people know they may need to set this going forward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5967) Add support for 'docker image inspect' in our docker abstraction

2016-10-05 Thread Kevin Klues (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Klues updated MESOS-5967:
---
Labels: gpu  (was: gpu mesosphere)

> Add support for 'docker image inspect' in our docker abstraction
> 
>
> Key: MESOS-5967
> URL: https://issues.apache.org/jira/browse/MESOS-5967
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Assignee: Guangya Liu
>  Labels: gpu
> Fix For: 1.1.0
>
>
> Docker's command line tool for {{docker inspect}} can take either a 
> {{container}}, an {{image}}, or a {{task}} as its argument, and return a JSON 
> array containing low-level information about that container, image or task. 
> However, the current {{docker inspect}} support in our docker abstraction 
> only supports inspecting containers (not images or tasks).  We should expand 
> this to (at least) support images.
> In particular, this additional functionality is motivated by the upcoming GPU 
> support, which needs to inspect the labels in a docker image to decide if it 
> should inject the required Nvidia volumes into a container.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5967) Add support for 'docker image inspect' in our docker abstraction

2016-10-05 Thread Kevin Klues (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Klues updated MESOS-5967:
---
Description: 
Docker's command line tool for {{docker inspect}} can take either a 
{{container}}, an {{image}}, or a {{task}} as its argument, and return a JSON 
array containing low-level information about that container, image or task. 

However, the current {{docker inspect}} support in our docker abstraction only 
supports inspecting containers (not images or tasks).  We should expand this to 
(at least) support images.

In particular, this additional functionality is motivated by the upcoming GPU 
support, which needs to inspect the labels in a docker image to decide if it 
should inject the required Nvidia volumes into a container.  

  was:
Docker's command line tool for {{docker inspect}} can take either a 
{{container}}, an {{image}}, or a {{task}} as its argument, and return a JSON 
array containing low-level information about that {{container}}, {{image}} or 
{{task}}. 

However, the current {{docker inspect}} support in our docker abstraction only 
supports inspecting containers (not images or tasks).  We should expand this 
support to images.

In particular, this additional functionality is motivated by the upcoming GPU 
support, which needs to inspect the labels in a docker image to decide if it 
should inject the required Nvidia volumes into a container.  


> Add support for 'docker image inspect' in our docker abstraction
> 
>
> Key: MESOS-5967
> URL: https://issues.apache.org/jira/browse/MESOS-5967
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>  Labels: gpu, mesosphere
> Fix For: 1.1.0
>
>
> Docker's command line tool for {{docker inspect}} can take either a 
> {{container}}, an {{image}}, or a {{task}} as its argument, and return a JSON 
> array containing low-level information about that container, image or task. 
> However, the current {{docker inspect}} support in our docker abstraction 
> only supports inspecting containers (not images or tasks).  We should expand 
> this to (at least) support images.
> In particular, this additional functionality is motivated by the upcoming GPU 
> support, which needs to inspect the labels in a docker image to decide if it 
> should inject the required Nvidia volumes into a container.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5967) Add support for 'docker image inspect' in our docker abstraction

2016-10-05 Thread Kevin Klues (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Klues updated MESOS-5967:
---
Description: 
Docker's command line tool for {{docker inspect}} can take either a 
{{container}}, an {{image}}, or a {{task}} as its argument, and return a JSON 
array containing low-level information about that {{container}}, {{image}} or 
{{task}}. 

However, the current {{docker inspect}} support in our docker abstraction only 
supports inspecting containers (not images or tasks).  We should expand this 
support to images.

In particular, this additional functionality is motivated by the upcoming GPU 
support, which needs to inspect the labels in a docker image to decide if it 
should inject the required Nvidia volumes into a container.  

  was:Our current {{docker inspect}} support in our docker abstraction only 
supports inspecting containers (not images).  We should expand this support to 
images.


> Add support for 'docker image inspect' in our docker abstraction
> 
>
> Key: MESOS-5967
> URL: https://issues.apache.org/jira/browse/MESOS-5967
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>  Labels: gpu, mesosphere
> Fix For: 1.1.0
>
>
> Docker's command line tool for {{docker inspect}} can take either a 
> {{container}}, an {{image}}, or a {{task}} as its argument, and return a JSON 
> array containing low-level information about that {{container}}, {{image}} or 
> {{task}}. 
> However, the current {{docker inspect}} support in our docker abstraction 
> only supports inspecting containers (not images or tasks).  We should expand 
> this support to images.
> In particular, this additional functionality is motivated by the upcoming GPU 
> support, which needs to inspect the labels in a docker image to decide if it 
> should inject the required Nvidia volumes into a container.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6315) `killtree` can accidentally kill containerizer / executor

2016-10-05 Thread Joris Van Remoortere (JIRA)
Joris Van Remoortere created MESOS-6315:
---

 Summary: `killtree` can accidentally kill containerizer / executor
 Key: MESOS-6315
 URL: https://issues.apache.org/jira/browse/MESOS-6315
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Joris Van Remoortere


The implementation of killtree is buggy. [~jieyu] has some ideas.

ltrace of mesos-local:
{code}
[pid 19501] [0x7f89d77a61ab] libmesos-1.1.0.so->kill(29985, SIGKILL)
   = 0
[pid 19501] [0x7f89d77a61ab] libmesos-1.1.0.so->kill(31349, SIGKILL 
[pid 31359] [0x] +++ killed by SIGKILL +++
[pid 31358] [0x] +++ killed by SIGKILL +++
[pid 31357] [0x] +++ killed by SIGKILL +++
[pid 31356] [0x] +++ killed by SIGKILL +++
[pid 31354] [0x] +++ killed by SIGKILL +++
[pid 31353] [0x] +++ killed by SIGKILL +++
[pid 31351] [0x] +++ killed by SIGKILL +++
[pid 31350] [0x] +++ killed by SIGKILL +++
[pid 19501] [0x7f89d77a61ab] <... kill resumed> )   
   = 0
[pid 19501] [0x7f89d77a61dd] libmesos-1.1.0.so->kill(29985, SIGCONT 
[pid 29985] [0x] +++ killed by SIGKILL +++
[pid 19493] [0x7f89d64ceda0] --- SIGCHLD (Child exited) ---
[pid 31352] [0x] +++ killed by SIGKILL +++
[pid 31349] [0x] +++ killed by SIGKILL +++
[pid 19501] [0x7f89d77a61dd] <... kill resumed> )   
   = 0
[pid 19501] [0x7f89d77a61dd] libmesos-1.1.0.so->kill(31349, SIGCONT)
   = -1
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6315) `killtree` can accidentally kill containerizer / executor

2016-10-05 Thread Joris Van Remoortere (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15549602#comment-15549602
 ] 

Joris Van Remoortere commented on MESOS-6315:
-

Since {{killtree}} is only used in the posix containerizer this is not a 
blocker.

> `killtree` can accidentally kill containerizer / executor
> -
>
> Key: MESOS-6315
> URL: https://issues.apache.org/jira/browse/MESOS-6315
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Joris Van Remoortere
>
> The implementation of killtree is buggy. [~jieyu] has some ideas.
> ltrace of mesos-local:
> {code}
> [pid 19501] [0x7f89d77a61ab] libmesos-1.1.0.so->kill(29985, SIGKILL)  
>  = 0
> [pid 19501] [0x7f89d77a61ab] libmesos-1.1.0.so->kill(31349, SIGKILL  return ...>
> [pid 31359] [0x] +++ killed by SIGKILL +++
> [pid 31358] [0x] +++ killed by SIGKILL +++
> [pid 31357] [0x] +++ killed by SIGKILL +++
> [pid 31356] [0x] +++ killed by SIGKILL +++
> [pid 31354] [0x] +++ killed by SIGKILL +++
> [pid 31353] [0x] +++ killed by SIGKILL +++
> [pid 31351] [0x] +++ killed by SIGKILL +++
> [pid 31350] [0x] +++ killed by SIGKILL +++
> [pid 19501] [0x7f89d77a61ab] <... kill resumed> ) 
>  = 0
> [pid 19501] [0x7f89d77a61dd] libmesos-1.1.0.so->kill(29985, SIGCONT  return ...>
> [pid 29985] [0x] +++ killed by SIGKILL +++
> [pid 19493] [0x7f89d64ceda0] --- SIGCHLD (Child exited) ---
> [pid 31352] [0x] +++ killed by SIGKILL +++
> [pid 31349] [0x] +++ killed by SIGKILL +++
> [pid 19501] [0x7f89d77a61dd] <... kill resumed> ) 
>  = 0
> [pid 19501] [0x7f89d77a61dd] libmesos-1.1.0.so->kill(31349, SIGCONT)  
>  = -1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6119) TCP health checks are not portable.

2016-10-05 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6119:
---
Shepherd: Till Toenshoff

> TCP health checks are not portable.
> ---
>
> Key: MESOS-6119
> URL: https://issues.apache.org/jira/browse/MESOS-6119
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>Priority: Blocker
>  Labels: health-check, mesosphere
> Fix For: 1.1.0
>
>
> MESOS-3567 introduced a dependency on "bash" for TCP health checks, which is 
> undesirable. We should implement a portable solution for TCP health checks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6279) Add test cases for the TCP health check

2016-10-05 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15533465#comment-15533465
 ] 

haosdent edited comment on MESOS-6279 at 10/5/16 5:27 PM:
--

| Added test cases for TCP health check. | https://reviews.apache.org/r/52251/ |


was (Author: haosd...@gmail.com):
| Added test case `HealthCheckTest.HealthyTaskViaTCP`. | 
https://reviews.apache.org/r/52251/ |
| Added test case `HealthCheckTest.ROOT_DOCKER_DockerHealthyTaskViaTCP`. | 
https://reviews.apache.org/r/52253/ |
| Added test case `ROOT_HealthyTaskViaTCPWithContainerImage`. | 
https://reviews.apache.org/r/52558/ |

> Add test cases for the TCP health check
> ---
>
> Key: MESOS-6279
> URL: https://issues.apache.org/jira/browse/MESOS-6279
> Project: Mesos
>  Issue Type: Task
>  Components: tests
>Reporter: haosdent
>Assignee: haosdent
>  Labels: health-check, mesosphere, test
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6278) Add test cases for the HTTP health checks

2016-10-05 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15533460#comment-15533460
 ] 

haosdent edited comment on MESOS-6278 at 10/5/16 5:26 PM:
--

| Added test cases for HTTP health check. | https://reviews.apache.org/r/52250/ 
|


was (Author: haosd...@gmail.com):
| Added test case `HealthCheckTest.HealthyTaskViaHTTP`. | 
https://reviews.apache.org/r/52250/ |
| Added test case `HealthCheckTest.ROOT_DOCKER_DockerHealthyTaskViaHTTP`. | 
https://reviews.apache.org/r/52252/ |
| Added test case `ROOT_HealthyTaskViaHTTPWithContainerImage`. | 
https://reviews.apache.org/r/52557/ |

> Add test cases for the HTTP health checks
> -
>
> Key: MESOS-6278
> URL: https://issues.apache.org/jira/browse/MESOS-6278
> Project: Mesos
>  Issue Type: Task
>  Components: tests
>Reporter: haosdent
>Assignee: haosdent
>  Labels: health-check, mesosphere, test
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6279) Add test cases for the TCP health check

2016-10-05 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15533465#comment-15533465
 ] 

haosdent edited comment on MESOS-6279 at 10/5/16 4:12 PM:
--

| Added test case `HealthCheckTest.HealthyTaskViaTCP`. | 
https://reviews.apache.org/r/52251/ |
| Added test case `HealthCheckTest.ROOT_DOCKER_DockerHealthyTaskViaTCP`. | 
https://reviews.apache.org/r/52253/ |
| Added test case `ROOT_HealthyTaskViaTCPWithContainerImage`. | 
https://reviews.apache.org/r/52558/ |


was (Author: haosd...@gmail.com):
| Added test case `HealthCheckTest.HealthyTaskViaTCP`. | 
https://reviews.apache.org/r/52251/ |
| Added test case `HealthCheckTest.ROOT_DOCKER_DockerHealthyTaskViaTCP`. | 
https://reviews.apache.org/r/52253/ |

> Add test cases for the TCP health check
> ---
>
> Key: MESOS-6279
> URL: https://issues.apache.org/jira/browse/MESOS-6279
> Project: Mesos
>  Issue Type: Task
>  Components: tests
>Reporter: haosdent
>Assignee: haosdent
>  Labels: health-check, mesosphere, test
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6278) Add test cases for the HTTP health checks

2016-10-05 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15533460#comment-15533460
 ] 

haosdent edited comment on MESOS-6278 at 10/5/16 4:11 PM:
--

| Added test case `HealthCheckTest.HealthyTaskViaHTTP`. | 
https://reviews.apache.org/r/52250/ |
| Added test case `HealthCheckTest.ROOT_DOCKER_DockerHealthyTaskViaHTTP`. | 
https://reviews.apache.org/r/52252/ |
| Added test case `ROOT_HealthyTaskViaHTTPWithContainerImage`. | 
https://reviews.apache.org/r/52557/ |


was (Author: haosd...@gmail.com):
| Added test case `HealthCheckTest.HealthyTaskViaHTTP`. | 
https://reviews.apache.org/r/52250/ |
| Added test case `HealthCheckTest.ROOT_DOCKER_DockerHealthyTaskViaHTTP`. | 
https://reviews.apache.org/r/52252/ |

> Add test cases for the HTTP health checks
> -
>
> Key: MESOS-6278
> URL: https://issues.apache.org/jira/browse/MESOS-6278
> Project: Mesos
>  Issue Type: Task
>  Components: tests
>Reporter: haosdent
>Assignee: haosdent
>  Labels: health-check, mesosphere, test
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6264) Investigate the high memory usage of the default executor.

2016-10-05 Thread Joris Van Remoortere (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15549111#comment-15549111
 ] 

Joris Van Remoortere edited comment on MESOS-6264 at 10/5/16 3:45 PM:
--

cc [~vinodkone][~jieyu]
The bulk of this comes from loading in {{libmesos.so}}
We do this because the autoconf build treats libmesos as a dynamic dependency.
Since we load libmesos dynamically, there is no chance for the linker to strip 
unused code. This means that all of the code in libmesos regardless of use gets 
loaded into resident memory.
In contrast the cmake build generates a static library for {{libmesos.a}}. This 
is then used to build the {{mesos-executor}} binary without a dynamic 
dependency on libmesos. The benefit of this approach is that the linker is able 
to strip out all unused code. In an optimized build this is {{~10MB}}.

Some approaches for the quick win are:
# Consider using the cmake build. This only needs to be modified slightly to 
strip symbols from the final executor binary {{-s}}.
# Modiy the autoconf build to build a {{libmesos.a}} so that we can statically 
link it in to the {{mesos-executor}} binary and allow the linker to strip 
unused code.

Regardless of the above approach, {{libmesos}} would still be by far the 
largest contributor of the {{RSS}}. This is for 2 reasons:
# Much of our code is structured such that the linker can't determine if it is 
unused. We would need to adjust our patterns such that the unused code analyzer 
can do a better job.
# Much of our code is {{inlined}} or written such that it can't be optimized. 2 
examples are:
## 
https://github.com/apache/mesos/blob/9beb8eae6408249cdb3e2f16ba68b31a00d3452c/3rdparty/libprocess/include/process/mime.hpp#L35-L154
This code could be moved to a {{.cpp}} file and should be a {{static const 
std::unordered_map}} that we {{insert(begin(), end())}} into 
{{types}}. This would reduce the size of libmesos by {{~20KB}}!
## 
https://github.com/apache/mesos/blob/9beb8eae6408249cdb3e2f16ba68b31a00d3452c/3rdparty/libprocess/include/process/http.hpp#L453-L517
This code and sibling {{struct Request}} have auto-generated {{inlined}} 
destructors. These are very expensive. Just declaring and then defining in the 
{{.cpp}} the default destructor can remove another {{~20KB}} each from 
libmesos. There are plenty of other opportunities like this scattered through 
the codebase. It's work to find them and the returns are small for each, but 
end up adding to much of the {{9MB}} left over.


was (Author: jvanremoortere):
cc [~vinodkone][~jieyu]
The bulk of this comes from loading in {{libmesos.so}}
We do this because the autoconf build treats libmesos as a dynamic dependency.
Since we load libmesos dynamically, there is no chance for the linker to strip 
unused code. This means that all of the code in libmesos regardless of use gets 
loaded into resident memory.
In contrast the cmake build generates a static library for {{libmesos.a}}. This 
is then used to build the {{mesos-executor}} binary without a dynamic 
dependency on libmesos. The benefit of this approach is that the linker is able 
to strip out all unused code. In an optimized build this is {{~10MB}}.

Some approaches for the quick win are:
# Consider using the cmake build. This only needs to be modified slightly to 
strip symbols from the final executor binary {{-s}}.
# Modiy the autoconf build to build a {{libmesos.a}} so that we can statically 
link it in to the {{mesos-executor}} binary and allow the linker to strip 
unused code.

Regardless of the above approach, {{libmesos}} would still be by far the 
largest contributor of the {{RSS}}. This is for 2 reasons:
# Much of our code is structured such that the linker can't determine if it is 
unused. We would need to adjust our patterns such that the unused code analyzer 
can do a better job.
# Much of our code is {{inlined}} or written such that it can't be optimized. 2 
examples are:
## 
https://github.com/apache/mesos/blob/9beb8eae6408249cdb3e2f16ba68b31a00d3452c/3rdparty/libprocess/include/process/mime.hpp#L35-L154
This code could be moved to a {{.cpp}} file and should be a {{static const 
std::unordered_map}} that we {{insert(begin(), end())}} into 
{{types}}. This would reduce the size of libmesos by {{~20KB}}!
## 
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/http.hpp#L453-L517
This code and sibling {{struct Request}} have auto-generated {{inlined}} 
destructors. These are very expensive. Just declaring and then defining in the 
{{.cpp}} the default destructor can remove another {{~20KB}} each from 
libmesos. There are plenty of other opportunities like this scattered through 
the codebase. It's work to find them and the returns are small for each, but 
end up adding to much of the {{9MB}} left over.

> Investigate the high memory usage of the default 

[jira] [Commented] (MESOS-6264) Investigate the high memory usage of the default executor.

2016-10-05 Thread Joris Van Remoortere (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15549111#comment-15549111
 ] 

Joris Van Remoortere commented on MESOS-6264:
-

cc [~vinodkone][~jieyu]
The bulk of this comes from loading in {{libmesos.so}}
We do this because the autoconf build treats libmesos as a dynamic dependency.
Since we load libmesos dynamically, there is no chance for the linker to strip 
unused code. This means that all of the code in libmesos regardless of use gets 
loaded into resident memory.
In contrast the cmake build generates a static library for {{libmesos.a}}. This 
is then used to build the {{mesos-executor}} binary without a dynamic 
dependency on libmesos. The benefit of this approach is that the linker is able 
to strip out all unused code. In an optimized build this is {{~10MB}}.

Some approaches for the quick win are:
# Consider using the cmake build. This only needs to be modified slightly to 
strip symbols from the final executor binary {{-s}}.
# Modiy the autoconf build to build a {{libmesos.a}} so that we can statically 
link it in to the {{mesos-executor}} binary and allow the linker to strip 
unused code.

Regardless of the above approach, {{libmesos}} would still be by far the 
largest contributor of the {{RSS}}. This is for 2 reasons:
# Much of our code is structured such that the linker can't determine if it is 
unused. We would need to adjust our patterns such that the unused code analyzer 
can do a better job.
# Much of our code is {{inlined}} or written such that it can't be optimized. 2 
examples are:
## 
https://github.com/apache/mesos/blob/9beb8eae6408249cdb3e2f16ba68b31a00d3452c/3rdparty/libprocess/include/process/mime.hpp#L35-L154
This code could be moved to a {{.cpp}} file and should be a {{static const 
std::unordered_map}} that we {{insert(begin(), end())}} into 
{{types}}. This would reduce the size of libmesos by {{~20KB}}!
## 
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/http.hpp#L453-L517
This code and sibling {{struct Request}} have auto-generated {{inlined}} 
destructors. These are very expensive. Just declaring and then defining in the 
{{.cpp}} the default destructor can remove another {{~20KB}} each from 
libmesos. There are plenty of other opportunities like this scattered through 
the codebase. It's work to find them and the returns are small for each, but 
end up adding to much of the {{9MB}} left over.

> Investigate the high memory usage of the default executor.
> --
>
> Key: MESOS-6264
> URL: https://issues.apache.org/jira/browse/MESOS-6264
> Project: Mesos
>  Issue Type: Bug
>Reporter: Anand Mazumdar
>  Labels: mesosphere
> Fix For: 1.1.0
>
> Attachments: pmap_output_for_the_default_executor.txt
>
>
> It seems that a default executor with two sleep tasks is using ~32 mb on 
> average and can sometimes lead to it being killed for some tests like 
> {{SlaveRecoveryTest/0.ROOT_CGROUPS_ReconnectDefaultExecutor}} on our internal 
> CI. Attached the {{pmap}} output for the default executor. Please note that 
> the command executor memory usage is also pretty high (~26 mb).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6278) Add test cases for the HTTP health checks

2016-10-05 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6278:
---
Shepherd: Alexander Rukletsov
  Sprint: Mesosphere Sprint 44
Story Points: 3
Target Version/s: 1.1.0
  Labels: health-check mesosphere test  (was: health-check test)
 Component/s: tests

> Add test cases for the HTTP health checks
> -
>
> Key: MESOS-6278
> URL: https://issues.apache.org/jira/browse/MESOS-6278
> Project: Mesos
>  Issue Type: Task
>  Components: tests
>Reporter: haosdent
>Assignee: haosdent
>  Labels: health-check, mesosphere, test
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6207) Python bindings fail to build with custom SVN installation path

2016-10-05 Thread Till Toenshoff (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Toenshoff updated MESOS-6207:
--
Shepherd: Till Toenshoff

> Python bindings fail to build with custom SVN installation path
> ---
>
> Key: MESOS-6207
> URL: https://issues.apache.org/jira/browse/MESOS-6207
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.0.1
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Trivial
>
> In {{src/Makefile.am}} {{PYTHON_LDFLAGS}} variable is used while building 
> Python bindings. This variable picks {{LDFLAGS}} during configuration phase 
> before we check for custom SVN installation path and misses 
> {{-L$\{with_svn\}/lib}} flag. That causes a link error on systems with 
> uncommon SVN installation path.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6207) Python bindings fail to build with custom SVN installation path

2016-10-05 Thread Till Toenshoff (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15548754#comment-15548754
 ] 

Till Toenshoff commented on MESOS-6207:
---

Thanks for your patience Ilya - I have taken over the shepherding after Vinod 
has nagged me enough ;).

> Python bindings fail to build with custom SVN installation path
> ---
>
> Key: MESOS-6207
> URL: https://issues.apache.org/jira/browse/MESOS-6207
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.0.1
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Trivial
>
> In {{src/Makefile.am}} {{PYTHON_LDFLAGS}} variable is used while building 
> Python bindings. This variable picks {{LDFLAGS}} during configuration phase 
> before we check for custom SVN installation path and misses 
> {{-L$\{with_svn\}/lib}} flag. That causes a link error on systems with 
> uncommon SVN installation path.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6279) Add test cases for the TCP health check

2016-10-05 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6279:
---
Shepherd: Alexander Rukletsov
  Sprint: Mesosphere Sprint 44
Story Points: 3
Target Version/s: 1.1.0
  Labels: health-check mesosphere test  (was: health-check)
 Component/s: tests

> Add test cases for the TCP health check
> ---
>
> Key: MESOS-6279
> URL: https://issues.apache.org/jira/browse/MESOS-6279
> Project: Mesos
>  Issue Type: Task
>  Components: tests
>Reporter: haosdent
>Assignee: haosdent
>  Labels: health-check, mesosphere, test
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6247) Enable Framework to set weight

2016-10-05 Thread Klaus Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15548683#comment-15548683
 ] 

Klaus Ma commented on MESOS-6247:
-

[~jvanremoortere] , yes, they can not share the reserved resources with each 
other in different role.

For the weight, it's better to let Mesos to allocate resources within a role. 
Because other frameworks may be deployed in this environment, e.g. Storm, it'll 
be a huge work to modify those frameworks one by one.

I agree with you and BenM that hierarchical role is the long term solution; but 
any suggestion on the target date?

BTW, how about other user's scenario about multiple frameworks?

> Enable Framework to set weight
> --
>
> Key: MESOS-6247
> URL: https://issues.apache.org/jira/browse/MESOS-6247
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
> Environment: all
>Reporter: Klaus Ma
>Priority: Critical
>
> We'd like to enable framework's weight when it register. So the framework can 
> share resources based on weight within the same role.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1653) HealthCheckTest.GracePeriod is flaky.

2016-10-05 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/MESOS-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gastón Kleiman updated MESOS-1653:
--
Shepherd: Alexander Rukletsov
  Sprint: Mesosphere Sprint 44
Target Version/s: 1.1.0

> HealthCheckTest.GracePeriod is flaky.
> -
>
> Key: MESOS-1653
> URL: https://issues.apache.org/jira/browse/MESOS-1653
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Benjamin Mahler
>Assignee: Gastón Kleiman
>  Labels: flaky, health-check, mesosphere
>
> {noformat}
> [--] 3 tests from HealthCheckTest
> [ RUN  ] HealthCheckTest.GracePeriod
> Using temporary directory '/tmp/HealthCheckTest_GracePeriod_d7zCPr'
> I0729 17:10:10.484951  1176 leveldb.cpp:176] Opened db in 28.883552ms
> I0729 17:10:10.499487  1176 leveldb.cpp:183] Compacted db in 13.674118ms
> I0729 17:10:10.500200  1176 leveldb.cpp:198] Created db iterator in 7394ns
> I0729 17:10:10.500692  1176 leveldb.cpp:204] Seeked to beginning of db in 
> 2317ns
> I0729 17:10:10.501113  1176 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 1367ns
> I0729 17:10:10.501535  1176 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0729 17:10:10.502233  1212 recover.cpp:425] Starting replica recovery
> I0729 17:10:10.502295  1212 recover.cpp:451] Replica is in EMPTY status
> I0729 17:10:10.502825  1212 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0729 17:10:10.502877  1212 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0729 17:10:10.502980  1212 recover.cpp:542] Updating replica status to 
> STARTING
> I0729 17:10:10.508482  1213 master.cpp:289] Master 
> 20140729-171010-16842879-54701-1176 (trusty) started on 127.0.1.1:54701
> I0729 17:10:10.508607  1213 master.cpp:326] Master only allowing 
> authenticated frameworks to register
> I0729 17:10:10.508632  1213 master.cpp:331] Master only allowing 
> authenticated slaves to register
> I0729 17:10:10.508656  1213 credentials.hpp:36] Loading credentials for 
> authentication from '/tmp/HealthCheckTest_GracePeriod_d7zCPr/credentials'
> I0729 17:10:10.509407  1213 master.cpp:360] Authorization enabled
> I0729 17:10:10.510030  1207 hierarchical_allocator_process.hpp:301] 
> Initializing hierarchical allocator process with master : 
> master@127.0.1.1:54701
> I0729 17:10:10.510113  1207 master.cpp:123] No whitelist given. Advertising 
> offers for all slaves
> I0729 17:10:10.511699  1213 master.cpp:1129] The newly elected leader is 
> master@127.0.1.1:54701 with id 20140729-171010-16842879-54701-1176
> I0729 17:10:10.512230  1213 master.cpp:1142] Elected as the leading master!
> I0729 17:10:10.512692  1213 master.cpp:960] Recovering from registrar
> I0729 17:10:10.513226  1210 registrar.cpp:313] Recovering registrar
> I0729 17:10:10.516006  1212 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 12.946461ms
> I0729 17:10:10.516047  1212 replica.cpp:320] Persisted replica status to 
> STARTING
> I0729 17:10:10.516129  1212 recover.cpp:451] Replica is in STARTING status
> I0729 17:10:10.516520  1212 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0729 17:10:10.516592  1212 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0729 17:10:10.516767  1212 recover.cpp:542] Updating replica status to VOTING
> I0729 17:10:10.528376  1212 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 11.537102ms
> I0729 17:10:10.528430  1212 replica.cpp:320] Persisted replica status to 
> VOTING
> I0729 17:10:10.528501  1212 recover.cpp:556] Successfully joined the Paxos 
> group
> I0729 17:10:10.528565  1212 recover.cpp:440] Recover process terminated
> I0729 17:10:10.528700  1212 log.cpp:656] Attempting to start the writer
> I0729 17:10:10.528960  1212 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0729 17:10:10.537821  1212 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 8.830863ms
> I0729 17:10:10.537869  1212 replica.cpp:342] Persisted promised to 1
> I0729 17:10:10.540550  1209 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0729 17:10:10.540856  1209 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 2
> I0729 17:10:10.547430  1209 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 6.548344ms
> I0729 17:10:10.547471  1209 replica.cpp:676] Persisted action at 0
> I0729 17:10:10.547732  1209 replica.cpp:508] Replica received write request 
> for position 0
> I0729 17:10:10.547765  1209 leveldb.cpp:438] Reading position from leveldb 
> took 15676ns
> I0729 17:10:10.557169  1209 leveldb.cpp:343] Persisting action 

[jira] [Commented] (MESOS-6247) Enable Framework to set weight

2016-10-05 Thread Joris Van Remoortere (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15548652#comment-15548652
 ] 

Joris Van Remoortere commented on MESOS-6247:
-

[~klaus1982]Do you mean they can not share reserved resources with each other?

If they are in the same role they are supposed to be co-operative. At that 
point why does the weight matter? They should both be yielding all unavailable 
resources to each other.

If we add support for weights now it will make it *even* harder to move people 
into the hierarchical role world described by benm. It seems like the 
frameworks co-operating (as they should per the contract of sharing a role) is 
the right temporary solution for you.

> Enable Framework to set weight
> --
>
> Key: MESOS-6247
> URL: https://issues.apache.org/jira/browse/MESOS-6247
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
> Environment: all
>Reporter: Klaus Ma
>Priority: Critical
>
> We'd like to enable framework's weight when it register. So the framework can 
> share resources based on weight within the same role.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1653) HealthCheckTest.GracePeriod is flaky.

2016-10-05 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-1653:

Assignee: Gastón Kleiman  (was: haosdent)

> HealthCheckTest.GracePeriod is flaky.
> -
>
> Key: MESOS-1653
> URL: https://issues.apache.org/jira/browse/MESOS-1653
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Benjamin Mahler
>Assignee: Gastón Kleiman
>  Labels: flaky, health-check, mesosphere
>
> {noformat}
> [--] 3 tests from HealthCheckTest
> [ RUN  ] HealthCheckTest.GracePeriod
> Using temporary directory '/tmp/HealthCheckTest_GracePeriod_d7zCPr'
> I0729 17:10:10.484951  1176 leveldb.cpp:176] Opened db in 28.883552ms
> I0729 17:10:10.499487  1176 leveldb.cpp:183] Compacted db in 13.674118ms
> I0729 17:10:10.500200  1176 leveldb.cpp:198] Created db iterator in 7394ns
> I0729 17:10:10.500692  1176 leveldb.cpp:204] Seeked to beginning of db in 
> 2317ns
> I0729 17:10:10.501113  1176 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 1367ns
> I0729 17:10:10.501535  1176 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0729 17:10:10.502233  1212 recover.cpp:425] Starting replica recovery
> I0729 17:10:10.502295  1212 recover.cpp:451] Replica is in EMPTY status
> I0729 17:10:10.502825  1212 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0729 17:10:10.502877  1212 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0729 17:10:10.502980  1212 recover.cpp:542] Updating replica status to 
> STARTING
> I0729 17:10:10.508482  1213 master.cpp:289] Master 
> 20140729-171010-16842879-54701-1176 (trusty) started on 127.0.1.1:54701
> I0729 17:10:10.508607  1213 master.cpp:326] Master only allowing 
> authenticated frameworks to register
> I0729 17:10:10.508632  1213 master.cpp:331] Master only allowing 
> authenticated slaves to register
> I0729 17:10:10.508656  1213 credentials.hpp:36] Loading credentials for 
> authentication from '/tmp/HealthCheckTest_GracePeriod_d7zCPr/credentials'
> I0729 17:10:10.509407  1213 master.cpp:360] Authorization enabled
> I0729 17:10:10.510030  1207 hierarchical_allocator_process.hpp:301] 
> Initializing hierarchical allocator process with master : 
> master@127.0.1.1:54701
> I0729 17:10:10.510113  1207 master.cpp:123] No whitelist given. Advertising 
> offers for all slaves
> I0729 17:10:10.511699  1213 master.cpp:1129] The newly elected leader is 
> master@127.0.1.1:54701 with id 20140729-171010-16842879-54701-1176
> I0729 17:10:10.512230  1213 master.cpp:1142] Elected as the leading master!
> I0729 17:10:10.512692  1213 master.cpp:960] Recovering from registrar
> I0729 17:10:10.513226  1210 registrar.cpp:313] Recovering registrar
> I0729 17:10:10.516006  1212 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 12.946461ms
> I0729 17:10:10.516047  1212 replica.cpp:320] Persisted replica status to 
> STARTING
> I0729 17:10:10.516129  1212 recover.cpp:451] Replica is in STARTING status
> I0729 17:10:10.516520  1212 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0729 17:10:10.516592  1212 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0729 17:10:10.516767  1212 recover.cpp:542] Updating replica status to VOTING
> I0729 17:10:10.528376  1212 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 11.537102ms
> I0729 17:10:10.528430  1212 replica.cpp:320] Persisted replica status to 
> VOTING
> I0729 17:10:10.528501  1212 recover.cpp:556] Successfully joined the Paxos 
> group
> I0729 17:10:10.528565  1212 recover.cpp:440] Recover process terminated
> I0729 17:10:10.528700  1212 log.cpp:656] Attempting to start the writer
> I0729 17:10:10.528960  1212 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0729 17:10:10.537821  1212 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 8.830863ms
> I0729 17:10:10.537869  1212 replica.cpp:342] Persisted promised to 1
> I0729 17:10:10.540550  1209 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0729 17:10:10.540856  1209 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 2
> I0729 17:10:10.547430  1209 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 6.548344ms
> I0729 17:10:10.547471  1209 replica.cpp:676] Persisted action at 0
> I0729 17:10:10.547732  1209 replica.cpp:508] Replica received write request 
> for position 0
> I0729 17:10:10.547765  1209 leveldb.cpp:438] Reading position from leveldb 
> took 15676ns
> I0729 17:10:10.557169  1209 leveldb.cpp:343] Persisting action (14 bytes) to 
> leveldb took 9.373798ms
> I0729 17:10:10.557241  1209 

[jira] [Issue Comment Deleted] (MESOS-1653) HealthCheckTest.GracePeriod is flaky.

2016-10-05 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-1653:

Comment: was deleted

(was: Patch:

https://reviews.apache.org/r/47089/)

> HealthCheckTest.GracePeriod is flaky.
> -
>
> Key: MESOS-1653
> URL: https://issues.apache.org/jira/browse/MESOS-1653
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Benjamin Mahler
>Assignee: haosdent
>  Labels: flaky, health-check, mesosphere
>
> {noformat}
> [--] 3 tests from HealthCheckTest
> [ RUN  ] HealthCheckTest.GracePeriod
> Using temporary directory '/tmp/HealthCheckTest_GracePeriod_d7zCPr'
> I0729 17:10:10.484951  1176 leveldb.cpp:176] Opened db in 28.883552ms
> I0729 17:10:10.499487  1176 leveldb.cpp:183] Compacted db in 13.674118ms
> I0729 17:10:10.500200  1176 leveldb.cpp:198] Created db iterator in 7394ns
> I0729 17:10:10.500692  1176 leveldb.cpp:204] Seeked to beginning of db in 
> 2317ns
> I0729 17:10:10.501113  1176 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 1367ns
> I0729 17:10:10.501535  1176 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0729 17:10:10.502233  1212 recover.cpp:425] Starting replica recovery
> I0729 17:10:10.502295  1212 recover.cpp:451] Replica is in EMPTY status
> I0729 17:10:10.502825  1212 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0729 17:10:10.502877  1212 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0729 17:10:10.502980  1212 recover.cpp:542] Updating replica status to 
> STARTING
> I0729 17:10:10.508482  1213 master.cpp:289] Master 
> 20140729-171010-16842879-54701-1176 (trusty) started on 127.0.1.1:54701
> I0729 17:10:10.508607  1213 master.cpp:326] Master only allowing 
> authenticated frameworks to register
> I0729 17:10:10.508632  1213 master.cpp:331] Master only allowing 
> authenticated slaves to register
> I0729 17:10:10.508656  1213 credentials.hpp:36] Loading credentials for 
> authentication from '/tmp/HealthCheckTest_GracePeriod_d7zCPr/credentials'
> I0729 17:10:10.509407  1213 master.cpp:360] Authorization enabled
> I0729 17:10:10.510030  1207 hierarchical_allocator_process.hpp:301] 
> Initializing hierarchical allocator process with master : 
> master@127.0.1.1:54701
> I0729 17:10:10.510113  1207 master.cpp:123] No whitelist given. Advertising 
> offers for all slaves
> I0729 17:10:10.511699  1213 master.cpp:1129] The newly elected leader is 
> master@127.0.1.1:54701 with id 20140729-171010-16842879-54701-1176
> I0729 17:10:10.512230  1213 master.cpp:1142] Elected as the leading master!
> I0729 17:10:10.512692  1213 master.cpp:960] Recovering from registrar
> I0729 17:10:10.513226  1210 registrar.cpp:313] Recovering registrar
> I0729 17:10:10.516006  1212 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 12.946461ms
> I0729 17:10:10.516047  1212 replica.cpp:320] Persisted replica status to 
> STARTING
> I0729 17:10:10.516129  1212 recover.cpp:451] Replica is in STARTING status
> I0729 17:10:10.516520  1212 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0729 17:10:10.516592  1212 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0729 17:10:10.516767  1212 recover.cpp:542] Updating replica status to VOTING
> I0729 17:10:10.528376  1212 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 11.537102ms
> I0729 17:10:10.528430  1212 replica.cpp:320] Persisted replica status to 
> VOTING
> I0729 17:10:10.528501  1212 recover.cpp:556] Successfully joined the Paxos 
> group
> I0729 17:10:10.528565  1212 recover.cpp:440] Recover process terminated
> I0729 17:10:10.528700  1212 log.cpp:656] Attempting to start the writer
> I0729 17:10:10.528960  1212 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0729 17:10:10.537821  1212 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 8.830863ms
> I0729 17:10:10.537869  1212 replica.cpp:342] Persisted promised to 1
> I0729 17:10:10.540550  1209 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0729 17:10:10.540856  1209 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 2
> I0729 17:10:10.547430  1209 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 6.548344ms
> I0729 17:10:10.547471  1209 replica.cpp:676] Persisted action at 0
> I0729 17:10:10.547732  1209 replica.cpp:508] Replica received write request 
> for position 0
> I0729 17:10:10.547765  1209 leveldb.cpp:438] Reading position from leveldb 
> took 15676ns
> I0729 17:10:10.557169  1209 leveldb.cpp:343] Persisting action (14 bytes) to 
> leveldb took 9.373798ms
> I0729 

[jira] [Commented] (MESOS-1653) HealthCheckTest.GracePeriod is flaky.

2016-10-05 Thread JIRA

[ 
https://issues.apache.org/jira/browse/MESOS-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15548644#comment-15548644
 ] 

Gastón Kleiman commented on MESOS-1653:
---

Patch: https://reviews.apache.org/r/52432/

> HealthCheckTest.GracePeriod is flaky.
> -
>
> Key: MESOS-1653
> URL: https://issues.apache.org/jira/browse/MESOS-1653
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Benjamin Mahler
>Assignee: haosdent
>  Labels: flaky, health-check, mesosphere
>
> {noformat}
> [--] 3 tests from HealthCheckTest
> [ RUN  ] HealthCheckTest.GracePeriod
> Using temporary directory '/tmp/HealthCheckTest_GracePeriod_d7zCPr'
> I0729 17:10:10.484951  1176 leveldb.cpp:176] Opened db in 28.883552ms
> I0729 17:10:10.499487  1176 leveldb.cpp:183] Compacted db in 13.674118ms
> I0729 17:10:10.500200  1176 leveldb.cpp:198] Created db iterator in 7394ns
> I0729 17:10:10.500692  1176 leveldb.cpp:204] Seeked to beginning of db in 
> 2317ns
> I0729 17:10:10.501113  1176 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 1367ns
> I0729 17:10:10.501535  1176 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0729 17:10:10.502233  1212 recover.cpp:425] Starting replica recovery
> I0729 17:10:10.502295  1212 recover.cpp:451] Replica is in EMPTY status
> I0729 17:10:10.502825  1212 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0729 17:10:10.502877  1212 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0729 17:10:10.502980  1212 recover.cpp:542] Updating replica status to 
> STARTING
> I0729 17:10:10.508482  1213 master.cpp:289] Master 
> 20140729-171010-16842879-54701-1176 (trusty) started on 127.0.1.1:54701
> I0729 17:10:10.508607  1213 master.cpp:326] Master only allowing 
> authenticated frameworks to register
> I0729 17:10:10.508632  1213 master.cpp:331] Master only allowing 
> authenticated slaves to register
> I0729 17:10:10.508656  1213 credentials.hpp:36] Loading credentials for 
> authentication from '/tmp/HealthCheckTest_GracePeriod_d7zCPr/credentials'
> I0729 17:10:10.509407  1213 master.cpp:360] Authorization enabled
> I0729 17:10:10.510030  1207 hierarchical_allocator_process.hpp:301] 
> Initializing hierarchical allocator process with master : 
> master@127.0.1.1:54701
> I0729 17:10:10.510113  1207 master.cpp:123] No whitelist given. Advertising 
> offers for all slaves
> I0729 17:10:10.511699  1213 master.cpp:1129] The newly elected leader is 
> master@127.0.1.1:54701 with id 20140729-171010-16842879-54701-1176
> I0729 17:10:10.512230  1213 master.cpp:1142] Elected as the leading master!
> I0729 17:10:10.512692  1213 master.cpp:960] Recovering from registrar
> I0729 17:10:10.513226  1210 registrar.cpp:313] Recovering registrar
> I0729 17:10:10.516006  1212 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 12.946461ms
> I0729 17:10:10.516047  1212 replica.cpp:320] Persisted replica status to 
> STARTING
> I0729 17:10:10.516129  1212 recover.cpp:451] Replica is in STARTING status
> I0729 17:10:10.516520  1212 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0729 17:10:10.516592  1212 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0729 17:10:10.516767  1212 recover.cpp:542] Updating replica status to VOTING
> I0729 17:10:10.528376  1212 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 11.537102ms
> I0729 17:10:10.528430  1212 replica.cpp:320] Persisted replica status to 
> VOTING
> I0729 17:10:10.528501  1212 recover.cpp:556] Successfully joined the Paxos 
> group
> I0729 17:10:10.528565  1212 recover.cpp:440] Recover process terminated
> I0729 17:10:10.528700  1212 log.cpp:656] Attempting to start the writer
> I0729 17:10:10.528960  1212 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0729 17:10:10.537821  1212 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 8.830863ms
> I0729 17:10:10.537869  1212 replica.cpp:342] Persisted promised to 1
> I0729 17:10:10.540550  1209 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0729 17:10:10.540856  1209 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 2
> I0729 17:10:10.547430  1209 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 6.548344ms
> I0729 17:10:10.547471  1209 replica.cpp:676] Persisted action at 0
> I0729 17:10:10.547732  1209 replica.cpp:508] Replica received write request 
> for position 0
> I0729 17:10:10.547765  1209 leveldb.cpp:438] Reading position from leveldb 
> took 15676ns
> I0729 17:10:10.557169  1209 leveldb.cpp:343] Persisting action (14 bytes) to 
> leveldb took 

[jira] [Commented] (MESOS-6249) On Mesos master failover the reregistered callback is not triggered

2016-10-05 Thread Joris Van Remoortere (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15548632#comment-15548632
 ] 

Joris Van Remoortere commented on MESOS-6249:
-

[~markusjura] It seems like you are hitting some logic around 
https://issues.apache.org/jira/browse/MESOS-786
You can see the comment here.
https://github.com/apache/mesos/blob/b70a22bad22e5e8668f9af62c575902dec7b0125/src/master/master.cpp#L2813-L2820

pinging [~bmahler] who wrote the comment, and [~anandmazumdar] for reference.

> On Mesos master failover the reregistered callback is not triggered
> ---
>
> Key: MESOS-6249
> URL: https://issues.apache.org/jira/browse/MESOS-6249
> Project: Mesos
>  Issue Type: Bug
>  Components: java api
>Affects Versions: 0.28.0, 0.28.1, 1.0.1
> Environment: OS X 10.11.6
>Reporter: Markus Jura
>
> On a Mesos master failover the reregistered callback of the Java API is not 
> triggered. Only the registration callback is triggered which makes it hard 
> for a framework to distinguish between these scenarios.
> This behaviour has been tested with the ConductR framework, both with the 
> Java API version 0.28.0, 0.28.1 and 1.0.1. Below you find the logs from the 
> master that got re-elected and from the ConductR framework.
> *Log: Mesos master on a master re-election*
> {code:bash}
> I0926 11:44:20.008306 3747840 zookeeper.cpp:259] A new leading master 
> (UPID=master@127.0.0.1:5050) is detected
> I0926 11:44:20.008458 3747840 master.cpp:1847] The newly elected leader is 
> master@127.0.0.1:5050 with id ca5b9713-1eec-43e1-9d27-9ebc5c0f95b1
> I0926 11:44:20.008484 3747840 master.cpp:1860] Elected as the leading master!
> I0926 11:44:20.008498 3747840 master.cpp:1547] Recovering from registrar
> I0926 11:44:20.008607 3747840 registrar.cpp:332] Recovering registrar
> I0926 11:44:20.016340 4284416 registrar.cpp:365] Successfully fetched the 
> registry (0B) in 7.702016ms
> I0926 11:44:20.016393 4284416 registrar.cpp:464] Applied 1 operations in 
> 12us; attempting to update the 'registry'
> I0926 11:44:20.021428 4284416 registrar.cpp:509] Successfully updated the 
> 'registry' in 5.019904ms
> I0926 11:44:20.021481 4284416 registrar.cpp:395] Successfully recovered 
> registrar
> I0926 11:44:20.021611 528384 master.cpp:1655] Recovered 0 agents from the 
> Registry (118B) ; allowing 10mins for agents to re-register
> I0926 11:44:20.536859 3747840 master.cpp:2424] Received SUBSCRIBE call for 
> framework 'conductr' at 
> scheduler-3f8b9645-7a17-4e9f-8ad5-077fe8c23b39@192.168.2.106:57164
> I0926 11:44:20.536969 3747840 master.cpp:2500] Subscribing framework conductr 
> with checkpointing disabled and capabilities [  ]
> I0926 11:44:20.537401 3211264 hierarchical.cpp:271] Added framework conductr
> I0926 11:44:20.807895 528384 master.cpp:4787] Re-registering agent 
> b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 at slave(1)@127.0.0.1:5051 (127.0.0.1)
> I0926 11:44:20.808145 1601536 registrar.cpp:464] Applied 1 operations in 
> 38us; attempting to update the 'registry'
> I0926 11:44:20.815757 1601536 registrar.cpp:509] Successfully updated the 
> 'registry' in 7.568896ms
> I0926 11:44:20.815992 3747840 master.cpp:7447] Adding task 
> 6abce9bb-895f-4f6f-be5b-25f6bd09f548 with resources mem(*):0 on agent 
> b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 (127.0.0.1)
> I0926 11:44:20.816339 3747840 master.cpp:4872] Re-registered agent 
> b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 at slave(1)@127.0.0.1:5051 
> (127.0.0.1) with cpus(*):8; mem(*):15360; disk(*):470832; 
> ports(*):[31000-32000]
> I0926 11:44:20.816385 1601536 hierarchical.cpp:478] Added agent 
> b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 (127.0.0.1) with cpus(*):8; 
> mem(*):15360; disk(*):470832; ports(*):[31000-32000] (allocated: cpus(*):0.9; 
> mem(*):402.653; disk(*):1000; ports(*):[31000-31000, 31001-31500])
> I0926 11:44:20.816437 3747840 master.cpp:4940] Sending updated checkpointed 
> resources  to agent b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 at 
> slave(1)@127.0.0.1:5051 (127.0.0.1)
> I0926 11:44:20.816787 4284416 master.cpp:5725] Sending 1 offers to framework 
> conductr (conductr) at 
> scheduler-3f8b9645-7a17-4e9f-8ad5-077fe8c23b39@192.168.2.106:57164
> {code}
> *Log: ConductR framework*
> {code:bash}
> I0926 11:44:20.007189 66441216 detector.cpp:152] Detected a new leader: 
> (id='87')
> I0926 11:44:20.007524 64294912 group.cpp:706] Trying to get 
> '/mesos/json.info_87' in ZooKeeper
> I0926 11:44:20.008625 63758336 zookeeper.cpp:259] A new leading master 
> (UPID=master@127.0.0.1:5050) is detected
> I0926 11:44:20.008965 63758336 sched.cpp:330] New master detected at 
> master@127.0.0.1:5050
> 2016-09-26T09:44:20Z MacBook-Pro-6.local INFO  MesosSchedulerClient 
> [sourceThread=conductr-akka.actor.default-dispatcher-2, 
> 

[jira] [Created] (MESOS-6314) It looks like getgrouplist returns duplicated results

2016-10-05 Thread Marc Villacorta (JIRA)
Marc Villacorta created MESOS-6314:
--

 Summary: It looks like getgrouplist returns duplicated results
 Key: MESOS-6314
 URL: https://issues.apache.org/jira/browse/MESOS-6314
 Project: Mesos
  Issue Type: Bug
  Components: tests
Affects Versions: 1.0.2
 Environment: Inside Docker container {{alpine:3.4}}
Reporter: Marc Villacorta


In my Alpine 3.4 system OsTest.User fails:
{code:none}
/mesos/build # id -G
0 1 2 3 4 6 10 11 20 26 27
{code}

{code:none}
 RUN  ] OsTest.User
../../../3rdparty/stout/tests/os_tests.cpp:696: Failure
Value of: expected_gids
  Actual: { "0", "0", "1", "10", "11", "2", "20", "26", "27", "3", "4", "6" }
Expected: tokens.get()
Which is: { "0", "1", "10", "11", "2", "20", "26", "27", "3", "4", "6" }
[  FAILED  ] OsTest.User (6 ms)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6314) OsTest.User: It looks like getgrouplist returns duplicated results

2016-10-05 Thread Marc Villacorta (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marc Villacorta updated MESOS-6314:
---
Summary: OsTest.User: It looks like getgrouplist returns duplicated results 
 (was: It looks like getgrouplist returns duplicated results)

> OsTest.User: It looks like getgrouplist returns duplicated results
> --
>
> Key: MESOS-6314
> URL: https://issues.apache.org/jira/browse/MESOS-6314
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Affects Versions: 1.0.2
> Environment: Inside Docker container {{alpine:3.4}}
>Reporter: Marc Villacorta
>
> In my Alpine 3.4 system OsTest.User fails:
> {code:none}
> /mesos/build # id -G
> 0 1 2 3 4 6 10 11 20 26 27
> {code}
> {code:none}
>  RUN  ] OsTest.User
> ../../../3rdparty/stout/tests/os_tests.cpp:696: Failure
> Value of: expected_gids
>   Actual: { "0", "0", "1", "10", "11", "2", "20", "26", "27", "3", "4", "6" }
> Expected: tokens.get()
> Which is: { "0", "1", "10", "11", "2", "20", "26", "27", "3", "4", "6" }
> [  FAILED  ] OsTest.User (6 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5909) Stout "OsTest.User" test can fail on some systems

2016-10-05 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15548374#comment-15548374
 ] 

haosdent commented on MESOS-5909:
-

Yep, please open a new one.It looks {{getgrouplist}} return duplicated result.

> Stout "OsTest.User" test can fail on some systems
> -
>
> Key: MESOS-5909
> URL: https://issues.apache.org/jira/browse/MESOS-5909
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Kapil Arya
>Assignee: Mao Geng
>  Labels: mesosphere
> Fix For: 1.1.0
>
> Attachments: MESOS-5909-fix.diff
>
>
> Libc call {{getgrouplist}} doesn't return the {{gid}} list in a sorted manner 
> (in my case, it's returning "471 100") ... whereas {{id -G}} return a sorted 
> list ("100 471" in my case) causing the validation inside the loop to fail.
> We should sort both lists before comparing the values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6313) In Mesos Console, "Completed Tasks" in a tab next to "Active Tasks"

2016-10-05 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15548362#comment-15548362
 ] 

haosdent commented on MESOS-6313:
-

Cool, we could add 
{code}


{code}

in the navbar to implement this.

> In Mesos Console, "Completed Tasks" in a tab next to "Active Tasks"
> ---
>
> Key: MESOS-6313
> URL: https://issues.apache.org/jira/browse/MESOS-6313
> Project: Mesos
>  Issue Type: Improvement
>  Components: webui
>Reporter: Roman Leventov
>
> It will ease navigation between them (clicks in close areas of screen, rather 
> than scroll) and will make "Completed Tasks" even *visible* when the list of 
> active tasks is very long. This is important for those who are not familiar 
> with mesos UI and expect everthing to be acessible though menus/tabs on the 
> top of the screen, not though scrolling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5909) Stout "OsTest.User" test can fail on some systems

2016-10-05 Thread Marc Villacorta (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15548353#comment-15548353
 ] 

Marc Villacorta commented on MESOS-5909:


In my Alpine 3.4 system this test still fails:
{code:none}
/mesos/build # id -G
0 1 2 3 4 6 10 11 20 26 27
{code}

{code:none}
 RUN  ] OsTest.User
../../../3rdparty/stout/tests/os_tests.cpp:696: Failure
Value of: expected_gids
  Actual: { "0", "0", "1", "10", "11", "2", "20", "26", "27", "3", "4", "6" }
Expected: tokens.get()
Which is: { "0", "1", "10", "11", "2", "20", "26", "27", "3", "4", "6" }
[  FAILED  ] OsTest.User (6 ms)
{code}

Should I open a new Jira?

> Stout "OsTest.User" test can fail on some systems
> -
>
> Key: MESOS-5909
> URL: https://issues.apache.org/jira/browse/MESOS-5909
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Kapil Arya
>Assignee: Mao Geng
>  Labels: mesosphere
> Fix For: 1.1.0
>
> Attachments: MESOS-5909-fix.diff
>
>
> Libc call {{getgrouplist}} doesn't return the {{gid}} list in a sorted manner 
> (in my case, it's returning "471 100") ... whereas {{id -G}} return a sorted 
> list ("100 471" in my case) causing the validation inside the loop to fail.
> We should sort both lists before comparing the values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6312) Add requirement in upgrade.md and getting-started.md for agent '--runtime_dir' in when running as non-root

2016-10-05 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15548338#comment-15548338
 ] 

haosdent commented on MESOS-6312:
-

[~klueska] For mesos-local, should we change the default value of it?

> Add requirement in upgrade.md and getting-started.md for agent 
> '--runtime_dir' in when running as non-root
> --
>
> Key: MESOS-6312
> URL: https://issues.apache.org/jira/browse/MESOS-6312
> Project: Mesos
>  Issue Type: Task
>Reporter: Kevin Klues
>Priority: Blocker
> Fix For: 1.1.0
>
>
> We recently introduced a new agent flag for {{--runtime_dir}}. Unlike the 
> {{--work_dir}}, this directory is designed to hold the state of a running 
> agent between subsequent agent-restarts (but not across host reboots).
> By default, this flag is set to {{/var/run/mesos}} since this is a {{tempfs}} 
> on linux that gets automatically cleaned up on reboot. However, on most 
> systems {{/var/run/mesos}} is only writable by root, causing problems when 
> launching an agent as non-root and not pointing {{--runtime_dir}} to a 
> different location.
> We need to call this out in the upgrade.md and getting-started.md docs so 
> that people know they may need to set this going forward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)