[jira] [Updated] (MESOS-3790) Zk connection should retry on EAI_NONAME

2016-01-08 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-3790:
---
Assignee: (was: Neil Conway)

> Zk connection should retry on EAI_NONAME
> 
>
> Key: MESOS-3790
> URL: https://issues.apache.org/jira/browse/MESOS-3790
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Priority: Minor
>  Labels: mesosphere, zookeeper
>
> The zookeeper interface is designed to retry (once per second for up to ten 
> minutes) if one or more of the Zookeeper hostnames can't be resolved (see 
> [MESOS-1326] and [MESOS-1523]).
> However, the current implementation assumes that a DNS resolution failure is 
> indicated by zookeeper_init() returning NULL and errno being set to EINVAL 
> (Zk translates getaddrinfo() failures into errno values). However, the 
> current Zk code does:
> {code}
> static int getaddrinfo_errno(int rc) {
> switch(rc) {
> case EAI_NONAME:
> // ZOOKEEPER-1323 EAI_NODATA and EAI_ADDRFAMILY are deprecated in FreeBSD.
> #if defined EAI_NODATA && EAI_NODATA != EAI_NONAME
> case EAI_NODATA:
> #endif
> return ENOENT;
> case EAI_MEMORY:
> return ENOMEM;
> default:
> return EINVAL;
> }
> }
> {code}
> getaddrinfo() returns EAI_NONAME when "the node or service is not known"; per 
> discussion in [MESOS-2186], this seems to happen intermittently due to DNS 
> failures.
> Proposed fix: looking at errno is always going to be somewhat fragile, but if 
> we're going to continue doing that, we should check for ENOENT as well as 
> EINVAL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3746) Consider introducing a mechanism to provide feedback on offer operations

2016-01-08 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-3746:
---
Assignee: (was: Neil Conway)

> Consider introducing a mechanism to provide feedback on offer operations
> 
>
> Key: MESOS-3746
> URL: https://issues.apache.org/jira/browse/MESOS-3746
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Michael Park
>  Labels: mesosphere, persistent-volumes, reservations
>
> Currently, the master does not provide a direct feedback to the framework 
> when an operation is dropped: 
> https://github.com/apache/mesos/blob/master/src/master/master.cpp#L1713-L1715
> A "subsequent offer" is used as the mechanism to determine whether an 
> operation succeeded or not, which is not sufficient if a framework mistakenly 
> sends invalid operations. There should be an immediate feedback as to whether 
> the request was "accepted".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4317) Document use of mesos specific future design patterns in gmock test framework

2016-01-08 Thread Avinash Sridharan (JIRA)
Avinash Sridharan created MESOS-4317:


 Summary: Document use of mesos specific future design patterns in 
gmock test framework
 Key: MESOS-4317
 URL: https://issues.apache.org/jira/browse/MESOS-4317
 Project: Mesos
  Issue Type: Documentation
Reporter: Avinash Sridharan
Priority: Minor


Mesos relies heavily on google test and google mock frameworks for its unit 
test infrastructure. In order to support unit testing of mesos classes that are 
inherently designed to be multi-threaded (or multi-process), and asynchronous 
in nature, the libprocess future/promise design patterns have been used to 
expose a set of API that allow for asynchronous callbacks within the mesos 
specific gmock test framework (3rdparty/libprocess/include/process/gmock.hpp) . 

Given that these future/promise based API is very specific to the apache mesos 
test framework it would be good to have documentation to better inform 
developers (especially newbies) of the infrastructure and its use-cases.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4317) Document use of mesos specific future design patterns in gmock test framework

2016-01-08 Thread Avinash Sridharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avinash Sridharan updated MESOS-4317:
-
Description: 
Mesos relies heavily on google test and google mock frameworks for its unit 
test infrastructure. In order to support unit testing of mesos classes that are 
inherently designed to be multi-threaded (or multi-process), and asynchronous 
in nature, the libprocess future/promise design patterns have been used to 
expose a set of API that allow for asynchronous callbacks within the mesos 
specific gmock test framework (3rdparty/libprocess/include/process/gmock.hpp) . 

Given that these future/promise based API is very specific to the apache mesos 
test framework it would be good to have documentation about its use-cases to 
better inform developers (especially newbies) of this infrastructure.  

  was:
Mesos relies heavily on google test and google mock frameworks for its unit 
test infrastructure. In order to support unit testing of mesos classes that are 
inherently designed to be multi-threaded (or multi-process), and asynchronous 
in nature, the libprocess future/promise design patterns have been used to 
expose a set of API that allow for asynchronous callbacks within the mesos 
specific gmock test framework (3rdparty/libprocess/include/process/gmock.hpp) . 

Given that these future/promise based API is very specific to the apache mesos 
test framework it would be good to have documentation to better inform 
developers (especially newbies) of the infrastructure and its use-cases.  


> Document use of mesos specific future design patterns in gmock test framework
> -
>
> Key: MESOS-4317
> URL: https://issues.apache.org/jira/browse/MESOS-4317
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Avinash Sridharan
>Priority: Minor
>
> Mesos relies heavily on google test and google mock frameworks for its unit 
> test infrastructure. In order to support unit testing of mesos classes that 
> are inherently designed to be multi-threaded (or multi-process), and 
> asynchronous in nature, the libprocess future/promise design patterns have 
> been used to expose a set of API that allow for asynchronous callbacks within 
> the mesos specific gmock test framework 
> (3rdparty/libprocess/include/process/gmock.hpp) . 
> Given that these future/promise based API is very specific to the apache 
> mesos test framework it would be good to have documentation about its 
> use-cases to better inform developers (especially newbies) of this 
> infrastructure.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4229) Docker containers left running on disk after reviewbot builds

2016-01-08 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089973#comment-15089973
 ] 

Adam B commented on MESOS-4229:
---

May have been introduced by MESOS-3900?
cc: [~jojy] [~vinodkone]

> Docker containers left running on disk after reviewbot builds
> -
>
> Key: MESOS-4229
> URL: https://issues.apache.org/jira/browse/MESOS-4229
> Project: Mesos
>  Issue Type: Bug
> Environment: ASF Mesos Reviewbot
>Reporter: Greg Mann
>  Labels: build, mesosphere, test
>
> The Mesos Reviewbot builds recently failed due to Docker containers being 
> left running on the disk, eventually leading to a full disk: 
> https://issues.apache.org/jira/browse/INFRA-10984
> These containers should be automatically cleaned up to avoid this problem in 
> the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4318) PersistentVolumeTest.BadACLNoPrincipal is flaky

2016-01-08 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-4318:
--
Labels: flaky-test  (was: )

> PersistentVolumeTest.BadACLNoPrincipal is flaky
> ---
>
> Key: MESOS-4318
> URL: https://issues.apache.org/jira/browse/MESOS-4318
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>  Labels: flaky-test
>
> https://builds.apache.org/job/Mesos/1457/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=centos:7,label_exp=docker%7C%7CHadoop/consoleFull
> {noformat}
> [ RUN  ] PersistentVolumeTest.BadACLNoPrincipal
> I0108 01:13:16.117883  1325 leveldb.cpp:174] Opened db in 2.614722ms
> I0108 01:13:16.118650  1325 leveldb.cpp:181] Compacted db in 706567ns
> I0108 01:13:16.118702  1325 leveldb.cpp:196] Created db iterator in 24489ns
> I0108 01:13:16.118723  1325 leveldb.cpp:202] Seeked to beginning of db in 
> 2436ns
> I0108 01:13:16.118738  1325 leveldb.cpp:271] Iterated through 0 keys in the 
> db in 397ns
> I0108 01:13:16.118793  1325 replica.cpp:779] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0108 01:13:16.119627  1348 recover.cpp:447] Starting replica recovery
> I0108 01:13:16.120352  1348 recover.cpp:473] Replica is in EMPTY status
> I0108 01:13:16.121750  1357 replica.cpp:673] Replica in EMPTY status received 
> a broadcasted recover request from (7084)@172.17.0.2:32801
> I0108 01:13:16.122297  1353 recover.cpp:193] Received a recover response from 
> a replica in EMPTY status
> I0108 01:13:16.122747  1350 recover.cpp:564] Updating replica status to 
> STARTING
> I0108 01:13:16.123625  1354 master.cpp:365] Master 
> 773d31e8-383d-4e4b-aa68-f9a3fb9f1fc2 (d9632dd1c41e) started on 
> 172.17.0.2:32801
> I0108 01:13:16.123946  1347 leveldb.cpp:304] Persisting metadata (8 bytes) to 
> leveldb took 728242ns
> I0108 01:13:16.123999  1347 replica.cpp:320] Persisted replica status to 
> STARTING
> I0108 01:13:16.123708  1354 master.cpp:367] Flags at startup: 
> --acls="create_volumes {
>   principals {
> values: "test-principal"
>   }
>   volume_types {
> type: ANY
>   }
> }
> create_volumes {
>   principals {
> type: ANY
>   }
>   volume_types {
> type: NONE
>   }
> }
> " --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate="false" --authenticate_slaves="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/f2rA75/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_slave_ping_timeouts="5" --quiet="false" 
> --recovery_slave_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="25secs" 
> --registry_strict="true" --roles="role1" --root_submissions="true" 
> --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" 
> --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-0.27.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/f2rA75/master" --zk_session_timeout="10secs"
> I0108 01:13:16.124219  1354 master.cpp:414] Master allowing unauthenticated 
> frameworks to register
> I0108 01:13:16.124236  1354 master.cpp:417] Master only allowing 
> authenticated slaves to register
> I0108 01:13:16.124248  1354 credentials.hpp:35] Loading credentials for 
> authentication from '/tmp/f2rA75/credentials'
> I0108 01:13:16.124294  1358 recover.cpp:473] Replica is in STARTING status
> I0108 01:13:16.124644  1354 master.cpp:456] Using default 'crammd5' 
> authenticator
> I0108 01:13:16.124820  1354 master.cpp:493] Authorization enabled
> W0108 01:13:16.124843  1354 master.cpp:553] The '--roles' flag is deprecated. 
> This flag will be removed in the future. See the Mesos 0.27 upgrade notes for 
> more information
> I0108 01:13:16.125154  1348 hierarchical.cpp:147] Initialized hierarchical 
> allocator process
> I0108 01:13:16.125334  1345 whitelist_watcher.cpp:77] No whitelist given
> I0108 01:13:16.126065  1346 replica.cpp:673] Replica in STARTING status 
> received a broadcasted recover request from (7085)@172.17.0.2:32801
> I0108 01:13:16.126806  1348 recover.cpp:193] Received a recover response from 
> a replica in STARTING status
> I0108 01:13:16.128237  1354 recover.cpp:564] Updating replica status to VOTING
> I0108 01:13:16.128402  1359 master.cpp:1629] The newly elected leader is 
> master@172.17.0.2:32801 with id 773d31e8-383d-4e4b-aa68-f9a3fb9f1fc2
> I0108 01:13:16.128489  1359 master.cpp:1642] Elected as the leading master!
> I0108 01:13:16.128523  1359 master.cpp:1387] Recovering from registrar
> I0108 01:13:16.128756  1355 registrar.cpp:307] Recovering registrar
> I0108 01:13:16.129259  1344 leveldb.cpp:304] Persisting metadata (8 bytes) to 

[jira] [Updated] (MESOS-4258) Generate xml test reports in the jenkins build.

2016-01-08 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-4258:
---
Shepherd: Benjamin Mahler

> Generate xml test reports in the jenkins build.
> ---
>
> Key: MESOS-4258
> URL: https://issues.apache.org/jira/browse/MESOS-4258
> Project: Mesos
>  Issue Type: Task
>  Components: test
>Reporter: Benjamin Mahler
>Assignee: Shuai Lin
>  Labels: newbie
>
> Google test has a flag for generating reports:
> {{--gtest_output=xml:report.xml}}
> Jenkins can display these reports via the xUnit plugin, which has support for 
> google test xml: https://wiki.jenkins-ci.org/display/JENKINS/xUnit+Plugin
> This lets us quickly see which test failed, as well as the time that each 
> test took to run.
> We should wire this up. One difficulty is that 'make distclean' complains 
> because the .xml files are left over (we could update distclean to wipe any 
> .xml files within the test locations):
> {noformat}
> ERROR: files left in build directory after distclean:
> ./3rdparty/libprocess/3rdparty/report.xml
> ./3rdparty/libprocess/report.xml
> ./src/report.xml
> make[1]: *** [distcleancheck] Error 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4318) PersistentVolumeTest.BadACLNoPrincipal is flaky

2016-01-08 Thread Jie Yu (JIRA)
Jie Yu created MESOS-4318:
-

 Summary: PersistentVolumeTest.BadACLNoPrincipal is flaky
 Key: MESOS-4318
 URL: https://issues.apache.org/jira/browse/MESOS-4318
 Project: Mesos
  Issue Type: Bug
Reporter: Jie Yu


https://builds.apache.org/job/Mesos/1457/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=centos:7,label_exp=docker%7C%7CHadoop/consoleFull

{noformat}
[ RUN  ] PersistentVolumeTest.BadACLNoPrincipal
I0108 01:13:16.117883  1325 leveldb.cpp:174] Opened db in 2.614722ms
I0108 01:13:16.118650  1325 leveldb.cpp:181] Compacted db in 706567ns
I0108 01:13:16.118702  1325 leveldb.cpp:196] Created db iterator in 24489ns
I0108 01:13:16.118723  1325 leveldb.cpp:202] Seeked to beginning of db in 2436ns
I0108 01:13:16.118738  1325 leveldb.cpp:271] Iterated through 0 keys in the db 
in 397ns
I0108 01:13:16.118793  1325 replica.cpp:779] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
I0108 01:13:16.119627  1348 recover.cpp:447] Starting replica recovery
I0108 01:13:16.120352  1348 recover.cpp:473] Replica is in EMPTY status
I0108 01:13:16.121750  1357 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from (7084)@172.17.0.2:32801
I0108 01:13:16.122297  1353 recover.cpp:193] Received a recover response from a 
replica in EMPTY status
I0108 01:13:16.122747  1350 recover.cpp:564] Updating replica status to STARTING
I0108 01:13:16.123625  1354 master.cpp:365] Master 
773d31e8-383d-4e4b-aa68-f9a3fb9f1fc2 (d9632dd1c41e) started on 172.17.0.2:32801
I0108 01:13:16.123946  1347 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 728242ns
I0108 01:13:16.123999  1347 replica.cpp:320] Persisted replica status to 
STARTING
I0108 01:13:16.123708  1354 master.cpp:367] Flags at startup: 
--acls="create_volumes {
  principals {
values: "test-principal"
  }
  volume_types {
type: ANY
  }
}
create_volumes {
  principals {
type: ANY
  }
  volume_types {
type: NONE
  }
}
" --allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate="false" --authenticate_slaves="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/f2rA75/credentials" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" 
--quiet="false" --recovery_slave_removal_limit="100%" 
--registry="replicated_log" --registry_fetch_timeout="1mins" 
--registry_store_timeout="25secs" --registry_strict="true" --roles="role1" 
--root_submissions="true" --slave_ping_timeout="15secs" 
--slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" 
--webui_dir="/mesos/mesos-0.27.0/_inst/share/mesos/webui" 
--work_dir="/tmp/f2rA75/master" --zk_session_timeout="10secs"
I0108 01:13:16.124219  1354 master.cpp:414] Master allowing unauthenticated 
frameworks to register
I0108 01:13:16.124236  1354 master.cpp:417] Master only allowing authenticated 
slaves to register
I0108 01:13:16.124248  1354 credentials.hpp:35] Loading credentials for 
authentication from '/tmp/f2rA75/credentials'
I0108 01:13:16.124294  1358 recover.cpp:473] Replica is in STARTING status
I0108 01:13:16.124644  1354 master.cpp:456] Using default 'crammd5' 
authenticator
I0108 01:13:16.124820  1354 master.cpp:493] Authorization enabled
W0108 01:13:16.124843  1354 master.cpp:553] The '--roles' flag is deprecated. 
This flag will be removed in the future. See the Mesos 0.27 upgrade notes for 
more information
I0108 01:13:16.125154  1348 hierarchical.cpp:147] Initialized hierarchical 
allocator process
I0108 01:13:16.125334  1345 whitelist_watcher.cpp:77] No whitelist given
I0108 01:13:16.126065  1346 replica.cpp:673] Replica in STARTING status 
received a broadcasted recover request from (7085)@172.17.0.2:32801
I0108 01:13:16.126806  1348 recover.cpp:193] Received a recover response from a 
replica in STARTING status
I0108 01:13:16.128237  1354 recover.cpp:564] Updating replica status to VOTING
I0108 01:13:16.128402  1359 master.cpp:1629] The newly elected leader is 
master@172.17.0.2:32801 with id 773d31e8-383d-4e4b-aa68-f9a3fb9f1fc2
I0108 01:13:16.128489  1359 master.cpp:1642] Elected as the leading master!
I0108 01:13:16.128523  1359 master.cpp:1387] Recovering from registrar
I0108 01:13:16.128756  1355 registrar.cpp:307] Recovering registrar
I0108 01:13:16.129259  1344 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 531437ns
I0108 01:13:16.129292  1344 replica.cpp:320] Persisted replica status to VOTING
I0108 01:13:16.129425  1358 recover.cpp:578] Successfully joined the Paxos group
I0108 01:13:16.129680  1358 recover.cpp:462] Recover process terminated
I0108 01:13:16.130187  1358 log.cpp:659] Attempting to start the writer
I0108 01:13:16.131613  1352 replica.cpp:493] Replica received implicit 

[jira] [Commented] (MESOS-3003) Support mounting in default configuration files/volumes into every new container

2016-01-08 Thread Timothy Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089979#comment-15089979
 ] 

Timothy Chen commented on MESOS-3003:
-

I think following what libcontainer/runc does, we should create a list of etc 
files to mount (ie: etc/host and etc/resolv.conf) in the container when we see 
that /etc is not mounted already from the host.
For now I think this should be suffice and we need to test different containers 
to see is there any more configuration files that we need to pass in.

> Support mounting in default configuration files/volumes into every new 
> container
> 
>
> Key: MESOS-3003
> URL: https://issues.apache.org/jira/browse/MESOS-3003
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Timothy Chen
>  Labels: mesosphere, unified-containerizer-mvp
>
> Most container images leave out system configuration (e.g: /etc/*) and expect 
> the container runtimes to mount in specific configurations as needed such as 
> /etc/resolv.conf from the host into the container when needed.
> We need to support mounting in specific configuration files for command 
> executor to work, and also allow the user to optionally define other 
> configuration files to mount in as well via flags.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4318) PersistentVolumeTest.BadACLNoPrincipal is flaky

2016-01-08 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-4318:
--
Shepherd: Jie Yu
  Sprint: Mesosphere Sprint 26

> PersistentVolumeTest.BadACLNoPrincipal is flaky
> ---
>
> Key: MESOS-4318
> URL: https://issues.apache.org/jira/browse/MESOS-4318
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>Assignee: Greg Mann
>  Labels: flaky-test
>
> https://builds.apache.org/job/Mesos/1457/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=centos:7,label_exp=docker%7C%7CHadoop/consoleFull
> {noformat}
> [ RUN  ] PersistentVolumeTest.BadACLNoPrincipal
> I0108 01:13:16.117883  1325 leveldb.cpp:174] Opened db in 2.614722ms
> I0108 01:13:16.118650  1325 leveldb.cpp:181] Compacted db in 706567ns
> I0108 01:13:16.118702  1325 leveldb.cpp:196] Created db iterator in 24489ns
> I0108 01:13:16.118723  1325 leveldb.cpp:202] Seeked to beginning of db in 
> 2436ns
> I0108 01:13:16.118738  1325 leveldb.cpp:271] Iterated through 0 keys in the 
> db in 397ns
> I0108 01:13:16.118793  1325 replica.cpp:779] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0108 01:13:16.119627  1348 recover.cpp:447] Starting replica recovery
> I0108 01:13:16.120352  1348 recover.cpp:473] Replica is in EMPTY status
> I0108 01:13:16.121750  1357 replica.cpp:673] Replica in EMPTY status received 
> a broadcasted recover request from (7084)@172.17.0.2:32801
> I0108 01:13:16.122297  1353 recover.cpp:193] Received a recover response from 
> a replica in EMPTY status
> I0108 01:13:16.122747  1350 recover.cpp:564] Updating replica status to 
> STARTING
> I0108 01:13:16.123625  1354 master.cpp:365] Master 
> 773d31e8-383d-4e4b-aa68-f9a3fb9f1fc2 (d9632dd1c41e) started on 
> 172.17.0.2:32801
> I0108 01:13:16.123946  1347 leveldb.cpp:304] Persisting metadata (8 bytes) to 
> leveldb took 728242ns
> I0108 01:13:16.123999  1347 replica.cpp:320] Persisted replica status to 
> STARTING
> I0108 01:13:16.123708  1354 master.cpp:367] Flags at startup: 
> --acls="create_volumes {
>   principals {
> values: "test-principal"
>   }
>   volume_types {
> type: ANY
>   }
> }
> create_volumes {
>   principals {
> type: ANY
>   }
>   volume_types {
> type: NONE
>   }
> }
> " --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate="false" --authenticate_slaves="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/f2rA75/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_slave_ping_timeouts="5" --quiet="false" 
> --recovery_slave_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="25secs" 
> --registry_strict="true" --roles="role1" --root_submissions="true" 
> --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" 
> --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-0.27.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/f2rA75/master" --zk_session_timeout="10secs"
> I0108 01:13:16.124219  1354 master.cpp:414] Master allowing unauthenticated 
> frameworks to register
> I0108 01:13:16.124236  1354 master.cpp:417] Master only allowing 
> authenticated slaves to register
> I0108 01:13:16.124248  1354 credentials.hpp:35] Loading credentials for 
> authentication from '/tmp/f2rA75/credentials'
> I0108 01:13:16.124294  1358 recover.cpp:473] Replica is in STARTING status
> I0108 01:13:16.124644  1354 master.cpp:456] Using default 'crammd5' 
> authenticator
> I0108 01:13:16.124820  1354 master.cpp:493] Authorization enabled
> W0108 01:13:16.124843  1354 master.cpp:553] The '--roles' flag is deprecated. 
> This flag will be removed in the future. See the Mesos 0.27 upgrade notes for 
> more information
> I0108 01:13:16.125154  1348 hierarchical.cpp:147] Initialized hierarchical 
> allocator process
> I0108 01:13:16.125334  1345 whitelist_watcher.cpp:77] No whitelist given
> I0108 01:13:16.126065  1346 replica.cpp:673] Replica in STARTING status 
> received a broadcasted recover request from (7085)@172.17.0.2:32801
> I0108 01:13:16.126806  1348 recover.cpp:193] Received a recover response from 
> a replica in STARTING status
> I0108 01:13:16.128237  1354 recover.cpp:564] Updating replica status to VOTING
> I0108 01:13:16.128402  1359 master.cpp:1629] The newly elected leader is 
> master@172.17.0.2:32801 with id 773d31e8-383d-4e4b-aa68-f9a3fb9f1fc2
> I0108 01:13:16.128489  1359 master.cpp:1642] Elected as the leading master!
> I0108 01:13:16.128523  1359 master.cpp:1387] Recovering from registrar
> I0108 01:13:16.128756  1355 registrar.cpp:307] Recovering registrar
> I0108 01:13:16.129259 

[jira] [Updated] (MESOS-4318) PersistentVolumeTest.BadACLNoPrincipal is flaky

2016-01-08 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-4318:
--
Assignee: Greg Mann

> PersistentVolumeTest.BadACLNoPrincipal is flaky
> ---
>
> Key: MESOS-4318
> URL: https://issues.apache.org/jira/browse/MESOS-4318
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>Assignee: Greg Mann
>  Labels: flaky-test
>
> https://builds.apache.org/job/Mesos/1457/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=centos:7,label_exp=docker%7C%7CHadoop/consoleFull
> {noformat}
> [ RUN  ] PersistentVolumeTest.BadACLNoPrincipal
> I0108 01:13:16.117883  1325 leveldb.cpp:174] Opened db in 2.614722ms
> I0108 01:13:16.118650  1325 leveldb.cpp:181] Compacted db in 706567ns
> I0108 01:13:16.118702  1325 leveldb.cpp:196] Created db iterator in 24489ns
> I0108 01:13:16.118723  1325 leveldb.cpp:202] Seeked to beginning of db in 
> 2436ns
> I0108 01:13:16.118738  1325 leveldb.cpp:271] Iterated through 0 keys in the 
> db in 397ns
> I0108 01:13:16.118793  1325 replica.cpp:779] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0108 01:13:16.119627  1348 recover.cpp:447] Starting replica recovery
> I0108 01:13:16.120352  1348 recover.cpp:473] Replica is in EMPTY status
> I0108 01:13:16.121750  1357 replica.cpp:673] Replica in EMPTY status received 
> a broadcasted recover request from (7084)@172.17.0.2:32801
> I0108 01:13:16.122297  1353 recover.cpp:193] Received a recover response from 
> a replica in EMPTY status
> I0108 01:13:16.122747  1350 recover.cpp:564] Updating replica status to 
> STARTING
> I0108 01:13:16.123625  1354 master.cpp:365] Master 
> 773d31e8-383d-4e4b-aa68-f9a3fb9f1fc2 (d9632dd1c41e) started on 
> 172.17.0.2:32801
> I0108 01:13:16.123946  1347 leveldb.cpp:304] Persisting metadata (8 bytes) to 
> leveldb took 728242ns
> I0108 01:13:16.123999  1347 replica.cpp:320] Persisted replica status to 
> STARTING
> I0108 01:13:16.123708  1354 master.cpp:367] Flags at startup: 
> --acls="create_volumes {
>   principals {
> values: "test-principal"
>   }
>   volume_types {
> type: ANY
>   }
> }
> create_volumes {
>   principals {
> type: ANY
>   }
>   volume_types {
> type: NONE
>   }
> }
> " --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate="false" --authenticate_slaves="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/f2rA75/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_slave_ping_timeouts="5" --quiet="false" 
> --recovery_slave_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="25secs" 
> --registry_strict="true" --roles="role1" --root_submissions="true" 
> --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" 
> --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-0.27.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/f2rA75/master" --zk_session_timeout="10secs"
> I0108 01:13:16.124219  1354 master.cpp:414] Master allowing unauthenticated 
> frameworks to register
> I0108 01:13:16.124236  1354 master.cpp:417] Master only allowing 
> authenticated slaves to register
> I0108 01:13:16.124248  1354 credentials.hpp:35] Loading credentials for 
> authentication from '/tmp/f2rA75/credentials'
> I0108 01:13:16.124294  1358 recover.cpp:473] Replica is in STARTING status
> I0108 01:13:16.124644  1354 master.cpp:456] Using default 'crammd5' 
> authenticator
> I0108 01:13:16.124820  1354 master.cpp:493] Authorization enabled
> W0108 01:13:16.124843  1354 master.cpp:553] The '--roles' flag is deprecated. 
> This flag will be removed in the future. See the Mesos 0.27 upgrade notes for 
> more information
> I0108 01:13:16.125154  1348 hierarchical.cpp:147] Initialized hierarchical 
> allocator process
> I0108 01:13:16.125334  1345 whitelist_watcher.cpp:77] No whitelist given
> I0108 01:13:16.126065  1346 replica.cpp:673] Replica in STARTING status 
> received a broadcasted recover request from (7085)@172.17.0.2:32801
> I0108 01:13:16.126806  1348 recover.cpp:193] Received a recover response from 
> a replica in STARTING status
> I0108 01:13:16.128237  1354 recover.cpp:564] Updating replica status to VOTING
> I0108 01:13:16.128402  1359 master.cpp:1629] The newly elected leader is 
> master@172.17.0.2:32801 with id 773d31e8-383d-4e4b-aa68-f9a3fb9f1fc2
> I0108 01:13:16.128489  1359 master.cpp:1642] Elected as the leading master!
> I0108 01:13:16.128523  1359 master.cpp:1387] Recovering from registrar
> I0108 01:13:16.128756  1355 registrar.cpp:307] Recovering registrar
> I0108 01:13:16.129259  1344 leveldb.cpp:304] 

[jira] [Comment Edited] (MESOS-4289) Design doc for simple appc image discovery

2016-01-08 Thread Jojy Varghese (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15087847#comment-15087847
 ] 

Jojy Varghese edited comment on MESOS-4289 at 1/8/16 11:25 PM:
---


https://docs.google.com/document/d/1EeL4JApd2-cW6p3xdBatOc9foT3W3E5atQLJ2iVj5Ow/edit?usp=sharing


was (Author: jojy):
https://docs.google.com/document/d/1EeL4JApd2-cW6p3xdBatOc9foT3W3E5atQLJ2iVj5Ow/edit#heading=h.xof8uidxnjzv

> Design doc for simple appc image discovery
> --
>
> Key: MESOS-4289
> URL: https://issues.apache.org/jira/browse/MESOS-4289
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Jojy Varghese
>Assignee: Jojy Varghese
>  Labels: mesosphere
>
> Create a design document describing the following:
> - Model and abstraction of the Discoverer
> - Workflow of the discovery process



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4258) Generate xml test reports in the jenkins build.

2016-01-08 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090303#comment-15090303
 ] 

Benjamin Mahler commented on MESOS-4258:


Your patch is committed, so now the report files are generated. The next part 
is to process the reports in jenkins. I think we'll want to use '[docker 
cp|https://docs.docker.com/engine/reference/commandline/cp/]' to copy out the 
report files from the container to the jenkins workspace. This likely means 
removing {{--rm}} from our {{docker run}} invocation and placing the rm command 
within the EXIT trap. [~lins05] can you do this next part as well?

> Generate xml test reports in the jenkins build.
> ---
>
> Key: MESOS-4258
> URL: https://issues.apache.org/jira/browse/MESOS-4258
> Project: Mesos
>  Issue Type: Task
>  Components: test
>Reporter: Benjamin Mahler
>Assignee: Shuai Lin
>  Labels: newbie
>
> Google test has a flag for generating reports:
> {{--gtest_output=xml:report.xml}}
> Jenkins can display these reports via the xUnit plugin, which has support for 
> google test xml: https://wiki.jenkins-ci.org/display/JENKINS/xUnit+Plugin
> This lets us quickly see which test failed, as well as the time that each 
> test took to run.
> We should wire this up. One difficulty is that 'make distclean' complains 
> because the .xml files are left over (we could update distclean to wipe any 
> .xml files within the test locations):
> {noformat}
> ERROR: files left in build directory after distclean:
> ./3rdparty/libprocess/3rdparty/report.xml
> ./3rdparty/libprocess/report.xml
> ./src/report.xml
> make[1]: *** [distcleancheck] Error 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4318) PersistentVolumeTest.BadACLNoPrincipal is flaky

2016-01-08 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090311#comment-15090311
 ] 

Greg Mann commented on MESOS-4318:
--

Review here: https://reviews.apache.org/r/42096/

> PersistentVolumeTest.BadACLNoPrincipal is flaky
> ---
>
> Key: MESOS-4318
> URL: https://issues.apache.org/jira/browse/MESOS-4318
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>Assignee: Greg Mann
>  Labels: flaky-test
>
> https://builds.apache.org/job/Mesos/1457/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=centos:7,label_exp=docker%7C%7CHadoop/consoleFull
> {noformat}
> [ RUN  ] PersistentVolumeTest.BadACLNoPrincipal
> I0108 01:13:16.117883  1325 leveldb.cpp:174] Opened db in 2.614722ms
> I0108 01:13:16.118650  1325 leveldb.cpp:181] Compacted db in 706567ns
> I0108 01:13:16.118702  1325 leveldb.cpp:196] Created db iterator in 24489ns
> I0108 01:13:16.118723  1325 leveldb.cpp:202] Seeked to beginning of db in 
> 2436ns
> I0108 01:13:16.118738  1325 leveldb.cpp:271] Iterated through 0 keys in the 
> db in 397ns
> I0108 01:13:16.118793  1325 replica.cpp:779] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0108 01:13:16.119627  1348 recover.cpp:447] Starting replica recovery
> I0108 01:13:16.120352  1348 recover.cpp:473] Replica is in EMPTY status
> I0108 01:13:16.121750  1357 replica.cpp:673] Replica in EMPTY status received 
> a broadcasted recover request from (7084)@172.17.0.2:32801
> I0108 01:13:16.122297  1353 recover.cpp:193] Received a recover response from 
> a replica in EMPTY status
> I0108 01:13:16.122747  1350 recover.cpp:564] Updating replica status to 
> STARTING
> I0108 01:13:16.123625  1354 master.cpp:365] Master 
> 773d31e8-383d-4e4b-aa68-f9a3fb9f1fc2 (d9632dd1c41e) started on 
> 172.17.0.2:32801
> I0108 01:13:16.123946  1347 leveldb.cpp:304] Persisting metadata (8 bytes) to 
> leveldb took 728242ns
> I0108 01:13:16.123999  1347 replica.cpp:320] Persisted replica status to 
> STARTING
> I0108 01:13:16.123708  1354 master.cpp:367] Flags at startup: 
> --acls="create_volumes {
>   principals {
> values: "test-principal"
>   }
>   volume_types {
> type: ANY
>   }
> }
> create_volumes {
>   principals {
> type: ANY
>   }
>   volume_types {
> type: NONE
>   }
> }
> " --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate="false" --authenticate_slaves="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/f2rA75/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_slave_ping_timeouts="5" --quiet="false" 
> --recovery_slave_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="25secs" 
> --registry_strict="true" --roles="role1" --root_submissions="true" 
> --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" 
> --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-0.27.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/f2rA75/master" --zk_session_timeout="10secs"
> I0108 01:13:16.124219  1354 master.cpp:414] Master allowing unauthenticated 
> frameworks to register
> I0108 01:13:16.124236  1354 master.cpp:417] Master only allowing 
> authenticated slaves to register
> I0108 01:13:16.124248  1354 credentials.hpp:35] Loading credentials for 
> authentication from '/tmp/f2rA75/credentials'
> I0108 01:13:16.124294  1358 recover.cpp:473] Replica is in STARTING status
> I0108 01:13:16.124644  1354 master.cpp:456] Using default 'crammd5' 
> authenticator
> I0108 01:13:16.124820  1354 master.cpp:493] Authorization enabled
> W0108 01:13:16.124843  1354 master.cpp:553] The '--roles' flag is deprecated. 
> This flag will be removed in the future. See the Mesos 0.27 upgrade notes for 
> more information
> I0108 01:13:16.125154  1348 hierarchical.cpp:147] Initialized hierarchical 
> allocator process
> I0108 01:13:16.125334  1345 whitelist_watcher.cpp:77] No whitelist given
> I0108 01:13:16.126065  1346 replica.cpp:673] Replica in STARTING status 
> received a broadcasted recover request from (7085)@172.17.0.2:32801
> I0108 01:13:16.126806  1348 recover.cpp:193] Received a recover response from 
> a replica in STARTING status
> I0108 01:13:16.128237  1354 recover.cpp:564] Updating replica status to VOTING
> I0108 01:13:16.128402  1359 master.cpp:1629] The newly elected leader is 
> master@172.17.0.2:32801 with id 773d31e8-383d-4e4b-aa68-f9a3fb9f1fc2
> I0108 01:13:16.128489  1359 master.cpp:1642] Elected as the leading master!
> I0108 01:13:16.128523  1359 master.cpp:1387] Recovering from registrar
> I0108 01:13:16.128756  1355 registrar.cpp:307] 

[jira] [Commented] (MESOS-4258) Generate xml test reports in the jenkins build.

2016-01-08 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090327#comment-15090327
 ] 

Shuai Lin commented on MESOS-4258:
--

[~bmahler] sure, I'll do that.

> Generate xml test reports in the jenkins build.
> ---
>
> Key: MESOS-4258
> URL: https://issues.apache.org/jira/browse/MESOS-4258
> Project: Mesos
>  Issue Type: Task
>  Components: test
>Reporter: Benjamin Mahler
>Assignee: Shuai Lin
>  Labels: newbie
>
> Google test has a flag for generating reports:
> {{--gtest_output=xml:report.xml}}
> Jenkins can display these reports via the xUnit plugin, which has support for 
> google test xml: https://wiki.jenkins-ci.org/display/JENKINS/xUnit+Plugin
> This lets us quickly see which test failed, as well as the time that each 
> test took to run.
> We should wire this up. One difficulty is that 'make distclean' complains 
> because the .xml files are left over (we could update distclean to wipe any 
> .xml files within the test locations):
> {noformat}
> ERROR: files left in build directory after distclean:
> ./3rdparty/libprocess/3rdparty/report.xml
> ./3rdparty/libprocess/report.xml
> ./src/report.xml
> make[1]: *** [distcleancheck] Error 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4249) Mesos fetcher step skipped with MESOS_DOCKER_MESOS_IMAGE flag

2016-01-08 Thread Shuai Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuai Lin updated MESOS-4249:
-
Shepherd: Timothy Chen

> Mesos fetcher step skipped with MESOS_DOCKER_MESOS_IMAGE flag
> -
>
> Key: MESOS-4249
> URL: https://issues.apache.org/jira/browse/MESOS-4249
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.26.0
> Environment: mesos 0.26.0-0.2.145.ubuntu1404
>Reporter: Marica Antonacci
>Assignee: Shuai Lin
>
> The following behaviour has been observed using a dockerized mesos slave.
> If the slave is running inside a docker container with the docker_mesos_image 
> startup flag and you submit the deployment of a dockerized application or job 
> (through Marathon/Chronos), the fetcher step is not performed. On the other 
> hand, if you request the deployment of a non-dockerized application, the URIs 
> are correctly fetched. Moreover, if I don’t provide the docker_mesos_image 
> flag, the fetcher works fine again for both dockerized and non-dockerized 
> applications.
> More details in the user mailing list 
> (https://www.mail-archive.com/user@mesos.apache.org/msg05429.html).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4301) Accepting an inverse offer prints misleading logs

2016-01-08 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090140#comment-15090140
 ] 

Joseph Wu commented on MESOS-4301:
--

While fixing this log line, found another bug.

Essentially:
# {{validation::offer::validate}} returns an error when an {{InverseOffer}} is 
accepted.
# If an {{Offer}} is part of the same {{Call::ACCEPT}}, the master sees 
{{error.isSome()}} and returns a {{TASK_LOST}} for normal offers.  
(https://github.com/apache/mesos/blob/fafbdca610d0a150b9fa9cb62d1c63cb7a6fdaf3/src/master/master.cpp#L3117)

Regression test:
https://reviews.apache.org/r/42092/

> Accepting an inverse offer prints misleading logs
> -
>
> Key: MESOS-4301
> URL: https://issues.apache.org/jira/browse/MESOS-4301
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.25.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: log, maintenance, mesosphere
>
> Whenever a scheduler accepts an inverse offer, Mesos will print a line like 
> this in the master logs:
> {code}
> W1125 10:05:53.155109 29362 master.cpp:2897] ACCEPT call used invalid offers 
> '[ 932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 ]': Offer 
> 932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 is no longer valid
> {code}
> Inverse offers should not trigger this warning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4301) Accepting an inverse offer prints misleading logs

2016-01-08 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090173#comment-15090173
 ] 

Joseph Wu commented on MESOS-4301:
--

Review to fix the logging and the regression test above:
https://reviews.apache.org/r/42086/

> Accepting an inverse offer prints misleading logs
> -
>
> Key: MESOS-4301
> URL: https://issues.apache.org/jira/browse/MESOS-4301
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.25.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: log, maintenance, mesosphere
>
> Whenever a scheduler accepts an inverse offer, Mesos will print a line like 
> this in the master logs:
> {code}
> W1125 10:05:53.155109 29362 master.cpp:2897] ACCEPT call used invalid offers 
> '[ 932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 ]': Offer 
> 932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 is no longer valid
> {code}
> Inverse offers should not trigger this warning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4229) Docker containers left running on disk after reviewbot builds

2016-01-08 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090192#comment-15090192
 ] 

Greg Mann commented on MESOS-4229:
--

The issue may be due to the occurrence of a hung build. Jenkins should kill 
such a build after a specified period, but in that case perhaps Docker cleanup 
doesn't occur as normal.

> Docker containers left running on disk after reviewbot builds
> -
>
> Key: MESOS-4229
> URL: https://issues.apache.org/jira/browse/MESOS-4229
> Project: Mesos
>  Issue Type: Bug
> Environment: ASF Mesos Reviewbot
>Reporter: Greg Mann
>  Labels: build, mesosphere, test
>
> The Mesos Reviewbot builds recently failed due to Docker containers being 
> left running on the disk, eventually leading to a full disk: 
> https://issues.apache.org/jira/browse/INFRA-10984
> These containers should be automatically cleaned up to avoid this problem in 
> the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3472) RegistryTokenTest.ExpiredToken test is flaky

2016-01-08 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090037#comment-15090037
 ] 

Neil Conway commented on MESOS-3472:


Weird: Clock::now.secs of 1483973693 is a time in 2017.

Looking into this, the problem seems to be that every time we run the test 
suite, we {{Clock::advance}} by about 4 weeks. So if you run the entire test 
suite with {{gtest_repeat}} set to ~12 or more, we'll eventually move the clock 
forward one year, which means the token we create in 
{{RegistryTokenTest.ExpiredToken}} will no longer be expired and the test will 
fail.

Possible fixes:

1. Have {{RegistryTokenTest.ExpiredToken}} use an offset of more than 1 year. 
Obviously this is kludgy.
2. Have {{RegistryTokenTest.ExpiredToken}} use a fixed time in the past, rather 
than picking one relative to {{Clock::now}}. Again, somewhat kludgy, although 
better than #1.
3. Have {{MesosTest::TearDown}} reset the clock (via {{Clock::update}}) to some 
"initial" value. Right now we don't capture an appropriate initial value, 
however.
4. Introduce {{Clock::resetAdvance()}} which clears the effect of any 
{{Clock::advance}} calls, and then invoke this in {{MesosTest::TearDown}}.

I'm inclined to do #4.

> RegistryTokenTest.ExpiredToken test is flaky
> 
>
> Key: MESOS-3472
> URL: https://issues.apache.org/jira/browse/MESOS-3472
> Project: Mesos
>  Issue Type: Bug
>Reporter: Artem Harutyunyan
>Assignee: Neil Conway
>  Labels: flaky, mesosphere
>
> RegistryTokenTest.ExpiredToken test is flaky. Here is the error I got on OSX 
> after running it for several times:
> {noformat}
> [ RUN  ] RegistryTokenTest.ExpiredToken
> ../../src/tests/containerizer/provisioner_docker_tests.cpp:167: Failure
> Value of: token.isError()
>   Actual: false
> Expected: true
> libc++abi.dylib: terminating with uncaught exception of type 
> testing::internal::GoogleTestFailureException: 
> ../../src/tests/containerizer/provisioner_docker_tests.cpp:167: Failure
> Value of: token.isError()
>   Actual: false
> Expected: true
> *** Aborted at 1442708631 (unix time) try "date -d @1442708631" if you are 
> using GNU date ***
> PC: @ 0x7fff925fd286 __pthread_kill
> *** SIGABRT (@0x7fff925fd286) received by PID 7082 (TID 0x7fff7d7ad300) stack 
> trace: ***
> @ 0x7fff9041af1a _sigtramp
> @ 0x7fff59759968 (unknown)
> @ 0x7fff9bb429b3 abort
> @ 0x7fff90ce1a21 abort_message
> @ 0x7fff90d099b9 default_terminate_handler()
> @ 0x7fff994767eb _objc_terminate()
> @ 0x7fff90d070a1 std::__terminate()
> @ 0x7fff90d06d48 __cxa_rethrow
> @0x10781bb16 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @0x1077e9d30 testing::UnitTest::Run()
> @0x106d59a91 RUN_ALL_TESTS()
> @0x106d55d47 main
> @ 0x7fff8fc395c9 start
> @0x3 (unknown)
> Abort trap: 6
> ~/src/mesos/build ((3ee82e3...)) $
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4229) Docker containers left running on disk after reviewbot builds

2016-01-08 Thread Jojy Varghese (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090180#comment-15090180
 ] 

Jojy Varghese commented on MESOS-4229:
--

We use *--rm* flag when launching *docker run*. I thought that was sufficient 
for cleanups.

> Docker containers left running on disk after reviewbot builds
> -
>
> Key: MESOS-4229
> URL: https://issues.apache.org/jira/browse/MESOS-4229
> Project: Mesos
>  Issue Type: Bug
> Environment: ASF Mesos Reviewbot
>Reporter: Greg Mann
>  Labels: build, mesosphere, test
>
> The Mesos Reviewbot builds recently failed due to Docker containers being 
> left running on the disk, eventually leading to a full disk: 
> https://issues.apache.org/jira/browse/INFRA-10984
> These containers should be automatically cleaned up to avoid this problem in 
> the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3472) RegistryTokenTest.ExpiredToken test is flaky

2016-01-08 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090226#comment-15090226
 ] 

Neil Conway commented on MESOS-3472:


Good point -- some initial experiments seem to confirm that moving the clock 
backwards in {{MesosTest::TearDown}} will not be trivial. I guess we should do 
#1 or #2 for now.

> RegistryTokenTest.ExpiredToken test is flaky
> 
>
> Key: MESOS-3472
> URL: https://issues.apache.org/jira/browse/MESOS-3472
> Project: Mesos
>  Issue Type: Bug
>Reporter: Artem Harutyunyan
>Assignee: Neil Conway
>  Labels: flaky, mesosphere
>
> RegistryTokenTest.ExpiredToken test is flaky. Here is the error I got on OSX 
> after running it for several times:
> {noformat}
> [ RUN  ] RegistryTokenTest.ExpiredToken
> ../../src/tests/containerizer/provisioner_docker_tests.cpp:167: Failure
> Value of: token.isError()
>   Actual: false
> Expected: true
> libc++abi.dylib: terminating with uncaught exception of type 
> testing::internal::GoogleTestFailureException: 
> ../../src/tests/containerizer/provisioner_docker_tests.cpp:167: Failure
> Value of: token.isError()
>   Actual: false
> Expected: true
> *** Aborted at 1442708631 (unix time) try "date -d @1442708631" if you are 
> using GNU date ***
> PC: @ 0x7fff925fd286 __pthread_kill
> *** SIGABRT (@0x7fff925fd286) received by PID 7082 (TID 0x7fff7d7ad300) stack 
> trace: ***
> @ 0x7fff9041af1a _sigtramp
> @ 0x7fff59759968 (unknown)
> @ 0x7fff9bb429b3 abort
> @ 0x7fff90ce1a21 abort_message
> @ 0x7fff90d099b9 default_terminate_handler()
> @ 0x7fff994767eb _objc_terminate()
> @ 0x7fff90d070a1 std::__terminate()
> @ 0x7fff90d06d48 __cxa_rethrow
> @0x10781bb16 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @0x1077e9d30 testing::UnitTest::Run()
> @0x106d59a91 RUN_ALL_TESTS()
> @0x106d55d47 main
> @ 0x7fff8fc395c9 start
> @0x3 (unknown)
> Abort trap: 6
> ~/src/mesos/build ((3ee82e3...)) $
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4312) Porting Mesos on Power (ppc64le)

2016-01-08 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089066#comment-15089066
 ] 

Qian Zhang commented on MESOS-4312:
---

RR:
https://reviews.apache.org/r/42068/
https://reviews.apache.org/r/42069/

> Porting Mesos on Power (ppc64le)
> 
>
> Key: MESOS-4312
> URL: https://issues.apache.org/jira/browse/MESOS-4312
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>
> The goal of this ticket is to make IBM Power (ppc64le) as a supported 
> hardware platform of Mesos. Currently the latest Mesos code can not be 
> successfully built on ppc64le, we will resolve the build errors in this 
> ticket, and also make sure Mesos test suite ("make check") can be ran 
> successfully on ppc64le. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4279) Graceful restart of docker task

2016-01-08 Thread Qian Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang updated MESOS-4279:
--
Shepherd: Timothy Chen

> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3421) Support sharing of resources across task instances

2016-01-08 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-3421:
--
Shepherd: Adam B

I'll volunteer to shepherd this.
cc: [~anandmazumdar] who wanted to help review/implement.

> Support sharing of resources across task instances
> --
>
> Key: MESOS-3421
> URL: https://issues.apache.org/jira/browse/MESOS-3421
> Project: Mesos
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 0.23.0
>Reporter: Anindya Sinha
>Assignee: Anindya Sinha
>  Labels: external-volumes, persistent-volumes
>
> A service that needs persistent volume needs to have access to the same 
> persistent volume (RW) from multiple task(s) instances on the same agent 
> node. Currently, a persistent volume once offered to the framework(s) can be 
> scheduled to a task and until that tasks terminates, that persistent volume 
> cannot be used by another task.
> Explore providing the capability of sharing persistent volumes across task 
> instances scheduled on a single agent node.
> Based on discussion within the community, we would allow sharing of resources 
> in general, and add support to enable shareability for persistent volumes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4301) Accepting an inverse offer prints misleading logs

2016-01-08 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090173#comment-15090173
 ] 

Joseph Wu edited comment on MESOS-4301 at 1/9/16 1:57 AM:
--

Review to:
* Fix the logging.
* Fix the bug found above.
* Refactor {{Master::accept}} to read more sequentially.

https://reviews.apache.org/r/42086/


was (Author: kaysoky):
Review to fix the logging and the regression test above:
https://reviews.apache.org/r/42086/

> Accepting an inverse offer prints misleading logs
> -
>
> Key: MESOS-4301
> URL: https://issues.apache.org/jira/browse/MESOS-4301
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.25.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: log, maintenance, mesosphere
>
> Whenever a scheduler accepts an inverse offer, Mesos will print a line like 
> this in the master logs:
> {code}
> W1125 10:05:53.155109 29362 master.cpp:2897] ACCEPT call used invalid offers 
> '[ 932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 ]': Offer 
> 932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 is no longer valid
> {code}
> Inverse offers should not trigger this warning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4258) Generate xml test reports in the jenkins build.

2016-01-08 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090360#comment-15090360
 ] 

Shuai Lin commented on MESOS-4258:
--

Besides the patch, this would also require a Jenkis admin to configure the 
locations of the xml files, as described in the "Configuration" section of 
[Jenkins xUnit Plugin 
Page|https://wiki.jenkins-ci.org/display/JENKINS/xUnit+Plugin]. Here we have 
three xml reports:

- 3rdparty/libprocess/3rdparty/report.xml
- 3rdparty/libprocess/report.xml
- src/report.xml


> Generate xml test reports in the jenkins build.
> ---
>
> Key: MESOS-4258
> URL: https://issues.apache.org/jira/browse/MESOS-4258
> Project: Mesos
>  Issue Type: Task
>  Components: test
>Reporter: Benjamin Mahler
>Assignee: Shuai Lin
>  Labels: newbie
>
> Google test has a flag for generating reports:
> {{--gtest_output=xml:report.xml}}
> Jenkins can display these reports via the xUnit plugin, which has support for 
> google test xml: https://wiki.jenkins-ci.org/display/JENKINS/xUnit+Plugin
> This lets us quickly see which test failed, as well as the time that each 
> test took to run.
> We should wire this up. One difficulty is that 'make distclean' complains 
> because the .xml files are left over (we could update distclean to wipe any 
> .xml files within the test locations):
> {noformat}
> ERROR: files left in build directory after distclean:
> ./3rdparty/libprocess/3rdparty/report.xml
> ./3rdparty/libprocess/report.xml
> ./src/report.xml
> make[1]: *** [distcleancheck] Error 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3746) Consider introducing a mechanism to provide feedback on offer operations

2016-01-08 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu updated MESOS-3746:
---
Description: 
Currently, the master does not provide a direct feedback to the framework when 
an operation is dropped: 
https://github.com/apache/mesos/blob/0.26.0/src/master/master.cpp#L1703-L1717

A "subsequent offer" is used as the mechanism to determine whether an operation 
succeeded or not, which is not sufficient if a framework mistakenly sends 
invalid operations. There should be an immediate feedback as to whether the 
request was "accepted".

  was:
Currently, the master does not provide a direct feedback to the framework when 
an operation is dropped: 
https://github.com/apache/mesos/blob/master/src/master/master.cpp#L1713-L1715

A "subsequent offer" is used as the mechanism to determine whether an operation 
succeeded or not, which is not sufficient if a framework mistakenly sends 
invalid operations. There should be an immediate feedback as to whether the 
request was "accepted".


> Consider introducing a mechanism to provide feedback on offer operations
> 
>
> Key: MESOS-3746
> URL: https://issues.apache.org/jira/browse/MESOS-3746
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Michael Park
>  Labels: mesosphere, persistent-volumes, reservations
>
> Currently, the master does not provide a direct feedback to the framework 
> when an operation is dropped: 
> https://github.com/apache/mesos/blob/0.26.0/src/master/master.cpp#L1703-L1717
> A "subsequent offer" is used as the mechanism to determine whether an 
> operation succeeded or not, which is not sufficient if a framework mistakenly 
> sends invalid operations. There should be an immediate feedback as to whether 
> the request was "accepted".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4313) S

2016-01-08 Thread Joerg Schad (JIRA)
Joerg Schad created MESOS-4313:
--

 Summary: S
 Key: MESOS-4313
 URL: https://issues.apache.org/jira/browse/MESOS-4313
 Project: Mesos
  Issue Type: Bug
Reporter: Joerg Schad






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4314) Publish Quota Documentation

2016-01-08 Thread Joerg Schad (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joerg Schad updated MESOS-4314:
---
Sprint: Mesosphere Sprint 26

> Publish Quota Documentation
> ---
>
> Key: MESOS-4314
> URL: https://issues.apache.org/jira/browse/MESOS-4314
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Joerg Schad
>Assignee: Joerg Schad
>
> Publish and finish the operator guide draft  for quota which describes basic 
> usage of the endpoints and few basic and advanced usage cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4316) Support get non-default weights by /weights

2016-01-08 Thread Yongqiao Wang (JIRA)
Yongqiao Wang created MESOS-4316:


 Summary: Support get non-default weights by /weights
 Key: MESOS-4316
 URL: https://issues.apache.org/jira/browse/MESOS-4316
 Project: Mesos
  Issue Type: Task
Reporter: Yongqiao Wang
Assignee: Yongqiao Wang
Priority: Minor


Like /quota, we should also add query logic for /weights to keep consistent. 
Then /roles no longer needs to show weight information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3877) Draft operator documentation for quota

2016-01-08 Thread Joerg Schad (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089179#comment-15089179
 ] 

Joerg Schad commented on MESOS-3877:


Finished draft and published review via MESOS-4314.

> Draft operator documentation for quota
> --
>
> Key: MESOS-3877
> URL: https://issues.apache.org/jira/browse/MESOS-3877
> Project: Mesos
>  Issue Type: Task
>  Components: documentation
>Reporter: Alexander Rukletsov
>Assignee: Joerg Schad
>  Labels: mesosphere
>
> Draft an operator guide for quota which describes basic usage of the 
> endpoints and few basic and advanced usage cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3307) Configurable size of completed task / framework history

2016-01-08 Thread Ian Babrou (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089129#comment-15089129
 ] 

Ian Babrou commented on MESOS-3307:
---

Having API params to fetch only interesting tasks would be very nice. Mesos DNS 
and similar tools don't care about the size of completed task history, it only 
cares about alive tasks. Many tools also only care about tasks with certain 
labels and/or ports allocated.

Having mesos even bus similar to marathon's even bus would eliminate the need 
to do active polling altogether, but that takes time (is there an issue for 
this, btw?).

I'm okay with having flags for history size, though, since [that's what I use 
now|https://github.com/cloudflare/mesos/commit/d247372226d6cbbe57fa856a0b3788e60200ef92].

> Configurable size of completed task / framework history
> ---
>
> Key: MESOS-3307
> URL: https://issues.apache.org/jira/browse/MESOS-3307
> Project: Mesos
>  Issue Type: Bug
>Reporter: Ian Babrou
>Assignee: Kevin Klues
>  Labels: mesosphere
>
> We try to make Mesos work with multiple frameworks and mesos-dns at the same 
> time. The goal is to have set of frameworks per team / project on a single 
> Mesos cluster.
> At this point our mesos state.json is at 4mb and it takes a while to 
> assembly. 5 mesos-dns instances hit state.json every 5 seconds, effectively 
> pushing mesos-master CPU usage through the roof. It's at 100%+ all the time.
> Here's the problem:
> {noformat}
> mesos λ curl -s http://mesos-master:5050/master/state.json | jq 
> .frameworks[].completed_tasks[].framework_id | sort | uniq -c | sort -n
>1 "20150606-001827-252388362-5050-5982-0003"
>   16 "20150606-001827-252388362-5050-5982-0005"
>   18 "20150606-001827-252388362-5050-5982-0029"
>   73 "20150606-001827-252388362-5050-5982-0007"
>  141 "20150606-001827-252388362-5050-5982-0009"
>  154 "20150820-154817-302720010-5050-15320-"
>  289 "20150606-001827-252388362-5050-5982-0004"
>  510 "20150606-001827-252388362-5050-5982-0012"
>  666 "20150606-001827-252388362-5050-5982-0028"
>  923 "20150116-002612-269165578-5050-32204-0003"
> 1000 "20150606-001827-252388362-5050-5982-0001"
> 1000 "20150606-001827-252388362-5050-5982-0006"
> 1000 "20150606-001827-252388362-5050-5982-0010"
> 1000 "20150606-001827-252388362-5050-5982-0011"
> 1000 "20150606-001827-252388362-5050-5982-0027"
> mesos λ fgrep 1000 -r src/master
> src/master/constants.cpp:const size_t MAX_REMOVED_SLAVES = 10;
> src/master/constants.cpp:const uint32_t MAX_COMPLETED_TASKS_PER_FRAMEWORK = 
> 1000;
> {noformat}
> Active tasks are just 6% of state.json response:
> {noformat}
> mesos λ cat ~/temp/mesos-state.json | jq -c . | wc
>1   14796 4138942
> mesos λ cat ~/temp/mesos-state.json | jq .frameworks[].tasks | jq -c . | wc
>   16  37  252774
> {noformat}
> I see four options that can improve the situation:
> 1. Add query string param to exclude completed tasks from state.json and use 
> it in mesos-dns and similar tools. There is no need for mesos-dns to know 
> about completed tasks, it's just extra load on master and mesos-dns.
> 2. Make history size configurable.
> 3. Make JSON serialization faster. With 1s of tasks even without history 
> it would take a lot of time to serialize tasks for mesos-dns. Doing it every 
> 60 seconds instead of every 5 seconds isn't really an option.
> 4. Create event bus for mesos master. Marathon has it and it'd be nice to 
> have it in Mesos. This way mesos-dns could avoid polling master state and 
> switch to listening for events.
> All can be done independently.
> Note to mesosphere folks: please start distributing debug symbols with your 
> distribution. I was asking for it for a while and it is really helpful: 
> https://github.com/mesosphere/marathon/issues/1497#issuecomment-104182501
> Perf report for leading master: 
> !http://i.imgur.com/iz7C3o0.png!
> I'm on 0.23.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4314) Publish Quota Documentation

2016-01-08 Thread Joerg Schad (JIRA)
Joerg Schad created MESOS-4314:
--

 Summary: Publish Quota Documentation
 Key: MESOS-4314
 URL: https://issues.apache.org/jira/browse/MESOS-4314
 Project: Mesos
  Issue Type: Documentation
Reporter: Joerg Schad
Assignee: Joerg Schad


Publish and finish the operator guide draft  for quota which describes basic 
usage of the endpoints and few basic and advanced usage cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4315) Improve Quota Failover Logic

2016-01-08 Thread Joerg Schad (JIRA)
Joerg Schad created MESOS-4315:
--

 Summary: Improve Quota Failover Logic
 Key: MESOS-4315
 URL: https://issues.apache.org/jira/browse/MESOS-4315
 Project: Mesos
  Issue Type: Improvement
Reporter: Joerg Schad


The Quota failover logic introduced with MESOS-3865 changes the  the master 
failover recovery changes significantly if at least one quota is set. 

Now, if upon recovery any previously set quota have been detected, the 
allocator enters recovery mode, during which the allocator does not issue 
offers. The recovery mode — and therefore offer suspension — ends when either:

* A certain amount of agents reregisters (by default 80% of agents known   
before the failover),
* a timeout expires (by default 10 minutes).

We could also safely exit the recovery mode, once all quota has been satisfied 
(i.e. all agents participating in satisfying quota have reconnected).
For small clusters a large percentage of quota'ed resources this will not make 
too much difference compared to the existing rules. But for larger clusters 
this condition could be fulfilled much faster than the 80% condition. 

We should at least consider whether such condition is worth the added 
complexity.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4035) UserCgroupIsolatorTest.ROOT_CGROUPS_UserCgroup fails on CentOS 6.6

2016-01-08 Thread Jan Schlicht (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089240#comment-15089240
 ] 

Jan Schlicht commented on MESOS-4035:
-

I assume that this was in a virtual machine and that something like {{sudo 
./bin/mesos-tests.sh}} was running prior to this? If tried reproducing this and 
am pretty sure, that I've seen the exact same error before, but could only find 
something that is quite similar and probably having the same cause:
Some virtual machines (e.g. Virtualbox) don't provide _CPU performance 
counters_ for their guests. This affects some root tests of Mesos that try to 
use {{perf}} to sample the {{cycles}} event. One of these tests is 
{{PerfEventIsolatorTest.ROOT_CGROUPS_Sample}}. Running {{sudo 
./bin/mesos-tests.sh --gtest_filter="*ROOT_CGROUPS_Sample"}} in such an 
environment will fail and keep a child process running that will block some 
cgroups from being removed. This affects all test processes that are run 
afterwards that try to clean up some cgroups before being run. 
{{UserCgroupIsolatorTest.ROOT_CGROUPS_UserCgroup}} is one of those. Restarting 
the VM will reset this behavior.
So, in a fresh VM, running {{sudo ./bin/mesos-tests.sh 
--gtest_filter="*ROOT_CGROUPS_UserCgroup"}} should pass, but doing this after 
running {{sudo ./bin/mesos-tests.sh --gtest_filter="*ROOT_CGROUPS_Sample"}} 
should fail.

> UserCgroupIsolatorTest.ROOT_CGROUPS_UserCgroup fails on CentOS 6.6
> --
>
> Key: MESOS-4035
> URL: https://issues.apache.org/jira/browse/MESOS-4035
> Project: Mesos
>  Issue Type: Bug
> Environment: CentOS6.6
>Reporter: Gilbert Song
>Assignee: Jan Schlicht
>
> `ROOT_CGROUPS_UserCgroup` on CentOS6.6 with 0.26rc3. The environment setup on 
> CentOS6.6 is based on latest update of /docs/getting-started.md. Either using 
> devtoolset-2 or devtoolset-3 returns the same failure. 
> If running `sudo ./bin/mesos-tests.sh 
> --gtest_filter="*ROOT_CGROUPS_UserCgroup*"`, it would return failures as 
> following log:
> {noformat}
> [==] Running 3 tests from 3 test cases.
> [--] Global test environment set-up.
> [--] 1 test from UserCgroupIsolatorTest/0, where TypeParam = 
> mesos::internal::slave::CgroupsMemIsolatorProcess
> userdel: user 'mesos.test.unprivileged.user' does not exist
> [ RUN  ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup
> ../../src/tests/mesos.cpp:722: Failure
> cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' 
> already exists in the file system
> -
> We cannot run any cgroups tests that require
> a hierarchy with subsystem 'perf_event'
> because we failed to find an existing hierarchy
> or create a new one (tried '/tmp/mesos_test_cgroup/perf_event').
> You can either remove all existing
> hierarchies, or disable this test case
> (i.e., --gtest_filter=-UserCgroupIsolatorTest/0.*).
> -
> ../../src/tests/mesos.cpp:776: Failure
> cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a valid hierarchy
> [  FAILED  ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup, where 
> TypeParam = mesos::internal::slave::CgroupsMemIsolatorProcess (1 ms)
> [--] 1 test from UserCgroupIsolatorTest/0 (1 ms total)
> [--] 1 test from UserCgroupIsolatorTest/1, where TypeParam = 
> mesos::internal::slave::CgroupsCpushareIsolatorProcess
> userdel: user 'mesos.test.unprivileged.user' does not exist
> [ RUN  ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup
> ../../src/tests/mesos.cpp:722: Failure
> cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' 
> already exists in the file system
> -
> We cannot run any cgroups tests that require
> a hierarchy with subsystem 'perf_event'
> because we failed to find an existing hierarchy
> or create a new one (tried '/tmp/mesos_test_cgroup/perf_event').
> You can either remove all existing
> hierarchies, or disable this test case
> (i.e., --gtest_filter=-UserCgroupIsolatorTest/1.*).
> -
> ../../src/tests/mesos.cpp:776: Failure
> cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a valid hierarchy
> [  FAILED  ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup, where 
> TypeParam = mesos::internal::slave::CgroupsCpushareIsolatorProcess (4 ms)
> [--] 1 test from UserCgroupIsolatorTest/1 (5 ms total)
> [--] 1 test from UserCgroupIsolatorTest/2, where TypeParam = 
> mesos::internal::slave::CgroupsPerfEventIsolatorProcess
> userdel: user 'mesos.test.unprivileged.user' does not exist
> [ RUN  ] UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup
> 

[jira] [Issue Comment Deleted] (MESOS-4035) UserCgroupIsolatorTest.ROOT_CGROUPS_UserCgroup fails on CentOS 6.6

2016-01-08 Thread Jan Schlicht (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht updated MESOS-4035:

Comment: was deleted

(was: I assume that this was in a virtual machine and that something like 
{{sudo ./bin/mesos-tests.sh}} was running prior to this? If tried reproducing 
this and am pretty sure, that I've seen the exact same error before, but could 
only find something that is quite similar and probably having the same cause:
Some virtual machines (e.g. Virtualbox) don't provide _CPU performance 
counters_ for their guests. This affects some root tests of Mesos that try to 
use {{perf}} to sample the {{cycles}} event. One of these tests is 
{{PerfEventIsolatorTest.ROOT_CGROUPS_Sample}}. Running {{sudo 
./bin/mesos-tests.sh --gtest_filter="*ROOT_CGROUPS_Sample"}} in such an 
environment will fail and keep a child process running that will block some 
cgroups from being removed. This affects all test processes that are run 
afterwards that try to clean up some cgroups before being run. 
{{UserCgroupIsolatorTest.ROOT_CGROUPS_UserCgroup}} is one of those. Restarting 
the VM will reset this behavior.
So, in a fresh VM, running {{sudo ./bin/mesos-tests.sh 
--gtest_filter="*ROOT_CGROUPS_UserCgroup"}} should pass, but doing this after 
running {{sudo ./bin/mesos-tests.sh --gtest_filter="*ROOT_CGROUPS_Sample"}} 
should fail.)

> UserCgroupIsolatorTest.ROOT_CGROUPS_UserCgroup fails on CentOS 6.6
> --
>
> Key: MESOS-4035
> URL: https://issues.apache.org/jira/browse/MESOS-4035
> Project: Mesos
>  Issue Type: Bug
> Environment: CentOS6.6
>Reporter: Gilbert Song
>Assignee: Jan Schlicht
>
> `ROOT_CGROUPS_UserCgroup` on CentOS6.6 with 0.26rc3. The environment setup on 
> CentOS6.6 is based on latest update of /docs/getting-started.md. Either using 
> devtoolset-2 or devtoolset-3 returns the same failure. 
> If running `sudo ./bin/mesos-tests.sh 
> --gtest_filter="*ROOT_CGROUPS_UserCgroup*"`, it would return failures as 
> following log:
> {noformat}
> [==] Running 3 tests from 3 test cases.
> [--] Global test environment set-up.
> [--] 1 test from UserCgroupIsolatorTest/0, where TypeParam = 
> mesos::internal::slave::CgroupsMemIsolatorProcess
> userdel: user 'mesos.test.unprivileged.user' does not exist
> [ RUN  ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup
> ../../src/tests/mesos.cpp:722: Failure
> cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' 
> already exists in the file system
> -
> We cannot run any cgroups tests that require
> a hierarchy with subsystem 'perf_event'
> because we failed to find an existing hierarchy
> or create a new one (tried '/tmp/mesos_test_cgroup/perf_event').
> You can either remove all existing
> hierarchies, or disable this test case
> (i.e., --gtest_filter=-UserCgroupIsolatorTest/0.*).
> -
> ../../src/tests/mesos.cpp:776: Failure
> cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a valid hierarchy
> [  FAILED  ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup, where 
> TypeParam = mesos::internal::slave::CgroupsMemIsolatorProcess (1 ms)
> [--] 1 test from UserCgroupIsolatorTest/0 (1 ms total)
> [--] 1 test from UserCgroupIsolatorTest/1, where TypeParam = 
> mesos::internal::slave::CgroupsCpushareIsolatorProcess
> userdel: user 'mesos.test.unprivileged.user' does not exist
> [ RUN  ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup
> ../../src/tests/mesos.cpp:722: Failure
> cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' 
> already exists in the file system
> -
> We cannot run any cgroups tests that require
> a hierarchy with subsystem 'perf_event'
> because we failed to find an existing hierarchy
> or create a new one (tried '/tmp/mesos_test_cgroup/perf_event').
> You can either remove all existing
> hierarchies, or disable this test case
> (i.e., --gtest_filter=-UserCgroupIsolatorTest/1.*).
> -
> ../../src/tests/mesos.cpp:776: Failure
> cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a valid hierarchy
> [  FAILED  ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup, where 
> TypeParam = mesos::internal::slave::CgroupsCpushareIsolatorProcess (4 ms)
> [--] 1 test from UserCgroupIsolatorTest/1 (5 ms total)
> [--] 1 test from UserCgroupIsolatorTest/2, where TypeParam = 
> mesos::internal::slave::CgroupsPerfEventIsolatorProcess
> userdel: user 'mesos.test.unprivileged.user' does not exist
> [ RUN  ] UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup
> ../../src/tests/mesos.cpp:722: 

[jira] [Commented] (MESOS-4035) UserCgroupIsolatorTest.ROOT_CGROUPS_UserCgroup fails on CentOS 6.6

2016-01-08 Thread Jan Schlicht (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089264#comment-15089264
 ] 

Jan Schlicht commented on MESOS-4035:
-

This doesn't seem to be related to the perf support. I could reproduce this on 
a virtual machine where {{sudo ./bin/mesos-tests.sh 
--gtest_filter="*ROOT_CGROUPS_UserCgroup"}} was the first command being run. 
There are some issues where perf related tests could fail and leave a running 
process that could influence subsequent test runs, but this problem seems to be 
different.

> UserCgroupIsolatorTest.ROOT_CGROUPS_UserCgroup fails on CentOS 6.6
> --
>
> Key: MESOS-4035
> URL: https://issues.apache.org/jira/browse/MESOS-4035
> Project: Mesos
>  Issue Type: Bug
> Environment: CentOS6.6
>Reporter: Gilbert Song
>Assignee: Jan Schlicht
>
> `ROOT_CGROUPS_UserCgroup` on CentOS6.6 with 0.26rc3. The environment setup on 
> CentOS6.6 is based on latest update of /docs/getting-started.md. Either using 
> devtoolset-2 or devtoolset-3 returns the same failure. 
> If running `sudo ./bin/mesos-tests.sh 
> --gtest_filter="*ROOT_CGROUPS_UserCgroup*"`, it would return failures as 
> following log:
> {noformat}
> [==] Running 3 tests from 3 test cases.
> [--] Global test environment set-up.
> [--] 1 test from UserCgroupIsolatorTest/0, where TypeParam = 
> mesos::internal::slave::CgroupsMemIsolatorProcess
> userdel: user 'mesos.test.unprivileged.user' does not exist
> [ RUN  ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup
> ../../src/tests/mesos.cpp:722: Failure
> cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' 
> already exists in the file system
> -
> We cannot run any cgroups tests that require
> a hierarchy with subsystem 'perf_event'
> because we failed to find an existing hierarchy
> or create a new one (tried '/tmp/mesos_test_cgroup/perf_event').
> You can either remove all existing
> hierarchies, or disable this test case
> (i.e., --gtest_filter=-UserCgroupIsolatorTest/0.*).
> -
> ../../src/tests/mesos.cpp:776: Failure
> cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a valid hierarchy
> [  FAILED  ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup, where 
> TypeParam = mesos::internal::slave::CgroupsMemIsolatorProcess (1 ms)
> [--] 1 test from UserCgroupIsolatorTest/0 (1 ms total)
> [--] 1 test from UserCgroupIsolatorTest/1, where TypeParam = 
> mesos::internal::slave::CgroupsCpushareIsolatorProcess
> userdel: user 'mesos.test.unprivileged.user' does not exist
> [ RUN  ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup
> ../../src/tests/mesos.cpp:722: Failure
> cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' 
> already exists in the file system
> -
> We cannot run any cgroups tests that require
> a hierarchy with subsystem 'perf_event'
> because we failed to find an existing hierarchy
> or create a new one (tried '/tmp/mesos_test_cgroup/perf_event').
> You can either remove all existing
> hierarchies, or disable this test case
> (i.e., --gtest_filter=-UserCgroupIsolatorTest/1.*).
> -
> ../../src/tests/mesos.cpp:776: Failure
> cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a valid hierarchy
> [  FAILED  ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup, where 
> TypeParam = mesos::internal::slave::CgroupsCpushareIsolatorProcess (4 ms)
> [--] 1 test from UserCgroupIsolatorTest/1 (5 ms total)
> [--] 1 test from UserCgroupIsolatorTest/2, where TypeParam = 
> mesos::internal::slave::CgroupsPerfEventIsolatorProcess
> userdel: user 'mesos.test.unprivileged.user' does not exist
> [ RUN  ] UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup
> ../../src/tests/mesos.cpp:722: Failure
> cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' 
> already exists in the file system
> -
> We cannot run any cgroups tests that require
> a hierarchy with subsystem 'perf_event'
> because we failed to find an existing hierarchy
> or create a new one (tried '/tmp/mesos_test_cgroup/perf_event').
> You can either remove all existing
> hierarchies, or disable this test case
> (i.e., --gtest_filter=-UserCgroupIsolatorTest/2.*).
> -
> ../../src/tests/mesos.cpp:776: Failure
> cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a valid hierarchy
> [  FAILED  ] UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup, where 
> TypeParam = 

[jira] [Commented] (MESOS-4279) Graceful restart of docker task

2016-01-08 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089301#comment-15089301
 ] 

Qian Zhang commented on MESOS-4279:
---

When creating an app of Docker type in Marathon, the processes launched in 
Mesos agent is like:
{code}
root  2086  2063  0 Jan06 ?00:00:49 docker -H 
unix:///var/run/docker.sock run -c 102 -m 33554432 -e 
MARATHON_APP_VERSION=2016-01-06T14:24:40.412Z -e HOST=mesos -e 
MARATHON_APP_DOCKER_IMAGE=mesos-4279 -e PORT_1=31433 -e 
MESOS_TASK_ID=app-docker1.af64d5d2-b481-11e5-bdf1-0242497320ff -e PORT=31433 -e 
PORTS=31433 -e MARATHON_APP_ID=/app-docker1 -e PORT0=31433 -e 
MESOS_SANDBOX=/mnt/mesos/sandbox -e 
MESOS_CONTAINER_NAME=mesos-9ee670be-3c38-4c23-91c1-826b283dd283-S7.a919ce36-9b6e-4086-bfe8-9f0a34a3f471
 -v 
/tmp/mesos/slaves/9ee670be-3c38-4c23-91c1-826b283dd283-S7/frameworks/83ced7f5-69b3-409b-abe5-a582a5d278cd-/executors/app-docker1.af64d5d2-b481-11e5-bdf1-0242497320ff/runs/a919ce36-9b6e-4086-bfe8-9f0a34a3f471:/mnt/mesos/sandbox
 --net bridge --entrypoint /bin/sh --name 
mesos-9ee670be-3c38-4c23-91c1-826b283dd283-S7.a919ce36-9b6e-4086-bfe8-9f0a34a3f471
 mesos-4279 -c python /app/script.py
root  2124  2103  0 Jan06 ?00:00:00 /bin/sh -c python /app/script.py
root  2140  2124  0 Jan06 ?00:00:35 python /app/script.py
{code}

The first process (2086) is the "docker run" command launched by Mesos docker 
executor, and the second & third process (2124 & 2140) are the app processes 
launched by Docker daemon. When restarting the app in Marathon, the Mesos 
docker executor will kill the app processes first, the way that it does the 
"kill" is to run "docker stop" command 
(https://github.com/apache/mesos/blob/0.26.0/src/docker/executor.cpp#L218), and 
the "docker stop" command will ONLY send SIGTERM to the process 2124, but NOT 
to 2140 (the actual user script), that's why the signal handler in user script 
is not triggered.

However for the app which is not Docker type, when killing it, the executor 
will send SIGTERM to the process group 
(https://github.com/apache/mesos/blob/0.26.0/src/launcher/executor.cpp#L419), 
so the user script can get the signal too.

I am not sure if there is a way for "docker stop" to not only send SIGTERM to 
the parent process of user script process but also to the user script process 
itself ... 

> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3082) Perf related tests rely on 'cycles' which might not always be present.

2016-01-08 Thread Jan Schlicht (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089443#comment-15089443
 ] 

Jan Schlicht commented on MESOS-3082:
-

Tests trying to sample using perf with the 'cycles' value can cause failures of 
other tests if run on a virtual machine that does not support _CPU performance 
counters_. E.g. running {{sudo ./bin/mesos-tests.sh 
--gtest_filter="*ROOT_CGROUPS_Sample"}} will fail and sometimes keep a child 
process running. This process will block some cgroups from being removed. This 
affects all test processes that are run afterwards that try to clean up some 
cgroups before being run (mostly {{ROOT_CGROUPS_*}}).

I'd suggest to disable these test if in a virtual machine without _CPU 
performance counters_.

> Perf related tests rely on 'cycles' which might not always be present.
> --
>
> Key: MESOS-3082
> URL: https://issues.apache.org/jira/browse/MESOS-3082
> Project: Mesos
>  Issue Type: Bug
> Environment: Ubuntu 14.04 (in a virtual machine)
>Reporter: Benjamin Hindman
>  Labels: mesosphere
>
> When running the tests on Ubuntu 14.04 the 'cycles' value collected by perf 
> is always 0, meaning certain tests always fail. These lines in the test have 
> been commented out for now and a TODO has been attached which links to this 
> JIRA issue, since the solution is unclear. In particular, 'cycles' might not 
> properly be counted because it is a hardware counter and this particular 
> machine was a virtual machine. Either way, we should determine the best 
> events to collect from perf in either VM or physical settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4279) Graceful restart of docker task

2016-01-08 Thread Martin Bydzovsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089364#comment-15089364
 ] 

Martin Bydzovsky commented on MESOS-4279:
-

Well I guess you introduced another "issue" in your test example. It's related 
to the way how you started the Marathon app. Please look at the explanation 
here: 
https://mesosphere.github.io/marathon/docs/native-docker.html#command-vs-args. 
In your {{ps}} output, you can see that the actual command is {{/bin/sh -c 
python /app/script.py}} - wrapped by sh -c.

Seems like you started your Marathon app with something like: 
{code}curl -XPOST http://marathon:8080/v2/apps --data={id: "test-app", cmd: 
"python script.py", ...} {code}
What I was showing in my examples above was: 
{code}curl -XPOST http://marathon:8080/v2/apps --data={id: "test-app", args: 
["/tmp/script.py"], ...} {code}

Usually this is called a "PID 1 problem" - 
https://medium.com/@gchudnov/trapping-signals-in-docker-containers-7a57fdda7d86#.zcxhq8yqn.

Simply said, in your example the PID 1 inside the docker container is the shell 
process and the actual python script is pid 2. Default signal handlers for all 
processes EXCEPT pid 1 are to shutdown on SIGINT/SIGTERM. PID 1 default signal 
handlers just ignore them.

So you could retry the example and use args instead of cmd. Then your {{ps}} 
output should look like:
{code}
root 10738  0.0  0.0 218228 14236 ? 15:22   0:00 docker run -c 102 -m 
268435456 -e PORT_10002=31123 -e MARATHON_APP_VERSION=2016-01-08T15:22:49.646Z 
-e HOST=mesos-slave1.example.com -e 
MARATHON_APP_DOCKER_IMAGE=bydga/marathon-test-api -e 
MESOS_TASK_ID=marathon-test-api.ad9cbac5-b61b-11e5-af54-023bd987a59b -e 
PORT=31123 -e PORTS=31123 -e MARATHON_APP_ID=/marathon-test-api -e PORT0=31123 
-e MESOS_SANDBOX=/mnt/mesos/sandbox -v 
/srv/mesos/slaves/20160106-114735-3423223818-5050-1508-S3/frameworks/20160106-083626-1258962954-5050-9311-/executors/marathon-test-api.ad9cbac5-b61b-11e5-af54-023bd987a59b/runs/bbeb80ab-e8d0-4b93-b7a0-6475787e090f:/mnt/mesos/sandbox
 --net host --name 
mesos-20160106-114735-3423223818-5050-1508-S3.bbeb80ab-e8d0-4b93-b7a0-6475787e090f
 bydga/marathon-test-api ./script.py
root 10749  0.0  0.0  21576  4336 ? 15:22   0:00 /usr/bin/python 
./script.py
{code}

With this setup, the docker stop works as expected:
{code}
bydzovskym mesos-slave1:aws ~   docker ps
CONTAINER IDIMAGE 
COMMAND  CREATED STATUS  PORTS  
 NAMES
ed4a35e4372cbydga/marathon-test-api   
"./script.py"7 minutes ago   Up 7 minutes   
 
mesos-20160106-114735-3423223818-5050-1508-S3.bbeb80ab-e8d0-4b93-b7a0-6475787e090f

bydzovskym mesos-slave1:aws ~   time docker stop ed4a35e4372c
ed4a35e4372c

real0m2.184s
user0m0.016s
sys 0m0.042s
{code}
and the output of the dokcer:
{code}
bydzovskym mesos-slave1:aws ~   docker logs -f ed4a35e4372c
Hello
15:15:57.943294
Iteration #1
15:15:58.944470
Iteration #2
15:15:59.945631
Iteration #3
15:16:00.946794
got 15
15:16:40.473517
15:16:42.475655
ending
Goodbye
{code}

The docker stop took a liiitle more than 2 seconds - as the grace period in the 
python script.

I still guess the problem is somewhere in the mesos orchestrating the docker - 
either it sends wrong {{docker kill}} or it kills it even more painfully 
(killing the docker run with linux {{kill}} command...

> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> 

[jira] [Comment Edited] (MESOS-3082) Perf related tests rely on 'cycles' which might not always be present.

2016-01-08 Thread Jan Schlicht (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089443#comment-15089443
 ] 

Jan Schlicht edited comment on MESOS-3082 at 1/8/16 4:22 PM:
-

Tests trying to sample using perf with the 'cycles' value can cause failures of 
other tests if run in a virtual machine that does not support _CPU performance 
counters_. E.g. running {{sudo ./bin/mesos-tests.sh 
--gtest_filter="\*ROOT_CGROUPS_Sample"}} will fail and sometimes keep a child 
process running. This process will block some cgroups from being removed. This 
affects all test processes that are run afterwards that try to clean up some 
cgroups before being run (mostly {{ROOT_CGROUPS_*}}).

I'd suggest to disable these test if in a virtual machine without _CPU 
performance counters_.


was (Author: nfnt):
Tests trying to sample using perf with the 'cycles' value can cause failures of 
other tests if run on a virtual machine that does not support _CPU performance 
counters_. E.g. running {{sudo ./bin/mesos-tests.sh 
--gtest_filter="\*ROOT_CGROUPS_Sample"}} will fail and sometimes keep a child 
process running. This process will block some cgroups from being removed. This 
affects all test processes that are run afterwards that try to clean up some 
cgroups before being run (mostly {{ROOT_CGROUPS_*}}).

I'd suggest to disable these test if in a virtual machine without _CPU 
performance counters_.

> Perf related tests rely on 'cycles' which might not always be present.
> --
>
> Key: MESOS-3082
> URL: https://issues.apache.org/jira/browse/MESOS-3082
> Project: Mesos
>  Issue Type: Bug
> Environment: Ubuntu 14.04 (in a virtual machine)
>Reporter: Benjamin Hindman
>Assignee: Jan Schlicht
>  Labels: mesosphere
>
> When running the tests on Ubuntu 14.04 the 'cycles' value collected by perf 
> is always 0, meaning certain tests always fail. These lines in the test have 
> been commented out for now and a TODO has been attached which links to this 
> JIRA issue, since the solution is unclear. In particular, 'cycles' might not 
> properly be counted because it is a hardware counter and this particular 
> machine was a virtual machine. Either way, we should determine the best 
> events to collect from perf in either VM or physical settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-3082) Perf related tests rely on 'cycles' which might not always be present.

2016-01-08 Thread Jan Schlicht (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089443#comment-15089443
 ] 

Jan Schlicht edited comment on MESOS-3082 at 1/8/16 4:26 PM:
-

Tests trying to sample using perf with the 'cycles' value can cause failures of 
other tests if run in a virtual machine that does not support _CPU performance 
counters_. E.g. running {{sudo ./bin/mesos-tests.sh 
--gtest_filter="\*ROOT_CGROUPS_Sample"}} will fail and sometimes keep a child 
process running. This process will block some cgroups from being removed. This 
affects all test processes that are run afterwards that try to clean up some 
cgroups before being run (mostly {{ROOT_CGROUPS_*}}).

I'd suggest to disable these tests if in a virtual machine without _CPU 
performance counters_.


was (Author: nfnt):
Tests trying to sample using perf with the 'cycles' value can cause failures of 
other tests if run in a virtual machine that does not support _CPU performance 
counters_. E.g. running {{sudo ./bin/mesos-tests.sh 
--gtest_filter="\*ROOT_CGROUPS_Sample"}} will fail and sometimes keep a child 
process running. This process will block some cgroups from being removed. This 
affects all test processes that are run afterwards that try to clean up some 
cgroups before being run (mostly {{ROOT_CGROUPS_*}}).

I'd suggest to disable these test if in a virtual machine without _CPU 
performance counters_.

> Perf related tests rely on 'cycles' which might not always be present.
> --
>
> Key: MESOS-3082
> URL: https://issues.apache.org/jira/browse/MESOS-3082
> Project: Mesos
>  Issue Type: Bug
> Environment: Ubuntu 14.04 (in a virtual machine)
>Reporter: Benjamin Hindman
>Assignee: Jan Schlicht
>  Labels: mesosphere
>
> When running the tests on Ubuntu 14.04 the 'cycles' value collected by perf 
> is always 0, meaning certain tests always fail. These lines in the test have 
> been commented out for now and a TODO has been attached which links to this 
> JIRA issue, since the solution is unclear. In particular, 'cycles' might not 
> properly be counted because it is a hardware counter and this particular 
> machine was a virtual machine. Either way, we should determine the best 
> events to collect from perf in either VM or physical settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3082) Perf related tests rely on 'cycles' which might not always be present.

2016-01-08 Thread Jan Schlicht (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht updated MESOS-3082:

Sprint: Mesosphere Sprint 26

> Perf related tests rely on 'cycles' which might not always be present.
> --
>
> Key: MESOS-3082
> URL: https://issues.apache.org/jira/browse/MESOS-3082
> Project: Mesos
>  Issue Type: Bug
> Environment: Ubuntu 14.04 (in a virtual machine)
>Reporter: Benjamin Hindman
>Assignee: Jan Schlicht
>  Labels: mesosphere
>
> When running the tests on Ubuntu 14.04 the 'cycles' value collected by perf 
> is always 0, meaning certain tests always fail. These lines in the test have 
> been commented out for now and a TODO has been attached which links to this 
> JIRA issue, since the solution is unclear. In particular, 'cycles' might not 
> properly be counted because it is a hardware counter and this particular 
> machine was a virtual machine. Either way, we should determine the best 
> events to collect from perf in either VM or physical settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-3082) Perf related tests rely on 'cycles' which might not always be present.

2016-01-08 Thread Jan Schlicht (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht reassigned MESOS-3082:
---

Assignee: Jan Schlicht

> Perf related tests rely on 'cycles' which might not always be present.
> --
>
> Key: MESOS-3082
> URL: https://issues.apache.org/jira/browse/MESOS-3082
> Project: Mesos
>  Issue Type: Bug
> Environment: Ubuntu 14.04 (in a virtual machine)
>Reporter: Benjamin Hindman
>Assignee: Jan Schlicht
>  Labels: mesosphere
>
> When running the tests on Ubuntu 14.04 the 'cycles' value collected by perf 
> is always 0, meaning certain tests always fail. These lines in the test have 
> been commented out for now and a TODO has been attached which links to this 
> JIRA issue, since the solution is unclear. In particular, 'cycles' might not 
> properly be counted because it is a hardware counter and this particular 
> machine was a virtual machine. Either way, we should determine the best 
> events to collect from perf in either VM or physical settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-3082) Perf related tests rely on 'cycles' which might not always be present.

2016-01-08 Thread Jan Schlicht (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089443#comment-15089443
 ] 

Jan Schlicht edited comment on MESOS-3082 at 1/8/16 4:20 PM:
-

Tests trying to sample using perf with the 'cycles' value can cause failures of 
other tests if run on a virtual machine that does not support _CPU performance 
counters_. E.g. running {{sudo ./bin/mesos-tests.sh 
--gtest_filter="\*ROOT_CGROUPS_Sample"}} will fail and sometimes keep a child 
process running. This process will block some cgroups from being removed. This 
affects all test processes that are run afterwards that try to clean up some 
cgroups before being run (mostly {{ROOT_CGROUPS_*}}).

I'd suggest to disable these test if in a virtual machine without _CPU 
performance counters_.


was (Author: nfnt):
Tests trying to sample using perf with the 'cycles' value can cause failures of 
other tests if run on a virtual machine that does not support _CPU performance 
counters_. E.g. running {{sudo ./bin/mesos-tests.sh 
--gtest_filter="*ROOT_CGROUPS_Sample"}} will fail and sometimes keep a child 
process running. This process will block some cgroups from being removed. This 
affects all test processes that are run afterwards that try to clean up some 
cgroups before being run (mostly {{ROOT_CGROUPS_*}}).

I'd suggest to disable these test if in a virtual machine without _CPU 
performance counters_.

> Perf related tests rely on 'cycles' which might not always be present.
> --
>
> Key: MESOS-3082
> URL: https://issues.apache.org/jira/browse/MESOS-3082
> Project: Mesos
>  Issue Type: Bug
> Environment: Ubuntu 14.04 (in a virtual machine)
>Reporter: Benjamin Hindman
>  Labels: mesosphere
>
> When running the tests on Ubuntu 14.04 the 'cycles' value collected by perf 
> is always 0, meaning certain tests always fail. These lines in the test have 
> been commented out for now and a TODO has been attached which links to this 
> JIRA issue, since the solution is unclear. In particular, 'cycles' might not 
> properly be counted because it is a hardware counter and this particular 
> machine was a virtual machine. Either way, we should determine the best 
> events to collect from perf in either VM or physical settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)