[jira] [Commented] (MESOS-1806) Substituting etcd for Zookeeper

2016-02-05 Thread Brandon Philips (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15135189#comment-15135189
 ] 

Brandon Philips commented on MESOS-1806:


I added an overview section to the etcd v3 API docs with video overviews to
the changes:
https://github.com/coreos/etcd/blob/master/Documentation/rfc/v3api.md#overview






> Substituting etcd for Zookeeper
> ---
>
> Key: MESOS-1806
> URL: https://issues.apache.org/jira/browse/MESOS-1806
> Project: Mesos
>  Issue Type: Task
>  Components: leader election
>Reporter: Ed Ropple
>Assignee: Shuai Lin
>Priority: Minor
>
>eropple: Could you also file a new JIRA for Mesos to drop ZK 
> in favor of etcd or ReplicatedLog? Would love to get some momentum going on 
> that one.
> --
> Consider it filed. =)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1806) Substituting etcd for Zookeeper

2016-02-05 Thread JIRA

[ 
https://issues.apache.org/jira/browse/MESOS-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15134372#comment-15134372
 ] 

Ivan Vučica commented on MESOS-1806:


FWIW, being able to not run Zookeeper would mean one less JVM-based service 
running on my low-memory VPS nodes.

> Substituting etcd for Zookeeper
> ---
>
> Key: MESOS-1806
> URL: https://issues.apache.org/jira/browse/MESOS-1806
> Project: Mesos
>  Issue Type: Task
>  Components: leader election
>Reporter: Ed Ropple
>Assignee: Shuai Lin
>Priority: Minor
>
>eropple: Could you also file a new JIRA for Mesos to drop ZK 
> in favor of etcd or ReplicatedLog? Would love to get some momentum going on 
> that one.
> --
> Consider it filed. =)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4071) Master crash during framework teardown ( Check failed: total.resources.contains(slaveId))

2016-02-05 Thread alexius ludeman (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15134309#comment-15134309
 ] 

alexius ludeman commented on MESOS-4071:


The tester runs in a continuous serial loop over 8 tests.  All tasks are using 
cpu allocation set to 0.1.  There are between 1 to 8 tasks launched per test.  
At the end of each test, all tasks are killed.  No dynamic reservations are 
used for any test.

If further information is needed to reproduce please follow up with me.

Thanks

> Master crash during framework teardown ( Check failed: 
> total.resources.contains(slaveId))
> -
>
> Key: MESOS-4071
> URL: https://issues.apache.org/jira/browse/MESOS-4071
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.25.0
>Reporter: Mandeep Chadha
>Assignee: Neil Conway
>  Labels: mesosphere
>
> Stack Trace :
> NOTE : Replaced IP address with XX.XX.XX.XX 
> {code}
> I1204 10:31:03.391127 2588810 master.cpp:5564] Processing TEARDOWN call for 
> framework 61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014 
> (mloop-coprocesses-183c4999-9ce9-47b2-bc96-a865c672fcbb (TEST) at 
> scheduler-c8ab2103-cf36-40d8-8a2d-a6b69a8fc...@xx.xx.xx.xx:35237
> I1204 10:31:03.391177 2588810 master.cpp:5576] Removing framework 
> 61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014 
> (mloop-coprocesses-183c4999-9ce9-47b2-bc96-a865c672fcbb (TEST)) at 
> schedulerc8ab2103-cf36-40d8-8a2d-a6b69a8fc...@xx.xx.xx.xx:35237
> I1204 10:31:03.391337 2588805 hierarchical.hpp:605] Deactivated framework 
> 61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014
> F1204 10:31:03.395500 2588810 sorter.cpp:233] Check failed: 
> total.resources.contains(slaveId)
> *** Check failure stack trace: ***
> @ 0x7f2b3dda53d8  google::LogMessage::Fail()
> @ 0x7f2b3dda5327  google::LogMessage::SendToLog()
> @ 0x7f2b3dda4d38  google::LogMessage::Flush()
> @ 0x7f2b3dda7a6c  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f2b3d3351a1  
> mesos::internal::master::allocator::DRFSorter::remove()
> @ 0x7f2b3d0b8c29  
> mesos::internal::master::allocator::HierarchicalAllocatorProcess<>::removeFramework()
> @ 0x7f2b3d0ca823 
> _ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_11FrameworkIDES6_EEvRKNS_3PIDIT_EEMSA_FvT0_ET1_ENKUlPNS_11ProcessBaseEE_clESJ_
> @ 0x7f2b3d0dc8dc  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_11FrameworkIDESA_EEvRKNS0_3PIDIT_EEMSE_FvT0_ET1_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2
> _
> @ 0x7f2b3dd2cc35  std::function<>::operator()()
> @ 0x7f2b3dd15ae5  process::ProcessBase::visit()
> @ 0x7f2b3dd188e2  process::DispatchEvent::visit()
> @   0x472366  process::ProcessBase::serve()
> @ 0x7f2b3dd1203f  process::ProcessManager::resume()
> @ 0x7f2b3dd061b2  process::internal::schedule()
> @ 0x7f2b3dd63efd  
> _ZNSt12_Bind_simpleIFPFvvEvEE9_M_invokeIJEEEvSt12_Inde
> x_tupleIJXspT_EEE
> @ 0x7f2b3dd63e4d  std::_Bind_simple<>::operator()()
> @ 0x7f2b3dd63de6  std::thread::_Impl<>::_M_run()
> @   0x318c2b6470  (unknown)
> @   0x318b2079d1  (unknown)
> @   0x318aae8b5d  (unknown)
> @  (nil)  (unknown)
> Aborted (core dumped)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4608) Consider deprecating `slave(1)` delegate in favor of `slave` on Agent

2016-02-05 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-4608:
-

 Summary: Consider deprecating `slave(1)` delegate in favor of 
`slave` on Agent
 Key: MESOS-4608
 URL: https://issues.apache.org/jira/browse/MESOS-4608
 Project: Mesos
  Issue Type: Improvement
  Components: HTTP API
Reporter: Anand Mazumdar


Historically, we were using a {{slave(1)}} delegate on the agent while 
initializing {{libprocess}}. This meant that all root HTTP requests to agent 
{{ip:port}} were forwarded to {{slave(1)}} route. 

With MESOS-4255, we added the ability to pass in the process ID to the agent 
constructor. Hence, we should now be able to use {{slave}} as the delegate 
instead of {{slave(1)}}.

This would however need to go through a deprecation cycle as there might be 
existing users relying on the {{slave(1)}} endpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4609) Subprocess should be more intelligent about setting/inheriting libprocess environment variables

2016-02-05 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-4609:
-
Fix Version/s: (was: 0.27.1)

> Subprocess should be more intelligent about setting/inheriting libprocess 
> environment variables 
> 
>
> Key: MESOS-4609
> URL: https://issues.apache.org/jira/browse/MESOS-4609
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.27.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> Mostly copied from [this 
> comment|https://issues.apache.org/jira/browse/MESOS-4598?focusedCommentId=15133497=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15133497]
> A subprocess inheriting the environment variables {{LIBPROCESS_*}} may run 
> into some accidental fatalities:
> | || Subprocess uses libprocess || Subprocess is something else ||
> || Subprocess sets/inherits the same {{PORT}} by accident | Bind failure -> 
> exit
> Option #1 above prevents accidental inheritance | Nothing happens (?) |
> || Subprocess sets a different {{PORT}} on purpose | Bind success (?) | 
> Nothing happens (?) |
> (?) = means this is usually the case, but not 100%.
> A complete fix would look something like:
> * If the {{subprocess}} call gets {{environment = None()}}, we should 
> automatically remove {{LIBPROCESS_PORT}} from the inherited environment.  
> * The parts of 
> [{{executorEnvironment}}|https://github.com/apache/mesos/blame/master/src/slave/containerizer/containerizer.cpp#L265]
>  dealing with libprocess & libmesos should be refactored into libprocess as a 
> helper.  We would use this helper for the Containerizer, Fetcher, and 
> ContainerLogger module.
> * If the {{subprocess}} call is given {{LIBPROCESS_PORT == 
> os::getenv("LIBPROCESS_PORT")}}, we can LOG(WARN) and unset the env var 
> locally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4609) Subprocess should be more intelligent about setting/inheriting libprocess environment variables

2016-02-05 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-4609:
-
Description: 
Mostly copied from [this 
comment|https://issues.apache.org/jira/browse/MESOS-4598?focusedCommentId=15133497=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15133497]

A subprocess inheriting the environment variables {{LIBPROCESS_*}} may run into 
some accidental fatalities:

| || Subprocess uses libprocess || Subprocess is something else ||
|| Subprocess sets/inherits the same {{PORT}} by accident | Bind failure -> exit
Option #1 above prevents accidental inheritance | Nothing happens (?) |
|| Subprocess sets a different {{PORT}} on purpose | Bind success (?) | Nothing 
happens (?) |

(?) = means this is usually the case, but not 100%.

A complete fix would look something like:
* If the {{subprocess}} call gets {{environment = None()}}, we should 
automatically remove {{LIBPROCESS_PORT}} from the inherited environment.  
* The parts of 
[{{executorEnvironment}}|https://github.com/apache/mesos/blame/master/src/slave/containerizer/containerizer.cpp#L265]
 dealing with libprocess & libmesos should be refactored into libprocess as a 
helper.  We would use this helper for the Containerizer, Fetcher, and 
ContainerLogger module.
* If the {{subprocess}} call is given {{LIBPROCESS_PORT == 
os::getenv("LIBPROCESS_PORT")}}, we can LOG(WARN) and unset the env var locally.

  was:
The {{LogrotateContainerLogger}} starts libprocess-using subprocesses.  
Libprocess initialization will attempt to resolve the IP from the hostname.  If 
a DNS service is not available, this step will fail, which terminates the 
logger subprocess prematurely.

Since the logger subprocesses live on the agent, they should use the same 
{{LIBPROCESS_IP}} supplied to the agent.


> Subprocess should be more intelligent about setting/inheriting libprocess 
> environment variables 
> 
>
> Key: MESOS-4609
> URL: https://issues.apache.org/jira/browse/MESOS-4609
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.27.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> Mostly copied from [this 
> comment|https://issues.apache.org/jira/browse/MESOS-4598?focusedCommentId=15133497=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15133497]
> A subprocess inheriting the environment variables {{LIBPROCESS_*}} may run 
> into some accidental fatalities:
> | || Subprocess uses libprocess || Subprocess is something else ||
> || Subprocess sets/inherits the same {{PORT}} by accident | Bind failure -> 
> exit
> Option #1 above prevents accidental inheritance | Nothing happens (?) |
> || Subprocess sets a different {{PORT}} on purpose | Bind success (?) | 
> Nothing happens (?) |
> (?) = means this is usually the case, but not 100%.
> A complete fix would look something like:
> * If the {{subprocess}} call gets {{environment = None()}}, we should 
> automatically remove {{LIBPROCESS_PORT}} from the inherited environment.  
> * The parts of 
> [{{executorEnvironment}}|https://github.com/apache/mesos/blame/master/src/slave/containerizer/containerizer.cpp#L265]
>  dealing with libprocess & libmesos should be refactored into libprocess as a 
> helper.  We would use this helper for the Containerizer, Fetcher, and 
> ContainerLogger module.
> * If the {{subprocess}} call is given {{LIBPROCESS_PORT == 
> os::getenv("LIBPROCESS_PORT")}}, we can LOG(WARN) and unset the env var 
> locally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4587) Docker environment variables must be able to contain the equal sign

2016-02-05 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-4587:
--
Shepherd: Jie Yu

> Docker environment variables must be able to contain the equal sign
> ---
>
> Key: MESOS-4587
> URL: https://issues.apache.org/jira/browse/MESOS-4587
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.25.0, 0.26.0, 0.27.0
>Reporter: Martin Tapp
>Assignee: Shuai Lin
>  Labels: containerizer
> Fix For: 0.27.1
>
>
> Note: Affects 0.26 and 0.27.
> The Jupyter Docker all-spark-notebook uses equal sign ('=') in Docker ENV 
> declarations (for instance, 
> https://github.com/jupyter/docker-stacks/blob/master/all-spark-notebook/Dockerfile#L51).
> This causes a mesos Unexpected Env format for 'ContainerConfig.Env' error.
> The problem is the tokenization code at 
> https://github.com/apache/mesos/blob/21e080c5ae6ef03556c7a2b588e034a916c7a05a/src/docker/docker.cpp#L386
>  which needs to only look at the first equal sign. Docker ENV declarations 
> can also be empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4609) Subprocess should be more intelligent about setting/inheriting libprocess environment variables

2016-02-05 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-4609:
-
Story Points: 2  (was: 1)

> Subprocess should be more intelligent about setting/inheriting libprocess 
> environment variables 
> 
>
> Key: MESOS-4609
> URL: https://issues.apache.org/jira/browse/MESOS-4609
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.27.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> Mostly copied from [this 
> comment|https://issues.apache.org/jira/browse/MESOS-4598?focusedCommentId=15133497=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15133497]
> A subprocess inheriting the environment variables {{LIBPROCESS_*}} may run 
> into some accidental fatalities:
> | || Subprocess uses libprocess || Subprocess is something else ||
> || Subprocess sets/inherits the same {{PORT}} by accident | Bind failure -> 
> exit
> Option #1 above prevents accidental inheritance | Nothing happens (?) |
> || Subprocess sets a different {{PORT}} on purpose | Bind success (?) | 
> Nothing happens (?) |
> (?) = means this is usually the case, but not 100%.
> A complete fix would look something like:
> * If the {{subprocess}} call gets {{environment = None()}}, we should 
> automatically remove {{LIBPROCESS_PORT}} from the inherited environment.  
> * The parts of 
> [{{executorEnvironment}}|https://github.com/apache/mesos/blame/master/src/slave/containerizer/containerizer.cpp#L265]
>  dealing with libprocess & libmesos should be refactored into libprocess as a 
> helper.  We would use this helper for the Containerizer, Fetcher, and 
> ContainerLogger module.
> * If the {{subprocess}} call is given {{LIBPROCESS_PORT == 
> os::getenv("LIBPROCESS_PORT")}}, we can LOG(WARN) and unset the env var 
> locally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4610) MasterContender/MasterDetector should be loadable as modules

2016-02-05 Thread Mark Cavage (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Cavage updated MESOS-4610:
---
External issue ID: MESOS-1806

> MasterContender/MasterDetector should be loadable as modules
> 
>
> Key: MESOS-4610
> URL: https://issues.apache.org/jira/browse/MESOS-4610
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Mark Cavage
>
> Currently mesos depends on Zookeeper for leader election and notification to 
> slaves, although there is a C++ hierarchy in the code to support alternatives 
> (e.g., unit tests use an in-memory implementation). From an operational 
> perspective, many organizations/users do not want to take a dependency on 
> Zookeeper, and use an alternative solution to implementing leader election. 
> Our organization in particular, very much wants this, and as a reference 
> there have been several requests from the community (see referenced tickets) 
> to replace with etcd/consul/etc.
> This ticket will serve as the work effort to modularize the 
> MasterContender/MasterDetector APIs such that integrators can build a 
> pluggable solution of their choice; this ticket will not fold in any 
> implementations such as etcd et al., but simply move this hierarchy to be 
> fully pluggable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4610) MasterContender/MasterDetector should be loadable as modules

2016-02-05 Thread Mark Cavage (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Cavage updated MESOS-4610:
---
External issue ID:   (was: MESOS-1806)

> MasterContender/MasterDetector should be loadable as modules
> 
>
> Key: MESOS-4610
> URL: https://issues.apache.org/jira/browse/MESOS-4610
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Mark Cavage
>
> Currently mesos depends on Zookeeper for leader election and notification to 
> slaves, although there is a C++ hierarchy in the code to support alternatives 
> (e.g., unit tests use an in-memory implementation). From an operational 
> perspective, many organizations/users do not want to take a dependency on 
> Zookeeper, and use an alternative solution to implementing leader election. 
> Our organization in particular, very much wants this, and as a reference 
> there have been several requests from the community (see referenced tickets) 
> to replace with etcd/consul/etc.
> This ticket will serve as the work effort to modularize the 
> MasterContender/MasterDetector APIs such that integrators can build a 
> pluggable solution of their choice; this ticket will not fold in any 
> implementations such as etcd et al., but simply move this hierarchy to be 
> fully pluggable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4587) Docker environment variables must be able to contain the equal sign

2016-02-05 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-4587:
--
Target Version/s: 0.27.1

> Docker environment variables must be able to contain the equal sign
> ---
>
> Key: MESOS-4587
> URL: https://issues.apache.org/jira/browse/MESOS-4587
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.25.0, 0.26.0, 0.27.0
>Reporter: Martin Tapp
>Assignee: Shuai Lin
>  Labels: containerizer
> Fix For: 0.27.1
>
>
> Note: Affects 0.26 and 0.27.
> The Jupyter Docker all-spark-notebook uses equal sign ('=') in Docker ENV 
> declarations (for instance, 
> https://github.com/jupyter/docker-stacks/blob/master/all-spark-notebook/Dockerfile#L51).
> This causes a mesos Unexpected Env format for 'ContainerConfig.Env' error.
> The problem is the tokenization code at 
> https://github.com/apache/mesos/blob/21e080c5ae6ef03556c7a2b588e034a916c7a05a/src/docker/docker.cpp#L386
>  which needs to only look at the first equal sign. Docker ENV declarations 
> can also be empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4607) Docker image create should not return any error with env var

2016-02-05 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-4607:
---

 Summary: Docker image create should not return any error with env 
var
 Key: MESOS-4607
 URL: https://issues.apache.org/jira/browse/MESOS-4607
 Project: Mesos
  Issue Type: Bug
  Components: docker
Reporter: Gilbert Song
Priority: Minor


In docker image create behavior, entrypoint and environment variables are read 
from docker inspect. Error should not be returned from finding any 
wrong-formatted env var, which may possibly block docker containerizer. 

Specifically, we may want to just `LOG(WARNING)` for those unexpected env var 
(Please see 
https://github.com/apache/mesos/blob/master/src/docker/docker.cpp#L388~#L395).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4609) Subprocess should be more intelligent about setting/inheriting libprocess environment variables

2016-02-05 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-4609:
-
Target Version/s: 0.28.0  (was: 0.27.1)

> Subprocess should be more intelligent about setting/inheriting libprocess 
> environment variables 
> 
>
> Key: MESOS-4609
> URL: https://issues.apache.org/jira/browse/MESOS-4609
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.27.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> Mostly copied from [this 
> comment|https://issues.apache.org/jira/browse/MESOS-4598?focusedCommentId=15133497=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15133497]
> A subprocess inheriting the environment variables {{LIBPROCESS_*}} may run 
> into some accidental fatalities:
> | || Subprocess uses libprocess || Subprocess is something else ||
> || Subprocess sets/inherits the same {{PORT}} by accident | Bind failure -> 
> exit
> Option #1 above prevents accidental inheritance | Nothing happens (?) |
> || Subprocess sets a different {{PORT}} on purpose | Bind success (?) | 
> Nothing happens (?) |
> (?) = means this is usually the case, but not 100%.
> A complete fix would look something like:
> * If the {{subprocess}} call gets {{environment = None()}}, we should 
> automatically remove {{LIBPROCESS_PORT}} from the inherited environment.  
> * The parts of 
> [{{executorEnvironment}}|https://github.com/apache/mesos/blame/master/src/slave/containerizer/containerizer.cpp#L265]
>  dealing with libprocess & libmesos should be refactored into libprocess as a 
> helper.  We would use this helper for the Containerizer, Fetcher, and 
> ContainerLogger module.
> * If the {{subprocess}} call is given {{LIBPROCESS_PORT == 
> os::getenv("LIBPROCESS_PORT")}}, we can LOG(WARN) and unset the env var 
> locally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3307) Configurable size of completed task / framework history

2016-02-05 Thread Kevin Klues (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15134538#comment-15134538
 ] 

Kevin Klues commented on MESOS-3307:


I'm all for query parameters to filter this stuff, but others seem to disagree. 
(See the thread above).

> Configurable size of completed task / framework history
> ---
>
> Key: MESOS-3307
> URL: https://issues.apache.org/jira/browse/MESOS-3307
> Project: Mesos
>  Issue Type: Bug
>Reporter: Ian Babrou
>Assignee: Kevin Klues
>  Labels: mesosphere
> Fix For: 0.27.0
>
>
> We try to make Mesos work with multiple frameworks and mesos-dns at the same 
> time. The goal is to have set of frameworks per team / project on a single 
> Mesos cluster.
> At this point our mesos state.json is at 4mb and it takes a while to 
> assembly. 5 mesos-dns instances hit state.json every 5 seconds, effectively 
> pushing mesos-master CPU usage through the roof. It's at 100%+ all the time.
> Here's the problem:
> {noformat}
> mesos λ curl -s http://mesos-master:5050/master/state.json | jq 
> .frameworks[].completed_tasks[].framework_id | sort | uniq -c | sort -n
>1 "20150606-001827-252388362-5050-5982-0003"
>   16 "20150606-001827-252388362-5050-5982-0005"
>   18 "20150606-001827-252388362-5050-5982-0029"
>   73 "20150606-001827-252388362-5050-5982-0007"
>  141 "20150606-001827-252388362-5050-5982-0009"
>  154 "20150820-154817-302720010-5050-15320-"
>  289 "20150606-001827-252388362-5050-5982-0004"
>  510 "20150606-001827-252388362-5050-5982-0012"
>  666 "20150606-001827-252388362-5050-5982-0028"
>  923 "20150116-002612-269165578-5050-32204-0003"
> 1000 "20150606-001827-252388362-5050-5982-0001"
> 1000 "20150606-001827-252388362-5050-5982-0006"
> 1000 "20150606-001827-252388362-5050-5982-0010"
> 1000 "20150606-001827-252388362-5050-5982-0011"
> 1000 "20150606-001827-252388362-5050-5982-0027"
> mesos λ fgrep 1000 -r src/master
> src/master/constants.cpp:const size_t MAX_REMOVED_SLAVES = 10;
> src/master/constants.cpp:const uint32_t MAX_COMPLETED_TASKS_PER_FRAMEWORK = 
> 1000;
> {noformat}
> Active tasks are just 6% of state.json response:
> {noformat}
> mesos λ cat ~/temp/mesos-state.json | jq -c . | wc
>1   14796 4138942
> mesos λ cat ~/temp/mesos-state.json | jq .frameworks[].tasks | jq -c . | wc
>   16  37  252774
> {noformat}
> I see four options that can improve the situation:
> 1. Add query string param to exclude completed tasks from state.json and use 
> it in mesos-dns and similar tools. There is no need for mesos-dns to know 
> about completed tasks, it's just extra load on master and mesos-dns.
> 2. Make history size configurable.
> 3. Make JSON serialization faster. With 1s of tasks even without history 
> it would take a lot of time to serialize tasks for mesos-dns. Doing it every 
> 60 seconds instead of every 5 seconds isn't really an option.
> 4. Create event bus for mesos master. Marathon has it and it'd be nice to 
> have it in Mesos. This way mesos-dns could avoid polling master state and 
> switch to listening for events.
> All can be done independently.
> Note to mesosphere folks: please start distributing debug symbols with your 
> distribution. I was asking for it for a while and it is really helpful: 
> https://github.com/mesosphere/marathon/issues/1497#issuecomment-104182501
> Perf report for leading master: 
> !http://i.imgur.com/iz7C3o0.png!
> I'm on 0.23.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4590) Add test case for reservations with same role, different principals

2016-02-05 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-4590:
---
Shepherd: Michael Park

> Add test case for reservations with same role, different principals
> ---
>
> Key: MESOS-4590
> URL: https://issues.apache.org/jira/browse/MESOS-4590
> Project: Mesos
>  Issue Type: Task
>  Components: master, test
>Reporter: Neil Conway
>Assignee: Neil Conway
>  Labels: mesosphere, reservations, test
>
> We don't have a test case that covers $SUBJECT; we probably should.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4603) GTEST crashes when starting/stopping many times in succession

2016-02-05 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15134673#comment-15134673
 ] 

Joseph Wu commented on MESOS-4603:
--

Possibly related to some races in {{process::finalize}}, which only gets called 
at the end of the libprocess tests currently.

> GTEST crashes when starting/stopping many times in succession
> -
>
> Key: MESOS-4603
> URL: https://issues.apache.org/jira/browse/MESOS-4603
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
> Environment: clang 3.4, ubuntu 14.04
>Reporter: Kevin Klues
>  Labels: tests
>
> After running:
> run-one-until-failure 3rdparty/libprocess/libprocess-tests
> At least one iteration of running the tests fails in under a minute with the 
> following stack trace.  The stack trace is differnt sometimes, but it always 
> seems to error out in ~ProcessManager().
> {noformat}
> *** Aborted at 1454643530 (unix time) try "date -d @1454643530" if you are 
> using GNU date ***
> PC: @ 0x7f7812f4d1a0 (unknown)
> *** SIGSEGV (@0x0) received by PID 168122 (TID 0x7f780298f700) from PID 0; 
> stack trace: ***
> @ 0x7f7814451340 (unknown)
> @ 0x7f7812f4d1a0 (unknown)
> @   0x5f06a0 process::Process<>::self()
> @   0x777220 
> _ZN7process8dispatchI7NothingNS_20AsyncExecutorProcessERKZZNS_4http8internal7requestERKNS3_7RequestEbENK3$_1clENS3_10ConnectionEEUlvE_PvSA_SD_EENS_6FutureIT_EEPKNS_7ProcessIT0_EEMSI_FSF_T1_T2_ET3_T4_
> @   0x77714c 
> _ZN7process13AsyncExecutor7executeIZZNS_4http8internal7requestERKNS2_7RequestEbENK3$_1clENS2_10ConnectionEEUlvE_EENS_6FutureI7NothingEERKT_PN5boost9enable_ifINSG_7is_voidINSt9result_ofIFSD_vEE4typeEEEvE4typeE
> @   0x77709e 
> _ZN7process5asyncIZZNS_4http8internal7requestERKNS1_7RequestEbENK3$_1clENS1_10ConnectionEEUlvE_EENS_6FutureI7NothingEERKT_PN5boost9enable_ifINSF_7is_voidINSt9result_ofIFSC_vEE4typeEEEvE4typeE
> @   0x777046 
> _ZZZN7process4http8internal7requestERKNS0_7RequestEbENK3$_1clENS0_10ConnectionEENKUlvE0_clEv
> @   0x777019 
> _ZZNK7process6FutureI7NothingE5onAnyIZZNS_4http8internal7requestERKNS4_7RequestEbENK3$_1clENS4_10ConnectionEEUlvE0_vEERKS2_OT_NS2_10LessPreferEENUlSD_E_clESD_
> @   0x776e02 
> _ZNSt17_Function_handlerIFvRKN7process6FutureI7NothingEEEZNKS3_5onAnyIZZNS0_4http8internal7requestERKNS8_7RequestEbENK3$_1clENS8_10ConnectionEEUlvE0_vEES5_OT_NS3_10LessPreferEEUlS5_E_E9_M_invokeERKSt9_Any_dataS5_
> @   0x43f888 std::function<>::operator()()
> @   0x4464ec 
> _ZN7process8internal3runISt8functionIFvRKNS_6FutureI7NothingJRS5_EEEvRKSt6vectorIT_SaISC_EEDpOT0_
> @   0x446305 process::Future<>::set()
> @   0x44f90a 
> _ZNKSt7_Mem_fnIMN7process6FutureI7NothingEEFbRKS2_EEclIJS5_EvEEbRS3_DpOT_
> @   0x44f7ae 
> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureI7NothingEEFbRKS3_EES4_St12_PlaceholderILi16__callIbJS6_EJLm0ELm1T_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
> @   0x44f72d 
> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureI7NothingEEFbRKS3_EES4_St12_PlaceholderILi1clIJS6_EbEET0_DpOT_
> @   0x44f6dd 
> _ZZNK7process6FutureI7NothingE7onReadyISt5_BindIFSt7_Mem_fnIMS2_FbRKS1_EES2_St12_PlaceholderILi1bEERKS2_OT_NS2_6PreferEENUlS7_E_clES7_
> @   0x44f492 
> _ZNSt17_Function_handlerIFvRK7NothingEZNK7process6FutureIS0_E7onReadyISt5_BindIFSt7_Mem_fnIMS6_FbS2_EES6_St12_PlaceholderILi1bEERKS6_OT_NS6_6PreferEEUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @   0x446d68 std::function<>::operator()()
> @   0x44644c 
> _ZN7process8internal3runISt8functionIFvRK7NothingEEJRS3_EEEvRKSt6vectorIT_SaISA_EEDpOT0_
> @   0x4462e7 process::Future<>::set()
> @   0x50d5c7 process::Promise<>::set()
> @   0x77c53b 
> process::http::internal::ConnectionProcess::disconnect()
> @   0x792710 process::http::internal::ConnectionProcess::_read()
> @   0x794356 
> _ZZN7process8dispatchINS_4http8internal17ConnectionProcessERKNS_6FutureISsEES5_EEvRKNS_3PIDIT_EEMS9_FvT0_ET1_ENKUlPNS_11ProcessBaseEE_clESI_
> @   0x793fa2 
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchINS0_4http8internal17ConnectionProcessERKNS0_6FutureISsEES9_EEvRKNS0_3PIDIT_EEMSD_FvT0_ET1_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @   0x810958 std::function<>::operator()()
> @   0x7fb854 process::ProcessBase::visit()
> @   0x8581ce process::DispatchEvent::visit()
> @   0x43d631 process::ProcessBase::serve()
> @   0x7f9604 process::ProcessManager::resume()
> @   0x8017a5 
> process::ProcessManager::init_threads()::$_1::operator()()
> @   

[jira] [Created] (MESOS-4609) Subprocess should be more intelligent about setting/inheriting libprocess environment variables

2016-02-05 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-4609:


 Summary: Subprocess should be more intelligent about 
setting/inheriting libprocess environment variables 
 Key: MESOS-4609
 URL: https://issues.apache.org/jira/browse/MESOS-4609
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.27.0
Reporter: Joseph Wu
Assignee: Joseph Wu
 Fix For: 0.27.1


The {{LogrotateContainerLogger}} starts libprocess-using subprocesses.  
Libprocess initialization will attempt to resolve the IP from the hostname.  If 
a DNS service is not available, this step will fail, which terminates the 
logger subprocess prematurely.

Since the logger subprocesses live on the agent, they should use the same 
{{LIBPROCESS_IP}} supplied to the agent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4005) Support workdir runtime configuration from image

2016-02-05 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15113225#comment-15113225
 ] 

Gilbert Song edited comment on MESOS-4005 at 2/5/16 9:37 PM:
-

https://reviews.apache.org/r/43167/
https://reviews.apache.org/r/43168/
https://reviews.apache.org/r/43083/


was (Author: gilbert):
https://reviews.apache.org/r/42540/

> Support workdir runtime configuration from image 
> -
>
> Key: MESOS-4005
> URL: https://issues.apache.org/jira/browse/MESOS-4005
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Timothy Chen
>Assignee: Gilbert Song
>  Labels: mesosphere, unified-containerizer-mvp
>
> We need to support workdir runtime configuration returned from image such as 
> Dockerfile.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4004) Support default entrypoint and command runtime config in Mesos containerizer

2016-02-05 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15113228#comment-15113228
 ] 

Gilbert Song edited comment on MESOS-4004 at 2/5/16 9:36 PM:
-

https://reviews.apache.org/r/43081/
https://reviews.apache.org/r/43082/



was (Author: gilbert):
https://reviews.apache.org/r/42539/

> Support default entrypoint and command runtime config in Mesos containerizer
> 
>
> Key: MESOS-4004
> URL: https://issues.apache.org/jira/browse/MESOS-4004
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Timothy Chen
>Assignee: Gilbert Song
>  Labels: mesosphere, unified-containerizer-mvp
>
> We need to use the entrypoint and command runtime configuration returned from 
> image to be used in Mesos containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4582) state.json serving duplicate "active" fields

2016-02-05 Thread Marco Massenzio (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15135137#comment-15135137
 ] 

Marco Massenzio commented on MESOS-4582:


I'm almost sure that duplicate keys are not legal JSON - worth checking the 
standard, but I'd be in favor of keepin the checks and throwing back a 406 (Bad 
Request).

If you want, I can look it up later this weekend and find out what the JSON 
standard says?

Thanks for fixing it!

> state.json serving duplicate "active" fields
> 
>
> Key: MESOS-4582
> URL: https://issues.apache.org/jira/browse/MESOS-4582
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.27.0
>Reporter: Michael Gummelt
>Assignee: Michael Park
>Priority: Blocker
> Attachments: error.json
>
>
> state.json is serving duplicate "active" fields in frameworks.  See the 
> framework "47df96c2-3f85-4bc5-b781-709b2c30c752-" In the attached file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4582) state.json serving duplicate "active" fields

2016-02-05 Thread Marco Massenzio (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15135137#comment-15135137
 ] 

Marco Massenzio edited comment on MESOS-4582 at 2/5/16 10:17 PM:
-

I'm almost sure that duplicate keys are not legal JSON - worth checking the 
standard, but I'd be in favor of keeping the checks and throwing back a 406 
(Bad Request).

Incidentally, as almost *all* JSON libraries in most languages (I know of Java, 
Python, C++, Scala) model JSON documents with the {{map}} structure, it is 
virtually impossible (or, at best, extremely difficult) to generate a JSON 
document with duplicate keys (even assuming that such a thing is syntactically 
correct).

If you want, I can look it up later this weekend and find out what the JSON 
standard says?

Thanks for fixing it!


was (Author: marco-mesos):
I'm almost sure that duplicate keys are not legal JSON - worth checking the 
standard, but I'd be in favor of keepin the checks and throwing back a 406 (Bad 
Request).

Incidentally, as almost *all* JSON libraries in most languages (I know of Java, 
Python, C++, Scala) model JSON documents with the {{map}} structure, it is 
virtually impossible (or, at best, extremely difficult) to generate a JSON 
document with duplicate keys (even assuming that such a thing is syntactically 
correct).

If you want, I can look it up later this weekend and find out what the JSON 
standard says?

Thanks for fixing it!

> state.json serving duplicate "active" fields
> 
>
> Key: MESOS-4582
> URL: https://issues.apache.org/jira/browse/MESOS-4582
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.27.0
>Reporter: Michael Gummelt
>Assignee: Michael Park
>Priority: Blocker
> Attachments: error.json
>
>
> state.json is serving duplicate "active" fields in frameworks.  See the 
> framework "47df96c2-3f85-4bc5-b781-709b2c30c752-" In the attached file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4066) Agent should not return partial state when a request is made to /state endpoint during recovery.

2016-02-05 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-4066:
--
Summary: Agent should not return partial state when a request is made to 
/state endpoint during recovery.  (was: Expose when agent is recovering in the 
agent's /state endpoint.)

> Agent should not return partial state when a request is made to /state 
> endpoint during recovery.
> 
>
> Key: MESOS-4066
> URL: https://issues.apache.org/jira/browse/MESOS-4066
> Project: Mesos
>  Issue Type: Task
>  Components: slave
>Reporter: Benjamin Mahler
>Assignee: Vinod Kone
>  Labels: mesosphere
>
> Currently when a user is hitting /state.json on the agent, it may return 
> partial state if the agent has failed over and is recovering. There is 
> currently no clear way to tell if this is the case when looking at a 
> response, so the user may incorrectly interpret the agent as being empty of 
> tasks.
> We could consider exposing the 'state' enum of the agent in the endpoint:
> {code}
>   enum State
>   {
> RECOVERING,   // Slave is doing recovery.
> DISCONNECTED, // Slave is not connected to the master.
> RUNNING,  // Slave has (re-)registered.
> TERMINATING,  // Slave is shutting down.
>   } state;
> {code}
> This may be a bit tricky to maintain as far as backwards-compatibility of the 
> endpoint, if we were to alter this enum.
> Exposing this would allow users to be more informed about the state of the 
> agent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4582) state.json serving duplicate "active" fields

2016-02-05 Thread Michael Park (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15135228#comment-15135228
 ] 

Michael Park commented on MESOS-4582:
-

[~marco-mesos] I've already looked it up, and the presence of duplicate keys is 
valid JSON. Many JSON libraries (Go, Python, C#, etc) simply use the last 
instance of the duplicate keys. Those same libraries make it hard (impossible?) 
to generate a JSON with duplicate keys. My proposal here is to take the same 
approach where we are tolerant of input with duplicate keys, but don't generate 
JSON with duplicate keys in our output.

> state.json serving duplicate "active" fields
> 
>
> Key: MESOS-4582
> URL: https://issues.apache.org/jira/browse/MESOS-4582
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.27.0
>Reporter: Michael Gummelt
>Assignee: Michael Park
>Priority: Blocker
> Attachments: error.json
>
>
> state.json is serving duplicate "active" fields in frameworks.  See the 
> framework "47df96c2-3f85-4bc5-b781-709b2c30c752-" In the attached file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4609) Subprocess should be more intelligent about setting/inheriting libprocess environment variables

2016-02-05 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15134994#comment-15134994
 ] 

Joseph Wu edited comment on MESOS-4609 at 2/5/16 11:26 PM:
---

|| Reviews || Summary || 
| https://reviews.apache.org/r/43260/
https://reviews.apache.org/r/43261/ | Some refactoring of 
{{process::initialize}} |
| https://reviews.apache.org/r/43271/ | Modifications to {{subprocess}} |
| https://reviews.apache.org/r/43272/ | Refactor of containerizer, fetcher, 
container logger |


was (Author: kaysoky):
|| Reviews || Summary || 
| https://reviews.apache.org/r/43260/
https://reviews.apache.org/r/43261/ | Some refactoring of 
{{process::initialize}} |
| TODO | |

> Subprocess should be more intelligent about setting/inheriting libprocess 
> environment variables 
> 
>
> Key: MESOS-4609
> URL: https://issues.apache.org/jira/browse/MESOS-4609
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.27.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> Mostly copied from [this 
> comment|https://issues.apache.org/jira/browse/MESOS-4598?focusedCommentId=15133497=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15133497]
> A subprocess inheriting the environment variables {{LIBPROCESS_*}} may run 
> into some accidental fatalities:
> | || Subprocess uses libprocess || Subprocess is something else ||
> || Subprocess sets/inherits the same {{PORT}} by accident | Bind failure -> 
> exit | Nothing happens (?) |
> || Subprocess sets a different {{PORT}} on purpose | Bind success (?) | 
> Nothing happens (?) |
> (?) = means this is usually the case, but not 100%.
> A complete fix would look something like:
> * If the {{subprocess}} call gets {{environment = None()}}, we should 
> automatically remove {{LIBPROCESS_PORT}} from the inherited environment.  
> * The parts of 
> [{{executorEnvironment}}|https://github.com/apache/mesos/blame/master/src/slave/containerizer/containerizer.cpp#L265]
>  dealing with libprocess & libmesos should be refactored into libprocess as a 
> helper.  We would use this helper for the Containerizer, Fetcher, and 
> ContainerLogger module.
> * If the {{subprocess}} call is given {{LIBPROCESS_PORT == 
> os::getenv("LIBPROCESS_PORT")}}, we can LOG(WARN) and unset the env var 
> locally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4517) Introduce docker runtime isolator.

2016-02-05 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15135048#comment-15135048
 ] 

Gilbert Song commented on MESOS-4517:
-

https://reviews.apache.org/r/43021/
https://reviews.apache.org/r/43022/
https://reviews.apache.org/r/43036/

> Introduce docker runtime isolator.
> --
>
> Key: MESOS-4517
> URL: https://issues.apache.org/jira/browse/MESOS-4517
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>  Labels: mesosphere
>
> Currently docker image default configuration are included in `ProvisionInfo`. 
> We should grab necessary config from `ProvisionInfo` into `ContainerInfo`, 
> and handle all these runtime informations inside of docker runtime isolator. 
> Return a `ContainerLaunchInfo` containing `working_dir`, `env` and merged 
> `commandInfo`, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4578) docker run -c is deprecated

2016-02-05 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-4578:

Target Version/s: 0.27.1

> docker run -c is deprecated
> ---
>
> Key: MESOS-4578
> URL: https://issues.apache.org/jira/browse/MESOS-4578
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization, docker
>Affects Versions: 0.26.0
> Environment: CoreOS 7
>Reporter: Cody Maloney
>  Labels: mesosphere, newbie
> Fix For: 0.27.1
>
>
> When running mesos slave with the docker containerizer enabled on CoreOS 
> 766.4.0, launching docker containers results in the following in stderr:
> {noformat}
> Warning: '-c' is deprecated, it will be replaced by '--cpu-shares' soon. See 
> usage.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4610) MasterContender/MasterDetector should be loadable as modules

2016-02-05 Thread Mark Cavage (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15135190#comment-15135190
 ] 

Mark Cavage commented on MESOS-4610:


Review posted here: https://reviews.apache.org/r/43269

> MasterContender/MasterDetector should be loadable as modules
> 
>
> Key: MESOS-4610
> URL: https://issues.apache.org/jira/browse/MESOS-4610
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Mark Cavage
>
> Currently mesos depends on Zookeeper for leader election and notification to 
> slaves, although there is a C++ hierarchy in the code to support alternatives 
> (e.g., unit tests use an in-memory implementation). From an operational 
> perspective, many organizations/users do not want to take a dependency on 
> Zookeeper, and use an alternative solution to implementing leader election. 
> Our organization in particular, very much wants this, and as a reference 
> there have been several requests from the community (see referenced tickets) 
> to replace with etcd/consul/etc.
> This ticket will serve as the work effort to modularize the 
> MasterContender/MasterDetector APIs such that integrators can build a 
> pluggable solution of their choice; this ticket will not fold in any 
> implementations such as etcd et al., but simply move this hierarchy to be 
> fully pluggable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-1356) Uncaught exceptions

2016-02-05 Thread Michael Browning (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Browning reassigned MESOS-1356:
---

Assignee: Michael Browning

> Uncaught exceptions
> ---
>
> Key: MESOS-1356
> URL: https://issues.apache.org/jira/browse/MESOS-1356
> Project: Mesos
>  Issue Type: Bug
>Reporter: Niklas Quarfot Nielsen
>Assignee: Michael Browning
>  Labels: coverity, newbie
>
> We usually do _not_ use exceptions in Mesos, but some libraries may and we 
> should handle them and perhaps convert them into Try<>/Error.
> 
> *** CID 1213893:  Uncaught exception  (UNCAUGHT_EXCEPT)
> /src/slave/containerizer/linux_launcher.cpp: 148 in 
> mesos::internal::slave::_childMain(const std::tr1::function &, int 
> *)()
> 142   return (*func)();
> 143 }
> 144
> 145
> 146 // Helper that creates a new session then blocks on reading the pipe 
> before
> 147 // calling the supplied function.
> >>> CID 1213893:  Uncaught exception  (UNCAUGHT_EXCEPT)
> >>> In function "_childMain" an exception of type 
> >>> "std::tr1::bad_function_call" is thrown and never caught.
> 148 static int _childMain(
> 149 const lambda::function& childFunction,
> 150 int pipes[2])
> 151 {
> 152   // In child.
> 153   os::close(pipes[1]);
> 
> *** CID 1213894:  Uncaught exception  (UNCAUGHT_EXCEPT)
> /src/slave/containerizer/linux_launcher.cpp: 137 in 
> mesos::internal::slave::childMain(void *)()
> 131
> 132   return Nothing();
> 133 }
> 134
> 135
> 136 // Helper for clone() which expects an int(void*).
> >>> CID 1213894:  Uncaught exception  (UNCAUGHT_EXCEPT)
> >>> In function "childMain" an exception of type 
> >>> "std::tr1::bad_function_call" is thrown and never caught.
> 137 static int childMain(void* child)
> 138 {
> 139   const lambda::function* func =
> 140 static_cast*> (child);
> 141
> 142   return (*func)();
> 
> *** CID 1213895:  Uncaught exception  (UNCAUGHT_EXCEPT)
> /src/usage/main.cpp: 72 in main()
> 66<< endl
> 67<< "Supported options:" << endl
> 68<< flags.usage();
> 69 }
> 70
> 71
> >>> CID 1213895:  Uncaught exception  (UNCAUGHT_EXCEPT)
> >>> In function "main" an exception of type 
> >>> "google::protobuf::FatalException" is thrown and never caught.
> 72 int main(int argc, char** argv)
> 73 {
> 74   GOOGLE_PROTOBUF_VERIFY_VERSION;
> 75
> 76   Flags flags;
> 77
> /src/usage/main.cpp: 72 in main()
> 66<< endl
> 67<< "Supported options:" << endl
> 68<< flags.usage();
> 69 }
> 70
> 71
> >>> CID 1213895:  Uncaught exception  (UNCAUGHT_EXCEPT)
> >>> In function "main" an exception of type 
> >>> "google::protobuf::FatalException" is thrown and never caught.
> 72 int main(int argc, char** argv)
> 73 {
> 74   GOOGLE_PROTOBUF_VERIFY_VERSION;
> 75
> 76   Flags flags;
> 77
> /src/usage/main.cpp: 72 in main()
> 66<< endl
> 67<< "Supported options:" << endl
> 68<< flags.usage();
> 69 }
> 70
> 71
> >>> CID 1213895:  Uncaught exception  (UNCAUGHT_EXCEPT)
> >>> In function "main" an exception of type 
> >>> "google::protobuf::FatalException" is thrown and never caught.
> 72 int main(int argc, char** argv)
> 73 {
> 74   GOOGLE_PROTOBUF_VERIFY_VERSION;
> 75
> 76   Flags flags;
> 77
> 
> *** CID 1213896:  Uncaught exception  (UNCAUGHT_EXCEPT)
> /src/launcher/executor.cpp: 423 in main()
> 417 };
> 418
> 419 } // namespace internal {
> 420 } // namespace mesos {
> 421
> 422
> >>> CID 1213896:  Uncaught exception  (UNCAUGHT_EXCEPT)
> >>> In function "main" an exception of type "std::tr1::bad_function_call" 
> >>> is thrown and never caught.
> 423 int main(int argc, char** argv)
> 424 {
> 425   mesos::internal::CommandExecutor executor;
> 426   mesos::MesosExecutorDriver driver();
> 427   return driver.run() == mesos::DRIVER_STOPPED ? 0 : 1;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4612) Update to Zookeeper 3.4.7

2016-02-05 Thread Cody Maloney (JIRA)
Cody Maloney created MESOS-4612:
---

 Summary: Update to Zookeeper 3.4.7
 Key: MESOS-4612
 URL: https://issues.apache.org/jira/browse/MESOS-4612
 Project: Mesos
  Issue Type: Improvement
Reporter: Cody Maloney


See: http://zookeeper.apache.org/doc/r3.4.7/releasenotes.html for improvements 
/ bug fixes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4613) Mesos when used with --log_dir generates hundreds of thousands of log files per day

2016-02-05 Thread Lukas Loesche (JIRA)
Lukas Loesche created MESOS-4613:


 Summary: Mesos when used with --log_dir generates hundreds of 
thousands of log files per day
 Key: MESOS-4613
 URL: https://issues.apache.org/jira/browse/MESOS-4613
 Project: Mesos
  Issue Type: Bug
Reporter: Lukas Loesche


We're using mesos with --log_dir=/var/log/mesos
Lately in addition to the mesos-master and mesos-slave log there's also been 
mesos-fetcher logs written into this directory.

It seems that every process generates a new log file with a unique file name 
containing the date and pid. For mesos-master and mesos-slave this makes sense. 
For mesos-fetcher not so much.

On a moderately busy agent it's currently generating 200k log files per day. On 
our particular system this would cause logrotate to segfault. And standard 
tools like 'rm mesos-fetcher*' won't work because there's too many files to 
expand the command.

I also noted that a lot of the created files are zero bytes. So for now we're 
running a cron every minute
{noformat}
find /var/log/mesos -size 0 -name 'mesos-fetcher*' -delete
{noformat}
as a workaround.

Anyways it would be nice if there was an option to make the mesos-fetcher write 
into a single log file instead of creating thousands of individual files.

Or if that's easier to implement an option to only write the master and slave 
log but not the fetcher logs.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4613) Mesos when used with --log_dir generates hundreds of thousands of log files per day

2016-02-05 Thread Till Toenshoff (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Toenshoff updated MESOS-4613:
--
Affects Version/s: 0.25.0

> Mesos when used with --log_dir generates hundreds of thousands of log files 
> per day
> ---
>
> Key: MESOS-4613
> URL: https://issues.apache.org/jira/browse/MESOS-4613
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
>Reporter: Lukas Loesche
>
> We're using mesos with --log_dir=/var/log/mesos
> Lately in addition to the mesos-master and mesos-slave log there's also been 
> mesos-fetcher logs written into this directory.
> It seems that every process generates a new log file with a unique file name 
> containing the date and pid. For mesos-master and mesos-slave this makes 
> sense. For mesos-fetcher not so much.
> On a moderately busy agent it's currently generating 200k log files per day. 
> On our particular system this would cause logrotate to segfault. And standard 
> tools like 'rm mesos-fetcher*' won't work because there's too many files to 
> expand the command.
> I also noted that a lot of the created files are zero bytes. So for now we're 
> running a cron every minute
> {noformat}
> find /var/log/mesos -size 0 -name 'mesos-fetcher*' -delete
> {noformat}
> as a workaround.
> Anyways it would be nice if there was an option to make the mesos-fetcher 
> write into a single log file instead of creating thousands of individual 
> files.
> Or if that's easier to implement an option to only write the master and slave 
> log but not the fetcher logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4479) Implement reservation labels

2016-02-05 Thread Michael Park (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15135255#comment-15135255
 ] 

Michael Park commented on MESOS-4479:
-

{noformat}
commit 0226620747e1769434a1a83da547bfc3470a9549
Author: Neil Conway 
Date:   Thu Feb 4 14:47:13 2016 -0800

Used `std::any_of` instead of `std::count_if` when validating IDs.

This makes the intent slightly clearer. In principle, it should save a
few cycles as well, but nothing significant. Also, clarify the name of
a helper function.

Review: https://reviews.apache.org/r/42750/
{noformat}

> Implement reservation labels
> 
>
> Key: MESOS-4479
> URL: https://issues.apache.org/jira/browse/MESOS-4479
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Neil Conway
>Assignee: Neil Conway
>  Labels: labels, mesosphere, reservations
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4611) Passing a lambda to dispatch() always matches the template returning void

2016-02-05 Thread Kevin Klues (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Klues updated MESOS-4611:
---
Description: 
The following idiom does not currently compile:

{code}
  Future initialized = dispatch(pid, [] () -> Nothing {
return Nothing();
  });
{code}

This seems non-intuitive because the following template exists for dispatch:

{code}
template 
Future dispatch(const UPID& pid, const std::function& f)
{
  std::shared_ptr promise(new Promise()); 
 
  std::shared_ptr> f_(
  new std::function(
  [=](ProcessBase*) {
promise->set(f());
  }));

  internal::dispatch(pid, f_);
  
  return promise->future();
} 
{code}

However, lambdas cannot be implicitly cast to a corresponding 
std::function type.
To make this work, you have to explicitly type the lambda before passing it to 
dispatch.

{code}
  std::function f = []() { return Nothing(); };
  Future initialized = dispatch(pid, f);
{code}

We should add template support to allow lambdas to be passed to dispatch() 
without explicit typing. 


  was:
The following idiom does not currently compile:

{code}
  Future initialized = dispatch(pid, [] () -> Nothing {
return Nothing();
  })
{code}

This seems non-intuitive because the following template exists for dispatch:

{code}
template 
Future dispatch(const UPID& pid, const std::function& f)
{
  std::shared_ptr promise(new Promise()); 
 
  std::shared_ptr> f_(
  new std::function(
  [=](ProcessBase*) {
promise->set(f());
  }));

  internal::dispatch(pid, f_);
  
  return promise->future();
} 
{code}

To make this work, you have to explicitly type the lambda before passing it to 
dispatch.

{code}
  std::function f = []() { return Nothing(); };
  Future initialized = dispatch(pid, f);
{code}

We should add template support to allow lambdas to be passed to dispatch() 
without explicit typing. 



> Passing a lambda to dispatch() always matches the template returning void
> -
>
> Key: MESOS-4611
> URL: https://issues.apache.org/jira/browse/MESOS-4611
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Kevin Klues
>  Labels: dispatch, libprocess, mesosphere
>
> The following idiom does not currently compile:
> {code}
>   Future initialized = dispatch(pid, [] () -> Nothing {
> return Nothing();
>   });
> {code}
> This seems non-intuitive because the following template exists for dispatch:
> {code}
> template 
> Future dispatch(const UPID& pid, const std::function& f)
> {
>   std::shared_ptr promise(new Promise()); 
>  
>   std::shared_ptr> f_(
>   new std::function(
>   [=](ProcessBase*) {
> promise->set(f());
>   }));
>   internal::dispatch(pid, f_);
>   
>   return promise->future();
> } 
> {code}
> However, lambdas cannot be implicitly cast to a corresponding 
> std::function type.
> To make this work, you have to explicitly type the lambda before passing it 
> to dispatch.
> {code}
>   std::function f = []() { return Nothing(); };
>   Future initialized = dispatch(pid, f);
> {code}
> We should add template support to allow lambdas to be passed to dispatch() 
> without explicit typing. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4611) Passing a lambda to dispatch() always matches the template returning void

2016-02-05 Thread Kevin Klues (JIRA)
Kevin Klues created MESOS-4611:
--

 Summary: Passing a lambda to dispatch() always matches the 
template returning void
 Key: MESOS-4611
 URL: https://issues.apache.org/jira/browse/MESOS-4611
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Reporter: Kevin Klues


The following idiom does not currently compile:

{code}
  Future initialized = dispatch(pid, [] () -> Nothing {
return Nothing();
  })
{code}

This seems non-intuitive because the following template exists for dispatch:

{code}
template 
Future dispatch(const UPID& pid, const std::function& f)
{
  std::shared_ptr promise(new Promise()); 
 
  std::shared_ptr> f_(
  new std::function(
  [=](ProcessBase*) {
promise->set(f());
  }));

  internal::dispatch(pid, f_);
  
  return promise->future();
} 
{code}

To make this work, you have to explicitly type the lambda before passing it to 
dispatch.

{code}
  std::function f = []() { return Nothing(); };
  Future initialized = dispatch(pid, f);
{code}

We should add template support to allow lambdas to be passed to dispatch() 
without explicit typing. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4601) Don't dump stack trace on failure to bind()

2016-02-05 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15135005#comment-15135005
 ] 

Joseph Wu commented on MESOS-4601:
--

Note, I effectively made this change in the refactor here: 
https://reviews.apache.org/r/43261/

> Don't dump stack trace on failure to bind()
> ---
>
> Key: MESOS-4601
> URL: https://issues.apache.org/jira/browse/MESOS-4601
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Neil Conway
>Assignee: Yong Tang
>  Labels: errorhandling, libprocess, mesosphere, newbie
>
> We should do {{EXIT(EXIT_FAILURE)}} rather than {{LOG(FATAL)}}, both for this 
> code path and a few other expected error conditions in libprocess network 
> initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4609) Subprocess should be more intelligent about setting/inheriting libprocess environment variables

2016-02-05 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15134994#comment-15134994
 ] 

Joseph Wu commented on MESOS-4609:
--

|| Reviews || Summary || 
| https://reviews.apache.org/r/43260/
https://reviews.apache.org/r/43261/ | Some refactoring of 
{{process::initialize}} |
| TODO | |

> Subprocess should be more intelligent about setting/inheriting libprocess 
> environment variables 
> 
>
> Key: MESOS-4609
> URL: https://issues.apache.org/jira/browse/MESOS-4609
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.27.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> Mostly copied from [this 
> comment|https://issues.apache.org/jira/browse/MESOS-4598?focusedCommentId=15133497=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15133497]
> A subprocess inheriting the environment variables {{LIBPROCESS_*}} may run 
> into some accidental fatalities:
> | || Subprocess uses libprocess || Subprocess is something else ||
> || Subprocess sets/inherits the same {{PORT}} by accident | Bind failure -> 
> exit | Nothing happens (?) |
> || Subprocess sets a different {{PORT}} on purpose | Bind success (?) | 
> Nothing happens (?) |
> (?) = means this is usually the case, but not 100%.
> A complete fix would look something like:
> * If the {{subprocess}} call gets {{environment = None()}}, we should 
> automatically remove {{LIBPROCESS_PORT}} from the inherited environment.  
> * The parts of 
> [{{executorEnvironment}}|https://github.com/apache/mesos/blame/master/src/slave/containerizer/containerizer.cpp#L265]
>  dealing with libprocess & libmesos should be refactored into libprocess as a 
> helper.  We would use this helper for the Containerizer, Fetcher, and 
> ContainerLogger module.
> * If the {{subprocess}} call is given {{LIBPROCESS_PORT == 
> os::getenv("LIBPROCESS_PORT")}}, we can LOG(WARN) and unset the env var 
> locally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4605) Upgrading mesos should not (re)enable mesos master or slave

2016-02-05 Thread JIRA
Grégoire Bellon-Gervais created MESOS-4605:
--

 Summary: Upgrading mesos should not (re)enable mesos master or 
slave
 Key: MESOS-4605
 URL: https://issues.apache.org/jira/browse/MESOS-4605
 Project: Mesos
  Issue Type: Bug
  Components: general
Affects Versions: 0.27.0
 Environment: debian8
Reporter: Grégoire Bellon-Gervais
Priority: Minor


Hello,
I'm under debian 8 and I use official repository to install mesos (and the deb 
files) :
deb http://repos.mesosphere.io/debian jessie main
I have 3 mesos masters and 3 mesos slaves.
On masters, mesos slaves are not started (and must not be started), same for 
mesos slaves, mesos masters must not be started.
During each upgrade, I have to disable manually after the upgrade the "not 
installed component".
Here the log on a mesos slave for example :

Setting up mesos (0.27.0-0.2.190.debian81) ...
Created symlink from 
/etc/systemd/system/multi-user.target.wants/mesos-master.service to 
/lib/systemd/system/mesos-master.service.
Processing triggers for libc-bin (2.19-18+deb8u2) ...
...
So, once upgrade is done, I have to issue the following command :
systemctl disable mesos-master.service

It should not be necessary I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4606) Add IPv6 support to net::IP and net::IPNetwork

2016-02-05 Thread Benno Evers (JIRA)
Benno Evers created MESOS-4606:
--

 Summary: Add IPv6 support to net::IP and net::IPNetwork
 Key: MESOS-4606
 URL: https://issues.apache.org/jira/browse/MESOS-4606
 Project: Mesos
  Issue Type: Improvement
  Components: stout
Reporter: Benno Evers
Assignee: Benno Evers
Priority: Minor


The classes net::IP and net::IPNetwork should to be able to store IPv6 
addresses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MESOS-3307) Configurable size of completed task / framework history

2016-02-05 Thread Tymofii (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tymofii updated MESOS-3307:
---
Comment: was deleted

(was: Yes, it generates JSON much faster now, but we still having lots and lots 
completed tasks and frameworks there, which we don't care about for service 
discovery, but want to keep them for history.
Wouldn't it be great to have some basic filtering for /state endpoint to get 
only active tasks/frameworks, only tasks or particular framework, only slaves 
information etc.?
/state-summary endpoint introduced recently doesn't fit service discovery 
requirements.)

> Configurable size of completed task / framework history
> ---
>
> Key: MESOS-3307
> URL: https://issues.apache.org/jira/browse/MESOS-3307
> Project: Mesos
>  Issue Type: Bug
>Reporter: Ian Babrou
>Assignee: Kevin Klues
>  Labels: mesosphere
> Fix For: 0.27.0
>
>
> We try to make Mesos work with multiple frameworks and mesos-dns at the same 
> time. The goal is to have set of frameworks per team / project on a single 
> Mesos cluster.
> At this point our mesos state.json is at 4mb and it takes a while to 
> assembly. 5 mesos-dns instances hit state.json every 5 seconds, effectively 
> pushing mesos-master CPU usage through the roof. It's at 100%+ all the time.
> Here's the problem:
> {noformat}
> mesos λ curl -s http://mesos-master:5050/master/state.json | jq 
> .frameworks[].completed_tasks[].framework_id | sort | uniq -c | sort -n
>1 "20150606-001827-252388362-5050-5982-0003"
>   16 "20150606-001827-252388362-5050-5982-0005"
>   18 "20150606-001827-252388362-5050-5982-0029"
>   73 "20150606-001827-252388362-5050-5982-0007"
>  141 "20150606-001827-252388362-5050-5982-0009"
>  154 "20150820-154817-302720010-5050-15320-"
>  289 "20150606-001827-252388362-5050-5982-0004"
>  510 "20150606-001827-252388362-5050-5982-0012"
>  666 "20150606-001827-252388362-5050-5982-0028"
>  923 "20150116-002612-269165578-5050-32204-0003"
> 1000 "20150606-001827-252388362-5050-5982-0001"
> 1000 "20150606-001827-252388362-5050-5982-0006"
> 1000 "20150606-001827-252388362-5050-5982-0010"
> 1000 "20150606-001827-252388362-5050-5982-0011"
> 1000 "20150606-001827-252388362-5050-5982-0027"
> mesos λ fgrep 1000 -r src/master
> src/master/constants.cpp:const size_t MAX_REMOVED_SLAVES = 10;
> src/master/constants.cpp:const uint32_t MAX_COMPLETED_TASKS_PER_FRAMEWORK = 
> 1000;
> {noformat}
> Active tasks are just 6% of state.json response:
> {noformat}
> mesos λ cat ~/temp/mesos-state.json | jq -c . | wc
>1   14796 4138942
> mesos λ cat ~/temp/mesos-state.json | jq .frameworks[].tasks | jq -c . | wc
>   16  37  252774
> {noformat}
> I see four options that can improve the situation:
> 1. Add query string param to exclude completed tasks from state.json and use 
> it in mesos-dns and similar tools. There is no need for mesos-dns to know 
> about completed tasks, it's just extra load on master and mesos-dns.
> 2. Make history size configurable.
> 3. Make JSON serialization faster. With 1s of tasks even without history 
> it would take a lot of time to serialize tasks for mesos-dns. Doing it every 
> 60 seconds instead of every 5 seconds isn't really an option.
> 4. Create event bus for mesos master. Marathon has it and it'd be nice to 
> have it in Mesos. This way mesos-dns could avoid polling master state and 
> switch to listening for events.
> All can be done independently.
> Note to mesosphere folks: please start distributing debug symbols with your 
> distribution. I was asking for it for a while and it is really helpful: 
> https://github.com/mesosphere/marathon/issues/1497#issuecomment-104182501
> Perf report for leading master: 
> !http://i.imgur.com/iz7C3o0.png!
> I'm on 0.23.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3307) Configurable size of completed task / framework history

2016-02-05 Thread Tymofii (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15133914#comment-15133914
 ] 

Tymofii commented on MESOS-3307:


Yes, it generates JSON much faster now, but we still having lots and lots 
completed tasks and frameworks there, which we don't care about for service 
discovery, but want to keep them for history.
Wouldn't it be great to have some basic filtering for /state endpoint to get 
only active tasks/frameworks, only tasks or particular framework, only slaves 
information etc.?
/state-summary endpoint introduced recently doesn't fit service discovery 
requirements.

> Configurable size of completed task / framework history
> ---
>
> Key: MESOS-3307
> URL: https://issues.apache.org/jira/browse/MESOS-3307
> Project: Mesos
>  Issue Type: Bug
>Reporter: Ian Babrou
>Assignee: Kevin Klues
>  Labels: mesosphere
> Fix For: 0.27.0
>
>
> We try to make Mesos work with multiple frameworks and mesos-dns at the same 
> time. The goal is to have set of frameworks per team / project on a single 
> Mesos cluster.
> At this point our mesos state.json is at 4mb and it takes a while to 
> assembly. 5 mesos-dns instances hit state.json every 5 seconds, effectively 
> pushing mesos-master CPU usage through the roof. It's at 100%+ all the time.
> Here's the problem:
> {noformat}
> mesos λ curl -s http://mesos-master:5050/master/state.json | jq 
> .frameworks[].completed_tasks[].framework_id | sort | uniq -c | sort -n
>1 "20150606-001827-252388362-5050-5982-0003"
>   16 "20150606-001827-252388362-5050-5982-0005"
>   18 "20150606-001827-252388362-5050-5982-0029"
>   73 "20150606-001827-252388362-5050-5982-0007"
>  141 "20150606-001827-252388362-5050-5982-0009"
>  154 "20150820-154817-302720010-5050-15320-"
>  289 "20150606-001827-252388362-5050-5982-0004"
>  510 "20150606-001827-252388362-5050-5982-0012"
>  666 "20150606-001827-252388362-5050-5982-0028"
>  923 "20150116-002612-269165578-5050-32204-0003"
> 1000 "20150606-001827-252388362-5050-5982-0001"
> 1000 "20150606-001827-252388362-5050-5982-0006"
> 1000 "20150606-001827-252388362-5050-5982-0010"
> 1000 "20150606-001827-252388362-5050-5982-0011"
> 1000 "20150606-001827-252388362-5050-5982-0027"
> mesos λ fgrep 1000 -r src/master
> src/master/constants.cpp:const size_t MAX_REMOVED_SLAVES = 10;
> src/master/constants.cpp:const uint32_t MAX_COMPLETED_TASKS_PER_FRAMEWORK = 
> 1000;
> {noformat}
> Active tasks are just 6% of state.json response:
> {noformat}
> mesos λ cat ~/temp/mesos-state.json | jq -c . | wc
>1   14796 4138942
> mesos λ cat ~/temp/mesos-state.json | jq .frameworks[].tasks | jq -c . | wc
>   16  37  252774
> {noformat}
> I see four options that can improve the situation:
> 1. Add query string param to exclude completed tasks from state.json and use 
> it in mesos-dns and similar tools. There is no need for mesos-dns to know 
> about completed tasks, it's just extra load on master and mesos-dns.
> 2. Make history size configurable.
> 3. Make JSON serialization faster. With 1s of tasks even without history 
> it would take a lot of time to serialize tasks for mesos-dns. Doing it every 
> 60 seconds instead of every 5 seconds isn't really an option.
> 4. Create event bus for mesos master. Marathon has it and it'd be nice to 
> have it in Mesos. This way mesos-dns could avoid polling master state and 
> switch to listening for events.
> All can be done independently.
> Note to mesosphere folks: please start distributing debug symbols with your 
> distribution. I was asking for it for a while and it is really helpful: 
> https://github.com/mesosphere/marathon/issues/1497#issuecomment-104182501
> Perf report for leading master: 
> !http://i.imgur.com/iz7C3o0.png!
> I'm on 0.23.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3307) Configurable size of completed task / framework history

2016-02-05 Thread Tymofii (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15133913#comment-15133913
 ] 

Tymofii commented on MESOS-3307:


Yes, it generates JSON much faster now, but we still having lots and lots 
completed tasks and frameworks there, which we don't care about for service 
discovery, but want to keep them for history.
Wouldn't it be great to have some basic filtering for /state endpoint to get 
only active tasks/frameworks, only tasks or particular framework, only slaves 
information etc.?
/state-summary endpoint introduced recently doesn't fit service discovery 
requirements.

> Configurable size of completed task / framework history
> ---
>
> Key: MESOS-3307
> URL: https://issues.apache.org/jira/browse/MESOS-3307
> Project: Mesos
>  Issue Type: Bug
>Reporter: Ian Babrou
>Assignee: Kevin Klues
>  Labels: mesosphere
> Fix For: 0.27.0
>
>
> We try to make Mesos work with multiple frameworks and mesos-dns at the same 
> time. The goal is to have set of frameworks per team / project on a single 
> Mesos cluster.
> At this point our mesos state.json is at 4mb and it takes a while to 
> assembly. 5 mesos-dns instances hit state.json every 5 seconds, effectively 
> pushing mesos-master CPU usage through the roof. It's at 100%+ all the time.
> Here's the problem:
> {noformat}
> mesos λ curl -s http://mesos-master:5050/master/state.json | jq 
> .frameworks[].completed_tasks[].framework_id | sort | uniq -c | sort -n
>1 "20150606-001827-252388362-5050-5982-0003"
>   16 "20150606-001827-252388362-5050-5982-0005"
>   18 "20150606-001827-252388362-5050-5982-0029"
>   73 "20150606-001827-252388362-5050-5982-0007"
>  141 "20150606-001827-252388362-5050-5982-0009"
>  154 "20150820-154817-302720010-5050-15320-"
>  289 "20150606-001827-252388362-5050-5982-0004"
>  510 "20150606-001827-252388362-5050-5982-0012"
>  666 "20150606-001827-252388362-5050-5982-0028"
>  923 "20150116-002612-269165578-5050-32204-0003"
> 1000 "20150606-001827-252388362-5050-5982-0001"
> 1000 "20150606-001827-252388362-5050-5982-0006"
> 1000 "20150606-001827-252388362-5050-5982-0010"
> 1000 "20150606-001827-252388362-5050-5982-0011"
> 1000 "20150606-001827-252388362-5050-5982-0027"
> mesos λ fgrep 1000 -r src/master
> src/master/constants.cpp:const size_t MAX_REMOVED_SLAVES = 10;
> src/master/constants.cpp:const uint32_t MAX_COMPLETED_TASKS_PER_FRAMEWORK = 
> 1000;
> {noformat}
> Active tasks are just 6% of state.json response:
> {noformat}
> mesos λ cat ~/temp/mesos-state.json | jq -c . | wc
>1   14796 4138942
> mesos λ cat ~/temp/mesos-state.json | jq .frameworks[].tasks | jq -c . | wc
>   16  37  252774
> {noformat}
> I see four options that can improve the situation:
> 1. Add query string param to exclude completed tasks from state.json and use 
> it in mesos-dns and similar tools. There is no need for mesos-dns to know 
> about completed tasks, it's just extra load on master and mesos-dns.
> 2. Make history size configurable.
> 3. Make JSON serialization faster. With 1s of tasks even without history 
> it would take a lot of time to serialize tasks for mesos-dns. Doing it every 
> 60 seconds instead of every 5 seconds isn't really an option.
> 4. Create event bus for mesos master. Marathon has it and it'd be nice to 
> have it in Mesos. This way mesos-dns could avoid polling master state and 
> switch to listening for events.
> All can be done independently.
> Note to mesosphere folks: please start distributing debug symbols with your 
> distribution. I was asking for it for a while and it is really helpful: 
> https://github.com/mesosphere/marathon/issues/1497#issuecomment-104182501
> Perf report for leading master: 
> !http://i.imgur.com/iz7C3o0.png!
> I'm on 0.23.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4604) ROOT_DOCKER_DockerHealthyTask is flaky.

2016-02-05 Thread Jan Schlicht (JIRA)
Jan Schlicht created MESOS-4604:
---

 Summary: ROOT_DOCKER_DockerHealthyTask is flaky.
 Key: MESOS-4604
 URL: https://issues.apache.org/jira/browse/MESOS-4604
 Project: Mesos
  Issue Type: Bug
  Components: tests
 Environment: CentOS 6/7, Ubuntu 15.04 on AWS.
Reporter: Jan Schlicht


Log from Teamcity that is running {{sudo ./bin/mesos-tests.sh}} on AWS EC2 
instances:
{noformat}
[18:27:14][Step 8/8] [--] 8 tests from HealthCheckTest
[18:27:14][Step 8/8] [ RUN  ] HealthCheckTest.HealthyTask
[18:27:17][Step 8/8] [   OK ] HealthCheckTest.HealthyTask ( ms)
[18:27:17][Step 8/8] [ RUN  ] HealthCheckTest.ROOT_DOCKER_DockerHealthyTask
[18:27:36][Step 8/8] ../../src/tests/health_check_tests.cpp:388: Failure
[18:27:36][Step 8/8] Failed to wait 15secs for termination
[18:27:36][Step 8/8] F0204 18:27:35.981302 23085 logging.cpp:64] RAW: Pure 
virtual method called
[18:27:36][Step 8/8] @ 0x7f7077055e1c  google::LogMessage::Fail()
[18:27:36][Step 8/8] @ 0x7f707705ba6f  google::RawLog__()
[18:27:36][Step 8/8] @ 0x7f70760f76c9  __cxa_pure_virtual
[18:27:36][Step 8/8] @   0xa9423c  
mesos::internal::tests::Cluster::Slaves::shutdown()
[18:27:36][Step 8/8] @  0x1074e45  
mesos::internal::tests::MesosTest::ShutdownSlaves()
[18:27:36][Step 8/8] @  0x1074de4  
mesos::internal::tests::MesosTest::Shutdown()
[18:27:36][Step 8/8] @  0x1070ec7  
mesos::internal::tests::MesosTest::TearDown()
[18:27:36][Step 8/8] @  0x16eb7b2  
testing::internal::HandleSehExceptionsInMethodIfSupported<>()
[18:27:36][Step 8/8] @  0x16e61a9  
testing::internal::HandleExceptionsInMethodIfSupported<>()
[18:27:36][Step 8/8] @  0x16c56aa  testing::Test::Run()
[18:27:36][Step 8/8] @  0x16c5e89  testing::TestInfo::Run()
[18:27:36][Step 8/8] @  0x16c650a  testing::TestCase::Run()
[18:27:36][Step 8/8] @  0x16cd1f6  
testing::internal::UnitTestImpl::RunAllTests()
[18:27:36][Step 8/8] @  0x16ec513  
testing::internal::HandleSehExceptionsInMethodIfSupported<>()
[18:27:36][Step 8/8] @  0x16e6df1  
testing::internal::HandleExceptionsInMethodIfSupported<>()
[18:27:36][Step 8/8] @  0x16cbe26  testing::UnitTest::Run()
[18:27:36][Step 8/8] @   0xe54c84  RUN_ALL_TESTS()
[18:27:36][Step 8/8] @   0xe54867  main
[18:27:36][Step 8/8] @ 0x7f7071560a40  (unknown)
[18:27:36][Step 8/8] @   0x9b52d9  _start
[18:27:36][Step 8/8] Aborted (core dumped)
[18:27:36][Step 8/8] Process exited with code 134
{noformat}
Happens with Ubuntu 15.04, CentOS 6, CentOS 7 _quite_ often. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4595) Add support for newest pre-defined Perf events to PerfEventIsolator

2016-02-05 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15134027#comment-15134027
 ] 

Niklas Quarfot Nielsen commented on MESOS-4595:
---

The structure of `PerfStatistics` is nice, but as you mention, doesn't scale 
well with the massive number of available counters.
I like the idea of a Labels field with an encoding like you mention: 
"/hw_counters/XYZ", "/kernel_pmu/ZYX", etc.
Populating that field should probably be guarded with a flag to the perf 
isolator, so the resource statistics doesn't explode in size if folks don't 
need all the information. 

> Add support for newest pre-defined Perf events to PerfEventIsolator
> ---
>
> Key: MESOS-4595
> URL: https://issues.apache.org/jira/browse/MESOS-4595
> Project: Mesos
>  Issue Type: Task
>  Components: isolation
>Reporter: Bartek Plotka
>Assignee: Bartek Plotka
>
> Currently, Perf Event Isolator is able to monitor all (specified in 
> {{--perf_events=...}}) Perf Events, but it can map only part of them in 
> {{ResourceUsage.proto}} (to be more exact in [PerfStatistics.proto | 
> https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L862])
> Since the last time {{PerfStatistics.proto}} was updated, list of supported 
> events expanded much and is growing constantly. I have created some 
> comparison table:
> || Events type || Num of matched events in PerfStatistics vs perf 4.3.3 || 
> perf 4.3.3 events ||
> | HW events  | 8  | 8  |
> | SW events | 9 | 10 |
> | HW cache event | 20 | 20 |
> | *Kernel PMU events* | *0* | *37* |
> | Tracepoint events | 0 | billion (: |
> For advance analysis (e.g during Oversubscription in QoS Controller) having 
> support for additional events is crucial. For instance in 
> [Serenity|https://github.com/mesosphere/serenity] we based some of our 
> revocation algorithms on the new [CMT| 
> https://01.org/packet-processing/cache-monitoring-technology-memory-bandwidth-monitoring-cache-allocation-technology-code-and-data]
>  feature which gives additional, useful event called {{llc_occupancy}}.
> I think we all agree that it would be great to support more (or even all) 
> perf events in {{Mesos PerfEventIsolator}} (:
> 
> Let's start a discussion over the approach. Within this task we have three 
> issues:
> # What events do we want to support in Mesos?
> ## all?
> ## only add Kernel PMU Events?
> ---
> I don't have a strong opinion on that, since i have never used {{Tracepoint 
> events}}. We currently need PMU events.
> # How to add new (or modify existing) events in {{mesos.proto}}?
> We can distinguish here 3 approaches:
> *# Add new events statically in {{PerfStatistics.proto}} as a separate 
> optional fields. (like it is currently)
> *# Instead of optional fields in {{PerfStatistics.proto}} message we could 
> have a {{key-value}} map (something like {{labels}} in other messages) and 
> feed it dynamically in {{PerfEventIsolator}}
> *# We could mix above approaches and just add mentioned map to existing 
> {{PerfStatistics.proto}} for additional events (:
> ---
> IMO: Approach 1) is somehow explicit - users can view what events to expect 
> (although they are parsed in a different manner e.g {{"-"}} to {{"_"}}), but 
> we would end with a looong message and a lot of copy-paste work. And we have 
> to maintain that!
> Approach 2 & 3 are more elastic, and we don't have problem mentioned in the 
> issue below (: And we *always* support *all* perf events in all kernel 
> versions (:
> IMO approaches 2 & 3 are the best.
> # How to support different naming format? For instance 
> {{intel_cqm/llc_occupancy/}} with {{"/"}} in name or  
> {{migrate:mm_migrate_pages}} with {{":"}}. I don't think it is possible to 
> have these as the field names in {{.proto}} syntax



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4595) Add support for newest pre-defined Perf events to PerfEventIsolator

2016-02-05 Thread Bartek Plotka (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bartek Plotka updated MESOS-4595:
-
Description: 
Currently, Perf Event Isolator is able to monitor all (specified in 
{{--perf_events=...}}) Perf Events, but it can map only part of them in 
{{ResourceUsage.proto}} (to be more exact in [PerfStatistics.proto | 
https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L862])

Since the last time {{PerfStatistics.proto}} was updated, list of supported 
events expanded much and is growing constantly. I have created some comparison 
table:

|| Events type || Num of matched events in PerfStatistics vs perf 4.3.3 || perf 
4.3.3 events ||
| HW events  | 8  | 8  |
| SW events | 9 | 10 |
| HW cache event | 20 | 20 |
| *Kernel PMU events* | *0* | *37* |
| Tracepoint events | 0 | billion (: |

For advance analysis (e.g during Oversubscription in QoS Controller) having 
support for additional events is crucial. For instance in 
[Serenity|https://github.com/mesosphere/serenity] we based some of our 
revocation algorithms on the new [CMT| 
https://01.org/packet-processing/cache-monitoring-technology-memory-bandwidth-monitoring-cache-allocation-technology-code-and-data]
 feature which gives additional, useful event called {{llc_occupancy}}.

I think we all agree that it would be great to support more (or even all) perf 
events in {{Mesos PerfEventIsolator}} (:

Let's start a discussion over the approach. Within this task we have three 
issues:
# What events do we want to support in Mesos?
## all?
## only add Kernel PMU Events?
---
I don't have a strong opinion on that, since i have never used {{Tracepoint 
events}}. We currently need PMU events.
# How to add new (or modify existing) events in {{mesos.proto}}?
We can distinguish here 3 approaches:
*# Add new events statically in {{PerfStatistics.proto}} as separate optional 
fields. (like it is currently)
*# Instead of optional fields in {{PerfStatistics.proto}} message we could have 
a {{key-value}} map (something like {{labels}} in other messages) and feed it 
dynamically in {{PerfEventIsolator}}
*# We could mix above approaches and just add mentioned map to existing 
{{PerfStatistics.proto}} for additional events (:
---
IMO: Approaches 1) is somehow explicit - users can view what events to expect 
(although they are parsed in a different manner e.g {{"-"}} to {{"_"}}), but we 
would end with a looong message and a lot of copy-paste work. And we have to 
maintain that!
Approach 2 & 3 are more elastic, and we don't have problem mentioned in the 
issue below (: And we *always* support *all* perf events in all kernel versions 
(:
IMO approaches 2 & 3 are the best.
# How to support different naming format? For instance 
{{intel_cqm/llc_occupancy/}} with {{"/"}} in name or  
{{migrate:mm_migrate_pages}} with {{":"}}. I don't think it is possible to have 
these as the field names in {{.proto}} syntax


  was:
Currently, Perf Event Isolator is able to monitor all (specified in 
{{--perf_events=...}}) Perf Events, but it can map only part of them in 
{{ResourceUsage.proto}} (to be more exact in [PerfStatistics.proto | 
https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L862])

Since the last time {{PerfStatistics.proto}} was updated, list of supported 
events expanded much and is growing constantly. I have created some comparison 
table:

|| Events type || Num of matched events in PerfStatistics vs perf 4.3.3 || perf 
4.3.3 events ||
| HW events  | 8  | 8  |
| SW events | 9 | 10 |
| HW cache event | 20 | 20 |
| *Kernel PMU events* | *0* | *37* |
| Tracepoint events | 0 | billion (: |

For advance analysis (e.g during Oversubscription in QoS Controller) having 
support for additional events is crucial. For instance in 
[Serenity|https://github.com/mesosphere/serenity] we based some of our 
revocation algorithms on the new [CMT| 
https://01.org/packet-processing/cache-monitoring-technology-memory-bandwidth-monitoring-cache-allocation-technology-code-and-data]
 feature which gives additional, useful event called {{llc_occupancy}}.

I think we all agree that it would be great to support more (or even all) perf 
events in {{Mesos PerfEventIsolator}} (:

Let's start a discussion over the approach. Within this task we have three 
issues:
# What events do we want to support in Mesos?
## all?
## only add Kernel PMU Events?
---
I don't have a strong opinion on that, since i have never used {{Tracepoint 
events}}. We currently need PMU events.
# How to add new (or modify existing) events in {{mesos.proto}}?
We can distinguish here 3 approaches:
*# Add new events statically in {{PerfStatistics.proto}} as separate optional 
fields. (like it is currently)
*# Instead of optional fields in {{PerfStatistics.proto}} message we could have 
a {{key-value}} map (something like {{labels}} in other messages) and feed it 
dynamically in 

[jira] [Updated] (MESOS-4595) Add support for newest pre-defined Perf events to PerfEventIsolator

2016-02-05 Thread Bartek Plotka (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bartek Plotka updated MESOS-4595:
-
Description: 
Currently, Perf Event Isolator is able to monitor all (specified in 
{{--perf_events=...}}) Perf Events, but it can map only part of them in 
{{ResourceUsage.proto}} (to be more exact in [PerfStatistics.proto | 
https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L862])

Since the last time {{PerfStatistics.proto}} was updated, list of supported 
events expanded much and is growing constantly. I have created some comparison 
table:

|| Events type || Num of matched events in PerfStatistics vs perf 4.3.3 || perf 
4.3.3 events ||
| HW events  | 8  | 8  |
| SW events | 9 | 10 |
| HW cache event | 20 | 20 |
| *Kernel PMU events* | *0* | *37* |
| Tracepoint events | 0 | billion (: |

For advance analysis (e.g during Oversubscription in QoS Controller) having 
support for additional events is crucial. For instance in 
[Serenity|https://github.com/mesosphere/serenity] we based some of our 
revocation algorithms on the new [CMT| 
https://01.org/packet-processing/cache-monitoring-technology-memory-bandwidth-monitoring-cache-allocation-technology-code-and-data]
 feature which gives additional, useful event called {{llc_occupancy}}.

I think we all agree that it would be great to support more (or even all) perf 
events in {{Mesos PerfEventIsolator}} (:

Let's start a discussion over the approach. Within this task we have three 
issues:
# What events do we want to support in Mesos?
## all?
## only add Kernel PMU Events?
---
I don't have a strong opinion on that, since i have never used {{Tracepoint 
events}}. We currently need PMU events.
# How to add new (or modify existing) events in {{mesos.proto}}?
We can distinguish here 3 approaches:
*# Add new events statically in {{PerfStatistics.proto}} as separate optional 
fields. (like it is currently)
*# Instead of optional fields in {{PerfStatistics.proto}} message we could have 
a {{key-value}} map (something like {{labels}} in other messages) and feed it 
dynamically in {{PerfEventIsolator}}
*# We could mix above approaches and just add mentioned map to existing 
{{PerfStatistics.proto}} for additional events (:
---
IMO: Approach 1) is somehow explicit - users can view what events to expect 
(although they are parsed in a different manner e.g {{"-"}} to {{"_"}}), but we 
would end with a looong message and a lot of copy-paste work. And we have to 
maintain that!
Approach 2 & 3 are more elastic, and we don't have problem mentioned in the 
issue below (: And we *always* support *all* perf events in all kernel versions 
(:
IMO approaches 2 & 3 are the best.
# How to support different naming format? For instance 
{{intel_cqm/llc_occupancy/}} with {{"/"}} in name or  
{{migrate:mm_migrate_pages}} with {{":"}}. I don't think it is possible to have 
these as the field names in {{.proto}} syntax


  was:
Currently, Perf Event Isolator is able to monitor all (specified in 
{{--perf_events=...}}) Perf Events, but it can map only part of them in 
{{ResourceUsage.proto}} (to be more exact in [PerfStatistics.proto | 
https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L862])

Since the last time {{PerfStatistics.proto}} was updated, list of supported 
events expanded much and is growing constantly. I have created some comparison 
table:

|| Events type || Num of matched events in PerfStatistics vs perf 4.3.3 || perf 
4.3.3 events ||
| HW events  | 8  | 8  |
| SW events | 9 | 10 |
| HW cache event | 20 | 20 |
| *Kernel PMU events* | *0* | *37* |
| Tracepoint events | 0 | billion (: |

For advance analysis (e.g during Oversubscription in QoS Controller) having 
support for additional events is crucial. For instance in 
[Serenity|https://github.com/mesosphere/serenity] we based some of our 
revocation algorithms on the new [CMT| 
https://01.org/packet-processing/cache-monitoring-technology-memory-bandwidth-monitoring-cache-allocation-technology-code-and-data]
 feature which gives additional, useful event called {{llc_occupancy}}.

I think we all agree that it would be great to support more (or even all) perf 
events in {{Mesos PerfEventIsolator}} (:

Let's start a discussion over the approach. Within this task we have three 
issues:
# What events do we want to support in Mesos?
## all?
## only add Kernel PMU Events?
---
I don't have a strong opinion on that, since i have never used {{Tracepoint 
events}}. We currently need PMU events.
# How to add new (or modify existing) events in {{mesos.proto}}?
We can distinguish here 3 approaches:
*# Add new events statically in {{PerfStatistics.proto}} as a separate optional 
fields. (like it is currently)
*# Instead of optional fields in {{PerfStatistics.proto}} message we could have 
a {{key-value}} map (something like {{labels}} in other messages) and feed it 
dynamically in 

[jira] [Assigned] (MESOS-4601) Don't dump stack trace on failure to bind()

2016-02-05 Thread Yong Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yong Tang reassigned MESOS-4601:


Assignee: Yong Tang

> Don't dump stack trace on failure to bind()
> ---
>
> Key: MESOS-4601
> URL: https://issues.apache.org/jira/browse/MESOS-4601
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Neil Conway
>Assignee: Yong Tang
>  Labels: errorhandling, libprocess, mesosphere, newbie
>
> We should do {{EXIT(EXIT_FAILURE)}} rather than {{LOG(FATAL)}}, both for this 
> code path and a few other expected error conditions in libprocess network 
> initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4601) Don't dump stack trace on failure to bind()

2016-02-05 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15134107#comment-15134107
 ] 

Yong Tang commented on MESOS-4601:
--

Will take a look at this issue as I have some free time in the next couple of 
weeks.

> Don't dump stack trace on failure to bind()
> ---
>
> Key: MESOS-4601
> URL: https://issues.apache.org/jira/browse/MESOS-4601
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Neil Conway
>Assignee: Yong Tang
>  Labels: errorhandling, libprocess, mesosphere, newbie
>
> We should do {{EXIT(EXIT_FAILURE)}} rather than {{LOG(FATAL)}}, both for this 
> code path and a few other expected error conditions in libprocess network 
> initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec

2016-02-05 Thread Craig W (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15134203#comment-15134203
 ] 

Craig W commented on MESOS-2162:


It would be nice to have the option to run with rkt, especially since it's hit 
1.0 (https://coreos.com/blog/rkt-hits-1.0.html).

> Consider a C++ implementation of CoreOS AppContainer spec
> -
>
> Key: MESOS-2162
> URL: https://issues.apache.org/jira/browse/MESOS-2162
> Project: Mesos
>  Issue Type: Story
>  Components: containerization
>Reporter: Dominic Hamon
>  Labels: gsoc2015, mesosphere, twitter
>
> CoreOS have released a 
> [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md]
>  for a container abstraction as an alternative to Docker. They have also 
> released a reference implementation, [rocket|https://coreos.com/blog/rocket/].
> We should consider a C++ implementation of the specification to have parity 
> with the community and then use this implementation for our containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4479) Implement reservation labels

2016-02-05 Thread Michael Park (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15135259#comment-15135259
 ] 

Michael Park commented on MESOS-4479:
-

{noformat}
commit 8b5cb55e6f8f1ed78ee43e7d497d9f01f8f0e5fd
Author: Neil Conway 
Date:   Fri Feb 5 14:12:22 2016 -0800

Allowed `createLabel` to take an optional `value`.

This better matches the underlying protobuf definition.

Review: https://reviews.apache.org/r/42753/
{noformat}
{noformat}
commit 60015ea893dd0dbd96077035a9155c90012173bc
Author: Neil Conway 
Date:   Fri Feb 5 14:12:14 2016 -0800

Fixed some typos in test case comments.

Review: https://reviews.apache.org/r/42752/
{noformat}
{noformat}
commit b5833d4d7a8358326149abd1f8d090be0335a7c6
Author: Neil Conway 
Date:   Fri Feb 5 14:11:40 2016 -0800

Tweaked some resource test cases.

We should check that two reservations with the same role but different
principals are considered distinct.

Review: https://reviews.apache.org/r/42751/
{noformat}
{noformat}
commit d9d966d9e636fd4bee8b902742eaa9cf6dd1b342
Author: Neil Conway 
Date:   Fri Feb 5 14:11:33 2016 -0800

Added `Resources::size()`.

Review: https://reviews.apache.org/r/43239/
{noformat}

> Implement reservation labels
> 
>
> Key: MESOS-4479
> URL: https://issues.apache.org/jira/browse/MESOS-4479
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Neil Conway
>Assignee: Neil Conway
>  Labels: labels, mesosphere, reservations
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4603) GTEST crashes when starting/stopping many times in succession

2016-02-05 Thread Kevin Klues (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Klues updated MESOS-4603:
---
Labels: mesosphere tests  (was: tests)

> GTEST crashes when starting/stopping many times in succession
> -
>
> Key: MESOS-4603
> URL: https://issues.apache.org/jira/browse/MESOS-4603
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
> Environment: clang 3.4, ubuntu 14.04
>Reporter: Kevin Klues
>  Labels: mesosphere, tests
>
> After running:
> run-one-until-failure 3rdparty/libprocess/libprocess-tests
> At least one iteration of running the tests fails in under a minute with the 
> following stack trace.  The stack trace is differnt sometimes, but it always 
> seems to error out in ~ProcessManager().
> {noformat}
> *** Aborted at 1454643530 (unix time) try "date -d @1454643530" if you are 
> using GNU date ***
> PC: @ 0x7f7812f4d1a0 (unknown)
> *** SIGSEGV (@0x0) received by PID 168122 (TID 0x7f780298f700) from PID 0; 
> stack trace: ***
> @ 0x7f7814451340 (unknown)
> @ 0x7f7812f4d1a0 (unknown)
> @   0x5f06a0 process::Process<>::self()
> @   0x777220 
> _ZN7process8dispatchI7NothingNS_20AsyncExecutorProcessERKZZNS_4http8internal7requestERKNS3_7RequestEbENK3$_1clENS3_10ConnectionEEUlvE_PvSA_SD_EENS_6FutureIT_EEPKNS_7ProcessIT0_EEMSI_FSF_T1_T2_ET3_T4_
> @   0x77714c 
> _ZN7process13AsyncExecutor7executeIZZNS_4http8internal7requestERKNS2_7RequestEbENK3$_1clENS2_10ConnectionEEUlvE_EENS_6FutureI7NothingEERKT_PN5boost9enable_ifINSG_7is_voidINSt9result_ofIFSD_vEE4typeEEEvE4typeE
> @   0x77709e 
> _ZN7process5asyncIZZNS_4http8internal7requestERKNS1_7RequestEbENK3$_1clENS1_10ConnectionEEUlvE_EENS_6FutureI7NothingEERKT_PN5boost9enable_ifINSF_7is_voidINSt9result_ofIFSC_vEE4typeEEEvE4typeE
> @   0x777046 
> _ZZZN7process4http8internal7requestERKNS0_7RequestEbENK3$_1clENS0_10ConnectionEENKUlvE0_clEv
> @   0x777019 
> _ZZNK7process6FutureI7NothingE5onAnyIZZNS_4http8internal7requestERKNS4_7RequestEbENK3$_1clENS4_10ConnectionEEUlvE0_vEERKS2_OT_NS2_10LessPreferEENUlSD_E_clESD_
> @   0x776e02 
> _ZNSt17_Function_handlerIFvRKN7process6FutureI7NothingEEEZNKS3_5onAnyIZZNS0_4http8internal7requestERKNS8_7RequestEbENK3$_1clENS8_10ConnectionEEUlvE0_vEES5_OT_NS3_10LessPreferEEUlS5_E_E9_M_invokeERKSt9_Any_dataS5_
> @   0x43f888 std::function<>::operator()()
> @   0x4464ec 
> _ZN7process8internal3runISt8functionIFvRKNS_6FutureI7NothingJRS5_EEEvRKSt6vectorIT_SaISC_EEDpOT0_
> @   0x446305 process::Future<>::set()
> @   0x44f90a 
> _ZNKSt7_Mem_fnIMN7process6FutureI7NothingEEFbRKS2_EEclIJS5_EvEEbRS3_DpOT_
> @   0x44f7ae 
> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureI7NothingEEFbRKS3_EES4_St12_PlaceholderILi16__callIbJS6_EJLm0ELm1T_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
> @   0x44f72d 
> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureI7NothingEEFbRKS3_EES4_St12_PlaceholderILi1clIJS6_EbEET0_DpOT_
> @   0x44f6dd 
> _ZZNK7process6FutureI7NothingE7onReadyISt5_BindIFSt7_Mem_fnIMS2_FbRKS1_EES2_St12_PlaceholderILi1bEERKS2_OT_NS2_6PreferEENUlS7_E_clES7_
> @   0x44f492 
> _ZNSt17_Function_handlerIFvRK7NothingEZNK7process6FutureIS0_E7onReadyISt5_BindIFSt7_Mem_fnIMS6_FbS2_EES6_St12_PlaceholderILi1bEERKS6_OT_NS6_6PreferEEUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @   0x446d68 std::function<>::operator()()
> @   0x44644c 
> _ZN7process8internal3runISt8functionIFvRK7NothingEEJRS3_EEEvRKSt6vectorIT_SaISA_EEDpOT0_
> @   0x4462e7 process::Future<>::set()
> @   0x50d5c7 process::Promise<>::set()
> @   0x77c53b 
> process::http::internal::ConnectionProcess::disconnect()
> @   0x792710 process::http::internal::ConnectionProcess::_read()
> @   0x794356 
> _ZZN7process8dispatchINS_4http8internal17ConnectionProcessERKNS_6FutureISsEES5_EEvRKNS_3PIDIT_EEMS9_FvT0_ET1_ENKUlPNS_11ProcessBaseEE_clESI_
> @   0x793fa2 
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchINS0_4http8internal17ConnectionProcessERKNS0_6FutureISsEES9_EEvRKNS0_3PIDIT_EEMSD_FvT0_ET1_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @   0x810958 std::function<>::operator()()
> @   0x7fb854 process::ProcessBase::visit()
> @   0x8581ce process::DispatchEvent::visit()
> @   0x43d631 process::ProcessBase::serve()
> @   0x7f9604 process::ProcessManager::resume()
> @   0x8017a5 
> process::ProcessManager::init_threads()::$_1::operator()()
> @   0x8016e3 
> 

[jira] [Created] (MESOS-4614) SlaveRecoveryTest/0.CleanupHTTPExecutor is flaky

2016-02-05 Thread Greg Mann (JIRA)
Greg Mann created MESOS-4614:


 Summary: SlaveRecoveryTest/0.CleanupHTTPExecutor is flaky
 Key: MESOS-4614
 URL: https://issues.apache.org/jira/browse/MESOS-4614
 Project: Mesos
  Issue Type: Bug
  Components: HTTP API, slave, tests
Affects Versions: 0.27.0
 Environment: CentOS 7, gcc, libevent & SSL enabled
Reporter: Greg Mann


Just saw this failure on the ASF CI:

{code}
[ RUN  ] SlaveRecoveryTest/0.CleanupHTTPExecutor
I0206 00:22:44.791671  2824 leveldb.cpp:174] Opened db in 2.539372ms
I0206 00:22:44.792459  2824 leveldb.cpp:181] Compacted db in 740473ns
I0206 00:22:44.792510  2824 leveldb.cpp:196] Created db iterator in 24164ns
I0206 00:22:44.792532  2824 leveldb.cpp:202] Seeked to beginning of db in 1831ns
I0206 00:22:44.792548  2824 leveldb.cpp:271] Iterated through 0 keys in the db 
in 342ns
I0206 00:22:44.792605  2824 replica.cpp:779] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
I0206 00:22:44.793256  2847 recover.cpp:447] Starting replica recovery
I0206 00:22:44.793480  2847 recover.cpp:473] Replica is in EMPTY status
I0206 00:22:44.794538  2847 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from (9472)@172.17.0.2:43484
I0206 00:22:44.795040  2848 recover.cpp:193] Received a recover response from a 
replica in EMPTY status
I0206 00:22:44.795644  2848 recover.cpp:564] Updating replica status to STARTING
I0206 00:22:44.796519  2850 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 752810ns
I0206 00:22:44.796545  2850 replica.cpp:320] Persisted replica status to 
STARTING
I0206 00:22:44.796725  2848 recover.cpp:473] Replica is in STARTING status
I0206 00:22:44.797828  2857 replica.cpp:673] Replica in STARTING status 
received a broadcasted recover request from (9473)@172.17.0.2:43484
I0206 00:22:44.798355  2850 recover.cpp:193] Received a recover response from a 
replica in STARTING status
I0206 00:22:44.799193  2850 recover.cpp:564] Updating replica status to VOTING
I0206 00:22:44.799583  2855 master.cpp:376] Master 
0b206a40-a9c3-4d44-a5bd-8032d60a32ca (6632562f1ade) started on 172.17.0.2:43484
I0206 00:22:44.799609  2855 master.cpp:378] Flags at startup: --acls="" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate="true" --authenticate_http="true" --authenticate_slaves="true" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/n2FxQV/credentials" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" 
--quiet="false" --recovery_slave_removal_limit="100%" 
--registry="replicated_log" --registry_fetch_timeout="1mins" 
--registry_store_timeout="100secs" --registry_strict="true" 
--root_submissions="true" --slave_ping_timeout="15secs" 
--slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" 
--webui_dir="/mesos/mesos-0.28.0/_inst/share/mesos/webui" 
--work_dir="/tmp/n2FxQV/master" --zk_session_timeout="10secs"
I0206 00:22:44.71  2855 master.cpp:423] Master only allowing authenticated 
frameworks to register
I0206 00:22:44.89  2855 master.cpp:428] Master only allowing authenticated 
slaves to register
I0206 00:22:44.800020  2855 credentials.hpp:35] Loading credentials for 
authentication from '/tmp/n2FxQV/credentials'
I0206 00:22:44.800245  2850 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 679345ns
I0206 00:22:44.800370  2850 replica.cpp:320] Persisted replica status to VOTING
I0206 00:22:44.800397  2855 master.cpp:468] Using default 'crammd5' 
authenticator
I0206 00:22:44.800693  2855 master.cpp:537] Using default 'basic' HTTP 
authenticator
I0206 00:22:44.800815  2855 master.cpp:571] Authorization enabled
I0206 00:22:44.801216  2850 recover.cpp:578] Successfully joined the Paxos group
I0206 00:22:44.801604  2850 recover.cpp:462] Recover process terminated
I0206 00:22:44.801759  2856 whitelist_watcher.cpp:77] No whitelist given
I0206 00:22:44.801725  2847 hierarchical.cpp:144] Initialized hierarchical 
allocator process
I0206 00:22:44.803982  2855 master.cpp:1712] The newly elected leader is 
master@172.17.0.2:43484 with id 0b206a40-a9c3-4d44-a5bd-8032d60a32ca
I0206 00:22:44.804026  2855 master.cpp:1725] Elected as the leading master!
I0206 00:22:44.804059  2855 master.cpp:1470] Recovering from registrar
I0206 00:22:44.804424  2855 registrar.cpp:307] Recovering registrar
I0206 00:22:44.805202  2855 log.cpp:659] Attempting to start the writer
I0206 00:22:44.806782  2856 replica.cpp:493] Replica received implicit promise 
request from (9475)@172.17.0.2:43484 with proposal 1
I0206 00:22:44.807368  2856 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb 

[jira] [Commented] (MESOS-4479) Implement reservation labels

2016-02-05 Thread Michael Park (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15135457#comment-15135457
 ] 

Michael Park commented on MESOS-4479:
-

{noformat}
commit 4dbebcfaf2e8c399b2343b932d19677790db020e
Author: Joseph Wu 
Date:   Fri Feb 5 17:56:00 2016 -0800

Fixed compilation on Ubuntu 15.

A few signed-unsigned comparisons introduced by
https://reviews.apache.org/r/42751/

Review: https://reviews.apache.org/r/43276/
{noformat}

> Implement reservation labels
> 
>
> Key: MESOS-4479
> URL: https://issues.apache.org/jira/browse/MESOS-4479
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Neil Conway
>Assignee: Neil Conway
>  Labels: labels, mesosphere, reservations
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4615) ContainerLoggerTest.DefaultToSandbox is flaky

2016-02-05 Thread Greg Mann (JIRA)
Greg Mann created MESOS-4615:


 Summary: ContainerLoggerTest.DefaultToSandbox is flaky
 Key: MESOS-4615
 URL: https://issues.apache.org/jira/browse/MESOS-4615
 Project: Mesos
  Issue Type: Bug
  Components: tests
Affects Versions: 0.27.0
 Environment: CentOS 7, gcc, libevent & SSL enabled
Reporter: Greg Mann


Just saw this failure on the ASF CI:

{code}
[ RUN  ] ContainerLoggerTest.DefaultToSandbox
I0206 01:25:03.766458  2824 leveldb.cpp:174] Opened db in 72.979786ms
I0206 01:25:03.811712  2824 leveldb.cpp:181] Compacted db in 45.162067ms
I0206 01:25:03.811810  2824 leveldb.cpp:196] Created db iterator in 26090ns
I0206 01:25:03.811828  2824 leveldb.cpp:202] Seeked to beginning of db in 3173ns
I0206 01:25:03.811839  2824 leveldb.cpp:271] Iterated through 0 keys in the db 
in 497ns
I0206 01:25:03.811900  2824 replica.cpp:779] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
I0206 01:25:03.812785  2849 recover.cpp:447] Starting replica recovery
I0206 01:25:03.813043  2849 recover.cpp:473] Replica is in EMPTY status
I0206 01:25:03.814668  2854 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from (371)@172.17.0.8:37843
I0206 01:25:03.815210  2849 recover.cpp:193] Received a recover response from a 
replica in EMPTY status
I0206 01:25:03.815732  2854 recover.cpp:564] Updating replica status to STARTING
I0206 01:25:03.819664  2857 master.cpp:376] Master 
914b62f9-95f6-4c57-a7e3-9b06e2c1c8de (74ef606c4063) started on 172.17.0.8:37843
I0206 01:25:03.819703  2857 master.cpp:378] Flags at startup: --acls="" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate="true" --authenticate_http="true" --authenticate_slaves="true" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/h5vu5I/credentials" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" 
--quiet="false" --recovery_slave_removal_limit="100%" 
--registry="replicated_log" --registry_fetch_timeout="1mins" 
--registry_store_timeout="100secs" --registry_strict="true" 
--root_submissions="true" --slave_ping_timeout="15secs" 
--slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" 
--webui_dir="/mesos/mesos-0.28.0/_inst/share/mesos/webui" 
--work_dir="/tmp/h5vu5I/master" --zk_session_timeout="10secs"
I0206 01:25:03.820241  2857 master.cpp:423] Master only allowing authenticated 
frameworks to register
I0206 01:25:03.820257  2857 master.cpp:428] Master only allowing authenticated 
slaves to register
I0206 01:25:03.820269  2857 credentials.hpp:35] Loading credentials for 
authentication from '/tmp/h5vu5I/credentials'
I0206 01:25:03.821110  2857 master.cpp:468] Using default 'crammd5' 
authenticator
I0206 01:25:03.821311  2857 master.cpp:537] Using default 'basic' HTTP 
authenticator
I0206 01:25:03.821636  2857 master.cpp:571] Authorization enabled
I0206 01:25:03.821979  2846 hierarchical.cpp:144] Initialized hierarchical 
allocator process
I0206 01:25:03.822057  2846 whitelist_watcher.cpp:77] No whitelist given
I0206 01:25:03.825460  2847 master.cpp:1712] The newly elected leader is 
master@172.17.0.8:37843 with id 914b62f9-95f6-4c57-a7e3-9b06e2c1c8de
I0206 01:25:03.825512  2847 master.cpp:1725] Elected as the leading master!
I0206 01:25:03.825533  2847 master.cpp:1470] Recovering from registrar
I0206 01:25:03.825835  2847 registrar.cpp:307] Recovering registrar
I0206 01:25:03.848212  2854 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 32.226093ms
I0206 01:25:03.848299  2854 replica.cpp:320] Persisted replica status to 
STARTING
I0206 01:25:03.848702  2854 recover.cpp:473] Replica is in STARTING status
I0206 01:25:03.850728  2858 replica.cpp:673] Replica in STARTING status 
received a broadcasted recover request from (373)@172.17.0.8:37843
I0206 01:25:03.851230  2854 recover.cpp:193] Received a recover response from a 
replica in STARTING status
I0206 01:25:03.852018  2854 recover.cpp:564] Updating replica status to VOTING
I0206 01:25:03.881681  2854 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 29.184163ms
I0206 01:25:03.881772  2854 replica.cpp:320] Persisted replica status to VOTING
I0206 01:25:03.882058  2854 recover.cpp:578] Successfully joined the Paxos group
I0206 01:25:03.882258  2854 recover.cpp:462] Recover process terminated
I0206 01:25:03.883076  2854 log.cpp:659] Attempting to start the writer
I0206 01:25:03.885040  2854 replica.cpp:493] Replica received implicit promise 
request from (374)@172.17.0.8:37843 with proposal 1
I0206 01:25:03.915132  2854 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 

[jira] [Assigned] (MESOS-4604) ROOT_DOCKER_DockerHealthyTask is flaky.

2016-02-05 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-4604:


Assignee: Joseph Wu

> ROOT_DOCKER_DockerHealthyTask is flaky.
> ---
>
> Key: MESOS-4604
> URL: https://issues.apache.org/jira/browse/MESOS-4604
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
> Environment: CentOS 6/7, Ubuntu 15.04 on AWS.
>Reporter: Jan Schlicht
>Assignee: Joseph Wu
>  Labels: flaky-test, test
>
> Log from Teamcity that is running {{sudo ./bin/mesos-tests.sh}} on AWS EC2 
> instances:
> {noformat}
> [18:27:14][Step 8/8] [--] 8 tests from HealthCheckTest
> [18:27:14][Step 8/8] [ RUN  ] HealthCheckTest.HealthyTask
> [18:27:17][Step 8/8] [   OK ] HealthCheckTest.HealthyTask ( ms)
> [18:27:17][Step 8/8] [ RUN  ] 
> HealthCheckTest.ROOT_DOCKER_DockerHealthyTask
> [18:27:36][Step 8/8] ../../src/tests/health_check_tests.cpp:388: Failure
> [18:27:36][Step 8/8] Failed to wait 15secs for termination
> [18:27:36][Step 8/8] F0204 18:27:35.981302 23085 logging.cpp:64] RAW: Pure 
> virtual method called
> [18:27:36][Step 8/8] @ 0x7f7077055e1c  google::LogMessage::Fail()
> [18:27:36][Step 8/8] @ 0x7f707705ba6f  google::RawLog__()
> [18:27:36][Step 8/8] @ 0x7f70760f76c9  __cxa_pure_virtual
> [18:27:36][Step 8/8] @   0xa9423c  
> mesos::internal::tests::Cluster::Slaves::shutdown()
> [18:27:36][Step 8/8] @  0x1074e45  
> mesos::internal::tests::MesosTest::ShutdownSlaves()
> [18:27:36][Step 8/8] @  0x1074de4  
> mesos::internal::tests::MesosTest::Shutdown()
> [18:27:36][Step 8/8] @  0x1070ec7  
> mesos::internal::tests::MesosTest::TearDown()
> [18:27:36][Step 8/8] @  0x16eb7b2  
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> [18:27:36][Step 8/8] @  0x16e61a9  
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> [18:27:36][Step 8/8] @  0x16c56aa  testing::Test::Run()
> [18:27:36][Step 8/8] @  0x16c5e89  testing::TestInfo::Run()
> [18:27:36][Step 8/8] @  0x16c650a  testing::TestCase::Run()
> [18:27:36][Step 8/8] @  0x16cd1f6  
> testing::internal::UnitTestImpl::RunAllTests()
> [18:27:36][Step 8/8] @  0x16ec513  
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> [18:27:36][Step 8/8] @  0x16e6df1  
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> [18:27:36][Step 8/8] @  0x16cbe26  testing::UnitTest::Run()
> [18:27:36][Step 8/8] @   0xe54c84  RUN_ALL_TESTS()
> [18:27:36][Step 8/8] @   0xe54867  main
> [18:27:36][Step 8/8] @ 0x7f7071560a40  (unknown)
> [18:27:36][Step 8/8] @   0x9b52d9  _start
> [18:27:36][Step 8/8] Aborted (core dumped)
> [18:27:36][Step 8/8] Process exited with code 134
> {noformat}
> Happens with Ubuntu 15.04, CentOS 6, CentOS 7 _quite_ often. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4479) Implement reservation labels

2016-02-05 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15135349#comment-15135349
 ] 

Joseph Wu commented on MESOS-4479:
--

Fix for Ubuntu15 compilation: https://reviews.apache.org/r/43276/

> Implement reservation labels
> 
>
> Key: MESOS-4479
> URL: https://issues.apache.org/jira/browse/MESOS-4479
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Neil Conway
>Assignee: Neil Conway
>  Labels: labels, mesosphere, reservations
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4604) ROOT_DOCKER_DockerHealthyTask is flaky.

2016-02-05 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-4604:
-
  Sprint: Mesosphere Sprint 28
Story Points: 3
  Labels: flaky-test mesosphere test  (was: flaky-test test)

> ROOT_DOCKER_DockerHealthyTask is flaky.
> ---
>
> Key: MESOS-4604
> URL: https://issues.apache.org/jira/browse/MESOS-4604
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
> Environment: CentOS 6/7, Ubuntu 15.04 on AWS.
>Reporter: Jan Schlicht
>Assignee: Joseph Wu
>  Labels: flaky-test, mesosphere, test
>
> Log from Teamcity that is running {{sudo ./bin/mesos-tests.sh}} on AWS EC2 
> instances:
> {noformat}
> [18:27:14][Step 8/8] [--] 8 tests from HealthCheckTest
> [18:27:14][Step 8/8] [ RUN  ] HealthCheckTest.HealthyTask
> [18:27:17][Step 8/8] [   OK ] HealthCheckTest.HealthyTask ( ms)
> [18:27:17][Step 8/8] [ RUN  ] 
> HealthCheckTest.ROOT_DOCKER_DockerHealthyTask
> [18:27:36][Step 8/8] ../../src/tests/health_check_tests.cpp:388: Failure
> [18:27:36][Step 8/8] Failed to wait 15secs for termination
> [18:27:36][Step 8/8] F0204 18:27:35.981302 23085 logging.cpp:64] RAW: Pure 
> virtual method called
> [18:27:36][Step 8/8] @ 0x7f7077055e1c  google::LogMessage::Fail()
> [18:27:36][Step 8/8] @ 0x7f707705ba6f  google::RawLog__()
> [18:27:36][Step 8/8] @ 0x7f70760f76c9  __cxa_pure_virtual
> [18:27:36][Step 8/8] @   0xa9423c  
> mesos::internal::tests::Cluster::Slaves::shutdown()
> [18:27:36][Step 8/8] @  0x1074e45  
> mesos::internal::tests::MesosTest::ShutdownSlaves()
> [18:27:36][Step 8/8] @  0x1074de4  
> mesos::internal::tests::MesosTest::Shutdown()
> [18:27:36][Step 8/8] @  0x1070ec7  
> mesos::internal::tests::MesosTest::TearDown()
> [18:27:36][Step 8/8] @  0x16eb7b2  
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> [18:27:36][Step 8/8] @  0x16e61a9  
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> [18:27:36][Step 8/8] @  0x16c56aa  testing::Test::Run()
> [18:27:36][Step 8/8] @  0x16c5e89  testing::TestInfo::Run()
> [18:27:36][Step 8/8] @  0x16c650a  testing::TestCase::Run()
> [18:27:36][Step 8/8] @  0x16cd1f6  
> testing::internal::UnitTestImpl::RunAllTests()
> [18:27:36][Step 8/8] @  0x16ec513  
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> [18:27:36][Step 8/8] @  0x16e6df1  
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> [18:27:36][Step 8/8] @  0x16cbe26  testing::UnitTest::Run()
> [18:27:36][Step 8/8] @   0xe54c84  RUN_ALL_TESTS()
> [18:27:36][Step 8/8] @   0xe54867  main
> [18:27:36][Step 8/8] @ 0x7f7071560a40  (unknown)
> [18:27:36][Step 8/8] @   0x9b52d9  _start
> [18:27:36][Step 8/8] Aborted (core dumped)
> [18:27:36][Step 8/8] Process exited with code 134
> {noformat}
> Happens with Ubuntu 15.04, CentOS 6, CentOS 7 _quite_ often. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4614) SlaveRecoveryTest/0.CleanupHTTPExecutor is flaky

2016-02-05 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15135385#comment-15135385
 ] 

Anand Mazumdar commented on MESOS-4614:
---

The executor did not even send the {{Subscribe}} call after it connected with 
the agent. 

This is similar to the behavior that we have been observing with another flaky 
test in {{MESOS-3273}} in which the example test framework does not send the 
initial {{SUBSCRIBE}} call.

> SlaveRecoveryTest/0.CleanupHTTPExecutor is flaky
> 
>
> Key: MESOS-4614
> URL: https://issues.apache.org/jira/browse/MESOS-4614
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API, slave, tests
>Affects Versions: 0.27.0
> Environment: CentOS 7, gcc, libevent & SSL enabled
>Reporter: Greg Mann
>  Labels: flaky-test, mesosphere
>
> Just saw this failure on the ASF CI:
> {code}
> [ RUN  ] SlaveRecoveryTest/0.CleanupHTTPExecutor
> I0206 00:22:44.791671  2824 leveldb.cpp:174] Opened db in 2.539372ms
> I0206 00:22:44.792459  2824 leveldb.cpp:181] Compacted db in 740473ns
> I0206 00:22:44.792510  2824 leveldb.cpp:196] Created db iterator in 24164ns
> I0206 00:22:44.792532  2824 leveldb.cpp:202] Seeked to beginning of db in 
> 1831ns
> I0206 00:22:44.792548  2824 leveldb.cpp:271] Iterated through 0 keys in the 
> db in 342ns
> I0206 00:22:44.792605  2824 replica.cpp:779] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0206 00:22:44.793256  2847 recover.cpp:447] Starting replica recovery
> I0206 00:22:44.793480  2847 recover.cpp:473] Replica is in EMPTY status
> I0206 00:22:44.794538  2847 replica.cpp:673] Replica in EMPTY status received 
> a broadcasted recover request from (9472)@172.17.0.2:43484
> I0206 00:22:44.795040  2848 recover.cpp:193] Received a recover response from 
> a replica in EMPTY status
> I0206 00:22:44.795644  2848 recover.cpp:564] Updating replica status to 
> STARTING
> I0206 00:22:44.796519  2850 leveldb.cpp:304] Persisting metadata (8 bytes) to 
> leveldb took 752810ns
> I0206 00:22:44.796545  2850 replica.cpp:320] Persisted replica status to 
> STARTING
> I0206 00:22:44.796725  2848 recover.cpp:473] Replica is in STARTING status
> I0206 00:22:44.797828  2857 replica.cpp:673] Replica in STARTING status 
> received a broadcasted recover request from (9473)@172.17.0.2:43484
> I0206 00:22:44.798355  2850 recover.cpp:193] Received a recover response from 
> a replica in STARTING status
> I0206 00:22:44.799193  2850 recover.cpp:564] Updating replica status to VOTING
> I0206 00:22:44.799583  2855 master.cpp:376] Master 
> 0b206a40-a9c3-4d44-a5bd-8032d60a32ca (6632562f1ade) started on 
> 172.17.0.2:43484
> I0206 00:22:44.799609  2855 master.cpp:378] Flags at startup: --acls="" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate="true" --authenticate_http="true" --authenticate_slaves="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/n2FxQV/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" 
> --quiet="false" --recovery_slave_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_store_timeout="100secs" --registry_strict="true" 
> --root_submissions="true" --slave_ping_timeout="15secs" 
> --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-0.28.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/n2FxQV/master" --zk_session_timeout="10secs"
> I0206 00:22:44.71  2855 master.cpp:423] Master only allowing 
> authenticated frameworks to register
> I0206 00:22:44.89  2855 master.cpp:428] Master only allowing 
> authenticated slaves to register
> I0206 00:22:44.800020  2855 credentials.hpp:35] Loading credentials for 
> authentication from '/tmp/n2FxQV/credentials'
> I0206 00:22:44.800245  2850 leveldb.cpp:304] Persisting metadata (8 bytes) to 
> leveldb took 679345ns
> I0206 00:22:44.800370  2850 replica.cpp:320] Persisted replica status to 
> VOTING
> I0206 00:22:44.800397  2855 master.cpp:468] Using default 'crammd5' 
> authenticator
> I0206 00:22:44.800693  2855 master.cpp:537] Using default 'basic' HTTP 
> authenticator
> I0206 00:22:44.800815  2855 master.cpp:571] Authorization enabled
> I0206 00:22:44.801216  2850 recover.cpp:578] Successfully joined the Paxos 
> group
> I0206 00:22:44.801604  2850 recover.cpp:462] Recover process terminated
> I0206 00:22:44.801759  2856 whitelist_watcher.cpp:77] No whitelist given
> I0206 00:22:44.801725  2847 hierarchical.cpp:144]