[jira] [Commented] (MESOS-6078) Add a agent teardown endpoint

2016-10-25 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606268#comment-15606268
 ] 

Cody Maloney commented on MESOS-6078:
-

{{/machine/down}} is very complicated to use for this use case (Requires 
posting multiple JSON blobs, which have to follow a format including timestamps 
in milliseconds, which have to have multiple fields which match exactly how a 
particular mesos agent was launched).

It takes a _lot_ of code and debugging to use and manage it for what is a 
simple common task. Also, once there are existing schedules things get more 
complicated (And if you want the agent to re-register later)

> Add a agent teardown endpoint
> -
>
> Key: MESOS-6078
> URL: https://issues.apache.org/jira/browse/MESOS-6078
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Cody Maloney
>Assignee: Michael Park
>  Labels: mesosphere
>
> Currently, when a whole agent machine is unexpectedly terminated for good 
> (AWS terminated the instance without warning), it goes through the mesos 
> slave removal rate limit before it's gone.
> If a couple agents / a whole rack goes in a cluster of thousands of agents, 
> this can get to be a problem.
> If the agent can be shutdown "cleanly" everything would get scheduled, but 
> once the agent is gone, there currently is no good way for an adminitstrator 
> to indicate the node is gone / gone and it's tasks are lost / should be 
> rescheduled if appropriate as soon as possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6354) Treat a non-existent mesos modules directory the same as an empty mesos modules directory

2016-10-10 Thread Cody Maloney (JIRA)
Cody Maloney created MESOS-6354:
---

 Summary: Treat a non-existent mesos modules directory the same as 
an empty mesos modules directory
 Key: MESOS-6354
 URL: https://issues.apache.org/jira/browse/MESOS-6354
 Project: Mesos
  Issue Type: Bug
  Components: modules
Reporter: Cody Maloney
Assignee: Kapil Arya


When there are no modules, there is often no module directory. A non-existent 
modules directory indicates exactly the same thing as not having any modules 
inside the modules directory.

In DC/OS we have to carry some extra stuff to make sure we always have a 
existing modules directory even in cases where we don't have any real mesos 
modules in it (https://github.com/dcos/dcos/pull/849)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6340) Set HOME for Mesos tasks

2016-10-07 Thread Cody Maloney (JIRA)
Cody Maloney created MESOS-6340:
---

 Summary: Set HOME for Mesos tasks
 Key: MESOS-6340
 URL: https://issues.apache.org/jira/browse/MESOS-6340
 Project: Mesos
  Issue Type: Bug
  Components: containerization, slave
Reporter: Cody Maloney
Assignee: Jie Yu


Quite a few programs assume {{$HOME}} points to a user-editable data file 
directory.

One example is PYTHON, which tries to look up $HOME to find user-installed 
pacakges, and if that fails it tries to look up the user in the passwd database 
which often goes badly (The container is running under the `nobody` user):

{code}
if i == 1:
if 'HOME' not in os.environ:
import pwd
userhome = pwd.getpwuid(os.getuid()).pw_dir
else:
userhome = os.environ['HOME']
{code}

Just setting HOME by default to WORK_DIR would enable more software to work 
correctly out of the box. Software which needs to specialize / change it (or 
schedulers with specific preferences), should still be able to set it 
arbitrarily and anything a scheduler explicitly sets should overwrite the 
default value of $WORK_DIR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6127) Implement suppport for HTTP/2

2016-09-05 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15465619#comment-15465619
 ] 

Cody Maloney commented on MESOS-6127:
-

As long as it's a protocol change, why not go to gRPC which is going to have a 
lot more maintainers developing / maintaining and committed to it than a HTTP2 
+ Protobuf thing that Mesos internally builds.

> Implement suppport for HTTP/2
> -
>
> Key: MESOS-6127
> URL: https://issues.apache.org/jira/browse/MESOS-6127
> Project: Mesos
>  Issue Type: Epic
>  Components: HTTP API, libprocess
>Reporter: Aaron Wood
>  Labels: performance
>
> HTTP/2 will allow us to take advantage of connection multiplexing, header 
> compression, streams, server push, etc. Add support for communication over 
> HTTP/2 between masters and agents, framework endpoints, etc.
> Should we support HTTP/2 without TLS? The spec allows for this but most major 
> browser vendors, libraries, and implementations aren't supporting it unless 
> TLS is used. If we do require TLS, what can be done to reduce the performance 
> hit of the TLS handshake? Might need to change more code to make sure that we 
> are taking advantage of connection sharing so that we can (ideally) only ever 
> have a one-time TLS handshake per shared connection.
> Potential library that could be helpful: 
> https://nghttp2.org/documentation/libnghttp2_asio.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6078) Add a agent teardown endpoint

2016-08-24 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435683#comment-15435683
 ] 

Cody Maloney commented on MESOS-6078:
-

For reference on the API for this: Needs to be able to be simply done with a 
button in a Web UI (Simple HTTP request).

> Add a agent teardown endpoint
> -
>
> Key: MESOS-6078
> URL: https://issues.apache.org/jira/browse/MESOS-6078
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Cody Maloney
>Assignee: Michael Park
>  Labels: mesosphere
>
> Currently, when a whole agent machine is unexpectedly terminated for good 
> (AWS terminated the instance without warning), it goes through the mesos 
> slave removal rate limit before it's gone.
> If a couple agents / a whole rack goes in a cluster of thousands of agents, 
> this can get to be a problem.
> If the agent can be shutdown "cleanly" everything would get scheduled, but 
> once the agent is gone, there currently is no good way for an adminitstrator 
> to indicate the node is gone / gone and it's tasks are lost / should be 
> rescheduled if appropriate as soon as possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6078) Add a agent teardown endpoint

2016-08-24 Thread Cody Maloney (JIRA)
Cody Maloney created MESOS-6078:
---

 Summary: Add a agent teardown endpoint
 Key: MESOS-6078
 URL: https://issues.apache.org/jira/browse/MESOS-6078
 Project: Mesos
  Issue Type: Improvement
  Components: master
Affects Versions: 1.0.1, 1.0.0
Reporter: Cody Maloney
Assignee: Michael Park


Currently, when a whole agent machine is unexpectedly terminated for good (AWS 
terminated the instance without warning), it goes through the mesos slave 
removal rate limit before it's gone.

If a couple agents / a whole rack goes in a cluster of thousands of agents, 
this can get to be a problem.

If the agent can be shutdown "cleanly" everything would get scheduled, but once 
the agent is gone, there currently is no good way for an adminitstrator to 
indicate the node is gone / gone and it's tasks are lost / should be 
rescheduled if appropriate as soon as possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6069) Misspelt TASK_KILLED in mesos slave

2016-08-22 Thread Cody Maloney (JIRA)
Cody Maloney created MESOS-6069:
---

 Summary: Misspelt TASK_KILLED in mesos slave
 Key: MESOS-6069
 URL: https://issues.apache.org/jira/browse/MESOS-6069
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Cody Maloney


https://github.com/apache/mesos/blob/c3228f3c3d1a1b2c145d1377185cfe22da6079eb/src/slave/slave.cpp#L2127



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5467) offer DECLINE / ACCEPT + Recovered resource messages are spammy

2016-05-26 Thread Cody Maloney (JIRA)
Cody Maloney created MESOS-5467:
---

 Summary: offer DECLINE / ACCEPT + Recovered resource messages are 
spammy
 Key: MESOS-5467
 URL: https://issues.apache.org/jira/browse/MESOS-5467
 Project: Mesos
  Issue Type: Bug
Reporter: Cody Maloney


When in a decent size Mesos cluster, frameworks get sent hundreds of offers. 
When the framework than accepts/declines those offers,

{noformat}
May 27 01:20:43 node-44a84216f97e mesos-master[110696]: I0527 01:20:43.361552 
110718 master.cpp:3297] Processing DECLINE call for offers: [ 
88bbf084-c8b7-4c91-af62-c91089c97eaf-O433278814 ] for framework 
20160406-160033-18415882-5050-35855- (mon-marathon-service) at 
scheduler-949644bc-b1f0-497b-a767-87d1201d5113@10.6.15.1:41319
{noformat}

will be printed for each of them. Along with a:
{noformat}
May 27 01:20:43 node-44a84216f97e mesos-master[110696]: I0527 01:20:43.419852 
110703 hierarchical.cpp:744] Recovered cpus(*):37.75; mem(*):102992; 
ports(*):[31000-31214, 31216-32000]; disk(*):545870 (total: cpus(*):38; 
mem(*):103120; ports(*):[31000-32000]; disk(*):545870, allocated: cpus(*):0.25; 
mem(*):128; ports(*):[31215-31215]) on slave 
88bbf084-c8b7-4c91-af62-c91089c97eaf-S649 from framework 
20160406-160033-18415882-5050-35855-
{noformat}

Would be nice to not log the exact declines, or to do a summary. This ends up 
being the vast majority of logs I look at (multi-thousand line blocks of logs 
which aren't useful to the investigation. Just need a sign "offers are being 
processed for this framework").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5466) Master attempted to send message to disconnected framework logged 800 times in 1 second

2016-05-26 Thread Cody Maloney (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Maloney updated MESOS-5466:

Attachment: master-disconnect-message

> Master attempted to send message to disconnected framework logged 800 times 
> in 1 second
> ---
>
> Key: MESOS-5466
> URL: https://issues.apache.org/jira/browse/MESOS-5466
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Cody Maloney
>  Labels: mesosphere
> Attachments: master-disconnect-message
>
>
> One instance (attached) had 806 of exactly the same message in one second. 
> Anonymized log attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1865) Redirect to the leader master when current master is not a leader

2016-04-14 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15241654#comment-15241654
 ] 

Cody Maloney commented on MESOS-1865:
-

Please not 301 "permanent redirect". Browsers cache that for a _long_ time so 
if that leader becomes master again you'll be permanently redirected away...

302 or 307. If we're concerned about breaking "dump" / simple clients than 307 
would seem to make the most sense. The odds are better that simple clients 
wouldn't know about 307 since it's newer, and just report as an error which a 
sysadmin would see in their monitoring tools and be able to fix.

> Redirect to the leader master when current master is not a leader
> -
>
> Key: MESOS-1865
> URL: https://issues.apache.org/jira/browse/MESOS-1865
> Project: Mesos
>  Issue Type: Bug
>  Components: json api
>Affects Versions: 0.20.1
>Reporter: Steven Schlansker
>Assignee: haosdent
>
> Some of the API endpoints, for example /master/tasks.json, will return bogus 
> information if you query a non-leading master:
> {code}
> [steven@Anesthetize:~]% curl 
> http://master1.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 
> 10
> {
>   "tasks": []
> }
> [steven@Anesthetize:~]% curl 
> http://master2.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 
> 10
> {
>   "tasks": []
> }
> [steven@Anesthetize:~]% curl 
> http://master3.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 
> 10
> {
>   "tasks": [
> {
>   "executor_id": "",
>   "framework_id": "20140724-231003-419644938-5050-1707-",
>   "id": 
> "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db",
>   "name": 
> "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db",
>   "resources": {
> "cpus": 0.25,
> "disk": 0,
> {code}
> This is very hard for end-users to work around.  For example if I query 
> "which master is leading" followed by "leader: which tasks are running" it is 
> possible that the leader fails over in between, leaving me with an incorrect 
> answer and no way to know that this happened.
> In my opinion the API should return the correct response (by asking the 
> current leader?) or an error (500 Not the leader?) but it's unacceptable to 
> return a successful wrong answer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5211) Allow docker puller to use docker image IDs in addition to tags

2016-04-13 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240330#comment-15240330
 ] 

Cody Maloney commented on MESOS-5211:
-

That's related purely to the unified containerizer. 

It's a bug currently that mesos inspects docker containerizer docker image 
names for a {{:}}, and if it isn't there, always forcibly appends {{:latest}}.

The bugfixing for the unified containerizer to not just check "has_tag" then 
assume it should use latest definitely could be covered by MESOS-3505

> Allow docker puller to use docker image IDs in addition to tags
> ---
>
> Key: MESOS-5211
> URL: https://issues.apache.org/jira/browse/MESOS-5211
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.28.0
>Reporter: Cody Maloney
>  Labels: containerizer, docker, mesosphere
>
> Docker added support for a {{@}} format instead of {{:}} in [1.6 
> via pull 11109|https://github.com/docker/docker/pull/11109]. 
> The {{@}} is useful because it allows reference to specific set of 
> bits, rather than a tag (such as {{:latest}}) which can change over time.
> Currently a number of code paths, such as the [Mesos Docker 
> code|https://github.com/apache/mesos/blob/df29bf0338771c92d1b1d3848181a35429cdcf0f/src/docker/docker.cpp#L1070],
>  the [Mesos Containerizer Docker 
> Provisioner|https://github.com/apache/mesos/blob/df29bf0338771c92d1b1d3848181a35429cdcf0f/src/slave/containerizer/mesos/provisioner/docker/registry_puller.cpp#L206]
>  do not support pulling / fetching docker containers by id.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5211) Allow docker puller to use docker image IDs in addition to tags

2016-04-13 Thread Cody Maloney (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Maloney updated MESOS-5211:

Description: 
Docker added support for a {{@}} format instead of {{:}} in [1.6 
via pull 11109|https://github.com/docker/docker/pull/11109]. 

The {{@}} is useful because it allows reference to specific set of 
bits, rather than a tag (such as {{:latest}}) which can change over time.

Currently a number of code paths, such as the [Mesos Docker 
code|https://github.com/apache/mesos/blob/df29bf0338771c92d1b1d3848181a35429cdcf0f/src/docker/docker.cpp#L1070],
 the [Mesos Containerizer Docker 
Provisioner|https://github.com/apache/mesos/blob/df29bf0338771c92d1b1d3848181a35429cdcf0f/src/slave/containerizer/mesos/provisioner/docker/registry_puller.cpp#L206]
 do not support pulling / fetching docker containers by id.

  was:
Docker added support for a {{@}} format instead of {{:}} in 1.6. 

The {{@}} is useful because it allows reference to specific set of 
bits, rather than a tag (such as {{:latest}}) which can change over time.

Currently a number of code paths, such as the [Mesos Docker 
code|https://github.com/apache/mesos/blob/df29bf0338771c92d1b1d3848181a35429cdcf0f/src/docker/docker.cpp#L1070],
 the [Mesos Containerizer Docker 
Provisioner|https://github.com/apache/mesos/blob/df29bf0338771c92d1b1d3848181a35429cdcf0f/src/slave/containerizer/mesos/provisioner/docker/registry_puller.cpp#L206]
 do not support pulling / fetching docker containers by id.


> Allow docker puller to use docker image IDs in addition to tags
> ---
>
> Key: MESOS-5211
> URL: https://issues.apache.org/jira/browse/MESOS-5211
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.28.0
>Reporter: Cody Maloney
>  Labels: containerizer, docker, mesosphere
>
> Docker added support for a {{@}} format instead of {{:}} in [1.6 
> via pull 11109|https://github.com/docker/docker/pull/11109]. 
> The {{@}} is useful because it allows reference to specific set of 
> bits, rather than a tag (such as {{:latest}}) which can change over time.
> Currently a number of code paths, such as the [Mesos Docker 
> code|https://github.com/apache/mesos/blob/df29bf0338771c92d1b1d3848181a35429cdcf0f/src/docker/docker.cpp#L1070],
>  the [Mesos Containerizer Docker 
> Provisioner|https://github.com/apache/mesos/blob/df29bf0338771c92d1b1d3848181a35429cdcf0f/src/slave/containerizer/mesos/provisioner/docker/registry_puller.cpp#L206]
>  do not support pulling / fetching docker containers by id.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5211) Allow docker puller to use docker image IDs in addition to tags

2016-04-13 Thread Cody Maloney (JIRA)
Cody Maloney created MESOS-5211:
---

 Summary: Allow docker puller to use docker image IDs in addition 
to tags
 Key: MESOS-5211
 URL: https://issues.apache.org/jira/browse/MESOS-5211
 Project: Mesos
  Issue Type: Bug
  Components: containerization, docker
Affects Versions: 0.28.0
Reporter: Cody Maloney


Docker added support for a {{@}} format instead of {{:}} in 1.6. 

The {{@}} is useful because it allows reference to specific set of 
bits, rather than a tag (such as {{:latest}}) which can change over time.

Currently a number of code paths, such as the [Mesos Docker 
code|https://github.com/apache/mesos/blob/df29bf0338771c92d1b1d3848181a35429cdcf0f/src/docker/docker.cpp#L1070],
 the [Mesos Containerizer Docker 
Provisioner|https://github.com/apache/mesos/blob/df29bf0338771c92d1b1d3848181a35429cdcf0f/src/slave/containerizer/mesos/provisioner/docker/registry_puller.cpp#L206]
 do not support pulling / fetching docker containers by id.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2281) Deprecate plain text Credential format.

2016-03-15 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196404#comment-15196404
 ] 

Cody Maloney commented on MESOS-2281:
-

The JSON format was added as part of MESOS-1391. The original author intended 
to deprecate the legacy credential format.

Original commit: 
https://github.com/apache/mesos/commit/2cb3761c6bfa80b956eaafde9c69eafaeac3deae
Review:
https://reviews.apache.org/r/2/

The JSON format should allow us to eliminate some code, as well as provide a 
more robust parser to ensure people don't read / write garbage (There was 
accidentally a newline or space added to the name of one principal, now all the 
parsing is off by a little bit and things aren't working properly)

> Deprecate plain text Credential format.
> ---
>
> Key: MESOS-2281
> URL: https://issues.apache.org/jira/browse/MESOS-2281
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Affects Versions: 0.21.1
>Reporter: Cody Maloney
>Assignee: Jan Schlicht
>  Labels: mesosphere, security, tech-debt
>
> Currently two formats of credentials are supported: JSON
> {code}
>   "credentials": [
> {
>   "principal": "sherman",
>   "secret": "kitesurf"
> }
> {code}
> And a new line file:
> {code}
> principal1 secret1
> pricipal2 secret2
> {code}
> We should deprecate the new line format and remove support for the old format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2814) os::read should have one implementation

2016-02-18 Thread Cody Maloney (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Maloney updated MESOS-2814:

Description: 
In master there are currently three implementations of the function:
 
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os/read.hpp#L42
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os/read.hpp#L82
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os/read.hpp#L42

All of them have fairly radically different implementations (One uses C read(), 
one uses c++ ifstream, one uses c fopen)

The read() based one does an excess / unnecessary copy / buffer allocation (it 
is going to read into one temporary buffer, then copy into the result string. 
Would be more efficient to do a .reserve() on the result string, and then fill 
the result buffer).

The ifstream/ifstreambuf_iterator ignores that you can have an error partially 
through reading a file / doesn't find the error or propagate it up.

The fopen() variant reads one newline separated line at a time. This could 
produce interesting / unexpected reading in the context of a binary file. It 
also causes glibc to insert null bytes at the end of the buffer it reads 
(excess computation). result isn't pre-allocated to be the right length, 
meaning that most of the continually read lines will result in realloc() and a 
lot of memory copies which will be inefficient on large files.

  was:Currently stout os::read() has two radically different implementations 
when you give it a {{std::string}} vs. a {{const char *}}. Ideally these have 
one implementation that does things like intelligently size the buffer that it 
writes into rather than re-allocating repeatedly with every time it lengthens 
the string (resulting in copious copying). 


> os::read should have one implementation
> ---
>
> Key: MESOS-2814
> URL: https://issues.apache.org/jira/browse/MESOS-2814
> Project: Mesos
>  Issue Type: Improvement
>  Components: stout
>Reporter: Cody Maloney
>Assignee: Isabel Jimenez
>  Labels: mesosphere, tech-debt
>
> In master there are currently three implementations of the function:
>  
> https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os/read.hpp#L42
> https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os/read.hpp#L82
> https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os/read.hpp#L42
> All of them have fairly radically different implementations (One uses C 
> read(), one uses c++ ifstream, one uses c fopen)
> The read() based one does an excess / unnecessary copy / buffer allocation 
> (it is going to read into one temporary buffer, then copy into the result 
> string. Would be more efficient to do a .reserve() on the result string, and 
> then fill the result buffer).
> The ifstream/ifstreambuf_iterator ignores that you can have an error 
> partially through reading a file / doesn't find the error or propagate it up.
> The fopen() variant reads one newline separated line at a time. This could 
> produce interesting / unexpected reading in the context of a binary file. It 
> also causes glibc to insert null bytes at the end of the buffer it reads 
> (excess computation). result isn't pre-allocated to be the right length, 
> meaning that most of the continually read lines will result in realloc() and 
> a lot of memory copies which will be inefficient on large files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4645) Mesos agent shutdown on healtcheck timeout rather than lost and recovered

2016-02-10 Thread Cody Maloney (JIRA)
Cody Maloney created MESOS-4645:
---

 Summary: Mesos agent shutdown on healtcheck timeout rather than 
lost and recovered
 Key: MESOS-4645
 URL: https://issues.apache.org/jira/browse/MESOS-4645
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.27.1
Reporter: Cody Maloney


I expected slaves to have to be gone the re-registration timeout before they'd 
be lost to the cluster, not fail 5 healtchecks (Failing the healthchecks 
indicates there is a network partition, not that the agent is gone for good and 
will never come back).

Is there some flag I'm missing here which I should be setting?

>From my perspective I expect frameworks to not get offers for resources on 
>agents which haven't been contacted recently (The framework wouldn't be able 
>to launch anything on the agent). Once the re-registration period times out 
>the slave would be assumed completely lost and the tasks assumed terminated / 
>able to be re-launched if desired. If an agent recovers between the 
>healthcheck timeout and re-registration timeout, it should be able to re-join 
>the cluster with its running tasks kept running.

Note: Some log lines have their start or tail truncated. Critical stuff should 
all be there

Master flags
{noformat}
Feb 11 00:22:19 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
I0211 00:22:19.690507  1362 master.cpp:369] Flags at startup: 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate="false" --authenticate_slaves="false" --authenticators="crammd5" 
--authorizers="local" --cluster="cody-cm52sd-2" --framework_sorter="drf" 
--help="false" --hostname_lookup="false" --initialize_driver_logging="true" 
--ip_discovery_command="/opt/mesosphere/bin/detect_ip" 
--log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" 
--logging_level="INFO" --max_slave_ping_timeouts="5" --port="5050" 
--quiet="false" --quorum="1" --recovery_slave_removal_limit="100%" 
--registry="replicated_log" --registry_fetch_timeout="1mins" 
--registry_store_timeout="5secs" --registry_strict="false" 
--roles="slave_public" --root_submissions="true" --slave_ping_timeout="15secs" 
--slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" 
--webui_dir="/opt/mesosphere/packages/mesos--4dd59ec6bde2052f6f2a0a0da415b6c92c3c418a/share/mesos/webui"
 --weights="slave_public=1" --work_dir="/var/lib/mesos/master" 
--zk="zk://127.0.0.1:2181/mesos" --zk_session_timeout="10secs"
{noformat}

Slave flags
{noformat}
Feb 11 00:34:13 ip-10-0-0-52.us-west-2.compute.internal mesos-slave[3914]: 
I0211 00:34:13.334395  3914 slave.cpp:192] Flags at startup: 
--appc_store_dir="/tmp/mesos/store/appc" --authenticatee="crammd5" 
--cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
--cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
--cgroups_root="mesos" --container_disk_watch_interval="15secs" 
--containerizers="docker,mesos" --default_role="*" 
--disk_watch_interval="1mins" --docker="docker" 
--docker_auth_server="auth.docker.io" --docker_auth_server_port="443" 
--docker_kill_orphans="true" 
--docker_local_archives_dir="/tmp/mesos/images/docker" --docker_puller="local" 
--docker_puller_timeout="60" --docker_registry="registry-1.docker.io" 
--docker_registry_port="443" --docker_remove_delay="1hrs" 
--docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" 
--docker_store_dir="/tmp/mesos/store/docker" 
--enforce_container_disk_quota="false" 
--executor_environment_variables="{"LD_LIBRARY_PATH":"\/opt\/mesosphere\/lib","PATH":"\/usr\/bin:\/bin","SASL_PATH":"\/opt\/mesosphere\/lib\/sasl2","SHELL":"\/usr\/bin\/bash"}"
 --executor_registration_timeout="5mins" 
--executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" 
--fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="2days" 
--gc_disk_headroom="0.1" --hadoop_home="" --help="false" 
--hostname_lookup="false" --image_provisioner_backend="copy" 
--initialize_driver_logging="true" 
--ip_discovery_command="/opt/mesosphere/bin/detect_ip" 
--isolation="cgroups/cpu,cgroups/mem" 
--launcher_dir="/opt/mesosphere/packages/mesos--4dd59ec6bde2052f6f2a0a0da415b6c92c3c418a/libexec/mesos"
 --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" 
--master="zk://leader.mesos:2181/mesos" 
--oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
--perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" 
--quiet="false" --recover="reconnect" --recovery_timeout="15mins" 
--registration_backoff_factor="1secs" 
--resources="ports:[1025-2180,2182-3887,3889-5049,5052-8079,8082-8180,8182-32000]"
 --re
Feb 11 00:34:13 ip-10-0-0-52.us-west-2.compute.internal mesos-slave[3914]: 
vocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" 
--slave_subsystems="cpu,memory" --strict="true" --switch_user="true" 

[jira] [Commented] (MESOS-4612) Update to Zookeeper 3.4.7

2016-02-07 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15136547#comment-15136547
 ] 

Cody Maloney commented on MESOS-4612:
-

That code in CMake means depending how you compile Mesos, you'll get very 
different behaviors (3.4.7 has several minor but critical behavior changes from 
3.4.5). Mesos already patches Zookeeper 3.4.5, patching 3.4.7 to compile under 
Windows (Releases of zookeeper are unpredictable. Ideally we'd have zookeeper 
3.5 which has a _ton_ of things improved, but that has an unknown release date 
at this point)

> Update to Zookeeper 3.4.7
> -
>
> Key: MESOS-4612
> URL: https://issues.apache.org/jira/browse/MESOS-4612
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Cody Maloney
>Assignee: haosdent
>  Labels: mesosphere, tech-debt
>
> See: http://zookeeper.apache.org/doc/r3.4.7/releasenotes.html for 
> improvements / bug fixes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4612) Update to Zookeeper 3.4.7

2016-02-05 Thread Cody Maloney (JIRA)
Cody Maloney created MESOS-4612:
---

 Summary: Update to Zookeeper 3.4.7
 Key: MESOS-4612
 URL: https://issues.apache.org/jira/browse/MESOS-4612
 Project: Mesos
  Issue Type: Improvement
Reporter: Cody Maloney


See: http://zookeeper.apache.org/doc/r3.4.7/releasenotes.html for improvements 
/ bug fixes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-2814) os::read should have one implementation

2016-02-03 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131244#comment-15131244
 ] 

Cody Maloney edited comment on MESOS-2814 at 2/3/16 10:17 PM:
--

In master there are currently three implementations of the function:
 
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os/read.hpp#L42
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os/read.hpp#L82
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os/read.hpp#L42

All of them have fairly radically different implementations (One uses C read(), 
one uses c++ ifstream, one uses c fopen)

The read(fd, size) I'd argue should be the underpinning of all three. When 
we're given a filename rather than an fd, should do an open() of the filename, 
then read() the whole thing (Which we could get get the length by doing a stat 
of the file), or make a second implementation of read(int fd) which stops at 
EOF rather than a fixed number of bytes.

All three overloads of the function can currently produce surprisingly 
different results in their independent implementations


was (Author: cmaloney):
In master there are currently three implementations of the function:
 
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os/read.hpp#L42
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os/read.hpp#L82
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os/read.hpp#L42

All of them have fairly radically different implementations (One uses C read(), 
one uses c++ ifstream, one uses c fopen)


> os::read should have one implementation
> ---
>
> Key: MESOS-2814
> URL: https://issues.apache.org/jira/browse/MESOS-2814
> Project: Mesos
>  Issue Type: Improvement
>  Components: stout
>Reporter: Cody Maloney
>Assignee: Isabel Jimenez
>  Labels: mesosphere, tech-debt
>
> Currently stout os::read() has two radically different implementations when 
> you give it a {{std::string}} vs. a {{const char *}}. Ideally these have one 
> implementation that does things like intelligently size the buffer that it 
> writes into rather than re-allocating repeatedly with every time it lengthens 
> the string (resulting in copious copying). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2814) os::read should have one implementation

2016-02-03 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131244#comment-15131244
 ] 

Cody Maloney commented on MESOS-2814:
-

In master there are currently three implementations of the function:
 
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os/read.hpp#L42
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os/read.hpp#L82
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os/read.hpp#L42

All of them have fairly radically different implementations (One uses C read(), 
one uses c++ ifstream, one uses c fopen)


> os::read should have one implementation
> ---
>
> Key: MESOS-2814
> URL: https://issues.apache.org/jira/browse/MESOS-2814
> Project: Mesos
>  Issue Type: Improvement
>  Components: stout
>Reporter: Cody Maloney
>Assignee: Isabel Jimenez
>  Labels: mesosphere, tech-debt
>
> Currently stout os::read() has two radically different implementations when 
> you give it a {{std::string}} vs. a {{const char *}}. Ideally these have one 
> implementation that does things like intelligently size the buffer that it 
> writes into rather than re-allocating repeatedly with every time it lengthens 
> the string (resulting in copious copying). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-416) Ensure master / slave do not get kernel OOM before executors, by setting oom_adj control.

2016-02-01 Thread Cody Maloney (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Maloney updated MESOS-416:
---
Labels: mesosphere twitter  (was: twitter)

> Ensure master / slave do not get kernel OOM before executors, by setting 
> oom_adj control.
> -
>
> Key: MESOS-416
> URL: https://issues.apache.org/jira/browse/MESOS-416
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Mahler
>  Labels: mesosphere, twitter
>
> We can adjust the /proc//oom_adj control during master / slave startup, 
> setting it to a low value to ensure we aren't killed first during an OOM.
> Relevant LWN article: http://lwn.net/Articles/317814/
> Also relevant: https://bugzilla.redhat.com/show_bug.cgi?id=239313



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4578) docker run -c is deprecated

2016-02-01 Thread Cody Maloney (JIRA)
Cody Maloney created MESOS-4578:
---

 Summary: docker run -c is deprecated
 Key: MESOS-4578
 URL: https://issues.apache.org/jira/browse/MESOS-4578
 Project: Mesos
  Issue Type: Improvement
  Components: containerization, docker
Affects Versions: 0.26.0
 Environment: CoreOS 7
Reporter: Cody Maloney


When running mesos slave with the docker containerizer enabled on CoreOS 
766.4.0, launching docker containers results in the following in stderr:
{noformat}
Warning: '-c' is deprecated, it will be replaced by '--cpu-shares' soon. See 
usage.
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4578) docker run -c is deprecated

2016-02-01 Thread Cody Maloney (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Maloney updated MESOS-4578:

Labels: mesosphere newbie  (was: mesosphere)

> docker run -c is deprecated
> ---
>
> Key: MESOS-4578
> URL: https://issues.apache.org/jira/browse/MESOS-4578
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization, docker
>Affects Versions: 0.26.0
> Environment: CoreOS 7
>Reporter: Cody Maloney
>  Labels: mesosphere, newbie
>
> When running mesos slave with the docker containerizer enabled on CoreOS 
> 766.4.0, launching docker containers results in the following in stderr:
> {noformat}
> Warning: '-c' is deprecated, it will be replaced by '--cpu-shares' soon. See 
> usage.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4569) Re-Registered and Registered times are the same after agents re-register

2016-01-30 Thread Cody Maloney (JIRA)
Cody Maloney created MESOS-4569:
---

 Summary: Re-Registered and Registered times are the same after 
agents re-register
 Key: MESOS-4569
 URL: https://issues.apache.org/jira/browse/MESOS-4569
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.27.0
Reporter: Cody Maloney
Priority: Minor


When I launch a Multi-Master cluster with Mesos 0.27, kill the leading master, 
and all agents re-register with the new master, the "registered" and 
"Re-registered" time are the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4546) Mesos Agents needs to re-resolve hosts in zk string on leader change / failure to connect

2016-01-28 Thread Cody Maloney (JIRA)
Cody Maloney created MESOS-4546:
---

 Summary: Mesos Agents needs to re-resolve hosts in zk string on 
leader change / failure to connect
 Key: MESOS-4546
 URL: https://issues.apache.org/jira/browse/MESOS-4546
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Cody Maloney
Assignee: Artem Harutyunyan
Priority: Blocker


Sample Mesos Agent log: https://gist.github.com/brndnmtthws/fb846fa988487250a809

Note, zookeeper has a function to change the list of servers at runtime: 
https://github.com/apache/zookeeper/blob/735ea78909e67c648a4978c8d31d63964986af73/src/c/src/zookeeper.c#L1207-L1232

This comes up when using an AWS AutoScalingGroup for managing the set of 
masters. 

The agent when it comes up the first time, resolves the zk:// string. Once all 
the hosts that were in the original string fail (Each fails, is replaced by a 
new machine, which has the same DNS name), the agent just keeps spinning in an 
internal loop, never re-resolving the DNS names.

Two solutions I see are 
1. Update the list of servers / re-resolve
2. Have the agent detect it hasn't connected recently, and kill itself (Which 
will force a re-resolution when the agent starts back up)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2718) Future created by State.names() throws an Illegal ExecutionException

2016-01-21 Thread Cody Maloney (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Maloney updated MESOS-2718:

Labels: mesosphere  (was: )

> Future created by State.names() throws an Illegal ExecutionException
> 
>
> Key: MESOS-2718
> URL: https://issues.apache.org/jira/browse/MESOS-2718
> Project: Mesos
>  Issue Type: Bug
>  Components: java api
>Affects Versions: 0.22.1
> Environment: OSX, Mesos 0.22.1
>Reporter: Matthias Veit
>  Labels: mesosphere
>
> During application startup, we call call org.apache.mesos.state.State.names().
> This will return a java Future. 
> Everything is fine in the success case.
> In the error case, the future can throw either an InterruptedException, 
> ExecutionException or a RuntimeException.
> The ExecutionException indicates, that the future was not successful.
> This is the text from the javadoc: 
> Exception thrown when attempting to retrieve the result of a task that 
> aborted by throwing an exception. This exception can be inspected using the 
> Throwable.getCause() method. See here: 
> https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutionException.html
> The ExecutionException thrown by mesos in the above method does not hold a 
> reference to the root cause, but returns a reference to this as cause (ex == 
> ex.getCause()). 
> ExecutionException really is a wrapper exception to indicate success or 
> failure of the java future and should always have a root cause. 
> With the current implementation we can't distinguish between a Future error 
> or an application error. Please provide always the exception cause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2281) Remove legacy Credential format

2016-01-10 Thread Cody Maloney (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Maloney updated MESOS-2281:

Labels: tech-debt  (was: )

> Remove legacy Credential format
> ---
>
> Key: MESOS-2281
> URL: https://issues.apache.org/jira/browse/MESOS-2281
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Affects Versions: 0.21.1
>Reporter: Cody Maloney
>  Labels: tech-debt
>
> Currently two formats of credentials are supported: JSON
> {code}
>   "credentials": [
> {
>   "principal": "sherman",
>   "secret": "kitesurf"
> }
> {code}
> And a new line file:
> {code}
> principal1 secret1
> pricipal2 secret2
> {code}
> We should deprecate and remove support for the old format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4181) Change port range logging to different logging level.

2016-01-04 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082108#comment-15082108
 ] 

Cody Maloney commented on MESOS-4181:
-

Even with that change the number of bytes to print as you cut up the range 
grows non-linearly, you'd need all the speed optimizations that went into the 
internal representation of ranges to go into the printing format...

> Change port range logging to different logging level.
> -
>
> Key: MESOS-4181
> URL: https://issues.apache.org/jira/browse/MESOS-4181
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.25.0
>Reporter: Cody Maloney
>Assignee: Joerg Schad
>  Labels: mesosphere, newbie
>
> Transforming from mesos' internal port range representation -> text is 
> non-linear in the number of bytest output. We end up with a massive amount of 
> log data like the following:
> {noformat}
> Dec 15 23:54:08 ip-10-0-7-60.us-west-2.compute.internal mesos-master[15919]: 
> I1215 23:51:58.891165 15925 hierarchical.hpp:1103] Recovered cpus(*):1e-05; 
> mem(*):10; ports(*):[5565-5565] (total: ports(*):[1025-2180, 2182-3887, 
> 3889-5049, 5052-8079, 8082-8180, 8182-32000]; cpus(*):4; mem(*):14019; 
> disk(*):32541, allocated: cpus(*):0.01815; ports(*):[1050-1050, 1092-1092, 
> 1094-1094, 1129-1129, 1132-1132, 1140-1140, 1177-1178, 1180-1180, 1192-1192, 
> 1205-1205, 1221-1221, 1308-1308, 1311-1311, 1323-1323, 1326-1326, 1335-1335, 
> 1365-1365, 1404-1404, 1412-1412, 1436-1436, 1455-1455, 1459-1459, 1472-1472, 
> 1477-1477, 1482-1482, 1491-1491, 1510-1510, 1551-1551, 1553-1553, 1559-1559, 
> 1573-1573, 1590-1590, 1592-1592, 1619-1619, 1635-1636, 1678-1678, 1738-1738, 
> 1742-1742, 1752-1752, 1770-1770, 1780-1782, 1790-1790, 1792-1792, 1799-1799, 
> 1804-1804, 1844-1844, 1852-1852, 1867-1867, 1899-1899, 1936-1936, 1945-1945, 
> 1954-1954, 2046-2046, 2055-2055, 2063-2063, 2070-2070, 2089-2089, 2104-2104, 
> 2117-2117, 2132-2132, 2173-2173, 2178-2178, 2188-2188, 2200-2200, 2218-2218, 
> 2223-2223, 2244-2244, 2248-2248, 2250-2250, 2270-2270, 2286-2286, 2302-2302, 
> 2332-2332, 2377-2377, 2397-2397, 2423-2423, 2435-2435, 2442-2442, 2448-2448, 
> 2477-2477, 2482-2482, 2522-2522, 2586-2586, 2594-2594, 2600-2600, 2602-2602, 
> 2643-2643, 2648-2648, 2659-2659, 2691-2691, 2716-2716, 2739-2739, 2794-2794, 
> 2802-2802, 2823-2823, 2831-2831, 2840-2840, 2848-2848, 2876-2876, 2894-2895, 
> 2900-2900, 2904-2904, 2912-2912, 2983-2983, 2991-2991, 2999-2999, 3011-3011, 
> 3025-3025, 3036-3036, 3041-3041, 3051-3051, 3074-3074, 3097-3097, 3107-3107, 
> 3121-3121, 3171-3171, 3176-3176, 3195-3195, 3197-3197, 3210-3210, 3221-3221, 
> 3234-3234, 3245-3245, 3250-3251, 3255-3255, 3270-3270, 3293-3293, 3298-3298, 
> 3312-3312, 3318-3318, 3325-3325, 3368-3368, 3379-3379, 3391-3391, 3412-3412, 
> 3414-3414, 3420-3420, 3492-3492, 3501-3501, 3538-3538, 3579-3579, 3631-3631, 
> 3680-3680, 3684-3684, 3695-3695, 3699-3699, 3738-3738, 3758-3758, 3793-3793, 
> 3808-3808, 3817-3817, 3854-3854, 3856-3856, 3900-3900, 3906-3906, 3909-3909, 
> 3912-3912, 3946-3946, 3956-3956, 3959-3959, 3963-3963, 3974-
> Dec 15 23:54:09 ip-10-0-7-60.us-west-2.compute.internal mesos-master[15919]: 
> 3974, 3981-3981, 3985-3985, 4134-4134, 4178-4178, 4206-4206, 4223-4223, 
> 4239-4239, 4245-4245, 4251-4251, 4262-4263, 4271-4271, 4308-4308, 4323-4323, 
> 4329-4329, 4368-4368, 4385-4385, 4404-4404, 4419-4419, 4430-4430, 4448-4448, 
> 4464-4464, 4481-4481, 4494-4494, 4499-4499, 4510-4510, 4534-4534, 4543-4543, 
> 4555-4555, 4561-4562, 4577-4577, 4601-4601, 4675-4675, 4722-4722, 4739-4739, 
> 4748-4748, 4752-4752, 4764-4764, 4771-4771, 4787-4787, 4827-4827, 4830-4830, 
> 4837-4837, 4848-4848, 4853-4853, 4879-4879, 4883-4883, 4897-4897, 4902-4902, 
> 4911-4911, 4940-4940, 4946-4946, 4957-4957, 4994-4994, 4996-4996, 5008-5008, 
> 5019-5019, 5043-5043, 5059-5059, 5109-5109, 5134-5135, 5157-5157, 5172-5172, 
> 5192-5192, 5211-5211, 5215-5215, 5234-5234, 5237-5237, 5246-5246, 5255-5255, 
> 5268-5268, 5311-5311, 5314-5314, 5316-5316, 5348-5348, 5391-5391, 5407-5407, 
> 5433-5433, 5446-5447, 5454-5454, 5456-5456, 5482-5482, 5514-5515, 5517-5517, 
> 5525-5525, 5542-5542, 5554-5554, 5581-5581, 5624-5624, 5647-5647, 5695-5695, 
> 5700-5700, 5703-5703, 5743-5743, 5747-5747, 5793-5793, 5850-5850, 5856-5856, 
> 5858-5858, 5899-5899, 5901-5901, 5940-5940, 5958-5958, 5962-5962, 5974-5974, 
> 5995-5995, 6000-6001, 6037-6037, 6053-6053, 6066-6066, 6078-6078, 6129-6129, 
> 6139-6139, 6160-6160, 6174-6174, 6193-6193, 6234-6234, 6263-6263, 6276-6276, 
> 6287-6287, 6292-6292, 6294-6294, 6296-6296, 6306-6307, 6333-6333, 6343-6343, 
> 6349-6349, 6377-6377, 6418-6418, 6454-6454, 6484-6484, 6496-6496, 6504-6504, 
> 6518-6518, 6589-6589, 6592-6592, 6606-6606, 6640-6640, 6713-6713, 6717-6717, 
> 

[jira] [Updated] (MESOS-4233) Logging is too verbose for sysadmins / syslog

2015-12-21 Thread Cody Maloney (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Maloney updated MESOS-4233:

Description: 
Currently mesos logs a lot. When launching a thousand tasks in the space of 10 
seconds it will print tens of thousands of log lines, overwhelming syslog 
(there is a max rate at which a process can send stuff over a unix socket) and 
not giving useful information to a sysadmin who cares about just the high-level 
activity and when something goes wrong.

Note mesos also blocks writing to its log locations, so when writing a lot of 
log messages, it can fill up the write buffer in the kernel, and be suspended 
until the syslog agent catches up reading from the socket (GLOG does a blocking 
fwrite to stderr). GLOG also has a big mutex around logging so only one thing 
logs at a time.

While for "internal debugging" it is useful to see things like "message went 
from internal compoent x to internal component y", from a sysadmin perspective 
I only care about the high level actions taken (launched task for framework x), 
sent offer to framework y, got task failed from host z. Note those are what I'd 
expect at the "INFO" level. At the "WARNING" level I'd expect very little to be 
logged / almost nothing in normal operation. Just things like "WARN: Repliacted 
log write took longer than expected". WARN would also get things like 
backtraces on crashes and abnormal exits / abort.

When trying to launch 3k+ tasks inside a second, mesos logging currently 
overwhelms syslog with 100k+ messages, many of which are thousands of bytes. 
Sysadmins expect to be able to use syslog to monitor basic events in their 
system. This is too much.

We can keep logging the messages to files, but the logging to stderr needs to 
be reduced significantly (stderr gets picked up and forwarded to syslog / 
central aggregation).

What I would like is if I can set the stderr logging level to be different / 
independent from the file logging level (Syslog giving the "sysadmin" 
aggregated overview, files useful for debugging in depth what happened in a 
cluster). A lot of what mesos currently logs at info is really debugging info / 
should show up as debug log level.

Some samples of mesos logging a lot more than a sysadmin would want / expect 
are attached, and some are below:

 - Every task gets printed multiple times for a basic launch:
{noformat}
Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: 
I1215 22:58:29.382644  1315 master.cpp:3248] Launching task 
envy.5b19a713-a37f-11e5-8b3e-0251692d6109 of framework 
5178f46d-71d6-422f-922c-5bbe82dff9cc- (marathon)
Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: 
I1215 22:58:29.382925  1315 master.hpp:176] Adding task 
envy.5b1958f2-a37f-11e5-8b3e-0251692d6109 with resources cpus(​*):0.0001; 
mem(*​):16; ports(*):[14047-14047]
{noformat}

 - Every task status update prints many log lines, successful ones are part of 
normal operation and maybe should be logged at info / debug levels, but not to 
a sysadmin (Just show when things fail, and maybe aggregate counters to tell of 
the volume of working)
 - No log messagse should be really big / more than 1k characters (Would 
prevent the giant port list attached, make that easily discoverable / bug 
filable / fixable) 

  was:
Currently mesos logs a lot. When launching a thousand tasks in the space of 10 
seconds it will print tens of thousands of log lines, overwhelming syslog 
(there is a max rate at which a process can send stuff over a unix socket) and 
not giving useful information to a sysadmin who cares about just the high-level 
activity and when something goes wrong.

Note mesos also blocks writing to its log locations, so when writing a lot of 
log messages, it can fill up the write buffer in the kernel, and be suspended 
until the syslog agent catches up reading from the socket (GLOG does a blocking 
fwrite to stderr). GLOG also has a big mutex around logging so only one thing 
logs at a time.

While for "internal debugging" it is useful to see things like "message went 
from internal compoent x to internal component y", from a sysadmin perspective 
I only care about the high level actions taken (launched task for framework x), 
sent offer to framework y, got task failed from host z. Note those are what I'd 
expect at the "INFO" level. At the "WARNING" level I'd expect very little to be 
logged / almost nothing in normal operation. Just things like "WARN: Repliacted 
log write took longer than expected". WARN would also get things like 
backtraces on crashes and abnormal exits / abort.

When trying to launch 3k+ tasks inside a second, mesos logging currently 
overwhelms syslog with 100k+ messages, many of which are thousands of bytes. 
Sysadmins expect to be able to use syslog to monitor basic events in their 
system. This is too much.

We can keep logging the messages 

[jira] [Updated] (MESOS-4233) Logging is too verbose for sysadmins / syslog

2015-12-21 Thread Cody Maloney (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Maloney updated MESOS-4233:

Description: 
Currently mesos logs a lot. When launching a thousand tasks in the space of 10 
seconds it will print tens of thousands of log lines, overwhelming syslog 
(there is a max rate at which a process can send stuff over a unix socket) and 
not giving useful information to a sysadmin who cares about just the high-level 
activity and when something goes wrong.

Note mesos also blocks writing to its log locations, so when writing a lot of 
log messages, it can fill up the write buffer in the kernel, and be suspended 
until the syslog agent catches up reading from the socket (GLOG does a blocking 
fwrite to stderr). GLOG also has a big mutex around logging so only one thing 
logs at a time.

While for "internal debugging" it is useful to see things like "message went 
from internal compoent x to internal component y", from a sysadmin perspective 
I only care about the high level actions taken (launched task for framework x), 
sent offer to framework y, got task failed from host z. Note those are what I'd 
expect at the "INFO" level. At the "WARNING" level I'd expect very little to be 
logged / almost nothing in normal operation. Just things like "WARN: Repliacted 
log write took longer than expected". WARN would also get things like 
backtraces on crashes and abnormal exits / abort.

When trying to launch 3k+ tasks inside a second, mesos logging currently 
overwhelms syslog with 100k+ messages, many of which are thousands of bytes. 
Sysadmins expect to be able to use syslog to monitor basic events in their 
system. This is too much.

We can keep logging the messages to files, but the logging to stderr needs to 
be reduced significantly (stderr gets picked up and forwarded to syslog / 
central aggregation).

What I would like is if I can set the stderr logging level to be different / 
independent from the file logging level (Syslog giving the "sysadmin" 
aggregated overview, files useful for debugging in depth what happened in a 
cluster). A lot of what mesos currently logs at info is really debugging info / 
should show up as debug log level.

Some samples of mesos logging a lot more than a sysadmin would want / expect 
are attached, and some are below:

 - Every task gets printed multiple times for a basic launch:
{noformat}
There are also things like every task gets printed multiple times when launched 
(Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: 
I1215 22:58:29.382644  1315 master.cpp:3248] Launching task 
envy.5b19a713-a37f-11e5-8b3e-0251692d6109 of framework 
5178f46d-71d6-422f-922c-5bbe82dff9cc- (marathon)
Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: 
I1215 22:58:29.382925  1315 master.hpp:176] Adding task 
envy.5b1958f2-a37f-11e5-8b3e-0251692d6109 with resources cpus(​*):0.0001; 
mem(*​):16; ports(*):[14047-14047]
{noformat}

 - Every task status update prints many log lines, successful ones are part of 
normal operation and maybe should be logged at info / debug levels, but not to 
a sysadmin (Just show when things fail, and maybe aggregate counters to tell of 
the volume of working)
 - No log messagse should be really big / more than 1k characters (Would 
prevent the giant port list attached, make that easily discoverable / bug 
filable / fixable) 

  was:
Currently mesos logs a lot. When launching a thousand tasks in the space of 10 
seconds it will print tens of thousands of log lines, overwhelming syslog 
(there is a max rate at which a process can send stuff over a unix socket) and 
not giving useful information to a sysadmin who cares about just the high-level 
activity and when something goes wrong.

Note mesos also blocks writing to its log locations, so when writing a lot of 
log messages, it can fill up the write buffer in the kernel, and be suspended 
until the syslog agent catches up reading from the socket (GLOG does a blocking 
fwrite to stderr). GLOG also has a big mutex around logging so only one thing 
logs at a time.

While for "internal debugging" it is useful to see things like "message went 
from internal compoent x to internal component y", from a sysadmin perspective 
I only care about the high level actions taken (launched task for framework x), 
sent offer to framework y, got task failed from host z. Note those are what I'd 
expect at the "INFO" level. At the "WARNING" level I'd expect very little to be 
logged / almost nothing in normal operation. Just things like "WARN: Repliacted 
log write took longer than expected". WARN would also get things like 
backtraces on crashes and abnormal exits / abort.

When trying to launch 3k+ tasks inside a second, mesos logging currently 
overwhelms syslog with 100k+ messages, many of which are thousands of bytes. 
Sysadmins expect to be able to use syslog to monitor 

[jira] [Created] (MESOS-4233) Logging is too verbose for sysadmins / syslog

2015-12-21 Thread Cody Maloney (JIRA)
Cody Maloney created MESOS-4233:
---

 Summary: Logging is too verbose for sysadmins / syslog
 Key: MESOS-4233
 URL: https://issues.apache.org/jira/browse/MESOS-4233
 Project: Mesos
  Issue Type: Epic
Reporter: Cody Maloney


Currently mesos logs a lot. When launching a thousand tasks in the space of 10 
seconds it will print tens of thousands of log lines, overwhelming syslog 
(there is a max rate at which a process can send stuff over a unix socket) and 
not giving useful information to a sysadmin who cares about just the high-level 
activity and when something goes wrong.

Note mesos also blocks writing to its log locations, so when writing a lot of 
log messages, it can fill up the write buffer in the kernel, and be suspended 
until the syslog agent catches up reading from the socket (GLOG does a blocking 
fwrite to stderr). GLOG also has a big mutex around logging so only one thing 
logs at a time.

While for "internal debugging" it is useful to see things like "message went 
from internal compoent x to internal component y", from a sysadmin perspective 
I only care about the high level actions taken (launched task for framework x), 
sent offer to framework y, got task failed from host z. Note those are what I'd 
expect at the "INFO" level. At the "WARNING" level I'd expect very little to be 
logged / almost nothing in normal operation. Just things like "WARN: Repliacted 
log write took longer than expected". WARN would also get things like 
backtraces on crashes and abnormal exits / abort.

When trying to launch 3k+ tasks inside a second, mesos logging currently 
overwhelms syslog with 100k+ messages, many of which are thousands of bytes. 
Sysadmins expect to be able to use syslog to monitor basic events in their 
system. This is too much.

We can keep logging the messages to files, but the logging to stderr needs to 
be reduced significantly (stderr gets picked up and forwarded to syslog / 
central aggregation).

What I would like is if I can set the stderr logging level to be different / 
independent from the file logging level (Syslog giving the "sysadmin" 
aggregated overview, files useful for debugging in depth what happened in a 
cluster). A lot of what mesos currently logs at info is really debugging info / 
should show up as debug log level.

Some samples of mesos logging a lot more than a sysadmin would want / expect 
are attached, and some are below:

Every task gets printed multiple times for a basic launch:
{noformat}
There are also things like every task gets printed multiple times when launched 
(Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: 
I1215 22:58:29.382644  1315 master.cpp:3248] Launching task 
envy.5b19a713-a37f-11e5-8b3e-0251692d6109 of framework 
5178f46d-71d6-422f-922c-5bbe82dff9cc- (marathon)
Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: 
I1215 22:58:29.382925  1315 master.hpp:176] Adding task 
envy.5b1958f2-a37f-11e5-8b3e-0251692d6109 with resources cpus(​*):0.0001; 
mem(*​):16; ports(*):[14047-14047]
{noformat}

Every task status update prints many log lines, successful ones are part of 
normal operation and maybe should be logged at info / debug levels, but not to 
a sysadmin (Just show when things fail, and maybe aggregate counters to tell of 
the volume of working)

No log messagse should be really big / more than 1k characters (Would prevent 
the giant port list attached, make that easily discoverable / bug filable / 
fixable) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4233) Logging is too verbose for sysadmins / syslog

2015-12-21 Thread Cody Maloney (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Maloney updated MESOS-4233:

Attachment: giant_port_range_logging

> Logging is too verbose for sysadmins / syslog
> -
>
> Key: MESOS-4233
> URL: https://issues.apache.org/jira/browse/MESOS-4233
> Project: Mesos
>  Issue Type: Epic
>Reporter: Cody Maloney
>  Labels: mesosphere
> Attachments: giant_port_range_logging
>
>
> Currently mesos logs a lot. When launching a thousand tasks in the space of 
> 10 seconds it will print tens of thousands of log lines, overwhelming syslog 
> (there is a max rate at which a process can send stuff over a unix socket) 
> and not giving useful information to a sysadmin who cares about just the 
> high-level activity and when something goes wrong.
> Note mesos also blocks writing to its log locations, so when writing a lot of 
> log messages, it can fill up the write buffer in the kernel, and be suspended 
> until the syslog agent catches up reading from the socket (GLOG does a 
> blocking fwrite to stderr). GLOG also has a big mutex around logging so only 
> one thing logs at a time.
> While for "internal debugging" it is useful to see things like "message went 
> from internal compoent x to internal component y", from a sysadmin 
> perspective I only care about the high level actions taken (launched task for 
> framework x), sent offer to framework y, got task failed from host z. Note 
> those are what I'd expect at the "INFO" level. At the "WARNING" level I'd 
> expect very little to be logged / almost nothing in normal operation. Just 
> things like "WARN: Repliacted log write took longer than expected". WARN 
> would also get things like backtraces on crashes and abnormal exits / abort.
> When trying to launch 3k+ tasks inside a second, mesos logging currently 
> overwhelms syslog with 100k+ messages, many of which are thousands of bytes. 
> Sysadmins expect to be able to use syslog to monitor basic events in their 
> system. This is too much.
> We can keep logging the messages to files, but the logging to stderr needs to 
> be reduced significantly (stderr gets picked up and forwarded to syslog / 
> central aggregation).
> What I would like is if I can set the stderr logging level to be different / 
> independent from the file logging level (Syslog giving the "sysadmin" 
> aggregated overview, files useful for debugging in depth what happened in a 
> cluster). A lot of what mesos currently logs at info is really debugging info 
> / should show up as debug log level.
> Some samples of mesos logging a lot more than a sysadmin would want / expect 
> are attached, and some are below:
> Every task gets printed multiple times for a basic launch:
> {noformat}
> There are also things like every task gets printed multiple times when 
> launched (Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal 
> mesos-master[1311]: I1215 22:58:29.382644  1315 master.cpp:3248] Launching 
> task envy.5b19a713-a37f-11e5-8b3e-0251692d6109 of framework 
> 5178f46d-71d6-422f-922c-5bbe82dff9cc- (marathon)
> Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: 
> I1215 22:58:29.382925  1315 master.hpp:176] Adding task 
> envy.5b1958f2-a37f-11e5-8b3e-0251692d6109 with resources cpus(​*):0.0001; 
> mem(*​):16; ports(*):[14047-14047]
> {noformat}
> Every task status update prints many log lines, successful ones are part of 
> normal operation and maybe should be logged at info / debug levels, but not 
> to a sysadmin (Just show when things fail, and maybe aggregate counters to 
> tell of the volume of working)
> No log messagse should be really big / more than 1k characters (Would prevent 
> the giant port list attached, make that easily discoverable / bug filable / 
> fixable) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4181) Don't log port ranges

2015-12-15 Thread Cody Maloney (JIRA)
Cody Maloney created MESOS-4181:
---

 Summary: Don't log port ranges
 Key: MESOS-4181
 URL: https://issues.apache.org/jira/browse/MESOS-4181
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 0.25.0
Reporter: Cody Maloney


Transforming from mesos' internal port range representation -> text is 
non-linear in the number of bytest output. We end up with a massive amount of 
log data like the following:
{noformat}
Dec 15 23:54:08 ip-10-0-7-60.us-west-2.compute.internal mesos-master[15919]: 
I1215 23:51:58.891165 15925 hierarchical.hpp:1103] Recovered cpus(*):1e-05; 
mem(*):10; ports(*):[5565-5565] (total: ports(*):[1025-2180, 2182-3887, 
3889-5049, 5052-8079, 8082-8180, 8182-32000]; cpus(*):4; mem(*):14019; 
disk(*):32541, allocated: cpus(*):0.01815; ports(*):[1050-1050, 1092-1092, 
1094-1094, 1129-1129, 1132-1132, 1140-1140, 1177-1178, 1180-1180, 1192-1192, 
1205-1205, 1221-1221, 1308-1308, 1311-1311, 1323-1323, 1326-1326, 1335-1335, 
1365-1365, 1404-1404, 1412-1412, 1436-1436, 1455-1455, 1459-1459, 1472-1472, 
1477-1477, 1482-1482, 1491-1491, 1510-1510, 1551-1551, 1553-1553, 1559-1559, 
1573-1573, 1590-1590, 1592-1592, 1619-1619, 1635-1636, 1678-1678, 1738-1738, 
1742-1742, 1752-1752, 1770-1770, 1780-1782, 1790-1790, 1792-1792, 1799-1799, 
1804-1804, 1844-1844, 1852-1852, 1867-1867, 1899-1899, 1936-1936, 1945-1945, 
1954-1954, 2046-2046, 2055-2055, 2063-2063, 2070-2070, 2089-2089, 2104-2104, 
2117-2117, 2132-2132, 2173-2173, 2178-2178, 2188-2188, 2200-2200, 2218-2218, 
2223-2223, 2244-2244, 2248-2248, 2250-2250, 2270-2270, 2286-2286, 2302-2302, 
2332-2332, 2377-2377, 2397-2397, 2423-2423, 2435-2435, 2442-2442, 2448-2448, 
2477-2477, 2482-2482, 2522-2522, 2586-2586, 2594-2594, 2600-2600, 2602-2602, 
2643-2643, 2648-2648, 2659-2659, 2691-2691, 2716-2716, 2739-2739, 2794-2794, 
2802-2802, 2823-2823, 2831-2831, 2840-2840, 2848-2848, 2876-2876, 2894-2895, 
2900-2900, 2904-2904, 2912-2912, 2983-2983, 2991-2991, 2999-2999, 3011-3011, 
3025-3025, 3036-3036, 3041-3041, 3051-3051, 3074-3074, 3097-3097, 3107-3107, 
3121-3121, 3171-3171, 3176-3176, 3195-3195, 3197-3197, 3210-3210, 3221-3221, 
3234-3234, 3245-3245, 3250-3251, 3255-3255, 3270-3270, 3293-3293, 3298-3298, 
3312-3312, 3318-3318, 3325-3325, 3368-3368, 3379-3379, 3391-3391, 3412-3412, 
3414-3414, 3420-3420, 3492-3492, 3501-3501, 3538-3538, 3579-3579, 3631-3631, 
3680-3680, 3684-3684, 3695-3695, 3699-3699, 3738-3738, 3758-3758, 3793-3793, 
3808-3808, 3817-3817, 3854-3854, 3856-3856, 3900-3900, 3906-3906, 3909-3909, 
3912-3912, 3946-3946, 3956-3956, 3959-3959, 3963-3963, 3974-
Dec 15 23:54:09 ip-10-0-7-60.us-west-2.compute.internal mesos-master[15919]: 
3974, 3981-3981, 3985-3985, 4134-4134, 4178-4178, 4206-4206, 4223-4223, 
4239-4239, 4245-4245, 4251-4251, 4262-4263, 4271-4271, 4308-4308, 4323-4323, 
4329-4329, 4368-4368, 4385-4385, 4404-4404, 4419-4419, 4430-4430, 4448-4448, 
4464-4464, 4481-4481, 4494-4494, 4499-4499, 4510-4510, 4534-4534, 4543-4543, 
4555-4555, 4561-4562, 4577-4577, 4601-4601, 4675-4675, 4722-4722, 4739-4739, 
4748-4748, 4752-4752, 4764-4764, 4771-4771, 4787-4787, 4827-4827, 4830-4830, 
4837-4837, 4848-4848, 4853-4853, 4879-4879, 4883-4883, 4897-4897, 4902-4902, 
4911-4911, 4940-4940, 4946-4946, 4957-4957, 4994-4994, 4996-4996, 5008-5008, 
5019-5019, 5043-5043, 5059-5059, 5109-5109, 5134-5135, 5157-5157, 5172-5172, 
5192-5192, 5211-5211, 5215-5215, 5234-5234, 5237-5237, 5246-5246, 5255-5255, 
5268-5268, 5311-5311, 5314-5314, 5316-5316, 5348-5348, 5391-5391, 5407-5407, 
5433-5433, 5446-5447, 5454-5454, 5456-5456, 5482-5482, 5514-5515, 5517-5517, 
5525-5525, 5542-5542, 5554-5554, 5581-5581, 5624-5624, 5647-5647, 5695-5695, 
5700-5700, 5703-5703, 5743-5743, 5747-5747, 5793-5793, 5850-5850, 5856-5856, 
5858-5858, 5899-5899, 5901-5901, 5940-5940, 5958-5958, 5962-5962, 5974-5974, 
5995-5995, 6000-6001, 6037-6037, 6053-6053, 6066-6066, 6078-6078, 6129-6129, 
6139-6139, 6160-6160, 6174-6174, 6193-6193, 6234-6234, 6263-6263, 6276-6276, 
6287-6287, 6292-6292, 6294-6294, 6296-6296, 6306-6307, 6333-6333, 6343-6343, 
6349-6349, 6377-6377, 6418-6418, 6454-6454, 6484-6484, 6496-6496, 6504-6504, 
6518-6518, 6589-6589, 6592-6592, 6606-6606, 6640-6640, 6713-6713, 6717-6717, 
6738-6738, 6757-6757, 6765-6765, 6778-6778, 6792-6792, 6798-6798, 6811-6811, 
6815-6815, 6828-6828, 6838-6839, 6856-6856, 6868-6868, 6877-6877, 6892-6892, 
6903-6903, 6908-6908, 6943-6943, 6973-6973, 6977-6977, 7003-7003, 7019-7019, 
7021-7021, 7031-7031, 7034-7034, 7038-7038, 7052-7052, 7060-7060, 7097-7097, 
7124-7124, 7151-7152, 7169-7169, 7171-7171, 7200-7200, 7204-7204, 7246-7246, 
7250-7250, 7292-7292, 7326-7326, 7347-7347, 7363-7363, 7369-7369, 7401-7401, 
7407-7407, 7421-7421, 7436-7436, 7447-7447, 7458-74
Dec 15 23:54:09 ip-10-0-7-60.us-west-2.compute.internal mesos-master[15919]: 
58, 7475-7475, 7477-7477, 7502-7502, 7531-7531, 

[jira] [Commented] (MESOS-1806) Substituting etcd for Zookeeper

2015-11-16 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007284#comment-15007284
 ] 

Cody Maloney commented on MESOS-1806:
-

Yes, it is a hard blocker. Restarting every machine in a large cluster when an 
etcd node goes down is going to result in a lot of cluster badness / thundering 
stampede.

> Substituting etcd for Zookeeper
> ---
>
> Key: MESOS-1806
> URL: https://issues.apache.org/jira/browse/MESOS-1806
> Project: Mesos
>  Issue Type: Task
>  Components: leader election
>Reporter: Ed Ropple
>Assignee: Shuai Lin
>Priority: Minor
>
>eropple: Could you also file a new JIRA for Mesos to drop ZK 
> in favor of etcd or ReplicatedLog? Would love to get some momentum going on 
> that one.
> --
> Consider it filed. =)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3836) `--executor-environment-variables` may not apply to docker containers

2015-11-08 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996140#comment-14996140
 ] 

Cody Maloney commented on MESOS-3836:
-

Any solution which comes up here is going to land (at the soonest) in Mesos 
0.27. That would likely mean not the next DCOS, but the one after, so this is 
all about mid term planning at this point.

When I say fully containerized I mean every executor should adhere to the same 
isolators that tasks do. A framework shouldn't be able to write a custom 
executor which uses more than its share of a CPU when cpu isolation is enabled. 
Or more of it's disk than it's disk quota allows / the framework has accepted 
offers on the host for.

> `--executor-environment-variables` may not apply to docker containers
> -
>
> Key: MESOS-3836
> URL: https://issues.apache.org/jira/browse/MESOS-3836
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, slave
>Affects Versions: 0.25.0
> Environment: Mesos 0.25.0 configured with 
> --executor-environment-variables
>Reporter: Cody Maloney
>Assignee: Marco Massenzio
>Priority: Minor
>  Labels: mesosphere
>
> In our use case we set {{PATH}} as part of the 
> {{\-\-executor_environment_variables}} in order to limit what binaries all 
> tasks which are launched via Mesos have readily available to them, making it 
> much harder for people launching tasks on mesos to accidentally depend on 
> something which isn't part of the "guaranteed" environment / platform.
> Docker containers can be used as executors, and have a fully isolated 
> filesystem. For executors which run in docker containers setting {{PATH}}  to 
> our path on the host filesystem may potentially break the docker container.
> The previous code of only copying across environment variables when 
> {{includeOsEnvironment}} is set dealt with this 
> (https://github.com/apache/mesos/blob/56510afe149758a69a5a714dfaab16111dd0d9c3/src/slave/containerizer/containerizer.cpp#L267)
> if {{includeOsEnvironment}} is set than we should copy across the current 
> {{\-\-executor_environment_variables}}. If it isn't, then 
> {{\-\-executor_environment_variables}} shouldn't be used at all.
> Another option which could be useful is to make it so that there are two sets 
> of "Executor Environment Variables". One for when {{includeOsEnvironment}} is 
> set, and one for when it is not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3836) `--executor-environment-variables` may not apply to docker containers

2015-11-08 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996121#comment-14996121
 ] 

Cody Maloney commented on MESOS-3836:
-

>From what we've seen in practice, whatever environment variables which were 
>set on the executor every task gets. Every marathon app task got every 
>environment variable that mesos-slave had unless the marathon app definition 
>explicitly overrode it.

Executors in many ways re like Tasks and should be fully containerized like 
them, which is a direction Mesos has been moving for a while (right now they 
aren't isolated at all, and having custom executors which are custom code 
running without isolation is not a great thing).

Arguably the model should be that no containerized task sees anything except 
what is explicitly told to see. Things shouldn't leak through from the host 
whatsoever. Mesos tells the tasks the couple things that they are allowed to 
use. In the case of filesystem isolation (such as docker does) then it doesn't 
inform special filesystem things unless it also adds a volume mount for them 
(rkt / appc may introduce another root filesystem isolation).

>From a DCOS perspective what we really want is all tasks are fully host 
>isolated, so they all run with filesystem isolated / even mesos native 
>containerizer tasks run in effectively a chroot with very limited files, very 
>limited environment variables set, so we only expose a small interface which 
>we have to watch and version.

> `--executor-environment-variables` may not apply to docker containers
> -
>
> Key: MESOS-3836
> URL: https://issues.apache.org/jira/browse/MESOS-3836
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, slave
>Affects Versions: 0.25.0
> Environment: Mesos 0.25.0 configured with 
> --executor-environment-variables
>Reporter: Cody Maloney
>Assignee: Marco Massenzio
>Priority: Minor
>  Labels: mesosphere
>
> In our use case we set {{PATH}} as part of the 
> {{\-\-executor_environment_variables}} in order to limit what binaries all 
> tasks which are launched via Mesos have readily available to them, making it 
> much harder for people launching tasks on mesos to accidentally depend on 
> something which isn't part of the "guaranteed" environment / platform.
> Docker containers can be used as executors, and have a fully isolated 
> filesystem. For executors which run in docker containers setting {{PATH}}  to 
> our path on the host filesystem may potentially break the docker container.
> The previous code of only copying across environment variables when 
> {{includeOsEnvironment}} is set dealt with this 
> (https://github.com/apache/mesos/blob/56510afe149758a69a5a714dfaab16111dd0d9c3/src/slave/containerizer/containerizer.cpp#L267)
> if {{includeOsEnvironment}} is set than we should copy across the current 
> {{\-\-executor_environment_variables}}. If it isn't, then 
> {{\-\-executor_environment_variables}} shouldn't be used at all.
> Another option which could be useful is to make it so that there are two sets 
> of "Executor Environment Variables". One for when {{includeOsEnvironment}} is 
> set, and one for when it is not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3740) LIBPROCESS_IP not passed to Docker containers

2015-11-06 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994319#comment-14994319
 ] 

Cody Maloney commented on MESOS-3740:
-

The {{--executor-environment-variables}} is given directly to executors, and 
then gets inherited from the executor by all tasks the executors launch 
currently. We can't do just one generic flag of 
{{--docker-task-environment-variables}} which includes LIBPROCESS_IP, because 
LIBPROCESS_IP is something that Mesos can / will calculate (Either using it's 
classic reverse lookup behavior or --ip-detect-script). So that one I think 
still needs to be special cased that we always just pass it through to solve 
the current present problem.

Adding a {{--docker-environment-variables}} which applies to all exectors and 
tasks launched with the docker containerizer could be useful in some 
circumstances (although within DCOS we have no need to pass special / extra / 
explicit environment variables to docker containers). The 
{{--docker-environment-variables}} still wouldn't be able to capture 
LIBPROCESS_IP though.

> LIBPROCESS_IP not passed to Docker containers
> -
>
> Key: MESOS-3740
> URL: https://issues.apache.org/jira/browse/MESOS-3740
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0
> Environment: Mesos 0.24.1
>Reporter: Cody Maloney
>Assignee: Michael Park
>  Labels: mesosphere
>
> Docker containers aren't currently passed all the same environment variables 
> that Mesos Containerizer tasks are. See: 
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/containerizer.cpp#L254
>  for all the environment variables explicitly set for mesos containers.
> While some of them don't necessarily make sense for docker containers, when 
> the docker has inside of it a libprocess process (A mesos framework 
> scheduler) and is using {{--net=host}} the task needs to have LIBPROCESS_IP 
> set otherwise the same sort of problems that happen because of MESOS-3553 can 
> happen (libprocess will try to guess the machine's IP address with likely bad 
> results in a number of operating environment).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3751) MESOS_NATIVE_JAVA_LIBRARY not set on MesosContainerize tasks with --executor_environmnent_variables

2015-11-04 Thread Cody Maloney (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Maloney updated MESOS-3751:

Fix Version/s: 0.26.0

> MESOS_NATIVE_JAVA_LIBRARY not set on MesosContainerize tasks with 
> --executor_environmnent_variables
> ---
>
> Key: MESOS-3751
> URL: https://issues.apache.org/jira/browse/MESOS-3751
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 0.24.1, 0.25.0
>Reporter: Cody Maloney
>Assignee: Gilbert Song
>  Labels: mesosphere, newbie
> Fix For: 0.26.0
>
>
> When using --executor_environment_variables, and having 
> MESOS_NATIVE_JAVA_LIBRARY in the environment of mesos-slave, the mesos 
> containerizer does not set MESOS_NATIVE_JAVA_LIBRARY itself.
> Relevant code: 
> https://github.com/apache/mesos/blob/14f7967ef307f3d98e3a4b93d92d6b3a56399b20/src/slave/containerizer/containerizer.cpp#L281
> It sees that the variable is in the mesos-slave's environment (os::getenv), 
> rather than checking if it is set in the environment variable set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3740) LIBPROCESS_IP not passed to Docker containers

2015-11-01 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984461#comment-14984461
 ] 

Cody Maloney commented on MESOS-3740:
-

When this came up was trying to launch a mesos framework inside of a docker 
container. The framework used libmesos, and that libmesos couldn't figure out 
what IP to use (the machine didn't have a hostname, and even if it did, the 
hostname may not resolve to the right IP address the mesos framework inside the 
docker container should announce as its own IP, due to something like having 
multiple addresses on the machine, or running in an IP per container type 
environment)

> LIBPROCESS_IP not passed to Docker containers
> -
>
> Key: MESOS-3740
> URL: https://issues.apache.org/jira/browse/MESOS-3740
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0
> Environment: Mesos 0.24.1
>Reporter: Cody Maloney
>Assignee: Michael Park
>  Labels: mesosphere
>
> Docker containers aren't currently passed all the same environment variables 
> that Mesos Containerizer tasks are. See: 
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/containerizer.cpp#L254
>  for all the environment variables explicitly set for mesos containers.
> While some of them don't necessarily make sense for docker containers, when 
> the docker has inside of it a libprocess process (A mesos framework 
> scheduler) and is using {{--net=host}} the task needs to have LIBPROCESS_IP 
> set otherwise the same sort of problems that happen because of MESOS-3553 can 
> happen (libprocess will try to guess the machine's IP address with likely bad 
> results in a number of operating environment).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3772) Consistency of quoted strings in error messages

2015-10-20 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965652#comment-14965652
 ] 

Cody Maloney commented on MESOS-3772:
-

What about generally preferring 
[std::quoted|http://en.cppreference.com/w/cpp/io/manip/quoted]? That does the 
escaping of quotes inside the string for you, as well as adding single quotes 
so it is a predictable / reversable transformation.

> Consistency of quoted strings in error messages
> ---
>
> Key: MESOS-3772
> URL: https://issues.apache.org/jira/browse/MESOS-3772
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>  Labels: mesosphere, newbie
>
> Example log output:
> {quote}
> I1020 18:56:02.933956  1790 slave.cpp:1270] Got assigned task 13 for 
> framework 496620b9-4368-4a71-b741-68216f3d909f-
> I1020 18:56:02.934185  1790 slave.cpp:1386] Launching task 13 for framework 
> 496620b9-4368-4a71-b741-68216f3d909f-
> I1020 18:56:02.934408  1790 slave.cpp:1618] Queuing task '13' for executor 
> default of framework '496620b9-4368-4a71-b741-68216f3d909f-
> I1020 18:56:02.935417  1790 slave.cpp:1760] Sending queued task '13' to 
> executor 'default' of framework 496620b9-4368-4a71-b741-68216f3d909f-
> {quote}
> Aside from the typo (unmatched quote) in the third line, these log messages 
> using quoting inconsistently: sometimes task, executor, and framework IDs are 
> quoted, other times they are not.
> We should probably adopt a general rule, a la 
> http://www.postgresql.org/docs/9.4/static/error-style-guide.html . My 
> proposal: when interpolating a variable, only use quotes if it is possible 
> that the value might contain whitespace or punctuation (in the latter case, 
> the punctuation should probably be escaped).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2275) Document header include rules in style guide

2015-10-19 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964527#comment-14964527
 ] 

Cody Maloney commented on MESOS-2275:
-

Out of curiosity, does this format match any of the formats available in 
clang-format --sort-includes? (http://reviews.llvm.org/D11240)

> Document header include rules in style guide
> 
>
> Key: MESOS-2275
> URL: https://issues.apache.org/jira/browse/MESOS-2275
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Niklas Quarfot Nielsen
>Assignee: Jan Schlicht
>Priority: Trivial
>  Labels: beginner, docathon, mesosphere
>
> We have several ways of sorting, grouping and ordering headers includes in 
> Mesos. We should agree on a rule set and do a style scan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3751) MESOS_NATIVE_JAVA_LIBRARY not set on MesosContainerizre tasks with --executor_environmnent_variables

2015-10-16 Thread Cody Maloney (JIRA)
Cody Maloney created MESOS-3751:
---

 Summary: MESOS_NATIVE_JAVA_LIBRARY not set on MesosContainerizre 
tasks with --executor_environmnent_variables
 Key: MESOS-3751
 URL: https://issues.apache.org/jira/browse/MESOS-3751
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Affects Versions: 0.25.0, 0.24.1
Reporter: Cody Maloney


When using --executor_environment_variables, and having 
MESOS_NATIVE_JAVA_LIBRARY in the environment of mesos-slave, the mesos 
containerizer does not set MESOS_NATIVE_JAVA_LIBRARY itself.

Relevant code: 
https://github.com/apache/mesos/blob/14f7967ef307f3d98e3a4b93d92d6b3a56399b20/src/slave/containerizer/containerizer.cpp#L281

It sees that the variable is in the mesos-slave's environment (os::getenv), 
rather than checking if it is set in the environment variable set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3740) LIBPROCESS_IP not passed to Docker containers

2015-10-14 Thread Cody Maloney (JIRA)
Cody Maloney created MESOS-3740:
---

 Summary: LIBPROCESS_IP not passed to Docker containers
 Key: MESOS-3740
 URL: https://issues.apache.org/jira/browse/MESOS-3740
 Project: Mesos
  Issue Type: Bug
  Components: containerization, docker
Affects Versions: 0.25.0
 Environment: Mesos 0.24.1
Reporter: Cody Maloney


Docker containers aren't currently passed all the same environment variables 
that Mesos Containerizer tasks are. See: 
https://github.com/apache/mesos/blob/master/src/slave/containerizer/containerizer.cpp#L254
 for all the environment variables explicitly set for mesos containers.

While some of them don't necessarily make sense for docker containers, when the 
docker has inside of it a libprocess process (A mesos framework scheduler) and 
is using {{--net=host}} the task needs to have LIBPROCESS_IP set otherwise the 
same sort of problems that happen because of MESOS-3553 can happen (libprocess 
will try to guess the machine's IP address with likely bad results in a number 
of operating environment).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3177) Make Mesos own configuration of roles/weights

2015-09-17 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14804289#comment-14804289
 ] 

Cody Maloney commented on MESOS-3177:
-

Currently the mesos master doesn't keep track of roles it knows of explicitly, 
just roles which it says it should know about passed in via the flag. Storing 
them in the replicated log would be my preferred place to put / persist them.

If they are persisted in the repliacted log and that is the authoritative 
source for them, I'd rather not have them be flags to the mesos master anymore, 
as after first mesos master start those flags would be meaningless and lead to 
a potentially bad user experience (I set the flags on mesos master but they 
aren't applying!?!?!). 

There is a `mesos-log` command that already exists, and it's been design 
discussed some that initialization of the replicated log shouldn't be implicit 
in master startup (Can potentially lead to bad cluster/error cases for some 
node replacement scenarios).

I would suggest only allowing adding roles in v1. Removing roles will require 
revoking offers, which sort of exists with inverse offers that recently became 
available, but is going to be a lot of engineering.

For other things you're going to need a Mesos Shepherd going forward for more 
design review, building out a proper design proposal, and getting things landed 
in time.

> Make Mesos own configuration of roles/weights
> -
>
> Key: MESOS-3177
> URL: https://issues.apache.org/jira/browse/MESOS-3177
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Reporter: Cody Maloney
>Assignee: Thomas Rampelberg
>  Labels: mesosphere
>
> All roles and weights must currently be specified up-front when starting 
> Mesos masters currently. In addition, they should be consistent on every 
> master, otherwise unexpected behavior could occur (You can have them be 
> inconsistent for some upgrade paths / changing the set).
> This makes it hard to introduce new groups of machines under new roles 
> dynamically (Have to generate a new master configuration, deploy that, before 
> we can connect slaves with a new role to the cluster).
> Ideally an administrator can manually add / remove / edit roles and have the 
> settings replicated / passed to all masters in the cluster by Mesos. 
> Effectively Mesos takes ownership of the setting, rather than requiring it to 
> be done externally.
> In addition, if a new slave joins the cluster with an unexpected / new role 
> that should just work, making it much easier to introduce machines with new 
> roles. (Policy around whether or not a slave can cause creation of a new 
> role, a given slave can register with a given role, etc. is out of scope, and 
> would be controls in the general registration process).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3177) Make Mesos own configuration of roles/weights

2015-09-11 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14740398#comment-14740398
 ] 

Cody Maloney commented on MESOS-3177:
-

There hasn't been any design documentation building / development so far.

In my mind I've been thinking it of a "Before you start the mesos masters, you 
create the initial replicated log state which contains the first set of roles 
and weights to operate with". Then from that point on mesos has a "add_role" 
and "remove_role" endpoints to manage them. Even better would be that if you 
don't have authentication turned on, as mesos sees new roles it just adds them 
(And as all things with that role disappear it removes them). If authentication 
is turned on, the authentication mechanism effectively "permanently" owns all 
the roles it defines (if it's just a static configuration file). If it's a 
dynamic source / database then the interface to talk about ownership would 
probably need to get more complicated.

> Make Mesos own configuration of roles/weights
> -
>
> Key: MESOS-3177
> URL: https://issues.apache.org/jira/browse/MESOS-3177
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Reporter: Cody Maloney
>Assignee: Thomas Rampelberg
>  Labels: mesosphere
>
> All roles and weights must currently be specified up-front when starting 
> Mesos masters currently. In addition, they should be consistent on every 
> master, otherwise unexpected behavior could occur (You can have them be 
> inconsistent for some upgrade paths / changing the set).
> This makes it hard to introduce new groups of machines under new roles 
> dynamically (Have to generate a new master configuration, deploy that, before 
> we can connect slaves with a new role to the cluster).
> Ideally an administrator can manually add / remove / edit roles and have the 
> settings replicated / passed to all masters in the cluster by Mesos. 
> Effectively Mesos takes ownership of the setting, rather than requiring it to 
> be done externally.
> In addition, if a new slave joins the cluster with an unexpected / new role 
> that should just work, making it much easier to introduce machines with new 
> roles. (Policy around whether or not a slave can cause creation of a new 
> role, a given slave can register with a given role, etc. is out of scope, and 
> would be controls in the general registration process).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3417) Log source address replicated log recieved broadcasts

2015-09-11 Thread Cody Maloney (JIRA)
Cody Maloney created MESOS-3417:
---

 Summary: Log source address replicated log recieved broadcasts
 Key: MESOS-3417
 URL: https://issues.apache.org/jira/browse/MESOS-3417
 Project: Mesos
  Issue Type: Improvement
  Components: replicated log
Affects Versions: 0.24.0, 0.23.0
 Environment: Mesos 0.23
Reporter: Cody Maloney
Assignee: Adam B
Priority: Minor


Currently Mesos doesn't log what machine a replicated log status broadcast was 
recieved from:
{code}
Sep 11 21:41:14 master-01 mesos-master[15625]: I0911 21:41:14.320164 15637 
replica.cpp:641] Replica in EMPTY status received a broadcasted recover request
Sep 11 21:41:14 master-01 mesos-dns[15583]: I0911 21:41:14.321097   15583 
detect.go:118] ignoring children-changed event, leader has not changed: /mesos
Sep 11 21:41:14 master-01 mesos-master[15625]: I0911 21:41:14.353914 15639 
replica.cpp:641] Replica in EMPTY status received a broadcasted recover request
Sep 11 21:41:14 master-01 mesos-master[15625]: I0911 21:41:14.479132 15639 
replica.cpp:641] Replica in EMPTY status received a broadcasted recover request
{code}

It would be really useful for debugging replicated log startup issues to have 
info about where the message came from (libprocess address, ip, or hostname) 
the message came from



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2131) Add a reverse proxy endpoint to mesos

2015-08-11 Thread Cody Maloney (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Maloney updated MESOS-2131:

Assignee: (was: Cody Maloney)

 Add a reverse proxy endpoint to mesos
 -

 Key: MESOS-2131
 URL: https://issues.apache.org/jira/browse/MESOS-2131
 Project: Mesos
  Issue Type: Improvement
  Components: master, slave
Reporter: Cody Maloney
Priority: Minor
  Labels: mesosphere

 A new libprocess Process inside mesos which allows attaching/detaching known 
 endpoints at a specific path.
 Ideally I want to be able to do things like attach 'slave-id' and pass HTTP 
 requests on to that slave:
 Sample endpoint actions:
 C++ api:
 attach(std::string name, Node target): Add a new reverse proxy path
 detach(std::string name): Remove an established reverse proxy path
 HTTP endpoints:
 /proxy/go/{name}
  - Prefix matches a path, forwards the remaining path onto the remote endpoin
 /proxy/debug.json
  - Prints out all attached endpoints.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2130) Allow prefix routing of paths in libprocess

2015-08-11 Thread Cody Maloney (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Maloney updated MESOS-2130:

Assignee: (was: Cody Maloney)

 Allow prefix routing of paths in libprocess
 ---

 Key: MESOS-2130
 URL: https://issues.apache.org/jira/browse/MESOS-2130
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess
Reporter: Cody Maloney
  Labels: mesosphere

 Currently libprocess can only route to UPIDs, and then within the upids one 
 top level command. Ideally you can attach C++ endpoints to arbitrary paths, 
 including taking everything that matches a prefix:
 Ex:
 /slaves/:slave_id/ could proxy to an individual slave
 /slaves/ 
  - Alias for /slave(1) if only one slave
 /slaves/{number} 
  - point to an individual slave rather than requiring people to properly 
 encode () in urls.
 /proxy/go/master-leader/files/browse.json
  - The endpoint would be /proxy/go, and then it internally processes the 
 request to find the host it should go to (What is the IP for the currently 
 elected master?) and then forwards the rest of the path to the target machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3177) Make Mesos own configuration of roles/weights

2015-07-30 Thread Cody Maloney (JIRA)
Cody Maloney created MESOS-3177:
---

 Summary: Make Mesos own configuration of roles/weights
 Key: MESOS-3177
 URL: https://issues.apache.org/jira/browse/MESOS-3177
 Project: Mesos
  Issue Type: Improvement
  Components: master, slave
Reporter: Cody Maloney


All roles and weights must currently be specified up-front when starting Mesos 
masters currently. In addition, they should be consistent on every master, 
otherwise unexpected behavior could occur (You can have them be inconsistent 
for some upgrade paths / changing the set).

This makes it hard to introduce new groups of machines under new roles 
dynamically (Have to generate a new master configuration, deploy that, before 
we can connect slaves with a new role to the cluster).

Ideally an administrator can manually add / remove / edit roles and have the 
settings replicated / passed to all masters in the cluster by Mesos. 
Effectively Mesos takes ownership of the setting, rather than requiring it to 
be done externally.

In addition, if a new slave joins the cluster with an unexpected / new role 
that should just work, making it much easier to introduce machines with new 
roles. (Policy around whether or not a slave can cause creation of a new role, 
a given slave can register with a given role, etc. is out of scope, and would 
be controls in the general registration process).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME

2015-07-14 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627014#comment-14627014
 ] 

Cody Maloney commented on MESOS-2902:
-

It is an argument against doing anything at runtime whenever possible. IP 
unfortunately we don't know outside the machine we shipped Mesos to / can't 
bake in. We would if we could, but most the environments we're shipping to we 
have found that we can't. If I send a mesos package to a bunch of arbitrary 
hosts, they all have different IPs, even though all the other configuration 
parameters stay the same.

 Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME
 

 Key: MESOS-2902
 URL: https://issues.apache.org/jira/browse/MESOS-2902
 Project: Mesos
  Issue Type: Improvement
  Components: master, modules, slave
Reporter: Cody Maloney
Assignee: Marco Massenzio
Priority: Critical
  Labels: mesosphere

 Currently Mesos tries to guess the IP, HOSTNAME by doing a reverse DNS 
 lookup. This doesn't work on a lot of clouds as we want things like public 
 IPs (which aren't the default DNS), there aren't FQDN names (Azure), or the 
 correct way to figure it out is to call some cloud-specific endpoint.
 If Mesos / Libprocess could load a mesos-module (Or run a script) which is 
 provided per-cloud, we can figure out perfectly the IP / Hostname for the 
 given environment. It also means we can ship one identical set of files to 
 all hosts in a given provider which doesn't happen to have the DNS scheme + 
 hostnames that libprocess/Mesos expects. Currently we have to generate 
 host-specific config files which Mesos uses to guess.
 The host-specific files break / fall apart if machines change IP / hostname 
 without being reinstalled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME

2015-07-14 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627125#comment-14627125
 ] 

Cody Maloney commented on MESOS-2902:
-

I've covered why not wrapper scripts several times in this thread already

 Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME
 

 Key: MESOS-2902
 URL: https://issues.apache.org/jira/browse/MESOS-2902
 Project: Mesos
  Issue Type: Improvement
  Components: master, modules, slave
Reporter: Cody Maloney
Assignee: Marco Massenzio
Priority: Critical
  Labels: mesosphere

 Currently Mesos tries to guess the IP, HOSTNAME by doing a reverse DNS 
 lookup. This doesn't work on a lot of clouds as we want things like public 
 IPs (which aren't the default DNS), there aren't FQDN names (Azure), or the 
 correct way to figure it out is to call some cloud-specific endpoint.
 If Mesos / Libprocess could load a mesos-module (Or run a script) which is 
 provided per-cloud, we can figure out perfectly the IP / Hostname for the 
 given environment. It also means we can ship one identical set of files to 
 all hosts in a given provider which doesn't happen to have the DNS scheme + 
 hostnames that libprocess/Mesos expects. Currently we have to generate 
 host-specific config files which Mesos uses to guess.
 The host-specific files break / fall apart if machines change IP / hostname 
 without being reinstalled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME

2015-07-14 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626999#comment-14626999
 ] 

Cody Maloney commented on MESOS-2902:
-

One thing as a follow up from the discussion this morning. Generally for 
shipping mesos lots of places in DCOS, we're trying to get everything to not 
happen on the host we're shipping to. Any code we execute on a host has a high 
probability of having some bugs and breaking in a lot of environments. As such, 
we bake everything off host, then when it gets to the host itself it's just a 
matter of reading static variables whenever possible.

This is effectively pushing possible errors for us from runtime / machine 
startup time when we really have a hard time fixing them to configuration setup 
time on some remote machine. I can generate a config using some tools. Test it 
out locally, and know that the remote machine will behave the same since it 
will get the same bit for bit config. If it's some script, I have to predict 
what the script will do in the foreign environment (There are very few things 
we can rely on existing in the host. Pretty much just bash and curl / wget. 
Everything else is additional dependencies we pick up which make it harder to 
install DCOS).


 Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME
 

 Key: MESOS-2902
 URL: https://issues.apache.org/jira/browse/MESOS-2902
 Project: Mesos
  Issue Type: Improvement
  Components: master, modules, slave
Reporter: Cody Maloney
Assignee: Marco Massenzio
Priority: Critical
  Labels: mesosphere

 Currently Mesos tries to guess the IP, HOSTNAME by doing a reverse DNS 
 lookup. This doesn't work on a lot of clouds as we want things like public 
 IPs (which aren't the default DNS), there aren't FQDN names (Azure), or the 
 correct way to figure it out is to call some cloud-specific endpoint.
 If Mesos / Libprocess could load a mesos-module (Or run a script) which is 
 provided per-cloud, we can figure out perfectly the IP / Hostname for the 
 given environment. It also means we can ship one identical set of files to 
 all hosts in a given provider which doesn't happen to have the DNS scheme + 
 hostnames that libprocess/Mesos expects. Currently we have to generate 
 host-specific config files which Mesos uses to guess.
 The host-specific files break / fall apart if machines change IP / hostname 
 without being reinstalled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME

2015-07-14 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626999#comment-14626999
 ] 

Cody Maloney edited comment on MESOS-2902 at 7/14/15 8:35 PM:
--

One thing as a follow up from the discussion this morning. Generally for 
shipping mesos lots of places in DCOS, we're trying to get everything to not 
happen on the host we're shipping to. Any code we execute on a host has a high 
probability of having some bugs and breaking in a lot of environments. As such, 
we bake everything off host, then when it gets to the host itself it's just a 
matter of reading static variables whenever possible. Running a script on a 
hundred hosts that generates the same config file is much more likely to go 
wrong, then running the script once somewhere I can validate the output, then 
shipping it to the hosts with integrity checking.

This is effectively pushing possible errors for us from runtime / machine 
startup time when we really have a hard time fixing them to configuration setup 
time on some remote machine. I can generate a config using some tools. Test it 
out locally, and know that the remote machine will behave the same since it 
will get the same bit for bit config. If it's some script, I have to predict 
what the script will do in the foreign environment (There are very few things 
we can rely on existing in the host. Pretty much just bash and curl / wget. 
Everything else is additional dependencies we pick up which make it harder to 
install DCOS).



was (Author: cmaloney):
One thing as a follow up from the discussion this morning. Generally for 
shipping mesos lots of places in DCOS, we're trying to get everything to not 
happen on the host we're shipping to. Any code we execute on a host has a high 
probability of having some bugs and breaking in a lot of environments. As such, 
we bake everything off host, then when it gets to the host itself it's just a 
matter of reading static variables whenever possible.

This is effectively pushing possible errors for us from runtime / machine 
startup time when we really have a hard time fixing them to configuration setup 
time on some remote machine. I can generate a config using some tools. Test it 
out locally, and know that the remote machine will behave the same since it 
will get the same bit for bit config. If it's some script, I have to predict 
what the script will do in the foreign environment (There are very few things 
we can rely on existing in the host. Pretty much just bash and curl / wget. 
Everything else is additional dependencies we pick up which make it harder to 
install DCOS).


 Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME
 

 Key: MESOS-2902
 URL: https://issues.apache.org/jira/browse/MESOS-2902
 Project: Mesos
  Issue Type: Improvement
  Components: master, modules, slave
Reporter: Cody Maloney
Assignee: Marco Massenzio
Priority: Critical
  Labels: mesosphere

 Currently Mesos tries to guess the IP, HOSTNAME by doing a reverse DNS 
 lookup. This doesn't work on a lot of clouds as we want things like public 
 IPs (which aren't the default DNS), there aren't FQDN names (Azure), or the 
 correct way to figure it out is to call some cloud-specific endpoint.
 If Mesos / Libprocess could load a mesos-module (Or run a script) which is 
 provided per-cloud, we can figure out perfectly the IP / Hostname for the 
 given environment. It also means we can ship one identical set of files to 
 all hosts in a given provider which doesn't happen to have the DNS scheme + 
 hostnames that libprocess/Mesos expects. Currently we have to generate 
 host-specific config files which Mesos uses to guess.
 The host-specific files break / fall apart if machines change IP / hostname 
 without being reinstalled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME

2015-07-09 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621226#comment-14621226
 ] 

Cody Maloney commented on MESOS-2902:
-

[~bmahler] Mesos is much more particular and peculiar in it's DNS / Hostname / 
IP requirements than a lot of datacenter software. nginx, httpd, etc. don't 
actually use the machine's hostname, they purely use whatever a request comes 
in as. They also don't publish anywhere saying This is me come find me based 
on the DNS address of the local machine. They get a request in, they inspect 
what IP address / port that request came in on, and in the case of nginx / 
apache possibly what the {{Host}} HTTP header is and deal with it from there. 
In the case of Mesos for the Masters for instance if a master and framework 
disagree on the master IP, you just end up with lost packets with no logging 
currently. The HTTP API should help in this area, but we need to ship Mesos 
today / can't wait for that to come.

We only use cloud-init in some environments. And it only has coreos public / 
private IPv4. There are environments we install using the myriad of other host 
install / setup tools (chef, salt, fleet, ...). There are a lot of ways we ship 
this stuff to clients.

Adding one simple flag doesn't considerably add to the Mesos maintenance 
burden, and solves our use case at the moment. If adding a flag is unpalatable, 
it could be added as a mesos 'hook' module which does exactly the same thing, 
just makes the IP lookup pluggable. That would make it so someone could write a 
mesos module which does NetworkManager if they wished (Although there will 
still be a problem of Mesos slave can't handle when it's IP address changes)

This isn't teaching mesos configuration management at all. It is trying to get 
it out of the policy of trying to self-configure itself badly for a lot of our 
customer environments, leading to lots of headaches for various customers we 
are trying to ship Mesos as a component of DCOS to.

The maintenance burden for this is no more than the `--ip` flag that Mesos has 
currently which is the exact same as setting LIBPROCESS_IP. It does not 
significantly effect organizations which do not need the flag / wish to use it 
I believe, and if they don't give it, it will not change the behavior of their 
setups.

 Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME
 

 Key: MESOS-2902
 URL: https://issues.apache.org/jira/browse/MESOS-2902
 Project: Mesos
  Issue Type: Improvement
  Components: master, modules, slave
Reporter: Cody Maloney
Assignee: Marco Massenzio
Priority: Critical
  Labels: mesosphere

 Currently Mesos tries to guess the IP, HOSTNAME by doing a reverse DNS 
 lookup. This doesn't work on a lot of clouds as we want things like public 
 IPs (which aren't the default DNS), there aren't FQDN names (Azure), or the 
 correct way to figure it out is to call some cloud-specific endpoint.
 If Mesos / Libprocess could load a mesos-module (Or run a script) which is 
 provided per-cloud, we can figure out perfectly the IP / Hostname for the 
 given environment. It also means we can ship one identical set of files to 
 all hosts in a given provider which doesn't happen to have the DNS scheme + 
 hostnames that libprocess/Mesos expects. Currently we have to generate 
 host-specific config files which Mesos uses to guess.
 The host-specific files break / fall apart if machines change IP / hostname 
 without being reinstalled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME

2015-07-08 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619815#comment-14619815
 ] 

Cody Maloney commented on MESOS-2902:
-

I'd much rather have it output a IP than hostname. Some of the cases we've run 
into where a hostname doesn't work: Multiple NICs per box (Each of which can 
have 1+ DNS address), clusters where boxes don't have resolvable hostnames, and 
clusters which have no DNS whatsoever.

If it's 'run a script which returns an IP' I can fairly reliably create 
cluster/environment-specific variants which get the right IP address.

 Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME
 

 Key: MESOS-2902
 URL: https://issues.apache.org/jira/browse/MESOS-2902
 Project: Mesos
  Issue Type: Improvement
  Components: master, modules, slave
Reporter: Cody Maloney
Assignee: Marco Massenzio
Priority: Critical
  Labels: mesosphere

 Currently Mesos tries to guess the IP, HOSTNAME by doing a reverse DNS 
 lookup. This doesn't work on a lot of clouds as we want things like public 
 IPs (which aren't the default DNS), there aren't FQDN names (Azure), or the 
 correct way to figure it out is to call some cloud-specific endpoint.
 If Mesos / Libprocess could load a mesos-module (Or run a script) which is 
 provided per-cloud, we can figure out perfectly the IP / Hostname for the 
 given environment. It also means we can ship one identical set of files to 
 all hosts in a given provider which doesn't happen to have the DNS scheme + 
 hostnames that libprocess/Mesos expects. Currently we have to generate 
 host-specific config files which Mesos uses to guess.
 The host-specific files break / fall apart if machines change IP / hostname 
 without being reinstalled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME

2015-07-08 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619815#comment-14619815
 ] 

Cody Maloney edited comment on MESOS-2902 at 7/9/15 3:15 AM:
-

I'd much rather have it output a IP than hostname. Some of the cases we've run 
into where a hostname doesn't work: Multiple NICs per box (Each of which can 
have 1+ IP Address, and an arbitrary grouping can have actual DNS), clusters 
where boxes don't have resolvable hostnames, and clusters which have no DNS 
whatsoever.

If it's 'run a script which returns an IP' I can fairly reliably create 
cluster/environment-specific variants which get the right IP address.


was (Author: cmaloney):
I'd much rather have it output a IP than hostname. Some of the cases we've run 
into where a hostname doesn't work: Multiple NICs per box (Each of which can 
have 1+ DNS address), clusters where boxes don't have resolvable hostnames, and 
clusters which have no DNS whatsoever.

If it's 'run a script which returns an IP' I can fairly reliably create 
cluster/environment-specific variants which get the right IP address.

 Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME
 

 Key: MESOS-2902
 URL: https://issues.apache.org/jira/browse/MESOS-2902
 Project: Mesos
  Issue Type: Improvement
  Components: master, modules, slave
Reporter: Cody Maloney
Assignee: Marco Massenzio
Priority: Critical
  Labels: mesosphere

 Currently Mesos tries to guess the IP, HOSTNAME by doing a reverse DNS 
 lookup. This doesn't work on a lot of clouds as we want things like public 
 IPs (which aren't the default DNS), there aren't FQDN names (Azure), or the 
 correct way to figure it out is to call some cloud-specific endpoint.
 If Mesos / Libprocess could load a mesos-module (Or run a script) which is 
 provided per-cloud, we can figure out perfectly the IP / Hostname for the 
 given environment. It also means we can ship one identical set of files to 
 all hosts in a given provider which doesn't happen to have the DNS scheme + 
 hostnames that libprocess/Mesos expects. Currently we have to generate 
 host-specific config files which Mesos uses to guess.
 The host-specific files break / fall apart if machines change IP / hostname 
 without being reinstalled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME

2015-07-08 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619890#comment-14619890
 ] 

Cody Maloney commented on MESOS-2902:
-

Probably we should sync in person tomorrow and summarize on here.

We could potentially say You have to run a script on every host which sets 
LIBPROCESS_IP (Or MESOS_IP which turns into the --ip flag and therefore 
LIBPROCESS_IP). It adds complexity in the form of extra dependencies, and 
makes the cluster install + running Mesos not very self-contained.

What I like about having Mesos run a script, is we are able to ship that script 
inside the DCOS internal host packaging system to hosts, manage and update it 
appropriately inside of DCOS. Anything which doesn't live in there we can't 
touch, update, etc. during upgrades.

It's also important to note this affects us both for launching Mesos, as well 
as launching DCOS System frameworks (Marathon tries to do the same hostname - 
ip logic inside libprocess and it goes just as badly in a lot of our use cases).

 Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME
 

 Key: MESOS-2902
 URL: https://issues.apache.org/jira/browse/MESOS-2902
 Project: Mesos
  Issue Type: Improvement
  Components: master, modules, slave
Reporter: Cody Maloney
Assignee: Marco Massenzio
Priority: Critical
  Labels: mesosphere

 Currently Mesos tries to guess the IP, HOSTNAME by doing a reverse DNS 
 lookup. This doesn't work on a lot of clouds as we want things like public 
 IPs (which aren't the default DNS), there aren't FQDN names (Azure), or the 
 correct way to figure it out is to call some cloud-specific endpoint.
 If Mesos / Libprocess could load a mesos-module (Or run a script) which is 
 provided per-cloud, we can figure out perfectly the IP / Hostname for the 
 given environment. It also means we can ship one identical set of files to 
 all hosts in a given provider which doesn't happen to have the DNS scheme + 
 hostnames that libprocess/Mesos expects. Currently we have to generate 
 host-specific config files which Mesos uses to guess.
 The host-specific files break / fall apart if machines change IP / hostname 
 without being reinstalled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME

2015-07-08 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619857#comment-14619857
 ] 

Cody Maloney commented on MESOS-2902:
-

In DCOS we do all Mesos config via environment variables (Allows better mixing 
and matching in various environemnts). We ship the same mesos-master systemd 
unit to every cluster, and then we change the configuration by swapping out 
environment variable files (See Systemd's {{EnvironmentFile}} directive). 
Inside an {{EnvironmentFile}} we can't run arbitrary scripts. It is 
structurally in-feasible to change the mesos-master systemd unit per cluster to 
include the 'Set the IP by running this script' only in cases where we want to 
do that. There may also cases where Mesos exits and we restart it, and it would 
refuse to start because it has a different IP (mesos slave might checkpoint it, 
although I'd have to double check).

The IP to use is a per-host thing, so I can't ship a generic config file to 
every host in the cluster which just sets {{LIBPROCESS_IP}} in an 
{{EnvironmentFile}}.

Writing a wrapper script which sets {{LIBPROCESS_IP}} and then does an {{exec 
mesos-master}} is feasible, although it obfuscates what is happening, and if 
someone we ship DCOS to has been hand-editing the script for their environment 
and gets the environment variable a little bit wrong, things will error really 
badly (We've had a number of customers with mesos figuring out that the host's 
IP is 127.0.0.1).

As far as the hostname stuff: In general we need Mesos not to do anything with 
hostnames in a number of our environments because they are unreliable, esp. as 
a means for figuring out what address should I talk on.

 Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME
 

 Key: MESOS-2902
 URL: https://issues.apache.org/jira/browse/MESOS-2902
 Project: Mesos
  Issue Type: Improvement
  Components: master, modules, slave
Reporter: Cody Maloney
Assignee: Marco Massenzio
Priority: Critical
  Labels: mesosphere

 Currently Mesos tries to guess the IP, HOSTNAME by doing a reverse DNS 
 lookup. This doesn't work on a lot of clouds as we want things like public 
 IPs (which aren't the default DNS), there aren't FQDN names (Azure), or the 
 correct way to figure it out is to call some cloud-specific endpoint.
 If Mesos / Libprocess could load a mesos-module (Or run a script) which is 
 provided per-cloud, we can figure out perfectly the IP / Hostname for the 
 given environment. It also means we can ship one identical set of files to 
 all hosts in a given provider which doesn't happen to have the DNS scheme + 
 hostnames that libprocess/Mesos expects. Currently we have to generate 
 host-specific config files which Mesos uses to guess.
 The host-specific files break / fall apart if machines change IP / hostname 
 without being reinstalled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2132) Allow sending http::Request objects

2015-07-04 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14613989#comment-14613989
 ] 

Cody Maloney commented on MESOS-2132:
-

Currently mesos http handlers receive an HTTP Request object. If you just want 
to forward the request with minimal changes (just the path), as the proxy 
process I was working on does, you need to copy every field out of the 
structure and pass the members as slightly differently formatted arguments to 
the http::post, get functions. Making it so those functions can just take an 
http request object makes it easier to forward requests, as well as cleans up 
the http get/post API so that rather than a long string of optional parameters, 
there are just fields ommitted from being set on a struct.

 Allow sending http::Request objects
 ---

 Key: MESOS-2132
 URL: https://issues.apache.org/jira/browse/MESOS-2132
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess
Reporter: Cody Maloney
Assignee: Cody Maloney
Priority: Minor
  Labels: mesosphere

 Currently you can only send a collection of fields which more or less matches 
 those in an http::Request object.
 http::Request objects are used when calling http handlers in libprocess.
 The motivation for being able to send these is then we can forward a request 
 that is recieved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1865) Redirect to the leader master when current master is not a leader

2015-07-01 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611005#comment-14611005
 ] 

Cody Maloney commented on MESOS-1865:
-

Following a redirect is entirely a client's choice. Practically in HTTP there 
isn't a better alternative I know of that keeps simple / dumb clients working 
well. Right now a number of dumb client programs which want to pull 
master/state.json manually call out to find out what the leading master is from 
the master, then going to that directly and hoping there isn't a race around it.

Practically for systems which care to only monitor the exact master they are 
talking to, most HTTP libraries I have seen you can disable automatic redirect 
following. Currently these APIs sometimes returning incorrect / invalid / stale 
data has caused problems for things like proxy config generation scripts (They 
get the wrong master at just the wrong point in time and generate an empty 
config, leading to badness)

 Redirect to the leader master when current master is not a leader
 -

 Key: MESOS-1865
 URL: https://issues.apache.org/jira/browse/MESOS-1865
 Project: Mesos
  Issue Type: Bug
  Components: json api
Affects Versions: 0.20.1
Reporter: Steven Schlansker
Assignee: haosdent

 Some of the API endpoints, for example /master/tasks.json, will return bogus 
 information if you query a non-leading master:
 {code}
 [steven@Anesthetize:~]% curl 
 http://master1.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 
 10
 {
   tasks: []
 }
 [steven@Anesthetize:~]% curl 
 http://master2.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 
 10
 {
   tasks: []
 }
 [steven@Anesthetize:~]% curl 
 http://master3.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 
 10
 {
   tasks: [
 {
   executor_id: ,
   framework_id: 20140724-231003-419644938-5050-1707-,
   id: 
 pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db,
   name: 
 pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db,
   resources: {
 cpus: 0.25,
 disk: 0,
 {code}
 This is very hard for end-users to work around.  For example if I query 
 which master is leading followed by leader: which tasks are running it is 
 possible that the leader fails over in between, leaving me with an incorrect 
 answer and no way to know that this happened.
 In my opinion the API should return the correct response (by asking the 
 current leader?) or an error (500 Not the leader?) but it's unacceptable to 
 return a successful wrong answer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2153) Add support for systemd journal for logging

2015-07-01 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611326#comment-14611326
 ] 

Cody Maloney commented on MESOS-2153:
-

This should also include individual task stdout/stderr, syslog messages being 
logged to the systemd journal (although those are more bits of this as an 
epic). Right now for long-running tasks, the stdout and stderr just grow 
forever. The systemd journal makes it so the stdout/stderr can be capped size, 
and administrative policies can be set per app if desired.

 Add support for systemd journal for logging
 ---

 Key: MESOS-2153
 URL: https://issues.apache.org/jira/browse/MESOS-2153
 Project: Mesos
  Issue Type: Improvement
  Components: master, slave
Reporter: Alexander Rukletsov
Priority: Minor

 We should be able to redirect master and slave logs to systemd journal on the 
 systems where it's available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-898) Introduce CMake as an alternative build system.

2015-06-24 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600263#comment-14600263
 ] 

Cody Maloney commented on MESOS-898:


I would suggest that with the move to CMake we switch to using a raw upstream 
packaged version of boost. There isn't a lot we gain by stripping out some of 
the headers, and it adds a lot more complexity. CMake has a lot of stuff 
ready-made for finding, downloading boost if and only if it isn't present on 
the host machine, isn't of the right version, etc. Forcing rebuilding all of 
that logic/code so that we can remove some files in a tarball which shouldn't 
be embedded inside the repository anyways seems like not the best idea.

 Introduce CMake as an alternative build system.
 ---

 Key: MESOS-898
 URL: https://issues.apache.org/jira/browse/MESOS-898
 Project: Mesos
  Issue Type: Epic
  Components: build
Reporter: Timothy St. Clair
Assignee: Alex Clemmer
  Labels: build

 This is a rather substantial undertaking, so I would want upstream 
 debate+buy-in prior to full commitment.  The basic premise is: upstream 
 rebundles several of its dependencies in part to tightly control its stack.  
 This is not out of the norm, but in order to be picked up by distribution 
 channels it needs to built against system dependencies, and rebundling is 
 strictly forbidden.  Given that the mesos primary target platform are 
 data-center distributions such as RHEL/CENTOS/SL it makes sense to still have 
 bundling support for those who do not have dependencies in their channels 
 yet.  This is where cmake can be win with it's uber macros 
 (http://www.cmake.org/cmake/help/v2.8.8/cmake.html#module:ExternalProject).  
 I do not know of any equivalent in the autotools world, other then to brew 
 your own solution.   I've done this type of work in the past, and completely 
 transformed condor and would leverage a lot of the work that was done there. 
 I currently have a tracking branch where I've started this work, but before I 
 go off into the woods, it makes sense to have a debate in public. 
 The primary benefits are: 
 1. Enable downstream channels to easily distro without carrying a large patch 
 sets. 
 2. Still support existing non-proper distribution methods. 
 3. Harden / future proof dependent interfaces. 
 Side Benefits: 
 Audit current build mechanics.  
  - Presently the language specific binding are not installed.  (.py  .jar)
  - make -jX currently fails 
  - optionally look in arm support. 
 Costs:
 1. Time
 2. Potential temporary destabilization
 3. Infrastructure around build+test may need to change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2129) Enable managing mesos without having to be able to connect to each slave

2015-06-23 Thread Cody Maloney (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Maloney updated MESOS-2129:

Assignee: (was: Cody Maloney)

 Enable managing mesos without having to be able to connect to each slave
 

 Key: MESOS-2129
 URL: https://issues.apache.org/jira/browse/MESOS-2129
 Project: Mesos
  Issue Type: Epic
Reporter: Cody Maloney
  Labels: mesosphere

 Ideally we want to use the full mesos WebUI from an office, which is 
 firewalled off from the vast majority of hosts in the datacenter (mesos 
 slaves). It also becomes burdensome to manage a precise firewall for 
 additional hosts, since every time a slave comes/goes if we don't want to 
 allow blanket access to the slave port, we have to add / remove firewall rules



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME

2015-06-19 Thread Cody Maloney (JIRA)
Cody Maloney created MESOS-2902:
---

 Summary: Enable Mesos to use arbitrary script / module to figure 
out IP, HOSTNAME
 Key: MESOS-2902
 URL: https://issues.apache.org/jira/browse/MESOS-2902
 Project: Mesos
  Issue Type: Improvement
  Components: master, modules, slave
Reporter: Cody Maloney
Priority: Minor


Currently Mesos tries to guess the IP, HOSTNAME by doing a reverse DNS lookup. 
This doesn't work on a lot of clouds as we want things like public IPs (which 
aren't the default DNS), there aren't FQDN names (Azure), or the correct way to 
figure it out is to call some cloud-specific endpoint.

If Mesos / Libprocess could load a mesos-module (Or run a script) which is 
provided per-cloud, we can figure out perfectly the IP / Hostname for the given 
environment. It also means we can ship one identical set of files to all hosts 
in a given provider which doesn't happen to have the DNS scheme + hostnames 
that libprocess/Mesos expects. Currently we have to generate host-specific 
config files which Mesos uses to guess.

The host-specific files break / fall apart if machines change IP / hostname 
without being reinstalled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME

2015-06-19 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594068#comment-14594068
 ] 

Cody Maloney edited comment on MESOS-2902 at 6/19/15 10:58 PM:
---

I can't drop it in a systemd unit file which runs a command before mesos and 
pass the data without making a temp file which is an odd way to do the config 
generation.

I could make a new mesos-init-fetch-ip script which I run instead of mesos, and 
that script then execs mesos. This confuses init system tracking of processes 
somewhat, and obfuscates what the underlying commands being run are.

It also adds a lot of error scenarios. For example, the wrapper script is 
updated and the change contains a typo, so it sets LIBPROCES_IP instead of 
LIBPROCESS_IP), Libprocess silently ignores the wrong environment variable. The 
environment I'm in Libprocess' internal logic guesses an IP that works. It gets 
engrained slightly incorrect as it rolls out across the cluster.

Currently one of the biggest pain points in initially setting up a Mesos 
cluster is getting the right IPs + Hostnames setup. If Mesos Master and Mesos 
Slave had a flag which was required, {{\-\-ip\-detection=reverse_dns}} or 
{{--ip-detection=/usr/bin/detect_mesos_ip}}. It would make it so that users see 
what mesos is doing and make an informed decision, rather than running Mesos, 
having things break with really bad error messages (Wrong hostname/IP on your 
Scheduler? No logging of things breaking happens...).

As far as generalizing it further. Note I'm saying IP, HOSTNAME are 
host-specific, which is why this sort of capability makes sense. It is 
impossible for me to know when I'm installing static config files to a Host, 
VM, Docker what the IP and Hostname are going to be. That is not the case for 
{{\-\-resources}}, {{\-quiet}} and the like. They are able to be pre-determined 
for a host. IP and Hostname are Runtime parameters of a machine (When you 
attach your machine to a network, they are assigned dynamically).


was (Author: cmaloney):
I can't drop it in a systemd unit file which runs a command before mesos and 
pass the data without making a temp file which is an odd way to do the config 
generation.

I could make a new mesos-init-fetch-ip script which I run instead of mesos, and 
that script then execs mesos. This confuses init system tracking of processes 
somewhat, and obfuscates what the underlying commands being run are.

It also adds a lot of error scenarios. For example, the wrapper script is 
updated and the change contains a typo, so it sets LIBPROCES_IP instead of 
LIBPROCESS_IP), Libprocess silently ignores the wrong environment variable. The 
environment I'm in Libprocess' internal logic guesses an IP that works. It gets 
engrained slightly incorrect as it rolls out across the cluster.

Currently one of the biggest pain points in initially setting up a Mesos 
cluster is getting the right IPs + Hostnames setup. If Mesos Master and Mesos 
Slave had a flag which was required, {{ \-\-ip\-detection=reverse_dns}} or 
{{--ip-detection=,/usr/bin/detect_mesos_ip} }}. It would make it so that users 
see what mesos is doing and make an informed decision, rather than running 
Mesos, having things break with really bad error messages (Wrong hostname/IP on 
your Scheduler? No logging of things breaking happens...).

As far as generalizing it further. Note I'm saying IP, HOSTNAME are 
host-specific, which is why this sort of capability makes sense. It is 
impossible for me to know when I'm installing static config files to a Host, 
VM, Docker what the IP and Hostname are going to be. That is not the case for 
{{\-\-resources}}, {{\-quiet}} and the like. They are able to be pre-determined 
for a host. IP and Hostname are Runtime parameters of a machine (When you 
attach your machine to a network, they are assigned dynamically).

 Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME
 

 Key: MESOS-2902
 URL: https://issues.apache.org/jira/browse/MESOS-2902
 Project: Mesos
  Issue Type: Improvement
  Components: master, modules, slave
Reporter: Cody Maloney
Priority: Minor
  Labels: mesosphere

 Currently Mesos tries to guess the IP, HOSTNAME by doing a reverse DNS 
 lookup. This doesn't work on a lot of clouds as we want things like public 
 IPs (which aren't the default DNS), there aren't FQDN names (Azure), or the 
 correct way to figure it out is to call some cloud-specific endpoint.
 If Mesos / Libprocess could load a mesos-module (Or run a script) which is 
 provided per-cloud, we can figure out perfectly the IP / Hostname for the 
 given environment. It also means we can ship one identical set of files to 
 all hosts in a given provider which 

[jira] [Commented] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME

2015-06-19 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594068#comment-14594068
 ] 

Cody Maloney commented on MESOS-2902:
-

I can't drop it in a systemd unit file which runs a command before mesos and 
pass the data without making a temp file which is an odd way to do the config 
generation.

I could make a new mesos-init-fetch-ip script which I run instead of mesos, and 
that script then execs mesos. This confuses init system tracking of processes 
somewhat, and obfuscates what the underlying commands being run are.

It also adds a lot of error scenarios. For example, the wrapper script is 
updated and the change contains a typo, so it sets LIBPROCES_IP instead of 
LIBPROCESS_IP), Libprocess silently ignores the wrong environment variable. The 
environment I'm in Libprocess' internal logic guesses an IP that works. It gets 
engrained slightly incorrect as it rolls out across the cluster.

Currently one of the biggest pain points in initially setting up a Mesos 
cluster is getting the right IPs + Hostnames setup. If Mesos Master and Mesos 
Slave had a flag which was required, {{ \-\-ip\-detection=reverse_dns}} or 
{{--ip-detection=,/usr/bin/detect_mesos_ip} }}. It would make it so that users 
see what mesos is doing and make an informed decision, rather than running 
Mesos, having things break with really bad error messages (Wrong hostname/IP on 
your Scheduler? No logging of things breaking happens...).

As far as generalizing it further. Note I'm saying IP, HOSTNAME are 
host-specific, which is why this sort of capability makes sense. It is 
impossible for me to know when I'm installing static config files to a Host, 
VM, Docker what the IP and Hostname are going to be. That is not the case for 
{{\-\-resources}}, {{\-quiet}} and the like. They are able to be pre-determined 
for a host. IP and Hostname are Runtime parameters of a machine (When you 
attach your machine to a network, they are assigned dynamically).

 Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME
 

 Key: MESOS-2902
 URL: https://issues.apache.org/jira/browse/MESOS-2902
 Project: Mesos
  Issue Type: Improvement
  Components: master, modules, slave
Reporter: Cody Maloney
Priority: Minor
  Labels: mesosphere

 Currently Mesos tries to guess the IP, HOSTNAME by doing a reverse DNS 
 lookup. This doesn't work on a lot of clouds as we want things like public 
 IPs (which aren't the default DNS), there aren't FQDN names (Azure), or the 
 correct way to figure it out is to call some cloud-specific endpoint.
 If Mesos / Libprocess could load a mesos-module (Or run a script) which is 
 provided per-cloud, we can figure out perfectly the IP / Hostname for the 
 given environment. It also means we can ship one identical set of files to 
 all hosts in a given provider which doesn't happen to have the DNS scheme + 
 hostnames that libprocess/Mesos expects. Currently we have to generate 
 host-specific config files which Mesos uses to guess.
 The host-specific files break / fall apart if machines change IP / hostname 
 without being reinstalled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2832) Enable configuring Mesos with environment variables without having them leak to tasks launched

2015-06-17 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590499#comment-14590499
 ] 

Cody Maloney commented on MESOS-2832:
-

For DCOS at least we don't want to just strip some out. We want to replace the 
entire environment with one statically spaecified. The reason for this is we 
explicitly want to make it hard to depend on special DCOS-internal components 
that mesos-slave has in it's PATH, LD_LIBRARY_PATH but which DCOS Services 
should not.

Removing a magic pattern matching of variables seems more complicated to 
implement than Load the exact set of environment variables to use from this 
map, then add in explicitly Mesos API provided ones, such as MESOS_SANDBOX, etc

 Enable configuring Mesos with environment variables without having them leak 
 to tasks launched
 --

 Key: MESOS-2832
 URL: https://issues.apache.org/jira/browse/MESOS-2832
 Project: Mesos
  Issue Type: Wish
Reporter: Cody Maloney
Assignee: Benjamin Hindman
Priority: Critical
  Labels: mesosphere

 Currently if mesos is configured with environment variables (MESOS_MODULES), 
 those show up in every task which is launched unless the executor explicitly 
 cleans them up. 
 If the task being launched happens to be something libprocess / mesos based, 
 this can often prevent the task from starting up (A scheduler has issues 
 loading a module intended for the slave).
 There are also cases where it would be nice to be able to change what the 
 PATH is that tasks launch with (the host may have more in the path than tasks 
 are supposed to / allowed to depend upon).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2862) mesos-fetcher won't fetch uris which begin with a

2015-06-11 Thread Cody Maloney (JIRA)
Cody Maloney created MESOS-2862:
---

 Summary: mesos-fetcher won't fetch uris which begin with a  
 Key: MESOS-2862
 URL: https://issues.apache.org/jira/browse/MESOS-2862
 Project: Mesos
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.22.1
Reporter: Cody Maloney
Priority: Minor


Discovered while running mesos with marathon on top. If I launch a marathon 
task with a URI which is  
http://apache.osuosl.org/mesos/0.22.1/mesos-0.22.1.tar.gz; mesos will log to 
stderr:

{code}
I0611 22:39:22.815636 35673 logging.cpp:177] Logging to STDERR
I0611 22:39:25.643889 35673 fetcher.cpp:214] Fetching URI ' 
http://apache.osuosl.org/mesos/0.22.1/mesos-0.22.1.tar.gz'
I0611 22:39:25.648111 35673 fetcher.cpp:94] Hadoop Client not available, 
skipping fetch with Hadoop Client
Failed to fetch:  http://apache.osuosl.org/mesos/0.22.1/mesos-0.22.1.tar.gz
Failed to synchronize with slave (it's probably exited)
{code}

It would be nice if mesos trimmed leading whitespace before doing protocol 
detection so that simple mistakes are just fixed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1739) Allow slave reconfiguration on restart

2015-06-09 Thread Cody Maloney (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Maloney updated MESOS-1739:

Assignee: (was: Cody Maloney)

 Allow slave reconfiguration on restart
 --

 Key: MESOS-1739
 URL: https://issues.apache.org/jira/browse/MESOS-1739
 Project: Mesos
  Issue Type: Epic
Reporter: Patrick Reilly
  Labels: mesosphere, myriad

 Make it so that either via a slave restart or a out of process reconfigure 
 ping, the attributes and resources of a slave can be updated to be a superset 
 of what they used to be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2830) Add an endpoint to slaves to allow launching system administration tasks

2015-06-08 Thread Cody Maloney (JIRA)
Cody Maloney created MESOS-2830:
---

 Summary: Add an endpoint to slaves to allow launching system 
administration tasks
 Key: MESOS-2830
 URL: https://issues.apache.org/jira/browse/MESOS-2830
 Project: Mesos
  Issue Type: Wish
  Components: slave
Reporter: Cody Maloney
Priority: Minor


As a System Administrator often times I need to run a organization-mandated 
task on every machine in the cluster. Ideally I could do this within the 
framework of mesos resources if it is a cleanup or auditing task, but 
sometimes I just have to run something, and run it now, regardless if a machine 
has un-accounted resources  (Ex: Adding/removing a user).

Currently to do this I have to completely bypass Mesos and SSH to the box. 
Ideally I could tell a mesos slave (With proper authentication) to run a 
container with the limited special permissions needed to get the task done.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2832) Enable configuring Mesos with environment variables without having them leak to tasks launched

2015-06-08 Thread Cody Maloney (JIRA)
Cody Maloney created MESOS-2832:
---

 Summary: Enable configuring Mesos with environment variables 
without having them leak to tasks launched
 Key: MESOS-2832
 URL: https://issues.apache.org/jira/browse/MESOS-2832
 Project: Mesos
  Issue Type: Wish
Reporter: Cody Maloney
Priority: Critical


Currently if mesos is configured with environment variables (MESOS_MODULES), 
those show up in every task which is launched unless the executor explicitly 
cleans them up. 

If the task being launched happens to be something libprocess / mesos based, 
this can often prevent the task from starting up (A scheduler has issues 
loading a module intended for the slave).

There are also cases where it would be nice to be able to change what the PATH 
is that tasks launch with (the host may have more in the path than tasks are 
supposed to / allowed to depend upon).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2810) mesos-executor reimplements subprocess

2015-06-03 Thread Cody Maloney (JIRA)
Cody Maloney created MESOS-2810:
---

 Summary: mesos-executor reimplements subprocess
 Key: MESOS-2810
 URL: https://issues.apache.org/jira/browse/MESOS-2810
 Project: Mesos
  Issue Type: Improvement
  Components: slave
Reporter: Cody Maloney


The launchTask method is a re-implementation of libprocess subprocess

https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L110



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2811) process/subprocess.hpp API hard to use, extend

2015-06-03 Thread Cody Maloney (JIRA)
Cody Maloney created MESOS-2811:
---

 Summary: process/subprocess.hpp API hard to use, extend
 Key: MESOS-2811
 URL: https://issues.apache.org/jira/browse/MESOS-2811
 Project: Mesos
  Issue Type: Improvement
  Components: slave
Affects Versions: 0.22.1
Reporter: Cody Maloney


https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/subprocess.hpp

There are many overloads of subprocess() construction, a lot of them are very 
similar.

It passes environment in as an {{Optionstd::mapstd::string, std::string}} 
which isn't what stout's os::environment() returns. ({{hashmapstd::string, 
std::string environment()}}. Ideally those should match for easy passing 
environments around + manipulating

It isn't possible to tell it not to copy in the environment of running process 
(Useful to isolate slave environments from the running process). This becomes 
critical when configuring mesos via environment variables. Currently mesos 
explicitly unsets LIBPROCESS_IP when launching new processes because that one 
is known to upset when mesos launches another libprocess based thing.

ExecEnv is just weird, it isn't great / modern C++, and results in a lot of 
unnecessary / useless copies of things as current, doesn't follow modern C++ 
interface standards.

The code is hard to read / follow:
{code}

  // Close the copies. We need to make sure that we do not close the
  // file descriptor assigned to stdin/stdout/stderr in case the
  // parent has closed stdin/stdout/stderr when calling this
  // function (in that case, a dup'ed file descriptor may have the
  // same file descriptor number as stdin/stdout/stderr).
  if (stdinFd[0] != STDIN_FILENO 
  stdinFd[0] != STDOUT_FILENO 
  stdinFd[0] != STDERR_FILENO) {
while (::close(stdinFd[0]) == -1  errno == EINTR);
  }
  if (stdoutFd[1] != STDIN_FILENO 
  stdoutFd[1] != STDOUT_FILENO 
  stdoutFd[1] != STDERR_FILENO) {
while (::close(stdoutFd[1]) == -1  errno == EINTR);
  }
  if (stderrFd[1] != STDIN_FILENO 
  stderrFd[1] != STDOUT_FILENO 
  stderrFd[1] != STDERR_FILENO) {
while (::close(stderrFd[1]) == -1  errno == EINTR);
  }
{code}
Why do we switch between fd[0] vs [1]? Why are we hand-coding While EINTR 
loops over and over? Doesn't stout have an os::close?

https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L165
 -- os::execvpe() can fail for perfectly good reasons, we should really log the 
name of the command / info that was trying to be run. There shouldn't be a 
backtrace printed (which abort does).

A lot of the subprocess overloads re-implement needlessly functionality which 
the underlying exec() C APIs provide, using those apis instead of 
re-implementing all the variations would be a much better model.

Mesos doesn't use / need most of the subprocess overloads that exist. A lot of 
the usage patterns probably could / should be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2812) Document mesos internal launching a container path

2015-06-03 Thread Cody Maloney (JIRA)
Cody Maloney created MESOS-2812:
---

 Summary: Document mesos internal launching a container path
 Key: MESOS-2812
 URL: https://issues.apache.org/jira/browse/MESOS-2812
 Project: Mesos
  Issue Type: Improvement
  Components: slave
Affects Versions: 0.22.1
Reporter: Cody Maloney


Sometimes mesos uses LinuxLauncher, sometimes it uses PosixLauncher. These both 
share a lot of implementation. Just because we're on Linux doesn't mean we use 
the LinuxLauncher. These rely on mesos-containerizer (another subprocess 
implementation), mesos-executor (yet another subprocess launcher in it's 
launchTask method).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2814) os::read should have one implementation

2015-06-03 Thread Cody Maloney (JIRA)
Cody Maloney created MESOS-2814:
---

 Summary: os::read should have one implementation
 Key: MESOS-2814
 URL: https://issues.apache.org/jira/browse/MESOS-2814
 Project: Mesos
  Issue Type: Improvement
  Components: stout
Reporter: Cody Maloney


Currently stout os::read() has two radically different implementations when you 
give it a {{std::string}} vs. a {{const char *}}. Ideally these have one 
implementation that does things like intelligently size the buffer that it 
writes into rather than re-allocating repeatedly with every time it lengthens 
the string (resulting in copious copying). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2131) Add a reverse proxy endpoint to mesos

2015-05-18 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549538#comment-14549538
 ] 

Cody Maloney commented on MESOS-2131:
-

This is stalled at the moment (I haven't been working on it, heading out of 
town). Can talk to someone about remaining issues with it, path forward if they 
resurrect it.

 Add a reverse proxy endpoint to mesos
 -

 Key: MESOS-2131
 URL: https://issues.apache.org/jira/browse/MESOS-2131
 Project: Mesos
  Issue Type: Improvement
  Components: master, slave
Reporter: Cody Maloney
Assignee: Cody Maloney
Priority: Minor
  Labels: mesosphere

 A new libprocess Process inside mesos which allows attaching/detaching known 
 endpoints at a specific path.
 Ideally I want to be able to do things like attach 'slave-id' and pass HTTP 
 requests on to that slave:
 Sample endpoint actions:
 C++ api:
 attach(std::string name, Node target): Add a new reverse proxy path
 detach(std::string name): Remove an established reverse proxy path
 HTTP endpoints:
 /proxy/go/{name}
  - Prefix matches a path, forwards the remaining path onto the remote endpoin
 /proxy/debug.json
  - Prints out all attached endpoints.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1375) Log rotation capable

2015-05-15 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546016#comment-14546016
 ] 

Cody Maloney commented on MESOS-1375:
-

For configuring things even using Mesosphere init scripts in the current init 
wrappers you can add arbitrary flags as well as do environment variables which 
will be sourced. That said, definitely we've felt the pain of those old init 
scripts (Our newer mesos packaging we use in DCOS completely foregoes them), we 
may actually look at removing them in a new generation of the packaging.

 Log rotation capable
 

 Key: MESOS-1375
 URL: https://issues.apache.org/jira/browse/MESOS-1375
 Project: Mesos
  Issue Type: Improvement
  Components: master, slave
Affects Versions: 0.18.0
Reporter: Damien Hardy
  Labels: ops, twitter

 Please provide a way to let ops manage logs.
 A log4j like configuration would be hard but make rotation capable without 
 restarting the service at least. 
 Based on external logrotate tool would be great :
  * write to a constant log file name
  * check for file change (recreated by logrotate) before write



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1303) ExamplesTest.{TestFramework, NoExecutorFramework} flaky

2015-05-13 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542583#comment-14542583
 ] 

Cody Maloney commented on MESOS-1303:
-

[~tillt] Would it be reasonable to just implement dirname ourselves in C++? 
What people expect to have happen isn't that hard to get (Although need to make 
sure we don't break expectations around things that end in '/').

 ExamplesTest.{TestFramework, NoExecutorFramework} flaky
 ---

 Key: MESOS-1303
 URL: https://issues.apache.org/jira/browse/MESOS-1303
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Ian Downes
  Labels: flaky

 I'm having trouble reproducing this but I did observe it once on my OSX 
 system:
 {noformat}
 [==] Running 2 tests from 1 test case.
 [--] Global test environment set-up.
 [--] 2 tests from ExamplesTest
 [ RUN  ] ExamplesTest.TestFramework
 ../../src/tests/script.cpp:81: Failure
 Failed
 test_framework_test.sh terminated with signal 'Abort trap: 6'
 [  FAILED  ] ExamplesTest.TestFramework (953 ms)
 [ RUN  ] ExamplesTest.NoExecutorFramework
 [   OK ] ExamplesTest.NoExecutorFramework (10162 ms)
 [--] 2 tests from ExamplesTest (5 ms total)
 [--] Global test environment tear-down
 [==] 2 tests from 1 test case ran. (11121 ms total)
 [  PASSED  ] 1 test.
 [  FAILED  ] 1 test, listed below:
 [  FAILED  ] ExamplesTest.TestFramework
 {noformat}
 when investigating a failed make check for https://reviews.apache.org/r/20971/
 {noformat}
 [--] 6 tests from ExamplesTest
 [ RUN  ] ExamplesTest.TestFramework
 [   OK ] ExamplesTest.TestFramework (8643 ms)
 [ RUN  ] ExamplesTest.NoExecutorFramework
 tests/script.cpp:81: Failure
 Failed
 no_executor_framework_test.sh terminated with signal 'Aborted'
 [  FAILED  ] ExamplesTest.NoExecutorFramework (7220 ms)
 [ RUN  ] ExamplesTest.JavaFramework
 [   OK ] ExamplesTest.JavaFramework (11181 ms)
 [ RUN  ] ExamplesTest.JavaException
 [   OK ] ExamplesTest.JavaException (5624 ms)
 [ RUN  ] ExamplesTest.JavaLog
 [   OK ] ExamplesTest.JavaLog (6472 ms)
 [ RUN  ] ExamplesTest.PythonFramework
 [   OK ] ExamplesTest.PythonFramework (14467 ms)
 [--] 6 tests from ExamplesTest (53607 ms total)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1739) Allow slave reconfiguration on restart

2015-05-09 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536280#comment-14536280
 ] 

Cody Maloney commented on MESOS-1739:
-

The biggest thing which came up in my old patchset was race conditions around 
re-registering in how the mesos registerSlave / reregisterSlave code is setup 
which probably will need some structural reworking. 

The case that was broken in my patch set is when a slave tries to register 
multiple times because it hasn't gotten a response from the master yet, and 1+ 
of those retries aren't identical to the first because they contain different 
resources / attributes (The slave started re-registration, then was restarted 
with new attributes before the master fully processed it), the master doesn't 
notice and just discards them as repeats.

 Allow slave reconfiguration on restart
 --

 Key: MESOS-1739
 URL: https://issues.apache.org/jira/browse/MESOS-1739
 Project: Mesos
  Issue Type: Epic
Reporter: Patrick Reilly
Assignee: Cody Maloney
  Labels: mesosphere

 Make it so that either via a slave restart or a out of process reconfigure 
 ping, the attributes and resources of a slave can be updated to be a superset 
 of what they used to be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2690) --enable-optimize build fails with maybe-uninitialized

2015-05-06 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531119#comment-14531119
 ] 

Cody Maloney commented on MESOS-2690:
-

So {{\-\-enable\-optimize}}, inside the script we add the {{\-O2}} as a 
user-shortcut. {{\-\-enable-optimize}} we provide as a user shortcut, and if 
people touch CXXFLAGS themselves, it doesn't do anything (Didn't use to 
anyways, with https://reviews.apache.org/r/33828/ we now always add a flag, 
regardless of if we don't add {{\-O2}} which is something I should have caught 
in my review...). The magic shortcuts are just making these combinations easier 
to use and work right (Sort of like how we add very specific flags if we see 
you are using compiler X so that mesos builds without needing to manually 
specify CXXFLAGS to work around specific compiler versions).

 --enable-optimize build fails with maybe-uninitialized
 --

 Key: MESOS-2690
 URL: https://issues.apache.org/jira/browse/MESOS-2690
 Project: Mesos
  Issue Type: Bug
  Components: build
 Environment: GCC 4.8 - 4.9
Reporter: Joris Van Remoortere
Assignee: Joris Van Remoortere
Priority: Blocker

 When building with the `enable-optimize` flag, the build fails with 
 `maybe-uninitialized' errors.
 This is due to a bug in GCC when building optimized code triggering false 
 positives for this warning. Please see:
 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59970
 We can disable this warning when using GCC + --enable-optimize.
 A quick work-around until there is a patch:
 ../configure CXXFLAGS=-Wno-maybe-uninitialized your-other-flags-here



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-2690) --enable-optimize build fails with maybe-uninitialized

2015-05-06 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531060#comment-14531060
 ] 

Cody Maloney edited comment on MESOS-2690 at 5/6/15 6:03 PM:
-

Grepping for {{-O2}} in {{CXXFLAGS}} is fairly fragile, and moderately unsafe 
because it's one particular GCC optimization, which happens to be included in 
{{-O2}}. Unless we implement parsing all of GCC's flags, finding which one 
enables the optimization that breaks {{-Wno-maybe-uninitialized}} we've made a 
very, very environment-specific patch to work around a particular bug which 
could quite likely be fixed in a point release of GCC at some point rendering 
the code incorrect.

In the spec file you can fairly simply add the flag to {{CXXFLAGS}} passing it 
into configure like all the other manually-set {{CXXFLAGS}} by configure. What 
sets the optimization flag which doesn't work well with our warning flags sets 
the bypass for the bug that pops up as well. It's all set on the outside and 
works its way in. 

From the automake manual:
{code}
This section attempts to answer all the above questions. We will mostly discuss 
CPPFLAGS in our examples, but actually the answer holds for all the compile 
flags used in Automake: CCASFLAGS, CFLAGS, CPPFLAGS, CXXFLAGS, FCFLAGS, FFLAGS, 
GCJFLAGS, LDFLAGS, LFLAGS, LIBTOOLFLAGS, OBJCFLAGS, OBJCXXFLAGS, RFLAGS, 
UPCFLAGS, and YFLAGS.
{code}
...
{code}
You should not add options to these user variables within configure either, for 
the same reason. Occasionally you need to modify these variables to perform a 
test, but you should reset their values afterwards. In contrast, it is OK to 
modify the ‘AM_’ variables within configure if you AC_SUBST them, but it is 
rather rare that you need to do this, unless you really want to change the 
default definitions of the ‘AM_’ variables in all Makefiles.
{code} -- 
http://www.gnu.org/software/automake/manual/html_node/Flag-Variables-Ordering.html



was (Author: cmaloney):
Grepping for {{-O2}} in {{CXXFLAGS}} is fairly fragile, and moderately unsafe 
because it's one particular GCC optimization, which happens to be included in 
{{-O2}}. Unless we implement parsing all of GCC's flags, finding which one 
enables the optimization that breaks {{-Wno-maybe-uninitialized}} we've made a 
very, very environment-specific patch to work around a particular bug which 
could quite likely be fixed in a point release of GCC at some point rendering 
the code incorrect.

In the spec file you can fairly simply add the flag to {{CXXFLAGS}} passing it 
into configure like all the other manually-set {{CXXFLAGS}} by configure. What 
sets the optimization flag which doesn't work well with our warning flags sets 
the bypass for the bug that pops up as well. It's all set on the outside and 
works its way in. 

From the automake manual:
{{code}}
This section attempts to answer all the above questions. We will mostly discuss 
CPPFLAGS in our examples, but actually the answer holds for all the compile 
flags used in Automake: CCASFLAGS, CFLAGS, CPPFLAGS, CXXFLAGS, FCFLAGS, FFLAGS, 
GCJFLAGS, LDFLAGS, LFLAGS, LIBTOOLFLAGS, OBJCFLAGS, OBJCXXFLAGS, RFLAGS, 
UPCFLAGS, and YFLAGS.
{{code}}
...
{{code}}
You should not add options to these user variables within configure either, for 
the same reason. Occasionally you need to modify these variables to perform a 
test, but you should reset their values afterwards. In contrast, it is OK to 
modify the ‘AM_’ variables within configure if you AC_SUBST them, but it is 
rather rare that you need to do this, unless you really want to change the 
default definitions of the ‘AM_’ variables in all Makefiles.
{{code}} -- 
http://www.gnu.org/software/automake/manual/html_node/Flag-Variables-Ordering.html


 --enable-optimize build fails with maybe-uninitialized
 --

 Key: MESOS-2690
 URL: https://issues.apache.org/jira/browse/MESOS-2690
 Project: Mesos
  Issue Type: Bug
  Components: build
 Environment: GCC 4.8 - 4.9
Reporter: Joris Van Remoortere
Assignee: Joris Van Remoortere
Priority: Blocker

 When building with the `enable-optimize` flag, the build fails with 
 `maybe-uninitialized' errors.
 This is due to a bug in GCC when building optimized code triggering false 
 positives for this warning. Please see:
 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59970
 We can disable this warning when using GCC + --enable-optimize.
 A quick work-around until there is a patch:
 ../configure CXXFLAGS=-Wno-maybe-uninitialized your-other-flags-here



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2690) --enable-optimize build fails with maybe-uninitialized

2015-05-06 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531060#comment-14531060
 ] 

Cody Maloney commented on MESOS-2690:
-

Grepping for {{-O2}} in {{CXXFLAGS}} is fairly fragile, and moderately unsafe 
because it's one particular GCC optimization, which happens to be included in 
{{-O2}}. Unless we implement parsing all of GCC's flags, finding which one 
enables the optimization that breaks {{-Wno-maybe-uninitialized}} we've made a 
very, very environment-specific patch to work around a particular bug which 
could quite likely be fixed in a point release of GCC at some point rendering 
the code incorrect.

In the spec file you can fairly simply add the flag to {{CXXFLAGS}} passing it 
into configure like all the other manually-set {{CXXFLAGS}} by configure. What 
sets the optimization flag which doesn't work well with our warning flags sets 
the bypass for the bug that pops up as well. It's all set on the outside and 
works its way in. 

From the automake manual:
{{code}}
This section attempts to answer all the above questions. We will mostly discuss 
CPPFLAGS in our examples, but actually the answer holds for all the compile 
flags used in Automake: CCASFLAGS, CFLAGS, CPPFLAGS, CXXFLAGS, FCFLAGS, FFLAGS, 
GCJFLAGS, LDFLAGS, LFLAGS, LIBTOOLFLAGS, OBJCFLAGS, OBJCXXFLAGS, RFLAGS, 
UPCFLAGS, and YFLAGS.
{{code}}
...
{{code}}
You should not add options to these user variables within configure either, for 
the same reason. Occasionally you need to modify these variables to perform a 
test, but you should reset their values afterwards. In contrast, it is OK to 
modify the ‘AM_’ variables within configure if you AC_SUBST them, but it is 
rather rare that you need to do this, unless you really want to change the 
default definitions of the ‘AM_’ variables in all Makefiles.
{{code}} -- 
http://www.gnu.org/software/automake/manual/html_node/Flag-Variables-Ordering.html


 --enable-optimize build fails with maybe-uninitialized
 --

 Key: MESOS-2690
 URL: https://issues.apache.org/jira/browse/MESOS-2690
 Project: Mesos
  Issue Type: Bug
  Components: build
 Environment: GCC 4.8 - 4.9
Reporter: Joris Van Remoortere
Assignee: Joris Van Remoortere
Priority: Blocker

 When building with the `enable-optimize` flag, the build fails with 
 `maybe-uninitialized' errors.
 This is due to a bug in GCC when building optimized code triggering false 
 positives for this warning. Please see:
 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59970
 We can disable this warning when using GCC + --enable-optimize.
 A quick work-around until there is a patch:
 ../configure CXXFLAGS=-Wno-maybe-uninitialized your-other-flags-here



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2690) --enable-optimize build fails with maybe-uninitialized

2015-05-04 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527543#comment-14527543
 ] 

Cody Maloney commented on MESOS-2690:
-

Not if you still want optimization / debug info with that you need to include 
it in the CXXFLAGS, CFLAGS, so something like: `-O2 -Wno-maybe-unitialized`. 
--enable-optimize, --enable-debug don't modify CFLAGS/CXXFLAGS if they are 
passed in by the user.

 --enable-optimize build fails with maybe-uninitialized
 --

 Key: MESOS-2690
 URL: https://issues.apache.org/jira/browse/MESOS-2690
 Project: Mesos
  Issue Type: Bug
  Components: build
 Environment: GCC 4.8 - 4.9
Reporter: Joris Van Remoortere
Assignee: Joris Van Remoortere

 When building with the `enable-optimize` flag, the build fails with 
 `maybe-uninitialized' errors.
 This is due to a bug in GCC when building optimized code triggering false 
 positives for this warning. Please see:
 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59970
 We can disable this warning when using GCC + --enable-optimize.
 A quick work-around until there is a patch:
 ../configure CXXFLAGS=-Wno-maybe-uninitialized your-other-flags-here



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1375) Log rotation capable

2015-05-04 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527512#comment-14527512
 ] 

Cody Maloney commented on MESOS-1375:
-

Another option that would be really nice is to integrate with systemd / 
journald when on one of those hosts to just use the journal. That way the log 
files are properly size-capped / rotated, and things could eventually used more 
structured auditable logging if they want.

 Log rotation capable
 

 Key: MESOS-1375
 URL: https://issues.apache.org/jira/browse/MESOS-1375
 Project: Mesos
  Issue Type: Improvement
  Components: master, slave
Affects Versions: 0.18.0
Reporter: Damien Hardy
  Labels: ops, twitter

 Please provide a way to let ops manage logs.
 A log4j like configuration would be hard but make rotation capable without 
 restarting the service at least. 
 Based on external logrotate tool would be great :
  * write to a constant log file name
  * check for file change (recreated by logrotate) before write



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MESOS-2604) Upgrade minimum required compilers for MESOS

2015-05-04 Thread Cody Maloney (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Maloney resolved MESOS-2604.
-
   Resolution: Fixed
Fix Version/s: 0.23.0

 Upgrade minimum required compilers for MESOS
 

 Key: MESOS-2604
 URL: https://issues.apache.org/jira/browse/MESOS-2604
 Project: Mesos
  Issue Type: Improvement
  Components: build
Affects Versions: 0.23.0
Reporter: Cody Maloney
Assignee: Cody Maloney
  Labels: c++11
 Fix For: 0.23.0


 As discussed in the last community meeting we would like to upgrade the 
 minimum mesos compiler version to GCC 4.8+, Clang 3.5. GCC primarily for 
 Linux. Clang for OS X, as well as linux for enabling Mesos tooling 
 improvements 
 ([clang-format|http://mesos.apache.org/documentation/clang-format/], 
 clang-tidy among others).
 Some documents for reference:
 [Compilers by Distribution 
 Version|https://docs.google.com/spreadsheets/d/1Ji8p3p_1JqUsMxE31mJqqztHf7LDx7mGMXh253azWpU/edit?usp=sharing]
 Shows we can get GCC 4.8+ or clang 3.5+ on all supported platforms.
 C++11 features supported by each compiler:
 [https://gcc.gnu.org/projects/cxx0x.html]
 [http://clang.llvm.org/cxx_status.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2604) Upgrade minimum required compilers for MESOS

2015-05-04 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527255#comment-14527255
 ] 

Cody Maloney commented on MESOS-2604:
-

{code}
author  Cody Maloney c...@mesosphere.io   
Thu, 23 Apr 2015 14:38:48 -0700 (14:38 -0700)
committer   Benjamin Hindman benjamin.hind...@gmail.com   
Sat, 25 Apr 2015 16:21:46 -0700 (16:21 -0700)
commit  0f5c78fad3423181f7227027eb42d162811514e7
tree5c6158257e926e29279e5eee13f189f46cf8fe07tree | snapshot
parent  b4bbfd6ae0c5287d0328caeff89d0c574ae4a546commit | diff
Warn if g++  4.8 or a C++ standard library is too old for Mesos.

After this a whole bunch more of the C++11 checks can be removed, we
can unconditionally use -std=c++11, among other things with this
change.

Note that we don't explicitly check the clang version number since
extracting it is hard (OS X clang behaves differently than Linux
clang), and 'clang -dumpversion' always reports 4.2.1 for
compatibility with some random tools that used GCC.
{code}

{code}
author  Benjamin Hindman benjamin.hind...@gmail.com   
Sat, 25 Apr 2015 16:06:38 -0700 (16:06 -0700)
committer   Benjamin Hindman benjamin.hind...@gmail.com   
Sat, 25 Apr 2015 16:21:35 -0700 (16:21 -0700)
commit  b4bbfd6ae0c5287d0328caeff89d0c574ae4a546
tree68a2adab47a3e93e95064ad5dce87e1a99f726c3tree | snapshot
parent  4919aa52a9eae4af0874cb41e3a1a6d10c2eafa7commit | diff
Warn if g++  4.8 or a C++ standard library is too old for libprocess.

After this a whole bunch more of the C++11 checks can be removed, we
can unconditionally use -std=c++11, among other things with this
change.

Note that we don't explicitly check the clang version number since
extracting it is hard (OS X clang behaves differently than Linux
clang), and 'clang -dumpversion' always reports 4.2.1 for
compatibility with some random tools that used GCC.
{code}

 Upgrade minimum required compilers for MESOS
 

 Key: MESOS-2604
 URL: https://issues.apache.org/jira/browse/MESOS-2604
 Project: Mesos
  Issue Type: Improvement
  Components: build
Affects Versions: 0.23.0
Reporter: Cody Maloney
Assignee: Cody Maloney
  Labels: c++11

 As discussed in the last community meeting we would like to upgrade the 
 minimum mesos compiler version to GCC 4.8+, Clang 3.5. GCC primarily for 
 Linux. Clang for OS X, as well as linux for enabling Mesos tooling 
 improvements 
 ([clang-format|http://mesos.apache.org/documentation/clang-format/], 
 clang-tidy among others).
 Some documents for reference:
 [Compilers by Distribution 
 Version|https://docs.google.com/spreadsheets/d/1Ji8p3p_1JqUsMxE31mJqqztHf7LDx7mGMXh253azWpU/edit?usp=sharing]
 Shows we can get GCC 4.8+ or clang 3.5+ on all supported platforms.
 C++11 features supported by each compiler:
 [https://gcc.gnu.org/projects/cxx0x.html]
 [http://clang.llvm.org/cxx_status.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2644) AS a framework developer I WANT to check and depend on a Mesos version

2015-04-22 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14507931#comment-14507931
 ] 

Cody Maloney commented on MESOS-2644:
-

We may also want to think about exposing 'feature' flags which schedulers can 
depend upon rather than hard version requirements. Could be useful for when a 
feature needs to be hot-patched in (Or when working off a fork for testing out 
a feature), then lands in a later release.

 AS a framework developer I WANT to check and depend on a Mesos version
 --

 Key: MESOS-2644
 URL: https://issues.apache.org/jira/browse/MESOS-2644
 Project: Mesos
  Issue Type: Story
  Components: framework
Affects Versions: 0.22.0
Reporter: Aaron Bell

 Example: I'm developing a framework that makes use of persistent volumes, 
 MESOS-1554. At startup I want my scheduler to verify the Mesos master's 
 version and abort if it's less than e.g. {{0.23.0}}, which I know is the 
 minimum version for that feature.
 I've looked at MESOS-753 and MESOS-986 and they don't seem to address  this 
 cleanly.
 Version may be available in {{state.json}}, but this is an unboundedly large 
 value to parse. It would seem sensible to have an HTTP endpoint {{/version}} 
 or similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2144) Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread

2015-04-16 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498789#comment-14498789
 ] 

Cody Maloney commented on MESOS-2144:
-

Just got one of these with full backtrace:
{code}
I0416 12:21:01.673476 36776 authenticatee.hpp:115] Initializing client SASL
@0x110e9284a  google::LogMessage::Fail()
@0x110e917dd  google::LogMessage::SendToLog()
@0x110e924ea  google::LogMessage::Flush()
@0x110e99348  google::LogMessageFatal::~LogMessageFatal()
I0416 12:21:01.747539 308416512 process.cpp:2091] Resuming 
reaper(1)@127.0.0.1:52842 at 2015-04-16 19:21:33.747597056+00:00
@0x110e92ca5  google::LogMessageFatal::~LogMessageFatal()
@0x10f3d33d3  _CheckFatal::~_CheckFatal()
@0x10f3d3025  _CheckFatal::~_CheckFatal()
@0x10fd94da6  mesos::internal::slave::Slave::__recover()
@0x10fe7f09d  
_ZZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS_6FutureI7NothingEES7_EEvRKNS_3PIDIT_EEMSB_FvT0_ET1_ENKUlPNS_11ProcessBaseEE_clESK_
@0x10fe7ee7f  
_ZNSt3__110__function6__funcIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS2_6FutureI7NothingEESA_EEvRKNS2_3PIDIT_EEMSE_FvT0_ET1_EUlPNS2_11ProcessBaseEE_NS_9allocatorISO_EEFvSN_EEclEOSN_
@0x110d74e7b  std::__1::function::operator()()
@0x110d5c5bf  process::ProcessBase::visit()
@0x110de6c0e  process::DispatchEvent::visit()
@0x10f3d0841  process::ProcessBase::serve()
@0x110d45abe  process::ProcessManager::resume()
@0x110d451de  process::schedule()
@ 0x7fff8f1eb268  _pthread_body
@ 0x7fff8f1eb1e5  _pthread_start
@ 0x7fff8f1e941d  thread_start
{code}

The full log from the test (MESOS_VERBOSE, GLOG_v=2)
{code}
[ RUN  ] ExamplesTest.LowLevelSchedulerPthread
Using temporary directory '/tmp/ExamplesTest_LowLevelSchedulerPthread_vVqryS'
I0416 12:21:01.637110 2105078528 logging.cpp:177] Logging to STDERR
Enabling authentication for the scheduler
I0416 12:21:01.639566 2105078528 process.cpp:2081] Spawned process 
__gc__@127.0.0.1:52945
I0416 12:21:01.639770 2105078528 process.cpp:2081] Spawned process 
help@127.0.0.1:52945
I0416 12:21:01.639583 365723648 process.cpp:2091] Resuming 
__gc__@127.0.0.1:52945 at 2015-04-16 19:21:01.639622912+00:00
I0416 12:21:01.639777 367869952 process.cpp:2091] Resuming help@127.0.0.1:52945 
at 2015-04-16 19:21:01.639796992+00:00
I0416 12:21:01.639875 366260224 process.cpp:2091] Resuming 
logging@127.0.0.1:52945 at 2015-04-16 19:21:01.639906816+00:00
I0416 12:21:01.639909 2105078528 process.cpp:2081] Spawned process 
logging@127.0.0.1:52945
I0416 12:21:01.639978 367869952 process.cpp:2091] Resuming 
profiler@127.0.0.1:52945 at 2015-04-16 19:21:01.640003840+00:00
I0416 12:21:01.640033 368943104 process.cpp:2091] Resuming help@127.0.0.1:52945 
at 2015-04-16 19:21:01.640058880+00:00
I0416 12:21:01.640051 2105078528 process.cpp:2081] Spawned process 
profiler@127.0.0.1:52945
I0416 12:21:01.640246 368406528 process.cpp:2091] Resuming 
system@127.0.0.1:52945 at 2015-04-16 19:21:01.640268032+00:00
I0416 12:21:01.640236 368943104 process.cpp:2091] Resuming 
__gc__@127.0.0.1:52945 at 2015-04-16 19:21:01.640258048+00:00
I0416 12:21:01.640318 2105078528 process.cpp:2081] Spawned process 
system@127.0.0.1:52945
I0416 12:21:01.640321 368943104 process.cpp:2091] Resuming 
__limiter__(1)@127.0.0.1:52945 at 2015-04-16 19:21:01.640336128+00:00
I0416 12:21:01.640390 368406528 process.cpp:2081] Spawned process 
__limiter__(1)@127.0.0.1:52945
I0416 12:21:01.640425 365723648 process.cpp:2091] Resuming 
metrics@127.0.0.1:52945 at 2015-04-16 19:21:01.640440064+00:00
I0416 12:21:01.640472 368406528 process.cpp:2081] Spawned process 
metrics@127.0.0.1:52945
I0416 12:21:01.640521 367869952 process.cpp:2091] Resuming help@127.0.0.1:52945 
at 2015-04-16 19:21:01.640538880+00:00
I0416 12:21:01.640733 366796800 process.cpp:2091] Resuming help@127.0.0.1:52945 
at 2015-04-16 19:21:01.640760064+00:00
I0416 12:21:01.640913 2105078528 process.cpp:2081] Spawned process 
__processes__@127.0.0.1:52945
I0416 12:21:01.640919 366796800 process.cpp:2091] Resuming 
__processes__@127.0.0.1:52945 at 2015-04-16 19:21:01.640937984+00:00
I0416 12:21:01.640949 2105078528 process.cpp:912] libprocess is initialized on 
127.0.0.1:52945 for 8 cpus
I0416 12:21:01.640971 365723648 process.cpp:2091] Resuming help@127.0.0.1:52945 
at 2015-04-16 19:21:01.640985856+00:00
W0416 12:21:01.641326 2105078528 scheduler.cpp:134] 
**
Scheduler driver bound to loopback interface! Cannot communicate with remote 
master(s). You might want to set 'LIBPROCESS_IP' environment variable to use a 
routable IP address.
**
I0416 12:21:01.641348 2105078528 scheduler.cpp:149] Version: 0.23.0
I0416 

[jira] [Created] (MESOS-2627) ExamplesTest.PersistentVolumeFramework is flaky on OS X

2015-04-16 Thread Cody Maloney (JIRA)
Cody Maloney created MESOS-2627:
---

 Summary: ExamplesTest.PersistentVolumeFramework is flaky on OS X
 Key: MESOS-2627
 URL: https://issues.apache.org/jira/browse/MESOS-2627
 Project: Mesos
  Issue Type: Bug
 Environment: OS X Yosemite
Reporter: Cody Maloney


This just failed for the first time on our OS X Bot (Far less frequent flaky 
than the other ExamplesTest, but still flaky) while compiling master at commit 
f6620f851f635b3346c6ebf878152f38b3932ad9. There weren't any commits which 
touched / changed anything in the test in the set.

{code}
[ RUN  ] ExamplesTest.PersistentVolumeFramework 
../../src/tests/script.cpp:83: Failure Failed 
persistent_volume_framework_test.sh terminated with signal Abort trap: 6 
[  FAILED  ] ExamplesTest.PersistentVolumeFramework (7865 ms)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2627) ExamplesTest.PersistentVolumeFramework is flaky on OS X

2015-04-16 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498472#comment-14498472
 ] 

Cody Maloney commented on MESOS-2627:
-

[~jieyu] any clue why this might be flaky on OS X?

 ExamplesTest.PersistentVolumeFramework is flaky on OS X
 ---

 Key: MESOS-2627
 URL: https://issues.apache.org/jira/browse/MESOS-2627
 Project: Mesos
  Issue Type: Bug
 Environment: OS X Yosemite
Reporter: Cody Maloney
  Labels: flaky, flaky-test

 This just failed for the first time on our OS X Bot (Far less frequent flaky 
 than the other ExamplesTest, but still flaky) while compiling master at 
 commit f6620f851f635b3346c6ebf878152f38b3932ad9. There weren't any commits 
 which touched / changed anything in the test in the set.
 {code}
 [ RUN  ] ExamplesTest.PersistentVolumeFramework 
 ../../src/tests/script.cpp:83: Failure Failed 
 persistent_volume_framework_test.sh terminated with signal Abort trap: 6 
 [  FAILED  ] ExamplesTest.PersistentVolumeFramework (7865 ms)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2627) ExamplesTest.PersistentVolumeFramework is flaky on OS X

2015-04-16 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498484#comment-14498484
 ] 

Cody Maloney commented on MESOS-2627:
-

No, want me to just turn GLOG_v=2 on for the box and I'll ping when it happens 
again?

 ExamplesTest.PersistentVolumeFramework is flaky on OS X
 ---

 Key: MESOS-2627
 URL: https://issues.apache.org/jira/browse/MESOS-2627
 Project: Mesos
  Issue Type: Bug
 Environment: OS X Yosemite
Reporter: Cody Maloney
  Labels: flaky, flaky-test

 This just failed for the first time on our OS X Bot (Far less frequent flaky 
 than the other ExamplesTest, but still flaky) while compiling master at 
 commit f6620f851f635b3346c6ebf878152f38b3932ad9. There weren't any commits 
 which touched / changed anything in the test in the set.
 {code}
 [ RUN  ] ExamplesTest.PersistentVolumeFramework 
 ../../src/tests/script.cpp:83: Failure Failed 
 persistent_volume_framework_test.sh terminated with signal Abort trap: 6 
 [  FAILED  ] ExamplesTest.PersistentVolumeFramework (7865 ms)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2605) The slave sometimes does not send active executors during reregistration

2015-04-14 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14495003#comment-14495003
 ] 

Cody Maloney commented on MESOS-2605:
-

That sounds like this might be related to MESOS-2601 then. Mesos doesn't 
currently save what containerizer created / owns a container, and so it just 
tries to recover the container with all of them.

 The slave sometimes does not send active executors during reregistration
 

 Key: MESOS-2605
 URL: https://issues.apache.org/jira/browse/MESOS-2605
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.22.0
Reporter: Elizabeth Lingg
Assignee: Michael Park
  Labels: mesosphere

 The slave sometimes does not send active executors during reregistration. 
 Framework checkpointing is enabled, and the executor successfully 
 reregisters. However, the tasks in that executor are LOST (by abnormal 
 executor termination) because the executor is removed by the mesos master as 
 unknown. See the example below, 
 task.journalnode.journalnode.NodeExecutor.1428609184051.
 See the Slave Logs here for the Task:
 {code}
 Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 
 19:53:06.778790 25126 status_update_manager.cpp:317] Received status update 
 TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task 
 task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 
 20150408-002100-4261056010-5050-1047-0008
 Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 
 19:53:06.779013 25126 status_update_manager.hpp:346] Checkpointing UPDATE for 
 status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for 
 task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 
 20150408-002100-4261056010-5050-1047-0008
 Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 
 19:53:06.781788 25123 slave.cpp:2753] Forwarding the update TASK_RUNNING 
 (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task 
 task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 
 20150408-002100-4261056010-5050-1047-0008 to master@10.142.250.253:5050
 Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 
 19:53:06.781889 25123 slave.cpp:2686] Sending acknowledgement for status 
 update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task 
 task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 
 20150408-002100-4261056010-5050-1047-0008 to executor(1)@10.168.119.78:47638
 Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 
 19:53:06.784503 25124 status_update_manager.cpp:389] Received status update 
 acknowledgement (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task 
 task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 
 20150408-002100-4261056010-5050-1047-0008
 Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 
 19:53:06.784567 25124 status_update_manager.hpp:346] Checkpointing ACK for 
 status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for 
 task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 
 20150408-002100-4261056010-5050-1047-0008
 {code}
 Master Logs:
 {code}
 Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: W0409 
 20:19:43.008666  1067 master.cpp:4015] Executor 
 executor.journalnode.NodeExecutor.1428609184051 of framework 
 20150408-002100-4261056010-5050-1047-0008 possibly unknown to the slave 
 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 
 (ec2-54-237-57-237.compute-1.amazonaws.com)
 Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 
 20:19:43.008652  1074 hierarchical.hpp:648] Recovered cpus(*):0.1; 
 mem(*):1536 (total allocatable: cpus(*):3.5; mem(*):21113; disk(*):142210; 
 ports(*):[3889-5044, 5046-5049, 2182-2958, 2960-3887, 1025-2180, 8082-9041, 
 9043-9159, 9161-, 5052-6999, 7002-7198, 7200-8079, 10001-65535]) on slave 
 20150407-233647-2059219722-5050-1659-S5 from framework 
 20150408-002100-4261056010-5050-1047-0008
 Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 
 20:19:43.008712  1067 master.cpp:4714] Removing executor 
 'executor.journalnode.NodeExecutor.1428609184051' with resources cpus(*):0.1; 
 mem(*):1536 of framework 20150408-002100-4261056010-5050-1047-0008 on slave 
 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 
 (ec2-54-237-57-237.compute-1.amazonaws.com)
 Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 
 20:19:43.010372  1067 master.cpp:3295] Status update TASK_LOST (UUID: 
 e5532567-e5b2-4fca-87aa-f3f98e371640) for task 
 task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 
 

[jira] [Commented] (MESOS-2550) Mesos doesn't compile with clang 3.6

2015-04-13 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493071#comment-14493071
 ] 

Cody Maloney commented on MESOS-2550:
-

https://reviews.apache.org/r/32747/
https://reviews.apache.org/r/32748/
https://reviews.apache.org/r/32749/

 Mesos doesn't compile with clang 3.6
 

 Key: MESOS-2550
 URL: https://issues.apache.org/jira/browse/MESOS-2550
 Project: Mesos
  Issue Type: Improvement
  Components: build
Affects Versions: 0.22.0
 Environment: ArchLinux with Clang 3.6
Reporter: Cody Maloney
Assignee: Cody Maloney

 The bundled libev fails to compile with the error:
 {code}
 ev.c:970:42: error: '_Noreturn' keyword must precede function declarator
   ecb_inline void ecb_unreachable (void) ecb_noreturn;
  ^~~~
   _Noreturn 
 {code}
 Can be patched by moving the noreturn to earlier in the line / where C++11 
 noreturn attributes go.
 Bundled boost fails with errors like:
 {code}
 ../3rdparty/libprocess/3rdparty/boost-1.53.0/boost/concept_check.hpp:653:11: 
 error: unused typedef
   'boost_concept_check653' [-Werror,-Wunused-local-typedef]
   BOOST_CONCEPT_ASSERT((InputIteratorconst_iterator));
   ^
 {code}
 Can be fixed by adding '-Wno-unused-local-typedef' if we detect clang 3.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2550) Mesos doesn't compile with clang 3.6

2015-04-13 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493074#comment-14493074
 ] 

Cody Maloney commented on MESOS-2550:
-

There is going to be a clang 3.6.1 release next month (The code freeze for it 
is May 5). The patches might land in that.

 Mesos doesn't compile with clang 3.6
 

 Key: MESOS-2550
 URL: https://issues.apache.org/jira/browse/MESOS-2550
 Project: Mesos
  Issue Type: Improvement
  Components: build
Affects Versions: 0.22.0
 Environment: ArchLinux with Clang 3.6
Reporter: Cody Maloney
Assignee: Cody Maloney

 The bundled libev fails to compile with the error:
 {code}
 ev.c:970:42: error: '_Noreturn' keyword must precede function declarator
   ecb_inline void ecb_unreachable (void) ecb_noreturn;
  ^~~~
   _Noreturn 
 {code}
 Can be patched by moving the noreturn to earlier in the line / where C++11 
 noreturn attributes go.
 Bundled boost fails with errors like:
 {code}
 ../3rdparty/libprocess/3rdparty/boost-1.53.0/boost/concept_check.hpp:653:11: 
 error: unused typedef
   'boost_concept_check653' [-Werror,-Wunused-local-typedef]
   BOOST_CONCEPT_ASSERT((InputIteratorconst_iterator));
   ^
 {code}
 Can be fixed by adding '-Wno-unused-local-typedef' if we detect clang 3.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2604) Upgrade minimum required compilers for MESOS

2015-04-09 Thread Cody Maloney (JIRA)
Cody Maloney created MESOS-2604:
---

 Summary: Upgrade minimum required compilers for MESOS
 Key: MESOS-2604
 URL: https://issues.apache.org/jira/browse/MESOS-2604
 Project: Mesos
  Issue Type: Improvement
  Components: build
Affects Versions: 0.23.0
Reporter: Cody Maloney
Assignee: Cody Maloney


As discussed in the last community meeting we would like to upgrade the minimum 
mesos compiler version to GCC 4.8+, Clang 3.5. GCC primarily for Linux. Clang 
for OS X, as well as linux for enabling Mesos tooling improvements 
([clang-format|http://mesos.apache.org/documentation/clang-format/], clang-tidy 
among others).

Some documents for reference:
[Compilers by Distribution 
Version|https://docs.google.com/spreadsheets/d/1Ji8p3p_1JqUsMxE31mJqqztHf7LDx7mGMXh253azWpU/edit?usp=sharing]
Shows we can get GCC 4.8+ or clang 3.5+ on all supported platforms.

C++11 features supported by each compiler:
[https://gcc.gnu.org/projects/cxx0x.html]
[http://clang.llvm.org/cxx_status.html]




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-830) ExamplesTest.JavaFramework is flaky

2015-04-09 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487890#comment-14487890
 ] 

Cody Maloney commented on MESOS-830:


Still failing (Although frequency seems to have increased) on our OSX Buildbot.

 ExamplesTest.JavaFramework is flaky
 ---

 Key: MESOS-830
 URL: https://issues.apache.org/jira/browse/MESOS-830
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Vinod Kone
  Labels: flaky

 [ RUN  ] ExamplesTest.JavaFramework
 Using temporary directory '/tmp/ExamplesTest_JavaFramework_wSc7u8'
 Enabling authentication for the framework
 I1120 15:13:39.820032 1681264640 master.cpp:285] Master started on 
 172.25.133.171:52576
 I1120 15:13:39.820180 1681264640 master.cpp:299] Master ID: 
 201311201513-2877626796-52576-3234
 I1120 15:13:39.820194 1681264640 master.cpp:302] Master only allowing 
 authenticated frameworks to register!
 I1120 15:13:39.821197 1679654912 slave.cpp:112] Slave started on 
 1)@172.25.133.171:52576
 I1120 15:13:39.821795 1679654912 slave.cpp:212] Slave resources: cpus(*):4; 
 mem(*):7168; disk(*):481998; ports(*):[31000-32000]
 I1120 15:13:39.822855 1682337792 slave.cpp:112] Slave started on 
 2)@172.25.133.171:52576
 I1120 15:13:39.823652 1682337792 slave.cpp:212] Slave resources: cpus(*):4; 
 mem(*):7168; disk(*):481998; ports(*):[31000-32000]
 I1120 15:13:39.825330 1679118336 master.cpp:744] The newly elected leader is 
 master@172.25.133.171:52576
 I1120 15:13:39.825445 1679118336 master.cpp:748] Elected as the leading 
 master!
 I1120 15:13:39.825907 1681264640 state.cpp:33] Recovering state from 
 '/tmp/ExamplesTest_JavaFramework_wSc7u8/0/meta'
 I1120 15:13:39.826127 1681264640 status_update_manager.cpp:180] Recovering 
 status update manager
 I1120 15:13:39.826331 1681801216 process_isolator.cpp:317] Recovering isolator
 I1120 15:13:39.826738 1682874368 slave.cpp:2743] Finished recovery
 I1120 15:13:39.827747 1682337792 state.cpp:33] Recovering state from 
 '/tmp/ExamplesTest_JavaFramework_wSc7u8/1/meta'
 I1120 15:13:39.827945 1680191488 slave.cpp:112] Slave started on 
 3)@172.25.133.171:52576
 I1120 15:13:39.828415 1682337792 status_update_manager.cpp:180] Recovering 
 status update manager
 I1120 15:13:39.828608 1680728064 sched.cpp:260] Authenticating with master 
 master@172.25.133.171:52576
 I1120 15:13:39.828606 1680191488 slave.cpp:212] Slave resources: cpus(*):4; 
 mem(*):7168; disk(*):481998; ports(*):[31000-32000]
 I1120 15:13:39.828680 1682874368 slave.cpp:497] New master detected at 
 master@172.25.133.171:52576
 I1120 15:13:39.828765 1682337792 process_isolator.cpp:317] Recovering isolator
 I1120 15:13:39.829828 1680728064 sched.cpp:229] Detecting new master
 I1120 15:13:39.830288 1679654912 authenticatee.hpp:100] Initializing client 
 SASL
 I1120 15:13:39.831635 1680191488 state.cpp:33] Recovering state from 
 '/tmp/ExamplesTest_JavaFramework_wSc7u8/2/meta'
 I1120 15:13:39.831991 1679118336 status_update_manager.cpp:158] New master 
 detected at master@172.25.133.171:52576
 I1120 15:13:39.832042 1682874368 slave.cpp:524] Detecting new master
 I1120 15:13:39.832314 1682337792 slave.cpp:2743] Finished recovery
 I1120 15:13:39.832309 1681264640 master.cpp:1266] Attempting to register 
 slave on vkone.local at slave(1)@172.25.133.171:52576
 I1120 15:13:39.832929 1680728064 status_update_manager.cpp:180] Recovering 
 status update manager
 I1120 15:13:39.833371 1681801216 slave.cpp:497] New master detected at 
 master@172.25.133.171:52576
 I1120 15:13:39.833273 1681264640 master.cpp:2513] Adding slave 
 201311201513-2877626796-52576-3234-0 at vkone.local with cpus(*):4; 
 mem(*):7168; disk(*):481998; ports(*):[31000-32000]
 I1120 15:13:39.833595 1680728064 process_isolator.cpp:317] Recovering isolator
 I1120 15:13:39.833859 1681801216 slave.cpp:524] Detecting new master
 I1120 15:13:39.833861 1682874368 status_update_manager.cpp:158] New master 
 detected at master@172.25.133.171:52576
 I1120 15:13:39.834092 1680191488 slave.cpp:542] Registered with master 
 master@172.25.133.171:52576; given slave ID 
 201311201513-2877626796-52576-3234-0
 I1120 15:13:39.834486 1681264640 master.cpp:1266] Attempting to register 
 slave on vkone.local at slave(2)@172.25.133.171:52576
 I1120 15:13:39.834549 1681264640 master.cpp:2513] Adding slave 
 201311201513-2877626796-52576-3234-1 at vkone.local with cpus(*):4; 
 mem(*):7168; disk(*):481998; ports(*):[31000-32000]
 I1120 15:13:39.834750 1680191488 slave.cpp:555] Checkpointing SlaveInfo to 
 '/tmp/ExamplesTest_JavaFramework_wSc7u8/0/meta/slaves/201311201513-2877626796-52576-3234-0/slave.info'
 I1120 15:13:39.834875 1682874368 hierarchical_allocator_process.hpp:445] 
 Added slave 201311201513-2877626796-52576-3234-0 (vkone.local) with 
 cpus(*):4; mem(*):7168; disk(*):481998; 

[jira] [Commented] (MESOS-2601) Tasks are not removed after recovery from slave and mesos containerizer

2015-04-08 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486105#comment-14486105
 ] 

Cody Maloney commented on MESOS-2601:
-

[~jieyu] The --containerizer flag has never been changed on the host. Isolator 
flags also haven't changed at runtime ever on the host (only with a full 
workdir wipeout / reboot / kill all tasks / new slave id).

 Tasks are not removed after recovery from slave and mesos containerizer
 ---

 Key: MESOS-2601
 URL: https://issues.apache.org/jira/browse/MESOS-2601
 Project: Mesos
  Issue Type: Bug
  Components: containerization, slave
Affects Versions: 0.22.1
Reporter: Timothy Chen

 We've seen in our test cluster that tasks that were launched with the mesos 
 containerizer are recovered after slave restart, but actual command process 
 is not running anymore and the checkpointed executor is not marked as 
 completed.
 The Mesos containerizer recovers and all the isolators couldn't recover the 
 task, but the containerizer itself is somehow never removed and the monitor 
 kept calling usage on the containerizer.
 Relevant log lines from the beginning of slave recovery:
 I0408 18:06:33.261379 32504 slave.cpp:577] Successfully attached file 
 '/hdd/mesos/slave/slaves/20150401-160104-251662508-5050-2197-S1/frameworks/20141222-194154-218108076-5050-4125-0004/executors/ct:1427921848104:0:EM
  DataDog Uploader:/runs/990741ed-909e-49cc-83f8-be63298872da'
 ...
 I0408 18:06:36.583277 32511 containerizer.cpp:350] Recovering container 
 '990741ed-909e-49cc-83f8-be63298872da' for executor 'ct:1427921848104:0:EM 
 DataDog Uploader:' of framework 20141222-194154-218108076-5050-4125-0004
 
 I0408 18:06:37.017122 32511 linux_launcher.cpp:162] Couldn't find freezer 
 cgroup for container 990741ed-909e-49cc-83f8-be63298872da, assuming already 
 destroyed
 W0408 18:06:37.074916 32496 cpushare.cpp:199] Couldn't find cgroup for 
 container 990741ed-909e-49cc-83f8-be63298872da
 I0408 18:06:37.075173 32486 mem.cpp:158] Couldn't find cgroup for container 
 990741ed-909e-49cc-83f8-be63298872da
 E0408 18:06:37.092279 32496 containerizer.cpp:1136] Error in a resource 
 limitation for container 990741ed-909e-49cc-83f8-be63298872da: Unknown 
 container
 I0408 18:06:37.092643 32496 containerizer.cpp:906] Destroying container 
 '990741ed-909e-49cc-83f8-be63298872da'
 W0408 18:06:37.229626 32501 containerizer.cpp:807] Ignoring update for 
 currently being destroyed container: 990741ed-909e-49cc-83f8-be63298872da
 W0408 18:06:38.129873 32484 containerizer.cpp:844] Skipping resource 
 statistic for container 990741ed-909e-49cc-83f8-be63298872da because: Unknown 
 container
 W0408 18:06:38.129909 32484 containerizer.cpp:844] Skipping resource 
 statistic for container 990741ed-909e-49cc-83f8-be63298872da because: Unknown 
 container



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1303) ExamplesTest.{TestFramework, NoExecutorFramework} flaky

2015-04-06 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482453#comment-14482453
 ] 

Cody Maloney commented on MESOS-1303:
-

This is definitely still flaky. From our OSX Buildbot earlier today with master 
commit: 740dcb3d55944bc1410818d48efc49f0091b037d

[--] 8 tests from ExamplesTest
[ RUN  ] ExamplesTest.TestFramework
../../src/tests/script.cpp:83: Failure
Failed
test_framework_test.sh terminated with signal Abort trap: 6
[  FAILED  ] ExamplesTest.TestFramework (7925 ms)

 ExamplesTest.{TestFramework, NoExecutorFramework} flaky
 ---

 Key: MESOS-1303
 URL: https://issues.apache.org/jira/browse/MESOS-1303
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Ian Downes
  Labels: flaky

 I'm having trouble reproducing this but I did observe it once on my OSX 
 system:
 {noformat}
 [==] Running 2 tests from 1 test case.
 [--] Global test environment set-up.
 [--] 2 tests from ExamplesTest
 [ RUN  ] ExamplesTest.TestFramework
 ../../src/tests/script.cpp:81: Failure
 Failed
 test_framework_test.sh terminated with signal 'Abort trap: 6'
 [  FAILED  ] ExamplesTest.TestFramework (953 ms)
 [ RUN  ] ExamplesTest.NoExecutorFramework
 [   OK ] ExamplesTest.NoExecutorFramework (10162 ms)
 [--] 2 tests from ExamplesTest (5 ms total)
 [--] Global test environment tear-down
 [==] 2 tests from 1 test case ran. (11121 ms total)
 [  PASSED  ] 1 test.
 [  FAILED  ] 1 test, listed below:
 [  FAILED  ] ExamplesTest.TestFramework
 {noformat}
 when investigating a failed make check for https://reviews.apache.org/r/20971/
 {noformat}
 [--] 6 tests from ExamplesTest
 [ RUN  ] ExamplesTest.TestFramework
 [   OK ] ExamplesTest.TestFramework (8643 ms)
 [ RUN  ] ExamplesTest.NoExecutorFramework
 tests/script.cpp:81: Failure
 Failed
 no_executor_framework_test.sh terminated with signal 'Aborted'
 [  FAILED  ] ExamplesTest.NoExecutorFramework (7220 ms)
 [ RUN  ] ExamplesTest.JavaFramework
 [   OK ] ExamplesTest.JavaFramework (11181 ms)
 [ RUN  ] ExamplesTest.JavaException
 [   OK ] ExamplesTest.JavaException (5624 ms)
 [ RUN  ] ExamplesTest.JavaLog
 [   OK ] ExamplesTest.JavaLog (6472 ms)
 [ RUN  ] ExamplesTest.PythonFramework
 [   OK ] ExamplesTest.PythonFramework (14467 ms)
 [--] 6 tests from ExamplesTest (53607 ms total)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (MESOS-1303) ExamplesTest.{TestFramework, NoExecutorFramework} flaky

2015-04-06 Thread Cody Maloney (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Maloney reopened MESOS-1303:
-

 ExamplesTest.{TestFramework, NoExecutorFramework} flaky
 ---

 Key: MESOS-1303
 URL: https://issues.apache.org/jira/browse/MESOS-1303
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Ian Downes
  Labels: flaky

 I'm having trouble reproducing this but I did observe it once on my OSX 
 system:
 {noformat}
 [==] Running 2 tests from 1 test case.
 [--] Global test environment set-up.
 [--] 2 tests from ExamplesTest
 [ RUN  ] ExamplesTest.TestFramework
 ../../src/tests/script.cpp:81: Failure
 Failed
 test_framework_test.sh terminated with signal 'Abort trap: 6'
 [  FAILED  ] ExamplesTest.TestFramework (953 ms)
 [ RUN  ] ExamplesTest.NoExecutorFramework
 [   OK ] ExamplesTest.NoExecutorFramework (10162 ms)
 [--] 2 tests from ExamplesTest (5 ms total)
 [--] Global test environment tear-down
 [==] 2 tests from 1 test case ran. (11121 ms total)
 [  PASSED  ] 1 test.
 [  FAILED  ] 1 test, listed below:
 [  FAILED  ] ExamplesTest.TestFramework
 {noformat}
 when investigating a failed make check for https://reviews.apache.org/r/20971/
 {noformat}
 [--] 6 tests from ExamplesTest
 [ RUN  ] ExamplesTest.TestFramework
 [   OK ] ExamplesTest.TestFramework (8643 ms)
 [ RUN  ] ExamplesTest.NoExecutorFramework
 tests/script.cpp:81: Failure
 Failed
 no_executor_framework_test.sh terminated with signal 'Aborted'
 [  FAILED  ] ExamplesTest.NoExecutorFramework (7220 ms)
 [ RUN  ] ExamplesTest.JavaFramework
 [   OK ] ExamplesTest.JavaFramework (11181 ms)
 [ RUN  ] ExamplesTest.JavaException
 [   OK ] ExamplesTest.JavaException (5624 ms)
 [ RUN  ] ExamplesTest.JavaLog
 [   OK ] ExamplesTest.JavaLog (6472 ms)
 [ RUN  ] ExamplesTest.PythonFramework
 [   OK ] ExamplesTest.PythonFramework (14467 ms)
 [--] 6 tests from ExamplesTest (53607 ms total)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-2550) Mesos doesn't compile with clang 3.6

2015-04-01 Thread Cody Maloney (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Maloney reassigned MESOS-2550:
---

Assignee: Cody Maloney

 Mesos doesn't compile with clang 3.6
 

 Key: MESOS-2550
 URL: https://issues.apache.org/jira/browse/MESOS-2550
 Project: Mesos
  Issue Type: Improvement
  Components: build
Affects Versions: 0.22.0
 Environment: ArchLinux with Clang 3.6
Reporter: Cody Maloney
Assignee: Cody Maloney

 The bundled libev fails to compile with the error:
 {code}
 ev.c:970:42: error: '_Noreturn' keyword must precede function declarator
   ecb_inline void ecb_unreachable (void) ecb_noreturn;
  ^~~~
   _Noreturn 
 {code}
 Can be patched by moving the noreturn to earlier in the line / where C++11 
 noreturn attributes go.
 Bundled boost fails with errors like:
 {code}
 ../3rdparty/libprocess/3rdparty/boost-1.53.0/boost/concept_check.hpp:653:11: 
 error: unused typedef
   'boost_concept_check653' [-Werror,-Wunused-local-typedef]
   BOOST_CONCEPT_ASSERT((InputIteratorconst_iterator));
   ^
 {code}
 Can be fixed by adding '-Wno-unused-local-typedef' if we detect clang 3.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   >