[jira] [Commented] (MESOS-6078) Add a agent teardown endpoint
[ https://issues.apache.org/jira/browse/MESOS-6078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606268#comment-15606268 ] Cody Maloney commented on MESOS-6078: - {{/machine/down}} is very complicated to use for this use case (Requires posting multiple JSON blobs, which have to follow a format including timestamps in milliseconds, which have to have multiple fields which match exactly how a particular mesos agent was launched). It takes a _lot_ of code and debugging to use and manage it for what is a simple common task. Also, once there are existing schedules things get more complicated (And if you want the agent to re-register later) > Add a agent teardown endpoint > - > > Key: MESOS-6078 > URL: https://issues.apache.org/jira/browse/MESOS-6078 > Project: Mesos > Issue Type: Improvement > Components: master >Affects Versions: 1.0.0, 1.0.1 >Reporter: Cody Maloney >Assignee: Michael Park > Labels: mesosphere > > Currently, when a whole agent machine is unexpectedly terminated for good > (AWS terminated the instance without warning), it goes through the mesos > slave removal rate limit before it's gone. > If a couple agents / a whole rack goes in a cluster of thousands of agents, > this can get to be a problem. > If the agent can be shutdown "cleanly" everything would get scheduled, but > once the agent is gone, there currently is no good way for an adminitstrator > to indicate the node is gone / gone and it's tasks are lost / should be > rescheduled if appropriate as soon as possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6354) Treat a non-existent mesos modules directory the same as an empty mesos modules directory
Cody Maloney created MESOS-6354: --- Summary: Treat a non-existent mesos modules directory the same as an empty mesos modules directory Key: MESOS-6354 URL: https://issues.apache.org/jira/browse/MESOS-6354 Project: Mesos Issue Type: Bug Components: modules Reporter: Cody Maloney Assignee: Kapil Arya When there are no modules, there is often no module directory. A non-existent modules directory indicates exactly the same thing as not having any modules inside the modules directory. In DC/OS we have to carry some extra stuff to make sure we always have a existing modules directory even in cases where we don't have any real mesos modules in it (https://github.com/dcos/dcos/pull/849) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6340) Set HOME for Mesos tasks
Cody Maloney created MESOS-6340: --- Summary: Set HOME for Mesos tasks Key: MESOS-6340 URL: https://issues.apache.org/jira/browse/MESOS-6340 Project: Mesos Issue Type: Bug Components: containerization, slave Reporter: Cody Maloney Assignee: Jie Yu Quite a few programs assume {{$HOME}} points to a user-editable data file directory. One example is PYTHON, which tries to look up $HOME to find user-installed pacakges, and if that fails it tries to look up the user in the passwd database which often goes badly (The container is running under the `nobody` user): {code} if i == 1: if 'HOME' not in os.environ: import pwd userhome = pwd.getpwuid(os.getuid()).pw_dir else: userhome = os.environ['HOME'] {code} Just setting HOME by default to WORK_DIR would enable more software to work correctly out of the box. Software which needs to specialize / change it (or schedulers with specific preferences), should still be able to set it arbitrarily and anything a scheduler explicitly sets should overwrite the default value of $WORK_DIR -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6127) Implement suppport for HTTP/2
[ https://issues.apache.org/jira/browse/MESOS-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15465619#comment-15465619 ] Cody Maloney commented on MESOS-6127: - As long as it's a protocol change, why not go to gRPC which is going to have a lot more maintainers developing / maintaining and committed to it than a HTTP2 + Protobuf thing that Mesos internally builds. > Implement suppport for HTTP/2 > - > > Key: MESOS-6127 > URL: https://issues.apache.org/jira/browse/MESOS-6127 > Project: Mesos > Issue Type: Epic > Components: HTTP API, libprocess >Reporter: Aaron Wood > Labels: performance > > HTTP/2 will allow us to take advantage of connection multiplexing, header > compression, streams, server push, etc. Add support for communication over > HTTP/2 between masters and agents, framework endpoints, etc. > Should we support HTTP/2 without TLS? The spec allows for this but most major > browser vendors, libraries, and implementations aren't supporting it unless > TLS is used. If we do require TLS, what can be done to reduce the performance > hit of the TLS handshake? Might need to change more code to make sure that we > are taking advantage of connection sharing so that we can (ideally) only ever > have a one-time TLS handshake per shared connection. > Potential library that could be helpful: > https://nghttp2.org/documentation/libnghttp2_asio.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6078) Add a agent teardown endpoint
[ https://issues.apache.org/jira/browse/MESOS-6078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435683#comment-15435683 ] Cody Maloney commented on MESOS-6078: - For reference on the API for this: Needs to be able to be simply done with a button in a Web UI (Simple HTTP request). > Add a agent teardown endpoint > - > > Key: MESOS-6078 > URL: https://issues.apache.org/jira/browse/MESOS-6078 > Project: Mesos > Issue Type: Improvement > Components: master >Affects Versions: 1.0.0, 1.0.1 >Reporter: Cody Maloney >Assignee: Michael Park > Labels: mesosphere > > Currently, when a whole agent machine is unexpectedly terminated for good > (AWS terminated the instance without warning), it goes through the mesos > slave removal rate limit before it's gone. > If a couple agents / a whole rack goes in a cluster of thousands of agents, > this can get to be a problem. > If the agent can be shutdown "cleanly" everything would get scheduled, but > once the agent is gone, there currently is no good way for an adminitstrator > to indicate the node is gone / gone and it's tasks are lost / should be > rescheduled if appropriate as soon as possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6078) Add a agent teardown endpoint
Cody Maloney created MESOS-6078: --- Summary: Add a agent teardown endpoint Key: MESOS-6078 URL: https://issues.apache.org/jira/browse/MESOS-6078 Project: Mesos Issue Type: Improvement Components: master Affects Versions: 1.0.1, 1.0.0 Reporter: Cody Maloney Assignee: Michael Park Currently, when a whole agent machine is unexpectedly terminated for good (AWS terminated the instance without warning), it goes through the mesos slave removal rate limit before it's gone. If a couple agents / a whole rack goes in a cluster of thousands of agents, this can get to be a problem. If the agent can be shutdown "cleanly" everything would get scheduled, but once the agent is gone, there currently is no good way for an adminitstrator to indicate the node is gone / gone and it's tasks are lost / should be rescheduled if appropriate as soon as possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6069) Misspelt TASK_KILLED in mesos slave
Cody Maloney created MESOS-6069: --- Summary: Misspelt TASK_KILLED in mesos slave Key: MESOS-6069 URL: https://issues.apache.org/jira/browse/MESOS-6069 Project: Mesos Issue Type: Bug Components: slave Reporter: Cody Maloney https://github.com/apache/mesos/blob/c3228f3c3d1a1b2c145d1377185cfe22da6079eb/src/slave/slave.cpp#L2127 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5467) offer DECLINE / ACCEPT + Recovered resource messages are spammy
Cody Maloney created MESOS-5467: --- Summary: offer DECLINE / ACCEPT + Recovered resource messages are spammy Key: MESOS-5467 URL: https://issues.apache.org/jira/browse/MESOS-5467 Project: Mesos Issue Type: Bug Reporter: Cody Maloney When in a decent size Mesos cluster, frameworks get sent hundreds of offers. When the framework than accepts/declines those offers, {noformat} May 27 01:20:43 node-44a84216f97e mesos-master[110696]: I0527 01:20:43.361552 110718 master.cpp:3297] Processing DECLINE call for offers: [ 88bbf084-c8b7-4c91-af62-c91089c97eaf-O433278814 ] for framework 20160406-160033-18415882-5050-35855- (mon-marathon-service) at scheduler-949644bc-b1f0-497b-a767-87d1201d5113@10.6.15.1:41319 {noformat} will be printed for each of them. Along with a: {noformat} May 27 01:20:43 node-44a84216f97e mesos-master[110696]: I0527 01:20:43.419852 110703 hierarchical.cpp:744] Recovered cpus(*):37.75; mem(*):102992; ports(*):[31000-31214, 31216-32000]; disk(*):545870 (total: cpus(*):38; mem(*):103120; ports(*):[31000-32000]; disk(*):545870, allocated: cpus(*):0.25; mem(*):128; ports(*):[31215-31215]) on slave 88bbf084-c8b7-4c91-af62-c91089c97eaf-S649 from framework 20160406-160033-18415882-5050-35855- {noformat} Would be nice to not log the exact declines, or to do a summary. This ends up being the vast majority of logs I look at (multi-thousand line blocks of logs which aren't useful to the investigation. Just need a sign "offers are being processed for this framework"). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5466) Master attempted to send message to disconnected framework logged 800 times in 1 second
[ https://issues.apache.org/jira/browse/MESOS-5466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Maloney updated MESOS-5466: Attachment: master-disconnect-message > Master attempted to send message to disconnected framework logged 800 times > in 1 second > --- > > Key: MESOS-5466 > URL: https://issues.apache.org/jira/browse/MESOS-5466 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Cody Maloney > Labels: mesosphere > Attachments: master-disconnect-message > > > One instance (attached) had 806 of exactly the same message in one second. > Anonymized log attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1865) Redirect to the leader master when current master is not a leader
[ https://issues.apache.org/jira/browse/MESOS-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15241654#comment-15241654 ] Cody Maloney commented on MESOS-1865: - Please not 301 "permanent redirect". Browsers cache that for a _long_ time so if that leader becomes master again you'll be permanently redirected away... 302 or 307. If we're concerned about breaking "dump" / simple clients than 307 would seem to make the most sense. The odds are better that simple clients wouldn't know about 307 since it's newer, and just report as an error which a sysadmin would see in their monitoring tools and be able to fix. > Redirect to the leader master when current master is not a leader > - > > Key: MESOS-1865 > URL: https://issues.apache.org/jira/browse/MESOS-1865 > Project: Mesos > Issue Type: Bug > Components: json api >Affects Versions: 0.20.1 >Reporter: Steven Schlansker >Assignee: haosdent > > Some of the API endpoints, for example /master/tasks.json, will return bogus > information if you query a non-leading master: > {code} > [steven@Anesthetize:~]% curl > http://master1.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n > 10 > { > "tasks": [] > } > [steven@Anesthetize:~]% curl > http://master2.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n > 10 > { > "tasks": [] > } > [steven@Anesthetize:~]% curl > http://master3.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n > 10 > { > "tasks": [ > { > "executor_id": "", > "framework_id": "20140724-231003-419644938-5050-1707-", > "id": > "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db", > "name": > "pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db", > "resources": { > "cpus": 0.25, > "disk": 0, > {code} > This is very hard for end-users to work around. For example if I query > "which master is leading" followed by "leader: which tasks are running" it is > possible that the leader fails over in between, leaving me with an incorrect > answer and no way to know that this happened. > In my opinion the API should return the correct response (by asking the > current leader?) or an error (500 Not the leader?) but it's unacceptable to > return a successful wrong answer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5211) Allow docker puller to use docker image IDs in addition to tags
[ https://issues.apache.org/jira/browse/MESOS-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240330#comment-15240330 ] Cody Maloney commented on MESOS-5211: - That's related purely to the unified containerizer. It's a bug currently that mesos inspects docker containerizer docker image names for a {{:}}, and if it isn't there, always forcibly appends {{:latest}}. The bugfixing for the unified containerizer to not just check "has_tag" then assume it should use latest definitely could be covered by MESOS-3505 > Allow docker puller to use docker image IDs in addition to tags > --- > > Key: MESOS-5211 > URL: https://issues.apache.org/jira/browse/MESOS-5211 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.28.0 >Reporter: Cody Maloney > Labels: containerizer, docker, mesosphere > > Docker added support for a {{@}} format instead of {{:}} in [1.6 > via pull 11109|https://github.com/docker/docker/pull/11109]. > The {{@}} is useful because it allows reference to specific set of > bits, rather than a tag (such as {{:latest}}) which can change over time. > Currently a number of code paths, such as the [Mesos Docker > code|https://github.com/apache/mesos/blob/df29bf0338771c92d1b1d3848181a35429cdcf0f/src/docker/docker.cpp#L1070], > the [Mesos Containerizer Docker > Provisioner|https://github.com/apache/mesos/blob/df29bf0338771c92d1b1d3848181a35429cdcf0f/src/slave/containerizer/mesos/provisioner/docker/registry_puller.cpp#L206] > do not support pulling / fetching docker containers by id. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5211) Allow docker puller to use docker image IDs in addition to tags
[ https://issues.apache.org/jira/browse/MESOS-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Maloney updated MESOS-5211: Description: Docker added support for a {{@}} format instead of {{:}} in [1.6 via pull 11109|https://github.com/docker/docker/pull/11109]. The {{@}} is useful because it allows reference to specific set of bits, rather than a tag (such as {{:latest}}) which can change over time. Currently a number of code paths, such as the [Mesos Docker code|https://github.com/apache/mesos/blob/df29bf0338771c92d1b1d3848181a35429cdcf0f/src/docker/docker.cpp#L1070], the [Mesos Containerizer Docker Provisioner|https://github.com/apache/mesos/blob/df29bf0338771c92d1b1d3848181a35429cdcf0f/src/slave/containerizer/mesos/provisioner/docker/registry_puller.cpp#L206] do not support pulling / fetching docker containers by id. was: Docker added support for a {{@}} format instead of {{:}} in 1.6. The {{@}} is useful because it allows reference to specific set of bits, rather than a tag (such as {{:latest}}) which can change over time. Currently a number of code paths, such as the [Mesos Docker code|https://github.com/apache/mesos/blob/df29bf0338771c92d1b1d3848181a35429cdcf0f/src/docker/docker.cpp#L1070], the [Mesos Containerizer Docker Provisioner|https://github.com/apache/mesos/blob/df29bf0338771c92d1b1d3848181a35429cdcf0f/src/slave/containerizer/mesos/provisioner/docker/registry_puller.cpp#L206] do not support pulling / fetching docker containers by id. > Allow docker puller to use docker image IDs in addition to tags > --- > > Key: MESOS-5211 > URL: https://issues.apache.org/jira/browse/MESOS-5211 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.28.0 >Reporter: Cody Maloney > Labels: containerizer, docker, mesosphere > > Docker added support for a {{@}} format instead of {{:}} in [1.6 > via pull 11109|https://github.com/docker/docker/pull/11109]. > The {{@}} is useful because it allows reference to specific set of > bits, rather than a tag (such as {{:latest}}) which can change over time. > Currently a number of code paths, such as the [Mesos Docker > code|https://github.com/apache/mesos/blob/df29bf0338771c92d1b1d3848181a35429cdcf0f/src/docker/docker.cpp#L1070], > the [Mesos Containerizer Docker > Provisioner|https://github.com/apache/mesos/blob/df29bf0338771c92d1b1d3848181a35429cdcf0f/src/slave/containerizer/mesos/provisioner/docker/registry_puller.cpp#L206] > do not support pulling / fetching docker containers by id. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5211) Allow docker puller to use docker image IDs in addition to tags
Cody Maloney created MESOS-5211: --- Summary: Allow docker puller to use docker image IDs in addition to tags Key: MESOS-5211 URL: https://issues.apache.org/jira/browse/MESOS-5211 Project: Mesos Issue Type: Bug Components: containerization, docker Affects Versions: 0.28.0 Reporter: Cody Maloney Docker added support for a {{@}} format instead of {{:}} in 1.6. The {{@}} is useful because it allows reference to specific set of bits, rather than a tag (such as {{:latest}}) which can change over time. Currently a number of code paths, such as the [Mesos Docker code|https://github.com/apache/mesos/blob/df29bf0338771c92d1b1d3848181a35429cdcf0f/src/docker/docker.cpp#L1070], the [Mesos Containerizer Docker Provisioner|https://github.com/apache/mesos/blob/df29bf0338771c92d1b1d3848181a35429cdcf0f/src/slave/containerizer/mesos/provisioner/docker/registry_puller.cpp#L206] do not support pulling / fetching docker containers by id. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2281) Deprecate plain text Credential format.
[ https://issues.apache.org/jira/browse/MESOS-2281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196404#comment-15196404 ] Cody Maloney commented on MESOS-2281: - The JSON format was added as part of MESOS-1391. The original author intended to deprecate the legacy credential format. Original commit: https://github.com/apache/mesos/commit/2cb3761c6bfa80b956eaafde9c69eafaeac3deae Review: https://reviews.apache.org/r/2/ The JSON format should allow us to eliminate some code, as well as provide a more robust parser to ensure people don't read / write garbage (There was accidentally a newline or space added to the name of one principal, now all the parsing is off by a little bit and things aren't working properly) > Deprecate plain text Credential format. > --- > > Key: MESOS-2281 > URL: https://issues.apache.org/jira/browse/MESOS-2281 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Affects Versions: 0.21.1 >Reporter: Cody Maloney >Assignee: Jan Schlicht > Labels: mesosphere, security, tech-debt > > Currently two formats of credentials are supported: JSON > {code} > "credentials": [ > { > "principal": "sherman", > "secret": "kitesurf" > } > {code} > And a new line file: > {code} > principal1 secret1 > pricipal2 secret2 > {code} > We should deprecate the new line format and remove support for the old format. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2814) os::read should have one implementation
[ https://issues.apache.org/jira/browse/MESOS-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Maloney updated MESOS-2814: Description: In master there are currently three implementations of the function: https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os/read.hpp#L42 https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os/read.hpp#L82 https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os/read.hpp#L42 All of them have fairly radically different implementations (One uses C read(), one uses c++ ifstream, one uses c fopen) The read() based one does an excess / unnecessary copy / buffer allocation (it is going to read into one temporary buffer, then copy into the result string. Would be more efficient to do a .reserve() on the result string, and then fill the result buffer). The ifstream/ifstreambuf_iterator ignores that you can have an error partially through reading a file / doesn't find the error or propagate it up. The fopen() variant reads one newline separated line at a time. This could produce interesting / unexpected reading in the context of a binary file. It also causes glibc to insert null bytes at the end of the buffer it reads (excess computation). result isn't pre-allocated to be the right length, meaning that most of the continually read lines will result in realloc() and a lot of memory copies which will be inefficient on large files. was:Currently stout os::read() has two radically different implementations when you give it a {{std::string}} vs. a {{const char *}}. Ideally these have one implementation that does things like intelligently size the buffer that it writes into rather than re-allocating repeatedly with every time it lengthens the string (resulting in copious copying). > os::read should have one implementation > --- > > Key: MESOS-2814 > URL: https://issues.apache.org/jira/browse/MESOS-2814 > Project: Mesos > Issue Type: Improvement > Components: stout >Reporter: Cody Maloney >Assignee: Isabel Jimenez > Labels: mesosphere, tech-debt > > In master there are currently three implementations of the function: > > https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os/read.hpp#L42 > https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os/read.hpp#L82 > https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os/read.hpp#L42 > All of them have fairly radically different implementations (One uses C > read(), one uses c++ ifstream, one uses c fopen) > The read() based one does an excess / unnecessary copy / buffer allocation > (it is going to read into one temporary buffer, then copy into the result > string. Would be more efficient to do a .reserve() on the result string, and > then fill the result buffer). > The ifstream/ifstreambuf_iterator ignores that you can have an error > partially through reading a file / doesn't find the error or propagate it up. > The fopen() variant reads one newline separated line at a time. This could > produce interesting / unexpected reading in the context of a binary file. It > also causes glibc to insert null bytes at the end of the buffer it reads > (excess computation). result isn't pre-allocated to be the right length, > meaning that most of the continually read lines will result in realloc() and > a lot of memory copies which will be inefficient on large files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4645) Mesos agent shutdown on healtcheck timeout rather than lost and recovered
Cody Maloney created MESOS-4645: --- Summary: Mesos agent shutdown on healtcheck timeout rather than lost and recovered Key: MESOS-4645 URL: https://issues.apache.org/jira/browse/MESOS-4645 Project: Mesos Issue Type: Bug Affects Versions: 0.27.1 Reporter: Cody Maloney I expected slaves to have to be gone the re-registration timeout before they'd be lost to the cluster, not fail 5 healtchecks (Failing the healthchecks indicates there is a network partition, not that the agent is gone for good and will never come back). Is there some flag I'm missing here which I should be setting? >From my perspective I expect frameworks to not get offers for resources on >agents which haven't been contacted recently (The framework wouldn't be able >to launch anything on the agent). Once the re-registration period times out >the slave would be assumed completely lost and the tasks assumed terminated / >able to be re-launched if desired. If an agent recovers between the >healthcheck timeout and re-registration timeout, it should be able to re-join >the cluster with its running tasks kept running. Note: Some log lines have their start or tail truncated. Critical stuff should all be there Master flags {noformat} Feb 11 00:22:19 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: I0211 00:22:19.690507 1362 master.cpp:369] Flags at startup: --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="false" --authenticate_slaves="false" --authenticators="crammd5" --authorizers="local" --cluster="cody-cm52sd-2" --framework_sorter="drf" --help="false" --hostname_lookup="false" --initialize_driver_logging="true" --ip_discovery_command="/opt/mesosphere/bin/detect_ip" --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --port="5050" --quiet="false" --quorum="1" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="5secs" --registry_strict="false" --roles="slave_public" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/opt/mesosphere/packages/mesos--4dd59ec6bde2052f6f2a0a0da415b6c92c3c418a/share/mesos/webui" --weights="slave_public=1" --work_dir="/var/lib/mesos/master" --zk="zk://127.0.0.1:2181/mesos" --zk_session_timeout="10secs" {noformat} Slave flags {noformat} Feb 11 00:34:13 ip-10-0-0-52.us-west-2.compute.internal mesos-slave[3914]: I0211 00:34:13.334395 3914 slave.cpp:192] Flags at startup: --appc_store_dir="/tmp/mesos/store/appc" --authenticatee="crammd5" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="docker,mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_auth_server="auth.docker.io" --docker_auth_server_port="443" --docker_kill_orphans="true" --docker_local_archives_dir="/tmp/mesos/images/docker" --docker_puller="local" --docker_puller_timeout="60" --docker_registry="registry-1.docker.io" --docker_registry_port="443" --docker_remove_delay="1hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --enforce_container_disk_quota="false" --executor_environment_variables="{"LD_LIBRARY_PATH":"\/opt\/mesosphere\/lib","PATH":"\/usr\/bin:\/bin","SASL_PATH":"\/opt\/mesosphere\/lib\/sasl2","SHELL":"\/usr\/bin\/bash"}" --executor_registration_timeout="5mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="2days" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname_lookup="false" --image_provisioner_backend="copy" --initialize_driver_logging="true" --ip_discovery_command="/opt/mesosphere/bin/detect_ip" --isolation="cgroups/cpu,cgroups/mem" --launcher_dir="/opt/mesosphere/packages/mesos--4dd59ec6bde2052f6f2a0a0da415b6c92c3c418a/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://leader.mesos:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --resources="ports:[1025-2180,2182-3887,3889-5049,5052-8079,8082-8180,8182-32000]" --re Feb 11 00:34:13 ip-10-0-0-52.us-west-2.compute.internal mesos-slave[3914]: vocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --slave_subsystems="cpu,memory" --strict="true" --switch_user="true"
[jira] [Commented] (MESOS-4612) Update to Zookeeper 3.4.7
[ https://issues.apache.org/jira/browse/MESOS-4612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15136547#comment-15136547 ] Cody Maloney commented on MESOS-4612: - That code in CMake means depending how you compile Mesos, you'll get very different behaviors (3.4.7 has several minor but critical behavior changes from 3.4.5). Mesos already patches Zookeeper 3.4.5, patching 3.4.7 to compile under Windows (Releases of zookeeper are unpredictable. Ideally we'd have zookeeper 3.5 which has a _ton_ of things improved, but that has an unknown release date at this point) > Update to Zookeeper 3.4.7 > - > > Key: MESOS-4612 > URL: https://issues.apache.org/jira/browse/MESOS-4612 > Project: Mesos > Issue Type: Improvement >Reporter: Cody Maloney >Assignee: haosdent > Labels: mesosphere, tech-debt > > See: http://zookeeper.apache.org/doc/r3.4.7/releasenotes.html for > improvements / bug fixes -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4612) Update to Zookeeper 3.4.7
Cody Maloney created MESOS-4612: --- Summary: Update to Zookeeper 3.4.7 Key: MESOS-4612 URL: https://issues.apache.org/jira/browse/MESOS-4612 Project: Mesos Issue Type: Improvement Reporter: Cody Maloney See: http://zookeeper.apache.org/doc/r3.4.7/releasenotes.html for improvements / bug fixes -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-2814) os::read should have one implementation
[ https://issues.apache.org/jira/browse/MESOS-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131244#comment-15131244 ] Cody Maloney edited comment on MESOS-2814 at 2/3/16 10:17 PM: -- In master there are currently three implementations of the function: https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os/read.hpp#L42 https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os/read.hpp#L82 https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os/read.hpp#L42 All of them have fairly radically different implementations (One uses C read(), one uses c++ ifstream, one uses c fopen) The read(fd, size) I'd argue should be the underpinning of all three. When we're given a filename rather than an fd, should do an open() of the filename, then read() the whole thing (Which we could get get the length by doing a stat of the file), or make a second implementation of read(int fd) which stops at EOF rather than a fixed number of bytes. All three overloads of the function can currently produce surprisingly different results in their independent implementations was (Author: cmaloney): In master there are currently three implementations of the function: https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os/read.hpp#L42 https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os/read.hpp#L82 https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os/read.hpp#L42 All of them have fairly radically different implementations (One uses C read(), one uses c++ ifstream, one uses c fopen) > os::read should have one implementation > --- > > Key: MESOS-2814 > URL: https://issues.apache.org/jira/browse/MESOS-2814 > Project: Mesos > Issue Type: Improvement > Components: stout >Reporter: Cody Maloney >Assignee: Isabel Jimenez > Labels: mesosphere, tech-debt > > Currently stout os::read() has two radically different implementations when > you give it a {{std::string}} vs. a {{const char *}}. Ideally these have one > implementation that does things like intelligently size the buffer that it > writes into rather than re-allocating repeatedly with every time it lengthens > the string (resulting in copious copying). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2814) os::read should have one implementation
[ https://issues.apache.org/jira/browse/MESOS-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131244#comment-15131244 ] Cody Maloney commented on MESOS-2814: - In master there are currently three implementations of the function: https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os/read.hpp#L42 https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os/read.hpp#L82 https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/os/read.hpp#L42 All of them have fairly radically different implementations (One uses C read(), one uses c++ ifstream, one uses c fopen) > os::read should have one implementation > --- > > Key: MESOS-2814 > URL: https://issues.apache.org/jira/browse/MESOS-2814 > Project: Mesos > Issue Type: Improvement > Components: stout >Reporter: Cody Maloney >Assignee: Isabel Jimenez > Labels: mesosphere, tech-debt > > Currently stout os::read() has two radically different implementations when > you give it a {{std::string}} vs. a {{const char *}}. Ideally these have one > implementation that does things like intelligently size the buffer that it > writes into rather than re-allocating repeatedly with every time it lengthens > the string (resulting in copious copying). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-416) Ensure master / slave do not get kernel OOM before executors, by setting oom_adj control.
[ https://issues.apache.org/jira/browse/MESOS-416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Maloney updated MESOS-416: --- Labels: mesosphere twitter (was: twitter) > Ensure master / slave do not get kernel OOM before executors, by setting > oom_adj control. > - > > Key: MESOS-416 > URL: https://issues.apache.org/jira/browse/MESOS-416 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Mahler > Labels: mesosphere, twitter > > We can adjust the /proc//oom_adj control during master / slave startup, > setting it to a low value to ensure we aren't killed first during an OOM. > Relevant LWN article: http://lwn.net/Articles/317814/ > Also relevant: https://bugzilla.redhat.com/show_bug.cgi?id=239313 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4578) docker run -c is deprecated
Cody Maloney created MESOS-4578: --- Summary: docker run -c is deprecated Key: MESOS-4578 URL: https://issues.apache.org/jira/browse/MESOS-4578 Project: Mesos Issue Type: Improvement Components: containerization, docker Affects Versions: 0.26.0 Environment: CoreOS 7 Reporter: Cody Maloney When running mesos slave with the docker containerizer enabled on CoreOS 766.4.0, launching docker containers results in the following in stderr: {noformat} Warning: '-c' is deprecated, it will be replaced by '--cpu-shares' soon. See usage. {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4578) docker run -c is deprecated
[ https://issues.apache.org/jira/browse/MESOS-4578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Maloney updated MESOS-4578: Labels: mesosphere newbie (was: mesosphere) > docker run -c is deprecated > --- > > Key: MESOS-4578 > URL: https://issues.apache.org/jira/browse/MESOS-4578 > Project: Mesos > Issue Type: Improvement > Components: containerization, docker >Affects Versions: 0.26.0 > Environment: CoreOS 7 >Reporter: Cody Maloney > Labels: mesosphere, newbie > > When running mesos slave with the docker containerizer enabled on CoreOS > 766.4.0, launching docker containers results in the following in stderr: > {noformat} > Warning: '-c' is deprecated, it will be replaced by '--cpu-shares' soon. See > usage. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4569) Re-Registered and Registered times are the same after agents re-register
Cody Maloney created MESOS-4569: --- Summary: Re-Registered and Registered times are the same after agents re-register Key: MESOS-4569 URL: https://issues.apache.org/jira/browse/MESOS-4569 Project: Mesos Issue Type: Bug Affects Versions: 0.27.0 Reporter: Cody Maloney Priority: Minor When I launch a Multi-Master cluster with Mesos 0.27, kill the leading master, and all agents re-register with the new master, the "registered" and "Re-registered" time are the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4546) Mesos Agents needs to re-resolve hosts in zk string on leader change / failure to connect
Cody Maloney created MESOS-4546: --- Summary: Mesos Agents needs to re-resolve hosts in zk string on leader change / failure to connect Key: MESOS-4546 URL: https://issues.apache.org/jira/browse/MESOS-4546 Project: Mesos Issue Type: Bug Components: slave Reporter: Cody Maloney Assignee: Artem Harutyunyan Priority: Blocker Sample Mesos Agent log: https://gist.github.com/brndnmtthws/fb846fa988487250a809 Note, zookeeper has a function to change the list of servers at runtime: https://github.com/apache/zookeeper/blob/735ea78909e67c648a4978c8d31d63964986af73/src/c/src/zookeeper.c#L1207-L1232 This comes up when using an AWS AutoScalingGroup for managing the set of masters. The agent when it comes up the first time, resolves the zk:// string. Once all the hosts that were in the original string fail (Each fails, is replaced by a new machine, which has the same DNS name), the agent just keeps spinning in an internal loop, never re-resolving the DNS names. Two solutions I see are 1. Update the list of servers / re-resolve 2. Have the agent detect it hasn't connected recently, and kill itself (Which will force a re-resolution when the agent starts back up) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2718) Future created by State.names() throws an Illegal ExecutionException
[ https://issues.apache.org/jira/browse/MESOS-2718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Maloney updated MESOS-2718: Labels: mesosphere (was: ) > Future created by State.names() throws an Illegal ExecutionException > > > Key: MESOS-2718 > URL: https://issues.apache.org/jira/browse/MESOS-2718 > Project: Mesos > Issue Type: Bug > Components: java api >Affects Versions: 0.22.1 > Environment: OSX, Mesos 0.22.1 >Reporter: Matthias Veit > Labels: mesosphere > > During application startup, we call call org.apache.mesos.state.State.names(). > This will return a java Future. > Everything is fine in the success case. > In the error case, the future can throw either an InterruptedException, > ExecutionException or a RuntimeException. > The ExecutionException indicates, that the future was not successful. > This is the text from the javadoc: > Exception thrown when attempting to retrieve the result of a task that > aborted by throwing an exception. This exception can be inspected using the > Throwable.getCause() method. See here: > https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutionException.html > The ExecutionException thrown by mesos in the above method does not hold a > reference to the root cause, but returns a reference to this as cause (ex == > ex.getCause()). > ExecutionException really is a wrapper exception to indicate success or > failure of the java future and should always have a root cause. > With the current implementation we can't distinguish between a Future error > or an application error. Please provide always the exception cause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2281) Remove legacy Credential format
[ https://issues.apache.org/jira/browse/MESOS-2281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Maloney updated MESOS-2281: Labels: tech-debt (was: ) > Remove legacy Credential format > --- > > Key: MESOS-2281 > URL: https://issues.apache.org/jira/browse/MESOS-2281 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Affects Versions: 0.21.1 >Reporter: Cody Maloney > Labels: tech-debt > > Currently two formats of credentials are supported: JSON > {code} > "credentials": [ > { > "principal": "sherman", > "secret": "kitesurf" > } > {code} > And a new line file: > {code} > principal1 secret1 > pricipal2 secret2 > {code} > We should deprecate and remove support for the old format. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4181) Change port range logging to different logging level.
[ https://issues.apache.org/jira/browse/MESOS-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082108#comment-15082108 ] Cody Maloney commented on MESOS-4181: - Even with that change the number of bytes to print as you cut up the range grows non-linearly, you'd need all the speed optimizations that went into the internal representation of ranges to go into the printing format... > Change port range logging to different logging level. > - > > Key: MESOS-4181 > URL: https://issues.apache.org/jira/browse/MESOS-4181 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 0.25.0 >Reporter: Cody Maloney >Assignee: Joerg Schad > Labels: mesosphere, newbie > > Transforming from mesos' internal port range representation -> text is > non-linear in the number of bytest output. We end up with a massive amount of > log data like the following: > {noformat} > Dec 15 23:54:08 ip-10-0-7-60.us-west-2.compute.internal mesos-master[15919]: > I1215 23:51:58.891165 15925 hierarchical.hpp:1103] Recovered cpus(*):1e-05; > mem(*):10; ports(*):[5565-5565] (total: ports(*):[1025-2180, 2182-3887, > 3889-5049, 5052-8079, 8082-8180, 8182-32000]; cpus(*):4; mem(*):14019; > disk(*):32541, allocated: cpus(*):0.01815; ports(*):[1050-1050, 1092-1092, > 1094-1094, 1129-1129, 1132-1132, 1140-1140, 1177-1178, 1180-1180, 1192-1192, > 1205-1205, 1221-1221, 1308-1308, 1311-1311, 1323-1323, 1326-1326, 1335-1335, > 1365-1365, 1404-1404, 1412-1412, 1436-1436, 1455-1455, 1459-1459, 1472-1472, > 1477-1477, 1482-1482, 1491-1491, 1510-1510, 1551-1551, 1553-1553, 1559-1559, > 1573-1573, 1590-1590, 1592-1592, 1619-1619, 1635-1636, 1678-1678, 1738-1738, > 1742-1742, 1752-1752, 1770-1770, 1780-1782, 1790-1790, 1792-1792, 1799-1799, > 1804-1804, 1844-1844, 1852-1852, 1867-1867, 1899-1899, 1936-1936, 1945-1945, > 1954-1954, 2046-2046, 2055-2055, 2063-2063, 2070-2070, 2089-2089, 2104-2104, > 2117-2117, 2132-2132, 2173-2173, 2178-2178, 2188-2188, 2200-2200, 2218-2218, > 2223-2223, 2244-2244, 2248-2248, 2250-2250, 2270-2270, 2286-2286, 2302-2302, > 2332-2332, 2377-2377, 2397-2397, 2423-2423, 2435-2435, 2442-2442, 2448-2448, > 2477-2477, 2482-2482, 2522-2522, 2586-2586, 2594-2594, 2600-2600, 2602-2602, > 2643-2643, 2648-2648, 2659-2659, 2691-2691, 2716-2716, 2739-2739, 2794-2794, > 2802-2802, 2823-2823, 2831-2831, 2840-2840, 2848-2848, 2876-2876, 2894-2895, > 2900-2900, 2904-2904, 2912-2912, 2983-2983, 2991-2991, 2999-2999, 3011-3011, > 3025-3025, 3036-3036, 3041-3041, 3051-3051, 3074-3074, 3097-3097, 3107-3107, > 3121-3121, 3171-3171, 3176-3176, 3195-3195, 3197-3197, 3210-3210, 3221-3221, > 3234-3234, 3245-3245, 3250-3251, 3255-3255, 3270-3270, 3293-3293, 3298-3298, > 3312-3312, 3318-3318, 3325-3325, 3368-3368, 3379-3379, 3391-3391, 3412-3412, > 3414-3414, 3420-3420, 3492-3492, 3501-3501, 3538-3538, 3579-3579, 3631-3631, > 3680-3680, 3684-3684, 3695-3695, 3699-3699, 3738-3738, 3758-3758, 3793-3793, > 3808-3808, 3817-3817, 3854-3854, 3856-3856, 3900-3900, 3906-3906, 3909-3909, > 3912-3912, 3946-3946, 3956-3956, 3959-3959, 3963-3963, 3974- > Dec 15 23:54:09 ip-10-0-7-60.us-west-2.compute.internal mesos-master[15919]: > 3974, 3981-3981, 3985-3985, 4134-4134, 4178-4178, 4206-4206, 4223-4223, > 4239-4239, 4245-4245, 4251-4251, 4262-4263, 4271-4271, 4308-4308, 4323-4323, > 4329-4329, 4368-4368, 4385-4385, 4404-4404, 4419-4419, 4430-4430, 4448-4448, > 4464-4464, 4481-4481, 4494-4494, 4499-4499, 4510-4510, 4534-4534, 4543-4543, > 4555-4555, 4561-4562, 4577-4577, 4601-4601, 4675-4675, 4722-4722, 4739-4739, > 4748-4748, 4752-4752, 4764-4764, 4771-4771, 4787-4787, 4827-4827, 4830-4830, > 4837-4837, 4848-4848, 4853-4853, 4879-4879, 4883-4883, 4897-4897, 4902-4902, > 4911-4911, 4940-4940, 4946-4946, 4957-4957, 4994-4994, 4996-4996, 5008-5008, > 5019-5019, 5043-5043, 5059-5059, 5109-5109, 5134-5135, 5157-5157, 5172-5172, > 5192-5192, 5211-5211, 5215-5215, 5234-5234, 5237-5237, 5246-5246, 5255-5255, > 5268-5268, 5311-5311, 5314-5314, 5316-5316, 5348-5348, 5391-5391, 5407-5407, > 5433-5433, 5446-5447, 5454-5454, 5456-5456, 5482-5482, 5514-5515, 5517-5517, > 5525-5525, 5542-5542, 5554-5554, 5581-5581, 5624-5624, 5647-5647, 5695-5695, > 5700-5700, 5703-5703, 5743-5743, 5747-5747, 5793-5793, 5850-5850, 5856-5856, > 5858-5858, 5899-5899, 5901-5901, 5940-5940, 5958-5958, 5962-5962, 5974-5974, > 5995-5995, 6000-6001, 6037-6037, 6053-6053, 6066-6066, 6078-6078, 6129-6129, > 6139-6139, 6160-6160, 6174-6174, 6193-6193, 6234-6234, 6263-6263, 6276-6276, > 6287-6287, 6292-6292, 6294-6294, 6296-6296, 6306-6307, 6333-6333, 6343-6343, > 6349-6349, 6377-6377, 6418-6418, 6454-6454, 6484-6484, 6496-6496, 6504-6504, > 6518-6518, 6589-6589, 6592-6592, 6606-6606, 6640-6640, 6713-6713, 6717-6717, >
[jira] [Updated] (MESOS-4233) Logging is too verbose for sysadmins / syslog
[ https://issues.apache.org/jira/browse/MESOS-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Maloney updated MESOS-4233: Description: Currently mesos logs a lot. When launching a thousand tasks in the space of 10 seconds it will print tens of thousands of log lines, overwhelming syslog (there is a max rate at which a process can send stuff over a unix socket) and not giving useful information to a sysadmin who cares about just the high-level activity and when something goes wrong. Note mesos also blocks writing to its log locations, so when writing a lot of log messages, it can fill up the write buffer in the kernel, and be suspended until the syslog agent catches up reading from the socket (GLOG does a blocking fwrite to stderr). GLOG also has a big mutex around logging so only one thing logs at a time. While for "internal debugging" it is useful to see things like "message went from internal compoent x to internal component y", from a sysadmin perspective I only care about the high level actions taken (launched task for framework x), sent offer to framework y, got task failed from host z. Note those are what I'd expect at the "INFO" level. At the "WARNING" level I'd expect very little to be logged / almost nothing in normal operation. Just things like "WARN: Repliacted log write took longer than expected". WARN would also get things like backtraces on crashes and abnormal exits / abort. When trying to launch 3k+ tasks inside a second, mesos logging currently overwhelms syslog with 100k+ messages, many of which are thousands of bytes. Sysadmins expect to be able to use syslog to monitor basic events in their system. This is too much. We can keep logging the messages to files, but the logging to stderr needs to be reduced significantly (stderr gets picked up and forwarded to syslog / central aggregation). What I would like is if I can set the stderr logging level to be different / independent from the file logging level (Syslog giving the "sysadmin" aggregated overview, files useful for debugging in depth what happened in a cluster). A lot of what mesos currently logs at info is really debugging info / should show up as debug log level. Some samples of mesos logging a lot more than a sysadmin would want / expect are attached, and some are below: - Every task gets printed multiple times for a basic launch: {noformat} Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: I1215 22:58:29.382644 1315 master.cpp:3248] Launching task envy.5b19a713-a37f-11e5-8b3e-0251692d6109 of framework 5178f46d-71d6-422f-922c-5bbe82dff9cc- (marathon) Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: I1215 22:58:29.382925 1315 master.hpp:176] Adding task envy.5b1958f2-a37f-11e5-8b3e-0251692d6109 with resources cpus(*):0.0001; mem(*):16; ports(*):[14047-14047] {noformat} - Every task status update prints many log lines, successful ones are part of normal operation and maybe should be logged at info / debug levels, but not to a sysadmin (Just show when things fail, and maybe aggregate counters to tell of the volume of working) - No log messagse should be really big / more than 1k characters (Would prevent the giant port list attached, make that easily discoverable / bug filable / fixable) was: Currently mesos logs a lot. When launching a thousand tasks in the space of 10 seconds it will print tens of thousands of log lines, overwhelming syslog (there is a max rate at which a process can send stuff over a unix socket) and not giving useful information to a sysadmin who cares about just the high-level activity and when something goes wrong. Note mesos also blocks writing to its log locations, so when writing a lot of log messages, it can fill up the write buffer in the kernel, and be suspended until the syslog agent catches up reading from the socket (GLOG does a blocking fwrite to stderr). GLOG also has a big mutex around logging so only one thing logs at a time. While for "internal debugging" it is useful to see things like "message went from internal compoent x to internal component y", from a sysadmin perspective I only care about the high level actions taken (launched task for framework x), sent offer to framework y, got task failed from host z. Note those are what I'd expect at the "INFO" level. At the "WARNING" level I'd expect very little to be logged / almost nothing in normal operation. Just things like "WARN: Repliacted log write took longer than expected". WARN would also get things like backtraces on crashes and abnormal exits / abort. When trying to launch 3k+ tasks inside a second, mesos logging currently overwhelms syslog with 100k+ messages, many of which are thousands of bytes. Sysadmins expect to be able to use syslog to monitor basic events in their system. This is too much. We can keep logging the messages
[jira] [Updated] (MESOS-4233) Logging is too verbose for sysadmins / syslog
[ https://issues.apache.org/jira/browse/MESOS-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Maloney updated MESOS-4233: Description: Currently mesos logs a lot. When launching a thousand tasks in the space of 10 seconds it will print tens of thousands of log lines, overwhelming syslog (there is a max rate at which a process can send stuff over a unix socket) and not giving useful information to a sysadmin who cares about just the high-level activity and when something goes wrong. Note mesos also blocks writing to its log locations, so when writing a lot of log messages, it can fill up the write buffer in the kernel, and be suspended until the syslog agent catches up reading from the socket (GLOG does a blocking fwrite to stderr). GLOG also has a big mutex around logging so only one thing logs at a time. While for "internal debugging" it is useful to see things like "message went from internal compoent x to internal component y", from a sysadmin perspective I only care about the high level actions taken (launched task for framework x), sent offer to framework y, got task failed from host z. Note those are what I'd expect at the "INFO" level. At the "WARNING" level I'd expect very little to be logged / almost nothing in normal operation. Just things like "WARN: Repliacted log write took longer than expected". WARN would also get things like backtraces on crashes and abnormal exits / abort. When trying to launch 3k+ tasks inside a second, mesos logging currently overwhelms syslog with 100k+ messages, many of which are thousands of bytes. Sysadmins expect to be able to use syslog to monitor basic events in their system. This is too much. We can keep logging the messages to files, but the logging to stderr needs to be reduced significantly (stderr gets picked up and forwarded to syslog / central aggregation). What I would like is if I can set the stderr logging level to be different / independent from the file logging level (Syslog giving the "sysadmin" aggregated overview, files useful for debugging in depth what happened in a cluster). A lot of what mesos currently logs at info is really debugging info / should show up as debug log level. Some samples of mesos logging a lot more than a sysadmin would want / expect are attached, and some are below: - Every task gets printed multiple times for a basic launch: {noformat} There are also things like every task gets printed multiple times when launched (Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: I1215 22:58:29.382644 1315 master.cpp:3248] Launching task envy.5b19a713-a37f-11e5-8b3e-0251692d6109 of framework 5178f46d-71d6-422f-922c-5bbe82dff9cc- (marathon) Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: I1215 22:58:29.382925 1315 master.hpp:176] Adding task envy.5b1958f2-a37f-11e5-8b3e-0251692d6109 with resources cpus(*):0.0001; mem(*):16; ports(*):[14047-14047] {noformat} - Every task status update prints many log lines, successful ones are part of normal operation and maybe should be logged at info / debug levels, but not to a sysadmin (Just show when things fail, and maybe aggregate counters to tell of the volume of working) - No log messagse should be really big / more than 1k characters (Would prevent the giant port list attached, make that easily discoverable / bug filable / fixable) was: Currently mesos logs a lot. When launching a thousand tasks in the space of 10 seconds it will print tens of thousands of log lines, overwhelming syslog (there is a max rate at which a process can send stuff over a unix socket) and not giving useful information to a sysadmin who cares about just the high-level activity and when something goes wrong. Note mesos also blocks writing to its log locations, so when writing a lot of log messages, it can fill up the write buffer in the kernel, and be suspended until the syslog agent catches up reading from the socket (GLOG does a blocking fwrite to stderr). GLOG also has a big mutex around logging so only one thing logs at a time. While for "internal debugging" it is useful to see things like "message went from internal compoent x to internal component y", from a sysadmin perspective I only care about the high level actions taken (launched task for framework x), sent offer to framework y, got task failed from host z. Note those are what I'd expect at the "INFO" level. At the "WARNING" level I'd expect very little to be logged / almost nothing in normal operation. Just things like "WARN: Repliacted log write took longer than expected". WARN would also get things like backtraces on crashes and abnormal exits / abort. When trying to launch 3k+ tasks inside a second, mesos logging currently overwhelms syslog with 100k+ messages, many of which are thousands of bytes. Sysadmins expect to be able to use syslog to monitor
[jira] [Created] (MESOS-4233) Logging is too verbose for sysadmins / syslog
Cody Maloney created MESOS-4233: --- Summary: Logging is too verbose for sysadmins / syslog Key: MESOS-4233 URL: https://issues.apache.org/jira/browse/MESOS-4233 Project: Mesos Issue Type: Epic Reporter: Cody Maloney Currently mesos logs a lot. When launching a thousand tasks in the space of 10 seconds it will print tens of thousands of log lines, overwhelming syslog (there is a max rate at which a process can send stuff over a unix socket) and not giving useful information to a sysadmin who cares about just the high-level activity and when something goes wrong. Note mesos also blocks writing to its log locations, so when writing a lot of log messages, it can fill up the write buffer in the kernel, and be suspended until the syslog agent catches up reading from the socket (GLOG does a blocking fwrite to stderr). GLOG also has a big mutex around logging so only one thing logs at a time. While for "internal debugging" it is useful to see things like "message went from internal compoent x to internal component y", from a sysadmin perspective I only care about the high level actions taken (launched task for framework x), sent offer to framework y, got task failed from host z. Note those are what I'd expect at the "INFO" level. At the "WARNING" level I'd expect very little to be logged / almost nothing in normal operation. Just things like "WARN: Repliacted log write took longer than expected". WARN would also get things like backtraces on crashes and abnormal exits / abort. When trying to launch 3k+ tasks inside a second, mesos logging currently overwhelms syslog with 100k+ messages, many of which are thousands of bytes. Sysadmins expect to be able to use syslog to monitor basic events in their system. This is too much. We can keep logging the messages to files, but the logging to stderr needs to be reduced significantly (stderr gets picked up and forwarded to syslog / central aggregation). What I would like is if I can set the stderr logging level to be different / independent from the file logging level (Syslog giving the "sysadmin" aggregated overview, files useful for debugging in depth what happened in a cluster). A lot of what mesos currently logs at info is really debugging info / should show up as debug log level. Some samples of mesos logging a lot more than a sysadmin would want / expect are attached, and some are below: Every task gets printed multiple times for a basic launch: {noformat} There are also things like every task gets printed multiple times when launched (Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: I1215 22:58:29.382644 1315 master.cpp:3248] Launching task envy.5b19a713-a37f-11e5-8b3e-0251692d6109 of framework 5178f46d-71d6-422f-922c-5bbe82dff9cc- (marathon) Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: I1215 22:58:29.382925 1315 master.hpp:176] Adding task envy.5b1958f2-a37f-11e5-8b3e-0251692d6109 with resources cpus(*):0.0001; mem(*):16; ports(*):[14047-14047] {noformat} Every task status update prints many log lines, successful ones are part of normal operation and maybe should be logged at info / debug levels, but not to a sysadmin (Just show when things fail, and maybe aggregate counters to tell of the volume of working) No log messagse should be really big / more than 1k characters (Would prevent the giant port list attached, make that easily discoverable / bug filable / fixable) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4233) Logging is too verbose for sysadmins / syslog
[ https://issues.apache.org/jira/browse/MESOS-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Maloney updated MESOS-4233: Attachment: giant_port_range_logging > Logging is too verbose for sysadmins / syslog > - > > Key: MESOS-4233 > URL: https://issues.apache.org/jira/browse/MESOS-4233 > Project: Mesos > Issue Type: Epic >Reporter: Cody Maloney > Labels: mesosphere > Attachments: giant_port_range_logging > > > Currently mesos logs a lot. When launching a thousand tasks in the space of > 10 seconds it will print tens of thousands of log lines, overwhelming syslog > (there is a max rate at which a process can send stuff over a unix socket) > and not giving useful information to a sysadmin who cares about just the > high-level activity and when something goes wrong. > Note mesos also blocks writing to its log locations, so when writing a lot of > log messages, it can fill up the write buffer in the kernel, and be suspended > until the syslog agent catches up reading from the socket (GLOG does a > blocking fwrite to stderr). GLOG also has a big mutex around logging so only > one thing logs at a time. > While for "internal debugging" it is useful to see things like "message went > from internal compoent x to internal component y", from a sysadmin > perspective I only care about the high level actions taken (launched task for > framework x), sent offer to framework y, got task failed from host z. Note > those are what I'd expect at the "INFO" level. At the "WARNING" level I'd > expect very little to be logged / almost nothing in normal operation. Just > things like "WARN: Repliacted log write took longer than expected". WARN > would also get things like backtraces on crashes and abnormal exits / abort. > When trying to launch 3k+ tasks inside a second, mesos logging currently > overwhelms syslog with 100k+ messages, many of which are thousands of bytes. > Sysadmins expect to be able to use syslog to monitor basic events in their > system. This is too much. > We can keep logging the messages to files, but the logging to stderr needs to > be reduced significantly (stderr gets picked up and forwarded to syslog / > central aggregation). > What I would like is if I can set the stderr logging level to be different / > independent from the file logging level (Syslog giving the "sysadmin" > aggregated overview, files useful for debugging in depth what happened in a > cluster). A lot of what mesos currently logs at info is really debugging info > / should show up as debug log level. > Some samples of mesos logging a lot more than a sysadmin would want / expect > are attached, and some are below: > Every task gets printed multiple times for a basic launch: > {noformat} > There are also things like every task gets printed multiple times when > launched (Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal > mesos-master[1311]: I1215 22:58:29.382644 1315 master.cpp:3248] Launching > task envy.5b19a713-a37f-11e5-8b3e-0251692d6109 of framework > 5178f46d-71d6-422f-922c-5bbe82dff9cc- (marathon) > Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: > I1215 22:58:29.382925 1315 master.hpp:176] Adding task > envy.5b1958f2-a37f-11e5-8b3e-0251692d6109 with resources cpus(*):0.0001; > mem(*):16; ports(*):[14047-14047] > {noformat} > Every task status update prints many log lines, successful ones are part of > normal operation and maybe should be logged at info / debug levels, but not > to a sysadmin (Just show when things fail, and maybe aggregate counters to > tell of the volume of working) > No log messagse should be really big / more than 1k characters (Would prevent > the giant port list attached, make that easily discoverable / bug filable / > fixable) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4181) Don't log port ranges
Cody Maloney created MESOS-4181: --- Summary: Don't log port ranges Key: MESOS-4181 URL: https://issues.apache.org/jira/browse/MESOS-4181 Project: Mesos Issue Type: Bug Components: master Affects Versions: 0.25.0 Reporter: Cody Maloney Transforming from mesos' internal port range representation -> text is non-linear in the number of bytest output. We end up with a massive amount of log data like the following: {noformat} Dec 15 23:54:08 ip-10-0-7-60.us-west-2.compute.internal mesos-master[15919]: I1215 23:51:58.891165 15925 hierarchical.hpp:1103] Recovered cpus(*):1e-05; mem(*):10; ports(*):[5565-5565] (total: ports(*):[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, 8182-32000]; cpus(*):4; mem(*):14019; disk(*):32541, allocated: cpus(*):0.01815; ports(*):[1050-1050, 1092-1092, 1094-1094, 1129-1129, 1132-1132, 1140-1140, 1177-1178, 1180-1180, 1192-1192, 1205-1205, 1221-1221, 1308-1308, 1311-1311, 1323-1323, 1326-1326, 1335-1335, 1365-1365, 1404-1404, 1412-1412, 1436-1436, 1455-1455, 1459-1459, 1472-1472, 1477-1477, 1482-1482, 1491-1491, 1510-1510, 1551-1551, 1553-1553, 1559-1559, 1573-1573, 1590-1590, 1592-1592, 1619-1619, 1635-1636, 1678-1678, 1738-1738, 1742-1742, 1752-1752, 1770-1770, 1780-1782, 1790-1790, 1792-1792, 1799-1799, 1804-1804, 1844-1844, 1852-1852, 1867-1867, 1899-1899, 1936-1936, 1945-1945, 1954-1954, 2046-2046, 2055-2055, 2063-2063, 2070-2070, 2089-2089, 2104-2104, 2117-2117, 2132-2132, 2173-2173, 2178-2178, 2188-2188, 2200-2200, 2218-2218, 2223-2223, 2244-2244, 2248-2248, 2250-2250, 2270-2270, 2286-2286, 2302-2302, 2332-2332, 2377-2377, 2397-2397, 2423-2423, 2435-2435, 2442-2442, 2448-2448, 2477-2477, 2482-2482, 2522-2522, 2586-2586, 2594-2594, 2600-2600, 2602-2602, 2643-2643, 2648-2648, 2659-2659, 2691-2691, 2716-2716, 2739-2739, 2794-2794, 2802-2802, 2823-2823, 2831-2831, 2840-2840, 2848-2848, 2876-2876, 2894-2895, 2900-2900, 2904-2904, 2912-2912, 2983-2983, 2991-2991, 2999-2999, 3011-3011, 3025-3025, 3036-3036, 3041-3041, 3051-3051, 3074-3074, 3097-3097, 3107-3107, 3121-3121, 3171-3171, 3176-3176, 3195-3195, 3197-3197, 3210-3210, 3221-3221, 3234-3234, 3245-3245, 3250-3251, 3255-3255, 3270-3270, 3293-3293, 3298-3298, 3312-3312, 3318-3318, 3325-3325, 3368-3368, 3379-3379, 3391-3391, 3412-3412, 3414-3414, 3420-3420, 3492-3492, 3501-3501, 3538-3538, 3579-3579, 3631-3631, 3680-3680, 3684-3684, 3695-3695, 3699-3699, 3738-3738, 3758-3758, 3793-3793, 3808-3808, 3817-3817, 3854-3854, 3856-3856, 3900-3900, 3906-3906, 3909-3909, 3912-3912, 3946-3946, 3956-3956, 3959-3959, 3963-3963, 3974- Dec 15 23:54:09 ip-10-0-7-60.us-west-2.compute.internal mesos-master[15919]: 3974, 3981-3981, 3985-3985, 4134-4134, 4178-4178, 4206-4206, 4223-4223, 4239-4239, 4245-4245, 4251-4251, 4262-4263, 4271-4271, 4308-4308, 4323-4323, 4329-4329, 4368-4368, 4385-4385, 4404-4404, 4419-4419, 4430-4430, 4448-4448, 4464-4464, 4481-4481, 4494-4494, 4499-4499, 4510-4510, 4534-4534, 4543-4543, 4555-4555, 4561-4562, 4577-4577, 4601-4601, 4675-4675, 4722-4722, 4739-4739, 4748-4748, 4752-4752, 4764-4764, 4771-4771, 4787-4787, 4827-4827, 4830-4830, 4837-4837, 4848-4848, 4853-4853, 4879-4879, 4883-4883, 4897-4897, 4902-4902, 4911-4911, 4940-4940, 4946-4946, 4957-4957, 4994-4994, 4996-4996, 5008-5008, 5019-5019, 5043-5043, 5059-5059, 5109-5109, 5134-5135, 5157-5157, 5172-5172, 5192-5192, 5211-5211, 5215-5215, 5234-5234, 5237-5237, 5246-5246, 5255-5255, 5268-5268, 5311-5311, 5314-5314, 5316-5316, 5348-5348, 5391-5391, 5407-5407, 5433-5433, 5446-5447, 5454-5454, 5456-5456, 5482-5482, 5514-5515, 5517-5517, 5525-5525, 5542-5542, 5554-5554, 5581-5581, 5624-5624, 5647-5647, 5695-5695, 5700-5700, 5703-5703, 5743-5743, 5747-5747, 5793-5793, 5850-5850, 5856-5856, 5858-5858, 5899-5899, 5901-5901, 5940-5940, 5958-5958, 5962-5962, 5974-5974, 5995-5995, 6000-6001, 6037-6037, 6053-6053, 6066-6066, 6078-6078, 6129-6129, 6139-6139, 6160-6160, 6174-6174, 6193-6193, 6234-6234, 6263-6263, 6276-6276, 6287-6287, 6292-6292, 6294-6294, 6296-6296, 6306-6307, 6333-6333, 6343-6343, 6349-6349, 6377-6377, 6418-6418, 6454-6454, 6484-6484, 6496-6496, 6504-6504, 6518-6518, 6589-6589, 6592-6592, 6606-6606, 6640-6640, 6713-6713, 6717-6717, 6738-6738, 6757-6757, 6765-6765, 6778-6778, 6792-6792, 6798-6798, 6811-6811, 6815-6815, 6828-6828, 6838-6839, 6856-6856, 6868-6868, 6877-6877, 6892-6892, 6903-6903, 6908-6908, 6943-6943, 6973-6973, 6977-6977, 7003-7003, 7019-7019, 7021-7021, 7031-7031, 7034-7034, 7038-7038, 7052-7052, 7060-7060, 7097-7097, 7124-7124, 7151-7152, 7169-7169, 7171-7171, 7200-7200, 7204-7204, 7246-7246, 7250-7250, 7292-7292, 7326-7326, 7347-7347, 7363-7363, 7369-7369, 7401-7401, 7407-7407, 7421-7421, 7436-7436, 7447-7447, 7458-74 Dec 15 23:54:09 ip-10-0-7-60.us-west-2.compute.internal mesos-master[15919]: 58, 7475-7475, 7477-7477, 7502-7502, 7531-7531,
[jira] [Commented] (MESOS-1806) Substituting etcd for Zookeeper
[ https://issues.apache.org/jira/browse/MESOS-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007284#comment-15007284 ] Cody Maloney commented on MESOS-1806: - Yes, it is a hard blocker. Restarting every machine in a large cluster when an etcd node goes down is going to result in a lot of cluster badness / thundering stampede. > Substituting etcd for Zookeeper > --- > > Key: MESOS-1806 > URL: https://issues.apache.org/jira/browse/MESOS-1806 > Project: Mesos > Issue Type: Task > Components: leader election >Reporter: Ed Ropple >Assignee: Shuai Lin >Priority: Minor > >eropple: Could you also file a new JIRA for Mesos to drop ZK > in favor of etcd or ReplicatedLog? Would love to get some momentum going on > that one. > -- > Consider it filed. =) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3836) `--executor-environment-variables` may not apply to docker containers
[ https://issues.apache.org/jira/browse/MESOS-3836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996140#comment-14996140 ] Cody Maloney commented on MESOS-3836: - Any solution which comes up here is going to land (at the soonest) in Mesos 0.27. That would likely mean not the next DCOS, but the one after, so this is all about mid term planning at this point. When I say fully containerized I mean every executor should adhere to the same isolators that tasks do. A framework shouldn't be able to write a custom executor which uses more than its share of a CPU when cpu isolation is enabled. Or more of it's disk than it's disk quota allows / the framework has accepted offers on the host for. > `--executor-environment-variables` may not apply to docker containers > - > > Key: MESOS-3836 > URL: https://issues.apache.org/jira/browse/MESOS-3836 > Project: Mesos > Issue Type: Bug > Components: containerization, slave >Affects Versions: 0.25.0 > Environment: Mesos 0.25.0 configured with > --executor-environment-variables >Reporter: Cody Maloney >Assignee: Marco Massenzio >Priority: Minor > Labels: mesosphere > > In our use case we set {{PATH}} as part of the > {{\-\-executor_environment_variables}} in order to limit what binaries all > tasks which are launched via Mesos have readily available to them, making it > much harder for people launching tasks on mesos to accidentally depend on > something which isn't part of the "guaranteed" environment / platform. > Docker containers can be used as executors, and have a fully isolated > filesystem. For executors which run in docker containers setting {{PATH}} to > our path on the host filesystem may potentially break the docker container. > The previous code of only copying across environment variables when > {{includeOsEnvironment}} is set dealt with this > (https://github.com/apache/mesos/blob/56510afe149758a69a5a714dfaab16111dd0d9c3/src/slave/containerizer/containerizer.cpp#L267) > if {{includeOsEnvironment}} is set than we should copy across the current > {{\-\-executor_environment_variables}}. If it isn't, then > {{\-\-executor_environment_variables}} shouldn't be used at all. > Another option which could be useful is to make it so that there are two sets > of "Executor Environment Variables". One for when {{includeOsEnvironment}} is > set, and one for when it is not. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3836) `--executor-environment-variables` may not apply to docker containers
[ https://issues.apache.org/jira/browse/MESOS-3836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996121#comment-14996121 ] Cody Maloney commented on MESOS-3836: - >From what we've seen in practice, whatever environment variables which were >set on the executor every task gets. Every marathon app task got every >environment variable that mesos-slave had unless the marathon app definition >explicitly overrode it. Executors in many ways re like Tasks and should be fully containerized like them, which is a direction Mesos has been moving for a while (right now they aren't isolated at all, and having custom executors which are custom code running without isolation is not a great thing). Arguably the model should be that no containerized task sees anything except what is explicitly told to see. Things shouldn't leak through from the host whatsoever. Mesos tells the tasks the couple things that they are allowed to use. In the case of filesystem isolation (such as docker does) then it doesn't inform special filesystem things unless it also adds a volume mount for them (rkt / appc may introduce another root filesystem isolation). >From a DCOS perspective what we really want is all tasks are fully host >isolated, so they all run with filesystem isolated / even mesos native >containerizer tasks run in effectively a chroot with very limited files, very >limited environment variables set, so we only expose a small interface which >we have to watch and version. > `--executor-environment-variables` may not apply to docker containers > - > > Key: MESOS-3836 > URL: https://issues.apache.org/jira/browse/MESOS-3836 > Project: Mesos > Issue Type: Bug > Components: containerization, slave >Affects Versions: 0.25.0 > Environment: Mesos 0.25.0 configured with > --executor-environment-variables >Reporter: Cody Maloney >Assignee: Marco Massenzio >Priority: Minor > Labels: mesosphere > > In our use case we set {{PATH}} as part of the > {{\-\-executor_environment_variables}} in order to limit what binaries all > tasks which are launched via Mesos have readily available to them, making it > much harder for people launching tasks on mesos to accidentally depend on > something which isn't part of the "guaranteed" environment / platform. > Docker containers can be used as executors, and have a fully isolated > filesystem. For executors which run in docker containers setting {{PATH}} to > our path on the host filesystem may potentially break the docker container. > The previous code of only copying across environment variables when > {{includeOsEnvironment}} is set dealt with this > (https://github.com/apache/mesos/blob/56510afe149758a69a5a714dfaab16111dd0d9c3/src/slave/containerizer/containerizer.cpp#L267) > if {{includeOsEnvironment}} is set than we should copy across the current > {{\-\-executor_environment_variables}}. If it isn't, then > {{\-\-executor_environment_variables}} shouldn't be used at all. > Another option which could be useful is to make it so that there are two sets > of "Executor Environment Variables". One for when {{includeOsEnvironment}} is > set, and one for when it is not. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3740) LIBPROCESS_IP not passed to Docker containers
[ https://issues.apache.org/jira/browse/MESOS-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994319#comment-14994319 ] Cody Maloney commented on MESOS-3740: - The {{--executor-environment-variables}} is given directly to executors, and then gets inherited from the executor by all tasks the executors launch currently. We can't do just one generic flag of {{--docker-task-environment-variables}} which includes LIBPROCESS_IP, because LIBPROCESS_IP is something that Mesos can / will calculate (Either using it's classic reverse lookup behavior or --ip-detect-script). So that one I think still needs to be special cased that we always just pass it through to solve the current present problem. Adding a {{--docker-environment-variables}} which applies to all exectors and tasks launched with the docker containerizer could be useful in some circumstances (although within DCOS we have no need to pass special / extra / explicit environment variables to docker containers). The {{--docker-environment-variables}} still wouldn't be able to capture LIBPROCESS_IP though. > LIBPROCESS_IP not passed to Docker containers > - > > Key: MESOS-3740 > URL: https://issues.apache.org/jira/browse/MESOS-3740 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.25.0 > Environment: Mesos 0.24.1 >Reporter: Cody Maloney >Assignee: Michael Park > Labels: mesosphere > > Docker containers aren't currently passed all the same environment variables > that Mesos Containerizer tasks are. See: > https://github.com/apache/mesos/blob/master/src/slave/containerizer/containerizer.cpp#L254 > for all the environment variables explicitly set for mesos containers. > While some of them don't necessarily make sense for docker containers, when > the docker has inside of it a libprocess process (A mesos framework > scheduler) and is using {{--net=host}} the task needs to have LIBPROCESS_IP > set otherwise the same sort of problems that happen because of MESOS-3553 can > happen (libprocess will try to guess the machine's IP address with likely bad > results in a number of operating environment). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3751) MESOS_NATIVE_JAVA_LIBRARY not set on MesosContainerize tasks with --executor_environmnent_variables
[ https://issues.apache.org/jira/browse/MESOS-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Maloney updated MESOS-3751: Fix Version/s: 0.26.0 > MESOS_NATIVE_JAVA_LIBRARY not set on MesosContainerize tasks with > --executor_environmnent_variables > --- > > Key: MESOS-3751 > URL: https://issues.apache.org/jira/browse/MESOS-3751 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 0.24.1, 0.25.0 >Reporter: Cody Maloney >Assignee: Gilbert Song > Labels: mesosphere, newbie > Fix For: 0.26.0 > > > When using --executor_environment_variables, and having > MESOS_NATIVE_JAVA_LIBRARY in the environment of mesos-slave, the mesos > containerizer does not set MESOS_NATIVE_JAVA_LIBRARY itself. > Relevant code: > https://github.com/apache/mesos/blob/14f7967ef307f3d98e3a4b93d92d6b3a56399b20/src/slave/containerizer/containerizer.cpp#L281 > It sees that the variable is in the mesos-slave's environment (os::getenv), > rather than checking if it is set in the environment variable set. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3740) LIBPROCESS_IP not passed to Docker containers
[ https://issues.apache.org/jira/browse/MESOS-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984461#comment-14984461 ] Cody Maloney commented on MESOS-3740: - When this came up was trying to launch a mesos framework inside of a docker container. The framework used libmesos, and that libmesos couldn't figure out what IP to use (the machine didn't have a hostname, and even if it did, the hostname may not resolve to the right IP address the mesos framework inside the docker container should announce as its own IP, due to something like having multiple addresses on the machine, or running in an IP per container type environment) > LIBPROCESS_IP not passed to Docker containers > - > > Key: MESOS-3740 > URL: https://issues.apache.org/jira/browse/MESOS-3740 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.25.0 > Environment: Mesos 0.24.1 >Reporter: Cody Maloney >Assignee: Michael Park > Labels: mesosphere > > Docker containers aren't currently passed all the same environment variables > that Mesos Containerizer tasks are. See: > https://github.com/apache/mesos/blob/master/src/slave/containerizer/containerizer.cpp#L254 > for all the environment variables explicitly set for mesos containers. > While some of them don't necessarily make sense for docker containers, when > the docker has inside of it a libprocess process (A mesos framework > scheduler) and is using {{--net=host}} the task needs to have LIBPROCESS_IP > set otherwise the same sort of problems that happen because of MESOS-3553 can > happen (libprocess will try to guess the machine's IP address with likely bad > results in a number of operating environment). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3772) Consistency of quoted strings in error messages
[ https://issues.apache.org/jira/browse/MESOS-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965652#comment-14965652 ] Cody Maloney commented on MESOS-3772: - What about generally preferring [std::quoted|http://en.cppreference.com/w/cpp/io/manip/quoted]? That does the escaping of quotes inside the string for you, as well as adding single quotes so it is a predictable / reversable transformation. > Consistency of quoted strings in error messages > --- > > Key: MESOS-3772 > URL: https://issues.apache.org/jira/browse/MESOS-3772 > Project: Mesos > Issue Type: Bug >Reporter: Neil Conway > Labels: mesosphere, newbie > > Example log output: > {quote} > I1020 18:56:02.933956 1790 slave.cpp:1270] Got assigned task 13 for > framework 496620b9-4368-4a71-b741-68216f3d909f- > I1020 18:56:02.934185 1790 slave.cpp:1386] Launching task 13 for framework > 496620b9-4368-4a71-b741-68216f3d909f- > I1020 18:56:02.934408 1790 slave.cpp:1618] Queuing task '13' for executor > default of framework '496620b9-4368-4a71-b741-68216f3d909f- > I1020 18:56:02.935417 1790 slave.cpp:1760] Sending queued task '13' to > executor 'default' of framework 496620b9-4368-4a71-b741-68216f3d909f- > {quote} > Aside from the typo (unmatched quote) in the third line, these log messages > using quoting inconsistently: sometimes task, executor, and framework IDs are > quoted, other times they are not. > We should probably adopt a general rule, a la > http://www.postgresql.org/docs/9.4/static/error-style-guide.html . My > proposal: when interpolating a variable, only use quotes if it is possible > that the value might contain whitespace or punctuation (in the latter case, > the punctuation should probably be escaped). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2275) Document header include rules in style guide
[ https://issues.apache.org/jira/browse/MESOS-2275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964527#comment-14964527 ] Cody Maloney commented on MESOS-2275: - Out of curiosity, does this format match any of the formats available in clang-format --sort-includes? (http://reviews.llvm.org/D11240) > Document header include rules in style guide > > > Key: MESOS-2275 > URL: https://issues.apache.org/jira/browse/MESOS-2275 > Project: Mesos > Issue Type: Improvement >Reporter: Niklas Quarfot Nielsen >Assignee: Jan Schlicht >Priority: Trivial > Labels: beginner, docathon, mesosphere > > We have several ways of sorting, grouping and ordering headers includes in > Mesos. We should agree on a rule set and do a style scan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3751) MESOS_NATIVE_JAVA_LIBRARY not set on MesosContainerizre tasks with --executor_environmnent_variables
Cody Maloney created MESOS-3751: --- Summary: MESOS_NATIVE_JAVA_LIBRARY not set on MesosContainerizre tasks with --executor_environmnent_variables Key: MESOS-3751 URL: https://issues.apache.org/jira/browse/MESOS-3751 Project: Mesos Issue Type: Bug Components: containerization Affects Versions: 0.25.0, 0.24.1 Reporter: Cody Maloney When using --executor_environment_variables, and having MESOS_NATIVE_JAVA_LIBRARY in the environment of mesos-slave, the mesos containerizer does not set MESOS_NATIVE_JAVA_LIBRARY itself. Relevant code: https://github.com/apache/mesos/blob/14f7967ef307f3d98e3a4b93d92d6b3a56399b20/src/slave/containerizer/containerizer.cpp#L281 It sees that the variable is in the mesos-slave's environment (os::getenv), rather than checking if it is set in the environment variable set. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3740) LIBPROCESS_IP not passed to Docker containers
Cody Maloney created MESOS-3740: --- Summary: LIBPROCESS_IP not passed to Docker containers Key: MESOS-3740 URL: https://issues.apache.org/jira/browse/MESOS-3740 Project: Mesos Issue Type: Bug Components: containerization, docker Affects Versions: 0.25.0 Environment: Mesos 0.24.1 Reporter: Cody Maloney Docker containers aren't currently passed all the same environment variables that Mesos Containerizer tasks are. See: https://github.com/apache/mesos/blob/master/src/slave/containerizer/containerizer.cpp#L254 for all the environment variables explicitly set for mesos containers. While some of them don't necessarily make sense for docker containers, when the docker has inside of it a libprocess process (A mesos framework scheduler) and is using {{--net=host}} the task needs to have LIBPROCESS_IP set otherwise the same sort of problems that happen because of MESOS-3553 can happen (libprocess will try to guess the machine's IP address with likely bad results in a number of operating environment). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3177) Make Mesos own configuration of roles/weights
[ https://issues.apache.org/jira/browse/MESOS-3177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14804289#comment-14804289 ] Cody Maloney commented on MESOS-3177: - Currently the mesos master doesn't keep track of roles it knows of explicitly, just roles which it says it should know about passed in via the flag. Storing them in the replicated log would be my preferred place to put / persist them. If they are persisted in the repliacted log and that is the authoritative source for them, I'd rather not have them be flags to the mesos master anymore, as after first mesos master start those flags would be meaningless and lead to a potentially bad user experience (I set the flags on mesos master but they aren't applying!?!?!). There is a `mesos-log` command that already exists, and it's been design discussed some that initialization of the replicated log shouldn't be implicit in master startup (Can potentially lead to bad cluster/error cases for some node replacement scenarios). I would suggest only allowing adding roles in v1. Removing roles will require revoking offers, which sort of exists with inverse offers that recently became available, but is going to be a lot of engineering. For other things you're going to need a Mesos Shepherd going forward for more design review, building out a proper design proposal, and getting things landed in time. > Make Mesos own configuration of roles/weights > - > > Key: MESOS-3177 > URL: https://issues.apache.org/jira/browse/MESOS-3177 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Reporter: Cody Maloney >Assignee: Thomas Rampelberg > Labels: mesosphere > > All roles and weights must currently be specified up-front when starting > Mesos masters currently. In addition, they should be consistent on every > master, otherwise unexpected behavior could occur (You can have them be > inconsistent for some upgrade paths / changing the set). > This makes it hard to introduce new groups of machines under new roles > dynamically (Have to generate a new master configuration, deploy that, before > we can connect slaves with a new role to the cluster). > Ideally an administrator can manually add / remove / edit roles and have the > settings replicated / passed to all masters in the cluster by Mesos. > Effectively Mesos takes ownership of the setting, rather than requiring it to > be done externally. > In addition, if a new slave joins the cluster with an unexpected / new role > that should just work, making it much easier to introduce machines with new > roles. (Policy around whether or not a slave can cause creation of a new > role, a given slave can register with a given role, etc. is out of scope, and > would be controls in the general registration process). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3177) Make Mesos own configuration of roles/weights
[ https://issues.apache.org/jira/browse/MESOS-3177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14740398#comment-14740398 ] Cody Maloney commented on MESOS-3177: - There hasn't been any design documentation building / development so far. In my mind I've been thinking it of a "Before you start the mesos masters, you create the initial replicated log state which contains the first set of roles and weights to operate with". Then from that point on mesos has a "add_role" and "remove_role" endpoints to manage them. Even better would be that if you don't have authentication turned on, as mesos sees new roles it just adds them (And as all things with that role disappear it removes them). If authentication is turned on, the authentication mechanism effectively "permanently" owns all the roles it defines (if it's just a static configuration file). If it's a dynamic source / database then the interface to talk about ownership would probably need to get more complicated. > Make Mesos own configuration of roles/weights > - > > Key: MESOS-3177 > URL: https://issues.apache.org/jira/browse/MESOS-3177 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Reporter: Cody Maloney >Assignee: Thomas Rampelberg > Labels: mesosphere > > All roles and weights must currently be specified up-front when starting > Mesos masters currently. In addition, they should be consistent on every > master, otherwise unexpected behavior could occur (You can have them be > inconsistent for some upgrade paths / changing the set). > This makes it hard to introduce new groups of machines under new roles > dynamically (Have to generate a new master configuration, deploy that, before > we can connect slaves with a new role to the cluster). > Ideally an administrator can manually add / remove / edit roles and have the > settings replicated / passed to all masters in the cluster by Mesos. > Effectively Mesos takes ownership of the setting, rather than requiring it to > be done externally. > In addition, if a new slave joins the cluster with an unexpected / new role > that should just work, making it much easier to introduce machines with new > roles. (Policy around whether or not a slave can cause creation of a new > role, a given slave can register with a given role, etc. is out of scope, and > would be controls in the general registration process). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3417) Log source address replicated log recieved broadcasts
Cody Maloney created MESOS-3417: --- Summary: Log source address replicated log recieved broadcasts Key: MESOS-3417 URL: https://issues.apache.org/jira/browse/MESOS-3417 Project: Mesos Issue Type: Improvement Components: replicated log Affects Versions: 0.24.0, 0.23.0 Environment: Mesos 0.23 Reporter: Cody Maloney Assignee: Adam B Priority: Minor Currently Mesos doesn't log what machine a replicated log status broadcast was recieved from: {code} Sep 11 21:41:14 master-01 mesos-master[15625]: I0911 21:41:14.320164 15637 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request Sep 11 21:41:14 master-01 mesos-dns[15583]: I0911 21:41:14.321097 15583 detect.go:118] ignoring children-changed event, leader has not changed: /mesos Sep 11 21:41:14 master-01 mesos-master[15625]: I0911 21:41:14.353914 15639 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request Sep 11 21:41:14 master-01 mesos-master[15625]: I0911 21:41:14.479132 15639 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request {code} It would be really useful for debugging replicated log startup issues to have info about where the message came from (libprocess address, ip, or hostname) the message came from -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2131) Add a reverse proxy endpoint to mesos
[ https://issues.apache.org/jira/browse/MESOS-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Maloney updated MESOS-2131: Assignee: (was: Cody Maloney) Add a reverse proxy endpoint to mesos - Key: MESOS-2131 URL: https://issues.apache.org/jira/browse/MESOS-2131 Project: Mesos Issue Type: Improvement Components: master, slave Reporter: Cody Maloney Priority: Minor Labels: mesosphere A new libprocess Process inside mesos which allows attaching/detaching known endpoints at a specific path. Ideally I want to be able to do things like attach 'slave-id' and pass HTTP requests on to that slave: Sample endpoint actions: C++ api: attach(std::string name, Node target): Add a new reverse proxy path detach(std::string name): Remove an established reverse proxy path HTTP endpoints: /proxy/go/{name} - Prefix matches a path, forwards the remaining path onto the remote endpoin /proxy/debug.json - Prints out all attached endpoints. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2130) Allow prefix routing of paths in libprocess
[ https://issues.apache.org/jira/browse/MESOS-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Maloney updated MESOS-2130: Assignee: (was: Cody Maloney) Allow prefix routing of paths in libprocess --- Key: MESOS-2130 URL: https://issues.apache.org/jira/browse/MESOS-2130 Project: Mesos Issue Type: Improvement Components: libprocess Reporter: Cody Maloney Labels: mesosphere Currently libprocess can only route to UPIDs, and then within the upids one top level command. Ideally you can attach C++ endpoints to arbitrary paths, including taking everything that matches a prefix: Ex: /slaves/:slave_id/ could proxy to an individual slave /slaves/ - Alias for /slave(1) if only one slave /slaves/{number} - point to an individual slave rather than requiring people to properly encode () in urls. /proxy/go/master-leader/files/browse.json - The endpoint would be /proxy/go, and then it internally processes the request to find the host it should go to (What is the IP for the currently elected master?) and then forwards the rest of the path to the target machine. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3177) Make Mesos own configuration of roles/weights
Cody Maloney created MESOS-3177: --- Summary: Make Mesos own configuration of roles/weights Key: MESOS-3177 URL: https://issues.apache.org/jira/browse/MESOS-3177 Project: Mesos Issue Type: Improvement Components: master, slave Reporter: Cody Maloney All roles and weights must currently be specified up-front when starting Mesos masters currently. In addition, they should be consistent on every master, otherwise unexpected behavior could occur (You can have them be inconsistent for some upgrade paths / changing the set). This makes it hard to introduce new groups of machines under new roles dynamically (Have to generate a new master configuration, deploy that, before we can connect slaves with a new role to the cluster). Ideally an administrator can manually add / remove / edit roles and have the settings replicated / passed to all masters in the cluster by Mesos. Effectively Mesos takes ownership of the setting, rather than requiring it to be done externally. In addition, if a new slave joins the cluster with an unexpected / new role that should just work, making it much easier to introduce machines with new roles. (Policy around whether or not a slave can cause creation of a new role, a given slave can register with a given role, etc. is out of scope, and would be controls in the general registration process). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME
[ https://issues.apache.org/jira/browse/MESOS-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627014#comment-14627014 ] Cody Maloney commented on MESOS-2902: - It is an argument against doing anything at runtime whenever possible. IP unfortunately we don't know outside the machine we shipped Mesos to / can't bake in. We would if we could, but most the environments we're shipping to we have found that we can't. If I send a mesos package to a bunch of arbitrary hosts, they all have different IPs, even though all the other configuration parameters stay the same. Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME Key: MESOS-2902 URL: https://issues.apache.org/jira/browse/MESOS-2902 Project: Mesos Issue Type: Improvement Components: master, modules, slave Reporter: Cody Maloney Assignee: Marco Massenzio Priority: Critical Labels: mesosphere Currently Mesos tries to guess the IP, HOSTNAME by doing a reverse DNS lookup. This doesn't work on a lot of clouds as we want things like public IPs (which aren't the default DNS), there aren't FQDN names (Azure), or the correct way to figure it out is to call some cloud-specific endpoint. If Mesos / Libprocess could load a mesos-module (Or run a script) which is provided per-cloud, we can figure out perfectly the IP / Hostname for the given environment. It also means we can ship one identical set of files to all hosts in a given provider which doesn't happen to have the DNS scheme + hostnames that libprocess/Mesos expects. Currently we have to generate host-specific config files which Mesos uses to guess. The host-specific files break / fall apart if machines change IP / hostname without being reinstalled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME
[ https://issues.apache.org/jira/browse/MESOS-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627125#comment-14627125 ] Cody Maloney commented on MESOS-2902: - I've covered why not wrapper scripts several times in this thread already Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME Key: MESOS-2902 URL: https://issues.apache.org/jira/browse/MESOS-2902 Project: Mesos Issue Type: Improvement Components: master, modules, slave Reporter: Cody Maloney Assignee: Marco Massenzio Priority: Critical Labels: mesosphere Currently Mesos tries to guess the IP, HOSTNAME by doing a reverse DNS lookup. This doesn't work on a lot of clouds as we want things like public IPs (which aren't the default DNS), there aren't FQDN names (Azure), or the correct way to figure it out is to call some cloud-specific endpoint. If Mesos / Libprocess could load a mesos-module (Or run a script) which is provided per-cloud, we can figure out perfectly the IP / Hostname for the given environment. It also means we can ship one identical set of files to all hosts in a given provider which doesn't happen to have the DNS scheme + hostnames that libprocess/Mesos expects. Currently we have to generate host-specific config files which Mesos uses to guess. The host-specific files break / fall apart if machines change IP / hostname without being reinstalled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME
[ https://issues.apache.org/jira/browse/MESOS-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626999#comment-14626999 ] Cody Maloney commented on MESOS-2902: - One thing as a follow up from the discussion this morning. Generally for shipping mesos lots of places in DCOS, we're trying to get everything to not happen on the host we're shipping to. Any code we execute on a host has a high probability of having some bugs and breaking in a lot of environments. As such, we bake everything off host, then when it gets to the host itself it's just a matter of reading static variables whenever possible. This is effectively pushing possible errors for us from runtime / machine startup time when we really have a hard time fixing them to configuration setup time on some remote machine. I can generate a config using some tools. Test it out locally, and know that the remote machine will behave the same since it will get the same bit for bit config. If it's some script, I have to predict what the script will do in the foreign environment (There are very few things we can rely on existing in the host. Pretty much just bash and curl / wget. Everything else is additional dependencies we pick up which make it harder to install DCOS). Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME Key: MESOS-2902 URL: https://issues.apache.org/jira/browse/MESOS-2902 Project: Mesos Issue Type: Improvement Components: master, modules, slave Reporter: Cody Maloney Assignee: Marco Massenzio Priority: Critical Labels: mesosphere Currently Mesos tries to guess the IP, HOSTNAME by doing a reverse DNS lookup. This doesn't work on a lot of clouds as we want things like public IPs (which aren't the default DNS), there aren't FQDN names (Azure), or the correct way to figure it out is to call some cloud-specific endpoint. If Mesos / Libprocess could load a mesos-module (Or run a script) which is provided per-cloud, we can figure out perfectly the IP / Hostname for the given environment. It also means we can ship one identical set of files to all hosts in a given provider which doesn't happen to have the DNS scheme + hostnames that libprocess/Mesos expects. Currently we have to generate host-specific config files which Mesos uses to guess. The host-specific files break / fall apart if machines change IP / hostname without being reinstalled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME
[ https://issues.apache.org/jira/browse/MESOS-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626999#comment-14626999 ] Cody Maloney edited comment on MESOS-2902 at 7/14/15 8:35 PM: -- One thing as a follow up from the discussion this morning. Generally for shipping mesos lots of places in DCOS, we're trying to get everything to not happen on the host we're shipping to. Any code we execute on a host has a high probability of having some bugs and breaking in a lot of environments. As such, we bake everything off host, then when it gets to the host itself it's just a matter of reading static variables whenever possible. Running a script on a hundred hosts that generates the same config file is much more likely to go wrong, then running the script once somewhere I can validate the output, then shipping it to the hosts with integrity checking. This is effectively pushing possible errors for us from runtime / machine startup time when we really have a hard time fixing them to configuration setup time on some remote machine. I can generate a config using some tools. Test it out locally, and know that the remote machine will behave the same since it will get the same bit for bit config. If it's some script, I have to predict what the script will do in the foreign environment (There are very few things we can rely on existing in the host. Pretty much just bash and curl / wget. Everything else is additional dependencies we pick up which make it harder to install DCOS). was (Author: cmaloney): One thing as a follow up from the discussion this morning. Generally for shipping mesos lots of places in DCOS, we're trying to get everything to not happen on the host we're shipping to. Any code we execute on a host has a high probability of having some bugs and breaking in a lot of environments. As such, we bake everything off host, then when it gets to the host itself it's just a matter of reading static variables whenever possible. This is effectively pushing possible errors for us from runtime / machine startup time when we really have a hard time fixing them to configuration setup time on some remote machine. I can generate a config using some tools. Test it out locally, and know that the remote machine will behave the same since it will get the same bit for bit config. If it's some script, I have to predict what the script will do in the foreign environment (There are very few things we can rely on existing in the host. Pretty much just bash and curl / wget. Everything else is additional dependencies we pick up which make it harder to install DCOS). Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME Key: MESOS-2902 URL: https://issues.apache.org/jira/browse/MESOS-2902 Project: Mesos Issue Type: Improvement Components: master, modules, slave Reporter: Cody Maloney Assignee: Marco Massenzio Priority: Critical Labels: mesosphere Currently Mesos tries to guess the IP, HOSTNAME by doing a reverse DNS lookup. This doesn't work on a lot of clouds as we want things like public IPs (which aren't the default DNS), there aren't FQDN names (Azure), or the correct way to figure it out is to call some cloud-specific endpoint. If Mesos / Libprocess could load a mesos-module (Or run a script) which is provided per-cloud, we can figure out perfectly the IP / Hostname for the given environment. It also means we can ship one identical set of files to all hosts in a given provider which doesn't happen to have the DNS scheme + hostnames that libprocess/Mesos expects. Currently we have to generate host-specific config files which Mesos uses to guess. The host-specific files break / fall apart if machines change IP / hostname without being reinstalled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME
[ https://issues.apache.org/jira/browse/MESOS-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621226#comment-14621226 ] Cody Maloney commented on MESOS-2902: - [~bmahler] Mesos is much more particular and peculiar in it's DNS / Hostname / IP requirements than a lot of datacenter software. nginx, httpd, etc. don't actually use the machine's hostname, they purely use whatever a request comes in as. They also don't publish anywhere saying This is me come find me based on the DNS address of the local machine. They get a request in, they inspect what IP address / port that request came in on, and in the case of nginx / apache possibly what the {{Host}} HTTP header is and deal with it from there. In the case of Mesos for the Masters for instance if a master and framework disagree on the master IP, you just end up with lost packets with no logging currently. The HTTP API should help in this area, but we need to ship Mesos today / can't wait for that to come. We only use cloud-init in some environments. And it only has coreos public / private IPv4. There are environments we install using the myriad of other host install / setup tools (chef, salt, fleet, ...). There are a lot of ways we ship this stuff to clients. Adding one simple flag doesn't considerably add to the Mesos maintenance burden, and solves our use case at the moment. If adding a flag is unpalatable, it could be added as a mesos 'hook' module which does exactly the same thing, just makes the IP lookup pluggable. That would make it so someone could write a mesos module which does NetworkManager if they wished (Although there will still be a problem of Mesos slave can't handle when it's IP address changes) This isn't teaching mesos configuration management at all. It is trying to get it out of the policy of trying to self-configure itself badly for a lot of our customer environments, leading to lots of headaches for various customers we are trying to ship Mesos as a component of DCOS to. The maintenance burden for this is no more than the `--ip` flag that Mesos has currently which is the exact same as setting LIBPROCESS_IP. It does not significantly effect organizations which do not need the flag / wish to use it I believe, and if they don't give it, it will not change the behavior of their setups. Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME Key: MESOS-2902 URL: https://issues.apache.org/jira/browse/MESOS-2902 Project: Mesos Issue Type: Improvement Components: master, modules, slave Reporter: Cody Maloney Assignee: Marco Massenzio Priority: Critical Labels: mesosphere Currently Mesos tries to guess the IP, HOSTNAME by doing a reverse DNS lookup. This doesn't work on a lot of clouds as we want things like public IPs (which aren't the default DNS), there aren't FQDN names (Azure), or the correct way to figure it out is to call some cloud-specific endpoint. If Mesos / Libprocess could load a mesos-module (Or run a script) which is provided per-cloud, we can figure out perfectly the IP / Hostname for the given environment. It also means we can ship one identical set of files to all hosts in a given provider which doesn't happen to have the DNS scheme + hostnames that libprocess/Mesos expects. Currently we have to generate host-specific config files which Mesos uses to guess. The host-specific files break / fall apart if machines change IP / hostname without being reinstalled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME
[ https://issues.apache.org/jira/browse/MESOS-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619815#comment-14619815 ] Cody Maloney commented on MESOS-2902: - I'd much rather have it output a IP than hostname. Some of the cases we've run into where a hostname doesn't work: Multiple NICs per box (Each of which can have 1+ DNS address), clusters where boxes don't have resolvable hostnames, and clusters which have no DNS whatsoever. If it's 'run a script which returns an IP' I can fairly reliably create cluster/environment-specific variants which get the right IP address. Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME Key: MESOS-2902 URL: https://issues.apache.org/jira/browse/MESOS-2902 Project: Mesos Issue Type: Improvement Components: master, modules, slave Reporter: Cody Maloney Assignee: Marco Massenzio Priority: Critical Labels: mesosphere Currently Mesos tries to guess the IP, HOSTNAME by doing a reverse DNS lookup. This doesn't work on a lot of clouds as we want things like public IPs (which aren't the default DNS), there aren't FQDN names (Azure), or the correct way to figure it out is to call some cloud-specific endpoint. If Mesos / Libprocess could load a mesos-module (Or run a script) which is provided per-cloud, we can figure out perfectly the IP / Hostname for the given environment. It also means we can ship one identical set of files to all hosts in a given provider which doesn't happen to have the DNS scheme + hostnames that libprocess/Mesos expects. Currently we have to generate host-specific config files which Mesos uses to guess. The host-specific files break / fall apart if machines change IP / hostname without being reinstalled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME
[ https://issues.apache.org/jira/browse/MESOS-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619815#comment-14619815 ] Cody Maloney edited comment on MESOS-2902 at 7/9/15 3:15 AM: - I'd much rather have it output a IP than hostname. Some of the cases we've run into where a hostname doesn't work: Multiple NICs per box (Each of which can have 1+ IP Address, and an arbitrary grouping can have actual DNS), clusters where boxes don't have resolvable hostnames, and clusters which have no DNS whatsoever. If it's 'run a script which returns an IP' I can fairly reliably create cluster/environment-specific variants which get the right IP address. was (Author: cmaloney): I'd much rather have it output a IP than hostname. Some of the cases we've run into where a hostname doesn't work: Multiple NICs per box (Each of which can have 1+ DNS address), clusters where boxes don't have resolvable hostnames, and clusters which have no DNS whatsoever. If it's 'run a script which returns an IP' I can fairly reliably create cluster/environment-specific variants which get the right IP address. Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME Key: MESOS-2902 URL: https://issues.apache.org/jira/browse/MESOS-2902 Project: Mesos Issue Type: Improvement Components: master, modules, slave Reporter: Cody Maloney Assignee: Marco Massenzio Priority: Critical Labels: mesosphere Currently Mesos tries to guess the IP, HOSTNAME by doing a reverse DNS lookup. This doesn't work on a lot of clouds as we want things like public IPs (which aren't the default DNS), there aren't FQDN names (Azure), or the correct way to figure it out is to call some cloud-specific endpoint. If Mesos / Libprocess could load a mesos-module (Or run a script) which is provided per-cloud, we can figure out perfectly the IP / Hostname for the given environment. It also means we can ship one identical set of files to all hosts in a given provider which doesn't happen to have the DNS scheme + hostnames that libprocess/Mesos expects. Currently we have to generate host-specific config files which Mesos uses to guess. The host-specific files break / fall apart if machines change IP / hostname without being reinstalled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME
[ https://issues.apache.org/jira/browse/MESOS-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619890#comment-14619890 ] Cody Maloney commented on MESOS-2902: - Probably we should sync in person tomorrow and summarize on here. We could potentially say You have to run a script on every host which sets LIBPROCESS_IP (Or MESOS_IP which turns into the --ip flag and therefore LIBPROCESS_IP). It adds complexity in the form of extra dependencies, and makes the cluster install + running Mesos not very self-contained. What I like about having Mesos run a script, is we are able to ship that script inside the DCOS internal host packaging system to hosts, manage and update it appropriately inside of DCOS. Anything which doesn't live in there we can't touch, update, etc. during upgrades. It's also important to note this affects us both for launching Mesos, as well as launching DCOS System frameworks (Marathon tries to do the same hostname - ip logic inside libprocess and it goes just as badly in a lot of our use cases). Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME Key: MESOS-2902 URL: https://issues.apache.org/jira/browse/MESOS-2902 Project: Mesos Issue Type: Improvement Components: master, modules, slave Reporter: Cody Maloney Assignee: Marco Massenzio Priority: Critical Labels: mesosphere Currently Mesos tries to guess the IP, HOSTNAME by doing a reverse DNS lookup. This doesn't work on a lot of clouds as we want things like public IPs (which aren't the default DNS), there aren't FQDN names (Azure), or the correct way to figure it out is to call some cloud-specific endpoint. If Mesos / Libprocess could load a mesos-module (Or run a script) which is provided per-cloud, we can figure out perfectly the IP / Hostname for the given environment. It also means we can ship one identical set of files to all hosts in a given provider which doesn't happen to have the DNS scheme + hostnames that libprocess/Mesos expects. Currently we have to generate host-specific config files which Mesos uses to guess. The host-specific files break / fall apart if machines change IP / hostname without being reinstalled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME
[ https://issues.apache.org/jira/browse/MESOS-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619857#comment-14619857 ] Cody Maloney commented on MESOS-2902: - In DCOS we do all Mesos config via environment variables (Allows better mixing and matching in various environemnts). We ship the same mesos-master systemd unit to every cluster, and then we change the configuration by swapping out environment variable files (See Systemd's {{EnvironmentFile}} directive). Inside an {{EnvironmentFile}} we can't run arbitrary scripts. It is structurally in-feasible to change the mesos-master systemd unit per cluster to include the 'Set the IP by running this script' only in cases where we want to do that. There may also cases where Mesos exits and we restart it, and it would refuse to start because it has a different IP (mesos slave might checkpoint it, although I'd have to double check). The IP to use is a per-host thing, so I can't ship a generic config file to every host in the cluster which just sets {{LIBPROCESS_IP}} in an {{EnvironmentFile}}. Writing a wrapper script which sets {{LIBPROCESS_IP}} and then does an {{exec mesos-master}} is feasible, although it obfuscates what is happening, and if someone we ship DCOS to has been hand-editing the script for their environment and gets the environment variable a little bit wrong, things will error really badly (We've had a number of customers with mesos figuring out that the host's IP is 127.0.0.1). As far as the hostname stuff: In general we need Mesos not to do anything with hostnames in a number of our environments because they are unreliable, esp. as a means for figuring out what address should I talk on. Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME Key: MESOS-2902 URL: https://issues.apache.org/jira/browse/MESOS-2902 Project: Mesos Issue Type: Improvement Components: master, modules, slave Reporter: Cody Maloney Assignee: Marco Massenzio Priority: Critical Labels: mesosphere Currently Mesos tries to guess the IP, HOSTNAME by doing a reverse DNS lookup. This doesn't work on a lot of clouds as we want things like public IPs (which aren't the default DNS), there aren't FQDN names (Azure), or the correct way to figure it out is to call some cloud-specific endpoint. If Mesos / Libprocess could load a mesos-module (Or run a script) which is provided per-cloud, we can figure out perfectly the IP / Hostname for the given environment. It also means we can ship one identical set of files to all hosts in a given provider which doesn't happen to have the DNS scheme + hostnames that libprocess/Mesos expects. Currently we have to generate host-specific config files which Mesos uses to guess. The host-specific files break / fall apart if machines change IP / hostname without being reinstalled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2132) Allow sending http::Request objects
[ https://issues.apache.org/jira/browse/MESOS-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14613989#comment-14613989 ] Cody Maloney commented on MESOS-2132: - Currently mesos http handlers receive an HTTP Request object. If you just want to forward the request with minimal changes (just the path), as the proxy process I was working on does, you need to copy every field out of the structure and pass the members as slightly differently formatted arguments to the http::post, get functions. Making it so those functions can just take an http request object makes it easier to forward requests, as well as cleans up the http get/post API so that rather than a long string of optional parameters, there are just fields ommitted from being set on a struct. Allow sending http::Request objects --- Key: MESOS-2132 URL: https://issues.apache.org/jira/browse/MESOS-2132 Project: Mesos Issue Type: Improvement Components: libprocess Reporter: Cody Maloney Assignee: Cody Maloney Priority: Minor Labels: mesosphere Currently you can only send a collection of fields which more or less matches those in an http::Request object. http::Request objects are used when calling http handlers in libprocess. The motivation for being able to send these is then we can forward a request that is recieved. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1865) Redirect to the leader master when current master is not a leader
[ https://issues.apache.org/jira/browse/MESOS-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611005#comment-14611005 ] Cody Maloney commented on MESOS-1865: - Following a redirect is entirely a client's choice. Practically in HTTP there isn't a better alternative I know of that keeps simple / dumb clients working well. Right now a number of dumb client programs which want to pull master/state.json manually call out to find out what the leading master is from the master, then going to that directly and hoping there isn't a race around it. Practically for systems which care to only monitor the exact master they are talking to, most HTTP libraries I have seen you can disable automatic redirect following. Currently these APIs sometimes returning incorrect / invalid / stale data has caused problems for things like proxy config generation scripts (They get the wrong master at just the wrong point in time and generate an empty config, leading to badness) Redirect to the leader master when current master is not a leader - Key: MESOS-1865 URL: https://issues.apache.org/jira/browse/MESOS-1865 Project: Mesos Issue Type: Bug Components: json api Affects Versions: 0.20.1 Reporter: Steven Schlansker Assignee: haosdent Some of the API endpoints, for example /master/tasks.json, will return bogus information if you query a non-leading master: {code} [steven@Anesthetize:~]% curl http://master1.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10 { tasks: [] } [steven@Anesthetize:~]% curl http://master2.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10 { tasks: [] } [steven@Anesthetize:~]% curl http://master3.mesos-vpcqa.otenv.com:5050/master/tasks.json | jq . | head -n 10 { tasks: [ { executor_id: , framework_id: 20140724-231003-419644938-5050-1707-, id: pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db, name: pp.guestcenterwebhealthmonitor.606cd6ee-4b50-11e4-825b-5212e05f35db, resources: { cpus: 0.25, disk: 0, {code} This is very hard for end-users to work around. For example if I query which master is leading followed by leader: which tasks are running it is possible that the leader fails over in between, leaving me with an incorrect answer and no way to know that this happened. In my opinion the API should return the correct response (by asking the current leader?) or an error (500 Not the leader?) but it's unacceptable to return a successful wrong answer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2153) Add support for systemd journal for logging
[ https://issues.apache.org/jira/browse/MESOS-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611326#comment-14611326 ] Cody Maloney commented on MESOS-2153: - This should also include individual task stdout/stderr, syslog messages being logged to the systemd journal (although those are more bits of this as an epic). Right now for long-running tasks, the stdout and stderr just grow forever. The systemd journal makes it so the stdout/stderr can be capped size, and administrative policies can be set per app if desired. Add support for systemd journal for logging --- Key: MESOS-2153 URL: https://issues.apache.org/jira/browse/MESOS-2153 Project: Mesos Issue Type: Improvement Components: master, slave Reporter: Alexander Rukletsov Priority: Minor We should be able to redirect master and slave logs to systemd journal on the systems where it's available. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-898) Introduce CMake as an alternative build system.
[ https://issues.apache.org/jira/browse/MESOS-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600263#comment-14600263 ] Cody Maloney commented on MESOS-898: I would suggest that with the move to CMake we switch to using a raw upstream packaged version of boost. There isn't a lot we gain by stripping out some of the headers, and it adds a lot more complexity. CMake has a lot of stuff ready-made for finding, downloading boost if and only if it isn't present on the host machine, isn't of the right version, etc. Forcing rebuilding all of that logic/code so that we can remove some files in a tarball which shouldn't be embedded inside the repository anyways seems like not the best idea. Introduce CMake as an alternative build system. --- Key: MESOS-898 URL: https://issues.apache.org/jira/browse/MESOS-898 Project: Mesos Issue Type: Epic Components: build Reporter: Timothy St. Clair Assignee: Alex Clemmer Labels: build This is a rather substantial undertaking, so I would want upstream debate+buy-in prior to full commitment. The basic premise is: upstream rebundles several of its dependencies in part to tightly control its stack. This is not out of the norm, but in order to be picked up by distribution channels it needs to built against system dependencies, and rebundling is strictly forbidden. Given that the mesos primary target platform are data-center distributions such as RHEL/CENTOS/SL it makes sense to still have bundling support for those who do not have dependencies in their channels yet. This is where cmake can be win with it's uber macros (http://www.cmake.org/cmake/help/v2.8.8/cmake.html#module:ExternalProject). I do not know of any equivalent in the autotools world, other then to brew your own solution. I've done this type of work in the past, and completely transformed condor and would leverage a lot of the work that was done there. I currently have a tracking branch where I've started this work, but before I go off into the woods, it makes sense to have a debate in public. The primary benefits are: 1. Enable downstream channels to easily distro without carrying a large patch sets. 2. Still support existing non-proper distribution methods. 3. Harden / future proof dependent interfaces. Side Benefits: Audit current build mechanics. - Presently the language specific binding are not installed. (.py .jar) - make -jX currently fails - optionally look in arm support. Costs: 1. Time 2. Potential temporary destabilization 3. Infrastructure around build+test may need to change. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2129) Enable managing mesos without having to be able to connect to each slave
[ https://issues.apache.org/jira/browse/MESOS-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Maloney updated MESOS-2129: Assignee: (was: Cody Maloney) Enable managing mesos without having to be able to connect to each slave Key: MESOS-2129 URL: https://issues.apache.org/jira/browse/MESOS-2129 Project: Mesos Issue Type: Epic Reporter: Cody Maloney Labels: mesosphere Ideally we want to use the full mesos WebUI from an office, which is firewalled off from the vast majority of hosts in the datacenter (mesos slaves). It also becomes burdensome to manage a precise firewall for additional hosts, since every time a slave comes/goes if we don't want to allow blanket access to the slave port, we have to add / remove firewall rules -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME
Cody Maloney created MESOS-2902: --- Summary: Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME Key: MESOS-2902 URL: https://issues.apache.org/jira/browse/MESOS-2902 Project: Mesos Issue Type: Improvement Components: master, modules, slave Reporter: Cody Maloney Priority: Minor Currently Mesos tries to guess the IP, HOSTNAME by doing a reverse DNS lookup. This doesn't work on a lot of clouds as we want things like public IPs (which aren't the default DNS), there aren't FQDN names (Azure), or the correct way to figure it out is to call some cloud-specific endpoint. If Mesos / Libprocess could load a mesos-module (Or run a script) which is provided per-cloud, we can figure out perfectly the IP / Hostname for the given environment. It also means we can ship one identical set of files to all hosts in a given provider which doesn't happen to have the DNS scheme + hostnames that libprocess/Mesos expects. Currently we have to generate host-specific config files which Mesos uses to guess. The host-specific files break / fall apart if machines change IP / hostname without being reinstalled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME
[ https://issues.apache.org/jira/browse/MESOS-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594068#comment-14594068 ] Cody Maloney edited comment on MESOS-2902 at 6/19/15 10:58 PM: --- I can't drop it in a systemd unit file which runs a command before mesos and pass the data without making a temp file which is an odd way to do the config generation. I could make a new mesos-init-fetch-ip script which I run instead of mesos, and that script then execs mesos. This confuses init system tracking of processes somewhat, and obfuscates what the underlying commands being run are. It also adds a lot of error scenarios. For example, the wrapper script is updated and the change contains a typo, so it sets LIBPROCES_IP instead of LIBPROCESS_IP), Libprocess silently ignores the wrong environment variable. The environment I'm in Libprocess' internal logic guesses an IP that works. It gets engrained slightly incorrect as it rolls out across the cluster. Currently one of the biggest pain points in initially setting up a Mesos cluster is getting the right IPs + Hostnames setup. If Mesos Master and Mesos Slave had a flag which was required, {{\-\-ip\-detection=reverse_dns}} or {{--ip-detection=/usr/bin/detect_mesos_ip}}. It would make it so that users see what mesos is doing and make an informed decision, rather than running Mesos, having things break with really bad error messages (Wrong hostname/IP on your Scheduler? No logging of things breaking happens...). As far as generalizing it further. Note I'm saying IP, HOSTNAME are host-specific, which is why this sort of capability makes sense. It is impossible for me to know when I'm installing static config files to a Host, VM, Docker what the IP and Hostname are going to be. That is not the case for {{\-\-resources}}, {{\-quiet}} and the like. They are able to be pre-determined for a host. IP and Hostname are Runtime parameters of a machine (When you attach your machine to a network, they are assigned dynamically). was (Author: cmaloney): I can't drop it in a systemd unit file which runs a command before mesos and pass the data without making a temp file which is an odd way to do the config generation. I could make a new mesos-init-fetch-ip script which I run instead of mesos, and that script then execs mesos. This confuses init system tracking of processes somewhat, and obfuscates what the underlying commands being run are. It also adds a lot of error scenarios. For example, the wrapper script is updated and the change contains a typo, so it sets LIBPROCES_IP instead of LIBPROCESS_IP), Libprocess silently ignores the wrong environment variable. The environment I'm in Libprocess' internal logic guesses an IP that works. It gets engrained slightly incorrect as it rolls out across the cluster. Currently one of the biggest pain points in initially setting up a Mesos cluster is getting the right IPs + Hostnames setup. If Mesos Master and Mesos Slave had a flag which was required, {{ \-\-ip\-detection=reverse_dns}} or {{--ip-detection=,/usr/bin/detect_mesos_ip} }}. It would make it so that users see what mesos is doing and make an informed decision, rather than running Mesos, having things break with really bad error messages (Wrong hostname/IP on your Scheduler? No logging of things breaking happens...). As far as generalizing it further. Note I'm saying IP, HOSTNAME are host-specific, which is why this sort of capability makes sense. It is impossible for me to know when I'm installing static config files to a Host, VM, Docker what the IP and Hostname are going to be. That is not the case for {{\-\-resources}}, {{\-quiet}} and the like. They are able to be pre-determined for a host. IP and Hostname are Runtime parameters of a machine (When you attach your machine to a network, they are assigned dynamically). Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME Key: MESOS-2902 URL: https://issues.apache.org/jira/browse/MESOS-2902 Project: Mesos Issue Type: Improvement Components: master, modules, slave Reporter: Cody Maloney Priority: Minor Labels: mesosphere Currently Mesos tries to guess the IP, HOSTNAME by doing a reverse DNS lookup. This doesn't work on a lot of clouds as we want things like public IPs (which aren't the default DNS), there aren't FQDN names (Azure), or the correct way to figure it out is to call some cloud-specific endpoint. If Mesos / Libprocess could load a mesos-module (Or run a script) which is provided per-cloud, we can figure out perfectly the IP / Hostname for the given environment. It also means we can ship one identical set of files to all hosts in a given provider which
[jira] [Commented] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME
[ https://issues.apache.org/jira/browse/MESOS-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594068#comment-14594068 ] Cody Maloney commented on MESOS-2902: - I can't drop it in a systemd unit file which runs a command before mesos and pass the data without making a temp file which is an odd way to do the config generation. I could make a new mesos-init-fetch-ip script which I run instead of mesos, and that script then execs mesos. This confuses init system tracking of processes somewhat, and obfuscates what the underlying commands being run are. It also adds a lot of error scenarios. For example, the wrapper script is updated and the change contains a typo, so it sets LIBPROCES_IP instead of LIBPROCESS_IP), Libprocess silently ignores the wrong environment variable. The environment I'm in Libprocess' internal logic guesses an IP that works. It gets engrained slightly incorrect as it rolls out across the cluster. Currently one of the biggest pain points in initially setting up a Mesos cluster is getting the right IPs + Hostnames setup. If Mesos Master and Mesos Slave had a flag which was required, {{ \-\-ip\-detection=reverse_dns}} or {{--ip-detection=,/usr/bin/detect_mesos_ip} }}. It would make it so that users see what mesos is doing and make an informed decision, rather than running Mesos, having things break with really bad error messages (Wrong hostname/IP on your Scheduler? No logging of things breaking happens...). As far as generalizing it further. Note I'm saying IP, HOSTNAME are host-specific, which is why this sort of capability makes sense. It is impossible for me to know when I'm installing static config files to a Host, VM, Docker what the IP and Hostname are going to be. That is not the case for {{\-\-resources}}, {{\-quiet}} and the like. They are able to be pre-determined for a host. IP and Hostname are Runtime parameters of a machine (When you attach your machine to a network, they are assigned dynamically). Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME Key: MESOS-2902 URL: https://issues.apache.org/jira/browse/MESOS-2902 Project: Mesos Issue Type: Improvement Components: master, modules, slave Reporter: Cody Maloney Priority: Minor Labels: mesosphere Currently Mesos tries to guess the IP, HOSTNAME by doing a reverse DNS lookup. This doesn't work on a lot of clouds as we want things like public IPs (which aren't the default DNS), there aren't FQDN names (Azure), or the correct way to figure it out is to call some cloud-specific endpoint. If Mesos / Libprocess could load a mesos-module (Or run a script) which is provided per-cloud, we can figure out perfectly the IP / Hostname for the given environment. It also means we can ship one identical set of files to all hosts in a given provider which doesn't happen to have the DNS scheme + hostnames that libprocess/Mesos expects. Currently we have to generate host-specific config files which Mesos uses to guess. The host-specific files break / fall apart if machines change IP / hostname without being reinstalled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2832) Enable configuring Mesos with environment variables without having them leak to tasks launched
[ https://issues.apache.org/jira/browse/MESOS-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590499#comment-14590499 ] Cody Maloney commented on MESOS-2832: - For DCOS at least we don't want to just strip some out. We want to replace the entire environment with one statically spaecified. The reason for this is we explicitly want to make it hard to depend on special DCOS-internal components that mesos-slave has in it's PATH, LD_LIBRARY_PATH but which DCOS Services should not. Removing a magic pattern matching of variables seems more complicated to implement than Load the exact set of environment variables to use from this map, then add in explicitly Mesos API provided ones, such as MESOS_SANDBOX, etc Enable configuring Mesos with environment variables without having them leak to tasks launched -- Key: MESOS-2832 URL: https://issues.apache.org/jira/browse/MESOS-2832 Project: Mesos Issue Type: Wish Reporter: Cody Maloney Assignee: Benjamin Hindman Priority: Critical Labels: mesosphere Currently if mesos is configured with environment variables (MESOS_MODULES), those show up in every task which is launched unless the executor explicitly cleans them up. If the task being launched happens to be something libprocess / mesos based, this can often prevent the task from starting up (A scheduler has issues loading a module intended for the slave). There are also cases where it would be nice to be able to change what the PATH is that tasks launch with (the host may have more in the path than tasks are supposed to / allowed to depend upon). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2862) mesos-fetcher won't fetch uris which begin with a
Cody Maloney created MESOS-2862: --- Summary: mesos-fetcher won't fetch uris which begin with a Key: MESOS-2862 URL: https://issues.apache.org/jira/browse/MESOS-2862 Project: Mesos Issue Type: Bug Components: fetcher Affects Versions: 0.22.1 Reporter: Cody Maloney Priority: Minor Discovered while running mesos with marathon on top. If I launch a marathon task with a URI which is http://apache.osuosl.org/mesos/0.22.1/mesos-0.22.1.tar.gz; mesos will log to stderr: {code} I0611 22:39:22.815636 35673 logging.cpp:177] Logging to STDERR I0611 22:39:25.643889 35673 fetcher.cpp:214] Fetching URI ' http://apache.osuosl.org/mesos/0.22.1/mesos-0.22.1.tar.gz' I0611 22:39:25.648111 35673 fetcher.cpp:94] Hadoop Client not available, skipping fetch with Hadoop Client Failed to fetch: http://apache.osuosl.org/mesos/0.22.1/mesos-0.22.1.tar.gz Failed to synchronize with slave (it's probably exited) {code} It would be nice if mesos trimmed leading whitespace before doing protocol detection so that simple mistakes are just fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1739) Allow slave reconfiguration on restart
[ https://issues.apache.org/jira/browse/MESOS-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Maloney updated MESOS-1739: Assignee: (was: Cody Maloney) Allow slave reconfiguration on restart -- Key: MESOS-1739 URL: https://issues.apache.org/jira/browse/MESOS-1739 Project: Mesos Issue Type: Epic Reporter: Patrick Reilly Labels: mesosphere, myriad Make it so that either via a slave restart or a out of process reconfigure ping, the attributes and resources of a slave can be updated to be a superset of what they used to be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2830) Add an endpoint to slaves to allow launching system administration tasks
Cody Maloney created MESOS-2830: --- Summary: Add an endpoint to slaves to allow launching system administration tasks Key: MESOS-2830 URL: https://issues.apache.org/jira/browse/MESOS-2830 Project: Mesos Issue Type: Wish Components: slave Reporter: Cody Maloney Priority: Minor As a System Administrator often times I need to run a organization-mandated task on every machine in the cluster. Ideally I could do this within the framework of mesos resources if it is a cleanup or auditing task, but sometimes I just have to run something, and run it now, regardless if a machine has un-accounted resources (Ex: Adding/removing a user). Currently to do this I have to completely bypass Mesos and SSH to the box. Ideally I could tell a mesos slave (With proper authentication) to run a container with the limited special permissions needed to get the task done. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2832) Enable configuring Mesos with environment variables without having them leak to tasks launched
Cody Maloney created MESOS-2832: --- Summary: Enable configuring Mesos with environment variables without having them leak to tasks launched Key: MESOS-2832 URL: https://issues.apache.org/jira/browse/MESOS-2832 Project: Mesos Issue Type: Wish Reporter: Cody Maloney Priority: Critical Currently if mesos is configured with environment variables (MESOS_MODULES), those show up in every task which is launched unless the executor explicitly cleans them up. If the task being launched happens to be something libprocess / mesos based, this can often prevent the task from starting up (A scheduler has issues loading a module intended for the slave). There are also cases where it would be nice to be able to change what the PATH is that tasks launch with (the host may have more in the path than tasks are supposed to / allowed to depend upon). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2810) mesos-executor reimplements subprocess
Cody Maloney created MESOS-2810: --- Summary: mesos-executor reimplements subprocess Key: MESOS-2810 URL: https://issues.apache.org/jira/browse/MESOS-2810 Project: Mesos Issue Type: Improvement Components: slave Reporter: Cody Maloney The launchTask method is a re-implementation of libprocess subprocess https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L110 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2811) process/subprocess.hpp API hard to use, extend
Cody Maloney created MESOS-2811: --- Summary: process/subprocess.hpp API hard to use, extend Key: MESOS-2811 URL: https://issues.apache.org/jira/browse/MESOS-2811 Project: Mesos Issue Type: Improvement Components: slave Affects Versions: 0.22.1 Reporter: Cody Maloney https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/subprocess.hpp There are many overloads of subprocess() construction, a lot of them are very similar. It passes environment in as an {{Optionstd::mapstd::string, std::string}} which isn't what stout's os::environment() returns. ({{hashmapstd::string, std::string environment()}}. Ideally those should match for easy passing environments around + manipulating It isn't possible to tell it not to copy in the environment of running process (Useful to isolate slave environments from the running process). This becomes critical when configuring mesos via environment variables. Currently mesos explicitly unsets LIBPROCESS_IP when launching new processes because that one is known to upset when mesos launches another libprocess based thing. ExecEnv is just weird, it isn't great / modern C++, and results in a lot of unnecessary / useless copies of things as current, doesn't follow modern C++ interface standards. The code is hard to read / follow: {code} // Close the copies. We need to make sure that we do not close the // file descriptor assigned to stdin/stdout/stderr in case the // parent has closed stdin/stdout/stderr when calling this // function (in that case, a dup'ed file descriptor may have the // same file descriptor number as stdin/stdout/stderr). if (stdinFd[0] != STDIN_FILENO stdinFd[0] != STDOUT_FILENO stdinFd[0] != STDERR_FILENO) { while (::close(stdinFd[0]) == -1 errno == EINTR); } if (stdoutFd[1] != STDIN_FILENO stdoutFd[1] != STDOUT_FILENO stdoutFd[1] != STDERR_FILENO) { while (::close(stdoutFd[1]) == -1 errno == EINTR); } if (stderrFd[1] != STDIN_FILENO stderrFd[1] != STDOUT_FILENO stderrFd[1] != STDERR_FILENO) { while (::close(stderrFd[1]) == -1 errno == EINTR); } {code} Why do we switch between fd[0] vs [1]? Why are we hand-coding While EINTR loops over and over? Doesn't stout have an os::close? https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L165 -- os::execvpe() can fail for perfectly good reasons, we should really log the name of the command / info that was trying to be run. There shouldn't be a backtrace printed (which abort does). A lot of the subprocess overloads re-implement needlessly functionality which the underlying exec() C APIs provide, using those apis instead of re-implementing all the variations would be a much better model. Mesos doesn't use / need most of the subprocess overloads that exist. A lot of the usage patterns probably could / should be removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2812) Document mesos internal launching a container path
Cody Maloney created MESOS-2812: --- Summary: Document mesos internal launching a container path Key: MESOS-2812 URL: https://issues.apache.org/jira/browse/MESOS-2812 Project: Mesos Issue Type: Improvement Components: slave Affects Versions: 0.22.1 Reporter: Cody Maloney Sometimes mesos uses LinuxLauncher, sometimes it uses PosixLauncher. These both share a lot of implementation. Just because we're on Linux doesn't mean we use the LinuxLauncher. These rely on mesos-containerizer (another subprocess implementation), mesos-executor (yet another subprocess launcher in it's launchTask method). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2814) os::read should have one implementation
Cody Maloney created MESOS-2814: --- Summary: os::read should have one implementation Key: MESOS-2814 URL: https://issues.apache.org/jira/browse/MESOS-2814 Project: Mesos Issue Type: Improvement Components: stout Reporter: Cody Maloney Currently stout os::read() has two radically different implementations when you give it a {{std::string}} vs. a {{const char *}}. Ideally these have one implementation that does things like intelligently size the buffer that it writes into rather than re-allocating repeatedly with every time it lengthens the string (resulting in copious copying). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2131) Add a reverse proxy endpoint to mesos
[ https://issues.apache.org/jira/browse/MESOS-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549538#comment-14549538 ] Cody Maloney commented on MESOS-2131: - This is stalled at the moment (I haven't been working on it, heading out of town). Can talk to someone about remaining issues with it, path forward if they resurrect it. Add a reverse proxy endpoint to mesos - Key: MESOS-2131 URL: https://issues.apache.org/jira/browse/MESOS-2131 Project: Mesos Issue Type: Improvement Components: master, slave Reporter: Cody Maloney Assignee: Cody Maloney Priority: Minor Labels: mesosphere A new libprocess Process inside mesos which allows attaching/detaching known endpoints at a specific path. Ideally I want to be able to do things like attach 'slave-id' and pass HTTP requests on to that slave: Sample endpoint actions: C++ api: attach(std::string name, Node target): Add a new reverse proxy path detach(std::string name): Remove an established reverse proxy path HTTP endpoints: /proxy/go/{name} - Prefix matches a path, forwards the remaining path onto the remote endpoin /proxy/debug.json - Prints out all attached endpoints. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1375) Log rotation capable
[ https://issues.apache.org/jira/browse/MESOS-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546016#comment-14546016 ] Cody Maloney commented on MESOS-1375: - For configuring things even using Mesosphere init scripts in the current init wrappers you can add arbitrary flags as well as do environment variables which will be sourced. That said, definitely we've felt the pain of those old init scripts (Our newer mesos packaging we use in DCOS completely foregoes them), we may actually look at removing them in a new generation of the packaging. Log rotation capable Key: MESOS-1375 URL: https://issues.apache.org/jira/browse/MESOS-1375 Project: Mesos Issue Type: Improvement Components: master, slave Affects Versions: 0.18.0 Reporter: Damien Hardy Labels: ops, twitter Please provide a way to let ops manage logs. A log4j like configuration would be hard but make rotation capable without restarting the service at least. Based on external logrotate tool would be great : * write to a constant log file name * check for file change (recreated by logrotate) before write -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1303) ExamplesTest.{TestFramework, NoExecutorFramework} flaky
[ https://issues.apache.org/jira/browse/MESOS-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542583#comment-14542583 ] Cody Maloney commented on MESOS-1303: - [~tillt] Would it be reasonable to just implement dirname ourselves in C++? What people expect to have happen isn't that hard to get (Although need to make sure we don't break expectations around things that end in '/'). ExamplesTest.{TestFramework, NoExecutorFramework} flaky --- Key: MESOS-1303 URL: https://issues.apache.org/jira/browse/MESOS-1303 Project: Mesos Issue Type: Bug Components: test Reporter: Ian Downes Labels: flaky I'm having trouble reproducing this but I did observe it once on my OSX system: {noformat} [==] Running 2 tests from 1 test case. [--] Global test environment set-up. [--] 2 tests from ExamplesTest [ RUN ] ExamplesTest.TestFramework ../../src/tests/script.cpp:81: Failure Failed test_framework_test.sh terminated with signal 'Abort trap: 6' [ FAILED ] ExamplesTest.TestFramework (953 ms) [ RUN ] ExamplesTest.NoExecutorFramework [ OK ] ExamplesTest.NoExecutorFramework (10162 ms) [--] 2 tests from ExamplesTest (5 ms total) [--] Global test environment tear-down [==] 2 tests from 1 test case ran. (11121 ms total) [ PASSED ] 1 test. [ FAILED ] 1 test, listed below: [ FAILED ] ExamplesTest.TestFramework {noformat} when investigating a failed make check for https://reviews.apache.org/r/20971/ {noformat} [--] 6 tests from ExamplesTest [ RUN ] ExamplesTest.TestFramework [ OK ] ExamplesTest.TestFramework (8643 ms) [ RUN ] ExamplesTest.NoExecutorFramework tests/script.cpp:81: Failure Failed no_executor_framework_test.sh terminated with signal 'Aborted' [ FAILED ] ExamplesTest.NoExecutorFramework (7220 ms) [ RUN ] ExamplesTest.JavaFramework [ OK ] ExamplesTest.JavaFramework (11181 ms) [ RUN ] ExamplesTest.JavaException [ OK ] ExamplesTest.JavaException (5624 ms) [ RUN ] ExamplesTest.JavaLog [ OK ] ExamplesTest.JavaLog (6472 ms) [ RUN ] ExamplesTest.PythonFramework [ OK ] ExamplesTest.PythonFramework (14467 ms) [--] 6 tests from ExamplesTest (53607 ms total) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1739) Allow slave reconfiguration on restart
[ https://issues.apache.org/jira/browse/MESOS-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536280#comment-14536280 ] Cody Maloney commented on MESOS-1739: - The biggest thing which came up in my old patchset was race conditions around re-registering in how the mesos registerSlave / reregisterSlave code is setup which probably will need some structural reworking. The case that was broken in my patch set is when a slave tries to register multiple times because it hasn't gotten a response from the master yet, and 1+ of those retries aren't identical to the first because they contain different resources / attributes (The slave started re-registration, then was restarted with new attributes before the master fully processed it), the master doesn't notice and just discards them as repeats. Allow slave reconfiguration on restart -- Key: MESOS-1739 URL: https://issues.apache.org/jira/browse/MESOS-1739 Project: Mesos Issue Type: Epic Reporter: Patrick Reilly Assignee: Cody Maloney Labels: mesosphere Make it so that either via a slave restart or a out of process reconfigure ping, the attributes and resources of a slave can be updated to be a superset of what they used to be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2690) --enable-optimize build fails with maybe-uninitialized
[ https://issues.apache.org/jira/browse/MESOS-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531119#comment-14531119 ] Cody Maloney commented on MESOS-2690: - So {{\-\-enable\-optimize}}, inside the script we add the {{\-O2}} as a user-shortcut. {{\-\-enable-optimize}} we provide as a user shortcut, and if people touch CXXFLAGS themselves, it doesn't do anything (Didn't use to anyways, with https://reviews.apache.org/r/33828/ we now always add a flag, regardless of if we don't add {{\-O2}} which is something I should have caught in my review...). The magic shortcuts are just making these combinations easier to use and work right (Sort of like how we add very specific flags if we see you are using compiler X so that mesos builds without needing to manually specify CXXFLAGS to work around specific compiler versions). --enable-optimize build fails with maybe-uninitialized -- Key: MESOS-2690 URL: https://issues.apache.org/jira/browse/MESOS-2690 Project: Mesos Issue Type: Bug Components: build Environment: GCC 4.8 - 4.9 Reporter: Joris Van Remoortere Assignee: Joris Van Remoortere Priority: Blocker When building with the `enable-optimize` flag, the build fails with `maybe-uninitialized' errors. This is due to a bug in GCC when building optimized code triggering false positives for this warning. Please see: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59970 We can disable this warning when using GCC + --enable-optimize. A quick work-around until there is a patch: ../configure CXXFLAGS=-Wno-maybe-uninitialized your-other-flags-here -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-2690) --enable-optimize build fails with maybe-uninitialized
[ https://issues.apache.org/jira/browse/MESOS-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531060#comment-14531060 ] Cody Maloney edited comment on MESOS-2690 at 5/6/15 6:03 PM: - Grepping for {{-O2}} in {{CXXFLAGS}} is fairly fragile, and moderately unsafe because it's one particular GCC optimization, which happens to be included in {{-O2}}. Unless we implement parsing all of GCC's flags, finding which one enables the optimization that breaks {{-Wno-maybe-uninitialized}} we've made a very, very environment-specific patch to work around a particular bug which could quite likely be fixed in a point release of GCC at some point rendering the code incorrect. In the spec file you can fairly simply add the flag to {{CXXFLAGS}} passing it into configure like all the other manually-set {{CXXFLAGS}} by configure. What sets the optimization flag which doesn't work well with our warning flags sets the bypass for the bug that pops up as well. It's all set on the outside and works its way in. From the automake manual: {code} This section attempts to answer all the above questions. We will mostly discuss CPPFLAGS in our examples, but actually the answer holds for all the compile flags used in Automake: CCASFLAGS, CFLAGS, CPPFLAGS, CXXFLAGS, FCFLAGS, FFLAGS, GCJFLAGS, LDFLAGS, LFLAGS, LIBTOOLFLAGS, OBJCFLAGS, OBJCXXFLAGS, RFLAGS, UPCFLAGS, and YFLAGS. {code} ... {code} You should not add options to these user variables within configure either, for the same reason. Occasionally you need to modify these variables to perform a test, but you should reset their values afterwards. In contrast, it is OK to modify the ‘AM_’ variables within configure if you AC_SUBST them, but it is rather rare that you need to do this, unless you really want to change the default definitions of the ‘AM_’ variables in all Makefiles. {code} -- http://www.gnu.org/software/automake/manual/html_node/Flag-Variables-Ordering.html was (Author: cmaloney): Grepping for {{-O2}} in {{CXXFLAGS}} is fairly fragile, and moderately unsafe because it's one particular GCC optimization, which happens to be included in {{-O2}}. Unless we implement parsing all of GCC's flags, finding which one enables the optimization that breaks {{-Wno-maybe-uninitialized}} we've made a very, very environment-specific patch to work around a particular bug which could quite likely be fixed in a point release of GCC at some point rendering the code incorrect. In the spec file you can fairly simply add the flag to {{CXXFLAGS}} passing it into configure like all the other manually-set {{CXXFLAGS}} by configure. What sets the optimization flag which doesn't work well with our warning flags sets the bypass for the bug that pops up as well. It's all set on the outside and works its way in. From the automake manual: {{code}} This section attempts to answer all the above questions. We will mostly discuss CPPFLAGS in our examples, but actually the answer holds for all the compile flags used in Automake: CCASFLAGS, CFLAGS, CPPFLAGS, CXXFLAGS, FCFLAGS, FFLAGS, GCJFLAGS, LDFLAGS, LFLAGS, LIBTOOLFLAGS, OBJCFLAGS, OBJCXXFLAGS, RFLAGS, UPCFLAGS, and YFLAGS. {{code}} ... {{code}} You should not add options to these user variables within configure either, for the same reason. Occasionally you need to modify these variables to perform a test, but you should reset their values afterwards. In contrast, it is OK to modify the ‘AM_’ variables within configure if you AC_SUBST them, but it is rather rare that you need to do this, unless you really want to change the default definitions of the ‘AM_’ variables in all Makefiles. {{code}} -- http://www.gnu.org/software/automake/manual/html_node/Flag-Variables-Ordering.html --enable-optimize build fails with maybe-uninitialized -- Key: MESOS-2690 URL: https://issues.apache.org/jira/browse/MESOS-2690 Project: Mesos Issue Type: Bug Components: build Environment: GCC 4.8 - 4.9 Reporter: Joris Van Remoortere Assignee: Joris Van Remoortere Priority: Blocker When building with the `enable-optimize` flag, the build fails with `maybe-uninitialized' errors. This is due to a bug in GCC when building optimized code triggering false positives for this warning. Please see: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59970 We can disable this warning when using GCC + --enable-optimize. A quick work-around until there is a patch: ../configure CXXFLAGS=-Wno-maybe-uninitialized your-other-flags-here -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2690) --enable-optimize build fails with maybe-uninitialized
[ https://issues.apache.org/jira/browse/MESOS-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531060#comment-14531060 ] Cody Maloney commented on MESOS-2690: - Grepping for {{-O2}} in {{CXXFLAGS}} is fairly fragile, and moderately unsafe because it's one particular GCC optimization, which happens to be included in {{-O2}}. Unless we implement parsing all of GCC's flags, finding which one enables the optimization that breaks {{-Wno-maybe-uninitialized}} we've made a very, very environment-specific patch to work around a particular bug which could quite likely be fixed in a point release of GCC at some point rendering the code incorrect. In the spec file you can fairly simply add the flag to {{CXXFLAGS}} passing it into configure like all the other manually-set {{CXXFLAGS}} by configure. What sets the optimization flag which doesn't work well with our warning flags sets the bypass for the bug that pops up as well. It's all set on the outside and works its way in. From the automake manual: {{code}} This section attempts to answer all the above questions. We will mostly discuss CPPFLAGS in our examples, but actually the answer holds for all the compile flags used in Automake: CCASFLAGS, CFLAGS, CPPFLAGS, CXXFLAGS, FCFLAGS, FFLAGS, GCJFLAGS, LDFLAGS, LFLAGS, LIBTOOLFLAGS, OBJCFLAGS, OBJCXXFLAGS, RFLAGS, UPCFLAGS, and YFLAGS. {{code}} ... {{code}} You should not add options to these user variables within configure either, for the same reason. Occasionally you need to modify these variables to perform a test, but you should reset their values afterwards. In contrast, it is OK to modify the ‘AM_’ variables within configure if you AC_SUBST them, but it is rather rare that you need to do this, unless you really want to change the default definitions of the ‘AM_’ variables in all Makefiles. {{code}} -- http://www.gnu.org/software/automake/manual/html_node/Flag-Variables-Ordering.html --enable-optimize build fails with maybe-uninitialized -- Key: MESOS-2690 URL: https://issues.apache.org/jira/browse/MESOS-2690 Project: Mesos Issue Type: Bug Components: build Environment: GCC 4.8 - 4.9 Reporter: Joris Van Remoortere Assignee: Joris Van Remoortere Priority: Blocker When building with the `enable-optimize` flag, the build fails with `maybe-uninitialized' errors. This is due to a bug in GCC when building optimized code triggering false positives for this warning. Please see: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59970 We can disable this warning when using GCC + --enable-optimize. A quick work-around until there is a patch: ../configure CXXFLAGS=-Wno-maybe-uninitialized your-other-flags-here -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2690) --enable-optimize build fails with maybe-uninitialized
[ https://issues.apache.org/jira/browse/MESOS-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527543#comment-14527543 ] Cody Maloney commented on MESOS-2690: - Not if you still want optimization / debug info with that you need to include it in the CXXFLAGS, CFLAGS, so something like: `-O2 -Wno-maybe-unitialized`. --enable-optimize, --enable-debug don't modify CFLAGS/CXXFLAGS if they are passed in by the user. --enable-optimize build fails with maybe-uninitialized -- Key: MESOS-2690 URL: https://issues.apache.org/jira/browse/MESOS-2690 Project: Mesos Issue Type: Bug Components: build Environment: GCC 4.8 - 4.9 Reporter: Joris Van Remoortere Assignee: Joris Van Remoortere When building with the `enable-optimize` flag, the build fails with `maybe-uninitialized' errors. This is due to a bug in GCC when building optimized code triggering false positives for this warning. Please see: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59970 We can disable this warning when using GCC + --enable-optimize. A quick work-around until there is a patch: ../configure CXXFLAGS=-Wno-maybe-uninitialized your-other-flags-here -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1375) Log rotation capable
[ https://issues.apache.org/jira/browse/MESOS-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527512#comment-14527512 ] Cody Maloney commented on MESOS-1375: - Another option that would be really nice is to integrate with systemd / journald when on one of those hosts to just use the journal. That way the log files are properly size-capped / rotated, and things could eventually used more structured auditable logging if they want. Log rotation capable Key: MESOS-1375 URL: https://issues.apache.org/jira/browse/MESOS-1375 Project: Mesos Issue Type: Improvement Components: master, slave Affects Versions: 0.18.0 Reporter: Damien Hardy Labels: ops, twitter Please provide a way to let ops manage logs. A log4j like configuration would be hard but make rotation capable without restarting the service at least. Based on external logrotate tool would be great : * write to a constant log file name * check for file change (recreated by logrotate) before write -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MESOS-2604) Upgrade minimum required compilers for MESOS
[ https://issues.apache.org/jira/browse/MESOS-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Maloney resolved MESOS-2604. - Resolution: Fixed Fix Version/s: 0.23.0 Upgrade minimum required compilers for MESOS Key: MESOS-2604 URL: https://issues.apache.org/jira/browse/MESOS-2604 Project: Mesos Issue Type: Improvement Components: build Affects Versions: 0.23.0 Reporter: Cody Maloney Assignee: Cody Maloney Labels: c++11 Fix For: 0.23.0 As discussed in the last community meeting we would like to upgrade the minimum mesos compiler version to GCC 4.8+, Clang 3.5. GCC primarily for Linux. Clang for OS X, as well as linux for enabling Mesos tooling improvements ([clang-format|http://mesos.apache.org/documentation/clang-format/], clang-tidy among others). Some documents for reference: [Compilers by Distribution Version|https://docs.google.com/spreadsheets/d/1Ji8p3p_1JqUsMxE31mJqqztHf7LDx7mGMXh253azWpU/edit?usp=sharing] Shows we can get GCC 4.8+ or clang 3.5+ on all supported platforms. C++11 features supported by each compiler: [https://gcc.gnu.org/projects/cxx0x.html] [http://clang.llvm.org/cxx_status.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2604) Upgrade minimum required compilers for MESOS
[ https://issues.apache.org/jira/browse/MESOS-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527255#comment-14527255 ] Cody Maloney commented on MESOS-2604: - {code} author Cody Maloney c...@mesosphere.io Thu, 23 Apr 2015 14:38:48 -0700 (14:38 -0700) committer Benjamin Hindman benjamin.hind...@gmail.com Sat, 25 Apr 2015 16:21:46 -0700 (16:21 -0700) commit 0f5c78fad3423181f7227027eb42d162811514e7 tree5c6158257e926e29279e5eee13f189f46cf8fe07tree | snapshot parent b4bbfd6ae0c5287d0328caeff89d0c574ae4a546commit | diff Warn if g++ 4.8 or a C++ standard library is too old for Mesos. After this a whole bunch more of the C++11 checks can be removed, we can unconditionally use -std=c++11, among other things with this change. Note that we don't explicitly check the clang version number since extracting it is hard (OS X clang behaves differently than Linux clang), and 'clang -dumpversion' always reports 4.2.1 for compatibility with some random tools that used GCC. {code} {code} author Benjamin Hindman benjamin.hind...@gmail.com Sat, 25 Apr 2015 16:06:38 -0700 (16:06 -0700) committer Benjamin Hindman benjamin.hind...@gmail.com Sat, 25 Apr 2015 16:21:35 -0700 (16:21 -0700) commit b4bbfd6ae0c5287d0328caeff89d0c574ae4a546 tree68a2adab47a3e93e95064ad5dce87e1a99f726c3tree | snapshot parent 4919aa52a9eae4af0874cb41e3a1a6d10c2eafa7commit | diff Warn if g++ 4.8 or a C++ standard library is too old for libprocess. After this a whole bunch more of the C++11 checks can be removed, we can unconditionally use -std=c++11, among other things with this change. Note that we don't explicitly check the clang version number since extracting it is hard (OS X clang behaves differently than Linux clang), and 'clang -dumpversion' always reports 4.2.1 for compatibility with some random tools that used GCC. {code} Upgrade minimum required compilers for MESOS Key: MESOS-2604 URL: https://issues.apache.org/jira/browse/MESOS-2604 Project: Mesos Issue Type: Improvement Components: build Affects Versions: 0.23.0 Reporter: Cody Maloney Assignee: Cody Maloney Labels: c++11 As discussed in the last community meeting we would like to upgrade the minimum mesos compiler version to GCC 4.8+, Clang 3.5. GCC primarily for Linux. Clang for OS X, as well as linux for enabling Mesos tooling improvements ([clang-format|http://mesos.apache.org/documentation/clang-format/], clang-tidy among others). Some documents for reference: [Compilers by Distribution Version|https://docs.google.com/spreadsheets/d/1Ji8p3p_1JqUsMxE31mJqqztHf7LDx7mGMXh253azWpU/edit?usp=sharing] Shows we can get GCC 4.8+ or clang 3.5+ on all supported platforms. C++11 features supported by each compiler: [https://gcc.gnu.org/projects/cxx0x.html] [http://clang.llvm.org/cxx_status.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2644) AS a framework developer I WANT to check and depend on a Mesos version
[ https://issues.apache.org/jira/browse/MESOS-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14507931#comment-14507931 ] Cody Maloney commented on MESOS-2644: - We may also want to think about exposing 'feature' flags which schedulers can depend upon rather than hard version requirements. Could be useful for when a feature needs to be hot-patched in (Or when working off a fork for testing out a feature), then lands in a later release. AS a framework developer I WANT to check and depend on a Mesos version -- Key: MESOS-2644 URL: https://issues.apache.org/jira/browse/MESOS-2644 Project: Mesos Issue Type: Story Components: framework Affects Versions: 0.22.0 Reporter: Aaron Bell Example: I'm developing a framework that makes use of persistent volumes, MESOS-1554. At startup I want my scheduler to verify the Mesos master's version and abort if it's less than e.g. {{0.23.0}}, which I know is the minimum version for that feature. I've looked at MESOS-753 and MESOS-986 and they don't seem to address this cleanly. Version may be available in {{state.json}}, but this is an unboundedly large value to parse. It would seem sensible to have an HTTP endpoint {{/version}} or similar. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2144) Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread
[ https://issues.apache.org/jira/browse/MESOS-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498789#comment-14498789 ] Cody Maloney commented on MESOS-2144: - Just got one of these with full backtrace: {code} I0416 12:21:01.673476 36776 authenticatee.hpp:115] Initializing client SASL @0x110e9284a google::LogMessage::Fail() @0x110e917dd google::LogMessage::SendToLog() @0x110e924ea google::LogMessage::Flush() @0x110e99348 google::LogMessageFatal::~LogMessageFatal() I0416 12:21:01.747539 308416512 process.cpp:2091] Resuming reaper(1)@127.0.0.1:52842 at 2015-04-16 19:21:33.747597056+00:00 @0x110e92ca5 google::LogMessageFatal::~LogMessageFatal() @0x10f3d33d3 _CheckFatal::~_CheckFatal() @0x10f3d3025 _CheckFatal::~_CheckFatal() @0x10fd94da6 mesos::internal::slave::Slave::__recover() @0x10fe7f09d _ZZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS_6FutureI7NothingEES7_EEvRKNS_3PIDIT_EEMSB_FvT0_ET1_ENKUlPNS_11ProcessBaseEE_clESK_ @0x10fe7ee7f _ZNSt3__110__function6__funcIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS2_6FutureI7NothingEESA_EEvRKNS2_3PIDIT_EEMSE_FvT0_ET1_EUlPNS2_11ProcessBaseEE_NS_9allocatorISO_EEFvSN_EEclEOSN_ @0x110d74e7b std::__1::function::operator()() @0x110d5c5bf process::ProcessBase::visit() @0x110de6c0e process::DispatchEvent::visit() @0x10f3d0841 process::ProcessBase::serve() @0x110d45abe process::ProcessManager::resume() @0x110d451de process::schedule() @ 0x7fff8f1eb268 _pthread_body @ 0x7fff8f1eb1e5 _pthread_start @ 0x7fff8f1e941d thread_start {code} The full log from the test (MESOS_VERBOSE, GLOG_v=2) {code} [ RUN ] ExamplesTest.LowLevelSchedulerPthread Using temporary directory '/tmp/ExamplesTest_LowLevelSchedulerPthread_vVqryS' I0416 12:21:01.637110 2105078528 logging.cpp:177] Logging to STDERR Enabling authentication for the scheduler I0416 12:21:01.639566 2105078528 process.cpp:2081] Spawned process __gc__@127.0.0.1:52945 I0416 12:21:01.639770 2105078528 process.cpp:2081] Spawned process help@127.0.0.1:52945 I0416 12:21:01.639583 365723648 process.cpp:2091] Resuming __gc__@127.0.0.1:52945 at 2015-04-16 19:21:01.639622912+00:00 I0416 12:21:01.639777 367869952 process.cpp:2091] Resuming help@127.0.0.1:52945 at 2015-04-16 19:21:01.639796992+00:00 I0416 12:21:01.639875 366260224 process.cpp:2091] Resuming logging@127.0.0.1:52945 at 2015-04-16 19:21:01.639906816+00:00 I0416 12:21:01.639909 2105078528 process.cpp:2081] Spawned process logging@127.0.0.1:52945 I0416 12:21:01.639978 367869952 process.cpp:2091] Resuming profiler@127.0.0.1:52945 at 2015-04-16 19:21:01.640003840+00:00 I0416 12:21:01.640033 368943104 process.cpp:2091] Resuming help@127.0.0.1:52945 at 2015-04-16 19:21:01.640058880+00:00 I0416 12:21:01.640051 2105078528 process.cpp:2081] Spawned process profiler@127.0.0.1:52945 I0416 12:21:01.640246 368406528 process.cpp:2091] Resuming system@127.0.0.1:52945 at 2015-04-16 19:21:01.640268032+00:00 I0416 12:21:01.640236 368943104 process.cpp:2091] Resuming __gc__@127.0.0.1:52945 at 2015-04-16 19:21:01.640258048+00:00 I0416 12:21:01.640318 2105078528 process.cpp:2081] Spawned process system@127.0.0.1:52945 I0416 12:21:01.640321 368943104 process.cpp:2091] Resuming __limiter__(1)@127.0.0.1:52945 at 2015-04-16 19:21:01.640336128+00:00 I0416 12:21:01.640390 368406528 process.cpp:2081] Spawned process __limiter__(1)@127.0.0.1:52945 I0416 12:21:01.640425 365723648 process.cpp:2091] Resuming metrics@127.0.0.1:52945 at 2015-04-16 19:21:01.640440064+00:00 I0416 12:21:01.640472 368406528 process.cpp:2081] Spawned process metrics@127.0.0.1:52945 I0416 12:21:01.640521 367869952 process.cpp:2091] Resuming help@127.0.0.1:52945 at 2015-04-16 19:21:01.640538880+00:00 I0416 12:21:01.640733 366796800 process.cpp:2091] Resuming help@127.0.0.1:52945 at 2015-04-16 19:21:01.640760064+00:00 I0416 12:21:01.640913 2105078528 process.cpp:2081] Spawned process __processes__@127.0.0.1:52945 I0416 12:21:01.640919 366796800 process.cpp:2091] Resuming __processes__@127.0.0.1:52945 at 2015-04-16 19:21:01.640937984+00:00 I0416 12:21:01.640949 2105078528 process.cpp:912] libprocess is initialized on 127.0.0.1:52945 for 8 cpus I0416 12:21:01.640971 365723648 process.cpp:2091] Resuming help@127.0.0.1:52945 at 2015-04-16 19:21:01.640985856+00:00 W0416 12:21:01.641326 2105078528 scheduler.cpp:134] ** Scheduler driver bound to loopback interface! Cannot communicate with remote master(s). You might want to set 'LIBPROCESS_IP' environment variable to use a routable IP address. ** I0416 12:21:01.641348 2105078528 scheduler.cpp:149] Version: 0.23.0 I0416
[jira] [Created] (MESOS-2627) ExamplesTest.PersistentVolumeFramework is flaky on OS X
Cody Maloney created MESOS-2627: --- Summary: ExamplesTest.PersistentVolumeFramework is flaky on OS X Key: MESOS-2627 URL: https://issues.apache.org/jira/browse/MESOS-2627 Project: Mesos Issue Type: Bug Environment: OS X Yosemite Reporter: Cody Maloney This just failed for the first time on our OS X Bot (Far less frequent flaky than the other ExamplesTest, but still flaky) while compiling master at commit f6620f851f635b3346c6ebf878152f38b3932ad9. There weren't any commits which touched / changed anything in the test in the set. {code} [ RUN ] ExamplesTest.PersistentVolumeFramework ../../src/tests/script.cpp:83: Failure Failed persistent_volume_framework_test.sh terminated with signal Abort trap: 6 [ FAILED ] ExamplesTest.PersistentVolumeFramework (7865 ms) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2627) ExamplesTest.PersistentVolumeFramework is flaky on OS X
[ https://issues.apache.org/jira/browse/MESOS-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498472#comment-14498472 ] Cody Maloney commented on MESOS-2627: - [~jieyu] any clue why this might be flaky on OS X? ExamplesTest.PersistentVolumeFramework is flaky on OS X --- Key: MESOS-2627 URL: https://issues.apache.org/jira/browse/MESOS-2627 Project: Mesos Issue Type: Bug Environment: OS X Yosemite Reporter: Cody Maloney Labels: flaky, flaky-test This just failed for the first time on our OS X Bot (Far less frequent flaky than the other ExamplesTest, but still flaky) while compiling master at commit f6620f851f635b3346c6ebf878152f38b3932ad9. There weren't any commits which touched / changed anything in the test in the set. {code} [ RUN ] ExamplesTest.PersistentVolumeFramework ../../src/tests/script.cpp:83: Failure Failed persistent_volume_framework_test.sh terminated with signal Abort trap: 6 [ FAILED ] ExamplesTest.PersistentVolumeFramework (7865 ms) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2627) ExamplesTest.PersistentVolumeFramework is flaky on OS X
[ https://issues.apache.org/jira/browse/MESOS-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498484#comment-14498484 ] Cody Maloney commented on MESOS-2627: - No, want me to just turn GLOG_v=2 on for the box and I'll ping when it happens again? ExamplesTest.PersistentVolumeFramework is flaky on OS X --- Key: MESOS-2627 URL: https://issues.apache.org/jira/browse/MESOS-2627 Project: Mesos Issue Type: Bug Environment: OS X Yosemite Reporter: Cody Maloney Labels: flaky, flaky-test This just failed for the first time on our OS X Bot (Far less frequent flaky than the other ExamplesTest, but still flaky) while compiling master at commit f6620f851f635b3346c6ebf878152f38b3932ad9. There weren't any commits which touched / changed anything in the test in the set. {code} [ RUN ] ExamplesTest.PersistentVolumeFramework ../../src/tests/script.cpp:83: Failure Failed persistent_volume_framework_test.sh terminated with signal Abort trap: 6 [ FAILED ] ExamplesTest.PersistentVolumeFramework (7865 ms) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2605) The slave sometimes does not send active executors during reregistration
[ https://issues.apache.org/jira/browse/MESOS-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14495003#comment-14495003 ] Cody Maloney commented on MESOS-2605: - That sounds like this might be related to MESOS-2601 then. Mesos doesn't currently save what containerizer created / owns a container, and so it just tries to recover the container with all of them. The slave sometimes does not send active executors during reregistration Key: MESOS-2605 URL: https://issues.apache.org/jira/browse/MESOS-2605 Project: Mesos Issue Type: Bug Affects Versions: 0.22.0 Reporter: Elizabeth Lingg Assignee: Michael Park Labels: mesosphere The slave sometimes does not send active executors during reregistration. Framework checkpointing is enabled, and the executor successfully reregisters. However, the tasks in that executor are LOST (by abnormal executor termination) because the executor is removed by the mesos master as unknown. See the example below, task.journalnode.journalnode.NodeExecutor.1428609184051. See the Slave Logs here for the Task: {code} Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.778790 25126 status_update_manager.cpp:317] Received status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008 Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.779013 25126 status_update_manager.hpp:346] Checkpointing UPDATE for status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008 Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.781788 25123 slave.cpp:2753] Forwarding the update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008 to master@10.142.250.253:5050 Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.781889 25123 slave.cpp:2686] Sending acknowledgement for status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008 to executor(1)@10.168.119.78:47638 Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.784503 25124 status_update_manager.cpp:389] Received status update acknowledgement (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008 Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.784567 25124 status_update_manager.hpp:346] Checkpointing ACK for status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008 {code} Master Logs: {code} Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: W0409 20:19:43.008666 1067 master.cpp:4015] Executor executor.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008 possibly unknown to the slave 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 (ec2-54-237-57-237.compute-1.amazonaws.com) Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.008652 1074 hierarchical.hpp:648] Recovered cpus(*):0.1; mem(*):1536 (total allocatable: cpus(*):3.5; mem(*):21113; disk(*):142210; ports(*):[3889-5044, 5046-5049, 2182-2958, 2960-3887, 1025-2180, 8082-9041, 9043-9159, 9161-, 5052-6999, 7002-7198, 7200-8079, 10001-65535]) on slave 20150407-233647-2059219722-5050-1659-S5 from framework 20150408-002100-4261056010-5050-1047-0008 Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.008712 1067 master.cpp:4714] Removing executor 'executor.journalnode.NodeExecutor.1428609184051' with resources cpus(*):0.1; mem(*):1536 of framework 20150408-002100-4261056010-5050-1047-0008 on slave 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 (ec2-54-237-57-237.compute-1.amazonaws.com) Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.010372 1067 master.cpp:3295] Status update TASK_LOST (UUID: e5532567-e5b2-4fca-87aa-f3f98e371640) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework
[jira] [Commented] (MESOS-2550) Mesos doesn't compile with clang 3.6
[ https://issues.apache.org/jira/browse/MESOS-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493071#comment-14493071 ] Cody Maloney commented on MESOS-2550: - https://reviews.apache.org/r/32747/ https://reviews.apache.org/r/32748/ https://reviews.apache.org/r/32749/ Mesos doesn't compile with clang 3.6 Key: MESOS-2550 URL: https://issues.apache.org/jira/browse/MESOS-2550 Project: Mesos Issue Type: Improvement Components: build Affects Versions: 0.22.0 Environment: ArchLinux with Clang 3.6 Reporter: Cody Maloney Assignee: Cody Maloney The bundled libev fails to compile with the error: {code} ev.c:970:42: error: '_Noreturn' keyword must precede function declarator ecb_inline void ecb_unreachable (void) ecb_noreturn; ^~~~ _Noreturn {code} Can be patched by moving the noreturn to earlier in the line / where C++11 noreturn attributes go. Bundled boost fails with errors like: {code} ../3rdparty/libprocess/3rdparty/boost-1.53.0/boost/concept_check.hpp:653:11: error: unused typedef 'boost_concept_check653' [-Werror,-Wunused-local-typedef] BOOST_CONCEPT_ASSERT((InputIteratorconst_iterator)); ^ {code} Can be fixed by adding '-Wno-unused-local-typedef' if we detect clang 3.6 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2550) Mesos doesn't compile with clang 3.6
[ https://issues.apache.org/jira/browse/MESOS-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493074#comment-14493074 ] Cody Maloney commented on MESOS-2550: - There is going to be a clang 3.6.1 release next month (The code freeze for it is May 5). The patches might land in that. Mesos doesn't compile with clang 3.6 Key: MESOS-2550 URL: https://issues.apache.org/jira/browse/MESOS-2550 Project: Mesos Issue Type: Improvement Components: build Affects Versions: 0.22.0 Environment: ArchLinux with Clang 3.6 Reporter: Cody Maloney Assignee: Cody Maloney The bundled libev fails to compile with the error: {code} ev.c:970:42: error: '_Noreturn' keyword must precede function declarator ecb_inline void ecb_unreachable (void) ecb_noreturn; ^~~~ _Noreturn {code} Can be patched by moving the noreturn to earlier in the line / where C++11 noreturn attributes go. Bundled boost fails with errors like: {code} ../3rdparty/libprocess/3rdparty/boost-1.53.0/boost/concept_check.hpp:653:11: error: unused typedef 'boost_concept_check653' [-Werror,-Wunused-local-typedef] BOOST_CONCEPT_ASSERT((InputIteratorconst_iterator)); ^ {code} Can be fixed by adding '-Wno-unused-local-typedef' if we detect clang 3.6 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2604) Upgrade minimum required compilers for MESOS
Cody Maloney created MESOS-2604: --- Summary: Upgrade minimum required compilers for MESOS Key: MESOS-2604 URL: https://issues.apache.org/jira/browse/MESOS-2604 Project: Mesos Issue Type: Improvement Components: build Affects Versions: 0.23.0 Reporter: Cody Maloney Assignee: Cody Maloney As discussed in the last community meeting we would like to upgrade the minimum mesos compiler version to GCC 4.8+, Clang 3.5. GCC primarily for Linux. Clang for OS X, as well as linux for enabling Mesos tooling improvements ([clang-format|http://mesos.apache.org/documentation/clang-format/], clang-tidy among others). Some documents for reference: [Compilers by Distribution Version|https://docs.google.com/spreadsheets/d/1Ji8p3p_1JqUsMxE31mJqqztHf7LDx7mGMXh253azWpU/edit?usp=sharing] Shows we can get GCC 4.8+ or clang 3.5+ on all supported platforms. C++11 features supported by each compiler: [https://gcc.gnu.org/projects/cxx0x.html] [http://clang.llvm.org/cxx_status.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-830) ExamplesTest.JavaFramework is flaky
[ https://issues.apache.org/jira/browse/MESOS-830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487890#comment-14487890 ] Cody Maloney commented on MESOS-830: Still failing (Although frequency seems to have increased) on our OSX Buildbot. ExamplesTest.JavaFramework is flaky --- Key: MESOS-830 URL: https://issues.apache.org/jira/browse/MESOS-830 Project: Mesos Issue Type: Bug Components: test Reporter: Vinod Kone Labels: flaky [ RUN ] ExamplesTest.JavaFramework Using temporary directory '/tmp/ExamplesTest_JavaFramework_wSc7u8' Enabling authentication for the framework I1120 15:13:39.820032 1681264640 master.cpp:285] Master started on 172.25.133.171:52576 I1120 15:13:39.820180 1681264640 master.cpp:299] Master ID: 201311201513-2877626796-52576-3234 I1120 15:13:39.820194 1681264640 master.cpp:302] Master only allowing authenticated frameworks to register! I1120 15:13:39.821197 1679654912 slave.cpp:112] Slave started on 1)@172.25.133.171:52576 I1120 15:13:39.821795 1679654912 slave.cpp:212] Slave resources: cpus(*):4; mem(*):7168; disk(*):481998; ports(*):[31000-32000] I1120 15:13:39.822855 1682337792 slave.cpp:112] Slave started on 2)@172.25.133.171:52576 I1120 15:13:39.823652 1682337792 slave.cpp:212] Slave resources: cpus(*):4; mem(*):7168; disk(*):481998; ports(*):[31000-32000] I1120 15:13:39.825330 1679118336 master.cpp:744] The newly elected leader is master@172.25.133.171:52576 I1120 15:13:39.825445 1679118336 master.cpp:748] Elected as the leading master! I1120 15:13:39.825907 1681264640 state.cpp:33] Recovering state from '/tmp/ExamplesTest_JavaFramework_wSc7u8/0/meta' I1120 15:13:39.826127 1681264640 status_update_manager.cpp:180] Recovering status update manager I1120 15:13:39.826331 1681801216 process_isolator.cpp:317] Recovering isolator I1120 15:13:39.826738 1682874368 slave.cpp:2743] Finished recovery I1120 15:13:39.827747 1682337792 state.cpp:33] Recovering state from '/tmp/ExamplesTest_JavaFramework_wSc7u8/1/meta' I1120 15:13:39.827945 1680191488 slave.cpp:112] Slave started on 3)@172.25.133.171:52576 I1120 15:13:39.828415 1682337792 status_update_manager.cpp:180] Recovering status update manager I1120 15:13:39.828608 1680728064 sched.cpp:260] Authenticating with master master@172.25.133.171:52576 I1120 15:13:39.828606 1680191488 slave.cpp:212] Slave resources: cpus(*):4; mem(*):7168; disk(*):481998; ports(*):[31000-32000] I1120 15:13:39.828680 1682874368 slave.cpp:497] New master detected at master@172.25.133.171:52576 I1120 15:13:39.828765 1682337792 process_isolator.cpp:317] Recovering isolator I1120 15:13:39.829828 1680728064 sched.cpp:229] Detecting new master I1120 15:13:39.830288 1679654912 authenticatee.hpp:100] Initializing client SASL I1120 15:13:39.831635 1680191488 state.cpp:33] Recovering state from '/tmp/ExamplesTest_JavaFramework_wSc7u8/2/meta' I1120 15:13:39.831991 1679118336 status_update_manager.cpp:158] New master detected at master@172.25.133.171:52576 I1120 15:13:39.832042 1682874368 slave.cpp:524] Detecting new master I1120 15:13:39.832314 1682337792 slave.cpp:2743] Finished recovery I1120 15:13:39.832309 1681264640 master.cpp:1266] Attempting to register slave on vkone.local at slave(1)@172.25.133.171:52576 I1120 15:13:39.832929 1680728064 status_update_manager.cpp:180] Recovering status update manager I1120 15:13:39.833371 1681801216 slave.cpp:497] New master detected at master@172.25.133.171:52576 I1120 15:13:39.833273 1681264640 master.cpp:2513] Adding slave 201311201513-2877626796-52576-3234-0 at vkone.local with cpus(*):4; mem(*):7168; disk(*):481998; ports(*):[31000-32000] I1120 15:13:39.833595 1680728064 process_isolator.cpp:317] Recovering isolator I1120 15:13:39.833859 1681801216 slave.cpp:524] Detecting new master I1120 15:13:39.833861 1682874368 status_update_manager.cpp:158] New master detected at master@172.25.133.171:52576 I1120 15:13:39.834092 1680191488 slave.cpp:542] Registered with master master@172.25.133.171:52576; given slave ID 201311201513-2877626796-52576-3234-0 I1120 15:13:39.834486 1681264640 master.cpp:1266] Attempting to register slave on vkone.local at slave(2)@172.25.133.171:52576 I1120 15:13:39.834549 1681264640 master.cpp:2513] Adding slave 201311201513-2877626796-52576-3234-1 at vkone.local with cpus(*):4; mem(*):7168; disk(*):481998; ports(*):[31000-32000] I1120 15:13:39.834750 1680191488 slave.cpp:555] Checkpointing SlaveInfo to '/tmp/ExamplesTest_JavaFramework_wSc7u8/0/meta/slaves/201311201513-2877626796-52576-3234-0/slave.info' I1120 15:13:39.834875 1682874368 hierarchical_allocator_process.hpp:445] Added slave 201311201513-2877626796-52576-3234-0 (vkone.local) with cpus(*):4; mem(*):7168; disk(*):481998;
[jira] [Commented] (MESOS-2601) Tasks are not removed after recovery from slave and mesos containerizer
[ https://issues.apache.org/jira/browse/MESOS-2601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486105#comment-14486105 ] Cody Maloney commented on MESOS-2601: - [~jieyu] The --containerizer flag has never been changed on the host. Isolator flags also haven't changed at runtime ever on the host (only with a full workdir wipeout / reboot / kill all tasks / new slave id). Tasks are not removed after recovery from slave and mesos containerizer --- Key: MESOS-2601 URL: https://issues.apache.org/jira/browse/MESOS-2601 Project: Mesos Issue Type: Bug Components: containerization, slave Affects Versions: 0.22.1 Reporter: Timothy Chen We've seen in our test cluster that tasks that were launched with the mesos containerizer are recovered after slave restart, but actual command process is not running anymore and the checkpointed executor is not marked as completed. The Mesos containerizer recovers and all the isolators couldn't recover the task, but the containerizer itself is somehow never removed and the monitor kept calling usage on the containerizer. Relevant log lines from the beginning of slave recovery: I0408 18:06:33.261379 32504 slave.cpp:577] Successfully attached file '/hdd/mesos/slave/slaves/20150401-160104-251662508-5050-2197-S1/frameworks/20141222-194154-218108076-5050-4125-0004/executors/ct:1427921848104:0:EM DataDog Uploader:/runs/990741ed-909e-49cc-83f8-be63298872da' ... I0408 18:06:36.583277 32511 containerizer.cpp:350] Recovering container '990741ed-909e-49cc-83f8-be63298872da' for executor 'ct:1427921848104:0:EM DataDog Uploader:' of framework 20141222-194154-218108076-5050-4125-0004 I0408 18:06:37.017122 32511 linux_launcher.cpp:162] Couldn't find freezer cgroup for container 990741ed-909e-49cc-83f8-be63298872da, assuming already destroyed W0408 18:06:37.074916 32496 cpushare.cpp:199] Couldn't find cgroup for container 990741ed-909e-49cc-83f8-be63298872da I0408 18:06:37.075173 32486 mem.cpp:158] Couldn't find cgroup for container 990741ed-909e-49cc-83f8-be63298872da E0408 18:06:37.092279 32496 containerizer.cpp:1136] Error in a resource limitation for container 990741ed-909e-49cc-83f8-be63298872da: Unknown container I0408 18:06:37.092643 32496 containerizer.cpp:906] Destroying container '990741ed-909e-49cc-83f8-be63298872da' W0408 18:06:37.229626 32501 containerizer.cpp:807] Ignoring update for currently being destroyed container: 990741ed-909e-49cc-83f8-be63298872da W0408 18:06:38.129873 32484 containerizer.cpp:844] Skipping resource statistic for container 990741ed-909e-49cc-83f8-be63298872da because: Unknown container W0408 18:06:38.129909 32484 containerizer.cpp:844] Skipping resource statistic for container 990741ed-909e-49cc-83f8-be63298872da because: Unknown container -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1303) ExamplesTest.{TestFramework, NoExecutorFramework} flaky
[ https://issues.apache.org/jira/browse/MESOS-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482453#comment-14482453 ] Cody Maloney commented on MESOS-1303: - This is definitely still flaky. From our OSX Buildbot earlier today with master commit: 740dcb3d55944bc1410818d48efc49f0091b037d [--] 8 tests from ExamplesTest [ RUN ] ExamplesTest.TestFramework ../../src/tests/script.cpp:83: Failure Failed test_framework_test.sh terminated with signal Abort trap: 6 [ FAILED ] ExamplesTest.TestFramework (7925 ms) ExamplesTest.{TestFramework, NoExecutorFramework} flaky --- Key: MESOS-1303 URL: https://issues.apache.org/jira/browse/MESOS-1303 Project: Mesos Issue Type: Bug Components: test Reporter: Ian Downes Labels: flaky I'm having trouble reproducing this but I did observe it once on my OSX system: {noformat} [==] Running 2 tests from 1 test case. [--] Global test environment set-up. [--] 2 tests from ExamplesTest [ RUN ] ExamplesTest.TestFramework ../../src/tests/script.cpp:81: Failure Failed test_framework_test.sh terminated with signal 'Abort trap: 6' [ FAILED ] ExamplesTest.TestFramework (953 ms) [ RUN ] ExamplesTest.NoExecutorFramework [ OK ] ExamplesTest.NoExecutorFramework (10162 ms) [--] 2 tests from ExamplesTest (5 ms total) [--] Global test environment tear-down [==] 2 tests from 1 test case ran. (11121 ms total) [ PASSED ] 1 test. [ FAILED ] 1 test, listed below: [ FAILED ] ExamplesTest.TestFramework {noformat} when investigating a failed make check for https://reviews.apache.org/r/20971/ {noformat} [--] 6 tests from ExamplesTest [ RUN ] ExamplesTest.TestFramework [ OK ] ExamplesTest.TestFramework (8643 ms) [ RUN ] ExamplesTest.NoExecutorFramework tests/script.cpp:81: Failure Failed no_executor_framework_test.sh terminated with signal 'Aborted' [ FAILED ] ExamplesTest.NoExecutorFramework (7220 ms) [ RUN ] ExamplesTest.JavaFramework [ OK ] ExamplesTest.JavaFramework (11181 ms) [ RUN ] ExamplesTest.JavaException [ OK ] ExamplesTest.JavaException (5624 ms) [ RUN ] ExamplesTest.JavaLog [ OK ] ExamplesTest.JavaLog (6472 ms) [ RUN ] ExamplesTest.PythonFramework [ OK ] ExamplesTest.PythonFramework (14467 ms) [--] 6 tests from ExamplesTest (53607 ms total) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (MESOS-1303) ExamplesTest.{TestFramework, NoExecutorFramework} flaky
[ https://issues.apache.org/jira/browse/MESOS-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Maloney reopened MESOS-1303: - ExamplesTest.{TestFramework, NoExecutorFramework} flaky --- Key: MESOS-1303 URL: https://issues.apache.org/jira/browse/MESOS-1303 Project: Mesos Issue Type: Bug Components: test Reporter: Ian Downes Labels: flaky I'm having trouble reproducing this but I did observe it once on my OSX system: {noformat} [==] Running 2 tests from 1 test case. [--] Global test environment set-up. [--] 2 tests from ExamplesTest [ RUN ] ExamplesTest.TestFramework ../../src/tests/script.cpp:81: Failure Failed test_framework_test.sh terminated with signal 'Abort trap: 6' [ FAILED ] ExamplesTest.TestFramework (953 ms) [ RUN ] ExamplesTest.NoExecutorFramework [ OK ] ExamplesTest.NoExecutorFramework (10162 ms) [--] 2 tests from ExamplesTest (5 ms total) [--] Global test environment tear-down [==] 2 tests from 1 test case ran. (11121 ms total) [ PASSED ] 1 test. [ FAILED ] 1 test, listed below: [ FAILED ] ExamplesTest.TestFramework {noformat} when investigating a failed make check for https://reviews.apache.org/r/20971/ {noformat} [--] 6 tests from ExamplesTest [ RUN ] ExamplesTest.TestFramework [ OK ] ExamplesTest.TestFramework (8643 ms) [ RUN ] ExamplesTest.NoExecutorFramework tests/script.cpp:81: Failure Failed no_executor_framework_test.sh terminated with signal 'Aborted' [ FAILED ] ExamplesTest.NoExecutorFramework (7220 ms) [ RUN ] ExamplesTest.JavaFramework [ OK ] ExamplesTest.JavaFramework (11181 ms) [ RUN ] ExamplesTest.JavaException [ OK ] ExamplesTest.JavaException (5624 ms) [ RUN ] ExamplesTest.JavaLog [ OK ] ExamplesTest.JavaLog (6472 ms) [ RUN ] ExamplesTest.PythonFramework [ OK ] ExamplesTest.PythonFramework (14467 ms) [--] 6 tests from ExamplesTest (53607 ms total) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-2550) Mesos doesn't compile with clang 3.6
[ https://issues.apache.org/jira/browse/MESOS-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Maloney reassigned MESOS-2550: --- Assignee: Cody Maloney Mesos doesn't compile with clang 3.6 Key: MESOS-2550 URL: https://issues.apache.org/jira/browse/MESOS-2550 Project: Mesos Issue Type: Improvement Components: build Affects Versions: 0.22.0 Environment: ArchLinux with Clang 3.6 Reporter: Cody Maloney Assignee: Cody Maloney The bundled libev fails to compile with the error: {code} ev.c:970:42: error: '_Noreturn' keyword must precede function declarator ecb_inline void ecb_unreachable (void) ecb_noreturn; ^~~~ _Noreturn {code} Can be patched by moving the noreturn to earlier in the line / where C++11 noreturn attributes go. Bundled boost fails with errors like: {code} ../3rdparty/libprocess/3rdparty/boost-1.53.0/boost/concept_check.hpp:653:11: error: unused typedef 'boost_concept_check653' [-Werror,-Wunused-local-typedef] BOOST_CONCEPT_ASSERT((InputIteratorconst_iterator)); ^ {code} Can be fixed by adding '-Wno-unused-local-typedef' if we detect clang 3.6 -- This message was sent by Atlassian JIRA (v6.3.4#6332)